In [61]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from processwx import select_stn, process_stn
%matplotlib inline
%config InlineBackend.figure_format='retina'
I've already preprocessed the avalanche events and forecasts, so we'll just load them into dataframes here:
In [2]:
events_df = pd.read_csv('btac_events.csv.gz', compression='gzip',
index_col=[0], parse_dates = [2])
hzrd_df = pd.read_csv('btac_nowcast_teton.csv.gz', compression='gzip',
index_col=[0], parse_dates=[0])
Since we dont't have hazard forecasts for non-Jackson regions, let's filter events in those areas out. I infer the list of zones from the BTAC obs page.
In [3]:
zones = ['101','102','103','104','105','106','JHMR','GT','SK']
df1 = events_df[events_df['zone'].isin(zones)]
Let's take a look at the event data. The first 10 columns contain data on the time, location, and characteristics of the slide path:
In [4]:
df1[df1.columns[0:10]].head(10)
Out[4]:
These fields are largely self explanatory; elevation is reported in feet above mean sea level.
After that, we get details on the size, type, and trigger for the avalanche, as well as the number of people killed:
In [5]:
df1[df1.columns[10:16]].head(10)
Out[5]:
A guide to avalanche terminology and codes is available here. Destructive size and relative size describe the magnitude of the slide, depth is the initial thickness in inches of the snow slab that slides. The trigger describes the source of the disturbance causing the slide, while type describes the nature of the slab.
Finally, we get notes and info on the reporting individual, if available:
In [6]:
df1[df1.columns[16:]].head(10)
Out[6]:
Let's look at a quick count of the events in our database by year:
In [7]:
per = df1['event_date'].dt.to_period("M");
g = df1.groupby(per);
s1 = g['ID'].count();
fig, ax = plt.subplots(1,1,figsize=(16,6));
s1 = s1.resample('M').sum();
s1.plot(kind='bar', ax=ax, title='Avalanche event counts', rot=65);
ticks = ax.xaxis.get_ticklocs();
ticklabels = [l.get_text() for l in ax.xaxis.get_ticklabels()];
ax.xaxis.set_ticks(ticks[::3]);
ax.xaxis.set_ticklabels(ticklabels[::3]);
ax.set_xlabel('date');
ax.set_ylabel('count');
What's going on here? It turns out initially, the only events in our data are serious accidents. In the winter of 2009-2010, the ski patrol at Jackson Hole began reporting the results of their avalanche control work, and a broader range of non-accident reports from skiers and snowmobilers are included. The raw observation counts are thus tricky to compare from year to year.
In [8]:
per = df1['event_date'].dt.to_period("M");
g = df1.groupby(per);
s2 = g['fatality'].sum().astype(int);
fig, ax = plt.subplots(1,1,figsize=(16,6));
s2 = s2.resample('M').sum();
s2.plot(kind='bar', ax=ax, title='Monthly fatal avalanche events', rot=65);
ticks = ax.xaxis.get_ticklocs();
ticklabels = [l.get_text() for l in ax.xaxis.get_ticklabels()];
ax.xaxis.set_ticks(ticks[::3]);
ax.xaxis.set_ticklabels(ticklabels[::3]);
ax.set_xlabel('date');
ax.set_ylabel('count');
print("Total fatalities:",s2.sum())
So in our 16 year record, we have 25 avalanche fatalities in the jackson hole region. With such a small sample, hard to see any particular pattern.
Let's start examining these events in the context of the avalanche forecasts. First, the data (last 5 days shown):
In [9]:
hzrd_df.tail(10)
Out[9]:
Each column is for a specific terrain elevation (atl is "above treeline", tl is "treeline", btl is "below treeline"), representing the forecast hazard level from 1-5 (low to extreme). The equivalent elevation bands are 6000-7500ft, 7500-9000ft, and 9000-10500ft. The daily bulletin gives an expected hazard for the morning and afternoon - in late winter and spring, the longer days and warming temperatures can create significant intrady increases in hazard. This is what the hazard looked like over the course of the 2008-2009 season:
In [10]:
s = ('2008-10-31','2009-5-31')
#s = ('2013-10-31','2014-5-31')
#s =('2016-10-31','2017-5-31')
ax = hzrd_df.loc[s[0]:s[1],['atl','tl','btl']].plot(figsize=(16,6),rot=45);
ax.set_ylim([0,5]);
ax.set_ylabel('Avalanche Danger');
Peak hazard occurred in late December to early January. The increased intraday variation in hazard is visible from March forward, with higher hazard in the afternoons.
What forecast conditions have the highest number of fatalities? We don't have a very impressive sample with just one region of a single state, but we can at least see how to approach it. We want to extract the appropriate forecast hazard for the date and elevation where fatalities occurred.
First, we make a function to categorize elevations:
In [11]:
def elevation_category(elevation):
if (6000. < elevation <= 7500.):
return 'btl'
elif (7500. < elevation <= 9000.):
return 'tl'
elif (9000. < elevation <= 10500.):
return 'atl'
else:
return None
Next we augment the event frame with the elevation category in which the slide occurred:
In [12]:
df1.is_copy=False
df1['el_cat'] = df1['elevation'].apply(lambda x: elevation_category(x))
We next average the morning and afternoon hazard levels, then stack and reindex the hazard frame in preparation for a left outer join with the event frame:
In [13]:
df2 = hzrd_df[['atl','tl','btl']].resample('D').mean().stack()
df2 = df2.reset_index()
df2.columns = ['event_date','el_cat','hazard']
df2.head()
Out[13]:
Finally, we merge these frames, then restrict the analysis to fatal accidents. While the sample size is small, we recover the frequently noted result that more fatal accidents occur during "Considerable" forecast hazard than "High" or "Extreme". This is both the result of the underlying hazard frequency (there are more "Considerable" days than "High" or "Extreme" days) and psychology (fewer people choose to recreate in avalanche terrain when the forecast danger is above "Considerable").
In [14]:
df3 = pd.merge(df1, df2, how='left', left_on=['event_date','el_cat'], right_on=['event_date','el_cat'])
In [15]:
df4 = df3[df3['fatality']>0]
df4['hazard'].plot(kind='hist', title='Fatalities by avalanche forecast hazard',
xlim=[0,5], bins=20, figsize=(6,6));
The stacked histogram of forecast avalanche danger by elevation category has a lognormal character.
In [16]:
hzrd_df[['atl','tl','btl']].plot(kind='hist', stacked=True,
xlim=[0,5], bins=20, figsize=(6,6));
The raw count of avalanches by hazard rating is given below:
In [17]:
g = df3.groupby(by='hazard');
g['ID'].count()
Out[17]:
and here, raw counts of forecasts per hazard rating
In [18]:
atl1, b = np.histogram(hzrd_df['atl'], bins=20);
tl1, _ = np.histogram(hzrd_df['tl'], bins=20);
btl1, _ = np.histogram(hzrd_df['btl'], bins=20);
atl1 + tl1 + btl1
Out[18]:
In [19]:
b
Out[19]:
While this is "quick and dirty", the forecast frequence weighted avalanche occurrence suggests there are about three events per fcst at "high" and "extreme" hazard, about one per forecast at "considerable", less than 1/3 per forecast at "moderate", and 1/40 per forecast at "low".
A quick addendum, adding thresholds for destructive size:
In [38]:
g = df3[(1 < df3['hazard']) & (df3['hazard'] <= 2)].groupby(by='destructive_size');
sz_2 = g['ID'].count()
g = df3[(2 < df3['hazard']) & (df3['hazard'] <= 3)].groupby(by='destructive_size');
sz_3 = g['ID'].count()
g = df3[(3 < df3['hazard']) & (df3['hazard'] <= 4)].groupby(by='destructive_size');
sz_4 = g['ID'].count()
g = df3[(4 < df3['hazard']) & (df3['hazard'] <= 5)].groupby(by='destructive_size');
sz_5 = g['ID'].count()
print(sz_2,sz_3,sz_4,sz_5)
Having examined avalanche forecasts and observations, the next challenge is to address the driving variable: weather. We'll take a quick look at some station data from Jackson Hole.
First, let's look at the data sources available. I've coded a handy utility to help pick out useful stations. Let's find the ten nearest Jackson Hole, with (lat, lon) = (43.572236,-110.8496103):
In [22]:
df = select_stn('./wxdata',{'k_nrst': 10, 'lat_lon': (43.572236,-110.8496103)}, return_df=True)
In [23]:
df.head(10)
Out[23]:
In [62]:
wx_df = process_stn('./wxdata', 'JHR');
Let's look at air temp during the winter of 2008-2009:
In [66]:
wx_df.loc['2008-10-31':'2009-05-31','air_temp_set_1'].plot(figsize=(16,6));
That looks pretty good. If we look at the whole series though, there are some issues:
In [68]:
wx_df['air_temp_set_1'].plot(figsize=(16,6));
At this resolution, we basically see the annual cycle for the period when the temperature sensor is reporting, but there are some obvious issues. Let's look at the period of 2011-2014 to explore this:
In [69]:
wx_df.loc['2011-10-31':'2014-05-31','air_temp_set_1'].plot(figsize=(16,6));
Odds are pretty good that those spikes down to -15C in July of 2012 are non physical. We'd like a method for outlier detection. First pass will use median absolute deviation:
In [ ]: