In [61]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


from processwx import select_stn, process_stn
%matplotlib inline
%config InlineBackend.figure_format='retina'

EDA with Teton avalanche observations and hazard forecasts

I've already preprocessed the avalanche events and forecasts, so we'll just load them into dataframes here:


In [2]:
events_df = pd.read_csv('btac_events.csv.gz', compression='gzip', 
                        index_col=[0], parse_dates = [2])
hzrd_df = pd.read_csv('btac_nowcast_teton.csv.gz', compression='gzip', 
                      index_col=[0], parse_dates=[0])

Since we dont't have hazard forecasts for non-Jackson regions, let's filter events in those areas out. I infer the list of zones from the BTAC obs page.


In [3]:
zones = ['101','102','103','104','105','106','JHMR','GT','SK']
df1 = events_df[events_df['zone'].isin(zones)]

Let's take a look at the event data. The first 10 columns contain data on the time, location, and characteristics of the slide path:


In [4]:
df1[df1.columns[0:10]].head(10)


Out[4]:
ID event_date event_time zone pathname elevation lat lng aspect slope_angle
1 2094.0 2000-12-01 None 104 Glory Bowl 9800.0 43.505112 -110.948463 E 41-45
2 2095.0 2000-12-09 None 104 Titmouse Ridge 8000.0 43.489483 -110.950050 E 41-45
3 2096.0 2000-12-25 None 106 South Badger Creek 9200.0 43.893562 -110.908119 S 31-35
4 2097.0 2001-02-06 None 105 Ralphs Slide 9800.0 43.592175 -110.878319 S 41-45
5 2098.0 2001-02-23 None 105 Caldonia Couloir Area 8000.0 43.610756 -110.831970 N 41-45
10 2104.0 2003-01-04 None 105 Near Ski Lake 9200.0 43.526413 -110.936897 NE 36-40
13 2107.0 2003-01-27 None 104 Avalanche Bowl 8600.0 43.484516 -110.951447 E 41-45
14 2108.0 2003-02-10 None JHMR Hourglass Couloir 10150.0 43.598717 -110.870293 NE 41-45
16 2110.0 2004-01-31 None 105 Pyramid Peak 8900.0 43.547104 -110.886559 S 36-40
18 2112.0 2006-01-05 None 104 East Face of Taylor 9800.0 43.528904 -110.978439 E 41-45

These fields are largely self explanatory; elevation is reported in feet above mean sea level.

After that, we get details on the size, type, and trigger for the avalanche, as well as the number of people killed:


In [5]:
df1[df1.columns[10:16]].head(10)


Out[5]:
destructive_size relative_size depth avy_trigger fldType fatality
1 2.5 3.0 24.0 AR SS 1.0
2 1.0 1.0 12.0 AS SS 1.0
3 3.0 3.0 24.0 AS SS 1.0
4 2.0 3.0 18.0 AS SS 1.0
5 1.5 1.0 18.0 AS SS 1.0
10 2.5 2.0 48.0 AR HS 1.0
13 2.0 2.0 24.0 AR SS 1.0
14 3.0 3.0 24.0 AS SS 1.0
16 4.0 4.0 48.0 AS SS 1.0
18 3.0 3.0 72.0 AS SS 1.0

A guide to avalanche terminology and codes is available here. Destructive size and relative size describe the magnitude of the slide, depth is the initial thickness in inches of the snow slab that slides. The trigger describes the source of the disturbance causing the slide, while type describes the nature of the slab.

Finally, we get notes and info on the reporting individual, if available:


In [6]:
df1[df1.columns[16:]].head(10)


Out[6]:
observer affiliation notes
1 None None None
2 None None Lone skier buried by a small slide in a terrai...
3 None None The skier was the third person to cross this s...
4 None None Skier swept over cliff.
5 None None Buried skier was quickly uncovered by companio...
10 None None Snow boarder caught and carried. Died from tr...
13 None None The snowboarder was alone and had been missing...
14 None None Lift skier caught and buried in a closed area ...
16 None None Solo skier caught and buried by a large avalan...
18 None None Skier caught and buried by a large avalanche.

Let's look at a quick count of the events in our database by year:


In [7]:
per = df1['event_date'].dt.to_period("M");
g = df1.groupby(per);
s1 = g['ID'].count();
fig, ax = plt.subplots(1,1,figsize=(16,6));
s1 = s1.resample('M').sum();
s1.plot(kind='bar', ax=ax, title='Avalanche event counts', rot=65);
ticks = ax.xaxis.get_ticklocs();
ticklabels = [l.get_text() for l in ax.xaxis.get_ticklabels()];
ax.xaxis.set_ticks(ticks[::3]);
ax.xaxis.set_ticklabels(ticklabels[::3]);
ax.set_xlabel('date');
ax.set_ylabel('count');


What's going on here? It turns out initially, the only events in our data are serious accidents. In the winter of 2009-2010, the ski patrol at Jackson Hole began reporting the results of their avalanche control work, and a broader range of non-accident reports from skiers and snowmobilers are included. The raw observation counts are thus tricky to compare from year to year.


In [8]:
per = df1['event_date'].dt.to_period("M");
g = df1.groupby(per);
s2 = g['fatality'].sum().astype(int);
fig, ax = plt.subplots(1,1,figsize=(16,6));
s2 = s2.resample('M').sum();
s2.plot(kind='bar', ax=ax, title='Monthly fatal avalanche events', rot=65);
ticks = ax.xaxis.get_ticklocs();
ticklabels = [l.get_text() for l in ax.xaxis.get_ticklabels()];
ax.xaxis.set_ticks(ticks[::3]);
ax.xaxis.set_ticklabels(ticklabels[::3]);
ax.set_xlabel('date');
ax.set_ylabel('count');
print("Total fatalities:",s2.sum())


Total fatalities: 25.0

So in our 16 year record, we have 25 avalanche fatalities in the jackson hole region. With such a small sample, hard to see any particular pattern.

Let's start examining these events in the context of the avalanche forecasts. First, the data (last 5 days shown):


In [9]:
hzrd_df.tail(10)


Out[9]:
region atl tl btl
date
2017-04-08 09:00:00 teton 2 1 1
2017-04-08 15:00:00 teton 2 2 1
2017-04-09 09:00:00 teton 2 1 1
2017-04-09 15:00:00 teton 2 1 1
2017-04-10 09:00:00 teton 2 1 1
2017-04-10 15:00:00 teton 2 1 1
2017-04-11 09:00:00 teton 2 1 1
2017-04-11 15:00:00 teton 2 1 1
2017-04-12 09:00:00 teton 2 1 1
2017-04-12 15:00:00 teton 2 2 1

Each column is for a specific terrain elevation (atl is "above treeline", tl is "treeline", btl is "below treeline"), representing the forecast hazard level from 1-5 (low to extreme). The equivalent elevation bands are 6000-7500ft, 7500-9000ft, and 9000-10500ft. The daily bulletin gives an expected hazard for the morning and afternoon - in late winter and spring, the longer days and warming temperatures can create significant intrady increases in hazard. This is what the hazard looked like over the course of the 2008-2009 season:


In [10]:
s = ('2008-10-31','2009-5-31')
#s = ('2013-10-31','2014-5-31')
#s =('2016-10-31','2017-5-31')

ax = hzrd_df.loc[s[0]:s[1],['atl','tl','btl']].plot(figsize=(16,6),rot=45);
ax.set_ylim([0,5]);
ax.set_ylabel('Avalanche Danger');


Peak hazard occurred in late December to early January. The increased intraday variation in hazard is visible from March forward, with higher hazard in the afternoons.

What forecast conditions have the highest number of fatalities? We don't have a very impressive sample with just one region of a single state, but we can at least see how to approach it. We want to extract the appropriate forecast hazard for the date and elevation where fatalities occurred.

First, we make a function to categorize elevations:


In [11]:
def elevation_category(elevation):
    if (6000. < elevation <= 7500.):
        return 'btl'
    elif (7500. < elevation <= 9000.):
        return 'tl'
    elif (9000. < elevation <= 10500.):
        return 'atl'
    else:
        return None

Next we augment the event frame with the elevation category in which the slide occurred:


In [12]:
df1.is_copy=False
df1['el_cat'] = df1['elevation'].apply(lambda x: elevation_category(x))

We next average the morning and afternoon hazard levels, then stack and reindex the hazard frame in preparation for a left outer join with the event frame:


In [13]:
df2 = hzrd_df[['atl','tl','btl']].resample('D').mean().stack()
df2 = df2.reset_index()
df2.columns = ['event_date','el_cat','hazard']
df2.head()


Out[13]:
event_date el_cat hazard
0 1999-11-29 atl 1.0
1 1999-11-29 tl 1.0
2 1999-11-29 btl 1.0
3 1999-11-30 atl 1.0
4 1999-11-30 tl 1.0

Finally, we merge these frames, then restrict the analysis to fatal accidents. While the sample size is small, we recover the frequently noted result that more fatal accidents occur during "Considerable" forecast hazard than "High" or "Extreme". This is both the result of the underlying hazard frequency (there are more "Considerable" days than "High" or "Extreme" days) and psychology (fewer people choose to recreate in avalanche terrain when the forecast danger is above "Considerable").


In [14]:
df3  = pd.merge(df1, df2, how='left', left_on=['event_date','el_cat'], right_on=['event_date','el_cat'])

In [15]:
df4 = df3[df3['fatality']>0]
df4['hazard'].plot(kind='hist', title='Fatalities by avalanche forecast hazard', 
                   xlim=[0,5], bins=20, figsize=(6,6));


The stacked histogram of forecast avalanche danger by elevation category has a lognormal character.


In [16]:
hzrd_df[['atl','tl','btl']].plot(kind='hist', stacked=True, 
                                 xlim=[0,5], bins=20, figsize=(6,6));


The raw count of avalanches by hazard rating is given below:


In [17]:
g = df3.groupby(by='hazard');
g['ID'].count()


Out[17]:
hazard
0.0       8
1.0     202
1.5     104
2.0    1702
2.5     272
3.0    2378
3.5     184
4.0    1528
4.5       5
5.0      55
Name: ID, dtype: int64

and here, raw counts of forecasts per hazard rating


In [18]:
atl1, b = np.histogram(hzrd_df['atl'], bins=20);
tl1, _ = np.histogram(hzrd_df['tl'], bins=20);
btl1, _ = np.histogram(hzrd_df['btl'], bins=20);
atl1 + tl1 + btl1


Out[18]:
array([ 172,    0,    0,    0, 8551,    0,    0,    0, 6034,    0,    0,
          0, 2411,    0,    0,    0,  555,    0,    0,   19])

In [19]:
b


Out[19]:
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ,  1.25,  1.5 ,  1.75,  2.  ,
        2.25,  2.5 ,  2.75,  3.  ,  3.25,  3.5 ,  3.75,  4.  ,  4.25,
        4.5 ,  4.75,  5.  ])

While this is "quick and dirty", the forecast frequence weighted avalanche occurrence suggests there are about three events per fcst at "high" and "extreme" hazard, about one per forecast at "considerable", less than 1/3 per forecast at "moderate", and 1/40 per forecast at "low".

A quick addendum, adding thresholds for destructive size:


In [38]:
g = df3[(1 < df3['hazard']) & (df3['hazard'] <= 2)].groupby(by='destructive_size');
sz_2 = g['ID'].count()

g = df3[(2 < df3['hazard']) & (df3['hazard'] <= 3)].groupby(by='destructive_size');
sz_3 = g['ID'].count()

g = df3[(3 < df3['hazard']) & (df3['hazard'] <= 4)].groupby(by='destructive_size');
sz_4 = g['ID'].count()

g = df3[(4 < df3['hazard']) & (df3['hazard'] <= 5)].groupby(by='destructive_size');
sz_5 = g['ID'].count()

print(sz_2,sz_3,sz_4,sz_5)


destructive_size
1.0    1272
1.2       1
1.5     267
2.0     211
2.5      18
3.0      21
3.5       1
4.0       2
Name: ID, dtype: int64 destructive_size
0.5        2
1.0     1441
1.5      379
2.0      602
2.4        1
2.5       62
3.0       78
3.5       20
4.0       13
6.0        1
11.0       1
Name: ID, dtype: int64 destructive_size
1.0    676
1.5    246
2.0    535
2.5     66
3.0    112
3.5     10
4.0      8
Name: ID, dtype: int64 destructive_size
1.0    45
1.5     3
2.0     8
2.5     2
3.0     2
Name: ID, dtype: int64

Weather

Having examined avalanche forecasts and observations, the next challenge is to address the driving variable: weather. We'll take a quick look at some station data from Jackson Hole.

First, let's look at the data sources available. I've coded a handy utility to help pick out useful stations. Let's find the ten nearest Jackson Hole, with (lat, lon) = (43.572236,-110.8496103):


In [22]:
df = select_stn('./wxdata',{'k_nrst': 10, 'lat_lon': (43.572236,-110.8496103)}, return_df=True)

In [23]:
df.head(10)


Out[23]:
stid mnet_id name latitude longitude elevation state rec_start rec_end dist
0 JHB 48 JACKSON HOLE-BASE 43.580000 -110.850000 6339 WY 1999-11-23 00:00:00 2017-04-11 23:15:00 0.864159
1 JHBUF 48 JACKSON HOLE-BUFFALO BOWL 43.590147 -110.848994 7913 WY 2016-01-04 20:40:00 2017-04-11 23:15:00 1.992856
2 JHMWY 48 JACKSON HOLE-MID MT 43.590000 -110.840000 8179 WY 2016-01-04 21:55:00 2017-04-11 23:15:00 2.122202
3 JHRV 48 JACKSON HOLE-RENDEZVOUS BOWL 43.590858 -110.872558 9608 WY 2010-02-06 00:00:00 2017-04-11 23:15:00 2.776535
4 JHR 48 JACKSON HOLE-RAYMER 43.600000 -110.850000 9360 WY 1998-12-08 00:00:00 2017-04-11 23:15:00 3.088345
5 JHS 48 JACKSON HOLE-SUMMIT 43.596969 -110.870756 10318 WY 2016-01-04 20:40:00 2017-04-11 23:15:00 3.235876
6 PHBW4 25 PHILLIPS BENCH 43.519480 -110.911030 8200 WY 2000-01-20 00:00:00 2017-04-11 23:00:00 7.678150
7 SNKWY 48 SNOW KING 43.463100 -110.763000 7810 WY 2007-11-11 00:00:00 2017-04-11 23:15:00 14.005828
8 GTHW4 25 GRAND TARGHEE 43.779330 -110.927830 9260 WY 2006-10-23 00:00:00 2017-04-11 23:00:00 23.879060
9 GTFM 48 Freds Mountain Summit 43.786630 -110.934230 9840 WY 2011-01-13 00:00:00 2017-04-11 23:15:00 24.799513

In [62]:
wx_df = process_stn('./wxdata', 'JHR');


/home/osboxes/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2821: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Let's look at air temp during the winter of 2008-2009:


In [66]:
wx_df.loc['2008-10-31':'2009-05-31','air_temp_set_1'].plot(figsize=(16,6));


That looks pretty good. If we look at the whole series though, there are some issues:


In [68]:
wx_df['air_temp_set_1'].plot(figsize=(16,6));


At this resolution, we basically see the annual cycle for the period when the temperature sensor is reporting, but there are some obvious issues. Let's look at the period of 2011-2014 to explore this:


In [69]:
wx_df.loc['2011-10-31':'2014-05-31','air_temp_set_1'].plot(figsize=(16,6));


Odds are pretty good that those spikes down to -15C in July of 2012 are non physical. We'd like a method for outlier detection. First pass will use median absolute deviation:


In [ ]: