This post is part of a series documenting my (meager) contributions to an investigation to uncover relations, if any, between space weather marine mammal beaching. This first installment is a warm-up, where I explore a small part of marine mammal beaching time series recorded in the vicinity of Cape Cod, Massachusetts (USA).


In [1]:
import pandas as pd
import os
import glob
import matplotlib.pyplot as pl
import seaborn as sb
import matplotlib.colors as mcolors
from calendar import month_name
import numpy as np
import warnings
from IPython.display import Image
import pymc3 as pm
%matplotlib inline

In [2]:
warnings.filterwarnings('ignore')

LOAD & CLEAN


In [3]:
from matplotlib import rcParams

rcParams['xtick.labelsize']=14
rcParams['ytick.labelsize']=14
rcParams['font.size']=16

In [4]:
fp = '/home/madhatter106/DATA/Beaching_Project/IFAW Mass Stranding Data 1999-2014.xlsx'
try:
    df = pd.read_excel(fp)
except FileNotFoundError as e:
    print(e)

In [5]:
df.columns


Out[5]:
Index(['MASS_STRAND_NUM', 'SERIES_NUM', 'Event Start Date',
       'Event End date (if applicable)', 'COMMENTS',
       'Number of Animals Stranded', 'Number of Additional Animals Herded',
       'LATITUDE_DEGREE', 'LONGITUDE_DEGREE', 'Addn. LATITUDE_DEGREE ',
       'Addm. LONGITUDE_DEGREE', 'SPECIES'],
      dtype='object')

In [6]:
df.head(2)


Out[6]:
MASS_STRAND_NUM SERIES_NUM Event Start Date Event End date (if applicable) COMMENTS Number of Animals Stranded Number of Additional Animals Herded LATITUDE_DEGREE LONGITUDE_DEGREE Addn. LATITUDE_DEGREE Addm. LONGITUDE_DEGREE SPECIES
0 MS00-01 Series0003 2000-04-09 NaN Collected level A data\n\nCattle ear tag white... 2.0 0.0 41.930667 70.069667 NaN NaN Lagenorhynchus acutus
1 MS00-02 Series0003 2000-04-11 NaN Skin- Kim Amaral, Dorsal fin to G. Early\n\nS... 2.0 0.0 41.933333 70.024167 NaN NaN Lagenorhynchus acutus

Cleanup: too many non-features (useless columns), I'll start by dropping those. There are several lines for a given event, and at this point I want to use only the line that has the numbers stranding and herded numbers. I also care about numbers of stranded/herded animals, location, date, species. But before, let's get some stats on the whole set...


In [7]:
df.describe()


Out[7]:
Event End date (if applicable) Number of Animals Stranded Number of Additional Animals Herded LATITUDE_DEGREE LONGITUDE_DEGREE Addn. LATITUDE_DEGREE Addm. LONGITUDE_DEGREE
count 0.0 160.00000 157.000000 2.310000e+02 2.310000e+02 2.040000e+02 2.040000e+02
mean NaN 5.70625 0.242038 1.812812e+04 3.034970e+04 2.052188e+04 3.435730e+04
std NaN 6.34203 1.885849 2.748874e+05 4.608534e+05 2.925134e+05 4.904041e+05
min NaN 1.00000 0.000000 4.142633e+01 -7.047960e+01 4.142633e+01 -7.047960e+01
25% NaN 2.00000 0.000000 4.177622e+01 -7.000550e+01 4.177515e+01 -7.001266e+01
50% NaN 4.00000 0.000000 4.189230e+01 7.002337e+01 4.187768e+01 7.001643e+01
75% NaN 6.00000 0.000000 4.192809e+01 7.006647e+01 4.192696e+01 7.006717e+01
max NaN 56.00000 20.000000 4.177969e+06 7.004393e+06 4.177969e+06 7.004393e+06

160 data for stranded. and 157 data for herded. For now I'll also ditch Addn. Lat/Lon. This is not final however, and I might still reincorporate that data once again, if I feel the need.


In [8]:
df.drop(['MASS_STRAND_NUM','SERIES_NUM','Event End date (if applicable)','COMMENTS','Addn. LATITUDE_DEGREE ',
         'Addm. LONGITUDE_DEGREE'],axis=1, inplace=True)

Next is to replace column names with something more practical:


In [9]:
replNames = {key: repl for key,repl in zip(df.columns[:3],['date','stranded', 'herded'])}
replNames.update({key:key.lower().split('_')[0] for key in df.columns[3:]})
df.rename(columns=replNames,inplace=True)
df.head(5)


Out[9]:
date stranded herded latitude longitude species
0 2000-04-09 2.0 0.0 41.930667 70.069667 Lagenorhynchus acutus
1 2000-04-11 2.0 0.0 41.933333 70.024167 Lagenorhynchus acutus
2 2000-04-11 2.0 0.0 41.933333 70.024167 Lagenorhynchus acutus
3 2000-08-14 NaN NaN 41.806833 70.002667 Lagenorhynchus acutus
4 2000-08-26 6.0 20.0 41.928167 70.066833 Lagenorhynchus acutus

Let's take a look at what's left in the table


In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 6 columns):
date         231 non-null datetime64[ns]
stranded     160 non-null float64
herded       157 non-null float64
latitude     231 non-null float64
longitude    231 non-null float64
species      231 non-null object
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 10.9+ KB

Cleaning NaNs


In [11]:
df.loc[(df.stranded.isnull()) & (df.herded.isnull())]


Out[11]:
date stranded herded latitude longitude species
3 2000-08-14 NaN NaN 4.180683e+01 7.000267e+01 Lagenorhynchus acutus
11 2002-03-24 NaN NaN 4.192347e+01 7.003273e+01 Lagenorhynchus acutus
25 2004-12-30 NaN NaN 4.182300e+01 7.000443e+01 Lagenorhynchus acutus
34 2005-12-10 NaN NaN 4.142633e+01 7.028300e+01 Delphinus delphis
35 2005-12-10 NaN NaN 4.171000e+01 7.032317e+01 Delphinus delphis
36 2005-12-10 NaN NaN 4.171133e+01 7.027467e+01 Delphinus delphis
37 2005-12-10 NaN NaN 4.178635e+01 7.002950e+01 Globicephala melas
38 2005-12-10 NaN NaN 4.176835e+01 7.009107e+01 Globicephala melas
39 2005-12-10 NaN NaN 4.171017e+01 7.028802e+01 Delphinus delphis
40 2005-12-10 NaN NaN 4.174863e+01 7.019983e+01 Globicephala melas
41 2005-12-10 NaN NaN 4.175643e+01 7.015298e+01 Delphinus delphis
42 2005-12-10 NaN NaN 4.176033e+01 7.013417e+01 Globicephala melas
43 2005-12-10 NaN NaN 4.193033e+01 7.006867e+01 Delphinus delphis
44 2005-12-10 NaN NaN 4.177583e+01 7.005190e+01 Globicephala melas
45 2005-12-10 NaN NaN 4.205648e+01 7.013608e+01 Lagenorhynchus acutus
46 2005-12-10 NaN NaN 4.182638e+01 7.000372e+01 Globicephala melas
47 2005-12-10 NaN NaN 4.173417e+01 7.023450e+01 Delphinus delphis
48 2005-12-10 NaN NaN 4.176850e+01 7.008500e+01 Globicephala melas
49 2005-12-10 NaN NaN 4.173735e+01 7.226767e+00 Delphinus delphis
51 2005-12-11 NaN NaN 4.176267e+01 7.010983e+01 Globicephala melas
52 2005-12-11 NaN NaN 4.174242e+01 7.021768e+01 Delphinus delphis
54 2005-12-14 NaN NaN 4.176033e+01 7.013417e+01 Globicephala melas
55 2005-12-14 NaN NaN 4.177233e+01 7.007000e+01 Globicephala melas
59 2006-01-17 NaN NaN 4.191900e+01 7.002817e+01 Lagenorhynchus acutus
61 2006-01-19 NaN NaN 4.187383e+01 7.000917e+01 Delphinus delphis
62 2006-01-19 NaN NaN 4.191283e+01 6.999100e+01 Lagenorhynchus acutus
63 2006-01-19 NaN NaN 4.191900e+01 7.002817e+01 Delphinus delphis
64 2006-01-19 NaN NaN 4.180933e+01 7.000483e+01 Lagenorhynchus acutus
65 2006-01-19 NaN NaN 4.173417e+01 7.023350e+01 Lagenorhynchus acutus
66 2006-01-19 NaN NaN 4.179900e+01 7.001200e+01 Delphinus delphis
... ... ... ... ... ... ...
125 2010-02-26 NaN NaN 4.199570e+01 7.006230e+01 Delphinus delphis
134 2010-04-10 NaN NaN 4.177613e+01 7.007222e+01 Lagenorhynchus acutus
137 2010-10-02 NaN NaN 4.177780e+01 7.004892e+01 Lagenorhynchus acutus
139 2010-11-12 NaN NaN 4.188945e+01 7.000003e+01 Delphinus delphis
141 2010-11-29 NaN NaN 4.189708e+01 7.002412e+01 Delphinus delphis
142 2011-02-02 NaN NaN 4.173530e+01 7.031120e+01 Delphinus delphis
143 2011-02-02 NaN NaN 4.177950e+01 6.993500e+01 Delphinus delphis
145 2011-02-14 NaN NaN 4.193120e+01 7.006610e+01 Delphinus delphis
148 2011-02-26 NaN NaN 4.193221e+01 -7.002414e+01 Lagenorhynchus acutus
154 2011-04-05 NaN NaN 4.203265e+01 -7.009570e+01 Lagenorhynchus acutus
157 2011-08-31 NaN NaN 4.182050e+01 -7.000550e+01 Delphinus delphis
160 2011-12-29 NaN NaN 4.192300e+01 -7.005440e+01 Delphinus delphis
163 2012-01-14 NaN NaN 4.192980e+01 -7.004800e+01 Delphinus delphis
164 2012-01-14 NaN NaN 4.177969e+06 7.004393e+06 Delphinus delphis
165 2012-01-14 NaN NaN 4.175050e+01 7.018640e+01 Delphinus delphis
166 2012-01-14 NaN NaN 4.179240e+01 -7.002030e+01 Delphinus delphis
168 2012-01-15 NaN NaN 4.189260e+01 -7.000280e+01 Delphinus delphis
169 2012-01-15 NaN NaN 4.177600e+01 -7.005470e+01 Delphinus delphis
171 2012-01-16 NaN NaN 4.185917e+01 -7.000728e+01 Delphinus delphis
179 2012-02-03 NaN NaN 4.193380e+01 -7.002690e+01 Delphinus delphis
181 2012-02-04 NaN NaN 4.192540e+01 -7.002610e+01 Delphinus delphis
185 2012-02-07 NaN NaN 4.193090e+01 -7.006350e+01 Delphinus delphis
196 2012-03-26 NaN NaN 4.176320e+01 -7.010690e+01 Delphinus delphis
198 2012-03-27 NaN NaN 4.179180e+01 -7.002000e+01 Delphinus delphis
199 2012-03-27 NaN NaN 4.191400e+01 -6.998810e+01 Delphinus delphis
205 2012-11-09 NaN NaN 4.207850e+01 -7.022830e+01 Delphinus delphis
219 2014-12-04 NaN NaN 4.189440e+01 -7.007110e+01 Delphinus delphis
223 1999-03-19 NaN NaN 4.182273e+01 7.000413e+01 Lagenorhynchus acutus
224 1999-03-19 NaN NaN 4.173633e+01 7.033267e+01 Lagenorhynchus acutus
226 1999-03-20 NaN NaN 4.176160e+01 7.011407e+01 Lagenorhynchus acutus

70 rows × 6 columns


In [12]:
df.loc[(df.stranded.isnull()) & (np.isfinite(df.herded))]


Out[12]:
date stranded herded latitude longitude species
113 2009-12-07 NaN 0.0 41.792267 70.0197 Delphinus delphis

In [13]:
df.loc[(np.isfinite(df.stranded)) &(df.herded.isnull())]


Out[13]:
date stranded herded latitude longitude species
98 2007-08-23 2.0 NaN 41.595333 70.847333 Stenella coeruleoalba
112 2009-12-07 3.0 NaN 41.775783 70.055300 Delphinus delphis
174 2012-01-19 6.0 NaN 41.928016 70.069959 Delphinus delphis
225 1999-03-20 12.0 NaN 41.916167 70.062333 Lagenorhynchus acutus

So looks like we're fine if we dropped rows where the column "stranded" is NaN. and replace the remaining herded Nans with '0'


In [14]:
df.dropna(subset=['stranded'], inplace=True)

In [15]:
df.loc[df.herded.isnull(),'herded'] = 0

In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 160 entries, 0 to 230
Data columns (total 6 columns):
date         160 non-null datetime64[ns]
stranded     160 non-null float64
herded       160 non-null float64
latitude     160 non-null float64
longitude    160 non-null float64
species      160 non-null object
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 8.8+ KB

More cleaning: Duplicates

Duplicate values might either indicate a problem in data organization, or an actual case of multiple entries of the same data. They could also have a nefarious effect on subsequent analyses. So I'll next look for those.


In [17]:
df.loc[df.duplicated(),:]


Out[17]:
date stranded herded latitude longitude species
2 2000-04-11 2.0 0.0 41.933333 70.024167 Lagenorhynchus acutus

There appear to be a duplicate entry of the above, and there's a decision needs to be made here. Was the twin entry a legitimate or an error? Examining the original excel file, it looks to me like a redundant entry. I will drop one


In [18]:
df.drop_duplicates(inplace=True)

Now, there might still be some date duplicates, because of multiple events ocurring on the same day. Let's look at that: Date-wise duplicates


In [19]:
df.loc[df.duplicated(subset='date')]


Out[19]:
date stranded herded latitude longitude species
103 2008-12-25 3.0 0.0 41.9205 70.032167 Grampus griseus
187 2012-02-09 3.0 0.0 41.8923 -70.008500 Delphinus delphis

Two cases. Let's see what happens on those dates:


In [20]:
dupDates = df.loc[df.duplicated(subset='date'),'date'].values
for dd in dupDates:
    print(df[df['date'] == dd])


          date  stranded  herded  latitude  longitude            species
102 2008-12-25       5.0     0.0   41.7840 -70.033100  Delphinus delphis
103 2008-12-25       3.0     0.0   41.9205  70.032167    Grampus griseus
          date  stranded  herded  latitude  longitude            species
186 2012-02-09       5.0     0.0   41.7272   -70.3031  Delphinus delphis
187 2012-02-09       3.0     0.0   41.8923   -70.0085  Delphinus delphis

So in the first case two different species stranded at different locations. In the second case, the same species stranded at different locations. This might warrant "melting" the data frame (see further below). That's all the cleaning for now. Next, I'm going to add a couple of columns. One, day of week, is to help remove observer bias. The second, month of year, is to help reduce noise and bin data toward seasonality.


In [21]:
df['weekday'] = df.date.dt.weekday_name

In [22]:
df['month'] = df.date.dt.month

Let's see what that did


In [23]:
df.head()


Out[23]:
date stranded herded latitude longitude species weekday month
0 2000-04-09 2.0 0.0 41.930667 70.069667 Lagenorhynchus acutus Sunday 4
1 2000-04-11 2.0 0.0 41.933333 70.024167 Lagenorhynchus acutus Tuesday 4
4 2000-08-26 6.0 20.0 41.928167 70.066833 Lagenorhynchus acutus Saturday 8
5 2000-08-27 5.0 0.0 41.928967 70.052300 Lagenorhynchus acutus Sunday 8
6 2001-03-07 2.0 0.0 41.906433 70.001117 Lagenorhynchus acutus Wednesday 3

Now, to take advantage of Pandas time series functionality, I'm going to assign 'date' as index. Doesn't look like multiple entries on the same date will be a problem.


In [24]:
df.index = df.date
df.drop('date', axis=1, inplace=True)

In [25]:
df.sort_index(inplace=True)

In [26]:
df.head(2)


Out[26]:
stranded herded latitude longitude species weekday month
date
1999-03-07 6.0 0.0 41.916667 70.033333 Lagenorhynchus acutus Sunday 3
1999-03-18 5.0 0.0 41.718017 70.368650 Lagenorhynchus acutus Thursday 3

EXPLORATION & DATA ENGINEERING

Now I can do some preliminary analysis. First I'll look at stranding and herding, breaking it down by species. I'll also take a look at possible weekday bias, and seasonal bias.

Observer Bias


In [27]:
f,ax=pl.subplots(figsize=(12,6))
sb.countplot(x='weekday', data=df,ax=ax);
ax.set_xlabel('weekday', fontsize=16)
ax.set_ylabel('count', fontsize=16);
ax.set_title('Week day bias', fontsize=18);


Numbers are higher during the weekend, but not by a huge margin. It does argue for binning the data to a week or more, maybe even a month. I'll look at seasonal bias, e.g. by aggregating on a month scale.


Seasonal Bias


In [28]:
dfs = df.groupby(['species'])['stranded'].sum()
dfh = df.groupby(['species'])['herded'].sum()
f,ax=pl.subplots(figsize=(10,8))
dfs.plot(kind='bar', logy=True,ax=ax, label='stranded')
dfh.plot(kind='bar', logy=True, ax=ax, alpha=0.4, color='red', label='herded')
ax.set_xticklabels(ax.get_xticklabels(),rotation=30, fontsize=14)
ax.set_xlabel('species',fontsize=16)
ax.set_ylabel('log(count)', fontsize=16)
ax.set_title('Overall count by species', fontsize=16);
ax.legend(fontsize=14);



In [29]:
dfpivot = df.pivot_table(index='month',columns='species',values='stranded',
                        aggfunc=sum, fill_value=0)

In [30]:
dfMonthly = df.loc[:,['stranded','herded']].resample('M').apply(sum).replace('NaN',0)

In [31]:
f,ax=pl.subplots(figsize=(16,8))
dfpivot.plot(kind='bar',ax=ax, stacked=True, fontsize=18, colormap='jet')
ax.set_xlabel('month', fontsize=18)
ax.set_ylabel('count', fontsize=18)
ax.set_title('Seasonal trend', fontsize=18);
ax.set_xticklabels([month_name[i].lower() for i in range(1,13)], rotation=30);
ax.legend(fontsize=16);


Next... If I end up modeling the time series, I'll have to come back and zero-pad the data. For now I need to reshape the data some to ease plotting and further exploration. For that, I'll create a new dataframe with the following operations:

  1. use one-hot encoding to create count columns for each species.
  2. sum stranded and herded into a new total columns
  3. multiply the total count across species count columns
  4. rename species columns to a more practical set of names.

In [32]:
#1.
dfs = pd.get_dummies(df, columns=['species'])
#2.
dfs['total']=dfs[['stranded','herded']].sum(axis=1)
#3.
speciescols = [col for col in dfs.columns if 'species' in col]
dfs[speciescols] = dfs[speciescols].multiply(dfs.total, axis=0)
#4.
dfs.rename(columns={key:key.split('_')[1].replace(' ','_') for key in speciescols},
           inplace=True)
# Let's verify the result is as desired...
dfs.head(10)


Out[32]:
stranded herded latitude longitude weekday month Delphinus_delphis Globicephala_melas Grampus_griseus Lagenorhynchus_acutus Stenella_coeruleoalba Tursiops_truncatus total
date
1999-03-07 6.0 0.0 41.916667 70.033333 Sunday 3 0.0 0.0 0.0 6.0 0.0 0.0 6.0
1999-03-18 5.0 0.0 41.718017 70.368650 Thursday 3 0.0 0.0 0.0 5.0 0.0 0.0 5.0
1999-03-19 29.0 0.0 41.933900 70.016200 Friday 3 0.0 0.0 0.0 29.0 0.0 0.0 29.0
1999-03-20 12.0 0.0 41.916167 70.062333 Saturday 3 0.0 0.0 0.0 12.0 0.0 0.0 12.0
1999-03-21 2.0 0.0 41.933900 70.016200 Sunday 3 0.0 0.0 0.0 2.0 0.0 0.0 2.0
1999-03-22 2.0 0.0 41.939333 70.054833 Monday 3 0.0 0.0 0.0 2.0 0.0 0.0 2.0
1999-06-27 2.0 0.0 41.663667 69.990667 Sunday 6 0.0 2.0 0.0 0.0 0.0 0.0 2.0
1999-12-15 3.0 0.0 41.761867 70.113483 Wednesday 12 3.0 0.0 0.0 0.0 0.0 0.0 3.0
2000-04-09 2.0 0.0 41.930667 70.069667 Sunday 4 0.0 0.0 0.0 2.0 0.0 0.0 2.0
2000-04-11 2.0 0.0 41.933333 70.024167 Tuesday 4 0.0 0.0 0.0 2.0 0.0 0.0 2.0

What about the multi-species event we detected earlier? Now that the dataframe is date-indexed, it's easy to check:


In [33]:
dfs.loc['2008-12-25',:]


Out[33]:
stranded herded latitude longitude weekday month Delphinus_delphis Globicephala_melas Grampus_griseus Lagenorhynchus_acutus Stenella_coeruleoalba Tursiops_truncatus total
date
2008-12-25 5.0 0.0 41.7840 -70.033100 Thursday 12 5.0 0.0 0.0 0.0 0.0 0.0 5.0
2008-12-25 3.0 0.0 41.9205 70.032167 Thursday 12 0.0 0.0 3.0 0.0 0.0 0.0 3.0

For our purpose we don't need lat/lon (always same location) nor weekday (we've already determined there's an intra-week bias) so we'll drop those...


In [34]:
dfs.drop(['latitude','longitude','weekday'],axis=1, inplace=True)

In [35]:
dfs_monthly = dfs.resample('M',level=0).apply(sum).replace('NaN',0.0)

In [36]:
speciescols = [col for col in dfs_monthly.columns if '_' in col]

In [37]:
f, ax = pl.subplots(figsize=(12,6))
colors = ["#A60628", "#348ABD", 'orange','green','purple','pink']
ax.set_title('Count Distribution by Event')
sb.kdeplot(dfs.total,shade=True,ax=ax,color='k')
for col, spec in zip(colors, speciescols):
    sb.kdeplot(dfs[spec], shade=True, ax=ax, color=col)
ax.legend(fontsize=14)
ax.set_xlim((0, 60))
ax.set_xlabel('event #', fontsize=16)
ax.set_ylabel('freq.', fontsize=16);



In [38]:
f,ax=pl.subplots(figsize=(15,4))
markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.total.values,
                                          label='stranded + herded')
pl.setp(markerline, 'color','k')
pl.setp(stemlines, 'color', 'k')
pl.setp(baseline, 'color', 'k')
ax.xaxis_date()
ax.legend(loc='best', fontsize=14)
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('animal count',fontsize=16);



In [39]:
f,axs = pl.subplots(nrows=2,figsize=(15,6), sharex=True)
for ax,col,clr in zip(axs,['stranded','herded'],['#A60628','#348ABD']):
    markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.loc[:,col].values, label=col)
    pl.setp(baseline, 'color', clr)
    pl.setp(stemlines, 'color', clr)
    pl.setp(markerline, 'color',clr)
    ax.xaxis_date()
    ax.legend(loc='best', fontsize=15)
    ax.set_ylabel('count', fontsize=16)
ax.set_xlabel('date', fontsize=16);



In [40]:
f,axs = pl.subplots(nrows=6,figsize=(15,12))
for ax,col in zip(axs,speciescols):
    markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.loc[:,col].values, label=col)
    pl.setp(markerline, 'color', 'k')
    pl.setp(stemlines, 'color', 'k')
    pl.setp(baseline, 'color', 'k')
    ax.xaxis_date()
    ax.legend(loc='best')


Since these are count data, I am going first to attempt modeling these using a Poisson distribution (cf. Modeling section below). Before moving on, I'm going to pickle the various dataframes created so I can reload them in a subsequent notebook...


In [41]:
dataSaveDir = '/home/madhatter106/DATA/Beaching_Project/'
df.to_pickle(os.path.join(dataSaveDir, 'dfBeaching.pkl'))
dfMonthly.to_pickle(os.path.join(dataSaveDir, 'dfBeachingMonthly.pkl'))
dfs.to_pickle(os.path.join(dataSaveDir, 'dfsBeaching.pkl'))
dfpivot.to_pickle(os.path.join(dataSaveDir, 'dfBchngPivotByMonth.pkl'))
dfs_monthly.to_pickle(os.path.join(dataSaveDir, 'dfsMonthly.pkl'))

... and to store a quick explanation in a text file


In [42]:
expList = [['df', 'main dataframe'], ['dfMonthly', 'df monthly resampling (by sum)'],
            ['dfs', 'main grouped by species'], ['dfpivot', 'pivot by month on stranded & species'],
            ['dfs_monthly', 'dfs montly resampling (by sum)']]
with open(os.path.join(dataSaveDir,'dfExplain.txt'), 'w') as fExplain:
    for pair in expList:
        print("%s: %s" %(pair[0], pair[1]), file=fExplain)

MODELING

These are count data so as mentioned earlier, my first attempt will be to use a Poisson process. Starting from a basic model, I will move on to explore possible switchpoints in the rate at which these beachings occur. A simpler alternative might be to use a binary outcome model (beaching/no beaching) using a Bernoulli process, but for now a boy will dream.

Since I'm of the opinion that frequentist methods and science don't mix (sorry Antti), and because I can't pass up a chance to flex my bayesian muscle, I'll use a very nice library, pymc3 for this modeling bit. <img src=https://imgs.xkcd.com/comics/frequentists_vs_bayesians.png width="400" height="800" /img>

                                                                    (XKCD)

I'm going first to assume a single beaching rate for three groups; total count, D. delphis, and L. acutus. The other species make too scant an appearance in the data. I'll then attempt to identify switchpoints in time around which significant rates in beaching occur. The basic inference model follows the diagram displayed below. The observations used are drawn from the dfs dataframe I created in the Exploration & Data Engineering section. The rate parameter is a continuous variable. As I show in the "Kruscke" diagrams below, I have two possible distributions I could draw from as priors to the rate parameter; a uniform distribution and an Exponential distribution. <img src="count_model.png" width="600" height="1200" /img>

1 - Uniform rate prior


In [43]:
f,ax=pl.subplots(figsize=(15,4))
markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.total.values,
                                          label='stranded + herded')
pl.setp(markerline, 'color','k')
pl.setp(stemlines, 'color', 'k')
pl.setp(baseline, 'color', 'k')
ax.xaxis_date()
ax.legend(loc='best', fontsize=14)
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('animal count',fontsize=16);



In [44]:
totCntData = dfs_monthly.total.values

Next is to specify the bayesian model. In reference to the Kruschke diagram above


In [45]:
with pm.Model() as totCntMdl:
    # prior
    lmbda = pm.Uniform('lmbda', totCntData.min(), totCntData.max())
    # likelihood
    lklhd = pm.Poisson('likelihood',mu=lmbda, observed=totCntData)
    start = pm.find_MAP()
    step = pm.Metropolis()
    trace = pm.sample(10000, start=start, step=step)


Optimization terminated successfully.
         Current function value: 1672.059521
         Iterations: 8
         Function evaluations: 9
         Gradient evaluations: 9
100%|██████████| 10000/10000 [00:01<00:00, 8206.43it/s]

In [46]:
chain_tcm = trace[1000:]

In [47]:
pm.traceplot(chain_tcm);


MCMC diagnostics


In [48]:
pm.df_summary(chain_tcm)


Out[48]:
mean sd mc_error hpd_2.5 hpd_97.5
lmbda 4.998167 0.16372 0.004269 4.675734 5.31374

In [49]:
pm.autocorrplot(chain_tcm);


This suggest an average number of strandings of 5 strandings/month, which is just about equal to just taking the mean (4.995). While this may seem like a trivial exercise, I've established that I've set my model right, and in the process I've quantified uncertainty around that number. Diagnostics above look good, with a relatively small error and a tight 95% credible interval between 4.7 and 5.33, and a quickly subsiding autocorrelation pattern. I'm going to repeat the exercise with an exponential prior, which receives a hyperprior in the form of an alpha parameter pegged to the sample mean (see model specification).

2 - exponential prior


In [50]:
with pm.Model() as totCntMdl_2:
    # prior
    alpha = 1/totCntData.mean()
    lmbda = pm.Exponential('lmbda', alpha)
    # likelihood
    lklhd = pm.Poisson('likelihood',mu=lmbda, observed=totCntData)
    start = pm.find_MAP()
    step = pm.Metropolis()
    trace_2 = pm.sample(10000, start=start, step=step)


Optimization terminated successfully.
         Current function value: 1670.031098
         Iterations: 5
         Function evaluations: 7
         Gradient evaluations: 7
100%|██████████| 10000/10000 [00:01<00:00, 6381.07it/s]

In [51]:
chain_tcm_2 = trace_2[1000:]

In [52]:
pm.traceplot(chain_tcm_2);



In [53]:
pm.df_summary(chain_tcm_2)


Out[53]:
mean sd mc_error hpd_2.5 hpd_97.5
lmbda 4.990791 0.162829 0.003994 4.687631 5.30902

In [54]:
pm.autocorrplot(chain_tcm_2);


3 - switchpoints

Since I get comparable results whether I use a uniform or exponential prior, I'll go ahead with the latter. To search for a switchpoint, I'll add a second lambda parameter for the poisson process and I will introduce a switchpoint, $\tau$ (tau in the model) around which the model can switch lamba parameters to better accomodate a change in stranding rate. The $\tau$ parameter is drawn from a uniform distribution between 0 and the number of months available in the time series.


In [55]:
with pm.Model() as model:
    alpha = 1.0/totCntData.mean()
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    tau = pm.DiscreteUniform("tau", lower=0, upper=totCntData.size-1)
    idx = np.arange(totCntData.size)
    lambda_ = pm.math.switch( tau >= idx, lambda_1, lambda_2 )
    lklhd = pm.Poisson("lklhd", lambda_, observed=totCntData)
    #step = pm.Metropolis(model.free_RVs)
    traceExpPr = pm.sample(10000, tune=1000)


Assigned NUTS to lambda_1_log_
Assigned NUTS to lambda_2_log_
Assigned Metropolis to tau
100%|██████████| 10000/10000 [00:10<00:00, 952.41it/s]

In [56]:
chainExpPr=traceExpPr[1000:]

In [57]:
pm.traceplot(chainExpPr);



In [58]:
pm.df_summary(chainExpPr)


Out[58]:
mean sd mc_error hpd_2.5 hpd_97.5
tau 127.993000 1.350167 0.037182 125.000000 130.000000
lambda_1 4.246134 0.178711 0.002357 3.906706 4.588968
lambda_2 6.579516 0.327444 0.004292 5.918661 7.187236

In [59]:
pm.autocorrplot(chainExpPr);



In [60]:
f,ax=pl.subplots(figsize=(15,4))
markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.total.values,
                                          label='stranded + herded')
pl.setp(markerline, 'color','k')
pl.setp(stemlines, 'color', 'k')
pl.setp(baseline, 'color', 'k')
ax.xaxis_date()
ax.vlines(dfs_monthly.index.values[128], ymin=0, ymax=dfs_monthly.total.values.max(),
         color='r',
         label='%s' % str(dfs_monthly.index.values[128]).split('T')[0])
ax.legend(loc='best', fontsize=14)
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('animal count',fontsize=16);



In [61]:
dfs_monthly.ix[:128,'total'].values.mean(), dfs_monthly.ix[128:, 'total'].values.mean()


Out[61]:
(4.25, 6.532258064516129)

So with an exponential prior we have a switchpoint, $\tau$, around the $128^{th}$ month of the (unpadded) series, a 4.25 mean stranding rate before, and a 6.5 mean stranding rate after. I see some spikes before that so I'll take the dataset before month 128 and see if I can identify another switchpoint.


In [62]:
data = totCntData[:128]
with pm.Model() as model:
    alpha = 1.0/totCntData[:128].mean()
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    tau = pm.DiscreteUniform("tau", lower=0, upper=data.size-1)
    idx = np.arange(data.size)
    lambda_ = pm.math.switch( tau >= idx, lambda_1, lambda_2 )
    lklhd = pm.Poisson("lklhd", lambda_, observed=data)
    #step = pm.Metropolis(model.free_RVs)
    traceExpPrTrunc = pm.sample(10000, tune=1000)


Assigned NUTS to lambda_1_log_
Assigned NUTS to lambda_2_log_
Assigned Metropolis to tau
100%|██████████| 10000/10000 [00:09<00:00, 1052.90it/s]

In [63]:
pm.traceplot(traceExpPrTrunc)


Out[63]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff827dd1cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff8286af278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff82f701d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff82f6e0358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff8286f50b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff82cc18710>]], dtype=object)

In [64]:
pm.df_summary(traceExpPrTrunc)


Out[64]:
mean sd mc_error hpd_2.5 hpd_97.5
tau 78.831700 2.244098 0.185661 76.000000 80.000000
lambda_1 3.622207 0.207410 0.002432 3.206417 4.023073
lambda_2 5.303500 0.338140 0.007486 4.655493 5.966942

In [65]:
f,ax=pl.subplots(figsize=(15,4))
markerline, stemlines, baseline = ax.stem(dfs_monthly.index.values, dfs_monthly.total.values,
                                          label='stranded + herded')
pl.setp(markerline, 'color','k')
pl.setp(stemlines, 'color', 'k')
pl.setp(baseline, 'color', 'k')
ax.xaxis_date()
ax.vlines(dfs_monthly.index.values[80], ymin=0, ymax=dfs_monthly.total.values.max(),
         color='r',
         label='%s' % str(dfs_monthly.index.values[80]).split('T')[0])
ax.vlines(dfs_monthly.index.values[128], ymin=0, ymax=dfs_monthly.total.values.max(),
         color='r',
         label='%s' % str(dfs_monthly.index.values[128]).split('T')[0])
ax.legend(loc='best', fontsize=14)
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('animal count',fontsize=16);



In [66]:
print(dfs_monthly.ix[:80,'total'].mean())
print(dfs_monthly.ix[80:128,'total'].mean())
print(dfs_monthly.ix[128:, 'total'].mean())


3.65
5.25
6.532258064516129

In [67]:
print(dfs_monthly.index.values[79])
print(dfs_monthly.index.values[128])


2005-10-31T00:00:00.000000000
2009-11-30T00:00:00.000000000

The above then suggests the presence of a couple of switchpoints; one around the end of 2005, and one around the end of 2009. There might be more but I'm not comfortable given the data paucity to truncate the data set further. Note that I haven't 0-padded the data (introduced months with zero count into the time series. Doing so would be more realistic and might different results. I'll try that next and if it's the case, I'll revise this post accordingly.

Tentatively, we could be done with the exploration of the count data. Another feature that we might explore is whether, and if so how, the probability that an event (2 or more animals, without any significance attached to animal count) occurs shifts. A bernoulli process would be an appropriate candidate for modeling this question.


In [68]:
# ADDENDUM: What happens if I use a zero-inlated Poisson distribution?
with pm.Model() as zf_model:
    alpha_zf = 1.0/totCntData.mean()
    lambda_1_zf = pm.Exponential("lambda_1_zf", alpha_zf)
    lambda_2_zf = pm.Exponential("lambda_2_zf", alpha_zf)
    tau_zf = pm.DiscreteUniform("tau_zf", lower=0, upper=totCntData.size-1)
    idx = np.arange(totCntData.size)
    lambda_ = pm.math.switch( tau_zf >= idx, lambda_1_zf, lambda_2_zf )
    zf_lklhd = pm.ZeroInflatedPoisson("zf_lklhd", lambda_, observed=totCntData)
    #step = pm.Metropolis(model.free_RVs)
    trace_zfPoi = pm.sample(10000, tune=1000)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-68-d52322f2e63a> in <module>()
      7     idx = np.arange(totCntData.size)
      8     lambda_ = pm.math.switch( tau_zf >= idx, lambda_1_zf, lambda_2_zf )
----> 9     zf_lklhd = pm.ZeroInflatedPoisson("zf_lklhd", lambda_, observed=totCntData)
     10     #step = pm.Metropolis(model.free_RVs)
     11     trace_zfPoi = pm.sample(10000, tune=1000)

/home/madhatter106/anaconda3/lib/python3.5/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
     28         if isinstance(name, string_types):
     29             data = kwargs.pop('observed', None)
---> 30             dist = cls.dist(*args, **kwargs)
     31             return model.Var(name, dist, data)
     32         else:

/home/madhatter106/anaconda3/lib/python3.5/site-packages/pymc3/distributions/distribution.py in dist(cls, *args, **kwargs)
     39     def dist(cls, *args, **kwargs):
     40         dist = object.__new__(cls)
---> 41         dist.__init__(*args, **kwargs)
     42         return dist
     43 

TypeError: __init__() missing 1 required positional argument: 'psi'

In [ ]: