Compute summary statistics for the daily sea ice index.

From the CSV files determine the day of maximum and minimum extent for each

month and how that month's max and min ranks with all other months

The input data format is just a date and extent for each day we have data.

Year, Month, Day,     Extent,    Missing, Source Data
YYYY,    MM,  DD, 10^6 sq km, 10^6 sq km, Source data product web site: http://nsidc.org/d....
1978,    10,  26,     10.231,      0.000, ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051....
1978,    10,  28,     10.420,      0.000, ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051....
1978,    10,  30,     10.557,      0.000, ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051....
....

Start by downloading the daily sea ice extent data from NSIDC's website.


In [207]:
!mkdir -p ../data
!wget -P ../data -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_final.csv 
!wget -P ../data -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_nrt.csv
!wget -P ../data -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_final.csv
!wget -P ../data -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_nrt.csv

In [208]:
import datetime as dt
import numpy as np
import os
import pandas as pd
from pandas import ExcelWriter

code to read the CSV files.


In [209]:
def parse_the_date(year, mm, dd):
    return dt.date(int(year), int(mm), int(dd))

def slurp_csv(filename):
    data = pd.read_csv(filename, header = None, skiprows=2,
                       names=["year", "mm", "dd", "extent", "missing", "source"],
                       parse_dates={'date':['year', 'mm', 'dd']},
                       date_parser=parse_the_date, index_col='date')
    data = data.drop('missing', axis=1)
    return data

def read_a_hemisphere(hemisphere):
    the_dir = '../data'
    final_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_final.csv'.format(hemi=hemisphere[0:1].upper()))
    nrt_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_nrt.csv'.format(hemi=hemisphere[0:1].upper()))

    final = slurp_csv(final_prod_filename)
    nrt = slurp_csv(nrt_prod_filename)
    all_data = pd.concat([final, nrt])
    return all_data

Read CSV data


In [210]:
north = read_a_hemisphere('north')
south = read_a_hemisphere('south')
south.head()


Out[210]:
            extent                                             source
date                                                                 
1978-10-26  17.634   ftp://sidads.colorado.edu/pub/DATASETS/nsidc0...
1978-10-28  17.815   ftp://sidads.colorado.edu/pub/DATASETS/nsidc0...
1978-10-30  17.671   ftp://sidads.colorado.edu/pub/DATASETS/nsidc0...
1978-11-01  17.534   ftp://sidads.colorado.edu/pub/DATASETS/nsidc0...
1978-11-03  17.493   ftp://sidads.colorado.edu/pub/DATASETS/nsidc0...

Add columns for year and month: We have do this because when we read the CSV file we converted the existing year/month/day columns into a python datetime object. also drop the source because we don't care where the data came from (near real time or production)


In [211]:
def add_year_month_columns(df):
    a = df.copy()
    a = a.drop('source',1)
    a = a.reset_index()
    a['year'] = pd.to_datetime(a.date).dt.year
    a['month'] = pd.to_datetime(a.date).dt.month
    a = a.set_index('date')
    return a

In [212]:
north = add_year_month_columns(north)
south = add_year_month_columns(south)

In [213]:
north.head()


Out[213]:
            extent  year  month
date                           
1978-10-26  10.231  1978     10
1978-10-28  10.420  1978     10
1978-10-30  10.557  1978     10
1978-11-01  10.670  1978     11
1978-11-03  10.787  1978     11

In [214]:
south.head()


Out[214]:
            extent  year  month
date                           
1978-10-26  17.634  1978     10
1978-10-28  17.815  1978     10
1978-10-30  17.671  1978     10
1978-11-01  17.534  1978     11
1978-11-03  17.493  1978     11

Add 5 day rolling mean to the timesereis.


In [215]:
def add_rolling_mean(df, window=5, min_periods=2):
    copy = df.copy()
    # create an empty ts to align our extent data with
    ts = pd.Series(np.nan, index=pd.date_range('1978-10-25', dt.date.today().strftime('%Y-%m-%d')))
    copy.index = copy.index.to_datetime()
    copy = df.align(ts, axis=0, join='right')[0]
    df['5day-Avg'] = pd.rolling_mean(copy['extent'], window=5, min_periods=2)
    return df

Want date back in the columns


In [216]:
north = add_rolling_mean(north)
south = add_rolling_mean(south)

In [217]:
north.head(1)


Out[217]:
            extent  year  month  5day-Avg
date                                     
1978-10-26  10.231  1978     10       NaN

In [218]:
north = north.reset_index()
south = south.reset_index()
north.head(1)


Out[218]:
         date  extent  year  month  5day-Avg
0  1978-10-26  10.231  1978     10       NaN

Use a groupby to compute the row locations that represent the minimum and maximum extent and grab those rows into new variables. AKA: Filter out everything but the minimum/maximum extent for each month and year


In [219]:
def select_min_and_max_variable_rows_by_year_and_month(df, variable):
    min_groups = df.loc[df.groupby(['year','month'])[variable].idxmin()][['date', variable, 'year', 'month']]
    max_groups = df.loc[df.groupby(['year','month'])[variable].idxmax()][['date', variable, 'year', 'month']]
    return {'min': min_groups,  'max': max_groups}

create dictionaries of max and min values for each hemisphere and for daily and rolling-mean


In [220]:
n = select_min_and_max_variable_rows_by_year_and_month(north, 'extent')
navg = select_min_and_max_variable_rows_by_year_and_month(north, '5day-Avg')
s = select_min_and_max_variable_rows_by_year_and_month(south, 'extent')
savg = select_min_and_max_variable_rows_by_year_and_month(south, '5day-Avg')

show that we have actually selected different data for daily and 5-average data


In [221]:
n['max'][3:5]


Out[221]:
          date  extent  year  month
48  1979-01-30  15.912  1979      1
61  1979-02-25  16.579  1979      2

In [222]:
navg['max'][3:5]


Out[222]:
          date   5day-Avg  year  month
48  1979-01-30  15.795333  1979      1
62  1979-02-27  16.515000  1979      2

In [223]:
def add_rank(df, rank_by, ascending):
    df['rank'] = df.groupby('month')[rank_by].rank(ascending=ascending, method='first')
    return df

add rank column for each month and hemsiphere's max and min:


In [224]:
n['max'] = add_rank(n['max'], 'extent', ascending=False)
n['min'] = add_rank(n['min'], 'extent', ascending=True)
s['max'] = add_rank(s['max'], 'extent', ascending=False)
s['min'] = add_rank(s['min'], 'extent', ascending=True)

navg['max'] = add_rank(navg['max'], '5day-Avg', ascending=False)
navg['min'] = add_rank(navg['min'], '5day-Avg', ascending=True)
savg['max'] = add_rank(savg['max'], '5day-Avg', ascending=False)
savg['min'] = add_rank(savg['min'], '5day-Avg', ascending=True)

In [225]:
def do_annual_min_max_ranking(df, field):
    min_index = df.groupby(['year'])[field].idxmin()
    max_index = df.groupby(['year'])[field].idxmax()
    mindata = df.loc[min_index][['date', field]]
    mindata['rank'] = mindata[field].rank(ascending=True, method='first')
    maxdata = df.loc[max_index][['date', field]]
    maxdata['rank'] = maxdata[field].rank(ascending=False, method='first')

    mindata = mindata.set_index(pd.to_datetime(mindata.date).dt.year)
    maxdata = maxdata.set_index(pd.to_datetime(maxdata.date).dt.year)

    maxdata = maxdata.add_prefix('max_')
    mindata = mindata.add_prefix('min_')

    data = pd.concat([mindata, maxdata], axis=1)
    return data

It is also desired that we have Annual min/max rank data so revisit the north and south


In [226]:
north_annual_by_day = do_annual_min_max_ranking(north, 'extent')
north_annual_averaged = do_annual_min_max_ranking(north, '5day-Avg')

In [227]:
south_annual_by_day = do_annual_min_max_ranking(south, 'extent')
south_annual_averaged = do_annual_min_max_ranking(south, '5day-Avg')

In [228]:
south_annual_averaged.head(3)


Out[228]:
        min_date  min_5day-Avg  min_rank    max_date  max_5day-Avg  max_rank
1978  1978-12-31      7.596000        38  1978-10-28       17.7245        37
1979  1979-02-19      2.928333        25  1979-09-15       18.3230        31
1980  1980-02-26      2.574000         7  1980-09-25       19.0470        11

Write out the data frames in a nice format


In [229]:
a = navg['min'].copy()

In [230]:
a.columns


Out[230]:
Index([u'date', u'5day-Avg', u'year', u'month', u'rank'], dtype='object')

In [231]:
a.set_index(['rank', 'month']).unstack('month').head(3)


Out[231]:
             date                                                              \
month          1           2           3           4           5           6    
rank                                                                            
1      2011-01-01  2006-02-01  2006-03-27  2006-04-30  2015-05-31  2010-06-30   
2      2014-01-01  2011-02-01  2015-03-09  2004-04-30  2006-05-31  2012-06-30   
3      2013-01-01  2012-02-01  2007-03-31  2007-04-30  2010-05-31  2011-06-30   

                                                       ...   year              \
month          7           8           9           10  ...     3     4     5    
rank                                                   ...                      
1      2012-07-31  2012-08-31  2012-09-17  2012-10-01  ...   2006  2006  2015   
2      2007-07-31  2007-08-31  2007-09-18  2007-10-01  ...   2015  2004  2006   
3      2011-07-31  2011-08-31  2011-09-11  2011-10-01  ...   2007  2007  2010   

                                                 
month    6     7     8     9     10    11    12  
rank                                             
1      2010  2012  2012  2012  2012  2012  2006  
2      2012  2007  2007  2007  2007  2007  2010  
3      2011  2011  2011  2011  2011  2010  2012  

[3 rows x 36 columns]

In [232]:
import calendar
month_names = [calendar.month_name[x] for x in range(1,13)]
def swap_column_level_and_sort(df):
    df.columns = df.columns.swaplevel(1,0)
    df = df.sortlevel(0, axis=1)
    return df

# set index to year and month and then broadcast month back across the columns.
# next swap and sort so that you have the data grouped under the month.
def prepare_for_csv(df):
    df = df.set_index(['rank','month']).unstack('month')
    df = swap_column_level_and_sort(df)
    df.columns = df.columns.set_levels(month_names, level=0)
    return df


def write_to_xls(df_list, writer, is_monthly=True):
    for df, sheet in df_list:
        if is_monthly:
            df = prepare_for_csv(df)
        df.to_excel(writer,'{sheet}'.format(sheet=sheet), float_format="%.3f")

In [233]:
writer = ExcelWriter('../output/Sea_Ice_MinMax_Statistics.xls')

In [234]:
monthly_dataframelist =[(navg['min'], 'Northern 5day Min'),
                        (navg['max'], 'Northern 5day Max'),
                        (savg['min'], 'Southern 5day Min'),
                        (savg['max'], 'Southern 5day Max'),
                        (n['min'], 'Northern Daily Min'),
                        (n['max'], 'Northern Daily Max'),
                        (s['min'], 'Southern Daily Min'),
                        (s['max'], 'Southern Daily Max')]
annual_dataframelist = [(north_annual_averaged, 'North Annual 5day-avg'),
                        (north_annual_by_day, 'North Annual Daily'),
                        (south_annual_averaged, 'South Annual 5day-avg'),
                        (south_annual_by_day, 'South Annual Daily')]

In [235]:
write_to_xls(monthly_dataframelist, writer, is_monthly=True)

In [236]:
write_to_xls(annual_dataframelist, writer, is_monthly=False)

writer.save()

In [237]:
b = prepare_for_csv(a)

In [238]:
b


Out[238]:
month     January                     February                        March  \
             date   5day-Avg  year        date   5day-Avg  year        date   
rank                                                                          
1      2011-01-01  12.595800  2011  2006-02-01  13.894200  2006  2006-03-27   
2      2014-01-01  12.881600  2014  2011-02-01  14.029800  2011  2015-03-09   
3      2013-01-01  12.910400  2013  2012-02-01  14.055400  2012  2007-03-31   
4      2009-01-01  12.994800  2009  2005-02-01  14.069400  2005  2005-03-28   
5      2015-01-01  13.000800  2015  2010-02-01  14.090400  2010  2011-03-29   
6      2007-01-01  13.005200  2007  2015-02-01  14.098600  2015  2014-03-31   
7      2006-01-01  13.031400  2006  2014-02-01  14.174400  2014  2010-03-01   
8      2008-01-01  13.034200  2008  2007-02-01  14.232600  2007  2004-03-31   
9      2005-01-01  13.054400  2005  2013-02-01  14.305600  2013  2009-03-13   
10     2010-01-01  13.102200  2010  2004-02-01  14.343200  2004  2013-03-31   
11     2012-01-01  13.119800  2012  2009-02-01  14.383400  2009  1996-03-28   
12     2000-01-01  13.325000  2000  2008-02-01  14.549800  2008  2012-03-01   
13     2001-01-01  13.368800  2001  1996-02-01  14.637800  1996  2002-03-31   
14     2004-01-01  13.388000  2004  2001-02-01  14.780400  2001  2008-03-31   
15     2003-01-01  13.475400  2003  1991-02-01  14.800400  1991  2000-03-23   
16     2002-01-01  13.476000  2002  1999-02-01  14.830200  1999  1995-03-16   
17     1991-01-01  13.488600  1991  2003-02-01  14.858000  2003  2001-03-31   
18     1997-01-01  13.642600  1997  1997-02-01  14.868000  1997  1999-03-06   
19     1996-01-01  13.690200  1996  1984-02-01  14.918667  1984  1989-03-30   
20     1999-01-01  13.742600  1999  2000-02-01  14.936800  2000  1991-03-29   
21     1986-01-01  13.769333  1986  1995-02-01  14.965800  1995  1997-03-31   
22     1985-01-02  13.795333  1985  2002-02-01  14.995600  2002  2003-03-14   
23     1993-01-01  13.905600  1993  1990-02-01  15.235800  1990  1998-03-26   
24     1998-01-01  13.922400  1998  1985-02-03  15.239000  1985  1992-03-07   
25     1992-01-01  13.922400  1992  1986-02-02  15.249000  1986  1994-03-22   
26     1984-01-02  13.956000  1984  1992-02-01  15.252000  1992  1984-03-30   
27     1994-01-01  13.970200  1994  1988-02-11  15.298800  1988  1990-03-31   
28     1990-01-01  14.022800  1990  1989-02-02  15.371400  1989  1981-03-30   
29     1995-01-01  14.049800  1995  1994-02-01  15.372800  1994  1988-03-31   
30     1987-01-02  14.110333  1987  1980-02-02  15.403000  1980  1993-03-01   
31     1980-01-01  14.151667  1980  1998-02-01  15.408400  1998  1986-03-30   
32     1989-01-01  14.181600  1989  1981-02-14  15.466000  1981  1985-03-11   
33     1981-01-01  14.184333  1981  1993-02-02  15.488400  1993  1987-03-27   
34     1983-01-01  14.186667  1983  1983-02-02  15.657000  1983  1979-03-31   
35     1982-01-02  14.293000  1982  1987-02-01  15.750333  1987  1980-03-31   
36     1979-01-02  14.706000  1979  1982-02-01  15.762000  1982  1982-03-27   
37     1988-01-14  14.840000  1988  1979-02-07  15.833000  1979  1983-03-30   

month                        April  ...  September     October             \
        5day-Avg  year        date  ...       year        date   5day-Avg   
rank                                ...                                     
1      14.229200  2006  2006-04-30  ...       2012  2012-10-01   3.922800   
2      14.273200  2015  2004-04-30  ...       2007  2007-10-01   4.398200   
3      14.307000  2007  2007-04-30  ...       2011  2011-10-01   4.928200   
4      14.418800  2005  2015-04-30  ...       2008  2008-10-01   4.982400   
5      14.456200  2011  1989-04-30  ...       2010  2010-10-01   5.299400   
6      14.582200  2014  2014-04-30  ...       2014  2014-10-01   5.494800   
7      14.838800  2010  2011-04-30  ...       2013  2009-10-01   5.520800   
8      14.852800  2004  2013-04-30  ...       2009  2013-10-01   5.617800   
9      14.870000  2009  2005-04-30  ...       2005  2005-10-01   5.625600   
10     14.901000  2013  2002-04-30  ...       2002  2006-10-01   5.882400   
11     14.935600  1996  2008-04-30  ...       1999  1995-10-01   6.092600   
12     14.970600  2012  1996-04-30  ...       2006  2003-10-01   6.246400   
13     15.020600  2002  1995-04-30  ...       2004  2002-10-01   6.337000   
14     15.048400  2008  2003-04-30  ...       2000  1990-10-01   6.365600   
15     15.054000  2000  1990-04-30  ...       1995  2004-10-01   6.448200   
16     15.151000  1995  2012-04-30  ...       2003  1991-10-01   6.619000   
17     15.221000  2001  1997-04-30  ...       1990  2000-10-01   6.640400   
18     15.226200  1999  2009-04-30  ...       1993  1999-10-01   6.869800   
19     15.266600  1989  2010-04-30  ...       1991  1998-10-01   6.874600   
20     15.296600  1991  2000-04-30  ...       1998  1997-10-01   6.879800   
21     15.300600  1997  1992-04-30  ...       1984  1993-10-01   7.014400   
22     15.320400  2003  1991-04-30  ...       1985  2001-10-01   7.029600   
23     15.380400  1998  1998-04-30  ...       2001  1985-10-01   7.156333   
24     15.411000  1992  2001-04-30  ...       1997  1979-10-01   7.234667   
25     15.419600  1994  1993-04-30  ...       1989  1989-10-01   7.285000   
26     15.478667  1984  1983-04-29  ...       1981  1984-10-02   7.583000   
27     15.515800  1990  1994-04-30  ...       1979  1994-10-01   7.615000   
28     15.541333  1981  1984-04-29  ...       1994  1982-10-01   7.643000   
29     15.617600  1988  1988-04-30  ...       1987  1981-10-02   7.700667   
30     15.630800  1993  1999-04-30  ...       1988  1987-10-01   7.795400   
31     15.633667  1986  1986-04-29  ...       1986  1988-10-01   7.871200   
32     15.634667  1985  1981-04-29  ...       1982  1983-10-02   7.917667   
33     15.746667  1987  1987-04-30  ...       1996  1986-10-02   8.028333   
34     15.889000  1979  1980-04-30  ...       1992  1980-10-01   8.069333   
35     15.896667  1980  1985-04-30  ...       1983  1992-10-01   8.146600   
36     15.900667  1982  1979-04-30  ...       1980  1996-10-01   8.197400   
37     15.913000  1983  1982-04-30  ...        NaN  1978-10-28  10.325500   

month          November                     December                   
       year        date   5day-Avg  year        date   5day-Avg  year  
rank                                                                   
1      2012  2012-11-01   7.787000  2012  2006-12-01  10.152800  2006  
2      2007  2007-11-01   8.194800  2007  2010-12-01  10.458200  2010  
3      2011  2010-11-01   8.243200  2010  2012-12-01  10.642000  2012  
4      2008  2011-11-01   8.414600  2011  2007-12-01  10.757200  2007  
5      2010  2009-11-01   8.505200  2009  2011-12-01  10.786600  2011  
6      2014  2013-11-01   8.887400  2013  2009-12-01  10.916200  2009  
7      2009  2006-11-01   8.942200  2006  2014-12-01  11.084000  2014  
8      2013  2014-11-01   8.957200  2014  2013-12-01  11.089200  2013  
9      2005  2004-11-01   9.011200  2004  2003-12-01  11.112200  2003  
10     2006  2003-11-01   9.158400  2003  2005-12-01  11.137200  2005  
11     1995  2005-11-01   9.187000  2005  2008-12-01  11.212200  2008  
12     2003  2002-11-01   9.191200  2002  2001-12-01  11.218800  2001  
13     2002  2008-11-01   9.194400  2008  2000-12-01  11.295400  2000  
14     1990  1998-11-01   9.324600  1998  1998-12-01  11.318400  1998  
15     2004  2001-11-01   9.377800  2001  2004-12-01  11.341600  2004  
16     1991  2000-11-01   9.435400  2000  1996-12-01  11.359200  1996  
17     2000  1984-11-01   9.465333  1984  1999-12-01  11.441400  1999  
18     1999  1997-11-01   9.583000  1997  2002-12-01  11.525800  2002  
19     1998  1991-11-01   9.687400  1991  1991-12-01  11.780000  1991  
20     1997  1981-11-01   9.744333  1981  1981-12-01  11.858333  1981  
21     1993  1996-11-01   9.761600  1996  1997-12-01  11.877000  1997  
22     2001  1999-11-01   9.839200  1999  1995-12-01  11.901000  1995  
23     1985  1995-11-01   9.851200  1995  1984-12-01  11.913333  1984  
24     1979  1987-11-01   9.862800  1987  1985-12-02  12.002667  1985  
25     1989  1985-11-02   9.878667  1985  1990-12-01  12.014200  1990  
26     1984  1979-11-02   9.955000  1979  1994-12-01  12.070400  1994  
27     1994  1990-11-01   9.959200  1990  1979-12-02  12.116667  1979  
28     1982  1988-11-01   9.968000  1988  1993-12-01  12.118000  1993  
29     1981  1993-11-01  10.011600  1993  1987-12-01  12.225400  1987  
30     1987  1989-11-01  10.075600  1989  1986-12-01  12.362333  1986  
31     1988  1994-11-01  10.131000  1994  1989-12-01  12.408600  1989  
32     1983  1992-11-01  10.133200  1992  1983-12-01  12.438000  1983  
33     1986  1980-11-02  10.176333  1980  1988-12-01  12.443800  1988  
34     1980  1986-11-01  10.305333  1986  1992-12-01  12.499000  1992  
35     1992  1983-11-01  10.374000  1983  1980-12-02  12.589667  1980  
36     1996  1978-11-01  10.549000  1978  1982-12-02  12.666667  1982  
37     1978  1982-11-02  10.665333  1982  1978-12-03  12.693000  1978  

[37 rows x 36 columns]

clean up your csv files


In [239]:
!cd ../data ; rm -f NH_seaice_extent_final.csv NH_seaice_extent_nrt.csv SH_seaice_extent_final.csv SH_seaice_extent_nrt.csv

In [239]: