Guided tour on how the daily 5-day statistics work with pandas

As always first fetch the nsidc daily sea ice concentration data to our output directory.


In [ ]:
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_final.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_nrt.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_final.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_nrt.csv

Variables to set before running:


In [ ]:
hemisphere = 'north'  # 'south' or 'north'
climatology_years = (1981, 2010)

In [ ]:
# some imports for working with pandas, and excel files.
import datetime as dt
import numpy as np

import os
import pandas as pd
from pandas import ExcelWriter

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

pd.options.display.mpl_style = 'default'

In [ ]:
# code for reading a hemisphere of data from CSV files.

def parse_the_date(year, mm, dd):
    return dt.date(int(year), int(mm), int(dd))

def slurp_csv(filename):
    data = pd.read_csv(filename, header = None, skiprows=2,
                       names=["year", "mm", "dd", "extent", "missing", "source"],
                       parse_dates={'date':['year', 'mm', 'dd']},
                       date_parser=parse_the_date, index_col='date')
    data = data.drop(['missing', 'source'], axis=1)
    return data


def read_a_hemisphere(hemisphere):
    the_dir = "../output"
    final_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_final.csv'.format(hemi=hemisphere[0:1].upper()))
    nrt_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_nrt.csv'.format(hemi=hemisphere[0:1].upper()))

    final = slurp_csv(final_prod_filename)
    nrt = slurp_csv(nrt_prod_filename)
    all_data = pd.concat([final, nrt])
    return all_data

In [ ]:
df = read_a_hemisphere(hemisphere)

# df.head(3) => just shows 3 rows from your dataframe
df.head(3)

Set date index to a special DatetimeIndex and then Reindex the dataframe so that every daily timestep is included in the series and any missing data is marked as NaN.


In [ ]:
# index before turning into DatetimeIndex
print df.index[0:5]

In [ ]:
df.index = pd.to_datetime(df.index)
df  = df.reindex(index=pd.date_range('1978-10-25', dt.date.today().strftime('%Y-%m-%d')))
df['hemi'] = hemisphere
print( df.head())
print("\nindex: ")
print( df.index)

interpolate missing data in SMMR period.

We don't want to interpolate across any timeperiods where more than one day of data is missing, but we do want to do a strict linear interpolation across the standard every-other-day that SMMR operated.

So we are going to do both a forward and backwards fill on the extent field, while setting the limit of missing values to one. Next, we are going to union the NaNs that remain after the fills in order to leave any gaps in the data record alone when we perform an interpolation.

So start by using the backfill to fill any NaN locations that have a valid "next" value. So start by using the forwardfill to fill any NaN locations that have a valid "previous" value.


In [ ]:
df['backfill'] = df.extent.fillna(method='bfill', limit=1)
df['forwardfill'] = df.extent.fillna(method='ffill', limit=1)

See below that in the backfill column, 1978-10-25 was filled with the value (10.231) from 1978-10-26 and that in the forwardfill column, the value remains NaN, but that in the forwardfill column, 1978-10-27 value gets the extent from 1978-10-26


In [ ]:
print(df.head())

See that 1987-12-03 gets a forward, but not a backfill


In [ ]:
print(df['19871201':'19871206'])

See that 1988-01-12 gets a backfill, but not a forwardfill


In [ ]:
print(df['19880110':'19880114'])

So the union of backfill's NaN and forward fill NaN will capture any missing data that doesn't have a valid data point both before and after itself in the series. We can get a list of is really NAN by saving this off.


In [ ]:
is_really_nan = pd.isnull(df['backfill']) | pd.isnull(df['forwardfill'])
  1. Use the interpolation scheme to do simple linear regression on the entire extent column
  2. Then go back and mark as missing any large gaps in the linearly interpolated data.
  3. Drop the backfill and forwardfill columns

In [ ]:
df['interpolated'] = df.extent.interpolate()
#df['interpolated'].loc[is_really_nan] = np.nan
df.interpolated.loc[is_really_nan == True] = np.nan
df = df.drop(['forwardfill', 'backfill'], axis=1)

So now we have a simple dataframe with daily extents and daily interpolated extents


In [ ]:
df.head()

Add 5 day rolling mean from the interpolated data to the extent.


In [ ]:
df['5 Day'] = pd.rolling_mean(df['interpolated'], window=5, min_periods=2)

In [ ]:
df.head()

Compute climatological means by selecting a copy of the data between your desired climatology years.


In [ ]:
clim_data = df[(df.index.year >= climatology_years[0])&(df.index.year <= climatology_years[1] )].copy()

In [ ]:
print clim_data.head(3),"\n...\n" ,clim_data.tail(3)

show the years of the climatology and then number of years to work with.


In [ ]:
print len(np.unique(clim_data.index.year))
print np.unique(clim_data.index.year)

grab the mean value of the interpolated extents for each month/day combination


In [ ]:
def clim_string(climatology_years):
    return  '{0}-{1}'.format(climatology_years[0], climatology_years[1])

In [ ]:
def get_climatological_means(column, clim_data):
    means = clim_data.copy()
    means = means.groupby([clim_data.index.month, clim_data.index.day]).mean()[[column]]
    means = means.rename(columns={column: clim_string(climatology_years)})
    return means

In [ ]:
daily_means = get_climatological_means('interpolated', clim_data)
five_day_means = get_climatological_means('5 Day', clim_data)

In [ ]:
print five_day_means.head()

check yourself: You can see in the three panels below that the value we get by calling mean() on the groupby result is the same as expected by averaging the day and month data separately


In [ ]:
testmeans = clim_data.groupby([clim_data.index.month, clim_data.index.day]).mean()[['interpolated']]
testmeans.head(1)

Select the January 1 data for climatology_years


In [ ]:
clim_data[(clim_data.index.month == 1)&(clim_data.index.day == 1)]['interpolated'].values

In [ ]:
np.nanmean(clim_data[(clim_data.index.month == 1)&(clim_data.index.day == 1)]['interpolated'].values)
Get the daily extent data into the correct format for display and for concatenating with the clim_averages

In [ ]:
df.index

In [ ]:
import calendar
month_names = [calendar.month_name[x] for x in range(1,13)]

right now the data is all stored a timeseries with an index of datetimes and columns of extent, interpolated-extent, and 5 day rolling mean extents.


In [ ]:
df.head(2)

So we would like to reorder (pivot) the data into a nice dataframe where Months/Days are shown along the left side, and years are columns across and a datavalue is displayed as the "meat" 1979 1980 Jan 1 data data ... 2 data data

There are a couple of ways to do this: You can select a column, or columns and set the index to by a hierarchy of year/month/day, and then unstack the year.


In [ ]:
df= df[['extent']].set_index([df.index.year, df.index.month, df.index.day]).unstack(0)
df.head(3)

We now want to concat the climatology means on to this newly shaped dataframe. But before you can do that the column indices must match.


In [ ]:
print df.columns.nlevels
print df.columns.levels
print daily_means.columns.nlevels

so drop the extra extent level


In [ ]:
df.columns = df.columns.droplevel(0)
print df.columns.nlevels

Now concatinate and the dataframe is ready to be output.


In [ ]:
df = pd.concat([df, daily_means.copy()], axis=1)

In [ ]:
df.to_csv('test.csv')

In [ ]:
# cleanup
!cd ../output; rm -f NH_seaice_extent_final.csv NH_seaice_extent_nrt.csv SH_seaice_extent_final.csv SH_seaice_extent_nrt.csv

In [ ]: