As always first fetch the nsidc daily sea ice concentration data to our output directory.
In [ ]:
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_final.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_nrt.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_final.csv
!wget -P ../output -qN ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02135/south/daily/data/SH_seaice_extent_nrt.csv
Variables to set before running:
In [ ]:
hemisphere = 'north' # 'south' or 'north'
climatology_years = (1981, 2010)
In [ ]:
# some imports for working with pandas, and excel files.
import datetime as dt
import numpy as np
import os
import pandas as pd
from pandas import ExcelWriter
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'
In [ ]:
# code for reading a hemisphere of data from CSV files.
def parse_the_date(year, mm, dd):
return dt.date(int(year), int(mm), int(dd))
def slurp_csv(filename):
data = pd.read_csv(filename, header = None, skiprows=2,
names=["year", "mm", "dd", "extent", "missing", "source"],
parse_dates={'date':['year', 'mm', 'dd']},
date_parser=parse_the_date, index_col='date')
data = data.drop(['missing', 'source'], axis=1)
return data
def read_a_hemisphere(hemisphere):
the_dir = "../output"
final_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_final.csv'.format(hemi=hemisphere[0:1].upper()))
nrt_prod_filename = os.path.join(the_dir, '{hemi}H_seaice_extent_nrt.csv'.format(hemi=hemisphere[0:1].upper()))
final = slurp_csv(final_prod_filename)
nrt = slurp_csv(nrt_prod_filename)
all_data = pd.concat([final, nrt])
return all_data
In [ ]:
df = read_a_hemisphere(hemisphere)
# df.head(3) => just shows 3 rows from your dataframe
df.head(3)
Set date index to a special DatetimeIndex and then Reindex the dataframe so that every daily timestep is included in the series and any missing data is marked as NaN.
In [ ]:
# index before turning into DatetimeIndex
print df.index[0:5]
In [ ]:
df.index = pd.to_datetime(df.index)
df = df.reindex(index=pd.date_range('1978-10-25', dt.date.today().strftime('%Y-%m-%d')))
df['hemi'] = hemisphere
print( df.head())
print("\nindex: ")
print( df.index)
We don't want to interpolate across any timeperiods where more than one day of data is missing, but we do want to do a strict linear interpolation across the standard every-other-day that SMMR operated.
So we are going to do both a forward and backwards fill on the extent field, while setting the limit of missing values to one. Next, we are going to union the NaNs that remain after the fills in order to leave any gaps in the data record alone when we perform an interpolation.
So start by using the backfill to fill any NaN locations that have a valid "next" value. So start by using the forwardfill to fill any NaN locations that have a valid "previous" value.
In [ ]:
df['backfill'] = df.extent.fillna(method='bfill', limit=1)
df['forwardfill'] = df.extent.fillna(method='ffill', limit=1)
See below that in the backfill column, 1978-10-25 was filled with the value (10.231) from 1978-10-26 and that in the forwardfill column, the value remains NaN, but that in the forwardfill column, 1978-10-27 value gets the extent from 1978-10-26
In [ ]:
print(df.head())
See that 1987-12-03 gets a forward, but not a backfill
In [ ]:
print(df['19871201':'19871206'])
See that 1988-01-12 gets a backfill, but not a forwardfill
In [ ]:
print(df['19880110':'19880114'])
So the union of backfill's NaN and forward fill NaN will capture any missing data that doesn't have a valid data point both before and after itself in the series. We can get a list of is really NAN by saving this off.
In [ ]:
is_really_nan = pd.isnull(df['backfill']) | pd.isnull(df['forwardfill'])
In [ ]:
df['interpolated'] = df.extent.interpolate()
#df['interpolated'].loc[is_really_nan] = np.nan
df.interpolated.loc[is_really_nan == True] = np.nan
df = df.drop(['forwardfill', 'backfill'], axis=1)
So now we have a simple dataframe with daily extents and daily interpolated extents
In [ ]:
df.head()
In [ ]:
df['5 Day'] = pd.rolling_mean(df['interpolated'], window=5, min_periods=2)
In [ ]:
df.head()
In [ ]:
clim_data = df[(df.index.year >= climatology_years[0])&(df.index.year <= climatology_years[1] )].copy()
In [ ]:
print clim_data.head(3),"\n...\n" ,clim_data.tail(3)
show the years of the climatology and then number of years to work with.
In [ ]:
print len(np.unique(clim_data.index.year))
print np.unique(clim_data.index.year)
grab the mean value of the interpolated extents for each month/day combination
In [ ]:
def clim_string(climatology_years):
return '{0}-{1}'.format(climatology_years[0], climatology_years[1])
In [ ]:
def get_climatological_means(column, clim_data):
means = clim_data.copy()
means = means.groupby([clim_data.index.month, clim_data.index.day]).mean()[[column]]
means = means.rename(columns={column: clim_string(climatology_years)})
return means
In [ ]:
daily_means = get_climatological_means('interpolated', clim_data)
five_day_means = get_climatological_means('5 Day', clim_data)
In [ ]:
print five_day_means.head()
check yourself: You can see in the three panels below that the value we get by calling
mean()
on the groupby
result is the same as expected by averaging the day
and month data separately
In [ ]:
testmeans = clim_data.groupby([clim_data.index.month, clim_data.index.day]).mean()[['interpolated']]
testmeans.head(1)
Select the January 1 data for climatology_years
In [ ]:
clim_data[(clim_data.index.month == 1)&(clim_data.index.day == 1)]['interpolated'].values
In [ ]:
np.nanmean(clim_data[(clim_data.index.month == 1)&(clim_data.index.day == 1)]['interpolated'].values)
In [ ]:
df.index
In [ ]:
import calendar
month_names = [calendar.month_name[x] for x in range(1,13)]
right now the data is all stored a timeseries with an index of datetimes and columns of extent, interpolated-extent, and 5 day rolling mean extents.
In [ ]:
df.head(2)
So we would like to reorder (pivot) the data into a nice dataframe where Months/Days are shown along the left side, and years are columns across and a datavalue is displayed as the "meat" 1979 1980 Jan 1 data data ... 2 data data
There are a couple of ways to do this: You can select a column, or columns and set the index to by a hierarchy of year/month/day, and then unstack the year.
In [ ]:
df= df[['extent']].set_index([df.index.year, df.index.month, df.index.day]).unstack(0)
df.head(3)
We now want to concat the climatology means on to this newly shaped dataframe. But before you can do that the column indices must match.
In [ ]:
print df.columns.nlevels
print df.columns.levels
print daily_means.columns.nlevels
so drop the extra extent level
In [ ]:
df.columns = df.columns.droplevel(0)
print df.columns.nlevels
Now concatinate and the dataframe is ready to be output.
In [ ]:
df = pd.concat([df, daily_means.copy()], axis=1)
In [ ]:
df.to_csv('test.csv')
In [ ]:
# cleanup
!cd ../output; rm -f NH_seaice_extent_final.csv NH_seaice_extent_nrt.csv SH_seaice_extent_final.csv SH_seaice_extent_nrt.csv
In [ ]: