(source: NIH/WHO)
In [ ]:
import pandas as pd
%pylab inline
In [ ]:
import urllib2
url = 'http://files.figshare.com/1113528/CGM.csv'
response = urllib2.urlopen(url)
fname = 'CGM.csv'
with open(fname,'wb') as f:
f.write(response.read())
In [ ]:
df = pd.read_csv('CGM.CSV')
df.head()
Pandas has advanced CSV loading and parsing. Instead of the vanilla read_csv let's add arguments for parsing the datetime column as datetime objects and set that column as the index
In [ ]:
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1)
df.head(5)
Now we can index into dataframe through dates. print values on october 24th 2010 from 10am to 10:30am
In [ ]:
df.ix['2010-03-24 10:00':'2010-03-24 10:30']
With new data a good method of introspection is plotting
In [ ]:
df.plot()
We've seen a number of instances of messy data and this data set is no different
In [ ]:
print df.ix[39450:39470]
In [ ]:
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1,na_values='nil')
df.plot()
You can see the gap of missing values near Sep 2010 in the plot above
In [ ]:
print df.ix[39450:39470]
In [ ]:
df_drop = df.dropna(axis=0).ix[39450:39470]
df_drop
Jump in time from 10:30am to 13:56 (1:56 pm)
In [ ]:
df.ix[39460:39470]
In [ ]:
df.fillna(method='pad',limit=5).ix[39450:39470]
Fills up to 5 consecutive rows with value preceding NaNs
In [ ]:
df.fillna(0).ix[39460:39470]
No matter how you fill NaN values statistic calculations will still succesfully complete.
In [ ]:
df.describe()
In [ ]:
df = df.apply(pd.Series.interpolate)
print df.ix[39460:39470]
We're going to use the interpolated values for the rest of the example
A great thing about pandas is the integrated plotting with matplotlib
Index by date, plot a month worth of data
In [ ]:
df.ix['2010-10-04'].plot()
In [ ]:
df.ix['2010-10-04':'2010-11-04'].plot()
In [ ]:
day = df.ix['2010-10-04']
highs = day[day['glucose'] > 180]
lows = day[day['glucose'] < 80]
figure(figsize=(9,6))
ax = gca()
day['glucose'].plot(style='k--',ax=ax)
highs['glucose'].plot(style='ro',ax=ax)
lows['glucose'].plot(style='bo',ax=ax)
In [ ]:
df['inrange'] = (df['glucose'] < 180) & (df['glucose'] > 80)
In [ ]:
#rolling_sum
window = 30.5*288 #288 is average number of samples in a month
inrange = pd.rolling_sum(df.inrange,window)
inrange = inrange.dropna()
inrange = inrange/float(window)
figure(figsize=(9,8))
#plot
inrange.plot()
Computational Tools: