Diabetes: A data intesive disease

High levels of blood glucose resulting from errors in insulin production
25.8 million Americans have diabetes
- 8.3 percent of the U.S. population
- 13.0 million men have diabetes (11.8 percent of all men ages 20 years and older).
- 12.6 million women have diabetes (10.8 percent of all women ages 20 years and older).
347 million people worldwide have diabetes

(source: NIH/WHO)

CGM EXAMPLE

Grab some data

Continuous glucose monitor timeseries
8 months worth of ~5min samples of blood glucose
data csv:
- datetime of measurement
- isig (current/voltage of measurement
- glucose: converted isig value
data originates from a juvenile type I diabetic



In [ ]:

    
import pandas as pd
%pylab inline



In [ ]:

    
import urllib2
url = 'http://files.figshare.com/1113528/CGM.csv'
response = urllib2.urlopen(url)

fname = 'CGM.csv'
with open(fname,'wb') as f:
    f.write(response.read())

Getting Help

Use the Page
- name? brings up help
- name?? tries to show source code
- For finding a name use wildcards name?
- %quickref if you get lost

Load CSV and Print first few lines



In [ ]:

    
df = pd.read_csv('CGM.CSV')
df.head()

Pandas has advanced CSV loading and parsing. Instead of the vanilla read_csv let's add arguments for parsing the datetime column as datetime objects and set that column as the index



In [ ]:

    
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1)
df.head(5)

Now we can index into dataframe through dates. print values on october 24th 2010 from 10am to 10:30am



In [ ]:

    
df.ix['2010-03-24 10:00':'2010-03-24 10:30']

With new data a good method of introspection is plotting



In [ ]:

    
df.plot()

We've seen a number of instances of messy data and this data set is no different



In [ ]:

    
print df.ix[39450:39470]

`nil` is something which can't be plot. Tell pandas `nil` are the NaN values in this datasets. pandas understand NaN vals



In [ ]:

    
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1,na_values='nil')
df.plot()

You can see the gap of missing values near Sep 2010 in the plot above

What to do with missing values?



In [ ]:

    
print df.ix[39450:39470]

Drop Missing Values



In [ ]:

    
df_drop = df.dropna(axis=0).ix[39450:39470]
df_drop

Jump in time from 10:30am to 13:56 (1:56 pm)

Fill With Limit



In [ ]:

    
df.ix[39460:39470]



In [ ]:

    
df.fillna(method='pad',limit=5).ix[39450:39470]

Fills up to 5 consecutive rows with value preceding NaNs

Fill Values with 0



In [ ]:

    
df.fillna(0).ix[39460:39470]

No matter how you fill NaN values statistic calculations will still succesfully complete.



In [ ]:

    
df.describe()

Interpolate Values



In [ ]:

    
df = df.apply(pd.Series.interpolate)
print df.ix[39460:39470]

We're going to use the interpolated values for the rest of the example

A great thing about pandas is the integrated plotting with matplotlib

Index by date, plot a month worth of data



In [ ]:

    
df.ix['2010-10-04'].plot()



In [ ]:

    
df.ix['2010-10-04':'2010-11-04'].plot()

Visualize highs and lows of a day. Healthy range for this patient is glucose level greater than 80 and glucose level lower than 180.

We can used the same masked arrays we learned about in numpy section



In [ ]:

    
day = df.ix['2010-10-04']
highs = day[day['glucose'] > 180]
lows = day[day['glucose'] < 80]

figure(figsize=(9,6))
ax = gca()

day['glucose'].plot(style='k--',ax=ax)
highs['glucose'].plot(style='ro',ax=ax)
lows['glucose'].plot(style='bo',ax=ax)

Generate a rolling statistic of when patient is in range



In [ ]:

    
df['inrange'] = (df['glucose'] < 180) & (df['glucose'] > 80)



In [ ]:

    
#rolling_sum
window = 30.5*288 #288 is average number of samples in a month
inrange = pd.rolling_sum(df.inrange,window)
inrange = inrange.dropna()
inrange = inrange/float(window)

figure(figsize=(9,8))
#plot
inrange.plot()

Computational Tools:

rolling_count Number of non-null observations
rolling_sum Sum of values
rolling_mean Mean of values
rolling_median Arithmetic median of values
rolling_window Moving window function
...