Diabetes: A data intesive disease

  • High levels of blood glucose resulting from errors in insulin production
  • 25.8 million Americans have diabetes
    • 8.3 percent of the U.S. population
    • 13.0 million men have diabetes (11.8 percent of all men ages 20 years and older).
    • 12.6 million women have diabetes (10.8 percent of all women ages 20 years and older).
  • 347 million people worldwide have diabetes

(source: NIH/WHO)

CGM EXAMPLE

Grab some data

  • Continuous glucose monitor timeseries
  • 8 months worth of ~5min samples of blood glucose
  • data csv:
    • datetime of measurement
    • isig (current/voltage of measurement
    • glucose: converted isig value
  • data originates from a juvenile type I diabetic

In [ ]:
import pandas as pd
%pylab inline

In [ ]:
import urllib2
url = 'http://files.figshare.com/1113528/CGM.csv'
response = urllib2.urlopen(url)

fname = 'CGM.csv'
with open(fname,'wb') as f:
    f.write(response.read())

Getting Help

  • Use the Page
    • name? brings up help
    • name?? tries to show source code
    • For finding a name use wildcards name?
    • %quickref if you get lost

Load CSV and Print first few lines


In [ ]:
df = pd.read_csv('CGM.CSV')
df.head()

Pandas has advanced CSV loading and parsing. Instead of the vanilla read_csv let's add arguments for parsing the datetime column as datetime objects and set that column as the index


In [ ]:
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1)
df.head(5)

Now we can index into dataframe through dates. print values on october 24th 2010 from 10am to 10:30am


In [ ]:
df.ix['2010-03-24 10:00':'2010-03-24 10:30']

With new data a good method of introspection is plotting


In [ ]:
df.plot()

We've seen a number of instances of messy data and this data set is no different


In [ ]:
print df.ix[39450:39470]

nil is something which can't be plot. Tell pandas nil are the NaN values in this datasets. pandas understand NaN vals


In [ ]:
df = pd.read_csv('CGM.csv',sep=',',parse_dates=[1],index_col=1,na_values='nil')
df.plot()

You can see the gap of missing values near Sep 2010 in the plot above

What to do with missing values?


In [ ]:
print df.ix[39450:39470]

Drop Missing Values


In [ ]:
df_drop = df.dropna(axis=0).ix[39450:39470]
df_drop

Jump in time from 10:30am to 13:56 (1:56 pm)

Fill With Limit


In [ ]:
df.ix[39460:39470]

In [ ]:
df.fillna(method='pad',limit=5).ix[39450:39470]

Fills up to 5 consecutive rows with value preceding NaNs

Fill Values with 0


In [ ]:
df.fillna(0).ix[39460:39470]

No matter how you fill NaN values statistic calculations will still succesfully complete.


In [ ]:
df.describe()

Interpolate Values


In [ ]:
df = df.apply(pd.Series.interpolate)
print df.ix[39460:39470]

We're going to use the interpolated values for the rest of the example

A great thing about pandas is the integrated plotting with matplotlib

Index by date, plot a month worth of data


In [ ]:
df.ix['2010-10-04'].plot()

In [ ]:
df.ix['2010-10-04':'2010-11-04'].plot()

Visualize highs and lows of a day. Healthy range for this patient is glucose level greater than 80 and glucose level lower than 180.

We can used the same masked arrays we learned about in numpy section


In [ ]:
day = df.ix['2010-10-04']
highs = day[day['glucose'] > 180]
lows = day[day['glucose'] < 80]

figure(figsize=(9,6))
ax = gca()

day['glucose'].plot(style='k--',ax=ax)
highs['glucose'].plot(style='ro',ax=ax)
lows['glucose'].plot(style='bo',ax=ax)

Generate a rolling statistic of when patient is in range


In [ ]:
df['inrange'] = (df['glucose'] < 180) & (df['glucose'] > 80)

In [ ]:
#rolling_sum
window = 30.5*288 #288 is average number of samples in a month
inrange = pd.rolling_sum(df.inrange,window)
inrange = inrange.dropna()
inrange = inrange/float(window)

figure(figsize=(9,8))
#plot
inrange.plot()

Computational Tools:

  • rolling_count Number of non-null observations
  • rolling_sum Sum of values
  • rolling_mean Mean of values
  • rolling_median Arithmetic median of values
  • rolling_window Moving window function
  • ...