Agile Data Wrangling with Python

This talk will be a quick overview of some tools in Python for quickly loading, cleaning, and combining data from a set of spreadsheets. In addition, I'll show some examples of simple visualization of the data and an example of a simple data model.

Example: Demographic Data

First I need to load libraries I will be using

  • %matplotlib inline : allows me to plot graphics in notebook
  • pandas : data-wrangling library
  • os : for working with operating system
  • matplotlib.pyplot : more plotting tools

In [1]:
%matplotlib inline

In [2]:
import os
import numpy as np
import pandas
from matplotlib import pyplot 

# local reference to a data directory
datadir = 'data'

I will use pandas to access my excel file of demographic data from the data directory.

Note that excel files can have multiple sheets, so I first query my options and then load the sheet I want to work with, in this case sheet1


In [3]:
demographics_file = os.path.join(datadir, 'demographics.xls')
tmpdf = pandas.ExcelFile(demographics_file)
tmpdf.sheet_names


Out[3]:
[u'sheet1']

In [4]:
## create DataFrame called demog by parsing ExcelFile
## Note I use the second column to be my default index as 
## this has unique IDs for each person
demog = tmpdf.parse('sheet1', index_col = 0)

In [5]:
print 'Column Names', demog.columns
print 'Index Values', demog.index[:5]


Column Names Index([u'BAC#', u'Birthday GRAB', u'Gender GRAB', u'Handedness GRAB', u'Years Ed GRAB', u'Number of Sessions', u'Subject Flagged?', u'TstScrs Session 1::Neuropsych Exam Test Date', u'TstScrs Session 1::Age at Session', u'TstScrs Session 1::LIFESTYLE FACTORS', u'TstScrs Session 1::GDS', u'TstScrs Session 1::MMSE', u'TstScrs Session 1::Cog Act Interview Total', u'TstScrs Session 1::CAI Age 6', u'TstScrs Session 1::CAI Age 12', u'TstScrs Session 1::CAI Age 18', u'TstScrs Session 1::CAI Age 40', u'TstScrs Session 1::CAI Past', u'TstScrs Session 1::CAI Now', u'TstScrs Session 1::Cog Act Adulthood', u'TstScrs Session 1::Cog Act Adulthood: Reading', u'TstScrs Session 1::FINGER TAPPING', u'TstScrs Session 1::taps per sec R', u'TstScrs Session 1::taps per sec L', u'TstScrs Session 1::taps per sec BR'], dtype=object)
Index Values Index([u'B06-234', u'B06-210', u'B07-237', u'B07-229', u'B06-235'], dtype=object)

Columns

We can access the column names of out dataframe.

The column names right now are very clunky. So first I will clean them up and write them to a new excel file.


In [6]:
#look at clunky columns
print demog.columns


Index([u'BAC#', u'Birthday GRAB', u'Gender GRAB', u'Handedness GRAB', u'Years Ed GRAB', u'Number of Sessions', u'Subject Flagged?', u'TstScrs Session 1::Neuropsych Exam Test Date', u'TstScrs Session 1::Age at Session', u'TstScrs Session 1::LIFESTYLE FACTORS', u'TstScrs Session 1::GDS', u'TstScrs Session 1::MMSE', u'TstScrs Session 1::Cog Act Interview Total', u'TstScrs Session 1::CAI Age 6', u'TstScrs Session 1::CAI Age 12', u'TstScrs Session 1::CAI Age 18', u'TstScrs Session 1::CAI Age 40', u'TstScrs Session 1::CAI Past', u'TstScrs Session 1::CAI Now', u'TstScrs Session 1::Cog Act Adulthood', u'TstScrs Session 1::Cog Act Adulthood: Reading', u'TstScrs Session 1::FINGER TAPPING', u'TstScrs Session 1::taps per sec R', u'TstScrs Session 1::taps per sec L', u'TstScrs Session 1::taps per sec BR'], dtype=object)

In [7]:
## clean up column names
cols = demog.columns
newcols = [x.replace(' ','').replace('#','').replace('GRAB','') \
           for x in cols]
newcols = [x.replace('TstScrsSession1::','') for x in newcols]

# Use cleaned headers with DataFrame
demog.columns = newcols
print demog.columns


Index([u'BAC', u'Birthday', u'Gender', u'Handedness', u'YearsEd', u'NumberofSessions', u'SubjectFlagged?', u'NeuropsychExamTestDate', u'AgeatSession', u'LIFESTYLEFACTORS', u'GDS', u'MMSE', u'CogActInterviewTotal', u'CAIAge6', u'CAIAge12', u'CAIAge18', u'CAIAge40', u'CAIPast', u'CAINow', u'CogActAdulthood', u'CogActAdulthood:Reading', u'FINGERTAPPING', u'tapspersecR', u'tapspersecL', u'tapspersecBR'], dtype=object)

In [8]:
demog[demog.columns[3]]


Out[8]:
LBNL_ID
B06-234                        R
B06-210                        R
B07-237                        R
B07-229                        R
B06-235                        R
B93-222                        R
B07-277                 R; son L
B08-206                        R
B08-207                        R
B08-224    R; father left handed
B08-218      R; brother L handed
B08-214                        R
B08-220                        R
B08-221                        R
B08-223                        R
...
B13-267                              R
B13-313                              R
B13-284                              R
B13-293                              R
B13-312                              R
B13-289                              R
B13-302                              R
B12-335                              R
B13-303                              R
B13-304                   A; Brother L
B13-247                              R
B13-242    L; brother + both parents L
B13-243                              R
B13-237                              R
B13-296                    L; Mother L
Name: Handedness, Length: 106, dtype: object

In [9]:
# by cleaning up column names I can now easily access fields 
# this mapping does not work if fields contain special characters or math operators
print demog.BAC[:10]


LBNL_ID
B06-234    BAC027
B06-210    BAC082
B07-237    BAC098
B07-229    BAC109
B06-235    BAC113
B93-222    BAC191
B07-277    BAC204
B08-206    BAC233
B08-207    BAC235
B08-224    BAC240
Name: BAC, dtype: object

In [10]:
# write data with cleaned headers to new excel file
demog.to_excel('data/newdemog.xls')

!ls data


005.png
comparing_apriori_vsother modularity.png
demographics.xls
newdemog.xls
subject_biomarker.csv

Cleaning Empty Data fields

describe

Pandas provides us with a simple tool to have a quick look at our data

Columns are listed with the number of valid inputs for each header column

For instance we can see that we have

106 BAC

but we don't seem to have any data for

LIFESTYLEFACTORS


In [11]:
demog.describe()


Out[11]:
YearsEd NumberofSessions AgeatSession LIFESTYLEFACTORS GDS MMSE CogActInterviewTotal CAIAge6 CAIAge12 CAIAge18 CAIAge40 CAIPast CAINow CogActAdulthood CogActAdulthood:Reading FINGERTAPPING tapspersecR tapspersecL tapspersecBR
count 106.000000 106.00000 106.000000 0 106.000000 106.000000 106.000000 99.000000 99.000000 99.000000 84.000000 99.000000 99.000000 106.000000 106.000000 0 106.000000 106.000000 106.000000
mean 16.490566 2.09434 67.622642 NaN 4.245283 28.839623 3.387453 3.342222 3.162158 3.344848 3.609524 3.328990 3.715152 20.679245 13.820755 NaN 4.655525 5.155675 4.785971
std 2.005921 1.55856 19.984019 NaN 3.593145 1.243166 0.541550 1.079244 0.725790 0.705894 0.547419 0.594666 0.647186 4.958009 3.348908 NaN 0.781013 0.740734 0.798957
min 12.000000 1.00000 21.000000 NaN 0.000000 25.000000 2.000000 1.000000 1.330000 1.000000 2.000000 1.750000 2.200000 6.000000 4.000000 NaN 2.545031 3.315833 2.233950
25% 16.000000 1.00000 70.250000 NaN 2.000000 28.000000 3.120000 2.670000 2.670000 2.915000 3.400000 2.940000 3.300000 18.000000 12.000000 NaN 4.082735 4.689999 4.177774
50% 16.000000 1.00000 73.500000 NaN 4.000000 29.000000 3.440000 3.330000 3.330000 3.500000 3.600000 3.400000 3.800000 21.000000 14.000000 NaN 4.569165 5.194224 4.798236
75% 18.000000 3.00000 78.000000 NaN 6.000000 30.000000 3.760000 4.330000 3.670000 3.830000 4.000000 3.775000 4.200000 24.000000 16.000000 NaN 5.243333 5.573045 5.359960
max 20.000000 7.00000 96.000000 NaN 18.000000 30.000000 4.600000 6.000000 4.500000 4.830000 4.800000 4.550000 5.000000 32.000000 21.000000 NaN 6.750871 6.946623 6.996163

So lets see the structure of this missing data.

I will make a mask where data is missing and visualize the table in a heat map


In [12]:
## set default colormap to gray
pyplot.gray()
# create binary mask where data is NOT null
mask = ~pandas.isnull(demog.values)
# set interpolation to nearest (nearest neighbor) to 
# avoid fuzzy image
pyplot.imshow(mask, aspect='auto', interpolation='nearest')
pyplot.ylabel('subjects')
pyplot.xlabel('demographic scores')
pyplot.colorbar(ticks=[0,1])


Out[12]:
<matplotlib.colorbar.Colorbar instance at 0x1100ac3f8>

In [13]:
demog.mean()


Out[13]:
YearsEd                    16.490566
NumberofSessions            2.094340
AgeatSession               67.622642
LIFESTYLEFACTORS                 NaN
GDS                         4.245283
MMSE                       28.839623
CogActInterviewTotal        3.387453
CAIAge6                     3.342222
CAIAge12                    3.162158
CAIAge18                    3.344848
CAIAge40                    3.609524
CAIPast                     3.328990
CAINow                      3.715152
CogActAdulthood            20.679245
CogActAdulthood:Reading    13.820755
FINGERTAPPING                    NaN
tapspersecR                 4.655525
tapspersecL                 5.155675
tapspersecBR                4.785971
dtype: float64

The pattern suggests:

  • most people have most of the tests
  • a couple of tests have data for only a few people

What we want to do:

  • drop columns with many NAN

Now we just need to drop the tests with very few samples.

This is much easier since we can just use pandas built in method dropna, and specify our axis to use across columns (axis=1).

This will drop any columns that have nan in them.


In [14]:
## To drop the tests that are missing for most subjects is easy, 
## I use the built in dropna

final_demog = demog.dropna(axis=1)

In [15]:
# set interpolation to nearest (nearest neighbor) to 
# avoid fuzzy image
pyplot.imshow(pandas.isnull(final_demog), 
              aspect='auto', 
              interpolation='nearest')
pyplot.ylabel('subjects')
pyplot.xlabel('demographic scores')
pyplot.colorbar(ticks=[0,1])


Out[15]:
<matplotlib.colorbar.Colorbar instance at 0x111cb0fc8>

We have REAL data for all our data points!


In [16]:
print final_demog.shape

print final_demog.mean()


(106, 16)
YearsEd                    16.490566
NumberofSessions            2.094340
AgeatSession               67.622642
GDS                         4.245283
MMSE                       28.839623
CogActInterviewTotal        3.387453
CogActAdulthood            20.679245
CogActAdulthood:Reading    13.820755
tapspersecR                 4.655525
tapspersecL                 5.155675
tapspersecBR                4.785971
dtype: float64

In [17]:
print 'Years Education\n', final_demog['YearsEd'].describe()


Years Education
count    106.000000
mean      16.490566
std        2.005921
min       12.000000
25%       16.000000
50%       16.000000
75%       18.000000
max       20.000000
dtype: float64

Now I want to take this data, and for the subjects I have useful dempographics for, grab their biomarker data.


In [18]:
## Use pandas to load the biomarker csv data
biomarker = pandas.read_csv('data/subject_biomarker.csv', sep=None)


Using Python parser to sniff delimiter

In [19]:
biomarker.describe()


Out[19]:
ALL Ant-cingulate Inferior-parietal Isthmus-cingulate Post-cingulate Precuneus lateralfrontal medialfrontal temporal
count 83.000000 85.000000 86.000000 85.000000 86.000000 85.000000 86.000000 85.000000 86.000000
mean 1.099934 1.186479 1.151134 1.179585 1.233277 1.167649 1.064884 1.069149 1.016278
std 0.184194 0.235193 0.188548 0.171204 0.210504 0.228273 0.189048 0.181282 0.153723
min 0.831769 0.863666 0.893938 0.949237 0.944589 0.921756 0.793289 0.806475 0.741812
25% 1.005110 1.076069 1.060854 1.082264 1.117816 1.051812 0.970168 0.979989 0.944867
50% 1.042127 1.132960 1.099851 1.130193 1.188095 1.092415 1.004200 1.017698 0.974516
75% 1.086275 1.182174 1.156271 1.216269 1.232962 1.170684 1.077915 1.060022 1.018200
max 1.748783 2.119020 1.773572 1.758167 2.092553 1.988178 1.859275 1.834954 1.525314

In [20]:
biomarker.columns


Out[20]:
Index([u'SUBID', u'ALL', u'Ant-cingulate', u'Inferior-parietal', u'Isthmus-cingulate', u'Post-cingulate', u'Precuneus', u'lateralfrontal', u'medialfrontal', u'temporal'], dtype=object)

In [21]:
biomarker.index


Out[21]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85], dtype=int64)

In [22]:
biomarker.index = biomarker.SUBID

In [23]:
biomarker.index


Out[23]:
Index([u'B05-201', u'B05-202', u'B05-215', u'B05-216', u'B05-217', u'B05-229', u'B05-231', u'B05-232', u'B06-202', u'B06-203', u'B06-210', u'B06-212', u'B06-213', u'B06-217', u'B06-219', u'B06-234', u'B06-235', u'B07-227', u'B07-228', u'B07-230', u'B07-235', u'B07-236', u'B07-237', u'B07-238', u'B07-243', u'B07-267', u'B07-271', u'B07-276', u'B07-277', u'B08-206', u'B08-207', u'B08-218', u'B08-220', u'B08-221', u'B08-223', u'B08-224', u'B08-231', u'B08-247', u'B08-251', u'B09-201', u'B09-210', u'B09-211', u'B09-215', u'B09-216', u'B09-218', u'B09-225', u'B09-228', u'B09-229', u'B09-230', u'B09-260', u'B09-277', u'B09-290', u'B10-216', u'B10-224', u'B10-226', u'B10-232', u'B10-257', u'B10-261', u'B10-264', u'B10-270', u'B10-282', u'B10-285', u'B10-294', u'B10-295', u'B10-298', u'B10-299', u'B10-306', u'B10-309', u'B10-311', u'B10-313', u'B10-314', u'B11-204', u'B11-208', u'B11-210', u'B11-211', u'B11-213', u'B11-216', u'B11-219', u'B11-224', u'B11-232', u'B11-236', u'B11-243', u'B11-253', u'B11-255', u'B11-256', u'B11-257'], dtype=object)

Combining DataFrames

Currently I have two data frames,

  • final_demog
  • biomarker

final_demog uses unique subids for an index biomarker uses integers

I could go back and use the index_col = 0 flag when I parse the csv file.

But I can also use the SUBID columns, and just reset the index to reflect their SUBID (which is unique)

This will allow me to join the two dataframes database-style


In [24]:
## reindex based on subject ID 
#biomarker.index = biomarker['SUBID']

## Join dataframes
join_data = final_demog.join(biomarker)

## or Concatenate
concat_data = pandas.concat((final_demog, biomarker))

Concat

Lets first look at concatenate

If we take a closer look, concat just added more rows, it didnt merge the data together


In [25]:
print concat_data.shape
pyplot.imshow(pandas.isnull(concat_data), interpolation='nearest',
              aspect='auto')


(192, 26)
Out[25]:
<matplotlib.image.AxesImage at 0x11212ed10>

In [26]:
## we can fix this by forcing concat to concatenate on a 
## specific axis (in this case axis = 1)
## 
concat_data2 = pandas.concat((final_demog, biomarker), axis=1)

In [27]:
print concat_data2.shape
pyplot.imshow(pandas.isnull(concat_data2), interpolation='nearest',
              aspect='auto')


(150, 26)
Out[27]:
<matplotlib.image.AxesImage at 0x1121fbe50>

In [28]:
combined_data = concat_data2.dropna(axis=0)

In [29]:
pyplot.imshow(pandas.isnull(combined_data))
print combined_data.SUBID[:10]


B06-210    B06-210
B06-234    B06-234
B06-235    B06-235
B07-237    B07-237
B07-277    B07-277
B08-206    B08-206
B08-207    B08-207
B08-218    B08-218
B08-220    B08-220
B08-221    B08-221
Name: SUBID, dtype: object

join

Now lets see how a join works


In [30]:
final_demog.join?

In [31]:
## Quick check that my data is aligned
## Note non-overlapping data
print join_data[2:6]['SUBID']


LBNL_ID
B07-237    B07-237
B07-229        NaN
B06-235    B06-235
B93-222        NaN
Name: SUBID, dtype: object

In [32]:
## by looking I can see not all my subjects with demographics 
#  have biomarker data
join_data


Out[32]:
<class 'pandas.core.frame.DataFrame'>
Index: 106 entries, B06-234 to B13-296
Data columns (total 26 columns):
BAC                        106  non-null values
Birthday                   106  non-null values
Gender                     106  non-null values
Handedness                 106  non-null values
YearsEd                    106  non-null values
NumberofSessions           106  non-null values
NeuropsychExamTestDate     106  non-null values
AgeatSession               106  non-null values
GDS                        106  non-null values
MMSE                       106  non-null values
CogActInterviewTotal       106  non-null values
CogActAdulthood            106  non-null values
CogActAdulthood:Reading    106  non-null values
tapspersecR                106  non-null values
tapspersecL                106  non-null values
tapspersecBR               106  non-null values
SUBID                      42  non-null values
ALL                        42  non-null values
Ant-cingulate              42  non-null values
Inferior-parietal          42  non-null values
Isthmus-cingulate          42  non-null values
Post-cingulate             42  non-null values
Precuneus                  42  non-null values
lateralfrontal             42  non-null values
medialfrontal              42  non-null values
temporal                   42  non-null values
dtypes: datetime64[ns](2), float64(20), object(4)

In [33]:
## quick look at image, I can see a number of people are missing
## biomarker data
mask = ~pandas.isnull(join_data)
pyplot.imshow(mask, interpolation = 'nearest', aspect = 'auto')


Out[33]:
<matplotlib.image.AxesImage at 0x1122057d0>

I can use an inner join which will match all the data on index and column, and leave me with only good data.


In [34]:
join_inner = final_demog.join(biomarker, how='inner')

In [35]:
print join_inner.shape
pyplot.imshow(pandas.isnull(join_inner))


(42, 26)
Out[35]:
<matplotlib.image.AxesImage at 0x11248bb50>

In [36]:
final_data = join_inner

In [37]:
final_data['random'] = np.random.random(42)

In [38]:
final_data['silly_thing']= 1

Plotting

I can also do a quick summary plot(boxplot) of a subset of the columns


In [39]:
print final_data.columns[11:14]
subsample = final_data.columns[11:14]
jnk = final_data[subsample].boxplot()


Index([u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR'], dtype=object)

In [40]:
subsample


Out[40]:
Index([u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR'], dtype=object)

Fixing Fields

sometimes people make errors during data entry, lets go in an d fix these.


In [41]:
## data entry is sloppy for gender 
final_data.Gender


Out[41]:
LBNL_ID
B06-210    FEMALE
B06-234         M
B06-235         M
B07-237         F
B07-277         F
B08-206         F
B08-207         F
B08-218         F
B08-220         F
B08-221         F
B08-223         F
B08-224         F
B08-247         M
B08-251         M
B09-201         F
B09-210         F
B09-211         M
B09-215         F
B09-218         F
B09-225         F
B09-228         M
B09-229         M
B10-270         F
B10-282         F
B10-285         M
B10-298         F
B10-306         F
B10-309         M
B10-311         F
B10-314         M
B11-210         F
B11-211         F
B11-213         F
B11-216         M
B11-219         F
B11-224         F
B11-232         M
B11-236         F
B11-243         M
B11-253         F
B11-255         M
B11-256         F
Name: Gender, dtype: object

So I am now going to clean up these values much as I cleaned up the headers


In [42]:
## fix gender to reflect just upper case letter M or F
final_data.Gender = [x[0].upper() for x in final_data.Gender]
print final_data.Gender


LBNL_ID
B06-210    F
B06-234    M
B06-235    M
B07-237    F
B07-277    F
B08-206    F
B08-207    F
B08-218    F
B08-220    F
B08-221    F
B08-223    F
B08-224    F
B08-247    M
B08-251    M
B09-201    F
B09-210    F
B09-211    M
B09-215    F
B09-218    F
B09-225    F
B09-228    M
B09-229    M
B10-270    F
B10-282    F
B10-285    M
B10-298    F
B10-306    F
B10-309    M
B10-311    F
B10-314    M
B11-210    F
B11-211    F
B11-213    F
B11-216    M
B11-219    F
B11-224    F
B11-232    M
B11-236    F
B11-243    M
B11-253    F
B11-255    M
B11-256    F
Name: Gender, dtype: object

Masks

Use a mask to select views of the dataframe


In [43]:
## now I can create masks to parse the data
females = final_data['Gender'] == 'F'
female_data = final_data[females]
male_data = final_data[~females]

In [44]:
female_data.shape


Out[44]:
(28, 28)

In [45]:
male_data.shape


Out[45]:
(14, 28)

In [46]:
## This allows me to make a simple box plot to compare 
## years of education between males and females
pyplot.boxplot((female_data['YearsEd'], male_data['YearsEd']))
pyplot.axis([0,3,10,25])
pyplot.xticks([1,2], ['Female', 'Male'])
pyplot.ylabel('Years of Education')


Out[46]:
<matplotlib.text.Text at 0x1127a91d0>

Groupby

You can use groupby much like you would use the mask above. Here is an example of plotting a biomarker value by Gender using groupby instead fo creating a mask.


In [47]:
for val, (name, group) in enumerate(final_data.groupby('Gender')):
    pyplot.subplot(1,2,val+1)
    pyplot.boxplot(group['Post-cingulate'])
    pyplot.axis([0,2,0.8, 2.3])
    pyplot.xticks([1,], [name,])
    pyplot.title('Post-cingulate')



In [48]:
for val, (name, group) in enumerate(final_data.groupby('Gender')):
    print val, name, group.shape


0 F (28, 28)
1 M (14, 28)

In [49]:
just_girls = final_data.groupby('Gender').get_group('F')

In [50]:
final_data.groupby('Gender').apply


Out[50]:
<bound method DataFrameGroupBy.apply of <pandas.core.groupby.DataFrameGroupBy object at 0x11221b510>>

Models

Pandas has a number of models built in, for example, Ordinary Least Squares


In [51]:
final_data.columns
# quick look at influence of MMSE on Post-cingulate Biomarker
ols = pandas.stats.ols.OLS(final_data['Post-cingulate'], 
                           final_data['MMSE'])

In [52]:
print ols.summary


-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         42
Number of Degrees of Freedom:   2

R-squared:         0.0005
Adj R-squared:    -0.0245

Rmse:              0.2448

F-stat (1, 40):     0.0194, p-value:     0.8899

Degrees of Freedom: model 1, resid 40

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0041     0.0292      -0.14     0.8899    -0.0612     0.0531
     intercept     1.3761     0.8410       1.64     0.1096    -0.2723     3.0244
---------------------------------End of Summary---------------------------------

StatsModels

Now that I have some clean data, I might want to run a quick model. To do this I will load the

statsmodels library

specifically I will load the linear_model subpackage with the alias lm


In [53]:
import statsmodels.regression.linear_model as lm

I will grab some variables I care about

dependent variable

  • Post-cingulate biomarker

independent variables

  • gender (which I will relabel with 0,1)
  • Age
  • Precuneus biomarker

In [54]:
final_data.columns


Out[54]:
Index([u'BAC', u'Birthday', u'Gender', u'Handedness', u'YearsEd', u'NumberofSessions', u'NeuropsychExamTestDate', u'AgeatSession', u'GDS', u'MMSE', u'CogActInterviewTotal', u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR', u'tapspersecL', u'tapspersecBR', u'SUBID', u'ALL', u'Ant-cingulate', u'Inferior-parietal', u'Isthmus-cingulate', u'Post-cingulate', u'Precuneus', u'lateralfrontal', u'medialfrontal', u'temporal', u'random', u'silly_thing'], dtype=object)

In [55]:
items_of_interest = ['Gender', 'AgeatSession', 'Precuneus']
independent = final_data.reindex(columns = items_of_interest)
independent


Out[55]:
Gender AgeatSession Precuneus
LBNL_ID
B06-210 F 73 1.081447
B06-234 M 74 1.330834
B06-235 M 77 1.056631
B07-237 F 72 1.051214
B07-277 F 72 1.175358
B08-206 F 87 1.977648
B08-207 F 76 1.027508
B08-218 F 72 0.960273
B08-220 F 73 1.058662
B08-221 F 86 1.008120
B08-223 F 70 1.170684
B08-224 F 70 1.135571
B08-247 M 73 1.069985
B08-251 M 78 0.921756
B09-201 F 80 1.082767
B09-210 F 82 1.791774
B09-211 M 75 1.173846
B09-215 F 82 1.058898
B09-218 F 70 1.637780
B09-225 F 72 1.089609
B09-228 M 76 1.125693
B09-229 M 70 1.130209
B10-270 F 68 1.118943
B10-282 F 80 1.099579
B10-285 M 78 0.928556
B10-298 F 73 1.034642
B10-306 F 75 1.358964
B10-309 M 72 1.171232
B10-311 F 81 1.058576
B10-314 M 87 1.248210
B11-210 F 70 1.086161
B11-211 F 71 1.044730
B11-213 F 75 1.104658
B11-216 M 96 1.812154
B11-219 F 76 1.173603
B11-224 F 79 1.453994
B11-232 M 73 1.609970
B11-236 F 83 1.030451
B11-243 M 75 1.053735
B11-253 F 72 1.988178
B11-255 M 73 1.216090
B11-256 F 71 1.057577

In [56]:
## demean my dependent variable
y = final_data['Post-cingulate'] - final_data['Post-cingulate'].mean()

## cast and demean my independent variables
independent.Gender[independent.Gender == 'F'] = 0
independent.Gender[independent.Gender == 'M'] = 1

independent.AgeatSession = independent.AgeatSession - independent.AgeatSession.mean()
independent.Precuneus = independent.Precuneus - independent.Precuneus.mean()

## Add constant (intercept)
## note how this creates a new column and populates it with ones
independent['const'] = 1

In [57]:
print 'dependent\n'
print y.values[:10]
print '\nindependent\n'
print independent.values[:10]


dependent

[-0.12048774  0.25894026 -0.11425474 -0.10989774 -0.13557674  0.72064726
 -0.16771274 -0.21860474 -0.15700874 -0.23350774]

independent

[[0 -2.904761904761898 -0.12727371428571455 1]
 [1 -1.904761904761898 0.1221132857142857 1]
 [1 1.095238095238102 -0.15208971428571427 1]
 [0 -3.904761904761898 -0.15750671428571428 1]
 [0 -3.904761904761898 -0.03336271428571447 1]
 [0 11.095238095238102 0.7689272857142857 1]
 [0 0.095238095238102 -0.18121271428571428 1]
 [0 -3.904761904761898 -0.24844771428571433 1]
 [0 -2.904761904761898 -0.15005871428571438 1]
 [0 10.095238095238102 -0.20060071428571447 1]]

In [58]:
lm.OLS?
lmfit = lm.OLS(endog = y.values, exog = independent.values)

In [59]:
result = lmfit.fit()

In [60]:
print independent.columns
result.summary()


Index([u'Gender', u'AgeatSession', u'Precuneus', u'const'], dtype=object)
Out[60]:
OLS Regression Results
Dep. Variable: y R-squared: 0.883
Model: OLS Adj. R-squared: 0.873
Method: Least Squares F-statistic: 95.16
Date: Mon, 04 Nov 2013 Prob (F-statistic): 1.01e-17
Time: 20:18:26 Log-Likelihood: 45.497
No. Observations: 42 AIC: -82.99
Df Residuals: 38 BIC: -76.04
Df Model: 3
coef std err t P>|t| [95.0% Conf. Int.]
x1 -0.0017 0.028 -0.061 0.952 -0.059 0.056
x2 -0.0054 0.002 -2.212 0.033 -0.010 -0.000
x3 0.8629 0.052 16.625 0.000 0.758 0.968
const 0.0006 0.016 0.036 0.972 -0.032 0.034
Omnibus: 2.046 Durbin-Watson: 2.160
Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.244
Skew: -0.401 Prob(JB): 0.537
Kurtosis: 3.261 Cond. No. 22.5

In [61]:
pyplot.subplot(121)
pyplot.plot(independent.AgeatSession, y, 'bo')
pyplot.xlabel('Age')
pyplot.ylabel('Post-cingulate Biomarker')
pyplot.subplot(122)
pyplot.plot(independent.Precuneus, y, 'ro')
pyplot.xlabel('Precuneus Biomarker')


Out[61]:
<matplotlib.text.Text at 0x11491e5d0>

Talking to a Database

sqlite3

  • python has a number of libraries to talk to databases
  • sqlite3 provies an interface to SQLite which is a lightweight disk based database

sqlite3 python DOCS


In [62]:
import sqlite3

In [63]:
data_to_db = final_data.reindex(columns = 
                                ('Post-cingulate', 
                                 'Precuneus', 
                                 'MMSE', 
                                 'AgeatSession'))

In [63]:


In [64]:
con = sqlite3.connect(':memory:')
con.commit?

In [65]:
query = """
CREATE TABLE test
(postcing FLOAT, precun FLOAT, mmse FLOAT, age FLOAT);"""
# open connection to database in memory
con = sqlite3.connect(':memory:')
## execute SQL statement to CREATE TABLE (create new table)
con.execute(query)
## Commit transaction
con.commit()

In [66]:
# coerce data to format that can be input into the table
# basically create a list of tuples
data = [tuple(item) for item in data_to_db.values.tolist()]

In [67]:
data[:2]


Out[67]:
[(1.138485, 1.0814469999999998, 29.0, 73.0),
 (1.5179129999999998, 1.330834, 30.0, 74.0)]

In [68]:
# new statement to insert our data into database
statement = "INSERT INTO test VALUES(?,?,?,?)"
con.executemany(statement, data)
con.commit()

In [69]:
## roundtrip: lets grab all the data out of the database
cursor = con.execute('select * from test')
newrows = cursor.fetchall()

In [70]:
cursor.description


Out[70]:
(('postcing', None, None, None, None, None, None),
 ('precun', None, None, None, None, None, None),
 ('mmse', None, None, None, None, None, None),
 ('age', None, None, None, None, None, None))

In [71]:
# zip column items in cursor description , and grab first
columns = zip(*cursor.description)[0]
columns


Out[71]:
('postcing', 'precun', 'mmse', 'age')

In [72]:
# create a new DataFrame
newdf = pandas.DataFrame(newrows, columns = columns)
newdf


Out[72]:
postcing precun mmse age
0 1.138485 1.081447 29 73
1 1.517913 1.330834 30 74
2 1.144718 1.056631 29 77
3 1.149075 1.051214 30 72
4 1.123396 1.175358 29 72
5 1.979620 1.977648 27 87
6 1.091260 1.027508 28 76
7 1.040368 0.960273 28 72
8 1.101964 1.058662 28 73
9 1.025465 1.008120 29 86
10 1.234171 1.170684 29 70
11 1.228818 1.135571 29 70
12 1.180982 1.069985 30 73
13 1.079167 0.921756 30 78
14 1.229333 1.082767 30 80
15 1.685527 1.791774 29 82
16 1.228112 1.173846 30 75
17 1.205652 1.058898 27 82
18 1.499777 1.637780 30 70
19 1.089057 1.089609 29 72
20 1.211186 1.125693 27 76
21 1.197423 1.130209 30 70
22 1.222365 1.118943 30 68
23 1.146490 1.099579 29 80
24 0.990200 0.928556 29 78
25 0.944589 1.034642 29 73
26 1.381214 1.358964 30 75
27 1.308381 1.171232 26 72
28 1.177977 1.058576 30 81
29 1.243055 1.248210 29 87
30 1.187455 1.086161 29 70
31 1.207291 1.044730 30 71
32 1.214298 1.104658 27 75
33 1.459784 1.812154 30 96
34 1.297182 1.173603 30 76
35 1.608539 1.453994 29 79
36 1.537757 1.609970 26 73
37 0.957489 1.030451 25 83
38 1.114593 1.053735 29 75
39 2.092553 1.988178 29 72
40 1.255497 1.216090 27 73
41 1.148677 1.057577 30 71

Pandas Database IO

To simplify this process, pandas has a simple built-in interface to a database


In [73]:
import pandas.io.sql as sql

newdf = sql.read_frame('select * from test', con)
print newdf[:3]


   postcing    precun  mmse  age
0  1.138485  1.081447    29   73
1  1.517913  1.330834    30   74
2  1.144718  1.056631    29   77

In [74]:
# I still have an open connection (con) to my database table 
# I can create a new table ordered by age
agedf = sql.read_frame('select * from test order by age',con)
agedf


Out[74]:
postcing precun mmse age
0 1.222365 1.118943 30 68
1 1.234171 1.170684 29 70
2 1.228818 1.135571 29 70
3 1.499777 1.637780 30 70
4 1.197423 1.130209 30 70
5 1.187455 1.086161 29 70
6 1.207291 1.044730 30 71
7 1.148677 1.057577 30 71
8 1.149075 1.051214 30 72
9 1.123396 1.175358 29 72
10 1.040368 0.960273 28 72
11 1.089057 1.089609 29 72
12 1.308381 1.171232 26 72
13 2.092553 1.988178 29 72
14 1.138485 1.081447 29 73
15 1.101964 1.058662 28 73
16 1.180982 1.069985 30 73
17 0.944589 1.034642 29 73
18 1.537757 1.609970 26 73
19 1.255497 1.216090 27 73
20 1.517913 1.330834 30 74
21 1.228112 1.173846 30 75
22 1.381214 1.358964 30 75
23 1.214298 1.104658 27 75
24 1.114593 1.053735 29 75
25 1.091260 1.027508 28 76
26 1.211186 1.125693 27 76
27 1.297182 1.173603 30 76
28 1.144718 1.056631 29 77
29 1.079167 0.921756 30 78
30 0.990200 0.928556 29 78
31 1.608539 1.453994 29 79
32 1.229333 1.082767 30 80
33 1.146490 1.099579 29 80
34 1.177977 1.058576 30 81
35 1.685527 1.791774 29 82
36 1.205652 1.058898 27 82
37 0.957489 1.030451 25 83
38 1.025465 1.008120 29 86
39 1.979620 1.977648 27 87
40 1.243055 1.248210 29 87
41 1.459784 1.812154 30 96

In [75]:
## I can create a dataframe where MMSE > 28 ordered by age
query = 'select * from test where mmse > 28 order by age'
mmsedf = sql.read_frame(query,con)
mmsedf


Out[75]:
postcing precun mmse age
0 1.222365 1.118943 30 68
1 1.234171 1.170684 29 70
2 1.228818 1.135571 29 70
3 1.499777 1.637780 30 70
4 1.197423 1.130209 30 70
5 1.187455 1.086161 29 70
6 1.207291 1.044730 30 71
7 1.148677 1.057577 30 71
8 1.149075 1.051214 30 72
9 1.123396 1.175358 29 72
10 1.089057 1.089609 29 72
11 2.092553 1.988178 29 72
12 1.138485 1.081447 29 73
13 1.180982 1.069985 30 73
14 0.944589 1.034642 29 73
15 1.517913 1.330834 30 74
16 1.228112 1.173846 30 75
17 1.381214 1.358964 30 75
18 1.114593 1.053735 29 75
19 1.297182 1.173603 30 76
20 1.144718 1.056631 29 77
21 1.079167 0.921756 30 78
22 0.990200 0.928556 29 78
23 1.608539 1.453994 29 79
24 1.229333 1.082767 30 80
25 1.146490 1.099579 29 80
26 1.177977 1.058576 30 81
27 1.685527 1.791774 29 82
28 1.025465 1.008120 29 86
29 1.243055 1.248210 29 87
30 1.459784 1.812154 30 96

In [75]:

Hey I want to get my hands dirty, what should I do next??

https://www.wakari.io

http://learnds.com/