Agile Data Wrangling with Python

This talk will be a quick overview of some tools in Python for quickly loading, cleaning, and combining data from a set of spreadsheets. In addition, I'll show some examples of simple visualization of the data and an example of a simple data model.

Example: Demographic Data

First I need to load libraries I will be using

%matplotlib inline : allows me to plot graphics in notebook
pandas : data-wrangling library
os : for working with operating system
matplotlib.pyplot : more plotting tools



In [1]:

    
%matplotlib inline



In [2]:

    
import os
import numpy as np
import pandas
from matplotlib import pyplot 

# local reference to a data directory
datadir = 'data'

I will use pandas to access my excel file of demographic data from the data directory.

Note that excel files can have multiple sheets, so I first query my options and then load the sheet I want to work with, in this case sheet1



In [3]:

    
demographics_file = os.path.join(datadir, 'demographics.xls')
tmpdf = pandas.ExcelFile(demographics_file)
tmpdf.sheet_names









    Out[3]:





[u'sheet1']



In [4]:

    
## create DataFrame called demog by parsing ExcelFile
## Note I use the second column to be my default index as 
## this has unique IDs for each person
demog = tmpdf.parse('sheet1', index_col = 0)



In [5]:

    
print 'Column Names', demog.columns
print 'Index Values', demog.index[:5]









    



Column Names Index([u'BAC#', u'Birthday GRAB', u'Gender GRAB', u'Handedness GRAB', u'Years Ed GRAB', u'Number of Sessions', u'Subject Flagged?', u'TstScrs Session 1::Neuropsych Exam Test Date', u'TstScrs Session 1::Age at Session', u'TstScrs Session 1::LIFESTYLE FACTORS', u'TstScrs Session 1::GDS', u'TstScrs Session 1::MMSE', u'TstScrs Session 1::Cog Act Interview Total', u'TstScrs Session 1::CAI Age 6', u'TstScrs Session 1::CAI Age 12', u'TstScrs Session 1::CAI Age 18', u'TstScrs Session 1::CAI Age 40', u'TstScrs Session 1::CAI Past', u'TstScrs Session 1::CAI Now', u'TstScrs Session 1::Cog Act Adulthood', u'TstScrs Session 1::Cog Act Adulthood: Reading', u'TstScrs Session 1::FINGER TAPPING', u'TstScrs Session 1::taps per sec R', u'TstScrs Session 1::taps per sec L', u'TstScrs Session 1::taps per sec BR'], dtype=object)
Index Values Index([u'B06-234', u'B06-210', u'B07-237', u'B07-229', u'B06-235'], dtype=object)

Columns

We can access the column names of out dataframe.

The column names right now are very clunky. So first I will clean them up and write them to a new excel file.



In [6]:

    
#look at clunky columns
print demog.columns









    



Index([u'BAC#', u'Birthday GRAB', u'Gender GRAB', u'Handedness GRAB', u'Years Ed GRAB', u'Number of Sessions', u'Subject Flagged?', u'TstScrs Session 1::Neuropsych Exam Test Date', u'TstScrs Session 1::Age at Session', u'TstScrs Session 1::LIFESTYLE FACTORS', u'TstScrs Session 1::GDS', u'TstScrs Session 1::MMSE', u'TstScrs Session 1::Cog Act Interview Total', u'TstScrs Session 1::CAI Age 6', u'TstScrs Session 1::CAI Age 12', u'TstScrs Session 1::CAI Age 18', u'TstScrs Session 1::CAI Age 40', u'TstScrs Session 1::CAI Past', u'TstScrs Session 1::CAI Now', u'TstScrs Session 1::Cog Act Adulthood', u'TstScrs Session 1::Cog Act Adulthood: Reading', u'TstScrs Session 1::FINGER TAPPING', u'TstScrs Session 1::taps per sec R', u'TstScrs Session 1::taps per sec L', u'TstScrs Session 1::taps per sec BR'], dtype=object)



In [7]:

    
## clean up column names
cols = demog.columns
newcols = [x.replace(' ','').replace('#','').replace('GRAB','') \
           for x in cols]
newcols = [x.replace('TstScrsSession1::','') for x in newcols]

# Use cleaned headers with DataFrame
demog.columns = newcols
print demog.columns









    



Index([u'BAC', u'Birthday', u'Gender', u'Handedness', u'YearsEd', u'NumberofSessions', u'SubjectFlagged?', u'NeuropsychExamTestDate', u'AgeatSession', u'LIFESTYLEFACTORS', u'GDS', u'MMSE', u'CogActInterviewTotal', u'CAIAge6', u'CAIAge12', u'CAIAge18', u'CAIAge40', u'CAIPast', u'CAINow', u'CogActAdulthood', u'CogActAdulthood:Reading', u'FINGERTAPPING', u'tapspersecR', u'tapspersecL', u'tapspersecBR'], dtype=object)



In [8]:

    
demog[demog.columns[3]]









    Out[8]:





LBNL_ID
B06-234                        R
B06-210                        R
B07-237                        R
B07-229                        R
B06-235                        R
B93-222                        R
B07-277                 R; son L
B08-206                        R
B08-207                        R
B08-224    R; father left handed
B08-218      R; brother L handed
B08-214                        R
B08-220                        R
B08-221                        R
B08-223                        R
...
B13-267                              R
B13-313                              R
B13-284                              R
B13-293                              R
B13-312                              R
B13-289                              R
B13-302                              R
B12-335                              R
B13-303                              R
B13-304                   A; Brother L
B13-247                              R
B13-242    L; brother + both parents L
B13-243                              R
B13-237                              R
B13-296                    L; Mother L
Name: Handedness, Length: 106, dtype: object



In [9]:

    
# by cleaning up column names I can now easily access fields 
# this mapping does not work if fields contain special characters or math operators
print demog.BAC[:10]









    



LBNL_ID
B06-234    BAC027
B06-210    BAC082
B07-237    BAC098
B07-229    BAC109
B06-235    BAC113
B93-222    BAC191
B07-277    BAC204
B08-206    BAC233
B08-207    BAC235
B08-224    BAC240
Name: BAC, dtype: object



In [10]:

    
# write data with cleaned headers to new excel file
demog.to_excel('data/newdemog.xls')

!ls data









    



005.png
comparing_apriori_vsother modularity.png
demographics.xls
newdemog.xls
subject_biomarker.csv

Cleaning Empty Data fields

describe

Pandas provides us with a simple tool to have a quick look at our data

Columns are listed with the number of valid inputs for each header column

For instance we can see that we have

106 BAC

but we don't seem to have any data for

LIFESTYLEFACTORS



In [11]:

    
demog.describe()









    Out[11]:






  
    
      
      YearsEd
      NumberofSessions
      AgeatSession
      LIFESTYLEFACTORS
      GDS
      MMSE
      CogActInterviewTotal
      CAIAge6
      CAIAge12
      CAIAge18
      CAIAge40
      CAIPast
      CAINow
      CogActAdulthood
      CogActAdulthood:Reading
      FINGERTAPPING
      tapspersecR
      tapspersecL
      tapspersecBR
    
  
  
    
      count
       106.000000
       106.00000
       106.000000
        0
       106.000000
       106.000000
       106.000000
       99.000000
       99.000000
       99.000000
       84.000000
       99.000000
       99.000000
       106.000000
       106.000000
        0
       106.000000
       106.000000
       106.000000
    
    
      mean
        16.490566
         2.09434
        67.622642
      NaN
         4.245283
        28.839623
         3.387453
        3.342222
        3.162158
        3.344848
        3.609524
        3.328990
        3.715152
        20.679245
        13.820755
      NaN
         4.655525
         5.155675
         4.785971
    
    
      std
         2.005921
         1.55856
        19.984019
      NaN
         3.593145
         1.243166
         0.541550
        1.079244
        0.725790
        0.705894
        0.547419
        0.594666
        0.647186
         4.958009
         3.348908
      NaN
         0.781013
         0.740734
         0.798957
    
    
      min
        12.000000
         1.00000
        21.000000
      NaN
         0.000000
        25.000000
         2.000000
        1.000000
        1.330000
        1.000000
        2.000000
        1.750000
        2.200000
         6.000000
         4.000000
      NaN
         2.545031
         3.315833
         2.233950
    
    
      25%
        16.000000
         1.00000
        70.250000
      NaN
         2.000000
        28.000000
         3.120000
        2.670000
        2.670000
        2.915000
        3.400000
        2.940000
        3.300000
        18.000000
        12.000000
      NaN
         4.082735
         4.689999
         4.177774
    
    
      50%
        16.000000
         1.00000
        73.500000
      NaN
         4.000000
        29.000000
         3.440000
        3.330000
        3.330000
        3.500000
        3.600000
        3.400000
        3.800000
        21.000000
        14.000000
      NaN
         4.569165
         5.194224
         4.798236
    
    
      75%
        18.000000
         3.00000
        78.000000
      NaN
         6.000000
        30.000000
         3.760000
        4.330000
        3.670000
        3.830000
        4.000000
        3.775000
        4.200000
        24.000000
        16.000000
      NaN
         5.243333
         5.573045
         5.359960
    
    
      max
        20.000000
         7.00000
        96.000000
      NaN
        18.000000
        30.000000
         4.600000
        6.000000
        4.500000
        4.830000
        4.800000
        4.550000
        5.000000
        32.000000
        21.000000
      NaN
         6.750871
         6.946623
         6.996163

So lets see the structure of this missing data.

I will make a mask where data is missing and visualize the table in a heat map



In [12]:

    
## set default colormap to gray
pyplot.gray()
# create binary mask where data is NOT null
mask = ~pandas.isnull(demog.values)
# set interpolation to nearest (nearest neighbor) to 
# avoid fuzzy image
pyplot.imshow(mask, aspect='auto', interpolation='nearest')
pyplot.ylabel('subjects')
pyplot.xlabel('demographic scores')
pyplot.colorbar(ticks=[0,1])









    Out[12]:





<matplotlib.colorbar.Colorbar instance at 0x1100ac3f8>



In [13]:

    
demog.mean()









    Out[13]:





YearsEd                    16.490566
NumberofSessions            2.094340
AgeatSession               67.622642
LIFESTYLEFACTORS                 NaN
GDS                         4.245283
MMSE                       28.839623
CogActInterviewTotal        3.387453
CAIAge6                     3.342222
CAIAge12                    3.162158
CAIAge18                    3.344848
CAIAge40                    3.609524
CAIPast                     3.328990
CAINow                      3.715152
CogActAdulthood            20.679245
CogActAdulthood:Reading    13.820755
FINGERTAPPING                    NaN
tapspersecR                 4.655525
tapspersecL                 5.155675
tapspersecBR                4.785971
dtype: float64

The pattern suggests:

most people have most of the tests
a couple of tests have data for only a few people

What we want to do:

drop columns with many NAN

Now we just need to drop the tests with very few samples.

This is much easier since we can just use pandas built in method dropna, and specify our axis to use across columns (axis=1).

This will drop any columns that have nan in them.



In [14]:

    
## To drop the tests that are missing for most subjects is easy, 
## I use the built in dropna

final_demog = demog.dropna(axis=1)



In [15]:

    
# set interpolation to nearest (nearest neighbor) to 
# avoid fuzzy image
pyplot.imshow(pandas.isnull(final_demog), 
              aspect='auto', 
              interpolation='nearest')
pyplot.ylabel('subjects')
pyplot.xlabel('demographic scores')
pyplot.colorbar(ticks=[0,1])









    Out[15]:





<matplotlib.colorbar.Colorbar instance at 0x111cb0fc8>

We have REAL data for all our data points!



In [16]:

    
print final_demog.shape

print final_demog.mean()









    



(106, 16)
YearsEd                    16.490566
NumberofSessions            2.094340
AgeatSession               67.622642
GDS                         4.245283
MMSE                       28.839623
CogActInterviewTotal        3.387453
CogActAdulthood            20.679245
CogActAdulthood:Reading    13.820755
tapspersecR                 4.655525
tapspersecL                 5.155675
tapspersecBR                4.785971
dtype: float64



In [17]:

    
print 'Years Education\n', final_demog['YearsEd'].describe()









    



Years Education
count    106.000000
mean      16.490566
std        2.005921
min       12.000000
25%       16.000000
50%       16.000000
75%       18.000000
max       20.000000
dtype: float64

Now I want to take this data, and for the subjects I have useful dempographics for, grab their biomarker data.



In [18]:

    
## Use pandas to load the biomarker csv data
biomarker = pandas.read_csv('data/subject_biomarker.csv', sep=None)









    



Using Python parser to sniff delimiter



In [19]:

    
biomarker.describe()









    Out[19]:






  
    
      
      ALL
      Ant-cingulate
      Inferior-parietal
      Isthmus-cingulate
      Post-cingulate
      Precuneus
      lateralfrontal
      medialfrontal
      temporal
    
  
  
    
      count
       83.000000
       85.000000
       86.000000
       85.000000
       86.000000
       85.000000
       86.000000
       85.000000
       86.000000
    
    
      mean
        1.099934
        1.186479
        1.151134
        1.179585
        1.233277
        1.167649
        1.064884
        1.069149
        1.016278
    
    
      std
        0.184194
        0.235193
        0.188548
        0.171204
        0.210504
        0.228273
        0.189048
        0.181282
        0.153723
    
    
      min
        0.831769
        0.863666
        0.893938
        0.949237
        0.944589
        0.921756
        0.793289
        0.806475
        0.741812
    
    
      25%
        1.005110
        1.076069
        1.060854
        1.082264
        1.117816
        1.051812
        0.970168
        0.979989
        0.944867
    
    
      50%
        1.042127
        1.132960
        1.099851
        1.130193
        1.188095
        1.092415
        1.004200
        1.017698
        0.974516
    
    
      75%
        1.086275
        1.182174
        1.156271
        1.216269
        1.232962
        1.170684
        1.077915
        1.060022
        1.018200
    
    
      max
        1.748783
        2.119020
        1.773572
        1.758167
        2.092553
        1.988178
        1.859275
        1.834954
        1.525314



In [20]:

    
biomarker.columns









    Out[20]:





Index([u'SUBID', u'ALL', u'Ant-cingulate', u'Inferior-parietal', u'Isthmus-cingulate', u'Post-cingulate', u'Precuneus', u'lateralfrontal', u'medialfrontal', u'temporal'], dtype=object)



In [21]:

    
biomarker.index









    Out[21]:





Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85], dtype=int64)



In [22]:

    
biomarker.index = biomarker.SUBID



In [23]:

    
biomarker.index









    Out[23]:





Index([u'B05-201', u'B05-202', u'B05-215', u'B05-216', u'B05-217', u'B05-229', u'B05-231', u'B05-232', u'B06-202', u'B06-203', u'B06-210', u'B06-212', u'B06-213', u'B06-217', u'B06-219', u'B06-234', u'B06-235', u'B07-227', u'B07-228', u'B07-230', u'B07-235', u'B07-236', u'B07-237', u'B07-238', u'B07-243', u'B07-267', u'B07-271', u'B07-276', u'B07-277', u'B08-206', u'B08-207', u'B08-218', u'B08-220', u'B08-221', u'B08-223', u'B08-224', u'B08-231', u'B08-247', u'B08-251', u'B09-201', u'B09-210', u'B09-211', u'B09-215', u'B09-216', u'B09-218', u'B09-225', u'B09-228', u'B09-229', u'B09-230', u'B09-260', u'B09-277', u'B09-290', u'B10-216', u'B10-224', u'B10-226', u'B10-232', u'B10-257', u'B10-261', u'B10-264', u'B10-270', u'B10-282', u'B10-285', u'B10-294', u'B10-295', u'B10-298', u'B10-299', u'B10-306', u'B10-309', u'B10-311', u'B10-313', u'B10-314', u'B11-204', u'B11-208', u'B11-210', u'B11-211', u'B11-213', u'B11-216', u'B11-219', u'B11-224', u'B11-232', u'B11-236', u'B11-243', u'B11-253', u'B11-255', u'B11-256', u'B11-257'], dtype=object)

Combining DataFrames

Currently I have two data frames,

final_demog
biomarker

final_demog uses unique subids for an index biomarker uses integers

I could go back and use the index_col = 0 flag when I parse the csv file.

But I can also use the SUBID columns, and just reset the index to reflect their SUBID (which is unique)

This will allow me to join the two dataframes database-style



In [24]:

    
## reindex based on subject ID 
#biomarker.index = biomarker['SUBID']

## Join dataframes
join_data = final_demog.join(biomarker)

## or Concatenate
concat_data = pandas.concat((final_demog, biomarker))

Concat

Lets first look at concatenate

If we take a closer look, concat just added more rows, it didnt merge the data together



In [25]:

    
print concat_data.shape
pyplot.imshow(pandas.isnull(concat_data), interpolation='nearest',
              aspect='auto')









    



(192, 26)






    Out[25]:





<matplotlib.image.AxesImage at 0x11212ed10>



In [26]:

    
## we can fix this by forcing concat to concatenate on a 
## specific axis (in this case axis = 1)
## 
concat_data2 = pandas.concat((final_demog, biomarker), axis=1)



In [27]:

    
print concat_data2.shape
pyplot.imshow(pandas.isnull(concat_data2), interpolation='nearest',
              aspect='auto')









    



(150, 26)






    Out[27]:





<matplotlib.image.AxesImage at 0x1121fbe50>



In [28]:

    
combined_data = concat_data2.dropna(axis=0)



In [29]:

    
pyplot.imshow(pandas.isnull(combined_data))
print combined_data.SUBID[:10]









    



B06-210    B06-210
B06-234    B06-234
B06-235    B06-235
B07-237    B07-237
B07-277    B07-277
B08-206    B08-206
B08-207    B08-207
B08-218    B08-218
B08-220    B08-220
B08-221    B08-221
Name: SUBID, dtype: object

join

Now lets see how a join works



In [30]:

    
final_demog.join?



In [31]:

    
## Quick check that my data is aligned
## Note non-overlapping data
print join_data[2:6]['SUBID']









    



LBNL_ID
B07-237    B07-237
B07-229        NaN
B06-235    B06-235
B93-222        NaN
Name: SUBID, dtype: object



In [32]:

    
## by looking I can see not all my subjects with demographics 
#  have biomarker data
join_data









    Out[32]:




<class 'pandas.core.frame.DataFrame'>
Index: 106 entries, B06-234 to B13-296
Data columns (total 26 columns):
BAC                        106  non-null values
Birthday                   106  non-null values
Gender                     106  non-null values
Handedness                 106  non-null values
YearsEd                    106  non-null values
NumberofSessions           106  non-null values
NeuropsychExamTestDate     106  non-null values
AgeatSession               106  non-null values
GDS                        106  non-null values
MMSE                       106  non-null values
CogActInterviewTotal       106  non-null values
CogActAdulthood            106  non-null values
CogActAdulthood:Reading    106  non-null values
tapspersecR                106  non-null values
tapspersecL                106  non-null values
tapspersecBR               106  non-null values
SUBID                      42  non-null values
ALL                        42  non-null values
Ant-cingulate              42  non-null values
Inferior-parietal          42  non-null values
Isthmus-cingulate          42  non-null values
Post-cingulate             42  non-null values
Precuneus                  42  non-null values
lateralfrontal             42  non-null values
medialfrontal              42  non-null values
temporal                   42  non-null values
dtypes: datetime64[ns](2), float64(20), object(4)



In [33]:

    
## quick look at image, I can see a number of people are missing
## biomarker data
mask = ~pandas.isnull(join_data)
pyplot.imshow(mask, interpolation = 'nearest', aspect = 'auto')









    Out[33]:





<matplotlib.image.AxesImage at 0x1122057d0>

I can use an inner join which will match all the data on index and column, and leave me with only good data.



In [34]:

    
join_inner = final_demog.join(biomarker, how='inner')



In [35]:

    
print join_inner.shape
pyplot.imshow(pandas.isnull(join_inner))









    



(42, 26)






    Out[35]:





<matplotlib.image.AxesImage at 0x11248bb50>



In [36]:

    
final_data = join_inner



In [37]:

    
final_data['random'] = np.random.random(42)



In [38]:

    
final_data['silly_thing']= 1

Plotting

I can also do a quick summary plot(boxplot) of a subset of the columns



In [39]:

    
print final_data.columns[11:14]
subsample = final_data.columns[11:14]
jnk = final_data[subsample].boxplot()









    



Index([u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR'], dtype=object)



In [40]:

    
subsample









    Out[40]:





Index([u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR'], dtype=object)

Fixing Fields

sometimes people make errors during data entry, lets go in an d fix these.



In [41]:

    
## data entry is sloppy for gender 
final_data.Gender









    Out[41]:





LBNL_ID
B06-210    FEMALE
B06-234         M
B06-235         M
B07-237         F
B07-277         F
B08-206         F
B08-207         F
B08-218         F
B08-220         F
B08-221         F
B08-223         F
B08-224         F
B08-247         M
B08-251         M
B09-201         F
B09-210         F
B09-211         M
B09-215         F
B09-218         F
B09-225         F
B09-228         M
B09-229         M
B10-270         F
B10-282         F
B10-285         M
B10-298         F
B10-306         F
B10-309         M
B10-311         F
B10-314         M
B11-210         F
B11-211         F
B11-213         F
B11-216         M
B11-219         F
B11-224         F
B11-232         M
B11-236         F
B11-243         M
B11-253         F
B11-255         M
B11-256         F
Name: Gender, dtype: object

So I am now going to clean up these values much as I cleaned up the headers



In [42]:

    
## fix gender to reflect just upper case letter M or F
final_data.Gender = [x[0].upper() for x in final_data.Gender]
print final_data.Gender









    



LBNL_ID
B06-210    F
B06-234    M
B06-235    M
B07-237    F
B07-277    F
B08-206    F
B08-207    F
B08-218    F
B08-220    F
B08-221    F
B08-223    F
B08-224    F
B08-247    M
B08-251    M
B09-201    F
B09-210    F
B09-211    M
B09-215    F
B09-218    F
B09-225    F
B09-228    M
B09-229    M
B10-270    F
B10-282    F
B10-285    M
B10-298    F
B10-306    F
B10-309    M
B10-311    F
B10-314    M
B11-210    F
B11-211    F
B11-213    F
B11-216    M
B11-219    F
B11-224    F
B11-232    M
B11-236    F
B11-243    M
B11-253    F
B11-255    M
B11-256    F
Name: Gender, dtype: object

Masks

Use a mask to select views of the dataframe



In [43]:

    
## now I can create masks to parse the data
females = final_data['Gender'] == 'F'
female_data = final_data[females]
male_data = final_data[~females]



In [44]:

    
female_data.shape









    Out[44]:





(28, 28)



In [45]:

    
male_data.shape









    Out[45]:





(14, 28)



In [46]:

    
## This allows me to make a simple box plot to compare 
## years of education between males and females
pyplot.boxplot((female_data['YearsEd'], male_data['YearsEd']))
pyplot.axis([0,3,10,25])
pyplot.xticks([1,2], ['Female', 'Male'])
pyplot.ylabel('Years of Education')









    Out[46]:





<matplotlib.text.Text at 0x1127a91d0>

Groupby

You can use groupby much like you would use the mask above. Here is an example of plotting a biomarker value by Gender using groupby instead fo creating a mask.



In [47]:

    
for val, (name, group) in enumerate(final_data.groupby('Gender')):
    pyplot.subplot(1,2,val+1)
    pyplot.boxplot(group['Post-cingulate'])
    pyplot.axis([0,2,0.8, 2.3])
    pyplot.xticks([1,], [name,])
    pyplot.title('Post-cingulate')



In [48]:

    
for val, (name, group) in enumerate(final_data.groupby('Gender')):
    print val, name, group.shape









    



0 F (28, 28)
1 M (14, 28)



In [49]:

    
just_girls = final_data.groupby('Gender').get_group('F')



In [50]:

    
final_data.groupby('Gender').apply









    Out[50]:





<bound method DataFrameGroupBy.apply of <pandas.core.groupby.DataFrameGroupBy object at 0x11221b510>>

Models

Pandas has a number of models built in, for example, Ordinary Least Squares



In [51]:

    
final_data.columns
# quick look at influence of MMSE on Post-cingulate Biomarker
ols = pandas.stats.ols.OLS(final_data['Post-cingulate'], 
                           final_data['MMSE'])



In [52]:

    
print ols.summary









    



-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         42
Number of Degrees of Freedom:   2

R-squared:         0.0005
Adj R-squared:    -0.0245

Rmse:              0.2448

F-stat (1, 40):     0.0194, p-value:     0.8899

Degrees of Freedom: model 1, resid 40

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0041     0.0292      -0.14     0.8899    -0.0612     0.0531
     intercept     1.3761     0.8410       1.64     0.1096    -0.2723     3.0244
---------------------------------End of Summary---------------------------------

StatsModels

Now that I have some clean data, I might want to run a quick model. To do this I will load the

statsmodels library

specifically I will load the linear_model subpackage with the alias lm



In [53]:

    
import statsmodels.regression.linear_model as lm

I will grab some variables I care about

dependent variable

Post-cingulate biomarker

independent variables

gender (which I will relabel with 0,1)
Age
Precuneus biomarker



In [54]:

    
final_data.columns









    Out[54]:





Index([u'BAC', u'Birthday', u'Gender', u'Handedness', u'YearsEd', u'NumberofSessions', u'NeuropsychExamTestDate', u'AgeatSession', u'GDS', u'MMSE', u'CogActInterviewTotal', u'CogActAdulthood', u'CogActAdulthood:Reading', u'tapspersecR', u'tapspersecL', u'tapspersecBR', u'SUBID', u'ALL', u'Ant-cingulate', u'Inferior-parietal', u'Isthmus-cingulate', u'Post-cingulate', u'Precuneus', u'lateralfrontal', u'medialfrontal', u'temporal', u'random', u'silly_thing'], dtype=object)



In [55]:

    
items_of_interest = ['Gender', 'AgeatSession', 'Precuneus']
independent = final_data.reindex(columns = items_of_interest)
independent









    Out[55]:






  
    
      
      Gender
      AgeatSession
      Precuneus
    
    
      LBNL_ID
      
      
      
    
  
  
    
      B06-210
       F
       73
       1.081447
    
    
      B06-234
       M
       74
       1.330834
    
    
      B06-235
       M
       77
       1.056631
    
    
      B07-237
       F
       72
       1.051214
    
    
      B07-277
       F
       72
       1.175358
    
    
      B08-206
       F
       87
       1.977648
    
    
      B08-207
       F
       76
       1.027508
    
    
      B08-218
       F
       72
       0.960273
    
    
      B08-220
       F
       73
       1.058662
    
    
      B08-221
       F
       86
       1.008120
    
    
      B08-223
       F
       70
       1.170684
    
    
      B08-224
       F
       70
       1.135571
    
    
      B08-247
       M
       73
       1.069985
    
    
      B08-251
       M
       78
       0.921756
    
    
      B09-201
       F
       80
       1.082767
    
    
      B09-210
       F
       82
       1.791774
    
    
      B09-211
       M
       75
       1.173846
    
    
      B09-215
       F
       82
       1.058898
    
    
      B09-218
       F
       70
       1.637780
    
    
      B09-225
       F
       72
       1.089609
    
    
      B09-228
       M
       76
       1.125693
    
    
      B09-229
       M
       70
       1.130209
    
    
      B10-270
       F
       68
       1.118943
    
    
      B10-282
       F
       80
       1.099579
    
    
      B10-285
       M
       78
       0.928556
    
    
      B10-298
       F
       73
       1.034642
    
    
      B10-306
       F
       75
       1.358964
    
    
      B10-309
       M
       72
       1.171232
    
    
      B10-311
       F
       81
       1.058576
    
    
      B10-314
       M
       87
       1.248210
    
    
      B11-210
       F
       70
       1.086161
    
    
      B11-211
       F
       71
       1.044730
    
    
      B11-213
       F
       75
       1.104658
    
    
      B11-216
       M
       96
       1.812154
    
    
      B11-219
       F
       76
       1.173603
    
    
      B11-224
       F
       79
       1.453994
    
    
      B11-232
       M
       73
       1.609970
    
    
      B11-236
       F
       83
       1.030451
    
    
      B11-243
       M
       75
       1.053735
    
    
      B11-253
       F
       72
       1.988178
    
    
      B11-255
       M
       73
       1.216090
    
    
      B11-256
       F
       71
       1.057577



In [56]:

    
## demean my dependent variable
y = final_data['Post-cingulate'] - final_data['Post-cingulate'].mean()

## cast and demean my independent variables
independent.Gender[independent.Gender == 'F'] = 0
independent.Gender[independent.Gender == 'M'] = 1

independent.AgeatSession = independent.AgeatSession - independent.AgeatSession.mean()
independent.Precuneus = independent.Precuneus - independent.Precuneus.mean()

## Add constant (intercept)
## note how this creates a new column and populates it with ones
independent['const'] = 1



In [57]:

    
print 'dependent\n'
print y.values[:10]
print '\nindependent\n'
print independent.values[:10]









    



dependent

[-0.12048774  0.25894026 -0.11425474 -0.10989774 -0.13557674  0.72064726
 -0.16771274 -0.21860474 -0.15700874 -0.23350774]

independent

[[0 -2.904761904761898 -0.12727371428571455 1]
 [1 -1.904761904761898 0.1221132857142857 1]
 [1 1.095238095238102 -0.15208971428571427 1]
 [0 -3.904761904761898 -0.15750671428571428 1]
 [0 -3.904761904761898 -0.03336271428571447 1]
 [0 11.095238095238102 0.7689272857142857 1]
 [0 0.095238095238102 -0.18121271428571428 1]
 [0 -3.904761904761898 -0.24844771428571433 1]
 [0 -2.904761904761898 -0.15005871428571438 1]
 [0 10.095238095238102 -0.20060071428571447 1]]



In [58]:

    
lm.OLS?
lmfit = lm.OLS(endog = y.values, exog = independent.values)



In [59]:

    
result = lmfit.fit()



In [60]:

    
print independent.columns
result.summary()









    



Index([u'Gender', u'AgeatSession', u'Precuneus', u'const'], dtype=object)






    Out[60]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.883


  Model:                    OLS          Adj. R-squared:        0.873


  Method:              Least Squares     F-statistic:           95.16


  Date:              Mon, 04 Nov 2013    Prob (F-statistic):  1.01e-17


  Time:                  20:18:26        Log-Likelihood:       45.497


  No. Observations:           42         AIC:                  -82.99


  Df Residuals:               38         BIC:                  -76.04


  Df Model:                    3                                     




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  x1        -0.0017      0.028     -0.061   0.952     -0.059     0.056


  x2        -0.0054      0.002     -2.212   0.033     -0.010    -0.000


  x3         0.8629      0.052     16.625   0.000      0.758     0.968


  const      0.0006      0.016      0.036   0.972     -0.032     0.034




  Omnibus:         2.046    Durbin-Watson:         2.160


  Prob(Omnibus):   0.360    Jarque-Bera (JB):      1.244


  Skew:           -0.401    Prob(JB):              0.537


  Kurtosis:        3.261    Cond. No.               22.5



In [61]:

    
pyplot.subplot(121)
pyplot.plot(independent.AgeatSession, y, 'bo')
pyplot.xlabel('Age')
pyplot.ylabel('Post-cingulate Biomarker')
pyplot.subplot(122)
pyplot.plot(independent.Precuneus, y, 'ro')
pyplot.xlabel('Precuneus Biomarker')









    Out[61]:





<matplotlib.text.Text at 0x11491e5d0>

Talking to a Database

sqlite3

python has a number of libraries to talk to databases
sqlite3 provies an interface to SQLite which is a lightweight disk based database

sqlite3 python DOCS



In [62]:

    
import sqlite3



In [63]:

    
data_to_db = final_data.reindex(columns = 
                                ('Post-cingulate', 
                                 'Precuneus', 
                                 'MMSE', 
                                 'AgeatSession'))



In [63]:



In [64]:

    
con = sqlite3.connect(':memory:')
con.commit?



In [65]:

    
query = """
CREATE TABLE test
(postcing FLOAT, precun FLOAT, mmse FLOAT, age FLOAT);"""
# open connection to database in memory
con = sqlite3.connect(':memory:')
## execute SQL statement to CREATE TABLE (create new table)
con.execute(query)
## Commit transaction
con.commit()



In [66]:

    
# coerce data to format that can be input into the table
# basically create a list of tuples
data = [tuple(item) for item in data_to_db.values.tolist()]



In [67]:

    
data[:2]









    Out[67]:





[(1.138485, 1.0814469999999998, 29.0, 73.0),
 (1.5179129999999998, 1.330834, 30.0, 74.0)]



In [68]:

    
# new statement to insert our data into database
statement = "INSERT INTO test VALUES(?,?,?,?)"
con.executemany(statement, data)
con.commit()



In [69]:

    
## roundtrip: lets grab all the data out of the database
cursor = con.execute('select * from test')
newrows = cursor.fetchall()



In [70]:

    
cursor.description









    Out[70]:





(('postcing', None, None, None, None, None, None),
 ('precun', None, None, None, None, None, None),
 ('mmse', None, None, None, None, None, None),
 ('age', None, None, None, None, None, None))



In [71]:

    
# zip column items in cursor description , and grab first
columns = zip(*cursor.description)[0]
columns









    Out[71]:





('postcing', 'precun', 'mmse', 'age')



In [72]:

    
# create a new DataFrame
newdf = pandas.DataFrame(newrows, columns = columns)
newdf

Pandas Database IO

To simplify this process, pandas has a simple built-in interface to a database



In [73]:

    
import pandas.io.sql as sql

newdf = sql.read_frame('select * from test', con)
print newdf[:3]









    



   postcing    precun  mmse  age
0  1.138485  1.081447    29   73
1  1.517913  1.330834    30   74
2  1.144718  1.056631    29   77



In [74]:

    
# I still have an open connection (con) to my database table 
# I can create a new table ordered by age
agedf = sql.read_frame('select * from test order by age',con)
agedf



In [75]:

    
## I can create a dataframe where MMSE > 28 ordered by age
query = 'select * from test where mmse > 28 order by age'
mmsedf = sql.read_frame(query,con)
mmsedf



In [75]:

Hey I want to get my hands dirty, what should I do next??

https://www.wakari.io

http://learnds.com/

	YearsEd	NumberofSessions	AgeatSession	LIFESTYLEFACTORS	GDS	MMSE	CogActInterviewTotal	CAIAge6	CAIAge12	CAIAge18	CAIAge40	CAIPast	CAINow	CogActAdulthood	CogActAdulthood:Reading	FINGERTAPPING	tapspersecR	tapspersecL	tapspersecBR
count	106.000000	106.00000	106.000000	0	106.000000	106.000000	106.000000	99.000000	99.000000	99.000000	84.000000	99.000000	99.000000	106.000000	106.000000	0	106.000000	106.000000	106.000000
mean	16.490566	2.09434	67.622642	NaN	4.245283	28.839623	3.387453	3.342222	3.162158	3.344848	3.609524	3.328990	3.715152	20.679245	13.820755	NaN	4.655525	5.155675	4.785971
std	2.005921	1.55856	19.984019	NaN	3.593145	1.243166	0.541550	1.079244	0.725790	0.705894	0.547419	0.594666	0.647186	4.958009	3.348908	NaN	0.781013	0.740734	0.798957
min	12.000000	1.00000	21.000000	NaN	0.000000	25.000000	2.000000	1.000000	1.330000	1.000000	2.000000	1.750000	2.200000	6.000000	4.000000	NaN	2.545031	3.315833	2.233950
25%	16.000000	1.00000	70.250000	NaN	2.000000	28.000000	3.120000	2.670000	2.670000	2.915000	3.400000	2.940000	3.300000	18.000000	12.000000	NaN	4.082735	4.689999	4.177774
50%	16.000000	1.00000	73.500000	NaN	4.000000	29.000000	3.440000	3.330000	3.330000	3.500000	3.600000	3.400000	3.800000	21.000000	14.000000	NaN	4.569165	5.194224	4.798236
75%	18.000000	3.00000	78.000000	NaN	6.000000	30.000000	3.760000	4.330000	3.670000	3.830000	4.000000	3.775000	4.200000	24.000000	16.000000	NaN	5.243333	5.573045	5.359960
max	20.000000	7.00000	96.000000	NaN	18.000000	30.000000	4.600000	6.000000	4.500000	4.830000	4.800000	4.550000	5.000000	32.000000	21.000000	NaN	6.750871	6.946623	6.996163

	ALL	Ant-cingulate	Inferior-parietal	Isthmus-cingulate	Post-cingulate	Precuneus	lateralfrontal	medialfrontal	temporal
count	83.000000	85.000000	86.000000	85.000000	86.000000	85.000000	86.000000	85.000000	86.000000
mean	1.099934	1.186479	1.151134	1.179585	1.233277	1.167649	1.064884	1.069149	1.016278
std	0.184194	0.235193	0.188548	0.171204	0.210504	0.228273	0.189048	0.181282	0.153723
min	0.831769	0.863666	0.893938	0.949237	0.944589	0.921756	0.793289	0.806475	0.741812
25%	1.005110	1.076069	1.060854	1.082264	1.117816	1.051812	0.970168	0.979989	0.944867
50%	1.042127	1.132960	1.099851	1.130193	1.188095	1.092415	1.004200	1.017698	0.974516
75%	1.086275	1.182174	1.156271	1.216269	1.232962	1.170684	1.077915	1.060022	1.018200
max	1.748783	2.119020	1.773572	1.758167	2.092553	1.988178	1.859275	1.834954	1.525314

	Gender	AgeatSession	Precuneus
LBNL_ID
B06-210	F	73	1.081447
B06-234	M	74	1.330834
B06-235	M	77	1.056631
B07-237	F	72	1.051214
B07-277	F	72	1.175358
B08-206	F	87	1.977648
B08-207	F	76	1.027508
B08-218	F	72	0.960273
B08-220	F	73	1.058662
B08-221	F	86	1.008120
B08-223	F	70	1.170684
B08-224	F	70	1.135571
B08-247	M	73	1.069985
B08-251	M	78	0.921756
B09-201	F	80	1.082767
B09-210	F	82	1.791774
B09-211	M	75	1.173846
B09-215	F	82	1.058898
B09-218	F	70	1.637780
B09-225	F	72	1.089609
B09-228	M	76	1.125693
B09-229	M	70	1.130209
B10-270	F	68	1.118943
B10-282	F	80	1.099579
B10-285	M	78	0.928556
B10-298	F	73	1.034642
B10-306	F	75	1.358964
B10-309	M	72	1.171232
B10-311	F	81	1.058576
B10-314	M	87	1.248210
B11-210	F	70	1.086161
B11-211	F	71	1.044730
B11-213	F	75	1.104658
B11-216	M	96	1.812154
B11-219	F	76	1.173603
B11-224	F	79	1.453994
B11-232	M	73	1.609970
B11-236	F	83	1.030451
B11-243	M	75	1.053735
B11-253	F	72	1.988178
B11-255	M	73	1.216090
B11-256	F	71	1.057577

Dep. Variable:	y	R-squared:	0.883
Model:	OLS	Adj. R-squared:	0.873
Method:	Least Squares	F-statistic:	95.16
Date:	Mon, 04 Nov 2013	Prob (F-statistic):	1.01e-17
Time:	20:18:26	Log-Likelihood:	45.497
No. Observations:	42	AIC:	-82.99
Df Residuals:	38	BIC:	-76.04
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	-0.0017	0.028	-0.061	0.952	-0.059 0.056
x2	-0.0054	0.002	-2.212	0.033	-0.010 -0.000
x3	0.8629	0.052	16.625	0.000	0.758 0.968
const	0.0006	0.016	0.036	0.972	-0.032 0.034

Omnibus:	2.046	Durbin-Watson:	2.160
Prob(Omnibus):	0.360	Jarque-Bera (JB):	1.244
Skew:	-0.401	Prob(JB):	0.537
Kurtosis:	3.261	Cond. No.	22.5

	postcing	precun	mmse	age
0	1.138485	1.081447	29	73
1	1.517913	1.330834	30	74
2	1.144718	1.056631	29	77
3	1.149075	1.051214	30	72
4	1.123396	1.175358	29	72
5	1.979620	1.977648	27	87
6	1.091260	1.027508	28	76
7	1.040368	0.960273	28	72
8	1.101964	1.058662	28	73
9	1.025465	1.008120	29	86
10	1.234171	1.170684	29	70
11	1.228818	1.135571	29	70
12	1.180982	1.069985	30	73
13	1.079167	0.921756	30	78
14	1.229333	1.082767	30	80
15	1.685527	1.791774	29	82
16	1.228112	1.173846	30	75
17	1.205652	1.058898	27	82
18	1.499777	1.637780	30	70
19	1.089057	1.089609	29	72
20	1.211186	1.125693	27	76
21	1.197423	1.130209	30	70
22	1.222365	1.118943	30	68
23	1.146490	1.099579	29	80
24	0.990200	0.928556	29	78
25	0.944589	1.034642	29	73
26	1.381214	1.358964	30	75
27	1.308381	1.171232	26	72
28	1.177977	1.058576	30	81
29	1.243055	1.248210	29	87
30	1.187455	1.086161	29	70
31	1.207291	1.044730	30	71
32	1.214298	1.104658	27	75
33	1.459784	1.812154	30	96
34	1.297182	1.173603	30	76
35	1.608539	1.453994	29	79
36	1.537757	1.609970	26	73
37	0.957489	1.030451	25	83
38	1.114593	1.053735	29	75
39	2.092553	1.988178	29	72
40	1.255497	1.216090	27	73
41	1.148677	1.057577	30	71