An IPython Tutorial for Data Analysis

In addition to the packages the come with Anaconda, dependencies for this notebook include:

You should be able to install these through the Terminal: pip install <package-name>

Getting started

For inline matplotlib/seaborn figures run this line:



In [1]:

    
%matplotlib inline

Next, import any packages or modules that you'll use in the notebook:



In [2]:

    
# For data structures
import pandas as pd

# For operating system functionality
import os.path as op
import os

# Stats stuff
from scipy import stats
import numpy as np

# For plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Set up seaborn with larger font & white background



In [3]:

    
sns.set(context='poster', style='whitegrid')

Basic DataFrame manipulation with Pandas

Python is an object-oriented language, meaning that there are classes (e.g., Pandas DataFrame) which (1) describe the contents of objects from that class (e.g., data, columns, datatype), and (2) define operations (i.e., methods) which an object from that class can perform (e.g., append(), groupby()).

Create a Pandas DataFrame "from scratch"

A Pandas DataFrame is a 2D labeled data structure, that can have columns of different types (e.g., string, integer). You can specify indices (row labels) and column labels as arguments. You can make DataFrames from a variety of different inputs including:

2D numpy.ndarray
A Pandas Series
Another DataFrame
A dict of 1D arrays, lists, dicts, or Series

Create an empty DataFrame

Here we will create a new instance of the DataFrame class, and assign this object to the variable data. This object will be initialized with the default parameters unless otherwise specified, and will inherit the methods and attributes of the DataFrame class. Check out this site to see the parameters, attributes, and methods of the DataFrame class.



In [4]:

    
data = pd.DataFrame(columns=['subid', 'cond', 'value'])
data









    Out[4]:






  
    
      
      subid
      cond
      value

Add a dict as a row



In [5]:

    
new_row = {'subid': 'subj1', 
           'cond': 'place', 
           'value': 3}

data = data.append([new_row])



In [6]:

    
data.head()









    Out[6]:






  
    
      
      subid
      cond
      value
    
  
  
    
      0
       subj1
       place
       3

Add a second DataFrame from a dict made of Series



In [7]:

    
# Make a dict of Series
new_rows = {'subid': pd.Series(['subj1']), 
            'cond': pd.Series(['face', 'place', 'face']),
            'value': pd.Series([5, 3, 7])}

# Create new DF
new_data = pd.DataFrame(new_rows)

# Add to old DF
data = data.append(new_data, ignore_index=True)



In [8]:

    
data.head()



In [9]:

    
data.subid[data.subid.isnull()] = 'subj2'



In [10]:

    
data.head()

Now, we can access the attributes of the DataFrame object:



In [11]:

    
print 'Data shape: ' + str(data.shape)









    



Data shape: (4, 3)

Create another dataframe



In [12]:

    
data2 = pd.DataFrame(columns=['subid', 'group', 'gender'])
new_row = {'subid': 'subj1', 
           'group': 'control',
           'gender': 'female'}
data2 = data2.append([new_row])
data2.head()









    Out[12]:






  
    
      
      subid
      group
      gender
    
  
  
    
      0
       subj1
       control
       female

Merge dataframes

Merging is done w/relational terminology similar to that used in SQL. There are 3 main cases of joining DataFrame objects:

one-to-one joins: e.g., when joining DataFrames based on their indexes (e.g., join)
many-to-one joins: e.g., when joining an index to one or more columns in a DataFrame
many-to-many joins: joining columns on columns (e.g., inner merge)

First, try merging using the intersection of keys from both frames:



In [13]:

    
pd.merge(data, data2, on=['subid'], how='inner')









    Out[13]:






  
    
      
      cond
      subid
      value
      group
      gender
    
  
  
    
      0
       place
       subj1
       3
       control
       female
    
    
      1
        face
       subj1
       5
       control
       female

Merge using union of keys from both frames:



In [14]:

    
pd.merge(data, data2, on=['subid'], how='outer')









    Out[14]:






  
    
      
      cond
      subid
      value
      group
      gender
    
  
  
    
      0
       place
       subj1
       3
       control
       female
    
    
      1
        face
       subj1
       5
       control
       female
    
    
      2
       place
       subj2
       3
           NaN
          NaN
    
    
      3
        face
       subj2
       7
           NaN
          NaN

Concatenate dataframes row-wise



In [15]:

    
pd.concat([data, data2], ignore_index=True)









    Out[15]:






  
    
      
      cond
      gender
      group
      subid
      value
    
  
  
    
      0
       place
          NaN
           NaN
       subj1
        3
    
    
      1
        face
          NaN
           NaN
       subj1
        5
    
    
      2
       place
          NaN
           NaN
       subj2
        3
    
    
      3
        face
          NaN
           NaN
       subj2
        7
    
    
      4
         NaN
       female
       control
       subj1
      NaN

Or do the same thing using the append() method



In [16]:

    
data.append(data2, ignore_index=True)









    Out[16]:






  
    
      
      cond
      gender
      group
      subid
      value
    
  
  
    
      0
       place
          NaN
           NaN
       subj1
        3
    
    
      1
        face
          NaN
           NaN
       subj1
        5
    
    
      2
       place
          NaN
           NaN
       subj2
        3
    
    
      3
        face
          NaN
           NaN
       subj2
        7
    
    
      4
         NaN
       female
       control
       subj1
      NaN

Explore a Dataset with Pandas

Define paths & files



In [17]:

    
data_file = 'objfam_groupcat_euc.csv'
username = os.getlogin()
data_dir = op.join('/Users', username, 'Dropbox/Code/tutorial/')



In [18]:

    
data_dir









    Out[18]:





'/Users/sgagnon/Dropbox/Code/tutorial/'

Create DataFrame from CSV with Pandas



In [19]:

    
df = pd.read_csv(op.join(data_dir, data_file))



In [20]:

    
df.head(n=3)



In [21]:

    
df.describe()









    Out[21]:






  
    
      
      Subject
      Run
      Trial
      Prototype
      Pair
      Morph
      Response
      RT
      EuclidDist
      PercepDist
    
  
  
    
      count
       3420.000000
       3420.000000
       3420.000000
       3420.000000
       3420.000000
       3420.000000
       3393.000000
       3420.000000
       3420.000000
       3420.000000
    
    
      mean
         13.105263
          6.500000
          8.000000
         15.500000
          3.500000
          2.000000
          3.495726
          1.529977
        182.595322
          5.039947
    
    
      std
          6.852008
          3.452557
          4.321126
          8.656707
          1.708075
          0.816616
          1.430195
          0.549428
        158.526164
          1.209830
    
    
      min
          3.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          0.000000
          0.000000
          2.550000
    
    
      25%
          7.000000
          3.750000
          4.000000
          8.000000
          2.000000
          1.000000
          2.000000
          1.150000
          0.000000
          3.910000
    
    
      50%
         12.000000
          6.500000
          8.000000
         15.500000
          3.500000
          2.000000
          4.000000
          1.470000
        179.376500
          4.910000
    
    
      75%
         20.000000
          9.250000
         12.000000
         23.000000
          5.000000
          3.000000
          5.000000
          1.830000
        307.410375
          6.360000
    
    
      max
         25.000000
         12.000000
         15.000000
         30.000000
          6.000000
          3.000000
          5.000000
          4.000000
        662.535900
          6.910000

Selection by position



In [22]:

    
df.iloc[3]









    Out[22]:





Subject         3.0000
Run             1.0000
Trial           4.0000
Prototype      15.0000
Pair            2.0000
Morph           2.0000
Response        2.0000
RT              1.8200
EuclidDist    159.5645
PercepDist      5.0900
Name: 3, dtype: float64

Selection by query



In [23]:

    
df.query('Subject == 3 & EuclidDist > 500')









    Out[23]:






  
    
      
      Subject
      Run
      Trial
      Prototype
      Pair
      Morph
      Response
      RT
      EuclidDist
      PercepDist
    
  
  
    
      6
       3
       1
       7
       30
       4
       3
       2
       2.66
       508.4457
       3.82



In [24]:

    
df[(df.Subject == 3) & (df.EuclidDist > 500)]









    Out[24]:






  
    
      
      Subject
      Run
      Trial
      Prototype
      Pair
      Morph
      Response
      RT
      EuclidDist
      PercepDist
    
  
  
    
      6
       3
       1
       7
       30
       4
       3
       2
       2.66
       508.4457
       3.82



In [25]:

    
df[(df['Subject'] == 3) & (df['EuclidDist'] > 500)]









    Out[25]:






  
    
      
      Subject
      Run
      Trial
      Prototype
      Pair
      Morph
      Response
      RT
      EuclidDist
      PercepDist
    
  
  
    
      6
       3
       1
       7
       30
       4
       3
       2
       2.66
       508.4457
       3.82



In [26]:

    
df_identical = df.query('Morph == 1')
df_identical.head()

Boolean indexing



In [27]:

    
df_resp_old = df[df.Response > 3]
df_resp_old.head()



In [28]:

    
select_prototypes = [7, 3]
df_select = df[df.Prototype.isin(select_prototypes)]



In [29]:

    
df_select.head()

Group data by subject

Splitting the data into groups based on some criteria (e.g., subject id), applying a function to each group independently (e.g., mean, median), and combining the results into a data structure



In [30]:

    
grouped = df.groupby(['Subject', 'Morph'], sort=False).mean().reset_index()
grouped.head()



In [31]:

    
table = pd.pivot_table(df, values='RT', 
                       index=['Subject', 'Morph'],
                       columns=['Response'],
                       aggfunc=np.median)



In [32]:

    
table.head()



In [33]:

    
table.stack()









    Out[33]:





Subject  Morph  Response
3        1      1           1.550
                2           1.480
                3           1.445
                4           1.490
                5           1.130
         2      1           1.400
                2           1.735
                3           1.500
                4           1.830
                5           1.270
         3      1           1.140
                2           1.615
                3           1.485
                4           1.640
                5           1.675
...
24       3      2           1.030
                4           0.990
                5           0.900
25       1      1           1.570
                2           2.190
                4           1.950
                5           1.205
         2      1           1.920
                2           1.960
                4           2.045
                5           1.560
         3      1           1.900
                2           2.270
                4           2.135
                5           1.475
Length: 253, dtype: float64

Plotting with Seaborn



In [34]:

    
grouped = df.groupby(['Subject', 'Morph']).mean().reset_index()

sns.lmplot(x='Morph', y='Response', 
           ci=68, data=grouped,
           x_estimator=np.mean, 
           color='dodgerblue')









    Out[34]:





<seaborn.axisgrid.FacetGrid at 0x109214f50>



In [35]:

    
sns.distplot(df.EuclidDist, color='darkviolet')









    Out[35]:





<matplotlib.axes.AxesSubplot at 0x1091c5850>



In [36]:

    
grouped = df.groupby(["Subject", 
                      "Run", 
                      "Morph"]).Response.mean().reset_index()

sns.tsplot(data=grouped, time='Run', 
           condition='Morph',
           unit='Subject',
           value='Response', 
           estimator=stats.nanmean)









    Out[36]:





<matplotlib.axes.AxesSubplot at 0x10a0ece10>



In [37]:

    
sns.corrplot(df)









    



/Users/sgagnon/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)






    Out[37]:





<matplotlib.axes.AxesSubplot at 0x10a599650>

Interactive Widgets



In [38]:

    
from IPython.html.widgets import interact, interactive, fixed
from IPython.html import widgets
from IPython.display import clear_output, display, HTML



In [39]:

    
def plot_bysubid(subid):
    sns.lmplot(x='Morph', y='Response', 
               ci=68, data=df[df.Subject == subid],
               x_estimator=np.mean, 
               color='dodgerblue')
    plt.ylim(1,5)



In [40]:

    
i = interact(plot_bysubid,
             subid=widgets.FloatSliderWidget(min=3, 
                                             max=25, 
                                             step=1, 
                                             value=5))

Analyze dataframe with R



In [41]:

    
%load_ext rpy2.ipython



In [42]:

    
%R -i df



In [43]:

    
%%R
print(str(df))









    





'data.frame':	3420 obs. of  10 variables:
 $ Subject   : int [1:3420(1d)] 3 3 3 3 3 3 3 3 3 3 ...
 $ Run       : int [1:3420(1d)] 1 1 1 1 1 1 1 1 1 1 ...
 $ Trial     : int [1:3420(1d)] 1 2 3 4 5 6 7 8 9 10 ...
 $ Prototype : int [1:3420(1d)] 24 8 7 15 19 26 30 16 22 18 ...
 $ Pair      : int [1:3420(1d)] 2 2 5 2 1 1 4 2 2 5 ...
 $ Morph     : int [1:3420(1d)] 2 1 3 2 1 2 3 3 2 2 ...
 $ Response  : num [1:3420(1d)] 5 2 5 2 4 3 2 2 5 5 ...
 $ RT        : num [1:3420(1d)] 1.97 3.26 3.6 1.82 3.47 1.59 2.66 1.54 1.27 1.8 ...
 $ EuclidDist: num [1:3420(1d)] 204 0 482 160 0 ...
 $ PercepDist: num [1:3420(1d)] 4.27 6.73 3.55 5.09 6.27 4.82 3.82 4 4.82 4.36 ...
NULL



In [44]:

    
%%R
# Factor categorical var
df$Subject = factor(df$Subject)

# Load in libraries
require(lme4)
require(lmerTest)

# Run analysis
rs1 = lmer(Response~scale(EuclidDist) + (1|Subject), data=df)
rs2 = lmer(Response~scale(EuclidDist) + (1 + scale(EuclidDist)|Subject), data=df)

print(anova(rs1, rs2))
print(summary(rs2))









    





Loading required package: lme4
Loading required package: Matrix
Loading required package: Rcpp
Loading required package: lmerTest
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Attaching package: ‘lmerTest’

The following object is masked from ‘package:lme4’:

    lmer

The following object is masked from ‘package:stats’:

    step

refitting model(s) with ML (instead of REML)
Data: df
Models:
object: Response ~ scale(EuclidDist) + (1 | Subject)
..1: Response ~ scale(EuclidDist) + (1 + scale(EuclidDist) | Subject)
       Df   AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
object  4 11442 11466 -5716.8    11434                             
..1     6 11388 11424 -5687.9    11376 57.894      2  2.682e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Linear mixed model fit by REML ['merModLmerTest']
Formula: Response ~ scale(EuclidDist) + (1 + scale(EuclidDist) | Subject)
   Data: df

REML criterion at convergence: 11383.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7706 -0.7665  0.2245  0.7206  2.8434 

Random effects:
 Groups   Name              Variance Std.Dev. Corr
 Subject  (Intercept)       0.05463  0.2337       
          scale(EuclidDist) 0.04708  0.2170   0.23
 Residual                   1.63990  1.2806       
Number of obs: 3393, groups:  Subject, 19

Fixed effects:
                  Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)        3.49508    0.05796 17.99400   60.31  < 2e-16 ***
scale(EuclidDist) -0.56113    0.05441 18.01100  -10.31 5.52e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
scl(EcldDs) 0.194

fMRI Data

Load in nibabel to read/write nifti files



In [45]:

    
import nibabel as nb
from nibabel import load



In [46]:

    
fmri_file = 'smoothed_timeseries.nii.gz'
fmri_filepath = op.join(data_dir, fmri_file)



In [47]:

    
fmri_data = load(fmri_filepath)
func_arr = fmri_data.get_data()

Check out the data



In [48]:

    
print np.shape(func_arr)
print func_arr.shape









    



(64, 64, 30, 195)
(64, 64, 30, 195)

Plot a slice at a given TR



In [49]:

    
TR = 20
plt.imshow(func_arr[:,:,15, TR])









    Out[49]:





<matplotlib.image.AxesImage at 0x111c46850>

Plot a range of slices



In [50]:

    
slices = range(1, np.shape(func_arr)[2], 5)
num_subplots = len(slices)

f, axes = plt.subplots(1,num_subplots)

for ax, slice in zip(axes, slices):
    ax.imshow(func_arr[:,:,slice, TR], )
    ax.set_axis_off()
    
plt.tight_layout()

Load in mask file



In [51]:

    
mask_file = 'Bilat-Hippocampus.nii.gz'
mask_filepath = op.join(data_dir, mask_file)



In [52]:

    
mask_data = load(mask_filepath)
mask_arr = mask_data.get_data().astype(bool)



In [53]:

    
num_voxels = mask_arr.sum(); print '# Voxels = ' + str(num_voxels)
mask_dim = mask_arr.shape; print 'Mask Dim = ' + str(mask_dim)









    



# Voxels = 165
Mask Dim = (64, 64, 30)

Plot mask



In [54]:

    
plt.imshow(mask_arr[:,:,8])









    Out[54]:





<matplotlib.image.AxesImage at 0x110457e90>

Load in fixed effect zstats & mask



In [55]:

    
fmri_file = 'zstat1.nii.gz'
fmri_filepath = op.join(data_dir, fmri_file)

fmri_data = load(fmri_filepath)
func_arr = fmri_data.get_data()



In [56]:

    
func_arr.shape









    Out[56]:





(64, 64, 30)



In [57]:

    
func_masked = func_arr[mask_arr]
func_masked.shape









    Out[57]:





(165,)



In [58]:

    
sns.distplot(func_masked, color='dodgerblue')
plt.xlabel('zstat')
plt.vlines(x=0, ymin=0, ymax=.25, linestyles='dashed')









    Out[58]:





<matplotlib.collections.LineCollection at 0x11791e910>

Plot the Daily Puppy



In [59]:

    
sns.puppyplot()









    Out[59]:



In [59]:

	Subject	Morph	Run	Trial	Prototype	Pair	Response	RT	EuclidDist	PercepDist
0	3	2	6.5	7.733333	16.533333	3.433333	3.450000	1.484333	189.257535	4.971333
1	3	1	6.5	7.816667	13.916667	3.266667	3.883333	1.448500	0.000000	6.496167
2	3	3	6.5	8.450000	16.050000	3.800000	2.450000	1.583833	351.998067	3.713667
3	4	2	6.5	8.183333	16.350000	3.516667	3.616667	1.656500	186.934810	4.928667
4	4	1	6.5	7.616667	15.900000	3.500000	4.366667	1.476167	0.000000	6.489500

	cond	subid	value	group	gender
0	place	subj1	3	control	female
1	face	subj1	5	control	female
2	place	subj2	3	NaN	NaN
3	face	subj2	7	NaN	NaN

	Subject	Run	Trial	Prototype	Pair	Morph	Response	RT	EuclidDist	PercepDist
0	3	1	1	24	2	2	5	1.97	204.2603	4.27
1	3	1	2	8	2	1	2	3.26	0.0000	6.73
2	3	1	3	7	5	3	5	3.60	482.0321	3.55

	Subject	Run	Trial	Prototype	Pair	Morph	Response	RT	EuclidDist	PercepDist
count	3420.000000	3420.000000	3420.000000	3420.000000	3420.000000	3420.000000	3393.000000	3420.000000	3420.000000	3420.000000
mean	13.105263	6.500000	8.000000	15.500000	3.500000	2.000000	3.495726	1.529977	182.595322	5.039947
std	6.852008	3.452557	4.321126	8.656707	1.708075	0.816616	1.430195	0.549428	158.526164	1.209830
min	3.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	0.000000	2.550000
25%	7.000000	3.750000	4.000000	8.000000	2.000000	1.000000	2.000000	1.150000	0.000000	3.910000
50%	12.000000	6.500000	8.000000	15.500000	3.500000	2.000000	4.000000	1.470000	179.376500	4.910000
75%	20.000000	9.250000	12.000000	23.000000	5.000000	3.000000	5.000000	1.830000	307.410375	6.360000
max	25.000000	12.000000	15.000000	30.000000	6.000000	3.000000	5.000000	4.000000	662.535900	6.910000

	Subject	Run	Trial	Prototype	Pair	Morph	Response	RT	PercepDist
1	3	1	2	8	2	1	2	3.26	6.73
4	3	1	5	19	1	1	4	3.47	6.27
11	3	1	12	29	3	1	4	1.55	6.45
12	3	1	13	13	1	1	5	1.97	6.27
14	3	1	15	1	5	1	3	1.43	6.73