An IPython Tutorial for Data Analysis

In addition to the packages the come with Anaconda, dependencies for this notebook include:

You should be able to install these through the Terminal: pip install <package-name>

Getting started

For inline matplotlib/seaborn figures run this line:


In [1]:
%matplotlib inline

Next, import any packages or modules that you'll use in the notebook:


In [2]:
# For data structures
import pandas as pd

# For operating system functionality
import os.path as op
import os

# Stats stuff
from scipy import stats
import numpy as np

# For plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Set up seaborn with larger font & white background


In [3]:
sns.set(context='poster', style='whitegrid')

Basic DataFrame manipulation with Pandas

Python is an object-oriented language, meaning that there are classes (e.g., Pandas DataFrame) which (1) describe the contents of objects from that class (e.g., data, columns, datatype), and (2) define operations (i.e., methods) which an object from that class can perform (e.g., append(), groupby()).

Create a Pandas DataFrame "from scratch"

A Pandas DataFrame is a 2D labeled data structure, that can have columns of different types (e.g., string, integer). You can specify indices (row labels) and column labels as arguments. You can make DataFrames from a variety of different inputs including:

  • 2D numpy.ndarray
  • A Pandas Series
  • Another DataFrame
  • A dict of 1D arrays, lists, dicts, or Series

Create an empty DataFrame

Here we will create a new instance of the DataFrame class, and assign this object to the variable data. This object will be initialized with the default parameters unless otherwise specified, and will inherit the methods and attributes of the DataFrame class. Check out this site to see the parameters, attributes, and methods of the DataFrame class.


In [4]:
data = pd.DataFrame(columns=['subid', 'cond', 'value'])
data


Out[4]:
subid cond value

Add a dict as a row


In [5]:
new_row = {'subid': 'subj1', 
           'cond': 'place', 
           'value': 3}

data = data.append([new_row])

In [6]:
data.head()


Out[6]:
subid cond value
0 subj1 place 3

Add a second DataFrame from a dict made of Series


In [7]:
# Make a dict of Series
new_rows = {'subid': pd.Series(['subj1']), 
            'cond': pd.Series(['face', 'place', 'face']),
            'value': pd.Series([5, 3, 7])}

# Create new DF
new_data = pd.DataFrame(new_rows)

# Add to old DF
data = data.append(new_data, ignore_index=True)

In [8]:
data.head()


Out[8]:
cond subid value
0 place subj1 3
1 face subj1 5
2 place NaN 3
3 face NaN 7

In [9]:
data.subid[data.subid.isnull()] = 'subj2'

In [10]:
data.head()


Out[10]:
cond subid value
0 place subj1 3
1 face subj1 5
2 place subj2 3
3 face subj2 7

Now, we can access the attributes of the DataFrame object:


In [11]:
print 'Data shape: ' + str(data.shape)


Data shape: (4, 3)

Create another dataframe


In [12]:
data2 = pd.DataFrame(columns=['subid', 'group', 'gender'])
new_row = {'subid': 'subj1', 
           'group': 'control',
           'gender': 'female'}
data2 = data2.append([new_row])
data2.head()


Out[12]:
subid group gender
0 subj1 control female

Merge dataframes

Merging is done w/relational terminology similar to that used in SQL. There are 3 main cases of joining DataFrame objects:

  • one-to-one joins: e.g., when joining DataFrames based on their indexes (e.g., join)
  • many-to-one joins: e.g., when joining an index to one or more columns in a DataFrame
  • many-to-many joins: joining columns on columns (e.g., inner merge)

First, try merging using the intersection of keys from both frames:


In [13]:
pd.merge(data, data2, on=['subid'], how='inner')


Out[13]:
cond subid value group gender
0 place subj1 3 control female
1 face subj1 5 control female

Merge using union of keys from both frames:


In [14]:
pd.merge(data, data2, on=['subid'], how='outer')


Out[14]:
cond subid value group gender
0 place subj1 3 control female
1 face subj1 5 control female
2 place subj2 3 NaN NaN
3 face subj2 7 NaN NaN

Concatenate dataframes row-wise


In [15]:
pd.concat([data, data2], ignore_index=True)


Out[15]:
cond gender group subid value
0 place NaN NaN subj1 3
1 face NaN NaN subj1 5
2 place NaN NaN subj2 3
3 face NaN NaN subj2 7
4 NaN female control subj1 NaN

Or do the same thing using the append() method


In [16]:
data.append(data2, ignore_index=True)


Out[16]:
cond gender group subid value
0 place NaN NaN subj1 3
1 face NaN NaN subj1 5
2 place NaN NaN subj2 3
3 face NaN NaN subj2 7
4 NaN female control subj1 NaN

Explore a Dataset with Pandas

Define paths & files


In [17]:
data_file = 'objfam_groupcat_euc.csv'
username = os.getlogin()
data_dir = op.join('/Users', username, 'Dropbox/Code/tutorial/')

In [18]:
data_dir


Out[18]:
'/Users/sgagnon/Dropbox/Code/tutorial/'

Create DataFrame from CSV with Pandas


In [19]:
df = pd.read_csv(op.join(data_dir, data_file))

In [20]:
df.head(n=3)


Out[20]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
0 3 1 1 24 2 2 5 1.97 204.2603 4.27
1 3 1 2 8 2 1 2 3.26 0.0000 6.73
2 3 1 3 7 5 3 5 3.60 482.0321 3.55

In [21]:
df.describe()


Out[21]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
count 3420.000000 3420.000000 3420.000000 3420.000000 3420.000000 3420.000000 3393.000000 3420.000000 3420.000000 3420.000000
mean 13.105263 6.500000 8.000000 15.500000 3.500000 2.000000 3.495726 1.529977 182.595322 5.039947
std 6.852008 3.452557 4.321126 8.656707 1.708075 0.816616 1.430195 0.549428 158.526164 1.209830
min 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 2.550000
25% 7.000000 3.750000 4.000000 8.000000 2.000000 1.000000 2.000000 1.150000 0.000000 3.910000
50% 12.000000 6.500000 8.000000 15.500000 3.500000 2.000000 4.000000 1.470000 179.376500 4.910000
75% 20.000000 9.250000 12.000000 23.000000 5.000000 3.000000 5.000000 1.830000 307.410375 6.360000
max 25.000000 12.000000 15.000000 30.000000 6.000000 3.000000 5.000000 4.000000 662.535900 6.910000

Selection by position


In [22]:
df.iloc[3]


Out[22]:
Subject         3.0000
Run             1.0000
Trial           4.0000
Prototype      15.0000
Pair            2.0000
Morph           2.0000
Response        2.0000
RT              1.8200
EuclidDist    159.5645
PercepDist      5.0900
Name: 3, dtype: float64

Selection by query


In [23]:
df.query('Subject == 3 & EuclidDist > 500')


Out[23]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
6 3 1 7 30 4 3 2 2.66 508.4457 3.82

In [24]:
df[(df.Subject == 3) & (df.EuclidDist > 500)]


Out[24]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
6 3 1 7 30 4 3 2 2.66 508.4457 3.82

In [25]:
df[(df['Subject'] == 3) & (df['EuclidDist'] > 500)]


Out[25]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
6 3 1 7 30 4 3 2 2.66 508.4457 3.82

In [26]:
df_identical = df.query('Morph == 1')
df_identical.head()


Out[26]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
1 3 1 2 8 2 1 2 3.26 0 6.73
4 3 1 5 19 1 1 4 3.47 0 6.27
11 3 1 12 29 3 1 4 1.55 0 6.45
12 3 1 13 13 1 1 5 1.97 0 6.27
14 3 1 15 1 5 1 3 1.43 0 6.73

Boolean indexing


In [27]:
df_resp_old = df[df.Response > 3]
df_resp_old.head()


Out[27]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
0 3 1 1 24 2 2 5 1.97 204.2603 4.27
2 3 1 3 7 5 3 5 3.60 482.0321 3.55
4 3 1 5 19 1 1 4 3.47 0.0000 6.27
8 3 1 9 22 2 2 5 1.27 164.6218 4.82
9 3 1 10 18 5 2 5 1.80 387.6952 4.36

In [28]:
select_prototypes = [7, 3]
df_select = df[df.Prototype.isin(select_prototypes)]

In [29]:
df_select.head()


Out[29]:
Subject Run Trial Prototype Pair Morph Response RT EuclidDist PercepDist
2 3 1 3 7 5 3 5 3.60 482.0321 3.55
29 3 2 15 3 1 2 2 1.15 188.4949 5.00
34 3 3 5 7 2 3 1 2.36 363.1455 3.36
47 3 4 3 3 5 3 2 1.40 383.3653 3.91
74 3 5 15 7 3 2 3 1.50 124.4781 5.45

Group data by subject

Splitting the data into groups based on some criteria (e.g., subject id), applying a function to each group independently (e.g., mean, median), and combining the results into a data structure


In [30]:
grouped = df.groupby(['Subject', 'Morph'], sort=False).mean().reset_index()
grouped.head()


Out[30]:
Subject Morph Run Trial Prototype Pair Response RT EuclidDist PercepDist
0 3 2 6.5 7.733333 16.533333 3.433333 3.450000 1.484333 189.257535 4.971333
1 3 1 6.5 7.816667 13.916667 3.266667 3.883333 1.448500 0.000000 6.496167
2 3 3 6.5 8.450000 16.050000 3.800000 2.450000 1.583833 351.998067 3.713667
3 4 2 6.5 8.183333 16.350000 3.516667 3.616667 1.656500 186.934810 4.928667
4 4 1 6.5 7.616667 15.900000 3.500000 4.366667 1.476167 0.000000 6.489500

In [31]:
table = pd.pivot_table(df, values='RT', 
                       index=['Subject', 'Morph'],
                       columns=['Response'],
                       aggfunc=np.median)

In [32]:
table.head()


Out[32]:
Response 1.0 2.0 3.0 4.0 5.0
Subject Morph
3 1 1.55 1.480 1.445 1.49 1.130
2 1.40 1.735 1.500 1.83 1.270
3 1.14 1.615 1.485 1.64 1.675
4 1 1.55 2.090 NaN 1.78 1.160
2 1.71 1.870 NaN 1.63 1.170

In [33]:
table.stack()


Out[33]:
Subject  Morph  Response
3        1      1           1.550
                2           1.480
                3           1.445
                4           1.490
                5           1.130
         2      1           1.400
                2           1.735
                3           1.500
                4           1.830
                5           1.270
         3      1           1.140
                2           1.615
                3           1.485
                4           1.640
                5           1.675
...
24       3      2           1.030
                4           0.990
                5           0.900
25       1      1           1.570
                2           2.190
                4           1.950
                5           1.205
         2      1           1.920
                2           1.960
                4           2.045
                5           1.560
         3      1           1.900
                2           2.270
                4           2.135
                5           1.475
Length: 253, dtype: float64

Plotting with Seaborn


In [34]:
grouped = df.groupby(['Subject', 'Morph']).mean().reset_index()

sns.lmplot(x='Morph', y='Response', 
           ci=68, data=grouped,
           x_estimator=np.mean, 
           color='dodgerblue')


Out[34]:
<seaborn.axisgrid.FacetGrid at 0x109214f50>

In [35]:
sns.distplot(df.EuclidDist, color='darkviolet')


Out[35]:
<matplotlib.axes.AxesSubplot at 0x1091c5850>

In [36]:
grouped = df.groupby(["Subject", 
                      "Run", 
                      "Morph"]).Response.mean().reset_index()

sns.tsplot(data=grouped, time='Run', 
           condition='Morph',
           unit='Subject',
           value='Response', 
           estimator=stats.nanmean)


Out[36]:
<matplotlib.axes.AxesSubplot at 0x10a0ece10>

In [37]:
sns.corrplot(df)


/Users/sgagnon/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
Out[37]:
<matplotlib.axes.AxesSubplot at 0x10a599650>

Interactive Widgets


In [38]:
from IPython.html.widgets import interact, interactive, fixed
from IPython.html import widgets
from IPython.display import clear_output, display, HTML

In [39]:
def plot_bysubid(subid):
    sns.lmplot(x='Morph', y='Response', 
               ci=68, data=df[df.Subject == subid],
               x_estimator=np.mean, 
               color='dodgerblue')
    plt.ylim(1,5)

In [40]:
i = interact(plot_bysubid,
             subid=widgets.FloatSliderWidget(min=3, 
                                             max=25, 
                                             step=1, 
                                             value=5))


Analyze dataframe with R


In [41]:
%load_ext rpy2.ipython

In [42]:
%R -i df

In [43]:
%%R
print(str(df))


'data.frame':	3420 obs. of  10 variables:
 $ Subject   : int [1:3420(1d)] 3 3 3 3 3 3 3 3 3 3 ...
 $ Run       : int [1:3420(1d)] 1 1 1 1 1 1 1 1 1 1 ...
 $ Trial     : int [1:3420(1d)] 1 2 3 4 5 6 7 8 9 10 ...
 $ Prototype : int [1:3420(1d)] 24 8 7 15 19 26 30 16 22 18 ...
 $ Pair      : int [1:3420(1d)] 2 2 5 2 1 1 4 2 2 5 ...
 $ Morph     : int [1:3420(1d)] 2 1 3 2 1 2 3 3 2 2 ...
 $ Response  : num [1:3420(1d)] 5 2 5 2 4 3 2 2 5 5 ...
 $ RT        : num [1:3420(1d)] 1.97 3.26 3.6 1.82 3.47 1.59 2.66 1.54 1.27 1.8 ...
 $ EuclidDist: num [1:3420(1d)] 204 0 482 160 0 ...
 $ PercepDist: num [1:3420(1d)] 4.27 6.73 3.55 5.09 6.27 4.82 3.82 4 4.82 4.36 ...
NULL

In [44]:
%%R
# Factor categorical var
df$Subject = factor(df$Subject)

# Load in libraries
require(lme4)
require(lmerTest)

# Run analysis
rs1 = lmer(Response~scale(EuclidDist) + (1|Subject), data=df)
rs2 = lmer(Response~scale(EuclidDist) + (1 + scale(EuclidDist)|Subject), data=df)

print(anova(rs1, rs2))
print(summary(rs2))


Loading required package: lme4
Loading required package: Matrix
Loading required package: Rcpp
Loading required package: lmerTest
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Attaching package: ‘lmerTest’

The following object is masked from ‘package:lme4’:

    lmer

The following object is masked from ‘package:stats’:

    step

refitting model(s) with ML (instead of REML)
Data: df
Models:
object: Response ~ scale(EuclidDist) + (1 | Subject)
..1: Response ~ scale(EuclidDist) + (1 + scale(EuclidDist) | Subject)
       Df   AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
object  4 11442 11466 -5716.8    11434                             
..1     6 11388 11424 -5687.9    11376 57.894      2  2.682e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Linear mixed model fit by REML ['merModLmerTest']
Formula: Response ~ scale(EuclidDist) + (1 + scale(EuclidDist) | Subject)
   Data: df

REML criterion at convergence: 11383.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7706 -0.7665  0.2245  0.7206  2.8434 

Random effects:
 Groups   Name              Variance Std.Dev. Corr
 Subject  (Intercept)       0.05463  0.2337       
          scale(EuclidDist) 0.04708  0.2170   0.23
 Residual                   1.63990  1.2806       
Number of obs: 3393, groups:  Subject, 19

Fixed effects:
                  Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)        3.49508    0.05796 17.99400   60.31  < 2e-16 ***
scale(EuclidDist) -0.56113    0.05441 18.01100  -10.31 5.52e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
scl(EcldDs) 0.194 

fMRI Data

Load in nibabel to read/write nifti files


In [45]:
import nibabel as nb
from nibabel import load

In [46]:
fmri_file = 'smoothed_timeseries.nii.gz'
fmri_filepath = op.join(data_dir, fmri_file)

In [47]:
fmri_data = load(fmri_filepath)
func_arr = fmri_data.get_data()

Check out the data


In [48]:
print np.shape(func_arr)
print func_arr.shape


(64, 64, 30, 195)
(64, 64, 30, 195)

Plot a slice at a given TR


In [49]:
TR = 20
plt.imshow(func_arr[:,:,15, TR])


Out[49]:
<matplotlib.image.AxesImage at 0x111c46850>

Plot a range of slices


In [50]:
slices = range(1, np.shape(func_arr)[2], 5)
num_subplots = len(slices)

f, axes = plt.subplots(1,num_subplots)

for ax, slice in zip(axes, slices):
    ax.imshow(func_arr[:,:,slice, TR], )
    ax.set_axis_off()
    
plt.tight_layout()


Load in mask file


In [51]:
mask_file = 'Bilat-Hippocampus.nii.gz'
mask_filepath = op.join(data_dir, mask_file)

In [52]:
mask_data = load(mask_filepath)
mask_arr = mask_data.get_data().astype(bool)

In [53]:
num_voxels = mask_arr.sum(); print '# Voxels = ' + str(num_voxels)
mask_dim = mask_arr.shape; print 'Mask Dim = ' + str(mask_dim)


# Voxels = 165
Mask Dim = (64, 64, 30)

Plot mask


In [54]:
plt.imshow(mask_arr[:,:,8])


Out[54]:
<matplotlib.image.AxesImage at 0x110457e90>

Load in fixed effect zstats & mask


In [55]:
fmri_file = 'zstat1.nii.gz'
fmri_filepath = op.join(data_dir, fmri_file)

fmri_data = load(fmri_filepath)
func_arr = fmri_data.get_data()

In [56]:
func_arr.shape


Out[56]:
(64, 64, 30)

In [57]:
func_masked = func_arr[mask_arr]
func_masked.shape


Out[57]:
(165,)

In [58]:
sns.distplot(func_masked, color='dodgerblue')
plt.xlabel('zstat')
plt.vlines(x=0, ymin=0, ymax=.25, linestyles='dashed')


Out[58]:
<matplotlib.collections.LineCollection at 0x11791e910>

Plot the Daily Puppy


In [59]:
sns.puppyplot()


Out[59]:

In [59]: