Putting it all together

Taking off with Loftus ...

(From Abdi, Edelman, Valentin & Dowling, 2009; Williams, Krishnan, & Abdi, 2009)

In the 1970s Elizabeth Loftus (Loftus and Palmer, 1974) conducted a series of experiments on the theme of eyewitness testimony. They wanted to find out whether the wording of a question affected the later responses of witnesses. To do this she showed subjects a film of a car accident. Following the film, she asked them a series of questions. Among the questions was one of 5 versions of a critical question concerning the speed the vehicles were travelling in miles per hour. Here are the 5 versions of the question along with the experimental condition in which they occurred:

HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?

Software Carpentry conducted a (ficticious) replication of the study in an online survey. The results are in the file loftus.csv.

We want to be able to answer the question: Does the wording of the question affect witnesses' estimation of the speed the vehicles were traveling?

Before we begin

Download and install the statsmodels 0.5 module. First clone the repository from github from git clone git://github.com/statsmodels/statsmodels.git. You will also need to install patsy. Using pip enter pip install patsy at the command line. Everything else should have come with your installation of Anaconda. Following the installation of patsy, navigate to the directory where you put the statsmodels repository. Once there, type python setup.py build and finally python setup.py install.

The statsmodels module has many useful statistical features that will benefit your work. The latest release allows you to build statistical models using R-like syntax, but with the readability of python. Documentation is available from http://statsmodels.sourceforge.net/.

Step 1: Import the .csv file



In [1]:

    
datafile = open('loftus.csv', 'rU')
data = []
for row in datafile:
    data.append(row.strip().split(','))
datafile.close()

Step 1a: Turn this bit of code into a function

Remember to document your code and call your function something memorable so you will remember what you have done later.



In [2]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Initialize an empty array
    data = []
    # Create data array
    for row in datafile:
        data.append(row.strip().split(','))
    datafile.close()
    return data

data = open_csv_file('loftus.csv')

Step 2: Get header row

If you know that the data file has a header row, you can update your function to get the header row



In [3]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format. Note that the first row is ALWAYS
    considered to be a header row!
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Initialize an empty array
    data = []
    # define counter start number
    rownum = 0
    for row in datafile:
        if rownum == 0:
            # Create header
            header = row
        else:
            # Create data array
            data.append(row.strip().split(','))
        rownum += 1
    datafile.close()
    return data, header

But this gets dangerous when we want to reuse our code because a lot of the time (unless it is our own data) you don't know if the data has a header row

Import the csv module .

To save processing time, import your modules at the beginning of your .py file. Here we are importing them as we need them for explanation purposes only.



In [4]:

    
import csv

The has_header method of the Sniffer class of the csv module returns True if the first line of the data is likely to be a header row.

Let's update our function to get the header row.



In [5]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format. It uses the csv module to determine
    whether or not there is a row of column headers
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Find out whether there is a header row
    hasheader = csv.Sniffer().has_header(datafile.read(1024))
    datafile.seek(0)
    
    data = []
    header = []
    rownum = 0
    if hasheader == True:
        for row in datafile:
            if rownum == 0:
                header = row.strip().split(',')
                print 'Data file has a header row:' 
                print header
            else:
                data.append(row.strip().split(','))
            rownum += 1
    else:
        for row in datafile:
            if rownum == 0:
                print 'Data file does not have a header row'
            data.append(row.strip().split(','))
            rownum += 1
    datafile.close()
    return(data, header)


data, header = open_csv_file('loftus.csv')









    



Data file has a header row:
['Participant', 'Condition Name', 'Condition Number', 'Age', 'IQ', 'Estimated Speed (mph)', 'Reaction Time (ms)']

Now check that the output of the function makes sense.



In [6]:

    
header









    Out[6]:





['Participant',
 'Condition Name',
 'Condition Number',
 'Age',
 'IQ',
 'Estimated Speed (mph)',
 'Reaction Time (ms)']



In [7]:

    
data[0:10]









    Out[7]:





[['1', 'Contact', '1', '38', '79', '35.34309684', '366.3496022'],
 ['2', 'Contact', '1', '60', '115', '33.17929371', '103.3719955'],
 ['3', 'Contact', '1', '55', '99', '45.32379901', '117.1909298'],
 ['4', 'Contact', '1', '37', '86', '42.66942264', '121.062926'],
 ['5', 'Contact', '1', '26', '96', '24.3133367', '300.7980271'],
 ['6', 'Contact', '1', '35', '100', '20.01894596', '353.0088051'],
 ['7', 'Contact', '1', '49', '117', '29.69251209', '228.1920257'],
 ['8', 'Contact', '1', '41', '114', '27.38630052', '254.788998'],
 ['9', 'Contact', '1', '53', '104', '27.50141729', '636.4364611'],
 ['10', 'Contact', '1', '56', '88', '39.48336299', '216.9692579']]

It looks like the csv module got it correct.

Step 3: Examine your data set.

What type of data do we have after import?

First, the header:



In [8]:

    
type(header)









    Out[8]:





list

Now the data:



In [9]:

    
type(data)









    Out[9]:





list

What are the variables in your data set?

First, let's get the number of list elements in the header list



In [10]:

    
len(header)









    Out[10]:





7

Now let's do the same for data. First, let's see how many observations are in the list



In [11]:

    
len(data)









    Out[11]:





25000

Now let's check that each list element has a list with 7 elements in it (to go with our headers). We might want to do this again, so let's write this as a function



In [12]:

    
def elements_in_row(data,header):
    '''(list) -> str

       Elements in row 

    '''
    h = len(header)
    rownum = 0
    for row in data:
        r = len(row)
        if r != h:
            print 'Row ' + str(rownum) + ' does not have ' + str(h) + ' elements.'
            print row 
        rownum += 1
    return 'All done! All rows have ' + str(h) + ' elements.'
    
elements_in_row(data,header)









    Out[12]:





'All done! All rows have 7 elements.'

Find the columns associated with each of the variables of interest.

We want to look at the estimated speed and condition type. First, let's extract the column indices.



In [13]:

    
colspeed = header.index('Estimated Speed (mph)')
print colspeed



In [14]:

    
colcondname = header.index('Condition Name')
print colcondname

Now, extract the data for the estimated speed and the condition name.



In [15]:

    
speed = [x[colspeed] for x in data]
print speed[0:10]









    



['35.34309684', '33.17929371', '45.32379901', '42.66942264', '24.3133367', '20.01894596', '29.69251209', '27.38630052', '27.50141729', '39.48336299']



In [16]:

    
condname = [x[colcondname] for x in data]
print condname[0:10]









    



['Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact']

Now let's see how many conditions we have and what the conditions are.

First, we need to get the unique conditions.

The set function will give you a set of unique names. To turn this set back into a list, use the list function.



In [17]:

    
def get_condition_names(condname):
    ''' list(str) -> list(str)
    
    get_condition_names takes a list of strings and returns 
    a list of the unique strings contained within the 
    original list.

    >>> get_condition_names(condname)
    ['Smash', 'Hit', 'Bump', 'Collide', 'Contact', 'Sash']
    '''
    conditions = list(set(condname))
    return conditions

conditions = get_condition_names(condname)
print conditions









    



['Smash', 'Hit', 'Bump', 'Collide', 'Contact', 'Sash']

And now we can use len to get the number of different conditions.



In [18]:

    
ncond = len(conditions)
print ncond

Not so good. We have a listing for 6 conditions, when we really have 5.

HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?

It looks like somebody mistyped the label for the Smash condition. We need to fix this before we go on.

We need to check whether our hypothesis is correct. Let's count the number of instances of each condition.



In [19]:

    
def count_observations_in_condition(condname):
    ''' list(str) list(str) -> int

    count_observations_in condition takes a list of unique strings and 
    returns the number of observations per string

    Requires:
    get_condition_names

    Example:
    >>> count_observations_in_condition(condname):
    Smash   5000
    Hit     5000
    Bump	5000
    Collide 5000
    Contact 5000
    '''
    conditions = get_condition_names(condname)
    idx = 0
    indices = []
    for cond in conditions:
        indices.append([i for i, x in enumerate(condname) if x == cond])
        print cond + '\t' + str(len(indices[idx]))
        idx += 1
    
count_observations_in_condition(condname)









    



Smash	4998
Hit	5000
Bump	5000
Collide	5000
Contact	5000
Sash	2

It looks like we were correct. Let's replace the elements Sash with Smash.

In this case it is fairly easy because we have a balanced design, but watch out if you don't have an equal number of observations in each condition.



In [20]:

    
def change_condition_name(Incorrect_Name,Correct_Name,Vector_of_Conditions):
    for idx, item in enumerate(Vector_of_Conditions):
        if item == Incorrect_Name:
            Vector_of_Conditions[idx] = Correct_Name
            print "Element " + str(idx) + " (" + str(item) + ") replaced with " + Correct_Name 
    return Vector_of_Conditions

condname = change_condition_name('Sash','Smash',condname)









    



Element 20261 (Sash) replaced with Smash
Element 20282 (Sash) replaced with Smash

And let's check our numbers again.



In [21]:

    
count_observations_in_condition(condname)









    



Bump	5000
Smash	5000
Contact	5000
Hit	5000
Collide	5000

Now let's check that the data in the variable `Estimated speed (mph)` are entered correctly.

Let's check the type of the variable speed.



In [22]:

    
type(speed)









    Out[22]:





list

Now the elements:



In [23]:

    
def get_type_elements(variable):
    count = 0
    format = []
    for item in variable:
        format.append(type(variable[count]))
        count += 1
    print list(set(format))

get_type_elements(speed)









    



[<type 'str'>]

Oops. The data is in the wrong format. We need a list of numbers, not strings. We'll use floating point numbers to accomodate any observations that have decimal points in them. You could check this too, if you wanted.



In [24]:

    
def change_data_format_to_float(variable):
    count = 0
    variablefloat=[]
    for item in variable:
        variablefloat.append(float(variable[count]))
        count += 1
    return variablefloat

speedfloat = change_data_format_to_float(speed)
get_type_elements(speedfloat)









    



[<type 'float'>]

To make our life easier, let's create a dictionary to associate values and condition names



In [25]:

    
def create_dictionary(condname):
    datadict = dict()
    count = 0
    
    for name in condname:
        if name in datadict:
            # append the new number to the existing array at this slot
            datadict[name].append(speedfloat[count])
        else:
            # create a new array in this slot            
            datadict[name] = [speedfloat[count]]
        count += 1
    return datadict

datadict = create_dictionary(condname)

Now we can reformat the data to use with the `pandas` and `scipy` modules for statistical analysis.

First, let's import the modules we will need



In [26]:

    
import pandas as pd
import numpy as np
from scipy import stats
from patsy import dmatrices

Now, we can turn our dictionary into a pandas dataframe.



In [27]:

    
df = pd.DataFrame(datadict)
print df









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Bump       5000  non-null values
Collide    5000  non-null values
Contact    5000  non-null values
Hit        5000  non-null values
Smash      5000  non-null values
dtypes: float64(5)

Let's take a look at the dataframe

We can call the system head and tail functions to do this



In [28]:

    
print df.head()









    



     Bump  Collide    Contact     Hit   Smash
0  45.601   43.872  35.343097  18.298  40.810
1  45.452   35.219  33.179294  46.845  48.751
2  42.071   30.035  45.323799  12.501  49.430
3  30.303   46.574  42.669423  32.120  51.688
4  27.146   40.785  24.313337  39.203  57.690



In [29]:

    
print df.tail()









    



        Bump  Collide  Contact     Hit   Smash
4995  23.902   43.902   27.352  23.896  42.378
4996  44.808   44.686   27.962  26.561  46.275
4997  58.433   39.162   42.062  35.872  35.883
4998  47.236   31.987   23.697  24.000  47.574
4999  31.359   49.443   23.715  34.286  48.763

Now it's time for some descriptive statistics.

We can look at each type of descriptive statistic separately:



In [30]:

    
condmean = df.mean()
print condmean









    



Bump       37.999997
Collide    41.000002
Contact    29.999999
Hit        34.999997
Smash      46.000005
dtype: float64



In [31]:

    
condvar = df.var()
print condvar









    



Bump       122.222092
Collide     41.555552
Contact    116.444323
Hit         86.444395
Smash       33.333344
dtype: float64



In [32]:

    
condstd = df.std()
print condstd









    



Bump       11.055410
Collide     6.446360
Contact    10.790937
Hit         9.297548
Smash       5.773504
dtype: float64

Or, we can get them all together:



In [33]:

    
descriptives = df.describe()
print descriptives









    



              Bump      Collide      Contact          Hit        Smash
count  5000.000000  5000.000000  5000.000000  5000.000000  5000.000000
mean     37.999997    41.000002    29.999999    34.999997    46.000005
std      11.055410     6.446360    10.790937     9.297548     5.773504
min       0.954110    19.711000    -7.928400     2.039300    24.962000
25%      30.680000    36.764250    22.807250    28.776750    42.158250
50%      38.002000    41.076500    30.086527    35.003500    46.083500
75%      45.564250    45.309250    37.378500    41.291750    49.801000
max      75.173000    61.102000    66.337000    68.553000    66.631000

Maybe a negative speed means the estimated speed was in reverse? Check with whoever gave you the data to be sure. In our case, it means the car was travelling in reverse.

Now we can plot our data

First, import pylab (this allows us to view our figures in the ipython notebook)



In [34]:

    
%pylab inline 

from pylab import *









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Now, let's plot our means



In [35]:

    
df.mean().plot(kind='bar')









    Out[35]:





<matplotlib.axes.AxesSubplot at 0x107a88b90>

and take a look at the boxplot



In [36]:

    
plt.figure();
bp = df.boxplot()

Now, let's run an ANOVA

I like to have the full data output, so we will use the functions built into the statsmodels module.



In [37]:

    
#import statsmodels.api as sm
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm
import statsmodels.sandbox as sand
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)

Although the format of df is great for getting descriptive statistics, we need a different format for running the ANOVA (corresponding vectors of data).



In [38]:

    
df_long = pd.DataFrame(zip(condname,speedfloat), columns=['condname', 'speed'])
df_long.head()



In [39]:

    
df_long.tail()

Now to actually run the ANOVA. Just as a reminder (for those whose stats are a bit rusty) the an ANOVA is just a special case of OLS regression, so we will run the regression model here.

And now we check our variables again

Let's set up the model:



In [40]:

    
model = sm.ols('speed ~ C(condname)',df_long)

where, speed is the dependent variable, C() means that condname is a categorical variable, and condname is the independent variable.

Now, we can run the analysis.



In [41]:

    
result = model.fit()



In [42]:

    
print result.summary()









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.267
Model:                            OLS   Adj. R-squared:                  0.267
Method:                 Least Squares   F-statistic:                     2281.
Date:                Thu, 05 Sep 2013   Prob (F-statistic):               0.00
Time:                        11:42:59   Log-Likelihood:                -90246.
No. Observations:               25000   AIC:                         1.805e+05
Df Residuals:                   24995   BIC:                         1.805e+05
Df Model:                           4                                         
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 38.0000      0.126    300.416      0.000        37.752    38.248
C(condname)[T.Collide]     3.0000      0.179     16.771      0.000         2.649     3.351
C(condname)[T.Contact]    -8.0000      0.179    -44.721      0.000        -8.351    -7.649
C(condname)[T.Hit]        -3.0000      0.179    -16.771      0.000        -3.351    -2.649
C(condname)[T.Smash]       8.0000      0.179     44.721      0.000         7.649     8.351
==============================================================================
Omnibus:                      210.373   Durbin-Watson:                   2.008
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              337.430
Skew:                          -0.025   Prob(JB):                     5.35e-74
Kurtosis:                       3.567   Cond. No.                         5.83
==============================================================================

Note that the result object has many useful attributes. For example, we can extract parameter estimates and r-squared



In [43]:

    
result.params









    Out[43]:





Intercept                 37.999997
C(condname)[T.Collide]     3.000004
C(condname)[T.Contact]    -7.999999
C(condname)[T.Hit]        -3.000000
C(condname)[T.Smash]       8.000008
dtype: float64



In [44]:

    
result.rsquared









    Out[44]:





0.26743877202192234

To present the results as a traditional ANOVA table



In [45]:

    
anova_lm(result)









    Out[45]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      C(condname)
           4
        730000.652181
       182500.163045
       2281.25372
        0
    
    
      Residual
       24995
       1999598.525461
           79.999941
              NaN
      NaN

Now to run the post-hocs.



In [46]:

    
conditions = pd.Categorical.from_array(df_long['condname'])
print conditions









    



Categorical: condname
array(['Contact', 'Contact', 'Contact', ..., 'Smash', 'Smash', 'Smash'], dtype=object)
Levels (5): Index(['Bump', 'Collide', 'Contact', 'Hit', 'Smash'], dtype=object)



In [47]:

    
print conditions.levels









    



Index([Bump, Collide, Contact, Hit, Smash], dtype=object)



In [48]:

    
res2 = pairwise_tukeyhsd(df_long['speed'], df_long['condname'])
print res2









    



Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  0      1      3.0     2.512   3.488   True 
  0      2      -8.0    -8.488  -7.512  True 
  0      3      -3.0    -3.488  -2.512  True 
  0      4      8.0     7.512   8.488   True 
  1      2     -11.0   -11.488 -10.512  True 
  1      3      -6.0    -6.488  -5.512  True 
  1      4      5.0     4.512   5.488   True 
  2      3      5.0     4.512   5.488   True 
  2      4      16.0    15.512  16.488  True 
  3      4      11.0    10.512  11.488  True 
---------------------------------------------



In [49]:

    
mc = MultiComparison(df_long['speed'],df_long['condname'])



In [50]:

    
posthoc = mc.tukeyhsd()
print posthoc.summary() 
print zip(list((conditions.levels)), range(5))









    



Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  0      1      3.0     2.512   3.488   True 
  0      2      -8.0    -8.488  -7.512  True 
  0      3      -3.0    -3.488  -2.512  True 
  0      4      8.0     7.512   8.488   True 
  1      2     -11.0   -11.488 -10.512  True 
  1      3      -6.0    -6.488  -5.512  True 
  1      4      5.0     4.512   5.488   True 
  2      3      5.0     4.512   5.488   True 
  2      4      16.0    15.512  16.488  True 
  3      4      11.0    10.512  11.488  True 
---------------------------------------------
[('Bump', 0), ('Collide', 1), ('Contact', 2), ('Hit', 3), ('Smash', 4)]

So, does the wording of a question matter when eyewitnesses give testimony?

YES!



In [50]:

	condname	speed
0	Contact	35.343097
1	Contact	33.179294
2	Contact	45.323799
3	Contact	42.669423
4	Contact	24.313337

	condname	speed
24995	Smash	42.378
24996	Smash	46.275
24997	Smash	35.883
24998	Smash	47.574
24999	Smash	48.763

	df	sum_sq	mean_sq	F	PR(>F)
C(condname)	4	730000.652181	182500.163045	2281.25372	0
Residual	24995	1999598.525461	79.999941	NaN	NaN