(From Abdi, Edelman, Valentin & Dowling, 2009; Williams, Krishnan, & Abdi, 2009)
In the 1970s Elizabeth Loftus (Loftus and Palmer, 1974) conducted a series of experiments on the theme of eyewitness testimony. They wanted to find out whether the wording of a question affected the later responses of witnesses. To do this she showed subjects a film of a car accident. Following the film, she asked them a series of questions. Among the questions was one of 5 versions of a critical question concerning the speed the vehicles were travelling in miles per hour. Here are the 5 versions of the question along with the experimental condition in which they occurred:
HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?
Software Carpentry conducted a (ficticious) replication of the study in an online survey. The results are in the file loftus.csv.
We want to be able to answer the question: Does the wording of the question affect witnesses' estimation of the speed the vehicles were traveling?
Download and install the statsmodels 0.5
module. First clone the repository from github
from git clone git://github.com/statsmodels/statsmodels.git
. You will also need to install patsy
. Using pip
enter pip install patsy
at the command line. Everything else should have come with your installation of Anaconda
. Following the installation of patsy
, navigate to the directory where you put the statsmodels repository. Once there, type python setup.py build
and finally python setup.py install
.
The statsmodels
module has many useful statistical features that will benefit your work. The latest release allows you to build statistical models using R-like syntax, but with the readability of python. Documentation is available from http://statsmodels.sourceforge.net/.
In [1]:
datafile = open('loftus.csv', 'rU')
data = []
for row in datafile:
data.append(row.strip().split(','))
datafile.close()
Remember to document your code and call your function something memorable so you will remember what you have done later.
In [2]:
def open_csv_file(filename):
'''(str) -> str
open_csv_file opens a csv file called filename and
returns the information in the datafile in python
readable format
example:
open_csv_file('loftus.csv')
'''
# Open the datafile
datafile = open(filename, 'rU')
# Initialize an empty array
data = []
# Create data array
for row in datafile:
data.append(row.strip().split(','))
datafile.close()
return data
data = open_csv_file('loftus.csv')
In [3]:
def open_csv_file(filename):
'''(str) -> str
open_csv_file opens a csv file called filename and
returns the information in the datafile in python
readable format. Note that the first row is ALWAYS
considered to be a header row!
example:
open_csv_file('loftus.csv')
'''
# Open the datafile
datafile = open(filename, 'rU')
# Initialize an empty array
data = []
# define counter start number
rownum = 0
for row in datafile:
if rownum == 0:
# Create header
header = row
else:
# Create data array
data.append(row.strip().split(','))
rownum += 1
datafile.close()
return data, header
But this gets dangerous when we want to reuse our code because a lot of the time (unless it is our own data) you don't know if the data has a header row
Import the csv
module .
To save processing time, import your modules at the beginning of your .py file. Here we are importing them as we need them for explanation purposes only.
In [4]:
import csv
In [5]:
def open_csv_file(filename):
'''(str) -> str
open_csv_file opens a csv file called filename and
returns the information in the datafile in python
readable format. It uses the csv module to determine
whether or not there is a row of column headers
example:
open_csv_file('loftus.csv')
'''
# Open the datafile
datafile = open(filename, 'rU')
# Find out whether there is a header row
hasheader = csv.Sniffer().has_header(datafile.read(1024))
datafile.seek(0)
data = []
header = []
rownum = 0
if hasheader == True:
for row in datafile:
if rownum == 0:
header = row.strip().split(',')
print 'Data file has a header row:'
print header
else:
data.append(row.strip().split(','))
rownum += 1
else:
for row in datafile:
if rownum == 0:
print 'Data file does not have a header row'
data.append(row.strip().split(','))
rownum += 1
datafile.close()
return(data, header)
data, header = open_csv_file('loftus.csv')
Now check that the output of the function makes sense.
In [6]:
header
Out[6]:
In [7]:
data[0:10]
Out[7]:
It looks like the csv
module got it correct.
In [8]:
type(header)
Out[8]:
Now the data:
In [9]:
type(data)
Out[9]:
In [10]:
len(header)
Out[10]:
Now let's do the same for data
.
First, let's see how many observations are in the list
In [11]:
len(data)
Out[11]:
Now let's check that each list element has a list with 7 elements in it (to go with our headers). We might want to do this again, so let's write this as a function
In [12]:
def elements_in_row(data,header):
'''(list) -> str
Elements in row
'''
h = len(header)
rownum = 0
for row in data:
r = len(row)
if r != h:
print 'Row ' + str(rownum) + ' does not have ' + str(h) + ' elements.'
print row
rownum += 1
return 'All done! All rows have ' + str(h) + ' elements.'
elements_in_row(data,header)
Out[12]:
In [13]:
colspeed = header.index('Estimated Speed (mph)')
print colspeed
In [14]:
colcondname = header.index('Condition Name')
print colcondname
In [15]:
speed = [x[colspeed] for x in data]
print speed[0:10]
In [16]:
condname = [x[colcondname] for x in data]
print condname[0:10]
In [17]:
def get_condition_names(condname):
''' list(str) -> list(str)
get_condition_names takes a list of strings and returns
a list of the unique strings contained within the
original list.
>>> get_condition_names(condname)
['Smash', 'Hit', 'Bump', 'Collide', 'Contact', 'Sash']
'''
conditions = list(set(condname))
return conditions
conditions = get_condition_names(condname)
print conditions
And now we can use len
to get the number of different conditions.
In [18]:
ncond = len(conditions)
print ncond
Not so good. We have a listing for 6 conditions, when we really have 5.
HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?
It looks like somebody mistyped the label for the Smash
condition. We need to fix this before we go on.
We need to check whether our hypothesis is correct. Let's count the number of instances of each condition.
In [19]:
def count_observations_in_condition(condname):
''' list(str) list(str) -> int
count_observations_in condition takes a list of unique strings and
returns the number of observations per string
Requires:
get_condition_names
Example:
>>> count_observations_in_condition(condname):
Smash 5000
Hit 5000
Bump 5000
Collide 5000
Contact 5000
'''
conditions = get_condition_names(condname)
idx = 0
indices = []
for cond in conditions:
indices.append([i for i, x in enumerate(condname) if x == cond])
print cond + '\t' + str(len(indices[idx]))
idx += 1
count_observations_in_condition(condname)
It looks like we were correct. Let's replace the elements Sash
with Smash
.
In this case it is fairly easy because we have a balanced design, but watch out if you don't have an equal number of observations in each condition.
In [20]:
def change_condition_name(Incorrect_Name,Correct_Name,Vector_of_Conditions):
for idx, item in enumerate(Vector_of_Conditions):
if item == Incorrect_Name:
Vector_of_Conditions[idx] = Correct_Name
print "Element " + str(idx) + " (" + str(item) + ") replaced with " + Correct_Name
return Vector_of_Conditions
condname = change_condition_name('Sash','Smash',condname)
And let's check our numbers again.
In [21]:
count_observations_in_condition(condname)
In [22]:
type(speed)
Out[22]:
Now the elements:
In [23]:
def get_type_elements(variable):
count = 0
format = []
for item in variable:
format.append(type(variable[count]))
count += 1
print list(set(format))
get_type_elements(speed)
Oops. The data is in the wrong format. We need a list of numbers, not strings. We'll use floating point numbers to accomodate any observations that have decimal points in them. You could check this too, if you wanted.
In [24]:
def change_data_format_to_float(variable):
count = 0
variablefloat=[]
for item in variable:
variablefloat.append(float(variable[count]))
count += 1
return variablefloat
speedfloat = change_data_format_to_float(speed)
get_type_elements(speedfloat)
In [25]:
def create_dictionary(condname):
datadict = dict()
count = 0
for name in condname:
if name in datadict:
# append the new number to the existing array at this slot
datadict[name].append(speedfloat[count])
else:
# create a new array in this slot
datadict[name] = [speedfloat[count]]
count += 1
return datadict
datadict = create_dictionary(condname)
In [26]:
import pandas as pd
import numpy as np
from scipy import stats
from patsy import dmatrices
Now, we can turn our dictionary into a pandas
dataframe.
In [27]:
df = pd.DataFrame(datadict)
print df
In [28]:
print df.head()
In [29]:
print df.tail()
In [30]:
condmean = df.mean()
print condmean
In [31]:
condvar = df.var()
print condvar
In [32]:
condstd = df.std()
print condstd
Or, we can get them all together:
In [33]:
descriptives = df.describe()
print descriptives
In [34]:
%pylab inline
from pylab import *
Now, let's plot our means
In [35]:
df.mean().plot(kind='bar')
Out[35]:
and take a look at the boxplot
In [36]:
plt.figure();
bp = df.boxplot()
In [37]:
#import statsmodels.api as sm
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm
import statsmodels.sandbox as sand
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
MultiComparison)
Although the format of df
is great for getting descriptive statistics, we need a different format for running the ANOVA (corresponding vectors of data).
In [38]:
df_long = pd.DataFrame(zip(condname,speedfloat), columns=['condname', 'speed'])
df_long.head()
Out[38]:
In [39]:
df_long.tail()
Out[39]:
Now to actually run the ANOVA. Just as a reminder (for those whose stats are a bit rusty) the an ANOVA is just a special case of OLS regression, so we will run the regression model here.
And now we check our variables again
Let's set up the model:
In [40]:
model = sm.ols('speed ~ C(condname)',df_long)
where, speed
is the dependent variable, C()
means that condname
is a categorical variable, and condname
is the independent variable.
Now, we can run the analysis.
In [41]:
result = model.fit()
In [42]:
print result.summary()
Note that the result
object has many useful attributes. For example, we can extract parameter estimates and r-squared
In [43]:
result.params
Out[43]:
In [44]:
result.rsquared
Out[44]:
To present the results as a traditional ANOVA table
In [45]:
anova_lm(result)
Out[45]:
Now to run the post-hocs.
In [46]:
conditions = pd.Categorical.from_array(df_long['condname'])
print conditions
In [47]:
print conditions.levels
In [48]:
res2 = pairwise_tukeyhsd(df_long['speed'], df_long['condname'])
print res2
In [49]:
mc = MultiComparison(df_long['speed'],df_long['condname'])
In [50]:
posthoc = mc.tukeyhsd()
print posthoc.summary()
print zip(list((conditions.levels)), range(5))
In [50]: