Putting it all together

Taking off with Loftus ...

(From Abdi, Edelman, Valentin & Dowling, 2009; Williams, Krishnan, & Abdi, 2009)

In the 1970s Elizabeth Loftus (Loftus and Palmer, 1974) conducted a series of experiments on the theme of eyewitness testimony. They wanted to find out whether the wording of a question affected the later responses of witnesses. To do this she showed subjects a film of a car accident. Following the film, she asked them a series of questions. Among the questions was one of 5 versions of a critical question concerning the speed the vehicles were travelling in miles per hour. Here are the 5 versions of the question along with the experimental condition in which they occurred:

HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?

Software Carpentry conducted a (ficticious) replication of the study in an online survey. The results are in the file loftus.csv.

We want to be able to answer the question: Does the wording of the question affect witnesses' estimation of the speed the vehicles were traveling?

Exercise:

Get the file loftus.csv from the class git repo (https://github.com/LJWilliams/swc_BerkeleyDLab)

Type:

git clone https://github.com/LJWilliams/swc_BerkeleyDLab

into the shell

Step 1: Import the .csv file

To import a .csv file into Python

open the data file into a variable



In [25]:

    
datafile = open('loftus.csv', 'rU')

write the rows of the datafile into a parsed file using a for loop



In [26]:

    
data = []
for row in datafile:
    data.append(row.strip().split(','))

close the datafile



In [27]:

    
datafile.close()

Exercise:

Because we will want to open all sorts of .csv files, turn the code to import the .csv files into a function

Remember to document your code and call your function something memorable so you will remember what you have done later.



In [28]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Initialize an empty array
    data = []
    # Create data array
    for row in datafile:
        data.append(row.strip().split(','))
    datafile.close()
    return data

data = open_csv_file('loftus.csv')

Step 2: Get header row

If you know that the data file has a header row, you can update your function to get the header row



In [29]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format. Note that the first row is ALWAYS
    considered to be a header row!
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Initialize an empty array
    data = []
    # define counter start number
    rownum = 0
    for row in datafile:
        if rownum == 0:
            # Create header
            header = row
        else:
            # Create data array
            data.append(row.strip().split(','))
        rownum += 1
    datafile.close()
    return data, header

But this gets dangerous when we want to reuse our code because a lot of the time (unless it is our own data) you don't know if the data has a header row

Import the csv module .

To save processing time, import your modules at the beginning of your .py file. Here we are importing them as we need them for explanation purposes only.



In [30]:

    
import csv

The has_header method of the Sniffer class of the csv module returns True if the first line of the data is likely to be a header row.

Let's update our function to get the header row.



In [31]:

    
def open_csv_file(filename):
    '''(str) -> str
    
    open_csv_file opens a csv file called filename and 
    returns the information in the datafile in python 
    readable format. It uses the csv module to determine
    whether or not there is a row of column headers
    
    example:
    open_csv_file('loftus.csv')
    '''
    
    # Open the datafile
    datafile = open(filename, 'rU')
    # Find out whether there is a header row
    hasheader = csv.Sniffer().has_header(datafile.read(1024))
    datafile.seek(0)
    
    data = []
    header = []
    rownum = 0
    if hasheader == True:
        for row in datafile:
            if rownum == 0:
                header = row.strip().split(',')
                print 'Data file has a header row:' 
                print header
            else:
                data.append(row.strip().split(','))
            rownum += 1
    else:
        for row in datafile:
            if rownum == 0:
                print 'Data file does not have a header row'
            data.append(row.strip().split(','))
            rownum += 1
    datafile.close()
    return(data, header)


data, header = open_csv_file('loftus.csv')









    



Data file has a header row:
['Participant', 'Condition Name', 'Condition Number', 'Age', 'IQ', 'Estimated Speed (mph)', 'Reaction Time (ms)']

Now check that the output of the function makes sense.



In [32]:

    
header









    Out[32]:





['Participant',
 'Condition Name',
 'Condition Number',
 'Age',
 'IQ',
 'Estimated Speed (mph)',
 'Reaction Time (ms)']



In [33]:

    
data[0:10]









    Out[33]:





[['1', 'Contact', '1', '38', '79', '35.34309684', '366.3496022'],
 ['2', 'Contact', '1', '60', '115', '33.17929371', '103.3719955'],
 ['3', 'Contact', '1', '55', '99', '45.32379901', '117.1909298'],
 ['4', 'Contact', '1', '37', '86', '42.66942264', '121.062926'],
 ['5', 'Contact', '1', '26', '96', '24.3133367', '300.7980271'],
 ['6', 'Contact', '1', '35', '100', '20.01894596', '353.0088051'],
 ['7', 'Contact', '1', '49', '117', '29.69251209', '228.1920257'],
 ['8', 'Contact', '1', '41', '114', '27.38630052', '254.788998'],
 ['9', 'Contact', '1', '53', '104', '27.50141729', '636.4364611'],
 ['10', 'Contact', '1', '56', '88', '39.48336299', '216.9692579']]

It looks like the csv module got it correct.

Step 3: Examine your data set.

What type of data do we have after import?

First, the header:



In [34]:

    
type(header)









    Out[34]:





list

Now the data:



In [35]:

    
type(data)









    Out[35]:





list

What are the variables in your data set?

First, let's get the number of list elements in the header list



In [36]:

    
len(header)









    Out[36]:





7

Now let's do the same for data. First, let's see how many observations are in the list



In [37]:

    
len(data)









    Out[37]:





25000

Now let's check that each list element has a list with 7 elements in it (to go with our headers). We might want to do this again, so let's write this as a function



In [38]:

    
def elements_in_row(data,header):
    '''(list) -> str

       Elements in row 

    '''
    h = len(header)
    rownum = 0
    for row in data:
        r = len(row)
        if r != h:
            print 'Row ' + str(rownum) + ' does not have ' + str(h) + ' elements.'
            print row 
        rownum += 1
    return 'All done! All rows have ' + str(h) + ' elements.'
    
elements_in_row(data,header)









    Out[38]:





'All done! All rows have 7 elements.'

Find the columns associated with each of the variables of interest.

We want to look at the estimated speed and condition type. First, let's extract the column indices.



In [39]:

    
colspeed = header.index('Estimated Speed (mph)')
print colspeed



In [40]:

    
colcondname = header.index('Condition Name')
print colcondname

Now, extract the data for the estimated speed and the condition name.



In [41]:

    
speed = [x[colspeed] for x in data]
print speed[0:10]









    



['35.34309684', '33.17929371', '45.32379901', '42.66942264', '24.3133367', '20.01894596', '29.69251209', '27.38630052', '27.50141729', '39.48336299']



In [42]:

    
condname = [x[colcondname] for x in data]
print condname[0:10]









    



['Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact', 'Contact']

Now let's see how many conditions we have and what the conditions are.

First, we need to get the unique conditions.

The set function will give you a set of unique names. To turn this set back into a list, use the list function.



In [43]:

    
def get_condition_names(condname):
    ''' list(str) -> list(str)
    
    get_condition_names takes a list of strings and returns 
    a list of the unique strings contained within the 
    original list.

    >>> get_condition_names(condname)
    ['Smash', 'Hit', 'Bump', 'Collide', 'Contact', 'Sash']
    '''
    conditions = list(set(condname))
    return conditions

conditions = get_condition_names(condname)
print conditions









    



['Smash', 'Hit', 'Bump', 'Collide', 'Contact', 'Sash']

And now we can use len to get the number of different conditions.



In [44]:

    
ncond = len(conditions)
print ncond

Not so good. We have a listing for 6 conditions, when we really have 5.

HIT: About how fast were the cars going when they hit each other?
SMASH: About how fast were the cars going when they smashed into each other?
COLLIDE: About how fast were the cars going when they collided with each other?
BUMP: About how fast were the cars going when they bumped each other?
CONTACT: About how fast were the cars going when they contacted each other?

It looks like somebody mistyped the label for the Smash condition. We need to fix this before we go on.

We need to check whether our hypothesis is correct. Let's count the number of instances of each condition.



In [45]:

    
def count_observations_in_condition(condname):
    ''' list(str) list(str) -> int

    count_observations_in condition takes a list of unique strings and 
    returns the number of observations per string

    Requires:
    get_condition_names

    Example:
    >>> count_observations_in_condition(condname):
    Smash   5000
    Hit     5000
    Bump	5000
    Collide 5000
    Contact 5000
    '''
    conditions = get_condition_names(condname)
    idx = 0
    indices = []
    for cond in conditions:
        indices.append([i for i, x in enumerate(condname) if x == cond])
        print cond + '\t' + str(len(indices[idx]))
        idx += 1
    
count_observations_in_condition(condname)









    



Smash	4998
Hit	5000
Bump	5000
Collide	5000
Contact	5000
Sash	2

It looks like we were correct. Let's replace the elements Sash with Smash.

In this case it is fairly easy because we have a balanced design, but watch out if you don't have an equal number of observations in each condition.



In [46]:

    
def change_condition_name(Incorrect_Name,Correct_Name,Vector_of_Conditions):
    ''' str str list(str) -> list(str)
    
    change_codition_name takes 2 strings (the incorrect and correct
    condition names) and a list of strings and returns a list of 
    strings with the incorrect name replaced by the correct name.
    
    example:
    condname = change_condition_name('Sash','Smash',condname)
    
    where, condname is a list of strings.
    '''
    for idx, item in enumerate(Vector_of_Conditions):
        if item == Incorrect_Name:
            Vector_of_Conditions[idx] = Correct_Name
            print "Element " + str(idx) + " (" + str(item) + ") replaced with " + Correct_Name 
    return Vector_of_Conditions

condname = change_condition_name('Sash','Smash',condname)









    



Element 20261 (Sash) replaced with Smash
Element 20282 (Sash) replaced with Smash

And let's check our numbers again.



In [47]:

    
count_observations_in_condition(condname)









    



Bump	5000
Smash	5000
Contact	5000
Hit	5000
Collide	5000

Now let's check that the data in the variable `Estimated speed (mph)` are entered correctly.

Let's check the type of the variable speed.



In [48]:

    
type(speed)









    Out[48]:





list

Now the elements:



In [49]:

    
def get_type_elements(variable):
    ''' list -> str
    
    get_type_elements takes a list of numbers and prints
    the type of each variable to stdout
    
    example:
    get_type_elements(speed)
    '''
    count = 0
    format = []
    for item in variable:
        format.append(type(variable[count]))
        count += 1
    print list(set(format))

get_type_elements(speed)









    



[<type 'str'>]

Oops. The data is in the wrong format. We need a list of numbers, not strings. We'll use floating point numbers to accomodate any observations that have decimal points in them. You could check this too, if you wanted.

Exercise:

Create a function that uses a for loop to change the type of the elements in the list speed to floating point numbers



In [50]:

    
def change_data_format_to_float(variable):
    ''' list(str) -> list(float)
    
    change_data_format_to_float takes a list of type string
    and returns numbers in type float
    
    example:
    speedfloat = change_data_format_to_float(speed)
    '''
    count = 0
    variablefloat=[]
    for item in variable:
        variablefloat.append(float(variable[count]))
        count += 1
    return variablefloat

speedfloat = change_data_format_to_float(speed)
get_type_elements(speedfloat)









    



[<type 'float'>]

To make our life easier, let's create a dictionary to associate values and condition names

Exercise:

Write a function that will create our dictionary



In [51]:

    
def create_dictionary(condname):
    datadict = dict()
    count = 0
    
    for name in condname:
        if name in datadict:
            # append the new number to the existing array at this slot
            datadict[name].append(speedfloat[count])
        else:
            # create a new array in this slot            
            datadict[name] = [speedfloat[count]]
        count += 1
    return datadict

datadict = create_dictionary(condname)

Now we can reformat the data to use with the `pandas` and `scipy` modules for statistical analysis.

First, let's import the modules we will need



In [52]:

    
import pandas as pd
import numpy as np
from scipy import stats
from patsy import dmatrices

Now, we can turn our dictionary into a pandas dataframe.



In [53]:

    
df = pd.DataFrame(datadict)
print df









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Bump       5000  non-null values
Collide    5000  non-null values
Contact    5000  non-null values
Hit        5000  non-null values
Smash      5000  non-null values
dtypes: float64(5)

Let's take a look at the dataframe

We can call the system head and tail functions to do this



In [54]:

    
print df.head()









    



     Bump  Collide    Contact     Hit   Smash
0  45.601   43.872  35.343097  18.298  40.810
1  45.452   35.219  33.179294  46.845  48.751
2  42.071   30.035  45.323799  12.501  49.430
3  30.303   46.574  42.669423  32.120  51.688
4  27.146   40.785  24.313337  39.203  57.690



In [55]:

    
print df.tail()









    



        Bump  Collide  Contact     Hit   Smash
4995  23.902   43.902   27.352  23.896  42.378
4996  44.808   44.686   27.962  26.561  46.275
4997  58.433   39.162   42.062  35.872  35.883
4998  47.236   31.987   23.697  24.000  47.574
4999  31.359   49.443   23.715  34.286  48.763

Now it's time for some descriptive statistics.

We can look at each type of descriptive statistic separately:



In [56]:

    
condmean = df.mean()
print condmean









    



Bump       37.999997
Collide    41.000002
Contact    29.999999
Hit        34.999997
Smash      46.000005
dtype: float64



In [57]:

    
condvar = df.var()
print condvar









    



Bump       122.222092
Collide     41.555552
Contact    116.444323
Hit         86.444395
Smash       33.333344
dtype: float64



In [58]:

    
condstd = df.std()
print condstd









    



Bump       11.055410
Collide     6.446360
Contact    10.790937
Hit         9.297548
Smash       5.773504
dtype: float64

Or, we can get them all together:



In [59]:

    
descriptives = df.describe()
print descriptives









    



              Bump      Collide      Contact          Hit        Smash
count  5000.000000  5000.000000  5000.000000  5000.000000  5000.000000
mean     37.999997    41.000002    29.999999    34.999997    46.000005
std      11.055410     6.446360    10.790937     9.297548     5.773504
min       0.954110    19.711000    -7.928400     2.039300    24.962000
25%      30.680000    36.764250    22.807250    28.776750    42.158250
50%      38.002000    41.076500    30.086527    35.003500    46.083500
75%      45.564250    45.309250    37.378500    41.291750    49.801000
max      75.173000    61.102000    66.337000    68.553000    66.631000

Maybe a negative speed means the estimated speed was in reverse? Check with whoever gave you the data to be sure. In our case, it means the car was travelling in reverse.

Now we can plot our data

First, import pylab (this allows us to view our figures in the ipython notebook)



In [60]:

    
%pylab inline 

from pylab import *









    



Populating the interactive namespace from numpy and matplotlib

Now, let's plot our means



In [61]:

    
df.mean().plot(kind='bar')









    Out[61]:





<matplotlib.axes.AxesSubplot at 0x1083cdb50>

and take a look at the boxplot



In [62]:

    
plt.figure();
bp = df.boxplot()

Step 4: Now, let's run an ANOVA

First, let's specify our conditions. Let's start with Bump. Because we have the data in a pandas dataframe, we can specify our conditions by name.

Test the normality of the distribution of each condition

To do this, we can use the normaltest function from the scipy stats module. Recall that we already imported the scipy stats module as stats.



In [63]:

    
stats.normaltest(df)









    Out[63]:





(array([ 0.82480245,  1.48875741,  2.20470372,  1.34776874,  0.01416214]),
 array([ 0.66205859,  0.47502934,  0.33208914,  0.50972477,  0.99294394]))

The first row returns the k2 value (skew^2 + kurtosis^2) and the second row are the respective p values. We are lucky. None of our conditions are skewed or violate the assumption of normality.

Compute the F and p values

Recall that we already imported the scipy stats module.



In [64]:

    
F_val, p_val = stats.f_oneway(df["Bump"],df["Collide"],df["Contact"],df["Hit"],df["Smash"])  

print "One-way ANOVA F =", F_val
print "One-way ANOVA p =", p_val









    



One-way ANOVA F = 2281.2537203
One-way ANOVA p = 0.0

So, we now know that there is a significant difference in estimated speed of the vehicles based on the wording of the question. Now, which conditions drive this finding?

Post-hoc analyses

Using the scipy stats module is great for simple things, but it does not contain functions for doing post-hoc analyses. Therefore, we will use a series of t-tests to test the differences between conditions.



In [65]:

    
t_bcol,p_bcol = stats.ttest_ind(df["Bump"], df["Collide"])
print "The t value for Bump vs. Collide is = ", t_bcol, ", p =", p_bcol









    



The t value for Bump vs. Collide is =  -16.5759940295 , p = 6.72575204893e-61

EXERCISE:

Because we need to repeat these lines several times, turn the above into a function so that you can compare the remaining conditions.



In [66]:

    
def printtvals(df,cond1,cond2):
    '''PRINTVALS takes a pandas dataframe and 2 named 
    conditions and returns the t and p values printed 
    to screen.
    
    USAGE: printtvals(df,cond1,cond2)
    
    where,
    df is a pandas dataframe
    cond1 is a named condition in df
    cond2 is a named condition in df
    '''
    t, p = stats.ttest_ind(df[cond1],df[cond2])
    print "The t value for ", cond1, "vs. ", cond2, "is =\t", t, ",\t p =", p



In [67]:

    
printtvals(df,"Bump","Collide")
printtvals(df,"Bump","Contact")
printtvals(df,"Bump","Hit")
printtvals(df,"Bump","Smash")
printtvals(df,"Collide","Contact")
printtvals(df,"Collide","Hit")
printtvals(df,"Collide","Smash")
printtvals(df,"Contact","Hit")
printtvals(df,"Contact","Smash")
printtvals(df,"Hit","Smash")









    



The t value for  Bump vs.  Collide is =	-16.5759940295 ,	 p = 6.72575204893e-61
The t value for  Bump vs.  Contact is =	36.6167045706 ,	 p = 1.42343926785e-275
The t value for  Bump vs.  Hit is =	14.6852010529 ,	 p = 2.5507920365e-48
The t value for  Bump vs.  Smash is =	-45.3557994543 ,	 p = 0.0
The t value for  Collide vs.  Contact is =	61.8798756012 ,	 p = 0.0
The t value for  Collide vs.  Hit is =	37.5000348594 ,	 p = 4.414059147e-288
The t value for  Collide vs.  Smash is =	-40.8551308746 ,	 p = 0.0
The t value for  Contact vs.  Hit is =	-24.8213808029 ,	 p = 4.95393467287e-132
The t value for  Contact vs.  Smash is =	-92.4446178633 ,	 p = 0.0
The t value for  Hit vs.  Smash is =	-71.0705940143 ,	 p = 0.0

So, we can see that all conditions differ significantly from each other. However, the greatest difference is between the Contact and Smash conditions.

For more complex designs requiring more than a oneway ANOVA

If you would like to have the full data output or have a more complex design you will need to use the functions built into the statsmodels module. statsmodels allows you to use R-like syntax to explicitly create your model. We have included the analysis from above for comparison of methods.

Before you begin

Download and install the statsmodels 0.6 module. First clone the repository from github from git clone git://github.com/statsmodels/statsmodels.git. You will also need to install patsy. Using pip enter pip install patsy at the command line. Everything else should have come with your installation of Anaconda. Following the installation of patsy, navigate to the directory where you put the statsmodels repository. Once there, type python setup.py build and finally python setup.py install.

The statsmodels module has many useful statistical features that will benefit your work. The latest release allows you to build statistical models using R-like syntax, but with the readability of python. Documentation is available from http://statsmodels.sourceforge.net/.



In [68]:

    
#import statsmodels.api as sm
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm
import statsmodels.sandbox as sand
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)

Although the format of df is great for getting descriptive statistics, we need a different format for running the ANOVA (corresponding vectors of data).



In [69]:

    
df_long = pd.DataFrame(zip(condname,speedfloat), columns=['condname', 'speed'])
df_long.head()



In [70]:

    
df_long.tail()

Now to actually run the ANOVA. Just as a reminder (for those whose stats are a bit rusty) the an ANOVA is just a special case of OLS regression, so we will run the regression model here.

Let's set up the model:



In [71]:

    
model = sm.ols('speed ~ C(condname)',df_long)

where, speed is the dependent variable, C() means that condname is a categorical variable, and condname is the independent variable.

Now, we can run the analysis.



In [72]:

    
result = model.fit()



In [73]:

    
print result.summary()









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  speed   R-squared:                       0.267
Model:                            OLS   Adj. R-squared:                  0.267
Method:                 Least Squares   F-statistic:                     2281.
Date:                Tue, 05 Nov 2013   Prob (F-statistic):               0.00
Time:                        18:37:30   Log-Likelihood:                -90246.
No. Observations:               25000   AIC:                         1.805e+05
Df Residuals:                   24995   BIC:                         1.805e+05
Df Model:                           4                                         
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 38.0000      0.126    300.416      0.000        37.752    38.248
C(condname)[T.Collide]     3.0000      0.179     16.771      0.000         2.649     3.351
C(condname)[T.Contact]    -8.0000      0.179    -44.721      0.000        -8.351    -7.649
C(condname)[T.Hit]        -3.0000      0.179    -16.771      0.000        -3.351    -2.649
C(condname)[T.Smash]       8.0000      0.179     44.721      0.000         7.649     8.351
==============================================================================
Omnibus:                      210.373   Durbin-Watson:                   2.008
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              337.430
Skew:                          -0.025   Prob(JB):                     5.35e-74
Kurtosis:                       3.567   Cond. No.                         5.83
==============================================================================

Note that the result object has many useful attributes. For example, we can extract parameter estimates and r-squared



In [74]:

    
result.params









    Out[74]:





Intercept                 37.999997
C(condname)[T.Collide]     3.000004
C(condname)[T.Contact]    -7.999999
C(condname)[T.Hit]        -3.000000
C(condname)[T.Smash]       8.000008
dtype: float64



In [75]:

    
result.rsquared









    Out[75]:





0.26743877202192234

To present the results as a traditional ANOVA table



In [76]:

    
anova_lm(result)









    Out[76]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      C(condname)
           4
        730000.652181
       182500.163045
       2281.25372
        0
    
    
      Residual
       24995
       1999598.525461
           79.999941
              NaN
      NaN

Now to run the post-hocs.



In [77]:

    
conditions = pd.Categorical.from_array(df_long['condname'])
#print conditions



In [78]:

    
print conditions.levels









    



Index([u'Bump', u'Collide', u'Contact', u'Hit', u'Smash'], dtype=object)



In [79]:

    
res2 = pairwise_tukeyhsd(df_long['speed'], df_long['condname'])
print res2









    



Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  0      1      3.0     2.512   3.488   True 
  0      2      -8.0    -8.488  -7.512  True 
  0      3      -3.0    -3.488  -2.512  True 
  0      4      8.0     7.512   8.488   True 
  1      2     -11.0   -11.488 -10.512  True 
  1      3      -6.0    -6.488  -5.512  True 
  1      4      5.0     4.512   5.488   True 
  2      3      5.0     4.512   5.488   True 
  2      4      16.0    15.512  16.488  True 
  3      4      11.0    10.512  11.488  True 
---------------------------------------------



In [80]:

    
mc = MultiComparison(df_long['speed'],df_long['condname'])



In [81]:

    
posthoc = mc.tukeyhsd()
print posthoc.summary() 
print zip(list((conditions.levels)), range(5))









    



Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
  0      1      3.0     2.512   3.488   True 
  0      2      -8.0    -8.488  -7.512  True 
  0      3      -3.0    -3.488  -2.512  True 
  0      4      8.0     7.512   8.488   True 
  1      2     -11.0   -11.488 -10.512  True 
  1      3      -6.0    -6.488  -5.512  True 
  1      4      5.0     4.512   5.488   True 
  2      3      5.0     4.512   5.488   True 
  2      4      16.0    15.512  16.488  True 
  3      4      11.0    10.512  11.488  True 
---------------------------------------------
[('Bump', 0), ('Collide', 1), ('Contact', 2), ('Hit', 3), ('Smash', 4)]

	condname	speed
0	Contact	35.343097
1	Contact	33.179294
2	Contact	45.323799
3	Contact	42.669423
4	Contact	24.313337

	condname	speed
24995	Smash	42.378
24996	Smash	46.275
24997	Smash	35.883
24998	Smash	47.574
24999	Smash	48.763

	df	sum_sq	mean_sq	F	PR(>F)
C(condname)	4	730000.652181	182500.163045	2281.25372	0
Residual	24995	1999598.525461	79.999941	NaN	NaN

Putting it all together

Taking off with Loftus ...

Exercise:

Step 1: Import the .csv file

Exercise:

Step 2: Get header row

Let's update our function to get the header row.

Step 3: Examine your data set.

What type of data do we have after import?

What are the variables in your data set?

Find the columns associated with each of the variables of interest.

Now, extract the data for the estimated speed and the condition name.

Now let's see how many conditions we have and what the conditions are.

First, we need to get the unique conditions.

Now let's check that the data in the variable Estimated speed (mph) are entered correctly.

Exercise:

To make our life easier, let's create a dictionary to associate values and condition names

Exercise:

Now we can reformat the data to use with the pandas and scipy modules for statistical analysis.

Let's take a look at the dataframe

Now it's time for some descriptive statistics.

Now we can plot our data

Step 4: Now, let's run an ANOVA

Test the normality of the distribution of each condition

Compute the F and p values

Post-hoc analyses

EXERCISE:

For more complex designs requiring more than a oneway ANOVA

Before you begin

Now let's check that the data in the variable `Estimated speed (mph)` are entered correctly.

Now we can reformat the data to use with the `pandas` and `scipy` modules for statistical analysis.