Title: Hypothesis Testing on Parts of Models Date: 2013-10-08 22:51 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-hypothesis-testing-parts Latex: yes Status: draft

Hypothesis Testing on Parts of Models

The basic software for hypothesis testing on parts of models involves the familiar ols function and summary method for generating the regression report and the anova_lm function for generating and ANOVA report on a model.

ANOVA Reports

The anova_lm function takes a model as an argument and produces the term-by-term ANOVA report. To illustrate, consider this model of wages from the Current Population Survey data:

`cps.csv`



In [42]:

    
from statsmodels.stats.anova import anova_lm
import statsmodels.formula.api as sm
import pandas as pd

cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
mod1 = sm.ols("wage ~ married + age + educ", data=cps)
fit1 = mod1.fit()
anova_lm(fit1, typ=1)









    Out[42]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      married
         1
         142.401413
        142.401413
         6.740387
       9.686949e-03
    
    
      age
         1
         338.479887
        338.479887
        16.021508
       7.155517e-05
    
    
      educ
         1
        2398.723127
       2398.723127
       113.540462
       3.741813e-24
    
    
      Residual
       530
       11197.094256
         21.126593
              NaN
                NaN

Note the small p-value on the married term: 0.00969.

To change the order of the terms in the report, you can create a new model with the explanatory terms listed in a different order. For example, here's the ANOVA on the same model, but with married list instead of first:

WARNING: This doesn't currently work properly!



In [43]:

    
mod2 = sm.ols("wage ~ educ + age + married", data=cps)
fit2 = mod2.fit()
anova_lm(fit2, typ=1)









    Out[43]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      married
         1
         142.401413
        142.401413
        6.740387
       9.686949e-03
    
    
      educ
         1
        2094.528877
       2094.528877
       99.141820
       1.589724e-21
    
    
      age
         1
         642.674137
        642.674137
       30.420150
       5.445293e-08
    
    
      Residual
       530
       11197.094256
         21.126593
             NaN
                NaN

Now the p-value on married is large. This suggests that much of the variation in wage that is associated with married can also be accounted for by age and educ instead.

Non-Parametric Statistics

Consider the model of world-record swimming times plotting on page 116 of 'Statistical Modeling: A Fresh Approach (2nd Edition)'. It shows pretty clearly the interaction between year and sex.

`swim100m.csv`



In [229]:

    
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
from statsmodels.api import graphics
import matplotlib.gridspec as gridspec

fit0 = sm.ols("time ~ year*sex", data=swim).fit() # Fit simple model

A version of the figure on page 116 can be generated as follows:



In [232]:

    
fig = figure(figsize=(12,5))
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1]) 
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])
graphics.plot_fit(fit0, "year", ax=ax1)
ax1.set_title("")
ax1.set_ylabel("time (secs")
dat = dict(list(swim.groupby('sex').time))
ax2.boxplot([dat[i] for i in dat.keys()])
ax2.get_yaxis().set_visible(False)
xticks([1, 2], ['M', 'F'])
show()

It's easy to confirm that this interaction term is statistically significant:



In [47]:

    
fit = sm.ols("time ~ year*sex", data=swim).fit()
anova_lm(fit, typ=1)









    Out[47]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      sex
        1
       1720.655232
       1720.655232
       156.140793
       4.299569e-18
    
    
      year
        1
       3342.177104
       3342.177104
       303.285733
       1.039245e-24
    
    
      year:sex
        1
        296.675432
        296.675432
        26.921801
       2.826421e-06
    
    
      Residual
       58
        639.153942
         11.019896
              NaN
                NaN

The p-value on the interaction term is very small: $2.83 \times 10^{-6}$.

To check whether this result might be influenced by the shape of the distribution of the time or year data, you can conduct a non-parametric test. Simply take the rank of each quantitative variable:



In [50]:

    
from numpy import argsort as rank
fit = sm.ols("time.rank() ~ year.rank()*sex", data=swim).fit()
anova_lm(fit, typ=1)









    Out[50]:






  
    
      
      df
      sum_sq
      mean_sq
      F
      PR(>F)
    
  
  
    
      sex
        1
        6100.403226
        6100.403226
       1599.92582
       6.278398e-44
    
    
      year.rank()
        1
       13533.070845
       13533.070845
       3549.25874
       1.006992e-53
    
    
      year.rank():sex
        1
           0.876059
           0.876059
          0.22976
       6.335042e-01
    
    
      Residual
       58
         221.149870
           3.812929
              NaN
                NaN

With the rank-transformed data, the p-value on the interaction term is much larger: no evidence for an interaction between year and sex. You can see this directly in a plot of the data after rank-transforming time:



In [60]:

    
fig, ax = subplots()
swim['time_rank'] = swim.time.rank()
colors = ["blue", "green"]
for i, (key, grp) in enumerate(swim.groupby(['sex'])):
    ax.scatter(y=grp.time_rank, x=grp.year, label=key, color=colors[i])
legend(loc='best')
xlabel("year")
ylabel("time (rank)")
show()

The rank-transformed data suggest that women's records are improving in about the same way as men's.

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.

	df	sum_sq	mean_sq	F	PR(>F)
married	1	142.401413	142.401413	6.740387	9.686949e-03
age	1	338.479887	338.479887	16.021508	7.155517e-05
educ	1	2398.723127	2398.723127	113.540462	3.741813e-24
Residual	530	11197.094256	21.126593	NaN	NaN

	df	sum_sq	mean_sq	F	PR(>F)
sex	1	1720.655232	1720.655232	156.140793	4.299569e-18
year	1	3342.177104	3342.177104	303.285733	1.039245e-24
year:sex	1	296.675432	296.675432	26.921801	2.826421e-06
Residual	58	639.153942	11.019896	NaN	NaN

	df	sum_sq	mean_sq	F	PR(>F)
sex	1	6100.403226	6100.403226	1599.92582	6.278398e-44
year.rank()	1	13533.070845	13533.070845	3549.25874	1.006992e-53
year.rank():sex	1	0.876059	0.876059	0.22976	6.335042e-01
Residual	58	221.149870	3.812929	NaN	NaN