Title: Hypothesis Testing on Parts of Models Date: 2013-10-08 22:51 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-hypothesis-testing-parts Latex: yes Status: draft

Hypothesis Testing on Parts of Models

The basic software for hypothesis testing on parts of models involves the familiar ols function and summary method for generating the regression report and the anova_lm function for generating and ANOVA report on a model.

ANOVA Reports

The anova_lm function takes a model as an argument and produces the term-by-term ANOVA report. To illustrate, consider this model of wages from the Current Population Survey data:

`cps.csv`


In [42]:
from statsmodels.stats.anova import anova_lm
import statsmodels.formula.api as sm
import pandas as pd

cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
mod1 = sm.ols("wage ~ married + age + educ", data=cps)
fit1 = mod1.fit()
anova_lm(fit1, typ=1)


Out[42]:
df sum_sq mean_sq F PR(>F)
married 1 142.401413 142.401413 6.740387 9.686949e-03
age 1 338.479887 338.479887 16.021508 7.155517e-05
educ 1 2398.723127 2398.723127 113.540462 3.741813e-24
Residual 530 11197.094256 21.126593 NaN NaN

Note the small p-value on the married term: 0.00969.

To change the order of the terms in the report, you can create a new model with the explanatory terms listed in a different order. For example, here's the ANOVA on the same model, but with married list instead of first:

WARNING: This doesn't currently work properly!


In [43]:
mod2 = sm.ols("wage ~ educ + age + married", data=cps)
fit2 = mod2.fit()
anova_lm(fit2, typ=1)


Out[43]:
df sum_sq mean_sq F PR(>F)
married 1 142.401413 142.401413 6.740387 9.686949e-03
educ 1 2094.528877 2094.528877 99.141820 1.589724e-21
age 1 642.674137 642.674137 30.420150 5.445293e-08
Residual 530 11197.094256 21.126593 NaN NaN

Now the p-value on married is large. This suggests that much of the variation in wage that is associated with married can also be accounted for by age and educ instead.

Non-Parametric Statistics

Consider the model of world-record swimming times plotting on page 116 of 'Statistical Modeling: A Fresh Approach (2nd Edition)'. It shows pretty clearly the interaction between year and sex.

`swim100m.csv`


In [229]:
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
from statsmodels.api import graphics
import matplotlib.gridspec as gridspec

fit0 = sm.ols("time ~ year*sex", data=swim).fit() # Fit simple model

A version of the figure on page 116 can be generated as follows:


In [232]:
fig = figure(figsize=(12,5))
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1]) 
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])
graphics.plot_fit(fit0, "year", ax=ax1)
ax1.set_title("")
ax1.set_ylabel("time (secs")
dat = dict(list(swim.groupby('sex').time))
ax2.boxplot([dat[i] for i in dat.keys()])
ax2.get_yaxis().set_visible(False)
xticks([1, 2], ['M', 'F'])
show()


It's easy to confirm that this interaction term is statistically significant:


In [47]:
fit = sm.ols("time ~ year*sex", data=swim).fit()
anova_lm(fit, typ=1)


Out[47]:
df sum_sq mean_sq F PR(>F)
sex 1 1720.655232 1720.655232 156.140793 4.299569e-18
year 1 3342.177104 3342.177104 303.285733 1.039245e-24
year:sex 1 296.675432 296.675432 26.921801 2.826421e-06
Residual 58 639.153942 11.019896 NaN NaN

The p-value on the interaction term is very small: $2.83 \times 10^{-6}$.

To check whether this result might be influenced by the shape of the distribution of the time or year data, you can conduct a non-parametric test. Simply take the rank of each quantitative variable:


In [50]:
from numpy import argsort as rank
fit = sm.ols("time.rank() ~ year.rank()*sex", data=swim).fit()
anova_lm(fit, typ=1)


Out[50]:
df sum_sq mean_sq F PR(>F)
sex 1 6100.403226 6100.403226 1599.92582 6.278398e-44
year.rank() 1 13533.070845 13533.070845 3549.25874 1.006992e-53
year.rank():sex 1 0.876059 0.876059 0.22976 6.335042e-01
Residual 58 221.149870 3.812929 NaN NaN

With the rank-transformed data, the p-value on the interaction term is much larger: no evidence for an interaction between year and sex. You can see this directly in a plot of the data after rank-transforming time:


In [60]:
fig, ax = subplots()
swim['time_rank'] = swim.time.rank()
colors = ["blue", "green"]
for i, (key, grp) in enumerate(swim.groupby(['sex'])):
    ax.scatter(y=grp.time_rank, x=grp.year, label=key, color=colors[i])
legend(loc='best')
xlabel("year")
ylabel("time (rank)")
show()


The rank-transformed data suggest that women's records are improving in about the same way as men's.

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.