In [42]:
from statsmodels.stats.anova import anova_lm
import statsmodels.formula.api as sm
import pandas as pd
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
mod1 = sm.ols("wage ~ married + age + educ", data=cps)
fit1 = mod1.fit()
anova_lm(fit1, typ=1)
Out[42]:
Note the small p-value on the married
term: 0.00969
.
To change the order of the terms in the report, you can create a new model with the explanatory terms listed in a different order. For example, here's the ANOVA on the same model, but with married
list instead of first:
In [43]:
mod2 = sm.ols("wage ~ educ + age + married", data=cps)
fit2 = mod2.fit()
anova_lm(fit2, typ=1)
Out[43]:
Now the p-value on married
is large. This suggests that much of the variation in wage
that is associated with married
can also be accounted for by age
and educ
instead.
Consider the model of world-record swimming times plotting on page 116 of 'Statistical Modeling: A Fresh Approach (2nd Edition)'. It shows pretty clearly the interaction between year
and sex
.
`swim100m.csv`
In [229]:
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
from statsmodels.api import graphics
import matplotlib.gridspec as gridspec
fit0 = sm.ols("time ~ year*sex", data=swim).fit() # Fit simple model
A version of the figure on page 116 can be generated as follows:
In [232]:
fig = figure(figsize=(12,5))
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1])
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])
graphics.plot_fit(fit0, "year", ax=ax1)
ax1.set_title("")
ax1.set_ylabel("time (secs")
dat = dict(list(swim.groupby('sex').time))
ax2.boxplot([dat[i] for i in dat.keys()])
ax2.get_yaxis().set_visible(False)
xticks([1, 2], ['M', 'F'])
show()
It's easy to confirm that this interaction term is statistically significant:
In [47]:
fit = sm.ols("time ~ year*sex", data=swim).fit()
anova_lm(fit, typ=1)
Out[47]:
The p-value on the interaction term is very small: $2.83 \times 10^{-6}$.
To check whether this result might be influenced by the shape of the distribution of the time
or year
data, you can conduct a non-parametric test. Simply take the rank of each quantitative variable:
In [50]:
from numpy import argsort as rank
fit = sm.ols("time.rank() ~ year.rank()*sex", data=swim).fit()
anova_lm(fit, typ=1)
Out[50]:
With the rank-transformed data, the p-value on the interaction term is much larger: no evidence for an interaction between year
and sex
. You can see this directly in a plot of the data after rank-transforming time
:
In [60]:
fig, ax = subplots()
swim['time_rank'] = swim.time.rank()
colors = ["blue", "green"]
for i, (key, grp) in enumerate(swim.groupby(['sex'])):
ax.scatter(y=grp.time_rank, x=grp.year, label=key, color=colors[i])
legend(loc='best')
xlabel("year")
ylabel("time (rank)")
show()
The rank-transformed data suggest that women's records are improving in about the same way as men's.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.