This tutorial presents two ways of doing hypothesis tests on whole models: 1) permutation tests where the connection is severed between the explanatory and response variables, and 2) tests such as ANOVA where the sampling distribution is calculated from first principals. In practice, first-principle tests are used most of the time. Still, the permutation test is useful for developing intuition about hypothesis testing - our main purpose here - and for those not-so-rare occaisons where the assumptions behind the first-principle tests as dubious.
The idea of a permutation test is to enfore the null hypothesis that there is no connection between the response variables and the explanatory variables. An effective way to do this is to randomize the response variable in a way that is consistent with sampling variability. When constructing confidence intervals, re-sampling was used. Re-sampling will typically repeat some cases and omit others. Here, we are more interested in scrambling the order of one or more variables while leaving the others in their original state. The permutation
function from the numpy.random
module will be used to accomplish this.
To illustrate, consider a model for exploring whether sex
and mother
's height are related to the height of the child:
`galton.csv`
In [20]:
import statsmodels.formula.api as sm
import pandas as pd
galton = pd.read_csv("http://www.mosaic-web.org/go/datasets/galton.csv")
mod = sm.ols("height ~ sex + mother", data=galton)
fit = mod.fit()
print fit.params, "\nrsquared ", fit.rsquared # Print out the rsquared value nicely
The parameters indicate that typical males are taller than typical females by about 5 inches and that for each inch taller the mother is, a child will typically be taller by 0.35 inches. A reasonable test statistic to summarize the whole model is $R^2$. The summary report shows this to be $R^2 = 0.562$ (not shown).
For confidence intervals, re-sampling was applied to the entire data frame. This selects random cases, but each selected case is an authentic one that matches exactly the original values for that case. The point of re-sampling is to get an idea of the variability introduced by random sampling of authentic cases.
In [4]:
import numpy as np
def resample(df, n=None, replace=True):
if n is None:
n = df.shape[0]
choices = np.random.choice(df.index, n, replace=replace)
return df.ix[choices]
pd.DataFrame([sm.ols("height ~ sex + mother", data=resample(galton)).fit().params for i in range(5)],
index=range(5))
Out[4]:
The sex[T.M]
parameters are tightly grouped hear 5 inches, the mother
parameters are around 0.3 to 0.4.
In order to carry out a permutation test, do not randomize the whole data frame. Instead, shuffle just the response variable:
In [7]:
from numpy.random import permutation
pd.DataFrame([sm.ols("permutation(height) ~ sex + mother", data=galton).fit().params for i in range(5)],
index=range(5))
Out[7]:
Now the sex[T.M]
and mother
parameters are close to zero, as would be expected when there is no relationship betweem the response variable and the explanatory variables.
In constructing the sampling distribution under the null hypothesis, you should do hundreds of trials of fitting the model to the scrambled data, calculating the test statistic (we'll use $R^2$ here) for each trial. Note that each trial itself involves all of the cases in your sample, but those cases have been changed so that the shuffled (or permuted) variable almost certainly takes on a different value in every case than in the original data.
In [11]:
nulls = pd.Series([sm.ols("permutation(height) ~ sex + mother", data=galton).fit().rsquared for i in range(500)])
The above produces a Pandas Series containing 500 $R^2$ values from the permutation test.
In [13]:
nulls.head() # There are 500 cases all together
Out[13]:
Naturally, all of the $R^2$ values for the trials are close to zero. After all, there is no relation between the response variable (after randomization with permutation()
) and the explanatory variables.
The p-value can be calculated directly from the trials, by comparison to the observed value in the actual data: $R^2$ was 0.5618.
In [22]:
(nulls>0.5618).value_counts()
Out[22]:
None of the 500 trials were greater than the value of the test statistic, 0.5618 (hence the count of 500 False's). It wouldn't be fair to claim that $p = 0$, since we only did 500 trials, but it is reasonable to say that the permutation test shows the p-value is $p < 1/500$.
On modern computers, the permutation test is entirely practical. But a few decades ago, it was not. Great creativity was applied to finding test statistics where the sampling distribution could be estimated without extensive calculation. One of these is the F statistic. This is still very useful today and is a standard part of the regression report in many statistical packages.
Here is the regression report from the height ~ sex + mother
model design presented earlier:
In [23]:
mod = sm.ols("height ~ sex + mother", data=galton)
fit = mod.fit()
fit.summary()
Out[23]:
The third row of the first table in the report shows an F statistic of 573.7 based on an $R^2$ of 0.562 and translates this to a p-value (Prob (F-statistic)
) that is practically zero: 4.44e-161
.
By way of showing that the regression report is rooted in the same approach show in Chapter 14 of 'Statistical Modeling: A Fresh Approach (2nd Edition)', you can confirm the calculations. There are $m = 3$ parameters and $n = 898$ cases, producing $n - m = 895$ degrees of freedom in the denominator and $m - 1 = 2$ degrees of freedom in the numerator. The calculation of the F statistic from $R^2$ and the degrees of freedom follows the formula given in the chapter:
$$F = \frac{R^2}{m - 1} \Big / \frac{1 - R^2}{n - m}$$Plugging the values into the formula:
In [25]:
(0.562/2.) / ((1.-0.562)/895.)
Out[25]:
F is the test statistic. To convert it to a p-value, you need to calculate how extreme the value of F = 574.19 is with reference to the F distribution with 895 and 2 degrees of freedom.
In [29]:
from scipy.stats import f
1. - f.cdf(574.189, 2, 895)
Out[29]:
The calculation of p-values from F always follows this form. In the context of the F distribution, "extreme" always means "bigger than". So, calculate the area under the F distribution to the right of the observed value.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.