Title: Hypothesis Testing on Whole Models Date: 2013-10-08 17:05 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-hypothesis-testing-whole Latex: yes Status: draft

Hypothesis Testing on Whole Models

This tutorial presents two ways of doing hypothesis tests on whole models: 1) permutation tests where the connection is severed between the explanatory and response variables, and 2) tests such as ANOVA where the sampling distribution is calculated from first principals. In practice, first-principle tests are used most of the time. Still, the permutation test is useful for developing intuition about hypothesis testing - our main purpose here - and for those not-so-rare occaisons where the assumptions behind the first-principle tests as dubious.

The Permutation Test

The idea of a permutation test is to enfore the null hypothesis that there is no connection between the response variables and the explanatory variables. An effective way to do this is to randomize the response variable in a way that is consistent with sampling variability. When constructing confidence intervals, re-sampling was used. Re-sampling will typically repeat some cases and omit others. Here, we are more interested in scrambling the order of one or more variables while leaving the others in their original state. The permutation function from the numpy.random module will be used to accomplish this.

To illustrate, consider a model for exploring whether sex and mother's height are related to the height of the child:

`galton.csv`


In [20]:
import statsmodels.formula.api as sm
import pandas as pd

galton = pd.read_csv("http://www.mosaic-web.org/go/datasets/galton.csv")
mod = sm.ols("height ~ sex + mother", data=galton)
fit = mod.fit()
print fit.params, "\nrsquared     ", fit.rsquared # Print out the rsquared value nicely


Intercept    41.449524
sex[T.M]      5.176695
mother        0.353137
dtype: float64 
rsquared      0.561801885474

The parameters indicate that typical males are taller than typical females by about 5 inches and that for each inch taller the mother is, a child will typically be taller by 0.35 inches. A reasonable test statistic to summarize the whole model is $R^2$. The summary report shows this to be $R^2 = 0.562$ (not shown).

For confidence intervals, re-sampling was applied to the entire data frame. This selects random cases, but each selected case is an authentic one that matches exactly the original values for that case. The point of re-sampling is to get an idea of the variability introduced by random sampling of authentic cases.


In [4]:
import numpy as np
def resample(df, n=None, replace=True):
    if n is None:
        n = df.shape[0]
    choices = np.random.choice(df.index, n, replace=replace)
    return df.ix[choices]

pd.DataFrame([sm.ols("height ~ sex + mother", data=resample(galton)).fit().params for i in range(5)], 
             index=range(5))


Out[4]:
Intercept sex[T.M] mother
0 42.757400 4.958629 0.334337
1 44.088196 5.040707 0.315130
2 41.784323 5.290200 0.346561
3 38.903076 5.377676 0.392889
4 41.844829 5.070087 0.348238

The sex[T.M] parameters are tightly grouped hear 5 inches, the mother parameters are around 0.3 to 0.4.

In order to carry out a permutation test, do not randomize the whole data frame. Instead, shuffle just the response variable:


In [7]:
from numpy.random import permutation
pd.DataFrame([sm.ols("permutation(height) ~ sex + mother", data=galton).fit().params for i in range(5)], 
             index=range(5))


Out[7]:
Intercept sex[T.M] mother
0 70.095853 0.215405 -0.053784
1 71.079129 0.576786 -0.072047
2 63.128318 -0.013061 0.056787
3 65.897598 0.459107 0.009758
4 68.360830 -0.186364 -0.023463

Now the sex[T.M] and mother parameters are close to zero, as would be expected when there is no relationship betweem the response variable and the explanatory variables.

In constructing the sampling distribution under the null hypothesis, you should do hundreds of trials of fitting the model to the scrambled data, calculating the test statistic (we'll use $R^2$ here) for each trial. Note that each trial itself involves all of the cases in your sample, but those cases have been changed so that the shuffled (or permuted) variable almost certainly takes on a different value in every case than in the original data.


In [11]:
nulls = pd.Series([sm.ols("permutation(height) ~ sex + mother", data=galton).fit().rsquared for i in range(500)])

The above produces a Pandas Series containing 500 $R^2$ values from the permutation test.


In [13]:
nulls.head() # There are 500 cases all together


Out[13]:
0    0.000741
1    0.000326
2    0.000777
3    0.004096
4    0.000761
dtype: float64

Naturally, all of the $R^2$ values for the trials are close to zero. After all, there is no relation between the response variable (after randomization with permutation()) and the explanatory variables.

The p-value can be calculated directly from the trials, by comparison to the observed value in the actual data: $R^2$ was 0.5618.


In [22]:
(nulls>0.5618).value_counts()


Out[22]:
False    500
dtype: int64

None of the 500 trials were greater than the value of the test statistic, 0.5618 (hence the count of 500 False's). It wouldn't be fair to claim that $p = 0$, since we only did 500 trials, but it is reasonable to say that the permutation test shows the p-value is $p < 1/500$.

First-Principle Tests

On modern computers, the permutation test is entirely practical. But a few decades ago, it was not. Great creativity was applied to finding test statistics where the sampling distribution could be estimated without extensive calculation. One of these is the F statistic. This is still very useful today and is a standard part of the regression report in many statistical packages.

Here is the regression report from the height ~ sex + mother model design presented earlier:


In [23]:
mod = sm.ols("height ~ sex + mother", data=galton)
fit = mod.fit()
fit.summary()


Out[23]:
OLS Regression Results
Dep. Variable: height R-squared: 0.562
Model: OLS Adj. R-squared: 0.561
Method: Least Squares F-statistic: 573.7
Date: Tue, 08 Oct 2013 Prob (F-statistic): 4.44e-161
Time: 21:58:23 Log-Likelihood: -2049.3
No. Observations: 898 AIC: 4105.
Df Residuals: 895 BIC: 4119.
Df Model: 2
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 41.4495 2.209 18.760 0.000 37.113 45.786
sex[T.M] 5.1767 0.159 32.625 0.000 4.865 5.488
mother 0.3531 0.034 10.270 0.000 0.286 0.421
Omnibus: 4.661 Durbin-Watson: 1.290
Prob(Omnibus): 0.097 Jarque-Bera (JB): 4.977
Skew: -0.103 Prob(JB): 0.0830
Kurtosis: 3.301 Cond. No. 1.79e+03

The third row of the first table in the report shows an F statistic of 573.7 based on an $R^2$ of 0.562 and translates this to a p-value (Prob (F-statistic)) that is practically zero: 4.44e-161.

By way of showing that the regression report is rooted in the same approach show in Chapter 14 of 'Statistical Modeling: A Fresh Approach (2nd Edition)', you can confirm the calculations. There are $m = 3$ parameters and $n = 898$ cases, producing $n - m = 895$ degrees of freedom in the denominator and $m - 1 = 2$ degrees of freedom in the numerator. The calculation of the F statistic from $R^2$ and the degrees of freedom follows the formula given in the chapter:

$$F = \frac{R^2}{m - 1} \Big / \frac{1 - R^2}{n - m}$$

Plugging the values into the formula:


In [25]:
(0.562/2.) / ((1.-0.562)/895.)


Out[25]:
574.1894977168951

F is the test statistic. To convert it to a p-value, you need to calculate how extreme the value of F = 574.19 is with reference to the F distribution with 895 and 2 degrees of freedom.


In [29]:
from scipy.stats import f
1. - f.cdf(574.189, 2, 895)


Out[29]:
1.1102230246251565e-16

The calculation of p-values from F always follows this form. In the context of the F distribution, "extreme" always means "bigger than". So, calculate the area under the F distribution to the right of the observed value.

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Hypothesis Testing on Parts of Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.