imports


In [1]:
%matplotlib inline
import statsmodels as sm
from statsmodels.formula.api import ols
import numpy as np
import pandas as pd
import pylab as plt
import seaborn as sns
import scipy.stats as stats
sns.set(style="white")

Getting and exploring the data


In [2]:
immer_ds = sm.datasets.get_rdataset("immer", "MASS")

In [3]:
print immer_ds.__doc__


+---------+-------------------+
| immer   | R Documentation   |
+---------+-------------------+

Yields from a Barley Field Trial
--------------------------------

Description
~~~~~~~~~~~

The ``immer`` data frame has 30 rows and 4 columns. Five varieties of
barley were grown in six locations in each of 1931 and 1932.

Usage
~~~~~

::

    immer

Format
~~~~~~

This data frame contains the following columns:

``Loc``
    The location.

``Var``
    The variety of barley (``"manchuria"``, ``"svansota"``,
    ``"velvet"``, ``"trebi"`` and ``"peatland"``).

``Y1``
    Yield in 1931.

``Y2``
    Yield in 1932.

Source
~~~~~~

Immer, F.R., Hayes, H.D. and LeRoy Powers (1934) Statistical
determination of barley varietal adaptation. *Journal of the American
Society for Agronomy* **26**, 403–419.

Fisher, R.A. (1947) *The Design of Experiments.* 4th edition. Edinburgh:
Oliver and Boyd.

References
~~~~~~~~~~

Venables, W. N. and Ripley, B. D. (1999) *Modern Applied Statistics with
S-PLUS.* Third Edition. Springer.

Examples
~~~~~~~~

::

    immer.aov <- aov(cbind(Y1,Y2) ~ Loc + Var, data = immer)
    summary(immer.aov)

    immer.aov <- aov((Y1+Y2)/2 ~ Var + Loc, data = immer)
    summary(immer.aov)
    model.tables(immer.aov, type = "means", se = TRUE, cterms = "Var")



In [4]:
immer_data = immer_ds.data
immer_ds.data


Out[4]:
Loc Var Y1 Y2
0 UF M 81.0 80.7
1 UF S 105.4 82.3
2 UF V 119.7 80.4
3 UF T 109.7 87.2
4 UF P 98.3 84.2
5 W M 146.6 100.4
6 W S 142.0 115.5
7 W V 150.7 112.2
8 W T 191.5 147.7
9 W P 145.7 108.1
10 M M 82.3 103.1
11 M S 77.3 105.1
12 M V 78.4 116.5
13 M T 131.3 139.9
14 M P 89.6 129.6
15 C M 119.8 98.9
16 C S 121.4 61.9
17 C V 124.0 96.2
18 C T 140.8 125.5
19 C P 124.8 75.7
20 GR M 98.9 66.4
21 GR S 89.0 49.9
22 GR V 69.1 96.7
23 GR T 89.3 61.9
24 GR P 104.1 80.3
25 D M 86.9 67.7
26 D S 77.1 66.7
27 D V 78.9 67.4
28 D T 101.8 91.8
29 D P 96.0 94.1

In [5]:
immer_data.describe()


Out[5]:
Y1 Y2
count 30.000000 30.000000
mean 109.046667 93.133333
std 28.674279 24.275450
min 69.100000 49.900000
25% 87.425000 76.850000
50% 102.950000 92.950000
75% 124.600000 107.350000
max 191.500000 147.700000

Two sample t test - are two distributions different?

Is manchuria yielding more crop than trebi?


In [6]:
manchuria_yield = immer_data.Y1[immer_data.Var == "M"]
trebi_yield = immer_data.Y1[immer_data.Var == "T"]

In [7]:
sns.distplot(manchuria_yield, rug=True, hist=False, label="manchuria")
sns.distplot(trebi_yield, rug=True, hist=False, label="trebi")


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd8c2321c10>

In [8]:
t, p = stats.ttest_ind(manchuria_yield, trebi_yield)
print t, p


-1.352886893 0.205886131092

How does it work?

$$t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$

where $s_{X_1X_2}$ the the common variance

$$ s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}}$$

What does it really mean? Further away the means are the higher the statistic gets


In [9]:
for rv2_mean in range(0,5,1):
    rvs1 = stats.norm.rvs(loc=0,scale=1,size=500)
    rvs2 = stats.norm.rvs(loc=rv2_mean,scale=1,size=500)
    stats.ttest_ind(rvs1,rvs2)
    f = plt.figure()
    sns.distplot(rvs1, hist=False, label="variable 1")
    sns.distplot(rvs2, hist=False, label="variable 2", axlabel="t = %g, p=%g"%stats.ttest_ind(rvs1,rvs2))
    plt.xlim([-9,9])
    plt.ylim([0, 0.45])


In general the less there is an overlap between the two distributions the higher the statistic


In [10]:
for rvs_spread in range(1,6,1):
    rvs1 = stats.norm.rvs(loc=0,scale=rvs_spread,size=500)
    rvs2 = stats.norm.rvs(loc=1,scale=rvs_spread,size=500)
    stats.ttest_ind(rvs1,rvs2)
    plt.figure()
    sns.distplot(rvs1, hist=False, label="variable 1")
    sns.distplot(rvs2, hist=False, label="variable 2", axlabel="t = %g, p=%g"%stats.ttest_ind(rvs1,rvs2))
    plt.xlim([-12,12])
    plt.ylim([0, 0.45])


The higher the sample size the higher the statistic.


In [11]:
for sample_size in range(100,500,100):
    rvs1 = stats.norm.rvs(loc=0,scale=1,size=sample_size)
    rvs2 = stats.norm.rvs(loc=1,scale=1,size=sample_size)
    stats.ttest_ind(rvs1,rvs2)
    plt.figure()
    sns.distplot(rvs1, hist=False, label="variable 1")
    sns.distplot(rvs2, hist=False, label="variable 2", axlabel="t = %g, p=%g"%stats.ttest_ind(rvs1,rvs2))
    plt.xlim([-12,12])
    plt.ylim([0, 0.45])


Sample size not only increases the statistic, but also increases degrees of freedom which are used for calculating the p value.


In [12]:
dfs = range(10,300, 10)
ps = [stats.t.pdf(2, df) for df in dfs]
plt.plot(ps, dfs)
plt.xlabel("p-value")
plt.ylabel("degrees of freedom")


Out[12]:
<matplotlib.text.Text at 0x7fd8bed6b490>

Even tiniest effect size can be statistically significant if you collect enough samples.


In [13]:
sample_sizes = range(10, 1000, 10)
ps = [stats.ttest_ind(stats.norm.rvs(loc=0,scale=1,size=sample_size),
                      stats.norm.rvs(loc=0.3,scale=1,size=sample_size))[1] for sample_size in sample_sizes]
plt.semilogx(ps, sample_sizes)
plt.xlabel("p-value")
plt.ylabel("sample size")


Out[13]:
<matplotlib.text.Text at 0x7fd8bf1587d0>

Excercise: plot distributions of svansona and velvet crops, and perform a two sample t test. Now include every sample from each crop type twice in the test - what will hapen to the p-value?


In [14]:
svansona_yield = immer_data.Y1[immer_data.Var == "S"]
velvet_yield = immer_data.Y1[immer_data.Var == "V"]

In [15]:
sns.distplot(svansona_yield, hist=False, rug=True)
sns.distplot(velvet_yield, hist=False, rug=True)


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd8bed61790>

In [16]:
stats.ttest_ind(svansona_yield, velvet_yield)


Out[16]:
Ttest_indResult(statistic=-0.084090620384495413, pvalue=0.93464398923779979)

In [17]:
svansona_yield_doubled = list(svansona_yield) + list(svansona_yield)
velvet_yield_doubled = list(velvet_yield) + list(velvet_yield)
stats.ttest_ind(svansona_yield_doubled, velvet_yield_doubled)


Out[17]:
Ttest_indResult(statistic=-0.1247265463348886, pvalue=0.90187271155787685)

Paired T test


In [18]:
year1_yield = immer_data.Y1
year2_yield = immer_data.Y2

In [19]:
sns.distplot(year1_yield, rug=True, hist=False, label="1931")
sns.distplot(year2_yield, rug=True, hist=False, label="1932")
plt.xlabel("crop")


Out[19]:
<matplotlib.text.Text at 0x7fd8bee923d0>

In [20]:
sns.violinplot(inner=None, data=immer_data)
for y1, y2 in zip(year1_yield, year2_yield):
    plt.plot([0,1], [y1,y2], color="b")



In [21]:
y_diff = year1_yield-year2_yield
sns.distplot(y_diff,  rug=True, hist=False)


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd8bf4c1fd0>

Let's calculate a one sample t-test checking if the distribution of differences is significantly different from zero.


In [22]:
stats.ttest_1samp(y_diff, 0)


Out[22]:
Ttest_1sampResult(statistic=3.3239873042716788, pvalue=0.0024126338636167597)

This is equivalent to so called paired two sample t test


In [23]:
stats.ttest_rel(year1_yield,year2_yield)


Out[23]:
Ttest_relResult(statistic=3.3239873042716788, pvalue=0.0024126338636167597)

What would happen if we ignored the pairing and treat the samples as independent?


In [24]:
stats.ttest_ind(year1_yield,year2_yield)


Out[24]:
Ttest_indResult(statistic=2.3199548629875282, pvalue=0.023883598700006337)

The result is similar, but would it always be the case? Let's generate a lot of data and compare the difference between two sample independent and paired tests.


In [25]:
t_diffs = []
rvs = []
for i in range(1000):
    rvs1 = stats.norm.rvs(loc=0,scale=1,size=15)
    rvs2 = stats.norm.rvs(loc=0.2,scale=1,size=15)
    t_rel, _ = stats.ttest_rel(rvs1,rvs2)
    t_ind, _ = stats.ttest_ind(rvs1,rvs2)
    t_diff = t_rel - t_ind
    t_diffs.append(t_diff)
    rvs.append((rvs1, rvs2))

In [26]:
sns.distplot(t_diffs)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd8bec8aad0>

Note that the distribution is left skewed. This means that independent samples t test has more often higher values than the paired version.

Let's pick an extreme


In [27]:
rv1, rv2 = rvs[np.array(t_diffs).argmax()]

In [28]:
sns.violinplot(pd.DataFrame(np.array([rv1, rv2]).T))
for y1, y2 in zip(rv1, rv2):
    plt.plot([0,1], [y1,y2], color="b")


/home/vagrant/miniconda/lib/python2.7/site-packages/seaborn/categorical.py:1791: UserWarning: The violinplot API has been changed. Attempting to adjust your arguments for the new API (which might not work). Please update your code. See the version 0.6 release notes for more info.
  warnings.warn(msg, UserWarning)

In [29]:
stats.ttest_rel(rv1, rv2)


Out[29]:
Ttest_relResult(statistic=2.9220607590480232, pvalue=0.011145245653294555)

In [30]:
stats.ttest_ind(rv1, rv2)


Out[30]:
Ttest_indResult(statistic=1.8957443217426311, pvalue=0.068358758989211996)

Using the wrong test can result in a test being less or more significant than it should. However the sign of the stattistic can never change.

Correlation between values

Are the yields of barley crops in 1931 correlated with yields in 1932


In [31]:
plt.scatter(x=year1_yield, y=year2_yield)


Out[31]:
<matplotlib.collections.PathCollection at 0x7fd8be971ad0>

We can try to fit a line to this plot. A line is defined by the following equation: $$ y = ax + b $$ where $a$ the slope and $b$ is the intercept of the line.


In [32]:
slope, intercept, r_value, p_value, std_err = stats.linregress(year1_yield, year2_yield)

In [33]:
plt.scatter(x=year1_yield, y=year2_yield)
plt.plot(np.arange(60,200),intercept+slope*np.arange(60,200), color="r")
plt.xlim([60, 200])
plt.xlabel("1931 crop yield")
plt.ylabel("1932 crop yield")


Out[33]:
<matplotlib.text.Text at 0x7fd8beb6ff10>

From the equation we can see that the intercept is equivalent to the crop yield in 1932 (y) in case the yield in 1931 (x) was zero: $$ y = a0 + b = b$$


In [34]:
plt.scatter(x=year1_yield, y=year2_yield)
plt.plot(np.arange(0,200),intercept+slope*np.arange(0,200), color="r")
plt.xlim([0, 200])
plt.xlabel("1931 crop yield")
plt.ylabel("1932 crop yield")
print intercept


45.1275605247

How is this line fitted? The line is a prediction. It tells you what should be the crop yield in 1932 based on 1931. But it's not perfect. Let's draw how far the predictions are from reality.


In [35]:
plt.scatter(x=year1_yield, y=year2_yield)
plt.plot(np.arange(60,200),intercept+slope*np.arange(60,200), color="r")
plt.xlim([60, 200])
plt.xlabel("1931 crop yield")
plt.ylabel("1932 crop yield")
plt.vlines(year1_yield, year2_yield,intercept+slope*year1_yield)


Out[35]:
<matplotlib.collections.LineCollection at 0x7fd8be928c10>

We can count the errors


In [36]:
print np.abs(year2_yield-intercept+slope*year1_yield).sum()


2880.34636852

Different lines will have bigger different errors:


In [37]:
slope2 = slope + 0.15
intercept2 = intercept - 10

plt.scatter(x=year1_yield, y=year2_yield)
plt.plot(np.arange(60,200),intercept2+slope2*np.arange(60,200), color="r")
plt.xlim([60, 200])
plt.xlabel("1931 crop yield")
plt.ylabel("1932 crop yield")
plt.vlines(year1_yield, year2_yield,intercept2+slope2*year1_yield)
print np.abs(year2_yield-intercept2+slope2*year1_yield).sum()


3671.05636852

The fitting procedure is minimizing those errors trying to find a line that best describes the data.

In practice sum of absolute values of errors is not minimized - it's the sum of squares


In [38]:
print ((year2_yield-intercept2+slope2*year1_yield)**2).sum()


487009.427745

In [39]:
((year2_yield-intercept2+slope2*year1_yield)**2).sum()


Out[39]:
487009.42774532107

In practice this means that big errors will make much bigger impact on the fit than small errors.

Why square not absolute value? This comes from the fact that linear regression model is using Gaussian random noise.

$$ y = ax + b + \epsilon$$

where $\epsilon = N(0,1)$ (is a Gaussina with mean zero and variance 1)

Note that fitting a linear model has a close solution. This means that we can calculate optimal slope and intercept parameters without having to try multiple different lines.

To assess how well we can predict values of one variable with the values of another we can use the correlation value


In [40]:
r_value


Out[40]:
0.52000353342794481

Correlation is nothing else than covariance of the two variables normalized by their variance. It varies from -1 (anticorrealted), through 0 (uncorrelated), to 1 (fully correlated)

$$P_{ij} = \frac{ C_{ij} } { \sqrt{ C_{ii} * C_{jj} } }$$

In [41]:
np.corrcoef(x=year1_yield, y=year2_yield)


Out[41]:
array([[ 1.        ,  0.52000353],
       [ 0.52000353,  1.        ]])

Correlation coefficient has another interesting property - when squared it describes the percentage of variance of variable Y explained by one varaince in variable X (and vice versa)


In [42]:
r_value**2*100


Out[42]:
27.040367477754774

There is also a p-value. It is directly related to the r_value and the number of samples.

Excercise: There is clearly an outlier in the distribution (data point in the upper right corner on the scatterplot) what will happen to the correlation if you remove it?


In [103]:
year1_yield_no_outlier = year1_yield[year1_yield!=year1_yield.max()]
year2_yield_no_outlier = year2_yield[year1_yield!=year1_yield.max()]
slope, intercept, r_value, p_value, std_err = stats.linregress(year1_yield_no_outlier, year2_yield_no_outlier)
plt.scatter(x=year1_yield_no_outlier, y=year2_yield_no_outlier)
plt.plot(np.arange(60,200),intercept+slope*np.arange(60,200), color="b", label="without outlier")
plt.xlim([60, 200])
plt.xlabel("1931 crop yield")
plt.ylabel("1932 crop yield")
plt.vlines(year1_yield_no_outlier, year2_yield_no_outlier,intercept+slope*year1_yield_no_outlier)

slope, intercept, r_value, p_value, std_err = stats.linregress(year1_yield, year2_yield)
plt.plot(np.arange(60,200),intercept+slope*np.arange(60,200), color="r", label="with outlier")
plt.legend()


Out[103]:
<matplotlib.legend.Legend at 0x7fd8bd669c50>

Multiple linear regression


In [43]:
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data

In [44]:
ds = sm.datasets.get_rdataset("Duncan", "car", cache=True)
print ds.__doc__


+----------+-------------------+
| Duncan   | R Documentation   |
+----------+-------------------+

Duncan's Occupational Prestige Data
-----------------------------------

Description
~~~~~~~~~~~

The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.

Usage
~~~~~

::

    Duncan

Format
~~~~~~

This data frame contains the following columns:

type
    Type of occupation. A factor with the following levels: ``prof``,
    professional and managerial; ``wc``, white-collar; ``bc``,
    blue-collar.

income
    Percent of males in occupation earning $3500 or more in 1950.

education
    Percent of males in occupation in 1950 who were high-school
    graduates.

prestige
    Percent of raters in NORC study rating occupation as excellent or
    good in prestige.

Source
~~~~~~

Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Table VI-1].

References
~~~~~~~~~~

Fox, J. (2008) *Applied Regression Analysis and Generalized Linear
Models*, Second Edition. Sage.

Fox, J. and Weisberg, S. (2011) *An R Companion to Applied Regression*,
Second Edition, Sage.


In [45]:
prestige.head()


Out[45]:
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90

How well can we describe predict prestige using education and icome?


In [46]:
prestige_model = ols("prestige ~ income + education", data=prestige).fit()
prestige_model.summary()


Out[46]:
OLS Regression Results
Dep. Variable: prestige R-squared: 0.828
Model: OLS Adj. R-squared: 0.820
Method: Least Squares F-statistic: 101.2
Date: Tue, 01 Sep 2015 Prob (F-statistic): 8.65e-17
Time: 06:01:40 Log-Likelihood: -178.98
No. Observations: 45 AIC: 364.0
Df Residuals: 42 BIC: 369.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -6.0647 4.272 -1.420 0.163 -14.686 2.556
income 0.5987 0.120 5.003 0.000 0.357 0.840
education 0.5458 0.098 5.555 0.000 0.348 0.744
Omnibus: 1.279 Durbin-Watson: 1.458
Prob(Omnibus): 0.528 Jarque-Bera (JB): 0.520
Skew: 0.155 Prob(JB): 0.771
Kurtosis: 3.426 Cond. No. 163.

Multiple linear regression (or General Linear Models - GLM) are based on the premise of a matrix of variables describing the data: $$ Y = \beta X + \epsilon $$ where $\beta$ are the coefficient and X is the design matrix. In our case the design matrix looks this way


In [47]:
plt.matshow(prestige_model.model.exog)
plt.xticks(range(3), prestige_model.model.exog_names, rotation=-25)
plt.colorbar()


Out[47]:
<matplotlib.colorbar.Colorbar instance at 0x7fd8be725cf8>

Note that on top of our two regressors of interest ("income" and "education") we also have "intercept" just like in regural linear regrossion with one variable.

In fact we can recreate the same results from our previous example using GLM.


In [48]:
crop_model = ols("Y2 ~ Y1", data=immer_ds.data).fit()
crop_model.summary()


Out[48]:
OLS Regression Results
Dep. Variable: Y2 R-squared: 0.270
Model: OLS Adj. R-squared: 0.244
Method: Least Squares F-statistic: 10.38
Date: Tue, 01 Sep 2015 Prob (F-statistic): 0.00323
Time: 06:01:40 Log-Likelihood: -133.01
No. Observations: 30 AIC: 270.0
Df Residuals: 28 BIC: 272.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 45.1276 15.392 2.932 0.007 13.598 76.657
Y1 0.4402 0.137 3.221 0.003 0.160 0.720
Omnibus: 1.042 Durbin-Watson: 1.153
Prob(Omnibus): 0.594 Jarque-Bera (JB): 1.009
Skew: 0.390 Prob(JB): 0.604
Kurtosis: 2.553 Cond. No. 450.

In [49]:
plt.matshow(crop_model.model.exog)
plt.xticks(range(2), crop_model.model.exog_names, rotation=-25)
plt.colorbar()


Out[49]:
<matplotlib.colorbar.Colorbar instance at 0x7fd8be5c9a28>

You can also express two sample T test in GLM


In [50]:
immer_subset = immer_data[np.logical_or(immer_data.Var == "M", immer_data.Var == "T")]

In [51]:
subset_model = ols("Y1 ~ Var", data=immer_subset).fit()
subset_model.summary()


/home/vagrant/miniconda/lib/python2.7/site-packages/scipy/stats/stats.py:1277: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[51]:
OLS Regression Results
Dep. Variable: Y1 R-squared: 0.155
Model: OLS Adj. R-squared: 0.070
Method: Least Squares F-statistic: 1.830
Date: Tue, 01 Sep 2015 Prob (F-statistic): 0.206
Time: 06:01:41 Log-Likelihood: -57.436
No. Observations: 12 AIC: 118.9
Df Residuals: 10 BIC: 119.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 102.5833 12.971 7.909 0.000 73.683 131.484
Var[T.T] 24.8167 18.343 1.353 0.206 -16.055 65.689
Omnibus: 2.846 Durbin-Watson: 1.407
Prob(Omnibus): 0.241 Jarque-Bera (JB): 1.551
Skew: 0.875 Prob(JB): 0.460
Kurtosis: 2.807 Cond. No. 2.62

In [52]:
plt.matshow(subset_model.model.exog)
plt.xticks(range(2), subset_model.model.exog_names, rotation=-25)
plt.colorbar()


Out[52]:
<matplotlib.colorbar.Colorbar instance at 0x7fd8be3f2f38>

This model can be also used for more than two "groups"


In [53]:
full_model = ols("Y1 ~ Var", data=immer_ds.data).fit()
full_model.summary()


Out[53]:
OLS Regression Results
Dep. Variable: Y1 R-squared: 0.116
Model: OLS Adj. R-squared: -0.026
Method: Least Squares F-statistic: 0.8170
Date: Tue, 01 Sep 2015 Prob (F-statistic): 0.526
Time: 06:01:41 Log-Likelihood: -140.90
No. Observations: 30 AIC: 291.8
Df Residuals: 25 BIC: 298.8
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 102.5833 11.857 8.652 0.000 78.164 127.003
Var[T.P] 7.1667 16.768 0.427 0.673 -27.368 41.701
Var[T.S] -0.5500 16.768 -0.033 0.974 -35.084 33.984
Var[T.T] 24.8167 16.768 1.480 0.151 -9.718 59.351
Var[T.V] 0.8833 16.768 0.053 0.958 -33.651 35.418
Omnibus: 2.958 Durbin-Watson: 0.690
Prob(Omnibus): 0.228 Jarque-Bera (JB): 2.648
Skew: 0.678 Prob(JB): 0.266
Kurtosis: 2.472 Cond. No. 5.83

In [54]:
plt.matshow(full_model.model.exog)
plt.xticks(range(len(full_model.model.exog_names)), full_model.model.exog_names, rotation=-25)
plt.colorbar()


Out[54]:
<matplotlib.colorbar.Colorbar instance at 0x7fd8bf302128>

We can now explicitly test hypotheses such as is there a difference between manchuria and trebi.


In [55]:
full_model.f_test("Var[T.P] = Var[T.T]")


Out[55]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[ 1.10795911]]), p=0.302594070459, df_denom=25, df_num=1>

Which is equivalent to


In [56]:
full_model.f_test([0,1,0,-1,0])


Out[56]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[ 1.10795911]]), p=0.302594070459, df_denom=25, df_num=1>

There are also other ways of coding categorical variables in design matrices

Shared variance and colinearity


In [57]:
data = sm.datasets.longley.load_pandas().data
data.head()


Out[57]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323 83.0 234289 2356 1590 107608 1947
1 61122 88.5 259426 2325 1456 108632 1948
2 60171 88.2 258054 3682 1616 109773 1949
3 61187 89.5 284599 3351 1650 110929 1950
4 63221 96.2 328975 2099 3099 112075 1951

In [58]:
print sm.datasets.longley.NOTE


::

    Number of Observations - 16

    Number of Variables - 6

    Variable name definitions::

            TOTEMP - Total Employment
            GNPDEFL - GNP deflator
            GNP - GNP
            UNEMP - Number of unemployed
            ARMED - Size of armed forces
            POP - Population
            YEAR - Year (1947 - 1962)

First we fit a model with one variable


In [59]:
model_1 = ols("TOTEMP ~ POP", data=data).fit()
model_1.summary()


/home/vagrant/miniconda/lib/python2.7/site-packages/scipy/stats/stats.py:1277: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
  "anyway, n=%i" % int(n))
Out[59]:
OLS Regression Results
Dep. Variable: TOTEMP R-squared: 0.922
Model: OLS Adj. R-squared: 0.917
Method: Least Squares F-statistic: 166.3
Date: Tue, 01 Sep 2015 Prob (F-statistic): 3.69e-09
Time: 06:01:42 Log-Likelihood: -132.37
No. Observations: 16 AIC: 268.7
Df Residuals: 14 BIC: 270.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 8380.6742 4422.434 1.895 0.079 -1104.503 1.79e+04
POP 0.4849 0.038 12.896 0.000 0.404 0.566
Omnibus: 0.822 Durbin-Watson: 1.571
Prob(Omnibus): 0.663 Jarque-Bera (JB): 0.723
Skew: 0.227 Prob(JB): 0.697
Kurtosis: 2.063 Cond. No. 2.05e+06

Lots of variance explained and high t value!

Lets try another variable


In [60]:
model_2 = ols("TOTEMP ~ GNP", data=data).fit()
model_2.summary()


Out[60]:
OLS Regression Results
Dep. Variable: TOTEMP R-squared: 0.967
Model: OLS Adj. R-squared: 0.965
Method: Least Squares F-statistic: 415.1
Date: Tue, 01 Sep 2015 Prob (F-statistic): 8.36e-12
Time: 06:01:42 Log-Likelihood: -125.43
No. Observations: 16 AIC: 254.9
Df Residuals: 14 BIC: 256.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 5.184e+04 681.372 76.087 0.000 5.04e+04 5.33e+04
GNP 0.0348 0.002 20.374 0.000 0.031 0.038
Omnibus: 1.925 Durbin-Watson: 1.619
Prob(Omnibus): 0.382 Jarque-Bera (JB): 1.215
Skew: 0.664 Prob(JB): 0.545
Kurtosis: 2.759 Cond. No. 1.66e+06

Even more variance explained and even higher t value!

Let's put both variables in the model


In [61]:
model_1_plus_2 = ols("TOTEMP ~ POP + GNP", data=data).fit()
model_1_plus_2.summary()


Out[61]:
OLS Regression Results
Dep. Variable: TOTEMP R-squared: 0.979
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 303.9
Date: Tue, 01 Sep 2015 Prob (F-statistic): 1.22e-11
Time: 06:01:42 Log-Likelihood: -121.88
No. Observations: 16 AIC: 249.8
Df Residuals: 13 BIC: 252.1
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 8.894e+04 1.38e+04 6.452 0.000 5.92e+04 1.19e+05
POP -0.4097 0.152 -2.693 0.018 -0.738 -0.081
GNP 0.0632 0.011 5.933 0.000 0.040 0.086
Omnibus: 0.827 Durbin-Watson: 1.301
Prob(Omnibus): 0.661 Jarque-Bera (JB): 0.601
Skew: 0.435 Prob(JB): 0.740
Kurtosis: 2.618 Cond. No. 4.20e+07

The t values decreased drastically! Effect of the population (POP) changed direction!


In [62]:
sns.jointplot("GNP", "POP", data=data)


Out[62]:
<seaborn.axisgrid.JointGrid at 0x7fd8be418f90>

It's because the two variables are highly correlated. t values and coefficients reported by a GLM model correspond only to the unique variance. Varaince shared by the independen variables are not assigned to any of them. Note however overall model performance ($R^2$) increased.


In [63]:
sns.jointplot("TOTEMP", "POP", data=data)


Out[63]:
<seaborn.axisgrid.JointGrid at 0x7fd8be056750>

In [64]:
slope, intercept, r_value, p_value, std_err = stats.linregress(data["GNP"], data["POP"])

In [65]:
predicted = intercept+slope*data["GNP"]
data["POPwoGNP"] = data["POP"] - predicted

In [66]:
sns.jointplot("TOTEMP", "POPwoGNP", data=data)


Out[66]:
<seaborn.axisgrid.JointGrid at 0x7fd8bdedff50>

Excercise: There are many different models one could fit. Write a loop that iterates over a set of models and try to find one that shows the highest correlation between employment and the size of the armed forces. What can you learn from this experiment in the context of what we said about p-hacking?


In [81]:
regressors = set(data.columns) - set(["ARMED", "UNEMP"])
regressors


Out[81]:
{'GNP', 'GNPDEFL', 'POP', 'POPwoGNP', 'TOTEMP', 'YEAR'}

In [86]:
list(itertools.combinations(regressors, 4))


Out[86]:
[('GNP', 'GNPDEFL', 'YEAR', 'POP'),
 ('GNP', 'GNPDEFL', 'YEAR', 'TOTEMP'),
 ('GNP', 'GNPDEFL', 'YEAR', 'POPwoGNP'),
 ('GNP', 'GNPDEFL', 'POP', 'TOTEMP'),
 ('GNP', 'GNPDEFL', 'POP', 'POPwoGNP'),
 ('GNP', 'GNPDEFL', 'TOTEMP', 'POPwoGNP'),
 ('GNP', 'YEAR', 'POP', 'TOTEMP'),
 ('GNP', 'YEAR', 'POP', 'POPwoGNP'),
 ('GNP', 'YEAR', 'TOTEMP', 'POPwoGNP'),
 ('GNP', 'POP', 'TOTEMP', 'POPwoGNP'),
 ('GNPDEFL', 'YEAR', 'POP', 'TOTEMP'),
 ('GNPDEFL', 'YEAR', 'POP', 'POPwoGNP'),
 ('GNPDEFL', 'YEAR', 'TOTEMP', 'POPwoGNP'),
 ('GNPDEFL', 'POP', 'TOTEMP', 'POPwoGNP'),
 ('YEAR', 'POP', 'TOTEMP', 'POPwoGNP')]

In [71]:
import itertools
import pandas as pd
models = []
t_values = []
p_values = []
for i in range(len(regressors)):
    for regresor_subset in itertools.combinations(regressors, i):
        ind_variables = ["ARMED", ] + list(regresor_subset)
        models.append(ind_variables)
        model = ols("UNEMP ~ " + " + ".join(ind_variables), data=data).fit()
        t_values.append(model.params["ARMED"])
        p_values.append(model.pvalues["ARMED"])

In [88]:
t_values


Out[88]:
[-0.23823618980853387,
 -0.74993585016869591,
 -0.79787067200432271,
 -0.74177671973940085,
 -0.66211434703015937,
 -0.69139168506110704,
 0.42587358006672982,
 -0.82734959800855779,
 -0.47403364494950279,
 -0.18053291820175732,
 -0.67251034064305992,
 -0.18053291820173872,
 -0.66442922867466248,
 -0.67174378603117357,
 -0.78032127265362061,
 -0.23178265250443181,
 -0.79137492410604504,
 -0.55453830607344579,
 -0.22775478771265117,
 -0.39376776397438845,
 -0.18053291820174952,
 -0.054831492349734556,
 -0.49061768940196926,
 -0.22675115011390623,
 -0.73034763357246058,
 -0.22675115011394137,
 -0.26944228728167097,
 -0.48378029205900264,
 -0.26944228727924091,
 -0.39532876139622419,
 -0.18053291820175066,
 -0.39532876139620654,
 -0.72736232807783296,
 -0.54112165438638615,
 -0.22876151308563436,
 -0.54872037744072455,
 -0.22675115011392866,
 -0.36500422757503009,
 -0.6397174326081011,
 -0.26944228727953762,
 -0.53113460536010937,
 -0.39532876139619955,
 -0.27138147648315575,
 -0.49461170540799626,
 -0.27138147648310962,
 -0.31752770623175564,
 -0.22675115011395297,
 -0.31752770623171311,
 -0.44854105937810251,
 -0.26944228727922487,
 -0.4485410593552368,
 -0.3953287613962061,
 -0.61855307513562097,
 -0.27138147648316524,
 -0.53580429075276936,
 -0.31752770623175586,
 -0.44854105935427435,
 -0.42869317572287452,
 -0.27138147648376404,
 -0.42869317572256438,
 -0.31752770623171506,
 -0.44854105935385713,
 -0.42869317572252619]

In [89]:
df = pd.DataFrame({"covariates": models, 
                   "p_values": p_values,
                   "t_values": t_values})

In [90]:
df.head()


Out[90]:
covariates p_values t_values
0 [ARMED] 0.510949 -0.238236
1 [ARMED, GNP] 0.012324 -0.749936
2 [ARMED, GNPDEFL] 0.006173 -0.797871
3 [ARMED, YEAR] 0.005779 -0.741777
4 [ARMED, POP] 0.011559 -0.662114

In [92]:
df.sort(["t_values"])


Out[92]:
covariates p_values t_values
7 [ARMED, GNP, GNPDEFL] 0.005849 -0.827350
2 [ARMED, GNPDEFL] 0.006173 -0.797871
16 [ARMED, YEAR, POP] 0.012489 -0.791375
14 [ARMED, GNPDEFL, TOTEMP] 0.001125 -0.780321
1 [ARMED, GNP] 0.012324 -0.749936
3 [ARMED, YEAR] 0.005779 -0.741777
24 [ARMED, GNP, GNPDEFL, TOTEMP] 0.001869 -0.730348
32 [ARMED, GNPDEFL, YEAR, POP] 0.026681 -0.727362
5 [ARMED, TOTEMP] 0.042309 -0.691392
10 [ARMED, GNP, TOTEMP] 0.003298 -0.672510
13 [ARMED, GNPDEFL, POP] 0.036638 -0.671744
12 [ARMED, GNPDEFL, YEAR] 0.022208 -0.664429
4 [ARMED, POP] 0.011559 -0.662114
38 [ARMED, YEAR, POP, TOTEMP] 0.000021 -0.639717
52 [ARMED, GNPDEFL, YEAR, POP, TOTEMP] 0.000066 -0.618553
17 [ARMED, YEAR, TOTEMP] 0.000041 -0.554538
35 [ARMED, GNPDEFL, POP, TOTEMP] 0.011100 -0.548720
33 [ARMED, GNPDEFL, YEAR, TOTEMP] 0.000206 -0.541122
54 [ARMED, GNPDEFL, YEAR, TOTEMP, POPwoGNP] 0.003300 -0.535804
40 [ARMED, YEAR, TOTEMP, POPwoGNP] 0.002042 -0.531135
43 [ARMED, GNP, GNPDEFL, YEAR, TOTEMP] 0.000007 -0.494612
22 [ARMED, GNP, GNPDEFL, YEAR] 0.003653 -0.490618
27 [ARMED, GNP, YEAR, TOTEMP] 0.000002 -0.483780
8 [ARMED, GNP, YEAR] 0.002089 -0.474034
48 [ARMED, GNP, YEAR, POP, TOTEMP] 0.000174 -0.448541
50 [ARMED, GNP, YEAR, TOTEMP, POPwoGNP] 0.000174 -0.448541
56 [ARMED, YEAR, POP, TOTEMP, POPwoGNP] 0.000174 -0.448541
61 [ARMED, GNP, YEAR, POP, TOTEMP, POPwoGNP] 0.000174 -0.448541
57 [ARMED, GNP, GNPDEFL, YEAR, POP, TOTEMP] 0.000363 -0.428693
59 [ARMED, GNP, GNPDEFL, YEAR, TOTEMP, POPwoGNP] 0.000363 -0.428693
... ... ... ...
51 [ARMED, GNP, POP, TOTEMP, POPwoGNP] 0.158411 -0.395329
41 [ARMED, POP, TOTEMP, POPwoGNP] 0.158411 -0.395329
19 [ARMED, POP, TOTEMP] 0.046094 -0.393768
37 [ARMED, GNPDEFL, TOTEMP, POPwoGNP] 0.074277 -0.365004
55 [ARMED, GNPDEFL, POP, TOTEMP, POPwoGNP] 0.159248 -0.317528
45 [ARMED, GNP, GNPDEFL, POP, TOTEMP] 0.159248 -0.317528
60 [ARMED, GNP, GNPDEFL, POP, TOTEMP, POPwoGNP] 0.159248 -0.317528
47 [ARMED, GNP, GNPDEFL, TOTEMP, POPwoGNP] 0.159248 -0.317528
58 [ARMED, GNP, GNPDEFL, YEAR, POP, POPwoGNP] 0.032018 -0.271381
53 [ARMED, GNPDEFL, YEAR, POP, POPwoGNP] 0.032018 -0.271381
42 [ARMED, GNP, GNPDEFL, YEAR, POP] 0.032018 -0.271381
44 [ARMED, GNP, GNPDEFL, YEAR, POPwoGNP] 0.032018 -0.271381
26 [ARMED, GNP, YEAR, POP] 0.054031 -0.269442
39 [ARMED, YEAR, POP, POPwoGNP] 0.054031 -0.269442
28 [ARMED, GNP, YEAR, POPwoGNP] 0.054031 -0.269442
49 [ARMED, GNP, YEAR, POP, POPwoGNP] 0.054031 -0.269442
0 [ARMED] 0.510949 -0.238236
15 [ARMED, GNPDEFL, POPwoGNP] 0.310884 -0.231783
34 [ARMED, GNPDEFL, YEAR, POPwoGNP] 0.338968 -0.228762
18 [ARMED, YEAR, POPwoGNP] 0.333834 -0.227755
46 [ARMED, GNP, GNPDEFL, POP, POPwoGNP] 0.240535 -0.226751
25 [ARMED, GNP, GNPDEFL, POPwoGNP] 0.240535 -0.226751
36 [ARMED, GNPDEFL, POP, POPwoGNP] 0.240535 -0.226751
23 [ARMED, GNP, GNPDEFL, POP] 0.240535 -0.226751
9 [ARMED, GNP, POP] 0.487455 -0.180533
30 [ARMED, GNP, POP, POPwoGNP] 0.487455 -0.180533
20 [ARMED, POP, POPwoGNP] 0.487455 -0.180533
11 [ARMED, GNP, POPwoGNP] 0.487455 -0.180533
21 [ARMED, TOTEMP, POPwoGNP] 0.839342 -0.054831
6 [ARMED, POPwoGNP] 0.212985 0.425874

63 rows × 3 columns


In [ ]: