3.1: One predictor


In [1]:
from __future__ import print_function, division
%matplotlib inline

import matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# use matplotlib style sheet
plt.style.use('ggplot')

# import statsmodels for R-style regression
import statsmodels.formula.api as smf

Read the data

Data are in the child.iq directory of the ARM_Data download-- you might have to change the path I use below to reflect the path on your computer.


In [2]:
kidiq  = pd.read_stata("../../ARM_Data/child.iq/kidiq.dta")
kidiq.head()


Out[2]:
kid_score mom_hs mom_iq mom_work mom_age
0 65 1 121.117529 4 27
1 98 1 89.361882 4 25
2 85 1 115.443165 4 27
3 83 1 99.449639 3 25
4 115 1 92.745710 4 27

First regression-- binary predictor, Pg 31

Fit the regression using the non-jittered data


In [3]:
fit0 = smf.ols('kid_score ~ mom_hs', data=kidiq).fit()
print(fit0.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.056
Model:                            OLS   Adj. R-squared:                  0.054
Method:                 Least Squares   F-statistic:                     25.69
Date:                Thu, 30 Jul 2015   Prob (F-statistic):           5.96e-07
Time:                        13:21:13   Log-Likelihood:                -1911.8
No. Observations:                 434   AIC:                             3828.
Df Residuals:                     432   BIC:                             3836.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     77.5484      2.059     37.670      0.000        73.502    81.595
mom_hs        11.7713      2.322      5.069      0.000         7.207    16.336
==============================================================================
Omnibus:                       11.077   Durbin-Watson:                   1.464
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               11.316
Skew:                          -0.373   Prob(JB):                      0.00349
Kurtosis:                       2.738   Cond. No.                         4.11
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plot Figure 3.1, Pg 32

A note for the python version:

  • I have not included jitter, in the vertical or horizontal directions. Instead, the data is plotted with opacity so the regions with high data-density can be distinguished.

In [4]:
fig0, ax0 = plt.subplots(figsize=(8, 6))
hs_linspace = np.linspace(kidiq['mom_hs'].min(), kidiq['mom_hs'].max(), 50)

# default color cycle
colors = plt.rcParams['axes.color_cycle']

# plot points
plt.scatter(kidiq['mom_hs'], kidiq['kid_score'],
            s=60, alpha=0.5, c=colors[1])
# add fit
plt.plot(hs_linspace, fit0.params[0] + fit0.params[1] * hs_linspace,
         lw=3, c=colors[1])

plt.xlabel("Mother completed high school")
plt.ylabel("Child test score")


Out[4]:
<matplotlib.text.Text at 0x7fdce9572e50>

Second regression -- continuous predictor, Pg 32


In [5]:
fit1 = smf.ols('kid_score ~ mom_iq', data=kidiq).fit()
print(fit1.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              kid_score   R-squared:                       0.201
Model:                            OLS   Adj. R-squared:                  0.199
Method:                 Least Squares   F-statistic:                     108.6
Date:                Thu, 30 Jul 2015   Prob (F-statistic):           7.66e-23
Time:                        13:21:13   Log-Likelihood:                -1875.6
No. Observations:                 434   AIC:                             3755.
Df Residuals:                     432   BIC:                             3763.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     25.7998      5.917      4.360      0.000        14.169    37.430
mom_iq         0.6100      0.059     10.423      0.000         0.495     0.725
==============================================================================
Omnibus:                        7.545   Durbin-Watson:                   1.645
Prob(Omnibus):                  0.023   Jarque-Bera (JB):                7.735
Skew:                          -0.324   Prob(JB):                       0.0209
Kurtosis:                       2.919   Cond. No.                         682.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Figure 3.2, Pg 33


In [6]:
fig1, ax1 = plt.subplots(figsize=(8, 6))
iq_linspace = np.linspace(kidiq['mom_iq'].min(), kidiq['mom_iq'].max(), 50)

# default color cycle
colors = plt.rcParams['axes.color_cycle']

# plot points
plt.scatter(kidiq['mom_iq'], kidiq['kid_score'],
            s=60, alpha=0.5, c=colors[1])
# add fit
plt.plot(iq_linspace, fit1.params[0] + fit1.params[1] * iq_linspace,
         lw=3, c=colors[1])

plt.xlabel("Mother IQ score")
plt.ylabel("Child test score")


Out[6]:
<matplotlib.text.Text at 0x7fdce9391ad0>