Regression

Copyright 2015 Allen Downey

License: Creative Commons Attribution 4.0 International


In [1]:
from __future__ import print_function, division

import numpy as np
import pandas as pd

import first
import thinkstats2
import thinkplot

%matplotlib inline

Let's load up the NSFG data again.


In [2]:
live, firsts, others = first.MakeFrames()
live.shape


Out[2]:
(9148, 244)

And select live, full-term births.


In [3]:
live = live[live.prglngth>=37]
live.shape


Out[3]:
(8001, 244)

And drop rows with missing data (just for the variables we want).


In [4]:
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
live.shape


Out[4]:
(7912, 244)

Check a few rows:


In [5]:
live.head()


Out[5]:
caseid pregordr howpreg_n howpreg_p moscurrp nowprgdk pregend1 pregend2 nbrnaliv multbrth ... laborfor_i religion_i metro_i basewgt adj_mod_basewgt finalwgt secu_p sest cmintvw totalwgt_lb
0 1 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 3410.389399 3869.349602 6448.271112 2 9 NaN 8.8125
1 1 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 3410.389399 3869.349602 6448.271112 2 9 NaN 7.8750
2 2 1 NaN NaN NaN NaN 5 NaN 3 5 ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 9.1250
3 2 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 7.0000
4 2 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 6.1875

5 rows × 244 columns

And summarize a few variables.


In [6]:
live[['agepreg', 'totalwgt_lb']].describe()


Out[6]:
agepreg totalwgt_lb
count 7912.000000 7912.000000
mean 24.978917 7.482384
std 5.523504 1.169065
min 10.830000 0.125000
25% 20.580000 6.687500
50% 24.410000 7.437500
75% 28.910000 8.187500
max 44.080000 15.437500

Here's a scatterplot of age and birthweight, with parameters tuned to avoid saturation.


In [7]:
ages = live.agepreg
weights = live.totalwgt_lb
thinkplot.Scatter(ages, weights, alpha=0.1, s=15)
thinkplot.Config(xlabel='age (years)',
                 ylabel='weight (lbs)',
                 xlim=[10, 45],
                 ylim=[0, 15],
                 legend=False)


Mean of mother's age:


In [8]:
live['agepreg'].mean()


Out[8]:
24.978916835186908

Mean and standard deviation of birthweight:


In [9]:
live['totalwgt_lb'].mean(), live['totalwgt_lb'].std()


Out[9]:
(7.482384352881699, 1.1690646813035233)

And the coefficient of correlation:


In [10]:
thinkstats2.Corr(ages, weights)


Out[10]:
0.085023137198215998

The Pandas corr function gets the same result:


In [11]:
live['totalwgt_lb'].corr(live['agepreg'])


Out[11]:
0.085023137198215887

To see the relationship more clearly, we can group mother's age into 3-year bins and plot percentiles of birth weight for each bin.


In [12]:
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)

ages = [group.agepreg.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.totalwgt_lb) for i, group in groups][1:-1]

thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
    weights = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(ages, weights, label=label)

thinkplot.Config(xlabel="mother's age (years)",
                 ylabel='birth weight (lbs)',
                 xlim=[14, 50],
                 legend=True)


The first and last points are not very reliable, because they represent fewer data points.

It looks like there is a generally positive relationshop between birth weight and mother's age, possibly leveling or dropping for older mothers.

We can get more information about the mothers by reading the respondents file, which contains one row per respondent.


In [13]:
def ReadFemResp(dct_file='2002FemResp.dct',
                dat_file='2002FemResp.dat.gz',
                nrows=None):
    """Reads the NSFG respondent data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
    return df

There are 7643 respondents and 3087 variables about each.


In [14]:
resp = ReadFemResp()
resp.shape


Out[14]:
(7643, 3087)

If we use the caseid variable as the index, we can look up respondents efficiently by id.

Here's what the first few rows look like:


In [15]:
resp.index = resp.caseid
resp.head()


Out[15]:
caseid rscrinf rdormres rostscrn rscreenhisp rscreenrace age_a age_r cmbirth agescrn ... pubassis_i basewgt adj_mod_basewgt finalwgt secu_r sest cmintvw cmlstyr screentime intvlngth
caseid
2298 2298 1 5 5 1 5 27 27 902 27 ... 0 3247.916977 5123.759559 5556.717241 2 18 1234 1222 18:26:36 110.492667
5012 5012 1 5 1 5 5 42 42 718 42 ... 0 2335.279149 2846.799490 4744.191350 2 18 1233 1221 16:30:59 64.294000
11586 11586 1 5 1 5 5 43 43 708 43 ... 0 2335.279149 2846.799490 4744.191350 2 18 1234 1222 18:19:09 75.149167
6794 6794 5 5 4 1 5 15 15 1042 15 ... 0 3783.152221 5071.464231 5923.977368 2 18 1234 1222 15:54:43 28.642833
616 616 1 5 4 1 5 20 20 991 20 ... 0 5341.329968 6437.335772 7229.128072 2 18 1233 1221 14:19:44 69.502667

5 rows × 3087 columns

Now we can join the tables, using the caseid from each pregnancy record to find the corresponding respondent and (abstractly) copy over the additional variables.

So the joined table has one row for each pregnancy and all the columns from both tables.


In [16]:
join = live.join(resp, on='caseid', rsuffix='_r')
join.shape


Out[16]:
(7912, 3331)

The encoding for screentime is a colon-separated timestamp.


In [17]:
join.screentime.head()


Out[17]:
0    19:56:43
1    19:56:43
2    14:54:03
3    14:54:03
4    14:54:03
Name: screentime, dtype: object

If we convert to a datetime object, we avoid some processing problems later.


In [18]:
join.screentime = pd.to_datetime(join.screentime)
join.screentime.head()


Out[18]:
0   2015-10-07 19:56:43
1   2015-10-07 19:56:43
2   2015-10-07 14:54:03
3   2015-10-07 14:54:03
4   2015-10-07 14:54:03
Name: screentime, dtype: datetime64[ns]

To estimate the effect of mother's age on birthweight, we can use a simple least squares fit.


In [19]:
ages = join.agepreg
weights = join.totalwgt_lb

inter, slope = thinkstats2.LeastSquares(ages, weights)
inter, slope, slope*16*10


Out[19]:
(7.0328792906633737, 0.017995378469939118, 2.879260555190259)

The slope is almost 3 ounces per decade.

We can do the same thing using Ordinary Least Squares from statsmodels:


In [20]:
import statsmodels.formula.api as smf

formula = ('totalwgt_lb ~ agepreg')
results = smf.ols(formula, data=join).fit()
results.summary()


Out[20]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 57.60
Date: Wed, 07 Oct 2015 Prob (F-statistic): 3.58e-14
Time: 15:40:17 Log-Likelihood: -12433.
No. Observations: 7912 AIC: 2.487e+04
Df Residuals: 7910 BIC: 2.488e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 7.0329 0.061 115.940 0.000 6.914 7.152
agepreg 0.0180 0.002 7.589 0.000 0.013 0.023
Omnibus: 381.292 Durbin-Watson: 1.586
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1098.730
Skew: 0.215 Prob(JB): 2.59e-239
Kurtosis: 4.774 Cond. No. 119.

The results object contains the parameters (and all the other info in the table):


In [21]:
inter, slope = results.params
inter, slope


Out[21]:
(7.032879290663371, 0.017995378469939035)

And the results are consistent with my implementation:


In [22]:
slope * 16 * 10           # slope in ounces per decade


Out[22]:
2.8792605551902457

We can use a boolean variable as a predictor:


In [23]:
join['isfirst'] = (join.birthord == 1)

formula = 'totalwgt_lb ~ isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()


Out[23]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.002
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 12.34
Date: Wed, 07 Oct 2015 Prob (F-statistic): 0.000445
Time: 15:40:18 Log-Likelihood: -12456.
No. Observations: 7912 AIC: 2.492e+04
Df Residuals: 7910 BIC: 2.493e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 7.5265 0.018 414.288 0.000 7.491 7.562
isfirst[T.True] -0.0924 0.026 -3.513 0.000 -0.144 -0.041
Omnibus: 364.721 Durbin-Watson: 1.581
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1001.182
Skew: 0.218 Prob(JB): 3.95e-218
Kurtosis: 4.687 Cond. No. 2.57

First babies are lighter by about 1.5 ounces.


In [24]:
results.params['isfirst[T.True]'] * 16


Out[24]:
-1.4779760012473198

And we can make a model with multiple predictors.


In [25]:
formula = 'totalwgt_lb ~ agepreg + isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()


Out[25]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 29.50
Date: Wed, 07 Oct 2015 Prob (F-statistic): 1.72e-13
Time: 15:40:18 Log-Likelihood: -12433.
No. Observations: 7912 AIC: 2.487e+04
Df Residuals: 7909 BIC: 2.489e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 7.0719 0.069 102.463 0.000 6.937 7.207
isfirst[T.True] -0.0327 0.028 -1.185 0.236 -0.087 0.021
agepreg 0.0171 0.002 6.826 0.000 0.012 0.022
Omnibus: 382.030 Durbin-Watson: 1.585
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1103.586
Skew: 0.214 Prob(JB): 2.29e-240
Kurtosis: 4.779 Cond. No. 138.

If we control for mother's age, the difference in weight for first babies is cut to about 0.5 ounces (and no longer statistically significant).


In [26]:
results.params['isfirst[T.True]'] * 16


Out[26]:
-0.52389693286041505

The relationship with age might be non-linear. Adding a quadratic term helps a little, although note that the $R^2$ values for all of these models are very small.


In [27]:
join['age2'] = join.agepreg**2

formula = 'totalwgt_lb ~ agepreg + age2'
results = smf.ols(formula, data=join).fit()
results.summary()


Out[27]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.009
Model: OLS Adj. R-squared: 0.009
Method: Least Squares F-statistic: 37.48
Date: Wed, 07 Oct 2015 Prob (F-statistic): 6.30e-17
Time: 15:40:18 Log-Likelihood: -12425.
No. Observations: 7912 AIC: 2.486e+04
Df Residuals: 7909 BIC: 2.488e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6.0378 0.247 24.428 0.000 5.553 6.522
agepreg 0.0979 0.019 5.050 0.000 0.060 0.136
age2 -0.0015 0.000 -4.153 0.000 -0.002 -0.001
Omnibus: 378.969 Durbin-Watson: 1.587
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1074.412
Skew: 0.219 Prob(JB): 4.95e-234
Kurtosis: 4.751 Cond. No. 1.36e+04

Now we can combine the quadratic age model with isfirst


In [28]:
formula = 'totalwgt_lb ~ agepreg + age2 + isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()


Out[28]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.009
Model: OLS Adj. R-squared: 0.009
Method: Least Squares F-statistic: 25.12
Date: Wed, 07 Oct 2015 Prob (F-statistic): 3.61e-16
Time: 15:40:18 Log-Likelihood: -12425.
No. Observations: 7912 AIC: 2.486e+04
Df Residuals: 7908 BIC: 2.488e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6.0795 0.256 23.775 0.000 5.578 6.581
isfirst[T.True] -0.0177 0.028 -0.637 0.524 -0.072 0.037
agepreg 0.0957 0.020 4.865 0.000 0.057 0.134
age2 -0.0015 0.000 -4.030 0.000 -0.002 -0.001
Omnibus: 379.403 Durbin-Watson: 1.586
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1077.436
Skew: 0.219 Prob(JB): 1.09e-234
Kurtosis: 4.754 Cond. No. 1.41e+04

Now the effect is cut to less that a third of an ounce, and very plausibly due to chance.


In [29]:
results.params['isfirst[T.True]'] * 16


Out[29]:
-0.28383607712278636

Here's the best model I found, combining all variables that seemed plausibly predictive.


In [30]:
formula = ('totalwgt_lb ~ agepreg + age2 + C(race) + '
           'nbrnaliv>1 + paydu==1 + totincr')
results = smf.ols(formula, data=join).fit()
results.summary()


Out[30]:
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.046
Model: OLS Adj. R-squared: 0.045
Method: Least Squares F-statistic: 54.63
Date: Wed, 07 Oct 2015 Prob (F-statistic): 1.11e-76
Time: 15:40:19 Log-Likelihood: -12275.
No. Observations: 7912 AIC: 2.457e+04
Df Residuals: 7904 BIC: 2.462e+04
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6.0829 0.243 25.022 0.000 5.606 6.559
C(race)[T.2] 0.3625 0.031 11.649 0.000 0.302 0.424
C(race)[T.3] 0.2434 0.050 4.833 0.000 0.145 0.342
nbrnaliv > 1[T.True] -1.0779 0.121 -8.889 0.000 -1.316 -0.840
paydu == 1[T.True] 0.0985 0.030 3.255 0.001 0.039 0.158
agepreg 0.0714 0.019 3.741 0.000 0.034 0.109
age2 -0.0012 0.000 -3.341 0.001 -0.002 -0.000
totincr 0.0132 0.004 3.452 0.001 0.006 0.021
Omnibus: 453.110 Durbin-Watson: 1.609
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1254.734
Skew: 0.300 Prob(JB): 3.45e-273
Kurtosis: 4.856 Cond. No. 1.36e+04

All predictors are statistically significant, so the effects could be legit, but the $R^2$ value is still very small: this model doesn't provide much help for the office pool.

Logistic regression

Let's say we want to predict the sex of a baby based on information about the mother.

I'll start by creating a binary dependent variable, isboy, and checking for dependence on mother's age:


In [31]:
live['isboy'] = (live.babysex==1).astype(int)
model = smf.logit('isboy ~ agepreg', data=live)    
results = model.fit()
results.summary()


Optimization terminated successfully.
         Current function value: 0.692981
         Iterations 3
Out[31]:
Logit Regression Results
Dep. Variable: isboy No. Observations: 7912
Model: Logit Df Residuals: 7910
Method: MLE Df Model: 1
Date: Wed, 07 Oct 2015 Pseudo R-squ.: 1.352e-05
Time: 15:40:19 Log-Likelihood: -5482.9
converged: True LL-Null: -5482.9
LLR p-value: 0.7002
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -0.0038 0.104 -0.036 0.971 -0.208 0.200
agepreg 0.0016 0.004 0.385 0.700 -0.006 0.010

The estimated parameter is 0.0016, which is small and not statistically significant. So the apparent relationship might be due to chance.

But for the sake of the example, I'll take it at face value and work out the effect on the prediction.

A parameter in a logistic regression is a log odds ratio, so we can compute the odds ratio for a difference of 10 years in mother's age:


In [32]:
log_odds_ratio = results.params['agepreg'] * 10
odds_ratio = np.exp(log_odds_ratio)
odds_ratio


Out[32]:
1.0158023740467292

And we can use the odds ratio to update a prior probability. A mother at the mean age has a 51% chance of having a boy.

In the case a mother who is 10 years older has a 51.4% chance.


In [33]:
p = 0.51
prior_odds = p / (1-p)
post_odds = prior_odds * odds_ratio
p = post_odds / (post_odds + 1)
p


Out[33]:
0.51391744178527499

I searched for other factors that might be predictive. The most likely candidates turn out not to be statistically significant.


In [34]:
formula = 'isboy ~ agepreg + hpagelb + birthord + C(race)'
model = smf.logit(formula, data=live)
results = model.fit()
results.summary()


Optimization terminated successfully.
         Current function value: 0.692727
         Iterations 3
Out[34]:
Logit Regression Results
Dep. Variable: isboy No. Observations: 7826
Model: Logit Df Residuals: 7820
Method: MLE Df Model: 5
Date: Wed, 07 Oct 2015 Pseudo R-squ.: 0.0004249
Time: 15:40:19 Log-Likelihood: -5421.3
converged: True LL-Null: -5423.6
LLR p-value: 0.4654
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -0.0639 0.111 -0.575 0.565 -0.282 0.154
C(race)[T.2] -0.0135 0.054 -0.250 0.803 -0.120 0.092
C(race)[T.3] -0.0158 0.089 -0.178 0.859 -0.190 0.158
agepreg -0.0056 0.006 -0.963 0.336 -0.017 0.006
hpagelb 0.0069 0.004 1.542 0.123 -0.002 0.016
birthord 0.0297 0.024 1.264 0.206 -0.016 0.076

Again, taking these parameters at face values, we can use the model to make predictions.

The baseline strategy is to always guess boy, which yields accuracy of 50.8%


In [35]:
exog = pd.DataFrame(model.exog, columns=model.exog_names)
endog = pd.DataFrame(model.endog, columns=[model.endog_names])

actual = endog['isboy']
baseline = actual.mean()
baseline


Out[35]:
0.5079223102478917

results.predict uses the model to generate predictions for the data.

Adding up the correct positive and negative predictions, we get accuracy 51.3%


In [36]:
predict = (results.predict() >= 0.5)
true_pos = predict * actual
true_neg = (1 - predict) * (1 - actual)

acc = (sum(true_pos) + sum(true_neg)) / len(actual)
acc


Out[36]:
0.51316125734730389

And we can use the model to generate a prediction for the office pool.

Suppose your hypothetical coworker is is 39 years old and white, her husband is 30, and they are expecting their first child.


In [37]:
columns = ['agepreg', 'hpagelb', 'birthord', 'race']
new = pd.DataFrame([[39, 30, 1, 2]], columns=columns)
y = results.predict(new)
y


Out[37]:
array([ 0.48485192])

This is one of the few scenarios where the model predicts that the baby will be a girl.

But given the low accuracy of the model (and the likelihood that it is overfit to the data) this model would not be much help in the pool.