In [1]:
from __future__ import print_function, division
import numpy as np
import pandas as pd
import first
import thinkstats2
import thinkplot
%matplotlib inline
Let's load up the NSFG data again.
In [2]:
live, firsts, others = first.MakeFrames()
live.shape
Out[2]:
And select live, full-term births.
In [3]:
live = live[live.prglngth>=37]
live.shape
Out[3]:
And drop rows with missing data (just for the variables we want).
In [4]:
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
live.shape
Out[4]:
Check a few rows:
In [5]:
live.head()
Out[5]:
And summarize a few variables.
In [6]:
live[['agepreg', 'totalwgt_lb']].describe()
Out[6]:
Here's a scatterplot of age and birthweight, with parameters tuned to avoid saturation.
In [7]:
ages = live.agepreg
weights = live.totalwgt_lb
thinkplot.Scatter(ages, weights, alpha=0.1, s=15)
thinkplot.Config(xlabel='age (years)',
ylabel='weight (lbs)',
xlim=[10, 45],
ylim=[0, 15],
legend=False)
Mean of mother's age:
In [8]:
live['agepreg'].mean()
Out[8]:
Mean and standard deviation of birthweight:
In [9]:
live['totalwgt_lb'].mean(), live['totalwgt_lb'].std()
Out[9]:
And the coefficient of correlation:
In [10]:
thinkstats2.Corr(ages, weights)
Out[10]:
The Pandas corr
function gets the same result:
In [11]:
live['totalwgt_lb'].corr(live['agepreg'])
Out[11]:
To see the relationship more clearly, we can group mother's age into 3-year bins and plot percentiles of birth weight for each bin.
In [12]:
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)
ages = [group.agepreg.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.totalwgt_lb) for i, group in groups][1:-1]
thinkplot.PrePlot(5)
for percent in [90, 75, 50, 25, 10]:
weights = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(ages, weights, label=label)
thinkplot.Config(xlabel="mother's age (years)",
ylabel='birth weight (lbs)',
xlim=[14, 50],
legend=True)
The first and last points are not very reliable, because they represent fewer data points.
It looks like there is a generally positive relationshop between birth weight and mother's age, possibly leveling or dropping for older mothers.
We can get more information about the mothers by reading the respondents file, which contains one row per respondent.
In [13]:
def ReadFemResp(dct_file='2002FemResp.dct',
dat_file='2002FemResp.dat.gz',
nrows=None):
"""Reads the NSFG respondent data.
dct_file: string file name
dat_file: string file name
returns: DataFrame
"""
dct = thinkstats2.ReadStataDct(dct_file)
df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
return df
There are 7643 respondents and 3087 variables about each.
In [14]:
resp = ReadFemResp()
resp.shape
Out[14]:
If we use the caseid
variable as the index, we can look up respondents efficiently by id.
Here's what the first few rows look like:
In [15]:
resp.index = resp.caseid
resp.head()
Out[15]:
Now we can join the tables, using the caseid
from each pregnancy record to find the corresponding respondent and (abstractly) copy over the additional variables.
So the joined table has one row for each pregnancy and all the columns from both tables.
In [16]:
join = live.join(resp, on='caseid', rsuffix='_r')
join.shape
Out[16]:
The encoding for screentime
is a colon-separated timestamp.
In [17]:
join.screentime.head()
Out[17]:
If we convert to a datetime object, we avoid some processing problems later.
In [18]:
join.screentime = pd.to_datetime(join.screentime)
join.screentime.head()
Out[18]:
To estimate the effect of mother's age on birthweight, we can use a simple least squares fit.
In [19]:
ages = join.agepreg
weights = join.totalwgt_lb
inter, slope = thinkstats2.LeastSquares(ages, weights)
inter, slope, slope*16*10
Out[19]:
The slope is almost 3 ounces per decade.
We can do the same thing using Ordinary Least Squares from statsmodels:
In [20]:
import statsmodels.formula.api as smf
formula = ('totalwgt_lb ~ agepreg')
results = smf.ols(formula, data=join).fit()
results.summary()
Out[20]:
The results object contains the parameters (and all the other info in the table):
In [21]:
inter, slope = results.params
inter, slope
Out[21]:
And the results are consistent with my implementation:
In [22]:
slope * 16 * 10 # slope in ounces per decade
Out[22]:
We can use a boolean variable as a predictor:
In [23]:
join['isfirst'] = (join.birthord == 1)
formula = 'totalwgt_lb ~ isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()
Out[23]:
First babies are lighter by about 1.5 ounces.
In [24]:
results.params['isfirst[T.True]'] * 16
Out[24]:
And we can make a model with multiple predictors.
In [25]:
formula = 'totalwgt_lb ~ agepreg + isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()
Out[25]:
If we control for mother's age, the difference in weight for first babies is cut to about 0.5 ounces (and no longer statistically significant).
In [26]:
results.params['isfirst[T.True]'] * 16
Out[26]:
The relationship with age might be non-linear. Adding a quadratic term helps a little, although note that the $R^2$ values for all of these models are very small.
In [27]:
join['age2'] = join.agepreg**2
formula = 'totalwgt_lb ~ agepreg + age2'
results = smf.ols(formula, data=join).fit()
results.summary()
Out[27]:
Now we can combine the quadratic age model with isfirst
In [28]:
formula = 'totalwgt_lb ~ agepreg + age2 + isfirst'
results = smf.ols(formula, data=join).fit()
results.summary()
Out[28]:
Now the effect is cut to less that a third of an ounce, and very plausibly due to chance.
In [29]:
results.params['isfirst[T.True]'] * 16
Out[29]:
Here's the best model I found, combining all variables that seemed plausibly predictive.
In [30]:
formula = ('totalwgt_lb ~ agepreg + age2 + C(race) + '
'nbrnaliv>1 + paydu==1 + totincr')
results = smf.ols(formula, data=join).fit()
results.summary()
Out[30]:
All predictors are statistically significant, so the effects could be legit, but the $R^2$ value is still very small: this model doesn't provide much help for the office pool.
In [31]:
live['isboy'] = (live.babysex==1).astype(int)
model = smf.logit('isboy ~ agepreg', data=live)
results = model.fit()
results.summary()
Out[31]:
The estimated parameter is 0.0016, which is small and not statistically significant. So the apparent relationship might be due to chance.
But for the sake of the example, I'll take it at face value and work out the effect on the prediction.
A parameter in a logistic regression is a log odds ratio, so we can compute the odds ratio for a difference of 10 years in mother's age:
In [32]:
log_odds_ratio = results.params['agepreg'] * 10
odds_ratio = np.exp(log_odds_ratio)
odds_ratio
Out[32]:
And we can use the odds ratio to update a prior probability. A mother at the mean age has a 51% chance of having a boy.
In the case a mother who is 10 years older has a 51.4% chance.
In [33]:
p = 0.51
prior_odds = p / (1-p)
post_odds = prior_odds * odds_ratio
p = post_odds / (post_odds + 1)
p
Out[33]:
I searched for other factors that might be predictive. The most likely candidates turn out not to be statistically significant.
In [34]:
formula = 'isboy ~ agepreg + hpagelb + birthord + C(race)'
model = smf.logit(formula, data=live)
results = model.fit()
results.summary()
Out[34]:
Again, taking these parameters at face values, we can use the model to make predictions.
The baseline strategy is to always guess boy, which yields accuracy of 50.8%
In [35]:
exog = pd.DataFrame(model.exog, columns=model.exog_names)
endog = pd.DataFrame(model.endog, columns=[model.endog_names])
actual = endog['isboy']
baseline = actual.mean()
baseline
Out[35]:
results.predict
uses the model to generate predictions for the data.
Adding up the correct positive and negative predictions, we get accuracy 51.3%
In [36]:
predict = (results.predict() >= 0.5)
true_pos = predict * actual
true_neg = (1 - predict) * (1 - actual)
acc = (sum(true_pos) + sum(true_neg)) / len(actual)
acc
Out[36]:
And we can use the model to generate a prediction for the office pool.
Suppose your hypothetical coworker is is 39 years old and white, her husband is 30, and they are expecting their first child.
In [37]:
columns = ['agepreg', 'hpagelb', 'birthord', 'race']
new = pd.DataFrame([[39, 30, 1, 2]], columns=columns)
y = results.predict(new)
y
Out[37]:
This is one of the few scenarios where the model predicts that the baby will be a girl.
But given the low accuracy of the model (and the likelihood that it is overfit to the data) this model would not be much help in the pool.