Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import random
import thinkstats2
import thinkplot
In [2]:
import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb
The following function computes the intercept and slope of the least squares fit.
In [3]:
from thinkstats2 import Mean, MeanVar, Var, Std, Cov
def LeastSquares(xs, ys):
meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope
Here's the least squares fit to birth weight as a function of mother's age.
In [4]:
inter, slope = LeastSquares(ages, weights)
inter, slope
Out[4]:
The intercept is often easier to interpret if we evaluate it at the mean of the independent variable.
In [5]:
inter + slope * 25
Out[5]:
And the slope is easier to interpret if we express it in pounds per decade (or ounces per year).
In [6]:
slope * 10
Out[6]:
The following function evaluates the fitted line at the given xs
.
In [7]:
def FitLine(xs, inter, slope):
fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys
And here's an example.
In [8]:
fit_xs, fit_ys = FitLine(ages, inter, slope)
Here's a scatterplot of the data with the fitted line.
In [9]:
thinkplot.Scatter(ages, weights, color='blue', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='white', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='red', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
axis=[10, 45, 0, 15],
legend=False)
In [10]:
def Residuals(xs, ys, inter, slope):
xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res
Now we can add the residuals as a column in the DataFrame.
In [11]:
live['residual'] = Residuals(ages, weights, inter, slope)
To visualize the residuals, I'll split the respondents into groups by age, then plot the percentiles of the residuals versus the average age in each group.
First I'll make the groups and compute the average age in each group.
In [12]:
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)
age_means = [group.agepreg.mean() for _, group in groups][1:-1]
age_means
Out[12]:
Next I'll compute the CDF of the residuals in each group.
In [13]:
cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]
The following function plots percentiles of the residuals against the average age in each group.
In [14]:
def PlotPercentiles(age_means, cdfs):
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(age_means, weight_percentiles, label=label)
The following figure shows the 25th, 50th, and 75th percentiles.
Curvature in the residuals suggests a non-linear relationship.
In [15]:
PlotPercentiles(age_means, cdfs)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])
In [16]:
def SampleRows(df, nrows, replace=False):
"""Choose a sample of rows from a DataFrame.
df: DataFrame
nrows: number of rows
replace: whether to sample with replacement
returns: DataDf
"""
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample
def ResampleRows(df):
"""Resamples rows from a DataFrame.
df: DataFrame
returns: DataFrame
"""
return SampleRows(df, len(df), replace=True)
The following function resamples the given dataframe and returns lists of estimates for inter
and slope
.
In [17]:
def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = LeastSquares(ages, weights)
t.append(estimates)
inters, slopes = zip(*t)
return inters, slopes
Here's an example.
In [18]:
inters, slopes = SamplingDistributions(live, iters=1001)
The following function takes a list of estimates and prints the mean, standard error, and 90% confidence interval.
In [19]:
def Summarize(estimates, actual=None):
mean = Mean(estimates)
stderr = Std(estimates, mu=actual)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.ConfidenceInterval(90)
print('mean, SE, CI', mean, stderr, ci)
Here's the summary for inter
.
In [20]:
Summarize(inters)
And for slope
.
In [21]:
Summarize(slopes)
Exercise: Use ResampleRows
and generate a list of estimates for the mean birth weight. Use Summarize
to compute the SE and CI for these estimates.
In [22]:
# Solution goes here
In [23]:
for slope, inter in zip(slopes, inters):
fxs, fys = FitLine(age_means, inter, slope)
thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])
Or we can make a neater (and more efficient plot) by computing fitted lines and finding percentiles of the fits for each value of the dependent variable.
In [24]:
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = FitLine(xs, inter, slope)
fys_seq.append(fys)
p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)
This example shows the confidence interval for the fitted values at each mother's age.
In [25]:
PlotConfidenceIntervals(age_means, inters, slopes, percent=90,
color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
color='gray', alpha=0.5, label='50% CI')
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])
The coefficient compares the variance of the residuals to the variance of the dependent variable.
In [26]:
def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)
For birth weight and mother's age $R^2$ is very small, indicating that the mother's age predicts a small part of the variance in birth weight.
In [27]:
inter, slope = LeastSquares(ages, weights)
res = Residuals(ages, weights, inter, slope)
r2 = CoefDetermination(weights, res)
r2
Out[27]:
We can confirm that $R^2 = \rho^2$:
In [28]:
print('rho', thinkstats2.Corr(ages, weights))
print('R', np.sqrt(r2))
To express predictive power, I think it's useful to compare the standard deviation of the residuals to the standard deviation of the dependent variable, as a measure RMSE if you try to guess birth weight with and without taking into account mother's age.
In [29]:
print('Std(ys)', Std(weights))
print('Std(res)', Std(res))
As another example of the same idea, here's how much we can improve guesses about IQ if we know someone's SAT scores.
In [30]:
var_ys = 15**2
rho = 0.72
r2 = rho**2
var_res = (1 - r2) * var_ys
std_res = np.sqrt(var_res)
std_res
Out[30]:
In [31]:
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights
And it is.
In [32]:
ht = SlopeTest((ages, weights))
pvalue = ht.PValue()
pvalue
Out[32]:
Under the null hypothesis, the largest slope we observe after 1000 tries is substantially less than the observed value.
In [33]:
ht.actual, ht.MaxTestStat()
Out[33]:
We can also use resampling to estimate the sampling distribution of the slope.
In [34]:
sampling_cdf = thinkstats2.Cdf(slopes)
The distribution of slopes under the null hypothesis, and the sampling distribution of the slope under resampling, have the same shape, but one has mean at 0 and the other has mean at the observed slope.
To compute a p-value, we can count how often the estimated slope under the null hypothesis exceeds the observed slope, or how often the estimated slope under resampling falls below 0.
In [35]:
thinkplot.PrePlot(2)
thinkplot.Plot([0, 0], [0, 1], color='0.8')
ht.PlotCdf(label='null hypothesis')
thinkplot.Cdf(sampling_cdf, label='sampling distribution')
thinkplot.Config(xlabel='slope (lbs / year)',
ylabel='CDF',
xlim=[-0.03, 0.03],
legend=True, loc='upper left')
Here's how to get a p-value from the sampling distribution.
In [36]:
pvalue = sampling_cdf[0]
pvalue
Out[36]:
In [37]:
def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
We can use it to estimate the mean birthweight and compute SE and CI.
In [38]:
iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)
And here's what the same calculation looks like if we ignore the weights.
In [39]:
estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)
The difference is non-negligible, which suggests that there are differences in birth weight between the strata in the survey.
Exercise: Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?
Read the BRFSS data and extract heights and log weights.
In [40]:
import brfss
df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
log_weights = np.log10(weights)
Estimate intercept and slope.
In [41]:
# Solution goes here
Make a scatter plot of the data and show the fitted line.
In [42]:
# Solution goes here
Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.
In [43]:
# Solution goes here
Plot percentiles of the residuals.
In [44]:
# Solution goes here
Compute correlation.
In [45]:
# Solution goes here
Compute coefficient of determination.
In [46]:
# Solution goes here
Confirm that $R^2 = \rho^2$.
In [47]:
# Solution goes here
Compute Std(ys), which is the RMSE of predictions that don't use height.
In [48]:
# Solution goes here
Compute Std(res), the RMSE of predictions that do use height.
In [49]:
# Solution goes here
How much does height information reduce RMSE?
In [50]:
# Solution goes here
Use resampling to compute sampling distributions for inter and slope.
In [51]:
# Solution goes here
Plot the sampling distribution of slope.
In [52]:
# Solution goes here
Compute the p-value of the slope.
In [53]:
# Solution goes here
Compute the 90% confidence interval of slope.
In [54]:
# Solution goes here
Compute the mean of the sampling distribution.
In [55]:
# Solution goes here
Compute the standard deviation of the sampling distribution, which is the standard error.
In [56]:
# Solution goes here
Resample rows without weights, compute mean height, and summarize results.
In [57]:
# Solution goes here
Resample rows with weights. Note that the weight column in this dataset is called finalwt
.
In [58]:
# Solution goes here
In [ ]: