Solution to an exercise in Think Stats, 2nd Edition, by Allen Downey (thinkstats2.com)
Exercise 10.1
Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?
Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?
Read the BRFSS data and extract heights and log weights.
In [5]:
import brfss
import numpy as np
%matplotlib inline
In [1]:
df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
weights = np.log10(weights)
Estimate intercept and slope.
In [2]:
import thinkstats2
inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope
Out[2]:
Make a scatter plot of the data and show the fitted line.
In [6]:
import thinkplot
thinkplot.Scatter(heights, weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, fys)
thinkplot.Config(xlabel='height (cm)', ylabel='log10 weight (kg)', legend=False)
Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.
In [7]:
thinkplot.Scatter(heights, 10**weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys)
thinkplot.Config(xlabel='height (cm)', ylabel='weight (kg)', legend=False)
Plot percentiles of the residuals.
The lines are flat over most of the range, indicating that the relationship is linear.
The lines are mostly parallel, indicating that the variance of the residuals is the same over the range.
In [9]:
res = thinkstats2.Residuals(heights, weights, inter, slope)
df['residual'] = res
bins = np.arange(130, 210, 5)
indices = np.digitize(df.htm3, bins)
groups = df.groupby(indices)
means = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for i, group in groups][1:-1]
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
ys = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(means, ys, label=label)
thinkplot.Config(xlabel='height (cm)', ylabel='residual weight (kg)', legend=False)
Compute correlation.
In [10]:
rho = thinkstats2.Corr(heights, weights)
rho
Out[10]:
Compute coefficient of determination.
In [11]:
r2 = thinkstats2.CoefDetermination(weights, res)
r2
Out[11]:
Confirm that $R^2 = \rho^2$.
In [12]:
rho**2 - r2
Out[12]:
Compute Std(ys), which is the RMSE of predictions that don't use height.
In [13]:
std_ys = thinkstats2.Std(weights)
std_ys
Out[13]:
Compute Std(res), the RMSE of predictions that do use height.
In [14]:
std_res = thinkstats2.Std(res)
std_res
Out[14]:
How much does height information reduce RMSE? About 15%.
In [15]:
1 - std_res / std_ys
Out[15]:
Use resampling to compute sampling distributions for inter and slope.
In [16]:
t = []
for _ in range(100):
sample = thinkstats2.ResampleRows(df)
estimates = thinkstats2.LeastSquares(sample.htm3, np.log10(sample.wtkg2))
t.append(estimates)
inters, slopes = zip(*t)
Plot the sampling distribution of slope.
In [17]:
cdf = thinkstats2.Cdf(slopes)
thinkplot.Cdf(cdf)
Out[17]:
Compute the p-value of the slope.
In [18]:
pvalue = cdf[0]
pvalue
Out[18]:
Compute the 90% confidence interval of slope.
In [19]:
ci = cdf.Percentile(5), cdf.Percentile(95)
ci
Out[19]:
Compute the mean of the sampling distribution.
In [20]:
mean = thinkstats2.Mean(slopes)
mean
Out[20]:
Compute the standard deviation of the sampling distribution, which is the standard error.
In [21]:
stderr = thinkstats2.Std(slopes)
stderr
Out[21]:
Resample using sampling weights.
In [40]:
def ResampleRowsWeighted(df, column='finalwt'):
"""Resamples a DataFrame using probabilities proportional to given column.
df: DataFrame
column: string column name to use as weights
returns: DataFrame
"""
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
Summarize a sampling distribution.
In [32]:
def Summarize(estimates):
mean = thinkstats2.Mean(estimates)
stderr = thinkstats2.Std(estimates)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.Percentile(5), cdf.Percentile(95)
print('mean', mean)
print('stderr', stderr)
print('ci', ci)
Resample rows without weights, and summarize results.
In [33]:
estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]
Summarize(estimates_unweighted)
Resample rows with weights. The estimated mean height is almost 2 cm taller if we take into account the sampling weights, and this difference is much bigger than the sampling error.
In [41]:
estimates_weighted = [ResampleRowsWeighted(df).htm3.mean() for _ in range(100)]
Summarize(estimates_weighted)
In [ ]: