Solution to an exercise in Think Stats, 2nd Edition, by Allen Downey (thinkstats2.com)

Exercise 10.1

Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?

Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

Read the BRFSS data and extract heights and log weights.


In [5]:
import brfss
import numpy as np

%matplotlib inline

In [1]:
df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
weights = np.log10(weights)

Estimate intercept and slope.


In [2]:
import thinkstats2
inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope


Out[2]:
(0.99308041639310674, 0.0052814541694188475)

Make a scatter plot of the data and show the fitted line.


In [6]:
import thinkplot
thinkplot.Scatter(heights, weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, fys)
thinkplot.Config(xlabel='height (cm)', ylabel='log10 weight (kg)', legend=False)


Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.


In [7]:
thinkplot.Scatter(heights, 10**weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys)
thinkplot.Config(xlabel='height (cm)', ylabel='weight (kg)', legend=False)


Plot percentiles of the residuals.

The lines are flat over most of the range, indicating that the relationship is linear.

The lines are mostly parallel, indicating that the variance of the residuals is the same over the range.


In [9]:
res = thinkstats2.Residuals(heights, weights, inter, slope)
df['residual'] = res

bins = np.arange(130, 210, 5)
indices = np.digitize(df.htm3, bins)
groups = df.groupby(indices)

means = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for i, group in groups][1:-1]

thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
    ys = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(means, ys, label=label)
    
thinkplot.Config(xlabel='height (cm)', ylabel='residual weight (kg)', legend=False)


Compute correlation.


In [10]:
rho = thinkstats2.Corr(heights, weights)
rho


Out[10]:
0.53172826059839817

Compute coefficient of determination.


In [11]:
r2 = thinkstats2.CoefDetermination(weights, res)
r2


Out[11]:
0.28273494311897296

Confirm that $R^2 = \rho^2$.


In [12]:
rho**2 - r2


Out[12]:
2.5091040356528538e-14

Compute Std(ys), which is the RMSE of predictions that don't use height.


In [13]:
std_ys = thinkstats2.Std(weights)
std_ys


Out[13]:
0.10320725030004504

Compute Std(res), the RMSE of predictions that do use height.


In [14]:
std_res = thinkstats2.Std(res)
std_res


Out[14]:
0.08740777080415557

How much does height information reduce RMSE? About 15%.


In [15]:
1 - std_res / std_ys


Out[15]:
0.15308497658795372

Use resampling to compute sampling distributions for inter and slope.


In [16]:
t = []
for _ in range(100):
    sample = thinkstats2.ResampleRows(df)
    estimates = thinkstats2.LeastSquares(sample.htm3, np.log10(sample.wtkg2))
    t.append(estimates)

inters, slopes = zip(*t)

Plot the sampling distribution of slope.


In [17]:
cdf = thinkstats2.Cdf(slopes)
thinkplot.Cdf(cdf)


Out[17]:
{'xscale': 'linear', 'yscale': 'linear'}

Compute the p-value of the slope.


In [18]:
pvalue = cdf[0]
pvalue


Out[18]:
0.0

Compute the 90% confidence interval of slope.


In [19]:
ci = cdf.Percentile(5), cdf.Percentile(95)
ci


Out[19]:
(0.0052571196935886876, 0.0053097007636649964)

Compute the mean of the sampling distribution.


In [20]:
mean = thinkstats2.Mean(slopes)
mean


Out[20]:
0.0052828017365193856

Compute the standard deviation of the sampling distribution, which is the standard error.


In [21]:
stderr = thinkstats2.Std(slopes)
stderr


Out[21]:
1.4057320854894858e-05

Resample using sampling weights.


In [40]:
def ResampleRowsWeighted(df, column='finalwt'):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column]
    cdf = thinkstats2.Cdf(dict(weights))
    indices = cdf.Sample(len(weights))
    sample = df.loc[indices]
    return sample

Summarize a sampling distribution.


In [32]:
def Summarize(estimates):
    mean = thinkstats2.Mean(estimates)
    stderr = thinkstats2.Std(estimates)
    cdf = thinkstats2.Cdf(estimates)
    ci = cdf.Percentile(5), cdf.Percentile(95)
    print('mean', mean)
    print('stderr', stderr)
    print('ci', ci)

Resample rows without weights, and summarize results.


In [33]:
estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]
Summarize(estimates_unweighted)


('mean', 168.95592933365671)
('stderr', 0.017625478720510346)
('ci', (168.92333111016794, 168.98266183633461))

Resample rows with weights. The estimated mean height is almost 2 cm taller if we take into account the sampling weights, and this difference is much bigger than the sampling error.


In [41]:
estimates_weighted = [ResampleRowsWeighted(df).htm3.mean() for _ in range(100)]
Summarize(estimates_weighted)


('mean', 170.49814828007842)
('stderr', 0.018522280501919314)
('ci', (170.46753673275532, 170.53268305745871))

In [ ]: