Read the BRFSS data and extract heights and log weights.


In [101]:
import brfss
import numpy as np
df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
weights = np.log10(weights)

Estimate intercept and slope.


In [102]:
import thinkstats2
inter, slope = thinkstats2.LeastSquares(heights, weights)
inter, slope


Out[102]:
(0.99308041639154265, 0.0052814541694194035)

Make a scatter plot of the data and show the fitted line.


In [103]:
import thinkplot
thinkplot.Scatter(heights, weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, fys)


Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.


In [104]:
thinkplot.Scatter(heights, 10**weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys)


Plot percentiles of the residuals.

The lines are flat over most of the range, indicating that the relationship is linear.

The lines are mostly parallel, indicating that the variance of the residuals is the same over the range.


In [105]:
res = thinkstats2.Residuals(heights, weights, inter, slope)
df['residual'] = res

bins = np.arange(130, 210, 5)
indices = np.digitize(df.htm3, bins)
groups = df.groupby(indices)

means = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for i, group in groups][1:-1]

thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
    ys = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(means, ys, label=label)


Compute correlation.


In [106]:
rho = thinkstats2.Corr(heights, weights)
rho


Out[106]:
0.53172826059836875

Compute coefficient of determination.


In [107]:
r2 = thinkstats2.CoefDetermination(weights, res)
r2


Out[107]:
0.28273494311920844

Confirm that $R^2 = \rho^2$.


In [124]:
rho**2 - r2


Out[124]:
-2.4169555246089658e-13

Compute Std(ys), which is the RMSE of predictions that don't use height.


In [109]:
std_ys = thinkstats2.Std(weights)
std_ys


Out[109]:
0.10320725030006163

Compute Std(res), the RMSE of predictions that do use height.


In [110]:
std_res = thinkstats2.Std(res)
std_res


Out[110]:
0.08740777080415527

How much does height information reduce RMSE? About 15%.


In [111]:
1 - std_res / std_ys


Out[111]:
0.15308497658809272

Use resampling to compute sampling distributions for inter and slope.


In [112]:
t = []
for _ in range(100):
    sample = thinkstats2.ResampleRows(df)
    estimates = thinkstats2.LeastSquares(sample.htm3, np.log10(sample.wtkg2))
    t.append(estimates)

inters, slopes = zip(*t)

Plot the sampling distribution of slope.


In [113]:
cdf = thinkstats2.Cdf(slopes)
thinkplot.Cdf(cdf)


Warning: Brewer ran out of colors.
Out[113]:
{'xscale': 'linear', 'yscale': 'linear'}

Compute the p-value of the slope.


In [116]:
pvalue = cdf[0]
pvalue


Out[116]:
0.0

Compute the 90% confidence interval of slope.


In [117]:
ci = cdf.Percentile(5), cdf.Percentile(95)
ci


Out[117]:
(0.0052597853025311081, 0.0053006108100437377)

Compute the mean of the sampling distribution.


In [118]:
mean = thinkstats2.Mean(slopes)
mean


Out[118]:
0.0052827446926660684

Compute the standard deviation of the sampling distribution, which is the standard error.


In [119]:
stderr = thinkstats2.Std(slopes)
stderr


Out[119]:
1.2199452296645808e-05

Resample using sampling weights.


In [120]:
def ResampleRowsWeighted(df):
    """Resamples a DataFrame using probabilities proportional to finalwt.

    live: DataFrame

    returns: DataFrame
    """
    weights = df.finalwt
    cdf = thinkstats2.Pmf(weights.iteritems()).MakeCdf()
    indices = cdf.Sample(len(weights))
    sample = df.loc[indices]
    return sample

Summarize a sampling distribution.


In [121]:
def Summarize(estimates):
    mean = thinkstats2.Mean(estimates)
    stderr = thinkstats2.Std(estimates)
    cdf = thinkstats2.Cdf(estimates)
    ci = cdf.Percentile(5), cdf.Percentile(95)
    print('mean', mean)
    print('stderr', stderr)
    print('ci', ci)

Resample rows without weights, and summarize results.


In [123]:
estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]
Summarize(estimates_unweighted)


('mean', 168.9558018805958)
('stderr', 0.016998444250244257)
('ci', (168.92560732836154, 168.98326562784212))

Resample rows with weights. The estimated mean height is almost 2 cm taller if we take into account the sampling weights, and this difference is much bigger than the sampling error.


In [122]:
estimates_weighted = [ResampleRowsWeighted(df).htm3.mean() for _ in range(100)]
Summarize(estimates_weighted)


('mean', 170.49577714788091)
('stderr', 0.016879720001523813)
('ci', (170.46547525212716, 170.52335081549748))

In [ ]: