Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

```
In [1]:
```from __future__ import print_function, division
%matplotlib inline
import numpy as np
import random
import thinkstats2
import thinkplot

```
In [2]:
```import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb

The following function computes the intercept and slope of the least squares fit.

```
In [3]:
```from thinkstats2 import Mean, MeanVar, Var, Std, Cov
def LeastSquares(xs, ys):
meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope

Here's the least squares fit to birth weight as a function of mother's age.

```
In [4]:
```inter, slope = LeastSquares(ages, weights)
inter, slope

```
Out[4]:
```

```
In [5]:
```inter + slope * 25

```
Out[5]:
```

And the slope is easier to interpret if we express it in pounds per decade (or ounces per year).

```
In [6]:
```slope * 10

```
Out[6]:
```

The following function evaluates the fitted line at the given `xs`

.

```
In [7]:
```def FitLine(xs, inter, slope):
fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys

And here's an example.

```
In [8]:
```fit_xs, fit_ys = FitLine(ages, inter, slope)

Here's a scatterplot of the data with the fitted line.

```
In [9]:
```thinkplot.Scatter(ages, weights, color='blue', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='white', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='red', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
axis=[10, 45, 0, 15],
legend=False)

```
```

```
In [10]:
```def Residuals(xs, ys, inter, slope):
xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res

Now we can add the residuals as a column in the DataFrame.

```
In [11]:
```live['residual'] = Residuals(ages, weights, inter, slope)

To visualize the residuals, I'll split the respondents into groups by age, then plot the percentiles of the residuals versus the average age in each group.

First I'll make the groups and compute the average age in each group.

```
In [12]:
```bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)
age_means = [group.agepreg.mean() for _, group in groups][1:-1]
age_means

```
Out[12]:
```

Next I'll compute the CDF of the residuals in each group.

```
In [13]:
```cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]

The following function plots percentiles of the residuals against the average age in each group.

```
In [14]:
```def PlotPercentiles(age_means, cdfs):
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(age_means, weight_percentiles, label=label)

The following figure shows the 25th, 50th, and 75th percentiles.

Curvature in the residuals suggests a non-linear relationship.

```
In [15]:
```PlotPercentiles(age_means, cdfs)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])

```
```

```
In [16]:
```def SampleRows(df, nrows, replace=False):
"""Choose a sample of rows from a DataFrame.
df: DataFrame
nrows: number of rows
replace: whether to sample with replacement
returns: DataDf
"""
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample
def ResampleRows(df):
"""Resamples rows from a DataFrame.
df: DataFrame
returns: DataFrame
"""
return SampleRows(df, len(df), replace=True)

`inter`

and `slope`

.

```
In [17]:
```def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = LeastSquares(ages, weights)
t.append(estimates)
inters, slopes = zip(*t)
return inters, slopes

Here's an example.

```
In [18]:
```inters, slopes = SamplingDistributions(live, iters=1001)

```
In [19]:
```def Summarize(estimates, actual=None):
mean = Mean(estimates)
stderr = Std(estimates, mu=actual)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.ConfidenceInterval(90)
print('mean, SE, CI', mean, stderr, ci)

Here's the summary for `inter`

.

```
In [20]:
```Summarize(inters)

```
```

And for `slope`

.

```
In [21]:
```Summarize(slopes)

```
```

**Exercise:** Use `ResampleRows`

and generate a list of estimates for the mean birth weight. Use `Summarize`

to compute the SE and CI for these estimates.

```
In [22]:
```# Solution
iters = 1000
estimates = [ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

```
```

```
In [23]:
```for slope, inter in zip(slopes, inters):
fxs, fys = FitLine(age_means, inter, slope)
thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])

```
```

```
In [24]:
```def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = FitLine(xs, inter, slope)
fys_seq.append(fys)
p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)

This example shows the confidence interval for the fitted values at each mother's age.

```
In [25]:
```PlotConfidenceIntervals(age_means, inters, slopes, percent=90,
color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
color='gray', alpha=0.5, label='50% CI')
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Residual (lbs)',
xlim=[10, 45])

```
```

The coefficient compares the variance of the residuals to the variance of the dependent variable.

```
In [26]:
```def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)

```
In [27]:
```inter, slope = LeastSquares(ages, weights)
res = Residuals(ages, weights, inter, slope)
r2 = CoefDetermination(weights, res)
r2

```
Out[27]:
```

We can confirm that $R^2 = \rho^2$:

```
In [28]:
```print('rho', thinkstats2.Corr(ages, weights))
print('R', np.sqrt(r2))

```
```

```
In [29]:
```print('Std(ys)', Std(weights))
print('Std(res)', Std(res))

```
```

```
In [30]:
```var_ys = 15**2
rho = 0.72
r2 = rho**2
var_res = (1 - r2) * var_ys
std_res = np.sqrt(var_res)
std_res

```
Out[30]:
```

```
In [31]:
```class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights

And it is.

```
In [32]:
```ht = SlopeTest((ages, weights))
pvalue = ht.PValue()
pvalue

```
Out[32]:
```

```
In [33]:
```ht.actual, ht.MaxTestStat()

```
Out[33]:
```

We can also use resampling to estimate the sampling distribution of the slope.

```
In [34]:
```sampling_cdf = thinkstats2.Cdf(slopes)

The distribution of slopes under the null hypothesis, and the sampling distribution of the slope under resampling, have the same shape, but one has mean at 0 and the other has mean at the observed slope.

To compute a p-value, we can count how often the estimated slope under the null hypothesis exceeds the observed slope, or how often the estimated slope under resampling falls below 0.

```
In [35]:
```thinkplot.PrePlot(2)
thinkplot.Plot([0, 0], [0, 1], color='0.8')
ht.PlotCdf(label='null hypothesis')
thinkplot.Cdf(sampling_cdf, label='sampling distribution')
thinkplot.Config(xlabel='slope (lbs / year)',
ylabel='CDF',
xlim=[-0.03, 0.03],
legend=True, loc='upper left')

```
```

Here's how to get a p-value from the sampling distribution.

```
In [36]:
```pvalue = sampling_cdf[0]
pvalue

```
Out[36]:
```

```
In [37]:
```def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample

We can use it to estimate the mean birthweight and compute SE and CI.

```
In [38]:
```iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

```
```

And here's what the same calculation looks like if we ignore the weights.

```
In [39]:
```estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

```
```

**Exercise:** Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?

Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

Read the BRFSS data and extract heights and log weights.

```
In [40]:
```import brfss
df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
log_weights = np.log10(weights)

Estimate intercept and slope.

```
In [41]:
```# Solution
inter, slope = thinkstats2.LeastSquares(heights, log_weights)
inter, slope

```
Out[41]:
```

Make a scatter plot of the data and show the fitted line.

```
In [42]:
```# Solution
thinkplot.Scatter(heights, log_weights, alpha=0.01, s=5)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, fys, color='red')
thinkplot.Config(xlabel='Height (cm)', ylabel='log10 weight (kg)', legend=False)

```
```

Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.

```
In [43]:
```# Solution
thinkplot.Scatter(heights, weights, alpha=0.01, s=5)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys, color='red')
thinkplot.Config(xlabel='Height (cm)', ylabel='Weight (kg)', legend=False)

```
```

Plot percentiles of the residuals.

```
In [44]:
```# Solution
# The lines are flat over most of the range,
# indicating that the relationship is linear.
# The lines are mostly parallel, indicating
# that the variance of the residuals is the
# same over the range.
res = thinkstats2.Residuals(heights, log_weights, inter, slope)
df['residual'] = res
bins = np.arange(130, 210, 5)
indices = np.digitize(df.htm3, bins)
groups = df.groupby(indices)
means = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for i, group in groups][1:-1]
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
ys = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(means, ys, label=label)
thinkplot.Config(xlabel='height (cm)', ylabel='residual weight (kg)', legend=False)

```
```

Compute correlation.

```
In [45]:
```# Solution
rho = thinkstats2.Corr(heights, log_weights)
rho

```
Out[45]:
```

Compute coefficient of determination.

```
In [46]:
```# Solution
r2 = thinkstats2.CoefDetermination(log_weights, res)
r2

```
Out[46]:
```

Confirm that $R^2 = \rho^2$.

```
In [47]:
```# Solution
rho**2 - r2

```
Out[47]:
```

Compute Std(ys), which is the RMSE of predictions that don't use height.

```
In [48]:
```# Solution
std_ys = thinkstats2.Std(log_weights)
std_ys

```
Out[48]:
```

Compute Std(res), the RMSE of predictions that do use height.

```
In [49]:
```# Solution
std_res = thinkstats2.Std(res)
std_res

```
Out[49]:
```

How much does height information reduce RMSE?

```
In [50]:
```# Solution
1 - std_res / std_ys

```
Out[50]:
```

Use resampling to compute sampling distributions for inter and slope.

```
In [51]:
```# Solution
t = []
for _ in range(100):
sample = thinkstats2.ResampleRows(df)
estimates = thinkstats2.LeastSquares(sample.htm3, np.log10(sample.wtkg2))
t.append(estimates)
inters, slopes = zip(*t)

Plot the sampling distribution of slope.

```
In [52]:
```# Solution
cdf = thinkstats2.Cdf(slopes)
thinkplot.Cdf(cdf)

```
Out[52]:
```

Compute the p-value of the slope.

```
In [53]:
```# Solution
pvalue = cdf[0]
pvalue

```
Out[53]:
```

Compute the 90% confidence interval of slope.

```
In [54]:
```# Solution
ci = cdf.Percentile(5), cdf.Percentile(95)
ci

```
Out[54]:
```

Compute the mean of the sampling distribution.

```
In [55]:
```# Solution
mean = thinkstats2.Mean(slopes)
mean

```
Out[55]:
```

Compute the standard deviation of the sampling distribution, which is the standard error.

```
In [56]:
```# Solution
stderr = thinkstats2.Std(slopes)
stderr

```
Out[56]:
```

Resample rows without weights, compute mean height, and summarize results.

```
In [57]:
```# Solution
estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]
Summarize(estimates_unweighted)

```
```

Resample rows with weights. Note that the weight column in this dataset is called `finalwt`

.

```
In [58]:
```# Solution
# The estimated mean height is almost 2 cm taller
# if we take into account the sampling weights,
# and this difference is much bigger than the sampling error.
estimates_weighted = [ResampleRowsWeighted(df, 'finalwt').htm3.mean() for _ in range(100)]
Summarize(estimates_weighted)

```
```

```
In [ ]:
```