Examples and Exercises from Think Stats, 2nd Edition

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

%matplotlib inline

import numpy as np

import random

import thinkstats2
import thinkplot

Least squares

One more time, let's load up the NSFG data.



In [2]:

    
import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb

The following function computes the intercept and slope of the least squares fit.



In [3]:

    
from thinkstats2 import Mean, MeanVar, Var, Std, Cov

def LeastSquares(xs, ys):
    meanx, varx = MeanVar(xs)
    meany = Mean(ys)

    slope = Cov(xs, ys, meanx, meany) / varx
    inter = meany - slope * meanx

    return inter, slope

Here's the least squares fit to birth weight as a function of mother's age.



In [4]:

    
inter, slope = LeastSquares(ages, weights)
inter, slope









    Out[4]:





(6.8303969733110526, 0.017453851471802753)

The intercept is often easier to interpret if we evaluate it at the mean of the independent variable.



In [5]:

    
inter + slope * 25









    Out[5]:





7.2667432601061215

And the slope is easier to interpret if we express it in pounds per decade (or ounces per year).



In [6]:

    
slope * 10









    Out[6]:





0.17453851471802753

The following function evaluates the fitted line at the given xs.



In [7]:

    
def FitLine(xs, inter, slope):
    fit_xs = np.sort(xs)
    fit_ys = inter + slope * fit_xs
    return fit_xs, fit_ys

And here's an example.



In [8]:

    
fit_xs, fit_ys = FitLine(ages, inter, slope)

Here's a scatterplot of the data with the fitted line.



In [9]:

    
thinkplot.Scatter(ages, weights, color='blue', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='white', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='red', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
                 ylabel='Birth weight (lbs)',
                 axis=[10, 45, 0, 15],
                 legend=False)

Residuals

The following functon computes the residuals.



In [10]:

    
def Residuals(xs, ys, inter, slope):
    xs = np.asarray(xs)
    ys = np.asarray(ys)
    res = ys - (inter + slope * xs)
    return res

Now we can add the residuals as a column in the DataFrame.



In [11]:

    
live['residual'] = Residuals(ages, weights, inter, slope)

To visualize the residuals, I'll split the respondents into groups by age, then plot the percentiles of the residuals versus the average age in each group.

First I'll make the groups and compute the average age in each group.



In [12]:

    
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)

age_means = [group.agepreg.mean() for _, group in groups][1:-1]
age_means









    Out[12]:





[15.212333333333312,
 17.74035928143719,
 20.506304824561838,
 23.455752212389893,
 26.435156146179903,
 29.411177432543294,
 32.30232530120497,
 35.240273631840736,
 38.10876470588231,
 40.91205882352941]

Next I'll compute the CDF of the residuals in each group.



In [13]:

    
cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]

The following function plots percentiles of the residuals against the average age in each group.



In [14]:

    
def PlotPercentiles(age_means, cdfs):
    thinkplot.PrePlot(3)
    for percent in [75, 50, 25]:
        weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
        label = '%dth' % percent
        thinkplot.Plot(age_means, weight_percentiles, label=label)

The following figure shows the 25th, 50th, and 75th percentiles.

Curvature in the residuals suggests a non-linear relationship.



In [15]:

    
PlotPercentiles(age_means, cdfs)

thinkplot.Config(xlabel="Mother's age (years)",
                 ylabel='Residual (lbs)',
                 xlim=[10, 45])

Sampling distribution

To estimate the sampling distribution of inter and slope, I'll use resampling.



In [16]:

    
def SampleRows(df, nrows, replace=False):
    """Choose a sample of rows from a DataFrame.

    df: DataFrame
    nrows: number of rows
    replace: whether to sample with replacement

    returns: DataDf
    """
    indices = np.random.choice(df.index, nrows, replace=replace)
    sample = df.loc[indices]
    return sample

def ResampleRows(df):
    """Resamples rows from a DataFrame.

    df: DataFrame

    returns: DataFrame
    """
    return SampleRows(df, len(df), replace=True)

The following function resamples the given dataframe and returns lists of estimates for inter and slope.



In [17]:

    
def SamplingDistributions(live, iters=101):
    t = []
    for _ in range(iters):
        sample = ResampleRows(live)
        ages = sample.agepreg
        weights = sample.totalwgt_lb
        estimates = LeastSquares(ages, weights)
        t.append(estimates)

    inters, slopes = zip(*t)
    return inters, slopes

Here's an example.



In [18]:

    
inters, slopes = SamplingDistributions(live, iters=1001)

The following function takes a list of estimates and prints the mean, standard error, and 90% confidence interval.



In [19]:

    
def Summarize(estimates, actual=None):
    mean = Mean(estimates)
    stderr = Std(estimates, mu=actual)
    cdf = thinkstats2.Cdf(estimates)
    ci = cdf.ConfidenceInterval(90)
    print('mean, SE, CI', mean, stderr, ci)

Here's the summary for inter.



In [20]:

    
Summarize(inters)









    



mean, SE, CI 6.829185459088737 0.06767011520211424 (6.716748354462339, 6.941896381986979)

And for slope.



In [21]:

    
Summarize(slopes)









    



mean, SE, CI 0.017480239500445995 0.002670358753454292 (0.013028573267487135, 0.022022292160010386)

Exercise: Use ResampleRows and generate a list of estimates for the mean birth weight. Use Summarize to compute the SE and CI for these estimates.



In [22]:

    
# Solution

iters = 1000
estimates = [ResampleRows(live).totalwgt_lb.mean()
             for _ in range(iters)]
Summarize(estimates)









    



mean, SE, CI 7.265321586634211 0.014517304608097607 (7.240851128568267, 7.2885870767869)

Visualizing uncertainty

To show the uncertainty of the estimated slope and intercept, we can generate a fitted line for each resampled estimate and plot them on top of each other.



In [23]:

    
for slope, inter in zip(slopes, inters):
    fxs, fys = FitLine(age_means, inter, slope)
    thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)
    
thinkplot.Config(xlabel="Mother's age (years)",
                 ylabel='Residual (lbs)',
                 xlim=[10, 45])

Or we can make a neater (and more efficient plot) by computing fitted lines and finding percentiles of the fits for each value of the dependent variable.



In [24]:

    
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
    fys_seq = []
    for inter, slope in zip(inters, slopes):
        fxs, fys = FitLine(xs, inter, slope)
        fys_seq.append(fys)

    p = (100 - percent) / 2
    percents = p, 100 - p
    low, high = thinkstats2.PercentileRows(fys_seq, percents)
    thinkplot.FillBetween(fxs, low, high, **options)

This example shows the confidence interval for the fitted values at each mother's age.



In [25]:

    
PlotConfidenceIntervals(age_means, inters, slopes, percent=90, 
                        color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
                        color='gray', alpha=0.5, label='50% CI')

thinkplot.Config(xlabel="Mother's age (years)",
                 ylabel='Residual (lbs)',
                 xlim=[10, 45])

Coefficient of determination

The coefficient compares the variance of the residuals to the variance of the dependent variable.



In [26]:

    
def CoefDetermination(ys, res):
    return 1 - Var(res) / Var(ys)

For birth weight and mother's age $R^2$ is very small, indicating that the mother's age predicts a small part of the variance in birth weight.



In [27]:

    
inter, slope = LeastSquares(ages, weights)
res = Residuals(ages, weights, inter, slope)
r2 = CoefDetermination(weights, res)
r2









    Out[27]:





0.004738115474710258

We can confirm that $R^2 = \rho^2$:



In [28]:

    
print('rho', thinkstats2.Corr(ages, weights))
print('R', np.sqrt(r2))









    



rho 0.06883397035410908
R 0.06883397035410828

To express predictive power, I think it's useful to compare the standard deviation of the residuals to the standard deviation of the dependent variable, as a measure RMSE if you try to guess birth weight with and without taking into account mother's age.



In [29]:

    
print('Std(ys)', Std(weights))
print('Std(res)', Std(res))









    



Std(ys) 1.4082155338406197
Std(res) 1.4048754287857832

As another example of the same idea, here's how much we can improve guesses about IQ if we know someone's SAT scores.



In [30]:

    
var_ys = 15**2
rho = 0.72
r2 = rho**2
var_res = (1 - r2) * var_ys
std_res = np.sqrt(var_res)
std_res









    Out[30]:





10.409610943738484

Hypothesis testing with slopes

Here's a HypothesisTest that uses permutation to test whether the observed slope is statistically significant.



In [31]:

    
class SlopeTest(thinkstats2.HypothesisTest):

    def TestStatistic(self, data):
        ages, weights = data
        _, slope = thinkstats2.LeastSquares(ages, weights)
        return slope

    def MakeModel(self):
        _, weights = self.data
        self.ybar = weights.mean()
        self.res = weights - self.ybar

    def RunModel(self):
        ages, _ = self.data
        weights = self.ybar + np.random.permutation(self.res)
        return ages, weights

And it is.



In [32]:

    
ht = SlopeTest((ages, weights))
pvalue = ht.PValue()
pvalue









    Out[32]:





0.0

Under the null hypothesis, the largest slope we observe after 1000 tries is substantially less than the observed value.



In [33]:

    
ht.actual, ht.MaxTestStat()









    Out[33]:





(0.017453851471802753, 0.009339357664930548)

We can also use resampling to estimate the sampling distribution of the slope.



In [34]:

    
sampling_cdf = thinkstats2.Cdf(slopes)

The distribution of slopes under the null hypothesis, and the sampling distribution of the slope under resampling, have the same shape, but one has mean at 0 and the other has mean at the observed slope.

To compute a p-value, we can count how often the estimated slope under the null hypothesis exceeds the observed slope, or how often the estimated slope under resampling falls below 0.



In [35]:

    
thinkplot.PrePlot(2)
thinkplot.Plot([0, 0], [0, 1], color='0.8')
ht.PlotCdf(label='null hypothesis')

thinkplot.Cdf(sampling_cdf, label='sampling distribution')

thinkplot.Config(xlabel='slope (lbs / year)',
                   ylabel='CDF',
                   xlim=[-0.03, 0.03],
                   legend=True, loc='upper left')

Here's how to get a p-value from the sampling distribution.



In [36]:

    
pvalue = sampling_cdf[0]
pvalue









    Out[36]:





0

Resampling with weights

Resampling provides a convenient way to take into account the sampling weights associated with respondents in a stratified survey design.

The following function resamples rows with probabilities proportional to weights.



In [37]:

    
def ResampleRowsWeighted(df, column='finalwgt'):
    weights = df[column]
    cdf = thinkstats2.Cdf(dict(weights))
    indices = cdf.Sample(len(weights))
    sample = df.loc[indices]
    return sample

We can use it to estimate the mean birthweight and compute SE and CI.



In [38]:

    
iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
             for _ in range(iters)]
Summarize(estimates)









    



mean, SE, CI 7.347540385040938 0.013571205961903726 (7.321607379951317, 7.368651803496348)

And here's what the same calculation looks like if we ignore the weights.



In [39]:

    
estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
             for _ in range(iters)]
Summarize(estimates)









    



mean, SE, CI 7.268162342332374 0.014782284407590501 (7.2425038725381725, 7.29244578446559)

The difference is non-negligible, which suggests that there are differences in birth weight between the strata in the survey.

Exercises

Exercise: Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?

Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

Read the BRFSS data and extract heights and log weights.



In [40]:

    
import brfss

df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
log_weights = np.log10(weights)

Estimate intercept and slope.



In [41]:

    
# Solution

inter, slope = thinkstats2.LeastSquares(heights, log_weights)
inter, slope









    Out[41]:





(0.9930804163917621, 0.005281454169418104)

Make a scatter plot of the data and show the fitted line.



In [42]:

    
# Solution

thinkplot.Scatter(heights, log_weights, alpha=0.01, s=5)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, fys, color='red')
thinkplot.Config(xlabel='Height (cm)', ylabel='log10 weight (kg)', legend=False)

Make the same plot but apply the inverse transform to show weights on a linear (not log) scale.



In [43]:

    
# Solution

thinkplot.Scatter(heights, weights, alpha=0.01, s=5)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys, color='red')
thinkplot.Config(xlabel='Height (cm)', ylabel='Weight (kg)', legend=False)

Plot percentiles of the residuals.



In [44]:

    
# Solution

# The lines are flat over most of the range, 
# indicating that the relationship is linear.

# The lines are mostly parallel, indicating 
# that the variance of the residuals is the 
# same over the range.

res = thinkstats2.Residuals(heights, log_weights, inter, slope)
df['residual'] = res

bins = np.arange(130, 210, 5)
indices = np.digitize(df.htm3, bins)
groups = df.groupby(indices)

means = [group.htm3.mean() for i, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for i, group in groups][1:-1]

thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
    ys = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(means, ys, label=label)
    
thinkplot.Config(xlabel='height (cm)', ylabel='residual weight (kg)', legend=False)

Compute correlation.



In [45]:

    
# Solution

rho = thinkstats2.Corr(heights, log_weights)
rho









    Out[45]:





0.5317282605983589

Compute coefficient of determination.



In [46]:

    
# Solution

r2 = thinkstats2.CoefDetermination(log_weights, res)
r2









    Out[46]:





0.2827349431189434

Confirm that $R^2 = \rho^2$.



In [47]:

    
# Solution

rho**2 - r2









    Out[47]:





1.2823075934420558e-14

Compute Std(ys), which is the RMSE of predictions that don't use height.



In [48]:

    
# Solution

std_ys = thinkstats2.Std(log_weights)
std_ys









    Out[48]:





0.10320725030004975

Compute Std(res), the RMSE of predictions that do use height.



In [49]:

    
# Solution

std_res = thinkstats2.Std(res)
std_res









    Out[49]:





0.08740777080416137

How much does height information reduce RMSE?



In [50]:

    
# Solution

1 - std_res / std_ys









    Out[50]:





0.15308497658793607

Use resampling to compute sampling distributions for inter and slope.



In [51]:

    
# Solution

t = []
for _ in range(100):
    sample = thinkstats2.ResampleRows(df)
    estimates = thinkstats2.LeastSquares(sample.htm3, np.log10(sample.wtkg2))
    t.append(estimates)

inters, slopes = zip(*t)

Plot the sampling distribution of slope.



In [52]:

    
# Solution

cdf = thinkstats2.Cdf(slopes)
thinkplot.Cdf(cdf)









    Out[52]:





{'xscale': 'linear', 'yscale': 'linear'}

Compute the p-value of the slope.



In [53]:

    
# Solution

pvalue = cdf[0]
pvalue









    Out[53]:





0

Compute the 90% confidence interval of slope.



In [54]:

    
# Solution

ci = cdf.Percentile(5), cdf.Percentile(95)
ci









    Out[54]:





(0.005256179585510808, 0.005305925811393488)

Compute the mean of the sampling distribution.



In [55]:

    
# Solution

mean = thinkstats2.Mean(slopes)
mean









    Out[55]:





0.005280974367722204

Compute the standard deviation of the sampling distribution, which is the standard error.



In [56]:

    
# Solution

stderr = thinkstats2.Std(slopes)
stderr









    Out[56]:





1.4328644946448544e-05

Resample rows without weights, compute mean height, and summarize results.



In [57]:

    
# Solution

estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]
Summarize(estimates_unweighted)









    



mean, SE, CI 168.9567580690798 0.01719931368075356 (168.9266633319186, 168.9880782756321)

Resample rows with weights. Note that the weight column in this dataset is called finalwt.



In [58]:

    
# Solution

# The estimated mean height is almost 2 cm taller 
# if we take into account the sampling weights,
# and this difference is much bigger than the sampling error.

estimates_weighted = [ResampleRowsWeighted(df, 'finalwt').htm3.mean() for _ in range(100)]
Summarize(estimates_weighted)









    



mean, SE, CI 170.497091619677 0.01746768128310096 (170.46624325471413, 170.52783756745285)



In [ ]: