In [28]:
%matplotlib inline
import thinkstats2, thinkplot
import first
import numpy as np
import math

Linear Least Squares Fit

  • linear - intended to model the relationship between variables.
  • least squares fit is one that minimizes MSE between the line and the data.

if there is a linear relationship between xs and ys with intercept inter and slope, slope, we expect each y[i] to be:

inter + slope * x[i]

Unless the corelation is perfect this prediction is only approximate. The vertical deviation from the line is called residual

res = ys - (inter + slope * xs)

we want to find inter and slope that minimize res. Usually its actually sum(res**2) that we want to minimize.


In [3]:
##birthweight vs mother's age:
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb

inter, slope = thinkstats2.LeastSquares(ages, weights)
fit_xs, fit_ys = thinkstats2.FitLine(ages, inter, slope)


nsfg.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan

In [7]:
thinkplot.Plot(fit_xs, fit_ys)
thinkplot.HexBin(ages, weights)
thinkplot.Show()

In [10]:
##plotting residuals by percentile
ages = live.agepreg
weights = live.totalwgt_lb
inter, slope = thinkstats2.LeastSquares(ages, weights)
live['residual'] = thinkstats2.Residuals(ages, weights, inter, slope)

bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)

ages = [group.agepreg.mean() for _, group in groups][1:-1]
cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]

thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
    weights = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(ages, weights, label=label)

thinkplot.Config(root='linear2',
               xlabel='age (years)',
               ylabel='residual (lbs)',
               xlim=[10, 45])

In [11]:
thinkplot.Show()

To assess sampling error, draw samples from the observed sample with replacement and measure variability.


In [13]:
def SamplingDistributions(live, iters=101):
    t = []
    for _ in range(iters):
        sample = thinkstats2.ResampleRows(live)
        ages = sample.agepreg
        weights = sample.totalwgt_lb
        estimates = thinkstats2.LeastSquares(ages, weights)
        t.append(estimates)
    inters, slopes = zip(*t)
    return inters, slopes

def Summarize(estimates, actual=None):
    mean = thinkstats2.Mean(estimates)
    stderr = thinkstats2.Std(mu=actual)
    cdf = thinkstats2.Cdf(estimates)
    ci = cdf.ConfidenceInterval(90)
    print 'mean, SE, CI', mean, stderr, ci

In [ ]:
def PlotConfidenceIntervals(xs, inters, slopes, 
                            percent=90, **options):
    fys_seq = []
    for inter, slope in zip(inters, slopes):
        fxs, fys = thinkstats2.FitLine(xs, inter, slope)
        fys_seq.append(fys)
    p = (100 -  percent) / 2
    percents = p, 100 - p
    low, high = thinkstats2.PercentileRows(fys_seq, percents)
    thinkplot.FillBetween(fxs, low, high, **options)

So, what this code does is

  1. Resample from the observed sample
  2. Calc a least squares line for each resampling
  3. select upper and lower percentiles of y for each value of x
  4. Draw a polygon between them.

Several ways to measure the goodness of fit of a linear model.

  • Standard Deviation of the Residuals: Std(res) is RMSE for predicitons
  • Coefficient of Determination: or R-squared:

    def CoefDetermination(ys, res):
     return 1 - Var(res) / Var(ys)
    

    Var(res) is the MSE of guesses using the model. Var(ys) is the MSE without it (i.e. raw ys). So their ratio is the fraction of MSE that remains if you use the model and $R^2$ is the fraction of MSE that the model eliminates.

    Note that $R^2$ is equal to the square of Pearson's Coefficient of correlation.

Testing a linear model

How do we know that the apparent relationship isn't just due to chance?

  • Test whether the aparent reduction in MSE is due to chance. Test stat is $R^2$ and null hypothesis is that there is no relationship between variables, which we can simulate by permutation

  • Test whether slope is due to chance. Null hypothesis is that slope is actually 0. Model birthweights as random variations around their mean


In [17]:
class SlopeTest(thinkstats2.HypothesisTest):
    
    def TestStatistic(self, data):
        ages, weights = data
        _, slope = thinkstats2.LeastSquares(ages, weights)
        return slope

    def MakeModel(self):
        _, weights = self.data
        self.ybar = weights.mean()
        self.res = weights - self.ybar
    def RunModel(self):
        ages, _ = self.data
        weights = self.ybar + np.random.permutation(self.res)
        return ages, weights

In [20]:
ht = SlopeTest((live.agepreg, live.totalwgt_lb))
pvalue = ht.PValue()

In [21]:
pvalue


Out[21]:
0.0

Although the estimated slope is small, it is unlikely due to chance.

Another way to get this pValue is to calculate the p value of the slope using the sampling distributions:


In [22]:
inters, slopes = SamplingDistributions(live, iters=1001)
slope_cdf = thinkstats2.Cdf(slopes)
pvalue = slope_cdf[0]
pvalue


Out[22]:
0.0

weighted resampling sampling weight indicates the number of people in the general population that the respondent represents.

To correct for oversampling, we can use resampling that draws samples from the survey using probabilities proportional to sampling weights.


In [23]:
np.random.random(8)


Out[23]:
array([ 0.10033854,  0.61147155,  0.71183061,  0.42039043,  0.79537224,
        0.98759405,  0.67693887,  0.34326631])

Exercise 10.9

  • compute linear least squares...
  • how would you best present estimated parameters when one variable is log-transformed?
  • if you wanted to guess someone's weight, how much would it help to know their height?

  • Weighting...


In [4]:
import brfss
df = brfss.ReadBrfss()
df = df.dropna(subset=['wtkg2','htm3'])

In [46]:
log_weights = np.log10(df.wtkg2)
heights = df.htm3

In [47]:
thinkplot.Scatter(heights, log_weights, alpha=.01)
inter, slope = thinkstats2.LeastSquares(heights, log_weights)
print inter, slope
xs, ys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(xs, ys)
thinkplot.Show(xlabel="height cm",
               ylabel="log weight kg",
               title="Plotting log weight versus height")


0.993080416392 0.00528145416942
<matplotlib.figure.Figure at 0x10fdc7350>

I would calculate the weight for the mean height and report the slope as a parameter on an exponential function.


In [48]:
## THIS IS MY WORK
# sorted_heights = np.sort(heights)
# ys = 10**(0.01216 * sorted_heights) + 2.28665
# thinkplot.Scatter(heights, df.wtkg2)
# thinkplot.Plot(sorted_heights, ys)
# thinkplot.Show()

##THIS IS THE ANSWER:
thinkplot.Scatter(heights, 10**log_weights, alpha=0.01)
fxs, fys = thinkstats2.FitLine(heights, inter, slope)
thinkplot.Plot(fxs, 10**fys)
thinkplot.Show(xlabel="heights cm",
               ylabel="weights kg",
               title="heights versus weights on linear scale")


<matplotlib.figure.Figure at 0x10c39f650>

In [59]:
res = thinkstats2.Residuals(heights, log_weights, inter, slope)
df['residuals'] = res

def plotRes(heights, log_weights):
    bins = np.arange(100,226, 5)
    indices = np.digitize(heights, bins)
    groups = df.groupby(indices)
    heights = [group.htm3.mean() for _, group in groups]
    cdfs = [thinkstats2.Cdf(group.residuals) for _, group in groups]
    
    thinkplot.PrePlot(3)
    for percent in [75, 50, 25]:
        log_weights = [cdf.Percentile(percent) for cdf in cdfs]
        label = '%dth' % percent
        thinkplot.Plot(heights, log_weights, label=label)
    thinkplot.Show(title="Plotting Residual Weight Error",
                   xlabel="height cm",
                   ylabel="log10 weight kg")
plotRes(heights, log_weights)


<matplotlib.figure.Figure at 0x10fad4310>

The model works well between heights of 140 cm and 190 cm. Befor that and after that, the slopes are not flat, which indicates that the residuals may not be random and that there is nonlinearity. All weight predictions for short people tend to be too high, and all predictions for taller people tend to be low. The lines are close to parallel though, which indicates that the variance remains relatively constrant through the line.

Between the indicated range, the model allows us to use height to predict weight to within 0.1 kg


In [60]:
rho = thinkstats2.Corr(log_weights, heights)
r2 = rho**2

CoD = thinkstats2.CoefDetermination(log_weights, res)
print "CoD using fn", CoD
print "rho:", rho
print "r2 using rho", r2
print "CoD - r2 (should be 0)", CoD - r2


CoD using fn 0.282734943119
rho: 0.531728260598
r2 using rho 0.282734943119
CoD - r2 (should be 0) 7.23310300543e-14

In [27]:
sRes = thinkstats2.Std(res)
sYs = thinkstats2.Std(log_weights)
print 'std(res)', sRes
print 'std(ys)', sYs
print 'reduction in RMSE: %f%%' % ((1 - sRes/sYs) * 100)


std(res) 0.201263830065
std(ys) 0.23764347603
reduction in RMSE: 15.308498%

If you want to guess somebody's height, knowing their height will only roughly account for their height. Using this model only accounts for about 15% of the variance about the mean height.


In [80]:
iters = 10
HeightsWithW = [thinkstats2.ResampleRowsWeighted(df, column='finalwt')
               .htm3.mean() for _ in range(iters)]      
HeightsNoW = [thinkstats2.ResampleRows(df).htm3.mean()
              for _ in range(iters)]

HeightsWithW, HeightsNoW


Out[80]:
([170.53480264354576,
  170.51314699165303,
  170.52794114674913,
  170.51467794417835,
  170.46392156268317,
  170.52448513510782,
  170.51361941429698,
  170.4983932577457,
  170.48776753774328,
  170.481077830999],
 [168.95522342812103,
  168.96655146627862,
  168.9508276238404,
  168.96610935952626,
  168.93446209503023,
  168.95146425756383,
  168.97149800925646,
  168.9442894965541,
  168.96733967945997,
  168.98538016128055])

In [81]:
import estimation
print HeightsNoW
weight_mean = np.mean(HeightsWithW)
nw_mean = np.mean(HeightsNoW)

weight_stderr = estimation.RMSE(HeightsWithW, weight_mean)
nw_stderr = estimation.RMSE(HeightsNoW, nw_mean)

weight_cdf = thinkstats2.Cdf(HeightsWithW)
nw_cdf = thinkstats2.Cdf(HeightsNoW)

weight_ci = weight_cdf.Percentile(5), weight_cdf.Percentile(95)
nw_ci = nw_cdf.Percentile(5), nw_cdf.Percentile(95)

print '\t mean, stderr, 90% CI'
print 'weighted', weight_mean, weight_stderr, weight_ci
print "not wtd'd", nw_mean, nw_stderr, nw_ci


[168.95522342812103, 168.96655146627862, 168.9508276238404, 168.96610935952626, 168.93446209503023, 168.95146425756383, 168.97149800925646, 168.9442894965541, 168.96733967945997, 168.98538016128055]
	 mean, stderr, 90% CI
weighted 170.505983346 0.0214921844491 (170.46392156268317, 170.53480264354576)
not wtd'd 168.959314558 0.0140875366964 (168.93446209503023, 168.98538016128055)

In [84]:
import linear
linear.Summarize(HeightsWithW)
linear.Summarize(HeightsNoW)


mean, SE, CI 170.505983346 0.0214921844491 (170.46392156268317, 170.53480264354576)
mean, SE, CI 168.959314558 0.0140875366964 (168.93446209503023, 168.98538016128055)

In [ ]: