In [110]:
%matplotlib inline
import math
import numpy as np
import random
import scipy.stats

import first
import hypothesis
import thinkstats2
import thinkplot

import normal

Central Limit Theorem if we add up n values for almost any distribution the distribution of the sum converges to normal as n increases.

  • values have to be drawn independently
  • values have to come from same distribution (relaxed)
  • distribution has to have finite mean and variance
  • rate of convergence depends on skewness of the distribution

t distribution the sampling distribution of correlation under the null hypothesis

Exercise 14.1

  • choose distribution for growth factor by year
  • generate a sample of adult weights choosing from the distribution of birthweights, choosing a sequence from f and computing the product
  • what value of n is needed to converge to lognormal?

In [2]:
live, firsts, others = first.MakeFrames()


nsfg.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan

average male weight:


In [124]:
np.random.seed(17)
def makefDist(n=18, mu=0.17, sigma=0.1, iters=1000):
    samples = np.random.lognormal(mu, sigma, n)
    return samples

def genWeight(birthweights, age):
    """
    birthweights: list of birthweights
    age: int indicted age
    """
    adultweights = []
    for bw in birthweights:
        fDist = makefDist(age)
        fDist = np.append(fDist, bw)
        fDist = np.log(fDist)
        fDist = np.sum(fDist)
        result = np.exp(fDist)
        adultweights.append(result)
    return adultweights

boyWeights = live[live.babysex==1].totalwgt_lb
boyWeights = boyWeights.dropna()

aWeights = genWeight(boyWeights, 19)
thinkplot.PrePlot(2, rows=2)
thinkplot.SubPlot(1)
cdf = thinkstats2.Cdf(aWeights)
thinkplot.Cdf(cdf)
thinkplot.SubPlot(2)
pdf = thinkstats2.EstimatedPdf(aWeights)
thinkplot.Pdf(pdf)
thinkplot.Show()

print "median",cdf.Percentile(50)
print "mean", cdf.Mean()
print "iqr", cdf.Percentile(25), cdf.Percentile(75)
print "ci", cdf.Percentile(5), cdf.Percentile(95)


median 184.872572709
mean 204.748234993
iqr 134.631023354 254.752246774
ci 82.7675631776 396.174628719
<matplotlib.figure.Figure at 0x10bfd9ed0>

In [126]:
def makeAWsamples(nVals, aWeights, iters=1000):
    samples = []
    for n in nVals:
        sample = [np.sum(np.random.choice(aWeights, n)) 
                  for _ in range(iters)]
        samples.append((n, sample))
    return samples
    
nVals = [1, 10, 100, 1000]
samples = makeAWsamples(nVals, aWeights)

thinkplot.PrePlot(len(nVals), rows=len(nVals)//2, cols=len(nVals)//2)
normal.NormalPlotSamples(samples)


The correct answer here was to sample the "factors" from a normal distribution

Exercise 14.2


In [132]:
dist1 = normal.SamplingDistMean(firsts.prglngth, len(firsts))
dist2 = normal.SamplingDistMean(others.prglngth, len(others))

dist = dist1 - dist2
print 'Standard Error:', dist.sigma
print '90% CI', dist.Percentile(5), dist.Percentile(95)


Standard Error: 0.0566695304191
90% CI -0.0151758158699 0.171250349425

standard error is equal to the the square root of the variance in the distribution of differences

Exercise 14.3


In [152]:
##null hypothesis is that boys and girls scores come 
##from the same population
male_before = normal.Normal(3.57, 0.28**2)
male_after = normal.Normal(3.44, 0.16**2)

fem_before = normal.Normal(1.91, 0.32**2)
fem_after = normal.Normal(3.18, 0.16**2)

diff_before = fem_before - male_before
#thinkplot.Cdf(diff_before)
diff_after = fem_after - male_after
# thinkplot.Cdf(diff_after)

print "before: mean, p-value", diff_before.mu, 1-diff_before.Prob(0)
print "after: mean, p-value", diff_after.mu, 1-diff_after.Prob(0)

diff =  diff_after - diff_before
# thinkplot.Cdf(diff)
print "diff in gender gap: mean, p-value", diff.mu, diff.Prob(0)


before: mean, p-value -1.66 4.73095323208e-05
after: mean, p-value -0.26 0.125267987207
diff in gender gap: mean, p-value 1.4 0.00182694836898

The CDF is key to understanding this problem.

The null hypothesis for the first part is that there is no difference in these distributions, so the p-test is 1-diff.Prob(0). Probability that given these dists, the diff is 0

For the second part, the null hypothesis is that there is a difference in probabilities, so p-test is diff.Prob(0)