``````

In [110]:

%matplotlib inline
import math
import numpy as np
import random
import scipy.stats

import first
import hypothesis
import thinkstats2
import thinkplot

import normal

``````

Central Limit Theorem if we add up n values for almost any distribution the distribution of the sum converges to normal as n increases.

• values have to be drawn independently
• values have to come from same distribution (relaxed)
• distribution has to have finite mean and variance
• rate of convergence depends on skewness of the distribution

t distribution the sampling distribution of correlation under the null hypothesis

Exercise 14.1

• choose distribution for growth factor by year
• generate a sample of adult weights choosing from the distribution of birthweights, choosing a sequence from f and computing the product
• what value of n is needed to converge to lognormal?
``````

In [2]:

live, firsts, others = first.MakeFrames()

``````
``````

nsfg.py:42: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan

``````

average male weight:

``````

In [124]:

np.random.seed(17)
def makefDist(n=18, mu=0.17, sigma=0.1, iters=1000):
samples = np.random.lognormal(mu, sigma, n)
return samples

def genWeight(birthweights, age):
"""
birthweights: list of birthweights
age: int indicted age
"""
for bw in birthweights:
fDist = makefDist(age)
fDist = np.append(fDist, bw)
fDist = np.log(fDist)
fDist = np.sum(fDist)
result = np.exp(fDist)

boyWeights = live[live.babysex==1].totalwgt_lb
boyWeights = boyWeights.dropna()

aWeights = genWeight(boyWeights, 19)
thinkplot.PrePlot(2, rows=2)
thinkplot.SubPlot(1)
cdf = thinkstats2.Cdf(aWeights)
thinkplot.Cdf(cdf)
thinkplot.SubPlot(2)
pdf = thinkstats2.EstimatedPdf(aWeights)
thinkplot.Pdf(pdf)
thinkplot.Show()

print "median",cdf.Percentile(50)
print "mean", cdf.Mean()
print "iqr", cdf.Percentile(25), cdf.Percentile(75)
print "ci", cdf.Percentile(5), cdf.Percentile(95)

``````
``````

median 184.872572709
mean 204.748234993
iqr 134.631023354 254.752246774
ci 82.7675631776 396.174628719

<matplotlib.figure.Figure at 0x10bfd9ed0>

``````
``````

In [126]:

def makeAWsamples(nVals, aWeights, iters=1000):
samples = []
for n in nVals:
sample = [np.sum(np.random.choice(aWeights, n))
for _ in range(iters)]
samples.append((n, sample))
return samples

nVals = [1, 10, 100, 1000]
samples = makeAWsamples(nVals, aWeights)

thinkplot.PrePlot(len(nVals), rows=len(nVals)//2, cols=len(nVals)//2)
normal.NormalPlotSamples(samples)

``````
``````

``````

The correct answer here was to sample the "factors" from a normal distribution

### Exercise 14.2

``````

In [132]:

dist1 = normal.SamplingDistMean(firsts.prglngth, len(firsts))
dist2 = normal.SamplingDistMean(others.prglngth, len(others))

dist = dist1 - dist2
print 'Standard Error:', dist.sigma
print '90% CI', dist.Percentile(5), dist.Percentile(95)

``````
``````

Standard Error: 0.0566695304191
90% CI -0.0151758158699 0.171250349425

``````

standard error is equal to the the square root of the variance in the distribution of differences

### Exercise 14.3

``````

In [152]:

##null hypothesis is that boys and girls scores come
##from the same population
male_before = normal.Normal(3.57, 0.28**2)
male_after = normal.Normal(3.44, 0.16**2)

fem_before = normal.Normal(1.91, 0.32**2)
fem_after = normal.Normal(3.18, 0.16**2)

diff_before = fem_before - male_before
#thinkplot.Cdf(diff_before)
diff_after = fem_after - male_after
# thinkplot.Cdf(diff_after)

print "before: mean, p-value", diff_before.mu, 1-diff_before.Prob(0)
print "after: mean, p-value", diff_after.mu, 1-diff_after.Prob(0)

diff =  diff_after - diff_before
# thinkplot.Cdf(diff)
print "diff in gender gap: mean, p-value", diff.mu, diff.Prob(0)

``````
``````

before: mean, p-value -1.66 4.73095323208e-05
after: mean, p-value -0.26 0.125267987207
diff in gender gap: mean, p-value 1.4 0.00182694836898

``````

The CDF is key to understanding this problem.

The null hypothesis for the first part is that there is no difference in these distributions, so the p-test is 1-diff.Prob(0). Probability that given these dists, the diff is 0

For the second part, the null hypothesis is that there is a difference in probabilities, so p-test is diff.Prob(0)