Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey



In [2]:

    
from __future__ import division

import thinkstats2
import thinkplot
import numpy as np

%matplotlib inline

Exercise 5.1

In the BRFSS (see Section 5.4), the distribution of heights is roughly normal with parameters µ = 178 cm and σ = 7.7 cm for men, and µ = 163 cm and σ = 7.3 cm for women.

In order to join Blue Man Group, you have to be male between 5’10” and 6’1” (see http://bluemancasting.com). What percentage of the U.S. male population is in this range? Hint: use scipy.stats.norm.cdf.

scipy.stats contains objects that represent analytic distributions



In [3]:

    
import scipy.stats

For example scipy.stats.norm represents a normal distribution.



In [4]:

    
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)









    Out[4]:





scipy.stats._distn_infrastructure.rv_frozen

A "frozen random variable" can compute its mean and standard deviation.



In [5]:

    
dist.mean(), dist.std()









    Out[5]:





(178.0, 7.7000000000000002)

It can also evaluate its CDF. How many people are more than one standard deviation below the mean? About 16%



In [6]:

    
dist.cdf(mu-sigma)









    Out[6]:





0.15865525393145741

How many people are between 5'10" and 6'1"?



In [7]:

    
def heightToCentimeters(ft, inches):
    height_in = ft * 12 + inches
    return height_in * 2.54

minHeight = heightToCentimeters(5,10)
minPercentile = dist.cdf(minHeight)
print('minPercentile', minPercentile )

maxHeight = heightToCentimeters(6,1)
maxPercentile = dist.cdf(maxHeight)
print('maxPercentile', maxPercentile)

print('population percent', maxPercentile - minPercentile)
print('my Answer: %d%%' % round((maxPercentile - minPercentile) * 100, 2))









    



('minPercentile', 0.48963902786483265)
('maxPercentile', 0.83238586549630722)
('population percent', 0.34274683763147457)
my Answer: 34%

Exercise 5.2

To get a feel for the Pareto distribution, let’s see how different the world would be if the distribution of human height were Pareto. With the parameters $x_m = 1$ m and $α = 1.7$, we get a distribution with a reasonable minimum, 1 m, and median, 1.5 m.

Plot this distribution. What is the mean human height in Pareto world? What fraction of the population is shorter than the mean? If there are 7 billion people in Pareto world, how many do we expect to be taller than 1 km? How tall do we expect the tallest person to be?

scipy.stats.pareto represents a pareto distribution. In Pareto world, the distribution of human heights has parameters alpha=1.7 and xmin=1 meter. So the shortest person is 100 cm and the median is 150.



In [8]:

    
alpha = 1.7
xmin = 1
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()









    Out[8]:





1.5034066538560549



In [18]:

    
xs, ps = thinkstats2.RenderParetoCdf(xmin, alpha, 0, 10.0, n=100) 
thinkplot.Plot(xs, ps, label=r'$\alpha=%g$' % alpha)
thinkplot.Config(xlabel='height (m)', ylabel='CDF')

What is the mean height in Pareto world?



In [22]:

    
pMean = dist.mean()
pMean









    Out[22]:





2.4285714285714288

What fraction of people are shorter than the mean?



In [23]:

    
dist.cdf(pMean)









    Out[23]:





0.77873969756528805

Out of 7 billion people, how many do we expect to be taller than 1 km? You could use dist.cdf or dist.sf.



In [29]:

    
fracTall = 1 - dist.cdf(1000)
fracTall * 7e9









    Out[29]:





55602.976430479954

How tall do we expect the tallest person to be? Hint: find the height that yields about 1 person.



In [35]:

    
"""
the probability that one is that tall is 1 in 7 billion.
I need to find the probability that corresponds to 1 - that height.
"""
tallestProb = 1 - (1 /  7e9)
dist.ppf(tallestProb)









    Out[35]:





618349.61067595053



In [47]:

    
dist.sf(618349.61067595053) * 7e9









    Out[47]:





1.0000001937626735

Exercise 5.3

The Weibull distribution is a generalization of the exponential distribution that comes up in failure analysis (see http://wikipedia.org/wiki/Weibull_distribution). Its CDF is

$CDF(x) = 1 − \exp(−(x / λ)^k)$

Can you find a transformation that makes a Weibull distribution look like a straight line? What do the slope and intercept of the line indicate?

Use random.weibullvariate to generate a sample from a Weibull distribution and use it to test your transformation.



In [84]:

    
import random, math
WB_sample = [random.weibullvariate(3, 5) for i in xrange(1000)]
WB_cdf = thinkstats2.Cdf(WB_sample)
WB_sample2 = [w for w in WB_sample if -math.log(WB_cdf.Prob(w)) > 0]
WB_sample2.sort()

t_sample = [math.log(x) for x in WB_sample2]    
t_cdf = [math.log(-math.log(1 - WB_cdf.Prob(y))) for y in WB_sample2]

thinkplot.plot(t_sample, t_cdf)

Exercise 5.4

For small values of n, we don’t expect an empirical distribution to fit an analytic distribution exactly. One way to evaluate the quality of fit is to generate a sample from an analytic distribution and see how well it matches the data.

For example, in Section 5.1 we plotted the distribution of time between births and saw that it is approximately exponential. But the distribution is based on only 44 data points. To see whether the data might have come from an exponential distribution, generate 44 values from an exponential distribution with the same mean as the data, about 33 minutes between births.

Plot the distribution of the random values and compare it to the actual distribution. You can use random.expovariate to generate the values.



In [99]:

    
import analytic

df = analytic.ReadBabyBoom()
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

sampMean = 33
lam = 44.0 / 24 / 60
randSamp = [random.expovariate(lam) for i in range(44)]

sampDiffs = np.diff(randSamp)
cdfSamp = thinkstats2.Cdf(sampDiffs, label='sample')

thinkplot.Cdfs([cdf, cdfSamp], complement=True)
thinkplot.Config(yscale='log')

Exercise 5.5

mystery0 --> linear and weibull
mystery1 --> weibull and normal are both pretty good
mystery2 --> expo and weibull
mystery3 --> normal (and pareto)
mystery4 --> lognormal (b/c) looks like normal, but ?. e + n
mystery5 --> paretomystery6 --> normal and weibull
mystery7 --> lognormal and expo

Exercise 5.6



In [145]:

    
import hinc

income = hinc.ReadData()

inc_freq = dict(zip(inc.income, income.freq))
inc_hist = thinkstats2.Hist(inc_freq, label='income distribution')
inc_cdf = thinkstats2.Cdf(inc_freq, label='income distribution cdf')
print 'done'









    



done



In [146]:

    
#exponential:
print('starting...')
thinkplot.Cdf(inc_cdf, 
              complement=True, 
              yscale='log',
              label="Exponential")
thinkplot.Show()









    



starting...






    












    





<matplotlib.figure.Figure at 0x108119850>



In [154]:

    
##normal
inc_list = [inc for i in range(freq) for inc, freq in inc_freq.iteritems()]
inc_list.sort()
rand_samp = np.random.normal(0,1,len(inc_list))
rand_samp.sort()
thinkplot.Plot(inc_list, rand_samp, label='Normal')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x1097e6090>



In [169]:

    
##lognormal
loginc_list = [math.log(inc) for i in range(freq) for inc, freq in inc_freq.iteritems()]
loginc_list.sort()
logrand_samp = [np.random.normal(0,1) for i in range(len(loginc_list))]
logrand_samp.sort()
thinkplot.PrePlot(2, rows=2)
thinkplot.SubPlot(1)
thinkplot.Plot(loginc_list, logrand_samp, label='lognormal')
thinkplot.Show()

log_cdf = thinkstats2.Cdf(loginc_list)
thinkplot.SubPlot(2)
thinkplot.Cdf(log_cdf, label='lognormal plotted as normal')
thinkplot.Show()









    












    












    





<matplotlib.figure.Figure at 0x109718650>



In [172]:

    
##Pareto
inc_list = [inc for i in range(freq) for inc, freq in inc_freq.iteritems()]
inc_cdf = thinkstats2.Cdf(inc_list)
thinkplot.Cdf(inc_cdf, transform="pareto", label='Pareto')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10287b3d0>



In [170]:

    
##Weibull
thinkplot.figure()
thinkplot.Cdf(inc_cdf, transform='weibull')









    Out[170]:





{'xscale': 'log', 'yscale': 'log'}



In [175]:

    
import hinc_soln
hinc_soln.main()









    



Writing hinc_linear.pdf
Writing hinc_linear.eps
Writing hinc_pareto.pdf
Writing hinc_pareto.eps
4.70950034693 0.35
Writing hinc_normal.pdf
Writing hinc_normal.eps






    





<matplotlib.figure.Figure at 0x1097f5c50>



In [ ]: