# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey



In [1]:

from __future__ import print_function, division

%matplotlib inline

import numpy as np

import nsfg
import first
import analytic

import thinkstats2
import thinkplot



## Exponential distribution

Here's what the exponential CDF looks like with a range of parameters.



In [2]:

thinkplot.PrePlot(3)
for lam in [2.0, 1, 0.5]:
xs, ps = thinkstats2.RenderExpoCdf(lam, 0, 3.0, 50)
label = r'$\lambda=%g$' % lam
thinkplot.Plot(xs, ps, label=label)

thinkplot.Config(title='Exponential CDF', xlabel='x', ylabel='CDF',
loc='lower right')






Here's the distribution of interarrival times from a dataset of birth times.



In [3]:

diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='Time between births (minutes)', ylabel='CDF')






Here's what the CCDF looks like on a log-y scale. A straight line is consistent with an exponential distribution.



In [4]:

thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='Time between births (minutes)',
ylabel='CCDF', yscale='log', loc='upper right')






## Normal distribution

Here's what the normal CDF looks like with a range of parameters.



In [5]:

thinkplot.PrePlot(3)

mus = [1.0, 2.0, 3.0]
sigmas = [0.5, 0.4, 0.3]
for mu, sigma in zip(mus, sigmas):
xs, ps = thinkstats2.RenderNormalCdf(mu=mu, sigma=sigma,
low=-1.0, high=4.0)
label = r'$\mu=%g$, $\sigma=%g$' % (mu, sigma)
thinkplot.Plot(xs, ps, label=label)

thinkplot.Config(title='Normal CDF', xlabel='x', ylabel='CDF',
loc='upper left')






I'll use a normal model to fit the distribution of birth weights from the NSFG.



In [6]:

weights = preg.totalwgt_lb.dropna()



Here's the observed CDF and the model. The model fits the data well except in the left tail.



In [7]:

# estimate parameters: trimming outliers yields a better fit
mu, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
print('Mean, Var', mu, var)

# plot the model
sigma = np.sqrt(var)
print('Sigma', sigma)
xs, ps = thinkstats2.RenderNormalCdf(mu, sigma, low=0, high=12.5)

thinkplot.Plot(xs, ps, label='model', color='0.6')

# plot the data
cdf = thinkstats2.Cdf(weights, label='data')

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf)
thinkplot.Config(title='Birth weights',
xlabel='Birth weight (pounds)',
ylabel='CDF')




Mean, Var 7.280883100022579 1.5452125703544872
Sigma 1.2430657948614334



A normal probability plot is a visual test for normality. The following example shows that if the data are actually from a normal distribution, the plot is approximately straight.



In [8]:

n = 1000
thinkplot.PrePlot(3)

mus = [0, 1, 5]
sigmas = [1, 1, 2]

for mu, sigma in zip(mus, sigmas):
sample = np.random.normal(mu, sigma, n)
xs, ys = thinkstats2.NormalProbability(sample)
label = '$\mu=%d$, $\sigma=%d$' % (mu, sigma)
thinkplot.Plot(xs, ys, label=label)

thinkplot.Config(title='Normal probability plot',
xlabel='standard normal sample',
ylabel='sample values')






Here's the normal probability plot for birth weights, showing that the lightest babies are lighter than we expect from the normal mode, and the heaviest babies are heavier.



In [9]:

mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
std = np.sqrt(var)

xs = [-4, 4]
fxs, fys = thinkstats2.FitLine(xs, mean, std)
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

xs, ys = thinkstats2.NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

thinkplot.Config(title='Normal probability plot',
xlabel='Standard deviations from mean',
ylabel='Birth weight (lbs)')






If we suspect that the deviation in the left tail is due to preterm babies, we can check by selecting only full term births.



In [10]:

full_term = preg[preg.prglngth >= 37]
term_weights = full_term.totalwgt_lb.dropna()



Now the deviation in the left tail is almost gone, but the heaviest babies are still heavy.



In [11]:

mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
std = np.sqrt(var)

xs = [-4, 4]
fxs, fys = thinkstats2.FitLine(xs, mean, std)
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

thinkplot.PrePlot(2)
xs, ys = thinkstats2.NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

xs, ys = thinkstats2.NormalProbability(term_weights)
thinkplot.Plot(xs, ys, label='full term')
thinkplot.Config(title='Normal probability plot',
xlabel='Standard deviations from mean',
ylabel='Birth weight (lbs)')






## Lognormal model

As an example of a lognormal disrtribution, we'll look at adult weights from the BRFSS.



In [14]:

import brfss
weights = df.wtkg2.dropna()



The following function estimates the parameters of a normal distribution and plots the data and a normal model.



In [15]:

def MakeNormalModel(weights):
"""Plots a CDF with a Normal model.

weights: sequence
"""
cdf = thinkstats2.Cdf(weights, label='weights')

mean, var = thinkstats2.TrimmedMeanVar(weights)
std = np.sqrt(var)
print('n, mean, std', len(weights), mean, std)

xmin = mean - 4 * std
xmax = mean + 4 * std

xs, ps = thinkstats2.RenderNormalCdf(mean, std, xmin, xmax)
thinkplot.Plot(xs, ps, label='model', linewidth=4, color='0.8')
thinkplot.Cdf(cdf)



Here's the distribution of adult weights and a normal model, which is not a very good fit.



In [16]:

MakeNormalModel(weights)
thinkplot.Config(title='Adult weight, linear scale', xlabel='Weight (kg)',
ylabel='CDF', loc='upper right')




n, mean, std 398484 78.59599565702814 17.75455519179871



Here's the distribution of adult weight and a lognormal model, plotted on a log-x scale. The model is a better fit for the data, although the heaviest people are heavier than the model expects.



In [17]:

log_weights = np.log10(weights)
MakeNormalModel(log_weights)
thinkplot.Config(title='Adult weight, log scale', xlabel='Weight (log10 kg)',
ylabel='CDF', loc='upper right')




n, mean, std 398484 1.884660713731975 0.09623580259151371



The following function generates a normal probability plot.



In [18]:

def MakeNormalPlot(weights):
"""Generates a normal probability plot of birth weights.

weights: sequence
"""
mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
std = np.sqrt(var)

xs = [-5, 5]
xs, ys = thinkstats2.FitLine(xs, mean, std)
thinkplot.Plot(xs, ys, color='0.8', label='model')

xs, ys = thinkstats2.NormalProbability(weights)
thinkplot.Plot(xs, ys, label='weights')



When we generate a normal probability plot with adult weights, we can see clearly that the data deviate from the model systematically.



In [19]:

MakeNormalPlot(weights)
thinkplot.Config(title='Adult weight, normal plot', xlabel='Weight (kg)',
ylabel='CDF', loc='upper left')






If we make a normal probability plot with log weights, the model fit the data well except in the tails, where the heaviest people exceed expectations.



In [20]:

MakeNormalPlot(log_weights)
thinkplot.Config(title='Adult weight, lognormal plot', xlabel='Weight (log10 kg)',
ylabel='CDF', loc='upper left')






## Pareto distribution

Here's what the Pareto CDF looks like with a range of parameters.



In [21]:

xmin = 0.5

thinkplot.PrePlot(3)
for alpha in [2.0, 1.0, 0.5]:
xs, ps = thinkstats2.RenderParetoCdf(xmin, alpha, 0, 10.0, n=100)
thinkplot.Plot(xs, ps, label=r'$\alpha=%g$' % alpha)

thinkplot.Config(title='Pareto CDF', xlabel='x',
ylabel='CDF', loc='lower right')






The distribution of populations for cities and towns is sometimes said to be Pareto-like.



In [24]:

import populations

print('Number of cities/towns', len(pops))




Number of cities/towns 19515



Here's the distribution of population for cities and towns in the U.S., along with a Pareto model. The model fits the data well in the tail.



In [25]:

log_pops = np.log10(pops)
cdf = thinkstats2.Cdf(pops, label='data')
cdf_log = thinkstats2.Cdf(log_pops, label='data')

# pareto plot
xs, ys = thinkstats2.RenderParetoCdf(xmin=5000, alpha=1.4, low=0, high=1e7)
thinkplot.Plot(np.log10(xs), 1-ys, label='model', color='0.8')

thinkplot.Cdf(cdf_log, complement=True)
thinkplot.Config(xlabel='log10 population',
ylabel='CCDF',
yscale='log', loc='lower left')






The lognormal model might be a better fit for this data (as is often the case for things that are supposed to be Pareto).



In [26]:

thinkplot.PrePlot(cols=2)

mu, sigma = log_pops.mean(), log_pops.std()
xs, ps = thinkstats2.RenderNormalCdf(mu, sigma, low=0, high=8)
thinkplot.Plot(xs, ps, label='model', color='0.8')

thinkplot.Cdf(cdf_log)
thinkplot.Config(xlabel='log10 population',
ylabel='CDF', loc='lower right')






Here's a normal probability plot for the log-populations. The model fits the data well except in the right tail, where the biggest cities are bigger than expected.



In [27]:

thinkstats2.NormalProbabilityPlot(log_pops, label='data')
thinkplot.Config(xlabel='Random variate',
ylabel='log10 population',
xlim=[-5, 5])






## Random variates

When we have an analytic CDF, we can sometimes invert it to generate random values. The following function generates values from an exponential distribution.



In [28]:

import random

def expovariate(lam):
p = random.random()
x = -np.log(1-p) / lam
return x



We can test it by generating a sample.



In [29]:

t = [expovariate(lam=2) for _ in range(1000)]



And plotting the CCDF on a log-y scale.



In [30]:

cdf = thinkstats2.Cdf(t)

thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='Exponential variate', ylabel='CCDF', yscale='log')






A straight line is consistent with an exponential distribution.

As an exercise, write a function that generates a Pareto variate.

## Exercises

Exercise: In the BRFSS (see Section 5.4), the distribution of heights is roughly normal with parameters µ = 178 cm and σ = 7.7 cm for men, and µ = 163 cm and σ = 7.3 cm for women.

In order to join Blue Man Group, you have to be male between 5’10” and 6’1” (see http://bluemancasting.com). What percentage of the U.S. male population is in this range? Hint: use scipy.stats.norm.cdf.

scipy.stats contains objects that represent analytic distributions



In [31]:

import scipy.stats



For example scipy.stats.norm represents a normal distribution.



In [32]:

mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)




Out[32]:

scipy.stats._distn_infrastructure.rv_frozen



A "frozen random variable" can compute its mean and standard deviation.



In [33]:

dist.mean(), dist.std()




Out[33]:

(178.0, 7.7)



It can also evaluate its CDF. How many people are more than one standard deviation below the mean? About 16%



In [34]:

dist.cdf(mu-sigma)




Out[34]:

0.1586552539314574



How many people are between 5'10" and 6'1"?



In [35]:

# Solution

low = dist.cdf(177.8)    # 5'10"
high = dist.cdf(185.4)   # 6'1"
low, high, high-low




Out[35]:

(0.48963902786483265, 0.8317337108107857, 0.3420946829459531)



Exercise: To get a feel for the Pareto distribution, let’s see how different the world would be if the distribution of human height were Pareto. With the parameters xm = 1 m and α = 1.7, we get a distribution with a reasonable minimum, 1 m, and median, 1.5 m.

Plot this distribution. What is the mean human height in Pareto world? What fraction of the population is shorter than the mean? If there are 7 billion people in Pareto world, how many do we expect to be taller than 1 km? How tall do we expect the tallest person to be?

scipy.stats.pareto represents a pareto distribution. In Pareto world, the distribution of human heights has parameters alpha=1.7 and xmin=1 meter. So the shortest person is 100 cm and the median is 150.



In [36]:

alpha = 1.7
xmin = 1       # meter
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()




Out[36]:

1.5034066538560549



What is the mean height in Pareto world?



In [37]:

# Solution

dist.mean()




Out[37]:

2.428571428571429



What fraction of people are shorter than the mean?



In [38]:

# Solution

dist.cdf(dist.mean())




Out[38]:

0.778739697565288



Out of 7 billion people, how many do we expect to be taller than 1 km? You could use dist.cdf or dist.sf.



In [39]:

# Solution

(1 - dist.cdf(1000)) * 7e9, dist.sf(1000) * 7e9




Out[39]:

(55602.976430479954, 55602.97643069972)



How tall do we expect the tallest person to be?



In [40]:

# Solution

# One way to solve this is to search for a height that we
# expect one person out of 7 billion to exceed.

# It comes in at roughly 600 kilometers.

dist.sf(600000) * 7e9




Out[40]:

1.0525455861201714




In [41]:

# Solution

# Another way is to use ppf, which evaluates the "percent point function", which
# is the inverse CDF.  So we can compute the height in meters that corresponds to
# the probability (1 - 1/7e9).

dist.ppf(1 - 1/7e9)




Out[41]:

618349.6106759505



Exercise: The Weibull distribution is a generalization of the exponential distribution that comes up in failure analysis (see http://wikipedia.org/wiki/Weibull_distribution). Its CDF is

$\mathrm{CDF}(x) = 1 − \exp[−(x / λ)^k]$

Can you find a transformation that makes a Weibull distribution look like a straight line? What do the slope and intercept of the line indicate?

Use random.weibullvariate to generate a sample from a Weibull distribution and use it to test your transformation.

Generate a sample from a Weibull distribution and plot it using a transform that makes a Weibull distribution look like a straight line.

If you are stuck, you can get a hint from thinkplot.Cdf, which provides a transform that makes the CDF of a Weibull distribution look like a straight line. Here's an example that shows how it's used.



In [42]:

sample = [random.weibullvariate(2, 1) for _ in range(1000)]
cdf = thinkstats2.Cdf(sample)
thinkplot.Cdf(cdf, transform='weibull')
thinkplot.Config(xlabel='Weibull variate', ylabel='CCDF')






Exercise: For small values of n, we don’t expect an empirical distribution to fit an analytic distribution exactly. One way to evaluate the quality of fit is to generate a sample from an analytic distribution and see how well it matches the data.

For example, in Section 5.1 we plotted the distribution of time between births and saw that it is approximately exponential. But the distribution is based on only 44 data points. To see whether the data might have come from an exponential distribution, generate 44 values from an exponential distribution with the same mean as the data, about 33 minutes between births.

Plot the distribution of the random values and compare it to the actual distribution. You can use random.expovariate to generate the values.



In [43]:

import analytic

diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

n = len(diffs)
lam = 44.0 / 24 / 60
sample = [random.expovariate(lam) for _ in range(n)]

1/lam, np.mean(sample)




Out[43]:

(32.72727272727273, 30.514962897557755)




In [44]:

# Solution

model = thinkstats2.Cdf(sample, label='model')

thinkplot.PrePlot(2)
thinkplot.Cdfs([cdf, model], complement=True)
thinkplot.Config(xlabel='Time between births (minutes)',
ylabel='CCDF',
yscale='log')







In [45]:

# Solution

# If you plot distributions for a large number of samples, you get a sense
# of how much random variation to expect.  In this case, the data fall within
# the range we expect, so there is no compelling reason to think it is
# not exponential.

for i in range(100):
sample = [random.expovariate(lam) for _ in range(n)]
thinkplot.Cdf(thinkstats2.Cdf(sample), complement=True, color='0.9')

thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='Time between births (minutes)',
ylabel='CCDF',
yscale='log')






Worked Example: The distributions of wealth and income are sometimes modeled using lognormal and Pareto distributions. To see which is better, let’s look at some data.

The Current Population Survey (CPS) is a joint effort of the Bureau of Labor Statistics and the Census Bureau to study income and related variables. Data collected in 2013 is available from http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm. I downloaded hinc06.xls, which is an Excel spreadsheet with information about household income, and converted it to hinc06.csv, a CSV file you will find in the repository for this book. You will also find hinc.py, which reads this file.

Extract the distribution of incomes from this dataset. Are any of the analytic distributions in this chapter a good model of the data?



In [48]:

import hinc
df




Out[48]:

income
freq
cumsum
ps

0
4.999000e+03
4204
4204
0.034330

1
9.999000e+03
4729
8933
0.072947

2
1.499900e+04
6982
15915
0.129963

3
1.999900e+04
7157
23072
0.188407

4
2.499900e+04
7131
30203
0.246640

5
2.999900e+04
6740
36943
0.301679

6
3.499900e+04
6354
43297
0.353566

7
3.999900e+04
5832
49129
0.401191

8
4.499900e+04
5547
54676
0.446488

9
4.999900e+04
5254
59930
0.489392

10
5.499900e+04
5102
65032
0.531056

11
5.999900e+04
4256
69288
0.565810

12
6.499900e+04
4356
73644
0.601382

13
6.999900e+04
3949
77593
0.633629

14
7.499900e+04
3756
81349
0.664301

15
7.999900e+04
3414
84763
0.692180

16
8.499900e+04
3326
88089
0.719341

17
8.999900e+04
2643
90732
0.740923

18
9.499900e+04
2678
93410
0.762792

19
9.999900e+04
2223
95633
0.780945

20
1.049990e+05
2606
98239
0.802226

21
1.099990e+05
1838
100077
0.817235

22
1.149990e+05
1986
102063
0.833453

23
1.199990e+05
1464
103527
0.845408

24
1.249990e+05
1596
105123
0.858441

25
1.299990e+05
1327
106450
0.869278

26
1.349990e+05
1253
107703
0.879510

27
1.399990e+05
1140
108843
0.888819

28
1.449990e+05
1119
109962
0.897957

29
1.499990e+05
920
110882
0.905470

30
1.549990e+05
1143
112025
0.914803

31
1.599990e+05
805
112830
0.921377

32
1.649990e+05
731
113561
0.927347

33
1.699990e+05
575
114136
0.932042

34
1.749990e+05
616
114752
0.937072

35
1.799990e+05
570
115322
0.941727

36
1.849990e+05
502
115824
0.945826

37
1.899990e+05
364
116188
0.948799

38
1.949990e+05
432
116620
0.952327

39
1.999990e+05
378
116998
0.955413

40
2.499990e+05
2549
119547
0.976229

41
inf
2911
122458
1.000000



Here's what the CDF looks like on a linear scale.



In [49]:

xs, ps = df.income.values, df.ps.values
cdf = thinkstats2.Cdf(xs, ps, label='data')
cdf_log = thinkstats2.Cdf(np.log10(xs), ps, label='data')

# linear plot
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='household income',
ylabel='CDF')






To check whether a Pareto model describes the data well, I plot the CCDF on a log-log scale.

I found parameters for the Pareto model that match the tail of the distribution.



In [50]:

xs, ys = thinkstats2.RenderParetoCdf(xmin=55000, alpha=2.5,
low=0, high=250000)

thinkplot.Plot(xs, 1-ys, label='model', color='0.8')

thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='log10 household income',
ylabel='CCDF',
xscale='log',
yscale='log',
loc='lower left')






For the lognormal model I estimate mu and sigma using percentile-based statistics (median and IQR).



In [51]:

median = cdf_log.Percentile(50)
iqr = cdf_log.Percentile(75) - cdf_log.Percentile(25)
std = iqr / 1.349

# choose std to match the upper tail
std = 0.35
print(median, std)




4.740354793159152 0.35



Here's what the distribution, and fitted model, look like on a log-x scale.



In [52]:

xs, ps = thinkstats2.RenderNormalCdf(median, std, low=3.5, high=5.5)
thinkplot.Plot(xs, ps, label='model', color='0.8')

thinkplot.Cdf(cdf_log)
thinkplot.Config(xlabel='log10 household income',
ylabel='CDF')






My conclusions based on these figures are:

1) The Pareto model might be a reasonable choice for the top 10-20% of incomes.

2) The lognormal model captures the shape of the distribution better, with some deviation in the left tail. With different choices for sigma, you could match the upper or lower tail, but not both at the same time.

In summary I would say that neither model captures the whole distribution, so you might have to

1) look for another analytic model,

2) choose one that captures the part of the distribution that is most relevent, or

3) avoid using an analytic model altogether.



In [ ]: