Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey


In [4]:
from __future__ import print_function, division

Exercise 5.1

scipy.stats contains objects that represent analytic distributions


In [1]:
import scipy.stats

%matplotlib inline

For example scipy.stats.norm represents a normal distribution.


In [3]:
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)


Out[3]:
scipy.stats._distn_infrastructure.rv_frozen

A "frozen random variable" can compute its mean and standard deviation.


In [4]:
dist.mean(), dist.std()


Out[4]:
(178.0, 7.7000000000000002)

It can also evaluate its CDF. How many people are more than one standard deviation below the mean? About 16%


In [5]:
dist.cdf(mu-sigma)


Out[5]:
0.15865525393145741

How many people are between 5'10" and 6'1"?


In [6]:
low = dist.cdf(177.8)    # 5'10"
high = dist.cdf(185.4)   # 6'1"
low, high, high-low


Out[6]:
(0.48963902786483265, 0.83173371081078573, 0.34209468294595308)

Exercise 5.2

scipy.stats.pareto represents a pareto distribution. In Pareto world, the distribution of human heights has parameters alpha=1.7 and xmin=1 meter. So the shortest person is 100 cm and the median is 150.


In [7]:
alpha = 1.7
xmin = 1
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()


Out[7]:
1.5034066538560549

What is the mean height in Pareto world?


In [8]:
dist.mean()


Out[8]:
2.4285714285714288

What fraction of people are shorter than the mean?


In [9]:
dist.cdf(dist.mean())


Out[9]:
0.77873969756528805

Out of 7 billion people, how many do we expect to be taller than 1 km? You could use dist.cdf or dist.sf.


In [10]:
(1 - dist.cdf(1000)) * 7e9
dist.sf(1000) * 7e9


Out[10]:
55602.976430479954

How tall do we expect the tallest person to be?


In [1]:
dist.sf(600000) * 7e9            # find the height that yields about 1 person


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-9d9fc41a89dc> in <module>()
----> 1 dist.sf(618349.61067595053) * 7e9            # find the height that yields about 1 person

NameError: name 'dist' is not defined

Exercise 5.3

Generate a sample from a Weibull distribution and plot it using a transform that makes a Weibull distribution look like a straight line.


In [12]:
import random
import thinkstats2
import thinkplot

thinkplot.Cdf provides a transform that makes the CDF of a Weibull distribution look like a straight line.


In [13]:
sample = [random.weibullvariate(2, 1) for _ in range(1000)]
cdf = thinkstats2.Cdf(sample)
thinkplot.Cdf(cdf, transform='weibull')
thinkplot.Show(legend=False)


<matplotlib.figure.Figure at 0x7f4332611310>

Make a random selection from cdf.


In [14]:
cdf.Random()


Out[14]:
0.18977095273221636

Draw a random sample from cdf.


In [15]:
cdf.Sample(10)


Out[15]:
array([ 2.25095035,  1.83146812,  1.2560635 ,  2.38449412,  0.29440278,
        2.19038979,  0.29697198,  0.6200072 ,  4.8524759 ,  1.51877504])

Draw a random sample from cdf, then compute the percentile rank for each value, and plot the distribution of the percentile ranks.


In [16]:
prs = [cdf.PercentileRank(x) for x in cdf.Sample(1000)]
pr_cdf = thinkstats2.Cdf(prs)
thinkplot.Cdf(pr_cdf)


Out[16]:
{'xscale': 'linear', 'yscale': 'linear'}

Generate 1000 random values using random.random() and plot their PMF.


In [17]:
values = [random.random() for _ in range(1000)]
pmf = thinkstats2.Pmf(values)
thinkplot.Pmf(pmf, linewidth=0.1)


Assuming that the PMF doesn't work very well, try plotting the CDF instead.


In [20]:
cdf = thinkstats2.Cdf(values)
thinkplot.Cdf(cdf)
thinkplot.Show()


<matplotlib.figure.Figure at 0x7f4346330f10>

Exercise 5.4


In [23]:
import analytic

df = analytic.ReadBabyBoom()
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

n = len(diffs)
lam = 44.0 / 24 / 60
sample = [random.expovariate(lam) for _ in range(n)]
model = thinkstats2.Cdf(sample, label='model')
    
thinkplot.PrePlot(2)
thinkplot.Cdfs([cdf, model], complement=True)
thinkplot.Show(title='Time between births',
                xlabel='minutes',
                ylabel='CCDF',
                yscale='log')

lam, np.mean(sample)


Out[23]:
(0.030555555555555555, 39.382613993507178)
<matplotlib.figure.Figure at 0x7f43323b6c90>

Exercise 5.5

Here is the code from mystery.py that generates the random data files:


In [8]:
from mystery import *

funcs = [uniform_sample, triangular_sample, expo_sample,
             gauss_sample, lognorm_sample, pareto_sample,
             weibull_sample, gumbel_sample]

for i in range(len(funcs)):
    sample = funcs[i](1000)
    filename = 'mystery%d.dat' % i
    print(filename, funcs[i].__name__)


mystery0.dat uniform_sample
mystery1.dat triangular_sample
mystery2.dat expo_sample
mystery3.dat gauss_sample
mystery4.dat lognorm_sample
mystery5.dat pareto_sample
mystery6.dat weibull_sample
mystery7.dat gumbel_sample

In [ ]: