Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey



In [4]:

    
from __future__ import print_function, division

Exercise 5.1

scipy.stats contains objects that represent analytic distributions



In [1]:

    
import scipy.stats

%matplotlib inline

For example scipy.stats.norm represents a normal distribution.



In [3]:

    
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)









    Out[3]:





scipy.stats._distn_infrastructure.rv_frozen

A "frozen random variable" can compute its mean and standard deviation.



In [4]:

    
dist.mean(), dist.std()









    Out[4]:





(178.0, 7.7000000000000002)

It can also evaluate its CDF. How many people are more than one standard deviation below the mean? About 16%



In [5]:

    
dist.cdf(mu-sigma)









    Out[5]:





0.15865525393145741

How many people are between 5'10" and 6'1"?



In [6]:

    
low = dist.cdf(177.8)    # 5'10"
high = dist.cdf(185.4)   # 6'1"
low, high, high-low









    Out[6]:





(0.48963902786483265, 0.83173371081078573, 0.34209468294595308)

Exercise 5.2

scipy.stats.pareto represents a pareto distribution. In Pareto world, the distribution of human heights has parameters alpha=1.7 and xmin=1 meter. So the shortest person is 100 cm and the median is 150.



In [7]:

    
alpha = 1.7
xmin = 1
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()









    Out[7]:





1.5034066538560549

What is the mean height in Pareto world?



In [8]:

    
dist.mean()









    Out[8]:





2.4285714285714288

What fraction of people are shorter than the mean?



In [9]:

    
dist.cdf(dist.mean())









    Out[9]:





0.77873969756528805

Out of 7 billion people, how many do we expect to be taller than 1 km? You could use dist.cdf or dist.sf.



In [10]:

    
(1 - dist.cdf(1000)) * 7e9
dist.sf(1000) * 7e9









    Out[10]:





55602.976430479954

How tall do we expect the tallest person to be?



In [1]:

    
dist.sf(600000) * 7e9            # find the height that yields about 1 person









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-9d9fc41a89dc> in <module>()
----> 1 dist.sf(618349.61067595053) * 7e9            # find the height that yields about 1 person

NameError: name 'dist' is not defined

Exercise 5.3

Generate a sample from a Weibull distribution and plot it using a transform that makes a Weibull distribution look like a straight line.



In [12]:

    
import random
import thinkstats2
import thinkplot

thinkplot.Cdf provides a transform that makes the CDF of a Weibull distribution look like a straight line.



In [13]:

    
sample = [random.weibullvariate(2, 1) for _ in range(1000)]
cdf = thinkstats2.Cdf(sample)
thinkplot.Cdf(cdf, transform='weibull')
thinkplot.Show(legend=False)









    












    





<matplotlib.figure.Figure at 0x7f4332611310>

Make a random selection from cdf.



In [14]:

    
cdf.Random()









    Out[14]:





0.18977095273221636

Draw a random sample from cdf.



In [15]:

    
cdf.Sample(10)









    Out[15]:





array([ 2.25095035,  1.83146812,  1.2560635 ,  2.38449412,  0.29440278,
        2.19038979,  0.29697198,  0.6200072 ,  4.8524759 ,  1.51877504])

Draw a random sample from cdf, then compute the percentile rank for each value, and plot the distribution of the percentile ranks.



In [16]:

    
prs = [cdf.PercentileRank(x) for x in cdf.Sample(1000)]
pr_cdf = thinkstats2.Cdf(prs)
thinkplot.Cdf(pr_cdf)









    Out[16]:





{'xscale': 'linear', 'yscale': 'linear'}

Generate 1000 random values using random.random() and plot their PMF.



In [17]:

    
values = [random.random() for _ in range(1000)]
pmf = thinkstats2.Pmf(values)
thinkplot.Pmf(pmf, linewidth=0.1)

Assuming that the PMF doesn't work very well, try plotting the CDF instead.



In [20]:

    
cdf = thinkstats2.Cdf(values)
thinkplot.Cdf(cdf)
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x7f4346330f10>

Exercise 5.4



In [23]:

    
import analytic

df = analytic.ReadBabyBoom()
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

n = len(diffs)
lam = 44.0 / 24 / 60
sample = [random.expovariate(lam) for _ in range(n)]
model = thinkstats2.Cdf(sample, label='model')
    
thinkplot.PrePlot(2)
thinkplot.Cdfs([cdf, model], complement=True)
thinkplot.Show(title='Time between births',
                xlabel='minutes',
                ylabel='CCDF',
                yscale='log')

lam, np.mean(sample)









    












    Out[23]:





(0.030555555555555555, 39.382613993507178)






    





<matplotlib.figure.Figure at 0x7f43323b6c90>

Exercise 5.5

Here is the code from mystery.py that generates the random data files:



In [8]:

    
from mystery import *

funcs = [uniform_sample, triangular_sample, expo_sample,
             gauss_sample, lognorm_sample, pareto_sample,
             weibull_sample, gumbel_sample]

for i in range(len(funcs)):
    sample = funcs[i](1000)
    filename = 'mystery%d.dat' % i
    print(filename, funcs[i].__name__)









    



mystery0.dat uniform_sample
mystery1.dat triangular_sample
mystery2.dat expo_sample
mystery3.dat gauss_sample
mystery4.dat lognorm_sample
mystery5.dat pareto_sample
mystery6.dat weibull_sample
mystery7.dat gumbel_sample



In [ ]: