Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey
In [4]:
from __future__ import print_function, division
In [1]:
import scipy.stats
%matplotlib inline
For example scipy.stats.norm represents a normal distribution.
In [3]:
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)
type(dist)
Out[3]:
A "frozen random variable" can compute its mean and standard deviation.
In [4]:
dist.mean(), dist.std()
Out[4]:
It can also evaluate its CDF. How many people are more than one standard deviation below the mean? About 16%
In [5]:
dist.cdf(mu-sigma)
Out[5]:
How many people are between 5'10" and 6'1"?
In [6]:
low = dist.cdf(177.8) # 5'10"
high = dist.cdf(185.4) # 6'1"
low, high, high-low
Out[6]:
In [7]:
alpha = 1.7
xmin = 1
dist = scipy.stats.pareto(b=alpha, scale=xmin)
dist.median()
Out[7]:
What is the mean height in Pareto world?
In [8]:
dist.mean()
Out[8]:
What fraction of people are shorter than the mean?
In [9]:
dist.cdf(dist.mean())
Out[9]:
Out of 7 billion people, how many do we expect to be taller than 1 km? You could use dist.cdf or dist.sf.
In [10]:
(1 - dist.cdf(1000)) * 7e9
dist.sf(1000) * 7e9
Out[10]:
How tall do we expect the tallest person to be?
In [1]:
dist.sf(600000) * 7e9 # find the height that yields about 1 person
In [12]:
import random
import thinkstats2
import thinkplot
thinkplot.Cdf
provides a transform that makes the CDF of a Weibull distribution look like a straight line.
In [13]:
sample = [random.weibullvariate(2, 1) for _ in range(1000)]
cdf = thinkstats2.Cdf(sample)
thinkplot.Cdf(cdf, transform='weibull')
thinkplot.Show(legend=False)
Make a random selection from cdf.
In [14]:
cdf.Random()
Out[14]:
Draw a random sample from cdf.
In [15]:
cdf.Sample(10)
Out[15]:
Draw a random sample from cdf, then compute the percentile rank for each value, and plot the distribution of the percentile ranks.
In [16]:
prs = [cdf.PercentileRank(x) for x in cdf.Sample(1000)]
pr_cdf = thinkstats2.Cdf(prs)
thinkplot.Cdf(pr_cdf)
Out[16]:
Generate 1000 random values using random.random() and plot their PMF.
In [17]:
values = [random.random() for _ in range(1000)]
pmf = thinkstats2.Pmf(values)
thinkplot.Pmf(pmf, linewidth=0.1)
Assuming that the PMF doesn't work very well, try plotting the CDF instead.
In [20]:
cdf = thinkstats2.Cdf(values)
thinkplot.Cdf(cdf)
thinkplot.Show()
In [23]:
import analytic
df = analytic.ReadBabyBoom()
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')
n = len(diffs)
lam = 44.0 / 24 / 60
sample = [random.expovariate(lam) for _ in range(n)]
model = thinkstats2.Cdf(sample, label='model')
thinkplot.PrePlot(2)
thinkplot.Cdfs([cdf, model], complement=True)
thinkplot.Show(title='Time between births',
xlabel='minutes',
ylabel='CCDF',
yscale='log')
lam, np.mean(sample)
Out[23]:
In [8]:
from mystery import *
funcs = [uniform_sample, triangular_sample, expo_sample,
gauss_sample, lognorm_sample, pareto_sample,
weibull_sample, gumbel_sample]
for i in range(len(funcs)):
sample = funcs[i](1000)
filename = 'mystery%d.dat' % i
print(filename, funcs[i].__name__)
In [ ]: