Modeling distributions

License: Creative Commons Attribution 4.0 International



In [1]:

    
from __future__ import print_function, division

import analytic
import brfss
import nsfg

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np
import math

%matplotlib inline

This notebook is about ways to model data using analytic distributions. I start with the exponential distribution, which is often a good model of time between random arrivals.

On December 18, 1997, 44 babies were born in a hospital in Brisbane, Australia. The time of birth for all 44 babies was reported in the local paper; the complete dataset is in a file called babyboom.dat.

If the times of birth are random, we expect the times between births to fit an exponential model.

Here's what the exponential CDF looks like for a range of parameters.



In [2]:

    
thinkplot.PrePlot(3)
for lam in [2.0, 1, 0.5]:
    xs, ps = thinkstats2.RenderExpoCdf(lam, 0, 3.0, 50)
    label = r'$\lambda=%g$' % lam
    thinkplot.Plot(xs, ps, label=label)
    
thinkplot.Config(title='Exponential CDF',
                 xlabel='x',
                 ylabel='CDF',
                 loc='lower right')

Here's the babyboom data:



In [3]:

    
df = analytic.ReadBabyBoom()
df

And here's the CDF of interarrival times.



In [4]:

    
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='minutes',
                 ylabel='CDF',
                 legend=False)

Visually it looks like an exponential CDF, but there are other analytic distributions that also look like this. A stronger test is to plot the complementary CDF, that is $1-CDF(x)$ on a log-y scale.

If the data are from an exponential distribution, the result should approximate a straight line.



In [5]:

    
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='minutes',
                 ylabel='CCDF',
                 yscale='log',
                 legend=False)

It is not exactly straight, which indicates that the exponential distribution is not a perfect model for this data. Most likely the underlying assumption—that a birth is equally likely at any time of day—is not exactly true. Nevertheless, it might be reasonable to model this dataset with an exponential distribution.

As George Box said, "All models are wrong, but some are useful".

The normal distribution

Many quantities in the natural world are well modeled by a normal distribution, also known as a Gaussian.

Here is what the CDF of a normal distriubution looks like for a few different parameter values:



In [6]:

    
thinkplot.PrePlot(3)

mus = [1.0, 2.0, 3.0]
sigmas = [0.5, 0.4, 0.3]
for mu, sigma in zip(mus, sigmas):
    xs, ps = thinkstats2.RenderNormalCdf(mu=mu, sigma=sigma, 
                                               low=-1.0, high=4.0)
    label = r'$\mu=%g$, $\sigma=%g$' % (mu, sigma)
    thinkplot.Plot(xs, ps, label=label)

thinkplot.Config(title='Normal CDF',
                 xlabel='x',
                 ylabel='CDF',
                 loc=2)

We might expect the distribution of birth weights to be approximately normal. I'll load data from the NSFG again:



In [7]:

    
preg = nsfg.ReadFemPreg()
weights = preg.totalwgt_lb.dropna()

We can estimate the parameters of the normal distribution, mu and sigma, then plot the data on top of the analytic model:



In [8]:

    
mu, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
print('Mean, Var', mu, var)
    
    # plot the model
sigma = math.sqrt(var)
print('Sigma', sigma)
xs, ps = thinkstats2.RenderNormalCdf(mu, sigma, low=0, high=12.5)

thinkplot.Plot(xs, ps, label='model', color='orange')

    # plot the data
cdf = thinkstats2.Cdf(weights, label='data')

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf) 
thinkplot.Config(title='Birth weights',
                   xlabel='birth weight (lbs)',
                   ylabel='CDF',
                   legend=True)









    



Mean, Var 7.28088310002 1.54521257035
Sigma 1.24306579486

The data fit the model well, but there are some deviations in the lower tail.

To get a better view of the tails, we can use a normal probability plot, which plots the actual data versus sorted values from random normal values.



In [9]:

    
def NormalProbability(ys, jitter=0.0):
    """Generates data for a normal probability plot.

    ys: sequence of values
    jitter: float magnitude of jitter added to the ys 

    returns: numpy arrays xs, ys
    """
    n = len(ys)
    xs = np.random.normal(0, 1, n)
    xs.sort()
    
    if jitter:
        ys = Jitter(ys, jitter)
    else:
        ys = np.array(ys)
    ys.sort()

    return xs, ys

If the data are normal, the result should be a straight line.



In [10]:

    
mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
std = math.sqrt(var)

xlim = [-4.5, 4.5]
fxs, fys = thinkstats2.FitLine(xlim, mean, std)
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

thinkplot.PrePlot(2) 
xs, ys = NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

thinkplot.Config(title='Normal probability plot',
                 xlabel='Standard deviations from mean',
                 ylabel='Birth weight (lbs)',
                 legend=True, loc='lower right',
                 xlim=xlim)

The normal probability plot shows that the lightest babies are lighter than expected, starting about two standard deviations below the mean. Also, the heaviest babies are heavier than the model predicts.

The NSFG data includes premature babies; if we select full-term babies, we might expect the normal model to be a better fit.



In [11]:

    
full_term = preg[preg.prglngth >= 37]
term_weights = full_term.totalwgt_lb.dropna()



In [12]:

    
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

thinkplot.PrePlot(2) 
xs, ys = NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

xs, ys = NormalProbability(term_weights)
thinkplot.Plot(xs, ys, label='full term')

thinkplot.Config(title='Normal probability plot',
                 xlabel='Standard deviations from mean',
                 ylabel='Birth weight (lbs)',
                 legend=True, loc='lower right',
                 xlim=xlim)

As expected, the normal model is a better fit for full-term babies at the low end of the distribution.

But it turns out that the normal model does not do very well for adult weight. I'll load data from the BRFSS.



In [13]:

    
df = brfss.ReadBrfss()
weights = df.wtkg2.dropna()
log_weights = np.log10(weights)

This function generates normal probability plots:



In [14]:

    
def MakeNormalPlot(weights):
    """Generates a normal probability plot of birth weights.

    weights: sequence
    """
    mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
    std = math.sqrt(var)

    xs = [-5, 5]
    xs, ys = thinkstats2.FitLine(xs, mean, std)
    thinkplot.Plot(xs, ys, color='0.8', label='model')

    xs, ys = thinkstats2.NormalProbability(weights)
    thinkplot.Plot(xs, ys, label='weights')

The normal distribution is a poor model for the distribution of adult weights.



In [15]:

    
MakeNormalPlot(weights)
thinkplot.Config(xlabel='z', ylabel='weights (kg)')

But if we compute the log of adult weights, the normal distribution is much better.



In [16]:

    
MakeNormalPlot(log_weights)
thinkplot.Config(xlabel='z', ylabel='weights (log10 kg)')

Within 3 standard deviations of the mean, the normal model does quite well, although the heaviest and lightest people diverge from the model.

If $\log x$ has a normal distribution, $x$ has a lognormal distribution.

	time	sex	weight_g	minutes
0	5	1	3837	5
1	104	1	3334	64
2	118	2	3554	78
3	155	2	3838	115
4	257	2	3625	177
5	405	1	2208	245
6	407	1	1745	247
7	422	2	2846	262
8	431	2	3166	271
9	708	2	3520	428
10	735	2	3380	455
11	812	2	3294	492
12	814	1	2576	494
13	909	1	3208	549
14	1035	2	3521	635
15	1049	1	3746	649
16	1053	1	3523	653
17	1133	2	2902	693
18	1209	2	2635	729
19	1256	2	3920	776
20	1305	2	3690	785
21	1406	1	3430	846
22	1407	1	3480	847
23	1433	1	3116	873
24	1446	1	3428	886
25	1514	2	3783	914
26	1631	2	3345	991
27	1657	2	3034	1017
28	1742	1	2184	1062
29	1807	2	3300	1087
30	1825	1	2383	1105
31	1854	2	3428	1134
32	1909	2	4162	1149
33	1947	2	3630	1187
34	1949	2	3406	1189
35	1951	2	3402	1191
36	2010	1	3500	1210
37	2037	2	3736	1237
38	2051	2	3370	1251
39	2104	2	2121	1264
40	2123	2	3150	1283
41	2217	1	3866	1337
42	2327	1	3542	1407
43	2355	1	3278	1435