Modeling distributions

Copyright 2015 Allen Downey

License: Creative Commons Attribution 4.0 International


In [1]:
from __future__ import print_function, division

import analytic
import brfss
import nsfg

import thinkstats2
import thinkplot

import pandas as pd
import numpy as np
import math

%matplotlib inline

This notebook is about ways to model data using analytic distributions. I start with the exponential distribution, which is often a good model of time between random arrivals.

On December 18, 1997, 44 babies were born in a hospital in Brisbane, Australia. The time of birth for all 44 babies was reported in the local paper; the complete dataset is in a file called babyboom.dat.

If the times of birth are random, we expect the times between births to fit an exponential model.

Here's what the exponential CDF looks like for a range of parameters.


In [2]:
thinkplot.PrePlot(3)
for lam in [2.0, 1, 0.5]:
    xs, ps = thinkstats2.RenderExpoCdf(lam, 0, 3.0, 50)
    label = r'$\lambda=%g$' % lam
    thinkplot.Plot(xs, ps, label=label)
    
thinkplot.Config(title='Exponential CDF',
                 xlabel='x',
                 ylabel='CDF',
                 loc='lower right')


Here's the babyboom data:


In [3]:
df = analytic.ReadBabyBoom()
df


Out[3]:
time sex weight_g minutes
0 5 1 3837 5
1 104 1 3334 64
2 118 2 3554 78
3 155 2 3838 115
4 257 2 3625 177
5 405 1 2208 245
6 407 1 1745 247
7 422 2 2846 262
8 431 2 3166 271
9 708 2 3520 428
10 735 2 3380 455
11 812 2 3294 492
12 814 1 2576 494
13 909 1 3208 549
14 1035 2 3521 635
15 1049 1 3746 649
16 1053 1 3523 653
17 1133 2 2902 693
18 1209 2 2635 729
19 1256 2 3920 776
20 1305 2 3690 785
21 1406 1 3430 846
22 1407 1 3480 847
23 1433 1 3116 873
24 1446 1 3428 886
25 1514 2 3783 914
26 1631 2 3345 991
27 1657 2 3034 1017
28 1742 1 2184 1062
29 1807 2 3300 1087
30 1825 1 2383 1105
31 1854 2 3428 1134
32 1909 2 4162 1149
33 1947 2 3630 1187
34 1949 2 3406 1189
35 1951 2 3402 1191
36 2010 1 3500 1210
37 2037 2 3736 1237
38 2051 2 3370 1251
39 2104 2 2121 1264
40 2123 2 3150 1283
41 2217 1 3866 1337
42 2327 1 3542 1407
43 2355 1 3278 1435

And here's the CDF of interarrival times.


In [4]:
diffs = df.minutes.diff()
cdf = thinkstats2.Cdf(diffs, label='actual')

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='minutes',
                 ylabel='CDF',
                 legend=False)


Visually it looks like an exponential CDF, but there are other analytic distributions that also look like this. A stronger test is to plot the complementary CDF, that is $1-CDF(x)$ on a log-y scale.

If the data are from an exponential distribution, the result should approximate a straight line.


In [5]:
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel='minutes',
                 ylabel='CCDF',
                 yscale='log',
                 legend=False)


It is not exactly straight, which indicates that the exponential distribution is not a perfect model for this data. Most likely the underlying assumption—that a birth is equally likely at any time of day—is not exactly true. Nevertheless, it might be reasonable to model this dataset with an exponential distribution.

As George Box said, "All models are wrong, but some are useful".

The normal distribution

Many quantities in the natural world are well modeled by a normal distribution, also known as a Gaussian.

Here is what the CDF of a normal distriubution looks like for a few different parameter values:


In [6]:
thinkplot.PrePlot(3)

mus = [1.0, 2.0, 3.0]
sigmas = [0.5, 0.4, 0.3]
for mu, sigma in zip(mus, sigmas):
    xs, ps = thinkstats2.RenderNormalCdf(mu=mu, sigma=sigma, 
                                               low=-1.0, high=4.0)
    label = r'$\mu=%g$, $\sigma=%g$' % (mu, sigma)
    thinkplot.Plot(xs, ps, label=label)

thinkplot.Config(title='Normal CDF',
                 xlabel='x',
                 ylabel='CDF',
                 loc=2)


We might expect the distribution of birth weights to be approximately normal. I'll load data from the NSFG again:


In [7]:
preg = nsfg.ReadFemPreg()
weights = preg.totalwgt_lb.dropna()

We can estimate the parameters of the normal distribution, mu and sigma, then plot the data on top of the analytic model:


In [8]:
mu, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
print('Mean, Var', mu, var)
    
    # plot the model
sigma = math.sqrt(var)
print('Sigma', sigma)
xs, ps = thinkstats2.RenderNormalCdf(mu, sigma, low=0, high=12.5)

thinkplot.Plot(xs, ps, label='model', color='orange')

    # plot the data
cdf = thinkstats2.Cdf(weights, label='data')

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf) 
thinkplot.Config(title='Birth weights',
                   xlabel='birth weight (lbs)',
                   ylabel='CDF',
                   legend=True)


Mean, Var 7.28088310002 1.54521257035
Sigma 1.24306579486

The data fit the model well, but there are some deviations in the lower tail.

To get a better view of the tails, we can use a normal probability plot, which plots the actual data versus sorted values from random normal values.


In [9]:
def NormalProbability(ys, jitter=0.0):
    """Generates data for a normal probability plot.

    ys: sequence of values
    jitter: float magnitude of jitter added to the ys 

    returns: numpy arrays xs, ys
    """
    n = len(ys)
    xs = np.random.normal(0, 1, n)
    xs.sort()
    
    if jitter:
        ys = Jitter(ys, jitter)
    else:
        ys = np.array(ys)
    ys.sort()

    return xs, ys

If the data are normal, the result should be a straight line.


In [10]:
mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
std = math.sqrt(var)

xlim = [-4.5, 4.5]
fxs, fys = thinkstats2.FitLine(xlim, mean, std)
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

thinkplot.PrePlot(2) 
xs, ys = NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

thinkplot.Config(title='Normal probability plot',
                 xlabel='Standard deviations from mean',
                 ylabel='Birth weight (lbs)',
                 legend=True, loc='lower right',
                 xlim=xlim)


The normal probability plot shows that the lightest babies are lighter than expected, starting about two standard deviations below the mean. Also, the heaviest babies are heavier than the model predicts.

The NSFG data includes premature babies; if we select full-term babies, we might expect the normal model to be a better fit.


In [11]:
full_term = preg[preg.prglngth >= 37]
term_weights = full_term.totalwgt_lb.dropna()

In [12]:
thinkplot.Plot(fxs, fys, linewidth=4, color='0.8')

thinkplot.PrePlot(2) 
xs, ys = NormalProbability(weights)
thinkplot.Plot(xs, ys, label='all live')

xs, ys = NormalProbability(term_weights)
thinkplot.Plot(xs, ys, label='full term')

thinkplot.Config(title='Normal probability plot',
                 xlabel='Standard deviations from mean',
                 ylabel='Birth weight (lbs)',
                 legend=True, loc='lower right',
                 xlim=xlim)


As expected, the normal model is a better fit for full-term babies at the low end of the distribution.

But it turns out that the normal model does not do very well for adult weight. I'll load data from the BRFSS.


In [13]:
df = brfss.ReadBrfss()
weights = df.wtkg2.dropna()
log_weights = np.log10(weights)

This function generates normal probability plots:


In [14]:
def MakeNormalPlot(weights):
    """Generates a normal probability plot of birth weights.

    weights: sequence
    """
    mean, var = thinkstats2.TrimmedMeanVar(weights, p=0.01)
    std = math.sqrt(var)

    xs = [-5, 5]
    xs, ys = thinkstats2.FitLine(xs, mean, std)
    thinkplot.Plot(xs, ys, color='0.8', label='model')

    xs, ys = thinkstats2.NormalProbability(weights)
    thinkplot.Plot(xs, ys, label='weights')

The normal distribution is a poor model for the distribution of adult weights.


In [15]:
MakeNormalPlot(weights)
thinkplot.Config(xlabel='z', ylabel='weights (kg)')


But if we compute the log of adult weights, the normal distribution is much better.


In [16]:
MakeNormalPlot(log_weights)
thinkplot.Config(xlabel='z', ylabel='weights (log10 kg)')


Within 3 standard deviations of the mean, the normal model does quite well, although the heaviest and lightest people diverge from the model.

If $\log x$ has a normal distribution, $x$ has a lognormal distribution.