Think Bayes

Second Edition

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)



In [1]:

    
# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist



In [2]:

    
# Get utils.py and create directories

import os

if not os.path.exists('utils.py'):
    !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
        
if not os.path.exists('figs'):
    !mkdir figs
        
if not os.path.exists('tables'):
    !mkdir tables



In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf
from utils import decorate, savefig

The train problem

I found the train problem in Frederick Mosteller's, Fifty Challenging Problems in Probability with Solutions:

"A railroad numbers its locomotives in order $1..N$. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has."

Based on this observation, we know the railroad has 60 or more locomotives. But how many more? To apply Bayesian reasoning, we can break this problem into two steps:

What did we know about $N$ before we saw the data?
For any given value of $N$, what is the likelihood of seeing the data (a locomotive with number 60)?

The answer to the first question is the prior. The answer to the second is the likelihood.

We don't have much basis to choose a prior, so we'll start with something simple and then consider alternatives.
Let's assume that $N$ is equally likely to be any value from 1 to 1000.



In [4]:

    
hypos = np.arange(1, 1001)
prior = Pmf(1, hypos)

Now let's figure out the likelihood of the data. In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60? If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$.

Here's the function that does the update:



In [5]:

    
def update_train(pmf, data):
    """Update a pmf based on new data.
    
    pmf: Pmf of possible numbers of trains
    data: integer train #
    """
    hypos = pmf.qs
    likelihood = 1 / hypos
    impossible = (data > hypos)
    likelihood[impossible] = 0
    pmf *= likelihood
    pmf.normalize()

This function might look familiar; it is the same as the update function for the dice problem in the previous notebook.

Here's the update:



In [6]:

    
data = 60
posterior = prior.copy()
update_train(posterior, data)

Here's what the posterior looks like



In [7]:

    
posterior.plot(label='Posterior after train 60')
decorate(xlabel='Number of trains',
         ylabel='PMF',
         title='Train Problem')
savefig('fig04-01')

Not surprisingly, all values of $N$ below 60 have been eliminated.

The most likely value, if you had to guess, is 60.
That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number? Nevertheless, if you want to maximize the chance of getting the answer exactly right, you should guess 60.

But maybe that's not the right goal.
An alternative is to compute the mean of the posterior distribution. Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is:

$\mathrm{mean} = \sum_i p_i q_i$

Which we can compute like this:



In [8]:

    
np.sum(posterior.ps * posterior.qs)









    Out[8]:





333.41989326370776

Or we can use the method provided by Pmf:



In [9]:

    
posterior.mean()









    Out[9]:





333.41989326370776

The mean of the posterior is 333, so that might be a good guess if you want to minimize error.
If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the mean squared error over the long run.

Sensitivity to the prior

The prior I chose in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound.

We might wonder whether the posterior distribution is sensitive to the prior. With so little data---only one observation---it is.

Here's what happens as we vary the upper bound:



In [10]:

    
data = 60

for high in [500, 1000, 2000]:
    hypos = np.arange(1, high+1)
    pmf = Pmf(1, hypos)
    update_train(pmf, data)
    print(high, pmf.mean())









    



500 207.0792279834105
1000 333.41989326370776
2000 552.1790171646376

So that's bad.
When the posterior is sensitive to the prior, there are two ways to proceed:

Get more data.
Get more background information and choose a better prior.

With more data, posterior distributions based on different priors tend to converge.
For example, suppose that in addition to train 60 we also see trains 30 and 90.
We can update the distribution like this:



In [11]:

    
dataset = [30, 60, 90]

for high in [500, 1000, 2000]:
    hypos = np.arange(1, high+1)
    pmf = Pmf(1, hypos)
    for data in dataset:
        update_train(pmf, data)
    print(high, pmf.mean())









    



500 151.84958795903833
1000 164.30558642273346
2000 171.3381810915096

The differences are smaller, but apparently three trains is not enough for the posteriors to converge.

Power law prior

If more data are not available, another option is to improve the priors by gathering more background information.
It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1.

With some effort, we could probably find a list of companies that operate locomotives in the area of observation. Or we could interview an expert in rail shipping to gather information about the typical size of companies.

But even without getting into the specifics of railroad economics, we can make some educated guesses.
In most fields, there are many small companies, fewer medium-sized companies, and only one or two very large companies.
In fact, the distribution of company sizes tends to follow a power law, as Robert Axtell reports in {\it Science} (official site, available here) ).

This law suggests that if there are 1000 companies with fewer than 10 locomotives, there might be 100 companies with 100 locomotives, 10 companies with 1000, and possibly one company with 10,000 locomotives.

Mathematically, a power law means that the number of companies with a given size is inversely proportional to size, or

$ \mathrm{PMF}(N) \sim \left( \frac{1}{N} \right)^{\alpha} $

where $\mathrm{PMF}(N)$ is the probability mass function of $N$ and $\alpha$ is a parameter that is often near 1.

We can construct a power law prior like this:



In [12]:

    
alpha = 1.0
ps = hypos**(-alpha)
power = Pmf(ps, hypos, name='power law')
power.normalize()









    Out[12]:





8.178368103610282

And here's the uniform prior again.



In [13]:

    
hypos = np.arange(1, 1001)
uniform = Pmf(1, hypos, name='uniform')
uniform.normalize()









    Out[13]:





1000

Again, the upper bound is arbitrary, but with a power law prior, the posterior is less sensitive to this choice.

Here's what a power law prior looks like, compared to a uniform prior



In [14]:

    
uniform.plot()
power.plot()

decorate(xlabel='Number of trains',
         ylabel='PMF',
         title='Train Problem')

Now let's see what the posteriors look like after observing one train.



In [15]:

    
dataset = [60]

update_train(uniform, dataset)
update_train(power, dataset)



In [16]:

    
uniform.plot()
power.plot()

decorate(xlabel='Number of trains',
         ylabel='PMF',
         title='Train Problem')
savefig('fig04-02')

The power law gives less prior probability to high values, which yields lower posterior means, and less sensitivity to the upper bound.

If we observe trains 30, 60, and 90, the means of the posteriors are:



In [17]:

    
alpha = 1.0
dataset = [30, 60, 90]

for high in [500, 1000, 2000]:
    hypos = np.arange(1, high+1)
    ps = hypos**(-alpha)
    power = Pmf(ps, hypos)
    for data in dataset:
        update_train(power, data)
    print(high, power.mean())









    



500 130.70846986255992
1000 133.2752313750311
2000 133.99746308073125

Now the differences are much smaller. In fact, with an arbitrarily large upper bound, the mean converges on 134.

So the power law prior is more realistic, because it is based on general information about the size of companies, and it behaves better in practice.

Credible intervals

So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean. These are both point estimates, that is, single values that estimate the quantity we are interested in.

Another way to summarize posterior distribution is with percentiles. If you have taken a standardized test, you might be familiar with percentiles. For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test.

If we are given a value, x, we can compute its percentile rank by finding all values less than or equal to x and adding up their probabilities.

Pmf provides a method that does this computation. So, for example, we can compute the probability that the company has less than or equal to 100 trains:



In [18]:

    
power.lt_dist(100)









    Out[18]:





0.2722361322050704

With a power law prior and a dataset of three trains, the result is about 27%. So 100 trains is the 27th percentile.

Going the other way, suppose we want to compute a particular percentile; for example, the median of a distriution is the 50th percentile. We can compute it by adding up probabilities until the total exceeds 0.5. Here's a function that does it:



In [19]:

    
def quantile(pmf, prob):
    """Compute a quantile.
    
    pmf: Pmf representing a normalized distribution
    prob: float probability
    
    returns: quantity from the distribution
    """
    total = 0
    for q, p in pmf.items():
        total += p
        if total >= prob:
            return q
    return np.nan

The loop uses items, which iterates the quantities and probabilities in the distribution. Inside the loop we add up the probabilities of the quantities in order. When the total equals or exceeds prob, we return the corresponding quantity.

This function is called quantile because it computes a quantile rather than a percentile. The difference is the way we specify prob. If prob is a percentage between 0 and 100, we call the corresponding quantity a percentile. If prob is a probability between 0 and 1, we call the corresponding quantity a quantile.

Here's how we can use this function to compute the median of the posterior distribution:



In [20]:

    
quantile(power, 0.5)









    Out[20]:





113

The result, 113 trains, is the median of the posterior distribution.

Pmf provides a method called quantile that does the same thing. We can call it like this to compute the 5th and 9th percentiles:



In [21]:

    
power.quantile([0.05, 0.95])









    Out[21]:





array([ 91., 243.])

The result is the interval from 91 to 242 trains, which implies:

The probability is 5% that the number of trains is less than or equal to 91.
The probability is 5% that the number of trains is greater than 242.

Therefore the probability is 90% that the number of trains falls between 91 and 242 (excluding 91 and including 242). For this reason, this interval is called a 90% credible interval.

Pmf also provides credible_interval, which computes an interval that contains the given probability.



In [22]:

    
power.credible_interval(0.9)









    Out[22]:





array([ 91., 243.])

The German tank problem

During World War II, the Economic Warfare Division of the American Embassy in London used statistical analysis to estimate German production of tanks and other equipment.

(see Ruggles and Brodie, "An Empirical Approach to Economic Intelligence in World War II", {\em Journal of the American Statistical Association}, Vol. 42, No. 237 (March 1947).)

The Western Allies had captured log books, inventories, and repair records that included chassis and engine serial numbers for individual tanks.

Analysis of these records indicated that serial numbers were allocated by manufacturer and tank type in blocks of 100 numbers, that numbers in each block were used sequentially, and that not all numbers in each block were used. So the problem of estimating German tank production could be reduced, within each block of 100 numbers, to a form of the locomotive problem.

Based on this insight, American and British analysts produced estimates substantially lower than estimates from other forms of intelligence. And after the war, records indicated that they were substantially more accurate.

They performed similar analyses for tires, trucks, rockets, and other equipment, yielding accurate and actionable economic intelligence.

The German tank problem is historically interesting; it is also a nice example of real-world application of statistical estimation. So far many of the examples in this book have been toy problems, but it will not be long before we start solving real problems. I think it is an advantage of Bayesian analysis, especially with the computational approach we are taking, that it provides such a short path from a basic introduction to the research frontier.

Informative priors

Among Bayesians, there are two approaches to choosing prior distributions. Some recommend choosing the prior that best represents background information about the problem; in that case the prior is said to be informative. The problem with using an informative prior is that people might use different background information (or interpret it differently). So informative priors often seem subjective.

The alternative is a so-called uninformative prior, which is intended to be as unrestricted as possible, in order to let the data speak for themselves. In some cases you can identify a unique prior that has some desirable property, like representing minimal prior information about the estimated quantity.

Uninformative priors are appealing because they seem more objective. But I am generally in favor of using informative priors. Why? First, Bayesian analysis is always based on modeling decisions. Choosing the prior is one of those decisions, but it is not the only one, and it might not even be the most subjective. So even if an uninformative prior is more objective, the entire analysis is still subjective.

Also, for most practical problems, you are likely to be in one of two regimes: either you have a lot of data or not very much. If you have a lot of data, the choice of the prior doesn't matter very much; informative and uninformative priors yield almost the same results.

But if, as in the locomotive problem, you don't have much data, using relevant background information (like the power law distribution) makes a big difference.

And if, as in the German tank problem, you have to make life-and-death decisions based on your results, you should probably use all of the information at your disposal, rather than maintaining the illusion of objectivity by pretending to know less than you do.

Exercises

Exercise: Suppose you are giving a talk in a large lecture hall and you want to estimate the number of people in the audience. There are too many to count, so you ask how many people were born on May 11 and two people raise their hands. You ask how many were born on May 23 and 1 person raises their hand. Finally, you ask how many were born on August 1, and no one raises their hand.

How many people are in the audience? What is the 90% credible interval for your estimate?

Hint: Remember the binomial distribution.



In [42]:

    
# Solution

hypos = np.arange(1, 2000, 10)
prior = Pmf(1, hypos)
prior.normalize()









    Out[42]:





200



In [43]:

    
# Solution

from scipy.stats import binom

likelihood1 = binom.pmf(2, hypos, 1/365)
likelihood2 = binom.pmf(1, hypos, 1/365)
likelihood3 = binom.pmf(0, hypos, 1/365)



In [44]:

    
# Solution

posterior = prior * likelihood1 * likelihood2 * likelihood3
posterior.normalize()









    Out[44]:





0.006758799800451805



In [45]:

    
# Solution

posterior.plot()
decorate(xlabel='Number of people in the audience',
         ylabel='PMF')



In [46]:

    
# Solution

posterior.max_prob()









    Out[46]:





361



In [47]:

    
# Solution

posterior.mean()









    Out[47]:





486.2255161687084



In [48]:

    
# Solution

posterior.credible_interval(0.9)









    Out[48]:





array([171., 941.])

Exercise: I often see rabbits in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are.

Suppose I deploy a motion-sensing camera trap that takes a picture of the first rabbit it sees each day. After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different.

How many rabbits visit my garden?

To answer this question, we have to think about the prior distribution and the likelihood of the data:

I have sometimes seen four rabbits at the same time, so I know there are at least that many. I would be surprised if there were more than 10. So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable.
To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day. Let's also assume it is guaranteed that the camera trap gets a picture every day.



In [29]:

    
# Solution

hypos = np.arange(4, 11)
prior = Pmf(1, hypos)



In [30]:

    
# Solution

# The probability that the second rabbit is the same as the first is 1/N
# The probability that the third rabbit is different is (N-1)/N

N = hypos
likelihood = (N-1) / N**2



In [31]:

    
# Solution

posterior = prior * likelihood
posterior.normalize()

posterior.bar(alpha=0.7)
decorate(xlabel='Number of rabbits',
         ylabel='PMF',
         title='The Rabbit Problem')

Exercise: Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe?



In [32]:

    
# Solution

hypos = np.arange(1, 4)
prior = Pmf(1/3, hypos)
prior



In [33]:

    
# Solution

# If you visit a prison at a random point in time,
# the probability of observing any given prisoner
# is proportional to the duration of their sentence.

likelihood = hypos
posterior = prior * likelihood
posterior.normalize()
posterior



In [34]:

    
# Solution

# The mean of the posterior is the average sentence.
# We can divide by 2 to get the average remaining sentence.

posterior.mean() / 2









    Out[34]:





1.1666666666666665

Exercise: If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child.

This article from the Pew Research Center provides some relevant data. From it, I extracted the following distribution of family size for mothers in the U.S. who were 40-44 years old in 2014:



In [35]:

    
qs = [1, 2, 3, 4]
ps = [22, 41, 24, 14]
prior = Pmf(ps, qs)
prior.bar(alpha=0.7)

plt.xticks(qs, ['1 child', '2 children', '3 children', '4+ children'])
decorate(ylabel='PMF',
         title='Distribution of family size')

For simplicity, let's assume that all families in the 4+ category have exactly 4 children.



In [36]:

    
# Solution

# When you choose a person a random, you are more likely to get someone
# from a bigger family; in fact, the chance of choosing someone from
# any given family is proportional to the number of children

likelihood = qs
posterior = prior * likelihood
posterior.normalize()
posterior



In [37]:

    
# Solution

# The probability that they have a sibling is the probability
# that they do not come from a family of 1 

1 - posterior[1]









    Out[37]:





0.9051724137931034

Exercise: The Doomsday argument is "a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far."

Suppose there are only two kinds of intelligent civilizations that can happen in the universe. The "short-lived" kind go exinct after only 200 billion individuals are born. The "long-lived" kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely. Which kind of civilization do you think we live in?

The Doomsday argument says we can use the total number of humans born so far as data. According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.

Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion. If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between. So what is the probability that you would be number 108 billion?

Given this data and dubious prior, what is the probability that our civilization will be short-lived?



In [38]:

    
# Solution

hypos = [200, 2000]
prior = Pmf(1, hypos)



In [39]:

    
# Solution

likelihood = 1/prior.qs
posterior = prior * likelihood
posterior.normalize()
posterior



In [40]:

    
# According to this analysis, the probability is about 91% that our civilization will be short-lived.  
# But this conclusion is based on a dubious prior.

# And with so little data, the posterior depends strongly on the prior.  
# To see that, run this analysis again with a different prior, and see what the results look like.

# What do you think of the Doomsday argument?



In [ ]:

	probs
1	0.333333
2	0.333333
3	0.333333

	probs
1	0.166667
2	0.333333
3	0.500000

	probs
1	0.094828
2	0.353448
3	0.310345
4	0.241379

	probs
200	0.909091
2000	0.090909