Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

```
In [1]:
```# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install empiricaldist

```
In [2]:
```# Get utils.py and create directories
import os
if not os.path.exists('utils.py'):
!wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
if not os.path.exists('figs'):
!mkdir figs
if not os.path.exists('tables'):
!mkdir tables

```
In [3]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf
from utils import decorate, savefig

I found the train problem
in Frederick Mosteller's, *Fifty Challenging Problems in
Probability with Solutions*:

"A railroad numbers its locomotives in order $1..N$. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has."

Based on this observation, we know the railroad has 60 or more locomotives. But how many more? To apply Bayesian reasoning, we can break this problem into two steps:

What did we know about $N$ before we saw the data?

For any given value of $N$, what is the likelihood of seeing the data (a locomotive with number 60)?

The answer to the first question is the prior. The answer to the second is the likelihood.

We don't have much basis to choose a prior, so we'll start with
something simple and then consider alternatives.

Let's assume that $N$ is equally likely to be any value from 1 to 1000.

```
In [4]:
```hypos = np.arange(1, 1001)
prior = Pmf(1, hypos)

Now let's figure out the likelihood of the data. In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60? If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$.

Here's the function that does the update:

```
In [5]:
```def update_train(pmf, data):
"""Update a pmf based on new data.
pmf: Pmf of possible numbers of trains
data: integer train #
"""
hypos = pmf.qs
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
pmf *= likelihood
pmf.normalize()

This function might look familiar; it is the same as the update function for the dice problem in the previous notebook.

Here's the update:

```
In [6]:
```data = 60
posterior = prior.copy()
update_train(posterior, data)

Here's what the posterior looks like

```
In [7]:
```posterior.plot(label='Posterior after train 60')
decorate(xlabel='Number of trains',
ylabel='PMF',
title='Train Problem')
savefig('fig04-01')

```
```

Not surprisingly, all values of $N$ below 60 have been eliminated.

The most likely value, if you had to guess, is 60.

That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number?
Nevertheless, if you want to maximize the chance of getting
the answer exactly right, you should guess 60.

But maybe that's not the right goal.

An alternative is to compute the mean of the posterior distribution.
Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is:

$\mathrm{mean} = \sum_i p_i q_i$

Which we can compute like this:

```
In [8]:
```np.sum(posterior.ps * posterior.qs)

```
Out[8]:
```

Or we can use the method provided by `Pmf`

:

```
In [9]:
```posterior.mean()

```
Out[9]:
```

If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the mean squared error over the long run.

The prior I chose in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound.

We might wonder whether the posterior distribution is sensitive to the prior. With so little data---only one observation---it is.

Here's what happens as we vary the upper bound:

```
In [10]:
```data = 60
for high in [500, 1000, 2000]:
hypos = np.arange(1, high+1)
pmf = Pmf(1, hypos)
update_train(pmf, data)
print(high, pmf.mean())

```
```

So that's bad.

When the posterior is sensitive to the prior, there are two ways to proceed:

Get more data.

Get more background information and choose a better prior.

With more data, posterior distributions based on different
priors tend to converge.

For example, suppose that in addition
to train 60 we also see trains 30 and 90.

We can update the distribution like this:

```
In [11]:
```dataset = [30, 60, 90]
for high in [500, 1000, 2000]:
hypos = np.arange(1, high+1)
pmf = Pmf(1, hypos)
for data in dataset:
update_train(pmf, data)
print(high, pmf.mean())

```
```

If more data are not available, another option is to improve the
priors by gathering more background information.

It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1.

With some effort, we could probably find a list of companies that operate locomotives in the area of observation. Or we could interview an expert in rail shipping to gather information about the typical size of companies.

But even without getting into the specifics of railroad economics, we
can make some educated guesses.

In most fields, there are many small
companies, fewer medium-sized companies, and only one or two very
large companies.

In fact, the distribution of company sizes tends to
follow a power law, as Robert Axtell reports in {\it Science} (official site, available here)
).

This law suggests that if there are 1000 companies with fewer than 10 locomotives, there might be 100 companies with 100 locomotives, 10 companies with 1000, and possibly one company with 10,000 locomotives.

Mathematically, a power law means that the number of companies with a given size is inversely proportional to size, or

$ \mathrm{PMF}(N) \sim \left( \frac{1}{N} \right)^{\alpha} $

where $\mathrm{PMF}(N)$ is the probability mass function of $N$ and $\alpha$ is a parameter that is often near 1.

We can construct a power law prior like this:

```
In [12]:
```alpha = 1.0
ps = hypos**(-alpha)
power = Pmf(ps, hypos, name='power law')
power.normalize()

```
Out[12]:
```

And here's the uniform prior again.

```
In [13]:
```hypos = np.arange(1, 1001)
uniform = Pmf(1, hypos, name='uniform')
uniform.normalize()

```
Out[13]:
```

Again, the upper bound is arbitrary, but with a power law prior, the posterior is less sensitive to this choice.

Here's what a power law prior looks like, compared to a uniform prior

```
In [14]:
```uniform.plot()
power.plot()
decorate(xlabel='Number of trains',
ylabel='PMF',
title='Train Problem')

```
```

Now let's see what the posteriors look like after observing one train.

```
In [15]:
```dataset = [60]
update_train(uniform, dataset)
update_train(power, dataset)

```
In [16]:
```uniform.plot()
power.plot()
decorate(xlabel='Number of trains',
ylabel='PMF',
title='Train Problem')
savefig('fig04-02')

```
```

The power law gives less prior probability to high values, which yields lower posterior means, and less sensitivity to the upper bound.

If we observe trains 30, 60, and 90, the means of the posteriors are:

```
In [17]:
```alpha = 1.0
dataset = [30, 60, 90]
for high in [500, 1000, 2000]:
hypos = np.arange(1, high+1)
ps = hypos**(-alpha)
power = Pmf(ps, hypos)
for data in dataset:
update_train(power, data)
print(high, power.mean())

```
```

Now the differences are much smaller. In fact, with an arbitrarily large upper bound, the mean converges on 134.

So the power law prior is more realistic, because it is based on general information about the size of companies, and it behaves better in practice.

So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean.
These are both **point estimates**, that is, single values that estimate the quantity we are interested in.

Another way to summarize posterior distribution is with percentiles. If you have taken a standardized test, you might be familiar with percentiles. For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test.

If we are given a value, `x`

, we can compute its **percentile rank** by finding all values less than or equal to `x`

and adding up their probabilities.

`Pmf`

provides a method that does this computation.
So, for example, we can compute the probability that the company has less than or equal to 100 trains:

```
In [18]:
```power.lt_dist(100)

```
Out[18]:
```

With a power law prior and a dataset of three trains, the result is about 27%. So 100 trains is the 27th percentile.

Going the other way, suppose we want to compute a particular percentile; for example, the median of a distriution is the 50th percentile. We can compute it by adding up probabilities until the total exceeds 0.5. Here's a function that does it:

```
In [19]:
```def quantile(pmf, prob):
"""Compute a quantile.
pmf: Pmf representing a normalized distribution
prob: float probability
returns: quantity from the distribution
"""
total = 0
for q, p in pmf.items():
total += p
if total >= prob:
return q
return np.nan

The loop uses `items`

, which iterates the quantities and probabilities in the distribution.
Inside the loop we add up the probabilities of the quantities in order.
When the total equals or exceeds `prob`

, we return the corresponding quantity.

This function is called `quantile`

because it computes a quantile rather than a percentile.
The difference is the way we specify `prob`

.
If `prob`

is a percentage between 0 and 100, we call the corresponding quantity a percentile.
If `prob`

is a probability between 0 and 1, we call the corresponding quantity a **quantile**.

Here's how we can use this function to compute the median of the posterior distribution:

```
In [20]:
```quantile(power, 0.5)

```
Out[20]:
```

The result, 113 trains, is the median of the posterior distribution.

`Pmf`

provides a method called `quantile`

that does the same thing.
We can call it like this to compute the 5th and 9th percentiles:

```
In [21]:
```power.quantile([0.05, 0.95])

```
Out[21]:
```

The result is the interval from 91 to 242 trains, which implies:

The probability is 5% that the number of trains is less than or equal to 91.

The probability is 5% that the number of trains is greater than 242.

Therefore the probability is 90% that the number of trains falls between 91 and 242 (excluding 91 and including 242).
For this reason, this interval is called a 90% **credible interval**.

`Pmf`

also provides `credible_interval`

, which computes an interval that contains the given probability.

```
In [22]:
```power.credible_interval(0.9)

```
Out[22]:
```

During World War II, the Economic Warfare Division of the American Embassy in London used statistical analysis to estimate German production of tanks and other equipment.

(see Ruggles and Brodie, "An Empirical Approach to Economic Intelligence in World War II", {\em Journal of the American Statistical Association}, Vol. 42, No. 237 (March 1947).)

The Western Allies had captured log books, inventories, and repair records that included chassis and engine serial numbers for individual tanks.

Analysis of these records indicated that serial numbers were allocated by manufacturer and tank type in blocks of 100 numbers, that numbers in each block were used sequentially, and that not all numbers in each block were used. So the problem of estimating German tank production could be reduced, within each block of 100 numbers, to a form of the locomotive problem.

Based on this insight, American and British analysts produced estimates substantially lower than estimates from other forms of intelligence. And after the war, records indicated that they were substantially more accurate.

They performed similar analyses for tires, trucks, rockets, and other equipment, yielding accurate and actionable economic intelligence.

The German tank problem is historically interesting; it is also a nice example of real-world application of statistical estimation. So far many of the examples in this book have been toy problems, but it will not be long before we start solving real problems. I think it is an advantage of Bayesian analysis, especially with the computational approach we are taking, that it provides such a short path from a basic introduction to the research frontier.

Among Bayesians, there are two approaches to choosing prior
distributions. Some recommend choosing the prior that best represents
background information about the problem; in that case the prior
is said to be **informative**. The problem with using an informative
prior is that people might use different background information (or
interpret it differently). So informative priors often seem subjective.

The alternative is a so-called **uninformative prior**, which is
intended to be as unrestricted as possible, in order to let the data
speak for themselves. In some cases you can identify a unique prior
that has some desirable property, like representing minimal prior
information about the estimated quantity.

Uninformative priors are appealing because they seem more objective. But I am generally in favor of using informative priors. Why? First, Bayesian analysis is always based on modeling decisions. Choosing the prior is one of those decisions, but it is not the only one, and it might not even be the most subjective. So even if an uninformative prior is more objective, the entire analysis is still subjective.

Also, for most practical problems, you are likely to be in one of two regimes: either you have a lot of data or not very much. If you have a lot of data, the choice of the prior doesn't matter very much; informative and uninformative priors yield almost the same results.

But if, as in the locomotive problem, you don't have much data, using relevant background information (like the power law distribution) makes a big difference.

And if, as in the German tank problem, you have to make life-and-death decisions based on your results, you should probably use all of the information at your disposal, rather than maintaining the illusion of objectivity by pretending to know less than you do.

**Exercise:** Suppose you are giving a talk in a large lecture hall and you want to estimate the number of people in the audience. There are too many to count, so you ask how many people were born on May 11 and two people raise their hands. You ask how many were born on May 23 and 1 person raises their hand. Finally, you ask how many were born on August 1, and no one raises their hand.

How many people are in the audience? What is the 90% credible interval for your estimate?

Hint: Remember the binomial distribution.

```
In [42]:
```# Solution
hypos = np.arange(1, 2000, 10)
prior = Pmf(1, hypos)
prior.normalize()

```
Out[42]:
```

```
In [43]:
```# Solution
from scipy.stats import binom
likelihood1 = binom.pmf(2, hypos, 1/365)
likelihood2 = binom.pmf(1, hypos, 1/365)
likelihood3 = binom.pmf(0, hypos, 1/365)

```
In [44]:
```# Solution
posterior = prior * likelihood1 * likelihood2 * likelihood3
posterior.normalize()

```
Out[44]:
```

```
In [45]:
```# Solution
posterior.plot()
decorate(xlabel='Number of people in the audience',
ylabel='PMF')

```
```

```
In [46]:
```# Solution
posterior.max_prob()

```
Out[46]:
```

```
In [47]:
```# Solution
posterior.mean()

```
Out[47]:
```

```
In [48]:
```# Solution
posterior.credible_interval(0.9)

```
Out[48]:
```

**Exercise:** I often see rabbits in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are.

Suppose I deploy a motion-sensing camera trap that takes a picture of the first rabbit it sees each day. After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different.

How many rabbits visit my garden?

To answer this question, we have to think about the prior distribution and the likelihood of the data:

I have sometimes seen four rabbits at the same time, so I know there are at least that many. I would be surprised if there were more than 10. So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable.

To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day. Let's also assume it is guaranteed that the camera trap gets a picture every day.

```
In [29]:
```# Solution
hypos = np.arange(4, 11)
prior = Pmf(1, hypos)

```
In [30]:
```# Solution
# The probability that the second rabbit is the same as the first is 1/N
# The probability that the third rabbit is different is (N-1)/N
N = hypos
likelihood = (N-1) / N**2

```
In [31]:
```# Solution
posterior = prior * likelihood
posterior.normalize()
posterior.bar(alpha=0.7)
decorate(xlabel='Number of rabbits',
ylabel='PMF',
title='The Rabbit Problem')

```
```

**Exercise:** Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe?

```
In [32]:
```# Solution
hypos = np.arange(1, 4)
prior = Pmf(1/3, hypos)
prior

```
Out[32]:
```

```
In [33]:
```# Solution
# If you visit a prison at a random point in time,
# the probability of observing any given prisoner
# is proportional to the duration of their sentence.
likelihood = hypos
posterior = prior * likelihood
posterior.normalize()
posterior

```
Out[33]:
```

```
In [34]:
```# Solution
# The mean of the posterior is the average sentence.
# We can divide by 2 to get the average remaining sentence.
posterior.mean() / 2

```
Out[34]:
```

**Exercise:** If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child.

This article from the Pew Research Center provides some relevant data. From it, I extracted the following distribution of family size for mothers in the U.S. who were 40-44 years old in 2014:

```
In [35]:
```qs = [1, 2, 3, 4]
ps = [22, 41, 24, 14]
prior = Pmf(ps, qs)
prior.bar(alpha=0.7)
plt.xticks(qs, ['1 child', '2 children', '3 children', '4+ children'])
decorate(ylabel='PMF',
title='Distribution of family size')

```
```

For simplicity, let's assume that all families in the 4+ category have exactly 4 children.

```
In [36]:
```# Solution
# When you choose a person a random, you are more likely to get someone
# from a bigger family; in fact, the chance of choosing someone from
# any given family is proportional to the number of children
likelihood = qs
posterior = prior * likelihood
posterior.normalize()
posterior

```
Out[36]:
```

```
In [37]:
```# Solution
# The probability that they have a sibling is the probability
# that they do not come from a family of 1
1 - posterior[1]

```
Out[37]:
```

**Exercise:** The Doomsday argument is "a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far."

Suppose there are only two kinds of intelligent civilizations that can happen in the universe. The "short-lived" kind go exinct after only 200 billion individuals are born. The "long-lived" kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely. Which kind of civilization do you think we live in?

The Doomsday argument says we can use the total number of humans born so far as data. According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.

Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion. If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between. So what is the probability that you would be number 108 billion?

Given this data and dubious prior, what is the probability that our civilization will be short-lived?

```
In [38]:
```# Solution
hypos = [200, 2000]
prior = Pmf(1, hypos)

```
In [39]:
```# Solution
likelihood = 1/prior.qs
posterior = prior * likelihood
posterior.normalize()
posterior

```
Out[39]:
```

```
In [40]:
```# According to this analysis, the probability is about 91% that our civilization will be short-lived.
# But this conclusion is based on a dubious prior.
# And with so little data, the posterior depends strongly on the prior.
# To see that, run this analysis again with a different prior, and see what the results look like.
# What do you think of the Doomsday argument?

```
In [ ]:
```