Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

```
In [1]:
```# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install empiricaldist

```
In [2]:
```# Get utils.py and create directories
import os
if not os.path.exists('utils.py'):
!wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
if not os.path.exists('figs'):
!mkdir figs
if not os.path.exists('tables'):
!mkdir tables

```
In [3]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf

```
In [4]:
```coin = Pmf()
coin['heads'] = 1/2
coin['tails'] = 1/2
coin

This Pmf contains two values, each with probability 1/2.

Here's a Pmf that represents the outcome of a 6-sided die.

```
In [5]:
```die = Pmf()
for x in [1,2,3,4,5,6]:
die[x] = 1
die

Initially, the probabilities don't add up to 1 (so they are not really probabilities).

```
In [6]:
```die.sum()

`normalize`

divides the probabilities by their total, so they add up to 1.

```
In [7]:
```die.normalize()

The return value from `normalize`

is the sum of the probabilities before normalizing.

Now we can see that the total is 1 (at least within floating-point error).

```
In [8]:
``````
die
```

```
In [9]:
```die.sum()

Another way make a Pmf is to provide a sequence of values.

In this example, every value appears once, so they all have the same probability.

```
In [10]:
```die = Pmf.from_seq([1,2,3,4,5,6])
die

More generally, values can appear more than once.

```
In [11]:
```letters = Pmf.from_seq(list('Mississippi'))
letters

```
In [12]:
```from utils import write_pmf
write_pmf(letters, 'table02-01')

The `Pmf`

class inherits from a Pandas `Series`

, so anything you can do with a `Series`

, you can also do with a `Pmf`

.

The bracket operator looks up a value and returns the corresponding probability.

```
In [13]:
```letters['s']

If you look up a value that's not in the `Pmf`

, you get an error.

```
In [14]:
```try:
letters['t']
except KeyError as e:
print('KeyError')

`Pmf`

like a function. If the value is in the `Pmf`

, the result is the same as the bracket operator.

```
In [15]:
```letters('s')

```
In [16]:
```letters('t')

```
In [17]:
```die([1,4,7])

```
In [18]:
```prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
prior

We can compute the unnormalized posteriors by multiplying the prior by the likelihoods.

```
In [19]:
```likelihood_vanilla = [0.75, 0.5]
posterior = prior * likelihood_vanilla
posterior

Then we can use `normalize`

to compute the posteriors.

```
In [20]:
```posterior.normalize()
posterior

```
In [21]:
```posterior('Bowl 1')

`Pmf`

objects is that it is easy to do successive updates with more data. For example, if we draw another cookie and get vanilla again, we can do a second update like this:

```
In [22]:
```posterior *= likelihood_vanilla
posterior.normalize()
posterior

And if we draw a third cookie and it's chocolate, we can do another update like this:

```
In [23]:
```likelihood_chocolate = [0.25, 0.5]
posterior *= likelihood_chocolate
posterior.normalize()
posterior

With two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.

Next let's solve a cookie problem with 101 bowls:

Bowl 0 contains no vanilla cookies,

Bowl 1 contains 1% vanilla cookies,

Bowl 2 contains 2% vanilla cookies,

and so on, up to

Bowl 99 contains 99% vanilla cookies, and

Bowl 100 contains all vanilla cookies.

As in the previous version, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on.

Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl $x$, for each value of $x$?

To solve this problem, I'll use `np.arange`

to represent 101 hypotheses, numbered from 0 to 100.

```
In [24]:
```hypos = np.arange(101)

And use it to make the prior distribution:

```
In [25]:
```prior = Pmf(1, hypos)
prior.normalize()

`hypos`

:

```
In [26]:
```likelihood_vanilla = hypos/100

Now we can compute the posterior distribution in the usual way:

```
In [27]:
```posterior1 = prior * likelihood_vanilla
posterior1.normalize()

And here's what it looks like:

```
In [28]:
```from utils import decorate, savefig
def decorate_bowls(title):
decorate(xlabel='Bowl #',
ylabel='PMF',
title=title)

```
In [29]:
```prior.plot(label='prior', color='gray')
posterior1.plot(label='posterior')
decorate_bowls('Posterior after one vanilla cookie')

```
In [30]:
```posterior2 = posterior1 * likelihood_vanilla
posterior2.normalize()

```
In [31]:
```posterior2.plot(label='posterior')
decorate_bowls('Posterior after two vanilla cookies')

At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.

But suppose we draw again and get a chocolate cookie. Here's the update:

```
In [32]:
```likelihood_chocolate = 1 - hypos/100
posterior3 = posterior2 * likelihood_chocolate
posterior3.normalize()

```
In [33]:
```posterior3.plot(label='posterior')
decorate_bowls('Posterior after 2 vanilla, 1 chocolate')

The peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, $2/3$.

The quantity with the highest posterior probability is called the **MAP**.

To compute the MAP, we can use the `Series`

method `idxmax`

:

```
In [34]:
```posterior3.idxmax()

Or `Pmf`

provides a more memorable name for the same thing:

```
In [35]:
```posterior3.max_prob()

Here's a figure for the book.

```
In [36]:
```plt.figure(figsize=(4, 6))
plt.subplot(311)
prior.plot(label='prior', color='gray')
posterior1.plot(label='1 vanilla', color='C0')
plt.ylabel('PMF')
plt.title('101 Bowls')
plt.legend()
plt.subplot(312)
posterior2.plot(label='2 vanilla', color='C1')
plt.ylabel('PMF')
plt.legend()
plt.subplot(313)
posterior3.plot(label='2 vanilla, 1 chocolate', color='C2')
decorate_bowls('')
savefig('fig02-01')

Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die. I choose one of the dice at random, roll it, and report that the outcome is a 1. What is the probability that I chose the 6-sided die?

I'll use integers to represent the hypotheses, and I can make the prior distribution like this:

```
In [37]:
```hypos = [6, 8, 12]
prior = Pmf(1/3, hypos)
prior

Now we can compute the likelihood of the data and use it to compute the posterior distribution.

```
In [38]:
```likelihood1 = 1/6, 1/8, 1/12
posterior = prior * likelihood1
posterior.normalize()
posterior

```
In [39]:
```write_pmf(posterior, 'table02-02')

One note about the `Pmf`

class: if you multiply a `Pmf`

by a sequence, the result is a `Pmf`

:

```
In [40]:
```type(prior * likelihood1)

If you do it the other way around, the result is a `Series`

:

```
In [41]:
```type(likelihood1 * prior)

But you can use `Pmf`

to convert it back to a `Pmf`

:

```
In [42]:
```Pmf(likelihood1 * prior)

Now suppose I roll the same die again and get a $7$. We can do a second update like this:

```
In [43]:
```likelihood2 = 0, 1/8, 1/12
posterior *= likelihood2
posterior.normalize()
posterior

```
In [44]:
```write_pmf(posterior, 'table02-03')

```
In [45]:
```def update_dice(pmf, data):
"""Update a pmf based on new data.
pmf: Pmf of possible dice and their probabilities
data: integer outcome
"""
hypos = pmf.qs
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
pmf *= likelihood
pmf.normalize()

Here's how we can use this function to compute the updates in the previous section:

```
In [46]:
```pmf = prior.copy()
pmf

```
In [47]:
```update_dice(pmf, 1)
update_dice(pmf, 7)
pmf

**Exercise:** Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7.
What is the probability that I chose the 8-sided die?

```
In [48]:
``````
# Solution goes here
```

**Exercise:** In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die.
But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided.
I choose a die, roll it, and get a 7.

What is the probability that I chose an 8-sided die?

```
In [49]:
``````
# Solution goes here
```

**Exercise:** Suppose I have two sock drawers.

One contains equal numbers of black and white socks.

The other contains equal numbers of red, green, and blue socks.
Suppose I choose a drawer at random, choose two socks at random, and I tell you that I got a matching pair.
What is the probability that the socks are white?

For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.

```
In [50]:
``````
# Solution goes here
```

```
In [51]:
``````
# Solution goes here
```

**Exercise:** Here's a problem from Bayesian Data Analysis:

Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?

Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.

```
In [52]:
``````
# Solution goes here
```

```
In [53]:
``````
# Solution goes here
```

```
In [ ]:
```