# Think Bayes

Second Edition

``````

In :

# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys

if IN_COLAB:
!pip install empiricaldist

``````
``````

In :

# Get utils.py and create directories

import os

if not os.path.exists('utils.py'):
!wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py

if not os.path.exists('figs'):
!mkdir figs

if not os.path.exists('tables'):
!mkdir tables

``````
``````

In :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf

``````

## The Pmf class

I'll start by making a Pmf that represents the outcome of a coin toss.

``````

In :

coin = Pmf()
coin['tails'] = 1/2
coin

``````

This Pmf contains two values, each with probability 1/2.

Here's a Pmf that represents the outcome of a 6-sided die.

``````

In :

die = Pmf()
for x in [1,2,3,4,5,6]:
die[x] = 1

die

``````

Initially, the probabilities don't add up to 1 (so they are not really probabilities).

``````

In :

die.sum()

``````

`normalize` divides the probabilities by their total, so they add up to 1.

``````

In :

die.normalize()

``````

The return value from `normalize` is the sum of the probabilities before normalizing.

Now we can see that the total is 1 (at least within floating-point error).

``````

In :

die

``````
``````

In :

die.sum()

``````

Another way make a Pmf is to provide a sequence of values.

In this example, every value appears once, so they all have the same probability.

``````

In :

die = Pmf.from_seq([1,2,3,4,5,6])
die

``````

More generally, values can appear more than once.

``````

In :

letters = Pmf.from_seq(list('Mississippi'))
letters

``````
``````

In :

from utils import write_pmf

write_pmf(letters, 'table02-01')

``````

The `Pmf` class inherits from a Pandas `Series`, so anything you can do with a `Series`, you can also do with a `Pmf`.

The bracket operator looks up a value and returns the corresponding probability.

``````

In :

letters['s']

``````

If you look up a value that's not in the `Pmf`, you get an error.

``````

In :

try:
letters['t']
except KeyError as e:
print('KeyError')

``````

You can also call `Pmf` like a function. If the value is in the `Pmf`, the result is the same as the bracket operator.

``````

In :

letters('s')

``````

But if you use parentheses and ask for the probability of something that's not in the Pmf, the result is 0.

``````

In :

letters('t')

``````

With parentheses, you can also provide a sequence of values, and you get a sequence of probabilities.

``````

In :

die([1,4,7])

``````

Here's a Pmf that represents the prior distribution.

``````

In :

prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
prior

``````

We can compute the unnormalized posteriors by multiplying the prior by the likelihoods.

``````

In :

likelihood_vanilla = [0.75, 0.5]
posterior = prior * likelihood_vanilla
posterior

``````

Then we can use `normalize` to compute the posteriors.

``````

In :

posterior.normalize()
posterior

``````
``````

In :

posterior('Bowl 1')

``````

One benefit of using `Pmf` objects is that it is easy to do successive updates with more data. For example, if we draw another cookie and get vanilla again, we can do a second update like this:

``````

In :

posterior *= likelihood_vanilla
posterior.normalize()
posterior

``````

And if we draw a third cookie and it's chocolate, we can do another update like this:

``````

In :

likelihood_chocolate = [0.25, 0.5]
posterior *= likelihood_chocolate
posterior.normalize()
posterior

``````

With two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.

## 101 Bowls

Next let's solve a cookie problem with 101 bowls:

• Bowl 0 contains no vanilla cookies,

• Bowl 1 contains 1% vanilla cookies,

• Bowl 2 contains 2% vanilla cookies,

and so on, up to

• Bowl 99 contains 99% vanilla cookies, and

• Bowl 100 contains all vanilla cookies.

As in the previous version, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on.

Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl \$x\$, for each value of \$x\$?

To solve this problem, I'll use `np.arange` to represent 101 hypotheses, numbered from 0 to 100.

``````

In :

hypos = np.arange(101)

``````

And use it to make the prior distribution:

``````

In :

prior = Pmf(1, hypos)
prior.normalize()

``````

The likelihood of the data is the fraction of vanilla cookies in each bowl, which we can calculate using `hypos`:

``````

In :

likelihood_vanilla = hypos/100

``````

Now we can compute the posterior distribution in the usual way:

``````

In :

posterior1 = prior * likelihood_vanilla
posterior1.normalize()

``````

And here's what it looks like:

``````

In :

from utils import decorate, savefig

def decorate_bowls(title):
decorate(xlabel='Bowl #',
ylabel='PMF',
title=title)

``````
``````

In :

prior.plot(label='prior', color='gray')
posterior1.plot(label='posterior')

``````

Now suppose we put the cookie back, draw again from the same bowl, and get another vanilla cookie. Here's the update after the second cookie:

``````

In :

posterior2 = posterior1 * likelihood_vanilla
posterior2.normalize()

``````
``````

In :

posterior2.plot(label='posterior')

``````

At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.

But suppose we draw again and get a chocolate cookie. Here's the update:

``````

In :

likelihood_chocolate = 1 - hypos/100

posterior3 = posterior2 * likelihood_chocolate
posterior3.normalize()

``````
``````

In :

posterior3.plot(label='posterior')
decorate_bowls('Posterior after 2 vanilla, 1 chocolate')

``````

The peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, \$2/3\$.

The quantity with the highest posterior probability is called the MAP.

To compute the MAP, we can use the `Series` method `idxmax`:

``````

In :

posterior3.idxmax()

``````

Or `Pmf` provides a more memorable name for the same thing:

``````

In :

posterior3.max_prob()

``````

Here's a figure for the book.

``````

In :

plt.figure(figsize=(4, 6))
plt.subplot(311)
prior.plot(label='prior', color='gray')
posterior1.plot(label='1 vanilla', color='C0')
plt.ylabel('PMF')
plt.title('101 Bowls')
plt.legend()

plt.subplot(312)
posterior2.plot(label='2 vanilla', color='C1')
plt.ylabel('PMF')
plt.legend()

plt.subplot(313)
posterior3.plot(label='2 vanilla, 1 chocolate', color='C2')
decorate_bowls('')

savefig('fig02-01')

``````

## The dice problem

Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die. I choose one of the dice at random, roll it, and report that the outcome is a 1. What is the probability that I chose the 6-sided die?

I'll use integers to represent the hypotheses, and I can make the prior distribution like this:

``````

In :

hypos = [6, 8, 12]
prior = Pmf(1/3, hypos)
prior

``````

Now we can compute the likelihood of the data and use it to compute the posterior distribution.

``````

In :

likelihood1 = 1/6, 1/8, 1/12
posterior = prior * likelihood1
posterior.normalize()
posterior

``````
``````

In :

write_pmf(posterior, 'table02-02')

``````

One note about the `Pmf` class: if you multiply a `Pmf` by a sequence, the result is a `Pmf`:

``````

In :

type(prior * likelihood1)

``````

If you do it the other way around, the result is a `Series`:

``````

In :

type(likelihood1 * prior)

``````

But you can use `Pmf` to convert it back to a `Pmf`:

``````

In :

Pmf(likelihood1 * prior)

``````

Now suppose I roll the same die again and get a \$7\$. We can do a second update like this:

``````

In :

likelihood2 = 0, 1/8, 1/12
posterior *= likelihood2
posterior.normalize()
posterior

``````
``````

In :

write_pmf(posterior, 'table02-03')

``````

## Updating dice

The following function is a more general version of the update in the previous section:

``````

In :

def update_dice(pmf, data):
"""Update a pmf based on new data.

pmf: Pmf of possible dice and their probabilities
data: integer outcome
"""
hypos = pmf.qs
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
pmf *= likelihood
pmf.normalize()

``````

Here's how we can use this function to compute the updates in the previous section:

``````

In :

pmf = prior.copy()
pmf

``````
``````

In :

update_dice(pmf, 1)
update_dice(pmf, 7)
pmf

``````

## Exercises

Exercise: Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die. I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7. What is the probability that I chose the 8-sided die?

``````

In :

# Solution goes here

``````

Exercise: In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die. But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided. I choose a die, roll it, and get a 7.
What is the probability that I chose an 8-sided die?

``````

In :

# Solution goes here

``````

Exercise: Suppose I have two sock drawers.
One contains equal numbers of black and white socks.
The other contains equal numbers of red, green, and blue socks. Suppose I choose a drawer at random, choose two socks at random, and I tell you that I got a matching pair. What is the probability that the socks are white?

For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.

``````

In :

# Solution goes here

``````
``````

In :

# Solution goes here

``````

Exercise: Here's a problem from Bayesian Data Analysis:

Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?

Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.

``````

In :

# Solution goes here

``````
``````

In :

# Solution goes here

``````
``````

In [ ]:

``````