In statistics a **distribution** is a set of values and their corresponding probabilities.
For example, if you roll a six-sided die, the set of possible values is the numbers 1 to 6, and the probability associated with each value is 1/6.
As another example, you might be interested in how many times each word appears in common English usage. You could build a distribution that includes each word and how many times it appears.
To represent a distribution in Python, you could use a dictionary that maps from each value to its probability. I have written a class called Pmf that uses a Python dictionary in exactly that way, and provides a number of useful methods. I called the class Pmf in reference to a **probability mass function**, which is a way to represent a distribution mathematically.
Pmf is defined in a Python module I wrote to accompany this book, thinkbayes.py. You can download it from http://thinkbayes.com/thinkbayes.py. For more information see Section [download].
To use Pmf you can import it like this:
In [2]:
import sys
sys.path.insert(0, './code')
# Go into the subdirectory
from thinkbayes import Pmf
# Grab the thinkbayes script
The following code builds a Pmf to represent the distribution of outcomes for a six-sided die:
In [8]:
help(Pmf)
# What is this object?
This is a Probability Mass Function object, which includes some pre-defined methods and parameters to help us deal with Pmfs (which measures the chance that some disecrete number is equal some value, where all values must sum to 1).
Here's a PMF for the fair die problem we will explore below
In [15]:
pmf = Pmf()
# intialize the object
Pmf()
creates an empty Probability Mass Function with no values.
In [16]:
for x in [1,2,3,4,5,6]:
# for x in array
pmf.Set(x, 1/6.0)
# Set the frequency for each x
pmf.Print()
The Set
method sets the probability associated with each value to $1/6$.
Here’s another example that counts the number of times each word appears in a sequence:
In [18]:
word_list = ['hi', 'the', 'bye', 'hi', 'football', 'sky']
pmf = Pmf()
for word in word_list:
pmf.Incr(word, 1)
pmf.Print()
Incr
increases the “probability” associated with each word (array value) by 1. If a
word is not already in the Pmf, it is added.
“Probability” is in quotes because in this example, the probabilities are not normalized; that is, they do not add up to 1, so they are not true probabilities.
However, in this example the word counts are proportional to the probabilities. So after we count all the words, we can compute probabilities by dividing through by the total number of words.
Pmf provides a method, Normalize
, that does exactly that:
In [20]:
pmf.Normalize()
pmf.Print()
# wow!
Once you have a defined Pmf object, you can ask for the probability associated with any value:
In [21]:
print(pmf.Prob('the'))
Which returns the frequency of the word “the” as a fraction of the words in the list.
Pmf uses a Python dictionary to store the values and their probabilities, so the values in the Pmf can be any hashable type.
The probabilities can be any numerical type, but they are usually floating-point numbers (type float
).
In [37]:
pmf = Pmf()
# Reinitialize the Pmf()
pmf.Set('Bowl 1', 0.5)
pmf.Set('Bowl 2', 0.5)
# Set up the prior distribution; 50/50 odds
pmf.Print()
# Show us what's in there so far
This distribution, which contains the priors for each hypothesis, is called (wait for it) the **prior distribution**.
To update the distribution based on new data (the vanilla cookie), we multiply each prior by the corresponding likelihood.
Here's a visualization of that in action, courtesy of this amazing post by Brandon Rohrer
Back to the cookie example, the likelihood of drawing a vanilla cookie from Bowl 1 is 3/4. The likelihood for Bowl 2 is 1/2.
Let's use the Mult
method to update these probabilities with the Vanilla likelihoods. Mult
does what you would expect. It gets the probability for the given hypothesis and multiplies by the given likelihood.
In [38]:
pmf.Mult('Bowl 1', 0.75)
pmf.Mult('Bowl 2', 0.5)
# Update with the vanilla likelihoods
pmf.Print()
# Where are we at now?
Note that this does not add up to 1. That is because after this update, the distribution is no longer normalized, but because these hypotheses are mutually exclusive and collectively exhaustive, we can **renormalize**:
In [39]:
pmf.Normalize()
Out[39]:
The result is a distribution that contains the posterior probability for each hypothesis, which is called (wait now) the **posterior distribution**.
Check if it's normalized now:
In [40]:
pmf.Normalize()
Out[40]:
Finally, we can get the posterior probability for Bowl 1, what are the odds of getting that vanilla cookie from Bowl 1?
In [41]:
pmf.Prob('Bowl 1')
# Odds of getting the Vanilla cookie from Bowl 1?
Out[41]:
And for fun, Bowl 2?
In [42]:
pmf.Prob('Bowl 2')
# Odds of getting the Vanilla cookie from Bowl 1?
Out[42]:
And the answer is 0.6. You can download this example from http://thinkbayes.com/cookie.py. For more information see Section [download].
In [84]:
class Cookie(Pmf):
"""A map from string bowl ID to probablity."""
def __init__(self, hypos):
"""Initialize self.
hypos: sequence of string bowl IDs
"""
Pmf.__init__(self)
# Intializie the Pmf object from before
for hypo in hypos:
#self.Set(hypo, 1)
# For hypo in in array, set to 1
# For learning, let's see what happens with Pmf.Incr()
# Yields the same result
self.Incr(hypo, 1)
self.Normalize()
#Renormalize after all the new hypotheses
mixes = {
'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
}
# Mix data as provided by the problem.
# Refresher:
# * Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
# * Bowl 2 contains 20 of each (10 vanilla, 10 chocolate)
def Likelihood(self, data, hypo):
"""The likelihood of the data under the hypothesis.
data: string cookie type
hypo: string bowl ID
"""
mix = self.mixes[hypo]
# Search for the mix of a given hypo ('Bowl 1' or 'Bowl 2')
like = mix[data]
# Likelihood of the prior given the current data in the mixes dict
return like
# Return the likelihood
def Update(self, data):
"""Updates the PMF with new data.
data: string cookie type
"""
for hypo in self.Values():
# For every hypo in the current prior distribution
like = self.Likelihood(data, hypo)
# Get the likelihood value using the Likelihood() method above
self.Mult(hypo, like)
# Multiple the prior by the new Likelihood
self.Normalize()
# Renormalize after all the new updates
A Cookie object is now a Pmf that maps from hypotheses to their probabilities.
The __init__
method gives each hypothesis with the same prior probability. As in the previous section, there are two hypotheses:
In [9]:
hypos = ['Bowl 1', 'Bowl 2']
pmf = Cookie(hypos)
# Run the Cookie object on our hypothesis, using __init__ to
# generate priors
pmf.Print()
# Show us the current distribution
Cookie
provides an Update
method that takes data as a parameter and updates the probabilities.
Update
loops through each hypothesis in the suite and multiplies its probability by the likelihood of the data under the hypothesis, which is computed by Likelihood
.
Likelihood
uses the mixes
parameter, which is a dictionary that maps from the name of a bowl to the mix of cookies in the bowl.
In [89]:
Cookie.mixes
Out[89]:
Here’s what the update looks like:
In [90]:
pmf.Update('vanilla')
# Update using all 'vanilla' entries in the dictionary
pmf.Print()
# Show us the posterior distribution (post Update via Likelihood)
And then we can print the posterior probability of each hypothesis:
In [91]:
for hypo, prob in pmf.Items():
print(hypo, prob)
Which is the same as what we got before.
This code is more complicated than what we saw in the previous section, but the advantage is that it generalizes to the case where we draw more than one cookie from the same bowl (with replacement), using the data
placeholder in the above methods:
In [93]:
dataset = ['vanilla', 'chocolate', 'vanilla']
# 3 draws, here are the results
for data in dataset:
pmf.Update(data)
# Update our pmf using the results of our draws
pmf.Print()
# What's the new distribution?
# More refined with new information
The other advantage is that it provides a framework for solving many similar problems.
In the next section we’ll solve the Monty Hall problem computationally and then see what parts of the framework are the same.
The code in this section is available from http://thinkbayes.com/cookie2.py. For more information see Section [download].
In [27]:
class Monty(Pmf):
"""Map from string location of car to probability"""
def __init__(self, hypos):
"""Initialize the prior distribution using the hp
hypos: sequence of hypotheses
"""
Pmf.__init__(self)
for hypo in hypos:
self.Set(hypo, 1)
self.Normalize()
def Update(self, data):
"""Updates each hypothesis based on the data.
data: any representation of the data
"""
for hypo in self.Values():
like = self.Likelihood(data, hypo)
self.Mult(hypo, like)
self.Normalize()
def Likelihood(self, data, hypo):
"""Compute the likelihood of the data under the hypothesis.
hypo: string name of the door where the prize is
data: string name of the door Monty opened
"""
if hypo == data:
return 0
elif hypo == 'A':
return 0.5
else:
return 1
So far Monty
and Cookie
are nearly the same (ignoring the Likelihood
method for a second)
The code that creates the Pmf is the same, too, except for the names of the hypotheses:
In [28]:
hypos = 'ABC'
pmf = Monty(hypos)
pmf.Print()
# Current prior; all have the same odds
Calling Update
is pretty much the same:
In [29]:
data = 'B'
# Opened Door B
pmf.Update(data)
# Update Prior with the Likelihoods
In [30]:
pmf.Print()
# Posterior Distribution
# Our opened door B in data was not the car, so the odds for Car behind B are now 0
The implementation of Update
is exactly the same; we are updating the Prior distribution as defined by the hypothesis using the Mult
function via Likelhood
.
Speaking of, let's examine the primary change of the new obect, the Likelihood
:
In [ ]:
def Likelihood(self, data, hypo):
"""Compute the likelihood of the data under the hypothesis.
hypo: string name of the door where the prize is
data: string name of the door Monty opened
"""
print('Is our hypo {} the same as our data {}?'.format(hypo, data))
if hypo == data:
print('Yes, so the odds of the car beind {} are 0'.format(data))
return 0
elif hypo == 'A':
print('Not A, so the odds update to 50/50, only two doors left')
return 0.5
else:
return 1
Finally, printing the results is the same:
In [26]:
for hypo, prob in pmf.Items():
print(hypo, prob)
The problem centers around the notion of switching; the car is behind one of three doors, and Monty can safely open one door at random.
Here that door is in our data
, door B. Once opened, we need to figure out whether we should stay or switch. Our Bayesian framework suggests that it is in our interest to switch. The logic arrives from the fact that Monty can safely choose between one of two doors that doesn't have the car, which is why the odds become 50/50 if he opens door A.
This combined with the data that there is definitely no car behind the door is what powers our switching behavior. Let us examine the other cases (opening door A or C instead)
In [ ]:
class Monty(Pmf):
"""Map from string location of car to probability"""
def __init__(self, hypos):
"""Initialize the prior distribution using the hp
hypos: sequence of hypotheses
"""
Pmf.__init__(self)
for hypo in hypos:
self.Set(hypo, 1)
self.Normalize()
def Update(self, data):
"""Updates each hypothesis based on the data.
data: any representation of the data
"""
for hypo in self.Values():
like = self.Likelihood(data, hypo)
self.Mult(hypo, like)
self.Normalize()
def Likelihood(self, data, hypo):
"""Compute the likelihood of the data under the hypothesis.
hypo: string name of the door where the prize is
data: string name of the door Monty opened
"""
if hypo == data:
return 0
elif hypo == 'A':
return 0.5
else:
return 1
In this example, writing Likelihood
is a little complicated, but the
framework of the Bayesian update is simple. The code in this section is
available from http://thinkbayes.com/monty.py. For more information
see Section [download].
class Suite(Pmf):
"""Represents a suite of hypotheses and their probabilities."""
def __init__(self, hypo=tuple()):
"""Initializes the distribution."""
def Update(self, data):
"""Updates each hypothesis based on the data."""
def Print(self):
"""Prints the hypotheses and their probabilities."""
The implementation of Suite
is in thinkbayes.py
. To use Suite
, you
should write a class that inherits from it and provides Likelihood
.
For example, here is the solution to the Monty Hall problem rewritten to
use Suite
:
In [3]:
from thinkbayes import Suite
class Monty(Suite):
def Likelihood(self, data, hypo):
if hypo == data:
return 0
elif hypo == 'A':
return 0.5
else:
return 1
And here’s the code that uses this class.
Once you've updated the Likelihood
function for the given problem, things get much easier to operationalize.
In [4]:
suite = Monty('ABC')
suite.Update('B')
suite.Print()
You can download this example from http://thinkbayes.com/monty2.py. For more information see Section [download].
We can use the Suite
framework to solve the M&M problem.
Writing the Likelihood
function is tricky, but everything else is
straightforward.
M&M’s are small candy-coated chocolates that come in a variety of colors. Mars Inc. which makes M&M’s, changes the mixture of colors from time to time.
In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.
Suppose a friend of mine has two bags of M&M’s, and he tells me that one is from 1994 and one from 1996. He won’t tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow one came from the 1994 bag?
This problem is similar to the cookie problem, with the twist that I draw one sample from each bowl/bag.
The first step is to enumerate the hypotheses. The bag the yellow M&M came from I’ll call Bag 1; I’ll call the other Bag 2. So the hypotheses are:
A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
B: Bag 1 is from 1996 and Bag 2 from 1994.
With the original table method, we can construct a table with a row for each hypothesis and a column for each term in Bayes’s theorem:
Prior $\mathrm{p}(H)$ | Likelihood $\mathrm{p}(D\vert H)$ | $\mathrm{p}(H) \mathrm{p}(D\vert H)$ | Posterior $\mathrm{p}(H\vert D)$ | |
---|---|---|---|---|
A | 1/2 | (20)(20) | 200 | 20/27 |
B | 1/2 | (14)(10) | 70 | 7/27 |
The first column has the priors. Based on the statement of the problem, it is reasonable to choose ${{\mathrm{p}(A)}} = {{\mathrm{p}(B)}} = 1/2$.
The second column has the likelihoods, which follow from the information in the problem.
For example:
If $A$ is true, the yellow M&M came from the 1994 bag with probability 20%, and the green came from the 1996 bag with probability 20%.
If $B$ is true, the yellow M&M came from the 1996 bag with probability 14%, and the green came from the 1994 bag with probability 10%.
Because the selections are independent, we get the conjoint probability by multiplying.
The third column is just the product of the previous two. The sum of this column, 270, is the normalizing constant. To get the last column, which contains the posteriors, we divide the third column by the normalizing constant.
That’s it. Simple, right?
Well, you might be bothered by one detail. I write $\mathrm{p}(D|H)$ in terms of percentages, not probabilities, which means it is off by a factor of 10,000. But that cancels out when we divide through by the normalizing constant, so it doesn’t affect the result.
When the set of hypotheses is mutually exclusive and collectively exhaustive, you can multiply the likelihoods by any factor, if it is convenient, as long as you apply the same factor to the entire column.
In [32]:
from thinkbayes import Suite
class M_and_M(Suite):
"""Map from hypothesis (A or B) to probability."""
# Mixes as defined by the problem
mix94 = dict(brown=30,
yellow=20,
red=20,
green=10,
orange=10,
tan=10)
mix96 = dict(blue=24,
green=20,
orange=16,
yellow=14,
red=13,
brown=13)
hypoA = dict(bag1=mix94, bag2=mix96)
hypoB = dict(bag1=mix96, bag2=mix94)
# Hypothesis using the info, i.e which bag did it come from, 1 or 2?
hypotheses = dict(A=hypoA, B=hypoB)
def Likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.
hypo: string hypothesis (A or B)
data: tuple of string bag, string color
"""
print('The data we observed is {}'.format(data))
bag, color = data
# Take the bag and color of M&M from the observation
mix = self.hypotheses[hypo][bag]
print('The current hypo we are examing is {}'.format(hypo))
# Pull the mixes for the relevant bag and color
print('The current mix for {} is {}'.format(bag, mix))
like = mix[color]
# Calculate the likelihood of seeing that color
print('Return the number of M&Ms with that color in {} ({}) and renomarlize for likelihood'.format(bag, like))
return like
First I need to encode the color mixes from before and after 1995:
mix94 = dict(brown=30,
yellow=20,
...
mix96 = dict(blue=24,
green=20,
...
Then I have to encode the hypotheses:
hypoA = dict(bag1=mix94, bag2=mix96)
hypoB = dict(bag1=mix96, bag2=mix94)
hypoA
represents the hypothesis that Bag 1 is from 1994 and Bag 2 from 1996. hypoB
is the other way around.
Next I map from the name of the hypothesis to the representation:
hypotheses = dict(A=hypoA, B=hypoB)
And finally I can write Likelihood
. In this case the hypothesis,
hypo
, is a string, either A
or B
. The data is a tuple that
specifies a bag and a color.
def Likelihood(self, data, hypo):
bag, color = data
mix = self.hypotheses[hypo][bag]
like = mix[color]
return like
Here’s the code that creates the suite and updates it:
In [31]:
suite = M_and_M('AB')
print('\n The prior probabilities before any observations are:\n')
suite.Print()
print('\n Where \'A\' are the odds the bag is from 1994, and \'B\' are the odds that it came from 1996\n')
print('\n Let us draw the first M&M')
suite.Update(('bag1', 'yellow'))
print('\n The posterior probabilities after this observation is now:')
suite.Print()
print('\n Let us draw another M&M')
suite.Update(('bag2', 'green'))
print('\n The posterior probabilities after pulling both M&Ms is now:')
suite.Print()
The posterior probability of A is approximately $20/27$, which is what we got before.
The code in this section is available from http://thinkbayes.com/m_and_m.py. For more information see Section [download].
This chapter presents the Suite class, which encapsulates the Bayesian update framework.
Suite is an **abstract type**, which means that it defines the interface a Suite is supposed to have, but does not provide a complete implementation. The Suite interface includes Update and Likelihood, but the Suite class only provides an implementation of Update, not Likelihood.
A **concrete type** is a class that extends an abstract parent class and provides an implementation of the missing methods. For example, Monty extends Suite, so it inherits Update and provides Likelihood.
If you are familiar with design patterns, you might recognize this as an example of the template method pattern. You can read about this pattern at http://en.wikipedia.org/wiki/Template_method_pattern.
Most of the examples in the following chapters follow the same pattern; for each problem we define a new class that extends Suite, inherits Update, and provides Likelihood. In a few cases we override Update, usually to improve performance.
In Section [framework] I said that the solution to the cookie problem generalizes to the case where we draw multiple cookies with replacement.
But in the more likely scenario where we eat the cookies we draw, the likelihood of each draw depends on the previous draws.
Modify the solution in this chapter to handle selection without replacement. Hint: add instance variables to Cookie to represent the hypothetical state of the bowls, and modify Likelihood accordingly. You might want to define a Bowl object.