In [ ]:
# Formats the notebook to look like the actual text
# Make sure to install the right dependencies for Python 3
# Commented out because this is a variant of the text
#import sys
#import book_format
# book_format.load_style('code')

Bayes’s Theorem


The fundamental idea behind all Bayesian statistics is Bayes’s theorem, which is surprisingly easy to derive, provided that you understand conditional probability.

Therefore, we’ll start with:

1) Probability

2) Conditional probability

3) Conjoint Probability

4) Bayes’s theorem

And then proceed with Bayesian statistics in the rest of the text.


A probability is a number between 0 and 1 (including both) that represents a degree of belief in a fact or prediction.

The value 1 represents certainty that a fact is true, or that a prediction will come true.

The value 0 represents certainty that the fact is false.

Intermediate values represent degrees of certainty. The value 0.5, often written as 50%, means that a predicted outcome is as likely to happen as not.

For example, the probability that a tossed coin lands face up is very close to 50%.

Conditional Probability

Conditional probability is a probability based on some background information.

For example, I want to know the probability that I will have a heart attack in the next year. According to the CDC, “Every year about 785,000 Americans have a first coronary attack.

The U.S. population is about 311 million, so the probability that a randomly chosen American will have a heart attack in the next year is roughly 0.3%:

$${P}(\text{ Heart Attack}) = \frac{\text{US Heart Attacks}}{\text{US Citizens}} = \frac{785,000}{311,000,000} = 0.3\%$$

But I am not a randomly chosen American. Epidemiologists have identified many factors that affect the risk of heart attacks; depending on those factors, my risk might be higher or lower than average.

I am male, 45 years old, and I have borderline high cholesterol. Those factors increase my chances. However, I have low blood pressure and I don’t smoke, and those factors decrease my chances.

Plugging everything into the online calculator, I find that the probability that I specificailly will have a heart attack in the next year is about 0.2%, less than the national average of 0.3%.

That value is a conditional probability, because it is based on a number of factors that make up my “condition.”

$${P}(\text{Heart Attack | Personal Medical History }) = 0.2\%$$

The usual notation for conditional probability is $${P}(A \mid B)$$ which is the probability of $A$ given that $B$ is true.

In this example, $A$ represents the prediction that I will have a heart attack in the next year, and $B$ is the set of conditions I listed (blood pressure, cholesterol, etc.):

Conjoint probability

Conjoint probability is a fancy way to ask what is the probability that two things are true.

I write $${P}(A {~\mathrm{and}~}B) = {P}{A \cup B}$$ to mean the probability that $A$ and $B$ are both true.

If you learned about probability in the context of coin tosses and dice, you might have learned the following formula:

$${{{P}(A {~\mathrm{and}~}B)}} = {P}{A \cup B} = {{{P}(A)}}~{{{P}(B)}} \quad\quad\quad\quad\mbox{ Warning: Not always true (Independence Assumption)}$$

For example, if I toss two coins, and $A$ describes the the first coin and $B$ describes the second coin, then the odds of a given outcome (Heads or Tails) can be described as:

$${P}(A) = 0.5$$$${P}(B) = 0.5$$

then the odds of both independent events (coin flips) coming up heads (or the conjoint probability of these two things being true) can be described as:

$${P}(A){P}(B) = {P}(A {~\mathrm{and}~}B) = {P}{A \cup B} = 0.25$$

There's a 25% chance we'll get two heads from two independet coin flips

$${P}(Coin_1){P}(Coin_2) = {P}(Coin_1 {~\mathrm{and}~}Coin_2) = {P}{Coin_1 \cup Coin_2} = 0.25$$

But this formula only works because in this case $A$ and $B$ are independent; that is, knowing the outcome of the first event does not change the probability of the second; flipping $Coin_1$ has no effect on the odds of $Coin_2$.

Or, more formally, $${P}(B|A) = ${P}(B)$$

$${P}(Coin_2|Coin_1) = ${P}(Coin_2)$$

The probability is unchanged given the new condition, i.e. the probabilities are totally independet.

Here is a different example where the events are not independent. Suppose that now $A$ means that it rains today and $B$ means that it rains tomorrow.

If I know that it rained today, it is *more likely** that it will rain tomorrow. We can describe this relationship formally as:

$${P}(B \mid A) > {P}(B)$$$${P}(Rain_{t+1} \mid Rain_{t}) > {P}(Rain_{t+1})$$

The odds of rain tomorrow given the condition that it rained today ${P}(Rain_{t+1} \mid Rain_{t})$ is greater than the odds of it raining tomorrow randomly ${P}(Rain_{t+1})$; these events are not independent.

Therefore, we can generalize the definition of a conjunction probability as

$${{\mathrm{P}(A {~\mathrm{and}~}B)}} = {P}(A \cup B) = {{\mathrm{p}(A)}}~{{\mathrm{P}(B \mid A)}}$$

for any events $A$ and $B$.

If these events are independent this solves to:

$${{\mathrm{P}(A {~\mathrm{and}~}B)}} = {P}(A \cup B) = {{\mathrm{P}(A)}}~{{\mathrm{p}(B \mid A)}} = {P}(A){P}(B)$$

If they are not independent, we must use the conditional probability, which is not equivalent to the default probability of the event:

$${{\mathrm{P}(A {~\mathrm{and}~}B)}} = {P}(A \cup B) = {{\mathrm{P}(A)}}~{{\mathrm{p}(B \mid A)}}$$

where $$ {{P}(B \mid A)} \neq {P}(B) $$

So if the chance of rain on any given day is 0.5, the chance of rain on two consecutive days is not 0.25, but probably a bit higher.

We’ll get to Bayes’s theorem soon, but I want to motivate it with an example called the cookie problem.

Suppose there are two bowls of cookies:

  • Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
  • Bowl 2 contains 20 of each (10 vanilla, 10 chocolate)

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random, and the cookie is vanilla. What is the probability that it came from Bowl 1?

Or formally, what was the probability the cookie was from the first bowl, given that it was vanilla:

$$ {P}(Bowl_1 \mid Vanilla) $$

With the keyword given, we know that this is a conditional probability; but it is not immediately obvious how to compute it.

If I asked a different question — the probability of a vanilla cookie given Bowl 1 — it would be easy:

$$ {P}(Vanilla \mid Bowl_1) = \frac{Vanilla}{Vanilla + Chocolate} = \frac{30}{10+30} = 3/4 = 75\%$$

Sadly, ${P}(A|B)$ is not the same as ${P}(A|B)$, which is an error know as the Base Rate Fallacy; a short example can be watched here which states that the odds of having some disease given certain symptoms are not the same as the odds of having those symptoms given some diease; they are totally different probabilities.

However, there is a way to get from one to the other: Bayes’s theorem.

Bayes’s theorem

At this point we have everything we need to derive Bayes’s theorem.

We’ll start with the observation that conjunction is commutative, or:

$${P}(A \cup B) = {P}(B \cup A)$$

or $${{\mathrm{p}(A {~\mathrm{and}~}B)}} = {{\mathrm{p}(B {~\mathrm{and}~}A)}} $$

Which is just saying that the odds of event $A$ occuring ${P}(A)$ and the odds of event $B$ ${P}(B)$ occuring are exactly the same as the odds of event B ${P}(B)$ and event A occuring ${P}(A)$.

Next, let us update the LHS of this equation with notion of conditional probability, which relate the events $A$ and $B$:

$${P}(A {~\mathrm{and}~}B) = {P}(A \cup B) = {P}(A){P}(B \mid A)$$

Which doesn't change the definition at all.

Remember, if these two events are independent, then ${P}(B \mid A) = {P}(B)$, and this equation resolves exactly the same as before:

$${P}(A {~\mathrm{and}~}B) = {P}(A \cup B) = {P}(A){P}(B|A) = {P}(A){P}(B) $$

Since the other side involves the same two events, we could just flip the events, and since they are independent, the condition changes nothing, since ${P}(A \mid B) = {P}(A)$.

So, for two independent events $A$ and $B$:

$${P}(B {~\mathrm{and}~}A) = {P}(A {~\mathrm{and}~}B) = {P}(B){P}(A) = {P}(A){P}(B)$$$${P}(B){P}(A) = {P}(A){P}(B) $$

Which is true, and is the same equality we started with.

Now, let's take that same conditional definition that relates these two events, or the conditional dependency, and since we haven't said anything about what $A$ and $B$ mean, they are still completely interchangeable.

Let's take the LHS side of the equation:

$${P}(A {~\mathrm{and}~}B) = {P}(A \cup B) = {P}(A){P}(B|A)$$

And flip them again, just like we did when they were independent:

$${P}(B {~\mathrm{and}~}A) = {P}(B \cup A) = {P}(B){P}(A|B)$$

That’s all we need. Putting those pieces together in the shape of the original equality, we get:

$${P}(A \cup B) = {P}(B \cup A)$$

$${{\mathrm{p}(A {~\mathrm{and}~}B)}} = {{\mathrm{p}(B {~\mathrm{and}~}A)}} $$$${{\mathrm{p}(B)}}~{{\mathrm{p}(A \mid B)}} = {{\mathrm{p}(A)}}~{{\mathrm{p}(B \mid A)}}$$

What does this mean?

Mathematically, we have two ways to compute the conjunction (the odds of both events occuring). If you have ${P}(A)$, you multiply by the conditional probability ${P}(B|A)$, or you could do it the other way around; if you know ${P}(B)$ you multiply by ${P}(A|B)$.

Intuitively, this is like saying the odds of event $A$ and $B$ occuring simltaneously can be described either with the odds of event $B$ occuring and the odds of event $A$ occuring given $B$ occured is equivalent to knowing the odds of event $A$ occuring and the odds of $B$ occuring given $A$ occured.

Knowing either conditional relationship can help you solve for the other, because either method you should get the same thing: the odds of event $A$ and $B$ occuring.

Finally we can divide through by ${P}(B)$ for convenience, and we get:

$${{{P}(A \mid B)}} = \frac{{{{P}(A)}}~{{{P}(B \mid A)}}}{{{{P}(B)}}}$$

And that’s Bayes’s theorem! It might not look like much, but it turns out to be surprisingly powerful.

For example, let's see if we can use it to solve the cookie problem.

I’ll write $Bowl_1$ for our hypothesis that the cookie came from Bowl 1 and $Vanilla$ for the vanilla cookie.

Plugging in Bayes’s theorem, we can solve for the odds of us getting a cookie from the first bowl given it was vanilla:

$${{\mathrm{P}(Bowl_1|Vanilla)}} = \frac{{{\mathrm{P}(Bowl_1)}}~{{\mathrm{P}(Vanilla \mid Bowl_1)}}}{{{\mathrm{P}(Vanilla)}}}$$

The term on the left is what we want: the probability of $Bowl_1$, given we're holding that vanilla cookie.

Now let's solve for the terms on the right are:

  • ${{\mathrm{P}(Bowl_1)}}$: This is the probability that we chose Bowl 1, unconditioned by (irrespective of) what kind of cookie we got; what is the odds of grabbing $Bowl_1$ at random? (50/50, since we have two bowls, and we're just grabbing either one.)
$${{\mathrm{p}(B_1)}} = \frac{Bowl_1}{Bowl_1 + Bowl_2} = \frac{1}{2} = 50\%$$
  • ${{\mathrm{P}(Vanilla \mid Bowl_1)}}$: Now, what was the probability of getting a vanilla cookie given that we can only grab it from Bowl 1? (3/4, since 75% of those cookies are vanilla.)

    $${{\mathrm{p}(Vanilla \mid Bowl_1)}} = \frac{Vanilla}{Vanilla_{Bowl_1} + Chocolate_{Bowl_1}} = \frac{30}{30 + 10} = \frac{3}{4} = 75\%$$

  • ${P}(V)$: This is the probability of getting a vanilla cookie from any bowl. Assuming, we had an equal chance of choosing a cookie from any bowl, what is the chance we ended up with a vanilla cookie? (5/8, since there are 50 Vanilla and 30 chocolate cookies overall, making 62.5% of them Vanilla)

    $${{\mathrm{p}(Vanilla)}} = \frac{Vanilla}{Vanilla + Chocolate} = \frac{50}{50 + 30} = \frac{5}{8} = 62.5\%$$

Putting all this together, we can now solve for the conditional relationship we didn't know, the odds of our vanilla cookie coming from the first bowl:

$${{\mathrm{P}(Bowl_1|Vanilla)}} = \frac{{{\mathrm{P}(Bowl_1)}}~{{\mathrm{P}(Vanilla \mid Bowl_1)}}}{{{\mathrm{P}(Vanilla)}}} = \frac{(1/2)~(3/4)}{5/8} = \frac{3}{5} = 60\%$$

There's a 60% chance we drew our cookie from Bowl 1, given the cookie was vanilla.

So the vanilla cookie is evidence in favor of the hypothesis that we chose Bowl 1, because vanilla cookies are more likely to come from Bowl 1 (psince there were more overall vanilla cookies in bowl 1).

This example demonstrates one use of Bayes’s theorem: it provides a strategy to get from $\mathrm{P}(B|A)$ to $\mathrm{p}(A|B)$.

This strategy is useful in cases like the cookie problem, where it is often easier to compute the terms on the right side of Bayes’s theorem than the term on the left.

The Diachronic Interpretation

There is another way to think of Bayes’s theorem; in the context of time.

The Bayes Theorem gives us a way to update the probability of a hypothesis, $H$, in light of some (new) data $D$.

This way of thinking about Bayes’s theorem is called the diachronic interpretation. “Diachronic” means that something is happening over time; in this case the probability of the hypotheses changes, over time, as we see new data.

Rewriting Bayes’s theorem with $H$ and $D$ instead of $A$ and $B$ yields:

$${{\mathrm{P}(H|D)}} = \frac{{{\mathrm{P}(H)}}~{{\mathrm{P}(D|H)}}}{{{\mathrm{P}(D)}}}$$

or $${{\mathrm{P}(Hypothesis|Data)}} = \frac{{{\mathrm{P}(Hypothesis)}}~{{\mathrm{p}(Data \mid Hypothesis)}}}{{{\mathrm{p}(Data)}}}$$

In this interpretation, each term has a name:

Function Name Definition Context
$\mathrm{P}(H)$ Prior The probability of the hypothesis before we see any data What are the odds of our hypothesis, before we observe anything / have any data?
$\mathrm{P}(H D)$ Posterior What we want to compute; the probability of the hypothesis given the data. After we've observed something / have some data are the odds seeing the hypothesis?
$\mathrm{P}(D H)$ Likelihood The probability of getting the data we got, given our hypothesis. How likely was chance of seeing this data, given our hypothesis was true?
$\mathrm{P}(D)$ Evidence What is the probability of seeing this data under any hypothesis? What are the odds of witnessing any data, given any hypothesis?

So another way of writing this new diachronic version is:

$$Posterior = \frac{Prior \cdot Likelihood}{Evidence}$$

This isn't in the book, but I'd like to jump quickly to a useful example I found that helped me understand this a little better.

Suppose we have a pile of movies and books sorted by genre, and there are 3 different movie genres (Action, Fantasy, Romance) and 2 book genres (Fantasy, Romance).

Assuming we grabbed something randomly from the pile that was labbeled 'Fantasy', what are the odds it was a book, not a movie?

Or, more formally:

$${{\mathrm{P}(Book|Fantasy)}} = \frac{{{\mathrm{P}(Book)}}~{{\mathrm{P}(Fantasy \mid Book)}}}{{{\mathrm{p}(Fantasy)}}}$$

Lets create a similar table in the context of this example:

Function Name Action Explanation Example
$${P}(Book)$$ Prior Odds before we observe any data Before we read the label, the object is completely unknown to us. So, given our goal to just find out whether we have a book, the probability that we have a book a priori (before) to reading the label / knowing no data What're the odds we just randomly grabbed a book?
$${P}(Fantasy \mid Book)$$ Posterior Odds after we observe some data Now that we've read the label, we know that it’s a Fantasy type, and we have new information; we know we're holding something from the Fantasy section. So now we know what we're solving for; the probability that it's a book a posteriori (after) to reading the label / having some data. What're the odds we're holding a book, given we now know it's Fantasy?
$${P}(Fantasy)$$ Evidence Odds of getting this data This is the label we just read, and we can use it to get closer to our posterior probability, since it is the chance a given thing could occur by default / the odds of getting this data What are the odds of getting anything with a Fantasy label on it?
$${P}(Fantasy \mid Book)$$ Likelihood Odds of the data, given our hypothesis This is the magical part; the easier conditional relationship we can use to infer the missing one. We want to know the odds of getting a book given our label is Fantasy, but it's much easier to find the odds of getting the Fantasy genre, given we're holding a book. Therefore, this is the chance we got the data we did, given our hypothesis** / the likelihood of our data, given our hypothesis** How likely is it we're holding something from the Fantasy genre, given our assumption that it's a book?

So, now we have a tool that updates our prior hypothesis with new evidence, using the likelihood of a seeing our hypothesis given our evidence, which updates our current posterior knowledge after seeing this new evidence.

Let's jump back to the context of the Cookie problem, and explore these terms in more detail:

As a refresher, the values were from the cookie problem were:

  • Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
  • Bowl 2 contains 20 of each (10 vanilla, 10 chocolate)


The the thing we're trying to solve for. In the diachronic approach, this is the unknown probablity we seek to update with update with new information.

$${{\mathrm{P}(Bowl_1 \mid Vanilla)}} $$

Technically, this is the prior, and our hypothesis is just the $Bowl_1$ bit (which is what we're curious about), but we are not trying to get ${P}(Bowl_1)$, we are trying to get the ${P}(Bowl_1)$ in the context of our data; what are the odds of our hypothesis given what we know?


The situation before we know or any information, which we will update with the likelihood and evidence to find the posterior (post-data) situation. Usually this is just the random chance of our hypothesis occuring by default.

Sometimes we can compute the objective prior based on background information. In the cookie problem, it was specified that we choose a bowl at random with equal probability.

$$ {P}(Bowl_1) = \frac{Bowl_1}{Bowl_1 + Bowl_2} = \frac{1}{2} = 50\% $$

In other cases the prior is subjective; that is, reasonable people might disagree, either because they use different background information or because they interpret the current situation differently.


The data or evidence we have on hand, often called the normalizing constant. This can be a little tricky, since it is supposed to be the probability of seeing the data under any hypothesis at all, but in the most general case it can be hard to nail down what that means.

Most often we simplify things by specifying a set of hypotheses that are:

  • Mutually exclusive: At most one hypothesis in the set can be true, and

  • Collectively exhaustive: There are no other possibilities; at least one of the hypotheses has to be true.

I use the word suite for a set of hypotheses that has these properties. If these two conditions are valid for the hypothesis suite, then we can use the law of total probability to compute this value, which just says that if there are two exclusive ways that something might happen, you can add up the probabilities.

In the cookie problem, there are only two hypotheses — the cookie came from Bowl 1 or Bowl 2 — and they are mutually exclusive and collectively exhaustive, and our evidence was the fact that it was vanilla, so our $P(V)$ can be calculated like this:

$${{\mathrm{P}(D)}} = {{\mathrm{P}(Vanilla)}} = {{\mathrm{P}(Bowl_1)}}~{{\mathrm{p}(Vanilla|Bowl_1)}} + {{\mathrm{P}(Bowl_2)}}~{{\mathrm{P}(Vanilla|Bowl_2)}}$$

Plugging those in, we have:

$${{\mathrm{P}(Vanilla)}} = \frac{1}{1 + 1} \cdot (\frac{30}{30 + 10}) + \frac{1}{2} \cdot (\frac{20}{20 + 20}) = \frac{5}{8} = 62.5\%$$

Which is what we computed earlier by mentally combining the two bowls (i.e. tossing them into one big pile and asking what the odds of getting a Vanilla cookie was). Hence the law of total probability.


The likelihood is usually the easiest part to compute. In the cookie problem, if we know which bowl the cookie came from, we can find the probability of a vanilla cookie just through counting / brute force (which is where machines come in). It's called the likelihood because it measures how likely it was to see our evidence, given the data we have.

$${{\mathrm{P}(Vanilla \mid Bowl_1)}} = \frac{Vanilla}{Vanilla_{Bowl_1} + Chocolate_{Bowl_1}} = \frac{30}{30 + 10} = \frac{3}{4} = 75\%$$


Using all those variables, and the format:

$$Posterior = \frac{Prior \cdot Likelihood}{Evidence}$$

We can begin updating our prior probability with new evidence, and updating our posterior knowledge. Let's go through step by step (as per the diachronic approach). Note that this is not completely mathematically accurate, but is useful as a thought exercise.

Step 0: Hypothesis

We grabbed a random cookie. What're the odds it came from Bowl 1?

$$ {P}(Bowl_1 \mid Cookie) = ...$$

In [ ]:
# More than 1 Bowl

#print(Hypothesis per number of bowls)

Step 1: Prior

We grabbed a cookie but we don't know what it is. What are the odds we grabbed it could be from Bowl 1, knowing no new information at all?

$$ {P}(Bowl_1 \mid Cookie) = Prior = \frac{Bowl_1}{Bowl_1 + Bowl_2} = \frac{1}{2} = 50\% $$

Step 2: Evidence

We looked in our hand and realized we're holding a vanilla cookie, so lets update our current Posterior; what're the odds that we grabbed it from Bowl 1 now that we know this new information? Well, we need to know the odds we'd get a vanilla cookie at all, out of either bowl:

$$ {P}(Bowl_1 \mid Vanilla) = \frac{Prior}{Evidence} = \frac{\frac{Bowl_1}{Bowl_1 + Bowl_2}}{\frac{Bowl_1}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate} + \frac{Bowl_2}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate}} = \frac{\frac{1}{2}}{\frac{1}{2} \cdot (\frac{30}{30 + 10}) + \frac{1}{2} \cdot (\frac{20}{20 + 20})}= \frac{50\%}{62.5\%} = 80\%$$

Our probability shot up because there were more vanilla cookies in Bowl 1 than Bowl 2, which makes our hypothesis pretty likely.

However, we need to tune this with the probability of the literal condition to to find out exactly how likely; what is the actual chance we'd get a vanilla cookie from Bowl 1, given our prior and evidence?

In [ ]:
# Prior for every variant cookie; change up the cookie counts, keep number of bowls exactly the same
# OR vary with just as many bowls; an array of bowls numbers with different probabilities output as a table

Step 3: Likelihood

I like to think about this is a tuning operation, the application of a heuristic; how likely is it that event $B$ is independent $A$

A small likelihood value does not change the probability significantly; it is not influential on the Prior or the Evidence; the magnitude of the Evidence is scaled proportionally to its likelihood. (0.00095% > pretty much the same Step 2 value)

A large likelihood value contains almost all the prior/evidence probability; 100% of it implies maximum relationship (all of the Prior probability can be explained by all of the evidence? The P(Prior)/P(Evidence) = 100% if the odds of the hypothesis occuring are exactly the same as the odd of the evidence occuring: )

$$ {P}(Bowl_1 \mid Vanilla) = \frac{Prior}{Evidence} = \frac{\frac{Bowl_1}{Bowl_1 + Bowl_2}}{\frac{Bowl_1}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate} + \frac{Bowl_2}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate}} = \frac{\frac{1}{2}}{\frac{1}{2} \cdot (\frac{20}{20 + 20}) + \frac{1}{2} \cdot (\frac{20}{20 + 20})}= \frac{50\%}{62.5\%} = 80\%$$

In [ ]:
# Run different assumed conditional probabilities across all bowl count arrays and chocolate chip arrays, show that it 
# only measures the magnitude of the prior explained by the evidence

In [3]:
0.5 /( (0.5)*0.5 + (0.5)**2)


In [ ]:
0.5 /( (0.5)**2 + (0.5)**2)

Evidence =

$$ {P}(Bowl_1 \mid Vanilla) = \frac{Prior \cdot Likelihood}{Evidence} = \frac{\frac{Bowl_1}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla_{Bowl_1}}{Vanilla_{Bowl_1} + Chocolate_{Bowl_1}}}{\frac{Bowl_1}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate} + \frac{Bowl_2}{Bowl_1 + Bowl_2} \cdot \frac{Vanilla}{Vanilla + Chocolate}} = \frac{\frac{1}{2} \cdot \frac{30}{30 + 10}}{\frac{1}{2} \cdot (\frac{30}{30 + 10}) + \frac{1}{2} \cdot (\frac{20}{20 + 20})}= \frac{50\% \cdot 75\%}{62.5\%} = 60\%$$

Now we have a fully updated posterior, telling incorporating our new data, and updating our prior in relation to our hypothesis. This can process can be repeated with any new data, updating the prior with evidence and likelihoods until we can accurately simulate our hypothesis.

Google Optimize: Odds of clicking on a certain webpage, given the odds of clicking on any webpage P(click), the likelihood of that webpage being clicked on P(webpage|click), and the likelihood of someone being on that webpage at all p(webpage)

p(click|webpage) = p(click) * p(webpage|click) / p(webpage)

The M&M problem

M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time.

In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of mine has two bags of M&M’s, and he tells me that one is from 1994 and one from 1996. He won’t tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow one came from the 1994 bag?

This problem is similar to the cookie problem, with the twist that I draw one sample from each bowl/bag. This problem also gives me a chance to demonstrate the table method, which is useful for solving problems like this on paper. In the next chapter we will solve them computationally.

The first step is to enumerate the hypotheses. The bag the yellow M&M came from I’ll call Bag 1; I’ll call the other Bag 2. So the hypotheses are:

  • A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.

  • B: Bag 1 is from 1996 and Bag 2 from 1994.

Now we construct a table with a row for each hypothesis and a column for each term in Bayes’s theorem:

Prior $\mathrm{p}(H)$ Likelihood $\mathrm{p}(D\vert H)$ $\mathrm{p}(H) \mathrm{p}(D\vert H)$ Posterior $\mathrm{p}(H\vert D)$
A 1/2 (20)(20) 200 20/27
B 1/2 (14)(10) 70 7/27

The first column has the priors. Based on the statement of the problem, it is reasonable to choose ${{\mathrm{p}(A)}} = {{\mathrm{p}(B)}} = 1/2$.

The second column has the likelihoods, which follow from the information in the problem. For example, if $A$ is true, the yellow M&M came from the 1994 bag with probability 20%, and the green came from the 1996 bag with probability 20%. If $B$ is true, the yellow M&M came from the 1996 bag with probability 14%, and the green came from the 1994 bag with probability 10%. Because the selections are independent, we get the conjoint probability by multiplying.

The third column is just the product of the previous two. The sum of this column, 270, is the normalizing constant. To get the last column, which contains the posteriors, we divide the third column by the normalizing constant.

That’s it. Simple, right?

Well, you might be bothered by one detail. I write $\mathrm{p}(D|H)$ in terms of percentages, not probabilities, which means it is off by a factor of 10,000. But that cancels out when we divide through by the normalizing constant, so it doesn’t affect the result.

When the set of hypotheses is mutually exclusive and collectively exhaustive, you can multiply the likelihoods by any factor, if it is convenient, as long as you apply the same factor to the entire column.

The Monty Hall problem

The Monty Hall problem might be the most contentious question in the history of probability. The scenario is simple, but the correct answer is so counterintuitive that many people just can’t accept it, and many smart people have embarrassed themselves not just by getting it wrong but by arguing the wrong side, aggressively, in public.

Monty Hall was the original host of the game show *Let’s Make a Deal*. The Monty Hall problem is based on one of the regular games on the show. If you are on the show, here’s what happens:

  • Monty shows you three closed doors and tells you that there is a prize behind each door: one prize is a car, the other two are less valuable prizes like peanut butter and fake finger nails. The prizes are arranged at random.

  • The object of the game is to guess which door has the car. If you guess right, you get to keep the car.

  • You pick a door, which we will call Door A. We’ll call the other doors B and C.

  • Before opening the door you chose, Monty increases the suspense by opening either Door B or C, whichever does not have the car. (If the car is actually behind Door A, Monty can safely open B or C, so he chooses one at random.)

  • Then Monty offers you the option to stick with your original choice or switch to the one remaining unopened door.

The question is, should you “stick” or “switch” or does it make no difference?

Most people have the strong intuition that it makes no difference. There are two doors left, they reason, so the chance that the car is behind Door A is 50%.

But that is wrong. In fact, the chance of winning if you stick with Door A is only 1/3; if you switch, your chances are 2/3.

By applying Bayes’s theorem, we can break this problem into simple pieces, and maybe convince ourselves that the correct answer is, in fact, correct.

To start, we should make a careful statement of the data. In this case $D$ consists of two parts: Monty chooses Door B *and* there is no car there.

Next we define three hypotheses: $A$, $B$, and $C$ represent the hypothesis that the car is behind Door A, Door B, or Door C. Again, let’s apply the table method:

Prior $\mathrm{p}(H)$ Likelihood $\mathrm{p}(D\vert H)$ $\mathrm{p}(H) \mathrm{p}(D\vert H)$ Posterior $\mathrm{p}(H\vert D)$
A 1/3 1/2 1/6 1/3
B 1/3 0 0 0
C 1/3 1 1/3 2/3

Filling in the priors is easy because we are told that the prizes are arranged at random, which suggests that the car is equally likely to be behind any door.

Figuring out the likelihoods takes some thought, but with reasonable care we can be confident that we have it right:

  • If the car is actually behind A, Monty could safely open Doors B or C. So the probability that he chooses B is 1/2. And since the car is actually behind A, the probability that the car is not behind B is 1.

  • If the car is actually behind B, Monty has to open door C, so the probability that he opens door B is 0.

  • Finally, if the car is behind Door C, Monty opens B with probability 1 and finds no car there with probability 1.

Now the hard part is over; the rest is just arithmetic. The sum of the third column is 1/2. Dividing through yields ${{\mathrm{p}(A|D)}} = 1/3$ and ${{\mathrm{p}(C|D)}} = 2/3$. So you are better off switching.

There are many variations of the Monty Hall problem. One of the strengths of the Bayesian approach is that it generalizes to handle these variations.

For example, suppose that Monty always chooses B if he can, and only chooses C if he has to (because the car is behind B). In that case the revised table is:

Prior $\mathrm{p}(H)$ Likelihood $\mathrm{p}(D\vert H)$ $\mathrm{p}(H) \mathrm{p}(D\vert H)$ Posterior $\mathrm{p}(H\vert D)$
A 1/3 1 1/3 1/2
B 1/3 0 0 0
C 1/3 1 1/3 1/2

The only change is $\mathrm{p}(D|A)$. If the car is behind $A$, Monty can choose to open B or C. But in this variation he always chooses B, so ${{\mathrm{p}(D|A)}} = 1$.

As a result, the likelihoods are the same for $A$ and $C$, and the posteriors are the same: ${{\mathrm{p}(A|D)}} = {{\mathrm{p}(C|D)}} = 1/2$. In this case, the fact that Monty chose B reveals no information about the location of the car, so it doesn’t matter whether the contestant sticks or switches.

On the other hand, if he had opened $C$, we would know ${{\mathrm{p}(B|D)}} = 1$.

I included the Monty Hall problem in this chapter because I think it is fun, and because Bayes’s theorem makes the complexity of the problem a little more manageable. But it is not a typical use of Bayes’s theorem, so if you found it confusing, don’t worry!


For many problems involving conditional probability, Bayes’s theorem provides a divide-and-conquer strategy. If $\mathrm{p}(A|B)$ is hard to compute, or hard to measure experimentally, check whether it might be easier to compute the other terms in Bayes’s theorem, $\mathrm{p}(B|A)$, $\mathrm{p}(A)$ and $\mathrm{p}(B)$.

If the Monty Hall problem is your idea of fun, I have collected a number of similar problems in an article called “All your Bayes are belong to us,” which you can read at