In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()


Out[1]:

In [2]:
os.chdir(path)

Bayes Theorem

Conjoint probability is a fancy way to say the probability that two things are true. If you learned about probability in the context of coin tosses and dice, you might have learned the following formula:

$$p( A \cap B ) = p( A \text{ and } B ) = p(A)p(B)$$

Meaning the probability of A and B occuring together is the probability of A occuring times the probability of B occuring. e.g. if I toss two fair coins the probability of both coins end up being head is 0.5 * 0.5 = 0.25.

The formula above only works when A and B are independent, meaning that the outcome of event A does not change the probability of the second, or more formally, $p(B|A) = p(B)$, where $p(B|A)$ denotes the probability of B given that A is true. A different example where the events are not independent would be, suppose A means that it rains today and B means that it rains tomorrow. Then if I know that it rains today, then it is more likely that it will rain tomorrow. So $p(B|A) > p(B)$.

Thus when the two events are not independent of one another, the formula above becomes:

$$p(A \text{ and } B) = p(A)p(B|A)$$

So if the chance of rain on any given day is 0.5, the chance of rain on two consecutive days is not 0.25, but probably a bit higher.

Next, we know that the probabilities are symmetric (communutative), meaning that $p(A \text{ and } B) = p(B \text{ and } A)$. Hence we can put the pieces together that $p(A)p(B|A) = p(B)p(A|B)$. And if we divide both side with $p(B)$ that gives you the Bayes's theorem:

$$p(A \mid B) = \frac{ p(A) \, p(B \mid A) }{p(B)}$$

Using this formula, let's consider the following cookie problem:

Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each. Now suppose you choose one of the bowls at random and without looking select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1?

Using the Bayes's theorem, this will get us $p(B_1 \mid V) = \frac{p(B_1)p(V \mid B_1)}{p(V)}$. We know $p(B_1)$, the probability that we chose bowl 1 is 1/2; $p(V|B_1)$, the probability of getting a vanilla cookie from Bowl 1 is 3/4; $p(V)$, the probability of drawing a vanilla cookie from either bowl is 5/8 ( a total of 50 vanilla cookies in both bowl and a total of 80 cookies in both bowl ).

Plugging it back to the formula that will give us 3/5. So the vanilla cookie that we've random selected is more likely to come from Bowl 1.

Diachronic Interpretation

An alternative way of looking at the Bayes's theorem is, it gives us a way to update the probability of a hypothesis $H$, in light of some body of data $D$. This is called the diachronic interpretation, where “diachronic” means that something is happening over time. Hence, this is equivalent to saying that the probability of the hypotheses changes over time, as we see new data. Given this information, we can now rewriting Bayes theorem with this new set of notations:

$$p(H \mid D) = \frac{ p(H) \, p(D \mid H) }{p(D)}$$
  • $p(H)$ is the probability of the hypothesis before we see the data, called the prior probability. Sometimes we can compute the prior based on background information. For example, the cookie problem specifies that there are only two hypotheses, the cookie either came from Bowl 1 or Bowl 2. In other cases the prior is subjective; that is, people might disagree. Either because they use different background information or because they interpret the same information differently.
  • $p(H|D)$ is what we want to compute, the probability of the hypothesis after we see the data, called the posterior probability.
  • $p(D|H)$ is the probability of the data under the hypothesis, called the likelihood. This is usually the easiest part to compute.
  • $p(D)$ is the probability of the data under any hypothesis, called the normalizing constant. In the cookie problem, there are only two hypotheses. In that case we can compute $p(D)$ using the law of total probability, which says that if there are two exclusive ways that something might happen, you can add up the probabilities like this: $p(D) = p(B1) p(D|B1) + p(B2) p(D|B2)$. Plugging in the values from the cookie problem, we have $p(D) = (1/2) (3/4) + (1/2) (1/2) = 5/8$.

For many problems involving conditional probability, Bayes’s theorem provides a divide-and-conquer strategy. If $p(A|B)$ is hard to compute, or hard to measure experimentally, check whether it might be easier to compute the other terms in Bayes’s theorem.

Let's look at another problem.

The M&M Problem

M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of yours has two bags of M&M’s, and he tells you that one bag is from 1994 and the other is from 1996. He won’t tell you which is which, but he gives you one M&M from each bag. One is Yellow and one is Green. What is the probability that the Yellow one came from the 1994 bag?

This is similar to the cookie problem, with the twist this time you'll be drawing one sample from each bowl/bag. This problem also gives us a chance to use the table method, which is useful for solving problems like this on paper. The first step is to enumerate the hypotheses. Suppose that the bag with the Yellow M&M came Bag 1; and one with the Green M&M came from Bag 2. So the hypotheses are:

  • A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
  • B: Bag 1 is from 1996 and Bag 2 from 1994.

Now we construct a table with a row for each hypothesis and a column for each term in Bayes’s theorem:

Prior $p(H)$ Likelihood $p(D \mid H)$ Prior * Likelihood $p(H)p(D \mid H)$ Posterior $p(H \mid D)$
A 1/2 (20)(20) 200 20/27
B 1/2 (14)(10) 70 7/27
  • The first column has the priors. Based on the statement of the problem, it is reasonable to choose $p(A) = p(B) = 1/2$.
  • The second column has the likelihoods, which follow from the information in the problem. For example, if $A$ is true, the yellow M&M came from the 1994 bag with probability 20%, and the green came from the 1996 bag with probability 20%. If $B$ is true, the yellow M&M came from the 1996 bag with probability 14%, and the green came from the 1994 bag with probability 10%. Because the selections are independent, we get the conjoint probability by multiplying the two numbers.
  • The third column is just the product of the previous two. The sum of this column, 270, is the normalizing constant, $p(D)$. To get the last column, which contains the posteriors, we divide the third column by the normalizing constant.

Well, you might be bothered by one detail. In the table above, we wrote $p(D|H)$ in terms of pure numbers, not probabilities, which means it is off by a factor of 10,000. But that cancels out when we divide through by the normalizing constant, so it doesn’t affect the result.

Odds

Probabilities and odds are different representations of the same information. One way to represent a probability is with a number between 0 and 1, but that’s not the only way. If you have ever bet on a football game or a horse race, you have probably encountered another representation of probability, called odds.

You might have heard expressions like "the odds are three to one," but you might not know what that means. The odds in favor of an event are the ratio of the probability it will occur to the probability that it will not.

So if you think a team has a 75% chance of winning, you would say that the odds in their favor are "three to one", because the chance of winning is three times the chance of losing. You can write odds in decimal form, but it is most common to write them as a ratio of integers. So "three to one" is written $3:1$.

When probabilities are low, it is more common to report the odds against rather than the odds in favor. For example, if you think your horse has a 10% chance of winning, you would say that the odds against are $9:1$.


In [3]:
# given a probability, you can compute the odds like this:
def odds(p):
    return p / (1 - p )

# 75% chance of winning, and this is equivalent to
# saying the odds in favor of winning is
print( odds(0.75) )


# given the odds in favor, you can convert to probability like this:
def probability(o):
    return o / ( o + 1 )

# the odds in favor of winning is 3
# this is equivalent to saying the probability of winning is
print( probability(3) )


3.0
0.75

The Odds Form of Bayes’s Theorem

Recall that when we first took a look at the Bayes’s theorem, we're looking at it in the probability form:

$$\mathrm{p}(H|D) = \frac{ \mathrm{p}(H) \mathrm{p}(D|H) } {\mathrm{p}(D)}$$

If we have two hypotheses, $A$ and $B$, we can write the ratio of posterior probabilities like this:

$$\frac{ \mathrm{p}(A|D)} {\mathrm{p}(B|D) } = \frac{ \mathrm{p}(A) \mathrm{p}(D|A) }{ \mathrm{p}(B) \mathrm{p}(D|B) }$$

Notice that the normalizing constant, $\mathrm{p}(D)$, drops out of this equation.

If $A$ and $B$ are mutually exclusive and collectively exhaustive, that means $\mathrm{p}(B) = 1 - \mathrm{p}(A)$, so we can rewrite the ratio of the priors, and the ratio of the posteriors, as odds.

For odds in favor of $A$, we replace the ratio of the priors $\frac{ \mathrm{p}(A)}{\mathrm{p}(B)}$ with $\mathrm{o}(A)$, and doing the same for the ratio of the posteriors, we get:

$$\mathrm{o}(A|D) = \mathrm{o}(A) \frac{\mathrm{P}(D|A)}{\mathrm{P}(D|B)}$$

In other words, this formula is saying that the posterior odds are the prior odds (our relative belief in hypothesis $p(A)$ v.s. $p(B)$ before seeing any evidence) times the likelihood ratio (the relative probability of the evidence, supposing the various hypotheses to be true). This is the odds form of Bayes’s theorem.

This form is most convenient for computing a Bayesian update on paper or in your head. For example, let’s go back to the cookie problem:

Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each. Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1?

The prior probability is 50%, so the prior odds are $1:1$, or just 1. The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$. So the posterior odds are $3:2$, which corresponds to probability $3/5$.

Let's look at another example.

Oliver's Blood

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type "O" blood. The blood of the two traces are found to be of type "O" (a common type in the local population, having frequency 60%) and of type "AB" (a rare type, with frequency 1%).

Do these data, the traces found at the scene, give evidence in favor of the proposition that Oliver was one of the people who left blood at the scene?

To answer this question, we need to think about what it means for data to give evidence in favor of (or against) a hypothesis. Intuitively, we might say that data favors a hypothesis if the hypothesis is more likely to be true in light of the data.

In the cookie problem, the prior odds are $1:1$, or probability 50%. The posterior odds are $3:2$, or probability 60%. So we could say that the vanilla cookie is evidence in favor of Bowl 1.

The odds form of Bayes’s theorem provides a way to make this intuition more precise. Again

$$\mathrm{o}(A|D) = \mathrm{o}(A) \frac{\mathrm{P}(D|A)}{\mathrm{P}(D|B)}$$

now by dividing both sides by $\mathrm{o}(A)$:

$$\frac{\mathrm{o}(A|D)}{\mathrm{o}(A)} = \frac{\mathrm{P}(D|A)}{\mathrm{P}(D|B)}$$

The term on the left is the ratio of the posterior and prior odds. The term on the right is the likelihood ratio, also called the Bayes factor.

  • If the Bayes factor is greater than 1, that means that the data were more likely under $A$ than under $B$. And since the odds ratio is also greater than 1, that means that the odds are greater, in light of the data, than they were before.
  • If the Bayes factor is less than 1, that means the data were less likely under $A$ than under $B$, so the odds in favor of $A$ go down.
  • Finally, if the Bayes factor is exactly 1, the data are equally likely under either hypothesis, so the odds do not change.

Now we can get back to the Oliver’s blood problem. If Oliver is one of the people who left blood at the crime scene, then he accounts for the "O" sample, so the probability of the data is just the probability that a random member of the population has type "AB" blood, which is 1%.

If Oliver did not leave blood at the scene, then we have two samples to account for. If we choose two random people from the population, what is the chance of finding one with type "O" and one with type "AB"? Well, there are two ways it might happen: the first person we choose might have type "O" and the second "AB", or the other way around. So the total probability is $2 (0.6) (0.01) = 1.2\%$.

The likelihood of the data is slightly higher if Oliver is not one of the people who left blood at the scene, so the blood data is actually evidence against Oliver’s guilt.

This example is a little contrived, but it is an example of the counterintuitive result that data consistent with a hypothesis are not necessarily in favor of the hypothesis.

If this result is so counterintuitive that it bothers you, this way of thinking might help: the data consist of a common event, type "O" blood, and a rare event, type "AB" blood. If Oliver accounts for the common event, that leaves the rare event still unexplained. If Oliver doesn’t account for the "O" blood, then we have two chances to find someone in the population with "AB" blood. And that factor of two makes the difference.

Discussion

Among Bayesians, there are two approaches to choosing prior distributions. Some recommend choosing the prior that best represents background information about the problem; in that case the prior is said to be informative. The problem with using an informative prior is that people might use different background information (or interpret it differently). So informative priors often seem subjective. The alternative is a so-called uninformative prior, which is intended to be as unrestricted as possible, in order to let the data speak for themselves.

Uninformative priors may seem a bit more appealing because they're more objective, but, people still favors the use of informative priors. Why? First, Bayesian analysis is always based on modeling decisions. Choosing the prior is one of those decisions, but it is not the only one, and it might not even be the most subjective. So even if an uninformative prior is more objective, the entire analysis is still subjective.

Also, for most practical problems, you are likely to be in one of two regimes: either you have a lot of data or not very much. If you have a lot of data, the choice of the prior doesn’t matter very much; informative and uninformative priors yield almost the same results. But if you don’t have much data, using relevant background information makes a big difference.