By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie

Notebook released under the Creative Commons Attribution 4.0 License.

Conditional probability refers to probability that takes some extra information about the state of the world into account. For instance, we can ask, "what is the probability that I have the flu, given that I am sneezing?" or, "what is the probability that this stock will have a return above the risk-free rate, given that it has positive return?" In both cases, the extra information changes the probability. We write the probability of $A$ given $B$ as $P(A|B)$. When looking at conditional probability, we are only considering situations where $B$ occurs, and we don't care about the situations where it does not.

If we compute conditional probability empirically, we can use simple counting. Just look at the set of data points for which $B$ is true, and compute the fraction of them that have $A$ true as well. For instance, we can ask, "what is the probability that a stock is in the top decile of returns this month, given that it was in the top decile last month?" To compute this, we count the number of stocks that were in the top decile both months, and divide it by the number of stocks in the top decile last month.

The following formula holds for conditional probabilities:

$$ P(A|B) = \frac{P(A\text{ and } B)}{P(B)} $$This can be written as $P(A|B)P(B) = P(A\text{ and } B)$, that is, the probability that $A$ and $B$ both occur is the probability that $B$ occurs, times the probability that $A$ occurs given that $B$ occurs. Notice that the equation above is symmetric, so we can also say $P(A\text{ and } B) = P(B|A)P(A)$.

Conditional probabilitites satisfy the *total probability rule*, which states that if $S_1, S_2, \ldots, S_n$ are mutually exclusive and exhaustive events, then
$$ P(A) = P(A|S_1)P(S_1) + P(A|S_2)P(S_2) + \ldots + P(A|S_n)P(S_n) $$

We can also use conditional probability for inference. This is useful when we want to know the state of a variable which is unobservable, because we can infer it from observable ones. In this case, the equation for conditional probability above isn't simply a true statement about known quantities, but a way to compute an unknown.

A useful restatement of the equation is

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$This is known as Bayes' formula. $P(A)$ is called the *prior probability* and $P(A|B)$ is called the *posterior probability* (the probability after taking into account the observation of $B$).

As an example, imagine that a certain test for disease X comes up positive for 99% of subjects with the disease, and for 5% of healthy subjects. Statistically, 1% of the population have the disease. So, your prior probability for having the disease before you get tested is 1%. If your test comes up positive, you have to update the probability that you have the disease. Say $B$ is the test coming up positive, and $A$ is having the disease. Then $P(B|A) = .99$ and $P(A) = .01$. To compute $P(B)$, we can use the total probability rule, and we get that $P(B) = .99 \cdot .06$. Then the posterior probability that you have the disease, given that your test comes up positive, is $P(A|B) = 16.7\%$.

Bayes' formula, and conditional probability in general, work for continuous distributions as well as for discrete probabilities, so that we can compute the distribution of random variable $A$ given the distribution of random variable $B$ (which will be two-dimensional), or the distribution of $A$ given a particular observation $b$ of $B$.

Bayesian inference is very useful for online estimation and prediction, where we update our estimate as new evidence comes in. Examples of this include neural networks and Kalman filters. This allows us to have some estimate to work with at all times, while continually improving it using the newest observations. The process is for this is:

- Start with a prior, based either on your own prior knowledge or estimated from past data using statistical techniques.
- When a new data point comes in, use Bayes' formula to update the prior distribution to a posterior distribution.
- The posterior distribution is now your new prior. Go to step 2.

```
In [ ]:
```# TODO example
# sklearn.naive_bayes?