Probability Theory

What we need is a mathematical description of uncertainty.

Probability distributions make intuitive sense for data, because they encode what might have been.

How is it that they can also be used to quantify degree of belief in a model, or parameter value?

The Rules of Probability Theory

Following MacKay Chapter 2:

If $x$ is an "outcome" (or an "event", or "the value of a random variable") then ${\rm Pr}(x)$ describes the probability of us getting outcome $x$ (or event $x$ happening, or the variable taking the value $x$). In the long run, this probability will equal the frequency of $x$ (provided nothing changes in the meantime).

If $y$ is another outcome (or event, or variable), we define the joint probability of both $x$ and $y$ as the joint probability ${\rm Pr}(x,y)$, a 2-dimensional object.

The frequencies of all $27^2$ 2-letter bigrams $xy$ in the Linux FAQ, from MacKay Ch. 2.

Conditional Probability

The probability of getting both $x$ and $y$ is given by the product of the probability of getting $y$ and the probability of getting $x$ given that we already got $y$:

${\rm Pr}(x,y)\;={\rm Pr}(x|y)\;{\rm Pr}(y)$

The probability of getting both $x$ and $y$ is also given by the product of the probability of getting $x$ and the probability of getting $y$ given that we already got $x$:

${\rm Pr}(x,y)\;={\rm Pr}(y|x)\;{\rm Pr}(x)$

The conditional probabilities for bigrams $xy$ in the Linux FAQ, from MacKay Ch. 2. The one on the left helps answer the question: which letter is likely to come next?

Q: What additional information do you need to turn the joint distribution above into these figures?

Marginal Probability

The probability of getting a particular value of $x$, no matter what $y$ is, means summing over all possible $y$ values

${\rm Pr}(x) = \sum_y\;{\rm Pr}(x,y)$

If $x$ is a continuous variable, the probability densities follow the same rules, and the sum becomes a marginalization integral:

${\rm Pr}(x) = \int\;{\rm Pr}(x,y)\;dy$

Probabilities (and probability densities) are everywhere positive, and sum to one:

$\int\;{\rm Pr}(x,y)\;dx\,dy = 1$

Additional Dependencies

Assigning values to probabilities inevitably involves keeping track of constraints, assumptions etc. In the above example, $H$ might stand for the statement "we are looking in the FAQ Manual for Linux", such that all our probabilities are estimated given this statement:

${\rm Pr}(x,y\,|\,H)\;={\rm Pr}(x\,|\,y,H)\;{\rm Pr}(y\,|\,H)$

Notice how the $H$ gets added after the "|" in all terms. In data analysis, its helpful to keep the $H$ in, to stand for "all of our assumptions", or equivalently, our model.

Degrees of Belief

In everyday life we use words like "probability," "chances", "odds" and so on all the time to describe our belief in things.

Matilda

Here's an example, due to poet and educator Hilaire Belloc:

Suppose you are a firefighter. A call comes through from a person who simply reports that a little girl is leaning out of a window at number 53, shouting "Fire"! But last week, there was a very similar report, also of a little girl leaning out of a window at number 53 and shouting "Fire!", which you responded to, only to find that there was actually no fire at all!

The key question is: do you believe that there is a fire at number 53?

To answer it you need to estimate the probability that there is a fire at number 53, given what you know about Matilda and her previous lies. There are two propositions here: $x$ = "there's a fire" and $y$ = "Matilda is telling the truth." Belief that there is a fire given that Matilda is known to tell lies, $B(x|y)$, is different from $B(x)$ given no other information.

What matters right now is your own level of uncertainty - and not the frequency of fires (reported or otherwise) in the future.

Q: Where else does subjective probability appear in everyday life?

Discuss degree of belief and quantifiable probability with your neighbor, and see if you can come up with an example. Whose beliefs are involved, and how is the probability quantified? Can you distinguish between a probability that is equal to a frequency, and a probability that describes a degree of belief, in your scenario?

Formally: The Cox Axioms

Let $x$ instead be a proposition, and the degree to which we believe it to be true be $B(x)$.

$B(x)$ can be mapped onto, and indeed replaced by, probability ${\rm Pr}(x)$ if the Cox Axioms are satisfied. As a checklist, these are:
1. If your degree of belief $B(x)$ is greater than your degree of belief $B(y)$, and your degree of belief $B(y)$ is greater than your degree of belief $B(z)$, is $B(x)$ greater than $B(z)$? Yes: this means that degree of belief can be represented by numbers
2. Is degree of belief $B(x)$ related to $B(NOT x)$? Yes: one goes up as the other goes down, like probabilities.
3. Does your degree of belief in both $x$ and $y$ depend on the degree of belief in $y$ and the degree of belief in $x$ given $y$?

This third one is the most interesting to think about: it's maybe not obvious at first sight that degrees of belief do follow this pattern, and so can be represented by probabilities. But your experience with things like small children, bookmakers, and so on involve conditional belief expressed as probability.

Bayes Theorem

By using probability to describe our degree of belief in propositions, or in science, models, we can combine the results above and write:

${\rm Pr}(x|d,H)\;=\frac{1}{Z}{\rm Pr}(d|x,H)\;{\rm Pr}(x|H)$

where $Z$ is a normalization factor given by

$Z = \int {\rm Pr}(d|x,H)\;{\rm Pr}(x|H)\;dx = {\rm Pr}(d|H)$

We could also write that:

${\rm Pr}(H|d)\;=\;{\rm Pr}(d|H)\;{\rm Pr}(H)\;/\;{\rm Pr}(d)$

This is "Bayes Theorem," and it describes the process by which we can learn about models ($H$) and their parameters ($x$).

Your degree of belief in a model $H$ given some data $d$ is ${\rm Pr}(H|d)$, and it depends on what your degree of belief in it was before, ${\rm Pr}(H)$, and how probable the model would make the data, ${\rm Pr}(d|H)$.

Likewise, your uncertainty about a parameter's value before you see any data is the "prior" probability ${\rm Pr}(x|H)$, but this is updated by the data to become the "posterior" probability ${\rm Pr}(x|d,H)$. The factor by which it is updated is the "likelihood", ${\rm Pr}(d|x,H)$.

Exercise: Diagnosis

You go to the doctor to be tested for a rare disease, and the test comes back positive.

The test is pretty good: it has a 95% true positive rate, and a 10% false positive rate. The incidence of the disease in the general population is 1 in 5000.

What is the probability that you actually have the disease? Are your chances of being sick better or worse than 95%?

Let's edit the code below and compute this posterior probability using Bayes Theorem.



In [20]:

    
P_testpositive_given_sick = 0.95
P_testpositive_given_healthy = 0.10
P_sick = 1.0/5000.0
P_healthy = 1 - P_sick

Z = P_testpositive_given_sick*P_sick + \
    P_testpositive_given_healthy*P_healthy

P_sick_given_testpositive = P_testpositive_given_sick*P_sick / Z

print("Probability of actually having the disease: ",round(P_sick_given_testpositive,3)*100,"%")









    



('Probability of actually having the disease: ', 0.2, '%')

In this example, the prior probability of having the disease is so low that it takes a much better test to raise your chances of having the disease above 5%. Try changing the test quality each way (reduce the flase positive rate or increase the true positive rate) and see how good it really needs to be.

Frequentist and Bayesian Thinking

Allowing parameters and models to have probability distributions that quantify our degree of belief in them is the key idea in Bayesian statistics.

The alternative is to maintain that only data are to be considered as having been drawn from a probability, or frequency, distribution. This puts emphasis on datasets that could have been taken but were not, and leads to answers of the form "the probability of getting the data by chance is p".

Bayesian analysis puts emphasis on understanding the data you have, and gaining understanding (via a model) from that data.

The probability for the data, $\Pr(d|x,H)$, A.K.A the sampling distribution, or likelihood, feature in both approaches - but different questions are answered with it.

Now let's go back to the Xray Image



In [ ]: