Basic Statistics

G. Richards, 2016

Resources for this material include Ivezic Sections 1.2, 3.0, and 3.1, Karen' Leighly's Bayesian Statistics Lecture, and Jo Bovy's class, specifically Lecture 1.

Last time we worked through some examples of the kinds of things that we will be doing later in the course. But before we can do the fun stuff, we need to lay some statistical groundwork. Some of you may have encountered some of this material in Math 311.

Notation

First we need to go over some of the notation that the book uses.

$x$ is a scalar quantity, measured $N$ times

$x_i$ is a single measurement with $i=1,...,N$

$\{x_i\}$ refers to the set of all N measurements

We are generally trying to estimate $h(x)$, the true distribution from which the values of $x$ are drawn. We will refer to $h(x)$ as the probability density (distribution) function or the "pdf" and $h(x)dx$ is the propobability of a value lying between $x$ and $x+dx$.

While $h(x)$ is the "true" pdf (or population pdf). What we measure from the data is the empirical pdf, which is denoted $f(x)$. So, $f(x)$ is a model of $h(x)$. In principle, with infinite data $f(x) \rightarrow h(x)$, but in reality measurement errors keep this from being strictly true.

If we are attempting to guess a model for $h(x)$, then the process is parametric. With a model solution we can generate new data that should mimic what we measure. If we are not attempting to guess a model, then the process is nonparametic. That is we are just trying to describe the data that we see in the most compact manner that we can, but we are not trying to produce mock data.

The histograms that we made last time are an example of a nonparametric method of describing data.

We could summarize the goal of this class an attempt to 1) estimate $f(x)$ from some real (possibly multi-dimensional) data set, 2) find a way to describe $f(x)$ and its uncertainty, 3) compare it to models of $h(x)$, and then 4) use the knowledge that we have gained in order to interpret new measurements.

Probability

The probability of $A$, $p(A)$, is the probability that some event will happen (say a coin toss), or if the process is continuous, the probability of $A$ falling in a certain range. (N.B., Technically these two things are different and sometimes are indicated by $P$ and $p$, but I'm ignoring that here.) $p(A)$ must be positive definite for all $A$ and the sum/integral of the pdf must be unity.

If we have two events, $A$ and $B$, the possible combinations are illustrated by the following figure:

$A \cup B$ is the union of sets $A$ and $B$.

$A \cap B$ is the intersection of sets $A$ and $B$.

The probability that either $A$ or $B$ will happen (which could include both) is the union, given by

$$p(A \cup B) = p(A) + p(B) - p(A \cap B)$$

The figure makes it clear why the last term is necessary. Since $A$ and $B$ overlap, we are double-counting the region where both $A$ and $B$ happen, so we have to subtract this out.

The probability that both $A$ and $B$ will happen, $p(A \cap B)$, is $$p(A \cap B) = p(A|B)p(B) = p(B|A)p(A)$$

where p(A|B) is the probability of A given that B is true and is called the conditional probability. So the $|$ is short for "given that".

The law of total probability says that

$$p(A) = \sum_ip(A|B_i)p(B_i)$$

N.B. Just to be annoying, different people use different notation and the following all mean the same thing $$p(A \cap B) = p(A,B) = p(AB) = p(A \,{\rm and}\, B)$$

I'll use the comma notation as that is what the book uses.

It is important to realize that the following is always true $$p(A,B) = p(A|B)p(B) = p(B|A)p(A)$$

However, if $A$ and $B$ are independent, then

$$p(A,B) = p(A)p(B)$$

Let's look an example.

If you have a bag with 5 marbles, 3 yellow and 2 blue and you want to know the probability of picking 2 yellow marbles in a row, that would be

$$p(Y_1,Y_2) = p(Y_1)p(Y_2|Y_1).$$

$p(Y_1) = \frac{3}{5}$ since you have an equally likely chance of drawing any of the 5 marbles.

If you did not put the first marble back in the back after drawing it (sampling without "replacement"), then the probability $p(Y_2|Y_1) = \frac{2}{4}$, so that $$p(Y_1,Y_2) = \frac{3}{5}\frac{2}{4} = \frac{3}{10}.$$

But if you put the first marble back, then $p(Y_2|Y_1) = \frac{3}{5} = p(Y_2)$, so that $$p(Y_1,Y_2) = \frac{3}{5}\frac{3}{5} = \frac{9}{25}.$$

In the first case $A$ and $B$ (or rather $Y_1$ and $Y_2$) are not independent, whereas in the second case they are.

We say that two random variables, $A$ and $B$ are independent iff $p(A,B) = p(A)p(B)$ such that knowing $B$ does not give any information about $A$.

A more complicated example from Jo Bovy's class at UToronto

So $$p(A \,{\rm or}\, B|C) = p(A|C) + p(B|C) - p(A \, {\rm and}\, B|C)$$

We could get more complicated than that, but let's leave it there for now as this is all that we need right now.

Need more help with this? Try watching some Khan Academy videos and working through the exercises: https://www.khanacademy.org/math/probability/probability-geometry

https://www.khanacademy.org/math/precalculus/prob-comb

Bayes' Rule

We have that $$p(x,y) = p(x|y)p(y) = p(y|x)p(x)$$

We can define the marginal probability as $$p(x) = \int p(x,y)dy,$$ where marginal means essentially projecting on to one axis (integrating over the other axis).

We can re-write this as $$p(x) = \int p(x|y)p(y) dy$$

An illustration might help. In the following figure, we have a 2-D distribution in $x-y$ parameter space. Here $x$ and $y$ are not independent as, once you pick a $y$, your values of $x$ are constrained.

The marginal distributions are shown on the left and bottom sides of the left panel. As the equation above says, this is just the integral along the $x$ direction for a given $y$ (left side panel) or the integral along the $y$ direction for a given $x$ (bottom panel).

The three panels on the right show the conditional probability (of $x$) for three $y$ values: $p(x|y=y_0)$. These are just "slices" through the 2-D distribution.

Since $p(x|y)p(y) = p(y|x)p(x)$ we can write that $$p(y|x) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y)p(y) dy}$$ which in words says that

the (conditional) probability of $y$ given $x$ is just the (conditional) probability of $x$ given $y$ times the (marginal) probability of $y$ divided by the (marginal) probability of $x$, where the latter is just the integral of the numerator.

This is Bayes' rule, which itself is not at all controverial, though its application can be as we'll discuss later.

Example: Lego's

An example with Lego's (it's awesome): https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego

Example: Monty Hall Problem

You are playing a game show and are shown 2 doors. One has a car behind it, the other a goat. What are your chances of picking the door with the car?

OK, now there are 3 doors: one with a car, two with goats. The game show host asks you to pick a door, but not to open it yet. Then the host opens one of the other two doors (that you did not pick), making sure to select one with a goat. The host offers you the opportunity to switch doors. Do you?

Now you are back at the 2 door situation. But what can you make of your prior information?

$p(1{\rm st \; choice}) = 1/3$

$p({\rm other}) = 2/3$ which doesn't change after host opens door without the prize. So, switching doubles your chances. But only because you had prior information. If someone walked in after the "bad" door was opened, then their probability of winning is the expected $1/2$.

For $N$ choices, revealing $N-2$ "answers" doesn't change the probability of your choice. It is still $\frac{1}{N}$. But it does change the probability of your knowledge of the other remaining choice by $N-1$ and it is $\frac{N-1}{N}$.

This is an example of the use of conditional probability, where we have $p(A|B) \ne p(A)$.

Example: Contingency Table

We can also use Bayes' rule to learn something about false positives and false negatives.

Let's say that we have a test for a disease. The test can be positive ($T=1$) or negative ($T=0$) and one can either have the disease ($D=1$) or not ($D=0$). So, there are 4 possible combinations: $$T=0; D=0 \;\;\; {\rm true \; negative}$$ $$T=0; D=1 \;\;\; {\rm false \; negative}$$ $$T=1; D=0 \;\;\; {\rm false \; positive}$$ $$T=1; D=1 \;\;\; {\rm true \; positive}$$

All else being equal, you have a 50% chance of being misdiagnosed. Not good! But the probability of disease and the accuracy of the test presumably are not random.

If the rates of false positive and false negative are: $$p(T=1|D=0) = \epsilon_{\rm FP}$$ $$p(T=0|D=1) = \epsilon_{\rm FN}$$

then the true positive and true negative rates are just: $$p(T=0| D=0) = 1-\epsilon_{\rm FP}$$ $$p(T=1| D=1) = 1-\epsilon_{\rm FN}$$

In graphical form this is:

If we have a prior regarding how likely the disease is, we can take this into account.

$$p(D=1)=\epsilon_D$$

and then $p(D=0)=1-\epsilon_D$.

Bayes' rule then can be used to help us determine how likely it is that you have the disease if you tested positive:

$$p(D=1|T=1) = \frac{p(T=1|D=1)p(D=1)}{p(T=1)},$$

where $$p(T=1) = p(T=1|D=0)p(D=0) + p(T=1|D=1)p(D=1).$$

So $$p(D=1|T=1) = \frac{(1 - \epsilon_{FN})\epsilon_D}{\epsilon_{FP}(1-\epsilon_D) + (1-\epsilon_{FN})\epsilon_D} \approx \frac{\epsilon_D}{\epsilon_D+\epsilon_{FP}}$$

Wondering why we can't just read $p(D=1|T=1)$ off the table? That because the table entry is the conditional probability of the test given the data, $p(T=1|D=1)$, what we want is the conditional probability of the data given the test.

That means that to get a reliable diagnosis, we need $\epsilon_{FP}$ to be quite small. (Because you want the probability to be close to unity if you test positive, otherwise it is a false positive).

Take an example with a disease rate of 1% and a false positive rate of 2%.

So we have $$p(D=1|T=1) = \frac{0.01}{0.01+0.02} = 0.333$$

Then in a sample of 1000 people, 10 people will actually have the disease $(1000*0.01)$, but another 20 $(1000*0.02)$ will test positive!

Models and Data

In this class, we generally won't be dealing with the probability of events $A$ and $B$, rather we will be dealing with models and data, where we are trying to determine the model, given the data. So, we can rewrite Bayes' rule as $$p({\rm model}|{\rm data}) = \frac{p(\rm{data}|\rm{model})p(\rm{model})}{p(\rm{data})}.$$

We can write this in words as: $${\rm Posterior Probability} = \frac{{\rm Likelihood}\times{\rm Prior}}{{\rm Evidence}},$$

where we interpret the posterior probability as the probability of the model (including the model parameters).

We'll talk more about models next time.

GTR

Bovy Lecture 1, Slides 26-50 for good stuff on stats parameters and distributions