Review of Probability

Goals:

  • Review the bits of mathematical probability that will be most important for us later.
  • Prime our brains for probabilistic reasoning.

This will be quick -- the key concepts will already be familiar if you've studied statistical and/or quantum mechanics.

References

  • Ross Ch. 2,3,6
  • Ivezic 3.1
  • MacKay 2.1-2.2

Some terminology

  • Sample space ($\Omega$): the set of all possible answers/outcomes for a given question/experiment.
  • Event ($E$): any subset of $\Omega$.

The probability of an event will be a real function satisfying certain requirements...

Axioms of probability:

  • $\forall E: 0 \leq P(E) \leq 1$
  • $P(\Omega) = P\left(\bigcup_{\mathrm{all~}i} E_i\right) = 1$
  • If $E_i$ are mutually exclusive, $P\left(\bigcup_i E_i\right) = \sum_i P(E_i)$

This dry definition provides a function with the right properties to describe our intuitive understanding of probability.

A familiar example

Let $\Omega$ be the set of states available to a system of fixed energy, e.g. a box full of gas particles.

With one additional (physics) assumption, that it's equally probable for the system to occupy any state in $\Omega$, this is the microcanonical ensemble in statistical mechanics.

Discrete vs. continuous sample spaces

Very often the type of event we're interested in lives in a continuous sample space. E.g., the Hubble parameter is $h=0.7$.

Our axioms mostly translate straightforwardly; in this example $P(\Omega)=1$ becomes the normalization condition

$\int_{-\infty}^{\infty} p(h=x)dx$ = 1

We can always describe the discrete case as a continuous one where $p$ is a sum of Dirac delta functions.

More definitions

If $X$ takes real values, then $p(X=x)$ is a probability density function, or PDF.

  • $p(X=x)$ is not a probability! But integrals like $p(X=x)dx$ and $P(x_0 < X < x_1)$ are.
  • We will rapidly become lazy and denote $p(X=x)$ incorrectly as $p(X)$ or $p(x)$. You have been warned.

The first bullet is highly relevant if we ever want to change variables, e.g. $x\rightarrow y(x)$

  • $p(y) \neq p[x(y)]$; rather $p(y) = p(x) \left|dx/dy\right|$

The cumulative distribution function (CDF) is the probability that $X\leq x$.

  • Usually written: $F(x) = P(X \leq x) = \int_{-\infty}^x p(X=x')dx'$.
  • Conversely, the PDF is the derivative of the CDF.
  • (The CDF is sometimes referred to just as the distribution function.)

The quantile function is the inverse of the CDF, $F^{-1}(P)$.

  • Multiply a quantile by 100 and it's a percentile.
  • The median of a given distribution is $F^{-1}(0.5)$, etc.

Ridiculous example: an unfair coin toss

We flip a coin which is weighted to land on heads a fraction $q$ of the time. To make things numeric, let $X=0$ for stand for an outcome of tails and $X=1$ for heads.

X PMF $P(X)$ CDF $F(X)$
0 $1-q$ $1-q$
1 $q$ 1

Joint probability distributions

Things get more interesting when we deal with joint distributions of multiple events, $p(X=x$ and $Y=y)$, or just $p(x,y)$.

(Usually visualized as contours of p)

The marginal probability of $y$, $p(y)$, means the probability of $y$ irrespective of what $x$ is.

  • $p(y) = \int dx ~ p(x,y)$

The conditional probability of $y$ given a value of $x$, $p(y|x)$, is most easily understood this way

  • $p(x,y) = p(y|x)\,p(x)$

i.e., $p$ of getting $x$ AND $y$ can be factorized into the product of

  • $p$ of getting $x$ regardless of $y$, and
  • $p$ of getting $y$ given $x$.

$p(y|x)$ is a (normalized) slice through $p(x,y)$ rather than an integral.

$x$ and $y$ are independent if $p(y|x) = p(y)$.

Equivalently, $p(x,y) = p(x)\,p(y)$.

Exercise

Take the coin tossing example from earlier, where $P(\mathrm{heads})=q$ and $P(\mathrm{tails})=1-q$ for a given toss. Assume that this holds independently for each toss.

Find:

  1. The conditional probability that both tosses are heads, given that the first toss is heads.
  2. The conditional probability that both tosses are heads, given that at least one of the tosses is heads.

Exercise

Say we keep on tossing this coin, still assuming independence, a total of $N$ times. Work out the probability that exactly $n$ of these turn out to be heads.

How to count things

The answer to the previous exercise is the PDF of the binomial distribution

$P(n|q,N) = {N \choose n} q^n (1-q)^{N-n}$

To introduce some notation, we might write this as

$n \sim \mathrm{Binom}(q,N)$

Here the squiggle means "is a random variable that is distributed as" (as opposed to "has the same order of magnitude as" or "scales with", the common usages in physics).

Recall that a key assumption was that each toss (trial) was independent. If we write the mean number of heads as $\mu=qN$ and also assume that $q$ is small while $N$ is large, then a series of irritating limits and substitutions yields the Poisson distribution

$P(n|\mu) = \frac{\mu^n e^{-\mu}}{n!}$

This is an extremely important result, given that most astronomy and physics experiments boil down to counting events that are rare compared with the number of time intervals in which they might happen (and be recorded).

  • E.g., most obviously, the number of photons from some source hitting a particular CCD pixel during an observation.

The Poisson distribution has the following (probably familiar) properties:

  • Expectation value (mean) $\langle n\rangle = \mu$
  • Variance $\left\langle \left(n-\langle n \rangle\right)^2 \right\rangle = \mu$
  • Additivity: $n_1+n_2\sim \mathrm{Pois}(\mu_1+\mu_2)$ if $n_i\sim\mathrm{Pois}(\mu_i)$

The central limit theorem

Another important theorem states, in its most common form:

  • If $X_i$ are independent and drawn from an identical PDF, with mean $\mu$ and variance $\sigma^2$, then the sum of $n$ $X$'s tends to the normal (Gaussian) distribution with mean $n\,\mu$ and variance $n\,\sigma^2$.
  • Alternatively, the average $\sum_i X_i/n$ tends to normal with mean $\mu$ and variance $\sigma^2/n$.

Among other things, this implies that a Poisson distribution with large enough $\mu$ closely resembles a Gaussian.

Cautions

This is a powerful result, but we need to keep some things in mind.

  1. It doesn't tell us, in general, how big $n$ needs to be for things to become "Gaussian enough" for a given purpose. This would need to be determined by more careful analysis.
  2. It's tempting to bin up data (e.g. Poisson counts in adjacent pixels/channels/integrations) enough to justify using the simple Gaussian distribution, but this risks throwing away key information in the data set (e.g. spatial/spectral/temporal structure).

Coming attractions

In the next chunk, we'll see how the principles of probability are applied to create generative models, which are a key ingredient of inference.

Bonus exercise

Test your understanding of the basic mathematical tools of probability by proving the formula for transforming random variables

$p(y) = p(x) \left|\frac{dx}{dy}\right|$

for the case where $y(x)$ is monotonic.

Bonus numerical exercise

Go farther with the previous exercise! Consider the function $b=\tan(\theta)$, which is sometimes used to reparametrize the slope of a line ($b$) with the angle the line makes in a plot ($\theta$).

  1. If $p(\theta)$ is uniform (proportional to a constant) for $-\frac{\pi}{2}<\theta<\frac{\pi}{2}$, work out $p(b)$.
  2. Demonstrate that you're right by generating a bunch of uniform random $\theta$'s, transforming each one to its corresponding $b$, and comparing a histogram of $b$ with your answer to (1).

In [ ]: