In [1]:
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, numpy as np, pandas as pd
sn.set_context('notebook')

Distributions

This notebook introduces some fundamental concepts that we’ll use extensively later on: joint, marginal and conditional distributions; the sum and product rules; and Bayes’ Theorem.

The aim is to avoid lots of formal statistical notation and instead develop some intuition about what these things actually mean. Most of the examples use just two variables to make things easier to visualise, but everything here can be extended in a straightforward way to much larger parameter spaces.

1.1. One dimensional probability densities

Suppose we have a continuous random variable, $x$, with a Gaussian (or Normal) distribution with mean, $\mu$, and standard deviation, $\sigma$. This is often written as $x\sim\mathcal{N}(\mu, \sigma)$.

The equation for the probability density, $P(x)$, is:

$$P(x)=\frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - ( {x - \mu } )^2 /{2\sigma ^2 }}}}$$

which we can plot like this:


In [2]:
def gaussian_function(x, mu=0, sigma=1):
    """ Simple example function to return the probability density of a Gaussian with mean, mu, 
        and standard deviation, sigma, evaluated at values in x.
        
        Note that a better function is available in scipy.stats.norm - this version is just 
        for illustration.
    """
    return (1/(sigma*(2*np.pi)**0.5))*np.exp(-((x - mu)**2)/(2*sigma**2))

# Pick some example values for mu and sigma
mu = 5
sigma = 3

# Calculate probability densities, P_x, at a range of values of x
x = np.arange(-5, 15, 0.1)
P_x = gaussian_function(x, mu=mu, sigma=sigma)

# Plot
plt.plot(x, P_x, 'k-')
plt.fill_between(x, P_x, where=((8<=x)&(x<=10)), alpha=0.3, color='red') # Explained below
plt.xlabel('$x$', fontsize=20)
plt.ylabel('$P(x)$', fontsize=20)


Out[2]:
<matplotlib.text.Text at 0x193afd30>

Note that the values on the $P(x)$ axis are probability densities rather than probabilities: because $x$ is a continuous variable, the probability of it taking any exact value is zero. Instead, we can calculate the probability of $x$ lying within a particular range by integrating over that range. For example, the probability of $x$ lying between $a$ and $b$ is given by:

$$P(a < x \leq b)={\int_{a}^{b} P(x) dx}$$

which, for the example above where $a=8$ and $b=10$, is equal to the area shaded in red on the plot.

Also, because $x$ must take some real value, the total area under the curve is equal to 1 i.e.

$${\int_{-\infty}^{\infty} P(x) dx} = 1$$

1.2. More than one dimension: the joint distribution

Suppose we have two independent random variables, $P(x)$ and $P(y)$, both normally distributed with means $\mu_x$ and $\mu_y$ and standard deviations $\sigma_x$ and $\sigma_y$ respectively.


In [3]:
# Pick some example values for mu and sigma
mu_x, mu_y = 5, 8
sigma_x, sigma_y = 3, 5

# Calculate densities
x = np.arange(-10, 30, 0.1)
P_x = gaussian_function(x, mu=mu_x, sigma=sigma_x)
P_y = gaussian_function(x, mu=mu_y, sigma=sigma_y)

# Plot
fig, axes = plt.subplots(nrows=1, ncols=2)
data = {'x':P_x, 'y':P_y}

for idx, item in enumerate(['x', 'y']):
    axes[idx].plot(x, data[item], 'k-')
    axes[idx].set_xlabel('$%s$' % item, fontsize=20)
    axes[idx].set_ylabel('$P(%s)$' % item, fontsize=20)

fig.subplots_adjust(wspace=0.8)


If we pick values at random from $x$ and $y$, we can calculate the joint probability distribution, $P(x,y)$:

$$P(x,y)=P(x)P(y)$$

We can represent this joint distribution as a surface in three dimensions, where the "hills" in the landscape represent areas with higher probability density.


In [4]:
import scipy.stats as stats
from mpl_toolkits.mplot3d import Axes3D

x = y = np.arange(-10, 30, 0.1)

# Get a "mesh" for the x and y values
X, Y = np.meshgrid(x, y)

# This time we'll use the "norm" function from scipy.stats, although our gaussian_function 
# from above would work just as well
norm_x = stats.norm.pdf(x, loc=mu_x, scale=sigma_x)
norm_y = stats.norm.pdf(y, loc=mu_y, scale=sigma_y)
M = np.dot(norm_y[:, None], norm_x[None, :])

# Plot
fig = plt.figure(figsize=(15,5))

# First subplot in 3D
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, M, cmap=plt.cm.terrain)
ax1.view_init(azim=390)
ax1.set_xlabel('$x$', fontsize=16)
ax1.set_ylabel('$y$', fontsize=16) 
ax1.set_zlabel('$P(x,y)$', fontsize=16)

# Second subplot as a contour plot
ax2 = fig.add_subplot(122)
ax2.contour(X, Y, M)
ax2.imshow(M, interpolation='none', origin='lower',
           cmap=plt.cm.terrain, 
           extent=(-10, 30, -10, 30))
ax2.set_xlabel('$x$', fontsize=16)
ax2.set_ylabel('$y$', fontsize=16)


Out[4]:
<matplotlib.text.Text at 0x196b7c50>

In the 1D example above, the total area under the curve was equal to 1 and to find the probabiltiy of $x$ lying between $a$ and $b$ we integrated over that interval. Here the situation is exactly the same, except now we have an extra dimension. This means that the total volume under the surface must be equal to 1, because the values for $x$ and $y$ must lie somewhere in the $x-y$ plane.

In a similar way, the probability of $x$ lying between $a$ and $b$ and $y$ lying between $c$ and $d$ is given by a double integral:

$$P(a < x \leq b, c < y \leq d) = {\int_{c}^{d} \int_{a}^{b} P(x,y) dx dy}$$

This integral corresponds to the volume beneath the surface element bounded by $a$, $b$, $c$ and $d$, which is illustrated by the red rectangle on the plot below.


In [5]:
from matplotlib.patches import Rectangle

# Pick some example values for a, b, c and d
a, b, c, d = 5, 10, 0, 5

# Contour plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contour(X, Y, M)
ax.imshow(M, interpolation='none', origin='lower',
          cmap=plt.cm.terrain, alpha=0.6,
          extent=(-10, 30, -10, 30))
ax.add_patch(Rectangle((a, c), b-a, d-c, facecolor='red', alpha=0.5))
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$y$', fontsize=16)


Out[5]:
<matplotlib.text.Text at 0x1b1d4f98>

1.3. Independent and dependent variables

In the example above, $x$ and $y$ were independent variables: we first drew a value from $x$, then independently drew one from $y$, and then multiplied the probability densities together to get the density for the joint distribution. Because we used different values for the standard deviations of our two Gaussians ($\sigma_x = 3$ and $\sigma_y = 5$ in the example above), the contours of the joint distribution are ellipses, which are elongated parallel to the $y$ axis (reflecting the fact that the distribution has greater variability in this direction). More specifically, the major axis of each elipse is parallel to the $y$ axis, while the minor axis is parallel to the $x$ axis. It is important to understand that this is only the case because $x$ and $y$ are independent.

The major and minor axes are related to the eigenvectors of the covariance matrix. This sounds complicated, but you can think of the covariance matrix as a table showing how each parameter (in this case $x$ and $y$) varies with the others. For the two-variable example here, it looks something like this:

$x$ $y$
$x$ $\sigma_x^2$ 0
$y$ 0 $\sigma_y^2$

Note that variance is simply the square of the standard deviation (i.e. $\sigma_x^2$ and $\sigma_y^2$). The two zeros in the table indicate that $x$ does not vary with $y$ and vice versa - in other words, these variables are independent.

On the other hand, suppose we're considering a dataset where $x$ and $y$ depend on one another. For example, we may be working with a dataset where $x$ corresponds to peoples' weight and $y$ to their height. We would not expect these two variables to be independent, because we know that in general tall people tend to be heavier. Of course, we wouldn't use Gaussian distributions to represent height and weight (not least because the Gaussians assign non-zero probabilities to negative values). However, for the sake of keeping things simple we'll stick with the Gaussian example for now. If $x$ and $y$ co-vary, the covariance matrix looks like this:

$x$ $y$
$x$ $\sigma_x^2$ $\sigma_{yx}^2$
$y$ $\sigma_{xy}^2$ $\sigma_y^2$

where $\sigma_{xy}^2 = \sigma_{yx}^2 \neq 0$.

The result is that the major and minor axes (the eigenvectors) of the probability density contour elipses are rotated and stretched with respect to the $x$ and $y$ axes (see plot below). This occurs because the two variables are dependent on one another.


In [6]:
from scipy.stats import multivariate_normal

# Use the same mu and sigma values from above. We need new values for sigma_xy and sigma_yx
sigma_xy = sigma_yx = 3
mus = [mu_x, mu_y]
cov = [[sigma_x**2, sigma_yx**2], [sigma_xy**2, sigma_y**2]] # Covariance matrix

# Grid of x and y values
x = y = np.arange(-10, 30, 0.1)
X, Y = np.meshgrid(x, y)
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X
pos[:, :, 1] = Y

# Create a "frozen" distribution
rv = multivariate_normal(mus, cov)

# Sample from the distribution
M = rv.pdf(pos)

# Contour plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contour(X, Y, M)
ax.imshow(M, interpolation='none', origin='lower',
          cmap=plt.cm.terrain, alpha=0.6,
          extent=(-10, 30, -10, 30))
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$y$', fontsize=16)


Out[6]:
<matplotlib.text.Text at 0x1b1a9eb8>

1.4. Marginal and conditional distributions

In section 1.2 we introduced the idea of a joint distribution and showed how joint probabilities can be interpreted as volumes beneath particular surface elements:

$$P(a < x \leq b, c < y \leq d) = {\int_{c}^{d} \int_{a}^{b} P(x,y) dx dy}$$

However, suppose we're only interested in the distribution over $x$, independent of the value of $y$. If we return to the height and weight example from above, we may wish to look at the joint distribution of height and weight and answer questions such as "What is the probability of weighing between 70 and 90 kgs, regardless of height?"

To answer this, we can simply expand the limits in one of the integrals - the probability of $x$ lying between $a$ and $b$ regardless of $y$ is given by:

$$P(a < x \leq b) = P(a < x \leq b, -\infty < y \leq \infty) = {\int_{-\infty}^{\infty} \int_{a}^{b} P(x,y) dx dy}$$

which is equivalent to the volume under the long strip shaded red on the left-hand plot below.


In [7]:
# Pick some example values for a, b, c and d
a, b, c, d = 5, 10, 0, 5

# Set up for plotting
fig, axes = plt.subplots(nrows=1, ncols=2)

# Integrate out y
axes[0].contour(X, Y, M)
axes[0].imshow(M, interpolation='none', origin='lower',
               cmap=plt.cm.terrain, alpha=0.6,
               extent=(-10, 30, -10, 30))
axes[0].axvspan(a, b, alpha=0.5, color='red')
axes[0].set_xlabel('$x$', fontsize=16)
axes[0].set_ylabel('$y$', fontsize=16)
axes[0].set_title('Integrating out $y$')

# Integrate out x
axes[1].contour(X, Y, M)
axes[1].imshow(M, interpolation='none', origin='lower',
               cmap=plt.cm.terrain, alpha=0.6,
               extent=(-10, 30, -10, 30))
axes[1].axhspan(c, d, alpha=0.5, color='red')
axes[1].set_xlabel('$x$', fontsize=16)
axes[1].set_ylabel('$y$', fontsize=16)
axes[1].set_title('Integrating out $x$')

plt.subplots_adjust(wspace=0.8)


If we're interested in deriving an equation for the probability density distribution over $x$, rather than the actual probability of $x$ lying within a particular interval, this integral becomes:

$$P(x) = \int_y P(x,y) dy$$

which is known as the sum rule. The process itself is called marginalisation and it's often said that we are "integrating out" the dependence on $y$. Integrating $y$ over all possible values essentially takes all the probability along the $y$ axis, sums it, and then stacks it up along the $x$ axis - hence the term marginalisation. A nice way to visualise this is to plot the marginal distributions for $x$ and $y$ along the edges of the plot:


In [8]:
# This time we'll use the handy jointplot function in Seaborn to plot 
# the same multivariate normal distribution as above
# Rather than evaluating the densities over a regular grid, for simplicity
# we'll just choose 1000 values at random
data = np.random.multivariate_normal(mus, cov, size=1000)
data = pd.DataFrame(data, columns=['$x$', '$y$'])
sn.jointplot('$x$', '$y$', data, kind='kde', stat_func=None)


Out[8]:
<seaborn.axisgrid.JointGrid at 0x19ae16d8>

Another question we might want to ask is, "Given that a person weighs between 70 and 90 kgs, what is the probability that their height is between 1.6 and 1.8 m?" This is a conditional probability, because we're interested in the probability that height lies in a certain range, conditioned on the fact that weight is between 70 and 90 kgs.

More generally, we might ask, "What is the probability that $y$ lies between $c$ and $d$, given that $x$ lies between $a$ and $b$?" This is usually written as

$$P(c < y \leq d \mid a < x \leq b)$$

where the $\mid$ symbol is read as "given".

To evaluate this conditional probability, we need to redistribute the volume beneath our probability density surface. We know that $a < x \leq b$, so any probability density outside this range is now zero. Densities within the range therefore need to be re-weighted ("re-normalised"), so that the total probability of being anywhere within the strip bounded by $a < x \leq b$ is equal to 1.

From the discussion above, we know that the joint distribution is given by:

$$P(a < x \leq b, c < y \leq d) = {\int_{c}^{d} \int_{a}^{b} P(x,y) dx dy}$$

and the marginal distribution of $x$ by:

$$P(a < x \leq b, -\infty < y \leq \infty) = {\int_{-\infty}^{\infty} \int_{a}^{b} P(x,y) dx dy}$$

On the plot below, the volume under the surface representing the marginal is shaded red while the volume for the joint probability we want to work out is shaded black.


In [9]:
# Pick some example values for a, b, c and d
a, b, c, d = 5, 10, 0, 5

# Contour plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.contour(X, Y, M)
ax.imshow(M, interpolation='none', origin='lower',
          cmap=plt.cm.terrain, alpha=0.6,
          extent=(-10, 30, -10, 30))
ax.axvspan(a, b, alpha=0.5, color='red')
ax.add_patch(Rectangle((a, c), b-a, d-c, color='black'))
ax.set_xlabel('$x$', fontsize=16)
ax.set_ylabel('$y$', fontsize=16)


Out[9]:
<matplotlib.text.Text at 0x1aeda4a8>

Hopefully you can see from this illustration that the probability of $y$ lying between $c$ and $d$ given that $x$ lies between $a$ and $b$ is equal to the ratio of the volume beneath the black square to the volume beneath the red strip. We don't need to consider the area outside of the red strip because our calculation is conditioned on the assumption that the probability density in this region is zero.

So, the volume beneath the black square is the joint probability; the volume beneath the red strip is the marginal probability; and the ratio of the two is the conditional probability:

$$P(c < y \leq d \mid a < x \leq b) = \frac{P(a < x \leq b, c < y \leq d)}{P(a < x \leq b)}$$

In cleaner notation, we can write:

$$P(y \mid x) = \frac{P(x,y)}{P(x)}$$

This is known as the product rule and we can combine it with the sum rule to derive Bayes' famous equation. First, note that

$$P(x,y) = P(y,x)$$

and therefore

$$P(x,y) = P(y \mid x)P(x) = P(y,x) = P(x \mid y)P(y)$$

This can be rearranged to give one form of Bayes' equation

$$P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}$$

We can now use the sum rule to re-write the denominator

$$P(y \mid x) = \frac{P(x \mid y)P(y)}{\int_y P(x,y) dy}$$

and finally use the product rule to re-write it again

$$P(y \mid x) = \frac{P(x \mid y)P(y)}{\int_y P(x \mid y)P(y) dy}$$

At first glance this version might look more complicated than the others, but note that in this form the denominator is the integral of the numerator, which is a useful rearrangement.

Bayes' equation is one of the fundamental building blocks for everything that follows. Hopefully this notebook has helped to provide some basic intuition as to where it comes from and how it relates to the properties of some simple distributions.