Some concepts of Probability Theory and Gaussian distributions

For this Machine Learning course, students are expected to have prior knowledge of some basic concepts from probability theory and linear algebra. In this notebook, we will briefly review some of the most relevant such concepts.

1. Notions of Probability Theory

A random variable (r.v.) can be defined as a variable whose value is subject to variations due to chance. In this course, we will refer to random variables using capital letters, while realizations of random variables will be denoted with small letters.

According to the values they can take, we can distinguish between:

Discrete r.v.: If the variable can only take values from a discretized set of points, e.g., $X \in \{1,\dots,5\}$.
Continuous r.v.: If the variable can take values from any interval (or intervals) in the real line, e.g., $Y \in [0,1] \cup [3,5]$

An event can then be defined as the outcome of an experiment involving the random variable to which a probability is assigned. For instance, for the discrete and continuous random variables defined above, we could consider the following events: $\{X=3\}$, $\{X\in \{1,2\}\}$, $\{Y < 3.5\}$.

1.1. Characterization of unidimensional discrete random variables

Probabilities of all events: A discrete random variable can be characterized by the probabilities of the variable taking any feasible values. If r.v. $X$ can take values from the set $\cal X$, this means that the random variable would be completely characterized by the following probabilities: $Pr\{X=i\}, \text{ for } i \in {\cal X}$.
Cumulative distribution function: if the domain of the random variable can be mapped to the real line, we can use the following unidimensional function: $$F(x) = Pr(\{X \leq x\}).$$

By construction, the cumulative distribution function is a non-decreasing function, which takes values between 0 and 1.

1.2. Characterization of unidimensional continuous random variables

Cumulative distribution function (cdf): As for the discrete case, when the domain of the random variable lies or can be mapped to an ordered set, the variable can be characterized by the cumulative distribution function, which is defined exactly as for the discrete case: $$F(y) = Pr(\{Y\leq y\})$$
Probability density function (pdf): In this case, the probability of the event ${Y = y_0}$ will be typically zero, so we need to recur to an alternative characterization which can give us some insight about the distribution of probability for the different values of the random variable. The probability density function is defined as the derivative of the cdf: $$p(y) = \frac{d F(y)}{d y}$$

Some important properties of the pdf follow:

The pdf is a non-negative function
The pdf offers some idea about how likely are the different values of the random variable, but cannot be interpreted as probabilities. In fact, the pdf can take values larger than one.
If we wish to calculate the probability of an event, we need to integrate the pdf over all values included in that event, for instance: $$Pr(\{y_1 \leq Y \leq y_2\}) = \int_{y=y_1}^{y=y_2} p(y) dy$$
The integral of the pdf over the whole domain of the random variable is always one

A final comment is in order. We said that if $Y$ is a continuous variable, the probability of the event ${Y = y_0}$ will be typically zero. However, in some cases we can have that such event concentrates a non-zero probability. If that is the case, the cdf will present a discontinuity at y_0, and the pdf will have a delta at that point (with height equal to the probability of the event).

1.3. Joint characterization of two random variables

If $X$ and $Y$ are two continuous random variables, we can characterize the joint behavior of both variables using their joint pdf and/or their joint cdf.

$$F(x,y) = Pr(\{X\leq x, Y \leq y\}) = Pr(\{X\leq x \text{ and } Y \leq y\})$$$$p(x,y) = \frac{d F(x,y)}{d x d y}$$

The above functions provide the most complete characterization of the two random variables when considered together. A similar knowledge cannot be extracted, for instance from the individual pdfs of $X$ and $Y$, because e.g., there are combinations of values of $X$ and $Y$ that can be very unlikely to occur, even if the individual values of $X$ and $Y$ have a larger pdf value. Imagine that $X$ represents distance to the see, and $Y$ represents humidity level. Clearly, these two variables are very much related, and we need a characterization that accounts for the joint behavior of the two variables.

Dependent and independent random variables. Conditional distributions.

The discussion above leads to the definition of dependent and independent random variables. Two random variables are said to be dependent if the knowledge of one of them alters the pdf distribution for the second variable. The distribution of random variable $X$ given that the value of $Y$ is known is given the following so-known conditional pdf: $$p(x|Y=y)$$

It is important to remark that, in spite of being a function of both $x$ and $y$, the above expression is a probability distribution over random variable $X$ only.

Using conditional pdfs, we can factorize the joint pdf of two random variables as $$p(x,y) = p(x | Y = y) p(y)\;\;\;\;\;\; (1)$$

Two random variables are said to be independent if they are not dependent. If that is the case the following properties are satisfied:

$p(x|Y=y) = p(x)$
$p(y|X=x) = p(y)$
$p(x,y) = p(x) p(y)$

Marginal and conditional distributions

If we know the joint pdf of two random variables, we have a complete characterization of their joint behavior. Thus, any other probability information can also be inferred from the joint pdf. In particular:

Marginal distribution of x: $p(x) = \int_Y p(x,y) dy$
Marginal distribution of y: $p(y) = \int_X p(x,y) dx$
Conditional distribution of x given y: $p(x|Y=y) = \frac{p(x,y)}{p(y)}$
Conditional distribution of y given x: $p(y|X=x) = \frac{p(x,y)}{p(x)}$

The last two expressions can be easily derived starting from (1). It is interesting to see that the conditional pdfs have the same shape as the joint pdf for the particular value $Y=y$ (or $X=x$). In this sense, the term in the denominator acts as a rescaling factor, that ensures that the integral of the pdf over the whole domain of the random variable remains 1.

1.4. Bayes' Theorem

Considering the two possible factorizations of the joint pdf of two random variables,

$$p(x,y) = p(x|Y=y) p(y) = p(y|X=x) p(x)$$

We can derive the following important result of Probability Theory which is known as Bayes' Theorem:

$$p(y|X=x) = \frac{p(x|Y=y) p(y)}{p(x)}$$

1.5. Random vector



In [ ]:

    
# To visualize plots in the notebook
%matplotlib inline

# Imported libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io       # To read matlab files
import pylab

2. The Gaussian distribution

2.1. Univariate Gaussian distribution

2.2. Multivariate Gaussian distribution



In [ ]: