For this Machine Learning course, students are expected to have prior knowledge of some basic concepts from probability theory and linear algebra. In this notebook, we will briefly review some of the most relevant such concepts.
A random variable (r.v.) can be defined as a variable whose value is subject to variations due to chance. In this course, we will refer to random variables using capital letters, while realizations of random variables will be denoted with small letters.
According to the values they can take, we can distinguish between:
Discrete r.v.: If the variable can only take values from a discretized set of points, e.g., $X \in \{1,\dots,5\}$.
Continuous r.v.: If the variable can take values from any interval (or intervals) in the real line, e.g., $Y \in [0,1] \cup [3,5]$
An event can then be defined as the outcome of an experiment involving the random variable to which a probability is assigned. For instance, for the discrete and continuous random variables defined above, we could consider the following events: $\{X=3\}$, $\{X\in \{1,2\}\}$, $\{Y < 3.5\}$.
Probabilities of all events: A discrete random variable can be characterized by the probabilities of the variable taking any feasible values. If r.v. $X$ can take values from the set $\cal X$, this means that the random variable would be completely characterized by the following probabilities: $Pr\{X=i\}, \text{ for } i \in {\cal X}$.
Cumulative distribution function: if the domain of the random variable can be mapped to the real line, we can use the following unidimensional function: $$F(x) = Pr(\{X \leq x\}).$$
By construction, the cumulative distribution function is a non-decreasing function, which takes values between 0 and 1.
Some important properties of the pdf follow:
A final comment is in order. We said that if $Y$ is a continuous variable, the probability of the event ${Y = y_0}$ will be typically zero. However, in some cases we can have that such event concentrates a non-zero probability. If that is the case, the cdf will present a discontinuity at y_0, and the pdf will have a delta at that point (with height equal to the probability of the event).
If $X$ and $Y$ are two continuous random variables, we can characterize the joint behavior of both variables using their joint pdf and/or their joint cdf.
$$F(x,y) = Pr(\{X\leq x, Y \leq y\}) = Pr(\{X\leq x \text{ and } Y \leq y\})$$$$p(x,y) = \frac{d F(x,y)}{d x d y}$$The above functions provide the most complete characterization of the two random variables when considered together. A similar knowledge cannot be extracted, for instance from the individual pdfs of $X$ and $Y$, because e.g., there are combinations of values of $X$ and $Y$ that can be very unlikely to occur, even if the individual values of $X$ and $Y$ have a larger pdf value. Imagine that $X$ represents distance to the see, and $Y$ represents humidity level. Clearly, these two variables are very much related, and we need a characterization that accounts for the joint behavior of the two variables.
The discussion above leads to the definition of dependent and independent random variables. Two random variables are said to be dependent if the knowledge of one of them alters the pdf distribution for the second variable. The distribution of random variable $X$ given that the value of $Y$ is known is given the following so-known conditional pdf: $$p(x|Y=y)$$
It is important to remark that, in spite of being a function of both $x$ and $y$, the above expression is a probability distribution over random variable $X$ only.
Using conditional pdfs, we can factorize the joint pdf of two random variables as $$p(x,y) = p(x | Y = y) p(y)\;\;\;\;\;\; (1)$$
Two random variables are said to be independent if they are not dependent. If that is the case the following properties are satisfied:
If we know the joint pdf of two random variables, we have a complete characterization of their joint behavior. Thus, any other probability information can also be inferred from the joint pdf. In particular:
The last two expressions can be easily derived starting from (1). It is interesting to see that the conditional pdfs have the same shape as the joint pdf for the particular value $Y=y$ (or $X=x$). In this sense, the term in the denominator acts as a rescaling factor, that ensures that the integral of the pdf over the whole domain of the random variable remains 1.
In [ ]:
# To visualize plots in the notebook
%matplotlib inline
# Imported libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io # To read matlab files
import pylab
In [ ]: