In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
import pandas as pd
from scipy import stats
In [2]:
def plot_dist(dist, xmin, xmax, ax):
x = np.linspace(xmin, xmax, 1000)
y = dist.pdf(x)
ax.plot(x,y)
ax.set_ylim(0,y.max() + y.max()/4.0)
The probability of the conjunction of two events (or assertions) is given by:
$$p(A\ or\ B) = p(A) + p(B) - p(A\ and\ B)$$The probability of the event $A$ and $B$ is given by:
$$p(A\ and\ B) = p(A|B)p(B)$$The probability of the event $A$ given that the event $B$ is true is given by:
$$p(A|B) = \frac{p(A\ and\ B)}{p(B)}$$The Bayes rule is:
$$p(X|Y) = \frac{p(X)p(Y|X)}{p(Y)}$$This can be derived from the sum and the product rule
The $\alpha$ quantile of a cdf $F$, denoted $F^{-1}(\alpha)$, is the value $x_\alpha$ such that
$$F(x_\alpha) = P(X \leq x_\alpha) = \alpha$$The value $F^{-1}(0.5)$ is the median of the distribution.
The mean or expected value of a discrete distribution, commonly denoted by $\mu$, is defined as
$$\mathbb{E}[X] = \sum_{x\in\mathcal{X}}x~p(x)$$Whereas for a continuous distribution, the mean is defined as
$$\mathbb{E}[X] = \int_{\mathcal{X}}x~p(x)$$The variance, denoted by $\sigma^2$, is measure of the "spread" of a distribution, defined as
$$\sigma^2 = \mathrm{var}[X] = \mathbb{E}{(X-\mu)^2}$$where $\mu = \mathbb{E}[X]$. A useful result is
$$\mathbb{E}[X^2] = \sigma^2 + \mu^2$$The standard deviation is defined as $\mathrm{std}[X] = \sqrt{\sigma^2} = \sigma$
The covariance between two random variable measures the degree which they are linearly related
$$\mathrm{cov}[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$If $\mathbf{x}$ is a d-dimensional vector, it covariance matrix is given by:
$$\mathrm{cov}[\mathbf{x}] = \mathbb{E}\left[(X - \mathbb{E}[X])(X - \mathbb{E}[X])\right] = $$$$= \begin{bmatrix}\mathrm{var}[X_1] & \mathrm{cov}[X_1, X_2] & \dots & \mathrm{cov}[X_1, X_d] \\ \mathrm{cov}[X_2, X_1] & \mathrm{var}[X_2] & \dots & \mathrm{cov}[X_2, X_d] \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{cov}[X_d, X_1] & \mathrm{cov}[X_1, X_2] & \dots & \mathrm{var}[X_d] \\ \end{bmatrix}$$The Pearson correlation coefficient between two random variables $X$ and $Y$ is given by:
$$\mathrm{corr}\left[X, Y\right] = \frac{\mathrm{cov}\left[X, Y\right]}{\sqrt{\mathrm{var}[X]\mathrm{var}[Y]}}$$the multinomial coeffiecient is defined as
$$\binom{n}{x_1,...,x_K} = \frac{n!}{x_1! \dots x_K!}$$Given a dataset $\mathcal{D} = \{x_1, \dots, x_N\}$, the empirical distribution is defined as
$$p_{\mathrm{emp}}(A) = \frac{1}{N}\sum_{i=1}^{N}w_i \delta_{x_i}(A)$$where $0\leq w_i \leq 1$ and $\sum w_i = 1$
In [3]:
ax = plt.subplot(111)
plot_dist(stats.norm, -4, 4, ax)
In [4]:
ax = plt.subplot(111)
plot_dist(stats.laplace, -6, 6, ax)
Where $$\Gamma(x) = \int_{0}^{\infty}u^{x-1}e^{-u}~\mathrm{d}u$$
In [5]:
ax = plt.subplot(111)
plot_dist(stats.gamma(1,2), 0, 6, ax)
Where $$B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$$
Pareto $$P(x~|~k,m) = km^kx^{-(k+1)}\mathbb{I}(x\geq m)$$
Student t $$P(x~|~\nu) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)\sqrt{\nu\pi}}\left(1 + \frac{x^2}{\nu}\right)^{-\frac{\nu + 1}{2}}$$
where $\nu$ is the number of degrees of freedom
where $\mathbf{\mu} = \mathbb{E}[\mathbf{x}]$ is the mean and $\mathbf{\Sigma} = cov[\mathbf{x}]$ is the covariance matrix.
In [6]:
x = np.arange(-4.0, 4.0, 0.1)
y = np.arange(-3.0, 3.0, 0.1)
X, Y = np.meshgrid(x, y)
Z = mlab.bivariate_normal(X, Y, sigmaxy=0.7)
plt.contour(X,Y,Z);
Where $\Sigma$ is called \emph{scale matrix}, $\nu$ is a scalar that controls how fat the tails of the distribution are (the bigger it is, the slimmer are the tails, and the distribution tends to the multivariate normal) and $V = \nu\Sigma$. this distributions has the following properties: $\mu$ is the mean and also the median, and $\frac{\nu}{\nu - 2}\Sigma$.
where $S_k = \left\{x: 0 \leq x_k \leq 1, \sum_{k=1}^{K} x_k = 1\right\}$ is the support of the distribution and $B(\alpha) = \frac{\prod_{k=1}^{K} \Gamma(\alpha_k)}{\Gamma\left(\sum_{k=1}^{K} \alpha_k\right)}$ is the beta function for N variables.
Consider $N$ idenpendent and identically distributed random varibles with pdf $p(x_i)$ (not necessarily gaussian) with mean $\mu$ and variance $\sigma^2$. Let $S_N = \sum_{i=1}^{N} X_i$ be the sum of these random variable. the central limit theorem states that:
$$P(S_N = s) = \frac{1}{\sqrt{2\pi N\sigma^2}}\exp\left(-\frac{(s - N\mu)^2}{2N\sigma^2}\right)$$That is, the distribution of
$$Z_N = \frac{S_N - N\mu}{\sigma\sqrt{N}} = \frac{\bar{X} - \mu}{\sigma/\sqrt{N}}$$where \bar{X} is the empirical mean, converges to the standard normal.
If $\mathbf{y} = f(\mathbf{x}) = \mathbf{A}\mathbf{x} + \mathbf{b}$ then:
$$\mathbb{E}[\mathbf{y}] = \mathbf{A}\mathbf{\mu} + \mathbf{b}$$$$\mathrm{cov}[y] = \mathbf{A} \mathbf{\Sigma} \mathbf{A}^\intercal$$If $X$ is a discrete random variable, we can derive the pdf of $y = f(x)$ by summing up the probability mass for all $x$ such that $f(x) = y$:
$$P(y) = \sum_{x:~f(x)=y} P(x)$$If X is continuous we work with the cdf and do instead:
$$P_y(y) = P_y(f(X) \leq y) = P_y(X \leq f^{-1}(y)) = P_y(f^{-1}(y))$$Taking the derivatives we get:
$$p_y(y) = p_x(x) \left|\frac{dx}{dy}\right|$$If $X$ is a multivariate continuous random variable we get:
$$p_y(y) = p_x(x) \left|\mathrm{det} ~ \mathbf{J}_{\mathbf{y} \rightarrow \mathbf{x}}\right|$$Where $\mathbf{J}$ is the jacobian matrix.
First generate $S$ samples from the distribution, call them $x_1, \dots, x_S$. We approximate $f(X)$ by using the empirical distribution of $\{f(x_s)\}_{s=1}^{S}$. To compute expected value of any function of a random variable we do:
$$\mathbb{E}[f(X)] = \int f(x)p(x)dx \approx \frac{1}{S}\sum_{s=1}^{S} f(x_s)$$By varying the function $f$ we can approximate quantities as mean, variave, cdf and median of a variable
We know that the area of a circle is given by $\pi r^2$ but it is also given by:
$$A = \int_{-r}^r \int_{-r}^r \mathbb{I}(x^2 + y^2 \leq r^2) dx dy$$therefore $\pi = \frac{A}{r^2}$. To use Monte Carlo to estimate $A$, we simply take $p(x)$ and $p(y)$ as uniform distributions on $[-r,r]$ so that $p(x) = p(y) = 1/2r$ and $f(x,y)$ to be $\mathbb{I}(x^2 + y^2 \leq r^2)$. So, by using the Monte Carlo approximation:
$$A = 4r^2 \int_{-r}^r \int_{-r}^r \mathbb{I}(x^2 + y^2 \leq r^2) p(x) p(y)dx dy = 4r^2 \frac{1}{S} \sum_{s=1}^{S} f(x_s, y_s)$$
In [7]:
def plot_circle(radius):
fig, ax = plt.subplots(figsize=(5,5))
ax.set_xlim(-radius, radius)
ax.set_ylim(-radius, radius)
circle = plt.Circle((0,0), radius, color='k', fill=False)
ax.add_artist(circle)
return ax
def in_circle(point, radius):
x, y = point
return x**2 + y**2 <= radius**2
def sample_points(a, b, n):
for i in range(n):
x = np.random.uniform(a, b)
y = np.random.uniform(a, b)
yield x, y
def compute_pi(n_points, radius=1):
ax = plot_circle(radius)
count = 0
for point in sample_points(-radius, radius, n_points):
color = 'b'
if in_circle(point, radius):
count += 1
color = 'r'
ax.scatter(point[0], point[1], color=color)
return 4 * count/n_points
pi = compute_pi(10000)
print("The value of pi is:", pi)
Using the central limit theorem, onde can show that:
$$(\hat{\mu} - \mu) \rightarrow \mathcal{N}\left(0, \frac{\sigma^2}{S}\right)$$Where $\sigma^2 = \mathrm{var}[f(X)]$ can be estimated using:
$$\hat{\sigma}^2 = \frac{1}{S} \sum_{s=1}^{S} (f(x_s) - \hat{\mu})$$Then we have:
$$P\left\{\mu - 1.96\frac{\hat{\sigma}}{\sqrt{S}} \leq \hat{\mu} \leq \mu + 1.96\frac{\hat{\sigma}}{\sqrt{S}}\right\} \approx 0.95$$That is, if we want to have a tolerance of $1.96\frac{\hat{\sigma}}{\sqrt{S}}$ with $95\%$ of probability, we can use this method. The $S$ can be used to adjust the tolerance: the higher the $S$, the smaller the tolerance is, but the slower the approximation will be.
Altought Information Theooy is concerned with representing data in a compact fashiona and transmitting it in a robust manner, it has a lot to do with probability, so we introduce a few of it's most useful concepts
The entropy of a random variable $X$ is given by:
$$\mathbb{H}(X) = -\sum_{k=1}^{K} p(k) \log p(k)$$measured in bits if we use log base 2 and in nats if we use log base $e$. This measures the amount of incentainty we have about random variable $X$.
A way to measure the similarity between two distributions $p$ and $q$ is to use the KL-Divergence (or relative entropy) defined as follows:
$$\mathbb{KL}[p\|q] = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} = - \mathbb{H}(p) + \mathbb{H}(p,q)$$Where $\mathbb{H}(p,q) = - \sum_k p_k \log q_k$ is the cross entropy, that can be interpreted as the average number of bits needed to encode data coming from a source with distribution $p$ when we use a model $q$ to define our codebook. Therefore the KL-divergence can be interpreted as the extra information needed to encode the data, due to the use of distribution $q$ instead of $p$. The KL-Divergence is zero only if $p=q$ and is always positive.
To measure how much knowing a random variable $X$ tells you about another one $Y$ we could use the correlation or coovariance, but it only works for real valued random variables and only captures linear relations. A more general approach is to use the mutual information:
$$I(X;Y) = \mathbb{KL}[P(X,Y)\|P(X)P(Y)]$$that is zero if the variables are independent. This can be read as:
$$I(X;Y) = \mathbb{H}(X) - \mathbb{H}(X|Y)$$where $\mathbb{H}(X|Y) = \sum_y p(y) \mathbb{H}(X|Y=y)$ is the conditional entropy. This way we can interpret the MI as the reduction of uncertainty about X after knowing about $Y$. For continuous random variables, one can use the generalization of the MI called MIC (maximal information coefficient).