Review Linear Algebra

Matric Cookbook

Vector: $x = \left[x_1 \ x_2 \ ... \ x_n\right] = \left[\begin{array}{c}x_1\\x_2\\.\\.\\x_n \end{array} \right]$

Norm: $x^T x = \|x\|_2^2$

Norm of a matrix: $\sqrt{trace(X^T X)}$

Trace of a mtrix: $\displaystyle trac(X) = \sum_i X_{i,i}$

Inverse of a matrix: $X^{-1} X = X X^{-1} = I$

Determinant

for a triangular matrix:
for transpose of a mtrix
for inverse of a matrix
for product of a matrix

Properties pf matrices

Orthogonal matrix:
- definition $X Y = I$
- if the inverse of $X$ exists, then $X = X^{-1}$
Positive definite matrix: $v^T X v > 0 \forall v \in R$
- if $\ge$, then its called "Positive semi-definite"

Linear dependence

Set of linearly dependent vectors $x_t$ $\exists w, t^* \text{such that} x_{t^*} = \sum_{t \ne t^*} w_t x_t$
definition: rank of a matrix is the number of linearly independent min(#column, #row)
definition: range of a matrix

$$\mathcal{R}(X) = \left\{x \in R^n | \exists w \text{sucj that} x = \sum_j w_j X_{.,j} \right\}$$
definition: Nullspace of a matrix: $\left\{ x \in R^n | x \in \mathcal{R}(X)\right\}$

Singular Balue Decomposition

$$X = U \Sigma V^T$$

$U$ spans the row space and $V$ spans the column space of $X$. ???

$U^T U = I$ $V^TV = I$

Range of a matrix

$$\mathcal{R}(X) = \{x\in \mathbb{R}^n | \exists w \text{ such that } x = \sum_j w_j X_{.,j}\}$$

Null space:

The set of all vectors that results in zero: $A x = 0$ where $x$ is null-space of $A$.

$$\{x\in \mathbb{R}^n | x \notin \mathcal{R})(X)\}$$

Eigen Analysis of Matrices

Eigenvalues and eigenvectors (Square matrix) $$\left\{ \lambda_i, u_i| Xu_i = \lambda_i u_i \text{ and } u_i^T u_i = 1_{i=j}\right\}$$
Properties
- Can write $X = \sum_i \lambda_i u_i u_i^T$
- determinant
- positive definite if $\lambda_i > 0$
- rank is # of non-zero eigen values

Singular value decomposition

Applicable to any matrix

Differential Calculus

derivative: $\displaystyle \frac{d}{dx}f(x) = \lim_{\Delta\to 0} \frac{f(x + \Delta) - f(x)}{\Delta}$
partial deriviative
- $f(x,y)$
- $$
- Example: $f(x,y) = \frac{x^2}{y}$
  - $\frac{\delta}{\delta x} f(x,y)$
  - $\frac{\delta}{\delta y} f(x,y)$

Gradients

Jacobians, Hesssians

Finding Hessians is in $O(N^3)$

Taylor Series

$$f(x) \approx f(a) + f'(a)(x-a) + \frac{1}{2}f''(a)(x-a)^2 + ...$$

f(\vector{x})

Probability

probasbility space: Triplet $\left( \Omega, \mathcal{F}, P\right)$
- $\Omega$ is
properties of probabilities
- $P(w) \ge 0 \forall w \in \Omega$

Distributions (joint, marginal, conditional)

joint distribution $P(x_1,..x_n)$
marginal $P(x_1,x_k) = \displaystyle \sum_{\text{all missing variables}}P(x_1,..,x_n)$
conditional $P(x_1|x_2)$

$$\sum_{x_1} P(X_1 = x_1|X_2 = c) = 1$$

Question:

Given two random variables $x_1$ and $x_2$, and having the marginal probabilities $P(x_1),P(x_2)$, can we get their joint distribution $P(x_1,x_2)$?

Answer: No. We also need to know the conditional distribution: $P(x_1|x_2)$

Chain rule, Bayes rule

Probability chain rule $p(s,o) = p(s|o) p(o) = p(o|s)p(s)$

$$p(x) = \prod_i p(x_i|x_1,...x_{i-1})$$

Bayes rule

Independence between vaiables

Conditional independence

Covariance matrix

Independetn always implies that $Cov - 0$
If $cov=0$, not always. Only for Gaussian

Sampling

generative models
(Q) Have a discrete distribution, we want to sample from this data using the probability distribution.
- (A) generate the cumulative distribution function, then pick a random number between $[0,1)$, and map the random number to discrete sample value
(A) how about continous case? We have $\mathbf{p}[x]$ where $x$ is a continous variable.
- (A) the previous method does not work here. For this, let us say we want to estimate some expected value from sampling: $\mathbf{E}[f(x)] = \int { f(x) \mathbf{p}(x)} dx $. We want to estimate the expected value of $f(x)$ using sampling, then with $N$ smaples, the approximation to the expected value will be $\displaystyle \mathbf{E}[f(x)] \approx_{N \to \infty} \frac{1}{N} \sum_{i=1}^N f(x_i)$

Monte Carlo Eastimate

A method to approximate expensive expectation $$\mathbf{E}[f(x)] = \int_x f(x) \mathbf{p}(x) dx \approx $$

Importance sampling

$$\mathbf{E}[f(x)] = \int_x f(x) \frac{\mathbf{p}(x)}{q(X)} q(x) dx \approx \frac{1}{K} \sum_k f(x_k) \frac{\mathbf{p}(x)}{q(x)}$$

Markov-Chain Monte Carlo (MCMC)

An iterative method to generate the sequence $x^(k)$
selected points are not independent

$$x^{(1)} \rightarrow^{T(x'\leftarrow x)} x^{(2)} \rightarrow x^{(3)} ... \rightarrow x^{(K)}$$
$T(x' \leftarrow x)$ is a transition operator, that must satify certain properties
usually, we drop the first few samples which are not reliable.

Gibs Sampling

MCMC method which uses the following transition operator $T(x' \leftarrow x)$:
- pick a variable $x_i$
- obtain $x'$ by nly resampling this variable according to $\mathbf{p}(x_i\vert x_1, x_2, ..x_n)$
- return $x'$
- Often, we simply cycle through the variables in random order

Machine Learning

Supervised Learning

Learning example: $(x,y)$

Unsupervised

Clustering
Feature extraction
Dimensionality reduction: learning a lower-dimensional representation of input

Learning algorithm

Review: Information Theory

Probability of discrete sample space: $$

Entropy

$$H(X) = \sum_i p_i \log_2 (\frac{1}{p_i})$$

Entropy measures "uncertainty" of $X$
Entropy has upper bound: $H(X) \le \log_2(n)$. Equality holds when $p_i = \forall i$. (intuitively, it means the outcomes ae equally likely).

Example:

For a coin toss, we have $p(X=Head)=p,\ \ p(X=Tail)=1-p$. Then, the entropy is $\displaystyle H(X) = p\log_2(\frac{1}{p}) + (1-p)\log_2(\frac{1}{1-p})$

Conditional Entropy

Let $X$ and $Y$ be two random variables $$H(x\vert Y) = \sum_y H(X\vert Y=y) Pr(Y=y)$$
$H(X)\ge H(X\vert Y)$
If $X$ is independet of $Y$, then $H(X) = H(X\vert Y)$

Joint Entropy

$$H(X,Y) = \sum_{x,y}Pr(X=x, Y=y) \log_2(\frac{1}{Pr(X=x, Y=y)})$$

Chain Rule $H(X,Y) = H(X) + H(Y\vert X)$

Mutual Information

$$I(X;Y) = H(X)-H(X\vert Y) \\ = H(Y) - H(Y\vert X) \\ = I(Y; X)$$

$I(X;X) = H(X) - H(X;X) = H(X) $

Conditional Mutual Information

Distance Between Distributioins:

Let $p$ and $q$ be two random distributions with the same support (domain)

Total Variation

$D_{TV}(p.q) = \frac{1}{2} \|p-q\|_1 = \frac{}{2}\sum |p_i - q_i|$

Hellinger Distance

define:
$\sqrt{p} = \left(\sqrt{p_1}, ...\sqrt{p_n}\right)$ $\sqrt{q} = \left(\sqrt{q_1}, ...\sqrt{q_n}\right)$
$\sqrt{p}$ and $\sqrt{q}$ are unit vectors

$$D_H(p,q) = \frac{1}{2}\|\sqrt{p} - \sqrt{q}\|_2$$

Kullback-Leibler Divergence

$$D_{KL}(p,q) = \sum_i p_i \log_2 \frac{p_i}{q_i}$$

It's asymmetric distance: $D_{KL}(p,q) \ne D_{KL}(q,p)$

Jenson-Shannon Distance

$$D_{JS}(p,q) = \frac{D_{KL}(p,r) + D_{KL}(q,r)}{2}$$

where $r=\frac{p+q}{2}$ is the average distribution

Some relations between these distances

$D_{JS}(p,q) \ge D_H^2(p,q)$
$D_H^2(p,q) \ge D^2_{TV}(p,q)$
$D_{TV}(p,q) \ge \delta$ where $\delta$ is probibility of distinguishing $p$ from $q$ with a sample.

$TV$ measure is very conservative, while $JS$ is very generous.



In [ ]: