Review Linear Algebra

Matric Cookbook

Vector: $x = \left[x_1 \ x_2 \ ... \ x_n\right] = \left[\begin{array}{c}x_1\\x_2\\.\\.\\x_n \end{array} \right]$

Norm: $x^T x = \|x\|_2^2$

Norm of a matrix: $\sqrt{trace(X^T X)}$

Trace of a mtrix: $\displaystyle trac(X) = \sum_i X_{i,i}$

Inverse of a matrix: $X^{-1} X = X X^{-1} = I$

Determinant

  • for a triangular matrix:
  • for transpose of a mtrix
  • for inverse of a matrix
  • for product of a matrix

Properties pf matrices

  • Orthogonal matrix:

    • definition $X Y = I$
    • if the inverse of $X$ exists, then $X = X^{-1}$
  • Positive definite matrix: $v^T X v > 0 \forall v \in R$

    • if $\ge$, then its called "Positive semi-definite"

Linear dependence

  • Set of linearly dependent vectors $x_t$ $\exists w, t^* \text{such that} x_{t^*} = \sum_{t \ne t^*} w_t x_t$

  • definition: rank of a matrix is the number of linearly independent min(#column, #row)

  • definition: range of a matrix

    $$\mathcal{R}(X) = \left\{x \in R^n | \exists w \text{sucj that} x = \sum_j w_j X_{.,j} \right\}$$

  • definition: Nullspace of a matrix: $\left\{ x \in R^n | x \in \mathcal{R}(X)\right\}$

Eigen Analysis of Matrices

  • Eigenvalues and eigenvectors (Square matrix) $$\left\{ \lambda_i, u_i| Xu_i = \lambda_i u_i \text{ and } u_i^T u_i = 1_{i=j}\right\}$$

  • Properties

    • Can write $X = \sum_i \lambda_i u_i u_i^T$
    • determinant
    • positive definite if $\lambda_i > 0$
    • rank is # of non-zero eigen values

Singular value decomposition

  • Applicable to any matrix

Differential Calculus

  • derivative: $\displaystyle \frac{d}{dx}f(x) = \lim_{\Delta\to 0} \frac{f(x + \Delta) - f(x)}{\Delta}$

  • partial deriviative

    • $f(x,y)$
    • $$

    • Example: $f(x,y) = \frac{x^2}{y}$

      • $\frac{\delta}{\delta x} f(x,y)$
      • $\frac{\delta}{\delta y} f(x,y)$

Gradients

Jacobians, Hesssians

  • Finding Hessians is in $O(N^3)$

Taylor Series

$$f(x) \approx f(a) + f'(a)(x-a) + \frac{1}{2}f''(a)(x-a)^2 + ...$$

f(\vector{x})

Probability

  • probasbility space: Triplet $\left( \Omega, \mathcal{F}, P\right)$

    • $\Omega$ is
  • properties of probabilities

    • $P(w) \ge 0 \forall w \in \Omega$

Distributions (joint, marginal, conditional)

  • joint distribution
  • marginal
  • conditional

Chain rule, Bayes rule

  • Probability chain rule
  • Bayes rule

Independence between vaiables

  • Conditional independence

Covariance matrix

  • Independetn always implies that $Cov - 0$
  • If $cov=0$, not always. Only for Gaussian

Sampling

  • generative models

  • (Q) Have a discrete distribution, we want to sample from this data using the probability distribution.

    • (A) generate the cumulative distribution function, then pick a random number between $[0,1)$, and map the random number to discrete sample value
  • (A) how about continous case? We have $\mathbf{p}[x]$ where $x$ is a continous variable.

    • (A) the previous method does not work here. For this, let us say we want to estimate some expected value from sampling: $\mathbf{E}[f(x)] = \int { f(x) \mathbf{p}(x)} dx $. We want to estimate the expected value of $f(x)$ using sampling, then with $N$ smaples, the approximation to the expected value will be $\displaystyle \mathbf{E}[f(x)] \approx_{N \to \infty} \frac{1}{N} \sum_{i=1}^N f(x_i)$

Monte Carlo Eastimate

  • A method to approximate expensive expectation $$\mathbf{E}[f(x)] = \int_x f(x) \mathbf{p}(x) dx \approx $$

Importance sampling

$$\mathbf{E}[f(x)] = \int_x f(x) \frac{\mathbf{p}(x)}{q(X)} q(x) dx \approx \frac{1}{K} \sum_k f(x_k) \frac{\mathbf{p}(x)}{q(x)}$$

Markov-Chain Monte Carlo (MCMC)

  • An iterative method to generate the sequence $x^(k)$
  • selected points are not independent

    $$x^{(1)} \rightarrow^{T(x'\leftarrow x)} x^{(2)} \rightarrow x^{(3)} ... \rightarrow x^{(K)}$$

  • $T(x' \leftarrow x)$ is a transition operator, that must satify certain properties

  • usually, we drop the first few samples which are not reliable.

Gibs Sampling

  • MCMC method which uses the following transition operator $T(x' \leftarrow x)$:

    • pick a variable $x_i$
    • obtain $x'$ by nly resampling this variable according to $\mathbf{p}(x_i\vert x_1, x_2, ..x_n)$
    • return $x'$

    • Often, we simply cycle through the variables in random order

Machine Learning

Supervised Learning

  • Learning example: $(x,y)$

Unsupervised

  • Clustering
  • Feature extraction
  • Dimensionality reduction: learning a lower-dimensional representation of input

Learning algorithm

Review: Information Theory

  • Probability of discrete sample space: $$

Entropy

$$H(X) = \sum_i p_i \log_2 (\frac{1}{p_i})$$

  • Entropy measures "uncertainty" of $X$
  • Entropy has upper bound: $H(X) \le \log_2(n)$. Equality holds when $p_i = \forall i$. (intuitively, it means the outcomes ae equally likely).

Example:

  • For a coin toss, we have $p(X=Head)=p,\ \ p(X=Tail)=1-p$. Then, the entropy is $\displaystyle H(X) = p\log_2(\frac{1}{p}) + (1-p)\log_2(\frac{1}{1-p})$

Conditional Entropy

  • Let $X$ and $Y$ be two random variables $$H(x\vert Y) = \sum_y H(X\vert Y=y) Pr(Y=y)$$

  • $H(X)\ge H(X\vert Y)$

  • If $X$ is independet of $Y$, then $H(X) = H(X\vert Y)$

Joint Entropy

$$H(X,Y) = \sum_{x,y}Pr(X=x, Y=y) \log_2(\frac{1}{Pr(X=x, Y=y)})$$
  • Chain Rule $H(X,Y) = H(X) + H(Y\vert X)$

Mutual Information

$$I(X;Y) = H(X)-H(X\vert Y) \\ = H(Y) - H(Y\vert X) \\ = I(Y; X)$$
  • $I(X;X) = H(X) - H(X;X) = H(X) $

Conditional Mutual Information

Distance Between Distributioins:

  • Let $p$ and $q$ be two random distributions with the same support (domain)

Total Variation

  • $D_{TV}(p.q) = \frac{1}{2} \|p-q\|_1 = \frac{}{2}\sum |p_i - q_i|$

Hellinger Distance

  • define:
    $\sqrt{p} = \left(\sqrt{p_1}, ...\sqrt{p_n}\right)$ $\sqrt{q} = \left(\sqrt{q_1}, ...\sqrt{q_n}\right)$
  • $\sqrt{p}$ and $\sqrt{q}$ are unit vectors

    $$D_H(p,q) = \frac{1}{2}\|\sqrt{p} - \sqrt{q}\|_2$$

Kullback-Leibler Divergence

$$D_{KL}(p,q) = \sum_i p_i \log_2 \frac{p_i}{q_i}$$

  • It's asymmetric distance: $D_{KL}(p,q) \ne D_{KL}(q,p)$

Jenson-Shannon Distance

$$D_{JS}(p,q) = \frac{D_{KL}(p,r) + D_{KL}(q,r)}{2}$$

where $r=\frac{p+q}{2}$ is the average distribution

Some relations between these distances

  • $D_{JS}(p,q) \ge D_H^2(p,q)$

  • $D_H^2(p,q) \ge D^2_{TV}(p,q)$

  • $D_{TV}(p,q) \ge \delta$ where $\delta$ is probibility of distinguishing $p$ from $q$ with a sample.

  • $TV$ measure is very conservative, while $JS$ is very generous.

In [ ]: