Generative Models

  • Unsupervised learning: only use input data
Example: the most basic generative model: PCA

$$X = \mu + \phi \alpha$$

where $\mu$ is the mean, and $\alpha$ is the eigen vectors. By generating different coeffcitns $\alpha$, we can generate new $X$ values.

Differnet techniques for unsupervised learning

  • Autoencoders
  • Generative Adversarial Networks
  • Restricted Boltzman Machines

Why Generative Models?

  • Beyond associating inputs to putputs

  • Recognize objects in the world and their factors of variation: recgnize a car and its differnt forms doors open or closed

  • Understand and imagine how the world evolves

  • Detect surprising events in the world

  • Imagine and generate rich plans for future

  • Establish concepts as useful for reasoning and decision making

What we will learn?

  • $f_\theta()$ use deep networks
    • fully connected

Auto-encoders

  • Decoder

  • Encoder

    We can share the weothd so that $W^* = W^T$

  • If the number of hiddent units is the same as input and output units, then using an identity matrix as weights can re-generate the same input as output.

  • To avoid this, we can make the number o fhidden units smaller, so then the network should learn useful informaiton from input.

Loss function

  • For binary inputs: cross-0entropy loss

  • For real-valued inputs: sum-of-squared differences

    • Use a linear activation function at the output

Gradients:

  • in both cases above:

  • When weights are shared (tied/coupled), use the sum of ..

Undercomplete Hidden Layers

  • When hidden layer is smaller than the input layer

    • Hidden layer compresses the input
    • Will compress well ony for the training distribution

      • Example: Good compressed features for input digit if it is trained for MNIST

Overcomplete Hidden Layer

  • Will learn identity matrux

    • No compression
  • This will only be useful in the following scenario

Linear Autoencoder

  • If the decoder is linear, what is the best encoder for MSE loss?

    Theorem: let $A$ be any matrix, with SVD decompositon: $A = U\Sigma V^T$

    • decomposition of rank $K$:

      $h(X) = V^T_{\le K}$

      $h(X) = f(\tilde{W} X)$

Optimality

  • If the inputs are normalized (by subtracting the mean):
    • then the encoder corresponds to PCA

A Probabilistiv Viewpoint

Goal: modeling $p(x)$

Three ways to do this:

  • Fully observed models

    • An undirected graphical model
    • here there ar eno latent variable, so we can directly model the joint distribution
    • Example: recurrent neural network
  • Transformation models:

    • start by a random vector $z$ that can be generated $z\sim \mathcal{N}(0,I)$, then we learn a transformation funciton $f$ so that maps $z \to x = f(z)$
    • transform un-observed noise source using a parametrized funtion
    • Example: many sampling functions, Generative Adversarial Networks (GAN)
  • Latent Variable Model: both $x,z$ are random varibale
    • modeling hidden causes
    • introduc unobserved local random variable

Review on graphical model: to get the joint probability distribution $p(x_1, ...,x_n)

  • Directed graphical model
  • Undirected graphical model (here we have to deal with the partition function $Z$)

    Directed grpahical model captiures the independendce assumption:

    Example: $x_1\to x_2 \ \ x_2\to x_3 \ \ x_2 \to x_4 \ \ x_4\to x_3$

    $p(x_1,x_2,x_3,x_4) = p(x_1)~p(x_2|x_1)~p(x_3|x_2x_4)~p(x_4|x_2)$

    iIn general, we can write: $p(x_1,...x_n) = \prod_{i=1}^n p(x_i|\pi(x_i))$ where $\pi$ rep[resents the parents of a node.

    • For undirected:

    $$p(x_1,...x_n) = \frac{\prod_{i=1}^m f_i (\phi(x))}{Z}$$

in undirected graphical model, we have to deal with the partition function $Z$.

Inferential Problems

  1. Evicence Estimation

    $$p(x) = \int p(x,z) dz$$

  2. Moment computation:

    $$\mathbf{E}[f(x)|z] = \int f(z) p(z|x) dz$$

  3. Prediction

    $$\mathbf{E}[x_{t+1}] = p(x_{t+1}|x_t) p(x_t) dx_t$$

  4. Hypothesis testing:

    $$\mathcal{B} = \log p(x|H_1) - \log p(x|H_2)$$

Importance sampling

$$w^{(s)} = $$

$w^{(s)}$ is called the weight/iportance. Intuitivly speaking, it tells how the estimated function $q(z)$ matches with the real distrin=bution $p(z)$

Then, with sampling, we convert the integral into a summation:

$$p(x) = \frac{1}{S}\sum_{s}w^{(s)} p(x|z^{(s)})$$

Importance Sampling to Variational Inference

For this, we use Jensen's inequality: $$\log\left(\int p(x) q(x) dx\right) \ge \int p(x) \log q(x) dx$$

Generative Models

  • Goal: we want to find the joint distribution: $$P(x,z)$$

    For this, we need to know

    • $P(x)$
    • $P(x|z)$
    • $P(z|x)$
    • $P(z)$

    • Easiest one to solve first is: $P(x)$

    • Hardest one is $P(z|x)$

Generative Adversarial Networks


In [ ]:

Variational Auto-Encoders


In [ ]:


In [ ]:


In [ ]:

  • We start by input data/image ($x$)
  • Using the inference network, we model parmaters of the latern variable $z$: for example $q(z|x) \sim \mathcal{N}(\mu_z,\Sigma_z)$
  • Using the parmaters obtained above, generate a random $z$
  • Use the generated $z$ to reconstruct and obtain $\hat{x}$

In [ ]: