Gaussian Processes

When we say Bayesian nonparametric it implies, not that there is an absence of distributions, but rather, that the number of parameters grows with the dataset. This means that Bayesian non-parametric models are infinitely parametric. The goal is to determine $k$ as a function of the data.

If our model $p(y|\theta)$ has a large number of parameters in $\theta$ and to avoid the multidimensional integration over $\theta$ when trying to make inference we can use MCMC or we can represent the model with Gaussians.

$$p(x \mid \pi, \Sigma) = (2\pi)^{-k/2}|\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2} (x-\mu)^{\prime}\Sigma^{-1}(x-\mu) \right\}$$

marginals of multivariate normal distributions are normal

$$p(x,y) = \mathcal{N}\left(\left[{ \begin{array}{c} {\mu_x} \\ {\mu_y} \\ \end{array} }\right], \left[{ \begin{array}{c} {\Sigma_x} & {\Sigma_{xy}} \\ {\Sigma_{xy}^T} & {\Sigma_y} \\ \end{array} }\right]\right)$$$$p(x) = \int p(x,y) dy = \mathcal{N}(\mu_x, \Sigma_x)$$

conditionals of multivariate normals are normal

$$p(x|y) = \mathcal{N}(\mu_x + \Sigma_{xy}\Sigma_y^{-1}(y-\mu_y), \Sigma_x-\Sigma_{xy}\Sigma_y^{-1}\Sigma_{xy}^T)$$

A Gaussian process (GP) generalizes the multivariate normal to an infinite number of dimensions and any subset of this process has a Gaussian distribution. It is also a disribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:

$$p(x) \sim \mathcal{GP}(m(x), k(x,x^{\prime}))$$

It is the marginalization property that makes working with a Gaussian process feasible: we can marginalize over the infinitely-many variables that we are not interested in, or have not observed.

An example specification of a GP:

$$\begin{aligned} m(x) &=0 \\ k(x,x^{\prime}) &= \theta_1\exp\left(-\frac{\theta_2}{2}(x-x^{\prime})^2\right) \end{aligned}$$

The covariance function is a squared exponential, for which values of $x$ and $x^{\prime}$ that are close together result in values of $k$ closer to 1 and those that are far apart return values closer to zero.

Gaussian processes in machine learning

GPs are used as a generic supervised learning that are generally used in the context of regression. The parameter nugget in this class is added to the diagonal of the correlation matrix between training points: in general this is a type of Tikhonov regularization. In the special case of a squared-exponential correlation function, this normalization is equivalent to specifying a fractional variance in the input.

$$\begin{equation} \text{nugget}_{i} = \left[ \frac{\sigma_{i}}{y_{i}} \right]^{2} \end{equation}$$

In the context of Gaussian Processes, the covariance matrix is referred to as the kernel (or Gram) matrix. The flexibility of GPs allows them to be used as prior distributions that typically would be fit using MCMC.

Dirichlet Processes

Usually used in the context of determining $k$ in a data-driven manner. Here we discuss two generative approaches for allocating samples to groups where the number of groups is not pre-determined.

Illustrating the DP with a Histogram

One way to approximate an unknown density using sample observations is using a histogram. One way to parametrically describe a histogram is by specifying a series of knots that define the bins of a histogram:

$$\zeta = \{\zeta_i: \zeta_1 \lt \zeta_2 \lt \ldots \lt \zeta_k \}_{h=1}^k$$

We can specify an associated probability model as:

$$f(x) = \sum_{h=i}^k I(\zeta_{h-1} \lt x \le \zeta_h) \frac{\pi_h}{\zeta_h - \zeta_{h-1}}$$

where $I$ is the indicator function and $\pi = \pi_1, \ldots, \pi_k$ a probability simplex.

We require a prior for the unknown probabilities, for which a natural choice is the Dirichlet distribution:

$$f(\mathbf{\pi}) = \frac{\prod \Gamma(\alpha_h)}{\Gamma(\sum_{h=1}^k \alpha_h)}\prod_{h=1}^{k} \pi_h^{\alpha_h - 1}$$$$\text{where } \, E(\pi|\alpha) = \pi_0 = \frac{\alpha_1}{\sum_h \alpha_h}, \ldots , \frac{\alpha_k}{\sum_h \alpha_h}$$

Notice that the Dirichlet is just a generalization of the beta distribution to $k \gt 2$ classes.

It is easy to show that the resulting posterior distribution for $\pi$ is another Dirichlet:

$$\pi|x \sim \text{Dirichlet}(\alpha_1 + n_i, \ldots, \alpha_k + n_k)$$

where $n_h$ is the number of observations contained by the $h^{th}$ histogram bin.



In [2]:

    
%matplotlib inline
from pymc import rbeta
import matplotlib.pyplot as plt

n = 100
y = 0.75 * rbeta(1, 5, n) + 0.25 * rbeta(20, 2, n)

counts, bins, patches = plt.hist(y, bins=10)

We can use these bin counts to cacluate the expected value of the Dirichlet posterior



In [3]:

    
from pymc import dirichlet_expval
import numpy as np

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
counts, bins, patches = ax1.hist(y,bins=10)
p = dirichlet_expval(1+counts)
p = np.append(p, 1.-p.sum())
y_exp = n*p
ax1.step(bins, y_exp, color='red', where='post', linewidth=4)
ax1.set_aspect(1./ax1.get_data_ratio())


ax2 = fig.add_subplot(1,2,2)
counts, bins, patches = plt.hist(y, bins=20)
p = dirichlet_expval(1+counts) 
y_exp = n*np.append(p, 1.-p.sum())
ax2.step(bins, y_exp, color='red', where='post', linewidth=4)
ax2.set_aspect(1./ax2.get_data_ratio())

Density estimation is great, but we are sensitive to the number of bins. So it is useful to generalize the DP some more.

Dirichlet Process Prior

Consider a sample space $\Omega$ that we may partition into $k$ non-overlapping subsets $\{B_1,\ldots,B_k\} \in \mathcal{B}$. We can assign a probability measure $P$ to this partition:

$$P(B_1),\ldots,P(B_k) = \int_{B_1} f(x) dx, \ldots, \int_{B_k} f(x) dx$$

A Dirichlet distribition would be a natural conjugate prior on these partition (bin) probabilities:

$$P(B_1),\ldots,P(B_k) \sim \text{Dirichlet}(a P_0(B_1), \ldots, a P_0(B_k))$$

where $P_0$ is a base probability measure and $a > 0$ can be interpreted as prior sample size, which essentially controls the amount of prior shrinkage.

However, we want our model to be insensitive to the choice of partition and to the number of bins. The important implication of specifying this prior is that although probabilities are assigned to each bin, it does not prescribe how that probability mass is distributed across any particular bin.

It is easy to show that combining (or splitting) the elements of a Dirichlet distribution results in another Dirichlet:

$$\begin{aligned} \pi_1, \ldots, \pi_k &\sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_k) \\ \Rightarrow \pi_1 + \pi_2, \pi_3, \ldots, \pi_k &\sim \text{Dirichlet}(\alpha_1 + \alpha_2, \alpha_3, \ldots, \alpha_k) \end{aligned}$$

or generally, for partition $\{B_1,\ldots,B_k\} \in \mathcal{B}$:

$$\sum_{h \in B_1} \pi_h, \ldots, \sum_{h \in B_k} \pi_h \sim \text{Dirichlet}(\sum_{h \in B_1} \alpha_h, \ldots, \sum_{h \in B_k} \alpha_h)$$

Similarly, for $\beta_1 + \beta_2 = 1$,

$$\begin{aligned} \pi_1, \ldots, \pi_k &\sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_k) \\ \tau_1, \tau_2 &\sim \text{Dirichlet}(\alpha_1 \beta_1, \alpha_1 \beta_2) \\ \Rightarrow \pi_1\tau_1 + \pi_1\tau_2, \pi_2, \ldots, \pi_k &\sim \text{Dirichlet}(\alpha_1\beta_1, \alpha_1\beta_2, \alpha_2, \alpha_3, \ldots, \alpha_k) \end{aligned}$$

Just as the Gaussian process is a distribution over functions, a Dirichlet process is a distribution over distributions (or, measure over measures).

\begin{equation} P∼DP(\alpha,P0) \end{equation}

It is centered upon the baseline probability measure P0, with $\alpha$ specifying the certainty in this baseline (i.e. inverse variance).

The expectation of a DPP is:

\begin{equation} E[P(B)]=P0(B) \end{equation}

in other words, centered on the baseline measure, and the variance is:

\begin{equation} \text{Var}(P(B))=P0(B)(1−P0(B))1+\alpha \end{equation}

It is essentially an infinitely decimated Dirichlet distribution. The marginal probability assigned to any subset B is beta distributed:

\begin{equation} P(B)∼\text{Beta}(\alpha P0(B),\alpha(1−P0(B))) \end{equation}

Stick-breaking Process

The specification of the DP above is not necessarily intuitive in terms of what a DP realization looks like. A generative approach for allocating observations to groups is the stick-breaking process, which involves breaking the support of a particular variable into $k$ disjoint segments. Here, we start with a "stick" of unit length. To "break" the stick, we generate random points along the stick via the following algorithm:

generate a random variable $\beta_1 \sim \text{Beta}(1, \alpha_0)$
use this random variable (which is on the unit interval) to define a break point on the stick
iterate $k-1$ times:
- generate $\beta_i \sim Beta(1, \alpha_0)$
- identify next break point at $\pi_i = \beta_i \prod_{j=1}^{i-1} (1-\beta_j)$ (which is on the part of the stick that remains after the previous break)

This results in $k$ "pieces" being created. Associated with each piece is a probability that is proportional to its length; these $k$ probabilities will have a Dirichlet distribution -- thus, the DP is a distribution over distributions.

This process defines an exchangeable distribution on partitions of the stick.

though there is an order to the generation of the segments, the distribution is independent of order.

Implementing a stick-breaking constructive process in Python is straightforward:



In [4]:

    
from numpy.random import beta

def stick_breaking(alpha, k):
    betas = beta(1, alpha, k)
    remaining_pieces = np.append(1, np.cumprod(1 - betas[:-1]))
    p = betas * remaining_pieces
    return p/p.sum()

For example, let's construct a DP with a baseline distribution that is standard normal:

$$P_0 = N(0,1)$$

We take a draw of $k$ values from the baseline distribution:

$$ \theta_1, \theta_2, \ldots \theta_k \sim P_0 $$

then, using a stick breaking process, we can obtain a set of draws $\beta_1, \beta_2, \ldots$ from a $\text{Beta}(1,\alpha)$. These are used to assign probabilities to the $\theta_i$ values. As we established above, the probability of each $\theta_i$ is calculated via:

$$ \pi_i = \beta_i \prod_{j=1}^{i-1} (1 - \beta_j) $$



In [5]:

    
k = 25
alpha = 7
theta = np.random.normal(0, 1, k)
p = stick_breaking(alpha, k)

These probabilities correspond to the set of draws from the baseline distribution, where each of the latter are point masses of probability. So, the DP density function is:

$$ P(x) = \sum_{i=1}^{n} \pi_i I(x=\beta_i) $$

where $I$ is the indicator function.



In [6]:

    
x = np.random.multinomial(k, p)
dp = theta[x]
print(dp)









    



[ 1.45083515  0.36452561  0.43705663  0.43705663  0.46168919  1.45083515
  1.45083515  1.45083515  0.43705663  0.46168919  0.43705663  0.7463551
  1.45083515  0.43705663  0.43705663  1.26302955  1.45083515  0.43705663
  0.43705663  0.43705663  1.45083515  0.43705663  0.43705663  0.43705663
  0.43705663]



In [7]:

    
x = set(dp)
f = [(dp==i).sum() for i in x]
plt.bar(x, f, width=0.01)









    Out[7]:





<Container object of 6 artists>

So, you can see that the Dirichlet process is discrete, despite the fact that its values may be non-integer. This can be generalized to a mixture of continuous distributions, which is called a DP mixture.

Here are several realizations with $k=20$ and $\alpha=0.5$. So, there are 20 bars (sorted), and the height of the bar represents that group's probability.



In [8]:

    
k = 20
fig, axes = plt.subplots(2, 5, sharex=True, sharey=True, figsize=(10,6))
for ax in np.ravel(axes):
    ax.bar(np.arange(k), np.sort(stick_breaking(alpha=0.5, k=k))[::-1])



In [9]:

    
fig, axes = plt.subplots(2, 5, sharex=True, sharey=True, figsize=(10,6))
for ax in np.ravel(axes):
    ax.bar(np.arange(k), np.sort(stick_breaking(alpha=5, k=k))[::-1])
    ax.set_ylim(0,1)



In [10]:

    
fig, axes = plt.subplots(2, 5, sharex=True, sharey=True, figsize=(10,6))
for ax in np.ravel(axes):
    ax.bar(np.arange(k), np.sort(stick_breaking(alpha=25, k=k))[::-1])
    ax.set_ylim(0,1)

We can see that as $\alpha_0$ increases, the weights (stick sizes) are more even across the samples.

We can use the stick-breaking process to induce $P \sim DP(\alpha P_0)$ for some $P_0$.

$$P(\cdot) = \sum_{h=1}^k \pi_h \delta_{\theta_h}(\cdot)$$

Where the $\pi_h$ are geenrated by stick-breaking, $\delta_{\theta}$ is a degenerate distribution with the entire probability mass at $\theta$, which in turn, are generated by $P_0$; here, we will use a standard normal.



In [11]:

    
from numpy.random import choice

def dirichlet_process(p, n, P0=np.random.randn):
    theta = P0(len(p))
    return np.random.choice(theta, size=n, p=p)



In [12]:

    
p = stick_breaking(alpha=1.0, k=1000)
_ = plt.hist(dirichlet_process(p, 100))



In [13]:

    
p = stick_breaking(alpha=5, k=1000)
_ = plt.hist(dirichlet_process(p, 100))



In [14]:

    
p = stick_breaking(alpha=1000, k=10000)
_ = plt.hist(dirichlet_process(p, 1000))

Notice that, while the particular values of the DP realizations are continuous, the distribution is discrete. But, as $\alpha \rightarrow \infty$, the likelihood of indexing the same $\theta_h$ more than once goes to zero, and one is essentially drawing from $P_0$.

So, while the DP is of limited use as a direct prior of a data distribution, it is extremely useful as a prior for an unknown mixture.

If we generalize the above approach such that the DP is used as the mixture measure for some kernel $\mathcal{K}(y|\theta)$, then we can define the mixture model:

$$f(y) = \sum_{h=1}^{\infty} \pi_h \mathcal{K}(y|\theta_h)$$

This is no different than other mixture models we have seen, except that the number of components is infinite. In practice, almost all the components are empty when we consider using it to model a finite dataset, but the model has the capacity to increase the number of mixture components as data are added.

This model can be specified hierarchically by:

$$\begin{aligned} P &\sim DP(\alpha P_0) \\ \theta_i &\sim P \\ y_i &\sim \mathcal{K}(y|\theta_i) \end{aligned}$$

The computational hurdle is in how to characterize the mixture when we cannot generate infinite mixture components. For this, we will use another generative model metaphor, the Chinese restaurant process.

References

Teh, Y. W., & Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. Bayesian nonparametrics, 158–207.

[Courses and materials by Chris Fonnesbeck] (https://github.com/fonnesbeck)



In [14]: