In [2]:
%pylab inline
import pylab as pp
import random as r
import matplotlib as mpl
Notes on implementation:
$\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bxp}{\mathbf{x}^{'}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\ba}{\mathbf{a}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bq}{\mathbf{q}}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bh}{\mathbf{h}}$
$\newcommand{\bI}{\mathbf{I}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\bT}{\mathbf{T}}$ $\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\xm}{\mathbf{x}_m}$ $\newcommand{\xn}{\mathbf{x}_n}$ $\newcommand{\y}{\mathbf{y}}$ $\newcommand{\K}{\mathbf{K}}$ $\newcommand{\zero}{\mathbf{0}}$ $\newcommand{\yi}{\y_i}$ $\newcommand{\thetav}{\mathbf{\theta}}$ $\newcommand{\t}{\mathbf{t}}$ $\newcommand{\x}{\mathbf{x}}$ $\newcommand{\tN}{\mathbf{t}_N}$ $\newcommand{\xN}{\mathbf{x}_N}$ $\newcommand{\k}{\mathbf{k}}$ $\newcommand{\C}{\mathbf{C}}$ $\newcommand{\CN}{\mathbf{C}_N}$ $\newcommand{\KN}{\mathbf{K}_N}$ $\newcommand{\eyeN}{\mathbf{I}_N}$
For this Lab we will be refer to sections 6.4.2 and 6.4.3. You may also want to refer to Rasmussen's Gaussian Process text which is available online at gaussianprocesses.org. To understand Gaussian processes, it is highly recommended understand how marginal, partitioned Gaussian distributions can be converted into conditional Gaussian distributions. This is covered in Chapter 2.3 and summarized in Eqns 2.94-2.98.
We will use the same data generating function that we used previously for regression. You can change sigma/beta, but keep it reasonable. Definitely play around once you have things working. Make use of these functions as you wish.
In [39]:
sigma = 0.1
beta = 1.0 / pow(sigma,2) # this is the beta used in Bishop Eqn. 6.59
N_test = 100
x_test = np.linspace(-1,1,N_test);
mu_test = np.zeros( N_test )
var_test = np.zeros( N_test )
In [4]:
def true_mean_function( x ):
return np.sin( 2*pi*(x+1) )
def add_noise( y, sigma ):
return y + sigma*np.random.randn(len(y))
def generate_t( x, sigma ):
return add_noise( true_mean_function( x), sigma )
In [5]:
y_test = true_mean_function( x_test )
t_test = add_noise( y_test, sigma )
pp.plot( x_test, y_test, 'b-', lw=2)
pp.plot( x_test, t_test, 'go')
Out[5]:
We will implement Gaussian process regression using the kernel function in Bishop Eqn. 6.6.3.
To start, implement function "k_n_m( xn, xm, thetas )" that takes scalars $\xn$ and $\xm$, and a vector of $4$ thetas, and computes the kernel function Bishop Eqn. 6.63 (10 points). NB: usually the kernel function will take $D$ by $1$ vectors, but since we are using a univariate problem, this makes things easier.
In [93]:
def k_n_m(xn, xm, thetas):
d = xn - xm
return thetas[0] * exp(-thetas[1]/2 * d ** 2) + thetas[2] + thetas[3] + xn * xm
Eqn 6.60 is the marginal distribution of mean ouput of $N$ data vectors: $p(\y) = \mathcal{N}(\zero, \K)$. Notice that the expected mean function is $0$ at all locations, and that the covariance is a $N$ by $N$ kernel matrix $\K$. Write a function "computeK( X1, X2, thetas )" that computes the kernel matrix. Use k_n_m as part of an innner loop (of course, there are more efficient ways of computing the kernel function making better use of vectorization, but that is not necessary) (5 points).
In [7]:
def compute_K(X1, X2, thetas):
K = zeros((X1.shape[0], X1.shape[0]))
for i, xn in enumerate(X1):
for j, xm in enumerate(X2):
K[i,j] = k_n_m(xn, xm, thetas)
return K
Now sample mean functions at the xtest locations for the theta values in Bishop Figure 6.5, make a figure with a 2 by 3 subplot and make sure the title reflects the theta values (make sure everything is legible). In other words, sample $\yi \sim \mathcal{N}(\zero, \K{\thetav})$. Make use of numpy.random.multivariate_normal(). On your plots include the expected value of $\y$ with a dashed line and fill_between 2 standard deviations of the uncertainty due to $\K$ (the diagonal of $\K$ is the variance of the model uncertainty) (15 points).
In [126]:
def plot_function_samples(x_test, y_test, Thetas):
Y = zeros((len(Thetas), x_test.shape[0]))
plt.figure()
plt.figure(figsize=(16, 12))
for i, thetas in enumerate(Thetas):
K = compute_K(x_test, x_test, thetas)
plt.subplot(2, 3, 1 + i)
plt.title('thetas: %s' % (Thetas[i], ))
#for j in range(5):
y = random.multivariate_normal(zeros((100)), K)
upper = y + sqrt(diag(K))
lower = y - sqrt(diag(K))
plt.plot(x_test, y, '--', lw=2)
plt.fill_between(x_test, lower, upper, alpha=0.1)
plt.tight_layout()
plt.show()
Thetas = (
(1.00, 4.00, 0.00, 0.00),
(9.00, 4.00, 0.00, 0.00),
(1.00, 64.00, 0.00, 0.00),
(1.00, 0.25, 0.00, 0.00),
(1.00, 4.00, 10.00, 0.00),
(1.00, 4.00, 0.00, 5.00),
)
Thetas = (
(1.00, 4.00, 0.00, 0.00),
(9.00, 4.00, 0.00, 0.00),
(1.00, 64.00, 0.00, 0.00),
(1.00, 0.25, 0.00, 0.00),
(1.00, 4.00, 10.00, 0.00),
(1.00, 4.00, 0.00, 5.00),
)
plot_function_samples(x_test, y_test, Thetas)
So far we have sampled mean functions from the prior. We can draw actual data $\t$ two ways. The first way is generatively, by first sampling $\y | \K$, then sampling $\t | \y, \beta$ (Eqns 6.60 followed by 6.59). The second way is to integrate over $\y$ (the mean draw) and directly sample $\t | \K, \beta$ using Eqn 6.61. This is the generative process for $\t$. Note that we have not specified a distribution over inputs $\x$; this is because Gaussian processes are conditional models. Because of this we are free to generate locations $\x$ when playing around with the GP; obviously a dataset will give us input-output pairs.
Once we have data, we are interested in the predictive distribution (note: the prior is the predictive distribution when there is no data). Consider the joint distribution for $N+1$ targets, given by Eqn 6.64. Its covariance matrix is composed of block components $\CN$, $\k$, and $c$. The covariance matrix $CN$ for $\tN$ is $\CN = \KN + \eyeN / \beta$. We have just made explicit the size $N$ of the matrix; $N$ is the number of training points. The kernel vector $\k$ is a $N$ by $1$ vector of kernel function evaluations between the training input data and the test input vector. The scalar $c$ is a kernel evaluation at the test input.
Write a function "gp_predictive_distribution( x_train, x_test, theta, C = None )" that computes Eqns 6.66 and 6.67, except allow for an arbitrary number of test points (not just one) and now the kernel matrix is for training data. By having C as an optional parameter, we can avoid computing it more than once (for this problem it is unimportant, but for real problems this is an issue). The function should compute $\C$, $\k$, and $c$ and the mean and noise functions. Do not forget: the computeK function computes $\K$, not $\C$! (10 points)
In [11]:
def gp_predictive_distribution(x_train, t_train, x_test, theta, C=None):
if not C:
C = compute_K(x_train, x_train, theta) + eye(x_train.shape[0]) / beta
C = np.matrix(C)
means = np.zeros(len(x_test))
variances = np.zeros(len(x_test))
for i in range(len(x_test)):
c = k_n_m(x_test[i], x_test[i], theta) + 1./beta
k = np.array([k_n_m(x_train[n], x_test[i], theta) for n in range(len(x_train))])
means[i] = k.T.dot(C.I).dot(t_train)
variances[i] = c - k.T.dot(C.I).dot(k)
C = np.append(C, [k], 0)
C = np.append(C, np.append([k], [[c]], 1).T, 1)
x_train = np.append(x_train, x_test[i])
t_train = np.append(t_train, r.gauss(means[i], variances[i]))
return means, variances
Later, to learn the hyperparameters, we will need to compute the log-likelihood of the of the training data. Implicitly, this is conditioned on the value setting for $\thetav$. Write a function "gp_log_likelihood( x_train, t_train, theta, C = None, invC = None )", where C and invC can be stored and reused. (10 points)
In [12]:
def gp_log_likelihood(x_train, t_train, theta, C=None, invC=None):
if not C or not invC:
C = compute_K(x_train, x_train, theta) + eye(x_train.shape[0]) / beta
invC = np.linalg.inv(C)
return -0.5 * (log(np.linalg.det(C)) + t_train.dot(invC).dot(t_train.T) + len(t_train) * log(2*pi))
Repeat the 6 plots above, but this time conditioned on the training 2 points. Use the sinuosoidal data generator to create 2 training points where x is sampled uniformly between $-1$ and $1$. For these plots, feel free to use the provided function "gp_plot". Make sure you put the parameters in the title and this time also the log-likelihood. (10 points) Try to understand the two types of uncertainty! If you do not use "gp_plot", please add a fill between for the model and target noise.
In [13]:
def gp_plot( x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta ):
# x_test:
# y_test: the true function at x_test
# mu_test: predictive mean at x_test
# var_test: predictive covariance at x_test
# t_train: the training values
# theta: the kernel parameters
# beta: the precision (known)
# the reason for the manipulation is to allow plots separating model and data stddevs.
#std_total = np.sqrt(np.diag(var_test)) # includes all uncertainty, model and target noise
std_total = np.sqrt(var_test) # original version wasn't working
std_model = np.sqrt( std_total**2 - 1.0/beta ) # remove data noise to get model uncertainty in stddev
std_combo = std_model + np.sqrt( 1.0/beta ) # add stddev (note: not the same as full)
pp.plot( x_test, y_test, 'b', lw=3)
pp.plot( x_test, mu_test, 'k--', lw=2 )
pp.fill_between( x_test, mu_test+2*std_combo,mu_test-2*std_combo, color='k', alpha=0.25 )
pp.fill_between( x_test, mu_test+2*std_model,mu_test-2*std_model, color='r', alpha=0.25 )
pp.title("theta = " + str(theta) + " log_l = " + str(gp_log_likelihood(x_train, t_train, theta)))
pp.plot( x_train, t_train, 'ro', ms=10 )
N_train = 2
x_train = np.linspace(-1,1,N_train)
t_train = generate_t(x_train, sigma)
plt.figure()
plt.figure(figsize=(16, 12))
for i, theta in enumerate(Thetas):
plt.subplot(2, 3, 1 + i)
mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)
gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
plt.tight_layout
In [14]:
N_train = 10
x_train = np.linspace(-1,1,N_train)
t_train = generate_t(x_train, sigma)
plt.figure()
plt.figure(figsize=(16, 12))
for i, theta in enumerate(Thetas):
plt.subplot(2, 3, 1 + i)
mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)
gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
plt.tight_layout
Learning the values of the parameter $\thetav$ can be very tricky for Gaussian processes in general, but when the data is univariate like ours, we can visualize the fit and see how plausible it looks.
Maximum likelihood or MAP learning is the most common way of setting the parameters, though a fully Bayesian approach is possible too. We will look at ML today. For this, we start with the dervivative of the log-likelihood with respect to the parameters $\thetav$; this is Eqn 6.70. This, in turn, requires the derivative of the kernel matrix $\CN$ wrt $\thetav$. This is the matrix of element-wise derivatives of the kernel function. Write the derivatives for $\theta_0$ to $\theta_3$ for our kernel function (5 points).
$k(\mathbf{x}_n, \mathbf{x}_m) = \theta_0 \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right) + \theta_2 + \theta_3 \mathbf{x}_n^T\mathbf{x}_m$
$\frac{\partial}{\partial \theta_0} k(\mathbf{x}_n, \mathbf{x}_m) = \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$
$\frac{\partial}{\partial \theta_1} k(\mathbf{x}_n, \mathbf{x}_m) = - \frac{\theta_0}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$
$\frac{\partial}{\partial \theta_2} k(\mathbf{x}_n, \mathbf{x}_m) = 1$
$\frac{\partial}{\partial \theta_3} k(\mathbf{x}_n, \mathbf{x}_m) = \mathbf{x}_n^T\mathbf{x}_m$
The covariance matric $C$ is restricted to be positive definite. Since $C$ is constructed by adding the (positive) noise variance to $K$, $C$ is always positive definite given that $K$ is positive semidefinite. So it suffices to show that $K$ is positive semidefinite. If we work this out for the two-dimensional case where we have two data-points: $x_1$ and $x_2$, we get the following situation:
$ K = \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2)\\ k(x_2, x_1) & k(x_2, x_2 \end{bmatrix} $
For K to be positive definite, we require that $|K| \ge 0$. If we write out the determinant, we get
$k(x_1, x_1)k(x_2, x_2) - k(x_1, x_2)^2 \ge 0$
Using the fact that $k(x_1, x_2) = k(x_2, x_1)$
$k(x_1, x_1)k(x_2, x_2) \ge k(x_1, x_2)^2$
$(\theta_0 + \theta_2 + \theta_3 x_1^T x_1) (\theta_0 + \theta_2 + \theta_3 x_2^T x2) \ge (\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||) + \theta_2 + \theta_3 x_1^T x_2)^2 $
Where we can see that $\theta_1$ disappeared on one side of the equation because the distance of a datapoint to itself is zero. Since the exponent is always larger than zero, it seems that $\theta_0$ needs plays a deciding role in this inequality. We can't quite figure this out from the equations, but some experimentation seems to confirm that $\theta_0$ is constrained to be greater or equal than zero. We will assume this result for the next question.
For parameters that are constrained to be positive, the usual approach is to use the exponential of the free-parameter in the kernel function, but perform gradient ascent on the unconstrained values. Consider the case $\theta_i = \exp( \phi_i)$, where $\phi_i$ is unconstrained. Write the derivative for $\phi_i$ in terms of the derivatives you already computed (5 points). Hint: use the chain rule and do not repeat the full derivation.
Since we're assuming that $\theta_0$ has is constrained to be larger or equal to zero, we substitute it by $\exp(\phi_0)$:
$\frac{\partial k(\mathbf{x}_n, \mathbf{x}_m)}{\partial \phi_0} = \frac{\partial k(\mathbf{x}_n, \mathbf{x}_m)}{\partial \exp(\phi_0)} \frac{\partial \exp(\phi_0)}{\partial \phi_0} = \exp \left( \phi_0 -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$
Grid-search: for the same training set you have above, perform a small grid search over $\thetav$ (try at least 20 combinations). Have your grid-search loop or function print out rows of log-likelihood + $\thetav$ sorted by best to worst. Use the log-likelihood to select the best $\thetav$ and the worst. Plots both the same way as the subplots above (ie a 1 by 2 subplot of best and worst). (10 points)
In [15]:
def getKey(item):
return item[0]
def gridsearch(x_train,t_train):
ran = np.arange(4)
srch = []
for a in ran:
for b in ran:
for c in ran:
for d in ran:
srch.append([gp_log_likelihood(x_train, t_train,[a,b,c,d], C=None, invC=None),[a,b,c,d]])
return sorted(srch, key=getKey, reverse=True)
gs = gridsearch(x_train,t_train)
b_and_w = []
b_and_w.append(gs[0][1])
b_and_w.append(gs[-1][1])
Thetas = tuple(b_and_w)
plt.figure()
plt.figure(figsize=(16, 12))
for i, theta in enumerate(Thetas):
plt.subplot(1,2, 1 + i)
mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)
gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
plt.tight_layout
Selecting kernel functions can be somewhat of an art. There are charateristics of kernel functions that are useful for some data sets, but not others. Complicating the matter is the ability to combine kernels with different characteristics (long term trends + seasonal fluctuations). Describe the charactistics of the kernel function we are using in terms of (signal, scale, offsets, etc). You may want to play around with $\thetav$ and see what each parameter does/affects/etc. (5 points) Describe why the best parameters work well for the training data and explain why the bad parameter settings perform poorly (in terms of the first part of the question). (5 points)
$\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||) + \theta_2 + \theta_3 x_1^T x_2$
$\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||^2) + \theta_2 + \theta_3 x_1^T x_2$
The kernel function indirectly constructs the covariance matrix of $p(\t)$, $\mathbf{C}$. We'll go over the kernel function term by term. The first term, involving $\theta_0$ and $\theta_1$ determines how correlated points in $\t$ are depending on their distance. So $\theta_0$ and $\theta_1$ are scaling factors operating respectively exponentially and linearly on the absolute distance between datapoints. They control the smoothness of the function
$\theta_2$ is simply constant noise added to the kernel function so it adds constant noise to the functions.
$\theta_3$ operates on the dot product between datapoints so it adds noise to the datapoints based on their quadratic distance from the origin.
Bonus: Implement gradient-ascent (or descent if you wish) using the combination of a) the log-likelihood objective function and b) the gradients you calculated above. Run on the training data above and show the log-likehood curve as it learns and a plot of the final model. Feel free to use available software (eg search for "minimize.py" which uses conjugate gradient descent, or something in scipy). NB: log-likelihood should be monotonically increasing. You are encouraged to also search and use "checkgrad". (20 points)
In [ ]: