Notes on implementation:
Submitters:
For this Lab we will be refer to Bishop sections 6.4.2 and 6.4.3. You may also want to refer to Rasmussen's Gaussian Process text which is available online at http://www.gaussianprocess.org/gpml/chapters/ and especially to the project found at http://www.automaticstatistician.com/index.php by Ghahramani for some intuition in GP. To understand Gaussian processes, it is highly recommended understand how marginal, partitioned Gaussian distributions can be converted into conditional Gaussian distributions. This is covered in Bishop 2.3 and summarized in Eqns 2.94-2.98.
We will use the same data generating function that we used previously for regression. You can change sigma/beta, but keep it reasonable. Definitely play around once you have things working. Make use of these functions as you wish.
In [1]:
%pylab inline
import pylab as pp
In [2]:
sigma = 0.5
beta = 1.0 / pow(sigma,2) # this is the beta used in Bishop Eqn. 6.59
N_test = 100
x_test = np.linspace(-1,1,N_test)
mu_test = np.zeros( N_test )
In [3]:
def true_mean_function( x ):
return np.sin( 2*pi*(x+1) )
def add_noise( y, sigma ):
return y + sigma*np.random.randn(len(y))
def generate_t( x, sigma ):
return add_noise( true_mean_function( x), sigma )
In [4]:
y_test = true_mean_function( x_test )
t_test = add_noise( y_test, sigma )
pp.plot( x_test, y_test, 'b-', lw=2)
pp.plot( x_test, t_test, 'go')
pp.plot()
Out[4]:
We will implement Gaussian process regression using the kernel function in Bishop Eqn. 6.63.
To start, implement function "k_n_m( xn, xm, thetas )" that takes scalars $\xn$ and $\xm$, and a vector of $4$ thetas, and computes the kernel function Bishop Eqn. 6.63 (10 points).
In [5]:
def k_n_m(xn, xm, thetas):
return thetas[0] * np.exp(-thetas[1] * (xn - xm) ** 2/ 2) + thetas[2] + thetas[3] * xn * xm
Eqn 6.60 is the marginal distribution of mean ouput of $N$ data vectors: $p(\y) = \mathcal{N}(\zero, \K)$. Notice that the expected mean function is $0$ at all locations, and that the covariance is a $N$ by $N$ kernel matrix $\K$. Write a function "computeK( X1, X2, thetas )" that computes the kernel matrix. Hint: use k_n_m as part of an innner loop (of course, there are more efficient ways of computing the kernel function making better use of vectorization, but that is not necessary) (5 points).
In [6]:
def computeK(X1, X2, thetas):
K = np.zeros((len(X1), len(X2)))
for n in range(len(X1)):
for m in range(len(X2)):
K[n, m] = k_n_m(X1[n], X2[m], thetas)
return K
Now sample mean functions at the xtest locations for the theta values in Bishop Figure 6.5, make a figure with a 2 by 3 subplot and make sure the title reflects the theta values (make sure everything is legible). In other words, sample $\yi \sim \mathcal{N}(\zero, \K{\thetav})$. Make use of numpy.random.multivariate_normal(). On your plots include the expected value of $\y$ with a dashed line and fill_between 2 standard deviations of the uncertainty due to $\K$ (the diagonal of $\K$ is the variance of the model uncertainty) (15 points).
In [7]:
from itertools import product
thetas = np.array([(1, 4, 0, 0), (9, 4, 0, 0), (1, 64, 0, 0), (1, 0.25, 0, 0), (1, 4, 10, 0), (1, 4, 0, 5)])
fig = plt.figure()
fig.set_size_inches(14, 8)
for row, col in product(range(2), range(3)):
idx = row * 3 + col
K_theta = computeK(x_test, x_test, thetas[idx])
for i in range(5):
y = np.random.multivariate_normal(np.zeros(len(K_theta)), K_theta)
plt.subplot(2, 3, idx + 1)
plt.plot(x_test, y)
plt.plot(x_test, [0] * len(x_test), '--')
std = np.sqrt(np.diag(K_theta))
plt.fill_between(x_test, -2 * std, 2 * std, color='r', alpha=0.25 )
plt.title(str(thetas[idx]))
plt.show()
So far we have sampled mean functions from the prior. We can draw actual data $\t$ two ways. The first way is generatively, by first sampling $\y | \K$, then sampling $\t | \y, \beta$ (Eqns 6.60 followed by 6.59). The second way is to integrate over $\y$ (the mean draw) and directly sample $\t | \K, \beta$ using Eqn 6.61. This is the generative process for $\t$. Note that we have not specified a distribution over inputs $\x$; this is because Gaussian processes are conditional models. Because of this we are free to generate locations $\x$ when playing around with the GP; obviously a dataset will give us input-output pairs.
Once we have data, we are interested in the predictive distribution (note: the prior is the predictive distribution when there is no data). Consider the joint distribution for $N+1$ targets, given by Eqn 6.64. Its covariance matrix is composed of block components $\CN$, $\k$, and $c$. The covariance matrix $CN$ for $\tN$ is $\CN = \KN + \eyeN / \beta$. We have just made explicit the size $N$ of the matrix; $N$ is the number of training points. The kernel vector $\k$ is a $N$ by $1$ vector of kernel function evaluations between the training input data and the test input vector. The scalar $c$ is a kernel evaluation at the test input.
Write a function "gp_predictive_distribution(x_train, t_train, x_test, theta, beta, C = None)" that computes Eqns 6.66 and 6.67, except allow for an arbitrary number of test points (not just one) and now the kernel matrix is for training data. By having C as an optional parameter, we can avoid computing it more than once (for this problem it is unimportant, but for real problems this is an issue). The function should compute $\C$, $\k$, and $c$ and the mean and noise functions. Do not forget: the computeK function computes $\K$, not $\C$! (10 points)
In [8]:
def computeC(x_train, theta, beta):
K_t = computeK(x_train, x_train, theta)
return K_t + np.identity(len(x_train)) / beta
def gp_predictive_distribution(x_train, t_train, x_test, theta, beta, C = None):
if not C:
C = computeC(x_train, theta, beta)
k = computeK(x_train, x_test, theta)
Cinv = np.linalg.inv(C)
means = k.T.dot(Cinv).dot(t_train)
c = computeK(x_test, x_test, theta) + 1.0 / beta
covs = c - k.T.dot(Cinv).dot(k)
return means, covs
Later, to learn the hyperparameters, we will need to compute the log-likelihood of the of the training data. Implicitly, this is conditioned on the value setting for $\thetav$. Write a function "gp_log_likelihood( x_train, t_train, theta, C = None, invC = None )", where C and invC can be stored and reused. (10 points)
In [9]:
def gp_log_likelihood(x_train, t_train, theta, beta, C = None, invC = None):
if not C:
C = computeC(x_train, theta, beta)
if not invC:
invC = np.linalg.inv(C)
return -(np.log(np.linalg.det(C)) + t_train.T.dot(invC).dot(t_train) + len(t_train) * np.log(2 * np.pi)) / 2
Repeat the 6 plots above, but this time conditioned on the training points. Use the sinuosoidal data generator to create 2 training points where x is sampled uniformly between $-1$ and $1$. For these plots, feel free to use the provided function "gp_plot". Make sure you put the parameters in the title and this time also the log-likelihood. (10 points) Try to understand the two types of uncertainty! If you do not use "gp_plot", please add a fill between for the model and target noise.
In [10]:
def gp_plot( x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta ):
# x_test: the test data
# y_test: the true function at x_test
# mu_test: predictive mean at x_test
# var_test: predictive covariance at x_test
# t_train: the training values
# theta: the kernel parameters
# beta: the precision (known)
# the reason for the manipulation is to allow plots separating model and data stddevs.
std_total = np.sqrt(np.diag(var_test)) # includes all uncertainty, model and target noise
std_model = np.sqrt( std_total**2 - 1.0/beta ) # remove data noise to get model uncertainty in stddev
std_combo = std_model + np.sqrt( 1.0/beta ) # add stddev (note: not the same as full)
pp.plot( x_test, y_test, 'b', lw=3)
pp.plot( x_test, mu_test, 'k--', lw=2 )
pp.fill_between( x_test, mu_test+2*std_combo,mu_test-2*std_combo, color='k', alpha=0.25 )
pp.fill_between( x_test, mu_test+2*std_model,mu_test-2*std_model, color='r', alpha=0.25 )
pp.plot( x_train, t_train, 'ro', ms=10 )
In [11]:
beta = 1.0 / pow(sigma,2)
In [12]:
def plot1(x_train):
fig = plt.figure()
y_train = true_mean_function(x_train)
t_train = add_noise(y_train, sigma)
x_test = np.linspace(-1, 1, N_test)
y_test = true_mean_function(x_test)
fig.set_size_inches(14, 8)
for row, col in product(range(2), range(3)):
idx = row * 3 + col
plt.subplot(2, 3, idx + 1)
means, sigmas = gp_predictive_distribution(x_train, t_train, x_test, thetas[idx], beta)
gp_plot(x_test, y_test, means, sigmas, x_train, t_train, thetas[idx], beta)
plt.title(str(thetas[idx]))
plt.show()
x_train = np.random.uniform(-1, 1, 2)
plot1(x_train)
In [13]:
x_train = np.random.uniform(-1, 1, 10)
plot1(x_train)
Learning the values of the parameter $\thetav$ can be very tricky for Gaussian processes in general, but when the data is univariate like ours, we can visualize the fit and see how plausible it looks.
Maximum likelihood or MAP learning is the most common way of setting the parameters, though a fully Bayesian approach is possible too. We will look at ML today. For this, we start with the dervivative of the log-likelihood with respect to the parameters $\thetav$; this is Eqn 6.70. This, in turn, requires the derivative of the kernel matrix $\CN$ wrt $\thetav$. This is the matrix of element-wise derivatives of the kernel function. Write the derivatives for $\theta_0$ to $\theta_3$ for our kernel function (5 points).
Answer
$\dfrac{\partial k(x_n,x_m)}{\partial\theta_0} = \exp(-\dfrac{\theta_1}{2}||x_n - x_m||^2)$
$\dfrac{\partial k(x_n,x_m)}{\partial\theta_1} = -\dfrac{1}{2}||x_n - x_m||^2 \theta_0 \exp(-\dfrac{\theta_1}{2}||x_n - x_m||^2)$
$\dfrac{\partial k(x_n,x_m)}{\partial\theta_2} = 1$
$\dfrac{\partial k(x_n,x_m)}{\partial\theta_3} = x_n^Tx_m$
Answer
We start by decomposing our kernal function into a sum of 3 kernal functions:
$k(x_n,x_m)= \theta_0 k_1(x_n,x_m) + \theta_2 k_2(x_n,x_m) + \theta_3 k_3(x_n,x_m)$
As we can see $\theta_0$, $\theta_2$, and $\theta_3$ are constants and therefore should be >0 for the kernal to be valid. On the other hand, $\theta_1$ plays a role of a bandwidth in $k_1$ and is not constrained.
For parameters that are constrained to be positive, the usual approach is to use the exponential of the free-parameter in the kernel function, but perform gradient ascent on the unconstrained values. Consider the case $\theta_i = \exp( \phi_i)$, where $\phi_i$ is unconstrained. Write the derivative for $\phi_i$ in terms of the derivatives you already computed (5 points). Hint: use the chain rule and do not repeat the full derivation.
Answer
$\dfrac{\partial k(x_n,x_m)}{\partial \phi_0}= \exp(-\dfrac{\theta_1}{2} ||x_n-x_m||^2 + \phi_0)$
$\dfrac{\partial k(x_n,x_m)}{\partial \phi_2}= \exp(\phi_2)$
$\dfrac{\partial k(x_n,x_m)}{\partial \phi_3}= x_n^Tx_m \exp(\phi_3)$
Grid-search: for the same training set you have above, perform a small grid search over $\thetav$ (try at least 20 combinations). Have your grid-search loop or function print out rows of log-likelihood + $\thetav$ sorted by best to worst. Use the log-likelihood to select the best $\thetav$ and the worst. Plots both the same way as the subplots above (ie a 1 by 2 subplot of best and worst). (10 points)
In [14]:
from operator import itemgetter
theta = [0, 1, 2, 4, 8]
x_train = np.random.uniform(-1, 1, 10)
y_train = true_mean_function(x_train)
t_train = add_noise(y_train, sigma)
results = []
# 27 combinations
for t in product(theta, repeat=4):
results.append(tuple([gp_log_likelihood(x_train, t_train, t, beta)]) + t)
results = sorted(results, key=itemgetter(0))
for loglkl, t0, t1, t2, t3 in results:
print('log-likelihood: %f thetas: %s' % (loglkl, (t0, t1, t2, t3)))
print 'Best Log Likelihood: %f for thetas: (%d,%d,%d,%d)' % results[-1]
print 'Worse Log Likelihood: %f for thetas: (%d,%d,%d,%d)' % results[0]
In [15]:
fig = plt.figure()
fig.set_size_inches(14, 8)
x_test = np.linspace(-1, 1, N_test)
y_test = true_mean_function(x_test)
plt.subplot(1, 2, 1)
means, sigmas = gp_predictive_distribution(x_train, t_train, x_test, np.array(results[-1][1:]), beta)
gp_plot(x_test, y_test, means, sigmas, x_train, t_train, np.array(results[-1]), beta)
plt.title('Best likelihood %.2f, theta = (%d, %d, %d, %d)' % results[-1])
plt.subplot(1, 2, 2)
means, sigmas = gp_predictive_distribution(x_train, t_train, x_test, np.array(results[0][1:]), beta)
gp_plot(x_test, y_test, means, sigmas, x_train, t_train, np.array(results[0]), beta)
plt.title('Worse likelihood %.2f, theta = (%d, %d, %d, %d)' % results[0])
plt.show()
Selecting kernel functions can be somewhat of an art. There are charateristics of kernel functions that are useful for some data sets, but not others. Complicating the matter is the ability to combine kernels with different characteristics (long term trends + seasonal fluctuations). Describe the charactistics of the kernel function we are using in terms of (signal, scale, offsets, etc). You may want to play around with $\thetav$ and see what each parameter does/affects/etc. (5 points) Describe why the best parameters work well for the training data and explain why the bad parameter settings perform poorly (in terms of the first part of the question). (5 points)
Answer
As we showed in 3.2 the kernal function can be decomposed as a combination of other kernel functions, we shall continue our discussion from that point. $k(x_n,x_m)= \theta_0 k_1(x_n,x_m) + \theta_2 k_2(x_n,x_m) + \theta_3 k_3(x_n,x_m)$
$k_1$ kernel corresponds to the similarity measure between points (signal capturing). $k_2 $ and $k_3$ operate as offsets and it was observed that it’s possible to achieve a good fit to the true function without the offset. Theta parameters have also some interpretation. $\theta_1$ corresponds to the bandwidth of the unnormalised gaussian distribution, and $\theta_0$ corresponds to the scaling factor of the distribution. Since $k_2(x_n,x_m)=x_n^0x_m^0$, $\theta_2$ plays a role of the offset, and $\theta_3$ is the scaling factor of the linear kernel.
As we stated previously, it’s possible to achieve a good data fit without the offset (i.e. only with a similarity kernel and two thetas), the opposite is also true, i.e. it’s not possible to achieve a good data fit without the similarity kernel and the corresponding thetas. The worst performance models have been observed to have low values for $\theta_0$ and $\theta_1$, and that implies that it can’t capture the signal. The key aspect in the kernel is obviously the similarity measure between two data points, and say, if both $\theta_0$ and $\theta_1$ are set close to 0, then the kernel loses its power, and the expectation on the plot becomes similar to a constant function.
Implement gradient-ascent (or descent if you wish) using the combination of a) the log-likelihood objective function and b) the gradients you calculated above. Run on the training data above and show the log-likehood curve as it learns and a plot of the final model. Feel free to use available software (eg search for "minimize.py" which uses conjugate gradient descent, or something in scipy). NB: log-likelihood should be monotonically increasing. You are encouraged to also search and use "checkgrad". (20 points)
In [16]:
from scipy import optimize
# Overloading k_n_m to accept phis instead of thetas
# The implementation of computeK, computeC will be the same.
def k_n_m(xn, xm, phis):
return np.exp(phis[0]) * np.exp(-np.exp(phis[1]) * (xn - xm) ** 2 / 2) + np.exp(phis[2]) + np.exp(phis[3]) * xn * xm
log_likelihoods = []
# Log Likelihood function should be changed a bit to accept phis first
def log_likelihood(phis, x_train, t_train, beta):
loglik = gp_log_likelihood(x_train, t_train, phis, beta)
log_likelihoods.append(loglik)
return -loglik
def grad_log_lklh(phis, x_train, t_train, beta):
thetas = np.exp(phis)
diff_xn_xm2 = (np.tile(x_train[np.newaxis].T, (1, len(x_train))) - np.tile(x_train, (len(x_train), 1))) ** 2
derivP = [0] * 4
derivP[0] = np.exp(-thetas[1] * diff_xn_xm2 / 2 + phis[0])
derivP[1] = -derivP[0] * diff_xn_xm2 * thetas[1] / 2
derivP[2] = np.tile(thetas[2], (len(x_train), len(x_train)))
derivP[3] = x_train[np.newaxis].T.dot(x_train[np.newaxis]) * thetas[3]
derivLogLklh = np.zeros(4)
C = computeC(x_train, phis, beta)
invC = np.linalg.inv(C)
for i in range(4):
derivLogLklh[i] = -0.5 * (-np.trace(invC.dot(derivP[i])) +
t_train.T.dot(invC).dot(derivP[i]).dot(invC).dot(t_train))
return derivLogLklh
phis = np.array([2, 6, 0, 0])
error = optimize.check_grad(log_likelihood, grad_log_lklh, phis, x_train, t_train, beta)
print error
result = optimize.minimize(log_likelihood, phis, jac=grad_log_lklh, method='CG', args=(x_train, t_train, beta))
print result.success
print 'Best grid search likelihood %d' % results[-1][0]
# Log Likelihood is smaller than the result from the grid-search
print 'Obtained likelihood:', -log_likelihood(result.x, x_train, t_train, beta)
Obtained likelihood turns to be better than the one in grid search
In [17]:
fig = plt.figure()
fig.set_size_inches(14, 8)
means, sigmas = gp_predictive_distribution(x_train, t_train, x_test, result.x, beta)
gp_plot(x_test, y_test, means, sigmas, x_train, t_train, result.x, beta)
plt.show()
In [18]:
plt.plot(np.arange(len(log_likelihoods)), log_likelihoods)
plt.ylabel('Log likelihood')
plt.xlabel('Iteration')
plt.show()
In [ ]: