In [2]:

    
%pylab inline
import pylab as pp
import random as r
import matplotlib as mpl









    



Populating the interactive namespace from numpy and matplotlib

Lab 3: Gaussian process regression and Mixture of Gaussians' clustering

Machine Learning 1, September 2014

The lab exercises should be made in groups of two or three people.
The deadline is october 24th (friday) 23:59.
Assignment should be sent to taco.cohen@gmail.com. The subject line of your email should be "lab#_lastname1_lastname2_lastname3".
Put your and your teammates' names in the body of the email
Attach the .IPYNB (IPython Notebook) file containing your code and answers. Naming of the file follows the same rule as the subject line. For example, if the subject line is "lab01_Kingma_Hu", the attached file should be "lab01_Kingma_Hu.ipynb". Only use underscores ("_") to connect names, otherwise the files cannot be parsed.

Notes on implementation:

You should write your code and answers in an IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact us.
Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
NOTE: Make sure we can run your notebook / scripts!

$\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bxp}{\mathbf{x}^{'}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\ba}{\mathbf{a}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bq}{\mathbf{q}}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bh}{\mathbf{h}}$

$\newcommand{\bI}{\mathbf{I}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\bT}{\mathbf{T}}$ $\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\xm}{\mathbf{x}_m}$ $\newcommand{\xn}{\mathbf{x}_n}$ $\newcommand{\y}{\mathbf{y}}$ $\newcommand{\K}{\mathbf{K}}$ $\newcommand{\zero}{\mathbf{0}}$ $\newcommand{\yi}{\y_i}$ $\newcommand{\thetav}{\mathbf{\theta}}$ $\newcommand{\t}{\mathbf{t}}$ $\newcommand{\x}{\mathbf{x}}$ $\newcommand{\tN}{\mathbf{t}_N}$ $\newcommand{\xN}{\mathbf{x}_N}$ $\newcommand{\k}{\mathbf{k}}$ $\newcommand{\C}{\mathbf{C}}$ $\newcommand{\CN}{\mathbf{C}_N}$ $\newcommand{\KN}{\mathbf{K}_N}$ $\newcommand{\eyeN}{\mathbf{I}_N}$

Part 1. Gaussian process regression

For this Lab we will be refer to sections 6.4.2 and 6.4.3. You may also want to refer to Rasmussen's Gaussian Process text which is available online at gaussianprocesses.org. To understand Gaussian processes, it is highly recommended understand how marginal, partitioned Gaussian distributions can be converted into conditional Gaussian distributions. This is covered in Chapter 2.3 and summarized in Eqns 2.94-2.98.

Sinusoidal Data

We will use the same data generating function that we used previously for regression. You can change sigma/beta, but keep it reasonable. Definitely play around once you have things working. Make use of these functions as you wish.



In [39]:

    
sigma = 0.1
beta  = 1.0 / pow(sigma,2) # this is the beta used in Bishop Eqn. 6.59
N_test = 100
x_test = np.linspace(-1,1,N_test); 
mu_test = np.zeros( N_test )
var_test = np.zeros( N_test )



In [4]:

    
def true_mean_function( x ):
    return np.sin( 2*pi*(x+1) )

def add_noise( y, sigma ):
    return y + sigma*np.random.randn(len(y))

def generate_t( x, sigma ):
    return add_noise( true_mean_function( x), sigma )



In [5]:

    
y_test = true_mean_function( x_test )
t_test = add_noise( y_test, sigma )
pp.plot( x_test, y_test, 'b-', lw=2)
pp.plot( x_test, t_test, 'go')









    Out[5]:





[<matplotlib.lines.Line2D at 0x7f37c0176990>]

1. Sampling from the Gaussian process prior (30 points)

We will implement Gaussian process regression using the kernel function in Bishop Eqn. 6.6.3.

1.1 k_n_m( xn, xm, thetas ) (10 points)

To start, implement function "k_n_m( xn, xm, thetas )" that takes scalars $\xn$ and $\xm$, and a vector of $4$ thetas, and computes the kernel function Bishop Eqn. 6.63 (10 points). NB: usually the kernel function will take $D$ by $1$ vectors, but since we are using a univariate problem, this makes things easier.



In [93]:

    
def k_n_m(xn, xm, thetas):
    
    d = xn - xm
    
    return thetas[0] * exp(-thetas[1]/2 * d ** 2) + thetas[2] + thetas[3] + xn * xm

1.2 computeK( X1, X2, thetas ) (5 points)

Eqn 6.60 is the marginal distribution of mean ouput of $N$ data vectors: $p(\y) = \mathcal{N}(\zero, \K)$. Notice that the expected mean function is $0$ at all locations, and that the covariance is a $N$ by $N$ kernel matrix $\K$. Write a function "computeK( X1, X2, thetas )" that computes the kernel matrix. Use k_n_m as part of an innner loop (of course, there are more efficient ways of computing the kernel function making better use of vectorization, but that is not necessary) (5 points).



In [7]:

    
def compute_K(X1, X2, thetas):
    
    K = zeros((X1.shape[0], X1.shape[0])) 
    for i, xn in enumerate(X1):
        for j, xm in enumerate(X2):
            K[i,j] = k_n_m(xn, xm, thetas)
            
    return K

1.3 Plot function samples (15 points)

Now sample mean functions at the xtest locations for the theta values in Bishop Figure 6.5, make a figure with a 2 by 3 subplot and make sure the title reflects the theta values (make sure everything is legible). In other words, sample $\yi \sim \mathcal{N}(\zero, \K{\thetav})$. Make use of numpy.random.multivariate_normal(). On your plots include the expected value of $\y$ with a dashed line and fill_between 2 standard deviations of the uncertainty due to $\K$ (the diagonal of $\K$ is the variance of the model uncertainty) (15 points).



In [126]:

    
def plot_function_samples(x_test, y_test, Thetas):
    
    Y = zeros((len(Thetas), x_test.shape[0]))
    
    plt.figure()
    plt.figure(figsize=(16, 12)) 
    for i, thetas in enumerate(Thetas):
        K = compute_K(x_test, x_test, thetas)
        plt.subplot(2, 3, 1 + i)
        plt.title('thetas: %s' % (Thetas[i], ))
        
        #for j in range(5):
        y = random.multivariate_normal(zeros((100)), K)
        upper = y + sqrt(diag(K))
        lower = y - sqrt(diag(K))
        plt.plot(x_test, y, '--', lw=2)
        plt.fill_between(x_test, lower, upper, alpha=0.1)
    
    plt.tight_layout()
    plt.show()
    
Thetas = (
        (1.00, 4.00, 0.00, 0.00),
        (9.00, 4.00, 0.00, 0.00),
        (1.00, 64.00, 0.00, 0.00),
        (1.00, 0.25, 0.00, 0.00),
        (1.00, 4.00, 10.00, 0.00),
        (1.00, 4.00, 0.00, 5.00),
)
Thetas = (
        (1.00, 4.00, 0.00, 0.00),
        (9.00, 4.00, 0.00, 0.00),
        (1.00, 64.00, 0.00, 0.00),
        (1.00, 0.25, 0.00, 0.00),
        (1.00, 4.00, 10.00, 0.00),
        (1.00, 4.00, 0.00, 5.00),
)
    
plot_function_samples(x_test, y_test, Thetas)









    





<matplotlib.figure.Figure at 0x7f37bb82f890>

2. Predictive distribution (35 points)

So far we have sampled mean functions from the prior. We can draw actual data $\t$ two ways. The first way is generatively, by first sampling $\y | \K$, then sampling $\t | \y, \beta$ (Eqns 6.60 followed by 6.59). The second way is to integrate over $\y$ (the mean draw) and directly sample $\t | \K, \beta$ using Eqn 6.61. This is the generative process for $\t$. Note that we have not specified a distribution over inputs $\x$; this is because Gaussian processes are conditional models. Because of this we are free to generate locations $\x$ when playing around with the GP; obviously a dataset will give us input-output pairs.

Once we have data, we are interested in the predictive distribution (note: the prior is the predictive distribution when there is no data). Consider the joint distribution for $N+1$ targets, given by Eqn 6.64. Its covariance matrix is composed of block components $\CN$, $\k$, and $c$. The covariance matrix $CN$ for $\tN$ is $\CN = \KN + \eyeN / \beta$. We have just made explicit the size $N$ of the matrix; $N$ is the number of training points. The kernel vector $\k$ is a $N$ by $1$ vector of kernel function evaluations between the training input data and the test input vector. The scalar $c$ is a kernel evaluation at the test input.

2.1 gp_predictive_distribution(...) (10 points)

Write a function "gp_predictive_distribution( x_train, x_test, theta, C = None )" that computes Eqns 6.66 and 6.67, except allow for an arbitrary number of test points (not just one) and now the kernel matrix is for training data. By having C as an optional parameter, we can avoid computing it more than once (for this problem it is unimportant, but for real problems this is an issue). The function should compute $\C$, $\k$, and $c$ and the mean and noise functions. Do not forget: the computeK function computes $\K$, not $\C$! (10 points)



In [11]:

    
def gp_predictive_distribution(x_train, t_train, x_test, theta, C=None):
    
    if not C:
        
        C = compute_K(x_train, x_train, theta) + eye(x_train.shape[0]) / beta
        C = np.matrix(C)
        
    means = np.zeros(len(x_test))
    variances = np.zeros(len(x_test))
        
    for i in range(len(x_test)):
        
        c = k_n_m(x_test[i], x_test[i], theta) + 1./beta
        k = np.array([k_n_m(x_train[n], x_test[i], theta) for n in range(len(x_train))])

        means[i] = k.T.dot(C.I).dot(t_train)
        variances[i] = c - k.T.dot(C.I).dot(k)
        
        C = np.append(C, [k], 0)
        C = np.append(C, np.append([k], [[c]], 1).T, 1)
        
        x_train = np.append(x_train, x_test[i])
        
        t_train = np.append(t_train, r.gauss(means[i], variances[i]))
    return means, variances

2.2 gp_log_likelihood(...) (10 points)

Later, to learn the hyperparameters, we will need to compute the log-likelihood of the of the training data. Implicitly, this is conditioned on the value setting for $\thetav$. Write a function "gp_log_likelihood( x_train, t_train, theta, C = None, invC = None )", where C and invC can be stored and reused. (10 points)



In [12]:

    
def gp_log_likelihood(x_train, t_train, theta, C=None, invC=None):
    if not C or not invC:
        C = compute_K(x_train, x_train, theta) + eye(x_train.shape[0]) / beta
        invC = np.linalg.inv(C)
    
    return -0.5 * (log(np.linalg.det(C)) + t_train.dot(invC).dot(t_train.T) + len(t_train) * log(2*pi))

2.3 Plotting (10 points)

Repeat the 6 plots above, but this time conditioned on the training 2 points. Use the sinuosoidal data generator to create 2 training points where x is sampled uniformly between $-1$ and $1$. For these plots, feel free to use the provided function "gp_plot". Make sure you put the parameters in the title and this time also the log-likelihood. (10 points) Try to understand the two types of uncertainty! If you do not use "gp_plot", please add a fill between for the model and target noise.



In [13]:

    
def gp_plot( x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta ):
    # x_test: 
    # y_test:   the true function at x_test
    # mu_test:   predictive mean at x_test
    # var_test: predictive covariance at x_test 
    # t_train:  the training values
    # theta:    the kernel parameters
    # beta:      the precision (known)
    
    # the reason for the manipulation is to allow plots separating model and data stddevs.
    #std_total = np.sqrt(np.diag(var_test))         # includes all uncertainty, model and target noise 
    std_total = np.sqrt(var_test)         # original version wasn't working
    std_model = np.sqrt( std_total**2 - 1.0/beta ) # remove data noise to get model uncertainty in stddev
    std_combo = std_model + np.sqrt( 1.0/beta )    # add stddev (note: not the same as full)
    
    pp.plot( x_test, y_test, 'b', lw=3)
    pp.plot( x_test, mu_test, 'k--', lw=2 )
    pp.fill_between( x_test, mu_test+2*std_combo,mu_test-2*std_combo, color='k', alpha=0.25 )
    pp.fill_between( x_test, mu_test+2*std_model,mu_test-2*std_model, color='r', alpha=0.25 )
    pp.title("theta = " + str(theta) + " log_l = " + str(gp_log_likelihood(x_train, t_train, theta)))
    pp.plot( x_train, t_train, 'ro', ms=10 )

N_train = 2
x_train = np.linspace(-1,1,N_train)
t_train = generate_t(x_train, sigma)
plt.figure()
plt.figure(figsize=(16, 12)) 
for i, theta in enumerate(Thetas):
    plt.subplot(2, 3, 1 + i)
    mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)    
    gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
    plt.tight_layout









    





<matplotlib.figure.Figure at 0x7f807c2f8990>

2.4 More ploting (5 points)

Repeat the 6 plots above, but this time conditioned a new set of 10 training points. (5 points)



In [14]:

    
N_train = 10
x_train = np.linspace(-1,1,N_train)
t_train = generate_t(x_train, sigma)
plt.figure()
plt.figure(figsize=(16, 12)) 
for i, theta in enumerate(Thetas):
    plt.subplot(2, 3, 1 + i)
    mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)    
    gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
    plt.tight_layout









    





<matplotlib.figure.Figure at 0x7f807d079ed0>

3. Learning the hyperparameters (50 points)

Learning the values of the parameter $\thetav$ can be very tricky for Gaussian processes in general, but when the data is univariate like ours, we can visualize the fit and see how plausible it looks.

3.1 Derivatives (5 points)

Maximum likelihood or MAP learning is the most common way of setting the parameters, though a fully Bayesian approach is possible too. We will look at ML today. For this, we start with the dervivative of the log-likelihood with respect to the parameters $\thetav$; this is Eqn 6.70. This, in turn, requires the derivative of the kernel matrix $\CN$ wrt $\thetav$. This is the matrix of element-wise derivatives of the kernel function. Write the derivatives for $\theta_0$ to $\theta_3$ for our kernel function (5 points).

Answer

$k(\mathbf{x}_n, \mathbf{x}_m) = \theta_0 \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right) + \theta_2 + \theta_3 \mathbf{x}_n^T\mathbf{x}_m$

$\frac{\partial}{\partial \theta_0} k(\mathbf{x}_n, \mathbf{x}_m) = \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$

$\frac{\partial}{\partial \theta_1} k(\mathbf{x}_n, \mathbf{x}_m) = - \frac{\theta_0}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \exp \left( -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$

$\frac{\partial}{\partial \theta_2} k(\mathbf{x}_n, \mathbf{x}_m) = 1$

$\frac{\partial}{\partial \theta_3} k(\mathbf{x}_n, \mathbf{x}_m) = \mathbf{x}_n^T\mathbf{x}_m$

3.2 Questions (5 points)

Which parameters in $\thetav$ are unconstrained, that is, where any positive/ negative values are valid? (5 points)

Answer

The covariance matric $C$ is restricted to be positive definite. Since $C$ is constructed by adding the (positive) noise variance to $K$, $C$ is always positive definite given that $K$ is positive semidefinite. So it suffices to show that $K$ is positive semidefinite. If we work this out for the two-dimensional case where we have two data-points: $x_1$ and $x_2$, we get the following situation:

$ K = \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2)\\ k(x_2, x_1) & k(x_2, x_2 \end{bmatrix} $

For K to be positive definite, we require that $|K| \ge 0$. If we write out the determinant, we get

$k(x_1, x_1)k(x_2, x_2) - k(x_1, x_2)^2 \ge 0$

Using the fact that $k(x_1, x_2) = k(x_2, x_1)$

$k(x_1, x_1)k(x_2, x_2) \ge k(x_1, x_2)^2$

$(\theta_0 + \theta_2 + \theta_3 x_1^T x_1) (\theta_0 + \theta_2 + \theta_3 x_2^T x2) \ge (\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||) + \theta_2 + \theta_3 x_1^T x_2)^2 $

Where we can see that $\theta_1$ disappeared on one side of the equation because the distance of a datapoint to itself is zero. Since the exponent is always larger than zero, it seems that $\theta_0$ needs plays a deciding role in this inequality. We can't quite figure this out from the equations, but some experimentation seems to confirm that $\theta_0$ is constrained to be greater or equal than zero. We will assume this result for the next question.

3.3 More derivatives (5 points)

For parameters that are constrained to be positive, the usual approach is to use the exponential of the free-parameter in the kernel function, but perform gradient ascent on the unconstrained values. Consider the case $\theta_i = \exp( \phi_i)$, where $\phi_i$ is unconstrained. Write the derivative for $\phi_i$ in terms of the derivatives you already computed (5 points). Hint: use the chain rule and do not repeat the full derivation.

Answer

Since we're assuming that $\theta_0$ has is constrained to be larger or equal to zero, we substitute it by $\exp(\phi_0)$:

$\frac{\partial k(\mathbf{x}_n, \mathbf{x}_m)}{\partial \phi_0} = \frac{\partial k(\mathbf{x}_n, \mathbf{x}_m)}{\partial \exp(\phi_0)} \frac{\partial \exp(\phi_0)}{\partial \phi_0} = \exp \left( \phi_0 -\frac{\theta_1}{2} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)$

3.4 Grid search (10 points)

Grid-search: for the same training set you have above, perform a small grid search over $\thetav$ (try at least 20 combinations). Have your grid-search loop or function print out rows of log-likelihood + $\thetav$ sorted by best to worst. Use the log-likelihood to select the best $\thetav$ and the worst. Plots both the same way as the subplots above (ie a 1 by 2 subplot of best and worst). (10 points)



In [15]:

    
def getKey(item):
    return item[0]

def gridsearch(x_train,t_train):
    ran = np.arange(4)
    srch = []
    for a in ran:
        for b in ran:
            for c in ran:
                for d in ran:
                    srch.append([gp_log_likelihood(x_train, t_train,[a,b,c,d], C=None, invC=None),[a,b,c,d]])    
    return sorted(srch, key=getKey, reverse=True)


gs = gridsearch(x_train,t_train)
b_and_w = []
b_and_w.append(gs[0][1])
b_and_w.append(gs[-1][1])
Thetas = tuple(b_and_w)

plt.figure()
plt.figure(figsize=(16, 12)) 
for i, theta in enumerate(Thetas):
    plt.subplot(1,2, 1 + i)
    mu_test, var_test = gp_predictive_distribution( x_train, t_train, x_test, theta)    
    gp_plot(x_test, y_test, mu_test, var_test, x_train, t_train, theta, beta)
    plt.tight_layout









    





<matplotlib.figure.Figure at 0x7f809a6c2290>

3.5 Questions (10 points)

Selecting kernel functions can be somewhat of an art. There are charateristics of kernel functions that are useful for some data sets, but not others. Complicating the matter is the ability to combine kernels with different characteristics (long term trends + seasonal fluctuations). Describe the charactistics of the kernel function we are using in terms of (signal, scale, offsets, etc). You may want to play around with $\thetav$ and see what each parameter does/affects/etc. (5 points) Describe why the best parameters work well for the training data and explain why the bad parameter settings perform poorly (in terms of the first part of the question). (5 points)

$\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||) + \theta_2 + \theta_3 x_1^T x_2$

Answer

$\theta_0 \exp(-\frac{\theta_1}{2} ||x_1 - x_2||^2) + \theta_2 + \theta_3 x_1^T x_2$

The kernel function indirectly constructs the covariance matrix of $p(\t)$, $\mathbf{C}$. We'll go over the kernel function term by term. The first term, involving $\theta_0$ and $\theta_1$ determines how correlated points in $\t$ are depending on their distance. So $\theta_0$ and $\theta_1$ are scaling factors operating respectively exponentially and linearly on the absolute distance between datapoints. They control the smoothness of the function

$\theta_2$ is simply constant noise added to the kernel function so it adds constant noise to the functions.

$\theta_3$ operates on the dot product between datapoints so it adds noise to the datapoints based on their quadratic distance from the origin.

3.6 Implementation (20 points)

Bonus: Implement gradient-ascent (or descent if you wish) using the combination of a) the log-likelihood objective function and b) the gradients you calculated above. Run on the training data above and show the log-likehood curve as it learns and a plot of the final model. Feel free to use available software (eg search for "minimize.py" which uses conjugate gradient descent, or something in scipy). NB: log-likelihood should be monotonically increasing. You are encouraged to also search and use "checkgrad". (20 points)



In [ ]: