GPyTorch regression with derivative information

Introduction

In this notebook, we show how to train a GP regression model in GPyTorch of an unknown function given function value and derivative observations. We consider modeling the function:

\begin{align*} y &= \sin(2x) + cos(x) + \epsilon \\ \frac{dy}{dx} &= 2\cos(2x) - \sin(x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.5) \end{align*}

using 50 value and derivative observations.


In [1]:
import torch
import gpytorch
import math
from matplotlib import pyplot as plt
import numpy as np

%matplotlib inline
%load_ext autoreload
%autoreload 2


/home/gpleiss/miniconda3/envs/gpytorch/lib/python3.7/site-packages/matplotlib/__init__.py:1003: UserWarning: Duplicate key in file "/home/gpleiss/.config/matplotlib/matplotlibrc", line #57
  (fname, cnt))

Setting up the training data

We use 50 uniformly distributed points in the interval $[0, 5 \pi]$


In [2]:
lb, ub = 0.0, 5*math.pi
n = 50

train_x = torch.linspace(lb, ub, n).unsqueeze(-1)
train_y = torch.stack([
    torch.sin(2*train_x) + torch.cos(train_x), 
    -torch.sin(train_x) + 2*torch.cos(2*train_x)
], -1).squeeze(1)

train_y += 0.05 * torch.randn(n, 2)

Setting up the model

A GP prior on the function values implies a multi-output GP prior on the function values and the partial derivatives, see 9.4 in http://www.gaussianprocess.org/gpml/chapters/RW9.pdf for more details. This allows using a MultitaskMultivariateNormal and MultitaskGaussianLikelihood to train a GP model from both function values and gradients. The resulting RBF kernel that models the covariance between the values and partial derivatives has been implemented in RBFKernelGrad and the extension of a constant mean is implemented in ConstantMeanGrad.

The RBFKernelGrad is generally worse conditioned than the RBFKernel, so we place a lower bound on the noise parameter to keep the smallest eigenvalues of the kernel matrix away from zero.


In [3]:
class GPModelWithDerivatives(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(GPModelWithDerivatives, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMeanGrad()
        self.base_kernel = gpytorch.kernels.RBFKernelGrad()
        self.covar_module = gpytorch.kernels.ScaleKernel(self.base_kernel)

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)  # Value + Derivative
likelihood.initialize(noise=0.01*train_y.std())  # Require 1% noise (approximately)
likelihood.raw_noise.requires_grad = False
model = GPModelWithDerivatives(train_x, train_y, likelihood)

Training the model

The model training is similar to training a standard GP regression model


In [4]:
# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

n_iter = 50
for i in range(n_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, n_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))        
    optimizer.step()


Iter 1/50 - Loss: 60.662   lengthscale: 0.693   noise: 0.014
Iter 2/50 - Loss: 58.924   lengthscale: 0.744   noise: 0.014
Iter 3/50 - Loss: 57.026   lengthscale: 0.798   noise: 0.014
Iter 4/50 - Loss: 55.471   lengthscale: 0.853   noise: 0.014
Iter 5/50 - Loss: 52.139   lengthscale: 0.905   noise: 0.014
Iter 6/50 - Loss: 51.552   lengthscale: 0.950   noise: 0.014
Iter 7/50 - Loss: 50.446   lengthscale: 0.988   noise: 0.014
Iter 8/50 - Loss: 46.764   lengthscale: 1.015   noise: 0.014
Iter 9/50 - Loss: 48.350   lengthscale: 1.028   noise: 0.014
Iter 10/50 - Loss: 43.598   lengthscale: 1.036   noise: 0.014
Iter 11/50 - Loss: 42.887   lengthscale: 1.037   noise: 0.014
Iter 12/50 - Loss: 42.121   lengthscale: 1.032   noise: 0.014
Iter 13/50 - Loss: 39.327   lengthscale: 1.026   noise: 0.014
Iter 14/50 - Loss: 38.159   lengthscale: 1.020   noise: 0.014
Iter 15/50 - Loss: 36.058   lengthscale: 1.018   noise: 0.014
Iter 16/50 - Loss: 33.545   lengthscale: 1.019   noise: 0.014
Iter 17/50 - Loss: 30.586   lengthscale: 1.024   noise: 0.014
Iter 18/50 - Loss: 29.251   lengthscale: 1.036   noise: 0.014
Iter 19/50 - Loss: 24.608   lengthscale: 1.053   noise: 0.014
Iter 20/50 - Loss: 25.399   lengthscale: 1.069   noise: 0.014
Iter 21/50 - Loss: 22.303   lengthscale: 1.091   noise: 0.014
Iter 22/50 - Loss: 19.724   lengthscale: 1.119   noise: 0.014
Iter 23/50 - Loss: 22.927   lengthscale: 1.141   noise: 0.014
Iter 24/50 - Loss: 19.074   lengthscale: 1.164   noise: 0.014
Iter 25/50 - Loss: 17.078   lengthscale: 1.176   noise: 0.014
Iter 26/50 - Loss: 14.719   lengthscale: 1.181   noise: 0.014
Iter 27/50 - Loss: 14.614   lengthscale: 1.183   noise: 0.014
Iter 28/50 - Loss: 10.786   lengthscale: 1.180   noise: 0.014
Iter 29/50 - Loss: 9.239   lengthscale: 1.166   noise: 0.014
Iter 30/50 - Loss: 6.483   lengthscale: 1.151   noise: 0.014
Iter 31/50 - Loss: 6.814   lengthscale: 1.141   noise: 0.014
Iter 32/50 - Loss: 4.227   lengthscale: 1.146   noise: 0.014
Iter 33/50 - Loss: 4.655   lengthscale: 1.158   noise: 0.014
Iter 34/50 - Loss: 2.414   lengthscale: 1.174   noise: 0.014
Iter 35/50 - Loss: 2.902   lengthscale: 1.190   noise: 0.014
Iter 36/50 - Loss: -3.138   lengthscale: 1.209   noise: 0.014
Iter 37/50 - Loss: 2.241   lengthscale: 1.228   noise: 0.014
Iter 38/50 - Loss: -6.051   lengthscale: 1.241   noise: 0.014
Iter 39/50 - Loss: -3.690   lengthscale: 1.251   noise: 0.014
Iter 40/50 - Loss: -2.811   lengthscale: 1.256   noise: 0.014
Iter 41/50 - Loss: -9.964   lengthscale: 1.259   noise: 0.014
Iter 42/50 - Loss: -5.040   lengthscale: 1.253   noise: 0.014
Iter 43/50 - Loss: -9.899   lengthscale: 1.250   noise: 0.014
Iter 44/50 - Loss: -10.464   lengthscale: 1.241   noise: 0.014
Iter 45/50 - Loss: -8.486   lengthscale: 1.234   noise: 0.014
Iter 46/50 - Loss: -9.840   lengthscale: 1.231   noise: 0.014
Iter 47/50 - Loss: -12.451   lengthscale: 1.235   noise: 0.014
Iter 48/50 - Loss: -12.304   lengthscale: 1.246   noise: 0.014
Iter 49/50 - Loss: -7.173   lengthscale: 1.257   noise: 0.014
Iter 50/50 - Loss: -11.317   lengthscale: 1.268   noise: 0.014

Making predictions with the model

Model predictions are also similar to GP regression with only function values, butwe need more CG iterations to get accurate estimates of the predictive variance


In [5]:
# Set into eval mode
model.train()
model.eval()
likelihood.eval()

# Initialize plots
f, (y1_ax, y2_ax) = plt.subplots(1, 2, figsize=(12, 6))

# Make predictions
with torch.no_grad(), gpytorch.settings.max_cg_iterations(50):
    test_x = torch.linspace(lb, ub, 500)
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()
    
# Plot training data as black stars
y1_ax.plot(train_x.detach().numpy(), train_y[:, 0].detach().numpy(), 'k*')
# Predictive mean as blue line
y1_ax.plot(test_x.numpy(), mean[:, 0].numpy(), 'b')
# Shade in confidence 
y1_ax.fill_between(test_x.numpy(), lower[:, 0].numpy(), upper[:, 0].numpy(), alpha=0.5)
y1_ax.legend(['Observed Values', 'Mean', 'Confidence'])
y1_ax.set_title('Function values')

# Plot training data as black stars
y2_ax.plot(train_x.detach().numpy(), train_y[:, 1].detach().numpy(), 'k*')
# Predictive mean as blue line
y2_ax.plot(test_x.numpy(), mean[:, 1].numpy(), 'b')
# Shade in confidence 
y2_ax.fill_between(test_x.numpy(), lower[:, 1].numpy(), upper[:, 1].numpy(), alpha=0.5)
y2_ax.legend(['Observed Derivatives', 'Mean', 'Confidence'])
y2_ax.set_title('Derivatives')

None



In [ ]: