This example shows how to use a AdditiveGridInducingVariationalGP
module. This classifcation module is designed for when the function you’re modeling has an additive decomposition over dimension. This is equivalent to using a covariance function that additively decomposes over dimensions:
where $[\mathbf{x}]_{i}$ denotes the ith component of the vector $\mathbf{x}$. Example applications of this include use in Bayesian optimization, and when performing deep kernel learning.
The use of inducing points allows for scaling up the training data by making computational complexity linear instead of cubic in the number of data points.
In this example, we’re performing classification on a two dimensional toy dataset that is:
The above function doesn't have an obvious additive decomposition, but it turns out that this function is can be very well approximated by the kernel anyways.
In [1]:
# High-level imports
import math
from math import exp
import torch
import gpytorch
from matplotlib import pyplot as plt
# Make inline plots
%matplotlib inline
In [2]:
n = 101
train_x = torch.zeros(n ** 2, 2)
train_x[:, 0].copy_(torch.linspace(-1, 1, n).repeat(n))
train_x[:, 1].copy_(torch.linspace(-1, 1, n).unsqueeze(1).repeat(1, n).view(-1))
train_y = (train_x[:, 0].abs().lt(0.5)).float() * (train_x[:, 1].abs().lt(0.5)).float() * 2 - 1
train_x = train_x.cuda()
train_y = train_y.cuda()
In contrast to the most basic classification models, this model uses an AdditiveGridInterpolationVariationalStrategy
. This causes two key changes in the model. First, the model now specifically assumes that the input to forward
, x
, is to be additive decomposed. Thus, although the model below defines an RBFKernel
as the covariance function, because we extend this base class, the additive decomposition discussed above will be imposed.
Second, this model automatically assumes we will be using scalable kernel interpolation (SKI) for each dimension. Because of the additive decomposition, we only provide one set of grid bounds to the base class constructor, as the same grid will be used for all dimensions. It is recommended that you scale your training and test data appropriately.
In [3]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import AdditiveGridInterpolationVariationalStrategy, CholeskyVariationalDistribution
from gpytorch.kernels import RBFKernel, ScaleKernel
from gpytorch.likelihoods import BernoulliLikelihood
from gpytorch.means import ConstantMean
from gpytorch.distributions import MultivariateNormal
class GPClassificationModel(AbstractVariationalGP):
def __init__(self, grid_size=128, grid_bounds=([-1, 1],)):
variational_distribution = CholeskyVariationalDistribution(num_inducing_points=grid_size, batch_size=2)
variational_strategy = AdditiveGridInterpolationVariationalStrategy(self,
grid_size=grid_size,
grid_bounds=grid_bounds,
num_dim=2,
variational_distribution=variational_distribution)
super(GPClassificationModel, self).__init__(variational_strategy)
self.mean_module = ConstantMean()
self.covar_module = ScaleKernel(RBFKernel(ard_num_dims=1))
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
latent_pred = MultivariateNormal(mean_x, covar_x)
return latent_pred
# Cuda the model and likelihood function
model = GPClassificationModel().cuda()
likelihood = gpytorch.likelihoods.BernoulliLikelihood().cuda()
Once the model has been defined, the training loop looks very similar to other variational models we've seen in the past. We will optimize the variational lower bound as our objective function. In this case, although variational inference in GPyTorch supports stochastic gradient descent, we choose to do batch optimization due to the relatively small toy dataset.
For an example of using the AdditiveGridInducingVariationalGP
model with stochastic gradient descent, see the dkl_mnist
example.
In [4]:
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# "Loss" for GPs - the marginal log likelihood
# n_data refers to the amount of training data
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.numel())
# Training function
def train(num_iter=200):
for i in range(num_iter):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, num_iter, loss.item()))
optimizer.step()
%time train()
In [5]:
# Switch the model and likelihood into the evaluation mode
model.eval()
likelihood.eval()
# Start the plot, 4x3in
f, ax = plt.subplots(1, 1, figsize=(4, 3))
n = 150
test_x = torch.zeros(n ** 2, 2)
test_x[:, 0].copy_(torch.linspace(-1, 1, n).repeat(n))
test_x[:, 1].copy_(torch.linspace(-1, 1, n).unsqueeze(1).repeat(1, n).view(-1))
# Cuda variable of test data
test_x = test_x.cuda()
with torch.no_grad():
predictions = likelihood(model(test_x))
# prob<0.5 --> label -1 // prob>0.5 --> label 1
pred_labels = predictions.mean.ge(0.5).float().mul(2).sub(1).cpu()
# Colors = yellow for 1, red for -1
color = []
for i in range(len(pred_labels)):
if pred_labels[i] == 1:
color.append('y')
else:
color.append('r')
# Plot data a scatter plot
ax.scatter(test_x[:, 0].cpu(), test_x[:, 1].cpu(), color=color, s=1)
Out[5]:
In [ ]: