Deep Kernel Learning (DenseNet + GP) on CIFAR10/100

In this notebook, we'll demonstrate the steps necessary to train a medium sized DenseNet (https://arxiv.org/abs/1608.06993) on either of two popularly used benchmark dataset in computer vision (CIFAR10 and CIFAR100). We'll be training the DKL model entirely end to end using the standard 300 Epoch training schedule and SGD.

This notebook is largely for tutorial purposes. If your goal is just to get (for example) a trained DKL + CIFAR100 model, we recommend that you move this code to a simple python script and run that, rather than training directly out of a python notebook. We find that training is just a bit faster out of a python notebook. We also of course recommend that you increase the size of the DenseNet used to a full sized model if you would like to achieve state of the art performance.

Furthermore, because this notebook involves training an actually reasonably large neural network, it is strongly recommended that you have a decent GPU available for this, as with all large deep learning models.


In [1]:
from torch.optim import SGD, Adam
from torch.optim.lr_scheduler import MultiStepLR
import torch.nn.functional as F
from torch import nn
import torch
import torchvision.datasets as dset
import torchvision.transforms as transforms
import gpytorch
import math

Set up data augmentation

The first thing we'll do is set up some data augmentation transformations to use during training, as well as some basic normalization to use during both training and testing. We'll use random crops and flips to train the model, and do basic normalization at both training time and test time. To accomplish these transformations, we use standard torchvision transforms.


In [2]:
normalize = transforms.Normalize(mean=[0.5071, 0.4867, 0.4408], std=[0.2675, 0.2565, 0.2761])
aug_trans = [transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip()]
common_trans = [transforms.ToTensor(), normalize]
train_compose = transforms.Compose(aug_trans + common_trans)
test_compose = transforms.Compose(common_trans)

Create DataLoaders

Next, we create dataloaders for the selected dataset using the built in torchvision datasets. The cell below will download either the cifar10 or cifar100 dataset, depending on which choice is made. The default here is cifar10, however training is just as fast on either dataset.

After downloading the datasets, we create standard torch.utils.data.DataLoaders for each dataset that we will be using to get minibatches of augmented data.


In [3]:
dataset = 'cifar10'

if dataset == 'cifar10':
    d_func = dset.CIFAR10
    train_set = dset.CIFAR10('data', train=True, transform=train_compose, download=True)
    test_set = dset.CIFAR10('data', train=False, transform=test_compose)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=256, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=256, shuffle=False)
    num_classes = 10
elif dataset == 'cifar100':
    d_func = dset.CIFAR100
    train_set = dset.CIFAR100('data', train=True, transform=train_compose, download=True)
    test_set = dset.CIFAR100('data', train=False, transform=test_compose)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=256, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=256, shuffle=False)
    num_classes = 100
else:
    raise RuntimeError('dataset must be one of "cifar100" or "cifar10"')


Files already downloaded and verified

Creating the DenseNet Model

With the data loaded, we can move on to defining our DKL model. A DKL model consists of three components: the neural network, the Gaussian process layer used after the neural network, and the Softmax likelihood.

The first step is defining the neural network architecture. To do this, we use a slightly modified version of the DenseNet available in the standard PyTorch package. Specifically, we modify it to remove the softmax layer, since we'll only be needing the final features extracted from the neural network.


In [4]:
from densenet import DenseNet

class DenseNetFeatureExtractor(DenseNet):
    def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        out = F.avg_pool2d(out, kernel_size=self.avgpool_size).view(features.size(0), -1)
        return out

feature_extractor = DenseNetFeatureExtractor(block_config=(6, 6, 6), num_classes=num_classes).cuda()
num_features = feature_extractor.classifier.in_features

Creating the GP Layer

In the next cell, we create the layer of Gaussian process models that are called after the neural network. In this case, we'll be using one GP per feature, as in the SV-DKL paper. The outputs of these Gaussian processes will the be mixed in the softmax likelihood.


In [5]:
class GaussianProcessLayer(gpytorch.models.AdditiveGridInducingVariationalGP):
    def __init__(self, num_dim, grid_bounds=(-10., 10.), grid_size=64):
        super(GaussianProcessLayer, self).__init__(grid_size=grid_size, grid_bounds=[grid_bounds],
                                                   num_dim=num_dim, mixing_params=False, sum_output=False)
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel(
                lengthscale_prior=gpytorch.priors.SmoothedBoxPrior(
                    math.exp(-1), math.exp(1), sigma=0.1, transform=torch.exp
                )
            )
        )
        self.mean_module = gpytorch.means.ConstantMean()
        self.grid_bounds = grid_bounds

    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)

Creating the DKL Model

With both the DenseNet feature extractor and GP layer defined, we can put them together in a single module that simply calls one and then the other, much like building any Sequential neural network in PyTorch. This completes defining our DKL model.


In [6]:
class DKLModel(gpytorch.Module):
    def __init__(self, feature_extractor, num_dim, grid_bounds=(-10., 10.)):
        super(DKLModel, self).__init__()
        self.feature_extractor = feature_extractor
        self.gp_layer = GaussianProcessLayer(num_dim=num_dim, grid_bounds=grid_bounds)
        self.grid_bounds = grid_bounds
        self.num_dim = num_dim

    def forward(self, x):
        features = self.feature_extractor(x)
        features = gpytorch.utils.grid.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
        res = self.gp_layer(features)
        return res

model = DKLModel(feature_extractor, num_dim=num_features).cuda()
likelihood = gpytorch.likelihoods.SoftmaxLikelihood(num_features=model.num_dim, n_classes=num_classes).cuda()

Defining Training and Testing Code

Next, we define the basic optimization loop and testing code. This code is entirely analogous to the standard PyTorch training loop. We create a torch.optim.SGD optimizer with the parameters of the neural network on which we apply the standard amount of weight decay suggested from the paper, the parameters of the Gaussian process (from which we omit weight decay, as L2 regualrization on top of variational inference is not necessary), and the mixing parameters of the Softmax likelihood.

We use the standard learning rate schedule from the paper, where we decrease the learning rate by a factor of ten 50% of the way through training, and again at 75% of the way through training.


In [7]:
n_epochs = 300
lr = 0.1
optimizer = SGD([
    {'params': model.feature_extractor.parameters()},
    {'params': model.gp_layer.hyperparameters(), 'lr': lr * 0.01},
    {'params': model.gp_layer.variational_parameters()},
    {'params': likelihood.parameters()},
], lr=lr, momentum=0.9, nesterov=True, weight_decay=0)
scheduler = MultiStepLR(optimizer, milestones=[0.5 * n_epochs, 0.75 * n_epochs], gamma=0.1)

def train(epoch):
    model.train()
    likelihood.train()
    
    mll = gpytorch.mlls.VariationalELBO(likelihood, model.gp_layer, num_data=len(train_loader.dataset))

    train_loss = 0.
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = -mll(output, target)
        loss.backward()
        optimizer.step()
        if (batch_idx + 1) % 25 == 0:
            print('Train Epoch: %d [%03d/%03d], Loss: %.6f' % (epoch, batch_idx + 1, len(train_loader), loss.item()))

def test():
    from torch.optim import SGD, Adam
    from torch.optim.lr_scheduler import MultiStepLR
    import torch.nn.functional as F
    from torch import nn
    import torch
    model.eval()
    likelihood.eval()

    correct = 0
    for data, target in test_loader:
        data, target = data.cuda(), target.cuda()
        with torch.no_grad():
            output = likelihood(model(data))
            pred = output.probs.argmax(1)
            correct += pred.eq(target.view_as(pred)).cpu().sum()
    print('Test set: Accuracy: {}/{} ({}%)'.format(
        correct, len(test_loader.dataset), 100. * correct / float(len(test_loader.dataset))
    ))

Train the Model

We are now ready to train the model. At the end of each Epoch we report the current test loss and accuracy, and we save a checkpoint model out to a file.


In [8]:
for epoch in range(1, n_epochs + 1):
    scheduler.step()
    with gpytorch.settings.use_toeplitz(False), gpytorch.settings.max_preconditioner_size(0):
        train(epoch)
        test()
    state_dict = model.state_dict()
    likelihood_state_dict = likelihood.state_dict()
    torch.save({'model': state_dict, 'likelihood': likelihood_state_dict}, 'dkl_cifar_checkpoint.dat')


tensor(47.1009, device='cuda:0', grad_fn=<NegBackward>)
tensor(26.8875, device='cuda:0', grad_fn=<NegBackward>)
tensor(8.7311, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.6159, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3649, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3383, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3309, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3293, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3326, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3310, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3319, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3324, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3339, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3123, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3348, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3320, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3342, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3375, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3378, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3076, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3295, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3431, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3343, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3319, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3401, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [025/196], Loss: 2.340129
tensor(2.3041, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3234, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3298, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3199, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3214, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2743, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.3012, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2692, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2446, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2281, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1919, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2081, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2151, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2046, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1341, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0657, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0500, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1542, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1805, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1384, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0945, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1233, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9728, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0628, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0723, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [050/196], Loss: 2.072335
tensor(2.1178, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9982, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8651, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0120, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9134, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0502, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.2028, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1629, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0794, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0876, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1774, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1260, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0042, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0408, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1549, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0885, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1347, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9506, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0196, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9888, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0983, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9950, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9773, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0446, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0111, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [075/196], Loss: 2.011076
tensor(1.9814, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8978, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8812, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9378, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9315, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9206, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9119, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0955, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9971, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9460, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9494, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9523, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9868, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8727, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.0696, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9163, device='cuda:0', grad_fn=<NegBackward>)
tensor(2.1109, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9777, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9008, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8858, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8459, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9929, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8997, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8077, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9749, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [100/196], Loss: 1.974880
tensor(1.9985, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9266, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8180, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8392, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8632, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9162, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9133, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7486, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9088, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8396, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7973, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8723, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8315, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8802, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8935, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7472, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8971, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9510, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7913, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7712, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8330, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8725, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7992, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8531, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7762, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [125/196], Loss: 1.776210
tensor(1.8108, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8968, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8432, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8733, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9007, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7433, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8644, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8635, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7391, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7522, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7129, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8130, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8068, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7275, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8611, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7851, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7062, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8262, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7434, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7824, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8547, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6733, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7022, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8125, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7844, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [150/196], Loss: 1.784407
tensor(1.7918, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8991, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7181, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6422, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7846, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7182, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8400, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7451, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6119, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7347, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6750, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8209, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7957, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7753, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8147, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7284, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6716, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6753, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8575, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7900, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6705, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6430, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6265, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8428, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7523, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 1 [175/196], Loss: 1.752253
tensor(1.6146, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7357, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.9351, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7774, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7121, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6053, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6259, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6550, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.8049, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6552, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5468, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5977, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7542, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6063, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7376, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7883, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6957, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6548, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7300, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6726, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6928, device='cuda:0', grad_fn=<NegBackward>)
Test set: Accuracy: 3506/10000 (35%)
tensor(1.6910, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7390, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7005, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6366, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7069, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6917, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7848, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6554, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7470, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6568, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6804, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6843, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6705, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5863, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6471, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6579, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6179, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5943, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5841, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6979, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6961, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7157, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.7617, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6654, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5626, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 2 [025/196], Loss: 1.562617
tensor(1.6991, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6000, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6070, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6046, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6720, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6086, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6933, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6127, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6088, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6291, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5365, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5950, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.4854, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6592, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6604, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5340, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6197, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5018, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5148, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5087, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5791, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5876, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5823, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5576, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.4929, device='cuda:0', grad_fn=<NegBackward>)
Train Epoch: 2 [050/196], Loss: 1.492854
tensor(1.5229, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5125, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5994, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5263, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5076, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5490, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.6787, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.4411, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5614, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5814, device='cuda:0', grad_fn=<NegBackward>)
tensor(1.5855, device='cuda:0', grad_fn=<NegBackward>)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-8-30679d80b21c> in <module>()
      2     scheduler.step()
      3     with gpytorch.settings.use_toeplitz(False), gpytorch.settings.max_preconditioner_size(0):
----> 4         train(epoch)
      5         test()
      6     state_dict = model.state_dict()

<ipython-input-7-7a12e899d2ef> in train(epoch)
     19         data, target = data.cuda(), target.cuda()
     20         optimizer.zero_grad()
---> 21         output = model(data)
     22         loss = -mll(output, target)
     23 #         import pdb

~/gpytorch/gpytorch/module.py in __call__(self, *inputs, **kwargs)
    179 
    180     def __call__(self, *inputs, **kwargs):
--> 181         outputs = self.forward(*inputs, **kwargs)
    182 
    183         if isinstance(outputs, tuple):

<ipython-input-6-1d29ce6ae3ca> in forward(self, x)
      9     def forward(self, x):
     10         features = self.feature_extractor(x)
---> 11         features = gpytorch.utils.grid.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
     12         res = self.gp_layer(features)
     13         return res

~/gpytorch/gpytorch/utils/grid.py in scale_to_bounds(x, lower_bound, upper_bound)
     22     """
     23     # Scale features so they fit inside grid bounds
---> 24     min_val = x.min()
     25     max_val = x.max()
     26     diff = max_val - min_val

KeyboardInterrupt: 

In [9]:
%debug


> /home/jrg365/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py(90)backward()
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

ipdb> up
> /home/jrg365/anaconda3/lib/python3.7/site-packages/torch/tensor.py(102)backward()
    100                 products. Defaults to ``False``.
    101         """
--> 102         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    103 
    104     def register_hook(self, hook):

ipdb> up
> <ipython-input-7-7a12e899d2ef>(26)train()
     24 #         pdb.set_trace()
     25         print(loss)
---> 26         loss.backward()
     27         optimizer.step()
     28         if (batch_idx + 1) % 25 == 0:

ipdb> loss
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch-nightly_1540726045570/work/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69
ipdb> q

In [ ]: