In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
You can get the data via:
wget http://pjreddie.com/media/files/cifar.tgz
In [2]:
from fastai.conv_learner import *
PATH = "data/cifar10/"
os.makedirs(PATH, exist_ok=True)
In [3]:
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
stats = (np.array([ 0.4914 , 0.48216, 0.44653]), np.array([ 0.24703, 0.24349, 0.26159]))
In [4]:
def get_data(sz,bs):
tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlip()], pad=sz//8)
return ImageClassifierData.from_paths(PATH, val_name='test', tfms=tfms, bs=bs)
In [5]:
bs=256
In [6]:
data = get_data(32, 4)
In [7]:
x,y = next(iter(data.trn_dl))
In [8]:
plt.imshow(data.trn_ds.denorm(x)[0]);
In [9]:
plt.imshow(data.trn_ds.denorm(x)[3]);
In [10]:
data = get_data(32,bs)
In [11]:
lr=1e-2
From this notebook by K.Turgutlu.
In [12]:
class SimpleNet(nn.Module):
def __init__(self, layers):
super().__init__()
self.layers = nn.ModuleList([
nn.Linear(layers[i], layers[i+1]) for i in range(len(layers) - 1)])
def forward(self, x):
x = x.view(x.size(0), -1)
for λ in self.layers:
λ_x = λ(x)
x = F.relu(λ_x)
return F.log_softmax(λ_x, dim=-1)
In [13]:
learn = ConvLearner.from_model_data(SimpleNet([32*32*3, 40, 10]), data)
In [14]:
learn, [o.numel() for o in learn.model.parameters()]
Out[14]:
In [35]:
[o for o in learn.model.parameters()]
Out[35]:
In [15]:
learn.summary()
Out[15]:
In [16]:
learn.lr_find()
learn.sched.plot()
In [17]:
%time learn.fit(lr,2)
In [18]:
%time learn.fit(lr, 2, cycle_len=1)
The goal is to basically replicated the basic architecture of a ResNet. The simple model above gets an accuracy of around 47%, with 120,000 parameters. Not great. We're deffinitely not using our parameters very well -- they're treating each pixel with a different weight.
Instead we'll find groups of 3x3 pixels with particular patterns, using a ConvNet.
The first step is to replace our FullNet model with a ConvNet model.
In [22]:
class ConvNet(nn.Module):
def __init__(self, layers, c):
super().__init__()
self.layers = nn.ModuleList([
nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2)
for i in range(len(layers) - 1)])
self.pool = nn.AdaptiveMaxPool2d(1)
self.out = nn.Linear(layers[-1], c)
def forward(self, x):
for λ in self.layers: x = F.relu(λ(x))
x = self.pool(x)
x = x.view(x.size(0), -1)
return F.log_softmax(self.out(x), dim=-1)
nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2)
: the first two pars are exactly the same as nn.Linear
: num_features_in, num_features_out
In [23]:
learn = ConvLearner.from_model_data(ConvNet([3, 20, 40, 80], 10), data)
learn = ConvLearner.from_model_data(ConvNet([3, 20, 40, 80], 10), data)
: 3 channels coming in; 1st layer comes out with 20, 2nd with 40, 3rd: 80.
In [24]:
learn.summary()
Out[24]:
To turn the output of the ConvNet into a prediction of one of ten classes is use Adaptive Max Pooling. Standard now for SotA algorithms. A Max Pool is done on the very last layer. Instead of a 2x2 or 3x3 or X-by-X, in Adaptive Max Pooling, we don't tell the algorithm how big an area to pool, instead we tell it how big a resolution to create.
So doing a 14x14 adaptive max pool on a 28x28 input image (like CIFAR-10 in this case) is the same as a 2x2 Max Pool. A 2x2 AMP would be the same as a 14x14 MP on a 28x28 image.
What we pretty much always do in modern CNNs is make the penultimate layer a 1x1 Adaptive Max Pool. ie: find the single largest cell and use that as our new activation.
That gives us a 1x1xNum_Features Tensor that we can send into our FullNet.
Then we do:
x = x.view(x.size(0), -1)
which returns a matrix of Mini_Batch x Num_Features.
We can feed that into a linear layer, with however many classes we need.
In [25]:
learn.lr_find(end_lr=100)
In [26]:
learn.sched.plot()
In [27]:
%time learn.fit(1e-1, 2)
In [28]:
%time learn.fit(1e-1, 4, cycle_len=1)
We have around 30,000 parameters in the ConvNet, about a quarter that in the simple FullNet, and our accuracy is around 57%, up from 47%.
We're going to refactor the ConvNet slightly so that we put less stuff in the forward pass. For instance, calling relu
each loop isn't ideal. So we'll create a new class called ConvLayer which contains a convolution with a kernel size of 3 and a stride of two, and with padding. Padding becomes especially important when you're down to small layer sizes in later convolutions where throwing away a potential convolution around the edge will lose a significant amount of information.
The relu will be inside the ConvLayer class, making it easier to edit and prevent bugs.
In [30]:
class ConvLayer(nn.Module):
def __init__(self, ni, nf):
super().__init__()
self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)
def forward(self, x): return F.relu(self.conv(x))
In [32]:
class ConvNet2(nn.Module):
def __init__(self, layers, c):
super().__init__()
self.layers = nn.ModuleList([ConvLayer(layers[i], layers[i + 1])
for i in range(len(layers) - 1)])
self.out = nn.Linear(layers[-1], c)
def forward(self, x):
for λ in self.layers: x = λ(x)
x = F.adaptive_max_pool2d(x, 1) # F is nn.Functional
x = x.view(x.size(0), -1)
return F.log_softmax(self.out(x), dim=-1)
What's awesome about PyTorch is that a Layer definition and a Neural Network definition are literally identical. They both have a Constructor, and a Forward. Any time you have a layer, you can use it as a neural net, and vice versa.
Also, since AMP has no state (no weights), we don't need to have it as an attribute as in the ConvNet class up above, we can instead call it as a function in the Forward method of ConvNet2.
In [35]:
learn = ConvLearner.from_model_data(ConvNet2([3, 20, 40, 80], 10), data)
In [36]:
learn.summary()
Out[36]:
In [37]:
%time learn.fit(1e-1, 2)
In [ ]:
%time learn.fit(1e-1, 2, cycle_len=1)
An issue up above, is that we're having trouble training the ConvNet as we add more layers. If we use larger learningRates, we get NaNs (Infinities), and smaller lrs take forever and doesn't have a chance to explore properly. So it isn't resilient.
To make the model more resilient, we'll use Batched Normalization. BatchNorm is a couple years old now (2018), and makes it much easier to train deep networks.
The network we're going to create will have more layers: 5 Conv layers and 1 Full layer. Back in the day that'd be considered a pretty deep network and would be hard to train. It's very simple now thanks to Batch Norm.
Batch Norm can be used by calling nn.BatchNorm.., we'll write it from scratch to learn about it.
In [38]:
class BnLayer(nn.Module):
def __init__(self, ni, nf, stride=2, kernel_size=3):
super().__init__()
self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size, stride=stride,
bias=False, padding=1)
self.a = nn.Parameter(torch.zeros(nf,1,1))
self.m = nn.Parameter(torch.ones(nf, 1,1))
def forward(self, x):
x = F.relu(self.conv(x))
x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)
if self.training: # true for training set. false for val set.
self.means = x_chan.mean(1)[:, None, None]
self.stds = x_chan.std(1)[:, None, None]
return (x-self.means) / self.stds * self.m + self.a
this is normalizing our input automatically per channel, and for later layers: per filter. but this isn't enough because SGD is a bloody-minded soldier. It will keep changing the activations to what it thinks they should be each minibatch, BatchNorm be damned.
In fact, that last line on it's own: (x-self.means) / self.std
literally does nothing, because SGD just undoes it the next minibatch.
So what we do, is create a new multiplier for each channel and a new added value, self.m
& self.a
-- the adder is just 3 zeros, self.a = nn.Parameter(torch.zeros(nf,1,1))
, and the multiplier is just 3 ones: self.m = nn.Parameter(torch.ones(nf, 1,1))
-- nf
is number of filters, 3 in our case --
We set those to be parameters. By specifying them as nn.Paramter(..)
we tell PyTorch it's allowed to learn these as weights. So initially it subtracts the Means, divides the Standard Deviations, multiplies by Ones, and adds Zeroes. Nothing much happens.
Now, though, when SGD wants scale the layer up, it doesn't have to scale up every value in the matrix: it can just scale up the single trio of numbers self.m
, the multiplier. Likewise if it wants to shift the matrix activations up or down a bit, it doesn't have to shift the entire weight matrix: it can just shift the trio of numbers self.a
, the adder.
We're normalizing the data, then saying you can shift and scale it using far fewer parameters than would've been necessary if I was asking you to shift and scale the entire set of Conv filters.
In practice what this does is allow us to increase our learning rates and increase the resilience of training and allows us to add more layers.
In this case, adding BNLayer instead of the original ConvLayer to the model allows us to add more layers (the 80
and 160
below in the learner), and still train it effectively.
Another great thing BatchNorm does is regularize the network. IoW: you can decrease or remove dropout and weightdecay. The reason why is that each minibatch is going to have a different mean and standard deviation; so they keep changing --> this keeps changing the meaning of the filters --> this has a regularizing effect because it's noise. When you add noise of any kind, it regularizes your model.
In [42]:
class ConvBnNet(nn.Module):
def __init__(self, layers, c):
super().__init__()
self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
self.layers = nn.ModuleList([BnLayer(layers[i], layers[i + 1])
for i in range(len(layers) - 1)])
self.out = nn.Linear(layers[-1], c)
def forward(self, x):
x = self.conv1(x)
for λ in self.layers: x = λ(x)
x = F.adaptive_max_pool2d(x, 1)
x = x.view(x.size(0), -1)
return F.log_softmax(self.out(x), dim=-1)
NOTE: this is a simplified version of BatchNorm! In real BatchNorm, instead of just taking the mean & stddev of the minibatch, you take an exponentially weighted moving average stddev & mean.
A change in ConvBnNet -- in line with modern approaches -- is the addition of a single Conv layer in the beginning with a kernel size of 5 and a stride of 1. The reason is to make sure the first layer has a richer input: sampling from a larger area. This first layer outputs 10 5x5 filters. Padding size is kernel size - 1 / 2 = 2.
In [43]:
learn = ConvLearner.from_model_data(ConvBnNet([10, 20, 40, 80, 160], 10), data)
In [44]:
learn.summary()
Out[44]:
In [45]:
%time learn.fit(3e-2, 2)
In [46]:
%time learn.fit(1e-1, 4, cycle_len=1)
In [48]:
t1 = [chr(ord('a')+i) for i in range(10)]
t2 = [chr(ord('ა')+i) for i in range(10)]
for a,b in zip(t1, t2):
print(a)
print(b)
Take a look at the accuracy rise (note val-loss < trn-loss, signaling no overfitting yet) from 47% -> 57% before, up to 70%. Woo! Okay, personal note: THIS IS SO MUCH EASIER THAN I IMAGINED.
So, given that this is looking so good, and obvious thing to try increasing the depth of the model. We can't just add more of our stride-2 layers, because they halve the size each time (we're down to 2x2 by the end), so instead we create a stride-1 layer (no size-change) for each stride-2 layer created. Then zip the stride2 & stride1 layers together ( s-2 first ), which gives us alternating stride 2, 1 layers.
This however, doesn't help because the model is now too deep for even batch norm to handle on it's own (12 layers (start-Conv, 10 S2-S1 Convs, 1 Linear))
In [54]:
class ConvBnNet2(nn.Module):
def __init__(self, layers, c):
super().__init__()
self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
for i in range(len(layers) - 1)])
self.layers2 = nn.ModuleList([BnLayer(layers[i+1], layers[i+1], 1)
for i in range(len(layers) - 1)])
self.out = nn.Linear(layers[-1], c)
def forward(self, x):
x = self.conv1(x)
for λ, λ2 in zip(self.layers, self.layers2):
x = λ(x)
x = λ2(x)
x = F.adaptive_max_pool2d(x, 1)
x = x.view(x.size(0), -1)
return F.log_softmax(self.out(x), dim=-1)
In [55]:
learn = ConvLearner.from_model_data(ConvBnNet2([10,20,40,80,160],10), data)
In [56]:
%time learn.fit(1e-2, 2)
In [57]:
%time learn.fit(1e-2, 2, cycle_len=1)
Notice making the model deeper hasn't helped. It's possible to train a standard ConvNet 12 layers deep, but it's hard to do properly. Instead we're going to replace the ConvNet with a ResNet.
The ResnetLayer class is going to inherit from BnLayer and replace our forward with return x + super().forward(x)
. And that's it. Everything else is going to be identical -- except that we're now going to make the network 4x deeper.
In [58]:
class ResnetLayer(BnLayer):
def forward(self, x): return x + super().forward(x)
In [59]:
class Resnet(nn.Module):
def __init__(self, layers, c):
super().__init__()
self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
for i in range(len(layers) - 1)])
self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
for i in range(len(layers) - 1)])
self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
for i in range(len(layers) - 1)])
self.out = nn.Linear(layers[-1], c)
def forward(self, x):
x = self.conv1(x)
for λ,λ2,λ3 in zip(self.layers, self.layers2, self.layers3):
x = λ3(λ2(λ(x))) # function of a function of a function
x = F.adaptive_max_pool2d(x, 1)
x = x.view(x.size(0), -1)
return F.log_softmax(self.out(x), dim=-1)
And now this model is going to train beautifully just because of that one line: def forward(self, x): return x + super().forward(x)
. Why is that?
This is called a ResNet block. It says its prediction $$y = x + f(x)$$ in this case the function is a convolution. Which is also saying: $$f(x) = y - x$$, where f(x)
is the current layer's prediction, y
is the prediction from the previous layer. What it's doing is trying to fit a function f
to the difference between y
and x
. That difference is the residual.
If y
is what I'm trying to calculate, and x
is the thing I've most recently calculated (input to current layer), then the difference between the two is essentially the error ito what I've calc'd so far. So this is saying attempt to find a set of convolutional weights that attempts to fill in the amount I was off by.
Lecture at ~ 1:55:00
In [60]:
learn = ConvLearner.from_model_data(Resnet([10,20,40,80,160], 10), data)
In [61]:
wd=1e-5
In [62]:
%time learn.fit(1e-2, 2, wds=wd)
In [63]:
%time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd)
In [64]:
%time learn.fit(1e-2, 8, cycle_len=4, wds=wd)
The idea is, if we have some inputs coming in and a function trying to predict how much the error is, then add on another prediction of error at that new stage, and on and on, then each time we're zooming in closer and closer to the correct answer -- ie: we've gotten to a certain point but there's still an error, a residual, so let's create a model that predicts that residual and add that onto our previous model, and another model that predicts that residual, and adds that on, and etc. If you keep doing that over and over, you should get closer and closer to our answer.
This is based on the theory of Boosting. By specifying return x + super().forward(x)
as the thing we're trying to calculate, then we're kind of getting boosting for free.
Note that here, only one convolution is done in the ResNet block. Actual standard ResNet blocks use two convolutions before adding back onto the input.
Note also that the first layer in every block is a standard Conv layer w/ a stride of two, not a Res layer. This is a bottleneck layer. From time to time we change the geometry in a ResNet model. Actual ResNets don't use a standard Conv layer for bottlenecks; but that'll be covered in Part 2 of this course.
Still, this simplified ResNet gets up to 82% accuracy (incl. overfitting).
We can make the Resnet even bigger
In [68]:
class Resnet2(nn.Module):
def __init__(self, layers, c, p=0.5):
super().__init__()
self.conv1 = BnLayer(3, 16, stride=1, kernel_size=7)
self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
for i in range(len(layers) - 1)])
self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
for i in range(len(layers) - 1)])
self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
for i in range(len(layers) - 1)])
self.out = nn.Linear(layers[-1], c)
self.drop = nn.Dropout(p) # dropout added
def forward(self, x):
x = self.conv1(x)
for λ,λ2,λ3 in zip(self.layers, self.layers2, self.layers3):
x = λ3(λ2(λ(x)))
x = F.adaptive_max_pool2d(x, 1)
x = x.view(x.size(0), -1)
x = self.drop(x)
return F.log_softmax(self.out(x), dim=-1)
Other than the minor simplification of ResNet, this is a reasonable approximation of a good starting point for a modern architecture.
In [69]:
# all sizes increased; 0.2 dropout
learn = ConvLearner.from_model_data(Resnet2([16, 32, 64, 128, 256], 10, 0.2), data)
In [70]:
wd=1e-6
In [71]:
%time learn.fit(1e-2, 2, wds=wd)
In [72]:
%time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd)
In [73]:
%time learn.fit(1e-2, 8, cycle_len=4, wds=wd)
In [74]:
learn.save('tmp')
In [75]:
log_preds,y = learn.TTA()
In [76]:
preds = np.mean(np.exp(log_preds), 0)
In [77]:
metrics.log_loss(y,preds), accuracy(preds,y)
Out[77]:
This accuracy was p.much SotA in 2012 for CIFAR-10. Today it's up to around 97%, but those implementations are all based on these techniques. It's mostly better approaches to data augmentation, regularization, resnset tweaks, etc.
The ResNet can be applied to a lot of non-image data, but has been ignored as of Dec 2017. The Transform architecture in NLP is essentially a very simplified Resnet analog and first of its kind.
The idea of skip connections -- common in computer vision, though nothing unique to it.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: