2018/9/15-16 WNixalo
https://github.com/fastai/fastai_v1/blob/master/dev_nb/001a_nn_basics.ipynb
In [3]:
from pathlib import Path
import requests
In [4]:
data_path = Path('data')
path = data_path/'mnist'
In [5]:
path.mkdir(parents=True, exist_ok=True)
url = 'http://deeplearning.net/data/mnist/'
filename = 'mnist.pkl.gz'
In [8]:
(path/filename)
Out[8]:
In [10]:
if not (path/filename).exists():
content = requests.get(url+filename).content
(path/filename).open('wb').write(content)
In [9]:
import pickle, gzip
In [12]:
with gzip.open(path/filename, 'rb') as f:
((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
In [14]:
%matplotlib inline
In [15]:
from matplotlib import pyplot
import numpy as np
In [16]:
pyplot.imshow(x_train[0].reshape((28,28)), cmap="gray")
x_train.shape
Out[16]:
In [17]:
import torch
In [18]:
x_train,y_train,x_valid,y_valid = map(torch.tensor, (x_train,y_train,x_valid,y_valid))
n,c = x_train.shape
x_train, x_train.shape, y_train.min(), y_train.max()
Out[18]:
In [20]:
import math
In [21]:
weights = torch.rand(784, 10)/math.sqrt(784)
weights.requires_grad_()
bias = torch.zeros(10, requires_grad=True)
In [22]:
def log_softmax(x): return x - x.exp().sum(-1).log().unsqueeze(-1)
def model(xb): return log_softmax(xb @ weights + bias)
In [91]:
xb.shape, xb.sum(-1).shape
Out[91]:
the torch.Tensor.sum(dim) call takes an integer argument as the axis along which to sum. This applies to NumPy arrays as well.
In this case xb.sum(-1) will turn a 64x784 tensor into a size 64 tensor. This creates a tensor with each element being the total sum of its corresponding size 784 (28x28 flattened) image from the minibatch.
In [23]:
bs = 64
xb = x_train[0:bs] # a mini-batch from x
preds = model(xb)
preds[0], preds.shape
Out[23]:
In [24]:
def nll(input, target): return -input[range(target.shape[0]), target].mean()
loss_func = nll
In [25]:
yb = y_train[0:bs]
loss_func(preds, yb)
Out[25]:
In [27]:
preds[0]
Out[27]:
In [42]:
((x_train[0:bs]@weights+bias) - (x_train[0:bs]@weights+bias).exp().sum(-1).log().unsqueeze(-1))[0]
Out[42]:
In [40]:
preds[0]
Out[40]:
In [43]:
nll(preds, yb)
Out[43]:
In [44]:
-preds[range(yb.shape[0]), yb].mean()
Out[44]:
In [45]:
type(preds)
Out[45]:
In [46]:
preds[range(0)]
Out[46]:
In [48]:
preds[0]
Out[48]:
In [49]:
preds[range(1)]
Out[49]:
In [50]:
preds[range(2)]
Out[50]:
In [53]:
preds[:2]
Out[53]:
In [55]:
type(preds)
Out[55]:
In [58]:
np.array([[range(10)]])[range(1)]
Out[58]:
In [59]:
A = np.array([[range(10)]])
In [65]:
A.shape
Out[65]:
In [64]:
A[range(2)]
In [67]:
A.shape
Out[67]:
In [71]:
len(A[0])
Out[71]:
In [73]:
A.shape[0]
Out[73]:
In [72]:
A[0]
Out[72]:
In [76]:
A[range(1)]
Out[76]:
In [77]:
xb.sum()
Out[77]:
In [81]:
xb.numpy().sum(-1)
Out[81]:
In [82]:
xb.sum(-1)
Out[82]:
torch.unsqueeze returns a tensor with a dimension of size 1 inserted at the specified position.
the returned tensor shares the smae underlying data with this tensor.
In [85]:
xb.sum(-1)
Out[85]:
In [94]:
xb[0].sum()
Out[94]:
taking a look at what .unsqueeze does; what does the tensor look like right before unsqueeze is applied to it?
In [100]:
xb.exp().sum(-1).log()
Out[100]:
In [101]:
xb.exp().sum(-1).log()[0]
Out[101]:
making sure I didn't need parentheses there
In [109]:
(xb.exp().sum(-1).log())[0]
Out[109]:
In [111]:
xb.exp().sum(-1).log().unsqueeze(-1)[:10]
Out[111]:
In [115]:
np.array([i for i in range(10)]).shape
Out[115]:
In [114]:
torch.Tensor([i for i in range(10)]).shape
Out[114]:
In [116]:
xb.exp().sum(-1).log().unsqueeze(-1).numpy().shape
Out[116]:
Okay so .unsqueeze turns the size 64 tensor into a 64x1 tensor, so it's nicely packaged up with the first element being the 64-long vector ... or something like that right?
In [118]:
xb.exp().sum(-1).log()[:10]
Out[118]:
The unsqueezed tensor doesn't look as 'nice'.. I guess. So it's packaged into a single column vector because we'll need that for the linear algebra we'll do to it later yeah?
In [125]:
preds.unsqueeze(-1).shape
Out[125]:
Oh this is cool. I was wondering how .unsqeeze worked for tensors with multiple items in multiple dimensions (ie: not just a single row vector). Well this is what it does:
In [128]:
preds.unsqueeze(-1)[:2]
Out[128]:
So .unsqueeze turns our size 64x10 ... ohhhhhhhh I misread:
torch.unsqueeze returns a tensor with a dimension of size 1 inserted at the specified position.
doesn't mean it repackages the original tensor into a 1-dimensional tensor. I was wonder how it knew how long to make it (you'd have to just concatenate everything, but then in what order?).
No, a size-1 dimension is inserted where you tell it. So if it's an (X,Y) matrix, you go and give it a Z dimension, but that Z only contains the original (X,Y), ie: the only thing added is a dimension.
Okay, interesting. Not exactly sure yet why we want 3 dimensions, but I kinda get it. Is it related to our data being 28x28x1? Wait isn't PyTorch's ordering N x [C x H x W] ? So it's unrelated then? Or useful for returning 64x784 to 64x28x28? I think that's not the case? Don't know.
So what's up with the input[range(.. thing?:
In [129]:
# logsoftmax(xb)
ls_xb = log_softmax(xb)
In [140]:
log_softmax(xb@weights+bias)[0]
Out[140]:
In [138]:
(xb@weights).shape
Out[138]:
In [141]:
xb.shape
Out[141]:
In [142]:
(xb@weights).shape
Out[142]:
Oh this is where I was confused. I'm not throwing xb into Log Softmax. I'm throwing xb • w + bias. The shape going into the log softmax function is not 64x784, it's 64x10. Yeah that makes sense. well duh it has to. Each value in the tensor is an activation for a class, for each image in the minibatch. So by the magic of machine learning, each activation encapsulates the effect of the weights and biases on that input element with respect to that class.
So that means that the .unsqueeze oepration is not going to be giving a 64x784 vector.
In [147]:
# for reference:
xb = x_train[0:bs]
yb = y_train[0:bs]
def log_softmax(x): return x - x.exp().sum(-1).log().unsqueeze(-1)
def model(xb): return log_softmax(xb @ weights + bias)
preds = model(xb)
def nll(input, target): return -input[range(target.shape[0]), target].mean()
loss = nll(preds, yb)
In [148]:
loss
Out[148]:
Note the loss equals that in cell Out[25] above as it should.
Back to teasing this apart by hand.
The minibatch:
In [155]:
xb, xb.shape
Out[155]:
The minibatch's activations as they head into the Log Softmax:
In [171]:
(xb @ weights + bias)[:2]
Out[171]:
In [172]:
(xb @ weights + bias).shape
Out[172]:
The minibatch activations after the Log Softmax and before heading into Negative Log Likelihood:
In [173]:
log_softmax(xb@weights+bias)[:2]
Out[173]:
In [174]:
log_softmax(xb@weights+bias).shape
Out[174]:
The loss value computed via NLL on the Log Softmax activations:
In [177]:
nll(log_softmax(xb@weights+bias), yb)
Out[177]:
Okay. Now questions. What is indexing input by [range(target.shape[0]), target] supposed to be doing? I established before that A[range(n)] is valid if n ≤ A.shape[0]. So what's going on is I'm range-indexing the 1st dimension of the LogSoftmax activations with the length of the target tensor, and the rest of the dimension indices being the ..target tensor itself?
That means the index is this:
In [180]:
[range(yb.shape[0]), yb]
Out[180]:
Okay. What does it look like when I index a tensor – forget range-idx for now – with another tensor?
In [181]:
xb[yb]
Out[181]:
Okay..
In [192]:
xb.shape, yb.shape
Out[192]:
In [191]:
array_1 = np.array([[str(j)+str(i) for i in range(10)] for j in range(5)])
array_1
Out[191]:
In [195]:
array_2 = np.array([i for i in range(len(array_1[0]))])
array_2
Out[195]:
Uh, moment of truth:
In [196]:
array_1[range(array_2.shape[0]), array_2]
Oof course. What happened. Is it.. yes. I'm indexing the wrong array. Also no value in target is greater than the number of classes ... oh... oh ffs. Okay.
I range index by the length of target's first dim to get the entire first dim of the LogSoftmax activations, and each vector in that index is itself indexed by the value of the target.
Less-shitty English: take the first dimension of the activations; that should be batch_size x num_classes activations; so: num_classes values in each of batch_size vectors; Now for each of those vectors, pull out the value indexed by the corresponding index-value in the target tensor.
Oh I see. So just now I was confused that there was redundant work being done. yeah kinda. It's Linear Algebra. See, the weights and biases produce the entire output-activations tensor. Meaning: the dot-product & addition operation creates probabilities for every class for every image in the minibatch. Yeah that can be a lot; linalg exists in a block-like world & it's easy to get carried away (I think).
And that answers another question: the loss function here only cares about how wrong the correct class was. Looks like the incorrect classes are totally ignored (hence a bit of mental hesitation for me because it looks like 90% of the information is being thrown away (it is)). Now, that's not what's going on when the Log Softmax is being computed. Gotta think about that a moment..
could activations for non-target classes affect the target-activations during the Log Softmax step, before they're disgarded in the NLL?
xb - xb.exp().sum(-1).log().unsqueeze(-1)
is the magic line (xb is x in the definition).
In [197]:
# for reference (again):
xb = x_train[0:bs]
yb = y_train[0:bs]
def log_softmax(x): return x - x.exp().sum(-1).log().unsqueeze(-1)
def model(xb): return log_softmax(xb @ weights + bias)
preds = model(xb)
def nll(input, target): return -input[range(target.shape[0]), target].mean()
loss = nll(preds, yb)
When the activations are activating, only the weights and biases are having a say. Right?
In [205]:
xb.shape, weights.shape
Out[205]:
In [204]:
np.array([[1,1,1],[2,2,2],[3,3,3]]) @ np.array([[1],[2],[3]])
Out[204]:
In [208]:
np.array([[1,1,1],[2,2,2],[-11,0,3]]) @ np.array([[1],[2],[3]])
Out[208]:
Right.
Now what about the Log Softmax operation itself? Well okay I can simulate this by hand:
In [247]:
yb.type()
Out[247]:
In [268]:
# batch size of 3
xb_tmp = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
yb_tmp = np.array([0,1,2])
# 4 classes
c = 4
w_tmp = np.array([[i for i in range(c)] for j in range(xb_tmp.shape[1])])
xb_tmp = torch.Tensor(xb_tmp)
yb_tmp = torch.tensor(yb_tmp, dtype=torch.int64) # see: https://pytorch.org/docs/stable/tensors.html#torch-tensor
w_tmp = torch.Tensor(w_tmp)
umm....
...
So it's torch.tensor not torch.Tensor? Got a lot of errors trying to specify a datatype with capital T. Alright then.
In [269]:
torch.tensor([[1, 2, 3]],dtype=torch.int32)
Out[269]:
In [270]:
xb_tmp.shape, yb_tmp.shape, w_tmp.shape
Out[270]:
In [271]:
xb.shape, yb.shape, weights.shape
Out[271]:
In [272]:
actv_tmp = log_softmax(xb_tmp @ w_tmp)
actv_tmp
Out[272]:
In [273]:
nll(actv_tmp, yb_tmp)
Out[273]:
Good it works. Now to change things. The question was if any of the dropped values (non-target index) had any effect on the loss - since the loss was only calculated on error from the correct target. Basically: is there any lateral flow of information?
So I'll check this by editing values in the softmax activation that are not of the correct index.
Wait that shouldn't have an effect anyway. No the question is if information earlier in the stream had an effect later on. It is 4:12 am..
Aha. My question was if the activations that created the non-target class probabilities had any effect on target classes. Which is asking if there is crossing of information in the ... oh.
I confused myself with the minibatches. Ignore those, there'd be something very wrong if there was cross-talk between them. I want to know if there is cross-talk within an individual tensor as it travels through the model.
In [288]:
# batch size of 3
xb_tmp = np.array([[0,1,1,0,0]])
yb_tmp = np.array([1])
# 4 classes
c = 4
w_tmp = np.array([[i for i in range(c)] for j in range(xb_tmp.shape[1])])
xb_tmp = torch.Tensor(xb_tmp)
yb_tmp = torch.tensor(yb_tmp, dtype=torch.int64) # see: https://pytorch.org/docs/stable/tensors.html#torch-tensor
w_tmp = torch.Tensor(w_tmp)
xb_tmp @ w_tmp
Out[288]:
In [289]:
# LogSoftmax(activations)
actv_tmp = log_softmax(xb_tmp @ w_tmp)
actv_tmp
Out[289]:
In [290]:
# NLL Loss
loss = nll(actv_tmp, yb_tmp)
loss
Out[290]:
In [298]:
def cross_test(x, y):
# batch size of 3
xb_tmp = np.array(x)
yb_tmp = np.array(y)
# 4 classes
c = 4
w_tmp = np.array([[i for i in range(c)] for j in range(xb_tmp.shape[1])])
xb_tmp = torch.Tensor(xb_tmp)
yb_tmp = torch.tensor(yb_tmp, dtype=torch.int64) # see: https://pytorch.org/docs/stable/tensors.html#torch-tensor
w_tmp = torch.Tensor(w_tmp)
print(f'Activation: {xb_tmp @ w_tmp}')
# LogSoftmax(activations)
actv_tmp = log_softmax(xb_tmp @ w_tmp)
print(f'Log Softmax: {actv_tmp}')
# NLL Loss
loss = nll(actv_tmp, yb_tmp)
print(f'NLL Loss: {loss}')
In [303]:
w_tmp
Out[303]:
In [297]:
cross_test([[1,1,1,1,1]], [1])
In [299]:
cross_test([[1,1,1,1,0]], [1])
In [300]:
cross_test([[1,1,1,0,0]], [1])
In [301]:
cross_test([[1,1,1,1,0]], [1])
In [302]:
cross_test([[1,1,0,0,0]], [1])
Right so uh, guess this is hte black-box territory people keep talking about. Buut.. more of a translucent gray. Okay, so..
The entire input tensor affects the loss. Of course. There are cases where there exist functions that can have multiple very-different inputs resulting in the same losses. This kind of gets at the issue of neural networks learning very unintuitive things, and the possible space of funtions grows very quickly.
The loss function in this case only looks at whether the correct class is on the mark or not - it doesn't care about incorrect classes, just how wrong the correct one is. In that respect the fraction of wasted information is $1 - 1/c~; \,\,\,\, c:\mathrm{number\ of\ classes}~$ for single-label classification, with respect to the predictions tensor (Log Softmax's output).
But that's a byproduct of what we're trying to do, or maybe a deliberate choice in loss function, eh? Because the information that created the predictions tensor was very much so important.
Cool, I'm weaving between logic and numbers, high-level abstraction and specific technical details. The funny thing about this stuff is it always starts as an enigma, and by the time your done you sort of feel dumb because of how obvious it seems afterwards.
The architecture of this problem is such that the model doesn't know which class is correct, so it has to do the work of looking at all of them.
Yeah. No shit.
You ever look so close at a nail you forget you're holding a screw driver? Yeah watch out for that.
In [ ]:
In [ ]: