Numpy
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
In [2]:
sample_index = 45
plt.figure(figsize=(3, 3))
plt.imshow(digits.images[sample_index], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.title("image label: %d" % digits.target[sample_index]);
In [3]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
data = np.asarray(digits.data, dtype='float32')
target = np.asarray(digits.target, dtype='int32')
X_train, X_test, y_train, y_test = train_test_split(
data, target, test_size=0.15, random_state=37)
# mean = 0 ; standard deviation = 1.0
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# print(scaler.mean_)
# print(scaler.scale_)
In [4]:
X_train.shape
Out[4]:
In [5]:
X_train.dtype
Out[5]:
In [6]:
X_test.shape
Out[6]:
In [7]:
y_train.shape
Out[7]:
In [8]:
y_train.dtype
Out[8]:
In this section we will implement a logistic regression model trainable with SGD using numpy. Here are the objectives:
Implement a simple forward model with no hidden layer (equivalent to a logistic regression): note: shape, transpose of W with regards to course $y = softmax(\mathbf{W} \dot x + b)$
Build a predict function which returns the most probable class given an input $x$
Build an accuracy function for a batch of inputs $X$ and the corresponding expected outputs $y_{true}$
Build a grad function which computes $\frac{d}{dW} -\log(softmax(W \dot x + b))$ for an $x$ and its corresponding expected output $y_{true}$ ; check that the gradients are well defined
Build a train function which uses the grad function output to update $\mathbf{W}$ and $b$
First let's define a helper function to compute the one hot encoding of an integer array for a fixed number of classes (similar to keras' to_categorical
):
In [9]:
def one_hot(n_classes, y):
return np.eye(n_classes)[y]
In [10]:
one_hot(n_classes=10, y=3)
Out[10]:
In [11]:
one_hot(n_classes=10, y=[0, 4, 9, 1])
Out[11]:
In [12]:
def softmax(X):
# TODO:
return None
Make sure that this works one vector at a time (and check that the components sum to one):
In [13]:
print(softmax([10, 2, -3]))
Note that a naive implementation of softmax might not be able process a batch of activations in a single call:
In [14]:
X = np.array([[10, 2, -3],
[-1, 5, -20]])
print(softmax(X))
Here is a way to implement softmax that works both for an individual vector of activations and for a batch of activation vectors at once:
In [15]:
def softmax(X):
exp = np.exp(X)
return exp / np.sum(exp, axis=-1, keepdims=True)
print("softmax of a single vector:")
print(softmax([10, 2, -3]))
Probabilities should sum to 1:
In [16]:
print(np.sum(softmax([10, 2, -3])))
In [17]:
print("sotfmax of 2 vectors:")
X = np.array([[10, 2, -3],
[-1, 5, -20]])
print(softmax(X))
The sum of probabilities for each input vector of logits should some to 1:
In [18]:
print(np.sum(softmax(X), axis=1))
Implement a function that given the true one-hot encoded class Y_true
and and some predicted probabilities Y_pred
returns the negative log likelihood.
In [19]:
def nll(Y_true, Y_pred):
Y_true = np.asarray(Y_true)
Y_pred = np.asarray(Y_pred)
# TODO
return 0.
# Make sure that it works for a simple sample at a time
print(nll([1, 0, 0], [.99, 0.01, 0]))
Check that the nll of a very confident yet bad prediction is a much higher positive number:
In [20]:
print(nll([1, 0, 0], [0.01, 0.01, .98]))
Make sure that your implementation can compute the average negative log likelihood of a group of predictions: Y_pred
and Y_true
can therefore be past as 2D arrays:
In [21]:
def nll(Y_true, Y_pred):
Y_true = np.atleast_2d(Y_true)
Y_pred = np.atleast_2d(Y_pred)
# TODO
return 0.
In [22]:
# Check that the average NLL of the following 3 almost perfect
# predictions is close to 0
Y_true = np.array([[0, 1, 0],
[1, 0, 0],
[0, 0, 1]])
Y_pred = np.array([[0, 1, 0],
[.99, 0.01, 0],
[0, 0, 1]])
print(nll(Y_true, Y_pred))
In [24]:
# %load solutions/numpy_nll.py
Let us now study the following linear model trainable by SGD, one sample at a time.
In [25]:
class LogisticRegression():
def __init__(self, input_size, output_size):
self.W = np.random.uniform(size=(input_size, output_size),
high=0.1, low=-0.1)
self.b = np.random.uniform(size=output_size,
high=0.1, low=-0.1)
self.output_size = output_size
def forward(self, X):
Z = np.dot(X, self.W) + self.b
return softmax(Z)
def predict(self, X):
if len(X.shape) == 1:
return np.argmax(self.forward(X))
else:
return np.argmax(self.forward(X), axis=1)
def grad_loss(self, x, y_true):
y_pred = self.forward(x)
dnll_output = y_pred - one_hot(self.output_size, y_true)
grad_W = np.outer(x, dnll_output)
grad_b = dnll_output
grads = {"W": grad_W, "b": grad_b}
return grads
def train(self, x, y, learning_rate):
# Traditional SGD update without momentum
grads = self.grad_loss(x, y)
self.W = self.W - learning_rate * grads["W"]
self.b = self.b - learning_rate * grads["b"]
def loss(self, X, y):
return nll(one_hot(self.output_size, y), self.forward(X))
def accuracy(self, X, y):
y_preds = np.argmax(self.forward(X), axis=1)
return np.mean(y_preds == y)
In [26]:
# Build a model and test its forward inference
n_features = X_train.shape[1]
n_classes = len(np.unique(y_train))
lr = LogisticRegression(n_features, n_classes)
print("Evaluation of the untrained model:")
train_loss = lr.loss(X_train, y_train)
train_acc = lr.accuracy(X_train, y_train)
test_acc = lr.accuracy(X_test, y_test)
print("train loss: %0.4f, train acc: %0.3f, test acc: %0.3f"
% (train_loss, train_acc, test_acc))
Evaluate the randomly initialized model on the first example:
In [27]:
def plot_prediction(model, sample_idx=0, classes=range(10)):
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
ax0.imshow(scaler.inverse_transform(X_test[sample_idx]).reshape(8, 8), cmap=plt.cm.gray_r,
interpolation='nearest')
ax0.set_title("True image label: %d" % y_test[sample_idx]);
ax1.bar(classes, one_hot(len(classes), y_test[sample_idx]), label='true')
ax1.bar(classes, model.forward(X_test[sample_idx]), label='prediction', color="red")
ax1.set_xticks(classes)
prediction = model.predict(X_test[sample_idx])
ax1.set_title('Output probabilities (prediction: %d)'
% prediction)
ax1.set_xlabel('Digit class')
ax1.legend()
plot_prediction(lr, sample_idx=0)
In [28]:
# Training for one epoch
learning_rate = 0.01
for i, (x, y) in enumerate(zip(X_train, y_train)):
lr.train(x, y, learning_rate)
if i % 100 == 0:
train_loss = lr.loss(X_train, y_train)
train_acc = lr.accuracy(X_train, y_train)
test_acc = lr.accuracy(X_test, y_test)
print("Update #%d, train loss: %0.4f, train acc: %0.3f, test acc: %0.3f"
% (i, train_loss, train_acc, test_acc))
Evaluate the trained model on the first example:
In [29]:
plot_prediction(lr, sample_idx=0)
The objective of this section is to implement the backpropagation algorithm (SGD with the chain rule) on a single layer neural network using the sigmoid activation function.
sigmoid
and its element-wise derivative dsigmoid
functions:
In [30]:
def sigmoid(X):
# TODO
return X
def dsigmoid(X):
# TODO
return X
x = np.linspace(-5, 5, 100)
plt.plot(x, sigmoid(x), label='sigmoid')
plt.plot(x, dsigmoid(x), label='dsigmoid')
plt.legend(loc='best');
In [32]:
# %load solutions/sigmoid.py
Implement forward
and forward_keep_all
functions for a model with a hidden layer with a sigmoid activation function:
Notes:
forward
now has a keep activations parameter to also return hidden activations and pre activations;Update the grad function to compute all gradients; check that the gradients are well defined;
Implement the train
and loss
functions.
Bonus: reimplementing all from scratch only using the lecture slides but without looking at the solution of the LogisticRegression
is an excellent exercise.
In [33]:
EPSILON = 1e-8
class NeuralNet():
"""MLP with 1 hidden layer with a sigmoid activation"""
def __init__(self, input_size, hidden_size, output_size):
# TODO
self.W_h = None
self.b_h = None
self.W_o = None
self.b_o = None
self.output_size = output_size
def forward_keep_activations(self, X):
# TODO
z_h = 0.
h = 0.
y = np.zeros(size=self.output_size)
return y, h, z_h
def forward(self, X):
y, h, z_h = self.forward_keep_activations(X)
return y
def loss(self, X, y):
# TODO
return 42.
def grad_loss(self, x, y_true):
# TODO
return {"W_h": 0., "b_h": 0., "W_o": 0., "b_o": 0.}
def train(self, x, y, learning_rate):
# TODO
pass
def predict(self, X):
if len(X.shape) == 1:
return np.argmax(self.forward(X))
else:
return np.argmax(self.forward(X), axis=1)
def accuracy(self, X, y):
y_preds = np.argmax(self.forward(X), axis=1)
return np.mean(y_preds == y)
In [35]:
# %load solutions/neural_net.py
In [36]:
n_hidden = 10
model = NeuralNet(n_features, n_hidden, n_classes)
In [37]:
model.loss(X_train, y_train)
Out[37]:
In [38]:
model.accuracy(X_train, y_train)
Out[38]:
In [39]:
plot_prediction(model, sample_idx=5)
In [40]:
losses, accuracies, accuracies_test = [], [], []
losses.append(model.loss(X_train, y_train))
accuracies.append(model.accuracy(X_train, y_train))
accuracies_test.append(model.accuracy(X_test, y_test))
print("Random init: train loss: %0.5f, train acc: %0.3f, test acc: %0.3f"
% (losses[-1], accuracies[-1], accuracies_test[-1]))
for epoch in range(15):
for i, (x, y) in enumerate(zip(X_train, y_train)):
model.train(x, y, 0.1)
losses.append(model.loss(X_train, y_train))
accuracies.append(model.accuracy(X_train, y_train))
accuracies_test.append(model.accuracy(X_test, y_test))
print("Epoch #%d, train loss: %0.5f, train acc: %0.3f, test acc: %0.3f"
% (epoch + 1, losses[-1], accuracies[-1], accuracies_test[-1]))
In [41]:
plt.plot(losses)
plt.title("Training loss");
In [42]:
plt.plot(accuracies, label='train')
plt.plot(accuracies_test, label='test')
plt.ylim(0, 1.1)
plt.ylabel("accuracy")
plt.legend(loc='best');
In [43]:
plot_prediction(model, sample_idx=4)
In [45]:
# %load solutions/worst_predictions.py
train
and grad_loss
function currently only accept a single sample at a time:Implement the same network architecture with Keras;
Check that the Keras model can approximately reproduce the behavior of the Numpy model when using similar hyperparameter values (size of the model, type of activations, learning rate value and use of momentum);
Compute the negative log likelihood of a sample 42 in the test set (can use model.predict_proba
);
Compute the average negative log-likelihood on the full test set.
Compute the average negative log-likelihood on the full training set and check that you can get the value of the loss reported by Keras.
Is the model overfitting or underfitting? (ensure that the model has fully converged by increasing the number of epochs to 50 or more if necessary).
In [ ]:
In [ ]:
# %load solutions/keras_model.py
In [ ]:
# %load solutions/keras_model_test_loss.py
In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo("o64FV-ez6Gw")
Optional: read the following blog post on Reverse-Mode Automatic Differentiation from start to section "A simple implementation in Python" included:
https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation