In [32]:
%matplotlib inline

In [33]:
import scipy.optimize
import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_mldata

Data

First get and preprocess the data. This time, we will use the complete data set and not just the samples for the numbers 0 and 1.

Get data: contains 70k samples of which the last 10k are meant for testing


In [34]:
mnist = fetch_mldata('MNIST original', data_home='./data')

Prepare for concat


In [35]:
y_all = mnist.target[:, np.newaxis]

Intercept term to be added


In [36]:
intercept = np.ones_like(y_all)

Before the next step, we need to define this util function which normalizes the data.


In [37]:
def normalize_features(train, test):
    """Normalizes train set features to a standard normal distribution
    (zero mean and unit variance). The same procedure is then applied
    to the test set features.
    """
    train_mean = train.mean(axis=0)
    # +0.1 to avoid division by zero in this specific case
    train_std = train.std(axis=0) + 0.1
    
    train = (train - train_mean) / train_std
    test = (test - train_mean) / train_std
    return train, test

Now, normalize the data (zero mean and unit variance)


In [38]:
train_normalized, test_normalized = normalize_features(
    mnist.data[:60000, :],
    mnist.data[60000:, :],
)

Concat intercept, X, and y so that shuffling is easier in a next step


In [39]:
train_all = np.hstack((
    intercept[:60000],
    train_normalized,
    y_all[:60000],
))
test_all = np.hstack((
    intercept[60000:],
    test_normalized,
    y_all[60000:],
))

Shuffle the data. As mentioned in the Logistic_Regression notebook already, I don't think it's needed, but let's stick with the ufldl tutorial here.


In [40]:
np.random.shuffle(train_all)
np.random.shuffle(test_all)

Finally, get train and test data sets


In [41]:
train_X = train_all[:, :-1]
train_y = train_all[:, -1]

test_X = test_all[:, :-1]    
test_y = test_all[:, -1]

Softmax Regression

Define some helpful variables and initial random theta values for all classes.


In [42]:
m, n = train_X.shape
k = np.unique(train_y).size
theta = np.random.rand(n, k) * 0.001

This indicator_mask will come in handy when computing the gradient later on.


In [43]:
indicator_mask = np.zeros((train_X.shape[0], theta.shape[1]), dtype=np.bool)
for i, idx in enumerate(train_y):
    indicator_mask[i][idx] = True

This is a helper function to keep the code DRY. It computes the probabilities of all classes for all samples.


In [44]:
def probs(theta, X, y):
    if theta.ndim == 1:
        theta = theta.reshape((theta.size / k, k))
    values = np.exp(X.dot(theta)) 
    sums = np.sum(values, axis=1)
    return (values.T / sums).T

The cost function of Softmax Regression. We could actually use the indicator_mask here instead of the loop at the end, but that would be computational overkill.


In [45]:
def cost_function(theta, X, y):
    log_probs = np.log(probs(theta, X, y))
    cost = 0
    for i in range(m):
         cost -= log_probs[i][y[i]]
    return cost

The gradient function of Softmax Regression.


In [46]:
def gradient(theta, X, y):
    gradient_matrix = -X.T.dot(indicator_mask - probs(theta, X, y))
    return gradient_matrix.flatten()

Alright, let's run the optimization. 100 iterations are enough here.


In [47]:
J_history = []

t0 = time.time()
res = scipy.optimize.minimize(
    fun=cost_function,
    x0=theta,
    args=(train_X, train_y),
    method='L-BFGS-B',
    jac=gradient,
    options={'maxiter': 100, 'disp': True},
    callback=lambda x: J_history.append(cost_function(x, train_X, train_y)),
)
t1 = time.time()

print('Optimization took {s} seconds'.format(s=t1 - t0))
optimal_theta = res.x.reshape((theta.size / k, k))


Optimization took 195.31115102767944 seconds

Plot the evolution of J (to make sure we did the right thing).


In [48]:
plt.plot(J_history, marker='o')
plt.xlabel('Iterations')
plt.ylabel('J(theta)')


Out[48]:
<matplotlib.text.Text at 0x10d5d00f0>

Even the optimal_theta still results in a relatively high cost value J. That's not necessarily a bad sign, but rather the consequence of the model being more complex than e.g., Logistic Regression.


In [49]:
def accuracy(theta, X, y):
    correct = np.sum(np.argmax(probs(theta, X, y), axis=1) == y)
    return correct / y.size

In [50]:
print('Training accuracy: {acc}'.format(acc=accuracy(res.x, train_X, train_y)))
print('Test accuracy: {acc}'.format(acc=accuracy(res.x, test_X, test_y)))


Training accuracy: 0.9435833333333333
Test accuracy: 0.9235