First we need to import Numpy, Pandas, MatPlotLib...
In [3]:
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib inline
Again we need functions for shuffling the data and calculating classification errrors.
In [4]:
### function for shuffling the data and labels
def shuffle_in_unison(features, labels):
rng_state = np.random.get_state()
np.random.shuffle(features)
np.random.set_state(rng_state)
np.random.shuffle(labels)
### calculate classification errors
# return a percentage: (number misclassified)/(total number of datapoints)
def calc_classification_error(predictions, class_labels):
n = predictions.size
num_of_errors = 0.
for idx in xrange(n):
if (predictions[idx] >= 0.5 and class_labels[idx]==0) or (predictions[idx] < 0.5 and class_labels[idx]==1):
num_of_errors += 1
return num_of_errors/n
We are going to use the MNIST dataset throughout this session. Let's load the data...
In [5]:
# load the 70,000 x 784 matrix
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
idxs_to_keep = []
for idx in xrange(mnist.data.shape[0]):
if mnist.target[idx] == 0 or mnist.target[idx] == 1: idxs_to_keep.append(idx)
mnist_x, mnist_y = (mnist.data[idxs_to_keep,:]/255., mnist.target[idxs_to_keep])
shuffle_in_unison(mnist_x, mnist_y)
print "Dataset size: %d x %d"%(mnist_x.shape)
# make a train / test split
x_train, x_test = (mnist_x[:10000,:], mnist_x[10000:,:])
y_train, y_test = (mnist_y[:10000], mnist_y[10000:])
# subplot containing first image
ax1 = plt.subplot(1,2,1)
digit = mnist_x[1,:]
ax1.imshow(np.reshape(digit, (28, 28)), cmap='Greys_r')
plt.show()
We saw in the previous session that simply adding noise to the input of an Autoencoder improves it's performance. Let's see how far we can stretch this idea. Can we simply multiply our data by a random matrix and reduce its dimensionality while still preserving its structure? Yes! The answer is provided in a famous result called the Johnson-Lindenstrauss Lemma, for $\epsilon < 1$: $$ (1- \epsilon) || \mathbf{x}_{i} - \mathbf{x}_{j} ||^{2} \le || \mathbf{x}_{i}\mathbf{W} - \mathbf{x}_{j}\mathbf{W} ||^{2} \le (1 + \epsilon) || \mathbf{x}_{i} - \mathbf{x}_{j} ||^{2} \text{ where } \mathbf{W} \text{ is a random matrix. }$$ In fact Scikit-Learn has a built in function that can tell you what $\epsilon$ should be for a given dataset.
In [ ]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim
johnson_lindenstrauss_min_dim(n_samples=x_train.shape[0], eps=0.9)
This is a nice function if we truly care about theoretical guarantees and about preserving distances, but in practice we can just see what works empirically. Let's next generate a random matrix...
In [ ]:
# set the random number generator for reproducability
np.random.seed(49)
# define the dimensionality of the hidden rep.
n_components = 200
# Randomly initialize the Weight matrix
W = np.random.normal(size=(x_train.shape[1], n_components), scale=1./x_train.shape[1])
train_red = np.dot(x_train, W)
test_red = np.dot(x_test, W)
print "Dataset is now of size: %d x %d"%(train_red.shape)
Let's run a kNN classifier on the projections...
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
preds = knn.predict(x_test)
knn_error_orig = calc_classification_error(preds, y_test) * 100
lr = LogisticRegression()
lr.fit(x_train, y_train)
preds = lr.predict(x_test)
lr_error_orig = calc_classification_error(preds, y_test) * 100
In [ ]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_red, y_train)
preds = knn.predict(test_red)
knn_error_red = calc_classification_error(preds, y_test) * 100
lr = LogisticRegression()
lr.fit(train_red, y_train)
preds = lr.predict(test_red)
lr_error_red = calc_classification_error(preds, y_test) * 100
In [ ]:
plt.bar([0,1,2,3], [knn_error_orig, lr_error_orig, knn_error_red, lr_error_red], color=['r','r','b','b'], align='center')
plt.xticks([0,1,2,3], ['kNN - OS', 'Log. Reg - OS', 'kNN - RP', 'Log. Reg. - RP'])
plt.ylim([0,5.])
plt.xlabel("Classifers and Features")
plt.ylabel("Classification Error")
plt.show()
In [ ]:
### TO DO
In [9]:
### TO DO
### Shoud see graph trend downward, with classification error decreasing as dimensionality increases.