0. Preliminaries

First we need to import Numpy, Pandas, MatPlotLib...



In [3]:

    
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib inline

Again we need functions for shuffling the data and calculating classification errrors.



In [4]:

    
### function for shuffling the data and labels
def shuffle_in_unison(features, labels):
    rng_state = np.random.get_state()
    np.random.shuffle(features)
    np.random.set_state(rng_state)
    np.random.shuffle(labels)
    
### calculate classification errors
# return a percentage: (number misclassified)/(total number of datapoints)
def calc_classification_error(predictions, class_labels):
    n = predictions.size
    num_of_errors = 0.
    for idx in xrange(n):
        if (predictions[idx] >= 0.5 and class_labels[idx]==0) or (predictions[idx] < 0.5 and class_labels[idx]==1):
            num_of_errors += 1
    return num_of_errors/n

0.1 Load the dataset of handwritten digits

We are going to use the MNIST dataset throughout this session. Let's load the data...



In [5]:

    
# load the 70,000 x 784 matrix
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
idxs_to_keep = []
for idx in xrange(mnist.data.shape[0]): 
    if mnist.target[idx] == 0 or mnist.target[idx] == 1: idxs_to_keep.append(idx)
mnist_x, mnist_y = (mnist.data[idxs_to_keep,:]/255., mnist.target[idxs_to_keep])
shuffle_in_unison(mnist_x, mnist_y)
print "Dataset size: %d x %d"%(mnist_x.shape)

# make a train / test split
x_train, x_test = (mnist_x[:10000,:], mnist_x[10000:,:])
y_train, y_test = (mnist_y[:10000], mnist_y[10000:])

# subplot containing first image
ax1 = plt.subplot(1,2,1)
digit = mnist_x[1,:]
ax1.imshow(np.reshape(digit, (28, 28)), cmap='Greys_r')
plt.show()









    



---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-5-974861bf3e67> in <module>()
      1 # load the 70,000 x 784 matrix
      2 from sklearn.datasets import fetch_mldata
----> 3 mnist = fetch_mldata('MNIST original')
      4 idxs_to_keep = []
      5 for idx in xrange(mnist.data.shape[0]):

/Users/junfang/anaconda/lib/python2.7/site-packages/sklearn/datasets/mldata.pyc in fetch_mldata(dataname, target_name, data_name, transpose_data, data_home)
    140         urlname = MLDATA_BASE_URL % quote(dataname)
    141         try:
--> 142             mldata_url = urlopen(urlname)
    143         except HTTPError as e:
    144             if e.code == 404:

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    427             req = meth(req)
    428 
--> 429         response = self._open(req, data)
    430 
    431         # post-process response

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in _open(self, req, data)
    445         protocol = req.get_type()
    446         result = self._call_chain(self.handle_open, protocol, protocol +
--> 447                                   '_open', req)
    448         if result:
    449             return result

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    405             func = getattr(handler, meth_name)
    406 
--> 407             result = func(*args)
    408             if result is not None:
    409                 return result

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in http_open(self, req)
   1226 
   1227     def http_open(self, req):
-> 1228         return self.do_open(httplib.HTTPConnection, req)
   1229 
   1230     http_request = AbstractHTTPHandler.do_request_

/Users/junfang/anaconda/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1196         except socket.error, err: # XXX what error?
   1197             h.close()
-> 1198             raise URLError(err)
   1199         else:
   1200             try:

URLError: <urlopen error [Errno 60] Operation timed out>

1 Random Projections

We saw in the previous session that simply adding noise to the input of an Autoencoder improves it's performance. Let's see how far we can stretch this idea. Can we simply multiply our data by a random matrix and reduce its dimensionality while still preserving its structure? Yes! The answer is provided in a famous result called the Johnson-Lindenstrauss Lemma, for $\epsilon < 1$: $$ (1- \epsilon) || \mathbf{x}_{i} - \mathbf{x}_{j} ||^{2} \le || \mathbf{x}_{i}\mathbf{W} - \mathbf{x}_{j}\mathbf{W} ||^{2} \le (1 + \epsilon) || \mathbf{x}_{i} - \mathbf{x}_{j} ||^{2} \text{ where } \mathbf{W} \text{ is a random matrix. }$$ In fact Scikit-Learn has a built in function that can tell you what $\epsilon$ should be for a given dataset.



In [ ]:

    
from sklearn.random_projection import johnson_lindenstrauss_min_dim

johnson_lindenstrauss_min_dim(n_samples=x_train.shape[0], eps=0.9)

This is a nice function if we truly care about theoretical guarantees and about preserving distances, but in practice we can just see what works empirically. Let's next generate a random matrix...



In [ ]:

    
# set the random number generator for reproducability
np.random.seed(49)

# define the dimensionality of the hidden rep.
n_components = 200

# Randomly initialize the Weight matrix
W = np.random.normal(size=(x_train.shape[1], n_components), scale=1./x_train.shape[1])

train_red = np.dot(x_train, W)
test_red = np.dot(x_test, W)
print "Dataset is now of size: %d x %d"%(train_red.shape)

Let's run a kNN classifier on the projections...



In [ ]:

    
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train) 
preds = knn.predict(x_test)
knn_error_orig = calc_classification_error(preds, y_test) * 100

lr = LogisticRegression()
lr.fit(x_train, y_train) 
preds = lr.predict(x_test)
lr_error_orig = calc_classification_error(preds, y_test) * 100



In [ ]:

    
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_red, y_train) 
preds = knn.predict(test_red)
knn_error_red = calc_classification_error(preds, y_test) * 100

lr = LogisticRegression()
lr.fit(train_red, y_train) 
preds = lr.predict(test_red)
lr_error_red = calc_classification_error(preds, y_test) * 100



In [ ]:

    
plt.bar([0,1,2,3], [knn_error_orig, lr_error_orig, knn_error_red, lr_error_red], color=['r','r','b','b'], align='center')
plt.xticks([0,1,2,3], ['kNN - OS', 'Log. Reg - OS', 'kNN - RP', 'Log. Reg. - RP'])
plt.ylim([0,5.])
plt.xlabel("Classifers and Features")
plt.ylabel("Classification Error")
plt.show()

STUDENT ACTIVITY (until end of session)

Subtask 1: Train an Autoencoder and PCA; Compare kNN Classifer on Compressed Representation



In [ ]:

    
### TO DO

Subtask 2: For each model, plot classification error (y-axis) vs dimensionality of compression (x-axis).



In [9]:

    
### TO DO

### Shoud see graph trend downward, with classification error decreasing as dimensionality increases.