Click here to view this JN on nbviewer.
In the concluding sessions of this course, I have shifted from talking about the data pipeline, to the functions at the end of the tunnel, our Machine Learning algorithms, which I've also likened to a stable of horses, in terms of how we "race" them to find the best. Choosing the best horse for your application takes experience. Don't expect to become a data scientist overnight.
In our sequence below, I start with a famous, oft used dataset, made of 28 by 28 numpy arrays, representing grayscale images of the numerals 0 through 9, quite a few specimens of each. They're labeled rows. We know the digits. Lets take a look.
In [1]:
import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
In [2]:
print(digits.DESCR)
In [3]:
import matplotlib.pyplot as plt
% matplotlib inline
In [4]:
plt.gray() # gray reversed shown below
_ = plt.matshow(digits.images[0])
In [5]:
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
In [6]:
_ = plt.matshow(digits.images[108])
In [7]:
digits.data[108]
Out[7]:
Remember how we think in machine learning. We have a multifaceted (multi-featured) set of samples, rows with many columns, and then a single column of correct results, an "answer key" if you will.
We often call this answer key column the "target" and then measure "error" as divergence between guesses and target.
Decreasing divergence bespeaks of a learning rate as the model trains on, or fits the training data. Whether we control this learning rate as a hyperparameter, or leave it to the algorithm to work at some built-in speed, depends on which machine learner type we've selected. Below we're looking at KNN and then a neural net.
In [8]:
digits.target[108]
Out[8]:
That's a very poor rendering of the numeral 7 and we're immediately forgiving if our Machine Learning algorithm gets some wrong, with training data of such abysmal quality. As seen from digits.data
, the 64 bits used to represent a digit are hardly enough. Other datasets come with at least 28 x 28 bits for each numeral. We're truly at the low end with this skimpy number of bits per digit.
Neverthesless, we press on... I'm making only minor changes to this open source script on Github, by Fabiosato.
Remember how KNN works:
In [9]:
from IPython.display import YouTubeVideo
YouTubeVideo("MDniRwXizWo")
Out[9]:
Remember to distinguish KNN from K-Means. You might use the latter to create the clusters whereby you could then fit the former. Here's a paper on LinkedIn suggesting doing that. Once you have the clusters (voters), a new data point is "claimed" by one or more clusters.
Hierarchical clustering algorithms compete with K-Means. The latter does better for spherical or globular clusters.
In [10]:
from IPython.display import YouTubeVideo
YouTubeVideo("3vHqmPF4VBA")
Out[10]:
In [11]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import neighbors # http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors
# prepare datasets from training and for validation
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
test_size=0.4, random_state=0)
# runs the kNN classifier for even number of neighbors from 1 to 10
for n in range(1, 10, 2):
clf = neighbors.KNeighborsClassifier(n)
# instance based learning
clf.fit(X_train, y_train)
# our 'ground truth'
y_true = y_test
# predict
y_pred = clf.predict(X_test)
# learning metrics
cm = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
print ("Neighbors: %d" % n)
print ("Confusion Matrix")
print (cm)
print ("Accuracy score: %f" % accuracy_score(y_true, y_pred))
print ()
Discerning digits within a blizzard of data points streaming in, or other patterns, may be described as a process of identifying clusters or neighborhoods. Even before we name the clusters we claim to find, we need to find them, and this is where dimensionality reduction comes in handy, as if we can get the dimensions down to three, we have some axes we might use.
"Dimensionality reduction" involves finding eigenvectors, the most efficient at singling out cells in not containing redundant info, forming a basis. An idea of ranking eigenvectors, in the sense of "most significant digits", allows us to cluster data by just the first few eigenvector coordinates.
One might usefully compare this process to discovering the desmomap, or binary tree resulting from bottom-up progressive agglomeration into larger groups. One may then place a threshold cut through the data to vary the number of clusters one wishes to regard as separate. There's a sense of binning and/or pigeon-holing, where the hyperparameter is the degree of subdivisioning.
Does a neural network fare better? Let's admit, the KNN machine learner did a great job. Fast horse!
In [12]:
from sklearn.neural_network import MLPClassifier
# runs the MLP classifier, all with same hyperparameters
for n in range(1, 10, 2):
clf = MLPClassifier()
# instance based learning
clf.fit(X_train, y_train)
# our 'ground truth'
y_true = y_test
# predict
y_pred = clf.predict(X_test)
# learning metrics
cm = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
# print ("Neighbors: %d" % n)
print ("Confusion Matrix")
print (cm)
print ("Accuracy score: %f" % accuracy_score(y_true, y_pred))
print ()
I'd say these two are competitive, but award KNN first prize in this case. On the other hand, I did not try varying the hyperparameters available to me with the MLP classifier. Lets say the results so far are inconclusive. More research needed.