This notebook was put together by **Andrew Greenhut** on **June 3rd 2015**.
In [1]:
#Run this code at the beginning of the presentation
from IPython.core.display import Image, display
from fig_code import plot_sgd_separator
from fig_code import plot_linear_regression
import seaborn; seaborn.set()
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
display(Image(filename='images/DS_Cat.jpg'))
In [3]:
display(Image(filename='images/dashboard-snockered-624x418.png'))
Machine Learning is a computer program that adapts to previously seen data.
Output depends on the algorithm and a set of tunable parameters (known as "hyper-parameters")
Most algorithms fit into two categories: Supervised or Unsupervised
There is no magic to Machine Learning. It is all linear algebra of matrices with the goal to minimize an error function
In [4]:
plot_linear_regression()
In [5]:
plot_sgd_separator()
Most machine learning algorithms expect data to be stored in a table (or matrix). The size of the table is [n_samples (rows), n_features (columns)]
The number of features can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.
In [6]:
display(Image(filename='images/images.png'))
In [7]:
display(Image(filename='images/supervised_learning_flowchart.png'))
In [8]:
display(Image(filename='images/Predictive-Analytics.png'))
In [9]:
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape
Out[9]:
Let's plot a few of these:
In [10]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
ax.set_xticks([])
ax.set_yticks([])
Here the data is simply each pixel value within an 8x8 grid:
In [11]:
# The images themselves
print(digits.images.shape)
print(digits.images[0])
In [12]:
# The data for use in our algorithms
print(digits.data.shape)
print(digits.data[0])
In [13]:
# The target label
print(digits.target)
So our data have 1797 samples in 64 dimensions.
We'd like to visualize our points within the 64-dimensional parameter space, but it's difficult to plot points in 64 dimensions! Instead we'll reduce the dimensions to 2, using an unsupervised method. Here, we'll make use of a manifold learning algorithm called Isomap, and transform the data to two dimensions.
In [14]:
from sklearn.manifold import Isomap
In [15]:
iso = Isomap(n_components=2)
data_projected = iso.fit_transform(digits.data)
In [16]:
data_projected.shape
Out[16]:
In [17]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)
We see here that the digits are fairly well-separated in the parameter space; this tells us that a supervised classification algorithm should perform fairly well. Let's give it a try.
In [18]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
random_state=2)
print(Xtrain.shape, Xtest.shape)
Let's use a simple logistic regression which (despite its confusing name) is a classification algorithm:
In [19]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
We can check our classification accuracy by comparing the true values of the test set to the predictions:
In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)
Out[20]:
This single number doesn't tell us where we've gone wrong: one nice way to do this is to use the confusion matrix
In [21]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, ypred))
In [22]:
plt.imshow(np.log(confusion_matrix(ytest, ypred)),
cmap='Blues', interpolation='nearest')
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');
We might also take a look at some of the outputs along with their predicted labels. We'll make the bad labels red:
In [23]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
ax.text(0.05, 0.05, str(ypred[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == ypred[i]) else 'red')
ax.set_xticks([])
ax.set_yticks([])
The interesting thing is that even with this simple logistic regression algorithm, many of the mislabeled cases are ones that we ourselves might get wrong!
There are many ways to improve this classifier, but we're out of time here. To go further, we could use a more sophisticated model, use cross validation, or apply other techniques. We'll cover some of these topics later in the tutorial.
Special thanks to [Jake Vanderplas](http://www.vanderplas.com) for his Scikit Learn content. Check out his Pycon 2015 tutorial on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/), or his tutorial [video](http://pyvideo.org/video/3429/machine-learning-with-scikit-learn-i).