This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/).
There are many machine learning algorithms available; here we'll go into brief detail on two of the most common and interesting ones: Support Vector Machines (SVMs) and Random Forests.
By the end of this section you should:
As before, we'll start by getting our notebook ready for interactive plotting:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data.
Let's show a quick example of support vector classification. First we need to create a dataset:
In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);
Now we'll fit a Support Vector Machine Classifier to these points:
In [3]:
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X, y)
Out[3]:
To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:
In [4]:
def plot_svc_decision_function(clf):
"""Plot the decision function for a 2D SVC"""
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
for j, yj in enumerate(y):
P[i, j] = clf.decision_function([xi, yj])
return plt.contour(X, Y, P, colors='k',
levels=[-1, 0, 1],
linestyles=['--', '-', '--'])
In [5]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf);
Notice that the dashed lines touch a couple of the points: these points are known as the "support vectors", and are stored in the support_vectors_
attribute of the classifier:
In [6]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
The unique thing about SVM is that only the support vectors matter: that is, if you moved any of the other points without letting them cross the decision boundaries, they would have no effect on the classification results! This can be seen interactively by running the fig_code/svm_gui.py
script:
In [7]:
# %run fig_code/svm_gui.py
The above version uses a linear kernel; it is also possible to use radial basis function kernels as well as others.
In [8]:
clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
The properties of SVMs make them extremely useful classifiers in practice.
We'll leave SVMs for the time being and take a look at another powerful classifier.
Random forests are an example of an ensemble learner built on decision trees. For this reason we'll first discuss decision trees themselves:
Here we'll explore a class of algorithms based on Decision trees. Decision trees at their root (Ha!) are extremely intuitive. They encode a series of binary choices in a process that parallels how a person might classify things themselves, but using an information criterion to decide which question is most fruitful at each step. For example, if you wanted to create a guide to identifying an animal found in nature, you might ask the following series of questions:
and so on. This binary splitting of questions is the essence of a decision tree.
In [9]:
from sklearn.tree import DecisionTreeClassifier
In [10]:
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
Out[10]:
We'll start by defining a convenience function which allows us to plot the predicted value in the background:
In [11]:
def plot_estimator(estimator, X, y):
estimator.fit(X, y)
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 50),
np.linspace(y_min, y_max, 50))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, alpha=0.3)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plt.axis('tight')
plt.axis('off')
plt.tight_layout()
In [12]:
clf = DecisionTreeClassifier(max_depth=10)
plot_estimator(clf, X, y)
One problem with decision trees is that they can end up over-fitting the data. They are such flexible models that, given a large depth, they can quickly memorize the inputs, which doesn't generalize well to previously unseen data. One way to get around this is to use many slightly different decision trees in concert. This is known as Random Forests, and is one of the more common techniques of ensemble learning (i.e. combining the results from several estimators.
In [13]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
plot_estimator(clf, X, y)
It's difficult to see in this example, but for more complicated data, random forests can be a very powerful technique.
We previously saw the hand-written digits data. Let's use that here to test the efficacy of the SVM and Random Forest classifiers.
In [14]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
Out[14]:
In [15]:
X = digits.data
y = digits.target
print(X.shape)
print(y.shape)
To remind us what we're looking at, we'll visualize the first few data points:
In [16]:
# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
We can quickly classify the digits using a decision tree as follows:
In [17]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label");
sklearn.svm.SVC
. How does the choice of kernel affect the results?sklearn.ensemble.RandomForestClassifier
. How does the max_depth
, max_features
, and n_estimators
affect the results?sklearn.metrics.f1_score
) on your results. What's the best F1 score you can reach?
In [17]:
In [18]:
# run this to load the solution
# %load solutions/04_svm_rf.py