This notebook is based on a tutotial given by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/).
In [26]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data.
Let's show a quick example of support vector classification. First we need to create a dataset:
In [27]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);
Now we'll fit a Support Vector Machine Classifier to these points:
In [28]:
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X, y)
Out[28]:
To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:
In [29]:
def plot_svc_decision_function(clf):
"""Plot the decision function for a 2D SVC"""
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
for j, yj in enumerate(y):
P[i, j] = clf.decision_function([xi, yj])
return plt.contour(X, Y, P, colors='k',
levels=[-1, 0, 1],
linestyles=['--', '-', '--'])
In [30]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf);
Notice that the dashed lines touch a couple of the points: these points are known as the "support vectors", and are stored in the support_vectors_
attribute of the classifier:
In [31]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
The unique thing about SVM is that only the support vectors matter: that is, if you moved any of the other points without letting them cross the decision boundaries, they would have no effect on the classification results!
In [32]:
clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
Now we'll take a look at another dataset, one where we have to put a bit more thought into how to represent the data. We can explore the data in a similar manner as above:
In [33]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
Out[33]:
In [34]:
X = digits.data
y = digits.target
print(X.shape)
print(y.shape)
In [37]:
print digits.data[0]
print digits.target
The target here is just the digit represented by the data. The data is an array of length 64... but what does this data mean?
There's a clue in the fact that we have two versions of the data array: data and images. Let's take a look at them:
In [38]:
print digits.data.shape
print digits.images.shape
We can see that they're related by a simple reshaping:
In [39]:
print np.all(digits.images.reshape((1797, 64)) == digits.data)
Let's visualize the data. It's little bit more involved than the simple scatter-plot we used above, but we can do it rather tersely.
In [35]:
# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
We see now what the features mean. Each feature is a real-valued quantity representing the darkness of a pixel in an 8x8 image of a hand-written digit.
Even though each sample has data that is inherently two-dimensional, the data matrix flattens this 2D data into a single vector, which can be contained in one row of the data matrix.
Now let's classify the digits using a Support Vector Classifier and try out 2 different kernels to see which one performs better.
In [36]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.svm import SVC
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
for kernel in ['rbf', 'linear']:
clf = SVC(kernel=kernel).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print("SVC: kernel = {0}".format(kernel))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("SVC: kernel = {0}".format(kernel))
The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.