This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
In this section,
We begin with the standard imports:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
As an example of this, consider the simple case of a classification task, in which the two classes of points are well separated:
In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification.
We can draw them as follows:
In [3]:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=2, markersize=10)
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);
These are three very different separators which, nevertheless, perfectly discriminate between these samples.
In [4]:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
In [3]:
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
Out[3]:
To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:
In [4]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
"""Plot the decision function for a 2D SVC"""
if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# create grid to evaluate model
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1,facecolors = 'none', edgecolor="black");
ax.set_xlim(xlim)
ax.set_ylim(ylim)
In [5]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);
This is the dividing line that maximizes the margin between the two sets of points. Notice
In Scikit-Learn, the identity of these points are stored in the support_vectors_
attribute of the classifier:
In [6]:
model.support_vectors_
Out[6]:
A key to this classifier's success is that for the fit, only the position of the support vectors matter;
We can see this, for example, if we plot the model learned from the first 60 points and first 120 points of this dataset:
In [7]:
def plot_svm(N=10, ax=None):
X, y = make_blobs(n_samples=400, centers=2,
random_state=0, cluster_std=0.60)
X = X[:N]
y = y[:N]
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
ax = ax or plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
ax.set_xlim(-1, 4)
ax.set_ylim(-1, 6)
plot_svc_decision_function(model, ax)
In [8]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, N in zip(ax, [60, 120]):
plot_svm(N, axi)
axi.set_title('N = {0}'.format(N))
In the right panel, we have doubled the number of training points, but the model has not changed:
This insensitivity to the exact behavior of distant points is one of the strengths of the SVM model.
If you are running this notebook live, you can use IPython's interactive widgets to view this feature of the SVM model interactively:
In [9]:
from ipywidgets import interact, fixed
interact(plot_svm, N=[10, 400], ax=fixed(None));
Where SVM becomes extremely powerful is when it is combined with kernels.
We have seen a version of kernels before, in the basis function regressions of In Depth: Linear Regression.
To motivate the need for kernels, let's look at some data that is not linearly separable:
In [13]:
from sklearn.datasets.samples_generator import make_circles
X, y = make_circles(100, factor=.1, noise=.1)
clf = SVC(kernel='linear').fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);
No linear discrimination will ever be able to separate this data.
- think about how we might project the data into a higher dimension such that a linear separator would be sufficient.
For example, one simple projection we could use would be to compute a radial basis function centered on the middle clump:
In [86]:
(X ** 2)[0]
Out[86]:
In [85]:
(X ** 2)[0].sum()
Out[85]:
In [83]:
(X ** 2).sum(1)[0]
Out[83]:
In [15]:
r = np.exp(-(X ** 2).sum(1))
We can visualize this extra data dimension using a three-dimensional plot
In [16]:
from mpl_toolkits import mplot3d
def plot_3D(elev=30, azim=30, X=X, y=y):
ax = plt.subplot(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
ax.view_init(elev=elev, azim=azim)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('r')
interact(plot_3D, elev=[-90, 90], azip=(-180, 180),
X=fixed(X), y=fixed(y));
We can see that with this additional dimension,
Here we had to choose and carefully tune our projection:
if we had not centered our radial basis function in the right location, we would not have seen such clean, linearly separable results.
To make such a choice is a problem: we would like to somehow automatically find the best basis functions to use.
Projecting $N$ points into $N$ dimensions might become very computationally intensive as $N$ grows large.
In Scikit-Learn, we can apply kernelized SVM simply by changing our linear kernel to an RBF (radial basis function) kernel, using the kernel
model hyperparameter:
In [17]:
clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)
Out[17]:
In [18]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary.
This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.
In [20]:
X, y = make_blobs(n_samples=100, centers=2,
random_state=0, cluster_std=1)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
The SVM implementation has a bit of a fudge-factor which "softens" the margin:
The hardness of the margin is controlled by a tuning parameter, most often known as $C$.
The plot shown below gives a visual picture of how a changing $C$ parameter affects the final fit, via the softening of the margin:
In [21]:
X, y = make_blobs(n_samples=100, centers=2,
random_state=0, cluster_std=1)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, C in zip(ax, [50.0, 0.1]):
model = SVC(kernel='linear', C=C).fit(X, y)
axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model, axi)
axi.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
axi.set_title('C = {0:.1f}'.format(C), size=14)
The optimal value of the $C$ parameter will depend on your dataset, and should be tuned using cross-validation or a similar procedure (refer back to Hyperparameters and Model Validation).
As an example of support vector machines in action, let's take a look at the facial recognition problem.
A fetcher for the dataset is built into Scikit-Learn:
In [22]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=57)
print(faces.target_names)
print(faces.images.shape)
Let's plot a few of these faces to see what we're working with:
In [23]:
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
Each image contains [62×47] or nearly 3,000 pixels.
We could proceed by simply using each pixel value as a feature,
We can do this most straightforwardly by packaging the preprocessor and the classifier into a single pipeline:
In [24]:
from sklearn.svm import SVC
from sklearn.decomposition import RandomizedPCA
from sklearn.pipeline import make_pipeline
pca = RandomizedPCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
For the sake of testing our classifier output, we will split the data into a training and testing set:
In [27]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,
random_state=42)
Finally, we can use a grid search cross-validation to explore combinations of parameters.
Here we will adjust C
(which controls the margin hardness) and gamma
(which controls the size of the radial basis function kernel), and determine the best model:
In [29]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.grid_search import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50],
'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)
%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)
The optimal values fall toward the middle of our grid; if they fell at the edges, we would want to expand the grid to make sure we have found the true optimum.
Now with this cross-validated model, we can predict the labels for the test data, which the model has not yet seen:
In [30]:
model = grid.best_estimator_
yfit = model.predict(Xtest)
Let's take a look at a few of the test images along with their predicted values:
In [31]:
fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
axi.set(xticks=[], yticks=[])
axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14);
We can get a better sense of our estimator's performance using the classification report,
In [32]:
from sklearn.metrics import classification_report
print(classification_report(ytest, yfit,
target_names=faces.target_names))
We might also display the confusion matrix between these classes:
This helps us get a sense of which labels are likely to be confused by the estimator.
In [33]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
For a real-world facial recognition task:
the only difference in the facial classification scheme is the feature selection:
you would need to use a more sophisticated algorithm to find the faces, and extract features that are independent of the pixellation.
We have seen here a brief intuitive introduction to the principals behind support vector machines. These methods are a powerful classification method for a number of reasons:
However, SVMs have several disadvantages as well:
probability
parameter of SVC
), but this extra estimation is costly.With those traits in mind, I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs.
Nevertheless, if you have the CPU cycles to commit to training and cross-validating an SVM on your data, the method can lead to excellent results.