This notebook uses material from a tutotial given by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/).

Support Vector Machine tutorial

Support vector machines (or SVMs) are a popular supervised classification method. SVMs attempt to find a classification boundary that maximizes the separability of the classes. This means that it tries to maximize the distance between the boundary lines and the closest data points.

Scikit-learn has a great SVM tutorial if you want more detailed information.

Toy Dataset Illustration

Here, we will use a toy (or overly simple) dataset of two classes which can be perfectly separated with a single, straght line.

The solid line is the decision boundary, dividing the red and blue classes. Notice that on either side of the boundary, there is a dotted line that passes through the closest datapoints. The distance between the solid boundary line and this dotted line is what an SVM tries to maximize.

The points that touch the dotted lines are called "support vectors". These points are the only ones that matter when determining boundary locations. All other datapoints can be added, moved, or removed from the dataset without changing the classification boundary, as long as they do not cross that dotted line.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm) and datasets (to download the iris dataset from scikit-learn).

Also create the color maps to use to color the plotted data, and "labelList", which is a list of colored rectangles to use in plotted legends


In [ ]:
# Print figures in the notebook
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets # Import the dataset from scikit-learn
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold

# Import patch for drawing rectangles in the legend
from matplotlib.patches import Rectangle

# Create color maps
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

# Create a legend for the colors, using rectangles for the corresponding colormap colors
labelList = []
for color in cmap_bold.colors:
    labelList.append(Rectangle((0, 0), 1, 1, fc=color))

Import the dataset

Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is here. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']

The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.

The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames.


In [ ]:
# Import some data to play with
iris = datasets.load_iris()

# Store the labels (y), label names, features (X), and feature names
y = iris.target       # Labels are stored in y as numbers
labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2
X = iris.data
featureNames = iris.feature_names

Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.

Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting.


In [ ]:
# Plot the data

# Sepal length and width
X_small = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5
y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Support vector machines: training

Next, we train a SVM classifier on our data.

The first line creates our classifier using the SVC() function. For now we can ignore the parameter kernel='linear', this just means the decision boundaries should be straight lines. The second line uses the fit() method to train the classifier on the features in X_small, using the labels in y.

It is safe to ignore the parameter 'decision_function_shape'. This is not important for this tutorial, but its inclusion prevents warnings from Scikit-learn about future changes to the default.


In [ ]:
# Create an instance of SVM and fit the data.
clf = SVC(kernel='linear', decision_function_shape='ovo')
clf.fit(X_small, y)

Plot the classification boundaries

Now that we have our classifier, let's visualize what it's doing.

First we plot the decision spaces, or the areas assigned to the different labels (species of iris). Then we plot our examples onto the space, showing where each point lies and the corresponding decision boundary.

The colored background shows the areas that are considered to belong to a certain label. If we took sepal measurements from a new flower, we could plot it in this space and use the color to determine which type of iris our classifier believes it to be.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (SVM)")
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Making predictions

Now, let's say we go out and measure the sepals of two new iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:

Plant Sepal length Sepal width
A 4.3 2.5
B 6.3 2.1

We can use our classifier's predict() function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.


In [ ]:
# Add our new data examples
examples = [[4.3, 2.5], # Plant A
            [6.3, 2.1]] # Plant B


# Create an instance of SVM and fit the data
clf = SVC(kernel='linear', decision_function_shape='ovo')
clf.fit(X_small, y)

# Predict the labels for our new examples
labels = clf.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

Plotting our predictions

Now let's plot our predictions to see why they were classified that way.


In [ ]:
# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (SVM)")
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

What are the support vectors in this example?

Below, we define a function to plot the solid decision boundary and corresponding dashed lines, as shown in the introductory picture. Because there are three classes to separate, there will now be three sets of lines.


In [ ]:
def plot_svc_decision_function(clf):
    """Plot the decision function for a 2D SVC"""
    x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
    y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
    Y, X = np.meshgrid(y, x)
    P = np.zeros((3,X.shape[0],X.shape[1]))
    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            P[:, i,j] = clf.decision_function([[xi, yj]])[0]
    for ind in range(3):
        plt.contour(X, Y, P[ind,:,:], colors='k',
                levels=[-1, 0, 1],
                linestyles=['--', '-', '--'])

And now we plot the lines on top of our previous plot


In [ ]:
# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (SVM)")
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plot_svc_decision_function(clf) # Plot the decison function

plt.show()

This plot is much more visually cluttered than our previous toy example. There are a few points worth noticing if you take a closer look.

First, notice how the three solid lines run right along one of the decision boundaries. These are used to determine the boundaries between the classification areas (where the colors change).

Additionally, while the parallel dotted lines still pas through one or more support vectors, there are now data points located between the decision boundary and the dotted line (and even on the wrong side of the decision boundary!). This happens when our data is not "perfectly separable". A perfectly separable dataset is one where the classes can be separated completely with a single, straight (or at least simple) line. While it makes for nice examples, real world machine learning uses are almost never perfectly separable.

Kernels: Changing The Decision Boundary Lines

In our previous example, all of the decision boundaries are straight lines. But what if our data is grouped into more circular clusters, maybe a curved line would separate the data better.

SVMs use something called kernels to determine the shape of the decision boundary. Remember that when we called the SVC() function we gave it a parameter kernel='linear', which made the boundaries straight. A different kernel, the radial basis function (RBF) groups data into circular clusters instead of dividing by straight lines.

Below we show the same example as before, but with an RBF kernel.


In [ ]:
# Create an instance of SVM and fit the data.
clf = SVC(kernel='rbf', decision_function_shape='ovo') # Use the RBF kernel this time
clf.fit(X_small, y)

# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (SVM)")
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

The boundaries are very similar to before, but now they're curved instead of straight. Now let's add the decision boundaries.


In [ ]:
# Create an instance of SVM and fit the data.
clf = SVC(kernel='rbf', decision_function_shape='ovo') # Use the RBF kernel this time
clf.fit(X_small, y)

# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (SVM)")
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plot_svc_decision_function(clf) # Plot the decison function

plt.show()

Now the plot looks very different from before! The solid black lines are now all curves, but each decision boundary still falls along one part of those lines. And instead of having dotted lines parallel to the solid, there are smaller ellipsoids on either side of the solid line.

What other kernels exist?

Scikit-learn comes with two other default kernels: polynomial and sigmoid. Advanced users can also creat their own kernels, but we will stick to the defaults for now.

Below, modify our previous examples to try out the other kernels. How do they change the decision boundaries?


In [ ]:
# Your code here!

What about my other features?

We've been looking at two features: the length and width of the plant's sepal. But what about the other two featurs, petal length and width? What does the graph look like when train on the petal length and width? How does it change when you change the SVM kernel?

How would you plot our two new plants, A and B, on these new plots? Assume we have all four measurements for each plant, as shown below.

Plant Sepal length Sepal width Petal length Petal width
A 4.3 2.5 1.5 0.5
B 6.3 2.1 4.8 1.5

In [ ]:
# Your code here!

Using more than two features

Sticking to two features is great for visualization, but is less practical for solving real machine learning problems. If you have time, you can experiment with using more features to train your classifier. It gets much harder to visualize the results with 3 features, and nearly impossible with 4 or more. There are techniques that can help you visualize the data in some form, and there are also ways to reduce the number of features you use while still retaining (hopefully) the same amount of information. However, those techniques are beyond the scope of this class.


In [ ]:

Evaluating The Classifier

In order to evaluate a classifier, we need to split our dataset into training data, which we'll show to the classifier so it can learn, and testing data, which we will hold back from training and use to test its predictions.

Below, we create the training and testing datasets, using all four features. We then train our classifier on the training data, and get the predictions for the test data.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create an instance of SVM and fit the data
clf = SVC(kernel='linear', decision_function_shape='ovo')
clf.fit(X_train, y_train)

# Predict the labels of the test data
predictions = clf.predict(X_test)

Next, we evaluate how well the classifier did. The easiest way to do this is to get the average number of correct predictions, usually referred to as the accuracy


In [ ]:
accuracy = np.mean(predictions == y_test )*100

print('The accuracy is %.2f' % accuracy + '%')

Comparing Models with Crossvalidation

To select the kernel to use in our model, we need to use crossvalidation. We can then get our final result using our test data.

First we choose the kernels we want to investigate, then divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the accuracy of each classifier on each fold so we can look at them later.


In [ ]:
# Choose our kernels
kernels = ['linear', 'rbf']

# Create a dictionary of arrays to store accuracies
accuracies = {}
for kernel in kernels:
    accuracies[kernel] = []

# Loop through 5 folds
kf = KFold(n_splits=5)
for trainInd, valInd in kf.split(X_train):
    X_tr = X_train[trainInd,:]
    y_tr = y_train[trainInd]
    X_val = X_train[valInd,:]
    y_val = y_train[valInd]
    
    # Loop through each kernel
    for kernel in kernels:
        
        # Create the classifier
        clf = SVC(kernel=kernel, decision_function_shape='ovo')
        
        # Train the classifier
        clf.fit(X_tr, y_tr) 
        
        # Make our predictions
        pred = clf.predict(X_val)
    
        # Calculate the accuracy
        accuracies[kernel].append(np.mean(pred == y_val))

Select a Model

To select a model, we look at the average accuracy across all folds.


In [ ]:
for kernel in kernels:
    print('%s: %.2f' % (kernel, np.mean(accuracies[kernel])))

Final Evaluation

The linear kernel gives us the highest accuracy, so we select it as our best model. Now we can evaluate it on our training set and get our final accuracy rating.


In [ ]:
clf = SVC(kernel='linear', decision_function_shape='ovo')
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

accuracy = np.mean(predictions == y_test) * 100

print('The final accuracy is %.2f' % accuracy + '%')

Sidenote: Randomness and Results

Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.

If it's important that you get the exact same results every time you run the code, you can specify a random state in the random_state argument of train_test_split() and KFold.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [ ]: