Intro to support vector machines

2015-02-06, JM

This notebook is a gentle introduction to the support vector machine as a binary classifier. It starts with a reminder of the mechanisms of logistic regression, moves on to a geometric interpretation of SVM optimization on linearly-separable data, then demonstrates the use of a kernel method using the radial basis function ('rbf' or Gaussian) kernel, and leaves off with an illustration of the results of using the other kernel functions available in the scikit-learn implementation of SVM.



In [ ]:

    
import sys
import pandas as pd

import numpy as np

from sklearn.svm import SVC

import matplotlib.pyplot as plt

from matplotlib import cm

%matplotlib inline

from IPython.display import Math

from IPython.display import Image



# optional, but nice to have

try:

    from mpl_toolkits.mplot3d import Axes3D

except ImportError:    

    sys.stderr.write("mpl_toolkits not installed. NO 3D GRAPH FOR YOU!")

try:

    import seaborn as sns

except ImportError:

    sys.stderr.write("seaborn not installed, using mpl defaults. Sadness.")

Revisit logistic regression

Let's start by briefly reviewing some of the pieces of logistic regression. In an earlier session, we learned how logistic regression is formulated.

The foundation of a logisitic regression implementation is the sigmoid (or logistic) activation function. This is also referred to as the hypothesis function, $h$:

$ h_{\theta}(x) = \frac{1}{1 + e^{-z}}$, where $ \theta^T x = z $.

This function takes an arbitrary input vector $x$, and creates a linear combination of this with the model vector $ \theta $ to create an output on [0,1], representing a probability of being in the positive class.

Below is a one-dimensional example:



In [ ]:

    
def sigmoid(x):
    # vector in, vector out
    return 1./(1. + np.exp(-x))

# Let's see it
df = pd.DataFrame.from_dict({"z": np.linspace(-8,8,30), 
                                "h": sigmoid(np.linspace(-8,8,30))})

#df.head()



In [ ]:

    
df.plot(x="z", y="h")

plt.xlabel("z")
plt.ylabel("h(z)")

So $z = \theta^Tx$, where $\theta =$ model vector, $x =$ feature vector, which means the form of $z$ depends on the dimension of our data:

1D data: $z = \theta_0 + \theta_1 x_1$
2D data: $z = \theta_0 + \theta_1 x_1 + \theta_2 x_2$
...

For positive sample, we want $h \approx 1$, so we need to make sure that $ z \gt\gt 0 $. Similarly, for a negative sample, we want $ h\approx 0 $, so we want $ z \lt\lt 0 $.

As a result of these constraints, we use the following cost function:

$ C = - \left( y \: \text{log}(h_{\theta}(x)) + (1-y) \: \text{log}(1-h_{\theta}) \right) $

$ C = -y \: \text{log} \frac{1}{1+exp(-z)} - (1-y) \: \text{log} \left( 1-\frac{1}{1+exp(-z)} \right) $

We see, below, what these functions look like for the two classes.



In [ ]:

    
# add piece-wise cost funcs for each class
df["LR_y=1"] = -np.log10(1.0 / (1 + np.exp(-1.0 * df["z"])))
df["LR_y=0"] = -np.log10(1.0 - 1.0 / (1 + np.exp(-1.0 * df["z"])))

# plot LR cost funcs
df[["z","LR_y=0","LR_y=1"]].plot(x="z", figsize=(8,4))

plt.ylabel("cost")
plt.title("approx. cost functions (by class)")


# uncomment to add conceptual SVM cost funcs 
_ = """
slope = 0.4
df["svm_y=1"] = df["z"].apply(lambda x: -slope*x + 0.5 if x < 1 else 0)
df["svm_y=0"] = df["z"].apply(lambda x: slope*x + 0.5 if x > -1 else 0)
plt.plot(df["z"], df["svm_y=1"], label="svm_cost1")
plt.plot(df["z"], df["svm_y=0"], label="svm_cost0")
plt.legend(loc='best')
"""

As we can see, when $y=1$, the cost for positive $z$ is low and the cost for negative $z$ is very high. This is consistent with our observations above.

The actual cost function (that gets minimized in the algorithm) also includes a sum over all the samples, a normalization by the number of samples, and possibly a regularization term. That full optimization problem becomes:

$ \text{min}{\theta} \frac{1}{m} \sum \limits{i=1}^{m} \lbrack yi \left( -\text{log} h{\theta}(x_i) \right)

+ (1-y_i) \left( - \text{log} (1-h_{\theta}(x_i)) \right) \rbrack 
+ \frac{\lambda}{2m} \sum \limits_{j=1}^{n} \theta_j^2 $

My understanding here is that (among other options), the strategy for acheiving this optimization is to use some form of maximum likelihood estimation to find optimal parameters. With the regularization term, this flavor of logisitic regression is very closely related to the linear SVM.

LR $ \rightarrow $ linear SVM

One way to motivate the linear SVM is to start with the logistic regression cost function above, and make the following changes:

replace the $log$ cost function with simplified (linear) versions, $\text{cost}_{n}(z)$, that move the "elbow" to +/- 1 and are linear (to see this, uncomment the multi-line comment in that plotting cell)
drop the $ 1/m $
"swap" the $\lambda$ parametrization for $C$

As a result of this reformulation of the cost function (seen in the plot above), we have the following updates to our optimization objective:

if $y=1$, we want $ \theta^T x \ge 1 $ (not just $\ge$ 0)
if $y=0$, we want $ \theta^T x \le -1 $ (not just $\le$ 0)

When y=1, $z \ge 0$ would be just enough to classify correctly, but the SVM algorithm tries to enforce a "safety margin" and pushes that decision point up to e.g. $z \ge 1$. This is why it's referred to as a "large-margin" classifier. This mainly applies to linearly-separable data (which we'll focus on for now).

Implementation

The optimization objective for logisitic regression (above) had the appearance:

$ \text{min}_{\theta} A + \lambda B$, where $\lambda$ adjusts trade-off between training data and regularization.

The optimization problem for the SVM algorithm now looks like:

$ \text{min}_{\theta} C \sum \limits_{i=1}^{m} \lbrack y_i \text{cost}_1(\theta^Tx) + (1-y_i) \text{cost}_0(\theta^Tx) \rbrack + \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 $

which has the appearance:

$ \text{min}_{\theta} C A + B$, where $C$ adjusts the trade-off between training data and regularization.

In the case that $C = 1 / \lambda$, these problem statements appear identical (but recall the underlying cost function is slightly different).

Example

To get a gut feel for what the result is, let's just show a result. You can keep this in mind while we work below. A legend for the figure below:

points: data (colored by class)
solid line: separating hyperplane
dashed lines: margins
open circles: support vectors

(nb: this graphic works about 90% of the time. for now, just hit cntl-return until you see no points in margins)



In [ ]:

    
%run svm-example.py

Handrolled SVM

Ok, now let's see if we can implement one by hand - at least, conceptually.

Consider the minimization task above:

$ \text{min}_{\theta} C \sum \limits_{i=1}^{m} \lbrack y_i \text{cost}_1(\theta^Tx) + (1-y_i) \text{cost}_0(\theta^Tx) \rbrack + \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 \rightarrow \text{min}_{\theta} C A + B $

To get a feel for how this works, let's do the following:

Consider the case where C is really large. The first minimization task is now to get $A=0$. When $y=1$, we can make $A=0$ if $\theta^T x \ge 1$. Conversely, when $y=0$, $A=0$ when $\theta_T x \le -1$. With $CA=0$, the other portion of the minimization task is essentially to minimize the $B$ term, $ \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 $ subject to the constraint that $\theta^T x \ge 1$ if $y=1$ (or $\le -1$ if $y=0$).

Since these are just inner products, we're essentially trying to make the $\theta$ vector point in the same direction as our $y=1$ data, while keeping the magnitude somewhat small.

Data

Start with some 2D data. This means our (n-1)-dimensional hyperplane is going to be 2-1=1D (ie a line). The equation of a line in 2D has the form $a x + b y = c$, so $\theta = [\theta_0 \; \theta_1 \; \theta_2 ]$.

For simplicity, I'm choosing the data so the separating hyperplane goes through the origin. In practice, this lets us set $\theta_0 = 0$ and ignore it for the rest of the time. Here's what the data looks like:



In [ ]:

    
# make up some data
d = {"x" :[1, 1, 2, 2,  -1, -2, -2, -3], 
    "y" :[1, -1.1, 2, -1, -1, 2, -2, 2], 
    "label":[0, 0, 0, 0, 1, 1, 1, 1]}

df = pd.DataFrame(d)


#sns.lmplot('x', 'y', df, hue='label', size=7, fit_reg=False)
df.plot(x='x', y='y', kind='scatter', c='label', s=100, figsize=(10,6))

Geometric interpretation

Consider what it actually means to (as we said above) minimize the $B$ term. It's equivalent to see it as minimizing the norm (length) of the vector:

$\text{min}_\theta \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 \rightarrow \text{min}_\theta \frac{1}{2} \Vert \theta \Vert ^2 $, such that $ \theta^T x \ge 1 $ if y=1(or $\le -1$ if y=0)

Another way to think about the constraints $\theta^T x \ge 1$ if $y=1$ (or $\le -1$ if $y=0$) is in terms of the projection, $p^i$, of $x^i$ onto $\theta$:

$\theta^T x^i = p^i \cdot \Vert \theta \Vert$

See the figure below for an example of this:



In [ ]:

    
# image from Ng's ML Coursera MOOC
# https://www.coursera.org/course/ml
Image(filename='projection.png')

With this in mind, we can restate the optimization goal with these constraints:

$ \text{min}_\theta \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 \rightarrow \text{min}_\theta \frac{1}{2} \Vert \theta \Vert ^2 $ such that

$ p^i \cdot \Vert \theta \Vert \ge 1$ if $y=1$

$ p^i \cdot \Vert \theta \Vert \le 1$ if $y=0$

Pick a $\theta$

Using the data above, and pick a naive $ \theta^T = \lbrack -2 \; -1.2 \rbrack $. We'll again indicate the support vectors with an extra empty circle around the data, and the decision boundary with the dashed line. The projections $p^i$ will be illustrated by colored lines.

(note: I'm only illustrating two support vectors for simplicity... we actually need three to create a stable margin.)



In [ ]:

    
#
#
# don't try to interpret all of this cell's code for now! The picture is 
#   the important part (and I didn't have time to hide it in a module)!
#
#

def projection(x, y):
    """Helper fnc to return the vector projection of y onto x"""
    x_hat = x / np.linalg.norm(x)
    cos = x.dot(y) / ( np.linalg.norm(y) * np.linalg.norm(x) )
    return np.linalg.norm(y) * cos * x_hat 


# pick a theta (use 3d for ease of cross product)
t = np.array([-2,-1.2,0])


######## calculate things #########

# calculate a vector orthogonal to t
o_t = np.cross(t, [0,0,-1])

# points for the plane
xx = np.linspace(-4, 4)
yy = (o_t[1] / o_t[0]) * xx

# support vector projections
# -2,2
idx_0 = 5
p0 = np.array([df.x[idx_0], df.y[idx_0], 0])
proj0 = projection(t, p0)

# 1,-1.1
idx_1 = 1
p1 = np.array([df.x[idx_1], df.y[idx_1], 0])
proj1 = projection(t, p1)


######## plot things #########

# data
df.plot(x='x', y='y', kind='scatter', c='label', s=100, figsize=(10,6))

# theta 
ax = plt.axes()
ax.arrow(0, 0, t[0], t[1], head_width=0.25, head_length=0.25, fc='r', ec='r', label='theta')

# plane
plt.plot(xx, yy, 'k--', label='plane')

# support vectors
plt.scatter([df.x[idx_1], df.x[idx_0]], [df.y[idx_1], df.y[idx_0]], s=300, facecolors='none')   

# projections
plt.plot([0,proj0[0]], [0,proj0[1]], 'b', linewidth=5)
plt.plot([0,proj1[0]], [0,proj1[1]], 'y', linewidth=5)

_ = plt.legend()
_ = plt.ylim(-4,4)



In [ ]:

    
print 'norm of p0 = {:.3}'.format(np.linalg.norm(proj0))
print 'norm of p1 = {:.3}'.format(np.linalg.norm(proj1))

For this choice of $\theta$, the resulting projections $p^i$ are rather small, so $ \Vert \theta \Vert$ has to be large to satisfy the conditions. Since we're trying to minimize $ \Vert \theta \Vert $, this is not ideal.

Let's adjust $\theta$ (by hand) so that the projections of the nearest points to the plane are larger; recall that this was the geometric interpretation of our objective. Note that the support vectors have changed since the closest points to the boundary are now different than before.



In [ ]:

    
#
#
# same disclaimer about content!
#
#

# pick a theta
t = np.array([-1.25,1e-4,0])


######## calculate things #########

# calculate a vector orthogonal to t
o_t = np.cross(t, [0,0,-1])

# points for the plane
xx = np.linspace(-4, 4)
yy = (o_t[1] / o_t[0]) * xx

# support vector projections
# -1,1
idx_0 = 4
p0 = np.array([df.x[idx_0], df.y[idx_0], 0])
proj0 = projection(t, p0)

# 1,-1.1
idx_1 = 1
p1 = np.array([df.x[idx_1], df.y[idx_1], 0])
proj1 = projection(t, p1)


######## plot things #########

# data
df.plot(x='x', y='y', kind='scatter', c='label', s=100, figsize=(10,6))

# theta 
ax = plt.axes()
ax.arrow(0, 0, t[0], t[1], head_width=0.25, head_length=0.25, fc='r', ec='r', label='theta')

# plane
plt.plot(xx, yy, 'k--', label='plane')

# support vectors
plt.scatter([df.x[idx_1], df.x[idx_0]], [df.y[idx_1], df.y[idx_0]], s=300, facecolors='none')   

# projections
plt.plot([0,proj0[0]], [0,proj0[1]], 'b', linewidth=5)
plt.plot([0,proj1[0]], [0,proj1[1]], 'y', linewidth=5)

_ = plt.legend()
_ = plt.ylim(-4,4)
_ = plt.xlim(-4,4)



In [ ]:

    
print 'norm of p0 = {:.3}'.format(np.linalg.norm(proj0))
print 'norm of p1 = {:.3}'.format(np.linalg.norm(proj1))

We can see that the projections $p^i$ have increased in magnitude, which means $\Vert \theta \Vert$ can be smaller. Thus, we're definitely getting closer to our optimization objective. Hooray!

(aside: the observent among you will notice that I said nothing about the magnitude of theta during this example. This is true and an important point; for now, I'm only illustrating how we orient the separating plane).

The Right Way ®

Ok, so as always, what we've just done is a terrible idea.

Let's see what it looks like when we use one of the SVM implementations in scikit-learn and extract all the pieces we need to see the decision boundary and the actual $\theta$ vector.



In [ ]:

    
# fit an SVM
clf = SVC(kernel='linear').fit(df[['x','y']].values, d['label'])

# back out the linear model coefficients ==> theta vector
w = clf.coef_[0]

# calculate the slope
a = -w[0] / w[1]

# span a space around the data points
xxx = np.linspace(-4, 4)
yyy = a * xxx - (clf.intercept_[0]) / w[1]


# plot data
df.plot(x='x', y='y', kind='scatter', c='label', s=100, figsize=(10,6))

# plot separating plane
plt.plot(xxx, yyy, 'k--', linewidth=1, label='separating plane')

# plot support vectors
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=300, facecolors='none')    
    
# calculated theta vector
ax = plt.axes()
ax.arrow(0, 0, w[0], w[1], head_width=0.25, head_length=0.25, fc='r', ec='r', label='theta')

_ = plt.legend()
_ = plt.ylim(-4,4)

print "*\n* theta = {} \n*".format(w)

We can see that our guess was pretty close (I may or may not have designed it that way). Here, we can also see that there are three support vectors - the minimum needed to ensure stability of this partcular large margin.

"Soft-margin" classifiers and nonlinear boundaries

So far, we've only talked about the way an SVM works in a narrow space of problems: linearly-separable data. The type of algorithm we've been using in this case (the original form of an SVM, ca. 1963) is known as a "hard-margin" or "primal-form" SVM. In this form, the margin has no data within it. And if there's no straight line that separates the classes, the optimization fails.

Obviously not all data is linearly-separable in the feature space, so in the 90s, the "soft-margin" SVM was invented, which has as similar idea to it, but allows for the data to be both in the margin and also possibly interspersed. The soft-margin SVM is generally what's implemented when you use an SVM in an ML library. That is what sklearn's behavior is, too.

To see how the soft-margin aspect enters, let's add a bunch of additional data so it's more likely that we'll have some class overlap.

(nb: the blobs are randomly-generated, so you may observe linearly-separable distributions. Hit CTRL-Enter a few times until you get to see a situation where the points are on both sides of the decision plane and the margin plotting isn't too confused)



In [ ]:

    
# argument is the number of points (total)
%run svm-example.py 200

As you might expect, the soft-margin version is much more generalizable when it comes to predicting new data.

Non-linear boundaries and "the kernel trick"

Hopefully now we've got a sense for how the linear kernel works in an SVM. Many argue that one big advantage of the SVM as a learning method is it's compatibility with alternative kernels - something that is possible with other algorithms, but seemingly less commonly implemented.

The "kernel trick" is something you'll read about in machine learning literature. It has to do with reformulating the optimization problem in a clever, more efficient manner (you can read more here). The short version is that the use of kernels allows the algorithm to optimize for a large-margin (most likely still "soft") in a transformed feature space other than that of the data.

Consider a set of data like the one below; no $ \ [ \theta_1 \theta_2 ] $ in $\mathbf{R}^2 $ can separate these. We want a new feature vector in a higher dimension so that we can still calculate $ \theta^T x = \theta_0 + \theta_1 f_1 + \theta_2 f_2 + ...$, just need to come up with the right $f$.



In [ ]:

    
from sklearn.datasets import make_circles

x, y = make_circles(n_samples=500, factor=0.2, noise=.15)

plt.figure(figsize=(8,6))
plt.scatter(x[:,0], x[:,1], c=y, s=100, alpha=0.75)

It's conceivable that one could come up with an arbitrary polynomial to wiggle it between the classes, but polynomials can quickly grow complex, so - computationally - they're not always good solution. Instead, we'll focus on the use of kernels.

In the svm.SVC implementation of SVM, there are four kernels available: linear, polynomial, rbf (aka Gaussian), and sigmoid. For now, we'll only talk about the rbf kernel.

Consider a point in space, $l^i$, and define a function $f$ that describes the similarity between a new point $x$ and the reference point $l^i$, based on how far away it is:

$ f = \text{exp} \left( - \frac{\Vert x - l^i \Vert ^2}{2 \sigma^2} \right) $

To get a sense for this function, plot the value of this similarity function in two dimensions where $l^i $ is at (0,0):



In [ ]:

    
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)

R = np.sqrt(X**2 + Y**2)
sig = 1
Z = np.exp(-R**2 / 2*sig**2)

fig = plt.figure(figsize=(6,4))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X,Y,Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0, antialiased=False)
_ = ax.set_zlim(0, 1)

This similarity function is the Gaussian kernel. It takes two vectors as inputs and returns a value - a distance.

In practice, we swap each element of the original feature vector $x$ with the value of the kernel function $f$ evaluated between the sample point and the $i$th support vector determined by the SVM.

We end up effectively with the same minimization task, but in a new space (of $f$ instead of $x$):

$ \text{min}_{\theta} C \sum \limits_{i=1}^{m} \lbrack y_i \text{cost}_1(\theta^T f) + (1-y_i) \text{cost}_0(\theta^T f) \rbrack + \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 $

and we optimize $ \theta $, for the set of $f$s, such that $ \theta^T f \ge 1 $ if $y=1$ and $ \theta^T f \le -1 $ if $y=0$, just like before.

An interesting property of these kernel functions (that we won't delve into here), is that this approach is computationally pretty cheap since it doesn't have to explicitly build a higher-dimensional representation of the vectors.

Let's see how the rbf/Gaussian kernel performs on the kind of data we looked at above:



In [ ]:

    
%run rbf-circles.py

Pretty cool! For these (artificially-generated) classes, this kernel works pretty well.

Other SVM kernels

Finally, I'll illustrate the other kernels available to sklearn's svm.SVC implementation. These optimizations are all using the same set of data, and the default settings for the associated kernel. But, note the wide range of decision boundaries.



In [ ]:

    
%run kernel-examples.py

What's next?

Well, based on that last set of figures, the next obvious question is: "WTF?! How do you choose an appropriate kernel when they perform so differently?"

Great question! At the moment, I have no idea. This topic would make a great followup session!

The End

References

Links that I found helpful and from which I cobbled together this session:

Wiki page on logistic regression
Wiki page on support vector machines
Quora post on non-regularized logistic regression versus SVM, and probabilities
Quora post on how reg-LR and linear SVM are very similar
Quora post on rbf kernal intuition
Text summary (+figures) of Andrew Ng's ML Coursera MOOC
Article on the kernel trick
HTML version of SVM section in Stanford information retrieval text
sklearn examples page



In [ ]: