Support vector machines

In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

https://en.wikipedia.org/wiki/Support_vector_machine


In [ ]:
# SVM Regression
import numpy as np
from sklearn import datasets
from sklearn.svm import SVR
import pandas as pd

In [ ]:
# load the diabetes datasets
# for info on this dataset, refer to the linear_regression script
dataset = datasets.load_diabetes()

In [ ]:
#Let us now build a pandas dataframe hosting the data at hand

# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]

In [ ]:
#Let us now build a pandas dataframe hosting the data at hand

# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]

In [ ]:
# We now build the Dataframe, with the data as argument
# and the list of column names as keyword argument
df_diabetes = pd.DataFrame(dataset.data, columns = lfeat)

In [ ]:
# We also want to add the regression target
# Let's create a new column :
df_diabetes["Target"] = dataset.target # Must have the correct size of course

In [ ]:
# Let's have a look at the first few entries
print "Printing data up to the 5th sample"
print df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features.

In [ ]:
# We are now going to fit a SVR model to the data

# Please have a look at svm_classification.py first
# SVR is basically an adaptation of the SVM method to regression
# where we modify the constraints of the optimisation problem
# so that the prediction target does not deviate from the model
# more than a specified threshold

#As before, we create an instance of the model
model = SVR()

In [ ]:
# Which we then fit to the training data X, Y
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :

data = df_diabetes[lfeat].values
target = df_diabetes["Target"].values
model.fit(data, target)
print(model)

In [ ]:
# as before, we can use the model to make predictions on any data
predicted = model.predict(data)
mse = np.mean((predicted-expected)**2)
# and evaluate the performance of the classification with standard metrics
print(mse)
print(model.score(data, target))

Business formulation - continued

Support vector regression (SVR) will also find the coefficients.

But the difference lies in how it proceeds to find them SVR can fit a straight line to the data.

But it does so with a geometric interpretation. It will find the line such that all points are at a distance D or less from this line. See below

D is specified in the construction of the model and can be chosen so as to minimise the prediction error


In [2]:
from IPython.display import Image
Image('figures/svm_regression.png')


Out[2]:

In [ ]: