This is an example use case of a support vector classifier. We will infer a data classification function from a set of training data.
We will use the SVC implementation in the scikit-learn toolbox.
In [1]:
from sklearn import svm
import pandas as pd
import pylab as pl
import seaborn as sns
%matplotlib inline
We begin by defining a set of training points. This is the set which the classifier will use to infer the data classification function. Each row represents a data point, with x,y coordinates and classification value.
In [2]:
fit_points = [
[2,1,1],
[1,2,1],
[3,2,1],
[4,2,0],
[4,4,0],
[5,1,0]
]
To understand the data set, we can plot the points from both classes (1 and 0). Points of class 1 are in black, and points from class 0 in red.
In [3]:
sns.set(style="darkgrid")
pl.scatter([point[0] if point[2]==1 else None for point in fit_points],
[point[1] for point in fit_points],
color = 'black')
pl.scatter([point[0] if point[2]==0 else None for point in fit_points],
[point[1] for point in fit_points],
color = 'red')
pl.grid(True)
pl.show()
The SVC uses pandas data frames to represent data. The data frame is a convenient data structure for tabular data, which enables column labels.
In [4]:
df_fit = pd.DataFrame(fit_points, columns=["x", "y", "value"])
print(df_fit)
We need to select the set of columns with the data features. In our example, those are the x
and y
coordinates.
In [5]:
train_cols = ["x", "y"]
We are now able to build and train the classifier.
In [6]:
clf = svm.SVC()
clf.fit(df_fit[train_cols], df_fit.value)
Out[6]:
The classifier is now trained with the fit points, and is ready to be evaluated with a set of test points, which have a similiar structure as the fit points: x
, y
coordinates, and a value.
In [7]:
test_points = [
[5,3],
[4,5],
[2,5],
[2,3],
[1,1]
]
We separate the features and values to make clear were the data comes from.
In [8]:
test_points_values = [0,0,0,1,1]
We build the test points dataframe with the features.
In [9]:
df_test = pd.DataFrame(test_points, columns=['x','y'])
print(df_test)
We can add the values to the dataframe.
In [10]:
df_test['real_value'] = test_points_values
print(df_test)
Right now we have a dataframe similar to the one with the fit points. We'll use the classifier to add a fourth column with the predicted values. Our goal is to have the same value in both real_value
and predicted_value
columns.
In [11]:
df_test['predicted_value'] = clf.predict(test_points)
print(df_test)
THe classifier is pretty successfull at predicting values from the x
and y
coordinates. We may also apply the classifier to the fit points - it's somewhat pointless, because those are the points used to infer the data classification function.
In [12]:
df_fit[''] = clf.predict([x[0:2] for x in fit_points])
print(df_fit)
To better understand the data separation between values 1 and 0, we'll plot both the fit points and the test points.
Following the same color code as before, points that belong to class 1 are represented in black, and points that belong to class 0 in red. Fit points are represented in a full cirle, and the test points are represented by circunferences.
In [13]:
sns.set(style="darkgrid")
for i in range(0,2):
pl.scatter(df_fit[df_fit.value==i].x,
df_fit[df_fit.value==i].y,
color = 'black' if i == 1 else 'red')
pl.scatter(df_test[df_test.predicted_value==i].x,
df_test[df_test.predicted_value==i].y,
marker='o',
facecolor='none',
color='black' if i == 1 else 'red')
pl.grid(True)
pl.show()