Support vector machines are a general class of supervised learning models. They can be used for at least three different basic tasks in ML:
In the first two cases, SVMs compete with many other machine learning techniques that have a range of implementations. Here are some reasons why an SVM might be an appropriate tool for a problem:
Reasons against using SVMs:
In [7]:
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing, svm
from sklearn.cross_validation import train_test_split
In [3]:
# Try it out with some data; start with Titanic survivors.
titanic = pd.read_csv("../dc/titanic_train.csv")
titanic.head()
Out[3]:
In [6]:
# Encode the categorical variables
# Sex (binary)
le_Sex = preprocessing.LabelEncoder()
le_Sex.fit(list(set(titanic.Sex)))
titanic['Sex_int'] = le_Sex.transform(titanic.Sex)
# Embarked (three sets)
embarked_filled = titanic.Embarked.fillna("N")
le_Embarked = preprocessing.LabelEncoder()
le_Embarked.fit(list(set(embarked_filled)))
titanic['Embarked_int'] = le_Embarked.transform(embarked_filled)
# Since there are still NaNs in the frame, impute missing values
tvar = ['Pclass', u'Sex_int', u'Age', u'SibSp', u'Parch',
u'Fare', u'Embarked_int']
imp = preprocessing.Imputer(missing_values="NaN",strategy="mean")
imp.fit(titanic[tvar])
imputed = imp.transform(titanic[tvar])
In [19]:
titanic['Survived'].values.shape
Out[19]:
In [22]:
# Split into test and training data
X = imputed
y = titanic['Survived'].values
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.70,random_state=51)
In [25]:
# Load the SVM classifier
clf = svm.SVC(kernel='linear',C=1.0)
clf.fit(X_train,y_train);
In [32]:
clf.score(X_test,y_test)
Out[32]:
In [36]:
y_score = clf.decision_function(X_test)
In [49]:
from sklearn.metrics import roc_curve,auc
fpr, tpr, thresholds = roc_curve(y_test,y_score)
roc_auc = auc(fpr,tpr)
In [50]:
fig,ax = plt.subplots(1,1,figsize=(8,6))
ax.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
ax.plot([0, 1], [0, 1], 'k--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.legend(loc="best");
Not great. $AUC=0.80$ could be much better, although it's a significant improvement over random.
In [56]:
print "This result used {} support vectors from the {}-sized training sample.".format(
clf.support_vectors_.shape[0],X_train.shape[0])
In [ ]: