Training an SVM in scikit-learn and choosing its hyperparameters using cross-validation. We are using a polynomial kernel and are tuning the polynomial degree of the kernel:
$ \kappa(\mathbf{u}, \mathbf{v}) = (\mathbf{u}^T \mathbf{v} + c)^d $
We are using the Iris flower data set first introduced by Ronald Fisher https://en.wikipedia.org/wiki/Iris_flower_data_set which contains:
In [1]:
import numpy as np
import pandas as pd
from sklearn import svm, datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
In [2]:
# load iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
In [3]:
X[:3]
Out[3]:
In [4]:
y[:3]
Out[4]:
Randomly select 20% of the samples as test set.
In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Using cross-validation, try out $d=1,2,\ldots,20$. Use accuracy to determine the train/test error.
In [6]:
parameters = {'degree':list(range(1, 21))}
svc = svm.SVC(kernel='poly')
clf = GridSearchCV(svc, parameters, scoring='accuracy')
clf.fit(X_train, y_train)
Out[6]:
The cross-validation results can be loaded into a pandas DataFrame. We see that the model starts overfitting for polynomial degrees $>3$.
In [7]:
pd.DataFrame(clf.cv_results_)
Out[7]:
Finally, train the model with lowest mean test error in cross-validation on all training data and determine the error on the test set.
In [8]:
e = clf.estimator.fit(X_train, y_train)
e
Out[8]:
In [9]:
y_pred = e.predict(X_test)
accuracy_score(y_test, y_pred)
Out[9]: