Kernelized support vector machines are powerful models and perform well on a variety of datasets.
They work well on low-dimensional and high-dimensional data (i.e., few and many features), but don’t scale very well with the number of samples.
Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage.
SVMs requires careful preprocessing of the data and tuning of the parameters. This is why, these days, most people instead use tree-based models such as random forests or gradient boosting (which require little or no preprocessing) in many applications.
The important parameters in kernel SVMs are the regularization parameter C, the choice of the kernel, and the kernel-specific parameters. gamma and C both control the complexity of the model, with large values in either resulting in a more complex model. Therefore, good settings for the two parameters are usually strongly correlated, and C and gamma should be adjusted together.
In [16]:
#load libraries
import pandas as pd
import numpy as np
#Supervised learning
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
In [17]:
#Load data set
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
In [19]:
cancer =pd.DataFrame(cancer.data)
cancer.head()
Out[19]:
In [6]:
#Split data set in train 75% and test 25%
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.25, stratify=cancer.target, random_state=66)
In [22]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))
In [38]:
list(cancer.target_names)
Out[38]:
In [23]:
list(cancer.feature_names)
Out[23]:
In [24]:
## Create an SVM classifier and train it on 75% of the data set.
svc =SVC(probability=True)
svc.fit(X_train, y_train)
## Create an SVM classifier and train it on 70% of the data set.
#clf = SVC(probability=True)
#clf.fit(X_train, y_train)
# Analyze accuracy of predictions on 25% of the holdout test sample.
classifier_score_test = svc.score(X_test, y_test)
classifier_score_train = svc.score(X_train, y_train)
print 'The classifier accuracy on the test set is {:.2f}'.format(classifier_score_test)
print 'The classifier accuracy on the training set is {:.2f}'.format(classifier_score_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
#print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
The model overfits quite substantially, with a perfect score on the training set and only 63% accuracy on the test set.
While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data. In particular, they require all the features to vary on a similar scale. Let’s look at the minimum and maximum values for each feature, plotted in log-spac
In [25]:
# import Matplotlib (scientific plotting library)
import matplotlib.pyplot as plt
# allow plots to appear within the notebook
%matplotlib inline
plt.plot(X_train.min(axis=0), 'o', label="min")
plt.plot(X_train.max(axis=0), '^', label="max")
plt.legend(loc=4)
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
plt.yscale("log")
SVMs,is very sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data
One way to resolve model overfitting problem is by rescaling each feature so that they are all approximately on the same scale. A common rescaling method for kernel SVMs is to scale the data such that all features are between 0 and 1. We will see how to do this using the MinMaxScaler
In [26]:
# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))
In [27]:
# use THE SAME transformation on the test set,
# using min and range of the training set (see Chapter 3 for details)
X_test_scaled = (X_test - min_on_training) / range_on_training
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
In [28]:
from sklearn.preprocessing import MinMaxScaler
# preprocessing using 0-1 scaling
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm =SVC()
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(
svm.score(X_test_scaled, y_test)))
In [29]:
# preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("SVM test accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))
The gamma parameter is the one shown in the formula given in the previous section, which controls the width of the Gaussian kernel. It determines the scale of what it means for points to be close together. The C parameter is a regularization parameter, similar to that used in the linear models. It limits the importance of each point
In [30]:
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
# Test options and evaluation metric
num_folds = 10
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Tune scaled SVM
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = [ 'linear' , 'poly' , 'rbf' , 'sigmoid' ]
param_grid = dict(C=c_values, kernel=kernel_values)
model = SVC()
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
for params, mean_score, scores in grid_result.grid_scores_:
print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))
Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar but less close to 100% accuracy. From here, we can try increasing either C or gamma to fit a more complex model. For example:
Demonstrate the classification result by ploting the decision boundery Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class:
In [31]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import svm, datasets
def decision_plot(X_train, y_train, n_neighbors, weights):
h = .02 # step size in the mesh
Xtrain = X_train[:, :2] # we only take the first two features.
#================================================================
# Create color maps
#================================================================
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
#================================================================
# we create an instance of SVM and fit out data.
# We do not scale ourdata since we want to plot the support vectors
#================================================================
C = 1.0 # SVM regularization parameter
svm = SVC(kernel='linear', random_state=0, gamma=0.2, C=C).fit(Xtrain, y_train)
rbf_svc = SVC(kernel='rbf', gamma=0.7, C=C).fit(Xtrain, y_train)
poly_svc = SVC(kernel='poly', degree=3, C=C).fit(Xtrain, y_train)
#lin_svc = svm.LinearSVC(C=C).fit(Xtrain, y_train)
In [32]:
#================================================================
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
#================================================================
x_min, x_max = Xtrain[:, 0].min() - 1, Xtrain[:, 0].max() + 1
y_min, y_max = Xtrain[:, 1].min() - 1, Xtrain[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
#================================================================
# Put the result into a color plot
#================================================================
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=y_train, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#plt.title("2-Class classification (k = %i, weights = '%s')"
# % (n_neighbors, weights))
plt.show()
In [33]:
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 9)
plt.rcParams['axes.titlesize'] = 'large'
# create a mesh to plot in
x_min, x_max = Xtrain[:, 0].min() - 1, Xtrain[:, 0].max() + 1
y_min, y_max = Xtrain[:, 1].min() - 1, Xtrain[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)',
'SVC with RBF kernel',
'SVC with polynomial (degree 3) kernel']
for i, clf in enumerate((svm, rbf_svc, poly_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=y_train, cmap=plt.cm.coolwarm)
plt.xlabel('mean radius')
plt.ylabel('mean texture')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
</div>
Confusion matrix describes the performance of a classification table. Every observation in the testing set is represented in the matrix. For a 2 classification problem with 2 responses is a 2 by 2 matrix.
Model says "+" Model says "-"
Actual: "+" True positive | False negative
----------------------------------
Actual: "-" False positive | True negative
In [1]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# prepare the model
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = SVC(kernel='linear', random_state=0, gamma=0.2, C=0.3, probability=False)
model.fit(rescaledX, y_train)
# estimate accuracy on validation dataset
rescaledtestX = scaler.transform(X_test)
predictions = model.predict(rescaledtestX)
n_classes = cancer.target_names.shape[0]
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions, labels=range(n_classes)))
print(classification_report(y_test, predictions, target_names=cancer.target_names ))
In [47]:
#print the first 25 true and predicted responses
#print 'True:', (y_test.values)[0:25]
print 'Pred:', predictions[0:25]
Model says "+" Model says "-"
Actual: "+" 50 (TP ) | 3 (FN)
----------------------------------
Actual: "-" 1 (FP) | 89 (TN)
Classification Accuracy - Answers the question how often is the classifier correct?? TP+TN/Total
Confusion Matrix
Summary of classification report.
When a positive value is predicted, how often is the predicition correct?(How precise is the classifier when predicting positives instances. In this case, SVM Classifier is 97% precise in predicting a malignat(cancerous) tumor
Recall/Sensitivity/True Positive rate (TPR) - Quantifies avoidance of false negatives- When the actual value is positive, how often is the value correct..."How sensitivy the classifier is in detecting a positive instance?
f1-Score - F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score. f1-Score = 2x((Precision* Recall)/(Precision+Recall)) (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
In [1]:
#print the first 10 predicted response
svm.predict(X_test)[0:10]
In statistical modeling and machine learning, a commonly-reported performance measure of model accuracy is Area Under the Curve (AUC), where, by “curve”, the ROC curve is implied. ROC is a term that originated from the Second World War and was used by radar engineers (it had nothing to do with machine learning or pattern recognition).
In [49]:
# Plot the receiver operating characteristic curve (ROC).
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=(20,10))
probas_ = model.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC fold (area = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
#plt.axes().set_aspect(1)
In [ ]: