In this lab, we will apply support vector classification methods to the Endometrium vs. Uterus cancer data. For documentation see: http://scikit-learn.org/0.17/modules/svm.html#svm
Let us start, as usual, by setting up our environment, loading the data, and setting up our cross-validation.
In [ ]:
import numpy as np
%pylab inline
In [ ]:
# Load the data
In [ ]:
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)
In [ ]:
from sklearn import svm
SVMs do not naturally compute probabilities. It is possible to convert the output of the decision function into a probability, but that is a computationally intensive procedure, called Platt's scaling. You can read about it in the corresponding paper: https://www.microsoft.com/en-us/research/publication/probabilities-for-sv-machines/
The natural ways for SVMs to return scores (and not predicted classes) is to use the output of their decision function directly.
Question: Modify the cross_validate
function to return as predictions the values of the decision function.
In [ ]:
from sklearn import preprocessing
In [ ]:
def cross_validate(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns, for each data point x,
the value of the decision function f computed when x was part of the test set.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- decision_function(X) to apply the trained classifier to the data X
and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape) # Hold all predictions, in correct order.
for tr, te in cv_folds:
# Restrict data to train/test folds
Xtr = design_matrix[tr, :]
ytr = labels[tr]
Xte = design_matrix[te, :]
# Fit classifier
classifier.fit(Xtr, ytr)
# Compute decision function on test data
# TODO
# Update pred
# TODO
return pred
In [ ]:
clf = svm.SVC(kernel='linear')
ypred_linear = cross_validate(X, y, clf, folds)
Question: Plot the corresponding ROC curve.
In [ ]:
from sklearn import metrics
fpr, tpr, thresholds = # TODO
auc = # TODO
plt.plot(, #TODO
color='orange',
label='Linear SVM (AUC=%.2f)' % auc)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve', fontsize=16)
plt.legend(loc=(1.1, 0), fontsize=14)
In [ ]:
In [ ]:
kmatrix = # TODO
# heatmap + color map
plt.pcolor(kmatrix, cmap=matplotlib.cm.PuRd)
# plot colorbar to the right
plt.colorbar()
# set axes boundaries
plt.xlim([0, X.shape[0]])
plt.ylim([0, X.shape[0]])
# flip the y-axis
plt.gca().invert_yaxis()
plt.gca().xaxis.tick_top()
Question: What do you observe about the values taken by the kernel? What happens if you scale the data before computing the kernel?
In [ ]:
Question: How does scaling affect the performance of the linear SVM?
In [ ]:
def cross_validate_with_scaling(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns, for each data point x,
the value of the decision function f computed when x was part of the test set.
Scale the training data, and apply same scaling to the test data.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- decision_function(X) to apply the trained classifier to the data X
and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# Restrict data to train/test folds
Xtr = design_matrix[tr, :]
ytr = labels[tr]
Xte = design_matrix[te, :]
# Create scaler object
# TODO
# Fit the scaler and transform training data
# TODO
# Transform test data
# TODO
# Fit classifier
classifier.fit(Xtr, ytr)
# Compute decision function on test data
# TODO
# Update pred
# TODO
return pred
In [ ]:
clf = svm.SVC(kernel='linear')
ypred_linear_scaled = cross_validate_with_scaling(X, y, clf, folds)
# TODO
plt.plot(, #TODO
label='Linear SVM (AUC=%.2f)' % auc)
plt.plot(, #TODO
label='Linear SVM + scaling (AUC=%.2f)' % auc2)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc=(1.1, 0), fontsize=14)
Question: Now optimize for the C-parameter within each loop of the cross-validation. Plot the new ROC curve.
In [ ]:
from sklearn import grid_search
parameters_dict = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
# TODO
We will use SVMs with kernels of the form $k(x, x') = (\langle x, x' \rangle + r)^d$.
Question Plot kernel matrices for $r=0, d=2$.
In [ ]:
Question: What do you observe? What is going to happen if you increase $d$? How do you think this will affect the SVM? Cross-validate the SVM and plot the ROC curve.
In [ ]:
# TODO
# Plot
plt.plot(, #TODO
label='Linear SVM + scaling (AUC=%.2f)' % auc2)
plt.plot(, #TODO
label='Quadratic SVM (AUC=%.2f)' % auc3)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc=(1.1, 0), fontsize=14)
Question What value for $r$ can change this behavior? Plot the corresponding kernel matrix.
In [ ]:
Question Now evaluate an SVM with polynomial kernel of degree d=2 and value for r as above.
In [ ]:
# TODO
# Plot
plt.plot(, #TODO
label='Linear SVM + scaling (AUC=%.2f)' % auc2)
plt.plot(, #TODO
label='Quadratic SVM (AUC=%.2f)' % auc4)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc=(1.1, 0), fontsize=14)
We will use SVMs with kernels of the form $k(x, x') = \exp \left(-\gamma ||x - x'||^2 \right)$.
The following code efficiently computes the pairwise squared distances between all items in X, that is to say the matrix $P$ such that $P_{ij} = ||x^i - x^j||^2$.
In [ ]:
from scipy.spatial.distance import pdist, squareform
pairwise_sq_dists = squareform(pdist(X, 'sqeuclidean'))
Question Plot kernel matrices for varying values of $\gamma$. What do you observe? What is going to be the impact on the SVM? What happens with very very small values of $\gamma$? Check your intuitions by cross-validating SVMs with these various kernels.
In [ ]:
What we have observed here is a phenomenon of diagonal dominance in the kernel matrix.
One way to address this is to re-scale the kernel matrix in the following way: $\hat K_{ij} = \frac{K_{ij}}{\sqrt{K_{ii} K_{jj}}}$.
To implement this you can pass your own kernel function or matrix to the kernel
parameter of the SVM.
Question: Write a function scaled_quadratic_kernel
that computes the scaled quadratic kernel matrix between two data arrays.
In [ ]:
def scaled_quadratic_kernel(X1, X2):
""" Custom scaled RBF kernel.
The RBF kernel between X1 and X2 is scaled to avoid diagonal dominance,
by applying k(X1i, X2j) <-- k(X1i, X2j) / sqrt(k(X1i, X1i) k(X2j, X2j))
Parameters:
-----------
X1: (n_samples1, n_features) np.array
First data matrix.
X2: (n_samples2, n_features) np.array
Second data matrix.
Return:
-------
K: (n_samples1, n_samples2) np.array
Kernel matrix between samples from X1 and samples from X2.
"""
# TODO
Question: Plot the corresponding kernel matrix. Does it match your expectations?
In [ ]:
Question: In the 0.17 version of scikit-learn, it isn't possible to use a custom kernel function within GridSearchCV. Set the C parameter to 1 and compare your custom kernel with the quadratic kernel (d=2) on the gene expression data.
In [ ]: