Q2

In this question, we'll again look at classification, but with a more sophisticated algorithm. We'll also use the Iris dataset again.

Part A

In this question, you'll use a powerful classification technique known as Support Vector Machines, or SVMs.

SVMs work by finding a line (or, in high dimensions, a hyperplane) that best separates data points that belong to different classes. SVMs are flexible enough to range from this fairly straightforward version, all the way to extremely complex incarnations that use nonlinear kernels to project the original data into extremely high-dimensional space, where (in theory) the data are easier to separate.

SVMs can also enforce a penalty which "allows" for a certain amount of classification error, thereby making the decision boundary more fluid. This penalty term can be increased to make the decision boundary less permeable, or decreased to allow for more errors.

In this part, you'll write code to train a linear SVM. In your code:

Define a function train_svm().
train_svm should take 3 arguments: a data matrix X, a target array y, and a floating-point penalty strength term C.
It should return a trained SVM model.

Your function should 1) create a linear SVM model, initialized with the correct penalty term, and 2) train (or fit) the model with a dataset and its labels. Look at the scikit-learn documentation for Linear SVC.

(The "C" in "SVC" means "Support Vector Classifier, as scikit-learn also has SVM implementations that can be used for regression)



In [ ]:

    
import sklearn.svm as svm



In [ ]:



In [ ]:

    
import numpy as np
np.random.seed(13775)
X = np.random.random((20, 2))
y = np.random.randint(2, size = 20)

m1 = train_svm(X, y, 100.0)
assert m1.C == 100.0
np.testing.assert_allclose(m1.coef_, np.array([[ 0.392707, -0.563687]]), rtol=1e-6)



In [ ]:

    
import numpy as np
np.random.seed(598497)
X = np.random.random((20, 2))
y = np.random.randint(2, size = 20)

m2 = train_svm(X, y, 10000.0)
assert m2.C == 10000.0
np.testing.assert_allclose(m2.coef_, np.array([[ -0.345056, -0.6118 ]]), rtol=1e-6)

Part B

In this part, you'll write an accompanying function to test the classification accuracy of your trained model.

In your code:

Define a function test_svm().
test_svm should take 3 arguments: a data matrix X, a target array y, and a trained SVM model. It should return a prediction accuracy between 0 (completely incorrect) and 1 (completely correct).

Your function can use the score() method available on the SVM model. Look at the scikit-learn documentation for K-Nearest Neighbors.



In [ ]:



In [ ]:

    
np.random.seed(58982)
X = np.random.random((100, 2))
y = np.random.randint(2, size = 100)

m2 = train_svm(X[:75], y[:75], 100.0)
acc2 = test_svm(X[75:], y[75:], m2)
np.testing.assert_allclose(acc2, 0.36, rtol = 1e-4)



In [ ]:

    
np.random.seed(99766)
X = np.random.random((20, 2))
y = np.random.randint(2, size = 20)

m2 = train_svm(X[:18], y[:18], 10.0)
acc2 = test_svm(X[18:], y[18:], m2)
np.testing.assert_allclose(acc2, 0.5)

Part C

In this part, you'll test the functions you just wrote.

The following code contains a cross-validation loop: it uses built-in scikit-learn tools to automate the task of implementing robust k-fold cross-validation. Incorporate your code into the core of the loop to extract sub-portions of the data matrix X and corresponding sub-portions of the target array y for training and testing.

In the following code:

Implement training and testing of the SVM in each cross-validation loop. The point is that, in each loop, the training and testing sets are different than the previous loop.
Keep track of the average classification accuracy. Print it at the end.



In [ ]:

    
import numpy as np
import sklearn.datasets as datasets
import sklearn.cross_validation as cv

# Set up the iris data.
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

# Some variables you're welcome to change, if you want.
C = 1.0    # SVM penalty term
folds = 5  # The "k" in "k-fold cross-validation"

# Set up the cross-validation loop.
kfold = cv.KFold(X.shape[0], n_folds = folds, shuffle = True, random_state = 10)
for train, test in kfold:
    # YOUR CODE HERE.
    
    ### BEGIN SOLUTION
    
    ### END SOLUTION

Part D

How was your average classification accuracy in the previous question? How did that compare with the KNN accuracy from Q1? Does this difference or similiarity in accuracy make sense to you? Can you say anything about how the "bias/variance tradeoff" may or may not be at play here between KNN and SVM?