Visualizing SVMs with Pima Indians Diabetes Dataset

Welcome to the practical section of module 5.3. Here we'll work with one of the most powerful tools in machine learning: Support Vector Machines (SVMs). We'll be learning how to train a support vector machine classifier (or SVC) using scikit-learn on the Pima Indians Diabetes Dataset, we'll also going to visulaize the SVMs by plotting the decision boundaries and the maximum margin with different kernels to acheive the best understanding of the practicals of SVMs.



In [89]:

    
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm
plt.rcParams['figure.figsize'] = (10, 10)

Understanding and Visualizing the Data

The dataset we'll be working with here is one of the classics in the machine learning community. The dataset is a collection of clinical measurments for 768 women of the Pima Indian heritage and their diagnosis of diabetes. We have 8 clinical measurments like: Pregnancy Count, Blood Glucose, Body Mass Index (BMI), Diastolic Blood Pressure, ... etc. The diagnosis of diabetes is represented through the Class column: 1 means that the subject has diabetes, 0 means that the subject hasn't.

We can take a look at a sample of the data to understand better:



In [90]:

    
dataset = pd.read_csv('../datasets/pima-indians-diabetes.csv')
dataset.sample(10)









    Out[90]:






  
    
      
      Pregnancy Count
      Blood Glucose
      Diastolic BP
      Triceps Skin Fold Thickness
      Serum Insulin
      BMI
      Diabetes Pedigree Function
      Age
      Class
    
  
  
    
      722
      1
      149
      68
      29
      127
      29.3
      0.349
      42
      1
    
    
      475
      0
      137
      84
      27
      0
      27.3
      0.231
      59
      0
    
    
      77
      5
      95
      72
      33
      0
      37.7
      0.370
      27
      0
    
    
      555
      7
      124
      70
      33
      215
      25.5
      0.161
      37
      0
    
    
      362
      5
      103
      108
      37
      0
      39.2
      0.305
      65
      0
    
    
      344
      8
      95
      72
      0
      0
      36.8
      0.485
      57
      0
    
    
      286
      5
      155
      84
      44
      545
      38.7
      0.619
      34
      0
    
    
      633
      1
      128
      82
      17
      183
      27.5
      0.115
      22
      0
    
    
      606
      1
      181
      78
      42
      293
      40.0
      1.258
      22
      1
    
    
      442
      4
      117
      64
      27
      120
      33.2
      0.230
      24
      0

To simplify our work and make the visualizations feasable, we'll be focusing on only two features of the eight: Blood Glucose and Body Mass Index (BMI). We'll now extract our focus dataset and see a summary of the values for both features:



In [93]:

    
focus_dataset = dataset[["Blood Glucose", "BMI", "Class"]]
focus_dataset.is_copy = False

x1_min, x1_max = focus_dataset["Blood Glucose"].min(), focus_dataset["Blood Glucose"].max()
x1_avg = np.mean(focus_dataset["Blood Glucose"])

x2_min, x2_max = focus_dataset["BMI"].min(), focus_dataset["BMI"].max()
x2_avg = np.mean(focus_dataset["BMI"])

print "Blood Glucose: min = %d, max = %d, mean = %0.2f" % (x1_min, x1_max, x1_avg)
print "BMI: min = %0.1f, max = %0.1f, mean = %0.2f" % (x2_min, x2_max, x2_avg)

focus_dataset.sample(10)









    



Blood Glucose: min = 0, max = 199, mean = 120.89
BMI: min = 0.0, max = 67.1, mean = 31.99






    Out[93]:






  
    
      
      Blood Glucose
      BMI
      Class
    
  
  
    
      704
      110
      28.4
      0
    
    
      236
      181
      35.9
      1
    
    
      633
      128
      27.5
      0
    
    
      83
      101
      24.6
      0
    
    
      684
      136
      0.0
      0
    
    
      682
      95
      44.6
      0
    
    
      349
      0
      41.0
      1
    
    
      486
      139
      40.7
      0
    
    
      231
      134
      46.2
      1
    
    
      115
      146
      31.2
      1

From the data summary we can notice two things:

There are some instances where the Blood Glucose and BMI take the value of 0. This is biologically impossible! This is probably due to errors in recording the data, so we need to filter those instances out of the dataset.
The features span a wide range, this can cause computational problems when we try to plot and visualize the results we get from our models. A good idea would be to scale the features into a reasonbale range.

We'll start by cleaning the data from the 0-valued instances.



In [95]:

    
# filtering the data out of 0-valued instances
real_bg = focus_dataset["Blood Glucose"] != 0
real_bmi = focus_dataset["BMI"] != 0

focus_dataset = focus_dataset[real_bg & real_bmi]

x1_min, x1_max = focus_dataset["Blood Glucose"].min(), focus_dataset["Blood Glucose"].max()
x1_avg = np.mean(focus_dataset["Blood Glucose"])

x2_min, x2_max = focus_dataset["BMI"].min(), focus_dataset["BMI"].max()
x2_avg = np.mean(focus_dataset["BMI"])

print "Blood Glucose: min = %d, max = %d, mean = %0.2f" % (x1_min, x1_max, x1_avg)
print "BMI: min = %0.1f, max = %0.1f, mean = %0.2f" % (x2_min, x2_max, x2_avg)

focus_dataset.sample(10)









    



Blood Glucose: min = 44, max = 199, mean = 121.94
BMI: min = 18.2, max = 67.1, mean = 32.45






    Out[95]:






  
    
      
      Blood Glucose
      BMI
      Class
    
  
  
    
      309
      124
      32.9
      1
    
    
      521
      124
      33.2
      0
    
    
      320
      129
      27.5
      0
    
    
      91
      123
      32.0
      0
    
    
      192
      159
      30.4
      1
    
    
      659
      80
      34.2
      1
    
    
      609
      111
      24.0
      0
    
    
      291
      107
      36.6
      1
    
    
      703
      129
      38.5
      0
    
    
      254
      92
      27.6
      1

Now after we have cleaned tha data and have all our instances at biologically plausiable values, we turn to scaling the values of the features. We're gonna use the same technique we used back then when we were doing linear regression. We'll fit our scalar on the training data then use the fitted scalar to transform the rest of the data (the test data).



In [37]:

    
def scale_features(X, scalar=None):
    if(len(X.shape) == 1):
        X = X.reshape(-1, 1)
    
    if scalar == None:
        scalar = StandardScaler()
        scalar.fit(X)
  
    return scalar.transform(X), scalar



In [96]:

    
# scaling the features
size = len(focus_dataset)
training_size = np.floor(size * 0.7).astype(int)

training_portion = focus_dataset.loc[:training_size, ("Blood Glucose", "BMI")]
test_portion = focus_dataset.loc[training_size:, ("Blood Glucose", "BMI")]

focus_dataset.loc[:training_size, ("Blood Glucose", "BMI")], train_scalar = scale_features(training_portion)
focus_dataset.loc[training_size:, ("Blood Glucose", "BMI")],_ = scale_features(test_portion, scalar=train_scalar)

x1_min, x1_max = focus_dataset["Blood Glucose"].min(), focus_dataset["Blood Glucose"].max()
x2_min, x2_max = focus_dataset["BMI"].min(), focus_dataset["BMI"].max()

focus_dataset.sample(10)









    Out[96]:






  
    
      
      Blood Glucose
      BMI
      Class
    
  
  
    
      64
      -0.266973
      0.062992
      1
    
    
      662
      1.461708
      0.735630
      1
    
    
      529
      -0.364823
      -1.086099
      0
    
    
      153
      1.005075
      1.156029
      0
    
    
      219
      -0.332206
      0.763656
      1
    
    
      709
      -0.951922
      0.791683
      1
    
    
      419
      0.222276
      -0.833859
      1
    
    
      633
      0.189660
      -0.679713
      0
    
    
      642
      0.809376
      -0.399447
      1
    
    
      370
      1.657408
      0.847736
      1

Now, that we've finished preprocessing our data, we finally visualize the distribution of our data points by sactter plotting them. The green data points reprsent the healthy subjects, and the red ones represent the diabetics.



In [100]:

    
plt.figure()

colors = ['green' if diagnosis == 0 else 'red' for diagnosis in focus_dataset["Class"]]
plt.scatter(focus_dataset["Blood Glucose"], focus_dataset["BMI"], c=colors)

plt.xlabel("Blood Glucose")
plt.ylabel("BMI")

plt.show()

Building SVM Models

Now we're ready to start building our SVMs. we're going to use SVC from scikit-learn's SVM package. The SVC implements the objective function we learned about in the videos:

$$\text{min} \hspace{0.5em} \frac{1}{2}||w||^2 + C\sum_{i=1}^{M}\zeta_{i} \hspace{0.5em} \text{subject to} \hspace{0.5em} y_ih_w(x_i) \geq 1 - \zeta_i$$

Our hypothesis function is in the usual form: $h_w(x) = w^T\phi(x) + b$, where $\phi$ is the mapping we use (if any) from our feature space to a higher dimensional feature space. This same hypothesis function can be written in an alternative form (called the dual form) that involves the kernel function we discussed in the video:

$$h_w(x) = \rho + \sum_{i=1}^{M}y_i\alpha_iK(x,x_i)$$

The details of how the two formulas are equivelant are mathematically complex, but you can say that $\rho$ relates to $b$ and $\alpha_i$ relates the weights $w$. The kernel function $K$ allows us to do our comuptations in the higher dimensional feature space without explicitly using $\phi$ which saves us a lot of computations. This is why the dual form is useful.

Practicaly, we can use SVC in the same way we used LogisticRegression and SGDRegressor: we instantiate a model, fit to the training data, and score it on the test data. We'll be concerend with two arguments when we instantiate our model:

kernel: which is the kernel function we're going to use with our model.
C: which is the slack variables penality in the objective function above.

We'll start by splitting our data into a training and a testing part, then we proceed to try out different SVMs with different kernels.



In [101]:

    
X_train = focus_dataset[["Blood Glucose", "BMI"]][:training_size]
y_train = focus_dataset["Class"][:training_size]

X_test = focus_dataset[["Blood Glucose", "BMI"]][training_size:]
y_test = focus_dataset["Class"][training_size:]

Linear Kernel

We'll start with the simplest kernel there is: the linear kernel. This kernel takes the doct product form: $$K(x_i,x_i') = x_i.x_i'$$ With this kernel, we just stay in our feature space, no mapping to a higher dimensional space is done. We can chose to train our model with the linear kernel by passing kernel='linear' as an argument when we construct the model.



In [133]:

    
model = SVC(kernel='linear', C = 1.0)
model.fit(X_train, y_train)

mean_accuracy = model.score(X_test, y_test)

mean_accuracy_train = model.score(X_train, y_train)
mean_accuracy_test = model.score(X_test, y_test)

print "Mean Accuracy on Train Data: %0.3f" % (mean_accuracy_train)
print "Mean Accuracy on Test Data: %0.3f" % (mean_accuracy_test)









    



Mean Accuracy on Train Data: 0.760
Mean Accuracy on Test Data: 0.770

After we trained our model we can start visualizing it. You may notice that the visualization script is a little more complex than what we used to do before, that's because here we're trying to visualize so many things: the classes clouds (represented by the blue and brown sections), the optimal hyperplane (the solid line in the middle), the maximum margin (defined by the two dashed lines) and the support vectors of the model(the circled data points). The script may look scary, but it'll be much simpler when you go through it.

Notice that we're using a method called decision_function: this method returns the distance each data points from the seperating hyperplane. We use these values to create the colored cloud, the optimal hyperplane and the maximum margin lines. We're also using the model's support_vectors_ attribute to get the support vectors of each class.



In [134]:

    
plt.figure()  # we just create a metaplotlib figure as usual

# we create the coordinate matrix for the colored meshes
# instead of using the features values directly like we did before
# we use ranges of between min and max of each feature with 0.01 step to make the boundaries smoother
xx, yy = np.meshgrid(np.arange(x1_min, x1_max, 0.01), np.arange(x2_min, x2_max, 0.01))

# we calculate the z values for the colored mesh
# we use .ravel() to faltten the matricies and np.c_ to concatnate them
z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

# here we plot the colored meshes
# notice that we use the z > 0: this will create 2 colors, one for values of z > 0, the other for z <= 0
# if we used all z directly we'll get so many colors for each value that z can take
plt.pcolormesh(xx, yy, z > 0, cmap=cm.Paired)

# here we plot the lines of the optimal hypreplane the maximum margin
# these lines are the contours of the regions where z value are -1, 0, 1
plt.contour(xx, yy, z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'], levels=[-1., 0, 1.])

# this is our regular scatter plot of the data points
plt.scatter(focus_dataset["Blood Glucose"], focus_dataset["BMI"], c=colors)

# here we sacctter some big circles in the positions of the support vectors
# this will result in the support vectors datapoints being circled
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=80,facecolors='none')

# we finally show the plot!
plt.show()

We can see in the plot:

The Optimal Hyperplane: the solid line on the boundary of the two colors.
The Maximum Mrgin: defined by the two dashed lines on the two regions.
The Support Vectors: the circled data points.

You're encouraged to twaek the value of the parameter $C$ and see how the resulting model is affected. Try reducing the value of $C$, which means that you restrict the allowance for noisy data ponints to violate the margins, and notice how the model's accuracy is affected and the how the maximum margin gets wider as you go smaller on the value of $C$.

RBF Kernel

The Radial Basis Function (or the RBF) kernel is one of the most facinating kernels we can work with. This kernel can produce a very non-linear solution that can adapt to the most noisy datasets. It's computed using:

$$K(x_i,x_i') = exp(-\gamma.||x_i - x_i'||^2)$$

An interesting fact about this kernel is that it opeartes in an infinte dimensions feature space! Yes, this formula is the dot product of two vectors in an infinite dimensional feature space! This is why the idea of kernel tricks and the use of the dual form is so facinating. Without those, we'd have been forced to calculate the mapping $\phi:R^2 \rightarrow R^\infty$ for each data point and store it in the memory to be reused, and this is practicaly impossible!

To use that kernel, we pass kernel='rbf' to the constructor of our model. We can also change the value of the free parameter $\gamma$ through the argument gamma.



In [149]:

    
model = SVC(kernel='rbf', gamma = 1.0, C = 1.0)
model.fit(X_train, y_train)

mean_accuracy_train = model.score(X_train, y_train)
mean_accuracy_test = model.score(X_test, y_test)

print "Mean Accuracy on Train Data: %0.3f" % (mean_accuracy_train)
print "Mean Accuracy on Test Data: %0.3f" % (mean_accuracy_test)









    



Mean Accuracy on Train Data: 0.764
Mean Accuracy on Test Data: 0.765

Notice how the model works slightly better on the training data but slightly worse on the test data, this is beacuse of the non-linearity of the decision boundary that fit with the training data as much as possible. We can visualize that nonlinear solution using the same script we used for the linar kernel, no changes whatsoever!



In [150]:

    
plt.figure()

xx, yy = np.meshgrid(np.arange(x1_min, x1_max, 0.01), np.arange(x2_min, x2_max, 0.01))
z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

plt.pcolormesh(xx, yy, z > 0, cmap=cm.Paired)
plt.contour(xx, yy, z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'], levels=[-1., 0, 1.])
plt.scatter(focus_dataset["Blood Glucose"], focus_dataset["BMI"], c=colors)
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=80,facecolors='none')









    Out[150]:





<matplotlib.collections.PathCollection at 0x7f0b1de7abd0>

We can produce even more non-linear solution by increasing the value of $\gamma$, let's put it to 10 and see what happens.



In [151]:

    
model = SVC(kernel='rbf', gamma = 10.0, C = 1.0)
model.fit(X_train, y_train)

mean_accuracy_train = model.score(X_train, y_train)
mean_accuracy_test = model.score(X_test, y_test)

print "Mean Accuracy on Train Data: %0.3f" % (mean_accuracy_train)
print "Mean Accuracy on Test Data: %0.3f" % (mean_accuracy_test)









    



Mean Accuracy on Train Data: 0.816
Mean Accuracy on Test Data: 0.757

Notice that the model got much better with the training data but got slightly worse with test ones. This is the classic bias/variance tradeoff we encounter everywhere in machine learning. In our case here, the model is very nonlinear and very complex that it fits so well to the noise of the training data but can't generalize as well to new unseen data. We can take a look at the complexity of the nonlinear solutions we have via the following plot that shows how data points are grouped into islands!



In [152]:

    
plt.figure()

xx, yy = np.meshgrid(np.arange(x1_min, x1_max, 0.01), np.arange(x2_min, x2_max, 0.01))
z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

plt.pcolormesh(xx, yy, z > 0, cmap=cm.Paired)
plt.contour(xx, yy, z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'], levels=[-1., 0, 1.])
plt.scatter(focus_dataset["Blood Glucose"], focus_dataset["BMI"], c=colors)
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=80,facecolors='none')









    Out[152]:





<matplotlib.collections.PathCollection at 0x7f0b24e64b10>

Polynomial Kernels

Another kind of kernels we can work with is the polynomial kernel. This kernel impliciltly maps our feature space to a $d$ dimensional feature space through the equation:

$$K(x_i,x_i') = (\gamma.(x_i.x_i') + r)^d$$

Where $d$ is the dgeree of the polynomial and $\gamma,r$ are free parameters. To train a model using this kernel we pass kernel='poly'. We can change the values of $d,\gamma,r$ using the arguments degree, gamma, and coef0 respectively.



In [159]:

    
model = SVC(kernel='poly', degree=3)
model.fit(X_train, y_train)

mean_accuracy_train = model.score(X_train, y_train)
mean_accuracy_test = model.score(X_test, y_test)

print "Mean Accuracy on Train Data: %0.3f" % (mean_accuracy_train)
print "Mean Accuracy on Test Data: %0.3f" % (mean_accuracy_test)









    



Mean Accuracy on Train Data: 0.726
Mean Accuracy on Test Data: 0.761

We notice that the polynomial kernel of the 3rd degree does a better job on the test data than our initial RBF kernel, that's because the nonlinearity of the model is less complex than RBF's. This can be seen in the following visualization of the ploynomial kernel solution.



In [161]:

    
plt.figure()

xx, yy = np.meshgrid(np.arange(x1_min, x1_max, 0.01), np.arange(x2_min, x2_max, 0.01))
z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

plt.pcolormesh(xx, yy, z > 0, cmap=cm.Paired)
plt.contour(xx, yy, z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'], levels=[-1., 0, 1.])
plt.scatter(focus_dataset["Blood Glucose"], focus_dataset["BMI"], c=colors)
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=80,facecolors='none')









    Out[161]:





<matplotlib.collections.PathCollection at 0x7f0b1da878d0>

	Pregnancy Count	Blood Glucose	Diastolic BP	Triceps Skin Fold Thickness	Serum Insulin	BMI	Diabetes Pedigree Function	Age	Class
722	1	149	68	29	127	29.3	0.349	42	1
475	0	137	84	27	0	27.3	0.231	59	0
77	5	95	72	33	0	37.7	0.370	27	0
555	7	124	70	33	215	25.5	0.161	37	0
362	5	103	108	37	0	39.2	0.305	65	0
344	8	95	72	0	0	36.8	0.485	57	0
286	5	155	84	44	545	38.7	0.619	34	0
633	1	128	82	17	183	27.5	0.115	22	0
606	1	181	78	42	293	40.0	1.258	22	1
442	4	117	64	27	120	33.2	0.230	24	0

	Blood Glucose	BMI	Class
704	110	28.4	0
236	181	35.9	1
633	128	27.5	0
83	101	24.6	0
684	136	0.0	0
682	95	44.6	0
349	0	41.0	1
486	139	40.7	0
231	134	46.2	1
115	146	31.2	1

	Blood Glucose	BMI	Class
309	124	32.9	1
521	124	33.2	0
320	129	27.5	0
91	123	32.0	0
192	159	30.4	1
659	80	34.2	1
609	111	24.0	0
291	107	36.6	1
703	129	38.5	0
254	92	27.6	1

	Blood Glucose	BMI	Class
64	-0.266973	0.062992	1
662	1.461708	0.735630	1
529	-0.364823	-1.086099	0
153	1.005075	1.156029	0
219	-0.332206	0.763656	1
709	-0.951922	0.791683	1
419	0.222276	-0.833859	1
633	0.189660	-0.679713	0
642	0.809376	-0.399447	1
370	1.657408	0.847736	1