Class 06

ML Models: Support Vector Machines + Overfitting

We're going back to machine learning models again! This time we are going to look at another type of machine learning algorithm that will give us the opportunity to adjust hyperparameters: inputs to the machine learning model that tell the model how to behave. Tuning the hyperparameters is something of an art- we'll talk about Occam's Razor again and how to balance model performance with model complexity. But we'll start with the Support Vector Machine (SVM) classifier and go from there.

SVM Classifier

We'll use the same set of data that we used in Class 04: the classifier data describing the self-driving car road conditions. We'll load in the data and plot it to make sure we know what we are working with.


In [1]:
library(ggplot2)

In [2]:
#Note that we didn't use stringsAsFactors=FALSE this time. The read.csv() function will convert the Speed column to a factor.
speeddf <- read.csv("../Class04/Class04_speed_data.csv")

options(repr.plot.width=5, repr.plot.height=4)
ggplot() + geom_point(aes(Grade,Bumpiness, col=Speed), data=speeddf)+ coord_fixed()


The goal, as before, it to build a model to describe the decision boundary between the "fast" and "slow" categories in our label column. We are going to skip the background information on the SVM for now, but you can read about how it works on Wikipedia. ]

We run through the same set of steps that we've used before to get ready to teach the model. We are going to start with the simplest SVM: the linear classifier. This will give us a linear decision boundary.


In [4]:
library(e1071)
set.seed(24)
trainIndex <- sample(seq(nrow(speeddf)), nrow(speeddf)*0.8)

train <- speeddf[trainIndex, ]
test <- speeddf[-trainIndex, ]

svm_model <- svm(Speed ~ ., data=train, kernel="linear")

library(pracma)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh
x_min <- 0.0; x_max <- 1.0 # Mesh x size
y_min <- 0.0; y_max <- 1.0  # Mesh y size
h <- .01  # step size in the mesh
mesh<-meshgrid(seq(x_min,x_max,h),seq(y_min,y_max,h))
meshdf <- data.frame(Grade=c(mesh$X),Bumpiness=c(mesh$Y))
meshdf$Prediction <- predict(svm_model,meshdf)
meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

p <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p + geom_point(aes(Grade,Bumpiness,col=Speed),data=test)



In [5]:
library(caret)
pred <- predict(svm_model, test)
cm<-confusionMatrix(pred,test$Speed)
print(cm$table)

Reference <- factor(c('fast', 'fast', 'slow', 'slow'))
Prediction <- factor(c('fast', 'slow', 'fast', 'slow'),levels=rev(levels(test$Speed)))
df <- data.frame(Reference, Prediction, Y=c(cm$table))

ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top")


Loading required package: lattice
          Reference
Prediction fast slow
      fast  109    7
      slow    9   75

In [6]:
cm$byClass


Sensitivity
0.923728813559322
Specificity
0.914634146341463
Pos Pred Value
0.939655172413793
Neg Pred Value
0.892857142857143
Precision
0.939655172413793
Recall
0.923728813559322
F1
0.931623931623932
Prevalence
0.59
Detection Rate
0.545
Detection Prevalence
0.58
Balanced Accuracy
0.919181479950393

In [7]:
cm$overall


Accuracy
0.92
Kappa
0.835255354200989
AccuracyLower
0.873335143977918
AccuracyUpper
0.953582391600177
AccuracyNull
0.59
AccuracyPValue
8.4476073907249e-26
McnemarPValue
0.802587348634152

At this point I want to go back to the last set of classifiers we looked at with this exact same dataset: the Perceptron and the Naïve Bayes. For comparison, here were the metrics for those two classifiers:

Perceptron:


Class-dependent Metrics

  • Sensitivity: 0.983870967741935
  • Precision: 0.897058823529412
  • F1: 0.938461538461539

Class-independent Metrics

  • Accuracy: 0.92
Naïve Bayes:

Class-dependent Metrics

  • Sensitivity: 0.9838710
  • Precision: 0.8531469
  • F1: 0.9138577

Class-independent Metrics

  • Accuracy: 0.885

We see that the SVC model did better than both of these other models in many of the metrics. Not bad considering it is a linear fit. Now we can look a little more closely and how the SVC model works. It picks a few data points from the training set to use as "support" points and then plots the decision boundary as a line between those points. So, in essence, it is a model that has the number of support points as the number of model parameters. That is important because it tells us about the model complexity. Let's look at our linear model and show which points the model is using as its support points.


In [8]:
support_points <- train[svm_model$index,]

p <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("Number of support points:",nrow(support_points)))


We see that there are actually a large number of support points in this model! So, even though it ends up being a linear decision boundary, it has quite a high complexity. Does the improvement in performance justify this increase in complexity? Perhaps...

Before we go on, we need to note that the SVC model only works well for small-ish datasets (less than 100,000 points). Beyond that, it will slow down considerably. It may be worth your time to try using this model on larger datasets, but I wanted to warn you about that.

First Hyperparameter: cost

Let's dive into the model hyperparameters! The only hyperparameter for the model at this point is the cost parameter. The SVC model is always looking for a perfect boundary: it wants to perfectly classify every point in the training set. So, what do we do about points that just aren't going to be classified perfectly? We apply a "penalty" for those points and try and lower the penalty as much as we can to do as best as we can. The size of the penalty we apply is related to the cost parameter. Let's try two extremes with our linear decision boundary to see what they do (it defaults to the value of cost=1.0 if we don't tell the model to use a different value).

We're going to use a set of sub-plots so we can put both the confusion matrix and the support vector points on the same line.


In [9]:
library(gridExtra)

svm_model2 <- svm(Speed ~ ., data=train, kernel="linear", cost=0.05)
pred2 <- predict(svm_model2, test)

meshdf$Prediction <- predict(svm_model2,meshdf)
meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

cm<-confusionMatrix(pred2,test$Speed)

support_points <- train[svm_model2$index,]


print(paste("Accuracy:",cm$overall['Accuracy']))
df <- data.frame(Reference, Prediction, Y=c(cm$table))

p1 <- ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top") + coord_fixed()

p2 <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p2 <- p2 + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("N support points:",nrow(support_points)))

grid.arrange(p1, p2, ncol=2)


Warning message:
"package 'gridExtra' was built under R version 3.3.3"
[1] "Accuracy: 0.92"

Note that the number of support points has dramatically increased but the overall quality of the fit (as measured by the Accuracy) han't improved! We've added complexity and not improved our performance at all. This is called overfitting and we'll see more examples of it later.

So, let's go to the other extreme and make cost large.


In [10]:
svm_model3 <- svm(Speed ~ ., data=train, kernel="linear", cost=100)
pred3 <- predict(svm_model3, test)

meshdf$Prediction <- predict(svm_model3,meshdf)
meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

cm<-confusionMatrix(pred3,test$Speed)

support_points <- train[svm_model3$index,]

print(paste("Accuracy:",cm$overall['Accuracy']))
df <- data.frame(Reference, Prediction, Y=c(cm$table))

p1 <- ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top") + coord_fixed()


p2 <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p2 <- p2 + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("N support points:",nrow(support_points)))

grid.arrange(p1, p2, ncol=2)


[1] "Accuracy: 0.92"

Now we've reduced the number of parameters without reducing the Accuracy very much at all. At this point it doesn't matter how much bigger we make cost, the SVC algorithm can't do any better.

The RBF Kernel

Now that we've looked at the linear SVC, we can extend this model in a very simple way: instead of demanding that the decision boundary be a straight line, we let the boundary wiggle between points. This gives us another hyperparameter to work with: $\gamma$: this parameter tells the SVC how close it should try and get to any given point. The best thing to do is to try out a couple of combinations to see how they perform.


In [11]:
svm_model4 <- svm(Speed ~ ., data=train, kernel="radial", cost=1000, gamma=0.001)
pred4 <- predict(svm_model4, test)

meshdf$Prediction <- predict(svm_model4,meshdf)
meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

cm<-confusionMatrix(pred4,test$Speed)

support_points <- train[svm_model4$index,]

print(paste("Accuracy:",cm$overall['Accuracy']))
df <- data.frame(Reference, Prediction, Y=c(cm$table))

p1 <- ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top") + coord_fixed()


p2 <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p2 <- p2 + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("N support points:",nrow(support_points)))

grid.arrange(p1, p2, ncol=2)


[1] "Accuracy: 0.915"

This looks a lot like the linear boundary! We'll try changing the values to see what they do.


In [12]:
svm_model5 <- svm(Speed ~ ., data=train, kernel="radial", cost=1, gamma=0.001)
pred5 <- predict(svm_model5, test)

meshdf$Prediction <- predict(svm_model5,meshdf)
meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

cm<-confusionMatrix(pred5,test$Speed)

support_points <- train[svm_model5$index,]

print(paste("Accuracy:",cm$overall['Accuracy']))
df <- data.frame(Reference, Prediction, Y=c(cm$table))

p1 <- ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Y), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top") + coord_fixed()


p2 <- ggplot(meshdf) +
 geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
 coord_fixed()
p2 <- p2 + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("N support points:",nrow(support_points)))

grid.arrange(p1, p2, ncol=2)


[1] "Accuracy: 0.77"

Reducing the penalty looks like it added a bit of curvature, but it increased the number of support points and lowered the accuracy. Let's try the other way.

Programming Aside Note:

Take a quick look at the previous two blocks of code: they were identical except for a couple of parameters at the top. This is a good indication that we should write our own function to simplify running more examples. Python makes that really easy: we add one line at the top, change the parameter inputs to the model, and indent the rest of the code.


In [13]:
testsvmmodel <- function(inputCost, inputGamma){
    svm_modelf <- svm(Speed ~ ., data=train, kernel="radial", cost=inputCost, gamma=inputGamma)
    predf <- predict(svm_modelf, test)

    meshdf$Prediction <- predict(svm_modelf,meshdf)
    meshdf$col <- ifelse(meshdf$Prediction == 'fast', 'blue', 'white')

    cm<-confusionMatrix(predf,test$Speed)

    support_points <- train[svm_modelf$index,]

    print(paste("Accuracy:",cm$overall['Accuracy']))
    df <- data.frame(Reference, Prediction, Y=c(cm$table))

    p1 <- ggplot(data =  df, mapping = aes(x = Reference, y = Prediction)) +
      geom_tile(aes(fill = Y), colour = "white") +
      geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
      scale_fill_gradient(low = "white", high = "blue") +
      theme_bw() + theme(legend.position = "none") + scale_x_discrete(position = "top") + coord_fixed()


    p2 <- ggplot(meshdf) +
     geom_rect(mapping = aes(xmin=Grade-h/2, xmax=Grade+h/2, ymin=Bumpiness-h/2, ymax=Bumpiness+h/2),fill = meshdf$col) + 
     coord_fixed()
    p2 <- p2 + geom_point(aes(Grade,Bumpiness),data=support_points,shape='o',size=3) + labs(title=paste("N support points:",nrow(support_points)))

    grid.arrange(p1, p2, ncol=2)
}

# Now we just need to call the function!
testsvmmodel(100,0.001)


[1] "Accuracy: 0.915"

So we've reduced the number of support points and increased the performance of the model! The boundary is no longer a straight line, but it seems to do better than the straight line did. This is an example of reducing complexity and improving performance. That's the direction we want to go! Let's try the other hyperparameter: $\gamma$.


In [14]:
testsvmmodel(10,10)


[1] "Accuracy: 0.925"

So we've improved performance again! But this time, look at the support points and the decision boundary: they are both a mess! This is a complicated model that is trying hard to fit a very particular shape that doesn't look like it is really our dataset. In other words, it looks like it is trying to fit to the noise in the data as opposed to fitting the underlying model. This is what we mean by overfitting and it is going the wrong way! This type of model will most likely perform poorly when we head out to try it in the "real world".

We'll try other combinations now.


In [15]:
testsvmmodel(10,1)


[1] "Accuracy: 0.935"

So, looking at all of these together, it looks like our best performance with the simplest model was where we had cost=10 and gamma=1. If we want to try to do even better, we could tune the paramters even further by exploring around this point to see if we can do a little better. Later in the course we'll take a look at tools that will simplify doing that optimization for us.

Support Vector Regressions

We can use this same type of tool to make predictions for continuous data as well. We will return to the fake data we used in Class 02 to see how to apply the SVM regression.


In [16]:
fakedata2 <- read.csv('../Class02/Class02_fakedata2.csv')
# Do the test/train split
trainIndex <- sample(seq(nrow(fakedata2)), nrow(fakedata2)*0.8)
train <- fakedata2[trainIndex, ]
test <- fakedata2[-trainIndex, ]

ggplot() + geom_point(aes(input,output),color='red',data=train)+ geom_point(aes(input,output),color='blue',data=test)


As a reminder, let's review the linear regression along with its graph.


In [17]:
linearmodel <- lm(output ~ input, train)
options(repr.plot.width=4, repr.plot.height=4)
plot(test, pch=15, col="blue")
abline(linearmodel,col="red")

# Add a legend
legend(0,2.4, # places a legend at the appropriate place
       c("Data","Linear Fit"), # puts text in the legend
       lty=c(0,1), # gives the legend appropriate symbols (lines)
       pch=c(15,NA), # Sets the symbol correctly for the point and line
       col=c("blue","red")) # gives the legend lines the correct color and width
predictions <- predict(linearmodel, test)
actuals <- test$output
print(paste("RMS Error:", sqrt(mean((predictions-actuals)^2))))


[1] "RMS Error: 0.0403957695492211"

Ok, we now try the support vector regression model.


In [18]:
svrmodel <- svm(output ~ input, train,cost=1.0,gamma=1.0)
predictions <- predict(svrmodel,test)
preddf <- data.frame(input=test$input,pred=predictions)
ggplot() + geom_point(aes(input,output),shape=15,color='blue',data=test) + geom_line(aes(input,pred),color='red',data=preddf)
print(paste("RMS Error:", sqrt(mean((predictions-actuals)^2))))


[1] "RMS Error: 0.0571158741047992"

That performance was worse than the linear model (an RMS error of 0.0 means a perfect fit). Let's try adjusting the hyperparameters. We'll make a function to make it easy to repeat this.


In [19]:
svrtest <- function(inputcost, inputgamma){
    svrmodel <- svm(output ~ input, train,cost=inputcost,gamma=inputgamma)
    predictions <- predict(svrmodel,test)
    preddf <- data.frame(input=test$input,pred=predictions)
    ggplot() + geom_point(aes(input,output),shape=15,color='blue',data=test) + geom_line(aes(input,pred),color='red',data=preddf) + 
    labs(title=(paste("RMS Error:", sqrt(mean((predictions-actuals)^2)))))
}
svrtest(100,1)



In [20]:
svrtest(0.1,1)



In [21]:
svrtest(1,100)


It looks like the best we can do with the SVR is actually just a little bit better than the linear regression.

In-class Activity

We used a multi-feature dataset in Class 02 to try out the regression. Use the SVM regression on that dataset to see if you can do any better in predicting the output.

Assignment

Implement either the SVM classifier or regression on your own data. You can do multi-class predictions with the classifier so it should be able to handle pretty much any dataset. Record the time it takes for the model to fit on your data and compare that to the Perceptron and Naive Bayes models we did last week.