Class 04

ML Models: Naïve Bayes + Evaluation Metrics

We are going to work with classifier models today. We start with a sample dataset from Sebastian Thrun's Udacity Machine Learning course. Here's the scenario: we are building a self-driving car. We have mapped out the course we are taking and created a dataset that indicates, on a scale from 0 to 1, how bumpy the road is and, on the same scale, how steep the road is (measured in "grade"). For each road we need to know whether we should have the car drive "slow" or "fast". For example, we want to slow down for bumpy roads. But we may want to speed up when we are going up steep hills. I've created a sample dataset from fake data that maps this out. We start by loading and plotting the data.


In [1]:
import pandas as pd
import seaborn as sns
sns.set_style("white")

#Note the new use of the dtype option here. We can directly tell pandas to use the Speed column as a category in one step.
speeddf = pd.read_csv("Class04_speed_data.csv",dtype={'Speed':'category'})

lm = sns.lmplot(x='Grade', y='Bumpiness', data=speeddf, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)


We will start with a subset of this data to illustrate what we are trying to do here. We use the sample() function to get a small piece of the data (we use the random_state option to make sure we use the same set of data every time, otherwise the data will change).


In [2]:
speedsub = speeddf.sample(16,random_state=55)
sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(top=False, right=False)


What we want to do is have the computer learn where the boundary lies between the fast data points and the slow data points. That way we can input in any grade and any bumpiness and the computer will tell us whether to go fast or slow. It looks like there is a region between the two sets of data where we could potentially put our boundary.


In [3]:
lm = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
patches=[]
polygon = Polygon([[.92,0],[1,0],[1,.24],[0,.9],[0,.67]], True)
patches.append(polygon)
p = PatchCollection(patches, alpha=0.4)
lm.ax.add_collection(p)


Out[3]:
<matplotlib.collections.PatchCollection at 0x7f28ed36f710>

How do we decide where in this region to put the boundary? There are a couple of different algorithms that will do the job for us. We're not going to spend time describing how they work - you can look them up if you are interested in the mathematics. Instead, we'll look at how to apply them and look at how well they work.

Perceptron

The first algorithm is called the Perceptron (information on how it works is found on Wikipedia: https://en.wikipedia.org/wiki/Perceptron#Learning_algorithm). The documentation for the Scikit Learn Perceptron is found here. We'll use a syntax very similar to the pattern we used in Class02. First, we split the data into training and testing sets.


In [4]:
from sklearn.model_selection import train_test_split

trainsub, testsub = train_test_split(speedsub, test_size=0.2, random_state=23)

Now we import the model and train it, just like we did with the linear regression.


In [5]:
from sklearn.linear_model import Perceptron

# Step 1: Create linear regression object
model = Perceptron()

# Step 2: Train the model using the training sets
features = trainsub[['Grade','Bumpiness']].values
labels = trainsub['Speed'].values

model.fit(features,labels)
print("Model Coefficients: {}".format(model.coef_))
print("Model Intercept: {}".format(model.intercept_))


Model Coefficients: [[-0.71490007 -1.31126853]]
Model Intercept: [ 1.]

We would like to visualize the decision boundary between the two classes. There are a couple of ways we could do this. For linear models like the perceptron, we can get the coefficients from the model and then plot them as a line. There are a couple of other steps to this, but fortunately, there is code to help us figure it out.


In [6]:
import matplotlib.pyplot as plt
import numpy as np
w = model.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(0,1)
yy = a * xx - (model.intercept_[0]) / w[1]

# Plot the points
lm2 = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm2.ax, top=False, right=False)

# Plot our range estimate
p2 = PatchCollection(patches, alpha=0.4)
lm2.ax.add_collection(p2)

# Plot the actual decision boundary
plt.plot(xx, yy, 'k-')


Out[6]:
[<matplotlib.lines.Line2D at 0x7f28ed1bd790>]

Note that the line isn't very good - remember that we only used a subset of the data to fit the decision boundary. But it still lies in the expected range.

There is another way we could plot this: we could split our figure into small boxes, then make a prediction for each box. We then plot all the decisions in two different colors, showing the prediction for each box. This gives us a more general tool for plotting not only linear boundaries, but any possible decision boundary.


In [7]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh
x_min = 0.0; x_max = 1.0 # Mesh x size
y_min = 0.0; y_max = 1.0  # Mesh y size
h = .01  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Now predict the results at each point and get the categorical values
Zpred = model.predict(np.c_[xx.ravel(), yy.ravel()])
Zseries = pd.Series(Zpred, dtype='category')
Zvalues = Zseries.cat.codes.values
Z = Zvalues.reshape(xx.shape)


# First plot our points
lm2 = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm2.ax, top=False, right=False)

# Now add in the decision boundary
plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1)


Out[7]:
<matplotlib.collections.QuadMesh at 0x7f28ed34cd10>

At this point, let's go back to the entire test dataset and fit the decision boundary for it. We'll also look at the out-of-sample performance by plotting the test data instead of the train data.


In [8]:
train, test = train_test_split(speeddf, test_size=0.2, random_state=23)

model2 = Perceptron()

features_train = train[['Grade','Bumpiness']].values
labels_train = train['Speed'].values
features_test = test[['Grade','Bumpiness']].values
labels_test = test['Speed'].values

model2.fit(features_train,labels_train)

Zpred = pd.Series(model2.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values
Z = Zpred.reshape(xx.shape)

# First plot our points
lm = sns.lmplot(x='Grade', y='Bumpiness', data=test, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)
plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1)


Out[8]:
<matplotlib.collections.QuadMesh at 0x7f2903c5a4d0>

So, there are a few things to note here. First, the Perceptron has given us a boundary that works fairly well. However, it isn't perfect. There are a few points that are labeled "fast" that will now be classified as "slow". It would be nice to have a way to quantify how well the classifier has performed. We'll look at a new set of tools to do that.

Evaluation Metrics

First, we review the evaluation metric we've already seen: the RMS value for the linear regression. Recall from Class 02 that we calculated this by taking our model prediction, subtracting the actual value, squaring the difference, then averaging over all points in the test set. Finally, we took the square root of this to get the RMS: "[Square]Root [of the] Mean-Squared". A perfect fit would give an RMS of 0.0 and larger RMS values mean that the fit is not performing as well.

There are more ways to evaluate the performance of a classifier model. They all start with the confusion matrix, so we'll start there.

The Confusion Matrix

The first thing we do is recognize that there are, for a binary, or two-state classifier, four possible outcomes when we evaluate each test point:

  1. The prediction says "slow" and the actual label says "slow"
  2. The prediction says "fast", but the actual label says "slow"
  3. The prediction says "slow", but the actual label says "fast"
  4. The prediction says "fast" and the actual label says "fast"

The first and last possibilies indicate that the prediction did a good job, but the other two mean there were problems. Let's make this into a table:

Predicted Predicted
Slow Fast
Actual Slow #1 #2
Actual Fast #3 #4

Now we need to count how many of each possibility there were using the test data. There is, naturally, a tool to do this for us.


In [9]:
from sklearn.metrics import confusion_matrix
class_labels = ["slow", "fast"]
y_pred = model2.predict(features_test)
cnf_matrix = confusion_matrix(labels_test, y_pred,labels=class_labels)
print(cnf_matrix)


[[ 76   0]
 [ 20 104]]

We can also visualize this as a graphic, showing a shade of color for each of the different values. This is especially useful when we have more than two classes. Because we'll use this again, we define a function that takes the class labels and confusion matrix as inputs and creates the plot.


In [10]:
def show_confusion_matrix(cnf_matrix, class_labels):
    plt.matshow(cnf_matrix,cmap=plt.cm.YlGn,alpha=0.7)
    ax = plt.gca()
    ax.set_xlabel('Predicted Label', fontsize=16)
    ax.set_xticks(range(0,len(class_labels)))
    ax.set_xticklabels(class_labels)
    ax.set_ylabel('Actual Label', fontsize=16, rotation=90)
    ax.set_yticks(range(0,len(class_labels)))
    ax.set_yticklabels(class_labels)
    ax.xaxis.set_label_position('top')
    ax.xaxis.tick_top()

    for row in range(len(cnf_matrix)):
        for col in range(len(cnf_matrix[row])):
            ax.text(col, row, cnf_matrix[row][col], va='center', ha='center', fontsize=16)
        
show_confusion_matrix(cnf_matrix,class_labels)


We can see now that the diagonal entries are what we want- the darker they are, the better we are doing. The off-diagonal terms (the slow-fast and fast-slow terms) are points that have been incorrectly identified. It would be nice if we could distill this matrix down into a single number. Unfortunately, there is no unique way of doing that. There are a couple of different metrics that people use and we can quickly go through them. There is a nice summary here of some of the metrics and how people use them.

Class-dependent Metrics

The first three metrics depend on what your target is. For example, with the Sensitivity/Recall score, the goal is to either correctly predict when to go slow or to correctly predict when to go fast. So there are two outputs from the score, depending on which is more important to you. Of course you could average them if you want and get something in the middle.

Class-independent Metrics

The last two metrics take all the possibilities into account and wrap them up as a single number. Which metric you use is something of a personal preference. However, it is good practice to use the same metric when comparing different models.


In [11]:
import sklearn.metrics as metrics

recall_score = metrics.recall_score(labels_test, y_pred,labels=class_labels,average=None)
prec_score = metrics.precision_score(labels_test, y_pred,labels=class_labels,average=None)
f1_score = metrics.f1_score(labels_test, y_pred,labels=class_labels,average=None)

acc_score = metrics.accuracy_score(labels_test, y_pred)
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)

print("Class-dependent Metrics")
print("Sensitivity/Recall Score: {}".format(recall_score))

print("Precision Score: {}".format(prec_score))
print("F1 Score: {}".format(f1_score))

print("\nClass-independent Metrics")
print("Accuracy Score: {}".format(acc_score))
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Class-dependent Metrics
Sensitivity/Recall Score: [ 1.          0.83870968]
Precision Score: [ 0.79166667  1.        ]
F1 Score: [ 0.88372093  0.9122807 ]

Class-independent Metrics
Accuracy Score: 0.9
Matthews Correlation Coefficient (MCC): 0.814848755674

The Perceptron is typically slow and not very flexible. With a large dataset it takes a long time to reach a solution. Altough it is simple to implement, it isn't very good and isn't used much. We'll do one more classifier to compare the two.

Naïve Bayes

We'll now try the Naïve Bayes classifier. If you are interested in how the classifier works, I suggest either this tutorial or reading the Wikipedia page.. We'll stick to the application and evaluation of the model. One of the advantages of the Naïve Bayes classifier is that it isn't fixed to a linear decision boundary. That means we can account for curved boundaries and maybe do a little bit better than the Perceptron.

We use the same set of training/testing features and labels as we used with the Perceptron. That will give us a head-to-head comparison between the two models.


In [12]:
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(features_train, labels_train)

# Plot the decision boundary
Zpred = pd.Series(nb_model.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values
Z = Zpred.reshape(xx.shape)
lm = sns.lmplot(x='Grade', y='Bumpiness', data=test, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)
plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1)

# Plot the confusion matrix
y_pred_nb = nb_model.predict(features_test)
cnf_matrix_nb = confusion_matrix(labels_test, y_pred_nb,labels=class_labels)
show_confusion_matrix(cnf_matrix_nb, class_labels)


There are a couple of things to note here: first: the decision boundary is curved! However, it is a fairly simple curve in that it doesn't wiggle very much - it is a smooth arc. This is related to the class Learning Principle of Occam's Razor. A straight-line is the simplest possible decision boundary and, therefore, is valued highly from the perspective of keeping the model as simple as possible. A smooth curve is slightly more complicated, but still fairly simple. The question is: do we gain out-of-sample performance by adding in the complexity of making the decision boundary curve?

That brings us to the second point: the confusion matrix now shows us that we have mis-classified 17 points. We compare that to the Perceptron model where we mis-classified 20 points. So we've done a little bit better in terms of out-of-sample performance, which is good. Let's take a look at the other metrics to see how they compare.


In [13]:
recall_score = metrics.recall_score(labels_test, y_pred_nb,labels=class_labels,average=None)
prec_score = metrics.precision_score(labels_test, y_pred_nb,labels=class_labels,average=None)
f1_score = metrics.f1_score(labels_test, y_pred_nb,labels=class_labels,average=None)

acc_score = metrics.accuracy_score(labels_test, y_pred_nb)
matt_score = metrics.matthews_corrcoef(labels_test, y_pred_nb)

print("Class-dependent Metrics")
print("Sensitivity/Recall Score: {}".format(recall_score))

print("Precision Score: {}".format(prec_score))
print("F1 Score: {}".format(f1_score))

print("\nClass-independent Metrics")
print("Accuracy Score: {}".format(acc_score))
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Class-dependent Metrics
Sensitivity/Recall Score: [ 0.80263158  0.98387097]
Precision Score: [ 0.96825397  0.89051095]
F1 Score: [ 0.87769784  0.9348659 ]

Class-independent Metrics
Accuracy Score: 0.915
Matthews Correlation Coefficient (MCC): 0.821839883647

Almost across the board, the Naïve Bayes classifier does a little bit better than the Perceptron classifier. It isn't a huge difference, though.

On the other hand, the Naïve Bayes classifier is a faster algorithm and handles large datasets better. It also gives us one additional piece of information that can be useful: it will tell us the prediction probabilities for each test point. That will give us access to another metric that can be useful.

Prediction Probabilities

When we make a prediction on one of the test features, the Naïve Bayes classifier will not only tell us its prediction for what the label should be, it will also tell us with what probability it thinks that label is correct. For example, we input in the following values to get the prediction.


In [14]:
print("Input values: {}".format(features_test[0]))
print("Prediction: {}".format(nb_model.predict([features_test[0]])))


Input values: [ 0.75209012  0.15270399]
Prediction: ['fast']

How confident is the model of that prediction? Let's get the prediction proabilities for that point.


In [15]:
print("Prediction Probabilities: {}".format(nb_model.predict_proba([features_test[0]])))


Prediction Probabilities: [[ 0.6794251  0.3205749]]

So, we can see that, for this point, the model outputs a 68% chance that the point should be classified as "fast" and, therefore, a 32% chance that it is "slow". We can plot the confidence intervals for these points to show how the model is mapping input values to output values. (Note: there may be a pandas warning... it doesn't appear to affect the outcome, so don't worry about it.)


In [16]:
#first, get all the predictions
y_proba_nb = nb_model.predict_proba(features_test)

test['fastprob'] = y_proba_nb[:,0]

cm = plt.cm.get_cmap('YlGn')
sc = plt.scatter(x=test['Grade'], y=test['Bumpiness'], c=test['fastprob'] , vmin=0, vmax=1, s=35, cmap=cm)
cbr = plt.colorbar(sc)
cbr.set_label('Probability of "fast"')
plt.xlabel('Grade')
plt.xlabel('Bumpiess')


/projects/sage/sage-7.5/local/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[16]:
<matplotlib.text.Text at 0x7f28e804eb50>

So we see that the model has a pretty high probabily of getting the label correct in both corners, but closer to the decision boundary the probability of each label approaches the midpoint of 50%.

Logloss Metric

We've got one more metric we can use for models that give us access to the prediction probabilities. This metric has the property that if all the points are correctly predicted, it will be 0.0. The closer to zero you are, the better the model is doing at predicting the correct outcomes. It is a class-independent metric and works for models with more than two classes, too.


In [17]:
logloss = metrics.log_loss(labels_test, y_proba_nb)
print("Log loss: {}".format(logloss))


Log loss: 0.247699390524

Assignment

You assignment this week is to run through both the Perceptron and the Naïve Bayes classifiers with your classification data. Evaluate both models using each of the metrics we've learned about and compare the performance of the models.

If you find that the model fit is taking a long time, you should note that in your assignment as well. How long a model takes to learn is an important parameter. There is a simple way of timing the model performance. We'll run both models again and compare their timing. For the small number of data points we have in this dataset, the timing isn't very different. That may not be the case for your models.


In [19]:
import time
# Perceptron Model
start1 = time.time()
model2.fit(features_train,labels_train)
stop1 = time.time()
print("Elapsed time: {} seconds".format(stop1-start1))


Elapsed time: 0.00249886512756 seconds

In [20]:
# Naïve Bayes model
start2 = time.time()
nb_model.fit(features_train,labels_train)
stop2 = time.time()
print("Elapsed time: {} seconds".format(stop2-start2))


Elapsed time: 0.078929901123 seconds

In [ ]: