Class 07

ML Models: Decision Trees

We will cover a new type of machine learning algorithm in this class: decision trees. We will also talk about ensemble methods and how we can use them to improve the performance of our machine learner.

Classification Decision Trees

We'll start by using a decision tree classifier. We'll use the same set of data as we used in Class 06. Again, that will allow us to compare the algorithm head-to-head with the other classifiers we've used previously. A decision tree works by splitting the data into pieces while trying to maximize the uniformity of each piece. Although we won't dive deeply into how the algorithm works, you can read a great tutorial here.

For example we start with a group of 10 people, half who identify as male and half as female. The most uniform split will be to divide the group into two sub-groups known as nodes. We can cleanly split the group so that each sub-group is uniformly populated. The tree builds a set of decision nodes to split the group so as to end up with the best set of rules to predict the output labels.

The tree will continue to split until it reaches a point where it can't split the data anymore. These end points are called leaf nodes. The number of data points allowed to be in a leaf node is one of the hyperparameters we have to tune. Going back to our example, if we set the minimum size of the leaf node to 5 people, the decision tree will end after doing a single split. However, if we let the leaf nodes be smaller, it may split up the sub-groups by age, height, or by other features.

This will make more sense as we try it out, so let's get started.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

#Note the new use of the dtype option here. We can directly tell pandas to use the Speed column as a category in one step.
speeddf = pd.read_csv("../Class04/Class04_speed_data.csv",dtype={'Speed':'category'})

#We'll use a different tool to plot the data now that we know how to group the data by a category. This will help us make better combined plots later on.
groups = speeddf.groupby('Speed')

# Plot
trainfig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
    ax.set_aspect(1)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')


Out[1]:
<matplotlib.text.Text at 0x7f58e92fb410>

We'll import the DecisionTreeClassifier and use all of the default values except for the random_state. We'll provide that so that the output is consistent run-to-run. The decision tree classifier uses the random number generator to make decisions about branching, so if we don't set this, we'll get different results every time we run the algorithm.


In [2]:
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create our decision boundary mesh
# point in the mesh
x_min = 0.0; x_max = 1.0 # Mesh x size
y_min = 0.0; y_max = 1.0  # Mesh y size
h = .01  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max+h, h), np.arange(y_min, y_max+h, h))

# Split the data into training and testing sets and prepare the features and labels
train, test = train_test_split(speeddf, test_size=0.2, random_state=23)
features_train = train[['Grade','Bumpiness']].values
labels_train = train['Speed'].values
features_test = test[['Grade','Bumpiness']].values
labels_test = test['Speed'].values
class_labels = ["slow", "fast"]

# Load the model and fit the data
dtmodel = DecisionTreeClassifier(random_state=32)
dtmodel.fit(features_train,labels_train)

y_pred = dtmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(dtmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')

import sklearn.metrics as metrics

recall_score = metrics.recall_score(labels_test, y_pred,labels=class_labels,average=None)
prec_score = metrics.precision_score(labels_test, y_pred,labels=class_labels,average=None)
f1_score = metrics.f1_score(labels_test, y_pred,labels=class_labels,average=None)

acc_score = metrics.accuracy_score(labels_test, y_pred)
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)

print("Class-dependent Metrics")
print("Sensitivity/Recall Score: {}".format(recall_score))

print("Precision Score: {}".format(prec_score))
print("F1 Score: {}".format(f1_score))

print("\nClass-independent Metrics")
print("Accuracy Score: {}".format(acc_score))
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


/projects/sage/sage/local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Class-dependent Metrics
Sensitivity/Recall Score: [ 0.89473684  0.94354839]
Precision Score: [ 0.90666667  0.936     ]
F1 Score: [ 0.90066225  0.93975904]

Class-independent Metrics
Accuracy Score: 0.925
Matthews Correlation Coefficient (MCC): 0.840473092852

Take a look at the decision boundary for this classifier: it is all over the place! The tree tries to account for every point and so it creates braches where there shouldn't be branches. We have a classic case of overfitting! And the model performance isn't great either with an MCC of 0.84. It is time to tune the hyperparameters to see if we can do better. Let's start by tuning the minimum number of samples in the leaf nodes of the tree.


In [3]:
# Load the model and fit the data
dtmodel = DecisionTreeClassifier(min_samples_leaf=10,random_state=32)
dtmodel.fit(features_train,labels_train)

y_pred = dtmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(dtmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Matthews Correlation Coefficient (MCC): 0.851443123939

So our decision boundary is cleaned up significantly and we got a bump in the test performance of the model. Let's check one more value to see if we can do any better.


In [4]:
# Load the model and fit the data
dtmodel = DecisionTreeClassifier(min_samples_leaf=5,random_state=32)
dtmodel.fit(features_train,labels_train)

y_pred = dtmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(dtmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Matthews Correlation Coefficient (MCC): 0.894625126632

We got an MCC of 0.894 with a fairly simple decision boundary. That's good! There are, perhaps, a few too many wiggles in the boundary, but overall it is looking pretty good. Note that all of the boundaries are straight lines- that is because the decision tree is choosing cutoff values of "Grade" and "Bumpiness" to split the dataset along those lines. Overall this isn't too bad.

Ensemble Methods

The decision tree did a reasonable job of modeling our data but we only used one tree and one set of random values. What if we could do this many times and average the results. There are tools to do that! One of the strategies that ensemble methods will use is to scramble which of the training features it uses for each trial run. Let's take a quick look at that method, called a "bootstrap" sample.

We will start with 100 data points in our training sample. The ensemble model will break this up into 10 chunks of 10 data points each (each chunk labeled A-J). For the first model it takes chunks A-I and trains on them, then validates that model with chunk J. The next model will take chunks A-H and J, leaving chunk I for validation. It repeats this as many times as it needs. Thus the ensemble is doing training and validation all on the same set of data!

Data Snooping Warning

Although the ensemble is doing its own validation, that doesn't mean you can train with all of your data. You still need to keep the test data locked away and not used for training. This means we can compare the ensemble model to the other models without cheating ourselves.

We'll try out the simplest version of this first, called the RandomForestClassifier.


In [5]:
# Load the model and fit the data
from sklearn.ensemble import RandomForestClassifier

rfmodel = RandomForestClassifier(n_estimators=100,random_state=32)
rfmodel.fit(features_train,labels_train)

y_pred = rfmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(rfmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Matthews Correlation Coefficient (MCC): 0.883622738921

We see that the ensemble does a reasonable job- perhaps not better, in this case, than the decision tree by itself. However, there is something else that we get out of using the ensemble: it will tell us the relative importance of the different features it used in making the decision boundary. The list of feature importances are given in terms of percentage importance of each feature. This can be helpful in deciding which features to use as inputs to the model. If the ensemble says that a feature is not very important, you may be able to drop it and simplify your model.

Let's look at our feature importances:


In [6]:
rfmodel.feature_importances_


Out[6]:
array([ 0.48856166,  0.51143834])

Both features (Grade and Bumpiness) have just about the same importance in our model (about 50% each). That isn't too surprising since we faked the data to begin with...

Let's try some other ensemble methods to see how they work.

AdaBoost Classifier

This is another ensemble classifier that iteratively learns using a series of weights.


In [7]:
# Load the model and fit the data
from sklearn.ensemble import AdaBoostClassifier

abcmodel = AdaBoostClassifier(n_estimators=100,random_state=32)
abcmodel.fit(features_train,labels_train)

y_pred = abcmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(abcmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Matthews Correlation Coefficient (MCC): 0.88302869249

XGBoost

This last ensemble method is new enough to not be a part of the regular sklearn toolbox yet. However, it has made a fairly big splash in the machine learning community for its performance on real-world data.


In [8]:
import xgboost
xgbmodel = xgboost.XGBClassifier(n_estimators=100, seed=32)
xgbmodel.fit(features_train,labels_train)

y_pred = xgbmodel.predict(features_test)

# Predict the boundary
Z = pd.Series(xgbmodel.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values.reshape(xx.shape)


# First plot our points
testfig1, ax = plt.subplots()

plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1,axes=ax)
ax.set_aspect(1)

# Plot test points
groups = test.groupby('Speed')
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Grade'], group['Bumpiness'], marker='o', linestyle='', ms=8, label=name)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Grade')
ax.set_ylabel('Bumpiness')
matt_score = metrics.matthews_corrcoef(labels_test, y_pred)
print("Matthews Correlation Coefficient (MCC): {}".format(matt_score))


Matthews Correlation Coefficient (MCC): 0.904792426006

So we get a little better performance on this dataset with the XGBoost algorithm. As quick side note here: XGBoost only works out-of-the-box on SageMath using their Python 2 kernel. If you want to use the Python 3 kernel like we have for the other classes, you'll need to install XGBoost for the other kernel.

In-class Activity

We can also use decision trees to model continuous data for regressions. I want you to look up the documentation for how to do that and implement the regressions on the same data we used in Class 06.

Assignment

Your assignment this week is to try using the decision tree models on your own data. Record how long it took to train the models and how well they performed compared to our previous classifier/regression models.