Random Forests

May 2017


This is a study for a blog post to appear on Data Simple. It will focus on the theory and scikit-learn implementation of the Random Forest machine learning (ML) algorithm.

Contents

  1. Introduction
  2. Decision Trees
  3. Random Forests
  4. Extremely Randomized Trees



1. Introduction

Random Forests are a popular example of an Ensemble Learning method. Ensemble Learning consists on combining multiple ML models in order to achieve higher predictive performance than could be obtained using either of the individual models alone.

Let's start by generating some toy data using scikit-learn's make_blobs function. Let's make it a two-dimensional data set, in order to allow easy visualization. There will be, say, 300 observations evenly split between three classes ("Yellow", "Blue", and "Red").


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, n_features=2, centers=3,
                  cluster_std=4, random_state=42)

# Plot data
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", marker='.')
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", marker='.')
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "rd", marker='.')
plt.xlabel(r"$x_1$", fontsize=20)
plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
plt.title("Toy data set\n", fontsize=16)

#plt.savefig('toy_data.png', dpi=300, bbox_inches='tight')

plt.figure(figsize=(4, 3))


Out[1]:
<matplotlib.figure.Figure at 0x107dba240>
<matplotlib.figure.Figure at 0x107dba240>

2. Decision Trees

2.1 Making predictions

Let's grow a Decision Tree on our data using scikit-learn to take a closer look at how this works in practice.


In [2]:
from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier class
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)

# Grow a Decision Tree
tree_clf.fit(X, y)


Out[2]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

We have grown a tree on our 2-D data set. Classifying new examples is simple:


In [3]:
import numpy as np

def predict_class(data_point, classes=['Yellow', 'Blue', 'Red']):
    # Convert to appropriate numpy array
    data_point = np.array(data_point).reshape(1, -1)
    
    # Classify
    result = tree_clf.predict(data_point)
    
    # Print output
    print('Predicted class for point {}: {}'.format(data_point,
                                                    classes[np.asscalar(result)]))
    
predict_class([-5, 10])
predict_class([10, 5])
predict_class([-10, -10])


Predicted class for point [[-5 10]]: Yellow
Predicted class for point [[10  5]]: Blue
Predicted class for point [[-10 -10]]: Red

We can visualize the Decision Tree model using the dot tool in the graphviz package. First generate a graph definition as a .dot file, using the export_graphviz method. Then convert the .dot file to a .png image file.


In [4]:
from sklearn.tree import export_graphviz

export_graphviz(tree_clf, out_file='tree.dot',
                feature_names=['x1', 'x2'],
                class_names=['Yellow', 'Blue', 'Red'],
                rounded=True, filled=True)

Generate .png image using graphviz package:

$ dot -Tpng tree.dot -o tree.png

Here's the tree visualization.

We can now see the way the model predicts new instance classes: it starts off at the root node, with all 300 training examples split evenly between the three classes, and asks whether feature $x_2$ value is lower than or equal to $-1.3707$. It splits the data in two groups according to the answer and then goes on to ask another question in each of the two following internal nodes. These, in turn, split the data between two groups each.

Let's define a function to visualize the decision boundaries and a function to plot them together with the training data using matplotlib.


In [5]:
from matplotlib.colors import ListedColormap

def compute_decision_boundaries(clf, X, y, axes):
    x1s = np.linspace(axes[0], axes[1], 300)
    x2s = np.linspace(axes[2], axes[3], 300)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    
    return x1, x2, y_pred

def plot_feature_space(clf, X, y, axes, file_name=None):
    x1, x2, y_pred = compute_decision_boundaries(clf, X, y, axes)
    
    custom_cmap = ListedColormap(['y','b','r'])
    plt.contourf(x1, x2, y_pred, cmap=custom_cmap, alpha=0.1, linewidth=1)

    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", marker='.')
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", marker='.')
    plt.plot(X[:, 0][y==2], X[:, 1][y==2], "rd", marker='.')
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
    if file_name is not None:
        plt.savefig(file_name, dpi=300, bbox_inches='tight')
    plt.figure(figsize=(8, 4))

In [6]:
plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='max_depth2.png')


<matplotlib.figure.Figure at 0x1094789b0>

2.2 "Growing" a Decision Tree model

scikit-learn uses the Classification And Regression Tree (CART) algorithm, which minimizes the following cost function at each split:

$$ J(k, t_k) = \dfrac{m_{left}}{m}I_{left} + \dfrac{m_{right}}{m}I_{right} $$$$ with \begin{cases} I_{left/right} \text{ impurity of the left/right subset,}\\ m_{left/right} \text{ number of instances in the left/right subset.} \end{cases} $$

The impurity $I$ measure is typically the Gini impurity. Formally:

$$ I_G(t) = 1 - \sum_{k=1}^K p_{k,t}^2 $$$$ with \begin{cases} k \bbox[3pt]{1, . . . , K} \text{ index of the class,}\\ p_{k,t} \text{ ratio of class $k$ instances among the training instances in node $t$} \end{cases} $$

You can go back to the image of the tree structure and check the "gini" attribute at each node. Let's take, for example, the purple leaf node $t=4$. Using the definition above, we can calculate the Gini impurity at this is node and get the indicated value of $0.0425$:

$$ I_G(4) = 1 - \sum_{k=1}^K p_{k,4}^2 = \\ = 1 - \left( {\left(\dfrac{0}{92}\right)}^2 + {\left(\dfrac{2}{92}\right)}^2 + {\left(\dfrac{90}{92}\right)}^2 \right) =\\ = 1 - ( 0 + 0.00047 + 0.95699 ) =\\ = 0.0425 $$

I used that argument max_depth=2 above. This was mainly to generate a small tree that was easy to visualize. The result was arguably an underfitting model that could be improved. Here is what it looks like using max_depth=3.


In [7]:
# Instantiate the classifier class for a deeper tree 
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)

# Grow a Decision Tree
tree_clf.fit(X, y)

plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='max_depth3.png')


<matplotlib.figure.Figure at 0x10acd1ac8>

If left unconstrained the algorithm tries hard to fit the data, learning every single data point's class and thus producing an overcomplicated model.


In [8]:
# Instantiate the classifier class with no limits
tree_clf = DecisionTreeClassifier(random_state=42)

# Grow a Decision Tree
tree_clf.fit(X, y)

plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='unlimited.png')


<matplotlib.figure.Figure at 0x10866ec50>

3. Random Forests

3.1 Building an ensemble of Decision Trees

An ensemble of Decision Trees is called a Random Forest. Random Forests are an example of an ensemble using the same ML algorithm to build multiple models each trained on a different random subset of the training data. When the training data sampling for each predictor is performed with replacement, the method is called Boosted aggregating, commonly abbreviated to Bagging.

Let's grow a Random Forest using scikit-kearn's BaggingClassifier: 50 Decision Trees, each trained on a random sample of 50 training set instances.


In [10]:
from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50,
                            max_samples=50, bootstrap=True, n_jobs=-1,
                            oob_score=True, random_state=42)

bag_clf.fit(X, y)

plot_feature_space(bag_clf, X, y, axes=[-16, 15, -20, 20], file_name='bagging.png')


<matplotlib.figure.Figure at 0x109c191d0>

Parameter oob_score=True above tells scikit-learn to perform Out-of-bag Evaluation after training and we can now use the attribute oob_score_ to check the score.


In [11]:
print('Out-of-bag evaluation score: {}%'.format(round(bag_clf.oob_score_, 4)*100))


Out-of-bag evaluation score: 91.0%

3.2 The RandomForestClassifier class

As you may know, scikit-learn actually has a RandomForestClassifier class pre-built for you, so you can leave the bagging classifier class to use when you're bagging other models for your custom-built ensembles. Let's build a Random Forest with the RandomForestClassifier class and an additional regularization hyperparameter by setting max_leaf_nodes=10.


In [12]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=50, max_leaf_nodes=10, n_jobs=-1,
                                bootstrap=True,
                                oob_score=True, random_state=42)

rf_clf.fit(X, y)

plot_feature_space(rf_clf, X, y, axes=[-16, 15, -20, 20], file_name='random_forest.png')


<matplotlib.figure.Figure at 0x109cd7dd8>

In [13]:
print('Out-of-bag evaluation score: {}%'.format(round(rf_clf.oob_score_, 4)*100))


Out-of-bag evaluation score: 91.33%

4. Extremely Randomized Trees

scikit-learn provides yet another ensemble algorithm introducing additional randomness: the Extremely Randomized Trees ensemble or Extra-Trees. Here's what that looks like with our toy data.


In [14]:
from sklearn.ensemble import ExtraTreesClassifier

extra_clf = ExtraTreesClassifier(n_estimators=50, max_leaf_nodes=10, n_jobs=-1,
                                 bootstrap=True,
                                 oob_score=True, random_state=42)

extra_clf.fit(X, y)

plot_feature_space(extra_clf, X, y, axes=[-16, 15, -20, 20], file_name='extra_trees.png')


<matplotlib.figure.Figure at 0x10afba9e8>

In [15]:
print('Out-of-bag evaluation score: {}%'.format(round(extra_clf.oob_score_, 4)*100))


Out-of-bag evaluation score: 92.0%

The decision boundaries look even more natural, right?