This is a study for a blog post to appear on Data Simple. It will focus on the theory and scikit-learn implementation of the Random Forest machine learning (ML) algorithm.
Random Forests are a popular example of an Ensemble Learning method. Ensemble Learning consists on combining multiple ML models in order to achieve higher predictive performance than could be obtained using either of the individual models alone.
Let's start by generating some toy data using scikit-learn
's make_blobs
function. Let's make it a two-dimensional data set, in order to allow easy visualization. There will be, say, 300 observations evenly split between three classes ("Yellow", "Blue", and "Red").
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, n_features=2, centers=3,
cluster_std=4, random_state=42)
# Plot data
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", marker='.')
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", marker='.')
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "rd", marker='.')
plt.xlabel(r"$x_1$", fontsize=20)
plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
plt.title("Toy data set\n", fontsize=16)
#plt.savefig('toy_data.png', dpi=300, bbox_inches='tight')
plt.figure(figsize=(4, 3))
Out[1]:
In [2]:
from sklearn.tree import DecisionTreeClassifier
# Instantiate the classifier class
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
# Grow a Decision Tree
tree_clf.fit(X, y)
Out[2]:
We have grown a tree on our 2-D data set. Classifying new examples is simple:
In [3]:
import numpy as np
def predict_class(data_point, classes=['Yellow', 'Blue', 'Red']):
# Convert to appropriate numpy array
data_point = np.array(data_point).reshape(1, -1)
# Classify
result = tree_clf.predict(data_point)
# Print output
print('Predicted class for point {}: {}'.format(data_point,
classes[np.asscalar(result)]))
predict_class([-5, 10])
predict_class([10, 5])
predict_class([-10, -10])
We can visualize the Decision Tree model using the dot
tool in the graphviz
package. First generate a graph definition as a .dot
file, using the export_graphviz
method. Then convert the .dot
file to a .png
image file.
In [4]:
from sklearn.tree import export_graphviz
export_graphviz(tree_clf, out_file='tree.dot',
feature_names=['x1', 'x2'],
class_names=['Yellow', 'Blue', 'Red'],
rounded=True, filled=True)
Generate .png
image using graphviz
package:
$ dot -Tpng tree.dot -o tree.png
Here's the tree visualization.
We can now see the way the model predicts new instance classes: it starts off at the root node, with all 300 training examples split evenly between the three classes, and asks whether feature $x_2$ value is lower than or equal to $-1.3707$. It splits the data in two groups according to the answer and then goes on to ask another question in each of the two following internal nodes. These, in turn, split the data between two groups each.
Let's define a function to visualize the decision boundaries and a function to plot them together with the training data using matplotlib
.
In [5]:
from matplotlib.colors import ListedColormap
def compute_decision_boundaries(clf, X, y, axes):
x1s = np.linspace(axes[0], axes[1], 300)
x2s = np.linspace(axes[2], axes[3], 300)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
return x1, x2, y_pred
def plot_feature_space(clf, X, y, axes, file_name=None):
x1, x2, y_pred = compute_decision_boundaries(clf, X, y, axes)
custom_cmap = ListedColormap(['y','b','r'])
plt.contourf(x1, x2, y_pred, cmap=custom_cmap, alpha=0.1, linewidth=1)
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", marker='.')
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", marker='.')
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "rd", marker='.')
plt.axis(axes)
plt.xlabel(r"$x_1$", fontsize=20)
plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
if file_name is not None:
plt.savefig(file_name, dpi=300, bbox_inches='tight')
plt.figure(figsize=(8, 4))
In [6]:
plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='max_depth2.png')
scikit-learn
uses the Classification And Regression Tree (CART) algorithm, which minimizes the following cost function at each split:
The impurity $I$ measure is typically the Gini impurity. Formally:
$$ I_G(t) = 1 - \sum_{k=1}^K p_{k,t}^2 $$$$ with \begin{cases} k \bbox[3pt]{1, . . . , K} \text{ index of the class,}\\ p_{k,t} \text{ ratio of class $k$ instances among the training instances in node $t$} \end{cases} $$You can go back to the image of the tree structure and check the "gini" attribute at each node. Let's take, for example, the purple leaf node $t=4$. Using the definition above, we can calculate the Gini impurity at this is node and get the indicated value of $0.0425$:
$$ I_G(4) = 1 - \sum_{k=1}^K p_{k,4}^2 = \\ = 1 - \left( {\left(\dfrac{0}{92}\right)}^2 + {\left(\dfrac{2}{92}\right)}^2 + {\left(\dfrac{90}{92}\right)}^2 \right) =\\ = 1 - ( 0 + 0.00047 + 0.95699 ) =\\ = 0.0425 $$I used that argument max_depth=2
above. This was mainly to generate a small tree that was easy to visualize. The result was arguably an underfitting model that could be improved. Here is what it looks like using max_depth=3
.
In [7]:
# Instantiate the classifier class for a deeper tree
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
# Grow a Decision Tree
tree_clf.fit(X, y)
plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='max_depth3.png')
If left unconstrained the algorithm tries hard to fit the data, learning every single data point's class and thus producing an overcomplicated model.
In [8]:
# Instantiate the classifier class with no limits
tree_clf = DecisionTreeClassifier(random_state=42)
# Grow a Decision Tree
tree_clf.fit(X, y)
plot_feature_space(tree_clf, X, y, axes=[-16, 15, -20, 20], file_name='unlimited.png')
An ensemble of Decision Trees is called a Random Forest. Random Forests are an example of an ensemble using the same ML algorithm to build multiple models each trained on a different random subset of the training data. When the training data sampling for each predictor is performed with replacement, the method is called Boosted aggregating, commonly abbreviated to Bagging.
Let's grow a Random Forest using scikit-kearn
's BaggingClassifier
: 50 Decision Trees, each trained on a random sample of 50 training set instances.
In [10]:
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50,
max_samples=50, bootstrap=True, n_jobs=-1,
oob_score=True, random_state=42)
bag_clf.fit(X, y)
plot_feature_space(bag_clf, X, y, axes=[-16, 15, -20, 20], file_name='bagging.png')
Parameter oob_score=True
above tells scikit-learn
to perform Out-of-bag Evaluation after training and we can now use the attribute oob_score_
to check the score.
In [11]:
print('Out-of-bag evaluation score: {}%'.format(round(bag_clf.oob_score_, 4)*100))
RandomForestClassifier
classAs you may know, scikit-learn
actually has a RandomForestClassifier
class pre-built for you, so you can leave the bagging classifier class to use when you're bagging other models for your custom-built ensembles. Let's build a Random Forest with the RandomForestClassifier
class and an additional regularization hyperparameter by setting max_leaf_nodes=10
.
In [12]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=50, max_leaf_nodes=10, n_jobs=-1,
bootstrap=True,
oob_score=True, random_state=42)
rf_clf.fit(X, y)
plot_feature_space(rf_clf, X, y, axes=[-16, 15, -20, 20], file_name='random_forest.png')
In [13]:
print('Out-of-bag evaluation score: {}%'.format(round(rf_clf.oob_score_, 4)*100))
In [14]:
from sklearn.ensemble import ExtraTreesClassifier
extra_clf = ExtraTreesClassifier(n_estimators=50, max_leaf_nodes=10, n_jobs=-1,
bootstrap=True,
oob_score=True, random_state=42)
extra_clf.fit(X, y)
plot_feature_space(extra_clf, X, y, axes=[-16, 15, -20, 20], file_name='extra_trees.png')
In [15]:
print('Out-of-bag evaluation score: {}%'.format(round(extra_clf.oob_score_, 4)*100))
The decision boundaries look even more natural, right?