13 - Ensemble Methods - Random Forest

by Alejandro Correa Bahnsen and Jesus Solano

version 1.5, February 2019

Part of the class Practical Machine Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham)

Why are we learning about ensembling?

  • Very popular method for improving the predictive performance of machine learning models
  • Provides a foundation for understanding more sophisticated models

Lesson objectives

Students will be able to:

  • Explain the difference between bagged trees and Random Forests
  • Build and tune a Random Forest model in scikit-learn
  • Decide whether a decision tree or a Random Forest is a better model for a given problem

Part 1: Building and tuning decision trees


In [1]:
import pandas as pd
import numpy as np

# read in the data
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/hitters.csv'
hitters = pd.read_csv(url)

# remove rows with missing values
hitters.dropna(inplace=True)
hitters.head()


Out[1]:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
4 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A

In [2]:
# encode categorical variables as integers
hitters['League'] = pd.factorize(hitters.League)[0]
hitters['Division'] = pd.factorize(hitters.Division)[0]
hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0]
hitters.head()


Out[2]:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 0 0 632 43 10 475.0 0
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 1 0 880 82 14 480.0 1
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 0 1 200 11 3 500.0 0
4 321 87 10 39 42 30 2 396 101 12 48 46 33 0 1 805 40 4 91.5 0
5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 1 0 282 421 25 750.0 1

In [3]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [4]:
# scatter plot of Years versus Hits colored by Salary
hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250))


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x19a30087b38>

In [5]:
# define features: exclude career statistics (which start with "C") and the response (Salary)
feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False].drop('Salary')
feature_cols


Out[5]:
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'League',
       'Division', 'PutOuts', 'Assists', 'Errors', 'NewLeague'],
      dtype='object')

In [6]:
hitters.Salary.describe()


Out[6]:
count     263.000000
mean      535.925882
std       451.118681
min        67.500000
25%       190.000000
50%       425.000000
75%       750.000000
max      2460.000000
Name: Salary, dtype: float64

In [7]:
# define X and y
X = hitters[feature_cols]
y = (hitters.Salary > 425).astype(int)

In [8]:
X.columns


Out[8]:
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'League',
       'Division', 'PutOuts', 'Assists', 'Errors', 'NewLeague'],
      dtype='object')

Predicting if salary is high with a decision tree

Review - Building a Decision Tree by hand


In [9]:
max_depth = None
num_pct = 10
max_features = None
min_gain=0.001

For feature 1 calculate possible splitting points


In [10]:
j = 1
print(X.columns[j])


Hits

In [11]:
# Split the variable in num_ctp points
splits = np.percentile(X.iloc[:, j], np.arange(0, 100, 100.0 / num_pct).tolist())

In [12]:
# Only unique values for filter binary and few unique values features
splits = np.unique(splits)

In [13]:
splits


Out[13]:
array([  1. ,  52. ,  66.8,  77. ,  92. , 103. , 120. , 136. , 148.6,
       168. ])

split the data using split 5


In [14]:
k = 5

In [15]:
filter_l = X.iloc[:, j] < splits[k]

y_l = y.loc[filter_l]
y_r = y.loc[~filter_l]

Gini

The Gini Impurity of a node is the probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node.

For each node


In [16]:
def gini(y):
    if y.shape[0] == 0:
        return 0
    else:
        return 1 - (y.mean()**2 + (1 - y.mean())**2)

In [17]:
gini_l = gini(y_l)
gini_l


Out[17]:
0.39928079856159704

In [18]:
gini_r = gini(y_r)
gini_r


Out[18]:
0.42690311418685123

The gini impurity of the split is the Gini Impurity of each node is weighted by the fraction of points from the parent node in that node.

putting all in a function


In [19]:
def gini_impurity(X_col, y, split):
    "Calculate the gain of an split k on feature j"
    
    filter_l = X_col < split
    y_l = y.loc[filter_l]
    y_r = y.loc[~filter_l]
    
    n_l = y_l.shape[0]
    n_r = y_r.shape[0]
    
    gini_y = gini(y)
    gini_l = gini(y_l)
    gini_r = gini(y_r)
    
    gini_impurity_ = gini_y - (n_l / (n_l + n_r) * gini_l + n_r / (n_l + n_r) * gini_r)
    
    return gini_impurity_

In [20]:
gini_impurity(X.iloc[:, j], y, splits[k])


Out[20]:
0.0862547016583845

In [ ]:

test all splits on all features


In [21]:
def best_split(X, y, num_pct=10):
    
    features = range(X.shape[1])
    
    best_split = [0, 0, 0]  # j, split, gain
    
    # For all features
    for j in features:
        
        splits = np.percentile(X.iloc[:, j], np.arange(0, 100, 100.0 / (num_pct+1)).tolist())
        splits = np.unique(splits)[1:]
        
        # For all splits
        for split in splits:
            gain = gini_impurity(X.iloc[:, j], y, split)
                        
            if gain > best_split[2]:
                best_split = [j, split, gain]
    
    return best_split

In [22]:
j, split, gain = best_split(X, y, 5)
j, split, gain


Out[22]:
(6, 6.0, 0.1428365268140297)

In [23]:
filter_l = X.iloc[:, j] < split

y_l = y.loc[filter_l]
y_r = y.loc[~filter_l]

In [24]:
y.shape[0], y_l.shape[0], y_r.shape[0]


Out[24]:
(263, 116, 147)

In [25]:
y.mean(), y_l.mean(), y_r.mean()


Out[25]:
(0.49049429657794674, 0.1896551724137931, 0.7278911564625851)

Recursively grow the tree


In [26]:
def tree_grow(X, y, level=0, min_gain=0.001, max_depth=None, num_pct=10):
    
    # If only one observation
    if X.shape[0] == 1:
        tree = dict(y_pred=y.iloc[:1].values[0], y_prob=0.5, level=level, split=-1, n_samples=1, gain=0)
        return tree
    
    # Calculate the best split
    j, split, gain = best_split(X, y, num_pct)
    
    # save tree and estimate prediction
    y_pred = int(y.mean() >= 0.5) 
    y_prob = (y.sum() + 1.0) / (y.shape[0] + 2.0)  # Laplace correction
    
    tree = dict(y_pred=y_pred, y_prob=y_prob, level=level, split=-1, n_samples=X.shape[0], gain=gain)
    
    # Check stooping criteria
    if gain < min_gain:
        return tree
    if max_depth is not None:
        if level >= max_depth:
            return tree   
    
    # No stooping criteria was meet, then continue to create the partition
    filter_l = X.iloc[:, j] < split
    X_l, y_l = X.loc[filter_l], y.loc[filter_l]
    X_r, y_r = X.loc[~filter_l], y.loc[~filter_l]
    tree['split'] = [j, split]

    # Next iteration to each split
    
    tree['sl'] = tree_grow(X_l, y_l, level + 1, min_gain=min_gain, max_depth=max_depth, num_pct=num_pct)
    tree['sr'] = tree_grow(X_r, y_r, level + 1, min_gain=min_gain, max_depth=max_depth, num_pct=num_pct)
    
    return tree

In [27]:
tree_grow(X, y, level=0, min_gain=0.001, max_depth=1, num_pct=10)


Out[27]:
{'y_pred': 0,
 'y_prob': 0.49056603773584906,
 'level': 0,
 'split': [6, 5.0],
 'n_samples': 263,
 'gain': 0.15865574114903452,
 'sl': {'y_pred': 0,
  'y_prob': 0.10869565217391304,
  'level': 1,
  'split': -1,
  'n_samples': 90,
  'gain': 0.01935558112773289},
 'sr': {'y_pred': 1,
  'y_prob': 0.6914285714285714,
  'level': 1,
  'split': -1,
  'n_samples': 173,
  'gain': 0.1127122881295256}}

In [28]:
tree = tree_grow(X, y, level=0, min_gain=0.001, max_depth=3, num_pct=10)

In [29]:
tree


Out[29]:
{'y_pred': 0,
 'y_prob': 0.49056603773584906,
 'level': 0,
 'split': [6, 5.0],
 'n_samples': 263,
 'gain': 0.15865574114903452,
 'sl': {'y_pred': 0,
  'y_prob': 0.10869565217391304,
  'level': 1,
  'split': [5, 65.0],
  'n_samples': 90,
  'gain': 0.01935558112773289,
  'sl': {'y_pred': 0,
   'y_prob': 0.07407407407407407,
   'level': 2,
   'split': [0, 185.0],
   'n_samples': 79,
   'gain': 0.009619566461418955,
   'sl': {'y_pred': 0,
    'y_prob': 0.3333333333333333,
    'level': 3,
    'split': -1,
    'n_samples': 7,
    'gain': 0.40816326530612246},
   'sr': {'y_pred': 0,
    'y_prob': 0.05405405405405406,
    'level': 3,
    'split': -1,
    'n_samples': 72,
    'gain': 0.009027777777777565}},
  'sr': {'y_pred': 0,
   'y_prob': 0.38461538461538464,
   'level': 2,
   'split': [0, 470.90909090909093],
   'n_samples': 11,
   'gain': 0.2203856749311295,
   'sl': {'y_pred': 0,
    'y_prob': 0.14285714285714285,
    'level': 3,
    'split': -1,
    'n_samples': 5,
    'gain': 0},
   'sr': {'y_pred': 1,
    'y_prob': 0.625,
    'level': 3,
    'split': -1,
    'n_samples': 6,
    'gain': 0.4444444444444444}}},
 'sr': {'y_pred': 1,
  'y_prob': 0.6914285714285714,
  'level': 1,
  'split': [1, 103.0],
  'n_samples': 173,
  'gain': 0.1127122881295256,
  'sl': {'y_pred': 0,
   'y_prob': 0.43037974683544306,
   'level': 2,
   'split': [5, 22.0],
   'n_samples': 77,
   'gain': 0.07695385846646363,
   'sl': {'y_pred': 0,
    'y_prob': 0.17857142857142858,
    'level': 3,
    'split': -1,
    'n_samples': 26,
    'gain': 0.06860475087899842},
   'sr': {'y_pred': 1,
    'y_prob': 0.5660377358490566,
    'level': 3,
    'split': -1,
    'n_samples': 51,
    'gain': 0.09501691508611931}},
  'sr': {'y_pred': 1,
   'y_prob': 0.8979591836734694,
   'level': 2,
   'split': [2, 6.0],
   'n_samples': 96,
   'gain': 0.01107413837448551,
   'sl': {'y_pred': 1,
    'y_prob': 0.7058823529411765,
    'level': 3,
    'split': -1,
    'n_samples': 15,
    'gain': 0.16547008547008554},
   'sr': {'y_pred': 1,
    'y_prob': 0.927710843373494,
    'level': 3,
    'split': -1,
    'n_samples': 81,
    'gain': 0.006994315787586275}}}}

Prediction


In [30]:
def tree_predict(X, tree, proba=False):
    
    predicted = np.ones(X.shape[0])

    # Check if final node
    if tree['split'] == -1:
        if not proba:
            predicted = predicted * tree['y_pred']
        else:
            predicted = predicted * tree['y_prob']
            
    else:
        
        j, split = tree['split']
        filter_l = (X.iloc[:, j] < split)
        X_l = X.loc[filter_l]
        X_r = X.loc[~filter_l]

        if X_l.shape[0] == 0:  # If left node is empty only continue with right
            predicted[~filter_l] = tree_predict(X_r, tree['sr'], proba)
        elif X_r.shape[0] == 0:  # If right node is empty only continue with left
            predicted[filter_l] = tree_predict(X_l, tree['sl'], proba)
        else:
            predicted[filter_l] = tree_predict(X_l, tree['sl'], proba)
            predicted[~filter_l] = tree_predict(X_r, tree['sr'], proba)

    return predicted

In [31]:
tree_predict(X, tree)


Out[31]:
array([1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 0.,
       0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1.,
       0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,
       0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
       0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0.,
       1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0.,
       0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
       0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1.,
       0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0.,
       1., 1., 0., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 0.,
       0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0.,
       0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1., 1.,
       1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1.,
       1., 0., 0., 1., 1., 1., 1., 1.])

Using sklearn


In [32]:
# list of values to try for max_depth
max_depth_range = range(1, 21)

# list to store the average RMSE for each value of max_depth
accuracy_scores = []

# use 10-fold cross-validation with each value of max_depth
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

for depth in max_depth_range:
    clf = DecisionTreeClassifier(max_depth=depth, random_state=1)
    accuracy_scores.append(cross_val_score(clf, X, y, cv=10, scoring='accuracy').mean())

In [33]:
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, accuracy_scores)
plt.xlabel('max_depth')
plt.ylabel('Accuracy')


Out[33]:
Text(0,0.5,'Accuracy')

In [34]:
# show the best accuracy and the corresponding max_depth
sorted(zip(accuracy_scores, max_depth_range))[::-1][0]


Out[34]:
(0.8205754985754986, 4)

In [35]:
# max_depth=2 was best, so fit a tree using that parameter
clf = DecisionTreeClassifier(max_depth=4, random_state=1)
clf.fit(X, y)


Out[35]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

In [36]:
# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':clf.feature_importances_}).sort_values('importance')


Out[36]:
feature importance
0 AtBat 0.000000
7 League 0.000000
8 Division 0.000000
10 Assists 0.000000
11 Errors 0.000000
12 NewLeague 0.000000
9 PutOuts 0.006048
2 HmRun 0.010841
4 RBI 0.012073
3 Runs 0.021020
5 Walks 0.103473
1 Hits 0.298269
6 Years 0.548277

In [37]:
pd.Series(cross_val_score(clf, X, y, cv=10)).describe()


Out[37]:
count    10.000000
mean      0.820575
std       0.083007
min       0.692308
25%       0.751781
50%       0.830484
75%       0.879630
max       0.923077
dtype: float64

Part 2: Random Forests

Random Forests is a slight variation of bagged trees that has even better performance:

  • Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
  • However, when building each tree, each time a split is considered, a random sample of m features is chosen as split candidates from the full set of p features. The split is only allowed to use one of those m features.
    • A new random sample of features is chosen for every single tree at every single split.
    • For classification, m is typically chosen to be the square root of p.
    • For regression, m is typically chosen to be somewhere between p/3 and p.

What's the point?

  • Suppose there is one very strong feature in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated.
  • Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
  • By randomly leaving out candidate features from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.

Predicting salary with a Random Forest


In [38]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf


C:\Users\albah\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
Out[38]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [39]:
pd.Series(cross_val_score(clf, X, y, cv=10)).describe()


Out[39]:
count    10.000000
mean      0.817749
std       0.062496
min       0.703704
25%       0.783333
50%       0.811254
75%       0.846154
max       0.923077
dtype: float64

Tuning n_estimators

One important tuning parameter is n_estimators, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".


In [40]:
# list of values to try for n_estimators
estimator_range = range(10, 310, 10)

# list to store the average Accuracy for each value of n_estimators
accuracy_scores = []

# use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
    clf = RandomForestClassifier(n_estimators=estimator, random_state=1, n_jobs=-1)
    accuracy_scores.append(cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean())

In [41]:
plt.plot(estimator_range, accuracy_scores)
plt.xlabel('n_estimators')
plt.ylabel('Accuracy')


Out[41]:
Text(0,0.5,'Accuracy')

Tuning max_features

The other important tuning parameter is max_features, which is the number of features that should be considered at each split.


In [42]:
# list of values to try for max_features
feature_range = range(1, len(feature_cols)+1)

# list to store the average Accuracy for each value of max_features
accuracy_scores = []

# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
    clf = RandomForestClassifier(n_estimators=200, max_features=feature, random_state=1, n_jobs=-1)
    accuracy_scores.append(cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean())

In [43]:
plt.plot(feature_range, accuracy_scores)
plt.xlabel('max_features')
plt.ylabel('Accuracy')


Out[43]:
Text(0,0.5,'Accuracy')

Fitting a Random Forest with the best parameters


In [44]:
# max_features=6 is best and n_estimators=200 is sufficiently large
clf = RandomForestClassifier(n_estimators=200, max_features=6, random_state=1, n_jobs=-1)
clf.fit(X, y)


Out[44]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=6, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [45]:
# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':clf.feature_importances_}).sort_values('importance')


Out[45]:
feature importance
8 Division 0.006081
7 League 0.008834
12 NewLeague 0.009709
11 Errors 0.032638
10 Assists 0.040503
2 HmRun 0.047118
9 PutOuts 0.051506
0 AtBat 0.078822
3 Runs 0.080185
5 Walks 0.082160
4 RBI 0.091048
1 Hits 0.132156
6 Years 0.339239

Comparing Random Forests with decision trees

Advantages of Random Forests:

  • Performance is competitive with the best supervised learning methods
  • Provides a more reliable estimate of feature importance
  • Allows you to estimate out-of-sample error without using train/test split or cross-validation

Disadvantages of Random Forests:

  • Less interpretable
  • Slower to train
  • Slower to predict