Random Forest

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

Data Preparation


In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [42]:
df = pd.read_csv("data/historical_loan.csv")

In [43]:
# refine the data
df.years = df.years.fillna(np.mean(df.years))

In [44]:
#Load the preprocessing module
from sklearn import preprocessing
categorical_variables = df.dtypes[df.dtypes=="object"].index.tolist()
for i in categorical_variables:
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[i]))
    df[i] = lbl.transform(df[i])

In [45]:
df.head()


Out[45]:
default amount grade years ownership income age
0 0 1000 1 2.0 3 19200.0 24
1 1 6500 0 2.0 0 66000.0 28
2 0 2400 0 2.0 3 60000.0 36
3 0 10000 2 3.0 3 62000.0 24
4 1 4000 2 2.0 3 20000.0 28

In [46]:
X = df.iloc[:,1:8]

In [47]:
y = df.iloc[:,0]

Implementing Random Forest


In [48]:
from sklearn.ensemble import RandomForestClassifier

In [49]:
clf = RandomForestClassifier()

In [50]:
clf.fit(X, y)


Out[50]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [ ]:
clf.me

Key input parameters (in addition to decision trees)

  • bootstrap: Whether bootstrap samples are used when building trees
  • max_features: The number of features to consider when looking for the best split (auto = sqrt)
  • n_estimators: The number of trees in the forest
  • oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy

Key output parameters

  • Feature Importance: The higher, the more important the feature
  • Out-of-Bag Score: Validation score of the training dataset obtained using an out-of-bag estimate.

Feature Importance

There are several ways to get feature "importances" with no strict consensus on what it means.

Mean Decrease Impurity

The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

In scikit-learn, it is implemented by using "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

Mean Decrease Accuracy

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

By averaging those expected activity rates over random trees one can reduce the variance of such an estimate and use it for feature selection.


In [11]:
importances = clf.feature_importances_

In [12]:
# Importance of the features in the forest
importances


Out[12]:
array([ 0.22541327,  0.11081867,  0.1679413 ,  0.03932398,  0.2825136 ,
        0.17398918])

In [13]:
#Calculate the standard deviation of variable importance
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)

In [14]:
std


Out[14]:
array([ 0.01667274,  0.00867487,  0.00816368,  0.00753017,  0.01572384,
        0.01548575])

In [15]:
indices = np.argsort(importances)[::-1]
indices


Out[15]:
array([4, 0, 5, 2, 1, 3])

In [16]:
length = X.shape[1]

In [17]:
labels = []

In [18]:
for i in range(length):
    labels.append(X.columns[indices[i]])

In [19]:
# Plot the feature importances of the forest
plt.figure(figsize=(16, 6))
plt.title("Feature importances")
plt.bar(range(length), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(length), labels)
plt.xlim([-1, length])
plt.show()


Exercise 1

Calculate the Feature Importance plot for max_depth = 6


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Out-of-Bag Error

The out-of-bag (OOB) error is the average error for each training observation calculated using predictions from the trees that do not contain it in their respective bootstrap sample. This allows the RandomForest to be fit and validated whilst being trained.


In [20]:
import warnings
warnings.filterwarnings('ignore')

In [21]:
clf2 = RandomForestClassifier(warm_start=True, class_weight="balanced", 
                              oob_score=True, max_features=None)

In [22]:
clf2.fit(X, y)


Out[22]:
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features=None,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=True)

In [23]:
clf2.oob_score_


Out[23]:
0.62495146887537212

In [24]:
min_estimators = 10
max_estimators = 50
error_rate = []

In [25]:
for i in range(min_estimators, max_estimators + 1):
    clf2.set_params(n_estimators=i)
    clf2.fit(X, y)
    oob_error = 1 - clf2.oob_score_
    error_rate.append(oob_error)

In [26]:
error_rate_indice = [x for x in range(min_estimators, max_estimators + 1)]

In [28]:
plt.figure()
plt.figure(figsize=(16, 6))
plt.plot(error_rate_indice, error_rate)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show()


<matplotlib.figure.Figure at 0x114154b70>

Exercise 2

Calculate the OOB error rate plot with max_feature = sqrt


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Exercise 3

Calculate the OOB error rate plot with max_depth = 6


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Exercise 4

Plot the change of AUC score for change in max-depth and no of estimators


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: