In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.
As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
In [42]:
df = pd.read_csv("data/historical_loan.csv")
In [43]:
# refine the data
df.years = df.years.fillna(np.mean(df.years))
In [44]:
#Load the preprocessing module
from sklearn import preprocessing
categorical_variables = df.dtypes[df.dtypes=="object"].index.tolist()
for i in categorical_variables:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df[i]))
df[i] = lbl.transform(df[i])
In [45]:
df.head()
Out[45]:
In [46]:
X = df.iloc[:,1:8]
In [47]:
y = df.iloc[:,0]
In [48]:
from sklearn.ensemble import RandomForestClassifier
In [49]:
clf = RandomForestClassifier()
In [50]:
clf.fit(X, y)
Out[50]:
In [ ]:
clf.me
There are several ways to get feature "importances" with no strict consensus on what it means.
The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.
In scikit-learn, it is implemented by using "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
By averaging those expected activity rates over random trees one can reduce the variance of such an estimate and use it for feature selection.
In [11]:
importances = clf.feature_importances_
In [12]:
# Importance of the features in the forest
importances
Out[12]:
In [13]:
#Calculate the standard deviation of variable importance
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
In [14]:
std
Out[14]:
In [15]:
indices = np.argsort(importances)[::-1]
indices
Out[15]:
In [16]:
length = X.shape[1]
In [17]:
labels = []
In [18]:
for i in range(length):
labels.append(X.columns[indices[i]])
In [19]:
# Plot the feature importances of the forest
plt.figure(figsize=(16, 6))
plt.title("Feature importances")
plt.bar(range(length), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(length), labels)
plt.xlim([-1, length])
plt.show()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [20]:
import warnings
warnings.filterwarnings('ignore')
In [21]:
clf2 = RandomForestClassifier(warm_start=True, class_weight="balanced",
oob_score=True, max_features=None)
In [22]:
clf2.fit(X, y)
Out[22]:
In [23]:
clf2.oob_score_
Out[23]:
In [24]:
min_estimators = 10
max_estimators = 50
error_rate = []
In [25]:
for i in range(min_estimators, max_estimators + 1):
clf2.set_params(n_estimators=i)
clf2.fit(X, y)
oob_error = 1 - clf2.oob_score_
error_rate.append(oob_error)
In [26]:
error_rate_indice = [x for x in range(min_estimators, max_estimators + 1)]
In [28]:
plt.figure()
plt.figure(figsize=(16, 6))
plt.plot(error_rate_indice, error_rate)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: