In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.
As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
In [2]:
train = pd.read_csv("data/trainRF.csv")
test = pd.read_csv("data/testRF.csv")
In [3]:
train_13 = train.drop(['day','month', 'duration'], axis = 1)
test_13 = test.drop(['day','month', 'duration'], axis = 1)
In [4]:
from sklearn import preprocessing
categorical_variables =train_13.dtypes[train_13.dtypes=="object"].index.tolist()
In [5]:
for i in categorical_variables:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_13[i]))
train_13[i] = lbl.transform(train_13[i])
test_13[i] = lbl.transform(test_13[i])
In [6]:
train_13_data = train_13.ix[:,0:13]
train_13_target = train_13.iloc[:, -1]
In [7]:
test_13_data = train_13.ix[:,0:13]
test_13_target = train_13.iloc[:, -1]
In [8]:
from sklearn.ensemble import RandomForestClassifier
In [9]:
clf = RandomForestClassifier()
In [10]:
clf.fit(train_13_data, train_13_target)
Out[10]:
The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.
By averaging those expected activity rates over random trees one can reduce the variance of such an estimate and use it for feature selection.
In [11]:
importances = clf.feature_importances_
In [12]:
# Importance of the features in the forest
importances
Out[12]:
In [13]:
#Calculate the standard deviation of variable importance
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
In [14]:
std
Out[14]:
In [15]:
indices = np.argsort(importances)[::-1]
indices
Out[15]:
In [16]:
length = train_13_data.shape[1]
In [17]:
labels = []
In [18]:
for i in range(length):
labels.append(train_13_data.columns[indices[i]])
In [19]:
# Plot the feature importances of the forest
plt.figure(figsize=(16, 6))
plt.title("Feature importances")
plt.bar(range(length), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(length), labels)
plt.xlim([-1, length])
plt.show()
In [20]:
import warnings
warnings.filterwarnings('ignore')
In [21]:
clf2 = RandomForestClassifier(warm_start=True, oob_score=True, max_features=None)
In [22]:
clf2.fit(train_13_data, train_13_target)
Out[22]:
In [23]:
clf2.oob_score_
Out[23]:
In [24]:
min_estimators = 10
max_estimators = 50
error_rate = []
In [25]:
for i in range(min_estimators, max_estimators + 1):
clf2.set_params(n_estimators=i)
clf2.fit(train_13_data, train_13_target)
oob_error = 1 - clf2.oob_score_
error_rate.append(oob_error)
In [26]:
error_rate_indice = [x for x in range(min_estimators, max_estimators + 1)]
In [27]:
plt.figure()
plt.plot(error_rate_indice, error_rate)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show()