Ensemble of Decision Trees

By Parijat Mazumdar (GitHub ID: mazumdarparijat)

This notebook illustrates the use of Random Forests in Shogun for classification and regression. We will understand the functioning of Random Forests, discuss about the importance of its various parameters and appreciate the usefulness of this learning method.

What is Random Forest?

Random Forest is an ensemble learning method in which a collection of decision trees are grown during training and the combination of the outputs of all the individual trees are considered during testing or application. The strategy for combination can be varied but generally, in case of classification, the mode of the output classes is used and, in case of regression, the mean of the outputs is used. The randomness in the method, as the method's name suggests, is infused mainly by the random subspace sampling done while training individual trees. While choosing the best split during tree growing, only a small randomly chosen subset of all the features is considered. The subset size is a user-controlled parameter and is usually the square root of the total number of available features. The purpose of the random subset sampling method is to decorrelate the individual trees in the forest, thus making the overall model more generic; i.e. decrease the variance without increasing the bias (see bias-variance trade-off). The purpose of Random Forest, in summary, is to reduce the generalization error of the model as much as possible.

Random Forest vs Decision Tree

In this section, we will appreciate the importance of training a Random Forest over a single decision tree. In the process, we will also learn how to use Shogun's Random Forest class. For this purpose, we will use the letter recognition dataset. This dataset contains pixel information (16 features) of 20000 samples of the English alphabet. This is a 26-class classification problem where the task is to predict the alphabet given the 16 pixel features. We start by loading the training dataset.


In [ ]:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../../data')
from shogun import CSVFile,RealFeatures,MulticlassLabels

def load_file(feat_file,label_file):
    feats=RealFeatures(CSVFile(feat_file))
    labels=MulticlassLabels(CSVFile(label_file))
    return (feats, labels)

trainfeat_file=os.path.join(SHOGUN_DATA_DIR, 'uci/letter/train_fm_letter.dat')
trainlab_file=os.path.join(SHOGUN_DATA_DIR, 'uci/letter/train_label_letter.dat')
train_feats,train_labels=load_file(trainfeat_file,trainlab_file)

Next, we decide the parameters of our Random Forest.


In [ ]:
from shogun import RandomForest, MajorityVote
from numpy import array

def setup_random_forest(num_trees,rand_subset_size,combination_rule,feature_types):
    rf=RandomForest(rand_subset_size,num_trees)
    rf.set_combination_rule(combination_rule)
    rf.set_feature_types(feature_types)

    return rf

comb_rule=MajorityVote()
feat_types=array([False]*16)
rand_forest=setup_random_forest(10,4,comb_rule,feat_types)

In the above code snippet, we decided to create a forest using 10 trees in which each split in individual trees will be using a randomly chosen subset of 4 features. Note that 4 here is the square root of the total available features (16) and is hence the usually chosen value as mentioned in the introductory paragraph. The strategy for combination chosen is Majority Vote which, as the name suggests, chooses the mode of all the individual tree outputs. The given features are all continuous in nature and hence feature types are all set false (i.e. not nominal). Next, we train our Random Forest and use it to classify letters in our test dataset.


In [ ]:
# train forest
rand_forest.set_labels(train_labels)
rand_forest.train(train_feats)

# load test dataset
testfeat_file= os.path.join(SHOGUN_DATA_DIR, 'uci/letter/test_fm_letter.dat')
testlab_file= os.path.join(SHOGUN_DATA_DIR, 'uci/letter/test_label_letter.dat')
test_feats,test_labels=load_file(testfeat_file,testlab_file)

# apply forest
output_rand_forest_train=rand_forest.apply_multiclass(train_feats)
output_rand_forest_test=rand_forest.apply_multiclass(test_feats)

We have with us the labels predicted by our Random Forest model. Let us also get the predictions made by a single tree. For this purpose, we train a CART-flavoured decision tree.


In [ ]:
from shogun import CARTree, PT_MULTICLASS

def train_cart(train_feats,train_labels,feature_types,problem_type):
    c=CARTree(feature_types,problem_type,2,False)
    c.set_labels(train_labels)
    c.train(train_feats)
    
    return c

# train CART
cart=train_cart(train_feats,train_labels,feat_types,PT_MULTICLASS)

# apply CART model
output_cart_train=cart.apply_multiclass(train_feats)
output_cart_test=cart.apply_multiclass(test_feats)

With both results at our disposal, let us find out which one is better.


In [ ]:
from shogun import MulticlassAccuracy

accuracy=MulticlassAccuracy()

rf_train_accuracy=accuracy.evaluate(output_rand_forest_train,train_labels)*100
rf_test_accuracy=accuracy.evaluate(output_rand_forest_test,test_labels)*100

cart_train_accuracy=accuracy.evaluate(output_cart_train,train_labels)*100
cart_test_accuracy=accuracy.evaluate(output_cart_test,test_labels)*100

print('Random Forest training accuracy : '+str(round(rf_train_accuracy,3))+'%')
print('CART training accuracy : '+str(round(cart_train_accuracy,3))+'%')
print
print('Random Forest test accuracy : '+str(round(rf_test_accuracy,3))+'%')
print('CART test accuracy : '+str(round(cart_test_accuracy,3))+'%')

As it is clear from the results above, we see a significant improvement in the predictions. The reason for the improvement is clear when one looks at the training accuracy. The single decision tree was over-fitting on the training dataset and hence was not generic. Random Forest on the other hand appropriately trades off training accuracy for the sake of generalization of the model. Impressed already? Let us now see what happens if we increase the number of trees in our forest.

Random Forest parameters : Number of trees and random subset size

In the last section, we trained a forest of 10 trees. What happens if we make our forest with 20 trees? Let us try to answer this question in a generic way.


In [ ]:
def get_rf_accuracy(num_trees,rand_subset_size):
    rf=setup_random_forest(num_trees,rand_subset_size,comb_rule,feat_types)
    rf.set_labels(train_labels)
    rf.train(train_feats)
    out_test=rf.apply_multiclass(test_feats)
    acc=MulticlassAccuracy()
    return acc.evaluate(out_test,test_labels)

The method above takes the number of trees and subset size as inputs and returns the evaluated accuracy as output. Let us use this method to get the accuracy for different number of trees keeping the subset size constant at 4.


In [ ]:
import matplotlib.pyplot as plt
% matplotlib inline

num_trees4=[5,10,20,50,100]
rf_accuracy_4=[round(get_rf_accuracy(i,4)*100,3) for i in num_trees4]

print('Random Forest accuracies (as %) :' + str(rf_accuracy_4))

# plot results

x4=[1]
y4=[86.48] # accuracy for single tree-CART
x4.extend(num_trees4)
y4.extend(rf_accuracy_4)
plt.plot(x4,y4,'--bo')
plt.xlabel('Number of trees')
plt.ylabel('Multiclass Accuracy (as %)')
plt.xlim([0,110])
plt.ylim([85,100])
plt.show()
NOTE : The above code snippet takes about a minute to execute. Please wait patiently.

We see from the above plot that the accuracy of the model keeps on increasing as we increase the number of trees on our Random Forest and eventually satarates at some value. Extrapolating the above plot qualitatively, the saturation value will be somewhere around 96.5%. The jump of accuracy from 86.48% for a single tree to 96.5% for a Random Forest with about 100 trees definitely highlights the importance of the Random Forest algorithm.

The inevitable question at this point is whether it is possible to achieve higher accuracy saturation by working with lesser (or greater) random feature subset size. Let us figure this out by repeating the above procedure for random subset size as 2 and 8.


In [ ]:
# subset size 2

num_trees2=[10,20,50,100]
rf_accuracy_2=[round(get_rf_accuracy(i,2)*100,3) for i in num_trees2]

print('Random Forest accuracies (as %) :' + str(rf_accuracy_2))

In [ ]:
# subset size 8

num_trees8=[5,10,50,100]
rf_accuracy_8=[round(get_rf_accuracy(i,8)*100,3) for i in num_trees8]

print('Random Forest accuracies (as %) :' + str(rf_accuracy_8))
NOTE : The above code snippets take about a minute each to execute. Please wait patiently.

Let us plot all the results together and then comprehend the results.


In [ ]:
x2=[1]
y2=[86.48]
x2.extend(num_trees2)
y2.extend(rf_accuracy_2)

x8=[1]
y8=[86.48]
x8.extend(num_trees8)
y8.extend(rf_accuracy_8)

plt.plot(x2,y2,'--bo',label='Subset Size = 2')
plt.plot(x4,y4,'--r^',label='Subset Size = 4')
plt.plot(x8,y8,'--gs',label='Subset Size = 8')
plt.xlabel('Number of trees')
plt.ylabel('Multiclass Accuracy (as %) ')
plt.legend(bbox_to_anchor=(0.92,0.4))
plt.xlim([0,110])
plt.ylim([85,100])
plt.show()

As we can see from the above plot, the subset size does not have a major impact on the saturated accuracy obtained in this particular dataset. While this is true in many datasets, this is not a generic observation. In some datasets, the random feature sample size does have a measurable impact on the test accuracy. A simple strategy to find the optimal subset size is to use cross-validation. But with Random Forest model, there is actually no need to perform cross-validation. Let us see how in the next section.

Out-of-bag error

The individual trees in a Random Forest are trained over data vectors randomly chosen with replacement. As a result, some of the data vectors are left out of training by each of the individual trees. These vectors form the out-of-bag (OOB) vectors of the corresponding trees. A data vector can be part of OOB classes of multiple trees. While calculating OOB error, a data vector is applied to only those trees of which it is a part of OOB class and the results are combined. This combined result averaged over similar estimate for all other vectors gives the OOB error. The OOB error is an estimate of the generalization bound of the Random Forest model. Let us see how to compute this OOB estimate in Shogun.


In [ ]:
rf=setup_random_forest(100,2,comb_rule,feat_types)
rf.set_labels(train_labels)
rf.train(train_feats)
    
# set evaluation strategy
eval=MulticlassAccuracy()
oobe=rf.get_oob_error(eval)

print('OOB accuracy : '+str(round(oobe*100,3))+'%')

The above OOB accuracy calculated is found to be slighly less than the test error evaluated in the previous section (see plot for num_trees=100 and rand_subset_size=2). This is because of the fact that the OOB estimate depicts the expected error for any generalized set of data vectors. It is only natural that for some set of vectors, the actual accuracy is slightly greater than the OOB estimate while in some cases the accuracy observed in a bit lower.

Let us now apply the Random Forest model to the wine dataset. This dataset is different from the previous one in the sense that this dataset is small and has no separate test dataset. Hence OOB (or equivalently cross-validation) is the only viable strategy available here. Let us read the dataset first.


In [ ]:
trainfeat_file= os.path.join(SHOGUN_DATA_DIR, 'uci/wine/fm_wine.dat')
trainlab_file= os.path.join(SHOGUN_DATA_DIR, 'uci/wine/label_wine.dat')
train_feats,train_labels=load_file(trainfeat_file,trainlab_file)

Next let us find out the appropriate feature subset size. For this we will make use of OOB error.


In [ ]:
import matplotlib.pyplot as plt

def get_oob_errors_wine(num_trees,rand_subset_size):
    feat_types=array([False]*13)
    rf=setup_random_forest(num_trees,rand_subset_size,MajorityVote(),feat_types)
    rf.set_labels(train_labels)
    rf.train(train_feats)
    eval=MulticlassAccuracy()
    return rf.get_oob_error(eval)    

size=[1,2,4,6,8,10,13]
oobe=[round(get_oob_errors_wine(400,i)*100,3) for i in size]

print('Out-of-box Accuracies (as %) : '+str(oobe))

plt.plot(size,oobe,'--bo')
plt.xlim([0,14])
plt.xlabel('Random subset size')
plt.ylabel('Multiclass accuracy')
plt.show()

From the above plot it is clear that subset size of 2 or 3 produces maximum accuracy for wine classification. At this value of subset size, the expected classification accuracy is of the model is 98.87%. Finally, as a sanity check, let us plot the accuracy vs number of trees curve to ensure that 400 is indeed a sufficient value ie. the oob error saturates before 400.


In [ ]:
size=[50,100,200,400,600]
oobe=[round(get_oob_errors_wine(i,2)*100,3) for i in size]

print('Out-of-box Accuracies (as %) : '+str(oobe))

plt.plot(size,oobe,'--bo')
plt.xlim([40,650])
plt.ylim([95,100])
plt.xlabel('Number of trees')
plt.ylabel('Multiclass accuracy')
plt.show()

We see from the above plot that the accuracy remains constant beyond 100. Hence 400 is a sufficient value. In-fact, values just above 100 would have been ideal because of the lower training time associated with them.

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

[2] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (October 2001), 5-32. DOI=10.1023/A:1010933404324 http://dx.doi.org/10.1023/A:1010933404324