In this lab session we are going to continue working with classification algorithms, mainly, we are going to focus on decision trees and their use in ensembles.
In [ ]:
%matplotlib inline
In [ ]:
import numpy as np
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
# Initialize the random generator seed to compare results
np.random.seed(0)
iris = datasets.load_iris()
X = iris.data # All input features are used
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.4)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Decision Trees learn simple decision rules selecting iteratively a input feature and setting a threshold over it, so the are simple tool to understand and to interpret.
Use the DecisionTreeClassifier( ) function to train a decision tree. Although the tree depth is usually a parameter to select, here we are working with only for input features, so you can use all default parameter and obtain a good performance. Complete the following code to return in the variable acc_tree the tree accuracy.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
from sklearn import tree
clf_tree = # <FILL IN>
acc_tree= # <FILL IN>
print("The test accuracy of the decision tree is %2.2f" %(100*acc_tree))
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
# TEST accuracy values
Test.assertEquals(np.round(acc_tree, 2), 0.95 , 'incorrect result: The value of C_opt is uncorrect')
Try to use the following example of the scikit-learn help, to plot the classification regions for different pairs of input features. Modify the necessary code line to plot our training data over the decision regions.
Be careful, this examples retrains different classifiers for each pair of input features; therefore, its solution differs from the above one that we have just computed.
In [ ]:
A Random Forest (RF) trains several decision tree classifiers, where each one is trained with different sub-samples of the training data, and averages their outputs to improve the final accuracy.
Use the RandomForestClassifier( ) function to train a RF classifier and select by cross validation the number of trees to use. The remaining parameters, such as the number of subsampled data or features, can be used with their default values. Return the optimal number of trees to be used and the final accuracy of the RF classifier.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rang_n_trees=np.arange(1,10)
tuned_parameters = [{'n_estimators': rang_n_trees}]
nfold = 10
clf_RF = #<FILL IN>
n_trees_opt = #<FILL IN>
acc_RF = #<FILL IN>
print "The number of selected trees is " + str(n_trees_opt)
print("The test accuracy of the RF is %2.2f" %(100*acc_RF))
Run the above code again, do you obtain the same accuracy?
Random forest have a random component when the training data are subsampled, so you can obtain a different result for different runnings of the algorithm. In this case, to be able to provide a statistically significant measurement of the performance of the classifier, we need to average the result over a large number of runs.
Complete the following code, to train again the RF classifier, but averaging its test accuracies over 50 runs. Provide its average accuracy and the average number of selected trees (include their standard deviations).
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# Initialize the random generator seed to compare results
np.random.seed(0)
print 'This can take a some minutes, be patient'
# Create RF classifier object with CV
clf_RF = # <FILL IN>
acc_RF_vector=[]
n_trees_vector=[]
for run in np.arange(50):
# For each run, train it, compute its accuracy and examine the number of optimal trees
clf_RF.# <FILL IN>
acc = # <FILL IN>
acc_RF_vector.append(acc)
n_trees = # <FILL IN>
n_trees_vector.append(n_trees)
# Compute averaged accuracies and number of used trees
mean_acc_RF = # <FILL IN>
std_acc_RF = # <FILL IN>
mean_n_trees = # <FILL IN>
std_n_trees = # <FILL IN>
# Print the results
print('Averaged accuracy for RF classifier is %2.2f +/- %2.2f '%(100*mean_acc_RF, 100*std_acc_RF))
print('Averaged number of selected trees is %2.2f +/- %2.2f '%(mean_n_trees, std_n_trees))
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
Test.assertEquals(np.round(mean_acc_RF, 1), 0.9 , 'incorrect result: The value of mean_acc_RF is uncorrect')
Test.assertEquals(np.round(std_acc_RF, 2), 0.03 , 'incorrect result: The value of std_acc_RF is uncorrect')
Test.assertEquals(np.round(mean_n_trees, 1), 4.2 , 'incorrect result: The value of mean_n_trees is uncorrect')
Test.assertEquals(np.round(std_n_trees, 1), 2.0 , 'incorrect result: The value of std_n_trees is uncorrect')
The goal of ensemble methods is to combine the predictions of several base estimators or learners to obtain a classifier of improved performance. We are going to work with two ensemble methods:
Here, to implement bagged classifiers, we are going to use BaggingClassifier( ) object which includes different degrees of freedom in the learners design: with or without samples replacement, selecting random subsets of features instead of samples or selecting subsets of both samples and features.
For the sake of simplicity, we are going to use as base learner a decision stump (i.e., a decision tree with one depth level). Note that in the case of using decision trees as learners, the resulting ensemble results in a random forest.
Complete the following code to train a ensemble of bagged decision stumps. Set max_samples (percentage of training data used to train each learner) and max_features parameters (percentage of input features used to train each learner) to 0.5, and fix to 10 the number of learners used.
In [ ]:
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
base_learner = tree.DecisionTreeClassifier(max_depth=1)
bagging = BaggingClassifier(base_learner, n_estimators = 10, max_samples=0.5, max_features = 0.5)
bagging.fit(X_train, Y_train)
acc_test = bagging.score(X_test, Y_test)
print('Accuracy of bagged ensemble is %2.2f '%(100*acc_test))
Analyze the final ensemble performance according to the number of learners. Average the result over 20 or more different runs to obtain statically significant results (note that the above accuracy change if you run the code again).
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# Initialize the random generator seed to test results
np.random.seed(0)
acc_test_evol = []
rang_n_learners = range(1,50,2)
for n_learners in rang_n_learners:
acc_test_run=[]
for run in range(50):
bagging = # <FILL IN>
acc = # <FILL IN>
acc_test_run.append(acc)
acc_test_evol.append(np.mean(acc_test_run))
# Ploting results
plt.figure()
plt.plot(rang_n_learners,acc_test_evol)
plt.xlabel('Number of learners')
plt.ylabel('Accuracy')
plt.title('Evolution of the bagged ensemble accuracy with the number of learners ')
plt.show()
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
# TEST accuracy values
Test.assertEquals(np.round(acc_test_evol[-1], 2), 0.94 , 'incorrect result: The value final of acc_test_evol is uncorrect')
To train an AdaBoost classifier, scikit-learn provides AdaBoostClassifier() method which includes two versions of the Adaboost algorithm:
As in previous subsection, use a decision stump as base learner. Fix to 50 the number of learners and compare the results of both approaches: Discrete Adaboost (set algorithm parameter to 'SAMME') and Real Adaboost (algorithm='SAMME.R').
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# Initialize the random generator seed to test results
np.random.seed(0)
from sklearn.ensemble import AdaBoostClassifier
base_learner = tree.DecisionTreeClassifier(max_depth=1)
# Train a discrete Adaboost classifier and obtain its accuracy
AB_D = #<FILL IN>
acc_AB_D = # <FILL IN>
# Train a real Adaboost classifier and obtain its accuracy
AB_R = # <FILL IN>
acc_AB_R = # <FILL IN>
print('Accuracy of discrete adaboost ensemble is %2.2f '%(100*acc_AB_D))
print('Accuracy of real adaboost ensemble is %2.2f '%(100*acc_AB_R))
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
# TEST accuracy values
Test.assertEquals(np.round(acc_AB_D, 2), 0.95 , 'incorrect result: The value of acc_AB_D is uncorrect')
Test.assertEquals(np.round(acc_AB_R, 2), 0.88 , 'incorrect result: The value of acc_AB_R is uncorrect')
Unlike BaggingClassifier() method, AdaBoostClassifier() let you analyze the evolution of error without having to train the ensemble for different number of learners. For this task, you can use the classifier method .staged_score() which returns the evolution of the ensemble accuracy. Note that it returns this information with a generator object, so you have to iterate over it to access to each element.
The following code lines let you plot the evolution of the ensemble accuracy (over the test data) for both discrete and real Adaboost approaches.
In [ ]:
acc_AB_D_evol=[acc for acc in AB_D.staged_score(X_test, Y_test)]
acc_AB_R_evol=[acc for acc in AB_R.staged_score(X_test, Y_test)]
# Ploting results
rang_n_learners=np.arange(50)+1
plt.figure()
plt.subplot(211)
plt.plot(rang_n_learners,acc_AB_D_evol)
plt.xlabel('Number of learners')
plt.ylabel('Accuracy')
plt.title('Discrete AB accuracy')
plt.subplot(212)
plt.plot(rang_n_learners,acc_AB_R_evol)
plt.xlabel('Number of learners')
plt.ylabel('Accuracy')
plt.title('Real AB accuracy')
plt.show()
If you want, you can check the following scikit-learn example where the performance of different ensembles over the Iris dataset is analyzed.