Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
The goal of ensemble methods is to combine the predictions of several individual estimators built with a given learning algorithm in order to solve a shared problem. Typically, an ensemble consists of two major components:
A consequence of this procedure is that we get a multitude of opinions about any given problem. So how do we know which classifier is right?
This is why we need a decision rule. Perhaps we consider everybody's opinion of equal importance, or perhaps we would want to weight somebody's opinion based on their expert status. Depending on the nature of our decision rule, ensemble methods can be categorized as follows:
An averaging ensemble is essentially a collection of models that train on the same dataset. Their results are then aggregated in a number of ways.
One common method involves creating multiple model configurations that take different parameter subsets as input. Techniques that take this approach are referred to collectively as bagging methods.
Bagging methods come in many different flavors. However, they typically only differ by the way they draw random subsets of the training set:
In scikit-learn, bagging methods can be realized using the meta-estimators
BaggingClassifier
and BaggingRegressor
. These are meta-estimators because they
allow us to build an ensemble from any other base estimator.
In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), n_estimators=10)
The BaggingClassifier class provides a number of options to customize the ensemble:
max_samples
$=1.0$ and max_features
$<1.0$ to implement the random subspace method. Alternatively, we can set both max_samples
$<1.0$ and max_features
$<1.0$ to implement the random patches method.For example, if we wanted to implement bagging with 10 $k$-NN classifiers with $k=5$, where every $k$-NN classifier is trained on 50% of the samples in the dataset, we would modify the preceding command as follows:
In [2]:
bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=5),
n_estimators=10, max_samples=0.5,
bootstrap=True, random_state=3)
In order to observe a performance boost, we have to apply the ensemble to some dataset, such as the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis:
In [3]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
X = dataset.data
y = dataset.target
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3
)
In [5]:
bag_knn.fit(X_train, y_train)
bag_knn.score(X_test, y_test)
Out[5]:
The performance boost will become evident once we also train a single $k$-NN classifier on the data:
In [6]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
Out[6]:
Without changing the underlying algorithm, we were able to improve our test score from 91.6% to 93.7% by simply letting 10 k-NN classifiers do the job instead of a single one.
You're welcome to experiment with other bagging ensembles.
For example, in order to change the above code to implement the random patches method, add max_features=xxx
to the BaggingClassifier
call in In [2]
, where xxx
is a number or fraction of features you want each base estimator to train on.
In [7]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
bag_tree = BaggingRegressor(DecisionTreeRegressor(),
max_features=0.5, n_estimators=10,
random_state=3)
For example, we could build an ensemble of decision trees to predict the housing prices from the Boston dataset of Chapter 3, First Steps in Supervised Learning:
In [8]:
from sklearn.datasets import load_boston
dataset = load_boston()
X = dataset.data
y = dataset.target
In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3
)
Then we can fit the bagging regressor on X_train
and score it on X_test
:
In [10]:
bag_tree.fit(X_train, y_train)
bag_tree.score(X_test, y_test)
Out[10]:
As in the preceding example, we find a performance boost of roughly 5%, from 77.3% accuracy for a single decision tree to 82.7% accuracy.
Of course, we wouldn't just stop here. Nobody said the ensemble needs to consist of 10
individual estimators, so we are free to explore different-sized ensembles. On top of that,
the max_samples
and max_features
parameters allow for a great deal of customization.
Another approach to building ensembles is through boosting. Boosting models use multiple individual learners in sequence to iteratively boost the performance of the ensemble.
Typically, the learners used in boosting are relatively simple. A good example is a decision tree with only a single node—a decision stump. Another example could be a simple linear regression model. The idea is not to have the strongest individual learners, quite the opposite—we want the individuals to be weak learners, so that we get a superior performance only when we consider a large number of individuals.
In [11]:
from sklearn.ensemble import GradientBoostingClassifier
boost_class = GradientBoostingClassifier(n_estimators=10,
random_state=3)
These classifiers support both binary and multiclass classification.
Similar to the BaggingClassifier
class, the GradientBoostingClassifier
class
provides a number of options to customize the ensemble:
n_estimators
: This denotes the number of base estimators in the ensemble. A large number of estimators typically results in better performance.loss
: This denotes the loss function (or cost function) to be optimized. Setting loss='deviance'
implements logistic regression for classification with probabilistic outputs. Setting loss='exponential'
actually results in AdaBoost, which we will talk about in a little bit.learning_rate
: This denotes the fraction by which to shrink the contribution of each tree. There is a trade-off between learning_rate
and n_estimators
.max_depth
: This denotes the maximum depth of the individual trees in the ensemble.criterion
: This denotes the function to measure the quality of a node split.min_samples_split
: This denotes the number of samples required to split an internal node.max_leaf_nodes
: This denotes the maximum number of leaf nodes allowed in each individual tree and so on.We can apply the boosted classifier to the preceding breast cancer dataset to get an idea of how this ensemble compares to a bagged classifier. But first, we need to reload the dataset:
In [12]:
dataset = load_breast_cancer()
X = dataset.data
y = dataset.target
In [13]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3
)
Then we find that the boosted classifier achieves 94.4% accuracy on the test set—a little under 1% better than the preceding bagged classifier:
In [14]:
boost_class.fit(X_train, y_train)
boost_class.score(X_test, y_test)
Out[14]:
We would expect an even better score if we increased the number of base estimators from 10 to 100. In addition, we might want to play around with the learning rate and the depths of the trees.
In [15]:
from sklearn.ensemble import GradientBoostingRegressor
boost_reg = GradientBoostingRegressor(n_estimators=10,
random_state=3)
We have seen earlier that a single decision tree can achieve 79.3% accuracy on the Boston dataset. A bagged decision tree classifier made of 10 individual regression trees achieved 82.7% accuracy. But how does a boosted regressor compare?
Let's reload the Boston dataset and split it into training and test sets. We want to make sure
we use the same value for random_state
so that we end up training and testing on the
same subsets of the data:
In [16]:
dataset = load_boston()
X = dataset.data
y = dataset.target
In [17]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=3
)
As it turns out, the boosted decision tree ensemble actually performs worse than the previous code:
In [18]:
boost_reg.fit(X_train, y_train)
boost_reg.score(X_test, y_test)
Out[18]:
This result might be confusing at first. After all, we used 10 times more classifiers than we did for the single decision tree. Why would our numbers get worse?
You can see, this is a good example of an expert classifier being smarter than a group of weak learners. One possible solution is to make the ensemble larger. In fact, it is customary to use on the order of 100 weak learners in a boosted ensemble:
In [19]:
boost_reg = GradientBoostingRegressor(n_estimators=100,
random_state=3)
Then, when we retrain the ensemble on the Boston dataset, we get a test score of 89.8%:
In [20]:
boost_reg.fit(X_train, y_train)
boost_reg.score(X_test, y_test)
Out[20]:
What happens when you increase the number to n_estimators=500
? There's a lot more
we could do by playing with the optional parameters.
As you can see, boosting is a powerful procedure that allows you to get massive performance improvements by combining a large number of relatively simple learners.