This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Understanding Ensemble Methods

The goal of ensemble methods is to combine the predictions of several individual estimators built with a given learning algorithm in order to solve a shared problem. Typically, an ensemble consists of two major components:

  • a set of models
  • a set of decision rules that govern how the results of these models are combined into a single output

A consequence of this procedure is that we get a multitude of opinions about any given problem. So how do we know which classifier is right?

This is why we need a decision rule. Perhaps we consider everybody's opinion of equal importance, or perhaps we would want to weight somebody's opinion based on their expert status. Depending on the nature of our decision rule, ensemble methods can be categorized as follows:

  • Averaging methods: They develop models in parallel and then use averaging or voting techniques to come up with a combined estimator. This is as close to democracy as ensemble methods can get.
  • Boosting methods: They involve building models in sequence, where each added model aims to improve the score of the combined estimator. This is akin to debugging the code of your intern or reading the report of your undergraduate student: they are all bound to make errors, and the job of every subsequent expert laying eyes on the topic is to figure out the special cases where the preceding expert got it wrong.
  • Stacking methods: Also known as blending methods, they use the weighted output of multiple classifiers as inputs to the next layer in the model. This is akin to having expert groups who pass on their decision to the next expert group.

Understanding average ensembles

An averaging ensemble is essentially a collection of models that train on the same dataset. Their results are then aggregated in a number of ways.

One common method involves creating multiple model configurations that take different parameter subsets as input. Techniques that take this approach are referred to collectively as bagging methods.

Bagging methods come in many different flavors. However, they typically only differ by the way they draw random subsets of the training set:

  • Pasting methods draw random subsets of the samples without replacement
  • Bagging methods draw random subsets of the samples with replacement
  • Random subspace methods draw random subsets of the features but train on all data samples
  • Random patches methods draw random subsets of both samples and features

In scikit-learn, bagging methods can be realized using the meta-estimators BaggingClassifier and BaggingRegressor. These are meta-estimators because they allow us to build an ensemble from any other base estimator.

Implementing a bagging classifier

We can, for instance, build an ensemble from a collection of 10 k-NN classifiers as follows:


In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), n_estimators=10)

The BaggingClassifier class provides a number of options to customize the ensemble:

  • n_estimators: As shown in the preceding code, this specifies the number of base estimators in the ensemble.
  • max_samples: This denotes the number (or fraction) of samples to draw from the dataset to train each base estimator. We can set bootstrap=True to sample with replacement (effectively implementing bagging), or we can set bootstrap=False to implement pasting.
  • max_features: This denotes the number (or fraction) of features to draw from the feature matrix to train each base estimator. We can set max_samples$=1.0$ and max_features$<1.0$ to implement the random subspace method. Alternatively, we can set both max_samples$<1.0$ and max_features$<1.0$ to implement the random patches method.

For example, if we wanted to implement bagging with 10 $k$-NN classifiers with $k=5$, where every $k$-NN classifier is trained on 50% of the samples in the dataset, we would modify the preceding command as follows:


In [2]:
bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=5),
                            n_estimators=10, max_samples=0.5,
                            bootstrap=True, random_state=3)

In order to observe a performance boost, we have to apply the ensemble to some dataset, such as the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis:


In [3]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
X = dataset.data
y = dataset.target

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=3
)

In [5]:
bag_knn.fit(X_train, y_train)
bag_knn.score(X_test, y_test)


Out[5]:
0.93706293706293708

The performance boost will become evident once we also train a single $k$-NN classifier on the data:


In [6]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)


Out[6]:
0.91608391608391604

Without changing the underlying algorithm, we were able to improve our test score from 91.6% to 93.7% by simply letting 10 k-NN classifiers do the job instead of a single one.

You're welcome to experiment with other bagging ensembles.

For example, in order to change the above code to implement the random patches method, add max_features=xxx to the BaggingClassifier call in In [2], where xxx is a number or fraction of features you want each base estimator to train on.

Implementing a bagging regressor

Similarly, we can use the BaggingRegressor class to form an ensemble of regressors:


In [7]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
bag_tree = BaggingRegressor(DecisionTreeRegressor(),
                           max_features=0.5, n_estimators=10, 
                           random_state=3)

For example, we could build an ensemble of decision trees to predict the housing prices from the Boston dataset of Chapter 3, First Steps in Supervised Learning:


In [8]:
from sklearn.datasets import load_boston
dataset = load_boston()
X = dataset.data
y = dataset.target

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=3
)

Then we can fit the bagging regressor on X_train and score it on X_test:


In [10]:
bag_tree.fit(X_train, y_train)
bag_tree.score(X_test, y_test)


Out[10]:
0.82704756225081688

As in the preceding example, we find a performance boost of roughly 5%, from 77.3% accuracy for a single decision tree to 82.7% accuracy.

Of course, we wouldn't just stop here. Nobody said the ensemble needs to consist of 10 individual estimators, so we are free to explore different-sized ensembles. On top of that, the max_samples and max_features parameters allow for a great deal of customization.

Understanding boosting ensembles

Another approach to building ensembles is through boosting. Boosting models use multiple individual learners in sequence to iteratively boost the performance of the ensemble.

Typically, the learners used in boosting are relatively simple. A good example is a decision tree with only a single node—a decision stump. Another example could be a simple linear regression model. The idea is not to have the strongest individual learners, quite the opposite—we want the individuals to be weak learners, so that we get a superior performance only when we consider a large number of individuals.

Implementing a boosting classifier

For example, we can build a boosting classifier from a collection of 10 decision trees as follows:


In [11]:
from sklearn.ensemble import GradientBoostingClassifier
boost_class = GradientBoostingClassifier(n_estimators=10,
                                         random_state=3)

These classifiers support both binary and multiclass classification.

Similar to the BaggingClassifier class, the GradientBoostingClassifier class provides a number of options to customize the ensemble:

  • n_estimators: This denotes the number of base estimators in the ensemble. A large number of estimators typically results in better performance.
  • loss: This denotes the loss function (or cost function) to be optimized. Setting loss='deviance' implements logistic regression for classification with probabilistic outputs. Setting loss='exponential' actually results in AdaBoost, which we will talk about in a little bit.
  • learning_rate: This denotes the fraction by which to shrink the contribution of each tree. There is a trade-off between learning_rate and n_estimators.
  • max_depth: This denotes the maximum depth of the individual trees in the ensemble.
  • criterion: This denotes the function to measure the quality of a node split.
  • min_samples_split: This denotes the number of samples required to split an internal node.
  • max_leaf_nodes: This denotes the maximum number of leaf nodes allowed in each individual tree and so on.

We can apply the boosted classifier to the preceding breast cancer dataset to get an idea of how this ensemble compares to a bagged classifier. But first, we need to reload the dataset:


In [12]:
dataset = load_breast_cancer()
X = dataset.data
y = dataset.target

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=3
)

Then we find that the boosted classifier achieves 94.4% accuracy on the test set—a little under 1% better than the preceding bagged classifier:


In [14]:
boost_class.fit(X_train, y_train)
boost_class.score(X_test, y_test)


Out[14]:
0.94405594405594406

We would expect an even better score if we increased the number of base estimators from 10 to 100. In addition, we might want to play around with the learning rate and the depths of the trees.

Implementing a boosting regressor

Implementing a boosted regressor follows the same syntax as the boosted classifier:


In [15]:
from sklearn.ensemble import GradientBoostingRegressor
boost_reg = GradientBoostingRegressor(n_estimators=10,
                                      random_state=3)

We have seen earlier that a single decision tree can achieve 79.3% accuracy on the Boston dataset. A bagged decision tree classifier made of 10 individual regression trees achieved 82.7% accuracy. But how does a boosted regressor compare?

Let's reload the Boston dataset and split it into training and test sets. We want to make sure we use the same value for random_state so that we end up training and testing on the same subsets of the data:


In [16]:
dataset = load_boston()
X = dataset.data
y = dataset.target

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=3
)

As it turns out, the boosted decision tree ensemble actually performs worse than the previous code:


In [18]:
boost_reg.fit(X_train, y_train)
boost_reg.score(X_test, y_test)


Out[18]:
0.71991199075668488

This result might be confusing at first. After all, we used 10 times more classifiers than we did for the single decision tree. Why would our numbers get worse?

You can see, this is a good example of an expert classifier being smarter than a group of weak learners. One possible solution is to make the ensemble larger. In fact, it is customary to use on the order of 100 weak learners in a boosted ensemble:


In [19]:
boost_reg = GradientBoostingRegressor(n_estimators=100,
                                      random_state=3)

Then, when we retrain the ensemble on the Boston dataset, we get a test score of 89.8%:


In [20]:
boost_reg.fit(X_train, y_train)
boost_reg.score(X_test, y_test)


Out[20]:
0.89984081091774459

What happens when you increase the number to n_estimators=500? There's a lot more we could do by playing with the optional parameters.

As you can see, boosting is a powerful procedure that allows you to get massive performance improvements by combining a large number of relatively simple learners.