Recall: Boosting

AdaBoost Algorithm

An iterative algorithm for "ensembling" base learners

Input: $\{(\mathbf{x}_i, y_i)\}_{i = 1}^n, T, \mathscr{F}$, base learner
Initialize: $\mathbf{w}^{1} = (\frac{1}{n}, ..., \frac{1}{n})$
For $t = 1, ..., T$
- $\mathbf{w}^{t} \rightarrow \boxed{\text{base learner finds} \quad \arg\min_{f \in \mathscr{F}} \sum \limits_{i = 1}^n w^t_i \mathbb{1}_{\{f(\mathbf{x}_i) \neq y_i\}}} \rightarrow f_t$
- $\alpha_t = \frac{1}{2}\text{ln}\left(\frac{1 - r_t}{r_t}\right)$
  - where $r_t := e_{\mathbf{w}^t}(f_t) = \frac 1 n \sum \limits_{i = 1}^n w_i \mathbf{1}_{\{f(\mathbf{x}_i) \neq y_i\}} $
- $w_i^{t + 1} = \frac{w_i^t \exp \left(- \alpha_ty_if_t(\mathbf{x}_i)\right)}{z_t}$ where $z_t$ normalizes.
Output: $h_T(\mathbf{x}) = \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x})\right)$

Adaboost through Coordinate Descent

It is often said that we can view Adaboost as "Coordinate Descent" on the exponential loss function.

Question: Can you figure out what that means? Why is Adaboost doing coordinate descent?

Hint 1: You need to figure out the objective function being minimized. For simplicity, assume there are a finite number of weak learners in $\mathscr{F}$

Hint 2: Recall that the exponential loss function is $\ell(h; (x,y)) = \exp(-y h(x))$

Hint 3: Let's write down the objective function being minimized. For simplicity, assume there are a finite number of weak learners in $\mathscr{F}$, say indexed by $j=1, \ldots, m$. Given a weight vector $\vec{\alpha}$, exponential loss over the data for this $\vec{\alpha}$ is: $$\text{Loss}(\vec{\alpha}) = \sum_{i=1}^n \exp \left( - y_i \left(\sum_{j=1}^m \alpha_j h_j(\vec{x}_i)\right)\right)$$ Coordinate descent chooses the smallest coordiante of $\nabla L(\vec{\alpha})$ and updates only this coordinate. Which coordinate is chosen?

Bagging classifiers

Let's explore how bagging (bootstrapped aggregation) works with classifiers to reduce variance, first by evaluating off the shelf tools and then by implementing our own basic bagging classifier.

In both examples we'll be working with the dataset from the forest cover type prediction Kaggle competition, where the aim is to build a multi-class classifier to predict the forest cover type of a 30x30 meter plot of land based on cartographic features. See their notes about the dataset for more background.

Exploring bagging

Loading and splitting the dataset

First, let's load the dataset:



In [1]:

    
import pandas as pd
df = pd.read_csv('forest-cover-type.csv')
df.head()









    Out[1]:






  
    
      
      Id
      Elevation
      Aspect
      Slope
      Horizontal_Distance_To_Hydrology
      Vertical_Distance_To_Hydrology
      Horizontal_Distance_To_Roadways
      Hillshade_9am
      Hillshade_Noon
      Hillshade_3pm
      ...
      Soil_Type32
      Soil_Type33
      Soil_Type34
      Soil_Type35
      Soil_Type36
      Soil_Type37
      Soil_Type38
      Soil_Type39
      Soil_Type40
      Cover_Type
    
  
  
    
      0
      1
      2596
      51
      3
      258
      0
      510
      221
      232
      148
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      5
    
    
      1
      2
      2590
      56
      2
      212
      -6
      390
      220
      235
      151
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      5
    
    
      2
      3
      2804
      139
      9
      268
      65
      3180
      234
      238
      135
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      2
    
    
      3
      4
      2785
      155
      18
      242
      118
      3090
      238
      238
      122
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      2
    
    
      4
      5
      2595
      45
      2
      153
      -1
      391
      220
      234
      150
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      5
    
  

5 rows × 56 columns

Now we extract the X/y features and split them into a 60/40 train / test split so that we can see how well the training set performance generalizes to a heldout set.



In [2]:

    
X, y = df.iloc[:, 1:-1].values, df.iloc[:, -1].values



In [3]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.6, random_state=0)

Evaluating train/test with and without bagging

Now let's use an off the shelf decision tree classifier and compare its train/test performance with a bagged decision tree.



In [4]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

models = [
    ('tree', DecisionTreeClassifier(random_state=0)),
    ('bagged tree', BaggingClassifier(
            DecisionTreeClassifier(random_state=0),
            random_state=0,
            n_estimators=10))
]

for label, model in models:  
    model.fit(X_train, y_train) 
    print("{} training|test accuracy: {:.2f} | {:.2f}".format(
            label, 
            accuracy_score(y_train, model.predict(X_train)),
            accuracy_score(y_test, model.predict(X_test))))









    



tree training|test accuracy: 1.00 | 0.73
bagged tree training|test accuracy: 0.99 | 0.80

Note that both models were able to (nearly) fit the training set perfectly, and that bagging substantially improves test set performance (reduces variance).

Hyperparameters

Let's look at two hyperparametes associated with the bagging classifier:

num_estimators controls how many classifiers make up the ensemble
max_samples controls how many samples each classifier in the ensemble draws

How many classifiers do we need to reduce variance?

The default number of estimators is 10; explore the performance of the bagging classifier with a range values. How many classifiers do we need to reduce variance? What is the point of diminishing returns for this dataset?



In [5]:

    
# your code goes here!

How much of the dataset does each classifier need?

By default, max_samples is set to 1.0, which means each classifier gets a number of samples equal to the size of the training set.

How do you suppose bagging manages to reduce variance while still using the same number of samples?

Explore how the performance varies as you range max_samples (note, you can use float values between 0.0 and 1.0 to choose a percentage):



In [6]:

    
# your code goes here!

Implementing Bagging

We've shown the power of bagging, now let's appreciate its simplicity by implementing our own bagging classifier right here!



In [7]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import BaseEstimator
import numpy as np

class McBaggingClassifier(BaseEstimator):
    
    def __init__(self, classifier_factory=DecisionTreeClassifier, num_classifiers=10):
        self.classifier_factory = classifier_factory
        self.num_classifiers = num_classifiers
        
    def fit(self, X, y):
        # create num_classifier classifiers calling classifier_factory, each 
        # fitted with a different sample from X
        return self

    def predict(self, X):
        # get the prediction for each classifier, take a majority vote        
        return np.ones(X.shape[0])

You should be able to achieve similar performance to scikit-learn's implementation:



In [8]:

    
our_models = [
    ('tree', DecisionTreeClassifier(random_state=0)),
    ('our bagged tree', McBaggingClassifier(
            classifier_factory=lambda: DecisionTreeClassifier(random_state=0)
            ))
]

for label, model in our_models:  
    model.fit(X_train, y_train) 
    print("{} training|test accuracy: {:.2f} | {:.2f}".format(
        label, 
        accuracy_score(y_train, model.predict(X_train)),
        accuracy_score(y_test, model.predict(X_test))))









    



tree training|test accuracy: 1.00 | 0.73
our bagged tree training|test accuracy: 0.14 | 0.14

	Id	Elevation	Aspect	Slope	Horizontal_Distance_To_Hydrology	Vertical_Distance_To_Hydrology	Horizontal_Distance_To_Roadways	Hillshade_9am	Hillshade_Noon	Hillshade_3pm	...	Cover_Type
0	1	2596	51	3	258	0	510	221	232	148	...	5
1	2	2590	56	2	212	-6	390	220	235	151	...	5
2	3	2804	139	9	268	65	3180	234	238	135	...	2
3	4	2785	155	18	242	118	3090	238	238	122	...	2
4	5	2595	45	2	153	-1	391	220	234	150	...	5