In this discussion notebook we will cover the material from lecture 12 abour ensambles and boosting. For consistancy (with the lecture's note), we will use decision trees. However, any other classifier/regressor could be used just as easily.


In [1]:
# Import all required libraries
from __future__ import division # For python 2.*

import numpy as np
import matplotlib.pyplot as plt
import mltools as ml

np.random.seed(0)
%matplotlib inline

Bagging and Random Forests

Bagging is a simple idea that the average prediction over multiple classifiers is better than any single classifier. Each classifier is trained on a random sample of the data (bootstrapping).

Loading the data

We'll use the kaggle competition data so you'll get familiar with loading it up and working with it.


In [2]:
X = np.genfromtxt("data/X_train.txt",delimiter=None)
Y = np.genfromtxt("data/Y_train.txt",delimiter=None)
[Xtr,Xva,Ytr,Yva] = ml.splitData(X,Y,0.80)

Xte = np.genfromtxt('data/X_test.txt',delimiter=None)
Xt, Yt = Xtr[:4000], Ytr[:4000]

Single Decision Tree

As a reminder, this is how we create a decision tree classifier with the mltools package.

Note that I create this tree as a random tree. That's because in bagging it is very common to create bunch of random trees and then it is called Random Forest!!!!


In [3]:
tree_one = ml.dtree.treeClassify(Xt, Yt, minParent=2**6, maxDepth=25, nFeatures=6)  # The nFeatures makes it random
probs = tree_one.predictSoft(Xte)

print("{0:>15}: {1:.4f}".format('Train AUC', tree_one.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', tree_one.auc(Xva, Yva)))


      Train AUC: 0.8290
 Validation AUC: 0.6012

Random Forest

We'll create a set of 10 random trees, each with bootstrapping, and combine them into a random forest.


In [4]:
np.random.seed(0)  # Resetting the seed in case you ran other stuff.
n_bags = 10
bags = []   # self.learners
for l in range(n_bags):
    # Each boosted data is the size of the original data. 
    Xi, Yi = ml.bootstrapData(Xt, Yt, Xt.shape[0])

    # Train the model on that draw
    tree = ml.dtree.treeClassify(Xi, Yi, minParent=2**6,maxDepth=25, nFeatures=6)
    bags.append(tree)

Printing the train and validation auc for all classifiers.


In [5]:
for l in range(n_bags):
    print(l)
    print("{0:>15}: {1:.4f}".format('Train AUC', bags[l].auc(Xt, Yt)))
    print("{0:>15}: {1:.4f}".format('Validation AUC', bags[l].auc(Xva, Yva)))


0
      Train AUC: 0.7615
 Validation AUC: 0.5942
1
      Train AUC: 0.7598
 Validation AUC: 0.5888
2
      Train AUC: 0.7509
 Validation AUC: 0.5975
3
      Train AUC: 0.7481
 Validation AUC: 0.6019
4
      Train AUC: 0.7480
 Validation AUC: 0.5997
5
      Train AUC: 0.7444
 Validation AUC: 0.5884
6
      Train AUC: 0.7521
 Validation AUC: 0.5906
7
      Train AUC: 0.7590
 Validation AUC: 0.5882
8
      Train AUC: 0.7523
 Validation AUC: 0.6059
9
      Train AUC: 0.7496
 Validation AUC: 0.5876

Creating a BaggedTree class

One option to find the AUC of the bagging algorithm is to implement it ourselves. But as programmers, we are lazy (the lazier the better). So instead let's just create a BaggedTree class and inherit classifiers.

By implementing the prdictSoft method we'll get everything for free :)


In [6]:
class BaggedTree(ml.base.classifier):
    def __init__(self, learners):
        """Constructs a BaggedTree class with a set of learners. """
        self.learners = learners
    
    def predictSoft(self, X):
        """Predicts the probabilities with each bagged learner and average over the results. """
        n_bags = len(self.learners)
        preds = [self.learners[l].predictSoft(X) for l in range(n_bags)]
        return np.mean(preds, axis=0)

Note that this class doesn't have a train function. We assume the training was already done and we are getting the learners. As an excersice, try and write the train function yourself.


In [7]:
bt = BaggedTree(bags)
bt.classes = np.unique(Y)

print("{0:>15}: {1:.4f}".format('Train AUC', bt.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', bt.auc(Xva, Yva)))


      Train AUC: 0.8922
 Validation AUC: 0.6524

Not surprisingly, the validation AUC has improved :)

Gradient Boosted Trees

Boosting is kind of the opposite of bagging. In Bagging we have a set of really smart classifiers that we are afraid will overfit the data so we take the average of them to increase the prediction accuracy.

In boosting it's the other way around. We take a bunch of really "stupid" classifier and make them more complex by learning in a sequence where each time we learn from the previous classifier mistakes.

With the decisio trees example, in the bagging we took really smart random classifiers. In the boosting we are going to take one level trees (a.k.a stumps).

Loading the data

This is a regression problem so let's use a regression data :)


In [8]:
path_to_file = './data/poly_data.txt' 
data = np.genfromtxt(path_to_file, delimiter='\t') # Read data from file

X, Y = np.atleast_2d(data[:, 0]).T, data[:, 1]
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)

In [9]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)

plt.show()



In [27]:
boosts = []
n_boosts = 20

Ytr_ = np.copy(Ytr)  # We're going to copy the data becuase each booster iteration we're going to mess with it.

for i in range(n_boosts):
    tree = ml.dtree.treeRegress(Xtr, Ytr_, maxDepth=1)
    boosts.append(tree)
    
    # Now "learning" from out mistakes.
    Ytr_ -= tree.predict(Xtr)

Just for the fun of it, let's see what a single tree will do. We can't take a random tree, we have to take the first -- make sure you understand why.


In [28]:
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T

ys = boosts[0].predict(xs)

In [29]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)

plt.show()


Now let's predict using all the boosting we have


In [15]:
def predict(X, boosts):
    """Predicts regression values using boosting. """
    preds = [boosts[i].predict(X) for i in range(len(boosts))]
    
    # Notice that in the bagging we returning the mean, here we return the sum
    return np.sum(preds, axis=0)

In [16]:
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T

ys = predict(xs, boosts)

In [17]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
    
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')

ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')

ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)   

# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)

plt.show()



In [ ]: