In this discussion notebook we will cover the material from lecture 12 abour ensambles and boosting. For consistancy (with the lecture's note), we will use decision trees. However, any other classifier/regressor could be used just as easily.
In [1]:
# Import all required libraries
from __future__ import division # For python 2.*
import numpy as np
import matplotlib.pyplot as plt
import mltools as ml
np.random.seed(0)
%matplotlib inline
In [2]:
X = np.genfromtxt("data/X_train.txt",delimiter=None)
Y = np.genfromtxt("data/Y_train.txt",delimiter=None)
[Xtr,Xva,Ytr,Yva] = ml.splitData(X,Y,0.80)
Xte = np.genfromtxt('data/X_test.txt',delimiter=None)
Xt, Yt = Xtr[:4000], Ytr[:4000]
In [3]:
tree_one = ml.dtree.treeClassify(Xt, Yt, minParent=2**6, maxDepth=25, nFeatures=6) # The nFeatures makes it random
probs = tree_one.predictSoft(Xte)
print("{0:>15}: {1:.4f}".format('Train AUC', tree_one.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', tree_one.auc(Xva, Yva)))
In [4]:
np.random.seed(0) # Resetting the seed in case you ran other stuff.
n_bags = 10
bags = [] # self.learners
for l in range(n_bags):
# Each boosted data is the size of the original data.
Xi, Yi = ml.bootstrapData(Xt, Yt, Xt.shape[0])
# Train the model on that draw
tree = ml.dtree.treeClassify(Xi, Yi, minParent=2**6,maxDepth=25, nFeatures=6)
bags.append(tree)
Printing the train and validation auc for all classifiers.
In [5]:
for l in range(n_bags):
print(l)
print("{0:>15}: {1:.4f}".format('Train AUC', bags[l].auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', bags[l].auc(Xva, Yva)))
One option to find the AUC of the bagging algorithm is to implement it ourselves. But as programmers, we are lazy (the lazier the better). So instead let's just create a BaggedTree class and inherit classifiers.
By implementing the prdictSoft method we'll get everything for free :)
In [6]:
class BaggedTree(ml.base.classifier):
def __init__(self, learners):
"""Constructs a BaggedTree class with a set of learners. """
self.learners = learners
def predictSoft(self, X):
"""Predicts the probabilities with each bagged learner and average over the results. """
n_bags = len(self.learners)
preds = [self.learners[l].predictSoft(X) for l in range(n_bags)]
return np.mean(preds, axis=0)
Note that this class doesn't have a train function. We assume the training was already done and we are getting the learners. As an excersice, try and write the train function yourself.
In [7]:
bt = BaggedTree(bags)
bt.classes = np.unique(Y)
print("{0:>15}: {1:.4f}".format('Train AUC', bt.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', bt.auc(Xva, Yva)))
Not surprisingly, the validation AUC has improved :)
Boosting is kind of the opposite of bagging. In Bagging we have a set of really smart classifiers that we are afraid will overfit the data so we take the average of them to increase the prediction accuracy.
In boosting it's the other way around. We take a bunch of really "stupid" classifier and make them more complex by learning in a sequence where each time we learn from the previous classifier mistakes.
With the decisio trees example, in the bagging we took really smart random classifiers. In the boosting we are going to take one level trees (a.k.a stumps).
In [8]:
path_to_file = './data/poly_data.txt'
data = np.genfromtxt(path_to_file, delimiter='\t') # Read data from file
X, Y = np.atleast_2d(data[:, 0]).T, data[:, 1]
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)
In [9]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)
plt.show()
In [27]:
boosts = []
n_boosts = 20
Ytr_ = np.copy(Ytr) # We're going to copy the data becuase each booster iteration we're going to mess with it.
for i in range(n_boosts):
tree = ml.dtree.treeRegress(Xtr, Ytr_, maxDepth=1)
boosts.append(tree)
# Now "learning" from out mistakes.
Ytr_ -= tree.predict(Xtr)
Just for the fun of it, let's see what a single tree will do. We can't take a random tree, we have to take the first -- make sure you understand why.
In [28]:
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T
ys = boosts[0].predict(xs)
In [29]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)
plt.show()
Now let's predict using all the boosting we have
In [15]:
def predict(X, boosts):
"""Predicts regression values using boosting. """
preds = [boosts[i].predict(X) for i in range(len(boosts))]
# Notice that in the bagging we returning the mean, here we return the sum
return np.sum(preds, axis=0)
In [16]:
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T
ys = predict(xs, boosts)
In [17]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)
plt.show()
In [ ]: