Title: Boosting a Weak Learner Slug: weak-learners Summary: From boosting a decision stump, it will be possible to create a strong learner Date: 2018-02-5 13:32 Category: Machine Learning Tags: Ensemble Authors: Thomas Pinder

Load the data


In [5]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target


[[  1.79900000e+01   1.03800000e+01   1.22800000e+02 ...,   2.65400000e-01
    4.60100000e-01   1.18900000e-01]
 [  2.05700000e+01   1.77700000e+01   1.32900000e+02 ...,   1.86000000e-01
    2.75000000e-01   8.90200000e-02]
 [  1.96900000e+01   2.12500000e+01   1.30000000e+02 ...,   2.43000000e-01
    3.61300000e-01   8.75800000e-02]
 ..., 
 [  1.66000000e+01   2.80800000e+01   1.08300000e+02 ...,   1.41800000e-01
    2.21800000e-01   7.82000000e-02]
 [  2.06000000e+01   2.93300000e+01   1.40100000e+02 ...,   2.65000000e-01
    4.08700000e-01   1.24000000e-01]
 [  7.76000000e+00   2.45400000e+01   4.79200000e+01 ...,   0.00000000e+00
    2.87100000e-01   7.03900000e-02]]

Build the Decision Stump


In [14]:
from sklearn.tree import DecisionTreeClassifier

def decision_stump(features, labels):
    clf = DecisionTreeClassifier(max_depth=1, random_state=123)
    clf.fit(features, labels)
    predictions = clf.predict(features)
    return predictions

Get an Accuracy Result

This can be done using the accuracy() function in sklearn, however, it's defined verbose here using NumPy just for completeness.


In [41]:
import numpy as np 

def get_accuracy(predictions, ground_truth):
    equality = (predictions==ground_truth)
    accuracy = np.mean(equality)
    return accuracy*100

Demonstrate for a single iteration


In [42]:
prediction = decision_stump(X, y)
accuracy = get_accuracy(prediction, y)
print(accuracy)


92.2671353251

Extract Incorrect Classifications


In [55]:
def mistakes(predictions, ground_truth):
    equality = (prediction==ground_truth)
    misclass = np.where(equality==False)
    correct = np.where(equality==True)
    return misclass, correct

def easy_hard_split(features, labels, predictions):
    hard_index, easy_index = mistakes(predictions, labels)
    hard_X, hard_y = features[hard_index], labels[hard_index]
    easy_X, easy_y = features[easy_index], labels[easy_index]
    return hard_X, easy_X, hard_y, easy_y

hard_X, easy_X, hard_y, easy_y = easy_hard_split(X, y, prediction)

Apply Boosting


In [ ]:
iterations = 1000
for i in range(iterations):
    stump = decision_stump(X, y)
    predictions = get_accuracy(stump, y)