Ensembles

As you have undoubtedly noticed, different classifiers tend to give different results on different problems and the same applies for the same classifier on different subsets or even feature subsets of the same data. After working with Random Forests, you may have begun to suspect that there ought to be a way to combine or average the results of different classifiers to produce an even better classification algorithm. This topic is generally referred to as ensemble methods which branches out into many different techniques, several of which we will explore in this exercise.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier,\
                            AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix

In [3]:
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

Voting

One such ensemble method is voting. The idea here is to have each classifier in the ensemble vote on the classification of a new data point, and then tally up the votes, or average the probabilities, and classify the point accordingly, the former being called "hard" and latter "soft" voting. An algorithm like this wouldn't be too hard to write ourselves, but sklearn provides a very nice framework for us.

1 - Create a set of 100 datapoints which are linearly separable, classifying the data as being above or below the curve. Choose three classification algorithms, construct hard and soft ensemble voters, and fit all 5 models to your data, plotting the decision boundary for each (if possible). Then, using cross-validation, report the training and testing accuracies for all but the ensembles. Comment on your results.


In [23]:
np.random.seed(0)
# random points between -5 and 5
x1 = 10*np.random.rand(100) - 5
x2 = 10*np.random.rand(100) - 5
X = np.vstack([x1, x2]).T
# line y = -x
y = np.array(x2 < -x1, dtype=int)

In [24]:
# plot the dataset
plt.scatter(x1[y==0], x2[y==0])
plt.scatter(x1[y==1], x2[y==1]);



In [27]:
lr = LogisticRegression()
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
hard = VotingClassifier(estimators=[('lr', lr), ('knn', knn), ('tree', tree)], voting='hard')
soft = VotingClassifier(estimators=[('lr', lr), ('knn', knn), ('tree', tree)], voting='soft')

In [34]:
# loop through each model
for model, name in zip([lr, knn, tree, hard, soft], ['Logistic Regression', 'KNN', 'Decision Tree', 'Hard Voting', 'Soft Voting']):
    # fit
    model.fit(X, y)
    # cv scores
    cv_scores = cross_val_score(model, X, y)
    # plotting decision boundary + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('{}, cv-scores mean={:.4f}, , cv-scores std={:.4f}'.format(name, cv_scores.mean(), cv_scores.std()))
    plot_decision_boundary(model, X, y, axes=[-5, 5, -5, 5])


The data is linearly separable so logistic regression works extremely well. In this case the ensemble isn't as good because the other two estimators arent' that good and, I assume, because the problem is too simple for this model.

2 - Create a set of 100 random data points that lie on a unit circle displaced by some noise, categorizing each as being inside or outside the circle. Stated analytically, select $x$ and $y$ such that $x^2 + y^2 = 1 + \epsilon$ where $\epsilon$ is a noise term. Choose three classification algorithms, construct hard and soft ensemble voters, and fit all 5 models to your data, plotting the decision boundary for each (if possible). Then, using cross-validation, report the training and testing accuracies for all but the ensembles. Comment on your results.


In [63]:
# random numbers between 0 and 2*pi
np.random.seed(0)
x = 2*np.pi*np.random.rand(100)
noise = np.random.rand(100) - 0.5
X = np.vstack([np.cos(x)*(1+noise), np.sin(x)*(1+noise)]).T
y = np.array(noise > 0, dtype=int)

In [64]:
# plot the dataset
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1]);



In [73]:
svm = SVC(kernel='rbf', probability=True)
knn = KNeighborsClassifier()
forest = RandomForestClassifier()
hard = VotingClassifier(estimators=[('svm', svm), ('knn', knn), ('forest', forest)], voting='hard')
soft = VotingClassifier(estimators=[('svm', svm), ('knn', knn), ('forest', forest)], voting='soft')

In [74]:
# loop through each model
for model, name in zip([svm, knn, forest, hard, soft], ['SVM', 'KNN', 'Random Forest', 'Hard Voting', 'Soft Voting']):
    # fit
    model.fit(X, y)
    # cv scores
    cv_scores = cross_val_score(model, X, y)
    # plotting decision boundary + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('{}, cv-scores mean={:.4f}, , cv-scores std={:.4f}'.format(name, cv_scores.mean(), cv_scores.std()))
    plot_decision_boundary(model, X, y, axes=[-2, 2, -2, 2])


As before we have SVM which is performing quite well so the ensembles aren't outperforming the best single model.

3 - Head over to the Machine Learning Repository, download the Semeion Handwritten Digit Data Set, put it into a dataframe, and split into training and testing sets. Then, using GridSearchCV and all the other tools you have at your disposal, fit the best possible model you can, making use of ensemble voting. Report and comment on your results.


In [76]:
digits = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/semeion/semeion.data', header=None, sep=' ')
digits.drop(266, axis=1, inplace=True)

In [97]:
# constructing a class label by multiplying the flag column by the digits+1 (to avoid multiplication by 0)
i = 1
class_labels = digits.loc[:, 256]*i
for n in range(257, 266):
    i += 1
    class_labels += digits.loc[:, n]*i
# subtracting one to get the right digit
class_labels = class_labels - 1
# adding the column and dropping all flags
digits['class_label'] = class_labels
for n in range(256, 266):
    digits.drop([n], axis=1, inplace=True)

In [107]:
digits.head()


Out[107]:
0 1 2 3 4 5 6 7 8 9 ... 247 248 249 250 251 252 253 254 255 class_label
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
3 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0

5 rows × 257 columns


In [109]:
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.drop('class_label', axis=1), digits.class_label,
                                                stratify=digits.class_label,
                                                test_size=0.2,
                                                random_state=0)

In solution he uses PCA and KNN, forest and SVM; he tunes every model (including PCA) as I did and then applies voting (he tunes again the voting model - excluding PCA - and he does both soft and hard voting).


In [119]:
knn = KNeighborsClassifier() # n_neighbors
svm = SVC(kernel='rbf') # C
naive = MultinomialNB() # alpha
forest = RandomForestClassifier() # n_estimators, max_features
ensemble = VotingClassifier(estimators=[('knn', knn), ('svm', svm), ('naive', naive), ('forest', forest)], voting='hard')

In [164]:
reg_values = [0.003, 0.001, 0.03, 0.01, 0.3, 0.1, 1, 3, 10, 30, 100, 300]
neigh_values = np.arange(2, 11)
trees_values = np.arange(10, 60, 10)
feat_values = np.arange(0.2, 1.2, 0.2)

I'm going to tune each model by itself and then do the ensemble learning with the tuned models:


In [150]:
param_grid = [{'n_neighbors':neigh_values}]

gs = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, n_jobs=2)
gs.fit(Xtrain, ytrain)
print(gs.best_params_)
print(gs.best_score_)


{'n_neighbors': 5}
0.894819466248

In [148]:
param_grid = [{'max_features':feat_values, 'n_estimators':trees_values}]

gs = GridSearchCV(estimator=forest, param_grid=param_grid, cv=5, n_jobs=2)
gs.fit(Xtrain, ytrain)
print(gs.best_params_)
print(gs.best_score_)


{'max_features': 0.20000000000000001, 'n_estimators': 40}
0.912087912088

In [151]:
param_grid = [{'alpha':reg_values}]

gs = GridSearchCV(estimator=naive, param_grid=param_grid, cv=5, n_jobs=2)
gs.fit(Xtrain, ytrain)
print(gs.best_params_)
print(gs.best_score_)


{'alpha': 0.003}
0.835164835165

In [152]:
param_grid = [{'C':reg_values}]

gs = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, n_jobs=2)
gs.fit(Xtrain, ytrain)
print(gs.best_params_)
print(gs.best_score_)


{'C': 30}
0.938775510204

Here I set the optimal hyperparameters, fit and score the ensemble:


In [153]:
ensemble.set_params(knn__n_neighbors=5, forest__max_features=0.2, forest__n_estimators=40, naive__alpha=0.003, svm__C=30)


Out[153]:
VotingClassifier(estimators=[('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')), ('svm', SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [154]:
ensemble.fit(Xtrain, ytrain)


Out[154]:
VotingClassifier(estimators=[('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')), ('svm', SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [156]:
cv_scores = cross_val_score(ensemble, Xtrain, ytrain)
np.mean(cv_scores), np.std(cv_scores)


Out[156]:
(0.92696250635204702, 0.020545611848306139)

In [157]:
ensemble.score(Xtest, ytest)


Out[157]:
0.94670846394984332

I get a test accuracy of 94.7%, which is quite good and better than the best scores obtained tuning each model.

Now I'm just plotting the confusion matrix to see how the ensemble predicts each digit:


In [163]:
confmat = confusion_matrix(ytest, ensemble.predict(Xtest))

fig, ax = plt.subplots(figsize=(10, 10))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')

plt.xlabel('predicted label')
plt.ylabel('true label');


The 9s are the most difficult digits for this learner.

Bagging and Boosting

As you saw above, a key issue to keep in mind when selecting the best model for you data is the bias/variance trade-off. Here we'll take a look at a few other ensemble methods which relate to this issue, namely, bagging and boosting methods. Broadly speaking, these methods work by taking a collection of identically prepared models, fitting them to the data, and then aggregating the results in some way to produce a final prediction, having better use for "strong" models, as in the case of bagging, or on "weak" models as in the case of boosting.

1 - Generate a set of 1000 random datapoints separated by a decaying sine curve defined by $$f(x) = e^{-\beta x}sin(\nu x)$$ on the range $-2\pi < x < 2\pi$, where $\beta = \frac{1}{2.5}$ and $\nu = 10$, classify each as being above or below the curve, and split into training and testing sets.


In [4]:
np.random.seed(0)
# random points in the quadrant
x1 = 4*np.pi*np.random.rand(1000) - 2*np.pi
x2 = 4*np.pi*np.random.rand(1000) - 2*np.pi
X = np.vstack([x1, x2]).T
# classified by the damped oscillator
y = np.array(x2 < np.exp(-1/2.5*x1) * np.sin(10*x1), dtype=int)

In the solution he uses the formula for x2 adding some noise


In [5]:
# plot the dataset
plt.scatter(x1[y==0], x2[y==0])
plt.scatter(x1[y==1], x2[y==1]);



In [6]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

2 - Fit two Tree Classifiers, one with default parameters, and another with max_depth = 5, and one other model of your choice to the data, report the training and testing accuracies, and plot the decision boundary. Also perform 10 fold cross validation and report the mean and standard deviation. Comment on your results.


In [8]:
tree1 = DecisionTreeClassifier()
tree2 = DecisionTreeClassifier(max_depth=5)
svm = SVC(kernel='rbf')

In [172]:
# loop through each model
for model, name in zip([tree1, tree2, svm], ['Tree', 'Tree max_depth=5', 'SVM']):
    # fit
    model.fit(Xtrain, ytrain)
    train_score = model.score(Xtrain, ytrain)
    test_score = model.score(Xtest, ytest)
    # cv scores
    cv_scores = cross_val_score(model, Xtrain, ytrain, cv=10)
    # plotting decision boundary + train-test scores + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('{}\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'.format(name,
                                                                                                              train_score,
                                                                                                              test_score,
                                                                                                              cv_scores.mean(),
                                                                                                              cv_scores.std()))
    plot_decision_boundary(model, Xtest, ytest, axes=[-2*np.pi, 2*np.pi, -2*np.pi, 2*np.pi])


The first tree is clearly overfitting but it gets good accuracy nonetheless, the decision boundary is really complex and far from the true function.

The second tree isn't complex enough in contrast and it has little overfitting; the SVM is very far away from the real function although its accuracy is still very good.

3 - Using each of the models you fit in part (2), perform bagging with bootstrap = True and with bootstrap = False, report the training and testing accuracies, and plot the decision boundary. Also perform 10 fold cross validation and report the mean and standard deviation. Comment on your results.


In [175]:
bag1 = BaggingClassifier(n_estimators=50, max_samples=0.5, bootstrap=True)
bag2 = BaggingClassifier(n_estimators=50, max_samples=0.5, bootstrap=False)

In [176]:
# loop through each model
for model, name in zip([tree1, tree2, svm], ['Tree', 'Tree max_depth=5', 'SVM']):
    # set bagging estimator
    bag1.set_params(base_estimator=model)
    bag2.set_params(base_estimator=model)
    
    # bagging with bootstrap
    # fit
    bag1.fit(Xtrain, ytrain)
    train_score = bag1.score(Xtrain, ytrain)
    test_score = bag1.score(Xtest, ytest)
    # cv scores
    cv_scores = cross_val_score(bag1, Xtrain, ytrain, cv=10)
    # plotting decision boundary + train-test scores + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('Bagged {} with Bootstrap\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'\
                  .format(name,
                          train_score,
                          test_score,
                          cv_scores.mean(),
                          cv_scores.std()))
    plot_decision_boundary(bag1, Xtest, ytest, axes=[-2*np.pi, 2*np.pi, -2*np.pi, 2*np.pi])
    
    # bagging without bootstrap
    # fit
    bag2.fit(Xtrain, ytrain)
    train_score = bag2.score(Xtrain, ytrain)
    test_score = bag2.score(Xtest, ytest)
    # cv scores
    cv_scores = cross_val_score(bag2, Xtrain, ytrain, cv=10)
    # plotting decision boundary + train-test scores + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('Bagged {} without Bootstrap\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'\
                  .format(name,
                          train_score,
                          test_score,
                          cv_scores.mean(),
                          cv_scores.std()))
    plot_decision_boundary(bag2, Xtest, ytest, axes=[-2*np.pi, 2*np.pi, -2*np.pi, 2*np.pi])


With SVM I don't see much difference, with the trees the decision boundaries are much smoother and for the full tree is very precise (though the test precision is the same we have less overfitting).

The best improvement though is for the second tree which is 3.5% better on the test set.

Also, in general the bootstrap method gives better results than the pasting method.

4 - Repeat part (3), but this time using AdaBoost.


In [181]:
ada = AdaBoostClassifier(n_estimators=50, algorithm='SAMME.R')
svm.set_params(probability=True);

In [182]:
# loop through each model
for model, name in zip([tree1, tree2, svm], ['Tree', 'Tree max_depth=5', 'SVM']):
    # set AdaBoost estimator
    ada.set_params(base_estimator=model)
    # fit
    ada.fit(Xtrain, ytrain)
    train_score = ada.score(Xtrain, ytrain)
    test_score = ada.score(Xtest, ytest)
    # cv scores
    cv_scores = cross_val_score(ada, Xtrain, ytrain, cv=10)
    # plotting decision boundary + train-test scores + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('AdaBoosted {}\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'\
                  .format(name,
                          train_score,
                          test_score,
                          cv_scores.mean(),
                          cv_scores.std()))
    plot_decision_boundary(ada, Xtest, ytest, axes=[-2*np.pi, 2*np.pi, -2*np.pi, 2*np.pi])


The first tree seems identical to the non-boosted one, while the second has become almost identical to the first and is slightly better. Still, nothing to do for SVM...

5 - Perform a GridSearchCV for the learning_rate and n_estimators parameters for Gradient Boosting, the fit the best model to the data, report CV variance and accuracy as before, report training and testing accuracies, and plot the decision boundary. Comment on your results.


In [43]:
gbc = GradientBoostingClassifier()

In [48]:
learn_range = np.arange(0.1, 1.1, 0.1)
est_range = [10, 30, 100, 300, 1000]
param_grid = [{'learning_rate':learn_range, 'n_estimators':est_range}]

gs = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=5)
gs.fit(Xtrain, ytrain)


Out[48]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.5, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=50,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'learning_rate': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'n_estimators': [10, 30, 100, 300, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [49]:
gs.best_params_, gs.best_score_


Out[49]:
({'learning_rate': 0.59999999999999998, 'n_estimators': 300},
 0.96250000000000002)

In [51]:
gbc.set_params(learning_rate=0.6, n_estimators=300)
gbc.fit(Xtrain, ytrain)

train_score = gbc.score(Xtrain, ytrain)
test_score = gbc.score(Xtest, ytest)
# cv scores
cv_scores = cross_val_score(gbc, Xtrain, ytrain, cv=10)
# plotting decision boundary + train-test scores + cv scores
plt.figure(figsize=(10, 6))
plt.title('Gradient Boost\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'\
              .format(train_score,
                      test_score,
                      cv_scores.mean(),
                      cv_scores.std()))
plot_decision_boundary(gbc, Xtest, ytest, axes=[-2*np.pi, 2*np.pi, -2*np.pi, 2*np.pi])


The accuracy is very good, similar to the bagged full decision tree.

6 - Head over to the Machine Learning Repository, download the Breast Cancer Wisconsin (Diagnostic) Data Set, put it into a dataframe, and split into training and testing sets.


In [52]:
breast = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                     header=None)

breast.columns = ['id','clump_thickness','uniformity_cell_size','uniformity_cell_shape','marginal_adhesion',
                'single_epithelial_cell_size','bare_nuclei','bland_chromatin','normal_nucleoli','mitoses','class']

In [53]:
breast.bare_nuclei = breast.bare_nuclei.apply(lambda x: x.replace('?', '1'))
breast.bare_nuclei = pd.to_numeric(breast.bare_nuclei)

In [55]:
# 2 for benign, 4 for malignant
le = LabelEncoder()
le.fit([2, 4])
breast['class'] = le.transform(breast['class'])

In [56]:
Xtrain, Xtest, ytrain, ytest = train_test_split(breast.drop('class', axis=1), breast['class'], test_size=0.2, random_state=0)

7 - Fit full and shallow decision tree models to the data using either bagging or boosting for each. Then fit a random forest model. Report cross-validation accuracies and variance, as well as training and testing arruracies for each model. Comment on your results.


In [59]:
tree_full = DecisionTreeClassifier()
tree_shallow = DecisionTreeClassifier(max_depth=1)

In [61]:
bag_full = BaggingClassifier(tree_full, n_estimators=500)
boost_shallow = AdaBoostClassifier(tree_full, n_estimators=500)
forest = RandomForestClassifier(n_estimators=500)

In [62]:
for model, name in zip([bag_full, boost_shallow, forest], ['bagging full tree', 'boosting shallow trees', 'forest']):
    model.fit(Xtrain, ytrain)
    cv_scores = cross_val_score(model, Xtrain, ytrain)
    train_score = model.score(Xtrain, ytrain)
    test_score = model.score(Xtest, ytest)
    print('{}\ncv mean={}, cv std={}\nTrain accuracy={}, Test accuracy={}'.format(name,
                                                                                  np.mean(cv_scores),
                                                                                  np.std(cv_scores), 
                                                                                  train_score, 
                                                                                  test_score))


bagging full tree
cv mean=0.9553025511279781, cv std=0.013298619389161388
Train accuracy=1.0, Test accuracy=0.9642857142857143
boosting shallow trees
cv mean=0.9409752170662986, cv std=0.007523298585622078
Train accuracy=1.0, Test accuracy=0.9357142857142857
forest
cv mean=0.9535200199336821, cv std=0.015282947065455374
Train accuracy=1.0, Test accuracy=0.9785714285714285

The best results are for bagged full regression trees and random forest, while the boosted trees aren't quite as good. All models have 100% training accuracy but different generalizations.

8 - Fit a Gradient Boost model to the data, tuning the parameters. Report the training and testing accuracies and the cross-validation mean and standard deviation of the best model. Comment on your results.


In [7]:
gbc = GradientBoostingClassifier()

In [8]:
learn_range = np.arange(0.1, 1.1, 0.1)
est_range = [10, 30, 100, 300, 1000]
param_grid = [{'learning_rate':learn_range, 'n_estimators':est_range}]

gs = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=5)
gs.fit(Xtrain, ytrain)


Out[8]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'learning_rate': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'n_estimators': [10, 30, 100, 300, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [9]:
gs.best_params_, gs.best_score_


Out[9]:
({'learning_rate': 0.59999999999999998, 'n_estimators': 300},
 0.96250000000000002)

In [10]:
gbc.set_params(learning_rate=0.1, n_estimators=1000)
gbc.fit(Xtrain, ytrain)

train_score = gbc.score(Xtrain, ytrain)
test_score = gbc.score(Xtest, ytest)
# cv scores
cv_scores = cross_val_score(gbc, Xtrain, ytrain, cv=10)
print('Gradient Boost\nTrain score={:.4f}, Test score={:.4f}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'\
              .format(train_score,
                      test_score,
                      cv_scores.mean(),
                      cv_scores.std()))


Gradient Boost
Train score=1.0000, Test score=0.9650
cv-scores mean=0.9563, cv-scores std=0.0178

We get better accuracy than the previous models.

9 - Repeat part (7) using another classification model of your choice, performing either bagging or boosting. Comment on your results.


In [11]:
lr = LogisticRegression()
boost_lr = AdaBoostClassifier(lr)

In [18]:
param_grid = [{
    'learning_rate':np.arange(0.1, 1.1, 0.1),
    # 'n_estimators':np.arange(50, 550, 50),
    'base_estimator__C':[0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
}]

gs = GridSearchCV(boost_lr, param_grid=param_grid, cv=10, n_jobs=2)
gs.fit(Xtrain, ytrain)


Out[18]:
GridSearchCV(cv=10, error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'learning_rate': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'base_estimator__C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [19]:
gs.best_params_, gs.best_score_


Out[19]:
({'base_estimator__C': 0.3, 'learning_rate': 1.0}, 0.90000000000000002)

In [21]:
lr.fit(Xtrain, ytrain)
cv_scores = cross_val_score(lr, Xtrain, ytrain)
train_score = lr.score(Xtrain, ytrain)
test_score = lr.score(Xtest, ytest)
print('cv mean={}, cv std={}\nTrain accuracy={}, Test accuracy={}'.format(np.mean(cv_scores),
                                                                          np.std(cv_scores), 
                                                                          train_score, 
                                                                          test_score))


cv mean=0.8974871636018885, cv std=0.009493883439372682
Train accuracy=0.8925, Test accuracy=0.885

In [22]:
boost_lr.set_params(base_estimator__C=0.3, learning_rate=1.0)
boost_lr.fit(Xtrain, ytrain)
cv_scores = cross_val_score(boost_lr, Xtrain, ytrain)
train_score = boost_lr.score(Xtrain, ytrain)
test_score = boost_lr.score(Xtest, ytest)
print('cv mean={}, cv std={}\nTrain accuracy={}, Test accuracy={}'.format(np.mean(cv_scores),
                                                                          np.std(cv_scores), 
                                                                          train_score, 
                                                                          test_score))


cv mean=0.9024996949302094, cv std=0.006118513472520807
Train accuracy=0.8975, Test accuracy=0.885

There is no improvement and the results are way worse than before, let's try bagging:


In [23]:
bagg_lr = BaggingClassifier(lr)

In [26]:
param_grid = [{
    'max_samples':np.arange(0.1, 1.1, 0.1),
    # 'n_estimators':np.arange(50, 550, 50),
    'base_estimator__C':[0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
}]

gs = GridSearchCV(bagg_lr, param_grid=param_grid, cv=10, n_jobs=2)
gs.fit(Xtrain, ytrain)


Out[26]:
GridSearchCV(cv=10, error_score='raise',
       estimator=BaggingClassifier(base_estimator=LogisticRegression(C=0.3, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'max_samples': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]), 'base_estimator__C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [27]:
gs.best_params_, gs.best_score_


Out[27]:
({'base_estimator__C': 0.01, 'max_samples': 0.70000000000000007},
 0.89875000000000005)

In [29]:
bagg_lr.set_params(base_estimator__C=0.01, max_samples=0.7)
bagg_lr.fit(Xtrain, ytrain)
cv_scores = cross_val_score(bagg_lr, Xtrain, ytrain)
train_score = bagg_lr.score(Xtrain, ytrain)
test_score = bagg_lr.score(Xtest, ytest)
print('cv mean={}, cv std={}\nTrain accuracy={}, Test accuracy={}'.format(np.mean(cv_scores),
                                                                          np.std(cv_scores), 
                                                                          train_score, 
                                                                          test_score))


cv mean=0.8949902847005152, cv std=0.006289313195949913
Train accuracy=0.90125, Test accuracy=0.89

A bit better, but still worse than the trees.

[Optional] - Advanced Random Forests

The final topics we'll look at here incorporate many of the same ideas that you encountered when working with Random Forests, namely Extremely Random Trees and Totally Random Trees, both of which have very different uses. Extremely Random Trees serves as an extension to Random Forests, adding an additional level of randomness to tree construction whereas Totally Random Trees is a data transformation which may improve classification when used in tandem with another classifier.

1 - Generate a set of 100 random data points, half from the unit circle, and half from a circle of radius 1.5. Stated analytically choose $x$ and $y$ such that $x^2 + y^2 = R^2 + \epsilon$ where $\epsilon$ is a noise term and $R$ is the circle radius. Then fit a Random Forest, Extra Trees Classifier, and one other model of your choice to the data. Report the training accuracies and cross-validation means and variances, plot the decision boundaries, and comment on your results.


In [4]:
from sklearn.ensemble import ExtraTreesClassifier, RandomTreesEmbedding
from sklearn.pipeline import Pipeline

In the solution he adds noise to each circle! (+0.25 * np.random.randn(n))


In [5]:
# random numbers between 0 and 2*pi
np.random.seed(0)
x = 2*np.pi*np.random.rand(100)
# half on a circle, half on the other
half = np.array([0]*50 + [1]*50)
np.random.shuffle(half)
X = np.vstack([np.cos(x)*(1+half*0.5), np.sin(x)*(1+half*0.5)]).T
y = half

In [6]:
# plot the dataset
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1]);



In [7]:
extra = ExtraTreesClassifier()
forest = RandomForestClassifier()
svm = SVC(kernel='rbf', probability=True)

In [8]:
# loop through each model
for model, name in zip([extra, forest, svm], ['Extra Trees', 'Random Forest', 'SVM']):
    # fit
    model.fit(X, y)
    train_score = model.score(X, y)
    # cv scores
    cv_scores = cross_val_score(model, X, y)
    # plotting decision boundary + cv scores
    plt.figure(figsize=(10, 6))
    plt.title('{}, train score={}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'.format(name, train_score, cv_scores.mean(), cv_scores.std()))
    plot_decision_boundary(model, X, y, axes=[-2, 2, -2, 2])


The best boundary is from the SVM, the extra trees fits the data better than the random forest (the boundary is smoother).

2 - Repeat part (1), but this time transforming the data using a Totally Random Trees hash prior to fitting.


In [10]:
rte = RandomTreesEmbedding()
rte.fit(X, y)
X_trans = rte.transform(X)

In [12]:
# loop through each model
for model, name in zip([extra, forest, svm], ['Extra Trees', 'Random Forest', 'SVM']):
    # fit
    model.fit(X_trans, y)
    train_score = model.score(X_trans, y)
    # cv scores
    cv_scores = cross_val_score(model, X_trans, y)
    print('{}, train score={}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'.format(name, train_score, cv_scores.mean(), cv_scores.std()))


Extra Trees, train score=1.0
cv-scores mean=0.8885, cv-scores std=0.0551
Random Forest, train score=1.0
cv-scores mean=0.8591, cv-scores std=0.0582
SVM, train score=0.9
cv-scores mean=0.7194, cv-scores std=0.0531

Extra trees is slightly better, the other two model are worse than before, especially SVM!

3 - Head over to the Machine Learning Repository, download the Madelon Data Set (train data and labels), put it into a dataframe, and split into training and testing sets. Then, using Totally Random Trees and all the other tools you have at your disposal, fit a Random Forest, an Extremely Random Forest, and one other algorithm of your choice to the data. Report the cross-validation mean and variance as well as the training and testing accuracies. Comment on your results.


In [13]:
madelon = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data',
                      sep=' ',
                      header=None)
madelon.drop(500, axis=1, inplace=True)
madelon_labels = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels',
                             header=None)
Xtrain, Xtest, ytrain, ytest = train_test_split(madelon, madelon_labels, test_size=0.2, random_state=0)

In [14]:
madelon.describe().T


Out[14]:
count mean std min 25% 50% 75% max
0 2000.0 481.7225 6.421769 462.0 477.00 482.0 486.00 503.0
1 2000.0 483.4525 30.186294 381.0 464.00 483.0 503.00 600.0
2 2000.0 510.1660 38.899165 370.0 485.00 510.5 536.00 654.0
3 2000.0 483.3845 9.059895 453.0 477.00 483.0 490.00 519.0
4 2000.0 501.6125 41.389418 371.0 475.00 500.0 528.00 688.0
5 2000.0 479.2590 6.795956 459.0 475.00 479.0 484.00 505.0
6 2000.0 480.1095 40.575925 334.0 452.75 480.0 506.25 611.0
7 2000.0 476.5650 1.384461 471.0 476.00 477.0 477.00 481.0
8 2000.0 486.7935 15.043836 430.0 477.00 487.0 496.25 536.0
9 2000.0 478.7890 7.190920 455.0 474.00 479.0 484.00 503.0
10 2000.0 486.5175 36.556831 354.0 462.00 486.0 511.00 620.0
11 2000.0 490.6535 28.198404 389.0 472.00 491.0 508.25 603.0
12 2000.0 478.1645 34.169434 347.0 455.00 478.0 502.00 593.0
13 2000.0 481.5245 42.617385 352.0 454.00 482.0 510.00 632.0
14 2000.0 485.0385 12.135224 444.0 477.00 485.0 493.00 523.0
15 2000.0 479.4235 21.984604 410.0 464.00 479.0 494.00 551.0
16 2000.0 479.2870 13.483602 433.0 470.00 479.0 489.00 523.0
17 2000.0 494.9475 30.339495 377.0 475.00 495.0 515.00 601.0
18 2000.0 484.1730 23.369449 408.0 468.00 484.0 500.00 566.0
19 2000.0 477.4270 9.957170 441.0 470.00 477.5 484.00 513.0
20 2000.0 484.7145 12.724305 426.0 476.75 485.0 493.00 526.0
21 2000.0 494.2390 22.344295 412.0 479.00 494.0 509.00 571.0
22 2000.0 476.5730 6.840326 456.0 472.00 477.0 481.00 499.0
23 2000.0 479.2605 9.641946 438.0 473.00 479.0 486.00 516.0
24 2000.0 499.4075 44.247580 343.0 469.00 500.0 529.00 653.0
25 2000.0 504.7190 28.529686 416.0 485.00 504.0 524.00 597.0
26 2000.0 484.7835 21.573278 421.0 471.00 485.0 499.00 573.0
27 2000.0 482.0995 11.934899 441.0 474.00 482.0 490.00 528.0
28 2000.0 480.5975 14.703897 438.0 468.00 481.0 493.00 520.0
29 2000.0 491.9145 21.965323 417.0 477.00 492.0 505.00 573.0
... ... ... ... ... ... ... ... ...
470 2000.0 516.0075 40.808800 398.0 488.00 515.0 543.25 666.0
471 2000.0 478.7120 4.541957 464.0 476.00 479.0 482.00 493.0
472 2000.0 480.3980 44.456189 347.0 451.00 480.0 512.00 621.0
473 2000.0 477.6405 1.787696 472.0 476.00 478.0 479.00 484.0
474 2000.0 479.4110 6.381570 455.0 475.00 479.0 484.00 506.0
475 2000.0 486.3710 74.518547 276.0 424.00 485.5 547.25 680.0
476 2000.0 481.3660 15.903317 428.0 471.00 482.0 492.00 535.0
477 2000.0 487.7350 17.023180 423.0 476.00 488.0 499.00 548.0
478 2000.0 483.6760 37.987953 347.0 458.00 483.0 509.00 630.0
479 2000.0 486.6320 11.695810 450.0 479.00 487.0 495.00 528.0
480 2000.0 485.4965 19.957333 418.0 472.00 486.0 499.00 554.0
481 2000.0 478.4575 27.820985 391.0 460.00 478.0 497.00 587.0
482 2000.0 477.3605 9.774779 439.0 471.00 478.0 484.00 509.0
483 2000.0 497.9700 34.666092 375.0 475.00 497.0 521.00 623.0
484 2000.0 501.8770 35.628971 376.0 478.00 501.0 526.25 619.0
485 2000.0 482.7790 11.275936 444.0 475.00 483.0 490.00 523.0
486 2000.0 478.7155 10.149387 444.0 472.00 479.0 485.00 512.0
487 2000.0 482.5695 34.051431 355.0 459.00 482.0 505.00 618.0
488 2000.0 489.9115 22.729650 411.0 475.00 490.0 505.00 606.0
489 2000.0 479.6215 8.864622 449.0 474.00 480.0 485.00 513.0
490 2000.0 482.4585 20.890033 407.0 468.00 482.0 496.00 549.0
491 2000.0 478.8115 4.011735 463.0 476.00 479.0 481.00 497.0
492 2000.0 486.3565 23.967366 391.0 471.00 486.0 502.00 566.0
493 2000.0 496.5655 127.635442 130.0 404.00 504.0 586.00 920.0
494 2000.0 493.4995 34.819020 368.0 470.00 492.0 517.00 615.0
495 2000.0 510.8930 37.459353 398.0 486.00 511.0 535.00 661.0
496 2000.0 478.2195 5.880613 457.0 474.00 478.0 482.00 500.0
497 2000.0 483.3090 13.559847 435.0 474.00 483.0 492.00 535.0
498 2000.0 507.9770 37.224297 363.0 482.00 508.0 533.00 644.0
499 2000.0 490.2660 25.825273 403.0 473.00 490.0 507.25 583.0

500 rows × 8 columns

In the solution he scales the data, which I thought wasn't necessary since the trees embedding is a data transformation to 1 and 0 of its own...


In [16]:
pipe_rf = Pipeline([
    ('rte', RandomTreesEmbedding()),
    ('clf', RandomForestClassifier())
])

pipe_xt = Pipeline([
    ('rte', RandomTreesEmbedding()),
    ('clf', ExtraTreesClassifier())
])

pipe_knn = Pipeline([
    ('rte', RandomTreesEmbedding()),
    ('clf', KNeighborsClassifier())
])

In [17]:
# loop through each model
for model, name in zip([pipe_rf, pipe_xt, pipe_knn], ['Random Forest', 'Extra Trees', 'KNN']):
    # fit
    model.fit(Xtrain, ytrain.values.ravel())
    train_score = model.score(Xtrain, ytrain.values.ravel())
    test_score = model.score(Xtest, ytest.values.ravel())
    # cv scores
    cv_scores = cross_val_score(model, Xtrain, ytrain.values.ravel())
    print('{}, train score={}, test score={}\ncv-scores mean={:.4f}, cv-scores std={:.4f}'.format(name, train_score, test_score, cv_scores.mean(), cv_scores.std()))


Random Forest, train score=0.96875, test score=0.5125
cv-scores mean=0.4982, cv-scores std=0.0341
Extra Trees, train score=0.993125, test score=0.545
cv-scores mean=0.5200, cv-scores std=0.0252
KNN, train score=0.7025, test score=0.475
cv-scores mean=0.5038, cv-scores std=0.0151

Not bad (if I remember correctly this dataset is hard to classify) but we have a lot of overfitting with both random forest and extra trees, while K-nearest neighbors is performing slightly worse.

In the solution he also does a grid search to tune all models.