# Data Analysis and Machine Learning: From Decision Trees to Forests and all that **Morten Hjorth-Jensen**, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University Date: **Dec 27, 2019** Copyright 1999-2019, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license ## Decision trees, overarching aims We start here with the most basic algorithm, the so-called decision tree. With this basic algorithm we can in turn build more complex networks, spanning from homogeneous and heterogenous forests (bagging, random forests and more) to one of the most popular supervised algorithms nowadays, the extreme gradient boosting, or just XGBoost. But let us start with the simplest possible ingredient. Decision trees are supervised learning algorithms used for both, classification and regression tasks. The main idea of decision trees is to find those descriptive features which contain the most **information** regarding the target feature and then split the dataset along the values of these features such that the target feature values for the resulting underlying datasets are as pure as possible. The descriptive features which reproduce best the target/output features are normally said to be the most informative ones. The process of finding the **most informative** feature is done until we accomplish a stopping criteria where we then finally end up in so called **leaf nodes**. A decision tree is typically divided into a **root node**, the **interior nodes**, and the final **leaf nodes** or just **leaves**. These entities are then connected by so-called **branches**. The leaf nodes contain the predictions we will make for new query instances presented to our trained model. This is possible since the model has learned the underlying structure of the training data and hence can, given some assumptions, make predictions about the target feature value (class) of unseen query instances. ## A typical Decision Tree with its pertinent Jargon, Classification Problem

This tree was produced using the Wisconsin cancer data (discussed here as well, see code examples below) using Scikit-Learn's decision tree classifier. Here we have used the so-called gini index (see below) to split the various branches.

General Features

The overarching approach to decision trees is a top-down approach.

  • A leaf provides the classification of a given instance.

  • A node specifies a test of some attribute of the instance.

  • A branch corresponds to a possible values of an attribute.

  • An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example.

This process is then repeated for the subtree rooted at the new node.

How do we set it up?

In simplified terms, the process of training a decision tree and predicting the target features of query instances is as follows:

  1. Present a dataset containing of a number of training instances characterized by a number of descriptive features and a target feature

  2. Train the decision tree model by continuously splitting the target feature along the values of the descriptive features using a measure of information gain during the training process

  3. Grow the tree until we accomplish a stopping criteria create leaf nodes which represent the predictions we want to make for new query instances

  4. Show query instances to the tree and run down the tree until we arrive at leaf nodes

Then we are essentially done!

Decision trees and Regression


In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

steps=250

distance=0
x=0
distance_list=[]
steps_list=[]
while x<steps:
    distance+=np.random.randint(-1,2)
    distance_list.append(distance)
    x+=1
    steps_list.append(x)
plt.plot(steps_list,distance_list, color='green', label="Random Walk Data")

steps_list=np.asarray(steps_list)
distance_list=np.asarray(distance_list)

X=steps_list[:,np.newaxis]

#Polynomial fits

#Degree 2
poly_features=PolynomialFeatures(degree=2, include_bias=False)
X_poly=poly_features.fit_transform(X)

lin_reg=LinearRegression()
poly_fit=lin_reg.fit(X_poly,distance_list)
b=lin_reg.coef_
c=lin_reg.intercept_
print ("2nd degree coefficients:")
print ("zero power: ",c)
print ("first power: ", b[0])
print ("second power: ",b[1])

z = np.arange(0, steps, .01)
z_mod=b[1]*z**2+b[0]*z+c

fit_mod=b[1]*X**2+b[0]*X+c
plt.plot(z, z_mod, color='r', label="2nd Degree Fit")
plt.title("Polynomial Regression")

plt.xlabel("Steps")
plt.ylabel("Distance")

#Degree 10
poly_features10=PolynomialFeatures(degree=10, include_bias=False)
X_poly10=poly_features10.fit_transform(X)

poly_fit10=lin_reg.fit(X_poly10,distance_list)

y_plot=poly_fit10.predict(X_poly10)
plt.plot(X, y_plot, color='black', label="10th Degree Fit")

plt.legend()
plt.show()


#Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
regr_1=DecisionTreeRegressor(max_depth=2)
regr_2=DecisionTreeRegressor(max_depth=5)
regr_3=DecisionTreeRegressor(max_depth=11)
regr_1.fit(X, distance_list)
regr_2.fit(X, distance_list)
regr_3.fit(X, distance_list)

X_test = np.arange(0.0, steps, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3=regr_3.predict(X_test)

# Plot the results
plt.figure()
plt.scatter(X, distance_list, s=2.5, c="black", label="data")
plt.plot(X_test, y_1, color="red",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="green", label="max_depth=5", linewidth=2)
plt.plot(X_test, y_3, color="m", label="max_depth=7", linewidth=2)

plt.xlabel("Data")
plt.ylabel("Darget")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()


2nd degree coefficients:
zero power:  7.069813447337733
first power:  -0.169654380239669
second power:  0.00048141857827328896

Building a tree, regression

There are mainly two steps

  1. We split the predictor space (the set of possible values $x_1,x_2,\dots, x_p$) into $J$ distinct and non-non-overlapping regions, $R_1,R_2,\dots,R_J$.

  2. For every observation that falls into the region $R_j$ , we make the same prediction, which is simply the mean of the response values for the training observations in $R_j$.

How do we construct the regions $R_1,\dots,R_J$? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes $R_1,\dots,R_J$ that minimize the MSE, given by

$$ \sum_{j=1}^J\sum_{i\in R_j}(y_i-\overline{y}_{R_j})^2, $$

where $\overline{y}_{R_j}$ is the mean response for the training observations within box $j$.

A top-down approach, recursive binary splitting

Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $J$ boxes. The common strategy is to take a top-down approach

The approach is top-down because it begins at the top of the tree (all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

Making a tree

In order to implement the recursive binary splitting we start by selecting the predictor $x_j$ and a cutpoint $s$ that splits the predictor space into two regions $R_1$ and $R_2$

$$ \left\{X\vert x_j < s\right\}, $$

and

$$ \left\{X\vert x_j \geq s\right\}, $$

so that we obtain the lowest MSE, that is

$$ \sum_{i:x_i\in R_j}(y_i-\overline{y}_{R_1})^2+\sum_{i:x_i\in R_2}(y_i-\overline{y}_{R_2})^2, $$

which we want to minimize by considering all predictors $x_1,x_2,\dots,x_p$. We consider also all possible values of $s$ for each predictor. These values could be determined by randomly assigned numbers or by starting at the midpoint and then proceed till we find an optimal value.

For any $j$ and $s$, we define the pair of half-planes where $\overline{y}_{R_1}$ is the mean response for the training observations in $R_1(j,s)$, and $\overline{y}_{R_2}$ is the mean response for the training observations in $R_2(j,s)$.

Finding the values of $j$ and $s$ that minimize the above equation can be done quite quickly, especially when the number of features $p$ is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the MSE within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the MSE. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.

Pruning the tree

The above procedure is rather straightforward, but leads often to overfitting and unnecessarily large and complicated trees. The basic idea is to grow a large tree $T_0$ and then prune it back in order to obtain a subtree. A smaller tree with fewer splits (fewer regions) can lead to smaller variance and better interpretation at the cost of a little more bias.

The so-called Cost complexity pruning algorithm gives us a way to do just this. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter $\alpha$.

Cost complexity pruning

For each value of $\alpha$ there corresponds a subtree $T \in T_0$ such that

$$ \sum_{m=1}^{\overline{T}}\sum_{i:x_i\in R_m}(y_i-\overline{y}_{R_m})^2+\alpha\overline{T}, $$

is as small as possible. Here $\overline{T}$ is the number of terminal nodes of the tree $T$ , $R_m$ is the rectangle (i.e. the subset of predictor space) corresponding to the $m$-th terminal node.

The tuning parameter $\alpha$ controls a trade-off between the subtree’s com- plexity and its fit to the training data. When $\alpha = 0$, then the subtree $T$ will simply equal $T_0$, because then the above equation just measures the training error. However, as $\alpha$ increases, there is a price to pay for having a tree with many terminal nodes. The above equation will tend to be minimized for a smaller subtree.

It turns out that as we increase $\alpha$ from zero branches get pruned from the tree in a nested and predictable fashion, so obtaining the whole sequence of subtrees as a function of $\alpha$ is easy. We can select a value of $\alpha$ using a validation set or using cross-validation. We then return to the full data set and obtain the subtree corresponding to $\alpha$.

Schematic Regression Procedure

Building a Regression Tree.

  1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.

  2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of $\alpha$.

  3. Use for example $K$-fold cross-validation to choose $\alpha$. Divide the training observations into $K$ folds. For each $k=1,2,\dots,K$ we:

    • repeat steps 1 and 2 on all but the $k$-th fold of the training data.

    • Then we valuate the mean squared prediction error on the data in the left-out $k$-th fold, as a function of $\alpha$.

    • Finally we average the results for each value of $\alpha$, and pick $\alpha$ to minimize the average error.

  1. Return the subtree from Step 2 that corresponds to the chosen value of $\alpha$.

A Classification Tree

A classification tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one. Recall that for a regression tree, the predicted response for an observation is given by the mean response of the training observations that belong to the same terminal node. In contrast, for a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the results of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.

Growing a classification tree

The task of growing a classification tree is quite similar to the task of growing a regression tree. Just as in the regression setting, we use recursive binary splitting to grow a classification tree. However, in the classification setting, the MSE cannot be used as a criterion for making the binary splits. A natural alternative to MSE is the classification error rate. Since we plan to assign an observation in a given region to the most commonly occurring error rate class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class.

When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate.

Classification tree, how to split nodes

If our targets are the outcome of a classification process that takes for example $k=1,2,\dots,K$ values, the only thing we need to think of is to set up the splitting criteria for each node.

We define a PDF $p_{mk}$ that represents the number of observations of a class $k$ in a region $R_m$ with $N_m$ observations. We represent this likelihood function in terms of the proportion $I(y_i=k)$ of observations of this class in the region $R_m$ as

$$ p_{mk} = \frac{1}{N_m}\sum_{x_i\in R_m}I(y_i=k). $$

We let $p_{mk}$ represent the majority class of observations in region $m$. The three most common ways of splitting a node are given by

  • Misclassification error
$$ p_{mk} = \frac{1}{N_m}\sum_{x_i\in R_m}I(y_i\ne k) = 1-p_{mk}. $$
  • Gini index $g$
$$ g = \sum_{k=1}^K p_{mk}(1-p_{mk}). $$
  • Information entropy or just entropy $s$
$$ s = -\sum_{k=1}^K p_{mk}\log{p_{mk}}. $$

Visualizing the Tree, Classification


In [2]:
import os
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz

from IPython.display import Image 
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np


cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
print(X)
y = pd.Categorical.from_codes(cancer.target, cancer.target_names)
y = pd.get_dummies(y)
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
tree_clf = DecisionTreeClassifier(max_depth=5)
tree_clf.fit(X_train, y_train)

export_graphviz(
    tree_clf,
    out_file="DataFiles/cancer.dot",
    feature_names=cancer.feature_names,
    class_names=cancer.target_names,
    rounded=True,
    filled=True
)
cmd = 'dot -Tpng DataFiles/cancer.dot -o DataFiles/cancer.png'
os.system(cmd)


     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0         17.990         10.38          122.80     1001.0          0.11840   
1         20.570         17.77          132.90     1326.0          0.08474   
2         19.690         21.25          130.00     1203.0          0.10960   
3         11.420         20.38           77.58      386.1          0.14250   
4         20.290         14.34          135.10     1297.0          0.10030   
5         12.450         15.70           82.57      477.1          0.12780   
6         18.250         19.98          119.60     1040.0          0.09463   
7         13.710         20.83           90.20      577.9          0.11890   
8         13.000         21.82           87.50      519.8          0.12730   
9         12.460         24.04           83.97      475.9          0.11860   
10        16.020         23.24          102.70      797.8          0.08206   
11        15.780         17.89          103.60      781.0          0.09710   
12        19.170         24.80          132.40     1123.0          0.09740   
13        15.850         23.95          103.70      782.7          0.08401   
14        13.730         22.61           93.60      578.3          0.11310   
15        14.540         27.54           96.73      658.8          0.11390   
16        14.680         20.13           94.74      684.5          0.09867   
17        16.130         20.68          108.10      798.8          0.11700   
18        19.810         22.15          130.00     1260.0          0.09831   
19        13.540         14.36           87.46      566.3          0.09779   
20        13.080         15.71           85.63      520.0          0.10750   
21         9.504         12.44           60.34      273.9          0.10240   
22        15.340         14.26          102.50      704.4          0.10730   
23        21.160         23.04          137.20     1404.0          0.09428   
24        16.650         21.38          110.00      904.6          0.11210   
25        17.140         16.40          116.00      912.7          0.11860   
26        14.580         21.53           97.41      644.8          0.10540   
27        18.610         20.25          122.10     1094.0          0.09440   
28        15.300         25.27          102.40      732.4          0.10820   
29        17.570         15.05          115.00      955.1          0.09847   
..           ...           ...             ...        ...              ...   
539        7.691         25.44           48.34      170.4          0.08668   
540       11.540         14.44           74.65      402.9          0.09984   
541       14.470         24.99           95.81      656.4          0.08837   
542       14.740         25.42           94.70      668.6          0.08275   
543       13.210         28.06           84.88      538.4          0.08671   
544       13.870         20.70           89.77      584.8          0.09578   
545       13.620         23.23           87.19      573.2          0.09246   
546       10.320         16.35           65.31      324.9          0.09434   
547       10.260         16.58           65.85      320.8          0.08877   
548        9.683         19.34           61.05      285.7          0.08491   
549       10.820         24.21           68.89      361.6          0.08192   
550       10.860         21.48           68.51      360.5          0.07431   
551       11.130         22.44           71.49      378.4          0.09566   
552       12.770         29.43           81.35      507.9          0.08276   
553        9.333         21.94           59.01      264.0          0.09240   
554       12.880         28.92           82.50      514.3          0.08123   
555       10.290         27.61           65.67      321.4          0.09030   
556       10.160         19.59           64.73      311.7          0.10030   
557        9.423         27.88           59.26      271.3          0.08123   
558       14.590         22.68           96.39      657.1          0.08473   
559       11.510         23.93           74.52      403.5          0.09261   
560       14.050         27.15           91.38      600.4          0.09929   
561       11.200         29.37           70.67      386.0          0.07449   
562       15.220         30.62          103.40      716.9          0.10480   
563       20.920         25.09          143.00     1347.0          0.10990   
564       21.560         22.39          142.00     1479.0          0.11100   
565       20.130         28.25          131.20     1261.0          0.09780   
566       16.600         28.08          108.30      858.1          0.08455   
567       20.600         29.33          140.10     1265.0          0.11780   
568        7.760         24.54           47.92      181.0          0.05263   

     mean compactness  mean concavity  mean concave points  mean symmetry  \
0             0.27760        0.300100             0.147100         0.2419   
1             0.07864        0.086900             0.070170         0.1812   
2             0.15990        0.197400             0.127900         0.2069   
3             0.28390        0.241400             0.105200         0.2597   
4             0.13280        0.198000             0.104300         0.1809   
5             0.17000        0.157800             0.080890         0.2087   
6             0.10900        0.112700             0.074000         0.1794   
7             0.16450        0.093660             0.059850         0.2196   
8             0.19320        0.185900             0.093530         0.2350   
9             0.23960        0.227300             0.085430         0.2030   
10            0.06669        0.032990             0.033230         0.1528   
11            0.12920        0.099540             0.066060         0.1842   
12            0.24580        0.206500             0.111800         0.2397   
13            0.10020        0.099380             0.053640         0.1847   
14            0.22930        0.212800             0.080250         0.2069   
15            0.15950        0.163900             0.073640         0.2303   
16            0.07200        0.073950             0.052590         0.1586   
17            0.20220        0.172200             0.102800         0.2164   
18            0.10270        0.147900             0.094980         0.1582   
19            0.08129        0.066640             0.047810         0.1885   
20            0.12700        0.045680             0.031100         0.1967   
21            0.06492        0.029560             0.020760         0.1815   
22            0.21350        0.207700             0.097560         0.2521   
23            0.10220        0.109700             0.086320         0.1769   
24            0.14570        0.152500             0.091700         0.1995   
25            0.22760        0.222900             0.140100         0.3040   
26            0.18680        0.142500             0.087830         0.2252   
27            0.10660        0.149000             0.077310         0.1697   
28            0.16970        0.168300             0.087510         0.1926   
29            0.11570        0.098750             0.079530         0.1739   
..                ...             ...                  ...            ...   
539           0.11990        0.092520             0.013640         0.2037   
540           0.11200        0.067370             0.025940         0.1818   
541           0.12300        0.100900             0.038900         0.1872   
542           0.07214        0.041050             0.030270         0.1840   
543           0.06877        0.029870             0.032750         0.1628   
544           0.10180        0.036880             0.023690         0.1620   
545           0.06747        0.029740             0.024430         0.1664   
546           0.04994        0.010120             0.005495         0.1885   
547           0.08066        0.043580             0.024380         0.1669   
548           0.05030        0.023370             0.009615         0.1580   
549           0.06602        0.015480             0.008160         0.1976   
550           0.04227        0.000000             0.000000         0.1661   
551           0.08194        0.048240             0.022570         0.2030   
552           0.04234        0.019970             0.014990         0.1539   
553           0.05605        0.039960             0.012820         0.1692   
554           0.05824        0.061950             0.023430         0.1566   
555           0.07658        0.059990             0.027380         0.1593   
556           0.07504        0.005025             0.011160         0.1791   
557           0.04971        0.000000             0.000000         0.1742   
558           0.13300        0.102900             0.037360         0.1454   
559           0.10210        0.111200             0.041050         0.1388   
560           0.11260        0.044620             0.043040         0.1537   
561           0.03558        0.000000             0.000000         0.1060   
562           0.20870        0.255000             0.094290         0.2128   
563           0.22360        0.317400             0.147400         0.2149   
564           0.11590        0.243900             0.138900         0.1726   
565           0.10340        0.144000             0.097910         0.1752   
566           0.10230        0.092510             0.053020         0.1590   
567           0.27700        0.351400             0.152000         0.2397   
568           0.04362        0.000000             0.000000         0.1587   

     mean fractal dimension           ...             worst radius  \
0                   0.07871           ...                   25.380   
1                   0.05667           ...                   24.990   
2                   0.05999           ...                   23.570   
3                   0.09744           ...                   14.910   
4                   0.05883           ...                   22.540   
5                   0.07613           ...                   15.470   
6                   0.05742           ...                   22.880   
7                   0.07451           ...                   17.060   
8                   0.07389           ...                   15.490   
9                   0.08243           ...                   15.090   
10                  0.05697           ...                   19.190   
11                  0.06082           ...                   20.420   
12                  0.07800           ...                   20.960   
13                  0.05338           ...                   16.840   
14                  0.07682           ...                   15.030   
15                  0.07077           ...                   17.460   
16                  0.05922           ...                   19.070   
17                  0.07356           ...                   20.960   
18                  0.05395           ...                   27.320   
19                  0.05766           ...                   15.110   
20                  0.06811           ...                   14.500   
21                  0.06905           ...                   10.230   
22                  0.07032           ...                   18.070   
23                  0.05278           ...                   29.170   
24                  0.06330           ...                   26.460   
25                  0.07413           ...                   22.250   
26                  0.06924           ...                   17.620   
27                  0.05699           ...                   21.310   
28                  0.06540           ...                   20.270   
29                  0.06149           ...                   20.010   
..                      ...           ...                      ...   
539                 0.07751           ...                    8.678   
540                 0.06782           ...                   12.260   
541                 0.06341           ...                   16.220   
542                 0.05680           ...                   16.510   
543                 0.05781           ...                   14.370   
544                 0.06688           ...                   15.050   
545                 0.05801           ...                   15.350   
546                 0.06201           ...                   11.250   
547                 0.06714           ...                   10.830   
548                 0.06235           ...                   10.930   
549                 0.06328           ...                   13.030   
550                 0.05948           ...                   11.660   
551                 0.06552           ...                   12.020   
552                 0.05637           ...                   13.870   
553                 0.06576           ...                    9.845   
554                 0.05708           ...                   13.890   
555                 0.06127           ...                   10.840   
556                 0.06331           ...                   10.650   
557                 0.06059           ...                   10.490   
558                 0.06147           ...                   15.480   
559                 0.06570           ...                   12.480   
560                 0.06171           ...                   15.300   
561                 0.05502           ...                   11.920   
562                 0.07152           ...                   17.520   
563                 0.06879           ...                   24.290   
564                 0.05623           ...                   25.450   
565                 0.05533           ...                   23.690   
566                 0.05648           ...                   18.980   
567                 0.07016           ...                   25.740   
568                 0.05884           ...                    9.456   

     worst texture  worst perimeter  worst area  worst smoothness  \
0            17.33           184.60      2019.0           0.16220   
1            23.41           158.80      1956.0           0.12380   
2            25.53           152.50      1709.0           0.14440   
3            26.50            98.87       567.7           0.20980   
4            16.67           152.20      1575.0           0.13740   
5            23.75           103.40       741.6           0.17910   
6            27.66           153.20      1606.0           0.14420   
7            28.14           110.60       897.0           0.16540   
8            30.73           106.20       739.3           0.17030   
9            40.68            97.65       711.4           0.18530   
10           33.88           123.80      1150.0           0.11810   
11           27.28           136.50      1299.0           0.13960   
12           29.94           151.70      1332.0           0.10370   
13           27.66           112.00       876.5           0.11310   
14           32.01           108.80       697.7           0.16510   
15           37.13           124.10       943.2           0.16780   
16           30.88           123.40      1138.0           0.14640   
17           31.48           136.80      1315.0           0.17890   
18           30.88           186.80      2398.0           0.15120   
19           19.26            99.70       711.2           0.14400   
20           20.49            96.09       630.5           0.13120   
21           15.66            65.13       314.9           0.13240   
22           19.08           125.10       980.9           0.13900   
23           35.59           188.00      2615.0           0.14010   
24           31.56           177.00      2215.0           0.18050   
25           21.40           152.40      1461.0           0.15450   
26           33.21           122.40       896.9           0.15250   
27           27.26           139.90      1403.0           0.13380   
28           36.71           149.30      1269.0           0.16410   
29           19.52           134.90      1227.0           0.12550   
..             ...              ...         ...               ...   
539          31.89            54.49       223.6           0.15960   
540          19.68            78.78       457.8           0.13450   
541          31.73           113.50       808.9           0.13400   
542          32.29           107.40       826.4           0.10600   
543          37.17            92.48       629.6           0.10720   
544          24.75            99.17       688.6           0.12640   
545          29.09            97.58       729.8           0.12160   
546          21.77            71.12       384.9           0.12850   
547          22.04            71.08       357.4           0.14610   
548          25.59            69.10       364.2           0.11990   
549          31.45            83.90       505.6           0.12040   
550          24.77            74.08       412.3           0.10010   
551          28.26            77.80       436.6           0.10870   
552          36.00            88.10       594.7           0.12340   
553          25.05            62.86       295.8           0.11030   
554          35.74            88.84       595.7           0.12270   
555          34.91            69.57       357.6           0.13840   
556          22.88            67.88       347.3           0.12650   
557          34.24            66.50       330.6           0.10730   
558          27.27           105.90       733.5           0.10260   
559          37.16            82.28       474.2           0.12980   
560          33.17           100.20       706.7           0.12410   
561          38.30            75.19       439.6           0.09267   
562          42.79           128.70       915.0           0.14170   
563          29.41           179.10      1819.0           0.14070   
564          26.40           166.10      2027.0           0.14100   
565          38.25           155.00      1731.0           0.11660   
566          34.12           126.70      1124.0           0.11390   
567          39.42           184.60      1821.0           0.16500   
568          30.37            59.16       268.6           0.08996   

     worst compactness  worst concavity  worst concave points  worst symmetry  \
0              0.66560          0.71190               0.26540          0.4601   
1              0.18660          0.24160               0.18600          0.2750   
2              0.42450          0.45040               0.24300          0.3613   
3              0.86630          0.68690               0.25750          0.6638   
4              0.20500          0.40000               0.16250          0.2364   
5              0.52490          0.53550               0.17410          0.3985   
6              0.25760          0.37840               0.19320          0.3063   
7              0.36820          0.26780               0.15560          0.3196   
8              0.54010          0.53900               0.20600          0.4378   
9              1.05800          1.10500               0.22100          0.4366   
10             0.15510          0.14590               0.09975          0.2948   
11             0.56090          0.39650               0.18100          0.3792   
12             0.39030          0.36390               0.17670          0.3176   
13             0.19240          0.23220               0.11190          0.2809   
14             0.77250          0.69430               0.22080          0.3596   
15             0.65770          0.70260               0.17120          0.4218   
16             0.18710          0.29140               0.16090          0.3029   
17             0.42330          0.47840               0.20730          0.3706   
18             0.31500          0.53720               0.23880          0.2768   
19             0.17730          0.23900               0.12880          0.2977   
20             0.27760          0.18900               0.07283          0.3184   
21             0.11480          0.08867               0.06227          0.2450   
22             0.59540          0.63050               0.23930          0.4667   
23             0.26000          0.31550               0.20090          0.2822   
24             0.35780          0.46950               0.20950          0.3613   
25             0.39490          0.38530               0.25500          0.4066   
26             0.66430          0.55390               0.27010          0.4264   
27             0.21170          0.34460               0.14900          0.2341   
28             0.61100          0.63350               0.20240          0.4027   
29             0.28120          0.24890               0.14560          0.2756   
..                 ...              ...                   ...             ...   
539            0.30640          0.33930               0.05000          0.2790   
540            0.21180          0.17970               0.06918          0.2329   
541            0.42020          0.40400               0.12050          0.3187   
542            0.13760          0.16110               0.10950          0.2722   
543            0.13810          0.10620               0.07958          0.2473   
544            0.20370          0.13770               0.06845          0.2249   
545            0.15170          0.10490               0.07174          0.2642   
546            0.08842          0.04384               0.02381          0.2681   
547            0.22460          0.17830               0.08333          0.2691   
548            0.09546          0.09350               0.03846          0.2552   
549            0.16330          0.06194               0.03264          0.3059   
550            0.07348          0.00000               0.00000          0.2458   
551            0.17820          0.15640               0.06413          0.3169   
552            0.10640          0.08653               0.06498          0.2407   
553            0.08298          0.07993               0.02564          0.2435   
554            0.16200          0.24390               0.06493          0.2372   
555            0.17100          0.20000               0.09127          0.2226   
556            0.12000          0.01005               0.02232          0.2262   
557            0.07158          0.00000               0.00000          0.2475   
558            0.31710          0.36620               0.11050          0.2258   
559            0.25170          0.36300               0.09653          0.2112   
560            0.22640          0.13260               0.10480          0.2250   
561            0.05494          0.00000               0.00000          0.1566   
562            0.79170          1.17000               0.23560          0.4089   
563            0.41860          0.65990               0.25420          0.2929   
564            0.21130          0.41070               0.22160          0.2060   
565            0.19220          0.32150               0.16280          0.2572   
566            0.30940          0.34030               0.14180          0.2218   
567            0.86810          0.93870               0.26500          0.4087   
568            0.06444          0.00000               0.00000          0.2871   

     worst fractal dimension  
0                    0.11890  
1                    0.08902  
2                    0.08758  
3                    0.17300  
4                    0.07678  
5                    0.12440  
6                    0.08368  
7                    0.11510  
8                    0.10720  
9                    0.20750  
10                   0.08452  
11                   0.10480  
12                   0.10230  
13                   0.06287  
14                   0.14310  
15                   0.13410  
16                   0.08216  
17                   0.11420  
18                   0.07615  
19                   0.07259  
20                   0.08183  
21                   0.07773  
22                   0.09946  
23                   0.07526  
24                   0.09564  
25                   0.10590  
26                   0.12750  
27                   0.07421  
28                   0.09876  
29                   0.07919  
..                       ...  
539                  0.10660  
540                  0.08134  
541                  0.10230  
542                  0.06956  
543                  0.06443  
544                  0.08492  
545                  0.06953  
546                  0.07399  
547                  0.09479  
548                  0.07920  
549                  0.07626  
550                  0.06592  
551                  0.08032  
552                  0.06484  
553                  0.07393  
554                  0.07242  
555                  0.08283  
556                  0.06742  
557                  0.06969  
558                  0.08004  
559                  0.08732  
560                  0.08321  
561                  0.05905  
562                  0.14090  
563                  0.09873  
564                  0.07115  
565                  0.06637  
566                  0.07820  
567                  0.12400  
568                  0.07039  

[569 rows x 30 columns]
     malignant  benign
0            1       0
1            1       0
2            1       0
3            1       0
4            1       0
5            1       0
6            1       0
7            1       0
8            1       0
9            1       0
10           1       0
11           1       0
12           1       0
13           1       0
14           1       0
15           1       0
16           1       0
17           1       0
18           1       0
19           0       1
20           0       1
21           0       1
22           1       0
23           1       0
24           1       0
25           1       0
26           1       0
27           1       0
28           1       0
29           1       0
..         ...     ...
539          0       1
540          0       1
541          0       1
542          0       1
543          0       1
544          0       1
545          0       1
546          0       1
547          0       1
548          0       1
549          0       1
550          0       1
551          0       1
552          0       1
553          0       1
554          0       1
555          0       1
556          0       1
557          0       1
558          0       1
559          0       1
560          0       1
561          0       1
562          1       0
563          1       0
564          1       0
565          1       0
566          1       0
567          1       0
568          0       1

[569 rows x 2 columns]
Out[2]:
0

Visualizing the Tree, The Moons


In [3]:
# Common imports
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
from sklearn.tree import export_graphviz
from pydot import graph_from_dot_data
import pandas as pd
import os

np.random.seed(42)
X, y = make_moons(n_samples=100, noise=0.25, random_state=53)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
tree_clf = DecisionTreeClassifier(max_depth=5)
tree_clf.fit(X_train, y_train)

export_graphviz(
    tree_clf,
    out_file="DataFiles/moons.dot",
    rounded=True,
    filled=True
)
cmd = 'dot -Tpng DataFiles/moons.dot -o DataFiles/moons.png'
os.system(cmd)


Out[3]:
0

Algorithms for Setting up Decision Trees

Two algorithms stand out in the set up of decision trees:

  1. The CART (Classification And Regression Tree) algorithm for both classification and regression

  2. The ID3 algorithm based on the computation of the information gain for classification

We discuss both algorithms with applications here. The popular library Scikit-Learn uses the CART algorithm. For classification problems you can use either the gini index or the entropy to split a tree in two branches.

The CART algorithm for Classification

For classification, the CART algorithm splits the data set in two subsets using a single feature $k$ and a threshold $t_k$. This could be for example a threshold set by a number below a certain circumference of a malign tumor.

How do we find these two quantities? We search for the pair $(k,t_k)$ that produces the purest subset using for example the gini factor $G$. The cost function it tries to minimize is then

$$ C(k,t_k) = \frac{m_{\mathrm{left}}}{m}G_{\mathrm{left}}+ \frac{m_{\mathrm{right}}}{m}G_{\mathrm{right}}, $$

where $G_{\mathrm{left/right}}$ measures the impurity of the left/right subset and $m_{\mathrm{left/right}}$ is the number of instances in the left/right subset

Once it has successfully split the training set in two, it splits the subsets using the same logic, then the subsubsets and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the $max\_depth$ hyperparameter), or if it cannot find a split that will reduce impurity. A few other hyperparameters control additional stopping conditions such as the $min\_samples\_split$, $min\_samples\_leaf$, $min\_weight\_fraction\_leaf$, and $max\_leaf\_nodes$.

The CART algorithm for Regression

The CART algorithm for regression works is similar to the one for classification except that instead of trying to split the training set in a way that minimizes say the gini or entropy impurity, it now tries to split the training set in a way that minimizes our well-known mean-squared error (MSE). The cost function is now

$$ C(k,t_k) = \frac{m_{\mathrm{left}}}{m}\mathrm{MSE}_{\mathrm{left}}+ \frac{m_{\mathrm{right}}}{m}\mathrm{MSE}_{\mathrm{right}}. $$

Here the MSE for a specific node is defined as

$$ \mathrm{MSE}_{\mathrm{node}}=\frac{1}{m_\mathrm{node}}\sum_{i\in \mathrm{node}}(\overline{y}_{\mathrm{node}}-y_i)^2, $$

with

$$ \overline{y}_{\mathrm{node}}=\frac{1}{m_\mathrm{node}}\sum_{i\in \mathrm{node}}y_i, $$

the mean value of all observations in a specific node.

Without any regularization, the regression task for decision trees, just like for classification tasks, is prone to overfitting.

Computing the Gini index

The example we will look at is a classical one in many Machine Learning applications. Based on various meteorological features, we have several so-called attributes which decide whether we at the end will do some outdoor activity like skiing, going for a bike ride etc etc. The table here contains the feautures outlook, temperature, humidity and wind. The target or output is whether we ride (True=1) or whether we do something else that day (False=0). The attributes for each feature are then sunny, overcast and rain for the outlook, hot, cold and mild for temperature, high and normal for humidity and weak and strong for wind.

The table here summarizes the various attributes and

Day Outlook Temperature Humidity Wind Ride
1 Sunny Hot High Weak 0
2 Sunny Hot High Strong 1
3 Overcast Hot High Weak 1
4 Rain Mild High Weak 1
5 Rain Cool Normal Weak 1
6 Rain Cool Normal Strong 0
7 Overcast Cool Normal Strong 1
8 Sunny Mild High Weak 0
9 Sunny Cool Normal Weak 1
10 Rain Mild Normal Weak 1
11 Sunny Mild Normal Strong 1
12 Overcast Mild High Strong 1
13 Overcast Hot Normal Weak 1
14 Rain Mild High Strong 0

Simple Python Code to read in Data and perform Classification


In [4]:
# Common imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from IPython.display import Image 
from pydot import graph_from_dot_data
import os

# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"

if not os.path.exists(PROJECT_ROOT_DIR):
    os.mkdir(PROJECT_ROOT_DIR)

if not os.path.exists(FIGURE_ID):
    os.makedirs(FIGURE_ID)

if not os.path.exists(DATA_ID):
    os.makedirs(DATA_ID)

def image_path(fig_id):
    return os.path.join(FIGURE_ID, fig_id)

def data_path(dat_id):
    return os.path.join(DATA_ID, dat_id)

def save_fig(fig_id):
    plt.savefig(image_path(fig_id) + ".png", format='png')

infile = open(data_path("rideclass.csv"),'r')

# Read the experimental data with Pandas
from IPython.display import display
ridedata = pd.read_csv(infile,names = ('Outlook','Temperature','Humidity','Wind','Ride'))
ridedata = pd.DataFrame(ridedata)

# Features and targets
X = ridedata.loc[:, ridedata.columns != 'Ride'].values
y = ridedata.loc[:, ridedata.columns == 'Ride'].values

# Create the encoder.
encoder = OneHotEncoder(handle_unknown="ignore")
# Assume for simplicity all features are categorical.
encoder.fit(X)    
# Apply the encoder.
X = encoder.transform(X)
print(X)
# Then do a Classification tree
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)
print("Train set accuracy with Decision Tree: {:.2f}".format(tree_clf.score(X,y)))
#transfer to a decision tree graph
export_graphviz(
    tree_clf,
    out_file="DataFiles/ride.dot",
    rounded=True,
    filled=True
)
cmd = 'dot -Tpng DataFiles/cancer.dot -o DataFiles/cancer.png'
os.system(cmd)


  (0, 0)	1.0
  (0, 7)	1.0
  (0, 9)	1.0
  (0, 13)	1.0
  (1, 3)	1.0
  (1, 5)	1.0
  (1, 8)	1.0
  (1, 12)	1.0
  (2, 3)	1.0
  (2, 5)	1.0
  (2, 8)	1.0
  (2, 11)	1.0
  (3, 1)	1.0
  (3, 5)	1.0
  (3, 8)	1.0
  (3, 12)	1.0
  (4, 2)	1.0
  (4, 6)	1.0
  (4, 8)	1.0
  (4, 12)	1.0
  (5, 2)	1.0
  (5, 4)	1.0
  (5, 10)	1.0
  (5, 12)	1.0
  (6, 2)	1.0
  :	:
  (8, 12)	1.0
  (9, 3)	1.0
  (9, 4)	1.0
  (9, 10)	1.0
  (9, 12)	1.0
  (10, 2)	1.0
  (10, 6)	1.0
  (10, 10)	1.0
  (10, 12)	1.0
  (11, 3)	1.0
  (11, 6)	1.0
  (11, 10)	1.0
  (11, 11)	1.0
  (12, 1)	1.0
  (12, 6)	1.0
  (12, 8)	1.0
  (12, 11)	1.0
  (13, 1)	1.0
  (13, 5)	1.0
  (13, 10)	1.0
  (13, 12)	1.0
  (14, 2)	1.0
  (14, 6)	1.0
  (14, 8)	1.0
  (14, 11)	1.0
Train set accuracy with Decision Tree: 0.73
Out[4]:
0

Computing the Gini Factor

The above functions (gini, entropy and misclassification error) are important components of the so-called CART algorithm. We will discuss this algorithm below after we have discussed the information gain algorithm ID3.

In the example here we have converted all our attributes into numerical values $0,1,2$ etc.


In [5]:
# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		if row[index] < value:
			left.append(row)
		else:
			right.append(row)
	return left, right
 
# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
	# count all samples at split point
	n_instances = float(sum([len(group) for group in groups]))
	# sum weighted Gini index for each group
	gini = 0.0
	for group in groups:
		size = float(len(group))
		# avoid divide by zero
		if size == 0:
			continue
		score = 0.0
		# score the group based on the score for each class
		for class_val in classes:
			p = [row[-1] for row in group].count(class_val) / size
			score += p * p
		# weight the group score by its relative size
		gini += (1.0 - score) * (size / n_instances)
	return gini

# Select the best split point for a dataset
def get_split(dataset):
	class_values = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None
	for index in range(len(dataset[0])-1):
		for row in dataset:
			groups = test_split(index, row[index], dataset)
			gini = gini_index(groups, class_values)
			print('X%d < %.3f Gini=%.3f' % ((index+1), row[index], gini))
			if gini < b_score:
				b_index, b_value, b_score, b_groups = index, row[index], gini, groups
	return {'index':b_index, 'value':b_value, 'groups':b_groups}
 
dataset = [[0,0,0,0,0],
            [0,0,0,1,1],
            [1,0,0,0,1],
            [2,1,0,0,1],
            [2,2,1,0,1],
            [2,2,1,1,0],
            [1,2,1,1,1],
            [0,1,0,0,0],
            [0,2,1,0,1],
            [2,1,1,0,1],
            [0,1,1,1,1],
            [1,1,0,1,1],
            [1,0,1,0,1],
            [2,1,0,1,0]]

split = get_split(dataset)
print('Split: [X%d < %.3f]' % ((split['index']+1), split['value']))


X1 < 0.000 Gini=0.408
X1 < 0.000 Gini=0.408
X1 < 1.000 Gini=0.394
X1 < 2.000 Gini=0.394
X1 < 2.000 Gini=0.394
X1 < 2.000 Gini=0.394
X1 < 1.000 Gini=0.394
X1 < 0.000 Gini=0.408
X1 < 0.000 Gini=0.408
X1 < 2.000 Gini=0.394
X1 < 0.000 Gini=0.408
X1 < 1.000 Gini=0.394
X1 < 1.000 Gini=0.394
X1 < 2.000 Gini=0.394
X2 < 0.000 Gini=0.408
X2 < 0.000 Gini=0.408
X2 < 0.000 Gini=0.408
X2 < 1.000 Gini=0.407
X2 < 2.000 Gini=0.407
X2 < 2.000 Gini=0.407
X2 < 2.000 Gini=0.407
X2 < 1.000 Gini=0.407
X2 < 2.000 Gini=0.407
X2 < 1.000 Gini=0.407
X2 < 1.000 Gini=0.407
X2 < 1.000 Gini=0.407
X2 < 0.000 Gini=0.408
X2 < 1.000 Gini=0.407
X3 < 0.000 Gini=0.408
X3 < 0.000 Gini=0.408
X3 < 0.000 Gini=0.408
X3 < 0.000 Gini=0.408
X3 < 1.000 Gini=0.367
X3 < 1.000 Gini=0.367
X3 < 1.000 Gini=0.367
X3 < 0.000 Gini=0.408
X3 < 1.000 Gini=0.367
X3 < 1.000 Gini=0.367
X3 < 1.000 Gini=0.367
X3 < 0.000 Gini=0.408
X3 < 1.000 Gini=0.367
X3 < 0.000 Gini=0.408
X4 < 0.000 Gini=0.408
X4 < 1.000 Gini=0.405
X4 < 0.000 Gini=0.408
X4 < 0.000 Gini=0.408
X4 < 0.000 Gini=0.408
X4 < 1.000 Gini=0.405
X4 < 1.000 Gini=0.405
X4 < 0.000 Gini=0.408
X4 < 0.000 Gini=0.408
X4 < 0.000 Gini=0.408
X4 < 1.000 Gini=0.405
X4 < 1.000 Gini=0.405
X4 < 0.000 Gini=0.408
X4 < 1.000 Gini=0.405
Split: [X3 < 1.000]

Entropy and the ID3 algorithm

ID3, learns decision trees by constructing them topdown, beginning with the question which attribute should be tested at the root of the tree?

  1. Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples.

  2. The best attribute is selected and used as the test at the root node of the tree.

  3. A descendant of the root node is then created for each possible value of this attribute.

  4. Training examples are sorted to the appropriate descendant node.

  5. The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree.

  6. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices.

The ID3 algorithm selects, which attribute to test at each node in the tree.

We would like to select the attribute that is most useful for classifying examples.

What is a good quantitative measure of the worth of an attribute?

Information gain measures how well a given attribute separates the training examples according to their target classification.

The ID3 algorithm uses this information gain measure to select among the candidate attributes at each step while growing the tree.

Implementing the ID3 Algorithm


In [6]:
import re
import math
from collections import deque

# x is examples in training set
# y is set of targets
# label is target attributes
# Node is a class which has properties values, childs, and next
# root is top node in the decision tree

class Node(object):
	def __init__(self):
		self.value = None
		self.next = None
		self.childs = None

# Simple class of Decision Tree
# Aimed for who want to learn Decision Tree, so it is not optimized
class DecisionTree(object):
	def __init__(self, sample, attributes, labels):
		self.sample = sample
		self.attributes = attributes
		self.labels = labels
		self.labelCodes = None
		self.labelCodesCount = None
		self.initLabelCodes()
		# print(self.labelCodes)
		self.root = None
		self.entropy = self.getEntropy([x for x in range(len(self.labels))])

	def initLabelCodes(self):
		self.labelCodes = []
		self.labelCodesCount = []
		for l in self.labels:
			if l not in self.labelCodes:
				self.labelCodes.append(l)
				self.labelCodesCount.append(0)
			self.labelCodesCount[self.labelCodes.index(l)] += 1

	def getLabelCodeId(self, sampleId):
		return self.labelCodes.index(self.labels[sampleId])

	def getAttributeValues(self, sampleIds, attributeId):
		vals = []
		for sid in sampleIds:
			val = self.sample[sid][attributeId]
			if val not in vals:
				vals.append(val)
		# print(vals)
		return vals

	def getEntropy(self, sampleIds):
		entropy = 0
		labelCount = [0] * len(self.labelCodes)
		for sid in sampleIds:
			labelCount[self.getLabelCodeId(sid)] += 1
		# print("-ge", labelCount)
		for lv in labelCount:
			# print(lv)
			if lv != 0:
				entropy += -lv/len(sampleIds) * math.log(lv/len(sampleIds), 2)
			else:
				entropy += 0
		return entropy

	def getDominantLabel(self, sampleIds):
		labelCodesCount = [0] * len(self.labelCodes)
		for sid in sampleIds:
			labelCodesCount[self.labelCodes.index(self.labels[sid])] += 1
		return self.labelCodes[labelCodesCount.index(max(labelCodesCount))]

	def getInformationGain(self, sampleIds, attributeId):
		gain = self.getEntropy(sampleIds)
		attributeVals = []
		attributeValsCount = []
		attributeValsIds = []
		for sid in sampleIds:
			val = self.sample[sid][attributeId]
			if val not in attributeVals:
				attributeVals.append(val)
				attributeValsCount.append(0)
				attributeValsIds.append([])
			vid = attributeVals.index(val)
			attributeValsCount[vid] += 1
			attributeValsIds[vid].append(sid)
		# print("-gig", self.attributes[attributeId])
		for vc, vids in zip(attributeValsCount, attributeValsIds):
			# print("-gig", vids)
			gain -= vc/len(sampleIds) * self.getEntropy(vids)
		return gain

	def getAttributeMaxInformationGain(self, sampleIds, attributeIds):
		attributesEntropy = [0] * len(attributeIds)
		for i, attId in zip(range(len(attributeIds)), attributeIds):
			attributesEntropy[i] = self.getInformationGain(sampleIds, attId)
		maxId = attributeIds[attributesEntropy.index(max(attributesEntropy))]
		return self.attributes[maxId], maxId

	def isSingleLabeled(self, sampleIds):
		label = self.labels[sampleIds[0]]
		for sid in sampleIds:
			if self.labels[sid] != label:
				return False
		return True

	def getLabel(self, sampleId):
		return self.labels[sampleId]

	def id3(self):
		sampleIds = [x for x in range(len(self.sample))]
		attributeIds = [x for x in range(len(self.attributes))]
		self.root = self.id3Recv(sampleIds, attributeIds, self.root)

	def id3Recv(self, sampleIds, attributeIds, root):
		root = Node() # Initialize current root
		if self.isSingleLabeled(sampleIds):
			root.value = self.labels[sampleIds[0]]
			return root
		# print(attributeIds)
		if len(attributeIds) == 0:
			root.value = self.getDominantLabel(sampleIds)
			return root
		bestAttrName, bestAttrId = self.getAttributeMaxInformationGain(
			sampleIds, attributeIds)
		# print(bestAttrName)
		root.value = bestAttrName
		root.childs = []  # Create list of children
		for value in self.getAttributeValues(sampleIds, bestAttrId):
			# print(value)
			child = Node()
			child.value = value
			root.childs.append(child)  # Append new child node to current
									   # root
			childSampleIds = []
			for sid in sampleIds:
				if self.sample[sid][bestAttrId] == value:
					childSampleIds.append(sid)
			if len(childSampleIds) == 0:
				child.next = self.getDominantLabel(sampleIds)
			else:
				# print(bestAttrName, bestAttrId)
				# print(attributeIds)
				if len(attributeIds) > 0 and bestAttrId in attributeIds:
					toRemove = attributeIds.index(bestAttrId)
					attributeIds.pop(toRemove)
				child.next = self.id3Recv(
					childSampleIds, attributeIds, child.next)
		return root

	def printTree(self):
		if self.root:
			roots = deque()
			roots.append(self.root)
			while len(roots) > 0:
				root = roots.popleft()
				print(root.value)
				if root.childs:
					for child in root.childs:
						print('({})'.format(child.value))
						roots.append(child.next)
				elif root.next:
					print(root.next)


def test():
	f = open('DataFiles/rideclass.csv')
	attributes = f.readline().split(',')
	attributes = attributes[1:len(attributes)-1]
	print(attributes)
	sample = f.readlines()
	f.close()
	for i in range(len(sample)):
		sample[i] = re.sub('\d+,', '', sample[i])
		sample[i] = sample[i].strip().split(',')
	labels = []
	for s in sample:
		labels.append(s.pop())
	# print(sample)
	# print(labels)
	decisionTree = DecisionTree(sample, attributes, labels)
	print("System entropy {}".format(decisionTree.entropy))
	decisionTree.id3()
	decisionTree.printTree()


if __name__ == '__main__':
	test()


['Outlook', 'Temperature', 'Humidity', 'Wind']
System entropy 0.863120568566631
Outlook
(Sunny)
(Overcast)
(Rain)
Humidity
(High)
(Normal)
1
Temperature
(Mild)
(Cool)
Wind
(Weak)
(Strong)
1
1
0
0
1

Cancer Data again now with Decision Trees and other Methods


In [7]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
# Support vector machine
svm = SVC(gamma='auto', C=100)
svm.fit(X_train, y_train)
print("Test set accuracy with SVM: {:.2f}".format(svm.score(X_test,y_test)))
# Decision Trees
deep_tree_clf = DecisionTreeClassifier(max_depth=None)
deep_tree_clf.fit(X_train, y_train)
print("Test set accuracy with Decision Trees: {:.2f}".format(deep_tree_clf.score(X_test,y_test)))
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy Logistic Regression with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Support Vector Machine
svm.fit(X_train_scaled, y_train)
print("Test set accuracy SVM with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Decision Trees
deep_tree_clf.fit(X_train_scaled, y_train)
print("Test set accuracy with Decision Trees and scaled data: {:.2f}".format(deep_tree_clf.score(X_test_scaled,y_test)))


(426, 30)
(143, 30)
Test set accuracy with Logistic Regression: 0.95
Test set accuracy with SVM: 0.63
Test set accuracy with Decision Trees: 0.90
Test set accuracy Logistic Regression with scaled data: 0.96
Test set accuracy SVM with scaled data: 0.96
Test set accuracy with Decision Trees and scaled data: 0.90
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:757: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)

Another example, the moons again


In [8]:
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12


from sklearn.svm import SVC
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
from sklearn.tree import export_graphviz

Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)

deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)


def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
        plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
        plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)
plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)
plt.show()


Playing around with regions


In [9]:
np.random.seed(6)
Xs = np.random.rand(100, 2) - 0.5
ys = (Xs[:, 0] > 0).astype(np.float32) * 2

angle = np.pi/4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)

tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs, ys)
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr, ys)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(tree_clf_s, Xs, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)
plt.subplot(122)
plot_decision_boundary(tree_clf_sr, Xsr, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)

plt.show()


Regression trees


In [10]:
# Quadratic training set + noise
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10

In [11]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)


Out[11]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

Final regressor code


In [12]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)
tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

def plot_regression_predictions(tree_reg, X, y, axes=[0, 1, -0.2, 1], ylabel="$y$"):
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)
    y_pred = tree_reg.predict(x1)
    plt.axis(axes)
    plt.xlabel("$x_1$", fontsize=18)
    if ylabel:
        plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(X, y, "b.")
    plt.plot(x1, y_pred, "r.-", linewidth=2, label=r"$\hat{y}$")

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_regression_predictions(tree_reg1, X, y)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
plt.text(0.21, 0.65, "Depth=0", fontsize=15)
plt.text(0.01, 0.2, "Depth=1", fontsize=13)
plt.text(0.65, 0.8, "Depth=1", fontsize=13)
plt.legend(loc="upper center", fontsize=18)
plt.title("max_depth=2", fontsize=14)

plt.subplot(122)
plot_regression_predictions(tree_reg2, X, y, ylabel=None)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
for split in (0.0458, 0.1298, 0.2873, 0.9040):
    plt.plot([split, split], [-0.2, 1], "k:", linewidth=1)
plt.text(0.3, 0.5, "Depth=2", fontsize=13)
plt.title("max_depth=3", fontsize=14)

plt.show()



In [13]:
tree_reg1 = DecisionTreeRegressor(random_state=42)
tree_reg2 = DecisionTreeRegressor(random_state=42, min_samples_leaf=10)
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

x1 = np.linspace(0, 1, 500).reshape(-1, 1)
y_pred1 = tree_reg1.predict(x1)
y_pred2 = tree_reg2.predict(x1)

plt.figure(figsize=(11, 4))

plt.subplot(121)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred1, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.legend(loc="upper center", fontsize=18)
plt.title("No restrictions", fontsize=14)

plt.subplot(122)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred2, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.title("min_samples_leaf={}".format(tree_reg2.min_samples_leaf), fontsize=14)

plt.show()


Pros and cons of trees, pros

  • White box, easy to interpret model. Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches discussed earlier (think of support vector machines)

  • Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!

  • No feature normalization needed

  • Tree models can handle both continuous and categorical data (Classification and Regression Trees)

  • Can model nonlinear relationships

  • Can model interactions between the different descriptive features

  • Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small)

Disadvantages

  • Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches

  • If continuous features are used the tree may become quite large and hence less interpretable

  • Decision trees are prone to overfit the training data and hence do not well generalize the data if no stopping criteria or improvements like pruning, boosting or bagging are implemented

  • Small changes in the data may lead to a completely different tree. This issue can be addressed by using ensemble methods like bagging, boosting or random forests

  • Unbalanced datasets where some target feature values occur much more frequently than others may lead to biased trees since the frequently occurring feature values are preferred over the less frequently occurring ones.

  • If the number of features is relatively large (high dimensional) and the number of instances is relatively low, the tree might overfit the data

  • Features with many levels may be preferred over features with less levels since for them it is more easy to split the dataset such that the sub datasets only contain pure target feature values. This issue can be addressed by preferring for instance the information gain ratio as splitting criteria over information gain

However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved.

Ensemble Methods: From a Single Tree to Many Trees and Extreme Boosting, Meet the Jungle of Methods

As stated above and seen in many of the examples discussed here about a single decision tree, we often end up overfitting our training data. This normally means that we have a high variance. Can we reduce the variance of a statistical learning method?

This leads us to a set of different methods that can combine different machine learning algorithms or just use one of them to construct forests and jungles of trees, homogeneous ones or heterogenous ones. These methods are recognized by different names which we will try to explain here. These are

  1. Voting classifiers

  2. Bagging and Pasting

  3. Random forests

  4. Boosting methods, from adaptive to Extreme Gradient Boosting (XGBoost)

We discuss these methods here.

An Overview of Ensemble Methods

Bagging

The plain decision trees suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of $n$ to $p$ is moderately large.

Bootstrap aggregation, or just bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.

More bagging

Bagging typically results in improved accuracy over prediction using a single tree. Unfortunately, however, it can be difficult to interpret the resulting model. Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results.

However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree, and it is no longer clear which variables are most important to the procedure. Thus, bagging improves prediction accuracy at the expense of interpretability. Although the collection of bagged trees is much more difficult to interpret than a single tree, one can obtain an overall summary of the importance of each predictor using the MSE (for bagging regression trees) or the Gini index (for bagging classification trees). In the case of bagging regression trees, we can record the total amount that the MSE is decreased due to splits over a given predictor, averaged over all $B$ possible trees. A large value indicates an important predictor. Similarly, in the context of bagging classification trees, we can add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all $B$ trees.

Simple Voting Example, head or tail


In [14]:
heads_proba = 0.51
coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)
cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)
plt.figure(figsize=(8,3.5))
plt.plot(cumulative_heads_ratio)
plt.plot([0, 10000], [0.51, 0.51], "k--", linewidth=2, label="51%")
plt.plot([0, 10000], [0.5, 0.5], "k-", label="50%")
plt.xlabel("Number of coin tosses")
plt.ylabel("Heads ratio")
plt.legend(loc="lower right")
plt.axis([0, 10000, 0.42, 0.58])
save_fig("votingsimple")
plt.show()


Using the Voting Classifier


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912

Please, not the moons again! Voting and Bagging


In [16]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train)


/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[16]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFore...rbf', max_iter=-1, probability=False, random_state=42,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [17]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

In [18]:
log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)


/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[18]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFore...'rbf', max_iter=-1, probability=True, random_state=42,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [19]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

Bagging Examples


In [20]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [21]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


0.904

In [22]:
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))


0.856

In [23]:
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)
save_fig("baggingtree")
plt.show()


Making your own Bootstrap: Changing the Level of the Decision Tree

Let us bring up our good old boostrap example from the linear regression lectures. We change the linerar regression algorithm with a decision tree wth different depths and perform a bootstrap aggregate (in this case we perform as many bootstraps as data points $n$).


In [27]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.utils import resample
from sklearn.tree import DecisionTreeRegressor

n = 100
n_boostraps = 100
maxdepth = 8

# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
error = np.zeros(maxdepth)
bias = np.zeros(maxdepth)
variance = np.zeros(maxdepth)
polydegree = np.zeros(maxdepth)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# we produce a simple tree first as benchmark
simpletree = DecisionTreeRegressor(max_depth=3) 
simpletree.fit(X_train_scaled, y_train)
simpleprediction = simpletree.predict(X_test_scaled)
for degree in range(1,maxdepth):
    model = DecisionTreeRegressor(max_depth=degree) 
    y_pred = np.empty((y_test.shape[0], n_boostraps))
    for i in range(n_boostraps):
        x_, y_ = resample(X_train_scaled, y_train)
        model.fit(x_, y_)
        y_pred[:, i] = model.predict(X_test_scaled)#.ravel()

    polydegree[degree] = degree
    error[degree] = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )
    bias[degree] = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )
    variance[degree] = np.mean( np.var(y_pred, axis=1, keepdims=True) )
    print('Polynomial degree:', degree)
    print('Error:', error[degree])
    print('Bias^2:', bias[degree])
    print('Var:', variance[degree])
    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))

mse_simpletree = np.mean( np.mean((y_test - simpleprediction)**2)
plt.xlim(1,maxdepth)
plt.plot(polydegree, error, label='MSE simple tree')
plt.plot(polydegree, mse_simpletree, label='MSE for Bootstrap')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
save_fig("baggingboot")
plt.show()


  File "<ipython-input-27-c21b87fe7de5>", line 51
    plt.xlim(1,maxdepth)
      ^
SyntaxError: invalid syntax

Random forests

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.

As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of $m$ predictors is chosen as split candidates from the full set of $p$ predictors. The split is allowed to use only one of those $m$ predictors.

A fresh sample of $m$ predictors is taken at each split, and typically we choose

$$ m\approx \sqrt{p}. $$

In building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors.

The reason for this is rather clever. Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged variable importance random forest trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.

Random Forest Algorithm

The algorithm described here can be applied to both classification and regression problems.

We will grow of forest of say $B$ trees.

  1. For $b=1:B$

    • Draw a bootstrap sample of from the training data organized in our $\boldsymbol{X}$ matrix.

    • We grow then a random forest tree $T_b$ based on the bootstrapped data by repeating the steps outlined till we reach the maximum node size is reached

  2. we select $m \le p$ variables at random from the $p$ predictors/features

  3. pick the best split point among the $m$ features using either the CART algorithm or the ID3 for classification and create a new node

  4. split the node into daughter nodes

  1. Output then the ensemble of trees $\{T_b\}_1^{B}$ and make predictions for either a regression type of problem or a classification type of problem.

Random Forests Compared with other Methods on the Cancer Data


In [25]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
print("Test set accuracy with Logistic Regression: {:.2f}".format(logreg.score(X_test,y_test)))
# Support vector machine
svm = SVC(gamma='auto', C=100)
svm.fit(X_train, y_train)
print("Test set accuracy with SVM: {:.2f}".format(svm.score(X_test,y_test)))
# Decision Trees
deep_tree_clf = DecisionTreeClassifier(max_depth=None)
deep_tree_clf.fit(X_train, y_train)
print("Test set accuracy with Decision Trees: {:.2f}".format(deep_tree_clf.score(X_test,y_test)))
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
logreg.fit(X_train_scaled, y_train)
print("Test set accuracy Logistic Regression with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Support Vector Machine
svm.fit(X_train_scaled, y_train)
print("Test set accuracy SVM with scaled data: {:.2f}".format(logreg.score(X_test_scaled,y_test)))
# Decision Trees
deep_tree_clf.fit(X_train_scaled, y_train)
print("Test set accuracy with Decision Trees and scaled data: {:.2f}".format(deep_tree_clf.score(X_test_scaled,y_test)))


from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
# Data set not specificied
#Instantiate the model with 500 trees and entropy as splitting criteria
Random_Forest_model = RandomForestClassifier(n_estimators=500,criterion="entropy")
Random_Forest_model.fit(X_train_scaled, y_train)
#Cross validation
accuracy = cross_validate(Random_Forest_model,X_test_scaled,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Random Forests and scaled data: {:.2f}".format(Random_Forest_model.score(X_test_scaled,y_test)))


import scikitplot as skplt
y_pred = Random_Forest_model.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = Random_Forest_model.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()


(426, 30)
(143, 30)
Test set accuracy with Logistic Regression: 0.95
Test set accuracy with SVM: 0.63
Test set accuracy with Decision Trees: 0.87
Test set accuracy Logistic Regression with scaled data: 0.96
Test set accuracy SVM with scaled data: 0.96
Test set accuracy with Decision Trees and scaled data: 0.90
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:757: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
  "of iterations.", ConvergenceWarning)
[0.93333333 0.8        0.93333333 1.         1.         0.92857143
 1.         0.92857143 0.92857143 1.        ]
Test set accuracy with Random Forests and scaled data: 0.98
/usr/local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:424: MatplotlibDeprecationWarning: 
Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warn_deprecated("2.2", "Passing one of 'on', 'true', 'off', 'false' as a "

Compare Bagging on Trees with Random Forests


In [ ]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)

In [ ]:
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
np.sum(y_pred == y_pred_rf) / len(y_pred)

Boosting, a Bird's Eye View

The basic idea is to combine weak classifiers in order to create a good classifier. With a weak classifier we often intend a classifier which produces results which are only slightly better than we would get by random guesses.

This is done by applying in an iterative way a weak (or a standard classifier like decision trees) to modify the data. In each iteration we emphasize those observations which are misclassified by weighting them with a factor.

What is boosting? Additive Modelling/Iterative Fitting

Boosting is a way of fitting an additive expansion in a set of elementary basis functions like for example some simple polynomials. Assume for example that we have a function

$$ f_M(x) = \sum_{i=1}^M \beta_m b(x;\gamma_m), $$

where $\beta_m$ are the expansion parameters to be determined in a minimization process and $b(x;\gamma_m)$ are some simple functions of the multivariable parameter $x$ which is characterized by the parameters $\gamma_m$.

As an example, consider the Sigmoid function we used in logistic regression. In that case, we can translate the function $b(x;\gamma_m)$ into the Sigmoid function

$$ \sigma(t) = \frac{1}{1+\exp{(-t)}}, $$

where $t=\gamma_0+\gamma_1 x$ and the parameters $\gamma_0$ and $\gamma_1$ were determined by the Logistic Regression fitting algorithm.

As another example, consider the cost function we defined for linear regression

$$ C(\boldsymbol{y},\boldsymbol{f}) = \frac{1}{n} \sum_{i=0}^{n-1}(y_i-f(x_i))^2. $$

In this case the function $f(x)$ was replaced by the design matrix $\boldsymbol{X}$ and the unknown linear regression parameters $\boldsymbol{\beta}$, that is $\boldsymbol{f}=\boldsymbol{X}\boldsymbol{\beta}$. In linear regression we can simply invert a matrix and obtain the parameters $\beta$ by

$$ \boldsymbol{\beta}=\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

In iterative fitting or additive modeling, we minimize the cost function with respect to the parameters $\beta_m$ and $\gamma_m$.

Iterative Fitting, Regression and Squared-error Cost Function

The way we proceed is as follows (here we specialize to the squared-error cost function)

  1. Establish a cost function, here ${\cal C}(\boldsymbol{y},\boldsymbol{f}) = \frac{1}{n} \sum_{i=0}^{n-1}(y_i-f_M(x_i))^2$ with $f_M(x) = \sum_{i=1}^M \beta_m b(x;\gamma_m)$.

  2. Initialize with a guess $f_0(x)$. It could be one or even zero or some random numbers.

  3. For $m=1:M$

a. minimize $\sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\beta b(x;\gamma))^2$ wrt $\gamma$ and $\beta$

b. This gives the optimal values $\beta_m$ and $\gamma_m$

c. Determine then the new values $f_m(x)=f_{m-1}(x) +\beta_m b(x;\gamma_m)$

We could use any of the algorithms we have discussed till now. If we use trees, $\gamma$ parameterizes the split variables and split points at the internal nodes, and the predictions at the terminal nodes.

Squared-Error Example and Iterative Fitting

To better understand what happens, let us develop the steps for the iterative fitting using the above squared error function.

For simplicity we assume also that our functions $b(x;\gamma)=1+\gamma x$.

This means that for every iteration $m$, we need to optimize

$$ (\beta_m,\gamma_m) = \mathrm{argmin}_{\beta,\lambda}\hspace{0.1cm} \sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\beta b(x;\gamma))^2=\sum_{i=0}^{n-1}(y_i-f_{m-1}(x_i)-\beta(1+\gamma x_i))^2. $$

We start our iteration by simply setting $f_0(x)=0$. Taking the derivatives with respect to $\beta$ and $\gamma$ we obtain

$$ \frac{\partial {\cal C}}{\partial \beta} = -2\sum_{i}(1+\gamma x_i)(y_i-\beta(1+\gamma x_i))=0, $$

and

$$ \frac{\partial {\cal C}}{\partial \gamma} =-2\sum_{i}\beta x_i(y_i-\beta(1+\gamma x_i))=0. $$

We can then rewrite these equations as (defining $\boldsymbol{w}=\boldsymbol{e}+\gamma \boldsymbol{x})$ with $\boldsymbol{e}$ being the unit vector)

$$ \gamma \boldsymbol{w}^T(\boldsymbol{y}-\beta\gamma \boldsymbol{w})=0, $$

which gives us $\beta = \boldsymbol{w}^T\boldsymbol{y}/(\boldsymbol{w}^T\boldsymbol{w})$. Similarly we have

$$ \beta\gamma \boldsymbol{x}^T(\boldsymbol{y}-\beta(1+\gamma \boldsymbol{x}))=0, $$

which leads to $\gamma =(\boldsymbol{x}^T\boldsymbol{y}-\beta\boldsymbol{x}^T\boldsymbol{e})/(\beta\boldsymbol{x}^T\boldsymbol{x})$. Inserting for $\beta$ gives us an equation for $\gamma$. This is a non-linear equation in the unknown $\gamma$ and has to be solved numerically.

The solution to these two equations gives us in turn $\beta_1$ and $\gamma_1$ leading to the new expression for $f_1(x)$ as $f_1(x) = \beta_1(1+\gamma_1x)$. Doing this $M$ times results in our final estimate for the function $f$.

Iterative Fitting, Classification and AdaBoost

Let us consider a binary classification problem with two outcomes $y_i \in \{-1,1\}$ and $i=0,1,2,\dots,n-1$ as our set of observations. We define a classification function $G(x)$ which produces a prediction taking one or the other of the two values $\{-1,1\}$.

The error rate of the training sample is then

$$ \mathrm{\overline{err}}=\frac{1}{n} \sum_{i=0}^{n-1} I(y_i\ne G(x_i)). $$

The iterative procedure starts with defining a weak classifier whose error rate is barely better than random guessing. The iterative procedure in boosting is to sequentially apply a weak classification algorithm to repeatedly modified versions of the data producing a sequence of weak classifiers $G_m(x)$.

Here we will express our function $f(x)$ in terms of $G(x)$. That is

$$ f_M(x) = \sum_{i=1}^M \beta_m b(x;\gamma_m), $$

will be a function of

$$ G_M(x) = \mathrm{sign} \sum_{i=1}^M \alpha_m G_m(x). $$

Adaptive Boosting, AdaBoost

In our iterative procedure we define thus

$$ f_m(x) = f_{m-1}(x)+\beta_mG_m(x). $$

The simplest possible cost function which leads (also simple from a computational point of view) to the AdaBoost algorithm is the exponential cost/loss function defined as

$$ C(\boldsymbol{y},\boldsymbol{f}) = \sum_{i=0}^{n-1}\exp{(-y_i(f_{m-1}(x_i)+\beta G(x_i))}. $$

We optimize $\beta$ and $G$ for each value of $m=1:M$ as we did in the regression case. This is normally done in two steps. Let us however first rewrite the cost function as

$$ C(\boldsymbol{y},\boldsymbol{f}) = \sum_{i=0}^{n-1}w_i^{m}\exp{(-y_i\beta G(x_i))}, $$

where we have defined $w_i^m= \exp{(-y_if_{m-1}(x_i))}$.

Building up AdaBoost

First, for any $\beta > 0$, we optimize $G$ by setting

$$ G_m(x) = \mathrm{sign} \sum_{i=0}^{n-1} w_i^m I(y_i \ne G_(x_i)), $$

which is the classifier that minimizes the weighted error rate in predicting $y$.

We can do this by rewriting

$$ \exp{-(\beta)}\sum_{y_i=G(x_i)}w_i^m+\exp{(\beta)}\sum_{y_i\ne G(x_i)}w_i^m, $$

which can be rewritten as

$$ (\exp{(\beta)}-\exp{-(\beta)})\sum_{i=0}^{n-1}w_i^mI(y_i\ne G(x_i))+\exp{(-\beta)}\sum_{i=0}^{n-1}w_i^m=0, $$

which leads to

$$ \beta_m = \frac{1}{2}\log{\frac{1-\mathrm{\overline{err}}}{\mathrm{\overline{err}}}}, $$

where we have redefined the error as

$$ \mathrm{\overline{err}}_m=\frac{1}{n}\frac{\sum_{i=0}^{n-1}w_i^mI(y_i\ne G(x_i)}{\sum_{i=0}^{n-1}w_i^m}, $$

which leads to an update of

$$ f_m(x) = f_{m-1}(x) +\beta_m G_m(x). $$

This leads to the new weights

$$ w_i^{m+1} = w_i^m \exp{(-y_i\beta_m G_m(x_i))} $$

Adaptive boosting: AdaBoost, Basic Algorithm

The algorithm here is rather straightforward. Assume that our weak classifier is a decision tree and we consider a binary set of outputs with $y_i \in \{-1,1\}$ and $i=0,1,2,\dots,n-1$ as our set of observations. Our design matrix is given in terms of the feature/predictor vectors $\boldsymbol{X}=[\boldsymbol{x}_0\boldsymbol{x}_1\dots\boldsymbol{x}_{p-1}]$. Finally, we define also a classifier determined by our data via a function $G(x)$. This function tells us how well we are able to classify our outputs/targets $\boldsymbol{y}$.

We have already defined the misclassification error $\mathrm{err}$ as

$$ \mathrm{err}=\frac{1}{n}\sum_{i=0}^{n-1}I(y_i\ne G(x_i)), $$

where the function $I()$ is one if we misclassify and zero if we classify correctly.

Basic Steps of AdaBoost

With the above definitions we are now ready to set up the algorithm for AdaBoost. The basic idea is to set up weights which will be used to scale the correctly classified and the misclassified cases.

  1. We start by initializing all weights to $w_i = 1/n$, with $i=0,1,2,\dots n-1$. It is easy to see that we must have $\sum_{i=0}^{n-1}w_i = 1$.

  2. We rewrite the misclassification error as

$$ \mathrm{\overline{err}}_m=\frac{\sum_{i=0}^{n-1}w_i^m I(y_i\ne G(x_i))}{\sum_{i=0}^{n-1}w_i}, $$
  1. Then we start looping over all attempts at classifying, namely we start an iterative process for $m=1:M$, where $M$ is the final number of classifications. Our given classifier could for example be a plain decision tree.

a. Fit then a given classifier to the training set using the weights $w_i$.

b. Compute then $\mathrm{err}$ and figure out which events are classified properly and which are classified wrongly.

c. Define a quantity $\alpha_{m} = \log{(1-\mathrm{\overline{err}}_m)/\mathrm{\overline{err}}_m}$

d. Set the new weights to $w_i = w_i\times \exp{(\alpha_m I(y_i\ne G(x_i)}$.

  1. Compute the new classifier $G(x)= \sum_{i=0}^{n-1}\alpha_m I(y_i\ne G(x_i)$.

For the iterations with $m \le 2$ the weights are modified individually at each steps. The observations which were misclassified at iteration $m-1$ have a weight which is larger than those which were classified properly. As this proceeds, the observations which were difficult to classifiy correctly are given a larger influence. Each new classification step $m$ is then forced to concentrate on those observations that are missed in the previous iterations.

AdaBoost Examples

Using Scikit-Learn it is easy to apply the adaptive boosting algorithm, as done here.


In [26]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train_scaled, y_train)
y_pred = ada_clf.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
plt.show()
y_probas = ada_clf.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
plt.show()


/usr/local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:424: MatplotlibDeprecationWarning: 
Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warn_deprecated("2.2", "Passing one of 'on', 'true', 'off', 'false' as a "

AdaBoost for Regression

Here we present Drucker's AdaBoost tailored for regression.

In bagging, each training example is equally likely to be picked. In boosting, the probability of a particular example being in the training set of a particular machine depends on the performance of the prior machines on that example. The following is a modification of Adaboost by Drucker.

Start by selecting a set of training data $n$ and assign to each entry a weight $w_i=1$ for $i=1,2,\dots,n$. As we have done earlier, we could pick say $80\%$ of the data set for training. The algorithm runs as follows:

  1. We define the probability that the training sample $i$ is in the set by $p_i = w_i/\sum_iw_i$. We pick $n$ samples (with replacement) to form our training set. We pick a number uniformly in the range $[0,\sum_iw_i]$.

  2. We choose then a regression machine (for example plain linear regression or a simple decision tree). A given regression machine makes then a hypothesis.

  3. Using every member of the training set with the chosen regression machine we obtain then a prediction $\tilde{y}_i$.

  4. We calculate then the loss function $L_i$ for each training sample. We can use various types of loss function as long as we have a value

$L_i\in [0,1]$.

Gradient boosting: Basics with Steepest Descent

Gradient boosting is again a similar technique to Adaptive boosting, it combines so-called weak classifiers or regressors into a strong method via a series of iterations.

In order to understand the method, let us illustrate its basics by bringing back the essential steps in linear regression, where our cost function was the least squares function.

The Squared-Error again! Steepest Descent

We start again with our cost function ${\cal C}(\boldsymbol{y}m\boldsymbol{f})=\sum_{i=0}^{n-1}{\cal L}(y_i, f(x_i))$ where we want to minimize This means that for every iteration, we need to optimize

$$ (\hat{\boldsymbol{f}}) = \mathrm{argmin}_{\boldsymbol{f}}\hspace{0.1cm} \sum_{i=0}^{n-1}(y_i-f(x_i))^2. $$

We define a real function $h_m(x)$ that defines our final function $f_M(x)$ as

$$ f_M(x) = \sum_{m=0}^M h_m(x). $$

In the steepest decent approach we approximate $h_m(x) = -\rho_m g_m(x)$, where $\rho_m$ is a scalar and $g_m(x)$ the gradient defined as

$$ g_m(x_i) = \left[ \frac{\partial {\cal L}(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x_i)=f_{m-1}(x_i)}. $$

With the new gradient we can update $f_m(x) = f_{m-1}(x) -\rho_m g_m(x)$. Using the above squared-error function we see that the gradient is $g_m(x_i) = -2(y_i-f(x_i))$.

Choosing $f_0(x)=0$ we obtain $g_m(x) = -2y_i$ and inserting this into the minimization problem for the cost function we have

$$ (\rho_1) = \mathrm{argmin}_{\rho}\hspace{0.1cm} \sum_{i=0}^{n-1}(y_i+2\rho y_i)^2. $$

Steepest Descent Example

Optimizing with respect to $\rho$ we obtain (taking the derivative) that $\rho_1 = -1/2$. We have then that

$$ f_1(x) = f_{0}(x) -\rho_1 g_1(x)=-y_i. $$

We can then proceed and compute

$$ g_2(x_i) = \left[ \frac{\partial {\cal L}(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x_i)=f_{1}(x_i)=y_i}=-4y_i, $$

and find a new value for $\rho_2=-1/2$ and continue till we have reached $m=M$. We can modify the steepest descent method, or steepest boosting, by introducing what is called gradient boosting.

Gradient Boosting, algorithm

Suppose we have a cost function $C(f)=\sum_{i=0}^{n-1}L(y_i, f(x_i))$ where $y_i$ is our target and $f(x_i)$ the function which is meant to model $y_i$. The above cost function could be our standard squared-error function

$$ C(\boldsymbol{y},\boldsymbol{f})=\sum_{i=0}^{n-1}(y_i-f(x_i))^2. $$

The way we proceed in an iterative fashion is to

  1. Initialize our estimate $f_0(x)$.

  2. For $m=1:M$, we

a. compute the negative gradient vector $\boldsymbol{u}_m = -\partial C(\boldsymbol{y},\boldsymbol{f})/\partial \boldsymbol{f}(x)$ at $f(x) = f_{m-1}(x)$;

b. fit the so-called base-learner to the negative gradient $h_m(u_m,x)$;

c. update the estimate $f_m(x) = f_{m-1}(x)+\nu h_m(u_m,x)$;

  1. The final estimate is then $f_M(x) = \sum_{m=1}^M\nu h_m(u_m,x)$.

Gradient Boosting Example, Regression

We discuss here the difference between the steepest descent approach and gradient boosting by repeating our simple regression example above.

Gradient Boosting, Examples of Regression


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
import scikitplot as skplt
from sklearn.metrics import mean_squared_error

n = 100
maxdegree = 6

# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

error = np.zeros(maxdegree)
bias = np.zeros(maxdegree)
variance = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

for degree in range(1,maxdegree):
    model = GradientBoostingRegressor(max_depth=degree, n_estimators=100, learning_rate=1.0)  
    model.fit(X_train_scaled,y_train)
    y_pred = model.predict(X_test_scaled)
    polydegree[degree] = degree
    error[degree] = np.mean( np.mean((y_test - y_pred)**2) )
    bias[degree] = np.mean( (y_test - np.mean(y_pred))**2 )
    variance[degree] = np.mean( np.var(y_pred) )
    print('Max depth:', degree)
    print('Error:', error[degree])
    print('Bias^2:', bias[degree])
    print('Var:', variance[degree])
    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))

plt.xlim(1,maxdegree-1)
plt.plot(polydegree, error, label='Error')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
save_fig("gdregression")
plt.show()

Gradient Boosting, Classification Example


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
import scikitplot as skplt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate

# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

gd_clf = GradientBoostingClassifier(max_depth=3, n_estimators=100, learning_rate=1.0)  
gd_clf.fit(X_train_scaled, y_train)
#Cross validation
accuracy = cross_validate(gd_clf,X_test_scaled,y_test,cv=10)['test_score']
print(accuracy)
print("Test set accuracy with Random Forests and scaled data: {:.2f}".format(gd_clf.score(X_test_scaled,y_test)))

import scikitplot as skplt
y_pred = gd_clf.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
save_fig("gdclassiffierconfusion")
plt.show()
y_probas = gd_clf.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
save_fig("gdclassiffierroc")
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
save_fig("gdclassiffiercgain")
plt.show()

XGBoost: Extreme Gradient Boosting

XGBoost or Extreme Gradient Boosting, is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting that solve many data science problems in a fast and accurate way. See the article by Chen and Guestrin.

The authors design and build a highly scalable end-to-end tree boosting system. It has a theoretically justified weighted quantile sketch for efficient proposal calculation. It introduces a novel sparsity-aware algorithm for parallel tree learning and an effective cache-aware block structure for out-of-core tree learning.

It is now the algorithm which wins essentially all ML competitions!!!

Regression Case


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
import scikitplot as skplt
from sklearn.metrics import mean_squared_error

n = 100
maxdegree = 6

# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

error = np.zeros(maxdegree)
bias = np.zeros(maxdegree)
variance = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

for degree in range(maxdegree):
    model =  xgb.XGBRegressor(objective ='reg:squarederror', colsaobjective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = degree, alpha = 10, n_estimators = 200)

    model.fit(X_train_scaled,y_train)
    y_pred = model.predict(X_test_scaled)
    polydegree[degree] = degree
    error[degree] = np.mean( np.mean((y_test - y_pred)**2) )
    bias[degree] = np.mean( (y_test - np.mean(y_pred))**2 )
    variance[degree] = np.mean( np.var(y_pred) )
    print('Max depth:', degree)
    print('Error:', error[degree])
    print('Bias^2:', bias[degree])
    print('Var:', variance[degree])
    print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))

plt.xlim(1,maxdegree-1)
plt.plot(polydegree, error, label='Error')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
plt.show()

Xgboost on the Cancer Data

As you will see from the confusion matrix below, XGBoots does an excellent job on the Wisconsin cancer data and outperforms essentially all agorithms we have discussed till now.


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import  train_test_split 
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
import scikitplot as skplt
import xgboost as xgb
# Load the data
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
print(X_train.shape)
print(X_test.shape)
#now scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

xg_clf = xgb.XGBClassifier()
xg_clf.fit(X_train_scaled,y_train)

y_test = xg_clf.predict(X_test_scaled)

print("Test set accuracy with Random Forests and scaled data: {:.2f}".format(xg_clf.score(X_test_scaled,y_test)))

import scikitplot as skplt
y_pred = xg_clf.predict(X_test_scaled)
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True)
save_fig("xdclassiffierconfusion")
plt.show()
y_probas = xg_clf.predict_proba(X_test_scaled)
skplt.metrics.plot_roc(y_test, y_probas)
save_fig("xdclassiffierroc")
plt.show()
skplt.metrics.plot_cumulative_gain(y_test, y_probas)
save_fig("gdclassiffiercgain")
plt.show()


xgb.plot_tree(xg_clf,num_trees=0)
plt.rcParams['figure.figsize'] = [50, 10]
save_fig("xgtree")
plt.show()

xgb.plot_importance(xg_clf)
plt.rcParams['figure.figsize'] = [5, 5]
save_fig("xgparams")
plt.show()