Decision trees are extremely intuitive ways of classifying data: you simply ask a series of questions in order to close in on the class label. For example, if you are into value investing, you might think of the following scheme to decide whether you invest into a certain stock or not:
(P/B = Price/Book ratio, P/E = Price/Earnings ratio, PEG = Price/Earnings to growth ratio, E = Equity, D = Debt).
Sources tell me this scheme is not Warren Buffet's key to success - it is obviously an overly simplified illustration. But it serves well to demonstrate the idea behind decision trees in ML. The binary splits narrow down the options. The big question here though is of course what questions we ought to ask to derive the desired answer and in what sequence. This we will discuss in the next sections - first intuitively and then in more rigorous terms.
To draw up a simple decision tree, we follow these two steps (James et al. (2013)):
Let us illustrate this on a two dimensional data set with features $X_1$ and $X_2$. The color of the dots indicate the true class label.
Without any split, we have but one region. A decision tree now splits the regions iteratively into predictor spaces $R_m$. This is shown below. The first figure on the left has two spaces, $R_1, R_2$, The second already has four ($R_1, R_2, R_3, R_4$) etc.. The background color indicates the label that the model would assign to a new data point in the respective area. Note that (as in the introductory decision tree figure) each branch can have a sperate number of splits. Some nodes are purer (get split more) than others.
The depth=n
argument in the figure title refers to the depth of the tree. Nodes that contain only a single class (color) are not further split. All others will be further split until a stopping criterion is reached, e.g.
depth
argument etc.How do we construct the regions $R_1, \ldots, R_M$? Or put in other words: How do we decide on the splitting variables (i.e. $X_1, X_2, \ldots, X_p$), split points and what topology (shape) the tree should have? To explain this we focus on the CART (classification and regression tree) approach of decision trees as implemented in Scikit-learn.
In order to describe the mathematics let us assume we have a data set with $p$ inputs for each of the $N$ observations: $(x_i, y_i)$ for $i = 1, 2, \ldots, N$, with $x_i = (x_{i1}, x_{i2}, \ldots, x_{ip})$.
Most libraries (including Scikit-learn) have implemented binary decision trees, meaning that each node is split into two child nodes. Hence for binary splits at node $m$ we take initial region $R_m$ and select $j$ (of feature $X_j$) and threshold $t_m$ such that the resulting two half-planes
$$\begin{equation} R_{\text{left}}(R_m; j, t_m) = \{(X, y) \, | \, X_j < t_m \} \qquad \text{and} \qquad R_{\text{right}}(R_m; j, t_m) = \{(X, y) \, | \, X_j \geq t_m \} \end{equation}$$maximize the information gain ($G$) for any value of $j$ and $t_m$.
Let us define $\theta = (j, t_m)$ to simplify expressions. Given $R_m$ with $N_m$ being the total number of samples in $R_m$ and $n_{\text{left}}$, $n_{\text{right}}$ the number samples in the left ($R_{\text{left}}$) and right ($R_{\text{right}}$) child nodes, respectively, the information gain function $G$ is defined as
$$\begin{equation} G(R_m; \theta) = H(R_m) - \left( \frac{n_{\text{left}}}{N_m} H(R_{\text{left}}) + \frac{n_{\text{right}}}{N_m} H(R_{\text{right}})\right) \end{equation}$$Here, $H()$ is simply a measure of impurity which we will get to shortly. With that, the information gain $G$ is simply the difference between the impurity of the parent node and the sum of the child node impurities. The lower the impurity of the child nodes, the larger information gain we get (Raschka (2015)).
Our optimization task can therefore be formulated as to find parameter $\theta$ that maximizes the following expression at every node $m$:
$$\begin{equation} \theta^* = \arg \max_{\theta} G(R_m; \theta) \end{equation}$$This is done recursively for every node until the stopping criteria is reached (max. depth, max. number of samples in region etc.; see above).
There are three common impurity measures (or splitting criteria), of which only the latter two are recommended for growing decision trees: Classification error rate ($H_E$), Gini index ($H_G$), cross-entropy ($H_H$). To discuss them, let us first define the proportion of class $k$ observations in node $m$ (Friedman et al. (2001)):
$$\begin{equation} \hat{p}_{m,k} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k) \end{equation}$$Earlier we mentioned that all observations in region $R_j$ are assigned to the same class following a majority vote. In more formal terms this means that observations in node $m$ are classified to class $k$ for which $k(m) = \arg \max_k \hat{p}_{m,k}$.
Now, the impurity measures are defined as follows:
$$\begin{align} &\text{Classification Error rate: } & H_E(R_m) &= 1 - \max_k (p_{m,k}) \\ &\text{Gini Index: } & H_G(R_m) &= \sum_k p_{m,k} (1 - p_{m,k}) \\ &\text{Cross-Entropy: } & H_E(R_m) &= - \sum_k p_{m,k} \, \log(p_{m,k}) \end{align}$$Note that for binary classification tasks (e.g. $y \in \{0, 1\}$), $\log$ in the cross-entropy is usually the logarithm to the base 2.
Below figure graphs the three impurity measures with respect to $p$. As mentioned, the classification error rate should not be used as an impurity measure. In practice it is primarily used to prune a tree - a concept we will not discuss here. Cross-entropy is minimal if all samples at a node belong to the same class and maximal if we have a homogeneous distribution among the classes. Therefore, cross-entropy can be understood as a criterion that attempts to maximize the mutual information in the tree. The Gini-index on the other hand works towards minimizing the probability of misclassification. Similar to entropy it is maximal if classes are evenly distributed and minimal if the vast majority of samples belong to the same class.
In practice both Gini index as well as cross-entropy usually produce similar results and thus it is not advisable to spend much time on evaluating trees using different impurity criteria (Raschka (2015)).
Decision trees have several advantages over the other classification approaches discussed so far:
However, simple decision trees have also disadvantages, some of them are significant:
To show how decision trees are applied in Python we once again rely on functions implemented in the Scikit-learn package. This time we will use the Carseats
data set. It is again a data set that corresponds to James et al. (2013)'s book and contains information on child carseat sales at 400 different stores. Sales
is the response variable with number of sold units in thousands. All other values are used as features. A detailed description of the data set can be found here.
In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('Data/Carseats.csv')
df.head()
Out[1]:
Let us assume you are a financial consultant and work on a market study that discusses appropriate sales channels for child car seats. You know that your client's strategy is to grow the business. Furthermore you learned that to run sales at break even at a sales location the company needs to sell approximately 4'000 units. What feature drives sales and in which stores would you advise your client to offer their product? Here a decision tree helps very much in visualizing the sales driver.
Before we implement the model we prepare the data. Notice that decision trees are invariant to scaling meaning we could, but don't have to scale values. This is true for both quantitative as well as categorical variables. What we need to do though, is transforming categorical values ShelveLoc, Urban
and US
into numeric values. We will use pandas map
method to (a) show an alternative to pd.factorize
introduced in previous chapters and (b) ensure a mapping that does not confuse (pd.factorize()
maps a column's first entry to value 0, second to 1 etc. This would mean that pd.factorize()
would label Yes
in columns Urban
or US
as 0. Yet Yes
is predominantly represented by a 1. Therefore, with the map
function we preclude any confusion).
In [2]:
# Create 'BreakEven' column with 1 if Sales >= 4k, else 0
df['BreakEven'] = df.Sales.map(lambda x: 1 if x>=4 else 0)
# Replace category names with numbers
df.ShelveLoc = df.ShelveLoc.map({'Bad':0, 'Medium': 1, 'Good': 2})
df.Urban = df.Urban.map({'No':0, 'Yes':1})
df.US = df.US.map({'No':0, 'Yes':1})
print(df.head(3))
In [3]:
# Assign features & response to X and y, respectively
X = df.drop(['Sales', 'BreakEven'], axis=1)
y = df.BreakEven
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Now we are in a position to run the DecisionTreeClassifier
.
In [4]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X_train, y_train)
Out[4]:
Creating a decision tree is as simple as that. From the above output we see that per default the function will use the Gini index as criterion. The only argument we set is the maximal depth. Alternatively we could define the maximum tree nodes, the minimum number of samples required to split an internal node or the minimum number of samples required to be at a leaf node. There are more options and it is best to check the documentation page to get a thorough understanding of the available options.
As with all other Scikit-learn functions, we can again call the usual performance metrics. The interpretation is as discussed in chapter 8.
In [5]:
# Load metrics sublibrary
from sklearn import metrics
# Print performance metrics
print('True proportion of sales >= 4k: ', y.sum() / y.shape[0])
print('Train score: ', tree.score(X_train, y_train))
print('Test score: ', tree.score(X_test, y_test))
print(37*'-')
# Confusion matrix
y_pred = tree.predict(X_test)
print('Confusion matrix: \n',
metrics.confusion_matrix(y_test, y_pred))
# Manual confusion matrix as pandas DataFrame
confm = pd.DataFrame({'Predicted sales >=4k': y_pred,
'True sales >=4k': y_test})
confm.replace(to_replace={0:'No', 1:'Yes'}, inplace=True)
print(confm.groupby(['True sales >=4k','Predicted sales >=4k']).size().unstack('Predicted sales >=4k'))
Visualizing decision trees was a horrendously cumbersome task in the past. However, with Scikit-learn version 0.21 there is now a function that plots the decision tree very easily. The following Code snippet displays the necessary steps. For further details - as usual - visit the Scikit-learn website. Be sure to have the list of class names for parameter class_names
in the correct order.
In [22]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Plot tree
plt.figure(figsize=(14, 6))
plot_tree(tree, class_names=['loss', 'profit'],
filled=True, rounded=True);
We have mentioned above that one of the major disadvantages of decision trees is the risk of overfitting. Having discussed cross validation in the previous chapter, we ought to ask the question how cross validation can be used to prevent overfitting. The answer to this seemingly trivial question might be more complex than initially thought. If we apply a $k$-fold cross validation and generate $k$ decision tree for every fold in the training set, then these $k$ tree will most certainly look different from fold to fold. Each of the $k$ decision trees will still suffer from overfitting. Furthermore, CV does not tell us anything about which of the $k$ tree we are supposed to select for prediction, and ultimately, this ought to be our goal. If CV produces 10 different trees but doesn't tell us which one of these to choose, we have gained nothing. Choosing the one tree (among the $k$) with (for example) the lowest error rate will not do it as this approach is most probably flawed: such a model is based on even less information than what would be available through the full training set and in general models with more information beat those with less. So CV will not help us on this end.
However, CV is still of use. To recapitulate the idea behind CV: Fundamentally, the purpose of cross validation is not to help select a particular instance of the $k$ decision trees but rather to qualify the model, i.e. to provide metrics such as error rate etc. which in turn can be useful in asserting the level of precision one can expect from the model. Therefore, CV comes into play when we are tuning the model to find the optimal hyperparameter. As an example: usually we do not know the optimal tree form. Does it generalize best when it has depth 2, 5, 10? Or is a stopping criterion of no more than 5, 10, 20 observation per region $R_j$ best? Here we can run different parameter in combination with CV and this, thus, will provide an answer on how well the model will generalize to new data - given the set of hyperparameter.
Below we show two setups to do this: with loops or grid search. The first approach follows what we learned in the previous chapter on cross validation.
In [28]:
# Max depth
maxDepth = np.array([1, 2, 5, 10])
# Minimum number of samples required to split any internal node
minSamplesNode = np.array([2, 5, 10, 20])
# The minimum number of samples required to be at a leaf/terminal node
minSamplesLeaf = np.array([2, 5, 10, 20])
In [33]:
# Import necessary functions
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Create k-Fold CV object
kFold = StratifiedKFold(n_splits=10)
# Loop through maxDept values, run CV and print results
for i in maxDepth:
tree = DecisionTreeClassifier(max_depth=i, random_state=0)
scrs = cross_val_score(tree, X_train, y_train, cv=kFold)
print('Score (depth ={0: 3.0f}): {1: .3f} +/- {2: .3f}'.format(i, np.mean(scrs), np.std(scrs)))
print(50*'-')
# Loop through minSamplesNode values, run CV and print results
for i in minSamplesNode:
tree = DecisionTreeClassifier(min_samples_split=i, random_state=0)
scrs = cross_val_score(tree, X_train, y_train, cv=kFold)
print('Score (min sample at node ={0: 3.0f}): {1: .3f} +/- {2: .3f}'.format(i, np.mean(scrs), np.std(scrs)))
print(50*'-')
# Loop through minSamplesNode values, run CV and print results
for i in minSamplesLeaf:
tree = DecisionTreeClassifier(min_samples_leaf=i, random_state=0)
scrs = cross_val_score(tree, X_train, y_train, cv=kFold)
print('Score (min sample at leaf ={0: 3.0f}): {1: .3f} +/- {2: .3f}'.format(i, np.mean(scrs), np.std(scrs)))
Based on the output we can conclude that we get better scores with fewer nodes. As for the min sample at a node/leaf, the differences are too small to judge. However, what we can also conclude is that this is a fairly cumbersome process. Three separate loops to find the optimal hyperparameter. Furthermore, these three loops just check for one criterion, but what if we were interested in all possible combinations? Maybe we get better results if we combine them - we only know if we check. And, as you should be expecting by now, there's a convenient way of doing this: via grid search.
The approach of grid search is fairly simple: it's a brute-force search paradigm where we specify a list of values for different hyperparameters. The algorithm evaluates the model performance for each combination of hyperparameter to obtain the optimal combination of values from this set (Raschka (2015)). As usual we use a code example to show how this works.
In [26]:
from sklearn.model_selection import GridSearchCV
# Define the hyperparameter values to be tested
param_grid = {'criterion': ['gini', 'entropy'],
'max_depth': maxDepth,
'min_samples_split': minSamplesNode,
'min_samples_leaf': minSamplesLeaf}
# Run brute-force grid search
gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
param_grid=param_grid,
scoring='accuracy',
cv=kFold, n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
Using the preceding code, we train and tune a DecisionTreeClassifier
on the given parameter grid. For this we define a dictionary called param_grid
and apply this to the GridSearchCV
. Using the training data we obtain the score of the best-performing model via the best_score_
attribute (here based on the accuracy measure) and the corresponding parameter via the best_params_
.
As it turns out, we are unable to increase the accuracy score by combining stopping criterion outside of the min_samples_leaf
hyperparameter. The best result is indeed when we use a max. depth of 1 and min_samples_split
criterion of 2. If a combination of the three criterion (or the entropy criterion) were to yield better results, it would mean that the latter two hyperparameter would have values $\geq$ than the minimum value of 1 and 2, respectively.
It is important to note that this result should not deceive you to believe that overfitting with decision trees is a myth. Here we seem to come across a rare exception where a prune tree of depth 1 yields the best performance. In general, decision trees have much better training results on deeper grown trees. Hence the risk of overfitting.
Note that grid search might be a convenient and powerful way of tuning hyperparameter but because it is a brute-force approach it is computationally very expensive. Depending on the number of processors you run your script on and the task at hand this might take substantial time. If, for whatever reasons, this is not feasible, the
RandomizedSearchCV
class might be a feasible alternative. This class draws random parameter from sampling distributions with a specified budget. See the documentation for more details.
Finally, to estimate the performance of these parameter on the independent test dataset, we can run these three lines:
In [34]:
# Extract best parameter
clf = gs.best_estimator_
# Fit model given best parameter
clf.fit(X_train, y_train)
# Print out score on Test dataset
print('Test accuracy: {0: .4f}'.format(clf.score(X_test, y_test)))
CV is so valuable because it provides reliable information on how well a model generalizes. Recall that given a set of $n$ independent observations $Z_1, Z_2, \ldots, Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar{Z}$ of the observations is given by $\sigma^2/n$ (see appendix of the script). This shows that if we average a set of (independent) performance metrics (e.g. classification error), we actually reduce the variance of this error. And in doing so, we increase the validity of said metric.
If we now extend this idea to not only assessing performance metrics but also predicting outcomes, we enter the field of ensemble models. These models produce $k$ independent predictions (e.g. trees) and assign the class label based on a majority vote of the $k$ outcomes. In doing so we lower the variance without compromising on the low bias of decision trees. Therefore it is easy to see that ensemble methods have proven to be extremely valuable, especially in the field of decision trees. One of the more prominent ensemble algorithm with respect to decision trees is called 'Random Forest'.
Let us first define the steps of a random forest model (Raschka (2015, p. 90)):
Step two above might seem odd at first: Why would we restrict the model to only choose from a subset $m$ of features (instead of selecting from the complete set $p$)? This is best explained with James et al. (2013, p. 320):
"Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. [...] Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average $(p - m)/p$ of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable."
On a side note: There exists a predecessor algorithm that works similar to random forests except that it sets $m=p$ in step 2.1 per default. Python has it implement as BaggingClassifier()
. Since random forests improves on the problem of correlated features, it is today clearly the preferred approach. For this reason we will only discuss the random forest implementation.
Though the interpretability of a random forest does not meet the simplicity of a simple decision tree, a big advantage is that we do not have to worry that much about choosing good hyperparameter values. We have three primary values to set:
Typically, the larger number of trees $B$, the better the performance of our random forest classifier. But this of course comes at the expense of (potentially significant) increased computational costs. Additionally, the marginal improvement decreases as the number of trees is increased, i.e. at a certain point the cost in computation time will outgrow the benefit in prediction accuracy from more trees. In the Scikit-learn implementation, this hyperparameter is steered through the n_estimators
argument.
The feature subset size ($m$) to consider at each node is typically set to $m = \sqrt{p}$, that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors $p$. Scikit-learn uses the max_feature
argument to control for it.
Finally, via the size $n$ of the bootstrap we control the bias-variance trade-off. A large value for $n$ will decrease randomness and thus such a model is more likely to overfit. On the other hand, preventing overfitting by selecting smaller values come at the expense of the model predictive performance. And since predictive accuracy is what we are most interested in, the vast majority of random forest implementations, including the RandomForestClassifier
implementation in Scikit-learn, have set the bootstrap sample size $n$ per default to the number of samples in the original training set. This provides a good bias-variance trade-off (Raschka (2015)).
There is an easily accessible implementation of a random forest classifier in Scikit-learn that we can use. For a description of the available hyperparameter please check again the function's documentation. It is also left to the reader to investigate how CV and/or grid-search can improve the performance. For this the same steps as explained for the DecisionTreeClasifier
function can be applied.
In [14]:
from sklearn.ensemble import RandomForestClassifier
# Create classifier object and fit it to data
forest = RandomForestClassifier(criterion='gini', random_state=0, n_jobs=-1)
forest.fit(X_train, y_train)
Out[14]:
In [15]:
# Print test score
print('Test accuracy: {0: .4f}'.format(clf.score(X_test, y_test)))
In writing this notebook, many ressources were consulted. For internet ressources the links are provided within the textflow above and will therefore not be listed again. Beyond these links, the following ressources were consulted and are recommended as further reading on the discussed topics:
With Scikit-learn version 0.21 there is now a function that plots the decision tree very easily. The following Code snippet displays the necessary steps. For further details - as usual - visit the Scikit-learn website. Be sure to have the list of class names for parameter class_names
in the correct order.
In [14]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Plot tree
plt.figure(figsize=(12, 6))
plot_tree(tree, class_names=['loss', 'profit'], filled=True);