Decision Trees and Random Forests

Suppose you're classifying dogs, and you know (as an expert in the field) that all poodles have tails shorter than 15mm, while all dachshunds have tails longer than 25mm. However, the difference between golden retrievers and poodles is that golden retrievers are taller than 40mm in height, while dachshunds are less than 35mm tall.

I give you a dataset of dogs, where we've measured the heights and tail lengths. Dutifully, you build a linear model taking in the features $f_h$ and $f_t$, and find a threshold/bias for $\alpha f_h + \beta f_t$.

But, you think, why bother? I already know how to decide the breed of a dog. Why spend all this energy?

In [ ]:
def breed(tail, height):
    if tail > 25:
        if height < 35:
            return 'dachshund'
            return 'golden retriever'
    elif tail < 15:
        return 'poodle'
        return 'dunno'

As you can probably tell, the problem with this strategy is that while it may work in cases where you are an expert, it doesn't help you figure out what to do when you aren't an expert. And there will be data science problems that you will need to solve where you no know nothing about the problem, and there aren't any experts.

Still, the above strategy has some positive factors:

  • It's fast to evaluate (just like linear models, unlike SVMs)
  • It's nonlinear (which means it's sometimes more flexible than linear models)
  • It's very easy to read

Don't discount the latter criteria - when your model is getting 98% accuracy and you want to get to 99% accuracy (to pass your class, to do your job, or to win a million dollars), you're going to want to improve results. What better way to improve your accuracy than to look at the errors that you made and figure out how to fix them? It's very hard to understand why you are making errors in many models, especially SVMs and neural nets, and this model is much easier.

How do we construct decision trees?

Ok, how can we construct these decision trees in cases when we aren't experts and know the correct splits beforehand? It turns out that there are a lot of different ways, most notably (today) CART (Classification and Regression Trees) and C5.0 (some people refer to this as ID3's successor). Let's look at a simplified version of the CART algorithm:

Suppose we want to solve the classification problem of deciding whether animals are good or bad (regression is similar) and we have a labeled data set of the form $$S = \{(\vec{x}, y)_i\}_{i = 1}^n$$ Where $\vec{x}$ is written with an arrow to emphasize that it is a vector of features, $x_i[0]$, $x_i[1]$, ...etc.

We're going to construct the tree in a top-down approach - that is, we're going to start at the top, create a node, and then walk downwards, creating more nodes as we go. Suppose that we want to branch on feature $k$, and there are two possible values for feature $k$ (which is a label, rather than a continuous variable), cat and dog. If we split based on feature $k$, we'll have two clumps of data, $S_{cat}$, and $S_{dog}$. How can we tell whether this is a good split? We'll look at two metrics to evaluate how to choose $k$ at every step.

Gini Impurity

Gini impurity is a measure of the mixiness of a set. If a set of points is really pure (i.e. 100% one label or 100% another label) the Gini impurity will be low. For a given set $S$ (say, $S' = S_{cat}$) we first get the fraction $f_{good}$ of $S'$ that consists of good and the fraction $f_{bad}$ that is bad. More generally, let the labels be $i = [0, 1, 2, ... k]$, and we calculate $f_i$ for all $i$. Then, $$I_G(S') = \sum_{i = 0}^k f_i(1 - f_i)$$

Note: as a function of each $f_i$s, $I_G$ is maximized whenever $f_i = 0$ or $f_i = 1$ (i.e. when the set is the most pure).

So, to choose the best feature to branch on, we just choose whichever reduces the total Gini impurity the most ($I_G(S_{cat}) + I_G(S_{dog})$).

Example: if our branch factor is breed, and the set of cats contains 45 good cats and 15 bad cats, and the set of dogs contains 60 good dogs and only 5 bad dogs, the overall Gini impurity of the split is: $$I_G = \left(\frac{45}{105}\cdot\frac{60}{105} + \frac{15}{20}\cdot\frac{5}{20}\right) + \left(\frac{60}{105}\cdot\frac{45}{105} + \frac{5}{20}\cdot\frac{15}{20}\right) = 0.4324$$

Question: if you look closely above, the Gini impurity of each of the sets above is the same. Is this always true?

However, if we split based on color, and the brown animals were 100 good and 10 bad, and the black animals were 5 good and 10 bad, then the Gini impurity is:

$$I_G = \left(\frac{100}{105}\cdot\frac{5}{105} + \frac{10}{20}\cdot\frac{10}{20}\right) + \left(\frac{5}{105}\cdot\frac{100}{105} + \frac{10}{20}\cdot\frac{100}{20}\right) = 0.2954$$

So, we'd choose to branch based on color, which makes some intuitive sense.

Information Gain

Similar to Gini impurity, information gain is another metric. Basically, for any mixed up set we can measure it's Shannon entropy, which is a measure of randomness. So, at each stage we try to reduce the entropy by as much as possible. However, given a piece of information $I$, the difference in entropy before knowing $I$ (e.g. the value of a feature) and the entropy afterwards is called the information gain.

$$I_E(S') = -\sum_{i = 0}^k f_i \cdot \log k_i$$

What about regression?

If our output isn't class labels (i.e., good or bad), then what are we supposed to do? Fall back on our old friend, mean squared error! We find a mean for each of the target nodes, and then calculate the overall reduction in mean squared error.

Wow, this seems kind of complicated

It is, a little bit. Luckily, we're not cavepeople and we can use the fruits of other people's labor, instead of writing our own code. Everyone's favorite Python library, scikit-learn comes pretty handy here:

In [ ]:
from sklearn import tree

Let's take a look at an example:

Let's load up a sample dataset and walk through using a decision tree on it:

In [ ]:
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# load data
iris = load_iris()

# just to see what's going on
mask = np.random.randint(len(iris['data']), size=10)
print(iris['data'][mask, :])

So we're predicting class labels given 4 features. Let's do this

In [ ]:
clf = tree.DecisionTreeClassifier()
clf =,

Wait, we're done? That's it?

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import pydotplus
import StringIO
import pydotplus
from scipy import misc

from IPython.display import Image, display
dot_data = tree.export_graphviz(clf, out_file=None, 
                         filled=True, rounded=True,  
graph = pydotplus.graph_from_dot_data(dot_data)  


As you can tell, decision trees are a powerful tool for both classification and regression, formalizing an intuitive method of classification. Decision trees can be more versatile than SVMs, which rely on choosing the right kernel and certain assumptions of the data (some degree of seperability, etc).

So far our supervised learning toolkit consists of regression, decision trees, and SVMs. There are plenty of other models for supervised learning like neural networks and ensemble methods, which we will not get to during our workshops. If you need to use any of these advanced models, Scikit-learn provides easy-to-use implementations for these just like it does for the models that we've seen.

In [ ]: