Introduction

Idea behind decision tree: split the space of the attribute variables with recursive, binary splits, with the aim of high purity for all regions.

The collection of paths to all regions makes a tree.

Vocabulary

  • attribute, or attribute variable: a dimension in which a data point has a value (typically excluding the target variable)

  • target variable: the variable whose value is to be predicted

  • the attributes of the i-th data point are labeled X_i.

  • the value of the target variable for the i-th data point is labeled y_i.

  • Trees that predict a quantitative target variable are called regression trees, and trees that predict qualitative targets are called classification trees.

Play


In [ ]:
%pylab inline

Get iris data and make a simple prediction


In [ ]:
from sklearn import datasets
iris = datasets.load_iris()

In [ ]:
import numpy as np
import random

Create training and test data sets


In [ ]:
iris_X = iris.data
iris_y = iris.target

r = random.randint(0,100)
np.random.seed(r)
idx = np.random.permutation(len(iris_X))

subset = 25

iris_X_train = iris_X[idx[:-subset]]  # all but the last 'subset' rows
iris_y_train = iris_y[idx[:-subset]]
iris_X_test  = iris_X[idx[-subset:]]  # the last 'subset' rows
iris_y_test  = iris_y[idx[-subset:]]

Train a classification tree


In [ ]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris_X_train,iris_y_train)

Print the predicted class of iris


In [ ]:
clf.predict(iris_X_train)

Based on the target values in the training set, calculate the training accuracy:


In [ ]:
def accuracy(x,y):
    output = []
    for i,j in zip(x,y):
        if i == j:
            output.append(1)
        else:
            output.append(0)
    return np.mean(output)
print("training accuracy: {}".format(accuracy(clf.predict(iris_X_train),iris_y_train)))

And here's the testing accuracy:


In [ ]:
print("testing accuracy: {}".format(accuracy(clf.predict(iris_X_test),iris_y_test)))

Visualize the tree:


In [ ]:
from sklearn.externals.six import StringIO # StringIO streams data as a string to "output file"
from IPython.display import Image # need Image to display inline

# export the tree data as a string to a file
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 

# compatible with modern pyparsing
import pydotplus as pydot
# or olde-timey
# import pydot
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

How it works

Split definition

To decide which variable is considered at each node, and what the split point is, we need a metric to minimize:


In [ ]:
Image(filename="gini.png")

where p_mk is the proportion of training data in the m-th region that are from the k-th class.

Values of p_mk close to 0 or 1 represent better purity, so we minimize G.

Cross validation: a side note

Cross validation is a generalization of the testing/training data set paradigm. A reasonable test for the validity of a tree is to re-sample the training and testing data set, re-fitting the tree each time. Small variations in the resulting trees indicate a stable model.

A Problematic Example


In [ ]:
classifier_1 = tree.DecisionTreeClassifier()
X = numpy.array([[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
Y = numpy.array([0,1,2,3,4,5,6,7,8,9,10])
classifier_1 = classifier_1.fit(X,Y)

In [ ]:
classifier_1.predict(X)

In [ ]:
## print the tree

# export the tree data as a string to a file
dot_data = StringIO()
tree.export_graphviz(classifier_1, out_file=dot_data) 

#
import pydotplus as pydot
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

The tree shown above is overtrained. Let's limit the depth.


In [ ]:
classifier_2 = tree.DecisionTreeClassifier(max_depth=3)
classifier_2 = classifier_2.fit(X,Y)
classifier_2.predict(X)

In [ ]:
dot_data = StringIO() 
tree.export_graphviz(classifier_2, out_file=dot_data) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

Take-away:

  • trees aren't great at predicting linear relationships between attrtibute and target variables. But standard linear regression is.

  • tree size needs to be controlled to avoid over training

Regression Trees

Concepts

  • The predicted target variable is the mean of all the training target variable in the region

  • The split betwee R_1 and R_2 minimizes the following:


In [ ]:
Image(filename="rss.png")

where x_i and y_i are the attribute and target variables for the i-th training data point, and y_hat is the mean of the target variables in the region.

Example

Let's create an example with a noisy sine function.


In [ ]:
# Create a random dataset
rng = np.random.RandomState(1)
# Set the range to [0,5] and sort it numerically
X = np.sort(5 * rng.rand(80, 1), axis=0)
# for target, take the sine of the data, and place it in an array
y = np.sin(X).ravel()
# add some noise to every fifth point
y[::5] += 3 * (0.5 - rng.rand(16))

Test a set of regression trees that have different depth limits.


In [ ]:
# use a regression tree model
from sklearn.tree import DecisionTreeRegressor

clf_1 = DecisionTreeRegressor(max_depth=2)
clf_2 = DecisionTreeRegressor(max_depth=5)
clf_3 = DecisionTreeRegressor()
clf_1.fit(X, y)
clf_2.fit(X, y)
clf_3.fit(X, y)

In [ ]:
# generate test data in correct range, and place each pt in its own array 
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = clf_1.predict(X_test)
y_2 = clf_2.predict(X_test)
y_3 = clf_3.predict(X_test)

In [ ]:
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(X, y, c="k", label="data")
plt.plot(X_test, y_1, c="g", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, c="r", label="max_depth=5", linewidth=2)
plt.plot(X_test, y_3, c="b", label="max_depth=inf", linewidth=1)

plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

In [ ]:
dot_data = StringIO()
tree.export_graphviz(clf_1, out_file=dot_data)
tree.export_graphviz(clf_2, out_file=dot_data)
tree.export_graphviz(clf_3, out_file=dot_data)

In [ ]:
graph = pydot.graph_from_dot_data(dot_data.getvalue())

Visualization of tree with depth=2


In [ ]:
Image(graph[0].create_png())

Visualization of tree with depth=5


In [ ]:
Image(graph[1].create_png())

Visualization of tree with no limitation on depth.


In [ ]:
Image(graph[2].create_png())

Options for deal with overfitting:

  • maximum depth

  • minimum training data points per region

  • pruning

Pruning

  • Not implemented in scikit-learn

  • uses cross validation to remove nodes from the tree in such a way that one makes an optimal tradeoff between the tree's complexity and its fit to the training data.

Bagging

  • create an ensemble of trees, based on a subdivision of the training data

  • average the results of the ensemble

Random forests

  • deal with the fact that tree production is greedy, and always uses the strongest split first.

  • base each split on a random subset of attributes

  • combine as in bagging