Introduction

Idea behind decision tree: split the space of the attribute variables with recursive, binary split, with the aim of high purity for all regions.

The collection of paths to all regions makes a tree.

Vocabulary

attribute, or attribute variable: a dimension in which a data point has a value (typically excluding the target variable)
target variable: the variable whose value is to be predicted
the attributes of the i-th data point are labeled X_i.
the value of the target variable for the i-th data point is labeled y_i.
Trees that predict a quantitative target variable are called regression trees, and trees that predict qualitative targets are called classification trees.

Play



In [ ]:

    
%pylab inline

Get iris data and make a simple prediction



In [ ]:

    
from sklearn import datasets
iris = datasets.load_iris()



In [ ]:

    
import numpy as np
import random

Create training and test data sets



In [ ]:

    
iris_X = iris.data
iris_y = iris.target

r = random.randint(0,100)
np.random.seed(r)
idx = np.random.permutation(len(iris_X))

subset = 25

iris_X_train = iris_X[idx[:-subset]]  # all but the last 'subset' rows
iris_y_train = iris_y[idx[:-subset]]
iris_X_test  = iris_X[idx[-subset:]]  # the last 'subset' rows
iris_y_test  = iris_y[idx[-subset:]]

Train a classification tree



In [ ]:

    
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris_X_train,iris_y_train)

Print the predicted class of iris



In [ ]:

    
clf.predict(iris_X_train)

Print the true class of iris



In [ ]:

    
iris_y_train



In [ ]:

    
clf.predict(iris_X_test)



In [ ]:

    
iris_y_test



In [ ]:

    
from sklearn.externals.six import StringIO # StringIO streams data as a string to "output file"
from IPython.display import Image # need Image to display inline

# export the tree data as a string to a file
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 

#
import pydot 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

Split definition

Need a metric to minimize:



In [ ]:

    
Image(filename="gini.png")

where p_mk is the proportion of training data in the m-th region that are from the k-th class.

Values of p_mk close to 0 or 1 represent better purity, so we minimize G.

Cross validation: a side note

Cross validation is a generalization of the testing/training data set paradigm. A reasonable test for the validity of a tree is re-sample the training and testing data set, re-fitting the tree each time. Small variations in the resulting trees indicate a stable model.

A Problematic Example



In [ ]:

    
classifier_1 = tree.DecisionTreeClassifier()
X = numpy.array([[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
Y = numpy.array([0,1,2,3,4,5,6,7,8,9,10])
classifier_1 = classifier_1.fit(X,Y)



In [ ]:

    
classifier_1.predict(X)



In [ ]:

    
from sklearn.externals.six import StringIO # StringIO streams data as a string to "output file"
from IPython.display import Image # need Image to display inline

# export the tree data as a string to a file
dot_data = StringIO() 
tree.export_graphviz(classifier_1, out_file=dot_data) 

#
import pydot 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())



In [ ]:

    
classifier_2 = tree.DecisionTreeClassifier(max_depth=3)
classifier_2 = classifier_2.fit(X,Y)
classifier_2.predict(X)



In [ ]:

    
dot_data = StringIO() 
tree.export_graphviz(classifier_2, out_file=dot_data) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

Notes:

trees aren't great at predicting linear relationships between attrtibute and target variables. But standard linear regression is.
tree size needs to be controlled to avoid over training

Regression Trees

Concepts

The predicted target variable is the mean of all the training target variable in the region
The split betwee R_1 and R_2 minimizes the following:



In [ ]:

    
Image(filename="rss.png")

where x_i and y_i are the attribute and target variables for the i-th training data point, and y_hat is the mean of the target variables in the region.



In [ ]:

    
# Create a random dataset
rng = np.random.RandomState(1)
# Set the range to [0,5] and sort it numerically
X = np.sort(5 * rng.rand(80, 1), axis=0)
# for target, take the sine of the data, and place it in an array
y = np.sin(X).ravel()
# add some noise to every fifth point
y[::5] += 3 * (0.5 - rng.rand(16))



In [ ]:

    
# use a regression tree model
from sklearn.tree import DecisionTreeRegressor

clf_1 = DecisionTreeRegressor(max_depth=2)
clf_2 = DecisionTreeRegressor(max_depth=5)
clf_3 = DecisionTreeRegressor()
clf_1.fit(X, y)
clf_2.fit(X, y)
clf_3.fit(X, y)



In [ ]:

    
# generate test data in correct range, and place each pt in its own array 
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = clf_1.predict(X_test)
y_2 = clf_2.predict(X_test)
y_3 = clf_3.predict(X_test)



In [ ]:

    
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(X, y, c="k", label="data")
plt.plot(X_test, y_1, c="g", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, c="r", label="max_depth=5", linewidth=2)
plt.plot(X_test, y_3, c="b", label="max_depth=inf", linewidth=1)

plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()



In [ ]:

    
dot_data = StringIO()
tree.export_graphviz(clf_1, out_file=dot_data)
tree.export_graphviz(clf_2, out_file=dot_data)
tree.export_graphviz(clf_3, out_file=dot_data)



In [ ]:

    
graph = pydot.graph_from_dot_data(dot_data.getvalue())



In [ ]:

    
Image(graph[0].create_png())



In [ ]:

    
Image(graph[1].create_png())



In [ ]:

    
Image(graph[2].create_png())

Options for overfitting:

maximum depth
minimum training data points per region
pruning

Pruning

Not implemented in scikit-learn
uses cross validation to remove nodes from the tree in such a way that one makes an optimal tradeoff between the tree's complexity and its fit to the training data.

Bagging

create an ensemble of trees, based on a subdivision of the training data
average the results of the ensemble

Random forests

deal with the fact that tree production is greedy, and always uses the strongest split first.
base each split on a random subset of attributes
combine as in bagging



In [ ]: