Idea behind decision tree: split the space of the attribute variables with recursive, binary split, with the aim of high purity for all regions.
The collection of paths to all regions makes a tree.
attribute, or attribute variable: a dimension in which a data point has a value (typically excluding the target variable)
target variable: the variable whose value is to be predicted
the attributes of the i-th data point are labeled X_i.
the value of the target variable for the i-th data point is labeled y_i.
Trees that predict a quantitative target variable are called regression trees, and trees that predict qualitative targets are called classification trees.
In [ ]:
%pylab inline
In [ ]:
from sklearn import datasets
iris = datasets.load_iris()
In [ ]:
import numpy as np
import random
Create training and test data sets
In [ ]:
iris_X = iris.data
iris_y = iris.target
r = random.randint(0,100)
np.random.seed(r)
idx = np.random.permutation(len(iris_X))
subset = 25
iris_X_train = iris_X[idx[:-subset]] # all but the last 'subset' rows
iris_y_train = iris_y[idx[:-subset]]
iris_X_test = iris_X[idx[-subset:]] # the last 'subset' rows
iris_y_test = iris_y[idx[-subset:]]
Train a classification tree
In [ ]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris_X_train,iris_y_train)
Print the predicted class of iris
In [ ]:
clf.predict(iris_X_train)
Print the true class of iris
In [ ]:
iris_y_train
In [ ]:
clf.predict(iris_X_test)
In [ ]:
iris_y_test
In [ ]:
from sklearn.externals.six import StringIO # StringIO streams data as a string to "output file"
from IPython.display import Image # need Image to display inline
# export the tree data as a string to a file
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
#
import pydot
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
In [ ]:
Image(filename="gini.png")
where p_mk is the proportion of training data in the m-th region that are from the k-th class.
Values of p_mk close to 0 or 1 represent better purity, so we minimize G.
Cross validation is a generalization of the testing/training data set paradigm. A reasonable test for the validity of a tree is re-sample the training and testing data set, re-fitting the tree each time. Small variations in the resulting trees indicate a stable model.
In [ ]:
classifier_1 = tree.DecisionTreeClassifier()
X = numpy.array([[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
Y = numpy.array([0,1,2,3,4,5,6,7,8,9,10])
classifier_1 = classifier_1.fit(X,Y)
In [ ]:
classifier_1.predict(X)
In [ ]:
from sklearn.externals.six import StringIO # StringIO streams data as a string to "output file"
from IPython.display import Image # need Image to display inline
# export the tree data as a string to a file
dot_data = StringIO()
tree.export_graphviz(classifier_1, out_file=dot_data)
#
import pydot
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
In [ ]:
classifier_2 = tree.DecisionTreeClassifier(max_depth=3)
classifier_2 = classifier_2.fit(X,Y)
classifier_2.predict(X)
In [ ]:
dot_data = StringIO()
tree.export_graphviz(classifier_2, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
In [ ]:
Image(filename="rss.png")
where x_i and y_i are the attribute and target variables for the i-th training data point, and y_hat is the mean of the target variables in the region.
In [ ]:
# Create a random dataset
rng = np.random.RandomState(1)
# Set the range to [0,5] and sort it numerically
X = np.sort(5 * rng.rand(80, 1), axis=0)
# for target, take the sine of the data, and place it in an array
y = np.sin(X).ravel()
# add some noise to every fifth point
y[::5] += 3 * (0.5 - rng.rand(16))
In [ ]:
# use a regression tree model
from sklearn.tree import DecisionTreeRegressor
clf_1 = DecisionTreeRegressor(max_depth=2)
clf_2 = DecisionTreeRegressor(max_depth=5)
clf_3 = DecisionTreeRegressor()
clf_1.fit(X, y)
clf_2.fit(X, y)
clf_3.fit(X, y)
In [ ]:
# generate test data in correct range, and place each pt in its own array
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = clf_1.predict(X_test)
y_2 = clf_2.predict(X_test)
y_3 = clf_3.predict(X_test)
In [ ]:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(X, y, c="k", label="data")
plt.plot(X_test, y_1, c="g", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, c="r", label="max_depth=5", linewidth=2)
plt.plot(X_test, y_3, c="b", label="max_depth=inf", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
In [ ]:
dot_data = StringIO()
tree.export_graphviz(clf_1, out_file=dot_data)
tree.export_graphviz(clf_2, out_file=dot_data)
tree.export_graphviz(clf_3, out_file=dot_data)
In [ ]:
graph = pydot.graph_from_dot_data(dot_data.getvalue())
In [ ]:
Image(graph[0].create_png())
In [ ]:
Image(graph[1].create_png())
In [ ]:
Image(graph[2].create_png())
Options for overfitting:
maximum depth
minimum training data points per region
pruning
Not implemented in scikit-learn
uses cross validation to remove nodes from the tree in such a way that one makes an optimal tradeoff between the tree's complexity and its fit to the training data.
create an ensemble of trees, based on a subdivision of the training data
average the results of the ensemble
deal with the fact that tree production is greedy, and always uses the strongest split first.
base each split on a random subset of attributes
combine as in bagging
In [ ]: