In [1]:
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.tree
In [2]:
plt.rcParams["figure.figsize"] = [17, 10]
Decision trees are directed graphs beginning with one node and branching to many. They are a hierarchical data structure that represent data by implementing a divide-and-conquer strategy. There are two main types of decision tree: classification and regression, and both are used to make predictions based on data. Classification trees output a discrete category/class/target while regression trees output real values. Regression tree algorithms were introduced in 1963 (reference).
Moving through a decision tree, each node splits up the input data. Each node is a sort of cluster of cases that is to be split by further branches in the tree. Often trees are binary, wherein each node is split into two subsamples, but they don't have to be binary.
So, imagine there are some colored shapes that can be classified as A, B or C.
A classification decision tree for the colored shapes could look like this:
Decision trees can be seen as a compact way to represent a lot of data. A usual goal in defining a decision tree is to search for one that is as small as possible.
scikit-learn provides a DecisionTreeClassifier
. It takes as input two arrays, an array of data features and an array of class labels for each collection of features.
Create some data. There are features and there are classifications for each collection of features.
In [3]:
# features
X = [
[0, 0],
[1, 1]
]
# targets
Y = [
0,
1
]
classifier = sklearn.tree.DecisionTreeClassifier()
classifier = classifier.fit(X, Y)
Now, predict the class of some example collection of features.
In [4]:
classifier.predict([[2, 2]])
Out[4]:
The probability of each class can be predicted too, which is the fraction of training samples of the same class in a leaf.
In [5]:
classifier.predict_proba([[2, 2]])
Out[5]:
We can look at the tree in Graphviz format.
In [6]:
graph = graphviz.Source(sklearn.tree.export_graphviz(classifier, out_file=None))
graph;
Get the iris dataset.
In [7]:
iris = sklearn.datasets.load_iris()
The top bit of the dataset looks like this:
In [8]:
pd.DataFrame(
data = np.c_[iris["data"], iris["target"]],
columns = iris["feature_names"] + ["target"]
).head()
Out[8]:
Make a decision tree and then fit it using the features ("data") and class labels ("target") of the iris dataset.
In [9]:
classifier = sklearn.tree.DecisionTreeClassifier()
classifier = classifier.fit(iris.data, iris.target)
Ok, let's look at the tree, but we'll fancy it up this time with colors and shit.
In [10]:
graph = graphviz.Source(
sklearn.tree.export_graphviz(
classifier,
out_file = None,
feature_names = iris.feature_names,
class_names = iris.target_names,
filled = True,
rounded = False,
special_characters = True,
proportion = True,
)
)
graph.render('iris_DT')
graph
Out[10]:
In [11]:
sklearn.tree.export_graphviz(
classifier,
out_file = "tree_1.svg",
feature_names = iris.feature_names,
class_names = iris.target_names,
filled = True,
rounded = False,
special_characters = True,
proportion = True,
)
Right, so now let's make some predictions.
In [12]:
classifier.predict(iris.data)
Out[12]:
How accurate is it? Well, here is what it should have got:
In [13]:
iris.target
Out[13]:
Boom, it's awesome. Well done, decision tree. :)
Now, let's take a glance at a decision tree for regression, or modelling something. Here, let's model a slightly noisy sine curve.
In [14]:
rng = np.random.RandomState(1)
X = np.sort(5*rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3*(0.5-rng.rand(16))
In [15]:
plt.scatter(X, y, s=30, edgecolor="black", c="red", label="data")
plt.title("a fuck off noisy sine curve")
plt.xlabel("data")
plt.ylabel("target")
plt.show();
Aait, let's create and fit a decision tree with a depth of like 2 nodes.
In [16]:
regressor = sklearn.tree.DecisionTreeRegressor(max_depth=2)
regressor.fit(X, y);
Ok, let's make some predictions and see how it does.
In [17]:
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_prediction = regressor.predict(X_test)
In [18]:
plt.scatter(X, y, s=30, edgecolor="black", c = "red", label="data")
plt.plot(X_test, y_prediction, color="cornflowerblue", label="max_depth = 2", linewidth=2)
plt.title("just fittin' a noisy sine curve, it's fine")
plt.xlabel("data")
plt.ylabel("target")
plt.legend()
plt.show();
Damn, that shit is woke!
In [19]:
graph = graphviz.Source(
sklearn.tree.export_graphviz(
regressor,
out_file = None,
filled = True,
rounded = False
)
)
graph;
Ok, now let's try a tree with greater depth, like 5 nodes.
In [20]:
regressor = sklearn.tree.DecisionTreeRegressor(max_depth=5)
regressor.fit(X, y);
In [21]:
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_prediction = regressor.predict(X_test)
In [22]:
plt.scatter(X, y, s=30, edgecolor="black", c="red", label="data")
plt.plot(X_test, y_prediction, color="cornflowerblue", label="max_depth = 5", linewidth=2)
plt.title("just fittin' a noisy sine curve, but what the Bjork?")
plt.xlabel("data")
plt.ylabel("target")
plt.legend()
plt.show();
Yeah ok, naw.
It turns out that learning a tree that classifies or models data perfectly may not lead to a tree with good generalization performance. There could be noise in the data (as there was in this example) or the algorithm might be making decisions based on low statistics (very little data).
In [27]:
graph = graphviz.Source(
sklearn.tree.export_graphviz(
regressor,
out_file = None,
filled = True,
rounded = False,
special_characters = True,
proportion = True,
)
)
graph.render('iris_DT')
graph
Out[27]:
In [ ]: