Experiments with entropy, information gain, and decision trees.

Iris fact of the day: Iris setosa's root contains a toxin that was used by the Aleut tribe in Alaska to make poisonous arrowheads.


In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO
import pydot


Couldn't import dot_parser, loading of dot files will not be possible.

If you do not have pydot library installed, open your terminal and type either conda install pydot or pip install pydot


In [2]:
# Load the data, which is included in sklearn.
iris = load_iris()
print 'Iris target names:', iris.target_names
print 'Iris feature names:', iris.feature_names
X, Y = iris.data, iris.target

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# Split into train and test.
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]


Iris target names: ['setosa' 'versicolor' 'virginica']
Iris feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [3]:
# Define a function that applies a threshold to turn real valued iris features into 0/1 features.
# 0 will mean "short" and 1 will mean "long".
def binarize_iris(data, thresholds=[6.0, 3.0, 2.5, 1.0]):
    # Initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)

    # Apply a threshold  to each feature.
    for feature in range(data.shape[1]):
        binarized_data[:,feature] = data[:,feature] > thresholds[feature]
    return binarized_data

# Create new binarized training and test data
binarized_train_data = binarize_iris(train_data)
binarized_test_data = binarize_iris(test_data)

The plan:

The goal is to identify the data partitioning scheme that will maximize the information gain.

The information gain will be expressed through entropy.

Let's start by defining a function that computes the entropy of a distribution. Remember that entropy is a measure of uncertainty. It is maximized when the distribution is uniform.


In [4]:
def entropy(distribution):
    h = 0.0
    for probability in distribution:
        logprob = -100.0  # log(0) = -inf so let's approximate it with -100 to avoid an error
        if probability > 0.0: logprob = np.log2(probability)
        h -= probability * logprob
    return h

# Show a plot of the entropy, H(X), of a Bernoulli random variable X.
p_values = np.linspace(0, 1, 50)
entropies = [entropy([p, 1-p]) for p in p_values]
plt.figure(figsize=(4,4))
plt.plot(p_values, entropies, 'o')
plt.xlabel('P(X=1)')
plt.ylabel('H(X)')
print



When you have time, try it with other bases for the log: 10 and "e"

We are interested in the entropy of our distribution over labels.


In [5]:
def get_label_distribution(labels):
    # Initialize counters for all labels to zero.
    label_probs = np.array([0.0 for i in range(len(iris.target_names))])

    # Iterate over labels in the training data and update counts.
    for label in labels:
        label_probs[label] += 1.0
    
    # Normalize to get a distribution.
    label_probs /= label_probs.sum()
    return label_probs

label_probs = get_label_distribution(train_labels)
print 'Label distribution', label_probs

# Compare the label entropy to a uniform distribution.
print 'Label entropy:', entropy(label_probs)
print 'Uniform entropy:', entropy([1./3, 1./3, 1./3])


Label distribution [ 0.31  0.33  0.36]
Label entropy: 1.58223227365
Uniform entropy: 1.58496250072

Very interesting. The distribution of labels is almost indistinguishable from uniform.

A 64-thousand-dollar question: Can we use entropy as a similarity measure for distributions? 

Now let's figure out which feature provides the greatest information gain. Philosophically, information gain means reduction of randomness. So we are looking for the feature(s) that reduce entropy the most.

To do this, we need to look at the entropy of each subset of the labels after splitting on each feature. In a sense, it is similar to marginalization by feature (like we did last week)


In [6]:
# A function that computes information gain given these inputs:
#   data: an array of featurized examples
#   labels: an array of labels corresponding to the the data
#   feature: the feature to use to split the data
#   threshold: the feature value to use to split the data (the default threshold is good for binary features)
def information_gain(data, labels, feature, threshold=0):
    # Get the initial entropy of the label distribution.
    initial_entropy = entropy(get_label_distribution(labels))
    
    # subset0 will contain the labels for which the feature is 0 and
    # subset1 will contain the labels for which the feature is 1.
    subset0, subset1 = [], []
    for datum, label in zip(data, labels):
        if datum[feature] > threshold: subset1.append(label)
        else: subset0.append(label)
    
    # Compute the entropy of each subset.
    subset0_entropy = entropy(get_label_distribution(subset0))
    subset1_entropy = entropy(get_label_distribution(subset1))
    
    # Make it a fair comparison: 
    # Compute the final entropy by weighting each subset's entropy according to its size.
    subset0_weight = 1.0 * len(subset0) / len(labels)
    subset1_weight = 1.0 * len(subset1) / len(labels)
    final_entropy = subset0_weight * subset0_entropy + subset1_weight * subset1_entropy
    
    # Finally, compute information gain as the difference between the initial and final entropy.
    return initial_entropy - final_entropy

for feature in range(binarized_train_data.shape[1]):
    ##  We are looking at binarized data; so the threshold = 0
    ig = information_gain(binarized_train_data, train_labels, feature)
    print '%d %.3f %s' %(feature, ig, iris.feature_names[feature])


0 0.406 sepal length (cm)
1 0.216 sepal width (cm)
2 0.893 petal length (cm)
3 0.780 petal width (cm)

According to the information gain metric, petal length is the most useful feature, followed by petal width. Let's confirm that this agrees with the sklearn decision tree implementation.

Actually, sklearn doesn't expose the information gain values. Instead, it stores the distribution of "feature importances", which reflects the value of each feature in the full decision tree. Let's train a decision tree with max_depth=1 so it will only choose a single feature. Let's also get the test accuracy with this "decision stump".

When you have time, try it with depths between 1 and 4, observe the Feature Importances. What can you conclude?


In [7]:
dt = DecisionTreeClassifier(criterion='entropy', max_depth=1)
dt.fit(binarized_train_data, train_labels)
print 'Using a decision stump -- a tree with depth 1:'
print 'Feature importances:', dt.feature_importances_
print 'Accuracy:', dt.score(binarized_test_data, test_labels)


Using a decision stump -- a tree with depth 1:
Feature importances: [ 0.  0.  1.  0.]
Accuracy: 0.66

We've been using the binarized version of the iris features. Recall that we simply chose thresholds for each feature by inspecting feature histograms. Let's use information gain as a metric to choose a best feature and a best threshold.


In [8]:
def try_features_and_thresholds(data, labels):
    for feature in range(data.shape[1]):
        # Choose a set of thresholds between the min- and max-valued feature, ignoring the min and max themselves.
        thresholds = np.linspace(data[:,feature].min(), data[:,feature].max(), 20)[1:-1]

        # Try each threshold and keep track of the best one for this feature.
        best_threshold = 0
        best_ig = 0
        for threshold in thresholds:
            ig = information_gain(data, labels, feature, threshold)
            if ig > best_ig:
                best_ig = ig
                best_threshold = threshold

        # Show the best threshold and information gain for this feature.
        print '%d %.3f %.3f %s' %(feature, best_threshold, best_ig, iris.feature_names[feature])
        
try_features_and_thresholds(train_data, train_labels)


0 5.732 0.525 sepal length (cm)
1 3.389 0.311 sepal width (cm)
2 2.116 0.893 petal length (cm)
3 0.605 0.893 petal width (cm)

It looks like when we binarized our data, we didn't choose the thresholds that maximized information gain for 3 out of 4 features. Let's try training actual decision trees (as opposed to stumps) with the original (non-binarized) data. You may need to install GraphViz before exporting the tree.

If the pydot was installed correctly, you will see the image showing the Decistion Tree after running this block of code. Otherwise, you will see error messages, like in my case. In any case, you can uncomment the

print 'dot_data value:', dot_data.getvalue()

line, and that will reveal the structure of the tree


In [9]:
# Train a decision tree classifier.
dt = DecisionTreeClassifier(criterion='entropy', min_samples_split=2)
dt.fit(train_data, train_labels)
print 'Accuracy:', dt.score(test_data, test_labels)

# Export the trained tree so we can look at it.
output_name = 'iris-decisiontree.jpg'
print output_name

dot_data = StringIO()
tree.export_graphviz(dt, out_file=dot_data)
## print 'dot_data value:', dot_data.getvalue()

graph = pydot.graph_from_dot_data(dot_data.getvalue())


# If the export was successful, show the image.   
if graph.write_jpg(output_name):
    print 'Output:', output_name
    display(Image(filename=output_name))


Accuracy: 0.96
iris-decisiontree.jpg
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-e9c59abd9e4c> in <module>()
     12 ## print 'dot_data value:', dot_data.getvalue()
     13 
---> 14 graph = pydot.graph_from_dot_data(dot_data.getvalue())
     15 
     16 

C:\Anaconda\lib\site-packages\pydot.pyc in graph_from_dot_data(data)
    218     """
    219 
--> 220     return dot_parser.parse_dot_data(data)
    221 
    222 

NameError: global name 'dot_parser' is not defined

If you successfully output the tree, you should be able to see it here. The first split perfectly partitions the setosas because they have very narrow petals. The next split identifies a pure subset of virginicas that have wide petals. Of the remaining medium-width petal examples, those with shorter petals are versicolors, but the split is not perfect. At this point, we stop splitting because we don't have enough samples to be convinced that further splitting would generalize well.

Note, though, that this depth 3 tree gets 96% accuracy on the test data. So does a depth 2 tree (try it!). Tree pruning, which is not implemented in sklearn, can be useful for choosing a depth that generalizes well.