Classification Problem with Decision Trees

Getting data.
Creating a Decision Tree.
Plotting results.

Sample code from "Data Science from Scratch" by Joel Grus, O'Reilly Media, 2015.

Creating a Decision Tree

The VP provides you with the interviewee data, consisting of (per your specification) pairs $\texttt{(input, label)}$, where each input is a dict of candidate attributes, and each label is either $\textsf{True}$ (the candidate interviewed well) or $\textbf{False}$ (the candidate interviewed poorly). In particular, you are provided with each candidate’s level, her preferred language, whether she is active on Twitter, and whether she has a PhD.



In [1]:

    
from __future__ import division
from collections import Counter, defaultdict
from functools import partial
import math, random

Our tree will consist of decision nodes (which ask a question and direct us differently depending on the answer) and leaf nodes (which give us a prediction). We will build it using the relatively simple ID3 algorithm, which operates in the following manner. Let’s say we’re given some labeled data, and a list of attributes to consider branching on.

If the data all have the same label, then create a leaf node that predicts that label and then stop.
If the list of attributes is empty (i.e., there are no more possible questions to ask), then create a leaf node that predicts the most common label and then stop.
Otherwise, try partitioning the data by each of the attributes.
Choose the partition with the lowest partition entropy.
Recur on each partitioned subset using the remaining attributes.

This is what’s known as a “greedy” algorithm because, at each step, it chooses the most immediately best option. Given a data set, there may be a better tree with a worse-looking first move. If so, this algorithm won’t find it. Nonetheless, it is relatively easy to understand and implement, which makes it a good place to begin exploring decision trees.



In [2]:

    
def entropy(class_probabilities):
    """given a list of class probabilities, compute the entropy"""
    return sum(-p * math.log(p, 2) for p in class_probabilities if p)

def class_probabilities(labels):
    total_count = len(labels)
    return [count / total_count
            for count in Counter(labels).values()]

def data_entropy(labeled_data):        
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilities(labels)
    return entropy(probabilities)

def partition_entropy(subsets):
    """find the entropy from this partition of data into subsets"""
    total_count = sum(len(subset) for subset in subsets)
    
    return sum( data_entropy(subset) * len(subset) / total_count
                for subset in subsets )

def group_by(items, key_fn):
    """returns a defaultdict(list), where each input item 
    is in the list whose key is key_fn(item)"""
    groups = defaultdict(list)
    for item in items:
        key = key_fn(item)
        groups[key].append(item)
    return groups
    
def partition_by(inputs, attribute):
    """returns a dict of inputs partitioned by the attribute
    each input is a pair (attribute_dict, label)"""
    return group_by(inputs, lambda x: x[0][attribute])    

def partition_entropy_by(inputs,attribute):
    """computes the entropy corresponding to the given partition"""        
    partitions = partition_by(inputs, attribute)
    return partition_entropy(partitions.values())        

def classify(tree, input):
    """classify the input using the given decision tree"""
    
    # if this is a leaf node, return its value
    if tree in [True, False]:
        return tree
   
    # otherwise find the correct subtree
    attribute, subtree_dict = tree
    
    subtree_key = input.get(attribute)  # None if input is missing attribute

    if subtree_key not in subtree_dict: # if no subtree for key,
        subtree_key = None              # we'll use the None subtree
    
    subtree = subtree_dict[subtree_key] # choose the appropriate subtree
    return classify(subtree, input)     # and use it to classify the input

def build_tree_id3(inputs, split_candidates=None):

    # if this is our first pass, 
    # all keys of the first input are split candidates
    if split_candidates is None:
        split_candidates = inputs[0][0].keys()

    # count Trues and Falses in the inputs
    num_inputs = len(inputs)
    num_trues = len([label for item, label in inputs if label])
    num_falses = num_inputs - num_trues
    
    if num_trues == 0:                  # if only Falses are left
        return False                    # return a "False" leaf
        
    if num_falses == 0:                 # if only Trues are left
        return True                     # return a "True" leaf

    if not split_candidates:            # if no split candidates left
        return num_trues >= num_falses  # return the majority leaf
                            
    # otherwise, split on the best attribute
    best_attribute = min(split_candidates,
        key=partial(partition_entropy_by, inputs))

    partitions = partition_by(inputs, best_attribute)
    new_candidates = [a for a in split_candidates 
                      if a != best_attribute]
    
    # recursively build the subtrees
    subtrees = { attribute : build_tree_id3(subset, new_candidates)
                 for attribute, subset in partitions.iteritems() }

    subtrees[None] = num_trues > num_falses # default case

    return (best_attribute, subtrees)

def forest_classify(trees, input):
    votes = [classify(tree, input) for tree in trees]
    vote_counts = Counter(votes)
    return vote_counts.most_common(1)[0][0]

Let’s manually go through these steps on the interviewee data set. The data set has both True and False labels, and we have four attributes we can split on. So our first step will be to find the partition with the least entropy. Function partition_by() does the partitioning and function partition_entropy_by() uses it to compute entropy.



In [3]:

    
inputs = [
    ({'level':'Senior','lang':'Java','tweets':'no','phd':'no'},   False),
    ({'level':'Senior','lang':'Java','tweets':'no','phd':'yes'},  False),
    ({'level':'Mid','lang':'Python','tweets':'no','phd':'no'},     True),
    ({'level':'Junior','lang':'Python','tweets':'no','phd':'no'},  True),
    ({'level':'Junior','lang':'R','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'R','tweets':'yes','phd':'yes'},    False),
    ({'level':'Mid','lang':'R','tweets':'yes','phd':'yes'},        True),
    ({'level':'Senior','lang':'Python','tweets':'no','phd':'no'}, False),
    ({'level':'Senior','lang':'R','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'Python','tweets':'yes','phd':'no'}, True),
    ({'level':'Senior','lang':'Python','tweets':'yes','phd':'yes'},True),
    ({'level':'Mid','lang':'Python','tweets':'no','phd':'yes'},    True),
    ({'level':'Mid','lang':'Java','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'Python','tweets':'no','phd':'yes'},False)
]

Then we just need to find the minimum-entropy partition for the whole data set:



In [4]:

    
for key in ['level','lang','tweets','phd']:
    print key, partition_entropy_by(inputs, key)









    



level 0.693536138896
lang 0.860131712855
tweets 0.788450457308
phd 0.892158928262

The lowest entropy comes from splitting on level, so we’ll need to make a subtree for each possible level value. Every Mid candidate is labeled True, which means that the Mid subtree is simply a leaf node predicting True. For Senior candidates, we have a mix of Trues and Falses, so we need to split again:



In [5]:

    
senior_inputs = [(input, label)
                 for input, label in inputs if input["level"] == "Senior"]

for key in ['lang', 'tweets', 'phd']:
    print key, partition_entropy_by(senior_inputs, key)









    



lang 0.4
tweets 0.0
phd 0.950977500433

This shows us that our next split should be on tweets, which results in a zero-entropy partition. For these Senior-level candidates, “yes” tweets always result in True while “no” tweets always result in False.

Finally, if we do the same thing for the Junior candidates, we end up splitting on phd, after which we find that no PhD always results in True and PhD always results in False.

Building a Tree

Now that we’ve seen how the algorithm works, we would like to implement it more generally. This means we need to decide how we want to represent trees. We’ll use pretty much the most lightweight representation possible. We define a tree to be one of the following:

True
False
a tuple (attribute, subtree_dict)

Here True represents a leaf node that returns True for any input, False represents a leaf node that returns False for any input, and a tuple represents a decision node that, for any input, finds its attribute value, and classifies the input using the corresponding subtree.

All that’s left is to build the tree representation from our training data. Our hiring tree would look like:



In [6]:

    
print "Building the tree"
tree = build_tree_id3(inputs)
print tree









    



Building the tree
('level', {'Senior': ('tweets', {'yes': True, None: False, 'no': False}), None: True, 'Mid': True, 'Junior': ('phd', {'yes': False, None: True, 'no': True})})

In the tree we built, every leaf consisted entirely of True inputs or entirely of False inputs. This means that the tree predicts perfectly on the training data set. But we can also apply it to new data that wasn’t in the training set to classify a new input:



In [7]:

    
print "Junior / Java / tweets / no phd", classify(tree,
        { "level" : "Junior", 
          "lang" : "Java", 
          "tweets" : "yes", 
          "phd" : "no"} )









    



Junior / Java / tweets / no phd True



In [8]:

    
print "Junior / Java / tweets / phd", classify(tree, 
        { "level" : "Junior", 
                 "lang" : "Java", 
                 "tweets" : "yes", 
                 "phd" : "yes"} )









    



Junior / Java / tweets / phd False

There’s still the question of what to do if we encounter an unexpected (or missing) attribute value. What should our hiring tree do if it encounters a candidate whose level is “Intern”? We’ll handle this case by adding a None key that just predicts the most common label.

Applying to data with missing or unexpected values:



In [9]:

    
print "Intern", classify(tree, { "level" : "Intern" } )









    



Intern True



In [10]:

    
print "Senior", classify(tree, { "level" : "Senior" } )









    



Senior False

Note: Since our goal was mainly to demonstrate how to build a tree, we built the tree using the entire data set. As always, if we were really trying to create a good model for something, we would have (collected more data and) split the data into train/validation/test subsets.