The goal of this notebook is to implement a binary decision tree classifier from scratch. We will:
Important Note: We will focus on building decision trees where the data contain only binary (0 or 1) features. This allows us to avoid dealing with:
In [46]:
import sys
import os
sys.path.append('..')
import graphlab
In [47]:
import numpy as np
In [48]:
loans = graphlab.SFrame('lending-club-data.gl/')
In [49]:
loans.head()
Out[49]:
We reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.
In [50]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')
We will be using 4 categorical features:
Since we are building a binary decision tree, we will have to convert these categorical features to a binary representation in a subsequent section using 1-hot encoding.
In [51]:
features = ['grade', # grade of the loan
'term', # the term of the loan
'home_ownership', # home_ownership status: own, mortgage or rent
'emp_length', # number of years of employment
]
target = 'safe_loans'
loans = loans[features + [target]]
Let's explore what the dataset looks like.
In [52]:
loans
Out[52]:
We will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We use seed=1
so everyone gets the same results.
In [53]:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]
# Since there are less risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)
print "Percentage of safe loans :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)
Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this presentation, but some of them are reviewed in this paper. Here, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.
In this presentation, we will implement binary decision trees (decision trees for binary features, a specific case of categorical variables taking on two values, e.g., true/false). Since all of our features are currently categorical features, we want to turn them into binary features.
For instance, the home_ownership feature represents the home ownership status of the loanee, which is either own
, mortgage
or rent
. For example, if a data point has the feature
{'home_ownership': 'RENT'}
we want to turn this into three features:
{
'home_ownership = OWN' : 0,
'home_ownership = MORTGAGE' : 0,
'home_ownership = RENT' : 1
}
Since this code requires a few Python and GraphLab tricks, feel free to use this block of code as is. Refer to the API documentation for a deeper understanding.
In [9]:
loans_data = risky_loans.append(safe_loans)
for feature in features:
loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
# Change None's to 0's
for column in loans_data_unpacked.column_names():
loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)
loans_data.remove_column(feature)
loans_data.add_columns(loans_data_unpacked)
Let's see what the feature columns look like now:
In [10]:
features = loans_data.column_names()
features.remove('safe_loans') # Remove the response variable
features
Out[10]:
In [11]:
print "Number of features (after binarizing categorical variables) = %s" % len(features)
Let's explore what one of these columns looks like:
In [12]:
loans_data['grade.A']
Out[12]:
This column is set to 1 if the loan grade is A and 0 otherwise.
Checkpoint: Make sure the following answers match up.
In [13]:
print "Total number of grade.A loans : %s" % loans_data['grade.A'].sum()
print "Expexted answer : 6422"
In [14]:
train_data, test_data = loans_data.random_split(.8, seed=1)
In [15]:
train_data['safe_loans'].unique()
Out[15]:
In [16]:
(train_data['safe_loans']).size()
Out[16]:
In [17]:
print( (train_data['safe_loans'] == 1).sum())
print( (train_data['safe_loans'] == -1).sum())
print( (train_data['safe_loans'] == 1).sum()+ (train_data['safe_loans'] == -1).sum() )
print((train_data['safe_loans']).size())
In this section, we will implement binary decision trees from scratch. There are several steps involved in building a decision tree. For that reason, we have split the entire assignment into several sections.
Recall that prediction at an intermediate node works by predicting the majority class for all data points that belong to this node.
Now, we will write a function that calculates the number of missclassified examples when predicting the majority class. This will be used to help determine which feature is the best to split on at a given node of the tree.
Note: Keep in mind that in order to compute the number of mistakes for a majority classifier, we only need the label (y values) of the data points in the node.
Steps to follow :
Now, let us write the function intermediate_node_num_mistakes
which computes the number of misclassified examples of an intermediate node given the set of labels (y values) of the data points contained in the node.
In [18]:
def intermediate_node_num_mistakes(labels_in_node):
# Corner case: If labels_in_node is empty, return 0
if len(labels_in_node) == 0:
return 0
number_examples = labels_in_node.size()
# Count the number of 1's (safe loans)
number_safe_loans = (labels_in_node == 1).sum()
# Count the number of -1's (risky loans)
number_risky_loans = (labels_in_node == -1).sum()
if number_safe_loans >= number_risky_loans:
class_predictions = np.ones(number_examples)
else:
class_predictions = -1*np.ones(number_examples)
number_mistakes = (labels_in_node.to_numpy() != class_predictions).sum()
return number_mistakes
In [19]:
example_labels = graphlab.SArray([-1, -1, 1, 1, 1])
intermediate_node_num_mistakes(example_labels)#.to_numyp == 2
Out[19]:
In [20]:
example_labels = graphlab.SArray([-1, -1, 1, 1, 1])
(example_labels == 1).to_numpy()
Out[20]:
Because there are several steps in this assignment, we have introduced some stopping points where you can check your code and make sure it is correct before proceeding. To test your intermediate_node_num_mistakes
function, run the following code until you get a Test passed!, then you should proceed. Otherwise, you should spend some time figuring out where things went wrong.
In [21]:
# Test case 1
example_labels = graphlab.SArray([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
print 'Test passed!'
else:
print 'Test 1 failed... try again!'
# Test case 2
example_labels = graphlab.SArray([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
print 'Test passed!'
else:
print 'Test 2 failed... try again!'
# Test case 3
example_labels = graphlab.SArray([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
print 'Test passed!'
else:
print 'Test 3 failed... try again!'
The function best_splitting_feature takes 3 arguments:
The function will loop through the list of possible features, and consider splitting on each of them. It will calculate the classification error of each split and return the feature that had the smallest classification error when split on.
Recall that the classification error is defined as follows: $$ \mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}} $$
Follow these steps:
This may seem like a lot, but we have provided pseudocode in the comments in order to help you implement the function correctly.
Note: Remember that since we are only dealing with binary features, we do not have to consider thresholds for real-valued features. This makes the implementation of this function much easier.
In [22]:
def best_splitting_feature(data, features, target):
best_feature = None # Keep track of the best feature
best_error = 10 # Keep track of the best error so far
# Note: Since error is always <= 1, we should intialize it with something larger than 1.
# Convert to float to make sure error gets computed correctly.
num_data_points = float(len(data))
# Loop through each feature to consider splitting on that feature
for feature in features:
# The left split will have all data points where the feature value is 0
left_split = data[data[feature] == 0]
# The right split will have all data points where the feature value is 1
right_split = data[data[feature] == 1]
# Calculate the number of misclassified examples in the left split.
left_mistakes = intermediate_node_num_mistakes(left_split[target])
# Calculate the number of misclassified examples in the right split.
right_mistakes = intermediate_node_num_mistakes(right_split[target])
# Compute the classification error of this split.
error = (left_mistakes + right_mistakes) / num_data_points
if error < best_error:
(best_error, best_feature) = (error, feature)
return best_feature
To test your best_splitting_feature
function, run the following code:
In [23]:
if best_splitting_feature(train_data, features, 'safe_loans') == 'term. 36 months':
print 'Test passed!'
else:
print 'Test failed... try again!'
With the above functions implemented correctly, we are now ready to build our decision tree. Each node in the decision tree is represented as a dictionary which contains the following keys and possible values:
{
'is_leaf' : True/False.
'prediction' : Prediction at the leaf node.
'left' : (dictionary corresponding to the left tree).
'right' : (dictionary corresponding to the right tree).
'splitting_feature' : The feature that this node splits on.
}
First, we will write a function that creates a leaf node given a set of target values.
In [24]:
def create_leaf(target_values):
# Create a leaf node
leaf = {'splitting_feature' : None,
'left' : None,
'right' : None,
'is_leaf': True }
# Count the number of data points that are +1 and -1 in this node.
num_ones = len(target_values[target_values == +1])
num_minus_ones = len(target_values[target_values == -1])
# For the leaf node, set the prediction to be the majority class.
# Store the predicted class (1 or -1) in leaf['prediction']
if num_ones > num_minus_ones:
leaf['prediction'] = 1
else:
leaf['prediction'] = -1
return leaf
We have provided a function that learns the decision tree recursively and implements 3 stopping conditions:
Now, we will write down the skeleton of the learning algorithm.
In [25]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):
remaining_features = features[:] # Make a copy of the features.
target_values = data[target]
print "--------------------------------------------------------------------"
print "Subtree, depth = %s (%s data points)." % (current_depth, len(target_values))
# Stopping condition 1
# (Check if there are mistakes at current node.
if intermediate_node_num_mistakes(target_values) == 0:
print "Stopping condition 1 reached."
# If no mistakes at current node, make current node a leaf node
return create_leaf(target_values)
# Stopping condition 2 (check if there are remaining features to consider splitting on)
if remaining_features == ['']:
print "Stopping condition 2 reached."
# If there are no remaining features to consider, make current node a leaf node
return create_leaf(target_values)
# Additional stopping condition (limit tree depth)
if current_depth >= max_depth:
print "Reached maximum depth. Stopping for now."
# If the max tree depth has been reached, make current node a leaf node
return create_leaf(target_values)
# Find the best splitting feature (recall the function best_splitting_feature implemented above)
splitting_feature = best_splitting_feature(data, features, target)
# Split on the best feature that we found.
left_split = data[data[splitting_feature] == 0]
right_split = data[data[splitting_feature] == 1]
remaining_features.remove(splitting_feature)
print "Split on feature %s. (%s, %s)" % (\
splitting_feature, len(left_split), len(right_split))
# Create a leaf node if the split is "perfect"
if len(left_split) == len(data):
print "Creating leaf node left."
return create_leaf(left_split[target])
if len(right_split) == len(data):
print "Creating leaf node right."
return create_leaf(right_split[target])
# Repeat (recurse) on left and right subtrees
left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth)
right_tree = decision_tree_create(right_split, remaining_features, target, current_depth + 1, max_depth)
return {'is_leaf' : False,
'prediction' : None,
'splitting_feature': splitting_feature,
'left' : left_tree,
'right' : right_tree}
Here is a recursive function to count the nodes in your tree:
In [26]:
def count_nodes(tree):
if tree['is_leaf']:
return 1
return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])
Run the following test code to check your implementation. Make sure you get 'Test passed' before proceeding.
In [27]:
small_data_decision_tree = decision_tree_create(train_data, features, 'safe_loans', max_depth = 3)
if count_nodes(small_data_decision_tree) == 13:
print 'Test passed!'
else:
print 'Test failed... try again!'
print 'Number of nodes found :', count_nodes(small_data_decision_tree)
print 'Number of nodes that should be there : 13'
In [28]:
# Make sure to cap the depth at 6 by using max_depth = 6
my_decision_tree = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6)
In [29]:
print(my_decision_tree.keys())
As discussed in the lecture, we can make predictions from the decision tree with a simple recursive function. Below, we call this function classify
, which takes in a learned tree
and a test point x
to classify. We include an option annotate
that describes the prediction path when set to True
.
In [30]:
def classify(tree, x, annotate = False):
# if the node is a leaf node.
if tree['is_leaf']:
if annotate:
print "At leaf, predicting %s" % tree['prediction']
return tree['prediction']
else:
# split on feature.
split_feature_value = x[tree['splitting_feature']]
if annotate:
print "Split on %s = %s" % (tree['splitting_feature'], split_feature_value)
if split_feature_value == 0:
return classify(tree['left'], x, annotate)
else:
return classify(tree['right'], x, annotate)
Now, let's consider the first example of the test set and see what my_decision_tree
model predicts for this data point.
In [31]:
test_data[0]
Out[31]:
In [32]:
print 'Predicted class: %s ' % classify(my_decision_tree, test_data[0])
Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:
In [33]:
classify(my_decision_tree, test_data[0], annotate=True)
Out[33]:
Now, we will write a function to evaluate a decision tree by computing the classification error of the tree on the given dataset.
Again, recall that the classification error is defined as follows: $$ \mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}} $$
Now, write a function called evaluate_classification_error
that takes in as input:
tree
(as described above)data
(an SFrame)This function should return a prediction (class label) for each row in data
using the decision tree
.
In [34]:
def evaluate_classification_error(tree, data):
# Apply the classify(tree, x) to each row in your data
prediction = data.apply(lambda x: classify(tree, x))
number_examples = prediction.size()
safe_loans = data['safe_loans']
classification_error = 1 - ( 1.0*(safe_loans == prediction).sum() ) / number_examples
return classification_error
Now, let's use this function to evaluate the classification error on the test set.
In [36]:
evaluate_classification_error(my_decision_tree, test_data)
Out[36]:
As discussed in the lecture, we can print out a single decision stump.
In [37]:
def print_stump(tree, name = 'root'):
split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
if split_name is None:
print "(leaf, label: %s)" % tree['prediction']
return None
split_feature, split_value = split_name.split('.')
print ' %s' % name
print ' |---------------|----------------|'
print ' | |'
print ' | |'
print ' | |'
print ' [{0} == 0] [{0} == 1] '.format(split_name)
print ' | |'
print ' | |'
print ' | |'
print ' (%s) (%s)' \
% (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),
('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree'))
In [38]:
print_stump(my_decision_tree)
In [39]:
print_stump(my_decision_tree['left'], my_decision_tree['splitting_feature'])
In [117]:
print_stump(my_decision_tree['left']['left'], my_decision_tree['left']['splitting_feature'])
In [118]:
print_stump(my_decision_tree['left']['left']['left'], my_decision_tree['left']['left']['splitting_feature'])
In [41]:
print_stump(my_decision_tree)
In [42]:
print_stump(my_decision_tree['right'], my_decision_tree['splitting_feature'])
In [43]:
print_stump(my_decision_tree['right']['right'], my_decision_tree['right']['splitting_feature'])