Assignment 4: Decision Tree Learning

In this assignment, you will work with a class of reinforcement learning agents called decision trees to attempt to classify features according to some decision boundary.

This assignment is due on T-Square on November 3 by 9:35 AM.

Introduction:

For this assignment we're going to need an explicit way to make structured decisions. The following is a decision node- a class representing some atomic choice in a binary decision graph. It can represent a class label (i.e. a final decision) or a binary decision to guide the us through a flow-chart to arrive at a decision.


In [204]:
class DecisionNode():

    def __init__(self, left, right, decision_function,class_label=None):
        self.left = left
        self.right = right
        self.decision_function = decision_function
        self.class_label = class_label

    def decide(self, feature):
        if self.class_label is not None:
            return self.class_label
        
        return self.left.decide(feature) if self.decision_function(feature) else self.right.decide(feature)

Part 1: Warmup: Building a tree by hand

20 pts.

In the below code block, construct a tree of decision nodes by hand in order to classify the data below. Select tests to be as small as possible (in terms of attributes), breaking ties among tests with the same number of attributes by selecting the one that classifies the greatest number of examples correctly. If multiple tests have the same number of attributes and classift the same number of examples, then break the tie using attributes with lower index numbers (e.g. select $A_1$ over $A_2$)

Datum $A_1$ $A_2$ $A_3$ $A_4$ y
$x_1$ 1 0 0 0 1
$x_2$ 1 0 1 1 1
$x_3$ 0 1 0 0 1
$x_4$ 0 1 1 0 0
$x_5$ 1 1 0 1 1
$x_6$ 0 1 0 1 0
$x_7$ 0 0 1 1 1
$x_8$ 0 0 1 0 0

In [205]:
examples = [[1,0,0,0],
            [1,0,1,1],
            [0,1,0,0],
            [0,1,1,0],
            [1,1,0,1],
            [0,1,0,1],
            [0,0,1,1],
            [0,0,1,0]]

classes = [1,1,1,0,1,0,1,0]

# Constructing nodes one at a time,
# build a decision tree as specified above.
# There exists a correct tree with less than 6 nodes.

decision_tree_label_1 = DecisionNode(None, None, None, 1)
decision_tree_label_0 = DecisionNode(None, None, None, 0)
a3Eqa4Func = lambda feat: feat[2]==feat[3]
decision_tree_test2 = DecisionNode(decision_tree_label_1, decision_tree_label_0, a3Eqa4Func)
a1Func = lambda feat: feat[0]==1
decision_tree_root = DecisionNode(decision_tree_label_1, decision_tree_test2, a1Func)

for i in examples:
    print decision_tree_root.decide(i)


1
1
1
0
1
0
1
0

Part 1b: Validation

Now that we have a decision tree, we're going to need some way to evaluate its performance. In most cases we'd reserve a portion of the training data for evaluation, or use cross validation, bot for now let's just see how your tree does on the provided examples. In the stubbed out code below, fill out the methods to compute accuracy, precision, recall, and the confusion matrix for your classifier output.


In [206]:
def confusion_matrix(classifier_output, true_labels):
    #TODO output should be [[true_positive, false_negative], [false_positive, true_negative]]
    truePos = 0
    falsePos = 0
    trueNeg = 0
    falseNeg = 0
    for i in range(len(true_labels)):
        if classifier_output[i]==1:
            if classifier_output[i] == true_labels[i]:
                truePos += 1
            if classifier_output[i] != true_labels[i]:
                falsePos += 1
        elif true_labels[i]==1 and classifier_output[i]==0:
                falseNeg += 1
        elif true_labels[i]==0 and classifier_output[i]==0:
            trueNeg += 1
            
    return [[truePos, falseNeg],[falsePos,trueNeg]]

def precision(classifier_output, true_labels):
    #TODO precision is measured as: true_positive/ (true_positive + false_positive)
    truePos = 0
    falsePos = 0
    for i in range(len(true_labels)):
        if classifier_output[i]==1:
            if classifier_output[i] == true_labels[i]:
                truePos += 1
            if classifier_output[i] != true_labels[i]:
                falsePos += 1
    if truePos>0 or falsePos>0:
        return 1.0*truePos/(truePos+falsePos)
    else:
        return -1
    
def recall(classifier_output, true_labels):
    #TODO: recall is measured as: true_positive/ (true_positive + false_negative)
    truePos = 0
    falseNeg = 0
    for i in range(len(true_labels)):
        if true_labels[i]==1:
            if classifier_output[i] == true_labels[i]:
                truePos += 1
            if classifier_output[i] != true_labels[i]:
                falseNeg += 1
    if truePos>0 or falseNeg>0:
        return 1.0*truePos/(truePos+falseNeg)
    else:
        return -1
    
def accuracy(classifier_output, true_labels):
    #TODO accuracy is measured as:  correct_classifications / total_number_examples
    trueCount = 0
    newList = zip(classifier_output, true_labels)
    for i,j in newList:
        if i==j:
            trueCount += 1
    
    if len(true_labels)>0:
        return 1.0*trueCount/len(true_labels)
    else:
        return -1
    
classifier_output = [decision_tree_root.decide(example) for example in examples]

# Make sure your hand-built tree is 100% accurate.
p1_accuracy = accuracy( classifier_output, classes )
p1_precision = precision(classifier_output, classes)
p1_recall = recall(classifier_output, classes)
p1_confusion_matrix = confusion_matrix(classifier_output, classes)

print p1_accuracy
print p1_precision
print p1_confusion_matrix
print p1_recall


1.0
1.0
[[5, 0], [0, 3]]
1.0

Part 2: Decision Tree Learning

40 pts.

As the number of examples we have grows, it rapidly becomes impractical to build these trees by hand, so it becomes necessary to specify a procedure by which we can automagically construct these trees.

For starters, let's consider the following algorithm (a variation of C4.5) for the construction of a decision tree from a given set of examples:

1) Check for base cases: 
     a)If all elements of a list are of the same class, return a leaf node with the appropriate class label.
     b)If a specified depth limit is reached, return a leaf labeled with the most frequent class.

2) For each attribute alpha: evaluate the normalized information gain gained by splitting on alpha

3) Let alpha_best be the attribute with the highest normalized information gain

4) Create a decision node that splits on alpha_best

5) Recur on the sublists obtained by splitting on alpha_best, and add those nodes as children of node

In the __build_tree__ method below implement the above algorithm. In the "classify" method below, write a function to produce classifications for a list of features once your decision tree has been build.


In [207]:
import math
import itertools
import operator
import numpy
import sys

def entropy(class_vector):
    # TODO: Compute the Shannon entropy for a vector of classes
    # Note: Classes will be given as either a 0 or a 1.
    zeroCount = 0
    oneCount = 0
    for i in class_vector:
        if i == 1:
            oneCount += 1
        else:
            zeroCount += 1
    totCount = oneCount + zeroCount
    oneFrac = float(1.0*oneCount/totCount)
    zeroFrac = float(1.0*zeroCount/totCount)
    if oneFrac>0.0 and zeroFrac>0.0:
        totEntropy = -1.0*((oneFrac)*math.log(oneFrac)+(zeroFrac)*math.log(zeroFrac))
    else:
        totEntropy = 0
    
    return totEntropy
    
def information_gain(previous_classes, current_classes ):
    # TODO: Implement information gain
    entropyOrig = entropy(previous_classes)
    entropyNew = entropy(current_classes[0]) + entropy(current_classes[1])
    return entropyOrig - entropyNew

class DecisionTree():

    def __init__(self, depth_limit=float('inf')):
        self.root = None
        self.depth_limit = depth_limit
        self.numAttr = 0
        #self.ignoreAttr = []

    def fit(self, features, classes):
        self.numAttr = len(features[0])
        self.root = self.__build_tree__(features, classes)
        #self.ignoreAttr = []

    def __build_tree__(self, features, classes, depth=0, mode=0, attr_subsample_rate=0):
        #TODO Implement the algorithm as specified above
        #check for base cases
        #if len(self.ignoreAttr)==self.numAttr:
         #   sys.exit("Depth more than number of attributes!")
        if depth == self.depth_limit:
            zeroCount = 0
            oneCount = 0
            for i in classes:
                if i == 1:
                    oneCount += 1
                else:
                    zeroCount += 1
            maxLabel = 0 if zeroCount>oneCount else 1
            #print str(['reached max depth, maxLabel is: ', maxLabel,'. counts are (0/1): ',zeroCount,'/',oneCount])
            return DecisionNode(None, None, None, maxLabel)
            
        uniqueLabels = set(classes)
        if len(uniqueLabels)==1:
            return DecisionNode(None, None, None, uniqueLabels.pop())
        
        bestIG = 0
        bestAttr = 0
        bestAttrSplit = 0
        bestLeftIndex = []
        sampledAttr = []
        #serially checking each attribute for IG
        if mode==1:
            sampledAttr = list(numpy.random.choice(range(len(features[0])), len(features[0])*attr_subsample_rate))
            #print sampledAttr
        for i in range(self.numAttr):
            if mode==0 or (mode==1 and i in sampledAttr):
                #print [mode, i]
                allAttrValues = [featVector[i] for featVector in features]
                minAttrValue = min(allAttrValues)
                maxAttrValue = max(allAttrValues)
                #print [minAttrValue, maxAttrValue]
                split = minAttrValue
                step = (maxAttrValue-minAttrValue)/100
                bestSplitIG = 0
                bestSplit = 0
                bestSplitLeftIndex = []
                #checking IG on different splits
                while split<=maxAttrValue:
                    classSplitLeft = []
                    classSplitRight = []
                    indxLeft  = []
                    j=0
                    for attValue in allAttrValues:
                        #print [attValue, j, classes[j]]
                        if attValue<split:
                            classSplitLeft.append(classes[j])
                            indxLeft.append(j)
                        else:
                            classSplitRight.append(classes[j])
                        j += 1
                    if len(classSplitLeft)>0 and len(classSplitRight)>0:
                        IG = information_gain(classes,[classSplitLeft,classSplitRight])
                        if IG>bestSplitIG:
                            bestSplitIG = IG
                            bestSplit = split
                            bestSplitLeftIndex = list(indxLeft)
                    split += step
                #checking if this attribute is the best or not    
                if bestSplitIG>bestIG:
                    bestIG = bestSplitIG
                    bestAttr = i
                    bestAttrSplit = bestSplit
                    bestLeftIndex = list(bestSplitLeftIndex)
                #print str(['depth: ',depth,' bestIG: ', bestIG, 'bestAttr: ',bestAttr,' bestSplit: ', bestSplit])
            #now we have the index of the best attribute by maximizing on the IG
            #self.ignoreAttr.append(bestAttr)
        #construct feature and class label lists for left and right tree
        leftFeatures = []
        leftClasses = []
        rightFeatures = []
        rightClasses = []
        for i in range(len(features)):
            if i in bestLeftIndex:
                leftFeatures.append(features[i])
                leftClasses.append(classes[i])
            else:
                rightFeatures.append(features[i])
                rightClasses.append(classes[i])
        decFunc = lambda feat: feat[bestAttr]<bestAttrSplit
        if mode==0:
            return DecisionNode(self.__build_tree__(leftFeatures,leftClasses, depth+1),self.__build_tree__(rightFeatures,rightClasses,depth+1), decFunc)
        else:
            #print str(['best split on depth:',depth,'. Split:', bestAttrSplit,'. Length of left and right',len(leftFeatures),' ', len(rightFeatures)]) 
            return DecisionNode(self.__build_tree__(leftFeatures,leftClasses, depth+1,1,attr_subsample_rate),self.__build_tree__(rightFeatures,rightClasses,depth+1,1,attr_subsample_rate), decFunc)
    
        
    def classify(self, features):
        #TODO Use a fitted tree to classify a list of feature vectors
        # Your output should be a list of class labels (either 0 or 1)
        labels = [0 for i in features]
        for i in range(len(features)):
            labels[i] = self.root.decide(features[i])
        return labels

Part 2b: Validation

For this part of the assignment we're going to use a relatively simple dataset (banknote authentication, found in 'part_2_data.csv'. In the section below there are methods to load the data in a consistent format.

In general, reserving part of your data as a test set can lead to unpredictable performance- a serendipitous choice of your train or test split could give you a very inaccurate idea of how your classifier performs. That's where k-fold cross validation comes in.

In the below method, we'll split the dataset at random into k equal subsections, then iterating on each of our k samples, we'll reserve that sample for testing and use the other k-1 for training. Averaging the results of each fold should give us a more consistent idea of how the classifier is doing.


In [208]:
import random
import numpy
def load_csv(data_file_path, class_index=-1):
    handle = open(data_file_path, 'r')
    contents = handle.read()
    handle.close()
    rows = contents.split('\n')
    out = numpy.array([  [float(i) for i in r.split(',')] for r in rows if r ])
    classes= map(int,  out[:, class_index])
    features = out[:, :class_index]
    return features, classes

def generate_k_folds(dataset, k):
    #TODO this method should return a list of folds,
    # where each fold is a tuple like (training_set, test_set)
    # where each set is a tuple like (examples, classes)
    kFolds = []
    numTraining = int(len(dataset[1])/k)
    for i in range(k):
        testFeat = []
        testClasses = []
        trainingFeat = []
        trainingClasses = []
        testIndx = random.sample(range(len(dataset[1])),numTraining)
        for j in range(len(dataset[0])):
            if j in testIndx:
                testFeat.append(dataset[0][j])
                testClasses.append(dataset[1][j])
            else:
                trainingFeat.append(dataset[0][j])
                trainingClasses.append(dataset[1][j])
        kFolds.append([[trainingFeat, trainingClasses],[testFeat, testClasses]])
    
    return kFolds

dataset = load_csv('part2_data.csv')
ten_folds = generate_k_folds(dataset, 10)

#on average your accuracy should be higher than 60%.
accuracies = []
precisions = []
recalls = []
confusion = []

for fold in ten_folds:
    train, test = fold
    train_features, train_classes = train
    test_features, test_classes = test
    tree = DecisionTree(1)
    tree.fit( train_features, train_classes)
    output = tree.classify(test_features)
    #print output
    
    accuracies.append( accuracy(output, test_classes))
    precisions.append( precision(output, test_classes))
    recalls.append( recall(output, test_classes))
    confusion.append( confusion_matrix(output, test_classes))
    
print sum(accuracies)*10
print sum(precisions)*10
print sum(recalls)*10


72.5547445255
62.1258714106
99.8076923077

Part 3: Random Forests

30 pts.

The decision boundaries drawn by decision trees are very sharp, and fitting a decision tree of unbounded depth to a list of examples almost inevitably leads to overfitting. In an attempt to decrease the variance of our classifier we're going to use a technique called 'Bootstrap Aggregating' (often abbreviated 'bagging').

A Random Forest is a collection of decision trees, built as follows:

1) For every tree we're going to build:

a) Subsample the examples provided us (with replacement) in accordance with a provided example subsampling rate.

b) From the sample in a), choose attributes at random to learn on (in accordance with a provided attribute subsampling rate)

c) Fit a decision tree to the subsample of data we've chosen (to a certain depth)

Classification for a random forest is then done by taking a majority vote of the classifications yielded by each tree in the forest after it classifies an example.


In [209]:
import numpy
class RandomForest():

    def __init__(self, num_trees, depth_limit, example_subsample_rate, attr_subsample_rate):
        self.trees = []
        self.num_trees = num_trees
        self.depth_limit = depth_limit
        self.example_subsample_rate = example_subsample_rate
        self.attr_subsample_rate = attr_subsample_rate

    def fit(self, features, classes):
        #assuming data and attr subsample rate to be a fraction
        #create 5 different subsamples of data and attr and then fit 5 different DTs
        
        for i in range(self.num_trees):
            trainingIdx = numpy.random.choice(range(len(features)),len(features)*self.example_subsample_rate)
            trainingFeat = [features[i] for i in trainingIdx]
            trainingClass = [classes[i] for i in trainingIdx]
            tree = DecisionTree(self.depth_limit)
            tree.numAttr = len(features[0])
            tree.root = tree.__build_tree__(features, classes, 0, 1, self.attr_subsample_rate)
            self.trees.append(tree)
            
            
    def classify(self, features):
        #get decisions from each DT and then count votes for each label
        count = [[0,0] for i in range(len(features))]
        for i in range(self.num_trees):
            label = self.trees[i].classify(features)
            i=0
            for l in label:
                count[i][l] +=1
                i += 1
        classes = [0 if count[i][0]>count[i][1] else 1 for i in range(len(features))]
        return classes
    
#TODO: As with the DecisionTree, evaluate the performance of your RandomForest on the dataset for part 2.
# on average your accuracy should be higher than 75%.

#  Optimize the parameters of your random forest for accuracy for a forest of 5 trees.
# (We'll verify these by training one of your RandomForest instances using these parameters
#  and checking the resulting accuracy)

#  Fill out the function below to reflect your answer:


dataset = load_csv('part2_data.csv')
ten_folds = generate_k_folds(dataset, 10)

bestAcc = 0
bestPrec = 0
bestRecall = 0
bestI = 0
bestJ = 0
i=0.1
stepData = 0.1
stepAttr = 0.25
'''while i<1:
    j = 0.75
    while j<=1:
        print [i,j]'''
accuracies = []
precisions = []
recalls = []
confusion = []
for fold in ten_folds:
    train, test = fold
    train_features, train_classes = train
    test_features, test_classes = test
    RF = RandomForest(5,1,0.1,0.25)
    RF.fit(train_features, train_classes)
    output = RF.classify(test_features)
    #print zip(output, test_classes)

    accuracies.append( accuracy(output, test_classes))
    precisions.append( precision(output, test_classes))
    recalls.append( recall(output, test_classes))
    confusion.append( confusion_matrix(output, test_classes))

avgAcc = sum(accuracies)*10
avgPrec = sum(precisions)*10
avgRec = sum(recalls)*10
'''if avgAcc>bestAcc:
            bestAcc = avgAcc
            bestPrec = avgPrec
            bestRecall = avgRec
            bestI = i
            bestJ = j
        j += stepAttr
    i += stepData'''        

#print [bestAcc,bestPrec,bestRecall,bestI, bestJ]
print avgAcc
def ideal_parameters():
    return 1, 0.1, 0.25


72.1897810219

Part 4: Challenge!

10 pts

You've been provided with a sample of data from a research dataset in 'challenge_data.pickle'. It is serialized as a tuple of (features, classes). I have reserved a part of the dataset for testing. The classifier that performs most accurately on the holdout set wins (so optimize for accuracy). To get full points for this part of the assignment, you'll need to get at least an average accuracy of 80% on the data you have, and at least an average accuracy of 60% on the holdout set.

Ties will be broken by submission time.

First place: +3% on your final grade

Second place: +2% on your final grade

Third place: +1% on your final grade


In [79]:
class ChallengeClassifier():
    
    def __init__(self):
        # initialize whatever parameters you may need here-
        # this method will be called without parameters 
        # so if you add any to make parameter sweeps easier, provide defaults
        raise NotImplemented()
        
    def fit(self, features, classes):
        # fit your model to the provided features
        raise NotImplemented()
        
    def classify(self, features):
        # classify each feature in features as either 0 or 1.
        raise NotImplemented()