For this assignment we're going to need an explicit way to make structured decisions. The following is a decision node- a class representing some atomic choice in a binary decision graph. It can represent a class label (i.e. a final decision) or a binary decision to guide the us through a flow-chart to arrive at a decision.
In [204]:
class DecisionNode():
def __init__(self, left, right, decision_function,class_label=None):
self.left = left
self.right = right
self.decision_function = decision_function
self.class_label = class_label
def decide(self, feature):
if self.class_label is not None:
return self.class_label
return self.left.decide(feature) if self.decision_function(feature) else self.right.decide(feature)
20 pts.
In the below code block, construct a tree of decision nodes by hand in order to classify the data below. Select tests to be as small as possible (in terms of attributes), breaking ties among tests with the same number of attributes by selecting the one that classifies the greatest number of examples correctly. If multiple tests have the same number of attributes and classift the same number of examples, then break the tie using attributes with lower index numbers (e.g. select $A_1$ over $A_2$)
Datum | $A_1$ | $A_2$ | $A_3$ | $A_4$ | y |
---|---|---|---|---|---|
$x_1$ | 1 | 0 | 0 | 0 | 1 |
$x_2$ | 1 | 0 | 1 | 1 | 1 |
$x_3$ | 0 | 1 | 0 | 0 | 1 |
$x_4$ | 0 | 1 | 1 | 0 | 0 |
$x_5$ | 1 | 1 | 0 | 1 | 1 |
$x_6$ | 0 | 1 | 0 | 1 | 0 |
$x_7$ | 0 | 0 | 1 | 1 | 1 |
$x_8$ | 0 | 0 | 1 | 0 | 0 |
In [205]:
examples = [[1,0,0,0],
[1,0,1,1],
[0,1,0,0],
[0,1,1,0],
[1,1,0,1],
[0,1,0,1],
[0,0,1,1],
[0,0,1,0]]
classes = [1,1,1,0,1,0,1,0]
# Constructing nodes one at a time,
# build a decision tree as specified above.
# There exists a correct tree with less than 6 nodes.
decision_tree_label_1 = DecisionNode(None, None, None, 1)
decision_tree_label_0 = DecisionNode(None, None, None, 0)
a3Eqa4Func = lambda feat: feat[2]==feat[3]
decision_tree_test2 = DecisionNode(decision_tree_label_1, decision_tree_label_0, a3Eqa4Func)
a1Func = lambda feat: feat[0]==1
decision_tree_root = DecisionNode(decision_tree_label_1, decision_tree_test2, a1Func)
for i in examples:
print decision_tree_root.decide(i)
Now that we have a decision tree, we're going to need some way to evaluate its performance. In most cases we'd reserve a portion of the training data for evaluation, or use cross validation, bot for now let's just see how your tree does on the provided examples. In the stubbed out code below, fill out the methods to compute accuracy, precision, recall, and the confusion matrix for your classifier output.
In [206]:
def confusion_matrix(classifier_output, true_labels):
#TODO output should be [[true_positive, false_negative], [false_positive, true_negative]]
truePos = 0
falsePos = 0
trueNeg = 0
falseNeg = 0
for i in range(len(true_labels)):
if classifier_output[i]==1:
if classifier_output[i] == true_labels[i]:
truePos += 1
if classifier_output[i] != true_labels[i]:
falsePos += 1
elif true_labels[i]==1 and classifier_output[i]==0:
falseNeg += 1
elif true_labels[i]==0 and classifier_output[i]==0:
trueNeg += 1
return [[truePos, falseNeg],[falsePos,trueNeg]]
def precision(classifier_output, true_labels):
#TODO precision is measured as: true_positive/ (true_positive + false_positive)
truePos = 0
falsePos = 0
for i in range(len(true_labels)):
if classifier_output[i]==1:
if classifier_output[i] == true_labels[i]:
truePos += 1
if classifier_output[i] != true_labels[i]:
falsePos += 1
if truePos>0 or falsePos>0:
return 1.0*truePos/(truePos+falsePos)
else:
return -1
def recall(classifier_output, true_labels):
#TODO: recall is measured as: true_positive/ (true_positive + false_negative)
truePos = 0
falseNeg = 0
for i in range(len(true_labels)):
if true_labels[i]==1:
if classifier_output[i] == true_labels[i]:
truePos += 1
if classifier_output[i] != true_labels[i]:
falseNeg += 1
if truePos>0 or falseNeg>0:
return 1.0*truePos/(truePos+falseNeg)
else:
return -1
def accuracy(classifier_output, true_labels):
#TODO accuracy is measured as: correct_classifications / total_number_examples
trueCount = 0
newList = zip(classifier_output, true_labels)
for i,j in newList:
if i==j:
trueCount += 1
if len(true_labels)>0:
return 1.0*trueCount/len(true_labels)
else:
return -1
classifier_output = [decision_tree_root.decide(example) for example in examples]
# Make sure your hand-built tree is 100% accurate.
p1_accuracy = accuracy( classifier_output, classes )
p1_precision = precision(classifier_output, classes)
p1_recall = recall(classifier_output, classes)
p1_confusion_matrix = confusion_matrix(classifier_output, classes)
print p1_accuracy
print p1_precision
print p1_confusion_matrix
print p1_recall
40 pts.
As the number of examples we have grows, it rapidly becomes impractical to build these trees by hand, so it becomes necessary to specify a procedure by which we can automagically construct these trees.
For starters, let's consider the following algorithm (a variation of C4.5) for the construction of a decision tree from a given set of examples:
1) Check for base cases:
a)If all elements of a list are of the same class, return a leaf node with the appropriate class label.
b)If a specified depth limit is reached, return a leaf labeled with the most frequent class.
2) For each attribute alpha: evaluate the normalized information gain gained by splitting on alpha
3) Let alpha_best be the attribute with the highest normalized information gain
4) Create a decision node that splits on alpha_best
5) Recur on the sublists obtained by splitting on alpha_best, and add those nodes as children of node
In the __build_tree__ method below implement the above algorithm. In the "classify" method below, write a function to produce classifications for a list of features once your decision tree has been build.
In [207]:
import math
import itertools
import operator
import numpy
import sys
def entropy(class_vector):
# TODO: Compute the Shannon entropy for a vector of classes
# Note: Classes will be given as either a 0 or a 1.
zeroCount = 0
oneCount = 0
for i in class_vector:
if i == 1:
oneCount += 1
else:
zeroCount += 1
totCount = oneCount + zeroCount
oneFrac = float(1.0*oneCount/totCount)
zeroFrac = float(1.0*zeroCount/totCount)
if oneFrac>0.0 and zeroFrac>0.0:
totEntropy = -1.0*((oneFrac)*math.log(oneFrac)+(zeroFrac)*math.log(zeroFrac))
else:
totEntropy = 0
return totEntropy
def information_gain(previous_classes, current_classes ):
# TODO: Implement information gain
entropyOrig = entropy(previous_classes)
entropyNew = entropy(current_classes[0]) + entropy(current_classes[1])
return entropyOrig - entropyNew
class DecisionTree():
def __init__(self, depth_limit=float('inf')):
self.root = None
self.depth_limit = depth_limit
self.numAttr = 0
#self.ignoreAttr = []
def fit(self, features, classes):
self.numAttr = len(features[0])
self.root = self.__build_tree__(features, classes)
#self.ignoreAttr = []
def __build_tree__(self, features, classes, depth=0, mode=0, attr_subsample_rate=0):
#TODO Implement the algorithm as specified above
#check for base cases
#if len(self.ignoreAttr)==self.numAttr:
# sys.exit("Depth more than number of attributes!")
if depth == self.depth_limit:
zeroCount = 0
oneCount = 0
for i in classes:
if i == 1:
oneCount += 1
else:
zeroCount += 1
maxLabel = 0 if zeroCount>oneCount else 1
#print str(['reached max depth, maxLabel is: ', maxLabel,'. counts are (0/1): ',zeroCount,'/',oneCount])
return DecisionNode(None, None, None, maxLabel)
uniqueLabels = set(classes)
if len(uniqueLabels)==1:
return DecisionNode(None, None, None, uniqueLabels.pop())
bestIG = 0
bestAttr = 0
bestAttrSplit = 0
bestLeftIndex = []
sampledAttr = []
#serially checking each attribute for IG
if mode==1:
sampledAttr = list(numpy.random.choice(range(len(features[0])), len(features[0])*attr_subsample_rate))
#print sampledAttr
for i in range(self.numAttr):
if mode==0 or (mode==1 and i in sampledAttr):
#print [mode, i]
allAttrValues = [featVector[i] for featVector in features]
minAttrValue = min(allAttrValues)
maxAttrValue = max(allAttrValues)
#print [minAttrValue, maxAttrValue]
split = minAttrValue
step = (maxAttrValue-minAttrValue)/100
bestSplitIG = 0
bestSplit = 0
bestSplitLeftIndex = []
#checking IG on different splits
while split<=maxAttrValue:
classSplitLeft = []
classSplitRight = []
indxLeft = []
j=0
for attValue in allAttrValues:
#print [attValue, j, classes[j]]
if attValue<split:
classSplitLeft.append(classes[j])
indxLeft.append(j)
else:
classSplitRight.append(classes[j])
j += 1
if len(classSplitLeft)>0 and len(classSplitRight)>0:
IG = information_gain(classes,[classSplitLeft,classSplitRight])
if IG>bestSplitIG:
bestSplitIG = IG
bestSplit = split
bestSplitLeftIndex = list(indxLeft)
split += step
#checking if this attribute is the best or not
if bestSplitIG>bestIG:
bestIG = bestSplitIG
bestAttr = i
bestAttrSplit = bestSplit
bestLeftIndex = list(bestSplitLeftIndex)
#print str(['depth: ',depth,' bestIG: ', bestIG, 'bestAttr: ',bestAttr,' bestSplit: ', bestSplit])
#now we have the index of the best attribute by maximizing on the IG
#self.ignoreAttr.append(bestAttr)
#construct feature and class label lists for left and right tree
leftFeatures = []
leftClasses = []
rightFeatures = []
rightClasses = []
for i in range(len(features)):
if i in bestLeftIndex:
leftFeatures.append(features[i])
leftClasses.append(classes[i])
else:
rightFeatures.append(features[i])
rightClasses.append(classes[i])
decFunc = lambda feat: feat[bestAttr]<bestAttrSplit
if mode==0:
return DecisionNode(self.__build_tree__(leftFeatures,leftClasses, depth+1),self.__build_tree__(rightFeatures,rightClasses,depth+1), decFunc)
else:
#print str(['best split on depth:',depth,'. Split:', bestAttrSplit,'. Length of left and right',len(leftFeatures),' ', len(rightFeatures)])
return DecisionNode(self.__build_tree__(leftFeatures,leftClasses, depth+1,1,attr_subsample_rate),self.__build_tree__(rightFeatures,rightClasses,depth+1,1,attr_subsample_rate), decFunc)
def classify(self, features):
#TODO Use a fitted tree to classify a list of feature vectors
# Your output should be a list of class labels (either 0 or 1)
labels = [0 for i in features]
for i in range(len(features)):
labels[i] = self.root.decide(features[i])
return labels
For this part of the assignment we're going to use a relatively simple dataset (banknote authentication, found in 'part_2_data.csv'. In the section below there are methods to load the data in a consistent format.
In general, reserving part of your data as a test set can lead to unpredictable performance- a serendipitous choice of your train or test split could give you a very inaccurate idea of how your classifier performs. That's where k-fold cross validation comes in.
In the below method, we'll split the dataset at random into k equal subsections, then iterating on each of our k samples, we'll reserve that sample for testing and use the other k-1 for training. Averaging the results of each fold should give us a more consistent idea of how the classifier is doing.
In [208]:
import random
import numpy
def load_csv(data_file_path, class_index=-1):
handle = open(data_file_path, 'r')
contents = handle.read()
handle.close()
rows = contents.split('\n')
out = numpy.array([ [float(i) for i in r.split(',')] for r in rows if r ])
classes= map(int, out[:, class_index])
features = out[:, :class_index]
return features, classes
def generate_k_folds(dataset, k):
#TODO this method should return a list of folds,
# where each fold is a tuple like (training_set, test_set)
# where each set is a tuple like (examples, classes)
kFolds = []
numTraining = int(len(dataset[1])/k)
for i in range(k):
testFeat = []
testClasses = []
trainingFeat = []
trainingClasses = []
testIndx = random.sample(range(len(dataset[1])),numTraining)
for j in range(len(dataset[0])):
if j in testIndx:
testFeat.append(dataset[0][j])
testClasses.append(dataset[1][j])
else:
trainingFeat.append(dataset[0][j])
trainingClasses.append(dataset[1][j])
kFolds.append([[trainingFeat, trainingClasses],[testFeat, testClasses]])
return kFolds
dataset = load_csv('part2_data.csv')
ten_folds = generate_k_folds(dataset, 10)
#on average your accuracy should be higher than 60%.
accuracies = []
precisions = []
recalls = []
confusion = []
for fold in ten_folds:
train, test = fold
train_features, train_classes = train
test_features, test_classes = test
tree = DecisionTree(1)
tree.fit( train_features, train_classes)
output = tree.classify(test_features)
#print output
accuracies.append( accuracy(output, test_classes))
precisions.append( precision(output, test_classes))
recalls.append( recall(output, test_classes))
confusion.append( confusion_matrix(output, test_classes))
print sum(accuracies)*10
print sum(precisions)*10
print sum(recalls)*10
30 pts.
The decision boundaries drawn by decision trees are very sharp, and fitting a decision tree of unbounded depth to a list of examples almost inevitably leads to overfitting. In an attempt to decrease the variance of our classifier we're going to use a technique called 'Bootstrap Aggregating' (often abbreviated 'bagging').
A Random Forest is a collection of decision trees, built as follows:
1) For every tree we're going to build:
a) Subsample the examples provided us (with replacement) in accordance with a provided example subsampling rate.
b) From the sample in a), choose attributes at random to learn on (in accordance with a provided attribute subsampling rate)
c) Fit a decision tree to the subsample of data we've chosen (to a certain depth)
Classification for a random forest is then done by taking a majority vote of the classifications yielded by each tree in the forest after it classifies an example.
In [209]:
import numpy
class RandomForest():
def __init__(self, num_trees, depth_limit, example_subsample_rate, attr_subsample_rate):
self.trees = []
self.num_trees = num_trees
self.depth_limit = depth_limit
self.example_subsample_rate = example_subsample_rate
self.attr_subsample_rate = attr_subsample_rate
def fit(self, features, classes):
#assuming data and attr subsample rate to be a fraction
#create 5 different subsamples of data and attr and then fit 5 different DTs
for i in range(self.num_trees):
trainingIdx = numpy.random.choice(range(len(features)),len(features)*self.example_subsample_rate)
trainingFeat = [features[i] for i in trainingIdx]
trainingClass = [classes[i] for i in trainingIdx]
tree = DecisionTree(self.depth_limit)
tree.numAttr = len(features[0])
tree.root = tree.__build_tree__(features, classes, 0, 1, self.attr_subsample_rate)
self.trees.append(tree)
def classify(self, features):
#get decisions from each DT and then count votes for each label
count = [[0,0] for i in range(len(features))]
for i in range(self.num_trees):
label = self.trees[i].classify(features)
i=0
for l in label:
count[i][l] +=1
i += 1
classes = [0 if count[i][0]>count[i][1] else 1 for i in range(len(features))]
return classes
#TODO: As with the DecisionTree, evaluate the performance of your RandomForest on the dataset for part 2.
# on average your accuracy should be higher than 75%.
# Optimize the parameters of your random forest for accuracy for a forest of 5 trees.
# (We'll verify these by training one of your RandomForest instances using these parameters
# and checking the resulting accuracy)
# Fill out the function below to reflect your answer:
dataset = load_csv('part2_data.csv')
ten_folds = generate_k_folds(dataset, 10)
bestAcc = 0
bestPrec = 0
bestRecall = 0
bestI = 0
bestJ = 0
i=0.1
stepData = 0.1
stepAttr = 0.25
'''while i<1:
j = 0.75
while j<=1:
print [i,j]'''
accuracies = []
precisions = []
recalls = []
confusion = []
for fold in ten_folds:
train, test = fold
train_features, train_classes = train
test_features, test_classes = test
RF = RandomForest(5,1,0.1,0.25)
RF.fit(train_features, train_classes)
output = RF.classify(test_features)
#print zip(output, test_classes)
accuracies.append( accuracy(output, test_classes))
precisions.append( precision(output, test_classes))
recalls.append( recall(output, test_classes))
confusion.append( confusion_matrix(output, test_classes))
avgAcc = sum(accuracies)*10
avgPrec = sum(precisions)*10
avgRec = sum(recalls)*10
'''if avgAcc>bestAcc:
bestAcc = avgAcc
bestPrec = avgPrec
bestRecall = avgRec
bestI = i
bestJ = j
j += stepAttr
i += stepData'''
#print [bestAcc,bestPrec,bestRecall,bestI, bestJ]
print avgAcc
def ideal_parameters():
return 1, 0.1, 0.25
10 pts
You've been provided with a sample of data from a research dataset in 'challenge_data.pickle'. It is serialized as a tuple of (features, classes). I have reserved a part of the dataset for testing. The classifier that performs most accurately on the holdout set wins (so optimize for accuracy). To get full points for this part of the assignment, you'll need to get at least an average accuracy of 80% on the data you have, and at least an average accuracy of 60% on the holdout set.
Ties will be broken by submission time.
First place: +3% on your final grade
Second place: +2% on your final grade
Third place: +1% on your final grade
In [79]:
class ChallengeClassifier():
def __init__(self):
# initialize whatever parameters you may need here-
# this method will be called without parameters
# so if you add any to make parameter sweeps easier, provide defaults
raise NotImplemented()
def fit(self, features, classes):
# fit your model to the provided features
raise NotImplemented()
def classify(self, features):
# classify each feature in features as either 0 or 1.
raise NotImplemented()