Lesson 13 - Validation


In [1]:
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)


print len(labels), len(features)


95 95

Create a decision tree classifier (just use the default parameters), train it on all the data. Print out the accuracy. THIS IS AN OVERFIT TREE, DO NOT TRUST THIS NUMBER! Nonetheless,

  • what’s the accuracy?

In [2]:
from sklearn import tree
from time import time

def submitAcc(features, labels):
    return clf.score(features, labels)



clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(features, labels)
print("done in %0.3fs" % (time() - t0))


done in 0.001s

In [3]:
pred = clf.predict(features)
print "Classifier with accurancy %.2f%%" % (submitAcc(features, labels))


Classifier with accurancy 0.99%

Now you’ll add in training and testing, so that you get a trustworthy accuracy number. Use the train_test_split validation available in sklearn.cross_validation; hold out 30% of the data for testing and set the random_state parameter to 42 (random_state controls which points go into the training set and which are used for testing; setting it to 42 means we know exactly which events are in which set, and can check the results you get).

  • What’s your updated accuracy?

In [15]:
from sklearn import cross_validation


X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size=0.30, random_state=42)

print len(X_train), len(y_train)
print len(X_test), len(y_test)


66 66
29 29

In [16]:
clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))


done in 0.001s

In [17]:
pred = clf.predict(X_test)
print "Classifier with accurancy %.2f%%" % (submitAcc(X_test, y_test))


Classifier with accurancy 0.72%

Aaaand the testing data brings us back down to earth after that 99% accuracy


In [ ]: