Udacity Machine Learning Validation mini-project

Prep stuff:



In [1]:

    
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )



In [2]:

    
### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

Training a decision tree on this starter data:



In [3]:

    
from sklearn.tree import DecisionTreeClassifier

Doing things the wrong way:



In [4]:

    
clf = DecisionTreeClassifier()
clf.fit(features,labels)
clf.score(features,labels)









    Out[4]:





0.98947368421052628

Less wrong:



In [5]:

    
from sklearn.cross_validation import train_test_split



In [6]:

    
features_train, features_test, labels_train, labels_test = train_test_split(features,labels,test_size=0.3,random_state=42)



In [7]:

    
clf = DecisionTreeClassifier()
clf.fit(features_train,labels_train)
clf.score(features_test,labels_test)









    Out[7]:





0.72413793103448276