In [1]:

    
%load poi_id.py



In [2]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [15]:

    
#!/usr/bin/python

import matplotlib.pyplot as plt
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat
from feature_format import targetFeatureSplit

### features_list is a list of strings, each of which is a feature name
### first feature must be "poi", as this will be singled out as the label
features_list = ["poi","salary"]


### load the dictionary containing the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "r") )

### we suggest removing any outliers before proceeding further

### if you are creating any new features, you might want to do that here
### store to my_dataset for easy export below
my_dataset = data_dict



### these two lines extract the features specified in features_list
### and extract them from data_dict, returning a numpy array
data = featureFormat(my_dataset, features_list)



### if you are creating new features, could also do that here



### split into labels and features (this line assumes that the first
### feature in the array is the label, which is why "poi" must always
### be first in features_list
labels, features = targetFeatureSplit(data)



### machine learning goes here!
### please name your classifier clf for easy export below

clf = None    ### get rid of this line!  just here to keep code from crashing out-of-box


### dump your classifier, dataset and features_list so 
### anyone can run/check your results
pickle.dump(clf, open("my_classifier.pkl", "w") )
pickle.dump(data_dict, open("my_dataset.pkl", "w") )
pickle.dump(features_list, open("my_feature_list.pkl", "w") )



In [5]:

    
from sklearn.cross_validation import train_test_split



In [16]:

    
from sklearn.tree import DecisionTreeClassifier



In [18]:

    
features_train,features_test,labels_train,labels_test = \
train_test_split(features,labels,test_size=0.3,random_state=42)



In [19]:

    
clf = DecisionTreeClassifier()
clf.fit(features_train,labels_train)









    Out[19]:





DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features=None, min_density=None,
            min_samples_leaf=1, min_samples_split=2, random_state=None,
            splitter='best')



In [20]:

    
clf.score(features_test,labels_test)









    Out[20]:





0.72413793103448276



In [21]:

    
from sklearn.metrics import *



In [26]:

    
print precision_score(labels_test,clf.predict(features_test))

0.0



In [25]:

    
print recall_score(labels_test,clf.predict(features_test))

0.0

This is the problem of classifier. For this problem, let's see what are the

Datasets and Question

For this problem, to know better about what features to choose and what what kind of algorithm we pick, it's better to get insight of the data to ask series of question. For the first time, we know that by using algorithm identifying whether the person is POI or not, this is the case of classifier. So we may want to try series of classfier algorithm. For now, let's take a look at the data.

Currently, all the person in the dataset we have



In [29]:

    
print len(data_dict)

And for the features we have



In [31]:

    
print len(data_dict['SKILLING JEFFREY K'])



In [33]:

    
data_dict['SKILLING JEFFREY K'].keys()









    Out[33]:





['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']

It seems that in dataset, POIs have missing value for their financial value. This could be interpreted wrong by machine, as they tend to guess that for every person that have missing value 'NaN' in financial value, machine would definitely guess POIs. One way to deal with this, is by discard any financial information for Enron employee, and goes straight to use data in Enron email.



In [ ]: