Goal:

Based on the email and financial dataset, predict whether Enron employee is fraud.
Use two evaluation metrics, precision and recall above 0.3

Machine Learning

Use Naive Bayes for person classification,
preprocess with tools/email_preprocess. word_data contain list of words, email_authors contain the authors in numeric. Possibily we extract the name and the email, put it to preprocess.
Use SVM, using 1% features can generate 88-99% accuracy.
Naive Bayes is better at classifying text than SVM. Faster and gives better performance.
Decision Tree:
- Min sample split: Minimum sample to split. >> to avoid overfitting
- Feature selection based on information gain.
- SelectPercentile to limit features, more features, more complex fit.
Use RandomForest has built-in overfitting(Randomly samples, bootstrap, majority votes(ensemble))

Use feature_format to convert dictionary into list of features2

  feature_list = ["poi", "salary", "bonus"] 
  data_array = featureFormat( data_dictionary, feature_list )
  label, features = targetFeatureSplit(data_array)

  the line above (targetFeatureSplit) assumes that the
  label is the _first_ item in feature_list--very important
  that poi is listed first!

LinearRegression, long_term_incentive will do more to bonus than saalary
Outlier can greatly affected slope in linear regression
Visualize online outliers in outlier/enron_outliers.py
After remove outliers, Salary and Bonus is important numerical feature detecting POI!!!
Not using clustering since we already have labeled data.
Feature scaling not in SVM or kMeans, they both scale the features that dependent. Linear Regression and Decision tree are those models that have independent variable. (DT: split on 1 var, LR: coef every variable)
SelectPercentile: percent best features. SelectKBest: based on K
Use TfIdfVectorizer in words that frequent need to be filtered.
LasoRegression to avoid overfitting (regularization)
call clf.feature_importances_ to see which features drive the most performance.
Use vectorizer and the method above to find what words that are signature of POI-s
Use stemmer (SnowballStemmer) to stem words.
Use ../tools/parse_out_email_text.py to parse out text from email.
Use text_learning/vectorize_text.py to parse out stemmed words, label 1 if from poi, 0 otherwise
use PCA to find latent features.
F! score best of both world
Use train_test_split, GridSearchCV to get the best parameter.

Methods:

Use poi_names for labelling, connect it to enron dataset dictionary
The dictionary has email, if we want to see from-to enron email, crawl that dataset.
the dataset has the form enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
Knowing best features with human understanding about the dataset.
No training points would have "NaN" for total_payments when the class label is "POI"
Now there are 156 folks in dataset, 31 of whom have "NaN" total_payments. This makes for 20% of them with a "NaN" overall.
Added new number from dataset, there are 36% of POI have NaN for total payment
Don't try to remove that outliers that we want to pay attention
Remove outliers by removing 10% highest residual erros, and try again (outliers/outlier_cleaner.py)
Remove ['TOTAL'] outlier (it's not a person) in the final_project_dataset.
Two out of 4 ourliers next visualization(after remove total, outlier/enron_outliers.py) is poi indentified. Not sure about the rest, check again whether or it's mistypo.
feature scale all of your numerical features, including salary and exercised stock options.
Critical to feature scale n-message and salary! first one can be counted, later will have the order of thousands.
Intuition: POI sends another POI higher email rate than general
visualize: what features that makes POI stands out than general?
Use feature_selection/poi_flag_email.py to identify whether poi, based on given email file
Use feature_selection/new_enron_feature.py to create new feature, count emails from poi
Use feature_selection/visualize_new_feature.py to create fraction of from_poi/from_general ratio

Problems

POI could end up not being in the dataset (Michael Krautz)
Inconsistent naming format in text emails.
Total payments could be the feature. NaN for non POI. Use isTotalPaymentExist for NaN categorical. (THIS INFORMATION CAN BE PICKED BY THE MODEL TO IDENTIFY POI). This is wrong, and as we added new dataset, this feature is normalized.
DIfferent data sources can introduce difference bias and mistake.
Financial features have some bias, you have to handle it. It doesn't if you just use email data.
Feature scale salary and stock might be an argument.



In [91]:

    
%run tester.py









    



SGDClassifier(alpha=1e-06, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=31, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)
	Accuracy: 0.79058	Precision: 0.36853	Recall: 0.35950	F1: 0.36396	F2: 0.36127
	Total predictions: 12000	True positives:  719	False positives: 1232	False negatives: 1281	True negatives: 8768



In [17]:

    
%run tester.py









    



GridSearchCV(cv=None, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'loss': ['hinge', 'log', 'squared_hinge'], 'n_iter': [30, 35], 'alpha': [0.01, 0.0001, 1e-06]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1', verbose=0)
	Accuracy: 0.78658	Precision: 0.35534	Recall: 0.34450	F1: 0.34983	F2: 0.34661
	Total predictions: 12000	True positives:  689	False positives: 1250	False negatives: 1311	True negatives: 8750



In [28]:

    
%run tester.py









    



GridSearchCV(cv=10, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'loss': ['hinge', 'log', 'squared_hinge'], 'n_iter': [30, 35], 'alpha': [0.01, 0.0001, 1e-06]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1', verbose=0)
	Accuracy: 0.78225	Precision: 0.34924	Recall: 0.35500	F1: 0.35210	F2: 0.35383
	Total predictions: 12000	True positives:  710	False positives: 1323	False negatives: 1290	True negatives: 8677



In [29]:

    
features_percentiled









    Out[29]:





['deferred_income',
 'bonus',
 'total_stock_value',
 'salary',
 'exercised_stock_options']



In [7]:

    
import pickle



In [8]:

    
clf = pickle.load(open('my_classifier.pkl','r'))



In [30]:

    
# %%writefile poi_id.py
#!/usr/bin/python

import sys
import pickle
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectPercentile,f_classif
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

sys.path.append("../tools")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from visualize_new_feature import update_with_fraction_poi
from vectorize_text import get_word_data_from_email


### Load the dictionary containing the dataset
data_dict = pickle.load(open('final_project_dataset.pkl', "r") )

## Task 2: Remove outliers
data_dict.pop('TOTAL')

### Task 3: Create new feature(s)
data_with_fraction_poi = update_with_fraction_poi(data_dict)#Update dataset with fraction poi


full_df = pd.DataFrame.from_dict(data_with_fraction_poi,orient='index') #create pandas Dataframe from dataset
df = full_df[full_df.email_address != 'NaN']

cols = df.columns.tolist() # get the list of features
cols.remove('email_address')#remove non numeric features
cols.remove('poi')# remove labels
impute = df[cols].copy().applymap(lambda x: 0  if x == 'NaN' else x) #replace NaN features as 0

scaled = impute.apply(MinMaxScaler().fit_transform) #Scaled each of the feature 

selPerc = SelectPercentile(f_classif,percentile=21) # Built the SelectPercentile, 21 Selected Based on the Performance
selPerc.fit(scaled,df['poi']) # Learn the Features, knowing which features to use

features_percentiled = scaled.columns[selPerc.get_support()].tolist() #Filter columns based on what Percentile support
scaled['poi'] = df['poi'] #rejoin the label

###### Add Text Learning####
word_data = df['email_address'].apply(get_word_data_from_email) # Extract words given email
vect = TfidfVectorizer(stop_words='english',max_df=0.4,min_df=0.33) # Build the vectorizer
vect.fit(word_data) # Vectorizer Learn from data
words = vect.vocabulary_.keys() # what words to used
vectorized_words = vect.transform(word_data).toarray()# The values of vectorized words
df_docs = pd.DataFrame(vectorized_words, 
                       columns=words,
                       index=df.index) # Using same index, person
df_with_data = pd.concat([df_docs,scaled],axis=1) # Concat emails with numerical features
############################

### Store to my_dataset for easy export below.
my_dataset = df_with_data.to_dict(orient='index') #change the dataframe back to dictionary
# my_dataset = scaled.to_dict(orient='index') #change the dataframe back to dictionary

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi'] + features_percentiled  + words# You will need to use more features

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB ##Default(Tuned): Precision: 0.29453	Recall: 0.43650
from sklearn.tree import DecisionTreeClassifier ##Default: Precision: 0.14830	Recall: 0.05450
from sklearn.ensemble import RandomForestClassifier ##Default: Precision: 0.47575	Recall: 0.20600, Longer time
from sklearn.linear_model import SGDClassifier ##Tuned: Precision: 0.36853	Recall: 0.35950, BEST!



# clf = SGDClassifier(loss='hinge',
#                     penalty='l2',
#                     alpha=1e-6,
#                     n_iter=31,
#                     n_jobs=-1,
#                     random_state=42)

# clf = GaussianNB()

### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score

folds = 1000

# StratifiedShuffleSplit is used when we take advantage of skew data but still keeping proportion of labels
# If we using usual train test split, it could be there's no POI labels in the test set, or even worse in train set
# which would makes the model isn't good enough. If for example the StratifiedShuffleSplit have 10 folds, then every folds
# will contains equal proportions of POI vs non-POI

cv = StratifiedShuffleSplit(df.poi,10,random_state=42)

sgd = SGDClassifier(penalty='l2',random_state=42)
parameters = {'loss': ['hinge','log','squared_hinge'],
              'n_iter': [30,35],
              'alpha': [1e-2, 1e-4 ,1e-6],
              }

clf = GridSearchCV(sgd,parameters,scoring='f1',cv=10)


### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)









    



Overwriting poi_id.py

vectorize email py

Given an email, parse the text from the email

Look up vectorize_text.py

if the email in poi_email_address.py/poiEmails(), then from_poi append 1 otherwise 0



In [13]:

    
data









    Out[13]:





array([[ 0.        ,  0.12080033,  0.521875  , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.99854354,  0.        , ...,  0.        ,
         0.03904287,  0.        ],
       [ 0.        ,  0.94246039,  0.05      , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  1.        ,  0.05625   , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        ,  1.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])



In [ ]:

    
%load ../tools/



In [1]:

    
from vectorize_text import get_word_data_from_email



In [2]:

    
import pickle



In [3]:

    
data_dict  = pickle.load(open('final_project_dataset.pkl','r'))



In [4]:

    
import pandas as pd



In [9]:

    
df = pd.DataFrame.from_dict(data_dict,orient='index')



In [40]:

    
df['word_data'] = df['email_address'].apply(get_word_data_from_email)



In [42]:

    
vect = TfidfVectorizer(stop_words='english',max_df=0.8,min_df=0.1)



In [43]:

    
vect.fit_transform(df['word_data'])









    Out[43]:





<145x518 sparse matrix of type '<type 'numpy.float64'>'
	with 13491 stored elements in Compressed Sparse Row format>



In [50]:

    
df_docs = pd.DataFrame(vect.fit_transform(df['word_data']).toarray(),
                       columns=vect.vocabulary_.keys(),
                       index=df.index)



In [34]:

    
from sklearn.naive_bayes import GaussianNB, MultinomialNB



In [ ]:

    
from sklearn.naive_bayes



In [49]:

    
GaussianNB().fit(vect.fit_transform(df['word_data']).toarray(), df.poi)









    Out[49]:





GaussianNB()



In [15]:

    
df[df.email_address == 'NaN'].index









    Out[15]:





Index([u'BADUM JAMES P', u'BAXTER JOHN C', u'BAZELIDES PHILIP J',
       u'BELFER ROBERT', u'BLAKE JR. NORMAN P', u'CHAN RONNIE',
       u'CLINE KENNETH W', u'CUMBERLAND MICHAEL S', u'DUNCAN JOHN H',
       u'FUGH JOHN L', u'GAHN ROBERT S', u'GATHMANN WILLIAM D', u'GILLIS JOHN',
       u'GRAMM WENDY L', u'GRAY RODNEY', u'JAEDICKE ROBERT',
       u'LEMAISTRE CHARLES', u'LOCKHART EUGENE E', u'LOWRY CHARLES P',
       u'MENDELSOHN JOHN', u'MEYER JEROME J', u'NOLES JAMES L',
       u'PEREIRA PAULO V. FERRAZ', u'REYNOLDS LAWRENCE', u'SAVAGE FRANK',
       u'SULLIVAN-SHAKLOVITZ COLLEEN', u'THE TRAVEL AGENCY IN THE PARK',
       u'URQUHART JOHN A', u'WAKEHAM JOHN', u'WALTERS GARETH W',
       u'WHALEY DAVID A', u'WINOKUR JR. HERBERT S', u'WROBEL BRUCE',
       u'YEAP SOON'],
      dtype='object')



In [6]:

    
df_modified = pd.DataFrame.from_dict(pickle.load(open('../dataset/final_project_dataset_modified.pkl','r')),orient='index')



In [ ]: