Fraud detection machine learning on Enron enteprise dataset

Table of Contents

  1. Abstract
  2. Workflow
  3. Reprodutibility Environments and Fundamental Steps
  4. Jupyter Notebook Structure
  5. Methods and Procedures
    1. Naive Bayes
    2. SVM
    3. Decision Tree
  6. Summury of Result
  7. References

1. Abstract

The purpose of this project is to provide a reproducible paper regarding studies on how well Naive Bayes, SVM, and Decision Tree Machine Learning Algorithms can indentify emails by their authors using a pre-processed list of email texts and the corresponding authors based on the text dataset(comprised of 146 users with 21 features each) of the famous fraud scandal of the american bankrupt Enron Corporation. We will also study ways to work with parameters to improve accuracy and performance.

NB: All contents and instructions used for this paper where based on the "Udacity - Introduction to Machine Leaning course", and were adaped according to the goals explained here. This is being used for educational pourposes only.

For more information on the history of the coorporation, please verify the link below:
http://www.investopedia.com/updates/enron-scandal-summary/

2. Workflow

3. Environment, Best Practices and Fundamental Steps

This project is based on the following tools: git version 2.7.4, anaconda 4.3.1 (64-bit), Jupyter Notebook Server 4.3.1, Python 2.7.13, scikit-learn library.

The experiments can be reproduced in three distinct manners: through anaconda installation, through docker and oracle virtual box.

Please, read the following link for best pratices concerning projects with this environment and also key setups procedures: https://github.com/ecalio07/enron-paper/blob/master/BEST_PRACTICES.md

4. Jupyter Notebook Structure

Jupyter notebook server must have folders and files in the following structure. It should be the same structure as we have here in GitHub, except for the environments folder.

5. Methods and Procedures

It will be performed arguments confirguration according to each classifier below so as to reach best time performance and accurance, as well as comparisons of results.

We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email.

In order to know which algorithm is best for this situation, we should make tests and by the results determine which one is most suitable for our scenario.

A couple of years ago, J.K. Rowling (of Harry Potter fame) tried something interesting. She wrote a book, “The Cuckoo’s Calling,” under the name Robert Galbraith. The book received some good reviews, but no one paid much attention to it--until an anonymous tipster on Twitter said it was J.K. Rowling. The London Sunday Times enlisted two experts to compare the linguistic patterns of “Cuckoo” to Rowling’s “The Casual Vacancy,” as well as to books by several other authors. After the results of their analysis pointed strongly toward Rowling as the author, the Times directly asked the publisher if they were the same person, and the publisher confirmed. The book exploded in popularity overnight.

We’ll do something very similar in this project. We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email. We will start with Naive Bayes in this mini-project, and then expand in later projects to other algorithms.

5.1. Naive Bayes

It is consider the holy grail of probrabilist inference. It is based on Revend Thomas Bayes who used this principles (Bayes Rules) to infer the existence of God. He created a family of methods who influenced artificial inteligence and statistics. It uses in its algorithm the concepts of sensitivity and specitivity.

Naive Bayes is a supervised classification algorithm used substancially in learning from documents (text learning). Each word is considered a feature and user names are considered the labes. It is called Naive because it ignores the words order.

The classifier uses Posterior Probability, giving the rank occurance provided text. In order words, it will be trained with frequent texts(features) used by Chris and Sarah(labels), and it will calculate the probabily and determine if each test email is from Chris or Sara.


In [9]:
#NAIVE BAYES

import sys
from time import time
sys.path.append("../tools")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()




#########################################################
### your code goes here ###

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
print "training time: ", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)
print "predicting time: ", round(time()-t1, 3), "s"

accuracy = accuracy_score(labels_test, pred)

print accuracy



## IT IS PENDING ADDING CODE TO DISPLAY HOW MANY EMAILS WERE PREDICTED TO BE CHRIS AND SARA, 
## WWHAT EMAILS WENT TO CHRIS AND SARA
## DISPLAY GRAPHS


no. of Chris training emails: 7936
no. of Sara training emails: 7884
performance and accuracy being processed, please wait...
training time:  0.838 s
predicting time:  0.15 s
0.973265073948

5.2. SVM

It separate two classes creating a line separator(decision boundary), handling well margims and outliers.

For information on Parameters, Advantages and Disadvantages: http://scikit-learn.org/stable/modules/svm.html

For this experiment we will work on changing values for paremeter the parameters C, kernel and gamma. when initiating SVC function. It can be a simple choice with few parameter (ex 1), multiple paramenter (ex 2) or no parameters at all.

  • ex 1
    linear_kernel_svm = svm.SVC(kernel='rbf', C=10000.)

  • ex 2
    linear_kernel_svm = svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)[source]

In machine learning we should avoid OVERFITTING. Because of that, we wil tune the parameters below since all of them affect overfitting and results like accuracy, performance.

C: controls the tradeoff between smooth decision boundary and classification training points correctly. In theory, a large value of C means that you will get more training points correctly.

gamma: defines how far a the influence of a single training example reaches. If gamma has a low value, every point has a far reach. If gamma has a high value, each training example has a close reach. High value might make the decision boundary less linear, for it will be closer to training points.

kernel parameter can be ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used.

Please refer to the following url for more information on Parameters:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

Steps to reproduce focusing on Accuracy vs Performance

In this testing we will improve the accuracy at the cost of performance.

  • 1 - Run the code as it is and record the results
  • 2 - Comment the two lines below the comment section "REDUCING DATASET" and record the result.
  • 3 - Compare the two results and notice that accurace is better in step 2. However, it took much longer to run because data set is larger. We can then conclude the with larger datasets we get much better accuracy but that will certainly affect training and testing performance.
  • 4 - Uncomment the two lines performed in step 2. Leaving the code as it was before step 2.

Steps to reproduce focusing on Gamma parameter

  • 1 - Uncomment the code (gamma with HIGH value) "linear_kernel_svm = svm.SVC(kernel='rbf', gamma=1.0)"; comment the others instatiators and check results. ?????? Compare linear graph????????
  • 2 - Uncomment the code (gamma with LOW value) "linear_kernel_svm = svm.SVC(kernel='rbf', gamma=1.0)"; comment the others instatiators and check results.

Steps to reproduce focusing on Kernel parameter

  • 1 - Run the code as it is and record the results
  • 2 - Change the kernel of your SVM to “rbf” and record the result.
  • 3 - Compare the two results and notice that accurace is better with this more complex kernel.
  • 4 - Keep this kernel (rbf) for the next tests.

Steps to reproduce focusing on C parameter

  • 1 - Uncomment the code below the section "LINES OF CODE MEANT TO TEST C PARAMETER" and comment the other related to other parameters
  • 2 - Try several values of C (say, 10.0, 100., 1000., and 10000.) and record all results
  • 3 - Compare the results and notice that the higher is the value of C, the better is accuray.
  • 4 - Recorde the best accurcy result of C and notice that we are still using 1% of the dataset.
  • 5 - Coment the 2 lines below the section REDUCING DATASET TO 1% and run the code again. Be patient because processing will be slow now.
  • 6 - Compare accuracy result with the one recorded in step 4. Verify that accuray must be hight in a larger dataset.

In [1]:
#SVM TEST

import sys
from time import time
sys.path.append("../tools")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()




#########################################################
### your code goes here ###
from sklearn import svm
from sklearn.metrics import accuracy_score

#####TEST CHANING PARAMETERS
######################## JUST ONE LINE IN HERE MUST BE UNCOMMENTED ######################## 

### LINES OF CODE MEANT TO TEST GAMMA PARAMETER
#linear_kernel_svm = svm.SVC(kernel='rbf', gamma=1000) #GAMMA WITH HIGH VALUE
#linear_kernel_svm = svm.SVC(kernel='rbf', gamma=1.0) #GAMMA WITH LOW VALUE

### LINES OF CODE MEANT TO TEST C PARAMETER
linear_kernel_svm = svm.SVC(kernel='rbf', C=10000.0)



######################## REDUCING DATASET TO 1% ########################
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
####### END ############

t0 = time()
linear_kernel_svm.fit(features_train, labels_train)
print "training time with SVM's linear kernel", time() - t0

t1 = time()
pred = linear_kernel_svm.predict(features_test)
print "prediction time with SVM's linear kernel", time() - t1

print "accuracy being processed, please wait..."
acc = accuracy_score(labels_test, pred)
print acc

#########################################################

def time_with_power(power, people,times):
    results = nd.random.power(power, people)
    for i in range(times):
            results += nd.random.power(power, 1000)
    return results



## IT IS PENDING ADDING CODE TO DISPLAY HOW MANY EMAILS WERE PREDICTED TO BE CHRIS AND SARA, 
## WWHAT EMAILS WENT TO CHRIS AND SARA
## DISPLAY GRAPHS


/home/eduardo/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
no. of Chris training emails: 7936
no. of Sara training emails: 7884
performance and accuracy being processed, please wait...
training time with SVM's linear kernel 100.790041208
prediction time with SVM's linear kernel 10.3493950367
0.990898748578

5.3. Decision Tree

  • make tests with percentile parameter

Advantages and Disadvantages: http://scikit-learn.org/stable/modules/tree.html
Parameters Information: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Main parameters covered in this experiment will be:

  • min_samples_split: it controls how deep the tree will reach
  • percentile:

Steps to reproduce focusing on "min_samples_split" parameter

  • 1 - Run the code as it is and record the results
  • 2 - Change the min_samples_split parameter to 40 and record results.
  • 3 - Compare the two results and notice that accurace is better with this more complex kernel.
  • 4 - Keep this kernel (rbf) for the next tests.

Steps to reproduce focusing on "percentile" parameter

  • 1 - Run the code as it is and record the number of features
  • 2 - Go into ../tools/email_preprocess.py, and find the line of code that looks like this:selector = SelectPercentile(f_classif, percentile=10). Change percentile from 10 to 1, and rerun dt_author_id.py. Record the number of features.
  • 3 - Compare the number of features
  • 4 - Check accuracy when you use 1% of your available features

In [7]:
#DECISION TREE

import sys
from time import time
sys.path.append("../tools")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

print "Size of features matrix: ", features_train.shape


#########################################################
### your code goes here ###

from sklearn import tree
from sklearn.metrics import accuracy_score

clf = tree.DecisionTreeClassifier(min_samples_split=40)

clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

print "accuracy being processed, please wait..."
acc = accuracy_score(labels_test, pred)
print "Accuracy: ", acc
#########################################################


## IT IS PENDING ADDING CODE TO DISPLAY HOW MANY EMAILS WERE PREDICTED TO BE CHRIS AND SARA, 
## WWHAT EMAILS WENT TO CHRIS AND SARA
## DISPLAY GRAPHS


no. of Chris training emails: 7936
no. of Sara training emails: 7884
performance and accuracy being processed, please wait...
Size of features matrix:  (15820, 3785)
accuracy being processed, please wait...
Accuracy:  0.978384527873

7. Summury of Result

Naive Bayes is really easy to implement and efficient. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts. It is good when working with a lot of noise of the data. On the other hand, it can break for some phrases for considering the words individually.

SVM works very well in complicated domains with clear margin of separation but it doesn't perform well in very large datasets, for it can become slow and prone to overfitting. As for tunning we can conclude that best accuracy were achieved with parameters RBF kernel, C=10000, and full dataset. As for performance, there will always be a tradeoff with accuracy reducing the dataset to make the code faster.

Decision Trees are easy to use but are prone to overfitting.