Fraud Detection Machine Learning on Enron Enteprise Dataset

Abstract

The purpose of this project is to provide a reproducible paper regarding studies on how well Naive Bayes, SVM, and Decision Tree Machine Learning Algorithms can indentify emails by their authors using a pre-processed list of email texts and the corresponding authors based on the text dataset(comprised of 146 users with 21 features each) of the famous fraud scandal of the american bankrupt Enron Corporation. We will also study ways to work with parameters to improve accuracy and performance.

Introduction

In the years 2001/2002, Enron Corporation, an American energy company based in Houston(Texas), went bankrupt due to a fraud scandal. After this scandal, many rules and regulations were changed to be able to audit and prenvent cases like this one.

After some time part of Enron Corporation data was made available to the public for learning pourposes.

Aside from the original dataset, new data versions were created so as to have more options to explore by using Machine Learning. And since machine learning is vatly used nowadays, enron data became a good option as a text source to be explored by machine learning.

Environment Tools

This project is based on the following tools: git version 2.7.4, anaconda 4.3.1 (64-bit), Jupyter Notebook Server 4.3.1, Python 2.7.13, scikit-learn library.

The experiments can be reproduced in three distinct manners: through anaconda installation, through docker and oracle virtual box. However, all of them make use of Jupyter Notebook as their fundamental tool.

Please, refer to the following link guidance on installations: https://github.com/ecalio07/enron-paper/blob/master/BEST_PRACTICES.md

Contents and instructions used for this paper where based on the "Udacity - Introduction to Machine Leaning course", and were adaped according to the goals explained here.

https://github.com/mdegis/machine-learning
https://github.com/baumanab/udacity_intro_machinelearning_project
https://github.com/skl92/machine-learning-enron-email-analysis
https://github.com/dshgna/ud120-projects

This is being used for educational pourposes only.

Experiments Workflow

It will be performed arguments confirguration according to each classifier below so as to reach best time performance and accurance, as well as comparisons of results.

We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email.

In order to know which algorithm is best for this situation, we should make tests and by the results determine which one is most suitable for our scenario.

A couple of years ago, J.K. Rowling (of Harry Potter fame) tried something interesting. She wrote a book, “The Cuckoo’s Calling,” under the name Robert Galbraith. The book received some good reviews, but no one paid much attention to it--until an anonymous tipster on Twitter said it was J.K. Rowling. The London Sunday Times enlisted two experts to compare the linguistic patterns of “Cuckoo” to Rowling’s “The Casual Vacancy,” as well as to books by several other authors. After the results of their analysis pointed strongly toward Rowling as the author, the Times directly asked the publisher if they were the same person, and the publisher confirmed. The book exploded in popularity overnight.

We’ll do something very similar in this project. We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email. We will start with Naive Bayes in this mini-project, and then expand in later projects to other algorithms.

For each experiment, there will be a cell (below descriptions) to reproduce results and the option to access the code and have the chance to change values for other results.

* SVM

Support Vector Machine works very well in complicated domains with clear margin of separation but it doesn't perform well in very large datasets, for it can become slow and prone to overfitting. As for tunning we can conclude that best accuracy were achieved with parameters RBF kernel, C=10000, and full dataset. As for performance, there will always be a tradeoff with accuracy reducing the dataset to make the code faster.

It separate two classes creating a line separator(decision boundary), handling well margims and outliers.

For this experiment we will work on changing values for paremeter the parameters C, kernel and gamma. when initiating SVC function. It can be a simple choice with few parameter (ex 1), multiple paramenter (ex 2) or no parameters at all.

  • ex 1
    linear_kernel_svm = svm.SVC(kernel='rbf', C=10000.)

  • ex 2
    linear_kernel_svm = svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)[source]

In machine learning we should avoid OVERFITTING. Because of that, we wil tune the parameters below since all of them affect overfitting and results like accuracy, performance.

C: controls the tradeoff between smooth decision boundary and classification training points correctly. In theory, a large value of C means that you will get more training points correctly.

gamma: defines how far a the influence of a single training example reaches. If gamma has a low value, every point has a far reach. If gamma has a high value, each training example has a close reach. High value might make the decision boundary less linear, for it will be closer to training points.

kernel parameter can be ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used.

For more information on SVM parameters, please click here

SVM Experiment - Focus on Kernel Parameter: RBF vs Linear values

Kernel values can be ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. The default one is rbf.

Accurace is not good enogh by using rbf(default value) kernel alone. We can reach a slight improvement in accurace by change the value to "linear".

code access


In [ ]:
%run ../dev/svm_kernel_parameter.ipynb

SVM Experiment - Focus on Gamma Parameter.

In this testing we will make changes to the gamma parameter value and addressing high and low values. This condition can be tested with rbf kernel and linear.

Gamma played a major difference when working with RBF. Thus, when we have a high gamma value(say 10000), the accuracy score is very low (0.5). However, when gamma is low, accuracy is much better.

On the other hand, gamma parameter made no difference in accuracy when working with linear kernel. Results didn't change.

code access


In [4]:
%run ../dev/svm_gama_parameter.ipynb


Number of available emails to be trained for Chris: 7936
Number of available emails to be trained for Sara: 7884


Please await, processing the result: Train and Predict Data with High Gamma Value


********** Results for experiment: " Train and Predict Data with High Gamma Value " ************

Training time: 0.096 s
Predicting time: 1.044 s

Number of Predicted emails for Chris 36
Number of Predicted emails for Sara 1722
Total Accuracy: 0.527303754266


Please await, processing the result: Train and Predict Data with Low Gamma Value


********** Results for experiment: " Train and Predict Data with Low Gamma Value " ************

Training time: 0.113 s
Predicting time: 0.994 s

Number of Predicted emails for Chris 1032
Number of Predicted emails for Sara 726
Total Accuracy: 0.889078498294

SVM Experiment - Focus on C Parameter.

Trying several values of C (say, 10.0, 100., 1000., and 10000.) and recording results, we can notice that the higher is the value of C, the better is accuracy until it reaches a limit where there is no change in results, and therefore, no reason to continue increasing.

The interesting part o this experiment is that C parameters affect results when working with linear and rbf. Best results where met when working with linear kernel and high C values.

code access


In [ ]:
%run ../dev/svm_C_parameter.ipynb

SVM Experiment - Focus on Accuracy vs Performance

In this testing we will improve the accuracy at the cost of performance. However, it is possible to have a middle term result, depending on reduction size of the data set. Accuracy is much better with full dataset.

code access


In [ ]:
%run ../dev/svm_accuracy_vs_performance.ipynb

* GaussianNB (Naive Bayes)

Naive Bayes is a supervised classification algorithm used substancially in learning from documents (text learning). Each word is considered a feature and user names are considered the labes. It is called Naive because it ignores the words order.

It is really easy to implement and efficient. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts. It is good when working with a lot of noise of the data. On the other hand, it can break for some phrases for considering the words individually.

The classifier uses Posterior Probability, giving the rank occurance provided text. In order words, it will be trained with frequent texts(features) used by Chris and Sarah(labels), and it will calculate the probabily and determine if each test email is from Chris or Sara.

It is possible to work with parameters, just like SVM. But their changes made no difference in results. Using Naive Bayes, it was possible to reach a good accurace and it was simpler, for it was not necessary to play with parameter changes.

For more information on GaussianNB parameters, please click here

code access


In [3]:
%run ../dev/bayes.ipynb


Number of available emails to be trained for Chris: 7936
Number of available emails to be trained for Sara: 7884


Please await, processing the result: Train and Predict Data with GaussianNB

Number of Predicted emails for Chris 906
Number of Predicted emails for Sara 852
Total Accuracy: 0.973265073948

* Decision Tree

Decision Trees are simple to understand and interpret, but do not tend to be as accurate as other approaches. They also prone to overfitting.

Main parameters covered in this experiment will be:

  • min_samples_split: it controls how deep the tree will reach
  • percentile:

    For more information on Decidion Tree parameters, please click here

Tree Experiment - Focus on min_samples_split parameter.

In this testing we will make changes to the min_samples_split parameter value and addressing high and low values. This condition can be tested with rbf kernel and linear.

code access


In [2]:
%run ../dev/min_samples_split_parameter.ipynb


Number of available emails to be trained for Chris: 7936
Number of available emails to be trained for Sara: 7884


Please await, processing the result: Train and Predict Data with min_samples_split = 2

Number of Predicted emails for Chris 861
Number of Predicted emails for Sara 897
Total Accuracy: 0.990898748578


Please await, processing the result: Train and Predict Data with min_samples_split = 40

Number of Predicted emails for Chris 865
Number of Predicted emails for Sara 893
Total Accuracy: 0.978384527873

Tree Experiment - Focus on percentile parameter.

By changing percentile parameter to a higher number, it was possible to achieve improvements in accuracy.

code access


In [1]:
%run ../dev/tree_percentile_parameter.ipynb


/home/eduardo/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Number of available emails to be trained for Chris: 7936
Number of available emails to be trained for Sara: 7884


Please await, processing the result: Train and Predict Data with percentile = 1

Number of Predicted emails for Chris 886
Number of Predicted emails for Sara 872
Total Accuracy: 0.966439135381


Please await, processing the result: Train and Predict Data with percentile = 10

Number of Predicted emails for Chris 871
Number of Predicted emails for Sara 887
Total Accuracy: 0.978384527873

Summury of Results

These experiments explored ways to use algorithm and configuration options on how to find most suitable approaches for handling text identification using machine learning.

Analysis with Reduced DataSet:

When working with SVM with only kernel parameter, linear value is better. However, when we added the gamma parameter with linear option, accuracy was badly affected. C parameter with high or low value, made no difference in results with linear kernel value.

Making a cross comparation among the options above, best linear result were met working with no other paramenters except kernel=linear.

With regards to RBF, we rechead the highest SVM accuracy by combining Linear kernel=rbf and Gamma parameter with Low value (0.889078498294).

The best result with reduced dataset was reached with the combination kernel='linear', C=10000 = (0.892491467577)

Decison tree present bad accuracy with reduced dataset, around 0.776450511945.

Analysis with Full DataSet:

Altough SVM full dataset reached highest accuracy of all (0.990898748578), it was the slowest algorithm.

Decision Tree presented great accuracy and performance with larger datasets.

When we tested naive bayes, accuracy with full dataset was (0.973265073948). It was the second best result of all experiments but performance was much better than SVM.

Conclusion: Based on the experiments here presented, I would exclude SVM due to complexity and slow performance. Even though accuracy was the best.

Results suggests that Decision Tree and GaussianNB (Naive Bayes) would be great options to work with larger datasets and meet good accuarcy and performance.

GaussianNB presented the best perfomance of all and good accuracy.