Fraud detection machine learning on Enron enteprise dataset

Abstract
Workflow
Reprodutibility Environments and Fundamental Steps
Jupyter Notebook Structure
Methods and Procedures
1. Naive Bayes
2. SVM
3. Decision Tree
Summury of Result
References

1. Abstract

The purpose of this project is to study how well Naive Bayes, SVM, and Decision Tree machine learning algorithms can indentify emails by their authors. There will be comparinsons among them as to their respective performance and accuracy based on a pre-made list of email texts and the corresponding authors based on Enron dataset comprised of 146 users with 21 features each.

The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.



In [2]:

    
import numpy as np
import matplotlib.pyplot as plt


n_groups = 5

means_men = (20, 35, 30, 35, 27)
std_men = (2, 3, 4, 1, 2)

means_women = (25, 32, 34, 20, 25)
std_women = (3, 5, 2, 3, 3)

fig, ax = plt.subplots()

index = np.arange(n_groups)
bar_width = 0.35

opacity = 0.4
error_config = {'ecolor': '0.3'}

rects1 = plt.bar(index, means_men, bar_width,
                 alpha=opacity,
                 color='b',
                 yerr=std_men,
                 error_kw=error_config,
                 label='Men')

rects2 = plt.bar(index + bar_width, means_women, bar_width,
                 alpha=opacity,
                 color='r',
                 yerr=std_women,
                 error_kw=error_config,
                 label='Women')

plt.xlabel('Group')
plt.ylabel('Scores')
plt.title('Scores by group and gender')
plt.xticks(index + bar_width / 2, ('A', 'B', 'C', 'D', 'E'))
plt.legend()

plt.tight_layout()
plt.show()

3 Reprodutibility Environments and Fundamental Steps

This project can be reproduced in two distinct manners: through section "3.1. Jupyter Notebook Envonment and Procedures" OR section "3.2. Docker Envonment and Procedures". Both will contain the common/fundamental tools below, the difference is that in 3.1 (Jypyter Notebook Env.) you will have to install the fundamental tools manually, whereas in 3.2 (Jypyter Notebook Env.), you will need only docker client installed to run the docker file (xxx.docker) available in environments folder, because all required tools will come already installed in this file.

Common/Fundamental Tools: git version 2.7.4, anaconda 4.3.1 (64-bit), Jupyter Notebook Server 4.3.1, Python 2.7.13, scikit-learn library.

3.1. Jupyter Notebook Envonment and Procedures

3.1.1 Install Anaconda 4.3.1 for Python 2.7.13. Anaconda will automatically install Jupyter Notebook Server 4.3.1 and Python 2.7.13: https://www.continuum.io/downloads
3.1.2 Install scikit-learn: http://scikit-learn.org/stable/developers/advanced_installation.html
3.1.3 Install git version 2.7.4: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
3.1.4 Clone files from this hosting to your desired path in your local machine by issuing the following command:
"git clone https://github.com/ecalio07/enron-paper.git"
3.1.5 Start Jupyter Server and create its structure just like the one in section 4 (Jupyter Notebook Structure). Start by creating folders and then upload the files cloned in previous step (3.1.4) to their respective directories.

3.2. Docker Envonment and Procedures

3.2.1 Install Docker: https://www.docker.com/community-edition#/download / https://github.com/docker/labs/blob/master/beginner/chapters/setup.md
3.2.1 Start Docker Image
3.2.1 Start Jupyter Notebook Server. Verify that the notebook already comes with the correct folder and files structure(refer to section 4).

4. Jupyter Notebook Structure

Jupyter notebook server must have folders and files in the following structure. It should be the same structure as we have here in GitHub, except for the environments folder.

5. Methods and Procedures

It will be performed arguments confirguration according to each classifier below so as to reach best time performance and accurance, as well as comparisons of results.

We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email.

5.1. Naive Bayes

make tests with percentile parameter ### 5.2. SVM
deploy an rbf kernel
optimize C parameter
### 5.3. Decision Tree
make tests with percentile parameter

7. Summury of Result

We will reach to a conclusion based on pros and cons as to which classifier best suits to this scenario. Thus, we will have the oportunity explore these methods.

8. References

Main: https://classroom.udacity.com/courses/ud120