Fraud detection machine learning on Enron enteprise dataset

Table of Contents

  1. Abstract
  2. Workflow
  3. Reprodutibility Environments and Fundamental Steps
  4. Jupyter Notebook Structure
  5. Methods and Procedures
    1. Naive Bayes
    2. SVM
    3. Decision Tree
  6. Summury of Result
  7. References

1. Abstract

The purpose of this project is to study how well Naive Bayes, SVM, and Decision Tree machine learning algorithms can indentify emails by their authors. There will be comparinsons among them as to their respective performance and accuracy based on a pre-made list of email texts and the corresponding authors based on Enron dataset comprised of 146 users with 21 features each.

The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.


In [2]:
import numpy as np
import matplotlib.pyplot as plt


n_groups = 5

means_men = (20, 35, 30, 35, 27)
std_men = (2, 3, 4, 1, 2)

means_women = (25, 32, 34, 20, 25)
std_women = (3, 5, 2, 3, 3)

fig, ax = plt.subplots()

index = np.arange(n_groups)
bar_width = 0.35

opacity = 0.4
error_config = {'ecolor': '0.3'}

rects1 = plt.bar(index, means_men, bar_width,
                 alpha=opacity,
                 color='b',
                 yerr=std_men,
                 error_kw=error_config,
                 label='Men')

rects2 = plt.bar(index + bar_width, means_women, bar_width,
                 alpha=opacity,
                 color='r',
                 yerr=std_women,
                 error_kw=error_config,
                 label='Women')

plt.xlabel('Group')
plt.ylabel('Scores')
plt.title('Scores by group and gender')
plt.xticks(index + bar_width / 2, ('A', 'B', 'C', 'D', 'E'))
plt.legend()

plt.tight_layout()
plt.show()

3 Reprodutibility Environments and Fundamental Steps

This project can be reproduced in two distinct manners: through section "3.1. Jupyter Notebook Envonment and Procedures" OR section "3.2. Docker Envonment and Procedures". Both will contain the common/fundamental tools below, the difference is that in 3.1 (Jypyter Notebook Env.) you will have to install the fundamental tools manually, whereas in 3.2 (Jypyter Notebook Env.), you will need only docker client installed to run the docker file (xxx.docker) available in environments folder, because all required tools will come already installed in this file.

  • Common/Fundamental Tools: git version 2.7.4, anaconda 4.3.1 (64-bit), Jupyter Notebook Server 4.3.1, Python 2.7.13, scikit-learn library.

3.1. Jupyter Notebook Envonment and Procedures

3.2. Docker Envonment and Procedures

4. Jupyter Notebook Structure

Jupyter notebook server must have folders and files in the following structure. It should be the same structure as we have here in GitHub, except for the environments folder.

5. Methods and Procedures

It will be performed arguments confirguration according to each classifier below so as to reach best time performance and accurance, as well as comparisons of results.

We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email.

5.1. Naive Bayes

  • make tests with percentile parameter ### 5.2. SVM
  • deploy an rbf kernel
  • optimize C parameter
    ### 5.3. Decision Tree
  • make tests with percentile parameter

7. Summury of Result

We will reach to a conclusion based on pros and cons as to which classifier best suits to this scenario. Thus, we will have the oportunity explore these methods.