The purpose of this project is to study how well Naive Bayes, SVM, and Decision Tree machine learning algorithms can indentify emails by their authors. There will be comparinsons among them as to their respective performance and accuracy based on a pre-made list of email texts and the corresponding authors based on Enron dataset comprised of 146 users with 21 features each.
The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.
In [2]:
import numpy as np
import matplotlib.pyplot as plt
n_groups = 5
means_men = (20, 35, 30, 35, 27)
std_men = (2, 3, 4, 1, 2)
means_women = (25, 32, 34, 20, 25)
std_women = (3, 5, 2, 3, 3)
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.4
error_config = {'ecolor': '0.3'}
rects1 = plt.bar(index, means_men, bar_width,
alpha=opacity,
color='b',
yerr=std_men,
error_kw=error_config,
label='Men')
rects2 = plt.bar(index + bar_width, means_women, bar_width,
alpha=opacity,
color='r',
yerr=std_women,
error_kw=error_config,
label='Women')
plt.xlabel('Group')
plt.ylabel('Scores')
plt.title('Scores by group and gender')
plt.xticks(index + bar_width / 2, ('A', 'B', 'C', 'D', 'E'))
plt.legend()
plt.tight_layout()
plt.show()
This project can be reproduced in two distinct manners: through section "3.1. Jupyter Notebook Envonment and Procedures" OR section "3.2. Docker Envonment and Procedures". Both will contain the common/fundamental tools below, the difference is that in 3.1 (Jypyter Notebook Env.) you will have to install the fundamental tools manually, whereas in 3.2 (Jypyter Notebook Env.), you will need only docker client installed to run the docker file (xxx.docker) available in environments folder, because all required tools will come already installed in this file.
It will be performed arguments confirguration according to each classifier below so as to reach best time performance and accurance, as well as comparisons of results.
We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email.