All through history, corruption and fraud has been present. It is known cases in which one book author published its book in the name of a more famous one. The fraud was detected by the use of Machine Learning.
In such contexts, it is difficult, if not impossible, to find efficient patterns without making use of Artificial Inteligence (AI). And since Machine Learning has become a buzzword and has proven its efficiency. This work explores a subfield of AI called Supervised Classificaton.
This paper covers Naive Bayes, Support Vector Mahcine (SVM), and Decision Tree supervised classification algorithms, working on a pre-processed list of email texts based on the Enron Corporation dataset. As such, they will predict email authors by their writting style and content of words
The purpose of this work is to understand complexity, strengths, weaknessess and how to improve accuracy and performance of the algorithms mentioned above.
It is holped this study will guide practicioners to manage these techniques to reach their expected results.
Keywords: Machine Learning, Classification, Support Vector Machines, Gaussian Naive Bayes, Decision Tree.
Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded in 1985 as the result of a merger between Houston Natural Gas and InterNorth.
In the 1990s, the company became a leading energy marketer of natural gas, crude oil, electricity and liquids in North America, Europe and the rest of the world. It employed approximately 20,000 staff and was one of the world's major electricity, natural gas, communications and pulp and paper companies, with claimed revenues of nearly $101 billion during 2000. It was named as America's Most Innovative Company.
However, the CEO Jeffrey Skilling had a way of hiding the financial losses of the trading business and other operations of the company, it was called mark-to-market accounting . This is a technique used when trading securities where you measure the value of a security based on its current market value, instead of its book value. This can work well for securities, but it can be disastrous for other businesses.
In Enron's case, the company would build an asset, such as a power plant, and immediately claim the projected profit on its books, even though it hadn't made any money from it. If the revenue from the power plant were less than the projected amount, instead of taking the loss, the company would then transfer these assets to an off-the-books corporation, where the loss would go unreported. This type of accounting enabled Enron to write off losses without hurting the company's bottom line.
At the end of 2001, the company's corruption emmerged as a great fraud scandal and soon after this the company filed for bankruptcy. After that, many rules and regulations were changed to be able to audit and prenvent cases like this one.
This scandal brought into question the accounting practices and activities of many corporations in the United States and many rules and regulations were changed to be able to audit and prenvent cases like this one.
Many wonder how such a powerful business disintegrated so quickly and how it managed to fool the regulators for so long.
Due to this interesting case, the company's financial and emails data became public for studies purpose. And this case became a point of interest for machine learning analysis because it could assist in finding solutions on how to prevent similar situations to happen.
The original dataset was collected and prepared by the CALO Project, and contains financial data and text features, that were extracted from emails comprised of 146 users with 21 features each.
For the experiments in this paper though, two other datasets will be used as the data sources(/data): word_data.pkl corresponding to the features, and email_authors.pkl to the labels.
They were created by Katie Malone for Udacity machine learning training course, and represent 8.000 emails per user, belonging to two users: Chris and Sara.
This research is based on the instructions taken from "Udacity - Introduction to Machine Leaning course" and it was adaped according to the goals explained the abstract.
Degis, Bauman, Liu and Gunawardhana, et al. \cite{social} present approaches working with classifiers in identifying authors by their email, also based from Udacity ML course. The use of parameters for classifiers are done manually, without the use of a library similar to gridsearch.
Three different classifiers will be trainned to predict autors emails by their word content and style.
For performing the experiments, estimators will have their parameters tunned by the use of scikit-learn GridSearchCV library \cite{scikit-learn}, that considers all parameter combinations and provide outputs of their scores.
Each experiment can be executed against only a portion or full dataset. They are represented by two running code cells (for each algorithm). Partial dataset is intended to speed up results so as to focus on finding the parameters values combinations, while fulll dataset is meant to verify how well estimators perform in a larger scale.
Support Vector Machine can work with classification or regression, but it is mostly applied to classification.
While Support Vectors are the co-ordinates of individual observation, Support Vector Machine is a frontier which best segregates two classes. It has a feature that igores the outliers and tend to be quite robust with them.
For this experiment values are changed for parameters:
C: controls the tradeoff between smooth decision boundary and classification training points correctly. In theory, a large value of C means that you will get more training points correctly.
gamma: defines how far a the influence of a single training example reaches. If gamma has a low value, every point has a far reach. If gamma has a high value, each training example has a close reach. High value might make the decision boundary less linear, for it will be closer to training points.
kernel parameter can be ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used.
GridSearch will make testes making combinations with different parameter and values: kernel: rbf and linear gamma: 1e-3, 1e-4 C: 1, 10, 100, 1000
In [1]:
# RUN CODE WITH REDUCED DATASET
%run ../dev/svm_gridsearch.ipynb
In [ ]:
# RUN CODE WITH FULL DATASET (AFTER MORE THAN TWO HOURS, IT WAS STILL RUNNING AND DIDN'T DISPLAY RESULTS)
%run ../dev/svm_gridsearch-fulldataset.ipynb
Naive Bayes,based on Bayes’ Theorem, works with the assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
It uses Posterior Probability, giving the rank occurance provided text. In order words, it will be trained with frequent texts(features) used by Chris and Sarah(labels), and it will calculate the probabily and determine if each test email is from Chris or Sara.
It is possible to work with parameters, similarly to SVM, but parameter tune for this classifier makes no changes in results. As such, GridSearchCV is not applied to this scenario.
In [ ]:
# RUN CODE WITH REDUCED DATASET
%run ../dev/bayes.ipynb
In [ ]:
# RUN CODE WITH FULL DATASET
%run ../dev/bayes-fulldataset.ipynb
Decision trees work with classification or regression models, and it breaks down a dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The final outcome is a tree with decision nodes and leaf nodes.
GridSearchCV will be instatiate with the tree estimator, the parameters below and will make the predictions: parameters = {"criterion": ["gini", "entropy"], "min_samples_split": [2, 10, 20], "max_depth": [None, 2, 5, 10], "min_samples_leaf": [1, 5, 10], "max_leaf_nodes": [None, 5, 10, 20], }
In [ ]:
# RUN CODE WITH REDUCED DATASET
%run ../dev/decitionTree_gridsearch.ipynb
In [ ]:
# RUN CODE WITH FULL DATASET (IT TAKES ALMOST ONE HOUR TO DISPLAY RESULTS)
%run ../dev/decitionTree_gridsearch-fulldataset.ipynb
These experiments explored the GridSearch libraries on different classifiers. This presents a faster and more efficient way to tune paramters for best performance and accuracy.
This paper contribution was centered in an automated way to tune paramenters, rather than doingi manually and traying to guess best combinations.
With regards to performance, all of experiments were fast, since dataset was reduced to 1%.
GridSearch presented best parameter ('min_samples_split': 2, 'max_leaf_nodes': None, 'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 5) for Decison tree with a bad accuracy: around 0.80379.
GaussianNB was the only classifier to work without paramenters tune in GridSearch, however, it presented the best accurace: around 9.91922
GridSearch presented best parameter (kernel:rbf, C:1000, gamma: 0.001) for SVM scoring: 0.91772
In this scenario, GaussianNB was the simplest and most efficient.
SVM experiment with combination of params and full dataset either was still running after more than two hours or had the screen fronzen and did not display any output in more than one attempt. It usually works very well in complicated domains but it doesn't perform well in very large datasets, for it can become slow and prone to overfitting. At this time, results could not be recorded.
GaussianNB presented a impressive performance in the large dataset, and within less than 8 seconds return the high score of 0.973265073948. That was somehow expected, since it is not working with combinations of parameters. But still, it is easy to implement, efficient and performatic, but it can break for some phrases for considering the words individually.
Decision Tree experiment was with the combination of params in a full dataset reached the best score 0.979393173198. On the other hand, it was slow and it took almost one hour to display the results.
Conclusion:
According to these experiemtns, results suggests that GaussianNB (Naive Bayes) and Decision Tree would be great options to work with larger datasets and meet good accuarcy. GaussianNB was much superior with regards to performance, but Decision Tree had a slight better accuracy.
And due to the lack of performance, SVM is allegedly not a good option for working with a large dataset like the one in this paper.