In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives.
In this project, I will play detective, and put my machine learning skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. The identifier will be build based on the provided dataset that contains several financial as well as email features for 146 employees. 18 out of these employees are labeled as persons of interest (POI). The features and labels will help me build a classifier that will classify a random employee as a person of interest.
The dataset contains financial features (all units are in US dollars):
salary
deferral_payments
total_payments
loan_advances
bonus
restricted_stock_deferred
deferred_income
total_stock_value
expenses
exercised_stock_options
other
long_term_incentive
restricted_stock
director_fees
Email features (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string):
to_messages
email_address
from_poi_to_this_person
from_messages
from_this_person_to_poi
shared_receipt_with_poi
Label (boolean, represented as integer):
poi
The dataset is not complete. The features often do not have an entry. Here is an overview of the misssing values there are for each feature.
Feature | Number of NaN's |
---|---|
total_stock_value | 18 |
total_payments | 20 |
restricted_stock | 34 |
exercised_stock_options | 42 |
expenses | 49 |
salary | 49 |
other | 52 |
shared_receipt_with_poi | 57 |
to_messages | 57 |
from_poi_to_this_person | 57 |
from_this_person_to_poi | 57 |
from_messages | 57 |
bonus | 62 |
long_term_incentive | 78 |
deferred_income | 95 |
deferral_payments | 105 |
restricted_stock_deferred | 126 |
director_fees | 127 |
loan_advances | 140 |
After Exploratory Data Analysis, I found these three records that must be removed from the dataset:
LOCKHART EUGENE E
(Contains only NaN values)TOTAL
(contains the sum for each features of each employee)THE TRAVEL AGENCY IN THE PARK
(does not represent an employee)In this part I do some additional feature engineering by constructing three new features from the ones in the raw data. I hope that these engineered features will enhance the training of the algorithm by providing information that better differentiates the patterns in the data. I expect them also to provide additional information that is not easily apparent or not clearly captures in the raw dataset. The new features are:
email_from_poi_ratio
: Ratio of emails received from poi's in relation to all received emailsemail_to_poi_ratio
: Ratio of emails sent to poi's in relation to all sent emailstotal_wealth
: Sum of the salary
, bonus
, exercised_stock_options
, total_stock_value
Feature scaling is a method used to standardize the range of features in the data. Since the range of values in the raw dataset my vary widely, some machine learning algorithms will work badly without feature scalling. The majority of classifiers calculate the distance of two points by the Euclidean distance. In the case that one of the features has a broad range of values the distance will be dominated by this feature. Because of this fact the range of all features should be scalled so that every feature in the dataset contribute proportionately to the final distance. In this project I wil use Scikit-learn's StandardScaler()
to normalize the features. However not every classifier will use feature scaling.
In this project I try out several algorithms to determine which classifier provide me in the end with the best result. The used algorithms are:
Before the algorithms are used in practice I perform an algorithm tuning. The tuning involves the determination of the best input parameters to increase the algorithm´s performance.
Algorithms have numerous input parameters that assist the decision making process during the training of an algorithm. The input parameters determine for instance which heuristics are used or the probabilities of certain events occurring.
The performance of an algorithm is greatly affected by the parameter setting. However these setting are not easy to determine. To find out the best combination of parameter settings that provide the best performance I use Scikit-learn's GridSearchCV()
. This function takes a classifier and a dictionary of input parameters and returns the parameters which results in the best performance of the algorithm.
Here are the dictionarys of input parameters that were used by GridSearchCV()
to determine the best parameter settings for each classifier:
In [1]:
# Parameters for Logistic Regression
logReg_parameters = {"C":[0.5,1,5,5.5,6],
"penalty":["l1", "l2"],
"tol":[1e-2, 1e-3, 1e-4, 1e-5]
}
# Parameters for AdaBoostClassifier
AdaBoost_parameters = {'n_estimators':[25,50,75],
"learning_rate":[0.25, 0.5, 0.75, 1.0],
}
# Parameters for DecisionTreeClassifier
DecisionTree_parameters = {"criterion":["gini","entropy"],
"min_samples_split":[2,3,4],
"min_samples_leaf":[1,2,3]
}
# Parameters for RandomForestClassifier
RandomForest_parameters = {"max_depth":[3,4,5,6,7],
"criterion":["gini","entropy"],
"n_estimators":[20,25,30],
"random_state":[35,40,45]
}
The execution of GridSearchCV()
resulted in the following best parameter settings:
To increase the performance even more I use the Pipeline function to perform a number of operation such as StandardScaler()
, SelectKBest(k)
and PCA(n_components)
. StandardScaler()
was already mentioned. Scikit-learn's SelectKBest(k)
is used for feature selection. The function selects features according to the k
highest scores. Here is an overview over the features with the corresponding scores.
Feature | Score |
---|---|
total_wealth | 28.26 |
exercised_stock_options | 24.81 |
total_stock_value | 24.18 |
bonus | 20.79 |
salary | 18.28 |
deferred_income | 11.45 |
long_term_incentive | 9.92 |
restricted_stock | 9.21 |
total_payments | 8.77 |
shared_receipt_with_poi | 8.58 |
loan_advances | 7.18 |
expenses | 7.18 |
from_poi_to_this_person | 5.24 |
email_from_poi_ratio | 5.12 |
other | 4.18 |
email_to_poi_ratio | 4.09 |
from_this_person_to_poi | 2.38 |
director_fees | 2.12 |
to_messages | 1.64 |
deferral_payments | 0.22 |
from_messages | 0.16 |
restricted_stock_deferred | 0.06 |
The new feature tota_wealth
has the highest score, in opposite email_from_poi_ratio
and email_to_poi_ratio
occupy 14th and 16th place out of 22.
Scikit-learn's PCA()
is used to transform the features into Principal Components, which are used as new features. This step may increase the performance of an algorithm. The final dimension of Principal Componenets is determined by using n_components
as input parameter for PCA()
.
While the set of best input parameters that were determined by GridSearchCV()
for each algorithm stay the same I use different combinations of StandardScaler()
, SelectKBest()
, PCA()
as well as combinations of k
for SelectKBest()
and n_components
for PCA()
. In the end I kept the combination of the operations and their input parameters which resulted in the best performance.
Since I did the combinations manually I could not try out all the possible combinations there are. I am aware of the fact that some combinations I did not try out may result in even better performances.
Here I provide an overview over the final manually chosen operations used in the Pipepline as well as the optimal input parameters for each algorithm determined by GridSearchCV()
during the tuning. The expressions "Scaler", "K_best", "PCA" and "clf" correspond to StandardScaler()
, SelectKBest()
, PCA()
and the used algorithm.
'None' means that the corresponding function was not used in the pipeline. In the case of "K_best" it means that all features were selected.
In [ ]:
Pipeline(steps=[("Scaler", None),
("K_best", None),
("PCA", PCA(n_components=12)),
("clf", LogisticRegression(C=6,
penalty=l1,
tol=0.001)
)
]
)
In [ ]:
Pipeline(steps=[("Scaler", StandardScaler()),
("K_best", None),
("PCA", None),
("clf", AdaBoostClassifier(n_estimators=50,
learning_rate=0.25))
]
)
In [ ]:
Pipeline(steps=[("Scaler", None),
("K_best", SelectKBest(k=10)),
("PCA", PCA(n_components=5)),
("clf", RandomForestClassifier(n_estimators=20,
criterion='entropy',
max_depth=3,
random_state=40))
]
)
In [ ]:
Pipeline(steps=[("Scaler", None),
("K_best", SelectKBest(k=13)),
("PCA", PCA(n_components=10)),
("clf", GaussianNB())
]
)
In [ ]:
Pipeline(steps=[("Scaler", StandardScaler()),
("K_best", SelectKBest(k=4)),
("PCA", None),
("clf", DecisionTreeClassifier(min_samples_split=3,
criterion='entropy',
min_samples_leaf=2))
]
)
The final part of the project is the process of checking if the algorithms fulfill their intended purpose to identify the poi's. For that reason the dataset is split into two parts: a training and a test set. The training set is used to train the algorithm to predict the poi's while the test set ist used to check the algorithm performance. It is important to use different data sets for testing and training. If test and training sets are the same we get a better but sophisticated performance, because the algorithm that try to predict an outcome for the test set had already seen the right outcome during the previous training phase.
Precision, Recall and F1-Score are the used metrics to evaluate the perforamance of each algorithm. These metrics are based on:
The metrics are calculated according to:
Precision = TP/(TP+FP)
Recall = (TP)/(TP + FN)
F1-Score = (2 Recall*Precision)/(Recall + Precision)
Precision represents the percentage of cases that the classifier labeled as positive are actually positives. Higher precision corresponds to an algorithm that tends to predict a POI correctly.
Recall represents the percentage of positive cases that classifier labeled as positive. An algorithm with a higher precesion tends to identify a POI in the dataset.
F1-Score represents the harmonic mean of Recall and Precision. The score takes values between 0 and 1, with 1 as the best and 0 as the worst score.
To calculate the metrics I use the procedure called the cross-validation method. In this procedure the dataset is split into k smaller sets. The algorithm is trained using k-1 of these smaller sets, while the remaining set is used for the testing of the performance. This process is repeated for k times, so each subset is used for the validation exactly one time.
The method allows doing evalution of an algorithm that uses a very small dataset, that happens to be the case in this project. The cross-validation method is implemented in the function test_classifier()
.
Furthermore I am dealing with the problem that the used dataset is very imbalanced. The number of non POIs is much higher than the number of POIs.
Splitting the dataset into training and test subsets may result in the fact that all POIs are allocated to the training set. In that case it wouldn't be possible to check if the classifier is actually good in predicting POIs. Ideally the training and test set should have the same ratio between POIs and non-POIs, this is called stratification. To achieve this test_classifier()
contains the Scikit-learn's function StratifiedShuffleSplit()
which performs the stratification.
The resulsting Precision, Recall and F1-Score for each algorithm are given below.
Algorithm | Precision | Recall | F1-Score |
---|---|---|---|
Gaussian Naive Bayes | 0.451 | 0.372 | 0.408 |
Ada Boost | 0.332 | 0.269 | 0.297 |
Logistic Regression | 0.407 | 0.229 | 0.293 |
Decision Tree | 0.298 | 0.237 | 0.264 |
Random Forest | 0.535 | 0.168 | 0.255 |
Using F1-Score as the main metric for the evaluation the Gaussian Naive Bayes tends to be the best algorithm. 45,1% of predicted persons of interest were the actual POI's while 37,2 % of POI's in the dataset were found. In comparison Random Forest algorithm predicted 53,5% of the POI's correctly while only 16,8% of POI's in the dataset were found.
The performance of Gaussian Naive Bayes was achieved by selecting 13 features with best scores and picking 10 Principal Components. Using only the best 13 features means that the new features email_from_poi_ratio
and email_to_poi_ratio
have no impact on the performance on the algorithm. In comparison total_wealth
do have a massive influence on the performance. Removing this feature from the feature list I get the following values for the metrics:
Using StandardScaler()
lowered the performance also. That leads to the assumption that Gaussian Naive Bayes do not need any normalization to work properly.
In [ ]:
In [ ]: