The goal for this project is to identify who's the person of interest. In other words, people who actually comitting the fraud in Enron. Their crimes include selling assets to shell companies at the end of each month, and buying them at the beginning of each month to avoid accounting losses. Hopefully if there are any other person that are not in the dataset, the machine learning can identify them based on the financial features and emails, whether the person is actually POI.
There are 146 person in the dataset, 18 of those are a person of interest (there are actually 35 persons). Since email data is just a sample, there are missing POI data. It may cause the prediction to a little worse. There are 21 features in the dataset.
The dataset is not without an error. Especially the financial features. Because not all POI in the dataset, we might want to add it by hand, and just put missing value for financial information. But this itself could lead an error, because machine learning could predict whether a person POI or not based on NaN
value. So financial features is still being considered. This is the proportion of no-NaN features for each column.
In [29]:
nan_summary = pd.DataFrame({'size': df.count(),
'no-nan': df.applymap(lambda x: pd.np.nan if x=='NaN' else x).count()},
index=df.columns)
nan_summary['no-nan-proportion'] = nan_summary['no-nan'] / nan_summary['size']
nan_summary
Out[29]:
we can see that of all the features in the dataset, only poi
feature, the label of this machine learning doesn't have any missing value. This is good, since the machine learning need the feature otherwise we the data is meaningless without label. On the other hand, feature that has too many missing values, like loan_advances
, would not benefit the model.
In the dataset, there's an outlier which is 'TOTAL'. This should be total of numerical features that every person in ENRON dataset has, but counted as a person. This is an outlier. we should exclude this because it's not a data that we have attention too. Next I begin to observe an outlier, and I have 2 out of 4 outlier that identified as POI. Since this is the data that we're paying attention, we don't exclude the outlier.
I add new features such as fraction in which this person sent email to POI persons, and fraction POI persons send emails to this persons. The reason behind this is because there's could be that if this person have higher frequencies of sending and receiving email with POI persons, this person could end up being POI himself. But this turns out filtered itself in SelectPercentile, therefore have no effect on the performance. I also add feature such as text words, based on the email of a person.
Without text feature I achieve: Precision: 0.27753 Recall: 0.24700
With text feature I achieve: Precision: 0.36853 Recall: 0.35950
I scaled any numerical features. The reason behind this because the algorithm that I'm using SGDClassifier consider the features to both dependent of each other. It doesn't like linear regression where features is independent of each other (based on coefficient). SGDClassifier also has l2 penalty, but since I see that scaling makes the model better, I decide to scale it.
I select numerical features based on the 21 percentile using SelectPercentile. I tried variety of percentiles that maximize both precision and recall. When both are deliver some trade-off, I determine the highest based on given F1 score.
Range of percentiles used and the corresponding precision and recall:
Final features used are:
['deferred_income',
'bonus',
'total_stock_value',
'salary',
'exercised_stock_options']
I ended up choosing Gaussian Naive Bayes, as it gives the default best performance compared to any other classifier that I tried. The performance default for each of the algorithm are as follows:
from sklearn.naive_bayes import GaussianNB ##Default(Tuned): Precision: 0.29453 Recall: 0.43650
from sklearn.tree import DecisionTreeClassifier ##Default: Precision: 0.14830 Recall: 0.05450
from sklearn.ensemble import RandomForestClassifier ##Default: Precision: 0.47575 Recall: 0.20600, Longer time
from sklearn.linear_model import SGDClassifier ##Tuned: Precision: 0.35534 Recall: 0.34450, BEST!
Since the algorithm that I use now are SGDClassifier, I tune its parameters. Tuning an algorithm is important since all of the estimator method and its parameters could be vary depend on the problem that we have. By tuning the algorithm, we will fit the parameters to our specific problem. By default the estimator take hinge
which would be the linear SVM. the alpha is the learning_rate. Too small will make the machine learning learning very slow. Too high for the learning rate, it will make overshooting, the model can't make it further to the best parameter.
I use GridSearchCV for tuning the algorithm. Not all of the parameters I hand over to GridSearchCV. For the text learning l2
penalty is must since it regularized sparse features. cv
parameter in default is StratifiedKFold
, which confirm with what tester.py
used. StratifiedKFold is used when we have skew data, and we can bootstrap by resampling with folds. The scoring method used is F1 score.
Validation is importance when we want to test the model against future data. While the drawback is we have smaller to trained, but it's useful to the the performance. We can't train the model using whole data and test it with the same one, as the model will already know what it's against and will perform excellently, and this called cheating in machine learning. I will use train test split with 70:30, and validate the performance again precision and recall.
I will use precision and recall for my evaluation metrics. As this metrics can identify the accuracy of skewed data. From the performance that I got, I have good precision and good recall. That means the model is able to identify the when the real POI comes out, and have good probability of flagging POI person.
StratifiedShuffleSplit is used when we take advantage of skew data but still keeping proportion of labels If we using usual train test split, it could be there's no POI labels in the test set, or even worse in train set which would makes the model isn't good enough. If for example the StratifiedShuffleSplit have ten folds, then every folds will contains equal proportions of POI vs non-POI