In [2]:
import pandas as pd
import numpy as np
import pylab as pl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pipeline.util as u
import pipeline.process as pr
import pipeline.read as r
import pipeline.explore as ex
import pipeline.evaluate as ev
% matplotlib inline
While I was working on improving my previous homework and determining a method for feature selection, I ran the small grid search of Magic Loops with the specifications Rayid selected. This ran over night, but unfortunately I ran it before I thought to add a timer. From observation the gradient boosting and ADA boosting appeared to be some of the slower models. The results informed the smaller loop I ran on the features selected by Random Forest, which were nearly identical to Rayid's feature set except I included monthly_income.
In [11]:
first_grid = r.read_csv("small_loop_result.csv")
fg = first_grid.sort_values(by="auc-roc")
fg.head(10)
Out[11]:
In [17]:
#top 10
fg.tail(10).sort_values(by="auc-roc", ascending=False)
Out[17]:
Ensemble methods performed the best overall, but there was wide variance. Hypertuning made a big difference in model fit according to AUC-ROC. Of the 213 specifications that worked, Gradient boosting classifer resulted in four of the top 10 and four of the bottom ten models. Gradient Boosting classifiers split on max depth, the top performers used max dapth of 5 while the bottom used max depth of 50.
Random forests performed very strongly and had much lower variance in the results; AUC-ROC ranged from 72 to 83 percent. (Compare with GB with a range from roughly 48 to 83). Random Forests also split on max depth with shorter trees out performing longer ones.
Simpler methods such as Decision Trees and KNN almost matched the ensemble methods under certain specifications. DT specifications with trees of depth 1 captured 65 to 72 percent of the area under the curve, which gives us a baseline of comparison. These models also beat trees with max depth of 20, 50 and 100, which suggest overall that with a limited number of features depths 4 or 5 times greater than the feature set length will overfit. Linear regression performed below the 1-deep trees; AUC-ROC range from 63 to 65 over 10 specifications. Naive bayes hit just above that range. SVD and SGD did not run directly from Rayid's set up.
In [22]:
second_grid = r.read_csv("refined_results.csv")
sg = second_grid.sort_values(by="auc-roc", ascending=False)
sg.head(10)
#Discuss speed and how they do with an extra feature.
Out[22]:
Adding monthly_income increased the AUC-ROC scores for the top performing models by 6 or so points. The top Gradient Boosting algorithm had AUC-ROC of .88, the top RF .87 and the top DT .86. Linear regression and naive bayes did about the same as the first pass, hovering in the .66 range. Adding information did not improve their results much.
Time is a major factor in grid search and we see the top GB models takes significantly longer than other models. Dividing the number of estimators by 10 leads to a similar speed gain though. So it seem the algorithm is linear in n_estimators, but gains less than 1% in AUC-ROC. Considering this gain in relation to our baseline of about 72% that performance gain is bigger than it looks. N_estimators had a similar impact on RF though the performance gain was smaller.
I included an RF with max_depth 50 in this batch as well. It seems to double the length of time a similarly specified 5-deep RF takes, and produce lower AUC-ROC. However, if we change our focus to finding delinquents, it performed the best finding 768 of 2547 (~30 percent). That seems low, looking at the confusion matrices in the Jupyter notebook, there are high levels of false negatives. This makes sense given that there are a lot more non-delinquents. The best model in this regard found 735 of 2547 delinquents (~29 percent).
My recommendations for future analysis would depend on the purpose and usage of the model. My analysis generally relied on the AUC-ROC which measures the model's ability to identify true positives without selecting false positives. This is good for general reliability of a model. It appears these models do well by minimizing false positives; the benefit here would be to minimize unnecessary contact with people who are not going to be delinquent. Depending on the severity of the policy solution, that may be desirable, as being labeled a potential delinquent can have serious negative outcomes on a person, from higher finance rates to stigma. If the goal is to effectively use resources while minimizing unncessary contact, the analyist might focus on tweaking RF and GB; Since RF is rather robust to fearture selection adding additional variables could boost the score. The depth should stay around 5 in both models. However, if the goal is to minimize false negatives, then the analyst might consider focusing on recall.