As my metric (buy rate) is a continuous function, I have turned a simple classification (good vs bad loans) into a regression of return. I find that this would be a better tool for picking loans to invest in, as it forecasts a predicted return, not just if the loan will be paid as agreed. My metric intereacts with the loan interest, prefering higher interest loans that are paid as agreed over lower interest loans, this is not done by simple classification. Successful forecasting of my metric can be the difference between having no defaulting loans, but a 4% return per year (low interest, safe loans) vs having no or few defaulting loans, and 10-15% return per year.
Data exploration and cleaning has been performed using the previous jupyter notebook named "Data Cleaning.ipynb". I exported a clean features and labels pickle using joblib in that notebook. I will import them and graph some preliminary things to start!
In [1]:
import joblib
features=joblib.load('clean_LCfeatures.p')
labels=joblib.load('clean_LClabels.p')
clabels=joblib.load('clean_LCclassifierlabel.p')
In [2]:
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualization code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
In [3]:
features.head(n=10)
Out[3]:
In [4]:
features['earliest_cr_line']=features.earliest_cr_line.dt.year
In [5]:
features.earliest_cr_line.dtype
Out[5]:
In [6]:
import imp
imp.reload(vs)
vs.distribution(features)
Some weird scaling going on, just to check my sanity, check the max values of a few of the features. these may need to be log scaled just to deal with the large values. Income almost always is going to need log scaling anyway.
In [7]:
valuespread={'annual_inc':max(features.annual_inc),'delinqmax':max(features.delinq_2yrs),
'openmax':max(features.open_acc), 'pubmax':max(features.pub_rec),
'revolbalmax':max(features.revol_bal),'revolutilmax':max(features.revol_util),
'collectmax':max(features.collections_12_mths_ex_med)}
valuespread
Out[7]:
There are some crazy revolving balances and utilizations. I need to check those quickly. Everything else looks extreme, but within ranges I'd assume could exist.
In [8]:
features.query('revol_bal>1000000.0')
Out[8]:
I guess these are pretty understandable, high paying jobs and income. Lets check the utilizations - having over 100% seems... not possible.
In [9]:
features.query('revol_util>120')
Out[9]:
Having a utilization over 100% is odd, but having a utilization of almost 900% is crazy! I am going to just remove this one loan - not sure what is going on with this outlier. The others are clearly weird, but there doesn't seem to be an obvious cutoff beyond removing the huge outlier.
In [10]:
features.drop([294407],inplace=True)
labels.drop([294407],inplace=True)
clabels.drop([294407],inplace=True)
In [11]:
imp.reload(vs)
vs.distribution(features)
In [12]:
# Log-transform the skewed features
skewed = ['annual_inc','delinq_2yrs','open_acc', 'pub_rec','revol_bal','total_acc', 'collections_12_mths_ex_med']
features_raw=features.copy()
features_raw[skewed] = features[skewed].apply(lambda x: np.log(x + 1))
# Visualize the new log distributions
imp.reload(vs)
vs.distribution(features_raw, transformed = True)
We see a clear improvement in income, open accounts, and revolving balance. The others are improved but still are dominated by low numbers. That's okay for now. There are extreme outliers but they are important. I am tempted to winsorize, but the values are so heavily weighted toward 0, that the winsorization would affect all values greater than zero.
In [13]:
# import scipy.stats
# winsors=['delinq_2yrs','pub_rec','collections_12_mths_ex_med']
# for feature in winsors:
# winsor[feature]=scipy.stats.mstats.winsorize(features[feature], limits=0.01,axis=1)
In [14]:
from sklearn.preprocessing import MinMaxScaler
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler()
numerical = ['loan_amnt','emp_length', 'annual_inc','dti','delinq_2yrs','earliest_cr_line','inq_last_6mths','open_acc', 'pub_rec','revol_bal','revol_util','total_acc', 'collections_12_mths_ex_med']
features_raw[numerical] = scaler.fit_transform(features_raw[numerical])
# Show an example of a record with scaling applied
display(features_raw.head(n = 10))
Up to this point, I have only dealt with the "continuous" numerical features, or features where the values have some relation. for example, higher income can be directly compared to lower income. Now I want to deal with discrete features, where values are independent. One example is the address state. Although we all might have a specific way to rank the states, there is no "correct" relationship between any two possible states - they are discrete and independent. These will be one-hot encoded!
Of note is the employment title - this is put in by the person applying to the loan and can be pretty much anything. This will certainly cause an explosion in feature space. I think, at this point, I will just remove that from the feature space instead of encode it.
In [15]:
features_raw
Out[15]:
In [16]:
features_raw=features_raw.drop(['emp_title','addr_state'],axis=1)
In [17]:
features_raw.dtypes
Out[17]:
In [18]:
# One-hot encode the 'features_raw' data using pandas.get_dummies()
feat = pd.get_dummies(features_raw)
#print(income.head(n=10))
# Print the number of features after one-hot encoding
encoded = list(feat.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))
# Uncomment the following line to see the encoded feature names
print (encoded)
Great! There are the original continuous features, and now a feature for all types of home ownership, purposes for the loan, and different address states!
Specifically, I believe that this data is ordered by the origination date, so it is important to shuffle the data. Luckily this happens anyway with train_test_split.
In [19]:
# Import train_test_split
from sklearn.model_selection import train_test_split #sklearn 0.18.1 and up
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(feat, labels, test_size = 0.2, random_state = 1)
# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))
For almost all supervised learning, a version of Decision trees is necessary to test, if only for it's extreme intuitiveness. At the very least, it may be used to shrink the feature space and improve speed for other algorithms. I will use Gradient Boosting regression for this. Secondly, I want to test a kernel based non-linear regression, and I will use KNN for this. Finally I want to test a high order polynomial regression.
In [20]:
from sklearn.metrics import r2_score, mean_squared_error
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
'''
results = {}
# Fit the learner to the training data using slicing with 'sample_size'
start = time() # Get start time
learner = clf.fit(X_train.sample(n=sample_size,random_state=1),y_train.sample(n=sample_size,random_state=1)) #df.sample(frac=percent) maybe too?
end = time() # Get end time
# Calculate the training time
results['train_time'] = end-start
# Get the predictions on the test set,
# then get predictions on the first 300 training samples
start = time() # Get start time
predictions_test = clf.predict(X_test)
predictions_train = clf.predict(X_train[:300])
end = time() # Get end time
# Calculate the total prediction time
results['pred_time'] = end-start
# Compute mean square error on the first 300 training samples
results['mse_train'] = mean_squared_error(y_train[:300],predictions_train)
# Compute mean square error on test set
results['mse_test'] = mean_squared_error(y_test,predictions_test)
# Compute R^2 on the the first 300 training samples
results['R2_train'] = r2_score(y_train[:300],predictions_train)
# Compute R^2 on the test set
results['R2_test'] = r2_score(y_test,predictions_test)
# Success
print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
# Return the results
return results
In [21]:
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Initialize models
clf_A = make_pipeline(PolynomialFeatures(2,interaction_only=True),LinearRegression())
clf_B = GradientBoostingRegressor(random_state=2)
clf_C = Ridge()
# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = len(y_train.sample(frac=.01))
samples_10 = len(y_train.sample(frac=.1))
samples_100 = len(y_train.sample(frac=1))
# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]: #clf_A,
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = \
train_predict(clf, samples, X_train, y_train, X_test, y_test)
# print(results[clf])
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results,.5,.5)
In [22]:
results
Out[22]:
Woof, terrible bias, almost no structure is described by these models. I'm actually not sure what to make of such extreme values, but negative R^2 are bad, and large MSE is bad. Likely, they are too simple. I'd like to try and overfit, before I move towards a more modest approach, such as classification. If I can tune a decision tree from overfitting towards something useful, maybe the metric and this regression still have hope.
For this I will use a standard decision tree, and attempt to overfit with hopes for Post-pruning
In [23]:
from sklearn.tree import DecisionTreeRegressor
results2={}
clf_D = DecisionTreeRegressor(random_state=1)
clf_D.fit(X_train,y_train)
predictions_test = clf_D.predict(X_test)
predictions_train = clf_D.predict(X_train[:3000])
end = time() # Get end time
# Calculate the total prediction time
# Compute mean square error on the first 300 training samples
results2['mse_train'] = mean_squared_error(y_train[:3000],predictions_train)
# Compute mean square error on test set
results2['mse_test'] = mean_squared_error(y_test,predictions_test)
# Compute R^2 on the the first 300 training samples
results2['R2_train'] = r2_score(y_train[:3000],predictions_train)
# Compute R^2 on the test set
results2['R2_test'] = r2_score(y_test,predictions_test)
results2
Out[23]:
In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
# Initialize the classifier
clf = DecisionTreeRegressor(random_state=2)
# Create the parameters list you wish to tune
parameters = {'max_depth':(4,6,20,None),'min_samples_split':(2,50),'min_samples_leaf':(1,101)} #',,
# Make an R2_score scoring object
scorer = make_scorer(r2_score)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train)
# Get the estimator
best_clf = grid_fit.best_estimator_
# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("MSE score on testing data: {:.4f}".format(mean_squared_error(y_test, predictions)))
print ("R2-score on testing data: {:.4f}".format(r2_score(y_test, predictions)))
print ("\nOptimized Model\n------")
print ("Final MSE score on the testing data: {:.4f}".format(mean_squared_error(y_test, best_predictions)))
print ("Final R2-score on the testing data: {:.4f}".format(r2_score(y_test, best_predictions)))
In [25]:
best_clf
Out[25]:
In [26]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import make_scorer
# # Initialize the classifier
# clf = GradientBoostingRegressor(random_state=2)
# # Create the parameters list you wish to tune
# parameters = {'learning_rate':(0.01,0.1), 'max_depth':(3,6)}
# # Make an R2_score scoring object
# scorer = make_scorer(r2_score)
# # TODO: Perform grid search on the classifier using 'scorer' as the scoring method
# grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)
# # TODO: Fit the grid search object to the training data and find the optimal parameters
# grid_fit = grid_obj.fit(X_train,y_train)
# # Get the estimator
# best_clf = grid_fit.best_estimator_
# # Make predictions using the unoptimized and model
# predictions = (clf.fit(X_train, y_train)).predict(X_test)
# best_predictions = best_clf.predict(X_test)
# # Report the before-and-afterscores
# print ("Unoptimized model\n------")
# print ("MSE score on testing data: {:.4f}".format(mean_squared_error(y_test, predictions)))
# print ("R2-score on testing data: {:.4f}".format(r2_score(y_test, predictions)))
# print ("\nOptimized Model\n------")
# print ("Final MSE score on the testing data: {:.4f}".format(mean_squared_error(y_test, best_predictions)))
# print ("Final R2-score on the testing data: {:.4f}".format(r2_score(y_test, best_predictions)))
In [27]:
best_clf
Out[27]:
Interestingly, the best boosted regressor is the "stock" one. It is pretty poor. I think it is time to try classification instead of regression
In this case, we take the classification data, which takes "current" and "fully paid" loans and groups, them, and groups all other delinquent loans together. First, we must get a new test split, and then we will get a baseline naive predictor.
In [28]:
# Import train_test_split
from sklearn.model_selection import train_test_split #sklearn 0.18.1 and up
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(feat, clabels, test_size = 0.2, random_state = 1)
# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))
Here we make a naive predictor, that simply guesses every loan is "good" (or bad, if more loans are bad than good). We need to beat these scores in order to really say that our model is doing better than this in order to say it is meaningful. For classification I will test accuracy, $f_{0.5}$ score, ROC AUC, and cohen's kappa to determine the quality of the model with this highly imbalanced dataset. I use $f_{0.5}$ specifically instead of other beta values because I care most about being precise! I'd rather misclassify a few good loans as bad than have bad loans slip through. Missing a good loan carries much less risk than funding a bad loan. I could look at tuning the classification by implementing a cost of misclassification... ROC AUC takes into consideration true positive rates and false positive rates, and can deal with unbalanced data. Cohen's kappa is specifically for unbalanced data, and describes the agreement
In [29]:
from sklearn.metrics import accuracy_score,fbeta_score, roc_auc_score, cohen_kappa_score, precision_score
allones=np.ones(len(clabels))
acc=accuracy_score(clabels,allones)
fbeta=fbeta_score(clabels,allones,beta=.5)
auc=roc_auc_score(clabels,allones)
cohen=cohen_kappa_score(clabels,allones)
prec=precision_score(clabels,allones)
# Print the results
print ("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}, ROC AUC: {:.4f}, Cohen's k: {:.4f}, precision: {:.4f}]".format(acc, fbeta,auc,cohen,prec))
Wow! accuracy and F-score are pretty high. likely, this is because there is an order of magnitude more fully paid loans than others. This can be seen in the data cleaning notebook! Interestingly, ROC AUC and Cohen's k show how poor a naive classifier should be. I'll use those to judge my models.
I am choosing three classifiers: Decision trees, as it is obligatory to try them on any classification - they get 80% of the way there 80% of the time. SVM, as the margin is a desirable trait when "current" loans are ambiguous and we'd like a large berth for the classification. and finally, KNN, as it is one of the simpliest models (a lazy learner) and I love to see simplicity win out.
In [30]:
from sklearn.metrics import fbeta_score, roc_auc_score, cohen_kappa_score
def train_predictclass(learner, sample_size, X_train, y_train, X_test, y_test):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
'''
resultsclass = {}
# Fit the learner to the training data using slicing with 'sample_size'
start = time() # Get start time
learner = clf.fit(X_train.sample(n=sample_size,random_state=1),y_train.sample(n=sample_size,random_state=1)) #df.sample(frac=percent) maybe too?
end = time() # Get end time
# Calculate the training time
resultsclass['train_time'] = end-start
# Get the predictions on the test set,
# then get predictions on the first 300 training samples
start = time() # Get start time
predictions_test = clf.predict(X_test)
predictions_train = clf.predict(X_train[:300])
end = time() # Get end time
# Calculate the total prediction time
resultsclass['pred_time'] = end-start
# Compute accuracy on the first 300 training samples
resultsclass['ROC_AUC_train'] = roc_auc_score(y_train[:300],predictions_train)
# Compute accuracy on test set
resultsclass['ROC_AUC_test'] = roc_auc_score(y_test,predictions_test)
# Compute F-score on the the first 300 training samples
resultsclass['precision_train'] = precision_score(y_train[:300],predictions_train)
# Compute F-score on the test set
resultsclass['precision_test'] = precision_score(y_test,predictions_test)
# Success
print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
# Return the results
return resultsclass
In [31]:
# Initialize models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
# TODO: Initialize the three models
clf_A = GaussianNB()
clf_B = DecisionTreeClassifier(random_state=5)
clf_C = RandomForestClassifier(n_estimators=50)
# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = len(y_train.sample(frac=.01))
samples_10 = len(y_train.sample(frac=.1))
samples_100 = len(y_train.sample(frac=1))
# Collect results on the learners
resultsclass = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
resultsclass[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
resultsclass[clf_name][i] = \
train_predictclass(clf, samples, X_train, y_train, X_test, y_test)
# print(results[clf])
In [32]:
# Run metrics visualization for the three supervised learning models chosen
imp.reload(vs)
vs.classevaluate(resultsclass,auc,prec)
In [33]:
# Initialize the classifier
clf = DecisionTreeClassifier(random_state=3)
# Create the parameters list you wish to tune
parameters = {'max_depth':(20,50, None),'max_leaf_nodes':(100,300,None)}
# Make an R2_score scoring object
scorer = make_scorer(precision_score)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train)
# Get the estimator
best_clf = grid_fit.best_estimator_
# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
In [34]:
# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("ROC AUC score on testing data: {:.4f}".format(roc_auc_score(y_test, predictions)))
print ("Precision on testing data: {:.4f}".format(precision_score(y_test, predictions)))
print ("\nOptimized Model\n------")
print ("Final ROC AUC score on the testing data: {:.4f}".format(roc_auc_score(y_test, best_predictions)))
print ("Final Precision on the testing data: {:.4f}".format(precision_score(y_test, best_predictions)))
Interestingly, this doesn't really improve the precision, there may not be enough bad loan data to give the model something to hang on to, and potentially, it could be that we just don't have the data collected that would be more predictive. Because it is crazy simple, and it actually performs better at reducing false positives than others, at least with this training/test set. It is worth noting that the 0.0095% increase in accuracy over the naive classifier is very small but better than nothing.
In [35]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(best_clf,X_train, y_train, cv=20,scoring=scorer, n_jobs=-1)
scores
Out[35]:
In [36]:
difference=scores.mean()-prec
import math
from scipy import stats,special
zscore=difference/(scores.std())
p_values = 1-special.ndtr(zscore)
print(zscore,p_values)
In [37]:
(scores.std()/math.sqrt(10))
Out[37]:
In [38]:
scores.std()*100
Out[38]:
In [39]:
import joblib
joblib.dump(clf, 'decisiontree.p')
Out[39]:
In [40]:
features.columns
Out[40]:
In [41]:
# from sklearn import tree
# import pydotplus
# dot_data = tree.export_graphviz(best_clf, out_file=None,feature_names=feat.columns,class_names=['Bad Loan','Fully Paid'])
# graph = pydotplus.graph_from_dot_data(dot_data)
# graph.write_pdf("LC_decisiontree.pdf")
In [42]:
best_clf.feature_importances_
Out[42]:
In [58]:
importances = best_clf.feature_importances_
# Plot
import importlib
importlib.reload(vs)
vs.feature_plot(importances, X_train, y_train)
In [44]:
best_clf.tree_.node_count
Out[44]:
In [45]:
indices = np.argsort(importances)[::-1]
columns = X_train.columns.values[indices[:10]]
values = importances[indices][:10]
np.cumsum(values)
values
columns
Out[45]:
In [ ]: