LendingClub is an online financial community that brings together creditworthy borrowers and savvy investors so that both can benefit financially. They replace the high cost and complexity of bank lending with a faster, smarter way to borrow and invest. Since 2007, LendingClub has been bringing borrowers and investors together, transforming the way people access credit. Over the last 10 years, LendingClub helped millions of people take control of their debt, grow their small businesses, and invest for the future.
LendingClub balances different investors on its platform. The mechanics of the platform allow LendingClub to meet the objectives of many different types of investors, including retail investors. Here’s how it works. Once loans are approved to the LendingClub platform, they are randomly allocated at a grade and term level either to a program designed for retail investors purchasing interests in fractions of loans (e.g. LendingClub Notes) or to a program intended for institutional investors. This helps ensure that investors have access to comparable quality loans no matter which type of investor they are. LendingClub goal is to meet incoming investor demand for interests in fractional loans as much as possible.
The design of LendingClub platform emphasizes how important retail investors are to LendingClub. For LendingClub retail investors are key component of our diverse marketplace strategy. Retail investors are—and will always be—the heart of the LendingClub marketplace. This project “Investor ML” tool specifically built keeping retail investors in mind. Investor ML aims to predict probability risk of charged off and not fully paid, called “Risk Rate %”. LendingClub can provide “Risk Rate %” predictions from Investor ML for each approved loan as an additional indicator for investors to make investment decisions. Retail investors use loan information, applicant information like fico score, loan grade etc. and decide to invest in fractional loans. Additional statistic “Risk Rate %” learned from historical loans may also act as additional information and help retail investors to diversify their investment.
The datasets are provided by LendingClub, we will use lending data from 2007-2011 and be trying to classify and predict whether or not the borrower paid back their loan in full. Data is available to download here.
In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
import warnings
warnings.filterwarnings("ignore")
# Import supplementary visualization code visuals.py
import visuals as vs
# Pretty display for notebooks
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
# Load the Census dataset
loans = pd.read_csv("LoanStats3a_securev1_2007_2011.csv", encoding='utf-8')
# Success - Display the first record
print ("Loans dataset has {} data points with {} variables each.".format(*loans.shape))
print('lending club data 2007 to 2011 {}: {}'.format(loans.shape, ', '.join(loans.columns)))
loans.head(n=4)
Out[1]:
In [2]:
loans.describe()
Out[2]:
Definition list
In [3]:
# Success
print ("Before removing NA")
print "Dataset has {} data points with {} variables each.".format(*loans.shape)
print(loans.isnull().sum())
As we see many features having all values as NaN lets drop those columns which doesnt even have 60% dataset populated.
In [4]:
# drop a column if all values are NaN
loans = loans.dropna(thresh=0.4*len(loans), axis = 1)
print "Loans dataset has {} data points with {} variables each.".format(*loans.shape)
#loans.info()
Now we can analyse all the data in 60 columns which have at least 40% records or more of data populated for each columns.
Assumption
For investors, only information about loan applicant is available, following 41 features, that is assumed to be available for investors.
Available features of loan applicant
Identifiers, not needed as features
Other features to analyze
In [5]:
identifiers = ['id', 'url']
investor_features = loans[['loan_amnt','term','int_rate','grade','sub_grade','emp_title','emp_length',\
'home_ownership','annual_inc','verification_status','loan_status',\
'desc','purpose','title','zip_code','addr_state','dti','delinq_2yrs',\
'earliest_cr_line','fico_range_low','fico_range_high','inq_last_6mths',\
'open_acc','pub_rec','revol_bal','revol_util','total_acc',\
'initial_list_status','out_prncp','out_prncp_inv','last_credit_pull_d','last_fico_range_high',\
'last_fico_range_low','collections_12_mths_ex_med','policy_code','application_type','acc_now_delinq',\
'delinq_amnt','pub_rec_bankruptcies','tax_liens','hardship_flag']]
#loans.info()
investor_features.shape
Out[5]:
Found interesting tool pandas_profiling, will use that to profile and learn more about data.
In [6]:
import pandas as pd
import pandas_profiling
import numpy as np
pandas_profiling.ProfileReport(investor_features)
Out[6]:
Based on above profiling, lets work on features
Removals
In [7]:
unique_loans = investor_features.drop_duplicates()
unique_loans_features = unique_loans.drop(['application_type', 'collections_12_mths_ex_med','fico_range_high',\
'initial_list_status','hardship_flag','out_prncp','out_prncp_inv',\
'policy_code','tax_liens','title','desc','addr_state','earliest_cr_line',\
'emp_title','last_credit_pull_d', 'delinq_amnt', 'last_fico_range_high',\
'zip_code'], axis=1)
unique_loans_features.shape
Out[7]:
Converstion
In [8]:
unique_loans_features['revol_util'] = unique_loans_features['revol_util'].str.replace('%', '').astype(float)
unique_loans_features['int_rate'] = unique_loans_features['int_rate'].str.replace('%', '').astype(float)
#XGB cant have column name with <
unique_loans_features['emp_length'] = unique_loans_features['emp_length'].str.replace('<', '')
Consider only if loan_status values is present
In [9]:
unique_loans_features = unique_loans_features[unique_loans_features['loan_status'].notnull()]
In [10]:
# Split the data into features and target label
loan_paid_status_raw = unique_loans_features['loan_status']
features_raw = unique_loans_features.drop('loan_status', axis = 1)
In [11]:
loans_status_count = unique_loans_features['loan_status'].value_counts().reset_index()
loans_status_count
Out[11]:
For our project we just need to know weather loans are charged off or not. We will convert the Fully Paid & Does not meet the credit policy. Status:Fully Paid to 0, Charged Off & Does not meet the credit policy. Status:Charged Off to 1
In [12]:
# function to convert target label to numeric values
def convert_label(val):
#print val
if val=='Fully Paid':
return 0
elif val=='Charged Off':
return 1
elif val=='Does not meet the credit policy. Status:Fully Paid':
return 0
elif val=='Does not meet the credit policy. Status:Charged Off':
return 1
In [13]:
loan_paid_status = loan_paid_status_raw.apply(convert_label)
loan_paid_status.value_counts().reset_index()
Out[13]:
In [14]:
cnt_label = loan_paid_status.value_counts()
plt.figure(figsize=(8,4))
sns.barplot(cnt_label.index, cnt_label.values, alpha=0.8, color=color[1])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.title('Number of loans fully paid (0), charged off(1)', fontsize=15)
plt.show()
In [15]:
unique_loans_features['charged_off'] = unique_loans_features['loan_status'].apply(convert_label)
In [16]:
pos = unique_loans_features[unique_loans_features["charged_off"] == 1].shape[0]
neg = unique_loans_features[unique_loans_features["charged_off"] == 0].shape[0]
print "Positive examples = {}".format(pos)
print "Negative examples = {}".format(neg)
print "Proportion of positive to negative examples = {}".format((pos*1.0 / neg) * 100)
We see that data set is pretty imbalanced as expected where positive examples (“charged off”) are only ~18%. We will handle imbalanced class issue later.
Lets do some basis analysis on data.
Lets validate some of opinions on the loans data
Good fico score based loans shold be paid
In [17]:
from matplotlib.pyplot import subplots, show
fig, ax = subplots(figsize=(12, 8))
unique_loans_features[unique_loans_features['charged_off'] == 0]['fico_range_low'].hist(alpha=0.5, color='red', bins=30, label='Fully Paid')
unique_loans_features[unique_loans_features['charged_off'] == 1]['fico_range_low'].hist(alpha=0.5, color='blue', bins=30, label='Charged Off')
plt.title("# Loans charged off or not & FICO score ", fontsize=15)
plt.legend()
#plt.xlabel('FICO')
plt.ylabel('Count')
ax.set_xlabel("FICO")
show()
What are different purpose the loan is being requested for?
In [18]:
plt.figure(figsize=(12,6))
sns.countplot(x='purpose', data=unique_loans_features, hue='charged_off', palette='Set1')
plt.xticks(rotation='vertical')
plt.title("# of loans vs purpose", fontsize=15)
plt.show()
Let's see the trend between FICO score and interest rate.
In [19]:
sns.jointplot('fico_range_low', 'int_rate', data=unique_loans_features, color='purple')
Out[19]:
Selected categorical features, intuitions for selection
selected numerical features, intutions for selection
In [20]:
categorical_features = ['emp_length', 'grade', 'home_ownership', 'purpose',\
'sub_grade','term','verification_status']
numerical_features = ['acc_now_delinq', 'annual_inc', 'delinq_2yrs', 'int_rate', 'dti', 'fico_range_low'\
,'inq_last_6mths','last_fico_range_low','loan_amnt'\
,'open_acc','pub_rec','pub_rec_bankruptcies','revol_bal','revol_util','total_acc']
In [21]:
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[numerical_features].shape)
print(unique_loans_features[numerical_features].isnull().sum())
As all above features are sensitive information realted to individual financials, imputing mean, median strategies might not be right, so it is better to impute 0 for NA values than mean, median or mode strategies. Lets impute missing values as 0
In [22]:
unique_loans_features[numerical_features] = unique_loans_features[numerical_features].fillna(0)
In [23]:
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[categorical_features].shape)
print(unique_loans_features[categorical_features].isnull().sum())
No missing values in categorical features
In [24]:
from sklearn.preprocessing import StandardScaler
# Initialize a scaler, then apply it to the features
scaler = StandardScaler()
features_raw[numerical_features] = scaler.fit_transform(unique_loans_features[numerical_features])
# Show an example of a record with scaling applied
display(features_raw.head(n = 10))
#print(features_raw.isnull().sum())
In [25]:
features= pd.concat([pd.get_dummies(features_raw[categorical_features], prefix_sep='_'),\
features_raw[numerical_features]], axis=1)
In [26]:
# Print the number of features after one-hot encoding
encoded = list(features.columns)
print "{} total features after one-hot encoding.".format(len(encoded))
print encoded
display(features.head(n = 1))
Now all categorical variables have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.
In [50]:
# Import train_test_split
from sklearn.cross_validation import train_test_split
# Split the 'features' and 'loan_paid_status' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, loan_paid_status, test_size = 0.2, random_state = 0,\
stratify= loan_paid_status)
# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])
display(X_test.head(n = 1))
display(y_test.head(n = 1))
Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). We can see that in spam detection, ads click, loan approvals, etc. In our example, the positive examples (people who charged off) were only ~18% from the total examples. Therefore, accuracy is no longer a good measure of performance for different models because if we simply predict all examples to belong to the negative class, we achieve 81% accuracy. Better metrics for imbalanced data sets are AUC (area under the ROC curve) and f1-score. However, that’s not enough because class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class by implicitly learns a model that optimizes the predictions based on the majority class in the dataset. As a result, we’ll explore different methods to overcome class imbalance problem.
Under-Sample: Under-sample the majority class with or w/o replacement by making the number of positive and negative examples equal. One of the drawbacks of under-sampling is that it ignores a good portion of training data that has valuable information. In our example, it would loose around ~27000+ examples. However, it’s very fast to train.
Over-Sample: Over-sample the minority class with or w/o replacement by making the number of positive and negative examples equal. We’ll add around ~27000+ samples from the training data set with this strategy. It’s a lot more computationally expensive than under-sampling. Also, it’s more prune to overfitting due to repeated examples.
EasyEnsemble: Sample several subsets from the majority class, build a classifier on top of each sampled data, and combine the output of all classifiers. More details can be found here.
Synthetic Minority Oversampling Technique (SMOTE): It over-samples the minority class but using synthesized examples. It operates on feature space not the data space.
In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from imblearn.pipeline import make_pipeline as imb_make_pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedBaggingClassifier
# Build random forest classifier (same config)
rf_clf = RandomForestClassifier(criterion="entropy", verbose=False, class_weight="balanced",random_state=10)
# Build model with no sampling
pip_orig = make_pipeline(rf_clf)
scores = cross_val_score(pip_orig, X_train, y_train, scoring="roc_auc", cv=10)
print("Original model's average AUC: {}".format(scores.mean()))
# Build model with undersampling
pip_undersample = imb_make_pipeline(RandomUnderSampler(), rf_clf)
scores = cross_val_score(pip_undersample, X_train, y_train, scoring="roc_auc", cv=10)
print("Under-sampled model's average AUC: {}".format(scores.mean()))
# Build model with oversampling
pip_oversample = imb_make_pipeline(RandomOverSampler(), rf_clf)
scores = cross_val_score(pip_oversample, X_train, y_train, scoring="roc_auc", cv=10)
print("Over-sampled model's average AUC: {}".format(scores.mean()))
# Build model with EasyEnsemble
resampled_rf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)
pip_resampled = make_pipeline(resampled_rf)
scores = cross_val_score(pip_resampled, X_train, y_train, scoring="roc_auc", cv=10)
print("EasyEnsemble model's average AUC: {}".format(scores.mean()))
# Build model with SMOTE
pip_smote = imb_make_pipeline(SMOTE(), rf_clf)
scores = cross_val_score(pip_smote, X_train, y_train, scoring="roc_auc", cv=10)
print("SMOTE model's average AUC: {}".format(scores.mean()))
EasyEnsemble method has the highest 10-folds CV with average AUC = 0.872006664839. So we use EasyEnsemble technique to handle the imbalanced class problem.
Investor ML, is particularly interested in predicting who will fully pay back the loan. It would seem that using accuracy as a metric for evaluating a particular model's performace would be appropriate. Additionally, identifying someone that does not fully pay back loan would be detrimental to Investor ML, since they are looking to invest on individual who will pay back loan. Therefore, a model's ability to precisely predict those who fully back is more important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall: $$ F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall} $$ In particular, when $\beta = 0.5$, more emphasis is placed on precision. This is called the F$_{0.5}$ score (or F-score for simplicity). Looking at the distribution of classes (those who charged off, and those who fully pay), it's clear most individuals fully pay back loan. This can greatly affect accuracy, since we could simply say "this person will pay back loan" and generally be right, without ever looking at the data! Making such a statement would be called naive, since we have not considered any information to substantiate the claim. Also as we are trying to predict "Risk Rate%", we cant say loan gets charged off. To define a naive predictor, we will use Gausian NB model to help establish a benchmark for whether a model is performing well.
In [29]:
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
from sklearn.naive_bayes import GaussianNB
cols = ['model','matthews_corrcoef', 'roc_auc_score', 'precision_score', 'recall_score','f1_score','accuracy',\
'fscore']
models_report = pd.DataFrame(columns = cols)
#The naive model always predict that an individual will fully pay back loan
clf_GNB = GaussianNB()
clf_GNB.fit(X_train,y_train)
naive_prediction = clf_GNB.predict(X_test)
y_score = clf_GNB.predict_proba(X_test)[:,1]
# Accuracy, F-score using sklearn
accuracy = accuracy_score(y_test, naive_prediction)
fscore = fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)
tmp = pd.Series({'model': 'Naive Predictor',\
'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
'matthews_corrcoef': metrics.matthews_corrcoef(y_test, naive_prediction),\
'precision_score': metrics.precision_score(y_test, naive_prediction),\
'recall_score': metrics.recall_score(y_test, naive_prediction),\
'f1_score': metrics.f1_score(y_test, naive_prediction),\
'accuracy': accuracy_score(y_test, naive_prediction),\
'fscore' : fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)
# Print the results
print "Naive Predictor per sklearn: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore)
models_report
Out[29]:
In [30]:
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(classifier_name, learner, sample_size, X_train, y_train, X_test, y_test, models_report, meta_learner=None):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
'''
results = {}
# Fit the learner to the training data using slicing with 'sample_size'
start = time() # Get start time
if meta_learner and classifier_name=='stackedEnsembleClassifier':
# Use CV to generate meta-features
meta_features = cross_val_predict(learner, X_train[:sample_size],y_train[:sample_size], cv=10, method="transform")
# Refit the first stack on the full training set
learner.fit(X_train[:sample_size],y_train[:sample_size])
# Fit the meta learner
second_stack = meta_learner.fit(meta_features, y_train[:sample_size])
else:
learner.fit(X_train[:sample_size],y_train[:sample_size])
end = time() # Get end time
# Calculate the training time
results['train_time'] = end-start
# Get the predictions on the test set,
# then get predictions on the first 300 training samples
start = time() # Get start time
if meta_learner and classifier_name=='stackedEnsembleClassifier':
predictions_test = second_stack.predict(learner.transform(X_test))
predictions_train = second_stack.predict(learner.transform(X_train[:300]))
else:
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
end = time() # Get end time
# Calculate the total prediction time
results['pred_time'] = end-start
# Compute accuracy on the first 300 training samples
results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
# Compute accuracy on test set
results['acc_test'] = accuracy_score(y_test, predictions_test)
# Compute F-score on the the first 300 training samples
results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta = 0.5)
# Compute F-score on the test set
results['f_test'] = fbeta_score(y_test, predictions_test, beta = 0.5)
if meta_learner and classifier_name=='stackedEnsembleClassifier':
results['meta-learner'] = second_stack
results['learner'] = learner
else:
results['learner'] = learner
# Success
print "{} trained on {} samples.".format(classifier_name, sample_size)
if classifier_name=='stackedEnsembleClassifier':
y_score = second_stack.predict_proba(learner.transform(X_test))[:, 1:]
else:
y_score = learner.predict_proba(X_test)[:,1]
tmp = pd.Series({'model': classifier_name+str(sample_size),\
'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
'matthews_corrcoef': metrics.matthews_corrcoef(y_test, predictions_test),\
'precision_score': metrics.precision_score(y_test, predictions_test),\
'recall_score': metrics.recall_score(y_test, predictions_test),\
'f1_score': metrics.f1_score(y_test, predictions_test),\
'accuracy': accuracy_score(y_test, predictions_test),\
'fscore' : fbeta_score(y_test, predictions_test, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)
# Return the results
return results, models_report
Lets evaluate using following models
In [31]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# Initialize the four models
clf_A = LogisticRegression(random_state=10)
# Build model with EasyEnsemble
resampled_LR_A = BalancedBaggingClassifier(base_estimator=clf_A, n_estimators=10, random_state=10)
clf_B = GradientBoostingClassifier(random_state=10)
resampled_GBC_B = BalancedBaggingClassifier(base_estimator=clf_B, n_estimators=10, random_state=10)
clf_C = XGBClassifier(objective='binary:logistic', random_state=10)
resampled_XGB_C = BalancedBaggingClassifier(base_estimator=clf_C, n_estimators=10, random_state=10)
# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)
# Collect results on the learners
classifiers = {'LogisticRegression':resampled_LR_A, 'GradientBoostingClassifier':resampled_GBC_B,\
'XGBClassifier':resampled_XGB_C}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results = {}
for clf_name, clf in classifiers.items():
#clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i], models_report = \
train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report)
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results, accuracy, fscore)
models_report
Out[31]:
We can see from the above charts and data that, EasySampling, LogisticRegression performs better than, XGBoostClassifier & GradientBoostingClassifier. Considering running time, 100% training size data, variance (train / test of accuracy score is close), F-score.
Lets perform a grid search optimization for the model over the entire training set (X_train and y_train)
Tuned for Logistic Regressioin : solvers (['newton-cg', 'lbfgs', 'liblinear', 'sag']) found 'sag' tuned parameter. tuned 'C': [0.1, 1.0, 1.5] found 1.5 tuned paramter. n-estimators : range(20,81,10) tuned to 10.
In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
# Initialize the classifier
clf = LogisticRegression(random_state=10)
tune_clf = BalancedBaggingClassifier(base_estimator=clf, random_state=10)
# Create the parameters list to tune
parameters = {'base_estimator__fit_intercept': [True],\
'base_estimator__C': [1.5],\
'base_estimator__penalty':['l1'],\
'n_estimators':[10]}
# Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score, beta=0.5)
# Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(tune_clf, parameters, verbose=True, cv=10, scoring=scorer)
# Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train, y_train)
# Get the estimator
best_clf = grid_fit.best_estimator_
# Build model with EasyEnsemble
#best_clf= BalancedBaggingClassifier(base_estimator=best_model, n_estimators=10, random_state=10)
# Make predictions using the unoptimized and model
predictions = (tune_clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
In [33]:
y_best_score = best_clf.predict_proba(X_test)[:,1]
tmp = pd.Series({'model': 'GridSearchTunedLR',\
'roc_auc_score' : metrics.roc_auc_score(y_test, y_best_score),\
'matthews_corrcoef': metrics.matthews_corrcoef(y_test, best_predictions),\
'precision_score': metrics.precision_score(y_test, best_predictions),\
'recall_score': metrics.recall_score(y_test, best_predictions),\
'f1_score': metrics.f1_score(y_test, best_predictions),\
'accuracy': accuracy_score(y_test, best_predictions),\
'fscore' : fbeta_score(y_test, best_predictions, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)
# Report the before-and-afterscores
print "Unoptimized model\n------"
print "Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5))
print "\nOptimized Model\n------"
print "Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions))
print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5))
# show best parameters
print "\nBest Classifier\n------"
print best_clf
models_report
Out[33]:
An important task when performing supervised learning on a dataset is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict whether an individual fully pay loan.
In [34]:
# Import supplementary visualization code visuals.py
import visuals as vs
import matplotlib.pyplot as pl
# Pretty display for notebooks
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
# Train the supervised model on the training set
model = RandomForestClassifier(random_state=10)
model.fit(X_train, y_train)
# Extract the feature importances
importances = model.feature_importances_
# Plot
vs.feature_plot( importances, X_train, y_train)
# show most importance features
a = np.array(importances)
factors = pd.DataFrame(data = np.array([importances.astype(float), features.columns]).T,
columns = ['importances', 'features'])
factors = factors.sort_values('importances', ascending=False)
print "\n top 20 important features"
display(factors[:20])
In [35]:
# Import functionality for cloning a model
from sklearn.base import clone
# Reduce the feature space
X_train_reduced_20 = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:20]]]
X_test_reduced_20 = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:20]]]
# Train on the "best" model found from grid search earlier
start = time()
full_clf = (clone(best_clf)).fit(X_train, y_train)
end = time()
train_time_full = end - start
start = time()
reduced_20_clf = (clone(best_clf)).fit(X_train_reduced_20, y_train)
end = time()
train_time_reduced_20 = end - start
# Make new predictions
full_predictions = full_clf.predict(X_test)
reduced_20_predictions = reduced_20_clf.predict(X_test_reduced_20)
# Report scores from the final model using both versions of data
print "Final Model trained on full data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_full)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, full_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, full_predictions, beta = 0.5))
print "\nFinal Model trained on reduced to 81 features data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_reduced_20)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_20_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_20_predictions, beta = 0.5))
Feature selection of using top 20 features wasn’t prudent in our case, so we will use the tuned best classifier with full dataset instead of top 20 feature
We’ll build ensemble models using four different models as base learners:
The ensemble models will be built using two different methods:
Blending (average) ensemble model: Fits the base learners to the training data and then, at test time, average the predictions generated by all the base learners. Use VotingClassifier from sklearn that: Fits all the base learners on the training data at test time, use all base learners to predict test data and then take the average of all predictions.
Stacked ensemble model: Fits the base learners to the training data. Next, use those trained base learners to generate predictions (meta-features) used by the meta-learner (assuming we have only one layer of base learners).
There are few different ways of training stacked ensemble model:
Fitting the base learners to all training data and then generate predictions using the same training data it was used to fit those learners. This method is more prune to overfitting because the meta learner will give more weights to the base learner who memorized the training data better, i.e. meta-learner won’t generate well and would overfit.
Split the training data into 2 to 3 different parts that will be used for training, validation, and generate predictions. It’s a suboptimal method because held out sets usually have higher variance and different splits give different results as well as learning algorithms would have fewer data to train.
Use k-folds cross validation where we split the data into k-folds. We fit the base learners to the (k -1) folds and use the fitted models to generate predictions of the held out fold. We repeat the process until we generate the predictions for all the k-folds. When done, refit the base learners to the full training data. This method is more reliable and will give models that memorize the data less weight. Therefore, it generalizes better on future data.
We’ll use logistic regression as the meta-learner for the stacked model. Note that we can use k-folds cross validation to validate and tune the hyperparameters of the meta learner. We will not tune the hyperparameters of any of the base learners or the meta-learner; however, we will use some of the values recommended by the Data-driven advice for applying machine learning to bioinformatics problems and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/.
In [36]:
from sklearn.ensemble import VotingClassifier
# Define base learners
xgb_clf = XGBClassifier(objective="binary:logistic", learning_rate=0.03, n_estimators=500,\
max_depth=1, subsample=0.4, random_state=10)
balanced_xgb_clf = BalancedBaggingClassifier(base_estimator=xgb_clf, n_estimators=10, random_state=10)
lr_clf = LogisticRegression(C=1.5, fit_intercept=True, penalty='l1', random_state=10)
balanced_lr_clf = BalancedBaggingClassifier(base_estimator=lr_clf, n_estimators=10, random_state=10)
gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
n_estimators=500, random_state=10)
balanced_gb_clf = BalancedBaggingClassifier(base_estimator=gb_clf, n_estimators=10, random_state=10)
rf_clf = RandomForestClassifier(n_estimators=300, max_features="sqrt", criterion="gini", min_samples_leaf=5,\
class_weight="balanced", random_state=10)
balanced_rf_clf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)
# Define meta-learner
logreg_clf = LogisticRegression(penalty="l2", C=100, fit_intercept=True)
# Fitting voting clf --> average ensemble
voting_clf = VotingClassifier([("xgb", balanced_xgb_clf), ("lr", balanced_lr_clf), ("gbc", gb_clf),\
("rf", rf_clf)], voting="soft", flatten_transform=True)
voting_clf.fit(X_train, y_train)
xgb_model, lr_model, gbc_model, rf_model = voting_clf.estimators_
models = {"xgb": xgb_model, "lr": lr_model, "gbc": gbc_model, "rf": rf_model, "avg_ensemble": voting_clf}
In [37]:
from sklearn.model_selection import cross_val_predict
# Build first stack of base learners
first_stack = make_pipeline(voting_clf)
# Use CV to generate meta-features
meta_features = cross_val_predict(first_stack, X_train, y_train, cv=10, method="transform")
# Refit the first stack on the full training set
first_stack.fit(X_train, y_train)
# Fit the meta learner
second_stack = logreg_clf.fit(meta_features, y_train)
In [38]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for name, model in models.items():
model_probs = model.predict_proba(X_test)[:, 1:]
model_auc_score = roc_auc_score(y_test, model_probs)
fpr, tpr, _ = roc_curve(y_test, model_probs)
precision, recall, _ = precision_recall_curve(y_test, model_probs)
axes[0].plot(fpr, tpr, label="{}, auc = {}".format(name, model_auc_score))
axes[1].plot(recall, precision, label="{}".format(name))
stacked_probs = second_stack.predict_proba(first_stack.transform(X_test))[:, 1:]
stacked_auc_score = roc_auc_score(y_test, stacked_probs)
fpr, tpr, _ = roc_curve(y_test, stacked_probs)
precision, recall, _ = precision_recall_curve(y_test, stacked_probs)
axes[0].plot(fpr, tpr, label="stacked_ensemble, auc = {}".format(stacked_auc_score))
axes[1].plot(recall, precision, label="stacked_ensembe")
axes[0].legend(loc="lower right")
axes[0].set_xlabel("FPR")
axes[0].set_ylabel("TPR")
axes[0].set_title("ROC curve")
axes[1].legend()
axes[1].set_xlabel("recall")
axes[1].set_ylabel("precision")
axes[1].set_title("PR curve")
plt.tight_layout()
As we can see from the chart above, stacked ensemble model has improved the performance.
In [39]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
#PERFORMANCE METRICS FOR TEST SET
print("Final model test METRICS")
y_test_pred = second_stack.predict(first_stack.transform(X_test))
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, y_test_pred))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, y_test_pred, beta = 0.5))
accuracy = accuracy_score(y_test, y_test_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
recall = recall_score(y_test, y_test_pred)
print("Recall: %.2f%%" % (recall * 100.0))
precision = precision_score(y_test, y_test_pred)
print("Precision: %.2f%%" % (precision * 100.0))
f1 = f1_score(y_test, y_test_pred, pos_label=1)
print("F1: %.2f%%" % (f1 * 100.0))
cfm = confusion_matrix(y_test, y_test_pred)
print (cfm)
print(classification_report(y_test, y_test_pred))
tmp = pd.Series({'model': 'Stacked Ensemble Final Model',\
'roc_auc_score' : stacked_auc_score,\
'matthews_corrcoef': metrics.matthews_corrcoef(y_test, y_test_pred),\
'precision_score': metrics.precision_score(y_test, y_test_pred),\
'recall_score': metrics.recall_score(y_test, y_test_pred),\
'f1_score': metrics.f1_score(y_test, y_test_pred),\
'accuracy': accuracy_score(y_test, y_test_pred),\
'fscore' : fbeta_score(y_test, y_test_pred, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)
models_report
Out[39]:
Though ensemble stacked model has better accuracy, f-score, In addition, with classification problems where False Negatives are a lot more expensive than False Positives, we may want to have a model with a high precision rather than high recall, i.e. the probability of the model to identify positive examples from randomly selected examples. Below is the confusion matrix:
Lets check best base classifier, tuned base classifier and final stacked model. Compare thier results for various different set of inputs, to validate robustness and obtain thier outputs. Also plot the ROC curves to perform sensitivity analysis.
In [40]:
bestBaseClassifier = clone(tune_clf)
tunedClassifier = clone(full_clf)
stackedEnsembleClassifier = first_stack
# Define meta-learner
meta_learner = LogisticRegression(penalty="l2", C=100, fit_intercept=True)
# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)
# Collect results on the learners
classifiers2 = {'bestBaseClassifier':bestBaseClassifier, 'tunedClassifier':tunedClassifier,\
'stackedEnsembleClassifier':stackedEnsembleClassifier}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results2 = {}
for clf_name, clf in classifiers2.items():
#clf_name = clf.__class__.__name__
results2[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results2[clf_name][i], models_report = \
train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report, meta_learner)
In [41]:
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results2, accuracy, fscore)
models_report
Out[41]:
From the above chart we can observe that though stackedEnsembleClassifier is high cost time in training and predict, it is better in terms of variance and better performance metrics of accuracy & F-score. We can also see that model is robust for varied inputs.
In [42]:
# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
final_first_stack = None
final_second_stack= None
for k, v in results2.items():
for k2, v2 in v.items():
if k2==2:
if k=='stackedEnsembleClassifier':
final_first_stack = v2['learner']
final_second_stack = v2['meta-learner']
stacked_probs = final_second_stack.predict_proba(final_first_stack.transform(X_test))[:, 1:]
stacked_auc_score = roc_auc_score(y_test, stacked_probs)
fpr, tpr, _ = roc_curve(y_test, stacked_probs)
precision, recall, _ = precision_recall_curve(y_test, stacked_probs)
axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, stacked_auc_score))
axes[1].plot(recall, precision, label="{}".format(k))
else:
model_probs = v2['learner'].predict_proba(X_test)[:, 1:]
model_auc_score = roc_auc_score(y_test, model_probs)
fpr, tpr, _ = roc_curve(y_test, model_probs)
precision, recall, _ = precision_recall_curve(y_test, model_probs)
axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, model_auc_score))
axes[1].plot(recall, precision, label="{}".format(k))
axes[0].legend(loc="lower right")
axes[0].set_xlabel("FPR")
axes[0].set_ylabel("TPR")
axes[0].set_title("ROC curve")
axes[1].legend()
axes[1].set_xlabel("recall")
axes[1].set_ylabel("precision")
axes[1].set_title("PR curve")
plt.tight_layout()
Above ROC plot shows the sensitivity analysis and stackedEnsembleClassifier with AUC 0.88168 is better than bestBaseClassifier and tunedClassifier.
Optimized model's accuracy and F-score on the testing data.
In [43]:
models_report.iloc[[0, 6, 10, 11]].T
Out[43]:
Partial dependence plots to see what are the most important features and their relationships with whether the borrower will most likely pay the loan in full before mature data. we will plot only the top 8 features to make it easier to read. Note that the partial plots are based on Gradient Boosting model.
In [44]:
# Plot partial dependence plots
gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
n_estimators=500, random_state=10)
gb_clf.fit(X_train, y_train)
Out[44]:
In [45]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
fig, axes = plot_partial_dependence(gb_clf, X_train, np.argsort(gb_clf.feature_importances_)[::-1][:12],\
n_cols=4, feature_names=features.columns[:], figsize=(14, 8))
plt.subplots_adjust(top=0.9)
plt.suptitle("Partial dependence plots of borrower charged off\n" "the loan based on top most influential features")
for ax in axes:
ax.set_xticks(())
As expected borrowers with lower annual income and less FICO scores are higly likely to get charged off; however, borrowers with lower interest rates (riskier) and higher revol_bal are more likely to pay the loan fully.
In [47]:
# Use CV to generate meta-features
meta_features = cross_val_predict(final_first_stack, features, loan_paid_status, cv=10, method="transform")
# Refit the first stack on the full training set
final_first_stack.fit(features, loan_paid_status)
# Fit the meta learner
final_second_stack_meta = final_second_stack.fit(meta_features, loan_paid_status)
In [48]:
from sklearn.externals import joblib
filename = 'finalized_first_stack_model.sav'
filename2 = 'finalized_second_stack_model.sav'
joblib.dump(final_first_stack, filename)
joblib.dump(final_second_stack_meta, filename2)
first_stack
Out[48]:
Test saved model, final model results on par with our best model, performance metric is increased due to increase in data as we used all the data available.
In [49]:
# load the model from disk
loaded_first_stack_model = joblib.load(filename)
loaded_second_stack_model = joblib.load(filename2)
Y_hat_imp_f = loaded_second_stack_model.predict(loaded_first_stack_model.transform(X_test))
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, Y_hat_imp_f))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, Y_hat_imp_f, beta = 0.5))
Most classification problems in the real world are imbalanced. Also, almost always data sets have missing values. In this project, we covered strategies to deal with both missing values and imbalanced data sets. We also explored different ways of building ensembles which can give better performance. Below are some interesting learnings.
Sometimes we may be willing to give up some improvement to the model if that would increase the complexity much more than the percentage change in the improvement to the evaluation metrics.
EasyEnsemble usually performs better than any other resampling methods.
Following are some of the additional improvements that can be done,