We shall go through step-by-step in assessing whether our model works. We shall be comparing two sets of values for loans issued in 2009-2011:
The model is trained with features from loans issued in 2012-2014, i.e. loans that have not yet matured. The steps detailed here is identical to the validation process when running python test.py compare
.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from transfers.fileio import load_from_pickle
from helpers.preprocessing import process_payment, process_features
from model.model import StatusModel
from model.validate import actual_IRR
We'll be training our model on loans issued in 2012-2014, with loan features as features and the expected recovery based on status of the loan as our target. The model is actually a collection of 4 x 36 Random Forest sub-models, i.e. one for each grade-payment month pair.
In [2]:
# Load data, then pre-process
print "Loading data..."
df_3c = pd.read_csv('data/LoanStats3c_securev1.csv', header=True).iloc[:-2, :]
df_3b = pd.read_csv('data/LoanStats3b_securev1.csv', header=True).iloc[:-2, :]
df_raw = pd.concat((df_3c, df_3b), axis=0)
# Pre-process data
print "Pre-processing data..."
df = process_features(df_raw)
# Train models for every grade for every month
print "Training models..."
model = StatusModel(model=RandomForestRegressor, parameters={'n_estimators':100, 'max_depth':10})
model.train_model(df)
First we use our trained model to calculate expected rate of return of the matured loans issued in 2009-2011. The input is the loan features of the loans, and the model outputs a single rate for each loan.
As we would be calculating rate of return for 40,000 loans, generating 3 cashflows of 36-months' length for each loan, this might take a little longer.
In [3]:
# Load data, then pre-process
print "Loading data..."
df_3a = pd.read_csv('data/LoanStats3a_securev1.csv', header=True).iloc[:-2, :]
df_raw = df_3a.copy()
# Pre-process data
print "Pre-processing data..."
df_features = process_features(df_raw, True, False)
# Calculating expected rate of return for loans already matured
print "Calculating rate of return..."
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}
rate_predict = model.expected_IRR(df_features, False, int_rate_dict)
Next we generate our the actual rate of return, or our 'ground truth'. The input is payment details of matured loans issued between 2009 and 2011, e.g. principal paid, interest paid, late fees and recovered value if the loan defaulted. The output is a single actual rate of return figure for each loan.
In [4]:
# Pre-process data
print "Pre-processing data..."
df_payment = process_payment(df_raw)
# Calculating actual rate of return for loans already matured with Dec 2014 int_rate
print "Calculating rate of return..."
# Replace int_rate with values set in Dec 2015 by sub_grade
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}
rate_true = actual_IRR(df_payment, False, int_rate_dict)
We simply collect all our data back into the original dataframe, and review overall average MSE and average MSE by sub-grade.
In [5]:
df_select = df_features[['sub_grade', 'int_rate']]
df_select['rate_true'] = rate_true
df_select['rate_predict'] = rate_predict
rate_true = np.array(rate_true)
rate_predict = np.array(rate_predict)
print "Average MSE: ", np.sum((rate_true - rate_predict)**2) / len(rate_true)
In [6]:
df_compare = df_select.groupby('sub_grade').mean()
df_compare['headline'] = sorted(int_rate_dict.values())
df_compare = df_compare.drop('int_rate', axis=1)
print df_compare
The headline Lending Club rate is illustrated by the red line, which is what the lender would receive if the loan does not default. After accounting for defaulted loans, we would have the blue line. This is the average rate of return by loan sub-grade (averaging over loans that paid in full and those that defaulted) and is calculated at maturity purely with the loan payment details.
The predicted rate is calculated purely with the loan features, shown here by the green line. This means that we can estimate what the rate of return before maturity, and in particular we're able to do so at the inception of the loan.
In [7]:
plt.figure(figsize = (12, 6))
x = xrange(20)
y_true = df_compare['rate_true']
y_predict = df_compare['rate_predict']
y_headline = df_compare['headline']
plt.plot(x, y_true, label='Actual rate')
plt.plot(x, y_predict, label='Predicted rate')
plt.plot(x, y_headline, label='Lending Club rate')
plt.legend(loc='best')
plt.xlabel('Sub-grade, 0:A1 19:D5')
plt.ylabel('Rate of return')
plt.title("Comparison of predicted vs true rate of return")
Out[7]:
The final notebook applications.ipynb
discusses the application of our model to the problem of loan selection.