Suppose you were given a number of loans of grade B-2 to choose from. Since all of them are in the same sub-grade, they pay the same interest rate. Is random selection good enough, or it possible to beat the average return in a statistically reliable way?
Viewed as a black box, the model takes in loan features as input and returns the predicted rate of return. What we'll be doing here is to use our the predicted rate of return to rank the loans, and attempt to beat the average by choosing the top quartile. We'll then compare the average rate of return of the loans in the top quartile against the sub-grade average.
In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from transfers.fileio import load_from_pickle
from helpers.preprocessing import process_payment, process_features
from model.model import StatusModel
from model.validate import actual_IRR
In [14]:
# Load data, then pre-process
print "Loading data..."
df_3c = pd.read_csv('data/LoanStats3c_securev1.csv', header=True).iloc[:-2, :]
df_3b = pd.read_csv('data/LoanStats3b_securev1.csv', header=True).iloc[:-2, :]
df_raw = pd.concat((df_3c, df_3b), axis=0)
# Pre-process data
print "Pre-processing data..."
df_features = process_features(df_raw)
# Train models for every grade for every month
print "Training models..."
model = StatusModel(model=RandomForestRegressor, parameters={'n_estimators':100, 'max_depth':10})
model.train_model(df_features)
In [15]:
# Load data, then pre-process
print "Loading data..."
df_3a = pd.read_csv('data/LoanStats3a_securev1.csv', header=True).iloc[:-2, :]
df_raw = df_3a.copy()
# Pre-process data
print "Pre-processing data..."
df_features = process_features(df_raw, True, False)
# Calculating expected rate of return for loans already matured
print "Calculating rate of return..."
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}
rate_predict = model.expected_IRR(df_features, False, int_rate_dict)
In [16]:
# Pre-process data
print "Pre-processing data..."
df_payment = process_payment(df_raw)
# Calculating actual rate of return for loans already matured with Dec 2014 int_rate
print "Calculating rate of return..."
# Replace int_rate with values set in Dec 2015 by sub_grade
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}
rate_true = actual_IRR(df_payment, False, int_rate_dict)
In [17]:
df_select = df_features[['id', 'sub_grade']]
df_select['rate_true'] = rate_true
df_select['rate_predict'] = rate_predict
Now that we have the predicted rate of return for each loan, we can rank the loans and identify which loans fall into the top quartile of each sub-grade band. We then compare the average actual return of the top quartile against the average actual return of the whole sub-grade band.
For the purpose of comparison, we do the same for the bottom quartile.
In [18]:
sub_grade_range = ['A1', 'A2', 'A3', 'A4', 'A5',
'B1', 'B2', 'B3', 'B4', 'B5',
'C1', 'C2', 'C3', 'C4', 'C5',
'D1', 'D2', 'D3', 'D4', 'D5']
average_list = []
top_list = []
bottom_list = []
for sub_grade in sub_grade_range:
df_sub_grade = df_select[df_select['sub_grade'] == sub_grade]
df_sub_grade['quartile'] = pd.qcut(df_sub_grade['rate_predict'], 4, False)
id_top = df_sub_grade[df_sub_grade['quartile'] == 3]['id'].values
id_bottom = df_sub_grade[df_sub_grade['quartile'] == 0]['id'].values
average_IRR = df_sub_grade.mean()['rate_true']
top_IRR = df_select[df_select['id'].isin(id_top)]['rate_true'].values.mean()
bottom_IRR = df_select[df_select['id'].isin(id_bottom)]['rate_true'].values.mean()
average_list.append(average_IRR)
top_list.append(top_IRR)
bottom_list.append(bottom_IRR)
print 'Average IRR of %s: %f' % (sub_grade, average_IRR)
print 'Actual IRR of top quartile: ', top_IRR
print 'Actual IRR of bottom 10 quartile:', bottom_IRR, '\n'
Finally, we plot the average overall return, as well as the average return of the top and bottom quartiles. We are able to beat the average of 9.1% by 1.3%.
In [19]:
plt.figure(figsize = (12, 6))
x = xrange(20)
plt.plot(x, average_list, label='Average IRR')
plt.plot(x, top_list, label='Top quartile IRR')
plt.plot(x, bottom_list, label='Bottom quartile IRR')
plt.xlabel('Sub-grade, 0:A1 19:D5')
plt.ylabel('IRR')
plt.legend(loc='best')
plt.title("Performance of selective loan selection strategy")
Out[19]: