We'll be going through step-by-step how the rate of return is predicted. To start off, let's look at an loan example.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from collections import defaultdict
from helpers.cashflow import calc_monthly_payment, get_monthly_payments, get_compound_curve
from helpers.preprocessing import process_features
from model.model import StatusModel
In [2]:
df_3c = pd.read_csv('data/LoanStats3c_securev1.csv', header=True).iloc[:-2, :]
df_3b = pd.read_csv('data/LoanStats3b_securev1.csv', header=True).iloc[:-2, :]
df_raw = pd.concat((df_3c, df_3b), axis=0)
Our training set is the set of 3-year loans issued between Jan 2012 and Dec 2014. The loan below has an interest rate of 19.20%, paid on an amortizing principal.
In [3]:
df_3c.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'annual_inc', 'issue_d', 'loan_status']]
Out[3]:
We can calculate the required monthly payment with the formula in the link below. This is the calc_monthly_payment
function in cashflow.py
, and here we suppose the loan amount was $1.
In [4]:
print "Monthly payment:", calc_monthly_payment(loan_amnt=1, int_rate=0.1920, term=3)
This is the payment for one month. For the whole 36 months, we simply have a list of 36 payments.
In [5]:
monthly_payments = np.array(get_monthly_payments(X_int_rate=np.array([0.1920]), date_range_length=36)[0])
print "Cashflow of monthly payments:\n", monthly_payments
This is the list of payments the borrower would make if there was no default. For the expected rate of return, we would need to adjust these payments by risk of default and by the time value of money.
In [6]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), monthly_payments, alpha=0.25)
plt.xlim((0,36))
plt.ylim((0, 0.04))
plt.xlabel('Monthly payment by month', fontsize=13)
Out[6]:
We'll do the time value adjustment first as it's relatively straightforward. We'll be assuming that payments received are reinvested in a instrument that has the same interest. This is a very strong assumption, but we're able to do this as we're only making comparison between loans (and not, say, comparing loans against stocks).
If the interest rate is 19.20%, then the monthly interest is simply that figure divided by 12. If we had $1, then after one month it would increase by the monthly interest.
In [7]:
print "Amount after 1 month:", (1 + 0.1920 / 12)
After two months, the amount would be 1.016 compounded again by the monthly interest.
In [8]:
print "Amount after 2 months:", (1 + 0.1920 / 12) ** 2
For the whole 36 months, we'll use the get_compound_curve
function in cashflow.py
.
In [9]:
compound_curve = np.array(get_compound_curve(X_compound_rate=np.array([0.1920]), date_range_length=36))[0]
print "Compound curve:\n", compound_curve
In particular, note that there would be no compound adjustment at the final month because we would be at final month or maturity of the loan.
In [10]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), compound_curve, alpha=0.25, color='m')
plt.xlim((0,36))
plt.ylim((0, 1.8))
plt.xlabel('Compound adjustment by month', fontsize=13)
Out[10]:
We now return to the loan that we looked into at very start.
In [11]:
df_3c.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'annual_inc', 'issue_d', 'loan_status']]
Out[11]:
This loan was issued in Jan 2014, and the loan status is current. Viewed from Jan 2015, this was 12 months ago. In our model, we assume that should the same loan be issued today, in 12 months' time (Jan 2016) it would have a loan status of current.
The loan status being current means that the probability we would be receiving this payment is 1. If the loan was not current, then we assume that the probability we would receive this payment is given by the schedule at the bottom of the link below:
https://www.lendingclub.com/info/demand-and-credit-profile.action
For example, if this loan has already defaulted, then there is only an 8% chance we would receive this payment. This probability of payment received would be our target to apply a Random Forest Regressor. Before doing so, we pre-process the data to fill in clean up the data and fill in missing values.
In [12]:
df = process_features(df_raw)
Since this loan is of grade D, we train our model on all loans of grade D issued in Jan 2014. This involves going into the details of our StatusModel
class.
In [13]:
model = RandomForestRegressor
parameters = {'n_estimators':100,
'max_depth':10}
features = ['loan_amnt', 'emp_length', 'monthly_inc', 'dti',
'fico', 'earliest_cr_line', 'open_acc', 'total_acc',
'revol_bal', 'revol_util', 'inq_last_6mths',
'delinq_2yrs', 'pub_rec', 'collect_12mths',
'last_delinq', 'last_record', 'last_derog',
'purpose_debt', 'purpose_credit', 'purpose_home',
'purpose_other', 'purpose_buy', 'purpose_biz',
'purpose_medic', 'purpose_car', 'purpose_move',
'purpose_vac', 'purpose_house', 'purpose_wed', 'purpose_energy',
'home_mortgage', 'home_rent', 'home_own',
'home_other', 'home_none', 'home_any']
grade_range = ['D']
date_range = ['Jan-2014']
In [14]:
grade_dict = defaultdict(list)
for grade in grade_range:
for month in date_range:
df_select = df[(df['grade'].isin([grade]))
& (df['issue_d'].isin([month]))]
X = df_select[features].values
y = df_select['loan_status'].values
model = model(**parameters)
model.fit(X, y)
grade_dict[grade].append(model)
print grade, 'training completed...'
We now predict the status of our original loan after 12 months.
In [15]:
df_select.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'monthly_inc', 'issue_d', 'loan_status']]
Out[15]:
In [16]:
X = df_select.iloc[-1:,:][features].values
print "Probability of receiving payment of loan 9199665 after 12 months:", model.predict(X)
What we've done so far is train our model on all loans of grade D issued in Jan 2014, and using our model, predict that should loan 9199665 be issued today, there would be a 97.4% of receiving the monthly payment in Jan 2016.
To get the probability of payment being received for the whole loan period, we repeat the process for 36 months. We'll be using the get_expected_payout
function inside model.py
.
In [17]:
model = StatusModel(model=RandomForestRegressor,
parameters={'n_estimators':100,
'max_depth':10})
model.grade_range = ['D']
model.date_range = ['Dec-2014', 'Nov-2014', 'Oct-2014',
'Sep-2014', 'Aug-2014', 'Jul-2014',
'Jun-2014', 'May-2014', 'Apr-2014',
'Mar-2014', 'Feb-2014', 'Jan-2014',
'Dec-2013', 'Nov-2013', 'Oct-2013',
'Sep-2013', 'Aug-2013', 'Jul-2013',
'Jun-2013', 'May-2013', 'Apr-2013',
'Mar-2013', 'Feb-2013', 'Jan-2013',
'Dec-2012', 'Nov-2012', 'Oct-2012',
'Sep-2012', 'Aug-2012', 'Jul-2012',
'Jun-2012', 'May-2012', 'Apr-2012',
'Mar-2012', 'Feb-2012', 'Jan-2012']
In [18]:
model.train_model(df)
In [19]:
X_sub_grade = df_select.iloc[-1:,:]['sub_grade'].values
expected_payout = np.array(model.get_expected_payout(X, X_sub_grade))[0]
print "Expected payout:\n", expected_payout
In [20]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), expected_payout, alpha=0.25, color='r')
plt.xlim((0,36))
plt.ylim((0, 1.0))
plt.xlabel('Expected payout by month', fontsize=13)
Out[20]:
For the final step, we multiply the monthly payment by the compound adjustment to account for time value, and by the expected payout to account for risk of default. This is also what the get_cashflows.py
function inside cashflow.py
does.
In [21]:
expected_cashflows = monthly_payments * compound_curve * expected_payout
print "Expected cashflows:\n", expected_cashflows
In [22]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), expected_payout, alpha=0.25, color='g')
plt.xlim((0,36))
plt.ylim((0, 1.0))
plt.xlabel('Expected cashflow by month', fontsize=13)
Out[22]:
To get the rate of return, we simply add up all the cashflows, take the cube root, and finally subtract 1. The calc_IRR
function inside cashflow.py
can also be used for this calculation.
In [23]:
rate_of_return = (np.sum(expected_cashflows))**(1/3.) - 1
print "Rate of return:", rate_of_return
To conclude, our model predicts that loan 9199665 has an expected rate of return of 15.66%, based of the headline Lending Club rate of 19.20%. The next notebook validation.ipynb
discusses how well the model works.