Mechanics: How does the model work?

We'll be going through step-by-step how the rate of return is predicted. To start off, let's look at an loan example.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from collections import defaultdict
from helpers.cashflow import calc_monthly_payment, get_monthly_payments, get_compound_curve
from helpers.preprocessing import process_features
from model.model import StatusModel

In [2]:
df_3c = pd.read_csv('data/LoanStats3c_securev1.csv', header=True).iloc[:-2, :]
df_3b = pd.read_csv('data/LoanStats3b_securev1.csv', header=True).iloc[:-2, :]
df_raw = pd.concat((df_3c, df_3b), axis=0)


/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

Our training set is the set of 3-year loans issued between Jan 2012 and Dec 2014. The loan below has an interest rate of 19.20%, paid on an amortizing principal.


In [3]:
df_3c.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'annual_inc', 'issue_d', 'loan_status']]


Out[3]:
id loan_amnt int_rate term sub_grade annual_inc issue_d loan_status
235628 9199665 10000 19.20% 36 months D3 46000 Jan-2014 Current

Monthly payments

We can calculate the required monthly payment with the formula in the link below. This is the calc_monthly_payment function in cashflow.py, and here we suppose the loan amount was $1.

http://en.wikipedia.org/wiki/Amortization_calculator


In [4]:
print "Monthly payment:", calc_monthly_payment(loan_amnt=1, int_rate=0.1920, term=3)


Monthly payment: 0.0367572181573

This is the payment for one month. For the whole 36 months, we simply have a list of 36 payments.


In [5]:
monthly_payments = np.array(get_monthly_payments(X_int_rate=np.array([0.1920]), date_range_length=36)[0])

print "Cashflow of monthly payments:\n", monthly_payments


Cashflow of monthly payments:
[ 0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722
  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722
  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722
  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722
  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722
  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722  0.03675722]

This is the list of payments the borrower would make if there was no default. For the expected rate of return, we would need to adjust these payments by risk of default and by the time value of money.


In [6]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), monthly_payments, alpha=0.25)
plt.xlim((0,36))
plt.ylim((0, 0.04))
plt.xlabel('Monthly payment by month', fontsize=13)


Out[6]:
<matplotlib.text.Text at 0x1029d1950>

Compound curve

We'll do the time value adjustment first as it's relatively straightforward. We'll be assuming that payments received are reinvested in a instrument that has the same interest. This is a very strong assumption, but we're able to do this as we're only making comparison between loans (and not, say, comparing loans against stocks).

If the interest rate is 19.20%, then the monthly interest is simply that figure divided by 12. If we had $1, then after one month it would increase by the monthly interest.


In [7]:
print "Amount after 1 month:", (1 + 0.1920 / 12)


Amount after 1 month: 1.016

After two months, the amount would be 1.016 compounded again by the monthly interest.


In [8]:
print "Amount after 2 months:", (1 + 0.1920 / 12) ** 2


Amount after 2 months: 1.032256

For the whole 36 months, we'll use the get_compound_curve function in cashflow.py.


In [9]:
compound_curve = np.array(get_compound_curve(X_compound_rate=np.array([0.1920]), date_range_length=36))[0]

print "Compound curve:\n", compound_curve


Compound curve:
[ 1.74292933  1.71548162  1.68846617  1.66187615  1.63570487  1.60994574
  1.58459226  1.55963805  1.53507682  1.51090238  1.48710865  1.46368961
  1.44063938  1.41795215  1.39562219  1.37364389  1.3520117   1.33072018
  1.30976396  1.28913775  1.26883637  1.2488547   1.22918769  1.20983041
  1.19077796  1.17202555  1.15356846  1.13540202  1.11752168  1.09992291
  1.08260129  1.06555245  1.0487721   1.032256    1.016       1.        ]

In particular, note that there would be no compound adjustment at the final month because we would be at final month or maturity of the loan.


In [10]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), compound_curve, alpha=0.25, color='m')

plt.xlim((0,36))
plt.ylim((0, 1.8))
plt.xlabel('Compound adjustment by month', fontsize=13)


Out[10]:
<matplotlib.text.Text at 0x103a4eed0>

Expected payout

We now return to the loan that we looked into at very start.


In [11]:
df_3c.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'annual_inc', 'issue_d', 'loan_status']]


Out[11]:
id loan_amnt int_rate term sub_grade annual_inc issue_d loan_status
235628 9199665 10000 19.20% 36 months D3 46000 Jan-2014 Current

This loan was issued in Jan 2014, and the loan status is current. Viewed from Jan 2015, this was 12 months ago. In our model, we assume that should the same loan be issued today, in 12 months' time (Jan 2016) it would have a loan status of current.

The loan status being current means that the probability we would be receiving this payment is 1. If the loan was not current, then we assume that the probability we would receive this payment is given by the schedule at the bottom of the link below:

https://www.lendingclub.com/info/demand-and-credit-profile.action

For example, if this loan has already defaulted, then there is only an 8% chance we would receive this payment. This probability of payment received would be our target to apply a Random Forest Regressor. Before doing so, we pre-process the data to fill in clean up the data and fill in missing values.


In [12]:
df = process_features(df_raw)


helpers/preprocessing.py:138: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  'Charged Off': 0.})
helpers/preprocessing.py:140: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['int_rate'] = df['int_rate'].map(lambda x: float(str(x).strip('%')) / 100)
helpers/preprocessing.py:142: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['term'] = df['term'].map(lambda x: int(str(x).strip(' months')))
helpers/preprocessing.py:146: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '0.5 years' if x == '< 1 year' else x)
helpers/preprocessing.py:147: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '10 years' if x == '10+ years' else x)
helpers/preprocessing.py:148: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '-1 years' if x == 'n/a' else x)
helpers/preprocessing.py:149: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: float(x.strip(' years')))
helpers/preprocessing.py:154: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: emp_length_mean if x < 0 else x)
helpers/preprocessing.py:156: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['annual_inc'] = df['annual_inc'].map(lambda x: float(x) / 12)
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/core/frame.py:2302: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
  **kwargs)
helpers/preprocessing.py:159: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['fico_range_low'] = (df['fico_range_low'] + df['fico_range_high']) / 2.
helpers/preprocessing.py:167: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  else x['earliest_cr_line'], axis=1)
helpers/preprocessing.py:171: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  '%b-%Y')).days / 30, axis=1)
helpers/preprocessing.py:175: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['revol_util'] = df['revol_util'].map(lambda x: float(str(x).strip('%')) / 100)
helpers/preprocessing.py:180: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['revol_util'] = df['revol_util'].fillna(revol_util_mean)
helpers/preprocessing.py:185: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_delinq'] = df['last_delinq'].map(lambda x: -1 if x == 'n/a' else x)
helpers/preprocessing.py:187: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_record'] = df['last_record'].map(lambda x: -1 if x == 'n/a' else x)
helpers/preprocessing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_derog'] = df['last_derog'].map(lambda x: -1 if x == 'n/a' else x)

Since this loan is of grade D, we train our model on all loans of grade D issued in Jan 2014. This involves going into the details of our StatusModel class.


In [13]:
model = RandomForestRegressor
parameters = {'n_estimators':100, 
              'max_depth':10}

features = ['loan_amnt', 'emp_length', 'monthly_inc', 'dti',
            'fico', 'earliest_cr_line', 'open_acc', 'total_acc',
            'revol_bal', 'revol_util', 'inq_last_6mths',
            'delinq_2yrs', 'pub_rec', 'collect_12mths',
            'last_delinq', 'last_record', 'last_derog',
            'purpose_debt', 'purpose_credit', 'purpose_home',
            'purpose_other', 'purpose_buy', 'purpose_biz',
            'purpose_medic', 'purpose_car', 'purpose_move',
            'purpose_vac', 'purpose_house', 'purpose_wed', 'purpose_energy',
            'home_mortgage', 'home_rent', 'home_own',
            'home_other', 'home_none', 'home_any']

grade_range = ['D']
date_range = ['Jan-2014']

In [14]:
grade_dict = defaultdict(list)

for grade in grade_range:
    for month in date_range:
        df_select = df[(df['grade'].isin([grade]))
                     & (df['issue_d'].isin([month]))]

        X = df_select[features].values
        y = df_select['loan_status'].values

        model = model(**parameters)
        model.fit(X, y)

        grade_dict[grade].append(model)
    print grade, 'training completed...'


D training completed...

We now predict the status of our original loan after 12 months.


In [15]:
df_select.iloc[-1:,:][['id', 'loan_amnt', 'int_rate', 'term', 'sub_grade', 'monthly_inc', 'issue_d', 'loan_status']]


Out[15]:
id loan_amnt int_rate term sub_grade monthly_inc issue_d loan_status
235628 9199665 10000 0.192 36 D3 3833.333333 Jan-2014 1

In [16]:
X = df_select.iloc[-1:,:][features].values

print "Probability of receiving payment of loan 9199665 after 12 months:", model.predict(X)


Probability of receiving payment of loan 9199665 after 12 months: [ 0.9735888]

What we've done so far is train our model on all loans of grade D issued in Jan 2014, and using our model, predict that should loan 9199665 be issued today, there would be a 97.4% of receiving the monthly payment in Jan 2016.

To get the probability of payment being received for the whole loan period, we repeat the process for 36 months. We'll be using the get_expected_payout function inside model.py.


In [17]:
model = StatusModel(model=RandomForestRegressor,
                    parameters={'n_estimators':100,
                                 'max_depth':10})

model.grade_range = ['D']
model.date_range = ['Dec-2014', 'Nov-2014', 'Oct-2014',
                    'Sep-2014', 'Aug-2014', 'Jul-2014',
                    'Jun-2014', 'May-2014', 'Apr-2014',
                    'Mar-2014', 'Feb-2014', 'Jan-2014',
                    'Dec-2013', 'Nov-2013', 'Oct-2013',
                    'Sep-2013', 'Aug-2013', 'Jul-2013',
                    'Jun-2013', 'May-2013', 'Apr-2013',
                    'Mar-2013', 'Feb-2013', 'Jan-2013',
                    'Dec-2012', 'Nov-2012', 'Oct-2012',
                    'Sep-2012', 'Aug-2012', 'Jul-2012',
                    'Jun-2012', 'May-2012', 'Apr-2012',
                    'Mar-2012', 'Feb-2012', 'Jan-2012']

In [18]:
model.train_model(df)


D training completed...

In [19]:
X_sub_grade = df_select.iloc[-1:,:]['sub_grade'].values
expected_payout = np.array(model.get_expected_payout(X, X_sub_grade))[0]

print "Expected payout:\n", expected_payout


Expected payout:
[ 0.99178735  0.98364215  0.97556384  0.96755188  0.95960571  0.95172481
  0.94390862  0.93615663  0.92846831  0.92084312  0.91328056  0.90578011
  0.89834125  0.89096349  0.88364632  0.87638924  0.86919176  0.8620534
  0.85497365  0.84795206  0.84098812  0.83408138  0.82723136  0.8204376
  0.81369964  0.80701701  0.80038926  0.79381594  0.78729661  0.78083082
  0.77441813  0.7680581   0.76175031  0.75549432  0.74928971  0.74313606]

In [20]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), expected_payout, alpha=0.25, color='r')

plt.xlim((0,36))
plt.ylim((0, 1.0))
plt.xlabel('Expected payout by month', fontsize=13)


Out[20]:
<matplotlib.text.Text at 0x10b6c5610>

Rate of return

For the final step, we multiply the monthly payment by the compound adjustment to account for time value, and by the expected payout to account for risk of default. This is also what the get_cashflows.py function inside cashflow.py does.


In [21]:
expected_cashflows = monthly_payments * compound_curve * expected_payout

print "Expected cashflows:\n", expected_cashflows


Expected cashflows:
[ 0.06353909  0.06202487  0.06054673  0.05910382  0.0576953   0.05632034
  0.05497815  0.05366795  0.05238897  0.05114047  0.04992172  0.04873202
  0.04757067  0.046437    0.04533034  0.04425006  0.04319552  0.04216611
  0.04116124  0.04018031  0.03922276  0.03828803  0.03737557  0.03648486
  0.03561538  0.03476661  0.03393808  0.03312929  0.03233977  0.03156907
  0.03081674  0.03008233  0.02936543  0.02866561  0.02798247  0.02731561]

In [22]:
plt.figure(figsize=(18,6))
plt.bar(xrange(36), expected_payout, alpha=0.25, color='g')

plt.xlim((0,36))
plt.ylim((0, 1.0))
plt.xlabel('Expected cashflow by month', fontsize=13)


Out[22]:
<matplotlib.text.Text at 0x10e139310>

To get the rate of return, we simply add up all the cashflows, take the cube root, and finally subtract 1. The calc_IRR function inside cashflow.py can also be used for this calculation.


In [23]:
rate_of_return = (np.sum(expected_cashflows))**(1/3.) - 1

print "Rate of return:", rate_of_return


Rate of return: 0.156624220874

To conclude, our model predicts that loan 9199665 has an expected rate of return of 15.66%, based of the headline Lending Club rate of 19.20%. The next notebook validation.ipynb discusses how well the model works.