Validation: Does my model work?

We shall go through step-by-step in assessing whether our model works. We shall be comparing two sets of values for loans issued in 2009-2011:

  1. Expected rate of return, as predicted by our model purely with loan features, such as FICO score.
  2. Actual rate of return, as calculated purely with payment details of the loan data, and

The model is trained with features from loans issued in 2012-2014, i.e. loans that have not yet matured. The steps detailed here is identical to the validation process when running python test.py compare.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from transfers.fileio import load_from_pickle
from helpers.preprocessing import process_payment, process_features
from model.model import StatusModel
from model.validate import actual_IRR

We'll be training our model on loans issued in 2012-2014, with loan features as features and the expected recovery based on status of the loan as our target. The model is actually a collection of 4 x 36 Random Forest sub-models, i.e. one for each grade-payment month pair.


In [2]:
# Load data, then pre-process
print "Loading data..."
df_3c = pd.read_csv('data/LoanStats3c_securev1.csv', header=True).iloc[:-2, :]
df_3b = pd.read_csv('data/LoanStats3b_securev1.csv', header=True).iloc[:-2, :]
df_raw = pd.concat((df_3c, df_3b), axis=0)

# Pre-process data
print "Pre-processing data..."
df = process_features(df_raw)

# Train models for every grade for every month
print "Training models..."
model = StatusModel(model=RandomForestRegressor, parameters={'n_estimators':100, 'max_depth':10})
model.train_model(df)


/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
Loading data...
Pre-processing data...
helpers/preprocessing.py:138: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  'Charged Off': 0.})
helpers/preprocessing.py:140: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['int_rate'] = df['int_rate'].map(lambda x: float(str(x).strip('%')) / 100)
helpers/preprocessing.py:142: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['term'] = df['term'].map(lambda x: int(str(x).strip(' months')))
helpers/preprocessing.py:146: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '0.5 years' if x == '< 1 year' else x)
helpers/preprocessing.py:147: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '10 years' if x == '10+ years' else x)
helpers/preprocessing.py:148: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: '-1 years' if x == 'n/a' else x)
helpers/preprocessing.py:149: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: float(x.strip(' years')))
helpers/preprocessing.py:154: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['emp_length'] = df['emp_length'].map(lambda x: emp_length_mean if x < 0 else x)
helpers/preprocessing.py:156: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['annual_inc'] = df['annual_inc'].map(lambda x: float(x) / 12)
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/core/frame.py:2302: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
  **kwargs)
helpers/preprocessing.py:159: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['fico_range_low'] = (df['fico_range_low'] + df['fico_range_high']) / 2.
helpers/preprocessing.py:167: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  else x['earliest_cr_line'], axis=1)
helpers/preprocessing.py:171: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  '%b-%Y')).days / 30, axis=1)
helpers/preprocessing.py:175: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['revol_util'] = df['revol_util'].map(lambda x: float(str(x).strip('%')) / 100)
helpers/preprocessing.py:180: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['revol_util'] = df['revol_util'].fillna(revol_util_mean)
helpers/preprocessing.py:185: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_delinq'] = df['last_delinq'].map(lambda x: -1 if x == 'n/a' else x)
helpers/preprocessing.py:187: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_record'] = df['last_record'].map(lambda x: -1 if x == 'n/a' else x)
helpers/preprocessing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  df['last_derog'] = df['last_derog'].map(lambda x: -1 if x == 'n/a' else x)
Training models...
A training completed...
B training completed...
C training completed...
D training completed...

First we use our trained model to calculate expected rate of return of the matured loans issued in 2009-2011. The input is the loan features of the loans, and the model outputs a single rate for each loan.

As we would be calculating rate of return for 40,000 loans, generating 3 cashflows of 36-months' length for each loan, this might take a little longer.


In [3]:
# Load data, then pre-process
print "Loading data..."
df_3a = pd.read_csv('data/LoanStats3a_securev1.csv', header=True).iloc[:-2, :]
df_raw = df_3a.copy()

# Pre-process data
print "Pre-processing data..."
df_features = process_features(df_raw, True, False) 

# Calculating expected rate of return for loans already matured
print "Calculating rate of return..."
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
                 'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
                 'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
                 'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}

rate_predict = model.expected_IRR(df_features, False, int_rate_dict)


Loading data...
Pre-processing data...
Calculating rate of return...

Next we generate our the actual rate of return, or our 'ground truth'. The input is payment details of matured loans issued between 2009 and 2011, e.g. principal paid, interest paid, late fees and recovered value if the loan defaulted. The output is a single actual rate of return figure for each loan.


In [4]:
# Pre-process data
print "Pre-processing data..."
df_payment = process_payment(df_raw)

# Calculating actual rate of return for loans already matured with Dec 2014 int_rate
print "Calculating rate of return..."

# Replace int_rate with values set in Dec 2015 by sub_grade
int_rate_dict = {'A1':0.0603, 'A2':0.0649, 'A3':0.0699, 'A4':0.0749, 'A5':0.0819,
                 'B1':0.0867, 'B2':0.0949, 'B3':0.1049, 'B4':0.1144, 'B5':0.1199,
                 'C1':0.1239, 'C2':0.1299, 'C3':0.1366, 'C4':0.1431, 'C5':0.1499,
                 'D1':0.1559, 'D2':0.1599, 'D3':0.1649, 'D4':0.1714, 'D5':0.1786}

rate_true = actual_IRR(df_payment, False, int_rate_dict)


Pre-processing data...
Calculating rate of return...

We simply collect all our data back into the original dataframe, and review overall average MSE and average MSE by sub-grade.


In [5]:
df_select = df_features[['sub_grade', 'int_rate']]

df_select['rate_true'] = rate_true
df_select['rate_predict'] = rate_predict

rate_true = np.array(rate_true)
rate_predict = np.array(rate_predict)

print "Average MSE: ", np.sum((rate_true - rate_predict)**2) / len(rate_true)


Average MSE:  0.0114000874305

In [6]:
df_compare = df_select.groupby('sub_grade').mean()
df_compare['headline'] = sorted(int_rate_dict.values())
df_compare = df_compare.drop('int_rate', axis=1)

print df_compare


           rate_true  rate_predict  headline
sub_grade                                   
A1          0.054840      0.050399    0.0603
A2          0.054009      0.053465    0.0649
A3          0.058820      0.057932    0.0699
A4          0.063289      0.061775    0.0749
A5          0.066053      0.067648    0.0819
B1          0.065036      0.060339    0.0867
B2          0.074596      0.069333    0.0949
B3          0.082055      0.080169    0.1049
B4          0.089151      0.090714    0.1144
B5          0.094420      0.095568    0.1199
C1          0.094693      0.090015    0.1239
C2          0.095258      0.095579    0.1299
C3          0.093860      0.101309    0.1366
C4          0.106270      0.107995    0.1431
C5          0.111463      0.112938    0.1499
D1          0.114010      0.111568    0.1559
D2          0.120886      0.115902    0.1599
D3          0.121501      0.118529    0.1649
D4          0.123292      0.124906    0.1714
D5          0.137635      0.131685    0.1786

The headline Lending Club rate is illustrated by the red line, which is what the lender would receive if the loan does not default. After accounting for defaulted loans, we would have the blue line. This is the average rate of return by loan sub-grade (averaging over loans that paid in full and those that defaulted) and is calculated at maturity purely with the loan payment details.

The predicted rate is calculated purely with the loan features, shown here by the green line. This means that we can estimate what the rate of return before maturity, and in particular we're able to do so at the inception of the loan.


In [7]:
plt.figure(figsize = (12, 6))
x = xrange(20)
y_true = df_compare['rate_true']
y_predict = df_compare['rate_predict']
y_headline = df_compare['headline']

plt.plot(x, y_true, label='Actual rate')
plt.plot(x, y_predict, label='Predicted rate')
plt.plot(x, y_headline, label='Lending Club rate')
plt.legend(loc='best')
plt.xlabel('Sub-grade, 0:A1 19:D5')
plt.ylabel('Rate of return')     
plt.title("Comparison of predicted vs true rate of return")


Out[7]:
<matplotlib.text.Text at 0x10ea6c9d0>

The final notebook applications.ipynb discusses the application of our model to the problem of loan selection.