Building a better model

Following the baseline model and some feature engineering, we will now build a better predictive model. This will follow a few new patterns:

We will import data cleaning and feature engineering stuff from external Python modules we've built (for standardization across our machines).
We will cross-validate across time: that is, the model will be trained on earlier years and tested on later years.
Rather than looping through models (and perhaps working mroe with Pipeline and GridSearch), we will focus on tuning the parameters of the best-performing model from the baseline set.



In [1]:

    
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import sys
import sklearn
import sqlite3
import matplotlib

import numpy as np
import pandas as pd
import enchant as en
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split, cross_val_score

src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
%aimport data
from data import make_dataset as md

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (16.0, 6.0)
plt.rcParams['legend.markerscale'] = 3
matplotlib.rcParams['font.size'] = 16.0









    



/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Data: Preparing for the model

Importing the raw data



In [2]:

    
DIR = os.getcwd() + "/../data/"
t = pd.read_csv(DIR + 'raw/lending-club-loan-data/loan.csv', low_memory=False)
t.head()









    Out[2]:






  
    
      
      id
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      term
      int_rate
      installment
      grade
      sub_grade
      ...
      total_bal_il
      il_util
      open_rv_12m
      open_rv_24m
      max_bal_bc
      all_util
      total_rev_hi_lim
      inq_fi
      total_cu_tl
      inq_last_12m
    
  
  
    
      0
      1077501
      1296599
      5000.0
      5000.0
      4975.0
      36 months
      10.65
      162.87
      B
      B2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1077430
      1314167
      2500.0
      2500.0
      2500.0
      60 months
      15.27
      59.83
      C
      C4
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1077175
      1313524
      2400.0
      2400.0
      2400.0
      36 months
      15.96
      84.33
      C
      C5
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1076863
      1277178
      10000.0
      10000.0
      10000.0
      36 months
      13.49
      339.31
      C
      C1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      1075358
      1311748
      3000.0
      3000.0
      3000.0
      60 months
      12.69
      67.79
      B
      B5
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 74 columns

Cleaning, imputing missing values, feature engineering (some NLP)



In [3]:

    
t2 = md.clean_data(t)
t3 = md.impute_missing(t2)
df = md.simple_dataset(t3)
# df = md.spelling_mistakes(t3) - skipping for now, so computationally expensive!









    



Now cleaning data.
Now imputing missing values and encoding categories.






    



/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2824: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)






    



Skipping NLP/geo stuff, and removing cols.

Train, test split: Splitting on 2015



In [4]:

    
df['issue_d'].hist(bins = 50)
plt.title('Seasonality in lending')
plt.ylabel('Frequency')
plt.xlabel('Year')
plt.show()

We can use past years as predictors of future years. One challenge with this approach is that we confound time-sensitive trends (for example, global economic shocks to interest rates - such as the financial crisis of 2008, or the growth of Lending Club to broader and broader markets of debtors) with differences related to time-insensitive factors (such as a debtor's riskiness).

To account for this, we can bundle our training and test sets into the following blocks:

Before 2015: Training set
2015 to current: Test set



In [5]:

    
old = df[df['issue_d'] < '2015']
new = df[df['issue_d'] >= '2015']
old.shape, new.shape









    Out[5]:





((464943, 83), (419823, 83))

We'll use the pre-2015 data on interest rates (old) to fit a model and cross-validate it. We'll then use the post-2015 data as a 'wild' dataset to test against.

Fitting the model



In [6]:

    
X = old.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y = old['int_rate']
X.shape, y.shape









    Out[6]:





((464943, 79), (464943,))



In [7]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape









    Out[7]:





((311511, 79), (153432, 79), (311511,), (153432,))



In [8]:

    
rfr = RandomForestRegressor(n_estimators = 10, max_features='sqrt')
scores = cross_val_score(rfr, X, y, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(scores.mean(), scores.std() * 2))









    



Accuracy: 0.76 (+/- 0.03)



In [9]:

    
X_new = new.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y_new = new['int_rate']

new_scores = cross_val_score(rfr, X_new, y_new, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(new_scores.mean(), new_scores.std() * 2))









    



Accuracy: 0.70 (+/- 0.14)



In [10]:

    
# QUINN: Let's just use this - all data
X_total = df.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y_total = df['int_rate']

total_scores = cross_val_score(rfr, X_total, y_total, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(total_scores.mean(), total_scores.std() * 2))









    



Accuracy: 0.68 (+/- 0.13)

Fitting the model

We fit the model on all the data, and evaluate feature importances.



In [30]:

    
rfr.fit(X_total, y_total)

fi = [{'importance': x, 'feature': y} for (x, y) in \
      sorted(zip(rfr.feature_importances_, X_total.columns))]
fi = pd.DataFrame(fi)
fi.sort_values(by = 'importance', ascending = False, inplace = True) 
fi.head()









    Out[30]:






  
    
      
      feature
      importance
    
  
  
    
      78
      total_rec_int
      0.147562
    
    
      77
      term_ 60 months
      0.097561
    
    
      76
      installment
      0.072274
    
    
      75
      total_rec_prncp
      0.057368
    
    
      74
      term_ 36 months
      0.051018



In [33]:

    
top5 = fi.head()
top5.plot(kind = 'bar')
plt.xticks(range(5), top5['feature'])
plt.title('Feature importances (top 5 features)')
plt.ylabel('Relative importance')
plt.show()



In [ ]:

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	1077501	1296599	5000.0	5000.0	4975.0	36 months	10.65	162.87	B	B2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1077430	1314167	2500.0	2500.0	2500.0	60 months	15.27	59.83	C	C4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1077175	1313524	2400.0	2400.0	2400.0	36 months	15.96	84.33	C	C5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1076863	1277178	10000.0	10000.0	10000.0	36 months	13.49	339.31	C	C1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1075358	1311748	3000.0	3000.0	3000.0	60 months	12.69	67.79	B	B5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	feature	importance
78	total_rec_int	0.147562
77	term_ 60 months	0.097561
76	installment	0.072274
75	total_rec_prncp	0.057368
74	term_ 36 months	0.051018