Cross-validating across years

What if we use last year to predict this year?



In [4]:

    
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import sys
import sklearn
import sqlite3
import matplotlib

import numpy as np
import pandas as pd
import enchant as en
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split, cross_val_score, KFold
from sklearn.linear_model import Ridge

src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
%aimport data
from data import make_dataset as md

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (16.0, 6.0)
plt.rcParams['legend.markerscale'] = 3
matplotlib.rcParams['font.size'] = 16.0









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Data: Preparing for the model

Importing the raw data



In [5]:

    
DIR = os.getcwd() + "/../data/"
t = pd.read_csv(DIR + 'raw/lending-club-loan-data/loan.csv', low_memory=False)
t.head(3)









    Out[5]:






  
    
      
      id
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      term
      int_rate
      installment
      grade
      sub_grade
      ...
      total_bal_il
      il_util
      open_rv_12m
      open_rv_24m
      max_bal_bc
      all_util
      total_rev_hi_lim
      inq_fi
      total_cu_tl
      inq_last_12m
    
  
  
    
      0
      1077501
      1296599
      5000.0
      5000.0
      4975.0
      36 months
      10.65
      162.87
      B
      B2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1077430
      1314167
      2500.0
      2500.0
      2500.0
      60 months
      15.27
      59.83
      C
      C4
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1077175
      1313524
      2400.0
      2400.0
      2400.0
      36 months
      15.96
      84.33
      C
      C5
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

3 rows × 74 columns

Cleaning, imputing missing values, feature engineering (some NLP)



In [6]:

    
t2 = md.clean_data(t)
t3 = md.impute_missing(t2)
df = md.simple_dataset(t3)









    



Now cleaning data.
Now imputing missing values and encoding categories.






    



/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2824: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)






    



Skipping NLP/geo stuff, and removing cols.

Fitting the model

For every year, we fit and predict the following.



In [19]:

    
df['issue_d'].describe()









    Out[19]:





count                  884766
unique                    101
top       2015-10-01 00:00:00
freq                    48473
first     2007-08-01 00:00:00
last      2015-12-01 00:00:00
Name: issue_d, dtype: object



In [20]:

    
rfr = RandomForestRegressor(n_estimators = 10, max_features='sqrt')



In [24]:

    
for y in range(2008, 2016):
    last_year = df[df['issue_d'] == str(y)]
    last_year_X = last_year.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
    last_year_y = last_year['int_rate']
    
    this_year = df[df['issue_d'] == str(y + 1)]
    this_year_X = this_year.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
    this_year_y = this_year['int_rate']
    
    rfr.fit(last_year_X, last_year_y)
    if y != 2015:
        print("Predicting year {} using {} data: \
        {:.2f}".format(y + 1, y, rfr.score(this_year_X, this_year_y)))









    



Predicting year 2009 using 2008 data:         0.16
Predicting year 2010 using 2009 data:         0.29
Predicting year 2011 using 2010 data:         0.25
Predicting year 2012 using 2011 data:         0.51
Predicting year 2013 using 2012 data:         0.34
Predicting year 2014 using 2013 data:         0.43
Predicting year 2015 using 2014 data:         0.36

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	1077501	1296599	5000.0	5000.0	4975.0	36 months	10.65	162.87	B	B2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1077430	1314167	2500.0	2500.0	2500.0	60 months	15.27	59.83	C	C4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1077175	1313524	2400.0	2400.0	2400.0	36 months	15.96	84.33	C	C5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN