Cross-validating across years

What if we use last year to predict this year?


In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import sys
import sklearn
import sqlite3
import matplotlib

import numpy as np
import pandas as pd
import enchant as en
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split, cross_val_score, KFold
from sklearn.linear_model import Ridge

src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
%aimport data
from data import make_dataset as md

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (16.0, 6.0)
plt.rcParams['legend.markerscale'] = 3
matplotlib.rcParams['font.size'] = 16.0


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Data: Preparing for the model

Importing the raw data


In [5]:
DIR = os.getcwd() + "/../data/"
t = pd.read_csv(DIR + 'raw/lending-club-loan-data/loan.csv', low_memory=False)
t.head(3)


Out[5]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 1077501 1296599 5000.0 5000.0 4975.0 36 months 10.65 162.87 B B2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1077430 1314167 2500.0 2500.0 2500.0 60 months 15.27 59.83 C C4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1077175 1313524 2400.0 2400.0 2400.0 36 months 15.96 84.33 C C5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 74 columns

Cleaning, imputing missing values, feature engineering (some NLP)


In [6]:
t2 = md.clean_data(t)
t3 = md.impute_missing(t2)
df = md.simple_dataset(t3)


Now cleaning data.
Now imputing missing values and encoding categories.
/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2824: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
Skipping NLP/geo stuff, and removing cols.

Fitting the model

For every year, we fit and predict the following.


In [19]:
df['issue_d'].describe()


Out[19]:
count                  884766
unique                    101
top       2015-10-01 00:00:00
freq                    48473
first     2007-08-01 00:00:00
last      2015-12-01 00:00:00
Name: issue_d, dtype: object

In [20]:
rfr = RandomForestRegressor(n_estimators = 10, max_features='sqrt')

In [24]:
for y in range(2008, 2016):
    last_year = df[df['issue_d'] == str(y)]
    last_year_X = last_year.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
    last_year_y = last_year['int_rate']
    
    this_year = df[df['issue_d'] == str(y + 1)]
    this_year_X = this_year.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
    this_year_y = this_year['int_rate']
    
    rfr.fit(last_year_X, last_year_y)
    if y != 2015:
        print("Predicting year {} using {} data: \
        {:.2f}".format(y + 1, y, rfr.score(this_year_X, this_year_y)))


Predicting year 2009 using 2008 data:         0.16
Predicting year 2010 using 2009 data:         0.29
Predicting year 2011 using 2010 data:         0.25
Predicting year 2012 using 2011 data:         0.51
Predicting year 2013 using 2012 data:         0.34
Predicting year 2014 using 2013 data:         0.43
Predicting year 2015 using 2014 data:         0.36