Following the baseline model and some feature engineering, we will now build a better predictive model. This will follow a few new patterns:
In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import os
import sys
import sklearn
import sqlite3
import matplotlib
import numpy as np
import pandas as pd
import enchant as en
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split, cross_val_score
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
%aimport data
from data import make_dataset as md
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (16.0, 6.0)
plt.rcParams['legend.markerscale'] = 3
matplotlib.rcParams['font.size'] = 16.0
In [2]:
DIR = os.getcwd() + "/../data/"
t = pd.read_csv(DIR + 'raw/lending-club-loan-data/loan.csv', low_memory=False)
t.head()
Out[2]:
In [3]:
t2 = md.clean_data(t)
t3 = md.impute_missing(t2)
df = md.simple_dataset(t3)
# df = md.spelling_mistakes(t3) - skipping for now, so computationally expensive!
In [4]:
df['issue_d'].hist(bins = 50)
plt.title('Seasonality in lending')
plt.ylabel('Frequency')
plt.xlabel('Year')
plt.show()
We can use past years as predictors of future years. One challenge with this approach is that we confound time-sensitive trends (for example, global economic shocks to interest rates - such as the financial crisis of 2008, or the growth of Lending Club to broader and broader markets of debtors) with differences related to time-insensitive factors (such as a debtor's riskiness).
To account for this, we can bundle our training and test sets into the following blocks:
In [5]:
old = df[df['issue_d'] < '2015']
new = df[df['issue_d'] >= '2015']
old.shape, new.shape
Out[5]:
We'll use the pre-2015 data on interest rates (old
) to fit a model and cross-validate it. We'll then use the post-2015 data as a 'wild' dataset to test against.
In [6]:
X = old.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y = old['int_rate']
X.shape, y.shape
Out[6]:
In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[7]:
In [8]:
rfr = RandomForestRegressor(n_estimators = 10, max_features='sqrt')
scores = cross_val_score(rfr, X, y, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(scores.mean(), scores.std() * 2))
In [9]:
X_new = new.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y_new = new['int_rate']
new_scores = cross_val_score(rfr, X_new, y_new, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(new_scores.mean(), new_scores.std() * 2))
In [10]:
# QUINN: Let's just use this - all data
X_total = df.drop(['int_rate', 'issue_d', 'earliest_cr_line', 'grade'], 1)
y_total = df['int_rate']
total_scores = cross_val_score(rfr, X_total, y_total, cv = 3)
print("Accuracy: {:.2f} (+/- {:.2f})".format(total_scores.mean(), total_scores.std() * 2))
In [30]:
rfr.fit(X_total, y_total)
fi = [{'importance': x, 'feature': y} for (x, y) in \
sorted(zip(rfr.feature_importances_, X_total.columns))]
fi = pd.DataFrame(fi)
fi.sort_values(by = 'importance', ascending = False, inplace = True)
fi.head()
Out[30]:
In [33]:
top5 = fi.head()
top5.plot(kind = 'bar')
plt.xticks(range(5), top5['feature'])
plt.title('Feature importances (top 5 features)')
plt.ylabel('Relative importance')
plt.show()
In [ ]: