Transform and Model

Let us build a Regression Model for prediciting the amount to be approved



In [2]:

    
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [3]:

    
#Default Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)



In [4]:

    
#Load the dataset
df = pd.read_csv("data/loan_data_clean.csv")



In [5]:

    
df.head()

Transform Variables

Let us create feature and target



In [8]:

    
# Select the initial feature set
df_X = df[['age', 'income', 'ownership' , 'years', 'grade']]



In [11]:

    
# Convert the categorical variables in to numerical values
df_X = pd.get_dummies(df_X)



In [12]:

    
# Create the feature set X
X = df_X



In [54]:

    
# Create the target from amount and default
df['amount_non_default'] = df['amount'] * (1- df['default'])



In [55]:

    
y = df['amount_non_default']

Build Model - Linear Regression



In [125]:

    
# import the sklearn linear model
from sklearn.linear_model import LinearRegression



In [126]:

    
# initiate the Linear Regression Model
model_ols = LinearRegression(normalize=True)



In [127]:

    
# Review the parameters in the Linear Regression
model_ols









    Out[127]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [128]:

    
# Review the parameters in the Linear Regression
model_ols.fit(X,y)









    Out[128]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [129]:

    
# What are the coeffecients of the model
model_ols.coef_









    Out[129]:





array([  7.99973540e+00,   3.41670491e-02,   3.75524906e+01,
        -3.40659174e+16,  -3.40659174e+16,  -3.40659174e+16,
        -3.40659174e+16,   1.42620966e+17,   1.42620966e+17,
         1.42620966e+17,   1.42620966e+17,   1.42620966e+17,
         1.42620966e+17,   1.42620966e+17])



In [130]:

    
# What is the intercept of the model
model_ols.intercept_









    Out[130]:





-1.0855504871673947e+17

Calculate Model - Predictions & Error



In [131]:

    
# predict the y
y_pred_ols = model_ols.predict(X)



In [132]:

    
# import metrics from sklearn
from sklearn import metrics



In [133]:

    
# Calculate mean squared erro
metrics.mean_squared_error(y_pred_ols, y)









    Out[133]:





40138831.382730052

Evaluate Model



In [134]:

    
# What is the score given by the model
model_ols.score(X,y)









    Out[134]:





0.097940101660104362



In [135]:

    
# What is the root mean square error
np.sqrt(metrics.mean_squared_error(y_pred_ols, y))









    Out[135]:





6335.5213978590627



In [136]:

    
# How does rmse compare with standard deviation of the target
df.amount_non_default.std()









    Out[136]:





6670.7111933826272

Generalisation Error



In [137]:

    
# Get the module for train test split
from sklearn.model_selection import train_test_split



In [138]:

    
#Split the data in test and training - 20% and 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [139]:

    
#Initiate the model
model_ols_split = LinearRegression()



In [140]:

    
#Fit the model
model_ols_split.fit(X_train, y_train)









    Out[140]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [142]:

    
# Make predictions for test and train
y_pred_split_train = model_ols_split.predict(X_train)
y_pred_split_test = model_ols_split.predict(X_test)



In [145]:

    
#Find the errors for test and train
error_ols_split_train = metrics.mean_squared_error(y_pred_split_train, y_train)
error_ols_split_test = metrics.mean_squared_error(y_pred_split_test, y_test)



In [148]:

    
error_ols_split_train, error_ols_split_test









    Out[148]:





(40196922.046010435, 39906625.742081515)



In [147]:

    
# Find the generalisation error
generalisation_error = error_ols_split_test - error_ols_split_train
generalisation_error









    Out[147]:





-290296.30392891914

Build Complex Model



In [150]:

    
# Import Polynominal Features
from sklearn.preprocessing import PolynomialFeatures



In [246]:

    
# Initiate Polynominal Features for Degree 2
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)



In [247]:

    
# Create Polynominal Features
X_poly = poly.fit_transform(X)



In [248]:

    
# See the new dataset
X_poly.shape









    Out[248]:





(29091, 105)



In [309]:

    
#Create split and train
X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(
    X_poly, y, test_size=0.2, random_state=42)



In [324]:

    
# Initiate the model
model_ols_poly = LinearRegression(normalize=True)



In [325]:

    
# Fit the model
model_ols_poly.fit(X_poly_train, y_poly_train)









    Out[325]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [326]:

    
# Make predictions for test and train
y_pred_poly_train = model_ols_poly.predict(X_poly_train)
y_pred_poly_test = model_ols_poly.predict(X_poly_test)



In [327]:

    
#Find the errors for test and train
error_ols_poly_train = metrics.mean_squared_error(y_pred_poly_train, y_poly_train)
error_ols_poly_test = metrics.mean_squared_error(y_pred_poly_test, y_poly_test)



In [328]:

    
error_ols_poly_train, error_ols_poly_test









    Out[328]:





(38576398.190701269, 39240001.867846712)



In [329]:

    
# Find the generalisation error
generalisation_poly_error = error_ols_poly_test - error_ols_poly_train
generalisation_poly_error









    Out[329]:





663603.67714544386

For Discussion

Why has the generalisation error gone up?
Should a complex model perform better than a simple one?

Regularization - Ridge



In [358]:

    
# Get ridge regression from linear_models
from sklearn.linear_model import Ridge



In [396]:

    
# Initiate model
model_ridge = Ridge(alpha = 0.1, normalize=True)



In [397]:

    
# Fit the model
model_ridge.fit(X_poly_train, y_poly_train)









    Out[397]:





Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)



In [398]:

    
# Make predictions for test and train
y_pred_ridge_train = model_ridge.predict(X_poly_train)
y_pred_ridge_test = model_ridge.predict(X_poly_test)



In [399]:

    
#Find the errors for test and train
error_ridge_train = metrics.mean_squared_error(y_pred_ridge_train, y_poly_train)
error_ridge_test = metrics.mean_squared_error(y_pred_ridge_test, y_poly_test)



In [400]:

    
error_ridge_train, error_ridge_test









    Out[400]:





(39171885.390347548, 39591479.19005619)



In [401]:

    
# Find the generalisation error
generalisation_ridge_error = error_ridge_test - error_ridge_train
generalisation_ridge_error









    Out[401]:





419593.79970864207

Cross Validation

Finding alpha using Cross Validation



In [402]:

    
# Get ridge regression from linear_models
from sklearn.linear_model import RidgeCV



In [403]:

    
# Initiate model with alphas = 0.1, 0.001, 0.0001
model_ridge_CV = RidgeCV(alphas=[0.1, 0.001, 0.0001], normalize = True)



In [404]:

    
# Fit the model
model_ridge_CV.fit(X_poly_train, y_poly_train)









    Out[404]:





RidgeCV(alphas=[0.1, 0.001, 0.0001], cv=None, fit_intercept=True,
    gcv_mode=None, normalize=True, scoring=None, store_cv_values=False)



In [405]:

    
# Find the correct alpha
model_ridge_CV.alpha_









    Out[405]:





0.001

Exercise: Regularization - Lasso



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	amount	interest	grade	years	ownership	income	age
0	5000	10.65	B	10.00	RENT	24000.00	33
1	2400	10.99	C	25.00	RENT	12252.00	31
2	10000	13.49	C	13.00	RENT	49200.00	24
3	5000	10.99	A	3.00	RENT	36000.00	39
4	3000	10.99	E	9.00	RENT	48000.00	24