Transform and Model

Let us build a Regression Model for prediciting the amount to be approved



In [1]:

    
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:

    
#Default Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)



In [3]:

    
#Load the dataset
df = pd.read_csv("data/loan_data_clean.csv")



In [4]:

    
df.head()

Transform Variables

Let us create feature and target



In [10]:

    
# Select the initial feature set
X_raw = df[['age', 'grade', 'years', 'ownership', 'income']]



In [9]:

    
# Convert the categorical variables in to numerical values
from sklearn.preprocessing import OneHotEncoder



In [11]:

    
# Create the feature set X
X = pd.get_dummies(X_raw)



In [14]:

    
# Create the target from amount and default
y = df.amount * (1 - df.default)



In [15]:

    
y.head()









    Out[15]:





0     5000
1     2400
2    10000
3     5000
4     3000
dtype: int64

Build Model - Linear Regression



In [16]:

    
# import the sklearn linear model
from sklearn.linear_model import LinearRegression



In [20]:

    
# initiate the Linear Regression Model
model_ols = LinearRegression(normalize = True)



In [21]:

    
# Review the parameters in the Linear Regression
model_ols









    Out[21]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [22]:

    
model_ols.fit(X,y)









    Out[22]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [25]:

    
# What are the coeffecients of the model
model_ols.coef_









    Out[25]:





array([  8.22428831e+00,   3.56559740e+01,   3.41152469e-02,
        -5.56093383e+02,   4.63115647e+02,  -5.27037725e+02,
         5.13726114e+02,   1.53946293e+03,   9.03022126e+02,
         2.15733968e+03,   4.37125155e+02,  -6.26484704e+02,
        -2.02390459e+02,  -3.56487860e+02])



In [26]:

    
model_ols.intercept_









    Out[26]:





5867.0568590363873

Calculate Model - Predictions & Error



In [29]:

    
# predict the y
y_pred = model_ols.predict(X)



In [30]:

    
# import metrics from sklearn
np.sum((y_pred - y)**2)/y.shape[0]









    Out[30]:





40135325.120906696



In [31]:

    
from sklearn.metrics import mean_squared_error



In [32]:

    
# Calculate mean squared error
mean_squared_error(y_pred,y)









    Out[32]:





40135325.120906696



In [33]:

    
# Root mean square error
np.sqrt(np.sum((y_pred - y)**2)/y.shape[0])









    Out[33]:





6335.2446772722751



In [45]:

    
plt.hist(y, bins=50, alpha=0.4, label="true")
plt.hist(y_pred, bins=50, alpha=0.4, label="pred")
plt.legend()
plt.show()



In [46]:

    
plt.scatter(y_pred, y)
plt.show()

Evaluate Model



In [53]:

    
# What is the score given by the model
model_ols.score(X, y)









    Out[53]:





0.098018899623945832



In [56]:

    
# What is the root mean square error
np.sqrt(mean_squared_error(y_pred, y))









    Out[56]:





6335.2446772722751



In [55]:

    
# How does rmse compare with standard deviation of the target
y.std()









    Out[55]:





6670.7111933826272

Generalisation Error



In [57]:

    
# Get the module for train test split
from sklearn.model_selection import train_test_split



In [59]:

    
train_test_split?



In [61]:

    
#Split the data in test and training - 20% and 80%
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)



In [63]:

    
X_train.shape, X_test.shape, y_train.shape, y_test.shape









    Out[63]:





((23272, 14), (5819, 14), (23272,), (5819,))



In [64]:

    
#Initiate the model
model_general = LinearRegression(normalize=True)



In [65]:

    
#Fit the model
model_general.fit(X_train, y_train)









    Out[65]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [66]:

    
# Make predictions for test and train
y_pred_general_train = model_general.predict(X_train)
y_pred_general_test = model_general.predict(X_test)



In [68]:

    
#Find the errors for test and train
mse_general_train = mean_squared_error(y_pred_general_train, y_train)
mse_general_test = mean_squared_error(y_pred_general_test, y_test)



In [70]:

    
# Find the generalisation error
mse_general_train - mse_general_test, mse_general_train, mse_general_test









    Out[70]:





(769603.69098217785, 40303216.575326569, 39533612.884344392)

Build Complex Model



In [71]:

    
# Import Polynominal Features
from sklearn.preprocessing import PolynomialFeatures



In [78]:

    
# Initiate Polynominal Features for Degree 2
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)



In [79]:

    
# Create Polynominal Features
X_poly = poly.fit_transform(X)



In [80]:

    
# See the new dataset
X.shape, X_poly.shape









    Out[80]:





((29091, 14), (29091, 105))



In [81]:

    
#Create split and train
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly,y, test_size = 0.2)



In [82]:

    
# Initiate the model
model_poly = LinearRegression(normalize=True)



In [85]:

    
# Fit the model
model_poly.fit(X_train_poly, y_train_poly)









    Out[85]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)



In [86]:

    
# Make predictions for test and train
y_pred_poly_train = model_poly.predict(X_train_poly) 
y_pred_poly_test = model_poly.predict(X_test_poly)



In [89]:

    
#Find the errors for test and train
mse_poly_train = mean_squared_error(y_pred_poly_train,y_train_poly )
mse_poly_test = mean_squared_error(y_pred_poly_test,y_test_poly )



In [90]:

    
# Find the generalisation error
mse_poly_test - mse_poly_train, mse_poly_train, mse_poly_test









    Out[90]:





(2152944.5680075735, 38314644.263922311, 40467588.831929885)



In [91]:

    
mse_general_train - mse_general_test, mse_general_train, mse_general_test









    Out[91]:





(769603.69098217785, 40303216.575326569, 39533612.884344392)

For Discussion

Why has the generalisation error gone up?
Should a complex model perform better than a simple one?

Regularization - Ridge



In [39]:

    
# Get ridge regression from linear_models



In [40]:

    
# Initiate model



In [41]:

    
# Fit the model



In [42]:

    
# Make predictions for test and train



In [43]:

    
#Find the errors for test and train



In [ ]:



In [44]:

    
# Find the generalisation error

Cross Validation

Finding alpha using Cross Validation



In [45]:

    
# Get ridge regression from linear_models



In [46]:

    
# Initiate model with alphas = 0.1, 0.001, 0.0001



In [47]:

    
# Fit the model



In [48]:

    
# Find the correct alpha

Exercise: Regularization - Lasso



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	amount	interest	grade	years	ownership	income	age
0	5000	10.65	B	10.00	RENT	24000.00	33
1	2400	10.99	C	25.00	RENT	12252.00	31
2	10000	13.49	C	13.00	RENT	49200.00	24
3	5000	10.99	A	3.00	RENT	36000.00	39
4	3000	10.99	E	9.00	RENT	48000.00	24