Model - Logistics Regression

Raw Data

This is the historical data that the bank has provided. It has the following columns

Application Attributes:

years: Number of years the applicant has been employed
ownership: Whether the applicant owns a house or not
income: Annual income of the applicant
age: Age of the applicant

Behavioural Attributes:

grade: Credit grade of the applicant

Outcome Variable:

amount : Amount of Loan provided to the applicant
default : Whether the applicant has defaulted or not
interest: Interest rate charged for the applicant

You are provided with the following data: loan_data.csv
There is a cleaned-up dataset which has removed outliers and treated missing values - loan_data_clean.csv

Let us build some intuition around the Loan Data



In [1]:

    
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [2]:

    
#Default Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)



In [3]:

    
#Load the dataset
df = pd.read_csv("data/loan_data_clean.csv")



In [4]:

    
df.head()

Explore



In [5]:

    
# Distribution of default
df.default.hist()









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x11af48630>



In [6]:

    
df.default.sum()/df.default.count()









    Out[6]:





0.11092777835069266



In [7]:

    
# Plot three variable - default vs income and interest
plt.scatter(df.income, df.interest, c=df.default, alpha=0.4)
plt.xlabel("income")
plt.ylabel("interest")









    Out[7]:





<matplotlib.text.Text at 0x11b2ea400>



In [8]:

    
plt.scatter(np.log(df.income), df.interest, c=df.default, alpha=0.4)









    Out[8]:





<matplotlib.collections.PathCollection at 0x11e2d39e8>

Logistic Regression - Two Variable `income` and `interest`



In [9]:

    
# Get the module
from sklearn.linear_model import LogisticRegression



In [10]:

    
# Define the features
df['incomeLog'] = np.log(df.income)
X2 = df[['incomeLog', 'interest']]



In [11]:

    
# Define the target
y = df.default



In [12]:

    
# Initiate the model
clf = LogisticRegression()



In [13]:

    
# Fit the model
clf.fit(X2,y)









    Out[13]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [14]:

    
# Calculate the Accuracy Score
clf.score(X2,y)









    Out[14]:





0.88907222164930733



In [15]:

    
# Calculate the predictions
clf.predict(X2)









    Out[15]:





array([0, 0, 0, ..., 0, 0, 0])



In [16]:

    
clf.predict(X2).sum()









    Out[16]:





0



In [17]:

    
# Calculate the probabilities
clf.predict_proba(X2)









    Out[17]:





array([[ 0.86339483,  0.13660517],
       [ 0.81430966,  0.18569034],
       [ 0.85198499,  0.14801501],
       ..., 
       [ 0.94785985,  0.05214015],
       [ 0.96513583,  0.03486417],
       [ 0.85217582,  0.14782418]])



In [18]:

    
plt.hist(clf.predict_proba(X2)[:,0], bins=100)
#plt.hist(clf.predict_proba(X2)[:,1], bins=100)
plt.show()

Plot the Decision Boundaries



In [19]:

    
def plot_classifier(X,y,clf):
    x1_min, x1_max = X.iloc[:,0].min(), X.iloc[:,0].max()
    x2_min, x2_max = X.iloc[:,1].min(), X.iloc[:,1].max()
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, (x1_max - x1_min)/100), 
                           np.arange(x2_min, x2_max, (x2_max - x2_min)/100))
    Z = clf.predict_proba(np.c_[xx1.ravel(), xx2.ravel()])[:,0]
    Z = Z.reshape(xx1.shape)
    cs = plt.contourf(xx1, xx2, Z, cmap="magma", alpha = 0.3)
    plt.scatter(x = X.iloc[:,0], y = X.iloc[:,1], c = y, s = 50, cmap="viridis", alpha=0.3)
    plt.colorbar(cs)



In [20]:

    
plot_classifier(X2,y,clf)

Exercise: What is the range of the predicted probabilities



In [ ]:

Exercise: What is the accuracy measure if you change the cut-off threshold



In [ ]:

Decision Tree



In [21]:

    
from sklearn.tree import DecisionTreeClassifier



In [22]:

    
clf_dt = DecisionTreeClassifier()



In [23]:

    
clf_dt.fit(X2,y)









    Out[23]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [24]:

    
plot_classifier(X2,y,clf_dt)



In [25]:

    
import pydotplus 
from IPython.display import Image



In [26]:

    
from sklearn import tree



In [27]:

    
# dot_data = tree.export_graphviz(clf_dt, out_file='tree.dot', feature_names=X2.columns,
#                                class_names=['0', '1'], filled=True, 
#                                rounded=True, special_characters=True)



In [28]:

    
# graph = pydotplus.graph_from_dot_file('tree.dot')



In [29]:

    
# Image(graph.create_png())

Regularization (Hyperparamater Tuning)



In [30]:

    
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold



In [31]:

    
clf_LogCV = LogisticRegressionCV(Cs=10, cv=StratifiedKFold(5),scoring="accuracy")



In [32]:

    
clf_LogCV.fit(X2,y)









    Out[32]:





LogisticRegressionCV(Cs=10, class_weight=None,
           cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
           dual=False, fit_intercept=True, intercept_scaling=1.0,
           max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2',
           random_state=None, refit=True, scoring='accuracy',
           solver='lbfgs', tol=0.0001, verbose=0)



In [33]:

    
clf_LogCV.Cs_









    Out[33]:





array([  1.00000000e-04,   7.74263683e-04,   5.99484250e-03,
         4.64158883e-02,   3.59381366e-01,   2.78255940e+00,
         2.15443469e+01,   1.66810054e+02,   1.29154967e+03,
         1.00000000e+04])



In [34]:

    
clf_LogCV.C_









    Out[34]:





array([ 0.0001])



In [35]:

    
clf_LogCV.predict(X2)









    Out[35]:





array([0, 0, 0, ..., 0, 0, 0])



In [36]:

    
clf_LogCV.score(X2,y)









    Out[36]:





0.88907222164930733

Logistic Regression - All Variables



In [11]:

    
# Preprocess the data



In [ ]:



In [ ]:



In [65]:

    
# Build the Model



In [ ]:



In [ ]:



In [1]:

    
# Choose a Threshold



In [ ]:



In [ ]:



In [2]:

    
# Calculate the accuracy



In [ ]:



In [ ]:

Calculate the error metric

Accuracy
Precision
Recall
Sensitivity
Specificity
Receiver Operating Curve
Area Under the Curve



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Choosing the Error Metric

What is a good error metric to choose in this case?



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Regularization - L1 and L2



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Feature Selection



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	amount	interest	grade	years	ownership	income	age
0	5000	10.65	B	10.00	RENT	24000.00	33
1	2400	10.99	C	25.00	RENT	12252.00	31
2	10000	13.49	C	13.00	RENT	49200.00	24
3	5000	10.99	A	3.00	RENT	36000.00	39
4	3000	10.99	E	9.00	RENT	48000.00	24