Exercise 04

Logistic regression for credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. Dataset

Attribute Information:

Variable Name	Description	Type
SeriousDlqin2yrs	Person experienced 90 days past due delinquency or worse	Y/N
RevolvingUtilizationOfUnsecuredLines	Total balance on credit divided by the sum of credit limits	percentage
age	Age of borrower in years	integer
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due	integer
DebtRatio	Monthly debt payments	percentage
MonthlyIncome	Monthly income	real
NumberOfOpenCreditLinesAndLoans	Number of Open loans	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due.	integer
NumberRealEstateLoansOrLines	Number of mortgage and real estate loans	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due	integer
NumberOfDependents	Number of dependents in family	integer

Read the data into Pandas



In [1]:

    
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()









    Out[1]:







  
    
      
      SeriousDlqin2yrs
      RevolvingUtilizationOfUnsecuredLines
      age
      NumberOfTime30-59DaysPastDueNotWorse
      DebtRatio
      MonthlyIncome
      NumberOfOpenCreditLinesAndLoans
      NumberOfTimes90DaysLate
      NumberRealEstateLoansOrLines
      NumberOfTime60-89DaysPastDueNotWorse
      NumberOfDependents
    
  
  
    
      0
      1
      0.766127
      45.0
      2.0
      0.802982
      9120.0
      13.0
      0.0
      6.0
      0.0
      2.0
    
    
      1
      0
      0.957151
      40.0
      0.0
      0.121876
      2600.0
      4.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      0
      0.658180
      38.0
      1.0
      0.085113
      3042.0
      2.0
      1.0
      0.0
      0.0
      0.0
    
    
      3
      0
      0.233810
      30.0
      0.0
      0.036050
      3300.0
      5.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0
      0.907239
      49.0
      1.0
      0.024926
      63588.0
      7.0
      0.0
      1.0
      0.0
      0.0



In [2]:

    
data.shape









    Out[2]:





(112915, 11)

Drop na



In [3]:

    
data.isnull().sum(axis=0)









    Out[3]:





SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64



In [4]:

    
data.dropna(inplace=True)
data.shape









    Out[4]:





(108648, 11)

Create X and y



In [5]:

    
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)



In [6]:

    
y.mean()









    Out[6]:





0.06742876076872101

Exercise 4.1

Split the data into training and testing sets



In [ ]:

Exercise 4.2

Fit a logistic regression model and examine the coefficients



In [ ]:

Exercise 4.3

Make predictions on the testing set and calculate the accuracy



In [ ]:

Exercise 4.4

Confusion matrix of predictions

What is the percentage of detected bad custumers



In [ ]:

Exercise 4.5

Increase sensitivity by lowering the threshold for predicting a bad customer

Create a new classifier by changing the probability threshold to 0.3

What is the new confusion matrix?

What is the new percentage of detected bad customers?



In [24]:

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	0.766127	45.0	2.0	0.802982	9120.0	13.0	0.0	6.0	2.0
1	0	0.957151	40.0	0.0	0.121876	2600.0	4.0	0.0	0.0	1.0
2	0	0.658180	38.0	1.0	0.085113	3042.0	2.0	1.0	0.0	0.0
3	0	0.233810	30.0	0.0	0.036050	3300.0	5.0	0.0	0.0	0.0
4	0	0.907239	49.0	1.0	0.024926	63588.0	7.0	0.0	1.0	0.0