Exercise 04

Logistic regression for credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. Dataset

Attribute Information:

Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit divided by the sum of credit limits percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due integer
DebtRatio Monthly debt payments percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due integer
NumberOfDependents Number of dependents in family integer

Read the data into Pandas


In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()


Out[1]:
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 0.766127 45.0 2.0 0.802982 9120.0 13.0 0.0 6.0 0.0 2.0
1 0 0.957151 40.0 0.0 0.121876 2600.0 4.0 0.0 0.0 0.0 1.0
2 0 0.658180 38.0 1.0 0.085113 3042.0 2.0 1.0 0.0 0.0 0.0
3 0 0.233810 30.0 0.0 0.036050 3300.0 5.0 0.0 0.0 0.0 0.0
4 0 0.907239 49.0 1.0 0.024926 63588.0 7.0 0.0 1.0 0.0 0.0

In [2]:
data.shape


Out[2]:
(112915, 11)

Drop na


In [3]:
data.isnull().sum(axis=0)


Out[3]:
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [4]:
data.dropna(inplace=True)
data.shape


Out[4]:
(108648, 11)

Create X and y


In [5]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [6]:
y.mean()


Out[6]:
0.06742876076872101

Exercise 4.1

Split the data into training and testing sets


In [ ]:

Exercise 4.2

Fit a logistic regression model and examine the coefficients


In [ ]:

Exercise 4.3

Make predictions on the testing set and calculate the accuracy


In [ ]:

Exercise 4.4

Confusion matrix of predictions

What is the percentage of detected bad custumers


In [ ]:

Exercise 4.5

Increase sensitivity by lowering the threshold for predicting a bad customer

Create a new classifier by changing the probability threshold to 0.3

What is the new confusion matrix?

What is the new percentage of detected bad customers?


In [24]: