default of credit card clients Data Set

The assignment was as follows:

In the workshop for this week, you are to select a data set from the UCI Machine Learning Repository and based on the recommended analysis type, wrangle the data into a fitted model, showing some model evaluation. In particular:

  • Layout the data into a dataset X and targets y.
  • Choose regression, classification, or clustering and build the best model you can from it.
  • Report an evaluation of the model built
  • Visualize aspects of your model (optional)
  • Compare and contrast different model families

Credit Card Example

Downloaded from the UCI Machine Learning Repository on February 26, 2015. The first thing is to fully describe your data in a README file. The dataset description is as follows:

  • Data Set: Multivariate
  • Attribute: Integer, Real
  • Tasks: Classification,
  • Instances: 30000
  • Attributes: 24

Data Set Information:

This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel “Sorting Smoothing Method” to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.

Attribute Information:

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

X2: Gender (1 = male; 2 = female).

X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

X4: Marital status (1 = married; 2 = single; 3 = others).

X5: Age (year).

X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Relevant Papers:

Data Exploration

Source:

Name: I-Cheng Yeh email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. other contact information: 886-2-26215656 ext. 3181


In [1]:
%matplotlib inline

import os
import json
import time
import pickle
import requests


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import KFold

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier


C:\Users\eobrien\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:
data = pd.read_csv('data/dataset.csv')

In [3]:
data.head()


Out[3]:
Unnamed: 0 X1 X2 X3 X4 X5 X6 X7 X8 X9 ... X15 X16 X17 X18 X19 X20 X21 X22 X23 Y
0 ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
2 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0

5 rows × 25 columns


In [4]:
data.iloc[0]
df = data

new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df =df.rename(columns = new_header) #set the header row as the df header

In [5]:
#this forces all of the columns to become numeric

for c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

In [6]:
# Describe the dataset
print(df.describe())
df.info()


                 ID       LIMIT_BAL           SEX     EDUCATION      MARRIAGE  \
count  30000.000000    30000.000000  30000.000000  30000.000000  30000.000000   
mean   15000.500000   167484.322667      1.603733      1.853133      1.551867   
std     8660.398374   129747.661567      0.489129      0.790349      0.521970   
min        1.000000    10000.000000      1.000000      0.000000      0.000000   
25%     7500.750000    50000.000000      1.000000      1.000000      1.000000   
50%    15000.500000   140000.000000      2.000000      2.000000      2.000000   
75%    22500.250000   240000.000000      2.000000      2.000000      2.000000   
max    30000.000000  1000000.000000      2.000000      6.000000      3.000000   

                AGE         PAY_0         PAY_2         PAY_3         PAY_4  \
count  30000.000000  30000.000000  30000.000000  30000.000000  30000.000000   
mean      35.485500     -0.016700     -0.133767     -0.166200     -0.220667   
std        9.217904      1.123802      1.197186      1.196868      1.169139   
min       21.000000     -2.000000     -2.000000     -2.000000     -2.000000   
25%       28.000000     -1.000000     -1.000000     -1.000000     -1.000000   
50%       34.000000      0.000000      0.000000      0.000000      0.000000   
75%       41.000000      0.000000      0.000000      0.000000      0.000000   
max       79.000000      8.000000      8.000000      8.000000      8.000000   

                  ...                  BILL_AMT4      BILL_AMT5  \
count             ...               30000.000000   30000.000000   
mean              ...               43262.948967   40311.400967   
std               ...               64332.856134   60797.155770   
min               ...             -170000.000000  -81334.000000   
25%               ...                2326.750000    1763.000000   
50%               ...               19052.000000   18104.500000   
75%               ...               54506.000000   50190.500000   
max               ...              891586.000000  927171.000000   

           BILL_AMT6       PAY_AMT1      PAY_AMT2      PAY_AMT3  \
count   30000.000000   30000.000000  3.000000e+04   30000.00000   
mean    38871.760400    5663.580500  5.921163e+03    5225.68150   
std     59554.107537   16563.280354  2.304087e+04   17606.96147   
min   -339603.000000       0.000000  0.000000e+00       0.00000   
25%      1256.000000    1000.000000  8.330000e+02     390.00000   
50%     17071.000000    2100.000000  2.009000e+03    1800.00000   
75%     49198.250000    5006.000000  5.000000e+03    4505.00000   
max    961664.000000  873552.000000  1.684259e+06  896040.00000   

            PAY_AMT4       PAY_AMT5       PAY_AMT6  default payment next month  
count   30000.000000   30000.000000   30000.000000                30000.000000  
mean     4826.076867    4799.387633    5215.502567                    0.221200  
std     15666.159744   15278.305679   17777.465775                    0.415062  
min         0.000000       0.000000       0.000000                    0.000000  
25%       296.000000     252.500000     117.750000                    0.000000  
50%      1500.000000    1500.000000    1500.000000                    0.000000  
75%      4013.250000    4031.500000    4000.000000                    0.000000  
max    621000.000000  426529.000000  528666.000000                    1.000000  

[8 rows x 25 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 1 to 30000
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6                     30000 non-null int64
PAY_AMT1                      30000 non-null int64
PAY_AMT2                      30000 non-null int64
PAY_AMT3                      30000 non-null int64
PAY_AMT4                      30000 non-null int64
PAY_AMT5                      30000 non-null int64
PAY_AMT6                      30000 non-null int64
default payment next month    30000 non-null int64
dtypes: int64(25)
memory usage: 5.7 MB

In [7]:
#changing this so it is easier to work with

df.columns =['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'Default']

In [9]:
feature_cols = ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

In [11]:
X = df[feature_cols]

In [12]:
y = df.Default

In [ ]:


In [ ]:


In [13]:
estimator = RandomForestClassifier()

In [17]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=0)

In [18]:
for train, test in KFold(len(X), n_folds=12, shuffle=True):
        X_train, X_test = X.iloc[train], X.iloc[test]
        y_train, y_test = y.iloc[train], y.iloc[test]   
        estimator

In [19]:
df.iloc[test]


Out[19]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 Default
30 30 50000 1 1 2 26 0 0 0 0 ... 17907 18375 11400 1500 1500 1000 1000 1600 0 0
38 38 60000 2 2 2 22 0 0 0 0 ... 6026 -28335 18660 1500 1518 2043 0 47671 617 0
41 41 360000 1 1 2 33 0 0 0 0 ... 628699 195969 179224 10000 7000 6000 188840 28000 4000 0
43 43 10000 1 2 2 22 0 0 0 0 ... 3576 3670 4451 1500 2927 1000 300 1000 500 0
45 45 40000 2 1 2 30 0 0 0 2 ... 25209 26636 29197 3000 5000 0 2000 3000 0 0
101 101 140000 1 1 2 32 -2 -2 -2 -2 ... 415 100 1430 10212 850 415 100 1430 0 0
113 113 280000 1 2 1 41 2 2 2 2 ... 144401 152174 149415 6500 0 14254 14850 0 5000 0
123 123 110000 2 1 1 48 1 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 0
126 126 20000 1 2 2 23 1 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 0
130 130 60000 1 3 1 55 3 2 2 0 ... 28853 29510 26547 2504 7 1200 1200 1100 1500 0
133 133 420000 1 2 1 34 0 0 0 0 ... 220951 210606 188108 9744 9553 7603 7830 7253 11326 0
134 134 330000 1 3 1 46 0 0 0 0 ... 227587 227775 228203 8210 8095 8025 8175 8391 8200 0
141 141 240000 1 1 2 47 1 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 1
143 143 50000 1 2 2 23 1 2 2 2 ... 19996 19879 18065 1000 10000 400 700 800 600 0
150 150 260000 2 1 1 60 1 -2 -1 -1 ... 0 969 869 0 22500 0 969 1000 0 0
152 152 80000 1 1 2 25 0 0 0 0 ... 41087 41951 31826 30000 3000 6000 8000 2000 14000 0
154 154 280000 2 2 1 56 0 0 0 0 ... 101783 177145 169311 8042 6700 5137 100000 7000 6321 0
175 175 360000 1 1 2 29 1 -2 -1 -1 ... 0 0 0 0 77 0 0 0 0 0
192 192 60000 2 1 2 27 2 0 0 0 ... 23005 22499 22873 1342 1664 2000 0 900 846 1
194 194 180000 2 1 2 24 -1 -1 2 0 ... 10200 0 0 37867 0 200 0 0 0 0
197 197 150000 2 2 1 34 -2 -2 -2 -2 ... 116 0 1500 0 0 116 0 1500 0 0
237 237 150000 2 2 2 27 0 0 0 0 ... 44384 36900 29497 4500 1745 1566 1208 1077 2529 0
248 248 100000 2 2 2 27 0 0 0 0 ... 30337 30997 32904 1788 1799 1100 1150 2423 0 0
260 260 220000 1 1 1 48 2 0 0 0 ... 169115 172169 162402 10000 9020 6000 5500 6000 5500 1
279 279 50000 1 3 1 33 0 0 0 0 ... 20046 20067 19703 2007 2199 691 707 703 697 0
292 292 50000 2 2 2 22 1 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 1
304 304 20000 2 1 2 25 0 0 2 0 ... 15824 15761 12510 2800 0 4000 1700 1000 2000 1
307 307 500000 2 2 1 36 -1 -1 -1 -1 ... 8500 4590 0 23962 0 8500 4590 0 0 0
349 349 140000 2 2 2 31 0 0 0 0 ... 139679 141748 142174 6600 6500 5100 5300 6000 5000 0
353 353 380000 1 1 2 30 1 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29690 29690 200000 1 1 2 36 -1 -1 -1 -1 ... 396 396 396 396 396 396 396 396 396 0
29707 29707 420000 1 2 1 43 0 0 0 0 ... 30099 66049 61043 42063 30253 20016 40015 30004 3000 0
29722 29722 230000 1 2 1 35 0 0 0 0 ... 180117 179717 124117 7004 10000 5000 0 5000 189600 0
29729 29729 160000 1 1 1 46 -2 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 1
29798 29798 90000 1 2 1 31 0 0 0 0 ... 72021 73912 58301 2471 2600 2633 3108 2052 1858 0
29800 29800 20000 1 2 2 33 3 2 8 7 ... 21631 21026 20130 8000 0 0 0 0 0 1
29806 29806 60000 1 2 1 40 0 0 0 0 ... 52642 52806 53530 2500 2500 2500 2000 2100 2200 0
29810 29810 20000 1 3 1 48 0 0 3 2 ... 18188 17630 16780 5000 0 4000 0 0 0 1
29817 29817 80000 1 2 2 42 1 4 3 2 ... 81545 51338 50826 0 639 0 50918 2000 2000 0
29818 29818 50000 1 2 2 40 2 0 0 2 ... 13444 13367 13282 2000 2000 1000 1000 1000 2000 1
29847 29847 300000 1 2 2 47 0 0 0 0 ... 118522 118586 118179 3000 5000 5013 4514 4220 4008 0
29858 29858 50000 1 2 1 47 0 0 0 0 ... 14439 14602 14918 1504 1600 1537 700 700 600 0
29866 29866 150000 1 1 1 43 -1 3 2 -1 ... 416 416 416 0 416 416 416 416 0 1
29872 29872 420000 1 2 2 31 0 0 0 0 ... 293951 305011 312087 14302 10500 11000 16000 12000 16000 0
29881 29881 140000 1 2 1 38 0 0 0 0 ... 75815 65099 66445 4494 5303 2806 1825 1880 1901 1
29890 29890 340000 1 2 1 37 0 0 0 0 ... 81589 83546 85362 8000 4000 3000 3300 3300 4000 0
29904 29904 260000 1 1 1 30 -1 0 -1 -1 ... 99 99 172104 10018 13333 99 99 172104 30013 0
29905 29905 60000 1 3 2 30 0 0 0 0 ... 58732 59306 59728 2600 4553 5800 2000 1000 1462 1
29909 29909 140000 1 1 2 29 -2 -2 -2 -2 ... 0 0 0 0 0 0 0 0 0 1
29923 29923 150000 1 2 1 35 1 -2 -2 -2 ... -18 -18 -18 0 0 0 0 0 0 0
29931 29931 20000 1 2 1 43 1 2 2 2 ... 8102 7136 5243 1307 0 1400 0 182 400 0
29939 29939 30000 1 2 2 38 2 2 2 2 ... 21471 21598 23241 1700 1700 0 782 2000 1100 0
29941 29941 20000 1 2 1 48 1 2 0 0 ... 18928 18761 18787 2 2019 11309 2000 700 1002 0
29945 29945 80000 1 2 2 39 3 2 0 0 ... 23867 23130 0 0 2968 1867 463 0 0 1
29953 29953 210000 1 2 2 30 -1 -1 -1 -1 ... 1252 626 626 3980 7509 1252 0 626 626 0
29962 29962 260000 1 1 2 33 -2 -2 -2 -2 ... 1368 101 955 263 0 1368 101 955 0 0
29969 29969 20000 1 2 2 34 0 0 0 0 ... 13478 16978 12914 2000 2000 1000 5000 12914 600 0
29978 29978 420000 1 1 2 34 0 0 0 0 ... 141695 144839 147954 7000 7000 5500 5500 5600 5000 0
29983 29983 90000 1 2 1 36 0 0 0 0 ... 11328 12036 14329 1500 1500 1500 1200 2500 0 1
29996 29996 220000 1 3 1 39 0 0 0 0 ... 88004 31237 15980 8500 20000 5003 3047 5000 1000 0

2500 rows × 25 columns


In [20]:
label = "default data credit card random forest"
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = RandomForestClassifier()
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
# Write official estimator to disk
estimator = RandomForestClassifier()
estimator.fit(X, y)
    
outpath = label.lower().replace(" ", "-") + ".pickle"
with open(outpath, 'wb') as f:
    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))


Build and Validation of default data credit card random forest took 1.272 seconds
Validation scores are as follows:

accuracy     0.801200
f1           0.772397
precision    0.777112
recall       0.801200
dtype: float64

In [ ]:
label = "default data credit card SVC"
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = SVC()
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
    
    
#outpath = label.lower().replace(" ", "-") + ".pickle"
#with open(outpath, 'wb') as f:
#    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))

In [21]:
label = "default data credit card KNNclassifer "
start  = time.time() # Start the clock! 
scores = {'precision':[], 'recall':[], 'accuracy':[], 'f1':[]}
    
for train, test in KFold(len(X), n_folds=12, shuffle=True):
    X_train, X_test = X.iloc[train], X.iloc[test]
    y_train, y_test = y.iloc[train], y.iloc[test] 
        
estimator = KNeighborsClassifier(n_neighbors=12)
estimator.fit(X_train, y_train)
        
expected  = y_test
predicted = estimator.predict(X_test)
        
# Append our scores to the tracker
scores['precision'].append(metrics.precision_score(expected, predicted, average="weighted"))
scores['recall'].append(metrics.recall_score(expected, predicted, average="weighted"))
scores['accuracy'].append(metrics.accuracy_score(expected, predicted))
scores['f1'].append(metrics.f1_score(expected, predicted, average="weighted"))

# Report
print("Build and Validation of {} took {:0.3f} seconds".format(label, time.time()-start))
print("Validation scores are as follows:\n")
print(pd.DataFrame(scores).mean())
    
    
    
#outpath = label.lower().replace(" ", "-") + ".pickle"
#with open(outpath, 'wb') as f:
#    pickle.dump(estimator, f)

#print("\nFitted model written to:\n{}".format(os.path.abspath(outpath)))


Build and Validation of default data credit card KNNclassifer  took 1.197 seconds
Validation scores are as follows:

accuracy     0.776800
f1           0.714357
precision    0.710574
recall       0.776800
dtype: float64

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: