Lending Club Default(charged off, late also are included) Fully Paid Classification Model

  • Main Repository for this project/
  • Raw data is gathered from Lending Club Loan Data and Kaggle
  • This project has been initiated from JAN 18 2017
  • Co-contributers are as below(sorted alphabetically)
    • Jang Sungguk(simfeel87@gmail.com)
    • Kim Gibeom(curtisk808@gmail.com)
    • Shin Yoonsig(shinys825@gmail.com)

Summary

Basic Approach: How to build basic investment strategy for beginners

: focus on 7-features not to lose money

Check List

  • High annual income
  • Low interest rate
  • Low loan amount
  • Opened employment title
  • Issued on JAN, SEP, DEC
  • Verification status and Owned house are not so important, but relatively positive

Initialize

Library loading


In [26]:
import re
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels.api as sm

import matplotlib as mpl
import matplotlib.tri as mtri
import matplotlib.pylab as plt
import matplotlib.patches as mpatches
import seaborn as sns

import itertools

from scipy import stats
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve

# Seaborn setting
sns.set(palette="hls", font_scale=2)

Read dataframe


In [6]:
df = pd.read_csv('frames/lc_dataframe.csv')
print list(df.columns)
df.tail()


['loan_amnt', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'desc', 'purpose', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'application_type']
Out[6]:
loan_amnt int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status ... dti delinq_2yrs inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc initial_list_status application_type
268131 31050 21.99 857.40 6 61 1 10 2 875000.0 1 ... 9.66 1 0 10 0 25770 79.3 13 0 1
268132 10800 7.89 337.89 1 15 1 8 2 92400.0 1 ... 19.62 1 0 11 0 9760 68.7 36 1 1
268133 9000 9.17 286.92 2 22 1 1 2 80000.0 1 ... 3.97 1 0 8 0 6320 51.8 17 0 1
268134 14400 25.99 431.06 6 65 0 11 6 62000.0 1 ... 16.88 0 1 9 1 5677 45.1 30 0 1
268135 8000 12.59 267.98 3 32 1 4 5 45000.0 1 ... 26.21 0 0 12 0 9097 50.8 47 1 1

5 rows × 25 columns

Variables Description


In [3]:
var_desc = pd.read_csv('frames/var_description(csv).csv')
var_desc


Out[3]:
Name Included in dataframe Data type Encoding Not Null ratio Reason for removal English Description Korean Description
0 loan_status o Categorical Fully Paid = 1 / Charged Off, Default, Late(31... 100.00% NaN Current status of the loan 대출의 현재 상태
1 int_rate o Continuous NaN 100.00% NaN Interest Rate on the loan 대출 금리
2 emp_title o Categorical True = 1, False = 0 94.20% NaN The job title supplied by the Borrower when ap... 대출 신청시 차용자가 제공 한 직책 *
3 emp_length o Categorical < 1' = 0, 10+ = 10, 'n/a' = 11 100.00% NaN Employment length in years. Possible values ar... 고용 기간(년)
4 home_ownership o Categorical Mortgage, Other, Own, Rent = {1, 2, 3, 4} 100.00% NaN The home ownership status provided by the borr... 주택 소유 상태(렌트, 소유, 모기지, 기타)
5 annual_inc o Continuous NaN 100.00% NaN The self-reported annual income provided by th... 차용자 자체보고 연간 소득.
6 verification_status o Categorical Source Verified, Verified = 1, Not Verified = 0 100.00% NaN Indicates if income was verified by LC, not ve... 수입 및 수입원 확인여부
7 issue_d o Categorical Transform to mm 100.00% NaN The month which the loan was funded 대출 자금이 조달 된 달
8 loan_amnt o Continuous NaN 100.00% NaN The listed amount of the loan applied for by t... 대출금액
9 desc o Continuous True = doc length, False = 0 14.20% NaN Loan description provided by the borrower 대출 내용 *
10 purpose o Categorical 14 categoried = {1:14} (from 1 to 14) 100.00% NaN A category provided by the borrower for the lo... 대출 목적
11 dti o Continuous NaN 100.00% NaN A ratio calculated using the borrower’s total ... 차용인의 (월별 채무상환액 / 월별 소득)
12 delinq_2yrs o Categorical No delayed = 0, Delayed = 1 100.00% NaN The number of 30+ days past-due incidences of ... 지난 2 년간 연체 30일 이상 발생 횟수
13 inq_last_6mths o Categorical {0:5} = 0~5 times, 6 = above 6 times 100.00% NaN The number of inquiries in past 6 months (excl... 지난 6 개월 간 신용관련 문의 건수 (자동차, 모기지 제외)
14 pub_rec o Categorical No record = 0, record = 1 100.00% NaN Number of derogatory public records 추징, 압류 등과 관련된 기록
15 revol_bal o Continuous NaN 100.00% NaN Total credit revolving balance 총 크레딧 회전 균형
16 revol_util o Continuous NaN 99.94% NaN Revolving line utilization rate, or the amount... 리볼빙 이용률, 차용자가 사용하는 리볼빙 크레딧 금액
17 total_acc o Continuous NaN 100.00% NaN The total number of credit lines currently in ... 현재 개설된 신용계좌 수
18 initial_list_status o Categorical W = 1, F = 0 100.00% NaN The initial listing status of the loan. Possib... 대출의 초기 리스팅 상태(W: 전체, F: 부분)
19 acc_now_delinq x Continuous NaN 100.00% unclear data collection date from LC The number of accounts on which the borrower i... 연체중인 계좌의 수
20 addr_state x - NaN 100.00% not related to analytics purpose The state provided by the borrower in the loan... 대출 신청서에 차용인이 제공 한 국가
21 all_util x - NaN 2.41% high null ratio Balance to credit limit on all trades 신용한도
22 annual_inc_joint x - NaN 0.06% high null ratio The combined self-reported annual income provi... 등록 중 공동 차용자가 제공 한 자체보고 된 연간 소득
23 application_type x Categorical Individual = 1, Joint = 0 100.00% extremely high unbalanced(above 99.9) Indicates whether the loan is an individual ap... 대출신청 개인, 공동 여부
24 collection_recovery_fee x Continuous NaN 100.00% not provided information on loan application post charge off collection fee 수금수수료
25 collections_12_mths_ex_med x Continuous NaN 99.98% not provided information on loan application Number of collections in 12 months excluding m... 의료제외 12개월 간 수금건수
26 dti_joint x - NaN 0.06% dependent to 'dti' A ratio calculated using the co-borrowers' tot... 모기지 및 요청 된 LC 대출을 제외하고 총 차입금에 대한 공동 차용자의 월별 총 ...
27 earliest_cr_line x - NaN 100.00% dependent to 'issue_d' The month the borrower's earliest reported cre... 차용자의 최초보고 크레딧 라인이 개설 된 월
28 funded_amnt x Continuous NaN 100.00% dependent to 'loan_amnt' The total amount committed to that loan at tha... 해당 시점에 대출에 투입된 총 금액.
29 funded_amnt_inv x Continuous NaN 100.00% dependent to 'loan_amnt' The total amount committed by investors for th... 해당 시점에 해당 대출에 대해 투자자들이 투입한 총 금액.
... ... ... ... ... ... ... ... ...
44 mths_since_rcnt_il x - NaN 2.35% high null ratio Months since most recent installment accounts ... 가장 최근의 할부 계좌가 개설 된 이후의 달
45 next_pymnt_d x - NaN 71.49% unclear data collection date from LC Next scheduled payment date 다음 지급 예정일
46 open_acc x Continuous NaN 100.00% dependent to 'total_acc' The number of open credit lines in the borrowe... 차용인의 개설된 신용계좌 수
47 open_acc_6m x - NaN 2.41% high null ratio Number of open trades in last 6 months 최근 6 개월 동안의 미결 거래 횟수
48 open_il_12m x - NaN 2.41% high null ratio Number of installment accounts opened in past ... 지난 12 개월 동안 개설 된 할부 계좌 수
49 open_il_24m x - NaN 2.41% high null ratio Number of installment accounts opened in past ... 지난 24 개월 동안 개설 된 할부 계좌 수
50 open_il_6m x - NaN 2.41% high null ratio Number of currently active installment trades 현재 활동중인 할부 거래 수
51 open_rv_12m x - NaN 2.41% high null ratio Number of revolving trades opened in past 12 m... 지난 12 개월 동안 개설 된 회전 거래 수
52 open_rv_24m x - NaN 2.41% high null ratio Number of revolving trades opened in past 24 m... 지난 24 개월 동안 개설 된 회전 거래 수
53 out_prncp x Continuous NaN 100.00% not provided information on loan application Remaining outstanding principal for total amou... 후원 된 총 금액에 대한 미납 된 원금
54 out_prncp_inv x Continuous NaN 100.00% not provided information on loan application Remaining outstanding principal for portion of... 투자자가 자금을 조달 한 부분의 잔액이 남아 있음.
55 policy_code x - NaN 100.00% same all elements publicly available policy_code=1\nnew products... publicly available policy_code = 1\r\n공개적으로 사용...
56 pymnt_plan x Categorical y = 1, n = 0 100.00% extremely high unbalanced(above 99.9) Indicates if a payment plan has been put in pl... 대출금 지불 계획 수립여부
57 recoveries x Continuous NaN 100.00% not provided information on loan application post charge off gross recovery 총체적인 회복으로 인한 비용 청구
58 sub_grade x Categorical 1, 2, 3, 4, 5 100.00% dependent to 'grade' LC assigned loan subgrade LC 세부 신용등급
59 term x Continuous NaN 100.00% not used in present The number of payments on the loan. Values are... 대출금 지불 기간(값은 개월 단위)
60 title x - NaN 99.98% too many categories The loan title provided by the borrower 차용인이 제공 한 대출 명칭
61 tot_coll_amt x - NaN 92.08% dependent to 'term' Total collection amounts ever owed 적립 된 총 징수액
62 tot_cur_bal x - NaN 92.08% dependent to other variables of repayment ability Total current balance of all accounts 모든 계정의 총 현재 잔액
63 total_bal_il x - NaN 2.41% high null ratio Total current balance of all installment accounts 모든 할부 계좌의 총 현재 잔액
64 total_cu_tl x - NaN 0.00% high null ratio Number of finance trades 금융 거래 건수
65 total_pymnt x Continuous NaN 100.00% not provided information on loan application Payments received to date for total amount funded 현재까지받은 지불액
66 total_pymnt_inv x - NaN 100.00% unclear data collection date from LC Payments received to date for portion of total... 투자자가 자금을 조달 한 총액의 일부분에 대해 현재까지 수령 한 지급액
67 total_rec_int x - NaN 100.00% unclear data collection date from LC Interest received to date 현재까지받은이자
68 total_rec_late_fee x - NaN 100.00% unclear data collection date from LC Late fees received to date 현재까지 수령 한 연체료
69 total_rec_prncp x - NaN 100.00% unclear data collection date from LC Principal received to date 현재까지받은 교장
70 total_rev_hi_lim x - NaN 92.08% dependent to 'revol_bal', 'revol_util' Total revolving high credit/credit limit 총 회전 높은 신용 한도 / 신용 한도
71 url x x NaN 100.00% not related to analytics purpose URL for the LC page with listing data. 목록 데이터가있는 LC 페이지의 URL입니다.
72 verification_status_joint x - NaN 0.06% high null ratio Indicates if the co-borrowers' joint income wa... 공동 차용인의 공동 수입이 LC로 검증되었는지, 검증되지 않았는지, 수입원이 확인되...
73 zip_code x - NaN 100.00% not related to analytics purpose The first 3 numbers of the zip code provided b... 대출 신청서에 차용자가 제공 한 우편 번호의 처음 3 자리 숫자입니다.

74 rows × 8 columns

Check dataframe and elements


In [56]:
df.dtypes


Out[56]:
loan_amnt                int64
int_rate               float64
installment            float64
grade                    int64
sub_grade                int64
emp_title                int64
emp_length               int64
home_ownership           int64
annual_inc             float64
verification_status      int64
issue_d                  int64
loan_status              int64
pymnt_plan               int64
desc                     int64
purpose                  int64
dti                    float64
delinq_2yrs              int64
inq_last_6mths           int64
open_acc                 int64
pub_rec                  int64
revol_bal                int64
revol_util             float64
total_acc                int64
initial_list_status      int64
application_type         int64
dtype: object

In [55]:
df.isnull().sum()


Out[55]:
loan_amnt              0
int_rate               0
installment            0
grade                  0
sub_grade              0
emp_title              0
emp_length             0
home_ownership         0
annual_inc             0
verification_status    0
issue_d                0
loan_status            0
pymnt_plan             0
desc                   0
purpose                0
dti                    0
delinq_2yrs            0
inq_last_6mths         0
open_acc               0
pub_rec                0
revol_bal              0
revol_util             0
total_acc              0
initial_list_status    0
application_type       0
dtype: int64

Data Preprocessing

check correlation between dependent variables


In [257]:
fig = plt.figure(figsize = (15, 10))
corrmat = df.corr()
sns.heatmap(corrmat)
plt.title("Correlation between Features")
plt.show()


get high correlation


In [307]:
corr_stack = corrmat.abs().unstack()
ordered_stack = corr_stack.order(kind="quicksort", ascending=False)
order_ix = []

for num in range(len(ordered_stack)):
    if ordered_stack[num] > 0.6 and ordered_stack[num] < 1.0:
        order_ix.append(num)

print ordered_stack[min(order_ix):max(order_ix)]


sub_grade    grade          0.994620
grade        sub_grade      0.994620
sub_grade    int_rate       0.966306
int_rate     sub_grade      0.966306
installment  loan_amnt      0.954577
loan_amnt    installment    0.954577
int_rate     grade          0.950878
grade        int_rate       0.950878
open_acc     total_acc      0.672594
dtype: float64
D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app

remove high corrleation variables


In [7]:
del df["grade"]
del df["sub_grade"]
del df["installment"]
del df["open_acc"]

check groupby elements


In [8]:
category_vars = ["emp_title", "emp_length", "home_ownership",
                 "verification_status", "issue_d", "purpose",
                 "initial_list_status", "pymnt_plan", "application_type"]

continuous_vars = ["loan_amnt", "int_rate", "annual_inc", 
                   "desc", "dti", "revol_bal", "revol_util", "total_acc",
                   "pub_rec", "inq_last_6mths", "delinq_2yrs"]

In [310]:
for var in category_vars:
    print df.groupby(var).loan_status.value_counts()


emp_title  loan_status
0          1               10494
           0                4471
1          1              197227
           0               55944
Name: loan_status, dtype: int64
emp_length  loan_status
0           1              17033
            0               5181
1           1              13892
            0               3984
2           1              19528
            0               5384
3           1              16846
            0               4761
4           1              13422
            0               3668
5           1              14856
            0               4122
6           1              12058
            0               3474
7           1              11483
            0               3381
8           1               9695
            0               2890
9           1               7789
            0               2392
10          1              63747
            0              17721
11          1               7372
            0               3457
Name: loan_status, dtype: int64
home_ownership  loan_status
1               1                   1
2               1              104965
                0               26496
3               1                  36
                0                   7
4               1                 114
                0                  27
5               1               17959
                0                5607
6               1               84646
                0               28278
Name: loan_status, dtype: int64
verification_status  loan_status
0                    1               73856
                     0               15480
1                    1              133865
                     0               44935
Name: loan_status, dtype: int64
issue_d  loan_status
1        1              18445
         0               5317
2        1              14696
         0               4508
3        1              15754
         0               4647
4        1              17455
         0               5405
5        1              16907
         0               5346
6        1              16138
         0               4991
7        1              20212
         0               6290
8        1              17607
         0               4863
9        1              15747
         0               4125
10       1              21371
         0               6242
11       1              18085
         0               4957
12       1              15304
         0               3724
Name: loan_status, dtype: int64
purpose  loan_status
1        1                3198
         0                 543
2        1               42249
         0               10536
3        1              120763
         0               37318
4        1                 269
         0                  56
5        1               12660
         0                3115
6        1                1366
         0                 369
7        1                5391
         0                1146
8        1                2285
         0                 726
9        1                1603
         0                 549
10       1               11341
         0                3732
11       1                 213
         0                  63
12       1                3375
         0                1630
13       1                1318
         0                 359
14       1                1690
         0                 273
Name: loan_status, dtype: int64
initial_list_status  loan_status
0                    1              149144
                     0               41104
1                    1               58577
                     0               19311
Name: loan_status, dtype: int64
pymnt_plan  loan_status
0           1              207719
            0               60410
1           0                   5
            1                   2
Name: loan_status, dtype: int64
application_type  loan_status
0                 0                   2
                  1                   1
1                 1              207720
                  0               60413
Name: loan_status, dtype: int64

remove variables 'pymnt_plan', 'application_type': almost every elements are in oneside


In [9]:
del df["pymnt_plan"]
del df["application_type"]

In [10]:
category_vars.remove("pymnt_plan")
category_vars.remove("application_type")

rearrage 'home_ownership', 'purpose'


In [11]:
df["home_ownership"].replace([1, 3], 4, inplace=True)
df["home_ownership"].replace(2, 1, inplace=True)
df["home_ownership"].replace(4, 2, inplace=True)
df["home_ownership"].replace(5, 3, inplace=True)
df["home_ownership"].replace(6, 4, inplace=True)

In [327]:
np.unique(df["home_ownership"])


Out[327]:
array([1, 2, 3, 4], dtype=int64)
Home_ownership Category Category number
Mortgage 1
Other 2
Own 3
Rent 4

In [12]:
df["purpose"].replace([4, 11], 10, inplace=True)
df["purpose"].replace(13, 4, inplace=True)
df["purpose"].replace(14, 11, inplace=True)

In [326]:
np.unique(df["purpose"])


Out[326]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int64)
Purpose Category Category number
car 1
credit_card 2
debt_consolidation 3
vacation 4
home_improvement 5
house 6
major_purchase 7
medical 8
moving 9
other 10
wedding 11
small_business 12

Histogram


In [332]:
print len(category_vars), len(continuous_vars)


7 11

In [377]:
N=4; M=2;  # set row and column of the figure
fig = plt.figure(figsize=(10,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 7:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(df[category_vars[i * M + j]], bins=30)
            plt.title(category_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for category variables", y=1.02,fontsize=30)
plt.show()



In [381]:
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(df[continuous_vars[i * M + j]], bins=10)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for continuous variables", y=1.02,fontsize=30)
plt.show()


QQ-Plot


In [420]:
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables", y=1.02,fontsize=30)
plt.show()


Remove outliers(0.005% of highest and lowest observations)


In [13]:
outliers = int(len(df) * 0.00005)
print "Number of 0.005%: ", outliers

out_ix = np.array([])

for name in continuous_vars:
    out_large_list = np.array(df[name].nlargest(outliers).index)
    out_small_list = np.array(df[name].nsmallest(outliers).index)
    out_ix = np.concatenate((out_ix, out_large_list, out_small_list))
    out_ix = np.unique(out_ix)
print "Number of outliers(0.005% * 2): ", out_ix.shape[0]


Number of 0.005%:  13
Number of outliers(0.005% * 2):  248

In [14]:
cleaned_df = df.drop(df.index[list(out_ix)])
cleaned_df.index = range(len(cleaned_df))
cleaned_df.tail()


D:\Anaconda2\lib\site-packages\pandas\indexes\base.py:1434: VisibleDeprecationWarning: non integer (and non boolean) array-likes will not be accepted as indices in the future
  result = getitem(key)
Out[14]:
loan_amnt int_rate emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status desc purpose dti delinq_2yrs inq_last_6mths pub_rec revol_bal revol_util total_acc initial_list_status
267883 31050 21.99 1 10 1 875000.0 1 12 1 0 3 9.66 1 0 0 25770 79.3 13 0
267884 10800 7.89 1 8 1 92400.0 1 12 1 0 2 19.62 1 0 0 9760 68.7 36 1
267885 9000 9.17 1 1 1 80000.0 1 12 1 0 3 3.97 1 0 0 6320 51.8 17 0
267886 14400 25.99 0 11 4 62000.0 1 12 1 0 3 16.88 0 1 1 5677 45.1 30 0
267887 8000 12.59 1 4 3 45000.0 1 12 1 0 3 26.21 0 0 0 9097 50.8 47 1

QQ-Plot after removing outliers


In [422]:
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(cleaned_df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables after removing outliers", y=1.02,fontsize=30)
plt.show()


Scaling and Logarithm


In [15]:
log_vars = ["int_rate", "annual_inc", "loan_amnt"]
scaling_vars = ["desc", "dti", "revol_bal", "revol_util", "total_acc", "pub_rec", "inq_last_6mths", "delinq_2yrs"]

check 0 for log_vars to prevent '-inf'


In [401]:
for i in log_vars:
    print cleaned_df[cleaned_df[i].isin([int(0)])].index


Int64Index([], dtype='int64')
Int64Index([], dtype='int64')
Int64Index([], dtype='int64')

In [16]:
cleaned_df[log_vars] = np.log10(cleaned_df[log_vars])

In [17]:
cleaned_df[scaling_vars] = preprocessing.scale(cleaned_df[scaling_vars])

In [419]:
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(cleaned_df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables after Scaling and Logarithm", y=1.02,fontsize=30)
plt.show()



In [426]:
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(cleaned_df[continuous_vars[i * M + j]], bins=10)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for continuous variables after Scaling and Logarithm", y=1.02,fontsize=30)
plt.show()


Fit and Scoring

Under Sampling


In [18]:
scaled_df = cleaned_df

In [20]:
def under_sampling(df_name):
    
    global X, y, under_X, under_y, under_df
    
    X = df_name.ix[:, df_name.columns != "loan_status"]
    y = df_name.ix[:, df_name.columns == "loan_status"]

    default_ix = np.array(df_name[df_name.loan_status == 0].index)
    paid_ix = np.array(df_name[df_name.loan_status == 1].index)
    rand_paid_ix = np.random.choice(paid_ix, len(default_ix), replace=False)
    rand_ix = np.concatenate([default_ix, rand_paid_ix])

    print "X Shape: ", X.shape
    print "y Shape: ", y.shape
    print "Default Index Length: ", len(default_ix)
    print "Paid Index Length: ", len(rand_paid_ix)
    print "Random Index Shape: ", rand_ix.shape
    print "-" * 75

    under_df = df_name.iloc[rand_ix, :]
    under_X = under_df.ix[:, under_df.columns != "loan_status"]
    under_y = under_df.ix[:, under_df.columns == "loan_status"]

    print "Under DataFrame's X and y", len(under_X), len(under_y)
    print "Data Variables Names: under_df, under_X, under_y"

In [21]:
under_sampling(scaled_df)


X Shape:  (267888, 18)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

Split train and test for under sampled data and whole data


In [22]:
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

Random Forest Fit


In [18]:
rf = RandomForestClassifier(max_features=None, random_state=0)
rf_result = rf.fit(under_train_X, under_train_y)


D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app

Confusion Matrix


In [23]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Oranges):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Cofusion Matrix(predict with under sampled test set)


In [20]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(under_test_y, rf_result.predict(under_test_X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with under sampled test set)')
plt.show()

print classification_report(under_test_y, rf_result.predict(under_test_X))


             precision    recall  f1-score   support

          0       0.60      0.69      0.64     12065
          1       0.63      0.53      0.58     12076

avg / total       0.61      0.61      0.61     24141

Cofusion Matrix(predict with whole test set)


In [21]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y, rf_result.predict(X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with whole test set)')
plt.show()

print classification_report(y, rf_result.predict(X))


             precision    recall  f1-score   support

          0       0.42      0.93      0.58     60352
          1       0.97      0.63      0.77    207536

avg / total       0.85      0.70      0.72    267888

Set CV and Models


In [24]:
cv = StratifiedKFold(n_splits=8, random_state=0, shuffle=False)

sgd = SGDClassifier(loss="log", fit_intercept=True,
                    average=1000, n_iter=30, n_jobs=8,
                    random_state=0)

lda = LinearDiscriminantAnalysis()

lr = LogisticRegression(n_jobs=8)

Dummy Encoding for Categori Variables


In [113]:
dummies_df = scaled_df[continuous_vars]

for var in category_vars:
    dummies_df = dummies_df.join(pd.get_dummies(scaled_df[var], prefix=var))
print dummies_df.shape
print dummies_df.columns

dummies_df = dummies_df.join(scaled_df["loan_status"])
dummies_df.tail()


(267888, 57)
Index([u'loan_amnt', u'int_rate', u'annual_inc', u'desc', u'dti', u'revol_bal',
       u'revol_util', u'total_acc', u'pub_rec', u'inq_last_6mths',
       u'delinq_2yrs', u'emp_title_0', u'emp_title_1', u'emp_length_0',
       u'emp_length_1', u'emp_length_2', u'emp_length_3', u'emp_length_4',
       u'emp_length_5', u'emp_length_6', u'emp_length_7', u'emp_length_8',
       u'emp_length_9', u'emp_length_10', u'emp_length_11',
       u'home_ownership_1', u'home_ownership_2', u'home_ownership_3',
       u'home_ownership_4', u'verification_status_0', u'verification_status_1',
       u'issue_d_1', u'issue_d_2', u'issue_d_3', u'issue_d_4', u'issue_d_5',
       u'issue_d_6', u'issue_d_7', u'issue_d_8', u'issue_d_9', u'issue_d_10',
       u'issue_d_11', u'issue_d_12', u'purpose_1', u'purpose_2', u'purpose_3',
       u'purpose_4', u'purpose_5', u'purpose_6', u'purpose_7', u'purpose_8',
       u'purpose_9', u'purpose_10', u'purpose_11', u'purpose_12',
       u'initial_list_status_0', u'initial_list_status_1'],
      dtype='object')
Out[113]:
loan_amnt int_rate annual_inc desc dti revol_bal revol_util total_acc pub_rec inq_last_6mths ... purpose_6 purpose_7 purpose_8 purpose_9 purpose_10 purpose_11 purpose_12 initial_list_status_0 initial_list_status_1 loan_status
267883 4.492062 1.342225 5.942008 -0.419569 -0.899401 0.600672 1.004029 -1.025334 -0.335109 -0.798916 ... 0 0 0 0 0 0 0 1 0 1
267884 4.033424 0.897077 4.965672 -0.419569 0.368136 -0.304613 0.574806 0.932358 -0.335109 -0.798916 ... 0 0 0 0 0 0 0 0 1 1
267885 3.954243 0.962369 4.903090 -0.419569 -1.623526 -0.499128 -0.109522 -0.684866 -0.335109 -0.798916 ... 0 0 0 0 0 0 0 1 0 1
267886 4.158362 1.414806 4.792392 -0.419569 0.019436 -0.535486 -0.380823 0.421656 1.941395 0.138736 ... 0 0 0 0 0 0 0 1 0 1
267887 3.903090 1.100026 4.653213 -0.419569 1.206798 -0.342102 -0.150015 1.868646 -0.335109 -0.798916 ... 0 0 0 0 0 0 0 0 1 1

5 rows × 58 columns


In [109]:
under_sampling(dummies_df)


X Shape:  (267888, 57)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

Get Scores

SGD


In [124]:
print "SGD CV Accuracy for under sampled dataset"
sgd_cv_under = cross_val_score(sgd, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", sgd_cv_under
print "Average Accuracy: ", np.mean(sgd_cv_under)

print '-' * 75

print "SGD CV Accuracy for whole dataset"
sgd_cv_whole = cross_val_score(sgd, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", sgd_cv_whole
print "Average Accuracy: ", np.mean(sgd_cv_whole)


SGD CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.45  0.59  0.64  0.59  0.6   0.59  0.58  0.65]
Average Accuracy:  0.585722097031
---------------------------------------------------------------------------
SGD CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.75  0.77]
Average Accuracy:  0.750216508392

LDA


In [80]:
print "LDA CV Accuracy for under sampled dataset"
lda_cv_under = cross_val_score(lda, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lda_cv_under
print "Average Accuracy: ", np.mean(lda_cv_under)

print '-' * 75

print "LDA CV Accuracy for whole dataset"
lda_cv_whole = cross_val_score(lda, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lda_cv_whole
print "Average Accuracy: ", np.mean(lda_cv_whole)


LDA CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.46  0.6   0.64  0.6   0.6   0.58  0.57  0.64]
Average Accuracy:  0.588108099152
---------------------------------------------------------------------------
LDA CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.74  0.77]
Average Accuracy:  0.749914143224

LogisticRegression


In [125]:
print "LogisticRegression CV Accuracy for under sampled dataset"
lr_cv_under = cross_val_score(lr, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lr_cv_under
print "Average Accuracy: ", np.mean(lr_cv_under)

print '-' * 75

print "LogisticRegression CV Accuracy for whole dataset"
lr_cv_whole = cross_val_score(lr, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lr_cv_whole
print "Average Accuracy: ", np.mean(lr_cv_whole)


LogisticRegression CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.45  0.59  0.64  0.59  0.6   0.59  0.58  0.65]
Average Accuracy:  0.586915098091
---------------------------------------------------------------------------
LogisticRegression CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.75  0.77]
Average Accuracy:  0.748308994804

ROC Curve


In [27]:
lr_result = lr.fit(under_train_X, under_train_y)
sgd_result = sgd.fit(under_train_X, under_train_y)
lda_result = lda.fit(under_train_X, under_train_y)

fpr0, tpr0, thresholds0 = roc_curve(y, sgd_result.decision_function(X), pos_label=1)
fpr1, tpr1, thresholds1 = roc_curve(y, lr_result.decision_function(X), pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y, lda_result.decision_function(X), pos_label=1)

plt.plot(fpr0, tpr0, "r-", label="LogisticRegression")
plt.plot(fpr1, tpr1, "g-", label="SGD")
plt.plot(fpr2, tpr2, "y-", label="LDA")
plt.plot([0, 1], [0, 1], 'k--', label="random guess")
plt.xlim(-0.05, 1.0)
plt.ylim(0, 1.05)
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver operating characteristic Curve')
plt.legend(loc="lower right")
plt.show()


Random Forest Feature importances


In [469]:
importances = rf_result.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X.shape[1]):
    print("{}. {} / feature_n: {} / importance: {}".format(f+1, list(X.columns)[f], indices[f], importances[indices[f]]))

std = np.std([tree.feature_importances_ for tree in rf_result.estimators_],
             axis=0)

plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()


1. loan_amnt / feature_n: 1 / importance: 0.158881575828
2. int_rate / feature_n: 10 / importance: 0.114321064941
3. emp_title / feature_n: 14 / importance: 0.104817431158
4. emp_length / feature_n: 15 / importance: 0.103399981851
5. home_ownership / feature_n: 5 / importance: 0.0919407006712
6. annual_inc / feature_n: 0 / importance: 0.0849728830988
7. verification_status / feature_n: 16 / importance: 0.0784396621926
8. issue_d / feature_n: 7 / importance: 0.0540577119001
9. desc / feature_n: 3 / importance: 0.045905758798
10. purpose / feature_n: 8 / importance: 0.0405277987998
11. dti / feature_n: 9 / importance: 0.0290124276806
12. delinq_2yrs / feature_n: 12 / importance: 0.0268262275236
13. inq_last_6mths / feature_n: 4 / importance: 0.0158916830919
14. pub_rec / feature_n: 11 / importance: 0.0152995025604
15. revol_bal / feature_n: 13 / importance: 0.0111792915635
16. revol_util / feature_n: 17 / importance: 0.0105867564604
17. total_acc / feature_n: 6 / importance: 0.00935101737055
18. initial_list_status / feature_n: 2 / importance: 0.0045885245118

Optimization

Fit iteration 1

Logistic Regressions with High feature importance variables to check p-value and coefficient


In [167]:
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(emp_length) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()


Optimization terminated successfully.
         Current function value: 0.632952
         Iterations 5
[[ 40636.  19716.]
 [ 23609.  36743.]]
Out[167]:
Logit Regression Results
Dep. Variable: loan_status No. Observations: 120704
Model: Logit Df Residuals: 120684
Method: MLE Df Model: 19
Date: Fri, 17 Mar 2017 Pseudo R-squ.: 0.08684
Time: 05:03:46 Log-Likelihood: -76400.
converged: True LL-Null: -83666.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 1.6425 0.163 10.096 0.000 1.324 1.961
C(emp_title)[T.1] 0.3131 0.048 6.558 0.000 0.220 0.407
C(emp_length)[T.1] 0.0384 0.032 1.208 0.227 -0.024 0.101
C(emp_length)[T.2] 0.0979 0.029 3.356 0.001 0.041 0.155
C(emp_length)[T.3] 0.0682 0.030 2.262 0.024 0.009 0.127
C(emp_length)[T.4] 0.0814 0.032 2.520 0.012 0.018 0.145
C(emp_length)[T.5] 0.0865 0.031 2.762 0.006 0.025 0.148
C(emp_length)[T.6] 0.0403 0.033 1.216 0.224 -0.025 0.105
C(emp_length)[T.7] 0.0092 0.034 0.275 0.784 -0.057 0.075
C(emp_length)[T.8] 0.0038 0.035 0.108 0.914 -0.065 0.073
C(emp_length)[T.9] -0.0040 0.038 -0.107 0.915 -0.078 0.070
C(emp_length)[T.10] 0.0551 0.024 2.272 0.023 0.008 0.103
C(emp_length)[T.11] -0.0343 0.059 -0.583 0.560 -0.150 0.081
C(verification_status)[T.1] -0.1118 0.014 -7.730 0.000 -0.140 -0.083
C(home_ownership)[T.2] 0.0834 0.239 0.349 0.727 -0.385 0.552
C(home_ownership)[T.3] -0.0574 0.023 -2.545 0.011 -0.102 -0.013
C(home_ownership)[T.4] -0.1470 0.014 -10.824 0.000 -0.174 -0.120
loan_amnt -0.7416 0.026 -29.060 0.000 -0.792 -0.692
int_rate -4.3722 0.049 -89.828 0.000 -4.468 -4.277
annual_inc 1.2937 0.033 39.068 0.000 1.229 1.359

Remove the variable with high p-value


In [184]:
iter_df = scaled_df.drop("emp_length", 1)
under_sampling(iter_df)


X Shape:  (267888, 17)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

In [185]:
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

In [159]:
def rf_fit():
    rf = RandomForestClassifier(max_features=None, random_state=0)
    rf_result = rf.fit(under_train_X, under_train_y)

    importances = rf_result.feature_importances_
    indices = np.argsort(importances)[::-1]

    for f in range(X.shape[1]):
        print("{}. {} / feature_n: {} / importance: {}".format(f+1, list(X.columns)[f], indices[f], importances[indices[f]]))

    std = np.std([tree.feature_importances_ for tree in rf_result.estimators_],
                 axis=0)

    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), importances[indices],
           yerr=std[indices], align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.show()

In [187]:
rf_fit()


D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()
1. loan_amnt / feature_n: 1 / importance: 0.163012653529
2. int_rate / feature_n: 9 / importance: 0.122158240701
3. emp_title / feature_n: 13 / importance: 0.110458846173
4. home_ownership / feature_n: 14 / importance: 0.107975373878
5. annual_inc / feature_n: 4 / importance: 0.0956290069998
6. verification_status / feature_n: 0 / importance: 0.088868919899
7. issue_d / feature_n: 15 / importance: 0.081914717001
8. desc / feature_n: 6 / importance: 0.0563041257624
9. purpose / feature_n: 7 / importance: 0.0422866619251
10. dti / feature_n: 8 / importance: 0.0305744085439
11. delinq_2yrs / feature_n: 11 / importance: 0.0280201247177
12. inq_last_6mths / feature_n: 3 / importance: 0.01652521287
13. pub_rec / feature_n: 10 / importance: 0.0164114731871
14. revol_bal / feature_n: 12 / importance: 0.0121277878523
15. revol_util / feature_n: 16 / importance: 0.0108470909168
16. total_acc / feature_n: 5 / importance: 0.0102946919374
17. initial_list_status / feature_n: 2 / importance: 0.00659066410688

Fit Iteration 2


In [164]:
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(issue_d) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()


Optimization terminated successfully.
         Current function value: 0.632352
         Iterations 5
[[ 40723.  19629.]
 [ 23608.  36744.]]
Out[164]:
Logit Regression Results
Dep. Variable: loan_status No. Observations: 120704
Model: Logit Df Residuals: 120684
Method: MLE Df Model: 19
Date: Fri, 17 Mar 2017 Pseudo R-squ.: 0.08771
Time: 05:03:01 Log-Likelihood: -76327.
converged: True LL-Null: -83666.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 1.5862 0.153 10.382 0.000 1.287 1.886
C(emp_title)[T.1] 0.3586 0.026 13.610 0.000 0.307 0.410
C(issue_d)[T.2] -0.0279 0.031 -0.915 0.360 -0.088 0.032
C(issue_d)[T.3] -0.0120 0.030 -0.396 0.692 -0.071 0.047
C(issue_d)[T.4] -0.0634 0.029 -2.160 0.031 -0.121 -0.006
C(issue_d)[T.5] -0.0360 0.029 -1.228 0.219 -0.093 0.021
C(issue_d)[T.6] -0.0192 0.030 -0.646 0.518 -0.078 0.039
C(issue_d)[T.7] -0.0666 0.028 -2.361 0.018 -0.122 -0.011
C(issue_d)[T.8] 0.0531 0.030 1.786 0.074 -0.005 0.111
C(issue_d)[T.9] 0.1493 0.031 4.873 0.000 0.089 0.209
C(issue_d)[T.10] 0.0121 0.028 0.430 0.667 -0.043 0.067
C(issue_d)[T.11] 0.0378 0.029 1.281 0.200 -0.020 0.096
C(issue_d)[T.12] 0.1565 0.031 4.987 0.000 0.095 0.218
C(verification_status)[T.1] -0.1424 0.014 -9.884 0.000 -0.171 -0.114
C(home_ownership)[T.2] 0.2426 0.233 1.039 0.299 -0.215 0.700
C(home_ownership)[T.3] -0.0758 0.023 -3.357 0.001 -0.120 -0.032
C(home_ownership)[T.4] -0.1550 0.013 -11.566 0.000 -0.181 -0.129
loan_amnt -0.7573 0.026 -29.665 0.000 -0.807 -0.707
int_rate -4.3215 0.049 -88.754 0.000 -4.417 -4.226
annual_inc 1.3106 0.033 40.022 0.000 1.246 1.375

In [188]:
iter_df["issue_d"] = iter_df["issue_d"].replace([1, 9, 12], 1)
iter_df["issue_d"] = iter_df["issue_d"].replace([2, 3, 4, 5, 6, 7, 8, 10, 11], 0)

print np.unique(iter_df["issue_d"])
under_sampling(iter_df)


[0 1]
X Shape:  (267888, 17)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

In [189]:
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

In [190]:
rf_fit()


D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()
1. loan_amnt / feature_n: 1 / importance: 0.168268150536
2. int_rate / feature_n: 9 / importance: 0.128251442655
3. emp_title / feature_n: 13 / importance: 0.115353508891
4. home_ownership / feature_n: 14 / importance: 0.113473816335
5. annual_inc / feature_n: 4 / importance: 0.101729866349
6. verification_status / feature_n: 0 / importance: 0.0923634635808
7. issue_d / feature_n: 15 / importance: 0.087299923558
8. desc / feature_n: 7 / importance: 0.0429419049315
9. purpose / feature_n: 8 / importance: 0.0326709198208
10. dti / feature_n: 11 / importance: 0.0294871400802
11. delinq_2yrs / feature_n: 10 / importance: 0.017892607527
12. inq_last_6mths / feature_n: 3 / importance: 0.0166550554523
13. pub_rec / feature_n: 16 / importance: 0.0122725152856
14. revol_bal / feature_n: 12 / importance: 0.0119994681601
15. revol_util / feature_n: 6 / importance: 0.0114018819998
16. total_acc / feature_n: 5 / importance: 0.0107822332615
17. initial_list_status / feature_n: 2 / importance: 0.00715610157643

Fit iteration 3


In [176]:
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(issue_d) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()


Optimization terminated successfully.
         Current function value: 0.632503
         Iterations 5
[[ 40681.  19671.]
 [ 23502.  36850.]]
Out[176]:
Logit Regression Results
Dep. Variable: loan_status No. Observations: 120704
Model: Logit Df Residuals: 120694
Method: MLE Df Model: 9
Date: Fri, 17 Mar 2017 Pseudo R-squ.: 0.08749
Time: 05:07:29 Log-Likelihood: -76346.
converged: True LL-Null: -83666.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 1.8122 0.151 11.972 0.000 1.516 2.109
C(emp_title)[T.1] 0.3341 0.026 12.754 0.000 0.283 0.385
C(issue_d)[T.1] 0.1226 0.015 8.423 0.000 0.094 0.151
C(verification_status)[T.1] -0.1376 0.014 -9.554 0.000 -0.166 -0.109
C(home_ownership)[T.2] 0.0941 0.241 0.391 0.696 -0.378 0.566
C(home_ownership)[T.3] -0.0729 0.023 -3.226 0.001 -0.117 -0.029
C(home_ownership)[T.4] -0.1462 0.013 -10.918 0.000 -0.172 -0.120
loan_amnt -0.7678 0.026 -30.063 0.000 -0.818 -0.718
int_rate -4.3578 0.049 -89.725 0.000 -4.453 -4.263
annual_inc 1.2804 0.033 38.992 0.000 1.216 1.345

In [191]:
fitted_cols = ["loan_status", "emp_title", "issue_d", "verification_status",
               "home_ownership", "loan_amnt", "int_rate", "annual_inc"]
fitted_df = iter_df.reindex(columns=fitted_cols)
fitted_df.tail()


Out[191]:
loan_status emp_title issue_d verification_status home_ownership loan_amnt int_rate annual_inc
267883 1 1 1 1 1 4.492062 1.342225 5.942008
267884 1 1 1 1 1 4.033424 0.897077 4.965672
267885 1 1 1 1 1 3.954243 0.962369 4.903090
267886 1 0 1 1 4 4.158362 1.414806 4.792392
267887 1 1 1 1 3 3.903090 1.100026 4.653213

In [192]:
under_sampling(fitted_df)


X Shape:  (267888, 7)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

In [196]:
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

In [194]:
rf_fit()


D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()
1. emp_title / feature_n: 5 / importance: 0.302556766856
2. issue_d / feature_n: 6 / importance: 0.291654176929
3. verification_status / feature_n: 4 / importance: 0.270477162656
4. home_ownership / feature_n: 3 / importance: 0.054684974733
5. loan_amnt / feature_n: 1 / importance: 0.0357518921347
6. int_rate / feature_n: 2 / importance: 0.0320879910333
7. annual_inc / feature_n: 0 / importance: 0.0127870356587

In [200]:
rf = RandomForestClassifier(max_features=None, random_state=0)
rf_result = rf.fit(under_train_X, under_train_y)


D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app

In [201]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(under_test_y, rf_result.predict(under_test_X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with under sampled test set)')
plt.show()

print classification_report(under_test_y, rf_result.predict(under_test_X))


             precision    recall  f1-score   support

          0       0.58      0.65      0.61     12065
          1       0.60      0.53      0.56     12076

avg / total       0.59      0.59      0.59     24141


In [202]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y, rf_result.predict(X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with whole test set)')
plt.show()

print classification_report(y, rf_result.predict(X))


             precision    recall  f1-score   support

          0       0.42      0.91      0.57     60352
          1       0.96      0.63      0.76    207536

avg / total       0.84      0.69      0.72    267888