Lending Club Default(charged off, late also are included) Fully Paid Classification Model

Main Repository for this project/
Raw data is gathered from Lending Club Loan Data and Kaggle
This project has been initiated from JAN 18 2017
Co-contributers are as below(sorted alphabetically)
- Jang Sungguk(simfeel87@gmail.com)
- Kim Gibeom(curtisk808@gmail.com)
- Shin Yoonsig(shinys825@gmail.com)

Summary

Basic Approach: How to build basic investment strategy for beginners

: focus on 7-features not to lose money

Check List

High annual income
Low interest rate
Low loan amount
Opened employment title
Issued on JAN, SEP, DEC
Verification status and Owned house are not so important, but relatively positive

Initialize

Library loading



In [26]:

    
import re
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels.api as sm

import matplotlib as mpl
import matplotlib.tri as mtri
import matplotlib.pylab as plt
import matplotlib.patches as mpatches
import seaborn as sns

import itertools

from scipy import stats
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve

# Seaborn setting
sns.set(palette="hls", font_scale=2)

Read dataframe



In [6]:

    
df = pd.read_csv('frames/lc_dataframe.csv')
print list(df.columns)
df.tail()









    



['loan_amnt', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'desc', 'purpose', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'application_type']






    Out[6]:






  
    
      
      loan_amnt
      int_rate
      installment
      grade
      sub_grade
      emp_title
      emp_length
      home_ownership
      annual_inc
      verification_status
      ...
      dti
      delinq_2yrs
      inq_last_6mths
      open_acc
      pub_rec
      revol_bal
      revol_util
      total_acc
      initial_list_status
      application_type
    
  
  
    
      268131
      31050
      21.99
      857.40
      6
      61
      1
      10
      2
      875000.0
      1
      ...
      9.66
      1
      0
      10
      0
      25770
      79.3
      13
      0
      1
    
    
      268132
      10800
      7.89
      337.89
      1
      15
      1
      8
      2
      92400.0
      1
      ...
      19.62
      1
      0
      11
      0
      9760
      68.7
      36
      1
      1
    
    
      268133
      9000
      9.17
      286.92
      2
      22
      1
      1
      2
      80000.0
      1
      ...
      3.97
      1
      0
      8
      0
      6320
      51.8
      17
      0
      1
    
    
      268134
      14400
      25.99
      431.06
      6
      65
      0
      11
      6
      62000.0
      1
      ...
      16.88
      0
      1
      9
      1
      5677
      45.1
      30
      0
      1
    
    
      268135
      8000
      12.59
      267.98
      3
      32
      1
      4
      5
      45000.0
      1
      ...
      26.21
      0
      0
      12
      0
      9097
      50.8
      47
      1
      1
    
  

5 rows × 25 columns

Variables Description



In [3]:

    
var_desc = pd.read_csv('frames/var_description(csv).csv')
var_desc









    Out[3]:






  
    
      
      Name
      Included in dataframe
      Data type
      Encoding
      Not Null ratio
      Reason for removal
      English Description
      Korean Description
    
  
  
    
      0
      loan_status
      o
      Categorical
      Fully Paid = 1 / Charged Off, Default, Late(31...
      100.00%
      NaN
      Current status of the loan
      대출의 현재 상태
    
    
      1
      int_rate
      o
      Continuous
      NaN
      100.00%
      NaN
      Interest Rate on the loan
      대출 금리
    
    
      2
      emp_title
      o
      Categorical
      True = 1, False = 0
      94.20%
      NaN
      The job title supplied by the Borrower when ap...
      대출 신청시 차용자가 제공 한 직책 *
    
    
      3
      emp_length
      o
      Categorical
      < 1' = 0, 10+ = 10, 'n/a' = 11
      100.00%
      NaN
      Employment length in years. Possible values ar...
      고용 기간(년)
    
    
      4
      home_ownership
      o
      Categorical
      Mortgage, Other, Own, Rent = {1, 2, 3, 4}
      100.00%
      NaN
      The home ownership status provided by the borr...
      주택 소유 상태(렌트, 소유, 모기지, 기타)
    
    
      5
      annual_inc
      o
      Continuous
      NaN
      100.00%
      NaN
      The self-reported annual income provided by th...
      차용자 자체보고 연간 소득.
    
    
      6
      verification_status
      o
      Categorical
      Source Verified, Verified = 1, Not Verified = 0
      100.00%
      NaN
      Indicates if income was verified by LC, not ve...
      수입 및 수입원 확인여부
    
    
      7
      issue_d
      o
      Categorical
      Transform to mm
      100.00%
      NaN
      The month which the loan was funded
      대출 자금이 조달 된 달
    
    
      8
      loan_amnt
      o
      Continuous
      NaN
      100.00%
      NaN
      The listed amount of the loan applied for by t...
      대출금액
    
    
      9
      desc
      o
      Continuous
      True = doc length, False = 0
      14.20%
      NaN
      Loan description provided by the borrower
      대출 내용 *
    
    
      10
      purpose
      o
      Categorical
      14 categoried  = {1:14} (from 1 to 14)
      100.00%
      NaN
      A category provided by the borrower for the lo...
      대출 목적
    
    
      11
      dti
      o
      Continuous
      NaN
      100.00%
      NaN
      A ratio calculated using the borrower’s total ...
      차용인의 (월별 채무상환액 / 월별 소득)
    
    
      12
      delinq_2yrs
      o
      Categorical
      No delayed = 0, Delayed = 1
      100.00%
      NaN
      The number of 30+ days past-due incidences of ...
      지난 2 년간 연체 30일 이상 발생 횟수
    
    
      13
      inq_last_6mths
      o
      Categorical
      {0:5} = 0~5 times, 6 = above 6 times
      100.00%
      NaN
      The number of inquiries in past 6 months (excl...
      지난 6 개월 간 신용관련 문의 건수 (자동차, 모기지 제외)
    
    
      14
      pub_rec
      o
      Categorical
      No record = 0, record = 1
      100.00%
      NaN
      Number of derogatory public records
      추징, 압류 등과 관련된 기록
    
    
      15
      revol_bal
      o
      Continuous
      NaN
      100.00%
      NaN
      Total credit revolving balance
      총 크레딧 회전 균형
    
    
      16
      revol_util
      o
      Continuous
      NaN
      99.94%
      NaN
      Revolving line utilization rate, or the amount...
      리볼빙 이용률, 차용자가 사용하는 리볼빙 크레딧 금액
    
    
      17
      total_acc
      o
      Continuous
      NaN
      100.00%
      NaN
      The total number of credit lines currently in ...
      현재 개설된 신용계좌 수
    
    
      18
      initial_list_status
      o
      Categorical
      W = 1, F = 0
      100.00%
      NaN
      The initial listing status of the loan. Possib...
      대출의 초기 리스팅 상태(W: 전체, F: 부분)
    
    
      19
      acc_now_delinq
      x
      Continuous
      NaN
      100.00%
      unclear data collection date from LC
      The number of accounts on which the borrower i...
      연체중인 계좌의 수
    
    
      20
      addr_state
      x
      -
      NaN
      100.00%
      not related to analytics purpose
      The state provided by the borrower in the loan...
      대출 신청서에 차용인이 제공 한 국가
    
    
      21
      all_util
      x
      -
      NaN
      2.41%
      high null ratio
      Balance to credit limit on all trades
      신용한도
    
    
      22
      annual_inc_joint
      x
      -
      NaN
      0.06%
      high null ratio
      The combined self-reported annual income provi...
      등록 중 공동 차용자가 제공 한 자체보고 된 연간 소득
    
    
      23
      application_type
      x
      Categorical
      Individual = 1, Joint = 0
      100.00%
      extremely high unbalanced(above 99.9)
      Indicates whether the loan is an individual ap...
      대출신청 개인, 공동 여부
    
    
      24
      collection_recovery_fee
      x
      Continuous
      NaN
      100.00%
      not provided information on loan application
      post charge off collection fee
      수금수수료
    
    
      25
      collections_12_mths_ex_med
      x
      Continuous
      NaN
      99.98%
      not provided information on loan application
      Number of collections in 12 months excluding m...
      의료제외 12개월 간 수금건수
    
    
      26
      dti_joint
      x
      -
      NaN
      0.06%
      dependent to 'dti'
      A ratio calculated using the co-borrowers' tot...
      모기지 및 요청 된 LC 대출을 제외하고 총 차입금에 대한 공동 차용자의 월별 총 ...
    
    
      27
      earliest_cr_line
      x
      -
      NaN
      100.00%
      dependent to 'issue_d'
      The month the borrower's earliest reported cre...
      차용자의 최초보고 크레딧 라인이 개설 된 월
    
    
      28
      funded_amnt
      x
      Continuous
      NaN
      100.00%
      dependent to 'loan_amnt'
      The total amount committed to that loan at tha...
      해당 시점에 대출에 투입된 총 금액.
    
    
      29
      funded_amnt_inv
      x
      Continuous
      NaN
      100.00%
      dependent to 'loan_amnt'
      The total amount committed by investors for th...
      해당 시점에 해당 대출에 대해 투자자들이 투입한 총 금액.
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      44
      mths_since_rcnt_il
      x
      -
      NaN
      2.35%
      high null ratio
      Months since most recent installment accounts ...
      가장 최근의 할부 계좌가 개설 된 이후의 달
    
    
      45
      next_pymnt_d
      x
      -
      NaN
      71.49%
      unclear data collection date from LC
      Next scheduled payment date
      다음 지급 예정일
    
    
      46
      open_acc
      x
      Continuous
      NaN
      100.00%
      dependent to 'total_acc'
      The number of open credit lines in the borrowe...
      차용인의 개설된 신용계좌 수
    
    
      47
      open_acc_6m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of open trades in last 6 months
      최근 6 개월 동안의 미결 거래 횟수
    
    
      48
      open_il_12m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of installment accounts opened in past ...
      지난 12 개월 동안 개설 된 할부 계좌 수
    
    
      49
      open_il_24m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of installment accounts opened in past ...
      지난 24 개월 동안 개설 된 할부 계좌 수
    
    
      50
      open_il_6m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of currently active installment trades
      현재 활동중인 할부 거래 수
    
    
      51
      open_rv_12m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of revolving trades opened in past 12 m...
      지난 12 개월 동안 개설 된 회전 거래 수
    
    
      52
      open_rv_24m
      x
      -
      NaN
      2.41%
      high null ratio
      Number of revolving trades opened in past 24 m...
      지난 24 개월 동안 개설 된 회전 거래 수
    
    
      53
      out_prncp
      x
      Continuous
      NaN
      100.00%
      not provided information on loan application
      Remaining outstanding principal for total amou...
      후원 된 총 금액에 대한 미납 된 원금
    
    
      54
      out_prncp_inv
      x
      Continuous
      NaN
      100.00%
      not provided information on loan application
      Remaining outstanding principal for portion of...
      투자자가 자금을 조달 한 부분의 잔액이 남아 있음.
    
    
      55
      policy_code
      x
      -
      NaN
      100.00%
      same all elements
      publicly available policy_code=1\nnew products...
      publicly available policy_code = 1\r\n공개적으로 사용...
    
    
      56
      pymnt_plan
      x
      Categorical
      y = 1, n = 0
      100.00%
      extremely high unbalanced(above 99.9)
      Indicates if a payment plan has been put in pl...
      대출금 지불 계획 수립여부
    
    
      57
      recoveries
      x
      Continuous
      NaN
      100.00%
      not provided information on loan application
      post charge off gross recovery
      총체적인 회복으로 인한 비용 청구
    
    
      58
      sub_grade
      x
      Categorical
      1, 2, 3, 4, 5
      100.00%
      dependent to 'grade'
      LC assigned loan subgrade
      LC 세부 신용등급
    
    
      59
      term
      x
      Continuous
      NaN
      100.00%
      not used in present
      The number of payments on the loan. Values are...
      대출금 지불 기간(값은 개월 단위)
    
    
      60
      title
      x
      -
      NaN
      99.98%
      too many categories
      The loan title provided by the borrower
      차용인이 제공 한 대출 명칭
    
    
      61
      tot_coll_amt
      x
      -
      NaN
      92.08%
      dependent to 'term'
      Total collection amounts ever owed
      적립 된 총 징수액
    
    
      62
      tot_cur_bal
      x
      -
      NaN
      92.08%
      dependent to other variables of repayment ability
      Total current balance of all accounts
      모든 계정의 총 현재 잔액
    
    
      63
      total_bal_il
      x
      -
      NaN
      2.41%
      high null ratio
      Total current balance of all installment accounts
      모든 할부 계좌의 총 현재 잔액
    
    
      64
      total_cu_tl
      x
      -
      NaN
      0.00%
      high null ratio
      Number of finance trades
      금융 거래 건수
    
    
      65
      total_pymnt
      x
      Continuous
      NaN
      100.00%
      not provided information on loan application
      Payments received to date for total amount funded
      현재까지받은 지불액
    
    
      66
      total_pymnt_inv
      x
      -
      NaN
      100.00%
      unclear data collection date from LC
      Payments received to date for portion of total...
      투자자가 자금을 조달 한 총액의 일부분에 대해 현재까지 수령 한 지급액
    
    
      67
      total_rec_int
      x
      -
      NaN
      100.00%
      unclear data collection date from LC
      Interest received to date
      현재까지받은이자
    
    
      68
      total_rec_late_fee
      x
      -
      NaN
      100.00%
      unclear data collection date from LC
      Late fees received to date
      현재까지 수령 한 연체료
    
    
      69
      total_rec_prncp
      x
      -
      NaN
      100.00%
      unclear data collection date from LC
      Principal received to date
      현재까지받은 교장
    
    
      70
      total_rev_hi_lim
      x
      -
      NaN
      92.08%
      dependent to 'revol_bal', 'revol_util'
      Total revolving high credit/credit limit
      총 회전 높은 신용 한도 / 신용 한도
    
    
      71
      url
      x
      x
      NaN
      100.00%
      not related to analytics purpose
      URL for the LC page with listing data.
      목록 데이터가있는 LC 페이지의 URL입니다.
    
    
      72
      verification_status_joint
      x
      -
      NaN
      0.06%
      high null ratio
      Indicates if the co-borrowers' joint income wa...
      공동 차용인의 공동 수입이 LC로 검증되었는지, 검증되지 않았는지, 수입원이 확인되...
    
    
      73
      zip_code
      x
      -
      NaN
      100.00%
      not related to analytics purpose
      The first 3 numbers of the zip code provided b...
      대출 신청서에 차용자가 제공 한 우편 번호의 처음 3 자리 숫자입니다.
    
  

74 rows × 8 columns

Check dataframe and elements



In [56]:

    
df.dtypes









    Out[56]:





loan_amnt                int64
int_rate               float64
installment            float64
grade                    int64
sub_grade                int64
emp_title                int64
emp_length               int64
home_ownership           int64
annual_inc             float64
verification_status      int64
issue_d                  int64
loan_status              int64
pymnt_plan               int64
desc                     int64
purpose                  int64
dti                    float64
delinq_2yrs              int64
inq_last_6mths           int64
open_acc                 int64
pub_rec                  int64
revol_bal                int64
revol_util             float64
total_acc                int64
initial_list_status      int64
application_type         int64
dtype: object



In [55]:

    
df.isnull().sum()









    Out[55]:





loan_amnt              0
int_rate               0
installment            0
grade                  0
sub_grade              0
emp_title              0
emp_length             0
home_ownership         0
annual_inc             0
verification_status    0
issue_d                0
loan_status            0
pymnt_plan             0
desc                   0
purpose                0
dti                    0
delinq_2yrs            0
inq_last_6mths         0
open_acc               0
pub_rec                0
revol_bal              0
revol_util             0
total_acc              0
initial_list_status    0
application_type       0
dtype: int64

Data Preprocessing

check correlation between dependent variables



In [257]:

    
fig = plt.figure(figsize = (15, 10))
corrmat = df.corr()
sns.heatmap(corrmat)
plt.title("Correlation between Features")
plt.show()

get high correlation



In [307]:

    
corr_stack = corrmat.abs().unstack()
ordered_stack = corr_stack.order(kind="quicksort", ascending=False)
order_ix = []

for num in range(len(ordered_stack)):
    if ordered_stack[num] > 0.6 and ordered_stack[num] < 1.0:
        order_ix.append(num)

print ordered_stack[min(order_ix):max(order_ix)]









    



sub_grade    grade          0.994620
grade        sub_grade      0.994620
sub_grade    int_rate       0.966306
int_rate     sub_grade      0.966306
installment  loan_amnt      0.954577
loan_amnt    installment    0.954577
int_rate     grade          0.950878
grade        int_rate       0.950878
open_acc     total_acc      0.672594
dtype: float64






    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app

remove high corrleation variables



In [7]:

    
del df["grade"]
del df["sub_grade"]
del df["installment"]
del df["open_acc"]

check groupby elements



In [8]:

    
category_vars = ["emp_title", "emp_length", "home_ownership",
                 "verification_status", "issue_d", "purpose",
                 "initial_list_status", "pymnt_plan", "application_type"]

continuous_vars = ["loan_amnt", "int_rate", "annual_inc", 
                   "desc", "dti", "revol_bal", "revol_util", "total_acc",
                   "pub_rec", "inq_last_6mths", "delinq_2yrs"]



In [310]:

    
for var in category_vars:
    print df.groupby(var).loan_status.value_counts()









    



emp_title  loan_status
0          1               10494
           0                4471
1          1              197227
           0               55944
Name: loan_status, dtype: int64
emp_length  loan_status
0           1              17033
            0               5181
1           1              13892
            0               3984
2           1              19528
            0               5384
3           1              16846
            0               4761
4           1              13422
            0               3668
5           1              14856
            0               4122
6           1              12058
            0               3474
7           1              11483
            0               3381
8           1               9695
            0               2890
9           1               7789
            0               2392
10          1              63747
            0              17721
11          1               7372
            0               3457
Name: loan_status, dtype: int64
home_ownership  loan_status
1               1                   1
2               1              104965
                0               26496
3               1                  36
                0                   7
4               1                 114
                0                  27
5               1               17959
                0                5607
6               1               84646
                0               28278
Name: loan_status, dtype: int64
verification_status  loan_status
0                    1               73856
                     0               15480
1                    1              133865
                     0               44935
Name: loan_status, dtype: int64
issue_d  loan_status
1        1              18445
         0               5317
2        1              14696
         0               4508
3        1              15754
         0               4647
4        1              17455
         0               5405
5        1              16907
         0               5346
6        1              16138
         0               4991
7        1              20212
         0               6290
8        1              17607
         0               4863
9        1              15747
         0               4125
10       1              21371
         0               6242
11       1              18085
         0               4957
12       1              15304
         0               3724
Name: loan_status, dtype: int64
purpose  loan_status
1        1                3198
         0                 543
2        1               42249
         0               10536
3        1              120763
         0               37318
4        1                 269
         0                  56
5        1               12660
         0                3115
6        1                1366
         0                 369
7        1                5391
         0                1146
8        1                2285
         0                 726
9        1                1603
         0                 549
10       1               11341
         0                3732
11       1                 213
         0                  63
12       1                3375
         0                1630
13       1                1318
         0                 359
14       1                1690
         0                 273
Name: loan_status, dtype: int64
initial_list_status  loan_status
0                    1              149144
                     0               41104
1                    1               58577
                     0               19311
Name: loan_status, dtype: int64
pymnt_plan  loan_status
0           1              207719
            0               60410
1           0                   5
            1                   2
Name: loan_status, dtype: int64
application_type  loan_status
0                 0                   2
                  1                   1
1                 1              207720
                  0               60413
Name: loan_status, dtype: int64

remove variables 'pymnt_plan', 'application_type': almost every elements are in oneside



In [9]:

    
del df["pymnt_plan"]
del df["application_type"]



In [10]:

    
category_vars.remove("pymnt_plan")
category_vars.remove("application_type")

rearrage 'home_ownership', 'purpose'



In [11]:

    
df["home_ownership"].replace([1, 3], 4, inplace=True)
df["home_ownership"].replace(2, 1, inplace=True)
df["home_ownership"].replace(4, 2, inplace=True)
df["home_ownership"].replace(5, 3, inplace=True)
df["home_ownership"].replace(6, 4, inplace=True)



In [327]:

    
np.unique(df["home_ownership"])









    Out[327]:





array([1, 2, 3, 4], dtype=int64)

Home_ownership Category	Category number
Mortgage	1
Other	2
Own	3
Rent	4



In [12]:

    
df["purpose"].replace([4, 11], 10, inplace=True)
df["purpose"].replace(13, 4, inplace=True)
df["purpose"].replace(14, 11, inplace=True)



In [326]:

    
np.unique(df["purpose"])









    Out[326]:





array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int64)

Purpose Category	Category number
car	1
credit_card	2
debt_consolidation	3
vacation	4
home_improvement	5
house	6
major_purchase	7
medical	8
moving	9
other	10
wedding	11
small_business	12

Histogram



In [332]:

    
print len(category_vars), len(continuous_vars)



In [377]:

    
N=4; M=2;  # set row and column of the figure
fig = plt.figure(figsize=(10,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 7:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(df[category_vars[i * M + j]], bins=30)
            plt.title(category_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for category variables", y=1.02,fontsize=30)
plt.show()



In [381]:

    
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(df[continuous_vars[i * M + j]], bins=10)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for continuous variables", y=1.02,fontsize=30)
plt.show()

QQ-Plot



In [420]:

    
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables", y=1.02,fontsize=30)
plt.show()

Remove outliers(0.005% of highest and lowest observations)



In [13]:

    
outliers = int(len(df) * 0.00005)
print "Number of 0.005%: ", outliers

out_ix = np.array([])

for name in continuous_vars:
    out_large_list = np.array(df[name].nlargest(outliers).index)
    out_small_list = np.array(df[name].nsmallest(outliers).index)
    out_ix = np.concatenate((out_ix, out_large_list, out_small_list))
    out_ix = np.unique(out_ix)
print "Number of outliers(0.005% * 2): ", out_ix.shape[0]









    



Number of 0.005%:  13
Number of outliers(0.005% * 2):  248



In [14]:

    
cleaned_df = df.drop(df.index[list(out_ix)])
cleaned_df.index = range(len(cleaned_df))
cleaned_df.tail()









    



D:\Anaconda2\lib\site-packages\pandas\indexes\base.py:1434: VisibleDeprecationWarning: non integer (and non boolean) array-likes will not be accepted as indices in the future
  result = getitem(key)






    Out[14]:






  
    
      
      loan_amnt
      int_rate
      emp_title
      emp_length
      home_ownership
      annual_inc
      verification_status
      issue_d
      loan_status
      desc
      purpose
      dti
      delinq_2yrs
      inq_last_6mths
      pub_rec
      revol_bal
      revol_util
      total_acc
      initial_list_status
    
  
  
    
      267883
      31050
      21.99
      1
      10
      1
      875000.0
      1
      12
      1
      0
      3
      9.66
      1
      0
      0
      25770
      79.3
      13
      0
    
    
      267884
      10800
      7.89
      1
      8
      1
      92400.0
      1
      12
      1
      0
      2
      19.62
      1
      0
      0
      9760
      68.7
      36
      1
    
    
      267885
      9000
      9.17
      1
      1
      1
      80000.0
      1
      12
      1
      0
      3
      3.97
      1
      0
      0
      6320
      51.8
      17
      0
    
    
      267886
      14400
      25.99
      0
      11
      4
      62000.0
      1
      12
      1
      0
      3
      16.88
      0
      1
      1
      5677
      45.1
      30
      0
    
    
      267887
      8000
      12.59
      1
      4
      3
      45000.0
      1
      12
      1
      0
      3
      26.21
      0
      0
      0
      9097
      50.8
      47
      1

QQ-Plot after removing outliers



In [422]:

    
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(cleaned_df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables after removing outliers", y=1.02,fontsize=30)
plt.show()

Scaling and Logarithm



In [15]:

    
log_vars = ["int_rate", "annual_inc", "loan_amnt"]
scaling_vars = ["desc", "dti", "revol_bal", "revol_util", "total_acc", "pub_rec", "inq_last_6mths", "delinq_2yrs"]

check 0 for log_vars to prevent '-inf'



In [401]:

    
for i in log_vars:
    print cleaned_df[cleaned_df[i].isin([int(0)])].index









    



Int64Index([], dtype='int64')
Int64Index([], dtype='int64')
Int64Index([], dtype='int64')



In [16]:

    
cleaned_df[log_vars] = np.log10(cleaned_df[log_vars])



In [17]:

    
cleaned_df[scaling_vars] = preprocessing.scale(cleaned_df[scaling_vars])



In [419]:

    
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size
plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            sp.stats.probplot(cleaned_df[continuous_vars[i * M + j]], plot=plt)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("QQ-Plots for continuous variables after Scaling and Logarithm", y=1.02,fontsize=30)
plt.show()



In [426]:

    
N=4; M=3;  # set row and column of the figure
fig = plt.figure(figsize=(15,15))  # figure size

plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0)  # subplot setting

# make subplot with face data index
for i in range(N):
    for j in range(M):
        if i * M + j < 11:
            ax = fig.add_subplot(N, M, i * M + j + 1)
            plt.hist(cleaned_df[continuous_vars[i * M + j]], bins=10)
            plt.title(continuous_vars[i * M + j])
        else:
            pass
plt.tight_layout()
plt.suptitle("Histograms for continuous variables after Scaling and Logarithm", y=1.02,fontsize=30)
plt.show()

Fit and Scoring

Under Sampling



In [18]:

    
scaled_df = cleaned_df



In [20]:

    
def under_sampling(df_name):
    
    global X, y, under_X, under_y, under_df
    
    X = df_name.ix[:, df_name.columns != "loan_status"]
    y = df_name.ix[:, df_name.columns == "loan_status"]

    default_ix = np.array(df_name[df_name.loan_status == 0].index)
    paid_ix = np.array(df_name[df_name.loan_status == 1].index)
    rand_paid_ix = np.random.choice(paid_ix, len(default_ix), replace=False)
    rand_ix = np.concatenate([default_ix, rand_paid_ix])

    print "X Shape: ", X.shape
    print "y Shape: ", y.shape
    print "Default Index Length: ", len(default_ix)
    print "Paid Index Length: ", len(rand_paid_ix)
    print "Random Index Shape: ", rand_ix.shape
    print "-" * 75

    under_df = df_name.iloc[rand_ix, :]
    under_X = under_df.ix[:, under_df.columns != "loan_status"]
    under_y = under_df.ix[:, under_df.columns == "loan_status"]

    print "Under DataFrame's X and y", len(under_X), len(under_y)
    print "Data Variables Names: under_df, under_X, under_y"



In [21]:

    
under_sampling(scaled_df)









    



X Shape:  (267888, 18)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

Split train and test for under sampled data and whole data



In [22]:

    
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

Random Forest Fit



In [18]:

    
rf = RandomForestClassifier(max_features=None, random_state=0)
rf_result = rf.fit(under_train_X, under_train_y)









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app

Confusion Matrix



In [23]:

    
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Oranges):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Cofusion Matrix(predict with under sampled test set)



In [20]:

    
# Compute confusion matrix
cnf_matrix = confusion_matrix(under_test_y, rf_result.predict(under_test_X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with under sampled test set)')
plt.show()

print classification_report(under_test_y, rf_result.predict(under_test_X))









    












    



             precision    recall  f1-score   support

          0       0.60      0.69      0.64     12065
          1       0.63      0.53      0.58     12076

avg / total       0.61      0.61      0.61     24141

Cofusion Matrix(predict with whole test set)



In [21]:

    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y, rf_result.predict(X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with whole test set)')
plt.show()

print classification_report(y, rf_result.predict(X))









    












    



             precision    recall  f1-score   support

          0       0.42      0.93      0.58     60352
          1       0.97      0.63      0.77    207536

avg / total       0.85      0.70      0.72    267888

Set CV and Models



In [24]:

    
cv = StratifiedKFold(n_splits=8, random_state=0, shuffle=False)

sgd = SGDClassifier(loss="log", fit_intercept=True,
                    average=1000, n_iter=30, n_jobs=8,
                    random_state=0)

lda = LinearDiscriminantAnalysis()

lr = LogisticRegression(n_jobs=8)

Dummy Encoding for Categori Variables



In [113]:

    
dummies_df = scaled_df[continuous_vars]

for var in category_vars:
    dummies_df = dummies_df.join(pd.get_dummies(scaled_df[var], prefix=var))
print dummies_df.shape
print dummies_df.columns

dummies_df = dummies_df.join(scaled_df["loan_status"])
dummies_df.tail()









    



(267888, 57)
Index([u'loan_amnt', u'int_rate', u'annual_inc', u'desc', u'dti', u'revol_bal',
       u'revol_util', u'total_acc', u'pub_rec', u'inq_last_6mths',
       u'delinq_2yrs', u'emp_title_0', u'emp_title_1', u'emp_length_0',
       u'emp_length_1', u'emp_length_2', u'emp_length_3', u'emp_length_4',
       u'emp_length_5', u'emp_length_6', u'emp_length_7', u'emp_length_8',
       u'emp_length_9', u'emp_length_10', u'emp_length_11',
       u'home_ownership_1', u'home_ownership_2', u'home_ownership_3',
       u'home_ownership_4', u'verification_status_0', u'verification_status_1',
       u'issue_d_1', u'issue_d_2', u'issue_d_3', u'issue_d_4', u'issue_d_5',
       u'issue_d_6', u'issue_d_7', u'issue_d_8', u'issue_d_9', u'issue_d_10',
       u'issue_d_11', u'issue_d_12', u'purpose_1', u'purpose_2', u'purpose_3',
       u'purpose_4', u'purpose_5', u'purpose_6', u'purpose_7', u'purpose_8',
       u'purpose_9', u'purpose_10', u'purpose_11', u'purpose_12',
       u'initial_list_status_0', u'initial_list_status_1'],
      dtype='object')






    Out[113]:






  
    
      
      loan_amnt
      int_rate
      annual_inc
      desc
      dti
      revol_bal
      revol_util
      total_acc
      pub_rec
      inq_last_6mths
      ...
      purpose_6
      purpose_7
      purpose_8
      purpose_9
      purpose_10
      purpose_11
      purpose_12
      initial_list_status_0
      initial_list_status_1
      loan_status
    
  
  
    
      267883
      4.492062
      1.342225
      5.942008
      -0.419569
      -0.899401
      0.600672
      1.004029
      -1.025334
      -0.335109
      -0.798916
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      1
    
    
      267884
      4.033424
      0.897077
      4.965672
      -0.419569
      0.368136
      -0.304613
      0.574806
      0.932358
      -0.335109
      -0.798916
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
    
    
      267885
      3.954243
      0.962369
      4.903090
      -0.419569
      -1.623526
      -0.499128
      -0.109522
      -0.684866
      -0.335109
      -0.798916
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      1
    
    
      267886
      4.158362
      1.414806
      4.792392
      -0.419569
      0.019436
      -0.535486
      -0.380823
      0.421656
      1.941395
      0.138736
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      1
    
    
      267887
      3.903090
      1.100026
      4.653213
      -0.419569
      1.206798
      -0.342102
      -0.150015
      1.868646
      -0.335109
      -0.798916
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
    
  

5 rows × 58 columns



In [109]:

    
under_sampling(dummies_df)









    



X Shape:  (267888, 57)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y

Get Scores

SGD



In [124]:

    
print "SGD CV Accuracy for under sampled dataset"
sgd_cv_under = cross_val_score(sgd, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", sgd_cv_under
print "Average Accuracy: ", np.mean(sgd_cv_under)

print '-' * 75

print "SGD CV Accuracy for whole dataset"
sgd_cv_whole = cross_val_score(sgd, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", sgd_cv_whole
print "Average Accuracy: ", np.mean(sgd_cv_whole)









    



SGD CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.45  0.59  0.64  0.59  0.6   0.59  0.58  0.65]
Average Accuracy:  0.585722097031
---------------------------------------------------------------------------
SGD CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.75  0.77]
Average Accuracy:  0.750216508392

LDA



In [80]:

    
print "LDA CV Accuracy for under sampled dataset"
lda_cv_under = cross_val_score(lda, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lda_cv_under
print "Average Accuracy: ", np.mean(lda_cv_under)

print '-' * 75

print "LDA CV Accuracy for whole dataset"
lda_cv_whole = cross_val_score(lda, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lda_cv_whole
print "Average Accuracy: ", np.mean(lda_cv_whole)









    



LDA CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.46  0.6   0.64  0.6   0.6   0.58  0.57  0.64]
Average Accuracy:  0.588108099152
---------------------------------------------------------------------------
LDA CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.74  0.77]
Average Accuracy:  0.749914143224

LogisticRegression



In [125]:

    
print "LogisticRegression CV Accuracy for under sampled dataset"
lr_cv_under = cross_val_score(lr, under_X,
            np.array(under_y).reshape(len(under_y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lr_cv_under
print "Average Accuracy: ", np.mean(lr_cv_under)

print '-' * 75

print "LogisticRegression CV Accuracy for whole dataset"
lr_cv_whole = cross_val_score(lr, X,
            np.array(y).reshape(len(y, )),
            cv = cv, scoring="accuracy", n_jobs=8)
print "Each Validation's Accuracy: ", lr_cv_whole
print "Average Accuracy: ", np.mean(lr_cv_whole)









    



LogisticRegression CV Accuracy for under sampled dataset
Each Validation's Accuracy:  [ 0.45  0.59  0.64  0.59  0.6   0.59  0.58  0.65]
Average Accuracy:  0.586915098091
---------------------------------------------------------------------------
LogisticRegression CV Accuracy for whole dataset
Each Validation's Accuracy:  [ 0.78  0.77  0.75  0.72  0.73  0.73  0.75  0.77]
Average Accuracy:  0.748308994804

ROC Curve



In [27]:

    
lr_result = lr.fit(under_train_X, under_train_y)
sgd_result = sgd.fit(under_train_X, under_train_y)
lda_result = lda.fit(under_train_X, under_train_y)

fpr0, tpr0, thresholds0 = roc_curve(y, sgd_result.decision_function(X), pos_label=1)
fpr1, tpr1, thresholds1 = roc_curve(y, lr_result.decision_function(X), pos_label=1)
fpr2, tpr2, thresholds2 = roc_curve(y, lda_result.decision_function(X), pos_label=1)

plt.plot(fpr0, tpr0, "r-", label="LogisticRegression")
plt.plot(fpr1, tpr1, "g-", label="SGD")
plt.plot(fpr2, tpr2, "y-", label="LDA")
plt.plot([0, 1], [0, 1], 'k--', label="random guess")
plt.xlim(-0.05, 1.0)
plt.ylim(0, 1.05)
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver operating characteristic Curve')
plt.legend(loc="lower right")
plt.show()

Random Forest Feature importances



In [469]:

    
importances = rf_result.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X.shape[1]):
    print("{}. {} / feature_n: {} / importance: {}".format(f+1, list(X.columns)[f], indices[f], importances[indices[f]]))

std = np.std([tree.feature_importances_ for tree in rf_result.estimators_],
             axis=0)

plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()









    



1. loan_amnt / feature_n: 1 / importance: 0.158881575828
2. int_rate / feature_n: 10 / importance: 0.114321064941
3. emp_title / feature_n: 14 / importance: 0.104817431158
4. emp_length / feature_n: 15 / importance: 0.103399981851
5. home_ownership / feature_n: 5 / importance: 0.0919407006712
6. annual_inc / feature_n: 0 / importance: 0.0849728830988
7. verification_status / feature_n: 16 / importance: 0.0784396621926
8. issue_d / feature_n: 7 / importance: 0.0540577119001
9. desc / feature_n: 3 / importance: 0.045905758798
10. purpose / feature_n: 8 / importance: 0.0405277987998
11. dti / feature_n: 9 / importance: 0.0290124276806
12. delinq_2yrs / feature_n: 12 / importance: 0.0268262275236
13. inq_last_6mths / feature_n: 4 / importance: 0.0158916830919
14. pub_rec / feature_n: 11 / importance: 0.0152995025604
15. revol_bal / feature_n: 13 / importance: 0.0111792915635
16. revol_util / feature_n: 17 / importance: 0.0105867564604
17. total_acc / feature_n: 6 / importance: 0.00935101737055
18. initial_list_status / feature_n: 2 / importance: 0.0045885245118

Optimization

Fit iteration 1

Logistic Regressions with High feature importance variables to check p-value and coefficient



In [167]:

    
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(emp_length) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()









    



Optimization terminated successfully.
         Current function value: 0.632952
         Iterations 5
[[ 40636.  19716.]
 [ 23609.  36743.]]






    Out[167]:





Logit Regression Results

  Dep. Variable:     loan_status      No. Observations:    120704 


  Model:                Logit         Df Residuals:        120684 


  Method:                MLE          Df Model:                19 


  Date:           Fri, 17 Mar 2017    Pseudo R-squ.:       0.08684


  Time:               05:03:46        Log-Likelihood:      -76400.


  converged:            True          LL-Null:             -83666.


                                    LLR p-value:          0.000 




                                 coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept                        1.6425      0.163     10.096   0.000      1.324     1.961


  C(emp_title)[T.1]                0.3131      0.048      6.558   0.000      0.220     0.407


  C(emp_length)[T.1]               0.0384      0.032      1.208   0.227     -0.024     0.101


  C(emp_length)[T.2]               0.0979      0.029      3.356   0.001      0.041     0.155


  C(emp_length)[T.3]               0.0682      0.030      2.262   0.024      0.009     0.127


  C(emp_length)[T.4]               0.0814      0.032      2.520   0.012      0.018     0.145


  C(emp_length)[T.5]               0.0865      0.031      2.762   0.006      0.025     0.148


  C(emp_length)[T.6]               0.0403      0.033      1.216   0.224     -0.025     0.105


  C(emp_length)[T.7]               0.0092      0.034      0.275   0.784     -0.057     0.075


  C(emp_length)[T.8]               0.0038      0.035      0.108   0.914     -0.065     0.073


  C(emp_length)[T.9]              -0.0040      0.038     -0.107   0.915     -0.078     0.070


  C(emp_length)[T.10]              0.0551      0.024      2.272   0.023      0.008     0.103


  C(emp_length)[T.11]             -0.0343      0.059     -0.583   0.560     -0.150     0.081


  C(verification_status)[T.1]     -0.1118      0.014     -7.730   0.000     -0.140    -0.083


  C(home_ownership)[T.2]           0.0834      0.239      0.349   0.727     -0.385     0.552


  C(home_ownership)[T.3]          -0.0574      0.023     -2.545   0.011     -0.102    -0.013


  C(home_ownership)[T.4]          -0.1470      0.014    -10.824   0.000     -0.174    -0.120


  loan_amnt                       -0.7416      0.026    -29.060   0.000     -0.792    -0.692


  int_rate                        -4.3722      0.049    -89.828   0.000     -4.468    -4.277


  annual_inc                       1.2937      0.033     39.068   0.000      1.229     1.359

Remove the variable with high p-value



In [184]:

    
iter_df = scaled_df.drop("emp_length", 1)
under_sampling(iter_df)









    



X Shape:  (267888, 17)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y



In [185]:

    
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)



In [159]:

    
def rf_fit():
    rf = RandomForestClassifier(max_features=None, random_state=0)
    rf_result = rf.fit(under_train_X, under_train_y)

    importances = rf_result.feature_importances_
    indices = np.argsort(importances)[::-1]

    for f in range(X.shape[1]):
        print("{}. {} / feature_n: {} / importance: {}".format(f+1, list(X.columns)[f], indices[f], importances[indices[f]]))

    std = np.std([tree.feature_importances_ for tree in rf_result.estimators_],
                 axis=0)

    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), importances[indices],
           yerr=std[indices], align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.show()



In [187]:

    
rf_fit()









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()






    



1. loan_amnt / feature_n: 1 / importance: 0.163012653529
2. int_rate / feature_n: 9 / importance: 0.122158240701
3. emp_title / feature_n: 13 / importance: 0.110458846173
4. home_ownership / feature_n: 14 / importance: 0.107975373878
5. annual_inc / feature_n: 4 / importance: 0.0956290069998
6. verification_status / feature_n: 0 / importance: 0.088868919899
7. issue_d / feature_n: 15 / importance: 0.081914717001
8. desc / feature_n: 6 / importance: 0.0563041257624
9. purpose / feature_n: 7 / importance: 0.0422866619251
10. dti / feature_n: 8 / importance: 0.0305744085439
11. delinq_2yrs / feature_n: 11 / importance: 0.0280201247177
12. inq_last_6mths / feature_n: 3 / importance: 0.01652521287
13. pub_rec / feature_n: 10 / importance: 0.0164114731871
14. revol_bal / feature_n: 12 / importance: 0.0121277878523
15. revol_util / feature_n: 16 / importance: 0.0108470909168
16. total_acc / feature_n: 5 / importance: 0.0102946919374
17. initial_list_status / feature_n: 2 / importance: 0.00659066410688

Fit Iteration 2



In [164]:

    
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(issue_d) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()









    



Optimization terminated successfully.
         Current function value: 0.632352
         Iterations 5
[[ 40723.  19629.]
 [ 23608.  36744.]]






    Out[164]:





Logit Regression Results

  Dep. Variable:     loan_status      No. Observations:    120704 


  Model:                Logit         Df Residuals:        120684 


  Method:                MLE          Df Model:                19 


  Date:           Fri, 17 Mar 2017    Pseudo R-squ.:       0.08771


  Time:               05:03:01        Log-Likelihood:      -76327.


  converged:            True          LL-Null:             -83666.


                                    LLR p-value:          0.000 




                                 coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept                        1.5862      0.153     10.382   0.000      1.287     1.886


  C(emp_title)[T.1]                0.3586      0.026     13.610   0.000      0.307     0.410


  C(issue_d)[T.2]                 -0.0279      0.031     -0.915   0.360     -0.088     0.032


  C(issue_d)[T.3]                 -0.0120      0.030     -0.396   0.692     -0.071     0.047


  C(issue_d)[T.4]                 -0.0634      0.029     -2.160   0.031     -0.121    -0.006


  C(issue_d)[T.5]                 -0.0360      0.029     -1.228   0.219     -0.093     0.021


  C(issue_d)[T.6]                 -0.0192      0.030     -0.646   0.518     -0.078     0.039


  C(issue_d)[T.7]                 -0.0666      0.028     -2.361   0.018     -0.122    -0.011


  C(issue_d)[T.8]                  0.0531      0.030      1.786   0.074     -0.005     0.111


  C(issue_d)[T.9]                  0.1493      0.031      4.873   0.000      0.089     0.209


  C(issue_d)[T.10]                 0.0121      0.028      0.430   0.667     -0.043     0.067


  C(issue_d)[T.11]                 0.0378      0.029      1.281   0.200     -0.020     0.096


  C(issue_d)[T.12]                 0.1565      0.031      4.987   0.000      0.095     0.218


  C(verification_status)[T.1]     -0.1424      0.014     -9.884   0.000     -0.171    -0.114


  C(home_ownership)[T.2]           0.2426      0.233      1.039   0.299     -0.215     0.700


  C(home_ownership)[T.3]          -0.0758      0.023     -3.357   0.001     -0.120    -0.032


  C(home_ownership)[T.4]          -0.1550      0.013    -11.566   0.000     -0.181    -0.129


  loan_amnt                       -0.7573      0.026    -29.665   0.000     -0.807    -0.707


  int_rate                        -4.3215      0.049    -88.754   0.000     -4.417    -4.226


  annual_inc                       1.3106      0.033     40.022   0.000      1.246     1.375



In [188]:

    
iter_df["issue_d"] = iter_df["issue_d"].replace([1, 9, 12], 1)
iter_df["issue_d"] = iter_df["issue_d"].replace([2, 3, 4, 5, 6, 7, 8, 10, 11], 0)

print np.unique(iter_df["issue_d"])
under_sampling(iter_df)









    



[0 1]
X Shape:  (267888, 17)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y



In [189]:

    
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)



In [190]:

    
rf_fit()









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()






    



1. loan_amnt / feature_n: 1 / importance: 0.168268150536
2. int_rate / feature_n: 9 / importance: 0.128251442655
3. emp_title / feature_n: 13 / importance: 0.115353508891
4. home_ownership / feature_n: 14 / importance: 0.113473816335
5. annual_inc / feature_n: 4 / importance: 0.101729866349
6. verification_status / feature_n: 0 / importance: 0.0923634635808
7. issue_d / feature_n: 15 / importance: 0.087299923558
8. desc / feature_n: 7 / importance: 0.0429419049315
9. purpose / feature_n: 8 / importance: 0.0326709198208
10. dti / feature_n: 11 / importance: 0.0294871400802
11. delinq_2yrs / feature_n: 10 / importance: 0.017892607527
12. inq_last_6mths / feature_n: 3 / importance: 0.0166550554523
13. pub_rec / feature_n: 16 / importance: 0.0122725152856
14. revol_bal / feature_n: 12 / importance: 0.0119994681601
15. revol_util / feature_n: 6 / importance: 0.0114018819998
16. total_acc / feature_n: 5 / importance: 0.0107822332615
17. initial_list_status / feature_n: 2 / importance: 0.00715610157643

Fit iteration 3



In [176]:

    
model = sm.Logit.from_formula("loan_status ~ loan_amnt + int_rate + C(emp_title) + C(issue_d) + annual_inc + C(verification_status) + C(home_ownership)",
                              data=under_df)
result = model.fit()
print result.pred_table()
result.summary()









    



Optimization terminated successfully.
         Current function value: 0.632503
         Iterations 5
[[ 40681.  19671.]
 [ 23502.  36850.]]






    Out[176]:





Logit Regression Results

  Dep. Variable:     loan_status      No. Observations:    120704 


  Model:                Logit         Df Residuals:        120694 


  Method:                MLE          Df Model:                 9 


  Date:           Fri, 17 Mar 2017    Pseudo R-squ.:       0.08749


  Time:               05:07:29        Log-Likelihood:      -76346.


  converged:            True          LL-Null:             -83666.


                                    LLR p-value:          0.000 




                                 coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept                        1.8122      0.151     11.972   0.000      1.516     2.109


  C(emp_title)[T.1]                0.3341      0.026     12.754   0.000      0.283     0.385


  C(issue_d)[T.1]                  0.1226      0.015      8.423   0.000      0.094     0.151


  C(verification_status)[T.1]     -0.1376      0.014     -9.554   0.000     -0.166    -0.109


  C(home_ownership)[T.2]           0.0941      0.241      0.391   0.696     -0.378     0.566


  C(home_ownership)[T.3]          -0.0729      0.023     -3.226   0.001     -0.117    -0.029


  C(home_ownership)[T.4]          -0.1462      0.013    -10.918   0.000     -0.172    -0.120


  loan_amnt                       -0.7678      0.026    -30.063   0.000     -0.818    -0.718


  int_rate                        -4.3578      0.049    -89.725   0.000     -4.453    -4.263


  annual_inc                       1.2804      0.033     38.992   0.000      1.216     1.345



In [191]:

    
fitted_cols = ["loan_status", "emp_title", "issue_d", "verification_status",
               "home_ownership", "loan_amnt", "int_rate", "annual_inc"]
fitted_df = iter_df.reindex(columns=fitted_cols)
fitted_df.tail()









    Out[191]:






  
    
      
      loan_status
      emp_title
      issue_d
      verification_status
      home_ownership
      loan_amnt
      int_rate
      annual_inc
    
  
  
    
      267883
      1
      1
      1
      1
      1
      4.492062
      1.342225
      5.942008
    
    
      267884
      1
      1
      1
      1
      1
      4.033424
      0.897077
      4.965672
    
    
      267885
      1
      1
      1
      1
      1
      3.954243
      0.962369
      4.903090
    
    
      267886
      1
      0
      1
      1
      4
      4.158362
      1.414806
      4.792392
    
    
      267887
      1
      1
      1
      1
      3
      3.903090
      1.100026
      4.653213



In [192]:

    
under_sampling(fitted_df)









    



X Shape:  (267888, 7)
y Shape:  (267888, 1)
Default Index Length:  60352
Paid Index Length:  60352
Random Index Shape:  (120704L,)
---------------------------------------------------------------------------
Under DataFrame's X and y 120704 120704
Data Variables Names: under_df, under_X, under_y



In [196]:

    
under_train_X, under_test_X, under_train_y, under_test_y = train_test_split(
    under_X, under_y, test_size=0.20, random_state=0)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)



In [194]:

    
rf_fit()









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()






    



1. emp_title / feature_n: 5 / importance: 0.302556766856
2. issue_d / feature_n: 6 / importance: 0.291654176929
3. verification_status / feature_n: 4 / importance: 0.270477162656
4. home_ownership / feature_n: 3 / importance: 0.054684974733
5. loan_amnt / feature_n: 1 / importance: 0.0357518921347
6. int_rate / feature_n: 2 / importance: 0.0320879910333
7. annual_inc / feature_n: 0 / importance: 0.0127870356587



In [200]:

    
rf = RandomForestClassifier(max_features=None, random_state=0)
rf_result = rf.fit(under_train_X, under_train_y)









    



D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app



In [201]:

    
# Compute confusion matrix
cnf_matrix = confusion_matrix(under_test_y, rf_result.predict(under_test_X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with under sampled test set)')
plt.show()

print classification_report(under_test_y, rf_result.predict(under_test_X))









    












    



             precision    recall  f1-score   support

          0       0.58      0.65      0.61     12065
          1       0.60      0.53      0.56     12076

avg / total       0.59      0.59      0.59     24141



In [202]:

    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y, rf_result.predict(X))
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[0,1],
                      title='Cofusion Matrix(predict with whole test set)')
plt.show()

print classification_report(y, rf_result.predict(X))









    












    



             precision    recall  f1-score   support

          0       0.42      0.91      0.57     60352
          1       0.96      0.63      0.76    207536

avg / total       0.84      0.69      0.72    267888

	loan_amnt	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	...	dti	delinq_2yrs	inq_last_6mths	open_acc	pub_rec	revol_bal	revol_util	total_acc	initial_list_status	application_type
268131	31050	21.99	857.40	6	61	1	10	2	875000.0	1	...	9.66	1	0	10	0	25770	79.3	13	0	1
268132	10800	7.89	337.89	1	15	1	8	2	92400.0	1	...	19.62	1	0	11	0	9760	68.7	36	1	1
268133	9000	9.17	286.92	2	22	1	1	2	80000.0	1	...	3.97	1	0	8	0	6320	51.8	17	0	1
268134	14400	25.99	431.06	6	65	0	11	6	62000.0	1	...	16.88	0	1	9	1	5677	45.1	30	0	1
268135	8000	12.59	267.98	3	32	1	4	5	45000.0	1	...	26.21	0	0	12	0	9097	50.8	47	1	1

	Name	Included in dataframe	Data type	Encoding	Not Null ratio	Reason for removal	English Description	Korean Description
0	loan_status	o	Categorical	Fully Paid = 1 / Charged Off, Default, Late(31...	100.00%	NaN	Current status of the loan	대출의 현재 상태
1	int_rate	o	Continuous	NaN	100.00%	NaN	Interest Rate on the loan	대출 금리
2	emp_title	o	Categorical	True = 1, False = 0	94.20%	NaN	The job title supplied by the Borrower when ap...	대출 신청시 차용자가 제공 한 직책 *
3	emp_length	o	Categorical	< 1' = 0, 10+ = 10, 'n/a' = 11	100.00%	NaN	Employment length in years. Possible values ar...	고용 기간(년)
4	home_ownership	o	Categorical	Mortgage, Other, Own, Rent = {1, 2, 3, 4}	100.00%	NaN	The home ownership status provided by the borr...	주택 소유 상태(렌트, 소유, 모기지, 기타)
5	annual_inc	o	Continuous	NaN	100.00%	NaN	The self-reported annual income provided by th...	차용자 자체보고 연간 소득.
6	verification_status	o	Categorical	Source Verified, Verified = 1, Not Verified = 0	100.00%	NaN	Indicates if income was verified by LC, not ve...	수입 및 수입원 확인여부
7	issue_d	o	Categorical	Transform to mm	100.00%	NaN	The month which the loan was funded	대출 자금이 조달 된 달
8	loan_amnt	o	Continuous	NaN	100.00%	NaN	The listed amount of the loan applied for by t...	대출금액
9	desc	o	Continuous	True = doc length, False = 0	14.20%	NaN	Loan description provided by the borrower	대출 내용 *
10	purpose	o	Categorical	14 categoried = {1:14} (from 1 to 14)	100.00%	NaN	A category provided by the borrower for the lo...	대출 목적
11	dti	o	Continuous	NaN	100.00%	NaN	A ratio calculated using the borrower’s total ...	차용인의 (월별 채무상환액 / 월별 소득)
12	delinq_2yrs	o	Categorical	No delayed = 0, Delayed = 1	100.00%	NaN	The number of 30+ days past-due incidences of ...	지난 2 년간 연체 30일 이상 발생 횟수
13	inq_last_6mths	o	Categorical	{0:5} = 0~5 times, 6 = above 6 times	100.00%	NaN	The number of inquiries in past 6 months (excl...	지난 6 개월 간 신용관련 문의 건수 (자동차, 모기지 제외)
14	pub_rec	o	Categorical	No record = 0, record = 1	100.00%	NaN	Number of derogatory public records	추징, 압류 등과 관련된 기록
15	revol_bal	o	Continuous	NaN	100.00%	NaN	Total credit revolving balance	총 크레딧 회전 균형
16	revol_util	o	Continuous	NaN	99.94%	NaN	Revolving line utilization rate, or the amount...	리볼빙 이용률, 차용자가 사용하는 리볼빙 크레딧 금액
17	total_acc	o	Continuous	NaN	100.00%	NaN	The total number of credit lines currently in ...	현재 개설된 신용계좌 수
18	initial_list_status	o	Categorical	W = 1, F = 0	100.00%	NaN	The initial listing status of the loan. Possib...	대출의 초기 리스팅 상태(W: 전체, F: 부분)
19	acc_now_delinq	x	Continuous	NaN	100.00%	unclear data collection date from LC	The number of accounts on which the borrower i...	연체중인 계좌의 수
20	addr_state	x	-	NaN	100.00%	not related to analytics purpose	The state provided by the borrower in the loan...	대출 신청서에 차용인이 제공 한 국가
21	all_util	x	-	NaN	2.41%	high null ratio	Balance to credit limit on all trades	신용한도
22	annual_inc_joint	x	-	NaN	0.06%	high null ratio	The combined self-reported annual income provi...	등록 중 공동 차용자가 제공 한 자체보고 된 연간 소득
23	application_type	x	Categorical	Individual = 1, Joint = 0	100.00%	extremely high unbalanced(above 99.9)	Indicates whether the loan is an individual ap...	대출신청 개인, 공동 여부
24	collection_recovery_fee	x	Continuous	NaN	100.00%	not provided information on loan application	post charge off collection fee	수금수수료
25	collections_12_mths_ex_med	x	Continuous	NaN	99.98%	not provided information on loan application	Number of collections in 12 months excluding m...	의료제외 12개월 간 수금건수
26	dti_joint	x	-	NaN	0.06%	dependent to 'dti'	A ratio calculated using the co-borrowers' tot...	모기지 및 요청 된 LC 대출을 제외하고 총 차입금에 대한 공동 차용자의 월별 총 ...
27	earliest_cr_line	x	-	NaN	100.00%	dependent to 'issue_d'	The month the borrower's earliest reported cre...	차용자의 최초보고 크레딧 라인이 개설 된 월
28	funded_amnt	x	Continuous	NaN	100.00%	dependent to 'loan_amnt'	The total amount committed to that loan at tha...	해당 시점에 대출에 투입된 총 금액.
29	funded_amnt_inv	x	Continuous	NaN	100.00%	dependent to 'loan_amnt'	The total amount committed by investors for th...	해당 시점에 해당 대출에 대해 투자자들이 투입한 총 금액.
...	...	...	...	...	...	...	...	...
44	mths_since_rcnt_il	x	-	NaN	2.35%	high null ratio	Months since most recent installment accounts ...	가장 최근의 할부 계좌가 개설 된 이후의 달
45	next_pymnt_d	x	-	NaN	71.49%	unclear data collection date from LC	Next scheduled payment date	다음 지급 예정일
46	open_acc	x	Continuous	NaN	100.00%	dependent to 'total_acc'	The number of open credit lines in the borrowe...	차용인의 개설된 신용계좌 수
47	open_acc_6m	x	-	NaN	2.41%	high null ratio	Number of open trades in last 6 months	최근 6 개월 동안의 미결 거래 횟수
48	open_il_12m	x	-	NaN	2.41%	high null ratio	Number of installment accounts opened in past ...	지난 12 개월 동안 개설 된 할부 계좌 수
49	open_il_24m	x	-	NaN	2.41%	high null ratio	Number of installment accounts opened in past ...	지난 24 개월 동안 개설 된 할부 계좌 수
50	open_il_6m	x	-	NaN	2.41%	high null ratio	Number of currently active installment trades	현재 활동중인 할부 거래 수
51	open_rv_12m	x	-	NaN	2.41%	high null ratio	Number of revolving trades opened in past 12 m...	지난 12 개월 동안 개설 된 회전 거래 수
52	open_rv_24m	x	-	NaN	2.41%	high null ratio	Number of revolving trades opened in past 24 m...	지난 24 개월 동안 개설 된 회전 거래 수
53	out_prncp	x	Continuous	NaN	100.00%	not provided information on loan application	Remaining outstanding principal for total amou...	후원 된 총 금액에 대한 미납 된 원금
54	out_prncp_inv	x	Continuous	NaN	100.00%	not provided information on loan application	Remaining outstanding principal for portion of...	투자자가 자금을 조달 한 부분의 잔액이 남아 있음.
55	policy_code	x	-	NaN	100.00%	same all elements	publicly available policy_code=1\nnew products...	publicly available policy_code = 1\r\n공개적으로 사용...
56	pymnt_plan	x	Categorical	y = 1, n = 0	100.00%	extremely high unbalanced(above 99.9)	Indicates if a payment plan has been put in pl...	대출금 지불 계획 수립여부
57	recoveries	x	Continuous	NaN	100.00%	not provided information on loan application	post charge off gross recovery	총체적인 회복으로 인한 비용 청구
58	sub_grade	x	Categorical	1, 2, 3, 4, 5	100.00%	dependent to 'grade'	LC assigned loan subgrade	LC 세부 신용등급
59	term	x	Continuous	NaN	100.00%	not used in present	The number of payments on the loan. Values are...	대출금 지불 기간(값은 개월 단위)
60	title	x	-	NaN	99.98%	too many categories	The loan title provided by the borrower	차용인이 제공 한 대출 명칭
61	tot_coll_amt	x	-	NaN	92.08%	dependent to 'term'	Total collection amounts ever owed	적립 된 총 징수액
62	tot_cur_bal	x	-	NaN	92.08%	dependent to other variables of repayment ability	Total current balance of all accounts	모든 계정의 총 현재 잔액
63	total_bal_il	x	-	NaN	2.41%	high null ratio	Total current balance of all installment accounts	모든 할부 계좌의 총 현재 잔액
64	total_cu_tl	x	-	NaN	0.00%	high null ratio	Number of finance trades	금융 거래 건수
65	total_pymnt	x	Continuous	NaN	100.00%	not provided information on loan application	Payments received to date for total amount funded	현재까지받은 지불액
66	total_pymnt_inv	x	-	NaN	100.00%	unclear data collection date from LC	Payments received to date for portion of total...	투자자가 자금을 조달 한 총액의 일부분에 대해 현재까지 수령 한 지급액
67	total_rec_int	x	-	NaN	100.00%	unclear data collection date from LC	Interest received to date	현재까지받은이자
68	total_rec_late_fee	x	-	NaN	100.00%	unclear data collection date from LC	Late fees received to date	현재까지 수령 한 연체료
69	total_rec_prncp	x	-	NaN	100.00%	unclear data collection date from LC	Principal received to date	현재까지받은 교장
70	total_rev_hi_lim	x	-	NaN	92.08%	dependent to 'revol_bal', 'revol_util'	Total revolving high credit/credit limit	총 회전 높은 신용 한도 / 신용 한도
71	url	x	x	NaN	100.00%	not related to analytics purpose	URL for the LC page with listing data.	목록 데이터가있는 LC 페이지의 URL입니다.
72	verification_status_joint	x	-	NaN	0.06%	high null ratio	Indicates if the co-borrowers' joint income wa...	공동 차용인의 공동 수입이 LC로 검증되었는지, 검증되지 않았는지, 수입원이 확인되...
73	zip_code	x	-	NaN	100.00%	not related to analytics purpose	The first 3 numbers of the zip code provided b...	대출 신청서에 차용자가 제공 한 우편 번호의 처음 3 자리 숫자입니다.

	loan_amnt	int_rate	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	purpose	dti	delinq_2yrs	inq_last_6mths	pub_rec	revol_bal	revol_util	total_acc	initial_list_status
267883	31050	21.99	1	10	1	875000.0	1	12	1	3	9.66	1	0	0	25770	79.3	13	0
267884	10800	7.89	1	8	1	92400.0	1	12	1	2	19.62	1	0	0	9760	68.7	36	1
267885	9000	9.17	1	1	1	80000.0	1	12	1	3	3.97	1	0	0	6320	51.8	17	0
267886	14400	25.99	0	11	4	62000.0	1	12	1	3	16.88	0	1	1	5677	45.1	30	0
267887	8000	12.59	1	4	3	45000.0	1	12	1	3	26.21	0	0	0	9097	50.8	47	1

	loan_amnt	int_rate	annual_inc	desc	dti	revol_bal	revol_util	total_acc	pub_rec	inq_last_6mths	...	initial_list_status_0	initial_list_status_1	loan_status
267883	4.492062	1.342225	5.942008	-0.419569	-0.899401	0.600672	1.004029	-1.025334	-0.335109	-0.798916	...	1	0	1
267884	4.033424	0.897077	4.965672	-0.419569	0.368136	-0.304613	0.574806	0.932358	-0.335109	-0.798916	...	0	1	1
267885	3.954243	0.962369	4.903090	-0.419569	-1.623526	-0.499128	-0.109522	-0.684866	-0.335109	-0.798916	...	1	0	1
267886	4.158362	1.414806	4.792392	-0.419569	0.019436	-0.535486	-0.380823	0.421656	1.941395	0.138736	...	1	0	1
267887	3.903090	1.100026	4.653213	-0.419569	1.206798	-0.342102	-0.150015	1.868646	-0.335109	-0.798916	...	0	1	1

Dep. Variable:	loan_status	No. Observations:	120704
Model:	Logit	Df Residuals:	120684
Method:	MLE	Df Model:	19
Date:	Fri, 17 Mar 2017	Pseudo R-squ.:	0.08684
Time:	05:03:46	Log-Likelihood:	-76400.
converged:	True	LL-Null:	-83666.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.6425	0.163	10.096	0.000	1.324 1.961
C(emp_title)[T.1]	0.3131	0.048	6.558	0.000	0.220 0.407
C(emp_length)[T.1]	0.0384	0.032	1.208	0.227	-0.024 0.101
C(emp_length)[T.2]	0.0979	0.029	3.356	0.001	0.041 0.155
C(emp_length)[T.3]	0.0682	0.030	2.262	0.024	0.009 0.127
C(emp_length)[T.4]	0.0814	0.032	2.520	0.012	0.018 0.145
C(emp_length)[T.5]	0.0865	0.031	2.762	0.006	0.025 0.148
C(emp_length)[T.6]	0.0403	0.033	1.216	0.224	-0.025 0.105
C(emp_length)[T.7]	0.0092	0.034	0.275	0.784	-0.057 0.075
C(emp_length)[T.8]	0.0038	0.035	0.108	0.914	-0.065 0.073
C(emp_length)[T.9]	-0.0040	0.038	-0.107	0.915	-0.078 0.070
C(emp_length)[T.10]	0.0551	0.024	2.272	0.023	0.008 0.103
C(emp_length)[T.11]	-0.0343	0.059	-0.583	0.560	-0.150 0.081
C(verification_status)[T.1]	-0.1118	0.014	-7.730	0.000	-0.140 -0.083
C(home_ownership)[T.2]	0.0834	0.239	0.349	0.727	-0.385 0.552
C(home_ownership)[T.3]	-0.0574	0.023	-2.545	0.011	-0.102 -0.013
C(home_ownership)[T.4]	-0.1470	0.014	-10.824	0.000	-0.174 -0.120
loan_amnt	-0.7416	0.026	-29.060	0.000	-0.792 -0.692
int_rate	-4.3722	0.049	-89.828	0.000	-4.468 -4.277
annual_inc	1.2937	0.033	39.068	0.000	1.229 1.359