Machine Learning Engineer Nanodegree

Capstone

Project: Lending Club Investor ML

Overview

LendingClub is an online financial community that brings together creditworthy borrowers and savvy investors so that both can benefit financially. They replace the high cost and complexity of bank lending with a faster, smarter way to borrow and invest. Since 2007, LendingClub has been bringing borrowers and investors together, transforming the way people access credit. Over the last 10 years, LendingClub helped millions of people take control of their debt, grow their small businesses, and invest for the future.

LendingClub balances different investors on its platform. The mechanics of the platform allow LendingClub to meet the objectives of many different types of investors, including retail investors. Here’s how it works. Once loans are approved to the LendingClub platform, they are randomly allocated at a grade and term level either to a program designed for retail investors purchasing interests in fractions of loans (e.g. LendingClub Notes) or to a program intended for institutional investors. This helps ensure that investors have access to comparable quality loans no matter which type of investor they are. LendingClub goal is to meet incoming investor demand for interests in fractional loans as much as possible.

The design of LendingClub platform emphasizes how important retail investors are to LendingClub. For LendingClub retail investors are key component of our diverse marketplace strategy. Retail investors are—and will always be—the heart of the LendingClub marketplace. This project “Investor ML” tool specifically built keeping retail investors in mind. Investor ML aims to predict probability risk of charged off and not fully paid, called “Risk Rate %”. LendingClub can provide “Risk Rate %” predictions from Investor ML for each approved loan as an additional indicator for investors to make investment decisions. Retail investors use loan information, applicant information like fico score, loan grade etc. and decide to invest in fractional loans. Additional statistic “Risk Rate %” learned from historical loans may also act as additional information and help retail investors to diversify their investment.

The datasets are provided by LendingClub, we will use lending data from 2007-2011 and be trying to classify and predict whether or not the borrower paid back their loan in full. Data is available to download here.

Exploring the Data

Lets load necessary Python libraries and load the Lendingclub data downloaded.



In [1]:

    
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
import warnings
warnings.filterwarnings("ignore")

# Import supplementary visualization code visuals.py
import visuals as vs

# Pretty display for notebooks
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()


# Load the Census dataset
loans = pd.read_csv("LoanStats3a_securev1_2007_2011.csv", encoding='utf-8')

# Success - Display the first record
print ("Loans dataset has {} data points with {} variables each.".format(*loans.shape))
print('lending club data 2007 to 2011 {}: {}'.format(loans.shape, ', '.join(loans.columns)))
loans.head(n=4)









    



Loans dataset has 42542 data points with 151 variables each.
lending club data 2007 to 2011 (42542, 151): id, member_id, loan_amnt, funded_amnt, funded_amnt_inv, term, int_rate, installment, grade, sub_grade, emp_title, emp_length, home_ownership, annual_inc, verification_status, issue_d, loan_status, pymnt_plan, url, desc, purpose, title, zip_code, addr_state, dti, delinq_2yrs, earliest_cr_line, fico_range_low, fico_range_high, inq_last_6mths, mths_since_last_delinq, mths_since_last_record, open_acc, pub_rec, revol_bal, revol_util, total_acc, initial_list_status, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_prncp, total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_d, last_pymnt_amnt, next_pymnt_d, last_credit_pull_d, last_fico_range_high, last_fico_range_low, collections_12_mths_ex_med, mths_since_last_major_derog, policy_code, application_type, annual_inc_joint, dti_joint, verification_status_joint, acc_now_delinq, tot_coll_amt, tot_cur_bal, open_acc_6m, open_act_il, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, total_rev_hi_lim, inq_fi, total_cu_tl, inq_last_12m, acc_open_past_24mths, avg_cur_bal, bc_open_to_buy, bc_util, chargeoff_within_12_mths, delinq_amnt, mo_sin_old_il_acct, mo_sin_old_rev_tl_op, mo_sin_rcnt_rev_tl_op, mo_sin_rcnt_tl, mort_acc, mths_since_recent_bc, mths_since_recent_bc_dlq, mths_since_recent_inq, mths_since_recent_revol_delinq, num_accts_ever_120_pd, num_actv_bc_tl, num_actv_rev_tl, num_bc_sats, num_bc_tl, num_il_tl, num_op_rev_tl, num_rev_accts, num_rev_tl_bal_gt_0, num_sats, num_tl_120dpd_2m, num_tl_30dpd, num_tl_90g_dpd_24m, num_tl_op_past_12m, pct_tl_nvr_dlq, percent_bc_gt_75, pub_rec_bankruptcies, tax_liens, tot_hi_cred_lim, total_bal_ex_mort, total_bc_limit, total_il_high_credit_limit, revol_bal_joint, sec_app_fico_range_low, sec_app_fico_range_high, sec_app_earliest_cr_line, sec_app_inq_last_6mths, sec_app_mort_acc, sec_app_open_acc, sec_app_revol_util, sec_app_open_act_il, sec_app_num_rev_accts, sec_app_chargeoff_within_12_mths, sec_app_collections_12_mths_ex_med, sec_app_mths_since_last_major_derog, hardship_flag, hardship_type, hardship_reason, hardship_status, deferral_term, hardship_amount, hardship_start_date, hardship_end_date, payment_plan_start_date, hardship_length, hardship_dpd, hardship_loan_status, orig_projected_additional_accrued_interest, hardship_payoff_balance_amount, hardship_last_payment_amount, disbursement_method, debt_settlement_flag, debt_settlement_flag_date, settlement_status, settlement_date, settlement_amount, settlement_percentage, settlement_term






    Out[1]:







  
    
      
      id
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      term
      int_rate
      installment
      grade
      sub_grade
      ...
      hardship_payoff_balance_amount
      hardship_last_payment_amount
      disbursement_method
      debt_settlement_flag
      debt_settlement_flag_date
      settlement_status
      settlement_date
      settlement_amount
      settlement_percentage
      settlement_term
    
  
  
    
      0
      1077501
      NaN
      5000.0
      5000.0
      4975.0
      36 months
      10.65%
      162.87
      B
      B2
      ...
      NaN
      NaN
      Cash
      N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1077430
      NaN
      2500.0
      2500.0
      2500.0
      60 months
      15.27%
      59.83
      C
      C4
      ...
      NaN
      NaN
      Cash
      N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1077175
      NaN
      2400.0
      2400.0
      2400.0
      36 months
      15.96%
      84.33
      C
      C5
      ...
      NaN
      NaN
      Cash
      N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1076863
      NaN
      10000.0
      10000.0
      10000.0
      36 months
      13.49%
      339.31
      C
      C1
      ...
      NaN
      NaN
      Cash
      N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

4 rows × 151 columns



In [2]:

    
loans.describe()









    Out[2]:







  
    
      
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      installment
      annual_inc
      dti
      delinq_2yrs
      fico_range_low
      fico_range_high
      ...
      payment_plan_start_date
      hardship_length
      hardship_dpd
      hardship_loan_status
      orig_projected_additional_accrued_interest
      hardship_payoff_balance_amount
      hardship_last_payment_amount
      settlement_amount
      settlement_percentage
      settlement_term
    
  
  
    
      count
      0.0
      42535.000000
      42535.000000
      42535.000000
      42535.000000
      4.253100e+04
      42535.000000
      42506.000000
      42535.000000
      42535.000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      156.000000
      156.000000
      156.000000
    
    
      mean
      NaN
      11089.722581
      10821.585753
      10139.830603
      322.623063
      6.913656e+04
      13.373043
      0.152449
      713.052545
      717.052545
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4259.603077
      49.935385
      0.942308
    
    
      std
      NaN
      7410.938391
      7146.914675
      7131.686447
      208.927216
      6.409635e+04
      6.726315
      0.512406
      36.188439
      36.188439
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      3147.115163
      15.707054
      3.445877
    
    
      min
      NaN
      500.000000
      500.000000
      0.000000
      15.670000
      1.896000e+03
      0.000000
      0.000000
      610.000000
      614.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      193.290000
      10.690000
      0.000000
    
    
      25%
      NaN
      5200.000000
      5000.000000
      4950.000000
      165.520000
      4.000000e+04
      8.200000
      0.000000
      685.000000
      689.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1800.000000
      40.000000
      0.000000
    
    
      50%
      NaN
      9700.000000
      9600.000000
      8500.000000
      277.690000
      5.900000e+04
      13.470000
      0.000000
      710.000000
      714.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      3481.850000
      49.980000
      0.000000
    
    
      75%
      NaN
      15000.000000
      15000.000000
      14000.000000
      428.180000
      8.250000e+04
      18.680000
      0.000000
      740.000000
      744.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      5701.100000
      60.700000
      0.000000
    
    
      max
      NaN
      35000.000000
      35000.000000
      35000.000000
      1305.190000
      6.000000e+06
      29.990000
      13.000000
      825.000000
      829.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      14798.200000
      92.740000
      24.000000
    
  

8 rows × 120 columns

Data dictionary

Definition list

id: A unique LC assigned ID for the loan listing.
loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan am"ount, then it will be reflected in this value.
funded_amnt: The total amount committed to that loan at that point in time.
funded_amnt_inv: The total amount committed by investors for that loan at that point in time.
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate: Interest Rate on the loan
installment: The monthly payment owed by the borrower if the loan originates.
grade: LC assigned loan grade
sub_grade: LC assigned loan subgrade
emp_title: The job title supplied by the Borrower when applying for the loan.
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER.
annual_inc: The self-reported annual income provided by the borrower during registration.
verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified.
issue_d: The month which the loan was funded.
loan_status: Current status of the loan.
pymnt_plan: Indicates if a payment plan has been put in place for the loan.
url: URL for the LC page with listing data.
desc: Loan description provided by the borrower.
purpose: A category provided by the borrower for the loan request.
title: The loan title provided by the borrower.
zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state: The state provided by the borrower in the loan application.
dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.
earliest_cr_line: The month the borrower's earliest reported credit line was opened.
fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.
fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.
inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries).
mths_since_last_delinq: The number of months since the borrower's last delinquency.
mths_since_last_record: The number of months since the last public record.
open_acc: The number of open credit lines in the borrower's credit file.
pub_rec: Number of derogatory public records.
revol_bal: Total credit revolving balance.
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc: The total number of credit lines currently in the borrower's credit file.
initial_list_status: The initial listing status of the loan. Possible values are – W, F.
out_prncp: Remaining outstanding principal for total amount funded.
out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors.
total_pymnt: Payments received to date for total amount funded.
total_pymnt_inv: Payments received to date for portion of total amount funded by investors.
total_rec_prncp: Principal received to date.
total_rec_int: Interest received to date.
total_rec_late_fee: Late fees received to date.
recoveries: post charge off gross recovery.
collection_recovery_fee: post charge off collection fee.
last_pymnt_d: Last month payment was received.
last_pymnt_amnt: Last total payment amount received.
next_pymnt_d: Next scheduled payment date.
last_credit_pull_d: The most recent month LC pulled credit for this loan.
last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.
collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections.
policy_code: publicly available policy_code=1, new products not publicly available policy_code=2.
application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers.
acc_now_delinq: The number of accounts on which the borrower is now delinquent.
chargeoff_within_12_mths: Number of charge-offs within 12 months.
delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.
pub_rec_bankruptcies: Number of public record bankruptcies.
tax_liens: Number of tax liens.
hardship_flag: Flags whether or not the borrower is on a hardship plan.
disbursement_method: The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY.
debt_settlement_flag: Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.
debt_settlement_flag_date: The most recent date that the Debt_Settlement_Flag has been set.
settlement_status: The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT.
settlement_date: The date that the borrower agrees to the settlement plan.
settlement_amount: The loan amount that the borrower has agreed to settle for.
settlement_percentage: The settlement amount as a percentage of the payoff balance amount on the loan.
settlement_term: The number of months that the borrower will be on the settlement plan.

Identify NaNs

Lets analyze the Nans values to first focus on the data that has decent population to analyze the data. As we see NaN values, lets see how many NA values are present.



In [3]:

    
# Success
print ("Before removing NA")
print "Dataset has {} data points with {} variables each.".format(*loans.shape)
print(loans.isnull().sum())









    



Before removing NA
Dataset has 42542 data points with 151 variables each.
id                                                4
member_id                                     42542
loan_amnt                                         7
funded_amnt                                       7
funded_amnt_inv                                   7
term                                              7
int_rate                                          7
installment                                       7
grade                                             7
sub_grade                                         7
emp_title                                      2631
emp_length                                        7
home_ownership                                    7
annual_inc                                       11
verification_status                               7
issue_d                                           7
loan_status                                       7
pymnt_plan                                        7
url                                               7
desc                                          13299
purpose                                           7
title                                            19
zip_code                                          7
addr_state                                        7
dti                                               7
delinq_2yrs                                      36
earliest_cr_line                                 36
fico_range_low                                    7
fico_range_high                                   7
inq_last_6mths                                   36
                                              ...  
sec_app_open_acc                              42542
sec_app_revol_util                            42542
sec_app_open_act_il                           42542
sec_app_num_rev_accts                         42542
sec_app_chargeoff_within_12_mths              42542
sec_app_collections_12_mths_ex_med            42542
sec_app_mths_since_last_major_derog           42542
hardship_flag                                     7
hardship_type                                 42542
hardship_reason                               42542
hardship_status                               42542
deferral_term                                 42542
hardship_amount                               42542
hardship_start_date                           42542
hardship_end_date                             42542
payment_plan_start_date                       42542
hardship_length                               42542
hardship_dpd                                  42542
hardship_loan_status                          42542
orig_projected_additional_accrued_interest    42542
hardship_payoff_balance_amount                42542
hardship_last_payment_amount                  42542
disbursement_method                               7
debt_settlement_flag                              7
debt_settlement_flag_date                     42386
settlement_status                             42386
settlement_date                               42386
settlement_amount                             42386
settlement_percentage                         42386
settlement_term                               42386
Length: 151, dtype: int64

As we see many features having all values as NaN lets drop those columns which doesnt even have 60% dataset populated.



In [4]:

    
# drop a column if all values are NaN
loans = loans.dropna(thresh=0.4*len(loans),  axis = 1)
print "Loans dataset has {} data points with {} variables each.".format(*loans.shape)
#loans.info()









    



Loans dataset has 42542 data points with 60 variables each.

Now we can analyse all the data in 60 columns which have at least 40% records or more of data populated for each columns.

Assumption

For investors, only information about loan applicant is available, following 41 features, that is assumed to be available for investors.

Available features of loan applicant

Identifiers, not needed as features

id: A unique LC assigned ID for the loan listing.
url: URL for the LC page with listing data.

Other features to analyze

loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan am"ount, then it will be reflected in this value.
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate: Interest Rate on the loan.
grade: LC assigned loan grade.
sub_grade: LC assigned loan subgrade.
emp_title: The job title supplied by the Borrower when applying for the loan.
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER.
annual_inc: The self-reported annual income provided by the borrower during registration.
verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified.
loan_status: Current status of the loan.
desc: Loan description provided by the borrower.
purpose: A category provided by the borrower for the loan request.
title: The loan title provided by the borrower.
zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state: The state provided by the borrower in the loan application.
dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.
earliest_cr_line: The month the borrower's earliest reported credit line was opened.
fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.
fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.
inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries).
mths_since_last_delinq: The number of months since the borrower's last delinquency.
mths_since_last_record: The number of months since the last public record.
open_acc: The number of open credit lines in the borrower's credit file.
pub_rec: Number of derogatory public records.
revol_bal: Total credit revolving balance.
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc: The total number of credit lines currently in the borrower's credit file.
initial_list_status: The initial listing status of the loan. Possible values are – W, F.
out_prncp: Remaining outstanding principal for total amount funded.
out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors.
last_credit_pull_d: The most recent month LC pulled credit for this loan.
last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.
collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections.
policy_code: publicly available policy_code=1, new products not publicly available policy_code=2.
application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers.
acc_now_delinq: The number of accounts on which the borrower is now delinquent.
delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.
pub_rec_bankruptcies: Number of public record bankruptcies.
tax_liens: Number of tax liens.
hardship_flag: Flags whether or not the borrower is on a hardship plan.



In [5]:

    
identifiers = ['id', 'url']
investor_features = loans[['loan_amnt','term','int_rate','grade','sub_grade','emp_title','emp_length',\
               'home_ownership','annual_inc','verification_status','loan_status',\
               'desc','purpose','title','zip_code','addr_state','dti','delinq_2yrs',\
               'earliest_cr_line','fico_range_low','fico_range_high','inq_last_6mths',\
               'open_acc','pub_rec','revol_bal','revol_util','total_acc',\
               'initial_list_status','out_prncp','out_prncp_inv','last_credit_pull_d','last_fico_range_high',\
               'last_fico_range_low','collections_12_mths_ex_med','policy_code','application_type','acc_now_delinq',\
               'delinq_amnt','pub_rec_bankruptcies','tax_liens','hardship_flag']]
#loans.info()
investor_features.shape









    Out[5]:





(42542, 41)

Found interesting tool pandas_profiling, will use that to profile and learn more about data.



In [6]:

    
import pandas as pd
import pandas_profiling
import numpy as np

pandas_profiling.ProfileReport(investor_features)









    Out[6]:









    
        Overview
    
    
    
        Dataset info
        
            
            
                Number of variables
                41 
            
            
                Number of observations
                42542 
            
            
                Total Missing (%)
                0.0% 
            
            
                Total size in memory
                13.3 MiB 
            
            
                Average record size in memory
                328.0 B 
            
            
        
    
    
        Variables types
        
            
            
                Numeric
                16 
            
            
                Categorical
                20 
            
            
                Boolean
                4 
            
            
                Date
                0 
            
            
                Text (Unique)
                0 
            
            
                Rejected
                1 
            
            
                Unsupported
                0 
            
            
        
    
    
        
        Warnings
        acc_now_delinq is highly skewed (γ1 = 103.07)  Skewed
acc_now_delinq has 36 / 100.0% missing values Missing
addr_state has 7 / 100.0% missing values Missing
addr_state has a high cardinality: 51 distinct values  Warning
annual_inc is highly skewed (γ1 = 29.035)  Skewed
annual_inc has 11 / 100.0% missing values Missing
application_type has 7 / 100.0% missing values Missing
collections_12_mths_ex_med has 152 / 100.0% missing values Missing
delinq_2yrs has 36 / 100.0% missing values Missing
delinq_amnt is highly skewed (γ1 = 206.16)  Skewed
delinq_amnt has 36 / 100.0% missing values Missing
desc has 13299 / 100.0% missing values Missing
desc has a high cardinality: 28965 distinct values  Warning
dti has 7 / 100.0% missing values Missing
earliest_cr_line has 36 / 100.0% missing values Missing
earliest_cr_line has a high cardinality: 531 distinct values  Warning
emp_length has 7 / 100.0% missing values Missing
emp_title has 2631 / 100.0% missing values Missing
emp_title has a high cardinality: 30660 distinct values  Warning
fico_range_high is highly correlated with fico_range_low (ρ = 1) Rejected
fico_range_low has 7 / 100.0% missing values Missing
grade has 7 / 100.0% missing values Missing
hardship_flag has 7 / 100.0% missing values Missing
home_ownership has 7 / 100.0% missing values Missing
initial_list_status has 7 / 100.0% missing values Missing
inq_last_6mths has 36 / 100.0% missing values Missing
int_rate has 7 / 100.0% missing values Missing
int_rate has a high cardinality: 395 distinct values  Warning
last_credit_pull_d has 11 / 100.0% missing values Missing
last_credit_pull_d has a high cardinality: 128 distinct values  Warning
last_fico_range_high has 7 / 100.0% missing values Missing
last_fico_range_low has 7 / 100.0% missing values Missing
loan_amnt has 7 / 100.0% missing values Missing
loan_status has 7 / 100.0% missing values Missing
open_acc has 36 / 100.0% missing values Missing
out_prncp has 7 / 100.0% missing values Missing
out_prncp_inv has 7 / 100.0% missing values Missing
policy_code has 7 / 100.0% missing values Missing
pub_rec has 36 / 100.0% missing values Missing
pub_rec_bankruptcies has 1372 / 100.0% missing values Missing
purpose has 7 / 100.0% missing values Missing
revol_bal has 7 / 100.0% missing values Missing
revol_util has 97 / 100.0% missing values Missing
revol_util has a high cardinality: 1120 distinct values  Warning
sub_grade has 7 / 100.0% missing values Missing
tax_liens is highly skewed (γ1 = 205.99)  Skewed
tax_liens has 112 / 100.0% missing values Missing
term has 7 / 100.0% missing values Missing
title has 19 / 100.0% missing values Missing
title has a high cardinality: 21258 distinct values  Warning
total_acc has 36 / 100.0% missing values Missing
verification_status has 7 / 100.0% missing values Missing
zip_code has 7 / 100.0% missing values Missing
zip_code has a high cardinality: 838 distinct values  Warning
Dataset has 6 duplicate rows Warning
 
    

    
        Variables
    
    
    
        acc_now_delinq

            Numeric
        
    

    
        
            
                
                    Distinct count
                    3
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    9.4104e-05
                
                
                    Minimum
                    0
                
                
                    Maximum
                    1
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        0
                    
                    
                        Maximum
                        1
                    
                    
                        Range
                        1
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.0097004
                    
                    
                        Coef of variation
                        103.08
                    
                    
                        Kurtosis
                        10623
                    
                    
                        Mean
                        9.4104e-05
                    
                    
                        MAD
                        0.00018819
                    
                    
                        Skewness
                        103.07
                    
                    
                        Sum
                        4
                    
                    
                        Variance
                        9.4098e-05
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42502
        0.0%
        
             
        

        1.0
        4
        0.0%
        
             
        

        (Missing)
        36
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42502
        0.0%
        
             
        

        1.0
        4
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42502
        0.0%
        
             
        

        1.0
        4
        0.0%
        
             
        


        
    


    
        addr_state

            Categorical
        
    

    
        
            Distinct count
            51
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    CA
    
        
             
        
        7429
    

    NY
    
        
             
        
        4065
    

    FL
    
        
             
        
        3104
    

    Other values (47)
    
        
            27937
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        CA
        7429
        0.0%
        
             
        

        NY
        4065
        0.0%
        
             
        

        FL
        3104
        0.0%
        
             
        

        TX
        2915
        0.0%
        
             
        

        NJ
        1988
        0.0%
        
             
        

        IL
        1672
        0.0%
        
             
        

        PA
        1651
        0.0%
        
             
        

        GA
        1503
        0.0%
        
             
        

        VA
        1487
        0.0%
        
             
        

        MA
        1438
        0.0%
        
             
        

        Other values (40)
        15283
        0.0%
        
             
        




    
        annual_inc

            Numeric
        
    

    
        
            
                
                    Distinct count
                    5598
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    11
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    69137
                
                
                    Minimum
                    1896
                
                
                    Maximum
                    6000000
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        1896
                    
                    
                        5-th percentile
                        24000
                    
                    
                        Q1
                        40000
                    
                    
                        Median
                        59000
                    
                    
                        Q3
                        82500
                    
                    
                        95-th percentile
                        144000
                    
                    
                        Maximum
                        6000000
                    
                    
                        Range
                        5998100
                    
                    
                        Interquartile range
                        42500
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        64096
                    
                    
                        Coef of variation
                        0.9271
                    
                    
                        Kurtosis
                        2117.3
                    
                    
                        Mean
                        69137
                    
                    
                        MAD
                        30995
                    
                    
                        Skewness
                        29.035
                    
                    
                        Sum
                        2940400000
                    
                    
                        Variance
                        4108300000
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        60000.0
        1591
        0.0%
        
             
        

        50000.0
        1119
        0.0%
        
             
        

        40000.0
        935
        0.0%
        
             
        

        45000.0
        898
        0.0%
        
             
        

        30000.0
        884
        0.0%
        
             
        

        75000.0
        865
        0.0%
        
             
        

        65000.0
        840
        0.0%
        
             
        

        70000.0
        790
        0.0%
        
             
        

        48000.0
        766
        0.0%
        
             
        

        80000.0
        718
        0.0%
        
             
        

        Other values (5587)
        33125
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1896.0
        1
        0.0%
        
             
        

        2000.0
        1
        0.0%
        
             
        

        3300.0
        1
        0.0%
        
             
        

        3500.0
        1
        0.0%
        
             
        

        3600.0
        1
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1782000.0
        1
        0.0%
        
             
        

        1900000.0
        1
        0.0%
        
             
        

        2039784.0
        1
        0.0%
        
             
        

        3900000.0
        1
        0.0%
        
             
        

        6000000.0
        1
        0.0%
        
             
        


        
    


    
        application_type

            Categorical
        
    

    
        
            Distinct count
            2
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    Individual
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Individual
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        collections_12_mths_ex_med

            Boolean
        
    

    
        
            
                
                    Distinct count
                    2
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    152
                
            
        
        
            
                
                    Mean
                    0
                
            
        
    


    
        
    0.0
    
        
            42390
        
        
    

    (Missing)
    
        
             
        
        152
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42390
        0.0%
        
             
        

        (Missing)
        152
        0.0%
        
             
        




    
        delinq_2yrs

            Numeric
        
    

    
        
            
                
                    Distinct count
                    13
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.15245
                
                
                    Minimum
                    0
                
                
                    Maximum
                    13
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        1
                    
                    
                        Maximum
                        13
                    
                    
                        Range
                        13
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.51241
                    
                    
                        Coef of variation
                        3.3612
                    
                    
                        Kurtosis
                        51.073
                    
                    
                        Mean
                        0.15245
                    
                    
                        MAD
                        0.27093
                    
                    
                        Skewness
                        5.4334
                    
                    
                        Sum
                        6480
                    
                    
                        Variance
                        0.26256
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        37771
        0.0%
        
             
        

        1.0
        3595
        0.0%
        
             
        

        2.0
        771
        0.0%
        
             
        

        3.0
        244
        0.0%
        
             
        

        4.0
        72
        0.0%
        
             
        

        5.0
        27
        0.0%
        
             
        

        6.0
        13
        0.0%
        
             
        

        7.0
        6
        0.0%
        
             
        

        8.0
        3
        0.0%
        
             
        

        11.0
        2
        0.0%
        
             
        

        Other values (2)
        2
        0.0%
        
             
        

        (Missing)
        36
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        37771
        0.0%
        
             
        

        1.0
        3595
        0.0%
        
             
        

        2.0
        771
        0.0%
        
             
        

        3.0
        244
        0.0%
        
             
        

        4.0
        72
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        7.0
        6
        0.0%
        
             
        

        8.0
        3
        0.0%
        
             
        

        9.0
        1
        0.0%
        
             
        

        11.0
        2
        0.0%
        
             
        

        13.0
        1
        0.0%
        
             
        


        
    


    
        delinq_amnt

            Numeric
        
    

    
        
            
                
                    Distinct count
                    4
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.14304
                
                
                    Minimum
                    0
                
                
                    Maximum
                    6053
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        0
                    
                    
                        Maximum
                        6053
                    
                    
                        Range
                        6053
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        29.36
                    
                    
                        Coef of variation
                        205.26
                    
                    
                        Kurtosis
                        42504
                    
                    
                        Mean
                        0.14304
                    
                    
                        MAD
                        0.28606
                    
                    
                        Skewness
                        206.16
                    
                    
                        Sum
                        6080
                    
                    
                        Variance
                        861.98
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42504
        0.0%
        
             
        

        6053.0
        1
        0.0%
        
             
        

        27.0
        1
        0.0%
        
             
        

        (Missing)
        36
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42504
        0.0%
        
             
        

        27.0
        1
        0.0%
        
             
        

        6053.0
        1
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42504
        0.0%
        
             
        

        27.0
        1
        0.0%
        
             
        

        6053.0
        1
        0.0%
        
             
        


        
    


    
        desc

            Categorical
        
    

    
        
            Distinct count
            28965
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            13299
        
    


    
        
     
    
        
             
        
        225
    

    Debt Consolidation
    
        
             
        
        11
    

    Camping Membership
    
        
             
        
        8
    

    Other values (28961)
    
        
            28999
        
        
    

    (Missing)
    
        
             
        
        13299
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
         
        225
        0.0%
        
             
        

        Debt Consolidation
        11
        0.0%
        
             
        

        Camping Membership
        8
        0.0%
        
             
        

        refinancing
        5
        0.0%
        
             
        

        Personal Loan
        3
        0.0%
        
             
        

        personal loan
        3
        0.0%
        
             
        

        credit card consolidation
        3
        0.0%
        
             
        

        credit card debt consolidation
        3
        0.0%
        
             
        

        consolidate debt
        3
        0.0%
        
             
        

        debt consolidation
        3
        0.0%
        
             
        

        Other values (28954)
        28976
        0.0%
        
             
        

        (Missing)
        13299
        0.0%
        
             
        




    
        dti

            Numeric
        
    

    
        
            
                
                    Distinct count
                    2895
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    13.373
                
                
                    Minimum
                    0
                
                
                    Maximum
                    29.99
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        2.1
                    
                    
                        Q1
                        8.2
                    
                    
                        Median
                        13.47
                    
                    
                        Q3
                        18.68
                    
                    
                        95-th percentile
                        23.92
                    
                    
                        Maximum
                        29.99
                    
                    
                        Range
                        29.99
                    
                    
                        Interquartile range
                        10.48
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        6.7263
                    
                    
                        Coef of variation
                        0.50298
                    
                    
                        Kurtosis
                        -0.85174
                    
                    
                        Mean
                        13.373
                    
                    
                        MAD
                        5.6401
                    
                    
                        Skewness
                        -0.029922
                    
                    
                        Sum
                        568820
                    
                    
                        Variance
                        45.243
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        206
        0.0%
        
             
        

        12.0
        54
        0.0%
        
             
        

        10.0
        46
        0.0%
        
             
        

        18.0
        46
        0.0%
        
             
        

        19.2
        45
        0.0%
        
             
        

        13.2
        43
        0.0%
        
             
        

        16.8
        41
        0.0%
        
             
        

        13.5
        41
        0.0%
        
             
        

        12.48
        40
        0.0%
        
             
        

        15.0
        38
        0.0%
        
             
        

        Other values (2884)
        41935
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        206
        0.0%
        
             
        

        0.01
        3
        0.0%
        
             
        

        0.02
        5
        0.0%
        
             
        

        0.03
        2
        0.0%
        
             
        

        0.04
        3
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        29.92
        2
        0.0%
        
             
        

        29.93
        3
        0.0%
        
             
        

        29.95
        2
        0.0%
        
             
        

        29.96
        1
        0.0%
        
             
        

        29.99
        1
        0.0%
        
             
        


        
    


    
        earliest_cr_line

            Categorical
        
    

    
        
            Distinct count
            531
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            36
        
    


    
        
    Oct-99
    
        
             
        
        393
    

    Nov-98
    
        
             
        
        390
    

    Oct-00
    
        
             
        
        370
    

    Other values (527)
    
        
            41353
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Oct-99
        393
        0.0%
        
             
        

        Nov-98
        390
        0.0%
        
             
        

        Oct-00
        370
        0.0%
        
             
        

        Dec-98
        366
        0.0%
        
             
        

        Dec-97
        348
        0.0%
        
             
        

        Nov-00
        340
        0.0%
        
             
        

        Nov-99
        337
        0.0%
        
             
        

        Oct-98
        334
        0.0%
        
             
        

        Sep-00
        325
        0.0%
        
             
        

        Nov-97
        319
        0.0%
        
             
        

        Other values (520)
        38984
        0.0%
        
             
        




    
        emp_length

            Categorical
        
    

    
        
            Distinct count
            13
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    10+ years
    
        
             
        
        9369
    

    < 1 year
    
        
             
        
        5062
    

    2 years
    
        
             
        
        4743
    

    Other values (9)
    
        
            23361
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        10+ years
        9369
        0.0%
        
             
        

        < 1 year
        5062
        0.0%
        
             
        

        2 years
        4743
        0.0%
        
             
        

        3 years
        4364
        0.0%
        
             
        

        4 years
        3649
        0.0%
        
             
        

        1 year
        3595
        0.0%
        
             
        

        5 years
        3458
        0.0%
        
             
        

        6 years
        2375
        0.0%
        
             
        

        7 years
        1875
        0.0%
        
             
        

        8 years
        1592
        0.0%
        
             
        

        Other values (2)
        2453
        0.0%
        
             
        




    
        emp_title

            Categorical
        
    

    
        
            Distinct count
            30660
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            2631
        
    


    
        
    US Army
    
        
             
        
        139
    

    Bank of America
    
        
             
        
        115
    

    IBM
    
        
             
        
        72
    

    Other values (30656)
    
        
            39585
        
        
    

    (Missing)
    
        
             
        
        2631
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        US Army
        139
        0.0%
        
             
        

        Bank of America
        115
        0.0%
        
             
        

        IBM
        72
        0.0%
        
             
        

        Kaiser Permanente
        61
        0.0%
        
             
        

        AT&T
        61
        0.0%
        
             
        

        UPS
        58
        0.0%
        
             
        

        Wells Fargo
        57
        0.0%
        
             
        

        USAF
        56
        0.0%
        
             
        

        US Air Force
        55
        0.0%
        
             
        

        Self Employed
        49
        0.0%
        
             
        

        Other values (30649)
        39188
        0.0%
        
             
        

        (Missing)
        2631
        0.0%
        
             
        




    
        fico_range_high

            Highly correlated
        
    

    This variable is highly correlated with fico_range_low and should be ignored for analysis


    
        
            Correlation
            1
        
    


    
        fico_range_low

            Numeric
        
    

    
        
            
                
                    Distinct count
                    45
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    713.05
                
                
                    Minimum
                    610
                
                
                    Maximum
                    825
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        610
                    
                    
                        5-th percentile
                        665
                    
                    
                        Q1
                        685
                    
                    
                        Median
                        710
                    
                    
                        Q3
                        740
                    
                    
                        95-th percentile
                        780
                    
                    
                        Maximum
                        825
                    
                    
                        Range
                        215
                    
                    
                        Interquartile range
                        55
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        36.188
                    
                    
                        Coef of variation
                        0.050751
                    
                    
                        Kurtosis
                        -0.49633
                    
                    
                        Mean
                        713.05
                    
                    
                        MAD
                        30.05
                    
                    
                        Skewness
                        0.46487
                    
                    
                        Sum
                        30330000
                    
                    
                        Variance
                        1309.6
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        685.0
        2310
        0.0%
        
             
        

        700.0
        2267
        0.0%
        
             
        

        680.0
        2228
        0.0%
        
             
        

        695.0
        2202
        0.0%
        
             
        

        690.0
        2196
        0.0%
        
             
        

        675.0
        1994
        0.0%
        
             
        

        705.0
        1970
        0.0%
        
             
        

        720.0
        1949
        0.0%
        
             
        

        715.0
        1891
        0.0%
        
             
        

        725.0
        1891
        0.0%
        
             
        

        Other values (34)
        21637
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        610.0
        2
        0.0%
        
             
        

        615.0
        1
        0.0%
        
             
        

        620.0
        1
        0.0%
        
             
        

        625.0
        2
        0.0%
        
             
        

        630.0
        6
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        805.0
        193
        0.0%
        
             
        

        810.0
        125
        0.0%
        
             
        

        815.0
        28
        0.0%
        
             
        

        820.0
        19
        0.0%
        
             
        

        825.0
        3
        0.0%
        
             
        


        
    


    
        grade

            Categorical
        
    

    
        
            Distinct count
            8
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    B
    
        
            12389
        
        
    

    A
    
        
             
        
        10183
    

    C
    
        
             
        
        8740
    

    Other values (4)
    
        
             
        
        11223
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        B
        12389
        0.0%
        
             
        

        A
        10183
        0.0%
        
             
        

        C
        8740
        0.0%
        
             
        

        D
        6016
        0.0%
        
             
        

        E
        3394
        0.0%
        
             
        

        F
        1301
        0.0%
        
             
        

        G
        512
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        hardship_flag

            Categorical
        
    

    
        
            Distinct count
            2
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    N
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        N
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        home_ownership

            Categorical
        
    

    
        
            Distinct count
            6
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    RENT
    
        
            20181
        
        
    

    MORTGAGE
    
        
             
        
        18959
    

    OWN
    
        
             
        
        3251
    

    Other values (2)
    
        
             
        
        144
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        RENT
        20181
        0.0%
        
             
        

        MORTGAGE
        18959
        0.0%
        
             
        

        OWN
        3251
        0.0%
        
             
        

        OTHER
        136
        0.0%
        
             
        

        NONE
        8
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        initial_list_status

            Categorical
        
    

    
        
            Distinct count
            2
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    f
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        f
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        inq_last_6mths

            Numeric
        
    

    
        
            
                
                    Distinct count
                    29
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    1.0814
                
                
                    Minimum
                    0
                
                
                    Maximum
                    33
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        1
                    
                    
                        Q3
                        2
                    
                    
                        95-th percentile
                        4
                    
                    
                        Maximum
                        33
                    
                    
                        Range
                        33
                    
                    
                        Interquartile range
                        2
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        1.5275
                    
                    
                        Coef of variation
                        1.4124
                    
                    
                        Kurtosis
                        30.962
                    
                    
                        Mean
                        1.0814
                    
                    
                        MAD
                        1.0433
                    
                    
                        Skewness
                        3.4535
                    
                    
                        Sum
                        45967
                    
                    
                        Variance
                        2.3331
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        19657
        0.0%
        
             
        

        1.0
        11247
        0.0%
        
             
        

        2.0
        5987
        0.0%
        
             
        

        3.0
        3182
        0.0%
        
             
        

        4.0
        1056
        0.0%
        
             
        

        5.0
        596
        0.0%
        
             
        

        6.0
        339
        0.0%
        
             
        

        7.0
        182
        0.0%
        
             
        

        8.0
        115
        0.0%
        
             
        

        9.0
        50
        0.0%
        
             
        

        Other values (18)
        95
        0.0%
        
             
        

        (Missing)
        36
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        19657
        0.0%
        
             
        

        1.0
        11247
        0.0%
        
             
        

        2.0
        5987
        0.0%
        
             
        

        3.0
        3182
        0.0%
        
             
        

        4.0
        1056
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        27.0
        1
        0.0%
        
             
        

        28.0
        1
        0.0%
        
             
        

        31.0
        1
        0.0%
        
             
        

        32.0
        1
        0.0%
        
             
        

        33.0
        1
        0.0%
        
             
        


        
    


    
        int_rate

            Categorical
        
    

    
        
            Distinct count
            395
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    10.99%
    
        
             
        
        970
    

    11.49%
    
        
             
        
        837
    

    13.49%
    
        
             
        
        832
    

    Other values (391)
    
        
            39896
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        10.99%
        970
        0.0%
        
             
        

        11.49%
        837
        0.0%
        
             
        

        13.49%
        832
        0.0%
        
             
        

        7.51%
        787
        0.0%
        
             
        

        7.88%
        742
        0.0%
        
             
        

        7.49%
        656
        0.0%
        
             
        

        11.71%
        609
        0.0%
        
             
        

        9.99%
        607
        0.0%
        
             
        

        7.90%
        582
        0.0%
        
             
        

        5.42%
        573
        0.0%
        
             
        

        Other values (384)
        35340
        0.0%
        
             
        




    
        last_credit_pull_d

            Categorical
        
    

    
        
            Distinct count
            128
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            11
        
    


    
        
    Jan-18
    
        
             
        
        9348
    

    Oct-16
    
        
             
        
        4312
    

    Dec-17
    
        
             
        
        905
    

    Other values (124)
    
        
            27966
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Jan-18
        9348
        0.0%
        
             
        

        Oct-16
        4312
        0.0%
        
             
        

        Dec-17
        905
        0.0%
        
             
        

        Nov-17
        868
        0.0%
        
             
        

        Oct-17
        822
        0.0%
        
             
        

        Feb-17
        816
        0.0%
        
             
        

        Sep-17
        683
        0.0%
        
             
        

        Aug-17
        665
        0.0%
        
             
        

        Feb-13
        608
        0.0%
        
             
        

        Mar-16
        520
        0.0%
        
             
        

        Other values (117)
        22984
        0.0%
        
             
        




    
        last_fico_range_high

            Numeric
        
    

    
        
            
                
                    Distinct count
                    73
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    688.87
                
                
                    Minimum
                    0
                
                
                    Maximum
                    850
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        534
                    
                    
                        Q1
                        644
                    
                    
                        Median
                        699
                    
                    
                        Q3
                        744
                    
                    
                        95-th percentile
                        804
                    
                    
                        Maximum
                        850
                    
                    
                        Range
                        850
                    
                    
                        Interquartile range
                        100
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        80.587
                    
                    
                        Coef of variation
                        0.11698
                    
                    
                        Kurtosis
                        2.9593
                    
                    
                        Mean
                        688.87
                    
                    
                        MAD
                        63.353
                    
                    
                        Skewness
                        -0.87048
                    
                    
                        Sum
                        29301000
                    
                    
                        Variance
                        6494.2
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        709.0
        1305
        0.0%
        
             
        

        694.0
        1256
        0.0%
        
             
        

        699.0
        1227
        0.0%
        
             
        

        724.0
        1208
        0.0%
        
             
        

        719.0
        1181
        0.0%
        
             
        

        714.0
        1176
        0.0%
        
             
        

        704.0
        1172
        0.0%
        
             
        

        684.0
        1147
        0.0%
        
             
        

        689.0
        1084
        0.0%
        
             
        

        734.0
        1057
        0.0%
        
             
        

        Other values (62)
        30722
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        28
        0.0%
        
             
        

        499.0
        768
        0.0%
        
             
        

        504.0
        223
        0.0%
        
             
        

        509.0
        177
        0.0%
        
             
        

        514.0
        206
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        829.0
        177
        0.0%
        
             
        

        834.0
        129
        0.0%
        
             
        

        839.0
        40
        0.0%
        
             
        

        844.0
        34
        0.0%
        
             
        

        850.0
        11
        0.0%
        
             
        


        
    


    
        last_fico_range_low

            Numeric
        
    

    
        
            
                
                    Distinct count
                    72
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    675.94
                
                
                    Minimum
                    0
                
                
                    Maximum
                    845
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        530
                    
                    
                        Q1
                        640
                    
                    
                        Median
                        695
                    
                    
                        Q3
                        740
                    
                    
                        95-th percentile
                        800
                    
                    
                        Maximum
                        845
                    
                    
                        Range
                        845
                    
                    
                        Interquartile range
                        100
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        119.29
                    
                    
                        Coef of variation
                        0.17647
                    
                    
                        Kurtosis
                        16.684
                    
                    
                        Mean
                        675.94
                    
                    
                        MAD
                        73.728
                    
                    
                        Skewness
                        -3.3858
                    
                    
                        Sum
                        28751000
                    
                    
                        Variance
                        14229
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        705.0
        1305
        0.0%
        
             
        

        690.0
        1256
        0.0%
        
             
        

        695.0
        1227
        0.0%
        
             
        

        720.0
        1208
        0.0%
        
             
        

        715.0
        1181
        0.0%
        
             
        

        710.0
        1176
        0.0%
        
             
        

        700.0
        1172
        0.0%
        
             
        

        680.0
        1147
        0.0%
        
             
        

        685.0
        1084
        0.0%
        
             
        

        730.0
        1057
        0.0%
        
             
        

        Other values (61)
        30722
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        796
        0.0%
        
             
        

        500.0
        223
        0.0%
        
             
        

        505.0
        177
        0.0%
        
             
        

        510.0
        206
        0.0%
        
             
        

        515.0
        206
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        825.0
        177
        0.0%
        
             
        

        830.0
        129
        0.0%
        
             
        

        835.0
        40
        0.0%
        
             
        

        840.0
        34
        0.0%
        
             
        

        845.0
        11
        0.0%
        
             
        


        
    


    
        loan_amnt

            Numeric
        
    

    
        
            
                
                    Distinct count
                    899
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    11090
                
                
                    Minimum
                    500
                
                
                    Maximum
                    35000
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        500
                    
                    
                        5-th percentile
                        2400
                    
                    
                        Q1
                        5200
                    
                    
                        Median
                        9700
                    
                    
                        Q3
                        15000
                    
                    
                        95-th percentile
                        25000
                    
                    
                        Maximum
                        35000
                    
                    
                        Range
                        34500
                    
                    
                        Interquartile range
                        9800
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        7410.9
                    
                    
                        Coef of variation
                        0.66827
                    
                    
                        Kurtosis
                        0.78587
                    
                    
                        Mean
                        11090
                    
                    
                        MAD
                        5876.1
                    
                    
                        Skewness
                        1.065
                    
                    
                        Sum
                        471700000
                    
                    
                        Variance
                        54922000
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        10000.0
        3016
        0.0%
        
             
        

        12000.0
        2439
        0.0%
        
             
        

        5000.0
        2260
        0.0%
        
             
        

        6000.0
        2037
        0.0%
        
             
        

        15000.0
        2012
        0.0%
        
             
        

        20000.0
        1724
        0.0%
        
             
        

        8000.0
        1699
        0.0%
        
             
        

        25000.0
        1499
        0.0%
        
             
        

        4000.0
        1230
        0.0%
        
             
        

        3000.0
        1134
        0.0%
        
             
        

        Other values (888)
        23485
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        500.0
        11
        0.0%
        
             
        

        550.0
        1
        0.0%
        
             
        

        600.0
        6
        0.0%
        
             
        

        700.0
        3
        0.0%
        
             
        

        725.0
        1
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        34475.0
        5
        0.0%
        
             
        

        34525.0
        1
        0.0%
        
             
        

        34675.0
        1
        0.0%
        
             
        

        34800.0
        2
        0.0%
        
             
        

        35000.0
        685
        0.0%
        
             
        


        
    


    
        loan_status

            Categorical
        
    

    
        
            Distinct count
            5
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    Fully Paid
    
        
            34116
        
        
    

    Charged Off
    
        
             
        
        5670
    

    Does not meet the credit policy. Status:Fully Paid
    
        
             
        
        1988
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Fully Paid
        34116
        0.0%
        
             
        

        Charged Off
        5670
        0.0%
        
             
        

        Does not meet the credit policy. Status:Fully Paid
        1988
        0.0%
        
             
        

        Does not meet the credit policy. Status:Charged Off
        761
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        open_acc

            Numeric
        
    

    
        
            
                
                    Distinct count
                    45
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    9.344
                
                
                    Minimum
                    1
                
                
                    Maximum
                    47
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        1
                    
                    
                        5-th percentile
                        3
                    
                    
                        Q1
                        6
                    
                    
                        Median
                        9
                    
                    
                        Q3
                        12
                    
                    
                        95-th percentile
                        18
                    
                    
                        Maximum
                        47
                    
                    
                        Range
                        46
                    
                    
                        Interquartile range
                        6
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        4.4963
                    
                    
                        Coef of variation
                        0.4812
                    
                    
                        Kurtosis
                        1.935
                    
                    
                        Mean
                        9.344
                    
                    
                        MAD
                        3.5063
                    
                    
                        Skewness
                        1.042
                    
                    
                        Sum
                        397170
                    
                    
                        Variance
                        20.216
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        7.0
        4252
        0.0%
        
             
        

        8.0
        4176
        0.0%
        
             
        

        6.0
        4172
        0.0%
        
             
        

        9.0
        3922
        0.0%
        
             
        

        10.0
        3386
        0.0%
        
             
        

        5.0
        3368
        0.0%
        
             
        

        11.0
        2944
        0.0%
        
             
        

        4.0
        2508
        0.0%
        
             
        

        12.0
        2398
        0.0%
        
             
        

        13.0
        2060
        0.0%
        
             
        

        Other values (34)
        9320
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1.0
        39
        0.0%
        
             
        

        2.0
        692
        0.0%
        
             
        

        3.0
        1608
        0.0%
        
             
        

        4.0
        2508
        0.0%
        
             
        

        5.0
        3368
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        41.0
        1
        0.0%
        
             
        

        42.0
        1
        0.0%
        
             
        

        44.0
        1
        0.0%
        
             
        

        46.0
        1
        0.0%
        
             
        

        47.0
        1
        0.0%
        
             
        


        
    


    
        out_prncp

            Boolean
        
    

    
        
            
                
                    Distinct count
                    2
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
            
        
        
            
                
                    Mean
                    0
                
            
        
    


    
        
    0.0
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        out_prncp_inv

            Boolean
        
    

    
        
            
                
                    Distinct count
                    2
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
            
        
        
            
                
                    Mean
                    0
                
            
        
    


    
        
    0.0
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        policy_code

            Boolean
        
    

    
        
            
                
                    Distinct count
                    2
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
            
        
        
            
                
                    Mean
                    1
                
            
        
    


    
        
    1.0
    
        
            42535
        
        
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1.0
        42535
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        pub_rec

            Numeric
        
    

    
        
            
                
                    Distinct count
                    7
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.058156
                
                
                    Minimum
                    0
                
                
                    Maximum
                    5
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        1
                    
                    
                        Maximum
                        5
                    
                    
                        Range
                        5
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.24571
                    
                    
                        Coef of variation
                        4.225
                    
                    
                        Kurtosis
                        26.835
                    
                    
                        Mean
                        0.058156
                    
                    
                        MAD
                        0.10981
                    
                    
                        Skewness
                        4.6055
                    
                    
                        Sum
                        2472
                    
                    
                        Variance
                        0.060375
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        40130
        0.0%
        
             
        

        1.0
        2298
        0.0%
        
             
        

        2.0
        64
        0.0%
        
             
        

        3.0
        11
        0.0%
        
             
        

        4.0
        2
        0.0%
        
             
        

        5.0
        1
        0.0%
        
             
        

        (Missing)
        36
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        40130
        0.0%
        
             
        

        1.0
        2298
        0.0%
        
             
        

        2.0
        64
        0.0%
        
             
        

        3.0
        11
        0.0%
        
             
        

        4.0
        2
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1.0
        2298
        0.0%
        
             
        

        2.0
        64
        0.0%
        
             
        

        3.0
        11
        0.0%
        
             
        

        4.0
        2
        0.0%
        
             
        

        5.0
        1
        0.0%
        
             
        


        
    


    
        pub_rec_bankruptcies

            Numeric
        
    

    
        
            
                
                    Distinct count
                    4
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    1372
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.045227
                
                
                    Minimum
                    0
                
                
                    Maximum
                    2
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        0
                    
                    
                        Maximum
                        2
                    
                    
                        Range
                        2
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.20874
                    
                    
                        Coef of variation
                        4.6153
                    
                    
                        Kurtosis
                        18.127
                    
                    
                        Mean
                        0.045227
                    
                    
                        MAD
                        0.086381
                    
                    
                        Skewness
                        4.4411
                    
                    
                        Sum
                        1862
                    
                    
                        Variance
                        0.043571
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        39316
        0.0%
        
             
        

        1.0
        1846
        0.0%
        
             
        

        2.0
        8
        0.0%
        
             
        

        (Missing)
        1372
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        39316
        0.0%
        
             
        

        1.0
        1846
        0.0%
        
             
        

        2.0
        8
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        39316
        0.0%
        
             
        

        1.0
        1846
        0.0%
        
             
        

        2.0
        8
        0.0%
        
             
        


        
    


    
        purpose

            Categorical
        
    

    
        
            Distinct count
            15
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    debt_consolidation
    
        
            19776
        
        
    

    credit_card
    
        
             
        
        5477
    

    other
    
        
             
        
        4425
    

    Other values (11)
    
        
             
        
        12857
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        debt_consolidation
        19776
        0.0%
        
             
        

        credit_card
        5477
        0.0%
        
             
        

        other
        4425
        0.0%
        
             
        

        home_improvement
        3199
        0.0%
        
             
        

        major_purchase
        2311
        0.0%
        
             
        

        small_business
        1992
        0.0%
        
             
        

        car
        1615
        0.0%
        
             
        

        wedding
        1004
        0.0%
        
             
        

        medical
        753
        0.0%
        
             
        

        moving
        629
        0.0%
        
             
        

        Other values (4)
        1354
        0.0%
        
             
        




    
        revol_bal

            Numeric
        
    

    
        
            
                
                    Distinct count
                    22710
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    7
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    14298
                
                
                    Minimum
                    0
                
                
                    Maximum
                    1207400
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        295.4
                    
                    
                        Q1
                        3635
                    
                    
                        Median
                        8821
                    
                    
                        Q3
                        17251
                    
                    
                        95-th percentile
                        44544
                    
                    
                        Maximum
                        1207400
                    
                    
                        Range
                        1207400
                    
                    
                        Interquartile range
                        13616
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        22018
                    
                    
                        Coef of variation
                        1.54
                    
                    
                        Kurtosis
                        346.29
                    
                    
                        Mean
                        14298
                    
                    
                        MAD
                        11545
                    
                    
                        Skewness
                        11.012
                    
                    
                        Sum
                        608160000
                    
                    
                        Variance
                        484810000
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        1119
        0.0%
        
             
        

        255.0
        14
        0.0%
        
             
        

        298.0
        14
        0.0%
        
             
        

        1.0
        13
        0.0%
        
             
        

        682.0
        12
        0.0%
        
             
        

        52.0
        10
        0.0%
        
             
        

        39.0
        10
        0.0%
        
             
        

        400.0
        10
        0.0%
        
             
        

        6.0
        10
        0.0%
        
             
        

        23.0
        9
        0.0%
        
             
        

        Other values (22699)
        41314
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        1119
        0.0%
        
             
        

        1.0
        13
        0.0%
        
             
        

        2.0
        6
        0.0%
        
             
        

        3.0
        7
        0.0%
        
             
        

        4.0
        3
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        487589.0
        1
        0.0%
        
             
        

        508961.0
        1
        0.0%
        
             
        

        602519.0
        1
        0.0%
        
             
        

        952013.0
        1
        0.0%
        
             
        

        1207359.0
        1
        0.0%
        
             
        


        
    


    
        revol_util

            Categorical
        
    

    
        
            Distinct count
            1120
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            97
        
    


    
        
    0%
    
        
             
        
        1070
    

    40.70%
    
        
             
        
        65
    

    0.20%
    
        
             
        
        64
    

    Other values (1116)
    
        
            41246
        
        
    

    (Missing)
    
        
             
        
        97
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0%
        1070
        0.0%
        
             
        

        40.70%
        65
        0.0%
        
             
        

        0.20%
        64
        0.0%
        
             
        

        63%
        63
        0.0%
        
             
        

        66.60%
        62
        0.0%
        
             
        

        70.40%
        61
        0.0%
        
             
        

        0.10%
        61
        0.0%
        
             
        

        64.60%
        60
        0.0%
        
             
        

        37.60%
        60
        0.0%
        
             
        

        46.40%
        59
        0.0%
        
             
        

        Other values (1109)
        40820
        0.0%
        
             
        

        (Missing)
        97
        0.0%
        
             
        




    
        sub_grade

            Categorical
        
    

    
        
            Distinct count
            36
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    B3
    
        
             
        
        2997
    

    A4
    
        
             
        
        2905
    

    B5
    
        
             
        
        2807
    

    Other values (32)
    
        
            33826
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        B3
        2997
        0.0%
        
             
        

        A4
        2905
        0.0%
        
             
        

        B5
        2807
        0.0%
        
             
        

        A5
        2793
        0.0%
        
             
        

        B4
        2590
        0.0%
        
             
        

        C1
        2264
        0.0%
        
             
        

        C2
        2157
        0.0%
        
             
        

        B2
        2113
        0.0%
        
             
        

        B1
        1882
        0.0%
        
             
        

        A3
        1823
        0.0%
        
             
        

        Other values (25)
        18204
        0.0%
        
             
        




    
        tax_liens

            Numeric
        
    

    
        
            
                
                    Distinct count
                    3
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    112
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    2.3568e-05
                
                
                    Minimum
                    0
                
                
                    Maximum
                    1
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0
                    
                    
                        Q3
                        0
                    
                    
                        95-th percentile
                        0
                    
                    
                        Maximum
                        1
                    
                    
                        Range
                        1
                    
                    
                        Interquartile range
                        0
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.0048547
                    
                    
                        Coef of variation
                        205.99
                    
                    
                        Kurtosis
                        42430
                    
                    
                        Mean
                        2.3568e-05
                    
                    
                        MAD
                        4.7135e-05
                    
                    
                        Skewness
                        205.99
                    
                    
                        Sum
                        1
                    
                    
                        Variance
                        2.3568e-05
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42429
        0.0%
        
             
        

        1.0
        1
        0.0%
        
             
        

        (Missing)
        112
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42429
        0.0%
        
             
        

        1.0
        1
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        42429
        0.0%
        
             
        

        1.0
        1
        0.0%
        
             
        


        
    


    
        term

            Categorical
        
    

    
        
            Distinct count
            3
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
     36 months
    
        
            31534
        
        
    

     60 months
    
        
             
        
        11001
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
         36 months
        31534
        0.0%
        
             
        

         60 months
        11001
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        title

            Categorical
        
    

    
        
            Distinct count
            21258
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            19
        
    


    
        
    Debt Consolidation
    
        
             
        
        2259
    

    Debt Consolidation Loan
    
        
             
        
        1760
    

    Personal Loan
    
        
             
        
        708
    

    Other values (21254)
    
        
            37796
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Debt Consolidation
        2259
        0.0%
        
             
        

        Debt Consolidation Loan
        1760
        0.0%
        
             
        

        Personal Loan
        708
        0.0%
        
             
        

        Consolidation
        547
        0.0%
        
             
        

        debt consolidation
        532
        0.0%
        
             
        

        Home Improvement
        373
        0.0%
        
             
        

        Credit Card Consolidation
        370
        0.0%
        
             
        

        Debt consolidation
        347
        0.0%
        
             
        

        Small Business Loan
        333
        0.0%
        
             
        

        Personal
        330
        0.0%
        
             
        

        Other values (21247)
        34964
        0.0%
        
             
        




    
        total_acc

            Numeric
        
    

    
        
            
                
                    Distinct count
                    84
                
                
                    Unique (%)
                    0.0%
                
                
                    Missing (%)
                    100.0%
                
                
                    Missing (n)
                    36
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    22.124
                
                
                    Minimum
                    1
                
                
                    Maximum
                    90
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        1
                    
                    
                        5-th percentile
                        6
                    
                    
                        Q1
                        13
                    
                    
                        Median
                        20
                    
                    
                        Q3
                        29
                    
                    
                        95-th percentile
                        44
                    
                    
                        Maximum
                        90
                    
                    
                        Range
                        89
                    
                    
                        Interquartile range
                        16
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        11.593
                    
                    
                        Coef of variation
                        0.52398
                    
                    
                        Kurtosis
                        0.65885
                    
                    
                        Mean
                        22.124
                    
                    
                        MAD
                        9.2155
                    
                    
                        Skewness
                        0.82238
                    
                    
                        Sum
                        940420
                    
                    
                        Variance
                        134.39
                    
                    
                        Memory size
                        332.4 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        15.0
        1552
        0.0%
        
             
        

        16.0
        1547
        0.0%
        
             
        

        17.0
        1543
        0.0%
        
             
        

        14.0
        1531
        0.0%
        
             
        

        20.0
        1504
        0.0%
        
             
        

        18.0
        1493
        0.0%
        
             
        

        21.0
        1483
        0.0%
        
             
        

        13.0
        1480
        0.0%
        
             
        

        12.0
        1416
        0.0%
        
             
        

        19.0
        1404
        0.0%
        
             
        

        Other values (73)
        27553
        0.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        1.0
        21
        0.0%
        
             
        

        2.0
        41
        0.0%
        
             
        

        3.0
        238
        0.0%
        
             
        

        4.0
        486
        0.0%
        
             
        

        5.0
        622
        0.0%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        79.0
        2
        0.0%
        
             
        

        80.0
        1
        0.0%
        
             
        

        81.0
        1
        0.0%
        
             
        

        87.0
        1
        0.0%
        
             
        

        90.0
        1
        0.0%
        
             
        


        
    


    
        verification_status

            Categorical
        
    

    
        
            Distinct count
            4
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    Not Verified
    
        
            18758
        
        
    

    Verified
    
        
             
        
        13471
    

    Source Verified
    
        
             
        
        10306
    

    (Missing)
    
        
             
        
        7
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Not Verified
        18758
        0.0%
        
             
        

        Verified
        13471
        0.0%
        
             
        

        Source Verified
        10306
        0.0%
        
             
        

        (Missing)
        7
        0.0%
        
             
        




    
        zip_code

            Categorical
        
    

    
        
            Distinct count
            838
        
        
            Unique (%)
            0.0%
        
        
            Missing (%)
            100.0%
        
        
            Missing (n)
            7
        
    


    
        
    100xx
    
        
             
        
        649
    

    945xx
    
        
             
        
        559
    

    606xx
    
        
             
        
        548
    

    Other values (834)
    
        
            40779
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        100xx
        649
        0.0%
        
             
        

        945xx
        559
        0.0%
        
             
        

        606xx
        548
        0.0%
        
             
        

        112xx
        538
        0.0%
        
             
        

        070xx
        503
        0.0%
        
             
        

        900xx
        478
        0.0%
        
             
        

        300xx
        436
        0.0%
        
             
        

        021xx
        416
        0.0%
        
             
        

        750xx
        392
        0.0%
        
             
        

        926xx
        387
        0.0%
        
             
        

        Other values (827)
        37629
        0.0%
        
             
        




    
        Correlations
    
    
    
    

    
        Sample
    
    
    
        
  
    
      
      loan_amnt
      term
      int_rate
      grade
      sub_grade
      emp_title
      emp_length
      home_ownership
      annual_inc
      verification_status
      loan_status
      desc
      purpose
      title
      zip_code
      addr_state
      dti
      delinq_2yrs
      earliest_cr_line
      fico_range_low
      fico_range_high
      inq_last_6mths
      open_acc
      pub_rec
      revol_bal
      revol_util
      total_acc
      initial_list_status
      out_prncp
      out_prncp_inv
      last_credit_pull_d
      last_fico_range_high
      last_fico_range_low
      collections_12_mths_ex_med
      policy_code
      application_type
      acc_now_delinq
      delinq_amnt
      pub_rec_bankruptcies
      tax_liens
      hardship_flag
    
  
  
    
      0
      5000.0
      36 months
      10.65%
      B
      B2
      NaN
      10+ years
      RENT
      24000.0
      Verified
      Fully Paid
      Borrower added on 12/22/11 > I need to upgra...
      credit_card
      Computer
      860xx
      AZ
      27.65
      0.0
      Jan-85
      735.0
      739.0
      1.0
      3.0
      0.0
      13648.0
      83.70%
      9.0
      f
      0.0
      0.0
      Jan-18
      739.0
      735.0
      0.0
      1.0
      Individual
      0.0
      0.0
      0.0
      0.0
      N
    
    
      1
      2500.0
      60 months
      15.27%
      C
      C4
      Ryder
      < 1 year
      RENT
      30000.0
      Source Verified
      Charged Off
      Borrower added on 12/22/11 > I plan to use t...
      car
      bike
      309xx
      GA
      1.00
      0.0
      Apr-99
      740.0
      744.0
      5.0
      3.0
      0.0
      1687.0
      9.40%
      4.0
      f
      0.0
      0.0
      Oct-16
      499.0
      0.0
      0.0
      1.0
      Individual
      0.0
      0.0
      0.0
      0.0
      N
    
    
      2
      2400.0
      36 months
      15.96%
      C
      C5
      NaN
      10+ years
      RENT
      12252.0
      Not Verified
      Fully Paid
      NaN
      small_business
      real estate business
      606xx
      IL
      8.72
      0.0
      Nov-01
      735.0
      739.0
      2.0
      2.0
      0.0
      2956.0
      98.50%
      10.0
      f
      0.0
      0.0
      Jun-17
      739.0
      735.0
      0.0
      1.0
      Individual
      0.0
      0.0
      0.0
      0.0
      N
    
    
      3
      10000.0
      36 months
      13.49%
      C
      C1
      AIR RESOURCES BOARD
      10+ years
      RENT
      49200.0
      Source Verified
      Fully Paid
      Borrower added on 12/21/11 > to pay for prop...
      other
      personel
      917xx
      CA
      20.00
      0.0
      Feb-96
      690.0
      694.0
      1.0
      10.0
      0.0
      5598.0
      21%
      37.0
      f
      0.0
      0.0
      Apr-16
      604.0
      600.0
      0.0
      1.0
      Individual
      0.0
      0.0
      0.0
      0.0
      N
    
    
      4
      3000.0
      60 months
      12.69%
      B
      B5
      University Medical Group
      1 year
      RENT
      80000.0
      Source Verified
      Fully Paid
      Borrower added on 12/21/11 > I plan on combi...
      other
      Personal
      972xx
      OR
      17.94
      0.0
      Jan-96
      695.0
      699.0
      0.0
      15.0
      0.0
      27783.0
      53.90%
      38.0
      f
      0.0
      0.0
      Jan-17
      694.0
      690.0
      0.0
      1.0
      Individual
      0.0
      0.0
      0.0
      0.0
      N

Based on above profiling, lets work on features

Removals

Dataset has 3 duplicate rows, lets remove duplicates
application_type all of them are individual except missing 7
collections_12_mths_ex_med has all the value 0.0 except missing 152
fico_range_high is highly correlated to fico_range_low, so lets remove fico_range_high feature
initial_list_status as f as value excluding missing 7
hardship_flag is having all same values as 'N' except missing data
out_prncp has all feature values as 0.0 except missing values
out_prncp_inv has all feature values 0.0 except missing values
policy_code has all feature values 1 except missing values
tax_liens as all values are 0.0 except 1 and others are missing
title, purpose have similar values like debt consolidation et al, so remove title use only purpose
desc is verbose text and some of the desc is like title or purpose. Not using verbose text to analyse so remove



In [7]:

    
unique_loans = investor_features.drop_duplicates()
unique_loans_features = unique_loans.drop(['application_type', 'collections_12_mths_ex_med','fico_range_high',\
                                          'initial_list_status','hardship_flag','out_prncp','out_prncp_inv',\
                                          'policy_code','tax_liens','title','desc','addr_state','earliest_cr_line',\
                                           'emp_title','last_credit_pull_d', 'delinq_amnt', 'last_fico_range_high',\
                                           'zip_code'], axis=1)
unique_loans_features.shape









    Out[7]:





(42536, 23)

Converstion

revol_util is in % format convert this to numberic
int_rate feature is in the format 10.99%, convert this to numeric
replace emp_length '<' symbol in values as XGBoostClassifier wont like it



In [8]:

    
unique_loans_features['revol_util'] = unique_loans_features['revol_util'].str.replace('%', '').astype(float)
unique_loans_features['int_rate'] = unique_loans_features['int_rate'].str.replace('%', '').astype(float)
#XGB cant have column name with <
unique_loans_features['emp_length'] = unique_loans_features['emp_length'].str.replace('<', '')

Consider only if loan_status values is present



In [9]:

    
unique_loans_features = unique_loans_features[unique_loans_features['loan_status'].notnull()]

Prepare features & targets label



In [10]:

    
# Split the data into features and target label
loan_paid_status_raw = unique_loans_features['loan_status']
features_raw = unique_loans_features.drop('loan_status', axis = 1)

Prepare labels

Lets first work on labels (or target or response variable), which is weather the loan is completely paid of or not.



In [11]:

    
loans_status_count = unique_loans_features['loan_status'].value_counts().reset_index()
loans_status_count









    Out[11]:







  
    
      
      index
      loan_status
    
  
  
    
      0
      Fully Paid
      34116
    
    
      1
      Charged Off
      5670
    
    
      2
      Does not meet the credit policy. Status:Fully ...
      1988
    
    
      3
      Does not meet the credit policy. Status:Charge...
      761

For our project we just need to know weather loans are charged off or not. We will convert the Fully Paid & Does not meet the credit policy. Status:Fully Paid to 0, Charged Off & Does not meet the credit policy. Status:Charged Off to 1



In [12]:

    
# function to convert target label to numeric values
def convert_label(val):
    #print val
    if val=='Fully Paid':
        return 0
    elif val=='Charged Off':
        return 1
    elif val=='Does not meet the credit policy. Status:Fully Paid':
        return 0
    elif val=='Does not meet the credit policy. Status:Charged Off':
        return 1



In [13]:

    
loan_paid_status = loan_paid_status_raw.apply(convert_label)

loan_paid_status.value_counts().reset_index()









    Out[13]:







  
    
      
      index
      loan_status
    
  
  
    
      0
      0
      36104
    
    
      1
      1
      6431



In [14]:

    
cnt_label = loan_paid_status.value_counts()

plt.figure(figsize=(8,4))
sns.barplot(cnt_label.index, cnt_label.values, alpha=0.8, color=color[1])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.title('Number of loans fully paid (0), charged off(1)', fontsize=15)
plt.show()



In [15]:

    
unique_loans_features['charged_off'] = unique_loans_features['loan_status'].apply(convert_label)



In [16]:

    
pos = unique_loans_features[unique_loans_features["charged_off"] == 1].shape[0]
neg = unique_loans_features[unique_loans_features["charged_off"] == 0].shape[0]
print "Positive examples = {}".format(pos)
print "Negative examples = {}".format(neg)
print "Proportion of positive to negative examples = {}".format((pos*1.0 / neg) * 100)









    



Positive examples = 6431
Negative examples = 36104
Proportion of positive to negative examples = 17.8124307556

We see that data set is pretty imbalanced as expected where positive examples (“charged off”) are only ~18%. We will handle imbalanced class issue later.

Lets do some basis analysis on data.

Lets validate some of opinions on the loans data

Good fico score based loans shold be paid



In [17]:

    
from matplotlib.pyplot import subplots, show

fig, ax = subplots(figsize=(12, 8))
unique_loans_features[unique_loans_features['charged_off'] == 0]['fico_range_low'].hist(alpha=0.5, color='red', bins=30, label='Fully Paid')
unique_loans_features[unique_loans_features['charged_off'] == 1]['fico_range_low'].hist(alpha=0.5, color='blue', bins=30, label='Charged Off')
plt.title("# Loans charged off or not & FICO score ", fontsize=15)
plt.legend()
#plt.xlabel('FICO')
plt.ylabel('Count')
ax.set_xlabel("FICO")
show()

Though loan applicants have good FICO score, 750+, we can see some loans are not paid off
Highest number of loans paid off are in b/w below 660 to 700 FICO score

What are different purpose the loan is being requested for?



In [18]:

    
plt.figure(figsize=(12,6))
sns.countplot(x='purpose', data=unique_loans_features, hue='charged_off', palette='Set1')
plt.xticks(rotation='vertical')
plt.title("# of loans vs purpose", fontsize=15)
plt.show()

Most of the loans are being requested for debt_consolidation
Most charged off loans are also of purpose debt_consolidation

Let's see the trend between FICO score and interest rate.



In [19]:

    
sns.jointplot('fico_range_low', 'int_rate', data=unique_loans_features, color='purple')









    Out[19]:





<seaborn.axisgrid.JointGrid at 0x1a1a52f4d0>

Higher the fico score, lower the interest rate

Feature selection based on intuition

Based on above profiling, by spending some time, going through each feature and analyzing each feature, we can first identify the categorical features & numerical features that are populated well to be used for modeling.
Pandas profiling also profiles the details of highly correlated features using which we don’t need to select some features if they are highly co-related, for example, fico_range_high & fico_range_low, they both are highly co-related so we can use one of them for our analysis and modeling.

Selected categorical features, intuitions for selection

addr_state: Not selected, as this feature as high cardinality, has 51 distinct values, though the state might be a good indicator, am hesitant use this feature for first pass as using this one hot encoding features may increase by 51.
earliest_cr_line: Not selected, should be converted to numerical by calculating the number of months from the loan date, not considered for first pass.
emp_length: Selected, this feature might help us know duration of employment and how it influences paying off loan.
emp_title: Selected, as purpose gives better indicator and also features will explode using this feature and one hot encoding.
grade: Selected, LC grade might influence paying off loan.
home_ownership: Selected, this feature might help us know how home_ownership influence paying off loan.
int_rate: Selected, different interest rate might influence loan getting paid off, lower interest rate means less amount to be paid off, so loan may have been paid
last_credit_pull_d: Not selected, converting the loan application month to last credit pull month might be helpful feature
purpose: Selected, purpose might indicate the usage, which might corelate to loan being paid off.
sub_grade: Selected, may be good indicator of paying off loan.
term: Selected, may be longer term, less monthly commitment and may paying off loan.
title: Not selected, purpose is better feature than title, too many categories to convert as one hot encode features.
url: Not selected, url identifier.
verification_status: Selected, indicator, better verification state, likely to pay off loan.
zipcode: Not selected, indicates living neighborhood and changes of pay off loan. Too many zip codes to convert as one hot encode features.

selected numerical features, intutions for selection

acc_now_delinq: Selected, considering this feature, as once we add more data, it might be valuable.
annual_inc: selected, income informs how good the loan applicant has the ability to pay back.
delinq_2yrs: selected, if applicant delinquency in past 2 yr may be good indicator.
delinq_amnt: Not selected, as most of the values is 0, 36missing and 2 other values may.
dti: Selected, dti may be a good indicator of borrower’s debt to income ration and informs the ability of applicant to pay back loan.
fico_range_high : Not selected, as its highly correlated to fico_range_low, which will be used.
fico_range_low: Selected, it might tell credit rating of applicant, which by itself is a good measure of loan applicant ability to pay back.
inq_last_6mths: Selected, shows multiple credit enquire, which may indicate puruance of loan applicant for various debts.
last_fico_range_high: Not selected, as it is highly correlated to last_fico_range_low and last_fico_range_low will be used.
last_fico_range_low: Selected, this feature might help us know how loan applicant maintained his credit.
loan_amnt: Selected, higher the loan_amnt, less likely loan will be paid off.
mths_since_last_delinq: Selected, more month’s good loan applicant.
open_acc: Selected, Median open acc is better.
pub_rec: Selected, derogatory public records, less likely to pay off loan.
pub_rec_bankruptcies: Selected, more/any bankruptcy less likely to pay off loan.
revol_bal: Selected, more revol_bal, likely to pay off loan.
revol_util: Selected, more revol_util, likely to pay off loan.
tax_lines: Not selected, all 42429 of tax_lines is 0, and 1 is 1 and 112 missing.
total_acc: Selected, median account is a good indicator.



In [20]:

    
categorical_features = ['emp_length', 'grade', 'home_ownership', 'purpose',\
                                     'sub_grade','term','verification_status']
numerical_features = ['acc_now_delinq', 'annual_inc', 'delinq_2yrs', 'int_rate', 'dti', 'fico_range_low'\
                                   ,'inq_last_6mths','last_fico_range_low','loan_amnt'\
                                  ,'open_acc','pub_rec','pub_rec_bankruptcies','revol_bal','revol_util','total_acc']

Preparing the Data

Missing value

1. Handling numerical features missing

Lets identify and treat the missing values



In [21]:

    
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[numerical_features].shape)
print(unique_loans_features[numerical_features].isnull().sum())









    



Dataset has 42535 data points with 15 numerical variables each.
acc_now_delinq            29
annual_inc                 4
delinq_2yrs               29
int_rate                   0
dti                        0
fico_range_low             0
inq_last_6mths            29
last_fico_range_low        0
loan_amnt                  0
open_acc                  29
pub_rec                   29
pub_rec_bankruptcies    1365
revol_bal                  0
revol_util                90
total_acc                 29
dtype: int64

As all above features are sensitive information realted to individual financials, imputing mean, median strategies might not be right, so it is better to impute 0 for NA values than mean, median or mode strategies. Lets impute missing values as 0



In [22]:

    
unique_loans_features[numerical_features] = unique_loans_features[numerical_features].fillna(0)

2. Handling categorical features missing

Lets identify and treat the missing categorical feature values



In [23]:

    
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[categorical_features].shape)
print(unique_loans_features[categorical_features].isnull().sum())









    



Dataset has 42535 data points with 7 numerical variables each.
emp_length             0
grade                  0
home_ownership         0
purpose                0
sub_grade              0
term                   0
verification_status    0
dtype: int64

No missing values in categorical features

Normalizing Numerical Features

Normalization ensures that each feature is treated equally when applying supervised learners.



In [24]:

    
from sklearn.preprocessing import StandardScaler


# Initialize a scaler, then apply it to the features
scaler = StandardScaler()
features_raw[numerical_features] = scaler.fit_transform(unique_loans_features[numerical_features])

# Show an example of a record with scaling applied
display(features_raw.head(n = 10))

#print(features_raw.isnull().sum())









    







  
    
      
      loan_amnt
      term
      int_rate
      grade
      sub_grade
      emp_length
      home_ownership
      annual_inc
      verification_status
      purpose
      ...
      fico_range_low
      inq_last_6mths
      open_acc
      pub_rec
      revol_bal
      revol_util
      total_acc
      last_fico_range_low
      acc_now_delinq
      pub_rec_bankruptcies
    
  
  
    
      0
      -0.821731
      36 months
      -0.408592
      B
      B2
      10+ years
      RENT
      -0.704100
      Verified
      credit_card
      ...
      0.606484
      -0.052834
      -1.407944
      -0.236602
      -0.029515
      1.220348
      -1.129812
      0.495148
      -0.009698
      -0.213007
    
    
      1
      -1.159074
      60 months
      0.837399
      C
      C4
      1 year
      RENT
      -0.610491
      Source Verified
      car
      ...
      0.744651
      2.566378
      -1.407944
      -0.236602
      -0.572748
      -1.393671
      -1.560731
      -5.666625
      -0.009698
      -0.213007
    
    
      2
      -1.172567
      36 months
      1.023488
      C
      C5
      10+ years
      RENT
      -0.887387
      Not Verified
      small_business
      ...
      0.606484
      0.601969
      -1.630102
      -0.236602
      -0.515113
      1.741041
      -1.043628
      0.495148
      -0.009698
      -0.213007
    
    
      3
      -0.147044
      36 months
      0.357342
      C
      C1
      10+ years
      RENT
      -0.310940
      Source Verified
      other
      ...
      -0.637022
      -0.052834
      0.147162
      -0.236602
      -0.395122
      -0.985560
      1.283336
      -0.636606
      -0.009698
      -0.213007
    
    
      4
      -1.091605
      60 months
      0.141586
      B
      B5
      1 year
      RENT
      0.169588
      Source Verified
      other
      ...
      -0.498854
      -0.707637
      1.257952
      -0.236602
      0.612455
      0.171926
      1.369520
      0.117896
      -0.009698
      -0.213007
    
    
      5
      -0.821731
      36 months
      -1.150253
      A
      A4
      3 years
      RENT
      -0.516881
      Source Verified
      wedding
      ...
      0.468317
      1.256772
      -0.074996
      -0.236602
      -0.287710
      -0.728732
      -0.871260
      -0.971941
      -0.009698
      -0.213007
    
    
      6
      -0.551856
      60 months
      1.023488
      C
      C5
      8 years
      RENT
      -0.345201
      Not Verified
      debt_consolidation
      ...
      -0.637022
      -0.052834
      -0.519312
      -0.236602
      0.155696
      1.287194
      -0.957444
      -0.217438
      -0.009698
      -0.213007
    
    
      7
      -1.091605
      36 months
      1.746271
      E
      E1
      9 years
      RENT
      -0.329662
      Source Verified
      car
      ...
      -1.466025
      0.601969
      -1.185786
      -0.236602
      -0.275993
      1.354040
      -1.560731
      0.075979
      -0.009698
      -0.213007
    
    
      8
      -0.740768
      60 months
      2.458266
      F
      F2
      4 years
      OWN
      -0.454475
      Source Verified
      small_business
      ...
      -1.051523
      0.601969
      0.369320
      -0.236602
      -0.412743
      -0.577449
      -0.785076
      -5.666625
      -0.009698
      -0.213007
    
    
      9
      -0.771129
      60 months
      0.141586
      B
      B5
      1 year
      RENT
      -0.844514
      Verified
      other
      ...
      0.330150
      -0.707637
      -1.630102
      -0.236602
      -0.227942
      -0.440240
      -1.646915
      -1.474943
      -0.009698
      -0.213007
    
  

10 rows × 22 columns

one-hot encoding categorical features

One-hot encoding creates a "dummy" variable for each possible category of each non-numeric feature.



In [25]:

    
features= pd.concat([pd.get_dummies(features_raw[categorical_features], prefix_sep='_'),\
                     features_raw[numerical_features]], axis=1)



In [26]:

    
# Print the number of features after one-hot encoding
encoded = list(features.columns)
print "{} total features after one-hot encoding.".format(len(encoded))

print encoded
display(features.head(n = 1))









    



93 total features after one-hot encoding.
[u'emp_length_ 1 year', u'emp_length_1 year', u'emp_length_10+ years', u'emp_length_2 years', u'emp_length_3 years', u'emp_length_4 years', u'emp_length_5 years', u'emp_length_6 years', u'emp_length_7 years', u'emp_length_8 years', u'emp_length_9 years', u'emp_length_n/a', u'grade_A', u'grade_B', u'grade_C', u'grade_D', u'grade_E', u'grade_F', u'grade_G', u'home_ownership_MORTGAGE', u'home_ownership_NONE', u'home_ownership_OTHER', u'home_ownership_OWN', u'home_ownership_RENT', u'purpose_car', u'purpose_credit_card', u'purpose_debt_consolidation', u'purpose_educational', u'purpose_home_improvement', u'purpose_house', u'purpose_major_purchase', u'purpose_medical', u'purpose_moving', u'purpose_other', u'purpose_renewable_energy', u'purpose_small_business', u'purpose_vacation', u'purpose_wedding', u'sub_grade_A1', u'sub_grade_A2', u'sub_grade_A3', u'sub_grade_A4', u'sub_grade_A5', u'sub_grade_B1', u'sub_grade_B2', u'sub_grade_B3', u'sub_grade_B4', u'sub_grade_B5', u'sub_grade_C1', u'sub_grade_C2', u'sub_grade_C3', u'sub_grade_C4', u'sub_grade_C5', u'sub_grade_D1', u'sub_grade_D2', u'sub_grade_D3', u'sub_grade_D4', u'sub_grade_D5', u'sub_grade_E1', u'sub_grade_E2', u'sub_grade_E3', u'sub_grade_E4', u'sub_grade_E5', u'sub_grade_F1', u'sub_grade_F2', u'sub_grade_F3', u'sub_grade_F4', u'sub_grade_F5', u'sub_grade_G1', u'sub_grade_G2', u'sub_grade_G3', u'sub_grade_G4', u'sub_grade_G5', u'term_ 36 months', u'term_ 60 months', u'verification_status_Not Verified', u'verification_status_Source Verified', u'verification_status_Verified', u'acc_now_delinq', u'annual_inc', u'delinq_2yrs', u'int_rate', u'dti', u'fico_range_low', u'inq_last_6mths', u'last_fico_range_low', u'loan_amnt', u'open_acc', u'pub_rec', u'pub_rec_bankruptcies', u'revol_bal', u'revol_util', u'total_acc']






    







  
    
      
      emp_length_ 1 year
      emp_length_1 year
      emp_length_10+ years
      emp_length_2 years
      emp_length_3 years
      emp_length_4 years
      emp_length_5 years
      emp_length_6 years
      emp_length_7 years
      emp_length_8 years
      ...
      fico_range_low
      inq_last_6mths
      last_fico_range_low
      loan_amnt
      open_acc
      pub_rec
      pub_rec_bankruptcies
      revol_bal
      revol_util
      total_acc
    
  
  
    
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      0.606484
      -0.052834
      0.495148
      -0.821731
      -1.407944
      -0.236602
      -0.213007
      -0.029515
      1.220348
      -1.129812
    
  

1 rows × 93 columns

Shuffle and Split Data

Now all categorical variables have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.



In [50]:

    
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the 'features' and 'loan_paid_status' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, loan_paid_status, test_size = 0.2, random_state = 0,\
                                                    stratify= loan_paid_status)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])
display(X_test.head(n = 1))
display(y_test.head(n = 1))









    



Training set has 34028 samples.
Testing set has 8507 samples.






    







  
    
      
      emp_length_ 1 year
      emp_length_1 year
      emp_length_10+ years
      emp_length_2 years
      emp_length_3 years
      emp_length_4 years
      emp_length_5 years
      emp_length_6 years
      emp_length_7 years
      emp_length_8 years
      ...
      fico_range_low
      inq_last_6mths
      last_fico_range_low
      loan_amnt
      open_acc
      pub_rec
      pub_rec_bankruptcies
      revol_bal
      revol_util
      total_acc
    
  
  
    
      20975
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      -0.22252
      -0.052834
      0.159813
      -0.281981
      -1.407944
      -0.236602
      -0.213007
      -0.294841
      0.129708
      -0.612709
    
  

1 rows × 93 columns







    





20975    0
Name: loan_status, dtype: int64

Strategies to deal with imbalanced datasets

Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). We can see that in spam detection, ads click, loan approvals, etc. In our example, the positive examples (people who charged off) were only ~18% from the total examples. Therefore, accuracy is no longer a good measure of performance for different models because if we simply predict all examples to belong to the negative class, we achieve 81% accuracy. Better metrics for imbalanced data sets are AUC (area under the ROC curve) and f1-score. However, that’s not enough because class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class by implicitly learns a model that optimizes the predictions based on the majority class in the dataset. As a result, we’ll explore different methods to overcome class imbalance problem.

Under-Sample: Under-sample the majority class with or w/o replacement by making the number of positive and negative examples equal. One of the drawbacks of under-sampling is that it ignores a good portion of training data that has valuable information. In our example, it would loose around ~27000+ examples. However, it’s very fast to train.

Over-Sample: Over-sample the minority class with or w/o replacement by making the number of positive and negative examples equal. We’ll add around ~27000+ samples from the training data set with this strategy. It’s a lot more computationally expensive than under-sampling. Also, it’s more prune to overfitting due to repeated examples.

EasyEnsemble: Sample several subsets from the majority class, build a classifier on top of each sampled data, and combine the output of all classifiers. More details can be found here.

Synthetic Minority Oversampling Technique (SMOTE): It over-samples the minority class but using synthesized examples. It operates on feature space not the data space.



In [28]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from imblearn.pipeline import make_pipeline as imb_make_pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedBaggingClassifier

# Build random forest classifier (same config)
rf_clf = RandomForestClassifier(criterion="entropy", verbose=False, class_weight="balanced",random_state=10)
# Build model with no sampling
pip_orig = make_pipeline(rf_clf)
scores = cross_val_score(pip_orig, X_train, y_train, scoring="roc_auc", cv=10)
print("Original model's average AUC: {}".format(scores.mean()))
# Build model with undersampling
pip_undersample = imb_make_pipeline(RandomUnderSampler(), rf_clf)
scores = cross_val_score(pip_undersample, X_train, y_train, scoring="roc_auc", cv=10)
print("Under-sampled model's average AUC: {}".format(scores.mean()))
# Build model with oversampling
pip_oversample = imb_make_pipeline(RandomOverSampler(), rf_clf)
scores = cross_val_score(pip_oversample, X_train, y_train, scoring="roc_auc", cv=10)
print("Over-sampled model's average AUC: {}".format(scores.mean()))
# Build model with EasyEnsemble
resampled_rf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)
pip_resampled = make_pipeline(resampled_rf)
scores = cross_val_score(pip_resampled, X_train, y_train, scoring="roc_auc", cv=10)
print("EasyEnsemble model's average AUC: {}".format(scores.mean()))
# Build model with SMOTE
pip_smote = imb_make_pipeline(SMOTE(), rf_clf)
scores = cross_val_score(pip_smote, X_train, y_train, scoring="roc_auc", cv=10)
print("SMOTE model's average AUC: {}".format(scores.mean()))









    



Original model's average AUC: 0.825141686061
Under-sampled model's average AUC: 0.841471980485
Over-sampled model's average AUC: 0.840353095013
EasyEnsemble model's average AUC: 0.872006664839
SMOTE model's average AUC: 0.832709723122

EasyEnsemble method has the highest 10-folds CV with average AUC = 0.872006664839. So we use EasyEnsemble technique to handle the imbalanced class problem.

Evaluating Model Performance

In this section, we will investigate four different algorithms, and determine which is best at modeling the data.

Metrics and the Naive Predictor

Investor ML, is particularly interested in predicting who will fully pay back the loan. It would seem that using accuracy as a metric for evaluating a particular model's performace would be appropriate. Additionally, identifying someone that does not fully pay back loan would be detrimental to Investor ML, since they are looking to invest on individual who will pay back loan. Therefore, a model's ability to precisely predict those who fully back is more important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall: $$ F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall} $$ In particular, when $\beta = 0.5$, more emphasis is placed on precision. This is called the F$_{0.5}$ score (or F-score for simplicity). Looking at the distribution of classes (those who charged off, and those who fully pay), it's clear most individuals fully pay back loan. This can greatly affect accuracy, since we could simply say "this person will pay back loan" and generally be right, without ever looking at the data! Making such a statement would be called naive, since we have not considered any information to substantiate the claim. Also as we are trying to predict "Risk Rate%", we cant say loan gets charged off. To define a naive predictor, we will use Gausian NB model to help establish a benchmark for whether a model is performing well.

Naive Predictor Performace

Lets define naive predictor and check its performance



In [29]:

    
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
from sklearn.naive_bayes import GaussianNB

cols = ['model','matthews_corrcoef', 'roc_auc_score', 'precision_score', 'recall_score','f1_score','accuracy',\
       'fscore']
models_report = pd.DataFrame(columns = cols)

#The naive model always predict that an individual will fully pay back loan
clf_GNB = GaussianNB()

clf_GNB.fit(X_train,y_train)

naive_prediction = clf_GNB.predict(X_test)
y_score = clf_GNB.predict_proba(X_test)[:,1]

# Accuracy, F-score using sklearn
accuracy = accuracy_score(y_test, naive_prediction)
fscore = fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)

tmp = pd.Series({'model': 'Naive Predictor',\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, naive_prediction),\
                 'precision_score': metrics.precision_score(y_test, naive_prediction),\
                 'recall_score': metrics.recall_score(y_test, naive_prediction),\
                 'f1_score': metrics.f1_score(y_test, naive_prediction),\
                 'accuracy': accuracy_score(y_test, naive_prediction),\
                 'fscore' : fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)

# Print the results 
print "Naive Predictor per sklearn: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore)
models_report









    



Naive Predictor per sklearn: [Accuracy score: 0.5571, F-score: 0.2567]






    Out[29]:







  
    
      
      model
      matthews_corrcoef
      roc_auc_score
      precision_score
      recall_score
      f1_score
      accuracy
      fscore
    
  
  
    
      0
      Naive Predictor
      0.201287
      0.696211
      0.220243
      0.75972
      0.341489
      0.557071
      0.2567

Creating a Training and Predicting Pipeline

Lets create a training and predicting pipeline that allows you to quickly and effectively train models using various sizes of training data and perform predictions on the testing data.



In [30]:

    
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(classifier_name, learner, sample_size, X_train, y_train, X_test, y_test, models_report, meta_learner=None): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    #  Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        # Use CV to generate meta-features
        meta_features = cross_val_predict(learner, X_train[:sample_size],y_train[:sample_size], cv=10, method="transform")
        # Refit the first stack on the full training set 
        learner.fit(X_train[:sample_size],y_train[:sample_size])
        # Fit the meta learner
        second_stack = meta_learner.fit(meta_features, y_train[:sample_size])
    else:
        learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # Get end time
    
    # Calculate the training time
    results['train_time'] = end-start
        
    #  Get the predictions on the test set,
    #       then get predictions on the first 300 training samples
    start = time() # Get start time
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        predictions_test = second_stack.predict(learner.transform(X_test))
        predictions_train = second_stack.predict(learner.transform(X_train[:300]))
    else:
        predictions_test = learner.predict(X_test)
        predictions_train = learner.predict(X_train[:300])
        
    end = time() # Get end time
    
    #  Calculate the total prediction time
    results['pred_time'] = end-start
            
    # Compute accuracy on the first 300 training samples
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
        
    #  Compute accuracy on test set
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    #  Compute F-score on the the first 300 training samples
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta = 0.5)
        
    #  Compute F-score on the test set
    results['f_test'] = fbeta_score(y_test, predictions_test, beta = 0.5)
    
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        results['meta-learner'] = second_stack
        results['learner'] = learner
    else:
        results['learner'] = learner
    # Success
    print "{} trained on {} samples.".format(classifier_name, sample_size)
    
    if classifier_name=='stackedEnsembleClassifier':
        y_score = second_stack.predict_proba(learner.transform(X_test))[:, 1:] 
    else:
        y_score = learner.predict_proba(X_test)[:,1]
    
    tmp = pd.Series({'model': classifier_name+str(sample_size),\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, predictions_test),\
                 'precision_score': metrics.precision_score(y_test, predictions_test),\
                 'recall_score': metrics.recall_score(y_test, predictions_test),\
                 'f1_score': metrics.f1_score(y_test, predictions_test),\
                 'accuracy': accuracy_score(y_test, predictions_test),\
                 'fscore' : fbeta_score(y_test, predictions_test, beta = 0.5, pos_label=1)})
    
    models_report = models_report.append(tmp, ignore_index = True)
        
    # Return the results
    return results, models_report

Lets evaluate using following models

LogisticRegression
GradientBoostingClassifier
XGBoostClassifier



In [31]:

    
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Initialize the four models
clf_A = LogisticRegression(random_state=10)
# Build model with EasyEnsemble
resampled_LR_A = BalancedBaggingClassifier(base_estimator=clf_A, n_estimators=10, random_state=10)
clf_B = GradientBoostingClassifier(random_state=10)
resampled_GBC_B = BalancedBaggingClassifier(base_estimator=clf_B, n_estimators=10, random_state=10)
clf_C = XGBClassifier(objective='binary:logistic', random_state=10)
resampled_XGB_C = BalancedBaggingClassifier(base_estimator=clf_C, n_estimators=10, random_state=10)


# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)

# Collect results on the learners
classifiers = {'LogisticRegression':resampled_LR_A, 'GradientBoostingClassifier':resampled_GBC_B,\
              'XGBClassifier':resampled_XGB_C}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results = {}
for clf_name, clf in classifiers.items():
    #clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i], models_report = \
        train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report)

# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results, accuracy, fscore)
models_report









    



XGBClassifier trained on 340 samples.
XGBClassifier trained on 3402 samples.
XGBClassifier trained on 34028 samples.
LogisticRegression trained on 340 samples.
LogisticRegression trained on 3402 samples.
LogisticRegression trained on 34028 samples.
GradientBoostingClassifier trained on 340 samples.
GradientBoostingClassifier trained on 3402 samples.
GradientBoostingClassifier trained on 34028 samples.






    












    Out[31]:







  
    
      
      model
      matthews_corrcoef
      roc_auc_score
      precision_score
      recall_score
      f1_score
      accuracy
      fscore
    
  
  
    
      0
      Naive Predictor
      0.201287
      0.696211
      0.220243
      0.759720
      0.341489
      0.557071
      0.256700
    
    
      1
      XGBClassifier340
      0.441011
      0.858439
      0.367037
      0.846812
      0.512109
      0.756083
      0.413943
    
    
      2
      XGBClassifier3402
      0.450757
      0.874160
      0.370445
      0.861586
      0.518120
      0.757729
      0.418113
    
    
      3
      XGBClassifier34028
      0.462447
      0.884929
      0.379299
      0.866252
      0.527587
      0.765487
      0.427344
    
    
      4
      LogisticRegression340
      0.376855
      0.824225
      0.360721
      0.699844
      0.476065
      0.767133
      0.399432
    
    
      5
      LogisticRegression3402
      0.420559
      0.848994
      0.376618
      0.769051
      0.505624
      0.772658
      0.419423
    
    
      6
      LogisticRegression34028
      0.459866
      0.870452
      0.405653
      0.792379
      0.536598
      0.793112
      0.449532
    
    
      7
      GradientBoostingClassifier340
      0.418919
      0.843553
      0.340653
      0.868585
      0.489376
      0.725990
      0.387793
    
    
      8
      GradientBoostingClassifier3402
      0.446944
      0.871031
      0.370395
      0.852255
      0.516372
      0.758669
      0.417619
    
    
      9
      GradientBoostingClassifier34028
      0.464679
      0.884464
      0.380757
      0.867807
      0.529286
      0.766663
      0.428901

We can see from the above charts and data that, EasySampling, LogisticRegression performs better than, XGBoostClassifier & GradientBoostingClassifier. Considering running time, 100% training size data, variance (train / test of accuracy score is close), F-score.

Improving Results

Lets perform a grid search optimization for the model over the entire training set (X_train and y_train)

Tuned for Logistic Regressioin : solvers (['newton-cg', 'lbfgs', 'liblinear', 'sag']) found 'sag' tuned parameter. tuned 'C': [0.1, 1.0, 1.5] found 1.5 tuned paramter. n-estimators : range(20,81,10) tuned to 10.



In [32]:

    
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Initialize the classifier
clf = LogisticRegression(random_state=10)
tune_clf = BalancedBaggingClassifier(base_estimator=clf, random_state=10)

# Create the parameters list to tune
parameters = {'base_estimator__fit_intercept': [True],\
              'base_estimator__C': [1.5],\
              'base_estimator__penalty':['l1'],\
              'n_estimators':[10]}


# Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score, beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(tune_clf, parameters, verbose=True, cv=10, scoring=scorer)

#  Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Build model with EasyEnsemble
#best_clf= BalancedBaggingClassifier(base_estimator=best_model, n_estimators=10, random_state=10)

# Make predictions using the unoptimized and model
predictions = (tune_clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)









    



Fitting 10 folds for each of 1 candidates, totalling 10 fits






    



[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   20.2s finished



In [33]:

    
y_best_score = best_clf.predict_proba(X_test)[:,1]
tmp = pd.Series({'model': 'GridSearchTunedLR',\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_best_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, best_predictions),\
                 'precision_score': metrics.precision_score(y_test, best_predictions),\
                 'recall_score': metrics.recall_score(y_test, best_predictions),\
                 'f1_score': metrics.f1_score(y_test, best_predictions),\
                 'accuracy': accuracy_score(y_test, best_predictions),\
                 'fscore' : fbeta_score(y_test, best_predictions, beta = 0.5, pos_label=1)})
    
models_report = models_report.append(tmp, ignore_index = True)

# Report the before-and-afterscores
print "Unoptimized model\n------"
print "Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5))
print "\nOptimized Model\n------"
print "Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions))
print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5))

# show best parameters
print "\nBest Classifier\n------"
print best_clf
models_report









    



Unoptimized model
------
Accuracy score on testing data: 0.7931
F-score on testing data: 0.4495

Optimized Model
------
Final accuracy score on the testing data: 0.7936
Final F-score on the testing data: 0.4507

Best Classifier
------
BalancedBaggingClassifier(base_estimator=LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=10, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
             bootstrap=True, bootstrap_features=False, max_features=1.0,
             max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
             random_state=10, ratio='auto', replacement=False, verbose=0,
             warm_start=False)






    Out[33]:







  
    
      
      model
      matthews_corrcoef
      roc_auc_score
      precision_score
      recall_score
      f1_score
      accuracy
      fscore
    
  
  
    
      0
      Naive Predictor
      0.201287
      0.696211
      0.220243
      0.759720
      0.341489
      0.557071
      0.256700
    
    
      1
      XGBClassifier340
      0.441011
      0.858439
      0.367037
      0.846812
      0.512109
      0.756083
      0.413943
    
    
      2
      XGBClassifier3402
      0.450757
      0.874160
      0.370445
      0.861586
      0.518120
      0.757729
      0.418113
    
    
      3
      XGBClassifier34028
      0.462447
      0.884929
      0.379299
      0.866252
      0.527587
      0.765487
      0.427344
    
    
      4
      LogisticRegression340
      0.376855
      0.824225
      0.360721
      0.699844
      0.476065
      0.767133
      0.399432
    
    
      5
      LogisticRegression3402
      0.420559
      0.848994
      0.376618
      0.769051
      0.505624
      0.772658
      0.419423
    
    
      6
      LogisticRegression34028
      0.459866
      0.870452
      0.405653
      0.792379
      0.536598
      0.793112
      0.449532
    
    
      7
      GradientBoostingClassifier340
      0.418919
      0.843553
      0.340653
      0.868585
      0.489376
      0.725990
      0.387793
    
    
      8
      GradientBoostingClassifier3402
      0.446944
      0.871031
      0.370395
      0.852255
      0.516372
      0.758669
      0.417619
    
    
      9
      GradientBoostingClassifier34028
      0.464679
      0.884464
      0.380757
      0.867807
      0.529286
      0.766663
      0.428901
    
    
      10
      GridSearchTunedLR
      0.462095
      0.870490
      0.406598
      0.795490
      0.538138
      0.793582
      0.450661

Feature Importance

An important task when performing supervised learning on a dataset is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict whether an individual fully pay loan.

Feature Selection

How does a model perform if we only use a subset of all the available features in the data? With less features required to train, the expectation is that training and prediction time is much lower — at the cost of performance metrics.



In [34]:

    
# Import supplementary visualization code visuals.py
import visuals as vs
import matplotlib.pyplot as pl
# Pretty display for notebooks
%matplotlib inline



from sklearn.ensemble import RandomForestClassifier
#  Train the supervised model on the training set 
model = RandomForestClassifier(random_state=10)
model.fit(X_train, y_train)
#  Extract the feature importances
importances = model.feature_importances_

# Plot
vs.feature_plot( importances, X_train, y_train)

# show most importance features
a = np.array(importances)
factors = pd.DataFrame(data = np.array([importances.astype(float), features.columns]).T,
                       columns = ['importances', 'features'])
factors = factors.sort_values('importances', ascending=False)

print "\n top 20 important features"
display(factors[:20])









    












    



 top 20 important features






    







  
    
      
      importances
      features
    
  
  
    
      85
      0.246455
      last_fico_range_low
    
    
      79
      0.0572329
      annual_inc
    
    
      91
      0.0544118
      revol_util
    
    
      82
      0.0542158
      dti
    
    
      81
      0.0523224
      int_rate
    
    
      90
      0.0520493
      revol_bal
    
    
      86
      0.0496855
      loan_amnt
    
    
      92
      0.0466982
      total_acc
    
    
      83
      0.042266
      fico_range_low
    
    
      87
      0.0408149
      open_acc
    
    
      84
      0.0281177
      inq_last_6mths
    
    
      73
      0.00973854
      term_ 36 months
    
    
      75
      0.0094579
      verification_status_Not Verified
    
    
      26
      0.00941782
      purpose_debt_consolidation
    
    
      80
      0.00903726
      delinq_2yrs
    
    
      74
      0.00870967
      term_ 60 months
    
    
      19
      0.00869922
      home_ownership_MORTGAGE
    
    
      2
      0.00810152
      emp_length_10+ years
    
    
      77
      0.00795508
      verification_status_Verified
    
    
      23
      0.00786212
      home_ownership_RENT



In [35]:

    
# Import functionality for cloning a model
from sklearn.base import clone

# Reduce the feature space
X_train_reduced_20 = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:20]]]
X_test_reduced_20 = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:20]]]

# Train on the "best" model found from grid search earlier
start = time()
full_clf = (clone(best_clf)).fit(X_train, y_train)
end = time()
train_time_full = end - start

start = time()
reduced_20_clf = (clone(best_clf)).fit(X_train_reduced_20, y_train)
end = time()
train_time_reduced_20 = end - start


# Make new predictions
full_predictions = full_clf.predict(X_test)
reduced_20_predictions = reduced_20_clf.predict(X_test_reduced_20)

# Report scores from the final model using both versions of data
print "Final Model trained on full data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_full)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, full_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, full_predictions, beta = 0.5))
print "\nFinal Model trained on reduced to 81 features data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_reduced_20)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_20_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_20_predictions, beta = 0.5))









    



Final Model trained on full data
------
Final Model Train time 2.2059469223 s full data
------
Accuracy on testing data: 0.7936
F-score on testing data: 0.4507

Final Model trained on reduced to 81 features data
------
Final Model Train time 1.24439692497 s full data
------
Accuracy on testing data: 0.7932
F-score on testing data: 0.4489

Feature selection of using top 20 features wasn’t prudent in our case, so we will use the tuned best classifier with full dataset instead of top 20 feature

Model Ensemble

We’ll build ensemble models using four different models as base learners:

LogisticRegression
GradientBoostingClassifier
XGBoostClassifier
RandomForestClassifier

The ensemble models will be built using two different methods:

Blending (average) ensemble model: Fits the base learners to the training data and then, at test time, average the predictions generated by all the base learners. Use VotingClassifier from sklearn that: Fits all the base learners on the training data at test time, use all base learners to predict test data and then take the average of all predictions.

Stacked ensemble model: Fits the base learners to the training data. Next, use those trained base learners to generate predictions (meta-features) used by the meta-learner (assuming we have only one layer of base learners).

There are few different ways of training stacked ensemble model:

Fitting the base learners to all training data and then generate predictions using the same training data it was used to fit those learners. This method is more prune to overfitting because the meta learner will give more weights to the base learner who memorized the training data better, i.e. meta-learner won’t generate well and would overfit.

Split the training data into 2 to 3 different parts that will be used for training, validation, and generate predictions. It’s a suboptimal method because held out sets usually have higher variance and different splits give different results as well as learning algorithms would have fewer data to train.

Use k-folds cross validation where we split the data into k-folds. We fit the base learners to the (k -1) folds and use the fitted models to generate predictions of the held out fold. We repeat the process until we generate the predictions for all the k-folds. When done, refit the base learners to the full training data. This method is more reliable and will give models that memorize the data less weight. Therefore, it generalizes better on future data.

We’ll use logistic regression as the meta-learner for the stacked model. Note that we can use k-folds cross validation to validate and tune the hyperparameters of the meta learner. We will not tune the hyperparameters of any of the base learners or the meta-learner; however, we will use some of the values recommended by the Data-driven advice for applying machine learning to bioinformatics problems and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/.



In [36]:

    
from sklearn.ensemble import VotingClassifier
# Define base learners
xgb_clf = XGBClassifier(objective="binary:logistic", learning_rate=0.03, n_estimators=500,\
                            max_depth=1, subsample=0.4, random_state=10)

balanced_xgb_clf = BalancedBaggingClassifier(base_estimator=xgb_clf, n_estimators=10, random_state=10)

lr_clf = LogisticRegression(C=1.5, fit_intercept=True, penalty='l1', random_state=10)

balanced_lr_clf = BalancedBaggingClassifier(base_estimator=lr_clf, n_estimators=10, random_state=10)

gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
                                    n_estimators=500, random_state=10)

balanced_gb_clf = BalancedBaggingClassifier(base_estimator=gb_clf, n_estimators=10, random_state=10)

rf_clf = RandomForestClassifier(n_estimators=300, max_features="sqrt", criterion="gini", min_samples_leaf=5,\
                                class_weight="balanced", random_state=10)

balanced_rf_clf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)

# Define meta-learner
logreg_clf = LogisticRegression(penalty="l2", C=100, fit_intercept=True)

# Fitting voting clf --> average ensemble
voting_clf = VotingClassifier([("xgb", balanced_xgb_clf), ("lr", balanced_lr_clf), ("gbc", gb_clf),\
                               ("rf", rf_clf)], voting="soft", flatten_transform=True) 
voting_clf.fit(X_train, y_train)

xgb_model, lr_model, gbc_model, rf_model = voting_clf.estimators_

models = {"xgb": xgb_model, "lr": lr_model, "gbc": gbc_model, "rf": rf_model, "avg_ensemble": voting_clf}



In [37]:

    
from sklearn.model_selection import cross_val_predict
# Build first stack of base learners
first_stack = make_pipeline(voting_clf)

# Use CV to generate meta-features
meta_features = cross_val_predict(first_stack, X_train, y_train, cv=10, method="transform")

# Refit the first stack on the full training set 
first_stack.fit(X_train, y_train)

# Fit the meta learner
second_stack = logreg_clf.fit(meta_features, y_train)



In [38]:

    
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for name, model in models.items():
    model_probs = model.predict_proba(X_test)[:, 1:]
    model_auc_score = roc_auc_score(y_test, model_probs)
    fpr, tpr, _ = roc_curve(y_test, model_probs)
    precision, recall, _ = precision_recall_curve(y_test, model_probs) 
    axes[0].plot(fpr, tpr, label="{}, auc = {}".format(name, model_auc_score)) 
    axes[1].plot(recall, precision, label="{}".format(name))

stacked_probs = second_stack.predict_proba(first_stack.transform(X_test))[:, 1:] 
stacked_auc_score = roc_auc_score(y_test, stacked_probs)
fpr, tpr, _ = roc_curve(y_test, stacked_probs)
precision, recall, _ = precision_recall_curve(y_test, stacked_probs) 
axes[0].plot(fpr, tpr, label="stacked_ensemble, auc = {}".format(stacked_auc_score))
axes[1].plot(recall, precision, label="stacked_ensembe") 
axes[0].legend(loc="lower right") 
axes[0].set_xlabel("FPR") 
axes[0].set_ylabel("TPR") 
axes[0].set_title("ROC curve") 
axes[1].legend() 
axes[1].set_xlabel("recall") 
axes[1].set_ylabel("precision") 
axes[1].set_title("PR curve") 

plt.tight_layout()

As we can see from the chart above, stacked ensemble model has improved the performance.



In [39]:

    
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report



#PERFORMANCE METRICS FOR TEST SET
print("Final model test METRICS")
y_test_pred = second_stack.predict(first_stack.transform(X_test))

print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, y_test_pred))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, y_test_pred, beta = 0.5))

accuracy = accuracy_score(y_test, y_test_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

recall = recall_score(y_test, y_test_pred)
print("Recall: %.2f%%" % (recall * 100.0))

precision = precision_score(y_test, y_test_pred)
print("Precision: %.2f%%" % (precision * 100.0))

f1 = f1_score(y_test, y_test_pred, pos_label=1)
print("F1: %.2f%%" % (f1  * 100.0))

cfm = confusion_matrix(y_test, y_test_pred)
print (cfm)

print(classification_report(y_test, y_test_pred))

tmp = pd.Series({'model': 'Stacked Ensemble Final Model',\
                 'roc_auc_score' : stacked_auc_score,\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, y_test_pred),\
                 'precision_score': metrics.precision_score(y_test, y_test_pred),\
                 'recall_score': metrics.recall_score(y_test, y_test_pred),\
                 'f1_score': metrics.f1_score(y_test, y_test_pred),\
                 'accuracy': accuracy_score(y_test, y_test_pred),\
                 'fscore' : fbeta_score(y_test, y_test_pred, beta = 0.5, pos_label=1)})
    
models_report = models_report.append(tmp, ignore_index = True)
models_report









    



Final model test METRICS
Accuracy on testing data: 0.8723
F-score on testing data: 0.5579
Accuracy: 87.23%
Recall: 39.27%
Precision: 62.35%
F1: 48.19%
[[6916  305]
 [ 781  505]]
             precision    recall  f1-score   support

          0       0.90      0.96      0.93      7221
          1       0.62      0.39      0.48      1286

avg / total       0.86      0.87      0.86      8507







    Out[39]:







  
    
      
      model
      matthews_corrcoef
      roc_auc_score
      precision_score
      recall_score
      f1_score
      accuracy
      fscore
    
  
  
    
      0
      Naive Predictor
      0.201287
      0.696211
      0.220243
      0.759720
      0.341489
      0.557071
      0.256700
    
    
      1
      XGBClassifier340
      0.441011
      0.858439
      0.367037
      0.846812
      0.512109
      0.756083
      0.413943
    
    
      2
      XGBClassifier3402
      0.450757
      0.874160
      0.370445
      0.861586
      0.518120
      0.757729
      0.418113
    
    
      3
      XGBClassifier34028
      0.462447
      0.884929
      0.379299
      0.866252
      0.527587
      0.765487
      0.427344
    
    
      4
      LogisticRegression340
      0.376855
      0.824225
      0.360721
      0.699844
      0.476065
      0.767133
      0.399432
    
    
      5
      LogisticRegression3402
      0.420559
      0.848994
      0.376618
      0.769051
      0.505624
      0.772658
      0.419423
    
    
      6
      LogisticRegression34028
      0.459866
      0.870452
      0.405653
      0.792379
      0.536598
      0.793112
      0.449532
    
    
      7
      GradientBoostingClassifier340
      0.418919
      0.843553
      0.340653
      0.868585
      0.489376
      0.725990
      0.387793
    
    
      8
      GradientBoostingClassifier3402
      0.446944
      0.871031
      0.370395
      0.852255
      0.516372
      0.758669
      0.417619
    
    
      9
      GradientBoostingClassifier34028
      0.464679
      0.884464
      0.380757
      0.867807
      0.529286
      0.766663
      0.428901
    
    
      10
      GridSearchTunedLR
      0.462095
      0.870490
      0.406598
      0.795490
      0.538138
      0.793582
      0.450661
    
    
      11
      Stacked Ensemble Final Model
      0.427706
      0.881685
      0.623457
      0.392691
      0.481870
      0.872340
      0.557888

Though ensemble stacked model has better accuracy, f-score, In addition, with classification problems where False Negatives are a lot more expensive than False Positives, we may want to have a model with a high precision rather than high recall, i.e. the probability of the model to identify positive examples from randomly selected examples. Below is the confusion matrix:

Final Model Evaluation

Lets check best base classifier, tuned base classifier and final stacked model. Compare thier results for various different set of inputs, to validate robustness and obtain thier outputs. Also plot the ROC curves to perform sensitivity analysis.



In [40]:

    
bestBaseClassifier = clone(tune_clf)
tunedClassifier = clone(full_clf)
stackedEnsembleClassifier = first_stack
# Define meta-learner
meta_learner = LogisticRegression(penalty="l2", C=100, fit_intercept=True)


# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)

# Collect results on the learners
classifiers2 = {'bestBaseClassifier':bestBaseClassifier, 'tunedClassifier':tunedClassifier,\
              'stackedEnsembleClassifier':stackedEnsembleClassifier}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results2 = {}
for clf_name, clf in classifiers2.items():
    #clf_name = clf.__class__.__name__
    results2[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results2[clf_name][i], models_report = \
        train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report, meta_learner)









    



bestBaseClassifier trained on 340 samples.
bestBaseClassifier trained on 3402 samples.
bestBaseClassifier trained on 34028 samples.
tunedClassifier trained on 340 samples.
tunedClassifier trained on 3402 samples.
tunedClassifier trained on 34028 samples.
stackedEnsembleClassifier trained on 340 samples.
stackedEnsembleClassifier trained on 3402 samples.
stackedEnsembleClassifier trained on 34028 samples.



In [41]:

    
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results2, accuracy, fscore)
models_report









    












    Out[41]:







  
    
      
      model
      matthews_corrcoef
      roc_auc_score
      precision_score
      recall_score
      f1_score
      accuracy
      fscore
    
  
  
    
      0
      Naive Predictor
      0.201287
      0.696211
      0.220243
      0.759720
      0.341489
      0.557071
      0.256700
    
    
      1
      XGBClassifier340
      0.441011
      0.858439
      0.367037
      0.846812
      0.512109
      0.756083
      0.413943
    
    
      2
      XGBClassifier3402
      0.450757
      0.874160
      0.370445
      0.861586
      0.518120
      0.757729
      0.418113
    
    
      3
      XGBClassifier34028
      0.462447
      0.884929
      0.379299
      0.866252
      0.527587
      0.765487
      0.427344
    
    
      4
      LogisticRegression340
      0.376855
      0.824225
      0.360721
      0.699844
      0.476065
      0.767133
      0.399432
    
    
      5
      LogisticRegression3402
      0.420559
      0.848994
      0.376618
      0.769051
      0.505624
      0.772658
      0.419423
    
    
      6
      LogisticRegression34028
      0.459866
      0.870452
      0.405653
      0.792379
      0.536598
      0.793112
      0.449532
    
    
      7
      GradientBoostingClassifier340
      0.418919
      0.843553
      0.340653
      0.868585
      0.489376
      0.725990
      0.387793
    
    
      8
      GradientBoostingClassifier3402
      0.446944
      0.871031
      0.370395
      0.852255
      0.516372
      0.758669
      0.417619
    
    
      9
      GradientBoostingClassifier34028
      0.464679
      0.884464
      0.380757
      0.867807
      0.529286
      0.766663
      0.428901
    
    
      10
      GridSearchTunedLR
      0.462095
      0.870490
      0.406598
      0.795490
      0.538138
      0.793582
      0.450661
    
    
      11
      Stacked Ensemble Final Model
      0.427706
      0.881685
      0.623457
      0.392691
      0.481870
      0.872340
      0.557888
    
    
      12
      bestBaseClassifier340
      0.376855
      0.824225
      0.360721
      0.699844
      0.476065
      0.767133
      0.399432
    
    
      13
      bestBaseClassifier3402
      0.420559
      0.848994
      0.376618
      0.769051
      0.505624
      0.772658
      0.419423
    
    
      14
      bestBaseClassifier34028
      0.459866
      0.870452
      0.405653
      0.792379
      0.536598
      0.793112
      0.449532
    
    
      15
      tunedClassifier340
      0.424741
      0.852030
      0.383262
      0.762053
      0.510018
      0.778653
      0.425569
    
    
      16
      tunedClassifier3402
      0.424213
      0.852650
      0.379205
      0.771384
      0.508457
      0.774539
      0.422128
    
    
      17
      tunedClassifier34028
      0.462095
      0.870490
      0.406598
      0.795490
      0.538138
      0.793582
      0.450661
    
    
      18
      stackedEnsembleClassifier340
      0.377719
      0.856132
      0.535752
      0.390358
      0.451642
      0.856706
      0.498609
    
    
      19
      stackedEnsembleClassifier3402
      0.395049
      0.873611
      0.623395
      0.339813
      0.439859
      0.869167
      0.534230
    
    
      20
      stackedEnsembleClassifier34028
      0.427706
      0.881685
      0.623457
      0.392691
      0.481870
      0.872340
      0.557888

From the above chart we can observe that though stackedEnsembleClassifier is high cost time in training and predict, it is better in terms of variance and better performance metrics of accuracy & F-score. We can also see that model is robust for varied inputs.



In [42]:

    
# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
final_first_stack = None
final_second_stack= None

for k, v in results2.items():
    for k2, v2 in v.items():
        if k2==2:
            if k=='stackedEnsembleClassifier':
                final_first_stack = v2['learner']
                final_second_stack = v2['meta-learner']
                stacked_probs = final_second_stack.predict_proba(final_first_stack.transform(X_test))[:, 1:]
                stacked_auc_score = roc_auc_score(y_test, stacked_probs)
                fpr, tpr, _ = roc_curve(y_test, stacked_probs)
                precision, recall, _ = precision_recall_curve(y_test, stacked_probs) 
                axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, stacked_auc_score)) 
                axes[1].plot(recall, precision, label="{}".format(k))
                
            else:
                model_probs = v2['learner'].predict_proba(X_test)[:, 1:]
                model_auc_score = roc_auc_score(y_test, model_probs)
                fpr, tpr, _ = roc_curve(y_test, model_probs)
                precision, recall, _ = precision_recall_curve(y_test, model_probs) 
                axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, model_auc_score)) 
                axes[1].plot(recall, precision, label="{}".format(k))
axes[0].legend(loc="lower right") 
axes[0].set_xlabel("FPR") 
axes[0].set_ylabel("TPR") 
axes[0].set_title("ROC curve") 
axes[1].legend() 
axes[1].set_xlabel("recall") 
axes[1].set_ylabel("precision") 
axes[1].set_title("PR curve") 
plt.tight_layout()

Above ROC plot shows the sensitivity analysis and stackedEnsembleClassifier with AUC 0.88168 is better than bestBaseClassifier and tunedClassifier.

Optimized model's accuracy and F-score on the testing data.

Results:

Metric	Benchmark Predictor	Unoptimized Model	Optimized Model	Stacked Ensemble Final Model
Accuracy Score	0.557071	0.793112	0.793582	0.87234
F-score	0.2567	0.449532	0.450661	0.55788



In [43]:

    
models_report.iloc[[0, 6, 10, 11]].T









    Out[43]:







  
    
      
      0
      6
      10
      11
    
  
  
    
      model
      Naive Predictor
      LogisticRegression34028
      GridSearchTunedLR
      Stacked Ensemble Final Model
    
    
      matthews_corrcoef
      0.201287
      0.459866
      0.462095
      0.427706
    
    
      roc_auc_score
      0.696211
      0.870452
      0.87049
      0.881685
    
    
      precision_score
      0.220243
      0.405653
      0.406598
      0.623457
    
    
      recall_score
      0.75972
      0.792379
      0.79549
      0.392691
    
    
      f1_score
      0.341489
      0.536598
      0.538138
      0.48187
    
    
      accuracy
      0.557071
      0.793112
      0.793582
      0.87234
    
    
      fscore
      0.2567
      0.449532
      0.450661
      0.557888

In terms of accuracy, F-score Stacked Ensemble Final Model is much better.
The Stacked Ensemble Final Model has larger accuracy and F-score compared to Benchmark Naive Predictor, unoptimized model, optimized model.

Additional visualization

Partial dependence plots to see what are the most important features and their relationships with whether the borrower will most likely pay the loan in full before mature data. we will plot only the top 8 features to make it easier to read. Note that the partial plots are based on Gradient Boosting model.



In [44]:

    
# Plot partial dependence plots
gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
                                    n_estimators=500, random_state=10)

gb_clf.fit(X_train, y_train)









    Out[44]:





GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features='log2', max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=500,
              presort='auto', random_state=10, subsample=1.0, verbose=0,
              warm_start=False)



In [45]:

    
from sklearn.ensemble.partial_dependence import plot_partial_dependence
fig, axes = plot_partial_dependence(gb_clf, X_train, np.argsort(gb_clf.feature_importances_)[::-1][:12],\
                                    n_cols=4, feature_names=features.columns[:], figsize=(14, 8)) 
plt.subplots_adjust(top=0.9)
plt.suptitle("Partial dependence plots of borrower charged off\n" "the loan based on top most influential features")
for ax in axes: 
    ax.set_xticks(())

As expected borrowers with lower annual income and less FICO scores are higly likely to get charged off; however, borrowers with lower interest rates (riskier) and higher revol_bal are more likely to pay the loan fully.

Final Model

We will trina Stacked Ensemble Final Model with all the data. We will use Stacked Ensemble Final Model to predict whether a loan by individual will be charged off.



In [47]:

    
# Use CV to generate meta-features
meta_features = cross_val_predict(final_first_stack, features, loan_paid_status, cv=10, method="transform")

# Refit the first stack on the full training set 
final_first_stack.fit(features, loan_paid_status)

# Fit the meta learner
final_second_stack_meta = final_second_stack.fit(meta_features, loan_paid_status)

Save the Model

We will save the model using a joblib so that we can use this saved model, infer a new loan will be paid or not based on application data and use Investor ML in real time. Final model, we fit using all data available.



In [48]:

    
from sklearn.externals import joblib
filename = 'finalized_first_stack_model.sav'
filename2 = 'finalized_second_stack_model.sav'

joblib.dump(final_first_stack, filename)
joblib.dump(final_second_stack_meta, filename2)
first_stack









    Out[48]:





Pipeline(memory=None,
     steps=[('votingclassifier', VotingClassifier(estimators=[('xgb', BalancedBaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.03, max_delta_step=0,
       max_depth=1, min_child_weight=1, missing=Non...se=0, warm_start=False))],
         flatten_transform=True, n_jobs=1, voting='soft', weights=None))])

Test saved model, final model results on par with our best model, performance metric is increased due to increase in data as we used all the data available.



In [49]:

    
# load the model from disk
loaded_first_stack_model = joblib.load(filename)
loaded_second_stack_model = joblib.load(filename2)
Y_hat_imp_f = loaded_second_stack_model.predict(loaded_first_stack_model.transform(X_test))

print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, Y_hat_imp_f))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, Y_hat_imp_f, beta = 0.5))









    



Accuracy on testing data: 0.8969
F-score on testing data: 0.6729

Conculsion

Most classification problems in the real world are imbalanced. Also, almost always data sets have missing values. In this project, we covered strategies to deal with both missing values and imbalanced data sets. We also explored different ways of building ensembles which can give better performance. Below are some interesting learnings.

There is no definitive guide of which algorithms to use given any situation. What may work on some data sets may not necessarily work on others. Therefore, always evaluate methods using cross validation to get a reliable estimates.
Sometimes we may be willing to give up some improvement to the model if that would increase the complexity much more than the percentage change in the improvement to the evaluation metrics.
EasyEnsemble usually performs better than any other resampling methods.
Missing values sometimes add more information to the model than we might expect.

Improvements

Following are some of the additional improvements that can be done,

In some classification problems, False Negatives are a lot more expensive than False Positives. Therefore, we can reduce cut-off points to reduce the False Negatives.
When building ensemble models, try to use good models that are as different as possible to reduce correlation between the base learners. We could’ve enhanced our stacked ensemble model by adding Dense Neural Network and some other kind of base learners as well as adding more layers to the stacked model.
Add binary features for each feature that has missing values to check if each example is missing or not.

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	hardship_payoff_balance_amount	hardship_last_payment_amount	disbursement_method	debt_settlement_flag	debt_settlement_flag_date	settlement_status	settlement_date	settlement_amount	settlement_percentage	settlement_term
0	1077501	NaN	5000.0	5000.0	4975.0	36 months	10.65%	162.87	B	B2	...	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
1	1077430	NaN	2500.0	2500.0	2500.0	60 months	15.27%	59.83	C	C4	...	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
2	1077175	NaN	2400.0	2400.0	2400.0	36 months	15.96%	84.33	C	C5	...	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN
3	1076863	NaN	10000.0	10000.0	10000.0	36 months	13.49%	339.31	C	C1	...	NaN	NaN	Cash	N	NaN	NaN	NaN	NaN	NaN	NaN

	member_id	loan_amnt	funded_amnt	funded_amnt_inv	installment	annual_inc	dti	delinq_2yrs	fico_range_low	fico_range_high	...	payment_plan_start_date	hardship_length	hardship_dpd	hardship_loan_status	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	settlement_amount	settlement_percentage	settlement_term
count	0.0	42535.000000	42535.000000	42535.000000	42535.000000	4.253100e+04	42535.000000	42506.000000	42535.000000	42535.000000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	156.000000	156.000000	156.000000
mean	NaN	11089.722581	10821.585753	10139.830603	322.623063	6.913656e+04	13.373043	0.152449	713.052545	717.052545	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4259.603077	49.935385	0.942308
std	NaN	7410.938391	7146.914675	7131.686447	208.927216	6.409635e+04	6.726315	0.512406	36.188439	36.188439	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3147.115163	15.707054	3.445877
min	NaN	500.000000	500.000000	0.000000	15.670000	1.896000e+03	0.000000	0.000000	610.000000	614.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	193.290000	10.690000	0.000000
25%	NaN	5200.000000	5000.000000	4950.000000	165.520000	4.000000e+04	8.200000	0.000000	685.000000	689.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1800.000000	40.000000	0.000000
50%	NaN	9700.000000	9600.000000	8500.000000	277.690000	5.900000e+04	13.470000	0.000000	710.000000	714.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3481.850000	49.980000	0.000000
75%	NaN	15000.000000	15000.000000	14000.000000	428.180000	8.250000e+04	18.680000	0.000000	740.000000	744.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5701.100000	60.700000	0.000000
max	NaN	35000.000000	35000.000000	35000.000000	1305.190000	6.000000e+06	29.990000	13.000000	825.000000	829.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14798.200000	92.740000	24.000000

Number of variables	41
Number of observations	42542
Total Missing (%)	0.0%
Total size in memory	13.3 MiB
Average record size in memory	328.0 B

Numeric	16
Categorical	20
Boolean	4
Date	0
Text (Unique)	0
Rejected	1
Unsupported	0

Distinct count	3
Unique (%)	0.0%
Missing (%)	100.0%
Missing (n)	36
Infinite (%)	0.0%
Infinite (n)	0

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	0
Maximum	1
Range	1
Interquartile range	0

Standard deviation	0.0097004
Coef of variation	103.08
Kurtosis	10623
Mean	9.4104e-05
MAD	0.00018819
Skewness	103.07
Sum	4
Variance	9.4098e-05
Memory size	332.4 KiB

Value	Count	Frequency (%)
CA	7429	0.0%
NY	4065	0.0%
FL	3104	0.0%
TX	2915	0.0%
NJ	1988	0.0%
IL	1672	0.0%
PA	1651	0.0%
GA	1503	0.0%
VA	1487	0.0%
MA	1438	0.0%
Other values (40)	15283	0.0%

Distinct count	5598
Unique (%)	0.0%
Missing (%)	100.0%
Missing (n)	11
Infinite (%)	0.0%
Infinite (n)	0

Minimum	1896
5-th percentile	24000
Q1	40000
Median	59000
Q3	82500
95-th percentile	144000
Maximum	6000000
Range	5998100
Interquartile range	42500

Standard deviation	64096
Coef of variation	0.9271
Kurtosis	2117.3
Mean	69137
MAD	30995
Skewness	29.035
Sum	2940400000
Variance	4108300000
Memory size	332.4 KiB

Standard deviation	0.51241
Coef of variation	3.3612
Kurtosis	51.073
Mean	0.15245
MAD	0.27093
Skewness	5.4334
Sum	6480
Variance	0.26256
Memory size	332.4 KiB

Standard deviation	29.36
Coef of variation	205.26
Kurtosis	42504
Mean	0.14304
MAD	0.28606
Skewness	206.16
Sum	6080
Variance	861.98
Memory size	332.4 KiB

Distinct count	28965
Unique (%)	0.0%
Missing (%)	100.0%
Missing (n)	13299

	225
Debt Consolidation	11
Camping Membership	8
Other values (28961)	28999
(Missing)	13299

Distinct count	2895
Unique (%)	0.0%
Missing (%)	100.0%
Missing (n)	7
Infinite (%)	0.0%
Infinite (n)	0

Standard deviation	6.7263
Coef of variation	0.50298
Kurtosis	-0.85174
Mean	13.373
MAD	5.6401
Skewness	-0.029922
Sum	568820
Variance	45.243
Memory size	332.4 KiB

Distinct count	30660
Unique (%)	0.0%
Missing (%)	100.0%
Missing (n)	2631

US Army	139
Bank of America	115
IBM	72
Other values (30656)	39585
(Missing)	2631

Standard deviation	36.188
Coef of variation	0.050751
Kurtosis	-0.49633
Mean	713.05
MAD	30.05
Skewness	0.46487
Sum	30330000
Variance	1309.6
Memory size	332.4 KiB

Standard deviation	1.5275
Coef of variation	1.4124
Kurtosis	30.962
Mean	1.0814
MAD	1.0433
Skewness	3.4535
Sum	45967
Variance	2.3331
Memory size	332.4 KiB

Standard deviation	80.587
Coef of variation	0.11698
Kurtosis	2.9593
Mean	688.87
MAD	63.353
Skewness	-0.87048
Sum	29301000
Variance	6494.2
Memory size	332.4 KiB

Standard deviation	119.29
Coef of variation	0.17647
Kurtosis	16.684
Mean	675.94
MAD	73.728
Skewness	-3.3858
Sum	28751000
Variance	14229
Memory size	332.4 KiB

Minimum	500
5-th percentile	2400
Q1	5200
Median	9700
Q3	15000
95-th percentile	25000
Maximum	35000
Range	34500
Interquartile range	9800

Standard deviation	7410.9
Coef of variation	0.66827
Kurtosis	0.78587
Mean	11090
MAD	5876.1
Skewness	1.065
Sum	471700000
Variance	54922000
Memory size	332.4 KiB

Fully Paid	34116
Charged Off	5670
Does not meet the credit policy. Status:Fully Paid	1988

Standard deviation	4.4963
Coef of variation	0.4812
Kurtosis	1.935
Mean	9.344
MAD	3.5063
Skewness	1.042
Sum	397170
Variance	20.216
Memory size	332.4 KiB

Standard deviation	0.24571
Coef of variation	4.225
Kurtosis	26.835
Mean	0.058156
MAD	0.10981
Skewness	4.6055
Sum	2472
Variance	0.060375
Memory size	332.4 KiB

Standard deviation	0.20874
Coef of variation	4.6153
Kurtosis	18.127
Mean	0.045227
MAD	0.086381
Skewness	4.4411
Sum	1862
Variance	0.043571
Memory size	332.4 KiB

debt_consolidation	19776
credit_card	5477
other	4425
Other values (11)	12857