Machine Learning Engineer Nanodegree

Capstone

Project: Lending Club Investor ML

Overview

LendingClub is an online financial community that brings together creditworthy borrowers and savvy investors so that both can benefit financially. They replace the high cost and complexity of bank lending with a faster, smarter way to borrow and invest. Since 2007, LendingClub has been bringing borrowers and investors together, transforming the way people access credit. Over the last 10 years, LendingClub helped millions of people take control of their debt, grow their small businesses, and invest for the future.

LendingClub balances different investors on its platform. The mechanics of the platform allow LendingClub to meet the objectives of many different types of investors, including retail investors. Here’s how it works. Once loans are approved to the LendingClub platform, they are randomly allocated at a grade and term level either to a program designed for retail investors purchasing interests in fractions of loans (e.g. LendingClub Notes) or to a program intended for institutional investors. This helps ensure that investors have access to comparable quality loans no matter which type of investor they are. LendingClub goal is to meet incoming investor demand for interests in fractional loans as much as possible.

The design of LendingClub platform emphasizes how important retail investors are to LendingClub. For LendingClub retail investors are key component of our diverse marketplace strategy. Retail investors are—and will always be—the heart of the LendingClub marketplace. This project “Investor ML” tool specifically built keeping retail investors in mind. Investor ML aims to predict probability risk of charged off and not fully paid, called “Risk Rate %”. LendingClub can provide “Risk Rate %” predictions from Investor ML for each approved loan as an additional indicator for investors to make investment decisions. Retail investors use loan information, applicant information like fico score, loan grade etc. and decide to invest in fractional loans. Additional statistic “Risk Rate %” learned from historical loans may also act as additional information and help retail investors to diversify their investment.

The datasets are provided by LendingClub, we will use lending data from 2007-2011 and be trying to classify and predict whether or not the borrower paid back their loan in full. Data is available to download here.


Exploring the Data

Lets load necessary Python libraries and load the Lendingclub data downloaded.


In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
import warnings
warnings.filterwarnings("ignore")

# Import supplementary visualization code visuals.py
import visuals as vs

# Pretty display for notebooks
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()


# Load the Census dataset
loans = pd.read_csv("LoanStats3a_securev1_2007_2011.csv", encoding='utf-8')

# Success - Display the first record
print ("Loans dataset has {} data points with {} variables each.".format(*loans.shape))
print('lending club data 2007 to 2011 {}: {}'.format(loans.shape, ', '.join(loans.columns)))
loans.head(n=4)


Loans dataset has 42542 data points with 151 variables each.
lending club data 2007 to 2011 (42542, 151): id, member_id, loan_amnt, funded_amnt, funded_amnt_inv, term, int_rate, installment, grade, sub_grade, emp_title, emp_length, home_ownership, annual_inc, verification_status, issue_d, loan_status, pymnt_plan, url, desc, purpose, title, zip_code, addr_state, dti, delinq_2yrs, earliest_cr_line, fico_range_low, fico_range_high, inq_last_6mths, mths_since_last_delinq, mths_since_last_record, open_acc, pub_rec, revol_bal, revol_util, total_acc, initial_list_status, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_prncp, total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_d, last_pymnt_amnt, next_pymnt_d, last_credit_pull_d, last_fico_range_high, last_fico_range_low, collections_12_mths_ex_med, mths_since_last_major_derog, policy_code, application_type, annual_inc_joint, dti_joint, verification_status_joint, acc_now_delinq, tot_coll_amt, tot_cur_bal, open_acc_6m, open_act_il, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, total_rev_hi_lim, inq_fi, total_cu_tl, inq_last_12m, acc_open_past_24mths, avg_cur_bal, bc_open_to_buy, bc_util, chargeoff_within_12_mths, delinq_amnt, mo_sin_old_il_acct, mo_sin_old_rev_tl_op, mo_sin_rcnt_rev_tl_op, mo_sin_rcnt_tl, mort_acc, mths_since_recent_bc, mths_since_recent_bc_dlq, mths_since_recent_inq, mths_since_recent_revol_delinq, num_accts_ever_120_pd, num_actv_bc_tl, num_actv_rev_tl, num_bc_sats, num_bc_tl, num_il_tl, num_op_rev_tl, num_rev_accts, num_rev_tl_bal_gt_0, num_sats, num_tl_120dpd_2m, num_tl_30dpd, num_tl_90g_dpd_24m, num_tl_op_past_12m, pct_tl_nvr_dlq, percent_bc_gt_75, pub_rec_bankruptcies, tax_liens, tot_hi_cred_lim, total_bal_ex_mort, total_bc_limit, total_il_high_credit_limit, revol_bal_joint, sec_app_fico_range_low, sec_app_fico_range_high, sec_app_earliest_cr_line, sec_app_inq_last_6mths, sec_app_mort_acc, sec_app_open_acc, sec_app_revol_util, sec_app_open_act_il, sec_app_num_rev_accts, sec_app_chargeoff_within_12_mths, sec_app_collections_12_mths_ex_med, sec_app_mths_since_last_major_derog, hardship_flag, hardship_type, hardship_reason, hardship_status, deferral_term, hardship_amount, hardship_start_date, hardship_end_date, payment_plan_start_date, hardship_length, hardship_dpd, hardship_loan_status, orig_projected_additional_accrued_interest, hardship_payoff_balance_amount, hardship_last_payment_amount, disbursement_method, debt_settlement_flag, debt_settlement_flag_date, settlement_status, settlement_date, settlement_amount, settlement_percentage, settlement_term
Out[1]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 1077501 NaN 5000.0 5000.0 4975.0 36 months 10.65% 162.87 B B2 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 1077430 NaN 2500.0 2500.0 2500.0 60 months 15.27% 59.83 C C4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 1077175 NaN 2400.0 2400.0 2400.0 36 months 15.96% 84.33 C C5 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 1076863 NaN 10000.0 10000.0 10000.0 36 months 13.49% 339.31 C C1 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN

4 rows × 151 columns


In [2]:
loans.describe()


Out[2]:
member_id loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti delinq_2yrs fico_range_low fico_range_high ... payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount settlement_amount settlement_percentage settlement_term
count 0.0 42535.000000 42535.000000 42535.000000 42535.000000 4.253100e+04 42535.000000 42506.000000 42535.000000 42535.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 156.000000 156.000000 156.000000
mean NaN 11089.722581 10821.585753 10139.830603 322.623063 6.913656e+04 13.373043 0.152449 713.052545 717.052545 ... NaN NaN NaN NaN NaN NaN NaN 4259.603077 49.935385 0.942308
std NaN 7410.938391 7146.914675 7131.686447 208.927216 6.409635e+04 6.726315 0.512406 36.188439 36.188439 ... NaN NaN NaN NaN NaN NaN NaN 3147.115163 15.707054 3.445877
min NaN 500.000000 500.000000 0.000000 15.670000 1.896000e+03 0.000000 0.000000 610.000000 614.000000 ... NaN NaN NaN NaN NaN NaN NaN 193.290000 10.690000 0.000000
25% NaN 5200.000000 5000.000000 4950.000000 165.520000 4.000000e+04 8.200000 0.000000 685.000000 689.000000 ... NaN NaN NaN NaN NaN NaN NaN 1800.000000 40.000000 0.000000
50% NaN 9700.000000 9600.000000 8500.000000 277.690000 5.900000e+04 13.470000 0.000000 710.000000 714.000000 ... NaN NaN NaN NaN NaN NaN NaN 3481.850000 49.980000 0.000000
75% NaN 15000.000000 15000.000000 14000.000000 428.180000 8.250000e+04 18.680000 0.000000 740.000000 744.000000 ... NaN NaN NaN NaN NaN NaN NaN 5701.100000 60.700000 0.000000
max NaN 35000.000000 35000.000000 35000.000000 1305.190000 6.000000e+06 29.990000 13.000000 825.000000 829.000000 ... NaN NaN NaN NaN NaN NaN NaN 14798.200000 92.740000 24.000000

8 rows × 120 columns

Data dictionary

Definition list

  • id: A unique LC assigned ID for the loan listing.
  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan am"ount, then it will be reflected in this value.
  • funded_amnt: The total amount committed to that loan at that point in time.
  • funded_amnt_inv: The total amount committed by investors for that loan at that point in time.
  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.
  • int_rate: Interest Rate on the loan
  • installment: The monthly payment owed by the borrower if the loan originates.
  • grade: LC assigned loan grade
  • sub_grade: LC assigned loan subgrade
  • emp_title: The job title supplied by the Borrower when applying for the loan.
  • emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
  • home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER.
  • annual_inc: The self-reported annual income provided by the borrower during registration.
  • verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified.
  • issue_d: The month which the loan was funded.
  • loan_status: Current status of the loan.
  • pymnt_plan: Indicates if a payment plan has been put in place for the loan.
  • url: URL for the LC page with listing data.
  • desc: Loan description provided by the borrower.
  • purpose: A category provided by the borrower for the loan request.
  • title: The loan title provided by the borrower.
  • zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
  • addr_state: The state provided by the borrower in the loan application.
  • dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
  • delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.
  • earliest_cr_line: The month the borrower's earliest reported credit line was opened.
  • fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.
  • fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.
  • inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries).
  • mths_since_last_delinq: The number of months since the borrower's last delinquency.
  • mths_since_last_record: The number of months since the last public record.
  • open_acc: The number of open credit lines in the borrower's credit file.
  • pub_rec: Number of derogatory public records.
  • revol_bal: Total credit revolving balance.
  • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
  • total_acc: The total number of credit lines currently in the borrower's credit file.
  • initial_list_status: The initial listing status of the loan. Possible values are – W, F.
  • out_prncp: Remaining outstanding principal for total amount funded.
  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors.
  • total_pymnt: Payments received to date for total amount funded.
  • total_pymnt_inv: Payments received to date for portion of total amount funded by investors.
  • total_rec_prncp: Principal received to date.
  • total_rec_int: Interest received to date.
  • total_rec_late_fee: Late fees received to date.
  • recoveries: post charge off gross recovery.
  • collection_recovery_fee: post charge off collection fee.
  • last_pymnt_d: Last month payment was received.
  • last_pymnt_amnt: Last total payment amount received.
  • next_pymnt_d: Next scheduled payment date.
  • last_credit_pull_d: The most recent month LC pulled credit for this loan.
  • last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.
  • last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.
  • collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections.
  • policy_code: publicly available policy_code=1, new products not publicly available policy_code=2.
  • application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers.
  • acc_now_delinq: The number of accounts on which the borrower is now delinquent.
  • chargeoff_within_12_mths: Number of charge-offs within 12 months.
  • delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.
  • pub_rec_bankruptcies: Number of public record bankruptcies.
  • tax_liens: Number of tax liens.
  • hardship_flag: Flags whether or not the borrower is on a hardship plan.
  • disbursement_method: The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY.
  • debt_settlement_flag: Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.
  • debt_settlement_flag_date: The most recent date that the Debt_Settlement_Flag has been set.
  • settlement_status: The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT.
  • settlement_date: The date that the borrower agrees to the settlement plan.
  • settlement_amount: The loan amount that the borrower has agreed to settle for.
  • settlement_percentage: The settlement amount as a percentage of the payoff balance amount on the loan.
  • settlement_term: The number of months that the borrower will be on the settlement plan.

Identify NaNs

Lets analyze the Nans values to first focus on the data that has decent population to analyze the data. As we see NaN values, lets see how many NA values are present.


In [3]:
# Success
print ("Before removing NA")
print "Dataset has {} data points with {} variables each.".format(*loans.shape)
print(loans.isnull().sum())


Before removing NA
Dataset has 42542 data points with 151 variables each.
id                                                4
member_id                                     42542
loan_amnt                                         7
funded_amnt                                       7
funded_amnt_inv                                   7
term                                              7
int_rate                                          7
installment                                       7
grade                                             7
sub_grade                                         7
emp_title                                      2631
emp_length                                        7
home_ownership                                    7
annual_inc                                       11
verification_status                               7
issue_d                                           7
loan_status                                       7
pymnt_plan                                        7
url                                               7
desc                                          13299
purpose                                           7
title                                            19
zip_code                                          7
addr_state                                        7
dti                                               7
delinq_2yrs                                      36
earliest_cr_line                                 36
fico_range_low                                    7
fico_range_high                                   7
inq_last_6mths                                   36
                                              ...  
sec_app_open_acc                              42542
sec_app_revol_util                            42542
sec_app_open_act_il                           42542
sec_app_num_rev_accts                         42542
sec_app_chargeoff_within_12_mths              42542
sec_app_collections_12_mths_ex_med            42542
sec_app_mths_since_last_major_derog           42542
hardship_flag                                     7
hardship_type                                 42542
hardship_reason                               42542
hardship_status                               42542
deferral_term                                 42542
hardship_amount                               42542
hardship_start_date                           42542
hardship_end_date                             42542
payment_plan_start_date                       42542
hardship_length                               42542
hardship_dpd                                  42542
hardship_loan_status                          42542
orig_projected_additional_accrued_interest    42542
hardship_payoff_balance_amount                42542
hardship_last_payment_amount                  42542
disbursement_method                               7
debt_settlement_flag                              7
debt_settlement_flag_date                     42386
settlement_status                             42386
settlement_date                               42386
settlement_amount                             42386
settlement_percentage                         42386
settlement_term                               42386
Length: 151, dtype: int64

As we see many features having all values as NaN lets drop those columns which doesnt even have 60% dataset populated.


In [4]:
# drop a column if all values are NaN
loans = loans.dropna(thresh=0.4*len(loans),  axis = 1)
print "Loans dataset has {} data points with {} variables each.".format(*loans.shape)
#loans.info()


Loans dataset has 42542 data points with 60 variables each.

Now we can analyse all the data in 60 columns which have at least 40% records or more of data populated for each columns.

Assumption

For investors, only information about loan applicant is available, following 41 features, that is assumed to be available for investors.

Available features of loan applicant

Identifiers, not needed as features

  • id: A unique LC assigned ID for the loan listing.
  • url: URL for the LC page with listing data.

Other features to analyze

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan am"ount, then it will be reflected in this value.
  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.
  • int_rate: Interest Rate on the loan.
  • grade: LC assigned loan grade.
  • sub_grade: LC assigned loan subgrade.
  • emp_title: The job title supplied by the Borrower when applying for the loan.
  • emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
  • home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER.
  • annual_inc: The self-reported annual income provided by the borrower during registration.
  • verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified.
  • loan_status: Current status of the loan.
  • desc: Loan description provided by the borrower.
  • purpose: A category provided by the borrower for the loan request.
  • title: The loan title provided by the borrower.
  • zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
  • addr_state: The state provided by the borrower in the loan application.
  • dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
  • delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.
  • earliest_cr_line: The month the borrower's earliest reported credit line was opened.
  • fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.
  • fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.
  • inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries).
  • mths_since_last_delinq: The number of months since the borrower's last delinquency.
  • mths_since_last_record: The number of months since the last public record.
  • open_acc: The number of open credit lines in the borrower's credit file.
  • pub_rec: Number of derogatory public records.
  • revol_bal: Total credit revolving balance.
  • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
  • total_acc: The total number of credit lines currently in the borrower's credit file.
  • initial_list_status: The initial listing status of the loan. Possible values are – W, F.
  • out_prncp: Remaining outstanding principal for total amount funded.
  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors.
  • last_credit_pull_d: The most recent month LC pulled credit for this loan.
  • last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.
  • last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.
  • collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections.
  • policy_code: publicly available policy_code=1, new products not publicly available policy_code=2.
  • application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers.
  • acc_now_delinq: The number of accounts on which the borrower is now delinquent.
  • delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.
  • pub_rec_bankruptcies: Number of public record bankruptcies.
  • tax_liens: Number of tax liens.
  • hardship_flag: Flags whether or not the borrower is on a hardship plan.

In [5]:
identifiers = ['id', 'url']
investor_features = loans[['loan_amnt','term','int_rate','grade','sub_grade','emp_title','emp_length',\
               'home_ownership','annual_inc','verification_status','loan_status',\
               'desc','purpose','title','zip_code','addr_state','dti','delinq_2yrs',\
               'earliest_cr_line','fico_range_low','fico_range_high','inq_last_6mths',\
               'open_acc','pub_rec','revol_bal','revol_util','total_acc',\
               'initial_list_status','out_prncp','out_prncp_inv','last_credit_pull_d','last_fico_range_high',\
               'last_fico_range_low','collections_12_mths_ex_med','policy_code','application_type','acc_now_delinq',\
               'delinq_amnt','pub_rec_bankruptcies','tax_liens','hardship_flag']]
#loans.info()
investor_features.shape


Out[5]:
(42542, 41)

Found interesting tool pandas_profiling, will use that to profile and learn more about data.


In [6]:
import pandas as pd
import pandas_profiling
import numpy as np

pandas_profiling.ProfileReport(investor_features)


Out[6]:

Overview

Dataset info

Number of variables 41
Number of observations 42542
Total Missing (%) 0.0%
Total size in memory 13.3 MiB
Average record size in memory 328.0 B

Variables types

Numeric 16
Categorical 20
Boolean 4
Date 0
Text (Unique) 0
Rejected 1
Unsupported 0

Warnings

Variables

acc_now_delinq
Numeric

Distinct count 3
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.4104e-05
Minimum 0
Maximum 1
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 1
Range 1
Interquartile range 0

Descriptive statistics

Standard deviation 0.0097004
Coef of variation 103.08
Kurtosis 10623
Mean 9.4104e-05
MAD 0.00018819
Skewness 103.07
Sum 4
Variance 9.4098e-05
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 42502 0.0%
 
1.0 4 0.0%
 
(Missing) 36 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 42502 0.0%
 
1.0 4 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 42502 0.0%
 
1.0 4 0.0%
 

addr_state
Categorical

Distinct count 51
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
CA
 
7429
NY
 
4065
FL
 
3104
Other values (47)
27937
Value Count Frequency (%)  
CA 7429 0.0%
 
NY 4065 0.0%
 
FL 3104 0.0%
 
TX 2915 0.0%
 
NJ 1988 0.0%
 
IL 1672 0.0%
 
PA 1651 0.0%
 
GA 1503 0.0%
 
VA 1487 0.0%
 
MA 1438 0.0%
 
Other values (40) 15283 0.0%
 

annual_inc
Numeric

Distinct count 5598
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 11
Infinite (%) 0.0%
Infinite (n) 0
Mean 69137
Minimum 1896
Maximum 6000000
Zeros (%) 0.0%

Quantile statistics

Minimum 1896
5-th percentile 24000
Q1 40000
Median 59000
Q3 82500
95-th percentile 144000
Maximum 6000000
Range 5998100
Interquartile range 42500

Descriptive statistics

Standard deviation 64096
Coef of variation 0.9271
Kurtosis 2117.3
Mean 69137
MAD 30995
Skewness 29.035
Sum 2940400000
Variance 4108300000
Memory size 332.4 KiB
Value Count Frequency (%)  
60000.0 1591 0.0%
 
50000.0 1119 0.0%
 
40000.0 935 0.0%
 
45000.0 898 0.0%
 
30000.0 884 0.0%
 
75000.0 865 0.0%
 
65000.0 840 0.0%
 
70000.0 790 0.0%
 
48000.0 766 0.0%
 
80000.0 718 0.0%
 
Other values (5587) 33125 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
1896.0 1 0.0%
 
2000.0 1 0.0%
 
3300.0 1 0.0%
 
3500.0 1 0.0%
 
3600.0 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
1782000.0 1 0.0%
 
1900000.0 1 0.0%
 
2039784.0 1 0.0%
 
3900000.0 1 0.0%
 
6000000.0 1 0.0%
 

application_type
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Individual
42535
(Missing)
 
7
Value Count Frequency (%)  
Individual 42535 0.0%
 
(Missing) 7 0.0%
 

collections_12_mths_ex_med
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 152
Mean 0
0.0
42390
(Missing)
 
152
Value Count Frequency (%)  
0.0 42390 0.0%
 
(Missing) 152 0.0%
 

delinq_2yrs
Numeric

Distinct count 13
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.15245
Minimum 0
Maximum 13
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 13
Range 13
Interquartile range 0

Descriptive statistics

Standard deviation 0.51241
Coef of variation 3.3612
Kurtosis 51.073
Mean 0.15245
MAD 0.27093
Skewness 5.4334
Sum 6480
Variance 0.26256
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 37771 0.0%
 
1.0 3595 0.0%
 
2.0 771 0.0%
 
3.0 244 0.0%
 
4.0 72 0.0%
 
5.0 27 0.0%
 
6.0 13 0.0%
 
7.0 6 0.0%
 
8.0 3 0.0%
 
11.0 2 0.0%
 
Other values (2) 2 0.0%
 
(Missing) 36 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 37771 0.0%
 
1.0 3595 0.0%
 
2.0 771 0.0%
 
3.0 244 0.0%
 
4.0 72 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
7.0 6 0.0%
 
8.0 3 0.0%
 
9.0 1 0.0%
 
11.0 2 0.0%
 
13.0 1 0.0%
 

delinq_amnt
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.14304
Minimum 0
Maximum 6053
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 6053
Range 6053
Interquartile range 0

Descriptive statistics

Standard deviation 29.36
Coef of variation 205.26
Kurtosis 42504
Mean 0.14304
MAD 0.28606
Skewness 206.16
Sum 6080
Variance 861.98
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 42504 0.0%
 
6053.0 1 0.0%
 
27.0 1 0.0%
 
(Missing) 36 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 42504 0.0%
 
27.0 1 0.0%
 
6053.0 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 42504 0.0%
 
27.0 1 0.0%
 
6053.0 1 0.0%
 

desc
Categorical

Distinct count 28965
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 13299
 
225
Debt Consolidation
 
11
Camping Membership
 
8
Other values (28961)
28999
(Missing)
 
13299
Value Count Frequency (%)  
225 0.0%
 
Debt Consolidation 11 0.0%
 
Camping Membership 8 0.0%
 
refinancing 5 0.0%
 
Personal Loan 3 0.0%
 
personal loan 3 0.0%
 
credit card consolidation 3 0.0%
 
credit card debt consolidation 3 0.0%
 
consolidate debt 3 0.0%
 
debt consolidation 3 0.0%
 
Other values (28954) 28976 0.0%
 
(Missing) 13299 0.0%
 

dti
Numeric

Distinct count 2895
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 13.373
Minimum 0
Maximum 29.99
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 2.1
Q1 8.2
Median 13.47
Q3 18.68
95-th percentile 23.92
Maximum 29.99
Range 29.99
Interquartile range 10.48

Descriptive statistics

Standard deviation 6.7263
Coef of variation 0.50298
Kurtosis -0.85174
Mean 13.373
MAD 5.6401
Skewness -0.029922
Sum 568820
Variance 45.243
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 206 0.0%
 
12.0 54 0.0%
 
10.0 46 0.0%
 
18.0 46 0.0%
 
19.2 45 0.0%
 
13.2 43 0.0%
 
16.8 41 0.0%
 
13.5 41 0.0%
 
12.48 40 0.0%
 
15.0 38 0.0%
 
Other values (2884) 41935 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 206 0.0%
 
0.01 3 0.0%
 
0.02 5 0.0%
 
0.03 2 0.0%
 
0.04 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
29.92 2 0.0%
 
29.93 3 0.0%
 
29.95 2 0.0%
 
29.96 1 0.0%
 
29.99 1 0.0%
 

earliest_cr_line
Categorical

Distinct count 531
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Oct-99
 
393
Nov-98
 
390
Oct-00
 
370
Other values (527)
41353
Value Count Frequency (%)  
Oct-99 393 0.0%
 
Nov-98 390 0.0%
 
Oct-00 370 0.0%
 
Dec-98 366 0.0%
 
Dec-97 348 0.0%
 
Nov-00 340 0.0%
 
Nov-99 337 0.0%
 
Oct-98 334 0.0%
 
Sep-00 325 0.0%
 
Nov-97 319 0.0%
 
Other values (520) 38984 0.0%
 

emp_length
Categorical

Distinct count 13
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
10+ years
 
9369
< 1 year
 
5062
2 years
 
4743
Other values (9)
23361
Value Count Frequency (%)  
10+ years 9369 0.0%
 
< 1 year 5062 0.0%
 
2 years 4743 0.0%
 
3 years 4364 0.0%
 
4 years 3649 0.0%
 
1 year 3595 0.0%
 
5 years 3458 0.0%
 
6 years 2375 0.0%
 
7 years 1875 0.0%
 
8 years 1592 0.0%
 
Other values (2) 2453 0.0%
 

emp_title
Categorical

Distinct count 30660
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 2631
US Army
 
139
Bank of America
 
115
IBM
 
72
Other values (30656)
39585
(Missing)
 
2631
Value Count Frequency (%)  
US Army 139 0.0%
 
Bank of America 115 0.0%
 
IBM 72 0.0%
 
Kaiser Permanente 61 0.0%
 
AT&T 61 0.0%
 
UPS 58 0.0%
 
Wells Fargo 57 0.0%
 
USAF 56 0.0%
 
US Air Force 55 0.0%
 
Self Employed 49 0.0%
 
Other values (30649) 39188 0.0%
 
(Missing) 2631 0.0%
 

fico_range_high
Highly correlated

This variable is highly correlated with fico_range_low and should be ignored for analysis

Correlation 1

fico_range_low
Numeric

Distinct count 45
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 713.05
Minimum 610
Maximum 825
Zeros (%) 0.0%

Quantile statistics

Minimum 610
5-th percentile 665
Q1 685
Median 710
Q3 740
95-th percentile 780
Maximum 825
Range 215
Interquartile range 55

Descriptive statistics

Standard deviation 36.188
Coef of variation 0.050751
Kurtosis -0.49633
Mean 713.05
MAD 30.05
Skewness 0.46487
Sum 30330000
Variance 1309.6
Memory size 332.4 KiB
Value Count Frequency (%)  
685.0 2310 0.0%
 
700.0 2267 0.0%
 
680.0 2228 0.0%
 
695.0 2202 0.0%
 
690.0 2196 0.0%
 
675.0 1994 0.0%
 
705.0 1970 0.0%
 
720.0 1949 0.0%
 
715.0 1891 0.0%
 
725.0 1891 0.0%
 
Other values (34) 21637 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
610.0 2 0.0%
 
615.0 1 0.0%
 
620.0 1 0.0%
 
625.0 2 0.0%
 
630.0 6 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
805.0 193 0.0%
 
810.0 125 0.0%
 
815.0 28 0.0%
 
820.0 19 0.0%
 
825.0 3 0.0%
 

grade
Categorical

Distinct count 8
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
B
12389
A
 
10183
C
 
8740
Other values (4)
 
11223
Value Count Frequency (%)  
B 12389 0.0%
 
A 10183 0.0%
 
C 8740 0.0%
 
D 6016 0.0%
 
E 3394 0.0%
 
F 1301 0.0%
 
G 512 0.0%
 
(Missing) 7 0.0%
 

hardship_flag
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
N
42535
(Missing)
 
7
Value Count Frequency (%)  
N 42535 0.0%
 
(Missing) 7 0.0%
 

home_ownership
Categorical

Distinct count 6
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
RENT
20181
MORTGAGE
 
18959
OWN
 
3251
Other values (2)
 
144
Value Count Frequency (%)  
RENT 20181 0.0%
 
MORTGAGE 18959 0.0%
 
OWN 3251 0.0%
 
OTHER 136 0.0%
 
NONE 8 0.0%
 
(Missing) 7 0.0%
 

initial_list_status
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
f
42535
(Missing)
 
7
Value Count Frequency (%)  
f 42535 0.0%
 
(Missing) 7 0.0%
 

inq_last_6mths
Numeric

Distinct count 29
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.0814
Minimum 0
Maximum 33
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 2
95-th percentile 4
Maximum 33
Range 33
Interquartile range 2

Descriptive statistics

Standard deviation 1.5275
Coef of variation 1.4124
Kurtosis 30.962
Mean 1.0814
MAD 1.0433
Skewness 3.4535
Sum 45967
Variance 2.3331
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 19657 0.0%
 
1.0 11247 0.0%
 
2.0 5987 0.0%
 
3.0 3182 0.0%
 
4.0 1056 0.0%
 
5.0 596 0.0%
 
6.0 339 0.0%
 
7.0 182 0.0%
 
8.0 115 0.0%
 
9.0 50 0.0%
 
Other values (18) 95 0.0%
 
(Missing) 36 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 19657 0.0%
 
1.0 11247 0.0%
 
2.0 5987 0.0%
 
3.0 3182 0.0%
 
4.0 1056 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
27.0 1 0.0%
 
28.0 1 0.0%
 
31.0 1 0.0%
 
32.0 1 0.0%
 
33.0 1 0.0%
 

int_rate
Categorical

Distinct count 395
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
10.99%
 
970
11.49%
 
837
13.49%
 
832
Other values (391)
39896
Value Count Frequency (%)  
10.99% 970 0.0%
 
11.49% 837 0.0%
 
13.49% 832 0.0%
 
7.51% 787 0.0%
 
7.88% 742 0.0%
 
7.49% 656 0.0%
 
11.71% 609 0.0%
 
9.99% 607 0.0%
 
7.90% 582 0.0%
 
5.42% 573 0.0%
 
Other values (384) 35340 0.0%
 

last_credit_pull_d
Categorical

Distinct count 128
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 11
Jan-18
 
9348
Oct-16
 
4312
Dec-17
 
905
Other values (124)
27966
Value Count Frequency (%)  
Jan-18 9348 0.0%
 
Oct-16 4312 0.0%
 
Dec-17 905 0.0%
 
Nov-17 868 0.0%
 
Oct-17 822 0.0%
 
Feb-17 816 0.0%
 
Sep-17 683 0.0%
 
Aug-17 665 0.0%
 
Feb-13 608 0.0%
 
Mar-16 520 0.0%
 
Other values (117) 22984 0.0%
 

last_fico_range_high
Numeric

Distinct count 73
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 688.87
Minimum 0
Maximum 850
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 534
Q1 644
Median 699
Q3 744
95-th percentile 804
Maximum 850
Range 850
Interquartile range 100

Descriptive statistics

Standard deviation 80.587
Coef of variation 0.11698
Kurtosis 2.9593
Mean 688.87
MAD 63.353
Skewness -0.87048
Sum 29301000
Variance 6494.2
Memory size 332.4 KiB
Value Count Frequency (%)  
709.0 1305 0.0%
 
694.0 1256 0.0%
 
699.0 1227 0.0%
 
724.0 1208 0.0%
 
719.0 1181 0.0%
 
714.0 1176 0.0%
 
704.0 1172 0.0%
 
684.0 1147 0.0%
 
689.0 1084 0.0%
 
734.0 1057 0.0%
 
Other values (62) 30722 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 28 0.0%
 
499.0 768 0.0%
 
504.0 223 0.0%
 
509.0 177 0.0%
 
514.0 206 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
829.0 177 0.0%
 
834.0 129 0.0%
 
839.0 40 0.0%
 
844.0 34 0.0%
 
850.0 11 0.0%
 

last_fico_range_low
Numeric

Distinct count 72
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 675.94
Minimum 0
Maximum 845
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 530
Q1 640
Median 695
Q3 740
95-th percentile 800
Maximum 845
Range 845
Interquartile range 100

Descriptive statistics

Standard deviation 119.29
Coef of variation 0.17647
Kurtosis 16.684
Mean 675.94
MAD 73.728
Skewness -3.3858
Sum 28751000
Variance 14229
Memory size 332.4 KiB
Value Count Frequency (%)  
705.0 1305 0.0%
 
690.0 1256 0.0%
 
695.0 1227 0.0%
 
720.0 1208 0.0%
 
715.0 1181 0.0%
 
710.0 1176 0.0%
 
700.0 1172 0.0%
 
680.0 1147 0.0%
 
685.0 1084 0.0%
 
730.0 1057 0.0%
 
Other values (61) 30722 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 796 0.0%
 
500.0 223 0.0%
 
505.0 177 0.0%
 
510.0 206 0.0%
 
515.0 206 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
825.0 177 0.0%
 
830.0 129 0.0%
 
835.0 40 0.0%
 
840.0 34 0.0%
 
845.0 11 0.0%
 

loan_amnt
Numeric

Distinct count 899
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 11090
Minimum 500
Maximum 35000
Zeros (%) 0.0%

Quantile statistics

Minimum 500
5-th percentile 2400
Q1 5200
Median 9700
Q3 15000
95-th percentile 25000
Maximum 35000
Range 34500
Interquartile range 9800

Descriptive statistics

Standard deviation 7410.9
Coef of variation 0.66827
Kurtosis 0.78587
Mean 11090
MAD 5876.1
Skewness 1.065
Sum 471700000
Variance 54922000
Memory size 332.4 KiB
Value Count Frequency (%)  
10000.0 3016 0.0%
 
12000.0 2439 0.0%
 
5000.0 2260 0.0%
 
6000.0 2037 0.0%
 
15000.0 2012 0.0%
 
20000.0 1724 0.0%
 
8000.0 1699 0.0%
 
25000.0 1499 0.0%
 
4000.0 1230 0.0%
 
3000.0 1134 0.0%
 
Other values (888) 23485 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
500.0 11 0.0%
 
550.0 1 0.0%
 
600.0 6 0.0%
 
700.0 3 0.0%
 
725.0 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
34475.0 5 0.0%
 
34525.0 1 0.0%
 
34675.0 1 0.0%
 
34800.0 2 0.0%
 
35000.0 685 0.0%
 

loan_status
Categorical

Distinct count 5
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Fully Paid
34116
Charged Off
 
5670
Does not meet the credit policy. Status:Fully Paid
 
1988
Value Count Frequency (%)  
Fully Paid 34116 0.0%
 
Charged Off 5670 0.0%
 
Does not meet the credit policy. Status:Fully Paid 1988 0.0%
 
Does not meet the credit policy. Status:Charged Off 761 0.0%
 
(Missing) 7 0.0%
 

open_acc
Numeric

Distinct count 45
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.344
Minimum 1
Maximum 47
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 3
Q1 6
Median 9
Q3 12
95-th percentile 18
Maximum 47
Range 46
Interquartile range 6

Descriptive statistics

Standard deviation 4.4963
Coef of variation 0.4812
Kurtosis 1.935
Mean 9.344
MAD 3.5063
Skewness 1.042
Sum 397170
Variance 20.216
Memory size 332.4 KiB
Value Count Frequency (%)  
7.0 4252 0.0%
 
8.0 4176 0.0%
 
6.0 4172 0.0%
 
9.0 3922 0.0%
 
10.0 3386 0.0%
 
5.0 3368 0.0%
 
11.0 2944 0.0%
 
4.0 2508 0.0%
 
12.0 2398 0.0%
 
13.0 2060 0.0%
 
Other values (34) 9320 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 39 0.0%
 
2.0 692 0.0%
 
3.0 1608 0.0%
 
4.0 2508 0.0%
 
5.0 3368 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
41.0 1 0.0%
 
42.0 1 0.0%
 
44.0 1 0.0%
 
46.0 1 0.0%
 
47.0 1 0.0%
 

out_prncp
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Mean 0
0.0
42535
(Missing)
 
7
Value Count Frequency (%)  
0.0 42535 0.0%
 
(Missing) 7 0.0%
 

out_prncp_inv
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Mean 0
0.0
42535
(Missing)
 
7
Value Count Frequency (%)  
0.0 42535 0.0%
 
(Missing) 7 0.0%
 

policy_code
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Mean 1
1.0
42535
(Missing)
 
7
Value Count Frequency (%)  
1.0 42535 0.0%
 
(Missing) 7 0.0%
 

pub_rec
Numeric

Distinct count 7
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.058156
Minimum 0
Maximum 5
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 5
Range 5
Interquartile range 0

Descriptive statistics

Standard deviation 0.24571
Coef of variation 4.225
Kurtosis 26.835
Mean 0.058156
MAD 0.10981
Skewness 4.6055
Sum 2472
Variance 0.060375
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 40130 0.0%
 
1.0 2298 0.0%
 
2.0 64 0.0%
 
3.0 11 0.0%
 
4.0 2 0.0%
 
5.0 1 0.0%
 
(Missing) 36 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 40130 0.0%
 
1.0 2298 0.0%
 
2.0 64 0.0%
 
3.0 11 0.0%
 
4.0 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
1.0 2298 0.0%
 
2.0 64 0.0%
 
3.0 11 0.0%
 
4.0 2 0.0%
 
5.0 1 0.0%
 

pub_rec_bankruptcies
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 1372
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.045227
Minimum 0
Maximum 2
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 2
Range 2
Interquartile range 0

Descriptive statistics

Standard deviation 0.20874
Coef of variation 4.6153
Kurtosis 18.127
Mean 0.045227
MAD 0.086381
Skewness 4.4411
Sum 1862
Variance 0.043571
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 39316 0.0%
 
1.0 1846 0.0%
 
2.0 8 0.0%
 
(Missing) 1372 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 39316 0.0%
 
1.0 1846 0.0%
 
2.0 8 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 39316 0.0%
 
1.0 1846 0.0%
 
2.0 8 0.0%
 

purpose
Categorical

Distinct count 15
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
debt_consolidation
19776
credit_card
 
5477
other
 
4425
Other values (11)
 
12857
Value Count Frequency (%)  
debt_consolidation 19776 0.0%
 
credit_card 5477 0.0%
 
other 4425 0.0%
 
home_improvement 3199 0.0%
 
major_purchase 2311 0.0%
 
small_business 1992 0.0%
 
car 1615 0.0%
 
wedding 1004 0.0%
 
medical 753 0.0%
 
moving 629 0.0%
 
Other values (4) 1354 0.0%
 

revol_bal
Numeric

Distinct count 22710
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Infinite (%) 0.0%
Infinite (n) 0
Mean 14298
Minimum 0
Maximum 1207400
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 295.4
Q1 3635
Median 8821
Q3 17251
95-th percentile 44544
Maximum 1207400
Range 1207400
Interquartile range 13616

Descriptive statistics

Standard deviation 22018
Coef of variation 1.54
Kurtosis 346.29
Mean 14298
MAD 11545
Skewness 11.012
Sum 608160000
Variance 484810000
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 1119 0.0%
 
255.0 14 0.0%
 
298.0 14 0.0%
 
1.0 13 0.0%
 
682.0 12 0.0%
 
52.0 10 0.0%
 
39.0 10 0.0%
 
400.0 10 0.0%
 
6.0 10 0.0%
 
23.0 9 0.0%
 
Other values (22699) 41314 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 1119 0.0%
 
1.0 13 0.0%
 
2.0 6 0.0%
 
3.0 7 0.0%
 
4.0 3 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
487589.0 1 0.0%
 
508961.0 1 0.0%
 
602519.0 1 0.0%
 
952013.0 1 0.0%
 
1207359.0 1 0.0%
 

revol_util
Categorical

Distinct count 1120
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 97
0%
 
1070
40.70%
 
65
0.20%
 
64
Other values (1116)
41246
(Missing)
 
97
Value Count Frequency (%)  
0% 1070 0.0%
 
40.70% 65 0.0%
 
0.20% 64 0.0%
 
63% 63 0.0%
 
66.60% 62 0.0%
 
70.40% 61 0.0%
 
0.10% 61 0.0%
 
64.60% 60 0.0%
 
37.60% 60 0.0%
 
46.40% 59 0.0%
 
Other values (1109) 40820 0.0%
 
(Missing) 97 0.0%
 

sub_grade
Categorical

Distinct count 36
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
B3
 
2997
A4
 
2905
B5
 
2807
Other values (32)
33826
Value Count Frequency (%)  
B3 2997 0.0%
 
A4 2905 0.0%
 
B5 2807 0.0%
 
A5 2793 0.0%
 
B4 2590 0.0%
 
C1 2264 0.0%
 
C2 2157 0.0%
 
B2 2113 0.0%
 
B1 1882 0.0%
 
A3 1823 0.0%
 
Other values (25) 18204 0.0%
 

tax_liens
Numeric

Distinct count 3
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 112
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.3568e-05
Minimum 0
Maximum 1
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 0
Maximum 1
Range 1
Interquartile range 0

Descriptive statistics

Standard deviation 0.0048547
Coef of variation 205.99
Kurtosis 42430
Mean 2.3568e-05
MAD 4.7135e-05
Skewness 205.99
Sum 1
Variance 2.3568e-05
Memory size 332.4 KiB
Value Count Frequency (%)  
0.0 42429 0.0%
 
1.0 1 0.0%
 
(Missing) 112 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 42429 0.0%
 
1.0 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 42429 0.0%
 
1.0 1 0.0%
 

term
Categorical

Distinct count 3
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
36 months
31534
60 months
 
11001
(Missing)
 
7
Value Count Frequency (%)  
36 months 31534 0.0%
 
60 months 11001 0.0%
 
(Missing) 7 0.0%
 

title
Categorical

Distinct count 21258
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 19
Debt Consolidation
 
2259
Debt Consolidation Loan
 
1760
Personal Loan
 
708
Other values (21254)
37796
Value Count Frequency (%)  
Debt Consolidation 2259 0.0%
 
Debt Consolidation Loan 1760 0.0%
 
Personal Loan 708 0.0%
 
Consolidation 547 0.0%
 
debt consolidation 532 0.0%
 
Home Improvement 373 0.0%
 
Credit Card Consolidation 370 0.0%
 
Debt consolidation 347 0.0%
 
Small Business Loan 333 0.0%
 
Personal 330 0.0%
 
Other values (21247) 34964 0.0%
 

total_acc
Numeric

Distinct count 84
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 36
Infinite (%) 0.0%
Infinite (n) 0
Mean 22.124
Minimum 1
Maximum 90
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 6
Q1 13
Median 20
Q3 29
95-th percentile 44
Maximum 90
Range 89
Interquartile range 16

Descriptive statistics

Standard deviation 11.593
Coef of variation 0.52398
Kurtosis 0.65885
Mean 22.124
MAD 9.2155
Skewness 0.82238
Sum 940420
Variance 134.39
Memory size 332.4 KiB
Value Count Frequency (%)  
15.0 1552 0.0%
 
16.0 1547 0.0%
 
17.0 1543 0.0%
 
14.0 1531 0.0%
 
20.0 1504 0.0%
 
18.0 1493 0.0%
 
21.0 1483 0.0%
 
13.0 1480 0.0%
 
12.0 1416 0.0%
 
19.0 1404 0.0%
 
Other values (73) 27553 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
1.0 21 0.0%
 
2.0 41 0.0%
 
3.0 238 0.0%
 
4.0 486 0.0%
 
5.0 622 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
79.0 2 0.0%
 
80.0 1 0.0%
 
81.0 1 0.0%
 
87.0 1 0.0%
 
90.0 1 0.0%
 

verification_status
Categorical

Distinct count 4
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
Not Verified
18758
Verified
 
13471
Source Verified
 
10306
(Missing)
 
7
Value Count Frequency (%)  
Not Verified 18758 0.0%
 
Verified 13471 0.0%
 
Source Verified 10306 0.0%
 
(Missing) 7 0.0%
 

zip_code
Categorical

Distinct count 838
Unique (%) 0.0%
Missing (%) 100.0%
Missing (n) 7
100xx
 
649
945xx
 
559
606xx
 
548
Other values (834)
40779
Value Count Frequency (%)  
100xx 649 0.0%
 
945xx 559 0.0%
 
606xx 548 0.0%
 
112xx 538 0.0%
 
070xx 503 0.0%
 
900xx 478 0.0%
 
300xx 436 0.0%
 
021xx 416 0.0%
 
750xx 392 0.0%
 
926xx 387 0.0%
 
Other values (827) 37629 0.0%
 

Correlations

Sample

loan_amnt term int_rate grade sub_grade emp_title emp_length home_ownership annual_inc verification_status loan_status desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med policy_code application_type acc_now_delinq delinq_amnt pub_rec_bankruptcies tax_liens hardship_flag
0 5000.0 36 months 10.65% B B2 NaN 10+ years RENT 24000.0 Verified Fully Paid Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 735.0 739.0 1.0 3.0 0.0 13648.0 83.70% 9.0 f 0.0 0.0 Jan-18 739.0 735.0 0.0 1.0 Individual 0.0 0.0 0.0 0.0 N
1 2500.0 60 months 15.27% C C4 Ryder < 1 year RENT 30000.0 Source Verified Charged Off Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 740.0 744.0 5.0 3.0 0.0 1687.0 9.40% 4.0 f 0.0 0.0 Oct-16 499.0 0.0 0.0 1.0 Individual 0.0 0.0 0.0 0.0 N
2 2400.0 36 months 15.96% C C5 NaN 10+ years RENT 12252.0 Not Verified Fully Paid NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 735.0 739.0 2.0 2.0 0.0 2956.0 98.50% 10.0 f 0.0 0.0 Jun-17 739.0 735.0 0.0 1.0 Individual 0.0 0.0 0.0 0.0 N
3 10000.0 36 months 13.49% C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Fully Paid Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 690.0 694.0 1.0 10.0 0.0 5598.0 21% 37.0 f 0.0 0.0 Apr-16 604.0 600.0 0.0 1.0 Individual 0.0 0.0 0.0 0.0 N
4 3000.0 60 months 12.69% B B5 University Medical Group 1 year RENT 80000.0 Source Verified Fully Paid Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 695.0 699.0 0.0 15.0 0.0 27783.0 53.90% 38.0 f 0.0 0.0 Jan-17 694.0 690.0 0.0 1.0 Individual 0.0 0.0 0.0 0.0 N

Based on above profiling, lets work on features

Removals

  • Dataset has 3 duplicate rows, lets remove duplicates
  • application_type all of them are individual except missing 7
  • collections_12_mths_ex_med has all the value 0.0 except missing 152
  • fico_range_high is highly correlated to fico_range_low, so lets remove fico_range_high feature
  • initial_list_status as f as value excluding missing 7
  • hardship_flag is having all same values as 'N' except missing data
  • out_prncp has all feature values as 0.0 except missing values
  • out_prncp_inv has all feature values 0.0 except missing values
  • policy_code has all feature values 1 except missing values
  • tax_liens as all values are 0.0 except 1 and others are missing
  • title, purpose have similar values like debt consolidation et al, so remove title use only purpose
  • desc is verbose text and some of the desc is like title or purpose. Not using verbose text to analyse so remove

In [7]:
unique_loans = investor_features.drop_duplicates()
unique_loans_features = unique_loans.drop(['application_type', 'collections_12_mths_ex_med','fico_range_high',\
                                          'initial_list_status','hardship_flag','out_prncp','out_prncp_inv',\
                                          'policy_code','tax_liens','title','desc','addr_state','earliest_cr_line',\
                                           'emp_title','last_credit_pull_d', 'delinq_amnt', 'last_fico_range_high',\
                                           'zip_code'], axis=1)
unique_loans_features.shape


Out[7]:
(42536, 23)

Converstion

  • revol_util is in % format convert this to numberic
  • int_rate feature is in the format 10.99%, convert this to numeric
  • replace emp_length '<' symbol in values as XGBoostClassifier wont like it

In [8]:
unique_loans_features['revol_util'] = unique_loans_features['revol_util'].str.replace('%', '').astype(float)
unique_loans_features['int_rate'] = unique_loans_features['int_rate'].str.replace('%', '').astype(float)
#XGB cant have column name with <
unique_loans_features['emp_length'] = unique_loans_features['emp_length'].str.replace('<', '')

Consider only if loan_status values is present


In [9]:
unique_loans_features = unique_loans_features[unique_loans_features['loan_status'].notnull()]

Prepare features & targets label


In [10]:
# Split the data into features and target label
loan_paid_status_raw = unique_loans_features['loan_status']
features_raw = unique_loans_features.drop('loan_status', axis = 1)

Prepare labels

  • Lets first work on labels (or target or response variable), which is weather the loan is completely paid of or not.

In [11]:
loans_status_count = unique_loans_features['loan_status'].value_counts().reset_index()
loans_status_count


Out[11]:
index loan_status
0 Fully Paid 34116
1 Charged Off 5670
2 Does not meet the credit policy. Status:Fully ... 1988
3 Does not meet the credit policy. Status:Charge... 761

For our project we just need to know weather loans are charged off or not. We will convert the Fully Paid & Does not meet the credit policy. Status:Fully Paid to 0, Charged Off & Does not meet the credit policy. Status:Charged Off to 1


In [12]:
# function to convert target label to numeric values
def convert_label(val):
    #print val
    if val=='Fully Paid':
        return 0
    elif val=='Charged Off':
        return 1
    elif val=='Does not meet the credit policy. Status:Fully Paid':
        return 0
    elif val=='Does not meet the credit policy. Status:Charged Off':
        return 1

In [13]:
loan_paid_status = loan_paid_status_raw.apply(convert_label)

loan_paid_status.value_counts().reset_index()


Out[13]:
index loan_status
0 0 36104
1 1 6431

In [14]:
cnt_label = loan_paid_status.value_counts()

plt.figure(figsize=(8,4))
sns.barplot(cnt_label.index, cnt_label.values, alpha=0.8, color=color[1])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.title('Number of loans fully paid (0), charged off(1)', fontsize=15)
plt.show()



In [15]:
unique_loans_features['charged_off'] = unique_loans_features['loan_status'].apply(convert_label)

In [16]:
pos = unique_loans_features[unique_loans_features["charged_off"] == 1].shape[0]
neg = unique_loans_features[unique_loans_features["charged_off"] == 0].shape[0]
print "Positive examples = {}".format(pos)
print "Negative examples = {}".format(neg)
print "Proportion of positive to negative examples = {}".format((pos*1.0 / neg) * 100)


Positive examples = 6431
Negative examples = 36104
Proportion of positive to negative examples = 17.8124307556

We see that data set is pretty imbalanced as expected where positive examples (“charged off”) are only ~18%. We will handle imbalanced class issue later.

Lets do some basis analysis on data.

Lets validate some of opinions on the loans data

Good fico score based loans shold be paid


In [17]:
from matplotlib.pyplot import subplots, show

fig, ax = subplots(figsize=(12, 8))
unique_loans_features[unique_loans_features['charged_off'] == 0]['fico_range_low'].hist(alpha=0.5, color='red', bins=30, label='Fully Paid')
unique_loans_features[unique_loans_features['charged_off'] == 1]['fico_range_low'].hist(alpha=0.5, color='blue', bins=30, label='Charged Off')
plt.title("# Loans charged off or not & FICO score ", fontsize=15)
plt.legend()
#plt.xlabel('FICO')
plt.ylabel('Count')
ax.set_xlabel("FICO")
show()


  • Though loan applicants have good FICO score, 750+, we can see some loans are not paid off
  • Highest number of loans paid off are in b/w below 660 to 700 FICO score

What are different purpose the loan is being requested for?


In [18]:
plt.figure(figsize=(12,6))
sns.countplot(x='purpose', data=unique_loans_features, hue='charged_off', palette='Set1')
plt.xticks(rotation='vertical')
plt.title("# of loans vs purpose", fontsize=15)
plt.show()


  • Most of the loans are being requested for debt_consolidation
  • Most charged off loans are also of purpose debt_consolidation

Let's see the trend between FICO score and interest rate.


In [19]:
sns.jointplot('fico_range_low', 'int_rate', data=unique_loans_features, color='purple')


Out[19]:
<seaborn.axisgrid.JointGrid at 0x1a1a52f4d0>
  • Higher the fico score, lower the interest rate

Feature selection based on intuition

  • Based on above profiling, by spending some time, going through each feature and analyzing each feature, we can first identify the categorical features & numerical features that are populated well to be used for modeling.
  • Pandas profiling also profiles the details of highly correlated features using which we don’t need to select some features if they are highly co-related, for example, fico_range_high & fico_range_low, they both are highly co-related so we can use one of them for our analysis and modeling.

Selected categorical features, intuitions for selection

  • addr_state: Not selected, as this feature as high cardinality, has 51 distinct values, though the state might be a good indicator, am hesitant use this feature for first pass as using this one hot encoding features may increase by 51.
  • earliest_cr_line: Not selected, should be converted to numerical by calculating the number of months from the loan date, not considered for first pass.
  • emp_length: Selected, this feature might help us know duration of employment and how it influences paying off loan.
  • emp_title: Selected, as purpose gives better indicator and also features will explode using this feature and one hot encoding.
  • grade: Selected, LC grade might influence paying off loan.
  • home_ownership: Selected, this feature might help us know how home_ownership influence paying off loan.
  • int_rate: Selected, different interest rate might influence loan getting paid off, lower interest rate means less amount to be paid off, so loan may have been paid
  • last_credit_pull_d: Not selected, converting the loan application month to last credit pull month might be helpful feature
  • purpose: Selected, purpose might indicate the usage, which might corelate to loan being paid off.
  • sub_grade: Selected, may be good indicator of paying off loan.
  • term: Selected, may be longer term, less monthly commitment and may paying off loan.
  • title: Not selected, purpose is better feature than title, too many categories to convert as one hot encode features.
  • url: Not selected, url identifier.
  • verification_status: Selected, indicator, better verification state, likely to pay off loan.
  • zipcode: Not selected, indicates living neighborhood and changes of pay off loan. Too many zip codes to convert as one hot encode features.

selected numerical features, intutions for selection

  • acc_now_delinq: Selected, considering this feature, as once we add more data, it might be valuable.
  • annual_inc: selected, income informs how good the loan applicant has the ability to pay back.
  • delinq_2yrs: selected, if applicant delinquency in past 2 yr may be good indicator.
  • delinq_amnt: Not selected, as most of the values is 0, 36missing and 2 other values may.
  • dti: Selected, dti may be a good indicator of borrower’s debt to income ration and informs the ability of applicant to pay back loan.
  • fico_range_high : Not selected, as its highly correlated to fico_range_low, which will be used.
  • fico_range_low: Selected, it might tell credit rating of applicant, which by itself is a good measure of loan applicant ability to pay back.
  • inq_last_6mths: Selected, shows multiple credit enquire, which may indicate puruance of loan applicant for various debts.
  • last_fico_range_high: Not selected, as it is highly correlated to last_fico_range_low and last_fico_range_low will be used.
  • last_fico_range_low: Selected, this feature might help us know how loan applicant maintained his credit.
  • loan_amnt: Selected, higher the loan_amnt, less likely loan will be paid off.
  • mths_since_last_delinq: Selected, more month’s good loan applicant.
  • open_acc: Selected, Median open acc is better.
  • pub_rec: Selected, derogatory public records, less likely to pay off loan.
  • pub_rec_bankruptcies: Selected, more/any bankruptcy less likely to pay off loan.
  • revol_bal: Selected, more revol_bal, likely to pay off loan.
  • revol_util: Selected, more revol_util, likely to pay off loan.
  • tax_lines: Not selected, all 42429 of tax_lines is 0, and 1 is 1 and 112 missing.
  • total_acc: Selected, median account is a good indicator.

In [20]:
categorical_features = ['emp_length', 'grade', 'home_ownership', 'purpose',\
                                     'sub_grade','term','verification_status']
numerical_features = ['acc_now_delinq', 'annual_inc', 'delinq_2yrs', 'int_rate', 'dti', 'fico_range_low'\
                                   ,'inq_last_6mths','last_fico_range_low','loan_amnt'\
                                  ,'open_acc','pub_rec','pub_rec_bankruptcies','revol_bal','revol_util','total_acc']

Preparing the Data

Missing value

1. Handling numerical features missing

Lets identify and treat the missing values


In [21]:
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[numerical_features].shape)
print(unique_loans_features[numerical_features].isnull().sum())


Dataset has 42535 data points with 15 numerical variables each.
acc_now_delinq            29
annual_inc                 4
delinq_2yrs               29
int_rate                   0
dti                        0
fico_range_low             0
inq_last_6mths            29
last_fico_range_low        0
loan_amnt                  0
open_acc                  29
pub_rec                   29
pub_rec_bankruptcies    1365
revol_bal                  0
revol_util                90
total_acc                 29
dtype: int64

As all above features are sensitive information realted to individual financials, imputing mean, median strategies might not be right, so it is better to impute 0 for NA values than mean, median or mode strategies. Lets impute missing values as 0


In [22]:
unique_loans_features[numerical_features] = unique_loans_features[numerical_features].fillna(0)

2. Handling categorical features missing

Lets identify and treat the missing categorical feature values


In [23]:
print "Dataset has {} data points with {} numerical variables each.".format(*unique_loans_features[categorical_features].shape)
print(unique_loans_features[categorical_features].isnull().sum())


Dataset has 42535 data points with 7 numerical variables each.
emp_length             0
grade                  0
home_ownership         0
purpose                0
sub_grade              0
term                   0
verification_status    0
dtype: int64

No missing values in categorical features

Normalizing Numerical Features

Normalization ensures that each feature is treated equally when applying supervised learners.


In [24]:
from sklearn.preprocessing import StandardScaler


# Initialize a scaler, then apply it to the features
scaler = StandardScaler()
features_raw[numerical_features] = scaler.fit_transform(unique_loans_features[numerical_features])

# Show an example of a record with scaling applied
display(features_raw.head(n = 10))

#print(features_raw.isnull().sum())


loan_amnt term int_rate grade sub_grade emp_length home_ownership annual_inc verification_status purpose ... fico_range_low inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc last_fico_range_low acc_now_delinq pub_rec_bankruptcies
0 -0.821731 36 months -0.408592 B B2 10+ years RENT -0.704100 Verified credit_card ... 0.606484 -0.052834 -1.407944 -0.236602 -0.029515 1.220348 -1.129812 0.495148 -0.009698 -0.213007
1 -1.159074 60 months 0.837399 C C4 1 year RENT -0.610491 Source Verified car ... 0.744651 2.566378 -1.407944 -0.236602 -0.572748 -1.393671 -1.560731 -5.666625 -0.009698 -0.213007
2 -1.172567 36 months 1.023488 C C5 10+ years RENT -0.887387 Not Verified small_business ... 0.606484 0.601969 -1.630102 -0.236602 -0.515113 1.741041 -1.043628 0.495148 -0.009698 -0.213007
3 -0.147044 36 months 0.357342 C C1 10+ years RENT -0.310940 Source Verified other ... -0.637022 -0.052834 0.147162 -0.236602 -0.395122 -0.985560 1.283336 -0.636606 -0.009698 -0.213007
4 -1.091605 60 months 0.141586 B B5 1 year RENT 0.169588 Source Verified other ... -0.498854 -0.707637 1.257952 -0.236602 0.612455 0.171926 1.369520 0.117896 -0.009698 -0.213007
5 -0.821731 36 months -1.150253 A A4 3 years RENT -0.516881 Source Verified wedding ... 0.468317 1.256772 -0.074996 -0.236602 -0.287710 -0.728732 -0.871260 -0.971941 -0.009698 -0.213007
6 -0.551856 60 months 1.023488 C C5 8 years RENT -0.345201 Not Verified debt_consolidation ... -0.637022 -0.052834 -0.519312 -0.236602 0.155696 1.287194 -0.957444 -0.217438 -0.009698 -0.213007
7 -1.091605 36 months 1.746271 E E1 9 years RENT -0.329662 Source Verified car ... -1.466025 0.601969 -1.185786 -0.236602 -0.275993 1.354040 -1.560731 0.075979 -0.009698 -0.213007
8 -0.740768 60 months 2.458266 F F2 4 years OWN -0.454475 Source Verified small_business ... -1.051523 0.601969 0.369320 -0.236602 -0.412743 -0.577449 -0.785076 -5.666625 -0.009698 -0.213007
9 -0.771129 60 months 0.141586 B B5 1 year RENT -0.844514 Verified other ... 0.330150 -0.707637 -1.630102 -0.236602 -0.227942 -0.440240 -1.646915 -1.474943 -0.009698 -0.213007

10 rows × 22 columns

one-hot encoding categorical features

One-hot encoding creates a "dummy" variable for each possible category of each non-numeric feature.


In [25]:
features= pd.concat([pd.get_dummies(features_raw[categorical_features], prefix_sep='_'),\
                     features_raw[numerical_features]], axis=1)

In [26]:
# Print the number of features after one-hot encoding
encoded = list(features.columns)
print "{} total features after one-hot encoding.".format(len(encoded))

print encoded
display(features.head(n = 1))


93 total features after one-hot encoding.
[u'emp_length_ 1 year', u'emp_length_1 year', u'emp_length_10+ years', u'emp_length_2 years', u'emp_length_3 years', u'emp_length_4 years', u'emp_length_5 years', u'emp_length_6 years', u'emp_length_7 years', u'emp_length_8 years', u'emp_length_9 years', u'emp_length_n/a', u'grade_A', u'grade_B', u'grade_C', u'grade_D', u'grade_E', u'grade_F', u'grade_G', u'home_ownership_MORTGAGE', u'home_ownership_NONE', u'home_ownership_OTHER', u'home_ownership_OWN', u'home_ownership_RENT', u'purpose_car', u'purpose_credit_card', u'purpose_debt_consolidation', u'purpose_educational', u'purpose_home_improvement', u'purpose_house', u'purpose_major_purchase', u'purpose_medical', u'purpose_moving', u'purpose_other', u'purpose_renewable_energy', u'purpose_small_business', u'purpose_vacation', u'purpose_wedding', u'sub_grade_A1', u'sub_grade_A2', u'sub_grade_A3', u'sub_grade_A4', u'sub_grade_A5', u'sub_grade_B1', u'sub_grade_B2', u'sub_grade_B3', u'sub_grade_B4', u'sub_grade_B5', u'sub_grade_C1', u'sub_grade_C2', u'sub_grade_C3', u'sub_grade_C4', u'sub_grade_C5', u'sub_grade_D1', u'sub_grade_D2', u'sub_grade_D3', u'sub_grade_D4', u'sub_grade_D5', u'sub_grade_E1', u'sub_grade_E2', u'sub_grade_E3', u'sub_grade_E4', u'sub_grade_E5', u'sub_grade_F1', u'sub_grade_F2', u'sub_grade_F3', u'sub_grade_F4', u'sub_grade_F5', u'sub_grade_G1', u'sub_grade_G2', u'sub_grade_G3', u'sub_grade_G4', u'sub_grade_G5', u'term_ 36 months', u'term_ 60 months', u'verification_status_Not Verified', u'verification_status_Source Verified', u'verification_status_Verified', u'acc_now_delinq', u'annual_inc', u'delinq_2yrs', u'int_rate', u'dti', u'fico_range_low', u'inq_last_6mths', u'last_fico_range_low', u'loan_amnt', u'open_acc', u'pub_rec', u'pub_rec_bankruptcies', u'revol_bal', u'revol_util', u'total_acc']
emp_length_ 1 year emp_length_1 year emp_length_10+ years emp_length_2 years emp_length_3 years emp_length_4 years emp_length_5 years emp_length_6 years emp_length_7 years emp_length_8 years ... fico_range_low inq_last_6mths last_fico_range_low loan_amnt open_acc pub_rec pub_rec_bankruptcies revol_bal revol_util total_acc
0 0 0 1 0 0 0 0 0 0 0 ... 0.606484 -0.052834 0.495148 -0.821731 -1.407944 -0.236602 -0.213007 -0.029515 1.220348 -1.129812

1 rows × 93 columns

Shuffle and Split Data

Now all categorical variables have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.


In [50]:
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the 'features' and 'loan_paid_status' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, loan_paid_status, test_size = 0.2, random_state = 0,\
                                                    stratify= loan_paid_status)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])
display(X_test.head(n = 1))
display(y_test.head(n = 1))


Training set has 34028 samples.
Testing set has 8507 samples.
emp_length_ 1 year emp_length_1 year emp_length_10+ years emp_length_2 years emp_length_3 years emp_length_4 years emp_length_5 years emp_length_6 years emp_length_7 years emp_length_8 years ... fico_range_low inq_last_6mths last_fico_range_low loan_amnt open_acc pub_rec pub_rec_bankruptcies revol_bal revol_util total_acc
20975 0 0 1 0 0 0 0 0 0 0 ... -0.22252 -0.052834 0.159813 -0.281981 -1.407944 -0.236602 -0.213007 -0.294841 0.129708 -0.612709

1 rows × 93 columns

20975    0
Name: loan_status, dtype: int64

Strategies to deal with imbalanced datasets

Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). We can see that in spam detection, ads click, loan approvals, etc. In our example, the positive examples (people who charged off) were only ~18% from the total examples. Therefore, accuracy is no longer a good measure of performance for different models because if we simply predict all examples to belong to the negative class, we achieve 81% accuracy. Better metrics for imbalanced data sets are AUC (area under the ROC curve) and f1-score. However, that’s not enough because class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class by implicitly learns a model that optimizes the predictions based on the majority class in the dataset. As a result, we’ll explore different methods to overcome class imbalance problem.

Under-Sample: Under-sample the majority class with or w/o replacement by making the number of positive and negative examples equal. One of the drawbacks of under-sampling is that it ignores a good portion of training data that has valuable information. In our example, it would loose around ~27000+ examples. However, it’s very fast to train.

Over-Sample: Over-sample the minority class with or w/o replacement by making the number of positive and negative examples equal. We’ll add around ~27000+ samples from the training data set with this strategy. It’s a lot more computationally expensive than under-sampling. Also, it’s more prune to overfitting due to repeated examples.

EasyEnsemble: Sample several subsets from the majority class, build a classifier on top of each sampled data, and combine the output of all classifiers. More details can be found here.

Synthetic Minority Oversampling Technique (SMOTE): It over-samples the minority class but using synthesized examples. It operates on feature space not the data space.


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from imblearn.pipeline import make_pipeline as imb_make_pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedBaggingClassifier

# Build random forest classifier (same config)
rf_clf = RandomForestClassifier(criterion="entropy", verbose=False, class_weight="balanced",random_state=10)
# Build model with no sampling
pip_orig = make_pipeline(rf_clf)
scores = cross_val_score(pip_orig, X_train, y_train, scoring="roc_auc", cv=10)
print("Original model's average AUC: {}".format(scores.mean()))
# Build model with undersampling
pip_undersample = imb_make_pipeline(RandomUnderSampler(), rf_clf)
scores = cross_val_score(pip_undersample, X_train, y_train, scoring="roc_auc", cv=10)
print("Under-sampled model's average AUC: {}".format(scores.mean()))
# Build model with oversampling
pip_oversample = imb_make_pipeline(RandomOverSampler(), rf_clf)
scores = cross_val_score(pip_oversample, X_train, y_train, scoring="roc_auc", cv=10)
print("Over-sampled model's average AUC: {}".format(scores.mean()))
# Build model with EasyEnsemble
resampled_rf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)
pip_resampled = make_pipeline(resampled_rf)
scores = cross_val_score(pip_resampled, X_train, y_train, scoring="roc_auc", cv=10)
print("EasyEnsemble model's average AUC: {}".format(scores.mean()))
# Build model with SMOTE
pip_smote = imb_make_pipeline(SMOTE(), rf_clf)
scores = cross_val_score(pip_smote, X_train, y_train, scoring="roc_auc", cv=10)
print("SMOTE model's average AUC: {}".format(scores.mean()))


Original model's average AUC: 0.825141686061
Under-sampled model's average AUC: 0.841471980485
Over-sampled model's average AUC: 0.840353095013
EasyEnsemble model's average AUC: 0.872006664839
SMOTE model's average AUC: 0.832709723122

EasyEnsemble method has the highest 10-folds CV with average AUC = 0.872006664839. So we use EasyEnsemble technique to handle the imbalanced class problem.

Evaluating Model Performance

In this section, we will investigate four different algorithms, and determine which is best at modeling the data.

Metrics and the Naive Predictor

Investor ML, is particularly interested in predicting who will fully pay back the loan. It would seem that using accuracy as a metric for evaluating a particular model's performace would be appropriate. Additionally, identifying someone that does not fully pay back loan would be detrimental to Investor ML, since they are looking to invest on individual who will pay back loan. Therefore, a model's ability to precisely predict those who fully back is more important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall: $$ F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall} $$ In particular, when $\beta = 0.5$, more emphasis is placed on precision. This is called the F$_{0.5}$ score (or F-score for simplicity). Looking at the distribution of classes (those who charged off, and those who fully pay), it's clear most individuals fully pay back loan. This can greatly affect accuracy, since we could simply say "this person will pay back loan" and generally be right, without ever looking at the data! Making such a statement would be called naive, since we have not considered any information to substantiate the claim. Also as we are trying to predict "Risk Rate%", we cant say loan gets charged off. To define a naive predictor, we will use Gausian NB model to help establish a benchmark for whether a model is performing well.

Naive Predictor Performace

Lets define naive predictor and check its performance


In [29]:
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
from sklearn.naive_bayes import GaussianNB

cols = ['model','matthews_corrcoef', 'roc_auc_score', 'precision_score', 'recall_score','f1_score','accuracy',\
       'fscore']
models_report = pd.DataFrame(columns = cols)

#The naive model always predict that an individual will fully pay back loan
clf_GNB = GaussianNB()

clf_GNB.fit(X_train,y_train)

naive_prediction = clf_GNB.predict(X_test)
y_score = clf_GNB.predict_proba(X_test)[:,1]

# Accuracy, F-score using sklearn
accuracy = accuracy_score(y_test, naive_prediction)
fscore = fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)

tmp = pd.Series({'model': 'Naive Predictor',\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, naive_prediction),\
                 'precision_score': metrics.precision_score(y_test, naive_prediction),\
                 'recall_score': metrics.recall_score(y_test, naive_prediction),\
                 'f1_score': metrics.f1_score(y_test, naive_prediction),\
                 'accuracy': accuracy_score(y_test, naive_prediction),\
                 'fscore' : fbeta_score(y_test, naive_prediction, beta = 0.5, pos_label=1)})
models_report = models_report.append(tmp, ignore_index = True)

# Print the results 
print "Naive Predictor per sklearn: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore)
models_report


Naive Predictor per sklearn: [Accuracy score: 0.5571, F-score: 0.2567]
Out[29]:
model matthews_corrcoef roc_auc_score precision_score recall_score f1_score accuracy fscore
0 Naive Predictor 0.201287 0.696211 0.220243 0.75972 0.341489 0.557071 0.2567

Creating a Training and Predicting Pipeline

Lets create a training and predicting pipeline that allows you to quickly and effectively train models using various sizes of training data and perform predictions on the testing data.


In [30]:
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(classifier_name, learner, sample_size, X_train, y_train, X_test, y_test, models_report, meta_learner=None): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    #  Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        # Use CV to generate meta-features
        meta_features = cross_val_predict(learner, X_train[:sample_size],y_train[:sample_size], cv=10, method="transform")
        # Refit the first stack on the full training set 
        learner.fit(X_train[:sample_size],y_train[:sample_size])
        # Fit the meta learner
        second_stack = meta_learner.fit(meta_features, y_train[:sample_size])
    else:
        learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # Get end time
    
    # Calculate the training time
    results['train_time'] = end-start
        
    #  Get the predictions on the test set,
    #       then get predictions on the first 300 training samples
    start = time() # Get start time
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        predictions_test = second_stack.predict(learner.transform(X_test))
        predictions_train = second_stack.predict(learner.transform(X_train[:300]))
    else:
        predictions_test = learner.predict(X_test)
        predictions_train = learner.predict(X_train[:300])
        
    end = time() # Get end time
    
    #  Calculate the total prediction time
    results['pred_time'] = end-start
            
    # Compute accuracy on the first 300 training samples
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
        
    #  Compute accuracy on test set
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    #  Compute F-score on the the first 300 training samples
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta = 0.5)
        
    #  Compute F-score on the test set
    results['f_test'] = fbeta_score(y_test, predictions_test, beta = 0.5)
    
    if meta_learner and classifier_name=='stackedEnsembleClassifier':
        results['meta-learner'] = second_stack
        results['learner'] = learner
    else:
        results['learner'] = learner
    # Success
    print "{} trained on {} samples.".format(classifier_name, sample_size)
    
    if classifier_name=='stackedEnsembleClassifier':
        y_score = second_stack.predict_proba(learner.transform(X_test))[:, 1:] 
    else:
        y_score = learner.predict_proba(X_test)[:,1]
    
    tmp = pd.Series({'model': classifier_name+str(sample_size),\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, predictions_test),\
                 'precision_score': metrics.precision_score(y_test, predictions_test),\
                 'recall_score': metrics.recall_score(y_test, predictions_test),\
                 'f1_score': metrics.f1_score(y_test, predictions_test),\
                 'accuracy': accuracy_score(y_test, predictions_test),\
                 'fscore' : fbeta_score(y_test, predictions_test, beta = 0.5, pos_label=1)})
    
    models_report = models_report.append(tmp, ignore_index = True)
        
    # Return the results
    return results, models_report

Lets evaluate using following models

  • LogisticRegression
  • GradientBoostingClassifier
  • XGBoostClassifier

In [31]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Initialize the four models
clf_A = LogisticRegression(random_state=10)
# Build model with EasyEnsemble
resampled_LR_A = BalancedBaggingClassifier(base_estimator=clf_A, n_estimators=10, random_state=10)
clf_B = GradientBoostingClassifier(random_state=10)
resampled_GBC_B = BalancedBaggingClassifier(base_estimator=clf_B, n_estimators=10, random_state=10)
clf_C = XGBClassifier(objective='binary:logistic', random_state=10)
resampled_XGB_C = BalancedBaggingClassifier(base_estimator=clf_C, n_estimators=10, random_state=10)


# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)

# Collect results on the learners
classifiers = {'LogisticRegression':resampled_LR_A, 'GradientBoostingClassifier':resampled_GBC_B,\
              'XGBClassifier':resampled_XGB_C}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results = {}
for clf_name, clf in classifiers.items():
    #clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i], models_report = \
        train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report)

# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results, accuracy, fscore)
models_report


XGBClassifier trained on 340 samples.
XGBClassifier trained on 3402 samples.
XGBClassifier trained on 34028 samples.
LogisticRegression trained on 340 samples.
LogisticRegression trained on 3402 samples.
LogisticRegression trained on 34028 samples.
GradientBoostingClassifier trained on 340 samples.
GradientBoostingClassifier trained on 3402 samples.
GradientBoostingClassifier trained on 34028 samples.
Out[31]:
model matthews_corrcoef roc_auc_score precision_score recall_score f1_score accuracy fscore
0 Naive Predictor 0.201287 0.696211 0.220243 0.759720 0.341489 0.557071 0.256700
1 XGBClassifier340 0.441011 0.858439 0.367037 0.846812 0.512109 0.756083 0.413943
2 XGBClassifier3402 0.450757 0.874160 0.370445 0.861586 0.518120 0.757729 0.418113
3 XGBClassifier34028 0.462447 0.884929 0.379299 0.866252 0.527587 0.765487 0.427344
4 LogisticRegression340 0.376855 0.824225 0.360721 0.699844 0.476065 0.767133 0.399432
5 LogisticRegression3402 0.420559 0.848994 0.376618 0.769051 0.505624 0.772658 0.419423
6 LogisticRegression34028 0.459866 0.870452 0.405653 0.792379 0.536598 0.793112 0.449532
7 GradientBoostingClassifier340 0.418919 0.843553 0.340653 0.868585 0.489376 0.725990 0.387793
8 GradientBoostingClassifier3402 0.446944 0.871031 0.370395 0.852255 0.516372 0.758669 0.417619
9 GradientBoostingClassifier34028 0.464679 0.884464 0.380757 0.867807 0.529286 0.766663 0.428901

We can see from the above charts and data that, EasySampling, LogisticRegression performs better than, XGBoostClassifier & GradientBoostingClassifier. Considering running time, 100% training size data, variance (train / test of accuracy score is close), F-score.

Improving Results

Lets perform a grid search optimization for the model over the entire training set (X_train and y_train)

Tuned for Logistic Regressioin : solvers (['newton-cg', 'lbfgs', 'liblinear', 'sag']) found 'sag' tuned parameter. tuned 'C': [0.1, 1.0, 1.5] found 1.5 tuned paramter. n-estimators : range(20,81,10) tuned to 10.


In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Initialize the classifier
clf = LogisticRegression(random_state=10)
tune_clf = BalancedBaggingClassifier(base_estimator=clf, random_state=10)

# Create the parameters list to tune
parameters = {'base_estimator__fit_intercept': [True],\
              'base_estimator__C': [1.5],\
              'base_estimator__penalty':['l1'],\
              'n_estimators':[10]}


# Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score, beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(tune_clf, parameters, verbose=True, cv=10, scoring=scorer)

#  Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Build model with EasyEnsemble
#best_clf= BalancedBaggingClassifier(base_estimator=best_model, n_estimators=10, random_state=10)

# Make predictions using the unoptimized and model
predictions = (tune_clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)


Fitting 10 folds for each of 1 candidates, totalling 10 fits
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   20.2s finished

In [33]:
y_best_score = best_clf.predict_proba(X_test)[:,1]
tmp = pd.Series({'model': 'GridSearchTunedLR',\
                 'roc_auc_score' : metrics.roc_auc_score(y_test, y_best_score),\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, best_predictions),\
                 'precision_score': metrics.precision_score(y_test, best_predictions),\
                 'recall_score': metrics.recall_score(y_test, best_predictions),\
                 'f1_score': metrics.f1_score(y_test, best_predictions),\
                 'accuracy': accuracy_score(y_test, best_predictions),\
                 'fscore' : fbeta_score(y_test, best_predictions, beta = 0.5, pos_label=1)})
    
models_report = models_report.append(tmp, ignore_index = True)

# Report the before-and-afterscores
print "Unoptimized model\n------"
print "Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5))
print "\nOptimized Model\n------"
print "Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions))
print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5))

# show best parameters
print "\nBest Classifier\n------"
print best_clf
models_report


Unoptimized model
------
Accuracy score on testing data: 0.7931
F-score on testing data: 0.4495

Optimized Model
------
Final accuracy score on the testing data: 0.7936
Final F-score on the testing data: 0.4507

Best Classifier
------
BalancedBaggingClassifier(base_estimator=LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=10, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
             bootstrap=True, bootstrap_features=False, max_features=1.0,
             max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
             random_state=10, ratio='auto', replacement=False, verbose=0,
             warm_start=False)
Out[33]:
model matthews_corrcoef roc_auc_score precision_score recall_score f1_score accuracy fscore
0 Naive Predictor 0.201287 0.696211 0.220243 0.759720 0.341489 0.557071 0.256700
1 XGBClassifier340 0.441011 0.858439 0.367037 0.846812 0.512109 0.756083 0.413943
2 XGBClassifier3402 0.450757 0.874160 0.370445 0.861586 0.518120 0.757729 0.418113
3 XGBClassifier34028 0.462447 0.884929 0.379299 0.866252 0.527587 0.765487 0.427344
4 LogisticRegression340 0.376855 0.824225 0.360721 0.699844 0.476065 0.767133 0.399432
5 LogisticRegression3402 0.420559 0.848994 0.376618 0.769051 0.505624 0.772658 0.419423
6 LogisticRegression34028 0.459866 0.870452 0.405653 0.792379 0.536598 0.793112 0.449532
7 GradientBoostingClassifier340 0.418919 0.843553 0.340653 0.868585 0.489376 0.725990 0.387793
8 GradientBoostingClassifier3402 0.446944 0.871031 0.370395 0.852255 0.516372 0.758669 0.417619
9 GradientBoostingClassifier34028 0.464679 0.884464 0.380757 0.867807 0.529286 0.766663 0.428901
10 GridSearchTunedLR 0.462095 0.870490 0.406598 0.795490 0.538138 0.793582 0.450661

Feature Importance

An important task when performing supervised learning on a dataset is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do. In the case of this project, that means we wish to identify a small number of features that most strongly predict whether an individual fully pay loan.

Feature Selection

How does a model perform if we only use a subset of all the available features in the data? With less features required to train, the expectation is that training and prediction time is much lower — at the cost of performance metrics.


In [34]:
# Import supplementary visualization code visuals.py
import visuals as vs
import matplotlib.pyplot as pl
# Pretty display for notebooks
%matplotlib inline



from sklearn.ensemble import RandomForestClassifier
#  Train the supervised model on the training set 
model = RandomForestClassifier(random_state=10)
model.fit(X_train, y_train)
#  Extract the feature importances
importances = model.feature_importances_

# Plot
vs.feature_plot( importances, X_train, y_train)

# show most importance features
a = np.array(importances)
factors = pd.DataFrame(data = np.array([importances.astype(float), features.columns]).T,
                       columns = ['importances', 'features'])
factors = factors.sort_values('importances', ascending=False)

print "\n top 20 important features"
display(factors[:20])


 top 20 important features
importances features
85 0.246455 last_fico_range_low
79 0.0572329 annual_inc
91 0.0544118 revol_util
82 0.0542158 dti
81 0.0523224 int_rate
90 0.0520493 revol_bal
86 0.0496855 loan_amnt
92 0.0466982 total_acc
83 0.042266 fico_range_low
87 0.0408149 open_acc
84 0.0281177 inq_last_6mths
73 0.00973854 term_ 36 months
75 0.0094579 verification_status_Not Verified
26 0.00941782 purpose_debt_consolidation
80 0.00903726 delinq_2yrs
74 0.00870967 term_ 60 months
19 0.00869922 home_ownership_MORTGAGE
2 0.00810152 emp_length_10+ years
77 0.00795508 verification_status_Verified
23 0.00786212 home_ownership_RENT

In [35]:
# Import functionality for cloning a model
from sklearn.base import clone

# Reduce the feature space
X_train_reduced_20 = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:20]]]
X_test_reduced_20 = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:20]]]

# Train on the "best" model found from grid search earlier
start = time()
full_clf = (clone(best_clf)).fit(X_train, y_train)
end = time()
train_time_full = end - start

start = time()
reduced_20_clf = (clone(best_clf)).fit(X_train_reduced_20, y_train)
end = time()
train_time_reduced_20 = end - start


# Make new predictions
full_predictions = full_clf.predict(X_test)
reduced_20_predictions = reduced_20_clf.predict(X_test_reduced_20)

# Report scores from the final model using both versions of data
print "Final Model trained on full data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_full)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, full_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, full_predictions, beta = 0.5))
print "\nFinal Model trained on reduced to 81 features data\n------"
print "Final Model Train time {} s full data\n------".format(train_time_reduced_20)
print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_20_predictions))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_20_predictions, beta = 0.5))


Final Model trained on full data
------
Final Model Train time 2.2059469223 s full data
------
Accuracy on testing data: 0.7936
F-score on testing data: 0.4507

Final Model trained on reduced to 81 features data
------
Final Model Train time 1.24439692497 s full data
------
Accuracy on testing data: 0.7932
F-score on testing data: 0.4489

Feature selection of using top 20 features wasn’t prudent in our case, so we will use the tuned best classifier with full dataset instead of top 20 feature


Model Ensemble

We’ll build ensemble models using four different models as base learners:

  • LogisticRegression
  • GradientBoostingClassifier
  • XGBoostClassifier
  • RandomForestClassifier

The ensemble models will be built using two different methods:

Blending (average) ensemble model: Fits the base learners to the training data and then, at test time, average the predictions generated by all the base learners. Use VotingClassifier from sklearn that: Fits all the base learners on the training data at test time, use all base learners to predict test data and then take the average of all predictions.

Stacked ensemble model: Fits the base learners to the training data. Next, use those trained base learners to generate predictions (meta-features) used by the meta-learner (assuming we have only one layer of base learners).

There are few different ways of training stacked ensemble model:

Fitting the base learners to all training data and then generate predictions using the same training data it was used to fit those learners. This method is more prune to overfitting because the meta learner will give more weights to the base learner who memorized the training data better, i.e. meta-learner won’t generate well and would overfit.

Split the training data into 2 to 3 different parts that will be used for training, validation, and generate predictions. It’s a suboptimal method because held out sets usually have higher variance and different splits give different results as well as learning algorithms would have fewer data to train.

Use k-folds cross validation where we split the data into k-folds. We fit the base learners to the (k -1) folds and use the fitted models to generate predictions of the held out fold. We repeat the process until we generate the predictions for all the k-folds. When done, refit the base learners to the full training data. This method is more reliable and will give models that memorize the data less weight. Therefore, it generalizes better on future data.

We’ll use logistic regression as the meta-learner for the stacked model. Note that we can use k-folds cross validation to validate and tune the hyperparameters of the meta learner. We will not tune the hyperparameters of any of the base learners or the meta-learner; however, we will use some of the values recommended by the Data-driven advice for applying machine learning to bioinformatics problems and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/.


In [36]:
from sklearn.ensemble import VotingClassifier
# Define base learners
xgb_clf = XGBClassifier(objective="binary:logistic", learning_rate=0.03, n_estimators=500,\
                            max_depth=1, subsample=0.4, random_state=10)

balanced_xgb_clf = BalancedBaggingClassifier(base_estimator=xgb_clf, n_estimators=10, random_state=10)

lr_clf = LogisticRegression(C=1.5, fit_intercept=True, penalty='l1', random_state=10)

balanced_lr_clf = BalancedBaggingClassifier(base_estimator=lr_clf, n_estimators=10, random_state=10)

gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
                                    n_estimators=500, random_state=10)

balanced_gb_clf = BalancedBaggingClassifier(base_estimator=gb_clf, n_estimators=10, random_state=10)

rf_clf = RandomForestClassifier(n_estimators=300, max_features="sqrt", criterion="gini", min_samples_leaf=5,\
                                class_weight="balanced", random_state=10)

balanced_rf_clf = BalancedBaggingClassifier(base_estimator=rf_clf, n_estimators=10, random_state=10)

# Define meta-learner
logreg_clf = LogisticRegression(penalty="l2", C=100, fit_intercept=True)

# Fitting voting clf --> average ensemble
voting_clf = VotingClassifier([("xgb", balanced_xgb_clf), ("lr", balanced_lr_clf), ("gbc", gb_clf),\
                               ("rf", rf_clf)], voting="soft", flatten_transform=True) 
voting_clf.fit(X_train, y_train)

xgb_model, lr_model, gbc_model, rf_model = voting_clf.estimators_

models = {"xgb": xgb_model, "lr": lr_model, "gbc": gbc_model, "rf": rf_model, "avg_ensemble": voting_clf}

In [37]:
from sklearn.model_selection import cross_val_predict
# Build first stack of base learners
first_stack = make_pipeline(voting_clf)

# Use CV to generate meta-features
meta_features = cross_val_predict(first_stack, X_train, y_train, cv=10, method="transform")

# Refit the first stack on the full training set 
first_stack.fit(X_train, y_train)

# Fit the meta learner
second_stack = logreg_clf.fit(meta_features, y_train)

In [38]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for name, model in models.items():
    model_probs = model.predict_proba(X_test)[:, 1:]
    model_auc_score = roc_auc_score(y_test, model_probs)
    fpr, tpr, _ = roc_curve(y_test, model_probs)
    precision, recall, _ = precision_recall_curve(y_test, model_probs) 
    axes[0].plot(fpr, tpr, label="{}, auc = {}".format(name, model_auc_score)) 
    axes[1].plot(recall, precision, label="{}".format(name))

stacked_probs = second_stack.predict_proba(first_stack.transform(X_test))[:, 1:] 
stacked_auc_score = roc_auc_score(y_test, stacked_probs)
fpr, tpr, _ = roc_curve(y_test, stacked_probs)
precision, recall, _ = precision_recall_curve(y_test, stacked_probs) 
axes[0].plot(fpr, tpr, label="stacked_ensemble, auc = {}".format(stacked_auc_score))
axes[1].plot(recall, precision, label="stacked_ensembe") 
axes[0].legend(loc="lower right") 
axes[0].set_xlabel("FPR") 
axes[0].set_ylabel("TPR") 
axes[0].set_title("ROC curve") 
axes[1].legend() 
axes[1].set_xlabel("recall") 
axes[1].set_ylabel("precision") 
axes[1].set_title("PR curve") 

plt.tight_layout()


As we can see from the chart above, stacked ensemble model has improved the performance.


In [39]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report



#PERFORMANCE METRICS FOR TEST SET
print("Final model test METRICS")
y_test_pred = second_stack.predict(first_stack.transform(X_test))

print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, y_test_pred))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, y_test_pred, beta = 0.5))

accuracy = accuracy_score(y_test, y_test_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

recall = recall_score(y_test, y_test_pred)
print("Recall: %.2f%%" % (recall * 100.0))

precision = precision_score(y_test, y_test_pred)
print("Precision: %.2f%%" % (precision * 100.0))

f1 = f1_score(y_test, y_test_pred, pos_label=1)
print("F1: %.2f%%" % (f1  * 100.0))

cfm = confusion_matrix(y_test, y_test_pred)
print (cfm)

print(classification_report(y_test, y_test_pred))

tmp = pd.Series({'model': 'Stacked Ensemble Final Model',\
                 'roc_auc_score' : stacked_auc_score,\
                 'matthews_corrcoef': metrics.matthews_corrcoef(y_test, y_test_pred),\
                 'precision_score': metrics.precision_score(y_test, y_test_pred),\
                 'recall_score': metrics.recall_score(y_test, y_test_pred),\
                 'f1_score': metrics.f1_score(y_test, y_test_pred),\
                 'accuracy': accuracy_score(y_test, y_test_pred),\
                 'fscore' : fbeta_score(y_test, y_test_pred, beta = 0.5, pos_label=1)})
    
models_report = models_report.append(tmp, ignore_index = True)
models_report


Final model test METRICS
Accuracy on testing data: 0.8723
F-score on testing data: 0.5579
Accuracy: 87.23%
Recall: 39.27%
Precision: 62.35%
F1: 48.19%
[[6916  305]
 [ 781  505]]
             precision    recall  f1-score   support

          0       0.90      0.96      0.93      7221
          1       0.62      0.39      0.48      1286

avg / total       0.86      0.87      0.86      8507

Out[39]:
model matthews_corrcoef roc_auc_score precision_score recall_score f1_score accuracy fscore
0 Naive Predictor 0.201287 0.696211 0.220243 0.759720 0.341489 0.557071 0.256700
1 XGBClassifier340 0.441011 0.858439 0.367037 0.846812 0.512109 0.756083 0.413943
2 XGBClassifier3402 0.450757 0.874160 0.370445 0.861586 0.518120 0.757729 0.418113
3 XGBClassifier34028 0.462447 0.884929 0.379299 0.866252 0.527587 0.765487 0.427344
4 LogisticRegression340 0.376855 0.824225 0.360721 0.699844 0.476065 0.767133 0.399432
5 LogisticRegression3402 0.420559 0.848994 0.376618 0.769051 0.505624 0.772658 0.419423
6 LogisticRegression34028 0.459866 0.870452 0.405653 0.792379 0.536598 0.793112 0.449532
7 GradientBoostingClassifier340 0.418919 0.843553 0.340653 0.868585 0.489376 0.725990 0.387793
8 GradientBoostingClassifier3402 0.446944 0.871031 0.370395 0.852255 0.516372 0.758669 0.417619
9 GradientBoostingClassifier34028 0.464679 0.884464 0.380757 0.867807 0.529286 0.766663 0.428901
10 GridSearchTunedLR 0.462095 0.870490 0.406598 0.795490 0.538138 0.793582 0.450661
11 Stacked Ensemble Final Model 0.427706 0.881685 0.623457 0.392691 0.481870 0.872340 0.557888

Though ensemble stacked model has better accuracy, f-score, In addition, with classification problems where False Negatives are a lot more expensive than False Positives, we may want to have a model with a high precision rather than high recall, i.e. the probability of the model to identify positive examples from randomly selected examples. Below is the confusion matrix:

Final Model Evaluation

Lets check best base classifier, tuned base classifier and final stacked model. Compare thier results for various different set of inputs, to validate robustness and obtain thier outputs. Also plot the ROC curves to perform sensitivity analysis.


In [40]:
bestBaseClassifier = clone(tune_clf)
tunedClassifier = clone(full_clf)
stackedEnsembleClassifier = first_stack
# Define meta-learner
meta_learner = LogisticRegression(penalty="l2", C=100, fit_intercept=True)


# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = (X_train.shape[0]*1/100)
samples_10 = (X_train.shape[0]*10/100)
samples_100 = (X_train.shape[0]*100/100)

# Collect results on the learners
classifiers2 = {'bestBaseClassifier':bestBaseClassifier, 'tunedClassifier':tunedClassifier,\
              'stackedEnsembleClassifier':stackedEnsembleClassifier}
#[resampled_LR_A, resampled_GBC_B, resampled_XGB_C, resampled_RFC_D]
results2 = {}
for clf_name, clf in classifiers2.items():
    #clf_name = clf.__class__.__name__
    results2[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results2[clf_name][i], models_report = \
        train_predict(clf_name, clf, samples, X_train, y_train, X_test, y_test, models_report, meta_learner)


bestBaseClassifier trained on 340 samples.
bestBaseClassifier trained on 3402 samples.
bestBaseClassifier trained on 34028 samples.
tunedClassifier trained on 340 samples.
tunedClassifier trained on 3402 samples.
tunedClassifier trained on 34028 samples.
stackedEnsembleClassifier trained on 340 samples.
stackedEnsembleClassifier trained on 3402 samples.
stackedEnsembleClassifier trained on 34028 samples.

In [41]:
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results2, accuracy, fscore)
models_report


Out[41]:
model matthews_corrcoef roc_auc_score precision_score recall_score f1_score accuracy fscore
0 Naive Predictor 0.201287 0.696211 0.220243 0.759720 0.341489 0.557071 0.256700
1 XGBClassifier340 0.441011 0.858439 0.367037 0.846812 0.512109 0.756083 0.413943
2 XGBClassifier3402 0.450757 0.874160 0.370445 0.861586 0.518120 0.757729 0.418113
3 XGBClassifier34028 0.462447 0.884929 0.379299 0.866252 0.527587 0.765487 0.427344
4 LogisticRegression340 0.376855 0.824225 0.360721 0.699844 0.476065 0.767133 0.399432
5 LogisticRegression3402 0.420559 0.848994 0.376618 0.769051 0.505624 0.772658 0.419423
6 LogisticRegression34028 0.459866 0.870452 0.405653 0.792379 0.536598 0.793112 0.449532
7 GradientBoostingClassifier340 0.418919 0.843553 0.340653 0.868585 0.489376 0.725990 0.387793
8 GradientBoostingClassifier3402 0.446944 0.871031 0.370395 0.852255 0.516372 0.758669 0.417619
9 GradientBoostingClassifier34028 0.464679 0.884464 0.380757 0.867807 0.529286 0.766663 0.428901
10 GridSearchTunedLR 0.462095 0.870490 0.406598 0.795490 0.538138 0.793582 0.450661
11 Stacked Ensemble Final Model 0.427706 0.881685 0.623457 0.392691 0.481870 0.872340 0.557888
12 bestBaseClassifier340 0.376855 0.824225 0.360721 0.699844 0.476065 0.767133 0.399432
13 bestBaseClassifier3402 0.420559 0.848994 0.376618 0.769051 0.505624 0.772658 0.419423
14 bestBaseClassifier34028 0.459866 0.870452 0.405653 0.792379 0.536598 0.793112 0.449532
15 tunedClassifier340 0.424741 0.852030 0.383262 0.762053 0.510018 0.778653 0.425569
16 tunedClassifier3402 0.424213 0.852650 0.379205 0.771384 0.508457 0.774539 0.422128
17 tunedClassifier34028 0.462095 0.870490 0.406598 0.795490 0.538138 0.793582 0.450661
18 stackedEnsembleClassifier340 0.377719 0.856132 0.535752 0.390358 0.451642 0.856706 0.498609
19 stackedEnsembleClassifier3402 0.395049 0.873611 0.623395 0.339813 0.439859 0.869167 0.534230
20 stackedEnsembleClassifier34028 0.427706 0.881685 0.623457 0.392691 0.481870 0.872340 0.557888

From the above chart we can observe that though stackedEnsembleClassifier is high cost time in training and predict, it is better in terms of variance and better performance metrics of accuracy & F-score. We can also see that model is robust for varied inputs.


In [42]:
# Plot ROC and PR curves using all models and test data
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
final_first_stack = None
final_second_stack= None

for k, v in results2.items():
    for k2, v2 in v.items():
        if k2==2:
            if k=='stackedEnsembleClassifier':
                final_first_stack = v2['learner']
                final_second_stack = v2['meta-learner']
                stacked_probs = final_second_stack.predict_proba(final_first_stack.transform(X_test))[:, 1:]
                stacked_auc_score = roc_auc_score(y_test, stacked_probs)
                fpr, tpr, _ = roc_curve(y_test, stacked_probs)
                precision, recall, _ = precision_recall_curve(y_test, stacked_probs) 
                axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, stacked_auc_score)) 
                axes[1].plot(recall, precision, label="{}".format(k))
                
            else:
                model_probs = v2['learner'].predict_proba(X_test)[:, 1:]
                model_auc_score = roc_auc_score(y_test, model_probs)
                fpr, tpr, _ = roc_curve(y_test, model_probs)
                precision, recall, _ = precision_recall_curve(y_test, model_probs) 
                axes[0].plot(fpr, tpr, label="{}, auc = {}".format(k, model_auc_score)) 
                axes[1].plot(recall, precision, label="{}".format(k))
axes[0].legend(loc="lower right") 
axes[0].set_xlabel("FPR") 
axes[0].set_ylabel("TPR") 
axes[0].set_title("ROC curve") 
axes[1].legend() 
axes[1].set_xlabel("recall") 
axes[1].set_ylabel("precision") 
axes[1].set_title("PR curve") 
plt.tight_layout()


Above ROC plot shows the sensitivity analysis and stackedEnsembleClassifier with AUC 0.88168 is better than bestBaseClassifier and tunedClassifier.

Optimized model's accuracy and F-score on the testing data.

Results:

Metric Benchmark Predictor Unoptimized Model Optimized Model Stacked Ensemble Final Model
Accuracy Score 0.557071 0.793112 0.793582 0.87234
F-score 0.2567 0.449532 0.450661 0.55788

In [43]:
models_report.iloc[[0, 6, 10, 11]].T


Out[43]:
0 6 10 11
model Naive Predictor LogisticRegression34028 GridSearchTunedLR Stacked Ensemble Final Model
matthews_corrcoef 0.201287 0.459866 0.462095 0.427706
roc_auc_score 0.696211 0.870452 0.87049 0.881685
precision_score 0.220243 0.405653 0.406598 0.623457
recall_score 0.75972 0.792379 0.79549 0.392691
f1_score 0.341489 0.536598 0.538138 0.48187
accuracy 0.557071 0.793112 0.793582 0.87234
fscore 0.2567 0.449532 0.450661 0.557888
  • In terms of accuracy, F-score Stacked Ensemble Final Model is much better.
  • The Stacked Ensemble Final Model has larger accuracy and F-score compared to Benchmark Naive Predictor, unoptimized model, optimized model.

Additional visualization

Partial dependence plots to see what are the most important features and their relationships with whether the borrower will most likely pay the loan in full before mature data. we will plot only the top 8 features to make it easier to read. Note that the partial plots are based on Gradient Boosting model.


In [44]:
# Plot partial dependence plots
gb_clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, max_depth=3, max_features='log2', \
                                    n_estimators=500, random_state=10)

gb_clf.fit(X_train, y_train)


Out[44]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features='log2', max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=500,
              presort='auto', random_state=10, subsample=1.0, verbose=0,
              warm_start=False)

In [45]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
fig, axes = plot_partial_dependence(gb_clf, X_train, np.argsort(gb_clf.feature_importances_)[::-1][:12],\
                                    n_cols=4, feature_names=features.columns[:], figsize=(14, 8)) 
plt.subplots_adjust(top=0.9)
plt.suptitle("Partial dependence plots of borrower charged off\n" "the loan based on top most influential features")
for ax in axes: 
    ax.set_xticks(())


As expected borrowers with lower annual income and less FICO scores are higly likely to get charged off; however, borrowers with lower interest rates (riskier) and higher revol_bal are more likely to pay the loan fully.

Final Model

We will trina Stacked Ensemble Final Model with all the data. We will use Stacked Ensemble Final Model to predict whether a loan by individual will be charged off.


In [47]:
# Use CV to generate meta-features
meta_features = cross_val_predict(final_first_stack, features, loan_paid_status, cv=10, method="transform")

# Refit the first stack on the full training set 
final_first_stack.fit(features, loan_paid_status)

# Fit the meta learner
final_second_stack_meta = final_second_stack.fit(meta_features, loan_paid_status)

Save the Model

We will save the model using a joblib so that we can use this saved model, infer a new loan will be paid or not based on application data and use Investor ML in real time. Final model, we fit using all data available.


In [48]:
from sklearn.externals import joblib
filename = 'finalized_first_stack_model.sav'
filename2 = 'finalized_second_stack_model.sav'

joblib.dump(final_first_stack, filename)
joblib.dump(final_second_stack_meta, filename2)
first_stack


Out[48]:
Pipeline(memory=None,
     steps=[('votingclassifier', VotingClassifier(estimators=[('xgb', BalancedBaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.03, max_delta_step=0,
       max_depth=1, min_child_weight=1, missing=Non...se=0, warm_start=False))],
         flatten_transform=True, n_jobs=1, voting='soft', weights=None))])

Test saved model, final model results on par with our best model, performance metric is increased due to increase in data as we used all the data available.


In [49]:
# load the model from disk
loaded_first_stack_model = joblib.load(filename)
loaded_second_stack_model = joblib.load(filename2)
Y_hat_imp_f = loaded_second_stack_model.predict(loaded_first_stack_model.transform(X_test))

print "Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, Y_hat_imp_f))
print "F-score on testing data: {:.4f}".format(fbeta_score(y_test, Y_hat_imp_f, beta = 0.5))


Accuracy on testing data: 0.8969
F-score on testing data: 0.6729

Conculsion

Most classification problems in the real world are imbalanced. Also, almost always data sets have missing values. In this project, we covered strategies to deal with both missing values and imbalanced data sets. We also explored different ways of building ensembles which can give better performance. Below are some interesting learnings.

  • There is no definitive guide of which algorithms to use given any situation. What may work on some data sets may not necessarily work on others. Therefore, always evaluate methods using cross validation to get a reliable estimates.
  • Sometimes we may be willing to give up some improvement to the model if that would increase the complexity much more than the percentage change in the improvement to the evaluation metrics.

  • EasyEnsemble usually performs better than any other resampling methods.

  • Missing values sometimes add more information to the model than we might expect.

Improvements

Following are some of the additional improvements that can be done,

  • In some classification problems, False Negatives are a lot more expensive than False Positives. Therefore, we can reduce cut-off points to reduce the False Negatives.
  • When building ensemble models, try to use good models that are as different as possible to reduce correlation between the base learners. We could’ve enhanced our stacked ensemble model by adding Dense Neural Network and some other kind of base learners as well as adding more layers to the stacked model.
  • Add binary features for each feature that has missing values to check if each example is missing or not.