Lending Club


In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.metrics import roc_curve, auc, accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

from scipy.stats import ttest_ind
import matplotlib.dates as mdates
from pandas_confusion import ConfusionMatrix
import statsmodels.api as sm

sns.set_style('white')

Abstract

Lending club offers an exciting alternative to the stock market by providing loans that others can invest in. They claim a 4% overall default rate and give a grade to loans that represents the chance of a loan ending up in default. In this project we found that the default rate is 18% instead of 4% and that only the grades A, B and C are profitable on average of which A and B give on average the highest return-on-investment (around 4.5%). Furthermore, we found that adding more features than only grade to predict loans ending in default is only marginal beneficial and logistic regression with all selected features performed the best (AUC 0.71). Features that were important for this prediction were found to be interest rate, annual income, term and debt-to-income. Of which a higher annual income gave a lower chance on default, while with he others the relationship was reversed. We further predicted grade to find the features that are important but that are already incorporated in grade. In this case Random Forest performed the best, but it predicted mostly grade A, therefore only the precision on the other grades was good (around 0.8). Features that were found to be important were either based on the amount that was borrowed or the amount of debt the borrower already had: revolving line utilization rate (the amount of credit used compared to all credit), installment (monthly payment), revolving balance (all credit), loan amount and debt-to-income. We recommend investors to invest in loans with grades A and B and to on top of that look for loans with short terms, loans of lower amounts, borrowers with little other debts and high incomes.

Introduction

Crowd funding has become a new and exciting way to get capital and to invest. Lending club has jumped into the trend by offering loans with fixed interest rates and terms that the public can choose to invest in. Lending club screens the loans that are applied for and only 10% gets approved and is subsequently offered to the public. By investing a small proportion in many different loans investors can diversify their portfolio and in this way keep the default risk to a minimum (which is estimated by lending club to be 4%). For their services lending club asks a fee of 1%. This is an interesting way for investors to get profit on their investment since it supposedly gives more stable returns than the stock market and higher interest rates than a savings account. The profits depend on the interest rate and the default rate. Therefore it is interesting to see whether certain characteristics of the loan or the buyer give a bigger chance of default. Hence this might help investors to upgrade their profits.

Lending Club has provided the public with their records via their website. A previous dataset was released that holds the records from 2007-2011 and there has also been a Kaggle contest with a preprocessed Lending Club dataset in the past. In April 2016 Lending Club has provided their 2007-2015 dataset through Kaggle as dataset, not as contest. This is the dataset we will be working on in this project. Nevertheless, previous work has usually been done on one of the earlier releases of their dataset. While most of the earlier work has been focused on predicting good loans from bad loans, which we also will be focusing on, most have incorporated also the current loans. This holds a problem, since loans with a 'late' status could still recover and end in 'fully paid'. And 'current' loans could still end in the status 'charged off' (Lending Club’s default status). This is why we will focus only on loans that are closed and are therefore have either status 'fully paid' or 'charged off'. The consequence is that previous work that has incorporated these current loans is not very comparable.

To predict whether a loan will end in 'charged off' we will use machine learning algorithms. According to previous work (Pandey and Srinivasan, 2014; Tsai et al.), both Logistic Regression and Random Forest have been found to work best. Although work that incorporates no external datasets usually ends up with an Area Under the Curve (AUC)-score of around 0.7. Which is not really great, but better than chance. The most important feature is usually found to be 'grade'. This is a measure for risk assessment of the loans given by Lending Club itself. The categories are A-G including subcategories like A1 etc. The idea is that the closer to G the higher the chance on default. Usually the interest rate is also higher for the riskier loans in order to make these loans still attractive for investors.

In this project, we will first focus on exploring the data. We will see whether Lending Club is right about their claimed 4% default rate. Subsequently, we will look into whether loans with higher grades have indeed higher interest rates and higher default rates. And we will close the exploration part with how profitable the loans with the different grade categories actually are on average. Hereafter we will move on to the prediction part. Where we will use Random Forest and Logistic Regression to predict the 'charged off' from the 'fully paid' loans. We will see if an algorithm with just grade performs just as good as an algorithm with all features. Hence that adding features gives no benefit from the metric Lending Club already provides. Furthermore, we will try to recreate grade from the features, to see whether Lending Club provides the features they use for their algorithm and which features are important because they are used to create grade. And finally, we will give some recommendations to the investors of Lending Club.

Methods

Dataset

For this project the Lending Club dataset from Kaggle was used (https://www.kaggle.com/wendykan/lending-club-loan-data). This file contains complete loan data for loans issued between 2007 and 2015. There are 887,379 loans in the file and 74 features. A self-created feature ROI was added (Chang et al., 2015), but this is not part of the dataset. A couple of features have to do with the loan (32) and a couple of them have to do with the borrower (42). The feature we are interested in to predict is 'loan status'. In this case we are only interested in loans that went to full term. Hence we selected the loans that had either status 'fully paid' or 'charged off'. Statuses 'issued', 'current', 'default', 'late (31-120 days)', 'late (16-30 days)' and 'in grace period' are loans that are still ongoing for which you cannot be certain yet how they will end up. 'Does not meet credit policy' loans would not be issued today, so are not useful for future investors. In all the loans 5% has the status 'charged off'. After selecting only the loans that went to full term, we are left with 252,971 loans. This is 28.5% of the number of loans we started with. Of these 18% have the status 'charged off'.


In [2]:
loans = pd.read_csv('../data/loan.csv')
loans['roi'] = ((loans['total_rec_int'] + loans['total_rec_prncp'] 
                          + loans['total_rec_late_fee'] + loans['recoveries']) / loans['funded_amnt']) - 1
print('loans:',loans.shape)
print(loans['loan_status'].unique())
print('percentage charged off in all loans:', 
      round(sum(loans['loan_status']=='Charged Off')/len(loans['loan_status'])*100), '\n')

# selecting loans that went to full term
closed_loans = loans[loans['loan_status'].isin(['Fully Paid', 'Charged Off'])]
print('closed_loans:',closed_loans.shape)
print('precentage closed loans of total loans:', round(closed_loans.shape[0] / loans.shape[0] * 100, 1))
print('percentage charged off in closed loans:', 
      round(sum(closed_loans['loan_status']=='Charged Off')/len(closed_loans['loan_status'])*100))


loans: (887379, 75)
['Fully Paid' 'Charged Off' 'Current' 'Default' 'Late (31-120 days)'
 'In Grace Period' 'Late (16-30 days)'
 'Does not meet the credit policy. Status:Fully Paid'
 'Does not meet the credit policy. Status:Charged Off' 'Issued']
percentage charged off in all loans: 5.0 

closed_loans: (252971, 75)
precentage closed loans of total loans: 28.5
percentage charged off in closed loans: 18.0
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (19,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Preprocessing

We want to give advice to investors which loans they should invest in. Therefore we selected for the prediction only the features that are known before the investors pick the loans they want to invest in. Also we deleted features that are not useful for prediction like 'id' and features that have all the same values. There was only one 'joint' loan application, while all others where individual loans. Hence we deleted this one loan also. If a feature had more than 10% missing features we deleted this feature from the features used for prediction. Moreover rows that had a missing value in one of the remaining features were deleted. The features 'earliest credit line' and 'issue date' were transformed to one feature, namely the number of days between the earliest credit line and the issue date of the loan. This was previously done by O’Rourke (2016). The values in the feature annual income were divided by 1000 and rounded-up in order to get more similar values and outliers (above 200,000) were transformed to 200,000. After these transformation we are left with 252,771 loans and 23 features and the percentage of 'charged off' loans is still 18%. We kept our self-created 'roi' feature for data exploration purposes, but this is not a feature we will use for prediction and it will be excluded later.


In [3]:
include = ['term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 
          'annual_inc', 'purpose', 'zip_code', 'addr_state', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 
          'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 
          'mths_since_last_major_derog', 'acc_now_delinq', 'loan_amnt', 'open_il_6m', 'open_il_12m', 
          'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'dti', 'open_acc_6m', 'tot_cur_bal',
          'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl',
          'inq_last_12m', 'issue_d', 'loan_status', 'roi']

# exclude the one joint application
closed_loans = closed_loans[closed_loans['application_type'] == 'INDIVIDUAL']

# make id index
closed_loans.index = closed_loans.id

# include only the features above
closed_loans = closed_loans[include]

# exclude features with more than 10% missing values
columns_not_missing = (closed_loans.isnull().apply(sum, 0) / len(closed_loans)) < 0.1
closed_loans = closed_loans.loc[:,columns_not_missing[columns_not_missing].index]

# delete rows with NANs
closed_loans = closed_loans.dropna()

# calculate nr of days between earliest creditline and issue date of the loan
# delete the two original features
closed_loans['earliest_cr_line'] = pd.to_datetime(closed_loans['earliest_cr_line'])
closed_loans['issue_d'] = pd.to_datetime(closed_loans['issue_d'])
closed_loans['days_since_first_credit_line'] = closed_loans['issue_d'] - closed_loans['earliest_cr_line']
closed_loans['days_since_first_credit_line'] = closed_loans['days_since_first_credit_line'] / np.timedelta64(1, 'D')
closed_loans = closed_loans.drop(['earliest_cr_line', 'issue_d'], axis=1)

# round-up annual_inc and cut-off outliers annual_inc at 200.000
closed_loans['annual_inc'] = np.ceil(closed_loans['annual_inc'] / 1000)
closed_loans.loc[closed_loans['annual_inc'] > 200, 'annual_inc'] = 200

print(closed_loans.shape)
print('percentage charged off in closed loans:', 
      round(sum(closed_loans['loan_status']=='Charged Off') / len(closed_loans['loan_status']) * 100))


(252771, 24)
percentage charged off in closed loans: 18.0

The selected features:

  • term: the number of payments on the loan. Values are in months and can be either 36 or 60
  • int_rate: interest rate
  • installment: height monthly pay
  • grade: A-G, A low risk, G high risk
  • sub_grade: A1-G5
  • emp_length: 0-10 years (10 stands for >=10)
  • home_ownership: 'RENT', 'OWN', 'MORTGAGE', 'OTHER', 'NONE' and 'ANY'
  • annual_inc: annual income stated by borrower, divided by 1000 and rounded-up, 200 stand for >=200,000
  • purpose: 'credit_card', 'car', 'small_business', 'other', 'wedding', 'debt_consolidation', 'home_improvement', 'major_purchase', 'medical', 'moving', 'vacation', 'house', 'renewable_energy' and 'educational'
  • zip_code: first 3 numbers followed by 2 times x
  • addr_state: two letters representing the state the borrower lives in
  • delinq_2yrs: the number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
  • inq_last_6mths: the number of inquiries by creditors during the past 6 months
  • open_acc: the number of open credit lines in the borrower’s credit file
  • pub_rec: number of derogatory public records
  • revol_bal: total credit revolving balance
  • revol_util: revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit
  • total_acc: the total number of credit lines currently in the borrower’s credit file
  • acc_now_delinq: the number of accounts on which the borrower is now delinquent
  • loan_amnt: the listed amount of the loan applied for by the borrower
  • dti: a ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income
  • loan_status: the listed amount of the loan applied for by the borrower
  • days_since_first_credit_line: self created feature, days between earliest credit line and issue date

In [4]:
closed_loans.columns


Out[4]:
Index(['term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'annual_inc', 'purpose', 'zip_code', 'addr_state',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'acc_now_delinq', 'loan_amnt', 'dti',
       'loan_status', 'roi', 'days_since_first_credit_line'],
      dtype='object')

In sklearn the features have to be numerical that we input in this algorithm, so we need to convert the categorical features to numeric. To do this ordered categorical features will have adjacent numbers and unordered features will get an order as best as possible during conversion to numeric, for instance geographical. These transformations could be done with a labelencoder from sklearn for instance, but we want to keep any order that is in the data, which might help with the prediction. With more features this would not be manageable but with this amount of features it is still doable. Also there cannot be nan/inf/-inf values, hence these will be made 0's. With this algorithm we will also have to scale and normalize the features. Non-numeric features were converted as follows:

  • grade/sub_grade: order of the letters was kept
  • emp_length: nr of years
  • zipcode: numbers kept of zipcode (geographical order)
  • term: in months
  • home_ownership: from none/any/other to rent to mortgage to owned
  • purpose: from purposes that might make money to purposes that only cost money
  • addr_state: ordered geographically from west to east, top to bottom (https://theusa.nl/staten/)

In [5]:
# features that are not float or int, so not to be converted:

# ordered:
# sub_grade, emp_length, zip_code, term

# unordered:
# home_ownership, purpose, addr_state (ordered geographically)

closed_loans_predict = closed_loans.copy()

# term
closed_loans_predict['term'] = closed_loans_predict['term'].apply(lambda x: int(x.split(' ')[1]))

# grade
closed_loans_predict['grade'] = closed_loans_predict['grade'].astype('category')
grade_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
closed_loans_predict['grade'] = closed_loans_predict['grade'].apply(lambda x: grade_dict[x])

# emp_length
emp_length_dict = {'n/a':0,
                   '< 1 year':0,
                   '1 year':1,
                   '2 years':2,
                   '3 years':3,
                   '4 years':4,
                   '5 years':5,
                   '6 years':6,
                   '7 years':7,
                   '8 years':8,
                   '9 years':9,
                   '10+ years':10}
closed_loans_predict['emp_length'] = closed_loans_predict['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
closed_loans_predict['zip_code'] = closed_loans_predict['zip_code'].apply(lambda x: int(x[0:3]))

# subgrade
closed_loans_predict['sub_grade'] = (closed_loans_predict['grade'] 
                                    + closed_loans_predict['sub_grade'].apply(lambda x: float(list(x)[1])/10))

# house
house_dict = {'NONE': 0, 'OTHER': 0, 'ANY': 0, 'RENT': 1, 'MORTGAGE': 2, 'OWN': 3}
closed_loans_predict['home_ownership'] = closed_loans_predict['home_ownership'].apply(lambda x: house_dict[x])

# purpose
purpose_dict = {'other': 0, 'small_business': 1, 'renewable_energy': 2, 'home_improvement': 3,
                'house': 4, 'educational': 5, 'medical': 6, 'moving': 7, 'car': 8, 
                'major_purchase': 9, 'wedding': 10, 'vacation': 11, 'credit_card': 12, 
                'debt_consolidation': 13}
closed_loans_predict['purpose'] = closed_loans_predict['purpose'].apply(lambda x: purpose_dict[x])

# states
state_dict = {'AK': 0, 'WA': 1, 'ID': 2, 'MT': 3, 'ND': 4, 'MN': 5, 
              'OR': 6, 'WY': 7, 'SD': 8, 'WI': 9, 'MI': 10, 'NY': 11, 
              'VT': 12, 'NH': 13, 'MA': 14, 'CT': 15, 'RI': 16, 'ME': 17,
              'CA': 18, 'NV': 19, 'UT': 20, 'CO': 21, 'NE': 22, 'IA': 23, 
              'KS': 24, 'MO': 25, 'IL': 26, 'IN': 27, 'OH': 28, 'PA': 29, 
              'NJ': 30, 'KY': 31, 'WV': 32, 'VA': 33, 'DC': 34, 'MD': 35, 
              'DE': 36, 'AZ': 37, 'NM': 38, 'OK': 39, 'AR': 40, 'TN': 41, 
              'NC': 42, 'TX': 43, 'LA': 44, 'MS': 45, 'AL': 46, 'GA': 47, 
              'SC': 48, 'FL': 49, 'HI': 50}
closed_loans_predict['addr_state'] = closed_loans_predict['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
closed_loans_predict = closed_loans_predict.fillna(0)
closed_loans_predict = closed_loans_predict.replace([np.inf, -np.inf], 0)

Classification

We selected two algorithms to use for this project based on that they preformed the best on Lending Club datasets: Logistic Regression and Random Forest. The Logistic Regression classifier is a simple classifier that uses a sigmoidal curve to predict from the features to which class the sample belongs. It has one parameter to tune namely the C-parameter. This parameter is the inverse of the regularization strength and smaller values specify stronger regularization. We will be using l1/lasso-regularization in the case of multiple features. With this algorithm we will also have to scale and normalize the features. Sometimes this algorithm has been found to perform better with fewer features on a Lending Club dataset.

Random Forest is a more complicated algorithm that scores well in a lot of cases. This algorithm makes various decision trees from subsets of the samples and uses at each split only a fraction of the features to prevent overfitting. The Random Forest algorithm is known to be not very sensitive to the values of its parameters: the number of features used at each split and the number of trees in the forest. Nevertheless, the default of sklearn is so low that we will raise the number of trees to 100. The algorithm has feature selection already built-in (at each split) and scaling/normalization is also not necessary.

For the classification we will split the data in a train (70%) and a test set (30%). The test set is used to evaluate the performance of our classifier.


In [6]:
# split data in train (70%) and test set (30%)
X_train, X_test, y_train, y_test = train_test_split(closed_loans_predict.drop(['loan_status', 'roi'], axis=1), 
                                                    closed_loans_predict['loan_status'], 
                                                    test_size=0.3, random_state=123)

# scaling and normalizing the features
X_train_scaled = preprocessing.scale(X_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# scale test set with scaling used in train set
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

Performance metrics

We will use a few metrics to test the performance of our classifier on the test set. First we will use confusion matrices and their statistics. A confusion matrix shows how many true negatives (TN), false positives (FP), false negatives (FN) and true positives (TP) there are. Secondly, we will use the F1-score. This is implemented as 'f1_weighted' in sklearn. This score can be interpreted as a weighted average of the precision and recall. Precision is defined as TP / (TP + FP), while recall is defined as TP / (TP + FN). The F1-score is supposed to deal better with classes of unequal size, as is the case in this project, than the standard accuracy metric, which could become really high if the algorithm only predicts the dominant class. Thirdly, we will show Receiver Operating Characteristic (ROC) curves, which deal very well with unequal sized classes. The Area Under the Curve (AUC)-score of the ROC-plot is always 0.5 for random result and above 0.5 for a better than random result with 1.0 as maximum score. And lastly, we will use the definition of Chang et al. (2015) for return-of-investment on a loan to represent how profitable a loan is. Their definition is ROI = (Total payment received by investors / Total amount committed by investors) − 1.

Results

Exploration

Lending Club claims that the default rate of their loans 4% is. We checked this out in the complete set we used for this project. And we found a 'charged off' rate (their default loan status) to be 5%. Hence a little higher than that they claimed, but not too much. But in this set there are a lot of loans that are still ongoing. For these loans you do not know whether they will end up in 'fully paid' or 'charged off'. Therefore we focus on loans that are closed. Of these loans a much higher percentage ends up in 'charged off', namely 18%.


In [7]:
print('percentage charged off in all loans:', 
      round(sum(loans['loan_status']=='Charged Off')/len(loans['loan_status'])*100))
print('percentage charged off in closed loans:', 
      round(sum(closed_loans['loan_status']=='Charged Off') / len(closed_loans['loan_status']) * 100))


percentage charged off in all loans: 5.0
percentage charged off in closed loans: 18.0

Lending Club gives grades (A-G) to their loans so potential investors can see which of these loans are not so risky (A) and which are the riskier loans (G). To make it still worthwhile to invest in the riskier loans, investors supposedly get more interest on these loans. From the figure below we see that indeed the interest grade is higher for riskier loans, but that there are a few exceptions.


In [8]:
closed_loans['grade'] = closed_loans['grade'].astype('category', ordered=True)
sns.boxplot(data=closed_loans, x='grade', y='int_rate', color='turquoise')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x13cfae630>

Apart from grade and interest rate there are of course other characteristics of the loans. A few of them are shown below. We see that loans are in the range of almost 0 until 35,000. Hence Lending Club loans seem to be an alternative for personal loans and credit cards and not mortgages. The loans are either 36 months (3 years) or 60 months (5 years) and mostly 3 years. The purpose of the loan is mostly debt consolidation and credit card. Therefore it seems to be mostly people that already have debts. The annual income was cut-off at 200,000 but lies mostly between 25,000 and 100,000.


In [9]:
sns.distplot(closed_loans['loan_amnt'], kde=False, bins=50)
plt.show()
sns.countplot(closed_loans['term'], color='turquoise')
plt.show()
sns.countplot(closed_loans['purpose'], color='turquoise')
plt.xticks(rotation=90)
plt.show()
ax = sns.distplot(closed_loans['annual_inc'], bins=100, kde=False)
plt.xlim([0,200])
ax.set(xlabel='annual income (x 1000)')
plt.show()


We can speculate that some of the characteristics of loans have an influence on the loan ending up in 'charged off'. A first logical check is of course whether the Lending Club 'grade' is already visually a factor in loans ending up in this status. As we can see from the figure below, the 'grade' is very well correlated to the 'charged off proportion' of the loans. Only between F and G the difference is smaller. Hence Lending Club has built a pretty good algorithm to predict the 'charged off' status. Also higher interest loans seem to end up in 'charged off' more often as expected. Furthermore, with purpose the influence is not clearly visible. But with dti (debt-to-income) the difference is significant. This means the more debt a person has compared to their income, the more chance of the loan ending in 'charged off'. Lastly, with home ownership status the difference is visually present and also in numbers 'rent' has the highest charged off proportion, then 'own' and then 'mortgage'.


In [10]:
grade_status = closed_loans.reset_index().groupby(['grade', 'loan_status'])['id'].count()
risk_grades = dict.fromkeys(closed_loans['grade'].unique())
for g in risk_grades.keys():
    risk_grades[g] = grade_status.loc[(g, 'Charged Off')] / (grade_status.loc[(g, 'Charged Off')] + grade_status.loc[(g, 'Fully Paid')])
risk_grades = pd.DataFrame(risk_grades, index=['proportion_unpaid_loans'])    
sns.stripplot(data=risk_grades, color='darkgray', size=15)
plt.show()

sns.distplot(closed_loans[closed_loans['loan_status']=='Charged Off']['int_rate'])
sns.distplot(closed_loans[closed_loans['loan_status']=='Fully Paid']['int_rate'])
plt.show()

purpose_paid = closed_loans.reset_index().groupby(['purpose', 'loan_status'])['id'].count()
sns.barplot(data=pd.DataFrame(purpose_paid).reset_index(), x='purpose', y='id', hue='loan_status')
plt.xticks(rotation=90)
plt.show()

sns.boxplot(data=closed_loans, x='loan_status', y='dti')
plt.show()
print(ttest_ind(closed_loans[closed_loans['loan_status']=='Fully Paid']['dti'], 
                closed_loans[closed_loans['loan_status']=='Charged Off']['dti']))
print((closed_loans[closed_loans['loan_status']=='Fully Paid']['dti']).mean())
print((closed_loans[closed_loans['loan_status']=='Charged Off']['dti']).mean())

home_paid = closed_loans.reset_index().groupby(['home_ownership', 'loan_status'])['id'].count()
sns.barplot(data=pd.DataFrame(home_paid).reset_index(), x='home_ownership', y='id', hue='loan_status')
plt.xticks(rotation=90)
plt.show()

print(home_paid)
print('mortgage:', home_paid['MORTGAGE'][0] / (home_paid['MORTGAGE'][0] + home_paid['MORTGAGE'][1]))
print('own:', home_paid['OWN'][0] / (home_paid['OWN'][0] + home_paid['OWN'][1]))
print('rent:', home_paid['RENT'][0] / (home_paid['RENT'][0] + home_paid['RENT'][1]))


/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Ttest_indResult(statistic=-56.448263930658023, pvalue=0.0)
16.1410641711
18.4084456096
home_ownership  loan_status
ANY             Fully Paid          1
MORTGAGE        Charged Off     19860
                Fully Paid     104895
NONE            Charged Off         7
                Fully Paid         36
OTHER           Charged Off        27
                Fully Paid        112
OWN             Charged Off      4018
                Fully Paid      17946
RENT            Charged Off     21289
                Fully Paid      84580
Name: id, dtype: int64
mortgage: 0.159192016352
own: 0.182935712985
rent: 0.201088137226

Another interesting question is whether it is profitable to invest in loans from Lending Club and whether the 'grade' is has influence on profitability. For this purpose we show the return-of-investment (ROI) overall and per grade. As is seen below the loans have an average of only 1.4% profit. And if we look per grade, only A-C results in profit on average. Loans that end up in 'charged off' are on average very bad for the profits since you will likely loose part of the principal as well. In the A-C categories the loans end up in 'charged off' less times and are therefore on average more profitable even though the loans in the riskier categories deliver more interest returns. The higher interest (more than 20% in the riskiest grades) does not compensate enough for the high 'charged off' ratio, which is around 40% in the riskiest grades as we saw before.


In [11]:
roi = closed_loans.groupby('grade')['roi'].mean()
print(roi)
print(closed_loans['roi'].mean())
sns.barplot(data=roi.reset_index(), x='grade', y='roi', color='gray')
plt.show()
roi = closed_loans.groupby(['grade', 'loan_status'])['roi'].mean()
sns.barplot(data=roi.reset_index(), x='roi', y='grade', hue='loan_status', orient='h')
plt.show()
sns.countplot(data=closed_loans, x='grade', hue='loan_status')
plt.show()


grade
A    0.044555
B    0.046401
C    0.012001
D   -0.016277
E   -0.061398
F   -0.085575
G   -0.104552
Name: roi, dtype: float64
0.013866103562

Prediction

Predicting status 'charged off'

As we saw in the exploration part, the 'grade' is already a pretty good characteristic to predict 'charged off' rate. Therefore we will first see whether adding any additional features is actually useful. Subsequently we will see if we can recreate 'grade' from the features to see which features are still useful but incorporated in 'grade'. In the methods is described which features were selected. We excluded all features not known at the start of the loan and features that are not predictive like the id of the loan or have a lot of missing values. Twenty-three of the features remain in this way. Logistic Regression and Random Forest, two algorithms that have performed well in the past on this dataset, will be used for the prediction. For optimal performance of the Logistic Regression algorithm, the C-parameter can be tuned (the inverse of the regularization strength) on the training set. This is only necessary in the case of using multiple features, because regularization is not useful in the case of one feature (grade in this case). The found optimal value for the C-parameter on the training set is 10.


In [12]:
# parameter tuning Logistic Regression
dict_Cs = {'C': [0.001, 0.1, 1, 10, 100]}
clf = GridSearchCV(LogisticRegression(penalty='l1'), dict_Cs, 'f1_weighted', cv=10)

clf.fit(X_train_scaled, y_train)
print(clf.best_params_)
print(clf.best_score_)


{'C': 100}
0.753149547427

We trained both our classifiers on both only 'grade' and all features. And with Logistic Regression we also trained one with top-5 features as selected by SelectKBest from sklearn. This is because Logistic Regression sometimes performs better with less features. We see that all F1-scores are around 0.75. Using all features instead of only grade gives only a very marginal increase of around 1% and using 5 features gives not increase. The best performing algorithm based on the F1-score is Logistic regression with all features. But the differences were very small. When looking at the confusion matrices it is clear that all algorithms mostly predict 'Fully Paid', since this is the dominant class (82%) accuracy scores will look pretty well, while the algorithm is actually not that great as can be seen from the confusion matrices. The F1-score metric was chosen based on the fact that it can deal better with unequal classes, but even that reports an score of 0.74 when Random Forest predicts all loans to be 'Fully Paid'. AUC is in this case a better metric, since also with uneven classes random remains 0.5. The algorithms with only grade give an AUC of 0.66. While the Logistic Regression with all features gives a score of 0.71 and Random Forest of 0.7. The top-5 features algorithm is in between those with 0.68. Hence again adding all features gives a little better performance (0.4-0.5) and Logistic Regression with all features performs the best. In the ROC-plot this is also displayed.


In [13]:
# Logistic Regression only grade
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled.loc[:,['grade']], y_train)
prediction = clf.predict(X_test_scaled.loc[:,['grade']])

# F1-score
print('f1_score:', f1_score(y_test, prediction, average='weighted'))

# AUC
y_score = clf.predict_proba(X_test_scaled.loc[:,['grade']])
fpr1, tpr1, thresholds = roc_curve(np.array(y_test), y_score[:,0], pos_label='Charged Off')
auc1 = round(auc(fpr1, tpr1), 2)
print('auc:', auc1)

# Confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test), prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


f1_score: 0.746041404442
auc: 0.66
Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          236       13302    13538
Fully Paid           330       61964    62294
__all__              566       75266    75832


Overall Statistics:

Accuracy: 0.82023420192
95% CI: (0.81748234524080143, 0.82296153591534582)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0194151995092
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off  Fully Paid
Population                                  75832       75832
P: Condition positive                       13538       62294
N: Condition negative                       62294       13538
Test outcome positive                         566       75266
Test outcome negative                       75266         566
TP: True Positive                             236       61964
TN: True Negative                           61964         236
FP: False Positive                            330       13302
FN: False Negative                          13302         330
TPR: (Sensitivity, hit rate, recall)    0.0174324    0.994703
TNR=SPC: (Specificity)                   0.994703   0.0174324
PPV: Pos Pred Value (Precision)          0.416961    0.823267
NPV: Neg Pred Value                      0.823267    0.416961
FPR: False-out                         0.00529746    0.982568
FDR: False Discovery Rate                0.583039    0.176733
FNR: Miss Rate                           0.982568  0.00529746
ACC: Accuracy                            0.820234    0.820234
F1 score                                0.0334657    0.900901
MCC: Matthews correlation coefficient   0.0539922   0.0539922
Informedness                             0.012135    0.012135
Markedness                               0.240228    0.240228
Prevalence                               0.178526    0.821474
LR+: Positive likelihood ratio            3.29071     1.01235
LR-: Negative likelihood ratio             0.9878    0.303886
DOR: Diagnostic odds ratio                3.33135     3.33135
FOR: False omission rate                 0.176733    0.583039
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x113b9d160>

In [14]:
# Logistic Regression all features
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled, y_train)
prediction = clf.predict(X_test_scaled)

# F1-score
print(f1_score(y_test, prediction, average='weighted'))

# AUC
y_score = clf.predict_proba(X_test_scaled)
fpr2, tpr2, thresholds = roc_curve(np.array(y_test), y_score[:,0], pos_label='Charged Off')
auc2 = round(auc(fpr2, tpr2), 2)
print('auc:', auc2)

# Confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test), prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


0.753657223162
auc: 0.71
Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          547       12991    13538
Fully Paid           548       61746    62294
__all__             1095       74737    75832


Overall Statistics:

Accuracy: 0.821460597109
95% CI: (0.8187159962118411, 0.82418058119905324)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0493628862175
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off  Fully Paid
Population                                  75832       75832
P: Condition positive                       13538       62294
N: Condition negative                       62294       13538
Test outcome positive                        1095       74737
Test outcome negative                       74737        1095
TP: True Positive                             547       61746
TN: True Negative                           61746         547
FP: False Positive                            548       12991
FN: False Negative                          12991         548
TPR: (Sensitivity, hit rate, recall)    0.0404048    0.991203
TNR=SPC: (Specificity)                   0.991203   0.0404048
PPV: Pos Pred Value (Precision)          0.499543    0.826177
NPV: Neg Pred Value                      0.826177    0.499543
FPR: False-out                         0.00879699    0.959595
FDR: False Discovery Rate                0.500457    0.173823
FNR: Miss Rate                           0.959595  0.00879699
ACC: Accuracy                            0.821461    0.821461
F1 score                                0.0747625    0.901198
MCC: Matthews correlation coefficient    0.101466    0.101466
Informedness                            0.0316078   0.0316078
Markedness                               0.325721    0.325721
Prevalence                               0.178526    0.821474
LR+: Positive likelihood ratio            4.59302     1.03294
LR-: Negative likelihood ratio           0.968112    0.217722
DOR: Diagnostic odds ratio                4.74431     4.74431
FOR: False omission rate                 0.173823    0.500457
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x112230e80>

In [15]:
# Logistic Regression top-5 features selected with Select-K-Best
new_X = (SelectKBest(mutual_info_classif, k=5)
        .fit_transform(X_train_scaled, y_train))
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(new_X, y_train)
prediction = clf.predict(X_test_scaled.loc[:, ['term', 'int_rate', 'installment', 'grade', 'sub_grade']])

# F1-score
print(f1_score(y_test, prediction, average='weighted'))

# AUC
y_score = clf.predict_proba(X_test_scaled.loc[:, ['term', 'int_rate', 'installment', 'grade', 'sub_grade']])
fpr3, tpr3, thresholds = roc_curve(np.array(y_test), y_score[:,0], pos_label='Charged Off')
auc3 = round(auc(fpr3, tpr3), 2)
print('auc:', auc3)

# Confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test), prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


0.745158335737
auc: 0.68
Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          203       13335    13538
Fully Paid           313       61981    62294
__all__              516       75316    75832


Overall Statistics:

Accuracy: 0.820023209199
95% CI: (0.81727010898909747, 0.82275180291855388)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0159888080352
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off  Fully Paid
Population                                  75832       75832
P: Condition positive                       13538       62294
N: Condition negative                       62294       13538
Test outcome positive                         516       75316
Test outcome negative                       75316         516
TP: True Positive                             203       61981
TN: True Negative                           61981         203
FP: False Positive                            313       13335
FN: False Negative                          13335         313
TPR: (Sensitivity, hit rate, recall)    0.0149948    0.994975
TNR=SPC: (Specificity)                   0.994975   0.0149948
PPV: Pos Pred Value (Precision)          0.393411    0.822946
NPV: Neg Pred Value                      0.822946    0.393411
FPR: False-out                         0.00502456    0.985005
FDR: False Discovery Rate                0.606589    0.177054
FNR: Miss Rate                           0.985005  0.00502456
ACC: Accuracy                            0.820023    0.820023
F1 score                                0.0288886    0.900821
MCC: Matthews correlation coefficient    0.046445    0.046445
Informedness                           0.00997027  0.00997027
Markedness                               0.216357    0.216357
Prevalence                               0.178526    0.821474
LR+: Positive likelihood ratio            2.98431     1.01012
LR-: Negative likelihood ratio           0.989979    0.335086
DOR: Diagnostic odds ratio                3.01451     3.01451
FOR: False omission rate                 0.177054    0.606589
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1050a65c0>

In [16]:
# Random Forest only grade
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train.loc[:,['grade']], y_train)
prediction = clf.predict(X_test.loc[:,['grade']])

# F1-score
print(f1_score(y_test, prediction, average='weighted'))

# AUC
y_score = clf.predict_proba(X_test.loc[:,['grade']])
fpr4, tpr4, thresholds = roc_curve(np.array(y_test), y_score[:,0], pos_label='Charged Off')
auc4 = round(auc(fpr4, tpr4), 2)
print('auc:', auc4)

# Confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test), prediction)
print(confusion_matrix)
confusion_matrix.plot()


0.740959528403
auc: 0.66
Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off            0       13538    13538
Fully Paid             0       62294    62294
__all__                0       75832    75832
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f4df588>

In [17]:
# Random Forest all features
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)

# F1-score
print(f1_score(y_test, prediction, average='weighted'))

# AUC
y_score = clf.predict_proba(X_test)
fpr5, tpr5, thresholds = roc_curve(np.array(y_test), y_score[:,0], pos_label='Charged Off')
auc5 = round(auc(fpr5, tpr5), 2)
print('auc:', auc5)

# Confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test), prediction)
print(confusion_matrix)
confusion_matrix.plot()


0.753800718644
auc: 0.7
Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          559       12979    13538
Fully Paid           575       61719    62294
__all__             1134       74698    75832
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f50d898>

In [18]:
# ROC-plot with AUC scores.

plt.plot(fpr1, tpr1, label='Logreg grade (auc = %0.2f)' % auc1, linewidth=4)
plt.plot(fpr2, tpr2, label='Logreg all (auc = %0.2f)' % auc2, linewidth=4)
plt.plot(fpr3, tpr3, label='Logreg top-5 (auc = %0.2f)' % auc3, linewidth=4)
plt.plot(fpr4, tpr4, label='RF grade (auc = %0.2f)' % auc4, linewidth=4)
plt.plot(fpr5, tpr5, label='RF all (auc = %0.2f)' % auc5, linewidth=4)
plt.legend(loc="lower right")
plt.show()


So adding features does lead to a little better performance. Therefore it is interesting to see which features are mostly used for this increase. The important features for logistic regression can be found by seeing which coefficients are used for the features. The bigger the coefficient the more the model uses this feature for prediction. For our best performing model, Logistic Regression with all features, the top-5 features with the biggest coefficients are: interest rate, annual income, subgrade, term and dti. The first and last two have a negative coefficient and the other two a positive one. It seems that the algorithm choose 'fully paid' as the positive class. Therefore a negative coefficient for interest rate means that the higher the interest rate the smaller the chance on 'fully paid'. This makes sense since grade is related to interest rate and the higher the grade, the higher the chance on 'charged off'. A shorter time period (term) gives less chance on 'charged off'. This also seems logical. And the less debt-to-income the less chance a loans ends up in ‘charged off’, which also makes sense. Grade is not in the top-5 features but the redundant feature subgrade is, only the strange thing is that the algorithm gave it a positive coefficient, hence the higher the subgrade, the more chance on 'fully paid'. This makes no sense. Annual income on the other hand is logical, since more annual income giving a bigger chance on 'fully paid' seems plausible. Subsequently these features were put in a logistic regression model of the package statsmodels to get p-values for the features. Here the signs of the coefficients are exactly reversed, so it seems to have chosen 'charged off' as the positive class. All the features for which the sign logically makes sense are significant, only subgrade is not. Hence this fits with what seems logical that the sign of subgrade is a mistake.


In [19]:
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled, y_train)
coefs = clf.coef_

# find index of top 5 highest coefficients, aka most used features for prediction
positions = abs(coefs[0]).argsort()[-5:][::-1]
features = list(X_train_scaled.columns[positions])
print(features)
print(coefs[0][positions])
print(clf.classes_)

# use statsmodels logistic regression to get p-values for the top-5 most used features
logit = sm.Logit(y_train == 'Charged Off', np.array(X_train_scaled.loc[:, features]))
result = logit.fit()
print(result.summary())


['int_rate', 'annual_inc', 'sub_grade', 'term', 'dti']
[-0.61312659  0.29842334  0.27813056 -0.17338965 -0.15910902]
['Charged Off' 'Fully Paid']
Optimization terminated successfully.
         Current function value: 0.672390
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            loan_status   No. Observations:               176939
Model:                          Logit   Df Residuals:                   176934
Method:                           MLE   Df Model:                            4
Date:                Sun, 05 Feb 2017   Pseudo R-squ.:                 -0.4312
Time:                        16:17:49   Log-Likelihood:            -1.1897e+05
converged:                       True   LL-Null:                       -83125.
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.3304      0.019     17.575      0.000         0.294     0.367
x2            -0.1166      0.005    -23.045      0.000        -0.127    -0.107
x3            -0.0177      0.019     -0.930      0.353        -0.055     0.020
x4             0.1061      0.006     19.017      0.000         0.095     0.117
x5             0.0902      0.005     17.881      0.000         0.080     0.100
==============================================================================

Re-creating grade

We saw from the section before that we only slightly outperform an algorithm with only grade by adding more features. Therefore we will see which features are predictive of grade and are in that way important. First a Logistic Regression algorithm is trained to predict the grades. We see that it mostly predicts everything as grade A. And the other grades are also not that well predicted except for G, but there are only 2 loans predicted as G. Random Forest on the other hand performs a little better. It also predicts most things as A, but we see some promising coloring on the diagonal of the confusion matrix plot (predicting the right grade) and the precision for these grades is around 0.8. The feature importance in Random Forest as implemented by sklearn as total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble. The most important features are found to be: revolving line utilization rate (the amount of credit used compared to all credit), installment (monthly payment), revolving balance (all credit), loan amount and debt-to-income. So the grade seems to be mostly based on the amount borrowed (loan amount and installment) and the debt the borrower already has (revolving_util, revolving_bal, dti). It makes sense that these things might be important. Nevertheless, the recreated algorithm is by far not as good as the one of Lending Club. If we leave out loans with grade A than our algorithm just predicts most things as grade B, while the rest of the grades are a lot more accurate. Either Lending Club trained a way better algorithm and/or Lending Club does not make all characteristics of the loans they use for their algorithm public knowledge.


In [20]:
# split data in train (70%) and test set (30%) stratify by loan_status
X_train, X_test, y_train, y_test = train_test_split(closed_loans_predict.drop(['grade', 'sub_grade', 'int_rate', 'roi', 'loan_status']
                                                                      , axis=1), 
                                                    closed_loans['grade'], test_size=0.3, 
                                                    random_state=123, stratify=closed_loans['loan_status'])

# scaling and normalizing the features
X_train_scaled = preprocessing.scale(X_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# scale test set with scaling used in train set
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# binarize the labels for multiclass onevsall prediction
lb = LabelBinarizer()
grades = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
lb.fit(grades)
y_train_2 = lb.transform(y_train)

In [21]:
# Logistic Regression predicting grade from the other features (excluding interest rate and subgrade)
clf = OneVsRestClassifier(LogisticRegression(penalty='l1'))
predict_y = clf.fit(X_train_scaled, y_train_2).predict(X_test_scaled)
predict_y = lb.inverse_transform(predict_y)

# confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test, dtype='<U1'), predict_y)
confusion_matrix.plot()  
confusion_matrix.print_stats()


Confusion Matrix:

Predicted      A   B    C    D    E   F  G  __all__
Actual                                             
A          12790   5    3    0    0   0  0    12798
B          22786   7   20    3    0   0  0    22816
C          19439   0   43   27    0   0  0    19509
D          12017   1   42   96    2   0  0    12158
E           5451   0   16  237   16   4  0     5724
F           1836   0    5  297  105   5  0     2248
G            341   0    4  133   97   2  2      579
__all__    74660  13  133  793  220  11  2    75832


Overall Statistics:

Accuracy: 0.170890916763
95% CI: (0.16821786155214871, 0.1735891758885294)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.00280065900501
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                          A            B            C  \
Population                                   75832        75832        75832   
P: Condition positive                        12798        22816        19509   
N: Condition negative                        63034        53016        56323   
Test outcome positive                        74660           13          133   
Test outcome negative                         1172        75819        75699   
TP: True Positive                            12790            7           43   
TN: True Negative                             1164        53010        56233   
FP: False Positive                           61870            6           90   
FN: False Negative                               8        22809        19466   
TPR: (Sensitivity, hit rate, recall)      0.999375  0.000306802   0.00220411   
TNR=SPC: (Specificity)                   0.0184662     0.999887     0.998402   
PPV: Pos Pred Value (Precision)            0.17131     0.538462     0.323308   
NPV: Neg Pred Value                       0.993174     0.699165      0.74285   
FPR: False-out                            0.981534  0.000113173   0.00159793   
FDR: False Discovery Rate                  0.82869     0.461538     0.676692   
FNR: Miss Rate                         0.000625098     0.999693     0.997796   
ACC: Accuracy                             0.184012     0.699138     0.742114   
F1 score                                  0.292483  0.000613255   0.00437837   
MCC: Matthews correlation coefficient    0.0541718   0.00678317   0.00633278   
Informedness                             0.0178411  0.000193629  0.000606185   
Markedness                                0.164484     0.237627    0.0661582   
Prevalence                                0.168768     0.300876     0.257266   
LR+: Positive likelihood ratio             1.01818       2.7109      1.37936   
LR-: Negative likelihood ratio           0.0338509     0.999806     0.999393   
DOR: Diagnostic odds ratio                 30.0783      2.71143       1.3802   
FOR: False omission rate                0.00682594     0.300835      0.25715   

Classes                                         D            E            F  \
Population                                  75832        75832        75832   
P: Condition positive                       12158         5724         2248   
N: Condition negative                       63674        70108        73584   
Test outcome positive                         793          220           11   
Test outcome negative                       75039        75612        75821   
TP: True Positive                              96           16            5   
TN: True Negative                           62977        69904        73578   
FP: False Positive                            697          204            6   
FN: False Negative                          12062         5708         2243   
TPR: (Sensitivity, hit rate, recall)   0.00789604   0.00279525    0.0022242   
TNR=SPC: (Specificity)                   0.989054      0.99709     0.999918   
PPV: Pos Pred Value (Precision)          0.121059    0.0727273     0.454545   
NPV: Neg Pred Value                      0.839257     0.924509     0.970417   
FPR: False-out                          0.0109464    0.0029098  8.15395e-05   
FDR: False Discovery Rate                0.878941     0.927273     0.545455   
FNR: Miss Rate                           0.992104     0.997205     0.997776   
ACC: Accuracy                            0.831746     0.922038     0.970342   
F1 score                                0.0148251   0.00538358   0.00442674   
MCC: Matthews correlation coefficient  -0.0110022  -0.00056262    0.0301753   
Informedness                          -0.00305035 -0.000114548   0.00214266   
Markedness                             -0.0396838  -0.00276339     0.424963   
Prevalence                               0.160328    0.0754826    0.0296445   
LR+: Positive likelihood ratio           0.721337     0.960634      27.2776   
LR-: Negative likelihood ratio            1.00308      1.00011     0.997857   
DOR: Diagnostic odds ratio                0.71912     0.960523      27.3362   
FOR: False omission rate                 0.160743    0.0754907    0.0295828   

Classes                                         G  
Population                                  75832  
P: Condition positive                         579  
N: Condition negative                       75253  
Test outcome positive                           2  
Test outcome negative                       75830  
TP: True Positive                               2  
TN: True Negative                           75253  
FP: False Positive                              0  
FN: False Negative                            577  
TPR: (Sensitivity, hit rate, recall)   0.00345423  
TNR=SPC: (Specificity)                          1  
PPV: Pos Pred Value (Precision)                 1  
NPV: Neg Pred Value                      0.992391  
FPR: False-out                                  0  
FDR: False Discovery Rate                       0  
FNR: Miss Rate                           0.996546  
ACC: Accuracy                            0.992391  
F1 score                               0.00688468  
MCC: Matthews correlation coefficient   0.0585487  
Informedness                           0.00345423  
Markedness                               0.992391  
Prevalence                              0.0076353  
LR+: Positive likelihood ratio                inf  
LR-: Negative likelihood ratio           0.996546  
DOR: Diagnostic odds ratio                    inf  
FOR: False omission rate               0.00760913  

In [22]:
# Random Forest predicting grade from the other features (excluding interest rate and subgrade)
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100))
predict_y = clf.fit(X_train, y_train_2).predict(X_test)
predict_y = lb.inverse_transform(predict_y)

# confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test, dtype='<U1'), predict_y)
confusion_matrix.plot()
confusion_matrix.print_stats()

# important features
features = []
for i,j in enumerate(grades):
    print('\n',j)
    feat_imp = clf.estimators_[i].feature_importances_
    positions = abs(feat_imp).argsort()[-5:][::-1]
    features.extend(list(X_train.columns[positions]))
    print(X_train.columns[positions])
    print(feat_imp[positions])

print(pd.Series(features).value_counts())


Confusion Matrix:

Predicted      A      B     C    D    E    F   G  __all__
Actual                                                   
A          12000    792     6    0    0    0   0    12798
B          13570   8912   329    5    0    0   0    22816
C          15533   1119  2798   59    0    0   0    19509
D          11156     84   425  426   66    1   0    12158
E           5078     17    21   91  501   16   0     5724
F           1929      1    10    9   57  241   1     2248
G            481      0     0    1    9   50  38      579
__all__    59747  10925  3589  591  633  308  39    75832


Overall Statistics:

Accuracy: 0.32856841439
95% CI: (0.32522543651930513, 0.33192451240839849)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.170563756245
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                        A          B         C  \
Population                                 75832      75832     75832   
P: Condition positive                      12798      22816     19509   
N: Condition negative                      63034      53016     56323   
Test outcome positive                      59747      10925      3589   
Test outcome negative                      16085      64907     72243   
TP: True Positive                          12000       8912      2798   
TN: True Negative                          15287      51003     55532   
FP: False Positive                         47747       2013       791   
FN: False Negative                           798      13904     16711   
TPR: (Sensitivity, hit rate, recall)    0.937647   0.390603  0.143421   
TNR=SPC: (Specificity)                   0.24252    0.96203  0.985956   
PPV: Pos Pred Value (Precision)         0.200847   0.815744  0.779604   
NPV: Neg Pred Value                     0.950389   0.785786  0.768683   
FPR: False-out                           0.75748  0.0379697  0.014044   
FDR: False Discovery Rate               0.799153   0.184256  0.220396   
FNR: Miss Rate                         0.0623535   0.609397  0.856579   
ACC: Accuracy                           0.359835   0.790102    0.7692   
F1 score                                0.330829   0.528259  0.242272   
MCC: Matthews correlation coefficient   0.165068   0.460564  0.266338   
Informedness                            0.180166   0.352633  0.129377   
Markedness                              0.151235    0.60153  0.548288   
Prevalence                              0.168768   0.300876  0.257266   
LR+: Positive likelihood ratio           1.23785    10.2872   10.2123   
LR-: Negative likelihood ratio          0.257107   0.633449   0.86878   
DOR: Diagnostic odds ratio               4.81454    16.2401   11.7547   
FOR: False omission rate               0.0496114   0.214214  0.231317   

Classes                                         D           E            F  \
Population                                  75832       75832        75832   
P: Condition positive                       12158        5724         2248   
N: Condition negative                       63674       70108        73584   
Test outcome positive                         591         633          308   
Test outcome negative                       75241       75199        75524   
TP: True Positive                             426         501          241   
TN: True Negative                           63509       69976        73517   
FP: False Positive                            165         132           67   
FN: False Negative                          11732        5223         2007   
TPR: (Sensitivity, hit rate, recall)    0.0350387   0.0875262     0.107206   
TNR=SPC: (Specificity)                   0.997409    0.998117     0.999089   
PPV: Pos Pred Value (Precision)          0.720812    0.791469     0.782468   
NPV: Neg Pred Value                      0.844074    0.930544     0.973426   
FPR: False-out                         0.00259132  0.00188281  0.000910524   
FDR: False Discovery Rate                0.279188    0.208531     0.217532   
FNR: Miss Rate                           0.964961    0.912474     0.892794   
ACC: Accuracy                            0.843114    0.929383      0.97265   
F1 score                                0.0668288    0.157622     0.188576   
MCC: Matthews correlation coefficient    0.135385    0.248668     0.283458   
Informedness                            0.0324473   0.0856434     0.106296   
Markedness                               0.564887    0.722013     0.755893   
Prevalence                               0.160328   0.0754826    0.0296445   
LR+: Positive likelihood ratio            13.5215      46.487      117.741   
LR-: Negative likelihood ratio           0.967468    0.914195     0.893607   
DOR: Diagnostic odds ratio                13.9762     50.8502       131.76   
FOR: False omission rate                 0.155926   0.0694557    0.0265743   

Classes                                          G  
Population                                   75832  
P: Condition positive                          579  
N: Condition negative                        75253  
Test outcome positive                           39  
Test outcome negative                        75793  
TP: True Positive                               38  
TN: True Negative                            75252  
FP: False Positive                               1  
FN: False Negative                             541  
TPR: (Sensitivity, hit rate, recall)     0.0656304  
TNR=SPC: (Specificity)                    0.999987  
PPV: Pos Pred Value (Precision)           0.974359  
NPV: Neg Pred Value                       0.992862  
FPR: False-out                         1.32885e-05  
FDR: False Discovery Rate                 0.025641  
FNR: Miss Rate                             0.93437  
ACC: Accuracy                             0.992853  
F1 score                                  0.122977  
MCC: Matthews correlation coefficient     0.251925  
Informedness                             0.0656171  
Markedness                                0.967221  
Prevalence                               0.0076353  
LR+: Positive likelihood ratio             4938.88  
LR-: Negative likelihood ratio            0.934382  
DOR: Diagnostic odds ratio                 5285.72  
FOR: False omission rate                0.00713786  

 A
Index(['revol_util', 'installment', 'loan_amnt', 'revol_bal',
       'days_since_first_credit_line'],
      dtype='object')
[ 0.17681354  0.10284015  0.08882419  0.07493068  0.06483961]

 B
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal', 'dti'], dtype='object')
[ 0.14983864  0.10604658  0.09345291  0.07722774  0.07306397]

 C
Index(['installment', 'revol_util', 'loan_amnt', 'revol_bal', 'dti'], dtype='object')
[ 0.13479105  0.09429534  0.09103329  0.0834632   0.08130225]

 D
Index(['installment', 'revol_util', 'revol_bal', 'loan_amnt', 'dti'], dtype='object')
[ 0.11722756  0.09791252  0.08453023  0.08349304  0.08331951]

 E
Index(['installment', 'loan_amnt', 'revol_util', 'dti', 'revol_bal'], dtype='object')
[ 0.14337105  0.09633851  0.08293736  0.07429641  0.07275197]

 F
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal',
       'days_since_first_credit_line'],
      dtype='object')
[ 0.16984233  0.11352206  0.07555623  0.06837688  0.06797365]

 G
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal',
       'days_since_first_credit_line'],
      dtype='object')
[ 0.17983861  0.10674825  0.08072958  0.07216386  0.07173274]
loan_amnt                       7
revol_util                      7
revol_bal                       7
installment                     7
dti                             4
days_since_first_credit_line    3
dtype: int64

In [23]:
# Excluding loans with grade A
# split data in train (70%) and test set (30%) stratify by loan_status
no_A_loans = closed_loans_predict[closed_loans['grade']!='A']
X_train, X_test, y_train, y_test = train_test_split(no_A_loans.drop(['grade', 'sub_grade', 'int_rate', 'roi', 'loan_status']
                                                                      , axis=1), 
                                                    closed_loans[closed_loans['grade']!='A']['grade'], test_size=0.3, 
                                                    random_state=123, stratify=closed_loans[closed_loans['grade']!='A']['loan_status'])

# scaling and normalizing the features
X_train_scaled = preprocessing.scale(X_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# scale test set with scaling used in train set
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# binarize the labels for multiclass onevsall prediction
lb = LabelBinarizer()
grades = ['B', 'C', 'D', 'E', 'F', 'G']
lb.fit(grades)
y_train_2 = lb.transform(y_train)

In [24]:
# Excluding loans with grade A
# Random Forest predicting grade from the other features (excluding interest rate and subgrade)
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100))
predict_y = clf.fit(X_train, y_train_2).predict(X_test)
predict_y = lb.inverse_transform(predict_y)

# confusion matrix
confusion_matrix = ConfusionMatrix(np.array(y_test, dtype='<U1'), predict_y)
confusion_matrix.plot()
confusion_matrix.print_stats()

# important features
features = []
for i,j in enumerate(grades):
    print('\n',j)
    feat_imp = clf.estimators_[i].feature_importances_
    positions = abs(feat_imp).argsort()[-5:][::-1]
    features.extend(list(X_train.columns[positions]))
    print(X_train.columns[positions])
    print(feat_imp[positions])

print(pd.Series(features).value_counts())


Confusion Matrix:

Predicted      B     C    D    E    F   G  __all__
Actual                                            
B          22348   460    9    0    0   0    22817
C          15715  3920   58    0    0   0    19693
D          10881   567  455   65    1   0    11969
E           5147    37  104  483   21   0     5792
F           1900     9   13   77  248   1     2248
G            513     1    2    8   51  52      627
__all__    56504  4994  641  633  321  53    63146


Overall Statistics:

Accuracy: 0.435593703481
95% CI: (0.43172148330031196, 0.43947184369710834)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.130308247444
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                        B          C           D  \
Population                                 63146      63146       63146   
P: Condition positive                      22817      19693       11969   
N: Condition negative                      40329      43453       51177   
Test outcome positive                      56504       4994         641   
Test outcome negative                       6642      58152       62505   
TP: True Positive                          22348       3920         455   
TN: True Negative                           6173      42379       50991   
FP: False Positive                         34156       1074         186   
FN: False Negative                           469      15773       11514   
TPR: (Sensitivity, hit rate, recall)    0.979445   0.199056   0.0380149   
TNR=SPC: (Specificity)                  0.153066   0.975284    0.996366   
PPV: Pos Pred Value (Precision)         0.395512   0.784942    0.709828   
NPV: Neg Pred Value                     0.929389   0.728763    0.815791   
FPR: False-out                          0.846934  0.0247164  0.00363445   
FDR: False Discovery Rate               0.604488   0.215058    0.290172   
FNR: Miss Rate                         0.0205548   0.800944    0.961985   
ACC: Accuracy                           0.451668   0.733206    0.814715   
F1 score                                0.563483   0.317576   0.0721649   
MCC: Matthews correlation coefficient   0.207492   0.299264    0.134428   
Informedness                            0.132511   0.174339   0.0343804   
Markedness                              0.324901   0.513704    0.525619   
Prevalence                              0.361337   0.311865    0.189545   
LR+: Positive likelihood ratio           1.15646    8.05359     10.4596   
LR-: Negative likelihood ratio          0.134287   0.821243    0.965494   
DOR: Diagnostic odds ratio               8.61182    9.80659     10.8334   
FOR: False omission rate               0.0706113   0.271237    0.184209   

Classes                                         E           F            G  
Population                                  63146       63146        63146  
P: Condition positive                        5792        2248          627  
N: Condition negative                       57354       60898        62519  
Test outcome positive                         633         321           53  
Test outcome negative                       62513       62825        63093  
TP: True Positive                             483         248           52  
TN: True Negative                           57204       60825        62518  
FP: False Positive                            150          73            1  
FN: False Negative                           5309        2000          575  
TPR: (Sensitivity, hit rate, recall)    0.0833909     0.11032    0.0829346  
TNR=SPC: (Specificity)                   0.997385    0.998801     0.999984  
PPV: Pos Pred Value (Precision)          0.763033    0.772586     0.981132  
NPV: Neg Pred Value                      0.915074    0.968166     0.990886  
FPR: False-out                         0.00261534  0.00119873  1.59951e-05  
FDR: False Discovery Rate                0.236967    0.227414    0.0188679  
FNR: Miss Rate                           0.916609     0.88968     0.917065  
ACC: Accuracy                             0.91355    0.967171     0.990878  
F1 score                                  0.15035    0.193071     0.152941  
MCC: Matthews correlation coefficient    0.234039     0.28431     0.283899  
Informedness                            0.0807755    0.109122    0.0829186  
Markedness                               0.678107    0.740751     0.972019  
Prevalence                              0.0917239      0.0356   0.00992937  
LR+: Positive likelihood ratio            31.8853     92.0313      5184.99  
LR-: Negative likelihood ratio           0.919013    0.890747      0.91708  
DOR: Diagnostic odds ratio                34.6952     103.319       5653.8  
FOR: False omission rate                0.0849263   0.0318345   0.00911353  

 B
Index(['installment', 'loan_amnt', 'revol_util', 'term', 'revol_bal'], dtype='object')
[ 0.12828011  0.10228995  0.09732885  0.08131599  0.0735484 ]

 C
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal', 'dti'], dtype='object')
[ 0.15252059  0.09933147  0.08400224  0.08172237  0.08067989]

 D
Index(['installment', 'revol_util', 'loan_amnt', 'revol_bal', 'dti'], dtype='object')
[ 0.12041984  0.09196381  0.08514548  0.08475313  0.08426566]

 E
Index(['installment', 'loan_amnt', 'revol_util', 'dti', 'revol_bal'], dtype='object')
[ 0.15202173  0.09869723  0.08063187  0.0742424   0.07353761]

 F
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal',
       'days_since_first_credit_line'],
      dtype='object')
[ 0.1754852   0.11718264  0.07458349  0.06910261  0.06706289]

 G
Index(['installment', 'loan_amnt', 'revol_util', 'revol_bal', 'dti'], dtype='object')
[ 0.17471817  0.10992299  0.07993324  0.07407731  0.07172008]
installment                     6
loan_amnt                       6
revol_util                      6
revol_bal                       6
dti                             4
days_since_first_credit_line    1
term                            1
dtype: int64

Discussion

Lending Club provides an interesting opportunity to invest in loans, which could be a more reliable alternative to the stock market. They claim that 4% of their loans default, but we have found that a stunning 18% actually defaults of the loans that went to full term in this dataset. Furthermore, they offer a way to foresee which loans will end up in default by grading the loans (A-G). Higher-grade loans get higher interest rates usually, in order to make them still attractive for the investor. We found that on average only investing in the loans with grades A-C is profitable, while the other loans in other grades on average end up costing the investor money. This is because with a loan that ends in default, also the outstanding principal is lost. Some characteristics of the loans visually seem to influence the default risk like debt-to-income, interest rate and home ownership. But grade seems the most important. Therefore Logistic Regression and Random Forest classifiers were trained to see if we could do better than just grade when predicting default by adding more features. Adding more features only increased the performance slightly and Logistic Regression with all features performed the best. The features found to be most important for this were: interest rate, annual income, subgrade, term and debt-to-income. Of these subgrade was the only that was not significant. A higher annual income was found to give a lower chance on default, while a higher interest rate, longer term and more debt-to-income gives a higher chance on default, which seems logical. To see whether other features are also important, but already incorporated in grade, we also made multi-class classifiers to predict grade. Here Random Forest performed better than Logistic Regression, but both predicted mostly grade A. With Random Forest the precision for the other grades was around 0.8, which is not that bad. The features most important for Random Forest were found to be revolving line utilization rate (the amount of credit used compared to all credit), installment (monthly payment), revolving balance (all credit), loan amount and debt-to-income. Hence features that reflect the amount borrowed and the debt the borrower already has. Nevertheless, it seems that Lending Club has either a lot better algorithm and/or does not provide all features they use.

Our best algorithm had an AUC of 0.71, which is comparable to AUC scores found by other studies (O'Rourke, 2016; Wu, 2014). Chang et al. (2015) did not provide AUC scores and has a lot higher other metrics, but they included current loans and therefore the performances are not comparable. In this project we chose not to include current loans, because you cannot know whether they will eventually end up in default while they are still ongoing, therefore including current loans might paint a too bright picture. Other studies (O'Rourke, 2016; Wu, 2014) dealt with unbalanced classes by rebalancing the dataset while we did not. Nevertheless, their accuracy scores are not better than the ones from this project, so it does not seem to matter that much. But this could be checked out in future research. We chose to transform categorical variables to numerical in a more complicated way, others used a Boolean method, but we wanted to keep the order that was in the categorical data or tried to incorporate a geographical order for instance. It seems to have not influenced the results, but this could also be checked out in future research. In previous studies (Caselli et al., 2008; Chang et al. 2015) it was found that adding external datasets increased performance. We did not look into this, but this might also be done in the future. Nevertheless, we wanted to make advice to investors as simple as possible and external datasets complicate this. Furthermore, other studies found the following features to be important for the prediction of default: grade, subgrade, interest rate, term, annual income, revolving line utilization rate, fico and debt-to-income (O'Rourke, 2016; Wu, 2014). These features were also found by this project except for fico, since this feature was not in our dataset.

Next to the few things that are mentioned in the paragraph above, a promising venue for the future might be to use our algorithm and take only the loans that are predicted with a high probability. We might take the top-25% or the same number of loans in that area in grade A in order to compare our least likely loans to end up in default to the ones of Lending Club. A couple of things that can be checked are: is the amount of loans that end up in default comparable to grade A loans, is the average return-of-investment of these loans higher than grade A loans, what features are different in these loans compared to the rest of the loans and compared to grade A loans. For now we recommend our investors to only invest in grades A and B, since these give the best return-of-investment (around 4.5%) despite the higher interest in the riskier loans. We strongly advice against investing in the higher risk loans since the default rates are a lot higher that Lending Club wants them to think. Grade is the most important feature to pay attention to, but next to grade investors might select on the shorter termed loans, borrowers with high annual incomes, low debts and that lend low amounts of money.

References

Caselli, Stefano, Stefano Gatti, and Francesca Querci. "The sensitivity of the loss given default rate to systematic risk: new empirical evidence on bank loans." Journal of Financial Services Research 34.1 (2008): 1-34.

Chang, Shunpo, Dae-oong Kim, Simon and Kondo, Genki. http://cs229.stanford.edu/proj2015/199_report.pdf Predicting Default Risk of Lending Club Loans. CS229. Oct 2015.

O’Rourke, Ted. https://rpubs.com/torourke97/190551 Lending Club - Predicting Loan Outcomes. June 2016.

Pandey, Jitendra Nath and Srinivasan, Maheshwaran. http://cs229.stanford.edu/proj2011/PandeySrinivasan-PredictingProbabilityOfLoanDefault.pdf. Predicting Probability of Loan Default. Dec 2014.

Tsai, Kevin, Sivagami, Ramiah and Sudhanshu, Singh. "Peer Lending Risk Predictor."

Wu, Jiayu. http://www.wujiayu.me/assets/projects/loan-default-prediction-Jiayu-Wu.pdf Loan Default Prediction Using Lending Club Data. Dec 2014.