Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

  • Use SFrames to do some feature engineering.
  • Train a decision-tree on the LendingClub dataset.
  • Visualize the tree.
  • Predict whether a loan will default along with prediction probabilities (on a validation set).
  • Train a complex tree model and compare it to simple tree model.

Let's get started!

Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade GraphLab Create using

   pip install graphlab-create --upgrade

In [1]:
import pandas as pd

Load LendingClub dataset

We will be using a dataset from the LendingClub. A parsed and cleaned form of the dataset is availiable here. Make sure you download the dataset before running the following command.


In [2]:
loans = pd.read_csv('../../data/lending-club-data.csv')


C:\Users\rohit\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (19,47) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Exploring some features

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.


In [3]:
loans.columns


Out[3]:
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans',
       'emp_length_num', 'grade_num', 'sub_grade_num', 'delinq_2yrs_zero',
       'pub_rec_zero', 'collections_12_mths_zero', 'short_emp',
       'payment_inc_ratio', 'final_d', 'last_delinq_none', 'last_record_none',
       'last_major_derog_none'],
      dtype='object')

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.


In [4]:
loans['grade']


Out[4]:
0         B
1         C
2         C
3         C
4         A
5         E
6         F
7         B
8         C
9         B
10        C
11        B
12        B
13        D
14        C
15        A
16        B
17        A
18        B
19        B
20        B
21        B
22        B
23        C
24        D
25        B
26        B
27        A
28        A
29        A
         ..
122577    A
122578    F
122579    B
122580    B
122581    C
122582    D
122583    C
122584    C
122585    D
122586    E
122587    C
122588    A
122589    C
122590    B
122591    C
122592    B
122593    E
122594    C
122595    C
122596    D
122597    E
122598    A
122599    G
122600    D
122601    B
122602    E
122603    D
122604    D
122605    D
122606    A
Name: grade, dtype: object

We can see that over half of the loan grades are assigned values B or C. Each loan is assigned one of these grades, along with a more finely discretized feature called sub_grade (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found here.

Now, let's look at a different feature.


In [5]:
loans['home_ownership']


Out[5]:
0             RENT
1             RENT
2             RENT
3             RENT
4             RENT
5             RENT
6              OWN
7             RENT
8              OWN
9              OWN
10            RENT
11            RENT
12            RENT
13            RENT
14            RENT
15        MORTGAGE
16        MORTGAGE
17            RENT
18            RENT
19             OWN
20            RENT
21            RENT
22        MORTGAGE
23            RENT
24            RENT
25        MORTGAGE
26            RENT
27        MORTGAGE
28        MORTGAGE
29            RENT
            ...   
122577        RENT
122578    MORTGAGE
122579    MORTGAGE
122580        RENT
122581        RENT
122582    MORTGAGE
122583        RENT
122584        RENT
122585         OWN
122586    MORTGAGE
122587    MORTGAGE
122588         OWN
122589    MORTGAGE
122590         OWN
122591    MORTGAGE
122592        RENT
122593    MORTGAGE
122594         OWN
122595    MORTGAGE
122596    MORTGAGE
122597    MORTGAGE
122598    MORTGAGE
122599    MORTGAGE
122600        RENT
122601    MORTGAGE
122602    MORTGAGE
122603    MORTGAGE
122604    MORTGAGE
122605    MORTGAGE
122606         OWN
Name: home_ownership, dtype: object

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

  • +1 as a safe loan,
  • -1 as a risky (bad) loan.

We put this in a new column called safe_loans.


In [6]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset.


In [7]:
loans.groupby(['safe_loans']).agg({'safe_loans':"count"}).apply(lambda x: 100 * x / len(loans))


Out[7]:
safe_loans
safe_loans
-1 18.881467
1 81.118533

You should have:

  • Around 81% safe loans
  • Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.


In [8]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans_data = loans[features + [target]]

One Hot encoding


In [9]:
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X = dvec.fit_transform(loans_data.transpose().to_dict().values())
loans_data = pd.get_dummies(loans_data)

for column in loans_data.columns:
    loans_data[column] = loans_data[column].fillna(0)

Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

Split data into training and validation sets

We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results.

Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.


In [18]:
#train_data, validation_data = loans_data.random_split(.8, seed=1)
import json
with open('../../data/module-5-assignment-1-train-idx.json') as json_data:
    train_idx = json.load(json_data)
    json_data.close()
train_data = loans_data.iloc[train_idx]

with open('../../data/module-5-assignment-1-validation-idx.json') as json_data:
    validation_idx = json.load(json_data)
    json_data.close()
validation_data = loans_data.iloc[validation_idx]

Use decision tree to build a classifier

Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.) Our feature columns and target column have already been decided above. Use validation_set=None to get the same results as everyone else.


In [19]:
from sklearn.tree import DecisionTreeClassifier

label = train_data["safe_loans"]
del train_data["safe_loans"]

#print(label)
print(train_data.columns)
decision_tree_model = DecisionTreeClassifier(max_depth=6).fit(train_data.as_matrix(), label)
small_model = DecisionTreeClassifier(max_depth=2).fit(train_data.as_matrix(), label)
#target = train_data['safe_loans']
#train_data = train_data.drop('safe_loans', 1)
#print(train_data.columns)
#print(train_data.as_matrix)
#decision_tree_model =


Index(['short_emp', 'emp_length_num', 'dti', 'last_delinq_none',
       'last_major_derog_none', 'revol_util', 'total_rec_late_fee', 'grade_A',
       'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F', 'grade_G',
       'sub_grade_A1', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4',
       'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3',
       'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2',
       'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1',
       'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5',
       'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4',
       'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3',
       'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2',
       'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5',
       'home_ownership_MORTGAGE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'purpose_car', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_home_improvement',
       'purpose_house', 'purpose_major_purchase', 'purpose_medical',
       'purpose_moving', 'purpose_other', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'term_ 36 months',
       'term_ 60 months'],
      dtype='object')

Visualizing a learned model

As noted in the documentation, typically the max depth of the tree is capped at 6. However, such a tree can be hard to visualize graphically. Here, we instead learn a smaller model with max depth of 2 to gain some intuition by visualizing the learned tree.


In [20]:
import graphviz
from sklearn import tree
dotfile = open("dtree2.dot", 'w')
tree.export_graphviz(small_model, out_file = dotfile, feature_names = train_data.columns)
dotfile.close()

In the view that is provided by GraphLab Create, you can see each node, and each split at each node. This visualization is great for considering what happens when this model predicts the target of a new data point.

Note: To better understand this visual:

  • The root node is represented using pink.
  • Intermediate nodes are in green.
  • Leaf nodes in blue and orange.

In [21]:
with open("dtree2.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)


Out[21]:
Tree 0 grade_A <= 0.5 gini = 0.5 samples = 37224 value = [18476, 18748] 1 grade_B <= 0.5 gini = 0.4973 samples = 32094 value = [17218, 14876] 0->1 True 4 total_rec_late_fee <= 14.8301 gini = 0.3702 samples = 5130 value = [1258, 3872] 0->4 False 2 gini = 0.4829 samples = 21728 value = [12875, 8853] 1->2 3 gini = 0.4869 samples = 10366 value = [4343, 6023] 1->3 5 gini = 0.3555 samples = 4987 value = [1153, 3834] 4->5 6 gini = 0.3902 samples = 143 value = [105, 38] 4->6

Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

  • Predict whether or not a loan is safe.
  • Predict the probability that a loan is safe.

In [27]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data


Out[27]:
short_emp emp_length_num dti last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans grade_A grade_B ... purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_small_business purpose_vacation purpose_wedding term_ 36 months term_ 60 months
19 0 11 11.18 1 1 82.4 0.0 1 0 1 ... 0 0 0 0 0 0 0 0 1 0
79 0 10 16.85 1 1 96.4 0.0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0
24 0 3 13.97 0 1 59.5 0.0 -1 0 0 ... 0 0 0 0 1 0 0 0 0 1
41 0 11 16.33 1 1 62.1 0.0 -1 1 0 ... 0 0 0 0 0 0 0 0 1 0

4 rows × 68 columns

Explore label predictions

Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan.

Hint: Be sure to use the .predict() method.


In [28]:
del sample_validation_data['safe_loans']
decision_tree_model.predict(sample_validation_data.as_matrix())


Out[28]:
array([ 1, -1, -1,  1], dtype=int64)

Quiz Question: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using decision_tree_model on sample_validation_data:


In [29]:
decision_tree_model.predict_proba(sample_validation_data.as_matrix())


Out[29]:
array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

Quiz Question: Which loan has the highest probability of being classified as a safe loan?

Checkpoint: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?

Tricky predictions!

Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using small_model on sample_validation_data:


In [32]:
small_model.predict_proba(sample_validation_data.as_matrix())


Out[32]:
array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

Quiz Question: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

Visualize the prediction on a tree

Note that you should be able to look at the small tree, traverse it yourself, and visualize the prediction being made. Consider the following point in the sample_validation_data


In [ ]:
sample_validation_data[1]

Let's visualize the small tree here to do the traversing for this data point.


In [ ]:
small_model.show(view="Tree")

Note: In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this here. If the score is $\geq$ 0, the class +1 is predicted. Otherwise, if the score < 0, we predict class -1.

Quiz Question: Based on the visualized tree, what prediction would you make for this data point?

Now, let's verify your prediction by examining the prediction made using GraphLab Create. Use the .predict function on small_model.


In [33]:
small_model.predict(sample_validation_data.as_matrix())


Out[33]:
array([ 1, -1, -1,  1], dtype=int64)

Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows: $$ \mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}} $$

Let us start by evaluating the accuracy of the small_model and decision_tree_model on the training data


In [41]:
print(small_model.score(train_data.as_matrix(), label))
print(decision_tree_model.score(train_data.as_matrix(), label))


0.613502041694
0.640527616591

Checkpoint: You should see that the small_model performs worse than the decision_tree_model on the training data.

Now, let us evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.


In [42]:
validation_label = validation_data['safe_loans']
del validation_data['safe_loans']
print(small_model.score(validation_data.as_matrix(), validation_label))
print(decision_tree_model.score(validation_data.as_matrix(), validation_label))


0.619345109866
0.636148211978

Quiz Question: What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01?

Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.


In [43]:
big_model =  DecisionTreeClassifier(max_depth=10).fit(train_data.as_matrix(), label)

Now, let us evaluate big_model on the training set and validation set.


In [44]:
print(big_model.score(train_data.as_matrix(), label))
print(big_model.score(validation_data.as_matrix(), validation_label))


0.663738448313
0.626669538992

Checkpoint: We should see that big_model has even better performance on the training set than decision_tree_model did on the training set.

Quiz Question: How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?

Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

Assume the following:

  • False negatives: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted.
  • False positives: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.
  • Correct predictions: All correct predictions don't typically incur any cost.

Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:

  1. First, let us compute the predictions made by the model.
  2. Second, compute the number of false positives.
  3. Third, compute the number of false negatives.
  4. Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positives.

First, let us make predictions on validation_data using the decision_tree_model:


In [45]:
predictions = decision_tree_model.predict(validation_data)

False positives are predictions where the model predicts +1 but the true label is -1. Complete the following code block for the number of false positives:


In [56]:
false_positives = 0
for i in range(0, len(predictions)):
    if predictions[i] == 1 and validation_label.iloc[i] == -1:
        false_postives = false_positives + 1

False negatives are predictions where the model predicts -1 but the true label is +1. Complete the following code block for the number of false negatives:


In [55]:
false_negatives = 0
for i in range(0, len(predictions)):
    if (predictions[i] == -1 and validation_label.iloc[i] == 1):
        false_negatives = false_negatives + 1

Quiz Question: Let us assume that each mistake costs money:

  • Assume a cost of \$10,000 per false negative.
  • Assume a cost of \$20,000 per false positive.

What is the total cost of mistakes made by decision_tree_model on validation_data?


In [57]:
false_positives * 20000  + false_negatives * 10000


Out[57]:
17170000

In [ ]: