Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

Use SFrames to do some feature engineering.
Train a decision-tree on the LendingClub dataset.
Visualize the tree.
Predict whether a loan will default along with prediction probabilities (on a validation set).
Train a complex tree model and compare it to simple tree model.

Let's get started!

Importing Libraries



In [42]:

    
import json
import numpy as np
import pandas as pd
import sklearn, sklearn.tree
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Load LendingClub Loans dataset

We will be using a dataset from the LendingClub. A parsed and cleaned form of the dataset is availiable here. Make sure you download the dataset before running the following command.



In [43]:

    
loans = pd.read_csv("lending-club-data_assign_1.csv")

Exploring some features

Let's quickly explore what the dataset looks like. First, let's look at the first few entries of the loans dataframe.



In [44]:

    
loans.head()









    Out[44]:






  
    
      
      id
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      term
      int_rate
      installment
      grade
      sub_grade
      ...
      sub_grade_num
      delinq_2yrs_zero
      pub_rec_zero
      collections_12_mths_zero
      short_emp
      payment_inc_ratio
      final_d
      last_delinq_none
      last_record_none
      last_major_derog_none
    
  
  
    
      0
      1077501
      1296599
      5000
      5000
      4975
      36 months
      10.65
      162.87
      B
      B2
      ...
      0.4
      1
      1
      1
      0
      8.14350
      20141201T000000
      1
      1
      1
    
    
      1
      1077430
      1314167
      2500
      2500
      2500
      60 months
      15.27
      59.83
      C
      C4
      ...
      0.8
      1
      1
      1
      1
      2.39320
      20161201T000000
      1
      1
      1
    
    
      2
      1077175
      1313524
      2400
      2400
      2400
      36 months
      15.96
      84.33
      C
      C5
      ...
      1.0
      1
      1
      1
      0
      8.25955
      20141201T000000
      1
      1
      1
    
    
      3
      1076863
      1277178
      10000
      10000
      10000
      36 months
      13.49
      339.31
      C
      C1
      ...
      0.2
      1
      1
      1
      0
      8.27585
      20141201T000000
      0
      1
      1
    
    
      4
      1075269
      1311441
      5000
      5000
      5000
      36 months
      7.90
      156.46
      A
      A4
      ...
      0.8
      1
      1
      1
      0
      5.21533
      20141201T000000
      1
      1
      1
    
  

5 rows × 68 columns

Now, let's print out the column names to see what features we have in this dataset.



In [45]:

    
loans.columns.values









    Out[45]:





array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans',
       'bad_loans', 'emp_length_num', 'grade_num', 'sub_grade_num',
       'delinq_2yrs_zero', 'pub_rec_zero', 'collections_12_mths_zero',
       'short_emp', 'payment_inc_ratio', 'final_d', 'last_delinq_none',
       'last_record_none', 'last_major_derog_none'], dtype=object)

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.



In [46]:

    
plt.figure(figsize=(10,6))
loans['grade'].value_counts().plot(kind='bar')
plt.tick_params(axis='x', labelsize=18)
plt.xticks(rotation='horizontal')
plt.tick_params(axis='y', labelsize=18)
plt.title("Histogram of Loan Grades", fontsize=18)
plt.xlabel("Loan Grades", fontsize=18)
plt.ylabel("Count", fontsize=18)









    Out[46]:





<matplotlib.text.Text at 0x1137a7c50>

We can see that over half of the loan grades are assigned values B or C. Each loan is assigned one of these grades, along with a more finely discretized feature called sub_grade (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found here.

Now, let's look at a different feature.



In [47]:

    
plt.figure(figsize=(10,6))
loans['home_ownership'].value_counts().plot(kind='bar')
plt.tick_params(axis='x', labelsize=18)
plt.xticks(rotation='horizontal')
plt.tick_params(axis='y', labelsize=18)
plt.title("Histogram of Home Ownership", fontsize=18)
plt.xlabel("Home Ownership Type", fontsize=18)
plt.ylabel("Count", fontsize=18)









    Out[47]:





<matplotlib.text.Text at 0x1137a7950>

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

+1 as a safe loan,
-1 as a risky (bad) loan.

We put this in a new column called safe_loans.



In [48]:

    
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', 1)

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset.



In [49]:

    
plt.figure(figsize=(10,6))
loans['safe_loans'].value_counts().plot(kind='bar')
plt.tick_params(axis='x', labelsize=18)
plt.xticks(rotation='horizontal')
plt.tick_params(axis='y', labelsize=18)
plt.title("Histogram of whether a Loan is safe or risky", fontsize=18)
plt.xlabel("Safe Loan=1, Risky Loan=-1", fontsize=18)
plt.ylabel("Count", fontsize=18)









    Out[49]:





<matplotlib.text.Text at 0x1200ff850>



In [50]:

    
print "Percentage of safe loans: %.1f%%" %((loans['safe_loans'].value_counts().ix[1]/float(len(loans['safe_loans'])))*100.0)
print "Percentage of risky loans: %.1f%%" %((loans['safe_loans'].value_counts().ix[-1]/float(len(loans['safe_loans'])))*100.0)









    



Percentage of safe loans: 81.1%
Percentage of risky loans: 18.9%

You should have:

Around 81% safe loans
Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.



In [51]:

    
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a subset of features and the target that we will use for the rest of this notebook.

Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (safe_loans_raw) and one with just the risky loans (risky_loans_raw).



In [52]:

    
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)









    



Number of safe loans  : 99457
Number of risky loans : 23150

Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was calculated earlier in the assignment:



In [53]:

    
print "Percentage of safe loans  : %.1f%%" %((float(len(safe_loans_raw))/len(loans[target]))*100.0)
print "Percentage of risky loans : %.1f%%" %((float(len(risky_loans_raw))/len(loans[target]))*100.0)









    



Percentage of safe loans  : 81.1%
Percentage of risky loans : 18.9%

As can be seem there are much more sage loans than risky loans in the data set. The training data and validation data we will load will combat this class imbalance and will have roughly 50% safe loans and 50% risky loans.

Performing one-hot encoding with Pandas

Before performing analysis on the data, we need to perform one-hot encoding for all of the categorical data. Once the one-hot encoding is performed on all of the data, we will split the data into a training set and a validation set.



In [54]:

    
loans_one_hot_enc = pd.get_dummies(loans)

Loading the training and validation datasets

Loading the JSON files with the indicies from the training data and the validation data into a a list.



In [55]:

    
with open('module-5-assignment-1-train-idx.json', 'r') as f:
    train_idx_lst = json.load(f)
train_idx_lst = [int(entry) for entry in train_idx_lst]



In [56]:

    
with open('module-5-assignment-1-validation-idx.json', 'r') as f:
    validation_idx_lst = json.load(f)
validation_idx_lst = [int(entry) for entry in validation_idx_lst]

Using the list of the training data indicies and the validation data indicies to get a DataFrame with the training data and a DataFrame with the validation data.



In [57]:

    
train_data = loans_one_hot_enc.ix[train_idx_lst]
validation_data = loans_one_hot_enc.ix[validation_idx_lst]

Use decision tree to build a classifier

Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.) Our feature columns and target column have already been decided above. Use validation_set=None to get the same results as everyone else.

Using sklearn to learn a decision tree classification model. The first entry in .fit is all the data, excluding the target variable "safe_loans" and the second entry is the targer variable "safe_loans".

First, training a tree with max_depth=6



In [58]:

    
decision_tree_model = sklearn.tree.DecisionTreeClassifier(max_depth=6)
decision_tree_model.fit(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])









    Out[58]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Now, training a tree with max_depth=2



In [59]:

    
small_model = sklearn.tree.DecisionTreeClassifier(max_depth=2)
small_model.fit(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])









    Out[59]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

Predict whether or not a loan is safe.
Predict the probability that a loan is safe.



In [60]:

    
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data









    Out[60]:






  
    
      
      short_emp
      emp_length_num
      dti
      last_delinq_none
      last_major_derog_none
      revol_util
      total_rec_late_fee
      safe_loans
      grade_A
      grade_B
      ...
      purpose_house
      purpose_major_purchase
      purpose_medical
      purpose_moving
      purpose_other
      purpose_small_business
      purpose_vacation
      purpose_wedding
      term_ 36 months
      term_ 60 months
    
  
  
    
      19
      0
      11
      11.18
      1
      1
      82.4
      0
      1
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      79
      0
      10
      16.85
      1
      1
      96.4
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      24
      0
      3
      13.97
      0
      1
      59.5
      0
      -1
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      41
      0
      11
      16.33
      1
      1
      62.1
      0
      -1
      1
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
  

4 rows × 68 columns

Explore label predictions

Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan.

Hint: Be sure to use the .predict() method.



In [61]:

    
samp_vald_data_pred = decision_tree_model.predict(sample_validation_data.ix[:, sample_validation_data.columns != "safe_loans"])



In [62]:

    
samp_vald_data_label = sample_validation_data["safe_loans"].values

Quiz Question: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?



In [63]:

    
print "%.1f%%" %((np.sum(samp_vald_data_pred == samp_vald_data_label)/float(len(samp_vald_data_pred)))*100.0)

Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe?



In [64]:

    
samp_vald_data_prob = decision_tree_model.predict_proba(sample_validation_data.ix[:, sample_validation_data.columns != "safe_loans"])[:,1]

Quiz Question: Which loan has the highest probability of being classified as a safe loan?



In [65]:

    
sample_validation_data.index[np.argmax(samp_vald_data_prob)]









    Out[65]:





41



In [66]:

    
sample_validation_data









    Out[66]:






  
    
      
      short_emp
      emp_length_num
      dti
      last_delinq_none
      last_major_derog_none
      revol_util
      total_rec_late_fee
      safe_loans
      grade_A
      grade_B
      ...
      purpose_house
      purpose_major_purchase
      purpose_medical
      purpose_moving
      purpose_other
      purpose_small_business
      purpose_vacation
      purpose_wedding
      term_ 36 months
      term_ 60 months
    
  
  
    
      19
      0
      11
      11.18
      1
      1
      82.4
      0
      1
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      79
      0
      10
      16.85
      1
      1
      96.4
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      24
      0
      3
      13.97
      0
      1
      59.5
      0
      -1
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      41
      0
      11
      16.33
      1
      1
      62.1
      0
      -1
      1
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
  

4 rows × 68 columns

41 corresponds to the 4th loan

Tricky predictions!

Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using small_model on sample_validation_data:



In [67]:

    
small_model.predict_proba(sample_validation_data.ix[:, sample_validation_data.columns != "safe_loans"])[:,1]









    Out[67]:





array([ 0.58103415,  0.40744661,  0.40744661,  0.76879888])

Quiz Question: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

During tree traversal both examples fall into the same leaf node.

Now, let's consider the 2nd entry in the sample_validation_data



In [68]:

    
sample_validation_data









    Out[68]:






  
    
      
      short_emp
      emp_length_num
      dti
      last_delinq_none
      last_major_derog_none
      revol_util
      total_rec_late_fee
      safe_loans
      grade_A
      grade_B
      ...
      purpose_house
      purpose_major_purchase
      purpose_medical
      purpose_moving
      purpose_other
      purpose_small_business
      purpose_vacation
      purpose_wedding
      term_ 36 months
      term_ 60 months
    
  
  
    
      19
      0
      11
      11.18
      1
      1
      82.4
      0
      1
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      79
      0
      10
      16.85
      1
      1
      96.4
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      24
      0
      3
      13.97
      0
      1
      59.5
      0
      -1
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      41
      0
      11
      16.33
      1
      1
      62.1
      0
      -1
      1
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
  

4 rows × 68 columns

The 2nd entry of sample_validation_data has index 79



In [69]:

    
sample_validation_data.ix[79]









    Out[69]:





short_emp                      0.00
emp_length_num                10.00
dti                           16.85
last_delinq_none               1.00
last_major_derog_none          1.00
revol_util                    96.40
total_rec_late_fee             0.00
safe_loans                     1.00
grade_A                        0.00
grade_B                        0.00
grade_C                        0.00
grade_D                        1.00
grade_E                        0.00
grade_F                        0.00
grade_G                        0.00
sub_grade_A1                   0.00
sub_grade_A2                   0.00
sub_grade_A3                   0.00
sub_grade_A4                   0.00
sub_grade_A5                   0.00
sub_grade_B1                   0.00
sub_grade_B2                   0.00
sub_grade_B3                   0.00
sub_grade_B4                   0.00
sub_grade_B5                   0.00
sub_grade_C1                   0.00
sub_grade_C2                   0.00
sub_grade_C3                   0.00
sub_grade_C4                   0.00
sub_grade_C5                   0.00
                              ...  
sub_grade_E4                   0.00
sub_grade_E5                   0.00
sub_grade_F1                   0.00
sub_grade_F2                   0.00
sub_grade_F3                   0.00
sub_grade_F4                   0.00
sub_grade_F5                   0.00
sub_grade_G1                   0.00
sub_grade_G2                   0.00
sub_grade_G3                   0.00
sub_grade_G4                   0.00
sub_grade_G5                   0.00
home_ownership_MORTGAGE        0.00
home_ownership_OTHER           0.00
home_ownership_OWN             0.00
home_ownership_RENT            1.00
purpose_car                    0.00
purpose_credit_card            0.00
purpose_debt_consolidation     1.00
purpose_home_improvement       0.00
purpose_house                  0.00
purpose_major_purchase         0.00
purpose_medical                0.00
purpose_moving                 0.00
purpose_other                  0.00
purpose_small_business         0.00
purpose_vacation               0.00
purpose_wedding                0.00
term_ 36 months                1.00
term_ 60 months                0.00
Name: 79, dtype: float64

Quiz Question: Based on the small_model , what prediction would you make for this data point?



In [70]:

    
small_model.predict(sample_validation_data.ix[79, sample_validation_data.columns != "safe_loans"])[0]









    Out[70]:





-1

Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows: $$ \mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}} $$

Let us start by evaluating the accuracy of the small_model and decision_tree_model on the training data



In [71]:

    
small_model_train_acc = small_model.score(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])
decision_tree_model_train_acc = decision_tree_model.score(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])
print small_model_train_acc
print decision_tree_model_train_acc









    



0.613502041694
0.640527616591

Checkpoint: You should see that the small_model performs worse than the decision_tree_model on the training data.



In [72]:

    
decision_tree_model_train_acc > small_model_train_acc









    Out[72]:





True

Now, let us evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.

Quiz Question: What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01?



In [73]:

    
decision_tree_model_valid_acc = decision_tree_model.score(validation_data.ix[:, validation_data.columns != "safe_loans"], validation_data["safe_loans"])



In [74]:

    
print "Accuracy of decision_tree_model on validation set: %.2f" %(decision_tree_model_valid_acc)









    



Accuracy of decision_tree_model on validation set: 0.64

Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.



In [75]:

    
big_model = sklearn.tree.DecisionTreeClassifier(max_depth=10)
big_model.fit(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])









    Out[75]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Now, let us evaluate the accuracy of the big_model on the training set and validation set.



In [76]:

    
big_model_train_acc = big_model.score(train_data.ix[:, train_data.columns != "safe_loans"], train_data["safe_loans"])
big_model_valid_acc = big_model.score(validation_data.ix[:, validation_data.columns != "safe_loans"], validation_data["safe_loans"])

Checkpoint: We should see that big_model has even better performance on the training set than decision_tree_model did on the training set.



In [77]:

    
big_model_train_acc > decision_tree_model_train_acc









    Out[77]:





True

Quiz Question: How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?



In [78]:

    
big_model_valid_acc > decision_tree_model_valid_acc









    Out[78]:





False

The big_model has more features, performs better on the training dataset, but worse on the validation dataset. This is a sign of overfitting.

Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

Assume the following:

False negatives: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted.
False positives: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.
Correct predictions: All correct predictions don't typically incur any cost.

Let's write code that can compute the cost of mistakes made by the model.

First, let us make predictions on validation_data using the decision_tree_model. Then, let's store the labels of validation_data.



In [79]:

    
predic_valid_data = decision_tree_model.predict(validation_data.ix[:, validation_data.columns != "safe_loans"])
labels_valid_data = validation_data["safe_loans"].values

Now, let's initialize counters that will store the number of false positive and the number of false negatives to 0.



In [80]:

    
N_false_pos = 0
N_false_neg = 0

Now, let's loop over the data to determine the number of false positive and the number of false negatives. False positives are predictions where the model predicts +1 but the true label is -1. False negatives are predictions where the model predicts -1 but the true label is +1.



In [81]:

    
for i in range(len(labels_valid_data)):
    # If we find a mistake
    if predic_valid_data[i] != labels_valid_data[i]:
        # If false positive, increment N_false_pos
        if predic_valid_data[i]==1:
            N_false_pos += 1
        # Else, it's a false negative, increment N_false_neg    
        else:
            N_false_neg += 1

Quiz Question: Let us assume that each mistake costs money:

Assume a cost of \$10,000 per false negative.
Assume a cost of \$20,000 per false positive.

What is the total cost of mistakes made by decision_tree_model on validation_data?



In [82]:

    
10000*N_false_neg + 20000*N_false_pos









    Out[82]:





50390000



In [ ]:

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	sub_grade_num	delinq_2yrs_zero	pub_rec_zero	collections_12_mths_zero	short_emp	payment_inc_ratio	final_d	last_delinq_none	last_record_none	last_major_derog_none
0	1077501	1296599	5000	5000	4975	36 months	10.65	162.87	B	B2	...	0.4	1	1	1	0	8.14350	20141201T000000	1	1	1
1	1077430	1314167	2500	2500	2500	60 months	15.27	59.83	C	C4	...	0.8	1	1	1	1	2.39320	20161201T000000	1	1	1
2	1077175	1313524	2400	2400	2400	36 months	15.96	84.33	C	C5	...	1.0	1	1	1	0	8.25955	20141201T000000	1	1	1
3	1076863	1277178	10000	10000	10000	36 months	13.49	339.31	C	C1	...	0.2	1	1	1	0	8.27585	20141201T000000	0	1	1
4	1075269	1311441	5000	5000	5000	36 months	7.90	156.46	A	A4	...	0.8	1	1	1	0	5.21533	20141201T000000	1	1	1

	emp_length_num	dti	last_delinq_none	last_major_derog_none	revol_util	safe_loans	grade_A	grade_B	...	purpose_other	term_ 36 months	term_ 60 months
19	11	11.18	1	1	82.4	1	0	1	...	0	1	0
79	10	16.85	1	1	96.4	1	0	0	...	0	1	0
24	3	13.97	0	1	59.5	-1	0	0	...	1	0	1
41	11	16.33	1	1	62.1	-1	1	0	...	0	1	0