Exploring Ensemble Methods

In this assignment, we will explore the use of boosting. We will use the pre-implemented gradient boosted trees in GraphLab Create. You will:

  • Use SFrames to do some feature engineering.
  • Train a boosted ensemble of decision-trees (gradient boosted trees) on the LendingClub dataset.
  • Predict whether a loan will default along with prediction probabilities (on a validation set).
  • Evaluate the trained model and compare it with a baseline.
  • Find the most positive and negative loans using the learned model.
  • Explore how the number of trees influences classification performance.

Let's get started!

Fire up Graphlab Create


In [1]:
import graphlab

Load LendingClub dataset

We will be using the LendingClub data. As discussed earlier, the LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors.

Just like we did in previous assignments, we will build a classification model to predict whether or not a loan provided by lending club is likely to default.

Let us start by loading the data.


In [2]:
loans = graphlab.SFrame('lending-club-data.gl/')


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1477013146.log
This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset. We have done this in previous assignments, so we won't belabor this here.


In [3]:
loans.head()


Out[3]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade
1077501 1296599 5000 5000 4975 36 months 10.65 162.87 B B2
1077430 1314167 2500 2500 2500 60 months 15.27 59.83 C C4
1077175 1313524 2400 2400 2400 36 months 15.96 84.33 C C5
1076863 1277178 10000 10000 10000 36 months 13.49 339.31 C C1
1075269 1311441 5000 5000 5000 36 months 7.9 156.46 A A4
1072053 1288686 3000 3000 3000 36 months 18.64 109.43 E E1
1071795 1306957 5600 5600 5600 60 months 21.28 152.39 F F2
1071570 1306721 5375 5375 5350 60 months 12.69 121.45 B B5
1070078 1305201 6500 6500 6500 60 months 14.65 153.45 C C3
1069908 1305008 12000 12000 12000 36 months 12.69 402.54 B B5
emp_title emp_length home_ownership annual_inc is_inc_v issue_d loan_status pymnt_plan
10+ years RENT 24000 Verified 20111201T000000 Fully Paid n
Ryder < 1 year RENT 30000 Source Verified 20111201T000000 Charged Off n
10+ years RENT 12252 Not Verified 20111201T000000 Fully Paid n
AIR RESOURCES BOARD 10+ years RENT 49200 Source Verified 20111201T000000 Fully Paid n
Veolia Transportaton 3 years RENT 36000 Source Verified 20111201T000000 Fully Paid n
MKC Accounting 9 years RENT 48000 Source Verified 20111201T000000 Fully Paid n
4 years OWN 40000 Source Verified 20111201T000000 Charged Off n
Starbucks < 1 year RENT 15000 Verified 20111201T000000 Charged Off n
Southwest Rural metro 5 years OWN 72000 Not Verified 20111201T000000 Fully Paid n
UCLA 10+ years OWN 75000 Source Verified 20111201T000000 Fully Paid n
url desc purpose title zip_code
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/22/11 > I need to ...
credit_card Computer 860xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/22/11 > I plan to use ...
car bike 309xx
https://www.lendingclub.c
om/browse/loanDetail. ...
small_business real estate business 606xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/21/11 > to pay for ...
other personel 917xx
https://www.lendingclub.c
om/browse/loanDetail. ...
wedding My wedding loan I promise
to pay back ...
852xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/16/11 > Downpayment ...
car Car Downpayment 900xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/21/11 > I own a small ...
small_business Expand Business & Buy
Debt Portfolio ...
958xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/16/11 > I'm trying to ...
other Building my credit
history. ...
774xx
https://www.lendingclub.c
om/browse/loanDetail. ...
Borrower added on
12/15/11 > I had recived ...
debt_consolidation High intrest
Consolidation ...
853xx
https://www.lendingclub.c
om/browse/loanDetail. ...
debt_consolidation Consolidation 913xx
addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record
AZ 27.65 0 19850101T000000 1 None None
GA 1.0 0 19990401T000000 5 None None
IL 8.72 0 20011101T000000 2 None None
CA 20.0 0 19960201T000000 1 35 None
AZ 11.2 0 20041101T000000 3 None None
CA 5.35 0 20070101T000000 2 None None
CA 5.55 0 20040401T000000 2 None None
TX 18.08 0 20040901T000000 0 None None
AZ 16.12 0 19980101T000000 2 None None
CA 10.78 0 19891001T000000 0 None None
open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt
3 0 13648 83.7 9 f 0.0 0.0 5861.07
3 0 1687 9.4 4 f 0.0 0.0 1008.71
2 0 2956 98.5 10 f 0.0 0.0 3003.65
10 0 5598 21.0 37 f 0.0 0.0 12226.3
9 0 7963 28.3 12 f 0.0 0.0 5631.38
4 0 8221 87.5 4 f 0.0 0.0 3938.14
11 0 5210 32.6 13 f 0.0 0.0 646.02
2 0 9279 36.5 3 f 0.0 0.0 1476.19
14 0 4032 20.6 23 f 0.0 0.0 7677.52
12 0 23336 67.1 34 f 0.0 0.0 13943.1
total_pymnt_inv ...
5831.78 ...
1008.71 ...
3003.65 ...
12226.3 ...
5631.38 ...
3938.14 ...
646.02 ...
1469.34 ...
7677.52 ...
13943.1 ...
[10 rows x 68 columns]


In [4]:
loans.column_names()


Out[4]:
['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'is_inc_v',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'url',
 'desc',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d',
 'last_credit_pull_d',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'not_compliant',
 'status',
 'inactive_loans',
 'bad_loans',
 'emp_length_num',
 'grade_num',
 'sub_grade_num',
 'delinq_2yrs_zero',
 'pub_rec_zero',
 'collections_12_mths_zero',
 'short_emp',
 'payment_inc_ratio',
 'final_d',
 'last_delinq_none',
 'last_record_none',
 'last_major_derog_none']

Modifying the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

As in past assignments, in order to make this more intuitive and consistent with the lectures, we reassign the target to be:

  • +1 as a safe loan,
  • -1 as a risky (bad) loan.

We put this in a new column called safe_loans.


In [5]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

Selecting features

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.

The features we will be using are described in the code comments below:


In [6]:
target = 'safe_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies 
            'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            'int_rate',                  # interest rate of the loan
            'total_rec_int',             # interest received to date
            'annual_inc',                # annual income of borrower
            'funded_amnt',               # amount committed to the loan
            'funded_amnt_inv',           # amount committed by investors for the loan
            'installment',               # monthly payment owed by the borrower
           ]

Skipping observations with missing values

Recall from the lectures that one common approach to coping with missing values is to skip observations that contain missing values.

We run the following code to do so:


In [7]:
loans, loans_with_na = loans[[target] + features].dropna_split()

# Count the number of rows with missing data
num_rows_with_na = loans_with_na.num_rows()
num_rows = loans.num_rows()
print 'Dropping %s observations; keeping %s ' % (num_rows_with_na, num_rows)


Dropping 29 observations; keeping 122578 

Fortunately, there are not too many missing values. We are retaining most of the data.

Make sure the classes are balanced

We saw in an earlier assignment that this dataset is also imbalanced. We will undersample the larger class (safe loans) in order to balance out our dataset. We used seed=1 to make sure everyone gets the same results.


In [8]:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

# Undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)

print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)


Percentage of safe loans                 : 0.502247166849
Percentage of risky loans                : 0.497752833151
Total number of loans in our new dataset : 46503

Checkpoint: You should now see that the dataset is balanced (approximately 50-50 safe vs risky loans).

Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

Split data into training and validation sets

We split the data into training data and validation data. We used seed=1 to make sure everyone gets the same results. We will use the validation data to help us select model parameters.


In [9]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

Gradient boosted tree classifier

Gradient boosted trees are a powerful variant of boosting methods; they have been used to win many Kaggle competitions, and have been widely used in industry. We will explore the predictive power of multiple decision trees as opposed to a single decision tree.

Additional reading: If you are interested in gradient boosted trees, here is some additional reading material:

We will now train models to predict safe_loans using the features above. In this section, we will experiment with training an ensemble of 5 trees. To cap the ensemble classifier at 5 trees, we call the function with max_iterations=5 (recall that each iterations corresponds to adding a tree). We set validation_set=None to make sure everyone gets the same results.


In [10]:
model_5 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 5)


Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 37219
Number of classes           : 2
Number of feature columns   : 24
Number of unpacked features : 24
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.044356     | 0.657541          | 0.657139          |
| 2         | 0.086682     | 0.656976          | 0.636157          |
| 3         | 0.130382     | 0.664983          | 0.623206          |
| 4         | 0.170571     | 0.668476          | 0.613783          |
| 5         | 0.209219     | 0.673339          | 0.606229          |
+-----------+--------------+-------------------+-------------------+

Making predictions

Just like we did in previous sections, let us consider a few positive and negative examples from the validation set. We will do the following:

  • Predict whether or not a loan is likely to default.
  • Predict the probability with which the loan is likely to default.

In [12]:
# Select all positive and negative examples.
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

# Select 2 examples from the validation set for positive & negative loans
sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

# Append the 4 examples into a single dataset
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data


Out[12]:
safe_loans grade sub_grade_num short_emp emp_length_num home_ownership dti purpose
1 B 0.2 0 3 MORTGAGE 29.44 credit_card
1 B 0.6 1 1 RENT 12.19 credit_card
-1 D 0.4 0 3 RENT 13.97 other
-1 A 1.0 0 11 MORTGAGE 16.33 debt_consolidation
payment_inc_ratio delinq_2yrs delinq_2yrs_zero inq_last_6mths last_delinq_none last_major_derog_none open_acc
6.30496 0 1 0 1 1 8
13.4952 0 1 0 1 1 8
2.96736 3 0 0 0 1 14
1.90524 0 1 0 1 1 17
pub_rec pub_rec_zero revol_util total_rec_late_fee int_rate total_rec_int annual_inc funded_amnt funded_amnt_inv
0 1 93.9 0.0 9.91 823.48 92000 15000 15000
0 1 59.1 0.0 11.71 1622.21 25000 8500 8500
0 1 59.5 0.0 16.77 719.11 50004 5000 5000
0 1 62.1 0.0 8.9 696.99 100000 5000 5000
installment
483.38
281.15
123.65
158.77
[4 rows x 25 columns]

Predicting on sample validation data

For each row in the sample_validation_data, write code to make model_5 predict whether or not the loan is classified as a safe loan.

Hint: Use the predict method in model_5 for this.


In [16]:
sample_predictions = model_5.predict(sample_validation_data)

Quiz Question: What percentage of the predictions on sample_validation_data did model_5 get correct?

Prediction probabilities

For each row in the sample_validation_data, what is the probability (according model_5) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using model_5 on sample_validation_data:


In [17]:
sample_accuracy = (sample_predictions == sample_validation_data['safe_loans']).sum() / float(len(sample_validation_data))
sample_accuracy


Out[17]:
0.75

In [18]:
model_5.predict(sample_validation_data, output_type='probability')


Out[18]:
dtype: float
Rows: 4
[0.7045905590057373, 0.5963408946990967, 0.44925159215927124, 0.6119099855422974]

Quiz Question: According to model_5, which loan is the least likely to be a safe loan?

Checkpoint: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?

Evaluating the model on the validation data

Recall that the accuracy is defined as follows: $$ \mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}} $$

Evaluate the accuracy of the model_5 on the validation_data.

Hint: Use the .evaluate() method in the model.


In [19]:
model_5.evaluate(validation_data)


Out[19]:
{'accuracy': 0.66813873330461,
 'auc': 0.7247215702188436,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      -1      |        1        |  1618 |
 |      1       |        -1       |  1463 |
 |      -1      |        -1       |  3054 |
 |      1       |        1        |  3149 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.6715001599317625,
 'log_loss': 0.6176131769693981,
 'precision': 0.6605831760016782,
 'recall': 0.6827840416305291,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+------+------+
 | threshold | fpr | tpr |  p   |  n   |
 +-----------+-----+-----+------+------+
 |    0.0    | 1.0 | 1.0 | 4612 | 4672 |
 |   1e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   2e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   3e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   4e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   5e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   6e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   7e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   8e-05   | 1.0 | 1.0 | 4612 | 4672 |
 |   9e-05   | 1.0 | 1.0 | 4612 | 4672 |
 +-----------+-----+-----+------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Calculate the number of false positives made by the model.


In [22]:
print model_5.evaluate(validation_data)['confusion_matrix']


+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      -1      |        1        |  1618 |
|      1       |        -1       |  1463 |
|      -1      |        -1       |  3054 |
|      1       |        1        |  3149 |
+--------------+-----------------+-------+
[4 rows x 3 columns]

Quiz Question: What is the number of false positives on the validation_data?

Calculate the number of false negatives made by the model.


In [23]:
print 1618


1618

Comparison with decision trees

In the earlier assignment, we saw that the prediction accuracy of the decision trees was around 0.64 (rounded). In this assignment, we saw that model_5 has an accuracy of 0.67 (rounded).

Here, we quantify the benefit of the extra 3% increase in accuracy of model_5 in comparison with a single decision tree from the original decision tree assignment.

As we explored in the earlier assignment, we calculated the cost of the mistakes made by the model. We again consider the same costs as follows:

  • False negatives: Assume a cost of \$10,000 per false negative.
  • False positives: Assume a cost of \$20,000 per false positive.

Assume that the number of false positives and false negatives for the learned decision tree was

  • False negatives: 1936
  • False positives: 1503

Using the costs defined above and the number of false positives and false negatives for the decision tree, we can calculate the total cost of the mistakes made by the decision tree model as follows:

cost = $10,000 * 1936  + $20,000 * 1503 = $49,420,000

The total cost of the mistakes of the model is $49.42M. That is a lot of money!.

Quiz Question: Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (model_5) as evaluated on the validation_set?


In [38]:
model_5_cm = model_5.evaluate(validation_data)['confusion_matrix']

In [40]:
fp = model_5_cm[model_5_cm.apply(lambda x: x['target_label'] == -1 and x['predicted_label'] == 1)]['count']
fn = model_5_cm[model_5_cm.apply(lambda x: x['target_label'] == 1 and x['predicted_label'] == -1)]['count']
model_5_cost = 10000 * fn + 20000 * fp
model_5_cost


Out[40]:
dtype: int
Rows: 1
[46990000]

Reminder: Compare the cost of the mistakes made by the boosted trees model with the decision tree model. The extra 3% improvement in prediction accuracy can translate to several million dollars! And, it was so easy to get by simply boosting our decision trees.

Most positive & negative loans.

In this section, we will find the loans that are most likely to be predicted safe. We can do this in a few steps:

  • Step 1: Use the model_5 (the model with 5 trees) and make probability predictions for all the loans in the validation_data.
  • Step 2: Similar to what we did in the very first assignment, add the probability predictions as a column called predictions into the validation_data.
  • Step 3: Sort the data (in descreasing order) by the probability predictions.

Start here with Step 1 & Step 2. Make predictions using model_5 for examples in the validation_data. Use output_type = probability.


In [41]:
prob_predictions = model_5.predict(validation_data, output_type='probability')
validation_data['predictions'] = prob_predictions

Checkpoint: For each row, the probabilities should be a number in the range [0, 1]. We have provided a simple check here to make sure your answers are correct.


In [42]:
print "Your loans      : %s\n" % validation_data['predictions'].head(4)
print "Expected answer : %s" % [0.4492515948736132, 0.6119100103640573,
                                0.3835981314851436, 0.3693306705994325]


Your loans      : [0.44925159215927124, 0.6119099855422974, 0.38359811902046204, 0.3693307042121887]

Expected answer : [0.4492515948736132, 0.6119100103640573, 0.3835981314851436, 0.3693306705994325]

Now, we are ready to go to Step 3. You can now use the prediction column to sort the loans in validation_data (in descending order) by prediction probability. Find the top 5 loans with the highest probability of being predicted as a safe loan.


In [54]:
validation_data.sort('predictions', ascending=False)


Out[54]:
safe_loans grade sub_grade_num short_emp emp_length_num home_ownership dti purpose payment_inc_ratio
1 A 0.2 0 11 MORTGAGE 4.21 credit_card 0.955726
1 A 0.4 0 4 MORTGAGE 12.76 car 1.7376
1 A 0.2 0 6 MORTGAGE 10.29 home_improvement 3.22264
1 A 0.2 0 8 MORTGAGE 10.02 wedding 3.49357
1 A 0.6 0 6 MORTGAGE 3.16 home_improvement 2.91713
1 A 0.6 0 5 MORTGAGE 5.2 major_purchase 0.74268
1 A 0.4 0 6 MORTGAGE 5.75 home_improvement 1.66994
1 A 0.6 0 3 RENT 4.76 major_purchase 1.6872
1 A 0.6 1 1 MORTGAGE 3.33 major_purchase 1.64489
1 A 0.6 0 11 MORTGAGE 2.4 car 2.49545
delinq_2yrs delinq_2yrs_zero inq_last_6mths last_delinq_none last_major_derog_none open_acc pub_rec pub_rec_zero
0 1 2 1 1 9 0 1
0 1 2 1 1 11 0 1
0 1 1 1 1 14 0 1
0 1 0 1 1 14 0 1
0 1 0 1 1 16 0 1
0 1 1 1 1 7 0 1
0 1 0 1 1 6 0 1
1 0 0 0 1 14 0 1
0 1 0 0 1 5 0 1
0 1 0 1 1 6 0 1
revol_util total_rec_late_fee int_rate total_rec_int annual_inc funded_amnt funded_amnt_inv installment
7.9 0.0 6.39 179.18 146000 3800 3650 116.28
5.5 0.0 6.76 429.63 85000 4000 4000 123.08
4.5 0.0 6.03 527.44 85000 7500 7500 228.27
7.9 0.0 6.03 161.9 115000 11000 11000 334.8
5.0 0.0 7.14 505.27 85000 10400 9809 206.63
11.2 0.0 7.14 56.58 100000 2000 2000 61.89
0.0 0.0 5.99 182.03 140987 6450 6450 196.2
1.1 0.0 7.12 115.11 220000 10000 10000 309.32
14.7 0.0 6.92 381.04 135000 6000 6000 185.05
0.0 0.0 6.17 671.39 110000 7500 7500 228.75
predictions
0.848508358002
0.848508358002
0.841295421124
0.841295421124
0.841295421124
0.841295421124
0.841295421124
0.841295421124
0.841295421124
0.841295421124
[9284 rows x 26 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Quiz Question: What grades are the top 5 loans?

Let us repeat this excercise to find the top 5 loans (in the validation_data) with the lowest probability of being predicted as a safe loan:


In [53]:
validation_data.sort('predictions', ascending=True)


Out[53]:
safe_loans grade sub_grade_num short_emp emp_length_num home_ownership dti purpose
-1 C 0.4 0 4 RENT 8.4 credit_card
-1 C 0.8 1 0 MORTGAGE 17.37 home_improvement
-1 D 0.8 0 3 RENT 8.95 small_business
-1 B 1.0 0 5 RENT 29.42 debt_consolidation
-1 C 0.2 0 5 RENT 30.17 debt_consolidation
-1 E 1.0 0 3 RENT 29.24 debt_consolidation
-1 F 0.2 0 2 MORTGAGE 11.12 car
-1 E 0.2 0 11 MORTGAGE 14.93 debt_consolidation
-1 D 0.2 0 8 MORTGAGE 12.64 debt_consolidation
-1 C 0.4 1 0 MORTGAGE 26.54 credit_card
payment_inc_ratio delinq_2yrs delinq_2yrs_zero inq_last_6mths last_delinq_none last_major_derog_none open_acc
11.8779 0 1 0 1 1 9
12.5753 0 1 0 1 1 8
16.727 0 1 2 1 1 7
14.3733 0 1 0 0 1 14
13.5391 0 1 1 1 1 7
3.69024 0 1 3 1 1 8
5.41577 0 1 1 0 1 8
6.52688 0 1 3 1 1 11
9.39964 0 1 0 0 1 5
9.49582 0 1 0 1 1 9
pub_rec pub_rec_zero revol_util total_rec_late_fee int_rate total_rec_int annual_inc funded_amnt funded_amnt_inv
0 1 60.0 34.64 15.31 2152.67 35000 9950 9950
0 1 46.1 18.86 15.31 1089.84 36000 15750 15750
0 1 41.6 16.7025 15.2 1519.65 24000 14000 14000
0 1 57.5 20.9132 14.09 1891.71 35000 12250 12250
0 1 80.7 19.7362 14.33 1632.01 35000 11500 11500
0 1 38.8 0.0 22.47 0.0 12500 1000 1000
1 0 67.5 0.0 22.95 0.0 31200 5000 5000
0 1 58.6 40.6347 19.99 520.22 75000 15400 15400
0 1 84.6 25.6507 14.59 1379.94 66000 15000 15000
1 0 76.1 17.41 15.31 1875.19 44000 10000 10000
installment predictions
346.44 0.134275108576
377.26 0.134275108576
334.54 0.134275108576
419.22 0.134275108576
394.89 0.134275108576
38.44 0.141768679023
140.81 0.141768679023
407.93 0.145480468869
516.98 0.152203395963
348.18 0.152203395963
[9284 rows x 26 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Checkpoint: You should expect to see 5 loans with the grade ['D', 'C', 'C', 'C', 'B'] or with ['D', 'C', 'B', 'C', 'C'].

Effect of adding more trees

In this assignment, we will train 5 different ensemble classifiers in the form of gradient boosted trees. We will train models with 10, 50, 100, 200, and 500 trees. We use the max_iterations parameter in the boosted tree module.

Let's get sarted with a model with max_iterations = 10:


In [55]:
model_10 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 10, verbose=False)

Now, train 4 models with max_iterations to be:

  • max_iterations = 50,
  • max_iterations = 100
  • max_iterations = 200
  • max_iterations = 500.

Let us call these models model_50, model_100, model_200, and model_500. You can pass in verbose=False in order to suppress the printed output.

Warning: This could take a couple of minutes to run.


In [56]:
model_50 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 50, verbose=False)
model_100 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 100, verbose=False)
model_200 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 200, verbose=False)
model_500 = graphlab.boosted_trees_classifier.create(train_data, validation_set=None, 
        target = target, features = features, max_iterations = 500, verbose=False)

Compare accuracy on entire validation set

Now we will compare the predicitve accuracy of our models on the validation set. Evaluate the accuracy of the 10, 50, 100, 200, and 500 tree models on the validation_data. Use the .evaluate method.


In [58]:
print "Accuracy max iterations 10: " + str(model_10.evaluate(validation_data)['accuracy'])
print "Accuracy max iterations 50: " + str(model_50.evaluate(validation_data)['accuracy'])
print "Accuracy max iterations 100: " + str(model_100.evaluate(validation_data)['accuracy'])
print "Accuracy max iterations 200: " + str(model_200.evaluate(validation_data)['accuracy'])
print "Accuracy max iterations 500: " + str(model_500.evaluate(validation_data)['accuracy'])


Accuracy max iterations 10: 0.672770357604
Accuracy max iterations 50: 0.690758293839
Accuracy max iterations 100: 0.691727703576
Accuracy max iterations 200: 0.684510986644
Accuracy max iterations 500: 0.671800947867

Quiz Question: Which model has the best accuracy on the validation_data?

Quiz Question: Is it always true that the model with the most trees will perform best on test data?

Plot the training and validation error vs. number of trees

Recall from the lecture that the classification error is defined as

$$ \mbox{classification error} = 1 - \mbox{accuracy} $$

In this section, we will plot the training and validation errors versus the number of trees to get a sense of how these models are performing. We will compare the 10, 50, 100, 200, and 500 tree models. You will need matplotlib in order to visualize the plots.

First, make sure this block of code runs on your computer.


In [59]:
import matplotlib.pyplot as plt
%matplotlib inline
def make_figure(dim, title, xlabel, ylabel, legend):
    plt.rcParams['figure.figsize'] = dim
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if legend is not None:
        plt.legend(loc=legend, prop={'size':15})
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()

In order to plot the classification errors (on the train_data and validation_data) versus the number of trees, we will need lists of these accuracies, which we get by applying the method .evaluate.

Steps to follow:

  • Step 1: Calculate the classification error for model on the training data (train_data).
  • Step 2: Store the training errors into a list (called training_errors) that looks like this:
    [train_err_10, train_err_50, ..., train_err_500]
  • Step 3: Calculate the classification error of each model on the validation data (validation_data).
  • Step 4: Store the validation classification error into a list (called validation_errors) that looks like this:
    [validation_err_10, validation_err_50, ..., validation_err_500]
    Once that has been completed, the rest of the code should be able to evaluate correctly and generate the plot.

Let us start with Step 1. Write code to compute the classification error on the train_data for models model_10, model_50, model_100, model_200, and model_500.


In [61]:
def calculate_classification_error(data, model):
    accuracy = (model.predict(data) == data['safe_loans']).sum() / float(len(data))
    
    return 1 - accuracy

train_err_10 = calculate_classification_error(train_data, model_10)
train_err_50 = calculate_classification_error(train_data, model_50)
train_err_100 = calculate_classification_error(train_data, model_100)
train_err_200 = calculate_classification_error(train_data, model_200)
train_err_500 = calculate_classification_error(train_data, model_500)

Now, let us run Step 2. Save the training errors into a list called training_errors


In [62]:
training_errors = [train_err_10, train_err_50, train_err_100, 
                   train_err_200, train_err_500]
training_errors


Out[62]:
[0.31174937531905744,
 0.24605712136274482,
 0.20043526155995595,
 0.13643569144791634,
 0.03823316048254921]

Now, onto Step 3. Write code to compute the classification error on the validation_data for models model_10, model_50, model_100, model_200, and model_500.


In [63]:
validation_err_10 = calculate_classification_error(validation_data, model_10)
validation_err_50 = calculate_classification_error(validation_data, model_50)
validation_err_100 = calculate_classification_error(validation_data, model_100)
validation_err_200 = calculate_classification_error(validation_data, model_200)
validation_err_500 = calculate_classification_error(validation_data, model_500)

Now, let us run Step 4. Save the training errors into a list called validation_errors


In [64]:
validation_errors = [validation_err_10, validation_err_50, validation_err_100, 
                     validation_err_200, validation_err_500]
validation_errors


Out[64]:
[0.3272296423955192,
 0.30924170616113744,
 0.30827229642395515,
 0.31548901335631196,
 0.3281990521327014]

Now, we will plot the training_errors and validation_errors versus the number of trees. We will compare the 10, 50, 100, 200, and 500 tree models. We provide some plotting code to visualize the plots within this notebook.

Run the following code to visualize the plots.


In [65]:
plt.plot([10, 50, 100, 200, 500], training_errors, linewidth=4.0, label='Training error')
plt.plot([10, 50, 100, 200, 500], validation_errors, linewidth=4.0, label='Validation error')

make_figure(dim=(10,5), title='Error vs number of trees',
            xlabel='Number of trees',
            ylabel='Classification error',
            legend='best')


Quiz Question: Does the training error reduce as the number of trees increases?

Quiz Question: Is it always true that the validation error will reduce as the number of trees increases?


In [ ]: