Supervised regression of LC loan data

As my metric (buy rate) is a continuous function, I have turned a simple classification (good vs bad loans) into a regression of return. I find that this would be a better tool for picking loans to invest in, as it forecasts a predicted return, not just if the loan will be paid as agreed. My metric intereacts with the loan interest, prefering higher interest loans that are paid as agreed over lower interest loans, this is not done by simple classification. Successful forecasting of my metric can be the difference between having no defaulting loans, but a 4% return per year (low interest, safe loans) vs having no or few defaulting loans, and 10-15% return per year.

Getting started

Data exploration and cleaning has been performed using the previous jupyter notebook named "Data Cleaning.ipynb". I exported a clean features and labels pickle using joblib in that notebook. I will import them and graph some preliminary things to start!


In [1]:
import joblib
features=joblib.load('clean_LCfeatures.p')
labels=joblib.load('clean_LClabels.p')
clabels=joblib.load('clean_LCclassifierlabel.p')

In [2]:
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualization code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

In [3]:
features.head(n=10)


Out[3]:
sub_grade loan_amnt emp_title emp_length home_ownership annual_inc purpose addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc collections_12_mths_ex_med
0 B2 5000.0 None 10.0 RENT 24000.00 credit_card AZ 27.65 0.0 1985-01-01 1.0 3.0 0.0 13648.0 83.7 9.0 0.0
5 B5 12000.0 UCLA 10.0 OWN 75000.00 debt_consolidation CA 10.78 0.0 1989-10-01 0.0 12.0 0.0 23336.0 67.1 34.0 0.0
6 C1 9000.0 Va. Dept of Conservation/Recreation 0.5 RENT 30000.00 debt_consolidation VA 10.08 0.0 2004-04-01 1.0 4.0 0.0 10452.0 91.7 9.0 0.0
7 B1 3000.0 Target 3.0 RENT 15000.00 credit_card IL 12.56 0.0 2003-07-01 2.0 11.0 0.0 7323.0 43.1 11.0 0.0
9 D1 1000.0 Internal revenue Service 0.5 RENT 28000.00 debt_consolidation MO 20.31 0.0 2007-09-01 1.0 11.0 0.0 6524.0 81.5 23.0 0.0
13 A1 9200.0 Network Interpreting Service 6.0 RENT 77385.19 debt_consolidation CA 9.86 0.0 2001-01-01 0.0 8.0 0.0 7314.0 23.1 28.0 0.0
14 B4 21000.0 Osram Sylvania 10.0 RENT 105000.00 debt_consolidation FL 13.22 0.0 1983-02-01 0.0 7.0 0.0 32135.0 90.3 38.0 0.0
15 B3 10000.0 Value Air 10.0 OWN 50000.00 credit_card TX 11.18 0.0 1985-07-01 0.0 8.0 0.0 10056.0 82.4 21.0 0.0
16 B3 10000.0 Wells Fargo Bank 5.0 RENT 50000.00 debt_consolidation CA 16.01 0.0 2003-04-01 0.0 6.0 0.0 17800.0 91.8 17.0 0.0
18 B1 15000.0 Winfield Pathology Consultants 2.0 MORTGAGE 92000.00 credit_card IL 29.44 0.0 2002-02-01 0.0 8.0 0.0 13707.0 93.9 31.0 0.0

In [4]:
features['earliest_cr_line']=features.earliest_cr_line.dt.year

In [5]:
features.earliest_cr_line.dtype


Out[5]:
dtype('int64')

In [6]:
import imp
imp.reload(vs)
vs.distribution(features)


Some weird scaling going on, just to check my sanity, check the max values of a few of the features. these may need to be log scaled just to deal with the large values. Income almost always is going to need log scaling anyway.


In [7]:
valuespread={'annual_inc':max(features.annual_inc),'delinqmax':max(features.delinq_2yrs),
             'openmax':max(features.open_acc), 'pubmax':max(features.pub_rec),
             'revolbalmax':max(features.revol_bal),'revolutilmax':max(features.revol_util),
             'collectmax':max(features.collections_12_mths_ex_med)}
valuespread


Out[7]:
{'annual_inc': 8900060.0,
 'collectmax': 6.0,
 'delinqmax': 29.0,
 'openmax': 76.0,
 'pubmax': 15.0,
 'revolbalmax': 1743266.0,
 'revolutilmax': 892.29999999999995}

There are some crazy revolving balances and utilizations. I need to check those quickly. Everything else looks extreme, but within ranges I'd assume could exist.


In [8]:
features.query('revol_bal>1000000.0')


Out[8]:
sub_grade loan_amnt emp_title emp_length home_ownership annual_inc purpose addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc collections_12_mths_ex_med
131832 B2 35000.0 plastic surgery 10.0 MORTGAGE 400000.0 debt_consolidation FL 33.48 0.0 1984 4.0 28.0 0.0 1743266.0 29.5 59.0 0.0

I guess these are pretty understandable, high paying jobs and income. Lets check the utilizations - having over 100% seems... not possible.


In [9]:
features.query('revol_util>120')


Out[9]:
sub_grade loan_amnt emp_title emp_length home_ownership annual_inc purpose addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc collections_12_mths_ex_med
68911 B4 7200.0 united states Navy 8.0 MORTGAGE 40000.0 credit_card CT 30.54 0.0 2001 0.0 4.0 0.0 11533.0 128.1 20.0 0.0
72075 D4 35000.0 Blachford, Inc. 0.5 MORTGAGE 294000.0 debt_consolidation GA 7.80 0.0 1982 2.0 6.0 0.0 33066.0 127.6 19.0 0.0
74550 D4 12000.0 Goldman Sachs 3.0 RENT 113000.0 debt_consolidation UT 12.76 4.0 1990 0.0 5.0 0.0 7357.0 120.2 19.0 0.0
210575 A5 12600.0 AVP, Senior Leasing Assistant 8.0 MORTGAGE 96011.0 debt_consolidation MD 11.56 2.0 1981 1.0 9.0 0.0 20033.0 146.1 22.0 0.0
238136 C3 20000.0 Systems Analyst 10.0 MORTGAGE 90000.0 debt_consolidation CA 7.85 0.0 2000 1.0 5.0 0.0 19616.0 127.4 9.0 0.0
242534 C5 4000.0 Doorline 2.0 RENT 35000.0 debt_consolidation TN 18.18 0.0 2008 4.0 14.0 0.0 2956.0 123.2 18.0 0.0
294319 C2 10000.0 Superintendent 10.0 OWN 91000.0 debt_consolidation CA 20.94 0.0 1997 0.0 9.0 0.0 9344.0 150.7 37.0 0.0
294407 B4 3500.0 Budget Analyst 10.0 RENT 45000.0 debt_consolidation CA 14.67 0.0 1998 0.0 2.0 0.0 2677.0 892.3 9.0 0.0
306159 G2 35000.0 HR Director 10.0 RENT 165800.0 debt_consolidation MA 7.42 0.0 2005 0.0 5.0 0.0 16521.0 153.0 5.0 0.0
524608 C4 18275.0 Consultant 1.0 OWN 155000.0 debt_consolidation MA 18.39 0.0 2002 0.0 14.0 1.0 15237.0 141.1 26.0 0.0
528603 B5 9600.0 phlebotomist 0.5 RENT 30912.0 debt_consolidation AK 17.16 0.0 2006 1.0 5.0 0.0 7315.0 126.1 21.0 0.0
547364 D4 9175.0 Deckhand 6.0 RENT 45000.0 debt_consolidation TN 28.67 0.0 2007 1.0 4.0 0.0 18632.0 132.1 10.0 0.0
603535 C5 7350.0 sales manager 3.0 RENT 33000.0 debt_consolidation MT 26.11 0.0 2003 0.0 9.0 0.0 10990.0 126.3 18.0 0.0

Having a utilization over 100% is odd, but having a utilization of almost 900% is crazy! I am going to just remove this one loan - not sure what is going on with this outlier. The others are clearly weird, but there doesn't seem to be an obvious cutoff beyond removing the huge outlier.


In [10]:
features.drop([294407],inplace=True)
labels.drop([294407],inplace=True)
clabels.drop([294407],inplace=True)

In [11]:
imp.reload(vs)
vs.distribution(features)


Utilization looks much more normal! Now to log transform the skewed distributions.

Log transform Skewed data


In [12]:
# Log-transform the skewed features

skewed = ['annual_inc','delinq_2yrs','open_acc', 'pub_rec','revol_bal','total_acc', 'collections_12_mths_ex_med']
features_raw=features.copy()
features_raw[skewed] = features[skewed].apply(lambda x: np.log(x + 1))

# Visualize the new log distributions
imp.reload(vs)
vs.distribution(features_raw, transformed = True)


We see a clear improvement in income, open accounts, and revolving balance. The others are improved but still are dominated by low numbers. That's okay for now. There are extreme outliers but they are important. I am tempted to winsorize, but the values are so heavily weighted toward 0, that the winsorization would affect all values greater than zero.


In [13]:
# import scipy.stats
# winsors=['delinq_2yrs','pub_rec','collections_12_mths_ex_med']
# for feature in winsors:
#     winsor[feature]=scipy.stats.mstats.winsorize(features[feature], limits=0.01,axis=1)

Normalization

Now we normalize, by scaling by the max and min values.


In [14]:
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler()
numerical = ['loan_amnt','emp_length', 'annual_inc','dti','delinq_2yrs','earliest_cr_line','inq_last_6mths','open_acc', 'pub_rec','revol_bal','revol_util','total_acc', 'collections_12_mths_ex_med']
features_raw[numerical] = scaler.fit_transform(features_raw[numerical])

# Show an example of a record with scaling applied
display(features_raw.head(n = 10))


sub_grade loan_amnt emp_title emp_length home_ownership annual_inc purpose addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc collections_12_mths_ex_med
0 B2 0.125364 None 1.00 RENT 0.230473 credit_card AZ 0.691423 0.0 0.534483 0.125 0.319143 0.0 0.662532 0.547059 0.327121 0.0
5 B5 0.329446 UCLA 1.00 OWN 0.378689 debt_consolidation CA 0.269567 0.0 0.603448 0.000 0.590484 0.0 0.699854 0.438562 0.667499 0.0
6 C1 0.241983 Va. Dept of Conservation/Recreation 0.05 RENT 0.259499 debt_consolidation VA 0.252063 0.0 0.862069 0.125 0.370513 0.0 0.643968 0.599346 0.327121 0.0
7 B1 0.067055 Target 0.30 RENT 0.169337 credit_card IL 0.314079 0.0 0.844828 0.250 0.572058 0.0 0.619215 0.281699 0.376658 0.0
9 D1 0.008746 Internal revenue Service 0.05 RENT 0.250524 debt_consolidation MO 0.507877 0.0 0.913793 0.125 0.572058 0.0 0.611177 0.532680 0.564987 0.0
13 A1 0.247813 Network Interpreting Service 0.60 RENT 0.382761 debt_consolidation CA 0.246562 0.0 0.810345 0.000 0.505829 0.0 0.619130 0.150980 0.616404 0.0
14 B4 0.591837 Osram Sylvania 1.00 RENT 0.422457 debt_consolidation FL 0.330583 0.0 0.500000 0.000 0.478714 0.0 0.722116 0.590196 0.696900 0.0
15 B3 0.271137 Value Air 1.00 OWN 0.325946 credit_card TX 0.279570 0.0 0.534483 0.000 0.505829 0.0 0.641281 0.538562 0.541346 0.0
16 B3 0.271137 Wells Fargo Bank 0.50 RENT 0.325946 debt_consolidation CA 0.400350 0.0 0.844828 0.000 0.447974 0.0 0.681012 0.600000 0.486824 0.0
18 B1 0.416910 Winfield Pathology Consultants 0.20 MORTGAGE 0.405264 credit_card IL 0.736184 0.0 0.827586 0.000 0.505829 0.0 0.662832 0.613725 0.643151 0.0

One Hot Encoding

Up to this point, I have only dealt with the "continuous" numerical features, or features where the values have some relation. for example, higher income can be directly compared to lower income. Now I want to deal with discrete features, where values are independent. One example is the address state. Although we all might have a specific way to rank the states, there is no "correct" relationship between any two possible states - they are discrete and independent. These will be one-hot encoded!

Of note is the employment title - this is put in by the person applying to the loan and can be pretty much anything. This will certainly cause an explosion in feature space. I think, at this point, I will just remove that from the feature space instead of encode it.


In [15]:
features_raw


Out[15]:
sub_grade loan_amnt emp_title emp_length home_ownership annual_inc purpose addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc collections_12_mths_ex_med
0 B2 0.125364 None 1.00 RENT 0.230473 credit_card AZ 0.691423 0.000000 0.534483 0.125 0.319143 0.000000 0.662532 0.547059 0.327121 0.000000
5 B5 0.329446 UCLA 1.00 OWN 0.378689 debt_consolidation CA 0.269567 0.000000 0.603448 0.000 0.590484 0.000000 0.699854 0.438562 0.667499 0.000000
6 C1 0.241983 Va. Dept of Conservation/Recreation 0.05 RENT 0.259499 debt_consolidation VA 0.252063 0.000000 0.862069 0.125 0.370513 0.000000 0.643968 0.599346 0.327121 0.000000
7 B1 0.067055 Target 0.30 RENT 0.169337 credit_card IL 0.314079 0.000000 0.844828 0.250 0.572058 0.000000 0.619215 0.281699 0.376658 0.000000
9 D1 0.008746 Internal revenue Service 0.05 RENT 0.250524 debt_consolidation MO 0.507877 0.000000 0.913793 0.125 0.572058 0.000000 0.611177 0.532680 0.564987 0.000000
13 A1 0.247813 Network Interpreting Service 0.60 RENT 0.382761 debt_consolidation CA 0.246562 0.000000 0.810345 0.000 0.505829 0.000000 0.619130 0.150980 0.616404 0.000000
14 B4 0.591837 Osram Sylvania 1.00 RENT 0.422457 debt_consolidation FL 0.330583 0.000000 0.500000 0.000 0.478714 0.000000 0.722116 0.590196 0.696900 0.000000
15 B3 0.271137 Value Air 1.00 OWN 0.325946 credit_card TX 0.279570 0.000000 0.534483 0.000 0.505829 0.000000 0.641281 0.538562 0.541346 0.000000
16 B3 0.271137 Wells Fargo Bank 0.50 RENT 0.325946 debt_consolidation CA 0.400350 0.000000 0.844828 0.000 0.447974 0.000000 0.681012 0.600000 0.486824 0.000000
18 B1 0.416910 Winfield Pathology Consultants 0.20 MORTGAGE 0.405264 credit_card IL 0.736184 0.000000 0.827586 0.000 0.505829 0.000000 0.662832 0.613725 0.643151 0.000000
19 C2 0.416910 nyc transit 0.90 RENT 0.349662 debt_consolidation NY 0.380595 0.000000 0.844828 0.125 0.478714 0.000000 0.603852 0.376471 0.376658 0.000000
20 B3 0.096210 Shands Hospital at the University of Fl 1.00 MORTGAGE 0.423690 debt_consolidation FL 0.140785 0.203795 0.517241 0.000 0.590484 0.000000 0.606616 0.246405 0.735781 0.000000
21 B3 0.227405 Oakridge homes 0.05 RENT 0.235783 credit_card MN 0.304826 0.000000 0.896552 0.000 0.505829 0.000000 0.610287 0.386275 0.398406 0.000000
22 A3 0.107143 None 0.70 MORTGAGE 0.186441 debt_consolidation NY 0.508627 0.000000 0.568966 0.000 0.447974 0.000000 0.648840 0.567974 0.398406 0.000000
23 A4 0.907434 Audubon Mutual Housing Corporation 0.50 MORTGAGE 0.378689 debt_consolidation NJ 0.350838 0.000000 0.465517 0.000 0.590484 0.000000 0.699300 0.179085 0.596989 0.000000
24 A5 0.125364 Good Samaritan Society 0.20 RENT 0.230711 debt_consolidation OR 0.298325 0.000000 0.879310 0.000 0.505829 0.000000 0.536314 0.191503 0.471294 0.000000
25 C5 0.183673 GREG BARRETT DRYWALL 0.70 RENT 0.275780 credit_card CA 0.158790 0.000000 0.913793 0.125 0.447974 0.000000 0.606650 0.395425 0.230212 0.000000
26 B2 0.341108 Sharp Lawn Inc. 1.00 RENT 0.300132 credit_card KY 0.295074 0.000000 0.879310 0.250 0.530085 0.000000 0.648122 0.373856 0.398406 0.000000
28 A4 0.416910 Gateway Hospice 0.10 RENT 0.312241 debt_consolidation OH 0.212053 0.000000 0.862069 0.000 0.478714 0.000000 0.613071 0.329412 0.606870 0.000000
29 B4 0.154519 Cox Communications 0.10 RENT 0.286257 debt_consolidation AZ 0.265566 0.000000 0.913793 0.125 0.478714 0.000000 0.619092 0.434641 0.398406 0.000000
31 A4 0.329446 John Wiley Jr. 1.00 RENT 0.354556 debt_consolidation NJ 0.417604 0.000000 0.689655 0.000 0.638286 0.000000 0.630407 0.137255 0.586735 0.000000
33 D2 0.107872 citizens bank 1.00 RENT 0.338344 debt_consolidation RI 0.500375 0.000000 0.862069 0.000 0.478714 0.000000 0.705303 0.647059 0.376658 0.000000
34 A1 0.154519 Stewart Enterprises, Inc. 1.00 MORTGAGE 0.313964 debt_consolidation LA 0.133533 0.000000 0.706897 0.125 0.447974 0.000000 0.565387 0.212418 0.616404 0.000000
36 A5 0.125364 STERIS Corporation 1.00 MORTGAGE 0.416111 debt_consolidation OH 0.408352 0.000000 0.706897 0.000 0.665401 0.000000 0.780485 0.405882 0.675153 0.000000
38 A1 0.271137 Helicoil 1.00 RENT 0.349662 credit_card CT 0.318580 0.000000 0.655172 0.125 0.572058 0.000000 0.664398 0.127451 0.501514 0.000000
39 A2 0.300292 cognizant technology solutions 0.50 RENT 0.369714 debt_consolidation CT 0.271318 0.000000 0.827586 0.000 0.412486 0.000000 0.651447 0.237908 0.266493 0.000000
40 B1 0.416910 Caterpillar Inc. 0.80 MORTGAGE 0.387084 debt_consolidation IL 0.228057 0.000000 0.655172 0.250 0.530085 0.000000 0.652673 0.416340 0.616404 0.000000
41 B1 0.725948 City of Santa Monica 0.90 RENT 0.428509 credit_card CA 0.392848 0.000000 0.775862 0.000 0.572058 0.000000 0.707611 0.405229 0.606870 0.000000
42 B2 0.183673 Aerotek Scientific 0.05 RENT 0.296920 debt_consolidation CA 0.184546 0.000000 0.862069 0.000 0.478714 0.000000 0.654545 0.607190 0.398406 0.000000
44 B1 0.329446 Scott & White 0.10 RENT 0.315100 credit_card TX 0.202801 0.000000 0.793103 0.125 0.572058 0.000000 0.654402 0.340523 0.398406 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
619822 A3 0.096210 TELESERVICE REP 1.00 MORTGAGE 0.345252 debt_consolidation WI 0.025756 0.000000 0.724138 0.000 0.370513 0.000000 0.325151 0.003268 0.501514 0.000000
619826 E1 0.212828 heat treat operator 1.00 MORTGAGE 0.309318 debt_consolidation TN 0.149287 0.000000 0.896552 0.000 0.412486 0.000000 0.000000 0.000000 0.376658 0.000000
619830 C3 1.000000 Sales 0.50 MORTGAGE 0.535303 debt_consolidation NC 0.307327 0.000000 0.344828 0.000 0.607545 0.000000 0.725668 0.465359 0.643151 0.000000
619841 B1 0.854227 Accounts Payable Coordinator and Leas 0.80 MORTGAGE 0.349662 debt_consolidation OR 0.231558 0.000000 0.706897 0.000 0.623428 0.000000 0.691853 0.341830 0.596989 0.000000
619843 B3 0.125364 Qc inspector 0.50 MORTGAGE 0.315100 credit_card CA 0.202551 0.000000 0.913793 0.000 0.572058 0.000000 0.805274 0.143791 0.553424 0.000000
619844 B4 0.224490 Operator 0.10 MORTGAGE 0.416111 credit_card TX 0.930233 0.323008 0.913793 0.000 0.572058 0.000000 0.605433 0.327451 0.667499 0.000000
619845 C3 0.154519 Account Executive 0.40 RENT 0.349662 debt_consolidation NY 0.527632 0.000000 0.896552 0.125 0.689656 0.000000 0.602946 0.201307 0.703779 0.000000
619858 A5 0.037901 medical biller 0.80 MORTGAGE 0.263764 credit_card CA 0.211303 0.000000 0.862069 0.125 0.607545 0.000000 0.577888 0.110458 0.758918 0.356207
619865 B1 1.000000 Manager 0.10 OWN 0.496134 credit_card CA 0.218055 0.000000 0.706897 0.000 0.552026 0.646241 0.609198 0.505229 0.616404 0.000000
619871 C3 0.183673 Hotel Clerk 1.00 RENT 0.322468 debt_consolidation NY 0.271318 0.203795 0.706897 0.000 0.607545 0.000000 0.652738 0.379739 0.717036 0.000000
619875 A2 0.679300 owner 0.05 MORTGAGE 0.378689 credit_card TX 0.532883 0.000000 0.810345 0.125 0.552026 0.000000 0.669644 0.426144 0.486824 0.000000
619878 E4 0.187318 None 0.00 OWN 0.185618 credit_card FL 0.726182 0.000000 0.810345 0.375 0.530085 0.250000 0.579353 0.252288 0.515450 0.000000
619898 C5 0.854227 Teacher 1.00 MORTGAGE 0.393431 debt_consolidation CA 0.413603 0.000000 0.844828 0.125 0.638286 0.000000 0.717901 0.458824 0.741753 0.000000
619902 B5 0.329446 utility Pre-Craft Trainee 0.10 RENT 0.360074 debt_consolidation CA 0.212303 0.000000 0.672414 0.125 0.623428 0.000000 0.659077 0.291503 0.586735 0.000000
619910 D4 0.353499 Sales Associate 0.60 MORTGAGE 0.290248 debt_consolidation TX 0.225806 0.000000 0.896552 0.000 0.505829 0.000000 0.652867 0.462092 0.576079 0.000000
619920 D1 0.203353 Sub teacher / Coach 0.20 OWN 0.245794 debt_consolidation CA 0.465616 0.000000 0.793103 0.000 0.782999 0.000000 0.667802 0.307190 0.723429 0.000000
619932 C1 0.329446 Groomer 0.40 MORTGAGE 0.338344 debt_consolidation CO 0.664666 0.000000 0.896552 0.250 0.731629 0.000000 0.632275 0.194118 0.741753 0.000000
619936 C5 0.189504 Asst. General Superintendent 0.50 MORTGAGE 0.376943 credit_card TX 0.556389 0.000000 0.879310 0.250 0.505829 0.000000 0.598433 0.298039 0.398406 0.000000
619943 C2 0.562682 Letter Carrier 1.00 MORTGAGE 0.381407 credit_card CA 0.320830 0.203795 0.465517 0.250 0.552026 0.000000 0.646684 0.390196 0.682597 0.000000
619950 E4 0.708455 shop foreman 1.00 MORTGAGE 0.338344 debt_consolidation TN 0.780695 0.000000 0.775862 0.125 0.607545 0.000000 0.697124 0.413072 0.501514 0.000000
619952 D3 0.150875 Sales 0.50 OWN 0.279550 debt_consolidation MS 0.473368 0.000000 0.896552 0.125 0.552026 0.250000 0.552234 0.177124 0.553424 0.000000
619953 D1 0.150875 Elementary School Teacher 0.30 MORTGAGE 0.300132 debt_consolidation SC 0.667667 0.000000 0.896552 0.000 0.552026 0.000000 0.621744 0.403268 0.616404 0.000000
619968 D5 0.329446 Accounting 0.90 RENT 0.303266 debt_consolidation NC 0.668667 0.203795 0.810345 0.000 0.478714 0.000000 0.498652 0.222876 0.541346 0.000000
619971 D5 0.059038 None 0.00 MORTGAGE 0.320049 debt_consolidation FL 0.367342 0.203795 0.810345 0.125 0.552026 0.250000 0.527632 0.220915 0.501514 0.000000
619985 C1 0.416910 Merchandiser 0.60 MORTGAGE 0.378689 debt_consolidation GA 0.583896 0.000000 0.775862 0.000 0.607545 0.000000 0.721374 0.532680 0.586735 0.000000
619992 D1 0.795918 sales manager 0.30 RENT 0.439827 debt_consolidation NY 0.665166 0.000000 0.844828 0.000 0.665401 0.000000 0.736005 0.591503 0.689843 0.000000
619993 B3 0.154519 Office Administrator 0.40 RENT 0.303266 debt_consolidation MD 0.266567 0.000000 0.551724 0.125 0.677848 0.000000 0.584075 0.066667 0.769788 0.000000
620003 A1 0.293732 Coordinator of RSVP 0.05 RENT 0.335957 debt_consolidation FL 0.330583 0.203795 0.362069 0.000 0.530085 0.000000 0.646092 0.168627 0.541346 0.000000
620005 D3 0.161079 Painter 0.20 RENT 0.245794 debt_consolidation FL 0.464616 0.000000 0.982759 0.125 0.319143 0.000000 0.519882 0.637908 0.138792 0.000000
620009 E2 0.295918 None 0.00 OWN 0.267894 debt_consolidation OH 0.736184 0.000000 0.827586 0.125 0.530085 0.000000 0.615948 0.271895 0.528706 0.000000

162959 rows × 18 columns


In [16]:
features_raw=features_raw.drop(['emp_title','addr_state'],axis=1)

In [17]:
features_raw.dtypes


Out[17]:
sub_grade                      object
loan_amnt                     float64
emp_length                    float64
home_ownership                 object
annual_inc                    float64
purpose                        object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line              float64
inq_last_6mths                float64
open_acc                      float64
pub_rec                       float64
revol_bal                     float64
revol_util                    float64
total_acc                     float64
collections_12_mths_ex_med    float64
dtype: object

In [18]:
# One-hot encode the 'features_raw' data using pandas.get_dummies()
feat = pd.get_dummies(features_raw)


#print(income.head(n=10))
# Print the number of features after one-hot encoding
encoded = list(feat.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))

# Uncomment the following line to see the encoded feature names
print (encoded)


53 total features after one-hot encoding.
['loan_amnt', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'collections_12_mths_ex_med', 'sub_grade_A1', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5', 'home_ownership_MORTGAGE', 'home_ownership_OWN', 'home_ownership_RENT', 'purpose_credit_card', 'purpose_debt_consolidation']

Great! There are the original continuous features, and now a feature for all types of home ownership, purposes for the loan, and different address states!

Shuffle and split data

Specifically, I believe that this data is ordered by the origination date, so it is important to shuffle the data. Luckily this happens anyway with train_test_split.


In [19]:
# Import train_test_split
from sklearn.model_selection import train_test_split #sklearn 0.18.1 and up
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(feat, labels, test_size = 0.2, random_state = 1)

# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))


Training set has 130367 samples.
Testing set has 32592 samples.

Testing different models

For almost all supervised learning, a version of Decision trees is necessary to test, if only for it's extreme intuitiveness. At the very least, it may be used to shrink the feature space and improve speed for other algorithms. I will use Gradient Boosting regression for this. Secondly, I want to test a kernel based non-linear regression, and I will use KNN for this. Finally I want to test a high order polynomial regression.

defining model tester


In [20]:
from sklearn.metrics import r2_score, mean_squared_error
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    # Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    learner = clf.fit(X_train.sample(n=sample_size,random_state=1),y_train.sample(n=sample_size,random_state=1)) #df.sample(frac=percent) maybe too?
    end = time() # Get end time
    
    # Calculate the training time
    results['train_time'] = end-start
        
    # Get the predictions on the test set,
    #       then get predictions on the first 300 training samples
    start = time() # Get start time
    predictions_test = clf.predict(X_test)
    predictions_train = clf.predict(X_train[:300])
    end = time() # Get end time
    
    # Calculate the total prediction time
    results['pred_time'] = end-start
            
    # Compute mean square error on the first 300 training samples
    results['mse_train'] = mean_squared_error(y_train[:300],predictions_train)
        
    # Compute mean square error on test set
    results['mse_test'] = mean_squared_error(y_test,predictions_test)
    
    # Compute R^2 on the the first 300 training samples
    results['R2_train'] = r2_score(y_train[:300],predictions_train)
        
    # Compute R^2 on the test set
    results['R2_test'] = r2_score(y_test,predictions_test)
       
    # Success
    print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    # Return the results
    return results

In [21]:
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Initialize models

clf_A = make_pipeline(PolynomialFeatures(2,interaction_only=True),LinearRegression())
clf_B = GradientBoostingRegressor(random_state=2)
clf_C = Ridge()

# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = len(y_train.sample(frac=.01))
samples_10 = len(y_train.sample(frac=.1))
samples_100 = len(y_train.sample(frac=1))

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]: #clf_A, 
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)
#         print(results[clf])

# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results,.5,.5)


Pipeline trained on 1304 samples.
Pipeline trained on 13037 samples.
Pipeline trained on 130367 samples.
GradientBoostingRegressor trained on 1304 samples.
GradientBoostingRegressor trained on 13037 samples.
GradientBoostingRegressor trained on 130367 samples.
Ridge trained on 1304 samples.
Ridge trained on 13037 samples.
Ridge trained on 130367 samples.

Checking the polynomial regression

These values look huge, I need to check why they are breaking the graphs


In [22]:
results


Out[22]:
{'GradientBoostingRegressor': {0: {'R2_test': -0.027685254500754652,
   'R2_train': 0.01810268176249441,
   'mse_test': 0.090920842956927345,
   'mse_train': 0.091524044036148666,
   'pred_time': 0.09604716300964355,
   'train_time': 0.21015286445617676},
  1: {'R2_test': 0.03819812536723366,
   'R2_train': 0.052787195008189847,
   'mse_test': 0.085092042350696048,
   'mse_train': 0.088291051279462518,
   'pred_time': 0.07995486259460449,
   'train_time': 1.7750799655914307},
  2: {'R2_test': 0.049926645164842354,
   'R2_train': 0.080675475466687852,
   'mse_test': 0.084054402760203303,
   'mse_train': 0.085691545036429229,
   'pred_time': 0.0840139389038086,
   'train_time': 23.71283507347107}},
 'Pipeline': {0: {'R2_test': -2.5892925015822432e+22,
   'R2_train': -1.8912946891109295e+22,
   'mse_test': 2.2907855870742744e+21,
   'mse_train': 1.7629026497621302e+21,
   'pred_time': 1.3107481002807617,
   'train_time': 1.0699996948242188},
  1: {'R2_test': -9.4839930127228027e+18,
   'R2_train': -2.1152683435113517e+19,
   'mse_test': 8.3906296751651302e+17,
   'mse_train': 1.9716716750719946e+18,
   'pred_time': 1.1816439628601074,
   'train_time': 3.9656147956848145},
  2: {'R2_test': -1.4727784572763279e+19,
   'R2_train': 0.076118319149359404,
   'mse_test': 1.3029890060008484e+18,
   'mse_train': 0.086116324051220161,
   'pred_time': 1.3325960636138916,
   'train_time': 45.169249057769775}},
 'Ridge': {0: {'R2_test': -0.0048736434966243358,
   'R2_train': -0.017429963309490315,
   'mse_test': 0.088902665803350656,
   'mse_train': 0.094836092365322866,
   'pred_time': 0.007500886917114258,
   'train_time': 0.017422199249267578},
  1: {'R2_test': 0.029448528206297397,
   'R2_train': 0.035674368925567634,
   'mse_test': 0.085866132224927366,
   'mse_train': 0.089886162111194129,
   'pred_time': 0.008438825607299805,
   'train_time': 0.028906822204589844},
  2: {'R2_test': 0.033955390217294434,
   'R2_train': 0.056455815485514149,
   'mse_test': 0.085467403439692946,
   'mse_train': 0.087949093952680885,
   'pred_time': 0.008608102798461914,
   'train_time': 0.20398306846618652}}}

Regression Postmortem

Woof, terrible bias, almost no structure is described by these models. I'm actually not sure what to make of such extreme values, but negative R^2 are bad, and large MSE is bad. Likely, they are too simple. I'd like to try and overfit, before I move towards a more modest approach, such as classification. If I can tune a decision tree from overfitting towards something useful, maybe the metric and this regression still have hope.

Attempting to overfit.

For this I will use a standard decision tree, and attempt to overfit with hopes for Post-pruning


In [23]:
from sklearn.tree import DecisionTreeRegressor
results2={}
clf_D = DecisionTreeRegressor(random_state=1)
clf_D.fit(X_train,y_train)
predictions_test = clf_D.predict(X_test)
predictions_train = clf_D.predict(X_train[:3000])
end = time() # Get end time

# Calculate the total prediction time

# Compute mean square error on the first 300 training samples
results2['mse_train'] = mean_squared_error(y_train[:3000],predictions_train)

# Compute mean square error on test set
results2['mse_test'] = mean_squared_error(y_test,predictions_test)

# Compute R^2 on the the first 300 training samples
results2['R2_train'] = r2_score(y_train[:3000],predictions_train)

# Compute R^2 on the test set
results2['R2_test'] = r2_score(y_test,predictions_test)

results2


Out[23]:
{'R2_test': -0.96710235974236092,
 'R2_train': 0.99999998067169482,
 'mse_test': 0.17403247146639395,
 'mse_train': 1.6953117790367594e-09}

Overfitting with Decision Trees - Success?

Looks like we have classic high variance now (great training scores, terrible test scores)! This makes me more confident that the metric is tenable.

Gridsearch Decision Tree model tuning

Now I'll try to tune the decision tree using gridsearch CV.


In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
# Initialize the classifier
clf = DecisionTreeRegressor(random_state=2)

# Create the parameters list you wish to tune
parameters = {'max_depth':(4,6,20,None),'min_samples_split':(2,50),'min_samples_leaf':(1,101)} #',,

# Make an R2_score scoring object
scorer = make_scorer(r2_score)

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("MSE score on testing data: {:.4f}".format(mean_squared_error(y_test, predictions)))
print ("R2-score on testing data: {:.4f}".format(r2_score(y_test, predictions)))
print ("\nOptimized Model\n------")
print ("Final MSE score on the testing data: {:.4f}".format(mean_squared_error(y_test, best_predictions)))
print ("Final R2-score on the testing data: {:.4f}".format(r2_score(y_test, best_predictions)))


Unoptimized model
------
MSE score on testing data: 0.1725
R2-score on testing data: -0.9501

Optimized Model
------
Final MSE score on the testing data: 0.0858
Final R2-score on the testing data: 0.0305

In [25]:
best_clf


Out[25]:
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=101, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=2,
           splitter='best')

In [26]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import make_scorer
# # Initialize the classifier
# clf = GradientBoostingRegressor(random_state=2)

# # Create the parameters list you wish to tune
# parameters = {'learning_rate':(0.01,0.1), 'max_depth':(3,6)} 

# # Make an R2_score scoring object
# scorer = make_scorer(r2_score)

# # TODO: Perform grid search on the classifier using 'scorer' as the scoring method
# grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# # TODO: Fit the grid search object to the training data and find the optimal parameters
# grid_fit = grid_obj.fit(X_train,y_train)

# # Get the estimator
# best_clf = grid_fit.best_estimator_

# # Make predictions using the unoptimized and model
# predictions = (clf.fit(X_train, y_train)).predict(X_test)
# best_predictions = best_clf.predict(X_test)
# # Report the before-and-afterscores
# print ("Unoptimized model\n------")
# print ("MSE score on testing data: {:.4f}".format(mean_squared_error(y_test, predictions)))
# print ("R2-score on testing data: {:.4f}".format(r2_score(y_test, predictions)))
# print ("\nOptimized Model\n------")
# print ("Final MSE score on the testing data: {:.4f}".format(mean_squared_error(y_test, best_predictions)))
# print ("Final R2-score on the testing data: {:.4f}".format(r2_score(y_test, best_predictions)))

In [27]:
best_clf


Out[27]:
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=101, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=2,
           splitter='best')

Interestingly, the best boosted regressor is the "stock" one. It is pretty poor. I think it is time to try classification instead of regression

Classification of Loan Data

In this case, we take the classification data, which takes "current" and "fully paid" loans and groups, them, and groups all other delinquent loans together. First, we must get a new test split, and then we will get a baseline naive predictor.


In [28]:
# Import train_test_split
from sklearn.model_selection import train_test_split #sklearn 0.18.1 and up
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(feat, clabels, test_size = 0.2, random_state = 1)

# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))


Training set has 130367 samples.
Testing set has 32592 samples.

Naive Predictor

Here we make a naive predictor, that simply guesses every loan is "good" (or bad, if more loans are bad than good). We need to beat these scores in order to really say that our model is doing better than this in order to say it is meaningful. For classification I will test accuracy, $f_{0.5}$ score, ROC AUC, and cohen's kappa to determine the quality of the model with this highly imbalanced dataset. I use $f_{0.5}$ specifically instead of other beta values because I care most about being precise! I'd rather misclassify a few good loans as bad than have bad loans slip through. Missing a good loan carries much less risk than funding a bad loan. I could look at tuning the classification by implementing a cost of misclassification... ROC AUC takes into consideration true positive rates and false positive rates, and can deal with unbalanced data. Cohen's kappa is specifically for unbalanced data, and describes the agreement


In [29]:
from sklearn.metrics import accuracy_score,fbeta_score, roc_auc_score, cohen_kappa_score, precision_score
allones=np.ones(len(clabels))
acc=accuracy_score(clabels,allones)
fbeta=fbeta_score(clabels,allones,beta=.5)
auc=roc_auc_score(clabels,allones)
cohen=cohen_kappa_score(clabels,allones)
prec=precision_score(clabels,allones)

# Print the results 
print ("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}, ROC AUC: {:.4f}, Cohen's k: {:.4f}, precision: {:.4f}]".format(acc, fbeta,auc,cohen,prec))


Naive Predictor: [Accuracy score: 0.7983, F-score: 0.8318, ROC AUC: 0.5000, Cohen's k: 0.0000, precision: 0.7983]

Wow! accuracy and F-score are pretty high. likely, this is because there is an order of magnitude more fully paid loans than others. This can be seen in the data cleaning notebook! Interestingly, ROC AUC and Cohen's k show how poor a naive classifier should be. I'll use those to judge my models.

Testing Classifiers

I am choosing three classifiers: Decision trees, as it is obligatory to try them on any classification - they get 80% of the way there 80% of the time. SVM, as the margin is a desirable trait when "current" loans are ambiguous and we'd like a large berth for the classification. and finally, KNN, as it is one of the simpliest models (a lazy learner) and I love to see simplicity win out.


In [30]:
from sklearn.metrics import fbeta_score, roc_auc_score, cohen_kappa_score
def train_predictclass(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    resultsclass = {}
    
    # Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    learner = clf.fit(X_train.sample(n=sample_size,random_state=1),y_train.sample(n=sample_size,random_state=1)) #df.sample(frac=percent) maybe too?
    end = time() # Get end time
    
    # Calculate the training time
    resultsclass['train_time'] = end-start
        
    # Get the predictions on the test set,
    #       then get predictions on the first 300 training samples
    start = time() # Get start time
    predictions_test = clf.predict(X_test)
    predictions_train = clf.predict(X_train[:300])
    end = time() # Get end time
    
    # Calculate the total prediction time
    resultsclass['pred_time'] = end-start
            
    # Compute accuracy on the first 300 training samples
    resultsclass['ROC_AUC_train'] = roc_auc_score(y_train[:300],predictions_train)
        
    # Compute accuracy on test set
    resultsclass['ROC_AUC_test'] = roc_auc_score(y_test,predictions_test)
    
    # Compute F-score on the the first 300 training samples
    resultsclass['precision_train'] = precision_score(y_train[:300],predictions_train)
        
    # Compute F-score on the test set
    resultsclass['precision_test'] = precision_score(y_test,predictions_test)
       
    # Success
    print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    # Return the results
    return resultsclass

In [31]:
# Initialize models
from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB

# TODO: Initialize the three models
clf_A = GaussianNB()
clf_B = DecisionTreeClassifier(random_state=5)
clf_C = RandomForestClassifier(n_estimators=50)


# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = len(y_train.sample(frac=.01))
samples_10 = len(y_train.sample(frac=.1))
samples_100 = len(y_train.sample(frac=1))

# Collect results on the learners
resultsclass = {}
for clf in [clf_A, clf_B, clf_C]: 
    clf_name = clf.__class__.__name__
    resultsclass[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        resultsclass[clf_name][i] = \
        train_predictclass(clf, samples, X_train, y_train, X_test, y_test)
#         print(results[clf])


GaussianNB trained on 1304 samples.
GaussianNB trained on 13037 samples.
GaussianNB trained on 130367 samples.
DecisionTreeClassifier trained on 1304 samples.
DecisionTreeClassifier trained on 13037 samples.
DecisionTreeClassifier trained on 130367 samples.
RandomForestClassifier trained on 1304 samples.
RandomForestClassifier trained on 13037 samples.
RandomForestClassifier trained on 130367 samples.

In [32]:
# Run metrics visualization for the three supervised learning models chosen
imp.reload(vs)
vs.classevaluate(resultsclass,auc,prec)


Results

This is promising! The decision tree unsurprisingly shows the best scores on the testing set. I'll try to optimize decision trees next! I will attempt to prune the tree a bit with grid search.


In [33]:
# Initialize the classifier
clf = DecisionTreeClassifier(random_state=3)

# Create the parameters list you wish to tune
parameters = {'max_depth':(20,50, None),'max_leaf_nodes':(100,300,None)}

# Make an R2_score scoring object
scorer = make_scorer(precision_score)

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

In [34]:
# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("ROC AUC score on testing data: {:.4f}".format(roc_auc_score(y_test, predictions)))
print ("Precision on testing data: {:.4f}".format(precision_score(y_test, predictions)))
print ("\nOptimized Model\n------")
print ("Final ROC AUC score on the testing data: {:.4f}".format(roc_auc_score(y_test, best_predictions)))
print ("Final Precision on the testing data: {:.4f}".format(precision_score(y_test, best_predictions)))


Unoptimized model
------
ROC AUC score on testing data: 0.5355
Precision on testing data: 0.8103

Optimized Model
------
Final ROC AUC score on the testing data: 0.5355
Final Precision on the testing data: 0.8103

Interestingly, this doesn't really improve the precision, there may not be enough bad loan data to give the model something to hang on to, and potentially, it could be that we just don't have the data collected that would be more predictive. Because it is crazy simple, and it actually performs better at reducing false positives than others, at least with this training/test set. It is worth noting that the 0.0095% increase in accuracy over the naive classifier is very small but better than nothing.


In [35]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(best_clf,X_train, y_train, cv=20,scoring=scorer, n_jobs=-1)
scores


Out[35]:
array([ 0.8173913 ,  0.81065089,  0.8139127 ,  0.81353618,  0.81408506,
        0.81191843,  0.81528538,  0.81426622,  0.81876959,  0.8165844 ,
        0.81061334,  0.81434185,  0.81097682,  0.81692067,  0.81386293,
        0.813835  ,  0.81851997,  0.81385281,  0.81400778,  0.81770119])

In [36]:
difference=scores.mean()-prec
import math
from scipy import stats,special
zscore=difference/(scores.std())
p_values = 1-special.ndtr(zscore)
print(zscore,p_values)


6.76716616742 6.56641407915e-12

In [37]:
(scores.std()/math.sqrt(10))


Out[37]:
0.00076091637747783658

In [38]:
scores.std()*100


Out[38]:
0.24062288617544125

In [39]:
import joblib
joblib.dump(clf, 'decisiontree.p')


Out[39]:
['decisiontree.p']

In [40]:
features.columns


Out[40]:
Index(['sub_grade', 'loan_amnt', 'emp_title', 'emp_length', 'home_ownership',
       'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'collections_12_mths_ex_med'],
      dtype='object')

In [41]:
# from sklearn import tree
# import pydotplus 
# dot_data = tree.export_graphviz(best_clf, out_file=None,feature_names=feat.columns,class_names=['Bad Loan','Fully Paid']) 
# graph = pydotplus.graph_from_dot_data(dot_data)
# graph.write_pdf("LC_decisiontree.pdf")

In [42]:
best_clf.feature_importances_


Out[42]:
array([  8.66396182e-02,   4.99708784e-02,   9.42451166e-02,
         1.24583630e-01,   1.45502733e-02,   7.32010295e-02,
         2.63119280e-02,   5.99036931e-02,   1.24593974e-02,
         1.11600734e-01,   1.10969909e-01,   7.91163851e-02,
         2.22227511e-03,   1.58726560e-03,   1.72467389e-03,
         2.71125210e-03,   3.74790123e-03,   3.99227665e-03,
         4.66614651e-03,   5.39147069e-03,   6.22634185e-03,
         5.67786178e-03,   6.41209653e-03,   8.25049604e-03,
         6.42897545e-03,   7.79081695e-03,   6.09934785e-03,
         6.10777970e-03,   5.62975949e-03,   6.51220714e-03,
         5.46114677e-03,   5.54992670e-03,   3.73997072e-03,
         3.52495842e-03,   3.15483521e-03,   2.37169836e-03,
         1.98641098e-03,   2.33878178e-03,   1.01819033e-03,
         6.42677764e-04,   9.57814003e-04,   4.25767082e-04,
         4.39973005e-04,   3.71987161e-05,   1.09326771e-04,
         6.93904622e-05,   0.00000000e+00,   3.80506013e-05,
         6.49495708e-03,   5.44486661e-03,   7.12187429e-03,
         7.10305978e-03,   7.23758695e-03])

In [58]:
importances = best_clf.feature_importances_

# Plot
import importlib
importlib.reload(vs)
vs.feature_plot(importances, X_train, y_train)



In [44]:
best_clf.tree_.node_count


Out[44]:
44717

In [45]:
indices = np.argsort(importances)[::-1]
columns = X_train.columns.values[indices[:10]]
values = importances[indices][:10]
np.cumsum(values)
values
columns


Out[45]:
array(['dti', 'revol_bal', 'revol_util', 'annual_inc', 'loan_amnt',
       'total_acc', 'earliest_cr_line', 'open_acc', 'emp_length',
       'inq_last_6mths'], dtype=object)

In [ ]: