In [420]:
%matplotlib inline

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import ttest_ind

sns.set_style('white')

Prediction

For investors it's interesting to know which characteristics of a loan are predictive of a loan ending in charged off. Lending club has its own algorithms beforehand that they use to predict which loans are riskier and give these a grade (A-F). This correlates well with the probability of charged off as we saw in the exploration of the dataset. The interest rates should reflect the risk (higher interest with more risk) to make the riskier loans still attractive to invest in. Although grade and interest correlates well, it's not a perfect correlation.

We will here use the loans that went to full term to build classifiers that can classify loans into charged off and fully paid. The accuracy measure used is 'f1_weighted' of sklearn. This score can be interpreted as a weighted average of the precision and recall. Also confusion matrices and ROC curves will be used for analysis. Grade is used as baseline prediction for charged off/fully paid. We will look for features that add extra predictive value on top of the grade feature and see if this gives us any insight.

Select loans and features

We selected the loans here that went to full term and add the characteristic whether they were charged off or not. We excluded the one loan that was a joint application. The number of loans left are 255,719. And the percentage of charged_off loans is 18%.


In [341]:
loans = pd.read_csv('../data/loan.csv')
closed_loans = loans[loans['loan_status'].isin(['Fully Paid', 'Charged Off'])]
print(closed_loans.shape)
round(sum(closed_loans['loan_status']=='Charged Off')/len(closed_loans['loan_status'])*100)


(252971, 74)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (19,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[341]:
18.0

Select features

We selected features that can be included for the prediction. For this we left out features that are not known at the beginning, like 'total payment'. Because these are not useful features to help new investors. Also non-predictive features like 'id' or features that have all the same values are also excluded. All features to do with 'joint' loans are also excluded, since we do not have joint loans. We did add loan_status and charged_off for the prediction. Furthermore, features that were missing in more than 5% of the loans were excluded leaving 24 features (excluding the targets loan_status and charged_off).


In [342]:
include = ['term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 
          'annual_inc', 'purpose', 'zip_code', 'addr_state', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 
          'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 
          'mths_since_last_major_derog', 'acc_now_delinq', 'loan_amnt', 'open_il_6m', 'open_il_12m', 
          'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'dti', 'open_acc_6m', 'tot_cur_bal',
          'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl',
          'inq_last_12m', 'issue_d', 'loan_status']

exclude = ['funded_amnt', 'funded_amnt_inv', 'verfication_status', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
           'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med', 
           'initial_list_status', 'id', 'member_id', 'emp_title', 'pymnt_plan', 'url', 'desc', 'title', 
           'out_prncp', 'out_prncp_inv', 'total_pymnt', 'last_pymnt_amnt', 'next_pymnt_d', 'policy_code', 
           'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'tot_coll_amt',
           ]


# exclude the one joint application
closed_loans = closed_loans[closed_loans['application_type'] == 'INDIVIDUAL']

# make id index
closed_loans.index = closed_loans.id

# include only the features above
closed_loans = closed_loans[include]

# exclude features with more than 5% missing values
columns_not_missing = (closed_loans.isnull().apply(sum, 0) / len(closed_loans)) < 0.1
closed_loans = closed_loans.loc[:,columns_not_missing[columns_not_missing].index]

# delete rows with NANs
print(1 - closed_loans.dropna().shape[0] / closed_loans.shape[0]) # ratio deleted rows
closed_loans = closed_loans.dropna()

# calculate nr of days between earliest creditline and issue date of the loan
# delete the two original features
closed_loans['earliest_cr_line'] = pd.to_datetime(closed_loans['earliest_cr_line'])
closed_loans['issue_d'] = pd.to_datetime(closed_loans['issue_d'])
closed_loans['days_since_first_credit_line'] = closed_loans['issue_d'] - closed_loans['earliest_cr_line']
closed_loans['days_since_first_credit_line'] = closed_loans['days_since_first_credit_line'] / np.timedelta64(1, 'D')
closed_loans = closed_loans.drop(['earliest_cr_line', 'issue_d'], axis=1)

# delete redundant features
#closed_loans = closed_loans.drop(['grade'], axis=1)

# round-up annual_inc and cut-off outliers annual_inc at 200.000
closed_loans['annual_inc'] = np.ceil(closed_loans['annual_inc'] / 1000)
closed_loans.loc[closed_loans['annual_inc'] > 200, 'annual_inc'] = 200

closed_loans.shape


0.0007866545440170514
Out[342]:
(252771, 23)

In [343]:
closed_loans.head()


Out[343]:
term int_rate installment grade sub_grade emp_length home_ownership annual_inc purpose zip_code ... open_acc pub_rec revol_bal revol_util total_acc acc_now_delinq loan_amnt dti loan_status days_since_first_credit_line
id
1077501 36 months 10.65 162.87 B B2 10+ years RENT 24.0 credit_card 860xx ... 3.0 0.0 13648.0 83.7 9.0 0.0 5000.0 27.65 Fully Paid 9830.0
1077430 60 months 15.27 59.83 C C4 < 1 year RENT 30.0 car 309xx ... 3.0 0.0 1687.0 9.4 4.0 0.0 2500.0 1.00 Charged Off 4627.0
1077175 36 months 15.96 84.33 C C5 10+ years RENT 13.0 small_business 606xx ... 2.0 0.0 2956.0 98.5 10.0 0.0 2400.0 8.72 Fully Paid 3682.0
1076863 36 months 13.49 339.31 C C1 10+ years RENT 50.0 other 917xx ... 10.0 0.0 5598.0 21.0 37.0 0.0 10000.0 20.00 Fully Paid 5782.0
1075269 36 months 7.90 156.46 A A4 3 years RENT 36.0 wedding 852xx ... 9.0 0.0 7963.0 28.3 12.0 0.0 5000.0 11.20 Fully Paid 2586.0

5 rows × 23 columns


In [229]:
closed_loans.columns


Out[229]:
Index(['term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'annual_inc', 'purpose', 'zip_code', 'addr_state',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'acc_now_delinq', 'loan_amnt', 'dti',
       'loan_status', 'days_since_first_credit_line'],
      dtype='object')

In [230]:
plt.hist(closed_loans['annual_inc'], bins=100)


Out[230]:
(array([  2.00000000e+00,   1.60000000e+01,   3.50000000e+01,
          1.80000000e+02,   3.33000000e+02,   3.36000000e+02,
          7.76000000e+02,   8.31000000e+02,   1.43700000e+03,
          1.42800000e+03,   2.16300000e+03,   3.31500000e+03,
          2.76100000e+03,   5.69800000e+03,   3.77600000e+03,
          3.77500000e+03,   7.98500000e+03,   4.74700000e+03,
          9.69400000e+03,   6.15900000e+03,   4.64200000e+03,
          9.43800000e+03,   6.27400000e+03,   1.11370000e+04,
          6.03300000e+03,   4.95100000e+03,   8.92800000e+03,
          4.74000000e+03,   1.15990000e+04,   4.05400000e+03,
          3.74300000e+03,   9.12000000e+03,   4.12500000e+03,
          1.26500000e+03,   8.25300000e+03,   4.72300000e+03,
          7.87700000e+03,   2.33300000e+03,   2.87100000e+03,
          7.41700000e+03,   2.74100000e+03,   6.07800000e+03,
          2.12800000e+03,   2.07000000e+03,   5.63600000e+03,
          2.29700000e+03,   3.54800000e+03,   2.03800000e+03,
          1.83700000e+03,   5.42100000e+03,   1.56300000e+03,
          2.47900000e+03,   1.02300000e+03,   1.09000000e+03,
          3.36700000e+03,   8.15000000e+02,   1.79100000e+03,
          7.02000000e+02,   6.23000000e+02,   3.63800000e+03,
          5.06000000e+02,   2.09600000e+03,   4.44000000e+02,
          4.32000000e+02,   1.82700000e+03,   4.29000000e+02,
          1.25000000e+02,   1.04100000e+03,   3.42000000e+02,
          1.48100000e+03,   2.52000000e+02,   3.37000000e+02,
          7.45000000e+02,   2.48000000e+02,   2.07400000e+03,
          1.70000000e+02,   1.98000000e+02,   4.93000000e+02,
          1.65000000e+02,   8.59000000e+02,   9.40000000e+01,
          1.03000000e+02,   4.88000000e+02,   1.31000000e+02,
          5.18000000e+02,   1.12000000e+02,   7.00000000e+01,
          5.67000000e+02,   9.70000000e+01,   6.47000000e+02,
          6.70000000e+01,   8.10000000e+01,   2.95000000e+02,
          7.00000000e+01,   2.67000000e+02,   6.70000000e+01,
          4.10000000e+01,   1.47000000e+02,   6.30000000e+01,
          4.72700000e+03]),
 array([   3.  ,    4.97,    6.94,    8.91,   10.88,   12.85,   14.82,
          16.79,   18.76,   20.73,   22.7 ,   24.67,   26.64,   28.61,
          30.58,   32.55,   34.52,   36.49,   38.46,   40.43,   42.4 ,
          44.37,   46.34,   48.31,   50.28,   52.25,   54.22,   56.19,
          58.16,   60.13,   62.1 ,   64.07,   66.04,   68.01,   69.98,
          71.95,   73.92,   75.89,   77.86,   79.83,   81.8 ,   83.77,
          85.74,   87.71,   89.68,   91.65,   93.62,   95.59,   97.56,
          99.53,  101.5 ,  103.47,  105.44,  107.41,  109.38,  111.35,
         113.32,  115.29,  117.26,  119.23,  121.2 ,  123.17,  125.14,
         127.11,  129.08,  131.05,  133.02,  134.99,  136.96,  138.93,
         140.9 ,  142.87,  144.84,  146.81,  148.78,  150.75,  152.72,
         154.69,  156.66,  158.63,  160.6 ,  162.57,  164.54,  166.51,
         168.48,  170.45,  172.42,  174.39,  176.36,  178.33,  180.3 ,
         182.27,  184.24,  186.21,  188.18,  190.15,  192.12,  194.09,
         196.06,  198.03,  200.  ]),
 <a list of 100 Patch objects>)

Split data

We keep 30% of the data separate for now so we can later use this to reliable test the performance of the classifier. The split is stratified by 'loan_status' in order to equally divide old loans over the split (old loans have a higher 'charged_off' probability). The classes to predict are in the variable 'charged_off'.


In [237]:
X_train, X_test, y_train, y_test = train_test_split(closed_loans, closed_loans['loan_status'], 
                                                    test_size=0.3, random_state=123)
X_train = X_train.drop('loan_status', axis=1)
X_test = X_test.drop('loan_status', axis=1)

Logistic regression

We will first start with the logistic regression classifier. This is a simple classifier that uses a sigmoidal curve to predict from the features to which class the sample belongs. It has one parameter to tune namely the C-parameter. This is the inverse of the regularization strength, smaller values specify stronger regularization. In sklearn the features have to be numerical that we input in this algorithm, so we need to convert the categorical features to numeric. To do this ordered categorical features will have adjacent numbers and unordered features will get an order as best as possible during conversion to numeric, for instance geographical. Also there cannot be nan/inf/-inf values, hence these will be made 0's. With this algorithm we will also have to scale and normalize the features.

Non-numeric features were converted as follows:

  • earliest_cr_line: the date was converted to a timestamp number
  • grade/sub_grade: order of the letters was kept
  • emp_length: nr of years
  • zipcode: numbers kept of zipcode (geographical order)
  • term: in months
  • home_ownership: from none to rent to mortgage to owned
  • purpose: from purposes that might make money to purposes that only cost money
  • addr_state: ordered geographically from west to east, top to bottom (https://theusa.nl/staten/)

In [239]:
# features that are not float or int, so not to be converted:

# ordered:
# sub_grade, emp_length, zip_code, term

# unordered:
# home_ownership, purpose, addr_state (ordered geographically)

# term
X_train['term'] = X_train['term'].apply(lambda x: int(x.split(' ')[1]))

# grade
loans['grade'] = loans['grade'].astype('category')
grade_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
X_train['grade'] = X_train['grade'].apply(lambda x: grade_dict[x])

# emp_length
emp_length_dict = {'n/a':0,
                   '< 1 year':0,
                   '1 year':1,
                   '2 years':2,
                   '3 years':3,
                   '4 years':4,
                   '5 years':5,
                   '6 years':6,
                   '7 years':7,
                   '8 years':8,
                   '9 years':9,
                   '10+ years':10}
X_train['emp_length'] = X_train['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
X_train['zip_code'] = X_train['zip_code'].apply(lambda x: int(x[0:3]))

# subgrade
X_train['sub_grade'] = X_train['grade'] + X_train['sub_grade'].apply(lambda x: float(list(x)[1])/10)

# house
house_dict = {'NONE': 0, 'OTHER': 0, 'ANY': 0, 'RENT': 1, 'MORTGAGE': 2, 'OWN': 3}
X_train['home_ownership'] = X_train['home_ownership'].apply(lambda x: house_dict[x])

# purpose
purpose_dict = {'other': 0, 'small_business': 1, 'renewable_energy': 2, 'home_improvement': 3,
                'house': 4, 'educational': 5, 'medical': 6, 'moving': 7, 'car': 8, 
                'major_purchase': 9, 'wedding': 10, 'vacation': 11, 'credit_card': 12, 
                'debt_consolidation': 13}
X_train['purpose'] = X_train['purpose'].apply(lambda x: purpose_dict[x])

# states
state_dict = {'AK': 0, 'WA': 1, 'ID': 2, 'MT': 3, 'ND': 4, 'MN': 5, 
              'OR': 6, 'WY': 7, 'SD': 8, 'WI': 9, 'MI': 10, 'NY': 11, 
              'VT': 12, 'NH': 13, 'MA': 14, 'CT': 15, 'RI': 16, 'ME': 17,
              'CA': 18, 'NV': 19, 'UT': 20, 'CO': 21, 'NE': 22, 'IA': 23, 
              'KS': 24, 'MO': 25, 'IL': 26, 'IN': 27, 'OH': 28, 'PA': 29, 
              'NJ': 30, 'KY': 31, 'WV': 32, 'VA': 33, 'DC': 34, 'MD': 35, 
              'DE': 36, 'AZ': 37, 'NM': 38, 'OK': 39, 'AR': 40, 'TN': 41, 
              'NC': 42, 'TX': 43, 'LA': 44, 'MS': 45, 'AL': 46, 'GA': 47, 
              'SC': 48, 'FL': 49, 'HI': 50}
X_train['addr_state'] = X_train['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
X_train = X_train.fillna(0)
X_train = X_train.replace([np.inf, -np.inf], 0)

In [240]:
X_train.columns


Out[240]:
Index(['term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'annual_inc', 'purpose', 'zip_code', 'addr_state',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'acc_now_delinq', 'loan_amnt', 'dti',
       'days_since_first_credit_line'],
      dtype='object')

In [241]:
# scaling and normalizing the features
X_train_scaled = preprocessing.scale(X_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

After the categorical features are conversed to numeric an normalized/scaled, we will first check what the accuracy is when only using the feature 'grade' (A-F) to predict 'charged off' (True/False). This is the classification lending club gave the loans. The closer to F the higher the chance the loan will end in 'charged off'. For the accuracy estimation we will use 'F1-weighted'. This stands for F1 = 2 (precision recall) / (precision + recall). In this way both precision and recall is important for the accuracy. Precision is the number of correct positive results divided by the number of all positive results, and recall is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. In this case using only 'grade' as feature, using the default parameter value for C (inverse of regularization strength) and using l1/lasso penalization we get an F1-accuracy of 0.744.


In [242]:
clf = LogisticRegression(penalty='l1')
scores = cross_val_score(clf, X_train_scaled.loc[:,['grade']], y_train, cv=10, scoring='f1_weighted')
print(scores)
print(np.mean(scores))


[ 0.7451726   0.74695486  0.74425513  0.7456573   0.74568727  0.7450578
  0.74423565  0.74611306  0.74540367  0.74482549]
0.745336282371

A score of 0.744 looks not really high but still a lot better than random. Nevertheless, if we look into the confusion matrix and the ROC-curve we see a whole other picture. It turns out the algorithm mostly predicts everything in the not charged off group and therefore gets the majority right, because there are a lot more paid loans than charged off loans (18%). The area under the curve even gives only a score of 0.506 while random is 0.5. The prediction with logistic regression and only feature grade is therefore only as good as random.


In [243]:
from sklearn.model_selection import cross_val_predict
from pandas_confusion import ConfusionMatrix

prediction = cross_val_predict(clf, X_train_scaled.loc[:,['grade']], y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          117       14057    14174
Fully Paid           487       61295    61782
__all__              604       75352    75956


Overall Statistics:

Accuracy: 0.808520722524
95% CI: (0.80570421101228251, 0.81131364877443635)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.000589413953434
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   75956        75956
P: Condition positive                        14174        61782
N: Condition negative                        61782        14174
Test outcome positive                          604        75352
Test outcome negative                        75352          604
TP: True Positive                              117        61295
TN: True Negative                            61295          117
FP: False Positive                             487        14057
FN: False Negative                           14057          487
TPR: (Sensitivity, hit rate, recall)    0.00825455     0.992117
TNR=SPC: (Specificity)                    0.992117   0.00825455
PPV: Pos Pred Value (Precision)           0.193709     0.813449
NPV: Neg Pred Value                       0.813449     0.193709
FPR: False-out                          0.00788255     0.991745
FDR: False Discovery Rate                 0.806291     0.186551
FNR: Miss Rate                            0.991745   0.00788255
ACC: Accuracy                             0.808521     0.808521
F1 score                                 0.0158343     0.893943
MCC: Matthews correlation coefficient   0.00163173   0.00163173
Informedness                           0.000371996  0.000371996
Markedness                              0.00715749   0.00715749
Prevalence                                0.186608     0.813392
LR+: Positive likelihood ratio             1.04719      1.00038
LR-: Negative likelihood ratio            0.999625     0.954934
DOR: Diagnostic odds ratio                 1.04759      1.04759
FOR: False omission rate                  0.186551     0.806291
Out[243]:
<matplotlib.axes._subplots.AxesSubplot at 0x114862978>

In [251]:
y_score = cross_val_predict(clf, X_train_scaled.loc[:,['grade']], y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.662577306672
Out[251]:
[<matplotlib.lines.Line2D at 0x13d6c5438>]

We can now include all the features we selected (24) and see if the prediction will be better. Because we use regularization, the effect of not useful features will be downgraded automatically. This leads to a slightly better F1-score of 0.751. Also the confusion matrix and the ROC-curve/AUC-score are a little better. Although still not great with an AUC score of 0.515. The top-5 features most used by the algorithm are: 'funded_amnt_inv', 'int_rate', 'sub_grade', 'funded_amnt' and 'annual_inc'. Not even grade itself.


In [256]:
clf = LogisticRegression(penalty='l1')
scores = cross_val_score(clf, X_train_scaled, y_train, cv=10, scoring='f1_weighted')
print(scores)  
print(np.mean(scores))


[ 0.75310091  0.75227635  0.75317398  0.75325794  0.75425814  0.75201058
  0.75253293  0.7538835   0.75249689  0.75422849]
0.75312197205

In [257]:
prediction = cross_val_predict(clf, X_train_scaled, y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          195       13979    14174
Fully Paid           865       60917    61782
__all__             1060       74896    75956


Overall Statistics:

Accuracy: 0.80457106746
95% CI: (0.80173290951382237, 0.80738594264111385)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: -0.000378008425893
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   75956        75956
P: Condition positive                        14174        61782
N: Condition negative                        61782        14174
Test outcome positive                         1060        74896
Test outcome negative                        74896         1060
TP: True Positive                              195        60917
TN: True Negative                            60917          195
FP: False Positive                             865        13979
FN: False Negative                           13979          865
TPR: (Sensitivity, hit rate, recall)     0.0137576     0.985999
TNR=SPC: (Specificity)                    0.985999    0.0137576
PPV: Pos Pred Value (Precision)           0.183962     0.813355
NPV: Neg Pred Value                       0.813355     0.183962
FPR: False-out                           0.0140008     0.986242
FDR: False Discovery Rate                 0.816038     0.186645
FNR: Miss Rate                            0.986242    0.0140008
ACC: Accuracy                             0.804571     0.804571
F1 score                                 0.0256006     0.891394
MCC: Matthews correlation coefficient -0.000807906 -0.000807906
Informedness                          -0.000243257 -0.000243257
Markedness                             -0.00268322  -0.00268322
Prevalence                                0.186608     0.813392
LR+: Positive likelihood ratio            0.982626     0.999753
LR-: Negative likelihood ratio             1.00025      1.01768
DOR: Diagnostic odds ratio                0.982383     0.982383
FOR: False omission rate                  0.186645     0.816038
Out[257]:
<matplotlib.axes._subplots.AxesSubplot at 0x120ecc748>

In [258]:
y_score = cross_val_predict(clf, X_train_scaled, y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.700731760785
Out[258]:
[<matplotlib.lines.Line2D at 0x1148a4c50>]

In [281]:
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled, y_train)
coefs = clf.coef_

# find index of top 5 highest coefficients, aka most used features for prediction
positions = abs(coefs[0]).argsort()[-5:][::-1]
print(X_train_scaled.columns[positions])
print(coefs[0][positions])


Index(['int_rate', 'annual_inc', 'sub_grade', 'term', 'dti'], dtype='object')
[-0.6129477   0.29842243  0.27762066 -0.17339293 -0.15911763]

We can also pick only 5 features and see if this works better. But it works exactly the same as only grade. So SelectKBest does not work as well with 5 features as using all features.


In [260]:
new_X = (SelectKBest(mutual_info_classif, k=5)
        .fit_transform(X_train_scaled, y_train))

In [262]:
print(new_X[0]) # term, int_rate, installement, grade, sub_grade
print(X_train_scaled.head())
new_X = pd.DataFrame(new_X, columns=['term', 'int_rate', 'installment', 'grade', 'sub_grade'])


[-0.53469047 -0.40174267  1.00571969 -0.59548115 -0.60103177]
       term  int_rate  installment     grade  sub_grade  emp_length  \
0 -0.534690 -0.401743     1.005720 -0.595481  -0.601032   -1.503256   
1  1.870241  0.865487     0.347860  0.900654   0.988284   -1.503256   
2 -0.534690 -0.092884    -0.352335  0.152587   0.080104    0.643787   
3  1.870241  0.554357    -0.710339  0.152587   0.231467   -1.503256   
4 -0.534690  0.863216    -0.427319  0.900654   0.761239   -0.698114   

   home_ownership  annual_inc   purpose  zip_code  \
0       -1.055764    0.437807  0.538410  1.171976   
1        0.528760    0.519080  0.538410 -0.778588   
2       -1.055764    0.681624  0.288887 -0.894062   
3       -1.055764   -0.781278  0.288887 -1.324746   
4       -1.055764   -0.510370  0.538410  1.171976   

               ...               inq_last_6mths  open_acc   pub_rec  \
0              ...                    -0.800122  1.851360 -0.329101   
1              ...                    -0.800122  1.442024 -0.329101   
2              ...                     3.886942 -0.809321 -0.329101   
3              ...                     0.137291  3.693369 -0.329101   
4              ...                     0.137291  0.828021 -0.329101   

   revol_bal  revol_util  total_acc  acc_now_delinq  loan_amnt       dti  \
0   0.556035    0.407422   1.183897       -0.051363   0.792242  1.453198   
1   1.029208    0.592675   1.353932       -0.051363   0.792242  0.796525   
2  -0.174204    0.834310  -0.091366       -0.051363  -0.462926  0.648741   
3  -0.510631   -0.095984   1.013862       -0.051363  -0.438315 -0.115878   
4  -0.525372   -0.595362   0.588774       -0.051363  -0.595211  0.043471   

   days_since_first_credit_line  
0                     -0.058769  
1                     -0.262228  
2                     -0.071656  
3                     -1.152993  
4                     -0.403985  

[5 rows x 22 columns]

In [263]:
clf = LogisticRegression(penalty='l1')
scores = cross_val_score(clf, new_X, y_train, cv=10, scoring='f1_weighted')
print(scores)  
print(np.mean(scores))


[ 0.74394672  0.74516986  0.74421423  0.74415496  0.74578639  0.74602528
  0.74268573  0.74441964  0.74488518  0.74455516]
0.744584314293

In [264]:
prediction = cross_val_predict(clf, new_X, y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off           93       14081    14174
Fully Paid           429       61353    61782
__all__              522       75434    75956


Overall Statistics:

Accuracy: 0.808968350097
95% CI: (0.80615432037334445, 0.81175876027606175)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: -0.000608142860769
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   75956        75956
P: Condition positive                        14174        61782
N: Condition negative                        61782        14174
Test outcome positive                          522        75434
Test outcome negative                        75434          522
TP: True Positive                               93        61353
TN: True Negative                            61353           93
FP: False Positive                             429        14081
FN: False Negative                           14081          429
TPR: (Sensitivity, hit rate, recall)    0.00656131     0.993056
TNR=SPC: (Specificity)                    0.993056   0.00656131
PPV: Pos Pred Value (Precision)           0.178161     0.813334
NPV: Neg Pred Value                       0.813334     0.178161
FPR: False-out                          0.00694377     0.993439
FDR: False Discovery Rate                 0.821839     0.186666
FNR: Miss Rate                            0.993439   0.00694377
ACC: Accuracy                             0.808968     0.808968
F1 score                                 0.0126565     0.894254
MCC: Matthews correlation coefficient  -0.00180362  -0.00180362
Informedness                          -0.000382461 -0.000382461
Markedness                             -0.00850557  -0.00850557
Prevalence                                0.186608     0.813392
LR+: Positive likelihood ratio             0.94492     0.999615
LR-: Negative likelihood ratio             1.00039      1.05829
DOR: Diagnostic odds ratio                0.944557     0.944557
FOR: False omission rate                  0.186666     0.821839
Out[264]:
<matplotlib.axes._subplots.AxesSubplot at 0x129fb3ac8>

In [269]:
y_score = cross_val_predict(clf, new_X, y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.679367278898
Out[269]:
[<matplotlib.lines.Line2D at 0x1444df208>]

To see the statistical relevance of certain features, we can use the statsmodels package. We first use it with the 5 features selected by SelectKBest. We see there that only term, int_rate and installement are relevant. Confidence interval of all is small, but the coefficients of alle are also very close to 0, so do not seem to have a huge influence.

Subsequently we do the same for the 5 features with the highest coefficients in the regularized logistic regression that uses all features. Of these all features seem useful, except for 'sub_grade'. The coefficients are slightly higher and confidence intervals are small. Although the conclusions are contradictory. Funded_amnt and funded_amnt_inv have the highest coefficients. These two values should be roughly the same, but have a contradictory relation with the target value charged_off. This makes no sense and gives the idea that the algorithm is still pretty random.


In [274]:
y_train == 'Charged Off'


Out[274]:
455306    False
790918    False
354831     True
76630     False
91839      True
130302    False
141508    False
225891    False
125360    False
424340     True
189953    False
425536    False
397052    False
150002    False
456947    False
201937     True
20655     False
82353     False
15805     False
179553     True
203884    False
54652     False
14433      True
186397    False
215405     True
393572    False
155273     True
11953      True
31773     False
24921     False
          ...  
135750    False
200590    False
217457    False
539140    False
394080    False
37762     False
220456    False
696143    False
384529     True
883627    False
264591    False
113698    False
188324    False
870075     True
226221    False
351325    False
102705     True
225235    False
450122     True
61789     False
199408    False
406847    False
217931    False
376823    False
189476    False
370936    False
19669     False
30115     False
17428     False
866544    False
Name: Actual, dtype: bool

In [275]:
import statsmodels.api as sm

print(new_X.columns)
logit = sm.Logit(y_train == 'Charged Off', np.array(new_X))
result = logit.fit()
print(result.summary())


Index(['term', 'int_rate', 'installment', 'grade', 'sub_grade'], dtype='object')
Optimization terminated successfully.
         Current function value: 0.675398
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Actual   No. Observations:               176939
Model:                          Logit   Df Residuals:                   176934
Method:                           MLE   Df Model:                            4
Date:                Thu, 26 Jan 2017   Pseudo R-squ.:                 -0.4376
Time:                        18:22:55   Log-Likelihood:            -1.1950e+05
converged:                       True   LL-Null:                       -83125.
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0903      0.006     16.251      0.000         0.079     0.101
x2             0.3858      0.020     18.999      0.000         0.346     0.426
x3            -0.0162      0.005     -3.255      0.001        -0.026    -0.006
x4             0.0197      0.050      0.391      0.696        -0.079     0.119
x5            -0.0658      0.061     -1.078      0.281        -0.185     0.054
==============================================================================

In [277]:
logit = sm.Logit(y_train == 'Charged Off', np.array(
        X_train_scaled.loc[:,['int_rate', 'annual_inc', 'sub_grade', 'term', 'dti']]))
result = logit.fit()
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.672390
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Actual   No. Observations:               176939
Model:                          Logit   Df Residuals:                   176934
Method:                           MLE   Df Model:                            4
Date:                Thu, 26 Jan 2017   Pseudo R-squ.:                 -0.4312
Time:                        18:24:22   Log-Likelihood:            -1.1897e+05
converged:                       True   LL-Null:                       -83125.
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.3304      0.019     17.575      0.000         0.294     0.367
x2            -0.1166      0.005    -23.045      0.000        -0.127    -0.107
x3            -0.0177      0.019     -0.930      0.353        -0.055     0.020
x4             0.1061      0.006     19.017      0.000         0.095     0.117
x5             0.0902      0.005     17.881      0.000         0.080     0.100
==============================================================================

Another way to possibly increase performance is to tune the C (penalization) parameter. We will do this with the GridSearchCV function of sklearn. The best performing C parameter, although really close with the default, is C=1. Giving an accuracy of 0.752. (code is quoted out because it takes a long time to run)


In [278]:
from sklearn.model_selection import GridSearchCV
dict_Cs = {'C': [0.001, 0.1, 1, 10, 100]}
clf = GridSearchCV(LogisticRegression(penalty='l1'), dict_Cs, 'f1_weighted', cv=10)

clf.fit(X_train_scaled, y_train)
print(clf.best_params_)
print(clf.best_score_)


{'C': 10}
0.753161919566

In [280]:
clf = LogisticRegression(penalty='l1', C=10)
scores = cross_val_score(clf, X_train_scaled, y_train, cv=10, scoring='f1_weighted')
print(scores)  
print(np.mean(scores))


[ 0.75322471  0.75227635  0.75326607  0.75338166  0.75425814  0.75210257
  0.75256493  0.7538835   0.75243284  0.75422849]
0.753161926239

In [282]:
prediction = cross_val_predict(clf, X_train_scaled, y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          196       13978    14174
Fully Paid           868       60914    61782
__all__             1064       74892    75956


Overall Statistics:

Accuracy: 0.804544736426
95% CI: (0.80170643567460653, 0.80735975642941682)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: -0.000343773069975
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   75956        75956
P: Condition positive                        14174        61782
N: Condition negative                        61782        14174
Test outcome positive                         1064        74892
Test outcome negative                        74892         1064
TP: True Positive                              196        60914
TN: True Negative                            60914          196
FP: False Positive                             868        13978
FN: False Negative                           13978          868
TPR: (Sensitivity, hit rate, recall)     0.0138281     0.985951
TNR=SPC: (Specificity)                    0.985951    0.0138281
PPV: Pos Pred Value (Precision)           0.184211     0.813358
NPV: Neg Pred Value                       0.813358     0.184211
FPR: False-out                           0.0140494     0.986172
FDR: False Discovery Rate                 0.815789     0.186642
FNR: Miss Rate                            0.986172    0.0140494
ACC: Accuracy                             0.804545     0.804545
F1 score                                 0.0257252     0.891377
MCC: Matthews correlation coefficient -0.000733497 -0.000733497
Informedness                          -0.000221263 -0.000221263
Markedness                             -0.00243157  -0.00243157
Prevalence                                0.186608     0.813392
LR+: Positive likelihood ratio            0.984251     0.999776
LR-: Negative likelihood ratio             1.00022        1.016
DOR: Diagnostic odds ratio                 0.98403      0.98403
FOR: False omission rate                  0.186642     0.815789
Out[282]:
<matplotlib.axes._subplots.AxesSubplot at 0x1377aea20>

In [283]:
y_score = cross_val_predict(clf, X_train_scaled, y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.700731748611
Out[283]:
[<matplotlib.lines.Line2D at 0x147ba8e80>]

Random Forest

To improve accuracy levels we could use a more complicated algorithm that scores well in a lot of cases, namely random forest. This algorithm makes various decision trees from subsets of the samples and uses at each split only a fraction of the features to prevent overfitting. The random forest algorithm is known to be not very sensitive to the values of its parameters: the number of features used at each split and the number of trees in the forest. Nevertheless, the default of sklearn is so low that we will raise the number of trees to 100. The algorithm has feature selection already builtin (at each split) and scaling/normalization is also not necessary.

We will first run the algorithm with only grade. This makes not that much sense for Random Forest, since it builds trees. And you cannot build a tree from only one feature. Nevertheless, this will be our starting point. The F1 score is 0.739 hence slightly lower than logistic regression. As expected is the confusion matrix dramatic, namely the algorithm turns out to just predict everything as fully paid. And that's why the AUC-score is exactly random.


In [284]:
clf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(clf, X_train.loc[:,['grade']], y_train, cv=10, scoring='f1_weighted')
print(scores)
print(np.mean(scores))


/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
[ 0.74032961  0.74032961  0.74032961  0.74039443  0.74039443  0.74039443
  0.7403803   0.7403803   0.7403803   0.7403803 ]
0.740369331874

In [285]:
prediction = cross_val_predict(clf, X_train.loc[:,['grade']], y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off            0       14174    14174
Fully Paid             0       61782    61782
__all__                0       75956    75956


Overall Statistics:

Accuracy: 0.813391963768
95% CI: (0.81060277776800005, 0.81615719138182941)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off Fully Paid
Population                                  75956      75956
P: Condition positive                       14174      61782
N: Condition negative                       61782      14174
Test outcome positive                           0      75956
Test outcome negative                       75956          0
TP: True Positive                               0      61782
TN: True Negative                           61782          0
FP: False Positive                              0      14174
FN: False Negative                          14174          0
TPR: (Sensitivity, hit rate, recall)            0          1
TNR=SPC: (Specificity)                          1          0
PPV: Pos Pred Value (Precision)               NaN   0.813392
NPV: Neg Pred Value                      0.813392        NaN
FPR: False-out                                  0          1
FDR: False Discovery Rate                     NaN   0.186608
FNR: Miss Rate                                  1          0
ACC: Accuracy                            0.813392   0.813392
F1 score                                        0   0.897094
MCC: Matthews correlation coefficient         NaN        NaN
Informedness                                    0          0
Markedness                                    NaN        NaN
Prevalence                               0.186608   0.813392
LR+: Positive likelihood ratio                NaN          1
LR-: Negative likelihood ratio                  1        NaN
DOR: Diagnostic odds ratio                    NaN        NaN
FOR: False omission rate                 0.186608        NaN
Out[285]:
<matplotlib.axes._subplots.AxesSubplot at 0x1444f5ef0>

In [287]:
y_score = cross_val_predict(clf, X_train.loc[:,['grade']], y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.662009408724
Out[287]:
[<matplotlib.lines.Line2D at 0x18e826eb8>]

Trying the algorithm with all the features (24) leads to a slightly higher F1-score of 0.750. But logistic regression with all features was a fraction better than that. Also the confusion matrix and AUC is comparable but slightly worse than the logistic regression algorithm with all features. The random forest classifier does select a different top-5 features, namely 'dti', 'revol_bal', 'revol_util', 'annual_inc' and 'int_rate'.


In [288]:
clf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='f1_weighted')
print(scores)
print(np.mean(scores))


[ 0.75355795  0.75200913  0.75151802  0.75171457  0.75187135  0.75392217
  0.75215301  0.75356562  0.75276674  0.75134994]
0.752442849609

In [289]:
prediction = cross_val_predict(clf, X_train, y_train, cv=10)
confusion_matrix = ConfusionMatrix(y_train, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off          224       13950    14174
Fully Paid           892       60890    61782
__all__             1116       74840    75956


Overall Statistics:

Accuracy: 0.804597398494
95% CI: (0.80175938337294439, 0.80741212883302516)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.00211724762002
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off  Fully Paid
Population                                  75956       75956
P: Condition positive                       14174       61782
N: Condition negative                       61782       14174
Test outcome positive                        1116       74840
Test outcome negative                       74840        1116
TP: True Positive                             224       60890
TN: True Negative                           60890         224
FP: False Positive                            892       13950
FN: False Negative                          13950         892
TPR: (Sensitivity, hit rate, recall)    0.0158036    0.985562
TNR=SPC: (Specificity)                   0.985562   0.0158036
PPV: Pos Pred Value (Precision)          0.200717    0.813602
NPV: Neg Pred Value                      0.813602    0.200717
FPR: False-out                          0.0144379    0.984196
FDR: False Discovery Rate                0.799283    0.186398
FNR: Miss Rate                           0.984196   0.0144379
ACC: Accuracy                            0.804597    0.804597
F1 score                                0.0293002    0.891364
MCC: Matthews correlation coefficient  0.00442222  0.00442222
Informedness                           0.00136572  0.00136572
Markedness                              0.0143192   0.0143192
Prevalence                               0.186608    0.813392
LR+: Positive likelihood ratio            1.09459     1.00139
LR-: Negative likelihood ratio           0.998614    0.913582
DOR: Diagnostic odds ratio                1.09611     1.09611
FOR: False omission rate                 0.186398    0.799283
Out[289]:
<matplotlib.axes._subplots.AxesSubplot at 0x1565212b0>

In [290]:
y_score = cross_val_predict(clf, X_train, y_train, cv=10, method='predict_proba')
fpr, tpr, thresholds = roc_curve(y_train, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.693645782868
Out[290]:
[<matplotlib.lines.Line2D at 0x1887b6c88>]

In [291]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
feat_imp = clf.feature_importances_

In [292]:
sns.barplot(x=X_train.columns, y=feat_imp, color='turquoise')
plt.xticks(rotation=90)


Out[292]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21]), <a list of 22 Text xticklabel objects>)

In [293]:
positions = abs(feat_imp).argsort()[-5:][::-1]
print(X_train.columns[positions])
print(feat_imp[positions])


Index(['dti', 'revol_bal', 'days_since_first_credit_line', 'revol_util',
       'int_rate'],
      dtype='object')
[ 0.08211771  0.07660079  0.07606937  0.07533684  0.07394158]

Test set

To test the accuracies of our algorithms we first have to do the same transformations on the test set as we did on the training set. So we will transform the categorical features to numerical and replace nan/inf/-inf with 0. Also for the logistic regression algorithm we normalized and scaled the training set and we saved these transformations, so we can do the exact same tranformation on the test set.


In [294]:
# term
X_test['term'] = X_test['term'].apply(lambda x: int(x.split(' ')[1]))

# grade
X_test['grade'] = X_test['grade'].apply(lambda x: grade_dict[x])

# emp_length
X_test['emp_length'] = X_test['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
X_test['zip_code'] = X_test['zip_code'].apply(lambda x: int(x[0:3]))

# subgrade
X_test['sub_grade'] = X_test['grade'] + X_test['sub_grade'].apply(lambda x: float(list(x)[1])/10)

# house
X_test['home_ownership'] = X_test['home_ownership'].apply(lambda x: house_dict[x])

# purpose
X_test['purpose'] = X_test['purpose'].apply(lambda x: purpose_dict[x])

# states
X_test['addr_state'] = X_test['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
X_test = X_test.fillna(0)
X_test = X_test.replace([np.inf, -np.inf], 0)

In [295]:
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

logistic regression

For logistic regression we will test both the 'only grade' algorithm (baseline) and the best performing algorithm (C=1, all features with regularization). We find practically the same F-scores/confusion matrices/ROC-curves/AUC-scores as for the training set. Therefore the crossvalidation scheme used on the training set gives reliable accuracy measurements. But it's clear that the predictive value of the algorithm increases slightly with more features, but it's basically predicting that all loans get fully paid and therefore the accuracy scores are practically random.


In [296]:
from sklearn.metrics import f1_score

clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled.loc[:,['grade']], y_train)
prediction = clf.predict(X_test_scaled.loc[:,['grade']])
print(f1_score(y_test, prediction, average='weighted'))
confusion_matrix = ConfusionMatrix(y_test, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


0.746041404442
Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off           19        2529     2548
Fully Paid           101       13201    13302
__all__              120       15730    15850


Overall Statistics:

Accuracy: 0.834069400631
95% CI: (0.82818508469201324, 0.83983084214263837)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: -0.000221228958055
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   15850        15850
P: Condition positive                         2548        13302
N: Condition negative                        13302         2548
Test outcome positive                          120        15730
Test outcome negative                        15730          120
TP: True Positive                               19        13201
TN: True Negative                            13201           19
FP: False Positive                             101         2529
FN: False Negative                            2529          101
TPR: (Sensitivity, hit rate, recall)    0.00745683     0.992407
TNR=SPC: (Specificity)                    0.992407   0.00745683
PPV: Pos Pred Value (Precision)           0.158333     0.839224
NPV: Neg Pred Value                       0.839224     0.158333
FPR: False-out                          0.00759284     0.992543
FDR: False Discovery Rate                 0.841667     0.160776
FNR: Miss Rate                            0.992543   0.00759284
ACC: Accuracy                             0.834069     0.834069
F1 score                                 0.0142429      0.90941
MCC: Matthews correlation coefficient -0.000576352 -0.000576352
Informedness                          -0.000136014 -0.000136014
Markedness                             -0.00244225  -0.00244225
Prevalence                                0.160757     0.839243
LR+: Positive likelihood ratio            0.982087     0.999863
LR-: Negative likelihood ratio             1.00014      1.01824
DOR: Diagnostic odds ratio                0.981952     0.981952
FOR: False omission rate                  0.160776     0.841667
Out[296]:
<matplotlib.axes._subplots.AxesSubplot at 0x1397464a8>

In [297]:
y_score = clf.predict_proba(X_test_scaled.loc[:,['grade']])
print(clf.classes_)
fpr, tpr, thresholds = roc_curve(y_test, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


['Charged Off' 'Fully Paid']
0.662318128932
Out[297]:
[<matplotlib.lines.Line2D at 0x1345e7da0>]

In [298]:
clf = LogisticRegression(penalty='l1', C=10)
clf.fit(X_train_scaled, y_train)
prediction = clf.predict(X_test_scaled)
print(f1_score(y_test, prediction, average='weighted'))
confusion_matrix = ConfusionMatrix(y_test, prediction)
confusion_matrix.print_stats()
confusion_matrix.plot()


0.753664722402
Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off           37        2511     2548
Fully Paid           202       13100    13302
__all__              239       15611    15850


Overall Statistics:

Accuracy: 0.828832807571
95% CI: (0.82287728997885279, 0.83466738743382574)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: -0.00104860773136
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                Charged Off   Fully Paid
Population                                   15850        15850
P: Condition positive                         2548        13302
N: Condition negative                        13302         2548
Test outcome positive                          239        15611
Test outcome negative                        15611          239
TP: True Positive                               37        13100
TN: True Negative                            13100           37
FP: False Positive                             202         2511
FN: False Negative                            2511          202
TPR: (Sensitivity, hit rate, recall)     0.0145212     0.984814
TNR=SPC: (Specificity)                    0.984814    0.0145212
PPV: Pos Pred Value (Precision)           0.154812     0.839152
NPV: Neg Pred Value                       0.839152     0.154812
FPR: False-out                           0.0151857     0.985479
FDR: False Discovery Rate                 0.845188     0.160848
FNR: Miss Rate                            0.985479    0.0151857
ACC: Accuracy                             0.828833     0.828833
F1 score                                 0.0265518     0.906167
MCC: Matthews correlation coefficient  -0.00200279  -0.00200279
Informedness                          -0.000664493 -0.000664493
Markedness                              -0.0060364   -0.0060364
Prevalence                                0.160757     0.839243
LR+: Positive likelihood ratio            0.956242     0.999326
LR-: Negative likelihood ratio             1.00067      1.04576
DOR: Diagnostic odds ratio                0.955597     0.955597
FOR: False omission rate                  0.160848     0.845188
Out[298]:
<matplotlib.axes._subplots.AxesSubplot at 0x1345e94e0>

In [299]:
y_score = clf.predict_proba(X_test_scaled)
print(clf.classes_)
fpr, tpr, thresholds = roc_curve(y_test, y_score[:,0], pos_label='Charged Off')
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


['Charged Off' 'Fully Paid']
0.705200306527
Out[299]:
[<matplotlib.lines.Line2D at 0x17a2dc9e8>]

In [ ]:


In [ ]:


In [ ]:

Try to see if top 25% and bottom 25% are ok (or 10%). Can we at least avoid bad loans?


In [344]:
closed_loans2 = closed_loans.drop(['loan_status'], axis=1)

# term
closed_loans2['term'] = closed_loans2['term'].apply(lambda x: int(x.split(' ')[1]))

# grade
closed_loans2['grade'] = closed_loans2['grade'].apply(lambda x: grade_dict[x])

# emp_length
closed_loans2['emp_length'] = closed_loans2['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
closed_loans2['zip_code'] = closed_loans2['zip_code'].apply(lambda x: int(x[0:3]))

# subgrade
closed_loans2['sub_grade'] = closed_loans2['grade'] + closed_loans2['sub_grade'].apply(lambda x: float(list(x)[1])/10)

# house
closed_loans2['home_ownership'] = closed_loans2['home_ownership'].apply(lambda x: house_dict[x])

# purpose
closed_loans2['purpose'] = closed_loans2['purpose'].apply(lambda x: purpose_dict[x])

# states
closed_loans2['addr_state'] = closed_loans2['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
closed_loans2 = closed_loans2.fillna(0)
closed_loans2 = closed_loans2.replace([np.inf, -np.inf], 0)

closed_loans_scaled = scaler.transform(closed_loans2)
closed_loans_scaled = pd.DataFrame(closed_loans_scaled, columns=closed_loans2.columns)
closed_loans_scaled.index = closed_loans2.index

In [352]:
closed_loans_scaled


Out[352]:
term int_rate installment grade sub_grade emp_length home_ownership annual_inc purpose zip_code ... inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc acc_now_delinq loan_amnt dti days_since_first_credit_line
id
1077501 -0.534690 -0.706059 -1.042886 -0.595481 -0.676713 1.180548 -1.055764 -1.268913 0.288887 1.047140 ... 0.137291 -1.627992 -0.329101 -0.081789 1.184680 -1.366628 -0.051363 -1.053594 1.426211 1.663403
1077430 1.870241 0.343152 -1.463942 0.152587 0.231467 -1.503256 -1.055764 -1.106368 -0.709206 -0.672477 ... 3.886942 -1.627992 -0.329101 -0.732395 -1.807563 -1.791716 -0.051363 -1.361234 -1.998511 -0.368448
1077175 -0.534690 0.499852 -1.363827 0.152587 0.307149 1.180548 -1.055764 -1.566911 -2.455868 0.254431 ... 1.074704 -1.832660 -0.329101 -0.663369 1.780712 -1.281611 -0.051363 -1.373539 -1.006434 -0.737485
1076863 -0.534690 -0.061090 -0.321892 0.152587 0.004422 1.180548 -1.055764 -0.564552 -2.705391 1.225031 ... 0.137291 -0.195318 -0.329101 -0.519661 -1.340403 1.013862 -0.051363 -0.438315 0.443129 0.082597
1075269 -0.534690 -1.330590 -1.069079 -1.343549 -1.282167 -0.698114 -1.055764 -0.943823 -0.210160 1.022173 ... 2.012117 -0.399986 -0.329101 -0.391019 -1.046414 -1.111576 -0.051363 -1.053594 -0.687736 -1.165490
1072053 -0.534690 1.108486 -1.261260 1.648722 1.518056 0.912168 -1.055764 -0.618734 -0.709206 1.171976 ... 1.074704 -1.423325 -0.329101 -0.376985 1.337715 -1.791716 -0.051363 -1.299706 -1.439504 -1.474387
1071795 1.870241 1.708035 -1.085711 2.396790 2.350555 -0.429734 2.113283 -0.835460 -2.455868 1.352988 ... 1.074704 0.009350 -0.329101 -0.540765 -0.873242 -1.026558 -0.051363 -0.979761 -1.413803 -1.081919
1071570 1.870241 -0.242771 -1.212142 -0.595481 -0.449668 -1.503256 -1.055764 -1.512730 -2.705391 0.778742 ... -0.800122 -1.832660 -0.329101 -0.319436 -0.716180 -1.876733 -0.051363 -1.007448 0.196395 -1.141668
1070078 1.870241 0.202349 -1.081379 0.152587 0.155785 -0.161354 2.113283 0.031445 0.538410 1.025294 ... 1.074704 0.623353 -0.329101 -0.604842 -1.356511 -0.176383 -0.051363 -0.869010 -0.055479 -0.190763
1069908 -0.534690 -0.242771 -0.063512 -0.595481 -0.449668 1.180548 2.113283 0.112718 0.538410 1.212548 ... -0.800122 0.214017 -0.329101 0.445180 0.516157 0.758809 -0.051363 -0.192204 -0.741709 0.986250
1064687 -0.534690 -0.061090 -0.460541 0.152587 0.004422 -1.503256 -1.055764 -1.106368 0.538410 -0.872215 ... 0.137291 -1.423325 -0.329101 -0.255632 1.506859 -1.366628 -0.051363 -0.561371 -0.831664 -1.081919
1069866 -0.534690 -0.874115 -1.313361 -0.595481 -0.752395 -0.698114 -1.055764 -1.512730 0.288887 0.254431 ... 1.074704 0.009350 -0.329101 -0.425831 -0.450381 -1.196593 -0.051363 -1.299706 -0.512966 -0.974528
1069057 -0.534690 -0.706059 -0.377343 -0.595481 -0.676713 -0.698114 -1.055764 0.789987 -2.705391 1.331142 ... 1.074704 0.623353 -0.329101 -0.171594 0.048997 0.333722 -0.051363 -0.438315 -1.219757 0.760922
1069759 -0.534690 0.574796 -1.564139 0.900654 0.761239 -1.503256 -1.055764 -1.160549 0.538410 0.363662 ... 0.137291 0.009350 -0.329101 -0.469292 1.096081 -0.176383 -0.051363 -1.545817 0.482967 -1.569282
1065775 -0.534690 0.343152 -0.286463 0.152587 0.231467 -0.429734 -1.055764 -0.781278 -1.956821 1.237515 ... 1.074704 0.623353 -0.329101 0.483636 0.641002 0.248704 -0.051363 -0.438315 0.263219 -0.297374
1069971 -0.534690 -1.755271 -1.260688 -1.343549 -1.509212 1.180548 0.528760 1.060895 -0.459683 -1.427736 ... -0.800122 1.851360 -0.329101 0.417983 -1.541765 1.438949 -0.051363 -1.225872 -0.775121 0.439528
1062474 -0.534690 -0.465331 -0.897453 -0.595481 -0.601032 -1.234875 0.528760 0.356535 -1.208252 1.140767 ... -0.800122 -1.423325 -0.329101 -0.824158 -0.666644 -0.941541 -0.051363 -0.930538 0.242658 -1.010455
1069742 -0.534690 -1.755271 -0.564212 -1.343549 -1.509212 0.107027 -1.055764 0.193990 0.538410 1.237515 ... -0.800122 -0.604654 -0.329101 -0.426321 -1.255830 0.248704 -0.051363 -0.536760 -0.859936 -0.618768
1069740 1.870241 0.343152 0.271935 0.152587 0.231467 -0.698114 -1.055764 -0.727097 0.538410 0.766259 ... 2.012117 -0.604654 -0.329101 0.144762 1.261198 -0.261401 -0.051363 0.823006 1.282283 -0.166942
1039153 -0.534690 -0.304089 1.159080 -0.595481 -0.525350 1.180548 -1.055764 0.925441 0.538410 -0.591334 ... -0.800122 -0.809321 -0.329101 0.923792 1.450478 1.098879 -0.051363 0.915298 -0.428151 1.936764
1069710 -0.534690 -0.465331 -0.356830 -0.595481 -0.601032 1.180548 2.113283 -0.564552 0.288887 0.856765 ... -0.800122 -0.604654 -0.329101 -0.277172 1.132326 -0.346418 -0.051363 -0.438315 -0.690306 1.592720
1069700 -0.534690 -0.465331 -0.356830 -0.595481 -0.601032 -0.161354 -1.055764 -0.564552 0.538410 1.225031 ... -0.800122 -1.013989 -0.329101 0.144055 1.510887 -0.686488 -0.051363 -0.438315 -0.069615 -0.938991
1069559 -0.534690 -0.465331 -0.897453 -0.595481 -0.601032 -1.234875 -1.055764 0.139808 -0.459683 1.171976 ... 0.137291 -0.809321 -0.329101 -0.499807 -0.990032 -1.536663 -0.051363 -0.930538 -1.818601 -0.677736
1069697 -0.534690 -0.874115 0.266827 -0.595481 -0.752395 -0.966495 0.528760 0.573261 0.288887 0.251310 ... -0.800122 -0.604654 -0.329101 -0.078580 1.595459 0.503757 -0.051363 0.176963 1.656239 -0.773412
1069800 -0.534690 0.116050 0.394566 0.152587 0.080104 0.912168 -1.055764 -0.293644 0.538410 -1.315383 ... 0.137291 -0.809321 -0.329101 -0.504757 0.133569 -1.196593 -0.051363 0.176963 -0.171136 -1.010455
1069657 1.870241 0.683805 -1.203152 0.900654 0.836921 -0.966495 -1.055764 -0.537461 -2.705391 -1.168701 ... -0.800122 0.623353 -0.329101 -0.587816 0.210087 -0.261401 -0.051363 -1.053594 -0.331770 -1.010455
1069799 -0.534690 -0.465331 -1.167764 -0.595481 -0.601032 1.180548 0.528760 0.952532 0.538410 -0.619422 ... -0.800122 0.214017 -0.329101 -0.491811 -0.667853 1.608984 -0.051363 -1.176650 -1.403522 1.723152
1047704 -0.534690 -0.465331 -0.559553 -0.595481 -0.601032 -1.503256 -1.055764 -1.241822 0.288887 0.123353 ... -0.800122 -0.604654 -0.329101 -0.473806 0.193978 -1.111576 -0.051363 -0.622899 -0.560514 -1.450566
1032111 -0.534690 -1.419160 -1.152236 -1.343549 -1.357849 0.375407 0.528760 -1.431457 0.538410 -1.196789 ... -0.800122 -1.013989 -0.329101 -0.214402 1.313552 -1.111576 -0.051363 -1.130504 0.486822 1.247895
1069539 -0.534690 -1.330590 2.360832 -1.343549 -1.282167 -0.161354 0.528760 0.112718 0.538410 -1.387164 ... -0.800122 0.214017 -0.329101 0.435117 -1.082659 0.078669 -0.051363 2.247377 -0.324060 2.115229
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37620521 -0.534690 -1.650804 1.296953 -1.343549 -1.433531 -1.503256 0.528760 0.112718 0.288887 0.766259 ... 0.137291 -0.195318 -0.329101 -0.001884 0.439640 -0.686488 -0.051363 1.284465 0.611474 -0.273553
37670482 1.870241 2.741350 1.324740 3.144857 3.107372 -1.234875 -1.055764 0.112718 -1.707298 -1.387164 ... 1.074704 3.488702 -0.329101 3.711918 -1.481356 4.499580 -0.051363 1.392139 2.276930 1.996903
37810537 -0.534690 2.323482 2.539487 2.396790 2.350555 0.643787 0.528760 2.686343 -2.455868 -1.399648 ... 1.074704 -1.013989 4.256724 -0.708408 -1.235694 -1.366628 -0.051363 1.592105 -1.370110 -1.331849
37197690 1.870241 0.506666 0.675993 0.900654 0.836921 1.180548 0.528760 0.329444 -1.956821 1.037777 ... -0.800122 1.032689 1.963812 -0.145321 -0.696043 1.608984 -0.051363 1.284465 -0.730143 1.925830
37700380 1.870241 -0.310902 -0.332802 0.152587 0.004422 -0.698114 0.528760 -0.700006 0.288887 0.672632 ... 0.137291 0.623353 -0.329101 -0.049098 0.866528 0.758809 -0.051363 0.176963 1.913254 0.618384
37650375 -0.534690 0.279563 1.124264 0.152587 0.307149 1.180548 2.113283 0.248172 -1.956821 -0.940875 ... 0.137291 -0.195318 -0.329101 2.231260 0.584620 -0.601471 -0.051363 0.792242 0.912181 0.831996
37640409 1.870241 1.415074 -0.496133 1.648722 1.669419 -0.698114 -1.055764 0.248172 0.288887 1.390439 ... 1.074704 0.418685 -0.329101 0.361522 0.226196 -0.091366 -0.051363 -0.290648 -0.865076 -1.509143
37650403 -0.534690 0.125134 1.097131 0.152587 0.231467 -0.429734 0.528760 3.499066 -1.956821 -1.281053 ... 1.074704 0.214017 -0.329101 -0.067157 -1.384702 -0.941541 -0.051363 0.792242 -0.904913 -0.938600
37790415 -0.534690 0.279563 2.540631 0.152587 0.307149 1.180548 0.528760 0.356535 0.538410 1.234394 ... 0.137291 0.828021 -0.329101 0.821042 0.641002 1.694002 -0.051363 2.022800 -0.001506 -0.534417
36118265 -0.534690 -0.401743 -0.079939 -0.595481 -0.449668 -1.234875 -1.055764 -0.158190 0.538410 1.193822 ... 0.137291 0.623353 -0.329101 -0.117744 -0.389973 -0.006348 -0.051363 -0.192204 -1.035991 0.867533
36400285 -0.534690 0.767833 0.163729 0.900654 0.988284 0.107027 0.528760 -0.889641 0.538410 0.816193 ... -0.800122 -0.604654 -0.329101 -0.178066 0.661138 -0.091366 -0.051363 -0.090683 -0.966597 -0.950706
37357154 1.870241 1.869278 1.411534 1.648722 1.820783 -1.503256 -1.055764 -0.293644 0.288887 -1.281053 ... 0.137291 0.623353 -0.329101 0.191432 0.278550 -0.431436 -0.051363 1.733619 -0.829094 -1.390426
37068125 -0.534690 0.279563 -0.688641 0.152587 0.307149 -0.161354 0.528760 0.085627 0.288887 0.741291 ... 1.074704 -0.604654 -0.329101 -0.528690 -0.349700 -1.111576 -0.051363 -0.782871 0.732271 -0.819883
37317965 1.870241 -0.310902 0.584254 0.152587 0.004422 -1.503256 0.528760 3.499066 0.288887 0.298123 ... -0.800122 2.465363 -0.329101 2.789832 0.032888 3.054282 -0.051363 1.407521 -0.859936 0.689458
36801355 -0.534690 -0.174641 1.044908 0.152587 0.080104 1.180548 0.528760 0.166899 0.288887 1.287449 ... 1.074704 -0.195318 -0.329101 -0.233004 0.218141 0.928844 -0.051363 0.792242 -0.478269 2.579162
37297854 -0.534690 0.279563 1.124264 0.152587 0.307149 1.180548 0.528760 0.789987 -2.455868 -0.656873 ... -0.800122 1.237356 4.256724 0.056480 -0.740343 0.078669 -0.051363 0.792242 0.923747 0.273169
37247819 1.870241 -0.022482 1.091819 0.152587 0.155785 1.180548 2.113283 -0.239463 0.538410 -0.747379 ... 0.137291 -0.195318 -0.329101 0.569959 0.089269 -0.856523 -0.051363 1.982807 -0.728858 -0.403985
37127712 -0.534690 0.620217 -0.858551 0.900654 0.912602 -0.161354 2.113283 -0.970914 0.538410 -0.407201 ... 0.137291 -0.195318 1.963812 -0.672072 -1.094741 -0.261401 -0.051363 -0.945920 0.305626 -0.986634
37167534 1.870241 2.550584 2.488653 2.396790 2.501918 0.643787 0.528760 0.735806 0.538410 -0.263639 ... 0.137291 4.102705 -0.329101 0.622178 1.237034 3.309335 -0.051363 2.638079 -0.127444 0.570741
37087435 -0.534690 0.931346 0.060917 0.900654 1.063966 0.912168 -1.055764 -0.781278 0.538410 -0.772346 ... -0.800122 -0.809321 -0.329101 -0.753772 -0.812833 -0.346418 -0.051363 -0.192204 1.309269 -0.273553
37227443 -0.534690 0.931346 -1.306618 0.900654 1.063966 -1.503256 0.528760 -0.618734 0.538410 -0.557004 ... 0.137291 -0.195318 1.963812 -0.717383 -0.824915 -0.601471 -0.051363 -1.333546 -0.239245 -0.309089
37077186 -0.534690 0.506666 0.302583 0.900654 0.836921 -0.161354 0.528760 -0.049827 -1.956821 -0.613180 ... -0.800122 -0.399986 -0.329101 -0.402333 0.395340 -0.856523 -0.051363 0.053908 -1.069403 -0.582450
37187152 -0.534690 -0.310902 0.338911 0.152587 0.004422 0.107027 0.528760 0.112718 0.538410 -0.688082 ... -0.800122 0.418685 -0.329101 0.905244 1.096081 -0.006348 -0.051363 0.176963 0.873629 0.094703
36808246 -0.534690 -0.742396 -0.911633 -0.595481 -0.601032 -0.429734 -1.055764 -0.781278 0.538410 -0.975205 ... 0.137291 1.442024 -0.329101 -0.583791 -1.775345 2.119090 -0.051363 -0.930538 -0.757130 1.914114
37041208 1.870241 0.506666 -0.019462 0.900654 0.836921 1.180548 0.528760 -0.185281 0.538410 -1.106283 ... -0.800122 -0.604654 -0.329101 -0.222398 -0.011412 0.248704 -0.051363 0.423075 1.946666 0.011914
36743377 -0.534690 0.506666 -1.105121 0.900654 0.836921 1.180548 0.528760 -0.618734 -1.208252 0.891095 ... -0.800122 0.418685 -0.329101 -0.120137 0.367149 1.694002 -0.051363 -1.152039 2.618760 1.307644
36231718 -0.534690 -1.755271 -0.368313 -1.343549 -1.509212 -1.503256 -1.055764 -0.456189 0.538410 -0.622543 ... -0.800122 -0.399986 -0.329101 -0.238009 -1.147095 -0.346418 -0.051363 -0.342947 -0.428151 3.411350
36241316 -0.534690 0.620217 -0.807921 0.900654 0.912602 -0.966495 -1.055764 -1.187640 0.538410 -0.606939 ... 0.137291 -1.627992 -0.329101 -0.728642 1.744467 -1.791716 -0.051363 -0.902851 0.260649 -1.616925
36421485 -0.534690 -1.155721 -1.191138 -0.595481 -0.752395 1.180548 0.528760 -0.564552 -0.709206 1.346746 ... -0.800122 0.009350 1.963812 -0.731688 -1.960598 0.418739 -0.051363 -1.176650 -0.503970 -0.416091
36260758 -0.534690 1.244747 -0.077815 1.648722 1.593738 -1.503256 2.113283 -1.052186 0.538410 -0.294848 ... 0.137291 -0.399986 -0.329101 -0.444107 -0.510790 -0.431436 -0.051363 -0.333718 1.656239 -0.380163

252771 rows × 22 columns


In [396]:
loans['roi'] = ((loans['total_rec_int'] + loans['total_rec_prncp'] 
                          + loans['total_rec_late_fee'] + loans['recoveries']) / loans['funded_amnt']) -1

In [407]:
prof_loans = loans[loans['id'].isin(closed_loans['loan_status'][y_score[:,1] > 0.9].index.tolist())]

In [399]:
roi = loans.groupby('grade')['roi'].mean()

In [401]:
prof_loans = loans[loans['id'].isin(closed_loans.index.tolist())]

In [413]:
roi = prof_loans.groupby('grade')['roi'].mean()
print(roi)
print(prof_loans['roi'].mean())


grade
A    0.046914
B    0.058231
C    0.048528
D    0.062202
E    0.069116
F   -0.057438
G    0.444977
Name: roi, dtype: float64
0.0512539962515

In [424]:
prof_loans['grade'] = prof_loans['grade'].astype('category', ordered=True)
sns.barplot(data=roi.reset_index(), x='grade', y='roi', color='gray')
plt.show()
roi = prof_loans.groupby('loan_status')['roi'].mean()
sns.barplot(data=roi.reset_index(), x='loan_status', y='roi')
plt.show()
roi = prof_loans.groupby(['grade', 'loan_status'])['roi'].mean()
sns.barplot(data=roi.reset_index(), x='roi', y='grade', hue='loan_status', orient='h')
plt.show()
sns.countplot(data=prof_loans, x='grade', hue='loan_status')
plt.show()


/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [409]:
prof_loans


Out[409]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m roi
11 1069908 1305008 12000.0 12000.0 12000.0 36 months 12.69 402.54 B B5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.161923
14 1069057 1303503 10000.0 10000.0 10000.0 36 months 10.65 325.74 B B2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN -0.252801
17 1069971 1304884 3600.0 3600.0 3600.0 36 months 6.03 109.57 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.051394
19 1069742 1304855 9200.0 9200.0 9200.0 36 months 6.03 280.01 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.028257
24 1069559 1304634 6000.0 6000.0 6000.0 36 months 11.71 198.46 B B3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN -0.658310
28 1069799 1304678 4000.0 4000.0 4000.0 36 months 11.71 132.31 B B3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.121198
31 1069539 1304608 31825.0 31825.0 31825.0 36 months 7.90 995.82 A A4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.096185
33 1069591 1304289 5000.0 5000.0 5000.0 36 months 8.90 158.77 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.142918
36 1069361 1304255 10800.0 10800.0 10800.0 36 months 9.91 348.03 B B1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.125667
37 1069357 1304251 15000.0 15000.0 15000.0 36 months 7.90 469.36 A A4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.110840
40 1067573 1301955 9600.0 9600.0 9600.0 36 months 7.51 298.67 A A3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.119767
41 1069506 1304567 12000.0 12000.0 12000.0 36 months 7.90 375.49 A A4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.126365
44 1069469 1304526 6000.0 6000.0 6000.0 36 months 6.03 182.62 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.010977
45 1051117 1282787 14000.0 14000.0 14000.0 36 months 9.91 451.15 B B1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.159994
46 1069465 1304521 5000.0 5000.0 5000.0 36 months 8.90 158.77 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.004274
48 1069287 1304171 10000.0 10000.0 10000.0 36 months 6.03 304.36 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.028267
49 1069453 1303701 11000.0 11000.0 11000.0 36 months 6.62 337.75 A A2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.105216
50 1069248 1304123 15000.0 15000.0 15000.0 36 months 9.91 483.38 B B1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.078518
51 1068120 1302485 25600.0 25600.0 25350.0 36 months 9.91 824.96 B B1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.159985
65 1069102 1303750 3500.0 3500.0 3500.0 36 months 10.65 114.01 B B2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.172334
74 1068893 1303514 14400.0 14400.0 14400.0 36 months 8.90 457.25 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.143022
78 1068997 1303437 15000.0 15000.0 15000.0 36 months 7.90 469.36 A A4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.037176
83 1068967 1303403 4500.0 4500.0 4500.0 36 months 6.03 136.96 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.095622
98 1068350 1302971 3500.0 3500.0 3500.0 36 months 6.03 106.53 A A1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.095617
102 1068508 1302906 6000.0 6000.0 6000.0 36 months 8.90 190.52 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.142862
103 1066641 1300833 7200.0 7200.0 7200.0 36 months 9.91 232.02 B B1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.159839
104 1068315 1302930 9500.0 9500.0 9500.0 36 months 8.90 301.66 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.104601
110 1068273 1302680 5500.0 5500.0 5500.0 36 months 6.62 168.88 A A2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.105069
111 1068274 1302681 11000.0 11000.0 11000.0 36 months 6.62 337.75 A A2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.105169
116 1061814 1293438 10000.0 10000.0 10000.0 36 months 8.90 317.54 A A5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.086892
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886730 37691691 40464722 1500.0 1500.0 1500.0 36 months 8.67 47.47 B B1 ... NaN NaN NaN NaN NaN 11200.0 NaN NaN NaN 0.027747
886739 37831603 40594612 8000.0 8000.0 8000.0 36 months 8.19 251.40 A A5 ... NaN NaN NaN NaN NaN 21500.0 NaN NaN NaN 0.031838
886740 37107603 39870372 30000.0 30000.0 30000.0 36 months 6.99 926.18 A A3 ... NaN NaN NaN NaN NaN 39300.0 NaN NaN NaN 0.048415
886812 37821486 40584462 26000.0 26000.0 26000.0 60 months 8.19 529.56 A A5 ... NaN NaN NaN NaN NaN 93500.0 NaN NaN NaN 0.040067
886828 36231341 38942750 6000.0 6000.0 6000.0 36 months 8.67 189.88 B B1 ... NaN NaN NaN NaN NaN 17000.0 NaN NaN NaN 0.022013
886888 37700422 40473182 35000.0 35000.0 35000.0 36 months 12.39 1169.04 C C1 ... NaN NaN NaN NaN NaN 100100.0 NaN NaN NaN 0.074413
886890 37651307 40414240 20000.0 20000.0 20000.0 60 months 10.49 429.78 B B3 ... NaN NaN NaN NaN NaN 89600.0 NaN NaN NaN 0.063475
886891 37701353 40474288 20950.0 20950.0 20950.0 60 months 17.14 522.24 D D4 ... NaN NaN NaN NaN NaN 15000.0 NaN NaN NaN 0.133916
886928 37721317 40444274 30000.0 30000.0 30000.0 36 months 6.49 919.34 A A2 ... NaN NaN NaN NaN NaN 105300.0 NaN NaN NaN 0.052816
886937 37611321 40374297 8000.0 8000.0 8000.0 36 months 9.49 256.23 B B2 ... NaN NaN NaN NaN NaN 5300.0 NaN NaN NaN -0.823315
886947 37691299 40464249 16000.0 16000.0 16000.0 36 months 9.49 512.46 B B2 ... NaN NaN NaN NaN NaN 33800.0 NaN NaN NaN 0.020843
886953 37601328 40364250 17000.0 17000.0 17000.0 36 months 7.49 528.73 A A4 ... NaN NaN NaN NaN NaN 66500.0 NaN NaN NaN 0.034161
886966 37751241 40514142 13200.0 13200.0 13200.0 36 months 15.59 461.41 D D1 ... NaN NaN NaN NaN NaN 39600.0 NaN NaN NaN 0.025246
886980 37741211 40504131 6000.0 6000.0 6000.0 36 months 8.19 188.55 A A5 ... NaN NaN NaN NaN NaN 31600.0 NaN NaN NaN 0.048925
887000 37731131 40494040 18000.0 18000.0 18000.0 36 months 6.99 555.71 A A3 ... NaN NaN NaN NaN NaN 13500.0 NaN NaN NaN -0.846413
887009 37701109 40474018 11000.0 11000.0 11000.0 36 months 11.44 362.43 B B4 ... NaN NaN NaN NaN NaN 22500.0 NaN NaN NaN 0.082765
887012 37711157 40484084 10000.0 10000.0 10000.0 36 months 15.59 349.55 D D1 ... NaN NaN NaN NaN NaN 5700.0 NaN NaN NaN -0.826957
887032 35948698 38650263 35000.0 35000.0 35000.0 36 months 11.99 1162.34 B B5 ... NaN NaN NaN NaN NaN 181700.0 NaN NaN NaN 0.066983
887093 37630866 40393702 4000.0 4000.0 4000.0 36 months 6.99 123.50 A A3 ... NaN NaN NaN NaN NaN 76000.0 NaN NaN NaN 0.004657
887105 37720840 40443672 35000.0 35000.0 35000.0 36 months 13.66 1190.45 C C3 ... NaN NaN NaN NaN NaN 47500.0 NaN NaN NaN -0.732788
887121 37840758 40603578 30000.0 30000.0 30000.0 36 months 8.67 949.40 B B1 ... NaN NaN NaN NaN NaN 39800.0 NaN NaN NaN 0.051029
887124 36281499 39002924 5000.0 5000.0 5000.0 36 months 10.49 162.49 B B3 ... NaN NaN NaN NaN NaN 148400.0 NaN NaN NaN 0.055844
887128 37750780 40513580 1500.0 1500.0 1500.0 36 months 11.99 49.82 B B5 ... NaN NaN NaN NaN NaN 15500.0 NaN NaN NaN 0.057553
887144 37760680 40523466 2000.0 2000.0 2000.0 36 months 8.19 62.85 A A5 ... NaN NaN NaN NaN NaN 23900.0 NaN NaN NaN 0.031935
887164 37620521 40383278 24000.0 24000.0 24000.0 36 months 6.49 735.47 A A2 ... NaN NaN NaN NaN NaN 23200.0 NaN NaN NaN 0.029352
887194 37650403 40413143 20000.0 20000.0 20000.0 36 months 14.31 686.57 C C4 ... NaN NaN NaN NaN NaN 70000.0 NaN NaN NaN 0.071025
887263 37317965 40080791 25000.0 25000.0 25000.0 60 months 12.39 561.06 C C1 ... NaN NaN NaN NaN NaN 120600.0 NaN NaN NaN 0.111017
887346 36808246 39560970 6000.0 6000.0 6000.0 36 months 10.49 194.99 B B3 ... NaN NaN NaN NaN NaN 43400.0 NaN NaN NaN 0.077563
887364 36231718 38943165 10775.0 10775.0 10775.0 36 months 6.03 327.95 A A1 ... NaN NaN NaN NaN NaN 41700.0 NaN NaN NaN 0.027552
887369 36421485 39142898 4000.0 4000.0 4000.0 36 months 8.67 126.59 B B1 ... NaN NaN NaN NaN NaN 30100.0 NaN NaN NaN 0.039505

65296 rows × 75 columns


In [393]:
closed_loans.index.tolist()


Out[393]:
[1077501,
 1077430,
 1077175,
 1076863,
 1075269,
 1072053,
 1071795,
 1071570,
 1070078,
 1069908,
 1064687,
 1069866,
 1069057,
 1069759,
 1065775,
 1069971,
 1062474,
 1069742,
 1069740,
 1039153,
 1069710,
 1069700,
 1069559,
 1069697,
 1069800,
 1069657,
 1069799,
 1047704,
 1032111,
 1069539,
 1069591,
 1069530,
 1069522,
 1069361,
 1069357,
 1069356,
 1067573,
 1069506,
 1069314,
 1060578,
 1069469,
 1051117,
 1069465,
 1069283,
 1069287,
 1069453,
 1069248,
 1068120,
 1069244,
 1069243,
 1069238,
 1069410,
 1068409,
 1068487,
 1043961,
 1068945,
 1069142,
 1069136,
 1068923,
 1069126,
 1069102,
 1069093,
 1069030,
 1068906,
 1069073,
 1069043,
 1060981,
 1069071,
 1069070,
 1068893,
 1068882,
 1069039,
 1068416,
 1068997,
 1048390,
 1068994,
 1068475,
 1068989,
 1068967,
 1068744,
 1045509,
 1041756,
 1068694,
 1068792,
 1068558,
 1068322,
 1065674,
 1068547,
 1068545,
 1049528,
 1068542,
 1068350,
 1068509,
 1068508,
 1066641,
 1068315,
 1068309,
 1068326,
 1066767,
 1068484,
 1068292,
 1068273,
 1068274,
 1068440,
 1046456,
 1068012,
 1055725,
 1061814,
 1068232,
 1068405,
 1068395,
 1068202,
 1068006,
 1068075,
 1065642,
 1068179,
 1068158,
 1068180,
 1065673,
 1068159,
 1068165,
 1059689,
 1066215,
 1063270,
 1067931,
 1067922,
 1067919,
 1068106,
 1068111,
 1067655,
 1060662,
 1061304,
 1068095,
 1067182,
 822464,
 1063526,
 1068091,
 1068090,
 1068082,
 1068078,
 1057633,
 1046204,
 1067794,
 1067816,
 1068018,
 1052996,
 1067818,
 1064842,
 1067109,
 1067971,
 1057518,
 1067473,
 1062177,
 1067447,
 1059412,
 1062608,
 1067664,
 1067441,
 1067644,
 1062288,
 1067434,
 1067419,
 1067601,
 1067579,
 1067563,
 1067364,
 1067152,
 1062471,
 1066842,
 1067126,
 1067326,
 1066789,
 1067324,
 1067102,
 1067084,
 1067090,
 1067265,
 1067287,
 1067066,
 1067283,
 1066635,
 1067026,
 1067038,
 1067030,
 1065350,
 1067028,
 1061877,
 1067018,
 1067223,
 1067004,
 1031265,
 1066835,
 1067181,
 1067179,
 1067172,
 1066712,
 1066706,
 1066836,
 1066826,
 1063280,
 1057447,
 1040867,
 1065232,
 1052492,
 1066763,
 1066798,
 1066639,
 1066633,
 1060644,
 1066766,
 1066613,
 1065648,
 1066768,
 1066617,
 1062015,
 1066598,
 1066503,
 1066530,
 1066520,
 1065929,
 1066582,
 1066348,
 1061430,
 1061519,
 1066480,
 1066471,
 1066318,
 1066453,
 1063090,
 1061406,
 1066278,
 1066281,
 1056150,
 1066434,
 1048526,
 1028566,
 1066424,
 1066212,
 1066232,
 1056219,
 1066199,
 1060851,
 972383,
 1066191,
 1066379,
 1066364,
 1066140,
 1066171,
 1066162,
 1040154,
 1066155,
 1065382,
 1063447,
 1065932,
 1065936,
 1066112,
 1066084,
 1066081,
 1065896,
 1066065,
 1066071,
 1062094,
 1061154,
 1066061,
 1065863,
 1065862,
 1065854,
 1062260,
 1065842,
 1066018,
 1049352,
 1066016,
 1065997,
 1061891,
 1065752,
 1063126,
 1065738,
 1061467,
 1065721,
 1065717,
 1065026,
 1057001,
 1065513,
 1065698,
 1065681,
 1061788,
 1065484,
 1061681,
 1065661,
 1065663,
 1065480,
 1065658,
 1065447,
 1065470,
 1065467,
 1065649,
 1065430,
 1059462,
 1064102,
 1065572,
 1065567,
 1065355,
 1058717,
 1065348,
 890389,
 1065304,
 1065180,
 1065298,
 1064904,
 1064696,
 1065260,
 1065110,
 1047987,
 1065103,
 1065254,
 1064873,
 1065244,
 1064969,
 1059620,
 1032978,
 1065234,
 1065196,
 1065199,
 1065195,
 1065020,
 1065016,
 1064924,
 1064940,
 1064932,
 1062248,
 1064985,
 1064926,
 1064908,
 1059118,
 1064754,
 1060082,
 1064736,
 1064727,
 1042668,
 1064711,
 1032875,
 1064830,
 1064681,
 1064383,
 1064675,
 1064666,
 1064636,
 1064805,
 1064635,
 1064793,
 1056865,
 1064783,
 1064623,
 1064792,
 1062701,
 1064781,
 1064774,
 1053765,
 1064608,
 1064558,
 1064548,
 1064462,
 1064582,
 1064579,
 1046511,
 1029191,
 1064527,
 1064567,
 1064500,
 1063899,
 1057598,
 1064323,
 1064317,
 1064471,
 1060374,
 1064452,
 1064453,
 1064284,
 1063081,
 1064251,
 1064407,
 1035800,
 1063678,
 1059751,
 1055800,
 1064221,
 1064210,
 1064209,
 1064366,
 1064216,
 1064150,
 1056864,
 1064146,
 1064183,
 1064185,
 1064126,
 1029473,
 1064128,
 1063949,
 1064166,
 1064133,
 1063505,
 1063528,
 1063912,
 1064082,
 1063892,
 1064061,
 1063876,
 1042841,
 1063864,
 1064051,
 1063847,
 1062833,
 1062032,
 1063843,
 1063828,
 1064006,
 1062625,
 1062070,
 1063964,
 1063804,
 1063751,
 1063982,
 1063972,
 1063785,
 1063788,
 1063485,
 1063781,
 1057766,
 1063778,
 1062818,
 1042037,
 1063518,
 1063515,
 1063524,
 1063519,
 1063521,
 984879,
 1063509,
 1063693,
 1063700,
 1056934,
 1063246,
 1057424,
 1063680,
 1063679,
 1063034,
 1063448,
 1060804,
 1027771,
 1063421,
 1062855,
 1062490,
 1063407,
 1063609,
 1063393,
 1063388,
 1063582,
 988402,
 1063564,
 1063121,
 1063275,
 1057756,
 1059960,
 1059525,
 1062577,
 1053188,
 1063098,
 1063238,
 1062043,
 1063228,
 1063197,
 1063069,
 1063062,
 1063199,
 1063187,
 1063040,
 1063020,
 1062929,
 1063003,
 1063001,
 1062997,
 1061315,
 1062808,
 1062918,
 1062983,
 1061530,
 1035958,
 1062894,
 1062895,
 1062976,
 1061972,
 1062897,
 1062857,
 1059828,
 1059486,
 1062887,
 1062851,
 1062844,
 1062753,
 1062873,
 1062860,
 1062738,
 1062864,
 1062734,
 1052370,
 1058564,
 1062584,
 1062253,
 1062795,
 1062614,
 1061378,
 1062781,
 1023158,
 1062654,
 1062665,
 1062555,
 1062606,
 1062643,
 1062535,
 1062536,
 765927,
 1031451,
 1062510,
 1036838,
 1062478,
 1055974,
 1062462,
 1062458,
 1062454,
 1060762,
 1062564,
 1061570,
 1062418,
 1062413,
 1062330,
 1062414,
 1060966,
 1062401,
 1057454,
 1062179,
 1062293,
 1062384,
 1031348,
 1062268,
 1062161,
 1061409,
 1061705,
 1062146,
 1058699,
 1062150,
 1062257,
 1061420,
 1062250,
 1020855,
 997521,
 1058624,
 1062232,
 1062235,
 1060429,
 1060630,
 1062192,
 1059016,
 1056174,
 1062045,
 1059395,
 1062012,
 1031266,
 1061975,
 1061914,
 1061750,
 1061741,
 1040188,
 1057749,
 1061869,
 1060650,
 1059701,
 1061842,
 1061837,
 1057272,
 1061675,
 1061661,
 1061656,
 1061776,
 1061767,
 1061643,
 1061785,
 1061623,
 1030795,
 1061598,
 1061772,
 1058931,
 1061355,
 1061593,
 1061583,
 1059572,
 1053004,
 1061496,
 1051434,
 1058174,
 1061547,
 1061306,
 1061515,
 1058858,
 1060415,
 1061279,
 1058942,
 1042958,
 1061251,
 1058015,
 1059858,
 1061247,
 1061250,
 1061245,
 1060411,
 1061416,
 1040053,
 1061227,
 1061403,
 1061413,
 1061202,
 1061204,
 1061380,
 1061193,
 1061194,
 1060945,
 1057719,
 1061383,
 1050511,
 1061175,
 1061122,
 1061095,
 1061164,
 1060436,
 1061133,
 1060855,
 1058477,
 1060995,
 1061135,
 1059939,
 1060951,
 1046507,
 1060880,
 1060934,
 1061128,
 1061130,
 1061126,
 1060895,
 1061106,
 1060925,
 1024951,
 1061070,
 1061060,
 1055190,
 1051018,
 1059259,
 1060831,
 1058228,
 1060841,
 1056914,
 1060794,
 1061003,
 1061008,
 1060779,
 1060993,
 1060553,
 1058200,
 1060544,
 1060979,
 1060698,
 1060970,
 1060760,
 1060753,
 1060559,
 1060696,
 1060735,
 1060720,
 1060558,
 1060731,
 1060536,
 1060721,
 1060527,
 1060714,
 1060508,
 1059582,
 1060481,
 1054397,
 1055294,
 1060485,
 1060697,
 1060670,
 1060452,
 1051471,
 1059954,
 1059310,
 1055282,
 1058486,
 1060427,
 1060418,
 1060605,
 1060597,
 1053763,
 1060372,
 1060371,
 1060369,
 1060365,
 1060086,
 1058493,
 1059978,
 1059252,
 1060167,
 1059974,
 1060046,
 1060030,
 1059948,
 1060036,
 1055631,
 1060019,
 1053454,
 1060004,
 1060015,
 1059936,
 1052230,
 1059996,
 1059887,
 1010310,
 1059889,
 1059736,
 1059870,
 1059874,
 1032863,
 1059843,
 1057243,
 1059785,
 1059704,
 1059801,
 1059605,
 1058085,
 1059157,
 1023566,
 1059657,
 1059639,
 1059571,
 1059629,
 1059603,
 1059522,
 1059529,
 1059502,
 1058733,
 1059497,
 1059433,
 1059469,
 1059444,
 1059311,
 1059333,
 1055994,
 1059386,
 1059289,
 1058552,
 1059461,
 1059312,
 1058031,
 1059447,
 1059319,
 1059291,
 1059280,
 1059362,
 1059262,
 1059409,
 1057384,
 1058892,
 1052360,
 1059251,
 1059227,
 1059236,
 1059210,
 1059174,
 1057064,
 1059199,
 1058946,
 1059186,
 1050101,
 1058958,
 1032427,
 1058956,
 1058959,
 1055920,
 1059103,
 1059046,
 1058943,
 1059100,
 1058926,
 1034052,
 1057403,
 1058924,
 1058877,
 1059041,
 1059039,
 1059024,
 1054453,
 1058814,
 1058830,
 1053112,
 1058317,
 1058846,
 1058824,
 1058828,
 1058831,
 1052449,
 1058734,
 1058209,
 1058749,
 1058782,
 1056831,
 1054373,
 1058744,
 1047572,
 1056011,
 1058718,
 1058792,
 1058556,
 1058407,
 1029236,
 1058696,
 1058515,
 1058110,
 1058519,
 1058494,
 1058634,
 1058483,
 1055678,
 1058484,
 1058615,
 1058603,
 1058468,
 1058440,
 1058451,
 1058596,
 1058578,
 1058423,
 1057201,
 1058396,
 1046641,
 1053982,
 1058372,
 1058368,
 1058155,
 1057701,
 1058301,
 1058237,
 1058309,
 1058318,
 1058282,
 1051960,
 1058289,
 1058241,
 1058276,
 1058123,
 1058126,
 1058242,
 1058119,
 1058084,
 1058111,
 1053505,
 1058072,
 1058082,
 1058221,
 1058078,
 1058213,
 1058059,
 1058060,
 1055436,
 1058186,
 1054438,
 1058203,
 1058042,
 1058195,
 1053728,
 1058177,
 1057996,
 1058162,
 1055265,
 1018385,
 1054294,
 1057750,
 1057923,
 1057816,
 1057900,
 1056770,
 1057710,
 1057323,
 1057878,
 1057726,
 1057720,
 1049053,
 1057448,
 1057669,
 1050535,
 1057674,
 1057806,
 1057615,
 1057818,
 1057629,
 1057621,
 1057787,
 1057275,
 1057330,
 1057773,
 1057343,
 1057562,
 1057776,
 1057519,
 1057537,
 1057358,
 1057038,
 1057316,
 1057350,
 1057319,
 1052234,
 1041468,
 1057515,
 1057522,
 1057428,
 1057451,
 1048164,
 1057482,
 1057501,
 1057458,
 1057445,
 1057211,
 1057055,
 1057199,
 1057375,
 1057238,
 1057204,
 1057013,
 1057397,
 1056923,
 1057140,
 1055794,
 1057083,
 1056866,
 1056925,
 1057099,
 1056297,
 1056889,
 1057113,
 1056786,
 1057110,
 1057080,
 1057031,
 1056845,
 1057065,
 1057026,
 1056070,
 1056857,
 1056856,
 1057054,
 1056817,
 1057024,
 1056799,
 1056835,
 1056055,
 1055476,
 1055593,
 1056585,
 1056374,
 1055929,
 1056772,
 1055229,
 1056601,
 1011430,
 1056115,
 1056345,
 1054308,
 1056351,
 1056090,
 1056273,
 1053961,
 1056155,
 1056147,
 1056107,
 1056157,
 1056298,
 1056299,
 1056248,
 926180,
 1056261,
 1056081,
 1056062,
 1056239,
 1056093,
 1056080,
 1056234,
 1053252,
 1056186,
 1055960,
 1048116,
 1056204,
 1056163,
 ...]

In [388]:
y_score = clf.predict_proba(closed_loans_scaled)
prediction = clf.predict(closed_loans_scaled)
confusion_matrix = ConfusionMatrix(np.array(closed_loans['loan_status'][y_score[:,1] > 0.9]), prediction[y_score[:,1] > 0.9])
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off            0        4241     4241
Fully Paid             0       61055    61055
__all__                0       65296    65296


Overall Statistics:

Accuracy: 0.935049620191
95% CI: (0.93313232985352268, 0.93692813651530282)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off Fully Paid
Population                                  65296      65296
P: Condition positive                        4241      61055
N: Condition negative                       61055       4241
Test outcome positive                           0      65296
Test outcome negative                       65296          0
TP: True Positive                               0      61055
TN: True Negative                           61055          0
FP: False Positive                              0       4241
FN: False Negative                           4241          0
TPR: (Sensitivity, hit rate, recall)            0          1
TNR=SPC: (Specificity)                          1          0
PPV: Pos Pred Value (Precision)               NaN    0.93505
NPV: Neg Pred Value                       0.93505        NaN
FPR: False-out                                  0          1
FDR: False Discovery Rate                     NaN  0.0649504
FNR: Miss Rate                                  1          0
ACC: Accuracy                             0.93505    0.93505
F1 score                                        0   0.966435
MCC: Matthews correlation coefficient         NaN        NaN
Informedness                                    0          0
Markedness                                    NaN        NaN
Prevalence                              0.0649504    0.93505
LR+: Positive likelihood ratio                NaN          1
LR-: Negative likelihood ratio                  1        NaN
DOR: Diagnostic odds ratio                    NaN        NaN
FOR: False omission rate                0.0649504        NaN
Out[388]:
<matplotlib.axes._subplots.AxesSubplot at 0x1180bc470>

In [386]:
np.array(closed_loans['loan_status'][y_score[:,1] > 0.9])


Out[386]:
array(['Fully Paid', 'Charged Off', 'Fully Paid', ..., 'Fully Paid',
       'Fully Paid', 'Fully Paid'], dtype=object)

In [377]:
prediction[y_score[:,1] > 0.9]


Out[377]:
array(['Fully Paid', 'Fully Paid', 'Fully Paid', ..., 'Fully Paid',
       'Fully Paid', 'Fully Paid'], dtype=object)

In [374]:
y_total[y_score[:,1] > 0.9]


Out[374]:
225891     Fully Paid
424340    Charged Off
189953     Fully Paid
15805      Fully Paid
31773      Fully Paid
24921      Fully Paid
175902     Fully Paid
616528     Fully Paid
78182      Fully Paid
158031     Fully Paid
98202     Charged Off
16376      Fully Paid
176488     Fully Paid
19866      Fully Paid
206337     Fully Paid
203157     Fully Paid
72040      Fully Paid
344699     Fully Paid
29128      Fully Paid
796292     Fully Paid
23385      Fully Paid
90823     Charged Off
303026     Fully Paid
769678     Fully Paid
24073     Charged Off
170393     Fully Paid
189629     Fully Paid
80527      Fully Paid
185266    Charged Off
1919       Fully Paid
             ...     
186852     Fully Paid
25029      Fully Paid
191388     Fully Paid
138782     Fully Paid
285786     Fully Paid
711329     Fully Paid
414029     Fully Paid
69766      Fully Paid
284884     Fully Paid
304993    Charged Off
864821     Fully Paid
371647     Fully Paid
39221      Fully Paid
218513     Fully Paid
35084      Fully Paid
89635      Fully Paid
157741     Fully Paid
406600    Charged Off
276439     Fully Paid
419418     Fully Paid
20197      Fully Paid
101478     Fully Paid
287362     Fully Paid
18142      Fully Paid
63705      Fully Paid
363375     Fully Paid
217825     Fully Paid
752284     Fully Paid
885709     Fully Paid
277938     Fully Paid
Name: Actual, dtype: object

In [371]:
prediction[y_score[:,1] > 0.9]


Out[371]:
array(['Fully Paid', 'Fully Paid', 'Fully Paid', ..., 'Fully Paid',
       'Fully Paid', 'Fully Paid'], dtype=object)

In [ ]:


In [ ]:


In [414]:
X_total = pd.concat([X_train_scaled, X_test_scaled])
y_total = pd.concat([y_train, y_test])

In [415]:
y_score = clf.predict_proba(X_total)
prediction = clf.predict(X_total)
confusion_matrix = ConfusionMatrix(y_total[y_score[:,1] > 0.9], prediction[y_score[:,1] > 0.9])
confusion_matrix.print_stats()
confusion_matrix.plot()


Confusion Matrix:

Predicted    Charged Off  Fully Paid  __all__
Actual                                       
Charged Off            0        1082     1082
Fully Paid             0       15023    15023
__all__                0       16105    16105


Overall Statistics:

Accuracy: 0.932815895685
95% CI: (0.92883974791545454, 0.93663484018407073)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                               Charged Off Fully Paid
Population                                  16105      16105
P: Condition positive                        1082      15023
N: Condition negative                       15023       1082
Test outcome positive                           0      16105
Test outcome negative                       16105          0
TP: True Positive                               0      15023
TN: True Negative                           15023          0
FP: False Positive                              0       1082
FN: False Negative                           1082          0
TPR: (Sensitivity, hit rate, recall)            0          1
TNR=SPC: (Specificity)                          1          0
PPV: Pos Pred Value (Precision)               NaN   0.932816
NPV: Neg Pred Value                      0.932816        NaN
FPR: False-out                                  0          1
FDR: False Discovery Rate                     NaN  0.0671841
FNR: Miss Rate                                  1          0
ACC: Accuracy                            0.932816   0.932816
F1 score                                        0    0.96524
MCC: Matthews correlation coefficient         NaN        NaN
Informedness                                    0          0
Markedness                                    NaN        NaN
Prevalence                              0.0671841   0.932816
LR+: Positive likelihood ratio                NaN          1
LR-: Negative likelihood ratio                  1        NaN
DOR: Diagnostic odds ratio                    NaN        NaN
FOR: False omission rate                0.0671841        NaN
Out[415]:
<matplotlib.axes._subplots.AxesSubplot at 0x154c0d4a8>

In [416]:
diff_mean = X_total[y_score[:,1] > 0.9].mean() - X_total.mean()
abs(diff_mean).sort_values(ascending=False)


Out[416]:
int_rate                        1.029157
sub_grade                       0.942607
grade                           0.925247
annual_inc                      0.608917
revol_util                      0.539280
dti                             0.531208
term                            0.446514
total_acc                       0.251735
home_ownership                  0.236309
inq_last_6mths                  0.227330
days_since_first_credit_line    0.183626
emp_length                      0.114662
loan_amnt                       0.101279
revol_bal                       0.099797
delinq_2yrs                     0.084395
installment                     0.080104
zip_code                        0.068169
pub_rec                         0.066996
addr_state                      0.026992
open_acc                        0.017322
acc_now_delinq                  0.016847
purpose                         0.012195
dtype: float64

In [421]:
for col in X_total.columns:
    result = ttest_ind(X_total[y_score[:,1] > 0.9][col], X_total[col])
    print(col, ':', result)
#X_total[y_score[:,1] > 0.9].mean() - X_total.mean()


term : Ttest_indResult(statistic=-111.1274594789467, pvalue=0.0)
int_rate : Ttest_indResult(statistic=-252.55245382254896, pvalue=0.0)
installment : Ttest_indResult(statistic=-18.291381247594764, pvalue=1.058913389464325e-74)
grade : Ttest_indResult(statistic=-228.05864402831577, pvalue=0.0)
sub_grade : Ttest_indResult(statistic=-232.6328655787587, pvalue=0.0)
emp_length : Ttest_indResult(statistic=26.186558411268546, pvalue=5.4744312026930184e-151)
home_ownership : Ttest_indResult(statistic=54.644952956066348, pvalue=0.0)
annual_inc : Ttest_indResult(statistic=132.54099142904784, pvalue=0.0)
purpose : Ttest_indResult(statistic=-2.8004280871792182, pvalue=0.0051037954727872238)
zip_code : Ttest_indResult(statistic=15.477599274683396, pvalue=5.1433161223269419e-54)
addr_state : Ttest_indResult(statistic=-6.1629362457205694, pvalue=7.1493494476485885e-10)
delinq_2yrs : Ttest_indResult(statistic=-20.014038201633305, pvalue=4.7168489070678436e-89)
inq_last_6mths : Ttest_indResult(statistic=-53.264927971053702, pvalue=0.0)
open_acc : Ttest_indResult(statistic=3.9468393246616409, pvalue=7.9206665281908095e-05)
pub_rec : Ttest_indResult(statistic=-15.482253872746803, pvalue=4.7845952665859297e-54)
revol_bal : Ttest_indResult(statistic=20.188025528349062, pvalue=1.4222443102909632e-90)
revol_util : Ttest_indResult(statistic=-123.34649242750646, pvalue=0.0)
total_acc : Ttest_indResult(statistic=56.881084046760428, pvalue=0.0)
acc_now_delinq : Ttest_indResult(statistic=-4.006449999190246, pvalue=6.1652173430596095e-05)
loan_amnt : Ttest_indResult(statistic=-23.345322420078318, pvalue=1.9429590487125806e-120)
dti : Ttest_indResult(statistic=-124.50997385851153, pvalue=0.0)
days_since_first_credit_line : Ttest_indResult(statistic=41.905153364232731, pvalue=0.0)

In [418]:
X_total[y_score[:,1] > 0.9]['term']


Out[418]:
7       -0.534690
9       -0.534690
10      -0.534690
18      -0.534690
28      -0.534690
29      -0.534690
32      -0.534690
39      -0.534690
51      -0.534690
52      -0.534690
54      -0.534690
56      -0.534690
57      -0.534690
64       1.870241
66      -0.534690
73      -0.534690
74      -0.534690
79      -0.534690
80      -0.534690
87      -0.534690
90      -0.534690
92      -0.534690
109     -0.534690
110     -0.534690
114     -0.534690
127     -0.534690
130     -0.534690
141      1.870241
143     -0.534690
147     -0.534690
           ...   
75737   -0.534690
75738   -0.534690
75741   -0.534690
75743   -0.534690
75745   -0.534690
75760   -0.534690
75765   -0.534690
75768   -0.534690
75769   -0.534690
75773   -0.534690
75779   -0.534690
75781   -0.534690
75782   -0.534690
75785   -0.534690
75786   -0.534690
75789   -0.534690
75790   -0.534690
75793   -0.534690
75796   -0.534690
75798   -0.534690
75802   -0.534690
75804   -0.534690
75806   -0.534690
75815   -0.534690
75821   -0.534690
75824   -0.534690
75825   -0.534690
75826   -0.534690
75827   -0.534690
75828   -0.534690
Name: term, dtype: float64

In [320]:
X_total.mean()


Out[320]:
term                            0.000735
int_rate                        0.000869
installment                     0.000411
grade                           0.001088
sub_grade                       0.001040
emp_length                      0.000231
home_ownership                  0.000497
annual_inc                      0.000595
purpose                        -0.000405
zip_code                       -0.000062
addr_state                      0.000371
delinq_2yrs                     0.001245
inq_last_6mths                 -0.000849
open_acc                       -0.001960
pub_rec                        -0.000001
revol_bal                       0.001173
revol_util                      0.001154
total_acc                      -0.002846
acc_now_delinq                 -0.000471
loan_amnt                       0.000421
dti                            -0.000668
days_since_first_credit_line   -0.000040
dtype: float64

In [309]:
X_total #most interesting features: 'int_rate', 'annual_inc', 'sub_grade', 'term', 'dti'
# vergelijken predicted > 0.9 vs. all?


Out[309]:
term int_rate installment grade sub_grade emp_length home_ownership annual_inc purpose zip_code ... inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc acc_now_delinq loan_amnt dti days_since_first_credit_line
0 -0.534690 -0.401743 1.005720 -0.595481 -0.601032 -1.503256 -1.055764 0.437807 0.538410 1.171976 ... -0.800122 1.851360 -0.329101 0.556035 0.407422 1.183897 -0.051363 0.792242 1.453198 -0.058769
1 1.870241 0.865487 0.347860 0.900654 0.988284 -1.503256 0.528760 0.519080 0.538410 -0.778588 ... -0.800122 1.442024 -0.329101 1.029208 0.592675 1.353932 -0.051363 0.792242 0.796525 -0.262228
2 -0.534690 -0.092884 -0.352335 0.152587 0.080104 0.643787 -1.055764 0.681624 0.288887 -0.894062 ... 3.886942 -0.809321 -0.329101 -0.174204 0.834310 -0.091366 -0.051363 -0.462926 0.648741 -0.071656
3 1.870241 0.554357 -0.710339 0.152587 0.231467 -1.503256 -1.055764 -0.781278 0.288887 -1.324746 ... 0.137291 3.693369 -0.329101 -0.510631 -0.095984 1.013862 -0.051363 -0.438315 -0.115878 -1.152993
4 -0.534690 0.863216 -0.427319 0.900654 0.761239 -0.698114 -1.055764 -0.510370 0.538410 1.171976 ... 0.137291 0.828021 -0.329101 -0.525372 -0.595362 0.588774 -0.051363 -0.595211 0.043471 -0.403985
5 1.870241 0.574796 0.291632 0.152587 0.231467 -1.503256 -1.055764 0.275262 0.538410 0.023484 ... 0.137291 0.214017 -0.329101 -0.374646 -0.325537 2.204107 -0.051363 0.792242 0.057607 -0.368838
6 1.870241 0.910907 -0.056320 0.900654 0.761239 1.180548 -1.055764 -0.618734 0.538410 -1.418373 ... 0.137291 -0.809321 1.963812 -0.152393 0.274523 -0.601471 -0.051363 0.300019 -1.060407 -0.047835
7 -0.534690 -1.103488 -1.059640 -1.343549 -1.206485 -0.698114 -1.055764 -0.700006 0.538410 -1.312263 ... -0.800122 -0.604654 -0.329101 -0.534129 -0.196665 -0.006348 -0.051363 -1.053594 -1.310997 -1.176814
8 -0.534690 0.463516 0.798665 0.152587 0.155785 1.180548 0.528760 -0.158190 0.288887 0.248189 ... 0.137291 -0.195318 -0.329101 0.499356 1.261198 -0.261401 -0.051363 0.484603 -0.086321 -0.416872
9 -0.534690 -1.103488 1.535473 -1.343549 -1.206485 1.180548 0.528760 0.789987 0.538410 1.421648 ... -0.800122 0.418685 -0.329101 0.841766 -0.168474 0.673792 -0.051363 1.407521 -0.637618 0.451244
10 -0.534690 -1.394179 -0.912572 -1.343549 -1.357849 -0.966495 -1.055764 0.925441 -2.705391 1.299933 ... -0.800122 -0.604654 -0.329101 -0.586511 -1.199449 -1.366628 -0.051363 -0.899774 -1.083539 -1.509924
11 -0.534690 1.528625 3.642595 1.648722 1.593738 1.180548 0.528760 1.331803 0.538410 1.468462 ... 1.074704 0.828021 -0.329101 1.557155 1.434369 1.013862 -0.051363 2.638079 0.228522 1.211968
12 -0.534690 0.420367 -1.079745 0.152587 0.307149 1.180548 0.528760 -0.672915 0.538410 -1.199910 ... -0.800122 -0.809321 -0.329101 -0.762584 0.854446 -0.771506 -0.051363 -1.127428 0.431564 -0.333301
13 -0.534690 0.352236 -1.053960 0.152587 0.080104 1.180548 -1.055764 -0.618734 -0.709206 1.196943 ... -0.800122 -1.832660 -0.329101 -0.574381 0.657111 -0.686488 -0.051363 -1.102816 -1.669533 -0.559019
14 -0.534690 -1.330590 -1.005169 -1.343549 -1.282167 -1.503256 -1.055764 -0.889641 0.288887 -0.572609 ... -0.800122 -1.013989 1.963812 -0.623281 -1.098768 -1.026558 -0.051363 -0.992066 1.254011 -0.570344
15 -0.534690 0.352236 3.271228 0.152587 0.080104 1.180548 0.528760 0.654533 0.538410 -0.909666 ... -0.800122 1.032689 -0.329101 0.529545 0.040942 1.864037 -0.051363 2.638079 0.634605 -0.356342
16 -0.534690 -0.158743 1.047646 0.152587 0.080104 -0.698114 -1.055764 0.248172 0.538410 -1.315383 ... 0.137291 0.214017 -0.329101 -0.527221 -0.853106 -0.431436 -0.051363 0.792242 -0.691591 -1.188920
17 -0.534690 2.121361 -0.758027 1.648722 1.745101 0.375407 0.528760 -0.970914 -1.208252 -0.488344 ... -0.800122 0.418685 1.963812 -0.652382 -1.779372 -0.006348 -0.051363 -0.930538 0.234947 -0.071656
18 -0.534690 -1.423702 -0.818750 -1.343549 -1.282167 1.180548 -1.055764 -0.510370 0.538410 1.234394 ... -0.800122 0.214017 1.963812 -0.521129 -0.361782 -0.601471 -0.051363 -0.807483 0.301771 0.605497
19 1.870241 -0.147388 0.321176 -0.595481 -0.525350 1.180548 0.528760 -0.158190 0.538410 -0.113836 ... -0.800122 0.009350 -0.329101 1.109711 0.568511 0.078669 -0.051363 1.010666 1.360672 -0.403985
20 -0.534690 -0.594779 0.168347 -0.595481 -0.676713 1.180548 0.528760 -0.239463 0.288887 -0.544520 ... -0.800122 -0.604654 -0.329101 0.835184 0.878609 -0.261401 -0.051363 0.053908 -0.369038 0.070882
21 -0.534690 -0.628845 1.234350 -0.595481 -0.676713 -0.698114 -1.055764 -0.185281 0.538410 -1.574418 ... -0.800122 0.214017 -0.329101 0.767899 -0.152365 -0.006348 -0.051363 1.038354 0.497102 0.724995
22 1.870241 0.960870 1.403893 1.648722 1.518056 0.375407 -1.055764 -0.347826 0.538410 0.703841 ... -0.800122 -1.013989 -0.329101 -0.303825 -1.259858 -1.111576 -0.051363 2.022800 -0.271372 -0.559410
23 -0.534690 -0.372219 -0.338605 -0.595481 -0.601032 1.180548 2.113283 -1.052186 0.288887 -1.196789 ... -0.800122 0.009350 -0.329101 0.475205 0.858473 -0.431436 -0.051363 -0.429086 1.636963 0.832387
24 -0.534690 -0.372219 0.303809 -0.595481 -0.601032 0.107027 0.528760 -0.564552 0.538410 1.427890 ... -0.800122 0.009350 -0.329101 -0.135476 -0.486627 -0.686488 -0.051363 0.152352 0.001064 -1.188530
25 -0.534690 0.574796 -0.864558 0.900654 0.836921 -0.698114 -1.055764 -0.943823 0.538410 -0.606939 ... 2.012117 3.898037 1.963812 -0.447643 -1.074604 2.714212 -0.051363 -0.948997 -0.730143 0.498887
26 -0.534690 0.129676 -0.726194 0.152587 0.004422 -0.698114 -1.055764 -0.483280 0.538410 1.203185 ... -0.800122 -0.195318 -0.329101 -0.381989 0.496021 -1.026558 -0.051363 -0.807483 -0.049054 -1.117847
27 1.870241 1.051710 -1.143369 1.648722 1.593738 -0.966495 -1.055764 -1.404366 -2.705391 0.351179 ... 0.137291 -0.809321 -0.329101 -0.724998 -0.305401 -0.771506 -0.051363 -1.004372 -0.933185 -1.189311
28 -0.534690 -0.712872 -0.803671 -0.595481 -0.601032 -1.503256 2.113283 -0.700006 0.538410 -1.271691 ... 2.012117 1.237356 -0.329101 -0.803815 -1.936435 -0.346418 -0.051363 -0.832094 -1.280155 -1.272491
29 -0.534690 -0.515294 -0.900027 -0.595481 -0.525350 -0.429734 -1.055764 -0.022736 0.538410 -1.321625 ... 0.137291 0.214017 1.963812 -0.633779 -0.671880 0.078669 -0.051363 -0.930538 -1.836592 0.558635
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75802 -0.534690 -1.469122 -0.796030 -1.343549 -1.282167 -0.698114 0.528760 -0.185281 0.288887 -0.538279 ... 0.137291 0.828021 -0.329101 -0.011675 -1.465247 -0.091366 -0.051363 -0.782871 1.618972 -0.510595
75803 -0.534690 0.129676 -0.656032 0.152587 0.004422 1.180548 2.113283 0.383626 -2.705391 -1.284174 ... -0.800122 -0.399986 -0.329101 0.254855 1.007481 -0.006348 -0.051363 -0.745955 1.062535 1.854365
75804 -0.534690 -0.501668 -0.359936 -0.595481 -0.601032 0.643787 0.528760 -0.374916 0.288887 1.278087 ... -0.800122 0.009350 -0.329101 -0.134823 -0.216801 -1.026558 -0.051363 -0.438315 -0.989728 -1.271710
75805 1.870241 1.296981 0.432815 0.900654 1.063966 0.375407 0.528760 2.821797 0.538410 -0.135682 ... -0.800122 0.828021 -0.329101 2.006014 1.531023 0.928844 -0.051363 0.792242 0.282495 1.794616
75806 -0.534690 -0.288192 0.478541 -0.595481 -0.449668 0.643787 -1.055764 2.009073 0.288887 1.249999 ... 0.137291 1.032689 -0.329101 0.289287 0.854446 0.928844 -0.051363 0.300019 0.084594 0.807784
75807 -0.534690 0.463516 0.010736 0.152587 0.155785 -0.429734 2.113283 -0.835460 -2.705391 -1.100041 ... 2.012117 -1.423325 -0.329101 -0.613490 0.870555 -1.111576 -0.051363 -0.192204 -1.594998 -0.962812
75808 -0.534690 -0.685620 0.957174 -0.595481 -0.676713 -1.234875 -1.055764 -0.347826 0.538410 -1.324746 ... -0.800122 -0.195318 -0.329101 0.467481 -0.442327 -0.856523 -0.051363 0.792242 -0.228964 1.176040
75809 1.870241 1.233392 0.702227 1.648722 1.669419 -0.698114 0.528760 2.632161 0.538410 0.766259 ... 1.074704 -0.399986 -0.329101 -0.381663 -0.547035 0.418739 -0.051363 1.118340 -0.507826 2.210125
75810 -0.534690 -0.022482 -1.082973 0.152587 0.155785 1.180548 2.113283 -1.052186 0.538410 -0.191858 ... 1.074704 0.214017 4.256724 0.411020 1.128298 0.588774 -0.051363 -1.115122 0.799095 0.713280
75811 -0.534690 -0.501668 -0.090237 -0.595481 -0.601032 -0.966495 -1.055764 -0.564552 0.538410 -1.031381 ... 0.137291 -0.399986 -0.329101 -0.626925 -1.529683 0.588774 -0.051363 -0.192204 1.293848 -0.439522
75812 1.870241 0.563441 -0.534668 0.152587 0.307149 -1.234875 -1.055764 -0.700006 0.538410 1.431011 ... -0.800122 -0.809321 -0.329101 0.427230 0.508103 -0.771506 -0.051363 -0.222968 -0.127444 -0.355951
75813 -0.534690 -0.147388 -0.880985 -0.595481 -0.525350 0.107027 2.113283 -0.239463 0.288887 -0.606939 ... 0.137291 -0.399986 -0.329101 -0.510359 0.681274 -0.346418 -0.051363 -0.930538 -0.943466 0.082988
75814 1.870241 0.166012 -0.554936 0.152587 0.231467 0.107027 -1.055764 -0.781278 0.288887 1.440374 ... -0.800122 0.009350 -0.329101 -0.290172 0.636975 -1.111576 -0.051363 -0.192204 0.579347 -0.523482
75815 -0.534690 -0.853676 0.928692 -0.595481 -0.676713 -0.698114 -1.055764 1.873619 0.538410 1.299933 ... -0.800122 -1.013989 -0.329101 -0.747354 -1.980734 -0.006348 -0.051363 0.792242 -1.452355 -0.463734
75816 -0.534690 0.052461 0.051069 0.152587 0.231467 1.180548 0.528760 -0.293644 0.538410 1.321779 ... 0.137291 0.623353 4.256724 -0.370621 0.310768 1.353932 -0.051363 -0.118370 0.137282 1.176040
75817 -0.534690 1.133467 -0.812784 0.900654 0.912602 1.180548 0.528760 0.681624 0.538410 1.212548 ... 0.137291 0.418685 1.963812 -0.328792 0.592675 2.374142 -0.051363 -0.930538 1.558574 -0.297374
75818 1.870241 0.554357 0.287709 0.152587 0.231467 0.912168 0.528760 2.144527 0.538410 1.212548 ... 1.074704 1.032689 -0.329101 2.494527 0.057051 0.673792 -0.051363 0.792242 -0.136439 2.995451
75819 1.870241 -0.855947 0.424765 -0.595481 -0.752395 -0.966495 0.528760 0.112718 0.538410 1.171976 ... -0.800122 0.214017 -0.329101 0.462368 -0.325537 0.163687 -0.051363 1.355222 -0.538667 3.066525
75820 -0.534690 0.865487 -1.267839 0.900654 0.988284 -0.698114 2.113283 -0.320735 0.538410 1.312417 ... -0.800122 1.237356 -0.329101 0.103042 0.238277 -0.346418 -0.051363 -1.299706 -0.380603 -1.510315
75821 -0.534690 -0.401743 -1.029891 -0.595481 -0.601032 1.180548 0.528760 -0.158190 0.538410 -1.362197 ... -0.800122 -1.218657 -0.329101 -0.639110 0.669193 1.438949 -0.051363 -1.053594 -1.199195 0.440309
75822 -0.534690 0.733768 -0.543045 0.900654 0.912602 -1.234875 -1.055764 -0.943823 -1.208252 1.274966 ... -0.800122 4.307373 -0.329101 -0.605549 0.467830 1.183897 -0.051363 -0.684427 1.004707 -0.820665
75823 -0.534690 -0.147388 -0.715570 0.152587 0.080104 0.107027 -1.055764 -1.106368 0.538410 1.281208 ... -0.800122 -1.423325 1.963812 -0.461949 1.043726 -1.281611 -0.051363 -0.782871 -0.883067 -0.498489
75824 -0.534690 -0.819610 -1.245895 -0.595481 -0.676713 -0.966495 -1.055764 -0.158190 0.538410 -0.981447 ... 0.137291 -1.218657 -0.329101 -0.559695 -1.296103 -0.771506 -0.051363 -1.238178 -0.560514 0.867533
75825 -0.534690 -1.330590 1.871738 -1.343549 -1.282167 -0.161354 0.528760 0.519080 -2.455868 0.248189 ... 2.012117 0.009350 -0.329101 -0.505518 -1.747154 -0.091366 -0.051363 1.776688 -1.071973 0.070882
75826 -0.534690 -1.553150 1.820291 -1.343549 -1.282167 1.180548 0.528760 3.499066 -1.956821 -1.424615 ... 0.137291 1.442024 -0.329101 0.618370 -1.259858 0.673792 -0.051363 1.776688 -1.561586 -0.131796
75827 -0.534690 -1.755271 1.400828 -1.343549 -1.509212 1.180548 0.528760 0.735806 0.288887 -1.199910 ... 0.137291 2.465363 -0.329101 0.773991 -1.396784 1.098879 -0.051363 1.407521 0.871059 0.807784
75828 -0.534690 -0.288192 0.341853 -0.595481 -0.449668 0.643787 0.528760 1.602711 0.538410 -0.688082 ... -0.800122 -0.195318 -0.329101 3.247666 0.512130 -0.856523 -0.051363 0.176963 0.075598 -0.439912
75829 1.870241 1.867007 1.247304 1.648722 1.745101 1.180548 -1.055764 1.087986 0.538410 -1.009535 ... 0.137291 1.032689 1.963812 -0.445141 -0.236937 1.949055 -0.051363 1.555188 -0.241815 -0.284878
75830 -0.534690 1.017645 -0.077937 0.900654 1.063966 0.912168 -1.055764 -0.970914 -2.705391 -0.606939 ... 0.137291 -0.399986 1.963812 -0.446446 -1.070577 -0.601471 -0.051363 -0.315260 0.826081 -0.570344
75831 -0.534690 -0.769648 -0.647900 -0.595481 -0.601032 -0.698114 2.113283 -1.268913 0.538410 1.434132 ... -0.800122 -0.399986 1.963812 -0.141459 0.173841 -1.026558 -0.051363 -0.684427 -0.225109 -1.213132

252771 rows × 22 columns


In [ ]:


In [ ]:


In [40]:
sum(y_score[:,1] > 0.5) / len(y_score[:,1] )


Out[40]:
0.014716617133322906

In [41]:
max(y_score[~prediction, 1])


Out[41]:
0.49995529304248365

In [74]:
diff_thres = y_score[:,1] > 0.18

In [79]:
print(f1_score(y_test, diff_thres, average='weighted'))
confusion_matrix = ConfusionMatrix(y_test, diff_thres)
confusion_matrix.print_stats()
confusion_matrix.plot()


0.686387709202
Confusion Matrix:

Predicted  False  True  __all__
Actual                         
False       8290  5751    14041
True        1677  1168     2845
__all__     9967  6919    16886


Overall Statistics:

Accuracy: 0.560108966007
95% CI: (0.55258278403993444, 0.56761443612344797)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.000610218839777
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                      False        True 
Population                                   16886        16886
P: Condition positive                        14041         2845
N: Condition negative                         2845        14041
Test outcome positive                         9967         6919
Test outcome negative                         6919         9967
TP: True Positive                             8290         1168
TN: True Negative                             1168         8290
FP: False Positive                            1677         5751
FN: False Negative                            5751         1677
TPR: (Sensitivity, hit rate, recall)      0.590414     0.410545
TNR=SPC: (Specificity)                    0.410545     0.590414
PPV: Pos Pred Value (Precision)           0.831745     0.168811
NPV: Neg Pred Value                       0.168811     0.831745
FPR: False-out                            0.589455     0.409586
FDR: False Discovery Rate                 0.168255     0.831189
FNR: Miss Rate                            0.409586     0.589455
ACC: Accuracy                             0.560109     0.560109
F1 score                                  0.690603     0.239246
MCC: Matthews correlation coefficient  0.000729584  0.000729584
Informedness                           0.000958604  0.000958604
Markedness                             0.000555279  0.000555279
Prevalence                                0.831517     0.168483
LR+: Positive likelihood ratio             1.00163      1.00234
LR-: Negative likelihood ratio            0.997665     0.998376
DOR: Diagnostic odds ratio                 1.00397      1.00397
FOR: False omission rate                  0.831189     0.168255
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x124f6da90>

In [ ]:


In [ ]:

Random Forest

Also with the random forest algorithm for both only grade and all features we find the same accuracy measurements as measured with cross-validation on the training set. Therefore the logistic regression algorithm with all features still performance the best, although it performs not very well.


In [46]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train.loc[:,['grade']], y_train)
prediction = clf.predict(X_test.loc[:,['grade']])
print(f1_score(y_test, prediction, average='weighted'))
confusion_matrix = ConfusionMatrix(y_test, prediction)
print(confusion_matrix)
confusion_matrix.plot()


0.739008115086
Predicted  False  True  __all__
Actual                         
False      14041     0    14041
True        2845     0     2845
__all__    16886     0    16886
/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x121651198>

In [47]:
fpr, tpr, thresholds = roc_curve(y_test, prediction, pos_label=True)
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


0.5
Out[47]:
[<matplotlib.lines.Line2D at 0x126add518>]

In [48]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
print(f1_score(y_test, prediction, average='weighted'))
confusion_matrix = ConfusionMatrix(y_test, prediction)
print(confusion_matrix)
confusion_matrix.plot()


0.749727275393
Predicted  False  True  __all__
Actual                         
False      13865   176    14041
True        2806    39     2845
__all__    16671   215    16886
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x126445c18>

In [49]:
y_score = clf.predict_proba(X_test)
print(clf.classes_)
fpr, tpr, thresholds = roc_curve(y_test, y_score[:,1], pos_label=True)
print(auc(fpr, tpr))
plt.plot(fpr, tpr)


[False  True]
0.694945786794
Out[49]:
[<matplotlib.lines.Line2D at 0x123830978>]

Predict grade


In [50]:
X_train, X_test, y_train, y_test = train_test_split(closed_loans.iloc[:, 0:24], 
                                                    closed_loans['grade'], test_size=0.3, 
                                                    random_state=123, stratify=closed_loans['loan_status'])
X_train = X_train.drop(['grade', 'sub_grade', 'int_rate'], axis=1)
X_test = X_test.drop(['grade', 'sub_grade', 'int_rate'], axis=1)

In [51]:
# features that are not float or int, so not to be converted:

# date:
# earliest_cr_line

# ordered:
# emp_length, zip_code, term

# unordered:
# home_ownership, purpose, addr_state (ordered geographically)

# date
X_train['earliest_cr_line'] = pd.to_datetime(X_train['earliest_cr_line']).dt.strftime("%s")
X_train['earliest_cr_line'] = [0 if date=='NaT' else int(date) for date in X_train['earliest_cr_line']]

# term
X_train['term'] = X_train['term'].apply(lambda x: int(x.split(' ')[1]))

# emp_length
emp_length_dict = {'n/a':0,
                   '< 1 year':0,
                   '1 year':1,
                   '2 years':2,
                   '3 years':3,
                   '4 years':4,
                   '5 years':5,
                   '6 years':6,
                   '7 years':7,
                   '8 years':8,
                   '9 years':9,
                   '10+ years':10}
X_train['emp_length'] = X_train['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
X_train['zip_code'] = X_train['zip_code'].apply(lambda x: int(x[0:3]))

# house
house_dict = {'NONE': 0, 'OTHER': 0, 'ANY': 0, 'RENT': 1, 'MORTGAGE': 2, 'OWN': 3}
X_train['home_ownership'] = X_train['home_ownership'].apply(lambda x: house_dict[x])

# purpose
purpose_dict = {'other': 0, 'small_business': 1, 'renewable_energy': 2, 'home_improvement': 3,
                'house': 4, 'educational': 5, 'medical': 6, 'moving': 7, 'car': 8, 
                'major_purchase': 9, 'wedding': 10, 'vacation': 11, 'credit_card': 12, 
                'debt_consolidation': 13}
X_train['purpose'] = X_train['purpose'].apply(lambda x: purpose_dict[x])

# states
state_dict = {'AK': 0, 'WA': 1, 'ID': 2, 'MT': 3, 'ND': 4, 'MN': 5, 
              'OR': 6, 'WY': 7, 'SD': 8, 'WI': 9, 'MI': 10, 'NY': 11, 
              'VT': 12, 'NH': 13, 'MA': 14, 'CT': 15, 'RI': 16, 'ME': 17,
              'CA': 18, 'NV': 19, 'UT': 20, 'CO': 21, 'NE': 22, 'IA': 23, 
              'KS': 24, 'MO': 25, 'IL': 26, 'IN': 27, 'OH': 28, 'PA': 29, 
              'NJ': 30, 'KY': 31, 'WV': 32, 'VA': 33, 'DC': 34, 'MD': 35, 
              'DE': 36, 'AZ': 37, 'NM': 38, 'OK': 39, 'AR': 40, 'TN': 41, 
              'NC': 42, 'TX': 43, 'LA': 44, 'MS': 45, 'AL': 46, 'GA': 47, 
              'SC': 48, 'FL': 49, 'HI': 50}
X_train['addr_state'] = X_train['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
X_train = X_train.fillna(0)
X_train = X_train.replace([np.inf, -np.inf], 0)


# date
X_test['earliest_cr_line'] = pd.to_datetime(X_test['earliest_cr_line']).dt.strftime("%s")
X_test['earliest_cr_line'] = [0 if date=='NaT' else int(date) for date in X_test['earliest_cr_line']]

# term
X_test['term'] = X_test['term'].apply(lambda x: int(x.split(' ')[1]))

# emp_length
X_test['emp_length'] = X_test['emp_length'].apply(lambda x: emp_length_dict[x])

# zipcode
X_test['zip_code'] = X_test['zip_code'].apply(lambda x: int(x[0:3]))

# house
X_test['home_ownership'] = X_test['home_ownership'].apply(lambda x: house_dict[x])

# purpose
X_test['purpose'] = X_test['purpose'].apply(lambda x: purpose_dict[x])

# states
X_test['addr_state'] = X_test['addr_state'].apply(lambda x: state_dict[x])

# make NA's, inf and -inf 0
X_test = X_test.fillna(0)
X_test = X_test.replace([np.inf, -np.inf], 0)

In [52]:
from sklearn import preprocessing
X_train_scaled = preprocessing.scale(X_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)


/Users/ro.d.bruijn/anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:160: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
  warnings.warn("Numerical issues were encountered "

In [53]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

lb = LabelBinarizer()
grades = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
lb.fit(grades)
y_train_2 = lb.transform(y_train)

clf = OneVsRestClassifier(LogisticRegression(penalty='l1'))
predict_y = clf.fit(X_train_scaled, y_train_2).predict(X_test_scaled)
predict_y = lb.inverse_transform(predict_y)

#print(accuracy_score(y_test, predict_y))
confusion_matrix = ConfusionMatrix(np.array(y_test, dtype='<U1'), predict_y)
confusion_matrix.plot()
confusion_matrix.print_stats()

# find index of top 5 highest coefficients, aka most used features for prediction
coefs = clf.coef_
positions = abs(coefs[0]).argsort()[-5:][::-1]
print(X_train_scaled.columns[positions])
print(coefs[0][positions])


Confusion Matrix:

Predicted      A   B    C     D    E  F  G  __all__
Actual                                             
A          12769  17    2     0    0  0  0    12788
B          22809  25   21     7    0  0  0    22862
C          19651  31   53    37    0  0  0    19772
D          12102   1   49   101    2  0  0    12255
E           5636   0   22   282   14  5  2     5961
F           1904   0    6   379  114  3  1     2407
G            370   0    9   201   91  0  0      671
__all__    75241  74  162  1007  221  8  3    76716


Overall Statistics:

Accuracy: 0.16899994786
95% CI: (0.16635413333328417, 0.17167081891471681)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.0028273004168
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                         A            B            C  \
Population                                  76716        76716        76716   
P: Condition positive                       12788        22862        19772   
N: Condition negative                       63928        53854        56944   
Test outcome positive                       75241           74          162   
Test outcome negative                        1475        76642        76554   
TP: True Positive                           12769           25           53   
TN: True Negative                            1456        53805        56835   
FP: False Positive                          62472           49          109   
FN: False Negative                             19        22837        19719   
TPR: (Sensitivity, hit rate, recall)     0.998514   0.00109352   0.00268056   
TNR=SPC: (Specificity)                  0.0227756      0.99909     0.998086   
PPV: Pos Pred Value (Precision)          0.169708     0.337838      0.32716   
NPV: Neg Pred Value                      0.987119      0.70203     0.742417   
FPR: False-out                           0.977224  0.000909867   0.00191416   
FDR: False Discovery Rate                0.830292     0.662162      0.67284   
FNR: Miss Rate                         0.00148577     0.998906     0.997319   
ACC: Accuracy                            0.185424     0.701679      0.74154   
F1 score                                 0.290109   0.00217998   0.00531755   
MCC: Matthews correlation coefficient   0.0577825   0.00270588   0.00730233   
Informedness                            0.0212899   0.00018365  0.000766397   
Markedness                               0.156827    0.0398681    0.0695776   
Prevalence                               0.166693     0.298008      0.25773   
LR+: Positive likelihood ratio            1.02179      1.20184      1.40038   
LR-: Negative likelihood ratio           0.065235     0.999816     0.999232   
DOR: Diagnostic odds ratio                15.6632      1.20206      1.40146   
FOR: False omission rate                0.0128814      0.29797     0.257583   

Classes                                         D            E            F  \
Population                                  76716        76716        76716   
P: Condition positive                       12255         5961         2407   
N: Condition negative                       64461        70755        74309   
Test outcome positive                        1007          221            8   
Test outcome negative                       75709        76495        76708   
TP: True Positive                             101           14            3   
TN: True Negative                           63555        70548        74304   
FP: False Positive                            906          207            5   
FN: False Negative                          12154         5947         2404   
TPR: (Sensitivity, hit rate, recall)   0.00824153    0.0023486   0.00124636   
TNR=SPC: (Specificity)                   0.985945     0.997074     0.999933   
PPV: Pos Pred Value (Precision)          0.100298    0.0633484        0.375   
NPV: Neg Pred Value                      0.839464     0.922256      0.96866   
FPR: False-out                           0.014055   0.00292559  6.72866e-05   
FDR: False Discovery Rate                0.899702     0.936652        0.625   
FNR: Miss Rate                           0.991758     0.997651     0.998754   
ACC: Accuracy                            0.829762     0.919782     0.968598   
F1 score                                0.0152315   0.00452928   0.00248447   
MCC: Matthews correlation coefficient  -0.0187134  -0.00288199    0.0201296   
Informedness                          -0.00581348 -0.000576989   0.00117908   
Markedness                             -0.0602378   -0.0143952      0.34366   
Prevalence                               0.159745    0.0777022    0.0313755   
LR+: Positive likelihood ratio           0.586377     0.802778      18.5232   
LR-: Negative likelihood ratio             1.0059      1.00058     0.998821   
DOR: Diagnostic odds ratio                0.58294     0.802314      18.5451   
FOR: False omission rate                 0.160536    0.0777436    0.0313396   

Classes                                          G  
Population                                   76716  
P: Condition positive                          671  
N: Condition negative                        76045  
Test outcome positive                            3  
Test outcome negative                        76713  
TP: True Positive                                0  
TN: True Negative                            76042  
FP: False Positive                               3  
FN: False Negative                             671  
TPR: (Sensitivity, hit rate, recall)             0  
TNR=SPC: (Specificity)                    0.999961  
PPV: Pos Pred Value (Precision)                  0  
NPV: Neg Pred Value                       0.991253  
FPR: False-out                         3.94503e-05  
FDR: False Discovery Rate                        1  
FNR: Miss Rate                                   1  
ACC: Accuracy                             0.991214  
F1 score                                         0  
MCC: Matthews correlation coefficient -0.000587425  
Informedness                          -3.94503e-05  
Markedness                             -0.00874689  
Prevalence                              0.00874655  
LR+: Positive likelihood ratio                   0  
LR-: Negative likelihood ratio             1.00004  
DOR: Diagnostic odds ratio                       0  
FOR: False omission rate                0.00874689  
Index(['installment', 'funded_amnt', 'term', 'loan_amnt', 'funded_amnt_inv'], dtype='object')
[-35.7243125   34.87986167 -18.0251852    1.45506546   1.4179102 ]

In [54]:
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100))
predict_y = clf.fit(X_train_scaled, y_train_2).predict(X_test_scaled)
predict_y = lb.inverse_transform(predict_y)

print(accuracy_score(y_test, predict_y))
confusion_matrix = ConfusionMatrix(np.array(y_test, dtype='<U1'), predict_y)
confusion_matrix.plot()
print(confusion_matrix)


0.388171958913
Predicted      A      B     C    D    E    F   G  __all__
Actual                                                   
A          12253    531     4    0    0    0   0    12788
B          11097  11321   438    6    0    0   0    22862
C          14119   1160  4422   70    1    0   0    19772
D          10933     26   624  604   68    0   0    12255
E           4958      0    29  141  802   31   0     5961
F           1991      0     6   10   70  330   0     2407
G            541      0     0    3    8   72  47      671
__all__    55892  13038  5523  834  949  433  47    76716

In [55]:
confusion_matrix.print_stats()


Confusion Matrix:

Predicted      A      B     C    D    E    F   G  __all__
Actual                                                   
A          12253    531     4    0    0    0   0    12788
B          11097  11321   438    6    0    0   0    22862
C          14119   1160  4422   70    1    0   0    19772
D          10933     26   624  604   68    0   0    12255
E           4958      0    29  141  802   31   0     5961
F           1991      0     6   10   70  330   0     2407
G            541      0     0    3    8   72  47      671
__all__    55892  13038  5523  834  949  433  47    76716


Overall Statistics:

Accuracy: 0.388171958913
95% CI: (0.38472121084297828, 0.39163116596403957)
No Information Rate: ToDo
P-Value [Acc > NIR]: 1.0
Kappa: 0.241353230121
Mcnemar's Test P-Value: ToDo


Class Statistics:

Classes                                        A          B          C  \
Population                                 76716      76716      76716   
P: Condition positive                      12788      22862      19772   
N: Condition negative                      63928      53854      56944   
Test outcome positive                      55892      13038       5523   
Test outcome negative                      20824      63678      71193   
TP: True Positive                          12253      11321       4422   
TN: True Negative                          20289      52137      55843   
FP: False Positive                         43639       1717       1101   
FN: False Negative                           535      11541      15350   
TPR: (Sensitivity, hit rate, recall)    0.958164   0.495189    0.22365   
TNR=SPC: (Specificity)                  0.317373   0.968118   0.980665   
PPV: Pos Pred Value (Precision)         0.219226   0.868308   0.800652   
NPV: Neg Pred Value                     0.974308    0.81876   0.784389   
FPR: False-out                          0.682627  0.0318825  0.0193348   
FDR: False Discovery Rate               0.780774   0.131692   0.199348   
FNR: Miss Rate                         0.0418361   0.504811    0.77635   
ACC: Accuracy                           0.424188   0.827181    0.78556   
F1 score                                0.356814   0.630696   0.349634   
MCC: Matthews correlation coefficient   0.230924   0.564201   0.345735   
Informedness                            0.275537   0.463306   0.204315   
Markedness                              0.193535   0.687068   0.585041   
Prevalence                              0.166693   0.298008    0.25773   
LR+: Positive likelihood ratio           1.40364    15.5317    11.5672   
LR-: Negative likelihood ratio           0.13182   0.521436   0.791657   
DOR: Diagnostic odds ratio               10.6482    29.7863    14.6114   
FOR: False omission rate               0.0256915    0.18124   0.215611   

Classes                                         D           E          F  \
Population                                  76716       76716      76716   
P: Condition positive                       12255        5961       2407   
N: Condition negative                       64461       70755      74309   
Test outcome positive                         834         949        433   
Test outcome negative                       75882       75767      76283   
TP: True Positive                             604         802        330   
TN: True Negative                           64231       70608      74206   
FP: False Positive                            230         147        103   
FN: False Negative                          11651        5159       2077   
TPR: (Sensitivity, hit rate, recall)     0.049286    0.134541     0.1371   
TNR=SPC: (Specificity)                   0.996432    0.997922   0.998614   
PPV: Pos Pred Value (Precision)          0.724221      0.8451   0.762125   
NPV: Neg Pred Value                      0.846459     0.93191   0.972772   
FPR: False-out                         0.00356805  0.00207759  0.0013861   
FDR: False Discovery Rate                0.275779      0.1549   0.237875   
FNR: Miss Rate                           0.950714    0.865459     0.8629   
ACC: Accuracy                             0.84513    0.930836   0.971584   
F1 score                                0.0922912    0.232127   0.232394   
MCC: Matthews correlation coefficient    0.161525     0.32082    0.31581   
Informedness                             0.045718    0.132464   0.135714   
Markedness                                0.57068     0.77701   0.734897   
Prevalence                               0.159745   0.0777022  0.0313755   
LR+: Positive likelihood ratio            13.8132     64.7582    98.9104   
LR-: Negative likelihood ratio           0.954118    0.867261   0.864098   
DOR: Diagnostic odds ratio                14.4774     74.6699    114.467   
FOR: False omission rate                 0.153541   0.0680903  0.0272276   

Classes                                         G  
Population                                  76716  
P: Condition positive                         671  
N: Condition negative                       76045  
Test outcome positive                          47  
Test outcome negative                       76669  
TP: True Positive                              47  
TN: True Negative                           76045  
FP: False Positive                              0  
FN: False Negative                            624  
TPR: (Sensitivity, hit rate, recall)    0.0700447  
TNR=SPC: (Specificity)                          1  
PPV: Pos Pred Value (Precision)                 1  
NPV: Neg Pred Value                      0.991861  
FPR: False-out                                  0  
FDR: False Discovery Rate                       0  
FNR: Miss Rate                           0.929955  
ACC: Accuracy                            0.991866  
F1 score                                 0.130919  
MCC: Matthews correlation coefficient     0.26358  
Informedness                            0.0700447  
Markedness                               0.991861  
Prevalence                             0.00874655  
LR+: Positive likelihood ratio                inf  
LR-: Negative likelihood ratio           0.929955  
DOR: Diagnostic odds ratio                    inf  
FOR: False omission rate               0.00813888  

In [56]:
features = []
for i,j in enumerate(grades):
    print('\n',j)
    feat_imp = clf.estimators_[i].feature_importances_
    positions = abs(feat_imp).argsort()[-5:][::-1]
    features.extend(list(X_train.columns[positions]))
    print(X_train.columns[positions])
    print(feat_imp[positions])


 A
Index(['revol_util', 'installment', 'revol_bal', 'funded_amnt_inv',
       'earliest_cr_line'],
      dtype='object')
[ 0.16382844  0.11410527  0.06305903  0.05875622  0.05705625]

 B
Index(['installment', 'revol_util', 'funded_amnt_inv', 'revol_bal', 'dti'], dtype='object')
[ 0.18082548  0.0815748   0.06421254  0.06384274  0.06121603]

 C
Index(['installment', 'revol_util', 'revol_bal', 'dti', 'earliest_cr_line'], dtype='object')
[ 0.15151168  0.08249654  0.07178606  0.07000384  0.06760838]

 D
Index(['installment', 'revol_util', 'revol_bal', 'dti', 'earliest_cr_line'], dtype='object')
[ 0.12177678  0.08687976  0.0728564   0.07252917  0.069689  ]

 E
Index(['installment', 'revol_util', 'funded_amnt_inv', 'dti', 'revol_bal'], dtype='object')
[ 0.16401471  0.07213717  0.06327645  0.06278267  0.06117716]

 F
Index(['installment', 'funded_amnt_inv', 'revol_util', 'funded_amnt',
       'loan_amnt'],
      dtype='object')
[ 0.19441472  0.06760932  0.06559835  0.06320743  0.06153764]

 G
Index(['installment', 'funded_amnt_inv', 'revol_util', 'dti', 'loan_amnt'], dtype='object')
[ 0.18354993  0.0720556   0.0674901   0.06233009  0.06232249]

In [57]:
pd.Series(features).value_counts()


Out[57]:
installment         7
revol_util          7
revol_bal           5
dti                 5
funded_amnt_inv     5
earliest_cr_line    3
loan_amnt           2
funded_amnt         1
dtype: int64

In [ ]:


In [ ]:


In [ ]: