Overview

In this project, I build an algorithm to predict whether a particular in lending club will be a success or a failure. A successful loan is a loan that is fully paid off. A failure loan is a loan either in default state or charged off state, where there is no reasonable expectation that the loan will be paid off.



In [1]:

    
# Load all the data 
import pandas as pd
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer



In [2]:

    
%matplotlib inline
import mpld3
mpld3.enable_notebook()



In [3]:

    
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']



In [4]:

    
df = pd.read_csv('/Volumes/mac/Work/JobHunt/Incubator/ipynb/LoanStats3a_securev1.csv',skiprows=1)
dfb = pd.read_csv('/Volumes/mac/Work/JobHunt/Incubator/ipynb/LoanStats3b_securev1.csv',skiprows=1)
dfc = pd.read_csv('/Volumes/mac/Work/JobHunt/Incubator/ipynb/LoanStats3c_securev1.csv',skiprows=1)
dfd = pd.read_csv('/Volumes/mac/Work/JobHunt/Incubator/ipynb/LoanStats3d_securev1.csv',skiprows=1)









    



/Users/jingfengli/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
/Users/jingfengli/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)



In [5]:

    
totdf = df
totdf = df.append(dfb)
totdf = totdf.append(dfc)
totdf = totdf.append(dfd)
print totdf.shape









    



(646398, 56)



In [6]:

    
import numpy as np

To deal with the abnormality of the input data, for instance, to deal with the placeholders



In [7]:

    
# success: 1, failure: -1, others: 0, questionable: 0.5
totdf['stat'] = pd.Series(np.zeros(np.shape(totdf['id'])))

success = ['Fully Paid']
failure = ['Charged Off','Default']
question = ['In Grace Period','Late (16-30 days)','Late (31-120 days)']

for i in success:
    totdf.stat[totdf.loan_status == i] = 1

for i in failure:
    totdf.stat[totdf.loan_status == i] = -1

for i in question:
    totdf.stat[totdf.loan_status == i] = 0.5

# ONLY look at the data for failure and success, no questionable or other type
CleanUpRecord = totdf.loc[(totdf.stat==1)|(totdf.stat==-1)]
df = CleanUpRecord









    



/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [8]:

    
print 'Failure / (Success + Failure)'
print (totdf.stat[totdf.stat == -1]).count()*1.0/((totdf.stat[totdf.stat == 1]).count() + (totdf.stat[totdf.stat == -1]).count())









    



Failure / (Success + Failure)
0.178621674682



In [ ]:

Distribution of loans from Lending Club



In [9]:

    
df.columns









    Out[9]:





Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'verification_status', u'issue_d', u'loan_status', u'pymnt_plan',
       u'url', u'desc', u'purpose', u'title', u'zip_code', u'addr_state',
       u'dti', u'delinq_2yrs', u'earliest_cr_line', u'fico_range_low',
       u'fico_range_high', u'inq_last_6mths', u'mths_since_last_delinq',
       u'mths_since_last_record', u'open_acc', u'pub_rec', u'revol_bal',
       u'revol_util', u'total_acc', u'initial_list_status', u'out_prncp',
       u'out_prncp_inv', u'total_pymnt', u'total_pymnt_inv',
       u'total_rec_prncp', u'total_rec_int', u'total_rec_late_fee',
       u'recoveries', u'collection_recovery_fee', u'last_pymnt_d',
       u'last_pymnt_amnt', u'next_pymnt_d', u'last_credit_pull_d',
       u'last_fico_range_high', u'last_fico_range_low',
       u'collections_12_mths_ex_med', u'mths_since_last_major_derog',
       u'policy_code', u'stat'],
      dtype='object')

There is two FICO info, WHICH ONE TO USE?!!!

Use the one did not marked with last, because that is the one that is av ailabel to the user.



In [10]:

    
# # There is two FICO info, WHICH ONE TO USE?!!!
# # Let us explore what is about
# MisMatch = []
# for i in xrange(len(df.url)):
#     if not df.last_fico_range_high.iloc[i] == df.fico_range_high.iloc[i]:
#         MisMatch.append(i)



In [11]:

    
relevant = [ 'annual_inc',
             'dti', 
             'emp_length',
             'fico_range_high',
             'fico_range_low',
             'home_ownership',
             'loan_amnt',
             'open_acc',
             'policy_code',
             'pub_rec',
             'sub_grade',
             'term',
             'total_acc',
#              'last_credit_pull_d',
# #              'last_fico_range_high',
#              'last_fico_range_low',
#             'earliest_cr_line',
#             'purpose',
#              'funded_amnt',
#              'grade',
            ]

X = df[relevant]
y = df['stat']
X_y = df[relevant +['stat']]

## reweight the y to deal with the imbalance 
# Reweighting does not seem to work now. 
# let me just resample
# y_1 = len(y[y== 1])
# y_0 = len(y[y== -1])
# print pd.unique(y)
# y[y==-1] = y_1
# y[y== 1] = y_0
# print pd.unique(y)



In [344]:

    
relevant = [ 'annual_inc',
             'dti', 
             'emp_length',
#              'fico_range_high',
#              'fico_range_low',
             'home_ownership',
             'loan_amnt',
             'open_acc',
             'policy_code',
             'pub_rec',
             'sub_grade',
             'term',
             'total_acc',
#              'last_credit_pull_d',
             'last_fico_range_high',
             'last_fico_range_low',
#             'earliest_cr_line',
#             'purpose',
#              'funded_amnt',
#              'grade',
            ]

X_cheat = df[relevant]
# y = df['stat']
# X_y = df[relevant +['stat']]

## reweight the y to deal with the imbalance 
# Reweighting does not seem to work now. 
# let me just resample
# y_1 = len(y[y== 1])
# y_0 = len(y[y== -1])
# print pd.unique(y)
# y[y==-1] = y_1
# y[y== 1] = y_0
# print pd.unique(y)



In [345]:

    
List2DictT = List2Dict_Transformer(str_keys=['home_ownership'])
X_trans_cheat = List2DictT.fit_transform(X_cheat)
DictVectorT = DictVectorizer(sparse=False)
X_vect_cheat = DictVectorT.fit_transform(X_trans_cheat)

RFfit_cheat =RandomForestClassifier(n_estimators=300)
RFfit_cheat.fit(X_vect_cheat,y)

y_prob_cheat=RFfit_cheat.predict_prob(X_vect_cheat)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-345-98e36bbdb196> in <module>()
      7 RFfit_cheat.fit(X_vect_cheat,y)
      8 
----> 9 y_prob_cheat=RFfit_cheat.predict_prob(X_vect_cheat)

AttributeError: 'RandomForestClassifier' object has no attribute 'predict_prob'



In [ ]:



In [ ]:



In [ ]:



In [12]:

    
############################
## Create otholgonal datasets

def rebalanceXy(X,y):
    # permute the data, just in case it is order:
    import numpy as np
    y = np.array(y)
    X = np.array(X)
    perm = np.random.permutation(len(y))
    X = X[perm]
    y = y[perm]
    # determine which one is not balanced
    y_uniq = np.unique(y)
    if len(y[y==y_uniq[0]]) < len(y)/2:
        y_less,y_more = y_uniq[0], y_uniq[1]
    else:
        y_less,y_more = y_uniq[1], y_uniq[0]
    
    n_less = len(y[y==y_less])
    n_more = len(y[y==y_more])
    # rebalance
    # The idea is to select all the y_less data, but not all the y_more data. 
    
    yy_less = y[y==y_less]
    XX_less = X[y==y_less]
    
    yy_more = y[y==y_more]
    XX_more = X[y==y_more]
    
    XX_reblance = []
    yy_reblance = []
    Xy_reblance = []
    for i in xrange(n_more/n_less + 1):
        St = i*n_less
        Ed = (i+1)*n_less
        if Ed <= n_more:            
            XX_tmp = np.append(XX_more[St:Ed],XX_less,0)
            yy_tmp = np.append(yy_more[St:Ed],yy_less,0)
            perm = np.random.permutation(n_less*2)
            Xy_reblance.append([XX_tmp[perm], yy_tmp[perm]])
        else:
            Diff = np.random.random_integers(0,n_more-1,Ed-n_more)
            np.append(XX_more[St:Ed], XX_more[Diff],0)
            
            XX_tmp = np.append(np.append(XX_more[St:Ed], XX_more[Diff],0), XX_less,0)
            yy_tmp = np.append(np.append(yy_more[St:Ed], yy_more[Diff],0), yy_less,0)
            perm = np.random.permutation(n_less*2)
            Xy_reblance.append([XX_tmp[perm], yy_tmp[perm]])
        
    
    return Xy_reblance


    


# split into train and test datasets
# from sklearn.cross_validation import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)



In [ ]:



In [16]:

    
len(y[y==-1])









    Out[16]:





33391



In [17]:

    
from sklearn.base import BaseEstimator,RegressorMixin,TransformerMixin
class List2Dict_Transformer(BaseEstimator,TransformerMixin):
    '''
    Expects a data-frame object. 
    '''
    def __init__(self, str_keys=[]):
        self.str_keys = str_keys
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        grademap = {'A':10,"B":20,"C":30,"D":40,"E":50,"F":60,"G":70,"H":80}
        X_dict = []
        X_keys = X.columns
        for i in xrange(len(X)):
            x_dict = {}
            for key in X_keys:
                if key in self.str_keys:
                    x_dict[key + '_' + X[key].iloc[i]] = 1
                elif key == u'emp_length':
                    if X[key].iloc[i][0] == 'n':
                        x_dict[key] = -1
                    elif X[key].iloc[i][0] == '<':
                        x_dict[key] = 0
                    else:
                        x_dict[key] = int(X[key].iloc[i][:2])
                elif key == 'sub_grade':
#                     print X[key].iloc[i][0]
                    base = grademap[X[key].iloc[i][0]]
                    x_dict[key] = base + int(X[key].iloc[i][1])
                elif key == 'term':
                    x_dict[key] = int(X[key].iloc[i][:3])
                else:
                    x_dict[key] = X[key].iloc[i]
            X_dict.append(x_dict)
        return X_dict



In [18]:

    
List2DictT = List2Dict_Transformer(str_keys=['home_ownership'])
X_trans = List2DictT.fit_transform(X)
DictVectorT = DictVectorizer(sparse=False)
X_vect = DictVectorT.fit_transform(X_trans)



In [19]:

    
import dill
with open('./X_vect.pkl','wb') as out_strm:
    dill.dump(X_vect,out_strm)



In [145]:

    
# Save the fit
# import dill
# with open('./List2DictT.pkl','wb') as out_strm:
#     dill.dump(List2DictT,out_strm)
# with open('./DictVectorT.pkl','wb') as out_strm:
#     dill.dump(DictVectorT,out_strm)
# with open('./RFfit.pkl','wb') as out_strm:
#     dill.dump(RFfit,out_strm)



In [21]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vect, y, test_size=0.33)
Xy_train_bal = rebalanceXy(X_train,y_train)



In [337]:

    
len(Xy_train_bal[0][1])









    Out[337]:





44596



In [181]:









    Out[181]:





array([[  2.40000000e+04,   2.76500000e+01,   1.00000000e+01, ...,
          2.20000000e+01,   3.60000000e+01,   9.00000000e+00],
       [  3.00000000e+04,   1.00000000e+00,   0.00000000e+00, ...,
          3.40000000e+01,   6.00000000e+01,   4.00000000e+00],
       [  1.22520000e+04,   8.72000000e+00,   1.00000000e+01, ...,
          3.50000000e+01,   3.60000000e+01,   1.00000000e+01],
       ..., 
       [  6.30780000e+04,   3.17000000e+01,   1.00000000e+01, ...,
          4.20000000e+01,   6.00000000e+01,   2.80000000e+01],
       [  5.40000000e+04,   1.32200000e+01,   0.00000000e+00, ...,
          1.10000000e+01,   3.60000000e+01,   2.10000000e+01],
       [  5.00000000e+04,   1.26300000e+01,   1.00000000e+01, ...,
          2.10000000e+01,   3.60000000e+01,   3.00000000e+01]])



In [27]:

    
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.linear_model import LogisticRegression

TotRFMOD = []
TotGBGMOD = []
TotLGRMOD = []
for i in xrange(len(Xy_train_bal)):
    XX_bal = Xy_train_bal[i][0]
    yy_bal = Xy_train_bal[i][1]

    RFfit =RandomForestClassifier(n_estimators=300)
    RFfit.fit(XX_bal,yy_bal)
    TotRFMOD.append(RFfit)
    
    clf=GradientBoostingClassifier(n_estimators=1000,max_depth=5,
                               learning_rate=0.05,max_features='sqrt')
    clf.fit(XX_bal,yy_bal)
    TotGBGMOD.append(clf)
    
    clf2=LogisticRegression()
    clf2.fit(XX_bal,yy_bal)
    TotLGRMOD.append(clf2)



In [28]:

    
TotModes = TotRFMOD + TotGBGMOD + TotLGRMOD

import dill
with open('TotModes,pkl','wb') as out_strm:
    dill.dump(TotModes,out_strm)



In [78]:









    Out[78]:





['annual_inc',
 'dti',
 'emp_length',
 'fico_range_high',
 'fico_range_low',
 'home_ownership_ANY',
 'home_ownership_MORTGAGE',
 'home_ownership_NONE',
 'home_ownership_OTHER',
 'home_ownership_OWN',
 'home_ownership_RENT',
 'loan_amnt',
 'open_acc',
 'policy_code',
 'pub_rec',
 'sub_grade',
 'term',
 'total_acc']



In [80]:

    
a=TotRFMOD[1]
a.feature_importances_
importance=[(DictVectorT.feature_names_[i],a.feature_importances_[i]) for i in xrange(len(a.feature_importances_))]

importance = sorted(importance,key = lambda tup:tup[1],reverse=True)
print importance









    



[('dti', 0.1508685928196716), ('annual_inc', 0.13692133735420761), ('loan_amnt', 0.12211874580051592), ('sub_grade', 0.11996917410243554), ('total_acc', 0.11263709925340408), ('open_acc', 0.092648392903049562), ('emp_length', 0.074658986728847437), ('fico_range_low', 0.064074135305509441), ('fico_range_high', 0.062755123134168991), ('term', 0.019984832525629676), ('pub_rec', 0.014157857438953174), ('home_ownership_MORTGAGE', 0.010746094673376984), ('home_ownership_RENT', 0.010694809810878798), ('home_ownership_OWN', 0.007455018225750093), ('home_ownership_OTHER', 0.00023341699429887925), ('home_ownership_NONE', 7.0199962168635859e-05), ('home_ownership_ANY', 6.1829671339511274e-06), ('policy_code', 0.0)]



In [2]:

    
# from collections import OrderedDict

# import pandas as pd

# from bokeh.charts import Bar, output_file, show
# from bokeh.sampledata.olympics2014 import data
# import bokeh
# from bokeh.models import HoverTool, ColumnDataSource


# # get the countries and we group the data by medal type
# gold = meanFico.Paid
# silver = 1-meanFico.Paid
# countries =[str(meanFico.index[i]) for i in xrange(len(gold))]

# for i in xrange(len(gold)):
#     meanFico['WTF'].iloc[i] = (meanFico.index[i])

# gold = gold.iloc[2:]
# silver = silver.iloc[2:]
# countries = countries[2:]
# # build a dict containing the grouped data
# medals = OrderedDict(silver=silver, gold=gold)
# fscor = meanFico.WTF.iloc[2:]



In [1]:

    
# output_notebook()
# TOOLS = "hover,save"
# bar = Bar(medals, countries, title="Pay Off versus Default", 
#           stacked=True,width=1500, height=800,tools = TOOLS,
#           ylabel='Probability',
#           xlabel='FICO credit score')
# source = ColumnDataSource(
#     data=dict(
#         gold = fscor,
#         silver = silver,
#         fscor = fscor,
#      )
# )

# hover = bar.select(dict(type=HoverTool))
# hover.tooltips = OrderedDict([
#     ("Pay off chance", "@gold"),
#     ("Default chance", "@silver"),
#     ("FICO score", "$x"),
# ])
# show(bar)



In [365]:

    
Fico = CleanUpRecord
Fico['fico']=Fico.fico_range_high
Fico['Paid'] = Fico['stat']
Fico.Paid[Fico['Paid'] == -1] =0
meanFico=Fico.groupby(['fico']).mean()
cntFico=Fico.groupby(['fico']).count()

# print pd.unique(Fico.stat)









    



/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app
/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/jingfengli/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [3]:

    
# output_notebook()
# TOOLS = "hover,save,box_zoom,reset"
# bb = cntFico.Paid.iloc[2:]
# newmedals = OrderedDict(bb=bb)

# bar = Bar(newmedals, countries, title="Distribution of Loans", 
#           width=1500, height=800,tools = TOOLS,
#           ylabel='Number of Loans',
#           xlabel='FICO credit score')
# source = ColumnDataSource(
#     data=dict(
#         bb = bb,
#      )
# )

# hover = bar.select(dict(type=HoverTool))
# hover.tooltips = OrderedDict([
#     ("Loans (#)", "@bb"),
# #     ("Default chance", "@silver"),
#     ("FICO score", "$x"),
# ])
# show(bar)



In [390]:

    
type(silver)









    Out[390]:





pandas.core.series.Series



In [331]:

    
import matplotlib.pyplot as plt
pp = [importance[i][1] for i in xrange(18)]
labels= [importance[i][0] for i in xrange(18)]
# print pp



# This gives me the plot:
# pNYC10 = sns.factorplot(x=np.linspace(0,18,18),y=np.array(pp),aspect=3)
# pNYC10.set_xticklabels(rotation=30)
# pNYC10.set_axis_labels("Water System - NYC, 2015", "Complaints")

fig, ax = plt.subplots()
ax.bar(np.linspace(0,18,18),np.array(pp),width=0.7)
plt.xticks(np.linspace(0,18,18)+0.5, labels)
fig.autofmt_xdate()

plt.title('Importance of Individual Feature in Predicting Loans')
plt.ylabel('Importance')
plt.xlim([-0.5,19])
# Pad margins so that markers don't get clipped by the axes
plt.margins(0.2)
# Tweak spacing to prevent clipping of tick-labels
plt.subplots_adjust(bottom=0.15)
plt.show()



In [51]:

    
from sklearn.linear_model import LogisticRegression
# combine all the models
# y_cmb = []
# for RF in TotRFMOD:
#     y_cmb.append(RF.predict_proba(X_train)[:,0])

# for RF in TotGBGMOD:
#     y_cmb.append(RF.predict_proba(X_train)[:,0])

# for RF in TotLGRMOD:
#     y_cmb.append(RF.predict_proba(X_train)[:,0])

    
y_cmb_pred = []
for RF in TotRFMOD:
    y_cmb_pred.append(RF.predict(X_train))

for RF in TotGBGMOD:
    y_cmb_pred.append(RF.predict(X_train))

for RF in TotLGRMOD:
    y_cmb_pred.append(RF.predict(X_train))



In [53]:

    
# 
np.unique(y_train)
# print (np.array(y_cmb)).shape
# print len(y_train)









    Out[53]:





array([-1.,  1.])



In [68]:

    
FinalLGR = LogisticRegression(class_weight={-1:0.8,1:0.2})
FinalLGR.fit(np.array(y_cmb)[6:,:].T,y_train)









    Out[68]:





LogisticRegression(C=1.0, class_weight={1: 0.2, -1: 0.8}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)



In [59]:

    
# y_cmb_test = []
# for RF in TotRFMOD:
#     y_cmb_test.append(RF.predict_proba(X_test)[:,0])

# for RF in TotGBGMOD:
#     y_cmb_test.append(RF.predict_proba(X_test)[:,0])

# for RF in TotLGRMOD:
#     y_cmb_test.append(RF.predict_proba(X_test)[:,0])

y_cmb_test_pred = []
for RF in TotRFMOD:
    y_cmb_test_pred.append(RF.predict(X_test))

for RF in TotGBGMOD:
    y_cmb_test_pred.append(RF.predict(X_test))

for RF in TotLGRMOD:
    y_cmb_test_pred.append(RF.predict(X_test))



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

    
# y_cmb_tot = []
# for RF in TotRFMOD:
#     y_cmb_tot.append(RF.predict_proba(X_test)[:,0])

# for RF in TotGBGMOD:
#     y_cmb_tot.append(RF.predict_proba(X_test)[:,0])

# for RF in TotLGRMOD:
#     y_cmb_tot.append(RF.predict_proba(X_test)[:,0])

# y_cmb_tot_pred = []
# for RF in TotRFMOD:
#     y_cmb_tot_pred.append(RF.predict(X_test))

# for RF in TotGBGMOD:
#     y_cmb_tot_pred.append(RF.predict(X_test))

# for RF in TotLGRMOD:
#     y_cmb_tot_pred.append(RF.predict(X_test))



In [ ]:

    
# ROC curve
plt.figure()
plt.plot([0,1],[0,1])
plt



In [ ]:

    
y_pred_cmb_train=FinalLGR.predict(np.array(y_cmb)[6:,:].T)
y_pred_cmb_test=FinalLGR.predict(np.array(y_cmb_test)[6:,:].T)

print 'Train ',Compute_TPFN(y_pred_cmb_train,y_train)
print 'Test ',Compute_TPFN(y_pred_cmb_test,y_test)
for i in range(15):
    print 'MODEL ', i , ' ',Compute_TPFN(y_cmb_pred[i],y_train)



In [165]:

    
# To demonstrate the proof of concept using MODEL - 7


# crt = 0.78
# for i in xrange(len(y_prob_proof)):
#     if y_prob_proof[i]>crt and y_prob_proof[i]<crt+0.1:
#         break
        
# print y_truth_proof[i], y_prob_proof[i], y_pred_proof[i]

1.0



In [464]:

    
MODEL_ID = 7
Grade = 31
X_raw_proof = X_train
y_prob_proof = 1- np.array(y_cmb[MODEL_ID])

y_pred_proof = np.array(y_cmb_pred[MODEL_ID])
y_truth_proof = np.array(y_train)
N = 100
crit = np.linspace(0,1,N+1)
ratio = [] 
cnt = []
for i in crit:
    tmp = y_truth_proof[(y_prob_proof>=i) & (y_prob_proof<= (i+0.1)) & (X_raw_proof[:,15]==Grade)]
#     print i,len(tmp),len(tmp[tmp==1])
    cnt.append(len(tmp[tmp==1]))
    ratio.append(len(tmp[tmp==1])*1.0/(len(tmp)+0.0001))

ratio = np.array(ratio)
cnt = np.array(cnt)
plt.bar(crit,cnt/(sum(cnt)*0.01),width=0.99/(N*1.5))
plt.bar(crit[73],cnt[73]/(sum(cnt)*0.01),width=0.99/(N*1.5), color='orange')
plt.annotate('Your Loan', xy=(crit[73]+0.005, cnt[73]/(sum(cnt)*0.01)), xytext=(crit[73]+0.1, 2),
            arrowprops=dict(facecolor='orange', shrink=0.01)
            )

font = {'family' : 'serif',
        'color'  : 'darkred',
        'weight' : 'normal',
        'size'   : 30,
        }
plt.text(crit[0]+0.02, 3.10, 'Grade: C1', fontdict=font)
plt.xlabel('Paying off Score')
plt.ylabel('Percent (%)')
plt.title('Distribution of Paying off Scores')

plt.xlim([0,1])
# plt.plot(1-crit,ratio)
plt.show()



In [ ]:

    
SubRecord = CleanUpRecord[CleanUpRecord.sub_grade=='C1']
SubRecord['Paid'] = SubRecord['stat']
SubRecord.Paid[SubRecord.Paid==-1] =0



In [309]:

    
StateData=SubRecord.groupby(['addr_state']).mean()
ZipData=SubRecord.groupby(['zip_code']).mean()

StateData.to_csv('StateData.csv', sep=',')
ZipData.to_csv('ZipData.csv', sep=',')



In [ ]:

    
# http://cdb.io/1R0a6u



In [ ]:



In [ ]:



In [285]:

    
from bokeh.plotting import figure, show, output_notebook



In [290]:

    
type(county_xs)









    Out[290]:





list



In [ ]:

    
colors = ["#%02x%02x%02x" % (r, g, 150) for r, g in zip(np.floor(50+2*x), np.floor(30+2*y))]



In [ ]:



In [283]:

    
sum(cnt)









    Out[283]:





61068



In [284]:

    
print 'Your loans beat %.2f' % (sum(cnt[0:73])*100.0/sum(cnt)), '% loans with the same grade'









    



Your loans beat 96.33 % loans with the same grade



In [242]:

    
MODEL_ID = 7
Grade = 23
X_raw_proof = X_train
y_prob_proof = np.array(y_cmb[MODEL_ID])

y_pred_proof = 1 - np.array(y_cmb_pred[MODEL_ID])
y_truth_proof = np.array(y_train)

Eg = []
for i in xrange(len(y_truth_proof)):
    if (X_raw_proof[i,15]==Grade) and y_prob_proof[i] > 0.73 and  y_prob_proof[i] < 0.74:
        Eg.append(i)
i = Eg[10]

print i,y_pred_proof[i],y_truth_proof[i],X_raw_proof[i],DictVectorT.feature_names_









    



91441 2.0 -1.0 [  6.24000000e+04   3.48100000e+01   7.00000000e+00   7.49000000e+02
   7.45000000e+02   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.20000000e+04
   2.70000000e+01   1.00000000e+00   0.00000000e+00   2.30000000e+01
   3.60000000e+01   3.70000000e+01] ['annual_inc', 'dti', 'emp_length', 'fico_range_high', 'fico_range_low', 'home_ownership_ANY', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'loan_amnt', 'open_acc', 'policy_code', 'pub_rec', 'sub_grade', 'term', 'total_acc']



In [463]:

    
MODEL_ID = 7
Grade = 31
X_raw_proof = X_train
y_prob_proof = 1-np.array(y_cmb[MODEL_ID])

y_pred_proof = np.array(y_cmb_pred[MODEL_ID])
y_truth_proof = np.array(y_train)

Eg = []
for i in xrange(len(y_truth_proof)):
    if (X_raw_proof[i,15]==Grade) and y_prob_proof[i] > 0.73 and  y_prob_proof[i] < 0.74:
        Eg.append(i)
i = Eg[10]

print i,y_pred_proof[i],y_truth_proof[i],X_raw_proof[i],DictVectorT.feature_names_
print len(y_truth_proof[y_truth_proof[Eg]==1])*1.0/len(Eg)









    



35302 1.0 1.0 [  1.25000000e+05   2.32300000e+01   1.00000000e+01   7.24000000e+02
   7.20000000e+02   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.20000000e+04
   1.60000000e+01   1.00000000e+00   0.00000000e+00   3.10000000e+01
   3.60000000e+01   4.20000000e+01] ['annual_inc', 'dti', 'emp_length', 'home_ownership_ANY', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'last_fico_range_high', 'last_fico_range_low', 'loan_amnt', 'open_acc', 'policy_code', 'pub_rec', 'sub_grade', 'term', 'total_acc']
0.979591836735



In [23]:

    
# y_proba_rf= RFfit.predict_proba(X_test)
y_pred_rf = RFfit.predict(X_test)
y_pred_rf_train = RFfit.predict(Xy_train_bal[0][0])



In [451]:

    
i = 91441 
print i,y_pred_proof[i],y_truth_proof[i],X_raw_proof[i],DictVectorT.feature_names_









    



91441 -1.0 -1.0 [  6.24000000e+04   3.48100000e+01   7.00000000e+00   7.49000000e+02
   7.45000000e+02   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.20000000e+04
   2.70000000e+01   1.00000000e+00   0.00000000e+00   2.30000000e+01
   3.60000000e+01   3.70000000e+01] ['annual_inc', 'dti', 'emp_length', 'home_ownership_ANY', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'last_fico_range_high', 'last_fico_range_low', 'loan_amnt', 'open_acc', 'policy_code', 'pub_rec', 'sub_grade', 'term', 'total_acc']



In [26]:

    
def Compute_TPFN(y_pred,y_true):
    df = pd.DataFrame(columns=('y_pred','y_true','count'))
    df['y_pred'] = pd.Series(y_pred)
    df['y_true'] = pd.Series(np.array(y_true))
    df['count'] = 1
    return df.groupby(['y_pred','y_true']).count()

# print Compute_TPFN(y_pred_train,y_train)
print Compute_TPFN(y_pred_rf,y_test)
print Compute_TPFN(y_pred_rf_train,Xy_train_bal[0][1])









    



               count
y_pred y_true       
-1     -1       7162
        1      19425
 1     -1       3837
        1      31266
               count
y_pred y_true       
-1     -1      22392
 1      1      22392
22392 22392



In [152]:









    Out[152]:





GradientBoostingClassifier(init=None, learning_rate=0.05, loss='deviance',
              max_depth=5, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=1000,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)



In [159]:

    
y_proba= clf.predict_proba(X_test)
# y_pred = clf.predict(X_test)
print Compute_TPFN(y_pred,y_test)









    



               count
y_pred y_true       
33391  33391   50350
       153546  10555
153546 33391     405
       153546    380



In [155]:

    
# RFfit =RandomForestClassifier(n_estimators=300)
# RFfit.fit(X_train,y_train,sample_weight=np.array(y_train))









    Out[155]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [162]:

    
ytrain_proba_rf= RFfit.predict_proba(X_train)
ytrain_proba= clf.predict_proba(X_train)



In [51]:

    
y_proba= RFfit.predict_proba(X_vect)
# y_pred = RFfit.predict(X_vect)
# print Compute_TPFN(y_pred,y_test)



In [166]:

    
from sklearn.linear_model import LogisticRegression
cmb = np.array([y_proba_rf[:,0],y_proba[:,0]]).T
LRclf = LogisticRegression()
LRclf.fit(cmb,y_test)









    Out[166]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)



In [167]:

    
y_pred_cmb = LRclf.predict(cmb)



In [170]:

    
print Compute_TPFN(y_pred_cmb,y_test)









    



               count
y_pred y_true       
33391  33391   50137
       153546  10377
153546 33391     618
       153546    558



In [ ]:



In [61]:

    
for i in xrange(len(y_proba)):
    if y_proba[i,0]>0.7 and y_proba[i,0]<0.73:
        break



In [223]:

    
LoanGrade = 'C1'

PayOff_Prob = []
PayOff_True = []
for i in xrange(len(y_proba)):
    if X.sub_grade.iloc[i] == LoanGrade:
        PayOff_Prob.append(y_proba[i,0])
        PayOff_True.append(y_test.iloc[i]== 33391 )



In [224]:

    
P = np.linspace(0,1,10)
PayOff = pd.DataFrame()
PayOff_True = np.array(PayOff_True)
PayOff_Prob = np.array(PayOff_Prob)

C =[]
for p in P:
    C.append(sum(PayOff_True[(PayOff_Prob>=p) &(PayOff_Prob<=0.1+p)])*1.0
             /(len(PayOff_True[(PayOff_Prob>=p) &(PayOff_Prob<=0.1+p)])+1))



In [228]:

    
Baseline_P = sum(PayOff_True)*1.0/len(PayOff_True)
# Let us build a bench mark
Bench = 
for i in PayOff_Prob:









    Out[228]:





82.401412595644501



In [227]:









    Out[227]:





[<matplotlib.lines.Line2D at 0x12ee601d0>]



In [175]:

    
# # Generate a random sample.
n = 100
a,b=np.histogram(PayOff_Prob,bins = n)

plt.bar(np.arange(len(b)-1)*1.0/n, a*1.0/a.sum() * 100, width=0.99/n)
plt.xlim(0, 1)
plt.ylabel("Distribution (%)")
plt.xlabel("Probabilty")
plt.title('Pay Off Probability')

plt.show()



In [124]:

    
b









    Out[124]:





array([ 0.  ,  0.01,  0.02,  0.03,  0.04,  0.05,  0.06,  0.07,  0.08,
        0.09,  0.1 ,  0.11,  0.12,  0.13,  0.14,  0.15,  0.16,  0.17,
        0.18,  0.19,  0.2 ,  0.21,  0.22,  0.23,  0.24,  0.25,  0.26,
        0.27,  0.28,  0.29,  0.3 ,  0.31,  0.32,  0.33,  0.34,  0.35,
        0.36,  0.37,  0.38,  0.39,  0.4 ,  0.41,  0.42,  0.43,  0.44,
        0.45,  0.46,  0.47,  0.48,  0.49,  0.5 ,  0.51,  0.52,  0.53,
        0.54,  0.55,  0.56,  0.57,  0.58,  0.59,  0.6 ,  0.61,  0.62,
        0.63,  0.64,  0.65,  0.66,  0.67,  0.68,  0.69,  0.7 ,  0.71,
        0.72,  0.73,  0.74,  0.75,  0.76,  0.77,  0.78,  0.79,  0.8 ,
        0.81,  0.82,  0.83,  0.84,  0.85,  0.86,  0.87,  0.88,  0.89,
        0.9 ,  0.91,  0.92,  0.93,  0.94,  0.95,  0.96,  0.97,  0.98,
        0.99,  1.  ])



In [110]:

    
# np.unique(PayOff_Prob)
import matplotlib.pyplot as plt
# plt.plot((a*1.0/a.sum()))
plt.hist(PayOff_Prob,bins=1000)
plt.show()



In [70]:

    
print X.iloc[i]
print y.iloc[i]









    



annual_inc                   49200
dti                             20
emp_length               10+ years
home_ownership                RENT
loan_amnt                    10000
open_acc                        10
policy_code                      1
pub_rec                          0
sub_grade                       C1
term                     36 months
total_acc                       37
last_fico_range_high           579
last_fico_range_low            575
Name: 3, dtype: object
33391.0



In [ ]:



In [ ]:



In [ ]:



In [44]:

    
# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction import DictVectorizer

rfpipeline = Pipeline([('List2DictT',List2Dict_Transformer()), 
                       ('DictVectorT',DictVectorizer()),
                       ('RFfit',RandomForestClassifier())])
rfpipeline.set_params(List2DictT__str_keys=['home_ownership'],
                      DictVectorT__sparse=False,
                      RFfit__n_estimators=300,
                      RFfit__oob_score=True,
#                       RFfit__sample_weight=y_train,
                     )

# rfpipeline.fit(X_train,y_train)









    Out[44]:





Pipeline(steps=[('List2DictT', List2Dict_Transformer(str_keys=['home_ownership', 'sub_grade'])), ('DictVectorT', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True,
        sparse=False)), ('RFfit', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max...imators=300, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False))])



In [14]:

    
y_proba_test = rfpipeline.predict_proba(X_test)
y_proba_train = rfpipeline.predict_proba(X_train)

# y_pred_test = rfpipeline.predict(X_test)
# y_pred_train = rfpipeline.predict(X_train)



In [22]:

    
y_pred_test = rfpipeline.predict(X_test)
y_pred_train = rfpipeline.predict(X_train)



In [23]:

    
# Let us do a ROC:
def Proba2Pred(y_proba,cutoff=0.5):
    y_pred = np.ones(len(y_proba))
    y_pred[y_proba[:,0] > cutoff] = -1
    return y_pred

def Compute_TPFN(y_pred,y_true):
    df = pd.DataFrame(columns=('y_pred','y_true','count'))
    df['y_pred'] = pd.Series(y_pred)
    df['y_true'] = pd.Series(np.array(y_true))
    df['count'] = 1
    return df.groupby(['y_pred','y_true']).count()

print Compute_TPFN(y_pred_train,y_train)
print Compute_TPFN(y_pred_test,y_test)









    



                count
y_pred y_true        
-1     -1       22345
 1      1      102902
               count
y_pred y_true       
-1     -1        323
        1        331
 1     -1      10723
        1      50313



In [18]:









    Out[18]:





dict_values([DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True,
        sparse=False), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False), List2Dict_Transformer(str_keys=['home_ownership', 'sub_grade'])])



In [ ]:



In [ ]:

    
importance=[(tot_feat_vec.columns[i],clf2.feature_importances_[i]) for i in xrange(len(clf2.feature_importances_))]

importance = sorted(importance,key = lambda tup:tup[1],reverse=True)
print importance



In [158]:

    
# print y_pred_test[0],y_proba_test[0],len(y_pred_test)









    



1.0 [ 0.03  0.97] 61690



In [ ]:

    
# WHAT THE FUCK IS GOING ON. NO PREDICTIVE POWER NOW? MUST BE SOMETHING WRONG WITH THE PIPELINE
# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction import DictVectorizer

# rfpipeline = Pipeline([('List2DictT',List2Dict_Transformer()), 
#                        ('DictVectorT',DictVectorizer()),
#                        ('RFfit',RandomForestClassifier())])
# rfpipeline.set_params(List2DictT__str_keys=['home_ownership','sub_grade'],
#                       DictVectorT__sparse=False,
#                       RFfit__n_estimators=1000,
#                       RFfit__oob_score=True)
# rfpipeline.fit(X_train,y_train)



In [234]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)



In [235]:

    
List2Dict_T =List2Dict_Transformer(str_keys=['home_ownership','sub_grade' ])
List2Dict_T.fit(df[relevant])    
X_dict= List2Dict_T.transform(df[relevant])

from sklearn.feature_extraction import DictVectorizer
DVect = DictVectorizer()
DVect.fit(X_dict)
X_vect = DVect.transform(X_dict)



In [212]:

    
DVect.feature_names_









    Out[212]:





['annual_inc',
 'dti',
 'emp_length',
 'fico_range_high',
 'fico_range_low',
 'home_ownership_ANY',
 'home_ownership_MORTGAGE',
 'home_ownership_NONE',
 'home_ownership_OTHER',
 'home_ownership_OWN',
 'home_ownership_RENT',
 'loan_amnt',
 'open_acc',
 'policy_code',
 'pub_rec',
 'sub_grade_A1',
 'sub_grade_A2',
 'sub_grade_A3',
 'sub_grade_A4',
 'sub_grade_A5',
 'sub_grade_B1',
 'sub_grade_B2',
 'sub_grade_B3',
 'sub_grade_B4',
 'sub_grade_B5',
 'sub_grade_C1',
 'sub_grade_C2',
 'sub_grade_C3',
 'sub_grade_C4',
 'sub_grade_C5',
 'sub_grade_D1',
 'sub_grade_D2',
 'sub_grade_D3',
 'sub_grade_D4',
 'sub_grade_D5',
 'sub_grade_E1',
 'sub_grade_E2',
 'sub_grade_E3',
 'sub_grade_E4',
 'sub_grade_E5',
 'sub_grade_F1',
 'sub_grade_F2',
 'sub_grade_F3',
 'sub_grade_F4',
 'sub_grade_F5',
 'sub_grade_G1',
 'sub_grade_G2',
 'sub_grade_G3',
 'sub_grade_G4',
 'sub_grade_G5',
 'term',
 'total_acc']



In [220]:

    
X_train, X_test, y_train, y_test = train_test_split(X_vect.toarray(), y, test_size=0.33)



In [229]:

    
len(X)









    Out[229]:





186937



In [221]:

    
#%%
# Random forest regressor
clf2 = RandomForestClassifier(n_estimators=100,                     # number of trees in the forest
                        criterion='gini', max_depth=None, 
                        min_samples_split=2, min_samples_leaf=1,
                        min_weight_fraction_leaf=0.0, max_features='auto',
                        max_leaf_nodes=None, bootstrap=True, oob_score=True, 
                        n_jobs=1, random_state=None, verbose=0, 
                        warm_start=False, class_weight=None)
# use the model to predict the k th data point
clf2.fit(X_train, y_train)


# clf2.fit(X_train, y_train)
# clf2.fit(X_train, y_train==1)
# import dill
# with open('./rf_lendingclub_p.pkl', 'wb') as out_strm:
#     dill.dump(clf2,out_strm)









    Out[221]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)



In [222]:

    
y_proba_test = clf2.predict_proba(X_test)
y_proba_train = clf2.predict_proba(X_train)

y_pred_test = clf2.predict(X_test)
y_pred_train = clf2.predict(X_train)



In [ ]:

    
y_pred_test= clf2.predict(X_test,y_test)



In [ ]:



In [123]:

    
# y_prob=(rfpipeline.predict_proba(X_test))



# List2Dict_T =List2Dict_Transformer(str_keys=['home_ownership','sub_grade' ])

# List2Dict_T.fit(df[relevant])    
# X_dict= List2Dict_T.transform(df[relevant])

# from sklearn.feature_extraction import DictVectorizer
# DVect = DictVectorizer()
# DVect.fit(X_dict)
# X_vect = DVect.transform(X_dict)



In [148]:

    
# single recode example
# try whether pandas like dict or not
RequiredDict = {'sub_grade':'Grade assigned by Lending Club',
                'term':'Term',
                'fico_range_low': 'Fico Score (Low)',
                'fico_range_high':'Fico Score (High)',
                'loan_amnt':'Total Loan Amount',
                'annual_inc': 'Annual Income',
                'dti':'Debt to Income ratio',
                'emp_length':'Employment length',
                'open_acc':'Number of credit lines currently open',
                'total_acc':'Total credit lines',
                'pub_rec':'Public record',
                'policy_code':'Polic code',
                'home_ownership':'House Ownership'}
record = pd.DataFrame(RequiredDict,index=[0])









    Out[148]:






  
    
      
      annual_inc
      dti
      emp_length
      fico_range_high
      fico_range_low
      home_ownership
      loan_amnt
      open_acc
      policy_code
      pub_rec
      sub_grade
      term
      total_acc
    
  
  
    
      0
      Annual Income
      Debt to Income ratio
      Employment length
      Fico Score (High)
      Fico Score (Low)
      House Ownership
      Total Loan Amount
      Number of credit lines currently open
      Polic code
      Public record
      Grade assigned by Lending Club
      Term
      Total credit lines



In [13]:

    
pd.unique(tot_feat.emp_length)
tot_feat_vec = pd.get_dummies(tot_feat)
print tot_feat_vec.columns









    



Index([u'annual_inc', u'dti', u'emp_length', u'fico_range_high',
       u'fico_range_low', u'loan_amnt', u'open_acc', u'policy_code',
       u'pub_rec', u'term', u'total_acc', u'home_ownership_ANY',
       u'home_ownership_MORTGAGE', u'home_ownership_NONE',
       u'home_ownership_OTHER', u'home_ownership_OWN', u'home_ownership_RENT',
       u'sub_grade_A1', u'sub_grade_A2', u'sub_grade_A3', u'sub_grade_A4',
       u'sub_grade_A5', u'sub_grade_B1', u'sub_grade_B2', u'sub_grade_B3',
       u'sub_grade_B4', u'sub_grade_B5', u'sub_grade_C1', u'sub_grade_C2',
       u'sub_grade_C3', u'sub_grade_C4', u'sub_grade_C5', u'sub_grade_D1',
       u'sub_grade_D2', u'sub_grade_D3', u'sub_grade_D4', u'sub_grade_D5',
       u'sub_grade_E1', u'sub_grade_E2', u'sub_grade_E3', u'sub_grade_E4',
       u'sub_grade_E5', u'sub_grade_F1', u'sub_grade_F2', u'sub_grade_F3',
       u'sub_grade_F4', u'sub_grade_F5', u'sub_grade_G1', u'sub_grade_G2',
       u'sub_grade_G3', u'sub_grade_G4', u'sub_grade_G5'],
      dtype='object')



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [14]:

    
# Expore what is missing. Most of them is emp_info, 
# which is likely corresponding to unemployment rate
# So let us supply it with -1.


#######################
# keys =df.columns
# miss = []
# for key in keys:
#     if 'n/a' in pd.unique(df[key]):
#         miss.append(key)
# print miss        
# # ['emp_title', 'emp_length', 'desc', 'title']



In [15]:

    
# tot_feat.emp_length[np.isnan(tot_feat.emp_length.astype('float'))] = (tot_feat.emp_length[~np.isnan(tot_feat.emp_length.astype('float'))]).median()

# tot_feat.emp_length = pd.Series(tot_feat.emp_length.astype('int'))



In [16]:

    
# So far the filling nans is using global median, which is can be improved 
# use the relevant median instead. Removing the irrelavant information.
# To do so, let me explore what has nan. 
## Explore, which field has nan.
# keys =tot_feat.columns
# miss = []
# for key in keys:
#     print (pd.unique(tot_feat[key]))

# ONLY EMPLOYMENT AND DESC have NAN



In [ ]:



In [17]:

    
tot_feat_vec.columns









    Out[17]:





Index([u'annual_inc', u'dti', u'emp_length', u'fico_range_high',
       u'fico_range_low', u'loan_amnt', u'open_acc', u'policy_code',
       u'pub_rec', u'term', u'total_acc', u'home_ownership_ANY',
       u'home_ownership_MORTGAGE', u'home_ownership_NONE',
       u'home_ownership_OTHER', u'home_ownership_OWN', u'home_ownership_RENT',
       u'sub_grade_A1', u'sub_grade_A2', u'sub_grade_A3', u'sub_grade_A4',
       u'sub_grade_A5', u'sub_grade_B1', u'sub_grade_B2', u'sub_grade_B3',
       u'sub_grade_B4', u'sub_grade_B5', u'sub_grade_C1', u'sub_grade_C2',
       u'sub_grade_C3', u'sub_grade_C4', u'sub_grade_C5', u'sub_grade_D1',
       u'sub_grade_D2', u'sub_grade_D3', u'sub_grade_D4', u'sub_grade_D5',
       u'sub_grade_E1', u'sub_grade_E2', u'sub_grade_E3', u'sub_grade_E4',
       u'sub_grade_E5', u'sub_grade_F1', u'sub_grade_F2', u'sub_grade_F3',
       u'sub_grade_F4', u'sub_grade_F5', u'sub_grade_G1', u'sub_grade_G2',
       u'sub_grade_G3', u'sub_grade_G4', u'sub_grade_G5'],
      dtype='object')



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [18]:

    
X = tot_feat_vec
y = tot_labl
y[y == -1] =0.0



In [19]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)



In [ ]:



In [ ]:



In [21]:

    
importance=[(tot_feat_vec.columns[i],clf2.feature_importances_[i]) for i in xrange(len(clf2.feature_importances_))]

importance = sorted(importance,key = lambda tup:tup[1],reverse=True)
print importance









    



[('dti', 0.14921630604716682), ('annual_inc', 0.13258534033034525), ('loan_amnt', 0.12275184415376111), ('total_acc', 0.11382579484136307), ('open_acc', 0.096341302023195824), ('emp_length', 0.077147956277889146), ('fico_range_high', 0.054979184170717704), ('fico_range_low', 0.054523177401007601), ('term', 0.018518013685362329), ('pub_rec', 0.013381435856131692), ('home_ownership_RENT', 0.010406800454035321), ('home_ownership_MORTGAGE', 0.0099205152708521944), ('sub_grade_C4', 0.0073455739945508317), ('sub_grade_C3', 0.0073357408640482499), ('home_ownership_OWN', 0.0073034609276571173), ('sub_grade_C5', 0.0072360374067525646), ('sub_grade_C2', 0.0070448024670123456), ('sub_grade_C1', 0.006927868671440093), ('sub_grade_D1', 0.0065411206857874261), ('sub_grade_D2', 0.0063408303368668517), ('sub_grade_B4', 0.006208506207513299), ('sub_grade_D3', 0.0060826141391949994), ('sub_grade_B5', 0.005979251384428493), ('sub_grade_D4', 0.0059135084532431484), ('sub_grade_D5', 0.0056830201211188758), ('sub_grade_B3', 0.0054177456825150183), ('sub_grade_E2', 0.0046430581686918307), ('sub_grade_E1', 0.0045794502503105688), ('sub_grade_B2', 0.0043955982677777511), ('sub_grade_E3', 0.0040077432000726601), ('sub_grade_B1', 0.003762320169615428), ('sub_grade_E4', 0.0036471856891856086), ('sub_grade_E5', 0.0036116273100459677), ('sub_grade_F1', 0.0029879152006349672), ('sub_grade_A5', 0.0029445813442733463), ('sub_grade_F3', 0.0026593578673591383), ('sub_grade_F2', 0.0025948251440647686), ('sub_grade_A4', 0.0024384195615901869), ('sub_grade_F4', 0.0022090581923741902), ('sub_grade_F5', 0.0019723659399025172), ('sub_grade_A3', 0.0015914156742543116), ('sub_grade_G1', 0.001325680111641982), ('sub_grade_A2', 0.0012722069297096171), ('sub_grade_A1', 0.0012386890546524406), ('sub_grade_G2', 0.0010100040764477575), ('sub_grade_G3', 0.00069210656991386703), ('sub_grade_G5', 0.00054722492218517319), ('sub_grade_G4', 0.00049823366252502329), ('home_ownership_OTHER', 0.00031166013652643437), ('home_ownership_NONE', 0.00010101741035176381), ('home_ownership_ANY', 5.0329193543388276e-07), ('policy_code', 0.0)]



In [35]:

    
tot_feat_vec.shape









    Out[35]:





(186937, 27)



In [158]:

    
y_pred = clf2.predict(tot_feat_vec)



In [165]:

    
tot_labl[tot_labl == -1] =0
Hits = sum(tot_labl[y_pred==1])
FalseAlarm = len(tot_labl[y_pred==1]) - Hits
Miss = sum(tot_labl[y_pred == 0])
CorrectRejection = len(tot_labl[y_pred == 0]) - Miss



In [13]:

    
# get the probability from the random forest
import dill
with open('./rf_lendingclub_p.pkl', 'rb') as in_strm:
    clf2 = dill.load(in_strm)



In [42]:

    
ap=clf2.predict_proba(tot_feat_vec)



In [44]:

    
import matplotlib.pyplot as plt
plt.hist(ap[:,1])
plt.show()



In [ ]:



In [205]:

    
LabelB = full.stat[full.grade=='B']
y_pred = np.array(y_pred)

len(y_predB),len(y_pred)









    Out[205]:





(56943, 186937)



In [239]:

    
# tot_labl[tot_labl == -1] =0
Grader = 'A'
LabelB = full.stat[full['grade']==Grader]
y_predB = y_pred[np.array(full.grade==Grader)]

money = full.loan_amnt[full.grade==Grader]

#
LabelB[LabelB==-1] = 0


Hits = sum(LabelB[y_predB==1])
FalseAlarm = len(LabelB[y_predB==1]) - Hits
Miss = sum(LabelB[y_predB == 0])
CorrectRejection = len(LabelB[y_predB == 0]) - Miss

print 'False alarm, should NOT have FUNDED -------',FalseAlarm*100.0/(Hits + FalseAlarm)
print 'Missing, should have FUNDED ---------------',Miss*100.0/(CorrectRejection + Miss)
print 'Default rate ------------------------------',(FalseAlarm + CorrectRejection)*100 / len(LabelB)

HitsMoney = sum(LabelB[y_predB==1]*money[y_predB==1])
FalseAlarmMoney = sum(money[y_predB==1]) - HitsMoney
MissMoney = sum(LabelB[y_predB == 0]*money[y_predB==0])
CorrectRejectionMoney = sum(money[y_predB == 0]) - MissMoney

print 'False alarm, should NOT have FUNDED -------',FalseAlarmMoney*100.0/(HitsMoney + FalseAlarmMoney)
print 'Missing, should have FUNDED ---------------',MissMoney*100.0/(CorrectRejectionMoney + MissMoney)
print 'Default rate ------------------------------',(FalseAlarmMoney + CorrectRejectionMoney)*100 / sum(money)









    



False alarm, should NOT have FUNDED ------- 0.529396290938
Missing, should have FUNDED --------------- 6.7081935793
Default rate ------------------------------ 6.48635342626
False alarm, should NOT have FUNDED ------- 0.454064324506
Missing, should have FUNDED --------------- 5.39836689138
Default rate ------------------------------ 6.24525867511



In [229]:

    
HitsMoney = sum(LabelB[y_predB==1]*money[y_predB==1])
FalseAlarmMoney = sum(money[y_predB==1]) - HitsMoney
MissMoney = sum(LabelB[y_predB == 0]*money[y_predB==0])
CorrectRejectionMoney = sum(money[y_predB == 0]) - MissMoney
print 'False alarm, should NOT have FUNDED -------',FalseAlarmMoney*100.0/(HitsMoney + FalseAlarmMoney)
print 'Missing, should have FUNDED ---------------',MissMoney*100.0/(CorrectRejectionMoney + MissMoney)
print 'Default rate ------------------------------',(FalseAlarmMoney + CorrectRejectionMoney)*100 / len(LabelBMoney)









    



 False alarm, should NOT have FUNDED ------- 0.788355076854
Missing, should have FUNDED --------------- 5.02276875905
Default rate ------------------------------





    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-229-d3eb311ccfbf> in <module>()
      5 print 'False alarm, should NOT have FUNDED -------',FalseAlarmMoney*100.0/(HitsMoney + FalseAlarmMoney)
      6 print 'Missing, should have FUNDED ---------------',MissMoney*100.0/(CorrectRejectionMoney + MissMoney)
----> 7 print 'Default rate ------------------------------',(FalseAlarmMoney + CorrectRejectionMoney)*100 / len(LabelBMoney)

NameError: name 'LabelBMoney' is not defined



In [252]:

    
import matplotlib.pyplot as plt 
plt.figure
# plt.subplot(221)
plt.plot(clf2.feature_importances_)
# # plt.show()
# plt.subplot(222)
# plt.plot(clf2.feature_importances_)
plt.show()



In [131]:

    
y_pred=clf2.predict(X_test)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [89]:

    
cnt = 0
baseline = 0
for i in xrange(len(y_test)):
    if y_test.iloc[i] == 1:
        baseline += 1
        if y_pred[i]:
            cnt += 1
    elif y_test.iloc[i] == -1 and not y_pred[i]:
        cnt += 1
        
# test_pred_err = clf2.oob_decision_function_
# test_pred = np.floor(test_pred_err[:,1] *2)
# test_pred[test_pred>=2] = 1
# print sum(abs(test_pred-train_labl))*100.0/len(train_labl)



In [91]:









    Out[91]:





0.27026752328324516



In [ ]:



In [88]:

    
cnt*1.0 / len(y_test)*100









    Out[88]:





95.07051385962069



In [242]:

    
#%%
tot = train_feat
tot['pred'] = test_pred
tot['actu'] = labl

#%%
bin_edges[-1] += 1
rate = np.zeros(hist.shape)
payrate = np.zeros(hist.shape)
prate = np.zeros(hist.shape)
for i in range(len(bin_edges)-1):
    left_edge = bin_edges[i]
    right_edge = bin_edges[i+1]
    incid = df.stat[(df['loan_amnt'] >= left_edge) & (df['loan_amnt'] < right_edge)]
    amnt = df.loan_amnt[(df['loan_amnt'] >= left_edge) & (df['loan_amnt'] < right_edge)]
    rate[i] = len(incid[incid==1]) *1.0 / (len(incid[incid==1]) + len(incid[incid==-1]) )
    payrate[i] = (sum(amnt[incid==1]) + sum(amnt[incid==-1]) )









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-242-a388a00368aa> in <module>()
      1 #%%
      2 tot = train_feat
----> 3 tot['pred'] = test_pred
      4 tot['actu'] = labl
      5 

NameError: name 'test_pred' is not defined



In [241]:

    
#%%
for i in range(len(bin_edges)-1):
    left_edge = bin_edges[i]
    right_edge = bin_edges[i+1]
    incid = tot[(tot['loan_amnt'] >= left_edge) & (df['loan_amnt'] < right_edge)]
    rate[i] = len(incid[incid.actu == 1])*100.0/len(incid)
    prate[i] = len(incid[(incid.actu- incid.pred)==0])*100.0/len(incid)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-241-33de777c1390> in <module>()
      1 #%%
----> 2 for i in range(len(bin_edges)-1):
      3     left_edge = bin_edges[i]
      4     right_edge = bin_edges[i+1]
      5     incid = tot[(tot['loan_amnt'] >= left_edge) & (df['loan_amnt'] < right_edge)]

NameError: name 'bin_edges' is not defined

In loans funded in Lending Club, about 15 percent of loans are a failure, and 85 % percent of the loans are successful. For loans of a large amount, the failure rate is particularly high.



In [240]:

    
N = 10
menMeans   = rate
womenMeans = 100 - rate

ind = np.arange(N)    # the x locations for the groups
width = 0.35       # the width of the bars: can also be len(x) sequence
# plt.subplot(211)
plt.figure()
p1 = plt.bar(ind, menMeans,   width, color='r',label='Success loans')
p2 = plt.bar(ind, womenMeans, width, color='y',label='Failure loans (charged off/Default)',
             bottom=menMeans)


plt.ylabel('Percentage')
# plt.title('successful loans versus failed loans')
plt.xticks(np.asarray([0,5,9])+width/2., ('5000','20000','35000') )
#plt.yticks(np.arange(0,81,10))
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
           ncol=2, mode="expand", borderaxespad=0.)
plt.xlabel('Loan Size ($)')
plt.show()









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-240-41889250f7ed> in <module>()
      1 N = 10
----> 2 menMeans   = rate
      3 womenMeans = 100 - rate
      4 
      5 ind = np.arange(N)    # the x locations for the groups

NameError: name 'rate' is not defined

The goal of my project is to predict whether a loan will be successful or not. To do that I downloaded data from lending club, selected several features to build a prediction model, including both numerical features, such as the annual income, the fico credit score, and also categorical features, such as the grade of the loan assigned by lending club.

Using these features, I built a random forest model to predict whether the loan will be successful. The idea is to build a large number of decision trees, each is used to generate a prediction, and the final prediction is the average of these predictions from each tree. The strategy by my random forest model significantly decreases the chance of funding a failure loan, compared to the strategy by lending club, with a improvement of 40 %.



In [221]:

    
# for i in range(len(bin_edges)-1):
#     left_edge = bin_edges[i]
#     right_edge = bin_edges[i+1]
#     incid = tot[(tot['loan_amnt'] >= left_edge) & (df['loan_amnt'] < right_edge)]
#     rate[i] = len(incid[incid.actu == 1])*100.0/len(incid)
#     prate[i] = len(incid[(incid.actu- incid.pred)==0])*100.0/len(incid)

# pold = 100-len(tot[tot.actu==1])*100.0/len(tot)
# pnew = 100-len(tot[(tot.actu-tot.pred)==0])*100.0/len(tot)
# print pold,pnew
# plt.figure()
# #plt.subplot(211)
# p1=plt.plot(bin_edges[:-1],100-rate,'b')
# p2=plt.plot(bin_edges[:-1],100-prate,'r')
# plt.xlabel('Loan Size ($)')
# plt.ylabel('Percent of failured loans')  
# plt.legend((p1[0],p2[0]),('Strategy by LendingClub', 'Strategy by my random forest model'))

# plt.title('Improved by %.2f %%' % ((pold-pnew)*100.0/pold))
# plt.show()

Moving forward, I would like to incorporate more features to further improve the model. One feature I am particular interested in the description of the loan provided by the borrower. It contains lots of details about the borrowers. I would also like to build a user interface to provide recommendation about whether or not one should fund a particular loan.



In [ ]:



In [3]:

    
x = ['a','b','c']
','.join(x)









    Out[3]:





'a,b,c'



In [ ]: