For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Lending club had a very interesting year in 2016, so let's check out some of their data and keep the context in mind. This data is from before they even went public.
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It's recommended you use the csv provided as it has been cleaned of NA values.
Here are what the columns represent:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
loans = pd.read_csv('loan_data.csv')
Check out the info(), head(), and describe() methods on loans.
In [3]:
loans.info()
In [4]:
loans.describe()
Out[4]:
In [5]:
loans.head()
Out[5]:
Let's do some data visualization! We'll use seaborn and pandas built-in plotting capabilities, but feel free to use whatever library you want. Don't worry about the colors matching, just worry about getting the main idea of the plot.
Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.
Note: This is pretty tricky, feel free to reference the solutions. You'll probably need one line of code for each histogram, I also recommend just using pandas built in .hist()
In [6]:
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')
Out[6]:
Create a similar figure, except this time select by the not.fully.paid column.
In [7]:
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
Out[7]:
Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.
In [8]:
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')
Out[8]:
Let's see the trend between FICO score and interest rate. Recreate the following jointplot.
In [9]:
sns.jointplot(x='fico',y='int.rate',data=loans,color='purple')
Out[9]:
Create the following lmplots to see if the trend differed between not.fully.paid and credit.policy. Check the documentation for lmplot() if you can't figure out how to separate it into columns.
In [10]:
plt.figure(figsize=(11,7))
sns.lmplot(y='int.rate',x='fico',data=loans,hue='credit.policy',
col='not.fully.paid',palette='Set1')
Out[10]:
In [12]:
loans.info()
Notice that the purpose column as categorical
That means we need to transform them using dummy variables so sklearn will be able to understand them. Let's do this in one clean step using pd.get_dummies.
Let's show you a way of dealing with these columns that can be expanded to multiple categorical features if necessary.
Create a list of 1 element containing the string 'purpose'. Call this list cat_feats.
In [36]:
cat_feats = ['purpose']
Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.
In [37]:
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
In [38]:
final_data.info()
In [20]:
from sklearn.model_selection import train_test_split
In [21]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
In [22]:
from sklearn.tree import DecisionTreeClassifier
Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.
In [23]:
dtree = DecisionTreeClassifier()
In [24]:
dtree.fit(X_train,y_train)
Out[24]:
In [25]:
predictions = dtree.predict(X_test)
In [26]:
from sklearn.metrics import classification_report,confusion_matrix
In [27]:
print(classification_report(y_test,predictions))
In [28]:
print(confusion_matrix(y_test,predictions))
In [29]:
from sklearn.ensemble import RandomForestClassifier
In [30]:
rfc = RandomForestClassifier(n_estimators=600)
In [31]:
rfc.fit(X_train,y_train)
Out[31]:
In [32]:
predictions = rfc.predict(X_test)
Now create a classification report from the results. Do you get anything strange or some sort of warning?
In [33]:
from sklearn.metrics import classification_report,confusion_matrix
In [34]:
print(classification_report(y_test,predictions))
Show the Confusion Matrix for the predictions.
In [35]:
print(confusion_matrix(y_test,predictions))
What performed better the random forest or the decision tree?
In [36]:
# Depends what metric you are trying to optimize for.
# Notice the recall for each class for the models.
# Neither did very well, more feature engineering is needed.