Lending Club


In [9]:
# dependencies load
%matplotlib inline
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# set seaborn style
sns.set_style("ticks")

Data Loading


In [10]:
# data loading
data_raw = pd.read_csv("loan.csv")

After loading the data we can have s look at its first several rows.


In [ ]:
# data exploration
print(data_raw.shape)
data_raw.head(n = 5)


(887379, 74)
Out[ ]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 1077501 1296599 5000.0 5000.0 4975.0 36 months 10.65 162.87 B B2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1077430 1314167 2500.0 2500.0 2500.0 60 months 15.27 59.83 C C4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1077175 1313524 2400.0 2400.0 2400.0 36 months 15.96 84.33 C C5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1076863 1277178 10000.0 10000.0 10000.0 36 months 13.49 339.31 C C1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1075358 1311748 3000.0 3000.0 3000.0 60 months 12.69 67.79 B B5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 74 columns

We can already see that some of the data is missing, and for this we can use the handy missingno package.


In [ ]:
# missing data visualisation
msno.matrix(data_raw)

It appears that most of the numeric data is available, while for the categorical ones we have quite a few missing values. So for the purposes of this analysis we will be focusing on analysing the numercal values mostly. We are ready to proceed to the next step.

Exploratory Data Analysis

One initial exploration would be to see if there is a difference in interest rates between the two terms, 36 and 60 months. For this we can create a box plot.


In [ ]:
# boxplot of int_rate
sns.boxplot(x = data_raw.term, y = data_raw.int_rate)

Optically it does seem as if there is a difference, but to make sure we should run a statistical test. In this case a t-test would be most appropriate. First we should do some data wrangling to get our data in order. To speed up the analysis we will have a look at a 5% sample from the data.


In [ ]:
# small t-test
# subset some rows
data_sample = data_raw[['term','int_rate']].sample(frac = 0.05)

print(data_sample.shape)
data_sample.head()

And now we can run the test.


In [ ]:
term36 = data_sample['int_rate'][data_sample['term'].str.contains("36 months")]
term60 = data_sample['int_rate'][data_sample['term'].str.contains("60 months")]

stats.ttest_ind(term36, term60)

And this result confirms our hypothesis that there is indeed a difference between the two terms. Next we can see if there is a difference in interest rate int_rate among different grade groups grade. Again we can use boxplots to have a look.


In [ ]:
sns.boxplot(x = data_raw.grade, y = data_raw.int_rate)

It looks like loans with grade A have lowest interest rates, while G grades have the highest ones (there seem to be quite a few outliers, so we have to be careful when drawing conclusions from these data alone).

Now lets' look at a histogram of loan amounts, and the amounts that were actually funded.


In [ ]:
sns.distplot(data_raw['loan_amnt']);

In [ ]:
sns.distplot(data_raw['funded_amnt'])

And finally we can examine the relationship between the two (we expect a linear one).


In [ ]:
sns.jointplot(x = 'loan_amnt', y = 'funded_amnt', data = data_raw)

And indeed the relationship is linear. We could use this plot to look at outliers, where much less amount was funded than requested.