Lending Club



In [9]:

    
# dependencies load
%matplotlib inline
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# set seaborn style
sns.set_style("ticks")

Data Loading



In [10]:

    
# data loading
data_raw = pd.read_csv("loan.csv")

After loading the data we can have s look at its first several rows.



In [ ]:

    
# data exploration
print(data_raw.shape)
data_raw.head(n = 5)









    



(887379, 74)






    Out[ ]:






  
    
      
      id
      member_id
      loan_amnt
      funded_amnt
      funded_amnt_inv
      term
      int_rate
      installment
      grade
      sub_grade
      ...
      total_bal_il
      il_util
      open_rv_12m
      open_rv_24m
      max_bal_bc
      all_util
      total_rev_hi_lim
      inq_fi
      total_cu_tl
      inq_last_12m
    
  
  
    
      0
      1077501
      1296599
      5000.0
      5000.0
      4975.0
      36 months
      10.65
      162.87
      B
      B2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1077430
      1314167
      2500.0
      2500.0
      2500.0
      60 months
      15.27
      59.83
      C
      C4
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1077175
      1313524
      2400.0
      2400.0
      2400.0
      36 months
      15.96
      84.33
      C
      C5
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1076863
      1277178
      10000.0
      10000.0
      10000.0
      36 months
      13.49
      339.31
      C
      C1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      1075358
      1311748
      3000.0
      3000.0
      3000.0
      60 months
      12.69
      67.79
      B
      B5
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 74 columns

We can already see that some of the data is missing, and for this we can use the handy missingno package.



In [ ]:

    
# missing data visualisation
msno.matrix(data_raw)

It appears that most of the numeric data is available, while for the categorical ones we have quite a few missing values. So for the purposes of this analysis we will be focusing on analysing the numercal values mostly. We are ready to proceed to the next step.

Exploratory Data Analysis

One initial exploration would be to see if there is a difference in interest rates between the two terms, 36 and 60 months. For this we can create a box plot.



In [ ]:

    
# boxplot of int_rate
sns.boxplot(x = data_raw.term, y = data_raw.int_rate)

Optically it does seem as if there is a difference, but to make sure we should run a statistical test. In this case a t-test would be most appropriate. First we should do some data wrangling to get our data in order. To speed up the analysis we will have a look at a 5% sample from the data.



In [ ]:

    
# small t-test
# subset some rows
data_sample = data_raw[['term','int_rate']].sample(frac = 0.05)

print(data_sample.shape)
data_sample.head()

And now we can run the test.



In [ ]:

    
term36 = data_sample['int_rate'][data_sample['term'].str.contains("36 months")]
term60 = data_sample['int_rate'][data_sample['term'].str.contains("60 months")]

stats.ttest_ind(term36, term60)

And this result confirms our hypothesis that there is indeed a difference between the two terms. Next we can see if there is a difference in interest rate int_rate among different grade groups grade. Again we can use boxplots to have a look.



In [ ]:

    
sns.boxplot(x = data_raw.grade, y = data_raw.int_rate)

It looks like loans with grade A have lowest interest rates, while G grades have the highest ones (there seem to be quite a few outliers, so we have to be careful when drawing conclusions from these data alone).

Now lets' look at a histogram of loan amounts, and the amounts that were actually funded.



In [ ]:

    
sns.distplot(data_raw['loan_amnt']);



In [ ]:

    
sns.distplot(data_raw['funded_amnt'])

And finally we can examine the relationship between the two (we expect a linear one).



In [ ]:

    
sns.jointplot(x = 'loan_amnt', y = 'funded_amnt', data = data_raw)

And indeed the relationship is linear. We could use this plot to look at outliers, where much less amount was funded than requested.

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	1077501	1296599	5000.0	5000.0	4975.0	36 months	10.65	162.87	B	B2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1077430	1314167	2500.0	2500.0	2500.0	60 months	15.27	59.83	C	C4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1077175	1313524	2400.0	2400.0	2400.0	36 months	15.96	84.33	C	C5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1076863	1277178	10000.0	10000.0	10000.0	36 months	13.49	339.31	C	C1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1075358	1311748	3000.0	3000.0	3000.0	60 months	12.69	67.79	B	B5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN