In [9]:
# dependencies load
%matplotlib inline
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# set seaborn style
sns.set_style("ticks")
In [10]:
# data loading
data_raw = pd.read_csv("loan.csv")
After loading the data we can have s look at its first several rows.
In [ ]:
# data exploration
print(data_raw.shape)
data_raw.head(n = 5)
Out[ ]:
We can already see that some of the data is missing, and for this we can use the handy missingno
package.
In [ ]:
# missing data visualisation
msno.matrix(data_raw)
It appears that most of the numeric data is available, while for the categorical ones we have quite a few missing values. So for the purposes of this analysis we will be focusing on analysing the numercal values mostly. We are ready to proceed to the next step.
One initial exploration would be to see if there is a difference in interest rates between the two terms, 36 and 60 months. For this we can create a box plot.
In [ ]:
# boxplot of int_rate
sns.boxplot(x = data_raw.term, y = data_raw.int_rate)
Optically it does seem as if there is a difference, but to make sure we should run a statistical test. In this case a t-test would be most appropriate. First we should do some data wrangling to get our data in order. To speed up the analysis we will have a look at a 5% sample from the data.
In [ ]:
# small t-test
# subset some rows
data_sample = data_raw[['term','int_rate']].sample(frac = 0.05)
print(data_sample.shape)
data_sample.head()
And now we can run the test.
In [ ]:
term36 = data_sample['int_rate'][data_sample['term'].str.contains("36 months")]
term60 = data_sample['int_rate'][data_sample['term'].str.contains("60 months")]
stats.ttest_ind(term36, term60)
And this result confirms our hypothesis that there is indeed a difference between the two terms. Next we can see if there is a difference in interest rate int_rate
among different grade groups grade
. Again we can use boxplots to have a look.
In [ ]:
sns.boxplot(x = data_raw.grade, y = data_raw.int_rate)
It looks like loans with grade A
have lowest interest rates, while G
grades have the highest ones (there seem to be quite a few outliers, so we have to be careful when drawing conclusions from these data alone).
Now lets' look at a histogram of loan amounts, and the amounts that were actually funded.
In [ ]:
sns.distplot(data_raw['loan_amnt']);
In [ ]:
sns.distplot(data_raw['funded_amnt'])
And finally we can examine the relationship between the two (we expect a linear one).
In [ ]:
sns.jointplot(x = 'loan_amnt', y = 'funded_amnt', data = data_raw)
And indeed the relationship is linear. We could use this plot to look at outliers, where much less amount was funded than requested.