Campaign for selling personal loans

This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

In this note book, we will build a model that will help the department to identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.


In [39]:
# This is used to plot inline
%matplotlib inline

# Importing all the required modules
import pandas as pd
import numpy as np

# seaborn is a plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# scikit learn for ml algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# To find the cross validation score
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# Using grid search and confusion matrix
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix

In [36]:
import warnings  # Lets ignore the module warnings
warnings.filterwarnings("ignore")

In [2]:
# Read the input file from the hard disk
# It is read as pandas data frame as stored in df variable
df = pd.read_excel("Bank_Personal_Loan_Modelling.xlsx", sheetname="Data")

In [3]:
# Find the shape of input
print("Number of customers data available: {}".format(df.shape[0]))
print("Number of features in the data: {}".format(df.shape[1]))


Number of customers data available: 5000
Number of features in the data: 14

In [4]:
# Check for null values in the data
if not df.isnull().values.any():
    print("No null values in this data")


No null values in this data

Insights


We have data on 5000 customers and each have 14 features each. Fortunately there are no null values in the data.

Description of the data set:

  • ID: Customer ID
  • Age: Customer's age in completed years
  • Experience: Number of years of professional experience
  • Income: Annual income of the customer (\$000)
  • ZIPCode: Home Address ZIP code.
  • Family: Family size of the customer
  • CCAvg: Average spending on credit cards per month (\$000)
  • Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. ($000)
  • Personal Loan: Did this customer accept the personal loan offered in the last campaign?
  • Securities Account: Does the customer have a securities account with the bank?
  • CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
  • Online: Does the customer use internet banking facilities?
  • CreditCard: Does the customer use a credit card issued by UniversalBank?

We are interested in the Personal Loan feature of this dataset. Lets investigate it further.


In [5]:
# Finding the number of customers who accepted the personal loan that was offered to them in the campaign.
len(df[df["Personal Loan"] == 1])


Out[5]:
480

In [6]:
# Current success ratio in percentage
print(str(len(df[df["Personal Loan"] == 1]) / df.shape[0] * 100) + "%")


9.6%

Insights


The number of customers who accepted the personal loan that was offered to them in the campaign is merely 480 customers. That implies the success ratio is 9.6%. Our target will be to increase this success ratio.

Univariate Analysis


In [7]:
# Testing the spread of data
df.describe().transpose()


Out[7]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937913 1.747666 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Insights


  • Age feature is normally distributed with mean = median
  • Experience feature has some missing values, because experience cannot be negative
  • Income is right skewed
  • Average Family size of the customer is around 2 people
  • Average credit card spending per month is slightly right skewed
  • Morgage distribution seams to have an outlier

In [8]:
# Find number of entries with negative experience
len(df[df.Experience < 0])


Out[8]:
52

In [37]:
# Fill the negative experience with mean
df.Experience[df.Experience < 0] = df.Experience.mean()

In [10]:
# Verily that negative values are removed
len(df[df.Experience < 0])


Out[10]:
0

Bivariate Analysis

Constructing Pearsons correlation heatmap using seaborn module.


In [11]:
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Pearson Correlation for features set', y=1.05, size=19)
sns.heatmap(df.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5fd99c2eb8>

Insights


  • There is a very strong positive correlation between Age and Experience
  • And a weak positive correlation between CCAverage and Income

All these analysis tells us that we could potentially keep Age or Experience.


In [12]:
# Draw the pair plot
pairPlot = sns.pairplot(df, hue="Personal Loan")


Univariate analysis on Personal Loan


In [13]:
loanNotTaken = len(df[df["Personal Loan"] == 0])
loanTaken = len(df[df["Personal Loan"] == 1])

sns.barplot(x=[0,1], y=[loanNotTaken, loanTaken])
plt.title("Personal Loan Distribution")
plt.xlabel("Loan Distribution")
plt.ylabel("Number of customers")
plt.xticks([0,1], ["Loan Not Taken", "Loan Taken"])


Out[13]:
([<matplotlib.axis.XTick at 0x7f5fc1f03eb8>,
  <matplotlib.axis.XTick at 0x7f5fc49c0438>],
 <a list of 2 Text xticklabel objects>)

Insights


  • The data is strongly biased towards customer that have not take the loan
  • Only 9.6% of the customers have taken the loan

Split dataset

Lets split the data into training and test set using train_test_split function from sklearn. The ratio of the split is 70:30


In [38]:
# Target feature
Y = df[["Personal Loan"]]
le = preprocessing.LabelEncoder()
# Encode the lables for classifier
Y = le.fit_transform(Y)

In [15]:
# Dependent variables
X = df.drop("Personal Loan", axis=1)

In [16]:
test_size = 0.30 # taking 70:30 training and test set
seed = 2  # Random numbmer seeding for reapeatability of the code

xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=test_size, random_state=seed)

Decision Tree Classifier


In [17]:
seed = 7  # Set the seed for repeatabiliy
# Creating a entropy based decision tree classifier
dtModel = DecisionTreeClassifier(criterion = 'entropy', random_state=seed)

In [18]:
# Fit the model
dtModel.fit(xTrain, yTrain)


Out[18]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=7, splitter='best')

In [19]:
dtModel.score(xTest, yTest)


Out[19]:
0.98466666666666669

K Neighbors Classifier(KNN)


In [20]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(xTrain, yTrain)


Out[20]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [21]:
knn.score(xTest, yTest)


Out[21]:
0.89600000000000002

Using Gridsearch cross validation to find the best regularization parameters


In [22]:
knn = KNeighborsClassifier()

param_grid = { 
    'n_neighbors': list(range(3, 50, 2)),
    'leaf_size': list(range(3, 50, 2))
}

CV_rfc = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, verbose=1)

# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)


Fitting 5 folds for each of 576 candidates, totalling 2880 fits
{'leaf_size': 3, 'n_neighbors': 21}
KNeighborsClassifier(algorithm='auto', leaf_size=3, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=21, p=2,
           weights='uniform')
0.9042
[Parallel(n_jobs=1)]: Done 2880 out of 2880 | elapsed:   54.4s finished

In [23]:
dtModel = DecisionTreeClassifier(criterion = 'entropy', random_state=seed)

param_grid = { 
    'criterion': ["gini", "entropy"]
}

CV_rfc = GridSearchCV(estimator=dtModel, param_grid=param_grid, cv=5, verbose=1)

# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)


Fitting 5 folds for each of 2 candidates, totalling 10 fits
{'criterion': 'gini'}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=7, splitter='best')
0.9754
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished

Why we chose DT inplace of KNN?


  • We will use Decision Tree classifier with entropy criterion because is gives much better accuracy over KNN.
  • Decision Trees are also very flexible, easy to understand, and easy to debug.
  • KNN doesn't know which attributes are more important i.e. when computing distance between data points (usually Euclidean distance or other generalisations of it), each attribute normally weighs the same to the total distance. This means that attributes which are not so important will have the same influence on the distance compared to more important attributes.

Using DT, we got a accuracy score of 0.9754

Ensemble techniques to improve the performance

We have a good accuracy score using DT, but we also have to keep in mind the fact that the data we have is highly biased. We will try to get better accuracy using some of ensemble techniques.


In [24]:
rf = RandomForestClassifier(random_state=seed)
rf.fit(xTrain, yTrain)


Out[24]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=7,
            verbose=0, warm_start=False)

In [25]:
rf.score(xTest, yTest)


Out[25]:
0.97599999999999998

Insights


There is slight improvement in accuracy score, but can further improve it by finding the best regularization techniques.


In [26]:
rf = RandomForestClassifier(random_state=seed)

param_grid = { 
    'n_estimators': list(range(10, 30)),
    'max_depth': list(range(3, 30, 2))
}

CV_rfc = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, verbose=1)

# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)


Fitting 5 folds for each of 280 candidates, totalling 1400 fits
{'max_depth': 17, 'n_estimators': 19}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=17, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=19, n_jobs=1, oob_score=False, random_state=7,
            verbose=0, warm_start=False)
0.9846
[Parallel(n_jobs=1)]: Done 1400 out of 1400 | elapsed:  1.8min finished

In [27]:
# Using best hyper parameters
rf = RandomForestClassifier(max_depth=17, n_estimators=19, random_state=seed)

yPredicted = rf.fit(xTrain, yTrain).predict(xTest)

In [28]:
# Compute confusion matrix
cnfMatrix = confusion_matrix(yTest, yPredicted)

In [29]:
cnfMatrix


Out[29]:
array([[1359,    3],
       [  28,  110]])

In [30]:
sns.heatmap(cnfMatrix)
plt.title("Actual vs Predicted")
plt.xlabel("Predicted labels")
plt.ylabel("Actual labels")


Out[30]:
<matplotlib.text.Text at 0x7f5fc3c4c978>

Conclusion


We got a very good accuracy using the Random Forest Classifier. The main challenge with this data is that the given data is highly biased toward Loan Not Taken. This can be fixed by collecting more samples.

Future scope

Apply stratified sampling to reduce the bias in the dataset.


In [31]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

for train_index, test_index in sss.split(X, Y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

In [32]:
# Using best hyper parameters
rf = RandomForestClassifier(max_depth=17, n_estimators=19, random_state=seed)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)


Out[32]:
0.98533333333333328

Insights


This is a better approach, because we are using Stratified Shuffle Split from sklearn, which samples by preserving the percentage of samples for each class.