This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
In this note book, we will build a model that will help the department to identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.
In [39]:
# This is used to plot inline
%matplotlib inline
# Importing all the required modules
import pandas as pd
import numpy as np
# seaborn is a plotting library
import seaborn as sns
import matplotlib.pyplot as plt
# scikit learn for ml algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# To find the cross validation score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
# Using grid search and confusion matrix
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
In [36]:
import warnings # Lets ignore the module warnings
warnings.filterwarnings("ignore")
In [2]:
# Read the input file from the hard disk
# It is read as pandas data frame as stored in df variable
df = pd.read_excel("Bank_Personal_Loan_Modelling.xlsx", sheetname="Data")
In [3]:
# Find the shape of input
print("Number of customers data available: {}".format(df.shape[0]))
print("Number of features in the data: {}".format(df.shape[1]))
In [4]:
# Check for null values in the data
if not df.isnull().values.any():
print("No null values in this data")
Description of the data set:
We are interested in the Personal Loan feature of this dataset. Lets investigate it further.
In [5]:
# Finding the number of customers who accepted the personal loan that was offered to them in the campaign.
len(df[df["Personal Loan"] == 1])
Out[5]:
In [6]:
# Current success ratio in percentage
print(str(len(df[df["Personal Loan"] == 1]) / df.shape[0] * 100) + "%")
In [7]:
# Testing the spread of data
df.describe().transpose()
Out[7]:
In [8]:
# Find number of entries with negative experience
len(df[df.Experience < 0])
Out[8]:
In [37]:
# Fill the negative experience with mean
df.Experience[df.Experience < 0] = df.Experience.mean()
In [10]:
# Verily that negative values are removed
len(df[df.Experience < 0])
Out[10]:
In [11]:
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Pearson Correlation for features set', y=1.05, size=19)
sns.heatmap(df.corr(),linewidths=0.1,vmax=1.0,
square=True, cmap=colormap, linecolor='white', annot=True)
Out[11]:
In [12]:
# Draw the pair plot
pairPlot = sns.pairplot(df, hue="Personal Loan")
In [13]:
loanNotTaken = len(df[df["Personal Loan"] == 0])
loanTaken = len(df[df["Personal Loan"] == 1])
sns.barplot(x=[0,1], y=[loanNotTaken, loanTaken])
plt.title("Personal Loan Distribution")
plt.xlabel("Loan Distribution")
plt.ylabel("Number of customers")
plt.xticks([0,1], ["Loan Not Taken", "Loan Taken"])
Out[13]:
In [38]:
# Target feature
Y = df[["Personal Loan"]]
le = preprocessing.LabelEncoder()
# Encode the lables for classifier
Y = le.fit_transform(Y)
In [15]:
# Dependent variables
X = df.drop("Personal Loan", axis=1)
In [16]:
test_size = 0.30 # taking 70:30 training and test set
seed = 2 # Random numbmer seeding for reapeatability of the code
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=test_size, random_state=seed)
In [17]:
seed = 7 # Set the seed for repeatabiliy
# Creating a entropy based decision tree classifier
dtModel = DecisionTreeClassifier(criterion = 'entropy', random_state=seed)
In [18]:
# Fit the model
dtModel.fit(xTrain, yTrain)
Out[18]:
In [19]:
dtModel.score(xTest, yTest)
Out[19]:
In [20]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(xTrain, yTrain)
Out[20]:
In [21]:
knn.score(xTest, yTest)
Out[21]:
In [22]:
knn = KNeighborsClassifier()
param_grid = {
'n_neighbors': list(range(3, 50, 2)),
'leaf_size': list(range(3, 50, 2))
}
CV_rfc = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, verbose=1)
# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)
In [23]:
dtModel = DecisionTreeClassifier(criterion = 'entropy', random_state=seed)
param_grid = {
'criterion': ["gini", "entropy"]
}
CV_rfc = GridSearchCV(estimator=dtModel, param_grid=param_grid, cv=5, verbose=1)
# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)
Using DT, we got a accuracy score of 0.9754
In [24]:
rf = RandomForestClassifier(random_state=seed)
rf.fit(xTrain, yTrain)
Out[24]:
In [25]:
rf.score(xTest, yTest)
Out[25]:
In [26]:
rf = RandomForestClassifier(random_state=seed)
param_grid = {
'n_estimators': list(range(10, 30)),
'max_depth': list(range(3, 30, 2))
}
CV_rfc = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, verbose=1)
# Using cross validation
CV_rfc.fit(X, Y)
print(CV_rfc.best_params_)
print(CV_rfc.best_estimator_)
print(CV_rfc.best_score_)
In [27]:
# Using best hyper parameters
rf = RandomForestClassifier(max_depth=17, n_estimators=19, random_state=seed)
yPredicted = rf.fit(xTrain, yTrain).predict(xTest)
In [28]:
# Compute confusion matrix
cnfMatrix = confusion_matrix(yTest, yPredicted)
In [29]:
cnfMatrix
Out[29]:
In [30]:
sns.heatmap(cnfMatrix)
plt.title("Actual vs Predicted")
plt.xlabel("Predicted labels")
plt.ylabel("Actual labels")
Out[30]:
In [31]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
for train_index, test_index in sss.split(X, Y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = Y[train_index], Y[test_index]
In [32]:
# Using best hyper parameters
rf = RandomForestClassifier(max_depth=17, n_estimators=19, random_state=seed)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
Out[32]: