The client bank XYZ is running a direct marketing campaign. It wants to identify customers who would potentially be buying their new term deposit plan.
Data is obtained from UCI Machine Learning repository. http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
Data from direct marketing campaign (phone calls) of a Portuguese Bank is provided.
y - has the client subscribed a term deposit? (binary: 'yes','no')
The given data is randomly divided into train and test for the purpose of this workshop. Build the model for train and use it to predict on test.
In [1]:
#Import the necessary libraries
import numpy as np
import pandas as pd
In [2]:
#Read the train and test data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
Exercise 1
print the number of rows and columns of train and test
In [16]:
Exercise 2
Print the first 10 rows of train
In [4]:
Out[4]:
Exercise 3
Print the column types of train and test. Are they the same in both train and test?
In [5]:
#train
Out[5]:
In [6]:
#test
Out[6]:
In [7]:
#Are they the same?
In [64]:
#Combine train and test
frames = [train, test]
input = pd.concat(frames)
In [9]:
#Print first 10 records of input
Out[9]:
Exercise 4
Find if any column has missing value
There is a pd.isnull
function. How to use that?
In [12]:
Out[12]:
In [65]:
#Replace deposit with a numeric column
#First, set all labels to be 0
input.at[:, "depositLabel"] = 0
#Now, set depositLabel to 1 whenever deposit is yes
input.at[input.deposit=="yes", "depositLabel"] = 1
In [ ]:
Exercise 5
Find % of customers in the input dataset who have purchased the term deposit
In [72]:
Out[72]:
In [75]:
#Create the labels
labels =
labels
Out[75]:
In [83]:
#Drop the deposit column
input.drop(["deposit", "depositLabel"], axis=1)
Exercise 6
Did it drop? If not, what has to be done?
Exercise 7
Print columnn names of input
In [ ]:
In [85]:
#Get list of columns that are continuous/integer
continuous_variables = input.dtypes[input.dtypes != "object"].index
In [86]:
continuous_variables
Out[86]:
In [87]:
#Get list of columns that are categorical
categorical_variables = input.dtypes[input.dtypes=="object"].index
In [88]:
categorical_variables
Out[88]:
Exercise 8
Create inputInteger
and inputCategorical
- two datasets - one having integer variables and another having categorical variables
In [89]:
inputInteger =
In [91]:
#print inputInteger
inputInteger.head()
Out[91]:
In [93]:
inputCategorical =
In [94]:
#print inputCategorical
inputCategorical.head()
Out[94]:
In [101]:
#Convert categorical variables into Labels using labelEncoder
inputCategorical = np.array(inputCategorical)
Exercise 9
Find length of categorical_variables
In [102]:
Out[102]:
In [119]:
#Load the preprocessing module
from sklearn import preprocessing
In [103]:
for i in range(len(categorical_variables)):
lbl = preprocessing.LabelEncoder()
lbl.fit(list(inputCategorical[:,i]))
inputCategorical[:, i] = lbl.transform(inputCategorical[:, i])
In [105]:
#print inputCategorical
Exercise 10
Convert inputInteger
to numpy
array
In [107]:
inputInteger =
inputInteger
Out[107]:
Exercise 11
Now, create the inputUpdated
array that has both inputInteger
and inputCategorical
concatenated
Hint Check function called vstack
and hstack
In [ ]:
In [118]:
inputUpdated.shape
Out[118]:
In [125]:
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
In [126]:
bankModelDT = tree.DecisionTreeClassifier(max_depth=2)
In [127]:
bankModelDT.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])
Out[127]:
In [128]:
dot_data = StringIO()
tree.export_graphviz(bankModelDT, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("bankDT.pdf")
Out[128]:
In [129]:
#Check the pdf
Exercise 12
Now, change the max_depth = 6 and check the results.
Then, change the max_depth= None and check the results
In [ ]:
In [144]:
# Prediction
prediction_DT = bankModelDT.predict(inputUpdated[train.shape[0]:,:])
In [133]:
#Compute the error metrics
In [134]:
import sklearn.metrics
In [135]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT)
Out[135]:
In [136]:
#What does that tell?
In [137]:
#What's the error AUC for the other Decision Tree Models
Exercise 13
Instead of predicting classes directly, predict the probability and check the auc
In [ ]:
In [142]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT[:,0])
Out[142]:
In [147]:
#Precision and Recall
In [145]:
sklearn.metrics.precision_score(labels[train.shape[0]:], prediction_DT)
Out[145]:
In [146]:
sklearn.metrics.recall_score(labels[train.shape[0]:], prediction_DT)
Out[146]:
In [148]:
from sklearn.ensemble import RandomForestClassifier
In [157]:
bankModelRF = RandomForestClassifier(n_jobs=-1, oob_score=True)
In [158]:
bankModelRF.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])
Out[158]:
In [156]:
bankModelRF.oob_score_
Out[156]:
Exercise 14
Do the following
In [ ]:
In [160]:
import xgboost as xgb
In [176]:
params = {}
params["min_child_weight"] = 3
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 1
params["silent"] = 0
params["max_depth"] = 4
params["nthread"] = 6
params["gamma"] = 1
params["objective"] = "binary:logistic"
params["eta"] = 0.005
params["base_score"] = 0.1
params["eval_metric"] = "auc"
params["seed"] = 123
In [177]:
plst = list(params.items())
num_rounds = 120
In [178]:
xgtrain_pv = xgb.DMatrix(inputUpdated[:train.shape[0],:], label=labels[:train.shape[0]])
watchlist = [(xgtrain_pv, 'train')]
bankModelXGB = xgb.train(plst, xgtrain_pv, num_rounds)
In [179]:
prediction_XGB = bankModelXGB.predict(xgb.DMatrix(inputUpdated[train.shape[0]:,:]))
In [180]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_XGB)
Out[180]:
In [175]:
inputOneHot = pd.get_dummies(input)
Exercise 15
On the one hot encoded data, train
Which one works best on the test dataset?
In [ ]: