[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Dataset description from the UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
In [191]:
import graphlab as gl
In [192]:
# already stored dataset locally as Turi SFrame
train = gl.SFrame('bankCustomerTrain.sf')
test = gl.SFrame('bankCustomerTest.sf')
# data = graphlab.SFrame('s3://' or 'hdfs://')
# data # pySpark RDD or SchemaRDD / Spark DataFrame
# data = graphlab.SFrame.read_json('')
# With a DB: configure ODBC manager / driver on the machine
# graphlab.connect_odbc?
# graphlab.from_sql?
The original dataset came with the following attribute information:
Field Num | Field Name | Description |
---|---|---|
1 | age | (numeric) |
2 | job | type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') |
3 | marital | marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) |
4 | education | (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') |
5 | default | has credit in default? (categorical: 'no','yes','unknown') |
6 | housing | has housing loan? (categorical: 'no','yes','unknown') |
7 | loan | has personal loan? (categorical: 'no','yes','unknown') |
--- | --- | --- |
8 | contact | contact communication type (categorical: 'cellular','telephone') |
9 | month | last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |
10 | day_of_week | last contact day of the week (categorical: 'mon','tue','wed','thu','fri') |
11 | duration | last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
--- | --- | --- |
12 | campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
13 | pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
14 | previous | number of contacts performed before this campaign and for this client (numeric) |
15 | poutcome | outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') |
--- | --- | --- |
16 | emp.var.rate | employment variation rate - quarterly indicator (numeric) |
17 | cons.price.idx | consumer price index - monthly indicator (numeric) |
18 | cons.conf.idx | consumer confidence index - monthly indicator (numeric) |
19 | euribor3m | euribor 3 month rate - daily indicator (numeric) |
20 | nr.employed | number of employees - quarterly indicator (numeric) |
--- | --- | --- |
21 | y | has the client subscribed a term deposit? (binary: 'yes','no') |
In [193]:
gl.canvas.set_target('browser')
train.show()
Before we start, let's assume that each phone call to a contact costs \$1 and that the customer lifetime value for a contact that purchases a term deposit is \$100. Then the ROI for calling all the customers in our training dataset is:
In [194]:
def calc_call_roi(contactList, leadScore, percentToCall):
#assumptions
costOfCall = 1.00
custLTV = 100.00
numberCalls = int(len(contactList)*percentToCall)
if 'lead_score' in contactList.column_names():
contactList.remove_column('lead_score')
contactList = contactList.add_column(leadScore,name='lead_score')
sortedByModel = contactList.sort('lead_score', ascending=False)
callList = sortedByModel[:numberCalls]
numSubscriptions = len(callList[callList['y']=='yes'])
roi = (numSubscriptions*custLTV - numberCalls*costOfCall) / (numberCalls*costOfCall)
return roi
In [195]:
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 1)
print 'ROI for calling all contacts: ' + '{0:.2f}'.format(initROI) + '%'
In [196]:
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 0.2)
print 'ROI for calling a 20% subset of contacts: ' + '{0:.2f}'.format(initROI) + '%'
The SFrame, one of Turi's underlying data structures, allows users to build flexible pipelines. Here we show how to quickly retrieve the percentage of clients who are going to make a deposit along with the percentage of clients who represent the specific demographic of single students.
In [197]:
numClients = float(len(train))
numY = gl.Sketch(train['y']).frequency_count('yes')
print "%.2f%% of clients in training set opened deposit accounts." % (numY/numClients*100.0)
medianAge = gl.Sketch(train['age']).quantile(0.5)
numUnderMedianAge = float(len(train[train['age']<medianAge]))
numPurchasingAndUnderMedianAge = sum(train.apply(lambda x: 1 if x['age'] < medianAge
and x['y'] == 'yes' else 0))
probYGivenUnderMedianAge = numPurchasingAndUnderMedianAge/numUnderMedianAge*100
print "%.2f%% clients with age < %g (median) opened deposit account." % (probYGivenUnderMedianAge, medianAge)
In [198]:
ageTargetingROI = calc_call_roi(test, test['age'].apply(lambda x: 1 if x < medianAge else 0), 0.2)
print 'ROI for age targeted calls to 20% of contacts: ' + '{0:.2f}'.format(ageTargetingROI) + '%'
In [199]:
# remove features that give away results/prediction
features = train.column_names()
features.remove('duration')
features.remove('y')
Turi's classifier toolkit that can help marketers predict if a client is likely to open an account.
In [200]:
toolkit_model = gl.classifier.create(train, features = features, target='y')
In [201]:
results = toolkit_model.evaluate(test)
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])
This initial model can be considered accurate given that it correctly predicts the purchasing decisions of 90% of the contacts. However, the toolkit model leaves room for improvement. Specifically only 64% of predicted sales actually convert to sales. Furthermore only 24% of actual sales were actually predicted by the model. In order to understand the model we can review the importance of the input features.
In [202]:
toolkit_model.get_feature_importance()
Out[202]:
After scoring the list by probability to purchase, the ROI for calling the top 50% of the list is:
In [203]:
toolkitLeadScore = toolkit_model.predict(test,output_type='probability')
toolkitROI = calc_call_roi(test, toolkitLeadScore, 0.2 )
print 'ROI for calling 20% of highest predicted contacts: ' + '{0:.2f}'.format(toolkitROI) + '%'
# One option to explore is using quadratic features to see if interactions between the features have predictive power.
quadratic = gl.feature_engineering.create(train,
gl.feature_engineering.QuadraticFeatures(features=['campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'euribor3m',
'nr.employed']))
# Transform the training data.
qTrain = quadratic.transform(train)
qFeatures = qTrain.column_names()
qFeatures.remove('y')
qFeatures.remove('duration')
# We create a boosted trees classifier with the enriched dataset.
new_rf_model = gl.random_forest_classifier.create(qTrain, features = qFeatures, target='y',
class_weights='auto', max_depth = 50,
row_subsample = 0.75, max_iterations = 50, column_subsample=0.5)
results = new_rf_model.evaluate(quadratic.transform(test))
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])
# see which features are most important in this tree model
new_rf_model.get_feature_importance()
# show ROI for experimentation model
rfLeadScore = new_rf_model.predict(test,output_type='probability')
rfROI = calc_call_roi(test, rfLeadScore, 0.1 )
print 'ROI for calling predicted contacts: ' + '{0:.2f}'.format(rfROI) + '%'
In [204]:
rfList = test.sort('lead_score', ascending=False)
rfList['lead_score', 'age','campaign','euribor3m','job','loan'].print_rows(num_rows=20)
We can deploy a real-time model to help the marketers understand potential clients as soon as the contacts come to the bank. Here we deploy on AWS, but Turi also supports hosting models on premise and on Azure.
# define the state path - this is where Turi will store the models, logs, and metadata for this deployment
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'
# setup your own AWS credentials.
# gl.aws.set_credentials(<key>,<secret key>)
# create an EC2 config - this is how you define the EC2 configuration for the cluster being deployed
ec2 = gl.deploy.Ec2Config(region='us-west-2', instance_type='m3.xlarge')
# use the EC2 config to launch a new Predictive Service
# num_hosts specifies how many machines the Predictive Service cluster has.
# You can scale up and down later after initial creation.
deployment = gl.deploy.predictive_service.create(name = 'rajat-lead-scoring-app',
ec2_config = ec2, state_path = ps_state_path, num_hosts = 3)
In [205]:
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'
deployment = gl.deploy.predictive_service.load(ps_state_path)
In [206]:
# see the status of and what's deployed on my_deployment
deployment
Out[206]:
In [207]:
# inputs and returns of this function map directly to the io of the endpoint for the REST service
def get_lead_score(json_row):
json_row = {key:[value] for key,value in json_row.items()}
client_info = quadratic.transform(gl.SFrame(json_row))
client_info['lead_score'] = toolkit_model.predict(client_info, output_type='probability')
return client_info
In [208]:
deployment.update('lead_score', get_lead_score)
In [209]:
deployment.apply_changes()
In [210]:
deployment.get_status()
Out[210]:
Now we can score incoming contacts using the REST endpoint.
In [214]:
# High lead score: 7720, 8070, 7924
# Low lead score: 0, 5000, 7000
deployment.query('lead_score', test[5000])
Out[214]:
In [ ]:
# deployment.terminate_service()
In [ ]: