Know your customer - marketing a new product to customers

Data Source:

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Dataset description from the UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Load Data



In [191]:

    
import graphlab as gl



In [192]:

    
# already stored dataset locally as Turi SFrame
train = gl.SFrame('bankCustomerTrain.sf')
test = gl.SFrame('bankCustomerTest.sf')

# data = graphlab.SFrame('s3://' or 'hdfs://')
# data # pySpark RDD or SchemaRDD / Spark DataFrame
# data = graphlab.SFrame.read_json('')
# With a DB: configure ODBC manager / driver on the machine
#    graphlab.connect_odbc?
#    graphlab.from_sql?

Data Dictionary

The original dataset came with the following attribute information:

Field Num	Field Name	Description
1	age	(numeric)
2	job	type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3	marital	marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4	education	(categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5	default	has credit in default? (categorical: 'no','yes','unknown')
6	housing	has housing loan? (categorical: 'no','yes','unknown')
7	loan	has personal loan? (categorical: 'no','yes','unknown')
---	---	---
8	contact	contact communication type (categorical: 'cellular','telephone')
9	month	last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10	day_of_week	last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11	duration	last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
---	---	---
12	campaign	number of contacts performed during this campaign and for this client (numeric, includes last contact)
13	pdays	number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14	previous	number of contacts performed before this campaign and for this client (numeric)
15	poutcome	outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
---	---	---
16	emp.var.rate	employment variation rate - quarterly indicator (numeric)
17	cons.price.idx	consumer price index - monthly indicator (numeric)
18	cons.conf.idx	consumer confidence index - monthly indicator (numeric)
19	euribor3m	euribor 3 month rate - daily indicator (numeric)
20	nr.employed	number of employees - quarterly indicator (numeric)
---	---	---
21	y	has the client subscribed a term deposit? (binary: 'yes','no')

Data Exploration - get a sense of the data with GraphLab Canvas



In [193]:

    
gl.canvas.set_target('browser')
train.show()









    



Canvas is accessible via web browser at the URL: http://localhost:53104/index.html
Opening Canvas in default web browser.

ROI Calculation - how we will measure the effectiveness of our lead scoring model

Before we start, let's assume that each phone call to a contact costs \$1 and that the customer lifetime value for a contact that purchases a term deposit is \$100. Then the ROI for calling all the customers in our training dataset is:



In [194]:

    
def calc_call_roi(contactList, leadScore, percentToCall):
    #assumptions
    costOfCall = 1.00
    custLTV = 100.00
    
    numberCalls = int(len(contactList)*percentToCall)
    if 'lead_score' in contactList.column_names():
        contactList.remove_column('lead_score')
    contactList = contactList.add_column(leadScore,name='lead_score')
    sortedByModel = contactList.sort('lead_score', ascending=False)
    callList = sortedByModel[:numberCalls]
    numSubscriptions = len(callList[callList['y']=='yes']) 
    roi = (numSubscriptions*custLTV - numberCalls*costOfCall) / (numberCalls*costOfCall)
    return roi

Call everyone (assuming you have budget & time), ROI is 9.59%



In [195]:

    
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 1)
print 'ROI for calling all contacts: ' + '{0:.2f}'.format(initROI) + '%'









    



ROI for calling all contacts: 9.59%

Call only the first 20%, ROI drops to 1.47%



In [196]:

    
initLeadScores = gl.SArray([1 for _ in test])
initROI = calc_call_roi(test, initLeadScores, 0.2)
print 'ROI for calling a 20% subset of contacts: ' + '{0:.2f}'.format(initROI) + '%'









    



ROI for calling a 20% subset of contacts: 1.47%

Modeling Part 1 - Query Data for customer segment - age less than median age of 38.

The SFrame, one of Turi's underlying data structures, allows users to build flexible pipelines. Here we show how to quickly retrieve the percentage of clients who are going to make a deposit along with the percentage of clients who represent the specific demographic of single students.



In [197]:

    
numClients = float(len(train))
numY = gl.Sketch(train['y']).frequency_count('yes')
print "%.2f%% of clients in training set opened deposit accounts." % (numY/numClients*100.0)

medianAge = gl.Sketch(train['age']).quantile(0.5)
numUnderMedianAge = float(len(train[train['age']<medianAge]))
numPurchasingAndUnderMedianAge = sum(train.apply(lambda x: 1 if x['age'] < medianAge 
                                           and x['y'] == 'yes' else 0))
probYGivenUnderMedianAge = numPurchasingAndUnderMedianAge/numUnderMedianAge*100

print "%.2f%% clients with age < %g (median) opened deposit account." % (probYGivenUnderMedianAge, medianAge)









    



11.43% of clients in training set opened deposit accounts.
12.16% clients with age < 38 (median) opened deposit account.

From this analysis we see that a larger percentage of people under 38 opened accounts than overall. So let's target them as leads and measure our ROI.



In [198]:

    
ageTargetingROI = calc_call_roi(test, test['age'].apply(lambda x: 1 if x < medianAge else 0), 0.2)
print 'ROI for age targeted calls to 20% of contacts: ' + '{0:.2f}'.format(ageTargetingROI) + '%'









    



ROI for age targeted calls to 20% of contacts: 15.71%

ROI for age targeted 20% of contacts: 15.71% - big jump over calling everyone and huge jump over random 20% - this is a good start, but we can do better.

Modeling Part 2 - Train a Machine Learning model instead - learn from ALL features, not just age, use GraphLab Create AutoML to choose the most effective classifer model automatically.



In [199]:

    
# remove features that give away results/prediction
features = train.column_names()
features.remove('duration')
features.remove('y')

Turi's classifier toolkit that can help marketers predict if a client is likely to open an account.



In [200]:

    
toolkit_model = gl.classifier.create(train, features = features, target='y')









    




Boosted trees classifier:






    




--------------------------------------------------------






    




Number of examples          : 31519






    




Number of classes           : 2






    




Number of feature columns   : 19






    




Number of unpacked features : 19






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| Iteration | Elapsed Time | Training-accuracy | Training-log_loss | Validation-accuracy | Validation-log_loss |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| 1         | 0.029432     | 0.905041          | 0.515452          | 0.890103            | 0.520484            |






    




| 2         | 0.060089     | 0.906279          | 0.419195          | 0.890746            | 0.428380            |






    




| 3         | 0.090442     | 0.906659          | 0.361340          | 0.889460            | 0.374441            |






    




| 4         | 0.116852     | 0.907453          | 0.324721          | 0.890746            | 0.340268            |






    




| 5         | 0.149560     | 0.908373          | 0.301003          | 0.893316            | 0.318993            |






    




| 6         | 0.180293     | 0.907706          | 0.285728          | 0.892673            | 0.305945            |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




Random forest classifier:






    




--------------------------------------------------------






    




Number of examples          : 31519






    




Number of classes           : 2






    




Number of feature columns   : 19






    




Number of unpacked features : 19






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| Iteration | Elapsed Time | Training-accuracy | Training-log_loss | Validation-accuracy | Validation-log_loss |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| 1         | 0.024752     | 0.902059          | 0.310494          | 0.889460            | 0.327177            |






    




| 2         | 0.052019     | 0.903709          | 0.308423          | 0.890746            | 0.325088            |






    




| 3         | 0.081392     | 0.904661          | 0.307518          | 0.892031            | 0.324166            |






    




| 4         | 0.107434     | 0.905454          | 0.306875          | 0.892031            | 0.324186            |






    




| 5         | 0.136001     | 0.905390          | 0.306996          | 0.888175            | 0.325083            |






    




| 6         | 0.165595     | 0.905263          | 0.307049          | 0.890746            | 0.324401            |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




Decision tree classifier:






    




--------------------------------------------------------






    




Number of examples          : 31519






    




Number of classes           : 2






    




Number of feature columns   : 19






    




Number of unpacked features : 19






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| Iteration | Elapsed Time | Training-accuracy | Training-log_loss | Validation-accuracy | Validation-log_loss |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




| 1         | 0.029335     | 0.905041          | 0.515452          | 0.890103            | 0.520484            |






    




+-----------+--------------+-------------------+-------------------+---------------------+---------------------+






    




SVM:






    




--------------------------------------------------------






    




Number of examples          : 31519






    




Number of classes           : 2






    




Number of feature columns   : 19






    




Number of unpacked features : 19






    




Number of coefficients    : 53






    




Starting L-BFGS






    




--------------------------------------------------------






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




| 1         | 4        | 0.000016  | 0.038233     | 0.886100          | 0.877249            |






    




| 2         | 7        | 5.000000  | 0.085126     | 0.893874          | 0.880463            |






    




| 3         | 8        | 5.000000  | 0.112689     | 0.114217          | 0.122751            |






    




| 4         | 10       | 1.000000  | 0.149538     | 0.895650          | 0.881105            |






    




| 5         | 11       | 1.000000  | 0.175710     | 0.890193          | 0.882391            |






    




| 6         | 16       | 2.000000  | 0.238406     | 0.895650          | 0.882391            |






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




TERMINATED: Iteration limit reached.






    




This model may not be optimal. To improve it, consider increasing `max_iterations`.






    




Logistic regression:






    




--------------------------------------------------------






    




Number of examples          : 31519






    




Number of classes           : 2






    




Number of feature columns   : 19






    




Number of unpacked features : 19






    




Number of coefficients    : 53






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, SVMClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.893958866596
PROGRESS: RandomForestClassifier          : 0.888817489147
PROGRESS: DecisionTreeClassifier          : 0.890102803707
PROGRESS: SVMClassifier                   : 0.884319
PROGRESS: LogisticClassifier              : 0.888175
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.






    




--------------------------------------------------------






    




+-----------+----------+--------------+-------------------+---------------------+






    




+-----------+----------+--------------+-------------------+---------------------+






    




| 1         | 2        | 0.151609     | 0.899235          | 0.888817            |






    




| 2         | 3        | 0.237565     | 0.899457          | 0.888175            |






    




| 3         | 4        | 0.308188     | 0.899648          | 0.888175            |






    




| 4         | 5        | 0.381385     | 0.899806          | 0.888175            |






    




| 5         | 6        | 0.462933     | 0.899775          | 0.888175            |






    




| 6         | 7        | 0.543236     | 0.899775          | 0.888175            |






    




+-----------+----------+--------------+-------------------+---------------------+






    




SUCCESS: Optimal solution found.

The toolkit automatically evaluates several types of algorithms, including: Boosted Trees, Random Forests, Decision Trees, Support Vector Machines, Logistic regression - with intelligent default paramters. Based on a validation set, it chooses the most accurate model. We can then evaluate this model on the test dataset.



In [201]:

    
results = toolkit_model.evaluate(test)
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])









    



accuracy: 0.904351, precision: 0.627692, recall: 0.237485

This initial model can be considered accurate given that it correctly predicts the purchasing decisions of 90% of the contacts. However, the toolkit model leaves room for improvement. Specifically only 64% of predicted sales actually convert to sales. Furthermore only 24% of actual sales were actually predicted by the model. In order to understand the model we can review the importance of the input features.



In [202]:

    
toolkit_model.get_feature_importance()









    Out[202]:





    
        name
        index
        count
    
    
        age
        None
        95
    
    
        euribor3m
        None
        88
    
    
        campaign
        None
        59
    
    
        pdays
        None
        32
    
    
        contact
        telephone
        18
    
    
        job
        admin.
        16
    
    
        day_of_week
        mon
        16
    
    
        cons.conf.idx
        None
        14
    
    
        education
        high.school
        14
    
    
        nr.employed
        None
        13
    

[62 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

After scoring the list by probability to purchase, the ROI for calling the top 50% of the list is:



In [203]:

    
toolkitLeadScore = toolkit_model.predict(test,output_type='probability')
toolkitROI = calc_call_roi(test, toolkitLeadScore, 0.2 )
print 'ROI for calling 20% of highest predicted contacts: ' + '{0:.2f}'.format(toolkitROI) + '%'









    



ROI for calling 20% of highest predicted contacts: 33.40%

Huge improvement in ROI for 20% called: 32.79% (over 3x improvement from calling everyone, while only calling 20% of the contacts).

Modeling Part 3 - Continued experimentation / iteration - we can continue to tweak the model by generating features, doing experiments, adding different data sources etc.

# One option to explore is using quadratic features to see if interactions between the features have predictive power.
quadratic = gl.feature_engineering.create(train,
                gl.feature_engineering.QuadraticFeatures(features=['campaign',
                                                                   'pdays',
                                                                   'previous',
                                                                   'emp.var.rate',
                                                                   'cons.price.idx',
                                                                   'euribor3m',
                                                                   'nr.employed']))

# Transform the training data.
qTrain = quadratic.transform(train)

qFeatures = qTrain.column_names()
qFeatures.remove('y')
qFeatures.remove('duration')

# We create a boosted trees classifier with the enriched dataset.
new_rf_model = gl.random_forest_classifier.create(qTrain, features = qFeatures, target='y', 
                                                  class_weights='auto', max_depth = 50,
                                                  row_subsample = 0.75, max_iterations = 50, column_subsample=0.5)

results = new_rf_model.evaluate(quadratic.transform(test))
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])       

# see which features are most important in this tree model
new_rf_model.get_feature_importance()

# show ROI for experimentation model
rfLeadScore = new_rf_model.predict(test,output_type='probability')
rfROI = calc_call_roi(test, rfLeadScore, 0.1 )
print 'ROI for calling predicted contacts: ' + '{0:.2f}'.format(rfROI) + '%'

Integration Part 1 - Ranked Lists for Marketing / Sales - who should be prioritized to be called next!



In [204]:

    
rfList = test.sort('lead_score', ascending=False)
rfList['lead_score', 'age','campaign','euribor3m','job','loan'].print_rows(num_rows=20)









    



+----------------+-----+----------+-----------+-------------+------+
|   lead_score   | age | campaign | euribor3m |     job     | loan |
+----------------+-----+----------+-----------+-------------+------+
| 0.883403599262 |  48 |    3     |   0.904   |    admin.   |  no  |
| 0.883403599262 |  58 |    1     |    0.9    | blue-collar |  no  |
| 0.883403599262 |  51 |    1     |   0.903   | blue-collar |  no  |
| 0.882390379906 |  61 |    1     |   0.695   | blue-collar | yes  |
| 0.882390379906 |  77 |    1     |   0.682   |   retired   |  no  |
| 0.876977026463 |  53 |    1     |    0.84   | blue-collar |  no  |
| 0.876977026463 |  55 |    1     |   0.802   |    admin.   |  no  |
| 0.866948366165 |  58 |    1     |   0.878   |    admin.   |  no  |
| 0.866948366165 |  63 |    1     |   0.846   |   retired   |  no  |
| 0.863215148449 |  60 |    1     |   0.861   |    admin.   |  no  |
| 0.859958946705 |  66 |    1     |   0.655   |   retired   |  no  |
| 0.859958946705 |  55 |    3     |   0.652   |  unemployed |  no  |
| 0.859958946705 |  64 |    2     |   0.652   |   retired   |  no  |
| 0.858869791031 |  92 |    1     |   0.903   |   retired   | yes  |
| 0.858432531357 |  27 |    3     |   0.899   |    admin.   |  no  |
| 0.858266115189 |  50 |    2     |   0.682   |    admin.   |  no  |
| 0.858266115189 |  60 |    1     |   0.682   |    admin.   |  no  |
| 0.857237398624 |  18 |    1     |   0.677   |   student   |  no  |
| 0.855425596237 |  32 |    2     |   0.884   |    admin.   |  no  |
| 0.852907836437 |  48 |    1     |   0.704   |    admin.   |  no  |
+----------------+-----+----------+-----------+-------------+------+
[8113 rows x 6 columns]

Integration Part 2 - Deploy models as a fault tolerant scalable REST service, so marketing and sales dashboards (SalesForce/Tableau) can easily integrate lead score

We can deploy a real-time model to help the marketers understand potential clients as soon as the contacts come to the bank. Here we deploy on AWS, but Turi also supports hosting models on premise and on Azure.

# define the state path - this is where Turi will store the models, logs, and metadata for this deployment
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'


# setup your own AWS credentials.
# gl.aws.set_credentials(<key>,<secret key>)

# create an EC2 config - this is how you define the EC2 configuration for the cluster being deployed
ec2 = gl.deploy.Ec2Config(region='us-west-2', instance_type='m3.xlarge')

# use the EC2 config to launch a new Predictive Service
# num_hosts specifies how many machines the Predictive Service cluster has. 
#     You can scale up and down later after initial creation.

deployment = gl.deploy.predictive_service.create(name = 'rajat-lead-scoring-app', 
                                                    ec2_config = ec2, state_path = ps_state_path, num_hosts = 3)



In [205]:

    
ps_state_path = 's3://gl-rajat-testing/predictive_service/lead_scoring_app'
deployment = gl.deploy.predictive_service.load(ps_state_path)









    



2016-05-26 11:51:41,280 [WARNING] graphlab.deploy.predictive_service, 384: Overwriting existing Predictive Service "rajat-lead-scoring-app" in local session.



In [206]:

    
# see the status of and what's deployed on my_deployment
deployment









    Out[206]:





Name                  : rajat-lead-scoring-app
State Path            : s3://gl-rajat-testing/predictive_service/lead_scoring_app
Description           : None
API Key               : 8a0244c4-497b-4969-a18d-3a3bfdfc8fcd
CORS origin           : 
Global Cache State    : enabled
Load Balancer DNS Name: rajat-lead-scoring-app-1226522522.us-west-2.elb.amazonaws.com

Deployed endpoints:
	lead_score [model]

No Pending changes.

Creating a new intelligent service is as simple as defining a Python function (can deploy anything in Python)



In [207]:

    
# inputs and returns of this function map directly to the io of the endpoint for the REST service
def get_lead_score(json_row):
    json_row = {key:[value] for key,value in json_row.items()}
    client_info = quadratic.transform(gl.SFrame(json_row))
    client_info['lead_score'] = toolkit_model.predict(client_info, output_type='probability')
    return client_info



In [208]:

    
deployment.update('lead_score', get_lead_score)









    



2016-05-26 11:52:36,006 [INFO] graphlab.deploy._predictive_service._predictive_service, 1527: Endpoint 'lead_score' is updated. Use apply_changes to deploy all pending changes, or continue other modification.



In [209]:

    
deployment.apply_changes()









    



2016-05-26 11:52:37,246 [INFO] graphlab.deploy._predictive_service._predictive_service, 1733: Persisting endpoint changes.
2016-05-26 11:52:37,260 [INFO] graphlab.util.file_util, 190: Uploading local path /var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv to s3 path: s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3






    



upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/ef811c97-fd83-453f-8b32-656dc5901a7c/objects.bin to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/ef811c97-fd83-453f-8b32-656dc5901a7c/objects.bin
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/dir_archive.ini to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/dir_archive.ini
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/version to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/version
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.sidx to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.sidx
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.frame_idx to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.frame_idx
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/pickle_archive to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/pickle_archive
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/ef811c97-fd83-453f-8b32-656dc5901a7c/dir_archive.ini to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/ef811c97-fd83-453f-8b32-656dc5901a7c/dir_archive.ini
upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.0000 to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/m_6d53cd4bb3428a97.0000
Completed 8 of 9 part(s) with 1 file(s) remaining





    



2016-05-26 11:52:39,066 [INFO] graphlab.util.file_util, 245: Successfully uploaded to s3 path s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3






    



upload: ../../../../var/folders/14/__zdljwj6yq7fn1rs93c8nhm0000gn/T/predictive_object_NVnNJv/91303ef3-eac1-409f-b1b7-baee17e3bce4/objects.bin to s3://gl-rajat-testing/predictive_service/lead_scoring_app/predictive_objects/lead_score/3/91303ef3-eac1-409f-b1b7-baee17e3bce4/objects.bin



In [210]:

    
deployment.get_status()









    Out[210]:





[{u'cache': {u'healthy': True, u'num_keys': 64, u'type': u'cluster'},
  u'dns_name': u'ec2-52-26-115-238.us-west-2.compute.amazonaws.com',
  u'graphlab_service_status': {u'ip-10-0-0-53:10000': {u'reason': None,
    u'status': u'healthy'}},
  u'id': u'i-ecd60631',
  u'models': [{u'lead_score': {u'ip-10-0-0-53:10000': {u'cache_enabled': True,
      u'reason': None,
      u'status': u'LoadSuccessful',
      u'type': u'model',
      u'version': 3}}}],
  u'reason': u'N/A',
  u'state': u'InService',
  u'system': {u'cpu_count': 4,
   u'cpu_usage': [0.3, 0.1, 0.1, 0.0],
   u'disk_usage': {u'root': {u'free': 3858726912,
     u'percent': 48.3,
     u'total': 8320901120,
     u'used': 4015902720},
    u'tmp': {u'free': 37426188288,
     u'percent': 0.1,
     u'total': 39490912256,
     u'used': 51879936}},
   u'memory': {u'active': 804380672,
    u'available': 15022342144,
    u'buffers': 91893760,
    u'cached': 2650820608,
    u'free': 12279627776,
    u'inactive': 2304663552,
    u'percent': 4.8,
    u'total': 15773601792,
    u'used': 3493974016}}},
 {u'cache': {u'healthy': True, u'num_keys': 64, u'type': u'cluster'},
  u'dns_name': u'ec2-52-34-159-226.us-west-2.compute.amazonaws.com',
  u'graphlab_service_status': {u'ip-10-0-0-54:10000': {u'reason': None,
    u'status': u'healthy'}},
  u'id': u'i-ebd60636',
  u'models': [{u'lead_score': {u'ip-10-0-0-54:10000': {u'cache_enabled': True,
      u'reason': None,
      u'status': u'LoadSuccessful',
      u'type': u'model',
      u'version': 3}}}],
  u'reason': u'N/A',
  u'state': u'InService',
  u'system': {u'cpu_count': 4,
   u'cpu_usage': [0.3, 0.1, 0.1, 0.0],
   u'disk_usage': {u'root': {u'free': 3858726912,
     u'percent': 48.3,
     u'total': 8320901120,
     u'used': 4015902720},
    u'tmp': {u'free': 37426192384,
     u'percent': 0.1,
     u'total': 39490912256,
     u'used': 51875840}},
   u'memory': {u'active': 777154560,
    u'available': 15049261056,
    u'buffers': 92016640,
    u'cached': 2651025408,
    u'free': 12306219008,
    u'inactive': 2304712704,
    u'percent': 4.6,
    u'total': 15773601792,
    u'used': 3467382784}}},
 {u'cache': {u'healthy': True, u'num_keys': 64, u'type': u'cluster'},
  u'dns_name': u'ec2-52-26-179-255.us-west-2.compute.amazonaws.com',
  u'graphlab_service_status': {u'ip-10-0-0-52:10000': {u'reason': None,
    u'status': u'healthy'}},
  u'id': u'i-edd60630',
  u'models': [{u'lead_score': {u'ip-10-0-0-52:10000': {u'cache_enabled': True,
      u'reason': None,
      u'status': u'LoadSuccessful',
      u'type': u'model',
      u'version': 3}}}],
  u'reason': u'N/A',
  u'state': u'InService',
  u'system': {u'cpu_count': 4,
   u'cpu_usage': [0.4, 0.1, 0.1, 0.0],
   u'disk_usage': {u'root': {u'free': 3858333696,
     u'percent': 48.3,
     u'total': 8320901120,
     u'used': 4016295936},
    u'tmp': {u'free': 37426196480,
     u'percent': 0.1,
     u'total': 39490912256,
     u'used': 51871744}},
   u'memory': {u'active': 787660800,
    u'available': 15040929792,
    u'buffers': 93360128,
    u'cached': 2651185152,
    u'free': 12296384512,
    u'inactive': 2305101824,
    u'percent': 4.6,
    u'total': 15773601792,
    u'used': 3477217280}}}]

Now we can score incoming contacts using the REST endpoint.



In [214]:

    
# High lead score: 7720, 8070, 7924
# Low lead score: 0, 5000, 7000
deployment.query('lead_score', test[5000])









    Out[214]:





{u'from_cache': False,
 u'model': u'lead_score',
 u'response': [{u'age': 29,
   u'campaign': 2,
   u'cons.conf.idx': -42.0,
   u'cons.price.idx': 93.2,
   u'contact': u'cellular',
   u'day_of_week': u'wed',
   u'default': u'no',
   u'duration': 117,
   u'education': u'basic.9y',
   u'emp.var.rate': -0.1,
   u'euribor3m': 4.12,
   u'housing': u'yes',
   u'job': u'management',
   u'lead_score': 0.0692317932844162,
   u'loan': u'no',
   u'marital': u'married',
   u'month': u'nov',
   u'nr.employed': 5195,
   u'pdays': 999,
   u'poutcome': u'nonexistent',
   u'previous': 0,
   u'quadratic_features': {u'campaign, campaign': 4,
    u'campaign, cons.price.idx': 186,
    u'campaign, emp.var.rate': 0,
    u'campaign, euribor3m': 8,
    u'campaign, nr.employed': 10390,
    u'campaign, pdays': 1998,
    u'campaign, previous': 0,
    u'cons.price.idx, cons.price.idx': 8686.24,
    u'cons.price.idx, emp.var.rate': -9.32,
    u'cons.price.idx, euribor3m': 383.98400000000004,
    u'cons.price.idx, nr.employed': 484174.0,
    u'cons.price.idx, pdays': 93106.8,
    u'cons.price.idx, previous': 0.0,
    u'emp.var.rate, emp.var.rate': 0.010000000000000002,
    u'emp.var.rate, euribor3m': -0.41200000000000003,
    u'emp.var.rate, nr.employed': -519.5,
    u'emp.var.rate, pdays': -99.9,
    u'emp.var.rate, previous': -0.0,
    u'euribor3m, euribor3m': 16.9744,
    u'euribor3m, nr.employed': 21403.4,
    u'euribor3m, pdays': 4115.88,
    u'euribor3m, previous': 0.0,
    u'nr.employed, nr.employed': 26988025,
    u'nr.employed, pdays': 5189805,
    u'nr.employed, previous': 0,
    u'pdays, pdays': 998001,
    u'pdays, previous': 0,
    u'previous, previous': 0},
   u'y': u'no'}],
 u'uuid': u'b2119dd8-4aca-4a02-b356-6b527cd70ff7',
 u'version': 3}



In [ ]:

    
# deployment.terminate_service()

Summary: With Machine Learning and leveraging our existing historical customer data we can prioritize which customers have the largest propensity to buy a new product.

Using Turi's Platform a development team can easily implement a lead scoring model and deploy it as a REST API for integration into Marketing tools and Dashboards



In [ ]:

name	index	count
age	None	95
euribor3m	None	88
campaign	None	59
pdays	None	32
contact	telephone	18
job	admin.	16
day_of_week	mon	16
cons.conf.idx	None	14
education	high.school	14
nr.employed	None	13