1. Introduction

The scenario: suppose we run an online travel agency. We would like to convince our users to book overseas vacations, rather than domestic ones. Each of the users in this dataset will definitely book something at the end of a given trial period, i.e. we are only looking at engaged customers.

Goals:

  1. predict which new users are most likely to book an overseas trip,
  2. generate segmention rules to group similar users based on features and propensity to convert.

Data: mimics the AirBnB challenge on Kaggle.

  • Users
  • Website or app sessions.

I've simulated data that's very similar in terms of features and distributions, but I've added timestamps to the sessions, and changed the target from country to a binary domestic vs. international variable.

Sections:

  1. Introduction
  2. The basic scenario - account data only
  3. What's happening under the hood?
  4. Incorporating activity data.

In [ ]:
from __future__ import print_function
import graphlab as gl

2. The basic scenario

Import the data: sales accounts

  • Sales accounts need not be synonymous with users, although that is the case here. At Turi, our sales accounts consist of a mix of individual users, companies, and teams within large companies.

  • The accounts dataset typically comes from a customer relationship management (CRM) tool, like Salesforce, SAP, or Hubspot. In practice there is an extra step here of extracting the data from that system into an SFrame.


In [ ]:
users = gl.SFrame('synthetic_airbnb_users.sfr')
users.print_rows(3)

In [ ]:
users['status'].sketch_summary()

Encode the target variable

Three types of accounts.

  • Successful accounts, i.e conversions, are coded as 1.
  • Failed accounts are coded as -1.
  • Open accounts, i.e. accounts that have not been decided, are coded as 0.

Together, successful and failed accounts constitute the training accounts.


In [ ]:
status_code = {'international': 1,
               'domestic': -1,
               'new': 0}

users['outcome'] = users['status'].apply(lambda x: status_code[x])
users[['status', 'outcome']].print_rows(10)

Define the schema

In a complex problem like lead scoring, there are potentially many columns with "meaning". To help the lead scoring tool recognize these columns, we define a dictionary that maps standard lead scoring inputs to the columns in our particular dataset.


In [ ]:
user_schema = {'conversion_status': 'outcome',
               'account_id': 'id',
               'features': ['gender', 'age', 'signup_method', 'signup_app',
                            'first_device_type', 'first_browser']}

Create the lead scoring tool

All accounts are passed to the tool when it's created. There is no separate predict method.

  • We typically want to score the same set of open accounts each day during the trial period.
  • Very rarely do we want to predict lead scores for different accounts.
  • It makes more sense to keep the open accounts in the model, so we can incrementally update the lead scores and market segments, as new data comes in.
  • The update method is not yet implemented :(

In [ ]:
scorer = gl.lead_scoring.create(users, user_schema)

Retrieve the model output and export

There's a lot of stuff in the lead scoring model's summary. Let's focus on the accessible fields, three in particular:

  • open_account_scores: conversion probability and market segment for open accounts
  • training_account_scores: conversion probability and market segment for existing successes and failures
  • segment_descriptions: definitions and summary statistics for the market segments

In [ ]:
print(scorer)

In [ ]:
scorer.open_account_scores.head(3)

In [ ]:
scorer.open_account_scores.topk('conversion_prob', k=3)

In [ ]:
scorer.training_account_scores.head(3)

In [ ]:
scorer.segment_descriptions.head(3)

In [ ]:
scorer.segment_descriptions[['segment_id', 'segment_features']].print_rows(max_column_width=65)

To get the training or open accounts that belong to a particular market segment, use the respective SFrame's filter_by method.


In [ ]:
seg = scorer.training_account_scores.filter_by(8, 'segment_id').head(3)
print(seg)

3. What's happening under the hood?

The scoring model: gradient boosted trees


In [ ]:
print(scorer.scoring_model)

Additional keyword arguments to the lead scoring create function are passed through to the gradient boosted trees model.


In [ ]:
scorer2 = gl.lead_scoring.create(users, user_schema, max_iterations=20, verbose=False)
print("Original num trees:", scorer.scoring_model.num_trees)
print("New num trees:", scorer2.scoring_model.num_trees)

Validating the scoring model

By default, the gradient boosted trees model withholds ??? percent of the training accounts as a validation set. The validation accuracy can be accessed as a user.


In [ ]:
print("Validation accuracy:", scorer.scoring_model.validation_accuracy)

The segmentation model: decision tree


In [ ]:
print(scorer.segmentation_model)

Because training the lead scoring tool can take some time with large datasets, the number of segments can be changed after a lead scoring tool has been created. This function creates a new model, the original model is immutable.


In [ ]:
scorer2 = scorer.resize_segmentation_model(max_segments=20)

print("original number of segments:", scorer.segment_descriptions.num_rows())
print("new number of segments:", scorer2.segment_descriptions.num_rows())

4. Incorporating activity data

Account activity data describes interactions between accounts and aspects of your business, like web assets, email campaigns, or products. Conceptually, each interaction involves at a minimum:

  • an account
  • a timestamp

Interactions may also have:

  • an "item"
  • a user
  • other features

In [ ]:
sessions = gl.SFrame('synthetic_airbnb_sessions.sfr')
sessions = gl.TimeSeries(sessions, index='timestamp')
sessions.head(5)

As with the accounts table, we need to indicate which columns in the activity table mean what. If we had a column indicating which user was involved, we could specify that as well here. In this scenario, we don't have users that are distinct from accounts.


In [ ]:
session_schema = {'account_id': 'user_id',
                  'item': 'action_detail'}

Define relevant dates

To use account activity data, a lead scoring tool needs to know the time window for each account's relevant interactions. There are three key dates for each account.

  • open date: when a new sales account was created
  • close date: when the trial period ends for a new sales account
  • decision date: when a final decision was reached by a training account, either success (conversion) or failure. May be before or after the close date.

The trial duration is the difference between the open date and the close date. The lead scoring tool in GLC assumes this is fixed for all accounts, but in general this need not be the case.

Open accounts do not have a decision date yet, by definition. They may or may not be still within the trial period.


In [ ]:
user_schema.update({'open_date': 'date_account_created',
                    'decision_date': 'booking_date'})

The trial duration is represented by an instance of the datetime package's timedelta class.

Create the lead scoring tool


In [ ]:
import datetime as dt

scorer3 = gl.lead_scoring.create(users, user_schema,
                                 sessions, session_schema,
                                 trial_duration=dt.timedelta(days=30))

In [ ]:
print(scorer3)

Under the hood: date-based data validation

Invalid accounts have a decision date earlier than their open date. This is impossible, and these accounts are simply dropped from the set of training accounts.


In [ ]:
invalid_ids = scorer3.invalid_accounts
print(invalid_ids)

invalid_accounts = users.filter_by(invalid_ids, 'id')
invalid_accounts[['id', 'date_account_created', 'booking_date']].print_rows(3)

Implicit failure accounts are accounts that are open, but have been open for so long they are extremely unlikely to convert.

  • The threshold for implicit failure is the 95th percentile of the time it took training accounts to reach a decision, or the trial period duration, whichever is longer.

  • Implicit failures are inluded in both the training and open account output, because they are used to train the scoring and segmentation models, but are technically still open.

  • The user doesn't have to explicitly specify failure accounts - the model can do that automatically.


In [ ]:
print(scorer3.num_implicit_failures)

Under the hood: activity-based feature engineering

The lead scoring tool constructs account-level features based on the number of interactions, items, and users (not applicable in this scenario) per day that the accounts are open (up to the maximum of the trial duration). The names of these features are accessible as a model field.


In [ ]:
scorer3.final_features

The values for these features are included in the primary model outputs (training_account_scores and open_account_scores).


In [ ]:
scorer3.open_account_scores.print_rows(3)

The activity-based features are also used to define market segments.


In [ ]:
cols = ['segment_features', 'median_conversion_prob', 'num_training_accounts']
scorer3.segment_descriptions[cols].print_rows(max_row_width=80, max_column_width=60)

Results: improved validation accuracy


In [ ]:
print("Account-only validation accuracy:", scorer.scoring_model.validation_accuracy)
print("Validation accuracy including activity features:", scorer3.scoring_model.validation_accuracy)