The scenario: suppose we run an online travel agency. We would like to convince our users to book overseas vacations, rather than domestic ones. Each of the users in this dataset will definitely book something at the end of a given trial period, i.e. we are only looking at engaged customers.
Goals:
Data: mimics the AirBnB challenge on Kaggle.
I've simulated data that's very similar in terms of features and distributions, but I've added timestamps to the sessions, and changed the target from country to a binary domestic vs. international variable.
Sections:
In [ ]:
from __future__ import print_function
import graphlab as gl
Sales accounts need not be synonymous with users, although that is the case here. At Turi, our sales accounts consist of a mix of individual users, companies, and teams within large companies.
The accounts dataset typically comes from a customer relationship management (CRM) tool, like Salesforce, SAP, or Hubspot. In practice there is an extra step here of extracting the data from that system into an SFrame.
In [ ]:
users = gl.SFrame('synthetic_airbnb_users.sfr')
users.print_rows(3)
In [ ]:
users['status'].sketch_summary()
Three types of accounts.
Together, successful and failed accounts constitute the training accounts.
In [ ]:
status_code = {'international': 1,
'domestic': -1,
'new': 0}
users['outcome'] = users['status'].apply(lambda x: status_code[x])
users[['status', 'outcome']].print_rows(10)
In a complex problem like lead scoring, there are potentially many columns with "meaning". To help the lead scoring tool recognize these columns, we define a dictionary that maps standard lead scoring inputs to the columns in our particular dataset.
In [ ]:
user_schema = {'conversion_status': 'outcome',
'account_id': 'id',
'features': ['gender', 'age', 'signup_method', 'signup_app',
'first_device_type', 'first_browser']}
All accounts are passed to the tool when it's created. There is no separate predict
method.
update
method is not yet implemented :(
In [ ]:
scorer = gl.lead_scoring.create(users, user_schema)
There's a lot of stuff in the lead scoring model's summary. Let's focus on the accessible fields, three in particular:
In [ ]:
print(scorer)
In [ ]:
scorer.open_account_scores.head(3)
In [ ]:
scorer.open_account_scores.topk('conversion_prob', k=3)
In [ ]:
scorer.training_account_scores.head(3)
In [ ]:
scorer.segment_descriptions.head(3)
In [ ]:
scorer.segment_descriptions[['segment_id', 'segment_features']].print_rows(max_column_width=65)
To get the training or open accounts that belong to a particular market segment, use the respective SFrame's filter_by
method.
In [ ]:
seg = scorer.training_account_scores.filter_by(8, 'segment_id').head(3)
print(seg)
In [ ]:
print(scorer.scoring_model)
Additional keyword arguments to the lead scoring create
function are passed through to the gradient boosted trees model.
In [ ]:
scorer2 = gl.lead_scoring.create(users, user_schema, max_iterations=20, verbose=False)
print("Original num trees:", scorer.scoring_model.num_trees)
print("New num trees:", scorer2.scoring_model.num_trees)
By default, the gradient boosted trees model withholds ??? percent of the training accounts as a validation set. The validation accuracy can be accessed as a user.
In [ ]:
print("Validation accuracy:", scorer.scoring_model.validation_accuracy)
In [ ]:
print(scorer.segmentation_model)
Because training the lead scoring tool can take some time with large datasets, the number of segments can be changed after a lead scoring tool has been created. This function creates a new model, the original model is immutable.
In [ ]:
scorer2 = scorer.resize_segmentation_model(max_segments=20)
print("original number of segments:", scorer.segment_descriptions.num_rows())
print("new number of segments:", scorer2.segment_descriptions.num_rows())
Account activity data describes interactions between accounts and aspects of your business, like web assets, email campaigns, or products. Conceptually, each interaction involves at a minimum:
Interactions may also have:
In [ ]:
sessions = gl.SFrame('synthetic_airbnb_sessions.sfr')
sessions = gl.TimeSeries(sessions, index='timestamp')
sessions.head(5)
As with the accounts table, we need to indicate which columns in the activity table mean what. If we had a column indicating which user was involved, we could specify that as well here. In this scenario, we don't have users that are distinct from accounts.
In [ ]:
session_schema = {'account_id': 'user_id',
'item': 'action_detail'}
To use account activity data, a lead scoring tool needs to know the time window for each account's relevant interactions. There are three key dates for each account.
The trial duration is the difference between the open date and the close date. The lead scoring tool in GLC assumes this is fixed for all accounts, but in general this need not be the case.
Open accounts do not have a decision date yet, by definition. They may or may not be still within the trial period.
In [ ]:
user_schema.update({'open_date': 'date_account_created',
'decision_date': 'booking_date'})
The trial duration is represented by an instance of the datetime
package's timedelta
class.
In [ ]:
import datetime as dt
scorer3 = gl.lead_scoring.create(users, user_schema,
sessions, session_schema,
trial_duration=dt.timedelta(days=30))
In [ ]:
print(scorer3)
Invalid accounts have a decision date earlier than their open date. This is impossible, and these accounts are simply dropped from the set of training accounts.
In [ ]:
invalid_ids = scorer3.invalid_accounts
print(invalid_ids)
invalid_accounts = users.filter_by(invalid_ids, 'id')
invalid_accounts[['id', 'date_account_created', 'booking_date']].print_rows(3)
Implicit failure accounts are accounts that are open, but have been open for so long they are extremely unlikely to convert.
The threshold for implicit failure is the 95th percentile of the time it took training accounts to reach a decision, or the trial period duration, whichever is longer.
Implicit failures are inluded in both the training and open account output, because they are used to train the scoring and segmentation models, but are technically still open.
The user doesn't have to explicitly specify failure accounts - the model can do that automatically.
In [ ]:
print(scorer3.num_implicit_failures)
The lead scoring tool constructs account-level features based on the number of interactions, items, and users (not applicable in this scenario) per day that the accounts are open (up to the maximum of the trial duration). The names of these features are accessible as a model field.
In [ ]:
scorer3.final_features
The values for these features are included in the primary model outputs (training_account_scores
and open_account_scores
).
In [ ]:
scorer3.open_account_scores.print_rows(3)
The activity-based features are also used to define market segments.
In [ ]:
cols = ['segment_features', 'median_conversion_prob', 'num_training_accounts']
scorer3.segment_descriptions[cols].print_rows(max_row_width=80, max_column_width=60)
In [ ]:
print("Account-only validation accuracy:", scorer.scoring_model.validation_accuracy)
print("Validation accuracy including activity features:", scorer3.scoring_model.validation_accuracy)