Forecasting customer churn

Churn prediction is the task of identifying users that are likely to stop using a service, product or website. In this notebook, you will learn how to:

Train & consume a model to forecast user churn

  • Define the boundary at which churn happens.
  • Define a churn period.
  • Train a model using data from the past.
  • Make predictions for probability of churn for each user.

Let's get started!


In [13]:
import graphlab as gl
import datetime
gl.canvas.set_target('ipynb') # make sure plots appear inline

Load previously saved data

In the previous notebook, we had saved the data in a binary format. Let us try and load the data back.


In [4]:
interactions_ts = gl.TimeSeries("data/user_activity_data.ts/")
users = gl.SFrame("data/users.sf/")

Training a churn predictor

We define churn to be no activity within a period of time (called the churn_period). Hence, a user/customer is said to have churned if periods of activity is followed by no activity for a churn_period (for example, 30 days).

<img src="https://dato.com/learn/userguide/churn_prediction/images/churn-illustration.png", align="left">


In [7]:
churn_period_oct =  datetime.datetime(year = 2011, month = 10, day = 1)

Making a train-validation split

Next, we perform a train-validation split where we randomly split the data such that one split contains data for a fraction of the users while the second split contains all data for the rest of the users.


In [8]:
(train, valid) = gl.churn_predictor.random_split(interactions_ts, user_id = 'CustomerID', fraction = 0.9, seed = 12)

In [9]:
print "Users in the training dataset   : %s" % len(train['CustomerID'].unique())
print "Users in the validation dataset : %s" % len(valid['CustomerID'].unique())


Users in the training dataset   : 3899
Users in the validation dataset : 441

Training a churn predictor model


In [10]:
model = gl.churn_predictor.create(train, user_id='CustomerID', 
              user_data = users, time_boundaries = [churn_period_oct])


PROGRESS: Grouping observation_data by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.
StockCode is a categorical variable with too many different values (3649) and will be ignored.
Description is a categorical variable with too many different values (3845) and will be ignored.
PROGRESS: Generating features at time-boundaries.
PROGRESS: --------------------------------------------------
PROGRESS: Features for 2011-09-30 17:00:00
PROGRESS: Joining user_data with aggregated features.
PROGRESS: --------------------------------------------------
PROGRESS: Training a classifier model.
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 3242
Number of classes           : 2
Number of feature columns   : 17
Number of unpacked features : 152
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.016911     | 0.783159          | 0.588051          |
| 2         | 0.031428     | 0.795188          | 0.528285          |
| 3         | 0.046039     | 0.808452          | 0.487217          |
| 4         | 0.063003     | 0.805367          | 0.461014          |
| 5         | 0.076169     | 0.810611          | 0.439372          |
| 6         | 0.095362     | 0.812461          | 0.422827          |
+-----------+--------------+-------------------+-------------------+
Decision tree regression:
--------------------------------------------------------
Number of examples          : 3242
Number of features          : 17
Number of unpacked features : 152
+-----------+--------------+--------------------+---------------+
| Iteration | Elapsed Time | Training-max_error | Training-rmse |
+-----------+--------------+--------------------+---------------+
| 1         | 0.019569     | 0.381705           | 0.224819      |
+-----------+--------------+--------------------+---------------+
PROGRESS: --------------------------------------------------
PROGRESS: Model training complete: Next steps
PROGRESS: --------------------------------------------------
PROGRESS: (1) Evaluate the model at various timestamps in the past:
PROGRESS:       metrics = model.evaluate(data, time_in_past)
PROGRESS: (2) Make a churn forecast for a timestamp in the future:
PROGRESS:       predictions = model.predict(data, time_in_future)

In [11]:
model


Out[11]:
Class                          : ChurnPredictor

Schema
------
Number of observations         : 362700
Number of users                : 3899
Number of feature columns      : 5
Features used                  : ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice']

Parameters
----------
Lookback periods               : [7, 14, 21, 60, 90]
Number of time boundaries      : 1
Time period                    : 1 day, 0:00:00
Churn period                   : 30 days, 0:00:00

Consuming predictions made by the model

Here the question to ask is will they churn after a certain period of time. To validate we can see if they user has used us after that evaluation period. Voila! I was confusing it with expiration time (customer churn not usage churn)


In [12]:
predictions = model.predict(valid, user_data=users)
predictions


PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-12-09 12:08:00
PROGRESS:  End   : 2012-01-08 12:08:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.
StockCode is a categorical variable with too many different values (3649) and will be ignored.
Description is a categorical variable with too many different values (3845) and will be ignored.
PROGRESS: Generating features for boundary 2011-12-09 12:08:00.
PROGRESS: Joining user_data with aggregated features.
Out[12]:
CustomerID probability
16200 0.38116106391
17383 0.885241806507
15910 0.740143716335
16718 0.783465206623
16222 0.783465206623
16899 0.143798291683
12732 0.946555435658
13194 0.946781158447
14625 0.743798315525
13242 0.918005168438
[441 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [15]:
predictions['probability'].show()


Evaluating the model


In [16]:
metrics = model.evaluate(valid, user_data=users, time_boundary=churn_period_oct)
metrics


PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.
StockCode is a categorical variable with too many different values (3649) and will be ignored.
Description is a categorical variable with too many different values (3845) and will be ignored.
PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 66 user(s). 
Out[16]:
{'auc': 0.7990228370663153, 'evaluation_data': Columns:
 	CustomerID	str
 	probability	float
 	label	int
 
 Rows: 375
 
 Data:
 +------------+----------------+-------+
 | CustomerID |  probability   | label |
 +------------+----------------+-------+
 |   16200    | 0.632646918297 |   1   |
 |   15910    | 0.430852562189 |   0   |
 |   16718    | 0.703077316284 |   1   |
 |   16222    | 0.768735051155 |   1   |
 |   16899    | 0.583611965179 |   1   |
 |   12732    | 0.894502520561 |   1   |
 |   13194    | 0.817718148232 |   1   |
 |   14625    | 0.618298172951 |   1   |
 |   13242    | 0.940870046616 |   1   |
 |   15894    | 0.828248143196 |   1   |
 +------------+----------------+-------+
 [375 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'precision': 0.7863501483679525, 'precision_recall_curve': Columns:
 	cutoffs	float
 	precision	float
 	recall	float
 
 Rows: 5
 
 Data:
 +---------+----------------+----------------+
 | cutoffs |   precision    |     recall     |
 +---------+----------------+----------------+
 |   0.1   | 0.743243243243 | 0.996376811594 |
 |   0.25  | 0.763231197772 | 0.992753623188 |
 |   0.5   | 0.786350148368 | 0.960144927536 |
 |   0.75  | 0.912371134021 | 0.641304347826 |
 |   0.9   | 0.953703703704 | 0.373188405797 |
 +---------+----------------+----------------+
 [5 rows x 3 columns], 'recall': 0.9601449275362319, 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+----+
 | threshold | fpr | tpr |  p  | n  |
 +-----------+-----+-----+-----+----+
 |    0.0    | 1.0 | 1.0 | 276 | 99 |
 |   1e-05   | 1.0 | 1.0 | 276 | 99 |
 |   2e-05   | 1.0 | 1.0 | 276 | 99 |
 |   3e-05   | 1.0 | 1.0 | 276 | 99 |
 |   4e-05   | 1.0 | 1.0 | 276 | 99 |
 |   5e-05   | 1.0 | 1.0 | 276 | 99 |
 |   6e-05   | 1.0 | 1.0 | 276 | 99 |
 |   7e-05   | 1.0 | 1.0 | 276 | 99 |
 |   8e-05   | 1.0 | 1.0 | 276 | 99 |
 |   9e-05   | 1.0 | 1.0 | 276 | 99 |
 +-----------+-----+-----+-----+----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [17]:
model.save('data/churn_model.mdl')