Forecasting customer churn

Churn prediction is the task of identifying users that are likely to stop using a service, product or website. In this notebook, you will learn how to:

Train & consume a model to forecast user churn

Define the boundary at which churn happens.
Define a churn period.
Train a model using data from the past.
Make predictions for probability of churn for each user.

Let's get started!



In [13]:

    
import graphlab as gl
import datetime
gl.canvas.set_target('ipynb') # make sure plots appear inline

Load previously saved data

In the previous notebook, we had saved the data in a binary format. Let us try and load the data back.



In [4]:

    
interactions_ts = gl.TimeSeries("data/user_activity_data.ts/")
users = gl.SFrame("data/users.sf/")

Training a churn predictor

We define churn to be no activity within a period of time (called the churn_period). Hence, a user/customer is said to have churned if periods of activity is followed by no activity for a churn_period (for example, 30 days).



In [7]:

    
churn_period_oct =  datetime.datetime(year = 2011, month = 10, day = 1)

Making a train-validation split

Next, we perform a train-validation split where we randomly split the data such that one split contains data for a fraction of the users while the second split contains all data for the rest of the users.



In [8]:

    
(train, valid) = gl.churn_predictor.random_split(interactions_ts, user_id = 'CustomerID', fraction = 0.9, seed = 12)



In [9]:

    
print "Users in the training dataset   : %s" % len(train['CustomerID'].unique())
print "Users in the validation dataset : %s" % len(valid['CustomerID'].unique())









    



Users in the training dataset   : 3899
Users in the validation dataset : 441

Training a churn predictor model



In [10]:

    
model = gl.churn_predictor.create(train, user_id='CustomerID', 
              user_data = users, time_boundaries = [churn_period_oct])









    



PROGRESS: Grouping observation_data by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.






    




InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.






    




StockCode is a categorical variable with too many different values (3649) and will be ignored.






    




Description is a categorical variable with too many different values (3845) and will be ignored.






    



PROGRESS: Generating features at time-boundaries.
PROGRESS: --------------------------------------------------
PROGRESS: Features for 2011-09-30 17:00:00
PROGRESS: Joining user_data with aggregated features.
PROGRESS: --------------------------------------------------
PROGRESS: Training a classifier model.






    




Boosted trees classifier:






    




--------------------------------------------------------






    




Number of examples          : 3242






    




Number of classes           : 2






    




Number of feature columns   : 17






    




Number of unpacked features : 152






    




+-----------+--------------+-------------------+-------------------+






    




| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |






    




+-----------+--------------+-------------------+-------------------+






    




| 1         | 0.016911     | 0.783159          | 0.588051          |






    




| 2         | 0.031428     | 0.795188          | 0.528285          |






    




| 3         | 0.046039     | 0.808452          | 0.487217          |






    




| 4         | 0.063003     | 0.805367          | 0.461014          |






    




| 5         | 0.076169     | 0.810611          | 0.439372          |






    




| 6         | 0.095362     | 0.812461          | 0.422827          |






    




+-----------+--------------+-------------------+-------------------+






    




Decision tree regression:






    




--------------------------------------------------------






    




Number of examples          : 3242






    




Number of features          : 17






    




Number of unpacked features : 152






    




+-----------+--------------+--------------------+---------------+






    




| Iteration | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+--------------+--------------------+---------------+






    




| 1         | 0.019569     | 0.381705           | 0.224819      |






    




+-----------+--------------+--------------------+---------------+






    



PROGRESS: --------------------------------------------------
PROGRESS: Model training complete: Next steps
PROGRESS: --------------------------------------------------
PROGRESS: (1) Evaluate the model at various timestamps in the past:
PROGRESS:       metrics = model.evaluate(data, time_in_past)
PROGRESS: (2) Make a churn forecast for a timestamp in the future:
PROGRESS:       predictions = model.predict(data, time_in_future)



In [11]:

    
model









    Out[11]:





Class                          : ChurnPredictor

Schema
------
Number of observations         : 362700
Number of users                : 3899
Number of feature columns      : 5
Features used                  : ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice']

Parameters
----------
Lookback periods               : [7, 14, 21, 60, 90]
Number of time boundaries      : 1
Time period                    : 1 day, 0:00:00
Churn period                   : 30 days, 0:00:00

Consuming predictions made by the model

Here the question to ask is will they churn after a certain period of time. To validate we can see if they user has used us after that evaluation period. Voila! I was confusing it with expiration time (customer churn not usage churn)



In [12]:

    
predictions = model.predict(valid, user_data=users)
predictions









    



PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-12-09 12:08:00
PROGRESS:  End   : 2012-01-08 12:08:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.






    




InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.






    




StockCode is a categorical variable with too many different values (3649) and will be ignored.






    




Description is a categorical variable with too many different values (3845) and will be ignored.






    



PROGRESS: Generating features for boundary 2011-12-09 12:08:00.
PROGRESS: Joining user_data with aggregated features.






    Out[12]:





    
        CustomerID
        probability
    
    
        16200
        0.38116106391
    
    
        17383
        0.885241806507
    
    
        15910
        0.740143716335
    
    
        16718
        0.783465206623
    
    
        16222
        0.783465206623
    
    
        16899
        0.143798291683
    
    
        12732
        0.946555435658
    
    
        13194
        0.946781158447
    
    
        14625
        0.743798315525
    
    
        13242
        0.918005168438
    

[441 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [15]:

    
predictions['probability'].show()

Evaluating the model



In [16]:

    
metrics = model.evaluate(valid, user_data=users, time_boundary=churn_period_oct)
metrics









    



PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.






    




InvoiceNo is a categorical variable with too many different values (16841) and will be ignored.






    




StockCode is a categorical variable with too many different values (3649) and will be ignored.






    




Description is a categorical variable with too many different values (3845) and will be ignored.






    



PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 66 user(s). 






    Out[16]:





{'auc': 0.7990228370663153, 'evaluation_data': Columns:
 	CustomerID	str
 	probability	float
 	label	int
 
 Rows: 375
 
 Data:
 +------------+----------------+-------+
 | CustomerID |  probability   | label |
 +------------+----------------+-------+
 |   16200    | 0.632646918297 |   1   |
 |   15910    | 0.430852562189 |   0   |
 |   16718    | 0.703077316284 |   1   |
 |   16222    | 0.768735051155 |   1   |
 |   16899    | 0.583611965179 |   1   |
 |   12732    | 0.894502520561 |   1   |
 |   13194    | 0.817718148232 |   1   |
 |   14625    | 0.618298172951 |   1   |
 |   13242    | 0.940870046616 |   1   |
 |   15894    | 0.828248143196 |   1   |
 +------------+----------------+-------+
 [375 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'precision': 0.7863501483679525, 'precision_recall_curve': Columns:
 	cutoffs	float
 	precision	float
 	recall	float
 
 Rows: 5
 
 Data:
 +---------+----------------+----------------+
 | cutoffs |   precision    |     recall     |
 +---------+----------------+----------------+
 |   0.1   | 0.743243243243 | 0.996376811594 |
 |   0.25  | 0.763231197772 | 0.992753623188 |
 |   0.5   | 0.786350148368 | 0.960144927536 |
 |   0.75  | 0.912371134021 | 0.641304347826 |
 |   0.9   | 0.953703703704 | 0.373188405797 |
 +---------+----------------+----------------+
 [5 rows x 3 columns], 'recall': 0.9601449275362319, 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+----+
 | threshold | fpr | tpr |  p  | n  |
 +-----------+-----+-----+-----+----+
 |    0.0    | 1.0 | 1.0 | 276 | 99 |
 |   1e-05   | 1.0 | 1.0 | 276 | 99 |
 |   2e-05   | 1.0 | 1.0 | 276 | 99 |
 |   3e-05   | 1.0 | 1.0 | 276 | 99 |
 |   4e-05   | 1.0 | 1.0 | 276 | 99 |
 |   5e-05   | 1.0 | 1.0 | 276 | 99 |
 |   6e-05   | 1.0 | 1.0 | 276 | 99 |
 |   7e-05   | 1.0 | 1.0 | 276 | 99 |
 |   8e-05   | 1.0 | 1.0 | 276 | 99 |
 |   9e-05   | 1.0 | 1.0 | 276 | 99 |
 +-----------+-----+-----+-----+----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}



In [17]:

    
model.save('data/churn_model.mdl')

CustomerID	probability
16200	0.38116106391
17383	0.885241806507
15910	0.740143716335
16718	0.783465206623
16222	0.783465206623
16899	0.143798291683
12732	0.946555435658
13194	0.946781158447
14625	0.743798315525
13242	0.918005168438