Customer Churn Prediction

In this webinar, we will loads data from the UCI Online Retail data (http://archive.ics.uci.edu/ml/datasets/Online+Retail) and predicts which customers are likely to churn given their purchase activity.

Churn can be defined in many ways. We define churn to be no activity within a period of time (called the churn_period). Using this definition, a user/customer is said to have churned any form of activity is followed by no activity for an entire duration of time known as the churn_period (by default, we assume 30 days). The following figure better illustrates this concept.

(from our user guide: https://turi.com/learn/userguide/churn_prediction/churn-prediction.html)

We will dig deaper into the different parameters of the Churn Prediction toolkit, but let's start by loading some data!


In [1]:
# Let's import Graphlab Create and a few other libraries
import graphlab as gl
import graphlab.aggregate
import datetime
import time

Import data from a locally downloaded copy of the UCI data set

Graphlab Create supports loading data from live databases, as well as from local files. In this case, since we're working with a fixed dataset, we will load it from disk.


In [2]:
#Data can come directly from a SQL database, for this webinar, we will load from a local copy
data = gl.SFrame("https://static.turi.com/datasets/churn-prediction/online_retail.csv")
data


2016-03-16 11:14:52,357 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.4 started. Logging: /tmp/graphlab_server_1458119690.log
Downloading https://static.turi.com/datasets/churn-prediction/online_retail.csv to /var/tmp/graphlab-turi/3267/0e060175-ebe8-4df5-a3d1-2b2362b49927.csv
Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail.csv
Parsing completed. Parsed 100 lines in 1.2468 secs.
This non-commercial license of GraphLab Create is assigned to guy4261@gmail.com and will expire on October 26, 2016. For commercial licensing options, visit https://turi.com/buy/.
------------------------------------------------------
Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail.csv
Parsing completed. Parsed 541909 lines in 1.41061 secs.
Inferred types from first line of file as 
column_type_hints=[int,str,str,int,str,float,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Out[2]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART
T-LIGHT HOLDER ...
6 12/1/10 8:26 2.55 17850 United Kingdom
536365 71053 WHITE METAL LANTERN 6 12/1/10 8:26 3.39 17850 United Kingdom
536365 84406B CREAM CUPID HEARTS COAT
HANGER ...
8 12/1/10 8:26 2.75 17850 United Kingdom
536365 84029G KNITTED UNION FLAG HOT
WATER BOTTLE ...
6 12/1/10 8:26 3.39 17850 United Kingdom
536365 84029E RED WOOLLY HOTTIE WHITE
HEART. ...
6 12/1/10 8:26 3.39 17850 United Kingdom
536365 22752 SET 7 BABUSHKA NESTING
BOXES ...
2 12/1/10 8:26 7.65 17850 United Kingdom
536365 21730 GLASS STAR FROSTED
T-LIGHT HOLDER ...
6 12/1/10 8:26 4.25 17850 United Kingdom
536366 22633 HAND WARMER UNION JACK 6 12/1/10 8:28 1.85 17850 United Kingdom
536366 22632 HAND WARMER RED POLKA DOT 6 12/1/10 8:28 1.85 17850 United Kingdom
536367 84879 ASSORTED COLOUR BIRD
ORNAMENT ...
32 12/1/10 8:34 1.69 13047 United Kingdom
[541909 rows x 8 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We need to do some cleanup first. The Invoice ID and Description columns are not going to help the model, and should be removed.


In [3]:
data = data.remove_columns(['InvoiceNo', 'Description'])
data


Out[3]:
StockCode Quantity InvoiceDate UnitPrice CustomerID Country
85123A 6 12/1/10 8:26 2.55 17850 United Kingdom
71053 6 12/1/10 8:26 3.39 17850 United Kingdom
84406B 8 12/1/10 8:26 2.75 17850 United Kingdom
84029G 6 12/1/10 8:26 3.39 17850 United Kingdom
84029E 6 12/1/10 8:26 3.39 17850 United Kingdom
22752 2 12/1/10 8:26 7.65 17850 United Kingdom
21730 6 12/1/10 8:26 4.25 17850 United Kingdom
22633 6 12/1/10 8:28 1.85 17850 United Kingdom
22632 6 12/1/10 8:28 1.85 17850 United Kingdom
84879 32 12/1/10 8:34 1.69 13047 United Kingdom
[541909 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Now we need to convert the InvoiceDate (which is a string) into a Python DateTime object


In [4]:
import dateutil
from dateutil import parser
def string_time_to_datetime(x):
    import datetime
    import pytz
    return dateutil.parser.parse(x)

data['InvoiceDate'] = data['InvoiceDate'].apply(string_time_to_datetime)

Finally, we want to separate some users into a train/validation set, making sure the validation users are not in the training set, and creating TimeSeries objects out of them.


In [5]:
(train, valid) = gl.churn_predictor.random_split(data, user_id = 'CustomerID', fraction = 0.9, seed = 12)
train_trial = gl.TimeSeries(train, index = 'InvoiceDate')
valid_trial = gl.TimeSeries(valid, index = 'InvoiceDate')

Now we can load user information, which can be used to augment the churn prediction model.


In [6]:
userdata = gl.SFrame("https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv")
userdata


Downloading https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv to /var/tmp/graphlab-turi/3267/fa6fe133-e384-4926-8369-0a18202dfbd2.csv
Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv
Parsing completed. Parsed 100 lines in 0.033971 secs.
Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv
Parsing completed. Parsed 4380 lines in 0.012869 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Out[6]:
CustomerID Gender Age Country
13097 Male 57 United Kingdom
16817 Male 57 United Kingdom
14499 Male 61 United Kingdom
16185 Male 33 United Kingdom
14285 Male 33 United Kingdom
16837 Male 57 United Kingdom
13969 Male 41 United Kingdom
12831 Male 45 United Kingdom
16697 Male 57 United Kingdom
17671 Male 45 United Kingdom
[4380 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Training the model

Let's now train the model.

Create a train-test split based on users

First, let's observe the data, and see what the time range looks like


In [7]:
print "Start date : %s" % train_trial.min_time
print "End date   : %s" % train_trial.max_time


Start date : 2010-12-01 08:26:00
End date   : 2011-12-09 12:50:00

In [8]:
# Period of inactivity that defines churn -- meaning that if a user stops purchasing
# items for 7 days, we'll consider them as having churned.
churn_period_trial = datetime.timedelta(days = 30) 

# Different beginning of months
churn_boundary_aug = datetime.datetime(year = 2011, month = 8, day = 1) 
churn_boundary_sep = datetime.datetime(year = 2011, month = 9, day = 1) 
churn_boundary_oct = datetime.datetime(year = 2011, month = 10, day = 1)

In [9]:
model = gl.churn_predictor.create(train_trial,
                                  user_data = userdata,
                                  user_id='CustomerID',
                                  churn_period = churn_period_trial,
                                  time_boundaries = [churn_boundary_aug, churn_boundary_sep, churn_boundary_oct])


PROGRESS: Grouping observation_data by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
StockCode is a categorical variable with too many different values (4063) and will be ignored.
PROGRESS: Generating features for time-boundary.
PROGRESS: --------------------------------------------------
PROGRESS: Features for 2011-08-01 03:00:00.
PROGRESS: Features for 2011-09-01 03:00:00.
PROGRESS: Features for 2011-10-01 03:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: --------------------------------------------------
PROGRESS: Training a classifier model.
WARNING: Detected extremely low variance for feature(s) 'Quantity_features_7', 'UnitPrice_features_7', 'Country_features_7', '__internal__count_7', 'Quantity_features_14', 'UnitPrice_features_14', 'Country_features_14', '__internal__count_14', 'Quantity_features_21', 'UnitPrice_features_21', 'Country_features_21', '__internal__count_21', 'Quantity_features_60', 'UnitPrice_features_60', 'Country_features_60', '__internal__count_60', 'Quantity_features_90', 'UnitPrice_features_90', 'Country_features_90', '__internal__count_90' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 9196
Number of classes           : 2
Number of feature columns   : 23
Number of unpacked features : 2282
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.082029     | 0.790670          | 0.600178          |
| 2         | 0.152539     | 0.793497          | 0.550899          |
| 3         | 0.232402     | 0.795020          | 0.519576          |
| 4         | 0.307320     | 0.798717          | 0.498738          |
| 5         | 0.377764     | 0.802305          | 0.485382          |
| 6         | 0.446069     | 0.802305          | 0.476601          |
+-----------+--------------+-------------------+-------------------+
PROGRESS: --------------------------------------------------
PROGRESS: Model training complete: Next steps
PROGRESS: --------------------------------------------------
PROGRESS: (1) Evaluate the model at various timestamps in the past:
PROGRESS:       metrics = model.evaluate(data, time_in_past)
PROGRESS: (2) Make a churn forecast for a timestamp in the future:
PROGRESS:       predictions = model.predict(data, time_in_future)

Evaluating the model (post-hoc anaylsis)


In [10]:
# Evaluate this model in October
evaluation_time = churn_boundary_oct

In [11]:
metrics = model.evaluate(valid_trial, evaluation_time, user_data = userdata)


PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
StockCode is a categorical variable with too many different values (4063) and will be ignored.
PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.

In [12]:
print(metrics)


{'auc': 0.6970680083275501, 'recall': 0.9465648854961832, 'precision': 0.7515151515151515, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int

Rows: 100001

Data:
+-----------+-----+-----+-----+-----+
| threshold | fpr | tpr |  p  |  n  |
+-----------+-----+-----+-----+-----+
|    0.0    | 1.0 | 1.0 | 262 | 110 |
|   1e-05   | 1.0 | 1.0 | 262 | 110 |
|   2e-05   | 1.0 | 1.0 | 262 | 110 |
|   3e-05   | 1.0 | 1.0 | 262 | 110 |
|   4e-05   | 1.0 | 1.0 | 262 | 110 |
|   5e-05   | 1.0 | 1.0 | 262 | 110 |
|   6e-05   | 1.0 | 1.0 | 262 | 110 |
|   7e-05   | 1.0 | 1.0 | 262 | 110 |
|   8e-05   | 1.0 | 1.0 | 262 | 110 |
|   9e-05   | 1.0 | 1.0 | 262 | 110 |
+-----------+-----+-----+-----+-----+
[100001 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'evaluation_data': Columns:
	CustomerID	int
	probability	float
	label	int

Rows: 372

Data:
+------------+-----------------+-------+
| CustomerID |   probability   | label |
+------------+-----------------+-------+
|   13761    |  0.790894567966 |   0   |
|   12377    |  0.929564416409 |   1   |
|   13715    |  0.870798766613 |   1   |
|   17725    |  0.224802270532 |   0   |
|   15437    |  0.881482243538 |   1   |
|   12739    |  0.741246521473 |   1   |
|   16523    | 0.0716642737389 |   0   |
|   14711    |  0.652524530888 |   1   |
|   12851    |  0.718903779984 |   1   |
|   14739    |  0.77179646492  |   0   |
+------------+-----------------+-------+
[372 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'precision_recall_curve': Columns:
	cutoffs	float
	precision	float
	recall	float

Rows: 5

Data:
+---------+----------------+-----------------+
| cutoffs |   precision    |      recall     |
+---------+----------------+-----------------+
|   0.1   | 0.707317073171 |  0.996183206107 |
|   0.25  | 0.72268907563  |  0.984732824427 |
|   0.5   | 0.751515151515 |  0.946564885496 |
|   0.75  | 0.80612244898  |  0.603053435115 |
|   0.9   | 0.882352941176 | 0.0572519083969 |
+---------+----------------+-----------------+
[5 rows x 3 columns]
}

In [13]:
# metrics['precision_recall_curve'].show()

Make predictions in the future

Here the question to ask is will they churn after a certain period of time. To validate we can see if they user has used us after that evaluation period. Voila! I was confusing it with expiration time (customer churn not usage churn)


In [14]:
# Make predictions in the future.

predictions_trial = model.predict(valid_trial, user_data = userdata)
predictions_trial.print_rows()


PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start of in-activity : 2011-12-09 11:20:00
PROGRESS:  End of in-activity   : 2012-01-08 11:20:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.
StockCode is a categorical variable with too many different values (4063) and will be ignored.
PROGRESS: Generating features for boundary 2011-12-09 11:20:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 0 user(s). Returning `probability` = None.
+------------+-----------------+
| CustomerID |   probability   |
+------------+-----------------+
|   13761    |  0.675416588783 |
|   12789    |  0.786019265652 |
|   12377    |  0.92285823822  |
|   13715    |  0.88807028532  |
|   17725    |  0.580599963665 |
|   15437    |  0.88807028532  |
|   12739    |  0.790894567966 |
|   16523    | 0.0636421069503 |
|   14711    |  0.444539040327 |
|   12851    |  0.790894567966 |
+------------+-----------------+
[442 rows x 2 columns]


In [15]:
predictions_trial.sort('probability', ascending=False).print_rows(20,max_column_width=20)


+------------+----------------+
| CustomerID |  probability   |
+------------+----------------+
|   12866    | 0.930307924747 |
|   12623    | 0.930261075497 |
|   16980    | 0.928374707699 |
|   13863    |  0.9267988801  |
|   12811    | 0.926037430763 |
|   15442    | 0.925807058811 |
|   13043    | 0.923144876957 |
|   16721    | 0.923144876957 |
|   15303    | 0.923144876957 |
|   15083    | 0.923144876957 |
|   12881    | 0.923144876957 |
|   12686    | 0.92285823822  |
|   12377    | 0.92285823822  |
|   14770    | 0.919770300388 |
|   17990    | 0.919770300388 |
|   14339    | 0.913970410824 |
|   16957    | 0.913970410824 |
|   16947    | 0.913970410824 |
|   16617    | 0.913970410824 |
|   12738    | 0.90664768219  |
+------------+----------------+
[442 rows x 2 columns]


In [16]:
predictions_trial.sort('probability', ascending=False)[200:300] .print_rows(20,max_column_width=20)


+------------+----------------+
| CustomerID |  probability   |
+------------+----------------+
|   16086    | 0.764054119587 |
|   12990    | 0.760963022709 |
|   12993    | 0.760963022709 |
|   14080    | 0.760963022709 |
|   16596    | 0.760963022709 |
|   16638    | 0.760963022709 |
|   16784    | 0.760963022709 |
|   14987    | 0.760963022709 |
|   15532    | 0.760936200619 |
|   15025    | 0.760936200619 |
|   17772    | 0.760792136192 |
|   17075    | 0.757927417755 |
|   14049    | 0.757856547832 |
|   13972    | 0.757027626038 |
|   15783    | 0.752182364464 |
|   13329    | 0.748927116394 |
|   17636    | 0.74851256609  |
|   16527    | 0.744530022144 |
|   13061    | 0.744530022144 |
|   16122    | 0.740379273891 |
+------------+----------------+
[100 rows x 2 columns]

Inside the model


In [17]:
model.trained_model


Out[17]:
Class                         : BoostedTreesClassifier

Schema
------
Number of examples            : 9196
Number of feature columns     : 23
Number of unpacked features   : 2282
Number of classes             : 2

Settings
--------
Number of trees               : 10
Max tree depth                : 6
Training time (sec)           : 0.7192
Training accuracy             : 0.8129
Validation accuracy           : None
Training log_loss             : 0.4525
Validation log_loss           : None

In [18]:
model.trained_model.get_feature_importance()


Out[18]:
name index count
Quantity_features_7 user_user_timesinceseen 73
Age None 29
Quantity_features_90 sum_sum 12
Quantity_features_60 count_sum 12
Quantity_features_60 sum_ratio 11
UnitPrice_features_7 sum_sum 11
Quantity_features_60 sum_sum 10
Quantity_features_90 sum_firstinteraction_time
sinceseen ...
10
UnitPrice_features_90 sum_sum 9
UnitPrice_features_90 sum_max 8
[2318 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.