Customer Churn Prediction

In this webinar, we will loads data from the UCI Online Retail data (http://archive.ics.uci.edu/ml/datasets/Online+Retail) and predicts which customers are likely to churn given their purchase activity.

Churn can be defined in many ways. We define churn to be no activity within a period of time (called the churn_period). Using this definition, a user/customer is said to have churned any form of activity is followed by no activity for an entire duration of time known as the churn_period (by default, we assume 30 days). The following figure better illustrates this concept.

(from our user guide: https://turi.com/learn/userguide/churn_prediction/churn-prediction.html)

We will dig deaper into the different parameters of the Churn Prediction toolkit, but let's start by loading some data!



In [1]:

    
# Let's import Graphlab Create and a few other libraries
import graphlab as gl
import graphlab.aggregate
import datetime
import time

Import data from a locally downloaded copy of the UCI data set

Graphlab Create supports loading data from live databases, as well as from local files. In this case, since we're working with a fixed dataset, we will load it from disk.



In [2]:

    
#Data can come directly from a SQL database, for this webinar, we will load from a local copy
data = gl.SFrame("https://static.turi.com/datasets/churn-prediction/online_retail.csv")
data









    



2016-03-16 11:14:52,357 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.4 started. Logging: /tmp/graphlab_server_1458119690.log






    




Downloading https://static.turi.com/datasets/churn-prediction/online_retail.csv to /var/tmp/graphlab-turi/3267/0e060175-ebe8-4df5-a3d1-2b2362b49927.csv






    




Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail.csv






    




Parsing completed. Parsed 100 lines in 1.2468 secs.






    



This non-commercial license of GraphLab Create is assigned to guy4261@gmail.com and will expire on October 26, 2016. For commercial licensing options, visit https://turi.com/buy/.
------------------------------------------------------





    




Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail.csv






    




Parsing completed. Parsed 541909 lines in 1.41061 secs.






    



Inferred types from first line of file as 
column_type_hints=[int,str,str,int,str,float,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------






    Out[2]:





    
        InvoiceNo
        StockCode
        Description
        Quantity
        InvoiceDate
        UnitPrice
        CustomerID
        Country
    
    
        536365
        85123A
        WHITE HANGING HEART
T-LIGHT HOLDER ...
        6
        12/1/10 8:26
        2.55
        17850
        United Kingdom
    
    
        536365
        71053
        WHITE METAL LANTERN
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        536365
        84406B
        CREAM CUPID HEARTS COAT
HANGER ...
        8
        12/1/10 8:26
        2.75
        17850
        United Kingdom
    
    
        536365
        84029G
        KNITTED UNION FLAG HOT
WATER BOTTLE ...
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        536365
        84029E
        RED WOOLLY HOTTIE WHITE
HEART. ...
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        536365
        22752
        SET 7 BABUSHKA NESTING
BOXES ...
        2
        12/1/10 8:26
        7.65
        17850
        United Kingdom
    
    
        536365
        21730
        GLASS STAR FROSTED
T-LIGHT HOLDER ...
        6
        12/1/10 8:26
        4.25
        17850
        United Kingdom
    
    
        536366
        22633
        HAND WARMER UNION JACK
        6
        12/1/10 8:28
        1.85
        17850
        United Kingdom
    
    
        536366
        22632
        HAND WARMER RED POLKA DOT
        6
        12/1/10 8:28
        1.85
        17850
        United Kingdom
    
    
        536367
        84879
        ASSORTED COLOUR BIRD
ORNAMENT ...
        32
        12/1/10 8:34
        1.69
        13047
        United Kingdom
    

[541909 rows x 8 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We need to do some cleanup first. The Invoice ID and Description columns are not going to help the model, and should be removed.



In [3]:

    
data = data.remove_columns(['InvoiceNo', 'Description'])
data









    Out[3]:





    
        StockCode
        Quantity
        InvoiceDate
        UnitPrice
        CustomerID
        Country
    
    
        85123A
        6
        12/1/10 8:26
        2.55
        17850
        United Kingdom
    
    
        71053
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        84406B
        8
        12/1/10 8:26
        2.75
        17850
        United Kingdom
    
    
        84029G
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        84029E
        6
        12/1/10 8:26
        3.39
        17850
        United Kingdom
    
    
        22752
        2
        12/1/10 8:26
        7.65
        17850
        United Kingdom
    
    
        21730
        6
        12/1/10 8:26
        4.25
        17850
        United Kingdom
    
    
        22633
        6
        12/1/10 8:28
        1.85
        17850
        United Kingdom
    
    
        22632
        6
        12/1/10 8:28
        1.85
        17850
        United Kingdom
    
    
        84879
        32
        12/1/10 8:34
        1.69
        13047
        United Kingdom
    

[541909 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Now we need to convert the InvoiceDate (which is a string) into a Python DateTime object



In [4]:

    
import dateutil
from dateutil import parser
def string_time_to_datetime(x):
    import datetime
    import pytz
    return dateutil.parser.parse(x)

data['InvoiceDate'] = data['InvoiceDate'].apply(string_time_to_datetime)

Finally, we want to separate some users into a train/validation set, making sure the validation users are not in the training set, and creating TimeSeries objects out of them.



In [5]:

    
(train, valid) = gl.churn_predictor.random_split(data, user_id = 'CustomerID', fraction = 0.9, seed = 12)
train_trial = gl.TimeSeries(train, index = 'InvoiceDate')
valid_trial = gl.TimeSeries(valid, index = 'InvoiceDate')

Now we can load user information, which can be used to augment the churn prediction model.



In [6]:

    
userdata = gl.SFrame("https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv")
userdata









    




Downloading https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv to /var/tmp/graphlab-turi/3267/fa6fe133-e384-4926-8369-0a18202dfbd2.csv






    




Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv






    




Parsing completed. Parsed 100 lines in 0.033971 secs.






    




Finished parsing file https://static.turi.com/datasets/churn-prediction/online_retail_side_data_extended.csv






    




Parsing completed. Parsed 4380 lines in 0.012869 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------






    Out[6]:





    
        CustomerID
        Gender
        Age
        Country
    
    
        13097
        Male
        57
        United Kingdom
    
    
        16817
        Male
        57
        United Kingdom
    
    
        14499
        Male
        61
        United Kingdom
    
    
        16185
        Male
        33
        United Kingdom
    
    
        14285
        Male
        33
        United Kingdom
    
    
        16837
        Male
        57
        United Kingdom
    
    
        13969
        Male
        41
        United Kingdom
    
    
        12831
        Male
        45
        United Kingdom
    
    
        16697
        Male
        57
        United Kingdom
    
    
        17671
        Male
        45
        United Kingdom
    

[4380 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Training the model

Let's now train the model.

Create a train-test split based on users

First, let's observe the data, and see what the time range looks like



In [7]:

    
print "Start date : %s" % train_trial.min_time
print "End date   : %s" % train_trial.max_time









    



Start date : 2010-12-01 08:26:00
End date   : 2011-12-09 12:50:00



In [8]:

    
# Period of inactivity that defines churn -- meaning that if a user stops purchasing
# items for 7 days, we'll consider them as having churned.
churn_period_trial = datetime.timedelta(days = 30) 

# Different beginning of months
churn_boundary_aug = datetime.datetime(year = 2011, month = 8, day = 1) 
churn_boundary_sep = datetime.datetime(year = 2011, month = 9, day = 1) 
churn_boundary_oct = datetime.datetime(year = 2011, month = 10, day = 1)



In [9]:

    
model = gl.churn_predictor.create(train_trial,
                                  user_data = userdata,
                                  user_id='CustomerID',
                                  churn_period = churn_period_trial,
                                  time_boundaries = [churn_boundary_aug, churn_boundary_sep, churn_boundary_oct])









    



PROGRESS: Grouping observation_data by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.





    




StockCode is a categorical variable with too many different values (4063) and will be ignored.






    



PROGRESS: Generating features for time-boundary.
PROGRESS: --------------------------------------------------
PROGRESS: Features for 2011-08-01 03:00:00.
PROGRESS: Features for 2011-09-01 03:00:00.
PROGRESS: Features for 2011-10-01 03:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: --------------------------------------------------
PROGRESS: Training a classifier model.





    




WARNING: Detected extremely low variance for feature(s) 'Quantity_features_7', 'UnitPrice_features_7', 'Country_features_7', '__internal__count_7', 'Quantity_features_14', 'UnitPrice_features_14', 'Country_features_14', '__internal__count_14', 'Quantity_features_21', 'UnitPrice_features_21', 'Country_features_21', '__internal__count_21', 'Quantity_features_60', 'UnitPrice_features_60', 'Country_features_60', '__internal__count_60', 'Quantity_features_90', 'UnitPrice_features_90', 'Country_features_90', '__internal__count_90' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




Boosted trees classifier:






    




--------------------------------------------------------






    




Number of examples          : 9196






    




Number of classes           : 2






    




Number of feature columns   : 23






    




Number of unpacked features : 2282






    




+-----------+--------------+-------------------+-------------------+






    




| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |






    




+-----------+--------------+-------------------+-------------------+






    




| 1         | 0.082029     | 0.790670          | 0.600178          |






    




| 2         | 0.152539     | 0.793497          | 0.550899          |






    




| 3         | 0.232402     | 0.795020          | 0.519576          |






    




| 4         | 0.307320     | 0.798717          | 0.498738          |






    




| 5         | 0.377764     | 0.802305          | 0.485382          |






    




| 6         | 0.446069     | 0.802305          | 0.476601          |






    




+-----------+--------------+-------------------+-------------------+






    



PROGRESS: --------------------------------------------------
PROGRESS: Model training complete: Next steps
PROGRESS: --------------------------------------------------
PROGRESS: (1) Evaluate the model at various timestamps in the past:
PROGRESS:       metrics = model.evaluate(data, time_in_past)
PROGRESS: (2) Make a churn forecast for a timestamp in the future:
PROGRESS:       predictions = model.predict(data, time_in_future)

Evaluating the model (post-hoc anaylsis)



In [10]:

    
# Evaluate this model in October
evaluation_time = churn_boundary_oct



In [11]:

    
metrics = model.evaluate(valid_trial, evaluation_time, user_data = userdata)









    



PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.





    




StockCode is a categorical variable with too many different values (4063) and will be ignored.






    



PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.



In [12]:

    
print(metrics)









    



{'auc': 0.6970680083275501, 'recall': 0.9465648854961832, 'precision': 0.7515151515151515, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int

Rows: 100001

Data:
+-----------+-----+-----+-----+-----+
| threshold | fpr | tpr |  p  |  n  |
+-----------+-----+-----+-----+-----+
|    0.0    | 1.0 | 1.0 | 262 | 110 |
|   1e-05   | 1.0 | 1.0 | 262 | 110 |
|   2e-05   | 1.0 | 1.0 | 262 | 110 |
|   3e-05   | 1.0 | 1.0 | 262 | 110 |
|   4e-05   | 1.0 | 1.0 | 262 | 110 |
|   5e-05   | 1.0 | 1.0 | 262 | 110 |
|   6e-05   | 1.0 | 1.0 | 262 | 110 |
|   7e-05   | 1.0 | 1.0 | 262 | 110 |
|   8e-05   | 1.0 | 1.0 | 262 | 110 |
|   9e-05   | 1.0 | 1.0 | 262 | 110 |
+-----------+-----+-----+-----+-----+
[100001 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'evaluation_data': Columns:
	CustomerID	int
	probability	float
	label	int

Rows: 372

Data:
+------------+-----------------+-------+
| CustomerID |   probability   | label |
+------------+-----------------+-------+
|   13761    |  0.790894567966 |   0   |
|   12377    |  0.929564416409 |   1   |
|   13715    |  0.870798766613 |   1   |
|   17725    |  0.224802270532 |   0   |
|   15437    |  0.881482243538 |   1   |
|   12739    |  0.741246521473 |   1   |
|   16523    | 0.0716642737389 |   0   |
|   14711    |  0.652524530888 |   1   |
|   12851    |  0.718903779984 |   1   |
|   14739    |  0.77179646492  |   0   |
+------------+-----------------+-------+
[372 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'precision_recall_curve': Columns:
	cutoffs	float
	precision	float
	recall	float

Rows: 5

Data:
+---------+----------------+-----------------+
| cutoffs |   precision    |      recall     |
+---------+----------------+-----------------+
|   0.1   | 0.707317073171 |  0.996183206107 |
|   0.25  | 0.72268907563  |  0.984732824427 |
|   0.5   | 0.751515151515 |  0.946564885496 |
|   0.75  | 0.80612244898  |  0.603053435115 |
|   0.9   | 0.882352941176 | 0.0572519083969 |
+---------+----------------+-----------------+
[5 rows x 3 columns]
}



In [13]:

    
# metrics['precision_recall_curve'].show()

Make predictions in the future

Here the question to ask is will they churn after a certain period of time. To validate we can see if they user has used us after that evaluation period. Voila! I was confusing it with expiration time (customer churn not usage churn)



In [14]:

    
# Make predictions in the future.

predictions_trial = model.predict(valid_trial, user_data = userdata)
predictions_trial.print_rows()









    



PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start of in-activity : 2011-12-09 11:20:00
PROGRESS:  End of in-activity   : 2012-01-08 11:20:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.





    




StockCode is a categorical variable with too many different values (4063) and will be ignored.






    



PROGRESS: Generating features for boundary 2011-12-09 11:20:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 0 user(s). Returning `probability` = None.
+------------+-----------------+
| CustomerID |   probability   |
+------------+-----------------+
|   13761    |  0.675416588783 |
|   12789    |  0.786019265652 |
|   12377    |  0.92285823822  |
|   13715    |  0.88807028532  |
|   17725    |  0.580599963665 |
|   15437    |  0.88807028532  |
|   12739    |  0.790894567966 |
|   16523    | 0.0636421069503 |
|   14711    |  0.444539040327 |
|   12851    |  0.790894567966 |
+------------+-----------------+
[442 rows x 2 columns]



In [15]:

    
predictions_trial.sort('probability', ascending=False).print_rows(20,max_column_width=20)









    



+------------+----------------+
| CustomerID |  probability   |
+------------+----------------+
|   12866    | 0.930307924747 |
|   12623    | 0.930261075497 |
|   16980    | 0.928374707699 |
|   13863    |  0.9267988801  |
|   12811    | 0.926037430763 |
|   15442    | 0.925807058811 |
|   13043    | 0.923144876957 |
|   16721    | 0.923144876957 |
|   15303    | 0.923144876957 |
|   15083    | 0.923144876957 |
|   12881    | 0.923144876957 |
|   12686    | 0.92285823822  |
|   12377    | 0.92285823822  |
|   14770    | 0.919770300388 |
|   17990    | 0.919770300388 |
|   14339    | 0.913970410824 |
|   16957    | 0.913970410824 |
|   16947    | 0.913970410824 |
|   16617    | 0.913970410824 |
|   12738    | 0.90664768219  |
+------------+----------------+
[442 rows x 2 columns]



In [16]:

    
predictions_trial.sort('probability', ascending=False)[200:300] .print_rows(20,max_column_width=20)









    



+------------+----------------+
| CustomerID |  probability   |
+------------+----------------+
|   16086    | 0.764054119587 |
|   12990    | 0.760963022709 |
|   12993    | 0.760963022709 |
|   14080    | 0.760963022709 |
|   16596    | 0.760963022709 |
|   16638    | 0.760963022709 |
|   16784    | 0.760963022709 |
|   14987    | 0.760963022709 |
|   15532    | 0.760936200619 |
|   15025    | 0.760936200619 |
|   17772    | 0.760792136192 |
|   17075    | 0.757927417755 |
|   14049    | 0.757856547832 |
|   13972    | 0.757027626038 |
|   15783    | 0.752182364464 |
|   13329    | 0.748927116394 |
|   17636    | 0.74851256609  |
|   16527    | 0.744530022144 |
|   13061    | 0.744530022144 |
|   16122    | 0.740379273891 |
+------------+----------------+
[100 rows x 2 columns]

Inside the model



In [17]:

    
model.trained_model









    Out[17]:





Class                         : BoostedTreesClassifier

Schema
------
Number of examples            : 9196
Number of feature columns     : 23
Number of unpacked features   : 2282
Number of classes             : 2

Settings
--------
Number of trees               : 10
Max tree depth                : 6
Training time (sec)           : 0.7192
Training accuracy             : 0.8129
Validation accuracy           : None
Training log_loss             : 0.4525
Validation log_loss           : None



In [18]:

    
model.trained_model.get_feature_importance()









    Out[18]:





    
        name
        index
        count
    
    
        Quantity_features_7
        user_user_timesinceseen
        73
    
    
        Age
        None
        29
    
    
        Quantity_features_90
        sum_sum
        12
    
    
        Quantity_features_60
        count_sum
        12
    
    
        Quantity_features_60
        sum_ratio
        11
    
    
        UnitPrice_features_7
        sum_sum
        11
    
    
        Quantity_features_60
        sum_sum
        10
    
    
        Quantity_features_90
        sum_firstinteraction_time
sinceseen ...
        10
    
    
        UnitPrice_features_90
        sum_sum
        9
    
    
        UnitPrice_features_90
        sum_max
        8
    

[2318 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER ...	6	12/1/10 8:26	2.55	17850	United Kingdom
536365	71053	WHITE METAL LANTERN	6	12/1/10 8:26	3.39	17850	United Kingdom
536365	84406B	CREAM CUPID HEARTS COAT HANGER ...	8	12/1/10 8:26	2.75	17850	United Kingdom
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE ...	6	12/1/10 8:26	3.39	17850	United Kingdom
536365	84029E	RED WOOLLY HOTTIE WHITE HEART. ...	6	12/1/10 8:26	3.39	17850	United Kingdom
536365	22752	SET 7 BABUSHKA NESTING BOXES ...	2	12/1/10 8:26	7.65	17850	United Kingdom
536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER ...	6	12/1/10 8:26	4.25	17850	United Kingdom
536366	22633	HAND WARMER UNION JACK	6	12/1/10 8:28	1.85	17850	United Kingdom
536366	22632	HAND WARMER RED POLKA DOT	6	12/1/10 8:28	1.85	17850	United Kingdom
536367	84879	ASSORTED COLOUR BIRD ORNAMENT ...	32	12/1/10 8:34	1.69	13047	United Kingdom

CustomerID	Gender	Age	Country
13097	Male	57	United Kingdom
16817	Male	57	United Kingdom
14499	Male	61	United Kingdom
16185	Male	33	United Kingdom
14285	Male	33	United Kingdom
16837	Male	57	United Kingdom
13969	Male	41	United Kingdom
12831	Male	45	United Kingdom
16697	Male	57	United Kingdom
17671	Male	45	United Kingdom

name	index	count
Quantity_features_7	user_user_timesinceseen	73
Age	None	29
Quantity_features_90	sum_sum	12
Quantity_features_60	count_sum	12
Quantity_features_60	sum_ratio	11
UnitPrice_features_7	sum_sum	11
Quantity_features_60	sum_sum	10
Quantity_features_90	sum_firstinteraction_time sinceseen ...	10
UnitPrice_features_90	sum_sum	9
UnitPrice_features_90	sum_max	8