Churn prediction is the task of identifying whether users are likely to stop using a service, product, or website. Companies and large corprorations strive to aqcuire new customers, a process which usually requires a lot of resources in time and money, and they will do everything to prevent a customer from leaving. Acquiring a new customer is often much more costly, rather than making an offer to a client who is ready to leave. More importantly, customers rarely reveal their complaints about a service or a product they use, or their intentions of leaving a company. Therefore, it is important for companies to regularly review historical user behavior patterns and accurately forecast the probability that a given customer may churn.
Furthermore, companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, whereas involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.
Here, we are going to study the transactional data set which contains all the transactions occurring between December 1st, 2010 (01/12/2010) and December 9th, 2011 (09/12/2011) for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts. The data set is provided through the UCI Machine Learning Directory (Online Retail), and our task will be to predict which customers are likely to churn given their purchasing activity. We will assume that none of these customers has stopped buying from this online retailer due to involuntary reasons, and that if some of them did was due to their intention to actually leave the company.
It is important to note that customer churn can be defined in many ways. In the GraphLab Create churn_predictor
toolkit that we are going to use below, churn is defined to be no activity for a fixed period of time (called the churn_period
). More specifically, a user is said to have churned if there he/she has no activity for a duration of time known as the churn_period
(by default, this is set to 30 days).
[Chen et. al 2012] Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Data set has been provided from the UCI ML Repository: http://archive.ics.uci.edu/ml/datasets/Online+Retail
In [1]:
import graphlab as gl
import graphlab.aggregate as agg
import datetime as dt
from visualization_helper_functions import *
In [2]:
purchasing_data = gl.SFrame.read_csv('./../../04.UCI.ML.REPO/00352.Online_Retail/Online_Retail.csv',
column_type_hints=[str,str,str,int,str,float,str,str])
As shown below, purchasing_data
keeps the transactional data of purchases made by various customers of this online retailer during the period from 01/12/2010 to 09/12/2011. It keeps customer records per InvoiceNo
and StockCode
, the Quantity
and the UnitPrice
of each product procured, and the InvoiceDate
that the corresponding invoice has been issued. Customer country of residency, Country
, as well as a short description of each StockCode
is also provided.
In [3]:
purchasing_data.head(5)
Out[3]:
In [4]:
purchasing_data.dtype
Out[4]:
Now we need to convert the InvoiceDate
(which is a string) into a Python DateTime object.
In [5]:
def _str_to_datetime(x):
import datetime as dt
try:
return dt.datetime.strptime(x, '%m/%d/%Y %H:%M')
except (TypeError, ValueError):
return None
In [6]:
purchasing_data['InvoiceDate'] = purchasing_data['InvoiceDate'].apply(_str_to_datetime)
purchasing_data.head(5)
Out[6]:
In [7]:
gl.canvas.set_target('ipynb')
purchasing_data.show()
Here, we have used the special SFrame functionality, .show()
, to draw this summary statistics. An important advantage of this GraphLab Create method is its ability to draw plots of large data sets with unparallel ease. However, for reasons of general compatibility, we utilize below the matplotlib
and seaborn
Python libraries through the visuzalization_helper_functions
which we imported initially, providing that way some basic visualizations to help our discussion.
Note, that not all the features of the purchasing_data
set are usefull for a univariate summary statistics analysis. We have 25900 different values recorded as InvoiceNo
and more than 4000 different StockCode
s and CustomerID
s, all of which are nominal categorical variables. Furthermore, the product Description
can not help by any mean our analysis, whereas it is not meaningful to make a univariate analysis of the datetime field, InvoiceDate
.
In [8]:
purchasing_data_df = purchasing_data.to_dataframe()
print purchasing_data_df.describe(include='all')
So by excluding from our consideration the attributes Description
and InvoiceNo
and keeping aside the StockCode
and CustomerID
which we will draw in separate frequency plots later, the univariate summary statistics of the remaining features can be visualized as follows.
In [10]:
%matplotlib inline
univariate_summary_statistics_plot(purchasing_data,
['Quantity','UnitPrice', 'Country'],
nsubplots_inrow=2, subplots_wspace=0.5)
In [11]:
gl.Sketch(purchasing_data['CustomerID'])
Out[11]:
In [12]:
gl.Sketch(purchasing_data['StockCode'])
Out[12]:
Note that some values seems peculiar in this data set. First, some records have negative Quantity
and UnitPrice
values, whereas a large number of CustomerID
s are missing. We have more than 135000 such occurences (anonymous 'X2'
values) in the corresponding gl.Sketch()
above. However, since we want to predict which customers are likely to churn, record lines with missing CustomerID
s can not help our model and we should better exclude them. The purchasing_data
set now becomes.
In [13]:
purchasing_data = purchasing_data[purchasing_data['CustomerID']!='']
purchasing_data.show()
In [14]:
%matplotlib inline
univariate_summary_statistics_plot(purchasing_data,
['Quantity','UnitPrice', 'Country'],
nsubplots_inrow=2, subplots_wspace=0.5)
Note, that no negative values in the UnitPrice
attribute exist any more, and except of some incredible expensive outliers, everything looks much more reasonable now.
In [16]:
gl.Sketch(purchasing_data['UnitPrice'])
Out[16]:
Note also that beyond the 99% quantile most of the transactions are some manual cancellations [Credit Notes (C-prefixed Invoices) with negative values in the Quantity
field] and the remaining ones some expensive products. Therefore, we should better keep all these records despite being outliers.
In [17]:
unit_price_outliers = purchasing_data[purchasing_data['UnitPrice'] > 15.95]
unit_price_outliers.sort('UnitPrice', ascending=False)
Out[17]:
For completeness, the UnitPrice
distribution beyond the 99% quantile is as follows:
In [18]:
%matplotlib inline
# transform the SFrame of interest into a Pandas DataFrame
unit_price_outliers_df = unit_price_outliers.to_dataframe()
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# draw the UnitPrice distplot
ax = sns.distplot(unit_price_outliers_df.UnitPrice,
bins=None, hist=True, kde=False, rug=False, color='b')
or zooming into the (15.95, 500]
UnitPrice
interval:
In [19]:
%matplotlib inline
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# draw the UnitPrice distplot
ax = sns.distplot(unit_price_outliers_df[unit_price_outliers_df['UnitPrice']<500].UnitPrice,\
bins=None, hist=True, kde=False, rug=False, color='b')
Furthermore, taking a closer look in the part of the record with the negative procured Quantities, we can safely assume that these cases concern product returns, discounts (StockCode='D'
) and manual corrections (StockCode='M'
). Therefore, having a complete purchasing record per customer requires keeping all these transactions.
In [20]:
purchasing_data_neg_quantity = purchasing_data[purchasing_data['Quantity'] < 0]
purchasing_data_neg_quantity.show()
In [21]:
gl.Sketch(purchasing_data_neg_quantity['Quantity'])
Out[21]:
Indeed, the customer with CustomerID='12607'
procured a bunch of various products on 10/10/2011 at a total cost of $1579.51 and returned all of them back canceling the previously issued Invoice (InvoiceNo='570467'
) two days later. Note the 'C'
prefix in the InvoiceNo
of the Credit Note. Of course, this specific customer is unlikely to make another purchase, a fact that we will miss if we drop the lines of product returns. More importantly, we need to take seriously this event of complete purchase cancellation by updating appropriately the predictive model we will build.
In [22]:
purchasing_data_cust_12607 = purchasing_data.filter_by('12607', 'CustomerID')
purchasing_data_cust_12607[['InvoiceNo', 'StockCode', 'Description', 'Quantity',\
'InvoiceDate']].\
print_rows(num_rows=205, max_column_width=20)
In [23]:
purchasing_data_cust_12607['CostPerStockCode'] =\
purchasing_data_cust_12607['Quantity'] * purchasing_data_cust_12607['UnitPrice']
purchasing_data_cust_12607.groupby(key_columns = ['InvoiceNo', 'InvoiceDate', 'CustomerID'],
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')}).sort('InvoiceDate')
Out[23]:
As another example, the customer with CustomerID='18139'
made a series of purchases on 11/21/2011 and the day after, 11/22/2011, at a total cost of $8393.22. Note, that among the issued invoices two of them (with InvoiceNo
'C578073'
and 'C578076'
) involved manual correction (StockCode='M'
) of the paid amount. These kind of transactions may be also important to achieve better predictive performance.
In [24]:
purchasing_data_cust_18139 = purchasing_data.filter_by('18139', 'CustomerID')
purchasing_data_cust_18139['CostPerStockCode'] =\
purchasing_data_cust_18139['Quantity'] * purchasing_data_cust_18139['UnitPrice']
purchasing_data_cust_18139[['InvoiceNo', 'StockCode', 'Description', 'Quantity',\
'InvoiceDate']].\
print_rows(num_rows=170, max_column_width=20)
In [25]:
purchasing_data_cust_18139['CostPerStockCode'] =\
purchasing_data_cust_18139['Quantity'] * purchasing_data_cust_18139['UnitPrice']
purchasing_data_cust_18139.groupby(key_columns = ['InvoiceNo', 'InvoiceDate', 'CustomerID'],
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')}).sort('InvoiceDate')
Out[25]:
In [26]:
purchasing_data_cust_18139.groupby(key_columns = 'CustomerID',
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')})
Out[26]:
In [27]:
purchasing_data.print_rows(num_rows=10, max_column_width=20)
In [28]:
purchasing_data.show()
Among the available attributes, the Description
column is not going to help the model and should be excluded. Furthermore, the nominal categorical attribute InvoiceNo
has too many different values (more than 22000) to be helpful for the GraphLab Create churn_predictor
toolkit. However, as we saw earlier what is important in the historical purchasing record of a customer is not the specific InvoiceNo
against which he/she procured some goods but the type of the issued invoice; was it a Purchase/Standard (Invoice Receipt) or a Cancelling Invoice (Credit Note)? This type in fact determines if an actual purchase happened or a product return, and our learning algorithm does not need to know something more from this specific attribute. Therefore, we should better denote:
InvoiceNo
with an 'IR'
string and InvoiceNo
with a 'CN'
string.
In [29]:
purchasing_data.remove_column('Description')
purchasing_data['InvoiceNo'] = purchasing_data['InvoiceNo'].apply(lambda x: 'CN' if 'C' in x else 'IR')
purchasing_data.print_rows(num_rows=20, max_column_width=20)
The purchasing_data
set now becomes:
In [30]:
purchasing_data.show()
In [31]:
univariate_summary_statistics_plot(purchasing_data,
['InvoiceNo','Quantity','UnitPrice','Country'],
nsubplots_inrow=2,
subplots_wspace=0.7)
In [32]:
purchasing_data.dtype
Out[32]:
The frequency plots for the multi-valued, nominal categorical variables, CustomerID
, and StockCode
are shown below.
In [33]:
%matplotlib inline
item_freq_plot(purchasing_data, 'CustomerID', topk=30)
In [34]:
%matplotlib inline
item_freq_plot(purchasing_data, 'StockCode', topk=30)
To better utilize our training set a joint consideration of the two attributes InvoiceNo
and StockCode
would be advisable. This can be achieved by adding an extra derived attribute by concatenating the values of the other two. By doing so we would be able to collectively take in account both the type of the issued Invoice [Invoice Receipt ('IR'
) or Credit Note('CN'
)], and the specific StockCode
which was procured or returned per record line. However, as it is immediately recognizable below, the percentages of distinct StockCode
s which appeared in the issued Credit Notes ('CN'
)s are particularly low, and the idea of jointly considering Invoice Types and Product Codes can not help much.
In [35]:
%matplotlib inline
item_freq_plot(purchasing_data, 'StockCode', hue='InvoiceNo', topk=30,
pct_threshold= 0.01, reverse=True)
In [36]:
purchasing_data_cn = purchasing_data[purchasing_data['InvoiceNo']=='CN']
purchasing_data_cn.groupby(['StockCode','InvoiceNo'], agg.COUNT()).\
sort('Count', ascending=False).print_rows(num_rows=30)
Finally, we want to separate some users into a train/test set, making sure the test users are not in the training set, and creating TimeSeries objects out of them.
In [37]:
(train, test) = gl.churn_predictor.\
random_split(purchasing_data, user_id='CustomerID', fraction=0.95, seed=1)
train_trial = gl.TimeSeries(train, index='InvoiceDate')
test_trial = gl.TimeSeries(test, index='InvoiceDate')
train/test
data sets have been splitted in such a way to keep different customers (CustomerID
). This is because we want our learning algorithm to be blind on the data set which we are going to use for its validation and final evaluation.
In [38]:
# load NumPy Library
import numpy as np
# verify that train and test sets have different CustomerIDs
train_customer_ids = train['CustomerID'].unique().sort().to_numpy()
test_customer_ids = test['CustomerID'].unique().sort().to_numpy()
print 'Number of same \'CustomerID\'s in test and train set: %d' %\
len(test_customer_ids[np.in1d(test_customer_ids, train_customer_ids, assume_unique=True)])
In [39]:
print "Train Start date : %s" % train_trial.min_time
print "Train End date : %s" % train_trial.max_time
In [40]:
print "Test Start date : %s" % test_trial.min_time
print "Test End date : %s" % test_trial.max_time
In [41]:
userdata = gl.SFrame('./userdata_sf/')
userdata
Out[41]:
In [42]:
userdata.dtype
Out[42]:
We also change the dtype
of the nominal categorical attribute CustomerID
to str
.
In [43]:
userdata['CustomerID'] = userdata['CustomerID'].astype(str)
Here, we train the Churn Predictor model. We determine the period of inactivity (churn_period_trial
) after which a customer will be considered as churned. Furthermore, in order to extract more signal out of the data, we choose multiple churn boundaries for the same users. Remember that both train and test Timeseries start from 2010-12-01 and end at 2011-12-09.
In [41]:
## churn period of inactivity
churn_period_trial = dt.timedelta(days = 30)
## In order to extract more signal out of the data,
## Define multiple churn boundaries for the same users.
churn_boundary_aug = dt.datetime(year = 2011, month = 8, day = 1)
churn_boundary_sep = dt.datetime(year = 2011, month = 9, day = 1)
churn_boundary_oct = dt.datetime(year = 2011, month = 10, day = 1)
In [42]:
train.head(10)
Out[42]:
In [44]:
## train the Churn Predictor model
model = gl.churn_predictor.create(train_trial,
user_id = 'CustomerID',
features = None,
user_data = userdata,
churn_period = churn_period_trial,
# The time-scale/granularity
# at which features are computed (1 day here)
time_period = dt.timedelta(1),
# The various multiples of `time_period` used
# while computing features
lookback_periods = [7, 14, 21, 60, 90],
# Multiple churn boundaries to extract more signal out of data
time_boundaries = [churn_boundary_aug, churn_boundary_sep,\
churn_boundary_oct],
use_advanced_features=True)
In [30]:
# Evaluate this model in October
evaluation_time = churn_boundary_oct
In [31]:
# define cutoffs to consider
cutoffs_list = np.linspace(0.1, 1.0, num=100)
cutoffs_list = cutoffs_list.tolist()
# calculate metrics using test set
metrics = model.evaluate(test_trial, evaluation_time,\
user_data=userdata, cutoffs=cutoffs_list)
In [32]:
print(metrics)
In [33]:
def plot_pr_curve(precision, recall, title):
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(precision, recall, 'g-', linewidth=2.0)
plt.plot(precision, recall, 'bo')
plt.title(title)
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.rcParams.update({'font.size': 16})
def plot_roc_curve(fpr, tpr, title):
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(fpr, tpr, 'g-', linewidth=2.0)
plt.title(title)
plt.xlabel('FP Rate')
plt.ylabel('TP Rate')
plt.rcParams.update({'font.size': 16})
In [34]:
%matplotlib inline
precision_all = metrics['precision_recall_curve']['precision']
recall_all = metrics['precision_recall_curve']['recall']
plot_pr_curve(precision_all, recall_all, 'Precision recall curve (Churn Predictor)')