Practical Code to Calculating Customer Lifetime Value (CLV)

Customer Lifetime Value (CLV) is an estimation of the entire net profit attributed to a single customer. It’s an important metric to understand because it helps businesses determine how much is too much to spend on advertising to acquire a single customer.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2

%matplotlib inline

Data Exploration

For this example we’ll calculate CLV from a dataset of roughly 4,200 transactions.

In [3]:

Out[3]:
TransactionID TransactionDate CustomerID Amount
0 1 2012-09-04 1 20.96
1 2 2012-05-15 2 10.87
2 3 2014-05-23 2 2.21
3 4 2014-10-24 2 10.48
4 5 2012-10-13 2 3.94
5 6 2013-01-23 2 12.37

In [176]:
data.tail(6)

Out[176]:
TransactionID TransactionDate CustomerID Amount
4175 4176 2012-09-18 1000 9.69
4176 4177 2013-06-23 1000 3.86
4177 4178 2011-08-07 1000 4.00
4178 4179 2012-10-07 1000 18.37
4179 4180 2014-01-09 1000 3.65
4180 4181 2011-04-30 1000 5.18

In [177]:
Transactions = data['CustomerID'].count()

In [178]:
Customers = data['CustomerID'].max()

In [179]:
MinTransactionDate = data['TransactionDate'].min()

In [180]:
MaxTransactionDate = data['TransactionDate'].max()

In [181]:
Amount = data['Amount'].sum()

In [182]:
summary = [Transactions, Customers, MinTransactionDate, MaxTransactionDate, round(Amount, 2)]
summary

Out[182]:
[4181, 1000, '2010-01-04', '2015-12-31', 33729.91]

As with any analysis, the first thing we’ll do is look at some basic summary statistics.

In [9]:
data = {'Transactions': [4181],
'Customers': [1000],
'MinTransactionDate': ['2010-01-04'],
'MaxTransactionDate': ['2015-12-31'],
'Amount': [33729.91]}
df = pd.DataFrame(data, index = [''])
df

Out[9]:
Amount Customers MaxTransactionDate MinTransactionDate Transactions
33729.91 1000 2015-12-31 2010-01-04 4181

In [210]:
TransactionsPerCustomer = round(Transactions / Customers, 2)
TransactionsPerCustomer

Out[210]:
4.0

In [211]:
AmountPerTransaction = round(Amount / Transactions, 2)
AmountPerTransaction

Out[211]:
8.07

In [212]:
AmountPerCustomer = round(Amount / Customers, 2)
AmountPerCustomer

Out[212]:
33.73

Note that the data consists of 1000 customers who made transactions between 2010 and 2015. Furthermore, each customer made about 4 transactions for 8 bucks a piece, totaling close to \$34. This amount can be considered a lower bound on CLV since it’s the total amount spent by each customer, but we still expect existing customers to make future purchases.

In [213]:
data = {'TransactionsPerCustomer': [4.0],
'AmountPerTransaction': [8.07],
'AmountPerCustomer': [33.73]}
df = pd.DataFrame(data, index = [''])
df

Out[213]:
AmountPerCustomer AmountPerTransaction TransactionsPerCustomer
33.73 8.07 4

In [214]:
more_summary = [TransactionsPerCustomer, AmountPerTransaction, AmountPerCustomer]
more_summary

Out[214]:
[4.0, 8.07, 33.73]

We need to consider outlier transactions and should remove the transactions from the data entirely. Here we inspect the largest transactions.

In [4]:
data.loc[data['Amount'] >= 29.99]

Out[4]:
TransactionID TransactionDate CustomerID Amount
52 53 2015-01-29 13 37.27
260 261 2012-06-21 69 38.29
415 416 2013-12-31 100 30.31
582 583 2013-03-16 145 30.43
1341 1342 2013-06-28 345 29.99
2270 2271 2013-04-13 573 32.81
2487 2488 2011-07-13 632 36.94
2641 2642 2011-05-16 663 31.40
2757 2758 2013-08-31 691 38.35
3960 3961 2013-07-28 957 30.01

In [6]:
import seaborn as sns
sns.set(color_codes=True)

Plotting Univariate Distributions

We could use a statistical test to check for outliers, but here it’s pretty clear that none exist. Plotting the entire distribution of transaction amounts should give us more confidence in our assertion.

In [8]:
plt.title('Distribution of Transaction Amounts', fontsize=14, fontweight="bold")
sns.distplot(data.Amount, color='#3498db')

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xac12780>

Measuring Historic CLV

Now we need to consider the biggest source of error in our \$34 CLV lower bound – some of the underlying customers are brand new and others have been customers for almost five years. Obviously the newer customers will have (generally) spent less on average than the old ones. So, we need to separate the customers into groups based on how long ago they were acquired (e.g. customers acquired in 2010, vs customers acquired in 2011, …).

Since we have 5 years worth of data, let’s separate customers into annual origin periods starting on 2010-01-01, and measure their purchases annually.

In [ ]: