Analyzing Churn Using the Survivor Model

Alan Burke

March 23, 2016

I recently gave a talk at GDC on using the survivor model to identify casual factors in churn. This post is to publish the code I used in that presentation and is useful for individuals who would like to practice creating a survivor model. Survivor models are popular for modeling single incidence events such as player churn. Another popular use of survivor model is to analyze a player's first MTX purchase.

Scenario 1 - New Ad

Imagine a situation where your user acquisition team launched a new advertisement campaign. They used a provocative ad that generates a lot of new users. However, your game is free to play and you monetize steadily over the life of the player. You want to use churn as a proxy for quality, since if the players do not stay in your game long enough, it is impossible to monetize on them.

The code below, when pasted into Excel, will create some sample data for the first scenario. Please paste the formula in at least 1000 rows so you have enough data.


In [ ]:
Random	Start	End	Churned	DaysPlayed	 Amazon 	Google	Other	PurchasePrice	FirstSessionLength
=RAND()	=ROUND(RAND()*5,0)	=B2+ROUND(RAND()*10,0)+1	=IF((((C2-B2)^-2)*A2)<0.02,1,0)	=C2-B2	=IF(0.5+D2*0.25-RAND()^0.5<-0.15,1,0)	=IF(F2=0,IF(0.5+D2*0.25-RAND()^0.5<0,1,0),0)	=IF(G2+F2=0,1,0)	=ROUND((0.25-D2*0.25+RAND()^0.3)*50,-1)	=ROUND((0.5-D2*0.5+3*(RAND()^0.5))*50,0)

Since all numbers where generated without a set seed, I include the actual values used in this analysis below. In constructing this dataset, each row represents one player.

The next piece is to run the R code to generate the survivor model.


In [ ]:
# Read in your data
churn <- read.csv("path_to_your_data/Ad Data.csv")
# Attaching allows you to refer to your data with shorthand
attach(churn)
# Initialize the survival library.  This steps assumes you have already installed the library
library("survival")

# Please see the survival library documentation for a full reference of
# all the survival functions.  

#Here we calculate the incident of churn each day 
# from the number of players that are still active
churn.surv <- survfit(Surv(DaysPlayed, Churned)~ 1, conf.type="none")
summary(churn.surv)
# Plotting survival probability
plot(churn.surv, xlab="DaysPlayed", ylab="Survival Probability")

# The first churn model uses the exponential distribution
churn.survexp <- survreg(Surv(DaysPlayed, Churned) ~ 
    Amazon+Google+FirstSessionLength+PurchasePrice, dist="exponential")
summary(churn.survexp)

# The second churn model uses the Weibull distribution
churn.survWeibull <- survreg(Surv(DaysPlayed, Churned) ~ 
    Amazon+Google+FirstSessionLength+PurchasePrice, dist="weibull")
summary(churn.survWeibull)

library(lmtest)
# The test results show that the models are stastically different 
# and the Weibull fits the data better
lrtest(churn.survexp,churn.survWeibull)

# clean up
detach(churn)

Here is the condensed output from the above code:


In [ ]:
survfit(formula = Surv(DaysPlayed, Churned) ~ 1, conf.type = "none")

 time n.risk n.event survival  std.err
    1 104856      98    0.999 9.44e-05
    2  99529     846    0.991 3.05e-04
    3  89208    1854    0.970 5.60e-04
    4  78741    3382    0.928 8.82e-04
    5  68210    5222    0.857 1.25e-03
    6  57770    7564    0.745 1.62e-03
    7  47249   10480    0.580 1.90e-03
    8  36573   10386    0.415 1.93e-03
    9  26187   10372    0.251 1.71e-03
   10  15815   10514    0.084 1.10e-03
   11   5301    5301    0.000      NaN
   

survreg(formula = Surv(DaysPlayed, Churned) ~ Amazon + Google + 
    FirstSessionLength + PurchasePrice, dist = "exponential")
                     Value Std. Error    z         p
(Intercept)        1.18025   0.017063 69.2  0.00e+00
Amazon             0.40787   0.010726 38.0  0.00e+00
Google             0.11250   0.008739 12.9  6.32e-38
FirstSessionLength 0.00272   0.000105 25.9 1.44e-147
PurchasePrice      0.01661   0.000351 47.3  0.00e+00

Scale fixed at 1 

Exponential distribution
Loglik(model)= -212158.6   Loglik(intercept only)= -214884.3
	Chisq= 5451.41 on 4 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 4 
n= 104856 


survreg(formula = Surv(DaysPlayed, Churned) ~ Amazon + Google + 
    FirstSessionLength + PurchasePrice, dist = "weibull")
                       Value Std. Error       z        p
(Intercept)         2.098805   4.50e-03  466.85 0.00e+00
Amazon              0.012933   2.48e-03    5.21 1.84e-07
Google              0.003131   2.03e-03    1.54 1.24e-01
FirstSessionLength  0.000143   2.54e-05    5.62 1.95e-08
PurchasePrice       0.000690   9.01e-05    7.65 1.96e-14
Log(scale)         -1.456626   3.02e-03 -482.26 0.00e+00

Scale= 0.233 

Weibull distribution
Loglik(model)= -146039.7   Loglik(intercept only)= -146101
	Chisq= 122.62 on 4 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 15 
n= 104856 


lrtest(churn.survexp,churn.survWeibull)

Likelihood ratio test
Model 1: Surv(DaysPlayed, Churned) ~ Amazon + Google + FirstSessionLength + 
    PurchasePrice
Model 2: Surv(DaysPlayed, Churned) ~ Amazon + Google + FirstSessionLength + 
    PurchasePrice
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   5 -212159                         
2   6 -146040  1 132238  < 2.2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

By comparing the two models, we can see that they are statistically different and the Weibull distribution fits the data better. Now that we have identified our game data as having a Weibull distribution in churn, let us analyze the effect of the new ad.


In [ ]:
# Read in the ad data
cyborg <- read.csv("path_to_your_data//Churn Data with Cyborg Campaign.csv")
attach(cyborg)
library("survival")

# Run the Weibull model again, but with the added CyborgAd variable
cyborg.survWeibull <- survreg(Surv(DaysPlayed, Churned) ~ 
    CyborgAd+ Amazon+Google+FirstSessionLength+PurchasePrice, dist="weibull")
summary(cyborg.survWeibull)

detach(cyborg)

With the following condensed output:


In [ ]:
survreg(formula = Surv(DaysPlayed, Churned) ~ CyborgAd + Amazon + 
    Google + FirstSessionLength + PurchasePrice, dist = "weibull")
                       Value Std. Error       z        p
(Intercept)         2.098137   5.33e-03  393.98 0.00e+00
CyborgAd           -0.095127   2.12e-03  -44.81 0.00e+00
Amazon              0.012497   2.85e-03    4.39 1.12e-05
Google             -0.002237   2.33e-03   -0.96 3.37e-01
FirstSessionLength  0.000121   2.94e-05    4.11 3.91e-05
PurchasePrice       0.000502   1.03e-04    4.88 1.08e-06
Log(scale)         -1.233473   2.82e-03 -437.16 0.00e+00

Scale= 0.291 

Weibull distribution
Loglik(model)= -180633.5   Loglik(intercept only)= -181713.9
	Chisq= 2160.92 on 5 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 11 
n= 104856

Here, negative coefficients indicate an increased likelihood to churn. The results show that although the Cyborg ad had a solid acquisition rate, the players did not retain as well. To calculate whether the campaign itself was profitable, you would have to look at your monetization of players with the shorter duration of play and compare it to your acquisition costs.

Scenario 2 - Patching a Title

The premise of the second scenario is that the game is patched. The patch makes these changes:

  • the patch goes live on day 4 to 50% of the player base
  • it includes a critical bug fix
  • it has a UI change to make warriors more prominent
  • it includes an update to the graphics

Unbeknownst to the developers, the patch also includes a critical bug that only affects wizards.

Here is the SQL code used to generate the random data. The SQL code uses window analytic functions and should run in systems using PostgreSQL style syntax.


In [ ]:
/* A rough order of operations for how the code runs:
Create a table of random variables.
Apply player specific qualities based on those stats, such as which class they choose.
Create daily specific variables, including some additional 
random variables such as whether they had a bug that day.
Calculate when a player churns.  Since this is fake data, a player can "churn" multiple times.  
To correct the fake data, anyone marked as churned has all data truncated after their first 
churn flag using window functions.
*/


drop table if exists schema_name_here.churn_test_model;

-- create a permanent table
create table schema_name_here.churn_test_model as (
with random_sequence as (
	-- create a set of random numbers to build the data set
	select setseed(0) as null_seed, null as quit_seed, null as start_date_seed
		, null as class_seed, null as patch_seed
	union all
	select null
		-- approximate a random normal variable
		, random() + random() + random() + random() + random() + random() + random() 
			+ random() + random() + random() + random() + random()-6 
		, random()
		, random() 
		, random()
	from generate_series(1, 100000)
	offset 1) 
, seed_values as
	-- create the player specific data
	(select *
		, trunc(start_date_seed*8) as start_date
		, case when start_date_seed*8<3 and class_seed <.6 then 1 -- warrior, pre patch
			when start_date_seed*8<3 then 2 -- wizard, pre patch 
			when class_seed<.6 and patch_seed<.5 then 1 -- warrior, no patch
			when patch_seed<.5 then 2 -- wizard, no patch
			when class_seed <.8 then 1  -- warrior, patch 
			else 2 -- wizard, patch
			end as class
		, case when patch_seed>=.5 then 1
		else 0 end as patch_group
	from random_sequence
	)
, random_day as 
	-- start creating the daily variables
	(select * 
		, start_date+age_of_account as start
		, start_date+age_of_account+1 as stop
		, random() as bug_seed
		, random()+random()*random()-3 minutes_played_seed 
		, case when patch_seed>=.5 and start_date+age_of_account>3 then 1
			else 0 end as patch
        -- make the daily calculations have noise
		, (random() + random() + random() + random() + random() + random() 
			+ random() + random() + random() +
			random() + random() + random()-6)/4 as random_noise  
	from seed_values
	cross join
		(select row_number() over() as age_of_account
		from generate_series(1, 10)) as study_days
	)
, day_event as
	-- add in daily bugs
	(select *
		-- bugs vary depending on patch and class
		, case when class=2 then 1 else 0 end as is_wizard
		, case when bug_seed>.9 and patch=0 then 1 
			when bug_seed>.85 and patch=1 and class=2 then 1
			when bug_seed>.95 and patch=1 and class=1 then 1
			else 0 end as bug
		-- time played varies on bug chance
		, case when bug_seed>.9 and patch=0 then @minutes_played_seed*15
			when bug_seed>.85 and patch=1 and class=2 then @minutes_played_seed*15
			when bug_seed>.95 and patch=1 and class=1 then @minutes_played_seed*15
			else @minutes_played_seed*20 end as minutes_played
	from random_day
	) 
, input_variable as
	-- calculate when the player churns
	(select *
	, -log(age_of_account+1)*0.5+ln(minutes_played)/20
         +(@quit_seed)/4+patch*0.025+random_noise/6
		-bug*0.12-is_wizard*patch*0.04 as test
	, case when -log(age_of_account+1)*0.5+ln(minutes_played)/20
		+(@quit_seed)/4+patch*0.025
		+random_noise/6-bug*0.12-is_wizard*patch*0.04<0 
		then 1 else 0 end as churn
	 from day_event
	) 
, order_data as
	-- ordering the data by individual player churn
	(select start, stop, age_of_account, patch, bug, minutes_played, quit_seed
		, patch_group, is_wizard, test, bug_seed
		, max(churn) over 
			(partition by quit_seed order by quit_seed
			, age_of_account desc range between current row and unbounded following) as churn
	from input_variable
	) 
, limiting as 
	-- select the first time a player churns
    (select *
        , lag(churn) over (partition by quit_seed order by age_of_account) as lag_churn
    from order_data
    ) 
select start, stop, age_of_account, patch, bug, minutes_played, is_wizard, churn
	, patch_group, quit_seed, test, bug_seed
from limiting
where (lag_churn<>1 or lag_churn is null) and stop<=10
)

This code generates a random set of data for our model. The actual data used can be at the bottom of this post. Notice that each day of data is its own row, and the quit_seed variable actually functions as the player id. The data is organized similar to a panel dataset.

Here is the R code used to run the model:


In [ ]:
library(RODBC)

myconn<-odbcConnect('This_is_your_odbc_connection')

patch_data =  sqlQuery(myconn,"select * from schema_name_here.churn_test_model")

attach(patch_data)

summary(coxph(Surv(start, stop, churn) ~ 
    bug+is_wizard+patch+age_of_account+minutes_played+cluster(quit_seed), data=patch_data))

The results of the code can be found here:


In [ ]:
coxph(formula = Surv(start, stop, churn) ~ bug + is_wizard + 
    patch + age_of_account + minutes_played + cluster(quit_seed), 
    data = patch_data)

  n= 382440, number of events= 59227 

                     coef  exp(coef)   se(coef)  robust se       z Pr(>|z|)    
bug             1.3921964  4.0236780  0.0112569  0.0110634 125.838   <2e-16 ***
is_wizard       0.0830510  1.0865972  0.0086099  0.0092643   8.965   <2e-16 ***
patch          -0.0828029  0.9205325  0.0089436  0.0094945  -8.721   <2e-16 ***
age_of_account  0.1582036  1.1714047  0.0023148  0.0024072  65.720   <2e-16 ***
minutes_played -0.0196435  0.9805482  0.0005978  0.0005983 -32.832   <2e-16 ***

Here, positive coefficients indicate a higher likelihood to churn. The initial results show that the patch reduces churn. However, in the forums, customer service reports an increased number of complaints from wizards. Forums are notoriously hard to extrapolate to the entire player community; so instead, let us use the survivor model to see what is happening with wizards and the patch by creating an interaction variable.


In [ ]:
patch_data$wizard_bug <- is_wizard*bug

summary(coxph(Surv(start, stop, churn) ~ 
    bug+is_wizard+patch+age_of_account+minutes_played+
    wizard_patch+cluster(quit_seed), data=patch_data))

detach(patch_data)

The formula stays the same except for the addition of the new variable, wizard_patch.


In [ ]:
coxph(formula = Surv(start, stop, churn) ~ bug + is_wizard + 
    patch + age_of_account + minutes_played + wizard_patch + 
    cluster(quit_seed), data = patch_data)

  n= 382440, number of events= 59227 

                     coef  exp(coef)   se(coef)  robust se       z Pr(>|z|)    
bug             1.3777800  3.9660871  0.0113080  0.0111103 124.009   <2e-16 ***
is_wizard      -0.0054967  0.9945184  0.0109240  0.0115514  -0.476    0.634    
patch          -0.1666453  0.8464998  0.0109573  0.0112832 -14.769   <2e-16 ***
age_of_account  0.1560963  1.1689388  0.0023195  0.0024141  64.659   <2e-16 ***
minutes_played -0.0196420  0.9805496  0.0005978  0.0005983 -32.830   <2e-16 ***
wizard_patch    0.2353473  1.2653481  0.0176368  0.0188478  12.487   <2e-16 ***

When you separate out wizards who receive the patch, you find that they had increased likelihood to churn due to the increased incidence in bugs.

Summary

While the examples used in this blog are trivial, they are useful for showing how the survival model can fit single incidence events with non-normal distributions such as churn quite well. Feeding the data into the model is straightforward; and, in the modelling stage, you have a lot of flexibility in what variables you include. Ideally, you would have some sort of experiment running to give you clear causation, but even correlations can help you understand how different factors relate to each other.