In [117]:
%load_ext rmagic
%R library(ggplot2)


The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic
Out[117]:
array(['ggplot2', 'tools', 'stats', 'graphics', 'grDevices', 'utils',
       'datasets', 'methods', 'base'], 
      dtype='|S9')

Bayesian Split Testing

Smarter split tests

There's a problem with split tests rooted in classic statistics. They have to run to fruition even when it's highly unlikely that their results will change. But there's a better way.

By using Bayesian methods we can constantly evaluate the results of our split tests and stop the test once we've gotten to a high enough likelihood.

Exploring the concepts

Let's pretend we have a split test with two variations and a control. We want to design an approach that will tell us once we are 95% sure that one variant is better than the other. What does "better" mean in this situation? Well, as we're running the split test we'll be looking for the control and variant's distributions to diverge siginificantly. The way we'll do this is by testing until the 95% credibility intervals no longer overlap. Conceptually this means all of the hypotheses we have for our variant's performance are not credible hypotheses for how our control performs.

I'm not sure if this is the most mathematically sound method. One particular reason for my lack of confidence in this method is that binomial tests seem to have more statistical power than this method. In a simulation I concocted I ran 5,000 binomial test comparing a 27% success rate with a variation's 28% success rate and it had a 90% statistical power when ran with a sample size of 21,000. What this means is that 90% of the time the simulation was apply to correctly identify an improvement in the variation over my control.

In a similar simulation I performed with this technique, using a maximum of 21,000 samples and stopping once I see the two distributions have diverged, I can only get to a power of 80%. An advantage of this technique however is that I only needed an average of 11,000 samples per test to identify the improvement.

Let's step through the process in my brain and see how I expect it to progress.

First, we start with a variation and control with the exact same distribution.


In [118]:
def normalize(pmf):
  likelihood_sum = reduce(lambda acc,x: acc + x[1], pmf, 0)
  return [(hypothesis, likelihood/likelihood_sum) for (hypothesis, likelihood) in pmf]

def create_pmf():
  # set every possibility to be equally possible
  hypotheses = [i/100.0 for i in xrange(0,101)]
  possibilities = [1.0]*101  
  # Coin flip, probability of heads
  pmf = normalize(zip(hypotheses, possibilities))
  return pmf

In [134]:
%%R
create_dual_chart <- function(hyp, pos_a){
    #likelihoods <- data.frame(control=pos_a, variation=pos_b)
    chart <- qplot(x=hyp, y=pos_a, geom="bar", stat="identity", fill = I("blue")) +
      #geom_vline(x=.27, color="red") +
      xlab("Hypothesis for chance of success") + 
      ylab("Likelihood of hypothesis") + 
      ylim(0,0.05)
    print(chart)
}

In [135]:
control_hypotheses = create_pmf()
hypotheses = [hypothesis for (hypothesis, likelihood) in control_hypotheses]
control_hypothesis_likelihoods = [likelihood for (hypothesis, likelihood) in control_hypotheses]

%Rpush hypotheses control_hypothesis_likelihoods
%R create_chart(hypotheses, control_hypothesis_likelihoods)
"""The chart showing the initial beliefs for our control and variant hypotheses."""


Out[135]:
'The chart showing the initial beliefs for our control and variant hypotheses.'

That's our prior set of hypotheses. Ideally we'd have bins for infinitesimal values between zero and one but we only have a limited amount of computer memory so I just make one bin for each percentage point. Now, what might the two distributions look like after the experiment gets a few participants in both the control and the variant? This is one possibility:


In [121]:
%%R
create_dual_chart <- function(hypotheses, control_likelihoods, variant_likelihoods){
    likelihoods <- data.frame(hypotheses=hypotheses, control=control_likelihoods, variant=variant_likelihoods)
    chart <- ggplot(likelihoods, aes(x=hypotheses)) +
      #geom_vline(x=.27, color="red") +
      geom_bar(aes(y=control), stat="identity", binwidth=.01, position = position_dodge(), alpha = 0.5, colour="red", fill="red") +
      geom_bar(aes(y=variant), stat="identity", binwidth=.01, position = position_dodge(width = 0.005), alpha = 0.5, colour="blue", fill="blue") +
      xlim(0,1) +
      xlab("Hypothesis for chance of success") + 
      ylab("Likelihood of hypothesis")
    print(chart)
}

In [122]:
def update_success(pmf):
  resulting_pmf = [(hypothesis, likelihood*hypothesis) for (hypothesis, likelihood) in pmf]
  return normalize(resulting_pmf)

def update_failure(pmf):
  resulting_pmf = [(hypothesis, likelihood*(1-hypothesis)) for (hypothesis, likelihood) in pmf]
  return normalize(resulting_pmf)

In [123]:
control_hypotheses = create_pmf()
variant_hypotheses = create_pmf()

control_hypotheses = update_failure(control_hypotheses)
control_hypotheses = update_failure(control_hypotheses)
control_hypotheses = update_failure(control_hypotheses)
control_hypotheses = update_success(control_hypotheses)

variant_hypotheses = update_success(variant_hypotheses)
variant_hypotheses = update_success(variant_hypotheses)
variant_hypotheses = update_failure(variant_hypotheses)
variant_hypotheses = update_success(variant_hypotheses)
variant_hypotheses = update_success(variant_hypotheses)

control_hypothesis_likelihoods = [likelihood for (hypothesis, likelihood) in control_hypotheses]
variant_hypothesis_likelihoods = [likelihood for (hypothesis, likelihood) in variant_hypotheses]

%Rpush hypotheses control_hypothesis_likelihoods variant_hypothesis_likelihoods
%R create_dual_chart(hypotheses, control_hypothesis_likelihoods, variant_hypothesis_likelihoods)

"""How our control and variant compare."""


Out[123]:
'How our control and variant compare.'

You can see that both distributions are still pretty wide and overlap a ton still (the purple region). The 95% credibility region of each is:


In [124]:
def highest_density_interval(pmf, percentage=.95):
  percentiles = sorted(pmf, cmp=lambda a,b: 1 if a[1]<b[1] else -1)
  minimum_hypothesis = 1
  maximum_hypothesis = 0
  the_sum = 0
  for percentile in percentiles:
    if the_sum >= percentage:
      break
    minimum_hypothesis = percentile[0] if percentile[0] < minimum_hypothesis else minimum_hypothesis
    maximum_hypothesis = percentile[0] if percentile[0] > maximum_hypothesis else maximum_hypothesis
    the_sum += percentile[1]
  return (minimum_hypothesis, maximum_hypothesis)

In [125]:
print "Credibility interval for control: " + str(highest_density_interval(control_hypotheses))
print "Credibility interval for variant: " + str(highest_density_interval(variant_hypotheses))


Credibility interval for control: (0.03, 0.67)
Credibility interval for variant: (0.41, 0.98)

So the credible conversion rates for our control is anywhere between 3% to 67% while the credible conversion rates for our variant are anywhere between 41% and 98%. We really don't know very much at this point. Let's skip ahead a bit further to where these intervals no longer overlap.


In [131]:
def divergent_distributions(control_hypotheses, variant_hypotheses):
    control_hypotheses = update_failure(control_hypotheses)
    control_hypotheses = update_failure(control_hypotheses)
    control_hypotheses = update_failure(control_hypotheses)
    control_hypotheses = update_success(control_hypotheses)
    control_hypotheses = update_failure(control_hypotheses)
    control_hypotheses = update_failure(control_hypotheses)
    
    variant_hypotheses = update_success(variant_hypotheses)
    variant_hypotheses = update_success(variant_hypotheses)
    variant_hypotheses = update_failure(variant_hypotheses)
    variant_hypotheses = update_success(variant_hypotheses)
    variant_hypotheses = update_failure(variant_hypotheses)
    variant_hypotheses = update_success(variant_hypotheses)
    variant_hypotheses = update_success(variant_hypotheses)
    
    control_hypothesis_likelihoods = [likelihood for (hypothesis, likelihood) in control_hypotheses]
    variant_hypothesis_likelihoods = [likelihood for (hypothesis, likelihood) in variant_hypotheses]
    
    %Rpush hypotheses control_hypothesis_likelihoods variant_hypothesis_likelihoods
    %R create_dual_chart(hypotheses, control_hypothesis_likelihoods, variant_hypothesis_likelihoods)
    print "Credibility interval for control: " + str(highest_density_interval(control_hypotheses))
    print "Credibility interval for variant: " + str(highest_density_interval(variant_hypotheses))
    
divergent_distributions(control_hypotheses, variant_hypotheses)


Credibility interval for control: (0.04, 0.48)
Credibility interval for variant: (0.49, 0.92)

This is exactly the point where the two distributions have 95% of their likelihoods in hypotheses that they don't share with each other.


In [13]:
def is_better_than(pmf_a, pmf_b):
  a_credibility_interval = highest_density_interval(pmf_a)
  b_credibility_interval = highest_density_interval(pmf_b)
  return a_credibility_interval[0] > b_credibility_interval[1]

def update_possibilities(successful, hypotheses):
  if successful:
    return update_success(hypotheses)
  else:
    return update_failure(hypotheses)

In [36]:


In [35]:
# set every possibility to be equally possible

def run_simulation():
    control_conversion_rate = .27
    variation_a_rate = .3
    possible_sample_size = 8000
    
    %Rpush control_conversion_rate variation_a_rate possible_sample_size
    control_distribution = %R rbinom(possible_sample_size, 1, control_conversion_rate)
    variation_a_distribution = %R rbinom(possible_sample_size, 1, variation_a_rate)
    
    control_hypotheses = create_pmf()
    variation_a_hypotheses = create_pmf()
    
    hyp = [hypothesis for (hypothesis, likelihood) in variation_a_hypotheses]
    %Rpush hyp control_hypotheses variation_a_hypotheses
    %R create_chart(hyp, control_hypotheses, variation_a_hypotheses)
        
    for i in xrange(0, possible_sample_size):
      control_hypotheses = update_possibilities(control_distribution[i], control_hypotheses)
      variation_a_hypotheses = update_possibilities(variation_a_distribution[i], variation_a_hypotheses)
        
      if is_better_than(variation_a_hypotheses, control_hypotheses):
        control_likelihoods = [likelihood for (hypothesis, likelihood) in control_hypotheses]
        variation_likelihoods = [likelihood for (hypothesis, likelihood) in variation_a_hypotheses]
        #%Rpush hyp control_likelihoods variation_likelihoods
        #%R create_chart(hyp, control_likelihoods, variation_likelihoods)
        return ('A', i+1)
    control_likelihoods = [likelihood for (hypothesis, likelihood) in control_hypotheses]
    variation_likelihoods = [likelihood for (hypothesis, likelihood) in variation_a_hypotheses]
    #%Rpush hyp control_likelihoods variation_likelihoods
    #%R create_chart(hyp, control_likelihoods, variation_likelihoods)
    return ('?', i+1)

simulation_count = 20
simulation_results = []

for i in xrange(0, simulation_count):
  simulation_results.append(run_simulation())

correct = reduce(lambda acc, x: acc+1 if x[0] == 'A' else acc, simulation_results, 0)
simulations = len(simulation_results)

samples = reduce(lambda acc, x: acc + x[1], simulation_results, 0)
print "Average samples needed: " + str(samples/simulation_count)
print "Power: " + str((1.0*correct)/simulations)


Average samples needed: 2289
Power: 1.0

In [ ]: