In [1]:
# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()
Out[1]:
In [2]:
os.chdir(path)
import numpy as np
import scipy.stats as stats
# magic to print version
%load_ext watermark
%watermark -a 'Ethen' -d -t -v -p numpy,scipy
A/B testing is a general methodology used when you want to test out whether a new feature or change is better. What you're doing is you're going show one set of features, the control set (your existing feature) to one user group and another set, your experiment set (your new feature) to another user group and test how did these users respond differently so that you can determine which set of your feature is better. If the ultimate goal is to decide which model or design is the best, then A/B testing is the right framework, along with its many gotchas to watch out for.
The following section describes a possible workflow for conducting A/B testing.
After defining the high level goal, find out (not guess) which parts of your business are underperforming or trending and why. Quantitative methods do a much better job answering how many and how much types of questions. Whereas qualitative methods such as User Experience Group (you go really deep with a few users. It can take form of observing users doing tasks or you ask users to self-document their behaviors) and surveys are much better suited for answering questions about why or how to fix a problem.
Take a look at your conversion funnel. Examine the flow from the persuasive end (top of the funnel) and the transactional end (bottom of the funnel). e.g. You can identify problems by starting from the top 5 highest bounce rate pages. During the examination, segment to spot underlying underperformance or trends.
Consider segmenting your users into different buckets and testing against that. Because mobile visitors perform differently than desktop ones, new visitors are different than returning visitors, and e-mail traffic is different than organic. Start thinking "segment first."
During the process ask yourself: 1) Why is it happening? 2) How can we spread the success of other areas of the site.
e.g. You’re looking at your metric of total active users over time and you see a spike in one of the timelines. After confirming that this is not caused by seasonal variation, we can look at different segment of our visitors to see if one of the segment is causing the spike. Suppose we have chosen segment to be geographic, it might just happen that we’ve identify a large proportion of the traffic is generated by a specific region and it might be best for us to dig deeper and understand why.
Three simple ideas for gathering qualitative data to understand the why
Now that you've identify the overall business goal and the possible problem (e.g. Less than one percent of visitors sign up for our newsletter). It's time to prioritize your website goals. Three categories of goals include:
What you need to do is to decide how to assign users to either the control or the experiment. There’re three commonly used categories, namely user id, anonymous id (cookie) and event.
If you think you can identify which population of your users will be affected by your experiment, you might want to target your experiment to that traffic (e.g. changing features specific to one language’s users) so that the rest of the population won’t dilute the effect.
Next, depending on the problem you’re looking at, you might want to use a cohort instead of a population. A cohort makes much more sense than looking at the entire population when testing out learning effects, examining user retention or anything else that requires the users to be established for some reason.
A quick note on cohort. The gist of cohort analysis is basically putting your customers into buckets so you can track their behaviours over a period of time. The term cohort stands for a group of customers grouped by the timeline (can be week, month) where they first made a purchase (can be a different action that’s valuable to the business). In a cohort, you can have roughly the same parameters in your two user group, which makes them more comparable.
e.g. You’re an educational platform has an existing course that’s already up and running. Some of the students have completed the course, some of them are midway through and there’re students who have not yet started. If you want to change the structure of of one of the lessons to see if it improves the completion rate of the entire course and they started the experiment at time X. For students who have started before the experiment initiated they may have already finished the lesson already leading to the fact that they may not even see the change. So taking the whole population of students and running the experiment on them isn’t what you want. Instead, you want to segment out the cohort, the group of customers, that started the lesson are the experiment was launched and split that into an experiment and control group.
When do I want to run the experiment and for how long.
e.g. Suppose we’ve chosen the goal to increase click-through rates, which is defined by the unique number of people who click the button versus the number of users who visited the page that the button was located. But to actually use the definition, we’ll also have to address some other questions. Such as, if the same user visits the page once and comes back a week or two later, do we still only want to count that once? Thus we’ll also need to specify a time period
To account for this, if 99% of your visitors convert after 1 week, then you should do the following.
So one version of the fully-defined metric will be: For each week, the number of cookies that clicked divided by the number of cookies that interacted with the page (also add the population definition).
Running the test for a least a week is adviced since it'll make sure that the experiment captures the different user behaviour of weekdays, weekends and try to avoid holidays ....
If your population is defined and you have a large enough traffic, another consideration is what fraction of the traffic are you going to send through the experiment. There’re some reasons that you might not want to run the experiment on all of your traffic to get the result faster.
After collating all the ideas, prioritize them based on three simple metrics: (give them scores)
Every test that's developed is documented so that we can review and prioritize ideas that are inspired by winning tests.
Some ideas worth experimenting are: Headlines, CTA (call to actions), check-out pages, forms and the elements include:
This section lists out some caveats that are not really mentioned or covered briefly in the template above.
NO PEEKING. When you run an A/B test, you should avoid stopping the experiment as soon as the results "look" significant. Using a stopping time that is dependent upon the results of the experiment can inflate your false-positive rate substantially.
To understand why this is so, let's look at a simpler experimental problem. Let's say that we have a coin in front of us, and we want to know whether it's biased -- whether it lands heads-up with probability other than 50%. If we flip the coin n times and it lands heads-up on k of them, then we know that the posterior distribution for the coin's bias is $p \sim Beta(k+1,n−k+1)$. So if we do this and 0.5 isn't within a 95% credible interval for $p$, then we would conclude that the coin is biased with p-value <=0.05. This is all fine as long as the number of flips we perform, n, doesn't depend on the results of the previous flips. If we do that, then we bias the experiment to favor extremal outcomes.
Let's clarify this simulating these two experimental procedures in code.
How many false positives do you think that the Biased Procedure will produce? We chose our p-value to be 0.05, so that the false positive rate would be about 5%. Let's repeat each procedure 1000 times, assuming that the coin really is fair, and see what the false positive rates really are:
In [3]:
def unbiased_procedure(n):
"""
Parameters
----------
n : int
number of experiments (1000 coin flips) to run
"""
false_positives = 0
for _ in range(n):
# success[-1] : total number of heads after the 1000 flips
success = ( np.random.random(size = 1000) > 0.5 ).cumsum()
beta_cdf = stats.beta( success[-1] + 1, 1000 - success[-1] + 1 ).cdf(0.5)
if beta_cdf >= 0.975 or beta_cdf <= 0.025:
false_positives += 1
return false_positives / n
In [4]:
def biased_procedure(n):
false_positives = 0
for _ in range(n):
success = ( np.random.random(size = 1000) > 0.5 ).cumsum()
trials = np.arange( 1, 1001 )
history = stats.beta( success + 1, trials - success + 1 ).cdf(0.5)
if ( history >= 0.975 ).any() or ( history <= 0.025 ).any():
false_positives += 1
return false_positives / n
In [5]:
# simulating 10k experiments under each procedure
print( unbiased_procedure(10000) )
print( biased_procedure(10000) )
Almost half of the experiments under the biased procedure produced a false positive. Conversely, only around 5% of the unbiased procedure experiments resulted in a false positive, which is close to the 5% false positive rate the we expected, given our p-value.
The easiest way to avoid this problem is to choose a stopping time that's independent of the test results. You could, for example, decide in advance to run the test for a fix amount of time, no matter the results you observe during the test's tenure. Thus just like in the template above, if 99% of your visitors convert after 1 week, then you should do the following.
Or you could decide to run the test until each bucket has received more than 10,000 visitors, again ignoring the test results until that condition is met. There're tests like power tests that let's you determine how many tests you should run before you make a conclusion about the result. Although you should be VERY CAREFUL with this, because the truth is: It’s not really the number of conversions that matters; it’s whether the time frame of the test is long enough to capture variations on your site.
For instance, the website traffic may behave one way during the day and another way at night (the same holds on weekdays and weekends). Then there's this thing that's called the novelty effect. Meaning when users are switched to a new experience, their initial reactions may not be their long-term reactions. In other words, if you are testing a new color for a button, the user may initially love the button and click it more often, just because it’s novel, or she may hate the new color and never touch it, but eventually she would get used to the new color and behave as she did before. It’s important to run the trial long enough to get past the period of the "shock of the new".
In sum, you should setting a results-independent stopping time (a week) is the easiest and most reliable way to avoid biased stopping times.
If you're running a lot of A/B tests, you should run follow-up tests and pay attention to your base success rate.
Let's talk about these in reverse order. Imagine that you do everything right. You set your stopping time in advance, and keep it independent from the test results. You set a relatively high success criterion: A probability of at least 95% that the variant is better than the control (formally, $p \leq 0.05$). You do all of that.
Then you run 100 tests, each with all the rigor just described. In the end, of those 100 tests, 5 of them claims that the variant will beat the control. How many of those variants do you think are really better than the control, though? If you run 20 tests in a row in which the "best" variant is worse or statistically indistinguishable from the control, then you should be suspicious when your 21st test comes out positive. If a button-color test failed to elicit a winner six months ago, but did produce one today, you should be skeptical. Why now but not then?
Here's an intuitve way of thinking about this problem. Let’s say you have a class of students who each take a 100-item true/false test on a certain subject. Suppose each student chooses randomly on all questions. Each student would achieve a random score between 0 and 100, with an average of 50.
Now take only the top scoring 10% of the class, and declaring them "winners", give them a second test, on which they again choose randomly. They will score less on the second test than the first test. That’s because, no matter what they scored on the first test they will still average 50 correct answers in the second test. This is what's called the regression to the mean. Meaning that tests which seem to be successful but then lose their uplift over time.
It can be wise to run your A/B tests twice (a valiation test). You’ll find that doing so helps to eliminate illusory results. If the results of the first test aren’t robust, you’ll see noticeable decay with the second. But, if the uplift is real, you should still see uplift during the second test. This approach isn’t fail-safe but it will help check whether your results are robust. e.g. In a multiple testing, you tried out three variants, B, C, and D against the control A. Variant C won. Don't deploy it fully yet. Drive 50% of your traffic to Variant C and 50% to Variant A (or some modification on this; the percent split is not important as long as you will have reasonable statistical power within an acceptable time period). As this will give you more information about C's true performance relative to A.
Given the situation above, it's better to keep a record of previous tests, when they were run, the variants that were tried, etc. Since these historical record gives you an idea of what's reasonable. Despite the fact that this information is not directly informative of the rates you should expect from future tests (The absolute numbers are extremely time dependent, so the raw numbers that you get today will be completely different than the ones you would have gotten six months later), it gives you an idea of what's plausible in terms of each test's relative performance.
Also, by keeping a record of previous tests, we can avoid:
Let's say you deploy a new feature to your product and wish to see if it increases the product's activation rate (or any other metric or KPI that's relevant to you). Currently the baseline of the product's activation rate is somewhere around 40%. After running the test, you realized that it WORKED, the activation went up to 50%. So you're like, YES! I just raised activation by 25%! and you sent this info to the head of product and ask for a raise.
After two months, the head of product comes back to you and said "you told me you raised the activation rate by 25%, shouldn't this mean that I should see a big jump in the overall activation? What's going on?" Well, what's going on is, you did raised activation by 25%, but only for user who uses the product's feature. So if only 10 percent of your users use that product, then the overall increase in activation rate will probably only be around 2.5% (25% * 10%). Which is still probably very good, but the expectation that you've set by mis-reporting can get you into trouble.
Suppose you have different types of users (or users with different usage patterns) using your product. e.g. business user and students. Then what can happen is your A/B testing will have different result in July versus October. The reason may be in July all your student users are out on vacation (not using your product) and in October after school starts they start using it again. This is simply saying that the weighting of your user population may be different in different times of the year (seasonality). Thus, you should be clear with yourself about who you're targeting.
Two different product teams both deployed new feautures on your landing page and ran the A/B test at the same period of time. This is more of a organization problem. You should probably require the product teams to register for their test, and make sure that multiple tests on the same stuff are not running at the same time, or else you might be tracking the effect of the other test.
Despite its useful functionality, there are still places where A/B testing isn't as useful. For example: