In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()


Out[1]:

In [2]:
os.chdir(path)
import numpy as np
import scipy.stats as stats

# magic to print version
%load_ext watermark

%watermark -a 'Ethen' -d -t -v -p numpy,scipy


Ethen 2016-08-23 15:27:13 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
scipy 0.18.0

Template For A/B Testing

A/B testing is a general methodology used when you want to test out whether a new feature or change is better. What you're doing is you're going show one set of features, the control set (your existing feature) to one user group and another set, your experiment set (your new feature) to another user group and test how did these users respond differently so that you can determine which set of your feature is better. If the ultimate goal is to decide which model or design is the best, then A/B testing is the right framework, along with its many gotchas to watch out for.

The following section describes a possible workflow for conducting A/B testing.

Generate High Level Business Goal

  • Define your business objectives. e.g. A business objective for an online flower store is to "Increase our sales by receiving online orders for our bouquets."
  • Define your Key Performance Indicators. e.g. Our flower store’s business objective is to sell bouquets. Our KPI could be number of bouquets sold online.
  • Define your target metrics. e.g. For our imaginary flower store, we can define a monthly target of 175 bouquets sold.

Segmenting And Understand The Whys

After defining the high level goal, find out (not guess) which parts of your business are underperforming or trending and why. Quantitative methods do a much better job answering how many and how much types of questions. Whereas qualitative methods such as User Experience Group (you go really deep with a few users. It can take form of observing users doing tasks or you ask users to self-document their behaviors) and surveys are much better suited for answering questions about why or how to fix a problem.

Take a look at your conversion funnel. Examine the flow from the persuasive end (top of the funnel) and the transactional end (bottom of the funnel). e.g. You can identify problems by starting from the top 5 highest bounce rate pages. During the examination, segment to spot underlying underperformance or trends.

  • Segment by source: Separate people who arrive on your website from e-mail campaigns, google, twitter, youtube, etc. Find answers to questions like: Is there a difference between bounce rates for those segments? Is there a difference in Visitor Loyalty between those who came from Youtube versus those who came from Twitter? What products do people who come from Youtube care about more than people who come from Google?
  • Segment by behavior: Focus on groups of people who have similar behaviors For example, you can separate out people who visit more than ten times a month versus those that visit only twice. Do these people look for products in different price ranges? Are they from different regions? Or separate people out by the products they purchase, by order size, by people who have signed up.

Consider segmenting your users into different buckets and testing against that. Because mobile visitors perform differently than desktop ones, new visitors are different than returning visitors, and e-mail traffic is different than organic. Start thinking "segment first."

During the process ask yourself: 1) Why is it happening? 2) How can we spread the success of other areas of the site.

e.g. You’re looking at your metric of total active users over time and you see a spike in one of the timelines. After confirming that this is not caused by seasonal variation, we can look at different segment of our visitors to see if one of the segment is causing the spike. Suppose we have chosen segment to be geographic, it might just happen that we’ve identify a large proportion of the traffic is generated by a specific region and it might be best for us to dig deeper and understand why.

Three simple ideas for gathering qualitative data to understand the why

  • Add an exit survey on your site, asking why your visitors did/didn't complete the goal of the site.
  • Send out a feedback surveys to your clients to find out more about them and their motives.
  • Simply track what your customers are saying in social media and on review sites.

Generate a Well-Defined Metric

Set the "Lower" Level Goals

Now that you've identify the overall business goal and the possible problem (e.g. Less than one percent of visitors sign up for our newsletter). It's time to prioritize your website goals. Three categories of goals include:

  • Do x: Add better product images.
  • Increase y: Increase click-through rates.
  • Reduce z: Reduce our shopping cart abandonment rate.

Define the Subject

What you need to do is to decide how to assign users to either the control or the experiment. There’re three commonly used categories, namely user id, anonymous id (cookie) and event.

  • user id: e.g. Log in user names. Choosing this as the proxy for your user means that all the events that corresponds to the same user id are either in the control or experiment group, regardless of whether that user is switching between a mobile phone or desktop. This also means that if the user has not log in then he / she will neither be assgined to a control or experiment group.
  • anonymous id (cookie): The cookie is specific for a browser and device, thus if the user switches from Chrome to Firefox, they’ll be assigned to a different cookie. Also note that users can clear the cookie, in which case the next time they visit the website they’ll get assigned to a new cookie even if they’re still using the same browser and device. For experiments that will be crossing the sign-in border, using cookie is preferred. e.g. Suppose you’re changing the layout of the page or locations of the sign in bar then you should use a cookie.
  • event: Should only be used when you’re testing a non-user-visible change. e.g. page load time. If not, what will happen is : The user will see the change when they first visit the page and after reloading the page, the user will not see the change, leading to confusion.

Define the Population

If you think you can identify which population of your users will be affected by your experiment, you might want to target your experiment to that traffic (e.g. changing features specific to one language’s users) so that the rest of the population won’t dilute the effect.

Next, depending on the problem you’re looking at, you might want to use a cohort instead of a population. A cohort makes much more sense than looking at the entire population when testing out learning effects, examining user retention or anything else that requires the users to be established for some reason.

A quick note on cohort. The gist of cohort analysis is basically putting your customers into buckets so you can track their behaviours over a period of time. The term cohort stands for a group of customers grouped by the timeline (can be week, month) where they first made a purchase (can be a different action that’s valuable to the business). In a cohort, you can have roughly the same parameters in your two user group, which makes them more comparable.

e.g. You’re an educational platform has an existing course that’s already up and running. Some of the students have completed the course, some of them are midway through and there’re students who have not yet started. If you want to change the structure of of one of the lessons to see if it improves the completion rate of the entire course and they started the experiment at time X. For students who have started before the experiment initiated they may have already finished the lesson already leading to the fact that they may not even see the change. So taking the whole population of students and running the experiment on them isn’t what you want. Instead, you want to segment out the cohort, the group of customers, that started the lesson are the experiment was launched and split that into an experiment and control group.

Define the Size and Duration

When do I want to run the experiment and for how long.

e.g. Suppose we’ve chosen the goal to increase click-through rates, which is defined by the unique number of people who click the button versus the number of users who visited the page that the button was located. But to actually use the definition, we’ll also have to address some other questions. Such as, if the same user visits the page once and comes back a week or two later, do we still only want to count that once? Thus we’ll also need to specify a time period

To account for this, if 99% of your visitors convert after 1 week, then you should do the following.

  • Run your test for two weeks.
  • Include in the test only users who show up in the first week. If a user shows up on day 13, you have not given them enough time to convert (click-through).
  • At the end of the test, if a user who showed up on day 2 converts more than 7 days after he first arrived, he must be counted as a non-conversion.

So one version of the fully-defined metric will be: For each week, the number of cookies that clicked divided by the number of cookies that interacted with the page (also add the population definition).

Running the test for a least a week is adviced since it'll make sure that the experiment captures the different user behaviour of weekdays, weekends and try to avoid holidays ....

If your population is defined and you have a large enough traffic, another consideration is what fraction of the traffic are you going to send through the experiment. There’re some reasons that you might not want to run the experiment on all of your traffic to get the result faster.

  • The first consideration might be you’re just uncertained of how your users will react to the new feature, so you might want to test it out a bit before you get users blogging about it.
  • The same notion applies to riskier cases, such as you’re completely switching your backend system, if it doesn’t work well, then the site might go down.

Prioritize

After collating all the ideas, prioritize them based on three simple metrics: (give them scores)

  • Potential How much potential for a conversion rate increase? You can check to see if this kind of idea worked before.
  • Importance How many visitors will be impacted from the test?
  • Ease How easy is it to implement the test? Go for the low-hanging fruit first.

Every test that's developed is documented so that we can review and prioritize ideas that are inspired by winning tests.

Some ideas worth experimenting are: Headlines, CTA (call to actions), check-out pages, forms and the elements include:

  • Wording. e.g. Call to action or value proposition.
  • Image. e.g. Replacing a general logistics image with the image of an actual employee.
  • Layout. e.g. Increased the size of the contact form or amount of content on the page.

A/B Testing Caveats

This section lists out some caveats that are not really mentioned or covered briefly in the template above.

Avoid Biased Stopping Times

NO PEEKING. When you run an A/B test, you should avoid stopping the experiment as soon as the results "look" significant. Using a stopping time that is dependent upon the results of the experiment can inflate your false-positive rate substantially.

To understand why this is so, let's look at a simpler experimental problem. Let's say that we have a coin in front of us, and we want to know whether it's biased -- whether it lands heads-up with probability other than 50%. If we flip the coin n times and it lands heads-up on k of them, then we know that the posterior distribution for the coin's bias is $p \sim Beta(k+1,n−k+1)$. So if we do this and 0.5 isn't within a 95% credible interval for $p$, then we would conclude that the coin is biased with p-value <=0.05. This is all fine as long as the number of flips we perform, n, doesn't depend on the results of the previous flips. If we do that, then we bias the experiment to favor extremal outcomes.

Let's clarify this simulating these two experimental procedures in code.

  • Unbiased Procedure: We flip the coin 1000 times. Let k be the number of times that the coin landed heads-up. After all 1000 flips, we look at the $p \sim Beta(k+1,1000−k+1)$ distribution. If 0.5 lies outside a the 95% credible interval for $p$, then we conclude that $p \neq 0.5$; if 0.5 does lie within the 95% credible interval, then we're not sure -- we don't reject the idea that $p = 0.5$.
  • Biased Procedure: We start flipping the coin. For each n with $1 < n \leq 1000$, let $k_n$ be the number of times the coin lands heads-up after the first n flips. After each flip, we look at the distribution $p \sim Beta(k_n+1,n−k_n+1)$. If 0.5 lies outside a the 95% credible interval for $p$, then we immediately halt the experiment and conclude that $p \neq 0.5$; if 0.5 does lie within the 95% credible interval, we continue flipping. If we make it to 1000 flips, we stop completely and follow the unbiased procedure.

How many false positives do you think that the Biased Procedure will produce? We chose our p-value to be 0.05, so that the false positive rate would be about 5%. Let's repeat each procedure 1000 times, assuming that the coin really is fair, and see what the false positive rates really are:


In [3]:
def unbiased_procedure(n):
    """
    Parameters
    ----------
    n : int
        number of experiments (1000 coin flips) to run
    """  
    false_positives = 0
    
    for _ in range(n):
        # success[-1] : total number of heads after the 1000 flips
        success  = ( np.random.random(size = 1000) > 0.5 ).cumsum()
        beta_cdf = stats.beta( success[-1] + 1, 1000 - success[-1] + 1 ).cdf(0.5)

        if beta_cdf >= 0.975 or beta_cdf <= 0.025:
            false_positives += 1

    return false_positives / n

In [4]:
def biased_procedure(n):
    
    false_positives = 0
    
    for _ in range(n):
        success = ( np.random.random(size = 1000) > 0.5 ).cumsum()
        trials  = np.arange( 1, 1001 )
        history = stats.beta( success + 1, trials - success + 1 ).cdf(0.5)

        if ( history >= 0.975 ).any() or ( history <= 0.025 ).any():
            false_positives += 1

    return false_positives / n

In [5]:
# simulating 10k experiments under each procedure
print( unbiased_procedure(10000) )
print( biased_procedure(10000) )


0.0527
0.4871

Almost half of the experiments under the biased procedure produced a false positive. Conversely, only around 5% of the unbiased procedure experiments resulted in a false positive, which is close to the 5% false positive rate the we expected, given our p-value.


The easiest way to avoid this problem is to choose a stopping time that's independent of the test results. You could, for example, decide in advance to run the test for a fix amount of time, no matter the results you observe during the test's tenure. Thus just like in the template above, if 99% of your visitors convert after 1 week, then you should do the following.

  • Run your test for two weeks.
  • Include in the test only users who show up in the first week. If a user shows up on day 13, you have not given them enough time to convert.
  • At the end of the test, if a user who showed up on day 2 converts more than 7 days after he first arrived, he must be counted as a non-conversion.

Or you could decide to run the test until each bucket has received more than 10,000 visitors, again ignoring the test results until that condition is met. There're tests like power tests that let's you determine how many tests you should run before you make a conclusion about the result. Although you should be VERY CAREFUL with this, because the truth is: It’s not really the number of conversions that matters; it’s whether the time frame of the test is long enough to capture variations on your site.

For instance, the website traffic may behave one way during the day and another way at night (the same holds on weekdays and weekends). Then there's this thing that's called the novelty effect. Meaning when users are switched to a new experience, their initial reactions may not be their long-term reactions. In other words, if you are testing a new color for a button, the user may initially love the button and click it more often, just because it’s novel, or she may hate the new color and never touch it, but eventually she would get used to the new color and behave as she did before. It’s important to run the trial long enough to get past the period of the "shock of the new".

In sum, you should setting a results-independent stopping time (a week) is the easiest and most reliable way to avoid biased stopping times.

Do Follow Up Tests and Watch your Overall Success Rate

If you're running a lot of A/B tests, you should run follow-up tests and pay attention to your base success rate.

Let's talk about these in reverse order. Imagine that you do everything right. You set your stopping time in advance, and keep it independent from the test results. You set a relatively high success criterion: A probability of at least 95% that the variant is better than the control (formally, $p \leq 0.05$). You do all of that.

Then you run 100 tests, each with all the rigor just described. In the end, of those 100 tests, 5 of them claims that the variant will beat the control. How many of those variants do you think are really better than the control, though? If you run 20 tests in a row in which the "best" variant is worse or statistically indistinguishable from the control, then you should be suspicious when your 21st test comes out positive. If a button-color test failed to elicit a winner six months ago, but did produce one today, you should be skeptical. Why now but not then?

Here's an intuitve way of thinking about this problem. Let’s say you have a class of students who each take a 100-item true/false test on a certain subject. Suppose each student chooses randomly on all questions. Each student would achieve a random score between 0 and 100, with an average of 50.

Now take only the top scoring 10% of the class, and declaring them "winners", give them a second test, on which they again choose randomly. They will score less on the second test than the first test. That’s because, no matter what they scored on the first test they will still average 50 correct answers in the second test. This is what's called the regression to the mean. Meaning that tests which seem to be successful but then lose their uplift over time.

It can be wise to run your A/B tests twice (a valiation test). You’ll find that doing so helps to eliminate illusory results. If the results of the first test aren’t robust, you’ll see noticeable decay with the second. But, if the uplift is real, you should still see uplift during the second test. This approach isn’t fail-safe but it will help check whether your results are robust. e.g. In a multiple testing, you tried out three variants, B, C, and D against the control A. Variant C won. Don't deploy it fully yet. Drive 50% of your traffic to Variant C and 50% to Variant A (or some modification on this; the percent split is not important as long as you will have reasonable statistical power within an acceptable time period). As this will give you more information about C's true performance relative to A.

Given the situation above, it's better to keep a record of previous tests, when they were run, the variants that were tried, etc. Since these historical record gives you an idea of what's reasonable. Despite the fact that this information is not directly informative of the rates you should expect from future tests (The absolute numbers are extremely time dependent, so the raw numbers that you get today will be completely different than the ones you would have gotten six months later), it gives you an idea of what's plausible in terms of each test's relative performance.

Also, by keeping a record of previous tests, we can avoid:

  • Falling into the trap of "We already tried that". A hypothesis can be implemented in so many different ways. If you just do one headline test and say "we tried that," you’re really selling yourself short.
  • Not testing continually or not retesting after months or years. Just because you tested a variation in the past doesn’t necessarily mean that those results are going to be valid a year or two from now (Because we have the record of what we did, we can easily reproduce the test).

Non-Randomized Bucketing

Double check if you're actually randomly splitting you're users, this will most likely burn you if your system assigns user id to users in a systematical way. e.g. user id whose last two digits are 70 are all from china.

False Reporting

Let's say you deploy a new feature to your product and wish to see if it increases the product's activation rate (or any other metric or KPI that's relevant to you). Currently the baseline of the product's activation rate is somewhere around 40%. After running the test, you realized that it WORKED, the activation went up to 50%. So you're like, YES! I just raised activation by 25%! and you sent this info to the head of product and ask for a raise.

After two months, the head of product comes back to you and said "you told me you raised the activation rate by 25%, shouldn't this mean that I should see a big jump in the overall activation? What's going on?" Well, what's going on is, you did raised activation by 25%, but only for user who uses the product's feature. So if only 10 percent of your users use that product, then the overall increase in activation rate will probably only be around 2.5% (25% * 10%). Which is still probably very good, but the expectation that you've set by mis-reporting can get you into trouble.

Seasonality / Not Running it Against the Correct Target

Suppose you have different types of users (or users with different usage patterns) using your product. e.g. business user and students. Then what can happen is your A/B testing will have different result in July versus October. The reason may be in July all your student users are out on vacation (not using your product) and in October after school starts they start using it again. This is simply saying that the weighting of your user population may be different in different times of the year (seasonality). Thus, you should be clear with yourself about who you're targeting.

Conflicting Tests

Two different product teams both deployed new feautures on your landing page and ran the A/B test at the same period of time. This is more of a organization problem. You should probably require the product teams to register for their test, and make sure that multiple tests on the same stuff are not running at the same time, or else you might be tracking the effect of the other test.

Others

Despite its useful functionality, there are still places where A/B testing isn't as useful. For example:

  • A/B testing can't tell you if you're missing something. Meaning it can tell you if A performs better B or vice versa, but it can't tell you that if you use C, then it will actually perform better than the former two.
  • Tesing out products that people rarely buy. e.g. cars, apartments. It might be too long before the user actually decides to take actions after seeing the information and you might be unaware of the actual motivation.
  • Optimizing for the funnel, rather than the product. Understanding what the customers want so that you can make the product better. Ultimately, you can’t simply test your headlines and get people to like your product more.
  • There are still occasions when A/B testing should not be trusted to make decisions. The best example is probably noting that higher click through rate doesn't necessary means higher relevance. To be explicit, poor search results means people perform more searches, and thereby click on more ads. While this seems good in the short term, it's terrible in the long term, as users get more and more frustrated with the search engine. [Quora]