Lesson 01 - Overview of A B Testing

Course Overview

  • how to run an A/B Test for businesses who have an online presence
  • allows you to run possible changes with users and see what performs better
  • scientific way to find what works best rather than relying on intuition
  • about
    • how to design a test
    • choose metrics
    • analyze the results
  • Take two set of users and show them different versions
    • One set - control set - will be shown existing product
    • Another set - experiment set - will be shown the new product
  • Isn't useful for testing out new features but rather is meant for whether the change in this feature improved stuff or not
  • If the change's results take time to come into effect then A/B Testing won't help. Say you are a house rental website and you want to know how to increase referrals by the user. But referrals don't happen that often. So that would be a really long test if done and thus not so useful.

Examples of A/B testing in industry:

Examples mentioned in the video:

Other Examples:

When can you use A/B Testing

  • Website selling cars is not a good idea as it will take too long to see whether there is a result
  • Should launch Pro service? Not a good option for A/B testing as you cannot assign people randomly to control and experiment as people need to opt-in.
  • update brand/logo can be an emotional thing for users and you can do A/B testing but in the long term
  • testing layout of page is definitely something that you can A/B Test

Other Techniques

Sometimes A/B testing isn't helpful. What to do in those cases?

  • can collect data that is complementary to doing A/B test. logs can be analyzed and a hypothesis can be developed and based on that can do

A/B testing has been around a while. e.g. in medicine clinical trials are effectively A/B tests

Business Example

Audacity makes online courses for finance courses

  • Our experiment would be to test the hypothesis "changing the start now button from orange to pink will increase how many students explore audicty's courses.
  • Which metric to choose?
    • total number of courses completed? No as it would take too long to complete
    • number of clicks? No. Say if one of the 2 groups has more number of users visits then it would give us incorrect results
    • (no. of clicks) / (no. of page views) - click through rate
    • (unique visitors who click) / (unique visitors to page) - click through probability
  • updated hypothesis is "chaning the start now button from orange to pink will increase the click through probability".
  • Why probability instead of rate?
    • rate should be used when want to measure usability of website. say we have a button then the rate will tell us how often do users actually find that button.
    • probability should be used when want to measure the impact. but for finding how often do people go exploring you don't want to take into account page refresh, double clicks etc.

Repeating experiment

  • When we repeat experiments then if our results are too different then we would be surprised
  • "too different" is related to standard error of our sample
  • standard error defines how variable estimates would be if we take different samples with the same size

Which Distribution

  • When should we be surprised by the changes in actual results vs. actual results?
  • Depends on the distribution

Binomial Distribution

We flip a coin multiple times. We have a probability of how it will turn up. As we keep on increasing the flips it approaches a normal distribution

Try using this website to get a better feel for the binomial distribution. At the top, you can choose n (the number of events to generate) and p (the probability of success for each event) for two binomial distributions and compare them. The site will show one distribution as bars and the other as dots, and two graphs will be shown. The top graph shows the probability that exactly some number of successes k will occur for each k, and the bottom graph shows the cumulative probability that at k or fewer successes will occur (so the probability on the far right (k >= n) will always be 1).

At the bottom, you can play a game where you throw a dice many times (with various definitions of success), keeping track of how many successes come up. This will let you see how the distribution you get resembles the binomial more and more as you keep playing.

Calculating Confidence Intervals

  • We can use the knowledge that we will be using binomaial distribution to find confidence interval.
  • This makes it possible for us to find the interval around mean where we expect values to fall
  • Steps are
    • Calculate Bias. This is center of confidence interval
    • Need to calculate the width of confidence interval which is the SE of distribution
  • Example is below for calculating 99% confidence interval

Bias

Estimate of probability

$\hat{p} = X / N$

Instead of binomial distribution we can use normal distribution if

$N * \hat{p} > 5$

$N * (1 - \hat{p}) > 5$

But for small probabilities the first one is more important

N = 2000
X = 300

$\hat{p}$ = (300/2000) = 0.15

$N * \hat{p} = 2000 * 0.15 = 300$

$N * (1 - \hat{p}) = 2000 * 0.85 = 1700$

As both are greater than 5 we can consider using a normal distribution instead of binomial distribution.

$SE = \sqrt{ (\hat{p} * (1 - \hat{p})) / N } = \sqrt{ (0.15 * 0.85) / 2000 } = 0.007984359711335657$

margin of error (m) = SE * Z
                    = 0.00798 * 2.575
                    = 0.0205485

$Lower Bound = \hat{p} - m = 0.15 - 0.0205485 = 0.1294515$

$Upper Bound = \hat{p} + m = 0.15 + 0.0205485 = 0.1705485$

Note that the SE is dependent on both

  • proportion of successes
  • number of samples

So to decide how many samples we want to calculate we should consider proportion of successes. Will be covered again later in detail.

Note

So far what we have done is

  • take a sample of clicks on the original website
  • computed confidence interval before making change on the sample
  • this helps us establish statistical (in)significance after making the change

Esatblish statistical significance using Hypothesis Testing

  • Need to calculate how likely results are due to chance

Say we run an experiment of changing the checkout flow of online shopping website

Null hypothesis

Both groups have the same probability of completing a checkout

Alternate hypothesis

Both groups have different probability of completing a checkout

Comparing 2 samples

When comparing 2 samples we need a SE that gives us a good comparison of both. We will use pooled SE for this purpose.

Practical, or Substantive, Significance ($d_{min}$)

  • From a business perspective what change matters to us?
  • A statistician would say what is substantive alongwith being statistically significant
  • only statistically significance is not enough as that just measures repeatability
  • need to measure something from a business perspective also
  • e.g.
    • new drug launch has costs - training, launching drugs
  • different size for different businesses. e.g.
    • 5% - 15% may be necessary for medicine and traditional science
    • 1% may be enough for google

Size vs Power Trade Off

  • we want to decide how many people go into the experiment. This is called statistical power
  • the smaller the change that you want to detect or higher the confidence that you want to have the more the size needs to be

How many page views

alpha = P (reject null | null true)
beta = P (fail to reject null | null false)
  • alpha - falsely concluding there was a difference
  • beta - failing to detect a true difference

As size increase the distribution gets tighter

thus beta gets lower (as there is less overlap) while alpha remains the same.

sensitivity = 1 - beta

Calculating number of page views

  • built-in library
  • lookup answer in a table
  • online calculator

Statistics textbooks frequently define power to mean the same thing as sensitivity, that is, 1 - beta. However, conversationally power often means the probability that your test draws the correct conclusions, and this probability depends on both alpha and beta. In this course, we'll use the second definition, and we'll use sensitivity to refer to 1 - beta.

Use this calculator to determine how many page views we'll need to collect in our experiment. Make sure to choose an absolute difference, not a relative difference.

How Does Number of Page Views Vary

For first case we need same sensitivity

After getting results we need to analyze them

For various confidence intervals we decide whether we want to launch or whether we don't want to launch the change

In case you are not certain then the risk needs to be communicated to the business decision makers. They need to consider other factors and consider whether they want to launch the test or not.