Sometimes A/B testing isn't helpful. What to do in those cases?
A/B testing has been around a while. e.g. in medicine clinical trials are effectively A/B tests
Audacity makes online courses for finance courses
We flip a coin multiple times. We have a probability of how it will turn up. As we keep on increasing the flips it approaches a normal distribution
Try using this website to get a better feel for the binomial distribution. At the top, you can choose n (the number of events to generate) and p (the probability of success for each event) for two binomial distributions and compare them. The site will show one distribution as bars and the other as dots, and two graphs will be shown. The top graph shows the probability that exactly some number of successes k will occur for each k, and the bottom graph shows the cumulative probability that at k or fewer successes will occur (so the probability on the far right (k >= n) will always be 1).
At the bottom, you can play a game where you throw a dice many times (with various definitions of success), keeping track of how many successes come up. This will let you see how the distribution you get resembles the binomial more and more as you keep playing.
Estimate of probability
$\hat{p} = X / N$
Instead of binomial distribution we can use normal distribution if
$N * \hat{p} > 5$
$N * (1 - \hat{p}) > 5$
But for small probabilities the first one is more important
N = 2000
X = 300
$\hat{p}$ = (300/2000) = 0.15
$N * \hat{p} = 2000 * 0.15 = 300$
$N * (1 - \hat{p}) = 2000 * 0.85 = 1700$
As both are greater than 5 we can consider using a normal distribution instead of binomial distribution.
$SE = \sqrt{ (\hat{p} * (1 - \hat{p})) / N } = \sqrt{ (0.15 * 0.85) / 2000 } = 0.007984359711335657$
margin of error (m) = SE * Z
= 0.00798 * 2.575
= 0.0205485
$Lower Bound = \hat{p} - m = 0.15 - 0.0205485 = 0.1294515$
$Upper Bound = \hat{p} + m = 0.15 + 0.0205485 = 0.1705485$
Note that the SE is dependent on both
So to decide how many samples we want to calculate we should consider proportion of successes. Will be covered again later in detail.
So far what we have done is
Say we run an experiment of changing the checkout flow of online shopping website
Both groups have the same probability of completing a checkout
Both groups have different probability of completing a checkout
When comparing 2 samples we need a SE that gives us a good comparison of both. We will use pooled SE for this purpose.
alpha = P (reject null | null true)
beta = P (fail to reject null | null false)
As size increase the distribution gets tighter
thus beta
gets lower (as there is less overlap) while alpha
remains the same.
sensitivity = 1 - beta
Statistics textbooks frequently define power to mean the same thing as sensitivity, that is, 1 - beta. However, conversationally power often means the probability that your test draws the correct conclusions, and this probability depends on both alpha and beta. In this course, we'll use the second definition, and we'll use sensitivity to refer to 1 - beta.
Use this calculator to determine how many page views we'll need to collect in our experiment. Make sure to choose an absolute difference, not a relative difference.
For first case we need same sensitivity
After getting results we need to analyze them
For various confidence intervals we decide whether we want to launch or whether we don't want to launch the change
In case you are not certain then the risk needs to be communicated to the business decision makers. They need to consider other factors and consider whether they want to launch the test or not.