Analysis of the multi-armed bandit strategy for web service optimization

I recently stumbled across a reddit link https://redd.it/4dlioz to a (now 3-year-old) blog post about a method, the multi-armed bandit strategy, claimed to yield better results than A/B testing in every case for web optimization.

The large amount of upvotes this post gathered (now 235) was surprising, given the tone, the vagueness of the many (unsupported) claims and the presence of easily spotted inaccuracies (more about that later). However, we should not ditch the whole method just because of a single enthusiatic and inaccurate presentation. That's why we are going to explain that strategy, compare it to the widely used A/B testing method through statistics and simulations to finally explain when it is okay to use the multi-armed bandit strategy.

Table of content

1. What are we talking about?

1.1. Basics

People (among other things), tend to react in a predictable way "on average" to a given stimulus (a.k.a. event): e.g. most people tend to switch the light on, on the event of entering a dark room.

Given a particular event, one can measure a particular outcome. For instance, on the event "going out on a rainy day", one can measure/count the particular outcome "number of persons carrying an umbrella". Similarly, one could measure the same outcome on a sunny day.

1.2. Blog post case study

Instead of the weather, the blogger decided to study the specific effect of the color of a "Buy Now!" on its click-through rate. For the purpose of his blog, he chose to compare the effect of three colors:

  • orange,
  • green and
  • white.

The user connecting to the website would therefore be subjected to three possible events:

  • Getting the version of the website with the orange button (we call that "event $A$", or just $A$);
  • Getting that same website with a green button instead (event $B$);
  • Same thing with a white button instead (event $C$).

The outcome (which we call $O$), the "thing" we measure, is the "number of clicks on that button".

But this raw number is not interesting by itself: we would like to unveil something deeper, a kind of hidden truth about people behavior in general. That's why we have to resort to probabilities. So instead of analysing $O$, the number of clicks on the "Buy Now!" button, we will study $P(O)$, the probability that a user clicks on the button.

To be more specific, $P(O)$ is not that interesting either because our purpose is to determine whether a specific event gives a different outcome than the others, that is if $P(O)$ is different given that the user is subjected to either $A$, $B$ or $C$. These probabilities are written $P(O|A)$, $P(O|B)$ and $P(O|C)$ and are usually called "conversion rate" in the web parlance.

To sum up, we defined the following probabilities:

  • $P(A)$: the probability that a user gets the website version with the orange button;
  • $P(B)$: the probability of getting the green button;
  • $P(C)$: the probability of getting the white button;
  • $P(O)$: the probability of any user clicking on the button;
  • $P(O|A)$: the probability that a user clicks on the button given that that button is orange;
  • $P(O|B)$: the probability of clicking on the button given it is green;
  • $P(O|C)$: the probability of clicking on the button given it is white;

1.3. A/B testing

A/B testing purpose is to answer two different questions:

  • Are $P(O|A)$, $P(O|B)$, etc. significantly different?
  • If so, what is this difference exactly?

To do that, we have to ensure that each event occurrence is equiprobable: i.e. that around as many users get an orange, a green or a white button at any given time ($P(A) = P(B) = P(C)$). This is important if $P(O)$ varies with time: very little voodoo magic is required to guess that people would tend to buy more things around christmas for instance. More people would come to the site, knowing exactly what they want and click on the "Buy Now!" button disregarding its color. By ensuring that $P(A) = P(B) = P(C)$, no one specific event is ever artificially favored.

Whether or not the difference between $P(O|A)$, $P(O|B)$ and $P(O|C)$ is significant is assessed by a statistical test (in that case, it would certainly be Fisher's exact test). This test assumes that the error on $P(O|X)$ (with $X$ being any of the events $A$, $B$ or $C$) follows a beta distribution.

Finally, if the differences are proven significant, the differences themselves can be calculated with a known precision, degree of significance and statistical power.

1.4. Multi-armed-bandit-based method

One of the drawbacks of the A/B testing method is that even if one event clearly yields a far better outcome than the others, we have to wait for our testing campaign to end before actually analyzing results. That patience is required for sound reasons, the main one being that we designed our campaign in order to reach a certain level of trust in our results (see the end of the previous subsection).

The multi-armed-bandit-based method proposed by the blogger ignores that. It assumes that the $P(O|X)$ obtained through the method are significant and can be used to make a decision at any given time (even at the very start, when we have no actual data). $P(O|A)$, $P(O|B)$ and $P(O|C)$ are assumed to be different and that difference is assumed to be significant at any time.

By making these rather bold assumptions, we are allowed to make the following change in the A/B testing data gathering method:

If $A$ is found to yield the highest probability of click so that $$P(O|A) \geq P(O|B) \text{ and } P(O|A) \geq P(O|C)$$ at the impression $n$, then we specifically set $$P(A) > P(B) = P(C)$$ at the impression $n+1$ (an impression is one user visiting the page, whether she clicks or not, "impression" is also often called "sample" or "trial").

This inequality is actually the only difference, in the data gathering stage, between the multi-armed bandit method and typical A/B testing. In the blog, the author devised quite a complicated scheme to determine these probabilities, by splitting the general web serving algorithm into two phases with two different probability of occurrence:

  • One phase of exploration (happening 10% of the time in his example) in which any event can occur with the same probability (like in proper A/B testing).
  • One phase of exploitation (occurring 90% of the time in his example) with the favored event always being served.

Ultimately, these two phases can be expressed more simply as raw probabilities: $$P(\text{favored event}) = P(\text{exploitation}) + P(\text{disfavored event}) \quad \text{ and } \quad P(\text{disfavored event}) = \frac{P(\text{exploration})}{\text{number of events}}.$$

It is worth noting that as we understand it, the blog post does not suggest to perform any actual statistical test and just let the algorithm be ("set and forget").

1.5. Preliminary concluclusions about the multi-armed bandit strategy

At this stage, we have already said most of what should matter to make a decision:

  1. The multi-armed bandit method is actually a strategy to maximize a specific outcome rather than a proper statistical method.
  2. In particular, this strategy cannot outperform A/B testing to answer the question "is there actually a difference?".
  3. As $P(O|X)$ is assumed significant at any given time, any underestimated $P(O|X)$ will be systematically disfavored with few opportunities to ever get corrected.
  4. The underlying distribution of errors on the disfavored event does not follow the beta distribution: it is skewed and no amount of extra samples would change that.
  5. That means that we can't really trust Fisher's exact test with that strategy.

If for any reason, we are fine with these (let's say, because we only want the best performing slot machine and the worst case scenario is "they all behave the same"), then why not... This algorithm, is, after all, only a strategy to maximize reward, not really a testing method per se. But it is not clear at this stage whether that strategy would really systematically/generally outperform A/B testing, let alone yielding 2-3 times "better" results (cf. the blog post). That is why we are going to rely on simulations.

2. Numerical modelling and simulation

We understand that most points of the conclusion above are not straightforward to all. We would also like to see how much this strategy outperforms A/B testing (i.e. how many more clicks would be gathered in the long run). This is why we are going to resort to numerical simulations to illustrate and justify them. We will proceed as follow:

  1. Model the webserver as a function which returns (event) 'A', 'B' or 'C' (i.e. button of different colors) according to the normalized probabilities in P_X (a dictionnary mapping event $X$ to $P(X)$).
  2. Model the behavior of people with a function returning whether or not a particular user has clicked on the button given its color and P_O_X: a dictionnary containing the "hidden truth", that is, the underlying exact probabilities that a user clicks on the button given a particular color, the ideal click-through rates.
  3. Model campaigns based on each strategy (A/B, MAB) to estimate the experimental $\tilde{P}(O|X)$ which is the estimate of $P(O|X)$ obtained through sampling according to these strategies.
  4. Estimate whether the differences in $\tilde{P}(O|X)$ are significant (by using Fisher's exact test).

These models will be used in the next section to test various claims.

2.1. The webserver

The web server should provide websites which differ only by the color of a button. As we control the server, we can decide the individual probability of sending any of these colors: $P(X)$ with $X$ being either $A$, $B$ or $C$.

>>> import numpy as np
>>> events = {"A": "got an orange button", "B": "got a green button", "C": "got a white button"}
>>> P_X = {"A": 1/3, "B": 1/3, "C": 1/3}
>>> def fetch_website_and_return_event(P_X):
>>>     event_list = [c for c in events]
>>>     probability_of_event = [P_X[event] for event in event_list]
>>>     return np.random.choice(event_list, p=probability_of_event)

>>> for user_number in range(100): # 100 users visit the website
>>>     event = fetch_website_and_return_event(P_X)
>>>     print("The user #%2d was subjected to event %s: she %s"%(user_number, event, events[event]))

The user # 0 was subjected to event A: she got an orange button
The user # 1 was subjected to event B: she got a green button
[...]
The user #98 was subjected to event C: she got a white button
The user #99 was subjected to event A: she got an orange button

2.2. User clicking behavior (click through rate)

In a similar way, we can represent the behavior of the users by setting $P(O|X)$ (with $X$ which is either $A$, $B$ or $C$) and "drawing" individual outcomes from it:

>>> P_O_X = {"A": 0.001, "B": 0.01, "C": 0.5}
>>> def has_user_clicked_on_the_button(event, P_O_X):
>>>     return np.random.uniform(low=0.0, high=1.0, size=None) < P_O_X[event]

>>> for user_number in range(100):
>>>     event = fetch_website_and_return_event(P_X)
>>>     if has_user_clicked_on_the_button(event, P_O_X):
>>>         has_clicked = "and clicked on it!"
>>>     else:
>>>         has_clicked = "and DIDN'T click on it!"
>>>     print("The user #%2d %s"%(user_number, events[event]), has_clicked)

The user # 0 got a white button and clicked on it!
The user # 1 got a white button and DIDN'T click on it!
[...]
The user #98 got a white button and DIDN'T click on it!
The user #99 got an orange button and DIDN'T click on it!


2.3. Campaigns

As we said, in A/B testing, $P(A) = P(B) = P(C)$, therefore the code above is already suited to this use case: all we have to do is to gather the results.

In the case of the multi-armed bandit, $P(X)$ should be updated at each iteration/impression to favor the event which yields the best outcome at that time.

However, our partial model implementations are too slow to be used in real simulations. That is why we provide the following code which is less readable but much faster and strictly equivalent.


In [1]:
import numpy as np
def AB_campaign(number_of_users, P_O_X, P_X=None):
    ordered_events = [event for event in P_O_X]
    if P_X:
        P_X = [P_X[event] for event in ordered_events]
    event_per_user = np.random.choice(ordered_events, p=P_X, size=number_of_users)
    click_proba_per_user = np.empty_like(event_per_user, dtype=np.float64)
    for event, proba in P_O_X.items():
        click_proba_per_user[event_per_user == event] = proba
    click_per_user = np.random.uniform(low=0.0, high=1.0, size=number_of_users) < click_proba_per_user
    return click_per_user, event_per_user

Rough timings comparing the given implementation and one based on the implementation in the section 2.2.

>>> %timeit AB_campaign(100000, P_O_X)
100 loops, best of 3: 17.3 ms per loop
>>> %timeit naive_AB_campaign(100000, P_O_X)
1 loop, best of 3: 8.51 s per loop

Now let's define the multi-armed bandit version in a similar way. The coefficient f corresponds to the $P(\text{exploitation})$ as described in the blog post. Both click_hist and event_hist are dictionaries respectively containing a total number of clicks or events per event. They can be optionally used to resume a previous multi-armed bandit campaign from the exact place where it stopped.


In [2]:
import numpy as np
def MultiArmedBandit_campaign(number_of_users, P_O_X, f=0.9, click_hist=None, event_hist=None):
    l = len(P_O_X)
    disfav = (1-f)/l
    fav = f + disfav
    # Let's put everything in lists and use indices instead of dict/event name
    indices = np.array(list(range(l)))
    event_list = np.array([event for event in P_O_X]) # provide an order
    P_O_X_list = np.array([P_O_X[event] for event in event_list])
    
    # Let's keep the sum of clicks for each
    clicks_per_X = np.ones_like(event_list, dtype=np.int)
    events_per_X = np.ones_like(event_list, dtype=np.int)
    if click_hist:
        c = [click_hist[event] for event in event_list]
        clicks_per_X *= c
    if event_hist:
        e = [event_hist[event] for event in event_list]
        events_per_X *= e
    
    # Generate all the random draws here and now 
    rdm_click = np.random.uniform(low=0.0, high=1.0, size=number_of_users)
    
    # Pregenerate all the lists of random number from the start
    # Each column i corresponds to favoring the ith event
    # Bad for memory, good for speed!
    P_X_fav = np.zeros((number_of_users, l), dtype=np.int)
    for i in range(l):
        P_X_list = [disfav] * l
        P_X_list[i] = fav
        P_X_fav[:, i] = np.random.choice(indices, p=P_X_list, size=number_of_users)
    
    # Init all the relevant arrays
    event_idx_per_user = np.zeros(number_of_users, dtype=np.int)
    fav_idx_per_user = np.zeros(number_of_users, dtype=np.int)
    click_per_user = np.zeros(number_of_users, dtype=np.bool)
    
    # Now perform the expensive loop
    for user_number, cmp_click in enumerate(rdm_click):
        # Get event to favor
        fav_idx = np.argmax(clicks_per_X/events_per_X)
        # Get event idx depending on fav:
        # just extract it from the correct column
        # of pregenerated random events
        event_idx = P_X_fav[user_number, fav_idx]
        
        # Update relevant arrays and cumulated sums
        event_idx_per_user[user_number] = event_idx
        fav_idx_per_user[user_number] = fav_idx
        events_per_X[event_idx] += 1
        has_clicked = (cmp_click < P_O_X_list[event_idx])
        clicks_per_X[event_idx] += has_clicked
        click_per_user[user_number] = has_clicked
        
    # Finally convert indices into event names
    event_per_user = np.empty_like(event_idx_per_user, dtype=np.str)
    fav_per_user = np.empty_like(fav_idx_per_user, dtype=np.str)
    for idx, event in enumerate(event_list):
        event_per_user[event_idx_per_user == idx] = event
        fav_per_user[fav_idx_per_user == idx] = event
    return click_per_user, event_per_user, fav_per_user
>>> %timeit MultiArmedBandit_campaign(10000, P_O_X)
10 loops, best of 3: 133 ms per loop

2.4. Fisher's exact test

We need a way to determine whether the measured $\tilde{P}(O|X)$ are significant. As stated previously, Fisher's exact test) is usually used to answer such a question as it is suited to click-through rates.

To do that, we must first create a contingency table giving the number of clicks and the number of "non-click" per event, for instance:

click non-click
A 15 26
B 541 432

Then the test itself is performed (here, we will be using fisher_exact from scipy.stats). There are a few way to handle the resulting $p$-values. In our case, we chose to arbitrary set a level of significance ($95\%$ by default) and to have the program answer the question "given that level, are the outomes significantly different?".


In [3]:
from scipy.stats import fisher_exact 
def get_nclicks_nevents(clicks, events, event_name):
    mask = (events == event_name)
    nclicks = np.sum(clicks[mask])
    nevents = np.sum(mask)
    return nclicks, nevents

def contingency_tbl_2x2(clicks, events, A="A", B="B"):
    A_nclicks, A_nevents = get_nclicks_nevents(clicks, events, A)
    B_nclicks, B_nevents = get_nclicks_nevents(clicks, events, B)
    return np.array([
                    [A_nclicks, A_nevents-A_nclicks],
                    [B_nclicks, B_nevents-B_nclicks]
        ])

def is_significant(clicks, events, A="A", B="B", significance=0.05):
    oddsratio, p = fisher_exact(contingency_tbl_2x2(clicks, events, A, B))
    return p < significance

As a last step, we should determine the minimum sample size to guarantee the detection of a given difference. This is beyond the scope of this article and we will only use the following result:

To reliably detect a difference of $5\%$ from an original conversion rate $P(O|X) = 10\%$ we need around $1500$ users/trials/impressions per event. In code, that would mean that the following,

>>> P_O_X = {"A": 0.1, "B": 0.15}
>>> AB_campaign(1500 * len(P_O_X), P_O_X)

should allow us to determine that $P(O|B) > P(O|A)$ with

  1. only $1\%$ chance of finding that there is a difference while there is none (significance)
  2. and $100-95 = 5\%$ chance of missing an existing difference (power).

In [4]:
M = 600
P_O_X = {"A": 0.1, "B": 0.15}
significance = []
for i in range(M):
    clicks, events = AB_campaign(1500*len(P_O_X), P_O_X)
    significance.append(is_significant(clicks, events, significance=0.01))
    
print("We should expect missing around 5%% of the differences: missed %.2f%% this time."%(100-np.sum(significance)*100/M))

P_O_X = {"A": 0.1, "B": 0.1}
significance = []
for i in range(M):
    clicks, events = AB_campaign(1500*len(P_O_X), P_O_X)
    significance.append(is_significant(clicks, events, significance=0.01))
    
print("We should expect wrongly finding around 1%% of them different: found %.2f%% this time."%(np.sum(significance)*100/M))


We should expect missing around 5% of the differences: missed 5.00% this time.
We should expect wrongly finding around 1% of them different: found 1.00% this time.

3. Testing the claims

Now that we have all the required elements, we can start to test a few claims and use case...

3.1. The multi-armed bandit strategy cannot outperform A/B testing to answer the question "is there a difference?"

Let's run multiple campaigns with different values of $P(O|X)$ and see how many times the differences are found significant...

>>> P_O_X = {"A": 0.15, "B": 0.15}
>>> P_O_X = {"A": 0.10, "B": 0.15}
>>> P_O_X = {"A": 0.40, "B": 0.20}

In [5]:
M = 100
P_O_X_list = [
    {"A": 0.15, "B": 0.15},
    {"A": 0.10, "B": 0.15},
    {"A": 0.40, "B": 0.20}
]
significance_AB = [[] for p in P_O_X_list]
significance_MAB = [[] for p in P_O_X_list]
for count in range(M):
    for i, P_O_X in enumerate(P_O_X_list):
        # AB testing
        clicks, events = AB_campaign(1500*len(P_O_X), P_O_X)
        significance_AB[i].append(is_significant(clicks, events, significance=0.01))
        # MAB strategy
        clicks, events = MultiArmedBandit_campaign(1500*len(P_O_X), P_O_X)[:2]
        significance_MAB[i].append(is_significant(clicks, events, significance=0.01))
  
print("Identical:")
print("    - AB: found a unexisting difference in %.2f%% of the campaigns."%(np.sum(significance_AB[0])*100./M))
print("    - MAB: found a unexisting difference in %.2f%% of the campaigns."%(np.sum(significance_MAB[0])*100./M))

print("Small differences:")
print("    - AB: missed an existing difference in %.2f%% of the campaigns."%(100-np.sum(significance_AB[1])*100./M))
print("    - MAB: missed an existing difference in %.2f%% of the campaigns."%(100-np.sum(significance_MAB[1])*100./M))

print("Large differences:")
print("    - AB: missed an existing difference in %.2f%% of the campaigns."%(100-np.sum(significance_AB[2])*100./M))
print("    - MAB: missed an existing difference in %.2f%% of the campaigns."%(100-np.sum(significance_MAB[2])*100./M))


Identical:
    - AB: found a unexisting difference in 1.00% of the campaigns.
    - MAB: found a unexisting difference in 2.00% of the campaigns.
Small differences:
    - AB: missed an existing difference in 3.00% of the campaigns.
    - MAB: missed an existing difference in 63.00% of the campaigns.
Large differences:
    - AB: missed an existing difference in 0.00% of the campaigns.
    - MAB: missed an existing difference in 0.00% of the campaigns.

Using the given algorithm with a somewhat larger number of repetitions (M = 2000), we obtained:

Identical:
    - AB: found a unexisting difference in 0.80% of the campaigns.
    - MAB: found a unexisting difference in 1.10% of the campaigns.
Small differences:
    - AB: missed an existing difference in 4.70% of the campaigns.
    - MAB: missed an existing difference in 67.20% of the campaigns.
Large differences:
    - AB: missed an existing difference in 0.00% of the campaigns.
    - MAB: missed an existing difference in 0.60% of the campaigns.

These data clearly demontrate that the A/B testing data gathering process is better than the one of the multi-armed bandit strategy in the sense that its detection power is better: where the sample size is barely enough to garantee $5\%$ of errors in A/B testing, the multi-armed bandit makes it far worse at above $50\%$.

Why is that? A/B testing is very good at minimizing the uncertainty on all the $\tilde{P}(O|X)$ as all of them are sampled evenly. In the case of the multi-armed bandit strategy, the uncertainty on the most favored $\tilde{P}(O|X)$ is very low, but the one one the least favored $\tilde{P}(O|X)$ is really large (we are going to illustrate that in the next subsection). But in practice, it is less efficient to sample one particular event more often than it is to sample all the events evenly.

That means that the multi-armed bandit strategy always need more samples to reach the exact same significance as A/B testing and this is the main reason why we can claim that A/B testing is always better to answer the question "is there actually a difference?".

However, this is not the only reason: as we are going to see in the following subsection, the answer of multi-armed bandit strategy to that specific question is actually much worse than this.

3.2. The underlying probability density function does not follow the beta distribution

At this stage, we could naturally argue that a few more samples are all what is needed to get a better answer. Well, yes and no.

  • Yes because this would indeed reduce the uncertainties.
  • No because this is irrelevant:
    1. With a given amount of data, A/B testing still gives the most significant results.
    2. The underlying probability density function is actually skewed due to the strategy: we still cannot guarantee any level of significance from our results.

Knowing the underlying density function is useful in a lot of cases. This is what grants us the right to use Fisher's test to "measure" the significance of our results. This is also important to predict in which range we expect to find our results. There are two ways of estimating those probability density functions: either "experimentally", by running a lot of campaigns with the same sample size or "theoretically", by using known results from statistics: that density function must be the beta function.

We could use histograms as an estimate of the experimental density function but we chose Gaussian kernel density estimation instead which provides a natural way of interpolating values while it converges faster with the number of repetitions (especially knowing that the beta density function can be very similar to a Gaussian for central values).


In [6]:
from scipy.stats import beta, gaussian_kde
def experimental_error(x, X, O_X):
    # Gaussian kernel density estimation
    # Defaults to using Scott's rule
    # which is optimal if the underlying
    # distribution is roughly Gaussian
    y = gaussian_kde(O_X/X).evaluate(x)
    return y

def theoretical_error(x, X, O_X):
    # use Beta from scipy
    # X: individual events (A, B, C)
    # corresponding to a (orange, green, white) button
    # O_X: number of clicks given X (i.e. given the color)
    k = np.mean(O_X)
    n = np.mean(X)
    a = k + 1
    b = n - k + 1
    y = beta.pdf(x, a, b)
    return y

In [7]:
# Setup the plots
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt, rcParams
rcParams['figure.figsize'] = (12,4)
rcParams['axes.titlesize'] = 16
rcParams['axes.labelsize'] = 14
rcParams['legend.fontsize'] = 14
import seaborn as sns
sns.set_style("white")

N = 1000 # Number of samples/users/etc. per campaign
M = 4000 # Number of campaigns

# Set equal conversion rates 
P_O_X = {"A": 0.1, "B": 0.1}

# Set the interpolation grid
x = np.linspace(0, 0.5, 2000)

# Set the output varialbles as dicts(event_name -> P(O|X) array of M elements)
AB_est_nclicks_identical = {event: np.zeros(M) for event in P_O_X}
MAB_est_nclicks_identical = {event: np.zeros(M) for event in P_O_X}
AB_est_nevents_identical = {event: np.zeros(M) for event in P_O_X}
MAB_est_nevents_identical = {event: np.zeros(M) for event in P_O_X}

# Do all the calculations from the start
for count in range(M):
    AB_clicks, AB_events = AB_campaign(N, P_O_X)[:2]
    MAB_clicks, MAB_events = MultiArmedBandit_campaign(N, P_O_X)[:2]
    for event in P_O_X:
        AB_est_nclicks_identical[event][count], AB_est_nevents_identical[event][count] = get_nclicks_nevents(AB_clicks, AB_events, event_name=event)
        MAB_est_nclicks_identical[event][count], MAB_est_nevents_identical[event][count] = get_nclicks_nevents(MAB_clicks, MAB_events, event_name=event)

# Sum over all events to get the 'averaged' effect.
y_theo_AB = np.zeros_like(x)
y_theo_MAB = np.zeros_like(x)
y_exp_AB = np.zeros_like(x)
y_exp_MAB = np.zeros_like(x)
for event in P_O_X:
    y_theo_AB += theoretical_error(x, AB_est_nevents_identical[event], AB_est_nclicks_identical[event])
    y_exp_AB += experimental_error(x, AB_est_nevents_identical[event], AB_est_nclicks_identical[event])
    y_theo_MAB += theoretical_error(x, MAB_est_nevents_identical[event], MAB_est_nclicks_identical[event])
    y_exp_MAB += experimental_error(x, MAB_est_nevents_identical[event], MAB_est_nclicks_identical[event])

In [8]:
# Actual plotting

fig = plt.figure(figsize=(12, 9))
ax_AB = fig.add_subplot(211)
ax_MAB = fig.add_subplot(212)

ax_AB.set_title("Distribution of errors in A/B testing:")
ax_MAB.set_title("Distribution of errors in MAB strategy:")

ax_AB.set_ylabel("$\propto$ Probability")

ax_MAB.set_xlabel("$P(O|X)$ estimated for groups of %d samples"%N)
ax_MAB.set_ylabel("$\propto$ Probability")

ax_AB.plot(x, y_exp_AB, label="'Experimental' error (KDE)")
ax_AB.plot(x, y_theo_AB, "k--", label="Theoretical error (Beta PDF)")
ax_AB.legend(loc=0)

ax_MAB.plot(x, y_exp_MAB, label="'Experimental' error (KDE)")
ax_MAB.plot(x, y_theo_MAB, "k--", label="Theoretical error (Beta PDF)")
ax_MAB.legend(loc=0)


Out[8]:
<matplotlib.legend.Legend at 0x7f4d59568a20>

That nails it:

  1. A/B testing provides actual, simulated, "experimental" values that are distributed correctly.
  2. The multi-armed bandit strategy provides values that are not correctly distributed which means that Fisher's test cannot be reliably used to quantify the significance of the result.

Feel free to alter the code to convince yourself that these total distributions are similar to per event distributions.

How can we explain a tail like this? Let's alter the code to separate the results, not per event, but per most favored/disfavored event:


In [9]:
N = 1000 # Number of samples/users/etc. per campaign
M = 4000 # Number of campaigns

# Set equal conversion rates 
P_O_X = {"A": 0.1, "B": 0.1}

# Set the interpolation grid
x = np.linspace(0, 0.5, 2000)

# Set the output varialbles as dicts(event_name -> P(O|X) array of M elements)
fav_est_nclicks_identical = np.zeros(M)
disfav_est_nclicks_identical = np.zeros(M)
fav_est_nevents_identical = np.zeros(M)
disfav_est_nevents_identical = np.zeros(M)

# Do all the calculations from the start
previous_was_favored = False
event_list = [event for event in P_O_X]
A, B = event_list
for count in range(M):
    MAB_clicks, MAB_events, MAB_fav = MultiArmedBandit_campaign(N, P_O_X)[:3]
    if np.sum(MAB_fav == A) >= N/2:
        # A was favored
        fav_event, disfav_event = A, B
    else:
        # B was favored
        fav_event, disfav_event = B, A
    fav_est_nclicks_identical[count], fav_est_nevents_identical[count] = get_nclicks_nevents(MAB_clicks, MAB_events, event_name=fav_event)
    disfav_est_nclicks_identical[count], disfav_est_nevents_identical[count] = get_nclicks_nevents(MAB_clicks, MAB_events, event_name=disfav_event)

for event in P_O_X:
    y_theo_fav = theoretical_error(x, fav_est_nevents_identical, fav_est_nclicks_identical)
    y_exp_fav = experimental_error(x, fav_est_nevents_identical, fav_est_nclicks_identical)
    y_theo_disfav = theoretical_error(x, disfav_est_nevents_identical, disfav_est_nclicks_identical)
    y_exp_disfav = experimental_error(x, disfav_est_nevents_identical, disfav_est_nclicks_identical)

In [10]:
# Actual plotting

fig = plt.figure(figsize=(12, 9))
ax_fav = fig.add_subplot(211)
ax_disfav = fig.add_subplot(212)

ax_fav.set_title("Distribution of errors of mostly favored events (MAB):")
ax_disfav.set_title("Distribution of errors of mostly favored events (MAB):")

ax_fav.set_ylabel("$\propto$ Probability")

ax_disfav.set_xlabel("$P(O|X)$ estimated for groups of %d samples"%N)
ax_disfav.set_ylabel("$\propto$ Probability")

ax_fav.plot(x, y_exp_fav, label="'Experimental' error (favored event)")
ax_fav.plot(x, y_theo_fav, "k--", label="Theoretical error (favored event)")
ax_fav.legend(loc=0)

ax_disfav.plot(x, y_exp_disfav, label="'Experimental' error (disfavored event)")
ax_disfav.plot(x, y_theo_disfav, "k--", label="Theoretical error (disfavored event)")
ax_disfav.legend(loc=0)


Out[10]:
<matplotlib.legend.Legend at 0x7f4d559f5e80>

We observe that the distribution of $\tilde{P}(O|X)$ of the disfavored events is the skewed distribution. If we look more closely, we should find that the actual mode of the experimental and theoretical distributions is systematically underestimated (feel free to run the code a few times to convince yourself of this).

"Ah ah!", we could say, "the distribution is not actually skewed: just throw in more samples and we should get the expected distribution, we just lack a proper number of samples for that particular event". Sorry, that won't work: the "theoretical" distribution already takes that "improper sample size" into account. But please go ahead, do a little experiment... The distribution of $\tilde{P}(O|X)$ for the favored event seems to be correctly distributed, doesn't it? To reach that exact same statistical power in the disfavored case, we need $0.93333333/0.06666666 = 14$ times more samples, i.e. $N=14000$. However with that sample size, we should see the same kind of underestimation and tail.

Why is that? Well, just think of it this way: the problem is that we choose when to change $P(X)$ using the results themselves. Likewise, we could toss a coin and decide to toss the coin a second time if, and only if we get a head. Nobody would be the least surprised that we obtained more tails than heads then, even if the coin is known to be fair. This is the same kind of problem here, if only more subtle... The algorithm actually favors underestimated probabilities for the disfavored events.

This bias is another reason why A/B testing always performs better to determine the existence and the exact value of the differences: the multi-armed bandit strategy generally overestimates the differences and no statistical test is readily available to handle such a skewed distribution. In practice, this would mean that even with a proper sample size, applying Fisher's test to such data would result in an overestimation of the number of false positives (finding differences where there are none).

3.3. The christmas effect, or varying $P(O|X)$

One last thing we want to test is the effect of varying $P(O)$ during the campaign as some say that the multi-armed strategy could fail very badly.
Let's imagine that we want to start a one-month campaign which just happens to start by the middle of November. We are expecting $1000$ impressions during the first two weeks and $2000$ impressions during the last weeks (Christmas powa!). Now, let's assume that the real conversion rates in November happen to be:

>>> P_O_X_nov = {"A": 0.10, "B": 0.20}

and in December:

>>> P_O_X_dec = {"A": 0.35, "B": 0.40}

so that $B$ is clearly always the best solution in spite of the slight increase in December:


In [11]:
P_O_X_nov = {"A": 0.10, "B": 0.15}
P_O_X_dec = {"A": 0.20, "B": 0.25}

M = 1000

which_AB_significant = [] 
which_MAB_significant = []
print("Christmas with always better 'B':")
# A/B testing
print(" Simulating A/B...", end="")
for campaign in range(M):
    AB_clicks_nov, AB_events_nov = AB_campaign(1000, P_O_X_nov)[:2]
    AB_clicks_dec, AB_events_dec = AB_campaign(2000, P_O_X_dec)[:2]
    
    AB_clicks = np.append(AB_clicks_nov, AB_clicks_dec)
    AB_events = np.append(AB_events_nov, AB_events_dec)
    
    if is_significant(AB_clicks, AB_events):
        nclicks_A, nevents_A = get_nclicks_nevents(AB_clicks, AB_events, "A")
        nclicks_B, nevents_B = get_nclicks_nevents(AB_clicks, AB_events, "B")
        ratio_A = nclicks_A/nevents_A
        ratio_B = nclicks_B/nevents_B
        if ratio_A > ratio_B:
            which_AB_significant.append("A")
        elif ratio_A < ratio_B:
            which_AB_significant.append("B")
        else:
            print("AB: We have a problem here: the ratios cannot be equal if the difference is significant.")
print(" Finished")
    
# MAB strategy
print(" Simulating Multi-armed Bandit...", end="")
for campaign in range(M):
    MAB_clicks_nov, MAB_events_nov = MultiArmedBandit_campaign(1000, P_O_X_nov)[:2]
    nclicks_A, nevents_A = get_nclicks_nevents(MAB_clicks_nov, MAB_events_nov, event_name="A")
    nclicks_B, nevents_B = get_nclicks_nevents(MAB_clicks_nov, MAB_events_nov, event_name="B")
    click_hist = {"A": nclicks_A, "B": nclicks_B} # We don't want to restart from scratch 
    event_hist = {"A": nevents_A, "B": nevents_B} # but to do as if P_O_X was changed in the middle
    MAB_clicks_dec, MAB_events_dec = MultiArmedBandit_campaign(2000, P_O_X_dec, click_hist=click_hist, event_hist=event_hist)[:2]
    
    MAB_clicks = np.append(MAB_clicks_nov, MAB_clicks_dec)
    MAB_events = np.append(MAB_events_nov, MAB_events_dec)
    if is_significant(MAB_clicks, MAB_events):
        if ratio_A > ratio_B:
            which_MAB_significant.append("A")
        elif ratio_A < ratio_B:
            which_MAB_significant.append("B")
        else:
            print("MAB: the ratios cannot be equal if the difference is significant.")
print(" Finished")
print()
print("A/B testing:")
print("    - found the result significant %.1f%% of the time."%(100.*len(which_AB_significant)/M))
print("    - found the correct result %.1f%% of the time."%(100.*np.sum(np.asarray(which_AB_significant)=="B")/len(which_AB_significant)))

print("MAB strategy:")
print("    - found the result significant %.1f%% of the time."%(100.*len(which_MAB_significant)/M))
print("    - found the correct result %.1f%% of the time."%(100.*np.sum(np.asarray(which_MAB_significant)=="B")/len(which_MAB_significant)))


Christmas with always better 'B':
 Simulating A/B... Finished
 Simulating Multi-armed Bandit... Finished

A/B testing:
    - found the result significant 92.4% of the time.
    - found the correct result 100.0% of the time.
MAB strategy:
    - found the result significant 61.5% of the time.
    - found the correct result 100.0% of the time.

Remarquably, the multi-armed bandit strategy behaves and adapts well. The only thing is that the difference was found significant only around $65\%$ of the time. We expected that: remember that the less promising event is underrepresented and we have already shown that

  1. the multi-armed bandit strategy does not beat A/B testing on stating whether the differences are significant;
  2. and that Fisher's test is not very well suited to analyze results obtained from that strategy.

Now, we could tweak $P(O|X)$ to reduce the relative difference. In that case, we could see that the multi-armed bandit strategy would seem to give a higher proportion of significant results than A/B testing. These would actually be false positives: the event selected might be the correct one, but we cannot say that we trust this result at a level of significance of $99\%$ (we actually have no idea of what the significance would be). We already explained that at lengths in the previous subsection.

4. What does that mean in practice?

4.1. Can we use it?

The multi-armed bandit strategy is generaly fine for web optimization.

Really?! But all these statistical considerations and simulations to show the limits of the method and it turns out to be fine?

Most likely yes, precisely because these limits were found not to be relevant to web optimization. Basically, if you are fine saying:

We determined that $A$ yields similar or better results than $B$, $C$, etc.

then the multi-armed bandit strategy might be worth a try. If you want to say:

We are pretty sure that $A$ yields better results than $B$, $C$, etc.

then A/B testing is a much better option (remember, even if you ended up gathering enough data for the difference to be "statistically significant" in the A/B framework, you cannot conclude within the multi-armed bandit framework because the distribution is skewed).

This skewness is only ever a problem in two cases:

  • if the real conversion rates are actually the same (because you should see many false positives);
  • if we are interested in quantifying the differences (because you cannot easily calculate a significance level).

If there was a difference from the start, it is necessarily overestimated by the strategy, which is generally not a problem.

In other words, A/B testing is the correct method to answer "is there a difference? if so, how much is it?" whereas the multi-armed method is only a strategy to find an event yielding the best results ("best" can be "same as the others", but you won't know it). Of course we knew that, we said it from the start, this is everywhere... but now we have proofs and tools.

4.2. Is multi-armed bandit always better than A/B testing?

The strategy given in the original blog post turned out to be correct even though there were a lot of mistakes in the post itself... Yes, the distribution of values is skewed by the method; yes, we can use more than two options in A/B testing; yes, we can add or remove options as we like (well, almost); no, this shouldn't be used in clinical trials; no, you can't set it an forget it (if you want maximal reward); etc.

One particular claim is particularly strong... and generally false:

you can always do better than A/B testing -- sometimes, two or three times better

As we have seen, this is completely wrong in at least one case, the one which A/B testing was designed for: proving that there is actually a difference (in clinical tests for instance). Even if the tested drug is no better at curing cancer than sugar, the multi-armed bandit method could favor it and the statistical test prove that this difference is significant while it isn't really.

We can also safely say that that claim is not correct in general if we "set it and forget", i.e. for continuous optimization. As a simple example, let's set $$P(O|A) = 0.1 \quad \text{ and } \quad P(O|B) = 0.2$$ assuming that we have $10000$ impressions per month. Let's A/B test on $2000$ impressions to get a rough estimate of these conversion rates in order to send the optimized version to the remaining users (statistical power of $99\%$). We should obtain $$N_\text{clicks}^\text{AB} = 1000*0.2 + 1000*0.1 + 0.99*0.2*(10000-1000) + 0.01*0.1*(10000-1000) = 2091$$ clicks by the end of the month, on the long run. This is to be compared to the multi-armed bandit strategy which would yield only $$N_\text{clicks}^\text{MAB} = 0.9333333*0.2*10000 + 0.0666666*0.1*10000 = 1933$$ clicks (assuming that the favored event is found immediately)... In only one month A/B testing performs better in that particular case.

The "set and forget" part is actually a more workaround than a feature: it is not as easy to say when to stop using that strategy compared to proper A/B testing. Comparing overall click-through rates at constant sample size is not relevant. You would have to estimate a proper sample size for the multi-armed bandit strategy, by using the code, running simulations, estimating the "experimental" distributions for different $N$ until reaching the proper overlap (i.e. the proper statistical level).

As well, multivariate analysis (let's say to identify interactions between the button color, shape and font size) requires the same setup as basic A/B testing. In particular, this kind of analysis cannot be performed using the multi-armed bandit strategy because we need significant estimates of the differences.

All that said, of course there are numerous cases where we expect the multi-armed bandit strategy to outperform A/B testing. But not "always"/"systematically"... not even "generally"! The two methods answer different questions, both of which can lead to far higher click-through rates than the other depending on the case and how these methods are used.