The scenario

Imagine you are in front of three slot machines, each with different likelihoods of paying out. How should you play them in order to maximize your payout?

This is the problem the multi-armed bandit algorithm solves.

You might say the solution lies in just trying them all out and seeing which one does better and you'd be right! But how much should you play each slot machine? How little is too little? Here is where it gets tricky. It isn't too bad though. A little Bayesian stats trick can make this pretty easy. Really all we need to do is play the machine most likely to payout as often as possible given what we know.

Reducing the unknown iteratively

Now that we know we don't need to know the exact likelihood for any of the options let's explore how we can find the best option.

Let's say we start with machine A, drop a quarter in, and pull back the handle. No pay out. How does that affect our belief in this machine? Turns out it's just like the way we updated our beliefs in a coin flip in my earlier post. (You can read that here: http://www.databozo.com/2013/09/15/Bayesian_updating_of_probability_distributions.html)

A key feature we want is experimentation of the different options. Just because we win two times in a row, we don't believe the machine will pay out 100% of the time... though we also can't rule that out. We can do that by collecting a set of credible hypotheses for the likelihood of a payout and randomly choosing one to try. There is a really handy statistical function we can use to accomplish this called the "Beta" function. The Beta function is defined like this:

random likelihood = Beta(1 + plays resulting in payout, 1 + plays NOT resulting in payout)

Imagine you've played slot machine A ten times without a payout. The beta function would look like this:


In [2]:
%load_ext rmagic
from pretty import pprint

In [22]:
%%R
curve(dbeta(x, 1+70, 1+130)/sum(dbeta(x, 1+70, 1+130)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 1+100, 1+100)/sum(dbeta(x, 1+100, 1+100)), add=TRUE)
curve(dbeta(x, 1+60, 1+140)/sum(dbeta(x, 1+60, 1+140)), add=TRUE)



In [10]:
%%R
wins = 0
losses = 10
curve(dbeta(x, 1+wins, 1+losses)/sum(dbeta(x, 1+wins, 1+losses)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")


This shows how the likelihoods nearest zero are most likely. Anything above .4 is practically impossible. Of course zero is the single most likely likelihood but it is not at all the only possibility. So now if we were to sample a likelihood from the above beta function we should see that we get a low likelihood for this machine.


In [37]:
%%R

print(rbeta(1,1,11))


[1] 0.04001839

Here's where it gets cool! We can use this same technique even for the slot machines we haven't tried yet. Let's see if we should try slot machine B next. We'll evaluate the beta function with no prior information. That's a Beta(1,1). Above we use rbeta with an extra parameter (the first) to specify how many samples we want to pull from the function. So this will be all one's.


In [38]:
%%R
print("Likelihood for Slot Machine B to payout")
print(rbeta(1,1,1))


[1] "Likelihood for Slot Machine B to payout"
[1] 0.8719912

Well shucks. If Slot Machine A only has a 3.5% chance to pay out but Slot Machine B has a 57.8% chance to pay out, well, we'd be crazy to play Machine A. But, let's see what Slot Machine C's likelihood is.


In [26]:
%%R
print("Likelihood for Slot Machine C to payout")
print(rbeta(1,1,1))


[1] "Likelihood for Slot Machine C to payout"
[1] 0.8984356

Just so happens that we sampled an even higher likelihood. Then We play Slot Machine C and track whether we get a success or failure. We can safely assume we fail the next play. We sample from the beta function again for each slot machine with the updated parameters.


In [40]:
%%R

print("Likelihood for Slot Machine A to payout")
print(rbeta(1,1,11))

print("Likelihood for Slot Machine B to payout")
print(rbeta(1,1,2))

print("Likelihood for Slot Machine C to payout")
print(rbeta(1,1,2))


[1] "Likelihood for Slot Machine A to payout"
[1] 0.03594639
[1] "Likelihood for Slot Machine B to payout"
[1] 0.2287346
[1] "Likelihood for Slot Machine C to payout"
[1] 0.2912565

Since Slot Machine B samples the highest we play that and go on and on. Let's run a full simulation and see how our results trend over time following this algorithm.

This data structure represents our initial conditions:


In [55]:
def create_slot_machines():
    return {
        'A': {'true_payout_rate': .1, 'payouts':0, 'plays':0},
        'B': {'true_payout_rate': .06, 'payouts':0, 'plays':0},
        'C': {'true_payout_rate': .03, 'payouts':0, 'plays':0}
    }
create_slot_machines()


Out[55]:
{'A': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.1},
 'B': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.06},
 'C': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.03}}

These are the true likelihoods for each slot machine. Our multi-armed bandit algorithm won't know about them. We just use them to decide if we let the machine pay out. We will also track the stats on how each machine has performed.

Now if the simulation works, what we should see is the best slot machine getting played a lot more than the worst, and the middle one getting somewhere in between them. The graph below tracks the payout rate we achieve throughout the simulation.


In [62]:
from numpy.random import beta
from random import random
import matplotlib.pyplot as plt

def sample_likelihood(machine_info):
    payouts = machine_info['payouts']
    plays = machine_info['plays']
    return beta(1 + payouts, 1 + plays - payouts, 1)[0]

def machine_to_play(machines):
    expected_likelihoods = [(name, sample_likelihood(machine_info)) for name, machine_info in machines.iteritems()]
    #expected_likelihoods = [(name, sample_likelihood_simple(machine_info)) for name, machine_info in machines.iteritems()]
    expected_likelihoods.sort(key=lambda element: element[1], reverse=True)
    return expected_likelihoods[:1][0][0]

def play_machine(chosen_machine, slot_machines):
    decision_number = random()
    slot_machines[chosen_machine]['plays'] += 1
    if decision_number <= slot_machines[chosen_machine]['true_payout_rate']:
        slot_machines[chosen_machine]['payouts'] += 1
        return True
    else:
        return False

def run_sim(iterations=1000):
    slot_machines = create_slot_machines()
    player_wins = 0
    player_plays = 0
    sim_history = []
    for i in xrange(iterations):
        chosen_machine = machine_to_play(slot_machines)
        machine_won = play_machine(chosen_machine, slot_machines)
        if machine_won:
            player_wins += 1
        player_plays += 1
        sim_history.append(player_wins/(1.0*player_plays))
    figure = plt.figure()
    ax = figure.add_subplot(111)
    ax.set_xlabel('Iteration #')
    ax.set_ylabel('Measured payout rate')
    ax.set_title('Payout rate improvement over time')
    plt.plot(sim_history)
    return slot_machines
    
run_sim(500)


Out[62]:
{'A': {'payouts': 46, 'plays': 384, 'true_payout_rate': 0.1},
 'B': {'payouts': 3, 'plays': 79, 'true_payout_rate': 0.06},
 'C': {'payouts': 0, 'plays': 37, 'true_payout_rate': 0.03}}

The following chart shows the probability distributions for each machine. Machine A is the line with the rightmost bump.


In [128]:
%%R
curve(dbeta(x, 47, 385)/sum(dbeta(x, 47, 385)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 3, 79)/sum(dbeta(x, 3, 79)), add=TRUE)
curve(dbeta(x, 1, 38)/sum(dbeta(x, 1, 38)), add=TRUE)



In [64]:
run_sim(1000)


Out[64]:
{'A': {'payouts': 87, 'plays': 831, 'true_payout_rate': 0.1},
 'B': {'payouts': 6, 'plays': 123, 'true_payout_rate': 0.06},
 'C': {'payouts': 0, 'plays': 46, 'true_payout_rate': 0.03}}

In [127]:
%%R
curve(dbeta(x, 88, 832)/sum(dbeta(x, 88, 832)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 7, 124)/sum(dbeta(x, 7, 124)), add=TRUE)
curve(dbeta(x, 1, 47)/sum(dbeta(x, 1, 47)), add=TRUE)


Whew. So out of 1,000 plays, this algorithm played the best slot machine ~75% of the time and played the worst one < 5% of the time. The payout rate of our algorithm so far is about 9.6%. Not much below our ideal of 10%. If we just randomly selected a slot machine we'd expect a payout rate of 6.3% which means we're doing 56% better than random. What happens if we ratchet up to 10,000?


In [66]:
run_sim(10000)


Out[66]:
{'A': {'payouts': 902, 'plays': 9504, 'true_payout_rate': 0.1},
 'B': {'payouts': 12, 'plays': 266, 'true_payout_rate': 0.06},
 'C': {'payouts': 9, 'plays': 230, 'true_payout_rate': 0.03}}

In [126]:
%%R
curve(dbeta(x, 903, 9505)/sum(dbeta(x, 903, 9505)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 13, 267)/sum(dbeta(x, 13, 267)), add=TRUE)
curve(dbeta(x, 10, 231)/sum(dbeta(x, 10, 231)), add=TRUE)


Looks like ~92 % of the time it chooses to play the better slot machine and ~0.9% of the time is playing the worst one. Also it looks like our payout rate is at ~9.8% and approaching our ideal rate of 10% (the best possible given the slot machines we have). To me this means that after a thousand or so pulls, the algorithm has found the arm it should exploit. It should never reach our ideal payout rate since it will still occasionally explore the other arms.

You can see in the probability distribution above however that Machine A has hardly any chance of performing worse than the other two.

A little more realistic

What happens if we simulate payout rates a bit closer to each other? How long will it take to converge? Let's redefine the payout rates.


In [67]:
def create_slot_machines():
    return {
        'A': {'true_payout_rate': .1, 'payouts':0, 'plays':0},
        'B': {'true_payout_rate': .102, 'payouts':0, 'plays':0},
        'C': {'true_payout_rate': .098, 'payouts':0, 'plays':0}
    }

So now we're dealing with differences of just +/-2%. Let's see how long this takes.


In [98]:
run_sim(500)


Out[98]:
{'A': {'payouts': 12, 'plays': 128, 'true_payout_rate': 0.1},
 'B': {'payouts': 1, 'plays': 43, 'true_payout_rate': 0.102},
 'C': {'payouts': 42, 'plays': 329, 'true_payout_rate': 0.098}}

At 500 plays, it's still very possible the algorithm might play the suboptimal machine more often. In this simulation for example, we can see that the worst machine was still performing well above its true payout rate with a measured payout rate of 12.8%. The machine with the highest payout rate has only won once out of 43 plays which is WAY below its 10.2% ideal.


In [101]:
run_sim(1000)


Out[101]:
{'A': {'payouts': 8, 'plays': 129, 'true_payout_rate': 0.1},
 'B': {'payouts': 68, 'plays': 616, 'true_payout_rate': 0.102},
 'C': {'payouts': 26, 'plays': 255, 'true_payout_rate': 0.098}}

At one thousand plays we can see the algorithm starting to exploit the best machine. Looking at the graph we can also see the payout rate beginning to settle.


In [79]:
run_sim(10000)


Out[79]:
{'A': {'payouts': 322, 'plays': 3199, 'true_payout_rate': 0.1},
 'B': {'payouts': 536, 'plays': 4997, 'true_payout_rate': 0.102},
 'C': {'payouts': 160, 'plays': 1804, 'true_payout_rate': 0.098}}

Since all of the true payout rates are fairly close to each other, the algorithm is still playing the suboptimal options fairly often. It will take more iterations until it begins to truly exploit the ideal option. Let's see what happens at 40,000 iterations:


In [81]:
run_sim(40000)


Out[81]:
{'A': {'payouts': 625, 'plays': 6429, 'true_payout_rate': 0.1},
 'B': {'payouts': 2833, 'plays': 27995, 'true_payout_rate': 0.102},
 'C': {'payouts': 528, 'plays': 5576, 'true_payout_rate': 0.098}}

The algorithm has achieved a payout rate of 9.97% and is exploiting the ideal machine fairly often.

Reflecting

I'm a little surprised to see how long it takes for the algorithm to begin exploiting an option. It doesn't mean this isn't optimal but I'm willing to be there are probably some hacks one could make or some general prior knowledge one could feed into the algorithm to improve it.

EDIT: The algorithm converged much slower due to a bug I had in my usage of the beta function. R and Numpy seem to parameterize it differently. Sigh. So now my stance has changed. On top of that, it seems as though driving the exploration/exploitation phases solely by sampling from the beta distribution is near-optimal. Here are a couple of citations to this effect:

That's all there is to a simple multi-armed bandit algorithm. Hopefully I've explained it well enough that you can think of new ways to apply it on your own. Really it all comes down to an understanding of what the beta function is doing and what the parameters you pass to it mean.

Thanks for reading!