Imagine you are in front of three slot machines, each with different likelihoods of paying out. How should you play them in order to maximize your payout?

This is the problem the multi-armed bandit algorithm solves.

You might say the solution lies in just trying them all out and seeing which one does better and you'd be right! But how much should you play each slot machine? How little is too little? Here is where it gets tricky. It isn't too bad though. A little Bayesian stats trick can make this pretty easy. Really all we need to do is play the machine most likely to payout as often as possible given what we know.

Now that we know we don't need to know the exact likelihood for any of the options let's explore how we can find the best option.

Let's say we start with machine A, drop a quarter in, and pull back the handle. No pay out. How does that affect our belief in this machine? Turns out it's just like the way we updated our beliefs in a coin flip in my earlier post. (You can read that here: http://www.databozo.com/2013/09/15/Bayesian_updating_of_probability_distributions.html)

A key feature we want is experimentation of the different options. Just because we win two times in a row, we don't believe the machine will pay out 100% of the time... though we also can't rule that out. We can do that by collecting a set of credible hypotheses for the likelihood of a payout and randomly choosing one to try. There is a really handy statistical function we can use to accomplish this called the "Beta" function. The Beta function is defined like this:

```
random likelihood = Beta(1 + plays resulting in payout, 1 + plays NOT resulting in payout)
```

Imagine you've played slot machine A ten times without a payout. The beta function would look like this:

```
In [2]:
```%load_ext rmagic
from pretty import pprint

```
In [22]:
```%%R
curve(dbeta(x, 1+70, 1+130)/sum(dbeta(x, 1+70, 1+130)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 1+100, 1+100)/sum(dbeta(x, 1+100, 1+100)), add=TRUE)
curve(dbeta(x, 1+60, 1+140)/sum(dbeta(x, 1+60, 1+140)), add=TRUE)

```
```

```
In [10]:
```%%R
wins = 0
losses = 10
curve(dbeta(x, 1+wins, 1+losses)/sum(dbeta(x, 1+wins, 1+losses)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")

```
```

```
In [37]:
```%%R
print(rbeta(1,1,11))

```
```

```
In [38]:
```%%R
print("Likelihood for Slot Machine B to payout")
print(rbeta(1,1,1))

```
```

```
In [26]:
```%%R
print("Likelihood for Slot Machine C to payout")
print(rbeta(1,1,1))

```
```

```
In [40]:
```%%R
print("Likelihood for Slot Machine A to payout")
print(rbeta(1,1,11))
print("Likelihood for Slot Machine B to payout")
print(rbeta(1,1,2))
print("Likelihood for Slot Machine C to payout")
print(rbeta(1,1,2))

```
```

Since Slot Machine B samples the highest we play that and go on and on. Let's run a full simulation and see how our results trend over time following this algorithm.

This data structure represents our initial conditions:

```
In [55]:
```def create_slot_machines():
return {
'A': {'true_payout_rate': .1, 'payouts':0, 'plays':0},
'B': {'true_payout_rate': .06, 'payouts':0, 'plays':0},
'C': {'true_payout_rate': .03, 'payouts':0, 'plays':0}
}
create_slot_machines()

```
Out[55]:
```

These are the true likelihoods for each slot machine. Our multi-armed bandit algorithm won't know about them. We just use them to decide if we let the machine pay out. We will also track the stats on how each machine has performed.

Now if the simulation works, what we should see is the best slot machine getting played a lot more than the worst, and the middle one getting somewhere in between them. The graph below tracks the payout rate we achieve throughout the simulation.

```
In [62]:
```from numpy.random import beta
from random import random
import matplotlib.pyplot as plt
def sample_likelihood(machine_info):
payouts = machine_info['payouts']
plays = machine_info['plays']
return beta(1 + payouts, 1 + plays - payouts, 1)[0]
def machine_to_play(machines):
expected_likelihoods = [(name, sample_likelihood(machine_info)) for name, machine_info in machines.iteritems()]
#expected_likelihoods = [(name, sample_likelihood_simple(machine_info)) for name, machine_info in machines.iteritems()]
expected_likelihoods.sort(key=lambda element: element[1], reverse=True)
return expected_likelihoods[:1][0][0]
def play_machine(chosen_machine, slot_machines):
decision_number = random()
slot_machines[chosen_machine]['plays'] += 1
if decision_number <= slot_machines[chosen_machine]['true_payout_rate']:
slot_machines[chosen_machine]['payouts'] += 1
return True
else:
return False
def run_sim(iterations=1000):
slot_machines = create_slot_machines()
player_wins = 0
player_plays = 0
sim_history = []
for i in xrange(iterations):
chosen_machine = machine_to_play(slot_machines)
machine_won = play_machine(chosen_machine, slot_machines)
if machine_won:
player_wins += 1
player_plays += 1
sim_history.append(player_wins/(1.0*player_plays))
figure = plt.figure()
ax = figure.add_subplot(111)
ax.set_xlabel('Iteration #')
ax.set_ylabel('Measured payout rate')
ax.set_title('Payout rate improvement over time')
plt.plot(sim_history)
return slot_machines
run_sim(500)

```
Out[62]:
```

```
In [128]:
```%%R
curve(dbeta(x, 47, 385)/sum(dbeta(x, 47, 385)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 3, 79)/sum(dbeta(x, 3, 79)), add=TRUE)
curve(dbeta(x, 1, 38)/sum(dbeta(x, 1, 38)), add=TRUE)

```
```

```
In [64]:
```run_sim(1000)

```
Out[64]:
```

```
In [127]:
```%%R
curve(dbeta(x, 88, 832)/sum(dbeta(x, 88, 832)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 7, 124)/sum(dbeta(x, 7, 124)), add=TRUE)
curve(dbeta(x, 1, 47)/sum(dbeta(x, 1, 47)), add=TRUE)

```
```

```
In [66]:
```run_sim(10000)

```
Out[66]:
```

```
In [126]:
```%%R
curve(dbeta(x, 903, 9505)/sum(dbeta(x, 903, 9505)), ylab="Likelihood of hypothesis", xlab="Hypothesis for ideal payout rate")
curve(dbeta(x, 13, 267)/sum(dbeta(x, 13, 267)), add=TRUE)
curve(dbeta(x, 10, 231)/sum(dbeta(x, 10, 231)), add=TRUE)

```
```

```
In [67]:
```def create_slot_machines():
return {
'A': {'true_payout_rate': .1, 'payouts':0, 'plays':0},
'B': {'true_payout_rate': .102, 'payouts':0, 'plays':0},
'C': {'true_payout_rate': .098, 'payouts':0, 'plays':0}
}

So now we're dealing with differences of just +/-2%. Let's see how long this takes.

```
In [98]:
```run_sim(500)

```
Out[98]:
```

```
In [101]:
```run_sim(1000)

```
Out[101]:
```

```
In [79]:
```run_sim(10000)

```
Out[79]:
```

```
In [81]:
```run_sim(40000)

```
Out[81]:
```

The algorithm has achieved a payout rate of 9.97% and is exploiting the ideal machine fairly often.

~~I'm a little surprised to see how long it takes for the algorithm to begin exploiting an option. It doesn't mean this isn't optimal but I'm willing to be there are probably some hacks one could make or some general prior knowledge one could feed into the algorithm to improve it.~~

EDIT: The algorithm converged much slower due to a bug I had in my usage of the beta function. R and Numpy seem to parameterize it differently. Sigh. So now my stance has changed. On top of that, it seems as though driving the exploration/exploitation phases solely by sampling from the beta distribution is near-optimal. Here are a couple of citations to this effect:

- From the people who brought you pymc: http://camdp.com/blogs/multi-armed-bandits
- Compares this random method with other technically more optimal solutions: http://www.economics.uci.edu/~ivan/asmb.874.pdf

That's all there is to a simple multi-armed bandit algorithm. Hopefully I've explained it well enough that you can think of new ways to apply it on your own. Really it all comes down to an understanding of what the beta function is doing and what the parameters you pass to it mean.

Thanks for reading!