Bandit Agorithms 101

2015 August 28

See imports below for requirements
In the same directory you cloned Data-Science-45min-Intros, you should also clone the BanditBook repository from https://github.com/johnmyleswhite/BanditsBook

Agenda

A simulation approach to building intuition about A/B testing
Practicalities of testing and operation
Optimizing outcomes with multiple options and different payoffs per option

1. A simulation approach to building intuition about A/B testing

This section isn't a cookbook for A/B testing. Rather, I am pointing to some key aspecst of how we design and analyse A/B tests that will be useful when we get to the section on bandit algorithms.

When thinking about A/B tests, let's focus on 3 key components:

Treatment (e.g. content presented, vaccination, etc.)
Reward (e.g. impression, click, proceeds of sale, immunity)
Audience and context (The context is an arbitrary but specifica set of circumstances under which the audience experiences the treatment.)



In [ ]:

    
from IPython.display import Image 
Image(filename='img/treat_aud_reward.jpg')



In [ ]:

    
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from numpy.random import binomial
from ggplot import *
import random
import sys
plt.figure(figsize=(6,6),dpi=80);
%matplotlib inline

Simulating A/B testing to build intuition

Situation:

2 treatments
1 uniform audience, split into to random groups
reward is merely the outcome

Simulation:

discrete outcomes
50/50 chances of outcome
Many trials



In [ ]:

    
Image(filename='img/ab.jpg')

Each test is like flipping a fair coin N times



In [ ]:

    
Image(filename='img/a.jpg')



In [ ]:

    
# This is A/ testing!
# This is the result of 1 arm, 100 trials
df = pd.DataFrame({"coin_toss":binomial(1,0.5,100)})
df.hist()
# Everyone got the same treatment, this is the distribution of the outcome
# reward is the total height of the right-hand bar
plt.show()

Run the cell above a few times.

Can you easily make the case that the average reward is 50?



In [ ]:

    
# every sample is 0/1, heads or tails
df.head()



In [ ]:

    
# now with a high probability of heads
df = pd.DataFrame({"coin_toss":binomial(1,0.6,100)})
df.hist()
plt.show()



In [ ]:

    
# Compare the variability across many different experiments
# of 100 flips each (variability of the mean)
df = pd.DataFrame({"coin_%i"%i:binomial(1,0.5,100) for i in range(20)})
df.hist()
plt.show()



In [ ]:

    
# Can we distinguish a small differce in probability?
df = pd.DataFrame({"coin_%i"%i:binomial(1,0.52,100) for i in range(20)})
df.hist()
plt.show()

Rewards of the actions might be different

What if reward for choice 0 = -0.1 and choice 1 = 0.5?



In [ ]:

    
# 1 arm
payoff = [-0.1,0.5]
a = np.bincount(binomial(1,0.5,100))
print "Number of 0s and 1s:", a
print "Total reward with pay off specified =", np.dot(a, payoff)

Add treatment B to make it more instesting

Now we want to compare the rewards for an individuals in the audience choosing A or B.



In [ ]:

    
# 2-arm, equal unity reward per coin
# (4 outcomes but 1,0=0,1 with this payoff vector)
payoff = [0,1,2]
a = np.bincount(binomial(2,0.5,100))
print a
print np.dot(a, payoff)

Simulating multiple arms with different payoffs

But more often,

Arm A (outcome 1/0) has a reward (e.g. 1.0 below) and
Arm B (outcome 1/0) has a different reward (eg.g 1.05 below)

For example, imagine the case of tweet engagement.

a user can click on a URL in a tweet (outcome 1/0)
a user can retweet (outcome 1/0)
rewards for a site visit might be 10x those of a retweet



In [ ]:

    
payoff1=[0,1]
reward1 = np.dot(np.bincount(binomial(1,0.5,100)), payoff1)
print "Arm A reward = ", reward1
payoff2=[0,1.05]
reward2 = np.dot(np.bincount(binomial(1,0.5,100)), payoff2)
print "Arm B reward = ", reward2
total_reward = reward1 + reward2
print "Total reward for arms A and B = ", total_reward

Why worry about the total reward? I thought we wanted to know if A > B?

Hold that thought for a couple of more cells...

From now on, assume reward = 0 for outcome 0 to keep thing a little simpler. Everything we do can be generalized to a reward vector as above.

How easy is it to differentiat two arms with different payoffs?

Lets ask about choosing a winner.



In [ ]:

    
def a_b_test(one_payoff=[1, 1.01]):
    # assume payoff for outcome 0 is 0
    reward1 = np.bincount(binomial(1,0.5,100))[1] * one_payoff[0]
    reward2 = np.bincount(binomial(1,0.5,100))[1] * one_payoff[1]
    return reward1, reward2, reward1 + reward2, reward1-reward2

n_tests = 1000
sim = np.array([a_b_test() for i in range(n_tests)])
df = pd.DataFrame(sim, columns=["t1", "t2", "tot", "diff"])
print "Number of tests in which Arm B won (expect > {} because of payoff) = {}".format(
    n_tests/2
    , len(df[df["diff"] <= 0.0]))
df.hist()
plt.show()

Now is them it jump to your power and significance testing expertise.

We are going to continue building intution through simulation.

Can we differentiat two arms with different probabilities?



In [ ]:

    
def a_b_test(ps=[0.5, 0.51], one_payoff=[1, 1]):
    reward1 = np.bincount(binomial(1,ps[0],100))[1] * one_payoff[0]
    reward2 = np.bincount(binomial(1,ps[1],100))[1] * one_payoff[1]
    return reward1, reward2, reward1 + reward2, reward1-reward2
n_tests= 100
sim = np.array([a_b_test() for i in range(n_tests)])
df = pd.DataFrame(sim, columns=["t1", "t2", "tot", "diff"])
print "Number of tests in which Arm B won (expect > {} because of probability) = {}".format(
    n_tests/2
    , len(df[df["diff"] <= 0.0]))
df.hist()
plt.show()

More Arms



In [ ]:

    
Image(filename='img/abcd.jpg')



In [ ]:

    
# repeating what did before with equal equal payoff, more arms
# remember the degenerate outcomes
df = pd.DataFrame({"tot_reward":binomial(2,0.5,100)})
df.hist()
plt.show()



In [ ]:

    
# ok, now with 4
df = pd.DataFrame({"tot_reward":binomial(4,0.5,100)})
df.hist()
plt.show()

Quick aside

Easy to simulate many arms with different probabilities when we have 0/1 reward.

Maybe interesting to compare to uniform probability case?



In [ ]:

    
# a little more practice with total reward distribution
trials = 100
probabilities = [0.1, 0.1, 0.9]
reward = np.zeros(trials)
for m in probabilities:
    # equal rewards of 1 or 0
    reward += binomial(1,m,trials)
df = pd.DataFrame({"reward":reward, "fair__uniform_reward":binomial(3,0.5,trials)})
df.hist()
plt.show()

2. Practicalities of testing and operation

Explore vs Exploit?

Marketer is going to have a hard time waiting for rigorous testing as winners appear to emerge
Implement online system vs. testing system
Chase successes vs. analyzing failures

So maybe set some new objectives instead of is A>B:

Want to automate balance of explore and exploit and run continuously
Implement one system

Carefull:

Some danger in forgetting about significance and power

Bandit Book Utilities

from "Bandit Algorithms" by John Myles White

Three utilities:

arms
algorithms
monte carlo testing framework



In [ ]:

    
sys.path.append('../../BanditsBook/python')
from core import *

3. Optimizing outcomes with multiple options with different payoffs

How can we explore arms and exploit the best arm more often, but still explore?

Answer 1: occasionally, we randomly explore losers.

Epsilon Greedy

Startup by visiting all the arms
Keep track of currrent probabilities for each arm based on visits so far
After startup, split experiment and exploit for each new individual from the audience with probability epsilon

Notes:

epsilon is the fraction of exploration
randomize everything all the time

What's a good value for $\epsilon$?

Depends on number of arms
Depends on individuals per time and total time

What are these parameters when one of the options is 9 times better than all of the others?

Needs a simulation!

(To keep it simple outcome 1/0 has reward 1/0.)



In [ ]:

    
random.seed(1)
# Mean (arm probabilities) (Bernoulli)
means = [0.1, 0.1, 0.1, 0.1, 0.9]
# Mulitple arms!
n_arms = len(means)
random.shuffle(means)
arms = map(lambda (mu): BernoulliArm(mu), means)
print("Best arm is " + str(ind_max(means)))

t_horizon = 250
n_sims = 1000

data = []
for epsilon in [0.1, 0.2, 0.3, 0.4, 0.5]:
    algo = EpsilonGreedy(epsilon, [], [])
    algo.initialize(n_arms)
    # results are column oriented
    # simulation_num, time, chosen arm, reward, cumulative reward
    results = test_algorithm(algo, arms, n_sims, t_horizon)
    results.append([epsilon]*len(results[0]))
    data.extend(np.array(results).T)
    
df = pd.DataFrame(data
            , columns = ["Sim"
                         , "T"
                         , "ChosenArm"
                         , "Reward"
                         , "CumulativeReward"
                         , "Epsilon"])
df.head()



In [ ]:

    
a=df.groupby(["Epsilon", "T"]).mean().reset_index()
a.head()



In [ ]:

    
ggplot(aes(x="T",y="Reward", color="Epsilon"), data=a) + geom_line()



In [ ]:

    
ggplot(aes(x="T",y="CumulativeReward", color="Epsilon"), data=a) + geom_line()

Anealing Softmax

Upgrades to $\epsilon$-Greedy:

Need to run more experiments if rewards appear to be nearly equal
Keep track of results for exploration as well as exploitation

Tempted choose each are in proportion to its current value, i.e.:

$p(A) \propto \frac{rA}{rA + RB}$

$p(B) \propto \frac{rB}{rA + RB}$

Remember Boltzmann, and about adding a temperature, $\tau$:

$p(A) \propto \frac{-\exp(rA/\tau)}{(\exp(rA/\tau) + \exp(rB/\tau))}$

$p(B) \propto \frac{-\exp(rB/\tau)}{(\exp(rA/\tau) + \exp(rB/\tau))}$

And what is annealing?

$\tau = 0$ is deterministic case of winner takes all
$\tau = \infty$ is all random, all time
Let the temperature go to zero over time to settle into the state slowly (adiabatically)



In [ ]:

    
t_horizon = 250
n_sims = 1000

algo = AnnealingSoftmax([], [])
algo.initialize(n_arms)
data = np.array(test_algorithm(algo, arms, n_sims, t_horizon)).T
df = pd.DataFrame(data)
#df.head()
df.columns = ["Sim", "T", "ChosenArm", "Reward", "CumulativeReward"]
df.head()
a=df.groupby(["T"]).mean().reset_index()
a.head()



In [ ]:

    
ggplot(aes(x="T",y="Reward", color="Sim"), data=a) + geom_line()



In [ ]:

    
ggplot(aes(x="T",y="CumulativeReward", color="Sim"), data=a) + geom_line()

UCB2

Add a confidnece measure to our estimates of averages!



In [ ]:

    
t_horizon = 250
n_sims = 1000

data = []
for alpha in [0.1, 0.3, 0.5, 0.7, 0.9]:
    algo = UCB2(alpha, [], [])
    algo.initialize(n_arms)
    results = test_algorithm(algo, arms, n_sims, t_horizon)
    results.append([alpha]*len(results[0]))
    data.extend(np.array(results).T)
    
df = pd.DataFrame(data, columns = ["Sim", "T", "ChosenArm", "Reward", "CumulativeReward", "Alpha"])
df.head()



In [ ]:

    
a=df.groupby(["Alpha", "T"]).mean().reset_index()
a.head()



In [ ]:

    
ggplot(aes(x="T",y="Reward", color="Alpha"), data=a) + geom_line()



In [ ]:

    
ggplot(aes(x="T",y="CumulativeReward", color="Alpha"), data=a) + geom_line()

The end