Bandit Agorithms 101

2015 August 28

Agenda

  1. A simulation approach to building intuition about A/B testing
  2. Practical requirements of testing and operation
  3. Optimizing outcomes with multiple options with different payoffs

1. A simulation approach to building intuition about A/B testing

A/B Test:

  • Treatment (e.g. content presented)
  • Reward (e.g. impression, click, proceeds of sale)
  • Audience-in-context

In [ ]:
from IPython.display import Image 
Image(filename='img/treat_aud_reward.jpg')

In [ ]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from numpy.random import binomial
from ggplot import *
import random
import sys
plt.figure(figsize=(6,6),dpi=80);
%matplotlib inline

Simulating A/B testing


In [ ]:
Image(filename='img/ab.jpg')

Each test is like flipping a fair coin N times


In [ ]:
Image(filename='img/a.jpg')

In [ ]:
df = pd.DataFrame({"coin_toss":binomial(1,0.5,100)})
df.hist()
plt.show()

In [ ]:
df.head()

In [ ]:
df = pd.DataFrame({"coin_toss":binomial(1,0.6,100)})
df.hist()
plt.show()

In [ ]:
df = pd.DataFrame({"coin_%i"%i:binomial(1,0.5,100) for i in range(20)})
df.hist()
plt.show()

In [ ]:
df = pd.DataFrame({"coin_%i"%i:binomial(1,0.52,100) for i in range(20)})
df.hist()
plt.show()

Rewards for choosing A or B


In [ ]:
# 1 arm
payoff = [-0.1,0.5]
a = np.bincount(binomial(1,0.5,100))
print np.dot(a, payoff)

Rewards for choosing A or B or C


In [ ]:
# 2-arm, equal reward
payoff = [0,1,2]
a = np.bincount(binomial(2,0.5,100))
print np.dot(a, payoff)

Simulating multiple options with different payoffs


In [ ]:
payoff=[1,1.05]
reward1 = np.bincount(binomial(1,0.5,100))[1] * payoff[0]
print reward1
reward2 = np.bincount(binomial(1,0.5,100))[1] * payoff[1]
print reward2
total_reward = reward1 + reward2
print total_reward

Can we differentiat two arms with different payoffs?


In [ ]:
def trial(payoff=[1, 1.01]):
    reward1 = np.bincount(binomial(1,0.5,100))[1] * payoff[0]
    reward2 = np.bincount(binomial(1,0.5,100))[1] * payoff[1]
    return reward1, reward2, reward1 + reward2, reward1-reward2

trials = 100
sim = np.array([trial() for i in range(trials)])
df = pd.DataFrame(sim, columns=["t1", "t2", "tot", "diff"])
print len(df[df["diff"] <= 0.0])
df.hist()
plt.show()

Can we differentiat two arms with different probabilities?


In [ ]:
def trial(ps=[0.5, 0.51], payoff=[1, 1]):
    reward1 = np.bincount(binomial(1,ps[0],100))[1] * payoff[0]
    reward2 = np.bincount(binomial(1,ps[1],100))[1] * payoff[1]
    return reward1, reward2, reward1 + reward2, reward1-reward2
trials = 100
sim = np.array([trial() for i in range(trials)])
df = pd.DataFrame(sim, columns=["t1", "t2", "tot", "diff"])
print len(df[df["diff"] <= 0.0])
df.hist()
plt.show()

More Arms


In [ ]:
Image(filename='img/abcd.jpg')

In [ ]:
df = pd.DataFrame({"tot_reward":binomial(2,0.5,100)})
df.hist()
plt.show()

In [ ]:
df = pd.DataFrame({"tot_reward":binomial(3,0.5,100)})
df.hist()
plt.show()

In [ ]:
trials = 100
means = [0.1, 0.1, 0.9]
reward = np.zeros(trials)
for m in means:
    # equal rewards of 1 or 0
    reward += binomial(1,m,trials)
df = pd.DataFrame({"reward":reward, "fair_reward":binomial(3,0.5,trials)})
df.hist()
plt.show()

2. Practical requirements of testing and operation

Explore vs Exploit?

  • Marketer is going to have a hard time waiting for rigorous testing as winners appear to emerge
  • Implement online system vs. testing system
  • Flavor: Chase successes vs. analyzing failures
  • Want to automate balance of explore and exploit and run continuously
  • Some danger in forgetting about significance and power

Bandit Book Utilities

from "Bandit Algorithms" by John Myles White

Three utilities:

  • arms
  • algorithms
  • monte carlo testing framework

In [ ]:
sys.path.append('../../BanditsBook/python')
from core import *

3. Optimizing outcomes with multiple options with different payoffs

Epsilon Greedy

  1. Split on audience + context + creative with fraction
  2. Keep track of past results
  3. Split experiment and exploit offerings with probability epsilon

Notes:

  • epsilon is the fraction of exploration
  • randomize everything all the time

In [ ]:
random.seed(1)
# Mean arm payoff (Bernoulli)
means = [0.1, 0.1, 0.1, 0.1, 0.9]
# Mulitple arms!
n_arms = len(means)
random.shuffle(means)
arms = map(lambda (mu): BernoulliArm(mu), means)
print("Best arm is " + str(ind_max(means)))

t_horizon = 250
n_sims = 1000

data = []
for epsilon in [0.1, 0.2, 0.3, 0.4, 0.5]:
    algo = EpsilonGreedy(epsilon, [], [])
    algo.initialize(n_arms)
    # results are column oriented
    # simulation_num, time, chosen arm, reward, cumulative reward
    results = test_algorithm(algo, arms, n_sims, t_horizon)
    results.append([epsilon]*len(results[0]))
    data.extend(np.array(results).T)
    
df = pd.DataFrame(data, columns = ["Sim", "T", "ChosenArm", "Reward", "CumulativeReward", "Epsilon"])
df.head()

In [ ]:
a=df.groupby(["Epsilon", "T"]).mean().reset_index()
a.head()

In [ ]:
ggplot(aes(x="T",y="Reward", color="Epsilon"), data=a) + geom_line()

In [ ]:
ggplot(aes(x="T",y="CumulativeReward", color="Epsilon"), data=a) + geom_line()

Anealing Softmax

Upgrades to $\epsilon$-Greedy:

  • Need to run more experiments if rewards appear to be nearly equal
  • Keep track of results for exploration as well as exploitation

Tempted choose each are in proportion to its current value, i.e.:

$p(A) \propto \frac{rA}{rA + RB}$

$p(B) \propto \frac{rB}{rA + RB}$

Remember Boltzmann, and about adding a temperature, $\tau$:

$p(A) \propto \frac{-\exp(rA/\tau)}{(\exp(rA/\tau) + \exp(rB/\tau))}$

$p(B) \propto \frac{-\exp(rB/\tau)}{(\exp(rA/\tau) + \exp(rB/\tau))}$

And what is annealing?

  • $\tau = 0$ is deterministic case of winner takes all
  • $\tau = \infty$ is all random, all time
  • Let the temperature go to zero over time to settle into the state slowly (adiabatically)

In [ ]:
t_horizon = 250
n_sims = 1000

algo = AnnealingSoftmax([], [])
algo.initialize(n_arms)
data = np.array(test_algorithm(algo, arms, n_sims, t_horizon)).T
df = pd.DataFrame(data)
#df.head()
df.columns = ["Sim", "T", "ChosenArm", "Reward", "CumulativeReward"]
df.head()
a=df.groupby(["T"]).mean().reset_index()
a.head()

In [ ]:
ggplot(aes(x="T",y="Reward", color="Sim"), data=a) + geom_line()

In [ ]:
ggplot(aes(x="T",y="CumulativeReward", color="Sim"), data=a) + geom_line()

UCB2

Add a confidnece measure to our estimates of averages!


In [ ]:
t_horizon = 250
n_sims = 1000

data = []
for alpha in [0.1, 0.3, 0.5, 0.7, 0.9]:
    algo = UCB2(alpha, [], [])
    algo.initialize(n_arms)
    results = test_algorithm(algo, arms, n_sims, t_horizon)
    results.append([alpha]*len(results[0]))
    data.extend(np.array(results).T)
    
df = pd.DataFrame(data, columns = ["Sim", "T", "ChosenArm", "Reward", "CumulativeReward", "Alpha"])
df.head()

In [ ]:
a=df.groupby(["Alpha", "T"]).mean().reset_index()
a.head()

In [ ]:
ggplot(aes(x="T",y="Reward", color="Alpha"), data=a) + geom_line()

In [ ]:
ggplot(aes(x="T",y="CumulativeReward", color="Alpha"), data=a) + geom_line()

The end