Title: Multi-Armed Bandit Problem Slug: multi-armed-bandit Summary: Exploring and implementing some of the approaches found to solving the multi-armed bandit problem. Date: 2018-01-22 10:49 Category: Statistics Tags: Basics Authors: Thomas Pinder

I first encountered the multi-armed bandit problem when learning about reinforcement learning. The problem is classic problem within machine learning and statistics, concerning itself with multiple slot machines, all with varying success probabilities. The problem is deciding which slot machine to choose, so as to give yourself the best probability of success, whilst losing the smallest amount of money. The notes here are based upon content from the Udemy course here by the Lazy Programmer with my own interpretations and annotations added in, so as to aid my own learning.

Epsilon-Greedy

An initial approach to solving this problem can be defined through an epsilon greedy strategy. Using this strategy, a small value of epsilon is chosen, this is our probability of exploitation. We then make random draws from a uniform probability distribution and if the value of this draw is less than epsilon we explore, meaning that we pull the arm of a random machine. If the draw is greater than epsilon, then we simply pull the arm of the best performing machine at present. To demonstrate this better we should implement this in the setting of three machines.


In [11]:
import numpy as np
import matplotlib.pyplot as plt

class Bandit:
    def __init__(self, m):
        self.m = m 
        self.mean = 0
        self.N = 0
        
    def pull(self):
        return np.random.randn() + self.m

    def update(self, x):
        self.N += 1
        self.mean = (1-1.0/self.N)*self.mean + 1.0/self.N*x
    

def run_simulation(m1, m2, m3, epsilon, N):
    bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]
    data = np.empty(N)
    for i in range(N):
        draw = np.random.random()
        if draw < epsilon:
            j = np.random.choice(3)
        else:
            j = np.argmax([b.mean for b in bandits])
        x = bandits[j].pull()
        bandits[j].update(x)
        data[i] = x
    cumulative_avg = np.cumsum(data)/(np.arange(N) + 1)

    plt.plot(cumulative_avg)
    plt.plot(np.ones(N)*m1)
    plt.plot(np.ones(N)*m2)
    plt.plot(np.ones(N)*m3)
    plt.xscale('log')
    plt.show()
    
    i = 1
    for b in bandits:
        print("Bandit {}'s Mean: {}".format(i, b.mean))
        i += 1
    
    return cumulative_avg

if __name__ == "__main__":
    c_1 = run_simulation(1.0, 2.0, 3.0, 0.1, 10000)
    c_05 = run_simulation(1.0, 2.0, 3.0, 0.05, 10000)
    c_01 = run_simulation(1.0, 2.0, 3.0, 0.001, 10000)
    
    plt.plot(c_1, label = 'eps=0.1')
    plt.plot(c_05, label = 'eps=0.05')
    plt.plot(c_01, label = 'eps=0.01')


Bandit 1's Mean: 0.9945925152354567
Bandit 2's Mean: 2.031884525867273
Bandit 3's Mean: 3.0027190916428332
Bandit 1's Mean: 0.9232646374694955
Bandit 2's Mean: 2.0148010108311833
Bandit 3's Mean: 3.004720907443339
Bandit 1's Mean: 0.9495607424791751
Bandit 2's Mean: 1.0253274264675913
Bandit 3's Mean: 3.0044380931662

So what's going on here? Well, we first create a bandit object and assign to it a couple of methods that will be needed, firstly the pulling motion in method pull(). This method replicates the act of pulling the bandit's arm and takes a random draw that will later be compared to our value of epsilon. The second method we've created, update() takes a single argument, x the bandit's draw and updates the sample mean along with incrementing the counter N.

With a bandit object constructed, we can run the simulations from the run_simulation() function. This function takes in five arguments, the first three of which is true means for our bandits, i.e. the true probability of success from the bandit. The fourth function is our epsilon and this will determine the probability of eploiting or exploring. Finally we pass the function N, the number of simulations to run. From here, an empty matrix is initialised for us to store our bandits pulls. We then go ahead and loop through each of iterations, each time taking a random draw and comparing it to epsilon. If the draw is less than epsilon we randomly select on of our three bandit machines, else we choose the optimal machine. Once a machine has been selected, we pull the machine and then update the machine's sample mean and plot the results. The function returns a cumulative average that we have calculated by calculating the cumulative sum of the machine's outputs and dividing by the number of iterations.

UCB1

An alternative to the epsilon-greedy approach is the Upper Confidence Bound (UCB) algorithm, the key difference being that the greedy approach is not only taken with respect to the sample mean now but also the upper limit of the sample mean's confidence bound, computed using the Chernoff-Hoeffding inequality. This has advantages as it assigns a greater degree of confidence to bandits with more available dats.

References