7. Bandit Algorithms

Recommender systems are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.

k-armed bandits are one way to solve this recommendation problem. They can also be used in other similar contexts, such as clinical trials and (financial) portfolio optimization.

General concepts

The (instantaneous) regret is the difference between the $\mu_i$ of the arm we pick, versus the $\mu^{*}$ of the best arm.

The total regret is the sum of the instantaneous regrets over time. $i_1, \dots, i_T$ are the decisions we made over time.

\begin{equation} R_T = \sum_{t = 1}^T r_t = \sum_{t = 1}^T \mu^{*} - \mu_{i_t} \end{equation}

Ideally, we pick the perfect arm from the beginning, and we get a total regret of 0.

In practice, if the regret goes to zero over time, we're happy enough, and the algorithm with this propery is called no regret, even though there might be a little bit of regret.

Stochastic k-armed bandits

Like in online supervised learning, the algorithm runs over time, and at every time point we need to make a decision.

k arms to pull, each arm can win ($y = $ reward $ = 1$) with probability $\mu_i$ (unknown). Note that the reward is always constant, it's just the probability which varies.

However, unlike online learning, we only receive information about the action we choose, i.e. we only have one single "try".

In e.g. OCP, at any time point we get a new $x$, and we can try a plethora of different ways to update our model ($w$) (hypothetically) and see how good the new potential model is. With bandits, you don't know how good your choice is until you commit to it and do it (pull the arm), by which time, you can no longer change it.

$\epsilon$-greedy algorithm

At every time, pick random arm with probability $\epsilon$, and pick the current best arm otherwise.

This works surprisingly well (it's no-regret), but could be better.

Hoeffding's inequality

[...] provides an upper bound on the probability that the sum of random variables deviates from its expected value.

-- Wikipedia

Let $X_1,\dots,X_m$ be i.i.d. random variables taking values in $[0, 1]$.

The real mean $\mu = \mathbb{E}[X]$ is unknown.

We have an empirical estimate based on our $m$ trials:

\begin{equation} \hat{\mu}_m = \frac{1}{m} \sum_{i = 1}^{m} X_i \end{equation}

Then we have:

\begin{equation} P\left(\left|\mu - \hat{\mu}_m\right| \ge b\right) \le 2 \exp\left(-2b^2m \right) = \delta_m \end{equation}

That is, the probability of our operation being more than $b$ off the real value is smaller than a computable threshold.

We just want a lower bound $b$ for any given fixed probability bound. So we fix the probability bound to, say, $\delta_m$, and then compute the corresponding $b$, as a function of $\delta_m$ and $m$.

For a fixed upper probability bound, we fix $\delta_m$ and get $b = \sqrt{\frac{1}{2m} \log{\frac{2}{\delta_m}}}$.

All we need now is to decide what $\delta_m$ should be.

Now we also want to set an upper bound on the probability not just for a single $m$, but for all $m$. Want upper bound for $E_m = P(|\mu - \hat{\mu}_m| \ge b)$, i.e. a lower bound for $P(|\mu - \hat{\mu}_t| \ge b, \> \forall t)$.

So we get:

\begin{equation} \begin{aligned} P(|\mu - \hat{\mu}_t| \le b, \> \forall t) & = 1 - P(E_1 \cup E_2 \cup \dots) \\ & \ge 1 - \sum_{t=1}^{\infty} P(E_t) \\ & \ge 1 - \sum_{t=1}^{\infty} \delta_t \\ & \ge 1 - \delta \> \text{with} \> \delta < \sum_{t=1}^{\infty}\delta_t \end{aligned} \end{equation}

2nd row smaller since the sum we're subtracting is bigger (longer than the prev.). 3rd row smaller because all deltas are larger than their corresponding Es, as defined by Hoeffding's inequality itself.

We therefore want a sum of $\delta_t$ which is bounded. Setting $\delta_t = \frac{c}{t^2}$ works well, since the correspoinding sum converges, so the upper bound $\delta$ exists and is finite.

We now have a good heuristic: at any time step $t$, our upper bound should be $\delta_t = \frac{c}{t^2}$.

Recall that we want to express $b$ as a function of $\delta_t$, since $b$ is the value which actually defines our upper confidence bound.

(Note that this probability shrinks quadratically over time, so it keeps getting tighter and tighter for all arms, as we keep playing.)


All we need to do now is shove our $\delta_t$ in the formula for $b$, and we get (setting $c := 2$):

\begin{aligned} \operatorname{UCB}(i) & = \hat{\mu}_i + \sqrt{\frac{1}{n_i} \ln (2 \frac{t^2}{2}) } \\ & = \hat{\mu}_i + \sqrt{\frac{\ln{t^2}}{n_i}} \\ & = \hat{\mu}_i + \sqrt{\frac{2 \ln{t}}{n_i}} \end{aligned}

We can plug this formula right into a program! See bonus/tutorial-bandits.ipynb for an implementation.

This is an algorithm which is much smarter than $\epsilon$-greedy about what it explores.

TODO: there seems to be a "2" missing somewhere. Low priority.

It can be shown that UCB is a no-regret algorithm. ($R_T / T \rightarrow 0 \> \text{as} \> T \rightarrow \infty$)

Applications of bandit algorithms


  • Clinical trials (give the best possible cure to a patient, while also working on improving the accuracy of our diagnostics)
  • Matching markets (TODO: more info)
  • Asset pricing
  • Adaptive routing
  • Go


  • Advertising
  • Optimizing relevance (e.g. news article recommendations)
  • Scheduling web crawlers
  • Optimizing user interfaces (e.g. smart A/B testing)

Contextual bandits

Also incorporate some info about every arm and every user. Useful when e.g. recommending articles, since it takes users topic preferences into account.

We still use cummulative (contextual) regret as a metric, $R_t = \sum_{t=1}^{T}r_t$.

Can achieve sublinear regret by learning optimal mapping from contexts (e.g. (user, article features) tuples) to actions.


  • Observe context: $z_t \in \mathcal{Z}$, and, e.g. $\mathcal{Z} \subseteq \mathbb{R}^{d}$
    • Pick arm from set of possible arms, $x_t \in \mathcal{A}_t$
    • Observe reward $y_t$, which depends on the picked arm and the context, plus some possible noise: $y_t = f(x_t, z_t) + \epsilon_t$
    • Incur regret: $r_t = \max_{x \in \mathcal{A}_t}(f(x, z_t)) - f(x_t, z_t)$ (like before, difference between the best arm, given the context, and the arm we actually picked)

Linear recommendations

Want to minimize regularized square loss for

\begin{equation} \hat{w}_i = \arg \min_w \sum_{t=1}^{m} (y_t - w^T z_t)^2 + \|w\|_2^2 \end{equation}

Note: This model can take features from the user, the article, or both into account. And every article has its own $\hat{w}$. This is linear regression and it's easy to solve.

Key idea: Want to merge UCB and regression by having an upper confidence bound (UCB) for our $w$s. Ideally, just as in UCB1, this bound will shrink towards $w$ as time goes on.


\begin{aligned} \left| \> \text{estimated reward} - \text{true reward} \> \right| & \le \text{some bound} \quad \text{(with some probability)} \\ \left| \hat{w}^T_i z_t - w^T_i z_t \right| & \le \alpha\sqrt{z^T_t(D^T_i D_i + I)^{-1}z_t}, \> p \ge 1 - \delta \\ \left| \hat{w}^T_i z_t - w^T_i z_t \right| & \le \alpha\sqrt{z^T_t M_i z_t}, \> p \ge 1 - \delta \end{aligned}

This holds as long as $\alpha = 1 + \sqrt{\frac{1}{2} \ln \left( \frac{2}{\delta} \right)}$.

We set our desired probability bound, compute $\alpha$ and we have an algorithm!

Same as UCB1, but compute an arm's UCB as:

\begin{aligned} M_x \in \mathbb{R}^{d \times d}, b_x \in \mathbb{R}^{d} & \quad \text{(the arm's model)} \\ \hat{w} = M_x^{-1} b & \quad \text{(the model used for the primary payoff prediction)} \\ \operatorname{UCB}_x = \hat{w}_x^T z_t + \alpha \sqrt{z_t^T M_t^{-1} z_t} & \quad \text{(arm UCB given z)} \end{aligned}

Not storing $M_x$ and $b_x$ together because we need $M_x^{-1}$ to compute the upper confidence bound of our predicted payoff.

LinUCB is also no-regret (i.e. regret sub-linear in T).

Learning from $y_t$

If the payoff $y_t > 0$ (see rejection sampling):

  • $M_x \leftarrow M_x + z_tz_t^T$ (outer product)
  • $b_x \leftarrow b_x + y_t z_t$

Problem with linear recommendations

No shared effect modeling. We optimize every arm separately based on what users like it, but there's no way to directly exploit the fact that similar users may like similar articles.

Use hybrid models!

Hybrid LinUCB

\begin{equation} y_t = w_i^T z_t + \beta^T \phi(x_i, z_t) + \epsilon_t \end{equation}
  • $\phi(x, z)$ simply flattens (like numpy.ravel) the outer product $x_i z_t^T$.
  • $w_i$ is an arm's model
  • $\beta$ captures user-article similarity (i.e. user interests).

Can also solve this using regularized regression. We also need to compute confidence intervals for this bad boy. The algorithm is fluffy, but it works.

Practical implementation of contextual bandits

Sample case:

  • 1193 user features, 81 article features
  • We need to perform dimensionality reduction!

Extracting feature vectors

  • Data consists of triplets of form (article_features, userfeatures, reward): $D = \left{ (\phi{a,1}, \phi_{u,1}, y1), \dots, (\phi{a,n}, \phi_{u,n}, y_n) \right}$
  • Learn the model parameters $W$ with logistic regression (remember that our reward $y_i$ is either 1 or 0, e.g. click or no click). This (super) model now predicts rewards based on both article and user features. It incorporates every arm's model.
  • Want per-arm models like before
  • Set: $\psi_{a,i} = \phi^T_{a,i} W$ (vector); in effect, this splits $W$ back to per-arm models;
  • $\psi_{a,i}$ is still hugely dimensional
  • k-means cluster $\psi_{a, i}$ our arm models (i.e. over i datapoints, with $i = 1, \dots, n$)
  • Obtain $j < n$ clusters; the final article features for article $i$ are $x_{i, j} = \frac{1}{Z} \exp{\left( -\| \psi_{a, i} - \mu_j \|_2^2 \right)}, \> x_{i, j} \in \mathbb{R}^{k}$
  • i.e. compute some clusters and model articles relative to them, i.e. every article's feature is its distance to that cluster (and exp + constant, but the principle stays the same). This way we can express our articles and users using much fewer features.

Evaluating bandit algorithms

Gather data with pure exploration (random).

Learn from log using rejection sampling. Go through log and try to predict the click at every step. If we're wrong, reject the sample (ignore log line), if we're right, feed back that reward into the algorithm.

Stop when T events kept.

This is what we did in the last project!

This approach is unbiased, and the expected number of needed events are $kT$, with $k$ being the (post-dim-red) number of article features.

In general, UCB algorithms tend to perform much better than greedy algorithms when there isn't a lot of training data. And hybrid LinUCB seems to be the best. [Li et al WWW '10]

Sharing observations across users

  • Use stereotypes
  • Describe stereotypes in lower-dim space (estimate using PCA/SVD, so dim-reduction).
  • First explore in stereotype subspace, then in the full space (whose exploration is significantly more expensive). This is coarse to fine bandits.

Sets of k recommendations

  • In many cases (ads, news) want to recommend more than one thing at a time.
  • Want to choose set that's relevant to as many users as possible.
  • $\implies$ optimize relevance and diversity
  • Want to cover as many users as possible with a (limited) set of e.g. ads.
  • Every article $i$ is relevant to a set of users $S_i$. Suppose this is known.

This is a maximum (set) coverage problem

  • Define the coverage of a set $A$ of articles:
\begin{equation} F(A) = \left| \bigcup_{i \in A}S_i \right| \end{equation}
  • And we want to maximize this coverage: $\max_{|A| \le k} F(A)$
    • nr. sets A grows exponentially in $k$
    • finding the optimal A is NP-hard.
    • Let's try a greedy solution!
      • Start with $A_0$ empty, and always add the article which increases the coverage of $A$ the most.
      • Turns out, this solution is "good enough" (~63% of optimal)
      • $F(A_{\text{greedy}}) \ge \left( 1 - \frac{1}{e} \right) F(A_{\text{opt}})$
      • $F(\> \text{greedy set of size} \> l \>) \ge \left(1 - e\left( -\frac{l}{k}\right)\right) \max_{|A| \le k}F(A)$ TODO: hard to tell from prof.'s writing; double check!
      • this works because F is non-negative monotone and submodular


[EXAM] Submodularity is a property of set functions.

$F : 2^V \rightarrow \mathbb{R} \> \text{submodular} \iff \forall A \subseteq B, s \not\in B: F(A \cup \{s\}) - F(A) \ge F(B \cup \{s\}) - F(B)$

Adding a set earlier cannot be worse than adding it later.

Marginal benefits can never increase, i.e. our delta improvement at every step only gets smaller and smaller.

Closedness: A weighted sum of submodular functions is also submodular (positive weights). (Closed under nonnegative linear combinations.)

  • Allows multi-objective optimization with weights, as long as each objective is itself submodular: $F(A) = \sum_i \lambda_i F_i(A)$ is submodular, if $F_1, \dots F_n$ are submodular

"Lazy" greedy algorithm

  • First iteration as usual.
  • Keep ordered list of marginal benefits $\Delta_i$ for every option from previous iteration (marginal benefit = increase in coverage = # new elements we would get by adding the ith set).
  • Re-evaluate $\Delta_i$ only for top element.
  • If $\Delta_i$ stays on top, use it; otherwise, re-sort.

This works because of submodularity. If $\Delta_i$ is on top, there's no way some other $\Delta_{i'}$ will "grow" in a subsequent step and overtake it. The only thing that can happen is for $\Delta_i$ itself to "drop down".

In practice, this means we can solve greedy problems with submodular objective functions really fast. Examples include sensor placement and blog recommendation selection.

General idea for recommending sets of k articles: Select article from pool. Iterations represent adding additional articles in order to maximize the user interest coverage.

  • Bandit submodular optimization: learn from observing marginal gains
  • $F_t(A_t)$ is the feedback at time $t$, given that the set of articles $A_t$ was shown.

So how do we measure user coverage for articles?

Submodular bandits

Simple abstract model

  • Given set of articles $V$, $\lvert V \rvert = n$.
  • Every round $t = 1 : T$ do:
    • $\exists$ an unknown subset of $V$ in which the user is interested: $W_t \subseteq V$
    • recommend a set of articles $A_t \subseteq V$ (how do we pick this? This is part of the challenge.)
    • if we recommended anything in which the user is interested, they click and we get a reward:
\begin{equation} F_t(A_t) = \left\{ \begin{array}{ll} 1 & \> \text{if} A_t \cap W_t \not= \varnothing \\ 0 & \> \text{otherwise} \end{array} \right. \end{equation}


  • Intialize $k$ multi-armed bandit algorithms for $k$ out of $n$ item selections.
  • In every round $t$, every bandit picks an article, and gets as feedback the difference between the reward for bandits up to, and including him, and the reward from all arms up to, but not including him (i.e. $\Delta_i$).

Can show that submodular bandits using semi-bandit feedback have sublinear regret.

SF = Submodular Function


  • Bandit algorithm for context-aware article set recommendations.
  • No-regret

In [ ]: