Recommender systems are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.
k-armed bandits are one way to solve this recommendation problem. They can also be used in other similar contexts, such as clinical trials and (financial) portfolio optimization.
The (instantaneous) regret is the difference between the $\mu_i$ of the arm we pick, versus the $\mu^{*}$ of the best arm.
The total regret is the sum of the instantaneous regrets over time. $i_1, \dots, i_T$ are the decisions we made over time.
\begin{equation} R_T = \sum_{t = 1}^T r_t = \sum_{t = 1}^T \mu^{*} - \mu_{i_t} \end{equation}Ideally, we pick the perfect arm from the beginning, and we get a total regret of 0.
In practice, if the regret goes to zero over time, we're happy enough, and the algorithm with this propery is called no regret, even though there might be a little bit of regret.
Like in online supervised learning, the algorithm runs over time, and at every time point we need to make a decision.
k arms to pull, each arm can win ($y = $ reward $ = 1$) with probability $\mu_i$ (unknown). Note that the reward is always constant, it's just the probability which varies.
However, unlike online learning, we only receive information about the action we choose, i.e. we only have one single "try".
In e.g. OCP, at any time point we get a new $x$, and we can try a plethora of different ways to update our model ($w$) (hypothetically) and see how good the new potential model is. With bandits, you don't know how good your choice is until you commit to it and do it (pull the arm), by which time, you can no longer change it.
[...] provides an upper bound on the probability that the sum of random variables deviates from its expected value.
-- Wikipedia
Let $X_1,\dots,X_m$ be i.i.d. random variables taking values in $[0, 1]$.
The real mean $\mu = \mathbb{E}[X]$ is unknown.
We have an empirical estimate based on our $m$ trials:
\begin{equation} \hat{\mu}_m = \frac{1}{m} \sum_{i = 1}^{m} X_i \end{equation}Then we have:
\begin{equation} P\left(\left|\mu - \hat{\mu}_m\right| \ge b\right) \le 2 \exp\left(-2b^2m \right) = \delta_m \end{equation}That is, the probability of our operation being more than $b$ off the real value is smaller than a computable threshold.
We just want a lower bound $b$ for any given fixed probability bound. So we fix the probability bound to, say, $\delta_m$, and then compute the corresponding $b$, as a function of $\delta_m$ and $m$.
For a fixed upper probability bound, we fix $\delta_m$ and get $b = \sqrt{\frac{1}{2m} \log{\frac{2}{\delta_m}}}$.
All we need now is to decide what $\delta_m$ should be.
Now we also want to set an upper bound on the probability not just for a single $m$, but for all $m$. Want upper bound for $E_m = P(|\mu - \hat{\mu}_m| \ge b)$, i.e. a lower bound for $P(|\mu - \hat{\mu}_t| \ge b, \> \forall t)$.
So we get:
\begin{equation} \begin{aligned} P(|\mu - \hat{\mu}_t| \le b, \> \forall t) & = 1 - P(E_1 \cup E_2 \cup \dots) \\ & \ge 1 - \sum_{t=1}^{\infty} P(E_t) \\ & \ge 1 - \sum_{t=1}^{\infty} \delta_t \\ & \ge 1 - \delta \> \text{with} \> \delta < \sum_{t=1}^{\infty}\delta_t \end{aligned} \end{equation}2nd row smaller since the sum we're subtracting is bigger (longer than the prev.). 3rd row smaller because all deltas are larger than their corresponding Es, as defined by Hoeffding's inequality itself.
We therefore want a sum of $\delta_t$ which is bounded. Setting $\delta_t = \frac{c}{t^2}$ works well, since the correspoinding sum converges, so the upper bound $\delta$ exists and is finite.
We now have a good heuristic: at any time step $t$, our upper bound should be $\delta_t = \frac{c}{t^2}$.
Recall that we want to express $b$ as a function of $\delta_t$, since $b$ is the value which actually defines our upper confidence bound.
(Note that this probability shrinks quadratically over time, so it keeps getting tighter and tighter for all arms, as we keep playing.)
All we need to do now is shove our $\delta_t$ in the formula for $b$, and we get (setting $c := 2$):
\begin{aligned} \operatorname{UCB}(i) & = \hat{\mu}_i + \sqrt{\frac{1}{n_i} \ln (2 \frac{t^2}{2}) } \\ & = \hat{\mu}_i + \sqrt{\frac{\ln{t^2}}{n_i}} \\ & = \hat{\mu}_i + \sqrt{\frac{2 \ln{t}}{n_i}} \end{aligned}We can plug this formula right into a program! See bonus/tutorial-bandits.ipynb
for an implementation.
This is an algorithm which is much smarter than $\epsilon$-greedy about what it explores.
TODO: there seems to be a "2" missing somewhere. Low priority.
It can be shown that UCB is a no-regret algorithm. ($R_T / T \rightarrow 0 \> \text{as} \> T \rightarrow \infty$)
Non-DM:
DM:
Also incorporate some info about every arm and every user. Useful when e.g. recommending articles, since it takes users topic preferences into account.
We still use cummulative (contextual) regret as a metric, $R_t = \sum_{t=1}^{T}r_t$.
Can achieve sublinear regret by learning optimal mapping from contexts (e.g. (user, article features) tuples) to actions.
Want to minimize regularized square loss for
\begin{equation} \hat{w}_i = \arg \min_w \sum_{t=1}^{m} (y_t - w^T z_t)^2 + \|w\|_2^2 \end{equation}Note: This model can take features from the user, the article, or both into account. And every article has its own $\hat{w}$. This is linear regression and it's easy to solve.
Key idea: Want to merge UCB and regression by having an upper confidence bound (UCB) for our $w$s. Ideally, just as in UCB1, this bound will shrink towards $w$ as time goes on.
This is LinUCB. [CHEATSHEET]
This holds as long as $\alpha = 1 + \sqrt{\frac{1}{2} \ln \left( \frac{2}{\delta} \right)}$.
We set our desired probability bound, compute $\alpha$ and we have an algorithm!
Same as UCB1, but compute an arm's UCB as:
\begin{aligned} M_x \in \mathbb{R}^{d \times d}, b_x \in \mathbb{R}^{d} & \quad \text{(the arm's model)} \\ \hat{w} = M_x^{-1} b & \quad \text{(the model used for the primary payoff prediction)} \\ \operatorname{UCB}_x = \hat{w}_x^T z_t + \alpha \sqrt{z_t^T M_t^{-1} z_t} & \quad \text{(arm UCB given z)} \end{aligned}Not storing $M_x$ and $b_x$ together because we need $M_x^{-1}$ to compute the upper confidence bound of our predicted payoff.
LinUCB is also no-regret (i.e. regret sub-linear in T).
numpy.ravel
) the outer product $x_i z_t^T$.Can also solve this using regularized regression. We also need to compute confidence intervals for this bad boy. The algorithm is fluffy, but it works.
Sample case:
Gather data with pure exploration (random).
Learn from log using rejection sampling. Go through log and try to predict the click at every step. If we're wrong, reject the sample (ignore log line), if we're right, feed back that reward into the algorithm.
Stop when T events kept.
This is what we did in the last project!
This approach is unbiased, and the expected number of needed events are $kT$, with $k$ being the (post-dim-red) number of article features.
In general, UCB algorithms tend to perform much better than greedy algorithms when there isn't a lot of training data. And hybrid LinUCB seems to be the best. [Li et al WWW '10]
[EXAM] Submodularity is a property of set functions.
$F : 2^V \rightarrow \mathbb{R} \> \text{submodular} \iff \forall A \subseteq B, s \not\in B: F(A \cup \{s\}) - F(A) \ge F(B \cup \{s\}) - F(B)$
Adding a set earlier cannot be worse than adding it later.
Marginal benefits can never increase, i.e. our delta improvement at every step only gets smaller and smaller.
Closedness: A weighted sum of submodular functions is also submodular (positive weights). (Closed under nonnegative linear combinations.)
This works because of submodularity. If $\Delta_i$ is on top, there's no way some other $\Delta_{i'}$ will "grow" in a subsequent step and overtake it. The only thing that can happen is for $\Delta_i$ itself to "drop down".
In practice, this means we can solve greedy problems with submodular objective functions really fast. Examples include sensor placement and blog recommendation selection.
General idea for recommending sets of k articles: Select article from pool. Iterations represent adding additional articles in order to maximize the user interest coverage.
So how do we measure user coverage for articles?
Can show that submodular bandits using semi-bandit feedback have sublinear regret.
SF = Submodular Function
In [ ]: