Reinforcement learning

  • State
  • Action
  • Reward

Multi-armed Bandits

Methods

  • $ \epsilon $ - greedy
  • UCB

    "Optimism in the Face of Uncertainty" $$ a_t = argmax_{a \in A} \left\{ \hat{r}_a + CB_a\right\}$$ UCB1: $$ a_t = argmax_{a \in A} \left\{ \hat{r}_a + \sqrt{2\log{\frac{t}{n_a}}} \right\}$$

  • EXP3

Auer, Peter, et al. "Gambling in a rigged casino: The adversarial multi-armed bandit problem." Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.

  • Thompson sampling $$p_a(r) = p(r|a)$$

    Chapelle, Olivier, and Lee Hong Li. An empirical evaluaTIon of Thompson sampling. Advances in Neural Information Processing Systems. 2011

Example: A/B testing

Contextual Multi-armed Bandits

$x$ - context $$ a_t(x) = argmax_{a \in A} \left\{ \hat{r}(x, w_a) + CB_a(x)\right\}$$

$$ \hat{r}(x, w_a) = \begin{cases} x \cdot w_a, \> linear \\ (1 + \exp{(-x \cdot w_a)})^{-1}, \> logit \\ \Phi(x \cdot w_a), \> probit \end{cases} $$

Li, Lihong, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation" Proceedings of the 19th international conference on World wide web, ACM 2010

Example: Ads

Markov Decision Process