UCB
"Optimism in the Face of Uncertainty" $$ a_t = argmax_{a \in A} \left\{ \hat{r}_a + CB_a\right\}$$ UCB1: $$ a_t = argmax_{a \in A} \left\{ \hat{r}_a + \sqrt{2\log{\frac{t}{n_a}}} \right\}$$
Auer, Peter, et al. "Gambling in a rigged casino: The adversarial multi-armed bandit problem." Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.
Thompson sampling $$p_a(r) = p(r|a)$$
Chapelle, Olivier, and Lee Hong Li. An empirical evaluaTIon of Thompson sampling. Advances in Neural Information Processing Systems. 2011
$x$ - context $$ a_t(x) = argmax_{a \in A} \left\{ \hat{r}(x, w_a) + CB_a(x)\right\}$$
$$ \hat{r}(x, w_a) = \begin{cases} x \cdot w_a, \> linear \\ (1 + \exp{(-x \cdot w_a)})^{-1}, \> logit \\ \Phi(x \cdot w_a), \> probit \end{cases} $$Li, Lihong, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation" Proceedings of the 19th international conference on World wide web, ACM 2010