Off-policy evaluation for slate recommendation

https://arxiv.org/pdf/1605.04812.pdf

Combinatorial bandits
Uses log data to estimate policy performance
Policy -> content-serving algorithm
Inverse Propensity Scores (IPS) require lots of data
Challenge -> # of possible lists (i.e. slates) is combinatorially large so the policy is likely to choose slates that are different from the data collected
Previous work
- (1) restricts the attention to a small slate space to increase the chance of matching
- (2) used parametric models.
- They need less data but have large bias and need long A/B testing

Combinatorial contextual bandits formulation

For each context -> policy selects a slate consisting of component actions
Reward is observed for the entire slate (e.g. time to success, NDCG)

On vs. off policy

The difference between Off-policy and On-policy methods is that with the first you do not need to follow any specific policy, your agent could even behave randomly and despite this, off-policy methods can still find the optimal policy. On the other hand on-policy methods are dependent on the policy used. In the case of Q-Learning, which is off-policy, it will find the optimal policy independent of the policy used during exploration, however this is true only when you visit the different states enough times.

Similar answer from the same post:

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the **greedy action a′**. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′s′ and the **current policy's action a″**. It estimates the return for state-action pairs assuming the current policy continues to be followed.