"Model Free Control", David Silver lecture 5

Model-Free Reinforcement Learning

Last lecture:

  • model-free prediction
  • estimate the value function of an unknown MDP

This lecture:

  • model-free control
  • optimize the value function

Uses of model-free control

Used in cases where either:

  • MDP model is unknown, but experience can be sampled, or
  • MDP model is known, but is too big to use, except by samples

Model-free control can solve these problems

On and Off-Policy Learning

On-policy learnin:

  • "learn on the job"
  • learn about policy $\pi$ from experience sampled from $\pi$

Off-policy learning:

  • "look over someone's shoulder"
  • learn about policy $\pi$ from experience sampled from $\mu$

Generalized policy iteration (refresher)

Policy iteration: estimate $v_\pi$

  • eg iterative policy evaluation

Policy improvmeent: Generate $\pi' \ge \pi$

  • eg greedy policy improvement

Generalized policy with monte-carlo evaluation

Policy evaluation: Monte-Carlo policy evalution, $V = v_\pi$?

Policy improvement: greedy policy improvement?

Model-free policy iteration using action-value function

  • Greedy policy improvement over $V(s)$ requires model of MDP
$$\def\argmax{\text{argmax}} \pi'(s) = \argmax_{a \in \mathcal{A}} \mathcal{R}_s^a + \mathcal{P}_{ss'}^a V(s') $$
  • greedy policy improvement over $Q(s, a)$ is model-free
$$ \pi'(s) = \argmax_{a \in \mathcal{A}} Q(s,a) $$

Generalized policy iteration with action-value function

Policy evaluation: monte-carlo policy evalaution $Q = q_\pi$

Policy improvement: greedy policy improvement?

Greedy exploration

  • simplest idea for ensuring continual exploration
  • all $m$ actions are tried with non-zero probability
  • with probability $1 - \epsilon$ choose the greedy action
  • with probability $\epsilon$ choose an action at random
$$ \pi(a \mid s) = \begin{cases}\epsilon/m + 1 - \epsilon & \text{if }a^* = \argmax_{a \in \mathcal{A}} Q(s,a) \\ \epsilon / m & \text{otherwise} \end{cases} $$

Greedy policy improvement

Theorem For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_\pi$ is an improvement $v_{\pi'} \ge v_\pi(s)$

$$ v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_t, \pi(S_t)) \mid S_t = s] \\ v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_t, \pi'(S_t)) \mid S_t = s] $$
$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) $$
$$ v_{\pi'}(s) = \sum_{a \in \mathcal{A}} \pi'(a \mid s) \, q_{\pi'}(s, a) $$
$$ q_\pi(s, \pi'(s)) = \sum_{a \in \mathcal{A}}\left( \pi'(a \mid s)\, q_\pi(s, a) \right) $$
$$ = \sum_{a \in \mathcal{A}} \left( \frac{\epsilon}{m} q_\pi(s, a) \right) + (1 - \epsilon)\, q_\pi(s, \argmax_{a} q_\pi(s, a)) $$
$$ = \sum_{a \in \mathcal{A}} \left( \frac{\epsilon}{m} q_\pi(s, a) \right) + (1 - \epsilon)\, \max_{a \in \mathcal{A}} q_\pi(s, a) $$

The max over all actions must be at least equal to than any weighted average:

$$ \sum_{a \in \mathcal{A}} p(a \mid s)\, q_\pi(s, a) $$

And we ideally want to form something that combines with the first expression to sum to:

$$ \sum_{a \in \mathcal{A}} \pi(s \mid a) \, q_\pi(s,a) $$

...which would mean that the second term will be:

$$ \pi(s \mid a) - \epsilon/m $$

... which means that the max term itself, bearing in mind the factor of $(1-\epsilon)$, should ideally be:

$$ \frac{1}{1-\epsilon}(\pi(s \mid a) - \epsilon/m) $$

Does this sum to 1? If we sum this over all $m$, the sum will be:

$$ \sum_{a \in \mathcal{A}} \frac{\pi(s \mid a) - \epsilon/m}{1 - \epsilon} $$

We have:

$$ \sum_{a \in \mathcal{A}} \pi(s \mid a) = 1 $$

... because we are summing over a probability distribution.

Similarly, because there are $m$ values for $a$, we have:

$$ \sum_{a \in \mathcal{A}} \epsilon/m = m \epsilon/m = 1 $$

Therefore:

$$ \sum_{a \in \mathcal{A}} \frac{\pi(s,a) - \epsilon/m}{1 - \epsilon} = \frac{1 - \epsilon} {1 - \epsilon} = 1 $$

... as required

We also have:

$$ v_\pi(s) = q_\pi(s, \pi(s)) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) $$

Meanwhile:

$$ v_{\pi'}(s) = q_{\pi'}(s, \pi'(s)) = \sum_{a \in \mathcal{A}} \pi'(a \mid s)\, q_{\pi'}(s, a) $$

Let's use e-greedy just for first action, and then $q_\pi$ thereafter, ie we want to examine:

$$ q_\pi(s, \pi'(s)) $$

...where $\pi'(s)$ is the e-greedy policy

$$ q_\pi(s, \pi'(s)) = \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) $$
$$ \ge \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon) \sum_{a \in \mathcal{A}} \left( \frac{1}{1 - \epsilon} (\pi(s \mid a) - \epsilon/m) \, q_\pi(s, a) \right) $$
$$ = \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + \sum_{a \in \mathcal{A}} (\pi(s \mid a) - \epsilon/m)\, q_\pi(s, a) $$
$$ = \sum_{a \in \mathcal{A}} \pi(s \mid a) \, q_\pi(s, a) $$
$$ = v_\pi(s) $$

Therefore:

$$ q_\pi(s, \pi'(s)) \ge v_\pi(s) $$

GLIE

definition Greedy in the Limit with Infinite Exploration (GLIE)

  • all state-action pairs are explored infinitely many times,
$$ \lim_{k \rightarrow \infty} N_k(s, a) = \infty $$
  • the policy converges on a greedy policy,
$$ \lim_{k \rightarrow \infty} \pi_k(a \mid s) = \mathbf{1}(a = \argmax_{a' \in \mathcal{A}} Q_k(s, a')) $$
  • for example, $\epsilon$-greedy is GLIE if $\epsilon$ reduces to zero at $\epsilon_k = 1/k$

GLIE Monte-Carlo Control

  • sample $k$th episode using $\pi: \{S_1, A_1, R_2, \dots, S_T \} \sim \pi$
  • for each state $S_t$ and action $A_t$ in the episode,
$$ N(S_t, A_t) \leftarrow N(S_t, A_t) + 1 \\ Q(S_t, A_t) \leftarrow(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q(S_t, A_t)) $$
  • improve policy based on new action-value function:
$$ \epsilon \leftarrow 1/k \\ \pi \leftarrow \epsilon\text{-greedy}(Q) $$

Theorem GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_*(s,a)$

MC vs TD Control

  • temporal-difference (TD) learning has several advantages over Monte-Carlo (MC):
    • lower variance
    • online
    • incomplete sequences
  • natural idea: use TD instead of MC in our control loop
    • apply TD to $Q(S,A)$
    • use $\epsilon$-greedy policy improvement
    • update every time-step

Update action-value functions with Sarsa

$$ Q(S,A) \leftarrow Q(S,A) + \alpha\left( R + \gamma Q(S', A') - Q(S,A) \right) $$

up to 45:40

Off-policy learning

  • evaluate target policy $\pi(a \mid s)$ to compute $v_\pi(s)$ or $q_\pi(s,a)$
  • while following behavior policy $\mu(a \mid s)$ $$ \{S_1, A_1, R_2, \dots, S_T \} \sim \mu $$
  • why is this important?
    • learn from observing humans or other agents
    • re-use experience generated from old policies $\pi_1, \pi_2, \dots, \pi_{t-1}$
    • learn about optimal policy while following exploratory policy
    • learn about multiple policies while following one policy

Importance Sampling for Off-Policy TD

  • use TD targets generated from $\mu$ to evaluate $\pi$
  • weight TD target $R + \gamma V(S')$ by importance sampling
  • only need a single importance sampling correction $$ V(S_t) \leftarrow V(S_t) + \alpha \left(
    \frac{\pi(A_t \mid S_t)} { \mu(A_t \mid S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)
    
    \right) $$
  • much lower variance than Monte-Carlo importance sampling
  • policies only need to be similar over a single step

Q-learning

  • we now consider off-policy learning of action-values $Q(s,a)$
  • no importance sampling is required
  • next action is chosen using behavior policy $A_{t+1} \sim \mu(\cdot \mid S_t)$
  • but we consider alternative successor action $A' \sim \pi(\cdot \mid S_{t+1})$
  • and update $Q(S_t, A_t)$ towards value of alternative action
$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)) $$

Off-Policy control with Q-learning

  • we now allow both behavior and target policies to improve
  • the target policy $\pi$ is greedy wrt $Q(s,a)$
$$ \pi(S_{t+1}) = \argmax_{a'} Q(S_{t+1}, a') $$
  • the behavior policy $\mu$ is eg $\epsilon$-greedy wrt $Q(s,a)$
  • the Q-learning target then simplifies:
$$ R_{t+1} + \gamma Q(S_{t+1}, A') \\ = R_{t+1} + \gamma Q(S_{t+1}, \argmax_{a'} Q(S_{t+1}, a')) \\ = R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') $$

Relationship between DP and TD

Full backup (DP) Sample backup (TD)
Iterative policy evaluation $V(s) \leftarrow \mathbb{E}[R + \gamma V(S') \mid s]$ TD Learning $V(S) \leftarrow_\alpha R + \gamma V(S')$
Q-policy iteration $Q(s,a) \leftarrow \mathbb{E}[R + \gamma Q(S', A') \mid s,a]$ Sarsa $Q(S,a) \leftarrow_\alpha R + \gamma Q(S', A')$
Q-value iteration $Q(s,a) \leftarrow \mathbb{E}[R + \gamma \max_{a' \in \mathcal{A}} Q(S', a') \mid s,a]$ Q-learning $Q(S,A) \leftarrow_\alpha R + \gamma \max_{a' \in \mathcal{A}} Q(S', a')$