Model-Free Reinforcement Learning
Last lecture:
This lecture:
Uses of model-free control
Used in cases where either:
Model-free control can solve these problems
On and Off-Policy Learning
On-policy learnin:
Off-policy learning:
Generalized policy iteration (refresher)
Policy iteration: estimate $v_\pi$
Policy improvmeent: Generate $\pi' \ge \pi$
Generalized policy with monte-carlo evaluation
Policy evaluation: Monte-Carlo policy evalution, $V = v_\pi$?
Policy improvement: greedy policy improvement?
Model-free policy iteration using action-value function
Generalized policy iteration with action-value function
Policy evaluation: monte-carlo policy evalaution $Q = q_\pi$
Policy improvement: greedy policy improvement?
Greedy exploration
Greedy policy improvement
Theorem For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_\pi$ is an improvement $v_{\pi'} \ge v_\pi(s)$
The max over all actions must be at least equal to than any weighted average:
$$ \sum_{a \in \mathcal{A}} p(a \mid s)\, q_\pi(s, a) $$And we ideally want to form something that combines with the first expression to sum to:
$$ \sum_{a \in \mathcal{A}} \pi(s \mid a) \, q_\pi(s,a) $$...which would mean that the second term will be:
$$ \pi(s \mid a) - \epsilon/m $$... which means that the max term itself, bearing in mind the factor of $(1-\epsilon)$, should ideally be:
$$ \frac{1}{1-\epsilon}(\pi(s \mid a) - \epsilon/m) $$Does this sum to 1? If we sum this over all $m$, the sum will be:
$$ \sum_{a \in \mathcal{A}} \frac{\pi(s \mid a) - \epsilon/m}{1 - \epsilon} $$We have:
$$ \sum_{a \in \mathcal{A}} \pi(s \mid a) = 1 $$... because we are summing over a probability distribution.
Similarly, because there are $m$ values for $a$, we have:
$$ \sum_{a \in \mathcal{A}} \epsilon/m = m \epsilon/m = 1 $$Therefore:
$$ \sum_{a \in \mathcal{A}} \frac{\pi(s,a) - \epsilon/m}{1 - \epsilon} = \frac{1 - \epsilon} {1 - \epsilon} = 1 $$... as required
We also have:
$$ v_\pi(s) = q_\pi(s, \pi(s)) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) $$Meanwhile:
$$ v_{\pi'}(s) = q_{\pi'}(s, \pi'(s)) = \sum_{a \in \mathcal{A}} \pi'(a \mid s)\, q_{\pi'}(s, a) $$Let's use e-greedy just for first action, and then $q_\pi$ thereafter, ie we want to examine:
$$ q_\pi(s, \pi'(s)) $$...where $\pi'(s)$ is the e-greedy policy
Therefore:
$$ q_\pi(s, \pi'(s)) \ge v_\pi(s) $$GLIE
definition Greedy in the Limit with Infinite Exploration (GLIE)
GLIE Monte-Carlo Control
Theorem GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_*(s,a)$
MC vs TD Control
Update action-value functions with Sarsa
$$ Q(S,A) \leftarrow Q(S,A) + \alpha\left( R + \gamma Q(S', A') - Q(S,A) \right) $$up to 45:40
Off-policy learning
Importance Sampling for Off-Policy TD
\frac{\pi(A_t \mid S_t)} { \mu(A_t \mid S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)
\right)
$$Q-learning
Off-Policy control with Q-learning
Relationship between DP and TD
Full backup (DP) | Sample backup (TD) |
---|---|
Iterative policy evaluation $V(s) \leftarrow \mathbb{E}[R + \gamma V(S') \mid s]$ | TD Learning $V(S) \leftarrow_\alpha R + \gamma V(S')$ |
Q-policy iteration $Q(s,a) \leftarrow \mathbb{E}[R + \gamma Q(S', A') \mid s,a]$ | Sarsa $Q(S,a) \leftarrow_\alpha R + \gamma Q(S', A')$ |
Q-value iteration $Q(s,a) \leftarrow \mathbb{E}[R + \gamma \max_{a' \in \mathcal{A}} Q(S', a') \mid s,a]$ | Q-learning $Q(S,A) \leftarrow_\alpha R + \gamma \max_{a' \in \mathcal{A}} Q(S', a')$ |