In [1]:
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from IPython.display import Image
constant-$\alpha$ MC: $V(S_t) \gets V(S_t) + \alpha \underbrace{\left [ G_t - V(S_t) \right ]}_{= \sum_{k=t}^{T-1} \gamma^{k-1} \theta_k}$
\begin{align} v_\pi(s) &\doteq \mathbb{E}_\pi [ G_t \mid S_s = s] \qquad \text{Monte Carlo} \\ &= \mathbb{E}_\pi [ R_{t+1} + \gamma \color{blue}{v_\pi(S_{t+1})} \mid S_t = s ] \quad \text{DP} \end{align}one-step TD, or TD(0): $V(S_t) \gets V(S_t) + \alpha \left [ \underbrace{R_{t+1} + \gamma \color{blue}{V(S_{t+1})} - V(S_t)}_{\text{TD error: } \theta_t} \right ]$
TD: samples the expected values and uses the current estimate $V$ instead of the true $v_\pi$.
In [3]:
Image('./res/fig6_1.png')
Out[3]:
In [2]:
Image('./res/TD_0.png')
Out[2]:
TD: they learn a guess from a guess - they boostrap.
TD: guarantee convergence.
In practice, TD methods have usually been found that converge faster than constant-$\alpha$ MC methods on stochastic tasks.
batch updating: updates are made only after processing each complete batch of training data until the value function converges.
In [3]:
Image('./res/sarsa.png')
Out[3]:
In [4]:
Image('./res/q_learn_off_policy.png')
Out[4]:
use expeteced value, how likely each action is under the current policy.
\begin{align} Q(S_t, A_t) & \gets Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \color{blue}{\mathbb{E}[Q(S_{t+1}, A_{t+1}) \mid S_{t+1}]} - Q(S_t,, A_t) \right ] \\ & \gets Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \color{blue}{\sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a)} - Q(S_t,, A_t) \right ] \end{align}maximization bias:
root of problem: using the same samples (plays) both to determine the maximizing action and to estimate its value. => divide the plays in two sets ($Q_1, Q_2$) and use them to learn two indepedent estimates. (double learning)
$Q_1(S_t, A_t) \gets Q_1(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma Q_2 \left( S_{t+1}, \operatorname{argmax}_a Q_1(S_{t+1}, a) \right) - Q_1(S_t,, A_t) \right ]$
In [5]:
Image('./res/double_learn.png')
Out[5]:
In [ ]: