In [1]:
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
#sns.set(font='SimHei', font_scale=2.5)
#plt.rcParams['axes.grid'] = False
import numpy as np
import pandas as pd
#pd.options.display.max_rows = 20
#import sklearn
#import itertools
#import logging
#logging.basicConfig()
#logger = logging.getLogger()
#logger.setLevel(logging.DEBUG)
from IPython.display import Image
the value of a state: expected return starting from that state.
An obvious way to estimate it from experience: simply to average the returns observed after visits to that state.
$s$ may be visited multiple times in the same episode:
In [3]:
Image('./res/first_visit_mc.png')
Out[3]:
Monte Carlo methods do not bootstrap: the estimate for one state does not build upon the estimate of any other state.
The computational expense of estimating the value of a single state is independent of the number of states.
=> One can generate many sample episodes starting from the states of interest, averaging returns from only these states, ignoring all others.
If a model is not available, useful to estimate action values $q_\pi(s, a)$ rather than state values.
maintain exploration problem: many state-action pairs may never be visited.
In [4]:
Image('./res/gpi.png')
Out[4]:
In [5]:
Image('./res/monte_carlo_es.png')
Out[5]:
ensure that all actions are selected infinitely often is for the agent to continue to select them:
policy is soft, meaning that $\pi(a \mid s) > 0$ for all $s \in \mathcal{S}$ and all $a \in \mathcal{A}(s)$, but gradually shifted closer and closer to a deterministic optimal policy.
$\epsilon$-soft policies: $\pi(a \mid s) \geq \frac{\epsilon}{|\mathcal{A}(s)|}$ for all states and actions, for some $\epsilon > 0$.
In [6]:
Image('./res/on_epsilon_soft.png')
Out[6]:
use two policies:
assumption of coverage: $\pi(a \mid s) > 0$ implies $b(a \mid s) > 0$.
we wish to estimate $v_\pi$ or $q_\pi$, but all we have are episodes following another policy $b$, where $b \neq \pi$.
=> importance sampling: a general technique for estimating expected values under one distribution given samples from another.
=> importance-sampling ratio $\rho_{t:T-1}$:
Given a starting state $S_t$, we have: $\operatorname{Pr}\{A_t, S_{t+1}, A_{t+1}, \cdots, S_T \mid S_t, A_{t:T-1} \sim \pi\} = \prod_{K=t}^{T-1} \pi(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)$
\begin{equation} \rho_{t:T-1} \doteq \prod_{K=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} \end{equation}So we can have the right expected value by:
\begin{align} \mathbb{E}[G_t \mid S_t] &= v_b(S_t) \\ \mathbb{E}[\rho_{t:T-1} G_t \mid S_t] &= v_\pi(S_t) \end{align}To estimate $v_\pi(s)$, we simply scale the returns by the ratios and average the results:
In [3]:
Image('./res/off_policy_predict.png')
Out[3]:
In [4]:
Image('./res/off_policy_control.png')
Out[4]:
Potential problem: this method learns only from the tails of episodes, when all of the remaining actions in the episode are greedy. If nongreedy actions are commom => greatly slow learning.
In [ ]: