In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
a model of the enviroment: anything that an agent can use to predict how the environment will respond to its actions.
distribution models: produce a descripiton of all possibilities and their probabilities.
common structure shared by all state-space planning methods:
\begin{equation*} \text{model} \longrightarrow \text{simulated experience} \xrightarrow{\text{backups}} \text{values} \longrightarrow \text{policy} \end{equation*}planning uses simulated experience VS learning uses real experience.
In [3]:
Image('./res/fig8_1.png')
Out[3]:
In [4]:
Image('./res/fig8_2.png')
Out[4]:
Learning and planning are deeply integrated in the sense that they share almost all the same machinery, differing only in the source of their experience.
Models may be incorrect because:
=> conflict between exploration and exploitation => simple heuristics are often effective.
Dyna-Q+ agen: keeps track for each state-action pair. The more time that has elapsed, the greater the chance to be picked next time => special "bonus reward": $r + k \sqrt{\tau}$
In [2]:
Image('./res/fig8_5.png')
Out[2]:
In [3]:
Image('./res/fig8_6.png')
Out[3]:
uniform selection is usually not the best; planning can be much more efficient if simulated transitions and updates are focused on particular state-action pairs.
backward focusing of planning computations: work backward from aribitary states that have changed in value. (propagation)
prioritized sweeping: prioritize the updates according to a measure of their urgency, and perform them in order of priority.
In [4]:
Image('./res/prioritized_sweeping.png')
Out[4]:
Three questions:
In [5]:
Image('./res/fig8_7.png')
Out[5]:
Two ways of distributing updates:
three key ideas in common:
important dimensions along wich the methods vary: