Model and Planning

We use "PLANNING" to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment

state-space planning
- a search through the state space for an optimal policy or path to a goal.
Plan-space planning
- Operators transform one plan into another, and value functions, if any, are defined over the space of plans.
- Plan-space methods are difficult to apply efficiently to the stochastic optimal control problems that are the focus in reinforcement learning, and we do not consider them further (see, e.g., Russell and Norvig, 2010).
In this chapter we argue that various other state-space planning methods also fit this structure, with individual methods differing only in the kinds of backups they do, the order in which they do them, and in how long the backed-upinformation is retained.

Dyna: Integrating Planning, Acting, and Learning

trial-and-error learning
- importance of cognition
reactive decision-making
- deliberative planning
random-sample one-step tabular Q-planning
one-step tabular Q-learning

If n == 0

Just direct learning one-step Q-learning
except for the last step of the first episode Q table entries are not updated, remained random.

If n == 50

After last step (updating Q as "Up", (f) update other <50 Q by simulating Model.

The general problem here is another version of the conflict between exploration and exploitation. In a planning context, exploration means trying actions that improve the model, whereas exploitation means behaving in the optimal way given the current model.



In [ ]: