# "Reinforcement learning", David Silver

## Lecture 1

Notation:

• history: $H_t = O_1, R_1, A_1, \dots, A_{t-1}, O_t, R_t$
• State, a function of history: $S_t = f(H_t)$
• environment state $S_t^e$
• agent state $S_t^a$
• can also be any function of history, $S_t^a = f(H_t)$

Information State

A state $S_t$ is Markov is an only if:

$$P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1,\dots, S_t)$$
• the future is independent of the past, given the present:
$$H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty}$$
• once the state is known, the history may be thrown away
• the environment state $S_t^e$ is Markov
• the history $H_t$ is Markov

Fully Observable Environments

Full observability: agent directly observes agent state:

$$O_t = S_t^a = S_t^e$$
• agent state = environment state = information state
• formally this ia a Markov decision process (MDP)

Partially Observable Environments

• agent indirectly observes environment
• now agent state $\ne$ environment state
• formally this is a partially observable Markov decision process (POMDP)
• agent must construct its own state representation $S_t^a$, eg:
• complete history $S_t^a = H_t$
• beliefs of environment state $S_t^a = (P(S_t^e=s^1), \dots, P(S_t^e=s^n))$
• recurrent neural network: $S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)$

Major components of an RL agent

• an RL agent may include one or more of these components:
• policy: agent's behavior function
• value function: how good is each state and/or action
• model: agent's representation of the environment

Policy

• a policy is the agent's behavior
• it is a map from state to action, eg:
• deterministic policy: $a = \pi(s)$
• stochastic policy: $\pi(a \mid s) = P(A_t = a \mid S_t = s)$

Value Function

• value function is a prediction of future reward
• used to evaluate the goodness/badness of states
• and therefore to select between actions, eg:
$$v_\pi(s) = \mathbb{E}_\pi(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s)$$

Model

• a model predicts what the environment will do next
• $\mathcal{P}$ predicts the next state
• $\mathcal{R}$ predicts the next (immediate) reward, eg:
$$\mathcal{P}_{SS'}^a = P(S_{t+1} = s' \mid S_t = a, A_t = a) \\ \mathcal{R}_s^a = \mathbb{E}[R_{t+a} \mid S_t = s, A_t = a)$$

Categorizing RL agents (1)

• Value based
• No Policy (implicit)
• Value Function
• Policy based
• Policy
• No Value Function
• Actor Critic
• Policy
• Value Function

## Lecture 2

• MDPs formally describe an environment for reinforcement learning
• where the environment is fully observable
• the current state completely characterises the process
• almost all RL problems can be formalised as MDPs, eg:
• optimal control primarily deals with continuous MDPs
• partially observable problems can be converted into MDPs
• bandits are MDPs with one state

Markov property

• "the future is independent of the past given the present"
• state captures all relevant information from the history
• once the state is known, the history may be thrown away
• ie the state is a sufficient statistic of the future

State Transition Matrix

For a Markov state $s$ and successor state $s'$, the state transition probability is defined by:

$$\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$$

State transition matrix $\mathcal{P}$ defines transition probabilities from all states $s$ to all successor states $s'$,

$$\mathcal{P} = \mathcal{P}[\text{from}][\text{to}]$$

... where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, ie a sequence of random states $S_1, S_2, \dots$ with the Markov property.

A Markov process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$:

• $\mathcal{S}$ is a (finite) set of states
• $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$

Markov Reward Process

A Markov reward process is a Markov chain with values

A Markov Reward Process is a tuple $<\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>$.

• $\mathcal{S}$ is a finite set of states
• $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$
• $\mathcal{R}$ is a reward function, $\mathcal{R}_s = \mathbb{E}[R_{t+1} \mid S_t = s)$
• $\gamma$ is a discount factor, $\gamma \in [0,1]$

Return

The return $G_t$ is the total discounted reward from time-step $t$

$$G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$
• the discount $\gamma \in [0,1]$ is the present value of future rewards
• the value of receiving reward $R$ after $k+1$ time-steps is $\gamma^kR$
• this values immediate reward above delayed reward
• $\gamma$ close to $0$ leads to 'myopic' evaluation
• $\gamma$ close to $1$ leads to 'far-sighted' evaluation

Value Function

The value function $v(s)$ gives the long-term value of state $s$.

The state value function $v(s)$ of an MRP is the expected return starting from state $s$:

$$v(s) = \mathbb{E}[G_t \mid S_t =s ]$$

Bellman Equation for MRPs

The value function can be decomposed into two parts:

• immediate reward $R_{t+1}$
• discounted value of successor state $\gamma v(S_{t+1})$
$$v(s) = \mathbb{E}[G_t \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \dots ) \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s] \\ = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} v(s')$$

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices:

$$v = \mathcal{R} + \gamma \mathcal{P} v$$

where:

• $v$ is a column vector, with one entry per state
• $\mathcal{R}$ is a column vector with one entry per state
$$\begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix}\mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{11} \dots \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} \dots \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix}$$

Solving the Bellman Equation

• the Bellman equation is a linear equation
• it can be solved directly:
$$v = \mathcal{R} + \gamma \mathcal{P} v \\ (I - \gamma \mathcal{P})v = \mathcal{R} \\ v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}$$
• computational complexity is $O(n^3)$ for $n$ states
• direct solution only possible for small MRPs
• many iterative methods for large MRPs, eg:
• dynamic programming
• monte-carlo evaluation
• temporal-difference learning

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

A Markov Decision Process is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$

• $\mathcal{S}$ is a finite set of states
• $\mathcal{A}$ is a finite set of actions
• $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}^a_{ss'} = P(S_{t+1} = s' \mid S_t = s, A_t = a)$
• $\mathcal{R}$ is a reward function, $\mathcal{R}_s^a = \mathbb{E}[R_{t+1} \mid S_t = s, A_t =a ]$
• $\gamma$ is a discount factor $\gamma \in [0,1]$

Policies

A policy $\pi$ is a distribution over actions given states,

$$\pi(a \mid s) = P(A_t = a \mid S_t = s)$$
• a policy fully defines the behavior of an agent
• MDP policies depend on the current state (not the history)
• ie policies are stationary (time-independent), $A_T \sim \pi(\cdot \mid S_t), \forall t > 0$

Policies (2)

• given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and a policy $\pi$
• the state sequence $S_1, S_2, \dots$ is a Markov process $<\mathcal{S}, \mathcal{P}^\pi>$
• the state and reward sequence $S_1, R_2, S_2, \dots$ is a Markov reward process $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$, where:
$$\mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{P}_{ss'}^a \\ \mathcal{R}_{s}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{R}_{s}^a$$

Value function

The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$

$$v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]$$

The action-value function $q_\pi(s, a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

$$q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]$$

Bellman Expectation Equation

For value function:

$$v_\pi(s) = \mathbb{E}[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s ]$$

action-value function:

$$q_\pi(s, a) = \mathbb{E}[ R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a ]$$$$v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s, a)\, q_\pi(s, a)$$$$q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s')$$
$$q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(s', a')\, q_\pi(s', a')$$
$$v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s,a) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_\pi(s') \right)$$

Bellman Expectation Equation (matrix form)

The Bellman expectation equation can be expressed concisely using the induced MRP:

$$v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi$$

with direct solution:

$$v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} R^\pi$$

Optimal policy

Define a partial ordering over policies:

$$\pi \ge \pi' \text{ if } v_\pi(s) \ge v_{\pi'}(s), \forall s$$

For any Markov Decision Process:

• there exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \ge \pi, \forall \pi$
• all optimal policies achieve the optimal value function, $v_{\pi_*}(s) = v_*(s)$
• all optimal policies achieve the optimal action-value function, $q_{\pi_*}(s, a) = q_*(s,a)$

Bellman Optimality Equation for $v_*$

The optimal value functions are recursively related by the Bellman optimality equations:

$$v_*(s) = \max_a q_*(s, a) \\ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s')$$
$$v_*(s) = \max_a \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') \right)$$
$$q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \max_{a'} q_*(s', a')$$

Solving the Bellman Optimality Equation

• Bellman Optimality Equation is non-linear
• No closed form solution (in general)
• Many iterative solution methods:
• value iteration
• policy iteration
• Q-learning
• Sarsa

## Lecture 3 Planning by Dynamic Programming

What is Dynamic Programming

• method for solving complex programs
• break down into subproblems
• solve the subproblems
• combine solutions to subproblems

Requirements for dynamic programming

• optimal substructure
• optimal solution can be decomposed into subproblems
• overlapping subproblems

• subproblems recur many times
• solutions can be cached and reused
• Markov decision processes satisfy both properties

• Bellman equation gives recursive decomposition
• Value function stores and reuses solutions

Planning by dynamic programming

• DP assumes full knowledge of the MDP
• used for planning in an MDP
• for prediction:
• input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and policy $\pi$
• or: MRP $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$
• output: value function $v_\pi$
• or for control:
• input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$
• output: optimal value function $v_*$
• and optimal policy $\pi_*$

Iterative policy evaluation

• problem: evaluate a given policy $\pi$
• solution: iterative application of Bellman expectation backup
• $v_1 \rightarrow v_2 \rightarrow \dots \rightarrow v_\pi$
• using synchronous backups:
• at each iteration $k + 1$
• for all states $s \in \mathcal{S}$
• update $v_{k+1}(s)$ from $v_k(s')$
$$v_{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s') \right)$$


In [8]:

import matplotlib.pyplot as plt
import numpy as np
import math
import torch

gridworld_str = """*---
----
----
---*"""  # * => exit. - => empty square

reward_by_symbol = {
'*': 0,
'-': -1
}

rows = len(gridworld_str.split('\n'))
cols = len(gridworld_str.split('\n')[0])
print('rows %s cols %s' % (rows, cols))
V = torch.zeros(rows, cols)
print('V', V)

gridworld = []
gridworld_rows_str = gridworld_str.split('\n')
for r in range(rows):
row_str = gridworld_rows_str[r]
row = []
for c in range(cols):
sym = row_str[c]
row.append(sym)
gridworld.append(row)

def get_R(gridworld):
rows = len(gridworld)
cols = len(gridworld[0])
R = torch.zeros(rows, cols)
for r in range(rows):
row = gridworld[r]
for c in range(cols):
char = row[c]
reward = reward_by_symbol[char]
R[r][c] = reward
return R

R = get_R(gridworld)
print('R', R)

def random_policy(V, s):
action_probs = [
{'a': [0, -1], 'p': 0.25},
{'a': [0, 1], 'p': 0.25},
{'a': [-1, 0], 'p': 0.25},
{'a': [1, 0], 'p': 0.25}
]
for a_p in action_probs:
a_p['a'] = torch.IntTensor(a_p['a'])
return action_probs

def apply_a(s, a):
s_new = s + a
s_new = s_new.clamp(min=0)
s_new[0] = min(s_new[0], rows - 1)
s_new[1] = min(s_new[1], cols - 1)
return s_new

def a_to_str(s):
return '\n'.join(str(a.view(1, -1)).split('\n')[1:-2])

def s_to_str(s):
return '\n'.join(str(s.view(1, -1)).split('\n')[1:-2])

def is_terminal(s):
return gridworld[s[0]][s[1]] == '*'

K = 100

def value_iterate(p_a_for_s):
V = torch.zeros(rows, cols)
#     print('V', V)
for k in range(K):
V_new = torch.zeros(rows, cols)
for s_i in range(rows):
for s_j in range(cols):
s = torch.IntTensor([s_i, s_j])
v_new = 0
if not is_terminal(s):
for a_p in p_a_for_s(V, s):
a, p = a_p['a'], a_p['p']
s_new = apply_a(s, a)
v_next = V[s_new[0], s_new[1]]
v_new += p * v_next
r = R[s[0], s[1]]
v_new += r
V_new[s_i, s_j] = v_new
V = V_new
if k < 3:
print('k', k + 1)
print(V)
return V

V_random = value_iterate(random_policy)
print('V_random', V_random)

def optimum_policy_hardcoded(V, s):
policy = [
['*', 'l', 'l', 'lr'],
['u', 'ul', 'lr', 'd'],
['u', 'ru', 'dr', 'd'],
['ru', 'r', 'r', '*']
]
actions = policy[s[0]][s[1]]
p = 1 / len(actions)
action_probs = []
for c in actions:
a = {
'u': [-1, 0],
'd': [1, 0],
'l': [0, -1],
'r': [0, 1]
}[c]
a = torch.IntTensor(a)
action_probs.append({'a': a, 'p': p})
#     print(action_probs)
return action_probs

V_optimum = value_iterate(optimum_policy_hardcoded)
print('V_optimum', V_optimum)




rows 4 cols 4
V
0  0  0  0
0  0  0  0
0  0  0  0
0  0  0  0
[torch.FloatTensor of size 4x4]

R
0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 1

0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

0.0000 -1.7500 -2.0000 -2.0000
-1.7500 -2.0000 -2.0000 -2.0000
-2.0000 -2.0000 -2.0000 -1.7500
-2.0000 -2.0000 -1.7500  0.0000
[torch.FloatTensor of size 4x4]

k 3

0.0000 -2.4375 -2.9375 -3.0000
-2.4375 -2.8750 -3.0000 -2.9375
-2.9375 -3.0000 -2.8750 -2.4375
-3.0000 -2.9375 -2.4375  0.0000
[torch.FloatTensor of size 4x4]

V_random
0.0000 -13.9426 -19.9149 -21.9048
-13.9426 -17.9251 -19.9155 -19.9149
-19.9149 -19.9155 -17.9251 -13.9426
-21.9048 -19.9149 -13.9426   0.0000
[torch.FloatTensor of size 4x4]

k 1

0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum
0 -1 -2 -4
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]



How to improve a policy

• given a policy $\pi$:
• evaluate the policy $\pi$: $v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s]$
• improve the policy by acting greedily with respect to $v_\pi$: $\pi' = \text{greedy}(v_\pi)$
• in Small Gridworld improved policy was optimal, $\pi' = \pi^*$
• in general, need more iterations of improvement/evaluation
• but this process of policy iteration always converges to $\pi_*$

Policy improvement

• consider a deterministic policy $a = \pi(s)$
• we can improve the policy by acting greedily: $$\def\argmax{\text{argmax}} \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a)$$
• this improves the value from any state $s$ over one step $$q_\pi(s, \pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s)$$
• it therefore improves the value function, $v_{\pi'}(s) \ge v_\pi(s)$
$$v_\pi(s) = q_\pi(s, \pi(a)) = \mathbb{E}_\pi[R_{t+1} + \gamma]$$
$$\le q_\pi(s, \pi'(a))$$
$$= \mathbb{E}_\pi[R_{t+1} + \gamma v_\]$$

_Policy improvement (2)-

• if improvements stop: $$v_\pi(s) = q_\pi(s, \pi(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) = q_\pi(s, \pi'(s))$$

ie: $$v_\pi(s) = \max_{a \in \mathcal{A}}q_\pi(s, a)$$

This is the Bellman optimality equation

Therefore:

$$v_\pi(s) = v_*(s) \forall s \in \mathcal{S}$$

Therefore, $\pi$ is an optimal policy

up to 59:15

Notes on v and q

$$v_\pi(s) = \mathbb{E}[R_{t+1} + R_{t+2} + \dots \mid S_t = s]$$

For deterministic policy

$$a = \pi(s) \\ v_\pi(s) = q_\pi(s, \pi(s))$$

If we have the greedy deterministic policy: $$\pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a)$$ ... then: $$q_{\pi}(s,\pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a)$$ And we have: $$\max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s)$$ and: 

$$v_\pi(s) \\ = q_\pi(s, \pi(s)) \\ \le q_\pi(s, \pi'(s)) \\ = \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right]$$
$$= \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi(S_{t+1}) \mid S_t = s\right]$$
$$\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1}) \mid S_t = s \right]$$
$$= \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi(S_{t+2}) \mid S_t = s \right]$$
$$\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi'(S_{t+2}) \mid S_t = s \right]$$
$$\le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s \right]$$
$$=v_{\pi'}(s)$$


In [9]:

def optimum_policy(V, s):
p = 1.0
a_best = None
v_best = None
for a in [
[-1, 0], [1, 0], [0, -1], [0, 1]
]:
a = torch.IntTensor(a)
#         print('s', s)
s_new = apply_a(s, a)
v_new = R[s[0], s[1]] + V[s_new[0], s_new[1]]
if v_best is None or v_new > v_best:
a_best = a
v_best = v_new
#     print('a_best', a_best, 'v_best', v_best)
action_probs = [{'a': a_best, 'p': 1.0}]
return action_probs

V_optimum = value_iterate(optimum_policy)
print('V_optimum', V_optimum)




k 1

0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum
0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]



Principle of Optimality

Any optimal policy can be subdivided into two components:

• an optimal first action $A_*$
• followed by an optimal policy from successor state $S'$

Theorem (Principle of optimality)

• a policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi = v_*(s)$, if and only if:
• for any state $s'$ reachable from $s$
• $\pi$ achieves the optimal value from state $s'$, $v_\pi(s') = v_*(s')$

Deterministic value iteration

• if we know the solution to subproblems $v_*(s')$
• then solution $v_*(s)$ can be found by one-step lookahead $$v_*(s) \leftarrow \max_{a \in \mathcal{A}} \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_*(s')$$
• the idea of value iteration is to apply these updates iteratively
• still works with loopy, stochastic MDPs

Value iteration

$$\mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^\mathbf{a} + \gamma \mathcal{P}^\mathbf{a} \mathbf{v}_k$$

Synchronous dynamic programming algorithms

Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iteractive Policy Evaluation
Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration
Control Bellman Optimality Equation Value Iteration
• algorithms are based on state-value function $v_\pi(s)$ or $v_*(s)$
• complexity $O(mn^2)$ per iteration, for $m$ actions and $n$ states
• could also apply to action-value function $q_\pi(s,a)$ or $q_*(s,a)$
• complexity $O(m^2n^2)$ per iteration

Asynchronous dynamic programming

Three simple ideas for asynchronous dynamic programming:

• in-place dynamic programming
• prioritized sweeping
• real-time dynamic programming

In-place dynamic programming

• only store one copy of value function for all $s$ in $\mathcal{S}$
$$v(s) \leftarrow \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') \right)$$

Prioritized sweeping

• use magnitude of Bellman error to guide state selection, eg: $$\left| \max_{a \in \mathcal{A}} \left(  \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') - v(s)  \right) \right|$$
• backup the state with the largest remaining Bellman error
• update Bellman error of affected states after each backup
• requires knowledge of reverse dynamics (predecessor states)
• can be implemented efficiently by maintaining a priority queue

Real-time dynamic programming

• idea: only states that are relevant to agent
• use agent's experience to guide the selection of states
• after each time-step $S_t$, $A_t$, $R_{t+1}$
• backup the state $S_t$ $$v(S_t) \leftarrow \max_{a \in \mathcal{A}} \left( \mathcal{R}^a_{S_t} + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{S_t s'} v(s') \right)$$

Full-Width Backups

• DP uses full-width backups
• for each backup (sync or async)
• every successor state and action is considered
• using knowledge of the MDP transitions and reward functions
• DP is effective for medium-sized problems (millions of states)
• for large problems DP suffers Bellman's curse of dimensionality
• number of states $n = |\mathcal{S}|$ grows exponentially with number of state variables
• even one backup can be too expensive

Sample backups

• in subsequent lectures we will consider sample backups
• using sample rewards and sample transitions $<\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{S'}>$
• instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$
• cost of backup is constant, independent of $n = |\mathcal{S}|$
• using a function approximator $\hat{v}(s, \mathbf{w})$
• apply dynamic programming to $\hat{v}(\cdot, \mathbf{w})$
• eg Fitted Value Iteration repeats at each iteration $k$:
• sample states $\tilde{\mathcal{S}} \subseteq \mathcal{S}$
• for each state $s \in \tilde{\mathcal{S}}$, estimate target value using Bellman optimality equation, $$\tilde{v}_k = \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}(s', \mathbf{w}_{\mathbf{k}}) \right)$$
• train next function $\hat{v}(\cdot, \mathbf{w}_{\mathbf{k} + \mathbf{1}})$ using targets $\{<s, \tilde{v}_k(s)>\}$