"Reinforcement learning", David Silver

Lecture 1

Notation:

history: $H_t = O_1, R_1, A_1, \dots, A_{t-1}, O_t, R_t$
State, a function of history: $S_t = f(H_t)$
environment state $S_t^e$
agent state $S_t^a$
- can also be any function of history, $S_t^a = f(H_t)$

Information State

A state $S_t$ is Markov is an only if:

$$ P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1,\dots, S_t) $$

the future is independent of the past, given the present:

$$ H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty} $$

once the state is known, the history may be thrown away
the environment state $S_t^e$ is Markov
the history $H_t$ is Markov

Fully Observable Environments

Full observability: agent directly observes agent state:

$$ O_t = S_t^a = S_t^e $$

agent state = environment state = information state
formally this ia a Markov decision process (MDP)

Partially Observable Environments

agent indirectly observes environment
now agent state $\ne$ environment state
formally this is a partially observable Markov decision process (POMDP)
agent must construct its own state representation $S_t^a$, eg:
- complete history $S_t^a = H_t$
- beliefs of environment state $S_t^a = (P(S_t^e=s^1), \dots, P(S_t^e=s^n))$
- recurrent neural network: $S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)$

Major components of an RL agent

an RL agent may include one or more of these components:
- policy: agent's behavior function
- value function: how good is each state and/or action
- model: agent's representation of the environment

Policy

a policy is the agent's behavior
it is a map from state to action, eg:
deterministic policy: $a = \pi(s)$
stochastic policy: $\pi(a \mid s) = P(A_t = a \mid S_t = s)$

Value Function

value function is a prediction of future reward
used to evaluate the goodness/badness of states
and therefore to select between actions, eg:

$$ v_\pi(s) = \mathbb{E}_\pi(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s) $$

Model

a model predicts what the environment will do next
$\mathcal{P}$ predicts the next state
$\mathcal{R}$ predicts the next (immediate) reward, eg:

$$ \mathcal{P}_{SS'}^a = P(S_{t+1} = s' \mid S_t = a, A_t = a) \\ \mathcal{R}_s^a = \mathbb{E}[R_{t+a} \mid S_t = s, A_t = a) $$

Categorizing RL agents (1)

Value based
- No Policy (implicit)
- Value Function
Policy based
- Policy
- No Value Function
Actor Critic
- Policy
- Value Function

Lecture 2

MDPs formally describe an environment for reinforcement learning
where the environment is fully observable
the current state completely characterises the process
almost all RL problems can be formalised as MDPs, eg:
- optimal control primarily deals with continuous MDPs
- partially observable problems can be converted into MDPs
- bandits are MDPs with one state

Markov property

"the future is independent of the past given the present"
state captures all relevant information from the history
once the state is known, the history may be thrown away
ie the state is a sufficient statistic of the future

State Transition Matrix

For a Markov state $s$ and successor state $s'$, the state transition probability is defined by:

$$ \mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s) $$

State transition matrix $\mathcal{P}$ defines transition probabilities from all states $s$ to all successor states $s'$,

$$ \mathcal{P} = \mathcal{P}[\text{from}][\text{to}] $$

... where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, ie a sequence of random states $S_1, S_2, \dots$ with the Markov property.

A Markov process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$:

$\mathcal{S}$ is a (finite) set of states
$\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$

Markov Reward Process

A Markov reward process is a Markov chain with values

A Markov Reward Process is a tuple $<\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>$.

$\mathcal{S}$ is a finite set of states
$\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$
$\mathcal{R}$ is a reward function, $\mathcal{R}_s = \mathbb{E}[R_{t+1} \mid S_t = s)$
$\gamma$ is a discount factor, $\gamma \in [0,1]$

Return

The return $G_t$ is the total discounted reward from time-step $t$

$$ G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} $$

the discount $\gamma \in [0,1]$ is the present value of future rewards
the value of receiving reward $R$ after $k+1$ time-steps is $\gamma^kR$
this values immediate reward above delayed reward
- $\gamma$ close to $0$ leads to 'myopic' evaluation
- $\gamma$ close to $1$ leads to 'far-sighted' evaluation

Value Function

The value function $v(s)$ gives the long-term value of state $s$.

The state value function $v(s)$ of an MRP is the expected return starting from state $s$:

$$ v(s) = \mathbb{E}[G_t \mid S_t =s ] $$

Bellman Equation for MRPs

The value function can be decomposed into two parts:

immediate reward $R_{t+1}$
discounted value of successor state $\gamma v(S_{t+1})$

$$ v(s) = \mathbb{E}[G_t \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \dots ) \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s] \\ = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} v(s') $$

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices:

$$ v = \mathcal{R} + \gamma \mathcal{P} v $$

where:

$v$ is a column vector, with one entry per state
$\mathcal{R}$ is a column vector with one entry per state

$$ \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix}\mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{11} \dots \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} \dots \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} $$

Solving the Bellman Equation

the Bellman equation is a linear equation
it can be solved directly:

$$ v = \mathcal{R} + \gamma \mathcal{P} v \\ (I - \gamma \mathcal{P})v = \mathcal{R} \\ v = (I - \gamma \mathcal{P})^{-1} \mathcal{R} $$

computational complexity is $O(n^3)$ for $n$ states
direct solution only possible for small MRPs
many iterative methods for large MRPs, eg:
- dynamic programming
- monte-carlo evaluation
- temporal-difference learning

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

A Markov Decision Process is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$

$\mathcal{S}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}^a_{ss'} = P(S_{t+1} = s' \mid S_t = s, A_t = a)$
$\mathcal{R}$ is a reward function, $\mathcal{R}_s^a = \mathbb{E}[R_{t+1} \mid S_t = s, A_t =a ]$
$\gamma$ is a discount factor $\gamma \in [0,1]$

Policies

A policy $\pi$ is a distribution over actions given states,

$$ \pi(a \mid s) = P(A_t = a \mid S_t = s) $$

a policy fully defines the behavior of an agent
MDP policies depend on the current state (not the history)
ie policies are stationary (time-independent), $A_T \sim \pi(\cdot \mid S_t), \forall t > 0$

Policies (2)

given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and a policy $\pi$
the state sequence $S_1, S_2, \dots$ is a Markov process $<\mathcal{S}, \mathcal{P}^\pi>$
the state and reward sequence $S_1, R_2, S_2, \dots$ is a Markov reward process $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$, where:

$$ \mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{P}_{ss'}^a \\ \mathcal{R}_{s}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{R}_{s}^a $$

Value function

The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$

$$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$

The action-value function $q_\pi(s, a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

$$ q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a] $$

Bellman Expectation Equation

For value function:

$$ v_\pi(s) = \mathbb{E}[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s ] $$

action-value function:

$$ q_\pi(s, a) = \mathbb{E}[ R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a ] $$$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s, a)\, q_\pi(s, a) $$$$ q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s') $$

$$ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(s', a')\, q_\pi(s', a') $$

$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s,a) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_\pi(s') \right) $$

Bellman Expectation Equation (matrix form)

The Bellman expectation equation can be expressed concisely using the induced MRP:

$$ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi $$

with direct solution:

$$ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} R^\pi $$

Optimal policy

Define a partial ordering over policies:

$$ \pi \ge \pi' \text{ if } v_\pi(s) \ge v_{\pi'}(s), \forall s $$

For any Markov Decision Process:

there exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \ge \pi, \forall \pi$
all optimal policies achieve the optimal value function, $v_{\pi_*}(s) = v_*(s)$
all optimal policies achieve the optimal action-value function, $q_{\pi_*}(s, a) = q_*(s,a)$

Bellman Optimality Equation for $v_*$

The optimal value functions are recursively related by the Bellman optimality equations:

$$ v_*(s) = \max_a q_*(s, a) \\ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') $$

$$ v_*(s) = \max_a \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') \right) $$

$$ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \max_{a'} q_*(s', a') $$

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear
No closed form solution (in general)
Many iterative solution methods:
- value iteration
- policy iteration
- Q-learning
- Sarsa

Lecture 3 Planning by Dynamic Programming

What is Dynamic Programming

method for solving complex programs
break down into subproblems
- solve the subproblems
- combine solutions to subproblems

Requirements for dynamic programming

optimal substructure
- optimal solution can be decomposed into subproblems
overlapping subproblems
- subproblems recur many times
- solutions can be cached and reused
Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Planning by dynamic programming

DP assumes full knowledge of the MDP
used for planning in an MDP
for prediction:
- input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and policy $\pi$
- or: MRP $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$
- output: value function $v_\pi$
or for control:
- input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$
- output: optimal value function $v_*$
- and optimal policy $\pi_*$

Iterative policy evaluation

problem: evaluate a given policy $\pi$
solution: iterative application of Bellman expectation backup
$v_1 \rightarrow v_2 \rightarrow \dots \rightarrow v_\pi$
using synchronous backups:
- at each iteration $k + 1$
- for all states $s \in \mathcal{S}$
- update $v_{k+1}(s)$ from $v_k(s')$

$$ v_{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s') \right) $$



In [8]:

    
import matplotlib.pyplot as plt
import numpy as np
import math
import torch


gridworld_str = """*---
----
----
---*"""  # * => exit. - => empty square

reward_by_symbol = {
    '*': 0,
    '-': -1
}

rows = len(gridworld_str.split('\n'))
cols = len(gridworld_str.split('\n')[0])
print('rows %s cols %s' % (rows, cols))
V = torch.zeros(rows, cols)
print('V', V)

gridworld = []
gridworld_rows_str = gridworld_str.split('\n')
for r in range(rows):
    row_str = gridworld_rows_str[r]
    row = []
    for c in range(cols):
        sym = row_str[c]
        row.append(sym)
    gridworld.append(row)

def get_R(gridworld):
    rows = len(gridworld)
    cols = len(gridworld[0])
    R = torch.zeros(rows, cols)
    for r in range(rows):
        row = gridworld[r]
        for c in range(cols):
            char = row[c]
            reward = reward_by_symbol[char]
            R[r][c] = reward
    return R

R = get_R(gridworld)
print('R', R)

def random_policy(V, s):
    action_probs = [
        {'a': [0, -1], 'p': 0.25},
        {'a': [0, 1], 'p': 0.25},
        {'a': [-1, 0], 'p': 0.25},
        {'a': [1, 0], 'p': 0.25}
    ]
    for a_p in action_probs:
        a_p['a'] = torch.IntTensor(a_p['a'])
    return action_probs

def apply_a(s, a):
    s_new = s + a
    s_new = s_new.clamp(min=0)
    s_new[0] = min(s_new[0], rows - 1)
    s_new[1] = min(s_new[1], cols - 1)
    return s_new

def a_to_str(s):
    return '\n'.join(str(a.view(1, -1)).split('\n')[1:-2])

def s_to_str(s):
    return '\n'.join(str(s.view(1, -1)).split('\n')[1:-2])

def is_terminal(s):
    return gridworld[s[0]][s[1]] == '*'

K = 100

def value_iterate(p_a_for_s):
    V = torch.zeros(rows, cols)
#     print('V', V)
    for k in range(K):
        V_new = torch.zeros(rows, cols)
        for s_i in range(rows):
            for s_j in range(cols):
                s = torch.IntTensor([s_i, s_j])
                v_new = 0
                if not is_terminal(s):
                    for a_p in p_a_for_s(V, s):
                        a, p = a_p['a'], a_p['p']
                        s_new = apply_a(s, a)
                        v_next = V[s_new[0], s_new[1]]
                        v_new += p * v_next
                    r = R[s[0], s[1]]
                    v_new += r
                V_new[s_i, s_j] = v_new
        V = V_new
        if k < 3:
            print('k', k + 1)
            print(V)
    return V

V_random = value_iterate(random_policy)
print('V_random', V_random)

def optimum_policy_hardcoded(V, s):
    policy = [
        ['*', 'l', 'l', 'lr'],
        ['u', 'ul', 'lr', 'd'],
        ['u', 'ru', 'dr', 'd'],
        ['ru', 'r', 'r', '*']
    ]
    actions = policy[s[0]][s[1]]
    p = 1 / len(actions)
    action_probs = []
    for c in actions:
        a = {
            'u': [-1, 0],
            'd': [1, 0],
            'l': [0, -1],
            'r': [0, 1]
        }[c]
        a = torch.IntTensor(a)
        action_probs.append({'a': a, 'p': p})
#     print(action_probs)
    return action_probs

V_optimum = value_iterate(optimum_policy_hardcoded)
print('V_optimum', V_optimum)









    



rows 4 cols 4
V 
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
[torch.FloatTensor of size 4x4]

R 
 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0.0000 -1.7500 -2.0000 -2.0000
-1.7500 -2.0000 -2.0000 -2.0000
-2.0000 -2.0000 -2.0000 -1.7500
-2.0000 -2.0000 -1.7500  0.0000
[torch.FloatTensor of size 4x4]

k 3

 0.0000 -2.4375 -2.9375 -3.0000
-2.4375 -2.8750 -3.0000 -2.9375
-2.9375 -3.0000 -2.8750 -2.4375
-3.0000 -2.9375 -2.4375  0.0000
[torch.FloatTensor of size 4x4]

V_random 
  0.0000 -13.9426 -19.9149 -21.9048
-13.9426 -17.9251 -19.9155 -19.9149
-19.9149 -19.9155 -17.9251 -13.9426
-21.9048 -19.9149 -13.9426   0.0000
[torch.FloatTensor of size 4x4]

k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum 
 0 -1 -2 -4
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

How to improve a policy

given a policy $\pi$:
- evaluate the policy $\pi$: $v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s]$
- improve the policy by acting greedily with respect to $v_\pi$: $\pi' = \text{greedy}(v_\pi)$
in Small Gridworld improved policy was optimal, $\pi' = \pi^*$
in general, need more iterations of improvement/evaluation
but this process of policy iteration always converges to $\pi_*$

Policy improvement

consider a deterministic policy $a = \pi(s)$
we can improve the policy by acting greedily: $$ \def\argmax{\text{argmax}} \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a) $$
this improves the value from any state $s$ over one step $$ q_\pi(s, \pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$
it therefore improves the value function, $v_{\pi'}(s) \ge v_\pi(s)$

$$ v_\pi(s) = q_\pi(s, \pi(a)) = \mathbb{E}_\pi[R_{t+1} + \gamma] $$

$$ \le q_\pi(s, \pi'(a)) $$

$$ = \mathbb{E}_\pi[R_{t+1} + \gamma v_\] $$

_Policy improvement (2)-

if improvements stop: $$ v_\pi(s) = q_\pi(s, \pi(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) = q_\pi(s, \pi'(s)) $$

ie: $$ v_\pi(s) = \max_{a \in \mathcal{A}}q_\pi(s, a) $$

This is the Bellman optimality equation

Therefore:

$$ v_\pi(s) = v_*(s) \forall s \in \mathcal{S} $$

Therefore, $\pi$ is an optimal policy

up to 59:15

Notes on v and q

$$ v_\pi(s) = \mathbb{E}[R_{t+1} + R_{t+2} + \dots \mid S_t = s] $$

For deterministic policy

$$ a = \pi(s) \\ v_\pi(s) = q_\pi(s, \pi(s)) $$

If we have the greedy deterministic policy: $$ \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a) $$ ... then: $$ q_{\pi}(s,\pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) $$ And we have: $$ \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$ and: $$ $$

$$ v_\pi(s) \\ = q_\pi(s, \pi(s)) \\ \le q_\pi(s, \pi'(s)) \\ = \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right] $$

$$ = \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi(S_{t+1}) \mid S_t = s\right] $$

$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1}) \mid S_t = s \right] $$

$$ = \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi(S_{t+2}) \mid S_t = s \right] $$

$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi'(S_{t+2}) \mid S_t = s \right] $$

$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s \right] $$

$$ =v_{\pi'}(s) $$



In [9]:

    
def optimum_policy(V, s):
    p = 1.0
    a_best = None
    v_best = None
    for a in [
        [-1, 0], [1, 0], [0, -1], [0, 1]
    ]:
        a = torch.IntTensor(a)
#         print('s', s)
        s_new = apply_a(s, a)
        v_new = R[s[0], s[1]] + V[s_new[0], s_new[1]]
        if v_best is None or v_new > v_best:
            a_best = a
            v_best = v_new
#     print('a_best', a_best, 'v_best', v_best)
    action_probs = [{'a': a_best, 'p': 1.0}]
    return action_probs

V_optimum = value_iterate(optimum_policy)
print('V_optimum', V_optimum)









    



k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum 
 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

Principle of Optimality

Any optimal policy can be subdivided into two components:

an optimal first action $A_*$
followed by an optimal policy from successor state $S'$

Theorem (Principle of optimality)

a policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi = v_*(s)$, if and only if:
- for any state $s'$ reachable from $s$
- $\pi$ achieves the optimal value from state $s'$, $v_\pi(s') = v_*(s')$

Deterministic value iteration

if we know the solution to subproblems $v_*(s')$
then solution $v_*(s)$ can be found by one-step lookahead $$ v_*(s) \leftarrow \max_{a \in \mathcal{A}} \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_*(s') $$
the idea of value iteration is to apply these updates iteratively
intuition: start with final rewards and work backwards
still works with loopy, stochastic MDPs

Value iteration

$$ \mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^\mathbf{a} + \gamma \mathcal{P}^\mathbf{a} \mathbf{v}_k $$

Synchronous dynamic programming algorithms

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iteractive Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

algorithms are based on state-value function $v_\pi(s)$ or $v_*(s)$
complexity $O(mn^2)$ per iteration, for $m$ actions and $n$ states
could also apply to action-value function $q_\pi(s,a)$ or $q_*(s,a)$
complexity $O(m^2n^2)$ per iteration

Asynchronous dynamic programming

Three simple ideas for asynchronous dynamic programming:

in-place dynamic programming
prioritized sweeping
real-time dynamic programming

In-place dynamic programming

only store one copy of value function for all $s$ in $\mathcal{S}$

$$ v(s) \leftarrow \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') \right) $$

Prioritized sweeping

use magnitude of Bellman error to guide state selection, eg: $$ \left| \max_{a \in \mathcal{A}} \left(
```
 \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') - v(s)
```
\right) \right| $$
backup the state with the largest remaining Bellman error
update Bellman error of affected states after each backup
requires knowledge of reverse dynamics (predecessor states)
can be implemented efficiently by maintaining a priority queue

Real-time dynamic programming

idea: only states that are relevant to agent
use agent's experience to guide the selection of states
after each time-step $S_t$, $A_t$, $R_{t+1}$
backup the state $S_t$ $$ v(S_t) \leftarrow \max_{a \in \mathcal{A}} \left( \mathcal{R}^a_{S_t} + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{S_t s'} v(s') \right) $$

Full-Width Backups

DP uses full-width backups
for each backup (sync or async)
- every successor state and action is considered
- using knowledge of the MDP transitions and reward functions
DP is effective for medium-sized problems (millions of states)
for large problems DP suffers Bellman's curse of dimensionality
- number of states $n = |\mathcal{S}|$ grows exponentially with number of state variables
even one backup can be too expensive

Sample backups

in subsequent lectures we will consider sample backups
using sample rewards and sample transitions $<\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{S'}>$
instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$
advantages:
- model-free: no advance knowledge of MDP required
- breaks the curse of dimensionality through samnpling
- cost of backup is constant, independent of $n = |\mathcal{S}|$

Approximate dynamic programming

approximate the value function
using a function approximator $\hat{v}(s, \mathbf{w})$
apply dynamic programming to $\hat{v}(\cdot, \mathbf{w})$
eg Fitted Value Iteration repeats at each iteration $k$:
- sample states $\tilde{\mathcal{S}} \subseteq \mathcal{S}$
- for each state $s \in \tilde{\mathcal{S}}$, estimate target value using Bellman optimality equation, $$ \tilde{v}_k = \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}(s', \mathbf{w}_{\mathbf{k}}) \right) $$
- train next function $\hat{v}(\cdot, \mathbf{w}_{\mathbf{k} + \mathbf{1}})$ using targets $\{<s, \tilde{v}_k(s)>\}$