Lecture 1

Notation:

  • history: $H_t = O_1, R_1, A_1, \dots, A_{t-1}, O_t, R_t$
  • State, a function of history: $S_t = f(H_t)$
  • environment state $S_t^e$
  • agent state $S_t^a$
    • can also be any function of history, $S_t^a = f(H_t)$

Information State

A state $S_t$ is Markov is an only if:

$$ P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1,\dots, S_t) $$
  • the future is independent of the past, given the present:
$$ H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty} $$
  • once the state is known, the history may be thrown away
  • the environment state $S_t^e$ is Markov
  • the history $H_t$ is Markov

Fully Observable Environments

Full observability: agent directly observes agent state:

$$ O_t = S_t^a = S_t^e $$
  • agent state = environment state = information state
  • formally this ia a Markov decision process (MDP)

Partially Observable Environments

  • agent indirectly observes environment
  • now agent state $\ne$ environment state
  • formally this is a partially observable Markov decision process (POMDP)
  • agent must construct its own state representation $S_t^a$, eg:
    • complete history $S_t^a = H_t$
    • beliefs of environment state $S_t^a = (P(S_t^e=s^1), \dots, P(S_t^e=s^n))$
    • recurrent neural network: $S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)$

Major components of an RL agent

  • an RL agent may include one or more of these components:
    • policy: agent's behavior function
    • value function: how good is each state and/or action
    • model: agent's representation of the environment

Policy

  • a policy is the agent's behavior
  • it is a map from state to action, eg:
  • deterministic policy: $a = \pi(s)$
  • stochastic policy: $\pi(a \mid s) = P(A_t = a \mid S_t = s)$

Value Function

  • value function is a prediction of future reward
  • used to evaluate the goodness/badness of states
  • and therefore to select between actions, eg:
$$ v_\pi(s) = \mathbb{E}_\pi(R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s) $$

Model

  • a model predicts what the environment will do next
  • $\mathcal{P}$ predicts the next state
  • $\mathcal{R}$ predicts the next (immediate) reward, eg:
$$ \mathcal{P}_{SS'}^a = P(S_{t+1} = s' \mid S_t = a, A_t = a) \\ \mathcal{R}_s^a = \mathbb{E}[R_{t+a} \mid S_t = s, A_t = a) $$

Categorizing RL agents (1)

  • Value based
    • No Policy (implicit)
    • Value Function
  • Policy based
    • Policy
    • No Value Function
  • Actor Critic
    • Policy
    • Value Function

Lecture 2

  • MDPs formally describe an environment for reinforcement learning
  • where the environment is fully observable
  • the current state completely characterises the process
  • almost all RL problems can be formalised as MDPs, eg:
    • optimal control primarily deals with continuous MDPs
    • partially observable problems can be converted into MDPs
    • bandits are MDPs with one state

Markov property

  • "the future is independent of the past given the present"
  • state captures all relevant information from the history
  • once the state is known, the history may be thrown away
  • ie the state is a sufficient statistic of the future

State Transition Matrix

For a Markov state $s$ and successor state $s'$, the state transition probability is defined by:

$$ \mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s) $$

State transition matrix $\mathcal{P}$ defines transition probabilities from all states $s$ to all successor states $s'$,

$$ \mathcal{P} = \mathcal{P}[\text{from}][\text{to}] $$

... where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, ie a sequence of random states $S_1, S_2, \dots$ with the Markov property.

A Markov process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$:

  • $\mathcal{S}$ is a (finite) set of states
  • $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$

Markov Reward Process

A Markov reward process is a Markov chain with values

A Markov Reward Process is a tuple $<\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>$.

  • $\mathcal{S}$ is a finite set of states
  • $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s)$
  • $\mathcal{R}$ is a reward function, $\mathcal{R}_s = \mathbb{E}[R_{t+1} \mid S_t = s)$
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$

Return

The return $G_t$ is the total discounted reward from time-step $t$

$$ G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} $$
  • the discount $\gamma \in [0,1]$ is the present value of future rewards
  • the value of receiving reward $R$ after $k+1$ time-steps is $\gamma^kR$
  • this values immediate reward above delayed reward
    • $\gamma$ close to $0$ leads to 'myopic' evaluation
    • $\gamma$ close to $1$ leads to 'far-sighted' evaluation

Value Function

The value function $v(s)$ gives the long-term value of state $s$.

The state value function $v(s)$ of an MRP is the expected return starting from state $s$:

$$ v(s) = \mathbb{E}[G_t \mid S_t =s ] $$

Bellman Equation for MRPs

The value function can be decomposed into two parts:

  • immediate reward $R_{t+1}$
  • discounted value of successor state $\gamma v(S_{t+1})$
$$ v(s) = \mathbb{E}[G_t \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \dots ) \mid S_t = s] \\ = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s] \\ = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} v(s') $$

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices:

$$ v = \mathcal{R} + \gamma \mathcal{P} v $$

where:

  • $v$ is a column vector, with one entry per state
  • $\mathcal{R}$ is a column vector with one entry per state
$$ \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix}\mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{11} \dots \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} \dots \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} $$

Solving the Bellman Equation

  • the Bellman equation is a linear equation
  • it can be solved directly:
$$ v = \mathcal{R} + \gamma \mathcal{P} v \\ (I - \gamma \mathcal{P})v = \mathcal{R} \\ v = (I - \gamma \mathcal{P})^{-1} \mathcal{R} $$
  • computational complexity is $O(n^3)$ for $n$ states
  • direct solution only possible for small MRPs
  • many iterative methods for large MRPs, eg:
    • dynamic programming
    • monte-carlo evaluation
    • temporal-difference learning

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

A Markov Decision Process is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$

  • $\mathcal{S}$ is a finite set of states
  • $\mathcal{A}$ is a finite set of actions
  • $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}^a_{ss'} = P(S_{t+1} = s' \mid S_t = s, A_t = a)$
  • $\mathcal{R}$ is a reward function, $\mathcal{R}_s^a = \mathbb{E}[R_{t+1} \mid S_t = s, A_t =a ]$
  • $\gamma$ is a discount factor $\gamma \in [0,1]$

Policies

A policy $\pi$ is a distribution over actions given states,

$$ \pi(a \mid s) = P(A_t = a \mid S_t = s) $$
  • a policy fully defines the behavior of an agent
  • MDP policies depend on the current state (not the history)
  • ie policies are stationary (time-independent), $A_T \sim \pi(\cdot \mid S_t), \forall t > 0$

Policies (2)

  • given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and a policy $\pi$
  • the state sequence $S_1, S_2, \dots$ is a Markov process $<\mathcal{S}, \mathcal{P}^\pi>$
  • the state and reward sequence $S_1, R_2, S_2, \dots$ is a Markov reward process $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$, where:
$$ \mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{P}_{ss'}^a \\ \mathcal{R}_{s}^\pi = \sum_{a \in \mathcal{A}} \pi(a \mid s ) \mathcal{R}_{s}^a $$

Value function

The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$

$$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$

The action-value function $q_\pi(s, a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

$$ q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a] $$

Bellman Expectation Equation

For value function:

$$ v_\pi(s) = \mathbb{E}[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s ] $$

action-value function:

$$ q_\pi(s, a) = \mathbb{E}[ R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a ] $$$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s, a)\, q_\pi(s, a) $$$$ q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s') $$
$$ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(s', a')\, q_\pi(s', a') $$
$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s,a) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_\pi(s') \right) $$

Bellman Expectation Equation (matrix form)

The Bellman expectation equation can be expressed concisely using the induced MRP:

$$ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi $$

with direct solution:

$$ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} R^\pi $$

Optimal policy

Define a partial ordering over policies:

$$ \pi \ge \pi' \text{ if } v_\pi(s) \ge v_{\pi'}(s), \forall s $$

For any Markov Decision Process:

  • there exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \ge \pi, \forall \pi$
  • all optimal policies achieve the optimal value function, $v_{\pi_*}(s) = v_*(s)$
  • all optimal policies achieve the optimal action-value function, $q_{\pi_*}(s, a) = q_*(s,a)$

Bellman Optimality Equation for $v_*$

The optimal value functions are recursively related by the Bellman optimality equations:

$$ v_*(s) = \max_a q_*(s, a) \\ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') $$
$$ v_*(s) = \max_a \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') \right) $$
$$ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \max_{a'} q_*(s', a') $$

Solving the Bellman Optimality Equation

  • Bellman Optimality Equation is non-linear
  • No closed form solution (in general)
  • Many iterative solution methods:
    • value iteration
    • policy iteration
    • Q-learning
    • Sarsa

Lecture 3 Planning by Dynamic Programming

What is Dynamic Programming

  • method for solving complex programs
  • break down into subproblems
    • solve the subproblems
    • combine solutions to subproblems

Requirements for dynamic programming

  • optimal substructure
    • optimal solution can be decomposed into subproblems
  • overlapping subproblems

    • subproblems recur many times
    • solutions can be cached and reused
  • Markov decision processes satisfy both properties

    • Bellman equation gives recursive decomposition
    • Value function stores and reuses solutions

Planning by dynamic programming

  • DP assumes full knowledge of the MDP
  • used for planning in an MDP
  • for prediction:
    • input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$ and policy $\pi$
    • or: MRP $<\mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma>$
    • output: value function $v_\pi$
  • or for control:
    • input: MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$
    • output: optimal value function $v_*$
    • and optimal policy $\pi_*$

Iterative policy evaluation

  • problem: evaluate a given policy $\pi$
  • solution: iterative application of Bellman expectation backup
  • $v_1 \rightarrow v_2 \rightarrow \dots \rightarrow v_\pi$
  • using synchronous backups:
    • at each iteration $k + 1$
    • for all states $s \in \mathcal{S}$
    • update $v_{k+1}(s)$ from $v_k(s')$
$$ v_{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s') \right) $$

In [8]:
import matplotlib.pyplot as plt
import numpy as np
import math
import torch


gridworld_str = """*---
----
----
---*"""  # * => exit. - => empty square

reward_by_symbol = {
    '*': 0,
    '-': -1
}

rows = len(gridworld_str.split('\n'))
cols = len(gridworld_str.split('\n')[0])
print('rows %s cols %s' % (rows, cols))
V = torch.zeros(rows, cols)
print('V', V)

gridworld = []
gridworld_rows_str = gridworld_str.split('\n')
for r in range(rows):
    row_str = gridworld_rows_str[r]
    row = []
    for c in range(cols):
        sym = row_str[c]
        row.append(sym)
    gridworld.append(row)

def get_R(gridworld):
    rows = len(gridworld)
    cols = len(gridworld[0])
    R = torch.zeros(rows, cols)
    for r in range(rows):
        row = gridworld[r]
        for c in range(cols):
            char = row[c]
            reward = reward_by_symbol[char]
            R[r][c] = reward
    return R

R = get_R(gridworld)
print('R', R)

def random_policy(V, s):
    action_probs = [
        {'a': [0, -1], 'p': 0.25},
        {'a': [0, 1], 'p': 0.25},
        {'a': [-1, 0], 'p': 0.25},
        {'a': [1, 0], 'p': 0.25}
    ]
    for a_p in action_probs:
        a_p['a'] = torch.IntTensor(a_p['a'])
    return action_probs

def apply_a(s, a):
    s_new = s + a
    s_new = s_new.clamp(min=0)
    s_new[0] = min(s_new[0], rows - 1)
    s_new[1] = min(s_new[1], cols - 1)
    return s_new

def a_to_str(s):
    return '\n'.join(str(a.view(1, -1)).split('\n')[1:-2])

def s_to_str(s):
    return '\n'.join(str(s.view(1, -1)).split('\n')[1:-2])

def is_terminal(s):
    return gridworld[s[0]][s[1]] == '*'

K = 100

def value_iterate(p_a_for_s):
    V = torch.zeros(rows, cols)
#     print('V', V)
    for k in range(K):
        V_new = torch.zeros(rows, cols)
        for s_i in range(rows):
            for s_j in range(cols):
                s = torch.IntTensor([s_i, s_j])
                v_new = 0
                if not is_terminal(s):
                    for a_p in p_a_for_s(V, s):
                        a, p = a_p['a'], a_p['p']
                        s_new = apply_a(s, a)
                        v_next = V[s_new[0], s_new[1]]
                        v_new += p * v_next
                    r = R[s[0], s[1]]
                    v_new += r
                V_new[s_i, s_j] = v_new
        V = V_new
        if k < 3:
            print('k', k + 1)
            print(V)
    return V

V_random = value_iterate(random_policy)
print('V_random', V_random)

def optimum_policy_hardcoded(V, s):
    policy = [
        ['*', 'l', 'l', 'lr'],
        ['u', 'ul', 'lr', 'd'],
        ['u', 'ru', 'dr', 'd'],
        ['ru', 'r', 'r', '*']
    ]
    actions = policy[s[0]][s[1]]
    p = 1 / len(actions)
    action_probs = []
    for c in actions:
        a = {
            'u': [-1, 0],
            'd': [1, 0],
            'l': [0, -1],
            'r': [0, 1]
        }[c]
        a = torch.IntTensor(a)
        action_probs.append({'a': a, 'p': p})
#     print(action_probs)
    return action_probs

V_optimum = value_iterate(optimum_policy_hardcoded)
print('V_optimum', V_optimum)


rows 4 cols 4
V 
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
[torch.FloatTensor of size 4x4]

R 
 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0.0000 -1.7500 -2.0000 -2.0000
-1.7500 -2.0000 -2.0000 -2.0000
-2.0000 -2.0000 -2.0000 -1.7500
-2.0000 -2.0000 -1.7500  0.0000
[torch.FloatTensor of size 4x4]

k 3

 0.0000 -2.4375 -2.9375 -3.0000
-2.4375 -2.8750 -3.0000 -2.9375
-2.9375 -3.0000 -2.8750 -2.4375
-3.0000 -2.9375 -2.4375  0.0000
[torch.FloatTensor of size 4x4]

V_random 
  0.0000 -13.9426 -19.9149 -21.9048
-13.9426 -17.9251 -19.9155 -19.9149
-19.9149 -19.9155 -17.9251 -13.9426
-21.9048 -19.9149 -13.9426   0.0000
[torch.FloatTensor of size 4x4]

k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum 
 0 -1 -2 -4
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

How to improve a policy

  • given a policy $\pi$:
    • evaluate the policy $\pi$: $v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s]$
    • improve the policy by acting greedily with respect to $v_\pi$: $\pi' = \text{greedy}(v_\pi)$
  • in Small Gridworld improved policy was optimal, $\pi' = \pi^*$
  • in general, need more iterations of improvement/evaluation
  • but this process of policy iteration always converges to $\pi_*$

Policy improvement

  • consider a deterministic policy $a = \pi(s)$
  • we can improve the policy by acting greedily: $$ \def\argmax{\text{argmax}} \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a) $$
  • this improves the value from any state $s$ over one step $$ q_\pi(s, \pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$
  • it therefore improves the value function, $v_{\pi'}(s) \ge v_\pi(s)$
$$ v_\pi(s) = q_\pi(s, \pi(a)) = \mathbb{E}_\pi[R_{t+1} + \gamma] $$
$$ \le q_\pi(s, \pi'(a)) $$
$$ = \mathbb{E}_\pi[R_{t+1} + \gamma v_\] $$

_Policy improvement (2)-

  • if improvements stop: $$ v_\pi(s) = q_\pi(s, \pi(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) = q_\pi(s, \pi'(s)) $$

ie: $$ v_\pi(s) = \max_{a \in \mathcal{A}}q_\pi(s, a) $$

This is the Bellman optimality equation

Therefore:

$$ v_\pi(s) = v_*(s) \forall s \in \mathcal{S} $$

Therefore, $\pi$ is an optimal policy

up to 59:15

Notes on v and q

$$ v_\pi(s) = \mathbb{E}[R_{t+1} + R_{t+2} + \dots \mid S_t = s] $$

For deterministic policy

$$ a = \pi(s) \\ v_\pi(s) = q_\pi(s, \pi(s)) $$

If we have the greedy deterministic policy: $$ \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a) $$ ... then: $$ q_{\pi}(s,\pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) $$ And we have: $$ \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$ and: $$ $$

$$ v_\pi(s) \\ = q_\pi(s, \pi(s)) \\ \le q_\pi(s, \pi'(s)) \\ = \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right] $$
$$ = \mathbb{E}_{\pi'}\left[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi(S_{t+1}) \mid S_t = s\right] $$
$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1}) \mid S_t = s \right] $$
$$ = \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi(S_{t+2}) \mid S_t = s \right] $$
$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, \pi'(S_{t+2}) \mid S_t = s \right] $$
$$ \le \mathbb{E}_{\pi'} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t = s \right] $$
$$ =v_{\pi'}(s) $$

In [9]:
def optimum_policy(V, s):
    p = 1.0
    a_best = None
    v_best = None
    for a in [
        [-1, 0], [1, 0], [0, -1], [0, 1]
    ]:
        a = torch.IntTensor(a)
#         print('s', s)
        s_new = apply_a(s, a)
        v_new = R[s[0], s[1]] + V[s_new[0], s_new[1]]
        if v_best is None or v_new > v_best:
            a_best = a
            v_best = v_new
#     print('a_best', a_best, 'v_best', v_best)
    action_probs = [{'a': a_best, 'p': 1.0}]
    return action_probs

V_optimum = value_iterate(optimum_policy)
print('V_optimum', V_optimum)


k 1

 0 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1  0
[torch.FloatTensor of size 4x4]

k 2

 0 -1 -2 -2
-1 -2 -2 -2
-2 -2 -2 -1
-2 -2 -1  0
[torch.FloatTensor of size 4x4]

k 3

 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

V_optimum 
 0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1  0
[torch.FloatTensor of size 4x4]

Principle of Optimality

Any optimal policy can be subdivided into two components:

  • an optimal first action $A_*$
  • followed by an optimal policy from successor state $S'$

Theorem (Principle of optimality)

  • a policy $\pi(a \mid s)$ achieves the optimal value from state $s$, $v_\pi = v_*(s)$, if and only if:
    • for any state $s'$ reachable from $s$
    • $\pi$ achieves the optimal value from state $s'$, $v_\pi(s') = v_*(s')$

Deterministic value iteration

  • if we know the solution to subproblems $v_*(s')$
  • then solution $v_*(s)$ can be found by one-step lookahead $$ v_*(s) \leftarrow \max_{a \in \mathcal{A}} \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_*(s') $$
  • the idea of value iteration is to apply these updates iteratively
  • intuition: start with final rewards and work backwards
  • still works with loopy, stochastic MDPs

Value iteration

$$ \mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^\mathbf{a} + \gamma \mathcal{P}^\mathbf{a} \mathbf{v}_k $$

Synchronous dynamic programming algorithms

Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iteractive Policy Evaluation
Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration
Control Bellman Optimality Equation Value Iteration
  • algorithms are based on state-value function $v_\pi(s)$ or $v_*(s)$
  • complexity $O(mn^2)$ per iteration, for $m$ actions and $n$ states
  • could also apply to action-value function $q_\pi(s,a)$ or $q_*(s,a)$
  • complexity $O(m^2n^2)$ per iteration

Asynchronous dynamic programming

Three simple ideas for asynchronous dynamic programming:

  • in-place dynamic programming
  • prioritized sweeping
  • real-time dynamic programming

In-place dynamic programming

  • only store one copy of value function for all $s$ in $\mathcal{S}$
$$ v(s) \leftarrow \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') \right) $$

Prioritized sweeping

  • use magnitude of Bellman error to guide state selection, eg: $$ \left| \max_{a \in \mathcal{A}} \left(
     \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') - v(s)
    
    \right) \right| $$
  • backup the state with the largest remaining Bellman error
  • update Bellman error of affected states after each backup
  • requires knowledge of reverse dynamics (predecessor states)
  • can be implemented efficiently by maintaining a priority queue

Real-time dynamic programming

  • idea: only states that are relevant to agent
  • use agent's experience to guide the selection of states
  • after each time-step $S_t$, $A_t$, $R_{t+1}$
  • backup the state $S_t$ $$ v(S_t) \leftarrow \max_{a \in \mathcal{A}} \left( \mathcal{R}^a_{S_t} + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{S_t s'} v(s') \right) $$

Full-Width Backups

  • DP uses full-width backups
  • for each backup (sync or async)
    • every successor state and action is considered
    • using knowledge of the MDP transitions and reward functions
  • DP is effective for medium-sized problems (millions of states)
  • for large problems DP suffers Bellman's curse of dimensionality
    • number of states $n = |\mathcal{S}|$ grows exponentially with number of state variables
  • even one backup can be too expensive

Sample backups

  • in subsequent lectures we will consider sample backups
  • using sample rewards and sample transitions $<\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{S'}>$
  • instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$
  • advantages:
    • model-free: no advance knowledge of MDP required
    • breaks the curse of dimensionality through samnpling
    • cost of backup is constant, independent of $n = |\mathcal{S}|$

Approximate dynamic programming

  • approximate the value function
  • using a function approximator $\hat{v}(s, \mathbf{w})$
  • apply dynamic programming to $\hat{v}(\cdot, \mathbf{w})$
  • eg Fitted Value Iteration repeats at each iteration $k$:
    • sample states $\tilde{\mathcal{S}} \subseteq \mathcal{S}$
    • for each state $s \in \tilde{\mathcal{S}}$, estimate target value using Bellman optimality equation, $$ \tilde{v}_k = \max_{a \in \mathcal{A}}\left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'}(s', \mathbf{w}_{\mathbf{k}}) \right) $$
    • train next function $\hat{v}(\cdot, \mathbf{w}_{\mathbf{k} + \mathbf{1}})$ using targets $\{<s, \tilde{v}_k(s)>\}$