Notation:
Information State
A state $S_t$ is Markov is an only if:
$$ P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1,\dots, S_t) $$Fully Observable Environments
Full observability: agent directly observes agent state:
$$ O_t = S_t^a = S_t^e $$Partially Observable Environments
Major components of an RL agent
Policy
Value Function
Model
Categorizing RL agents (1)
Markov property
State Transition Matrix
For a Markov state $s$ and successor state $s'$, the state transition probability is defined by:
$$ \mathcal{P}_{ss'} = P(S_{t+1} = s' \mid S_t = s) $$State transition matrix $\mathcal{P}$ defines transition probabilities from all states $s$ to all successor states $s'$,
$$ \mathcal{P} = \mathcal{P}[\text{from}][\text{to}] $$... where each row of the matrix sums to 1.
Markov Process
A Markov process is a memoryless random process, ie a sequence of random states $S_1, S_2, \dots$ with the Markov property.
A Markov process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$:
Markov Reward Process
A Markov reward process is a Markov chain with values
A Markov Reward Process is a tuple $<\mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma>$.
Return
The return $G_t$ is the total discounted reward from time-step $t$
$$ G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} $$Value Function
The value function $v(s)$ gives the long-term value of state $s$.
The state value function $v(s)$ of an MRP is the expected return starting from state $s$:
$$ v(s) = \mathbb{E}[G_t \mid S_t =s ] $$Bellman Equation for MRPs
The value function can be decomposed into two parts:
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices:
$$ v = \mathcal{R} + \gamma \mathcal{P} v $$where:
Solving the Bellman Equation
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
A Markov Decision Process is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>$
Policies
A policy $\pi$ is a distribution over actions given states,
$$ \pi(a \mid s) = P(A_t = a \mid S_t = s) $$Policies (2)
Value function
The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$
$$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$The action-value function $q_\pi(s, a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$
$$ q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a] $$Bellman Expectation Equation
For value function:
$$ v_\pi(s) = \mathbb{E}[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s ] $$action-value function:
$$ q_\pi(s, a) = \mathbb{E}[ R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a ] $$$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(s, a)\, q_\pi(s, a) $$$$ q_\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_\pi(s') $$Bellman Expectation Equation (matrix form)
The Bellman expectation equation can be expressed concisely using the induced MRP:
$$ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi $$with direct solution:
$$ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} R^\pi $$Optimal policy
Define a partial ordering over policies:
$$ \pi \ge \pi' \text{ if } v_\pi(s) \ge v_{\pi'}(s), \forall s $$For any Markov Decision Process:
Bellman Optimality Equation for $v_*$
The optimal value functions are recursively related by the Bellman optimality equations:
$$ v_*(s) = \max_a q_*(s, a) \\ q_*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') $$Solving the Bellman Optimality Equation
What is Dynamic Programming
Requirements for dynamic programming
overlapping subproblems
Markov decision processes satisfy both properties
Planning by dynamic programming
Iterative policy evaluation
In [8]:
import matplotlib.pyplot as plt
import numpy as np
import math
import torch
gridworld_str = """*---
----
----
---*""" # * => exit. - => empty square
reward_by_symbol = {
'*': 0,
'-': -1
}
rows = len(gridworld_str.split('\n'))
cols = len(gridworld_str.split('\n')[0])
print('rows %s cols %s' % (rows, cols))
V = torch.zeros(rows, cols)
print('V', V)
gridworld = []
gridworld_rows_str = gridworld_str.split('\n')
for r in range(rows):
row_str = gridworld_rows_str[r]
row = []
for c in range(cols):
sym = row_str[c]
row.append(sym)
gridworld.append(row)
def get_R(gridworld):
rows = len(gridworld)
cols = len(gridworld[0])
R = torch.zeros(rows, cols)
for r in range(rows):
row = gridworld[r]
for c in range(cols):
char = row[c]
reward = reward_by_symbol[char]
R[r][c] = reward
return R
R = get_R(gridworld)
print('R', R)
def random_policy(V, s):
action_probs = [
{'a': [0, -1], 'p': 0.25},
{'a': [0, 1], 'p': 0.25},
{'a': [-1, 0], 'p': 0.25},
{'a': [1, 0], 'p': 0.25}
]
for a_p in action_probs:
a_p['a'] = torch.IntTensor(a_p['a'])
return action_probs
def apply_a(s, a):
s_new = s + a
s_new = s_new.clamp(min=0)
s_new[0] = min(s_new[0], rows - 1)
s_new[1] = min(s_new[1], cols - 1)
return s_new
def a_to_str(s):
return '\n'.join(str(a.view(1, -1)).split('\n')[1:-2])
def s_to_str(s):
return '\n'.join(str(s.view(1, -1)).split('\n')[1:-2])
def is_terminal(s):
return gridworld[s[0]][s[1]] == '*'
K = 100
def value_iterate(p_a_for_s):
V = torch.zeros(rows, cols)
# print('V', V)
for k in range(K):
V_new = torch.zeros(rows, cols)
for s_i in range(rows):
for s_j in range(cols):
s = torch.IntTensor([s_i, s_j])
v_new = 0
if not is_terminal(s):
for a_p in p_a_for_s(V, s):
a, p = a_p['a'], a_p['p']
s_new = apply_a(s, a)
v_next = V[s_new[0], s_new[1]]
v_new += p * v_next
r = R[s[0], s[1]]
v_new += r
V_new[s_i, s_j] = v_new
V = V_new
if k < 3:
print('k', k + 1)
print(V)
return V
V_random = value_iterate(random_policy)
print('V_random', V_random)
def optimum_policy_hardcoded(V, s):
policy = [
['*', 'l', 'l', 'lr'],
['u', 'ul', 'lr', 'd'],
['u', 'ru', 'dr', 'd'],
['ru', 'r', 'r', '*']
]
actions = policy[s[0]][s[1]]
p = 1 / len(actions)
action_probs = []
for c in actions:
a = {
'u': [-1, 0],
'd': [1, 0],
'l': [0, -1],
'r': [0, 1]
}[c]
a = torch.IntTensor(a)
action_probs.append({'a': a, 'p': p})
# print(action_probs)
return action_probs
V_optimum = value_iterate(optimum_policy_hardcoded)
print('V_optimum', V_optimum)
How to improve a policy
Policy improvement
_Policy improvement (2)-
ie: $$ v_\pi(s) = \max_{a \in \mathcal{A}}q_\pi(s, a) $$
This is the Bellman optimality equation
Therefore:
$$ v_\pi(s) = v_*(s) \forall s \in \mathcal{S} $$Therefore, $\pi$ is an optimal policy
up to 59:15
Notes on v and q
$$ v_\pi(s) = \mathbb{E}[R_{t+1} + R_{t+2} + \dots \mid S_t = s] $$For deterministic policy
$$ a = \pi(s) \\ v_\pi(s) = q_\pi(s, \pi(s)) $$If we have the greedy deterministic policy: $$ \pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s, a) $$ ... then: $$ q_{\pi}(s,\pi'(s)) = \max_{a \in \mathcal{A}} q_\pi(s, a) $$ And we have: $$ \max_{a \in \mathcal{A}} q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$ and: $$ $$
In [9]:
def optimum_policy(V, s):
p = 1.0
a_best = None
v_best = None
for a in [
[-1, 0], [1, 0], [0, -1], [0, 1]
]:
a = torch.IntTensor(a)
# print('s', s)
s_new = apply_a(s, a)
v_new = R[s[0], s[1]] + V[s_new[0], s_new[1]]
if v_best is None or v_new > v_best:
a_best = a
v_best = v_new
# print('a_best', a_best, 'v_best', v_best)
action_probs = [{'a': a_best, 'p': 1.0}]
return action_probs
V_optimum = value_iterate(optimum_policy)
print('V_optimum', V_optimum)
Principle of Optimality
Any optimal policy can be subdivided into two components:
Theorem (Principle of optimality)
Deterministic value iteration
Value iteration
$$ \mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^\mathbf{a} + \gamma \mathcal{P}^\mathbf{a} \mathbf{v}_k $$Synchronous dynamic programming algorithms
Problem | Bellman Equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Iteractive Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
Asynchronous dynamic programming
Three simple ideas for asynchronous dynamic programming:
In-place dynamic programming
Prioritized sweeping
\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v(s') - v(s)
\right)
\right|
$$Real-time dynamic programming
Full-Width Backups
Sample backups
Approximate dynamic programming