Next we will find different ways to compute the Value function given by a stochastic policy $\pi(s) = p(a\mid s)$.
We want to calculate $V_{\pi}(s)$ (the state-value-function given a policy)
Here we draw an Markov Decision Process (MDP) with three states $\mathcal{S}=\{s_1,s_2,s_3\}$ and three possible actions $\mathcal{A}=\{a_1,a_2,a_3\}$, moving to state $s_1$, moving to state $s_2$ and moving to state $s_3$.
In [ ]:
%load_ext autoreload
%autoreload 2
For the MDP represented above we define the state transition probability matrix $\mathcal{P}^a_{ss'}=p(S_{t+1}=s'\mid S_{t}=s, A_t=a)$. In this MDP we assume that when we choose to move to state $s_i$, $i=\{1,2,3\}$ we always end up in that state, meaning that $\mathcal{P}^a_{ss'}=p(S_{t+1}=s'\mid S_{t}=s, A_t=a)=1$. In this case, $\mathcal{P}^{\pi}=\mathcal{P}^a_{ss'}\pi(a\mid s) = \pi(a\mid s)$ the Bellman Expectation equation becomes (Check page 14 and 16 from the lecture slides.): $$ V_{\pi}(s) = \sum_{a\in\mathcal{A}} \pi(a\mid s)\left( \mathcal{R}^a_s + \gamma \sum_{s'\in \mathcal{S}}\mathcal{P}^a_{ss'}V_{\pi}(s')\right) = \mathcal{R}^{\pi}+ \gamma \sum_{s'\in \mathcal{S}}\pi(a\mid s)V_{\pi}(s') $$
In [ ]:
import numpy as np
policy = np.array([[0.3, 0.2, 0.5], [0.5, 0.4, 0.1], [0.8, 0.1, 0.1]])
print("This is represents the policy with 3 states and 3 actions p(row=a|col=s):\n", np.matrix(policy))
# 'raw_rewards' variable contains rewards obtained after transition to each state
# In our example it doesn't depend on source state
raw_rewards = np.array([1.5, -1.833333333, 19.833333333])
# 'rewards' variable contains expected values of the next reward for each state
rewards = np.matmul(policy, raw_rewards)
assert np.allclose(rewards, np.array([10., 2., 3.]))
gamma = 0.1
print('This are the rewards for each action:\n', rewards)
state_value_function = np.array([0 for i in range(3)])
print('Policy evaluation:')
for i in range(20):
print('V_{}={}'.format(i, state_value_function))
state_value_function = rewards + gamma * (np.matmul(policy, state_value_function))
print('\nV={}'.format(state_value_function))
In [ ]:
solution=np.matmul(np.linalg.inv(np.eye(3)-0.1*policy), rewards)
print('Solution by inversion:\nV={}'.format(state_value_function))
The result stays the same.
We can design yet another way of evaluating the value of a given policy $\pi$, see lecture slides pag.20. The intuition is to incrementally the expected return from sampled episodes, sequences of triplets $\{(s_i,a_i,r_{i})\}_{i=1}^N$. The function $\color{blue}{gt}$ computes the total discounted reward from a list of sequential rewards obtained by sampling the policy: $G_t=r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+\dots+\gamma^N r_{t+N}$.
The value of a policy can also be computed by looking at its empirical expected cumulative discounted return: $$ V_{\pi}(s) = \mathbb{E}_{\pi}\left[G_t\mid S_t=s\right] $$
In [ ]:
import random
from collections import defaultdict
reward_counter = np.array([0., 0., 0.])
visit_counter = np.array([0., 0., 0.])
nIterations = 400
def gt(rewardlist, gamma=0.1):
'''
Function to calculate the total discounted reward
>>> gt([10, 2, 3], gamma=0.1)
10.23
'''
total_disc_return = 0
for (i, value) in enumerate(rewardlist):
total_disc_return += (gamma ** i) * value
return total_disc_return
for i in range(nIterations):
start_state = random.randint(0, 2)
next_state = start_state
rewardlist = []
occurence = defaultdict(list)
for i in range(250): #draw samples from the policy recursively over horizon of N=250
rewardlist.append(rewards[next_state])
occurence[next_state].append(len(rewardlist) - 1)
action = np.random.choice(np.arange(0, 3), p=policy[next_state])
next_state = action
for state in occurence:
for value in occurence[state]: #update state value function E[G_t|s]=S(s)/N(s)
rew = gt(rewardlist[value:])
reward_counter[state] += rew # S(s)
visit_counter[state] += 1 # N(s)
print("MC policy evaluation V=", reward_counter / visit_counter)
As can be seen the result is nearly the same as the state-value-function calculated above.
So far we have seen different ways of given a known policy $\pi(a\mid s)$ how to comput its value $V_{\pi}(s)$. Next, we wish to find the optimal policy $\pi^\ast(s)$ for the MDP in the example.
In [ ]:
q_table = np.zeros((3, 3)) #state action value function Q-table
gamma = 0.1
alpha = 1.0
for i in range(1001):
state = random.randint(0, 2)
action = random.randint(0, 2)
next_state = action
reward = raw_rewards[next_state]
next_q = max(q_table[next_state]) #s.a. value evaluation at the next state
q_table[state, action] = q_table[state, action] + alpha* (
reward + gamma * (next_q) - q_table[state, action]) #Q-Table update
if i % 200 == 0:
print("Q_{}(s,a)=".format(i),q_table)
In [ ]:
import numpy as np
raw_rewards = np.array([1.5, -1.833333333, 19.833333333])
gamma = 0.1
state_value_function = np.zeros(3)
print('V_{} = {}'.format(0, state_value_function))
for i in range(1000):
for s in range(3):
Q_s = [raw_rewards[s_next] + gamma * state_value_function[s_next]
for s_next in range(3)]
state_value_function[s] = max(Q_s)
if i % 100 == 99:
print('V_{} = {}'.format(i + 1, state_value_function))