State Value Function

Next we will find different ways to compute the Value function given by a stochastic policy $\pi(s) = p(a\mid s)$. We want to calculate $V_{\pi}(s)$ (the state-value-function given a policy)

Here we draw an Markov Decision Process (MDP) with three states $\mathcal{S}=\{s_1,s_2,s_3\}$ and three possible actions $\mathcal{A}=\{a_1,a_2,a_3\}$, moving to state $s_1$, moving to state $s_2$ and moving to state $s_3$.



In [ ]:

    
%load_ext autoreload
%autoreload 2

Policy Evaluation by Dynamic Programming

For the MDP represented above we define the state transition probability matrix $\mathcal{P}^a_{ss'}=p(S_{t+1}=s'\mid S_{t}=s, A_t=a)$. In this MDP we assume that when we choose to move to state $s_i$, $i=\{1,2,3\}$ we always end up in that state, meaning that $\mathcal{P}^a_{ss'}=p(S_{t+1}=s'\mid S_{t}=s, A_t=a)=1$. In this case, $\mathcal{P}^{\pi}=\mathcal{P}^a_{ss'}\pi(a\mid s) = \pi(a\mid s)$ the Bellman Expectation equation becomes (Check page 14 and 16 from the lecture slides.): $$ V_{\pi}(s) = \sum_{a\in\mathcal{A}} \pi(a\mid s)\left( \mathcal{R}^a_s + \gamma \sum_{s'\in \mathcal{S}}\mathcal{P}^a_{ss'}V_{\pi}(s')\right) = \mathcal{R}^{\pi}+ \gamma \sum_{s'\in \mathcal{S}}\pi(a\mid s)V_{\pi}(s') $$



In [ ]:

    
import numpy as np

policy = np.array([[0.3, 0.2, 0.5], [0.5, 0.4, 0.1], [0.8, 0.1, 0.1]])
print("This is represents the policy with 3 states and 3 actions p(row=a|col=s):\n", np.matrix(policy))
# 'raw_rewards' variable contains rewards obtained after transition to each state
# In our example it doesn't depend on source state
raw_rewards = np.array([1.5, -1.833333333, 19.833333333])
# 'rewards' variable contains expected values of the next reward for each state
rewards = np.matmul(policy, raw_rewards)
assert np.allclose(rewards, np.array([10., 2., 3.]))
gamma = 0.1

print('This are the rewards for each action:\n', rewards)
state_value_function = np.array([0 for i in range(3)])
print('Policy evaluation:')
for i in range(20):
    print('V_{}={}'.format(i, state_value_function))
    state_value_function = rewards + gamma * (np.matmul(policy, state_value_function))
print('\nV={}'.format(state_value_function))

Policy Evaluation by Linear Programming

The state-value-function can be directly solved through linear programming (as shown on page 15 from the lecture slides): $$ V_{\pi}(s)=\left(I-\gamma\mathcal{P}^{\pi}\right)^{-1}\mathcal{R}^{\pi} $$



In [ ]:

    
solution=np.matmul(np.linalg.inv(np.eye(3)-0.1*policy), rewards)
print('Solution by inversion:\nV={}'.format(state_value_function))

The result stays the same.

Policy Evaluation by Monte Carlo Sampling

We can design yet another way of evaluating the value of a given policy $\pi$, see lecture slides pag.20. The intuition is to incrementally the expected return from sampled episodes, sequences of triplets $\{(s_i,a_i,r_{i})\}_{i=1}^N$. The function $\color{blue}{gt}$ computes the total discounted reward from a list of sequential rewards obtained by sampling the policy: $G_t=r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+\dots+\gamma^N r_{t+N}$.

The value of a policy can also be computed by looking at its empirical expected cumulative discounted return: $$ V_{\pi}(s) = \mathbb{E}_{\pi}\left[G_t\mid S_t=s\right] $$



In [ ]:

    
import random
from collections import defaultdict

reward_counter = np.array([0., 0., 0.])
visit_counter = np.array([0., 0., 0.])
nIterations = 400

def gt(rewardlist, gamma=0.1):
    '''
    Function to calculate the total discounted reward
    >>> gt([10, 2, 3], gamma=0.1)
    10.23
    '''
    total_disc_return = 0
    for (i, value) in enumerate(rewardlist):
        total_disc_return += (gamma ** i) * value
    return total_disc_return


for i in range(nIterations): 
    start_state = random.randint(0, 2)
    next_state = start_state
    rewardlist = []
    occurence = defaultdict(list)
    for i in range(250): #draw samples from the policy recursively over horizon of N=250
        rewardlist.append(rewards[next_state])
        occurence[next_state].append(len(rewardlist) - 1)
        action = np.random.choice(np.arange(0, 3), p=policy[next_state])
        next_state = action

    for state in occurence:
        for value in occurence[state]: #update state value function E[G_t|s]=S(s)/N(s)
            rew = gt(rewardlist[value:]) 
            reward_counter[state] += rew # S(s)
            visit_counter[state] += 1 # N(s)

print("MC policy evaluation V=", reward_counter / visit_counter)

As can be seen the result is nearly the same as the state-value-function calculated above.

So far we have seen different ways of given a known policy $\pi(a\mid s)$ how to comput its value $V_{\pi}(s)$. Next, we wish to find the optimal policy $\pi^\ast(s)$ for the MDP in the example.

Policy Optimization by Q-Learning

This code solves a very easy problem: using the rewards it calculates the optimal action-value-function (page 26 on slides).

It samples a state-action pair randomly, so that all state-action pairs can be seen.



In [ ]:

    
q_table = np.zeros((3, 3)) #state action value function Q-table
gamma = 0.1
alpha = 1.0
for i in range(1001):
    state = random.randint(0, 2)
    action = random.randint(0, 2)
    next_state = action
    reward = raw_rewards[next_state]
    next_q = max(q_table[next_state]) #s.a. value evaluation at the next state
    q_table[state, action] = q_table[state, action] + alpha* (
            reward + gamma * (next_q) - q_table[state, action])  #Q-Table update
    if i % 200 == 0:
        print("Q_{}(s,a)=".format(i),q_table)

Value Iteration



In [ ]:

    
import numpy as np

raw_rewards = np.array([1.5, -1.833333333, 19.833333333])
gamma = 0.1

state_value_function = np.zeros(3)
print('V_{} = {}'.format(0, state_value_function))
for i in range(1000):
    for s in range(3):
        Q_s = [raw_rewards[s_next] + gamma * state_value_function[s_next]
               for s_next in range(3)]
        state_value_function[s] = max(Q_s)
    if i % 100 == 99:
        print('V_{} = {}'.format(i + 1, state_value_function))