DiscreteDP

Implementation Details

Daisuke Oyama
Faculty of Economics, University of Tokyo

This notebook describes the implementation details of the DiscreteDP class.

For the theoretical background and notation, see the lecture Discrete Dynamic Programming.

Solution methods

The DiscreteDP class currently implements the following solution algorithms:

value iteration;
policy iteration (default);
modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an $\varepsilon$-optimal policy for a prespecified value of $\varepsilon$.

Value iteration relies on (only) the fact that the Bellman operator $T$ is a contraction mapping and thus iterative application of $T$ to any initial function $v^0$ converges to its unique fixed point $v^*$.

Policy iteration more closely exploits the particular structure of the problem, where each iteration consists of a policy evaluation step, which computes the value $v_{\sigma}$ of a policy $\sigma$ by solving the linear equation $v = T_{\sigma} v$, and a policy improvement step, which computes a $v_{\sigma}$-greedy policy.

Modified policy iteration replaces the policy evaluation step in policy iteration with "partial policy evaluation", which computes an approximation of the value of a policy $\sigma$ by iterating $T_{\sigma}$ for a specified number of times.

Below we describe our implementation of these algorithms more in detail.
(While not explicit, in the actual implementation each algorithm is terminated when the number of iterations reaches iter_max.)

Value iteration

DiscreteDP.value_iteration(v_init, epsilon, iter_max)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$; set $i = 0$.
Compute $v^{i+1} = T v^i$.
If $\lVert v^{i+1} - v^i\rVert < [(1 - \beta) / (2\beta)] \varepsilon$, then go to step 4; otherwise, set $i = i + 1$ and go to step 2.
Compute a $v^{i+1}$-greedy policy $\sigma$, and return $v^{i+1}$ and $\sigma$.

Given $\varepsilon > 0$, the value iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless iter_max is reached).

Policy iteration

DiscreteDP.policy_iteration(v_init, iter_max)

Choose any $v^0 \in \mathbb{R}^n$ and compute a $v^0$-greedy policy $\sigma^0$; set $i = 0$.
[Policy evaluation] Compute the value $v_{\sigma^i}$ by solving the equation $v = T_{\sigma^i} v$.
[Policy improvement] Compute a $v_{\sigma^i}$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible.
If $\sigma^{i+1} = \sigma^i$, then return $v_{\sigma^i}$ and $\sigma^{i+1}$; otherwise, set $i = i + 1$ and go to step 2.

The policy iteration algorithm terminates in a finite number of iterations, and returns an optimal value function and an optimal policy function (unless iter_max is reached).

Modified policy iteration

DiscreteDP.modified_policy_iteration(v_init, epsilon, iter_max, k)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$ and $k \geq 0$; set $i = 0$.
[Policy improvement] Compute a $v^i$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible (for $i \geq 1$).
Compute $u = T v^i$ ($= T_{\sigma^{i+1}} v^i$). If $\mathrm{span}(u - v^i) < [(1 - \beta) / \beta] \varepsilon$, then go to step 5; otherwise go to step 4.
[Partial policy evaluation] Compute $v^{i+1} = (T_{\sigma^{i+1}})^k u$ ($= (T_{\sigma^{i+1}})^{k+1} v^i$). Set $i = i + 1$ and go to step 2.
Return $v = u + [\beta / (1 - \beta)] [(\min(u - v^i) + \max(u - v^i)) / 2] \mathbf{1}$ and $\sigma_{i+1}$.

Given $\varepsilon > 0$, provided that $v^0$ is such that $T v^0 \geq v^0$, the modified policy iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless iter_max is reached).

Remarks

Here we employ the termination criterion based on the span semi-norm, where $\mathrm{span}(z) = \max(z) - \min(z)$ for $z \in \mathbb{R}^n$. Since $\mathrm{span}(T v - v) \leq 2\lVert T v - v\rVert$, this reaches $\varepsilon$-optimality faster than the norm-based criterion as employed in the value iteration above.
Except for the termination criterion, modified policy is equivalent to value iteration if $k = 0$ and to policy iteration in the limit as $k \to \infty$.
Thus, if one would like to have value iteration with the span-based rule, run modified policy iteration with $k = 0$.
In returning a value function, our implementation is slightly different from that by Puterman (2005), Section 6.6.3, pp.201-202, which uses $u + [\beta / (1 - \beta)] \min(u - v^i) \mathbf{1}$.
The condition for convergence, $T v^0 \geq v^0$, is satisfied for example when $v^0 = v_{\sigma}$ for some policy $\sigma$, or when $v^0(s) = \min_{(s', a)} r(s', a)$ for all $s$. If v_init is not specified, it is set to the latter, $\min_{(s', a)} r(s', a))$.

Illustration

We illustrate the algorithms above by the simple example from Puterman (2005), Section 3.1, pp.33-35.



In [1]:

    
import numpy as np
import pandas as pd
from quantecon.markov import DiscreteDP



In [2]:

    
n = 2  # Number of states
m = 2  # Number of actions

# Reward array
R = [[5, 10],
     [-1, -float('inf')]]

# Transition probability array
Q = [[(0.5, 0.5), (0, 1)],
     [(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary

# Discount rate
beta = 0.95

ddp = DiscreteDP(R, Q, beta)

Analytical solution:



In [3]:

    
def sigma_star(beta):
    sigma = np.empty(2, dtype=int)
    sigma[1] = 0
    if beta > 10/11:
        sigma[0] = 0
    else:
        sigma[0] = 1
    return sigma

def v_star(beta):
    v = np.empty(2)
    v[1] = -1 / (1 - beta)
    if beta > 10/11:
        v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
    else:
        v[0] = (10 - 11*beta) / (1 - beta)
    return v



In [4]:

    
sigma_star(beta=beta)









    Out[4]:





array([0, 0])



In [5]:

    
v_star(beta=beta)









    Out[5]:





array([ -8.57142857, -20.        ])

Value iteration

Solve the problem by value iteration; see Example 6.3.1, p.164 in Puterman (2005).



In [6]:

    
epsilon = 1e-2
v_init = [0, 0]
res_vi = ddp.solve(method='value_iteration', v_init=v_init, epsilon=epsilon)

The number of iterations required to satisfy the termination criterion:



In [7]:

    
res_vi.num_iter









    Out[7]:





162

The returned value function:



In [8]:

    
res_vi.v









    Out[8]:





array([ -8.5665053 , -19.99507673])

It is indeed an $\varepsilon/2$-approximation of $v^*$:



In [9]:

    
np.abs(res_vi.v - v_star(beta=beta)).max() < epsilon/2









    Out[9]:





True

The returned policy function:



In [10]:

    
res_vi.sigma









    Out[10]:





array([0, 0])

Value iteration converges very slowly. Let us replicate Table 6.3.1 on p.165:



In [11]:

    
num_reps = 164
values = np.empty((num_reps, n))
diffs = np.empty(num_reps)
spans = np.empty(num_reps)
v = np.array([0, 0])

values[0] = v
diffs[0] = np.nan
spans[0] = np.nan

for i in range(1, num_reps):
    v_new = ddp.bellman_operator(v)
    values[i] = v_new
    diffs[i] = np.abs(v_new - v).max()
    spans[i] = (v_new - v).max() - (v_new - v).min()
    v = v_new



In [12]:

    
df = pd.DataFrame()
df[0], df[1], df[2], df[3] = values[:, 0], values[:, 1], diffs, spans
df.columns = '$v^i(0)$', '$v^i(1)$', \
             '$\\lVert v^i - v^{i-1}\\rVert$', '$\\mathrm{span}(v^i - v^{i-1})$'

iter_nums = pd.Series(list(range(num_reps)), name='$i$')
df.index = iter_nums

display_nums = \
    list(range(10)) + [10*i for i in range(1, 16)] + [160+i for i in range(4)]
df.iloc[display_nums, [0, 1, 2]]









    Out[12]:







  
    
      
      $v^i(0)$
      $v^i(1)$
      $\lVert v^i - v^{i-1}\rVert$
    
    
      $i$
      
      
      
    
  
  
    
      0
      0.000000
      0.000000
      NaN
    
    
      1
      10.000000
      -1.000000
      10.000000
    
    
      2
      9.275000
      -1.950000
      0.950000
    
    
      3
      8.479375
      -2.852500
      0.902500
    
    
      4
      7.672766
      -3.709875
      0.857375
    
    
      5
      6.882373
      -4.524381
      0.814506
    
    
      6
      6.120046
      -5.298162
      0.773781
    
    
      7
      5.390395
      -6.033254
      0.735092
    
    
      8
      4.694642
      -6.731591
      0.698337
    
    
      9
      4.032449
      -7.395012
      0.663420
    
    
      10
      3.402783
      -8.025261
      0.630249
    
    
      20
      -1.401710
      -12.830282
      0.377354
    
    
      30
      -4.278653
      -15.707225
      0.225936
    
    
      40
      -6.001185
      -17.429757
      0.135276
    
    
      50
      -7.032529
      -18.461100
      0.080995
    
    
      60
      -7.650033
      -19.078604
      0.048495
    
    
      70
      -8.019755
      -19.448326
      0.029035
    
    
      80
      -8.241121
      -19.669693
      0.017385
    
    
      90
      -8.373661
      -19.802233
      0.010409
    
    
      100
      -8.453018
      -19.881589
      0.006232
    
    
      110
      -8.500532
      -19.929103
      0.003731
    
    
      120
      -8.528980
      -19.957551
      0.002234
    
    
      130
      -8.546013
      -19.974584
      0.001338
    
    
      140
      -8.556211
      -19.984783
      0.000801
    
    
      150
      -8.562317
      -19.990889
      0.000480
    
    
      160
      -8.565973
      -19.994545
      0.000287
    
    
      161
      -8.566246
      -19.994818
      0.000273
    
    
      162
      -8.566505
      -19.995077
      0.000259
    
    
      163
      -8.566751
      -19.995323
      0.000246

On the other hand, the span decreases faster than the norm; the following replicates Table 6.6.1, page 205:



In [13]:

    
df.iloc[list(range(1, 13)) + [10*i for i in  range(2, 7)], [2, 3]]









    Out[13]:







  
    
      
      $\lVert v^i - v^{i-1}\rVert$
      $\mathrm{span}(v^i - v^{i-1})$
    
    
      $i$
      
      
    
  
  
    
      1
      10.000000
      1.100000e+01
    
    
      2
      0.950000
      2.250000e-01
    
    
      3
      0.902500
      1.068750e-01
    
    
      4
      0.857375
      5.076562e-02
    
    
      5
      0.814506
      2.411367e-02
    
    
      6
      0.773781
      1.145399e-02
    
    
      7
      0.735092
      5.440647e-03
    
    
      8
      0.698337
      2.584307e-03
    
    
      9
      0.663420
      1.227546e-03
    
    
      10
      0.630249
      5.830844e-04
    
    
      11
      0.598737
      2.769651e-04
    
    
      12
      0.568800
      1.315584e-04
    
    
      20
      0.377354
      3.409318e-07
    
    
      30
      0.225936
      1.993445e-10
    
    
      40
      0.135276
      1.154632e-13
    
    
      50
      0.080995
      0.000000e+00
    
    
      60
      0.048495
      1.776357e-15

The span-based termination criterion is satisfied when $i = 11$:



In [14]:

    
epsilon * (1-beta) / beta









    Out[14]:





0.0005263157894736847



In [15]:

    
spans[11] < epsilon * (1-beta) / beta









    Out[15]:





True

In fact, modified policy iteration with $k = 0$ terminates with $11$ iterations:



In [16]:

    
epsilon = 1e-2
v_init = [0, 0]
k = 0
res_mpi_1 = ddp.solve(method='modified_policy_iteration',
                      v_init=v_init, epsilon=epsilon, k=k)



In [17]:

    
res_mpi_1.num_iter









    Out[17]:





11



In [18]:

    
res_mpi_1.v









    Out[18]:





array([ -8.56904799, -19.99736883])

Policy iteration

If $\{\sigma^i\}$ is the sequence of policies obtained by policy iteration with an initial policy $\sigma^0$, one can show that $T^i v_{\sigma^0} \leq v_{\sigma^i}$ ($\leq v^*$), so that the number of iterations required for policy iteration is smaller than that for value iteration at least weakly, and indeed in many cases, the former is significantly smaller than the latter.



In [19]:

    
v_init = [0, 0]
res_pi = ddp.solve(method='policy_iteration', v_init=v_init)



In [20]:

    
res_pi.num_iter









    Out[20]:





2

Policy iteration returns the exact optimal value function (up to rounding errors):



In [21]:

    
res_pi.v









    Out[21]:





array([ -8.57142857, -20.        ])



In [22]:

    
np.abs(res_pi.v - v_star(beta=beta)).max()









    Out[22]:





3.5527136788005009e-15

To look into the iterations:



In [23]:

    
v = np.array([0, 0])
sigma = np.array([-1, -1])  # Dummy
sigma_new = ddp.compute_greedy(v)
i = 0

while True:
    print('Iterate {0}'.format(i))
    print(' value:  {0}'.format(v))
    print(' policy: {0}'.format(sigma_new))
    if np.array_equal(sigma_new, sigma):
        break
    sigma[:] = sigma_new
    v = ddp.evaluate_policy(sigma)
    sigma_new = ddp.compute_greedy(v)
    i += 1

print('Terminated')









    



Iterate 0
 value:  [0 0]
 policy: [1 0]
Iterate 1
 value:  [ -9. -20.]
 policy: [0 0]
Iterate 2
 value:  [ -8.57142857 -20.        ]
 policy: [0 0]
Terminated

See Example 6.4.1, pp.176-177.

Modified policy iteration

The evaluation step in policy iteration which solves the linear equation $v = T_{\sigma} v$ to obtain the policy value $v_{\sigma}$ can be expensive for problems with a large number of states. Modified policy iteration is to reduce the cost of this step by using an approximation of $v_{\sigma}$ obtained by iteration of $T_{\sigma}$. The tradeoff is that this approach only computes an $\varepsilon$-optimal policy, and for small $\varepsilon$, takes a larger number of iterations than policy iteration (but much smaller than value iteration).



In [24]:

    
epsilon = 1e-2
v_init = [0, 0]
k = 6
res_mpi = ddp.solve(method='modified_policy_iteration',
                    v_init=v_init, epsilon=epsilon, k=k)



In [25]:

    
res_mpi.num_iter









    Out[25]:





4

The returned value function:



In [26]:

    
res_mpi.v









    Out[26]:





array([ -8.57137101, -19.99993638])

It is indeed an $\varepsilon/2$-approximation of $v^*$:



In [27]:

    
np.abs(res_mpi.v - v_star(beta=beta)).max() < epsilon/2









    Out[27]:





True

To look into the iterations:



In [28]:

    
epsilon = 1e-2
v = np.array([0, 0])
k = 6
i = 0
print('Iterate {0}'.format(i))
print(' v: {0}'.format(v))

sigma = np.empty(n, dtype=int)  # Store the policy function

while True:
    i += 1
    u = ddp.bellman_operator(v, sigma=sigma)
    diff = u - v
    span = diff.max() - diff.min()
    print('Iterate {0}'.format(i))
    print(' sigma: {0}'.format(sigma))
    print(' T_sigma(v): {0}'.format(u))
    print(' span: {0}'.format(span))
    if span < epsilon * (1-ddp.beta) / ddp.beta:
        v = u + ((diff.max() + diff.min()) / 2) * \
            (ddp.beta / (1 - ddp.beta))
        break
    ddp.operator_iteration(ddp.T_sigma(sigma), v=u, max_iter=k)
    v = u
    print(' T_sigma^k+1(v): {0}'.format(v))

print('Terminated')
print(' sigma: {0}'.format(sigma))
print(' v: {0}'.format(v))









    



Iterate 0
 v: [0 0]
Iterate 1
 sigma: [1 0]
 T_sigma(v): [ 10.  -1.]
 span: 11.0
 T_sigma^k+1(v): [ 4.96674592 -6.03325408]
Iterate 2
 sigma: [0 0]
 T_sigma(v): [ 4.49340863 -6.73159137]
 span: 0.22499999999999964
 T_sigma^k+1(v): [  1.17973283 -10.24650042]
Iterate 3
 sigma: [0 0]
 T_sigma(v): [  0.69328539 -10.7341754 ]
 span: 0.0012275460282897832
 T_sigma^k+1(v): [ -1.7602088  -13.18876747]
Iterate 4
 sigma: [0 0]
 T_sigma(v): [ -2.10076373 -13.5293291 ]
 span: 6.6971966727891186e-06
Terminated
 sigma: [0 0]
 v: [ -8.57137101 -19.99993638]

Compare this with the implementation with the norm-based termination rule as described in Example 6.5.1, pp.187-188.

Reference

M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.



In [ ]:

	$v^i(0)$	$v^i(1)$	$\lVert v^i - v^{i-1}\rVert$
$i$
0	0.000000	0.000000	NaN
1	10.000000	-1.000000	10.000000
2	9.275000	-1.950000	0.950000
3	8.479375	-2.852500	0.902500
4	7.672766	-3.709875	0.857375
5	6.882373	-4.524381	0.814506
6	6.120046	-5.298162	0.773781
7	5.390395	-6.033254	0.735092
8	4.694642	-6.731591	0.698337
9	4.032449	-7.395012	0.663420
10	3.402783	-8.025261	0.630249
20	-1.401710	-12.830282	0.377354
30	-4.278653	-15.707225	0.225936
40	-6.001185	-17.429757	0.135276
50	-7.032529	-18.461100	0.080995
60	-7.650033	-19.078604	0.048495
70	-8.019755	-19.448326	0.029035
80	-8.241121	-19.669693	0.017385
90	-8.373661	-19.802233	0.010409
100	-8.453018	-19.881589	0.006232
110	-8.500532	-19.929103	0.003731
120	-8.528980	-19.957551	0.002234
130	-8.546013	-19.974584	0.001338
140	-8.556211	-19.984783	0.000801
150	-8.562317	-19.990889	0.000480
160	-8.565973	-19.994545	0.000287
161	-8.566246	-19.994818	0.000273
162	-8.566505	-19.995077	0.000259
163	-8.566751	-19.995323	0.000246

	$\lVert v^i - v^{i-1}\rVert$	$\mathrm{span}(v^i - v^{i-1})$
$i$
1	10.000000	1.100000e+01
2	0.950000	2.250000e-01
3	0.902500	1.068750e-01
4	0.857375	5.076562e-02
5	0.814506	2.411367e-02
6	0.773781	1.145399e-02
7	0.735092	5.440647e-03
8	0.698337	2.584307e-03
9	0.663420	1.227546e-03
10	0.630249	5.830844e-04
11	0.598737	2.769651e-04
12	0.568800	1.315584e-04
20	0.377354	3.409318e-07
30	0.225936	1.993445e-10
40	0.135276	1.154632e-13
50	0.080995	0.000000e+00
60	0.048495	1.776357e-15