DiscreteDP

Implementation Details

Daisuke Oyama
Faculty of Economics, University of Tokyo

This notebook describes the implementation details of the DiscreteDP type and its methods.

For the theoretical background and notation, see the lecture Discrete Dynamic Programming.

Solution methods

The following solution algorithms are currently implemented for the DiscreteDP type:

value iteration;
policy iteration (default);
modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an $\varepsilon$-optimal policy for a prespecified value of $\varepsilon$.

Value iteration relies on (only) the fact that the Bellman operator $T$ is a contraction mapping and thus iterative application of $T$ to any initial function $v^0$ converges to its unique fixed point $v^*$.

Policy iteration more closely exploits the particular structure of the problem, where each iteration consists of a policy evaluation step, which computes the value $v_{\sigma}$ of a policy $\sigma$ by solving the linear equation $v = T_{\sigma} v$, and a policy improvement step, which computes a $v_{\sigma}$-greedy policy.

Modified policy iteration replaces the policy evaluation step in policy iteration with "partial policy evaluation", which computes an approximation of the value of a policy $\sigma$ by iterating $T_{\sigma}$ for a specified number of times.

Below we describe our implementation of these algorithms more in detail.
(While not explicit, in the actual implementation each algorithm is terminated when the number of iterations reaches max_iter.)

Value iteration

solve(ddp, v_init, VFI; max_iter, epsilon)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$; set $i = 0$.
Compute $v^{i+1} = T v^i$.
If $\lVert v^{i+1} - v^i\rVert < [(1 - \beta) / (2\beta)] \varepsilon$, then go to step 4; otherwise, set $i = i + 1$ and go to step 2.
Compute a $v^{i+1}$-greedy policy $\sigma$, and return $v^{i+1}$ and $\sigma$.

Given $\varepsilon > 0$, the value iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless max_iter is reached).

Policy iteration

solve(ddp, v_init, PFI; max_iter)

Choose any $v^0 \in \mathbb{R}^n$ and compute a $v^0$-greedy policy $\sigma^0$; set $i = 0$.
[Policy evaluation] Compute the value $v_{\sigma^i}$ by solving the equation $v = T_{\sigma^i} v$.
[Policy improvement] Compute a $v_{\sigma^i}$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible.
If $\sigma^{i+1} = \sigma^i$, then return $v_{\sigma^i}$ and $\sigma^{i+1}$; otherwise, set $i = i + 1$ and go to step 2.

The policy iteration algorithm terminates in a finite number of iterations, and returns an optimal value function and an optimal policy function (unless max_iter is reached).

Modified policy iteration

solve(ddp, v_init, MPFI; max_iter, epsilon, k)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$ and $k \geq 0$; set $i = 0$.
[Policy improvement] Compute a $v^i$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible (for $i \geq 1$).
Compute $u = T v^i$ ($= T_{\sigma^{i+1}} v^i$). If $\mathrm{span}(u - v^i) < [(1 - \beta) / \beta] \varepsilon$, then go to step 5; otherwise go to step 4.
[Partial policy evaluation] Compute $v^{i+1} = (T_{\sigma^{i+1}})^k u$ ($= (T_{\sigma^{i+1}})^{k+1} v^i$). Set $i = i + 1$ and go to step 2.
Return $v = u + [\beta / (1 - \beta)] [(\min(u - v^i) + \max(u - v^i)) / 2] \mathbf{1}$ and $\sigma_{i+1}$.

Given $\varepsilon > 0$, provided that $v^0$ is such that $T v^0 \geq v^0$, the modified policy iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless max_iter is reached).

Remarks

Here we employ the termination criterion based on the span semi-norm, where $\mathrm{span}(z) = \max(z) - \min(z)$ for $z \in \mathbb{R}^n$. Since $\mathrm{span}(T v - v) \leq 2\lVert T v - v\rVert$, this reaches $\varepsilon$-optimality faster than the norm-based criterion as employed in the value iteration above.
Except for the termination criterion, modified policy is equivalent to value iteration if $k = 0$ and to policy iteration in the limit as $k \to \infty$.
Thus, if one would like to have value iteration with the span-based rule, run modified policy iteration with $k = 0$.
In returning a value function, our implementation is slightly different from that by Puterman (2005), Section 6.6.3, pp.201-202, which uses $u + [\beta / (1 - \beta)] \min(u - v^i) \mathbf{1}$.
The condition for convergence, $T v^0 \geq v^0$, is satisfied for example when $v^0 = v_{\sigma}$ for some policy $\sigma$, or when $v^0(s) = \min_{(s', a)} r(s', a)$ for all $s$. If v_init is not specified, it is set to the latter, $\min_{(s', a)} r(s', a))$.

Illustration

We illustrate the algorithms above by the simple example from Puterman (2005), Section 3.1, pp.33-35.



In [1]:

    
using QuantEcon
using DataFrames



In [2]:

    
n = 2  # Number of states
m = 2  # Number of actions

# Reward array
R = [5 10; -1 -Inf]

# Transition probability array
Q = Array{Float64}(n, m, n)
Q[1, 1, :] = [0.5, 0.5]
Q[1, 2, :] = [0, 1]
Q[2, 1, :] = [0, 1]
Q[2, 2, :] = [0.5, 0.5]  # Arbitrary

# Discount rate
beta = 0.95

ddp = DiscreteDP(R, Q, beta);

Analytical solution:



In [3]:

    
function sigma_star(beta)
    sigma = Vector{Int64}(2)
    sigma[2] = 1
    if beta > 10/11
        sigma[1] = 1
    else
        sigma[1] = 2
    end
    return sigma
end

function v_star(beta)
    v = Vector{Float64}(2)
    v[2] = -1 / (1 - beta)
    if beta > 10/11
        v[1] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
    else
        v[1] = (10 - 11*beta) / (1 - beta)
    end
    return v
end;



In [4]:

    
sigma_star(beta)









    Out[4]:





2-element Array{Int64,1}:
 1
 1



In [5]:

    
v_star(beta)









    Out[5]:





2-element Array{Float64,1}:
  -8.57143
 -20.0

Value iteration

Solve the problem by value iteration; see Example 6.3.1, p.164 in Puterman (2005).



In [6]:

    
epsilon = 1e-2
v_init = [0., 0.]
res_vi = solve(ddp, v_init, VFI, epsilon=epsilon);

The number of iterations required to satisfy the termination criterion:



In [7]:

    
res_vi.num_iter









    Out[7]:




162

The returned value function:



In [8]:

    
res_vi.v









    Out[8]:





2-element Array{Float64,1}:
  -8.56651
 -19.9951

It is indeed an $\varepsilon/2$-approximation of $v^*$:



In [9]:

    
maximum(abs, res_vi.v - v_star(beta)) < epsilon/2









    Out[9]:




true

The returned policy function:



In [10]:

    
res_vi.sigma









    Out[10]:





2-element Array{Int64,1}:
 1
 1

Value iteration converges very slowly. Let us replicate Table 6.3.1 on p.165:



In [11]:

    
num_reps = 164
values = Matrix{Float64}(num_reps, n)
diffs = Vector{Float64}(num_reps)
spans = Vector{Float64}(num_reps)
v = [0, 0]

values[1, :] = v
diffs[1] = NaN
spans[1] = NaN

for i in 2:num_reps
    v_new = bellman_operator(ddp, v)
    values[i, :] = v_new
    diffs[i] = maximum(abs, v_new - v)
    spans[i] = maximum(v_new - v) - minimum(v_new - v)
    v = v_new
end



In [12]:

    
col_names = map(Symbol, ["i", "v^i(1)", "v^i(2)", "‖v^i - v^(i-1)‖", "span(v^i - v^(i-1))"])
df = DataFrame(Any[0:num_reps-1, values[:, 1], values[:, 2], diffs, spans], col_names)

display_nums = [i+1 for i in 0:9]
append!(display_nums, [10*i+1 for i in 1:16])
append!(display_nums, [160+i+1 for i in 1:3])
df[display_nums, [1, 2, 3, 4]]









    Out[12]:




i v^i(1) v^i(2) ‖v^i - v^(i-1)‖
1 0 0.0 0.0 NaN
2 1 10.0 -1.0 10.0
3 2 9.274999999999999 -1.95 0.95
4 3 8.479375 -2.8525 0.9025000000000001
5 4 7.672765624999999 -3.709875 0.8573749999999998
6 5 6.882373046874999 -4.524381249999999 0.8145062499999995
7 6 6.120046103515625 -5.298162187499999 0.7737809374999998
8 7 5.390394860107422 -6.033254078124999 0.7350918906250001
9 8 4.694641871441651 -6.731591374218749 0.69833729609375
10 9 4.032448986180879 -7.395011805507812 0.6634204312890626
11 10 3.4027826608197067 -8.02526121523242 0.630249409724609
12 20 -1.4017104317198141 -12.83028155182915 0.37735360253530636
13 30 -4.278653292750171 -15.70722472114124 0.22593554099256608
14 40 -6.001185440126598 -17.42975686869792 0.1352759542790558
15 50 -7.032529065894289 -18.461100494465715 0.08099471081759191
16 60 -7.6500325916895076 -19.078604020260936 0.048494525249424214
17 70 -8.019754762693053 -19.44832619126448 0.029035463617658408
18 80 -8.24112108372828 -19.669692512299708 0.0173846046158026
19 90 -8.373661277235374 -19.8022327058068 0.010408804957535267
20 100 -8.453017987021873 -19.8815894155933 0.006232136021406376
21 110 -8.500531780547469 -19.9291032091189 0.0037314100463738953
22 120 -8.528980043854592 -19.95755147242602 0.002234133030208696
23 130 -8.54601306995374 -19.974584498525168 0.0013376579723587412
24 140 -8.556211371866315 -19.984782800437745 0.0008009052401192207
25 150 -8.56231747193888 -19.990888900510306 0.00047953155208801945
26 160 -8.56597341960701 -19.99454484817844 0.00028711325376562513
27 161 -8.566246177198087 -19.994817605769516 0.00027275759107681097
28 162 -8.56650529690961 -19.995076725481038 0.0002591197115222599
29 163 -8.566751460635556 -19.995322889206985 0.0002461637259472127

On the other hand, the span decreases faster than the norm; the following replicates Table 6.6.1, page 205:



In [13]:

    
display_nums = [i+1 for i in 1:12]
append!(display_nums, [10*i+1 for i in 2:6])
df[display_nums, [1, 4, 5]]









    Out[13]:




i ‖v^i - v^(i-1)‖ span(v^i - v^(i-1))
1 1 10.0 11.0
2 2 0.95 0.22499999999999853
3 3 0.9025000000000001 0.10687500000000072
4 4 0.8573749999999998 0.050765624999999925
5 5 0.8145062499999995 0.024113671874999465
6 6 0.7737809374999998 0.011453994140625312
7 7 0.7350918906250001 0.0054406472167976005
8 8 0.69833729609375 0.0025843074279787714
9 9 0.6634204312890626 0.0012275460282902273
10 10 0.630249409724609 0.0005830843634369032
11 11 0.5987369392383783 0.000276965072632418
12 12 0.5688000922764598 0.00013155840950096476
13 20 0.37735360253530636 3.40931783249232e-7
14 30 0.22593554099256608 1.993445408743355e-10
15 40 0.1352759542790558 1.1546319456101628e-13
16 50 0.08099471081759191 0.0
17 60 0.048494525249424214 1.7763568394002505e-15

The span-based termination criterion is satisfied when $i = 11$:



In [14]:

    
epsilon * (1-beta) / beta









    Out[14]:




0.0005263157894736847



In [15]:

    
spans[12] < epsilon * (1-beta) / beta









    Out[15]:




true

In fact, modified policy iteration with $k = 0$ terminates with $11$ iterations:



In [16]:

    
epsilon = 1e-2
v_init = [0., 0.]
k = 0
res_mpi_1 = solve(ddp, v_init, MPFI, epsilon=epsilon, k=k);



In [17]:

    
res_mpi_1.num_iter









    Out[17]:




11



In [18]:

    
res_mpi_1.v









    Out[18]:





2-element Array{Float64,1}:
  -8.56905
 -19.9974

Policy iteration

If $\{\sigma^i\}$ is the sequence of policies obtained by policy iteration with an initial policy $\sigma^0$, one can show that $T^i v_{\sigma^0} \leq v_{\sigma^i}$ ($\leq v^*$), so that the number of iterations required for policy iteration is smaller than that for value iteration at least weakly, and indeed in many cases, the former is significantly smaller than the latter.



In [19]:

    
v_init = [0., 0.]
res_pi = solve(ddp, v_init, PFI);



In [20]:

    
res_pi.num_iter









    Out[20]:




2

Policy iteration returns the exact optimal value function (up to rounding errors):



In [21]:

    
res_pi.v









    Out[21]:





2-element Array{Float64,1}:
  -8.57143
 -20.0



In [22]:

    
maximum(abs, res_pi.v - v_star(beta))









    Out[22]:




3.552713678800501e-15

To look into the iterations:



In [23]:

    
v = [0., 0.]
sigma = [0, 0]  # Dummy
sigma_new = compute_greedy(ddp, v)
i = 0

while true
    println("Iterate $i")
    println(" value:  $v")
    println(" policy: $sigma_new")
    if all(sigma_new .== sigma)
        break
    end
    copy!(sigma, sigma_new)
    v = evaluate_policy(ddp, sigma)
    sigma_new = compute_greedy(ddp, v)
    i += 1
end

println("Terminated")









    



Iterate 0
 value:  [0.0, 0.0]
 policy: [2, 1]
Iterate 1
 value:  [-9.0, -20.0]
 policy: [1, 1]
Iterate 2
 value:  [-8.57143, -20.0]
 policy: [1, 1]
Terminated

See Example 6.4.1, pp.176-177.

Modified policy iteration

The evaluation step in policy iteration which solves the linear equation $v = T_{\sigma} v$ to obtain the policy value $v_{\sigma}$ can be expensive for problems with a large number of states. Modified policy iteration is to reduce the cost of this step by using an approximation of $v_{\sigma}$ obtained by iteration of $T_{\sigma}$. The tradeoff is that this approach only computes an $\varepsilon$-optimal policy, and for small $\varepsilon$, takes a larger number of iterations than policy iteration (but much smaller than value iteration).



In [24]:

    
epsilon = 1e-2
v_init = [0., 0.]
k = 6
res_mpi = solve(ddp, v_init, MPFI, epsilon=epsilon, k=k);



In [25]:

    
res_mpi.num_iter









    Out[25]:




4

The returned value function:



In [26]:

    
res_mpi.v









    Out[26]:





2-element Array{Float64,1}:
  -8.57137
 -19.9999

It is indeed an $\varepsilon/2$-approximation of $v^*$:



In [27]:

    
maximum(abs, res_mpi.v - v_star(beta)) < epsilon/2









    Out[27]:




true

To look into the iterations:



In [28]:

    
# T_sigma operator
function T_sigma{T<:Integer}(ddp::DiscreteDP, sigma::Array{T})
    R_sigma, Q_sigma = RQ_sigma(ddp, sigma)
    return v -> R_sigma + ddp.beta * Q_sigma * v
end;



In [29]:

    
epsilon = 1e-2
v = [0, 0]
k = 6
i = 0
println("Iterate $i")
println(" v: $v")

sigma = Vector{Int64}(n)
u = Vector{Float64}(n)

while true
    i += 1
    bellman_operator!(ddp, v, u, sigma)  # u and sigma are modified in place
    diff = u - v
    span = maximum(diff) - minimum(diff)
    println("Iterate $i")
    println(" sigma:  $sigma")
    println(" T_sigma(v): $u")
    println(" span: $span")
    if span < epsilon * (1-ddp.beta) / ddp.beta
        v = u + ((maximum(diff) + minimum(diff)) / 2) *
            (ddp.beta / (1 - ddp.beta))
        break
    end
    
    v = compute_fixed_point(T_sigma(ddp, sigma), u,
                            err_tol=0, max_iter=k, verbose=false)
    #The above is equivalent to the following:
    #for j in 1:k
    #    v = T_sigma(ddp, sigma)(u)
    #    copy!(u, v)
    #end
    #copy!(v, u)
    
    println(" T_sigma^k+1(v): $v")
end

println("Terminated")
println(" sigma:  $sigma")
println(" v:  $v")









    



Iterate 0
 v: [0, 0]
Iterate 1
 sigma:  [2, 1]
 T_sigma(v): [10.0, -1.0]
 span: 11.0
 T_sigma^k+1(v): [4.96675, -6.03325]
Iterate 2
 sigma:  [1, 1]
 T_sigma(v): [4.49341, -6.73159]
 span: 0.22499999999999964
 T_sigma^k+1(v): [1.17973, -10.2465]
Iterate 3
 sigma:  [1, 1]
 T_sigma(v): [0.693285, -10.7342]
 span: 0.0012275460282902273
 T_sigma^k+1(v): [-1.76021, -13.1888]
Iterate 4
 sigma:  [1, 1]
 T_sigma(v): [-2.10076, -13.5293]
 span: 6.6971966727891186e-6
Terminated
 sigma:  [1, 1]
 v:  [-8.57137, -19.9999]

Compare this with the implementation with the norm-based termination rule as described in Example 6.5.1, pp.187-188.

Reference

M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.



In [ ]:

	i	v^i(1)	v^i(2)	‖v^i - v^(i-1)‖
1	0	0.0	0.0	NaN
2	1	10.0	-1.0	10.0
3	2	9.274999999999999	-1.95	0.95
4	3	8.479375	-2.8525	0.9025000000000001
5	4	7.672765624999999	-3.709875	0.8573749999999998
6	5	6.882373046874999	-4.524381249999999	0.8145062499999995
7	6	6.120046103515625	-5.298162187499999	0.7737809374999998
8	7	5.390394860107422	-6.033254078124999	0.7350918906250001
9	8	4.694641871441651	-6.731591374218749	0.69833729609375
10	9	4.032448986180879	-7.395011805507812	0.6634204312890626
11	10	3.4027826608197067	-8.02526121523242	0.630249409724609
12	20	-1.4017104317198141	-12.83028155182915	0.37735360253530636
13	30	-4.278653292750171	-15.70722472114124	0.22593554099256608
14	40	-6.001185440126598	-17.42975686869792	0.1352759542790558
15	50	-7.032529065894289	-18.461100494465715	0.08099471081759191
16	60	-7.6500325916895076	-19.078604020260936	0.048494525249424214
17	70	-8.019754762693053	-19.44832619126448	0.029035463617658408
18	80	-8.24112108372828	-19.669692512299708	0.0173846046158026
19	90	-8.373661277235374	-19.8022327058068	0.010408804957535267
20	100	-8.453017987021873	-19.8815894155933	0.006232136021406376
21	110	-8.500531780547469	-19.9291032091189	0.0037314100463738953
22	120	-8.528980043854592	-19.95755147242602	0.002234133030208696
23	130	-8.54601306995374	-19.974584498525168	0.0013376579723587412
24	140	-8.556211371866315	-19.984782800437745	0.0008009052401192207
25	150	-8.56231747193888	-19.990888900510306	0.00047953155208801945
26	160	-8.56597341960701	-19.99454484817844	0.00028711325376562513
27	161	-8.566246177198087	-19.994817605769516	0.00027275759107681097
28	162	-8.56650529690961	-19.995076725481038	0.0002591197115222599
29	163	-8.566751460635556	-19.995322889206985	0.0002461637259472127

	i	‖v^i - v^(i-1)‖	span(v^i - v^(i-1))
1	1	10.0	11.0
2	2	0.95	0.22499999999999853
3	3	0.9025000000000001	0.10687500000000072
4	4	0.8573749999999998	0.050765624999999925
5	5	0.8145062499999995	0.024113671874999465
6	6	0.7737809374999998	0.011453994140625312
7	7	0.7350918906250001	0.0054406472167976005
8	8	0.69833729609375	0.0025843074279787714
9	9	0.6634204312890626	0.0012275460282902273
10	10	0.630249409724609	0.0005830843634369032
11	11	0.5987369392383783	0.000276965072632418
12	12	0.5688000922764598	0.00013155840950096476
13	20	0.37735360253530636	3.40931783249232e-7
14	30	0.22593554099256608	1.993445408743355e-10
15	40	0.1352759542790558	1.1546319456101628e-13
16	50	0.08099471081759191	0.0
17	60	0.048494525249424214	1.7763568394002505e-15