Daisuke Oyama
Faculty of Economics, University of Tokyo
This notebook describes the implementation details of the DiscreteDP
type and its methods.
For the theoretical background and notation, see the lecture Discrete Dynamic Programming.
The following solution algorithms are currently implemented for the DiscreteDP
type:
Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an $\varepsilon$-optimal policy for a prespecified value of $\varepsilon$.
Value iteration relies on (only) the fact that the Bellman operator $T$ is a contraction mapping and thus iterative application of $T$ to any initial function $v^0$ converges to its unique fixed point $v^*$.
Policy iteration more closely exploits the particular structure of the problem, where each iteration consists of a policy evaluation step, which computes the value $v_{\sigma}$ of a policy $\sigma$ by solving the linear equation $v = T_{\sigma} v$, and a policy improvement step, which computes a $v_{\sigma}$-greedy policy.
Modified policy iteration replaces the policy evaluation step in policy iteration with "partial policy evaluation", which computes an approximation of the value of a policy $\sigma$ by iterating $T_{\sigma}$ for a specified number of times.
Below we describe our implementation of these algorithms more in detail.
(While not explicit, in the actual implementation each algorithm is terminated
when the number of iterations reaches max_iter
.)
solve(ddp, v_init, VFI; max_iter, epsilon)
Given $\varepsilon > 0$,
the value iteration algorithm terminates in a finite number of iterations,
and returns an $\varepsilon/2$-approximation of the optimal value funciton and
an $\varepsilon$-optimal policy function
(unless max_iter
is reached).
solve(ddp, v_init, PFI; max_iter)
The policy iteration algorithm terminates in a finite number of iterations, and
returns an optimal value function and an optimal policy function
(unless max_iter
is reached).
solve(ddp, v_init, MPFI; max_iter, epsilon, k)
Given $\varepsilon > 0$,
provided that $v^0$ is such that $T v^0 \geq v^0$,
the modified policy iteration algorithm terminates in a finite number of iterations,
and returns an $\varepsilon/2$-approximation of the optimal value funciton and
an $\varepsilon$-optimal policy function
(unless max_iter
is reached).
Remarks
v_init
is not specified, it is set to the latter, $\min_{(s', a)} r(s', a))$.We illustrate the algorithms above by the simple example from Puterman (2005), Section 3.1, pp.33-35.
In [1]:
using QuantEcon
using DataFrames
In [2]:
n = 2 # Number of states
m = 2 # Number of actions
# Reward array
R = [5 10; -1 -Inf]
# Transition probability array
Q = Array{Float64}(n, m, n)
Q[1, 1, :] = [0.5, 0.5]
Q[1, 2, :] = [0, 1]
Q[2, 1, :] = [0, 1]
Q[2, 2, :] = [0.5, 0.5] # Arbitrary
# Discount rate
beta = 0.95
ddp = DiscreteDP(R, Q, beta);
Analytical solution:
In [3]:
function sigma_star(beta)
sigma = Vector{Int64}(2)
sigma[2] = 1
if beta > 10/11
sigma[1] = 1
else
sigma[1] = 2
end
return sigma
end
function v_star(beta)
v = Vector{Float64}(2)
v[2] = -1 / (1 - beta)
if beta > 10/11
v[1] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
else
v[1] = (10 - 11*beta) / (1 - beta)
end
return v
end;
In [4]:
sigma_star(beta)
Out[4]:
In [5]:
v_star(beta)
Out[5]:
Solve the problem by value iteration; see Example 6.3.1, p.164 in Puterman (2005).
In [6]:
epsilon = 1e-2
v_init = [0., 0.]
res_vi = solve(ddp, v_init, VFI, epsilon=epsilon);
The number of iterations required to satisfy the termination criterion:
In [7]:
res_vi.num_iter
Out[7]:
The returned value function:
In [8]:
res_vi.v
Out[8]:
It is indeed an $\varepsilon/2$-approximation of $v^*$:
In [9]:
maximum(abs, res_vi.v - v_star(beta)) < epsilon/2
Out[9]:
The returned policy function:
In [10]:
res_vi.sigma
Out[10]:
Value iteration converges very slowly. Let us replicate Table 6.3.1 on p.165:
In [11]:
num_reps = 164
values = Matrix{Float64}(num_reps, n)
diffs = Vector{Float64}(num_reps)
spans = Vector{Float64}(num_reps)
v = [0, 0]
values[1, :] = v
diffs[1] = NaN
spans[1] = NaN
for i in 2:num_reps
v_new = bellman_operator(ddp, v)
values[i, :] = v_new
diffs[i] = maximum(abs, v_new - v)
spans[i] = maximum(v_new - v) - minimum(v_new - v)
v = v_new
end
In [12]:
col_names = map(Symbol, ["i", "v^i(1)", "v^i(2)", "‖v^i - v^(i-1)‖", "span(v^i - v^(i-1))"])
df = DataFrame(Any[0:num_reps-1, values[:, 1], values[:, 2], diffs, spans], col_names)
display_nums = [i+1 for i in 0:9]
append!(display_nums, [10*i+1 for i in 1:16])
append!(display_nums, [160+i+1 for i in 1:3])
df[display_nums, [1, 2, 3, 4]]
Out[12]:
On the other hand, the span decreases faster than the norm; the following replicates Table 6.6.1, page 205:
In [13]:
display_nums = [i+1 for i in 1:12]
append!(display_nums, [10*i+1 for i in 2:6])
df[display_nums, [1, 4, 5]]
Out[13]:
The span-based termination criterion is satisfied when $i = 11$:
In [14]:
epsilon * (1-beta) / beta
Out[14]:
In [15]:
spans[12] < epsilon * (1-beta) / beta
Out[15]:
In fact, modified policy iteration with $k = 0$ terminates with $11$ iterations:
In [16]:
epsilon = 1e-2
v_init = [0., 0.]
k = 0
res_mpi_1 = solve(ddp, v_init, MPFI, epsilon=epsilon, k=k);
In [17]:
res_mpi_1.num_iter
Out[17]:
In [18]:
res_mpi_1.v
Out[18]:
If $\{\sigma^i\}$ is the sequence of policies obtained by policy iteration with an initial policy $\sigma^0$, one can show that $T^i v_{\sigma^0} \leq v_{\sigma^i}$ ($\leq v^*$), so that the number of iterations required for policy iteration is smaller than that for value iteration at least weakly, and indeed in many cases, the former is significantly smaller than the latter.
In [19]:
v_init = [0., 0.]
res_pi = solve(ddp, v_init, PFI);
In [20]:
res_pi.num_iter
Out[20]:
Policy iteration returns the exact optimal value function (up to rounding errors):
In [21]:
res_pi.v
Out[21]:
In [22]:
maximum(abs, res_pi.v - v_star(beta))
Out[22]:
To look into the iterations:
In [23]:
v = [0., 0.]
sigma = [0, 0] # Dummy
sigma_new = compute_greedy(ddp, v)
i = 0
while true
println("Iterate $i")
println(" value: $v")
println(" policy: $sigma_new")
if all(sigma_new .== sigma)
break
end
copy!(sigma, sigma_new)
v = evaluate_policy(ddp, sigma)
sigma_new = compute_greedy(ddp, v)
i += 1
end
println("Terminated")
See Example 6.4.1, pp.176-177.
The evaluation step in policy iteration which solves the linear equation $v = T_{\sigma} v$ to obtain the policy value $v_{\sigma}$ can be expensive for problems with a large number of states. Modified policy iteration is to reduce the cost of this step by using an approximation of $v_{\sigma}$ obtained by iteration of $T_{\sigma}$. The tradeoff is that this approach only computes an $\varepsilon$-optimal policy, and for small $\varepsilon$, takes a larger number of iterations than policy iteration (but much smaller than value iteration).
In [24]:
epsilon = 1e-2
v_init = [0., 0.]
k = 6
res_mpi = solve(ddp, v_init, MPFI, epsilon=epsilon, k=k);
In [25]:
res_mpi.num_iter
Out[25]:
The returned value function:
In [26]:
res_mpi.v
Out[26]:
It is indeed an $\varepsilon/2$-approximation of $v^*$:
In [27]:
maximum(abs, res_mpi.v - v_star(beta)) < epsilon/2
Out[27]:
To look into the iterations:
In [28]:
# T_sigma operator
function T_sigma{T<:Integer}(ddp::DiscreteDP, sigma::Array{T})
R_sigma, Q_sigma = RQ_sigma(ddp, sigma)
return v -> R_sigma + ddp.beta * Q_sigma * v
end;
In [29]:
epsilon = 1e-2
v = [0, 0]
k = 6
i = 0
println("Iterate $i")
println(" v: $v")
sigma = Vector{Int64}(n)
u = Vector{Float64}(n)
while true
i += 1
bellman_operator!(ddp, v, u, sigma) # u and sigma are modified in place
diff = u - v
span = maximum(diff) - minimum(diff)
println("Iterate $i")
println(" sigma: $sigma")
println(" T_sigma(v): $u")
println(" span: $span")
if span < epsilon * (1-ddp.beta) / ddp.beta
v = u + ((maximum(diff) + minimum(diff)) / 2) *
(ddp.beta / (1 - ddp.beta))
break
end
v = compute_fixed_point(T_sigma(ddp, sigma), u,
err_tol=0, max_iter=k, verbose=false)
#The above is equivalent to the following:
#for j in 1:k
# v = T_sigma(ddp, sigma)(u)
# copy!(u, v)
#end
#copy!(v, u)
println(" T_sigma^k+1(v): $v")
end
println("Terminated")
println(" sigma: $sigma")
println(" v: $v")
Compare this with the implementation with the norm-based termination rule as described in Example 6.5.1, pp.187-188.
In [ ]: