In dynamic programming an agent finds a policy (an action given their current state) that maximizes their present discounted award given
\begin{align*} \beta &\quad\text{discount rate}\\ R_{ik} &\quad\text{reward from transitioning to $i$ using control $k$} \\ P_{ijk} &\quad\text{probability of transitioning from state $i$ to $j$ using control $k$} \\ V_{i} &\quad\text{present discounted value of being in state $i$} \end{align*}Value iteration is a simple way to estimate discrete infinite horizon dynamic programs. In value iteration we set our present discounted value of being in a particular state to arbitrary values and iterate on the Bellman equation until convergence
\begin{align*} &\text{do} \\ &\quad i += 1\\ &\quad V_{i+1} = \max (R + \beta PV_i) \\ &\text{until}\quad || V_{i+1} - V_i || < \epsilon \end{align*}In policy iteration we start with an initial policy and then find the present discounted value $V$ of the policy. Then the policy that greedily maximizes the payoff for next period is taken.
Following an example given in Foundations of Machine Learning suppose we have a infinite horizon dynamic program identical to the one given in the graph shown below:
Notice the edges in the graph correspond have probabilities and rewards. For instance, if you choose to play $a$ in state 1 there is a 75% chance that you will end up in state 1 (with a reward of 2) and 25% percent chance that you will end up in state 2 (also with a reward of 2).
In Julia this problem can be represented using arrays
In [4]:
using MDP
I = [1,1,2,3,4]
J = [1,2,2,2,1]
R = sparse(I, J, Float64[2,2,2,2,3])
P = sparse(I,J,[.75,.25,1.00,1.00,1.00])
SA = [
((1,),('a',)),
((1,),('b',)),
((2,),('c',)),
((2,),('d',))
]
R = Float64[2,2,2,3]
indvec = [2,4]
# indvec describes what choices correspond to each state
# the "2" says that the first to second elements of R correspond to playing "a" or "b" in state 1
# the "4" says that the third to fourth elements of R correspond to playing "c" or "d" in state 2
V0 = Float64[-1; 1]
β=0.5
mdp = SimpleMDP(R,P,SA,indvec,β,IClock())
valueiteration(mdp; Vstart=V0)
Out[4]:
The output of value iteration shows that the optimal policy is to play $b$ in state 1 and $d$ in state 2. The policy makes intuitive sense given the above graph because the agent wants to maximize the chance of getting the high reward choice $d$.
In [5]:
policyiteration(mdp)
Out[5]:
Consider a deterministic infinite horizon optimal growth problem used by Judd in Numerical Methods for Economists
In the problem the agent maximizes their present discounted utility by changing their consumption. A higher rate of consumption lowers means that less capital is being produced which in turn lowers future consumption.
$$ V(k) = \max_c u(c) + \beta V(k^+(k,c)) $$where $u(c) = \frac{c^{1+\gamma}}{1+\gamma}$, $F(k) = k + f(k) = k + \frac{(1-\beta)}{\alpha\beta}k^\alpha$ and $k^+ = F(k) - c$
The problem with this formulation is that it is difficult to discretize $c$ in a way that is compatible with the levels of $k$. In order to circumvent this problem we make control variable the state variable of the next period. This gives
$$ V(k) = \max_{k^+} u(F(k) - k^+) + \beta V(k^+) $$Assuming $\gamma = -2$, $\alpha = 0.25$ and $\beta = 0.96$ the steady state level of capital is one. Since all initial capital values but zero converge to the steady state if a region around the steady state is discretized then the policy will not have an optimal policy go outside the discretized region. In the code below I chose to discretize the capital stock levels into 401 equally sized bins between 0.8 and 1.2.
In [15]:
using MDP
# reward function
util(c; g=-2) = c^(g+1)/(g+1)
# Law of Motion for the capital stock
transition(k; alpha=0.25, beta=0.96) = k + ((1-beta)/(alpha*beta))*k^alpha
# function to determine what actions are feasible in a given state
reachable(c) = c >= 0
ks = 0.8:0.001:1.2 # capital levels at which the
n = length(ks)
# determine the feasible state action pairs
R = Float64[] # rewards for all feasible state action pairs
SA = ((Float64,),(Float64,))[] # enumeration of all feasible state action pairs
indvec = Int64[]
k = 1
for i=1:n
if i>1 push!(indvec, k-1) end
for j=1:n
c = transition(ks[i])-ks[j]
if reachable(c)
push!(SA, ((ks[i],),(ks[j],)))
push!(R, util(c))
k+=1
end
end
end
push!(indvec, k-1)
# build the transition matrix
P = sparse(Int64[],Int64[],Float64[],k-1,n)
P[1:indvec[1], 1:indvec[1]] = eye(indvec[1])
for i=2:length(indvec)
lb = indvec[i-1]+1
ub = indvec[i]
#@printf "lb = %0.0f\n" lb
#@printf "ub = %0.0f\n" ub
P[lb:ub,1:(ub-lb+1)] = eye(ub-lb+1)
end
β = 0.96
dp = SimpleMDP(R,P,SA,indvec,β,IClock())
Out[15]:
In [16]:
v = policyiteration(dp; eps=1e-2)
Out[16]:
In [34]:
function computepath(startindx, mdpsol; len=10)
policy = mdpsol.policy
action = copy(policy)
indvec = mdpsol.indvec
for i=2:length(policy)
action[i] -= indvec[i-1]
end
SA = mdpsol.SA
path = Array(Float64,len+1)
indx = startindx
path[1] = SA[policy[indx]][2][1]
for i=1:len
indx = action[indx]
path[i+1] = SA[policy[indx]][2][1]
end
path
end
Out[34]:
In [42]:
using DataFrames
df = DataFrame(low=computepath(1,v;len=99), high=computepath(401,v; len=99),time=1:100)
Out[42]:
In [46]:
using Gadfly
plot(df, layer(x="time", y="low", Geom.line), layer(x="time", y="high", Geom.line),
Guide.xlabel("time (t)"), Guide.ylabel("Capital Stock (k)"))
Out[46]:
Plots of the states chosen by the optimal policy over time given an initial capital somewhere in the discretized region show that their is global convergence to around the steady state value. It also shows that the discretization results in values close to the true steady state also being steady states values because the current cost of increasing decreasing capital stock by a whole bin is more than the future reward.