Markov Decision Process (MDP)

Infinite Horizon Dynamic Programming

Overview

In dynamic programming an agent finds a policy (an action given their current state) that maximizes their present discounted award given

\begin{align*} \beta &\quad\text{discount rate}\\ R_{ik} &\quad\text{reward from transitioning to $i$ using control $k$} \\ P_{ijk} &\quad\text{probability of transitioning from state $i$ to $j$ using control $k$} \\ V_{i} &\quad\text{present discounted value of being in state $i$} \end{align*}

Value Iteration

Value iteration is a simple way to estimate discrete infinite horizon dynamic programs. In value iteration we set our present discounted value of being in a particular state to arbitrary values and iterate on the Bellman equation until convergence

\begin{align*} &\text{do} \\ &\quad i += 1\\ &\quad V_{i+1} = \max (R + \beta PV_i) \\ &\text{until}\quad || V_{i+1} - V_i || < \epsilon \end{align*}

Policy Iteration

In policy iteration we start with an initial policy and then find the present discounted value $V$ of the policy. Then the policy that greedily maximizes the payoff for next period is taken.

Examples

Ex. 1

Following an example given in Foundations of Machine Learning suppose we have a infinite horizon dynamic program identical to the one given in the graph shown below:

Notice the edges in the graph correspond have probabilities and rewards. For instance, if you choose to play $a$ in state 1 there is a 75% chance that you will end up in state 1 (with a reward of 2) and 25% percent chance that you will end up in state 2 (also with a reward of 2).

In Julia this problem can be represented using arrays


In [4]:
using MDP

I = [1,1,2,3,4]
J = [1,2,2,2,1]
R = sparse(I, J, Float64[2,2,2,2,3])
P = sparse(I,J,[.75,.25,1.00,1.00,1.00])
SA = [
    ((1,),('a',)),
    ((1,),('b',)),
    ((2,),('c',)),
    ((2,),('d',))
]
R = Float64[2,2,2,3]
indvec = [2,4]
# indvec describes what choices correspond to each state
# the "2" says that the first to second elements of R correspond to playing "a" or "b" in state 1
# the "4" says that the third to fourth elements of R correspond to playing "c" or "d" in state 2
V0 = Float64[-1; 1]
β=0.5

mdp = SimpleMDP(R,P,SA,indvec,β,IClock())

valueiteration(mdp; Vstart=V0)


Out[4]:
MDPSolution([4.66614,5.33264],[2,4])

The output of value iteration shows that the optimal policy is to play $b$ in state 1 and $d$ in state 2. The policy makes intuitive sense given the above graph because the agent wants to maximize the chance of getting the high reward choice $d$.


In [5]:
policyiteration(mdp)


V
[4.666666666666666,5.333333333333333]
Out[5]:
MDPSolution([4.66667,5.33333],[2,4])

Ex. 2

Consider a deterministic infinite horizon optimal growth problem used by Judd in Numerical Methods for Economists

In the problem the agent maximizes their present discounted utility by changing their consumption. A higher rate of consumption lowers means that less capital is being produced which in turn lowers future consumption.

$$ V(k) = \max_c u(c) + \beta V(k^+(k,c)) $$

where $u(c) = \frac{c^{1+\gamma}}{1+\gamma}$, $F(k) = k + f(k) = k + \frac{(1-\beta)}{\alpha\beta}k^\alpha$ and $k^+ = F(k) - c$

The problem with this formulation is that it is difficult to discretize $c$ in a way that is compatible with the levels of $k$. In order to circumvent this problem we make control variable the state variable of the next period. This gives

$$ V(k) = \max_{k^+} u(F(k) - k^+) + \beta V(k^+) $$

Assuming $\gamma = -2$, $\alpha = 0.25$ and $\beta = 0.96$ the steady state level of capital is one. Since all initial capital values but zero converge to the steady state if a region around the steady state is discretized then the policy will not have an optimal policy go outside the discretized region. In the code below I chose to discretize the capital stock levels into 401 equally sized bins between 0.8 and 1.2.


In [15]:
using MDP
# reward function
util(c; g=-2) = c^(g+1)/(g+1)

# Law of Motion for the capital stock
transition(k; alpha=0.25, beta=0.96) = k + ((1-beta)/(alpha*beta))*k^alpha

# function to determine what actions are feasible in a given state
reachable(c) = c >= 0

ks = 0.8:0.001:1.2 # capital levels at which the 
n = length(ks)

# determine the feasible state action pairs
R = Float64[] # rewards for all feasible state action pairs
SA = ((Float64,),(Float64,))[] # enumeration of all feasible state action pairs
indvec = Int64[]

k = 1
for i=1:n
    if i>1 push!(indvec, k-1) end
    for j=1:n
        c = transition(ks[i])-ks[j]
        if reachable(c)
            push!(SA, ((ks[i],),(ks[j],)))
            push!(R, util(c))
            k+=1
        end
    end
end
push!(indvec, k-1)

# build the transition matrix
P = sparse(Int64[],Int64[],Float64[],k-1,n)
P[1:indvec[1], 1:indvec[1]] = eye(indvec[1])
for i=2:length(indvec)
    lb = indvec[i-1]+1
    ub = indvec[i]
    #@printf "lb = %0.0f\n" lb
    #@printf "ub = %0.0f\n" ub
    P[lb:ub,1:(ub-lb+1)] = eye(ub-lb+1)
end

β = 0.96
dp = SimpleMDP(R,P,SA,indvec,β,IClock())


Out[15]:
SimpleMDP{IClock}([-6.34423,-6.38473,-6.42576,-6.46732,-6.50942,-6.55207,-6.59528,-6.63907,-6.68344,-6.72841  …  -5.4514,-5.48128,-5.51149,-5.54203,-5.57292,-5.60415,-5.63573,-5.66768,-5.69998,-5.73266],132481x401 sparse matrix with 132481 Float64 entries:
	[1     ,      1]  =  1.0
	[159   ,      1]  =  1.0
	[318   ,      1]  =  1.0
	[478   ,      1]  =  1.0
	[639   ,      1]  =  1.0
	[801   ,      1]  =  1.0
	[964   ,      1]  =  1.0
	[1128  ,      1]  =  1.0
	[1293  ,      1]  =  1.0
	[1460  ,      1]  =  1.0
	⋮
	[128471,    401]  =  1.0
	[128872,    401]  =  1.0
	[129273,    401]  =  1.0
	[129674,    401]  =  1.0
	[130075,    401]  =  1.0
	[130476,    401]  =  1.0
	[130877,    401]  =  1.0
	[131278,    401]  =  1.0
	[131679,    401]  =  1.0
	[132080,    401]  =  1.0
	[132481,    401]  =  1.0,[((0.8,),(0.8,)),((0.8,),(0.801,)),((0.8,),(0.802,)),((0.8,),(0.803,)),((0.8,),(0.804,)),((0.8,),(0.805,)),((0.8,),(0.806,)),((0.8,),(0.807,)),((0.8,),(0.808,)),((0.8,),(0.809,))  …  ((1.2,),(1.191,)),((1.2,),(1.192,)),((1.2,),(1.193,)),((1.2,),(1.194,)),((1.2,),(1.195,)),((1.2,),(1.196,)),((1.2,),(1.197,)),((1.2,),(1.198,)),((1.2,),(1.199,)),((1.2,),(1.2,))],[158,317,477,638,800,963,1127,1292,1459,1627  …  128872,129273,129674,130075,130476,130877,131278,131679,132080,132481],0.96,IClock())

In [16]:
v = policyiteration(dp; eps=1e-2)


Out[16]:
MDPSolution([-158.288,-158.242,-158.196,-158.15,-158.105,-158.059,-158.013,-157.968,-157.922,-157.877  …  -143.404,-143.372,-143.34,-143.308,-143.276,-143.245,-143.213,-143.181,-143.15,-143.118],[7,166,326,487,649,812,976,1141,1307,1475  …  128857,129259,129661,130063,130465,130867,131269,131671,132073,132474],[158,317,477,638,800,963,1127,1292,1459,1627  …  128872,129273,129674,130075,130476,130877,131278,131679,132080,132481],[((0.8,),(0.8,)),((0.8,),(0.801,)),((0.8,),(0.802,)),((0.8,),(0.803,)),((0.8,),(0.804,)),((0.8,),(0.805,)),((0.8,),(0.806,)),((0.8,),(0.807,)),((0.8,),(0.808,)),((0.8,),(0.809,))  …  ((1.2,),(1.191,)),((1.2,),(1.192,)),((1.2,),(1.193,)),((1.2,),(1.194,)),((1.2,),(1.195,)),((1.2,),(1.196,)),((1.2,),(1.197,)),((1.2,),(1.198,)),((1.2,),(1.199,)),((1.2,),(1.2,))])

In [34]:
function computepath(startindx, mdpsol; len=10)
    policy = mdpsol.policy
    action = copy(policy)
    indvec = mdpsol.indvec

    for i=2:length(policy)
        action[i] -= indvec[i-1]
    end
    SA = mdpsol.SA
    
    path = Array(Float64,len+1) 
    indx = startindx
    path[1] = SA[policy[indx]][2][1]
    for i=1:len
        indx = action[indx]
        path[i+1] = SA[policy[indx]][2][1]
    end
    path
end


Out[34]:
101-element Array{Float64,1}:
 1.193
 1.187
 1.181
 1.175
 1.169
 1.163
 1.158
 1.153
 1.148
 1.143
 1.138
 1.133
 1.129
 ⋮    
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008

In [42]:
using DataFrames
df = DataFrame(low=computepath(1,v;len=99), high=computepath(401,v; len=99),time=1:100)


Out[42]:
lowhightime
10.8061.1931
20.8121.1872
30.8181.1813
40.8241.1754
50.831.1695
60.8361.1636
70.8411.1587
80.8461.1538
90.8511.1489
100.8561.14310
110.8611.13811
120.8661.13312
130.871.12913
140.8741.12514
150.8781.12115
160.8821.11716
170.8861.11317
180.891.10918
190.8941.10519
200.8971.10220
210.91.09921
220.9031.09622
230.9061.09323
240.9091.0924
250.9121.08725
260.9151.08426
270.9181.08127
280.9211.07828
290.9241.07529
300.9261.07330
&vellip&vellip&vellip&vellip

In [46]:
using Gadfly

plot(df, layer(x="time", y="low", Geom.line), layer(x="time", y="high", Geom.line),
Guide.xlabel("time (t)"), Guide.ylabel("Capital Stock (k)"))


Out[46]:

Plots of the states chosen by the optimal policy over time given an initial capital somewhere in the discretized region show that their is global convergence to around the steady state value. It also shows that the discretization results in values close to the true steady state also being steady states values because the current cost of increasing decreasing capital stock by a whole bin is more than the future reward.