Markov Decision Process (MDP)

Infinite Horizon Dynamic Programming

Overview

In dynamic programming an agent finds a policy (an action given their current state) that maximizes their present discounted award given

\begin{align*} \beta &\quad\text{discount rate}\\ R_{ik} &\quad\text{reward from transitioning to $i$ using control $k$} \\ P_{ijk} &\quad\text{probability of transitioning from state $i$ to $j$ using control $k$} \\ V_{i} &\quad\text{present discounted value of being in state $i$} \end{align*}

Value Iteration

Value iteration is a simple way to estimate discrete infinite horizon dynamic programs. In value iteration we set our present discounted value of being in a particular state to arbitrary values and iterate on the Bellman equation until convergence

\begin{align*} &\text{do} \\ &\quad i += 1\\ &\quad V_{i+1} = \max (R + \beta PV_i) \\ &\text{until}\quad || V_{i+1} - V_i || < \epsilon \end{align*}

Policy Iteration

In policy iteration we start with an initial policy and then find the present discounted value $V$ of the policy. Then the policy that greedily maximizes the payoff for next period is taken.

Examples

Ex. 1

Following an example given in Foundations of Machine Learning suppose we have a infinite horizon dynamic program identical to the one given in the graph shown below:

Notice the edges in the graph correspond have probabilities and rewards. For instance, if you choose to play $a$ in state 1 there is a 75% chance that you will end up in state 1 (with a reward of 2) and 25% percent chance that you will end up in state 2 (also with a reward of 2).

In Julia this problem can be represented using arrays



In [4]:

    
using MDP

I = [1,1,2,3,4]
J = [1,2,2,2,1]
R = sparse(I, J, Float64[2,2,2,2,3])
P = sparse(I,J,[.75,.25,1.00,1.00,1.00])
SA = [
    ((1,),('a',)),
    ((1,),('b',)),
    ((2,),('c',)),
    ((2,),('d',))
]
R = Float64[2,2,2,3]
indvec = [2,4]
# indvec describes what choices correspond to each state
# the "2" says that the first to second elements of R correspond to playing "a" or "b" in state 1
# the "4" says that the third to fourth elements of R correspond to playing "c" or "d" in state 2
V0 = Float64[-1; 1]
β=0.5

mdp = SimpleMDP(R,P,SA,indvec,β,IClock())

valueiteration(mdp; Vstart=V0)









    Out[4]:





MDPSolution([4.66614,5.33264],[2,4])

The output of value iteration shows that the optimal policy is to play $b$ in state 1 and $d$ in state 2. The policy makes intuitive sense given the above graph because the agent wants to maximize the chance of getting the high reward choice $d$.



In [5]:

    
policyiteration(mdp)









    



V
[4.666666666666666,5.333333333333333]






    Out[5]:





MDPSolution([4.66667,5.33333],[2,4])

Ex. 2

Consider a deterministic infinite horizon optimal growth problem used by Judd in Numerical Methods for Economists

In the problem the agent maximizes their present discounted utility by changing their consumption. A higher rate of consumption lowers means that less capital is being produced which in turn lowers future consumption.

$$ V(k) = \max_c u(c) + \beta V(k^+(k,c)) $$

where $u(c) = \frac{c^{1+\gamma}}{1+\gamma}$, $F(k) = k + f(k) = k + \frac{(1-\beta)}{\alpha\beta}k^\alpha$ and $k^+ = F(k) - c$

The problem with this formulation is that it is difficult to discretize $c$ in a way that is compatible with the levels of $k$. In order to circumvent this problem we make control variable the state variable of the next period. This gives

$$ V(k) = \max_{k^+} u(F(k) - k^+) + \beta V(k^+) $$

Assuming $\gamma = -2$, $\alpha = 0.25$ and $\beta = 0.96$ the steady state level of capital is one. Since all initial capital values but zero converge to the steady state if a region around the steady state is discretized then the policy will not have an optimal policy go outside the discretized region. In the code below I chose to discretize the capital stock levels into 401 equally sized bins between 0.8 and 1.2.



In [15]:

    
using MDP
# reward function
util(c; g=-2) = c^(g+1)/(g+1)

# Law of Motion for the capital stock
transition(k; alpha=0.25, beta=0.96) = k + ((1-beta)/(alpha*beta))*k^alpha

# function to determine what actions are feasible in a given state
reachable(c) = c >= 0

ks = 0.8:0.001:1.2 # capital levels at which the 
n = length(ks)

# determine the feasible state action pairs
R = Float64[] # rewards for all feasible state action pairs
SA = ((Float64,),(Float64,))[] # enumeration of all feasible state action pairs
indvec = Int64[]

k = 1
for i=1:n
    if i>1 push!(indvec, k-1) end
    for j=1:n
        c = transition(ks[i])-ks[j]
        if reachable(c)
            push!(SA, ((ks[i],),(ks[j],)))
            push!(R, util(c))
            k+=1
        end
    end
end
push!(indvec, k-1)

# build the transition matrix
P = sparse(Int64[],Int64[],Float64[],k-1,n)
P[1:indvec[1], 1:indvec[1]] = eye(indvec[1])
for i=2:length(indvec)
    lb = indvec[i-1]+1
    ub = indvec[i]
    #@printf "lb = %0.0f\n" lb
    #@printf "ub = %0.0f\n" ub
    P[lb:ub,1:(ub-lb+1)] = eye(ub-lb+1)
end

β = 0.96
dp = SimpleMDP(R,P,SA,indvec,β,IClock())









    Out[15]:





SimpleMDP{IClock}([-6.34423,-6.38473,-6.42576,-6.46732,-6.50942,-6.55207,-6.59528,-6.63907,-6.68344,-6.72841  …  -5.4514,-5.48128,-5.51149,-5.54203,-5.57292,-5.60415,-5.63573,-5.66768,-5.69998,-5.73266],132481x401 sparse matrix with 132481 Float64 entries:
	[1     ,      1]  =  1.0
	[159   ,      1]  =  1.0
	[318   ,      1]  =  1.0
	[478   ,      1]  =  1.0
	[639   ,      1]  =  1.0
	[801   ,      1]  =  1.0
	[964   ,      1]  =  1.0
	[1128  ,      1]  =  1.0
	[1293  ,      1]  =  1.0
	[1460  ,      1]  =  1.0
	⋮
	[128471,    401]  =  1.0
	[128872,    401]  =  1.0
	[129273,    401]  =  1.0
	[129674,    401]  =  1.0
	[130075,    401]  =  1.0
	[130476,    401]  =  1.0
	[130877,    401]  =  1.0
	[131278,    401]  =  1.0
	[131679,    401]  =  1.0
	[132080,    401]  =  1.0
	[132481,    401]  =  1.0,[((0.8,),(0.8,)),((0.8,),(0.801,)),((0.8,),(0.802,)),((0.8,),(0.803,)),((0.8,),(0.804,)),((0.8,),(0.805,)),((0.8,),(0.806,)),((0.8,),(0.807,)),((0.8,),(0.808,)),((0.8,),(0.809,))  …  ((1.2,),(1.191,)),((1.2,),(1.192,)),((1.2,),(1.193,)),((1.2,),(1.194,)),((1.2,),(1.195,)),((1.2,),(1.196,)),((1.2,),(1.197,)),((1.2,),(1.198,)),((1.2,),(1.199,)),((1.2,),(1.2,))],[158,317,477,638,800,963,1127,1292,1459,1627  …  128872,129273,129674,130075,130476,130877,131278,131679,132080,132481],0.96,IClock())



In [16]:

    
v = policyiteration(dp; eps=1e-2)









    Out[16]:





MDPSolution([-158.288,-158.242,-158.196,-158.15,-158.105,-158.059,-158.013,-157.968,-157.922,-157.877  …  -143.404,-143.372,-143.34,-143.308,-143.276,-143.245,-143.213,-143.181,-143.15,-143.118],[7,166,326,487,649,812,976,1141,1307,1475  …  128857,129259,129661,130063,130465,130867,131269,131671,132073,132474],[158,317,477,638,800,963,1127,1292,1459,1627  …  128872,129273,129674,130075,130476,130877,131278,131679,132080,132481],[((0.8,),(0.8,)),((0.8,),(0.801,)),((0.8,),(0.802,)),((0.8,),(0.803,)),((0.8,),(0.804,)),((0.8,),(0.805,)),((0.8,),(0.806,)),((0.8,),(0.807,)),((0.8,),(0.808,)),((0.8,),(0.809,))  …  ((1.2,),(1.191,)),((1.2,),(1.192,)),((1.2,),(1.193,)),((1.2,),(1.194,)),((1.2,),(1.195,)),((1.2,),(1.196,)),((1.2,),(1.197,)),((1.2,),(1.198,)),((1.2,),(1.199,)),((1.2,),(1.2,))])



In [34]:

    
function computepath(startindx, mdpsol; len=10)
    policy = mdpsol.policy
    action = copy(policy)
    indvec = mdpsol.indvec

    for i=2:length(policy)
        action[i] -= indvec[i-1]
    end
    SA = mdpsol.SA
    
    path = Array(Float64,len+1) 
    indx = startindx
    path[1] = SA[policy[indx]][2][1]
    for i=1:len
        indx = action[indx]
        path[i+1] = SA[policy[indx]][2][1]
    end
    path
end









    Out[34]:





101-element Array{Float64,1}:
 1.193
 1.187
 1.181
 1.175
 1.169
 1.163
 1.158
 1.153
 1.148
 1.143
 1.138
 1.133
 1.129
 ⋮    
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008
 1.008



In [42]:

    
using DataFrames
df = DataFrame(low=computepath(1,v;len=99), high=computepath(401,v; len=99),time=1:100)









    Out[42]:




low high time
1 0.806 1.193 1
2 0.812 1.187 2
3 0.818 1.181 3
4 0.824 1.175 4
5 0.83 1.169 5
6 0.836 1.163 6
7 0.841 1.158 7
8 0.846 1.153 8
9 0.851 1.148 9
10 0.856 1.143 10
11 0.861 1.138 11
12 0.866 1.133 12
13 0.87 1.129 13
14 0.874 1.125 14
15 0.878 1.121 15
16 0.882 1.117 16
17 0.886 1.113 17
18 0.89 1.109 18
19 0.894 1.105 19
20 0.897 1.102 20
21 0.9 1.099 21
22 0.903 1.096 22
23 0.906 1.093 23
24 0.909 1.09 24
25 0.912 1.087 25
26 0.915 1.084 26
27 0.918 1.081 27
28 0.921 1.078 28
29 0.924 1.075 29
30 0.926 1.073 30
&vellip &vellip &vellip &vellip



In [46]:

    
using Gadfly

plot(df, layer(x="time", y="low", Geom.line), layer(x="time", y="high", Geom.line),
Guide.xlabel("time (t)"), Guide.ylabel("Capital Stock (k)"))









    Out[46]:

Plots of the states chosen by the optimal policy over time given an initial capital somewhere in the discretized region show that their is global convergence to around the steady state value. It also shows that the discretization results in values close to the true steady state also being steady states values because the current cost of increasing decreasing capital stock by a whole bin is more than the future reward.

	low	high	time
1	0.806	1.193	1
2	0.812	1.187	2
3	0.818	1.181	3
4	0.824	1.175	4
5	0.83	1.169	5
6	0.836	1.163	6
7	0.841	1.158	7
8	0.846	1.153	8
9	0.851	1.148	9
10	0.856	1.143	10
11	0.861	1.138	11
12	0.866	1.133	12
13	0.87	1.129	13
14	0.874	1.125	14
15	0.878	1.121	15
16	0.882	1.117	16
17	0.886	1.113	17
18	0.89	1.109	18
19	0.894	1.105	19
20	0.897	1.102	20
21	0.9	1.099	21
22	0.903	1.096	22
23	0.906	1.093	23
24	0.909	1.09	24
25	0.912	1.087	25
26	0.915	1.084	26
27	0.918	1.081	27
28	0.921	1.078	28
29	0.924	1.075	29
30	0.926	1.073	30
&vellip	&vellip	&vellip	&vellip