After installing the package, we first load PLite into the current process.
In [1]:
# Pkg.clone("https://github.com/haoyio/PLite.jl") # do this once
using PLite
The idea of this section is to define the problem mathematically without fussing about how we might choose to solve it later (e.g., use a discretized representation).
In general, given the wide variety of MDP solvers available, it's important not to restrict our problem representation by what solver we might eventually choose. Rather, we want to choose the best solver after understanding our problem (e.g., find some problem structure that we can exploit).
In [2]:
# constants
const MinX = 0
const MaxX = 100
const StepX = 20
# mdp definition
mdp = MDP()
Out[2]:
For each state and action variable, we either define a continuous or discrete variable. Note that we use the state variables' "natural representation" and avoid discretizing x
even though for value iteration we would eventually have to.
In [3]:
# state space
statevariable!(mdp, "x", MinX, MaxX) # continuous
statevariable!(mdp, "goal", ["no", "yes"]) # discrete
# action space
actionvariable!(mdp, "move", ["W", "E", "stop"]) # discrete
Out[3]:
To define either the transition or reward function, we pass in the MDP
object, the transition function itself, and an ordered set of the transition function's argument names.
In the case of the transition function, we can define it using either the $T(s,a)$ or $T(s,a,s')$ format. Our example here uses the former way. In the case of the reward function, we define it using the $R(s,a)$ format.
First define the transition function.
In [4]:
function isgoal(x::Float64)
if abs(x - MaxX / 2) < StepX
return "yes"
else
return "no"
end
end
function mytransition(x::Float64, goal::AbstractString, move::AbstractString)
if isgoal(x) == "yes" && goal == "yes"
return [([x, isgoal(x)], 1.0)]
end
if move == "E"
if x >= MaxX
return [
([x, isgoal(x)], 0.9),
([x - StepX, isgoal(x - StepX)], 0.1)]
elseif x <= MinX
return [
([x, isgoal(x)], 0.2),
([x + StepX, isgoal(x + StepX)], 0.8)]
else
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.1),
([x + StepX, isgoal(x + StepX)], 0.8)]
end
elseif move == "W"
if x >= MaxX
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.9)]
elseif x <= MinX
return [
([x, isgoal(x)], 0.9),
([x + StepX, isgoal(x + StepX)], 0.1)]
else
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.8),
([x + StepX, isgoal(x + StepX)], 0.1)]
end
elseif move == "stop"
return [([x, isgoal(x)], 1.0)]
end
end
Out[4]:
After defining the transition function, we pass it into mdp
.
In [5]:
transition!(mdp, ["x", "goal", "move"], mytransition)
Out[5]:
Repeat this process for the reward function.
In [6]:
function myreward(x::Float64, goal::AbstractString, move::AbstractString)
if goal == "yes" && move == "stop"
return 1
else
return 0
end
end
reward!(mdp, ["x", "goal", "move"], myreward)
Out[6]:
For an infinite horizon problem with discount $\gamma$, it can be proven that the value of an optimal policy satisfies the Bellman equation $$U^{\star}(s) = \max_{a}\left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$ For the convergence proof, see http://web.stanford.edu/class/ee266/lectures/dpproof.pdf.
To initialize the serial value iteration solver, simply type the following. There are default parameters for the algorithm. Namely, maxiter
, tol
, and discount
. These are keyword arguments with default values. If we don't like them, we can change their values.
In [7]:
solver = SerialValueIteration()
Out[7]:
In PLite, value iteration requires all variables to be discretized. In the above problem, we need to discretize x
, so we write the following.
Note that the solver uses the GridInterpolations.jl
package for multilinear interpolation to approximate the values between the discretized state variable values if the $T(s, a)$ type transition is defined.
In [8]:
const StepX = 20
discretize_statevariable!(solver, "x", StepX)
Out[8]:
In value iteration, we compute optimal value function $U_{n}$ associated with a finite horizon of $n$ and no discounting. If $n=0$, then $U_{0}(s)=0$ for all $s$. We can compute $U_{n}$ recursively from this base case $$U_{n}(s) = \max_{a}\left( R\left(s,a\right) + \sum_{s'}T\left(s'\mid s, a\right) U_{n-1}\left(s'\right) \right)\text{.}$$
Notice that the value iteration solver was built with infinite horizon problems in mind. It’s easy, however, to modify it to solve finite horizon problems by simply changing the parameters of the solvers.
When the iterations terminate, it'll give a warning that the maximum number of iterations have been reached. This message was built to warn the user about convergence issues for infinite horizon problems, so just ignore it since we know what we're doing.
For an MDP with a horizon of 40 and no discounting, we can define the solver as follows. (Note that we redefined the discretization scheme since it was lost when we overrode the solver variable.)
In [9]:
solver = SerialValueIteration(maxiter=40, discount=1, verbose=false)
discretize_statevariable!(solver, "x", StepX)
Out[9]:
The optimal value function $U^{\star}$ appears on both sides of the equation. Value iteration approximates $U^{\star}$ by iteratively updating the estimate of $U^{\star}$ using the above equation. Once we know $U^{\star}$, we can extract an optimal policy via $$\pi\left(s\right) \leftarrow \underset{a} {\mathrm{argmax}} \left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$
To generate the solution object, we simply input the following.
In [10]:
solution = solve(mdp, solver)
Out[10]:
Finally, to generate the optimal policy function, we type in the following.
In [11]:
policy = getpolicy(mdp, solution)
Out[11]:
We can then extract the optimal action at a given state by querying policy
as follows.
In [12]:
stateq = (12, "no")
actionq = policy(stateq...) # equally valid to type actionq = policy(12, "no")
Out[12]:
The result says that we should move "east," or move right on the grid. This makes sense since we're to the left of the grid center. :D