Serial Value Iteration

After installing the package, we first load PLite into the current process.


In [1]:
# Pkg.clone("https://github.com/haoyio/PLite.jl")  # do this once
using PLite

Problem definition

For this example, we define a simple 1-D gridworld type problem where the goal of the agent is to move and stop at the center of the grid starting from anywhere.

The idea of this section is to define the problem mathematically without fussing about how we might choose to solve it later (e.g., use a discretized representation).

In general, given the wide variety of MDP solvers available, it's important not to restrict our problem representation by what solver we might eventually choose. Rather, we want to choose the best solver after understanding our problem (e.g., find some problem structure that we can exploit).

MDP initialization

We first define some constants and initialize the empty MDP object.


In [2]:
# constants
const MinX = 0
const MaxX = 100
const StepX = 20

# mdp definition
mdp = MDP()


Out[2]:
PLite.MDP(Dict{AbstractString,PLite.LazyVar}(),Dict{AbstractString,PLite.LazyVar}(),PLite.LazyFunc(true,ASCIIString[],##emptyfunc#7358),PLite.LazyFunc(true,ASCIIString[],##emptyfunc#7357))

State and action spaces definition

We define the state and action spaces using a factored representation. If we can't factor the representation, we can simply define the space using a single discrete or continuous variable.

For each state and action variable, we either define a continuous or discrete variable. Note that we use the state variables' "natural representation" and avoid discretizing x even though for value iteration we would eventually have to.


In [3]:
# state space
statevariable!(mdp, "x", MinX, MaxX)  # continuous
statevariable!(mdp, "goal", ["no", "yes"])  # discrete

# action space
actionvariable!(mdp, "move", ["W", "E", "stop"])  # discrete


Out[3]:
PLite.ValuesVar("move",ASCIIString["W","E","stop"])

Transition and reward functions definition

To define either the transition or reward function, we pass in the MDP object, the transition function itself, and an ordered set of the transition function's argument names.

In the case of the transition function, we can define it using either the $T(s,a)$ or $T(s,a,s')$ format. Our example here uses the former way. In the case of the reward function, we define it using the $R(s,a)$ format.

First define the transition function.


In [4]:
function isgoal(x::Float64)
  if abs(x - MaxX / 2) < StepX
    return "yes"
  else
    return "no"
  end
end

function mytransition(x::Float64, goal::AbstractString, move::AbstractString)
  if isgoal(x) == "yes" && goal == "yes"
    return [([x, isgoal(x)], 1.0)]
  end

  if move == "E"
    if x >= MaxX
      return [
        ([x, isgoal(x)], 0.9),
        ([x - StepX, isgoal(x - StepX)], 0.1)]
    elseif x <= MinX
      return [
        ([x, isgoal(x)], 0.2),
        ([x + StepX, isgoal(x + StepX)], 0.8)]
    else
      return [
        ([x, isgoal(x)], 0.1),
        ([x - StepX, isgoal(x - StepX)], 0.1),
        ([x + StepX, isgoal(x + StepX)], 0.8)]
    end
  elseif move == "W"
    if x >= MaxX
      return [
        ([x, isgoal(x)], 0.1),
        ([x - StepX, isgoal(x - StepX)], 0.9)]
    elseif x <= MinX
      return [
        ([x, isgoal(x)], 0.9),
        ([x + StepX, isgoal(x + StepX)], 0.1)]
    else
      return [
        ([x, isgoal(x)], 0.1),
        ([x - StepX, isgoal(x - StepX)], 0.8),
        ([x + StepX, isgoal(x + StepX)], 0.1)]
    end
  elseif move == "stop"
    return [([x, isgoal(x)], 1.0)]
  end
end


Out[4]:
mytransition (generic function with 1 method)

After defining the transition function, we pass it into mdp.


In [5]:
transition!(mdp, ["x", "goal", "move"], mytransition)


Out[5]:
PLite.LazyFunc(false,ASCIIString["x","goal","move"],mytransition)

Repeat this process for the reward function.


In [6]:
function myreward(x::Float64, goal::AbstractString, move::AbstractString)
  if goal == "yes" && move == "stop"
    return 1
  else
    return 0
  end
end

reward!(mdp, ["x", "goal", "move"], myreward)


Out[6]:
PLite.LazyFunc(false,ASCIIString["x","goal","move"],myreward)

Solver selection

So far, we've only been interested in coding up the mathematical formulation of the problem. From here on, we're only interested in selecting the solver and giving additional information to the solver object such that we can solve the MDP (such as the discretization scheme).

Infinite horizon MDP

For an infinite horizon problem with discount $\gamma$, it can be proven that the value of an optimal policy satisfies the Bellman equation $$U^{\star}(s) = \max_{a}\left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$ For the convergence proof, see http://web.stanford.edu/class/ee266/lectures/dpproof.pdf.

To initialize the serial value iteration solver, simply type the following. There are default parameters for the algorithm. Namely, maxiter, tol, and discount. These are keyword arguments with default values. If we don't like them, we can change their values.


In [7]:
solver = SerialValueIteration()


Out[7]:
PLite.SerialValueIteration(true,1000,0.0001,0.99,Dict{AbstractString,PLite.LazyDiscrete}(),Dict{AbstractString,PLite.LazyDiscrete}(),GridInterpolations.RectangleGrid with 1 points,GridInterpolations.RectangleGrid with 1 points)

In PLite, value iteration requires all variables to be discretized. In the above problem, we need to discretize x, so we write the following.

Note that the solver uses the GridInterpolations.jl package for multilinear interpolation to approximate the values between the discretized state variable values if the $T(s, a)$ type transition is defined.


In [8]:
const StepX = 20
discretize_statevariable!(solver, "x", StepX)


Out[8]:
PLite.LazyDiscrete("x",20.0)

Finite horizon MDP

In value iteration, we compute optimal value function $U_{n}$ associated with a finite horizon of $n$ and no discounting. If $n=0$, then $U_{0}(s)=0$ for all $s$. We can compute $U_{n}$ recursively from this base case $$U_{n}(s) = \max_{a}\left( R\left(s,a\right) + \sum_{s'}T\left(s'\mid s, a\right) U_{n-1}\left(s'\right) \right)\text{.}$$

Notice that the value iteration solver was built with infinite horizon problems in mind. It’s easy, however, to modify it to solve finite horizon problems by simply changing the parameters of the solvers.

When the iterations terminate, it'll give a warning that the maximum number of iterations have been reached. This message was built to warn the user about convergence issues for infinite horizon problems, so just ignore it since we know what we're doing.

For an MDP with a horizon of 40 and no discounting, we can define the solver as follows. (Note that we redefined the discretization scheme since it was lost when we overrode the solver variable.)


In [9]:
solver = SerialValueIteration(maxiter=40, discount=1, verbose=false)
discretize_statevariable!(solver, "x", StepX)


Out[9]:
PLite.LazyDiscrete("x",20.0)

Solution and policy extraction

The optimal value function $U^{\star}$ appears on both sides of the equation. Value iteration approximates $U^{\star}$ by iteratively updating the estimate of $U^{\star}$ using the above equation. Once we know $U^{\star}$, we can extract an optimal policy via $$\pi\left(s\right) \leftarrow \underset{a} {\mathrm{argmax}} \left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$

To generate the solution object, we simply input the following.


In [10]:
solution = solve(mdp, solver)


INFO: mdp and value iteration solver passed basic checks
WARNING: maximum number of iterations reached; check accuracy of solutions
Out[10]:
PLite.ValueIterationSolution(3x12 Array{Float64,2}:
 36.4688  36.7344  37.875   38.8611  38.6111  …  39.0  39.0  38.6111  37.5   
 37.3438  38.5938  38.8594  37.8889  36.8611     39.0  39.0  36.8611  36.6111
 36.3438  37.5938  39.0     39.0     37.6111     40.0  40.0  38.6111  37.5   ,GridInterpolations.RectangleGrid with 12 points,GridInterpolations.RectangleGrid with 3 points,0.06014998299999998,40,1.0)
INFO: value iteration solution generated
cputime [s] = 0.06014998299999998
number of iterations = 40
residual = 1.0

Finally, to generate the optimal policy function, we type in the following.


In [11]:
policy = getpolicy(mdp, solution)


Out[11]:
policy (generic function with 1 method)

We can then extract the optimal action at a given state by querying policy as follows.


In [12]:
stateq = (12, "no")
actionq = policy(stateq...)  # equally valid to type actionq = policy(12, "no")


Out[12]:
1-element Array{Any,1}:
 "E"

The result says that we should move "east," or move right on the grid. This makes sense since we're to the left of the grid center. :D