After installing the package, we first load PLite into the current process.

```
In [1]:
```# Pkg.clone("https://github.com/haoyio/PLite.jl") # do this once
using PLite

The idea of this section is to define the problem mathematically without fussing about how we might choose to solve it later (e.g., use a discretized representation).

In general, given the wide variety of MDP solvers available, it's important not to restrict our problem representation by what solver we might eventually choose. Rather, we want to choose the best solver after understanding our problem (e.g., find some problem structure that we can exploit).

```
In [2]:
```# constants
const MinX = 0
const MaxX = 100
const StepX = 20
# mdp definition
mdp = MDP()

```
Out[2]:
```

`x`

even though for value iteration we would eventually have to.

```
In [3]:
```# state space
statevariable!(mdp, "x", MinX, MaxX) # continuous
statevariable!(mdp, "goal", ["no", "yes"]) # discrete
# action space
actionvariable!(mdp, "move", ["W", "E", "stop"]) # discrete

```
Out[3]:
```

To define either the transition or reward function, we pass in the `MDP`

object, the transition function itself, and an ordered set of the transition function's argument names.

In the case of the transition function, we can define it using either the $T(s,a)$ or $T(s,a,s')$ format. Our example here uses the former way. In the case of the reward function, we define it using the $R(s,a)$ format.

First define the transition function.

```
In [4]:
```function isgoal(x::Float64)
if abs(x - MaxX / 2) < StepX
return "yes"
else
return "no"
end
end
function mytransition(x::Float64, goal::AbstractString, move::AbstractString)
if isgoal(x) == "yes" && goal == "yes"
return [([x, isgoal(x)], 1.0)]
end
if move == "E"
if x >= MaxX
return [
([x, isgoal(x)], 0.9),
([x - StepX, isgoal(x - StepX)], 0.1)]
elseif x <= MinX
return [
([x, isgoal(x)], 0.2),
([x + StepX, isgoal(x + StepX)], 0.8)]
else
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.1),
([x + StepX, isgoal(x + StepX)], 0.8)]
end
elseif move == "W"
if x >= MaxX
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.9)]
elseif x <= MinX
return [
([x, isgoal(x)], 0.9),
([x + StepX, isgoal(x + StepX)], 0.1)]
else
return [
([x, isgoal(x)], 0.1),
([x - StepX, isgoal(x - StepX)], 0.8),
([x + StepX, isgoal(x + StepX)], 0.1)]
end
elseif move == "stop"
return [([x, isgoal(x)], 1.0)]
end
end

```
Out[4]:
```

After defining the transition function, we pass it into `mdp`

.

```
In [5]:
```transition!(mdp, ["x", "goal", "move"], mytransition)

```
Out[5]:
```

Repeat this process for the reward function.

```
In [6]:
```function myreward(x::Float64, goal::AbstractString, move::AbstractString)
if goal == "yes" && move == "stop"
return 1
else
return 0
end
end
reward!(mdp, ["x", "goal", "move"], myreward)

```
Out[6]:
```

For an infinite horizon problem with discount $\gamma$, it can be proven that the value of an optimal policy satisfies the *Bellman equation*
$$U^{\star}(s) = \max_{a}\left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$
For the convergence proof, see http://web.stanford.edu/class/ee266/lectures/dpproof.pdf.

`maxiter`

, `tol`

, and `discount`

. These are keyword arguments with default values. If we don't like them, we can change their values.

```
In [7]:
```solver = SerialValueIteration()

```
Out[7]:
```

In PLite, value iteration requires all variables to be discretized. In the above problem, we need to discretize `x`

, so we write the following.

Note that the solver uses the `GridInterpolations.jl`

package for multilinear interpolation to approximate the values between the discretized state variable values if the $T(s, a)$ type transition is defined.

```
In [8]:
```const StepX = 20
discretize_statevariable!(solver, "x", StepX)

```
Out[8]:
```

In value iteration, we compute optimal value function $U_{n}$ associated with a finite horizon of $n$ and no discounting. If $n=0$, then $U_{0}(s)=0$ for all $s$. We can compute $U_{n}$ recursively from this base case $$U_{n}(s) = \max_{a}\left( R\left(s,a\right) + \sum_{s'}T\left(s'\mid s, a\right) U_{n-1}\left(s'\right) \right)\text{.}$$

Notice that the value iteration solver was built with infinite horizon problems in mind. It’s easy, however, to modify it to solve finite horizon problems by simply changing the parameters of the solvers.

When the iterations terminate, it'll give a warning that the maximum number of iterations have been reached. This message was built to warn the user about convergence issues for infinite horizon problems, so just ignore it since we know what we're doing.

```
In [9]:
```solver = SerialValueIteration(maxiter=40, discount=1, verbose=false)
discretize_statevariable!(solver, "x", StepX)

```
Out[9]:
```

The optimal value function $U^{\star}$ appears on both sides of the equation. Value iteration approximates $U^{\star}$ by iteratively updating the estimate of $U^{\star}$ using the above equation. Once we know $U^{\star}$, we can extract an optimal policy via $$\pi\left(s\right) \leftarrow \underset{a} {\mathrm{argmax}} \left( R\left(s,a\right) + \gamma\sum_{s'}T\left(s'\mid s, a\right) U^{\star}\left(s'\right) \right)\text{.}$$

To generate the solution object, we simply input the following.

```
In [10]:
```solution = solve(mdp, solver)

```
Out[10]:
```

Finally, to generate the optimal policy function, we type in the following.

```
In [11]:
```policy = getpolicy(mdp, solution)

```
Out[11]:
```

We can then extract the optimal action at a given state by querying `policy`

as follows.

```
In [12]:
```stateq = (12, "no")
actionq = policy(stateq...) # equally valid to type actionq = policy(12, "no")

```
Out[12]:
```