**Daisuke Oyama**

*Faculty of Economics, University of Tokyo*

This notebook demonstrates via a simple example how to use the `DiscreteDP`

module.

```
In [1]:
```import numpy as np
from scipy import sparse
from quantecon.markov import DiscreteDP

Let us consider the following two-state dynamic program, taken from Puterman (2005), Section 3.1, pp.33-35; see also Example 6.2.1, pp.155-156.

There are two possible states $0$ and $1$.

At state $0$, you may choose either "stay", say action $0$, or "move", action $1$.

At state $1$, there is no way to move, so that you can only stay, i.e., $0$ is the only available action. (You may alternatively distinguish between the action "staty" at state $0$ and that at state $1$, and call the latter action $2$; but here we choose to refer to the both actions as action $0$.)

At state $0$, if you choose action $0$ (stay), then you receive a reward $5$, and in the next period the state will remain at $0$ with probability $1/2$, but it moves to $1$ with probability $1/2$.

If you choose action $1$ (move), then you receive a reward $10$, and the state in the next period will be $1$ with probability $1$.

At state $1$, where the only action you can take is $0$ (stay), you receive a reward $-1$, and the state will remain at $1$ with probability $1$.

You want to maximize the sum of discounted expected reward flows with discount factor $\beta \in [0, 1)$.

The optimization problem consists of:

the state space: $S = \{0, 1\}$;

the action space: $A = \{0, 1\}$;

the set of feasible state-action pairs $\mathit{SA} = \{(0, 0), (0, 1), (1, 0)\} \subset S \times A$;

the reward function $r\colon \mathit{SA} \to \mathbb{R}$, where $$ r(0, 0) = 5,\ r(0, 1) = 10,\ r(1, 0) = -1; $$

the transition probability function $q \colon \mathit{SA} \to \Delta(S)$, where $$ \begin{aligned} &(q(0 | 0, 0), q(1 | 0, 0)) = (1/2, 1/2), \\ &(q(0 | 0, 1), q(1 | 0, 1)) = (0, 1), \\ &(q(0 | 1, 0), q(1 | 1, 0)) = (0, 1); \end{aligned} $$

the discount factor $\beta \in [0, 1)$.

```
In [2]:
```def v_star(beta):
v = np.empty(2)
v[1] = -1 / (1 - beta)
if beta > 10/11:
v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
else:
v[0] = (10 - 11*beta) / (1 - beta)
return v

We want to solve this problem numerically by using the `DiscreteDP`

class.

```
In [3]:
```v_star(beta=0.95)

```
Out[3]:
```

There are two ways to represent the data for instantiating a `DiscreteDP`

object.
Let $n$, $m$, and $L$ denote the numbers of states, actions,
and feasbile state-action pairs, respectively;
in the above example, $n = 2$, $m = 2$, and $L = 3$.

`DiscreteDP(R, Q, beta)`

with parameters:

- $n \times m$ reward array
`R`

, - $n \times m \times n$ transition probability array
`Q`

, and - discount factor
`beta`

,

where

`R[s, a]`

is the reward for action`a`

when the state is`s`

and`Q[s, a, s']`

is the probability that the state in the next period is`s'`

when the current state is`s`

and the action chosen is`a`

.- $n \times m$ reward array
`DiscreteDP(R, Q, beta, s_indices, a_indices)`

with parameters:

- length $L$ reward vector
`R`

, - $L \times n$ transition probability array
`Q`

, - discount factor
`beta`

, - length $L$ array
`s_indices`

, and - length $L$ array
`a_indices`

,

where the pairs

`(s_indices[0], a_indices[0])`

, ...,`(s_indices[L-1], a_indices[L-1])`

enumerate feasible state-action pairs, and`R[i]`

is the reward for action`a_indices[i]`

when the state is`s_indices[i]`

and`Q[i, s']`

is the probability that the state in the next period is`s'`

when the current state is`s_indices[i]`

and the action chosen is`a_indices[0]`

.- length $L$ reward vector

Let us illustrate the two formulations by the simple example at the outset.

`R[s, a]`

to be $-\infty$ when action `a`

is infeasible under state `s`

.

The reward array `R`

is an $n \times m$ 2-dimensional array:

```
In [4]:
```R = [[5, 10],
[-1, -float('inf')]]

The transition probability array `Q`

is an $n \times m \times n$ 3-dimenstional array:

```
In [5]:
```Q = [[(0.5, 0.5), (0, 1)],
[(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary

Let us set the discount factor $\beta$ to be $0.95$:

```
In [6]:
```beta = 0.95

We are ready to create a `DiscreteDP`

instance:

```
In [7]:
```ddp = DiscreteDP(R, Q, beta)

`R`

to be a 1-dimensional array of length `L`

and `Q`

to be a 2-dimensional array of shape `(L, n)`

,
where `L`

is the number of feasible state-action pairs.

`s_indices`

and `a_indices`

of length $3$
contain the indices of states and actions, respectively.

```
In [8]:
```s_indices = [0, 0, 1] # State indices
a_indices = [0, 1, 0] # Action indices

The reward vector `R`

is a length $L$ 1-dimensional array:

```
In [9]:
```# Rewards for (s, a) = (0, 0), (0, 1), (1, 0), respectively
R = [5, 10, -1]

The transition probability array `Q`

is an $L \times n$ 2-dimensional array:

```
In [10]:
```# Probability vectors for (s, a) = (0, 0), (0, 1), (1, 0), respectively
Q = [(0.5, 0.5), (0, 1), (0, 1)]

For the discount factor, set $\beta = 0.95$ as before:

```
In [11]:
```beta = 0.95

Now create a `DiscreteDP`

instance:

```
In [12]:
```ddp_sa = DiscreteDP(R, Q, beta, s_indices, a_indices)

`Q`

as a `scipy.sparse`

matrix
(of any format),
which is useful for large and sparse problems.

For example, let us convert the above ndarray `Q`

to the Coordinate (coo) format:

```
In [13]:
```import scipy.sparse
Q = scipy.sparse.coo_matrix(Q)

Pass it to `DiscreteDP`

with the other parameters:

```
In [14]:
```ddp_sparse = DiscreteDP(R, Q, beta, s_indices, a_indices)

Internally, the matrix `Q`

is converted to the Compressed Sparse Row (csr) format:

```
In [15]:
```ddp_sparse.Q

```
Out[15]:
```

```
In [16]:
```ddp_sparse.Q.toarray()

```
Out[16]:
```

Now let us solve our model.
Currently, `DiscreteDP`

supports the following solution algorithms:

- policy iteration;
- value iteration;
- modified policy iteration.

(The methods are the same across the formulations.)

We solve the model first by policy iteration, which gives the exact solution:

```
In [17]:
```v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res = ddp.solve(method='policy_iteration', v_init=v_init)

`res`

contains the information about the solution result:

```
In [18]:
``````
res
```

```
Out[18]:
```

The optimal policy function:

```
In [19]:
```res.sigma

```
Out[19]:
```

The optimal value function:

```
In [20]:
```res.v

```
Out[20]:
```

This coincides with the analytical solution:

```
In [21]:
```v_star(beta)

```
Out[21]:
```

```
In [22]:
```np.allclose(res.v, v_star(beta))

```
Out[22]:
```

The number of iterations:

```
In [23]:
```res.num_iter

```
Out[23]:
```

Verify that the value of the policy `[0, 0]`

is actually equal to the optimal value `v`

:

```
In [24]:
```ddp.evaluate_policy(res.sigma)

```
Out[24]:
```

```
In [25]:
```ddp.evaluate_policy(res.sigma) == res.v

```
Out[25]:
```

`res.mc`

is the controlled Markov chain given by the optimal policy `[0, 0]`

:

```
In [26]:
```res.mc

```
Out[26]:
```

```
In [27]:
```epsilon = 1e-2 # Convergece tolerance, optional(default=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_vi = ddp.solve(method='value_iteration', v_init=v_init,
epsilon=epsilon)

```
In [28]:
``````
res_vi
```

```
Out[28]:
```

`res1.sigma`

is an $\varepsilon$-optimal policy,
and the value function `res1.v`

is an $\varepsilon/2$-approximation
of the true optimal value function.

```
In [29]:
```np.abs(v_star(beta) - res_vi.v).max()

```
Out[29]:
```

Finally, solve the model by modified policy iteration:

```
In [30]:
```epsilon = 1e-2 # Convergece tolerance, optional(defaul=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_mpi = ddp.solve(method='modified_policy_iteration', v_init=v_init,
epsilon=epsilon)

```
In [31]:
``````
res_mpi
```

```
Out[31]:
```

```
In [32]:
```np.abs(v_star(beta) - res_mpi.v).max()

```
Out[32]:
```

- M.L. Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic Programming*, Wiley-Interscience, 2005.

```
In [ ]:
```