Daisuke Oyama
Faculty of Economics, University of Tokyo
This notebook demonstrates via a simple example how to use the DiscreteDP
module.
In [1]:
import numpy as np
from scipy import sparse
from quantecon.markov import DiscreteDP
Let us consider the following two-state dynamic program, taken from Puterman (2005), Section 3.1, pp.33-35; see also Example 6.2.1, pp.155-156.
There are two possible states $0$ and $1$.
At state $0$, you may choose either "stay", say action $0$, or "move", action $1$.
At state $1$, there is no way to move, so that you can only stay, i.e., $0$ is the only available action. (You may alternatively distinguish between the action "staty" at state $0$ and that at state $1$, and call the latter action $2$; but here we choose to refer to the both actions as action $0$.)
At state $0$, if you choose action $0$ (stay), then you receive a reward $5$, and in the next period the state will remain at $0$ with probability $1/2$, but it moves to $1$ with probability $1/2$.
If you choose action $1$ (move), then you receive a reward $10$, and the state in the next period will be $1$ with probability $1$.
At state $1$, where the only action you can take is $0$ (stay), you receive a reward $-1$, and the state will remain at $1$ with probability $1$.
You want to maximize the sum of discounted expected reward flows with discount factor $\beta \in [0, 1)$.
The optimization problem consists of:
the state space: $S = \{0, 1\}$;
the action space: $A = \{0, 1\}$;
the set of feasible state-action pairs $\mathit{SA} = \{(0, 0), (0, 1), (1, 0)\} \subset S \times A$;
the reward function $r\colon \mathit{SA} \to \mathbb{R}$, where $$ r(0, 0) = 5,\ r(0, 1) = 10,\ r(1, 0) = -1; $$
the transition probability function $q \colon \mathit{SA} \to \Delta(S)$, where $$ \begin{aligned} &(q(0 | 0, 0), q(1 | 0, 0)) = (1/2, 1/2), \\ &(q(0 | 0, 1), q(1 | 0, 1)) = (0, 1), \\ &(q(0 | 1, 0), q(1 | 1, 0)) = (0, 1); \end{aligned} $$
the discount factor $\beta \in [0, 1)$.
The Belmann equation for this problem is: $$ \begin{aligned} v(0) &= \max \left\{5 + \beta \left(\frac{1}{2} v(0) + \frac{1}{2} v(1)\right), 10 + \beta v(1)\right\}, \\ v(1) &= (-1) + \beta v(1). \end{aligned} $$
This problem is simple enough to solve by hand: the optimal value function $v^*$ is given by $$ \begin{aligned} &v(0) = \begin{cases} \dfrac{5 - 5.5 \beta}{(1 - 0.5 \beta) (1 - \beta)} & \text{if $\beta > \frac{10}{11}$} \\ \dfrac{10 - 11 \beta}{1 - \beta} & \text{otherwise}, \end{cases}\\ &v(1) = -\frac{1}{1 - \beta}, \end{aligned} $$ and the optimal policy function $\sigma^*$ is given by $$ \begin{aligned} &\sigma^*(0) = \begin{cases} 0 & \text{if $\beta > \frac{10}{11}$} \\ 1 & \text{otherwise}, \end{cases}\\ &\sigma^*(1) = 0. \end{aligned} $$
In [2]:
def v_star(beta):
v = np.empty(2)
v[1] = -1 / (1 - beta)
if beta > 10/11:
v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
else:
v[0] = (10 - 11*beta) / (1 - beta)
return v
We want to solve this problem numerically by using the DiscreteDP
class.
We will set $\beta = 0.95$ ($> 10/11$), for which the anlaytical solution is: $\sigma^* = (0, 0)$ and
In [3]:
v_star(beta=0.95)
Out[3]:
There are two ways to represent the data for instantiating a DiscreteDP
object.
Let $n$, $m$, and $L$ denote the numbers of states, actions,
and feasbile state-action pairs, respectively;
in the above example, $n = 2$, $m = 2$, and $L = 3$.
DiscreteDP(R, Q, beta)
with parameters:
R
,Q
, andbeta
,where R[s, a]
is the reward for action a
when the state is s
and
Q[s, a, s']
is the probability that the state in the next period is s'
when the current state is s
and the action chosen is a
.
DiscreteDP(R, Q, beta, s_indices, a_indices)
with parameters:
R
,Q
,beta
,s_indices
, anda_indices
,where the pairs (s_indices[0], a_indices[0])
, ..., (s_indices[L-1], a_indices[L-1])
enumerate feasible state-action pairs, and
R[i]
is the reward for action a_indices[i]
when the state is s_indices[i]
and
Q[i, s']
is the probability that the state in the next period is s'
when the current state is s_indices[i]
and the action chosen is a_indices[0]
.
Let us illustrate the two formulations by the simple example at the outset.
This formulation is straightforward
when the number of feasible actions is constant across states
so that the set of feasible state-action pairs is naturally represetend
by the product $S \times A$,
while any problem can actually be represented in this way
by defining the reward R[s, a]
to be $-\infty$ when action a
is infeasible under state s
.
To apply this approach to the current example, we consider the effectively equivalent problem in which at both states $0$ and $1$, both actions $0$ (stay) and $1$ (move) are available, but action $1$ yields a reward $-\infty$ at state $1$.
The reward array R
is an $n \times m$ 2-dimensional array:
In [4]:
R = [[5, 10],
[-1, -float('inf')]]
The transition probability array Q
is an $n \times m \times n$ 3-dimenstional array:
In [5]:
Q = [[(0.5, 0.5), (0, 1)],
[(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary
Note that the transition probabilities for action $(s, a) = (1, 1)$ are arbitrary, since $a = 1$ is infeasible at $s = 1$ in the original problem.
Let us set the discount factor $\beta$ to be $0.95$:
In [6]:
beta = 0.95
We are ready to create a DiscreteDP
instance:
In [7]:
ddp = DiscreteDP(R, Q, beta)
When the number of feasible actions varies across states,
it can be inefficient in terms of memory usage
to extend the domain by treating infeasible actions
to be "feasible but yielding reward $-\infty$".
This formulation takes the set of feasible state-action pairs as is,
defining R
to be a 1-dimensional array of length L
and Q
to be a 2-dimensional array of shape (L, n)
,
where L
is the number of feasible state-action pairs.
First, we have to list all the feasible state-action pairs. For our example, they are: $(s, a) = (0, 0), (0, 1), (1, 0)$.
We have arrays s_indices
and a_indices
of length $3$
contain the indices of states and actions, respectively.
In [8]:
s_indices = [0, 0, 1] # State indices
a_indices = [0, 1, 0] # Action indices
The reward vector R
is a length $L$ 1-dimensional array:
In [9]:
# Rewards for (s, a) = (0, 0), (0, 1), (1, 0), respectively
R = [5, 10, -1]
The transition probability array Q
is an $L \times n$ 2-dimensional array:
In [10]:
# Probability vectors for (s, a) = (0, 0), (0, 1), (1, 0), respectively
Q = [(0.5, 0.5), (0, 1), (0, 1)]
For the discount factor, set $\beta = 0.95$ as before:
In [11]:
beta = 0.95
Now create a DiscreteDP
instance:
In [12]:
ddp_sa = DiscreteDP(R, Q, beta, s_indices, a_indices)
Importantly, this formulation allows us to represent the transition probability array Q
as a scipy.sparse
matrix
(of any format),
which is useful for large and sparse problems.
For example, let us convert the above ndarray Q
to the Coordinate (coo) format:
In [13]:
import scipy.sparse
Q = scipy.sparse.coo_matrix(Q)
Pass it to DiscreteDP
with the other parameters:
In [14]:
ddp_sparse = DiscreteDP(R, Q, beta, s_indices, a_indices)
Internally, the matrix Q
is converted to the Compressed Sparse Row (csr) format:
In [15]:
ddp_sparse.Q
Out[15]:
In [16]:
ddp_sparse.Q.toarray()
Out[16]:
Now let us solve our model.
Currently, DiscreteDP
supports the following solution algorithms:
(The methods are the same across the formulations.)
We solve the model first by policy iteration, which gives the exact solution:
In [17]:
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res = ddp.solve(method='policy_iteration', v_init=v_init)
res
contains the information about the solution result:
In [18]:
res
Out[18]:
The optimal policy function:
In [19]:
res.sigma
Out[19]:
The optimal value function:
In [20]:
res.v
Out[20]:
This coincides with the analytical solution:
In [21]:
v_star(beta)
Out[21]:
In [22]:
np.allclose(res.v, v_star(beta))
Out[22]:
The number of iterations:
In [23]:
res.num_iter
Out[23]:
Verify that the value of the policy [0, 0]
is actually equal to the optimal value v
:
In [24]:
ddp.evaluate_policy(res.sigma)
Out[24]:
In [25]:
ddp.evaluate_policy(res.sigma) == res.v
Out[25]:
res.mc
is the controlled Markov chain given by the optimal policy [0, 0]
:
In [26]:
res.mc
Out[26]:
Next, solve the model by value iteration, which returns an $\varepsilon$-optimal solution for a specified value of $\varepsilon$:
In [27]:
epsilon = 1e-2 # Convergece tolerance, optional(default=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_vi = ddp.solve(method='value_iteration', v_init=v_init,
epsilon=epsilon)
In [28]:
res_vi
Out[28]:
The computed policy function res1.sigma
is an $\varepsilon$-optimal policy,
and the value function res1.v
is an $\varepsilon/2$-approximation
of the true optimal value function.
In [29]:
np.abs(v_star(beta) - res_vi.v).max()
Out[29]:
Finally, solve the model by modified policy iteration:
In [30]:
epsilon = 1e-2 # Convergece tolerance, optional(defaul=1e-3)
v_init = [0, 0] # Initial value function, optional(default=max_a r(s, a))
res_mpi = ddp.solve(method='modified_policy_iteration', v_init=v_init,
epsilon=epsilon)
In [31]:
res_mpi
Out[31]:
Modified policy function also returns an $\varepsilon$-optimal policy function and an $\varepsilon/2$-approximate value function:
In [32]:
np.abs(v_star(beta) - res_mpi.v).max()
Out[32]:
In [ ]: