Exercise 5.1

Consider the diagrams on the right in Figure 5.2. Why does the value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower?

Jumps up for last two rows in the rear since sticking for 20 and 21 in blackjack usually result in a win. It drops off for the last row on the left since if the dealer has an Ace, it's bad news for the player. Finally, the frontmost values are higher in the upper diagrams than in the lower since having a usable Ace makes it less likely that an Ace will make the player bust.

Exercise 5.2

The backup diagram for the Monte Carlo estimation of $Q^\pi$ is similar to the backup diagram for $V^\pi$, but the root is a state-action pair, not just a state. The diagram ends in a terminal state.

Some notes

"Without a model (as we had in DP chapter 4)... state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. "

By model, the author means, having the transition probabilities and transition rewards available at hand.

" For policy evaluation to work for action values, we must assure continual exploration. One way to do this is by specifying that the first step of each episode starts at a state-action pair, and that every such pair has a nonzero probability of being selected as the start. This guarantees that all state-action pairs will be visited an infinite number of times in the limit of an infinite number of episodes. We call this the assumption of exploring starts."

Has the following been proved on Monte Carlo ES? "Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved. In our opinion, this is one of the most fundamental open questions in reinforcement learning."

Questions

Question: In 5.6, in the Figure 5.7, (c), the update to w, why is there a 1 in the numerator? It's because $\pi(s,a)$ is a deterministic policy!

Why are we taking $\tau$ to be the latest time at which the actions are not greedy? Is it because our estimate for $Q^\pi$ only improves for nongreedy actions as stated in the section?

Question: In 5.4, the conditions for the policy improvement theorem require optimal substructure right? So even though Monte Carlo is more robust to violations of the Markov Property since it doesn't bootstrap $V$ or $Q$, we are still assuming that greedy updates in GPI (Generalized Policy Improvement) will allow us to arrive at the optimal action-value functions due to the Markov Property?

Exercise 5.3

What is the Monte Carlo estimate analogous to (5.3) for action values, given returns generated using $\pi'$?

I'm not sure, but I think it would be something like:

$$Q(s,a) = \frac{\sum_i^{n_s} \frac{p_i(s,a)}{p_i'(s,a)} R_i(s,a) }{ \sum_i^{n_s} \frac{p_i(s,a)}{p_i'(s,a)} }$$

Where $R_i(s,a)$ is the reward following state $s$ and action $a$, and $p_i(s_t, a_t) = P^{a_t}_{s_t, s_{t+1}} \prod_{k=t+1}^{T_i(s_t, a_t) - 1} \pi(s_k, a_k)P^{a_k}_{s_k, s_{k+1}} $.

Exercise 5.4

See the code in chapter5_racetrack for the solution!

Exercise 5.5

Modify the algorithm for first-visit MC policy evaluation (Figure 5.1) to use the incremental implementation for stationary averages described in Section 2.5.

We just need to update the algorithm so that $Returns(s)$ is a 1x1 array for each $s \in S$. Before part(b) of the algorithm, we need to initialize $k = 0$, and in part (b) in the loop, we do:

  • $Returns(s) \leftarrow Returns(s) + \frac{1}{k + 1}[R - Returns(s)]$
  • $k \leftarrow k + 1$
Exercise 5.6

Derive the weighted-average update rule (5.5) from (5.4). Follow the pattern of the derivation of the unweighted rule (2.4) from (2.1).

We have,

$ \begin{equation} \begin{split} V_{n+1} =& \frac{\sum_{k=1}^{n+1} w_k R_k }{\sum{k+1}{n+1} w_k}\\ =& \frac{w_{n+1} R_{n+1} + \sum_{k=1}^n w_k R_k }{W_{n+1}} \end{split} \end{equation} $

where $W_{n+1} = W_n + w_{n+1}$ and $W_0 = 0$. Then we have:

$ \begin{equation} \begin{split} V_{n+1} =& \frac{1}{W_{n+1}} [w_{n+1}R_{n+1} + V_n W_n]\\ =& \frac{1}{W_{n+1}} [w_{n+1}R_{n+1} + V_n (W_{n+1} - w_{n+1})]\\ =& V_n + \frac{w_{n+1}}{W_{n+1}} [R_{n+1} - V_n ] \end{split} \end{equation} $

Exercise 5.7

Modify the algorithm for the off-policy Monte Carlo control algorithm (Figure 5.7) to use the method described above for incrementally computing weighted averages.

Before the repeat loop, we need to initialize $W = 0$. Then in the repeat loop we do:

  • get $w$ and $t$ as usual
  • delete lines for $N(s,a)$ and $D(s,a)$
  • $W \leftarrow w + W$
  • $Q(s,a) \leftarrow Q(s,a) + \frac{w}{W} [R_t - Q(s,a)] $

In [ ]: