"For our purposes, it is convenient to place the boundary of the learning agent not at the limit of its physical body, but at the limit of its control." - 3.2
Is the reinforcement learning framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?
The goal-directed task doesn't work for tasks where failure is not an option (i.e. the problem doesn't reset). One example would be learning how to walk near a cliff, where failure means death with no ability to reset the episode.
Another exception would be a task in which the rewards cannot be easily reduced to a single number. For example, in teaching a car how to drive we would want to minimize amount of travel time while maximizing safetiness. With the current framework, we would have to reduce all these desirable outcomes into a single number returned by the environment.
Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?
For an episodic task with the above rewards, the discounted future reward would be exactly -1 gamma^k, where k is the time step of failure, since that is when the episode ends. In the continuing formulation task, the reward is -1 gamma^k1 + -1 * gamma^k2, etc. for each failure.
Assuming a finite MDP with a finite number of reward values, write an equation for the transition probabilities and the expected rewards in terms of the joint conditional distribution in (3.5) $Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t)$
$P^a_{s,s'} = Pr(s_{t+1} = s' | s_t = s, a_t = a) = \sum_r Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t) $
$R^a_{s, s'} = E[r_{t+1} | s_t = s, a_t = a, s_{t+1} = s'] = \sum_r r \cdot Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t)$
What is the Bellman equation for action values?
$ \begin{equation} \begin{split} Q^{\pi}(s, a) &= E_{\pi} [R_t | s_t = s, a_t = a]\\ =& E_{\pi} [\sum_{k=0}^{\infty} \gamma^k r_{t + k + 1} | s_t = s, a_t = a ]\\ =& E_{\pi} [r_{t + 1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k + 2} | s_t = s, a_t = a]\\ =& \sum_{s'} P^a_{ss'} R^a_{ss'} + \sum_{s'} P^a_{ss'} \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k + 2}\\ =& \sum_{s'} P^a_{ss'} [R^a_{ss'} + \gamma V^{\pi}(s')]\\ =& \sum_{s'} P^a_{ss'} [R^a_{ss'} + \gamma \sum_{a'} \pi(s', a') Q^{\pi}(s', a')]\\ \end{split} \end{equation} $
The Bellman equation (3.10) must hold for each state for the value function shown in Figure 3.5b. As an example, show numerically that this equation holds for the center state, valued at 0.7 ,
$V^{\pi}(s_{mid}) = \frac{1}{4} [(0 + .9 * .7) + (0 + 0.9 * 2.3) + (0 + .9 * .4) + (0 - .9 * .4)] = .675 \approx .7$
In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.2), that adding a constant $C$ to all the rewards adds a constant, $K$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is $K$ in terms of $C$ and $\gamma$?
We have from previously that $V^{\pi}(s) = E_{\pi} [\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s]$.
Let's add a constant $C$ to all rewards so that $r'_{t+1} = r_{t+1} + C$.
Then, $$V'^{\pi}(s) = E_{\pi} [\sum_{k=0}^{\infty} \gamma^k (r_{t+k+1} + C) | s_t = s]$$ $$ = V^{\pi}(s) + E_{\pi}[ \sum_{k=0}^{\infty} \gamma^k C | s_t = s ]$$ $$ = V^{\pi}(s) + K $$
We have,
$ \begin{equation} \begin{split} K =& E_{\pi}[\sum_{k=0}^{\infty} \gamma^k C | s_t = s]\\ =& \sum_a \pi(s, a) \sum_s' P^a_{ss'} (C \frac{1}{1 - \gamma})\\ =& \frac{C}{1 - \gamma} \end{split} \end{equation} $
Now consider adding a constant C to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.
For episodic tasks we have,
$$K = \sum_a \pi(s, a) \sum_s' P^a_{ss'} \sum_k^{N}(C \gamma^k)$$where $N$ is the number of time steps until the episode is over. It seems like the reward is maximized as N goes to infiinity, so adding a constant reward to all actions in an episodic task makes the agent never want to end interacting with the environment to avoid finishing the episode.
The value of a state depends on the the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
$ \begin{equation} \begin{split} V^{\pi} (s) =& E_{\pi} [Q^{\pi} (s, a) | s_t = s]\\ =& \sum_a \pi(s, a) Q^{\pi} (s, a) \end{split} \end{equation} $
The value of an action, , can be divided into two parts, the expected next reward, which does not depend on the policy , and the expected sum of the remaining rewards, which depends on the next state and the policy. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching to the possible next states:
$ \begin{equation} \begin{split} Q^{\pi} (s, a) =& E[ r_{t+1} | s_t = s, a_t = a] + \gamma E_{\pi}[ V^{\pi}(s') | s_t = s, a_t = a] \\ Q^{\pi} (s, a) =& \sum_{s'} P^a_{ss'} [R_{ss'} + \gamma V^{\pi}(s')] \end{split} \end{equation} $
Give the Bellman equation for for the recycling robot. s = search, w = wait, re = recharge h = high, l = low
$Q^*(s, a) = \sum_{s'} P^a_{s, s'} [R^a_{s,s'} + \gamma max_{a'} Q^*(s', a')]$
$Q^*(h, s) = \alpha (R^s + \gamma max_{a'} Q^*(h, a')) + (1 - \alpha) (R^s + \gamma max_{a'} Q^*(l, a'))$
$Q^*(l, s) = \beta (R^s + \gamma max_{a'} Q^*(l, a')) + (1 - \beta) (-3 + \gamma max_{a'} Q^*(h, a'))$
$Q^*(l, w) = (R^w + \gamma max_{a'} Q^*(l, a'))$
$Q^*(h, w) = (R^w + \gamma max_{a'} Q^*(h, a'))$
$Q^*(l, r) = (0 + \gamma max_{a'} Q^*(h, a'))$
$Q^*(h, r) = 0$
Figure 3.8 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.2) to express this value symbolically, and then to compute it to three decimal places.
$V^*(s) = max_{a \in A(s)} \sum_{s'} P^a_{s,s'}[R^a_{s, s'} + \gamma V^*(s')]$
So we have,
$V^*(s) = max_a E[r_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k+2} | s_t=s, s_t = a]$
$ = max_a [10 + \gamma \cdot 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0 + \gamma^4 + \gamma^5 \cdot 10 + ...]$
$ = 10(\sum_{k=0}^\infty \gamma^{5k})$
$ = 10 \frac{1}{1 - \gamma^5} \approx 24.419$
In [1]:
# https://raw.githubusercontent.com/stober/cartpole/master/src/__init__.py
import numpy as np
import math
import random
class CartPole(object):
def __init__(self):
# Constants
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.total_mass = (self.masspole + self.masscart)
self.length = 0.5 # actually half the pole's length
self.polemass_length = (self.masspole * self.length)
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.fourthirds = 1.3333333333333
def cart_pole(self, action, x = 0.0, xdot = 0.0, theta = 0.0, thetadot = 0.0):
# action must be binary
force = self.force_mag if action > 0 else -self.force_mag
costheta = math.cos(theta)
sintheta = math.sin(theta)
tmp = (force + self.polemass_length * (thetadot ** 2) * sintheta) / self.total_mass
thetaacc = (self.gravity * sintheta - costheta * tmp) / (self.length * (self.fourthirds - self.masspole * costheta ** 2 / self.total_mass))
xacc = tmp - self.polemass_length * thetaacc * costheta / self.total_mass
# Update the four state variables, using Euler's method
x += self.tau * xdot
xdot += self.tau * xacc
theta += self.tau * thetadot
thetadot += self.tau * thetaacc
return [x, xdot, theta, thetadot]
def get_box(self, x = 0.0, xdot = 0.0, theta = 0.0, thetadot = 0.0):
# get_box: Given the current state, returns a number from 1 to 162
# designating the region of the state space encompassing the current state.
# Returns a value of -1 if a failure state is encountered.
one_degree = 0.0174532 # 2pi/360
six_degrees = 0.1047192
twelve_degrees = 0.2094384
fifty_degrees = 0.87266
if (x < -2.4 or x > 2.4) or (theta < -twelve_degrees or theta > twelve_degrees):
return -1
if x < -0.8:
box = 0
elif x < 0.8:
box = 1
else:
box = 2
if xdot < -0.5:
pass
elif xdot < 0.5:
box += 3
else:
box += 6
if theta < -six_degrees:
pass
elif theta < -one_degree:
box += 9
elif theta < 0:
box += 18
elif theta < one_degree:
box += 27
elif theta < six_degrees:
box += 36
else:
box += 45
if thetadot < -fifty_degrees:
pass
elif thetadot < fifty_degrees:
box += 54
else:
box += 108
return box
def prob_push_right(self, s):
return (1.0 / (1.0 + np.exp(-max(-50.0, min(s, 50.0)))))
In [17]:
cp = CartPole()
N_BOXES = 162
MAX_FAILURES = 100
MAX_STEPS = 100000
LAMBDAw = 0.9 #/* Decay rate for w eligibility trace. */
LAMBDAv = 0.8 # /* Decay rate for v eligibility trace. */
GAMMA = 0.95 # /* Discount factor for critic. */
ALPHA = 1000 # /* Learning rate for action weights, w. */
BETA = 0.5 # /* Learning rate for critic weights, v. */
failures = steps = 0
w = []; v = []; xbar = []; e = []
# Initialize action and heuristic critic weights and traces.
for i in range(N_BOXES):
w.append(0.0); v.append(0.0); xbar.append(0.0); e.append(0.0)
# Starting state is (0 0 0 0)
x = x_dot = theta = theta_dot = 0.0
# Find box in state space containing start state
box = cp.get_box(x, x_dot, theta, theta_dot)
# Iterate through the action-learn loop. ---*/
while steps < MAX_STEPS and failures < MAX_FAILURES:
# Choose action randomly, biased by current weight
y = (random.uniform(0, 1) < cp.prob_push_right(w[box]))
# Update traces.
e[box] += (1.0 - LAMBDAw) * (y - 0.5)
xbar[box] += (1.0 - LAMBDAv)
# Remember prediction of failure for current stat
oldp = v[box]
# /*--- Apply action to the simulated cart-pole ---*/
x, x_dot, theta, theta_dot = cp.cart_pole(y, x, x_dot, theta, theta_dot)
# /*--- Get box of state space containing the resulting state. ---*/
box = cp.get_box(x, x_dot, theta, theta_dot)
if box < 0:
# /*--- Failure occurred. ---*/
failed = 1
failures += 1
print("Trial %d was %d steps.\n" %(failures, steps))
steps = 0
# /*--- Reset state to (0 0 0 0). Find the box. ---*/
x = x_dot = theta = theta_dot = 0.0
box = cp.get_box(x, x_dot, theta, theta_dot)
# /*--- Reinforcement upon failure is -1. Prediction of failure is 0. ---*/
r = -1.0
p = 0.0
else:
# /*--- Not a failure. ---*/
failed = 0
# /*--- Reinforcement is 0. Prediction of failure given by v weight. ---*/
r = 0
p = v[box]
steps += 1
# /*--- Heuristic reinforcement is: current reinforcement
# + gamma * new failure prediction - previous failure prediction ---*/
rhat = r + GAMMA * p - oldp
for i in range(N_BOXES):
# /*--- Update all weights. ---*/
w[i] += ALPHA * rhat * e[i]
v[i] += BETA * rhat * xbar[i]
if v[i] < -1.0:
v[i] = v[i]
if failed == 1:
#/*--- If failure, zero all traces. ---*/
e[i] = 0.0
xbar[i] = 0.0
else:
# /*--- Otherwise, update (decay) the traces. ---*/
e[i] *= LAMBDAw
xbar[i] *= LAMBDAv
if (failures == MAX_FAILURES):
print("Pole not balanced. Stopping after %d failures." %(failures))
else:
print("Pole balanced successfully for at least %d steps\n" %(steps))
In [10]:
# Us AI Gym
import gym
import random
env = gym.make('CartPole-v0')
cp = CartPole()
N_BOXES = 162
MAX_FAILURES = 100
MAX_STEPS = 100000
LAMBDAw = 0.9 #/* Decay rate for w eligibility trace. */
LAMBDAv = 0.8 # /* Decay rate for v eligibility trace. */
GAMMA = 0.95 # /* Discount factor for critic. */
ALPHA = 1000 # /* Learning rate for action weights, w. */
BETA = 0.5 # /* Learning rate for critic weights, v. */
failures = steps = 0
w = []; v = []; xbar = []; e = []
# Initialize action and heuristic critic weights and traces.
for i in range(N_BOXES):
w.append(0.0); v.append(0.0); xbar.append(0.0); e.append(0.0)
# Starting state is (0 0 0 0)
observation = env.reset()
x, x_dot, theta, theta_dot = observation
# Find box in state space containing start state
box = cp.get_box(x, x_dot, theta, theta_dot)
# Iterate through the action-learn loop. ---*/
while steps < MAX_STEPS and failures < MAX_FAILURES:
env.render()
# Choose action randomly, biased by current weight
y = (random.uniform(0, 1) < cp.prob_push_right(w[box]))
# Update traces.
e[box] += (1.0 - LAMBDAw) * (y - 0.5)
xbar[box] += (1.0 - LAMBDAv)
# Remember prediction of failure for current stat
oldp = v[box]
# /*--- Apply action to the simulated cart-pole ---*/
observation, reward, done, info = env.step(y)
x, x_dot, theta, theta_dot = observation
# /*--- Get box of state space containing the resulting state. ---*/
box = cp.get_box(x, x_dot, theta, theta_dot)
if done:
# /*--- Failure occurred. ---*/
failed = 1
failures += 1
print("Trial %d was %d steps.\n" %(failures, steps))
steps = 0
# /*--- Reset state to (0 0 0 0). Find the box. ---*/
observation = env.reset()
x, x_dot, theta, theta_dot = observation
box = cp.get_box(x, x_dot, theta, theta_dot)
# /*--- Reinforcement upon failure is -1. Prediction of failure is 0. ---*/
r = -1
p = 0.0
else:
# /*--- Not a failure. ---*/
failed = 0
# /*--- Reinforcement is 0. Prediction of failure given by v weight. ---*/
r = 0
p = v[box]
steps += 1
# /*--- Heuristic reinforcement is: current reinforcement
# + gamma * new failure prediction - previous failure prediction ---*/
rhat = r + GAMMA * p - oldp
for i in range(N_BOXES):
# /*--- Update all weights. ---*/
w[i] += ALPHA * rhat * e[i]
v[i] += BETA * rhat * xbar[i]
if v[i] < -1.0:
v[i] = v[i]
if failed == 1:
#/*--- If failure, zero all traces. ---*/
e[i] = 0.0
xbar[i] = 0.0
else:
# /*--- Otherwise, update (decay) the traces. ---*/
e[i] *= LAMBDAw
xbar[i] *= LAMBDAv
if (failures == MAX_FAILURES):
print("Pole not balanced. Stopping after %d failures." %(failures))
else:
print("Pole balanced successfully for at least %d steps\n" %(steps))
In [ ]:
In [1]:
import gym
import random
env = gym.make('CartPole-v0')
# env.monitor.start('/tmp/cartpole-experiment-3', force=True)
bestSteps = 0
best = [0, 0, 0, 0]
alpha = 1
for i_episode in xrange(80):
test = [best[i] + (random.random() - 0.5) * alpha for i in range(4)]
score = 0
for ep in range(10): # <-- key thing was to figure out that you need to do 10 tests per point
observation = env.reset()
for t in xrange(200): # <-- because you can't go over 200 you need to gain score hight else where
env.render()
if sum(observation[i] * test[i] for i in range(4)) > 0:
action = 1
else:
action = 0
observation, reward, done, info = env.step(action)
if done:
break
score += t
if bestSteps < score:
bestSteps = score
best = test
alpha *= .9
print "test:", "[%+1.2f %+1.2f %+1.2f %+1.2f]" % tuple(test), score,
print "best:", "[%+1.2f %+1.2f %+1.2f %+1.2f]" % tuple(best), bestSteps, alpha
print "best", best, bestSteps
# env.monitor.close()