In [12]:
import itertools
import gym
import numpy as np
import tensorflow as tf
I'm reimplementing my current version of Sarsa(λ) in hiora_cartpole.linfa. That was based on Sutton & Barto 10/2015, p. 211. The 09/2016 draft is quite different for some reason I don't understand at a glance.
In [4]:
env = gym.make("CartPole-v1")
In [9]:
high = np.array([2.5, 4.4, 0.28, 3.9])
state_ranges = np.array([-high, high])
order = 3
In [37]:
state = tf.placeholder(tf.float64, high[None,:].shape)
In [32]:
n_dims = state_ranges.shape[1]
n_entries = (order + 1)**n_dims
intervals = np.diff(state_ranges, axis=0)
# All entries from cartesian product {0, …, order+1}^n_dims.
c_matrix = np.array(
list( itertools.product(range(order+1), repeat=n_dims) ),
dtype=np.int32)
In [45]:
C = tf.constant(c_matrix, dtype=tf.float64)
LOW = tf.constant(-high)
INTERVALS = tf.constant(intervals)
PI = tf.constant(np.pi, dtype=tf.float64)
In [46]:
def phi(o):
"""
o: a placeholder
"""
nc_o = tf.div( tf.subtract(o, LOW) , INTERVALS) # normalized, centered
return tf.cos( tf.mul(PI, tf.matmul(C, tf.transpose(nc_o))) )
Need to be able to get the value for one action separately.
In [ ]:
tQall = tf.matmul(vtheta, tf.transpose(phi(to)))
tQga = tf.max(tQall)
tga = tf.argmax(tQall)
Call to learn
number $n$ has available or calculates (maybe):
math | code | explanation |
---|---|---|
$o_{n-1}$ | po |
The observation passed in call $n-1$. |
$r_{n-1}$ | pr |
The reward passed in call $n-1$. |
$a_{n-1}$ | pa |
The action returned call $n-1$. |
$\theta_{n-1}$ | The weights as updated in call $n-1$. | |
$Q_{\theta_{n-1}}(o_{n-1}, a_{n-1})$ | pQpopa |
|
$r_n$ | r |
The reward passed in call $n$ |
$o_n$ | o |
The observation passed in call $n$ |
$a_n$ | a |
The action that will be returned in call $n$ |
$\hat{a}_n$ | ga |
The action that a greedy policy would return in call $n$. |
$Q_{\theta_{n-1}}(o_n, a_n)$ | pQoa |
There is no equivalent for $\theta_{n-1}$ and $\theta_n$ in the code, because $\theta$ gets updated in-place by the optimizer. So before the update, theta
is $\theta_{n-1}$ and after the update it's $\theta_n$.
In [ ]:
pQpopa = Qfunc(po, pa)
pQoa = Qfunc(o, a)
loss = tf.mul(tf.sub(pQpopa, tf.add(r, Qpa)), elig)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=alpha)
update_model = optimizer.minimze(loss)
In [47]:
sess = tf.Session()
In [52]:
5
In [ ]: