In [12]:

    
import itertools

import gym
import numpy as np
import tensorflow as tf

I'm reimplementing my current version of Sarsa(λ) in hiora_cartpole.linfa. That was based on Sutton & Barto 10/2015, p. 211. The 09/2016 draft is quite different for some reason I don't understand at a glance.



In [4]:

    
env = gym.make("CartPole-v1")









    



[2017-01-05 16:52:31,394] Making new env: CartPole-v1



In [9]:

    
high = np.array([2.5, 4.4, 0.28, 3.9])
state_ranges = np.array([-high, high])
order = 3



In [37]:

    
state = tf.placeholder(tf.float64, high[None,:].shape)



In [32]:

    
n_dims = state_ranges.shape[1]
n_entries = (order + 1)**n_dims
intervals = np.diff(state_ranges, axis=0)

# All entries from cartesian product {0, …, order+1}^n_dims.
c_matrix = np.array(
               list( itertools.product(range(order+1), repeat=n_dims) ),
               dtype=np.int32)



In [45]:

    
C = tf.constant(c_matrix, dtype=tf.float64)
LOW = tf.constant(-high)
INTERVALS = tf.constant(intervals)
PI = tf.constant(np.pi, dtype=tf.float64)



In [46]:

    
def phi(o):
    """
    
    o: a placeholder
    """
    nc_o = tf.div( tf.subtract(o, LOW) , INTERVALS) # normalized, centered
    return tf.cos( tf.mul(PI, tf.matmul(C, tf.transpose(nc_o))) )

Need to be able to get the value for one action separately.



In [ ]:

    
tQall = tf.matmul(vtheta, tf.transpose(phi(to)))
tQga = tf.max(tQall)
tga = tf.argmax(tQall)

Call to learn number $n$ has available or calculates (maybe):

math	code	explanation
$o_{n-1}$	`po`	The observation passed in call $n-1$.
$r_{n-1}$	`pr`	The reward passed in call $n-1$.
$a_{n-1}$	`pa`	The action returned call $n-1$.
$\theta_{n-1}$		The weights as updated in call $n-1$.
$Q_{\theta_{n-1}}(o_{n-1}, a_{n-1})$	`pQpopa`
$r_n$	`r`	The reward passed in call $n$
$o_n$	`o`	The observation passed in call $n$
$a_n$	`a`	The action that will be returned in call $n$
$\hat{a}_n$	`ga`	The action that a greedy policy would return in call $n$.
$Q_{\theta_{n-1}}(o_n, a_n)$	`pQoa`

There is no equivalent for $\theta_{n-1}$ and $\theta_n$ in the code, because $\theta$ gets updated in-place by the optimizer. So before the update, theta is $\theta_{n-1}$ and after the update it's $\theta_n$.



In [ ]:

    
pQpopa = Qfunc(po, pa)
pQoa = Qfunc(o, a)
loss = tf.mul(tf.sub(pQpopa, tf.add(r, Qpa)), elig)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=alpha)
update_model = optimizer.minimze(loss)



In [47]:

    
sess = tf.Session()



In [52]:

    
5









    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-52-20bc6dcfdc56> in <module>()
      1 for i in xrange(1000000):
----> 2     sess.run(feature, feed_dict={state: env.observation_space.sample()[None,:]})

/home/erle/.local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    764     try:
    765       result = self._run(None, fetches, feed_dict, options_ptr,
--> 766                          run_metadata_ptr)
    767       if run_metadata:
    768         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/erle/.local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
    962     if final_fetches or final_targets:
    963       results = self._do_run(handle, final_targets, final_fetches,
--> 964                              feed_dict_string, options, run_metadata)
    965     else:
    966       results = []

/home/erle/.local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1012     if handle is None:
   1013       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
-> 1014                            target_list, options, run_metadata)
   1015     else:
   1016       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/home/erle/.local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
   1019   def _do_call(self, fn, *args):
   1020     try:
-> 1021       return fn(*args)
   1022     except errors.OpError as e:
   1023       message = compat.as_text(e.message)

/home/erle/.local/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1001         return tf_session.TF_Run(session, options,
   1002                                  feed_dict, fetch_list, target_list,
-> 1003                                  status, run_metadata)
   1004 
   1005     def _prun_fn(session, handle, feed_dict, fetch_list):

KeyboardInterrupt:



In [ ]: