Pseudo-implementation of SARSA-Max, for reference
In [ ]:
# DON'T RUN ME!
##############
# Say we have broken the environment into an episodic list of some kind
episodes = []
# We first initialize Q(s, a) to an arbitrary starting condition.
init_Q(s, a)
for e in episodes:
init_s()
for step in e:
choose_action(a, s, some_policy) # derived from Q. Could be epsilon-greedy, softmax etc.
execute_a(), observe(r, s_prime)
update_Q() # <- this is the interesting bit. See below.
update_s()
if s == terminal():
break
Where:
$\alpha$ is a learning rate between 0 and 1.
$\gamma$ is a discount factor between 0 and 1. Future rewards are worth less than immediate rewards.
$max_a$ is the maximum reward that is attainable in the next state.
Each environment has a step function.
step returns four values:
observation: an environment-specific object representing your observation of the environment.reward: amount of reward achieved by the previous action.done: has the environment reached the end of a well-defined episode? If so, reset.info: diagnostic data for debugging.
In [ ]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break