States, Actions and Rewards (the SARSA algorithm)

Pseudo-implementation of SARSA-Max, for reference



In [ ]:

    
# DON'T RUN ME!
##############
# Say we have broken the environment into an episodic list of some kind
episodes = []

# We first initialize Q(s, a) to an arbitrary starting condition.
init_Q(s, a)

for e in episodes:
    init_s()
    for step in e:
        choose_action(a, s, some_policy) # derived from Q. Could be epsilon-greedy, softmax etc.
        execute_a(), observe(r, s_prime)
        update_Q() # <- this is the interesting bit. See below.
        update_s()
        if s == terminal():
            break

How do Q-values get updated?

$$ Q(s, \ a), \leftarrow Q(s, \ a) + \alpha [r + \gamma max_a,Q(s \prime, \ a \prime) - Q(s, \ a)]$$

Where:

$\alpha$ is a learning rate between 0 and 1.

$\gamma$ is a discount factor between 0 and 1. Future rewards are worth less than immediate rewards.

$max_a$ is the maximum reward that is attainable in the next state.

So that is SARSA, and some basic RL theory. Back to gym and universe...

Each environment has a step function.

step returns four values:

An object observation: an environment-specific object representing your observation of the environment.
A float reward: amount of reward achieved by the previous action.
A boolean done: has the environment reached the end of a well-defined episode? If so, reset.
A dict info: diagnostic data for debugging.

gym demo



In [ ]:

    
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break