Recurrent memory intro

In the seminar you'll deploy recurrent neural network inside SARSA agent.

The environment it plays is a simple POMDP of rock-paper-scissors game with exploitable opponent.

Instructions

First, read through the code and run it as you read. The code will create a feedforward neural network and train it with SARSA.

Since the game is partially observable, default algorithm will won't reach optimal score. In fact, it's unstable and may even end up worse than random.

After you ran the code, find the two #YOUR CODE HERE chunks (mb ctrl+f) and implement a recurrent memory.

Re-run the experiment and compare the performance of feedworward vs recurrent agent. RNN should be much better, session reward > 50.

After you're done with that, proceed to the next part, for it is going to be much more interesting.



In [ ]:

    
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# number of parallel agents and batch sequence length (frames)
N_AGENTS = 10
SEQ_LENGTH = 25

The environment we're going to use now is not a default gym env.

It was instead written from scratch in rockpaperscissors.py.

Morale: you can make your own gym environments easily with anything you want (including OS and the web, e.g. selenium)



In [ ]:

    
import gym
from rockpaperscissors import RockPaperScissors


def make_env():
    env = RockPaperScissors()
    return gym.wrappers.TimeLimit(env, max_episode_steps=100)


# spawn game instance
env = make_env()
observation_shape = env.observation_space.shape
n_actions = env.action_space.n

env.reset()
obs = env.step(env.action_space.sample())[0]

print obs

Basic agent setup

Here we define a simple agent that maps game images into policy with a minimalistic neural net



In [ ]:

    
# setup theano/lasagne. Prefer CPU
%env THEANO_FLAGS = device = cpu, floatX = float32

import theano
import lasagne
import theano.tensor as T
from lasagne.layers import *



In [ ]:

    
# observation
obs = InputLayer((None,)+observation_shape,)

nn = DenseLayer(obs, 32, nonlinearity=T.nnet.elu)



In [ ]:

    
from agentnet.memory import RNNCell, GRUCell, LSTMCell
<YOUR CODE>:
# Implement a recurrent agent memory by un-comemnting code below and defining h_new

#h_prev = InputLayer((None,50),name="previous memory state with 50 units")

# h_new = RNNCell(<what's prev state>,<what's input>,nonlinearity=T.nnet.elu)

# (IMPORTANT!) use new cell to compute q-values instead of dense layer
#nn = h_new



In [ ]:

    
from agentnet.resolver import EpsilonGreedyResolver
l_qvalues = DenseLayer(nn, n_actions)
l_actions = EpsilonGreedyResolver(l_qvalues)

Agent, as usual



In [ ]:

    
from agentnet.agent import Agent
<YOUR CODE>
# uncomment agent_states and define what layers should be used

agent = Agent(observation_layers=obs,
              policy_estimators=(l_qvalues),
              # agent_states={<new rnn state>:<what layer should it become at next time-step>},
              action_layers=l_actions)

Pool, as usual



In [ ]:

    
from agentnet.experiments.openai_gym.pool import EnvPool

pool = EnvPool(agent, make_env, n_games=16)  # may need to adjust

pool.update(SEQ_LENGTH)

Learning

For N+1'st time, we use vanilla SARSA



In [ ]:

    
replay = pool.experience_replay

qvalues_seq = agent.get_sessions(
    replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,
    unroll_scan=False,  # this new guy makes compilation 100x faster for a bit slower runtime
)[-1]

auto_updates = agent.get_automatic_updates()  # required if unroll_scan=False



In [ ]:

    
# get SARSA mse loss
from agentnet.learning import sarsa
elemwise_mse = sarsa.get_elementwise_objective(qvalues_seq,
                                               actions=replay.actions[0],
                                               rewards=replay.rewards,
                                               is_alive=replay.is_alive)
loss = elemwise_mse.mean()



In [ ]:

    
# Compute weights and updates
weights = lasagne.layers.get_all_params([l_actions], trainable=True)

updates = lasagne.updates.adam(loss, weights)

# compile train function
train_step = theano.function([], loss, updates=auto_updates+updates)

Demo run



In [ ]:

    
untrained_reward = np.mean(pool.evaluate(save_path="./records", n_games=10,
                                         record_video=False, use_monitor=False))

Training loop



In [ ]:

    
# starting epoch
epoch_counter = 1

# full game rewards
rewards = {0: untrained_reward}
loss, reward = 0, untrained_reward



In [ ]:

    
from tqdm import trange
from IPython.display import clear_output

for i in trange(10000):

    # play
    pool.update(SEQ_LENGTH)
    # train
    loss = train_step()

    # update epsilon
    new_epsilon = max(0.01, 1-2e-4*epoch_counter)
    l_actions.epsilon.set_value(np.float32(new_epsilon))

    # record current learning progress and show learning curves
    if epoch_counter % 100 == 0:
        clear_output(True)
        print("iter=%i,loss=%.3f,epsilon=%.3f" %
              (epoch_counter, loss, new_epsilon))
        reward = 0.9*reward + 0.1*np.mean(np.mean(pool.evaluate(save_path="./records", n_games=10,
                                                                record_video=False, use_monitor=False)))
        rewards[epoch_counter] = reward

        plt.plot(*zip(*sorted(rewards.items(), key=lambda (t, r): t)))
        plt.grid()
        plt.show()

    epoch_counter += 1

Evaluating results

Here we plot learning curves and sample testimonials



In [ ]:

    
plt.plot(*zip(*sorted(rewards.items(), key=lambda k: k[0])))
plt.grid()

Bonus (1++ points)

Compare two types of nonlinearities for the RNN:

T.nnet.elu
T.nnet.sigmoid

Re-train agent at least 10 times. It's probably a good idea to automate the process.

Notice something weird? Any clue why this happens and how to fix it?

Running the experiment and reporting results gets your 1 point. Reward will get much higher as you go down the rabbit hole! Don't forget to send this notebook to Anytask and mention that you went for this bonus.



In [ ]:

    
# results, ideas, solutions...



In [ ]:



In [ ]: