In the seminar you'll deploy recurrent neural network inside SARSA agent.
The environment it plays is a simple POMDP of rock-paper-scissors game with exploitable opponent.
First, read through the code and run it as you read. The code will create a feedforward neural network and train it with SARSA.
Since the game is partially observable, default algorithm will won't reach optimal score. In fact, it's unstable and may even end up worse than random.
After you ran the code, find the two #YOUR CODE HERE chunks (mb ctrl+f) and implement a recurrent memory.
Re-run the experiment and compare the performance of feedworward vs recurrent agent. RNN should be much better, session reward > 50.
After you're done with that, proceed to the next part, for it is going to be much more interesting.
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
# number of parallel agents and batch sequence length (frames)
N_AGENTS = 10
SEQ_LENGTH = 25
The environment we're going to use now is not a default gym env.
It was instead written from scratch in rockpaperscissors.py.
Morale: you can make your own gym environments easily with anything you want (including OS and the web, e.g. selenium)
In [ ]:
import gym
from rockpaperscissors import RockPaperScissors
def make_env():
env = RockPaperScissors()
return gym.wrappers.TimeLimit(env, max_episode_steps=100)
# spawn game instance
env = make_env()
observation_shape = env.observation_space.shape
n_actions = env.action_space.n
env.reset()
obs = env.step(env.action_space.sample())[0]
print obs
In [ ]:
# setup theano/lasagne. Prefer CPU
%env THEANO_FLAGS = device = cpu, floatX = float32
import theano
import lasagne
import theano.tensor as T
from lasagne.layers import *
In [ ]:
# observation
obs = InputLayer((None,)+observation_shape,)
nn = DenseLayer(obs, 32, nonlinearity=T.nnet.elu)
In [ ]:
from agentnet.memory import RNNCell, GRUCell, LSTMCell
<YOUR CODE>:
# Implement a recurrent agent memory by un-comemnting code below and defining h_new
#h_prev = InputLayer((None,50),name="previous memory state with 50 units")
# h_new = RNNCell(<what's prev state>,<what's input>,nonlinearity=T.nnet.elu)
# (IMPORTANT!) use new cell to compute q-values instead of dense layer
#nn = h_new
In [ ]:
from agentnet.resolver import EpsilonGreedyResolver
l_qvalues = DenseLayer(nn, n_actions)
l_actions = EpsilonGreedyResolver(l_qvalues)
In [ ]:
from agentnet.agent import Agent
<YOUR CODE>
# uncomment agent_states and define what layers should be used
agent = Agent(observation_layers=obs,
policy_estimators=(l_qvalues),
# agent_states={<new rnn state>:<what layer should it become at next time-step>},
action_layers=l_actions)
In [ ]:
from agentnet.experiments.openai_gym.pool import EnvPool
pool = EnvPool(agent, make_env, n_games=16) # may need to adjust
pool.update(SEQ_LENGTH)
In [ ]:
replay = pool.experience_replay
qvalues_seq = agent.get_sessions(
replay,
session_length=SEQ_LENGTH,
experience_replay=True,
unroll_scan=False, # this new guy makes compilation 100x faster for a bit slower runtime
)[-1]
auto_updates = agent.get_automatic_updates() # required if unroll_scan=False
In [ ]:
# get SARSA mse loss
from agentnet.learning import sarsa
elemwise_mse = sarsa.get_elementwise_objective(qvalues_seq,
actions=replay.actions[0],
rewards=replay.rewards,
is_alive=replay.is_alive)
loss = elemwise_mse.mean()
In [ ]:
# Compute weights and updates
weights = lasagne.layers.get_all_params([l_actions], trainable=True)
updates = lasagne.updates.adam(loss, weights)
# compile train function
train_step = theano.function([], loss, updates=auto_updates+updates)
In [ ]:
untrained_reward = np.mean(pool.evaluate(save_path="./records", n_games=10,
record_video=False, use_monitor=False))
In [ ]:
# starting epoch
epoch_counter = 1
# full game rewards
rewards = {0: untrained_reward}
loss, reward = 0, untrained_reward
In [ ]:
from tqdm import trange
from IPython.display import clear_output
for i in trange(10000):
# play
pool.update(SEQ_LENGTH)
# train
loss = train_step()
# update epsilon
new_epsilon = max(0.01, 1-2e-4*epoch_counter)
l_actions.epsilon.set_value(np.float32(new_epsilon))
# record current learning progress and show learning curves
if epoch_counter % 100 == 0:
clear_output(True)
print("iter=%i,loss=%.3f,epsilon=%.3f" %
(epoch_counter, loss, new_epsilon))
reward = 0.9*reward + 0.1*np.mean(np.mean(pool.evaluate(save_path="./records", n_games=10,
record_video=False, use_monitor=False)))
rewards[epoch_counter] = reward
plt.plot(*zip(*sorted(rewards.items(), key=lambda (t, r): t)))
plt.grid()
plt.show()
epoch_counter += 1
In [ ]:
plt.plot(*zip(*sorted(rewards.items(), key=lambda k: k[0])))
plt.grid()
Compare two types of nonlinearities for the RNN:
T.nnet.eluT.nnet.sigmoidRe-train agent at least 10 times. It's probably a good idea to automate the process.
Notice something weird? Any clue why this happens and how to fix it?
Running the experiment and reporting results gets your 1 point. Reward will get much higher as you go down the rabbit hole! Don't forget to send this notebook to Anytask and mention that you went for this bonus.
In [ ]:
# results, ideas, solutions...
In [ ]:
In [ ]: