This time we're going to learn something harder then CartPole :)
Gym atari games only allow raw image pixels as observation, hence demanding a more powerful agent network to find meaningful features. We shall use a convolutional neural network for such task.
Most of the code in this notebook is written for you, however you are strongly encouraged to experiment with it to find better agent configuration and/or learning algorithm.
In [2]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#setup theano/lasagne. Prefer GPU
%env THEANO_FLAGS=device=gpu,floatX=float32
#If you are running on a server, launch xvfb to record game videos
#Please make sure you have xvfb installed (apt-get install xvfb, see gym readme on xvfb)
import os
if os.environ.get("DISPLAY") is str and len(os.environ.get("DISPLAY"))!=0:
!bash xvfb start
%env DISPLAY=:1
In [3]:
In [ ]:
from gym.core import ObservationWrapper
from gym.spaces import Box
from scipy.misc import imresize
class PreprocessAtari(ObservationWrapper):
def __init__(self, env):
"""A gym wrapper that crops, scales image into the desired shapes and optionally grayscales it."""
ObservationWrapper.__init__(self,env)
self.img_size = (64, 64)
self.observation_space = Box(0.0, 1.0, self.img_size)
def _observation(self, img):
"""what happens to each observation"""
# Here's what you need to do:
# * crop image, remove irrelevant parts
# * resize image to self.img_size
# (use imresize imported above or any library you want,
# e.g. opencv, skimage, PIL, keras)
# * cast image to grayscale
# * convert image pixels to (0,1) range, float32 type
<Your code here>
return <...>
In [4]:
import gym
#game maker consider https://gym.openai.com/envs
def make_env():
env = gym.make("KungFuMaster-v0")
return PreprocessAtari(env)
#spawn game instance
env = make_env()
observation_shape = env.observation_space.shape
n_actions = env.action_space.n
obs = env.reset()
plt.imshow(obs[0],interpolation='none',cmap='gray')
Out[4]:
In [5]:
import theano, lasagne
import theano.tensor as T
from lasagne.layers import *
from agentnet.memory import WindowAugmentation
In [6]:
#observation goes here
observation_layer = InputLayer((None,)+observation_shape,)
#4-tick window over images
prev_wnd = InputLayer((None,4)+observation_shape,name='window from last tick')
new_wnd = WindowAugmentation(observation_layer,prev_wnd,name='updated window')
#reshape to (frame, h,w). If you don't use grayscale, 4 should become 12.
wnd_reshape = reshape(new_wnd, (-1,4*observation_shape[0])+observation_shape[1:])
Here will need to build a convolutional network that consists of 4 layers:
You may find a template for such network below
In [9]:
from lasagne.nonlinearities import rectify,elu,tanh,softmax
#network body
conv0 = Conv2DLayer(wnd_reshape,<...>)
conv1 = <another convolutional layer, growing from conv0>
conv2 = <yet another layer...>
dense = DenseLayer(<what is it's input?>,
nonlinearity=tanh,
name='dense "neck" layer')
You will now need to build output layers. Since we're building advantage actor-critic algorithm, out network will require two outputs:
Both those layers will grow from final dense layer from the network body.
In [11]:
#actor head
logits_layer = DenseLayer(dense,n_actions,nonlinearity=None)
#^^^ separately define pre-softmax policy logits to regularize them later
policy_layer = NonlinearityLayer(logits_layer,softmax)
#critic head
V_layer = DenseLayer(dense,1,nonlinearity=None)
#sample actions proportionally to policy_layer
from agentnet.resolver import ProbabilisticResolver
action_layer = ProbabilisticResolver(policy_layer)
In [13]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
policy_estimators=(logits_layer,V_layer),
agent_states={new_wnd:prev_wnd},
action_layers=action_layer)
In [14]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params([V_layer,policy_layer],trainable=True)
weights
Out[14]:
In [15]:
from agentnet.experiments.openai_gym.pool import EnvPool
#number of parallel agents
N_AGENTS = 10
pool = EnvPool(agent,make_env, N_AGENTS) #may need to adjust
In [16]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_ = pool.interact(10)
print('actions:')
print(action_log[0])
print("rewards")
print(reward_log[0])
In [17]:
# batch sequence length (frames)
SEQ_LENGTH = 25
#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)
Such sessions are in sequences of observations, agent memory, actions, q-values,etc
SessionPool also stores rewards, alive indicators, etc.
In [18]:
#get agent's Qvalues obtained via experience replay
#we don't unroll scan here and propagate automatic updates
#to speed up compilation at a cost of runtime speed
replay = pool.experience_replay
_,_,_,_,(logits_seq,V_seq) = agent.get_sessions(
replay,
session_length=SEQ_LENGTH,
experience_replay=True,
unroll_scan=False,
)
auto_updates = agent.get_automatic_updates()
In [19]:
# compute pi(a|s) and log(pi(a|s)) manually [use logsoftmax]
# we can't guarantee that theano optimizes logsoftmax automatically since it's still in dev
logits_flat = logits_seq.reshape([-1,logits_seq.shape[-1]])
policy_seq = T.nnet.softmax(logits_flat).reshape(logits_seq.shape)
logpolicy_seq = T.nnet.logsoftmax(logits_flat).reshape(logits_seq.shape)
# get policy gradient
from agentnet.learning import a2c
elwise_actor_loss,elwise_critic_loss = a2c.get_elementwise_objective(policy=logpolicy_seq,
treat_policy_as_logpolicy=True,
state_values=V_seq[:,:,0],
actions=replay.actions[0],
rewards=replay.rewards/100.,
is_alive=replay.is_alive,
gamma_or_gammas=0.99,
n_steps=None,
return_separate=True)
# (you can change them more or less harmlessly, this usually just makes learning faster/slower)
# also regularize to prioritize exploration
reg_logits = T.mean(logits_seq**2)
reg_entropy = T.mean(T.sum(policy_seq*logpolicy_seq,axis=-1))
#add-up loss components with magic numbers
loss = 0.1*elwise_actor_loss.mean() +\
0.25*elwise_critic_loss.mean() +\
1e-3*reg_entropy +\
1e-3*reg_logits
In [20]:
# Compute weight updates, clip by norm
grads = T.grad(loss,weights)
grads = lasagne.updates.total_norm_constraint(grads,10)
updates = lasagne.updates.adam(grads, weights,1e-4)
#compile train function
train_step = theano.function([],loss,updates=auto_updates+updates)
In [21]:
untrained_reward = np.mean(pool.evaluate(save_path="./records",
record_video=True))
In [22]:
#show video
from IPython.display import HTML
import os
video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./records/")))
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./records/"+video_names[-1])) #this may or may not be _last_ video. Try other indices
Out[22]:
In [24]:
#starting epoch
epoch_counter = 1
#full game rewards
rewards = {}
loss,reward_per_tick,reward =0,0,0
In [ ]:
from tqdm import trange
from IPython.display import clear_output
#the algorithm almost converges by 15k iterations, 50k is for full convergence
for i in trange(150000):
#play
pool.update(SEQ_LENGTH)
#train
loss = 0.95*loss + 0.05*train_step()
if epoch_counter%10==0:
#average reward per game tick in current experience replay pool
reward_per_tick = 0.95*reward_per_tick + 0.05*pool.experience_replay.rewards.get_value().mean()
print("iter=%i\tloss=%.3f\treward/tick=%.3f"%(epoch_counter,
loss,
reward_per_tick))
##record current learning progress and show learning curves
if epoch_counter%100 ==0:
reward = 0.95*reward + 0.05*np.mean(pool.evaluate(record_video=False))
rewards[epoch_counter] = reward
clear_output(True)
plt.plot(*zip(*sorted(rewards.items(),key=lambda (t,r):t)))
plt.show()
epoch_counter +=1
# Time to drink some coffee!
In [ ]:
import pandas as pd
plt.plot(*zip(*sorted(rewards.items(),key=lambda k:k[0])))
In [ ]:
from agentnet.utils.persistence import save
save(action_layer,"kung_fu.pcl")
In [ ]:
###LOAD FROM HERE
from agentnet.utils.persistence import load
load(action_layer,"kung_fu.pcl")
In [25]:
rw = pool.evaluate(n_games=20,save_path="./records",record_video=True)
print("mean session score=%f.5"%np.mean(rw))
In [27]:
#show video
from IPython.display import HTML
import os
video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./records/")))
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./records/"+video_names[-1])) #this may or may not be _last_ video. Try other indices
Out[27]:
In [ ]: