Frameworks - we'll accept this homework in any deep learning framework. For example, it translates to TensorFlow almost line-to-line. However, we recommend you to stick to theano/lasagne unless you're certain about your skills in the framework of your choice.
In [1]:
%env THEANO_FLAGS='floatX=float32'
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
!bash ../xvfb start
%env DISPLAY=:1
In [2]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make("CartPole-v0")
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape
plt.imshow(env.render("rgb_array"))
Out[2]:
For REINFORCE algorithm, we'll need a model that predicts action probabilities given states.
In [3]:
import theano
import theano.tensor as T
#create input variables. We'll support multiple states at once
states = T.matrix("states[batch,units]")
actions = T.ivector("action_ids[batch]")
cumulative_rewards = T.vector("R[batch] = r + gamma*r' + gamma^2*r'' + ...")
In [4]:
import lasagne
from lasagne.layers import *
#input layer
l_states = InputLayer((None,)+state_dim,input_var=states)
<Your architecture. Please start with a 1-2 layers with 50-200 neurons>
#output layer
#this time we need to predict action probabilities,
#so make sure your nonlinearity forces p>0 and sum_p = 1
l_action_probas = DenseLayer(<...>,
num_units=<...>,
nonlinearity=<...>)
In [5]:
#get probabilities of actions
predicted_probas = get_output(l_action_probas)
#predict action probability given state
#if you use float32, set allow_input_downcast=True
predict_proba = <compile a function that takes states and returns predicted_probas>
We now need to define objective and update over policy gradient.
The objective function can be defined thusly:
$$ J \approx \sum _i log \pi_\theta (a_i | s_i) \cdot R(s_i,a_i) $$When you compute gradient of that function over network weights $ \theta $, it will become exactly the policy gradient.
In [6]:
#select probabilities for chosen actions, pi(a_i|s_i)
predicted_probas_for_actions = predicted_probas[T.arange(actions.shape[0]),actions]
In [7]:
#REINFORCE objective function
J = #<policy objective as per formula above>
In [8]:
#all network weights
all_weights = <get all "thetas" aka network weights using lasagne>
#weight updates. maximize J = minimize -J
updates = lasagne.updates.sgd(-J,all_weights,learning_rate=0.01)
In [9]:
train_step = theano.function([states,actions,cumulative_rewards],updates=updates,
allow_input_downcast=True)
In [10]:
def get_cumulative_rewards(rewards, #rewards at each step
gamma = 0.99 #discount for reward
):
"""
take a list of immediate rewards r(s,a) for the whole session
compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...
The simple way to compute cumulative rewards is to iterate from last to first time tick
and compute R_t = r_t + gamma*R_{t+1} recurrently
You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
"""
<your code here>
return <array of cumulative rewards>
In [11]:
assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0,0,1,0,0,1,0],gamma=0.9),[1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0,0,1,-2,3,-4,0],gamma=0.5), [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0,0,1,2,3,4,0],gamma=0), [0, 0, 1, 2, 3, 4, 0])
print("looks good!")
In [12]:
def generate_session(t_max=1000):
"""play env with REINFORCE agent and train at the session end"""
#arrays to record session
states,actions,rewards = [],[],[]
s = env.reset()
for t in range(t_max):
#action probabilities array aka pi(a|s)
action_probas = predict_proba([s])[0]
a = <sample action with given probabilities>
new_s,r,done,info = env.step(a)
#record session history to train later
states.append(s)
actions.append(a)
rewards.append(r)
s = new_s
if done: break
cumulative_rewards = get_cumulative_rewards(rewards)
train_step(states,actions,cumulative_rewards)
return sum(rewards)
In [13]:
for i in range(100):
rewards = [generate_session() for _ in range(100)] #generate new sessions
print ("mean reward:%.3f"%(np.mean(rewards)))
if np.mean(rewards) > 300:
print ("You Win!")
break
In [14]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(gym.make("CartPole-v0"),directory="videos",force=True)
sessions = [generate_session() for _ in range(100)]
env.close()
In [15]:
#show video
from IPython.display import HTML
import os
video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices
Out[15]:
In [ ]: