This tutrial provides a step-by-step guide on how to define a simple reinforcement learning problem and use Chimp to solve it. We use the classic mountain car problem. In mountain car, you are controlling a car that is trying to get to the top of the hill. The car can accelerate forward, accelerate backwar, or do nothing. The challange lies in the fact that the car can't go directly up the hill (boo gravity), and must learn to move back and forth to get to the top.
If you have a problem that you would like to solve using deep reinforcement learning, there are two things you must define: a simulator and the deep Q-Network (DQN) model.
The simulator class serves as a generative model that's used by chimp. It provides acess to the dynamics and the reward structures of the problem.
Your simulator class must contain the following functions:
def act(a):
# transitions the simulator by performing the action a
pass
def reward():
# returns the reward R(s,a) where s is the state before calling act, and a is the action passed to act
pass
def get_screenshot():
# returns the current state of the system
pass
def episode_over():
# returns true if the current state is terminal, false otherwise
pass
def reset_episode():
# reinitializes the current state and reward
pass
Your simulator class must also contain the following fields:
n_actions # the number of actions availble to the agent
model_dims # tuple containing the model dimensions for the state of the system
In [1]:
# Inline for plotting
%matplotlib inline
In [2]:
##########################################################
################### MOUNTAIN CAR #########################
##########################################################
# you can find this source code in chimp/simulators/mdp/mountain_car.py
# and chimp/simulators/mdp/mdp_simulator.py
import numpy as np
class MountainCar():
def __init__(self,
term_r = 10.0,
nonterm_r = -1.0,
discount = 0.95):
# initialize fields required by Chimp
self.n_actions = 3
self.model_dims = (2,)
# initialize simulator vars
self.current_state = np.zeros(2, dtype=np.float32) # must be float 32
self.last_reward = 0.0
# initialize mountain car constants
self.actions = np.array([-1.0, 0.0, 1.0])
self.vmin, self.vmax = (-0.07, 0.07)
self.xmin, self.xmax = (-1.2, 0.6)
self.term_r = term_r
self.nonterm_r = nonterm_r
def act(self, a):
"""
Transitions to the next state and computes the reward
"""
x_old, v_old = self.current_state
# compute the reward
r = self.term_r if x_old >= self.xmax else self.nonterm_r
# transition the state
v = v_old + 0.001 * self.actions[a] - 0.0025 * np.cos(3 * x_old)
v = np.clip(v, self.vmin, self.vmax)
x = x_old + v
x = np.clip(x, self.xmin, self.xmax)
self.last_reward = r
self.current_state[0] = x
self.current_state[1] = v
def reward(self):
return self.last_reward
def get_screenshot(self):
return self.current_state
def episode_over(self):
"""
Checks if the car reached the top of the mountain
"""
if self.current_state[0] >= self.xmax:
return True
return False
def reset_episode(self):
"""
Reinitializes the state of the car and the last reward
"""
self.current_state[0] = np.random.uniform(self.xmin, self.xmax*0.9)
self.current_state[1] = 0.0
self.last_reward = 0.0
def simulate(self, nsteps, policy):
"""
Runs a simulation using the provided DQN policy for nsteps
"""
self.reset_episode()
rtot = 0.0
xpos = np.zeros(nsteps)
vel = np.zeros(nsteps)
# state must be in (1, dims) form for forward prop
input_state = np.zeros((1,2), dtype=np.float32)
# run the simulation
for i in xrange(nsteps):
state = simulator.get_screenshot()
input_state[0] = state
a = policy.action((input_state, None)) # input format for DQN is (state, action), don't use action
simulator.act(a)
r = simulator.reward()
rtot += r
xpos[i], vel[i] = state
if simulator.episode_over():
break
return rtot, xpos, vel
We use the Chainer framework for training and defining the network used for the Q-value apporximation. Currently Chimp only supports Chainer (but TensorFlow support is on its way), so you must define the DQN using the Chainer API. A simple, fully connected network with two hidden layers, batch normalization and ReLU non-linearities looks like this:
In [3]:
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import cuda, Function, gradient_check, Variable, optimizers, serializers, utils
from chainer import Link, Chain, ChainList
class CarNet(Chain):
def __init__(self):
super(CarNet, self).__init__(
l1=F.Linear(2, 20), # 2 dimensional state space
bn1=L.BatchNormalization(20),
l2=F.Linear(20, 10),
bn2=L.BatchNormalization(10),
lout=F.Linear(10, 3) # 3 output actions
)
# need this variable for batch normalization and dropout
self.train = True
# init avg_var to prevent divide by zero during first forward pass in batch norm
self.bn1.avg_var.fill(0.1),
self.bn2.avg_var.fill(0.1),
def __call__(self, state, a):
# Chimp passes in actions by defult if requested, not used here
h = F.relu(self.l1(state))
h = self.bn1(h, test=not self.train)
h = F.relu(self.l2(h))
h = self.bn2(h, test=not self.train)
output = self.lout(h)
return output
Chimp trains the DQN by interfacing with your simulator through a DQNAgent class. Chimp also provides a memory class for storing the replay memory, and a learner class that uses Chainer to do the training. Before starting the training, we have to define some hyper-parameters for the DQN:
In [4]:
settings = {
# agent settings
'batch_size' : 32, # size of minibatch
'print_every' : 1000, # frequency of text output
'save_dir' : 'results/mountain_car', # everything is saved here
'iterations' : 2001, # length of training (low value as an example)
'eval_iterations' : 100, # length of evaluation simulation
'eval_every' : 1000, # frequnecy of evaluation simulation
'save_every' : 20000, # frequency of network saving
'initial_exploration' : 50000, # number of trainstions initally in reaply memory
'epsilon_decay' : 0.000001, # subtract from epsilon every step
'eval_epsilon' : 0, # epsilon used in evaluation, 0 means no random actions
'epsilon' : 1.0, # Initial exploratoin rate
'history_sizes' : (1, 0, 0), # sizes of histories to use as nn inputs (s, a, r)
'model_dims' : (2,),
# simulator settings
'viz' : False, # no visualization (used in Atari)
# replay memory settings
'memory_size' : 100000, # size of replay memory
# learner settings
'learning_rate' : 0.00001, # learning rate used by the optimizer
'discount' : 0.95, # discount rate for RL
'clip_err' : False, # clip gradients
'clip_reward' : False, # value to clip reward values to
'target_net_update' : 1000, # frequnecy of target net updates
'optim_name' : 'ADAM', # currently supports "RMSprop", "ADADELTA", "ADAM" and "SGD"'
'gpu' : False, # not using GPU
'reward_rescale': False, # rescale rewards to [0,1]
'decay_rate' : 0.99, # decay rate for RMSprop, otherwise not used
# random number seeds
'seed_general' : 1723,
'seed_simulator' : 5632,
'seed_agent' : 9826,
'seed_memory' : 7563
}
There are a lot of hyper-parameters, but most of them can be left to the default values. The ones to play around with are: iterations, eval_iteration, epsilon_decay, learning_rate, target_net_update.
Now we are ready to train the DQN!
In [8]:
# Import the necessary Chimp modules
# Memory
from chimp.memories import ReplayMemoryHDF5
# Learner (Brain)
from chimp.learners.dqn_learner import DQNLearner
from chimp.learners.chainer_backend import ChainerBackend
# Agent Framework
from chimp.agents import DQNAgent
# Policy class for evaluation
from chimp.utils.policies import DQNPolicy
# initialize our mountain car simulator
simulator = MountainCar()
# initialize the netowrk
net = CarNet()
# Initialize the learner with a Chainer backend and out net
backend = ChainerBackend(settings) # initialize with the settings dictionary
backend.set_net(net) # set the net for our Chainer backend
learner = DQNLearner(settings, backend) # create the learner
# Initialize replay memory
memory = ReplayMemoryHDF5(settings)
# Initialize the DQNAgent
agent = DQNAgent(learner, memory, simulator, settings) # pass in all 3 and settings
# Start training
agent.train(verbose=True)
We can plot the training loss.
In [9]:
from matplotlib import pyplot
agent.plot_loss()
We can also plot the evaluation history.
In [10]:
agent.plot_eval_reward()
Nothing interesting is happening yet. Our DQN hasn't learned how to reach the goal state. In fact, mountain car is a diffiuclt problem for reinforcement learning, because of it's sparse reward structure. The only positive reward comes from reaching the top of the mountain. Brute force training takes quite a while to train the DQN on this problem, so we use a pre-trained network.
In [11]:
# re-create our pre-trained network to load it from a pickle file
class TestNet(Chain):
def __init__(self):
super(TestNet, self).__init__(
l1=F.Linear(2, 20, bias=0.0),
l2=F.Linear(20, 10, bias=0.0),
bn1=L.BatchNormalization(10),
l3=F.Linear(10, 10),
l4=F.Linear(10, 10),
bn2=L.BatchNormalization(10),
lout=F.Linear(10, 3)
)
self.train = True
# initialize avg_var to prevent divide by zero
self.bn1.avg_var.fill(0.1),
self.bn2.avg_var.fill(0.1),
def __call__(self, ohist, ahist):
h = F.relu(self.l1(ohist))
h = F.relu(self.l2(h))
h = self.bn1(h, test=not self.train)
h = F.relu(self.l3(h))
h = F.relu(self.l4(h))
h = self.bn2(h, test=not self.train)
output = self.lout(h)
return output
Let's load the network.
In [12]:
import pickle
net = pickle.load(open("../chimp/pre_trained_nets/mountain_car.net", "rb"))
We use a simple simulator function to evaluate the DQN:
In [15]:
def car_sim(nsteps, simulator, policy, verbose=False):
# re-initialize the model
simulator.reset_episode()
rtot = 0.0
xpos = np.zeros(nsteps)
vel = np.zeros(nsteps)
# run the simulation
input_state = np.zeros((1,2), dtype=np.float32)
for i in xrange(nsteps):
state = simulator.get_screenshot()
input_state[0] = state
a = policy.action((input_state,None))
simulator.act(a)
r = simulator.reward()
rtot += r
xpos[i], vel[i] = state
if simulator.episode_over():
break
return rtot, xpos, vel
We'll run a quick simulation and collect some statistics.
In [18]:
backend = ChainerBackend(settings)
backend.set_net(net)
learner = DQNLearner(settings, backend)
policy = DQNPolicy(learner)
r, xtrace, vtrace = car_sim(300, simulator, policy, verbose=True)
Now let's plot the positions and velocities. Note that the car reached the terminal state at around 200 time-steps.
In [20]:
pyplot.plot(xtrace, label='Position'); pyplot.plot(10.0*vtrace, label='Velocity') # scale velcotiy so we can see it
pyplot.legend(loc=4)
pyplot.show()
If you would like to see more examples of using Chimp check out the examples folder. If you want to see more implementations of simulators check out chimp/simulators