Simple RL

Welcome! Here we'll showcase some basic examples of typical RL programming tasks.

Example 1: Grid World

First, we'll grab our relevant imports: some agents, an MDP, an a function to facilitate running experiments and plotting:


In [1]:
# Add simple_rl to system path.
import os
import sys
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.insert(0, parent_dir)

from simple_rl.agents import QLearningAgent, RandomAgent
from simple_rl.tasks import GridWorldMDP
from simple_rl.run_experiments import run_agents_on_mdp

Next, we make an MDP and a few agents:


In [2]:
# Setup MDP.
mdp = GridWorldMDP(width=6, height=6, init_loc=(1,1), goal_locs=[(6,6)])

# Setup Agents.
ql_agent = QLearningAgent(actions=mdp.get_actions()) 
rand_agent = RandomAgent(actions=mdp.get_actions())

The real meat of simple_rl are the functions that run experiments. The first of which takes a list of agents and an mdp and simulates their interaction:


In [3]:
# Run experiment and make plot.
run_agents_on_mdp([ql_agent, rand_agent], mdp, instances=5, episodes=100, steps=40, reset_at_terminal=True, verbose=False)


Running experiment: 
(MDP)
	gridworld_h-6_w-6
(Agents)
	qlearner
	random
(Params)
	instances : 5
	episodes : 100
	steps : 40

qlearner is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.

random is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.


--- TIMES ---
qlearner agent took 0.89 seconds.
random agent took 0.35 seconds.
-------------

We can throw R-Max, introduced by [Brafman and Tennenholtz, 2002] in the mix, too:


In [43]:
from simple_rl.agents import RMaxAgent

rmax_agent = RMaxAgent(actions=mdp.get_actions(), horizon=3, s_a_threshold=1)

# Run experiment and make plot.
run_agents_on_mdp([rmax_agent, ql_agent, rand_agent], mdp, instances=5, episodes=100, steps=20, reset_at_terminal=True, verbose=False)


Running experiment: 
(MDP)
	gridworld_h-6_w-6
(Agents)
	rmax-h3
	qlearner
	random
(Params)
	instances : 5
	episodes : 100
	steps : 20

rmax-h3 is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.

qlearner is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.

random is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.


--- TIMES ---
random agent took 0.27 seconds.
rmax-h3 agent took 70.88 seconds.
qlearner agent took 0.75 seconds.
-------------

Each experiment we run generates an Experiment object. This facilitates recording results, making relevant files, and plotting. When the run_agents... function is called, a results dir is created containing relevant experiment data. There should be a subdirectory in results named after the mdp you ran experiments on -- this is where the plot, agent results, and parameters.txt file are stored.

All of the above code is contained in the simple_example.py file.

Example 2: Visuals (require pygame)

First let's make a FourRoomMDP from [Sutton, Precup, Singh 1999], which is more visually interesting than a grid world.


In [ ]:
from simple_rl.tasks import FourRoomMDP
four_room_mdp = FourRoomMDP(9, 9, goal_locs=[(9, 9)], gamma=0.95)

# Run experiment and make plot.
four_room_mdp.visualize_value()

Or we can visualize a policy:

Both of these are in examples/viz_example.py. If you need pygame in anaconda, give this a shot:

> conda install -c cogsci pygame

If you get an sdl font related error on Mac/Linux, try:

> brew update sdl && sdl_tf

We can also make grid worlds with a text file. For instance, we can construct the grid problem from [Barto and Pickett 2002] by making a text file:

--w-----w---w----g
--------w---------
--w-----w---w-----
--w-----w---w-----
wwwww-wwwwwwwww-ww
---w----w----w----
---w---------w----
--------w---------
wwwwwwwww---------
w-------wwwwwww-ww
--w-----w---w-----
--------w---------
--w---------w-----
--w-----w---w-----
wwwww-wwwwwwwww-ww
---w-----w---w----
---w-----w---w----
a--------w--------

Then, we make a grid world out of it:


In [30]:
from simple_rl.tasks.grid_world import GridWorldMDPClass

pblocks_mdp = GridWorldMDPClass.make_grid_world_from_file("pblocks_grid.txt", randomize=False)
pblocks_mdp.visualize_value()


Press anything to quit q

Which Produces:

Example 3: OOMDPs, Taxi

There's also a Taxi MDP, which is actually built on top of an Object Oriented MDP Abstract class from [Diuk, Cohen, Littman 2008].


In [4]:
from simple_rl.tasks import TaxiOOMDP
from simple_rl.run_experiments import run_agents_on_mdp
from simple_rl.agents import QLearningAgent, RandomAgent

# Taxi initial state attributes..
agent = {"x":1, "y":1, "has_passenger":0}
passengers = [{"x":3, "y":2, "dest_x":2, "dest_y":3, "in_taxi":0}]
taxi_mdp = TaxiOOMDP(width=4, height=4, agent=agent, walls=[], passengers=passengers)

# Make agents.
ql_agent = QLearningAgent(actions=taxi_mdp.get_actions()) 
rand_agent = RandomAgent(actions=taxi_mdp.get_actions())

Above, we specify the objects of the OOMDP and their attributes. Now, just as before, we can let some agents interact with the MDP:


In [5]:
# Run experiment and make plot.
run_agents_on_mdp([ql_agent, rand_agent], taxi_mdp, instances=5, episodes=100, steps=150, reset_at_terminal=True)


Running experiment: 
(MDP)
	taxi_h-4_w-4
(Agents)
	qlearner
	random
(Params)
	instances : 5
	episodes : 100
	steps : 150

qlearner is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.

random is learning.
  Instance 1 of 5.
  Instance 2 of 5.
  Instance 3 of 5.
  Instance 4 of 5.
  Instance 5 of 5.


--- TIMES ---
random agent took 7.68 seconds.
qlearner agent took 15.32 seconds.
-------------

More on OOMDPs in examples/oomdp_example.py

Example 4: Markov Games

I've added a few markov games, including rock paper scissors, grid games, and prisoners dilemma. Just as before, we get a run agents method that simulates learning and makes a plot:


In [75]:
from simple_rl.run_experiments import play_markov_game
from simple_rl.agents import QLearningAgent, FixedPolicyAgent
from simple_rl.tasks import RockPaperScissorsMDP

import random

# Setup MDP, Agents.
markov_game = RockPaperScissorsMDP()
ql_agent = QLearningAgent(actions=markov_game.get_actions(), epsilon=0.2) 
fixed_action = random.choice(markov_game.get_actions())
fixed_agent = FixedPolicyAgent(policy=lambda s:fixed_action)

# Run experiment and make plot.
play_markov_game([ql_agent, fixed_agent], markov_game, instances=10, episodes=1, steps=10)


Running experiment: 
(Markov Game MDP)
	rock_paper_scissors
(Agents)
	fixed-policy
	qlearner
(Params)
	instances : 10

	Instance 1 of 10.
	Instance 2 of 10.
	Instance 3 of 10.
	Instance 4 of 10.
	Instance 5 of 10.
	Instance 6 of 10.
	Instance 7 of 10.
	Instance 8 of 10.
	Instance 9 of 10.
	Instance 10 of 10.
Experiment took 0.03 seconds.

Example 5: Gym MDP

Recently I added support for making OpenAI gym MDPs. It's again only a few lines of code:


In [53]:
from simple_rl.tasks import GymMDP
from simple_rl.agents import LinearQLearningAgent, RandomAgent
from simple_rl.run_experiments import run_agents_on_mdp

# Gym MDP.
gym_mdp = GymMDP(env_name='CartPole-v0', render=False) # If render is true, visualizes interactions.
num_feats = gym_mdp.get_num_state_feats()

# Setup agents and run.
lin_agent = LinearQLearningAgent(gym_mdp.get_actions(), num_features=num_feats, alpha=0.2, epsilon=0.4, rbf=True)

run_agents_on_mdp([lin_agent], gym_mdp, instances=3, episodes=1, steps=50)


[2017-08-07 12:36:53,546] Making new env: CartPole-v0
Running experiment: 
(MDP)
	gym-CartPole-v0
(Agents)
	ql-linear-rbf
(Params)
	instances : 3
	episodes : 1
	steps : 50

ql-linear-rbf is learning.
  Instance 1 of 3.
  Instance 2 of 3.
  Instance 3 of 3.


--- TIMES ---
ql-linear-rbf agent took 0.04 seconds.
-------------


In [ ]: