Ben Mann
2017.09
Model-based learning
DQN learns on CartPole much faster when input is [velocity, position, angle, angular velocity] vs pixels.
Can we use an unsupervised generative model to collapse a high dimensional state representation to speed up model-free learning? Evaluate on classic control problems and basic Atari games.
[5] uses a deep VAE to learn a low-dimensional representation of classic control problems, but doesn't use reinforcement learning on top of it.
[7] and [9] build on Atari next video frame prediction [6] but fail to beat DQN performance.
[8] merges model-based and model-free techniques but doesn't report Atari success.
[2] uses the idea of combining model-based and model-free techniques for the purpose of data efficiency, but it doesn't operate on pixels, and in mujoco state + action should perfectly predict next state unlike atari where other agents can do stuff
So the big difference here is that we're aiming to beat DQN at data efficiency on Atari using an approximation of f(s, a) -> s'. We start with CartPole to validate the approach and move on to Pong.
Cartpole, MountainCar, Pendulum
Pong
Is a dense representation learned from full-resolution color better than from downsample?
Can we learn faster than from pixels?
Breakout(?)
Montezuma (stretch goal)
Train autoencoder to repro -- a/b split here
Use dense autoencoder representation to preprocess frames and train a model
Originally we thought that to train a good environment model, we should start collecting data with a pre-trained agent, then gradually degrade that agent's performance to random by introducing more and more random actions instead of what the agent suggests. In a real world task, you can imagine using human demonstrations to generate the world model instead.
For Cartpole, it seems the state space is pretty well explored by completely random actions, so we never bothered.
In [1]:
import time
import numpy as np
import gym
from tqdm import tqdm_notebook
from skimage.transform import resize
%load_ext autoreload
%autoreload 2
env = gym.make('CartPole-v0')
In [2]:
def downsample(im):
return np.uint8(resize(np.mean(im, axis=-1), (im.shape[0]/4,im.shape[1]/4), mode='edge'))
print(env.action_space, env.observation_space, env.observation_space.high.shape)
frames, rewards, actions, observations = [], [], [], []
n_frames = 0
# CartPole ends at 200, but useful for other envs?
MAX_FRAMES_PER_EPISODE = 1000
# ~2GB of data for CartPole
FRAMES_TO_COLLECT = 66000
t = time.time()
with tqdm_notebook(total=FRAMES_TO_COLLECT) as pbar:
while n_frames < FRAMES_TO_COLLECT:
observation = env.reset()
fs = []
rs = []
as_ = []
os = []
for _ in range(MAX_FRAMES_PER_EPISODE):
fs.append(downsample(env.render(mode = 'rgb_array')))
action = env.action_space.sample()
observation, reward, done, _ = env.step(action)
as_.append(action)
rs.append(reward)
os.append(observation)
n_frames += 1
if done:
frames.append(fs)
rewards.append(rs)
actions.append(as_)
observations.append(os)
pbar.update(len(fs))
break
In [3]:
# Save the data
env_name = 'cartpole'
np.save('%s_frames' % env_name, frames)
np.save('%s_rewards' % env_name, rewards)
np.save('%s_observations' % env_name, observations)
np.save('%s_actions' % env_name, actions)
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
all_frames = np.vstack(np.stack(x) for x in frames)
print('Shape of stacked frames', all_frames.shape)
combined = np.min(all_frames, axis=0)
plt.imshow(combined, cmap='gray', vmin=0, vmax=255)
print('Range of values', np.min(combined), 'to', np.max(combined))
def bbox(img):
'''Returns y_min, y_max, x_min, x_max
https://stackoverflow.com/a/31402351/614529
'''
a = np.where(img != 255)
bounds = np.min(a[0]), np.max(a[0]), np.min(a[1]), np.max(a[1])
return bounds
# This would be useful if we wanted to reduce input dimensionality
# by cropping away the extra whitespace.
print('Bounding box', bbox(combined))
Next, we use this collected data to train an autoencoder that takes a stack of frames and actions as inputs and outputs the next frame. This architecture requires no domain-specific information, though for games like Atari the framestack needs to be 4 frames deep to still pick up dynamics despite flickering.
For a simple autoencoder we started with https://blog.keras.io/building-autoencoders-in-keras.html and modified it to our purposes. Key things to note:
fit_generator
, else even my 64GB RAM machine OOM's.
In [5]:
# See autoencoder.py for the meat here.
from autoencoder import load_data, make_model, train
_, windowed_frames, windowed_frames_next, windowed_actions = load_data(window=3)
In [6]:
model = make_model(windowed_frames)
train(model, windowed_frames, windowed_frames_next, windowed_actions)
In [7]:
# Load the model from disk in case we want to start from here without training.
from keras.models import load_model, Model
model = load_model('autoencoder.h5')
encoder_model = Model(model.input, model.get_layer('bottleneck').output, name='encoder')
print(model.input_shape, model.output_shape, encoder_model.output_shape)
Let's verify that the model picked up the properties we expect.
When I initially ran this data exploration, the scatter chart looked pretty bad and the imagination rollouts were nearly the same regardless of the input actions. After a lot of experimentation, it got much better.
In [8]:
WINDOW = 3
def one_hot(x, n_classes):
size = x.shape[0]
one_hot = np.zeros((size, n_classes))
one_hot[range(size), x] = 1
return one_hot
one_hot_actions = one_hot(windowed_actions, 2)
print(one_hot_actions.shape)
First let's see if the encoded representation looks at all similar to the actual CartPole data (position, velocity, angle, angular velocity). Visually, there seems to be some correspondence between the top two charts. The delta between the two charts on the right is hard to see, so take a look at the bottom left chart. Values range between -0.3 and +0.3, which seems reasonable given the tanh nonlinearity.
In [9]:
N_FRAMES = 100
from autoencoder import preprocess, eps_to_stacked_window
windowed_observations = eps_to_stacked_window(observations, window=WINDOW, offset=True)
left_pred = encoder_model.predict([preprocess(windowed_frames[:N_FRAMES]), one_hot(np.zeros(N_FRAMES, dtype=np.uint8), 2)])
right_pred = encoder_model.predict([preprocess(windowed_frames[:N_FRAMES]), one_hot(np.ones(N_FRAMES, dtype=np.uint8), 2)])
pred = encoder_model.predict([preprocess(windowed_frames[:N_FRAMES]), one_hot_actions[:N_FRAMES]])
plt.figure(figsize=(18, 12))
plt.subplot(2,2,1)
plt.title('Classic cartpole observation of next frame')
plt.plot(windowed_observations[:N_FRAMES, WINDOW-1, :])
plt.subplot(2,2,2)
plt.title('Encoded representation for action taken')
plt.plot(pred)
plt.subplot(2,2,4)
plt.plot(left_pred)
plt.title('Encoded representation for LEFT action')
plt.subplot(2,2,3)
plt.title('Delta between left and right actions')
plt.plot(left_pred - right_pred)
plt.show()
Next, we hope that at least one of the encoded channels picked up each of the real variables, even though it doesn't know anything about them. What we're looking for is a correlation coefficient close to 1 or -1 in every row. We can see that the position row has 0.92, the angle row has .77, the velocity row has .47, and the angular velocity row has .41. So the encoder had a harder time learning the dynamic channels than the static ones.
In [10]:
plt.figure(figsize=(18, 12))
row_labels = ['position', 'velocity', 'angle', 'angular\nvelocity']
for i in range(4):
for j in range(4):
ax = plt.subplot(4,4,i * 4 + j + 1)
x,y = windowed_observations[:N_FRAMES, WINDOW-1, i], pred[:N_FRAMES, j]
if j == 0:
ax.set_ylabel(row_labels[i], rotation=0, labelpad=30, size='large')
plt.title('r: %.2f' % np.corrcoef(x,y)[0,1])
plt.scatter(x,y)
Our model takes a stack of frames and an action and outputs the predicted next frame. An imagination rollout takes an initial condition and feeds the model with its own predictions for N steps (15 in this case).
We're looking for the cart to move left when we give it a stream of left actions, and right when we give it a stream of right actions. When I first ran this, it moved left no matter what actions it was given. You can see that one moves slower because initially the pole was tipping the other way. Seems reasonable!
In [11]:
from array2gif import write_gif
from IPython.display import Image
def unprocess(frames):
'''Invert, scale, remove extra dims'''
return (1 - np.squeeze(frames)) * 255
def to_rgb(im, scale=None):
'''Convert a grayscale image into the format array2gif expects.'''
if scale:
im = np.stack(resize(x, (im.shape[1] * scale, im.shape[2] * scale), mode='edge') for x in im)
return [np.stack((x,) * 3).astype(np.uint8) for x in im]
def rollout(action):
frames = list(preprocess(windowed_frames[0]))
# How many steps to rollout
N_STEPS = 15
for _ in range(N_STEPS):
x = np.expand_dims(np.array(frames[-WINDOW:]), axis=0)
pred = model.predict([x, np.array(action)])
# Put the extra dimension on the end
frames.append(np.expand_dims(np.squeeze(pred), axis=-1))
return np.array(frames)
def display(frames, name):
write_gif(to_rgb(frames, scale=2), filename=name, fps=20)
return Image(filename=name)
left = rollout([[0,1]])
right = rollout([[1,0]])
display(unprocess(np.concatenate([left, right], axis=1)), 'test.gif')
Out[11]:
In [1]:
# Must restart kernel for this to not OOM.
# Apparently there's no other way to release GPU resources :(
%env OPENAI_LOGDIR=/tmp/logs
# Baseline CartPole
!python run_cartpole.py --policy=fc --nstack=1
In [ ]:
# Use only position and angle to solve
!python run_cartpole.py --policy=fc --nstack=2 --nsteps=20 --use_static_wrapper
In [ ]:
# Solve from pixels
!python run_cartpole.py --policy=cnn --nstack=2 --nsteps=20
In [ ]:
# Use imagination to solve from pixels
!python run_cartpole.py --policy=fc --nstack=2 --nsteps=20 --use_encoded_imagination
In [32]:
# Render learning graphs
import glob
import tensorflow as tf
files = sorted(glob.glob('/tmp/logs/tb/**/events*', recursive=True), reverse=True)
def read_tensorboard(path_to_events_file, tag):
"""This example supposes that the events file contains summaries with a
summary value tag. These could have been added by calling
`add_summary()`, passing the output of a scalar summary op created with
with: `tf.scalar_summary(['loss'], loss_tensor)`.
"""
for e in tf.train.summary_iterator(path_to_events_file):
for v in e.summary.value:
if v.tag == tag:
yield v.simple_value
import matplotlib.pyplot as plt
%matplotlib inline
labels = ['classic', 'static', 'pixels', 'imagination']
plt.figure(figsize=(12, 6))
plt.hold(True)
for i in range(min(len(files), 6)):
data = list(read_tensorboard(files[i], 'mean_episode_length'))
if data[-1] < 195:
continue
plt.plot(data, label=labels.pop())
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()