Remember the idea behind Evolution Strategies? Here's a neat blog post about 'em.
Can you reproduce their success? You will have to implement evolutionary strategies and see how they work.
This project is optional; has several milestones each worth a number of points [and swag].
Milestones:
Rules:
It would be very convenient later if you implemented a function that takes policy weights, generates a session and returns policy changes -- so that you could then run a bunch of them in parallel.
The simplest way you can do multiprocessing is to use joblib
For joblib, make sure random variables are independent in each job. Simply add np.random.seed() at the beginning of your "job" function.
Later once you got distributed, you may need a storage that gathers gradients from all workers. In such case we recommend Redis due to it's simplicity.
Here's a speed-optimized saver/loader to store numpy arrays in Redis as strings.
In [ ]:
import joblib
from six import BytesIO
def dumps(data):
"""converts whatever to string"""
s = BytesIO()
joblib.dump(data, s)
return s.getvalue()
def loads(self, string):
"""converts string to whatever was dumps'ed in it"""
return joblib.load(BytesIO(string))
pip install Image and pip install gym[atari] May the force be with you!
In [ ]:
from pong import make_pong
import numpy as np
env = make_pong()
print(env.action_space)
In [ ]:
# get the initial state
s = env.reset()
print(s.shape)
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
# plot first observation. Only one frame
plt.imshow(s.swapaxes(1, 2).reshape(-1, s.shape[-1]).T)
In [ ]:
# next frame
new_s, r, done, _ = env.step(env.action_space.sample())
plt.imshow(new_s.swapaxes(1, 2).reshape(-1, s.shape[-1]).T)
In [ ]:
# after 10 frames
for _ in range(10):
new_s, r, done, _ = env.step(env.action_space.sample())
plt.imshow(new_s.swapaxes(1, 2).reshape(-1, s.shape[-1]).T, vmin=0)
In [ ]:
<YOUR CODE: tons of it here or elsewhere>