While RL algorithms require a reward signal to be given to the agent at every timestep, ES algorithms only care about the final cumulative reward that an agent gets at the end of its rollout in an environment. In many problems, we only know the outcome at the end of the task, such as whether the agent wins or loses, whether the robot arm picks up the object or not, or whether the agent has survived, and these are problems where ES may have an advantage over traditional RL.[1]
In [5]:
import gym
In [3]:
# taken from [1]
def rollout(agent, env):
obs = env.reset()
done = False
total_reward = 0
while not done:
a = agent.get_action(obs)
obs, reward, done = env.step(a)
total_reward += reward
return total_reward
In [6]:
env = gym.make('worlddomination-v0')
In [ ]:
In [ ]: