Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('MountainCar-v0')


[2017-05-02 17:29:25,338] Making new env: MountainCar-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(1000):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

In [4]:
env.action_space


Out[4]:
Discrete(3)

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [5]:
print(rewards[-20:])


[]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [6]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=2, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [7]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [8]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [9]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [10]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [12]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            #env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: -76.0 Training loss: 0.7905 Explore P: 0.9925
Episode: 2 Total reward: -199.0 Training loss: 0.8030 Explore P: 0.9731
Episode: 3 Total reward: -199.0 Training loss: 1.0011 Explore P: 0.9542
Episode: 4 Total reward: -199.0 Training loss: 1.2440 Explore P: 0.9356
Episode: 5 Total reward: -199.0 Training loss: 5.1232 Explore P: 0.9173
Episode: 6 Total reward: -199.0 Training loss: 14.1443 Explore P: 0.8995
Episode: 7 Total reward: -199.0 Training loss: 57.2329 Explore P: 0.8819
Episode: 8 Total reward: -199.0 Training loss: 33.6128 Explore P: 0.8647
Episode: 9 Total reward: -199.0 Training loss: 94.0645 Explore P: 0.8479
Episode: 10 Total reward: -199.0 Training loss: 60.6322 Explore P: 0.8314
Episode: 11 Total reward: -199.0 Training loss: 222.5308 Explore P: 0.8152
Episode: 12 Total reward: -199.0 Training loss: 378.6078 Explore P: 0.7993
Episode: 13 Total reward: -199.0 Training loss: 554.1976 Explore P: 0.7838
Episode: 14 Total reward: -199.0 Training loss: 316.5712 Explore P: 0.7685
Episode: 15 Total reward: -199.0 Training loss: 803.1554 Explore P: 0.7536
Episode: 16 Total reward: -199.0 Training loss: 289.6391 Explore P: 0.7389
Episode: 17 Total reward: -199.0 Training loss: 538.9788 Explore P: 0.7246
Episode: 18 Total reward: -199.0 Training loss: 1090.0771 Explore P: 0.7105
Episode: 19 Total reward: -199.0 Training loss: 303.7308 Explore P: 0.6967
Episode: 20 Total reward: -199.0 Training loss: 879.3655 Explore P: 0.6832
Episode: 21 Total reward: -199.0 Training loss: 1019.7370 Explore P: 0.6699
Episode: 22 Total reward: -199.0 Training loss: 1219.4546 Explore P: 0.6569
Episode: 23 Total reward: -199.0 Training loss: 1241.9463 Explore P: 0.6442
Episode: 24 Total reward: -199.0 Training loss: 775.4604 Explore P: 0.6317
Episode: 25 Total reward: -199.0 Training loss: 419.3268 Explore P: 0.6194
Episode: 26 Total reward: -199.0 Training loss: 958.2316 Explore P: 0.6074
Episode: 27 Total reward: -199.0 Training loss: 441.5667 Explore P: 0.5956
Episode: 28 Total reward: -199.0 Training loss: 545.5043 Explore P: 0.5841
Episode: 29 Total reward: -199.0 Training loss: 1267.8416 Explore P: 0.5728
Episode: 30 Total reward: -199.0 Training loss: 788.0515 Explore P: 0.5617
Episode: 31 Total reward: -199.0 Training loss: 1012.7548 Explore P: 0.5508
Episode: 32 Total reward: -199.0 Training loss: 434.8347 Explore P: 0.5402
Episode: 33 Total reward: -199.0 Training loss: 1255.2883 Explore P: 0.5297
Episode: 34 Total reward: -199.0 Training loss: 721.1035 Explore P: 0.5195
Episode: 35 Total reward: -199.0 Training loss: 976.3445 Explore P: 0.5094
Episode: 36 Total reward: -199.0 Training loss: 764.6820 Explore P: 0.4996
Episode: 37 Total reward: -199.0 Training loss: 966.5424 Explore P: 0.4900
Episode: 38 Total reward: -199.0 Training loss: 1215.2361 Explore P: 0.4805
Episode: 39 Total reward: -199.0 Training loss: 805.3513 Explore P: 0.4712
Episode: 40 Total reward: -199.0 Training loss: 1428.1522 Explore P: 0.4621
Episode: 41 Total reward: -199.0 Training loss: 1167.5590 Explore P: 0.4532
Episode: 42 Total reward: -199.0 Training loss: 538.0898 Explore P: 0.4445
Episode: 43 Total reward: -199.0 Training loss: 732.2590 Explore P: 0.4359
Episode: 44 Total reward: -199.0 Training loss: 637.7289 Explore P: 0.4276
Episode: 45 Total reward: -199.0 Training loss: 1134.6018 Explore P: 0.4193
Episode: 46 Total reward: -199.0 Training loss: 1494.2922 Explore P: 0.4113
Episode: 47 Total reward: -199.0 Training loss: 612.2845 Explore P: 0.4034
Episode: 48 Total reward: -199.0 Training loss: 1262.4821 Explore P: 0.3956
Episode: 49 Total reward: -199.0 Training loss: 961.2111 Explore P: 0.3880
Episode: 50 Total reward: -199.0 Training loss: 1000.9906 Explore P: 0.3806
Episode: 51 Total reward: -199.0 Training loss: 447.8260 Explore P: 0.3733
Episode: 52 Total reward: -199.0 Training loss: 443.1601 Explore P: 0.3661
Episode: 53 Total reward: -199.0 Training loss: 1320.5049 Explore P: 0.3591
Episode: 54 Total reward: -199.0 Training loss: 897.0618 Explore P: 0.3522
Episode: 55 Total reward: -199.0 Training loss: 420.2852 Explore P: 0.3455
Episode: 56 Total reward: -199.0 Training loss: 1431.0924 Explore P: 0.3389
Episode: 57 Total reward: -199.0 Training loss: 635.1156 Explore P: 0.3324
Episode: 58 Total reward: -199.0 Training loss: 678.0747 Explore P: 0.3260
Episode: 59 Total reward: -199.0 Training loss: 1121.5293 Explore P: 0.3198
Episode: 60 Total reward: -199.0 Training loss: 1121.2013 Explore P: 0.3137
Episode: 61 Total reward: -199.0 Training loss: 1116.9534 Explore P: 0.3077
Episode: 62 Total reward: -199.0 Training loss: 823.8275 Explore P: 0.3018
Episode: 63 Total reward: -199.0 Training loss: 663.0596 Explore P: 0.2961
Episode: 64 Total reward: -199.0 Training loss: 388.7497 Explore P: 0.2905
Episode: 65 Total reward: -199.0 Training loss: 408.5754 Explore P: 0.2849
Episode: 66 Total reward: -199.0 Training loss: 765.1489 Explore P: 0.2795
Episode: 67 Total reward: -199.0 Training loss: 382.2531 Explore P: 0.2742
Episode: 68 Total reward: -199.0 Training loss: 424.6395 Explore P: 0.2690
Episode: 69 Total reward: -199.0 Training loss: 512.5395 Explore P: 0.2639
Episode: 70 Total reward: -199.0 Training loss: 1336.6051 Explore P: 0.2589
Episode: 71 Total reward: -199.0 Training loss: 679.4982 Explore P: 0.2540
Episode: 72 Total reward: -199.0 Training loss: 416.8826 Explore P: 0.2492
Episode: 73 Total reward: -199.0 Training loss: 660.0294 Explore P: 0.2445
Episode: 74 Total reward: -199.0 Training loss: 0.2235 Explore P: 0.2398
Episode: 75 Total reward: -199.0 Training loss: 1129.3567 Explore P: 0.2353
Episode: 76 Total reward: -199.0 Training loss: 210.6980 Explore P: 0.2309
Episode: 77 Total reward: -199.0 Training loss: 549.5284 Explore P: 0.2265
Episode: 78 Total reward: -199.0 Training loss: 191.7019 Explore P: 0.2223
Episode: 79 Total reward: -199.0 Training loss: 395.1107 Explore P: 0.2181
Episode: 80 Total reward: -199.0 Training loss: 525.7328 Explore P: 0.2140
Episode: 81 Total reward: -199.0 Training loss: 539.1795 Explore P: 0.2100
Episode: 82 Total reward: -199.0 Training loss: 503.6017 Explore P: 0.2060
Episode: 83 Total reward: -199.0 Training loss: 370.4332 Explore P: 0.2022
Episode: 84 Total reward: -199.0 Training loss: 744.0920 Explore P: 0.1984
Episode: 85 Total reward: -199.0 Training loss: 756.7369 Explore P: 0.1947
Episode: 86 Total reward: -199.0 Training loss: 218.0375 Explore P: 0.1910
Episode: 87 Total reward: -199.0 Training loss: 905.3496 Explore P: 0.1875
Episode: 88 Total reward: -199.0 Training loss: 657.7096 Explore P: 0.1840
Episode: 89 Total reward: -199.0 Training loss: 179.6094 Explore P: 0.1805
Episode: 90 Total reward: -199.0 Training loss: 235.6754 Explore P: 0.1772
Episode: 91 Total reward: -199.0 Training loss: 0.3124 Explore P: 0.1739
Episode: 92 Total reward: -199.0 Training loss: 226.4112 Explore P: 0.1706
Episode: 93 Total reward: -199.0 Training loss: 622.8131 Explore P: 0.1675
Episode: 94 Total reward: -199.0 Training loss: 0.2598 Explore P: 0.1644
Episode: 95 Total reward: -199.0 Training loss: 264.3078 Explore P: 0.1613
Episode: 96 Total reward: -199.0 Training loss: 628.8370 Explore P: 0.1584
Episode: 97 Total reward: -199.0 Training loss: 236.0284 Explore P: 0.1554
Episode: 98 Total reward: -199.0 Training loss: 762.6096 Explore P: 0.1526
Episode: 99 Total reward: -199.0 Training loss: 207.3579 Explore P: 0.1498
Episode: 100 Total reward: -199.0 Training loss: 249.4463 Explore P: 0.1470
Episode: 101 Total reward: -199.0 Training loss: 438.5832 Explore P: 0.1443
Episode: 102 Total reward: -199.0 Training loss: 223.8933 Explore P: 0.1417
Episode: 103 Total reward: -199.0 Training loss: 0.3500 Explore P: 0.1391
Episode: 104 Total reward: -199.0 Training loss: 747.4609 Explore P: 0.1365
Episode: 105 Total reward: -199.0 Training loss: 249.5469 Explore P: 0.1340
Episode: 106 Total reward: -199.0 Training loss: 502.5314 Explore P: 0.1316
Episode: 107 Total reward: -199.0 Training loss: 0.3052 Explore P: 0.1292
Episode: 108 Total reward: -199.0 Training loss: 413.8648 Explore P: 0.1268
Episode: 109 Total reward: -199.0 Training loss: 227.6606 Explore P: 0.1245
Episode: 110 Total reward: -199.0 Training loss: 227.5628 Explore P: 0.1223
Episode: 111 Total reward: -199.0 Training loss: 206.6959 Explore P: 0.1201
Episode: 112 Total reward: -199.0 Training loss: 542.3402 Explore P: 0.1179
Episode: 113 Total reward: -199.0 Training loss: 501.2343 Explore P: 0.1158
Episode: 114 Total reward: -199.0 Training loss: 0.3336 Explore P: 0.1137
Episode: 115 Total reward: -199.0 Training loss: 210.8261 Explore P: 0.1116
Episode: 116 Total reward: -199.0 Training loss: 193.2682 Explore P: 0.1096
Episode: 117 Total reward: -199.0 Training loss: 400.4077 Explore P: 0.1077
Episode: 118 Total reward: -199.0 Training loss: 211.6224 Explore P: 0.1058
Episode: 119 Total reward: -199.0 Training loss: 250.4275 Explore P: 0.1039
Episode: 120 Total reward: -199.0 Training loss: 364.1627 Explore P: 0.1020
Episode: 121 Total reward: -199.0 Training loss: 259.0563 Explore P: 0.1002
Episode: 122 Total reward: -199.0 Training loss: 0.3016 Explore P: 0.0984
Episode: 123 Total reward: -199.0 Training loss: 446.9735 Explore P: 0.0967
Episode: 124 Total reward: -199.0 Training loss: 416.1559 Explore P: 0.0950
Episode: 125 Total reward: -199.0 Training loss: 504.1576 Explore P: 0.0933
Episode: 126 Total reward: -199.0 Training loss: 0.3161 Explore P: 0.0917
Episode: 127 Total reward: -199.0 Training loss: 259.3651 Explore P: 0.0901
Episode: 128 Total reward: -199.0 Training loss: 178.3633 Explore P: 0.0885
Episode: 129 Total reward: -199.0 Training loss: 211.4234 Explore P: 0.0869
Episode: 130 Total reward: -199.0 Training loss: 0.1329 Explore P: 0.0854
Episode: 131 Total reward: -199.0 Training loss: 179.9212 Explore P: 0.0839
Episode: 132 Total reward: -199.0 Training loss: 149.2860 Explore P: 0.0825
Episode: 133 Total reward: -199.0 Training loss: 188.9923 Explore P: 0.0810
Episode: 134 Total reward: -199.0 Training loss: 547.8625 Explore P: 0.0796
Episode: 135 Total reward: -199.0 Training loss: 160.6921 Explore P: 0.0783
Episode: 136 Total reward: -199.0 Training loss: 0.1390 Explore P: 0.0769
Episode: 137 Total reward: -199.0 Training loss: 0.3331 Explore P: 0.0756
Episode: 138 Total reward: -199.0 Training loss: 431.8025 Explore P: 0.0743
Episode: 139 Total reward: -199.0 Training loss: 382.0038 Explore P: 0.0730
Episode: 140 Total reward: -199.0 Training loss: 0.3481 Explore P: 0.0718
Episode: 141 Total reward: -199.0 Training loss: 260.4589 Explore P: 0.0706
Episode: 142 Total reward: -199.0 Training loss: 414.5254 Explore P: 0.0694
Episode: 143 Total reward: -199.0 Training loss: 458.9545 Explore P: 0.0682
Episode: 144 Total reward: -199.0 Training loss: 192.2069 Explore P: 0.0671
Episode: 145 Total reward: -199.0 Training loss: 0.2772 Explore P: 0.0660
Episode: 146 Total reward: -199.0 Training loss: 0.3193 Explore P: 0.0649
Episode: 147 Total reward: -199.0 Training loss: 0.3198 Explore P: 0.0638
Episode: 148 Total reward: -199.0 Training loss: 187.7933 Explore P: 0.0627
Episode: 149 Total reward: -199.0 Training loss: 200.5688 Explore P: 0.0617
Episode: 150 Total reward: -199.0 Training loss: 201.0311 Explore P: 0.0607
Episode: 151 Total reward: -199.0 Training loss: 404.1075 Explore P: 0.0597
Episode: 152 Total reward: -199.0 Training loss: 174.1928 Explore P: 0.0587
Episode: 153 Total reward: -199.0 Training loss: 0.3435 Explore P: 0.0577
Episode: 154 Total reward: -199.0 Training loss: 0.4429 Explore P: 0.0568
Episode: 155 Total reward: -199.0 Training loss: 0.2593 Explore P: 0.0559
Episode: 156 Total reward: -199.0 Training loss: 0.3926 Explore P: 0.0550
Episode: 157 Total reward: -199.0 Training loss: 375.1106 Explore P: 0.0541
Episode: 158 Total reward: -199.0 Training loss: 0.1714 Explore P: 0.0532
Episode: 159 Total reward: -199.0 Training loss: 241.5639 Explore P: 0.0523
Episode: 160 Total reward: -199.0 Training loss: 0.2339 Explore P: 0.0515
Episode: 161 Total reward: -199.0 Training loss: 170.6426 Explore P: 0.0507
Episode: 162 Total reward: -199.0 Training loss: 188.4956 Explore P: 0.0499
Episode: 163 Total reward: -199.0 Training loss: 155.8701 Explore P: 0.0491
Episode: 164 Total reward: -199.0 Training loss: 0.3392 Explore P: 0.0483
Episode: 165 Total reward: -199.0 Training loss: 333.2675 Explore P: 0.0476
Episode: 166 Total reward: -199.0 Training loss: 390.1407 Explore P: 0.0468
Episode: 167 Total reward: -199.0 Training loss: 0.3041 Explore P: 0.0461
Episode: 168 Total reward: -199.0 Training loss: 160.3161 Explore P: 0.0454
Episode: 169 Total reward: -199.0 Training loss: 0.3631 Explore P: 0.0447
Episode: 170 Total reward: -199.0 Training loss: 0.3844 Explore P: 0.0440
Episode: 171 Total reward: -199.0 Training loss: 233.2304 Explore P: 0.0434
Episode: 172 Total reward: -199.0 Training loss: 0.2985 Explore P: 0.0427
Episode: 173 Total reward: -199.0 Training loss: 219.5496 Explore P: 0.0421
Episode: 174 Total reward: -199.0 Training loss: 0.1977 Explore P: 0.0414
Episode: 175 Total reward: -199.0 Training loss: 0.3921 Explore P: 0.0408
Episode: 176 Total reward: -199.0 Training loss: 0.3744 Explore P: 0.0402
Episode: 177 Total reward: -199.0 Training loss: 0.5370 Explore P: 0.0396
Episode: 178 Total reward: -199.0 Training loss: 0.4136 Explore P: 0.0390
Episode: 179 Total reward: -199.0 Training loss: 0.4482 Explore P: 0.0384
Episode: 180 Total reward: -199.0 Training loss: 187.1818 Explore P: 0.0379
Episode: 181 Total reward: -199.0 Training loss: 0.2779 Explore P: 0.0373
Episode: 182 Total reward: -199.0 Training loss: 0.5098 Explore P: 0.0368
Episode: 183 Total reward: -199.0 Training loss: 170.8258 Explore P: 0.0363
Episode: 184 Total reward: -199.0 Training loss: 355.3206 Explore P: 0.0357
Episode: 185 Total reward: -199.0 Training loss: 189.5893 Explore P: 0.0352
Episode: 186 Total reward: -199.0 Training loss: 0.2723 Explore P: 0.0347
Episode: 187 Total reward: -199.0 Training loss: 0.3043 Explore P: 0.0343
Episode: 188 Total reward: -199.0 Training loss: 0.4489 Explore P: 0.0338
Episode: 189 Total reward: -199.0 Training loss: 210.0493 Explore P: 0.0333
Episode: 190 Total reward: -199.0 Training loss: 0.2134 Explore P: 0.0329
Episode: 191 Total reward: -199.0 Training loss: 461.1664 Explore P: 0.0324
Episode: 192 Total reward: -199.0 Training loss: 262.2762 Explore P: 0.0320
Episode: 193 Total reward: -199.0 Training loss: 0.2448 Explore P: 0.0315
Episode: 194 Total reward: -199.0 Training loss: 0.1419 Explore P: 0.0311
Episode: 195 Total reward: -199.0 Training loss: 220.6318 Explore P: 0.0307
Episode: 196 Total reward: -199.0 Training loss: 188.6056 Explore P: 0.0303
Episode: 197 Total reward: -199.0 Training loss: 0.1893 Explore P: 0.0299
Episode: 198 Total reward: -199.0 Training loss: 0.1477 Explore P: 0.0295
Episode: 199 Total reward: -199.0 Training loss: 208.5783 Explore P: 0.0291
Episode: 200 Total reward: -199.0 Training loss: 0.2213 Explore P: 0.0287
Episode: 201 Total reward: -199.0 Training loss: 0.1511 Explore P: 0.0284
Episode: 202 Total reward: -199.0 Training loss: 0.2858 Explore P: 0.0280
Episode: 203 Total reward: -199.0 Training loss: 229.0875 Explore P: 0.0276
Episode: 204 Total reward: -199.0 Training loss: 0.2882 Explore P: 0.0273
Episode: 205 Total reward: -199.0 Training loss: 0.2085 Explore P: 0.0270
Episode: 206 Total reward: -199.0 Training loss: 232.6019 Explore P: 0.0266
Episode: 207 Total reward: -199.0 Training loss: 444.3925 Explore P: 0.0263
Episode: 208 Total reward: -199.0 Training loss: 220.9616 Explore P: 0.0260
Episode: 209 Total reward: -199.0 Training loss: 0.2074 Explore P: 0.0257
Episode: 210 Total reward: -199.0 Training loss: 0.3052 Explore P: 0.0253
Episode: 211 Total reward: -199.0 Training loss: 179.1043 Explore P: 0.0250
Episode: 212 Total reward: -199.0 Training loss: 0.2077 Explore P: 0.0247
Episode: 213 Total reward: -199.0 Training loss: 0.1906 Explore P: 0.0245
Episode: 214 Total reward: -199.0 Training loss: 0.2553 Explore P: 0.0242
Episode: 215 Total reward: -199.0 Training loss: 0.1865 Explore P: 0.0239
Episode: 216 Total reward: -199.0 Training loss: 173.1716 Explore P: 0.0236
Episode: 217 Total reward: -199.0 Training loss: 227.6028 Explore P: 0.0234
Episode: 218 Total reward: -199.0 Training loss: 248.8235 Explore P: 0.0231
Episode: 219 Total reward: -199.0 Training loss: 0.1669 Explore P: 0.0228
Episode: 220 Total reward: -199.0 Training loss: 0.2071 Explore P: 0.0226
Episode: 221 Total reward: -199.0 Training loss: 200.6746 Explore P: 0.0223
Episode: 222 Total reward: -199.0 Training loss: 0.3353 Explore P: 0.0221
Episode: 223 Total reward: -199.0 Training loss: 204.6123 Explore P: 0.0218
Episode: 224 Total reward: -199.0 Training loss: 0.1466 Explore P: 0.0216
Episode: 225 Total reward: -199.0 Training loss: 0.1270 Explore P: 0.0214
Episode: 226 Total reward: -199.0 Training loss: 0.1176 Explore P: 0.0212
Episode: 227 Total reward: -199.0 Training loss: 187.9116 Explore P: 0.0209
Episode: 228 Total reward: -199.0 Training loss: 0.1846 Explore P: 0.0207
Episode: 229 Total reward: -199.0 Training loss: 0.2208 Explore P: 0.0205
Episode: 230 Total reward: -199.0 Training loss: 0.2516 Explore P: 0.0203
Episode: 231 Total reward: -199.0 Training loss: 144.2259 Explore P: 0.0201
Episode: 232 Total reward: -199.0 Training loss: 0.2312 Explore P: 0.0199
Episode: 233 Total reward: -199.0 Training loss: 188.7431 Explore P: 0.0197
Episode: 234 Total reward: -199.0 Training loss: 186.7947 Explore P: 0.0195
Episode: 235 Total reward: -199.0 Training loss: 0.2145 Explore P: 0.0193
Episode: 236 Total reward: -199.0 Training loss: 149.7251 Explore P: 0.0191
Episode: 237 Total reward: -199.0 Training loss: 0.1095 Explore P: 0.0190
Episode: 238 Total reward: -199.0 Training loss: 306.9872 Explore P: 0.0188
Episode: 239 Total reward: -199.0 Training loss: 0.1559 Explore P: 0.0186
Episode: 240 Total reward: -199.0 Training loss: 0.2362 Explore P: 0.0184
Episode: 241 Total reward: -199.0 Training loss: 0.2699 Explore P: 0.0183
Episode: 242 Total reward: -199.0 Training loss: 0.3265 Explore P: 0.0181
Episode: 243 Total reward: -199.0 Training loss: 222.4182 Explore P: 0.0180
Episode: 244 Total reward: -199.0 Training loss: 0.3009 Explore P: 0.0178
Episode: 245 Total reward: -199.0 Training loss: 173.0432 Explore P: 0.0176
Episode: 246 Total reward: -199.0 Training loss: 169.4193 Explore P: 0.0175
Episode: 247 Total reward: -199.0 Training loss: 0.1221 Explore P: 0.0174
Episode: 248 Total reward: -199.0 Training loss: 170.5063 Explore P: 0.0172
Episode: 249 Total reward: -199.0 Training loss: 0.2005 Explore P: 0.0171
Episode: 250 Total reward: -199.0 Training loss: 0.2351 Explore P: 0.0169
Episode: 251 Total reward: -199.0 Training loss: 0.1979 Explore P: 0.0168
Episode: 252 Total reward: -199.0 Training loss: 0.2276 Explore P: 0.0167
Episode: 253 Total reward: -199.0 Training loss: 171.0335 Explore P: 0.0165
Episode: 254 Total reward: -199.0 Training loss: 0.1489 Explore P: 0.0164
Episode: 255 Total reward: -199.0 Training loss: 0.1683 Explore P: 0.0163
Episode: 256 Total reward: -199.0 Training loss: 178.9486 Explore P: 0.0161
Episode: 257 Total reward: -199.0 Training loss: 0.1275 Explore P: 0.0160
Episode: 258 Total reward: -199.0 Training loss: 0.1386 Explore P: 0.0159
Episode: 259 Total reward: -199.0 Training loss: 0.2444 Explore P: 0.0158
Episode: 260 Total reward: -199.0 Training loss: 0.3667 Explore P: 0.0157
Episode: 261 Total reward: -199.0 Training loss: 0.1317 Explore P: 0.0156
Episode: 262 Total reward: -199.0 Training loss: 221.3358 Explore P: 0.0155
Episode: 263 Total reward: -199.0 Training loss: 247.9454 Explore P: 0.0153
Episode: 264 Total reward: -199.0 Training loss: 0.2887 Explore P: 0.0152
Episode: 265 Total reward: -199.0 Training loss: 0.1666 Explore P: 0.0151
Episode: 266 Total reward: -199.0 Training loss: 0.1778 Explore P: 0.0150
Episode: 267 Total reward: -199.0 Training loss: 0.2083 Explore P: 0.0149
Episode: 268 Total reward: -199.0 Training loss: 0.2114 Explore P: 0.0148
Episode: 269 Total reward: -199.0 Training loss: 0.1361 Explore P: 0.0147
Episode: 270 Total reward: -199.0 Training loss: 235.1025 Explore P: 0.0147
Episode: 271 Total reward: -199.0 Training loss: 237.1890 Explore P: 0.0146
Episode: 272 Total reward: -199.0 Training loss: 0.1269 Explore P: 0.0145
Episode: 273 Total reward: -199.0 Training loss: 0.2421 Explore P: 0.0144
Episode: 274 Total reward: -199.0 Training loss: 0.1728 Explore P: 0.0143
Episode: 275 Total reward: -199.0 Training loss: 0.3069 Explore P: 0.0142
Episode: 276 Total reward: -199.0 Training loss: 0.1538 Explore P: 0.0141
Episode: 277 Total reward: -199.0 Training loss: 0.2706 Explore P: 0.0140
Episode: 278 Total reward: -199.0 Training loss: 0.1861 Explore P: 0.0140
Episode: 279 Total reward: -199.0 Training loss: 0.2539 Explore P: 0.0139
Episode: 280 Total reward: -199.0 Training loss: 0.0896 Explore P: 0.0138
Episode: 281 Total reward: -199.0 Training loss: 0.2041 Explore P: 0.0137
Episode: 282 Total reward: -199.0 Training loss: 0.2109 Explore P: 0.0137
Episode: 283 Total reward: -199.0 Training loss: 0.2082 Explore P: 0.0136
Episode: 284 Total reward: -199.0 Training loss: 0.0725 Explore P: 0.0135
Episode: 285 Total reward: -199.0 Training loss: 0.1411 Explore P: 0.0135
Episode: 286 Total reward: -199.0 Training loss: 0.1545 Explore P: 0.0134
Episode: 287 Total reward: -199.0 Training loss: 0.1925 Explore P: 0.0133
Episode: 288 Total reward: -199.0 Training loss: 0.2369 Explore P: 0.0133
Episode: 289 Total reward: -199.0 Training loss: 0.2650 Explore P: 0.0132
Episode: 290 Total reward: -199.0 Training loss: 0.0999 Explore P: 0.0131
Episode: 291 Total reward: -199.0 Training loss: 0.2690 Explore P: 0.0131
Episode: 292 Total reward: -199.0 Training loss: 0.2257 Explore P: 0.0130
Episode: 293 Total reward: -199.0 Training loss: 0.1523 Explore P: 0.0129
Episode: 294 Total reward: -199.0 Training loss: 0.1804 Explore P: 0.0129
Episode: 295 Total reward: -199.0 Training loss: 0.1617 Explore P: 0.0128
Episode: 296 Total reward: -199.0 Training loss: 0.1388 Explore P: 0.0128
Episode: 297 Total reward: -199.0 Training loss: 0.1744 Explore P: 0.0127
Episode: 298 Total reward: -199.0 Training loss: 0.1491 Explore P: 0.0127
Episode: 299 Total reward: -199.0 Training loss: 0.2122 Explore P: 0.0126
Episode: 300 Total reward: -199.0 Training loss: 0.2538 Explore P: 0.0126
Episode: 301 Total reward: -199.0 Training loss: 0.1901 Explore P: 0.0125
Episode: 302 Total reward: -199.0 Training loss: 0.2500 Explore P: 0.0125
Episode: 303 Total reward: -199.0 Training loss: 0.2867 Explore P: 0.0124
Episode: 304 Total reward: -199.0 Training loss: 153.0774 Explore P: 0.0124
Episode: 305 Total reward: -199.0 Training loss: 0.2533 Explore P: 0.0123
Episode: 306 Total reward: -199.0 Training loss: 0.2906 Explore P: 0.0123
Episode: 307 Total reward: -199.0 Training loss: 0.2015 Explore P: 0.0122
Episode: 308 Total reward: -199.0 Training loss: 0.1819 Explore P: 0.0122
Episode: 309 Total reward: -199.0 Training loss: 0.2491 Explore P: 0.0121
Episode: 310 Total reward: -199.0 Training loss: 0.2295 Explore P: 0.0121
Episode: 311 Total reward: -199.0 Training loss: 0.1526 Explore P: 0.0121
Episode: 312 Total reward: -199.0 Training loss: 0.1605 Explore P: 0.0120
Episode: 313 Total reward: -199.0 Training loss: 0.2038 Explore P: 0.0120
Episode: 314 Total reward: -199.0 Training loss: 0.2371 Explore P: 0.0119
Episode: 315 Total reward: -199.0 Training loss: 195.4897 Explore P: 0.0119
Episode: 316 Total reward: -199.0 Training loss: 192.3690 Explore P: 0.0119
Episode: 317 Total reward: -199.0 Training loss: 0.3275 Explore P: 0.0118
Episode: 318 Total reward: -199.0 Training loss: 0.2586 Explore P: 0.0118
Episode: 319 Total reward: -199.0 Training loss: 0.3217 Explore P: 0.0118
Episode: 320 Total reward: -199.0 Training loss: 344.5847 Explore P: 0.0117
Episode: 321 Total reward: -199.0 Training loss: 0.1719 Explore P: 0.0117
Episode: 322 Total reward: -199.0 Training loss: 0.3264 Explore P: 0.0117
Episode: 323 Total reward: -199.0 Training loss: 0.1806 Explore P: 0.0116
Episode: 324 Total reward: -199.0 Training loss: 0.2007 Explore P: 0.0116
Episode: 325 Total reward: -199.0 Training loss: 0.1914 Explore P: 0.0116
Episode: 326 Total reward: -199.0 Training loss: 0.2134 Explore P: 0.0115
Episode: 327 Total reward: -199.0 Training loss: 0.3248 Explore P: 0.0115
Episode: 328 Total reward: -199.0 Training loss: 0.2305 Explore P: 0.0115
Episode: 329 Total reward: -199.0 Training loss: 0.2007 Explore P: 0.0114
Episode: 330 Total reward: -199.0 Training loss: 0.1097 Explore P: 0.0114
Episode: 331 Total reward: -199.0 Training loss: 0.2256 Explore P: 0.0114
Episode: 332 Total reward: -199.0 Training loss: 0.1196 Explore P: 0.0114
Episode: 333 Total reward: -199.0 Training loss: 0.1835 Explore P: 0.0113
Episode: 334 Total reward: -199.0 Training loss: 0.1319 Explore P: 0.0113
Episode: 335 Total reward: -199.0 Training loss: 0.2195 Explore P: 0.0113
Episode: 336 Total reward: -199.0 Training loss: 0.1854 Explore P: 0.0113
Episode: 337 Total reward: -199.0 Training loss: 0.1141 Explore P: 0.0112
Episode: 338 Total reward: -199.0 Training loss: 0.1170 Explore P: 0.0112
Episode: 339 Total reward: -199.0 Training loss: 0.2163 Explore P: 0.0112
Episode: 340 Total reward: -199.0 Training loss: 0.1248 Explore P: 0.0112
Episode: 341 Total reward: -199.0 Training loss: 0.2127 Explore P: 0.0111
Episode: 342 Total reward: -199.0 Training loss: 0.2371 Explore P: 0.0111
Episode: 343 Total reward: -199.0 Training loss: 0.1952 Explore P: 0.0111
Episode: 344 Total reward: -199.0 Training loss: 0.2230 Explore P: 0.0111
Episode: 345 Total reward: -199.0 Training loss: 0.2388 Explore P: 0.0110
Episode: 346 Total reward: -199.0 Training loss: 0.1341 Explore P: 0.0110
Episode: 347 Total reward: -199.0 Training loss: 0.2146 Explore P: 0.0110
Episode: 348 Total reward: -199.0 Training loss: 0.1855 Explore P: 0.0110
Episode: 349 Total reward: -199.0 Training loss: 0.1924 Explore P: 0.0110
Episode: 350 Total reward: -199.0 Training loss: 0.3075 Explore P: 0.0109
Episode: 351 Total reward: -199.0 Training loss: 173.8871 Explore P: 0.0109
Episode: 352 Total reward: -199.0 Training loss: 0.1320 Explore P: 0.0109
Episode: 353 Total reward: -199.0 Training loss: 191.6758 Explore P: 0.0109
Episode: 354 Total reward: -199.0 Training loss: 0.2312 Explore P: 0.0109
Episode: 355 Total reward: -199.0 Training loss: 0.0942 Explore P: 0.0109
Episode: 356 Total reward: -199.0 Training loss: 0.1078 Explore P: 0.0108
Episode: 357 Total reward: -199.0 Training loss: 0.1075 Explore P: 0.0108
Episode: 358 Total reward: -199.0 Training loss: 0.2113 Explore P: 0.0108
Episode: 359 Total reward: -199.0 Training loss: 0.1679 Explore P: 0.0108
Episode: 360 Total reward: -199.0 Training loss: 0.3904 Explore P: 0.0108
Episode: 361 Total reward: -199.0 Training loss: 0.1329 Explore P: 0.0108
Episode: 362 Total reward: -199.0 Training loss: 0.1698 Explore P: 0.0107
Episode: 363 Total reward: -199.0 Training loss: 0.2482 Explore P: 0.0107
Episode: 364 Total reward: -199.0 Training loss: 0.2616 Explore P: 0.0107
Episode: 365 Total reward: -199.0 Training loss: 188.7820 Explore P: 0.0107
Episode: 366 Total reward: -199.0 Training loss: 209.7603 Explore P: 0.0107
Episode: 367 Total reward: -199.0 Training loss: 194.4615 Explore P: 0.0107
Episode: 368 Total reward: -199.0 Training loss: 0.1107 Explore P: 0.0107
Episode: 369 Total reward: -199.0 Training loss: 0.1547 Explore P: 0.0106
Episode: 370 Total reward: -199.0 Training loss: 0.0941 Explore P: 0.0106
Episode: 371 Total reward: -199.0 Training loss: 0.2030 Explore P: 0.0106
Episode: 372 Total reward: -199.0 Training loss: 0.1910 Explore P: 0.0106
Episode: 373 Total reward: -199.0 Training loss: 0.2101 Explore P: 0.0106
Episode: 374 Total reward: -199.0 Training loss: 0.2469 Explore P: 0.0106
Episode: 375 Total reward: -199.0 Training loss: 176.2014 Explore P: 0.0106
Episode: 376 Total reward: -199.0 Training loss: 0.2758 Explore P: 0.0106
Episode: 377 Total reward: -199.0 Training loss: 0.2334 Explore P: 0.0106
Episode: 378 Total reward: -199.0 Training loss: 0.1436 Explore P: 0.0105
Episode: 379 Total reward: -199.0 Training loss: 0.1419 Explore P: 0.0105
Episode: 380 Total reward: -199.0 Training loss: 0.2085 Explore P: 0.0105
Episode: 381 Total reward: -199.0 Training loss: 0.2079 Explore P: 0.0105
Episode: 382 Total reward: -199.0 Training loss: 0.2043 Explore P: 0.0105
Episode: 383 Total reward: -199.0 Training loss: 0.2156 Explore P: 0.0105
Episode: 384 Total reward: -199.0 Training loss: 0.2041 Explore P: 0.0105
Episode: 385 Total reward: -199.0 Training loss: 0.1285 Explore P: 0.0105
Episode: 386 Total reward: -199.0 Training loss: 0.1892 Explore P: 0.0105
Episode: 387 Total reward: -199.0 Training loss: 0.2205 Explore P: 0.0105
Episode: 388 Total reward: -199.0 Training loss: 0.1413 Explore P: 0.0104
Episode: 389 Total reward: -199.0 Training loss: 0.1637 Explore P: 0.0104
Episode: 390 Total reward: -199.0 Training loss: 193.9095 Explore P: 0.0104
Episode: 391 Total reward: -199.0 Training loss: 0.1569 Explore P: 0.0104
Episode: 392 Total reward: -199.0 Training loss: 0.1666 Explore P: 0.0104
Episode: 393 Total reward: -199.0 Training loss: 0.1562 Explore P: 0.0104
Episode: 394 Total reward: -199.0 Training loss: 0.1296 Explore P: 0.0104
Episode: 395 Total reward: -199.0 Training loss: 0.2225 Explore P: 0.0104
Episode: 396 Total reward: -199.0 Training loss: 0.1997 Explore P: 0.0104
Episode: 397 Total reward: -199.0 Training loss: 0.1694 Explore P: 0.0104
Episode: 398 Total reward: -199.0 Training loss: 357.1644 Explore P: 0.0104
Episode: 399 Total reward: -199.0 Training loss: 0.1723 Explore P: 0.0104
Episode: 400 Total reward: -199.0 Training loss: 0.1212 Explore P: 0.0103
Episode: 401 Total reward: -199.0 Training loss: 0.1810 Explore P: 0.0103
Episode: 402 Total reward: -199.0 Training loss: 0.1531 Explore P: 0.0103
Episode: 403 Total reward: -199.0 Training loss: 0.1597 Explore P: 0.0103
Episode: 404 Total reward: -199.0 Training loss: 0.2003 Explore P: 0.0103
Episode: 405 Total reward: -199.0 Training loss: 0.1737 Explore P: 0.0103
Episode: 406 Total reward: -199.0 Training loss: 0.1560 Explore P: 0.0103
Episode: 407 Total reward: -199.0 Training loss: 0.2563 Explore P: 0.0103
Episode: 408 Total reward: -199.0 Training loss: 0.2794 Explore P: 0.0103
Episode: 409 Total reward: -199.0 Training loss: 0.2156 Explore P: 0.0103
Episode: 410 Total reward: -199.0 Training loss: 0.3351 Explore P: 0.0103
Episode: 411 Total reward: -199.0 Training loss: 0.2425 Explore P: 0.0103
Episode: 412 Total reward: -199.0 Training loss: 160.2139 Explore P: 0.0103
Episode: 413 Total reward: -199.0 Training loss: 0.3076 Explore P: 0.0103
Episode: 414 Total reward: -199.0 Training loss: 172.1516 Explore P: 0.0103
Episode: 415 Total reward: -199.0 Training loss: 262.8005 Explore P: 0.0103
Episode: 416 Total reward: -199.0 Training loss: 0.2402 Explore P: 0.0103
Episode: 417 Total reward: -199.0 Training loss: 0.2414 Explore P: 0.0102
Episode: 418 Total reward: -199.0 Training loss: 0.3431 Explore P: 0.0102
Episode: 419 Total reward: -199.0 Training loss: 0.1740 Explore P: 0.0102
Episode: 420 Total reward: -199.0 Training loss: 0.2017 Explore P: 0.0102
Episode: 421 Total reward: -199.0 Training loss: 0.2778 Explore P: 0.0102
Episode: 422 Total reward: -199.0 Training loss: 0.1687 Explore P: 0.0102
Episode: 423 Total reward: -199.0 Training loss: 0.1884 Explore P: 0.0102
Episode: 424 Total reward: -199.0 Training loss: 0.2281 Explore P: 0.0102
Episode: 425 Total reward: -199.0 Training loss: 0.2266 Explore P: 0.0102
Episode: 426 Total reward: -199.0 Training loss: 0.1827 Explore P: 0.0102
Episode: 427 Total reward: -199.0 Training loss: 0.2772 Explore P: 0.0102
Episode: 428 Total reward: -199.0 Training loss: 0.1363 Explore P: 0.0102
Episode: 429 Total reward: -199.0 Training loss: 0.2090 Explore P: 0.0102
Episode: 430 Total reward: -199.0 Training loss: 173.3793 Explore P: 0.0102
Episode: 431 Total reward: -199.0 Training loss: 397.8823 Explore P: 0.0102
Episode: 432 Total reward: -199.0 Training loss: 175.1239 Explore P: 0.0102
Episode: 433 Total reward: -199.0 Training loss: 0.2176 Explore P: 0.0102
Episode: 434 Total reward: -199.0 Training loss: 199.5604 Explore P: 0.0102
Episode: 435 Total reward: -199.0 Training loss: 175.4117 Explore P: 0.0102
Episode: 436 Total reward: -199.0 Training loss: 0.3377 Explore P: 0.0102
Episode: 437 Total reward: -199.0 Training loss: 175.1428 Explore P: 0.0102
Episode: 438 Total reward: -199.0 Training loss: 0.2599 Explore P: 0.0102
Episode: 439 Total reward: -199.0 Training loss: 0.2974 Explore P: 0.0102
Episode: 440 Total reward: -199.0 Training loss: 0.2046 Explore P: 0.0102
Episode: 441 Total reward: -199.0 Training loss: 0.2252 Explore P: 0.0102
Episode: 442 Total reward: -199.0 Training loss: 0.1918 Explore P: 0.0102
Episode: 443 Total reward: -199.0 Training loss: 0.3807 Explore P: 0.0101
Episode: 444 Total reward: -199.0 Training loss: 389.9805 Explore P: 0.0101
Episode: 445 Total reward: -199.0 Training loss: 0.2469 Explore P: 0.0101
Episode: 446 Total reward: -199.0 Training loss: 0.1817 Explore P: 0.0101
Episode: 447 Total reward: -199.0 Training loss: 345.8887 Explore P: 0.0101
Episode: 448 Total reward: -199.0 Training loss: 0.2368 Explore P: 0.0101
Episode: 449 Total reward: -199.0 Training loss: 0.2635 Explore P: 0.0101
Episode: 450 Total reward: -199.0 Training loss: 0.2544 Explore P: 0.0101
Episode: 451 Total reward: -199.0 Training loss: 0.2981 Explore P: 0.0101
Episode: 452 Total reward: -199.0 Training loss: 0.3888 Explore P: 0.0101
Episode: 453 Total reward: -199.0 Training loss: 0.3009 Explore P: 0.0101
Episode: 454 Total reward: -199.0 Training loss: 0.1562 Explore P: 0.0101
Episode: 455 Total reward: -199.0 Training loss: 0.1799 Explore P: 0.0101
Episode: 456 Total reward: -199.0 Training loss: 0.2558 Explore P: 0.0101
Episode: 457 Total reward: -199.0 Training loss: 0.4169 Explore P: 0.0101
Episode: 458 Total reward: -199.0 Training loss: 0.3775 Explore P: 0.0101
Episode: 459 Total reward: -199.0 Training loss: 0.3452 Explore P: 0.0101
Episode: 460 Total reward: -199.0 Training loss: 173.1974 Explore P: 0.0101
Episode: 461 Total reward: -199.0 Training loss: 0.2484 Explore P: 0.0101
Episode: 462 Total reward: -199.0 Training loss: 0.1940 Explore P: 0.0101
Episode: 463 Total reward: -199.0 Training loss: 0.2862 Explore P: 0.0101
Episode: 464 Total reward: -199.0 Training loss: 0.3083 Explore P: 0.0101
Episode: 465 Total reward: -199.0 Training loss: 0.2169 Explore P: 0.0101
Episode: 466 Total reward: -199.0 Training loss: 0.4498 Explore P: 0.0101
Episode: 467 Total reward: -199.0 Training loss: 0.1170 Explore P: 0.0101
Episode: 468 Total reward: -199.0 Training loss: 0.2235 Explore P: 0.0101
Episode: 469 Total reward: -199.0 Training loss: 0.4018 Explore P: 0.0101
Episode: 470 Total reward: -199.0 Training loss: 0.1936 Explore P: 0.0101
Episode: 471 Total reward: -199.0 Training loss: 0.4480 Explore P: 0.0101
Episode: 472 Total reward: -199.0 Training loss: 0.2811 Explore P: 0.0101
Episode: 473 Total reward: -199.0 Training loss: 0.3957 Explore P: 0.0101
Episode: 474 Total reward: -199.0 Training loss: 0.3318 Explore P: 0.0101
Episode: 475 Total reward: -199.0 Training loss: 230.4907 Explore P: 0.0101
Episode: 476 Total reward: -199.0 Training loss: 0.4266 Explore P: 0.0101
Episode: 477 Total reward: -199.0 Training loss: 0.2583 Explore P: 0.0101
Episode: 478 Total reward: -199.0 Training loss: 0.1368 Explore P: 0.0101
Episode: 479 Total reward: -199.0 Training loss: 0.1571 Explore P: 0.0101
Episode: 480 Total reward: -199.0 Training loss: 0.3806 Explore P: 0.0101
Episode: 481 Total reward: -199.0 Training loss: 0.2464 Explore P: 0.0101
Episode: 482 Total reward: -199.0 Training loss: 0.3234 Explore P: 0.0101
Episode: 483 Total reward: -199.0 Training loss: 0.1794 Explore P: 0.0101
Episode: 484 Total reward: -199.0 Training loss: 0.2249 Explore P: 0.0101
Episode: 485 Total reward: -199.0 Training loss: 0.1801 Explore P: 0.0101
Episode: 486 Total reward: -199.0 Training loss: 329.1386 Explore P: 0.0101
Episode: 487 Total reward: -199.0 Training loss: 0.2233 Explore P: 0.0101
Episode: 488 Total reward: -199.0 Training loss: 0.2515 Explore P: 0.0101
Episode: 489 Total reward: -199.0 Training loss: 0.3228 Explore P: 0.0101
Episode: 490 Total reward: -199.0 Training loss: 0.2202 Explore P: 0.0101
Episode: 491 Total reward: -199.0 Training loss: 245.9888 Explore P: 0.0101
Episode: 492 Total reward: -199.0 Training loss: 0.3767 Explore P: 0.0101
Episode: 493 Total reward: -199.0 Training loss: 0.3740 Explore P: 0.0101
Episode: 494 Total reward: -199.0 Training loss: 0.4880 Explore P: 0.0101
Episode: 495 Total reward: -199.0 Training loss: 0.2875 Explore P: 0.0101
Episode: 496 Total reward: -199.0 Training loss: 0.2433 Explore P: 0.0101
Episode: 497 Total reward: -199.0 Training loss: 131.5013 Explore P: 0.0101
Episode: 498 Total reward: -199.0 Training loss: 0.3725 Explore P: 0.0100
Episode: 499 Total reward: -199.0 Training loss: 0.3769 Explore P: 0.0100
Episode: 500 Total reward: -199.0 Training loss: 0.1589 Explore P: 0.0100
Episode: 501 Total reward: -199.0 Training loss: 0.2043 Explore P: 0.0100
Episode: 502 Total reward: -199.0 Training loss: 177.5684 Explore P: 0.0100
Episode: 503 Total reward: -199.0 Training loss: 0.2921 Explore P: 0.0100
Episode: 504 Total reward: -199.0 Training loss: 0.2247 Explore P: 0.0100
Episode: 505 Total reward: -199.0 Training loss: 0.3291 Explore P: 0.0100
Episode: 506 Total reward: -199.0 Training loss: 0.4327 Explore P: 0.0100
Episode: 507 Total reward: -199.0 Training loss: 0.2108 Explore P: 0.0100
Episode: 508 Total reward: -199.0 Training loss: 0.2577 Explore P: 0.0100
Episode: 509 Total reward: -199.0 Training loss: 0.4170 Explore P: 0.0100
Episode: 510 Total reward: -199.0 Training loss: 0.2681 Explore P: 0.0100
Episode: 511 Total reward: -199.0 Training loss: 0.3759 Explore P: 0.0100
Episode: 512 Total reward: -199.0 Training loss: 0.3135 Explore P: 0.0100
Episode: 513 Total reward: -199.0 Training loss: 0.3037 Explore P: 0.0100
Episode: 514 Total reward: -199.0 Training loss: 0.3912 Explore P: 0.0100
Episode: 515 Total reward: -199.0 Training loss: 0.5147 Explore P: 0.0100
Episode: 516 Total reward: -199.0 Training loss: 0.2262 Explore P: 0.0100
Episode: 517 Total reward: -199.0 Training loss: 0.4158 Explore P: 0.0100
Episode: 518 Total reward: -199.0 Training loss: 0.2598 Explore P: 0.0100
Episode: 519 Total reward: -199.0 Training loss: 0.2795 Explore P: 0.0100
Episode: 520 Total reward: -199.0 Training loss: 0.2910 Explore P: 0.0100
Episode: 521 Total reward: -199.0 Training loss: 118.1537 Explore P: 0.0100
Episode: 522 Total reward: -199.0 Training loss: 85.9184 Explore P: 0.0100
Episode: 523 Total reward: -199.0 Training loss: 0.4335 Explore P: 0.0100
Episode: 524 Total reward: -199.0 Training loss: 0.2772 Explore P: 0.0100
Episode: 525 Total reward: -199.0 Training loss: 203.2148 Explore P: 0.0100
Episode: 526 Total reward: -199.0 Training loss: 0.7822 Explore P: 0.0100
Episode: 527 Total reward: -199.0 Training loss: 0.6102 Explore P: 0.0100
Episode: 528 Total reward: -199.0 Training loss: 176.7691 Explore P: 0.0100
Episode: 529 Total reward: -199.0 Training loss: 0.5512 Explore P: 0.0100
Episode: 530 Total reward: -199.0 Training loss: 0.2025 Explore P: 0.0100
Episode: 531 Total reward: -199.0 Training loss: 0.2187 Explore P: 0.0100
Episode: 532 Total reward: -199.0 Training loss: 0.2658 Explore P: 0.0100
Episode: 533 Total reward: -199.0 Training loss: 0.4832 Explore P: 0.0100
Episode: 534 Total reward: -199.0 Training loss: 233.2332 Explore P: 0.0100
Episode: 535 Total reward: -199.0 Training loss: 0.3226 Explore P: 0.0100
Episode: 536 Total reward: -199.0 Training loss: 0.4183 Explore P: 0.0100
Episode: 537 Total reward: -199.0 Training loss: 0.5390 Explore P: 0.0100
Episode: 538 Total reward: -199.0 Training loss: 135.8805 Explore P: 0.0100
Episode: 539 Total reward: -199.0 Training loss: 0.3904 Explore P: 0.0100
Episode: 540 Total reward: -199.0 Training loss: 196.6689 Explore P: 0.0100
Episode: 541 Total reward: -199.0 Training loss: 0.5956 Explore P: 0.0100
Episode: 542 Total reward: -199.0 Training loss: 0.4380 Explore P: 0.0100
Episode: 543 Total reward: -199.0 Training loss: 0.4636 Explore P: 0.0100
Episode: 544 Total reward: -199.0 Training loss: 0.2705 Explore P: 0.0100
Episode: 545 Total reward: -199.0 Training loss: 0.4251 Explore P: 0.0100
Episode: 546 Total reward: -199.0 Training loss: 0.4744 Explore P: 0.0100
Episode: 547 Total reward: -199.0 Training loss: 0.4335 Explore P: 0.0100
Episode: 548 Total reward: -199.0 Training loss: 200.4796 Explore P: 0.0100
Episode: 549 Total reward: -199.0 Training loss: 0.4135 Explore P: 0.0100
Episode: 550 Total reward: -199.0 Training loss: 0.6649 Explore P: 0.0100
Episode: 551 Total reward: -199.0 Training loss: 0.6064 Explore P: 0.0100
Episode: 552 Total reward: -199.0 Training loss: 0.4469 Explore P: 0.0100
Episode: 553 Total reward: -199.0 Training loss: 0.4995 Explore P: 0.0100
Episode: 554 Total reward: -199.0 Training loss: 0.5332 Explore P: 0.0100
Episode: 555 Total reward: -199.0 Training loss: 0.4280 Explore P: 0.0100
Episode: 556 Total reward: -199.0 Training loss: 0.5335 Explore P: 0.0100
Episode: 557 Total reward: -199.0 Training loss: 0.4807 Explore P: 0.0100
Episode: 558 Total reward: -199.0 Training loss: 0.2537 Explore P: 0.0100
Episode: 559 Total reward: -199.0 Training loss: 0.5142 Explore P: 0.0100
Episode: 560 Total reward: -199.0 Training loss: 0.3383 Explore P: 0.0100
Episode: 561 Total reward: -199.0 Training loss: 0.3223 Explore P: 0.0100
Episode: 562 Total reward: -199.0 Training loss: 69.7287 Explore P: 0.0100
Episode: 563 Total reward: -199.0 Training loss: 179.1216 Explore P: 0.0100
Episode: 564 Total reward: -199.0 Training loss: 0.5061 Explore P: 0.0100
Episode: 565 Total reward: -199.0 Training loss: 0.3119 Explore P: 0.0100
Episode: 566 Total reward: -199.0 Training loss: 0.2207 Explore P: 0.0100
Episode: 567 Total reward: -199.0 Training loss: 0.3274 Explore P: 0.0100
Episode: 568 Total reward: -199.0 Training loss: 147.9911 Explore P: 0.0100
Episode: 569 Total reward: -199.0 Training loss: 200.7127 Explore P: 0.0100
Episode: 570 Total reward: -199.0 Training loss: 0.1470 Explore P: 0.0100
Episode: 571 Total reward: -199.0 Training loss: 0.4204 Explore P: 0.0100
Episode: 572 Total reward: -199.0 Training loss: 0.2805 Explore P: 0.0100
Episode: 573 Total reward: -199.0 Training loss: 0.3985 Explore P: 0.0100
Episode: 574 Total reward: -199.0 Training loss: 0.4915 Explore P: 0.0100
Episode: 575 Total reward: -199.0 Training loss: 0.4017 Explore P: 0.0100
Episode: 576 Total reward: -199.0 Training loss: 0.2963 Explore P: 0.0100
Episode: 577 Total reward: -199.0 Training loss: 0.1236 Explore P: 0.0100
Episode: 578 Total reward: -199.0 Training loss: 0.1180 Explore P: 0.0100
Episode: 579 Total reward: -199.0 Training loss: 0.2860 Explore P: 0.0100
Episode: 580 Total reward: -199.0 Training loss: 0.2659 Explore P: 0.0100
Episode: 581 Total reward: -199.0 Training loss: 0.2684 Explore P: 0.0100
Episode: 582 Total reward: -199.0 Training loss: 0.3838 Explore P: 0.0100
Episode: 583 Total reward: -199.0 Training loss: 0.3034 Explore P: 0.0100
Episode: 584 Total reward: -199.0 Training loss: 0.2662 Explore P: 0.0100
Episode: 585 Total reward: -199.0 Training loss: 0.1637 Explore P: 0.0100
Episode: 586 Total reward: -199.0 Training loss: 0.3345 Explore P: 0.0100
Episode: 587 Total reward: -199.0 Training loss: 0.1574 Explore P: 0.0100
Episode: 588 Total reward: -199.0 Training loss: 0.2474 Explore P: 0.0100
Episode: 589 Total reward: -199.0 Training loss: 0.3318 Explore P: 0.0100
Episode: 590 Total reward: -199.0 Training loss: 0.1838 Explore P: 0.0100
Episode: 591 Total reward: -199.0 Training loss: 0.3508 Explore P: 0.0100
Episode: 592 Total reward: -199.0 Training loss: 0.1968 Explore P: 0.0100
Episode: 593 Total reward: -199.0 Training loss: 0.3649 Explore P: 0.0100
Episode: 594 Total reward: -199.0 Training loss: 0.2493 Explore P: 0.0100
Episode: 595 Total reward: -199.0 Training loss: 0.3070 Explore P: 0.0100
Episode: 596 Total reward: -199.0 Training loss: 0.2496 Explore P: 0.0100
Episode: 597 Total reward: -199.0 Training loss: 0.2715 Explore P: 0.0100
Episode: 598 Total reward: -199.0 Training loss: 190.8210 Explore P: 0.0100
Episode: 599 Total reward: -199.0 Training loss: 0.3722 Explore P: 0.0100
Episode: 600 Total reward: -199.0 Training loss: 0.1878 Explore P: 0.0100
Episode: 601 Total reward: -199.0 Training loss: 0.2962 Explore P: 0.0100
Episode: 602 Total reward: -199.0 Training loss: 182.9186 Explore P: 0.0100
Episode: 603 Total reward: -199.0 Training loss: 261.9647 Explore P: 0.0100
Episode: 604 Total reward: -199.0 Training loss: 190.9596 Explore P: 0.0100
Episode: 605 Total reward: -199.0 Training loss: 205.3884 Explore P: 0.0100
Episode: 606 Total reward: -199.0 Training loss: 0.1871 Explore P: 0.0100
Episode: 607 Total reward: -199.0 Training loss: 222.0581 Explore P: 0.0100
Episode: 608 Total reward: -199.0 Training loss: 0.1910 Explore P: 0.0100
Episode: 609 Total reward: -199.0 Training loss: 0.2323 Explore P: 0.0100
Episode: 610 Total reward: -199.0 Training loss: 0.2548 Explore P: 0.0100
Episode: 611 Total reward: -199.0 Training loss: 151.1659 Explore P: 0.0100
Episode: 612 Total reward: -199.0 Training loss: 0.1526 Explore P: 0.0100
Episode: 613 Total reward: -199.0 Training loss: 175.1818 Explore P: 0.0100
Episode: 614 Total reward: -199.0 Training loss: 0.0858 Explore P: 0.0100
Episode: 615 Total reward: -199.0 Training loss: 0.2319 Explore P: 0.0100
Episode: 616 Total reward: -199.0 Training loss: 0.1812 Explore P: 0.0100
Episode: 617 Total reward: -199.0 Training loss: 0.2872 Explore P: 0.0100
Episode: 618 Total reward: -199.0 Training loss: 0.3754 Explore P: 0.0100
Episode: 619 Total reward: -199.0 Training loss: 0.2732 Explore P: 0.0100
Episode: 620 Total reward: -199.0 Training loss: 0.2553 Explore P: 0.0100
Episode: 621 Total reward: -199.0 Training loss: 0.3335 Explore P: 0.0100
Episode: 622 Total reward: -199.0 Training loss: 0.2304 Explore P: 0.0100
Episode: 623 Total reward: -199.0 Training loss: 0.2819 Explore P: 0.0100
Episode: 624 Total reward: -199.0 Training loss: 161.1583 Explore P: 0.0100
Episode: 625 Total reward: -199.0 Training loss: 0.2225 Explore P: 0.0100
Episode: 626 Total reward: -199.0 Training loss: 0.4480 Explore P: 0.0100
Episode: 627 Total reward: -199.0 Training loss: 199.8566 Explore P: 0.0100
Episode: 628 Total reward: -199.0 Training loss: 219.4838 Explore P: 0.0100
Episode: 629 Total reward: -199.0 Training loss: 0.3049 Explore P: 0.0100
Episode: 630 Total reward: -199.0 Training loss: 0.1130 Explore P: 0.0100
Episode: 631 Total reward: -199.0 Training loss: 0.2157 Explore P: 0.0100
Episode: 632 Total reward: -199.0 Training loss: 215.6377 Explore P: 0.0100
Episode: 633 Total reward: -199.0 Training loss: 0.3180 Explore P: 0.0100
Episode: 634 Total reward: -199.0 Training loss: 0.1967 Explore P: 0.0100
Episode: 635 Total reward: -199.0 Training loss: 0.1931 Explore P: 0.0100
Episode: 636 Total reward: -199.0 Training loss: 0.2965 Explore P: 0.0100
Episode: 637 Total reward: -199.0 Training loss: 0.1510 Explore P: 0.0100
Episode: 638 Total reward: -199.0 Training loss: 235.7630 Explore P: 0.0100
Episode: 639 Total reward: -199.0 Training loss: 0.2170 Explore P: 0.0100
Episode: 640 Total reward: -199.0 Training loss: 0.3534 Explore P: 0.0100
Episode: 641 Total reward: -199.0 Training loss: 0.1218 Explore P: 0.0100
Episode: 642 Total reward: -199.0 Training loss: 0.1501 Explore P: 0.0100
Episode: 643 Total reward: -199.0 Training loss: 0.2607 Explore P: 0.0100
Episode: 644 Total reward: -199.0 Training loss: 0.1936 Explore P: 0.0100
Episode: 645 Total reward: -199.0 Training loss: 0.1475 Explore P: 0.0100
Episode: 646 Total reward: -199.0 Training loss: 0.2211 Explore P: 0.0100
Episode: 647 Total reward: -199.0 Training loss: 0.2157 Explore P: 0.0100
Episode: 648 Total reward: -199.0 Training loss: 0.1810 Explore P: 0.0100
Episode: 649 Total reward: -199.0 Training loss: 0.1911 Explore P: 0.0100
Episode: 650 Total reward: -199.0 Training loss: 0.2408 Explore P: 0.0100
Episode: 651 Total reward: -199.0 Training loss: 0.1268 Explore P: 0.0100
Episode: 652 Total reward: -199.0 Training loss: 0.3352 Explore P: 0.0100
Episode: 653 Total reward: -199.0 Training loss: 0.1632 Explore P: 0.0100
Episode: 654 Total reward: -199.0 Training loss: 0.2490 Explore P: 0.0100
Episode: 655 Total reward: -199.0 Training loss: 0.1889 Explore P: 0.0100
Episode: 656 Total reward: -199.0 Training loss: 0.2208 Explore P: 0.0100
Episode: 657 Total reward: -199.0 Training loss: 0.1809 Explore P: 0.0100
Episode: 658 Total reward: -199.0 Training loss: 0.2297 Explore P: 0.0100
Episode: 659 Total reward: -199.0 Training loss: 0.2353 Explore P: 0.0100
Episode: 660 Total reward: -199.0 Training loss: 0.2549 Explore P: 0.0100
Episode: 661 Total reward: -199.0 Training loss: 0.3172 Explore P: 0.0100
Episode: 662 Total reward: -199.0 Training loss: 188.3617 Explore P: 0.0100
Episode: 663 Total reward: -199.0 Training loss: 0.2815 Explore P: 0.0100
Episode: 664 Total reward: -199.0 Training loss: 0.2906 Explore P: 0.0100
Episode: 665 Total reward: -199.0 Training loss: 0.3409 Explore P: 0.0100
Episode: 666 Total reward: -199.0 Training loss: 203.6185 Explore P: 0.0100
Episode: 667 Total reward: -199.0 Training loss: 0.2060 Explore P: 0.0100
Episode: 668 Total reward: -199.0 Training loss: 0.2385 Explore P: 0.0100
Episode: 669 Total reward: -199.0 Training loss: 0.1937 Explore P: 0.0100
Episode: 670 Total reward: -199.0 Training loss: 132.3429 Explore P: 0.0100
Episode: 671 Total reward: -199.0 Training loss: 0.1953 Explore P: 0.0100
Episode: 672 Total reward: -199.0 Training loss: 0.2245 Explore P: 0.0100
Episode: 673 Total reward: -199.0 Training loss: 0.2215 Explore P: 0.0100
Episode: 674 Total reward: -199.0 Training loss: 0.1442 Explore P: 0.0100
Episode: 675 Total reward: -199.0 Training loss: 204.4820 Explore P: 0.0100
Episode: 676 Total reward: -199.0 Training loss: 203.4896 Explore P: 0.0100
Episode: 677 Total reward: -199.0 Training loss: 0.1900 Explore P: 0.0100
Episode: 678 Total reward: -199.0 Training loss: 0.1302 Explore P: 0.0100
Episode: 679 Total reward: -199.0 Training loss: 0.2265 Explore P: 0.0100
Episode: 680 Total reward: -199.0 Training loss: 0.1732 Explore P: 0.0100
Episode: 681 Total reward: -199.0 Training loss: 0.3666 Explore P: 0.0100
Episode: 682 Total reward: -199.0 Training loss: 0.1864 Explore P: 0.0100
Episode: 683 Total reward: -199.0 Training loss: 0.2057 Explore P: 0.0100
Episode: 684 Total reward: -199.0 Training loss: 0.1493 Explore P: 0.0100
Episode: 685 Total reward: -199.0 Training loss: 0.0825 Explore P: 0.0100
Episode: 686 Total reward: -199.0 Training loss: 0.2431 Explore P: 0.0100
Episode: 687 Total reward: -199.0 Training loss: 0.1248 Explore P: 0.0100
Episode: 688 Total reward: -199.0 Training loss: 0.2648 Explore P: 0.0100
Episode: 689 Total reward: -199.0 Training loss: 0.1784 Explore P: 0.0100
Episode: 690 Total reward: -199.0 Training loss: 0.1732 Explore P: 0.0100
Episode: 691 Total reward: -199.0 Training loss: 205.6966 Explore P: 0.0100
Episode: 692 Total reward: -199.0 Training loss: 0.2391 Explore P: 0.0100
Episode: 693 Total reward: -199.0 Training loss: 197.9220 Explore P: 0.0100
Episode: 694 Total reward: -199.0 Training loss: 0.1047 Explore P: 0.0100
Episode: 695 Total reward: -199.0 Training loss: 0.4428 Explore P: 0.0100
Episode: 696 Total reward: -199.0 Training loss: 0.2741 Explore P: 0.0100
Episode: 697 Total reward: -199.0 Training loss: 0.1775 Explore P: 0.0100
Episode: 698 Total reward: -199.0 Training loss: 0.2431 Explore P: 0.0100
Episode: 699 Total reward: -199.0 Training loss: 0.2721 Explore P: 0.0100
Episode: 700 Total reward: -199.0 Training loss: 0.3106 Explore P: 0.0100
Episode: 701 Total reward: -199.0 Training loss: 231.0549 Explore P: 0.0100
Episode: 702 Total reward: -199.0 Training loss: 0.1376 Explore P: 0.0100
Episode: 703 Total reward: -199.0 Training loss: 232.6574 Explore P: 0.0100
Episode: 704 Total reward: -199.0 Training loss: 0.1534 Explore P: 0.0100
Episode: 705 Total reward: -199.0 Training loss: 0.2187 Explore P: 0.0100
Episode: 706 Total reward: -199.0 Training loss: 0.3233 Explore P: 0.0100
Episode: 707 Total reward: -199.0 Training loss: 0.1929 Explore P: 0.0100
Episode: 708 Total reward: -199.0 Training loss: 0.3187 Explore P: 0.0100
Episode: 709 Total reward: -199.0 Training loss: 0.3415 Explore P: 0.0100
Episode: 710 Total reward: -199.0 Training loss: 0.2746 Explore P: 0.0100
Episode: 711 Total reward: -199.0 Training loss: 144.5928 Explore P: 0.0100
Episode: 712 Total reward: -199.0 Training loss: 0.2237 Explore P: 0.0100
Episode: 713 Total reward: -199.0 Training loss: 0.2947 Explore P: 0.0100
Episode: 714 Total reward: -199.0 Training loss: 209.4447 Explore P: 0.0100
Episode: 715 Total reward: -199.0 Training loss: 0.2546 Explore P: 0.0100
Episode: 716 Total reward: -199.0 Training loss: 202.8853 Explore P: 0.0100
Episode: 717 Total reward: -199.0 Training loss: 233.4455 Explore P: 0.0100
Episode: 718 Total reward: -199.0 Training loss: 0.3583 Explore P: 0.0100
Episode: 719 Total reward: -199.0 Training loss: 0.1671 Explore P: 0.0100
Episode: 720 Total reward: -199.0 Training loss: 0.0886 Explore P: 0.0100
Episode: 721 Total reward: -199.0 Training loss: 0.3651 Explore P: 0.0100
Episode: 722 Total reward: -199.0 Training loss: 0.2372 Explore P: 0.0100
Episode: 723 Total reward: -199.0 Training loss: 0.2240 Explore P: 0.0100
Episode: 724 Total reward: -199.0 Training loss: 0.4395 Explore P: 0.0100
Episode: 725 Total reward: -199.0 Training loss: 0.1713 Explore P: 0.0100
Episode: 726 Total reward: -199.0 Training loss: 0.2706 Explore P: 0.0100
Episode: 727 Total reward: -199.0 Training loss: 0.1772 Explore P: 0.0100
Episode: 728 Total reward: -199.0 Training loss: 0.1612 Explore P: 0.0100
Episode: 729 Total reward: -199.0 Training loss: 0.2414 Explore P: 0.0100
Episode: 730 Total reward: -199.0 Training loss: 0.2971 Explore P: 0.0100
Episode: 731 Total reward: -199.0 Training loss: 445.3287 Explore P: 0.0100
Episode: 732 Total reward: -199.0 Training loss: 225.8912 Explore P: 0.0100
Episode: 733 Total reward: -199.0 Training loss: 0.1831 Explore P: 0.0100
Episode: 734 Total reward: -199.0 Training loss: 214.1418 Explore P: 0.0100
Episode: 735 Total reward: -199.0 Training loss: 0.1507 Explore P: 0.0100
Episode: 736 Total reward: -199.0 Training loss: 0.2687 Explore P: 0.0100
Episode: 737 Total reward: -199.0 Training loss: 0.3526 Explore P: 0.0100
Episode: 738 Total reward: -199.0 Training loss: 0.2002 Explore P: 0.0100
Episode: 739 Total reward: -199.0 Training loss: 0.2439 Explore P: 0.0100
Episode: 740 Total reward: -199.0 Training loss: 0.2013 Explore P: 0.0100
Episode: 741 Total reward: -199.0 Training loss: 0.2253 Explore P: 0.0100
Episode: 742 Total reward: -199.0 Training loss: 139.9890 Explore P: 0.0100
Episode: 743 Total reward: -199.0 Training loss: 0.1576 Explore P: 0.0100
Episode: 744 Total reward: -199.0 Training loss: 0.2544 Explore P: 0.0100
Episode: 745 Total reward: -199.0 Training loss: 0.1641 Explore P: 0.0100
Episode: 746 Total reward: -199.0 Training loss: 0.3086 Explore P: 0.0100
Episode: 747 Total reward: -199.0 Training loss: 0.1141 Explore P: 0.0100
Episode: 748 Total reward: -199.0 Training loss: 0.3473 Explore P: 0.0100
Episode: 749 Total reward: -199.0 Training loss: 0.1646 Explore P: 0.0100
Episode: 750 Total reward: -199.0 Training loss: 0.2884 Explore P: 0.0100
Episode: 751 Total reward: -199.0 Training loss: 0.1658 Explore P: 0.0100
Episode: 752 Total reward: -199.0 Training loss: 0.1450 Explore P: 0.0100
Episode: 753 Total reward: -199.0 Training loss: 0.1312 Explore P: 0.0100
Episode: 754 Total reward: -199.0 Training loss: 0.1334 Explore P: 0.0100
Episode: 755 Total reward: -199.0 Training loss: 0.3282 Explore P: 0.0100
Episode: 756 Total reward: -199.0 Training loss: 0.3560 Explore P: 0.0100
Episode: 757 Total reward: -199.0 Training loss: 0.1770 Explore P: 0.0100
Episode: 758 Total reward: -199.0 Training loss: 192.9929 Explore P: 0.0100
Episode: 759 Total reward: -199.0 Training loss: 0.1473 Explore P: 0.0100
Episode: 760 Total reward: -199.0 Training loss: 0.3295 Explore P: 0.0100
Episode: 761 Total reward: -199.0 Training loss: 0.1391 Explore P: 0.0100
Episode: 762 Total reward: -199.0 Training loss: 0.2035 Explore P: 0.0100
Episode: 763 Total reward: -199.0 Training loss: 0.2369 Explore P: 0.0100
Episode: 764 Total reward: -199.0 Training loss: 0.2118 Explore P: 0.0100
Episode: 765 Total reward: -199.0 Training loss: 0.2241 Explore P: 0.0100
Episode: 766 Total reward: -199.0 Training loss: 0.2836 Explore P: 0.0100
Episode: 767 Total reward: -199.0 Training loss: 0.1866 Explore P: 0.0100
Episode: 768 Total reward: -199.0 Training loss: 240.1953 Explore P: 0.0100
Episode: 769 Total reward: -199.0 Training loss: 0.2546 Explore P: 0.0100
Episode: 770 Total reward: -199.0 Training loss: 0.2403 Explore P: 0.0100
Episode: 771 Total reward: -199.0 Training loss: 0.2379 Explore P: 0.0100
Episode: 772 Total reward: -199.0 Training loss: 312.6153 Explore P: 0.0100
Episode: 773 Total reward: -199.0 Training loss: 0.3126 Explore P: 0.0100
Episode: 774 Total reward: -199.0 Training loss: 0.1835 Explore P: 0.0100
Episode: 775 Total reward: -199.0 Training loss: 0.3105 Explore P: 0.0100
Episode: 776 Total reward: -199.0 Training loss: 0.2702 Explore P: 0.0100
Episode: 777 Total reward: -199.0 Training loss: 0.2469 Explore P: 0.0100
Episode: 778 Total reward: -199.0 Training loss: 0.2164 Explore P: 0.0100
Episode: 779 Total reward: -199.0 Training loss: 0.2118 Explore P: 0.0100
Episode: 780 Total reward: -199.0 Training loss: 0.2035 Explore P: 0.0100
Episode: 781 Total reward: -199.0 Training loss: 0.1374 Explore P: 0.0100
Episode: 782 Total reward: -199.0 Training loss: 0.2518 Explore P: 0.0100
Episode: 783 Total reward: -199.0 Training loss: 0.1851 Explore P: 0.0100
Episode: 784 Total reward: -199.0 Training loss: 0.3484 Explore P: 0.0100
Episode: 785 Total reward: -199.0 Training loss: 0.1502 Explore P: 0.0100
Episode: 786 Total reward: -199.0 Training loss: 0.3623 Explore P: 0.0100
Episode: 787 Total reward: -199.0 Training loss: 0.2813 Explore P: 0.0100
Episode: 788 Total reward: -199.0 Training loss: 0.3586 Explore P: 0.0100
Episode: 789 Total reward: -199.0 Training loss: 0.2523 Explore P: 0.0100
Episode: 790 Total reward: -199.0 Training loss: 0.4354 Explore P: 0.0100
Episode: 791 Total reward: -199.0 Training loss: 0.2625 Explore P: 0.0100
Episode: 792 Total reward: -199.0 Training loss: 0.1624 Explore P: 0.0100
Episode: 793 Total reward: -199.0 Training loss: 0.3437 Explore P: 0.0100
Episode: 794 Total reward: -199.0 Training loss: 0.2505 Explore P: 0.0100
Episode: 795 Total reward: -199.0 Training loss: 142.5487 Explore P: 0.0100
Episode: 796 Total reward: -199.0 Training loss: 0.3008 Explore P: 0.0100
Episode: 797 Total reward: -199.0 Training loss: 0.1902 Explore P: 0.0100
Episode: 798 Total reward: -199.0 Training loss: 0.3046 Explore P: 0.0100
Episode: 799 Total reward: -199.0 Training loss: 0.1802 Explore P: 0.0100
Episode: 800 Total reward: -199.0 Training loss: 0.2231 Explore P: 0.0100
Episode: 801 Total reward: -199.0 Training loss: 0.2023 Explore P: 0.0100
Episode: 802 Total reward: -199.0 Training loss: 0.2093 Explore P: 0.0100
Episode: 803 Total reward: -199.0 Training loss: 227.3339 Explore P: 0.0100
Episode: 804 Total reward: -199.0 Training loss: 0.3143 Explore P: 0.0100
Episode: 805 Total reward: -199.0 Training loss: 0.1229 Explore P: 0.0100
Episode: 806 Total reward: -199.0 Training loss: 0.2015 Explore P: 0.0100
Episode: 807 Total reward: -199.0 Training loss: 0.1782 Explore P: 0.0100
Episode: 808 Total reward: -199.0 Training loss: 0.3613 Explore P: 0.0100
Episode: 809 Total reward: -199.0 Training loss: 207.9070 Explore P: 0.0100
Episode: 810 Total reward: -199.0 Training loss: 0.2224 Explore P: 0.0100
Episode: 811 Total reward: -199.0 Training loss: 211.4654 Explore P: 0.0100
Episode: 812 Total reward: -199.0 Training loss: 0.1632 Explore P: 0.0100
Episode: 813 Total reward: -199.0 Training loss: 195.5205 Explore P: 0.0100
Episode: 814 Total reward: -199.0 Training loss: 209.6518 Explore P: 0.0100
Episode: 815 Total reward: -199.0 Training loss: 0.2534 Explore P: 0.0100
Episode: 816 Total reward: -199.0 Training loss: 0.4137 Explore P: 0.0100
Episode: 817 Total reward: -199.0 Training loss: 0.2295 Explore P: 0.0100
Episode: 818 Total reward: -199.0 Training loss: 0.4251 Explore P: 0.0100
Episode: 819 Total reward: -199.0 Training loss: 0.2143 Explore P: 0.0100
Episode: 820 Total reward: -199.0 Training loss: 0.3149 Explore P: 0.0100
Episode: 821 Total reward: -199.0 Training loss: 0.2999 Explore P: 0.0100
Episode: 822 Total reward: -199.0 Training loss: 234.3014 Explore P: 0.0100
Episode: 823 Total reward: -199.0 Training loss: 0.1493 Explore P: 0.0100
Episode: 824 Total reward: -199.0 Training loss: 0.2728 Explore P: 0.0100
Episode: 825 Total reward: -199.0 Training loss: 0.2290 Explore P: 0.0100
Episode: 826 Total reward: -199.0 Training loss: 0.2526 Explore P: 0.0100
Episode: 827 Total reward: -199.0 Training loss: 0.2517 Explore P: 0.0100
Episode: 828 Total reward: -199.0 Training loss: 0.1838 Explore P: 0.0100
Episode: 829 Total reward: -199.0 Training loss: 0.3393 Explore P: 0.0100
Episode: 830 Total reward: -199.0 Training loss: 0.5729 Explore P: 0.0100
Episode: 831 Total reward: -199.0 Training loss: 383.2180 Explore P: 0.0100
Episode: 832 Total reward: -199.0 Training loss: 0.2593 Explore P: 0.0100
Episode: 833 Total reward: -199.0 Training loss: 0.4988 Explore P: 0.0100
Episode: 834 Total reward: -199.0 Training loss: 0.1454 Explore P: 0.0100
Episode: 835 Total reward: -199.0 Training loss: 0.4293 Explore P: 0.0100
Episode: 836 Total reward: -199.0 Training loss: 0.1764 Explore P: 0.0100
Episode: 837 Total reward: -199.0 Training loss: 234.1080 Explore P: 0.0100
Episode: 838 Total reward: -199.0 Training loss: 0.2488 Explore P: 0.0100
Episode: 839 Total reward: -199.0 Training loss: 0.3189 Explore P: 0.0100
Episode: 840 Total reward: -199.0 Training loss: 163.2121 Explore P: 0.0100
Episode: 841 Total reward: -199.0 Training loss: 155.3551 Explore P: 0.0100
Episode: 842 Total reward: -199.0 Training loss: 0.2007 Explore P: 0.0100
Episode: 843 Total reward: -199.0 Training loss: 0.4611 Explore P: 0.0100
Episode: 844 Total reward: -199.0 Training loss: 0.3375 Explore P: 0.0100
Episode: 845 Total reward: -199.0 Training loss: 0.3794 Explore P: 0.0100
Episode: 846 Total reward: -199.0 Training loss: 0.2282 Explore P: 0.0100
Episode: 847 Total reward: -199.0 Training loss: 233.1985 Explore P: 0.0100
Episode: 848 Total reward: -199.0 Training loss: 0.2178 Explore P: 0.0100
Episode: 849 Total reward: -199.0 Training loss: 0.2025 Explore P: 0.0100
Episode: 850 Total reward: -199.0 Training loss: 435.2504 Explore P: 0.0100
Episode: 851 Total reward: -199.0 Training loss: 183.4550 Explore P: 0.0100
Episode: 852 Total reward: -199.0 Training loss: 0.2279 Explore P: 0.0100
Episode: 853 Total reward: -199.0 Training loss: 0.4090 Explore P: 0.0100
Episode: 854 Total reward: -199.0 Training loss: 0.2338 Explore P: 0.0100
Episode: 855 Total reward: -199.0 Training loss: 206.5216 Explore P: 0.0100
Episode: 856 Total reward: -199.0 Training loss: 0.2220 Explore P: 0.0100
Episode: 857 Total reward: -199.0 Training loss: 0.1915 Explore P: 0.0100
Episode: 858 Total reward: -199.0 Training loss: 0.1782 Explore P: 0.0100
Episode: 859 Total reward: -199.0 Training loss: 0.4351 Explore P: 0.0100
Episode: 860 Total reward: -199.0 Training loss: 0.2099 Explore P: 0.0100
Episode: 861 Total reward: -199.0 Training loss: 0.4031 Explore P: 0.0100
Episode: 862 Total reward: -199.0 Training loss: 0.2512 Explore P: 0.0100
Episode: 863 Total reward: -199.0 Training loss: 0.3328 Explore P: 0.0100
Episode: 864 Total reward: -199.0 Training loss: 0.2312 Explore P: 0.0100
Episode: 865 Total reward: -199.0 Training loss: 0.3737 Explore P: 0.0100
Episode: 866 Total reward: -199.0 Training loss: 0.4141 Explore P: 0.0100
Episode: 867 Total reward: -199.0 Training loss: 0.2321 Explore P: 0.0100
Episode: 868 Total reward: -199.0 Training loss: 0.1408 Explore P: 0.0100
Episode: 869 Total reward: -199.0 Training loss: 0.3677 Explore P: 0.0100
Episode: 870 Total reward: -199.0 Training loss: 0.5278 Explore P: 0.0100
Episode: 871 Total reward: -199.0 Training loss: 169.9983 Explore P: 0.0100
Episode: 872 Total reward: -199.0 Training loss: 0.2531 Explore P: 0.0100
Episode: 873 Total reward: -199.0 Training loss: 0.2837 Explore P: 0.0100
Episode: 874 Total reward: -199.0 Training loss: 0.3285 Explore P: 0.0100
Episode: 875 Total reward: -199.0 Training loss: 275.5761 Explore P: 0.0100
Episode: 876 Total reward: -199.0 Training loss: 0.3628 Explore P: 0.0100
Episode: 877 Total reward: -199.0 Training loss: 0.3870 Explore P: 0.0100
Episode: 878 Total reward: -199.0 Training loss: 0.2736 Explore P: 0.0100
Episode: 879 Total reward: -199.0 Training loss: 234.8757 Explore P: 0.0100
Episode: 880 Total reward: -199.0 Training loss: 0.2269 Explore P: 0.0100
Episode: 881 Total reward: -199.0 Training loss: 0.4525 Explore P: 0.0100
Episode: 882 Total reward: -199.0 Training loss: 0.3157 Explore P: 0.0100
Episode: 883 Total reward: -199.0 Training loss: 0.3407 Explore P: 0.0100
Episode: 884 Total reward: -199.0 Training loss: 0.3286 Explore P: 0.0100
Episode: 885 Total reward: -199.0 Training loss: 0.1478 Explore P: 0.0100
Episode: 886 Total reward: -199.0 Training loss: 0.3074 Explore P: 0.0100
Episode: 887 Total reward: -199.0 Training loss: 0.2055 Explore P: 0.0100
Episode: 888 Total reward: -199.0 Training loss: 0.4156 Explore P: 0.0100
Episode: 889 Total reward: -199.0 Training loss: 0.2445 Explore P: 0.0100
Episode: 890 Total reward: -199.0 Training loss: 0.3337 Explore P: 0.0100
Episode: 891 Total reward: -199.0 Training loss: 0.3830 Explore P: 0.0100
Episode: 892 Total reward: -199.0 Training loss: 0.1191 Explore P: 0.0100
Episode: 893 Total reward: -199.0 Training loss: 0.3252 Explore P: 0.0100
Episode: 894 Total reward: -199.0 Training loss: 0.1372 Explore P: 0.0100
Episode: 895 Total reward: -199.0 Training loss: 162.3221 Explore P: 0.0100
Episode: 896 Total reward: -199.0 Training loss: 0.3312 Explore P: 0.0100
Episode: 897 Total reward: -199.0 Training loss: 0.2989 Explore P: 0.0100
Episode: 898 Total reward: -199.0 Training loss: 0.3034 Explore P: 0.0100
Episode: 899 Total reward: -199.0 Training loss: 181.2942 Explore P: 0.0100
Episode: 900 Total reward: -199.0 Training loss: 0.2649 Explore P: 0.0100
Episode: 901 Total reward: -199.0 Training loss: 389.5145 Explore P: 0.0100
Episode: 902 Total reward: -199.0 Training loss: 0.2454 Explore P: 0.0100
Episode: 903 Total reward: -199.0 Training loss: 0.2255 Explore P: 0.0100
Episode: 904 Total reward: -199.0 Training loss: 0.3743 Explore P: 0.0100
Episode: 905 Total reward: -199.0 Training loss: 0.2902 Explore P: 0.0100
Episode: 906 Total reward: -199.0 Training loss: 183.2758 Explore P: 0.0100
Episode: 907 Total reward: -199.0 Training loss: 0.1473 Explore P: 0.0100
Episode: 908 Total reward: -199.0 Training loss: 0.3770 Explore P: 0.0100
Episode: 909 Total reward: -199.0 Training loss: 0.2156 Explore P: 0.0100
Episode: 910 Total reward: -199.0 Training loss: 0.3350 Explore P: 0.0100
Episode: 911 Total reward: -199.0 Training loss: 0.3188 Explore P: 0.0100
Episode: 912 Total reward: -199.0 Training loss: 0.1816 Explore P: 0.0100
Episode: 913 Total reward: -199.0 Training loss: 0.4084 Explore P: 0.0100
Episode: 914 Total reward: -199.0 Training loss: 0.2777 Explore P: 0.0100
Episode: 915 Total reward: -199.0 Training loss: 0.3347 Explore P: 0.0100
Episode: 916 Total reward: -199.0 Training loss: 0.3623 Explore P: 0.0100
Episode: 917 Total reward: -199.0 Training loss: 0.1669 Explore P: 0.0100
Episode: 918 Total reward: -199.0 Training loss: 0.1649 Explore P: 0.0100
Episode: 919 Total reward: -199.0 Training loss: 0.3714 Explore P: 0.0100
Episode: 920 Total reward: -199.0 Training loss: 0.2470 Explore P: 0.0100
Episode: 921 Total reward: -199.0 Training loss: 212.1552 Explore P: 0.0100
Episode: 922 Total reward: -199.0 Training loss: 179.9927 Explore P: 0.0100
Episode: 923 Total reward: -199.0 Training loss: 0.1663 Explore P: 0.0100
Episode: 924 Total reward: -199.0 Training loss: 0.1426 Explore P: 0.0100
Episode: 925 Total reward: -199.0 Training loss: 0.3930 Explore P: 0.0100
Episode: 926 Total reward: -199.0 Training loss: 0.1050 Explore P: 0.0100
Episode: 927 Total reward: -199.0 Training loss: 0.2968 Explore P: 0.0100
Episode: 928 Total reward: -199.0 Training loss: 0.2668 Explore P: 0.0100
Episode: 929 Total reward: -199.0 Training loss: 0.3368 Explore P: 0.0100
Episode: 930 Total reward: -199.0 Training loss: 0.3935 Explore P: 0.0100
Episode: 931 Total reward: -199.0 Training loss: 0.2818 Explore P: 0.0100
Episode: 932 Total reward: -199.0 Training loss: 0.1486 Explore P: 0.0100
Episode: 933 Total reward: -199.0 Training loss: 203.0708 Explore P: 0.0100
Episode: 934 Total reward: -199.0 Training loss: 0.3182 Explore P: 0.0100
Episode: 935 Total reward: -199.0 Training loss: 0.1482 Explore P: 0.0100
Episode: 936 Total reward: -199.0 Training loss: 0.2728 Explore P: 0.0100
Episode: 937 Total reward: -199.0 Training loss: 0.2977 Explore P: 0.0100
Episode: 938 Total reward: -199.0 Training loss: 229.1917 Explore P: 0.0100
Episode: 939 Total reward: -199.0 Training loss: 0.2137 Explore P: 0.0100
Episode: 940 Total reward: -199.0 Training loss: 0.2118 Explore P: 0.0100
Episode: 941 Total reward: -199.0 Training loss: 0.2180 Explore P: 0.0100
Episode: 942 Total reward: -199.0 Training loss: 165.8180 Explore P: 0.0100
Episode: 943 Total reward: -199.0 Training loss: 0.1593 Explore P: 0.0100
Episode: 944 Total reward: -199.0 Training loss: 0.3733 Explore P: 0.0100
Episode: 945 Total reward: -199.0 Training loss: 0.2860 Explore P: 0.0100
Episode: 946 Total reward: -199.0 Training loss: 167.4956 Explore P: 0.0100
Episode: 947 Total reward: -199.0 Training loss: 0.1981 Explore P: 0.0100
Episode: 948 Total reward: -199.0 Training loss: 0.2183 Explore P: 0.0100
Episode: 949 Total reward: -199.0 Training loss: 0.3491 Explore P: 0.0100
Episode: 950 Total reward: -199.0 Training loss: 420.4178 Explore P: 0.0100
Episode: 951 Total reward: -199.0 Training loss: 0.1780 Explore P: 0.0100
Episode: 952 Total reward: -199.0 Training loss: 0.2246 Explore P: 0.0100
Episode: 953 Total reward: -199.0 Training loss: 0.3110 Explore P: 0.0100
Episode: 954 Total reward: -199.0 Training loss: 0.3256 Explore P: 0.0100
Episode: 955 Total reward: -199.0 Training loss: 0.2453 Explore P: 0.0100
Episode: 956 Total reward: -199.0 Training loss: 0.3567 Explore P: 0.0100
Episode: 957 Total reward: -199.0 Training loss: 0.2721 Explore P: 0.0100
Episode: 958 Total reward: -199.0 Training loss: 0.1810 Explore P: 0.0100
Episode: 959 Total reward: -199.0 Training loss: 0.2095 Explore P: 0.0100
Episode: 960 Total reward: -199.0 Training loss: 186.3401 Explore P: 0.0100
Episode: 961 Total reward: -199.0 Training loss: 0.1751 Explore P: 0.0100
Episode: 962 Total reward: -199.0 Training loss: 152.5579 Explore P: 0.0100
Episode: 963 Total reward: -199.0 Training loss: 0.2507 Explore P: 0.0100
Episode: 964 Total reward: -199.0 Training loss: 0.1786 Explore P: 0.0100
Episode: 965 Total reward: -199.0 Training loss: 0.2115 Explore P: 0.0100
Episode: 966 Total reward: -199.0 Training loss: 0.1193 Explore P: 0.0100
Episode: 967 Total reward: -199.0 Training loss: 0.1874 Explore P: 0.0100
Episode: 968 Total reward: -199.0 Training loss: 0.1487 Explore P: 0.0100
Episode: 969 Total reward: -199.0 Training loss: 0.2712 Explore P: 0.0100
Episode: 970 Total reward: -199.0 Training loss: 0.2823 Explore P: 0.0100
Episode: 971 Total reward: -199.0 Training loss: 0.1812 Explore P: 0.0100
Episode: 972 Total reward: -199.0 Training loss: 221.8918 Explore P: 0.0100
Episode: 973 Total reward: -199.0 Training loss: 0.1540 Explore P: 0.0100
Episode: 974 Total reward: -199.0 Training loss: 0.2180 Explore P: 0.0100
Episode: 975 Total reward: -199.0 Training loss: 0.1865 Explore P: 0.0100
Episode: 976 Total reward: -199.0 Training loss: 0.1680 Explore P: 0.0100
Episode: 977 Total reward: -199.0 Training loss: 0.2086 Explore P: 0.0100
Episode: 978 Total reward: -199.0 Training loss: 0.1616 Explore P: 0.0100
Episode: 979 Total reward: -199.0 Training loss: 0.1706 Explore P: 0.0100
Episode: 980 Total reward: -199.0 Training loss: 0.1780 Explore P: 0.0100
Episode: 981 Total reward: -199.0 Training loss: 0.1766 Explore P: 0.0100
Episode: 982 Total reward: -199.0 Training loss: 0.1445 Explore P: 0.0100
Episode: 983 Total reward: -199.0 Training loss: 0.1661 Explore P: 0.0100
Episode: 984 Total reward: -199.0 Training loss: 0.1470 Explore P: 0.0100
Episode: 985 Total reward: -199.0 Training loss: 0.2259 Explore P: 0.0100
Episode: 986 Total reward: -199.0 Training loss: 0.2311 Explore P: 0.0100
Episode: 987 Total reward: -199.0 Training loss: 171.6832 Explore P: 0.0100
Episode: 988 Total reward: -199.0 Training loss: 0.2393 Explore P: 0.0100
Episode: 989 Total reward: -199.0 Training loss: 0.2128 Explore P: 0.0100
Episode: 990 Total reward: -199.0 Training loss: 0.2058 Explore P: 0.0100
Episode: 991 Total reward: -199.0 Training loss: 0.1963 Explore P: 0.0100
Episode: 992 Total reward: -199.0 Training loss: 0.3025 Explore P: 0.0100
Episode: 993 Total reward: -199.0 Training loss: 0.1348 Explore P: 0.0100
Episode: 994 Total reward: -199.0 Training loss: 0.1557 Explore P: 0.0100
Episode: 995 Total reward: -199.0 Training loss: 0.1774 Explore P: 0.0100
Episode: 996 Total reward: -199.0 Training loss: 0.1735 Explore P: 0.0100
Episode: 997 Total reward: -199.0 Training loss: 0.2440 Explore P: 0.0100
Episode: 998 Total reward: -199.0 Training loss: 0.1847 Explore P: 0.0100
Episode: 999 Total reward: -199.0 Training loss: 201.1085 Explore P: 0.0100

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [13]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [14]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[14]:
<matplotlib.text.Text at 0x1214bad68>

Testing

Let's checkout how our trained agent plays the game.


In [1]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d9d7ecad5f1e> in <module>()
      1 test_episodes = 10
      2 test_max_steps = 400
----> 3 env.reset()
      4 with tf.Session() as sess:
      5     saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))

NameError: name 'env' is not defined

In [17]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.