Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [23]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [24]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-04-22 21:04:24,512] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [25]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [26]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [27]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')

            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)

            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')

            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size, activation_fn=tf.nn.elu)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size, activation_fn=tf.nn.elu)
            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)

            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)

            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [28]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [29]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [13]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [30]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [32]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 8.0 Training loss: 1.0204 Explore P: 0.9992
Episode: 2 Total reward: 8.0 Training loss: 1.0859 Explore P: 0.9984
Episode: 3 Total reward: 43.0 Training loss: 1.1526 Explore P: 0.9942
Episode: 4 Total reward: 13.0 Training loss: 1.0541 Explore P: 0.9929
Episode: 5 Total reward: 15.0 Training loss: 1.2072 Explore P: 0.9914
Episode: 6 Total reward: 39.0 Training loss: 1.1955 Explore P: 0.9876
Episode: 7 Total reward: 8.0 Training loss: 1.0867 Explore P: 0.9868
Episode: 8 Total reward: 9.0 Training loss: 1.2552 Explore P: 0.9859
Episode: 9 Total reward: 8.0 Training loss: 1.0279 Explore P: 0.9852
Episode: 10 Total reward: 24.0 Training loss: 1.1968 Explore P: 0.9828
Episode: 11 Total reward: 13.0 Training loss: 1.3793 Explore P: 0.9816
Episode: 12 Total reward: 9.0 Training loss: 1.3028 Explore P: 0.9807
Episode: 13 Total reward: 12.0 Training loss: 1.0339 Explore P: 0.9795
Episode: 14 Total reward: 10.0 Training loss: 1.1731 Explore P: 0.9786
Episode: 15 Total reward: 30.0 Training loss: 1.2816 Explore P: 0.9757
Episode: 16 Total reward: 11.0 Training loss: 1.3286 Explore P: 0.9746
Episode: 17 Total reward: 38.0 Training loss: 1.1433 Explore P: 0.9709
Episode: 18 Total reward: 34.0 Training loss: 1.2034 Explore P: 0.9677
Episode: 19 Total reward: 37.0 Training loss: 1.0708 Explore P: 0.9641
Episode: 20 Total reward: 11.0 Training loss: 1.0596 Explore P: 0.9631
Episode: 21 Total reward: 20.0 Training loss: 1.0334 Explore P: 0.9612
Episode: 22 Total reward: 47.0 Training loss: 1.2933 Explore P: 0.9567
Episode: 23 Total reward: 17.0 Training loss: 1.1395 Explore P: 0.9551
Episode: 24 Total reward: 34.0 Training loss: 1.1217 Explore P: 0.9519
Episode: 25 Total reward: 14.0 Training loss: 0.9869 Explore P: 0.9506
Episode: 26 Total reward: 41.0 Training loss: 1.3016 Explore P: 0.9467
Episode: 27 Total reward: 36.0 Training loss: 1.3272 Explore P: 0.9434
Episode: 28 Total reward: 28.0 Training loss: 1.1585 Explore P: 0.9408
Episode: 29 Total reward: 20.0 Training loss: 1.4082 Explore P: 0.9389
Episode: 30 Total reward: 14.0 Training loss: 1.1083 Explore P: 0.9376
Episode: 31 Total reward: 15.0 Training loss: 1.1026 Explore P: 0.9362
Episode: 32 Total reward: 11.0 Training loss: 1.2441 Explore P: 0.9352
Episode: 33 Total reward: 18.0 Training loss: 2.3168 Explore P: 0.9335
Episode: 34 Total reward: 52.0 Training loss: 1.1424 Explore P: 0.9287
Episode: 35 Total reward: 14.0 Training loss: 0.9834 Explore P: 0.9275
Episode: 36 Total reward: 40.0 Training loss: 2.6916 Explore P: 0.9238
Episode: 37 Total reward: 20.0 Training loss: 3.7834 Explore P: 0.9220
Episode: 38 Total reward: 27.0 Training loss: 1.1253 Explore P: 0.9195
Episode: 39 Total reward: 20.0 Training loss: 1.1930 Explore P: 0.9177
Episode: 40 Total reward: 11.0 Training loss: 0.9487 Explore P: 0.9167
Episode: 41 Total reward: 9.0 Training loss: 1.1304 Explore P: 0.9159
Episode: 42 Total reward: 14.0 Training loss: 9.0711 Explore P: 0.9146
Episode: 43 Total reward: 10.0 Training loss: 5.0962 Explore P: 0.9137
Episode: 44 Total reward: 10.0 Training loss: 9.4877 Explore P: 0.9128
Episode: 45 Total reward: 20.0 Training loss: 4.5862 Explore P: 0.9110
Episode: 46 Total reward: 11.0 Training loss: 6.3197 Explore P: 0.9100
Episode: 47 Total reward: 14.0 Training loss: 5.3832 Explore P: 0.9088
Episode: 48 Total reward: 15.0 Training loss: 5.2645 Explore P: 0.9074
Episode: 49 Total reward: 20.0 Training loss: 5.9387 Explore P: 0.9056
Episode: 50 Total reward: 21.0 Training loss: 0.8924 Explore P: 0.9037
Episode: 51 Total reward: 36.0 Training loss: 7.7714 Explore P: 0.9005
Episode: 52 Total reward: 10.0 Training loss: 0.9385 Explore P: 0.8996
Episode: 53 Total reward: 11.0 Training loss: 0.9144 Explore P: 0.8987
Episode: 54 Total reward: 13.0 Training loss: 7.7032 Explore P: 0.8975
Episode: 55 Total reward: 14.0 Training loss: 0.8763 Explore P: 0.8963
Episode: 56 Total reward: 13.0 Training loss: 0.8556 Explore P: 0.8951
Episode: 57 Total reward: 10.0 Training loss: 0.9919 Explore P: 0.8942
Episode: 58 Total reward: 35.0 Training loss: 0.8738 Explore P: 0.8911
Episode: 59 Total reward: 20.0 Training loss: 0.8028 Explore P: 0.8894
Episode: 60 Total reward: 16.0 Training loss: 23.4595 Explore P: 0.8880
Episode: 61 Total reward: 62.0 Training loss: 12.6409 Explore P: 0.8825
Episode: 62 Total reward: 13.0 Training loss: 0.9160 Explore P: 0.8814
Episode: 63 Total reward: 73.0 Training loss: 13.1499 Explore P: 0.8751
Episode: 64 Total reward: 22.0 Training loss: 12.6787 Explore P: 0.8732
Episode: 65 Total reward: 24.0 Training loss: 26.3408 Explore P: 0.8711
Episode: 66 Total reward: 26.0 Training loss: 1.0181 Explore P: 0.8689
Episode: 67 Total reward: 30.0 Training loss: 29.2791 Explore P: 0.8663
Episode: 68 Total reward: 22.0 Training loss: 19.3929 Explore P: 0.8644
Episode: 69 Total reward: 20.0 Training loss: 1.1861 Explore P: 0.8627
Episode: 70 Total reward: 15.0 Training loss: 34.5604 Explore P: 0.8614
Episode: 71 Total reward: 13.0 Training loss: 30.9300 Explore P: 0.8603
Episode: 72 Total reward: 13.0 Training loss: 38.6928 Explore P: 0.8592
Episode: 73 Total reward: 10.0 Training loss: 19.2172 Explore P: 0.8584
Episode: 74 Total reward: 20.0 Training loss: 38.1172 Explore P: 0.8567
Episode: 75 Total reward: 14.0 Training loss: 16.5710 Explore P: 0.8555
Episode: 76 Total reward: 29.0 Training loss: 0.8035 Explore P: 0.8530
Episode: 77 Total reward: 9.0 Training loss: 36.8473 Explore P: 0.8523
Episode: 78 Total reward: 13.0 Training loss: 20.9793 Explore P: 0.8512
Episode: 79 Total reward: 27.0 Training loss: 40.7055 Explore P: 0.8489
Episode: 80 Total reward: 17.0 Training loss: 1.0190 Explore P: 0.8475
Episode: 81 Total reward: 13.0 Training loss: 0.9315 Explore P: 0.8464
Episode: 82 Total reward: 15.0 Training loss: 0.9954 Explore P: 0.8451
Episode: 83 Total reward: 60.0 Training loss: 17.1273 Explore P: 0.8401
Episode: 84 Total reward: 21.0 Training loss: 0.6857 Explore P: 0.8384
Episode: 85 Total reward: 16.0 Training loss: 41.6072 Explore P: 0.8371
Episode: 86 Total reward: 15.0 Training loss: 22.3827 Explore P: 0.8358
Episode: 87 Total reward: 65.0 Training loss: 22.1762 Explore P: 0.8305
Episode: 88 Total reward: 20.0 Training loss: 0.9099 Explore P: 0.8289
Episode: 89 Total reward: 37.0 Training loss: 22.2153 Explore P: 0.8258
Episode: 90 Total reward: 26.0 Training loss: 21.5167 Explore P: 0.8237
Episode: 91 Total reward: 27.0 Training loss: 22.0183 Explore P: 0.8215
Episode: 92 Total reward: 54.0 Training loss: 24.3114 Explore P: 0.8171
Episode: 93 Total reward: 42.0 Training loss: 0.7780 Explore P: 0.8138
Episode: 94 Total reward: 19.0 Training loss: 22.3843 Explore P: 0.8122
Episode: 95 Total reward: 11.0 Training loss: 25.1736 Explore P: 0.8114
Episode: 96 Total reward: 20.0 Training loss: 24.9336 Explore P: 0.8098
Episode: 97 Total reward: 13.0 Training loss: 24.9150 Explore P: 0.8087
Episode: 98 Total reward: 71.0 Training loss: 0.8418 Explore P: 0.8031
Episode: 99 Total reward: 13.0 Training loss: 0.7291 Explore P: 0.8020
Episode: 100 Total reward: 47.0 Training loss: 24.2022 Explore P: 0.7983
Episode: 101 Total reward: 23.0 Training loss: 23.7120 Explore P: 0.7965
Episode: 102 Total reward: 44.0 Training loss: 45.1353 Explore P: 0.7931
Episode: 103 Total reward: 30.0 Training loss: 0.7672 Explore P: 0.7907
Episode: 104 Total reward: 43.0 Training loss: 0.7378 Explore P: 0.7874
Episode: 105 Total reward: 39.0 Training loss: 23.4430 Explore P: 0.7843
Episode: 106 Total reward: 12.0 Training loss: 0.7848 Explore P: 0.7834
Episode: 107 Total reward: 14.0 Training loss: 26.4749 Explore P: 0.7823
Episode: 108 Total reward: 47.0 Training loss: 0.7766 Explore P: 0.7787
Episode: 109 Total reward: 15.0 Training loss: 0.6827 Explore P: 0.7776
Episode: 110 Total reward: 33.0 Training loss: 0.6760 Explore P: 0.7750
Episode: 111 Total reward: 24.0 Training loss: 0.6785 Explore P: 0.7732
Episode: 112 Total reward: 31.0 Training loss: 53.4103 Explore P: 0.7708
Episode: 113 Total reward: 30.0 Training loss: 26.9520 Explore P: 0.7685
Episode: 114 Total reward: 48.0 Training loss: 0.6371 Explore P: 0.7649
Episode: 115 Total reward: 61.0 Training loss: 27.5290 Explore P: 0.7603
Episode: 116 Total reward: 18.0 Training loss: 0.6383 Explore P: 0.7590
Episode: 117 Total reward: 30.0 Training loss: 28.0261 Explore P: 0.7567
Episode: 118 Total reward: 52.0 Training loss: 0.6961 Explore P: 0.7529
Episode: 119 Total reward: 50.0 Training loss: 28.4170 Explore P: 0.7492
Episode: 120 Total reward: 29.0 Training loss: 29.2863 Explore P: 0.7470
Episode: 121 Total reward: 15.0 Training loss: 0.9167 Explore P: 0.7459
Episode: 122 Total reward: 41.0 Training loss: 26.6380 Explore P: 0.7429
Episode: 123 Total reward: 12.0 Training loss: 0.8303 Explore P: 0.7420
Episode: 124 Total reward: 16.0 Training loss: 0.9824 Explore P: 0.7408
Episode: 125 Total reward: 14.0 Training loss: 0.8126 Explore P: 0.7398
Episode: 126 Total reward: 15.0 Training loss: 27.1164 Explore P: 0.7387
Episode: 127 Total reward: 20.0 Training loss: 53.3544 Explore P: 0.7373
Episode: 128 Total reward: 16.0 Training loss: 0.8118 Explore P: 0.7361
Episode: 129 Total reward: 34.0 Training loss: 27.5105 Explore P: 0.7336
Episode: 130 Total reward: 25.0 Training loss: 0.6521 Explore P: 0.7318
Episode: 131 Total reward: 40.0 Training loss: 28.8377 Explore P: 0.7290
Episode: 132 Total reward: 37.0 Training loss: 0.6969 Explore P: 0.7263
Episode: 133 Total reward: 49.0 Training loss: 26.8942 Explore P: 0.7228
Episode: 134 Total reward: 36.0 Training loss: 0.6962 Explore P: 0.7202
Episode: 135 Total reward: 29.0 Training loss: 29.1698 Explore P: 0.7182
Episode: 136 Total reward: 40.0 Training loss: 27.6288 Explore P: 0.7154
Episode: 137 Total reward: 23.0 Training loss: 20.1060 Explore P: 0.7137
Episode: 138 Total reward: 24.0 Training loss: 0.7833 Explore P: 0.7121
Episode: 139 Total reward: 26.0 Training loss: 54.2095 Explore P: 0.7102
Episode: 140 Total reward: 28.0 Training loss: 0.6870 Explore P: 0.7083
Episode: 141 Total reward: 25.0 Training loss: 0.6927 Explore P: 0.7065
Episode: 142 Total reward: 10.0 Training loss: 27.4986 Explore P: 0.7058
Episode: 143 Total reward: 23.0 Training loss: 0.6401 Explore P: 0.7042
Episode: 144 Total reward: 40.0 Training loss: 24.2329 Explore P: 0.7015
Episode: 145 Total reward: 61.0 Training loss: 0.7725 Explore P: 0.6973
Episode: 146 Total reward: 23.0 Training loss: 0.7294 Explore P: 0.6957
Episode: 147 Total reward: 29.0 Training loss: 75.1497 Explore P: 0.6937
Episode: 148 Total reward: 34.0 Training loss: 31.9683 Explore P: 0.6914
Episode: 149 Total reward: 48.0 Training loss: 0.8060 Explore P: 0.6881
Episode: 150 Total reward: 41.0 Training loss: 0.5530 Explore P: 0.6853
Episode: 151 Total reward: 44.0 Training loss: 54.6262 Explore P: 0.6824
Episode: 152 Total reward: 11.0 Training loss: 0.5748 Explore P: 0.6816
Episode: 153 Total reward: 48.0 Training loss: 33.2155 Explore P: 0.6784
Episode: 154 Total reward: 26.0 Training loss: 33.0692 Explore P: 0.6767
Episode: 155 Total reward: 60.0 Training loss: 57.6767 Explore P: 0.6727
Episode: 156 Total reward: 71.0 Training loss: 96.9794 Explore P: 0.6680
Episode: 157 Total reward: 54.0 Training loss: 62.5053 Explore P: 0.6645
Episode: 158 Total reward: 40.0 Training loss: 0.5615 Explore P: 0.6618
Episode: 159 Total reward: 67.0 Training loss: 23.5234 Explore P: 0.6575
Episode: 160 Total reward: 42.0 Training loss: 32.5320 Explore P: 0.6548
Episode: 161 Total reward: 55.0 Training loss: 32.4685 Explore P: 0.6512
Episode: 162 Total reward: 52.0 Training loss: 25.3897 Explore P: 0.6479
Episode: 163 Total reward: 65.0 Training loss: 33.1978 Explore P: 0.6438
Episode: 164 Total reward: 24.0 Training loss: 0.7051 Explore P: 0.6423
Episode: 165 Total reward: 36.0 Training loss: 24.9944 Explore P: 0.6400
Episode: 166 Total reward: 46.0 Training loss: 0.7812 Explore P: 0.6371
Episode: 167 Total reward: 26.0 Training loss: 0.9497 Explore P: 0.6355
Episode: 168 Total reward: 68.0 Training loss: 0.6880 Explore P: 0.6312
Episode: 169 Total reward: 31.0 Training loss: 0.9609 Explore P: 0.6293
Episode: 170 Total reward: 61.0 Training loss: 0.8278 Explore P: 0.6255
Episode: 171 Total reward: 59.0 Training loss: 0.9248 Explore P: 0.6219
Episode: 172 Total reward: 45.0 Training loss: 0.5173 Explore P: 0.6192
Episode: 173 Total reward: 28.0 Training loss: 0.6155 Explore P: 0.6175
Episode: 174 Total reward: 32.0 Training loss: 0.7597 Explore P: 0.6155
Episode: 175 Total reward: 19.0 Training loss: 0.8004 Explore P: 0.6144
Episode: 176 Total reward: 45.0 Training loss: 1.1509 Explore P: 0.6117
Episode: 177 Total reward: 60.0 Training loss: 1.0419 Explore P: 0.6081
Episode: 178 Total reward: 38.0 Training loss: 37.6007 Explore P: 0.6058
Episode: 179 Total reward: 26.0 Training loss: 39.3858 Explore P: 0.6043
Episode: 180 Total reward: 22.0 Training loss: 0.9429 Explore P: 0.6029
Episode: 181 Total reward: 38.0 Training loss: 1.0202 Explore P: 0.6007
Episode: 182 Total reward: 34.0 Training loss: 1.1672 Explore P: 0.5987
Episode: 183 Total reward: 24.0 Training loss: 22.7279 Explore P: 0.5973
Episode: 184 Total reward: 14.0 Training loss: 0.9491 Explore P: 0.5965
Episode: 185 Total reward: 32.0 Training loss: 0.6062 Explore P: 0.5946
Episode: 186 Total reward: 37.0 Training loss: 1.0453 Explore P: 0.5924
Episode: 187 Total reward: 60.0 Training loss: 0.7647 Explore P: 0.5889
Episode: 188 Total reward: 12.0 Training loss: 0.5757 Explore P: 0.5882
Episode: 189 Total reward: 26.0 Training loss: 67.3478 Explore P: 0.5867
Episode: 190 Total reward: 31.0 Training loss: 41.1696 Explore P: 0.5850
Episode: 191 Total reward: 59.0 Training loss: 0.7647 Explore P: 0.5816
Episode: 192 Total reward: 37.0 Training loss: 0.9386 Explore P: 0.5795
Episode: 193 Total reward: 47.0 Training loss: 0.6044 Explore P: 0.5768
Episode: 194 Total reward: 50.0 Training loss: 0.7293 Explore P: 0.5740
Episode: 195 Total reward: 17.0 Training loss: 0.8639 Explore P: 0.5730
Episode: 196 Total reward: 27.0 Training loss: 39.6886 Explore P: 0.5715
Episode: 197 Total reward: 68.0 Training loss: 0.7571 Explore P: 0.5677
Episode: 198 Total reward: 167.0 Training loss: 0.8458 Explore P: 0.5585
Episode: 199 Total reward: 54.0 Training loss: 0.9568 Explore P: 0.5555
Episode: 200 Total reward: 55.0 Training loss: 46.3290 Explore P: 0.5525
Episode: 201 Total reward: 63.0 Training loss: 35.8956 Explore P: 0.5491
Episode: 202 Total reward: 14.0 Training loss: 24.4381 Explore P: 0.5483
Episode: 203 Total reward: 72.0 Training loss: 0.7917 Explore P: 0.5445
Episode: 204 Total reward: 30.0 Training loss: 19.7669 Explore P: 0.5429
Episode: 205 Total reward: 29.0 Training loss: 118.0056 Explore P: 0.5413
Episode: 206 Total reward: 23.0 Training loss: 1.2818 Explore P: 0.5401
Episode: 207 Total reward: 61.0 Training loss: 0.9182 Explore P: 0.5369
Episode: 208 Total reward: 63.0 Training loss: 1.0643 Explore P: 0.5336
Episode: 209 Total reward: 54.0 Training loss: 48.7055 Explore P: 0.5308
Episode: 210 Total reward: 37.0 Training loss: 1.5199 Explore P: 0.5288
Episode: 211 Total reward: 29.0 Training loss: 67.9208 Explore P: 0.5273
Episode: 212 Total reward: 21.0 Training loss: 1.5846 Explore P: 0.5263
Episode: 213 Total reward: 51.0 Training loss: 19.8461 Explore P: 0.5236
Episode: 214 Total reward: 9.0 Training loss: 0.4090 Explore P: 0.5232
Episode: 215 Total reward: 34.0 Training loss: 1.5298 Explore P: 0.5214
Episode: 216 Total reward: 39.0 Training loss: 0.6481 Explore P: 0.5194
Episode: 217 Total reward: 55.0 Training loss: 0.9454 Explore P: 0.5166
Episode: 218 Total reward: 23.0 Training loss: 1.8787 Explore P: 0.5155
Episode: 219 Total reward: 42.0 Training loss: 1.0642 Explore P: 0.5134
Episode: 220 Total reward: 42.0 Training loss: 32.4379 Explore P: 0.5112
Episode: 221 Total reward: 31.0 Training loss: 90.3222 Explore P: 0.5097
Episode: 222 Total reward: 43.0 Training loss: 1.8235 Explore P: 0.5076
Episode: 223 Total reward: 62.0 Training loss: 49.6423 Explore P: 0.5045
Episode: 224 Total reward: 25.0 Training loss: 1.2363 Explore P: 0.5032
Episode: 225 Total reward: 68.0 Training loss: 1.4993 Explore P: 0.4999
Episode: 226 Total reward: 51.0 Training loss: 0.8124 Explore P: 0.4974
Episode: 227 Total reward: 46.0 Training loss: 1.2313 Explore P: 0.4952
Episode: 228 Total reward: 34.0 Training loss: 22.7233 Explore P: 0.4935
Episode: 229 Total reward: 27.0 Training loss: 2.0683 Explore P: 0.4922
Episode: 230 Total reward: 21.0 Training loss: 1.7608 Explore P: 0.4912
Episode: 231 Total reward: 32.0 Training loss: 0.9545 Explore P: 0.4897
Episode: 232 Total reward: 27.0 Training loss: 3.0192 Explore P: 0.4884
Episode: 233 Total reward: 20.0 Training loss: 85.9907 Explore P: 0.4874
Episode: 234 Total reward: 50.0 Training loss: 1.3304 Explore P: 0.4850
Episode: 235 Total reward: 47.0 Training loss: 1.4128 Explore P: 0.4828
Episode: 236 Total reward: 45.0 Training loss: 32.4992 Explore P: 0.4807
Episode: 237 Total reward: 37.0 Training loss: 39.6504 Explore P: 0.4790
Episode: 238 Total reward: 26.0 Training loss: 2.2041 Explore P: 0.4777
Episode: 239 Total reward: 68.0 Training loss: 33.0484 Explore P: 0.4746
Episode: 240 Total reward: 49.0 Training loss: 50.8525 Explore P: 0.4723
Episode: 241 Total reward: 94.0 Training loss: 31.9860 Explore P: 0.4680
Episode: 242 Total reward: 95.0 Training loss: 0.8183 Explore P: 0.4636
Episode: 243 Total reward: 50.0 Training loss: 85.4165 Explore P: 0.4614
Episode: 244 Total reward: 69.0 Training loss: 25.9743 Explore P: 0.4583
Episode: 245 Total reward: 59.0 Training loss: 45.6854 Explore P: 0.4556
Episode: 246 Total reward: 93.0 Training loss: 1.7620 Explore P: 0.4515
Episode: 247 Total reward: 59.0 Training loss: 115.9948 Explore P: 0.4489
Episode: 248 Total reward: 32.0 Training loss: 37.7877 Explore P: 0.4475
Episode: 249 Total reward: 27.0 Training loss: 1.8649 Explore P: 0.4463
Episode: 250 Total reward: 58.0 Training loss: 43.3030 Explore P: 0.4438
Episode: 251 Total reward: 64.0 Training loss: 2.9322 Explore P: 0.4410
Episode: 252 Total reward: 41.0 Training loss: 44.3996 Explore P: 0.4393
Episode: 253 Total reward: 58.0 Training loss: 58.5384 Explore P: 0.4368
Episode: 254 Total reward: 80.0 Training loss: 95.9258 Explore P: 0.4334
Episode: 255 Total reward: 39.0 Training loss: 1.7450 Explore P: 0.4317
Episode: 256 Total reward: 17.0 Training loss: 3.9076 Explore P: 0.4310
Episode: 257 Total reward: 51.0 Training loss: 11.2743 Explore P: 0.4289
Episode: 258 Total reward: 47.0 Training loss: 2.6967 Explore P: 0.4269
Episode: 259 Total reward: 40.0 Training loss: 43.4216 Explore P: 0.4253
Episode: 260 Total reward: 51.0 Training loss: 55.9784 Explore P: 0.4231
Episode: 261 Total reward: 80.0 Training loss: 92.0249 Explore P: 0.4199
Episode: 262 Total reward: 54.0 Training loss: 0.7693 Explore P: 0.4176
Episode: 263 Total reward: 83.0 Training loss: 61.1533 Explore P: 0.4143
Episode: 264 Total reward: 37.0 Training loss: 68.8414 Explore P: 0.4128
Episode: 265 Total reward: 48.0 Training loss: 51.3700 Explore P: 0.4109
Episode: 266 Total reward: 59.0 Training loss: 4.0606 Explore P: 0.4085
Episode: 267 Total reward: 43.0 Training loss: 4.0006 Explore P: 0.4068
Episode: 268 Total reward: 14.0 Training loss: 4.6051 Explore P: 0.4062
Episode: 269 Total reward: 23.0 Training loss: 3.2992 Explore P: 0.4053
Episode: 270 Total reward: 56.0 Training loss: 24.4933 Explore P: 0.4031
Episode: 271 Total reward: 47.0 Training loss: 2.5349 Explore P: 0.4013
Episode: 272 Total reward: 41.0 Training loss: 1.4561 Explore P: 0.3997
Episode: 273 Total reward: 63.0 Training loss: 4.0104 Explore P: 0.3972
Episode: 274 Total reward: 129.0 Training loss: 60.7941 Explore P: 0.3923
Episode: 275 Total reward: 54.0 Training loss: 115.9683 Explore P: 0.3902
Episode: 276 Total reward: 27.0 Training loss: 3.7012 Explore P: 0.3892
Episode: 277 Total reward: 50.0 Training loss: 50.3574 Explore P: 0.3873
Episode: 278 Total reward: 12.0 Training loss: 6.3980 Explore P: 0.3868
Episode: 279 Total reward: 89.0 Training loss: 41.9555 Explore P: 0.3835
Episode: 280 Total reward: 47.0 Training loss: 1.2281 Explore P: 0.3817
Episode: 281 Total reward: 47.0 Training loss: 2.8331 Explore P: 0.3800
Episode: 282 Total reward: 29.0 Training loss: 118.5354 Explore P: 0.3789
Episode: 283 Total reward: 69.0 Training loss: 6.3463 Explore P: 0.3764
Episode: 284 Total reward: 96.0 Training loss: 42.6366 Explore P: 0.3729
Episode: 285 Total reward: 108.0 Training loss: 15.4436 Explore P: 0.3690
Episode: 286 Total reward: 89.0 Training loss: 51.2837 Explore P: 0.3658
Episode: 287 Total reward: 142.0 Training loss: 80.7182 Explore P: 0.3608
Episode: 288 Total reward: 99.0 Training loss: 4.8525 Explore P: 0.3573
Episode: 289 Total reward: 93.0 Training loss: 3.6874 Explore P: 0.3541
Episode: 290 Total reward: 71.0 Training loss: 2.1668 Explore P: 0.3517
Episode: 291 Total reward: 199.0 Training loss: 47.5472 Explore P: 0.3450
Episode: 292 Total reward: 125.0 Training loss: 4.0496 Explore P: 0.3408
Episode: 293 Total reward: 95.0 Training loss: 4.6975 Explore P: 0.3377
Episode: 294 Total reward: 53.0 Training loss: 3.5154 Explore P: 0.3359
Episode: 295 Total reward: 102.0 Training loss: 1.7590 Explore P: 0.3326
Episode: 296 Total reward: 140.0 Training loss: 2.4827 Explore P: 0.3281
Episode: 297 Total reward: 137.0 Training loss: 2.1178 Explore P: 0.3238
Episode: 298 Total reward: 137.0 Training loss: 1.7693 Explore P: 0.3195
Episode: 299 Total reward: 131.0 Training loss: 1.7156 Explore P: 0.3155
Episode: 300 Total reward: 132.0 Training loss: 132.0682 Explore P: 0.3115
Episode: 301 Total reward: 180.0 Training loss: 3.1649 Explore P: 0.3061
Episode: 302 Total reward: 159.0 Training loss: 77.1155 Explore P: 0.3015
Episode: 303 Total reward: 130.0 Training loss: 2.7920 Explore P: 0.2977
Episode: 304 Total reward: 155.0 Training loss: 1.7919 Explore P: 0.2933
Episode: 305 Total reward: 134.0 Training loss: 70.2758 Explore P: 0.2895
Episode: 306 Total reward: 131.0 Training loss: 3.9819 Explore P: 0.2859
Episode: 307 Total reward: 199.0 Training loss: 1.9613 Explore P: 0.2804
Episode: 308 Total reward: 199.0 Training loss: 2.6305 Explore P: 0.2751
Episode: 309 Total reward: 125.0 Training loss: 2.3936 Explore P: 0.2718
Episode: 310 Total reward: 199.0 Training loss: 2.0042 Explore P: 0.2666
Episode: 311 Total reward: 166.0 Training loss: 4.6888 Explore P: 0.2624
Episode: 312 Total reward: 156.0 Training loss: 3.8886 Explore P: 0.2585
Episode: 313 Total reward: 199.0 Training loss: 0.7918 Explore P: 0.2536
Episode: 314 Total reward: 199.0 Training loss: 4.9219 Explore P: 0.2488
Episode: 315 Total reward: 147.0 Training loss: 2.4732 Explore P: 0.2453
Episode: 316 Total reward: 199.0 Training loss: 1.7165 Explore P: 0.2407
Episode: 317 Total reward: 199.0 Training loss: 2.7628 Explore P: 0.2362
Episode: 318 Total reward: 199.0 Training loss: 1.5545 Explore P: 0.2317
Episode: 319 Total reward: 199.0 Training loss: 241.3593 Explore P: 0.2273
Episode: 320 Total reward: 199.0 Training loss: 2.4671 Explore P: 0.2230
Episode: 321 Total reward: 199.0 Training loss: 1.1726 Explore P: 0.2188
Episode: 322 Total reward: 199.0 Training loss: 207.3331 Explore P: 0.2147
Episode: 323 Total reward: 199.0 Training loss: 2.1245 Explore P: 0.2107
Episode: 324 Total reward: 199.0 Training loss: 170.8571 Explore P: 0.2067
Episode: 325 Total reward: 199.0 Training loss: 1.7720 Explore P: 0.2029
Episode: 326 Total reward: 199.0 Training loss: 2.8243 Explore P: 0.1991
Episode: 327 Total reward: 199.0 Training loss: 1.4296 Explore P: 0.1953
Episode: 328 Total reward: 199.0 Training loss: 0.8294 Explore P: 0.1917
Episode: 329 Total reward: 199.0 Training loss: 0.7705 Explore P: 0.1881
Episode: 330 Total reward: 199.0 Training loss: 1.4468 Explore P: 0.1846
Episode: 331 Total reward: 199.0 Training loss: 2.1686 Explore P: 0.1812
Episode: 332 Total reward: 199.0 Training loss: 1.3349 Explore P: 0.1778
Episode: 333 Total reward: 199.0 Training loss: 2.1375 Explore P: 0.1745
Episode: 334 Total reward: 199.0 Training loss: 2.3702 Explore P: 0.1712
Episode: 335 Total reward: 199.0 Training loss: 0.4778 Explore P: 0.1681
Episode: 336 Total reward: 199.0 Training loss: 1.1306 Explore P: 0.1650
Episode: 337 Total reward: 199.0 Training loss: 1.2535 Explore P: 0.1619
Episode: 338 Total reward: 199.0 Training loss: 269.0118 Explore P: 0.1589
Episode: 339 Total reward: 199.0 Training loss: 1.0376 Explore P: 0.1560
Episode: 340 Total reward: 199.0 Training loss: 0.3805 Explore P: 0.1531
Episode: 341 Total reward: 199.0 Training loss: 279.3161 Explore P: 0.1503
Episode: 342 Total reward: 199.0 Training loss: 295.7319 Explore P: 0.1475
Episode: 343 Total reward: 199.0 Training loss: 1.3953 Explore P: 0.1448
Episode: 344 Total reward: 199.0 Training loss: 0.3477 Explore P: 0.1421
Episode: 345 Total reward: 193.0 Training loss: 1.2232 Explore P: 0.1396
Episode: 346 Total reward: 199.0 Training loss: 0.4151 Explore P: 0.1371
Episode: 347 Total reward: 199.0 Training loss: 0.5475 Explore P: 0.1346
Episode: 348 Total reward: 199.0 Training loss: 0.1626 Explore P: 0.1321
Episode: 349 Total reward: 197.0 Training loss: 0.3579 Explore P: 0.1297
Episode: 350 Total reward: 150.0 Training loss: 0.2820 Explore P: 0.1279
Episode: 351 Total reward: 168.0 Training loss: 0.5214 Explore P: 0.1260
Episode: 352 Total reward: 181.0 Training loss: 0.8092 Explore P: 0.1239
Episode: 353 Total reward: 167.0 Training loss: 0.1503 Explore P: 0.1220
Episode: 354 Total reward: 141.0 Training loss: 0.3171 Explore P: 0.1204
Episode: 355 Total reward: 175.0 Training loss: 0.3282 Explore P: 0.1185
Episode: 356 Total reward: 146.0 Training loss: 0.5879 Explore P: 0.1170
Episode: 357 Total reward: 174.0 Training loss: 0.5431 Explore P: 0.1151
Episode: 358 Total reward: 149.0 Training loss: 0.3499 Explore P: 0.1136
Episode: 359 Total reward: 155.0 Training loss: 0.3705 Explore P: 0.1120
Episode: 360 Total reward: 174.0 Training loss: 0.4063 Explore P: 0.1102
Episode: 361 Total reward: 154.0 Training loss: 0.4857 Explore P: 0.1087
Episode: 362 Total reward: 199.0 Training loss: 0.3761 Explore P: 0.1067
Episode: 363 Total reward: 182.0 Training loss: 0.1734 Explore P: 0.1050
Episode: 364 Total reward: 199.0 Training loss: 0.3836 Explore P: 0.1031
Episode: 365 Total reward: 199.0 Training loss: 23.5343 Explore P: 0.1013
Episode: 366 Total reward: 199.0 Training loss: 163.4232 Explore P: 0.0995
Episode: 367 Total reward: 184.0 Training loss: 0.5552 Explore P: 0.0978
Episode: 368 Total reward: 170.0 Training loss: 0.2203 Explore P: 0.0964
Episode: 369 Total reward: 199.0 Training loss: 0.3429 Explore P: 0.0947
Episode: 370 Total reward: 175.0 Training loss: 0.2297 Explore P: 0.0932
Episode: 371 Total reward: 162.0 Training loss: 0.1963 Explore P: 0.0919
Episode: 372 Total reward: 195.0 Training loss: 0.3936 Explore P: 0.0903
Episode: 373 Total reward: 182.0 Training loss: 0.5204 Explore P: 0.0888
Episode: 374 Total reward: 194.0 Training loss: 0.2908 Explore P: 0.0873
Episode: 375 Total reward: 187.0 Training loss: 0.1896 Explore P: 0.0859
Episode: 376 Total reward: 199.0 Training loss: 0.2128 Explore P: 0.0844
Episode: 377 Total reward: 192.0 Training loss: 0.3060 Explore P: 0.0830
Episode: 378 Total reward: 181.0 Training loss: 0.3358 Explore P: 0.0817
Episode: 379 Total reward: 199.0 Training loss: 0.1381 Explore P: 0.0803
Episode: 380 Total reward: 199.0 Training loss: 0.1148 Explore P: 0.0789
Episode: 381 Total reward: 199.0 Training loss: 0.5321 Explore P: 0.0775
Episode: 382 Total reward: 199.0 Training loss: 3.9040 Explore P: 0.0762
Episode: 383 Total reward: 197.0 Training loss: 0.1373 Explore P: 0.0749
Episode: 384 Total reward: 199.0 Training loss: 0.1381 Explore P: 0.0736
Episode: 385 Total reward: 191.0 Training loss: 0.2227 Explore P: 0.0724
Episode: 386 Total reward: 182.0 Training loss: 0.2101 Explore P: 0.0713
Episode: 387 Total reward: 165.0 Training loss: 0.2979 Explore P: 0.0703
Episode: 388 Total reward: 166.0 Training loss: 0.1507 Explore P: 0.0693
Episode: 389 Total reward: 184.0 Training loss: 0.1404 Explore P: 0.0682
Episode: 390 Total reward: 188.0 Training loss: 0.3353 Explore P: 0.0671
Episode: 391 Total reward: 178.0 Training loss: 0.1795 Explore P: 0.0661
Episode: 392 Total reward: 199.0 Training loss: 0.0793 Explore P: 0.0650
Episode: 393 Total reward: 199.0 Training loss: 0.1009 Explore P: 0.0639
Episode: 394 Total reward: 199.0 Training loss: 0.0890 Explore P: 0.0629
Episode: 395 Total reward: 199.0 Training loss: 0.3710 Explore P: 0.0618
Episode: 396 Total reward: 199.0 Training loss: 12.1209 Explore P: 0.0608
Episode: 397 Total reward: 199.0 Training loss: 0.1645 Explore P: 0.0598
Episode: 398 Total reward: 199.0 Training loss: 0.0638 Explore P: 0.0588
Episode: 399 Total reward: 199.0 Training loss: 0.0929 Explore P: 0.0579
Episode: 400 Total reward: 199.0 Training loss: 0.0536 Explore P: 0.0569
Episode: 401 Total reward: 199.0 Training loss: 0.1208 Explore P: 0.0560
Episode: 402 Total reward: 199.0 Training loss: 0.3963 Explore P: 0.0551
Episode: 403 Total reward: 199.0 Training loss: 23.3142 Explore P: 0.0542
Episode: 404 Total reward: 199.0 Training loss: 0.0554 Explore P: 0.0533
Episode: 405 Total reward: 199.0 Training loss: 0.0982 Explore P: 0.0525
Episode: 406 Total reward: 199.0 Training loss: 0.0246 Explore P: 0.0516
Episode: 407 Total reward: 184.0 Training loss: 0.0798 Explore P: 0.0509
Episode: 408 Total reward: 199.0 Training loss: 0.2834 Explore P: 0.0501
Episode: 409 Total reward: 199.0 Training loss: 0.0684 Explore P: 0.0493
Episode: 410 Total reward: 199.0 Training loss: 49.0025 Explore P: 0.0485
Episode: 411 Total reward: 199.0 Training loss: 0.0639 Explore P: 0.0477
Episode: 412 Total reward: 199.0 Training loss: 0.0681 Explore P: 0.0470
Episode: 413 Total reward: 199.0 Training loss: 0.0310 Explore P: 0.0463
Episode: 414 Total reward: 199.0 Training loss: 0.0890 Explore P: 0.0456
Episode: 415 Total reward: 199.0 Training loss: 1.7221 Explore P: 0.0449
Episode: 416 Total reward: 199.0 Training loss: 0.0717 Explore P: 0.0442
Episode: 417 Total reward: 199.0 Training loss: 0.0930 Explore P: 0.0435
Episode: 418 Total reward: 199.0 Training loss: 0.0288 Explore P: 0.0428
Episode: 419 Total reward: 199.0 Training loss: 0.0135 Explore P: 0.0422
Episode: 420 Total reward: 199.0 Training loss: 0.3659 Explore P: 0.0416
Episode: 421 Total reward: 199.0 Training loss: 0.0705 Explore P: 0.0409
Episode: 422 Total reward: 199.0 Training loss: 0.0188 Explore P: 0.0403
Episode: 423 Total reward: 199.0 Training loss: 0.2994 Explore P: 0.0397
Episode: 424 Total reward: 199.0 Training loss: 0.0652 Explore P: 0.0391
Episode: 425 Total reward: 199.0 Training loss: 0.0565 Explore P: 0.0386
Episode: 426 Total reward: 199.0 Training loss: 13.9379 Explore P: 0.0380
Episode: 427 Total reward: 199.0 Training loss: 0.4545 Explore P: 0.0375
Episode: 428 Total reward: 199.0 Training loss: 0.0748 Explore P: 0.0369
Episode: 429 Total reward: 199.0 Training loss: 0.1284 Explore P: 0.0364
Episode: 430 Total reward: 199.0 Training loss: 0.0243 Explore P: 0.0359
Episode: 431 Total reward: 199.0 Training loss: 0.0316 Explore P: 0.0354
Episode: 432 Total reward: 199.0 Training loss: 26.3495 Explore P: 0.0349
Episode: 433 Total reward: 199.0 Training loss: 0.0325 Explore P: 0.0344
Episode: 434 Total reward: 199.0 Training loss: 0.0325 Explore P: 0.0339
Episode: 435 Total reward: 199.0 Training loss: 0.0518 Explore P: 0.0334
Episode: 436 Total reward: 199.0 Training loss: 0.0921 Explore P: 0.0330
Episode: 437 Total reward: 199.0 Training loss: 0.1519 Explore P: 0.0325
Episode: 438 Total reward: 199.0 Training loss: 0.0935 Explore P: 0.0321
Episode: 439 Total reward: 199.0 Training loss: 0.0737 Explore P: 0.0316
Episode: 440 Total reward: 199.0 Training loss: 150.1964 Explore P: 0.0312
Episode: 441 Total reward: 199.0 Training loss: 0.0316 Explore P: 0.0308
Episode: 442 Total reward: 199.0 Training loss: 0.0338 Explore P: 0.0304
Episode: 443 Total reward: 199.0 Training loss: 0.0354 Explore P: 0.0300
Episode: 444 Total reward: 199.0 Training loss: 0.0311 Explore P: 0.0296
Episode: 445 Total reward: 199.0 Training loss: 0.0135 Explore P: 0.0292
Episode: 446 Total reward: 199.0 Training loss: 0.0211 Explore P: 0.0288
Episode: 447 Total reward: 199.0 Training loss: 0.0697 Explore P: 0.0284
Episode: 448 Total reward: 199.0 Training loss: 0.1316 Explore P: 0.0281
Episode: 449 Total reward: 199.0 Training loss: 0.0452 Explore P: 0.0277
Episode: 450 Total reward: 199.0 Training loss: 0.1043 Explore P: 0.0274
Episode: 451 Total reward: 199.0 Training loss: 0.0350 Explore P: 0.0270
Episode: 452 Total reward: 199.0 Training loss: 0.0552 Explore P: 0.0267
Episode: 453 Total reward: 199.0 Training loss: 0.0249 Explore P: 0.0264
Episode: 454 Total reward: 199.0 Training loss: 185.4973 Explore P: 0.0260
Episode: 455 Total reward: 199.0 Training loss: 0.0253 Explore P: 0.0257
Episode: 456 Total reward: 199.0 Training loss: 0.1039 Explore P: 0.0254
Episode: 457 Total reward: 199.0 Training loss: 0.0613 Explore P: 0.0251
Episode: 458 Total reward: 199.0 Training loss: 0.2171 Explore P: 0.0248
Episode: 459 Total reward: 199.0 Training loss: 0.0429 Explore P: 0.0245
Episode: 460 Total reward: 199.0 Training loss: 0.0643 Explore P: 0.0242
Episode: 461 Total reward: 199.0 Training loss: 0.0894 Explore P: 0.0240
Episode: 462 Total reward: 199.0 Training loss: 0.0565 Explore P: 0.0237
Episode: 463 Total reward: 199.0 Training loss: 89.6411 Explore P: 0.0234
Episode: 464 Total reward: 199.0 Training loss: 0.0718 Explore P: 0.0231
Episode: 465 Total reward: 199.0 Training loss: 0.2295 Explore P: 0.0229
Episode: 466 Total reward: 199.0 Training loss: 0.0424 Explore P: 0.0226
Episode: 467 Total reward: 199.0 Training loss: 0.0822 Explore P: 0.0224
Episode: 468 Total reward: 199.0 Training loss: 0.0308 Explore P: 0.0221
Episode: 469 Total reward: 199.0 Training loss: 74.5628 Explore P: 0.0219
Episode: 470 Total reward: 199.0 Training loss: 0.0956 Explore P: 0.0217
Episode: 471 Total reward: 199.0 Training loss: 0.0490 Explore P: 0.0214
Episode: 472 Total reward: 199.0 Training loss: 0.0525 Explore P: 0.0212
Episode: 473 Total reward: 199.0 Training loss: 0.0273 Explore P: 0.0210
Episode: 474 Total reward: 199.0 Training loss: 0.2669 Explore P: 0.0208
Episode: 475 Total reward: 199.0 Training loss: 0.0389 Explore P: 0.0206
Episode: 476 Total reward: 199.0 Training loss: 0.0486 Explore P: 0.0204
Episode: 477 Total reward: 199.0 Training loss: 0.0340 Explore P: 0.0202
Episode: 478 Total reward: 199.0 Training loss: 14.1068 Explore P: 0.0200
Episode: 479 Total reward: 199.0 Training loss: 17.4078 Explore P: 0.0198
Episode: 480 Total reward: 199.0 Training loss: 0.0408 Explore P: 0.0196
Episode: 481 Total reward: 199.0 Training loss: 0.0869 Explore P: 0.0194
Episode: 482 Total reward: 199.0 Training loss: 0.0722 Explore P: 0.0192
Episode: 483 Total reward: 199.0 Training loss: 0.0531 Explore P: 0.0190
Episode: 484 Total reward: 199.0 Training loss: 0.0625 Explore P: 0.0188
Episode: 485 Total reward: 199.0 Training loss: 0.0571 Explore P: 0.0187
Episode: 486 Total reward: 199.0 Training loss: 0.0278 Explore P: 0.0185
Episode: 487 Total reward: 199.0 Training loss: 0.0449 Explore P: 0.0183
Episode: 488 Total reward: 199.0 Training loss: 0.0604 Explore P: 0.0182
Episode: 489 Total reward: 199.0 Training loss: 0.0509 Explore P: 0.0180
Episode: 490 Total reward: 199.0 Training loss: 0.0459 Explore P: 0.0178
Episode: 491 Total reward: 199.0 Training loss: 38.9527 Explore P: 0.0177
Episode: 492 Total reward: 199.0 Training loss: 0.0400 Explore P: 0.0175
Episode: 493 Total reward: 199.0 Training loss: 4.6835 Explore P: 0.0174
Episode: 494 Total reward: 199.0 Training loss: 0.0439 Explore P: 0.0172
Episode: 495 Total reward: 199.0 Training loss: 0.0394 Explore P: 0.0171
Episode: 496 Total reward: 199.0 Training loss: 0.0535 Explore P: 0.0170
Episode: 497 Total reward: 199.0 Training loss: 0.0538 Explore P: 0.0168
Episode: 498 Total reward: 199.0 Training loss: 0.0432 Explore P: 0.0167
Episode: 499 Total reward: 199.0 Training loss: 0.0657 Explore P: 0.0166
Episode: 500 Total reward: 199.0 Training loss: 0.0407 Explore P: 0.0164
Episode: 501 Total reward: 199.0 Training loss: 0.0492 Explore P: 0.0163
Episode: 502 Total reward: 199.0 Training loss: 335.2382 Explore P: 0.0162
Episode: 503 Total reward: 199.0 Training loss: 0.0400 Explore P: 0.0161
Episode: 504 Total reward: 199.0 Training loss: 0.0377 Explore P: 0.0159
Episode: 505 Total reward: 199.0 Training loss: 0.0643 Explore P: 0.0158
Episode: 506 Total reward: 199.0 Training loss: 0.0476 Explore P: 0.0157
Episode: 507 Total reward: 199.0 Training loss: 0.1133 Explore P: 0.0156
Episode: 508 Total reward: 199.0 Training loss: 0.1104 Explore P: 0.0155
Episode: 509 Total reward: 199.0 Training loss: 351.5255 Explore P: 0.0154
Episode: 510 Total reward: 199.0 Training loss: 332.4712 Explore P: 0.0153
Episode: 511 Total reward: 199.0 Training loss: 0.0920 Explore P: 0.0152
Episode: 512 Total reward: 199.0 Training loss: 0.0613 Explore P: 0.0151
Episode: 513 Total reward: 199.0 Training loss: 0.0377 Explore P: 0.0150
Episode: 514 Total reward: 199.0 Training loss: 0.1154 Explore P: 0.0149
Episode: 515 Total reward: 187.0 Training loss: 0.1309 Explore P: 0.0148
Episode: 516 Total reward: 199.0 Training loss: 0.1930 Explore P: 0.0147
Episode: 517 Total reward: 199.0 Training loss: 0.0759 Explore P: 0.0146
Episode: 518 Total reward: 199.0 Training loss: 0.1122 Explore P: 0.0145
Episode: 519 Total reward: 199.0 Training loss: 0.2027 Explore P: 0.0144
Episode: 520 Total reward: 199.0 Training loss: 1.0731 Explore P: 0.0143
Episode: 521 Total reward: 199.0 Training loss: 0.1532 Explore P: 0.0142
Episode: 522 Total reward: 199.0 Training loss: 0.0984 Explore P: 0.0142
Episode: 523 Total reward: 199.0 Training loss: 0.0932 Explore P: 0.0141
Episode: 524 Total reward: 199.0 Training loss: 0.1806 Explore P: 0.0140
Episode: 525 Total reward: 199.0 Training loss: 0.1186 Explore P: 0.0139
Episode: 526 Total reward: 199.0 Training loss: 0.1081 Explore P: 0.0138
Episode: 527 Total reward: 199.0 Training loss: 242.3724 Explore P: 0.0138
Episode: 528 Total reward: 199.0 Training loss: 0.1020 Explore P: 0.0137
Episode: 529 Total reward: 199.0 Training loss: 0.1223 Explore P: 0.0136
Episode: 530 Total reward: 199.0 Training loss: 0.1032 Explore P: 0.0135
Episode: 531 Total reward: 199.0 Training loss: 0.0802 Explore P: 0.0135
Episode: 532 Total reward: 199.0 Training loss: 0.0796 Explore P: 0.0134
Episode: 533 Total reward: 199.0 Training loss: 0.1804 Explore P: 0.0133
Episode: 534 Total reward: 199.0 Training loss: 234.2017 Explore P: 0.0133
Episode: 535 Total reward: 199.0 Training loss: 0.1887 Explore P: 0.0132
Episode: 536 Total reward: 199.0 Training loss: 0.0870 Explore P: 0.0131
Episode: 537 Total reward: 199.0 Training loss: 0.1315 Explore P: 0.0131
Episode: 538 Total reward: 199.0 Training loss: 0.2050 Explore P: 0.0130
Episode: 539 Total reward: 199.0 Training loss: 0.2127 Explore P: 0.0130
Episode: 540 Total reward: 199.0 Training loss: 0.1648 Explore P: 0.0129
Episode: 541 Total reward: 199.0 Training loss: 0.0967 Explore P: 0.0128
Episode: 542 Total reward: 199.0 Training loss: 0.1565 Explore P: 0.0128
Episode: 543 Total reward: 199.0 Training loss: 0.1203 Explore P: 0.0127
Episode: 544 Total reward: 199.0 Training loss: 0.1818 Explore P: 0.0127
Episode: 545 Total reward: 199.0 Training loss: 0.0733 Explore P: 0.0126
Episode: 546 Total reward: 199.0 Training loss: 0.1533 Explore P: 0.0126
Episode: 547 Total reward: 199.0 Training loss: 0.2199 Explore P: 0.0125
Episode: 548 Total reward: 199.0 Training loss: 0.2520 Explore P: 0.0125
Episode: 549 Total reward: 199.0 Training loss: 345.5298 Explore P: 0.0124
Episode: 550 Total reward: 199.0 Training loss: 0.2136 Explore P: 0.0124
Episode: 551 Total reward: 199.0 Training loss: 0.2611 Explore P: 0.0123
Episode: 552 Total reward: 199.0 Training loss: 0.1915 Explore P: 0.0123
Episode: 553 Total reward: 199.0 Training loss: 0.2601 Explore P: 0.0122
Episode: 554 Total reward: 199.0 Training loss: 0.1247 Explore P: 0.0122
Episode: 555 Total reward: 199.0 Training loss: 0.1253 Explore P: 0.0122
Episode: 556 Total reward: 199.0 Training loss: 0.1937 Explore P: 0.0121
Episode: 557 Total reward: 199.0 Training loss: 0.1109 Explore P: 0.0121
Episode: 558 Total reward: 199.0 Training loss: 0.2190 Explore P: 0.0120
Episode: 559 Total reward: 199.0 Training loss: 0.1018 Explore P: 0.0120
Episode: 560 Total reward: 199.0 Training loss: 0.1990 Explore P: 0.0119
Episode: 561 Total reward: 199.0 Training loss: 321.0490 Explore P: 0.0119
Episode: 562 Total reward: 199.0 Training loss: 0.2080 Explore P: 0.0119
Episode: 563 Total reward: 199.0 Training loss: 0.2606 Explore P: 0.0118
Episode: 564 Total reward: 199.0 Training loss: 0.1509 Explore P: 0.0118
Episode: 565 Total reward: 199.0 Training loss: 0.1733 Explore P: 0.0118
Episode: 566 Total reward: 199.0 Training loss: 0.1428 Explore P: 0.0117
Episode: 567 Total reward: 199.0 Training loss: 0.2346 Explore P: 0.0117
Episode: 568 Total reward: 199.0 Training loss: 254.6623 Explore P: 0.0117
Episode: 569 Total reward: 199.0 Training loss: 0.1060 Explore P: 0.0116
Episode: 570 Total reward: 199.0 Training loss: 0.1027 Explore P: 0.0116
Episode: 571 Total reward: 199.0 Training loss: 0.2316 Explore P: 0.0116
Episode: 572 Total reward: 199.0 Training loss: 0.1613 Explore P: 0.0115
Episode: 573 Total reward: 199.0 Training loss: 0.2216 Explore P: 0.0115
Episode: 574 Total reward: 199.0 Training loss: 0.1497 Explore P: 0.0115
Episode: 575 Total reward: 199.0 Training loss: 0.3359 Explore P: 0.0114
Episode: 576 Total reward: 199.0 Training loss: 0.1261 Explore P: 0.0114
Episode: 577 Total reward: 199.0 Training loss: 0.1663 Explore P: 0.0114
Episode: 578 Total reward: 199.0 Training loss: 0.0890 Explore P: 0.0114
Episode: 579 Total reward: 199.0 Training loss: 0.1580 Explore P: 0.0113
Episode: 580 Total reward: 199.0 Training loss: 0.1528 Explore P: 0.0113
Episode: 581 Total reward: 199.0 Training loss: 0.1465 Explore P: 0.0113
Episode: 582 Total reward: 199.0 Training loss: 0.2152 Explore P: 0.0113
Episode: 583 Total reward: 199.0 Training loss: 0.1573 Explore P: 0.0112
Episode: 584 Total reward: 199.0 Training loss: 0.1053 Explore P: 0.0112
Episode: 585 Total reward: 199.0 Training loss: 0.1290 Explore P: 0.0112
Episode: 586 Total reward: 199.0 Training loss: 0.2272 Explore P: 0.0112
Episode: 587 Total reward: 199.0 Training loss: 0.1145 Explore P: 0.0111
Episode: 588 Total reward: 199.0 Training loss: 0.0994 Explore P: 0.0111
Episode: 589 Total reward: 199.0 Training loss: 0.1268 Explore P: 0.0111
Episode: 590 Total reward: 199.0 Training loss: 315.9100 Explore P: 0.0111
Episode: 591 Total reward: 199.0 Training loss: 0.1965 Explore P: 0.0111
Episode: 592 Total reward: 199.0 Training loss: 0.1144 Explore P: 0.0110
Episode: 593 Total reward: 199.0 Training loss: 0.1908 Explore P: 0.0110
Episode: 594 Total reward: 199.0 Training loss: 0.2207 Explore P: 0.0110
Episode: 595 Total reward: 199.0 Training loss: 0.2540 Explore P: 0.0110
Episode: 596 Total reward: 199.0 Training loss: 0.2942 Explore P: 0.0110
Episode: 597 Total reward: 199.0 Training loss: 0.2093 Explore P: 0.0109
Episode: 598 Total reward: 199.0 Training loss: 0.3063 Explore P: 0.0109
Episode: 599 Total reward: 199.0 Training loss: 241.7867 Explore P: 0.0109
Episode: 600 Total reward: 199.0 Training loss: 0.1411 Explore P: 0.0109
Episode: 601 Total reward: 199.0 Training loss: 0.1156 Explore P: 0.0109
Episode: 602 Total reward: 199.0 Training loss: 0.1705 Explore P: 0.0108
Episode: 603 Total reward: 199.0 Training loss: 0.2853 Explore P: 0.0108
Episode: 604 Total reward: 199.0 Training loss: 0.2207 Explore P: 0.0108
Episode: 605 Total reward: 199.0 Training loss: 0.2286 Explore P: 0.0108
Episode: 606 Total reward: 199.0 Training loss: 0.1580 Explore P: 0.0108
Episode: 607 Total reward: 199.0 Training loss: 0.2028 Explore P: 0.0108
Episode: 608 Total reward: 199.0 Training loss: 0.3113 Explore P: 0.0107
Episode: 609 Total reward: 199.0 Training loss: 0.2346 Explore P: 0.0107
Episode: 610 Total reward: 199.0 Training loss: 0.2226 Explore P: 0.0107
Episode: 611 Total reward: 199.0 Training loss: 130.3791 Explore P: 0.0107
Episode: 612 Total reward: 199.0 Training loss: 0.2135 Explore P: 0.0107
Episode: 613 Total reward: 199.0 Training loss: 0.2290 Explore P: 0.0107
Episode: 614 Total reward: 199.0 Training loss: 201.3921 Explore P: 0.0107
Episode: 615 Total reward: 199.0 Training loss: 0.3236 Explore P: 0.0107
Episode: 616 Total reward: 199.0 Training loss: 0.4551 Explore P: 0.0106
Episode: 617 Total reward: 199.0 Training loss: 0.7620 Explore P: 0.0106
Episode: 618 Total reward: 199.0 Training loss: 0.3151 Explore P: 0.0106
Episode: 619 Total reward: 199.0 Training loss: 0.3650 Explore P: 0.0106
Episode: 620 Total reward: 199.0 Training loss: 0.6641 Explore P: 0.0106
Episode: 621 Total reward: 33.0 Training loss: 0.5925 Explore P: 0.0106
Episode: 622 Total reward: 64.0 Training loss: 0.6203 Explore P: 0.0106
Episode: 623 Total reward: 64.0 Training loss: 0.4698 Explore P: 0.0106
Episode: 624 Total reward: 23.0 Training loss: 0.7696 Explore P: 0.0106
Episode: 625 Total reward: 24.0 Training loss: 0.8071 Explore P: 0.0106
Episode: 626 Total reward: 21.0 Training loss: 0.6114 Explore P: 0.0106
Episode: 627 Total reward: 19.0 Training loss: 1.1263 Explore P: 0.0106
Episode: 628 Total reward: 23.0 Training loss: 0.7505 Explore P: 0.0106
Episode: 629 Total reward: 23.0 Training loss: 0.5577 Explore P: 0.0106
Episode: 630 Total reward: 63.0 Training loss: 78.4654 Explore P: 0.0106
Episode: 631 Total reward: 61.0 Training loss: 136.9949 Explore P: 0.0106
Episode: 632 Total reward: 26.0 Training loss: 0.6719 Explore P: 0.0106
Episode: 633 Total reward: 18.0 Training loss: 0.9558 Explore P: 0.0106
Episode: 634 Total reward: 20.0 Training loss: 0.9147 Explore P: 0.0106
Episode: 635 Total reward: 57.0 Training loss: 0.9010 Explore P: 0.0106
Episode: 636 Total reward: 19.0 Training loss: 1.1777 Explore P: 0.0106
Episode: 637 Total reward: 16.0 Training loss: 1.3046 Explore P: 0.0106
Episode: 638 Total reward: 16.0 Training loss: 1.9484 Explore P: 0.0106
Episode: 639 Total reward: 14.0 Training loss: 1.6212 Explore P: 0.0106
Episode: 640 Total reward: 13.0 Training loss: 2.2370 Explore P: 0.0106
Episode: 641 Total reward: 15.0 Training loss: 87.2518 Explore P: 0.0106
Episode: 642 Total reward: 11.0 Training loss: 2.2667 Explore P: 0.0106
Episode: 643 Total reward: 14.0 Training loss: 527.6612 Explore P: 0.0106
Episode: 644 Total reward: 13.0 Training loss: 2.7098 Explore P: 0.0106
Episode: 645 Total reward: 11.0 Training loss: 2.5510 Explore P: 0.0106
Episode: 646 Total reward: 12.0 Training loss: 1.9252 Explore P: 0.0106
Episode: 647 Total reward: 10.0 Training loss: 2.5396 Explore P: 0.0106
Episode: 648 Total reward: 11.0 Training loss: 1048.4934 Explore P: 0.0105
Episode: 649 Total reward: 14.0 Training loss: 830.4617 Explore P: 0.0105
Episode: 650 Total reward: 12.0 Training loss: 1.2947 Explore P: 0.0105
Episode: 651 Total reward: 13.0 Training loss: 1.7871 Explore P: 0.0105
Episode: 652 Total reward: 14.0 Training loss: 1.6186 Explore P: 0.0105
Episode: 653 Total reward: 17.0 Training loss: 1.8100 Explore P: 0.0105
Episode: 654 Total reward: 16.0 Training loss: 433.5859 Explore P: 0.0105
Episode: 655 Total reward: 12.0 Training loss: 2.1233 Explore P: 0.0105
Episode: 656 Total reward: 13.0 Training loss: 2.4527 Explore P: 0.0105
Episode: 657 Total reward: 16.0 Training loss: 1.8832 Explore P: 0.0105
Episode: 658 Total reward: 14.0 Training loss: 332.3200 Explore P: 0.0105
Episode: 659 Total reward: 56.0 Training loss: 227.7033 Explore P: 0.0105
Episode: 660 Total reward: 60.0 Training loss: 1.7859 Explore P: 0.0105
Episode: 661 Total reward: 21.0 Training loss: 0.6107 Explore P: 0.0105
Episode: 662 Total reward: 56.0 Training loss: 0.8468 Explore P: 0.0105
Episode: 663 Total reward: 199.0 Training loss: 407.6404 Explore P: 0.0105
Episode: 664 Total reward: 199.0 Training loss: 0.8110 Explore P: 0.0105
Episode: 665 Total reward: 69.0 Training loss: 0.9736 Explore P: 0.0105
Episode: 666 Total reward: 199.0 Training loss: 0.8866 Explore P: 0.0105
Episode: 667 Total reward: 199.0 Training loss: 1.4770 Explore P: 0.0105
Episode: 668 Total reward: 199.0 Training loss: 0.7515 Explore P: 0.0105
Episode: 669 Total reward: 199.0 Training loss: 0.6267 Explore P: 0.0105
Episode: 670 Total reward: 199.0 Training loss: 0.3518 Explore P: 0.0105
Episode: 671 Total reward: 199.0 Training loss: 1.0893 Explore P: 0.0105
Episode: 672 Total reward: 199.0 Training loss: 0.3681 Explore P: 0.0104
Episode: 673 Total reward: 199.0 Training loss: 1.3923 Explore P: 0.0104
Episode: 674 Total reward: 199.0 Training loss: 0.3379 Explore P: 0.0104
Episode: 675 Total reward: 199.0 Training loss: 0.7462 Explore P: 0.0104
Episode: 676 Total reward: 199.0 Training loss: 0.4778 Explore P: 0.0104
Episode: 677 Total reward: 199.0 Training loss: 0.8289 Explore P: 0.0104
Episode: 678 Total reward: 199.0 Training loss: 1.2137 Explore P: 0.0104
Episode: 679 Total reward: 199.0 Training loss: 0.9096 Explore P: 0.0104
Episode: 680 Total reward: 199.0 Training loss: 0.4147 Explore P: 0.0104
Episode: 681 Total reward: 199.0 Training loss: 0.4882 Explore P: 0.0104
Episode: 682 Total reward: 199.0 Training loss: 0.7902 Explore P: 0.0104
Episode: 683 Total reward: 199.0 Training loss: 1.6665 Explore P: 0.0104
Episode: 684 Total reward: 186.0 Training loss: 1.0589 Explore P: 0.0103
Episode: 685 Total reward: 199.0 Training loss: 0.8236 Explore P: 0.0103
Episode: 686 Total reward: 199.0 Training loss: 1.0880 Explore P: 0.0103
Episode: 687 Total reward: 176.0 Training loss: 1.2993 Explore P: 0.0103
Episode: 688 Total reward: 190.0 Training loss: 73.9958 Explore P: 0.0103
Episode: 689 Total reward: 199.0 Training loss: 47.9360 Explore P: 0.0103
Episode: 690 Total reward: 171.0 Training loss: 2.9647 Explore P: 0.0103
Episode: 691 Total reward: 154.0 Training loss: 0.9711 Explore P: 0.0103
Episode: 692 Total reward: 142.0 Training loss: 1.2004 Explore P: 0.0103
Episode: 693 Total reward: 131.0 Training loss: 2.5935 Explore P: 0.0103
Episode: 694 Total reward: 109.0 Training loss: 159.1599 Explore P: 0.0103
Episode: 695 Total reward: 96.0 Training loss: 248.9795 Explore P: 0.0103
Episode: 696 Total reward: 79.0 Training loss: 218.3204 Explore P: 0.0103
Episode: 697 Total reward: 35.0 Training loss: 2.5556 Explore P: 0.0103
Episode: 698 Total reward: 31.0 Training loss: 4.2871 Explore P: 0.0103
Episode: 699 Total reward: 31.0 Training loss: 3.7856 Explore P: 0.0103
Episode: 700 Total reward: 34.0 Training loss: 4.6049 Explore P: 0.0103
Episode: 701 Total reward: 37.0 Training loss: 2.9799 Explore P: 0.0103
Episode: 702 Total reward: 27.0 Training loss: 1.1967 Explore P: 0.0103
Episode: 703 Total reward: 35.0 Training loss: 2.4179 Explore P: 0.0103
Episode: 704 Total reward: 27.0 Training loss: 3.8980 Explore P: 0.0103
Episode: 705 Total reward: 34.0 Training loss: 2.6787 Explore P: 0.0103
Episode: 706 Total reward: 39.0 Training loss: 2.4240 Explore P: 0.0103
Episode: 707 Total reward: 29.0 Training loss: 2.0758 Explore P: 0.0103
Episode: 708 Total reward: 24.0 Training loss: 2.9076 Explore P: 0.0103
Episode: 709 Total reward: 22.0 Training loss: 2.6522 Explore P: 0.0103
Episode: 710 Total reward: 31.0 Training loss: 375.2794 Explore P: 0.0103
Episode: 711 Total reward: 22.0 Training loss: 3.0527 Explore P: 0.0103
Episode: 712 Total reward: 28.0 Training loss: 4.1684 Explore P: 0.0103
Episode: 713 Total reward: 105.0 Training loss: 1.6832 Explore P: 0.0103
Episode: 714 Total reward: 112.0 Training loss: 1.6475 Explore P: 0.0103
Episode: 715 Total reward: 124.0 Training loss: 0.9231 Explore P: 0.0103
Episode: 716 Total reward: 129.0 Training loss: 3.4517 Explore P: 0.0103
Episode: 717 Total reward: 116.0 Training loss: 354.8233 Explore P: 0.0103
Episode: 718 Total reward: 100.0 Training loss: 106.1257 Explore P: 0.0103
Episode: 719 Total reward: 99.0 Training loss: 422.5796 Explore P: 0.0103
Episode: 720 Total reward: 99.0 Training loss: 5.4680 Explore P: 0.0103
Episode: 721 Total reward: 110.0 Training loss: 4.4735 Explore P: 0.0102
Episode: 722 Total reward: 120.0 Training loss: 2.0519 Explore P: 0.0102
Episode: 723 Total reward: 126.0 Training loss: 3.9607 Explore P: 0.0102
Episode: 724 Total reward: 174.0 Training loss: 1.4092 Explore P: 0.0102
Episode: 725 Total reward: 154.0 Training loss: 1.8270 Explore P: 0.0102
Episode: 726 Total reward: 173.0 Training loss: 2.8875 Explore P: 0.0102
Episode: 727 Total reward: 189.0 Training loss: 2.0164 Explore P: 0.0102
Episode: 728 Total reward: 199.0 Training loss: 0.7601 Explore P: 0.0102
Episode: 729 Total reward: 199.0 Training loss: 2.6114 Explore P: 0.0102
Episode: 730 Total reward: 199.0 Training loss: 1.6354 Explore P: 0.0102
Episode: 731 Total reward: 199.0 Training loss: 441.0022 Explore P: 0.0102
Episode: 732 Total reward: 187.0 Training loss: 0.4356 Explore P: 0.0102
Episode: 733 Total reward: 191.0 Training loss: 2.5610 Explore P: 0.0102
Episode: 734 Total reward: 177.0 Training loss: 1.3549 Explore P: 0.0102
Episode: 735 Total reward: 199.0 Training loss: 681.3746 Explore P: 0.0102
Episode: 736 Total reward: 199.0 Training loss: 0.2224 Explore P: 0.0102
Episode: 737 Total reward: 199.0 Training loss: 0.8711 Explore P: 0.0102
Episode: 738 Total reward: 199.0 Training loss: 0.2639 Explore P: 0.0102
Episode: 739 Total reward: 199.0 Training loss: 0.4047 Explore P: 0.0102
Episode: 740 Total reward: 199.0 Training loss: 0.5337 Explore P: 0.0102
Episode: 741 Total reward: 199.0 Training loss: 0.2172 Explore P: 0.0102
Episode: 742 Total reward: 199.0 Training loss: 1.0681 Explore P: 0.0102
Episode: 743 Total reward: 199.0 Training loss: 0.4679 Explore P: 0.0102
Episode: 744 Total reward: 199.0 Training loss: 0.1104 Explore P: 0.0102
Episode: 745 Total reward: 199.0 Training loss: 0.2908 Explore P: 0.0102
Episode: 746 Total reward: 199.0 Training loss: 0.5388 Explore P: 0.0102
Episode: 747 Total reward: 199.0 Training loss: 0.1109 Explore P: 0.0102
Episode: 748 Total reward: 199.0 Training loss: 0.4945 Explore P: 0.0102
Episode: 749 Total reward: 199.0 Training loss: 0.1857 Explore P: 0.0101
Episode: 750 Total reward: 199.0 Training loss: 0.2923 Explore P: 0.0101
Episode: 751 Total reward: 199.0 Training loss: 0.2363 Explore P: 0.0101
Episode: 752 Total reward: 199.0 Training loss: 0.3642 Explore P: 0.0101
Episode: 753 Total reward: 199.0 Training loss: 0.1366 Explore P: 0.0101
Episode: 754 Total reward: 199.0 Training loss: 0.5463 Explore P: 0.0101
Episode: 755 Total reward: 199.0 Training loss: 46.7921 Explore P: 0.0101
Episode: 756 Total reward: 199.0 Training loss: 0.2444 Explore P: 0.0101
Episode: 757 Total reward: 197.0 Training loss: 0.1612 Explore P: 0.0101
Episode: 758 Total reward: 199.0 Training loss: 0.2608 Explore P: 0.0101
Episode: 759 Total reward: 199.0 Training loss: 339.4979 Explore P: 0.0101
Episode: 760 Total reward: 199.0 Training loss: 0.1711 Explore P: 0.0101
Episode: 761 Total reward: 199.0 Training loss: 123.1421 Explore P: 0.0101
Episode: 762 Total reward: 176.0 Training loss: 0.2372 Explore P: 0.0101
Episode: 763 Total reward: 199.0 Training loss: 0.1817 Explore P: 0.0101
Episode: 764 Total reward: 165.0 Training loss: 0.2691 Explore P: 0.0101
Episode: 765 Total reward: 199.0 Training loss: 383.5774 Explore P: 0.0101
Episode: 766 Total reward: 199.0 Training loss: 0.0757 Explore P: 0.0101
Episode: 767 Total reward: 199.0 Training loss: 0.1875 Explore P: 0.0101
Episode: 768 Total reward: 199.0 Training loss: 0.0999 Explore P: 0.0101
Episode: 769 Total reward: 199.0 Training loss: 0.0805 Explore P: 0.0101
Episode: 770 Total reward: 199.0 Training loss: 0.1810 Explore P: 0.0101
Episode: 771 Total reward: 199.0 Training loss: 0.0777 Explore P: 0.0101
Episode: 772 Total reward: 199.0 Training loss: 0.1014 Explore P: 0.0101
Episode: 773 Total reward: 199.0 Training loss: 0.1020 Explore P: 0.0101
Episode: 774 Total reward: 199.0 Training loss: 0.0602 Explore P: 0.0101
Episode: 775 Total reward: 199.0 Training loss: 0.0197 Explore P: 0.0101
Episode: 776 Total reward: 199.0 Training loss: 0.0680 Explore P: 0.0101
Episode: 777 Total reward: 199.0 Training loss: 0.0506 Explore P: 0.0101
Episode: 778 Total reward: 199.0 Training loss: 0.0610 Explore P: 0.0101
Episode: 779 Total reward: 199.0 Training loss: 0.0399 Explore P: 0.0101
Episode: 780 Total reward: 199.0 Training loss: 0.3245 Explore P: 0.0101
Episode: 781 Total reward: 199.0 Training loss: 0.1018 Explore P: 0.0101
Episode: 782 Total reward: 199.0 Training loss: 0.1119 Explore P: 0.0101
Episode: 783 Total reward: 199.0 Training loss: 0.0594 Explore P: 0.0101
Episode: 784 Total reward: 199.0 Training loss: 0.1734 Explore P: 0.0101
Episode: 785 Total reward: 199.0 Training loss: 0.0934 Explore P: 0.0101
Episode: 786 Total reward: 199.0 Training loss: 0.1066 Explore P: 0.0101
Episode: 787 Total reward: 199.0 Training loss: 0.0538 Explore P: 0.0101
Episode: 788 Total reward: 199.0 Training loss: 0.1206 Explore P: 0.0101
Episode: 789 Total reward: 199.0 Training loss: 0.0492 Explore P: 0.0101
Episode: 790 Total reward: 199.0 Training loss: 0.0502 Explore P: 0.0101
Episode: 791 Total reward: 199.0 Training loss: 0.0668 Explore P: 0.0101
Episode: 792 Total reward: 199.0 Training loss: 0.0354 Explore P: 0.0101
Episode: 793 Total reward: 199.0 Training loss: 307.0701 Explore P: 0.0101
Episode: 794 Total reward: 199.0 Training loss: 0.0315 Explore P: 0.0101
Episode: 795 Total reward: 199.0 Training loss: 0.1421 Explore P: 0.0101
Episode: 796 Total reward: 199.0 Training loss: 0.0937 Explore P: 0.0101
Episode: 797 Total reward: 199.0 Training loss: 0.1802 Explore P: 0.0101
Episode: 798 Total reward: 199.0 Training loss: 0.1766 Explore P: 0.0101
Episode: 799 Total reward: 199.0 Training loss: 0.0301 Explore P: 0.0101
Episode: 800 Total reward: 199.0 Training loss: 0.0919 Explore P: 0.0101
Episode: 801 Total reward: 199.0 Training loss: 0.1011 Explore P: 0.0101
Episode: 802 Total reward: 199.0 Training loss: 0.1208 Explore P: 0.0101
Episode: 803 Total reward: 199.0 Training loss: 0.0676 Explore P: 0.0101
Episode: 804 Total reward: 199.0 Training loss: 28.2580 Explore P: 0.0100
Episode: 805 Total reward: 199.0 Training loss: 0.1701 Explore P: 0.0100
Episode: 806 Total reward: 199.0 Training loss: 3.0245 Explore P: 0.0100
Episode: 807 Total reward: 199.0 Training loss: 0.0408 Explore P: 0.0100
Episode: 808 Total reward: 199.0 Training loss: 0.0701 Explore P: 0.0100
Episode: 809 Total reward: 199.0 Training loss: 0.1321 Explore P: 0.0100
Episode: 810 Total reward: 199.0 Training loss: 0.0434 Explore P: 0.0100
Episode: 811 Total reward: 199.0 Training loss: 62.8858 Explore P: 0.0100
Episode: 812 Total reward: 199.0 Training loss: 0.0395 Explore P: 0.0100
Episode: 813 Total reward: 199.0 Training loss: 0.2389 Explore P: 0.0100
Episode: 814 Total reward: 199.0 Training loss: 0.2632 Explore P: 0.0100
Episode: 815 Total reward: 199.0 Training loss: 0.0942 Explore P: 0.0100
Episode: 816 Total reward: 199.0 Training loss: 0.0435 Explore P: 0.0100
Episode: 817 Total reward: 199.0 Training loss: 0.0654 Explore P: 0.0100
Episode: 818 Total reward: 199.0 Training loss: 0.1198 Explore P: 0.0100
Episode: 819 Total reward: 199.0 Training loss: 0.0970 Explore P: 0.0100
Episode: 820 Total reward: 199.0 Training loss: 0.1725 Explore P: 0.0100
Episode: 821 Total reward: 199.0 Training loss: 0.1108 Explore P: 0.0100
Episode: 822 Total reward: 199.0 Training loss: 225.9124 Explore P: 0.0100
Episode: 823 Total reward: 199.0 Training loss: 0.1551 Explore P: 0.0100
Episode: 824 Total reward: 199.0 Training loss: 0.1313 Explore P: 0.0100
Episode: 825 Total reward: 199.0 Training loss: 0.1840 Explore P: 0.0100
Episode: 826 Total reward: 199.0 Training loss: 0.0979 Explore P: 0.0100
Episode: 827 Total reward: 199.0 Training loss: 0.1076 Explore P: 0.0100
Episode: 828 Total reward: 199.0 Training loss: 0.1243 Explore P: 0.0100
Episode: 829 Total reward: 199.0 Training loss: 0.1730 Explore P: 0.0100
Episode: 830 Total reward: 199.0 Training loss: 0.2575 Explore P: 0.0100
Episode: 831 Total reward: 199.0 Training loss: 0.2422 Explore P: 0.0100
Episode: 832 Total reward: 199.0 Training loss: 0.1743 Explore P: 0.0100
Episode: 833 Total reward: 199.0 Training loss: 0.2963 Explore P: 0.0100
Episode: 834 Total reward: 199.0 Training loss: 0.1904 Explore P: 0.0100
Episode: 835 Total reward: 199.0 Training loss: 0.1775 Explore P: 0.0100
Episode: 836 Total reward: 199.0 Training loss: 0.2994 Explore P: 0.0100
Episode: 837 Total reward: 199.0 Training loss: 0.1946 Explore P: 0.0100
Episode: 838 Total reward: 199.0 Training loss: 0.1895 Explore P: 0.0100
Episode: 839 Total reward: 199.0 Training loss: 0.2031 Explore P: 0.0100
Episode: 840 Total reward: 199.0 Training loss: 0.2109 Explore P: 0.0100
Episode: 841 Total reward: 199.0 Training loss: 0.2789 Explore P: 0.0100
Episode: 842 Total reward: 199.0 Training loss: 0.1234 Explore P: 0.0100
Episode: 843 Total reward: 199.0 Training loss: 21.5182 Explore P: 0.0100
Episode: 844 Total reward: 199.0 Training loss: 0.3058 Explore P: 0.0100
Episode: 845 Total reward: 199.0 Training loss: 0.2443 Explore P: 0.0100
Episode: 846 Total reward: 199.0 Training loss: 308.8051 Explore P: 0.0100
Episode: 847 Total reward: 199.0 Training loss: 0.1171 Explore P: 0.0100
Episode: 848 Total reward: 199.0 Training loss: 0.1672 Explore P: 0.0100
Episode: 849 Total reward: 199.0 Training loss: 0.2864 Explore P: 0.0100
Episode: 850 Total reward: 199.0 Training loss: 0.1954 Explore P: 0.0100
Episode: 851 Total reward: 199.0 Training loss: 0.3247 Explore P: 0.0100
Episode: 852 Total reward: 199.0 Training loss: 0.5964 Explore P: 0.0100
Episode: 853 Total reward: 199.0 Training loss: 0.1907 Explore P: 0.0100
Episode: 854 Total reward: 199.0 Training loss: 0.2371 Explore P: 0.0100
Episode: 855 Total reward: 199.0 Training loss: 5.5391 Explore P: 0.0100
Episode: 856 Total reward: 199.0 Training loss: 0.1567 Explore P: 0.0100
Episode: 857 Total reward: 199.0 Training loss: 0.3254 Explore P: 0.0100
Episode: 858 Total reward: 199.0 Training loss: 0.5803 Explore P: 0.0100
Episode: 859 Total reward: 199.0 Training loss: 0.1490 Explore P: 0.0100
Episode: 860 Total reward: 199.0 Training loss: 0.2411 Explore P: 0.0100
Episode: 861 Total reward: 199.0 Training loss: 0.5081 Explore P: 0.0100
Episode: 862 Total reward: 199.0 Training loss: 0.1990 Explore P: 0.0100
Episode: 863 Total reward: 199.0 Training loss: 217.2373 Explore P: 0.0100
Episode: 864 Total reward: 199.0 Training loss: 0.1903 Explore P: 0.0100
Episode: 865 Total reward: 199.0 Training loss: 0.1625 Explore P: 0.0100
Episode: 866 Total reward: 199.0 Training loss: 0.3460 Explore P: 0.0100
Episode: 867 Total reward: 199.0 Training loss: 0.2791 Explore P: 0.0100
Episode: 868 Total reward: 199.0 Training loss: 0.3351 Explore P: 0.0100
Episode: 869 Total reward: 199.0 Training loss: 0.3419 Explore P: 0.0100
Episode: 870 Total reward: 199.0 Training loss: 0.1507 Explore P: 0.0100
Episode: 871 Total reward: 199.0 Training loss: 0.7624 Explore P: 0.0100
Episode: 872 Total reward: 199.0 Training loss: 0.2626 Explore P: 0.0100
Episode: 873 Total reward: 199.0 Training loss: 0.2825 Explore P: 0.0100
Episode: 874 Total reward: 199.0 Training loss: 0.3144 Explore P: 0.0100
Episode: 875 Total reward: 199.0 Training loss: 0.2181 Explore P: 0.0100
Episode: 876 Total reward: 199.0 Training loss: 0.2765 Explore P: 0.0100
Episode: 877 Total reward: 199.0 Training loss: 0.3421 Explore P: 0.0100
Episode: 878 Total reward: 199.0 Training loss: 0.5248 Explore P: 0.0100
Episode: 879 Total reward: 199.0 Training loss: 0.3468 Explore P: 0.0100
Episode: 880 Total reward: 199.0 Training loss: 0.4537 Explore P: 0.0100
Episode: 881 Total reward: 199.0 Training loss: 0.3628 Explore P: 0.0100
Episode: 882 Total reward: 199.0 Training loss: 0.5053 Explore P: 0.0100
Episode: 883 Total reward: 199.0 Training loss: 0.4808 Explore P: 0.0100
Episode: 884 Total reward: 199.0 Training loss: 0.4020 Explore P: 0.0100
Episode: 885 Total reward: 199.0 Training loss: 0.5495 Explore P: 0.0100
Episode: 886 Total reward: 199.0 Training loss: 0.6240 Explore P: 0.0100
Episode: 887 Total reward: 199.0 Training loss: 0.6733 Explore P: 0.0100
Episode: 888 Total reward: 199.0 Training loss: 0.4623 Explore P: 0.0100
Episode: 889 Total reward: 199.0 Training loss: 0.4760 Explore P: 0.0100
Episode: 890 Total reward: 173.0 Training loss: 0.5177 Explore P: 0.0100
Episode: 891 Total reward: 196.0 Training loss: 0.3087 Explore P: 0.0100
Episode: 892 Total reward: 199.0 Training loss: 0.3225 Explore P: 0.0100
Episode: 893 Total reward: 181.0 Training loss: 0.7549 Explore P: 0.0100
Episode: 894 Total reward: 163.0 Training loss: 0.5879 Explore P: 0.0100
Episode: 895 Total reward: 169.0 Training loss: 0.7871 Explore P: 0.0100
Episode: 896 Total reward: 179.0 Training loss: 0.1437 Explore P: 0.0100
Episode: 897 Total reward: 155.0 Training loss: 0.1887 Explore P: 0.0100
Episode: 898 Total reward: 165.0 Training loss: 0.3452 Explore P: 0.0100
Episode: 899 Total reward: 175.0 Training loss: 0.7238 Explore P: 0.0100
Episode: 900 Total reward: 145.0 Training loss: 0.4003 Explore P: 0.0100
Episode: 901 Total reward: 151.0 Training loss: 0.4942 Explore P: 0.0100
Episode: 902 Total reward: 153.0 Training loss: 0.4165 Explore P: 0.0100
Episode: 903 Total reward: 127.0 Training loss: 0.9265 Explore P: 0.0100
Episode: 904 Total reward: 130.0 Training loss: 0.6440 Explore P: 0.0100
Episode: 905 Total reward: 145.0 Training loss: 0.6884 Explore P: 0.0100
Episode: 906 Total reward: 107.0 Training loss: 0.9116 Explore P: 0.0100
Episode: 907 Total reward: 107.0 Training loss: 1.2153 Explore P: 0.0100
Episode: 908 Total reward: 127.0 Training loss: 0.5353 Explore P: 0.0100
Episode: 909 Total reward: 138.0 Training loss: 0.4765 Explore P: 0.0100
Episode: 910 Total reward: 108.0 Training loss: 0.5273 Explore P: 0.0100
Episode: 911 Total reward: 100.0 Training loss: 0.8100 Explore P: 0.0100
Episode: 912 Total reward: 98.0 Training loss: 0.9912 Explore P: 0.0100
Episode: 913 Total reward: 87.0 Training loss: 333.5301 Explore P: 0.0100
Episode: 914 Total reward: 110.0 Training loss: 0.5966 Explore P: 0.0100
Episode: 915 Total reward: 82.0 Training loss: 1.4260 Explore P: 0.0100
Episode: 916 Total reward: 91.0 Training loss: 0.5004 Explore P: 0.0100
Episode: 917 Total reward: 79.0 Training loss: 0.9163 Explore P: 0.0100
Episode: 918 Total reward: 88.0 Training loss: 0.8615 Explore P: 0.0100
Episode: 919 Total reward: 94.0 Training loss: 77.4066 Explore P: 0.0100
Episode: 920 Total reward: 90.0 Training loss: 0.4633 Explore P: 0.0100
Episode: 921 Total reward: 94.0 Training loss: 387.3761 Explore P: 0.0100
Episode: 922 Total reward: 90.0 Training loss: 108.1568 Explore P: 0.0100
Episode: 923 Total reward: 80.0 Training loss: 1.2973 Explore P: 0.0100
Episode: 924 Total reward: 77.0 Training loss: 0.7103 Explore P: 0.0100
Episode: 925 Total reward: 86.0 Training loss: 0.7282 Explore P: 0.0100
Episode: 926 Total reward: 83.0 Training loss: 0.7079 Explore P: 0.0100
Episode: 927 Total reward: 90.0 Training loss: 0.8154 Explore P: 0.0100
Episode: 928 Total reward: 95.0 Training loss: 0.3933 Explore P: 0.0100
Episode: 929 Total reward: 99.0 Training loss: 529.7502 Explore P: 0.0100
Episode: 930 Total reward: 112.0 Training loss: 0.7860 Explore P: 0.0100
Episode: 931 Total reward: 117.0 Training loss: 0.5571 Explore P: 0.0100
Episode: 932 Total reward: 120.0 Training loss: 20.1799 Explore P: 0.0100
Episode: 933 Total reward: 125.0 Training loss: 1.1010 Explore P: 0.0100
Episode: 934 Total reward: 120.0 Training loss: 0.6735 Explore P: 0.0100
Episode: 935 Total reward: 152.0 Training loss: 0.3608 Explore P: 0.0100
Episode: 936 Total reward: 143.0 Training loss: 0.8572 Explore P: 0.0100
Episode: 937 Total reward: 143.0 Training loss: 0.6397 Explore P: 0.0100
Episode: 938 Total reward: 146.0 Training loss: 0.8972 Explore P: 0.0100
Episode: 939 Total reward: 188.0 Training loss: 0.6078 Explore P: 0.0100
Episode: 940 Total reward: 199.0 Training loss: 0.8817 Explore P: 0.0100
Episode: 941 Total reward: 199.0 Training loss: 0.4485 Explore P: 0.0100
Episode: 942 Total reward: 199.0 Training loss: 126.8505 Explore P: 0.0100
Episode: 943 Total reward: 194.0 Training loss: 0.3516 Explore P: 0.0100
Episode: 944 Total reward: 199.0 Training loss: 0.1969 Explore P: 0.0100
Episode: 945 Total reward: 199.0 Training loss: 0.3375 Explore P: 0.0100
Episode: 946 Total reward: 199.0 Training loss: 49.2055 Explore P: 0.0100
Episode: 947 Total reward: 199.0 Training loss: 0.2147 Explore P: 0.0100
Episode: 948 Total reward: 199.0 Training loss: 0.4307 Explore P: 0.0100
Episode: 949 Total reward: 199.0 Training loss: 0.3204 Explore P: 0.0100
Episode: 950 Total reward: 199.0 Training loss: 16.5419 Explore P: 0.0100
Episode: 951 Total reward: 199.0 Training loss: 0.3334 Explore P: 0.0100
Episode: 952 Total reward: 199.0 Training loss: 0.1645 Explore P: 0.0100
Episode: 953 Total reward: 199.0 Training loss: 0.4363 Explore P: 0.0100
Episode: 954 Total reward: 199.0 Training loss: 328.4742 Explore P: 0.0100
Episode: 955 Total reward: 199.0 Training loss: 0.3174 Explore P: 0.0100
Episode: 956 Total reward: 199.0 Training loss: 0.2211 Explore P: 0.0100
Episode: 957 Total reward: 199.0 Training loss: 0.1314 Explore P: 0.0100
Episode: 958 Total reward: 199.0 Training loss: 0.1448 Explore P: 0.0100
Episode: 959 Total reward: 199.0 Training loss: 0.2396 Explore P: 0.0100
Episode: 960 Total reward: 199.0 Training loss: 0.1311 Explore P: 0.0100
Episode: 961 Total reward: 199.0 Training loss: 347.0466 Explore P: 0.0100
Episode: 962 Total reward: 199.0 Training loss: 0.0901 Explore P: 0.0100
Episode: 963 Total reward: 199.0 Training loss: 0.0822 Explore P: 0.0100
Episode: 964 Total reward: 199.0 Training loss: 0.1254 Explore P: 0.0100
Episode: 965 Total reward: 199.0 Training loss: 0.1834 Explore P: 0.0100
Episode: 966 Total reward: 199.0 Training loss: 0.0991 Explore P: 0.0100
Episode: 967 Total reward: 199.0 Training loss: 0.1823 Explore P: 0.0100
Episode: 968 Total reward: 199.0 Training loss: 0.0422 Explore P: 0.0100
Episode: 969 Total reward: 199.0 Training loss: 0.0771 Explore P: 0.0100
Episode: 970 Total reward: 199.0 Training loss: 0.1289 Explore P: 0.0100
Episode: 971 Total reward: 199.0 Training loss: 0.1936 Explore P: 0.0100
Episode: 972 Total reward: 199.0 Training loss: 0.2115 Explore P: 0.0100
Episode: 973 Total reward: 160.0 Training loss: 0.3105 Explore P: 0.0100
Episode: 974 Total reward: 157.0 Training loss: 0.1103 Explore P: 0.0100
Episode: 975 Total reward: 196.0 Training loss: 0.1942 Explore P: 0.0100
Episode: 976 Total reward: 199.0 Training loss: 0.1672 Explore P: 0.0100
Episode: 977 Total reward: 199.0 Training loss: 0.1708 Explore P: 0.0100
Episode: 978 Total reward: 164.0 Training loss: 0.1152 Explore P: 0.0100
Episode: 979 Total reward: 173.0 Training loss: 10.5364 Explore P: 0.0100
Episode: 980 Total reward: 192.0 Training loss: 0.1423 Explore P: 0.0100
Episode: 981 Total reward: 199.0 Training loss: 0.2086 Explore P: 0.0100
Episode: 982 Total reward: 175.0 Training loss: 0.1433 Explore P: 0.0100
Episode: 983 Total reward: 187.0 Training loss: 0.1492 Explore P: 0.0100
Episode: 984 Total reward: 179.0 Training loss: 0.2051 Explore P: 0.0100
Episode: 985 Total reward: 199.0 Training loss: 0.1493 Explore P: 0.0100
Episode: 986 Total reward: 165.0 Training loss: 0.1331 Explore P: 0.0100
Episode: 987 Total reward: 181.0 Training loss: 0.0861 Explore P: 0.0100
Episode: 988 Total reward: 199.0 Training loss: 193.9763 Explore P: 0.0100
Episode: 989 Total reward: 199.0 Training loss: 0.0465 Explore P: 0.0100
Episode: 990 Total reward: 199.0 Training loss: 0.1077 Explore P: 0.0100
Episode: 991 Total reward: 199.0 Training loss: 0.0434 Explore P: 0.0100
Episode: 992 Total reward: 199.0 Training loss: 0.0697 Explore P: 0.0100
Episode: 993 Total reward: 199.0 Training loss: 0.1653 Explore P: 0.0100
Episode: 994 Total reward: 199.0 Training loss: 0.0617 Explore P: 0.0100
Episode: 995 Total reward: 199.0 Training loss: 252.0712 Explore P: 0.0100
Episode: 996 Total reward: 199.0 Training loss: 0.2247 Explore P: 0.0100
Episode: 997 Total reward: 199.0 Training loss: 0.1084 Explore P: 0.0100
Episode: 998 Total reward: 199.0 Training loss: 0.1546 Explore P: 0.0100
Episode: 999 Total reward: 199.0 Training loss: 0.1452 Explore P: 0.0100

In [ ]:
saver = tf.train.Saver()
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    saver.save(sess, "checkpoints/cartpole.ckpt")

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [33]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [34]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[34]:
<matplotlib.text.Text at 0x14e16832748>

Testing

Let's checkout how our trained agent plays the game.


In [35]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [36]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.