Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [4]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [5]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-05-22 19:47:47,804] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [6]:
env.reset()
rewards = []
for _ in range(1000):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [7]:
print(rewards[-20:])


[1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [8]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [9]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [10]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [11]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [12]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [13]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 5.0 Training loss: 1.0588 Explore P: 0.9995
Episode: 2 Total reward: 24.0 Training loss: 1.0618 Explore P: 0.9971
Episode: 3 Total reward: 34.0 Training loss: 1.0617 Explore P: 0.9938
Episode: 4 Total reward: 13.0 Training loss: 1.0393 Explore P: 0.9925
Episode: 5 Total reward: 12.0 Training loss: 1.0892 Explore P: 0.9913
Episode: 6 Total reward: 17.0 Training loss: 1.1700 Explore P: 0.9897
Episode: 7 Total reward: 40.0 Training loss: 1.0428 Explore P: 0.9857
Episode: 8 Total reward: 15.0 Training loss: 1.1843 Explore P: 0.9843
Episode: 9 Total reward: 16.0 Training loss: 0.9488 Explore P: 0.9827
Episode: 10 Total reward: 16.0 Training loss: 1.0449 Explore P: 0.9812
Episode: 11 Total reward: 31.0 Training loss: 0.9862 Explore P: 0.9782
Episode: 12 Total reward: 14.0 Training loss: 1.1230 Explore P: 0.9768
Episode: 13 Total reward: 14.0 Training loss: 1.4301 Explore P: 0.9755
Episode: 14 Total reward: 14.0 Training loss: 1.5071 Explore P: 0.9741
Episode: 15 Total reward: 13.0 Training loss: 1.4229 Explore P: 0.9729
Episode: 16 Total reward: 21.0 Training loss: 1.1063 Explore P: 0.9708
Episode: 17 Total reward: 13.0 Training loss: 1.2767 Explore P: 0.9696
Episode: 18 Total reward: 11.0 Training loss: 1.5523 Explore P: 0.9685
Episode: 19 Total reward: 36.0 Training loss: 1.4361 Explore P: 0.9651
Episode: 20 Total reward: 37.0 Training loss: 1.5553 Explore P: 0.9616
Episode: 21 Total reward: 29.0 Training loss: 1.6906 Explore P: 0.9588
Episode: 22 Total reward: 20.0 Training loss: 1.5591 Explore P: 0.9569
Episode: 23 Total reward: 23.0 Training loss: 1.8850 Explore P: 0.9547
Episode: 24 Total reward: 32.0 Training loss: 1.6303 Explore P: 0.9517
Episode: 25 Total reward: 13.0 Training loss: 2.8790 Explore P: 0.9505
Episode: 26 Total reward: 17.0 Training loss: 2.5971 Explore P: 0.9489
Episode: 27 Total reward: 31.0 Training loss: 2.9293 Explore P: 0.9460
Episode: 28 Total reward: 25.0 Training loss: 3.7511 Explore P: 0.9437
Episode: 29 Total reward: 36.0 Training loss: 3.5197 Explore P: 0.9403
Episode: 30 Total reward: 20.0 Training loss: 5.7538 Explore P: 0.9384
Episode: 31 Total reward: 13.0 Training loss: 2.3081 Explore P: 0.9372
Episode: 32 Total reward: 13.0 Training loss: 2.8482 Explore P: 0.9360
Episode: 33 Total reward: 13.0 Training loss: 3.0021 Explore P: 0.9348
Episode: 34 Total reward: 21.0 Training loss: 2.4114 Explore P: 0.9329
Episode: 35 Total reward: 19.0 Training loss: 3.6498 Explore P: 0.9311
Episode: 36 Total reward: 17.0 Training loss: 3.8424 Explore P: 0.9296
Episode: 37 Total reward: 9.0 Training loss: 5.2813 Explore P: 0.9287
Episode: 38 Total reward: 12.0 Training loss: 4.7760 Explore P: 0.9276
Episode: 39 Total reward: 12.0 Training loss: 3.4503 Explore P: 0.9265
Episode: 40 Total reward: 16.0 Training loss: 12.3317 Explore P: 0.9251
Episode: 41 Total reward: 17.0 Training loss: 5.0513 Explore P: 0.9235
Episode: 42 Total reward: 16.0 Training loss: 4.2114 Explore P: 0.9221
Episode: 43 Total reward: 9.0 Training loss: 4.6075 Explore P: 0.9212
Episode: 44 Total reward: 13.0 Training loss: 8.0044 Explore P: 0.9201
Episode: 45 Total reward: 25.0 Training loss: 4.6196 Explore P: 0.9178
Episode: 46 Total reward: 12.0 Training loss: 10.8814 Explore P: 0.9167
Episode: 47 Total reward: 42.0 Training loss: 20.0361 Explore P: 0.9129
Episode: 48 Total reward: 11.0 Training loss: 6.6052 Explore P: 0.9119
Episode: 49 Total reward: 36.0 Training loss: 8.6602 Explore P: 0.9087
Episode: 50 Total reward: 42.0 Training loss: 7.5404 Explore P: 0.9049
Episode: 51 Total reward: 13.0 Training loss: 14.1788 Explore P: 0.9037
Episode: 52 Total reward: 11.0 Training loss: 45.1842 Explore P: 0.9027
Episode: 53 Total reward: 26.0 Training loss: 24.7527 Explore P: 0.9004
Episode: 54 Total reward: 25.0 Training loss: 6.0091 Explore P: 0.8982
Episode: 55 Total reward: 13.0 Training loss: 10.4165 Explore P: 0.8971
Episode: 56 Total reward: 12.0 Training loss: 33.2102 Explore P: 0.8960
Episode: 57 Total reward: 20.0 Training loss: 11.9650 Explore P: 0.8942
Episode: 58 Total reward: 16.0 Training loss: 6.3694 Explore P: 0.8928
Episode: 59 Total reward: 22.0 Training loss: 7.1557 Explore P: 0.8909
Episode: 60 Total reward: 49.0 Training loss: 35.1889 Explore P: 0.8866
Episode: 61 Total reward: 14.0 Training loss: 7.4747 Explore P: 0.8853
Episode: 62 Total reward: 35.0 Training loss: 18.3602 Explore P: 0.8823
Episode: 63 Total reward: 11.0 Training loss: 8.4076 Explore P: 0.8813
Episode: 64 Total reward: 9.0 Training loss: 124.4428 Explore P: 0.8805
Episode: 65 Total reward: 21.0 Training loss: 11.8855 Explore P: 0.8787
Episode: 66 Total reward: 36.0 Training loss: 26.9472 Explore P: 0.8756
Episode: 67 Total reward: 22.0 Training loss: 60.4652 Explore P: 0.8737
Episode: 68 Total reward: 28.0 Training loss: 19.5380 Explore P: 0.8713
Episode: 69 Total reward: 11.0 Training loss: 124.2935 Explore P: 0.8703
Episode: 70 Total reward: 46.0 Training loss: 54.5932 Explore P: 0.8664
Episode: 71 Total reward: 12.0 Training loss: 39.5404 Explore P: 0.8653
Episode: 72 Total reward: 12.0 Training loss: 207.8224 Explore P: 0.8643
Episode: 73 Total reward: 32.0 Training loss: 14.5616 Explore P: 0.8616
Episode: 74 Total reward: 10.0 Training loss: 10.4682 Explore P: 0.8607
Episode: 75 Total reward: 20.0 Training loss: 25.3334 Explore P: 0.8590
Episode: 76 Total reward: 9.0 Training loss: 123.7935 Explore P: 0.8583
Episode: 77 Total reward: 28.0 Training loss: 35.3735 Explore P: 0.8559
Episode: 78 Total reward: 15.0 Training loss: 11.2103 Explore P: 0.8546
Episode: 79 Total reward: 8.0 Training loss: 53.1281 Explore P: 0.8540
Episode: 80 Total reward: 13.0 Training loss: 67.8966 Explore P: 0.8529
Episode: 81 Total reward: 14.0 Training loss: 84.4432 Explore P: 0.8517
Episode: 82 Total reward: 17.0 Training loss: 37.2997 Explore P: 0.8503
Episode: 83 Total reward: 26.0 Training loss: 13.7211 Explore P: 0.8481
Episode: 84 Total reward: 11.0 Training loss: 84.0106 Explore P: 0.8472
Episode: 85 Total reward: 27.0 Training loss: 276.7153 Explore P: 0.8449
Episode: 86 Total reward: 13.0 Training loss: 18.4485 Explore P: 0.8438
Episode: 87 Total reward: 10.0 Training loss: 67.7065 Explore P: 0.8430
Episode: 88 Total reward: 8.0 Training loss: 70.5460 Explore P: 0.8423
Episode: 89 Total reward: 11.0 Training loss: 23.1362 Explore P: 0.8414
Episode: 90 Total reward: 34.0 Training loss: 98.7986 Explore P: 0.8386
Episode: 91 Total reward: 33.0 Training loss: 122.8112 Explore P: 0.8358
Episode: 92 Total reward: 12.0 Training loss: 138.0466 Explore P: 0.8349
Episode: 93 Total reward: 25.0 Training loss: 138.5815 Explore P: 0.8328
Episode: 94 Total reward: 94.0 Training loss: 169.8482 Explore P: 0.8251
Episode: 95 Total reward: 10.0 Training loss: 53.6118 Explore P: 0.8243
Episode: 96 Total reward: 13.0 Training loss: 125.9160 Explore P: 0.8232
Episode: 97 Total reward: 27.0 Training loss: 64.4772 Explore P: 0.8210
Episode: 98 Total reward: 21.0 Training loss: 17.1346 Explore P: 0.8193
Episode: 99 Total reward: 16.0 Training loss: 54.8811 Explore P: 0.8180
Episode: 100 Total reward: 11.0 Training loss: 313.1962 Explore P: 0.8171
Episode: 101 Total reward: 40.0 Training loss: 10.3968 Explore P: 0.8139
Episode: 102 Total reward: 35.0 Training loss: 255.4780 Explore P: 0.8111
Episode: 103 Total reward: 27.0 Training loss: 113.6043 Explore P: 0.8090
Episode: 104 Total reward: 11.0 Training loss: 63.6114 Explore P: 0.8081
Episode: 105 Total reward: 12.0 Training loss: 117.9233 Explore P: 0.8071
Episode: 106 Total reward: 46.0 Training loss: 64.5221 Explore P: 0.8035
Episode: 107 Total reward: 51.0 Training loss: 74.7138 Explore P: 0.7994
Episode: 108 Total reward: 11.0 Training loss: 16.3486 Explore P: 0.7986
Episode: 109 Total reward: 45.0 Training loss: 15.3367 Explore P: 0.7950
Episode: 110 Total reward: 35.0 Training loss: 455.5434 Explore P: 0.7923
Episode: 111 Total reward: 13.0 Training loss: 12.3061 Explore P: 0.7913
Episode: 112 Total reward: 25.0 Training loss: 143.6290 Explore P: 0.7893
Episode: 113 Total reward: 15.0 Training loss: 236.9164 Explore P: 0.7881
Episode: 114 Total reward: 17.0 Training loss: 188.2619 Explore P: 0.7868
Episode: 115 Total reward: 20.0 Training loss: 213.6358 Explore P: 0.7853
Episode: 116 Total reward: 21.0 Training loss: 74.3607 Explore P: 0.7836
Episode: 117 Total reward: 10.0 Training loss: 84.4644 Explore P: 0.7829
Episode: 118 Total reward: 22.0 Training loss: 13.5751 Explore P: 0.7812
Episode: 119 Total reward: 12.0 Training loss: 12.9313 Explore P: 0.7802
Episode: 120 Total reward: 15.0 Training loss: 82.5695 Explore P: 0.7791
Episode: 121 Total reward: 13.0 Training loss: 308.4070 Explore P: 0.7781
Episode: 122 Total reward: 18.0 Training loss: 13.5162 Explore P: 0.7767
Episode: 123 Total reward: 12.0 Training loss: 11.6037 Explore P: 0.7758
Episode: 124 Total reward: 26.0 Training loss: 9.9916 Explore P: 0.7738
Episode: 125 Total reward: 11.0 Training loss: 565.5924 Explore P: 0.7730
Episode: 126 Total reward: 41.0 Training loss: 91.0995 Explore P: 0.7698
Episode: 127 Total reward: 20.0 Training loss: 191.0064 Explore P: 0.7683
Episode: 128 Total reward: 26.0 Training loss: 11.1226 Explore P: 0.7664
Episode: 129 Total reward: 14.0 Training loss: 182.5249 Explore P: 0.7653
Episode: 130 Total reward: 12.0 Training loss: 6.3756 Explore P: 0.7644
Episode: 131 Total reward: 15.0 Training loss: 314.0054 Explore P: 0.7633
Episode: 132 Total reward: 15.0 Training loss: 125.9996 Explore P: 0.7621
Episode: 133 Total reward: 12.0 Training loss: 129.4295 Explore P: 0.7612
Episode: 134 Total reward: 12.0 Training loss: 7.8751 Explore P: 0.7603
Episode: 135 Total reward: 17.0 Training loss: 6.1673 Explore P: 0.7590
Episode: 136 Total reward: 15.0 Training loss: 3.9564 Explore P: 0.7579
Episode: 137 Total reward: 14.0 Training loss: 6.8520 Explore P: 0.7569
Episode: 138 Total reward: 10.0 Training loss: 332.1107 Explore P: 0.7561
Episode: 139 Total reward: 14.0 Training loss: 6.6234 Explore P: 0.7551
Episode: 140 Total reward: 9.0 Training loss: 4.1617 Explore P: 0.7544
Episode: 141 Total reward: 16.0 Training loss: 5.7265 Explore P: 0.7532
Episode: 142 Total reward: 16.0 Training loss: 159.0388 Explore P: 0.7520
Episode: 143 Total reward: 39.0 Training loss: 248.1977 Explore P: 0.7492
Episode: 144 Total reward: 10.0 Training loss: 5.1094 Explore P: 0.7484
Episode: 145 Total reward: 18.0 Training loss: 227.9258 Explore P: 0.7471
Episode: 146 Total reward: 12.0 Training loss: 297.5279 Explore P: 0.7462
Episode: 147 Total reward: 9.0 Training loss: 87.0428 Explore P: 0.7455
Episode: 148 Total reward: 24.0 Training loss: 329.7943 Explore P: 0.7438
Episode: 149 Total reward: 16.0 Training loss: 97.4512 Explore P: 0.7426
Episode: 150 Total reward: 13.0 Training loss: 256.7272 Explore P: 0.7417
Episode: 151 Total reward: 18.0 Training loss: 138.3042 Explore P: 0.7403
Episode: 152 Total reward: 34.0 Training loss: 123.3755 Explore P: 0.7379
Episode: 153 Total reward: 15.0 Training loss: 100.3941 Explore P: 0.7368
Episode: 154 Total reward: 23.0 Training loss: 2.8372 Explore P: 0.7351
Episode: 155 Total reward: 39.0 Training loss: 271.7397 Explore P: 0.7323
Episode: 156 Total reward: 15.0 Training loss: 115.8956 Explore P: 0.7312
Episode: 157 Total reward: 14.0 Training loss: 5.4791 Explore P: 0.7302
Episode: 158 Total reward: 14.0 Training loss: 198.9180 Explore P: 0.7292
Episode: 159 Total reward: 14.0 Training loss: 84.0318 Explore P: 0.7282
Episode: 160 Total reward: 13.0 Training loss: 86.8574 Explore P: 0.7272
Episode: 161 Total reward: 15.0 Training loss: 223.5095 Explore P: 0.7262
Episode: 162 Total reward: 26.0 Training loss: 90.9456 Explore P: 0.7243
Episode: 163 Total reward: 29.0 Training loss: 229.2346 Explore P: 0.7222
Episode: 164 Total reward: 30.0 Training loss: 86.6350 Explore P: 0.7201
Episode: 165 Total reward: 20.0 Training loss: 93.6757 Explore P: 0.7187
Episode: 166 Total reward: 10.0 Training loss: 83.5062 Explore P: 0.7180
Episode: 167 Total reward: 8.0 Training loss: 175.0676 Explore P: 0.7174
Episode: 168 Total reward: 14.0 Training loss: 76.6483 Explore P: 0.7164
Episode: 169 Total reward: 11.0 Training loss: 310.2247 Explore P: 0.7156
Episode: 170 Total reward: 33.0 Training loss: 1.3719 Explore P: 0.7133
Episode: 171 Total reward: 7.0 Training loss: 2.1521 Explore P: 0.7128
Episode: 172 Total reward: 16.0 Training loss: 1.8970 Explore P: 0.7117
Episode: 173 Total reward: 12.0 Training loss: 62.3190 Explore P: 0.7109
Episode: 174 Total reward: 13.0 Training loss: 1.4988 Explore P: 0.7099
Episode: 175 Total reward: 12.0 Training loss: 69.4134 Explore P: 0.7091
Episode: 176 Total reward: 13.0 Training loss: 84.7166 Explore P: 0.7082
Episode: 177 Total reward: 11.0 Training loss: 125.9708 Explore P: 0.7074
Episode: 178 Total reward: 13.0 Training loss: 62.0274 Explore P: 0.7065
Episode: 179 Total reward: 16.0 Training loss: 55.0137 Explore P: 0.7054
Episode: 180 Total reward: 14.0 Training loss: 65.8498 Explore P: 0.7044
Episode: 181 Total reward: 14.0 Training loss: 131.0564 Explore P: 0.7035
Episode: 182 Total reward: 15.0 Training loss: 68.6104 Explore P: 0.7024
Episode: 183 Total reward: 15.0 Training loss: 75.1950 Explore P: 0.7014
Episode: 184 Total reward: 15.0 Training loss: 1.8987 Explore P: 0.7004
Episode: 185 Total reward: 19.0 Training loss: 196.3994 Explore P: 0.6990
Episode: 186 Total reward: 8.0 Training loss: 234.4035 Explore P: 0.6985
Episode: 187 Total reward: 28.0 Training loss: 1.7446 Explore P: 0.6966
Episode: 188 Total reward: 16.0 Training loss: 1.6030 Explore P: 0.6955
Episode: 189 Total reward: 13.0 Training loss: 128.4129 Explore P: 0.6946
Episode: 190 Total reward: 15.0 Training loss: 1.3983 Explore P: 0.6936
Episode: 191 Total reward: 15.0 Training loss: 71.0388 Explore P: 0.6925
Episode: 192 Total reward: 14.0 Training loss: 1.8310 Explore P: 0.6916
Episode: 193 Total reward: 38.0 Training loss: 284.1143 Explore P: 0.6890
Episode: 194 Total reward: 12.0 Training loss: 105.4411 Explore P: 0.6882
Episode: 195 Total reward: 9.0 Training loss: 1.7993 Explore P: 0.6876
Episode: 196 Total reward: 15.0 Training loss: 46.5499 Explore P: 0.6865
Episode: 197 Total reward: 11.0 Training loss: 2.4397 Explore P: 0.6858
Episode: 198 Total reward: 11.0 Training loss: 1.7653 Explore P: 0.6851
Episode: 199 Total reward: 16.0 Training loss: 144.7834 Explore P: 0.6840
Episode: 200 Total reward: 16.0 Training loss: 1.7505 Explore P: 0.6829
Episode: 201 Total reward: 11.0 Training loss: 45.1649 Explore P: 0.6822
Episode: 202 Total reward: 13.0 Training loss: 1.2387 Explore P: 0.6813
Episode: 203 Total reward: 9.0 Training loss: 45.2931 Explore P: 0.6807
Episode: 204 Total reward: 11.0 Training loss: 93.8861 Explore P: 0.6800
Episode: 205 Total reward: 42.0 Training loss: 60.6165 Explore P: 0.6771
Episode: 206 Total reward: 15.0 Training loss: 162.9738 Explore P: 0.6761
Episode: 207 Total reward: 12.0 Training loss: 88.6133 Explore P: 0.6753
Episode: 208 Total reward: 8.0 Training loss: 172.6174 Explore P: 0.6748
Episode: 209 Total reward: 10.0 Training loss: 2.2126 Explore P: 0.6741
Episode: 210 Total reward: 21.0 Training loss: 127.5842 Explore P: 0.6728
Episode: 211 Total reward: 18.0 Training loss: 36.9022 Explore P: 0.6716
Episode: 212 Total reward: 11.0 Training loss: 40.7589 Explore P: 0.6708
Episode: 213 Total reward: 17.0 Training loss: 127.2611 Explore P: 0.6697
Episode: 214 Total reward: 12.0 Training loss: 59.2900 Explore P: 0.6689
Episode: 215 Total reward: 16.0 Training loss: 2.3508 Explore P: 0.6679
Episode: 216 Total reward: 28.0 Training loss: 97.4240 Explore P: 0.6660
Episode: 217 Total reward: 11.0 Training loss: 69.2299 Explore P: 0.6653
Episode: 218 Total reward: 13.0 Training loss: 153.9853 Explore P: 0.6645
Episode: 219 Total reward: 22.0 Training loss: 71.7378 Explore P: 0.6630
Episode: 220 Total reward: 14.0 Training loss: 2.0904 Explore P: 0.6621
Episode: 221 Total reward: 14.0 Training loss: 66.6762 Explore P: 0.6612
Episode: 222 Total reward: 47.0 Training loss: 102.3139 Explore P: 0.6581
Episode: 223 Total reward: 12.0 Training loss: 2.8000 Explore P: 0.6574
Episode: 224 Total reward: 16.0 Training loss: 224.2620 Explore P: 0.6563
Episode: 225 Total reward: 11.0 Training loss: 2.9411 Explore P: 0.6556
Episode: 226 Total reward: 16.0 Training loss: 2.1921 Explore P: 0.6546
Episode: 227 Total reward: 36.0 Training loss: 149.3751 Explore P: 0.6523
Episode: 228 Total reward: 17.0 Training loss: 117.4525 Explore P: 0.6512
Episode: 229 Total reward: 12.0 Training loss: 2.1700 Explore P: 0.6504
Episode: 230 Total reward: 9.0 Training loss: 95.6893 Explore P: 0.6498
Episode: 231 Total reward: 25.0 Training loss: 31.4540 Explore P: 0.6482
Episode: 232 Total reward: 21.0 Training loss: 30.7645 Explore P: 0.6469
Episode: 233 Total reward: 8.0 Training loss: 1.8974 Explore P: 0.6464
Episode: 234 Total reward: 16.0 Training loss: 2.6009 Explore P: 0.6454
Episode: 235 Total reward: 15.0 Training loss: 29.7990 Explore P: 0.6444
Episode: 236 Total reward: 13.0 Training loss: 69.0963 Explore P: 0.6436
Episode: 237 Total reward: 12.0 Training loss: 1.9837 Explore P: 0.6428
Episode: 238 Total reward: 18.0 Training loss: 1.8708 Explore P: 0.6417
Episode: 239 Total reward: 26.0 Training loss: 87.1172 Explore P: 0.6401
Episode: 240 Total reward: 14.0 Training loss: 2.0180 Explore P: 0.6392
Episode: 241 Total reward: 20.0 Training loss: 29.0890 Explore P: 0.6379
Episode: 242 Total reward: 10.0 Training loss: 90.4796 Explore P: 0.6373
Episode: 243 Total reward: 17.0 Training loss: 32.5607 Explore P: 0.6362
Episode: 244 Total reward: 15.0 Training loss: 2.3887 Explore P: 0.6353
Episode: 245 Total reward: 9.0 Training loss: 28.2081 Explore P: 0.6347
Episode: 246 Total reward: 9.0 Training loss: 118.6945 Explore P: 0.6342
Episode: 247 Total reward: 21.0 Training loss: 41.7661 Explore P: 0.6328
Episode: 248 Total reward: 10.0 Training loss: 2.0932 Explore P: 0.6322
Episode: 249 Total reward: 17.0 Training loss: 30.1837 Explore P: 0.6312
Episode: 250 Total reward: 8.0 Training loss: 94.7369 Explore P: 0.6307
Episode: 251 Total reward: 17.0 Training loss: 86.2867 Explore P: 0.6296
Episode: 252 Total reward: 20.0 Training loss: 72.6912 Explore P: 0.6284
Episode: 253 Total reward: 10.0 Training loss: 2.1940 Explore P: 0.6278
Episode: 254 Total reward: 22.0 Training loss: 77.5394 Explore P: 0.6264
Episode: 255 Total reward: 16.0 Training loss: 48.2741 Explore P: 0.6254
Episode: 256 Total reward: 12.0 Training loss: 30.9249 Explore P: 0.6247
Episode: 257 Total reward: 12.0 Training loss: 2.1257 Explore P: 0.6239
Episode: 258 Total reward: 10.0 Training loss: 26.8110 Explore P: 0.6233
Episode: 259 Total reward: 32.0 Training loss: 1.3336 Explore P: 0.6214
Episode: 260 Total reward: 10.0 Training loss: 62.2616 Explore P: 0.6208
Episode: 261 Total reward: 21.0 Training loss: 32.1638 Explore P: 0.6195
Episode: 262 Total reward: 14.0 Training loss: 57.3722 Explore P: 0.6186
Episode: 263 Total reward: 13.0 Training loss: 28.5761 Explore P: 0.6178
Episode: 264 Total reward: 21.0 Training loss: 1.7326 Explore P: 0.6166
Episode: 265 Total reward: 11.0 Training loss: 114.8702 Explore P: 0.6159
Episode: 266 Total reward: 13.0 Training loss: 1.8172 Explore P: 0.6151
Episode: 267 Total reward: 16.0 Training loss: 30.8546 Explore P: 0.6141
Episode: 268 Total reward: 13.0 Training loss: 54.6504 Explore P: 0.6134
Episode: 269 Total reward: 9.0 Training loss: 76.2201 Explore P: 0.6128
Episode: 270 Total reward: 8.0 Training loss: 101.3165 Explore P: 0.6123
Episode: 271 Total reward: 16.0 Training loss: 1.8956 Explore P: 0.6114
Episode: 272 Total reward: 10.0 Training loss: 83.8449 Explore P: 0.6108
Episode: 273 Total reward: 12.0 Training loss: 1.7881 Explore P: 0.6100
Episode: 274 Total reward: 16.0 Training loss: 109.1081 Explore P: 0.6091
Episode: 275 Total reward: 18.0 Training loss: 27.2456 Explore P: 0.6080
Episode: 276 Total reward: 12.0 Training loss: 44.1704 Explore P: 0.6073
Episode: 277 Total reward: 14.0 Training loss: 1.4409 Explore P: 0.6065
Episode: 278 Total reward: 17.0 Training loss: 1.9952 Explore P: 0.6054
Episode: 279 Total reward: 16.0 Training loss: 26.4580 Explore P: 0.6045
Episode: 280 Total reward: 8.0 Training loss: 122.6257 Explore P: 0.6040
Episode: 281 Total reward: 22.0 Training loss: 1.6973 Explore P: 0.6027
Episode: 282 Total reward: 38.0 Training loss: 26.1936 Explore P: 0.6005
Episode: 283 Total reward: 13.0 Training loss: 76.7918 Explore P: 0.5997
Episode: 284 Total reward: 15.0 Training loss: 30.3083 Explore P: 0.5988
Episode: 285 Total reward: 11.0 Training loss: 28.8462 Explore P: 0.5982
Episode: 286 Total reward: 18.0 Training loss: 26.5176 Explore P: 0.5971
Episode: 287 Total reward: 40.0 Training loss: 65.0132 Explore P: 0.5948
Episode: 288 Total reward: 15.0 Training loss: 1.2872 Explore P: 0.5939
Episode: 289 Total reward: 22.0 Training loss: 36.9935 Explore P: 0.5926
Episode: 290 Total reward: 12.0 Training loss: 66.5094 Explore P: 0.5919
Episode: 291 Total reward: 7.0 Training loss: 39.6055 Explore P: 0.5915
Episode: 292 Total reward: 10.0 Training loss: 1.0844 Explore P: 0.5909
Episode: 293 Total reward: 54.0 Training loss: 0.9500 Explore P: 0.5878
Episode: 294 Total reward: 20.0 Training loss: 0.7284 Explore P: 0.5866
Episode: 295 Total reward: 9.0 Training loss: 25.2031 Explore P: 0.5861
Episode: 296 Total reward: 12.0 Training loss: 74.5184 Explore P: 0.5854
Episode: 297 Total reward: 12.0 Training loss: 0.7556 Explore P: 0.5847
Episode: 298 Total reward: 10.0 Training loss: 0.5896 Explore P: 0.5842
Episode: 299 Total reward: 15.0 Training loss: 54.9766 Explore P: 0.5833
Episode: 300 Total reward: 15.0 Training loss: 32.2130 Explore P: 0.5824
Episode: 301 Total reward: 12.0 Training loss: 38.7178 Explore P: 0.5818
Episode: 302 Total reward: 18.0 Training loss: 0.7695 Explore P: 0.5807
Episode: 303 Total reward: 17.0 Training loss: 1.0414 Explore P: 0.5798
Episode: 304 Total reward: 16.0 Training loss: 92.4873 Explore P: 0.5788
Episode: 305 Total reward: 34.0 Training loss: 59.9632 Explore P: 0.5769
Episode: 306 Total reward: 34.0 Training loss: 29.4838 Explore P: 0.5750
Episode: 307 Total reward: 11.0 Training loss: 84.9022 Explore P: 0.5744
Episode: 308 Total reward: 25.0 Training loss: 0.7962 Explore P: 0.5730
Episode: 309 Total reward: 24.0 Training loss: 18.8024 Explore P: 0.5716
Episode: 310 Total reward: 16.0 Training loss: 0.9373 Explore P: 0.5707
Episode: 311 Total reward: 29.0 Training loss: 19.8597 Explore P: 0.5691
Episode: 312 Total reward: 9.0 Training loss: 76.2919 Explore P: 0.5686
Episode: 313 Total reward: 51.0 Training loss: 0.6059 Explore P: 0.5657
Episode: 314 Total reward: 45.0 Training loss: 0.7192 Explore P: 0.5632
Episode: 315 Total reward: 34.0 Training loss: 21.0033 Explore P: 0.5614
Episode: 316 Total reward: 43.0 Training loss: 0.8297 Explore P: 0.5590
Episode: 317 Total reward: 57.0 Training loss: 15.3465 Explore P: 0.5559
Episode: 318 Total reward: 29.0 Training loss: 22.7318 Explore P: 0.5543
Episode: 319 Total reward: 33.0 Training loss: 0.9356 Explore P: 0.5525
Episode: 320 Total reward: 32.0 Training loss: 0.8677 Explore P: 0.5508
Episode: 321 Total reward: 22.0 Training loss: 0.7380 Explore P: 0.5496
Episode: 322 Total reward: 49.0 Training loss: 42.5348 Explore P: 0.5469
Episode: 323 Total reward: 48.0 Training loss: 20.6781 Explore P: 0.5444
Episode: 324 Total reward: 40.0 Training loss: 20.9402 Explore P: 0.5422
Episode: 325 Total reward: 26.0 Training loss: 21.8174 Explore P: 0.5409
Episode: 326 Total reward: 62.0 Training loss: 16.5017 Explore P: 0.5376
Episode: 327 Total reward: 62.0 Training loss: 16.5181 Explore P: 0.5343
Episode: 328 Total reward: 71.0 Training loss: 16.2410 Explore P: 0.5306
Episode: 329 Total reward: 58.0 Training loss: 0.7613 Explore P: 0.5276
Episode: 330 Total reward: 31.0 Training loss: 20.9528 Explore P: 0.5260
Episode: 331 Total reward: 40.0 Training loss: 0.7386 Explore P: 0.5239
Episode: 332 Total reward: 57.0 Training loss: 21.8223 Explore P: 0.5210
Episode: 333 Total reward: 54.0 Training loss: 20.9304 Explore P: 0.5183
Episode: 334 Total reward: 18.0 Training loss: 16.2166 Explore P: 0.5174
Episode: 335 Total reward: 21.0 Training loss: 34.0123 Explore P: 0.5163
Episode: 336 Total reward: 81.0 Training loss: 30.8050 Explore P: 0.5122
Episode: 337 Total reward: 56.0 Training loss: 15.3068 Explore P: 0.5094
Episode: 338 Total reward: 41.0 Training loss: 17.1834 Explore P: 0.5074
Episode: 339 Total reward: 28.0 Training loss: 1.0596 Explore P: 0.5060
Episode: 340 Total reward: 23.0 Training loss: 30.9592 Explore P: 0.5048
Episode: 341 Total reward: 27.0 Training loss: 17.3109 Explore P: 0.5035
Episode: 342 Total reward: 55.0 Training loss: 33.0047 Explore P: 0.5008
Episode: 343 Total reward: 77.0 Training loss: 1.2227 Explore P: 0.4970
Episode: 344 Total reward: 53.0 Training loss: 43.5802 Explore P: 0.4944
Episode: 345 Total reward: 14.0 Training loss: 1.3022 Explore P: 0.4938
Episode: 346 Total reward: 151.0 Training loss: 1.2523 Explore P: 0.4865
Episode: 347 Total reward: 41.0 Training loss: 1.1699 Explore P: 0.4846
Episode: 348 Total reward: 61.0 Training loss: 20.7276 Explore P: 0.4817
Episode: 349 Total reward: 30.0 Training loss: 19.4818 Explore P: 0.4803
Episode: 350 Total reward: 35.0 Training loss: 11.5242 Explore P: 0.4786
Episode: 351 Total reward: 29.0 Training loss: 1.5608 Explore P: 0.4773
Episode: 352 Total reward: 27.0 Training loss: 1.1080 Explore P: 0.4760
Episode: 353 Total reward: 42.0 Training loss: 1.0444 Explore P: 0.4741
Episode: 354 Total reward: 34.0 Training loss: 21.3482 Explore P: 0.4725
Episode: 355 Total reward: 78.0 Training loss: 1.2409 Explore P: 0.4689
Episode: 356 Total reward: 28.0 Training loss: 15.1475 Explore P: 0.4676
Episode: 357 Total reward: 94.0 Training loss: 23.9069 Explore P: 0.4633
Episode: 358 Total reward: 25.0 Training loss: 40.0987 Explore P: 0.4622
Episode: 359 Total reward: 64.0 Training loss: 16.8704 Explore P: 0.4593
Episode: 360 Total reward: 36.0 Training loss: 1.3485 Explore P: 0.4577
Episode: 361 Total reward: 141.0 Training loss: 16.4264 Explore P: 0.4514
Episode: 362 Total reward: 48.0 Training loss: 0.9454 Explore P: 0.4493
Episode: 363 Total reward: 72.0 Training loss: 12.6144 Explore P: 0.4462
Episode: 364 Total reward: 15.0 Training loss: 22.8255 Explore P: 0.4455
Episode: 365 Total reward: 131.0 Training loss: 16.0637 Explore P: 0.4398
Episode: 366 Total reward: 39.0 Training loss: 1.4317 Explore P: 0.4382
Episode: 367 Total reward: 27.0 Training loss: 1.5250 Explore P: 0.4370
Episode: 368 Total reward: 40.0 Training loss: 22.6081 Explore P: 0.4353
Episode: 369 Total reward: 60.0 Training loss: 24.8746 Explore P: 0.4328
Episode: 370 Total reward: 77.0 Training loss: 2.3809 Explore P: 0.4295
Episode: 371 Total reward: 55.0 Training loss: 35.0495 Explore P: 0.4272
Episode: 372 Total reward: 42.0 Training loss: 29.4056 Explore P: 0.4255
Episode: 373 Total reward: 90.0 Training loss: 1.1990 Explore P: 0.4217
Episode: 374 Total reward: 16.0 Training loss: 15.5992 Explore P: 0.4211
Episode: 375 Total reward: 22.0 Training loss: 1.0841 Explore P: 0.4202
Episode: 376 Total reward: 37.0 Training loss: 1.5254 Explore P: 0.4187
Episode: 377 Total reward: 69.0 Training loss: 30.1577 Explore P: 0.4159
Episode: 378 Total reward: 30.0 Training loss: 1.4075 Explore P: 0.4146
Episode: 379 Total reward: 40.0 Training loss: 1.4806 Explore P: 0.4130
Episode: 380 Total reward: 44.0 Training loss: 34.2348 Explore P: 0.4113
Episode: 381 Total reward: 32.0 Training loss: 0.8458 Explore P: 0.4100
Episode: 382 Total reward: 56.0 Training loss: 29.9617 Explore P: 0.4077
Episode: 383 Total reward: 57.0 Training loss: 19.1436 Explore P: 0.4055
Episode: 384 Total reward: 43.0 Training loss: 1.1400 Explore P: 0.4038
Episode: 385 Total reward: 19.0 Training loss: 55.2904 Explore P: 0.4030
Episode: 386 Total reward: 36.0 Training loss: 40.7717 Explore P: 0.4016
Episode: 387 Total reward: 51.0 Training loss: 51.4531 Explore P: 0.3996
Episode: 388 Total reward: 69.0 Training loss: 117.1470 Explore P: 0.3970
Episode: 389 Total reward: 185.0 Training loss: 18.9314 Explore P: 0.3899
Episode: 390 Total reward: 89.0 Training loss: 34.4780 Explore P: 0.3865
Episode: 391 Total reward: 63.0 Training loss: 16.3456 Explore P: 0.3841
Episode: 392 Total reward: 127.0 Training loss: 1.5270 Explore P: 0.3794
Episode: 393 Total reward: 157.0 Training loss: 18.8347 Explore P: 0.3737
Episode: 394 Total reward: 52.0 Training loss: 75.2706 Explore P: 0.3718
Episode: 395 Total reward: 45.0 Training loss: 1.9994 Explore P: 0.3701
Episode: 396 Total reward: 93.0 Training loss: 1.5925 Explore P: 0.3668
Episode: 397 Total reward: 76.0 Training loss: 1.6710 Explore P: 0.3641
Episode: 398 Total reward: 119.0 Training loss: 14.2343 Explore P: 0.3599
Episode: 399 Total reward: 100.0 Training loss: 1.3071 Explore P: 0.3564
Episode: 400 Total reward: 43.0 Training loss: 41.6047 Explore P: 0.3550
Episode: 401 Total reward: 104.0 Training loss: 38.9699 Explore P: 0.3514
Episode: 402 Total reward: 169.0 Training loss: 35.7481 Explore P: 0.3457
Episode: 403 Total reward: 44.0 Training loss: 18.7158 Explore P: 0.3442
Episode: 404 Total reward: 93.0 Training loss: 21.3677 Explore P: 0.3411
Episode: 405 Total reward: 188.0 Training loss: 1.4079 Explore P: 0.3349
Episode: 406 Total reward: 182.0 Training loss: 2.4916 Explore P: 0.3291
Episode: 407 Total reward: 112.0 Training loss: 1.6176 Explore P: 0.3255
Episode: 408 Total reward: 135.0 Training loss: 1.3305 Explore P: 0.3213
Episode: 409 Total reward: 199.0 Training loss: 2.3812 Explore P: 0.3152
Episode: 410 Total reward: 141.0 Training loss: 1.2047 Explore P: 0.3109
Episode: 411 Total reward: 174.0 Training loss: 1.1658 Explore P: 0.3057
Episode: 412 Total reward: 199.0 Training loss: 33.4217 Explore P: 0.2999
Episode: 413 Total reward: 101.0 Training loss: 2.1347 Explore P: 0.2969
Episode: 414 Total reward: 97.0 Training loss: 1.1805 Explore P: 0.2942
Episode: 415 Total reward: 165.0 Training loss: 28.0199 Explore P: 0.2895
Episode: 416 Total reward: 137.0 Training loss: 34.9022 Explore P: 0.2857
Episode: 417 Total reward: 105.0 Training loss: 27.8470 Explore P: 0.2828
Episode: 418 Total reward: 53.0 Training loss: 27.3613 Explore P: 0.2814
Episode: 419 Total reward: 41.0 Training loss: 2.7112 Explore P: 0.2803
Episode: 420 Total reward: 65.0 Training loss: 3.7474 Explore P: 0.2785
Episode: 421 Total reward: 128.0 Training loss: 55.5022 Explore P: 0.2751
Episode: 422 Total reward: 68.0 Training loss: 2.4944 Explore P: 0.2733
Episode: 423 Total reward: 91.0 Training loss: 2.3039 Explore P: 0.2709
Episode: 424 Total reward: 76.0 Training loss: 44.1019 Explore P: 0.2690
Episode: 425 Total reward: 36.0 Training loss: 1.4692 Explore P: 0.2680
Episode: 426 Total reward: 35.0 Training loss: 2.3150 Explore P: 0.2671
Episode: 427 Total reward: 57.0 Training loss: 23.9872 Explore P: 0.2657
Episode: 428 Total reward: 42.0 Training loss: 1.1378 Explore P: 0.2646
Episode: 429 Total reward: 48.0 Training loss: 2.2785 Explore P: 0.2634
Episode: 430 Total reward: 111.0 Training loss: 63.8442 Explore P: 0.2606
Episode: 431 Total reward: 104.0 Training loss: 1.4861 Explore P: 0.2580
Episode: 432 Total reward: 39.0 Training loss: 1.5713 Explore P: 0.2570
Episode: 433 Total reward: 47.0 Training loss: 1.8032 Explore P: 0.2559
Episode: 434 Total reward: 121.0 Training loss: 2.2711 Explore P: 0.2529
Episode: 435 Total reward: 129.0 Training loss: 83.0015 Explore P: 0.2498
Episode: 436 Total reward: 153.0 Training loss: 60.7962 Explore P: 0.2462
Episode: 437 Total reward: 199.0 Training loss: 3.2239 Explore P: 0.2415
Episode: 438 Total reward: 125.0 Training loss: 0.8784 Explore P: 0.2386
Episode: 439 Total reward: 199.0 Training loss: 2.1103 Explore P: 0.2341
Episode: 440 Total reward: 99.0 Training loss: 2.5652 Explore P: 0.2319
Episode: 441 Total reward: 142.0 Training loss: 18.2520 Explore P: 0.2288
Episode: 442 Total reward: 163.0 Training loss: 1.1996 Explore P: 0.2253
Episode: 443 Total reward: 199.0 Training loss: 1.2117 Explore P: 0.2210
Episode: 444 Total reward: 132.0 Training loss: 86.1816 Explore P: 0.2182
Episode: 445 Total reward: 169.0 Training loss: 1.1743 Explore P: 0.2148
Episode: 446 Total reward: 199.0 Training loss: 1.9266 Explore P: 0.2107
Episode: 447 Total reward: 199.0 Training loss: 0.7718 Explore P: 0.2068
Episode: 448 Total reward: 199.0 Training loss: 1.1419 Explore P: 0.2029
Episode: 449 Total reward: 199.0 Training loss: 0.7290 Explore P: 0.1991
Episode: 450 Total reward: 199.0 Training loss: 86.3484 Explore P: 0.1954
Episode: 451 Total reward: 199.0 Training loss: 1.6509 Explore P: 0.1917
Episode: 452 Total reward: 132.0 Training loss: 1.3319 Explore P: 0.1893
Episode: 453 Total reward: 99.0 Training loss: 1.0836 Explore P: 0.1876
Episode: 454 Total reward: 130.0 Training loss: 130.0077 Explore P: 0.1853
Episode: 455 Total reward: 23.0 Training loss: 1.2447 Explore P: 0.1849
Episode: 456 Total reward: 48.0 Training loss: 1.2189 Explore P: 0.1840
Episode: 457 Total reward: 106.0 Training loss: 1.0335 Explore P: 0.1822
Episode: 458 Total reward: 108.0 Training loss: 176.6644 Explore P: 0.1803
Episode: 459 Total reward: 42.0 Training loss: 1.3266 Explore P: 0.1796
Episode: 460 Total reward: 61.0 Training loss: 202.2074 Explore P: 0.1786
Episode: 461 Total reward: 21.0 Training loss: 1.2708 Explore P: 0.1782
Episode: 462 Total reward: 41.0 Training loss: 1.2701 Explore P: 0.1776
Episode: 463 Total reward: 112.0 Training loss: 1.1035 Explore P: 0.1757
Episode: 464 Total reward: 125.0 Training loss: 186.4967 Explore P: 0.1736
Episode: 465 Total reward: 101.0 Training loss: 0.8162 Explore P: 0.1720
Episode: 466 Total reward: 120.0 Training loss: 1.0107 Explore P: 0.1701
Episode: 467 Total reward: 116.0 Training loss: 243.9276 Explore P: 0.1682
Episode: 468 Total reward: 101.0 Training loss: 218.0991 Explore P: 0.1666
Episode: 469 Total reward: 133.0 Training loss: 1.4167 Explore P: 0.1645
Episode: 470 Total reward: 128.0 Training loss: 1.1612 Explore P: 0.1626
Episode: 471 Total reward: 35.0 Training loss: 1.8624 Explore P: 0.1620
Episode: 472 Total reward: 49.0 Training loss: 2.1176 Explore P: 0.1613
Episode: 473 Total reward: 32.0 Training loss: 1.5792 Explore P: 0.1608
Episode: 474 Total reward: 106.0 Training loss: 1.6341 Explore P: 0.1592
Episode: 475 Total reward: 67.0 Training loss: 2.0074 Explore P: 0.1582
Episode: 476 Total reward: 31.0 Training loss: 2.2820 Explore P: 0.1578
Episode: 477 Total reward: 32.0 Training loss: 1.8767 Explore P: 0.1573
Episode: 478 Total reward: 32.0 Training loss: 265.7396 Explore P: 0.1568
Episode: 479 Total reward: 23.0 Training loss: 2.6004 Explore P: 0.1565
Episode: 480 Total reward: 32.0 Training loss: 0.9482 Explore P: 0.1560
Episode: 481 Total reward: 59.0 Training loss: 1.2068 Explore P: 0.1552
Episode: 482 Total reward: 109.0 Training loss: 1.2989 Explore P: 0.1536
Episode: 483 Total reward: 121.0 Training loss: 1.4157 Explore P: 0.1519
Episode: 484 Total reward: 113.0 Training loss: 0.9474 Explore P: 0.1503
Episode: 485 Total reward: 111.0 Training loss: 267.5251 Explore P: 0.1487
Episode: 486 Total reward: 144.0 Training loss: 1.3143 Explore P: 0.1467
Episode: 487 Total reward: 141.0 Training loss: 1.1825 Explore P: 0.1448
Episode: 488 Total reward: 135.0 Training loss: 1.5205 Explore P: 0.1430
Episode: 489 Total reward: 121.0 Training loss: 0.9781 Explore P: 0.1414
Episode: 490 Total reward: 199.0 Training loss: 0.8702 Explore P: 0.1388
Episode: 491 Total reward: 199.0 Training loss: 0.3945 Explore P: 0.1363
Episode: 492 Total reward: 199.0 Training loss: 1.1744 Explore P: 0.1338
Episode: 493 Total reward: 199.0 Training loss: 1.8055 Explore P: 0.1314
Episode: 494 Total reward: 199.0 Training loss: 310.5743 Explore P: 0.1290
Episode: 495 Total reward: 199.0 Training loss: 1.3793 Explore P: 0.1266
Episode: 496 Total reward: 183.0 Training loss: 227.0763 Explore P: 0.1245
Episode: 497 Total reward: 189.0 Training loss: 2.4799 Explore P: 0.1224
Episode: 498 Total reward: 199.0 Training loss: 1.0757 Explore P: 0.1202
Episode: 499 Total reward: 157.0 Training loss: 0.4421 Explore P: 0.1184
Episode: 500 Total reward: 199.0 Training loss: 330.7383 Explore P: 0.1163
Episode: 501 Total reward: 199.0 Training loss: 0.8849 Explore P: 0.1142
Episode: 502 Total reward: 199.0 Training loss: 1.0228 Explore P: 0.1122
Episode: 503 Total reward: 199.0 Training loss: 0.7168 Explore P: 0.1101
Episode: 504 Total reward: 199.0 Training loss: 1.2293 Explore P: 0.1082
Episode: 505 Total reward: 199.0 Training loss: 1.4046 Explore P: 0.1062
Episode: 506 Total reward: 199.0 Training loss: 0.5917 Explore P: 0.1043
Episode: 507 Total reward: 199.0 Training loss: 0.6061 Explore P: 0.1025
Episode: 508 Total reward: 199.0 Training loss: 76.1845 Explore P: 0.1007
Episode: 509 Total reward: 199.0 Training loss: 0.6411 Explore P: 0.0989
Episode: 510 Total reward: 140.0 Training loss: 1.0399 Explore P: 0.0976
Episode: 511 Total reward: 199.0 Training loss: 0.5945 Explore P: 0.0959
Episode: 512 Total reward: 199.0 Training loss: 195.7922 Explore P: 0.0942
Episode: 513 Total reward: 199.0 Training loss: 105.8392 Explore P: 0.0926
Episode: 514 Total reward: 199.0 Training loss: 0.7889 Explore P: 0.0909
Episode: 515 Total reward: 199.0 Training loss: 0.4795 Explore P: 0.0893
Episode: 516 Total reward: 199.0 Training loss: 0.4733 Explore P: 0.0878
Episode: 517 Total reward: 199.0 Training loss: 0.4458 Explore P: 0.0862
Episode: 518 Total reward: 199.0 Training loss: 0.7046 Explore P: 0.0847
Episode: 519 Total reward: 199.0 Training loss: 0.5056 Explore P: 0.0833
Episode: 520 Total reward: 199.0 Training loss: 0.5025 Explore P: 0.0818
Episode: 521 Total reward: 199.0 Training loss: 0.4864 Explore P: 0.0804
Episode: 522 Total reward: 199.0 Training loss: 57.1895 Explore P: 0.0790
Episode: 523 Total reward: 199.0 Training loss: 0.2560 Explore P: 0.0777
Episode: 524 Total reward: 199.0 Training loss: 0.4628 Explore P: 0.0763
Episode: 525 Total reward: 199.0 Training loss: 0.3177 Explore P: 0.0750
Episode: 526 Total reward: 199.0 Training loss: 0.2303 Explore P: 0.0737
Episode: 527 Total reward: 199.0 Training loss: 0.2680 Explore P: 0.0725
Episode: 528 Total reward: 199.0 Training loss: 0.5300 Explore P: 0.0713
Episode: 529 Total reward: 199.0 Training loss: 0.3013 Explore P: 0.0700
Episode: 530 Total reward: 199.0 Training loss: 95.5958 Explore P: 0.0689
Episode: 531 Total reward: 199.0 Training loss: 0.3557 Explore P: 0.0677
Episode: 532 Total reward: 199.0 Training loss: 0.2714 Explore P: 0.0666
Episode: 533 Total reward: 199.0 Training loss: 0.2603 Explore P: 0.0655
Episode: 534 Total reward: 199.0 Training loss: 0.6350 Explore P: 0.0644
Episode: 535 Total reward: 199.0 Training loss: 0.4335 Explore P: 0.0633
Episode: 536 Total reward: 199.0 Training loss: 0.1784 Explore P: 0.0622
Episode: 537 Total reward: 199.0 Training loss: 0.4273 Explore P: 0.0612
Episode: 538 Total reward: 199.0 Training loss: 0.2949 Explore P: 0.0602
Episode: 539 Total reward: 199.0 Training loss: 0.2344 Explore P: 0.0592
Episode: 540 Total reward: 199.0 Training loss: 0.1694 Explore P: 0.0582
Episode: 541 Total reward: 199.0 Training loss: 0.5521 Explore P: 0.0573
Episode: 542 Total reward: 199.0 Training loss: 0.2866 Explore P: 0.0564
Episode: 543 Total reward: 199.0 Training loss: 160.4642 Explore P: 0.0554
Episode: 544 Total reward: 199.0 Training loss: 0.0950 Explore P: 0.0545
Episode: 545 Total reward: 199.0 Training loss: 0.3099 Explore P: 0.0537
Episode: 546 Total reward: 199.0 Training loss: 0.3587 Explore P: 0.0528
Episode: 547 Total reward: 199.0 Training loss: 0.4133 Explore P: 0.0520
Episode: 548 Total reward: 199.0 Training loss: 0.1365 Explore P: 0.0511
Episode: 549 Total reward: 199.0 Training loss: 0.3828 Explore P: 0.0503
Episode: 550 Total reward: 199.0 Training loss: 0.2443 Explore P: 0.0495
Episode: 551 Total reward: 199.0 Training loss: 0.3509 Explore P: 0.0488
Episode: 552 Total reward: 199.0 Training loss: 0.2200 Explore P: 0.0480
Episode: 553 Total reward: 199.0 Training loss: 0.1544 Explore P: 0.0472
Episode: 554 Total reward: 199.0 Training loss: 0.3632 Explore P: 0.0465
Episode: 555 Total reward: 199.0 Training loss: 0.1194 Explore P: 0.0458
Episode: 556 Total reward: 199.0 Training loss: 0.2978 Explore P: 0.0451
Episode: 557 Total reward: 199.0 Training loss: 0.3165 Explore P: 0.0444
Episode: 558 Total reward: 199.0 Training loss: 0.0901 Explore P: 0.0437
Episode: 559 Total reward: 199.0 Training loss: 0.1127 Explore P: 0.0431
Episode: 560 Total reward: 199.0 Training loss: 0.3547 Explore P: 0.0424
Episode: 561 Total reward: 199.0 Training loss: 277.6599 Explore P: 0.0418
Episode: 562 Total reward: 199.0 Training loss: 0.1756 Explore P: 0.0411
Episode: 563 Total reward: 199.0 Training loss: 0.2387 Explore P: 0.0405
Episode: 564 Total reward: 199.0 Training loss: 0.1874 Explore P: 0.0399
Episode: 565 Total reward: 199.0 Training loss: 0.1467 Explore P: 0.0393
Episode: 566 Total reward: 199.0 Training loss: 237.9963 Explore P: 0.0388
Episode: 567 Total reward: 199.0 Training loss: 0.2270 Explore P: 0.0382
Episode: 568 Total reward: 199.0 Training loss: 0.2110 Explore P: 0.0376
Episode: 569 Total reward: 199.0 Training loss: 0.1945 Explore P: 0.0371
Episode: 570 Total reward: 199.0 Training loss: 0.1601 Explore P: 0.0366
Episode: 571 Total reward: 199.0 Training loss: 0.2192 Explore P: 0.0360
Episode: 572 Total reward: 199.0 Training loss: 0.1538 Explore P: 0.0355
Episode: 573 Total reward: 199.0 Training loss: 0.3557 Explore P: 0.0350
Episode: 574 Total reward: 199.0 Training loss: 0.1385 Explore P: 0.0345
Episode: 575 Total reward: 199.0 Training loss: 0.1644 Explore P: 0.0340
Episode: 576 Total reward: 199.0 Training loss: 0.1334 Explore P: 0.0336
Episode: 577 Total reward: 199.0 Training loss: 0.2085 Explore P: 0.0331
Episode: 578 Total reward: 199.0 Training loss: 0.2997 Explore P: 0.0326
Episode: 579 Total reward: 199.0 Training loss: 0.3022 Explore P: 0.0322
Episode: 580 Total reward: 199.0 Training loss: 0.3017 Explore P: 0.0318
Episode: 581 Total reward: 199.0 Training loss: 0.3489 Explore P: 0.0313
Episode: 582 Total reward: 199.0 Training loss: 0.4957 Explore P: 0.0309
Episode: 583 Total reward: 199.0 Training loss: 0.1997 Explore P: 0.0305
Episode: 584 Total reward: 199.0 Training loss: 0.1433 Explore P: 0.0301
Episode: 585 Total reward: 199.0 Training loss: 0.2564 Explore P: 0.0297
Episode: 586 Total reward: 199.0 Training loss: 0.1715 Explore P: 0.0293
Episode: 587 Total reward: 199.0 Training loss: 0.2970 Explore P: 0.0289
Episode: 588 Total reward: 199.0 Training loss: 0.1432 Explore P: 0.0286
Episode: 589 Total reward: 199.0 Training loss: 0.0965 Explore P: 0.0282
Episode: 590 Total reward: 199.0 Training loss: 0.2219 Explore P: 0.0278
Episode: 591 Total reward: 199.0 Training loss: 0.1356 Explore P: 0.0275
Episode: 592 Total reward: 199.0 Training loss: 0.2376 Explore P: 0.0271
Episode: 593 Total reward: 199.0 Training loss: 0.1507 Explore P: 0.0268
Episode: 594 Total reward: 199.0 Training loss: 0.1480 Explore P: 0.0265
Episode: 595 Total reward: 199.0 Training loss: 0.1078 Explore P: 0.0261
Episode: 596 Total reward: 199.0 Training loss: 0.1562 Explore P: 0.0258
Episode: 597 Total reward: 199.0 Training loss: 0.1625 Explore P: 0.0255
Episode: 598 Total reward: 199.0 Training loss: 261.3425 Explore P: 0.0252
Episode: 599 Total reward: 199.0 Training loss: 0.1971 Explore P: 0.0249
Episode: 600 Total reward: 199.0 Training loss: 0.1405 Explore P: 0.0246
Episode: 601 Total reward: 199.0 Training loss: 0.2187 Explore P: 0.0243
Episode: 602 Total reward: 199.0 Training loss: 102.8862 Explore P: 0.0240
Episode: 603 Total reward: 199.0 Training loss: 0.2504 Explore P: 0.0238
Episode: 604 Total reward: 199.0 Training loss: 0.1673 Explore P: 0.0235
Episode: 605 Total reward: 199.0 Training loss: 0.2498 Explore P: 0.0232
Episode: 606 Total reward: 199.0 Training loss: 0.1287 Explore P: 0.0230
Episode: 607 Total reward: 199.0 Training loss: 0.2856 Explore P: 0.0227
Episode: 608 Total reward: 199.0 Training loss: 0.2168 Explore P: 0.0225
Episode: 609 Total reward: 199.0 Training loss: 0.1997 Explore P: 0.0222
Episode: 610 Total reward: 199.0 Training loss: 0.1957 Explore P: 0.0220
Episode: 611 Total reward: 199.0 Training loss: 259.5076 Explore P: 0.0217
Episode: 612 Total reward: 199.0 Training loss: 0.2224 Explore P: 0.0215
Episode: 613 Total reward: 199.0 Training loss: 0.1877 Explore P: 0.0213
Episode: 614 Total reward: 199.0 Training loss: 0.1575 Explore P: 0.0211
Episode: 615 Total reward: 199.0 Training loss: 0.1257 Explore P: 0.0208
Episode: 616 Total reward: 199.0 Training loss: 0.1946 Explore P: 0.0206
Episode: 617 Total reward: 199.0 Training loss: 0.1187 Explore P: 0.0204
Episode: 618 Total reward: 199.0 Training loss: 228.4958 Explore P: 0.0202
Episode: 619 Total reward: 199.0 Training loss: 0.1705 Explore P: 0.0200
Episode: 620 Total reward: 199.0 Training loss: 0.1089 Explore P: 0.0198
Episode: 621 Total reward: 199.0 Training loss: 0.1782 Explore P: 0.0196
Episode: 622 Total reward: 199.0 Training loss: 0.2536 Explore P: 0.0194
Episode: 623 Total reward: 199.0 Training loss: 0.1259 Explore P: 0.0192
Episode: 624 Total reward: 199.0 Training loss: 0.1616 Explore P: 0.0191
Episode: 625 Total reward: 199.0 Training loss: 0.1908 Explore P: 0.0189
Episode: 626 Total reward: 199.0 Training loss: 0.1694 Explore P: 0.0187
Episode: 627 Total reward: 199.0 Training loss: 0.1473 Explore P: 0.0185
Episode: 628 Total reward: 199.0 Training loss: 0.1398 Explore P: 0.0184
Episode: 629 Total reward: 199.0 Training loss: 0.2426 Explore P: 0.0182
Episode: 630 Total reward: 199.0 Training loss: 244.6785 Explore P: 0.0180
Episode: 631 Total reward: 199.0 Training loss: 0.1415 Explore P: 0.0179
Episode: 632 Total reward: 199.0 Training loss: 0.0994 Explore P: 0.0177
Episode: 633 Total reward: 199.0 Training loss: 0.2877 Explore P: 0.0176
Episode: 634 Total reward: 199.0 Training loss: 0.2712 Explore P: 0.0174
Episode: 635 Total reward: 184.0 Training loss: 213.3642 Explore P: 0.0173
Episode: 636 Total reward: 177.0 Training loss: 0.1722 Explore P: 0.0172
Episode: 637 Total reward: 199.0 Training loss: 0.2212 Explore P: 0.0170
Episode: 638 Total reward: 156.0 Training loss: 152.4958 Explore P: 0.0169
Episode: 639 Total reward: 172.0 Training loss: 0.3249 Explore P: 0.0168
Episode: 640 Total reward: 152.0 Training loss: 0.2885 Explore P: 0.0167
Episode: 641 Total reward: 158.0 Training loss: 0.2195 Explore P: 0.0166
Episode: 642 Total reward: 124.0 Training loss: 0.3330 Explore P: 0.0165
Episode: 643 Total reward: 126.0 Training loss: 0.2300 Explore P: 0.0164
Episode: 644 Total reward: 112.0 Training loss: 0.2952 Explore P: 0.0164
Episode: 645 Total reward: 148.0 Training loss: 247.1076 Explore P: 0.0163
Episode: 646 Total reward: 109.0 Training loss: 0.2641 Explore P: 0.0162
Episode: 647 Total reward: 136.0 Training loss: 0.4088 Explore P: 0.0161
Episode: 648 Total reward: 153.0 Training loss: 0.5466 Explore P: 0.0160
Episode: 649 Total reward: 117.0 Training loss: 0.4589 Explore P: 0.0159
Episode: 650 Total reward: 127.0 Training loss: 0.4250 Explore P: 0.0159
Episode: 651 Total reward: 150.0 Training loss: 0.5285 Explore P: 0.0158
Episode: 652 Total reward: 111.0 Training loss: 0.4845 Explore P: 0.0157
Episode: 653 Total reward: 113.0 Training loss: 0.8690 Explore P: 0.0157
Episode: 654 Total reward: 131.0 Training loss: 0.2932 Explore P: 0.0156
Episode: 655 Total reward: 113.0 Training loss: 0.8862 Explore P: 0.0155
Episode: 656 Total reward: 111.0 Training loss: 0.5876 Explore P: 0.0155
Episode: 657 Total reward: 59.0 Training loss: 0.5644 Explore P: 0.0154
Episode: 658 Total reward: 50.0 Training loss: 0.4542 Explore P: 0.0154
Episode: 659 Total reward: 101.0 Training loss: 0.6716 Explore P: 0.0153
Episode: 660 Total reward: 19.0 Training loss: 0.8167 Explore P: 0.0153
Episode: 661 Total reward: 24.0 Training loss: 0.6344 Explore P: 0.0153
Episode: 662 Total reward: 21.0 Training loss: 0.5326 Explore P: 0.0153
Episode: 663 Total reward: 29.0 Training loss: 240.1963 Explore P: 0.0153
Episode: 664 Total reward: 20.0 Training loss: 380.8177 Explore P: 0.0153
Episode: 665 Total reward: 26.0 Training loss: 496.0798 Explore P: 0.0153
Episode: 666 Total reward: 20.0 Training loss: 1.1411 Explore P: 0.0153
Episode: 667 Total reward: 17.0 Training loss: 1.0363 Explore P: 0.0153
Episode: 668 Total reward: 14.0 Training loss: 68.5867 Explore P: 0.0152
Episode: 669 Total reward: 17.0 Training loss: 2.0600 Explore P: 0.0152
Episode: 670 Total reward: 15.0 Training loss: 1.4748 Explore P: 0.0152
Episode: 671 Total reward: 12.0 Training loss: 1.4939 Explore P: 0.0152
Episode: 672 Total reward: 20.0 Training loss: 0.9272 Explore P: 0.0152
Episode: 673 Total reward: 30.0 Training loss: 407.0804 Explore P: 0.0152
Episode: 674 Total reward: 43.0 Training loss: 0.5206 Explore P: 0.0152
Episode: 675 Total reward: 22.0 Training loss: 1.0747 Explore P: 0.0152
Episode: 676 Total reward: 19.0 Training loss: 1.5912 Explore P: 0.0152
Episode: 677 Total reward: 27.0 Training loss: 0.4446 Explore P: 0.0151
Episode: 678 Total reward: 30.0 Training loss: 0.4512 Explore P: 0.0151
Episode: 679 Total reward: 112.0 Training loss: 0.3363 Explore P: 0.0151
Episode: 680 Total reward: 143.0 Training loss: 129.8861 Explore P: 0.0150
Episode: 681 Total reward: 172.0 Training loss: 0.3649 Explore P: 0.0149
Episode: 682 Total reward: 150.0 Training loss: 0.3066 Explore P: 0.0148
Episode: 683 Total reward: 148.0 Training loss: 0.8127 Explore P: 0.0148
Episode: 684 Total reward: 199.0 Training loss: 194.4067 Explore P: 0.0147
Episode: 685 Total reward: 116.0 Training loss: 0.4825 Explore P: 0.0146
Episode: 686 Total reward: 199.0 Training loss: 0.7970 Explore P: 0.0145
Episode: 687 Total reward: 199.0 Training loss: 305.9972 Explore P: 0.0144
Episode: 688 Total reward: 199.0 Training loss: 0.5788 Explore P: 0.0144
Episode: 689 Total reward: 186.0 Training loss: 0.3936 Explore P: 0.0143
Episode: 690 Total reward: 199.0 Training loss: 0.4774 Explore P: 0.0142
Episode: 691 Total reward: 175.0 Training loss: 0.1770 Explore P: 0.0141
Episode: 692 Total reward: 199.0 Training loss: 68.4718 Explore P: 0.0140
Episode: 693 Total reward: 199.0 Training loss: 0.4610 Explore P: 0.0140
Episode: 694 Total reward: 199.0 Training loss: 0.4406 Explore P: 0.0139
Episode: 695 Total reward: 199.0 Training loss: 0.3950 Explore P: 0.0138
Episode: 696 Total reward: 199.0 Training loss: 0.3752 Explore P: 0.0137
Episode: 697 Total reward: 199.0 Training loss: 0.4714 Explore P: 0.0137
Episode: 698 Total reward: 199.0 Training loss: 216.6983 Explore P: 0.0136
Episode: 699 Total reward: 199.0 Training loss: 0.3031 Explore P: 0.0135
Episode: 700 Total reward: 199.0 Training loss: 0.3654 Explore P: 0.0134
Episode: 701 Total reward: 199.0 Training loss: 0.3788 Explore P: 0.0134
Episode: 702 Total reward: 199.0 Training loss: 0.1714 Explore P: 0.0133
Episode: 703 Total reward: 199.0 Training loss: 0.2205 Explore P: 0.0132
Episode: 704 Total reward: 199.0 Training loss: 0.5542 Explore P: 0.0132
Episode: 705 Total reward: 199.0 Training loss: 0.3479 Explore P: 0.0131
Episode: 706 Total reward: 199.0 Training loss: 0.6849 Explore P: 0.0131
Episode: 707 Total reward: 199.0 Training loss: 0.1594 Explore P: 0.0130
Episode: 708 Total reward: 199.0 Training loss: 0.3541 Explore P: 0.0129
Episode: 709 Total reward: 199.0 Training loss: 0.3850 Explore P: 0.0129
Episode: 710 Total reward: 199.0 Training loss: 0.5600 Explore P: 0.0128
Episode: 711 Total reward: 199.0 Training loss: 0.2241 Explore P: 0.0128
Episode: 712 Total reward: 199.0 Training loss: 0.5423 Explore P: 0.0127
Episode: 713 Total reward: 165.0 Training loss: 0.6441 Explore P: 0.0127
Episode: 714 Total reward: 199.0 Training loss: 266.1266 Explore P: 0.0126
Episode: 715 Total reward: 199.0 Training loss: 0.5780 Explore P: 0.0126
Episode: 716 Total reward: 189.0 Training loss: 0.9294 Explore P: 0.0125
Episode: 717 Total reward: 199.0 Training loss: 1.1765 Explore P: 0.0125
Episode: 718 Total reward: 161.0 Training loss: 1.3791 Explore P: 0.0124
Episode: 719 Total reward: 187.0 Training loss: 0.6919 Explore P: 0.0124
Episode: 720 Total reward: 183.0 Training loss: 0.6674 Explore P: 0.0123
Episode: 721 Total reward: 199.0 Training loss: 0.8920 Explore P: 0.0123
Episode: 722 Total reward: 199.0 Training loss: 1.2152 Explore P: 0.0122
Episode: 723 Total reward: 199.0 Training loss: 0.8948 Explore P: 0.0122
Episode: 724 Total reward: 199.0 Training loss: 1.1851 Explore P: 0.0122
Episode: 725 Total reward: 199.0 Training loss: 1.0658 Explore P: 0.0121
Episode: 726 Total reward: 199.0 Training loss: 1.4604 Explore P: 0.0121
Episode: 727 Total reward: 197.0 Training loss: 1.6631 Explore P: 0.0120
Episode: 728 Total reward: 199.0 Training loss: 0.7172 Explore P: 0.0120
Episode: 729 Total reward: 199.0 Training loss: 341.3912 Explore P: 0.0120
Episode: 730 Total reward: 199.0 Training loss: 0.7328 Explore P: 0.0119
Episode: 731 Total reward: 199.0 Training loss: 170.4263 Explore P: 0.0119
Episode: 732 Total reward: 193.0 Training loss: 0.5403 Explore P: 0.0118
Episode: 733 Total reward: 199.0 Training loss: 0.7921 Explore P: 0.0118
Episode: 734 Total reward: 199.0 Training loss: 0.4231 Explore P: 0.0118
Episode: 735 Total reward: 199.0 Training loss: 0.1503 Explore P: 0.0117
Episode: 736 Total reward: 199.0 Training loss: 0.1847 Explore P: 0.0117
Episode: 737 Total reward: 199.0 Training loss: 0.3195 Explore P: 0.0117
Episode: 738 Total reward: 199.0 Training loss: 0.0523 Explore P: 0.0116
Episode: 739 Total reward: 199.0 Training loss: 0.2424 Explore P: 0.0116
Episode: 740 Total reward: 199.0 Training loss: 174.1876 Explore P: 0.0116
Episode: 741 Total reward: 199.0 Training loss: 0.2773 Explore P: 0.0115
Episode: 742 Total reward: 199.0 Training loss: 0.1173 Explore P: 0.0115
Episode: 743 Total reward: 199.0 Training loss: 0.0954 Explore P: 0.0115
Episode: 744 Total reward: 199.0 Training loss: 0.1808 Explore P: 0.0115
Episode: 745 Total reward: 199.0 Training loss: 0.2247 Explore P: 0.0114
Episode: 746 Total reward: 199.0 Training loss: 0.2289 Explore P: 0.0114
Episode: 747 Total reward: 199.0 Training loss: 0.2203 Explore P: 0.0114
Episode: 748 Total reward: 199.0 Training loss: 0.1126 Explore P: 0.0113
Episode: 749 Total reward: 199.0 Training loss: 0.2750 Explore P: 0.0113
Episode: 750 Total reward: 199.0 Training loss: 0.1353 Explore P: 0.0113
Episode: 751 Total reward: 199.0 Training loss: 0.1066 Explore P: 0.0113
Episode: 752 Total reward: 199.0 Training loss: 220.4093 Explore P: 0.0112
Episode: 753 Total reward: 199.0 Training loss: 0.1187 Explore P: 0.0112
Episode: 754 Total reward: 199.0 Training loss: 0.2211 Explore P: 0.0112
Episode: 755 Total reward: 199.0 Training loss: 0.1754 Explore P: 0.0112
Episode: 756 Total reward: 199.0 Training loss: 0.2295 Explore P: 0.0111
Episode: 757 Total reward: 199.0 Training loss: 0.1516 Explore P: 0.0111
Episode: 758 Total reward: 199.0 Training loss: 0.1203 Explore P: 0.0111
Episode: 759 Total reward: 199.0 Training loss: 0.3891 Explore P: 0.0111
Episode: 760 Total reward: 166.0 Training loss: 0.1123 Explore P: 0.0111
Episode: 761 Total reward: 199.0 Training loss: 16.0004 Explore P: 0.0110
Episode: 762 Total reward: 199.0 Training loss: 0.1781 Explore P: 0.0110
Episode: 763 Total reward: 199.0 Training loss: 0.2521 Explore P: 0.0110
Episode: 764 Total reward: 180.0 Training loss: 0.2367 Explore P: 0.0110
Episode: 765 Total reward: 199.0 Training loss: 0.1989 Explore P: 0.0110
Episode: 766 Total reward: 199.0 Training loss: 0.2237 Explore P: 0.0109
Episode: 767 Total reward: 199.0 Training loss: 0.0315 Explore P: 0.0109
Episode: 768 Total reward: 199.0 Training loss: 0.0455 Explore P: 0.0109
Episode: 769 Total reward: 199.0 Training loss: 0.2043 Explore P: 0.0109
Episode: 770 Total reward: 199.0 Training loss: 0.3112 Explore P: 0.0109
Episode: 771 Total reward: 199.0 Training loss: 0.2957 Explore P: 0.0109
Episode: 772 Total reward: 199.0 Training loss: 0.2515 Explore P: 0.0108
Episode: 773 Total reward: 199.0 Training loss: 148.8838 Explore P: 0.0108
Episode: 774 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0108
Episode: 775 Total reward: 199.0 Training loss: 0.1865 Explore P: 0.0108
Episode: 776 Total reward: 199.0 Training loss: 0.2494 Explore P: 0.0108
Episode: 777 Total reward: 199.0 Training loss: 0.0643 Explore P: 0.0108
Episode: 778 Total reward: 199.0 Training loss: 0.1234 Explore P: 0.0107
Episode: 779 Total reward: 199.0 Training loss: 0.2104 Explore P: 0.0107
Episode: 780 Total reward: 199.0 Training loss: 10.3575 Explore P: 0.0107
Episode: 781 Total reward: 199.0 Training loss: 0.0709 Explore P: 0.0107
Episode: 782 Total reward: 199.0 Training loss: 0.2295 Explore P: 0.0107
Episode: 783 Total reward: 199.0 Training loss: 0.1370 Explore P: 0.0107
Episode: 784 Total reward: 199.0 Training loss: 0.2532 Explore P: 0.0107
Episode: 785 Total reward: 199.0 Training loss: 0.2556 Explore P: 0.0106
Episode: 786 Total reward: 162.0 Training loss: 0.1471 Explore P: 0.0106
Episode: 787 Total reward: 199.0 Training loss: 0.1446 Explore P: 0.0106
Episode: 788 Total reward: 199.0 Training loss: 0.4440 Explore P: 0.0106
Episode: 789 Total reward: 199.0 Training loss: 0.1137 Explore P: 0.0106
Episode: 790 Total reward: 199.0 Training loss: 0.1232 Explore P: 0.0106
Episode: 791 Total reward: 199.0 Training loss: 0.3117 Explore P: 0.0106
Episode: 792 Total reward: 169.0 Training loss: 0.2757 Explore P: 0.0106
Episode: 793 Total reward: 122.0 Training loss: 0.0956 Explore P: 0.0106
Episode: 794 Total reward: 199.0 Training loss: 0.0818 Explore P: 0.0105
Episode: 795 Total reward: 199.0 Training loss: 0.2768 Explore P: 0.0105
Episode: 796 Total reward: 199.0 Training loss: 0.2933 Explore P: 0.0105
Episode: 797 Total reward: 199.0 Training loss: 0.1893 Explore P: 0.0105
Episode: 798 Total reward: 199.0 Training loss: 0.2017 Explore P: 0.0105
Episode: 799 Total reward: 151.0 Training loss: 0.1476 Explore P: 0.0105
Episode: 800 Total reward: 199.0 Training loss: 0.1446 Explore P: 0.0105
Episode: 801 Total reward: 199.0 Training loss: 0.0932 Explore P: 0.0105
Episode: 802 Total reward: 199.0 Training loss: 0.1016 Explore P: 0.0105
Episode: 803 Total reward: 199.0 Training loss: 0.2000 Explore P: 0.0105
Episode: 804 Total reward: 199.0 Training loss: 0.1450 Explore P: 0.0105
Episode: 805 Total reward: 199.0 Training loss: 0.1241 Explore P: 0.0104
Episode: 806 Total reward: 199.0 Training loss: 0.1874 Explore P: 0.0104
Episode: 807 Total reward: 199.0 Training loss: 0.1097 Explore P: 0.0104
Episode: 808 Total reward: 199.0 Training loss: 0.1322 Explore P: 0.0104
Episode: 809 Total reward: 199.0 Training loss: 0.2560 Explore P: 0.0104
Episode: 810 Total reward: 199.0 Training loss: 0.1620 Explore P: 0.0104
Episode: 811 Total reward: 199.0 Training loss: 0.1449 Explore P: 0.0104
Episode: 812 Total reward: 199.0 Training loss: 0.2231 Explore P: 0.0104
Episode: 813 Total reward: 199.0 Training loss: 0.0912 Explore P: 0.0104
Episode: 814 Total reward: 199.0 Training loss: 0.1909 Explore P: 0.0104
Episode: 815 Total reward: 199.0 Training loss: 0.1575 Explore P: 0.0104
Episode: 816 Total reward: 199.0 Training loss: 0.1648 Explore P: 0.0104
Episode: 817 Total reward: 199.0 Training loss: 0.1968 Explore P: 0.0103
Episode: 818 Total reward: 199.0 Training loss: 23.4868 Explore P: 0.0103
Episode: 819 Total reward: 199.0 Training loss: 0.2619 Explore P: 0.0103
Episode: 820 Total reward: 199.0 Training loss: 0.1366 Explore P: 0.0103
Episode: 821 Total reward: 199.0 Training loss: 0.1556 Explore P: 0.0103
Episode: 822 Total reward: 199.0 Training loss: 0.2814 Explore P: 0.0103
Episode: 823 Total reward: 199.0 Training loss: 0.1413 Explore P: 0.0103
Episode: 824 Total reward: 199.0 Training loss: 0.2014 Explore P: 0.0103
Episode: 825 Total reward: 199.0 Training loss: 0.1342 Explore P: 0.0103
Episode: 826 Total reward: 199.0 Training loss: 0.2392 Explore P: 0.0103
Episode: 827 Total reward: 199.0 Training loss: 0.2318 Explore P: 0.0103
Episode: 828 Total reward: 199.0 Training loss: 0.4747 Explore P: 0.0103
Episode: 829 Total reward: 199.0 Training loss: 0.4721 Explore P: 0.0103
Episode: 830 Total reward: 199.0 Training loss: 0.2209 Explore P: 0.0103
Episode: 831 Total reward: 199.0 Training loss: 305.5102 Explore P: 0.0103
Episode: 832 Total reward: 199.0 Training loss: 0.2065 Explore P: 0.0103
Episode: 833 Total reward: 199.0 Training loss: 0.4136 Explore P: 0.0103
Episode: 834 Total reward: 199.0 Training loss: 0.4226 Explore P: 0.0102
Episode: 835 Total reward: 199.0 Training loss: 0.4490 Explore P: 0.0102
Episode: 836 Total reward: 199.0 Training loss: 0.1671 Explore P: 0.0102
Episode: 837 Total reward: 199.0 Training loss: 0.3485 Explore P: 0.0102
Episode: 838 Total reward: 199.0 Training loss: 0.3309 Explore P: 0.0102
Episode: 839 Total reward: 199.0 Training loss: 0.2201 Explore P: 0.0102
Episode: 840 Total reward: 199.0 Training loss: 0.5826 Explore P: 0.0102
Episode: 841 Total reward: 199.0 Training loss: 353.4071 Explore P: 0.0102
Episode: 842 Total reward: 199.0 Training loss: 0.4890 Explore P: 0.0102
Episode: 843 Total reward: 199.0 Training loss: 360.8519 Explore P: 0.0102
Episode: 844 Total reward: 199.0 Training loss: 88.0021 Explore P: 0.0102
Episode: 845 Total reward: 199.0 Training loss: 0.3976 Explore P: 0.0102
Episode: 846 Total reward: 199.0 Training loss: 302.2032 Explore P: 0.0102
Episode: 847 Total reward: 199.0 Training loss: 0.3547 Explore P: 0.0102
Episode: 848 Total reward: 199.0 Training loss: 0.2315 Explore P: 0.0102
Episode: 849 Total reward: 199.0 Training loss: 0.3631 Explore P: 0.0102
Episode: 850 Total reward: 199.0 Training loss: 0.5688 Explore P: 0.0102
Episode: 851 Total reward: 199.0 Training loss: 0.5413 Explore P: 0.0102
Episode: 852 Total reward: 199.0 Training loss: 0.2778 Explore P: 0.0102
Episode: 853 Total reward: 199.0 Training loss: 0.3436 Explore P: 0.0102
Episode: 854 Total reward: 199.0 Training loss: 0.5269 Explore P: 0.0102
Episode: 855 Total reward: 199.0 Training loss: 0.4064 Explore P: 0.0102
Episode: 856 Total reward: 199.0 Training loss: 0.6995 Explore P: 0.0102
Episode: 857 Total reward: 175.0 Training loss: 0.6084 Explore P: 0.0102
Episode: 858 Total reward: 143.0 Training loss: 0.6593 Explore P: 0.0102
Episode: 859 Total reward: 35.0 Training loss: 0.6223 Explore P: 0.0102
Episode: 860 Total reward: 26.0 Training loss: 0.9752 Explore P: 0.0102
Episode: 861 Total reward: 28.0 Training loss: 1.0972 Explore P: 0.0102
Episode: 862 Total reward: 21.0 Training loss: 355.9939 Explore P: 0.0102
Episode: 863 Total reward: 18.0 Training loss: 1.1959 Explore P: 0.0102
Episode: 864 Total reward: 24.0 Training loss: 0.6682 Explore P: 0.0102
Episode: 865 Total reward: 29.0 Training loss: 1.2094 Explore P: 0.0102
Episode: 866 Total reward: 28.0 Training loss: 1.1059 Explore P: 0.0102
Episode: 867 Total reward: 23.0 Training loss: 475.2639 Explore P: 0.0102
Episode: 868 Total reward: 25.0 Training loss: 310.2563 Explore P: 0.0102
Episode: 869 Total reward: 27.0 Training loss: 0.6235 Explore P: 0.0102
Episode: 870 Total reward: 20.0 Training loss: 1.1930 Explore P: 0.0102
Episode: 871 Total reward: 19.0 Training loss: 0.6976 Explore P: 0.0102
Episode: 872 Total reward: 22.0 Training loss: 0.9629 Explore P: 0.0101
Episode: 873 Total reward: 21.0 Training loss: 0.6745 Explore P: 0.0101
Episode: 874 Total reward: 26.0 Training loss: 0.6856 Explore P: 0.0101
Episode: 875 Total reward: 31.0 Training loss: 0.9854 Explore P: 0.0101
Episode: 876 Total reward: 16.0 Training loss: 0.8239 Explore P: 0.0101
Episode: 877 Total reward: 20.0 Training loss: 0.9013 Explore P: 0.0101
Episode: 878 Total reward: 26.0 Training loss: 1.1120 Explore P: 0.0101
Episode: 879 Total reward: 30.0 Training loss: 0.8887 Explore P: 0.0101
Episode: 880 Total reward: 121.0 Training loss: 0.6154 Explore P: 0.0101
Episode: 881 Total reward: 150.0 Training loss: 0.2683 Explore P: 0.0101
Episode: 882 Total reward: 140.0 Training loss: 0.3445 Explore P: 0.0101
Episode: 883 Total reward: 199.0 Training loss: 0.1293 Explore P: 0.0101
Episode: 884 Total reward: 138.0 Training loss: 231.2847 Explore P: 0.0101
Episode: 885 Total reward: 162.0 Training loss: 333.9309 Explore P: 0.0101
Episode: 886 Total reward: 133.0 Training loss: 0.2873 Explore P: 0.0101
Episode: 887 Total reward: 127.0 Training loss: 322.9952 Explore P: 0.0101
Episode: 888 Total reward: 148.0 Training loss: 0.2162 Explore P: 0.0101
Episode: 889 Total reward: 128.0 Training loss: 0.4227 Explore P: 0.0101
Episode: 890 Total reward: 113.0 Training loss: 0.1457 Explore P: 0.0101
Episode: 891 Total reward: 170.0 Training loss: 0.2704 Explore P: 0.0101
Episode: 892 Total reward: 110.0 Training loss: 0.3781 Explore P: 0.0101
Episode: 893 Total reward: 82.0 Training loss: 273.0876 Explore P: 0.0101
Episode: 894 Total reward: 80.0 Training loss: 0.2974 Explore P: 0.0101
Episode: 895 Total reward: 80.0 Training loss: 142.4489 Explore P: 0.0101
Episode: 896 Total reward: 75.0 Training loss: 0.4100 Explore P: 0.0101
Episode: 897 Total reward: 116.0 Training loss: 0.6308 Explore P: 0.0101
Episode: 898 Total reward: 83.0 Training loss: 241.5772 Explore P: 0.0101
Episode: 899 Total reward: 199.0 Training loss: 0.2050 Explore P: 0.0101
Episode: 900 Total reward: 199.0 Training loss: 232.5823 Explore P: 0.0101
Episode: 901 Total reward: 199.0 Training loss: 236.9583 Explore P: 0.0101
Episode: 902 Total reward: 199.0 Training loss: 0.4146 Explore P: 0.0101
Episode: 903 Total reward: 199.0 Training loss: 0.4000 Explore P: 0.0101
Episode: 904 Total reward: 199.0 Training loss: 100.7962 Explore P: 0.0101
Episode: 905 Total reward: 199.0 Training loss: 0.3839 Explore P: 0.0101
Episode: 906 Total reward: 199.0 Training loss: 0.7265 Explore P: 0.0101
Episode: 907 Total reward: 199.0 Training loss: 0.5445 Explore P: 0.0101
Episode: 908 Total reward: 199.0 Training loss: 0.4232 Explore P: 0.0101
Episode: 909 Total reward: 199.0 Training loss: 0.4788 Explore P: 0.0101
Episode: 910 Total reward: 199.0 Training loss: 0.5663 Explore P: 0.0101
Episode: 911 Total reward: 199.0 Training loss: 194.0769 Explore P: 0.0101
Episode: 912 Total reward: 199.0 Training loss: 211.6273 Explore P: 0.0101
Episode: 913 Total reward: 199.0 Training loss: 0.3323 Explore P: 0.0101
Episode: 914 Total reward: 199.0 Training loss: 0.4045 Explore P: 0.0101
Episode: 915 Total reward: 199.0 Training loss: 0.5198 Explore P: 0.0101
Episode: 916 Total reward: 199.0 Training loss: 0.2656 Explore P: 0.0101
Episode: 917 Total reward: 199.0 Training loss: 0.3619 Explore P: 0.0101
Episode: 918 Total reward: 199.0 Training loss: 0.2266 Explore P: 0.0101
Episode: 919 Total reward: 199.0 Training loss: 0.2619 Explore P: 0.0101
Episode: 920 Total reward: 199.0 Training loss: 0.2751 Explore P: 0.0101
Episode: 921 Total reward: 199.0 Training loss: 0.3718 Explore P: 0.0101
Episode: 922 Total reward: 199.0 Training loss: 0.2766 Explore P: 0.0101
Episode: 923 Total reward: 199.0 Training loss: 0.2482 Explore P: 0.0101
Episode: 924 Total reward: 199.0 Training loss: 0.5159 Explore P: 0.0101
Episode: 925 Total reward: 199.0 Training loss: 0.6251 Explore P: 0.0101
Episode: 926 Total reward: 199.0 Training loss: 0.4796 Explore P: 0.0101
Episode: 927 Total reward: 199.0 Training loss: 0.4825 Explore P: 0.0101
Episode: 928 Total reward: 199.0 Training loss: 0.4695 Explore P: 0.0101
Episode: 929 Total reward: 199.0 Training loss: 0.2777 Explore P: 0.0101
Episode: 930 Total reward: 199.0 Training loss: 86.6087 Explore P: 0.0101
Episode: 931 Total reward: 199.0 Training loss: 0.4040 Explore P: 0.0101
Episode: 932 Total reward: 199.0 Training loss: 0.3830 Explore P: 0.0101
Episode: 933 Total reward: 199.0 Training loss: 0.3233 Explore P: 0.0101
Episode: 934 Total reward: 199.0 Training loss: 266.7009 Explore P: 0.0101
Episode: 935 Total reward: 199.0 Training loss: 424.4490 Explore P: 0.0101
Episode: 936 Total reward: 199.0 Training loss: 0.3335 Explore P: 0.0101
Episode: 937 Total reward: 199.0 Training loss: 0.1830 Explore P: 0.0101
Episode: 938 Total reward: 199.0 Training loss: 0.3250 Explore P: 0.0101
Episode: 939 Total reward: 199.0 Training loss: 0.3497 Explore P: 0.0101
Episode: 940 Total reward: 199.0 Training loss: 0.3175 Explore P: 0.0101
Episode: 941 Total reward: 199.0 Training loss: 0.3040 Explore P: 0.0100
Episode: 942 Total reward: 199.0 Training loss: 0.1566 Explore P: 0.0100
Episode: 943 Total reward: 199.0 Training loss: 0.5466 Explore P: 0.0100
Episode: 944 Total reward: 199.0 Training loss: 293.8184 Explore P: 0.0100
Episode: 945 Total reward: 199.0 Training loss: 0.3891 Explore P: 0.0100
Episode: 946 Total reward: 199.0 Training loss: 0.4408 Explore P: 0.0100
Episode: 947 Total reward: 199.0 Training loss: 0.3561 Explore P: 0.0100
Episode: 948 Total reward: 199.0 Training loss: 0.3647 Explore P: 0.0100
Episode: 949 Total reward: 199.0 Training loss: 0.3244 Explore P: 0.0100
Episode: 950 Total reward: 199.0 Training loss: 0.5821 Explore P: 0.0100
Episode: 951 Total reward: 199.0 Training loss: 0.1819 Explore P: 0.0100
Episode: 952 Total reward: 199.0 Training loss: 0.2922 Explore P: 0.0100
Episode: 953 Total reward: 199.0 Training loss: 0.3537 Explore P: 0.0100
Episode: 954 Total reward: 199.0 Training loss: 268.4037 Explore P: 0.0100
Episode: 955 Total reward: 199.0 Training loss: 0.5031 Explore P: 0.0100
Episode: 956 Total reward: 199.0 Training loss: 0.1812 Explore P: 0.0100
Episode: 957 Total reward: 199.0 Training loss: 0.3884 Explore P: 0.0100
Episode: 958 Total reward: 199.0 Training loss: 145.5590 Explore P: 0.0100
Episode: 959 Total reward: 199.0 Training loss: 0.2918 Explore P: 0.0100
Episode: 960 Total reward: 199.0 Training loss: 0.1689 Explore P: 0.0100
Episode: 961 Total reward: 199.0 Training loss: 0.2069 Explore P: 0.0100
Episode: 962 Total reward: 199.0 Training loss: 0.1569 Explore P: 0.0100
Episode: 963 Total reward: 199.0 Training loss: 0.1290 Explore P: 0.0100
Episode: 964 Total reward: 199.0 Training loss: 0.2340 Explore P: 0.0100
Episode: 965 Total reward: 199.0 Training loss: 0.3582 Explore P: 0.0100
Episode: 966 Total reward: 199.0 Training loss: 213.1660 Explore P: 0.0100
Episode: 967 Total reward: 199.0 Training loss: 225.4590 Explore P: 0.0100
Episode: 968 Total reward: 199.0 Training loss: 0.2615 Explore P: 0.0100
Episode: 969 Total reward: 199.0 Training loss: 0.1657 Explore P: 0.0100
Episode: 970 Total reward: 199.0 Training loss: 0.3544 Explore P: 0.0100
Episode: 971 Total reward: 199.0 Training loss: 0.1730 Explore P: 0.0100
Episode: 972 Total reward: 199.0 Training loss: 0.2901 Explore P: 0.0100
Episode: 973 Total reward: 199.0 Training loss: 0.1439 Explore P: 0.0100
Episode: 974 Total reward: 199.0 Training loss: 0.1673 Explore P: 0.0100
Episode: 975 Total reward: 199.0 Training loss: 0.1088 Explore P: 0.0100
Episode: 976 Total reward: 199.0 Training loss: 0.1290 Explore P: 0.0100
Episode: 977 Total reward: 199.0 Training loss: 0.0475 Explore P: 0.0100
Episode: 978 Total reward: 199.0 Training loss: 0.2114 Explore P: 0.0100
Episode: 979 Total reward: 199.0 Training loss: 0.2853 Explore P: 0.0100
Episode: 980 Total reward: 199.0 Training loss: 0.3130 Explore P: 0.0100
Episode: 981 Total reward: 199.0 Training loss: 183.1573 Explore P: 0.0100
Episode: 982 Total reward: 199.0 Training loss: 0.3024 Explore P: 0.0100
Episode: 983 Total reward: 199.0 Training loss: 0.2258 Explore P: 0.0100
Episode: 984 Total reward: 199.0 Training loss: 0.4382 Explore P: 0.0100
Episode: 985 Total reward: 199.0 Training loss: 0.2068 Explore P: 0.0100
Episode: 986 Total reward: 199.0 Training loss: 0.2196 Explore P: 0.0100
Episode: 987 Total reward: 199.0 Training loss: 0.1244 Explore P: 0.0100
Episode: 988 Total reward: 199.0 Training loss: 167.6532 Explore P: 0.0100
Episode: 989 Total reward: 199.0 Training loss: 0.1947 Explore P: 0.0100
Episode: 990 Total reward: 199.0 Training loss: 0.0978 Explore P: 0.0100
Episode: 991 Total reward: 199.0 Training loss: 0.1817 Explore P: 0.0100
Episode: 992 Total reward: 199.0 Training loss: 0.2728 Explore P: 0.0100
Episode: 993 Total reward: 199.0 Training loss: 0.0937 Explore P: 0.0100
Episode: 994 Total reward: 199.0 Training loss: 0.1767 Explore P: 0.0100
Episode: 995 Total reward: 199.0 Training loss: 0.2021 Explore P: 0.0100
Episode: 996 Total reward: 199.0 Training loss: 140.6136 Explore P: 0.0100
Episode: 997 Total reward: 199.0 Training loss: 131.1091 Explore P: 0.0100
Episode: 998 Total reward: 199.0 Training loss: 0.1010 Explore P: 0.0100
Episode: 999 Total reward: 199.0 Training loss: 0.0795 Explore P: 0.0100

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [15]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[15]:
<matplotlib.text.Text at 0x1265c13c8>

Testing

Let's checkout how our trained agent plays the game.


In [17]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [18]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.