Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [6]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [7]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-17 18:42:42,785] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [8]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [9]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [10]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [11]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [12]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [13]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [14]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [15]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [16]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 2.0 Training loss: 1.1709 Explore P: 0.9998
Episode: 2 Total reward: 15.0 Training loss: 1.1415 Explore P: 0.9983
Episode: 3 Total reward: 27.0 Training loss: 1.0786 Explore P: 0.9957
Episode: 4 Total reward: 14.0 Training loss: 1.0205 Explore P: 0.9943
Episode: 5 Total reward: 17.0 Training loss: 1.0519 Explore P: 0.9926
Episode: 6 Total reward: 14.0 Training loss: 1.0505 Explore P: 0.9912
Episode: 7 Total reward: 8.0 Training loss: 1.1451 Explore P: 0.9904
Episode: 8 Total reward: 28.0 Training loss: 1.0974 Explore P: 0.9877
Episode: 9 Total reward: 37.0 Training loss: 1.1408 Explore P: 0.9841
Episode: 10 Total reward: 14.0 Training loss: 1.1452 Explore P: 0.9827
Episode: 11 Total reward: 11.0 Training loss: 1.1193 Explore P: 0.9817
Episode: 12 Total reward: 18.0 Training loss: 1.1560 Explore P: 0.9799
Episode: 13 Total reward: 13.0 Training loss: 1.2576 Explore P: 0.9787
Episode: 14 Total reward: 12.0 Training loss: 1.1699 Explore P: 0.9775
Episode: 15 Total reward: 24.0 Training loss: 1.2659 Explore P: 0.9752
Episode: 16 Total reward: 8.0 Training loss: 1.3200 Explore P: 0.9744
Episode: 17 Total reward: 24.0 Training loss: 1.3965 Explore P: 0.9721
Episode: 18 Total reward: 21.0 Training loss: 1.3456 Explore P: 0.9701
Episode: 19 Total reward: 13.0 Training loss: 1.4171 Explore P: 0.9688
Episode: 20 Total reward: 17.0 Training loss: 1.8448 Explore P: 0.9672
Episode: 21 Total reward: 16.0 Training loss: 1.8148 Explore P: 0.9657
Episode: 22 Total reward: 8.0 Training loss: 3.5678 Explore P: 0.9649
Episode: 23 Total reward: 16.0 Training loss: 1.8703 Explore P: 0.9634
Episode: 24 Total reward: 15.0 Training loss: 1.6495 Explore P: 0.9619
Episode: 25 Total reward: 18.0 Training loss: 2.0888 Explore P: 0.9602
Episode: 26 Total reward: 12.0 Training loss: 3.9700 Explore P: 0.9591
Episode: 27 Total reward: 13.0 Training loss: 4.2566 Explore P: 0.9579
Episode: 28 Total reward: 17.0 Training loss: 2.6608 Explore P: 0.9562
Episode: 29 Total reward: 23.0 Training loss: 4.2741 Explore P: 0.9541
Episode: 30 Total reward: 14.0 Training loss: 2.5614 Explore P: 0.9528
Episode: 31 Total reward: 27.0 Training loss: 1.8744 Explore P: 0.9502
Episode: 32 Total reward: 14.0 Training loss: 2.7927 Explore P: 0.9489
Episode: 33 Total reward: 35.0 Training loss: 3.5873 Explore P: 0.9456
Episode: 34 Total reward: 25.0 Training loss: 20.4900 Explore P: 0.9433
Episode: 35 Total reward: 12.0 Training loss: 4.6836 Explore P: 0.9422
Episode: 36 Total reward: 32.0 Training loss: 4.6049 Explore P: 0.9392
Episode: 37 Total reward: 29.0 Training loss: 9.0972 Explore P: 0.9365
Episode: 38 Total reward: 30.0 Training loss: 13.0535 Explore P: 0.9337
Episode: 39 Total reward: 13.0 Training loss: 19.3627 Explore P: 0.9325
Episode: 40 Total reward: 50.0 Training loss: 9.8427 Explore P: 0.9279
Episode: 41 Total reward: 34.0 Training loss: 11.9346 Explore P: 0.9248
Episode: 42 Total reward: 27.0 Training loss: 6.0418 Explore P: 0.9223
Episode: 43 Total reward: 12.0 Training loss: 9.8117 Explore P: 0.9212
Episode: 44 Total reward: 20.0 Training loss: 17.3960 Explore P: 0.9194
Episode: 45 Total reward: 12.0 Training loss: 6.7082 Explore P: 0.9183
Episode: 46 Total reward: 29.0 Training loss: 12.8507 Explore P: 0.9157
Episode: 47 Total reward: 31.0 Training loss: 7.9261 Explore P: 0.9129
Episode: 48 Total reward: 28.0 Training loss: 55.7063 Explore P: 0.9104
Episode: 49 Total reward: 16.0 Training loss: 8.4813 Explore P: 0.9089
Episode: 50 Total reward: 10.0 Training loss: 22.6348 Explore P: 0.9080
Episode: 51 Total reward: 17.0 Training loss: 51.3870 Explore P: 0.9065
Episode: 52 Total reward: 21.0 Training loss: 32.9819 Explore P: 0.9046
Episode: 53 Total reward: 15.0 Training loss: 20.3821 Explore P: 0.9033
Episode: 54 Total reward: 15.0 Training loss: 38.8090 Explore P: 0.9019
Episode: 55 Total reward: 16.0 Training loss: 27.3016 Explore P: 0.9005
Episode: 56 Total reward: 19.0 Training loss: 28.1849 Explore P: 0.8988
Episode: 57 Total reward: 11.0 Training loss: 95.4361 Explore P: 0.8979
Episode: 58 Total reward: 20.0 Training loss: 14.9822 Explore P: 0.8961
Episode: 59 Total reward: 25.0 Training loss: 69.3418 Explore P: 0.8939
Episode: 60 Total reward: 9.0 Training loss: 9.3469 Explore P: 0.8931
Episode: 61 Total reward: 29.0 Training loss: 11.0135 Explore P: 0.8905
Episode: 62 Total reward: 15.0 Training loss: 9.8287 Explore P: 0.8892
Episode: 63 Total reward: 19.0 Training loss: 54.9833 Explore P: 0.8875
Episode: 64 Total reward: 14.0 Training loss: 9.6541 Explore P: 0.8863
Episode: 65 Total reward: 11.0 Training loss: 57.3257 Explore P: 0.8853
Episode: 66 Total reward: 11.0 Training loss: 7.9105 Explore P: 0.8844
Episode: 67 Total reward: 14.0 Training loss: 13.0389 Explore P: 0.8831
Episode: 68 Total reward: 14.0 Training loss: 52.1695 Explore P: 0.8819
Episode: 69 Total reward: 17.0 Training loss: 100.4266 Explore P: 0.8804
Episode: 70 Total reward: 21.0 Training loss: 33.6046 Explore P: 0.8786
Episode: 71 Total reward: 17.0 Training loss: 39.6531 Explore P: 0.8771
Episode: 72 Total reward: 36.0 Training loss: 14.8078 Explore P: 0.8740
Episode: 73 Total reward: 19.0 Training loss: 45.1692 Explore P: 0.8724
Episode: 74 Total reward: 23.0 Training loss: 80.2075 Explore P: 0.8704
Episode: 75 Total reward: 11.0 Training loss: 46.1176 Explore P: 0.8695
Episode: 76 Total reward: 37.0 Training loss: 28.7535 Explore P: 0.8663
Episode: 77 Total reward: 23.0 Training loss: 50.4762 Explore P: 0.8643
Episode: 78 Total reward: 18.0 Training loss: 56.1125 Explore P: 0.8628
Episode: 79 Total reward: 48.0 Training loss: 47.1231 Explore P: 0.8587
Episode: 80 Total reward: 31.0 Training loss: 100.4988 Explore P: 0.8561
Episode: 81 Total reward: 29.0 Training loss: 10.4337 Explore P: 0.8536
Episode: 82 Total reward: 20.0 Training loss: 12.2359 Explore P: 0.8519
Episode: 83 Total reward: 18.0 Training loss: 141.0979 Explore P: 0.8504
Episode: 84 Total reward: 11.0 Training loss: 70.8232 Explore P: 0.8495
Episode: 85 Total reward: 15.0 Training loss: 95.4053 Explore P: 0.8482
Episode: 86 Total reward: 10.0 Training loss: 13.4467 Explore P: 0.8474
Episode: 87 Total reward: 10.0 Training loss: 298.8007 Explore P: 0.8466
Episode: 88 Total reward: 11.0 Training loss: 7.8934 Explore P: 0.8456
Episode: 89 Total reward: 10.0 Training loss: 11.4517 Explore P: 0.8448
Episode: 90 Total reward: 15.0 Training loss: 73.2631 Explore P: 0.8436
Episode: 91 Total reward: 9.0 Training loss: 115.8266 Explore P: 0.8428
Episode: 92 Total reward: 24.0 Training loss: 40.7382 Explore P: 0.8408
Episode: 93 Total reward: 44.0 Training loss: 213.0003 Explore P: 0.8372
Episode: 94 Total reward: 9.0 Training loss: 10.7542 Explore P: 0.8364
Episode: 95 Total reward: 27.0 Training loss: 10.5816 Explore P: 0.8342
Episode: 96 Total reward: 15.0 Training loss: 185.7386 Explore P: 0.8330
Episode: 97 Total reward: 32.0 Training loss: 99.5904 Explore P: 0.8303
Episode: 98 Total reward: 14.0 Training loss: 46.6704 Explore P: 0.8292
Episode: 99 Total reward: 12.0 Training loss: 10.7374 Explore P: 0.8282
Episode: 100 Total reward: 12.0 Training loss: 135.2496 Explore P: 0.8272
Episode: 101 Total reward: 27.0 Training loss: 44.6197 Explore P: 0.8250
Episode: 102 Total reward: 17.0 Training loss: 82.2934 Explore P: 0.8236
Episode: 103 Total reward: 10.0 Training loss: 159.9242 Explore P: 0.8228
Episode: 104 Total reward: 33.0 Training loss: 9.1134 Explore P: 0.8201
Episode: 105 Total reward: 9.0 Training loss: 9.1957 Explore P: 0.8194
Episode: 106 Total reward: 16.0 Training loss: 62.3819 Explore P: 0.8181
Episode: 107 Total reward: 13.0 Training loss: 8.3292 Explore P: 0.8171
Episode: 108 Total reward: 45.0 Training loss: 158.4720 Explore P: 0.8134
Episode: 109 Total reward: 15.0 Training loss: 59.4217 Explore P: 0.8122
Episode: 110 Total reward: 38.0 Training loss: 215.3996 Explore P: 0.8092
Episode: 111 Total reward: 28.0 Training loss: 210.0052 Explore P: 0.8070
Episode: 112 Total reward: 27.0 Training loss: 440.5213 Explore P: 0.8048
Episode: 113 Total reward: 18.0 Training loss: 63.6101 Explore P: 0.8034
Episode: 114 Total reward: 9.0 Training loss: 51.1088 Explore P: 0.8027
Episode: 115 Total reward: 10.0 Training loss: 7.2532 Explore P: 0.8019
Episode: 116 Total reward: 14.0 Training loss: 47.0008 Explore P: 0.8008
Episode: 117 Total reward: 10.0 Training loss: 143.8552 Explore P: 0.8000
Episode: 118 Total reward: 14.0 Training loss: 409.3321 Explore P: 0.7989
Episode: 119 Total reward: 9.0 Training loss: 82.0720 Explore P: 0.7982
Episode: 120 Total reward: 11.0 Training loss: 222.1376 Explore P: 0.7973
Episode: 121 Total reward: 18.0 Training loss: 127.8246 Explore P: 0.7959
Episode: 122 Total reward: 10.0 Training loss: 6.6337 Explore P: 0.7951
Episode: 123 Total reward: 49.0 Training loss: 59.3078 Explore P: 0.7913
Episode: 124 Total reward: 15.0 Training loss: 5.7666 Explore P: 0.7901
Episode: 125 Total reward: 20.0 Training loss: 130.4575 Explore P: 0.7885
Episode: 126 Total reward: 26.0 Training loss: 98.2357 Explore P: 0.7865
Episode: 127 Total reward: 12.0 Training loss: 222.7195 Explore P: 0.7856
Episode: 128 Total reward: 10.0 Training loss: 5.9802 Explore P: 0.7848
Episode: 129 Total reward: 28.0 Training loss: 164.7927 Explore P: 0.7826
Episode: 130 Total reward: 23.0 Training loss: 138.6818 Explore P: 0.7809
Episode: 131 Total reward: 11.0 Training loss: 146.3478 Explore P: 0.7800
Episode: 132 Total reward: 18.0 Training loss: 107.4105 Explore P: 0.7786
Episode: 133 Total reward: 13.0 Training loss: 66.5617 Explore P: 0.7776
Episode: 134 Total reward: 10.0 Training loss: 110.7497 Explore P: 0.7769
Episode: 135 Total reward: 20.0 Training loss: 84.6903 Explore P: 0.7753
Episode: 136 Total reward: 11.0 Training loss: 114.5353 Explore P: 0.7745
Episode: 137 Total reward: 18.0 Training loss: 115.6292 Explore P: 0.7731
Episode: 138 Total reward: 15.0 Training loss: 3.6356 Explore P: 0.7720
Episode: 139 Total reward: 18.0 Training loss: 108.1531 Explore P: 0.7706
Episode: 140 Total reward: 13.0 Training loss: 3.2026 Explore P: 0.7696
Episode: 141 Total reward: 11.0 Training loss: 3.7691 Explore P: 0.7688
Episode: 142 Total reward: 28.0 Training loss: 3.3158 Explore P: 0.7667
Episode: 143 Total reward: 36.0 Training loss: 61.6117 Explore P: 0.7639
Episode: 144 Total reward: 12.0 Training loss: 3.4737 Explore P: 0.7630
Episode: 145 Total reward: 57.0 Training loss: 73.6277 Explore P: 0.7587
Episode: 146 Total reward: 12.0 Training loss: 53.7302 Explore P: 0.7579
Episode: 147 Total reward: 23.0 Training loss: 2.5490 Explore P: 0.7561
Episode: 148 Total reward: 10.0 Training loss: 57.4684 Explore P: 0.7554
Episode: 149 Total reward: 19.0 Training loss: 56.6453 Explore P: 0.7540
Episode: 150 Total reward: 13.0 Training loss: 2.1148 Explore P: 0.7530
Episode: 151 Total reward: 43.0 Training loss: 58.5362 Explore P: 0.7498
Episode: 152 Total reward: 11.0 Training loss: 49.9742 Explore P: 0.7490
Episode: 153 Total reward: 13.0 Training loss: 1.8630 Explore P: 0.7480
Episode: 154 Total reward: 9.0 Training loss: 112.6122 Explore P: 0.7474
Episode: 155 Total reward: 19.0 Training loss: 115.6578 Explore P: 0.7460
Episode: 156 Total reward: 18.0 Training loss: 159.8576 Explore P: 0.7447
Episode: 157 Total reward: 20.0 Training loss: 1.6961 Explore P: 0.7432
Episode: 158 Total reward: 17.0 Training loss: 1.7310 Explore P: 0.7419
Episode: 159 Total reward: 12.0 Training loss: 57.7217 Explore P: 0.7411
Episode: 160 Total reward: 14.0 Training loss: 1.6415 Explore P: 0.7400
Episode: 161 Total reward: 19.0 Training loss: 162.6029 Explore P: 0.7387
Episode: 162 Total reward: 28.0 Training loss: 92.8545 Explore P: 0.7366
Episode: 163 Total reward: 29.0 Training loss: 49.2524 Explore P: 0.7345
Episode: 164 Total reward: 14.0 Training loss: 112.7515 Explore P: 0.7335
Episode: 165 Total reward: 7.0 Training loss: 97.3308 Explore P: 0.7330
Episode: 166 Total reward: 13.0 Training loss: 1.3165 Explore P: 0.7321
Episode: 167 Total reward: 16.0 Training loss: 45.8984 Explore P: 0.7309
Episode: 168 Total reward: 14.0 Training loss: 141.1332 Explore P: 0.7299
Episode: 169 Total reward: 13.0 Training loss: 1.1587 Explore P: 0.7290
Episode: 170 Total reward: 17.0 Training loss: 45.2210 Explore P: 0.7277
Episode: 171 Total reward: 12.0 Training loss: 108.1808 Explore P: 0.7269
Episode: 172 Total reward: 10.0 Training loss: 40.7393 Explore P: 0.7262
Episode: 173 Total reward: 19.0 Training loss: 31.2582 Explore P: 0.7248
Episode: 174 Total reward: 9.0 Training loss: 219.4591 Explore P: 0.7242
Episode: 175 Total reward: 15.0 Training loss: 0.5284 Explore P: 0.7231
Episode: 176 Total reward: 16.0 Training loss: 1.0065 Explore P: 0.7219
Episode: 177 Total reward: 14.0 Training loss: 59.8071 Explore P: 0.7210
Episode: 178 Total reward: 14.0 Training loss: 86.9231 Explore P: 0.7200
Episode: 179 Total reward: 11.0 Training loss: 0.6259 Explore P: 0.7192
Episode: 180 Total reward: 12.0 Training loss: 39.4254 Explore P: 0.7183
Episode: 181 Total reward: 12.0 Training loss: 1.3001 Explore P: 0.7175
Episode: 182 Total reward: 13.0 Training loss: 102.1355 Explore P: 0.7166
Episode: 183 Total reward: 10.0 Training loss: 62.8892 Explore P: 0.7159
Episode: 184 Total reward: 25.0 Training loss: 59.3207 Explore P: 0.7141
Episode: 185 Total reward: 10.0 Training loss: 55.4833 Explore P: 0.7134
Episode: 186 Total reward: 11.0 Training loss: 0.7272 Explore P: 0.7126
Episode: 187 Total reward: 26.0 Training loss: 30.8483 Explore P: 0.7108
Episode: 188 Total reward: 21.0 Training loss: 80.9037 Explore P: 0.7093
Episode: 189 Total reward: 8.0 Training loss: 44.6503 Explore P: 0.7088
Episode: 190 Total reward: 16.0 Training loss: 46.0946 Explore P: 0.7076
Episode: 191 Total reward: 12.0 Training loss: 0.9730 Explore P: 0.7068
Episode: 192 Total reward: 16.0 Training loss: 25.8601 Explore P: 0.7057
Episode: 193 Total reward: 13.0 Training loss: 43.8052 Explore P: 0.7048
Episode: 194 Total reward: 8.0 Training loss: 0.7413 Explore P: 0.7042
Episode: 195 Total reward: 25.0 Training loss: 79.3610 Explore P: 0.7025
Episode: 196 Total reward: 12.0 Training loss: 1.1658 Explore P: 0.7017
Episode: 197 Total reward: 27.0 Training loss: 0.9340 Explore P: 0.6998
Episode: 198 Total reward: 28.0 Training loss: 84.0041 Explore P: 0.6979
Episode: 199 Total reward: 25.0 Training loss: 30.7070 Explore P: 0.6962
Episode: 200 Total reward: 8.0 Training loss: 1.4569 Explore P: 0.6956
Episode: 201 Total reward: 17.0 Training loss: 0.8827 Explore P: 0.6944
Episode: 202 Total reward: 34.0 Training loss: 22.2634 Explore P: 0.6921
Episode: 203 Total reward: 19.0 Training loss: 28.8815 Explore P: 0.6908
Episode: 204 Total reward: 18.0 Training loss: 60.4076 Explore P: 0.6896
Episode: 205 Total reward: 16.0 Training loss: 1.3333 Explore P: 0.6885
Episode: 206 Total reward: 18.0 Training loss: 1.1283 Explore P: 0.6873
Episode: 207 Total reward: 10.0 Training loss: 42.9816 Explore P: 0.6866
Episode: 208 Total reward: 12.0 Training loss: 39.5467 Explore P: 0.6858
Episode: 209 Total reward: 11.0 Training loss: 23.5611 Explore P: 0.6851
Episode: 210 Total reward: 17.0 Training loss: 1.2279 Explore P: 0.6839
Episode: 211 Total reward: 13.0 Training loss: 37.4812 Explore P: 0.6830
Episode: 212 Total reward: 16.0 Training loss: 22.2818 Explore P: 0.6820
Episode: 213 Total reward: 31.0 Training loss: 21.1080 Explore P: 0.6799
Episode: 214 Total reward: 11.0 Training loss: 58.5944 Explore P: 0.6791
Episode: 215 Total reward: 17.0 Training loss: 43.4265 Explore P: 0.6780
Episode: 216 Total reward: 15.0 Training loss: 57.3983 Explore P: 0.6770
Episode: 217 Total reward: 23.0 Training loss: 77.8382 Explore P: 0.6755
Episode: 218 Total reward: 9.0 Training loss: 48.0022 Explore P: 0.6749
Episode: 219 Total reward: 17.0 Training loss: 39.7138 Explore P: 0.6737
Episode: 220 Total reward: 9.0 Training loss: 88.9876 Explore P: 0.6732
Episode: 221 Total reward: 15.0 Training loss: 19.9210 Explore P: 0.6722
Episode: 222 Total reward: 32.0 Training loss: 19.5613 Explore P: 0.6700
Episode: 223 Total reward: 17.0 Training loss: 38.2312 Explore P: 0.6689
Episode: 224 Total reward: 35.0 Training loss: 19.0546 Explore P: 0.6666
Episode: 225 Total reward: 14.0 Training loss: 1.3924 Explore P: 0.6657
Episode: 226 Total reward: 17.0 Training loss: 1.3233 Explore P: 0.6646
Episode: 227 Total reward: 14.0 Training loss: 95.5612 Explore P: 0.6637
Episode: 228 Total reward: 19.0 Training loss: 18.0053 Explore P: 0.6624
Episode: 229 Total reward: 12.0 Training loss: 1.5512 Explore P: 0.6616
Episode: 230 Total reward: 17.0 Training loss: 39.7448 Explore P: 0.6605
Episode: 231 Total reward: 10.0 Training loss: 35.0024 Explore P: 0.6599
Episode: 232 Total reward: 17.0 Training loss: 35.4123 Explore P: 0.6588
Episode: 233 Total reward: 37.0 Training loss: 17.5831 Explore P: 0.6564
Episode: 234 Total reward: 10.0 Training loss: 37.9826 Explore P: 0.6557
Episode: 235 Total reward: 8.0 Training loss: 70.8949 Explore P: 0.6552
Episode: 236 Total reward: 20.0 Training loss: 37.1099 Explore P: 0.6539
Episode: 237 Total reward: 13.0 Training loss: 1.0752 Explore P: 0.6531
Episode: 238 Total reward: 14.0 Training loss: 17.6874 Explore P: 0.6522
Episode: 239 Total reward: 16.0 Training loss: 17.5504 Explore P: 0.6512
Episode: 240 Total reward: 15.0 Training loss: 19.4427 Explore P: 0.6502
Episode: 241 Total reward: 27.0 Training loss: 36.9410 Explore P: 0.6485
Episode: 242 Total reward: 11.0 Training loss: 0.7954 Explore P: 0.6478
Episode: 243 Total reward: 18.0 Training loss: 18.5555 Explore P: 0.6466
Episode: 244 Total reward: 49.0 Training loss: 0.8970 Explore P: 0.6435
Episode: 245 Total reward: 25.0 Training loss: 0.6955 Explore P: 0.6419
Episode: 246 Total reward: 32.0 Training loss: 20.9732 Explore P: 0.6399
Episode: 247 Total reward: 36.0 Training loss: 61.0841 Explore P: 0.6377
Episode: 248 Total reward: 9.0 Training loss: 30.3346 Explore P: 0.6371
Episode: 249 Total reward: 28.0 Training loss: 0.8790 Explore P: 0.6353
Episode: 250 Total reward: 23.0 Training loss: 49.1591 Explore P: 0.6339
Episode: 251 Total reward: 11.0 Training loss: 0.7632 Explore P: 0.6332
Episode: 252 Total reward: 14.0 Training loss: 29.6198 Explore P: 0.6324
Episode: 253 Total reward: 19.0 Training loss: 14.6813 Explore P: 0.6312
Episode: 254 Total reward: 35.0 Training loss: 0.8390 Explore P: 0.6290
Episode: 255 Total reward: 110.0 Training loss: 1.1391 Explore P: 0.6222
Episode: 256 Total reward: 29.0 Training loss: 32.2034 Explore P: 0.6205
Episode: 257 Total reward: 19.0 Training loss: 28.0171 Explore P: 0.6193
Episode: 258 Total reward: 14.0 Training loss: 41.1098 Explore P: 0.6184
Episode: 259 Total reward: 29.0 Training loss: 46.2892 Explore P: 0.6167
Episode: 260 Total reward: 76.0 Training loss: 1.0684 Explore P: 0.6121
Episode: 261 Total reward: 12.0 Training loss: 27.3396 Explore P: 0.6114
Episode: 262 Total reward: 24.0 Training loss: 1.5615 Explore P: 0.6099
Episode: 263 Total reward: 19.0 Training loss: 1.1943 Explore P: 0.6088
Episode: 264 Total reward: 89.0 Training loss: 12.8981 Explore P: 0.6035
Episode: 265 Total reward: 87.0 Training loss: 1.1223 Explore P: 0.5983
Episode: 266 Total reward: 24.0 Training loss: 24.9046 Explore P: 0.5969
Episode: 267 Total reward: 38.0 Training loss: 0.9700 Explore P: 0.5947
Episode: 268 Total reward: 56.0 Training loss: 23.7414 Explore P: 0.5914
Episode: 269 Total reward: 107.0 Training loss: 22.6318 Explore P: 0.5853
Episode: 270 Total reward: 50.0 Training loss: 51.3227 Explore P: 0.5824
Episode: 271 Total reward: 37.0 Training loss: 1.2111 Explore P: 0.5803
Episode: 272 Total reward: 34.0 Training loss: 1.7236 Explore P: 0.5783
Episode: 273 Total reward: 58.0 Training loss: 16.4533 Explore P: 0.5750
Episode: 274 Total reward: 17.0 Training loss: 22.5920 Explore P: 0.5741
Episode: 275 Total reward: 63.0 Training loss: 33.7327 Explore P: 0.5705
Episode: 276 Total reward: 49.0 Training loss: 52.6798 Explore P: 0.5678
Episode: 277 Total reward: 72.0 Training loss: 13.3638 Explore P: 0.5638
Episode: 278 Total reward: 70.0 Training loss: 1.5991 Explore P: 0.5599
Episode: 279 Total reward: 50.0 Training loss: 26.5202 Explore P: 0.5572
Episode: 280 Total reward: 20.0 Training loss: 0.7884 Explore P: 0.5561
Episode: 281 Total reward: 12.0 Training loss: 29.9646 Explore P: 0.5554
Episode: 282 Total reward: 15.0 Training loss: 33.9996 Explore P: 0.5546
Episode: 283 Total reward: 27.0 Training loss: 39.3957 Explore P: 0.5532
Episode: 284 Total reward: 109.0 Training loss: 1.7761 Explore P: 0.5473
Episode: 285 Total reward: 60.0 Training loss: 34.6245 Explore P: 0.5441
Episode: 286 Total reward: 45.0 Training loss: 1.3284 Explore P: 0.5417
Episode: 287 Total reward: 43.0 Training loss: 23.7634 Explore P: 0.5394
Episode: 288 Total reward: 62.0 Training loss: 1.8122 Explore P: 0.5361
Episode: 289 Total reward: 33.0 Training loss: 55.9169 Explore P: 0.5344
Episode: 290 Total reward: 88.0 Training loss: 0.9494 Explore P: 0.5298
Episode: 291 Total reward: 48.0 Training loss: 44.0226 Explore P: 0.5273
Episode: 292 Total reward: 58.0 Training loss: 1.3399 Explore P: 0.5243
Episode: 293 Total reward: 20.0 Training loss: 1.1614 Explore P: 0.5233
Episode: 294 Total reward: 36.0 Training loss: 66.0255 Explore P: 0.5214
Episode: 295 Total reward: 75.0 Training loss: 1.6018 Explore P: 0.5176
Episode: 296 Total reward: 12.0 Training loss: 26.4333 Explore P: 0.5170
Episode: 297 Total reward: 27.0 Training loss: 23.8161 Explore P: 0.5156
Episode: 298 Total reward: 10.0 Training loss: 1.2828 Explore P: 0.5151
Episode: 299 Total reward: 69.0 Training loss: 33.2671 Explore P: 0.5117
Episode: 300 Total reward: 42.0 Training loss: 22.8889 Explore P: 0.5095
Episode: 301 Total reward: 89.0 Training loss: 63.7810 Explore P: 0.5051
Episode: 302 Total reward: 37.0 Training loss: 33.2429 Explore P: 0.5033
Episode: 303 Total reward: 63.0 Training loss: 31.9538 Explore P: 0.5002
Episode: 304 Total reward: 32.0 Training loss: 2.0001 Explore P: 0.4986
Episode: 305 Total reward: 14.0 Training loss: 17.8252 Explore P: 0.4979
Episode: 306 Total reward: 56.0 Training loss: 56.1209 Explore P: 0.4952
Episode: 307 Total reward: 90.0 Training loss: 51.0099 Explore P: 0.4909
Episode: 308 Total reward: 118.0 Training loss: 1.3655 Explore P: 0.4852
Episode: 309 Total reward: 69.0 Training loss: 57.7486 Explore P: 0.4820
Episode: 310 Total reward: 50.0 Training loss: 28.5875 Explore P: 0.4796
Episode: 311 Total reward: 97.0 Training loss: 1.4452 Explore P: 0.4751
Episode: 312 Total reward: 90.0 Training loss: 54.7468 Explore P: 0.4709
Episode: 313 Total reward: 36.0 Training loss: 0.8538 Explore P: 0.4693
Episode: 314 Total reward: 31.0 Training loss: 51.4191 Explore P: 0.4678
Episode: 315 Total reward: 26.0 Training loss: 1.6518 Explore P: 0.4666
Episode: 316 Total reward: 93.0 Training loss: 25.2718 Explore P: 0.4624
Episode: 317 Total reward: 33.0 Training loss: 97.2752 Explore P: 0.4609
Episode: 318 Total reward: 35.0 Training loss: 85.9511 Explore P: 0.4594
Episode: 319 Total reward: 35.0 Training loss: 78.9097 Explore P: 0.4578
Episode: 320 Total reward: 75.0 Training loss: 0.9472 Explore P: 0.4544
Episode: 321 Total reward: 131.0 Training loss: 20.2984 Explore P: 0.4487
Episode: 322 Total reward: 33.0 Training loss: 28.7299 Explore P: 0.4472
Episode: 323 Total reward: 91.0 Training loss: 92.2091 Explore P: 0.4432
Episode: 324 Total reward: 27.0 Training loss: 1.8147 Explore P: 0.4421
Episode: 325 Total reward: 77.0 Training loss: 1.1093 Explore P: 0.4388
Episode: 326 Total reward: 29.0 Training loss: 1.3512 Explore P: 0.4375
Episode: 327 Total reward: 61.0 Training loss: 27.6791 Explore P: 0.4349
Episode: 328 Total reward: 74.0 Training loss: 23.3941 Explore P: 0.4318
Episode: 329 Total reward: 32.0 Training loss: 2.7234 Explore P: 0.4304
Episode: 330 Total reward: 109.0 Training loss: 33.1890 Explore P: 0.4259
Episode: 331 Total reward: 106.0 Training loss: 0.9338 Explore P: 0.4215
Episode: 332 Total reward: 70.0 Training loss: 2.8599 Explore P: 0.4186
Episode: 333 Total reward: 36.0 Training loss: 58.1221 Explore P: 0.4172
Episode: 334 Total reward: 97.0 Training loss: 1.0784 Explore P: 0.4132
Episode: 335 Total reward: 56.0 Training loss: 1.1543 Explore P: 0.4110
Episode: 336 Total reward: 72.0 Training loss: 86.9642 Explore P: 0.4081
Episode: 337 Total reward: 35.0 Training loss: 1.4597 Explore P: 0.4067
Episode: 338 Total reward: 59.0 Training loss: 1.8579 Explore P: 0.4044
Episode: 339 Total reward: 113.0 Training loss: 117.4940 Explore P: 0.3999
Episode: 340 Total reward: 134.0 Training loss: 2.2671 Explore P: 0.3948
Episode: 341 Total reward: 101.0 Training loss: 2.5628 Explore P: 0.3909
Episode: 342 Total reward: 17.0 Training loss: 1.3807 Explore P: 0.3902
Episode: 343 Total reward: 68.0 Training loss: 59.0019 Explore P: 0.3877
Episode: 344 Total reward: 31.0 Training loss: 47.7511 Explore P: 0.3865
Episode: 345 Total reward: 107.0 Training loss: 42.8100 Explore P: 0.3825
Episode: 346 Total reward: 83.0 Training loss: 2.5734 Explore P: 0.3794
Episode: 347 Total reward: 41.0 Training loss: 54.9078 Explore P: 0.3779
Episode: 348 Total reward: 57.0 Training loss: 1.5358 Explore P: 0.3758
Episode: 349 Total reward: 120.0 Training loss: 41.1684 Explore P: 0.3714
Episode: 350 Total reward: 101.0 Training loss: 37.5682 Explore P: 0.3678
Episode: 351 Total reward: 59.0 Training loss: 1.7312 Explore P: 0.3657
Episode: 352 Total reward: 110.0 Training loss: 48.6127 Explore P: 0.3618
Episode: 353 Total reward: 82.0 Training loss: 2.0367 Explore P: 0.3589
Episode: 354 Total reward: 164.0 Training loss: 1.2948 Explore P: 0.3533
Episode: 355 Total reward: 25.0 Training loss: 1.8881 Explore P: 0.3524
Episode: 356 Total reward: 90.0 Training loss: 1.0646 Explore P: 0.3493
Episode: 357 Total reward: 44.0 Training loss: 3.3027 Explore P: 0.3479
Episode: 358 Total reward: 43.0 Training loss: 46.8632 Explore P: 0.3464
Episode: 359 Total reward: 39.0 Training loss: 43.1165 Explore P: 0.3451
Episode: 360 Total reward: 65.0 Training loss: 0.6801 Explore P: 0.3429
Episode: 361 Total reward: 34.0 Training loss: 1.4274 Explore P: 0.3418
Episode: 362 Total reward: 34.0 Training loss: 1.1599 Explore P: 0.3407
Episode: 363 Total reward: 42.0 Training loss: 0.9212 Explore P: 0.3393
Episode: 364 Total reward: 39.0 Training loss: 45.8163 Explore P: 0.3380
Episode: 365 Total reward: 85.0 Training loss: 46.1875 Explore P: 0.3352
Episode: 366 Total reward: 69.0 Training loss: 79.6241 Explore P: 0.3330
Episode: 367 Total reward: 122.0 Training loss: 0.8105 Explore P: 0.3291
Episode: 368 Total reward: 105.0 Training loss: 73.0289 Explore P: 0.3257
Episode: 369 Total reward: 32.0 Training loss: 1.9180 Explore P: 0.3247
Episode: 370 Total reward: 52.0 Training loss: 1.4990 Explore P: 0.3231
Episode: 371 Total reward: 42.0 Training loss: 60.5334 Explore P: 0.3218
Episode: 372 Total reward: 47.0 Training loss: 4.3245 Explore P: 0.3203
Episode: 373 Total reward: 108.0 Training loss: 50.0728 Explore P: 0.3170
Episode: 374 Total reward: 77.0 Training loss: 1.2489 Explore P: 0.3146
Episode: 375 Total reward: 61.0 Training loss: 49.4810 Explore P: 0.3128
Episode: 376 Total reward: 103.0 Training loss: 66.4733 Explore P: 0.3097
Episode: 377 Total reward: 64.0 Training loss: 91.2555 Explore P: 0.3078
Episode: 378 Total reward: 121.0 Training loss: 0.9061 Explore P: 0.3042
Episode: 379 Total reward: 98.0 Training loss: 71.8727 Explore P: 0.3013
Episode: 380 Total reward: 39.0 Training loss: 1.9601 Explore P: 0.3002
Episode: 381 Total reward: 199.0 Training loss: 90.7047 Explore P: 0.2945
Episode: 382 Total reward: 47.0 Training loss: 60.7241 Explore P: 0.2931
Episode: 383 Total reward: 38.0 Training loss: 110.3466 Explore P: 0.2921
Episode: 384 Total reward: 108.0 Training loss: 129.0864 Explore P: 0.2890
Episode: 385 Total reward: 60.0 Training loss: 3.4142 Explore P: 0.2874
Episode: 386 Total reward: 94.0 Training loss: 1.4825 Explore P: 0.2848
Episode: 387 Total reward: 74.0 Training loss: 1.9551 Explore P: 0.2827
Episode: 388 Total reward: 47.0 Training loss: 70.1560 Explore P: 0.2815
Episode: 389 Total reward: 102.0 Training loss: 68.7411 Explore P: 0.2787
Episode: 390 Total reward: 63.0 Training loss: 2.5535 Explore P: 0.2770
Episode: 391 Total reward: 105.0 Training loss: 102.4495 Explore P: 0.2742
Episode: 392 Total reward: 164.0 Training loss: 0.8512 Explore P: 0.2699
Episode: 393 Total reward: 100.0 Training loss: 75.0383 Explore P: 0.2673
Episode: 394 Total reward: 28.0 Training loss: 0.9444 Explore P: 0.2666
Episode: 395 Total reward: 39.0 Training loss: 1.1208 Explore P: 0.2656
Episode: 396 Total reward: 33.0 Training loss: 3.2740 Explore P: 0.2648
Episode: 397 Total reward: 102.0 Training loss: 1.4851 Explore P: 0.2622
Episode: 398 Total reward: 56.0 Training loss: 1.4963 Explore P: 0.2608
Episode: 399 Total reward: 39.0 Training loss: 2.4116 Explore P: 0.2598
Episode: 400 Total reward: 49.0 Training loss: 1.8952 Explore P: 0.2586
Episode: 401 Total reward: 45.0 Training loss: 1.2377 Explore P: 0.2575
Episode: 402 Total reward: 33.0 Training loss: 1.7577 Explore P: 0.2567
Episode: 403 Total reward: 47.0 Training loss: 1.1039 Explore P: 0.2555
Episode: 404 Total reward: 76.0 Training loss: 72.9597 Explore P: 0.2536
Episode: 405 Total reward: 76.0 Training loss: 1.1776 Explore P: 0.2518
Episode: 406 Total reward: 59.0 Training loss: 1.0968 Explore P: 0.2504
Episode: 407 Total reward: 41.0 Training loss: 79.5475 Explore P: 0.2494
Episode: 408 Total reward: 122.0 Training loss: 112.0755 Explore P: 0.2465
Episode: 409 Total reward: 41.0 Training loss: 1.1850 Explore P: 0.2455
Episode: 410 Total reward: 36.0 Training loss: 0.8836 Explore P: 0.2447
Episode: 411 Total reward: 32.0 Training loss: 221.1886 Explore P: 0.2439
Episode: 412 Total reward: 62.0 Training loss: 1.3029 Explore P: 0.2425
Episode: 413 Total reward: 36.0 Training loss: 239.2108 Explore P: 0.2416
Episode: 414 Total reward: 95.0 Training loss: 0.9051 Explore P: 0.2395
Episode: 415 Total reward: 128.0 Training loss: 0.8808 Explore P: 0.2365
Episode: 416 Total reward: 38.0 Training loss: 2.2746 Explore P: 0.2357
Episode: 417 Total reward: 24.0 Training loss: 1.0782 Explore P: 0.2351
Episode: 418 Total reward: 64.0 Training loss: 1.2206 Explore P: 0.2337
Episode: 419 Total reward: 56.0 Training loss: 0.8283 Explore P: 0.2325
Episode: 420 Total reward: 55.0 Training loss: 1.5365 Explore P: 0.2312
Episode: 421 Total reward: 45.0 Training loss: 0.8321 Explore P: 0.2302
Episode: 422 Total reward: 137.0 Training loss: 88.7076 Explore P: 0.2272
Episode: 423 Total reward: 68.0 Training loss: 1.0393 Explore P: 0.2258
Episode: 424 Total reward: 47.0 Training loss: 1.7034 Explore P: 0.2248
Episode: 425 Total reward: 41.0 Training loss: 1.5581 Explore P: 0.2239
Episode: 426 Total reward: 48.0 Training loss: 1.6666 Explore P: 0.2229
Episode: 427 Total reward: 29.0 Training loss: 1.7479 Explore P: 0.2222
Episode: 428 Total reward: 37.0 Training loss: 1.2240 Explore P: 0.2215
Episode: 429 Total reward: 49.0 Training loss: 0.9399 Explore P: 0.2204
Episode: 430 Total reward: 65.0 Training loss: 1.4485 Explore P: 0.2191
Episode: 431 Total reward: 88.0 Training loss: 0.6906 Explore P: 0.2172
Episode: 432 Total reward: 38.0 Training loss: 121.4873 Explore P: 0.2164
Episode: 433 Total reward: 84.0 Training loss: 1.7044 Explore P: 0.2147
Episode: 434 Total reward: 94.0 Training loss: 2.0280 Explore P: 0.2128
Episode: 435 Total reward: 25.0 Training loss: 1.7676 Explore P: 0.2123
Episode: 436 Total reward: 36.0 Training loss: 96.5116 Explore P: 0.2116
Episode: 437 Total reward: 24.0 Training loss: 1.1070 Explore P: 0.2111
Episode: 438 Total reward: 81.0 Training loss: 84.7248 Explore P: 0.2095
Episode: 439 Total reward: 34.0 Training loss: 86.9960 Explore P: 0.2088
Episode: 440 Total reward: 38.0 Training loss: 1.2504 Explore P: 0.2080
Episode: 441 Total reward: 39.0 Training loss: 114.2940 Explore P: 0.2073
Episode: 442 Total reward: 36.0 Training loss: 93.4846 Explore P: 0.2065
Episode: 443 Total reward: 37.0 Training loss: 0.8231 Explore P: 0.2058
Episode: 444 Total reward: 41.0 Training loss: 96.0007 Explore P: 0.2050
Episode: 445 Total reward: 29.0 Training loss: 1.2125 Explore P: 0.2045
Episode: 446 Total reward: 27.0 Training loss: 1.4664 Explore P: 0.2039
Episode: 447 Total reward: 64.0 Training loss: 0.3769 Explore P: 0.2027
Episode: 448 Total reward: 82.0 Training loss: 118.4478 Explore P: 0.2011
Episode: 449 Total reward: 91.0 Training loss: 90.6581 Explore P: 0.1994
Episode: 450 Total reward: 26.0 Training loss: 1.4209 Explore P: 0.1989
Episode: 451 Total reward: 20.0 Training loss: 84.2197 Explore P: 0.1985
Episode: 452 Total reward: 51.0 Training loss: 1.6248 Explore P: 0.1976
Episode: 453 Total reward: 66.0 Training loss: 1.1580 Explore P: 0.1963
Episode: 454 Total reward: 46.0 Training loss: 1.1933 Explore P: 0.1955
Episode: 455 Total reward: 53.0 Training loss: 217.3170 Explore P: 0.1945
Episode: 456 Total reward: 42.0 Training loss: 0.6875 Explore P: 0.1937
Episode: 457 Total reward: 33.0 Training loss: 1.1355 Explore P: 0.1931
Episode: 458 Total reward: 30.0 Training loss: 0.9700 Explore P: 0.1926
Episode: 459 Total reward: 72.0 Training loss: 1.2627 Explore P: 0.1913
Episode: 460 Total reward: 93.0 Training loss: 81.8973 Explore P: 0.1896
Episode: 461 Total reward: 93.0 Training loss: 1.0352 Explore P: 0.1879
Episode: 462 Total reward: 53.0 Training loss: 83.2643 Explore P: 0.1870
Episode: 463 Total reward: 65.0 Training loss: 1.3399 Explore P: 0.1858
Episode: 464 Total reward: 27.0 Training loss: 116.5890 Explore P: 0.1854
Episode: 465 Total reward: 39.0 Training loss: 1.0993 Explore P: 0.1847
Episode: 466 Total reward: 102.0 Training loss: 0.7972 Explore P: 0.1829
Episode: 467 Total reward: 78.0 Training loss: 1.1545 Explore P: 0.1816
Episode: 468 Total reward: 99.0 Training loss: 1.6212 Explore P: 0.1799
Episode: 469 Total reward: 71.0 Training loss: 204.9589 Explore P: 0.1787
Episode: 470 Total reward: 44.0 Training loss: 1.1193 Explore P: 0.1779
Episode: 471 Total reward: 69.0 Training loss: 1.5359 Explore P: 0.1768
Episode: 472 Total reward: 49.0 Training loss: 75.9020 Explore P: 0.1760
Episode: 473 Total reward: 78.0 Training loss: 1.4995 Explore P: 0.1747
Episode: 474 Total reward: 65.0 Training loss: 1.5727 Explore P: 0.1736
Episode: 475 Total reward: 118.0 Training loss: 1.0938 Explore P: 0.1717
Episode: 476 Total reward: 49.0 Training loss: 1.0811 Explore P: 0.1709
Episode: 477 Total reward: 91.0 Training loss: 101.7168 Explore P: 0.1694
Episode: 478 Total reward: 130.0 Training loss: 1.0240 Explore P: 0.1674
Episode: 479 Total reward: 140.0 Training loss: 1.4061 Explore P: 0.1652
Episode: 480 Total reward: 63.0 Training loss: 0.7899 Explore P: 0.1642
Episode: 481 Total reward: 43.0 Training loss: 0.9541 Explore P: 0.1635
Episode: 482 Total reward: 64.0 Training loss: 0.7658 Explore P: 0.1626
Episode: 483 Total reward: 53.0 Training loss: 1.3396 Explore P: 0.1618
Episode: 484 Total reward: 100.0 Training loss: 1.1020 Explore P: 0.1603
Episode: 485 Total reward: 65.0 Training loss: 0.5010 Explore P: 0.1593
Episode: 486 Total reward: 41.0 Training loss: 1.4329 Explore P: 0.1587
Episode: 487 Total reward: 59.0 Training loss: 1.0672 Explore P: 0.1578
Episode: 488 Total reward: 67.0 Training loss: 1.5193 Explore P: 0.1568
Episode: 489 Total reward: 36.0 Training loss: 90.3867 Explore P: 0.1563
Episode: 490 Total reward: 55.0 Training loss: 78.2686 Explore P: 0.1555
Episode: 491 Total reward: 115.0 Training loss: 1.2358 Explore P: 0.1538
Episode: 492 Total reward: 47.0 Training loss: 1.2595 Explore P: 0.1531
Episode: 493 Total reward: 78.0 Training loss: 1.4214 Explore P: 0.1520
Episode: 494 Total reward: 138.0 Training loss: 103.7686 Explore P: 0.1501
Episode: 495 Total reward: 34.0 Training loss: 1.5926 Explore P: 0.1496
Episode: 496 Total reward: 37.0 Training loss: 1.1359 Explore P: 0.1491
Episode: 497 Total reward: 62.0 Training loss: 0.7420 Explore P: 0.1482
Episode: 498 Total reward: 51.0 Training loss: 89.8885 Explore P: 0.1475
Episode: 499 Total reward: 56.0 Training loss: 1.2036 Explore P: 0.1468
Episode: 500 Total reward: 42.0 Training loss: 1.2775 Explore P: 0.1462
Episode: 501 Total reward: 48.0 Training loss: 1.2581 Explore P: 0.1455
Episode: 502 Total reward: 90.0 Training loss: 88.1740 Explore P: 0.1443
Episode: 503 Total reward: 46.0 Training loss: 1.3415 Explore P: 0.1437
Episode: 504 Total reward: 65.0 Training loss: 1.0906 Explore P: 0.1428
Episode: 505 Total reward: 109.0 Training loss: 0.8742 Explore P: 0.1414
Episode: 506 Total reward: 167.0 Training loss: 1.4142 Explore P: 0.1392
Episode: 507 Total reward: 63.0 Training loss: 1.1321 Explore P: 0.1384
Episode: 508 Total reward: 82.0 Training loss: 94.4103 Explore P: 0.1374
Episode: 509 Total reward: 59.0 Training loss: 1.1149 Explore P: 0.1366
Episode: 510 Total reward: 94.0 Training loss: 1.2872 Explore P: 0.1354
Episode: 511 Total reward: 108.0 Training loss: 88.8268 Explore P: 0.1341
Episode: 512 Total reward: 127.0 Training loss: 0.7153 Explore P: 0.1325
Episode: 513 Total reward: 48.0 Training loss: 76.1004 Explore P: 0.1319
Episode: 514 Total reward: 82.0 Training loss: 175.5750 Explore P: 0.1309
Episode: 515 Total reward: 124.0 Training loss: 1.3187 Explore P: 0.1294
Episode: 516 Total reward: 199.0 Training loss: 1.3515 Explore P: 0.1271
Episode: 517 Total reward: 87.0 Training loss: 80.6286 Explore P: 0.1261
Episode: 518 Total reward: 79.0 Training loss: 71.9390 Explore P: 0.1252
Episode: 519 Total reward: 59.0 Training loss: 1.1095 Explore P: 0.1245
Episode: 520 Total reward: 130.0 Training loss: 88.5748 Explore P: 0.1230
Episode: 521 Total reward: 68.0 Training loss: 0.9570 Explore P: 0.1222
Episode: 522 Total reward: 71.0 Training loss: 76.0623 Explore P: 0.1214
Episode: 523 Total reward: 66.0 Training loss: 0.9743 Explore P: 0.1207
Episode: 524 Total reward: 76.0 Training loss: 0.8132 Explore P: 0.1199
Episode: 525 Total reward: 58.0 Training loss: 158.9388 Explore P: 0.1192
Episode: 526 Total reward: 60.0 Training loss: 1.6034 Explore P: 0.1186
Episode: 527 Total reward: 107.0 Training loss: 0.8943 Explore P: 0.1174
Episode: 528 Total reward: 105.0 Training loss: 1.1433 Explore P: 0.1163
Episode: 529 Total reward: 67.0 Training loss: 75.0756 Explore P: 0.1156
Episode: 530 Total reward: 91.0 Training loss: 1.0304 Explore P: 0.1146
Episode: 531 Total reward: 72.0 Training loss: 1.4094 Explore P: 0.1139
Episode: 532 Total reward: 57.0 Training loss: 0.7798 Explore P: 0.1133
Episode: 533 Total reward: 58.0 Training loss: 0.8673 Explore P: 0.1127
Episode: 534 Total reward: 85.0 Training loss: 83.4735 Explore P: 0.1118
Episode: 535 Total reward: 79.0 Training loss: 0.7569 Explore P: 0.1110
Episode: 536 Total reward: 56.0 Training loss: 1.1706 Explore P: 0.1105
Episode: 537 Total reward: 74.0 Training loss: 0.5680 Explore P: 0.1097
Episode: 538 Total reward: 46.0 Training loss: 0.9129 Explore P: 0.1093
Episode: 539 Total reward: 74.0 Training loss: 0.9525 Explore P: 0.1085
Episode: 540 Total reward: 68.0 Training loss: 0.7270 Explore P: 0.1079
Episode: 541 Total reward: 50.0 Training loss: 1.4756 Explore P: 0.1074
Episode: 542 Total reward: 87.0 Training loss: 1.2475 Explore P: 0.1065
Episode: 543 Total reward: 62.0 Training loss: 1.0688 Explore P: 0.1059
Episode: 544 Total reward: 108.0 Training loss: 79.0646 Explore P: 0.1049
Episode: 545 Total reward: 79.0 Training loss: 160.5605 Explore P: 0.1042
Episode: 546 Total reward: 68.0 Training loss: 0.7907 Explore P: 0.1035
Episode: 547 Total reward: 81.0 Training loss: 0.7596 Explore P: 0.1028
Episode: 548 Total reward: 68.0 Training loss: 1.2010 Explore P: 0.1021
Episode: 549 Total reward: 54.0 Training loss: 84.2177 Explore P: 0.1016
Episode: 550 Total reward: 58.0 Training loss: 1.0261 Explore P: 0.1011
Episode: 551 Total reward: 58.0 Training loss: 0.9086 Explore P: 0.1006
Episode: 552 Total reward: 62.0 Training loss: 1.1990 Explore P: 0.1000
Episode: 553 Total reward: 50.0 Training loss: 0.7939 Explore P: 0.0996
Episode: 554 Total reward: 118.0 Training loss: 1.0654 Explore P: 0.0985
Episode: 555 Total reward: 86.0 Training loss: 87.3125 Explore P: 0.0978
Episode: 556 Total reward: 58.0 Training loss: 0.7458 Explore P: 0.0973
Episode: 557 Total reward: 90.0 Training loss: 0.6483 Explore P: 0.0965
Episode: 558 Total reward: 61.0 Training loss: 137.0485 Explore P: 0.0960
Episode: 559 Total reward: 61.0 Training loss: 42.6581 Explore P: 0.0954
Episode: 560 Total reward: 53.0 Training loss: 0.7530 Explore P: 0.0950
Episode: 561 Total reward: 51.0 Training loss: 44.6679 Explore P: 0.0945
Episode: 562 Total reward: 125.0 Training loss: 0.6483 Explore P: 0.0935
Episode: 563 Total reward: 54.0 Training loss: 0.9030 Explore P: 0.0930
Episode: 564 Total reward: 50.0 Training loss: 61.8452 Explore P: 0.0926
Episode: 565 Total reward: 46.0 Training loss: 1.4357 Explore P: 0.0923
Episode: 566 Total reward: 65.0 Training loss: 0.6996 Explore P: 0.0917
Episode: 567 Total reward: 91.0 Training loss: 0.6653 Explore P: 0.0910
Episode: 568 Total reward: 75.0 Training loss: 42.6931 Explore P: 0.0904
Episode: 569 Total reward: 82.0 Training loss: 0.7096 Explore P: 0.0897
Episode: 570 Total reward: 92.0 Training loss: 0.4017 Explore P: 0.0890
Episode: 571 Total reward: 74.0 Training loss: 35.3268 Explore P: 0.0884
Episode: 572 Total reward: 52.0 Training loss: 30.5768 Explore P: 0.0880
Episode: 573 Total reward: 70.0 Training loss: 0.9155 Explore P: 0.0875
Episode: 574 Total reward: 59.0 Training loss: 0.6464 Explore P: 0.0870
Episode: 575 Total reward: 66.0 Training loss: 1.2132 Explore P: 0.0865
Episode: 576 Total reward: 54.0 Training loss: 0.7574 Explore P: 0.0861
Episode: 577 Total reward: 72.0 Training loss: 0.8174 Explore P: 0.0855
Episode: 578 Total reward: 63.0 Training loss: 0.5448 Explore P: 0.0851
Episode: 579 Total reward: 85.0 Training loss: 0.7021 Explore P: 0.0844
Episode: 580 Total reward: 42.0 Training loss: 56.2694 Explore P: 0.0841
Episode: 581 Total reward: 49.0 Training loss: 0.9346 Explore P: 0.0838
Episode: 582 Total reward: 47.0 Training loss: 1.2287 Explore P: 0.0834
Episode: 583 Total reward: 54.0 Training loss: 0.7222 Explore P: 0.0830
Episode: 584 Total reward: 46.0 Training loss: 44.6829 Explore P: 0.0827
Episode: 585 Total reward: 42.0 Training loss: 1.2839 Explore P: 0.0824
Episode: 586 Total reward: 62.0 Training loss: 1.4986 Explore P: 0.0819
Episode: 587 Total reward: 56.0 Training loss: 0.9280 Explore P: 0.0815
Episode: 588 Total reward: 52.0 Training loss: 0.7570 Explore P: 0.0812
Episode: 589 Total reward: 43.0 Training loss: 24.8082 Explore P: 0.0808
Episode: 590 Total reward: 49.0 Training loss: 0.5940 Explore P: 0.0805
Episode: 591 Total reward: 36.0 Training loss: 0.3953 Explore P: 0.0802
Episode: 592 Total reward: 63.0 Training loss: 0.5097 Explore P: 0.0798
Episode: 593 Total reward: 57.0 Training loss: 16.4260 Explore P: 0.0794
Episode: 594 Total reward: 88.0 Training loss: 0.5289 Explore P: 0.0788
Episode: 595 Total reward: 46.0 Training loss: 0.5093 Explore P: 0.0785
Episode: 596 Total reward: 69.0 Training loss: 0.2282 Explore P: 0.0780
Episode: 597 Total reward: 70.0 Training loss: 0.6690 Explore P: 0.0775
Episode: 598 Total reward: 58.0 Training loss: 29.5678 Explore P: 0.0771
Episode: 599 Total reward: 60.0 Training loss: 0.6366 Explore P: 0.0767
Episode: 600 Total reward: 58.0 Training loss: 0.2295 Explore P: 0.0764
Episode: 601 Total reward: 53.0 Training loss: 0.6223 Explore P: 0.0760
Episode: 602 Total reward: 56.0 Training loss: 0.7677 Explore P: 0.0756
Episode: 603 Total reward: 61.0 Training loss: 0.4547 Explore P: 0.0752
Episode: 604 Total reward: 84.0 Training loss: 0.3241 Explore P: 0.0747
Episode: 605 Total reward: 75.0 Training loss: 0.2989 Explore P: 0.0742
Episode: 606 Total reward: 77.0 Training loss: 0.8075 Explore P: 0.0737
Episode: 607 Total reward: 65.0 Training loss: 0.3684 Explore P: 0.0733
Episode: 608 Total reward: 66.0 Training loss: 0.7200 Explore P: 0.0729
Episode: 609 Total reward: 67.0 Training loss: 35.1783 Explore P: 0.0725
Episode: 610 Total reward: 86.0 Training loss: 0.3041 Explore P: 0.0719
Episode: 611 Total reward: 83.0 Training loss: 16.9401 Explore P: 0.0714
Episode: 612 Total reward: 64.0 Training loss: 0.7052 Explore P: 0.0710
Episode: 613 Total reward: 59.0 Training loss: 0.6551 Explore P: 0.0707
Episode: 614 Total reward: 77.0 Training loss: 0.2359 Explore P: 0.0702
Episode: 615 Total reward: 64.0 Training loss: 0.2788 Explore P: 0.0698
Episode: 616 Total reward: 57.0 Training loss: 0.4572 Explore P: 0.0695
Episode: 617 Total reward: 64.0 Training loss: 0.5470 Explore P: 0.0691
Episode: 618 Total reward: 70.0 Training loss: 0.2852 Explore P: 0.0687
Episode: 619 Total reward: 68.0 Training loss: 0.1418 Explore P: 0.0683
Episode: 620 Total reward: 45.0 Training loss: 0.4121 Explore P: 0.0680
Episode: 621 Total reward: 70.0 Training loss: 0.3163 Explore P: 0.0676
Episode: 622 Total reward: 95.0 Training loss: 0.4587 Explore P: 0.0671
Episode: 623 Total reward: 60.0 Training loss: 0.4669 Explore P: 0.0667
Episode: 624 Total reward: 72.0 Training loss: 0.2171 Explore P: 0.0663
Episode: 625 Total reward: 73.0 Training loss: 0.1108 Explore P: 0.0659
Episode: 626 Total reward: 98.0 Training loss: 0.2140 Explore P: 0.0654
Episode: 627 Total reward: 101.0 Training loss: 0.5723 Explore P: 0.0648
Episode: 628 Total reward: 91.0 Training loss: 0.5172 Explore P: 0.0643
Episode: 629 Total reward: 58.0 Training loss: 0.3005 Explore P: 0.0640
Episode: 630 Total reward: 71.0 Training loss: 0.2350 Explore P: 0.0636
Episode: 631 Total reward: 70.0 Training loss: 2.3665 Explore P: 0.0633
Episode: 632 Total reward: 75.0 Training loss: 0.6415 Explore P: 0.0629
Episode: 633 Total reward: 72.0 Training loss: 1.3732 Explore P: 0.0625
Episode: 634 Total reward: 77.0 Training loss: 0.1868 Explore P: 0.0621
Episode: 635 Total reward: 53.0 Training loss: 0.9704 Explore P: 0.0618
Episode: 636 Total reward: 91.0 Training loss: 0.1241 Explore P: 0.0613
Episode: 637 Total reward: 72.0 Training loss: 0.4037 Explore P: 0.0610
Episode: 638 Total reward: 63.0 Training loss: 0.4666 Explore P: 0.0606
Episode: 639 Total reward: 66.0 Training loss: 0.2158 Explore P: 0.0603
Episode: 640 Total reward: 60.0 Training loss: 2.2535 Explore P: 0.0600
Episode: 641 Total reward: 73.0 Training loss: 0.3936 Explore P: 0.0596
Episode: 642 Total reward: 93.0 Training loss: 0.2128 Explore P: 0.0592
Episode: 643 Total reward: 75.0 Training loss: 0.4078 Explore P: 0.0588
Episode: 644 Total reward: 95.0 Training loss: 0.7933 Explore P: 0.0584
Episode: 645 Total reward: 71.0 Training loss: 0.8789 Explore P: 0.0580
Episode: 646 Total reward: 100.0 Training loss: 0.3585 Explore P: 0.0575
Episode: 647 Total reward: 92.0 Training loss: 0.3074 Explore P: 0.0571
Episode: 648 Total reward: 101.0 Training loss: 1.2649 Explore P: 0.0566
Episode: 649 Total reward: 83.0 Training loss: 1.1499 Explore P: 0.0562
Episode: 650 Total reward: 99.0 Training loss: 0.7524 Explore P: 0.0558
Episode: 651 Total reward: 76.0 Training loss: 0.4539 Explore P: 0.0554
Episode: 652 Total reward: 96.0 Training loss: 0.3449 Explore P: 0.0550
Episode: 653 Total reward: 141.0 Training loss: 0.5358 Explore P: 0.0544
Episode: 654 Total reward: 103.0 Training loss: 0.4092 Explore P: 0.0539
Episode: 655 Total reward: 96.0 Training loss: 0.3867 Explore P: 0.0535
Episode: 656 Total reward: 123.0 Training loss: 0.5350 Explore P: 0.0530
Episode: 657 Total reward: 121.0 Training loss: 0.2361 Explore P: 0.0525
Episode: 658 Total reward: 120.0 Training loss: 3.2580 Explore P: 0.0519
Episode: 659 Total reward: 199.0 Training loss: 5.1439 Explore P: 0.0511
Episode: 660 Total reward: 143.0 Training loss: 0.5763 Explore P: 0.0505
Episode: 661 Total reward: 121.0 Training loss: 0.3904 Explore P: 0.0500
Episode: 662 Total reward: 199.0 Training loss: 0.2347 Explore P: 0.0493
Episode: 663 Total reward: 199.0 Training loss: 3.8113 Explore P: 0.0485
Episode: 664 Total reward: 199.0 Training loss: 0.4026 Explore P: 0.0477
Episode: 665 Total reward: 199.0 Training loss: 0.2070 Explore P: 0.0470
Episode: 666 Total reward: 199.0 Training loss: 0.1700 Explore P: 0.0463
Episode: 667 Total reward: 199.0 Training loss: 0.1960 Explore P: 0.0455
Episode: 668 Total reward: 199.0 Training loss: 0.3476 Explore P: 0.0448
Episode: 669 Total reward: 199.0 Training loss: 0.3101 Explore P: 0.0442
Episode: 670 Total reward: 199.0 Training loss: 0.3654 Explore P: 0.0435
Episode: 671 Total reward: 199.0 Training loss: 0.7060 Explore P: 0.0428
Episode: 672 Total reward: 199.0 Training loss: 0.4129 Explore P: 0.0422
Episode: 673 Total reward: 153.0 Training loss: 0.8130 Explore P: 0.0417
Episode: 674 Total reward: 153.0 Training loss: 0.3761 Explore P: 0.0412
Episode: 675 Total reward: 42.0 Training loss: 19.8355 Explore P: 0.0411
Episode: 676 Total reward: 36.0 Training loss: 0.4310 Explore P: 0.0410
Episode: 677 Total reward: 38.0 Training loss: 1.1418 Explore P: 0.0408
Episode: 678 Total reward: 45.0 Training loss: 14.2228 Explore P: 0.0407
Episode: 679 Total reward: 49.0 Training loss: 0.8090 Explore P: 0.0406
Episode: 680 Total reward: 107.0 Training loss: 19.9597 Explore P: 0.0402
Episode: 681 Total reward: 64.0 Training loss: 0.7199 Explore P: 0.0400
Episode: 682 Total reward: 37.0 Training loss: 1.3272 Explore P: 0.0399
Episode: 683 Total reward: 49.0 Training loss: 0.6192 Explore P: 0.0398
Episode: 684 Total reward: 79.0 Training loss: 0.5696 Explore P: 0.0395
Episode: 685 Total reward: 58.0 Training loss: 0.9370 Explore P: 0.0394
Episode: 686 Total reward: 47.0 Training loss: 24.3478 Explore P: 0.0392
Episode: 687 Total reward: 198.0 Training loss: 0.4061 Explore P: 0.0387
Episode: 688 Total reward: 199.0 Training loss: 35.2389 Explore P: 0.0381
Episode: 689 Total reward: 199.0 Training loss: 0.7129 Explore P: 0.0375
Episode: 690 Total reward: 199.0 Training loss: 0.8652 Explore P: 0.0370
Episode: 691 Total reward: 199.0 Training loss: 0.5725 Explore P: 0.0365
Episode: 692 Total reward: 124.0 Training loss: 0.5754 Explore P: 0.0361
Episode: 693 Total reward: 199.0 Training loss: 0.6673 Explore P: 0.0356
Episode: 694 Total reward: 199.0 Training loss: 19.0373 Explore P: 0.0351
Episode: 695 Total reward: 199.0 Training loss: 0.3965 Explore P: 0.0346
Episode: 696 Total reward: 199.0 Training loss: 0.7319 Explore P: 0.0341
Episode: 697 Total reward: 199.0 Training loss: 0.5650 Explore P: 0.0337
Episode: 698 Total reward: 199.0 Training loss: 0.1735 Explore P: 0.0332
Episode: 699 Total reward: 199.0 Training loss: 0.3710 Explore P: 0.0327
Episode: 700 Total reward: 199.0 Training loss: 0.5317 Explore P: 0.0323
Episode: 701 Total reward: 199.0 Training loss: 0.4157 Explore P: 0.0319
Episode: 702 Total reward: 137.0 Training loss: 0.5292 Explore P: 0.0316
Episode: 703 Total reward: 199.0 Training loss: 0.2910 Explore P: 0.0311
Episode: 704 Total reward: 199.0 Training loss: 0.2955 Explore P: 0.0307
Episode: 705 Total reward: 117.0 Training loss: 0.9389 Explore P: 0.0305
Episode: 706 Total reward: 199.0 Training loss: 0.4015 Explore P: 0.0301
Episode: 707 Total reward: 199.0 Training loss: 0.6368 Explore P: 0.0297
Episode: 708 Total reward: 199.0 Training loss: 0.5902 Explore P: 0.0293
Episode: 709 Total reward: 81.0 Training loss: 0.3974 Explore P: 0.0291
Episode: 710 Total reward: 199.0 Training loss: 0.6268 Explore P: 0.0288
Episode: 711 Total reward: 64.0 Training loss: 0.4892 Explore P: 0.0286
Episode: 712 Total reward: 199.0 Training loss: 0.4754 Explore P: 0.0283
Episode: 713 Total reward: 199.0 Training loss: 0.2861 Explore P: 0.0279
Episode: 714 Total reward: 80.0 Training loss: 0.5548 Explore P: 0.0278
Episode: 715 Total reward: 199.0 Training loss: 68.3757 Explore P: 0.0274
Episode: 716 Total reward: 199.0 Training loss: 0.5314 Explore P: 0.0271
Episode: 717 Total reward: 64.0 Training loss: 0.6069 Explore P: 0.0270
Episode: 718 Total reward: 43.0 Training loss: 215.1456 Explore P: 0.0269
Episode: 719 Total reward: 87.0 Training loss: 0.3376 Explore P: 0.0267
Episode: 720 Total reward: 69.0 Training loss: 0.3028 Explore P: 0.0266
Episode: 721 Total reward: 104.0 Training loss: 0.3608 Explore P: 0.0265
Episode: 722 Total reward: 199.0 Training loss: 0.5214 Explore P: 0.0261
Episode: 723 Total reward: 79.0 Training loss: 0.3346 Explore P: 0.0260
Episode: 724 Total reward: 125.0 Training loss: 0.4497 Explore P: 0.0258
Episode: 725 Total reward: 199.0 Training loss: 189.9845 Explore P: 0.0255
Episode: 726 Total reward: 73.0 Training loss: 0.2415 Explore P: 0.0254
Episode: 727 Total reward: 199.0 Training loss: 0.5178 Explore P: 0.0251
Episode: 728 Total reward: 199.0 Training loss: 0.2903 Explore P: 0.0248
Episode: 729 Total reward: 199.0 Training loss: 0.1599 Explore P: 0.0245
Episode: 730 Total reward: 72.0 Training loss: 205.8992 Explore P: 0.0244
Episode: 731 Total reward: 199.0 Training loss: 0.2001 Explore P: 0.0241
Episode: 732 Total reward: 199.0 Training loss: 0.4936 Explore P: 0.0238
Episode: 733 Total reward: 199.0 Training loss: 0.2772 Explore P: 0.0236
Episode: 734 Total reward: 160.0 Training loss: 0.1392 Explore P: 0.0233
Episode: 735 Total reward: 199.0 Training loss: 0.1926 Explore P: 0.0231
Episode: 736 Total reward: 199.0 Training loss: 0.5966 Explore P: 0.0228
Episode: 737 Total reward: 199.0 Training loss: 0.3929 Explore P: 0.0226
Episode: 738 Total reward: 93.0 Training loss: 0.2152 Explore P: 0.0225
Episode: 739 Total reward: 199.0 Training loss: 0.3395 Explore P: 0.0222
Episode: 740 Total reward: 199.0 Training loss: 0.3098 Explore P: 0.0220
Episode: 741 Total reward: 199.0 Training loss: 231.0202 Explore P: 0.0217
Episode: 742 Total reward: 133.0 Training loss: 0.2346 Explore P: 0.0216
Episode: 743 Total reward: 199.0 Training loss: 0.2720 Explore P: 0.0213
Episode: 744 Total reward: 199.0 Training loss: 87.9550 Explore P: 0.0211
Episode: 745 Total reward: 199.0 Training loss: 0.3560 Explore P: 0.0209
Episode: 746 Total reward: 199.0 Training loss: 0.2627 Explore P: 0.0207
Episode: 747 Total reward: 199.0 Training loss: 0.1493 Explore P: 0.0205
Episode: 748 Total reward: 199.0 Training loss: 0.4832 Explore P: 0.0203
Episode: 749 Total reward: 90.0 Training loss: 0.1771 Explore P: 0.0202
Episode: 750 Total reward: 199.0 Training loss: 0.1688 Explore P: 0.0200
Episode: 751 Total reward: 112.0 Training loss: 0.2044 Explore P: 0.0199
Episode: 752 Total reward: 199.0 Training loss: 0.4386 Explore P: 0.0197
Episode: 753 Total reward: 199.0 Training loss: 0.2507 Explore P: 0.0195
Episode: 754 Total reward: 199.0 Training loss: 0.1701 Explore P: 0.0193
Episode: 755 Total reward: 199.0 Training loss: 0.3013 Explore P: 0.0191
Episode: 756 Total reward: 199.0 Training loss: 0.3039 Explore P: 0.0189
Episode: 757 Total reward: 199.0 Training loss: 0.1834 Explore P: 0.0188
Episode: 758 Total reward: 199.0 Training loss: 0.3964 Explore P: 0.0186
Episode: 759 Total reward: 199.0 Training loss: 0.2102 Explore P: 0.0184
Episode: 760 Total reward: 199.0 Training loss: 0.1633 Explore P: 0.0183
Episode: 761 Total reward: 199.0 Training loss: 0.1714 Explore P: 0.0181
Episode: 762 Total reward: 199.0 Training loss: 0.1735 Explore P: 0.0179
Episode: 763 Total reward: 199.0 Training loss: 0.1805 Explore P: 0.0178
Episode: 764 Total reward: 199.0 Training loss: 0.2075 Explore P: 0.0176
Episode: 765 Total reward: 199.0 Training loss: 0.2268 Explore P: 0.0175
Episode: 766 Total reward: 199.0 Training loss: 131.0836 Explore P: 0.0173
Episode: 767 Total reward: 199.0 Training loss: 0.1692 Explore P: 0.0172
Episode: 768 Total reward: 199.0 Training loss: 0.2795 Explore P: 0.0170
Episode: 769 Total reward: 188.0 Training loss: 0.2184 Explore P: 0.0169
Episode: 770 Total reward: 199.0 Training loss: 0.1894 Explore P: 0.0168
Episode: 771 Total reward: 199.0 Training loss: 0.1110 Explore P: 0.0166
Episode: 772 Total reward: 199.0 Training loss: 0.3050 Explore P: 0.0165
Episode: 773 Total reward: 199.0 Training loss: 0.1338 Explore P: 0.0164
Episode: 774 Total reward: 105.0 Training loss: 0.3522 Explore P: 0.0163
Episode: 775 Total reward: 111.0 Training loss: 202.1144 Explore P: 0.0162
Episode: 776 Total reward: 112.0 Training loss: 0.1451 Explore P: 0.0162
Episode: 777 Total reward: 197.0 Training loss: 0.1247 Explore P: 0.0161
Episode: 778 Total reward: 116.0 Training loss: 0.1327 Explore P: 0.0160
Episode: 779 Total reward: 108.0 Training loss: 0.2527 Explore P: 0.0159
Episode: 780 Total reward: 167.0 Training loss: 0.1195 Explore P: 0.0158
Episode: 781 Total reward: 141.0 Training loss: 0.1807 Explore P: 0.0157
Episode: 782 Total reward: 65.0 Training loss: 0.2555 Explore P: 0.0157
Episode: 783 Total reward: 89.0 Training loss: 136.0200 Explore P: 0.0156
Episode: 784 Total reward: 139.0 Training loss: 0.3433 Explore P: 0.0156
Episode: 785 Total reward: 84.0 Training loss: 0.2229 Explore P: 0.0155
Episode: 786 Total reward: 125.0 Training loss: 0.2907 Explore P: 0.0155
Episode: 787 Total reward: 88.0 Training loss: 0.3131 Explore P: 0.0154
Episode: 788 Total reward: 76.0 Training loss: 0.5696 Explore P: 0.0154
Episode: 789 Total reward: 76.0 Training loss: 0.4303 Explore P: 0.0153
Episode: 790 Total reward: 97.0 Training loss: 0.2007 Explore P: 0.0153
Episode: 791 Total reward: 107.0 Training loss: 78.6756 Explore P: 0.0152
Episode: 792 Total reward: 91.0 Training loss: 0.2155 Explore P: 0.0152
Episode: 793 Total reward: 102.0 Training loss: 102.5483 Explore P: 0.0151
Episode: 794 Total reward: 105.0 Training loss: 0.5189 Explore P: 0.0151
Episode: 795 Total reward: 126.0 Training loss: 0.5756 Explore P: 0.0150
Episode: 796 Total reward: 75.0 Training loss: 0.2766 Explore P: 0.0150
Episode: 797 Total reward: 118.0 Training loss: 0.2336 Explore P: 0.0149
Episode: 798 Total reward: 125.0 Training loss: 0.3821 Explore P: 0.0148
Episode: 799 Total reward: 104.0 Training loss: 0.4825 Explore P: 0.0148
Episode: 800 Total reward: 77.0 Training loss: 0.2607 Explore P: 0.0148
Episode: 801 Total reward: 155.0 Training loss: 0.2819 Explore P: 0.0147
Episode: 802 Total reward: 131.0 Training loss: 0.1256 Explore P: 0.0146
Episode: 803 Total reward: 137.0 Training loss: 0.2086 Explore P: 0.0146
Episode: 804 Total reward: 91.0 Training loss: 0.4467 Explore P: 0.0145
Episode: 805 Total reward: 138.0 Training loss: 0.2180 Explore P: 0.0145
Episode: 806 Total reward: 131.0 Training loss: 0.5168 Explore P: 0.0144
Episode: 807 Total reward: 130.0 Training loss: 0.3313 Explore P: 0.0143
Episode: 808 Total reward: 105.0 Training loss: 51.2319 Explore P: 0.0143
Episode: 809 Total reward: 105.0 Training loss: 0.5838 Explore P: 0.0143
Episode: 810 Total reward: 110.0 Training loss: 0.4643 Explore P: 0.0142
Episode: 811 Total reward: 119.0 Training loss: 0.5008 Explore P: 0.0142
Episode: 812 Total reward: 122.0 Training loss: 0.5426 Explore P: 0.0141
Episode: 813 Total reward: 102.0 Training loss: 0.4339 Explore P: 0.0141
Episode: 814 Total reward: 92.0 Training loss: 0.3923 Explore P: 0.0140
Episode: 815 Total reward: 90.0 Training loss: 0.5223 Explore P: 0.0140
Episode: 816 Total reward: 103.0 Training loss: 0.5197 Explore P: 0.0140
Episode: 817 Total reward: 101.0 Training loss: 0.8635 Explore P: 0.0139
Episode: 818 Total reward: 114.0 Training loss: 0.1625 Explore P: 0.0139
Episode: 819 Total reward: 102.0 Training loss: 0.5566 Explore P: 0.0138
Episode: 820 Total reward: 98.0 Training loss: 0.2952 Explore P: 0.0138
Episode: 821 Total reward: 105.0 Training loss: 0.3689 Explore P: 0.0138
Episode: 822 Total reward: 105.0 Training loss: 0.4402 Explore P: 0.0137
Episode: 823 Total reward: 110.0 Training loss: 0.3192 Explore P: 0.0137
Episode: 824 Total reward: 107.0 Training loss: 0.5463 Explore P: 0.0136
Episode: 825 Total reward: 98.0 Training loss: 0.3543 Explore P: 0.0136
Episode: 826 Total reward: 113.0 Training loss: 0.4104 Explore P: 0.0136
Episode: 827 Total reward: 92.0 Training loss: 0.3460 Explore P: 0.0135
Episode: 828 Total reward: 96.0 Training loss: 90.2448 Explore P: 0.0135
Episode: 829 Total reward: 104.0 Training loss: 6.6780 Explore P: 0.0135
Episode: 830 Total reward: 107.0 Training loss: 0.3635 Explore P: 0.0134
Episode: 831 Total reward: 107.0 Training loss: 2.5383 Explore P: 0.0134
Episode: 832 Total reward: 110.0 Training loss: 0.4260 Explore P: 0.0133
Episode: 833 Total reward: 113.0 Training loss: 0.6029 Explore P: 0.0133
Episode: 834 Total reward: 129.0 Training loss: 0.4506 Explore P: 0.0133
Episode: 835 Total reward: 133.0 Training loss: 0.6022 Explore P: 0.0132
Episode: 836 Total reward: 132.0 Training loss: 0.6529 Explore P: 0.0132
Episode: 837 Total reward: 160.0 Training loss: 0.2720 Explore P: 0.0131
Episode: 838 Total reward: 164.0 Training loss: 0.3409 Explore P: 0.0131
Episode: 839 Total reward: 169.0 Training loss: 0.3628 Explore P: 0.0130
Episode: 840 Total reward: 199.0 Training loss: 1.6019 Explore P: 0.0130
Episode: 841 Total reward: 199.0 Training loss: 0.1756 Explore P: 0.0129
Episode: 842 Total reward: 199.0 Training loss: 0.2186 Explore P: 0.0129
Episode: 843 Total reward: 42.0 Training loss: 0.4010 Explore P: 0.0128
Episode: 844 Total reward: 35.0 Training loss: 0.3344 Explore P: 0.0128
Episode: 845 Total reward: 132.0 Training loss: 0.2562 Explore P: 0.0128
Episode: 846 Total reward: 76.0 Training loss: 16.7825 Explore P: 0.0128
Episode: 847 Total reward: 38.0 Training loss: 0.4747 Explore P: 0.0128
Episode: 848 Total reward: 18.0 Training loss: 155.7800 Explore P: 0.0128
Episode: 849 Total reward: 22.0 Training loss: 0.2501 Explore P: 0.0127
Episode: 850 Total reward: 38.0 Training loss: 8.3790 Explore P: 0.0127
Episode: 851 Total reward: 101.0 Training loss: 0.2129 Explore P: 0.0127
Episode: 852 Total reward: 79.0 Training loss: 0.2042 Explore P: 0.0127
Episode: 853 Total reward: 35.0 Training loss: 0.2409 Explore P: 0.0127
Episode: 854 Total reward: 199.0 Training loss: 206.1857 Explore P: 0.0126
Episode: 855 Total reward: 194.0 Training loss: 0.3375 Explore P: 0.0126
Episode: 856 Total reward: 156.0 Training loss: 88.4715 Explore P: 0.0125
Episode: 857 Total reward: 155.0 Training loss: 0.3444 Explore P: 0.0125
Episode: 858 Total reward: 184.0 Training loss: 0.2276 Explore P: 0.0125
Episode: 859 Total reward: 172.0 Training loss: 0.3368 Explore P: 0.0124
Episode: 860 Total reward: 194.0 Training loss: 0.3317 Explore P: 0.0124
Episode: 861 Total reward: 199.0 Training loss: 0.2839 Explore P: 0.0123
Episode: 862 Total reward: 182.0 Training loss: 2.0830 Explore P: 0.0123
Episode: 863 Total reward: 199.0 Training loss: 0.1803 Explore P: 0.0122
Episode: 864 Total reward: 199.0 Training loss: 0.1621 Explore P: 0.0122
Episode: 865 Total reward: 199.0 Training loss: 0.3630 Explore P: 0.0121
Episode: 866 Total reward: 194.0 Training loss: 0.1158 Explore P: 0.0121
Episode: 867 Total reward: 199.0 Training loss: 0.2157 Explore P: 0.0121
Episode: 868 Total reward: 199.0 Training loss: 0.4147 Explore P: 0.0120
Episode: 869 Total reward: 199.0 Training loss: 0.4199 Explore P: 0.0120
Episode: 870 Total reward: 199.0 Training loss: 0.2931 Explore P: 0.0119
Episode: 871 Total reward: 199.0 Training loss: 0.1724 Explore P: 0.0119
Episode: 872 Total reward: 199.0 Training loss: 0.1926 Explore P: 0.0119
Episode: 873 Total reward: 199.0 Training loss: 4.3907 Explore P: 0.0118
Episode: 874 Total reward: 199.0 Training loss: 0.1642 Explore P: 0.0118
Episode: 875 Total reward: 199.0 Training loss: 9.0667 Explore P: 0.0118
Episode: 876 Total reward: 199.0 Training loss: 0.2036 Explore P: 0.0117
Episode: 877 Total reward: 199.0 Training loss: 0.4250 Explore P: 0.0117
Episode: 878 Total reward: 198.0 Training loss: 0.1191 Explore P: 0.0117
Episode: 879 Total reward: 198.0 Training loss: 0.3975 Explore P: 0.0116
Episode: 880 Total reward: 173.0 Training loss: 1.1988 Explore P: 0.0116
Episode: 881 Total reward: 184.0 Training loss: 0.2373 Explore P: 0.0116
Episode: 882 Total reward: 173.0 Training loss: 181.5215 Explore P: 0.0115
Episode: 883 Total reward: 199.0 Training loss: 0.1049 Explore P: 0.0115
Episode: 884 Total reward: 199.0 Training loss: 8.7918 Explore P: 0.0115
Episode: 885 Total reward: 199.0 Training loss: 5.9542 Explore P: 0.0115
Episode: 886 Total reward: 199.0 Training loss: 0.7627 Explore P: 0.0114
Episode: 887 Total reward: 199.0 Training loss: 0.4717 Explore P: 0.0114
Episode: 888 Total reward: 193.0 Training loss: 0.3410 Explore P: 0.0114
Episode: 889 Total reward: 199.0 Training loss: 0.2572 Explore P: 0.0113
Episode: 890 Total reward: 199.0 Training loss: 0.3323 Explore P: 0.0113
Episode: 891 Total reward: 199.0 Training loss: 0.1406 Explore P: 0.0113
Episode: 892 Total reward: 199.0 Training loss: 0.1827 Explore P: 0.0113
Episode: 893 Total reward: 199.0 Training loss: 0.8714 Explore P: 0.0112
Episode: 894 Total reward: 199.0 Training loss: 0.6008 Explore P: 0.0112
Episode: 895 Total reward: 199.0 Training loss: 0.0701 Explore P: 0.0112
Episode: 896 Total reward: 199.0 Training loss: 38.2416 Explore P: 0.0112
Episode: 897 Total reward: 199.0 Training loss: 0.1079 Explore P: 0.0111
Episode: 898 Total reward: 199.0 Training loss: 0.0950 Explore P: 0.0111
Episode: 899 Total reward: 199.0 Training loss: 0.1335 Explore P: 0.0111
Episode: 900 Total reward: 199.0 Training loss: 0.3076 Explore P: 0.0111
Episode: 901 Total reward: 199.0 Training loss: 0.6858 Explore P: 0.0111
Episode: 902 Total reward: 199.0 Training loss: 0.0935 Explore P: 0.0110
Episode: 903 Total reward: 199.0 Training loss: 0.1523 Explore P: 0.0110
Episode: 904 Total reward: 199.0 Training loss: 0.0510 Explore P: 0.0110
Episode: 905 Total reward: 199.0 Training loss: 0.0507 Explore P: 0.0110
Episode: 906 Total reward: 199.0 Training loss: 0.0367 Explore P: 0.0110
Episode: 907 Total reward: 199.0 Training loss: 0.0799 Explore P: 0.0109
Episode: 908 Total reward: 199.0 Training loss: 0.0521 Explore P: 0.0109
Episode: 909 Total reward: 199.0 Training loss: 0.0809 Explore P: 0.0109
Episode: 910 Total reward: 199.0 Training loss: 0.2478 Explore P: 0.0109
Episode: 911 Total reward: 199.0 Training loss: 0.0899 Explore P: 0.0109
Episode: 912 Total reward: 199.0 Training loss: 0.0708 Explore P: 0.0108
Episode: 913 Total reward: 199.0 Training loss: 0.0740 Explore P: 0.0108
Episode: 914 Total reward: 199.0 Training loss: 0.0533 Explore P: 0.0108
Episode: 915 Total reward: 199.0 Training loss: 0.0490 Explore P: 0.0108
Episode: 916 Total reward: 199.0 Training loss: 0.0518 Explore P: 0.0108
Episode: 917 Total reward: 199.0 Training loss: 0.1420 Explore P: 0.0108
Episode: 918 Total reward: 199.0 Training loss: 108.8447 Explore P: 0.0108
Episode: 919 Total reward: 199.0 Training loss: 0.1476 Explore P: 0.0107
Episode: 920 Total reward: 199.0 Training loss: 0.0655 Explore P: 0.0107
Episode: 921 Total reward: 199.0 Training loss: 0.0734 Explore P: 0.0107
Episode: 922 Total reward: 199.0 Training loss: 0.0774 Explore P: 0.0107
Episode: 923 Total reward: 199.0 Training loss: 0.0793 Explore P: 0.0107
Episode: 924 Total reward: 199.0 Training loss: 1.7294 Explore P: 0.0107
Episode: 925 Total reward: 199.0 Training loss: 0.0652 Explore P: 0.0107
Episode: 926 Total reward: 199.0 Training loss: 0.0705 Explore P: 0.0106
Episode: 927 Total reward: 199.0 Training loss: 0.1265 Explore P: 0.0106
Episode: 928 Total reward: 199.0 Training loss: 0.2074 Explore P: 0.0106
Episode: 929 Total reward: 199.0 Training loss: 0.0649 Explore P: 0.0106
Episode: 930 Total reward: 199.0 Training loss: 191.4010 Explore P: 0.0106
Episode: 931 Total reward: 199.0 Training loss: 0.0730 Explore P: 0.0106
Episode: 932 Total reward: 199.0 Training loss: 0.0863 Explore P: 0.0106
Episode: 933 Total reward: 199.0 Training loss: 0.0911 Explore P: 0.0106
Episode: 934 Total reward: 199.0 Training loss: 0.2740 Explore P: 0.0105
Episode: 935 Total reward: 199.0 Training loss: 0.1973 Explore P: 0.0105
Episode: 936 Total reward: 199.0 Training loss: 0.2931 Explore P: 0.0105
Episode: 937 Total reward: 199.0 Training loss: 0.0728 Explore P: 0.0105
Episode: 938 Total reward: 199.0 Training loss: 0.0716 Explore P: 0.0105
Episode: 939 Total reward: 199.0 Training loss: 0.0639 Explore P: 0.0105
Episode: 940 Total reward: 199.0 Training loss: 0.0787 Explore P: 0.0105
Episode: 941 Total reward: 199.0 Training loss: 31.0910 Explore P: 0.0105
Episode: 942 Total reward: 199.0 Training loss: 0.1827 Explore P: 0.0105
Episode: 943 Total reward: 199.0 Training loss: 0.0689 Explore P: 0.0105
Episode: 944 Total reward: 199.0 Training loss: 0.1145 Explore P: 0.0104
Episode: 945 Total reward: 199.0 Training loss: 0.0676 Explore P: 0.0104
Episode: 946 Total reward: 199.0 Training loss: 0.0546 Explore P: 0.0104
Episode: 947 Total reward: 199.0 Training loss: 1.6765 Explore P: 0.0104
Episode: 948 Total reward: 199.0 Training loss: 0.0511 Explore P: 0.0104
Episode: 949 Total reward: 199.0 Training loss: 0.1258 Explore P: 0.0104
Episode: 950 Total reward: 199.0 Training loss: 0.1271 Explore P: 0.0104
Episode: 951 Total reward: 199.0 Training loss: 0.0674 Explore P: 0.0104
Episode: 952 Total reward: 199.0 Training loss: 0.0541 Explore P: 0.0104
Episode: 953 Total reward: 199.0 Training loss: 0.0647 Explore P: 0.0104
Episode: 954 Total reward: 199.0 Training loss: 0.0769 Explore P: 0.0104
Episode: 955 Total reward: 199.0 Training loss: 102.1594 Explore P: 0.0104
Episode: 956 Total reward: 199.0 Training loss: 0.0610 Explore P: 0.0104
Episode: 957 Total reward: 199.0 Training loss: 0.0760 Explore P: 0.0103
Episode: 958 Total reward: 199.0 Training loss: 0.0749 Explore P: 0.0103
Episode: 959 Total reward: 199.0 Training loss: 0.0946 Explore P: 0.0103
Episode: 960 Total reward: 199.0 Training loss: 8.9746 Explore P: 0.0103
Episode: 961 Total reward: 199.0 Training loss: 0.0838 Explore P: 0.0103
Episode: 962 Total reward: 199.0 Training loss: 0.0750 Explore P: 0.0103
Episode: 963 Total reward: 199.0 Training loss: 0.0624 Explore P: 0.0103
Episode: 964 Total reward: 199.0 Training loss: 0.0709 Explore P: 0.0103
Episode: 965 Total reward: 199.0 Training loss: 0.1239 Explore P: 0.0103
Episode: 966 Total reward: 199.0 Training loss: 0.0729 Explore P: 0.0103
Episode: 967 Total reward: 199.0 Training loss: 0.0470 Explore P: 0.0103
Episode: 968 Total reward: 199.0 Training loss: 0.0390 Explore P: 0.0103
Episode: 969 Total reward: 199.0 Training loss: 0.1078 Explore P: 0.0103
Episode: 970 Total reward: 199.0 Training loss: 0.1483 Explore P: 0.0103
Episode: 971 Total reward: 199.0 Training loss: 88.5404 Explore P: 0.0103
Episode: 972 Total reward: 199.0 Training loss: 0.0720 Explore P: 0.0103
Episode: 973 Total reward: 199.0 Training loss: 0.1423 Explore P: 0.0103
Episode: 974 Total reward: 199.0 Training loss: 0.1247 Explore P: 0.0102
Episode: 975 Total reward: 199.0 Training loss: 0.1189 Explore P: 0.0102
Episode: 976 Total reward: 199.0 Training loss: 0.1195 Explore P: 0.0102
Episode: 977 Total reward: 199.0 Training loss: 0.1250 Explore P: 0.0102
Episode: 978 Total reward: 199.0 Training loss: 0.2500 Explore P: 0.0102
Episode: 979 Total reward: 199.0 Training loss: 0.0736 Explore P: 0.0102
Episode: 980 Total reward: 199.0 Training loss: 0.0911 Explore P: 0.0102
Episode: 981 Total reward: 199.0 Training loss: 112.5318 Explore P: 0.0102
Episode: 982 Total reward: 199.0 Training loss: 0.1025 Explore P: 0.0102
Episode: 983 Total reward: 199.0 Training loss: 0.0680 Explore P: 0.0102
Episode: 984 Total reward: 199.0 Training loss: 0.1107 Explore P: 0.0102
Episode: 985 Total reward: 199.0 Training loss: 0.1005 Explore P: 0.0102
Episode: 986 Total reward: 199.0 Training loss: 0.2876 Explore P: 0.0102
Episode: 987 Total reward: 199.0 Training loss: 0.0896 Explore P: 0.0102
Episode: 988 Total reward: 199.0 Training loss: 89.4133 Explore P: 0.0102
Episode: 989 Total reward: 199.0 Training loss: 0.1303 Explore P: 0.0102
Episode: 990 Total reward: 199.0 Training loss: 0.2132 Explore P: 0.0102
Episode: 991 Total reward: 199.0 Training loss: 0.1071 Explore P: 0.0102
Episode: 992 Total reward: 199.0 Training loss: 0.0586 Explore P: 0.0102
Episode: 993 Total reward: 199.0 Training loss: 0.0570 Explore P: 0.0102
Episode: 994 Total reward: 199.0 Training loss: 0.1388 Explore P: 0.0102
Episode: 995 Total reward: 199.0 Training loss: 0.2545 Explore P: 0.0102
Episode: 996 Total reward: 199.0 Training loss: 0.0978 Explore P: 0.0102
Episode: 997 Total reward: 199.0 Training loss: 0.0925 Explore P: 0.0102
Episode: 998 Total reward: 199.0 Training loss: 0.0962 Explore P: 0.0102
Episode: 999 Total reward: 199.0 Training loss: 0.0572 Explore P: 0.0102

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [17]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [18]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[18]:
<matplotlib.text.Text at 0x11a445518>

Testing

Let's checkout how our trained agent plays the game.


In [24]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints/cartpole.ckpt
[2017-06-17 20:03:26,263] Restoring parameters from checkpoints/cartpole.ckpt

In [25]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.