Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-04-26 07:15:41,789] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [4]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [5]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

In [ ]:

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [6]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [7]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [8]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [10]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [11]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 8.0 Training loss: 1.2774 Explore P: 0.9992
Episode: 2 Total reward: 15.0 Training loss: 1.2257 Explore P: 0.9977
Episode: 3 Total reward: 12.0 Training loss: 1.2553 Explore P: 0.9965
Episode: 4 Total reward: 23.0 Training loss: 1.0768 Explore P: 0.9943
Episode: 5 Total reward: 9.0 Training loss: 1.1275 Explore P: 0.9934
Episode: 6 Total reward: 15.0 Training loss: 1.1430 Explore P: 0.9919
Episode: 7 Total reward: 17.0 Training loss: 1.1278 Explore P: 0.9902
Episode: 8 Total reward: 21.0 Training loss: 1.1081 Explore P: 0.9882
Episode: 9 Total reward: 13.0 Training loss: 1.4072 Explore P: 0.9869
Episode: 10 Total reward: 27.0 Training loss: 1.2015 Explore P: 0.9843
Episode: 11 Total reward: 32.0 Training loss: 1.1633 Explore P: 0.9812
Episode: 12 Total reward: 17.0 Training loss: 1.3382 Explore P: 0.9795
Episode: 13 Total reward: 41.0 Training loss: 1.3452 Explore P: 0.9756
Episode: 14 Total reward: 14.0 Training loss: 1.4981 Explore P: 0.9742
Episode: 15 Total reward: 20.0 Training loss: 1.3688 Explore P: 0.9723
Episode: 16 Total reward: 18.0 Training loss: 1.4376 Explore P: 0.9705
Episode: 17 Total reward: 17.0 Training loss: 1.6667 Explore P: 0.9689
Episode: 18 Total reward: 10.0 Training loss: 1.3889 Explore P: 0.9680
Episode: 19 Total reward: 12.0 Training loss: 1.5541 Explore P: 0.9668
Episode: 20 Total reward: 19.0 Training loss: 1.5675 Explore P: 0.9650
Episode: 21 Total reward: 20.0 Training loss: 1.7522 Explore P: 0.9631
Episode: 22 Total reward: 33.0 Training loss: 1.7299 Explore P: 0.9599
Episode: 23 Total reward: 31.0 Training loss: 2.4385 Explore P: 0.9570
Episode: 24 Total reward: 16.0 Training loss: 1.9277 Explore P: 0.9555
Episode: 25 Total reward: 24.0 Training loss: 2.2370 Explore P: 0.9532
Episode: 26 Total reward: 33.0 Training loss: 1.7675 Explore P: 0.9501
Episode: 27 Total reward: 13.0 Training loss: 3.1016 Explore P: 0.9489
Episode: 28 Total reward: 11.0 Training loss: 2.0692 Explore P: 0.9479
Episode: 29 Total reward: 15.0 Training loss: 2.2176 Explore P: 0.9465
Episode: 30 Total reward: 18.0 Training loss: 2.1326 Explore P: 0.9448
Episode: 31 Total reward: 16.0 Training loss: 2.4098 Explore P: 0.9433
Episode: 32 Total reward: 11.0 Training loss: 2.3873 Explore P: 0.9423
Episode: 33 Total reward: 15.0 Training loss: 3.3219 Explore P: 0.9409
Episode: 34 Total reward: 23.0 Training loss: 3.4802 Explore P: 0.9387
Episode: 35 Total reward: 16.0 Training loss: 16.5077 Explore P: 0.9372
Episode: 36 Total reward: 13.0 Training loss: 7.7539 Explore P: 0.9360
Episode: 37 Total reward: 12.0 Training loss: 4.5372 Explore P: 0.9349
Episode: 38 Total reward: 9.0 Training loss: 2.6692 Explore P: 0.9341
Episode: 39 Total reward: 11.0 Training loss: 2.6902 Explore P: 0.9331
Episode: 40 Total reward: 39.0 Training loss: 3.2809 Explore P: 0.9295
Episode: 41 Total reward: 29.0 Training loss: 8.4378 Explore P: 0.9268
Episode: 42 Total reward: 58.0 Training loss: 4.7492 Explore P: 0.9215
Episode: 43 Total reward: 13.0 Training loss: 4.5624 Explore P: 0.9203
Episode: 44 Total reward: 19.0 Training loss: 22.9375 Explore P: 0.9186
Episode: 45 Total reward: 10.0 Training loss: 11.4755 Explore P: 0.9177
Episode: 46 Total reward: 22.0 Training loss: 4.5503 Explore P: 0.9157
Episode: 47 Total reward: 15.0 Training loss: 13.1579 Explore P: 0.9143
Episode: 48 Total reward: 17.0 Training loss: 39.4206 Explore P: 0.9128
Episode: 49 Total reward: 13.0 Training loss: 32.5252 Explore P: 0.9116
Episode: 50 Total reward: 43.0 Training loss: 12.1778 Explore P: 0.9078
Episode: 51 Total reward: 22.0 Training loss: 16.6674 Explore P: 0.9058
Episode: 52 Total reward: 54.0 Training loss: 17.5926 Explore P: 0.9010
Episode: 53 Total reward: 9.0 Training loss: 12.8299 Explore P: 0.9002
Episode: 54 Total reward: 12.0 Training loss: 6.5734 Explore P: 0.8991
Episode: 55 Total reward: 36.0 Training loss: 6.0768 Explore P: 0.8959
Episode: 56 Total reward: 11.0 Training loss: 42.5450 Explore P: 0.8949
Episode: 57 Total reward: 13.0 Training loss: 4.0679 Explore P: 0.8938
Episode: 58 Total reward: 9.0 Training loss: 15.3231 Explore P: 0.8930
Episode: 59 Total reward: 11.0 Training loss: 48.6777 Explore P: 0.8920
Episode: 60 Total reward: 12.0 Training loss: 14.5620 Explore P: 0.8910
Episode: 61 Total reward: 9.0 Training loss: 34.5592 Explore P: 0.8902
Episode: 62 Total reward: 12.0 Training loss: 49.6113 Explore P: 0.8891
Episode: 63 Total reward: 35.0 Training loss: 20.1889 Explore P: 0.8860
Episode: 64 Total reward: 14.0 Training loss: 31.7905 Explore P: 0.8848
Episode: 65 Total reward: 22.0 Training loss: 5.9239 Explore P: 0.8829
Episode: 66 Total reward: 13.0 Training loss: 33.3984 Explore P: 0.8818
Episode: 67 Total reward: 18.0 Training loss: 30.2943 Explore P: 0.8802
Episode: 68 Total reward: 15.0 Training loss: 7.4056 Explore P: 0.8789
Episode: 69 Total reward: 14.0 Training loss: 6.6556 Explore P: 0.8777
Episode: 70 Total reward: 14.0 Training loss: 31.7854 Explore P: 0.8765
Episode: 71 Total reward: 15.0 Training loss: 7.3882 Explore P: 0.8752
Episode: 72 Total reward: 43.0 Training loss: 34.4647 Explore P: 0.8714
Episode: 73 Total reward: 15.0 Training loss: 39.6615 Explore P: 0.8701
Episode: 74 Total reward: 12.0 Training loss: 7.2609 Explore P: 0.8691
Episode: 75 Total reward: 17.0 Training loss: 70.3370 Explore P: 0.8677
Episode: 76 Total reward: 12.0 Training loss: 7.4690 Explore P: 0.8666
Episode: 77 Total reward: 9.0 Training loss: 90.0847 Explore P: 0.8659
Episode: 78 Total reward: 15.0 Training loss: 5.3574 Explore P: 0.8646
Episode: 79 Total reward: 9.0 Training loss: 7.4016 Explore P: 0.8638
Episode: 80 Total reward: 15.0 Training loss: 7.3950 Explore P: 0.8625
Episode: 81 Total reward: 28.0 Training loss: 40.4543 Explore P: 0.8601
Episode: 82 Total reward: 9.0 Training loss: 52.3103 Explore P: 0.8594
Episode: 83 Total reward: 15.0 Training loss: 53.7259 Explore P: 0.8581
Episode: 84 Total reward: 19.0 Training loss: 34.2459 Explore P: 0.8565
Episode: 85 Total reward: 9.0 Training loss: 28.5948 Explore P: 0.8557
Episode: 86 Total reward: 12.0 Training loss: 8.0840 Explore P: 0.8547
Episode: 87 Total reward: 22.0 Training loss: 45.0597 Explore P: 0.8529
Episode: 88 Total reward: 18.0 Training loss: 30.5534 Explore P: 0.8513
Episode: 89 Total reward: 19.0 Training loss: 6.8621 Explore P: 0.8498
Episode: 90 Total reward: 31.0 Training loss: 41.1563 Explore P: 0.8472
Episode: 91 Total reward: 11.0 Training loss: 83.7497 Explore P: 0.8462
Episode: 92 Total reward: 10.0 Training loss: 7.7658 Explore P: 0.8454
Episode: 93 Total reward: 32.0 Training loss: 121.1800 Explore P: 0.8427
Episode: 94 Total reward: 14.0 Training loss: 7.6797 Explore P: 0.8416
Episode: 95 Total reward: 19.0 Training loss: 30.9631 Explore P: 0.8400
Episode: 96 Total reward: 7.0 Training loss: 193.1124 Explore P: 0.8394
Episode: 97 Total reward: 13.0 Training loss: 73.6203 Explore P: 0.8383
Episode: 98 Total reward: 15.0 Training loss: 103.7317 Explore P: 0.8371
Episode: 99 Total reward: 9.0 Training loss: 130.8849 Explore P: 0.8363
Episode: 100 Total reward: 8.0 Training loss: 147.3368 Explore P: 0.8357
Episode: 101 Total reward: 12.0 Training loss: 51.8454 Explore P: 0.8347
Episode: 102 Total reward: 33.0 Training loss: 90.2317 Explore P: 0.8320
Episode: 103 Total reward: 19.0 Training loss: 64.9888 Explore P: 0.8304
Episode: 104 Total reward: 14.0 Training loss: 7.3125 Explore P: 0.8293
Episode: 105 Total reward: 15.0 Training loss: 102.5794 Explore P: 0.8280
Episode: 106 Total reward: 13.0 Training loss: 131.6469 Explore P: 0.8270
Episode: 107 Total reward: 23.0 Training loss: 6.0105 Explore P: 0.8251
Episode: 108 Total reward: 14.0 Training loss: 5.2139 Explore P: 0.8240
Episode: 109 Total reward: 33.0 Training loss: 6.0170 Explore P: 0.8213
Episode: 110 Total reward: 17.0 Training loss: 58.5708 Explore P: 0.8199
Episode: 111 Total reward: 16.0 Training loss: 78.3209 Explore P: 0.8186
Episode: 112 Total reward: 19.0 Training loss: 46.8259 Explore P: 0.8171
Episode: 113 Total reward: 19.0 Training loss: 6.6019 Explore P: 0.8155
Episode: 114 Total reward: 15.0 Training loss: 114.3136 Explore P: 0.8143
Episode: 115 Total reward: 10.0 Training loss: 46.9572 Explore P: 0.8135
Episode: 116 Total reward: 30.0 Training loss: 96.6317 Explore P: 0.8111
Episode: 117 Total reward: 14.0 Training loss: 45.6494 Explore P: 0.8100
Episode: 118 Total reward: 13.0 Training loss: 8.3398 Explore P: 0.8090
Episode: 119 Total reward: 8.0 Training loss: 36.2399 Explore P: 0.8083
Episode: 120 Total reward: 17.0 Training loss: 5.9506 Explore P: 0.8070
Episode: 121 Total reward: 11.0 Training loss: 89.2064 Explore P: 0.8061
Episode: 122 Total reward: 12.0 Training loss: 98.7181 Explore P: 0.8051
Episode: 123 Total reward: 11.0 Training loss: 81.6733 Explore P: 0.8043
Episode: 124 Total reward: 40.0 Training loss: 216.8083 Explore P: 0.8011
Episode: 125 Total reward: 18.0 Training loss: 5.7475 Explore P: 0.7997
Episode: 126 Total reward: 10.0 Training loss: 5.6185 Explore P: 0.7989
Episode: 127 Total reward: 8.0 Training loss: 4.0098 Explore P: 0.7982
Episode: 128 Total reward: 11.0 Training loss: 6.6236 Explore P: 0.7974
Episode: 129 Total reward: 23.0 Training loss: 4.8705 Explore P: 0.7956
Episode: 130 Total reward: 9.0 Training loss: 153.8965 Explore P: 0.7949
Episode: 131 Total reward: 16.0 Training loss: 5.6661 Explore P: 0.7936
Episode: 132 Total reward: 36.0 Training loss: 39.3000 Explore P: 0.7908
Episode: 133 Total reward: 16.0 Training loss: 83.0560 Explore P: 0.7895
Episode: 134 Total reward: 13.0 Training loss: 4.2540 Explore P: 0.7885
Episode: 135 Total reward: 10.0 Training loss: 6.0675 Explore P: 0.7877
Episode: 136 Total reward: 10.0 Training loss: 97.9332 Explore P: 0.7870
Episode: 137 Total reward: 34.0 Training loss: 60.1229 Explore P: 0.7843
Episode: 138 Total reward: 13.0 Training loss: 45.0153 Explore P: 0.7833
Episode: 139 Total reward: 8.0 Training loss: 41.5045 Explore P: 0.7827
Episode: 140 Total reward: 12.0 Training loss: 47.6364 Explore P: 0.7818
Episode: 141 Total reward: 10.0 Training loss: 105.8643 Explore P: 0.7810
Episode: 142 Total reward: 22.0 Training loss: 4.0854 Explore P: 0.7793
Episode: 143 Total reward: 14.0 Training loss: 6.1972 Explore P: 0.7782
Episode: 144 Total reward: 11.0 Training loss: 168.0713 Explore P: 0.7774
Episode: 145 Total reward: 8.0 Training loss: 4.0872 Explore P: 0.7768
Episode: 146 Total reward: 18.0 Training loss: 5.1898 Explore P: 0.7754
Episode: 147 Total reward: 8.0 Training loss: 5.5805 Explore P: 0.7748
Episode: 148 Total reward: 13.0 Training loss: 47.6831 Explore P: 0.7738
Episode: 149 Total reward: 15.0 Training loss: 55.0274 Explore P: 0.7727
Episode: 150 Total reward: 12.0 Training loss: 98.3874 Explore P: 0.7717
Episode: 151 Total reward: 15.0 Training loss: 4.9875 Explore P: 0.7706
Episode: 152 Total reward: 8.0 Training loss: 110.6396 Explore P: 0.7700
Episode: 153 Total reward: 14.0 Training loss: 122.6392 Explore P: 0.7689
Episode: 154 Total reward: 14.0 Training loss: 4.2909 Explore P: 0.7679
Episode: 155 Total reward: 12.0 Training loss: 59.9539 Explore P: 0.7670
Episode: 156 Total reward: 27.0 Training loss: 67.8556 Explore P: 0.7649
Episode: 157 Total reward: 40.0 Training loss: 86.9309 Explore P: 0.7619
Episode: 158 Total reward: 7.0 Training loss: 6.2655 Explore P: 0.7614
Episode: 159 Total reward: 29.0 Training loss: 55.3424 Explore P: 0.7592
Episode: 160 Total reward: 10.0 Training loss: 4.4379 Explore P: 0.7585
Episode: 161 Total reward: 9.0 Training loss: 47.3613 Explore P: 0.7578
Episode: 162 Total reward: 9.0 Training loss: 5.0493 Explore P: 0.7571
Episode: 163 Total reward: 19.0 Training loss: 138.7179 Explore P: 0.7557
Episode: 164 Total reward: 16.0 Training loss: 71.1862 Explore P: 0.7545
Episode: 165 Total reward: 17.0 Training loss: 5.4132 Explore P: 0.7532
Episode: 166 Total reward: 13.0 Training loss: 96.4727 Explore P: 0.7523
Episode: 167 Total reward: 10.0 Training loss: 5.2584 Explore P: 0.7515
Episode: 168 Total reward: 17.0 Training loss: 61.9150 Explore P: 0.7503
Episode: 169 Total reward: 8.0 Training loss: 4.8903 Explore P: 0.7497
Episode: 170 Total reward: 11.0 Training loss: 94.7925 Explore P: 0.7489
Episode: 171 Total reward: 15.0 Training loss: 63.8721 Explore P: 0.7477
Episode: 172 Total reward: 17.0 Training loss: 48.5656 Explore P: 0.7465
Episode: 173 Total reward: 9.0 Training loss: 3.7960 Explore P: 0.7458
Episode: 174 Total reward: 15.0 Training loss: 97.1451 Explore P: 0.7447
Episode: 175 Total reward: 13.0 Training loss: 4.7306 Explore P: 0.7438
Episode: 176 Total reward: 12.0 Training loss: 4.6059 Explore P: 0.7429
Episode: 177 Total reward: 15.0 Training loss: 4.0431 Explore P: 0.7418
Episode: 178 Total reward: 16.0 Training loss: 3.2180 Explore P: 0.7406
Episode: 179 Total reward: 11.0 Training loss: 41.7048 Explore P: 0.7398
Episode: 180 Total reward: 12.0 Training loss: 131.6561 Explore P: 0.7389
Episode: 181 Total reward: 13.0 Training loss: 45.4335 Explore P: 0.7380
Episode: 182 Total reward: 12.0 Training loss: 52.7987 Explore P: 0.7371
Episode: 183 Total reward: 11.0 Training loss: 3.5530 Explore P: 0.7363
Episode: 184 Total reward: 10.0 Training loss: 102.2853 Explore P: 0.7356
Episode: 185 Total reward: 8.0 Training loss: 111.0801 Explore P: 0.7350
Episode: 186 Total reward: 13.0 Training loss: 42.5606 Explore P: 0.7341
Episode: 187 Total reward: 12.0 Training loss: 4.3543 Explore P: 0.7332
Episode: 188 Total reward: 9.0 Training loss: 71.0869 Explore P: 0.7326
Episode: 189 Total reward: 21.0 Training loss: 42.1307 Explore P: 0.7310
Episode: 190 Total reward: 13.0 Training loss: 49.7662 Explore P: 0.7301
Episode: 191 Total reward: 10.0 Training loss: 93.4920 Explore P: 0.7294
Episode: 192 Total reward: 9.0 Training loss: 53.5627 Explore P: 0.7287
Episode: 193 Total reward: 12.0 Training loss: 3.4450 Explore P: 0.7279
Episode: 194 Total reward: 34.0 Training loss: 114.2461 Explore P: 0.7254
Episode: 195 Total reward: 11.0 Training loss: 3.8034 Explore P: 0.7247
Episode: 196 Total reward: 9.0 Training loss: 3.2879 Explore P: 0.7240
Episode: 197 Total reward: 16.0 Training loss: 115.4493 Explore P: 0.7229
Episode: 198 Total reward: 10.0 Training loss: 3.4042 Explore P: 0.7222
Episode: 199 Total reward: 11.0 Training loss: 42.8475 Explore P: 0.7214
Episode: 200 Total reward: 21.0 Training loss: 107.2492 Explore P: 0.7199
Episode: 201 Total reward: 16.0 Training loss: 152.0090 Explore P: 0.7188
Episode: 202 Total reward: 13.0 Training loss: 63.3706 Explore P: 0.7178
Episode: 203 Total reward: 26.0 Training loss: 44.7228 Explore P: 0.7160
Episode: 204 Total reward: 11.0 Training loss: 122.1985 Explore P: 0.7152
Episode: 205 Total reward: 18.0 Training loss: 3.5232 Explore P: 0.7139
Episode: 206 Total reward: 13.0 Training loss: 79.4490 Explore P: 0.7130
Episode: 207 Total reward: 17.0 Training loss: 88.8046 Explore P: 0.7118
Episode: 208 Total reward: 12.0 Training loss: 41.8630 Explore P: 0.7110
Episode: 209 Total reward: 12.0 Training loss: 97.2259 Explore P: 0.7102
Episode: 210 Total reward: 10.0 Training loss: 47.1895 Explore P: 0.7095
Episode: 211 Total reward: 11.0 Training loss: 4.3932 Explore P: 0.7087
Episode: 212 Total reward: 9.0 Training loss: 56.3666 Explore P: 0.7081
Episode: 213 Total reward: 8.0 Training loss: 174.0695 Explore P: 0.7075
Episode: 214 Total reward: 10.0 Training loss: 3.2986 Explore P: 0.7068
Episode: 215 Total reward: 9.0 Training loss: 2.0773 Explore P: 0.7062
Episode: 216 Total reward: 11.0 Training loss: 56.3685 Explore P: 0.7054
Episode: 217 Total reward: 8.0 Training loss: 37.8226 Explore P: 0.7049
Episode: 218 Total reward: 9.0 Training loss: 89.4733 Explore P: 0.7042
Episode: 219 Total reward: 10.0 Training loss: 3.0540 Explore P: 0.7035
Episode: 220 Total reward: 16.0 Training loss: 41.3898 Explore P: 0.7024
Episode: 221 Total reward: 13.0 Training loss: 2.0408 Explore P: 0.7015
Episode: 222 Total reward: 12.0 Training loss: 3.4858 Explore P: 0.7007
Episode: 223 Total reward: 10.0 Training loss: 2.5765 Explore P: 0.7000
Episode: 224 Total reward: 20.0 Training loss: 4.9922 Explore P: 0.6986
Episode: 225 Total reward: 26.0 Training loss: 51.0167 Explore P: 0.6968
Episode: 226 Total reward: 10.0 Training loss: 2.8026 Explore P: 0.6962
Episode: 227 Total reward: 10.0 Training loss: 54.3370 Explore P: 0.6955
Episode: 228 Total reward: 7.0 Training loss: 88.1494 Explore P: 0.6950
Episode: 229 Total reward: 14.0 Training loss: 40.5178 Explore P: 0.6940
Episode: 230 Total reward: 19.0 Training loss: 3.4116 Explore P: 0.6927
Episode: 231 Total reward: 13.0 Training loss: 3.8308 Explore P: 0.6918
Episode: 232 Total reward: 8.0 Training loss: 143.1312 Explore P: 0.6913
Episode: 233 Total reward: 11.0 Training loss: 79.8811 Explore P: 0.6906
Episode: 234 Total reward: 15.0 Training loss: 33.9620 Explore P: 0.6895
Episode: 235 Total reward: 21.0 Training loss: 96.3121 Explore P: 0.6881
Episode: 236 Total reward: 13.0 Training loss: 2.2959 Explore P: 0.6872
Episode: 237 Total reward: 8.0 Training loss: 42.9516 Explore P: 0.6867
Episode: 238 Total reward: 13.0 Training loss: 68.3784 Explore P: 0.6858
Episode: 239 Total reward: 10.0 Training loss: 33.3381 Explore P: 0.6851
Episode: 240 Total reward: 30.0 Training loss: 1.5056 Explore P: 0.6831
Episode: 241 Total reward: 13.0 Training loss: 36.4583 Explore P: 0.6822
Episode: 242 Total reward: 10.0 Training loss: 88.9329 Explore P: 0.6816
Episode: 243 Total reward: 10.0 Training loss: 3.3888 Explore P: 0.6809
Episode: 244 Total reward: 12.0 Training loss: 43.0344 Explore P: 0.6801
Episode: 245 Total reward: 15.0 Training loss: 39.3683 Explore P: 0.6791
Episode: 246 Total reward: 28.0 Training loss: 40.1742 Explore P: 0.6772
Episode: 247 Total reward: 13.0 Training loss: 44.2036 Explore P: 0.6763
Episode: 248 Total reward: 9.0 Training loss: 92.7866 Explore P: 0.6757
Episode: 249 Total reward: 15.0 Training loss: 57.4456 Explore P: 0.6747
Episode: 250 Total reward: 17.0 Training loss: 74.2216 Explore P: 0.6736
Episode: 251 Total reward: 9.0 Training loss: 1.9557 Explore P: 0.6730
Episode: 252 Total reward: 18.0 Training loss: 2.0819 Explore P: 0.6718
Episode: 253 Total reward: 13.0 Training loss: 48.8914 Explore P: 0.6710
Episode: 254 Total reward: 8.0 Training loss: 24.7785 Explore P: 0.6704
Episode: 255 Total reward: 13.0 Training loss: 2.0720 Explore P: 0.6696
Episode: 256 Total reward: 10.0 Training loss: 112.2454 Explore P: 0.6689
Episode: 257 Total reward: 25.0 Training loss: 47.4073 Explore P: 0.6673
Episode: 258 Total reward: 11.0 Training loss: 62.2427 Explore P: 0.6666
Episode: 259 Total reward: 10.0 Training loss: 66.8265 Explore P: 0.6659
Episode: 260 Total reward: 8.0 Training loss: 2.0777 Explore P: 0.6654
Episode: 261 Total reward: 10.0 Training loss: 41.5635 Explore P: 0.6647
Episode: 262 Total reward: 16.0 Training loss: 30.1654 Explore P: 0.6637
Episode: 263 Total reward: 11.0 Training loss: 3.0162 Explore P: 0.6630
Episode: 264 Total reward: 8.0 Training loss: 25.9962 Explore P: 0.6624
Episode: 265 Total reward: 15.0 Training loss: 52.1324 Explore P: 0.6615
Episode: 266 Total reward: 10.0 Training loss: 54.7051 Explore P: 0.6608
Episode: 267 Total reward: 12.0 Training loss: 2.0921 Explore P: 0.6600
Episode: 268 Total reward: 18.0 Training loss: 50.1997 Explore P: 0.6589
Episode: 269 Total reward: 12.0 Training loss: 77.7558 Explore P: 0.6581
Episode: 270 Total reward: 8.0 Training loss: 73.7973 Explore P: 0.6576
Episode: 271 Total reward: 16.0 Training loss: 48.6127 Explore P: 0.6565
Episode: 272 Total reward: 11.0 Training loss: 111.1605 Explore P: 0.6558
Episode: 273 Total reward: 8.0 Training loss: 88.6057 Explore P: 0.6553
Episode: 274 Total reward: 12.0 Training loss: 2.9883 Explore P: 0.6545
Episode: 275 Total reward: 10.0 Training loss: 104.8552 Explore P: 0.6539
Episode: 276 Total reward: 10.0 Training loss: 82.6828 Explore P: 0.6532
Episode: 277 Total reward: 14.0 Training loss: 48.4167 Explore P: 0.6523
Episode: 278 Total reward: 16.0 Training loss: 27.8063 Explore P: 0.6513
Episode: 279 Total reward: 9.0 Training loss: 33.8421 Explore P: 0.6507
Episode: 280 Total reward: 17.0 Training loss: 71.0932 Explore P: 0.6496
Episode: 281 Total reward: 17.0 Training loss: 66.4045 Explore P: 0.6486
Episode: 282 Total reward: 11.0 Training loss: 26.5619 Explore P: 0.6479
Episode: 283 Total reward: 11.0 Training loss: 36.5878 Explore P: 0.6471
Episode: 284 Total reward: 10.0 Training loss: 3.7939 Explore P: 0.6465
Episode: 285 Total reward: 10.0 Training loss: 2.4501 Explore P: 0.6459
Episode: 286 Total reward: 12.0 Training loss: 75.9622 Explore P: 0.6451
Episode: 287 Total reward: 9.0 Training loss: 66.5885 Explore P: 0.6445
Episode: 288 Total reward: 14.0 Training loss: 44.4752 Explore P: 0.6437
Episode: 289 Total reward: 9.0 Training loss: 2.8483 Explore P: 0.6431
Episode: 290 Total reward: 14.0 Training loss: 70.0011 Explore P: 0.6422
Episode: 291 Total reward: 11.0 Training loss: 23.2360 Explore P: 0.6415
Episode: 292 Total reward: 12.0 Training loss: 26.5980 Explore P: 0.6407
Episode: 293 Total reward: 16.0 Training loss: 59.9663 Explore P: 0.6397
Episode: 294 Total reward: 16.0 Training loss: 51.2042 Explore P: 0.6387
Episode: 295 Total reward: 16.0 Training loss: 158.0205 Explore P: 0.6377
Episode: 296 Total reward: 9.0 Training loss: 62.9076 Explore P: 0.6372
Episode: 297 Total reward: 9.0 Training loss: 46.0573 Explore P: 0.6366
Episode: 298 Total reward: 18.0 Training loss: 2.4531 Explore P: 0.6355
Episode: 299 Total reward: 13.0 Training loss: 61.1556 Explore P: 0.6347
Episode: 300 Total reward: 13.0 Training loss: 74.5984 Explore P: 0.6338
Episode: 301 Total reward: 13.0 Training loss: 33.3603 Explore P: 0.6330
Episode: 302 Total reward: 10.0 Training loss: 130.0261 Explore P: 0.6324
Episode: 303 Total reward: 22.0 Training loss: 57.4133 Explore P: 0.6310
Episode: 304 Total reward: 31.0 Training loss: 54.5242 Explore P: 0.6291
Episode: 305 Total reward: 11.0 Training loss: 39.0602 Explore P: 0.6284
Episode: 306 Total reward: 28.0 Training loss: 3.1976 Explore P: 0.6267
Episode: 307 Total reward: 25.0 Training loss: 19.2603 Explore P: 0.6252
Episode: 308 Total reward: 13.0 Training loss: 2.3615 Explore P: 0.6244
Episode: 309 Total reward: 15.0 Training loss: 77.0600 Explore P: 0.6235
Episode: 310 Total reward: 14.0 Training loss: 60.0065 Explore P: 0.6226
Episode: 311 Total reward: 7.0 Training loss: 32.8759 Explore P: 0.6222
Episode: 312 Total reward: 10.0 Training loss: 2.0155 Explore P: 0.6216
Episode: 313 Total reward: 11.0 Training loss: 38.8568 Explore P: 0.6209
Episode: 314 Total reward: 8.0 Training loss: 38.6616 Explore P: 0.6204
Episode: 315 Total reward: 12.0 Training loss: 29.0094 Explore P: 0.6197
Episode: 316 Total reward: 8.0 Training loss: 41.1915 Explore P: 0.6192
Episode: 317 Total reward: 22.0 Training loss: 2.5159 Explore P: 0.6178
Episode: 318 Total reward: 26.0 Training loss: 2.1875 Explore P: 0.6163
Episode: 319 Total reward: 10.0 Training loss: 46.8858 Explore P: 0.6157
Episode: 320 Total reward: 19.0 Training loss: 70.3979 Explore P: 0.6145
Episode: 321 Total reward: 20.0 Training loss: 1.7864 Explore P: 0.6133
Episode: 322 Total reward: 7.0 Training loss: 18.4815 Explore P: 0.6129
Episode: 323 Total reward: 9.0 Training loss: 34.2074 Explore P: 0.6123
Episode: 324 Total reward: 11.0 Training loss: 37.0610 Explore P: 0.6117
Episode: 325 Total reward: 13.0 Training loss: 2.1275 Explore P: 0.6109
Episode: 326 Total reward: 12.0 Training loss: 49.4479 Explore P: 0.6102
Episode: 327 Total reward: 25.0 Training loss: 18.1811 Explore P: 0.6087
Episode: 328 Total reward: 16.0 Training loss: 32.0249 Explore P: 0.6077
Episode: 329 Total reward: 13.0 Training loss: 76.8990 Explore P: 0.6069
Episode: 330 Total reward: 19.0 Training loss: 18.9695 Explore P: 0.6058
Episode: 331 Total reward: 11.0 Training loss: 53.2436 Explore P: 0.6051
Episode: 332 Total reward: 16.0 Training loss: 3.2431 Explore P: 0.6042
Episode: 333 Total reward: 22.0 Training loss: 31.8778 Explore P: 0.6029
Episode: 334 Total reward: 11.0 Training loss: 39.7898 Explore P: 0.6022
Episode: 335 Total reward: 10.0 Training loss: 1.6263 Explore P: 0.6016
Episode: 336 Total reward: 10.0 Training loss: 33.8393 Explore P: 0.6011
Episode: 337 Total reward: 24.0 Training loss: 84.6020 Explore P: 0.5996
Episode: 338 Total reward: 11.0 Training loss: 1.9746 Explore P: 0.5990
Episode: 339 Total reward: 10.0 Training loss: 1.3601 Explore P: 0.5984
Episode: 340 Total reward: 10.0 Training loss: 15.7789 Explore P: 0.5978
Episode: 341 Total reward: 19.0 Training loss: 56.6093 Explore P: 0.5967
Episode: 342 Total reward: 33.0 Training loss: 2.2555 Explore P: 0.5948
Episode: 343 Total reward: 10.0 Training loss: 17.4801 Explore P: 0.5942
Episode: 344 Total reward: 20.0 Training loss: 40.5913 Explore P: 0.5930
Episode: 345 Total reward: 14.0 Training loss: 20.2850 Explore P: 0.5922
Episode: 346 Total reward: 14.0 Training loss: 42.2426 Explore P: 0.5914
Episode: 347 Total reward: 11.0 Training loss: 69.0655 Explore P: 0.5907
Episode: 348 Total reward: 18.0 Training loss: 33.9767 Explore P: 0.5897
Episode: 349 Total reward: 12.0 Training loss: 30.7009 Explore P: 0.5890
Episode: 350 Total reward: 11.0 Training loss: 100.0461 Explore P: 0.5884
Episode: 351 Total reward: 24.0 Training loss: 14.8766 Explore P: 0.5870
Episode: 352 Total reward: 29.0 Training loss: 25.0430 Explore P: 0.5853
Episode: 353 Total reward: 22.0 Training loss: 14.1312 Explore P: 0.5840
Episode: 354 Total reward: 13.0 Training loss: 14.9602 Explore P: 0.5833
Episode: 355 Total reward: 9.0 Training loss: 30.8997 Explore P: 0.5828
Episode: 356 Total reward: 23.0 Training loss: 78.2159 Explore P: 0.5815
Episode: 357 Total reward: 15.0 Training loss: 14.9611 Explore P: 0.5806
Episode: 358 Total reward: 28.0 Training loss: 76.8134 Explore P: 0.5790
Episode: 359 Total reward: 30.0 Training loss: 67.9725 Explore P: 0.5773
Episode: 360 Total reward: 9.0 Training loss: 44.5401 Explore P: 0.5768
Episode: 361 Total reward: 11.0 Training loss: 25.8684 Explore P: 0.5762
Episode: 362 Total reward: 11.0 Training loss: 111.4531 Explore P: 0.5756
Episode: 363 Total reward: 11.0 Training loss: 56.5638 Explore P: 0.5749
Episode: 364 Total reward: 9.0 Training loss: 30.3612 Explore P: 0.5744
Episode: 365 Total reward: 16.0 Training loss: 1.5192 Explore P: 0.5735
Episode: 366 Total reward: 11.0 Training loss: 1.4722 Explore P: 0.5729
Episode: 367 Total reward: 10.0 Training loss: 2.1671 Explore P: 0.5723
Episode: 368 Total reward: 11.0 Training loss: 24.0140 Explore P: 0.5717
Episode: 369 Total reward: 19.0 Training loss: 1.3649 Explore P: 0.5707
Episode: 370 Total reward: 22.0 Training loss: 27.2544 Explore P: 0.5694
Episode: 371 Total reward: 19.0 Training loss: 1.3145 Explore P: 0.5684
Episode: 372 Total reward: 15.0 Training loss: 61.3118 Explore P: 0.5675
Episode: 373 Total reward: 11.0 Training loss: 15.6454 Explore P: 0.5669
Episode: 374 Total reward: 21.0 Training loss: 21.2242 Explore P: 0.5657
Episode: 375 Total reward: 32.0 Training loss: 1.8672 Explore P: 0.5640
Episode: 376 Total reward: 8.0 Training loss: 1.8816 Explore P: 0.5635
Episode: 377 Total reward: 13.0 Training loss: 1.4361 Explore P: 0.5628
Episode: 378 Total reward: 13.0 Training loss: 55.2558 Explore P: 0.5621
Episode: 379 Total reward: 15.0 Training loss: 13.0873 Explore P: 0.5613
Episode: 380 Total reward: 11.0 Training loss: 50.6510 Explore P: 0.5607
Episode: 381 Total reward: 8.0 Training loss: 18.2204 Explore P: 0.5602
Episode: 382 Total reward: 35.0 Training loss: 25.8362 Explore P: 0.5583
Episode: 383 Total reward: 40.0 Training loss: 12.3519 Explore P: 0.5561
Episode: 384 Total reward: 25.0 Training loss: 15.1218 Explore P: 0.5547
Episode: 385 Total reward: 9.0 Training loss: 1.1368 Explore P: 0.5542
Episode: 386 Total reward: 8.0 Training loss: 1.0074 Explore P: 0.5538
Episode: 387 Total reward: 10.0 Training loss: 1.6010 Explore P: 0.5533
Episode: 388 Total reward: 17.0 Training loss: 22.0825 Explore P: 0.5523
Episode: 389 Total reward: 32.0 Training loss: 38.5605 Explore P: 0.5506
Episode: 390 Total reward: 13.0 Training loss: 1.2627 Explore P: 0.5499
Episode: 391 Total reward: 16.0 Training loss: 22.0084 Explore P: 0.5490
Episode: 392 Total reward: 10.0 Training loss: 18.5658 Explore P: 0.5485
Episode: 393 Total reward: 20.0 Training loss: 15.1878 Explore P: 0.5474
Episode: 394 Total reward: 21.0 Training loss: 11.8674 Explore P: 0.5463
Episode: 395 Total reward: 8.0 Training loss: 1.3878 Explore P: 0.5459
Episode: 396 Total reward: 15.0 Training loss: 10.4267 Explore P: 0.5451
Episode: 397 Total reward: 17.0 Training loss: 1.0389 Explore P: 0.5442
Episode: 398 Total reward: 13.0 Training loss: 41.5249 Explore P: 0.5435
Episode: 399 Total reward: 11.0 Training loss: 39.4275 Explore P: 0.5429
Episode: 400 Total reward: 8.0 Training loss: 1.2372 Explore P: 0.5425
Episode: 401 Total reward: 23.0 Training loss: 14.8722 Explore P: 0.5412
Episode: 402 Total reward: 15.0 Training loss: 19.5031 Explore P: 0.5404
Episode: 403 Total reward: 8.0 Training loss: 11.4738 Explore P: 0.5400
Episode: 404 Total reward: 17.0 Training loss: 23.4754 Explore P: 0.5391
Episode: 405 Total reward: 13.0 Training loss: 11.5417 Explore P: 0.5384
Episode: 406 Total reward: 8.0 Training loss: 26.4607 Explore P: 0.5380
Episode: 407 Total reward: 13.0 Training loss: 17.3599 Explore P: 0.5373
Episode: 408 Total reward: 10.0 Training loss: 29.9418 Explore P: 0.5368
Episode: 409 Total reward: 10.0 Training loss: 43.1301 Explore P: 0.5363
Episode: 410 Total reward: 9.0 Training loss: 60.7578 Explore P: 0.5358
Episode: 411 Total reward: 19.0 Training loss: 1.0266 Explore P: 0.5348
Episode: 412 Total reward: 11.0 Training loss: 57.7678 Explore P: 0.5342
Episode: 413 Total reward: 29.0 Training loss: 20.9200 Explore P: 0.5327
Episode: 414 Total reward: 8.0 Training loss: 1.1500 Explore P: 0.5323
Episode: 415 Total reward: 9.0 Training loss: 44.8960 Explore P: 0.5318
Episode: 416 Total reward: 12.0 Training loss: 0.7597 Explore P: 0.5312
Episode: 417 Total reward: 12.0 Training loss: 15.2271 Explore P: 0.5306
Episode: 418 Total reward: 9.0 Training loss: 14.6752 Explore P: 0.5301
Episode: 419 Total reward: 10.0 Training loss: 0.8180 Explore P: 0.5296
Episode: 420 Total reward: 27.0 Training loss: 27.9598 Explore P: 0.5282
Episode: 421 Total reward: 21.0 Training loss: 11.2763 Explore P: 0.5271
Episode: 422 Total reward: 18.0 Training loss: 14.5655 Explore P: 0.5262
Episode: 423 Total reward: 43.0 Training loss: 14.3161 Explore P: 0.5239
Episode: 424 Total reward: 26.0 Training loss: 12.9701 Explore P: 0.5226
Episode: 425 Total reward: 31.0 Training loss: 13.3857 Explore P: 0.5210
Episode: 426 Total reward: 23.0 Training loss: 0.6222 Explore P: 0.5198
Episode: 427 Total reward: 62.0 Training loss: 23.2582 Explore P: 0.5167
Episode: 428 Total reward: 24.0 Training loss: 0.8246 Explore P: 0.5155
Episode: 429 Total reward: 24.0 Training loss: 20.5321 Explore P: 0.5143
Episode: 430 Total reward: 12.0 Training loss: 9.3897 Explore P: 0.5137
Episode: 431 Total reward: 10.0 Training loss: 0.9410 Explore P: 0.5132
Episode: 432 Total reward: 28.0 Training loss: 0.7609 Explore P: 0.5118
Episode: 433 Total reward: 33.0 Training loss: 11.1797 Explore P: 0.5101
Episode: 434 Total reward: 16.0 Training loss: 0.7292 Explore P: 0.5093
Episode: 435 Total reward: 16.0 Training loss: 23.7012 Explore P: 0.5085
Episode: 436 Total reward: 108.0 Training loss: 10.5793 Explore P: 0.5031
Episode: 437 Total reward: 29.0 Training loss: 1.2220 Explore P: 0.5017
Episode: 438 Total reward: 53.0 Training loss: 11.3072 Explore P: 0.4991
Episode: 439 Total reward: 31.0 Training loss: 9.0537 Explore P: 0.4976
Episode: 440 Total reward: 29.0 Training loss: 17.1514 Explore P: 0.4962
Episode: 441 Total reward: 33.0 Training loss: 16.9718 Explore P: 0.4946
Episode: 442 Total reward: 39.0 Training loss: 8.9066 Explore P: 0.4927
Episode: 443 Total reward: 53.0 Training loss: 35.8386 Explore P: 0.4902
Episode: 444 Total reward: 34.0 Training loss: 19.8417 Explore P: 0.4885
Episode: 445 Total reward: 23.0 Training loss: 17.1943 Explore P: 0.4874
Episode: 446 Total reward: 40.0 Training loss: 35.1992 Explore P: 0.4855
Episode: 447 Total reward: 89.0 Training loss: 9.9165 Explore P: 0.4813
Episode: 448 Total reward: 107.0 Training loss: 27.8066 Explore P: 0.4763
Episode: 449 Total reward: 61.0 Training loss: 1.2022 Explore P: 0.4735
Episode: 450 Total reward: 73.0 Training loss: 1.0336 Explore P: 0.4701
Episode: 451 Total reward: 44.0 Training loss: 1.1871 Explore P: 0.4681
Episode: 452 Total reward: 30.0 Training loss: 11.7373 Explore P: 0.4667
Episode: 453 Total reward: 118.0 Training loss: 10.1504 Explore P: 0.4613
Episode: 454 Total reward: 32.0 Training loss: 19.4705 Explore P: 0.4599
Episode: 455 Total reward: 30.0 Training loss: 12.3256 Explore P: 0.4585
Episode: 456 Total reward: 52.0 Training loss: 10.7644 Explore P: 0.4562
Episode: 457 Total reward: 51.0 Training loss: 13.3556 Explore P: 0.4539
Episode: 458 Total reward: 53.0 Training loss: 13.2746 Explore P: 0.4516
Episode: 459 Total reward: 40.0 Training loss: 14.5778 Explore P: 0.4498
Episode: 460 Total reward: 31.0 Training loss: 1.2552 Explore P: 0.4485
Episode: 461 Total reward: 47.0 Training loss: 7.4150 Explore P: 0.4464
Episode: 462 Total reward: 57.0 Training loss: 13.6333 Explore P: 0.4439
Episode: 463 Total reward: 105.0 Training loss: 0.9280 Explore P: 0.4394
Episode: 464 Total reward: 85.0 Training loss: 7.9413 Explore P: 0.4358
Episode: 465 Total reward: 54.0 Training loss: 8.0479 Explore P: 0.4335
Episode: 466 Total reward: 46.0 Training loss: 1.2306 Explore P: 0.4315
Episode: 467 Total reward: 27.0 Training loss: 0.9577 Explore P: 0.4304
Episode: 468 Total reward: 27.0 Training loss: 11.2492 Explore P: 0.4293
Episode: 469 Total reward: 24.0 Training loss: 12.7475 Explore P: 0.4283
Episode: 470 Total reward: 81.0 Training loss: 27.4405 Explore P: 0.4249
Episode: 471 Total reward: 74.0 Training loss: 1.6374 Explore P: 0.4218
Episode: 472 Total reward: 68.0 Training loss: 22.6931 Explore P: 0.4190
Episode: 473 Total reward: 26.0 Training loss: 1.1068 Explore P: 0.4180
Episode: 474 Total reward: 51.0 Training loss: 39.1313 Explore P: 0.4159
Episode: 475 Total reward: 32.0 Training loss: 1.1187 Explore P: 0.4146
Episode: 476 Total reward: 59.0 Training loss: 1.3609 Explore P: 0.4122
Episode: 477 Total reward: 39.0 Training loss: 11.3556 Explore P: 0.4107
Episode: 478 Total reward: 33.0 Training loss: 1.2467 Explore P: 0.4093
Episode: 479 Total reward: 32.0 Training loss: 36.5647 Explore P: 0.4081
Episode: 480 Total reward: 64.0 Training loss: 25.0684 Explore P: 0.4055
Episode: 481 Total reward: 33.0 Training loss: 1.8699 Explore P: 0.4042
Episode: 482 Total reward: 23.0 Training loss: 2.5045 Explore P: 0.4033
Episode: 483 Total reward: 56.0 Training loss: 1.4835 Explore P: 0.4011
Episode: 484 Total reward: 84.0 Training loss: 55.2942 Explore P: 0.3978
Episode: 485 Total reward: 52.0 Training loss: 19.6208 Explore P: 0.3958
Episode: 486 Total reward: 69.0 Training loss: 14.8518 Explore P: 0.3932
Episode: 487 Total reward: 70.0 Training loss: 19.6224 Explore P: 0.3905
Episode: 488 Total reward: 38.0 Training loss: 23.2359 Explore P: 0.3891
Episode: 489 Total reward: 76.0 Training loss: 14.2690 Explore P: 0.3862
Episode: 490 Total reward: 43.0 Training loss: 10.3761 Explore P: 0.3846
Episode: 491 Total reward: 48.0 Training loss: 13.2216 Explore P: 0.3828
Episode: 492 Total reward: 45.0 Training loss: 1.1882 Explore P: 0.3811
Episode: 493 Total reward: 54.0 Training loss: 1.5772 Explore P: 0.3791
Episode: 494 Total reward: 22.0 Training loss: 31.4227 Explore P: 0.3783
Episode: 495 Total reward: 48.0 Training loss: 0.7882 Explore P: 0.3765
Episode: 496 Total reward: 43.0 Training loss: 28.2352 Explore P: 0.3750
Episode: 497 Total reward: 17.0 Training loss: 59.4296 Explore P: 0.3743
Episode: 498 Total reward: 38.0 Training loss: 24.8711 Explore P: 0.3730
Episode: 499 Total reward: 37.0 Training loss: 1.5679 Explore P: 0.3716
Episode: 500 Total reward: 121.0 Training loss: 15.1550 Explore P: 0.3673
Episode: 501 Total reward: 54.0 Training loss: 13.3280 Explore P: 0.3654
Episode: 502 Total reward: 37.0 Training loss: 21.1764 Explore P: 0.3640
Episode: 503 Total reward: 93.0 Training loss: 9.9257 Explore P: 0.3608
Episode: 504 Total reward: 55.0 Training loss: 57.2589 Explore P: 0.3588
Episode: 505 Total reward: 36.0 Training loss: 2.0672 Explore P: 0.3576
Episode: 506 Total reward: 45.0 Training loss: 28.2182 Explore P: 0.3560
Episode: 507 Total reward: 49.0 Training loss: 13.6501 Explore P: 0.3543
Episode: 508 Total reward: 48.0 Training loss: 20.2166 Explore P: 0.3527
Episode: 509 Total reward: 30.0 Training loss: 57.7822 Explore P: 0.3517
Episode: 510 Total reward: 75.0 Training loss: 1.2942 Explore P: 0.3491
Episode: 511 Total reward: 89.0 Training loss: 19.8651 Explore P: 0.3461
Episode: 512 Total reward: 55.0 Training loss: 1.4324 Explore P: 0.3443
Episode: 513 Total reward: 59.0 Training loss: 16.5314 Explore P: 0.3423
Episode: 514 Total reward: 143.0 Training loss: 1.6154 Explore P: 0.3376
Episode: 515 Total reward: 61.0 Training loss: 72.3748 Explore P: 0.3356
Episode: 516 Total reward: 92.0 Training loss: 33.4711 Explore P: 0.3326
Episode: 517 Total reward: 74.0 Training loss: 48.6348 Explore P: 0.3302
Episode: 518 Total reward: 65.0 Training loss: 1.7813 Explore P: 0.3281
Episode: 519 Total reward: 81.0 Training loss: 13.7946 Explore P: 0.3256
Episode: 520 Total reward: 48.0 Training loss: 21.6257 Explore P: 0.3241
Episode: 521 Total reward: 89.0 Training loss: 2.1943 Explore P: 0.3213
Episode: 522 Total reward: 124.0 Training loss: 16.3781 Explore P: 0.3174
Episode: 523 Total reward: 38.0 Training loss: 41.1233 Explore P: 0.3163
Episode: 524 Total reward: 57.0 Training loss: 55.5117 Explore P: 0.3145
Episode: 525 Total reward: 96.0 Training loss: 40.4218 Explore P: 0.3116
Episode: 526 Total reward: 22.0 Training loss: 2.4061 Explore P: 0.3110
Episode: 527 Total reward: 111.0 Training loss: 9.7585 Explore P: 0.3076
Episode: 528 Total reward: 61.0 Training loss: 28.7139 Explore P: 0.3058
Episode: 529 Total reward: 69.0 Training loss: 1.0509 Explore P: 0.3038
Episode: 530 Total reward: 51.0 Training loss: 68.6747 Explore P: 0.3023
Episode: 531 Total reward: 46.0 Training loss: 1.7134 Explore P: 0.3010
Episode: 532 Total reward: 55.0 Training loss: 62.8443 Explore P: 0.2994
Episode: 533 Total reward: 91.0 Training loss: 70.2048 Explore P: 0.2967
Episode: 534 Total reward: 80.0 Training loss: 1.5755 Explore P: 0.2945
Episode: 535 Total reward: 62.0 Training loss: 23.1091 Explore P: 0.2927
Episode: 536 Total reward: 46.0 Training loss: 2.1805 Explore P: 0.2914
Episode: 537 Total reward: 87.0 Training loss: 20.7342 Explore P: 0.2890
Episode: 538 Total reward: 25.0 Training loss: 35.9775 Explore P: 0.2883
Episode: 539 Total reward: 33.0 Training loss: 1.6385 Explore P: 0.2874
Episode: 540 Total reward: 31.0 Training loss: 2.3007 Explore P: 0.2865
Episode: 541 Total reward: 122.0 Training loss: 15.0059 Explore P: 0.2831
Episode: 542 Total reward: 80.0 Training loss: 41.5785 Explore P: 0.2810
Episode: 543 Total reward: 92.0 Training loss: 59.3950 Explore P: 0.2785
Episode: 544 Total reward: 72.0 Training loss: 1.6160 Explore P: 0.2766
Episode: 545 Total reward: 70.0 Training loss: 1.7209 Explore P: 0.2747
Episode: 546 Total reward: 73.0 Training loss: 18.1547 Explore P: 0.2728
Episode: 547 Total reward: 53.0 Training loss: 17.9935 Explore P: 0.2714
Episode: 548 Total reward: 70.0 Training loss: 22.1907 Explore P: 0.2696
Episode: 549 Total reward: 66.0 Training loss: 54.7935 Explore P: 0.2679
Episode: 550 Total reward: 55.0 Training loss: 8.6427 Explore P: 0.2664
Episode: 551 Total reward: 56.0 Training loss: 1.4678 Explore P: 0.2650
Episode: 552 Total reward: 40.0 Training loss: 0.8467 Explore P: 0.2640
Episode: 553 Total reward: 43.0 Training loss: 2.8195 Explore P: 0.2629
Episode: 554 Total reward: 99.0 Training loss: 1.6599 Explore P: 0.2604
Episode: 555 Total reward: 56.0 Training loss: 1.4326 Explore P: 0.2590
Episode: 556 Total reward: 37.0 Training loss: 3.9783 Explore P: 0.2581
Episode: 557 Total reward: 62.0 Training loss: 2.7335 Explore P: 0.2566
Episode: 558 Total reward: 71.0 Training loss: 2.0142 Explore P: 0.2548
Episode: 559 Total reward: 46.0 Training loss: 12.6104 Explore P: 0.2537
Episode: 560 Total reward: 113.0 Training loss: 2.3870 Explore P: 0.2510
Episode: 561 Total reward: 45.0 Training loss: 66.3432 Explore P: 0.2499
Episode: 562 Total reward: 44.0 Training loss: 46.6465 Explore P: 0.2488
Episode: 563 Total reward: 49.0 Training loss: 2.1797 Explore P: 0.2477
Episode: 564 Total reward: 59.0 Training loss: 1.8394 Explore P: 0.2463
Episode: 565 Total reward: 67.0 Training loss: 3.9196 Explore P: 0.2447
Episode: 566 Total reward: 78.0 Training loss: 1.5977 Explore P: 0.2429
Episode: 567 Total reward: 75.0 Training loss: 1.5884 Explore P: 0.2411
Episode: 568 Total reward: 63.0 Training loss: 1.8183 Explore P: 0.2397
Episode: 569 Total reward: 76.0 Training loss: 48.2621 Explore P: 0.2379
Episode: 570 Total reward: 54.0 Training loss: 0.8010 Explore P: 0.2367
Episode: 571 Total reward: 132.0 Training loss: 1.9592 Explore P: 0.2337
Episode: 572 Total reward: 86.0 Training loss: 1.9739 Explore P: 0.2318
Episode: 573 Total reward: 50.0 Training loss: 88.6746 Explore P: 0.2307
Episode: 574 Total reward: 71.0 Training loss: 1.3215 Explore P: 0.2291
Episode: 575 Total reward: 69.0 Training loss: 65.1526 Explore P: 0.2276
Episode: 576 Total reward: 38.0 Training loss: 2.0658 Explore P: 0.2268
Episode: 577 Total reward: 51.0 Training loss: 3.6103 Explore P: 0.2257
Episode: 578 Total reward: 114.0 Training loss: 65.0078 Explore P: 0.2233
Episode: 579 Total reward: 56.0 Training loss: 2.3501 Explore P: 0.2221
Episode: 580 Total reward: 107.0 Training loss: 83.3094 Explore P: 0.2198
Episode: 581 Total reward: 70.0 Training loss: 2.1958 Explore P: 0.2183
Episode: 582 Total reward: 52.0 Training loss: 1.0308 Explore P: 0.2173
Episode: 583 Total reward: 50.0 Training loss: 17.6974 Explore P: 0.2162
Episode: 584 Total reward: 60.0 Training loss: 159.7466 Explore P: 0.2150
Episode: 585 Total reward: 72.0 Training loss: 1.4425 Explore P: 0.2135
Episode: 586 Total reward: 27.0 Training loss: 1.2617 Explore P: 0.2130
Episode: 587 Total reward: 114.0 Training loss: 81.6067 Explore P: 0.2107
Episode: 588 Total reward: 49.0 Training loss: 98.5122 Explore P: 0.2097
Episode: 589 Total reward: 54.0 Training loss: 1.7434 Explore P: 0.2086
Episode: 590 Total reward: 94.0 Training loss: 1.0956 Explore P: 0.2068
Episode: 591 Total reward: 60.0 Training loss: 83.4579 Explore P: 0.2056
Episode: 592 Total reward: 93.0 Training loss: 68.5305 Explore P: 0.2038
Episode: 593 Total reward: 96.0 Training loss: 1.7056 Explore P: 0.2019
Episode: 594 Total reward: 199.0 Training loss: 1.4905 Explore P: 0.1981
Episode: 595 Total reward: 56.0 Training loss: 78.7930 Explore P: 0.1971
Episode: 596 Total reward: 193.0 Training loss: 1.1943 Explore P: 0.1935
Episode: 597 Total reward: 93.0 Training loss: 1.8433 Explore P: 0.1918
Episode: 598 Total reward: 131.0 Training loss: 1.5038 Explore P: 0.1895
Episode: 599 Total reward: 82.0 Training loss: 1.1673 Explore P: 0.1880
Episode: 600 Total reward: 105.0 Training loss: 1.9089 Explore P: 0.1861
Episode: 601 Total reward: 151.0 Training loss: 87.5024 Explore P: 0.1835
Episode: 602 Total reward: 102.0 Training loss: 91.5227 Explore P: 0.1817
Episode: 603 Total reward: 199.0 Training loss: 1.4945 Explore P: 0.1783
Episode: 604 Total reward: 134.0 Training loss: 0.8114 Explore P: 0.1761
Episode: 605 Total reward: 199.0 Training loss: 1.5449 Explore P: 0.1728
Episode: 606 Total reward: 148.0 Training loss: 1.3450 Explore P: 0.1704
Episode: 607 Total reward: 137.0 Training loss: 74.4065 Explore P: 0.1683
Episode: 608 Total reward: 197.0 Training loss: 1.0414 Explore P: 0.1652
Episode: 609 Total reward: 58.0 Training loss: 1.3747 Explore P: 0.1643
Episode: 610 Total reward: 136.0 Training loss: 0.8170 Explore P: 0.1622
Episode: 611 Total reward: 72.0 Training loss: 103.4339 Explore P: 0.1611
Episode: 612 Total reward: 187.0 Training loss: 1.1692 Explore P: 0.1583
Episode: 613 Total reward: 90.0 Training loss: 1.3541 Explore P: 0.1570
Episode: 614 Total reward: 104.0 Training loss: 109.3193 Explore P: 0.1554
Episode: 615 Total reward: 77.0 Training loss: 1.1821 Explore P: 0.1543
Episode: 616 Total reward: 72.0 Training loss: 77.9574 Explore P: 0.1533
Episode: 617 Total reward: 141.0 Training loss: 1.1169 Explore P: 0.1513
Episode: 618 Total reward: 117.0 Training loss: 0.9701 Explore P: 0.1496
Episode: 619 Total reward: 98.0 Training loss: 1.2583 Explore P: 0.1483
Episode: 620 Total reward: 122.0 Training loss: 0.8507 Explore P: 0.1466
Episode: 621 Total reward: 168.0 Training loss: 0.7452 Explore P: 0.1443
Episode: 622 Total reward: 142.0 Training loss: 1.9991 Explore P: 0.1424
Episode: 623 Total reward: 109.0 Training loss: 0.6944 Explore P: 0.1410
Episode: 624 Total reward: 87.0 Training loss: 110.2878 Explore P: 0.1399
Episode: 625 Total reward: 199.0 Training loss: 0.5487 Explore P: 0.1373
Episode: 626 Total reward: 117.0 Training loss: 0.6342 Explore P: 0.1358
Episode: 627 Total reward: 92.0 Training loss: 223.5027 Explore P: 0.1347
Episode: 628 Total reward: 154.0 Training loss: 0.5577 Explore P: 0.1328
Episode: 629 Total reward: 198.0 Training loss: 71.1664 Explore P: 0.1304
Episode: 630 Total reward: 139.0 Training loss: 71.8784 Explore P: 0.1287
Episode: 631 Total reward: 199.0 Training loss: 0.8878 Explore P: 0.1264
Episode: 632 Total reward: 111.0 Training loss: 0.5766 Explore P: 0.1251
Episode: 633 Total reward: 173.0 Training loss: 99.4098 Explore P: 0.1231
Episode: 634 Total reward: 121.0 Training loss: 0.6531 Explore P: 0.1217
Episode: 635 Total reward: 199.0 Training loss: 0.6695 Explore P: 0.1195
Episode: 636 Total reward: 132.0 Training loss: 0.8015 Explore P: 0.1181
Episode: 637 Total reward: 192.0 Training loss: 0.6730 Explore P: 0.1160
Episode: 638 Total reward: 84.0 Training loss: 0.4630 Explore P: 0.1152
Episode: 639 Total reward: 90.0 Training loss: 1.1245 Explore P: 0.1142
Episode: 640 Total reward: 123.0 Training loss: 1.1733 Explore P: 0.1129
Episode: 641 Total reward: 181.0 Training loss: 0.8045 Explore P: 0.1111
Episode: 642 Total reward: 138.0 Training loss: 151.5556 Explore P: 0.1097
Episode: 643 Total reward: 117.0 Training loss: 0.6142 Explore P: 0.1086
Episode: 644 Total reward: 120.0 Training loss: 0.5982 Explore P: 0.1074
Episode: 645 Total reward: 199.0 Training loss: 0.2857 Explore P: 0.1055
Episode: 646 Total reward: 199.0 Training loss: 94.7000 Explore P: 0.1036
Episode: 647 Total reward: 152.0 Training loss: 0.4983 Explore P: 0.1022
Episode: 648 Total reward: 199.0 Training loss: 0.8298 Explore P: 0.1004
Episode: 649 Total reward: 173.0 Training loss: 0.5446 Explore P: 0.0988
Episode: 650 Total reward: 161.0 Training loss: 0.6761 Explore P: 0.0974
Episode: 651 Total reward: 190.0 Training loss: 0.5108 Explore P: 0.0957
Episode: 652 Total reward: 199.0 Training loss: 0.7969 Explore P: 0.0940
Episode: 653 Total reward: 199.0 Training loss: 1.2819 Explore P: 0.0924
Episode: 654 Total reward: 117.0 Training loss: 0.5539 Explore P: 0.0914
Episode: 655 Total reward: 199.0 Training loss: 1.1575 Explore P: 0.0898
Episode: 656 Total reward: 199.0 Training loss: 0.5711 Explore P: 0.0883
Episode: 657 Total reward: 179.0 Training loss: 0.5403 Explore P: 0.0869
Episode: 658 Total reward: 146.0 Training loss: 0.4424 Explore P: 0.0858
Episode: 659 Total reward: 125.0 Training loss: 22.2252 Explore P: 0.0848
Episode: 660 Total reward: 164.0 Training loss: 0.2150 Explore P: 0.0836
Episode: 661 Total reward: 105.0 Training loss: 0.3097 Explore P: 0.0828
Episode: 662 Total reward: 133.0 Training loss: 252.3188 Explore P: 0.0819
Episode: 663 Total reward: 199.0 Training loss: 223.8138 Explore P: 0.0805
Episode: 664 Total reward: 177.0 Training loss: 0.4464 Explore P: 0.0792
Episode: 665 Total reward: 186.0 Training loss: 0.3956 Explore P: 0.0779
Episode: 666 Total reward: 171.0 Training loss: 0.3527 Explore P: 0.0768
Episode: 667 Total reward: 199.0 Training loss: 0.2213 Explore P: 0.0755
Episode: 668 Total reward: 199.0 Training loss: 0.1904 Explore P: 0.0742
Episode: 669 Total reward: 176.0 Training loss: 0.2467 Explore P: 0.0731
Episode: 670 Total reward: 166.0 Training loss: 0.2222 Explore P: 0.0720
Episode: 671 Total reward: 198.0 Training loss: 0.2921 Explore P: 0.0708
Episode: 672 Total reward: 199.0 Training loss: 0.2876 Explore P: 0.0696
Episode: 673 Total reward: 199.0 Training loss: 10.7188 Explore P: 0.0684
Episode: 674 Total reward: 199.0 Training loss: 5.2830 Explore P: 0.0673
Episode: 675 Total reward: 199.0 Training loss: 0.1321 Explore P: 0.0662
Episode: 676 Total reward: 199.0 Training loss: 0.2447 Explore P: 0.0650
Episode: 677 Total reward: 176.0 Training loss: 0.0784 Explore P: 0.0641
Episode: 678 Total reward: 199.0 Training loss: 0.1597 Explore P: 0.0630
Episode: 679 Total reward: 199.0 Training loss: 4.9633 Explore P: 0.0620
Episode: 680 Total reward: 199.0 Training loss: 0.1505 Explore P: 0.0610
Episode: 681 Total reward: 199.0 Training loss: 0.1880 Explore P: 0.0599
Episode: 682 Total reward: 199.0 Training loss: 0.2402 Explore P: 0.0590
Episode: 683 Total reward: 199.0 Training loss: 0.3214 Explore P: 0.0580
Episode: 684 Total reward: 199.0 Training loss: 0.1304 Explore P: 0.0571
Episode: 685 Total reward: 199.0 Training loss: 0.4188 Explore P: 0.0561
Episode: 686 Total reward: 199.0 Training loss: 3.5000 Explore P: 0.0552
Episode: 687 Total reward: 199.0 Training loss: 205.5815 Explore P: 0.0543
Episode: 688 Total reward: 199.0 Training loss: 0.4595 Explore P: 0.0535
Episode: 689 Total reward: 199.0 Training loss: 0.1587 Explore P: 0.0526
Episode: 690 Total reward: 199.0 Training loss: 0.3111 Explore P: 0.0518
Episode: 691 Total reward: 199.0 Training loss: 0.7004 Explore P: 0.0509
Episode: 692 Total reward: 199.0 Training loss: 0.1819 Explore P: 0.0501
Episode: 693 Total reward: 199.0 Training loss: 82.2188 Explore P: 0.0493
Episode: 694 Total reward: 199.0 Training loss: 0.1228 Explore P: 0.0486
Episode: 695 Total reward: 199.0 Training loss: 0.0829 Explore P: 0.0478
Episode: 696 Total reward: 199.0 Training loss: 0.0730 Explore P: 0.0471
Episode: 697 Total reward: 199.0 Training loss: 0.1774 Explore P: 0.0463
Episode: 698 Total reward: 199.0 Training loss: 0.1337 Explore P: 0.0456
Episode: 699 Total reward: 199.0 Training loss: 321.8691 Explore P: 0.0449
Episode: 700 Total reward: 199.0 Training loss: 0.1123 Explore P: 0.0442
Episode: 701 Total reward: 199.0 Training loss: 0.1910 Explore P: 0.0435
Episode: 702 Total reward: 199.0 Training loss: 0.1695 Explore P: 0.0429
Episode: 703 Total reward: 199.0 Training loss: 0.2358 Explore P: 0.0422
Episode: 704 Total reward: 199.0 Training loss: 0.3269 Explore P: 0.0416
Episode: 705 Total reward: 199.0 Training loss: 0.1725 Explore P: 0.0410
Episode: 706 Total reward: 145.0 Training loss: 0.2647 Explore P: 0.0405
Episode: 707 Total reward: 199.0 Training loss: 0.2173 Explore P: 0.0399
Episode: 708 Total reward: 139.0 Training loss: 343.5912 Explore P: 0.0395
Episode: 709 Total reward: 199.0 Training loss: 0.0917 Explore P: 0.0389
Episode: 710 Total reward: 199.0 Training loss: 0.2531 Explore P: 0.0384
Episode: 711 Total reward: 199.0 Training loss: 0.1312 Explore P: 0.0378
Episode: 712 Total reward: 199.0 Training loss: 0.1930 Explore P: 0.0373
Episode: 713 Total reward: 177.0 Training loss: 0.2399 Explore P: 0.0368
Episode: 714 Total reward: 199.0 Training loss: 0.1260 Explore P: 0.0363
Episode: 715 Total reward: 199.0 Training loss: 0.1777 Explore P: 0.0357
Episode: 716 Total reward: 199.0 Training loss: 0.1515 Explore P: 0.0352
Episode: 717 Total reward: 199.0 Training loss: 227.4391 Explore P: 0.0347
Episode: 718 Total reward: 199.0 Training loss: 0.2168 Explore P: 0.0342
Episode: 719 Total reward: 199.0 Training loss: 0.3875 Explore P: 0.0338
Episode: 720 Total reward: 199.0 Training loss: 0.1713 Explore P: 0.0333
Episode: 721 Total reward: 199.0 Training loss: 0.2906 Explore P: 0.0328
Episode: 722 Total reward: 199.0 Training loss: 221.5430 Explore P: 0.0324
Episode: 723 Total reward: 199.0 Training loss: 0.1427 Explore P: 0.0320
Episode: 724 Total reward: 199.0 Training loss: 0.1329 Explore P: 0.0315
Episode: 725 Total reward: 199.0 Training loss: 0.4398 Explore P: 0.0311
Episode: 726 Total reward: 199.0 Training loss: 185.5946 Explore P: 0.0307
Episode: 727 Total reward: 199.0 Training loss: 0.2287 Explore P: 0.0303
Episode: 728 Total reward: 199.0 Training loss: 0.3917 Explore P: 0.0299
Episode: 729 Total reward: 199.0 Training loss: 0.2028 Explore P: 0.0295
Episode: 730 Total reward: 199.0 Training loss: 0.3202 Explore P: 0.0291
Episode: 731 Total reward: 199.0 Training loss: 201.3362 Explore P: 0.0287
Episode: 732 Total reward: 199.0 Training loss: 0.2367 Explore P: 0.0284
Episode: 733 Total reward: 199.0 Training loss: 172.4727 Explore P: 0.0280
Episode: 734 Total reward: 199.0 Training loss: 270.6387 Explore P: 0.0276
Episode: 735 Total reward: 199.0 Training loss: 0.1768 Explore P: 0.0273
Episode: 736 Total reward: 199.0 Training loss: 0.1204 Explore P: 0.0269
Episode: 737 Total reward: 199.0 Training loss: 0.1259 Explore P: 0.0266
Episode: 738 Total reward: 199.0 Training loss: 0.1543 Explore P: 0.0263
Episode: 739 Total reward: 199.0 Training loss: 202.3708 Explore P: 0.0260
Episode: 740 Total reward: 199.0 Training loss: 0.1214 Explore P: 0.0257
Episode: 741 Total reward: 199.0 Training loss: 0.2038 Explore P: 0.0253
Episode: 742 Total reward: 199.0 Training loss: 0.1938 Explore P: 0.0250
Episode: 743 Total reward: 199.0 Training loss: 0.2160 Explore P: 0.0247
Episode: 744 Total reward: 112.0 Training loss: 0.2567 Explore P: 0.0246
Episode: 745 Total reward: 199.0 Training loss: 144.7135 Explore P: 0.0243
Episode: 746 Total reward: 160.0 Training loss: 0.1747 Explore P: 0.0241
Episode: 747 Total reward: 160.0 Training loss: 0.1915 Explore P: 0.0238
Episode: 748 Total reward: 199.0 Training loss: 0.2502 Explore P: 0.0236
Episode: 749 Total reward: 199.0 Training loss: 0.2659 Explore P: 0.0233
Episode: 750 Total reward: 199.0 Training loss: 244.4802 Explore P: 0.0230
Episode: 751 Total reward: 199.0 Training loss: 0.1780 Explore P: 0.0228
Episode: 752 Total reward: 199.0 Training loss: 0.3397 Explore P: 0.0225
Episode: 753 Total reward: 199.0 Training loss: 0.4223 Explore P: 0.0223
Episode: 754 Total reward: 199.0 Training loss: 0.2228 Explore P: 0.0220
Episode: 755 Total reward: 199.0 Training loss: 0.1641 Explore P: 0.0218
Episode: 756 Total reward: 199.0 Training loss: 0.2947 Explore P: 0.0216
Episode: 757 Total reward: 199.0 Training loss: 0.2514 Explore P: 0.0213
Episode: 758 Total reward: 199.0 Training loss: 0.2116 Explore P: 0.0211
Episode: 759 Total reward: 199.0 Training loss: 0.4722 Explore P: 0.0209
Episode: 760 Total reward: 199.0 Training loss: 0.2409 Explore P: 0.0207
Episode: 761 Total reward: 199.0 Training loss: 0.2810 Explore P: 0.0205
Episode: 762 Total reward: 199.0 Training loss: 0.1862 Explore P: 0.0203
Episode: 763 Total reward: 199.0 Training loss: 0.1798 Explore P: 0.0201
Episode: 764 Total reward: 199.0 Training loss: 0.2500 Explore P: 0.0199
Episode: 765 Total reward: 199.0 Training loss: 0.3915 Explore P: 0.0197
Episode: 766 Total reward: 199.0 Training loss: 0.3298 Explore P: 0.0195
Episode: 767 Total reward: 199.0 Training loss: 0.1909 Explore P: 0.0193
Episode: 768 Total reward: 199.0 Training loss: 0.2712 Explore P: 0.0191
Episode: 769 Total reward: 199.0 Training loss: 0.1506 Explore P: 0.0189
Episode: 770 Total reward: 199.0 Training loss: 0.3304 Explore P: 0.0188
Episode: 771 Total reward: 199.0 Training loss: 0.1959 Explore P: 0.0186
Episode: 772 Total reward: 199.0 Training loss: 0.0718 Explore P: 0.0184
Episode: 773 Total reward: 199.0 Training loss: 0.2345 Explore P: 0.0183
Episode: 774 Total reward: 199.0 Training loss: 0.2049 Explore P: 0.0181
Episode: 775 Total reward: 199.0 Training loss: 0.2696 Explore P: 0.0179
Episode: 776 Total reward: 156.0 Training loss: 0.3306 Explore P: 0.0178
Episode: 777 Total reward: 174.0 Training loss: 254.6922 Explore P: 0.0177
Episode: 778 Total reward: 145.0 Training loss: 0.3408 Explore P: 0.0176
Episode: 779 Total reward: 130.0 Training loss: 279.2807 Explore P: 0.0175
Episode: 780 Total reward: 146.0 Training loss: 0.2511 Explore P: 0.0174
Episode: 781 Total reward: 125.0 Training loss: 0.3278 Explore P: 0.0173
Episode: 782 Total reward: 141.0 Training loss: 0.2462 Explore P: 0.0172
Episode: 783 Total reward: 120.0 Training loss: 0.2553 Explore P: 0.0171
Episode: 784 Total reward: 122.0 Training loss: 0.5149 Explore P: 0.0170
Episode: 785 Total reward: 117.0 Training loss: 0.2632 Explore P: 0.0169
Episode: 786 Total reward: 137.0 Training loss: 0.4271 Explore P: 0.0168
Episode: 787 Total reward: 151.0 Training loss: 0.2308 Explore P: 0.0167
Episode: 788 Total reward: 110.0 Training loss: 0.3124 Explore P: 0.0166
Episode: 789 Total reward: 151.0 Training loss: 0.3061 Explore P: 0.0165
Episode: 790 Total reward: 94.0 Training loss: 0.2910 Explore P: 0.0165
Episode: 791 Total reward: 117.0 Training loss: 0.3845 Explore P: 0.0164
Episode: 792 Total reward: 78.0 Training loss: 0.5412 Explore P: 0.0164
Episode: 793 Total reward: 57.0 Training loss: 0.6391 Explore P: 0.0163
Episode: 794 Total reward: 81.0 Training loss: 0.4264 Explore P: 0.0163
Episode: 795 Total reward: 61.0 Training loss: 0.3142 Explore P: 0.0162
Episode: 796 Total reward: 62.0 Training loss: 0.6075 Explore P: 0.0162
Episode: 797 Total reward: 52.0 Training loss: 0.2833 Explore P: 0.0162
Episode: 798 Total reward: 112.0 Training loss: 0.3159 Explore P: 0.0161
Episode: 799 Total reward: 114.0 Training loss: 0.3859 Explore P: 0.0160
Episode: 800 Total reward: 101.0 Training loss: 0.2874 Explore P: 0.0160
Episode: 801 Total reward: 93.0 Training loss: 0.3681 Explore P: 0.0159
Episode: 802 Total reward: 138.0 Training loss: 0.4613 Explore P: 0.0158
Episode: 803 Total reward: 76.0 Training loss: 0.4725 Explore P: 0.0158
Episode: 804 Total reward: 77.0 Training loss: 67.3140 Explore P: 0.0157
Episode: 805 Total reward: 60.0 Training loss: 0.6838 Explore P: 0.0157
Episode: 806 Total reward: 104.0 Training loss: 0.2026 Explore P: 0.0156
Episode: 807 Total reward: 91.0 Training loss: 0.3297 Explore P: 0.0156
Episode: 808 Total reward: 81.0 Training loss: 0.3580 Explore P: 0.0155
Episode: 809 Total reward: 78.0 Training loss: 62.2224 Explore P: 0.0155
Episode: 810 Total reward: 199.0 Training loss: 0.2939 Explore P: 0.0154
Episode: 811 Total reward: 63.0 Training loss: 145.8595 Explore P: 0.0154
Episode: 812 Total reward: 70.0 Training loss: 0.6058 Explore P: 0.0153
Episode: 813 Total reward: 99.0 Training loss: 0.3721 Explore P: 0.0153
Episode: 814 Total reward: 139.0 Training loss: 0.4767 Explore P: 0.0152
Episode: 815 Total reward: 66.0 Training loss: 0.4237 Explore P: 0.0152
Episode: 816 Total reward: 77.0 Training loss: 0.6665 Explore P: 0.0151
Episode: 817 Total reward: 53.0 Training loss: 0.3662 Explore P: 0.0151
Episode: 818 Total reward: 90.0 Training loss: 0.2934 Explore P: 0.0151
Episode: 819 Total reward: 71.0 Training loss: 0.3364 Explore P: 0.0150
Episode: 820 Total reward: 51.0 Training loss: 0.6478 Explore P: 0.0150
Episode: 821 Total reward: 97.0 Training loss: 0.2673 Explore P: 0.0149
Episode: 822 Total reward: 179.0 Training loss: 0.7233 Explore P: 0.0149
Episode: 823 Total reward: 109.0 Training loss: 0.2974 Explore P: 0.0148
Episode: 824 Total reward: 67.0 Training loss: 0.6851 Explore P: 0.0148
Episode: 825 Total reward: 166.0 Training loss: 0.2027 Explore P: 0.0147
Episode: 826 Total reward: 173.0 Training loss: 0.4353 Explore P: 0.0146
Episode: 827 Total reward: 199.0 Training loss: 0.4769 Explore P: 0.0145
Episode: 828 Total reward: 199.0 Training loss: 0.4480 Explore P: 0.0144
Episode: 829 Total reward: 149.0 Training loss: 0.3190 Explore P: 0.0144
Episode: 830 Total reward: 174.0 Training loss: 54.4535 Explore P: 0.0143
Episode: 831 Total reward: 147.0 Training loss: 0.4074 Explore P: 0.0142
Episode: 832 Total reward: 102.0 Training loss: 0.3419 Explore P: 0.0142
Episode: 833 Total reward: 104.0 Training loss: 34.3423 Explore P: 0.0141
Episode: 834 Total reward: 199.0 Training loss: 0.3029 Explore P: 0.0141
Episode: 835 Total reward: 127.0 Training loss: 0.2307 Explore P: 0.0140
Episode: 836 Total reward: 199.0 Training loss: 60.2864 Explore P: 0.0139
Episode: 837 Total reward: 199.0 Training loss: 0.3248 Explore P: 0.0139
Episode: 838 Total reward: 199.0 Training loss: 0.3688 Explore P: 0.0138
Episode: 839 Total reward: 199.0 Training loss: 0.3527 Explore P: 0.0137
Episode: 840 Total reward: 199.0 Training loss: 0.2080 Explore P: 0.0136
Episode: 841 Total reward: 199.0 Training loss: 0.1928 Explore P: 0.0136
Episode: 842 Total reward: 199.0 Training loss: 0.2431 Explore P: 0.0135
Episode: 843 Total reward: 199.0 Training loss: 0.4224 Explore P: 0.0134
Episode: 844 Total reward: 133.0 Training loss: 0.5674 Explore P: 0.0134
Episode: 845 Total reward: 199.0 Training loss: 0.2919 Explore P: 0.0133
Episode: 846 Total reward: 199.0 Training loss: 0.2397 Explore P: 0.0132
Episode: 847 Total reward: 128.0 Training loss: 0.2242 Explore P: 0.0132
Episode: 848 Total reward: 199.0 Training loss: 0.2090 Explore P: 0.0131
Episode: 849 Total reward: 199.0 Training loss: 0.2277 Explore P: 0.0131
Episode: 850 Total reward: 199.0 Training loss: 0.2512 Explore P: 0.0130
Episode: 851 Total reward: 199.0 Training loss: 0.0764 Explore P: 0.0130
Episode: 852 Total reward: 199.0 Training loss: 0.3395 Explore P: 0.0129
Episode: 853 Total reward: 155.0 Training loss: 0.4203 Explore P: 0.0129
Episode: 854 Total reward: 199.0 Training loss: 0.2341 Explore P: 0.0128
Episode: 855 Total reward: 199.0 Training loss: 0.1727 Explore P: 0.0127
Episode: 856 Total reward: 199.0 Training loss: 0.4718 Explore P: 0.0127
Episode: 857 Total reward: 199.0 Training loss: 0.3474 Explore P: 0.0126
Episode: 858 Total reward: 199.0 Training loss: 0.4063 Explore P: 0.0126
Episode: 859 Total reward: 199.0 Training loss: 0.1666 Explore P: 0.0125
Episode: 860 Total reward: 168.0 Training loss: 0.3023 Explore P: 0.0125
Episode: 861 Total reward: 199.0 Training loss: 0.2498 Explore P: 0.0124
Episode: 862 Total reward: 199.0 Training loss: 0.2326 Explore P: 0.0124
Episode: 863 Total reward: 199.0 Training loss: 0.1776 Explore P: 0.0123
Episode: 864 Total reward: 164.0 Training loss: 0.2993 Explore P: 0.0123
Episode: 865 Total reward: 96.0 Training loss: 0.4428 Explore P: 0.0123
Episode: 866 Total reward: 129.0 Training loss: 0.2313 Explore P: 0.0123
Episode: 867 Total reward: 177.0 Training loss: 0.2048 Explore P: 0.0122
Episode: 868 Total reward: 146.0 Training loss: 13.7905 Explore P: 0.0122
Episode: 869 Total reward: 37.0 Training loss: 279.8771 Explore P: 0.0122
Episode: 870 Total reward: 119.0 Training loss: 0.3275 Explore P: 0.0122
Episode: 871 Total reward: 199.0 Training loss: 0.2413 Explore P: 0.0121
Episode: 872 Total reward: 100.0 Training loss: 0.6281 Explore P: 0.0121
Episode: 873 Total reward: 102.0 Training loss: 0.6161 Explore P: 0.0121
Episode: 874 Total reward: 59.0 Training loss: 225.2600 Explore P: 0.0121
Episode: 875 Total reward: 74.0 Training loss: 0.1594 Explore P: 0.0120
Episode: 876 Total reward: 35.0 Training loss: 0.3518 Explore P: 0.0120
Episode: 877 Total reward: 28.0 Training loss: 0.2791 Explore P: 0.0120
Episode: 878 Total reward: 199.0 Training loss: 0.2409 Explore P: 0.0120
Episode: 879 Total reward: 48.0 Training loss: 0.3400 Explore P: 0.0120
Episode: 880 Total reward: 53.0 Training loss: 0.2953 Explore P: 0.0120
Episode: 881 Total reward: 199.0 Training loss: 355.9628 Explore P: 0.0119
Episode: 882 Total reward: 199.0 Training loss: 0.1996 Explore P: 0.0119
Episode: 883 Total reward: 140.0 Training loss: 0.3318 Explore P: 0.0119
Episode: 884 Total reward: 48.0 Training loss: 0.3780 Explore P: 0.0119
Episode: 885 Total reward: 113.0 Training loss: 0.3141 Explore P: 0.0118
Episode: 886 Total reward: 80.0 Training loss: 0.6175 Explore P: 0.0118
Episode: 887 Total reward: 38.0 Training loss: 0.5264 Explore P: 0.0118
Episode: 888 Total reward: 199.0 Training loss: 0.4841 Explore P: 0.0118
Episode: 889 Total reward: 66.0 Training loss: 274.5448 Explore P: 0.0118
Episode: 890 Total reward: 36.0 Training loss: 0.3717 Explore P: 0.0118
Episode: 891 Total reward: 44.0 Training loss: 0.4638 Explore P: 0.0118
Episode: 892 Total reward: 67.0 Training loss: 261.9496 Explore P: 0.0117
Episode: 893 Total reward: 38.0 Training loss: 0.2213 Explore P: 0.0117
Episode: 894 Total reward: 194.0 Training loss: 0.2663 Explore P: 0.0117
Episode: 895 Total reward: 41.0 Training loss: 0.3816 Explore P: 0.0117
Episode: 896 Total reward: 74.0 Training loss: 0.5183 Explore P: 0.0117
Episode: 897 Total reward: 199.0 Training loss: 0.3639 Explore P: 0.0116
Episode: 898 Total reward: 191.0 Training loss: 0.3027 Explore P: 0.0116
Episode: 899 Total reward: 199.0 Training loss: 0.5515 Explore P: 0.0116
Episode: 900 Total reward: 117.0 Training loss: 0.7317 Explore P: 0.0116
Episode: 901 Total reward: 119.0 Training loss: 0.2371 Explore P: 0.0115
Episode: 902 Total reward: 147.0 Training loss: 0.2668 Explore P: 0.0115
Episode: 903 Total reward: 186.0 Training loss: 0.3363 Explore P: 0.0115
Episode: 904 Total reward: 199.0 Training loss: 0.3985 Explore P: 0.0115
Episode: 905 Total reward: 199.0 Training loss: 0.3862 Explore P: 0.0114
Episode: 906 Total reward: 199.0 Training loss: 0.3343 Explore P: 0.0114
Episode: 907 Total reward: 199.0 Training loss: 0.3174 Explore P: 0.0114
Episode: 908 Total reward: 199.0 Training loss: 0.3662 Explore P: 0.0114
Episode: 909 Total reward: 183.0 Training loss: 58.8152 Explore P: 0.0113
Episode: 910 Total reward: 195.0 Training loss: 54.0388 Explore P: 0.0113
Episode: 911 Total reward: 199.0 Training loss: 110.9607 Explore P: 0.0113
Episode: 912 Total reward: 199.0 Training loss: 106.3211 Explore P: 0.0113
Episode: 913 Total reward: 199.0 Training loss: 0.5284 Explore P: 0.0112
Episode: 914 Total reward: 199.0 Training loss: 0.2013 Explore P: 0.0112
Episode: 915 Total reward: 199.0 Training loss: 0.3361 Explore P: 0.0112
Episode: 916 Total reward: 199.0 Training loss: 121.3566 Explore P: 0.0112
Episode: 917 Total reward: 199.0 Training loss: 0.5538 Explore P: 0.0111
Episode: 918 Total reward: 175.0 Training loss: 0.4782 Explore P: 0.0111
Episode: 919 Total reward: 171.0 Training loss: 0.3421 Explore P: 0.0111
Episode: 920 Total reward: 199.0 Training loss: 0.6881 Explore P: 0.0111
Episode: 921 Total reward: 163.0 Training loss: 0.5084 Explore P: 0.0111
Episode: 922 Total reward: 122.0 Training loss: 0.6592 Explore P: 0.0110
Episode: 923 Total reward: 107.0 Training loss: 1.4845 Explore P: 0.0110
Episode: 924 Total reward: 109.0 Training loss: 0.5529 Explore P: 0.0110
Episode: 925 Total reward: 33.0 Training loss: 1.3130 Explore P: 0.0110
Episode: 926 Total reward: 23.0 Training loss: 1.5874 Explore P: 0.0110
Episode: 927 Total reward: 27.0 Training loss: 1.0277 Explore P: 0.0110
Episode: 928 Total reward: 27.0 Training loss: 0.6526 Explore P: 0.0110
Episode: 929 Total reward: 32.0 Training loss: 1.1745 Explore P: 0.0110
Episode: 930 Total reward: 22.0 Training loss: 384.5526 Explore P: 0.0110
Episode: 931 Total reward: 25.0 Training loss: 0.9051 Explore P: 0.0110
Episode: 932 Total reward: 32.0 Training loss: 0.5896 Explore P: 0.0110
Episode: 933 Total reward: 100.0 Training loss: 1.2416 Explore P: 0.0110
Episode: 934 Total reward: 30.0 Training loss: 0.7511 Explore P: 0.0110
Episode: 935 Total reward: 34.0 Training loss: 0.6497 Explore P: 0.0110
Episode: 936 Total reward: 25.0 Training loss: 0.7747 Explore P: 0.0110
Episode: 937 Total reward: 23.0 Training loss: 1.3833 Explore P: 0.0110
Episode: 938 Total reward: 14.0 Training loss: 1.3215 Explore P: 0.0110
Episode: 939 Total reward: 21.0 Training loss: 1.2767 Explore P: 0.0110
Episode: 940 Total reward: 16.0 Training loss: 213.9854 Explore P: 0.0110
Episode: 941 Total reward: 28.0 Training loss: 1.3207 Explore P: 0.0110
Episode: 942 Total reward: 18.0 Training loss: 1.5677 Explore P: 0.0110
Episode: 943 Total reward: 26.0 Training loss: 1.1375 Explore P: 0.0110
Episode: 944 Total reward: 30.0 Training loss: 0.9486 Explore P: 0.0110
Episode: 945 Total reward: 23.0 Training loss: 0.5095 Explore P: 0.0110
Episode: 946 Total reward: 33.0 Training loss: 323.8259 Explore P: 0.0110
Episode: 947 Total reward: 42.0 Training loss: 0.8213 Explore P: 0.0110
Episode: 948 Total reward: 25.0 Training loss: 269.0544 Explore P: 0.0110
Episode: 949 Total reward: 24.0 Training loss: 0.6782 Explore P: 0.0109
Episode: 950 Total reward: 28.0 Training loss: 0.9634 Explore P: 0.0109
Episode: 951 Total reward: 28.0 Training loss: 1.1198 Explore P: 0.0109
Episode: 952 Total reward: 31.0 Training loss: 1.4523 Explore P: 0.0109
Episode: 953 Total reward: 98.0 Training loss: 0.7859 Explore P: 0.0109
Episode: 954 Total reward: 110.0 Training loss: 1.0333 Explore P: 0.0109
Episode: 955 Total reward: 44.0 Training loss: 0.8561 Explore P: 0.0109
Episode: 956 Total reward: 129.0 Training loss: 220.1947 Explore P: 0.0109
Episode: 957 Total reward: 175.0 Training loss: 0.5687 Explore P: 0.0109
Episode: 958 Total reward: 199.0 Training loss: 0.3421 Explore P: 0.0109
Episode: 959 Total reward: 199.0 Training loss: 0.5674 Explore P: 0.0109
Episode: 960 Total reward: 199.0 Training loss: 0.6841 Explore P: 0.0108
Episode: 961 Total reward: 199.0 Training loss: 0.6956 Explore P: 0.0108
Episode: 962 Total reward: 199.0 Training loss: 0.3168 Explore P: 0.0108
Episode: 963 Total reward: 199.0 Training loss: 0.3495 Explore P: 0.0108
Episode: 964 Total reward: 199.0 Training loss: 0.4725 Explore P: 0.0108
Episode: 965 Total reward: 199.0 Training loss: 0.5653 Explore P: 0.0108
Episode: 966 Total reward: 199.0 Training loss: 0.5692 Explore P: 0.0107
Episode: 967 Total reward: 199.0 Training loss: 0.4429 Explore P: 0.0107
Episode: 968 Total reward: 199.0 Training loss: 1.0512 Explore P: 0.0107
Episode: 969 Total reward: 199.0 Training loss: 0.6886 Explore P: 0.0107
Episode: 970 Total reward: 199.0 Training loss: 0.7216 Explore P: 0.0107
Episode: 971 Total reward: 199.0 Training loss: 0.5152 Explore P: 0.0107
Episode: 972 Total reward: 199.0 Training loss: 0.5312 Explore P: 0.0107
Episode: 973 Total reward: 199.0 Training loss: 156.7826 Explore P: 0.0106
Episode: 974 Total reward: 199.0 Training loss: 1.1234 Explore P: 0.0106
Episode: 975 Total reward: 199.0 Training loss: 1.2598 Explore P: 0.0106
Episode: 976 Total reward: 199.0 Training loss: 0.9674 Explore P: 0.0106
Episode: 977 Total reward: 37.0 Training loss: 1.1054 Explore P: 0.0106
Episode: 978 Total reward: 38.0 Training loss: 1.2896 Explore P: 0.0106
Episode: 979 Total reward: 20.0 Training loss: 1.5647 Explore P: 0.0106
Episode: 980 Total reward: 22.0 Training loss: 1.8234 Explore P: 0.0106
Episode: 981 Total reward: 16.0 Training loss: 1.2526 Explore P: 0.0106
Episode: 982 Total reward: 27.0 Training loss: 1.5715 Explore P: 0.0106
Episode: 983 Total reward: 28.0 Training loss: 1.1074 Explore P: 0.0106
Episode: 984 Total reward: 33.0 Training loss: 1.0480 Explore P: 0.0106
Episode: 985 Total reward: 28.0 Training loss: 1.8179 Explore P: 0.0106
Episode: 986 Total reward: 27.0 Training loss: 0.8924 Explore P: 0.0106
Episode: 987 Total reward: 21.0 Training loss: 1.4659 Explore P: 0.0106
Episode: 988 Total reward: 22.0 Training loss: 2.3099 Explore P: 0.0106
Episode: 989 Total reward: 16.0 Training loss: 2.1198 Explore P: 0.0106
Episode: 990 Total reward: 18.0 Training loss: 1.8322 Explore P: 0.0106
Episode: 991 Total reward: 15.0 Training loss: 216.7510 Explore P: 0.0106
Episode: 992 Total reward: 13.0 Training loss: 153.9884 Explore P: 0.0106
Episode: 993 Total reward: 18.0 Training loss: 1.6350 Explore P: 0.0106
Episode: 994 Total reward: 23.0 Training loss: 1.9123 Explore P: 0.0106
Episode: 995 Total reward: 199.0 Training loss: 1.9583 Explore P: 0.0106
Episode: 996 Total reward: 21.0 Training loss: 247.6024 Explore P: 0.0106
Episode: 997 Total reward: 25.0 Training loss: 2.5814 Explore P: 0.0106
Episode: 998 Total reward: 28.0 Training loss: 1.2624 Explore P: 0.0106
Episode: 999 Total reward: 29.0 Training loss: 426.7271 Explore P: 0.0106

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [12]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [13]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[13]:
<matplotlib.text.Text at 0x21b87530f60>

Testing

Let's checkout how our trained agent plays the game.


In [ ]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [15]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.


In [ ]: