Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-16 12:02:14,222] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [8]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [13]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [12]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [14]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [15]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [16]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [17]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [18]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [19]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 28.0 Training loss: 1.1128 Explore P: 0.9972
Episode: 2 Total reward: 13.0 Training loss: 1.0582 Explore P: 0.9959
Episode: 3 Total reward: 12.0 Training loss: 1.0461 Explore P: 0.9948
Episode: 4 Total reward: 31.0 Training loss: 1.0763 Explore P: 0.9917
Episode: 5 Total reward: 14.0 Training loss: 1.0295 Explore P: 0.9903
Episode: 6 Total reward: 20.0 Training loss: 1.0620 Explore P: 0.9884
Episode: 7 Total reward: 12.0 Training loss: 0.9539 Explore P: 0.9872
Episode: 8 Total reward: 50.0 Training loss: 1.2083 Explore P: 0.9823
Episode: 9 Total reward: 14.0 Training loss: 1.0840 Explore P: 0.9810
Episode: 10 Total reward: 8.0 Training loss: 1.1570 Explore P: 0.9802
Episode: 11 Total reward: 16.0 Training loss: 1.2886 Explore P: 0.9787
Episode: 12 Total reward: 12.0 Training loss: 1.1632 Explore P: 0.9775
Episode: 13 Total reward: 54.0 Training loss: 1.1754 Explore P: 0.9723
Episode: 14 Total reward: 27.0 Training loss: 1.3688 Explore P: 0.9697
Episode: 15 Total reward: 15.0 Training loss: 1.2364 Explore P: 0.9682
Episode: 16 Total reward: 10.0 Training loss: 1.4808 Explore P: 0.9673
Episode: 17 Total reward: 12.0 Training loss: 2.3937 Explore P: 0.9661
Episode: 18 Total reward: 18.0 Training loss: 1.4531 Explore P: 0.9644
Episode: 19 Total reward: 13.0 Training loss: 1.7278 Explore P: 0.9632
Episode: 20 Total reward: 25.0 Training loss: 1.9328 Explore P: 0.9608
Episode: 21 Total reward: 11.0 Training loss: 1.9117 Explore P: 0.9598
Episode: 22 Total reward: 20.0 Training loss: 1.9544 Explore P: 0.9579
Episode: 23 Total reward: 18.0 Training loss: 1.8065 Explore P: 0.9562
Episode: 24 Total reward: 38.0 Training loss: 2.1687 Explore P: 0.9526
Episode: 25 Total reward: 21.0 Training loss: 2.5309 Explore P: 0.9506
Episode: 26 Total reward: 35.0 Training loss: 3.0786 Explore P: 0.9473
Episode: 27 Total reward: 17.0 Training loss: 2.7802 Explore P: 0.9457
Episode: 28 Total reward: 39.0 Training loss: 6.5382 Explore P: 0.9421
Episode: 29 Total reward: 20.0 Training loss: 2.1318 Explore P: 0.9402
Episode: 30 Total reward: 27.0 Training loss: 4.5146 Explore P: 0.9377
Episode: 31 Total reward: 15.0 Training loss: 3.0608 Explore P: 0.9363
Episode: 32 Total reward: 12.0 Training loss: 4.3354 Explore P: 0.9352
Episode: 33 Total reward: 12.0 Training loss: 10.6429 Explore P: 0.9341
Episode: 34 Total reward: 21.0 Training loss: 4.2618 Explore P: 0.9321
Episode: 35 Total reward: 14.0 Training loss: 4.2755 Explore P: 0.9309
Episode: 36 Total reward: 24.0 Training loss: 6.0969 Explore P: 0.9286
Episode: 37 Total reward: 13.0 Training loss: 7.1275 Explore P: 0.9275
Episode: 38 Total reward: 12.0 Training loss: 7.5950 Explore P: 0.9264
Episode: 39 Total reward: 23.0 Training loss: 18.3120 Explore P: 0.9243
Episode: 40 Total reward: 22.0 Training loss: 10.0711 Explore P: 0.9222
Episode: 41 Total reward: 18.0 Training loss: 9.7195 Explore P: 0.9206
Episode: 42 Total reward: 21.0 Training loss: 5.1165 Explore P: 0.9187
Episode: 43 Total reward: 29.0 Training loss: 5.7434 Explore P: 0.9161
Episode: 44 Total reward: 19.0 Training loss: 9.6390 Explore P: 0.9143
Episode: 45 Total reward: 21.0 Training loss: 22.7470 Explore P: 0.9124
Episode: 46 Total reward: 12.0 Training loss: 6.0279 Explore P: 0.9114
Episode: 47 Total reward: 31.0 Training loss: 7.2099 Explore P: 0.9086
Episode: 48 Total reward: 10.0 Training loss: 8.4758 Explore P: 0.9077
Episode: 49 Total reward: 22.0 Training loss: 39.3540 Explore P: 0.9057
Episode: 50 Total reward: 37.0 Training loss: 67.2643 Explore P: 0.9024
Episode: 51 Total reward: 16.0 Training loss: 25.8195 Explore P: 0.9010
Episode: 52 Total reward: 13.0 Training loss: 6.2283 Explore P: 0.8998
Episode: 53 Total reward: 14.0 Training loss: 22.4878 Explore P: 0.8986
Episode: 54 Total reward: 22.0 Training loss: 18.6191 Explore P: 0.8966
Episode: 55 Total reward: 33.0 Training loss: 20.9156 Explore P: 0.8937
Episode: 56 Total reward: 16.0 Training loss: 64.4260 Explore P: 0.8923
Episode: 57 Total reward: 12.0 Training loss: 26.9148 Explore P: 0.8912
Episode: 58 Total reward: 8.0 Training loss: 10.2481 Explore P: 0.8905
Episode: 59 Total reward: 28.0 Training loss: 49.1300 Explore P: 0.8881
Episode: 60 Total reward: 14.0 Training loss: 41.1967 Explore P: 0.8868
Episode: 61 Total reward: 15.0 Training loss: 11.2506 Explore P: 0.8855
Episode: 62 Total reward: 15.0 Training loss: 44.1689 Explore P: 0.8842
Episode: 63 Total reward: 16.0 Training loss: 14.6805 Explore P: 0.8828
Episode: 64 Total reward: 15.0 Training loss: 50.8586 Explore P: 0.8815
Episode: 65 Total reward: 15.0 Training loss: 106.2600 Explore P: 0.8802
Episode: 66 Total reward: 20.0 Training loss: 31.1651 Explore P: 0.8784
Episode: 67 Total reward: 10.0 Training loss: 11.4001 Explore P: 0.8776
Episode: 68 Total reward: 16.0 Training loss: 10.9336 Explore P: 0.8762
Episode: 69 Total reward: 11.0 Training loss: 18.5061 Explore P: 0.8752
Episode: 70 Total reward: 13.0 Training loss: 27.4199 Explore P: 0.8741
Episode: 71 Total reward: 10.0 Training loss: 14.2745 Explore P: 0.8733
Episode: 72 Total reward: 12.0 Training loss: 11.1916 Explore P: 0.8722
Episode: 73 Total reward: 15.0 Training loss: 14.1849 Explore P: 0.8709
Episode: 74 Total reward: 14.0 Training loss: 216.5710 Explore P: 0.8697
Episode: 75 Total reward: 12.0 Training loss: 57.3340 Explore P: 0.8687
Episode: 76 Total reward: 11.0 Training loss: 25.1038 Explore P: 0.8677
Episode: 77 Total reward: 32.0 Training loss: 16.1531 Explore P: 0.8650
Episode: 78 Total reward: 16.0 Training loss: 15.7502 Explore P: 0.8636
Episode: 79 Total reward: 14.0 Training loss: 23.7233 Explore P: 0.8624
Episode: 80 Total reward: 13.0 Training loss: 22.3965 Explore P: 0.8613
Episode: 81 Total reward: 12.0 Training loss: 289.5641 Explore P: 0.8603
Episode: 82 Total reward: 13.0 Training loss: 51.0562 Explore P: 0.8592
Episode: 83 Total reward: 17.0 Training loss: 26.9322 Explore P: 0.8578
Episode: 84 Total reward: 16.0 Training loss: 33.2146 Explore P: 0.8564
Episode: 85 Total reward: 9.0 Training loss: 25.2714 Explore P: 0.8556
Episode: 86 Total reward: 11.0 Training loss: 38.2944 Explore P: 0.8547
Episode: 87 Total reward: 17.0 Training loss: 67.3021 Explore P: 0.8533
Episode: 88 Total reward: 26.0 Training loss: 276.9796 Explore P: 0.8511
Episode: 89 Total reward: 36.0 Training loss: 476.6077 Explore P: 0.8481
Episode: 90 Total reward: 19.0 Training loss: 474.6125 Explore P: 0.8465
Episode: 91 Total reward: 12.0 Training loss: 249.3454 Explore P: 0.8455
Episode: 92 Total reward: 19.0 Training loss: 23.0799 Explore P: 0.8439
Episode: 93 Total reward: 8.0 Training loss: 405.2763 Explore P: 0.8432
Episode: 94 Total reward: 29.0 Training loss: 42.1307 Explore P: 0.8408
Episode: 95 Total reward: 16.0 Training loss: 33.4296 Explore P: 0.8395
Episode: 96 Total reward: 15.0 Training loss: 229.8984 Explore P: 0.8382
Episode: 97 Total reward: 34.0 Training loss: 67.3444 Explore P: 0.8354
Episode: 98 Total reward: 14.0 Training loss: 31.8489 Explore P: 0.8343
Episode: 99 Total reward: 37.0 Training loss: 17.0930 Explore P: 0.8312
Episode: 100 Total reward: 16.0 Training loss: 252.8731 Explore P: 0.8299
Episode: 101 Total reward: 28.0 Training loss: 29.2210 Explore P: 0.8276
Episode: 102 Total reward: 14.0 Training loss: 82.8122 Explore P: 0.8265
Episode: 103 Total reward: 30.0 Training loss: 32.1660 Explore P: 0.8240
Episode: 104 Total reward: 13.0 Training loss: 382.0585 Explore P: 0.8230
Episode: 105 Total reward: 15.0 Training loss: 196.2258 Explore P: 0.8218
Episode: 106 Total reward: 13.0 Training loss: 31.7011 Explore P: 0.8207
Episode: 107 Total reward: 26.0 Training loss: 276.4936 Explore P: 0.8186
Episode: 108 Total reward: 13.0 Training loss: 44.2406 Explore P: 0.8175
Episode: 109 Total reward: 16.0 Training loss: 229.7859 Explore P: 0.8163
Episode: 110 Total reward: 37.0 Training loss: 260.5188 Explore P: 0.8133
Episode: 111 Total reward: 17.0 Training loss: 29.8027 Explore P: 0.8119
Episode: 112 Total reward: 9.0 Training loss: 28.0819 Explore P: 0.8112
Episode: 113 Total reward: 27.0 Training loss: 32.1152 Explore P: 0.8090
Episode: 114 Total reward: 16.0 Training loss: 35.4067 Explore P: 0.8078
Episode: 115 Total reward: 14.0 Training loss: 260.8397 Explore P: 0.8066
Episode: 116 Total reward: 12.0 Training loss: 259.1508 Explore P: 0.8057
Episode: 117 Total reward: 12.0 Training loss: 44.7807 Explore P: 0.8047
Episode: 118 Total reward: 17.0 Training loss: 105.8721 Explore P: 0.8034
Episode: 119 Total reward: 11.0 Training loss: 27.2619 Explore P: 0.8025
Episode: 120 Total reward: 15.0 Training loss: 633.1356 Explore P: 0.8013
Episode: 121 Total reward: 12.0 Training loss: 33.7090 Explore P: 0.8004
Episode: 122 Total reward: 20.0 Training loss: 242.3188 Explore P: 0.7988
Episode: 123 Total reward: 12.0 Training loss: 250.3530 Explore P: 0.7978
Episode: 124 Total reward: 18.0 Training loss: 32.1946 Explore P: 0.7964
Episode: 125 Total reward: 12.0 Training loss: 356.6527 Explore P: 0.7955
Episode: 126 Total reward: 36.0 Training loss: 116.7384 Explore P: 0.7927
Episode: 127 Total reward: 35.0 Training loss: 156.5303 Explore P: 0.7899
Episode: 128 Total reward: 12.0 Training loss: 129.2966 Explore P: 0.7890
Episode: 129 Total reward: 8.0 Training loss: 38.8228 Explore P: 0.7884
Episode: 130 Total reward: 32.0 Training loss: 648.6605 Explore P: 0.7859
Episode: 131 Total reward: 24.0 Training loss: 314.4232 Explore P: 0.7840
Episode: 132 Total reward: 12.0 Training loss: 22.7135 Explore P: 0.7831
Episode: 133 Total reward: 23.0 Training loss: 360.7065 Explore P: 0.7813
Episode: 134 Total reward: 17.0 Training loss: 306.4225 Explore P: 0.7800
Episode: 135 Total reward: 18.0 Training loss: 40.7809 Explore P: 0.7786
Episode: 136 Total reward: 17.0 Training loss: 35.6588 Explore P: 0.7773
Episode: 137 Total reward: 11.0 Training loss: 29.4453 Explore P: 0.7765
Episode: 138 Total reward: 25.0 Training loss: 28.1937 Explore P: 0.7746
Episode: 139 Total reward: 9.0 Training loss: 549.6845 Explore P: 0.7739
Episode: 140 Total reward: 9.0 Training loss: 26.7974 Explore P: 0.7732
Episode: 141 Total reward: 17.0 Training loss: 139.9556 Explore P: 0.7719
Episode: 142 Total reward: 11.0 Training loss: 505.1475 Explore P: 0.7711
Episode: 143 Total reward: 48.0 Training loss: 239.8251 Explore P: 0.7674
Episode: 144 Total reward: 16.0 Training loss: 125.1043 Explore P: 0.7662
Episode: 145 Total reward: 21.0 Training loss: 315.1842 Explore P: 0.7646
Episode: 146 Total reward: 24.0 Training loss: 16.2679 Explore P: 0.7628
Episode: 147 Total reward: 14.0 Training loss: 22.3072 Explore P: 0.7618
Episode: 148 Total reward: 25.0 Training loss: 136.3477 Explore P: 0.7599
Episode: 149 Total reward: 17.0 Training loss: 653.1281 Explore P: 0.7586
Episode: 150 Total reward: 18.0 Training loss: 21.2228 Explore P: 0.7573
Episode: 151 Total reward: 11.0 Training loss: 488.9940 Explore P: 0.7564
Episode: 152 Total reward: 40.0 Training loss: 388.4138 Explore P: 0.7535
Episode: 153 Total reward: 11.0 Training loss: 134.5174 Explore P: 0.7526
Episode: 154 Total reward: 12.0 Training loss: 273.2594 Explore P: 0.7517
Episode: 155 Total reward: 12.0 Training loss: 24.8822 Explore P: 0.7509
Episode: 156 Total reward: 11.0 Training loss: 166.8619 Explore P: 0.7500
Episode: 157 Total reward: 13.0 Training loss: 370.2747 Explore P: 0.7491
Episode: 158 Total reward: 10.0 Training loss: 234.8537 Explore P: 0.7483
Episode: 159 Total reward: 30.0 Training loss: 15.9280 Explore P: 0.7461
Episode: 160 Total reward: 12.0 Training loss: 17.8128 Explore P: 0.7452
Episode: 161 Total reward: 14.0 Training loss: 16.2393 Explore P: 0.7442
Episode: 162 Total reward: 10.0 Training loss: 325.8303 Explore P: 0.7435
Episode: 163 Total reward: 19.0 Training loss: 14.9757 Explore P: 0.7421
Episode: 164 Total reward: 25.0 Training loss: 179.5653 Explore P: 0.7403
Episode: 165 Total reward: 41.0 Training loss: 221.3130 Explore P: 0.7373
Episode: 166 Total reward: 11.0 Training loss: 12.2919 Explore P: 0.7365
Episode: 167 Total reward: 37.0 Training loss: 438.5371 Explore P: 0.7338
Episode: 168 Total reward: 9.0 Training loss: 413.1923 Explore P: 0.7331
Episode: 169 Total reward: 11.0 Training loss: 430.1524 Explore P: 0.7323
Episode: 170 Total reward: 18.0 Training loss: 370.0018 Explore P: 0.7310
Episode: 171 Total reward: 11.0 Training loss: 176.7794 Explore P: 0.7303
Episode: 172 Total reward: 20.0 Training loss: 163.4337 Explore P: 0.7288
Episode: 173 Total reward: 19.0 Training loss: 174.3003 Explore P: 0.7275
Episode: 174 Total reward: 12.0 Training loss: 262.5769 Explore P: 0.7266
Episode: 175 Total reward: 20.0 Training loss: 150.5704 Explore P: 0.7252
Episode: 176 Total reward: 20.0 Training loss: 147.5637 Explore P: 0.7237
Episode: 177 Total reward: 15.0 Training loss: 187.4657 Explore P: 0.7227
Episode: 178 Total reward: 12.0 Training loss: 402.5753 Explore P: 0.7218
Episode: 179 Total reward: 27.0 Training loss: 5.7557 Explore P: 0.7199
Episode: 180 Total reward: 11.0 Training loss: 344.9215 Explore P: 0.7191
Episode: 181 Total reward: 22.0 Training loss: 176.2465 Explore P: 0.7175
Episode: 182 Total reward: 9.0 Training loss: 214.0036 Explore P: 0.7169
Episode: 183 Total reward: 16.0 Training loss: 137.4577 Explore P: 0.7158
Episode: 184 Total reward: 16.0 Training loss: 364.3958 Explore P: 0.7147
Episode: 185 Total reward: 10.0 Training loss: 131.5004 Explore P: 0.7139
Episode: 186 Total reward: 10.0 Training loss: 3.0146 Explore P: 0.7132
Episode: 187 Total reward: 29.0 Training loss: 5.6728 Explore P: 0.7112
Episode: 188 Total reward: 20.0 Training loss: 403.7636 Explore P: 0.7098
Episode: 189 Total reward: 13.0 Training loss: 144.5053 Explore P: 0.7089
Episode: 190 Total reward: 25.0 Training loss: 165.3593 Explore P: 0.7072
Episode: 191 Total reward: 12.0 Training loss: 4.1933 Explore P: 0.7063
Episode: 192 Total reward: 13.0 Training loss: 232.0637 Explore P: 0.7054
Episode: 193 Total reward: 11.0 Training loss: 5.2360 Explore P: 0.7046
Episode: 194 Total reward: 19.0 Training loss: 137.7719 Explore P: 0.7033
Episode: 195 Total reward: 11.0 Training loss: 121.4025 Explore P: 0.7026
Episode: 196 Total reward: 12.0 Training loss: 3.0423 Explore P: 0.7017
Episode: 197 Total reward: 18.0 Training loss: 138.7101 Explore P: 0.7005
Episode: 198 Total reward: 9.0 Training loss: 358.5811 Explore P: 0.6999
Episode: 199 Total reward: 16.0 Training loss: 2.1677 Explore P: 0.6988
Episode: 200 Total reward: 14.0 Training loss: 224.3177 Explore P: 0.6978
Episode: 201 Total reward: 12.0 Training loss: 247.8046 Explore P: 0.6970
Episode: 202 Total reward: 12.0 Training loss: 316.0663 Explore P: 0.6962
Episode: 203 Total reward: 10.0 Training loss: 176.4582 Explore P: 0.6955
Episode: 204 Total reward: 12.0 Training loss: 142.9169 Explore P: 0.6946
Episode: 205 Total reward: 15.0 Training loss: 2.1651 Explore P: 0.6936
Episode: 206 Total reward: 16.0 Training loss: 335.8691 Explore P: 0.6925
Episode: 207 Total reward: 19.0 Training loss: 503.3141 Explore P: 0.6912
Episode: 208 Total reward: 23.0 Training loss: 109.3810 Explore P: 0.6897
Episode: 209 Total reward: 20.0 Training loss: 2.3969 Explore P: 0.6883
Episode: 210 Total reward: 40.0 Training loss: 90.9820 Explore P: 0.6856
Episode: 211 Total reward: 12.0 Training loss: 128.4597 Explore P: 0.6848
Episode: 212 Total reward: 15.0 Training loss: 3.2206 Explore P: 0.6838
Episode: 213 Total reward: 11.0 Training loss: 95.1852 Explore P: 0.6830
Episode: 214 Total reward: 10.0 Training loss: 199.2592 Explore P: 0.6824
Episode: 215 Total reward: 9.0 Training loss: 91.9066 Explore P: 0.6818
Episode: 216 Total reward: 16.0 Training loss: 190.4002 Explore P: 0.6807
Episode: 217 Total reward: 13.0 Training loss: 226.0988 Explore P: 0.6798
Episode: 218 Total reward: 15.0 Training loss: 219.0774 Explore P: 0.6788
Episode: 219 Total reward: 10.0 Training loss: 3.9806 Explore P: 0.6781
Episode: 220 Total reward: 17.0 Training loss: 256.8464 Explore P: 0.6770
Episode: 221 Total reward: 18.0 Training loss: 76.9521 Explore P: 0.6758
Episode: 222 Total reward: 9.0 Training loss: 3.2845 Explore P: 0.6752
Episode: 223 Total reward: 9.0 Training loss: 173.7374 Explore P: 0.6746
Episode: 224 Total reward: 12.0 Training loss: 324.4783 Explore P: 0.6738
Episode: 225 Total reward: 11.0 Training loss: 78.0929 Explore P: 0.6731
Episode: 226 Total reward: 19.0 Training loss: 1.7268 Explore P: 0.6718
Episode: 227 Total reward: 8.0 Training loss: 138.1678 Explore P: 0.6713
Episode: 228 Total reward: 9.0 Training loss: 67.0849 Explore P: 0.6707
Episode: 229 Total reward: 11.0 Training loss: 148.1500 Explore P: 0.6700
Episode: 230 Total reward: 9.0 Training loss: 112.2071 Explore P: 0.6694
Episode: 231 Total reward: 12.0 Training loss: 173.8643 Explore P: 0.6686
Episode: 232 Total reward: 12.0 Training loss: 74.1162 Explore P: 0.6678
Episode: 233 Total reward: 11.0 Training loss: 69.9150 Explore P: 0.6671
Episode: 234 Total reward: 9.0 Training loss: 266.7513 Explore P: 0.6665
Episode: 235 Total reward: 14.0 Training loss: 185.5789 Explore P: 0.6656
Episode: 236 Total reward: 18.0 Training loss: 66.0631 Explore P: 0.6644
Episode: 237 Total reward: 12.0 Training loss: 64.0533 Explore P: 0.6636
Episode: 238 Total reward: 10.0 Training loss: 157.2539 Explore P: 0.6630
Episode: 239 Total reward: 9.0 Training loss: 113.7060 Explore P: 0.6624
Episode: 240 Total reward: 22.0 Training loss: 253.0049 Explore P: 0.6609
Episode: 241 Total reward: 15.0 Training loss: 100.5261 Explore P: 0.6600
Episode: 242 Total reward: 9.0 Training loss: 105.7877 Explore P: 0.6594
Episode: 243 Total reward: 10.0 Training loss: 244.9234 Explore P: 0.6587
Episode: 244 Total reward: 12.0 Training loss: 264.0488 Explore P: 0.6579
Episode: 245 Total reward: 12.0 Training loss: 285.6095 Explore P: 0.6572
Episode: 246 Total reward: 11.0 Training loss: 213.4570 Explore P: 0.6565
Episode: 247 Total reward: 12.0 Training loss: 152.0330 Explore P: 0.6557
Episode: 248 Total reward: 11.0 Training loss: 58.2141 Explore P: 0.6550
Episode: 249 Total reward: 21.0 Training loss: 103.6935 Explore P: 0.6536
Episode: 250 Total reward: 13.0 Training loss: 121.3285 Explore P: 0.6528
Episode: 251 Total reward: 10.0 Training loss: 100.1648 Explore P: 0.6521
Episode: 252 Total reward: 15.0 Training loss: 363.1227 Explore P: 0.6512
Episode: 253 Total reward: 10.0 Training loss: 93.5325 Explore P: 0.6505
Episode: 254 Total reward: 8.0 Training loss: 56.7259 Explore P: 0.6500
Episode: 255 Total reward: 11.0 Training loss: 3.3377 Explore P: 0.6493
Episode: 256 Total reward: 12.0 Training loss: 54.9018 Explore P: 0.6486
Episode: 257 Total reward: 9.0 Training loss: 110.0745 Explore P: 0.6480
Episode: 258 Total reward: 12.0 Training loss: 89.9007 Explore P: 0.6472
Episode: 259 Total reward: 12.0 Training loss: 47.5368 Explore P: 0.6464
Episode: 260 Total reward: 9.0 Training loss: 136.3115 Explore P: 0.6459
Episode: 261 Total reward: 15.0 Training loss: 4.3004 Explore P: 0.6449
Episode: 262 Total reward: 14.0 Training loss: 200.5336 Explore P: 0.6440
Episode: 263 Total reward: 14.0 Training loss: 213.7814 Explore P: 0.6431
Episode: 264 Total reward: 13.0 Training loss: 84.7364 Explore P: 0.6423
Episode: 265 Total reward: 14.0 Training loss: 2.6457 Explore P: 0.6414
Episode: 266 Total reward: 8.0 Training loss: 168.7287 Explore P: 0.6409
Episode: 267 Total reward: 13.0 Training loss: 218.6096 Explore P: 0.6401
Episode: 268 Total reward: 13.0 Training loss: 194.1976 Explore P: 0.6393
Episode: 269 Total reward: 11.0 Training loss: 90.0727 Explore P: 0.6386
Episode: 270 Total reward: 8.0 Training loss: 131.9349 Explore P: 0.6381
Episode: 271 Total reward: 10.0 Training loss: 48.8470 Explore P: 0.6375
Episode: 272 Total reward: 14.0 Training loss: 39.2557 Explore P: 0.6366
Episode: 273 Total reward: 14.0 Training loss: 68.6701 Explore P: 0.6357
Episode: 274 Total reward: 20.0 Training loss: 121.0310 Explore P: 0.6345
Episode: 275 Total reward: 24.0 Training loss: 35.9053 Explore P: 0.6330
Episode: 276 Total reward: 27.0 Training loss: 113.6335 Explore P: 0.6313
Episode: 277 Total reward: 21.0 Training loss: 2.1632 Explore P: 0.6300
Episode: 278 Total reward: 12.0 Training loss: 34.1961 Explore P: 0.6292
Episode: 279 Total reward: 17.0 Training loss: 2.9529 Explore P: 0.6282
Episode: 280 Total reward: 10.0 Training loss: 31.9052 Explore P: 0.6276
Episode: 281 Total reward: 14.0 Training loss: 213.8201 Explore P: 0.6267
Episode: 282 Total reward: 20.0 Training loss: 32.3995 Explore P: 0.6255
Episode: 283 Total reward: 37.0 Training loss: 71.9545 Explore P: 0.6232
Episode: 284 Total reward: 11.0 Training loss: 7.2107 Explore P: 0.6225
Episode: 285 Total reward: 15.0 Training loss: 33.2825 Explore P: 0.6216
Episode: 286 Total reward: 23.0 Training loss: 175.4714 Explore P: 0.6202
Episode: 287 Total reward: 16.0 Training loss: 121.5343 Explore P: 0.6192
Episode: 288 Total reward: 18.0 Training loss: 4.4293 Explore P: 0.6181
Episode: 289 Total reward: 15.0 Training loss: 29.3995 Explore P: 0.6172
Episode: 290 Total reward: 18.0 Training loss: 55.9283 Explore P: 0.6161
Episode: 291 Total reward: 41.0 Training loss: 149.8688 Explore P: 0.6137
Episode: 292 Total reward: 25.0 Training loss: 84.4039 Explore P: 0.6121
Episode: 293 Total reward: 18.0 Training loss: 5.7669 Explore P: 0.6111
Episode: 294 Total reward: 14.0 Training loss: 39.7809 Explore P: 0.6102
Episode: 295 Total reward: 20.0 Training loss: 4.6684 Explore P: 0.6090
Episode: 296 Total reward: 19.0 Training loss: 31.6807 Explore P: 0.6079
Episode: 297 Total reward: 9.0 Training loss: 61.1396 Explore P: 0.6074
Episode: 298 Total reward: 9.0 Training loss: 34.1221 Explore P: 0.6068
Episode: 299 Total reward: 16.0 Training loss: 138.9426 Explore P: 0.6059
Episode: 300 Total reward: 14.0 Training loss: 5.3620 Explore P: 0.6050
Episode: 301 Total reward: 10.0 Training loss: 79.3021 Explore P: 0.6044
Episode: 302 Total reward: 14.0 Training loss: 49.0916 Explore P: 0.6036
Episode: 303 Total reward: 8.0 Training loss: 104.0440 Explore P: 0.6031
Episode: 304 Total reward: 20.0 Training loss: 101.0491 Explore P: 0.6019
Episode: 305 Total reward: 15.0 Training loss: 5.4332 Explore P: 0.6011
Episode: 306 Total reward: 18.0 Training loss: 5.6578 Explore P: 0.6000
Episode: 307 Total reward: 18.0 Training loss: 25.6384 Explore P: 0.5989
Episode: 308 Total reward: 20.0 Training loss: 6.4029 Explore P: 0.5978
Episode: 309 Total reward: 9.0 Training loss: 176.4225 Explore P: 0.5972
Episode: 310 Total reward: 10.0 Training loss: 27.3194 Explore P: 0.5966
Episode: 311 Total reward: 14.0 Training loss: 23.3147 Explore P: 0.5958
Episode: 312 Total reward: 9.0 Training loss: 26.2146 Explore P: 0.5953
Episode: 313 Total reward: 9.0 Training loss: 191.5470 Explore P: 0.5948
Episode: 314 Total reward: 9.0 Training loss: 88.4002 Explore P: 0.5942
Episode: 315 Total reward: 19.0 Training loss: 161.2394 Explore P: 0.5931
Episode: 316 Total reward: 25.0 Training loss: 93.1639 Explore P: 0.5917
Episode: 317 Total reward: 14.0 Training loss: 68.2557 Explore P: 0.5909
Episode: 318 Total reward: 15.0 Training loss: 26.7399 Explore P: 0.5900
Episode: 319 Total reward: 10.0 Training loss: 170.6469 Explore P: 0.5894
Episode: 320 Total reward: 11.0 Training loss: 3.3687 Explore P: 0.5888
Episode: 321 Total reward: 15.0 Training loss: 88.8905 Explore P: 0.5879
Episode: 322 Total reward: 10.0 Training loss: 21.0162 Explore P: 0.5873
Episode: 323 Total reward: 29.0 Training loss: 37.1514 Explore P: 0.5857
Episode: 324 Total reward: 21.0 Training loss: 3.5919 Explore P: 0.5844
Episode: 325 Total reward: 41.0 Training loss: 66.5671 Explore P: 0.5821
Episode: 326 Total reward: 9.0 Training loss: 23.8566 Explore P: 0.5816
Episode: 327 Total reward: 12.0 Training loss: 306.8965 Explore P: 0.5809
Episode: 328 Total reward: 8.0 Training loss: 3.0472 Explore P: 0.5804
Episode: 329 Total reward: 9.0 Training loss: 104.6340 Explore P: 0.5799
Episode: 330 Total reward: 10.0 Training loss: 4.6642 Explore P: 0.5794
Episode: 331 Total reward: 18.0 Training loss: 37.2253 Explore P: 0.5783
Episode: 332 Total reward: 10.0 Training loss: 26.0824 Explore P: 0.5778
Episode: 333 Total reward: 11.0 Training loss: 90.1184 Explore P: 0.5771
Episode: 334 Total reward: 15.0 Training loss: 114.5057 Explore P: 0.5763
Episode: 335 Total reward: 26.0 Training loss: 4.7339 Explore P: 0.5748
Episode: 336 Total reward: 19.0 Training loss: 176.7065 Explore P: 0.5737
Episode: 337 Total reward: 15.0 Training loss: 92.2431 Explore P: 0.5729
Episode: 338 Total reward: 14.0 Training loss: 101.3784 Explore P: 0.5721
Episode: 339 Total reward: 30.0 Training loss: 54.9085 Explore P: 0.5704
Episode: 340 Total reward: 13.0 Training loss: 70.8780 Explore P: 0.5697
Episode: 341 Total reward: 18.0 Training loss: 2.9412 Explore P: 0.5687
Episode: 342 Total reward: 11.0 Training loss: 128.1035 Explore P: 0.5681
Episode: 343 Total reward: 9.0 Training loss: 3.1334 Explore P: 0.5676
Episode: 344 Total reward: 16.0 Training loss: 38.5450 Explore P: 0.5667
Episode: 345 Total reward: 11.0 Training loss: 71.8534 Explore P: 0.5661
Episode: 346 Total reward: 15.0 Training loss: 1.6052 Explore P: 0.5652
Episode: 347 Total reward: 11.0 Training loss: 109.0632 Explore P: 0.5646
Episode: 348 Total reward: 12.0 Training loss: 4.9467 Explore P: 0.5640
Episode: 349 Total reward: 10.0 Training loss: 40.3499 Explore P: 0.5634
Episode: 350 Total reward: 15.0 Training loss: 3.4527 Explore P: 0.5626
Episode: 351 Total reward: 11.0 Training loss: 106.9212 Explore P: 0.5620
Episode: 352 Total reward: 15.0 Training loss: 34.1927 Explore P: 0.5611
Episode: 353 Total reward: 24.0 Training loss: 84.3414 Explore P: 0.5598
Episode: 354 Total reward: 8.0 Training loss: 43.3037 Explore P: 0.5594
Episode: 355 Total reward: 15.0 Training loss: 0.9405 Explore P: 0.5586
Episode: 356 Total reward: 15.0 Training loss: 17.9579 Explore P: 0.5577
Episode: 357 Total reward: 15.0 Training loss: 74.7600 Explore P: 0.5569
Episode: 358 Total reward: 11.0 Training loss: 43.9022 Explore P: 0.5563
Episode: 359 Total reward: 13.0 Training loss: 16.7958 Explore P: 0.5556
Episode: 360 Total reward: 13.0 Training loss: 78.4240 Explore P: 0.5549
Episode: 361 Total reward: 10.0 Training loss: 87.9796 Explore P: 0.5544
Episode: 362 Total reward: 13.0 Training loss: 105.0640 Explore P: 0.5536
Episode: 363 Total reward: 28.0 Training loss: 16.8636 Explore P: 0.5521
Episode: 364 Total reward: 13.0 Training loss: 3.9164 Explore P: 0.5514
Episode: 365 Total reward: 9.0 Training loss: 24.0048 Explore P: 0.5509
Episode: 366 Total reward: 11.0 Training loss: 22.7599 Explore P: 0.5503
Episode: 367 Total reward: 15.0 Training loss: 27.6213 Explore P: 0.5495
Episode: 368 Total reward: 13.0 Training loss: 3.3522 Explore P: 0.5488
Episode: 369 Total reward: 25.0 Training loss: 12.4298 Explore P: 0.5475
Episode: 370 Total reward: 18.0 Training loss: 90.3494 Explore P: 0.5465
Episode: 371 Total reward: 17.0 Training loss: 66.5405 Explore P: 0.5456
Episode: 372 Total reward: 9.0 Training loss: 55.6065 Explore P: 0.5451
Episode: 373 Total reward: 19.0 Training loss: 62.2752 Explore P: 0.5441
Episode: 374 Total reward: 11.0 Training loss: 3.4799 Explore P: 0.5435
Episode: 375 Total reward: 16.0 Training loss: 2.5282 Explore P: 0.5427
Episode: 376 Total reward: 8.0 Training loss: 1.2182 Explore P: 0.5422
Episode: 377 Total reward: 11.0 Training loss: 51.4415 Explore P: 0.5417
Episode: 378 Total reward: 23.0 Training loss: 1.4088 Explore P: 0.5404
Episode: 379 Total reward: 13.0 Training loss: 1.6463 Explore P: 0.5397
Episode: 380 Total reward: 10.0 Training loss: 55.9892 Explore P: 0.5392
Episode: 381 Total reward: 8.0 Training loss: 22.4308 Explore P: 0.5388
Episode: 382 Total reward: 11.0 Training loss: 1.7156 Explore P: 0.5382
Episode: 383 Total reward: 13.0 Training loss: 65.4508 Explore P: 0.5375
Episode: 384 Total reward: 18.0 Training loss: 73.4197 Explore P: 0.5366
Episode: 385 Total reward: 20.0 Training loss: 67.8101 Explore P: 0.5355
Episode: 386 Total reward: 10.0 Training loss: 45.8936 Explore P: 0.5350
Episode: 387 Total reward: 9.0 Training loss: 20.2614 Explore P: 0.5345
Episode: 388 Total reward: 16.0 Training loss: 1.9170 Explore P: 0.5337
Episode: 389 Total reward: 11.0 Training loss: 37.7061 Explore P: 0.5331
Episode: 390 Total reward: 23.0 Training loss: 103.4796 Explore P: 0.5319
Episode: 391 Total reward: 8.0 Training loss: 86.7478 Explore P: 0.5315
Episode: 392 Total reward: 12.0 Training loss: 43.7247 Explore P: 0.5309
Episode: 393 Total reward: 13.0 Training loss: 53.2227 Explore P: 0.5302
Episode: 394 Total reward: 9.0 Training loss: 34.1443 Explore P: 0.5297
Episode: 395 Total reward: 12.0 Training loss: 2.8607 Explore P: 0.5291
Episode: 396 Total reward: 17.0 Training loss: 2.8609 Explore P: 0.5282
Episode: 397 Total reward: 22.0 Training loss: 42.4999 Explore P: 0.5271
Episode: 398 Total reward: 9.0 Training loss: 1.7179 Explore P: 0.5266
Episode: 399 Total reward: 14.0 Training loss: 80.1840 Explore P: 0.5259
Episode: 400 Total reward: 15.0 Training loss: 1.6627 Explore P: 0.5251
Episode: 401 Total reward: 17.0 Training loss: 2.2837 Explore P: 0.5242
Episode: 402 Total reward: 95.0 Training loss: 27.8602 Explore P: 0.5194
Episode: 403 Total reward: 13.0 Training loss: 1.0315 Explore P: 0.5187
Episode: 404 Total reward: 38.0 Training loss: 1.6910 Explore P: 0.5168
Episode: 405 Total reward: 20.0 Training loss: 0.9454 Explore P: 0.5158
Episode: 406 Total reward: 47.0 Training loss: 64.7375 Explore P: 0.5134
Episode: 407 Total reward: 20.0 Training loss: 16.4529 Explore P: 0.5124
Episode: 408 Total reward: 29.0 Training loss: 11.9333 Explore P: 0.5109
Episode: 409 Total reward: 49.0 Training loss: 1.8494 Explore P: 0.5085
Episode: 410 Total reward: 17.0 Training loss: 9.3960 Explore P: 0.5077
Episode: 411 Total reward: 36.0 Training loss: 73.3632 Explore P: 0.5059
Episode: 412 Total reward: 23.0 Training loss: 44.4469 Explore P: 0.5047
Episode: 413 Total reward: 14.0 Training loss: 1.6513 Explore P: 0.5040
Episode: 414 Total reward: 44.0 Training loss: 1.6874 Explore P: 0.5019
Episode: 415 Total reward: 51.0 Training loss: 18.5683 Explore P: 0.4994
Episode: 416 Total reward: 33.0 Training loss: 26.4009 Explore P: 0.4978
Episode: 417 Total reward: 32.0 Training loss: 34.0742 Explore P: 0.4962
Episode: 418 Total reward: 21.0 Training loss: 17.7689 Explore P: 0.4952
Episode: 419 Total reward: 38.0 Training loss: 22.3536 Explore P: 0.4933
Episode: 420 Total reward: 35.0 Training loss: 26.5546 Explore P: 0.4916
Episode: 421 Total reward: 28.0 Training loss: 0.8973 Explore P: 0.4903
Episode: 422 Total reward: 26.0 Training loss: 1.3706 Explore P: 0.4890
Episode: 423 Total reward: 34.0 Training loss: 10.5570 Explore P: 0.4874
Episode: 424 Total reward: 52.0 Training loss: 19.8774 Explore P: 0.4849
Episode: 425 Total reward: 19.0 Training loss: 1.0844 Explore P: 0.4840
Episode: 426 Total reward: 38.0 Training loss: 15.2065 Explore P: 0.4822
Episode: 427 Total reward: 29.0 Training loss: 1.2285 Explore P: 0.4809
Episode: 428 Total reward: 26.0 Training loss: 28.1584 Explore P: 0.4797
Episode: 429 Total reward: 29.0 Training loss: 26.4352 Explore P: 0.4783
Episode: 430 Total reward: 59.0 Training loss: 21.0858 Explore P: 0.4755
Episode: 431 Total reward: 29.0 Training loss: 41.5450 Explore P: 0.4742
Episode: 432 Total reward: 45.0 Training loss: 1.5844 Explore P: 0.4721
Episode: 433 Total reward: 19.0 Training loss: 2.3385 Explore P: 0.4712
Episode: 434 Total reward: 28.0 Training loss: 1.4449 Explore P: 0.4699
Episode: 435 Total reward: 35.0 Training loss: 31.9446 Explore P: 0.4683
Episode: 436 Total reward: 96.0 Training loss: 1.3756 Explore P: 0.4640
Episode: 437 Total reward: 42.0 Training loss: 46.5619 Explore P: 0.4621
Episode: 438 Total reward: 31.0 Training loss: 0.8114 Explore P: 0.4607
Episode: 439 Total reward: 33.0 Training loss: 2.0197 Explore P: 0.4592
Episode: 440 Total reward: 32.0 Training loss: 31.4804 Explore P: 0.4577
Episode: 441 Total reward: 23.0 Training loss: 11.5387 Explore P: 0.4567
Episode: 442 Total reward: 15.0 Training loss: 12.5823 Explore P: 0.4560
Episode: 443 Total reward: 20.0 Training loss: 1.9851 Explore P: 0.4551
Episode: 444 Total reward: 74.0 Training loss: 8.8443 Explore P: 0.4519
Episode: 445 Total reward: 38.0 Training loss: 1.8225 Explore P: 0.4502
Episode: 446 Total reward: 31.0 Training loss: 29.9877 Explore P: 0.4488
Episode: 447 Total reward: 18.0 Training loss: 14.9294 Explore P: 0.4480
Episode: 448 Total reward: 23.0 Training loss: 11.9026 Explore P: 0.4470
Episode: 449 Total reward: 41.0 Training loss: 14.1103 Explore P: 0.4452
Episode: 450 Total reward: 33.0 Training loss: 13.2039 Explore P: 0.4438
Episode: 451 Total reward: 17.0 Training loss: 20.0676 Explore P: 0.4431
Episode: 452 Total reward: 47.0 Training loss: 2.0080 Explore P: 0.4410
Episode: 453 Total reward: 36.0 Training loss: 1.2166 Explore P: 0.4395
Episode: 454 Total reward: 60.0 Training loss: 1.6247 Explore P: 0.4369
Episode: 455 Total reward: 42.0 Training loss: 2.2934 Explore P: 0.4351
Episode: 456 Total reward: 76.0 Training loss: 10.7036 Explore P: 0.4319
Episode: 457 Total reward: 34.0 Training loss: 52.9187 Explore P: 0.4305
Episode: 458 Total reward: 35.0 Training loss: 36.4772 Explore P: 0.4290
Episode: 459 Total reward: 31.0 Training loss: 1.5944 Explore P: 0.4277
Episode: 460 Total reward: 40.0 Training loss: 17.5018 Explore P: 0.4260
Episode: 461 Total reward: 80.0 Training loss: 35.8462 Explore P: 0.4227
Episode: 462 Total reward: 33.0 Training loss: 8.9367 Explore P: 0.4214
Episode: 463 Total reward: 61.0 Training loss: 11.6347 Explore P: 0.4189
Episode: 464 Total reward: 41.0 Training loss: 13.9660 Explore P: 0.4172
Episode: 465 Total reward: 38.0 Training loss: 41.7076 Explore P: 0.4157
Episode: 466 Total reward: 54.0 Training loss: 12.5269 Explore P: 0.4135
Episode: 467 Total reward: 86.0 Training loss: 64.4105 Explore P: 0.4100
Episode: 468 Total reward: 49.0 Training loss: 16.6175 Explore P: 0.4081
Episode: 469 Total reward: 33.0 Training loss: 12.7063 Explore P: 0.4067
Episode: 470 Total reward: 44.0 Training loss: 1.6122 Explore P: 0.4050
Episode: 471 Total reward: 33.0 Training loss: 61.8237 Explore P: 0.4037
Episode: 472 Total reward: 40.0 Training loss: 50.6398 Explore P: 0.4021
Episode: 473 Total reward: 141.0 Training loss: 31.6937 Explore P: 0.3966
Episode: 474 Total reward: 92.0 Training loss: 1.2583 Explore P: 0.3931
Episode: 475 Total reward: 56.0 Training loss: 12.2733 Explore P: 0.3910
Episode: 476 Total reward: 61.0 Training loss: 2.1773 Explore P: 0.3886
Episode: 477 Total reward: 19.0 Training loss: 24.1700 Explore P: 0.3879
Episode: 478 Total reward: 47.0 Training loss: 1.8333 Explore P: 0.3862
Episode: 479 Total reward: 24.0 Training loss: 1.3321 Explore P: 0.3853
Episode: 480 Total reward: 21.0 Training loss: 1.6996 Explore P: 0.3845
Episode: 481 Total reward: 56.0 Training loss: 4.0072 Explore P: 0.3824
Episode: 482 Total reward: 82.0 Training loss: 23.6217 Explore P: 0.3793
Episode: 483 Total reward: 41.0 Training loss: 18.2766 Explore P: 0.3778
Episode: 484 Total reward: 98.0 Training loss: 37.0301 Explore P: 0.3742
Episode: 485 Total reward: 82.0 Training loss: 26.2214 Explore P: 0.3713
Episode: 486 Total reward: 90.0 Training loss: 2.7071 Explore P: 0.3680
Episode: 487 Total reward: 121.0 Training loss: 1.8812 Explore P: 0.3637
Episode: 488 Total reward: 61.0 Training loss: 53.5616 Explore P: 0.3616
Episode: 489 Total reward: 88.0 Training loss: 11.3588 Explore P: 0.3585
Episode: 490 Total reward: 41.0 Training loss: 11.7550 Explore P: 0.3571
Episode: 491 Total reward: 49.0 Training loss: 1.7459 Explore P: 0.3554
Episode: 492 Total reward: 55.0 Training loss: 1.8910 Explore P: 0.3535
Episode: 493 Total reward: 20.0 Training loss: 2.2762 Explore P: 0.3528
Episode: 494 Total reward: 66.0 Training loss: 3.2206 Explore P: 0.3505
Episode: 495 Total reward: 51.0 Training loss: 46.9035 Explore P: 0.3488
Episode: 496 Total reward: 60.0 Training loss: 25.1009 Explore P: 0.3468
Episode: 497 Total reward: 93.0 Training loss: 53.7032 Explore P: 0.3437
Episode: 498 Total reward: 41.0 Training loss: 24.6958 Explore P: 0.3423
Episode: 499 Total reward: 53.0 Training loss: 1.6852 Explore P: 0.3405
Episode: 500 Total reward: 32.0 Training loss: 3.6309 Explore P: 0.3395
Episode: 501 Total reward: 71.0 Training loss: 2.1863 Explore P: 0.3371
Episode: 502 Total reward: 38.0 Training loss: 2.1533 Explore P: 0.3359
Episode: 503 Total reward: 50.0 Training loss: 26.8079 Explore P: 0.3343
Episode: 504 Total reward: 30.0 Training loss: 3.6566 Explore P: 0.3333
Episode: 505 Total reward: 76.0 Training loss: 11.5282 Explore P: 0.3309
Episode: 506 Total reward: 59.0 Training loss: 30.5798 Explore P: 0.3290
Episode: 507 Total reward: 98.0 Training loss: 2.3722 Explore P: 0.3259
Episode: 508 Total reward: 50.0 Training loss: 1.4548 Explore P: 0.3243
Episode: 509 Total reward: 37.0 Training loss: 122.0184 Explore P: 0.3231
Episode: 510 Total reward: 34.0 Training loss: 1.6186 Explore P: 0.3221
Episode: 511 Total reward: 24.0 Training loss: 1.9000 Explore P: 0.3213
Episode: 512 Total reward: 58.0 Training loss: 37.7919 Explore P: 0.3195
Episode: 513 Total reward: 50.0 Training loss: 24.4112 Explore P: 0.3180
Episode: 514 Total reward: 41.0 Training loss: 47.2970 Explore P: 0.3167
Episode: 515 Total reward: 36.0 Training loss: 13.6730 Explore P: 0.3156
Episode: 516 Total reward: 79.0 Training loss: 3.5917 Explore P: 0.3132
Episode: 517 Total reward: 91.0 Training loss: 1.4076 Explore P: 0.3105
Episode: 518 Total reward: 65.0 Training loss: 44.2837 Explore P: 0.3085
Episode: 519 Total reward: 50.0 Training loss: 1.7154 Explore P: 0.3070
Episode: 520 Total reward: 84.0 Training loss: 39.4561 Explore P: 0.3045
Episode: 521 Total reward: 57.0 Training loss: 44.5094 Explore P: 0.3029
Episode: 522 Total reward: 55.0 Training loss: 1.2951 Explore P: 0.3013
Episode: 523 Total reward: 55.0 Training loss: 9.8508 Explore P: 0.2997
Episode: 524 Total reward: 53.0 Training loss: 31.8086 Explore P: 0.2981
Episode: 525 Total reward: 57.0 Training loss: 2.4399 Explore P: 0.2965
Episode: 526 Total reward: 68.0 Training loss: 3.3035 Explore P: 0.2945
Episode: 527 Total reward: 68.0 Training loss: 87.7510 Explore P: 0.2926
Episode: 528 Total reward: 50.0 Training loss: 50.1942 Explore P: 0.2912
Episode: 529 Total reward: 117.0 Training loss: 127.5474 Explore P: 0.2879
Episode: 530 Total reward: 73.0 Training loss: 1.6922 Explore P: 0.2859
Episode: 531 Total reward: 58.0 Training loss: 64.7100 Explore P: 0.2843
Episode: 532 Total reward: 69.0 Training loss: 2.0013 Explore P: 0.2824
Episode: 533 Total reward: 55.0 Training loss: 2.1964 Explore P: 0.2809
Episode: 534 Total reward: 71.0 Training loss: 133.3977 Explore P: 0.2790
Episode: 535 Total reward: 78.0 Training loss: 14.4462 Explore P: 0.2769
Episode: 536 Total reward: 56.0 Training loss: 31.2765 Explore P: 0.2754
Episode: 537 Total reward: 66.0 Training loss: 29.2365 Explore P: 0.2737
Episode: 538 Total reward: 49.0 Training loss: 2.5379 Explore P: 0.2724
Episode: 539 Total reward: 28.0 Training loss: 99.9252 Explore P: 0.2717
Episode: 540 Total reward: 86.0 Training loss: 68.0335 Explore P: 0.2694
Episode: 541 Total reward: 61.0 Training loss: 10.5402 Explore P: 0.2679
Episode: 542 Total reward: 42.0 Training loss: 107.3018 Explore P: 0.2668
Episode: 543 Total reward: 112.0 Training loss: 1.5370 Explore P: 0.2639
Episode: 544 Total reward: 114.0 Training loss: 1.2073 Explore P: 0.2610
Episode: 545 Total reward: 89.0 Training loss: 29.1991 Explore P: 0.2588
Episode: 546 Total reward: 51.0 Training loss: 6.6627 Explore P: 0.2575
Episode: 547 Total reward: 58.0 Training loss: 10.5727 Explore P: 0.2561
Episode: 548 Total reward: 88.0 Training loss: 54.2432 Explore P: 0.2540
Episode: 549 Total reward: 71.0 Training loss: 5.5843 Explore P: 0.2522
Episode: 550 Total reward: 127.0 Training loss: 2.2539 Explore P: 0.2492
Episode: 551 Total reward: 73.0 Training loss: 17.2013 Explore P: 0.2474
Episode: 552 Total reward: 52.0 Training loss: 55.0881 Explore P: 0.2462
Episode: 553 Total reward: 76.0 Training loss: 2.4960 Explore P: 0.2444
Episode: 554 Total reward: 137.0 Training loss: 2.1417 Explore P: 0.2412
Episode: 555 Total reward: 199.0 Training loss: 86.4608 Explore P: 0.2367
Episode: 556 Total reward: 71.0 Training loss: 1.9917 Explore P: 0.2351
Episode: 557 Total reward: 96.0 Training loss: 2.0353 Explore P: 0.2329
Episode: 558 Total reward: 78.0 Training loss: 2.5715 Explore P: 0.2312
Episode: 559 Total reward: 63.0 Training loss: 1.7375 Explore P: 0.2298
Episode: 560 Total reward: 47.0 Training loss: 77.9717 Explore P: 0.2288
Episode: 561 Total reward: 105.0 Training loss: 114.0148 Explore P: 0.2265
Episode: 562 Total reward: 113.0 Training loss: 2.5154 Explore P: 0.2240
Episode: 563 Total reward: 81.0 Training loss: 1.0471 Explore P: 0.2223
Episode: 564 Total reward: 109.0 Training loss: 91.9630 Explore P: 0.2200
Episode: 565 Total reward: 128.0 Training loss: 83.5766 Explore P: 0.2173
Episode: 566 Total reward: 74.0 Training loss: 2.2764 Explore P: 0.2158
Episode: 567 Total reward: 88.0 Training loss: 291.9935 Explore P: 0.2140
Episode: 568 Total reward: 76.0 Training loss: 2.2273 Explore P: 0.2125
Episode: 569 Total reward: 164.0 Training loss: 17.2441 Explore P: 0.2092
Episode: 570 Total reward: 79.0 Training loss: 2.8565 Explore P: 0.2076
Episode: 571 Total reward: 101.0 Training loss: 2.3543 Explore P: 0.2056
Episode: 572 Total reward: 199.0 Training loss: 67.0989 Explore P: 0.2018
Episode: 573 Total reward: 86.0 Training loss: 0.8617 Explore P: 0.2001
Episode: 574 Total reward: 122.0 Training loss: 1.1065 Explore P: 0.1978
Episode: 575 Total reward: 152.0 Training loss: 1.6178 Explore P: 0.1950
Episode: 576 Total reward: 107.0 Training loss: 1.9360 Explore P: 0.1930
Episode: 577 Total reward: 133.0 Training loss: 73.2953 Explore P: 0.1906
Episode: 578 Total reward: 126.0 Training loss: 2.5193 Explore P: 0.1883
Episode: 579 Total reward: 199.0 Training loss: 2.2680 Explore P: 0.1848
Episode: 580 Total reward: 96.0 Training loss: 1.8597 Explore P: 0.1832
Episode: 581 Total reward: 100.0 Training loss: 95.2668 Explore P: 0.1814
Episode: 582 Total reward: 114.0 Training loss: 1.2563 Explore P: 0.1795
Episode: 583 Total reward: 114.0 Training loss: 96.0110 Explore P: 0.1776
Episode: 584 Total reward: 88.0 Training loss: 1.2993 Explore P: 0.1761
Episode: 585 Total reward: 199.0 Training loss: 2.8955 Explore P: 0.1728
Episode: 586 Total reward: 123.0 Training loss: 2.0642 Explore P: 0.1708
Episode: 587 Total reward: 122.0 Training loss: 2.1891 Explore P: 0.1689
Episode: 588 Total reward: 117.0 Training loss: 1.6498 Explore P: 0.1670
Episode: 589 Total reward: 81.0 Training loss: 1.6924 Explore P: 0.1658
Episode: 590 Total reward: 112.0 Training loss: 2.1865 Explore P: 0.1640
Episode: 591 Total reward: 165.0 Training loss: 1.9885 Explore P: 0.1615
Episode: 592 Total reward: 150.0 Training loss: 1.3104 Explore P: 0.1593
Episode: 593 Total reward: 130.0 Training loss: 2.0007 Explore P: 0.1573
Episode: 594 Total reward: 93.0 Training loss: 1.2449 Explore P: 0.1560
Episode: 595 Total reward: 157.0 Training loss: 0.9129 Explore P: 0.1537
Episode: 596 Total reward: 157.0 Training loss: 93.7647 Explore P: 0.1515
Episode: 597 Total reward: 169.0 Training loss: 1.1271 Explore P: 0.1491
Episode: 598 Total reward: 111.0 Training loss: 110.8345 Explore P: 0.1476
Episode: 599 Total reward: 123.0 Training loss: 0.9287 Explore P: 0.1459
Episode: 600 Total reward: 199.0 Training loss: 0.9120 Explore P: 0.1432
Episode: 601 Total reward: 199.0 Training loss: 98.9746 Explore P: 0.1406
Episode: 602 Total reward: 160.0 Training loss: 25.3277 Explore P: 0.1385
Episode: 603 Total reward: 199.0 Training loss: 0.6807 Explore P: 0.1360
Episode: 604 Total reward: 199.0 Training loss: 1.1514 Explore P: 0.1335
Episode: 605 Total reward: 199.0 Training loss: 0.7622 Explore P: 0.1311
Episode: 606 Total reward: 154.0 Training loss: 0.5075 Explore P: 0.1292
Episode: 607 Total reward: 199.0 Training loss: 1.2417 Explore P: 0.1269
Episode: 608 Total reward: 199.0 Training loss: 0.5355 Explore P: 0.1245
Episode: 609 Total reward: 102.0 Training loss: 0.5109 Explore P: 0.1234
Episode: 610 Total reward: 159.0 Training loss: 0.6573 Explore P: 0.1216
Episode: 611 Total reward: 199.0 Training loss: 1.2335 Explore P: 0.1194
Episode: 612 Total reward: 148.0 Training loss: 223.9051 Explore P: 0.1178
Episode: 613 Total reward: 150.0 Training loss: 1.0760 Explore P: 0.1162
Episode: 614 Total reward: 136.0 Training loss: 0.9331 Explore P: 0.1148
Episode: 615 Total reward: 115.0 Training loss: 0.6788 Explore P: 0.1136
Episode: 616 Total reward: 167.0 Training loss: 1.2960 Explore P: 0.1118
Episode: 617 Total reward: 137.0 Training loss: 0.6747 Explore P: 0.1105
Episode: 618 Total reward: 143.0 Training loss: 0.8890 Explore P: 0.1090
Episode: 619 Total reward: 95.0 Training loss: 0.2380 Explore P: 0.1081
Episode: 620 Total reward: 143.0 Training loss: 229.9512 Explore P: 0.1067
Episode: 621 Total reward: 113.0 Training loss: 1.0778 Explore P: 0.1056
Episode: 622 Total reward: 119.0 Training loss: 0.3510 Explore P: 0.1045
Episode: 623 Total reward: 86.0 Training loss: 0.4805 Explore P: 0.1037
Episode: 624 Total reward: 97.0 Training loss: 0.4162 Explore P: 0.1028
Episode: 625 Total reward: 88.0 Training loss: 0.7017 Explore P: 0.1020
Episode: 626 Total reward: 72.0 Training loss: 3.3437 Explore P: 0.1013
Episode: 627 Total reward: 70.0 Training loss: 10.4924 Explore P: 0.1007
Episode: 628 Total reward: 70.0 Training loss: 0.4317 Explore P: 0.1000
Episode: 629 Total reward: 191.0 Training loss: 0.5803 Explore P: 0.0983
Episode: 630 Total reward: 93.0 Training loss: 0.5804 Explore P: 0.0975
Episode: 631 Total reward: 120.0 Training loss: 0.3727 Explore P: 0.0965
Episode: 632 Total reward: 85.0 Training loss: 0.3373 Explore P: 0.0957
Episode: 633 Total reward: 97.0 Training loss: 7.2489 Explore P: 0.0949
Episode: 634 Total reward: 78.0 Training loss: 1.1258 Explore P: 0.0942
Episode: 635 Total reward: 69.0 Training loss: 0.3373 Explore P: 0.0937
Episode: 636 Total reward: 65.0 Training loss: 0.3217 Explore P: 0.0931
Episode: 637 Total reward: 75.0 Training loss: 0.4940 Explore P: 0.0925
Episode: 638 Total reward: 66.0 Training loss: 0.3979 Explore P: 0.0920
Episode: 639 Total reward: 97.0 Training loss: 0.7328 Explore P: 0.0912
Episode: 640 Total reward: 75.0 Training loss: 0.3742 Explore P: 0.0906
Episode: 641 Total reward: 56.0 Training loss: 0.3174 Explore P: 0.0901
Episode: 642 Total reward: 86.0 Training loss: 0.5170 Explore P: 0.0894
Episode: 643 Total reward: 78.0 Training loss: 0.2671 Explore P: 0.0888
Episode: 644 Total reward: 98.0 Training loss: 0.4528 Explore P: 0.0880
Episode: 645 Total reward: 104.0 Training loss: 0.3315 Explore P: 0.0872
Episode: 646 Total reward: 102.0 Training loss: 0.3372 Explore P: 0.0864
Episode: 647 Total reward: 57.0 Training loss: 0.4064 Explore P: 0.0860
Episode: 648 Total reward: 90.0 Training loss: 0.2723 Explore P: 0.0853
Episode: 649 Total reward: 75.0 Training loss: 7.8044 Explore P: 0.0848
Episode: 650 Total reward: 71.0 Training loss: 0.3210 Explore P: 0.0842
Episode: 651 Total reward: 175.0 Training loss: 0.4287 Explore P: 0.0830
Episode: 652 Total reward: 199.0 Training loss: 0.2485 Explore P: 0.0815
Episode: 653 Total reward: 75.0 Training loss: 0.5554 Explore P: 0.0810
Episode: 654 Total reward: 199.0 Training loss: 0.5227 Explore P: 0.0796
Episode: 655 Total reward: 79.0 Training loss: 0.2264 Explore P: 0.0790
Episode: 656 Total reward: 76.0 Training loss: 0.4976 Explore P: 0.0785
Episode: 657 Total reward: 102.0 Training loss: 0.3671 Explore P: 0.0778
Episode: 658 Total reward: 199.0 Training loss: 0.4928 Explore P: 0.0765
Episode: 659 Total reward: 139.0 Training loss: 7.2617 Explore P: 0.0756
Episode: 660 Total reward: 199.0 Training loss: 0.2685 Explore P: 0.0743
Episode: 661 Total reward: 120.0 Training loss: 0.3150 Explore P: 0.0735
Episode: 662 Total reward: 138.0 Training loss: 0.3391 Explore P: 0.0726
Episode: 663 Total reward: 199.0 Training loss: 0.5577 Explore P: 0.0714
Episode: 664 Total reward: 199.0 Training loss: 0.9096 Explore P: 0.0702
Episode: 665 Total reward: 199.0 Training loss: 0.4001 Explore P: 0.0690
Episode: 666 Total reward: 199.0 Training loss: 0.3411 Explore P: 0.0678
Episode: 667 Total reward: 137.0 Training loss: 0.3932 Explore P: 0.0671
Episode: 668 Total reward: 144.0 Training loss: 0.4008 Explore P: 0.0662
Episode: 669 Total reward: 199.0 Training loss: 0.1730 Explore P: 0.0651
Episode: 670 Total reward: 199.0 Training loss: 0.1644 Explore P: 0.0640
Episode: 671 Total reward: 199.0 Training loss: 0.2910 Explore P: 0.0630
Episode: 672 Total reward: 199.0 Training loss: 0.3400 Explore P: 0.0619
Episode: 673 Total reward: 199.0 Training loss: 0.1789 Explore P: 0.0609
Episode: 674 Total reward: 199.0 Training loss: 0.4203 Explore P: 0.0599
Episode: 675 Total reward: 130.0 Training loss: 0.3057 Explore P: 0.0593
Episode: 676 Total reward: 199.0 Training loss: 200.4075 Explore P: 0.0583
Episode: 677 Total reward: 199.0 Training loss: 0.3811 Explore P: 0.0573
Episode: 678 Total reward: 199.0 Training loss: 0.3158 Explore P: 0.0564
Episode: 679 Total reward: 199.0 Training loss: 0.2306 Explore P: 0.0555
Episode: 680 Total reward: 199.0 Training loss: 0.1929 Explore P: 0.0546
Episode: 681 Total reward: 199.0 Training loss: 0.3687 Explore P: 0.0537
Episode: 682 Total reward: 199.0 Training loss: 0.3568 Explore P: 0.0529
Episode: 683 Total reward: 199.0 Training loss: 0.2222 Explore P: 0.0520
Episode: 684 Total reward: 199.0 Training loss: 0.3070 Explore P: 0.0512
Episode: 685 Total reward: 119.0 Training loss: 297.8262 Explore P: 0.0507
Episode: 686 Total reward: 199.0 Training loss: 0.2420 Explore P: 0.0499
Episode: 687 Total reward: 199.0 Training loss: 0.3855 Explore P: 0.0491
Episode: 688 Total reward: 199.0 Training loss: 0.3071 Explore P: 0.0483
Episode: 689 Total reward: 139.0 Training loss: 0.1554 Explore P: 0.0478
Episode: 690 Total reward: 185.0 Training loss: 0.2430 Explore P: 0.0471
Episode: 691 Total reward: 199.0 Training loss: 0.1634 Explore P: 0.0464
Episode: 692 Total reward: 199.0 Training loss: 0.1846 Explore P: 0.0457
Episode: 693 Total reward: 199.0 Training loss: 0.2466 Explore P: 0.0450
Episode: 694 Total reward: 199.0 Training loss: 0.2816 Explore P: 0.0443
Episode: 695 Total reward: 199.0 Training loss: 0.1981 Explore P: 0.0436
Episode: 696 Total reward: 199.0 Training loss: 0.2390 Explore P: 0.0429
Episode: 697 Total reward: 199.0 Training loss: 0.3916 Explore P: 0.0423
Episode: 698 Total reward: 199.0 Training loss: 0.5299 Explore P: 0.0417
Episode: 699 Total reward: 199.0 Training loss: 0.3822 Explore P: 0.0410
Episode: 700 Total reward: 199.0 Training loss: 0.2328 Explore P: 0.0404
Episode: 701 Total reward: 199.0 Training loss: 0.3602 Explore P: 0.0398
Episode: 702 Total reward: 199.0 Training loss: 0.1655 Explore P: 0.0392
Episode: 703 Total reward: 199.0 Training loss: 0.2771 Explore P: 0.0387
Episode: 704 Total reward: 199.0 Training loss: 0.2728 Explore P: 0.0381
Episode: 705 Total reward: 199.0 Training loss: 0.2874 Explore P: 0.0375
Episode: 706 Total reward: 199.0 Training loss: 0.1034 Explore P: 0.0370
Episode: 707 Total reward: 199.0 Training loss: 1.5015 Explore P: 0.0365
Episode: 708 Total reward: 199.0 Training loss: 0.1246 Explore P: 0.0359
Episode: 709 Total reward: 199.0 Training loss: 0.2316 Explore P: 0.0354
Episode: 710 Total reward: 199.0 Training loss: 0.2502 Explore P: 0.0349
Episode: 711 Total reward: 199.0 Training loss: 0.3318 Explore P: 0.0344
Episode: 712 Total reward: 199.0 Training loss: 0.2211 Explore P: 0.0340
Episode: 713 Total reward: 199.0 Training loss: 0.2554 Explore P: 0.0335
Episode: 714 Total reward: 199.0 Training loss: 0.2048 Explore P: 0.0330
Episode: 715 Total reward: 199.0 Training loss: 0.1812 Explore P: 0.0326
Episode: 716 Total reward: 199.0 Training loss: 0.2596 Explore P: 0.0321
Episode: 717 Total reward: 199.0 Training loss: 0.1445 Explore P: 0.0317
Episode: 718 Total reward: 199.0 Training loss: 0.2258 Explore P: 0.0313
Episode: 719 Total reward: 199.0 Training loss: 0.1705 Explore P: 0.0308
Episode: 720 Total reward: 199.0 Training loss: 0.2344 Explore P: 0.0304
Episode: 721 Total reward: 199.0 Training loss: 0.4017 Explore P: 0.0300
Episode: 722 Total reward: 199.0 Training loss: 0.1975 Explore P: 0.0296
Episode: 723 Total reward: 199.0 Training loss: 0.3392 Explore P: 0.0292
Episode: 724 Total reward: 199.0 Training loss: 0.3410 Explore P: 0.0289
Episode: 725 Total reward: 199.0 Training loss: 0.2610 Explore P: 0.0285
Episode: 726 Total reward: 199.0 Training loss: 0.2333 Explore P: 0.0281
Episode: 727 Total reward: 199.0 Training loss: 266.1306 Explore P: 0.0278
Episode: 728 Total reward: 198.0 Training loss: 0.2158 Explore P: 0.0274
Episode: 729 Total reward: 199.0 Training loss: 0.3088 Explore P: 0.0271
Episode: 730 Total reward: 199.0 Training loss: 0.2287 Explore P: 0.0267
Episode: 731 Total reward: 199.0 Training loss: 0.3852 Explore P: 0.0264
Episode: 732 Total reward: 153.0 Training loss: 0.1339 Explore P: 0.0262
Episode: 733 Total reward: 154.0 Training loss: 0.1771 Explore P: 0.0259
Episode: 734 Total reward: 171.0 Training loss: 0.3575 Explore P: 0.0257
Episode: 735 Total reward: 184.0 Training loss: 0.2875 Explore P: 0.0254
Episode: 736 Total reward: 190.0 Training loss: 0.2311 Explore P: 0.0251
Episode: 737 Total reward: 193.0 Training loss: 0.1881 Explore P: 0.0248
Episode: 738 Total reward: 157.0 Training loss: 0.2075 Explore P: 0.0246
Episode: 739 Total reward: 166.0 Training loss: 0.2304 Explore P: 0.0243
Episode: 740 Total reward: 177.0 Training loss: 0.1226 Explore P: 0.0241
Episode: 741 Total reward: 143.0 Training loss: 0.2463 Explore P: 0.0239
Episode: 742 Total reward: 142.0 Training loss: 0.3762 Explore P: 0.0237
Episode: 743 Total reward: 167.0 Training loss: 0.1944 Explore P: 0.0234
Episode: 744 Total reward: 156.0 Training loss: 0.2343 Explore P: 0.0232
Episode: 745 Total reward: 147.0 Training loss: 0.1962 Explore P: 0.0230
Episode: 746 Total reward: 179.0 Training loss: 0.3862 Explore P: 0.0228
Episode: 747 Total reward: 187.0 Training loss: 267.1616 Explore P: 0.0226
Episode: 748 Total reward: 181.0 Training loss: 0.2500 Explore P: 0.0223
Episode: 749 Total reward: 142.0 Training loss: 0.1864 Explore P: 0.0222
Episode: 750 Total reward: 171.0 Training loss: 0.5231 Explore P: 0.0220
Episode: 751 Total reward: 156.0 Training loss: 0.1821 Explore P: 0.0218
Episode: 752 Total reward: 139.0 Training loss: 0.1281 Explore P: 0.0216
Episode: 753 Total reward: 148.0 Training loss: 298.4337 Explore P: 0.0215
Episode: 754 Total reward: 175.0 Training loss: 0.1989 Explore P: 0.0213
Episode: 755 Total reward: 199.0 Training loss: 0.1435 Explore P: 0.0210
Episode: 756 Total reward: 155.0 Training loss: 0.2094 Explore P: 0.0209
Episode: 757 Total reward: 199.0 Training loss: 0.1735 Explore P: 0.0206
Episode: 758 Total reward: 185.0 Training loss: 0.2798 Explore P: 0.0205
Episode: 759 Total reward: 187.0 Training loss: 0.1303 Explore P: 0.0203
Episode: 760 Total reward: 163.0 Training loss: 228.6178 Explore P: 0.0201
Episode: 761 Total reward: 199.0 Training loss: 0.2218 Explore P: 0.0199
Episode: 762 Total reward: 170.0 Training loss: 0.2289 Explore P: 0.0197
Episode: 763 Total reward: 170.0 Training loss: 0.2012 Explore P: 0.0196
Episode: 764 Total reward: 183.0 Training loss: 0.1781 Explore P: 0.0194
Episode: 765 Total reward: 153.0 Training loss: 0.3758 Explore P: 0.0192
Episode: 766 Total reward: 198.0 Training loss: 0.1215 Explore P: 0.0191
Episode: 767 Total reward: 175.0 Training loss: 0.2441 Explore P: 0.0189
Episode: 768 Total reward: 148.0 Training loss: 0.2731 Explore P: 0.0188
Episode: 769 Total reward: 160.0 Training loss: 0.1855 Explore P: 0.0186
Episode: 770 Total reward: 149.0 Training loss: 0.1154 Explore P: 0.0185
Episode: 771 Total reward: 160.0 Training loss: 0.1853 Explore P: 0.0184
Episode: 772 Total reward: 170.0 Training loss: 0.0915 Explore P: 0.0182
Episode: 773 Total reward: 149.0 Training loss: 0.1628 Explore P: 0.0181
Episode: 774 Total reward: 164.0 Training loss: 50.8589 Explore P: 0.0180
Episode: 775 Total reward: 144.0 Training loss: 0.1028 Explore P: 0.0179
Episode: 776 Total reward: 167.0 Training loss: 0.0903 Explore P: 0.0177
Episode: 777 Total reward: 137.0 Training loss: 0.1937 Explore P: 0.0176
Episode: 778 Total reward: 135.0 Training loss: 0.2489 Explore P: 0.0175
Episode: 779 Total reward: 155.0 Training loss: 0.1040 Explore P: 0.0174
Episode: 780 Total reward: 143.0 Training loss: 0.1215 Explore P: 0.0173
Episode: 781 Total reward: 133.0 Training loss: 0.0834 Explore P: 0.0172
Episode: 782 Total reward: 140.0 Training loss: 0.3940 Explore P: 0.0171
Episode: 783 Total reward: 132.0 Training loss: 0.7990 Explore P: 0.0170
Episode: 784 Total reward: 144.0 Training loss: 0.0867 Explore P: 0.0169
Episode: 785 Total reward: 141.0 Training loss: 0.1321 Explore P: 0.0168
Episode: 786 Total reward: 133.0 Training loss: 0.0728 Explore P: 0.0167
Episode: 787 Total reward: 141.0 Training loss: 0.0930 Explore P: 0.0166
Episode: 788 Total reward: 118.0 Training loss: 0.2661 Explore P: 0.0166
Episode: 789 Total reward: 130.0 Training loss: 0.1570 Explore P: 0.0165
Episode: 790 Total reward: 132.0 Training loss: 0.1088 Explore P: 0.0164
Episode: 791 Total reward: 129.0 Training loss: 0.1909 Explore P: 0.0163
Episode: 792 Total reward: 125.0 Training loss: 0.0986 Explore P: 0.0162
Episode: 793 Total reward: 129.0 Training loss: 0.0806 Explore P: 0.0161
Episode: 794 Total reward: 131.0 Training loss: 0.1564 Explore P: 0.0161
Episode: 795 Total reward: 122.0 Training loss: 0.0562 Explore P: 0.0160
Episode: 796 Total reward: 120.0 Training loss: 0.1500 Explore P: 0.0159
Episode: 797 Total reward: 136.0 Training loss: 0.0315 Explore P: 0.0158
Episode: 798 Total reward: 134.0 Training loss: 1.9677 Explore P: 0.0158
Episode: 799 Total reward: 126.0 Training loss: 0.0460 Explore P: 0.0157
Episode: 800 Total reward: 132.0 Training loss: 0.1083 Explore P: 0.0156
Episode: 801 Total reward: 136.0 Training loss: 0.1581 Explore P: 0.0155
Episode: 802 Total reward: 136.0 Training loss: 0.0893 Explore P: 0.0155
Episode: 803 Total reward: 132.0 Training loss: 0.2038 Explore P: 0.0154
Episode: 804 Total reward: 134.0 Training loss: 0.0940 Explore P: 0.0153
Episode: 805 Total reward: 128.0 Training loss: 0.0761 Explore P: 0.0153
Episode: 806 Total reward: 133.0 Training loss: 0.0750 Explore P: 0.0152
Episode: 807 Total reward: 143.0 Training loss: 0.0830 Explore P: 0.0151
Episode: 808 Total reward: 150.0 Training loss: 1.2718 Explore P: 0.0150
Episode: 809 Total reward: 127.0 Training loss: 0.1077 Explore P: 0.0150
Episode: 810 Total reward: 133.0 Training loss: 0.1087 Explore P: 0.0149
Episode: 811 Total reward: 121.0 Training loss: 0.0926 Explore P: 0.0148
Episode: 812 Total reward: 135.0 Training loss: 0.0712 Explore P: 0.0148
Episode: 813 Total reward: 156.0 Training loss: 1.5117 Explore P: 0.0147
Episode: 814 Total reward: 161.0 Training loss: 0.4266 Explore P: 0.0146
Episode: 815 Total reward: 130.0 Training loss: 0.0929 Explore P: 0.0146
Episode: 816 Total reward: 172.0 Training loss: 0.0885 Explore P: 0.0145
Episode: 817 Total reward: 172.0 Training loss: 0.0799 Explore P: 0.0144
Episode: 818 Total reward: 158.0 Training loss: 0.0942 Explore P: 0.0144
Episode: 819 Total reward: 161.0 Training loss: 0.1168 Explore P: 0.0143
Episode: 820 Total reward: 163.0 Training loss: 0.0874 Explore P: 0.0142
Episode: 821 Total reward: 181.0 Training loss: 0.0715 Explore P: 0.0141
Episode: 822 Total reward: 191.0 Training loss: 0.1016 Explore P: 0.0141
Episode: 823 Total reward: 199.0 Training loss: 0.0467 Explore P: 0.0140
Episode: 824 Total reward: 199.0 Training loss: 1.1051 Explore P: 0.0139
Episode: 825 Total reward: 188.0 Training loss: 0.1337 Explore P: 0.0138
Episode: 826 Total reward: 199.0 Training loss: 0.0304 Explore P: 0.0138
Episode: 827 Total reward: 199.0 Training loss: 0.1181 Explore P: 0.0137
Episode: 828 Total reward: 199.0 Training loss: 0.0659 Explore P: 0.0136
Episode: 829 Total reward: 199.0 Training loss: 0.0607 Explore P: 0.0135
Episode: 830 Total reward: 199.0 Training loss: 0.0431 Explore P: 0.0135
Episode: 831 Total reward: 199.0 Training loss: 0.0442 Explore P: 0.0134
Episode: 832 Total reward: 199.0 Training loss: 0.1920 Explore P: 0.0133
Episode: 833 Total reward: 199.0 Training loss: 0.0586 Explore P: 0.0133
Episode: 834 Total reward: 199.0 Training loss: 0.2380 Explore P: 0.0132
Episode: 835 Total reward: 199.0 Training loss: 0.2688 Explore P: 0.0131
Episode: 836 Total reward: 199.0 Training loss: 0.0789 Explore P: 0.0131
Episode: 837 Total reward: 199.0 Training loss: 0.0928 Explore P: 0.0130
Episode: 838 Total reward: 199.0 Training loss: 0.0552 Explore P: 0.0130
Episode: 839 Total reward: 199.0 Training loss: 0.1232 Explore P: 0.0129
Episode: 840 Total reward: 199.0 Training loss: 0.2811 Explore P: 0.0128
Episode: 841 Total reward: 199.0 Training loss: 0.0979 Explore P: 0.0128
Episode: 842 Total reward: 168.0 Training loss: 0.1676 Explore P: 0.0127
Episode: 843 Total reward: 172.0 Training loss: 0.2037 Explore P: 0.0127
Episode: 844 Total reward: 199.0 Training loss: 0.1704 Explore P: 0.0126
Episode: 845 Total reward: 189.0 Training loss: 0.1247 Explore P: 0.0126
Episode: 846 Total reward: 199.0 Training loss: 0.1417 Explore P: 0.0125
Episode: 847 Total reward: 199.0 Training loss: 0.1420 Explore P: 0.0125
Episode: 848 Total reward: 199.0 Training loss: 0.1696 Explore P: 0.0124
Episode: 849 Total reward: 199.0 Training loss: 0.1494 Explore P: 0.0124
Episode: 850 Total reward: 199.0 Training loss: 0.2532 Explore P: 0.0123
Episode: 851 Total reward: 199.0 Training loss: 0.1033 Explore P: 0.0123
Episode: 852 Total reward: 186.0 Training loss: 0.1188 Explore P: 0.0123
Episode: 853 Total reward: 199.0 Training loss: 0.3205 Explore P: 0.0122
Episode: 854 Total reward: 199.0 Training loss: 0.2081 Explore P: 0.0122
Episode: 855 Total reward: 199.0 Training loss: 155.1477 Explore P: 0.0121
Episode: 856 Total reward: 199.0 Training loss: 0.1097 Explore P: 0.0121
Episode: 857 Total reward: 199.0 Training loss: 0.2333 Explore P: 0.0120
Episode: 858 Total reward: 199.0 Training loss: 0.2088 Explore P: 0.0120
Episode: 859 Total reward: 199.0 Training loss: 0.2321 Explore P: 0.0120
Episode: 860 Total reward: 199.0 Training loss: 0.3426 Explore P: 0.0119
Episode: 861 Total reward: 199.0 Training loss: 0.2561 Explore P: 0.0119
Episode: 862 Total reward: 199.0 Training loss: 0.5107 Explore P: 0.0118
Episode: 863 Total reward: 199.0 Training loss: 0.2288 Explore P: 0.0118
Episode: 864 Total reward: 199.0 Training loss: 0.1480 Explore P: 0.0118
Episode: 865 Total reward: 199.0 Training loss: 0.0780 Explore P: 0.0117
Episode: 866 Total reward: 199.0 Training loss: 0.1057 Explore P: 0.0117
Episode: 867 Total reward: 199.0 Training loss: 0.4608 Explore P: 0.0117
Episode: 868 Total reward: 199.0 Training loss: 0.1974 Explore P: 0.0116
Episode: 869 Total reward: 199.0 Training loss: 0.2630 Explore P: 0.0116
Episode: 870 Total reward: 199.0 Training loss: 0.3215 Explore P: 0.0116
Episode: 871 Total reward: 199.0 Training loss: 0.2262 Explore P: 0.0115
Episode: 872 Total reward: 199.0 Training loss: 0.1668 Explore P: 0.0115
Episode: 873 Total reward: 199.0 Training loss: 0.1813 Explore P: 0.0115
Episode: 874 Total reward: 199.0 Training loss: 289.3306 Explore P: 0.0115
Episode: 875 Total reward: 199.0 Training loss: 0.3145 Explore P: 0.0114
Episode: 876 Total reward: 199.0 Training loss: 0.3400 Explore P: 0.0114
Episode: 877 Total reward: 199.0 Training loss: 0.2022 Explore P: 0.0114
Episode: 878 Total reward: 199.0 Training loss: 0.3677 Explore P: 0.0113
Episode: 879 Total reward: 199.0 Training loss: 0.1930 Explore P: 0.0113
Episode: 880 Total reward: 199.0 Training loss: 0.3084 Explore P: 0.0113
Episode: 881 Total reward: 199.0 Training loss: 0.3082 Explore P: 0.0113
Episode: 882 Total reward: 199.0 Training loss: 0.1577 Explore P: 0.0112
Episode: 883 Total reward: 199.0 Training loss: 0.2435 Explore P: 0.0112
Episode: 884 Total reward: 199.0 Training loss: 0.3377 Explore P: 0.0112
Episode: 885 Total reward: 199.0 Training loss: 0.2148 Explore P: 0.0112
Episode: 886 Total reward: 199.0 Training loss: 0.1416 Explore P: 0.0111
Episode: 887 Total reward: 199.0 Training loss: 0.2042 Explore P: 0.0111
Episode: 888 Total reward: 199.0 Training loss: 0.3368 Explore P: 0.0111
Episode: 889 Total reward: 199.0 Training loss: 0.2360 Explore P: 0.0111
Episode: 890 Total reward: 199.0 Training loss: 0.2271 Explore P: 0.0111
Episode: 891 Total reward: 199.0 Training loss: 0.2756 Explore P: 0.0110
Episode: 892 Total reward: 199.0 Training loss: 0.3972 Explore P: 0.0110
Episode: 893 Total reward: 199.0 Training loss: 156.1931 Explore P: 0.0110
Episode: 894 Total reward: 199.0 Training loss: 0.2199 Explore P: 0.0110
Episode: 895 Total reward: 199.0 Training loss: 0.3039 Explore P: 0.0110
Episode: 896 Total reward: 199.0 Training loss: 0.2759 Explore P: 0.0109
Episode: 897 Total reward: 199.0 Training loss: 0.2773 Explore P: 0.0109
Episode: 898 Total reward: 199.0 Training loss: 0.4188 Explore P: 0.0109
Episode: 899 Total reward: 199.0 Training loss: 0.1329 Explore P: 0.0109
Episode: 900 Total reward: 199.0 Training loss: 0.2957 Explore P: 0.0109
Episode: 901 Total reward: 199.0 Training loss: 261.5996 Explore P: 0.0109
Episode: 902 Total reward: 199.0 Training loss: 0.1624 Explore P: 0.0108
Episode: 903 Total reward: 199.0 Training loss: 0.2032 Explore P: 0.0108
Episode: 904 Total reward: 199.0 Training loss: 269.8798 Explore P: 0.0108
Episode: 905 Total reward: 199.0 Training loss: 0.3439 Explore P: 0.0108
Episode: 906 Total reward: 199.0 Training loss: 0.1889 Explore P: 0.0108
Episode: 907 Total reward: 199.0 Training loss: 0.2253 Explore P: 0.0108
Episode: 908 Total reward: 199.0 Training loss: 0.2174 Explore P: 0.0107
Episode: 909 Total reward: 199.0 Training loss: 0.3052 Explore P: 0.0107
Episode: 910 Total reward: 199.0 Training loss: 0.1948 Explore P: 0.0107
Episode: 911 Total reward: 199.0 Training loss: 0.0998 Explore P: 0.0107
Episode: 912 Total reward: 199.0 Training loss: 0.3194 Explore P: 0.0107
Episode: 913 Total reward: 199.0 Training loss: 0.2392 Explore P: 0.0107
Episode: 914 Total reward: 199.0 Training loss: 0.3429 Explore P: 0.0107
Episode: 915 Total reward: 199.0 Training loss: 0.3291 Explore P: 0.0106
Episode: 916 Total reward: 199.0 Training loss: 0.1999 Explore P: 0.0106
Episode: 917 Total reward: 199.0 Training loss: 0.2488 Explore P: 0.0106
Episode: 918 Total reward: 199.0 Training loss: 0.2977 Explore P: 0.0106
Episode: 919 Total reward: 199.0 Training loss: 0.2658 Explore P: 0.0106
Episode: 920 Total reward: 199.0 Training loss: 0.5628 Explore P: 0.0106
Episode: 921 Total reward: 199.0 Training loss: 415.7462 Explore P: 0.0106
Episode: 922 Total reward: 199.0 Training loss: 0.5580 Explore P: 0.0106
Episode: 923 Total reward: 199.0 Training loss: 0.4268 Explore P: 0.0105
Episode: 924 Total reward: 199.0 Training loss: 0.4177 Explore P: 0.0105
Episode: 925 Total reward: 199.0 Training loss: 0.3238 Explore P: 0.0105
Episode: 926 Total reward: 199.0 Training loss: 0.9700 Explore P: 0.0105
Episode: 927 Total reward: 199.0 Training loss: 0.3719 Explore P: 0.0105
Episode: 928 Total reward: 199.0 Training loss: 0.4357 Explore P: 0.0105
Episode: 929 Total reward: 199.0 Training loss: 0.1644 Explore P: 0.0105
Episode: 930 Total reward: 199.0 Training loss: 0.5693 Explore P: 0.0105
Episode: 931 Total reward: 199.0 Training loss: 0.3190 Explore P: 0.0105
Episode: 932 Total reward: 199.0 Training loss: 0.3041 Explore P: 0.0105
Episode: 933 Total reward: 199.0 Training loss: 0.1038 Explore P: 0.0104
Episode: 934 Total reward: 199.0 Training loss: 0.3783 Explore P: 0.0104
Episode: 935 Total reward: 199.0 Training loss: 0.5780 Explore P: 0.0104
Episode: 936 Total reward: 199.0 Training loss: 0.0886 Explore P: 0.0104
Episode: 937 Total reward: 199.0 Training loss: 0.2582 Explore P: 0.0104
Episode: 938 Total reward: 199.0 Training loss: 0.1841 Explore P: 0.0104
Episode: 939 Total reward: 199.0 Training loss: 0.1994 Explore P: 0.0104
Episode: 940 Total reward: 199.0 Training loss: 0.2028 Explore P: 0.0104
Episode: 941 Total reward: 199.0 Training loss: 0.0838 Explore P: 0.0104
Episode: 942 Total reward: 199.0 Training loss: 0.3073 Explore P: 0.0104
Episode: 943 Total reward: 199.0 Training loss: 0.1833 Explore P: 0.0104
Episode: 944 Total reward: 199.0 Training loss: 0.2742 Explore P: 0.0104
Episode: 945 Total reward: 199.0 Training loss: 214.9985 Explore P: 0.0104
Episode: 946 Total reward: 199.0 Training loss: 0.1254 Explore P: 0.0103
Episode: 947 Total reward: 199.0 Training loss: 0.0825 Explore P: 0.0103
Episode: 948 Total reward: 199.0 Training loss: 0.1951 Explore P: 0.0103
Episode: 949 Total reward: 199.0 Training loss: 0.1711 Explore P: 0.0103
Episode: 950 Total reward: 199.0 Training loss: 0.1096 Explore P: 0.0103
Episode: 951 Total reward: 199.0 Training loss: 0.1067 Explore P: 0.0103
Episode: 952 Total reward: 199.0 Training loss: 0.1285 Explore P: 0.0103
Episode: 953 Total reward: 199.0 Training loss: 0.1774 Explore P: 0.0103
Episode: 954 Total reward: 199.0 Training loss: 0.2564 Explore P: 0.0103
Episode: 955 Total reward: 199.0 Training loss: 0.1309 Explore P: 0.0103
Episode: 956 Total reward: 199.0 Training loss: 0.1319 Explore P: 0.0103
Episode: 957 Total reward: 199.0 Training loss: 176.8958 Explore P: 0.0103
Episode: 958 Total reward: 199.0 Training loss: 0.5370 Explore P: 0.0103
Episode: 959 Total reward: 199.0 Training loss: 0.3356 Explore P: 0.0103
Episode: 960 Total reward: 199.0 Training loss: 207.8371 Explore P: 0.0103
Episode: 961 Total reward: 199.0 Training loss: 0.2098 Explore P: 0.0103
Episode: 962 Total reward: 199.0 Training loss: 0.4552 Explore P: 0.0103
Episode: 963 Total reward: 199.0 Training loss: 0.6278 Explore P: 0.0102
Episode: 964 Total reward: 199.0 Training loss: 0.1420 Explore P: 0.0102
Episode: 965 Total reward: 199.0 Training loss: 241.4790 Explore P: 0.0102
Episode: 966 Total reward: 199.0 Training loss: 0.4496 Explore P: 0.0102
Episode: 967 Total reward: 199.0 Training loss: 0.2990 Explore P: 0.0102
Episode: 968 Total reward: 199.0 Training loss: 0.2264 Explore P: 0.0102
Episode: 969 Total reward: 199.0 Training loss: 0.5039 Explore P: 0.0102
Episode: 970 Total reward: 199.0 Training loss: 0.7879 Explore P: 0.0102
Episode: 971 Total reward: 199.0 Training loss: 0.3346 Explore P: 0.0102
Episode: 972 Total reward: 168.0 Training loss: 1.9756 Explore P: 0.0102
Episode: 973 Total reward: 19.0 Training loss: 305.7015 Explore P: 0.0102
Episode: 974 Total reward: 107.0 Training loss: 1.5985 Explore P: 0.0102
Episode: 975 Total reward: 17.0 Training loss: 389.1099 Explore P: 0.0102
Episode: 976 Total reward: 13.0 Training loss: 1.8880 Explore P: 0.0102
Episode: 977 Total reward: 11.0 Training loss: 2.0341 Explore P: 0.0102
Episode: 978 Total reward: 12.0 Training loss: 2.6920 Explore P: 0.0102
Episode: 979 Total reward: 11.0 Training loss: 2.3802 Explore P: 0.0102
Episode: 980 Total reward: 9.0 Training loss: 3.1442 Explore P: 0.0102
Episode: 981 Total reward: 10.0 Training loss: 414.1959 Explore P: 0.0102
Episode: 982 Total reward: 10.0 Training loss: 4.0472 Explore P: 0.0102
Episode: 983 Total reward: 11.0 Training loss: 1159.8199 Explore P: 0.0102
Episode: 984 Total reward: 12.0 Training loss: 3.4771 Explore P: 0.0102
Episode: 985 Total reward: 9.0 Training loss: 3.7970 Explore P: 0.0102
Episode: 986 Total reward: 8.0 Training loss: 2.9343 Explore P: 0.0102
Episode: 987 Total reward: 9.0 Training loss: 3.4255 Explore P: 0.0102
Episode: 988 Total reward: 9.0 Training loss: 647.5013 Explore P: 0.0102
Episode: 989 Total reward: 10.0 Training loss: 3.0775 Explore P: 0.0102
Episode: 990 Total reward: 11.0 Training loss: 5.2867 Explore P: 0.0102
Episode: 991 Total reward: 9.0 Training loss: 4.3347 Explore P: 0.0102
Episode: 992 Total reward: 12.0 Training loss: 3.4616 Explore P: 0.0102
Episode: 993 Total reward: 11.0 Training loss: 3.4846 Explore P: 0.0102
Episode: 994 Total reward: 8.0 Training loss: 2.6049 Explore P: 0.0102
Episode: 995 Total reward: 8.0 Training loss: 526.9852 Explore P: 0.0102
Episode: 996 Total reward: 10.0 Training loss: 680.9720 Explore P: 0.0102
Episode: 997 Total reward: 10.0 Training loss: 709.1796 Explore P: 0.0102
Episode: 998 Total reward: 11.0 Training loss: 576.6755 Explore P: 0.0102
Episode: 999 Total reward: 7.0 Training loss: 3.0512 Explore P: 0.0102

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [167]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [181]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[181]:
<matplotlib.text.Text at 0x125c136d8>

Testing

Let's checkout how our trained agent plays the game.


In [183]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [184]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.