Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-04-14 14:17:18,778] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [4]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [5]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This lext line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [6]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [7]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # expotentional decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [8]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [9]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [11]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 17.0 Training loss: 1.1153 Explore P: 0.9983
Episode: 2 Total reward: 37.0 Training loss: 0.9784 Explore P: 0.9947
Episode: 3 Total reward: 15.0 Training loss: 1.0622 Explore P: 0.9932
Episode: 4 Total reward: 8.0 Training loss: 1.0278 Explore P: 0.9924
Episode: 5 Total reward: 21.0 Training loss: 0.9641 Explore P: 0.9903
Episode: 6 Total reward: 22.0 Training loss: 1.1733 Explore P: 0.9882
Episode: 7 Total reward: 19.0 Training loss: 1.1709 Explore P: 0.9863
Episode: 8 Total reward: 21.0 Training loss: 1.0342 Explore P: 0.9843
Episode: 9 Total reward: 16.0 Training loss: 1.0193 Explore P: 0.9827
Episode: 10 Total reward: 46.0 Training loss: 1.1574 Explore P: 0.9783
Episode: 11 Total reward: 14.0 Training loss: 1.4772 Explore P: 0.9769
Episode: 12 Total reward: 11.0 Training loss: 1.2652 Explore P: 0.9758
Episode: 13 Total reward: 39.0 Training loss: 0.9179 Explore P: 0.9721
Episode: 14 Total reward: 28.0 Training loss: 1.1872 Explore P: 0.9694
Episode: 15 Total reward: 16.0 Training loss: 1.0078 Explore P: 0.9679
Episode: 16 Total reward: 12.0 Training loss: 1.2956 Explore P: 0.9667
Episode: 17 Total reward: 16.0 Training loss: 0.9956 Explore P: 0.9652
Episode: 18 Total reward: 17.0 Training loss: 1.4238 Explore P: 0.9636
Episode: 19 Total reward: 15.0 Training loss: 0.9262 Explore P: 0.9621
Episode: 20 Total reward: 21.0 Training loss: 1.2144 Explore P: 0.9601
Episode: 21 Total reward: 9.0 Training loss: 1.1086 Explore P: 0.9593
Episode: 22 Total reward: 22.0 Training loss: 0.8835 Explore P: 0.9572
Episode: 23 Total reward: 14.0 Training loss: 1.1917 Explore P: 0.9559
Episode: 24 Total reward: 13.0 Training loss: 0.9507 Explore P: 0.9546
Episode: 25 Total reward: 19.0 Training loss: 1.3503 Explore P: 0.9528
Episode: 26 Total reward: 38.0 Training loss: 1.4630 Explore P: 0.9493
Episode: 27 Total reward: 11.0 Training loss: 1.6650 Explore P: 0.9482
Episode: 28 Total reward: 52.0 Training loss: 1.8349 Explore P: 0.9434
Episode: 29 Total reward: 19.0 Training loss: 1.2243 Explore P: 0.9416
Episode: 30 Total reward: 41.0 Training loss: 1.6829 Explore P: 0.9378
Episode: 31 Total reward: 25.0 Training loss: 1.3682 Explore P: 0.9355
Episode: 32 Total reward: 12.0 Training loss: 1.4296 Explore P: 0.9344
Episode: 33 Total reward: 23.0 Training loss: 2.1344 Explore P: 0.9322
Episode: 34 Total reward: 29.0 Training loss: 2.5191 Explore P: 0.9296
Episode: 35 Total reward: 14.0 Training loss: 1.7183 Explore P: 0.9283
Episode: 36 Total reward: 8.0 Training loss: 2.3577 Explore P: 0.9275
Episode: 37 Total reward: 17.0 Training loss: 1.5941 Explore P: 0.9260
Episode: 38 Total reward: 15.0 Training loss: 1.9791 Explore P: 0.9246
Episode: 39 Total reward: 16.0 Training loss: 4.1643 Explore P: 0.9232
Episode: 40 Total reward: 16.0 Training loss: 5.3214 Explore P: 0.9217
Episode: 41 Total reward: 37.0 Training loss: 1.4230 Explore P: 0.9183
Episode: 42 Total reward: 16.0 Training loss: 3.0442 Explore P: 0.9169
Episode: 43 Total reward: 16.0 Training loss: 8.0309 Explore P: 0.9154
Episode: 44 Total reward: 26.0 Training loss: 4.7849 Explore P: 0.9131
Episode: 45 Total reward: 31.0 Training loss: 2.3925 Explore P: 0.9103
Episode: 46 Total reward: 38.0 Training loss: 2.6666 Explore P: 0.9069
Episode: 47 Total reward: 17.0 Training loss: 17.2985 Explore P: 0.9053
Episode: 48 Total reward: 16.0 Training loss: 16.3878 Explore P: 0.9039
Episode: 49 Total reward: 22.0 Training loss: 64.4369 Explore P: 0.9019
Episode: 50 Total reward: 14.0 Training loss: 3.7114 Explore P: 0.9007
Episode: 51 Total reward: 10.0 Training loss: 4.5585 Explore P: 0.8998
Episode: 52 Total reward: 15.0 Training loss: 3.5621 Explore P: 0.8985
Episode: 53 Total reward: 18.0 Training loss: 2.6143 Explore P: 0.8969
Episode: 54 Total reward: 26.0 Training loss: 24.8376 Explore P: 0.8946
Episode: 55 Total reward: 10.0 Training loss: 20.9677 Explore P: 0.8937
Episode: 56 Total reward: 9.0 Training loss: 6.5235 Explore P: 0.8929
Episode: 57 Total reward: 21.0 Training loss: 4.9882 Explore P: 0.8910
Episode: 58 Total reward: 9.0 Training loss: 4.5114 Explore P: 0.8902
Episode: 59 Total reward: 17.0 Training loss: 3.3088 Explore P: 0.8888
Episode: 60 Total reward: 77.0 Training loss: 49.9368 Explore P: 0.8820
Episode: 61 Total reward: 18.0 Training loss: 5.2048 Explore P: 0.8804
Episode: 62 Total reward: 17.0 Training loss: 7.8434 Explore P: 0.8790
Episode: 63 Total reward: 23.0 Training loss: 6.6517 Explore P: 0.8770
Episode: 64 Total reward: 22.0 Training loss: 8.7147 Explore P: 0.8751
Episode: 65 Total reward: 21.0 Training loss: 108.4805 Explore P: 0.8733
Episode: 66 Total reward: 13.0 Training loss: 5.9709 Explore P: 0.8721
Episode: 67 Total reward: 30.0 Training loss: 10.4376 Explore P: 0.8695
Episode: 68 Total reward: 15.0 Training loss: 9.4567 Explore P: 0.8683
Episode: 69 Total reward: 21.0 Training loss: 6.6202 Explore P: 0.8665
Episode: 70 Total reward: 14.0 Training loss: 6.1669 Explore P: 0.8653
Episode: 71 Total reward: 13.0 Training loss: 9.0623 Explore P: 0.8641
Episode: 72 Total reward: 12.0 Training loss: 10.3997 Explore P: 0.8631
Episode: 73 Total reward: 9.0 Training loss: 9.7964 Explore P: 0.8624
Episode: 74 Total reward: 12.0 Training loss: 5.6110 Explore P: 0.8613
Episode: 75 Total reward: 37.0 Training loss: 8.6656 Explore P: 0.8582
Episode: 76 Total reward: 15.0 Training loss: 7.5292 Explore P: 0.8569
Episode: 77 Total reward: 10.0 Training loss: 11.9470 Explore P: 0.8561
Episode: 78 Total reward: 13.0 Training loss: 249.3708 Explore P: 0.8550
Episode: 79 Total reward: 32.0 Training loss: 8.4261 Explore P: 0.8523
Episode: 80 Total reward: 19.0 Training loss: 8.9096 Explore P: 0.8507
Episode: 81 Total reward: 28.0 Training loss: 11.4269 Explore P: 0.8483
Episode: 82 Total reward: 18.0 Training loss: 355.0399 Explore P: 0.8468
Episode: 83 Total reward: 11.0 Training loss: 7.4039 Explore P: 0.8459
Episode: 84 Total reward: 20.0 Training loss: 157.3670 Explore P: 0.8442
Episode: 85 Total reward: 17.0 Training loss: 15.5268 Explore P: 0.8428
Episode: 86 Total reward: 11.0 Training loss: 10.0476 Explore P: 0.8419
Episode: 87 Total reward: 10.0 Training loss: 4.5766 Explore P: 0.8411
Episode: 88 Total reward: 10.0 Training loss: 4.0252 Explore P: 0.8402
Episode: 89 Total reward: 13.0 Training loss: 6.2767 Explore P: 0.8392
Episode: 90 Total reward: 15.0 Training loss: 200.1621 Explore P: 0.8379
Episode: 91 Total reward: 14.0 Training loss: 11.4019 Explore P: 0.8368
Episode: 92 Total reward: 13.0 Training loss: 5.3701 Explore P: 0.8357
Episode: 93 Total reward: 19.0 Training loss: 475.2175 Explore P: 0.8341
Episode: 94 Total reward: 26.0 Training loss: 8.8604 Explore P: 0.8320
Episode: 95 Total reward: 17.0 Training loss: 8.5144 Explore P: 0.8306
Episode: 96 Total reward: 22.0 Training loss: 10.3373 Explore P: 0.8288
Episode: 97 Total reward: 25.0 Training loss: 334.2906 Explore P: 0.8267
Episode: 98 Total reward: 17.0 Training loss: 10.0969 Explore P: 0.8253
Episode: 99 Total reward: 11.0 Training loss: 8.2079 Explore P: 0.8244
Episode: 100 Total reward: 27.0 Training loss: 7.0060 Explore P: 0.8222
Episode: 101 Total reward: 13.0 Training loss: 10.3138 Explore P: 0.8212
Episode: 102 Total reward: 10.0 Training loss: 9.6756 Explore P: 0.8204
Episode: 103 Total reward: 28.0 Training loss: 7.6019 Explore P: 0.8181
Episode: 104 Total reward: 25.0 Training loss: 9.7619 Explore P: 0.8161
Episode: 105 Total reward: 21.0 Training loss: 301.5739 Explore P: 0.8144
Episode: 106 Total reward: 28.0 Training loss: 564.3414 Explore P: 0.8122
Episode: 107 Total reward: 23.0 Training loss: 7.0282 Explore P: 0.8103
Episode: 108 Total reward: 24.0 Training loss: 8.7630 Explore P: 0.8084
Episode: 109 Total reward: 28.0 Training loss: 303.1896 Explore P: 0.8062
Episode: 110 Total reward: 17.0 Training loss: 7.1595 Explore P: 0.8048
Episode: 111 Total reward: 31.0 Training loss: 7.6050 Explore P: 0.8024
Episode: 112 Total reward: 12.0 Training loss: 8.5109 Explore P: 0.8014
Episode: 113 Total reward: 11.0 Training loss: 6.7949 Explore P: 0.8005
Episode: 114 Total reward: 12.0 Training loss: 6.7784 Explore P: 0.7996
Episode: 115 Total reward: 17.0 Training loss: 794.4244 Explore P: 0.7982
Episode: 116 Total reward: 14.0 Training loss: 9.4618 Explore P: 0.7971
Episode: 117 Total reward: 34.0 Training loss: 7.2674 Explore P: 0.7945
Episode: 118 Total reward: 16.0 Training loss: 7.7536 Explore P: 0.7932
Episode: 119 Total reward: 15.0 Training loss: 7.9483 Explore P: 0.7920
Episode: 120 Total reward: 21.0 Training loss: 8.4418 Explore P: 0.7904
Episode: 121 Total reward: 39.0 Training loss: 714.3967 Explore P: 0.7874
Episode: 122 Total reward: 18.0 Training loss: 6.0396 Explore P: 0.7860
Episode: 123 Total reward: 42.0 Training loss: 5.8918 Explore P: 0.7827
Episode: 124 Total reward: 21.0 Training loss: 441.4326 Explore P: 0.7811
Episode: 125 Total reward: 19.0 Training loss: 1.5162 Explore P: 0.7796
Episode: 126 Total reward: 27.0 Training loss: 3.1248 Explore P: 0.7776
Episode: 127 Total reward: 11.0 Training loss: 2.1723 Explore P: 0.7767
Episode: 128 Total reward: 28.0 Training loss: 2.4717 Explore P: 0.7746
Episode: 129 Total reward: 10.0 Training loss: 4.4096 Explore P: 0.7738
Episode: 130 Total reward: 25.0 Training loss: 357.2564 Explore P: 0.7719
Episode: 131 Total reward: 34.0 Training loss: 3.3282 Explore P: 0.7693
Episode: 132 Total reward: 43.0 Training loss: 841.0543 Explore P: 0.7660
Episode: 133 Total reward: 10.0 Training loss: 340.2225 Explore P: 0.7653
Episode: 134 Total reward: 16.0 Training loss: 712.6940 Explore P: 0.7641
Episode: 135 Total reward: 17.0 Training loss: 0.9827 Explore P: 0.7628
Episode: 136 Total reward: 13.0 Training loss: 2.4497 Explore P: 0.7618
Episode: 137 Total reward: 44.0 Training loss: 463.6692 Explore P: 0.7585
Episode: 138 Total reward: 15.0 Training loss: 2.2495 Explore P: 0.7574
Episode: 139 Total reward: 8.0 Training loss: 1.4404 Explore P: 0.7568
Episode: 140 Total reward: 22.0 Training loss: 772.9932 Explore P: 0.7552
Episode: 141 Total reward: 25.0 Training loss: 3.1275 Explore P: 0.7533
Episode: 142 Total reward: 17.0 Training loss: 3.8691 Explore P: 0.7520
Episode: 143 Total reward: 22.0 Training loss: 342.1142 Explore P: 0.7504
Episode: 144 Total reward: 16.0 Training loss: 325.5000 Explore P: 0.7492
Episode: 145 Total reward: 18.0 Training loss: 224.2690 Explore P: 0.7479
Episode: 146 Total reward: 8.0 Training loss: 309.0187 Explore P: 0.7473
Episode: 147 Total reward: 12.0 Training loss: 2.0449 Explore P: 0.7464
Episode: 148 Total reward: 18.0 Training loss: 1.4730 Explore P: 0.7451
Episode: 149 Total reward: 61.0 Training loss: 1.3440 Explore P: 0.7406
Episode: 150 Total reward: 23.0 Training loss: 562.5506 Explore P: 0.7389
Episode: 151 Total reward: 14.0 Training loss: 1.1233 Explore P: 0.7379
Episode: 152 Total reward: 19.0 Training loss: 1.4315 Explore P: 0.7365
Episode: 153 Total reward: 15.0 Training loss: 0.8151 Explore P: 0.7355
Episode: 154 Total reward: 17.0 Training loss: 0.7221 Explore P: 0.7342
Episode: 155 Total reward: 21.0 Training loss: 1.8820 Explore P: 0.7327
Episode: 156 Total reward: 17.0 Training loss: 2.4699 Explore P: 0.7315
Episode: 157 Total reward: 33.0 Training loss: 281.8736 Explore P: 0.7291
Episode: 158 Total reward: 9.0 Training loss: 1.2585 Explore P: 0.7285
Episode: 159 Total reward: 11.0 Training loss: 292.9897 Explore P: 0.7277
Episode: 160 Total reward: 20.0 Training loss: 1.9928 Explore P: 0.7262
Episode: 161 Total reward: 22.0 Training loss: 1.5541 Explore P: 0.7247
Episode: 162 Total reward: 8.0 Training loss: 200.4036 Explore P: 0.7241
Episode: 163 Total reward: 11.0 Training loss: 851.0067 Explore P: 0.7233
Episode: 164 Total reward: 12.0 Training loss: 1.2402 Explore P: 0.7224
Episode: 165 Total reward: 15.0 Training loss: 1.7773 Explore P: 0.7214
Episode: 166 Total reward: 12.0 Training loss: 400.3169 Explore P: 0.7205
Episode: 167 Total reward: 11.0 Training loss: 1.4236 Explore P: 0.7197
Episode: 168 Total reward: 14.0 Training loss: 2.7575 Explore P: 0.7188
Episode: 169 Total reward: 15.0 Training loss: 1.8167 Explore P: 0.7177
Episode: 170 Total reward: 23.0 Training loss: 2.1272 Explore P: 0.7161
Episode: 171 Total reward: 12.0 Training loss: 2.8455 Explore P: 0.7152
Episode: 172 Total reward: 25.0 Training loss: 4.4971 Explore P: 0.7135
Episode: 173 Total reward: 12.0 Training loss: 4.4051 Explore P: 0.7126
Episode: 174 Total reward: 29.0 Training loss: 1.6799 Explore P: 0.7106
Episode: 175 Total reward: 10.0 Training loss: 629.5552 Explore P: 0.7099
Episode: 176 Total reward: 26.0 Training loss: 3.6326 Explore P: 0.7081
Episode: 177 Total reward: 22.0 Training loss: 266.2142 Explore P: 0.7065
Episode: 178 Total reward: 11.0 Training loss: 3.5761 Explore P: 0.7058
Episode: 179 Total reward: 14.0 Training loss: 3.4056 Explore P: 0.7048
Episode: 180 Total reward: 9.0 Training loss: 4.9221 Explore P: 0.7042
Episode: 181 Total reward: 21.0 Training loss: 1.8504 Explore P: 0.7027
Episode: 182 Total reward: 15.0 Training loss: 5.0086 Explore P: 0.7017
Episode: 183 Total reward: 17.0 Training loss: 2.5088 Explore P: 0.7005
Episode: 184 Total reward: 14.0 Training loss: 2.7079 Explore P: 0.6995
Episode: 185 Total reward: 10.0 Training loss: 2.5779 Explore P: 0.6988
Episode: 186 Total reward: 14.0 Training loss: 2.4201 Explore P: 0.6979
Episode: 187 Total reward: 10.0 Training loss: 2.1859 Explore P: 0.6972
Episode: 188 Total reward: 17.0 Training loss: 2.4125 Explore P: 0.6960
Episode: 189 Total reward: 18.0 Training loss: 2.3854 Explore P: 0.6948
Episode: 190 Total reward: 25.0 Training loss: 187.4262 Explore P: 0.6931
Episode: 191 Total reward: 11.0 Training loss: 495.9777 Explore P: 0.6923
Episode: 192 Total reward: 18.0 Training loss: 1.0880 Explore P: 0.6911
Episode: 193 Total reward: 19.0 Training loss: 2.2425 Explore P: 0.6898
Episode: 194 Total reward: 9.0 Training loss: 225.2420 Explore P: 0.6892
Episode: 195 Total reward: 23.0 Training loss: 1.1425 Explore P: 0.6876
Episode: 196 Total reward: 18.0 Training loss: 3.2818 Explore P: 0.6864
Episode: 197 Total reward: 29.0 Training loss: 1.6394 Explore P: 0.6845
Episode: 198 Total reward: 13.0 Training loss: 1.1648 Explore P: 0.6836
Episode: 199 Total reward: 12.0 Training loss: 3.4078 Explore P: 0.6828
Episode: 200 Total reward: 17.0 Training loss: 1.7970 Explore P: 0.6816
Episode: 201 Total reward: 19.0 Training loss: 2.3566 Explore P: 0.6804
Episode: 202 Total reward: 10.0 Training loss: 161.9451 Explore P: 0.6797
Episode: 203 Total reward: 13.0 Training loss: 1.1885 Explore P: 0.6788
Episode: 204 Total reward: 10.0 Training loss: 3.1069 Explore P: 0.6781
Episode: 205 Total reward: 18.0 Training loss: 263.5405 Explore P: 0.6769
Episode: 206 Total reward: 12.0 Training loss: 150.3121 Explore P: 0.6761
Episode: 207 Total reward: 19.0 Training loss: 241.3645 Explore P: 0.6749
Episode: 208 Total reward: 14.0 Training loss: 238.9797 Explore P: 0.6739
Episode: 209 Total reward: 10.0 Training loss: 5.2170 Explore P: 0.6733
Episode: 210 Total reward: 12.0 Training loss: 2.4129 Explore P: 0.6725
Episode: 211 Total reward: 10.0 Training loss: 1.6731 Explore P: 0.6718
Episode: 212 Total reward: 17.0 Training loss: 0.7651 Explore P: 0.6707
Episode: 213 Total reward: 19.0 Training loss: 0.7520 Explore P: 0.6694
Episode: 214 Total reward: 18.0 Training loss: 2.0599 Explore P: 0.6683
Episode: 215 Total reward: 16.0 Training loss: 244.1855 Explore P: 0.6672
Episode: 216 Total reward: 9.0 Training loss: 594.5690 Explore P: 0.6666
Episode: 217 Total reward: 17.0 Training loss: 1.5901 Explore P: 0.6655
Episode: 218 Total reward: 14.0 Training loss: 2.7579 Explore P: 0.6646
Episode: 219 Total reward: 34.0 Training loss: 1.7471 Explore P: 0.6624
Episode: 220 Total reward: 39.0 Training loss: 1.5974 Explore P: 0.6598
Episode: 221 Total reward: 15.0 Training loss: 1.0307 Explore P: 0.6589
Episode: 222 Total reward: 10.0 Training loss: 2.1904 Explore P: 0.6582
Episode: 223 Total reward: 15.0 Training loss: 2.8873 Explore P: 0.6572
Episode: 224 Total reward: 8.0 Training loss: 1.3599 Explore P: 0.6567
Episode: 225 Total reward: 15.0 Training loss: 1.9494 Explore P: 0.6557
Episode: 226 Total reward: 9.0 Training loss: 0.9970 Explore P: 0.6552
Episode: 227 Total reward: 26.0 Training loss: 1.2766 Explore P: 0.6535
Episode: 228 Total reward: 9.0 Training loss: 1.6536 Explore P: 0.6529
Episode: 229 Total reward: 17.0 Training loss: 0.7249 Explore P: 0.6518
Episode: 230 Total reward: 23.0 Training loss: 385.3306 Explore P: 0.6503
Episode: 231 Total reward: 35.0 Training loss: 183.5179 Explore P: 0.6481
Episode: 232 Total reward: 15.0 Training loss: 636.4583 Explore P: 0.6471
Episode: 233 Total reward: 9.0 Training loss: 1.1418 Explore P: 0.6466
Episode: 234 Total reward: 26.0 Training loss: 1.9151 Explore P: 0.6449
Episode: 235 Total reward: 17.0 Training loss: 454.7029 Explore P: 0.6438
Episode: 236 Total reward: 12.0 Training loss: 157.4639 Explore P: 0.6431
Episode: 237 Total reward: 12.0 Training loss: 0.5925 Explore P: 0.6423
Episode: 238 Total reward: 13.0 Training loss: 151.3665 Explore P: 0.6415
Episode: 239 Total reward: 13.0 Training loss: 171.7578 Explore P: 0.6407
Episode: 240 Total reward: 10.0 Training loss: 2.0987 Explore P: 0.6401
Episode: 241 Total reward: 28.0 Training loss: 2.3262 Explore P: 0.6383
Episode: 242 Total reward: 10.0 Training loss: 1.7188 Explore P: 0.6377
Episode: 243 Total reward: 15.0 Training loss: 172.9464 Explore P: 0.6367
Episode: 244 Total reward: 15.0 Training loss: 1.5114 Explore P: 0.6358
Episode: 245 Total reward: 12.0 Training loss: 1.9034 Explore P: 0.6350
Episode: 246 Total reward: 10.0 Training loss: 162.4112 Explore P: 0.6344
Episode: 247 Total reward: 7.0 Training loss: 204.8197 Explore P: 0.6340
Episode: 248 Total reward: 21.0 Training loss: 827.5753 Explore P: 0.6327
Episode: 249 Total reward: 9.0 Training loss: 1.6809 Explore P: 0.6321
Episode: 250 Total reward: 10.0 Training loss: 287.0580 Explore P: 0.6315
Episode: 251 Total reward: 20.0 Training loss: 470.1392 Explore P: 0.6302
Episode: 252 Total reward: 13.0 Training loss: 1.7842 Explore P: 0.6294
Episode: 253 Total reward: 15.0 Training loss: 128.3854 Explore P: 0.6285
Episode: 254 Total reward: 18.0 Training loss: 1.8093 Explore P: 0.6274
Episode: 255 Total reward: 22.0 Training loss: 1.1276 Explore P: 0.6260
Episode: 256 Total reward: 12.0 Training loss: 1.5765 Explore P: 0.6253
Episode: 257 Total reward: 8.0 Training loss: 1.6216 Explore P: 0.6248
Episode: 258 Total reward: 14.0 Training loss: 0.9159 Explore P: 0.6239
Episode: 259 Total reward: 15.0 Training loss: 246.8497 Explore P: 0.6230
Episode: 260 Total reward: 13.0 Training loss: 1.7824 Explore P: 0.6222
Episode: 261 Total reward: 9.0 Training loss: 155.8929 Explore P: 0.6217
Episode: 262 Total reward: 8.0 Training loss: 1.3674 Explore P: 0.6212
Episode: 263 Total reward: 17.0 Training loss: 146.3181 Explore P: 0.6202
Episode: 264 Total reward: 10.0 Training loss: 124.2479 Explore P: 0.6195
Episode: 265 Total reward: 8.0 Training loss: 118.2961 Explore P: 0.6191
Episode: 266 Total reward: 9.0 Training loss: 134.8288 Explore P: 0.6185
Episode: 267 Total reward: 10.0 Training loss: 1.9112 Explore P: 0.6179
Episode: 268 Total reward: 17.0 Training loss: 137.6139 Explore P: 0.6169
Episode: 269 Total reward: 9.0 Training loss: 1.5787 Explore P: 0.6163
Episode: 270 Total reward: 14.0 Training loss: 0.9221 Explore P: 0.6155
Episode: 271 Total reward: 14.0 Training loss: 1.3204 Explore P: 0.6146
Episode: 272 Total reward: 10.0 Training loss: 1.0847 Explore P: 0.6140
Episode: 273 Total reward: 13.0 Training loss: 243.5681 Explore P: 0.6132
Episode: 274 Total reward: 11.0 Training loss: 1.3681 Explore P: 0.6126
Episode: 275 Total reward: 11.0 Training loss: 1.0481 Explore P: 0.6119
Episode: 276 Total reward: 15.0 Training loss: 141.1072 Explore P: 0.6110
Episode: 277 Total reward: 11.0 Training loss: 233.4689 Explore P: 0.6103
Episode: 278 Total reward: 14.0 Training loss: 1.6510 Explore P: 0.6095
Episode: 279 Total reward: 10.0 Training loss: 143.2629 Explore P: 0.6089
Episode: 280 Total reward: 26.0 Training loss: 1.1154 Explore P: 0.6074
Episode: 281 Total reward: 16.0 Training loss: 0.9093 Explore P: 0.6064
Episode: 282 Total reward: 17.0 Training loss: 121.3505 Explore P: 0.6054
Episode: 283 Total reward: 12.0 Training loss: 126.9923 Explore P: 0.6047
Episode: 284 Total reward: 9.0 Training loss: 123.4567 Explore P: 0.6041
Episode: 285 Total reward: 14.0 Training loss: 1.5216 Explore P: 0.6033
Episode: 286 Total reward: 14.0 Training loss: 240.0504 Explore P: 0.6025
Episode: 287 Total reward: 13.0 Training loss: 115.0631 Explore P: 0.6017
Episode: 288 Total reward: 9.0 Training loss: 193.4437 Explore P: 0.6012
Episode: 289 Total reward: 8.0 Training loss: 237.9859 Explore P: 0.6007
Episode: 290 Total reward: 15.0 Training loss: 1.2959 Explore P: 0.5998
Episode: 291 Total reward: 12.0 Training loss: 106.9232 Explore P: 0.5991
Episode: 292 Total reward: 11.0 Training loss: 1.4474 Explore P: 0.5985
Episode: 293 Total reward: 19.0 Training loss: 0.9375 Explore P: 0.5973
Episode: 294 Total reward: 13.0 Training loss: 0.9031 Explore P: 0.5966
Episode: 295 Total reward: 10.0 Training loss: 121.6237 Explore P: 0.5960
Episode: 296 Total reward: 17.0 Training loss: 222.0397 Explore P: 0.5950
Episode: 297 Total reward: 13.0 Training loss: 112.7025 Explore P: 0.5942
Episode: 298 Total reward: 10.0 Training loss: 131.7434 Explore P: 0.5937
Episode: 299 Total reward: 18.0 Training loss: 1.0120 Explore P: 0.5926
Episode: 300 Total reward: 9.0 Training loss: 1.5773 Explore P: 0.5921
Episode: 301 Total reward: 10.0 Training loss: 161.4930 Explore P: 0.5915
Episode: 302 Total reward: 40.0 Training loss: 130.6983 Explore P: 0.5892
Episode: 303 Total reward: 22.0 Training loss: 1.5656 Explore P: 0.5879
Episode: 304 Total reward: 10.0 Training loss: 214.5751 Explore P: 0.5873
Episode: 305 Total reward: 9.0 Training loss: 237.3297 Explore P: 0.5868
Episode: 306 Total reward: 13.0 Training loss: 1.5965 Explore P: 0.5861
Episode: 307 Total reward: 15.0 Training loss: 1.1046 Explore P: 0.5852
Episode: 308 Total reward: 24.0 Training loss: 93.3901 Explore P: 0.5838
Episode: 309 Total reward: 10.0 Training loss: 127.6801 Explore P: 0.5832
Episode: 310 Total reward: 9.0 Training loss: 94.3086 Explore P: 0.5827
Episode: 311 Total reward: 16.0 Training loss: 1.4624 Explore P: 0.5818
Episode: 312 Total reward: 18.0 Training loss: 102.9245 Explore P: 0.5808
Episode: 313 Total reward: 14.0 Training loss: 2.1395 Explore P: 0.5800
Episode: 314 Total reward: 25.0 Training loss: 1.5585 Explore P: 0.5786
Episode: 315 Total reward: 11.0 Training loss: 1.6088 Explore P: 0.5779
Episode: 316 Total reward: 11.0 Training loss: 108.2826 Explore P: 0.5773
Episode: 317 Total reward: 14.0 Training loss: 1.7683 Explore P: 0.5765
Episode: 318 Total reward: 18.0 Training loss: 227.5329 Explore P: 0.5755
Episode: 319 Total reward: 27.0 Training loss: 0.9363 Explore P: 0.5740
Episode: 320 Total reward: 15.0 Training loss: 79.2613 Explore P: 0.5731
Episode: 321 Total reward: 10.0 Training loss: 1.8215 Explore P: 0.5726
Episode: 322 Total reward: 30.0 Training loss: 1.4458 Explore P: 0.5709
Episode: 323 Total reward: 30.0 Training loss: 1.5633 Explore P: 0.5692
Episode: 324 Total reward: 10.0 Training loss: 219.7365 Explore P: 0.5686
Episode: 325 Total reward: 11.0 Training loss: 1.9300 Explore P: 0.5680
Episode: 326 Total reward: 8.0 Training loss: 256.6587 Explore P: 0.5676
Episode: 327 Total reward: 12.0 Training loss: 104.3052 Explore P: 0.5669
Episode: 328 Total reward: 24.0 Training loss: 192.5328 Explore P: 0.5656
Episode: 329 Total reward: 19.0 Training loss: 2.2345 Explore P: 0.5645
Episode: 330 Total reward: 14.0 Training loss: 1.1749 Explore P: 0.5637
Episode: 331 Total reward: 13.0 Training loss: 1.5120 Explore P: 0.5630
Episode: 332 Total reward: 27.0 Training loss: 98.3868 Explore P: 0.5615
Episode: 333 Total reward: 13.0 Training loss: 85.3490 Explore P: 0.5608
Episode: 334 Total reward: 15.0 Training loss: 110.0502 Explore P: 0.5600
Episode: 335 Total reward: 12.0 Training loss: 106.9206 Explore P: 0.5593
Episode: 336 Total reward: 9.0 Training loss: 1.5027 Explore P: 0.5588
Episode: 337 Total reward: 16.0 Training loss: 1.5288 Explore P: 0.5580
Episode: 338 Total reward: 9.0 Training loss: 1.6312 Explore P: 0.5575
Episode: 339 Total reward: 16.0 Training loss: 2.2141 Explore P: 0.5566
Episode: 340 Total reward: 9.0 Training loss: 183.0384 Explore P: 0.5561
Episode: 341 Total reward: 11.0 Training loss: 1.2235 Explore P: 0.5555
Episode: 342 Total reward: 13.0 Training loss: 2.1100 Explore P: 0.5548
Episode: 343 Total reward: 11.0 Training loss: 1.6805 Explore P: 0.5542
Episode: 344 Total reward: 12.0 Training loss: 102.6421 Explore P: 0.5535
Episode: 345 Total reward: 12.0 Training loss: 152.3424 Explore P: 0.5529
Episode: 346 Total reward: 16.0 Training loss: 75.7826 Explore P: 0.5520
Episode: 347 Total reward: 10.0 Training loss: 198.0218 Explore P: 0.5515
Episode: 348 Total reward: 11.0 Training loss: 153.3377 Explore P: 0.5509
Episode: 349 Total reward: 9.0 Training loss: 82.1711 Explore P: 0.5504
Episode: 350 Total reward: 20.0 Training loss: 0.9280 Explore P: 0.5493
Episode: 351 Total reward: 12.0 Training loss: 1.2368 Explore P: 0.5487
Episode: 352 Total reward: 20.0 Training loss: 1.0838 Explore P: 0.5476
Episode: 353 Total reward: 22.0 Training loss: 138.4555 Explore P: 0.5464
Episode: 354 Total reward: 14.0 Training loss: 1.3669 Explore P: 0.5457
Episode: 355 Total reward: 18.0 Training loss: 1.7649 Explore P: 0.5447
Episode: 356 Total reward: 11.0 Training loss: 158.9398 Explore P: 0.5441
Episode: 357 Total reward: 8.0 Training loss: 2.4403 Explore P: 0.5437
Episode: 358 Total reward: 9.0 Training loss: 1.4125 Explore P: 0.5432
Episode: 359 Total reward: 17.0 Training loss: 76.1829 Explore P: 0.5423
Episode: 360 Total reward: 10.0 Training loss: 1.9855 Explore P: 0.5418
Episode: 361 Total reward: 11.0 Training loss: 223.9702 Explore P: 0.5412
Episode: 362 Total reward: 18.0 Training loss: 97.5259 Explore P: 0.5402
Episode: 363 Total reward: 14.0 Training loss: 67.8938 Explore P: 0.5395
Episode: 364 Total reward: 15.0 Training loss: 1.1631 Explore P: 0.5387
Episode: 365 Total reward: 9.0 Training loss: 2.6659 Explore P: 0.5382
Episode: 366 Total reward: 9.0 Training loss: 1.6425 Explore P: 0.5377
Episode: 367 Total reward: 10.0 Training loss: 201.9760 Explore P: 0.5372
Episode: 368 Total reward: 14.0 Training loss: 1.2152 Explore P: 0.5365
Episode: 369 Total reward: 11.0 Training loss: 62.6026 Explore P: 0.5359
Episode: 370 Total reward: 18.0 Training loss: 69.7657 Explore P: 0.5350
Episode: 371 Total reward: 12.0 Training loss: 1.4476 Explore P: 0.5343
Episode: 372 Total reward: 15.0 Training loss: 1.0908 Explore P: 0.5335
Episode: 373 Total reward: 9.0 Training loss: 180.9991 Explore P: 0.5331
Episode: 374 Total reward: 10.0 Training loss: 96.4900 Explore P: 0.5325
Episode: 375 Total reward: 17.0 Training loss: 1.1324 Explore P: 0.5317
Episode: 376 Total reward: 13.0 Training loss: 211.7384 Explore P: 0.5310
Episode: 377 Total reward: 28.0 Training loss: 1.1342 Explore P: 0.5295
Episode: 378 Total reward: 10.0 Training loss: 1.0147 Explore P: 0.5290
Episode: 379 Total reward: 16.0 Training loss: 50.3771 Explore P: 0.5282
Episode: 380 Total reward: 21.0 Training loss: 84.8622 Explore P: 0.5271
Episode: 381 Total reward: 11.0 Training loss: 54.0538 Explore P: 0.5265
Episode: 382 Total reward: 8.0 Training loss: 73.0696 Explore P: 0.5261
Episode: 383 Total reward: 9.0 Training loss: 169.9652 Explore P: 0.5256
Episode: 384 Total reward: 13.0 Training loss: 170.7523 Explore P: 0.5250
Episode: 385 Total reward: 22.0 Training loss: 1.4647 Explore P: 0.5238
Episode: 386 Total reward: 50.0 Training loss: 0.8107 Explore P: 0.5213
Episode: 387 Total reward: 19.0 Training loss: 1.2154 Explore P: 0.5203
Episode: 388 Total reward: 19.0 Training loss: 1.4225 Explore P: 0.5193
Episode: 389 Total reward: 22.0 Training loss: 1.7545 Explore P: 0.5182
Episode: 390 Total reward: 16.0 Training loss: 194.5963 Explore P: 0.5174
Episode: 391 Total reward: 13.0 Training loss: 112.4296 Explore P: 0.5167
Episode: 392 Total reward: 15.0 Training loss: 1.6448 Explore P: 0.5160
Episode: 393 Total reward: 22.0 Training loss: 101.0273 Explore P: 0.5149
Episode: 394 Total reward: 17.0 Training loss: 96.1998 Explore P: 0.5140
Episode: 395 Total reward: 9.0 Training loss: 1.4755 Explore P: 0.5136
Episode: 396 Total reward: 22.0 Training loss: 1.2812 Explore P: 0.5125
Episode: 397 Total reward: 24.0 Training loss: 145.9429 Explore P: 0.5112
Episode: 398 Total reward: 17.0 Training loss: 1.5521 Explore P: 0.5104
Episode: 399 Total reward: 15.0 Training loss: 2.0109 Explore P: 0.5096
Episode: 400 Total reward: 12.0 Training loss: 59.6041 Explore P: 0.5090
Episode: 401 Total reward: 15.0 Training loss: 53.5356 Explore P: 0.5083
Episode: 402 Total reward: 12.0 Training loss: 49.8986 Explore P: 0.5077
Episode: 403 Total reward: 18.0 Training loss: 56.8188 Explore P: 0.5068
Episode: 404 Total reward: 22.0 Training loss: 43.6881 Explore P: 0.5057
Episode: 405 Total reward: 23.0 Training loss: 2.6064 Explore P: 0.5046
Episode: 406 Total reward: 16.0 Training loss: 52.9182 Explore P: 0.5038
Episode: 407 Total reward: 12.0 Training loss: 2.5918 Explore P: 0.5032
Episode: 408 Total reward: 19.0 Training loss: 60.5717 Explore P: 0.5023
Episode: 409 Total reward: 25.0 Training loss: 84.3906 Explore P: 0.5010
Episode: 410 Total reward: 15.0 Training loss: 62.2684 Explore P: 0.5003
Episode: 411 Total reward: 26.0 Training loss: 168.0710 Explore P: 0.4990
Episode: 412 Total reward: 16.0 Training loss: 108.5568 Explore P: 0.4982
Episode: 413 Total reward: 16.0 Training loss: 0.9701 Explore P: 0.4975
Episode: 414 Total reward: 20.0 Training loss: 1.0720 Explore P: 0.4965
Episode: 415 Total reward: 17.0 Training loss: 133.5067 Explore P: 0.4957
Episode: 416 Total reward: 12.0 Training loss: 51.6858 Explore P: 0.4951
Episode: 417 Total reward: 18.0 Training loss: 44.8258 Explore P: 0.4942
Episode: 418 Total reward: 35.0 Training loss: 0.8769 Explore P: 0.4925
Episode: 419 Total reward: 19.0 Training loss: 56.9961 Explore P: 0.4916
Episode: 420 Total reward: 10.0 Training loss: 1.9494 Explore P: 0.4911
Episode: 421 Total reward: 20.0 Training loss: 138.7181 Explore P: 0.4902
Episode: 422 Total reward: 18.0 Training loss: 1.1995 Explore P: 0.4893
Episode: 423 Total reward: 15.0 Training loss: 75.5782 Explore P: 0.4886
Episode: 424 Total reward: 20.0 Training loss: 51.9178 Explore P: 0.4876
Episode: 425 Total reward: 11.0 Training loss: 157.3504 Explore P: 0.4871
Episode: 426 Total reward: 14.0 Training loss: 1.4510 Explore P: 0.4864
Episode: 427 Total reward: 14.0 Training loss: 0.7865 Explore P: 0.4858
Episode: 428 Total reward: 17.0 Training loss: 121.6948 Explore P: 0.4849
Episode: 429 Total reward: 18.0 Training loss: 40.8981 Explore P: 0.4841
Episode: 430 Total reward: 37.0 Training loss: 52.3021 Explore P: 0.4823
Episode: 431 Total reward: 16.0 Training loss: 34.0647 Explore P: 0.4816
Episode: 432 Total reward: 17.0 Training loss: 195.4201 Explore P: 0.4808
Episode: 433 Total reward: 17.0 Training loss: 1.2199 Explore P: 0.4800
Episode: 434 Total reward: 21.0 Training loss: 38.9784 Explore P: 0.4790
Episode: 435 Total reward: 21.0 Training loss: 36.6157 Explore P: 0.4780
Episode: 436 Total reward: 15.0 Training loss: 1.6644 Explore P: 0.4773
Episode: 437 Total reward: 16.0 Training loss: 77.7379 Explore P: 0.4766
Episode: 438 Total reward: 14.0 Training loss: 1.2091 Explore P: 0.4759
Episode: 439 Total reward: 21.0 Training loss: 114.8384 Explore P: 0.4749
Episode: 440 Total reward: 25.0 Training loss: 45.0903 Explore P: 0.4738
Episode: 441 Total reward: 23.0 Training loss: 44.6842 Explore P: 0.4727
Episode: 442 Total reward: 22.0 Training loss: 1.6352 Explore P: 0.4717
Episode: 443 Total reward: 27.0 Training loss: 35.5370 Explore P: 0.4705
Episode: 444 Total reward: 13.0 Training loss: 29.9636 Explore P: 0.4699
Episode: 445 Total reward: 17.0 Training loss: 1.2988 Explore P: 0.4691
Episode: 446 Total reward: 18.0 Training loss: 1.1914 Explore P: 0.4682
Episode: 447 Total reward: 26.0 Training loss: 37.6496 Explore P: 0.4671
Episode: 448 Total reward: 19.0 Training loss: 3.0427 Explore P: 0.4662
Episode: 449 Total reward: 27.0 Training loss: 1.4955 Explore P: 0.4650
Episode: 450 Total reward: 24.0 Training loss: 32.5773 Explore P: 0.4639
Episode: 451 Total reward: 22.0 Training loss: 1.9739 Explore P: 0.4629
Episode: 452 Total reward: 22.0 Training loss: 29.6150 Explore P: 0.4619
Episode: 453 Total reward: 20.0 Training loss: 1.2509 Explore P: 0.4610
Episode: 454 Total reward: 35.0 Training loss: 0.9928 Explore P: 0.4594
Episode: 455 Total reward: 13.0 Training loss: 64.7428 Explore P: 0.4588
Episode: 456 Total reward: 28.0 Training loss: 1.4260 Explore P: 0.4576
Episode: 457 Total reward: 23.0 Training loss: 94.2489 Explore P: 0.4565
Episode: 458 Total reward: 38.0 Training loss: 1.5185 Explore P: 0.4548
Episode: 459 Total reward: 17.0 Training loss: 0.6019 Explore P: 0.4541
Episode: 460 Total reward: 37.0 Training loss: 35.6320 Explore P: 0.4524
Episode: 461 Total reward: 26.0 Training loss: 30.4420 Explore P: 0.4513
Episode: 462 Total reward: 18.0 Training loss: 1.0272 Explore P: 0.4505
Episode: 463 Total reward: 17.0 Training loss: 1.0322 Explore P: 0.4497
Episode: 464 Total reward: 25.0 Training loss: 1.6699 Explore P: 0.4487
Episode: 465 Total reward: 27.0 Training loss: 2.2930 Explore P: 0.4475
Episode: 466 Total reward: 24.0 Training loss: 1.1420 Explore P: 0.4464
Episode: 467 Total reward: 22.0 Training loss: 1.4347 Explore P: 0.4455
Episode: 468 Total reward: 15.0 Training loss: 0.6878 Explore P: 0.4448
Episode: 469 Total reward: 14.0 Training loss: 24.6553 Explore P: 0.4442
Episode: 470 Total reward: 19.0 Training loss: 1.2007 Explore P: 0.4434
Episode: 471 Total reward: 17.0 Training loss: 22.8812 Explore P: 0.4426
Episode: 472 Total reward: 18.0 Training loss: 0.8945 Explore P: 0.4419
Episode: 473 Total reward: 21.0 Training loss: 0.7460 Explore P: 0.4410
Episode: 474 Total reward: 33.0 Training loss: 16.6252 Explore P: 0.4395
Episode: 475 Total reward: 17.0 Training loss: 0.7670 Explore P: 0.4388
Episode: 476 Total reward: 25.0 Training loss: 21.2598 Explore P: 0.4377
Episode: 477 Total reward: 20.0 Training loss: 21.3762 Explore P: 0.4369
Episode: 478 Total reward: 20.0 Training loss: 25.2074 Explore P: 0.4360
Episode: 479 Total reward: 16.0 Training loss: 1.1176 Explore P: 0.4353
Episode: 480 Total reward: 16.0 Training loss: 42.4603 Explore P: 0.4347
Episode: 481 Total reward: 16.0 Training loss: 16.2612 Explore P: 0.4340
Episode: 482 Total reward: 12.0 Training loss: 1.3198 Explore P: 0.4335
Episode: 483 Total reward: 18.0 Training loss: 1.2862 Explore P: 0.4327
Episode: 484 Total reward: 19.0 Training loss: 1.2332 Explore P: 0.4319
Episode: 485 Total reward: 20.0 Training loss: 1.0086 Explore P: 0.4311
Episode: 486 Total reward: 23.0 Training loss: 17.8078 Explore P: 0.4301
Episode: 487 Total reward: 22.0 Training loss: 2.0276 Explore P: 0.4292
Episode: 488 Total reward: 12.0 Training loss: 26.3124 Explore P: 0.4287
Episode: 489 Total reward: 20.0 Training loss: 19.1247 Explore P: 0.4278
Episode: 490 Total reward: 16.0 Training loss: 0.8764 Explore P: 0.4272
Episode: 491 Total reward: 15.0 Training loss: 1.3384 Explore P: 0.4265
Episode: 492 Total reward: 39.0 Training loss: 1.9348 Explore P: 0.4249
Episode: 493 Total reward: 24.0 Training loss: 43.4037 Explore P: 0.4239
Episode: 494 Total reward: 16.0 Training loss: 15.6696 Explore P: 0.4233
Episode: 495 Total reward: 25.0 Training loss: 26.4258 Explore P: 0.4222
Episode: 496 Total reward: 20.0 Training loss: 2.1140 Explore P: 0.4214
Episode: 497 Total reward: 24.0 Training loss: 43.3665 Explore P: 0.4204
Episode: 498 Total reward: 14.0 Training loss: 1.4079 Explore P: 0.4199
Episode: 499 Total reward: 22.0 Training loss: 24.2611 Explore P: 0.4190
Episode: 500 Total reward: 24.0 Training loss: 24.1133 Explore P: 0.4180
Episode: 501 Total reward: 23.0 Training loss: 24.6645 Explore P: 0.4170
Episode: 502 Total reward: 17.0 Training loss: 1.2726 Explore P: 0.4163
Episode: 503 Total reward: 23.0 Training loss: 16.9491 Explore P: 0.4154
Episode: 504 Total reward: 19.0 Training loss: 21.7735 Explore P: 0.4146
Episode: 505 Total reward: 20.0 Training loss: 0.8132 Explore P: 0.4138
Episode: 506 Total reward: 26.0 Training loss: 18.4331 Explore P: 0.4128
Episode: 507 Total reward: 24.0 Training loss: 15.9981 Explore P: 0.4118
Episode: 508 Total reward: 14.0 Training loss: 1.8912 Explore P: 0.4113
Episode: 509 Total reward: 28.0 Training loss: 30.9907 Explore P: 0.4101
Episode: 510 Total reward: 19.0 Training loss: 22.4857 Explore P: 0.4094
Episode: 511 Total reward: 24.0 Training loss: 18.7677 Explore P: 0.4084
Episode: 512 Total reward: 33.0 Training loss: 44.2532 Explore P: 0.4071
Episode: 513 Total reward: 15.0 Training loss: 0.9566 Explore P: 0.4065
Episode: 514 Total reward: 17.0 Training loss: 20.1820 Explore P: 0.4058
Episode: 515 Total reward: 23.0 Training loss: 36.1656 Explore P: 0.4049
Episode: 516 Total reward: 30.0 Training loss: 19.2627 Explore P: 0.4037
Episode: 517 Total reward: 35.0 Training loss: 13.0267 Explore P: 0.4024
Episode: 518 Total reward: 26.0 Training loss: 18.0649 Explore P: 0.4014
Episode: 519 Total reward: 20.0 Training loss: 61.5668 Explore P: 0.4006
Episode: 520 Total reward: 28.0 Training loss: 17.4043 Explore P: 0.3995
Episode: 521 Total reward: 36.0 Training loss: 82.0313 Explore P: 0.3981
Episode: 522 Total reward: 39.0 Training loss: 1.2164 Explore P: 0.3966
Episode: 523 Total reward: 41.0 Training loss: 16.8849 Explore P: 0.3950
Episode: 524 Total reward: 55.0 Training loss: 35.3104 Explore P: 0.3929
Episode: 525 Total reward: 24.0 Training loss: 0.7335 Explore P: 0.3920
Episode: 526 Total reward: 31.0 Training loss: 13.9533 Explore P: 0.3908
Episode: 527 Total reward: 29.0 Training loss: 1.0266 Explore P: 0.3897
Episode: 528 Total reward: 11.0 Training loss: 19.5104 Explore P: 0.3893
Episode: 529 Total reward: 27.0 Training loss: 16.4914 Explore P: 0.3882
Episode: 530 Total reward: 22.0 Training loss: 1.6124 Explore P: 0.3874
Episode: 531 Total reward: 30.0 Training loss: 42.1976 Explore P: 0.3863
Episode: 532 Total reward: 31.0 Training loss: 16.7578 Explore P: 0.3851
Episode: 533 Total reward: 23.0 Training loss: 12.0734 Explore P: 0.3842
Episode: 534 Total reward: 26.0 Training loss: 47.9944 Explore P: 0.3833
Episode: 535 Total reward: 55.0 Training loss: 1.7746 Explore P: 0.3812
Episode: 536 Total reward: 17.0 Training loss: 1.1487 Explore P: 0.3806
Episode: 537 Total reward: 19.0 Training loss: 23.4680 Explore P: 0.3799
Episode: 538 Total reward: 13.0 Training loss: 13.7521 Explore P: 0.3794
Episode: 539 Total reward: 28.0 Training loss: 29.7318 Explore P: 0.3784
Episode: 540 Total reward: 18.0 Training loss: 1.2391 Explore P: 0.3777
Episode: 541 Total reward: 25.0 Training loss: 0.8921 Explore P: 0.3768
Episode: 542 Total reward: 35.0 Training loss: 20.1212 Explore P: 0.3755
Episode: 543 Total reward: 14.0 Training loss: 41.3973 Explore P: 0.3750
Episode: 544 Total reward: 41.0 Training loss: 13.3826 Explore P: 0.3735
Episode: 545 Total reward: 21.0 Training loss: 19.0067 Explore P: 0.3727
Episode: 546 Total reward: 47.0 Training loss: 49.0593 Explore P: 0.3710
Episode: 547 Total reward: 37.0 Training loss: 25.5197 Explore P: 0.3697
Episode: 548 Total reward: 38.0 Training loss: 1.2904 Explore P: 0.3683
Episode: 549 Total reward: 22.0 Training loss: 25.9084 Explore P: 0.3676
Episode: 550 Total reward: 99.0 Training loss: 0.8448 Explore P: 0.3640
Episode: 551 Total reward: 26.0 Training loss: 7.5961 Explore P: 0.3631
Episode: 552 Total reward: 33.0 Training loss: 1.3423 Explore P: 0.3620
Episode: 553 Total reward: 32.0 Training loss: 0.9812 Explore P: 0.3608
Episode: 554 Total reward: 21.0 Training loss: 12.6153 Explore P: 0.3601
Episode: 555 Total reward: 30.0 Training loss: 12.9124 Explore P: 0.3590
Episode: 556 Total reward: 25.0 Training loss: 7.8825 Explore P: 0.3582
Episode: 557 Total reward: 22.0 Training loss: 12.0815 Explore P: 0.3574
Episode: 558 Total reward: 24.0 Training loss: 1.5022 Explore P: 0.3566
Episode: 559 Total reward: 27.0 Training loss: 1.1523 Explore P: 0.3556
Episode: 560 Total reward: 31.0 Training loss: 10.9908 Explore P: 0.3546
Episode: 561 Total reward: 21.0 Training loss: 32.3258 Explore P: 0.3538
Episode: 562 Total reward: 65.0 Training loss: 0.9735 Explore P: 0.3516
Episode: 563 Total reward: 62.0 Training loss: 39.9010 Explore P: 0.3495
Episode: 564 Total reward: 41.0 Training loss: 9.9390 Explore P: 0.3481
Episode: 565 Total reward: 32.0 Training loss: 1.6124 Explore P: 0.3470
Episode: 566 Total reward: 36.0 Training loss: 1.3082 Explore P: 0.3458
Episode: 567 Total reward: 19.0 Training loss: 25.6646 Explore P: 0.3452
Episode: 568 Total reward: 35.0 Training loss: 19.9450 Explore P: 0.3440
Episode: 569 Total reward: 37.0 Training loss: 1.7748 Explore P: 0.3428
Episode: 570 Total reward: 35.0 Training loss: 17.4384 Explore P: 0.3416
Episode: 571 Total reward: 24.0 Training loss: 13.8927 Explore P: 0.3408
Episode: 572 Total reward: 38.0 Training loss: 1.4994 Explore P: 0.3396
Episode: 573 Total reward: 44.0 Training loss: 1.1722 Explore P: 0.3381
Episode: 574 Total reward: 37.0 Training loss: 26.1575 Explore P: 0.3369
Episode: 575 Total reward: 59.0 Training loss: 8.2502 Explore P: 0.3350
Episode: 576 Total reward: 77.0 Training loss: 27.5511 Explore P: 0.3325
Episode: 577 Total reward: 40.0 Training loss: 36.0910 Explore P: 0.3312
Episode: 578 Total reward: 36.0 Training loss: 8.5912 Explore P: 0.3301
Episode: 579 Total reward: 52.0 Training loss: 6.4369 Explore P: 0.3284
Episode: 580 Total reward: 51.0 Training loss: 1.3635 Explore P: 0.3268
Episode: 581 Total reward: 41.0 Training loss: 0.8874 Explore P: 0.3255
Episode: 582 Total reward: 31.0 Training loss: 1.4737 Explore P: 0.3245
Episode: 583 Total reward: 43.0 Training loss: 21.8411 Explore P: 0.3232
Episode: 584 Total reward: 24.0 Training loss: 16.0251 Explore P: 0.3224
Episode: 585 Total reward: 33.0 Training loss: 21.7639 Explore P: 0.3214
Episode: 586 Total reward: 22.0 Training loss: 35.8769 Explore P: 0.3207
Episode: 587 Total reward: 32.0 Training loss: 11.7337 Explore P: 0.3197
Episode: 588 Total reward: 22.0 Training loss: 21.1851 Explore P: 0.3190
Episode: 589 Total reward: 37.0 Training loss: 0.9472 Explore P: 0.3179
Episode: 590 Total reward: 27.0 Training loss: 1.5469 Explore P: 0.3170
Episode: 591 Total reward: 34.0 Training loss: 21.2708 Explore P: 0.3160
Episode: 592 Total reward: 56.0 Training loss: 0.7433 Explore P: 0.3143
Episode: 593 Total reward: 37.0 Training loss: 12.2719 Explore P: 0.3132
Episode: 594 Total reward: 54.0 Training loss: 52.7516 Explore P: 0.3115
Episode: 595 Total reward: 32.0 Training loss: 1.4478 Explore P: 0.3106
Episode: 596 Total reward: 45.0 Training loss: 17.4924 Explore P: 0.3092
Episode: 597 Total reward: 36.0 Training loss: 1.3261 Explore P: 0.3082
Episode: 598 Total reward: 73.0 Training loss: 1.3937 Explore P: 0.3060
Episode: 599 Total reward: 67.0 Training loss: 0.9392 Explore P: 0.3040
Episode: 600 Total reward: 61.0 Training loss: 21.6015 Explore P: 0.3022
Episode: 601 Total reward: 38.0 Training loss: 0.8986 Explore P: 0.3011
Episode: 602 Total reward: 63.0 Training loss: 1.1565 Explore P: 0.2993
Episode: 603 Total reward: 75.0 Training loss: 31.8062 Explore P: 0.2971
Episode: 604 Total reward: 48.0 Training loss: 1.1652 Explore P: 0.2957
Episode: 605 Total reward: 71.0 Training loss: 0.8857 Explore P: 0.2937
Episode: 606 Total reward: 72.0 Training loss: 1.5290 Explore P: 0.2917
Episode: 607 Total reward: 53.0 Training loss: 1.0437 Explore P: 0.2902
Episode: 608 Total reward: 78.0 Training loss: 2.3206 Explore P: 0.2880
Episode: 609 Total reward: 94.0 Training loss: 0.9422 Explore P: 0.2854
Episode: 610 Total reward: 70.0 Training loss: 1.0058 Explore P: 0.2835
Episode: 611 Total reward: 103.0 Training loss: 27.3048 Explore P: 0.2807
Episode: 612 Total reward: 109.0 Training loss: 15.7407 Explore P: 0.2778
Episode: 613 Total reward: 58.0 Training loss: 9.9827 Explore P: 0.2762
Episode: 614 Total reward: 89.0 Training loss: 7.8321 Explore P: 0.2739
Episode: 615 Total reward: 185.0 Training loss: 28.9684 Explore P: 0.2690
Episode: 616 Total reward: 62.0 Training loss: 25.7871 Explore P: 0.2674
Episode: 617 Total reward: 70.0 Training loss: 11.3774 Explore P: 0.2656
Episode: 618 Total reward: 78.0 Training loss: 1.3389 Explore P: 0.2636
Episode: 619 Total reward: 150.0 Training loss: 22.2618 Explore P: 0.2599
Episode: 620 Total reward: 130.0 Training loss: 32.5684 Explore P: 0.2566
Episode: 621 Total reward: 73.0 Training loss: 16.7239 Explore P: 0.2548
Episode: 622 Total reward: 28.0 Training loss: 27.2476 Explore P: 0.2542
Episode: 623 Total reward: 59.0 Training loss: 1.5582 Explore P: 0.2527
Episode: 624 Total reward: 46.0 Training loss: 10.5571 Explore P: 0.2516
Episode: 625 Total reward: 41.0 Training loss: 11.2981 Explore P: 0.2506
Episode: 626 Total reward: 37.0 Training loss: 12.9361 Explore P: 0.2497
Episode: 627 Total reward: 35.0 Training loss: 46.2805 Explore P: 0.2489
Episode: 628 Total reward: 78.0 Training loss: 15.7000 Explore P: 0.2470
Episode: 629 Total reward: 66.0 Training loss: 0.9340 Explore P: 0.2455
Episode: 630 Total reward: 72.0 Training loss: 1.0908 Explore P: 0.2438
Episode: 631 Total reward: 126.0 Training loss: 1.3157 Explore P: 0.2409
Episode: 632 Total reward: 45.0 Training loss: 1.1506 Explore P: 0.2398
Episode: 633 Total reward: 69.0 Training loss: 0.9053 Explore P: 0.2382
Episode: 634 Total reward: 41.0 Training loss: 1.6150 Explore P: 0.2373
Episode: 635 Total reward: 48.0 Training loss: 1.2345 Explore P: 0.2362
Episode: 636 Total reward: 19.0 Training loss: 17.4979 Explore P: 0.2358
Episode: 637 Total reward: 59.0 Training loss: 12.5582 Explore P: 0.2345
Episode: 638 Total reward: 65.0 Training loss: 21.0342 Explore P: 0.2330
Episode: 639 Total reward: 57.0 Training loss: 230.6809 Explore P: 0.2317
Episode: 640 Total reward: 37.0 Training loss: 25.4149 Explore P: 0.2309
Episode: 641 Total reward: 39.0 Training loss: 1.8141 Explore P: 0.2301
Episode: 642 Total reward: 19.0 Training loss: 14.1114 Explore P: 0.2296
Episode: 643 Total reward: 16.0 Training loss: 2.0282 Explore P: 0.2293
Episode: 644 Total reward: 18.0 Training loss: 24.6417 Explore P: 0.2289
Episode: 645 Total reward: 19.0 Training loss: 1.6414 Explore P: 0.2285
Episode: 646 Total reward: 28.0 Training loss: 18.2957 Explore P: 0.2279
Episode: 647 Total reward: 36.0 Training loss: 1.2192 Explore P: 0.2271
Episode: 648 Total reward: 38.0 Training loss: 0.6736 Explore P: 0.2263
Episode: 649 Total reward: 28.0 Training loss: 1.5629 Explore P: 0.2257
Episode: 650 Total reward: 21.0 Training loss: 24.5263 Explore P: 0.2252
Episode: 651 Total reward: 46.0 Training loss: 150.1758 Explore P: 0.2242
Episode: 652 Total reward: 37.0 Training loss: 22.9620 Explore P: 0.2234
Episode: 653 Total reward: 29.0 Training loss: 37.2038 Explore P: 0.2228
Episode: 654 Total reward: 34.0 Training loss: 37.4262 Explore P: 0.2221
Episode: 655 Total reward: 35.0 Training loss: 1.5823 Explore P: 0.2213
Episode: 656 Total reward: 35.0 Training loss: 13.6948 Explore P: 0.2206
Episode: 657 Total reward: 15.0 Training loss: 2.4901 Explore P: 0.2203
Episode: 658 Total reward: 19.0 Training loss: 10.2794 Explore P: 0.2199
Episode: 659 Total reward: 21.0 Training loss: 59.1853 Explore P: 0.2195
Episode: 660 Total reward: 21.0 Training loss: 2.8686 Explore P: 0.2190
Episode: 661 Total reward: 16.0 Training loss: 4.5770 Explore P: 0.2187
Episode: 662 Total reward: 14.0 Training loss: 2.8120 Explore P: 0.2184
Episode: 663 Total reward: 13.0 Training loss: 36.8954 Explore P: 0.2181
Episode: 664 Total reward: 22.0 Training loss: 1.3332 Explore P: 0.2177
Episode: 665 Total reward: 14.0 Training loss: 17.0443 Explore P: 0.2174
Episode: 666 Total reward: 15.0 Training loss: 112.2994 Explore P: 0.2171
Episode: 667 Total reward: 13.0 Training loss: 2.9089 Explore P: 0.2168
Episode: 668 Total reward: 25.0 Training loss: 108.8355 Explore P: 0.2163
Episode: 669 Total reward: 17.0 Training loss: 13.8663 Explore P: 0.2159
Episode: 670 Total reward: 27.0 Training loss: 171.8986 Explore P: 0.2154
Episode: 671 Total reward: 21.0 Training loss: 1.5800 Explore P: 0.2149
Episode: 672 Total reward: 41.0 Training loss: 2.5559 Explore P: 0.2141
Episode: 673 Total reward: 69.0 Training loss: 1.1998 Explore P: 0.2127
Episode: 674 Total reward: 54.0 Training loss: 50.0960 Explore P: 0.2116
Episode: 675 Total reward: 23.0 Training loss: 27.3362 Explore P: 0.2111
Episode: 676 Total reward: 25.0 Training loss: 1.8316 Explore P: 0.2106
Episode: 677 Total reward: 26.0 Training loss: 1.5312 Explore P: 0.2101
Episode: 678 Total reward: 47.0 Training loss: 14.8057 Explore P: 0.2092
Episode: 679 Total reward: 21.0 Training loss: 3.1941 Explore P: 0.2088
Episode: 680 Total reward: 37.0 Training loss: 24.5800 Explore P: 0.2080
Episode: 681 Total reward: 14.0 Training loss: 15.6116 Explore P: 0.2078
Episode: 682 Total reward: 30.0 Training loss: 1.7930 Explore P: 0.2072
Episode: 683 Total reward: 29.0 Training loss: 1.5996 Explore P: 0.2066
Episode: 684 Total reward: 40.0 Training loss: 1.4876 Explore P: 0.2058
Episode: 685 Total reward: 41.0 Training loss: 2.1263 Explore P: 0.2050
Episode: 686 Total reward: 91.0 Training loss: 1.7474 Explore P: 0.2032
Episode: 687 Total reward: 66.0 Training loss: 156.1963 Explore P: 0.2020
Episode: 688 Total reward: 51.0 Training loss: 1.9669 Explore P: 0.2010
Episode: 689 Total reward: 38.0 Training loss: 1.9293 Explore P: 0.2003
Episode: 690 Total reward: 39.0 Training loss: 32.9188 Explore P: 0.1995
Episode: 691 Total reward: 36.0 Training loss: 15.2054 Explore P: 0.1988
Episode: 692 Total reward: 48.0 Training loss: 1.9895 Explore P: 0.1979
Episode: 693 Total reward: 33.0 Training loss: 2.1530 Explore P: 0.1973
Episode: 694 Total reward: 29.0 Training loss: 2.4897 Explore P: 0.1968
Episode: 695 Total reward: 97.0 Training loss: 1.4609 Explore P: 0.1950
Episode: 696 Total reward: 67.0 Training loss: 123.3220 Explore P: 0.1937
Episode: 697 Total reward: 71.0 Training loss: 1.5160 Explore P: 0.1924
Episode: 698 Total reward: 47.0 Training loss: 1.6016 Explore P: 0.1916
Episode: 699 Total reward: 66.0 Training loss: 1.6061 Explore P: 0.1904
Episode: 700 Total reward: 32.0 Training loss: 19.3820 Explore P: 0.1898
Episode: 701 Total reward: 56.0 Training loss: 16.7381 Explore P: 0.1888
Episode: 702 Total reward: 49.0 Training loss: 1.6339 Explore P: 0.1879
Episode: 703 Total reward: 113.0 Training loss: 2.0695 Explore P: 0.1859
Episode: 704 Total reward: 56.0 Training loss: 127.7140 Explore P: 0.1850
Episode: 705 Total reward: 51.0 Training loss: 2.1957 Explore P: 0.1841
Episode: 706 Total reward: 45.0 Training loss: 12.8770 Explore P: 0.1833
Episode: 707 Total reward: 52.0 Training loss: 2.3676 Explore P: 0.1824
Episode: 708 Total reward: 37.0 Training loss: 0.5801 Explore P: 0.1817
Episode: 709 Total reward: 40.0 Training loss: 1.7517 Explore P: 0.1811
Episode: 710 Total reward: 54.0 Training loss: 2.0028 Explore P: 0.1801
Episode: 711 Total reward: 85.0 Training loss: 106.5039 Explore P: 0.1787
Episode: 712 Total reward: 77.0 Training loss: 197.6158 Explore P: 0.1774
Episode: 713 Total reward: 76.0 Training loss: 1.2633 Explore P: 0.1761
Episode: 714 Total reward: 120.0 Training loss: 22.5078 Explore P: 0.1742
Episode: 715 Total reward: 85.0 Training loss: 200.0853 Explore P: 0.1728
Episode: 716 Total reward: 91.0 Training loss: 3.8336 Explore P: 0.1713
Episode: 717 Total reward: 111.0 Training loss: 1.6937 Explore P: 0.1695
Episode: 718 Total reward: 78.0 Training loss: 50.3090 Explore P: 0.1683
Episode: 719 Total reward: 91.0 Training loss: 11.0574 Explore P: 0.1668
Episode: 720 Total reward: 147.0 Training loss: 1.6774 Explore P: 0.1645
Episode: 721 Total reward: 57.0 Training loss: 2.8214 Explore P: 0.1637
Episode: 722 Total reward: 154.0 Training loss: 177.2232 Explore P: 0.1613
Episode: 723 Total reward: 91.0 Training loss: 14.8496 Explore P: 0.1600
Episode: 724 Total reward: 106.0 Training loss: 81.5056 Explore P: 0.1584
Episode: 725 Total reward: 87.0 Training loss: 2.7203 Explore P: 0.1571
Episode: 726 Total reward: 152.0 Training loss: 1.3438 Explore P: 0.1549
Episode: 727 Total reward: 79.0 Training loss: 2.5333 Explore P: 0.1537
Episode: 728 Total reward: 199.0 Training loss: 1.9010 Explore P: 0.1509
Episode: 729 Total reward: 80.0 Training loss: 1.7633 Explore P: 0.1498
Episode: 730 Total reward: 186.0 Training loss: 17.4749 Explore P: 0.1472
Episode: 731 Total reward: 89.0 Training loss: 2.4334 Explore P: 0.1460
Episode: 732 Total reward: 91.0 Training loss: 22.0238 Explore P: 0.1447
Episode: 733 Total reward: 82.0 Training loss: 1.1835 Explore P: 0.1436
Episode: 734 Total reward: 117.0 Training loss: 2.0784 Explore P: 0.1421
Episode: 735 Total reward: 97.0 Training loss: 108.5955 Explore P: 0.1408
Episode: 736 Total reward: 125.0 Training loss: 106.4633 Explore P: 0.1392
Episode: 737 Total reward: 106.0 Training loss: 127.7308 Explore P: 0.1378
Episode: 738 Total reward: 199.0 Training loss: 2.3933 Explore P: 0.1353
Episode: 739 Total reward: 80.0 Training loss: 0.7202 Explore P: 0.1343
Episode: 740 Total reward: 144.0 Training loss: 1.9576 Explore P: 0.1325
Episode: 741 Total reward: 90.0 Training loss: 2.1742 Explore P: 0.1314
Episode: 742 Total reward: 159.0 Training loss: 141.5353 Explore P: 0.1295
Episode: 743 Total reward: 139.0 Training loss: 2.6320 Explore P: 0.1279
Episode: 744 Total reward: 132.0 Training loss: 1.9919 Explore P: 0.1263
Episode: 745 Total reward: 108.0 Training loss: 2.3361 Explore P: 0.1251
Episode: 746 Total reward: 122.0 Training loss: 1.4727 Explore P: 0.1237
Episode: 747 Total reward: 90.0 Training loss: 2.3662 Explore P: 0.1227
Episode: 748 Total reward: 79.0 Training loss: 2.0465 Explore P: 0.1218
Episode: 749 Total reward: 89.0 Training loss: 1.4634 Explore P: 0.1208
Episode: 750 Total reward: 74.0 Training loss: 1.4726 Explore P: 0.1200
Episode: 751 Total reward: 87.0 Training loss: 218.1596 Explore P: 0.1190
Episode: 752 Total reward: 146.0 Training loss: 1.4079 Explore P: 0.1174
Episode: 753 Total reward: 91.0 Training loss: 2.1462 Explore P: 0.1165
Episode: 754 Total reward: 84.0 Training loss: 165.5214 Explore P: 0.1156
Episode: 755 Total reward: 97.0 Training loss: 1.7036 Explore P: 0.1146
Episode: 756 Total reward: 83.0 Training loss: 1.5108 Explore P: 0.1137
Episode: 757 Total reward: 85.0 Training loss: 50.0501 Explore P: 0.1128
Episode: 758 Total reward: 79.0 Training loss: 1.3583 Explore P: 0.1120
Episode: 759 Total reward: 99.0 Training loss: 1.4119 Explore P: 0.1110
Episode: 760 Total reward: 83.0 Training loss: 1.8450 Explore P: 0.1102
Episode: 761 Total reward: 89.0 Training loss: 1.2679 Explore P: 0.1093
Episode: 762 Total reward: 93.0 Training loss: 1.2635 Explore P: 0.1084
Episode: 763 Total reward: 89.0 Training loss: 2.0387 Explore P: 0.1075
Episode: 764 Total reward: 82.0 Training loss: 2.7671 Explore P: 0.1067
Episode: 765 Total reward: 90.0 Training loss: 293.2490 Explore P: 0.1058
Episode: 766 Total reward: 70.0 Training loss: 2.5846 Explore P: 0.1052
Episode: 767 Total reward: 65.0 Training loss: 52.9311 Explore P: 0.1045
Episode: 768 Total reward: 87.0 Training loss: 1.6671 Explore P: 0.1037
Episode: 769 Total reward: 89.0 Training loss: 1.9991 Explore P: 0.1029
Episode: 770 Total reward: 69.0 Training loss: 1.6411 Explore P: 0.1023
Episode: 771 Total reward: 90.0 Training loss: 3.3310 Explore P: 0.1014
Episode: 772 Total reward: 121.0 Training loss: 2.9526 Explore P: 0.1003
Episode: 773 Total reward: 79.0 Training loss: 2.3238 Explore P: 0.0996
Episode: 774 Total reward: 86.0 Training loss: 1.5569 Explore P: 0.0988
Episode: 775 Total reward: 75.0 Training loss: 3.5022 Explore P: 0.0982
Episode: 776 Total reward: 85.0 Training loss: 0.6073 Explore P: 0.0974
Episode: 777 Total reward: 62.0 Training loss: 190.8418 Explore P: 0.0969
Episode: 778 Total reward: 88.0 Training loss: 2.0841 Explore P: 0.0961
Episode: 779 Total reward: 85.0 Training loss: 2.8744 Explore P: 0.0954
Episode: 780 Total reward: 65.0 Training loss: 1.5316 Explore P: 0.0949
Episode: 781 Total reward: 42.0 Training loss: 234.2364 Explore P: 0.0945
Episode: 782 Total reward: 36.0 Training loss: 3.5302 Explore P: 0.0942
Episode: 783 Total reward: 77.0 Training loss: 1.4823 Explore P: 0.0935
Episode: 784 Total reward: 65.0 Training loss: 156.3634 Explore P: 0.0930
Episode: 785 Total reward: 57.0 Training loss: 2.0659 Explore P: 0.0925
Episode: 786 Total reward: 61.0 Training loss: 2.0169 Explore P: 0.0920
Episode: 787 Total reward: 61.0 Training loss: 1.4686 Explore P: 0.0915
Episode: 788 Total reward: 37.0 Training loss: 3.2058 Explore P: 0.0912
Episode: 789 Total reward: 34.0 Training loss: 3.0565 Explore P: 0.0910
Episode: 790 Total reward: 21.0 Training loss: 4.2403 Explore P: 0.0908
Episode: 791 Total reward: 23.0 Training loss: 2.7653 Explore P: 0.0906
Episode: 792 Total reward: 49.0 Training loss: 138.0399 Explore P: 0.0902
Episode: 793 Total reward: 69.0 Training loss: 1.9351 Explore P: 0.0897
Episode: 794 Total reward: 22.0 Training loss: 2.2865 Explore P: 0.0895
Episode: 795 Total reward: 22.0 Training loss: 3.4115 Explore P: 0.0893
Episode: 796 Total reward: 27.0 Training loss: 2.2285 Explore P: 0.0891
Episode: 797 Total reward: 56.0 Training loss: 3.7597 Explore P: 0.0886
Episode: 798 Total reward: 27.0 Training loss: 3.9695 Explore P: 0.0884
Episode: 799 Total reward: 20.0 Training loss: 2.9041 Explore P: 0.0883
Episode: 800 Total reward: 29.0 Training loss: 3.3151 Explore P: 0.0881
Episode: 801 Total reward: 52.0 Training loss: 1.6887 Explore P: 0.0876
Episode: 802 Total reward: 20.0 Training loss: 185.0690 Explore P: 0.0875
Episode: 803 Total reward: 41.0 Training loss: 51.9090 Explore P: 0.0872
Episode: 804 Total reward: 55.0 Training loss: 2.4148 Explore P: 0.0868
Episode: 805 Total reward: 21.0 Training loss: 3.9950 Explore P: 0.0866
Episode: 806 Total reward: 22.0 Training loss: 2.0308 Explore P: 0.0864
Episode: 807 Total reward: 16.0 Training loss: 3.2144 Explore P: 0.0863
Episode: 808 Total reward: 24.0 Training loss: 1.4179 Explore P: 0.0861
Episode: 809 Total reward: 20.0 Training loss: 1.7889 Explore P: 0.0860
Episode: 810 Total reward: 20.0 Training loss: 1.8777 Explore P: 0.0858
Episode: 811 Total reward: 21.0 Training loss: 1.2879 Explore P: 0.0857
Episode: 812 Total reward: 23.0 Training loss: 3.1679 Explore P: 0.0855
Episode: 813 Total reward: 21.0 Training loss: 473.2147 Explore P: 0.0853
Episode: 814 Total reward: 23.0 Training loss: 3.0620 Explore P: 0.0852
Episode: 815 Total reward: 18.0 Training loss: 2.5540 Explore P: 0.0850
Episode: 816 Total reward: 25.0 Training loss: 1.8706 Explore P: 0.0848
Episode: 817 Total reward: 22.0 Training loss: 239.6549 Explore P: 0.0847
Episode: 818 Total reward: 20.0 Training loss: 114.6417 Explore P: 0.0845
Episode: 819 Total reward: 16.0 Training loss: 2.1006 Explore P: 0.0844
Episode: 820 Total reward: 18.0 Training loss: 2.7195 Explore P: 0.0843
Episode: 821 Total reward: 17.0 Training loss: 182.5574 Explore P: 0.0841
Episode: 822 Total reward: 18.0 Training loss: 243.9188 Explore P: 0.0840
Episode: 823 Total reward: 17.0 Training loss: 2.2855 Explore P: 0.0839
Episode: 824 Total reward: 15.0 Training loss: 2.9489 Explore P: 0.0838
Episode: 825 Total reward: 16.0 Training loss: 3.1590 Explore P: 0.0836
Episode: 826 Total reward: 22.0 Training loss: 1.7466 Explore P: 0.0835
Episode: 827 Total reward: 29.0 Training loss: 40.3972 Explore P: 0.0833
Episode: 828 Total reward: 19.0 Training loss: 2.5322 Explore P: 0.0831
Episode: 829 Total reward: 21.0 Training loss: 2.2525 Explore P: 0.0830
Episode: 830 Total reward: 23.0 Training loss: 1.7207 Explore P: 0.0828
Episode: 831 Total reward: 60.0 Training loss: 2.6919 Explore P: 0.0824
Episode: 832 Total reward: 28.0 Training loss: 3.3040 Explore P: 0.0822
Episode: 833 Total reward: 23.0 Training loss: 2.1028 Explore P: 0.0820
Episode: 834 Total reward: 60.0 Training loss: 64.6868 Explore P: 0.0816
Episode: 835 Total reward: 86.0 Training loss: 3.9118 Explore P: 0.0810
Episode: 836 Total reward: 61.0 Training loss: 0.4133 Explore P: 0.0805
Episode: 837 Total reward: 66.0 Training loss: 21.6146 Explore P: 0.0801
Episode: 838 Total reward: 75.0 Training loss: 184.7214 Explore P: 0.0795
Episode: 839 Total reward: 95.0 Training loss: 2.2474 Explore P: 0.0789
Episode: 840 Total reward: 70.0 Training loss: 17.3131 Explore P: 0.0784
Episode: 841 Total reward: 86.0 Training loss: 224.6975 Explore P: 0.0778
Episode: 842 Total reward: 102.0 Training loss: 1.1486 Explore P: 0.0771
Episode: 843 Total reward: 86.0 Training loss: 1.4441 Explore P: 0.0766
Episode: 844 Total reward: 99.0 Training loss: 0.6524 Explore P: 0.0759
Episode: 845 Total reward: 89.0 Training loss: 80.3342 Explore P: 0.0753
Episode: 846 Total reward: 114.0 Training loss: 85.0464 Explore P: 0.0746
Episode: 847 Total reward: 98.0 Training loss: 0.7770 Explore P: 0.0740
Episode: 848 Total reward: 83.0 Training loss: 228.3755 Explore P: 0.0734
Episode: 849 Total reward: 199.0 Training loss: 0.5740 Explore P: 0.0722
Episode: 850 Total reward: 199.0 Training loss: 0.7810 Explore P: 0.0709
Episode: 851 Total reward: 111.0 Training loss: 0.9638 Explore P: 0.0703
Episode: 852 Total reward: 199.0 Training loss: 0.6757 Explore P: 0.0691
Episode: 853 Total reward: 199.0 Training loss: 0.6365 Explore P: 0.0679
Episode: 854 Total reward: 199.0 Training loss: 1.0158 Explore P: 0.0668
Episode: 855 Total reward: 199.0 Training loss: 0.8033 Explore P: 0.0657
Episode: 856 Total reward: 199.0 Training loss: 0.5704 Explore P: 0.0646
Episode: 857 Total reward: 180.0 Training loss: 0.6877 Explore P: 0.0636
Episode: 858 Total reward: 199.0 Training loss: 1.1831 Explore P: 0.0625
Episode: 859 Total reward: 199.0 Training loss: 0.3136 Explore P: 0.0615
Episode: 860 Total reward: 177.0 Training loss: 187.0638 Explore P: 0.0606
Episode: 861 Total reward: 199.0 Training loss: 62.7757 Explore P: 0.0596
Episode: 862 Total reward: 185.0 Training loss: 0.6566 Explore P: 0.0587
Episode: 863 Total reward: 199.0 Training loss: 0.4847 Explore P: 0.0577
Episode: 864 Total reward: 199.0 Training loss: 0.3621 Explore P: 0.0568
Episode: 865 Total reward: 199.0 Training loss: 65.2897 Explore P: 0.0559
Episode: 866 Total reward: 199.0 Training loss: 0.9414 Explore P: 0.0550
Episode: 867 Total reward: 199.0 Training loss: 0.4089 Explore P: 0.0541
Episode: 868 Total reward: 199.0 Training loss: 0.2342 Explore P: 0.0532
Episode: 869 Total reward: 199.0 Training loss: 0.4035 Explore P: 0.0524
Episode: 870 Total reward: 199.0 Training loss: 0.4738 Explore P: 0.0515
Episode: 871 Total reward: 199.0 Training loss: 17.1555 Explore P: 0.0507
Episode: 872 Total reward: 199.0 Training loss: 0.6795 Explore P: 0.0499
Episode: 873 Total reward: 199.0 Training loss: 0.6237 Explore P: 0.0491
Episode: 874 Total reward: 157.0 Training loss: 0.4996 Explore P: 0.0485
Episode: 875 Total reward: 199.0 Training loss: 329.6562 Explore P: 0.0478
Episode: 876 Total reward: 199.0 Training loss: 0.5434 Explore P: 0.0470
Episode: 877 Total reward: 167.0 Training loss: 0.5756 Explore P: 0.0464
Episode: 878 Total reward: 147.0 Training loss: 0.7865 Explore P: 0.0459
Episode: 879 Total reward: 128.0 Training loss: 281.1920 Explore P: 0.0454
Episode: 880 Total reward: 111.0 Training loss: 0.8864 Explore P: 0.0450
Episode: 881 Total reward: 118.0 Training loss: 0.9177 Explore P: 0.0446
Episode: 882 Total reward: 113.0 Training loss: 248.8173 Explore P: 0.0442
Episode: 883 Total reward: 107.0 Training loss: 0.7127 Explore P: 0.0439
Episode: 884 Total reward: 143.0 Training loss: 0.6288 Explore P: 0.0434
Episode: 885 Total reward: 146.0 Training loss: 0.4666 Explore P: 0.0429
Episode: 886 Total reward: 130.0 Training loss: 0.9533 Explore P: 0.0425
Episode: 887 Total reward: 119.0 Training loss: 0.5351 Explore P: 0.0421
Episode: 888 Total reward: 145.0 Training loss: 143.5915 Explore P: 0.0416
Episode: 889 Total reward: 145.0 Training loss: 282.8658 Explore P: 0.0412
Episode: 890 Total reward: 102.0 Training loss: 1.2552 Explore P: 0.0408
Episode: 891 Total reward: 108.0 Training loss: 0.6353 Explore P: 0.0405
Episode: 892 Total reward: 128.0 Training loss: 0.7319 Explore P: 0.0401
Episode: 893 Total reward: 83.0 Training loss: 1.4368 Explore P: 0.0399
Episode: 894 Total reward: 97.0 Training loss: 0.6440 Explore P: 0.0396
Episode: 895 Total reward: 96.0 Training loss: 0.6838 Explore P: 0.0393
Episode: 896 Total reward: 98.0 Training loss: 0.6835 Explore P: 0.0390
Episode: 897 Total reward: 102.0 Training loss: 1.3364 Explore P: 0.0387
Episode: 898 Total reward: 81.0 Training loss: 0.8280 Explore P: 0.0385
Episode: 899 Total reward: 73.0 Training loss: 1.0209 Explore P: 0.0383
Episode: 900 Total reward: 31.0 Training loss: 0.8534 Explore P: 0.0382
Episode: 901 Total reward: 36.0 Training loss: 1.4950 Explore P: 0.0381
Episode: 902 Total reward: 30.0 Training loss: 1.5512 Explore P: 0.0380
Episode: 903 Total reward: 30.0 Training loss: 1.5472 Explore P: 0.0379
Episode: 904 Total reward: 24.0 Training loss: 1.3556 Explore P: 0.0379
Episode: 905 Total reward: 56.0 Training loss: 189.6601 Explore P: 0.0377
Episode: 906 Total reward: 25.0 Training loss: 1.9611 Explore P: 0.0376
Episode: 907 Total reward: 26.0 Training loss: 1.4923 Explore P: 0.0376
Episode: 908 Total reward: 35.0 Training loss: 1.7582 Explore P: 0.0375
Episode: 909 Total reward: 24.0 Training loss: 0.9799 Explore P: 0.0374
Episode: 910 Total reward: 24.0 Training loss: 214.7785 Explore P: 0.0373
Episode: 911 Total reward: 17.0 Training loss: 1.7002 Explore P: 0.0373
Episode: 912 Total reward: 18.0 Training loss: 1.6883 Explore P: 0.0372
Episode: 913 Total reward: 34.0 Training loss: 1.0146 Explore P: 0.0372
Episode: 914 Total reward: 25.0 Training loss: 1.2011 Explore P: 0.0371
Episode: 915 Total reward: 55.0 Training loss: 432.4491 Explore P: 0.0369
Episode: 916 Total reward: 23.0 Training loss: 1.5323 Explore P: 0.0369
Episode: 917 Total reward: 16.0 Training loss: 1.5438 Explore P: 0.0368
Episode: 918 Total reward: 24.0 Training loss: 435.6873 Explore P: 0.0368
Episode: 919 Total reward: 27.0 Training loss: 1.8114 Explore P: 0.0367
Episode: 920 Total reward: 23.0 Training loss: 1.3498 Explore P: 0.0366
Episode: 921 Total reward: 23.0 Training loss: 1.7129 Explore P: 0.0366
Episode: 922 Total reward: 23.0 Training loss: 1.4151 Explore P: 0.0365
Episode: 923 Total reward: 17.0 Training loss: 214.9255 Explore P: 0.0365
Episode: 924 Total reward: 32.0 Training loss: 212.8562 Explore P: 0.0364
Episode: 925 Total reward: 33.0 Training loss: 443.2886 Explore P: 0.0363
Episode: 926 Total reward: 64.0 Training loss: 291.0511 Explore P: 0.0361
Episode: 927 Total reward: 23.0 Training loss: 1.2303 Explore P: 0.0361
Episode: 928 Total reward: 26.0 Training loss: 1.2662 Explore P: 0.0360
Episode: 929 Total reward: 27.0 Training loss: 1.5454 Explore P: 0.0359
Episode: 930 Total reward: 19.0 Training loss: 116.6151 Explore P: 0.0359
Episode: 931 Total reward: 75.0 Training loss: 410.3089 Explore P: 0.0357
Episode: 932 Total reward: 97.0 Training loss: 0.8546 Explore P: 0.0354
Episode: 933 Total reward: 80.0 Training loss: 1.5663 Explore P: 0.0352
Episode: 934 Total reward: 92.0 Training loss: 1.1995 Explore P: 0.0350
Episode: 935 Total reward: 115.0 Training loss: 295.1371 Explore P: 0.0347
Episode: 936 Total reward: 145.0 Training loss: 242.0642 Explore P: 0.0344
Episode: 937 Total reward: 150.0 Training loss: 0.3883 Explore P: 0.0340
Episode: 938 Total reward: 160.0 Training loss: 0.6104 Explore P: 0.0336
Episode: 939 Total reward: 138.0 Training loss: 0.3980 Explore P: 0.0333
Episode: 940 Total reward: 134.0 Training loss: 0.3130 Explore P: 0.0330
Episode: 941 Total reward: 126.0 Training loss: 81.5733 Explore P: 0.0327
Episode: 942 Total reward: 137.0 Training loss: 46.6440 Explore P: 0.0324
Episode: 943 Total reward: 149.0 Training loss: 42.2457 Explore P: 0.0321
Episode: 944 Total reward: 140.0 Training loss: 0.1861 Explore P: 0.0317
Episode: 945 Total reward: 199.0 Training loss: 0.1570 Explore P: 0.0313
Episode: 946 Total reward: 199.0 Training loss: 0.2469 Explore P: 0.0309
Episode: 947 Total reward: 190.0 Training loss: 127.1649 Explore P: 0.0305
Episode: 948 Total reward: 133.0 Training loss: 0.1956 Explore P: 0.0302
Episode: 949 Total reward: 199.0 Training loss: 0.2203 Explore P: 0.0298
Episode: 950 Total reward: 129.0 Training loss: 0.2290 Explore P: 0.0296
Episode: 951 Total reward: 160.0 Training loss: 143.3621 Explore P: 0.0293
Episode: 952 Total reward: 199.0 Training loss: 0.3562 Explore P: 0.0289
Episode: 953 Total reward: 161.0 Training loss: 0.3226 Explore P: 0.0286
Episode: 954 Total reward: 171.0 Training loss: 0.2562 Explore P: 0.0283
Episode: 955 Total reward: 176.0 Training loss: 0.2779 Explore P: 0.0280
Episode: 956 Total reward: 199.0 Training loss: 0.3358 Explore P: 0.0276
Episode: 957 Total reward: 199.0 Training loss: 11.8226 Explore P: 0.0273
Episode: 958 Total reward: 199.0 Training loss: 0.3869 Explore P: 0.0269
Episode: 959 Total reward: 170.0 Training loss: 0.2069 Explore P: 0.0266
Episode: 960 Total reward: 199.0 Training loss: 8.0836 Explore P: 0.0263
Episode: 961 Total reward: 199.0 Training loss: 0.4290 Explore P: 0.0260
Episode: 962 Total reward: 199.0 Training loss: 0.1437 Explore P: 0.0257
Episode: 963 Total reward: 199.0 Training loss: 0.2070 Explore P: 0.0254
Episode: 964 Total reward: 199.0 Training loss: 0.2060 Explore P: 0.0251
Episode: 965 Total reward: 199.0 Training loss: 0.2116 Explore P: 0.0248
Episode: 966 Total reward: 199.0 Training loss: 0.2460 Explore P: 0.0245
Episode: 967 Total reward: 199.0 Training loss: 0.1771 Explore P: 0.0242
Episode: 968 Total reward: 199.0 Training loss: 0.1252 Explore P: 0.0239
Episode: 969 Total reward: 199.0 Training loss: 0.2353 Explore P: 0.0236
Episode: 970 Total reward: 199.0 Training loss: 0.2519 Explore P: 0.0234
Episode: 971 Total reward: 199.0 Training loss: 0.3741 Explore P: 0.0231
Episode: 972 Total reward: 199.0 Training loss: 0.4210 Explore P: 0.0228
Episode: 973 Total reward: 199.0 Training loss: 9.0117 Explore P: 0.0226
Episode: 974 Total reward: 199.0 Training loss: 0.3990 Explore P: 0.0223
Episode: 975 Total reward: 199.0 Training loss: 0.4499 Explore P: 0.0221
Episode: 976 Total reward: 169.0 Training loss: 0.5975 Explore P: 0.0219
Episode: 977 Total reward: 159.0 Training loss: 0.4188 Explore P: 0.0217
Episode: 978 Total reward: 162.0 Training loss: 0.6024 Explore P: 0.0215
Episode: 979 Total reward: 113.0 Training loss: 0.6621 Explore P: 0.0214
Episode: 980 Total reward: 129.0 Training loss: 0.9390 Explore P: 0.0212
Episode: 981 Total reward: 138.0 Training loss: 0.6157 Explore P: 0.0211
Episode: 982 Total reward: 136.0 Training loss: 0.8195 Explore P: 0.0209
Episode: 983 Total reward: 156.0 Training loss: 0.9651 Explore P: 0.0208
Episode: 984 Total reward: 163.0 Training loss: 0.3950 Explore P: 0.0206
Episode: 985 Total reward: 120.0 Training loss: 0.6088 Explore P: 0.0205
Episode: 986 Total reward: 145.0 Training loss: 1.1163 Explore P: 0.0203
Episode: 987 Total reward: 122.0 Training loss: 1.0331 Explore P: 0.0202
Episode: 988 Total reward: 121.0 Training loss: 0.7297 Explore P: 0.0201
Episode: 989 Total reward: 118.0 Training loss: 0.7908 Explore P: 0.0200
Episode: 990 Total reward: 143.0 Training loss: 0.2997 Explore P: 0.0198
Episode: 991 Total reward: 140.0 Training loss: 1.0748 Explore P: 0.0197
Episode: 992 Total reward: 107.0 Training loss: 0.5442 Explore P: 0.0196
Episode: 993 Total reward: 119.0 Training loss: 46.9258 Explore P: 0.0195
Episode: 994 Total reward: 98.0 Training loss: 0.5671 Explore P: 0.0194
Episode: 995 Total reward: 119.0 Training loss: 0.6406 Explore P: 0.0193
Episode: 996 Total reward: 103.0 Training loss: 0.5042 Explore P: 0.0192
Episode: 997 Total reward: 109.0 Training loss: 0.7135 Explore P: 0.0191
Episode: 998 Total reward: 91.0 Training loss: 0.4138 Explore P: 0.0190
Episode: 999 Total reward: 119.0 Training loss: 0.6830 Explore P: 0.0189

Visualizing training

Below I'll plot the total rewards for each episode. I took a rolling average too, in blue.


In [12]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [13]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[13]:
<matplotlib.text.Text at 0x7fefbbb78a90>

Testing

Let's checkout how our trained agent plays the game.


In [15]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [16]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.