Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [4]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [5]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-29 10:29:27,477] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [6]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [8]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [9]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [10]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [11]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [12]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [13]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [14]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [15]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 10.0 Training loss: 1.2843 Explore P: 0.9990
Episode: 2 Total reward: 15.0 Training loss: 1.4689 Explore P: 0.9975
Episode: 3 Total reward: 58.0 Training loss: 1.2572 Explore P: 0.9918
Episode: 4 Total reward: 11.0 Training loss: 1.2005 Explore P: 0.9907
Episode: 5 Total reward: 38.0 Training loss: 1.2464 Explore P: 0.9870
Episode: 6 Total reward: 49.0 Training loss: 1.1615 Explore P: 0.9822
Episode: 7 Total reward: 10.0 Training loss: 1.1145 Explore P: 0.9813
Episode: 8 Total reward: 13.0 Training loss: 1.2759 Explore P: 0.9800
Episode: 9 Total reward: 16.0 Training loss: 1.0810 Explore P: 0.9785
Episode: 10 Total reward: 21.0 Training loss: 1.2707 Explore P: 0.9764
Episode: 11 Total reward: 27.0 Training loss: 1.3068 Explore P: 0.9738
Episode: 12 Total reward: 20.0 Training loss: 1.3586 Explore P: 0.9719
Episode: 13 Total reward: 20.0 Training loss: 1.4039 Explore P: 0.9700
Episode: 14 Total reward: 34.0 Training loss: 1.1509 Explore P: 0.9667
Episode: 15 Total reward: 26.0 Training loss: 1.2061 Explore P: 0.9642
Episode: 16 Total reward: 23.0 Training loss: 1.3429 Explore P: 0.9620
Episode: 17 Total reward: 16.0 Training loss: 1.6928 Explore P: 0.9605
Episode: 18 Total reward: 19.0 Training loss: 1.4309 Explore P: 0.9587
Episode: 19 Total reward: 13.0 Training loss: 1.4211 Explore P: 0.9575
Episode: 20 Total reward: 11.0 Training loss: 1.3575 Explore P: 0.9564
Episode: 21 Total reward: 19.0 Training loss: 1.6751 Explore P: 0.9546
Episode: 22 Total reward: 16.0 Training loss: 1.3683 Explore P: 0.9531
Episode: 23 Total reward: 20.0 Training loss: 1.8245 Explore P: 0.9512
Episode: 24 Total reward: 24.0 Training loss: 2.0528 Explore P: 0.9490
Episode: 25 Total reward: 10.0 Training loss: 1.5797 Explore P: 0.9481
Episode: 26 Total reward: 12.0 Training loss: 2.0528 Explore P: 0.9469
Episode: 27 Total reward: 11.0 Training loss: 2.6709 Explore P: 0.9459
Episode: 28 Total reward: 29.0 Training loss: 2.1323 Explore P: 0.9432
Episode: 29 Total reward: 25.0 Training loss: 2.7723 Explore P: 0.9409
Episode: 30 Total reward: 15.0 Training loss: 5.8472 Explore P: 0.9395
Episode: 31 Total reward: 24.0 Training loss: 2.7294 Explore P: 0.9372
Episode: 32 Total reward: 17.0 Training loss: 3.2803 Explore P: 0.9357
Episode: 33 Total reward: 25.0 Training loss: 3.1626 Explore P: 0.9333
Episode: 34 Total reward: 11.0 Training loss: 2.4331 Explore P: 0.9323
Episode: 35 Total reward: 29.0 Training loss: 4.3340 Explore P: 0.9297
Episode: 36 Total reward: 11.0 Training loss: 7.2724 Explore P: 0.9286
Episode: 37 Total reward: 16.0 Training loss: 11.0559 Explore P: 0.9272
Episode: 38 Total reward: 15.0 Training loss: 6.1231 Explore P: 0.9258
Episode: 39 Total reward: 25.0 Training loss: 9.1333 Explore P: 0.9235
Episode: 40 Total reward: 37.0 Training loss: 7.8493 Explore P: 0.9201
Episode: 41 Total reward: 16.0 Training loss: 6.5931 Explore P: 0.9187
Episode: 42 Total reward: 19.0 Training loss: 5.4300 Explore P: 0.9170
Episode: 43 Total reward: 11.0 Training loss: 13.4986 Explore P: 0.9160
Episode: 44 Total reward: 19.0 Training loss: 10.7926 Explore P: 0.9142
Episode: 45 Total reward: 23.0 Training loss: 16.3392 Explore P: 0.9122
Episode: 46 Total reward: 49.0 Training loss: 8.2268 Explore P: 0.9078
Episode: 47 Total reward: 9.0 Training loss: 14.7577 Explore P: 0.9070
Episode: 48 Total reward: 52.0 Training loss: 11.9795 Explore P: 0.9023
Episode: 49 Total reward: 18.0 Training loss: 11.3619 Explore P: 0.9007
Episode: 50 Total reward: 15.0 Training loss: 93.3731 Explore P: 0.8994
Episode: 51 Total reward: 14.0 Training loss: 28.0523 Explore P: 0.8981
Episode: 52 Total reward: 18.0 Training loss: 11.1213 Explore P: 0.8965
Episode: 53 Total reward: 11.0 Training loss: 31.3388 Explore P: 0.8955
Episode: 54 Total reward: 21.0 Training loss: 14.4414 Explore P: 0.8937
Episode: 55 Total reward: 18.0 Training loss: 14.6165 Explore P: 0.8921
Episode: 56 Total reward: 13.0 Training loss: 12.8782 Explore P: 0.8910
Episode: 57 Total reward: 33.0 Training loss: 13.3546 Explore P: 0.8881
Episode: 58 Total reward: 20.0 Training loss: 14.7332 Explore P: 0.8863
Episode: 59 Total reward: 21.0 Training loss: 16.8108 Explore P: 0.8845
Episode: 60 Total reward: 28.0 Training loss: 52.0881 Explore P: 0.8820
Episode: 61 Total reward: 23.0 Training loss: 19.9877 Explore P: 0.8800
Episode: 62 Total reward: 38.0 Training loss: 14.6511 Explore P: 0.8767
Episode: 63 Total reward: 21.0 Training loss: 35.3178 Explore P: 0.8749
Episode: 64 Total reward: 19.0 Training loss: 101.0044 Explore P: 0.8733
Episode: 65 Total reward: 10.0 Training loss: 18.0987 Explore P: 0.8724
Episode: 66 Total reward: 31.0 Training loss: 314.2233 Explore P: 0.8697
Episode: 67 Total reward: 15.0 Training loss: 84.0895 Explore P: 0.8684
Episode: 68 Total reward: 11.0 Training loss: 74.7100 Explore P: 0.8675
Episode: 69 Total reward: 14.0 Training loss: 52.0110 Explore P: 0.8663
Episode: 70 Total reward: 24.0 Training loss: 26.6753 Explore P: 0.8642
Episode: 71 Total reward: 11.0 Training loss: 29.1713 Explore P: 0.8633
Episode: 72 Total reward: 23.0 Training loss: 162.4790 Explore P: 0.8613
Episode: 73 Total reward: 23.0 Training loss: 81.1424 Explore P: 0.8594
Episode: 74 Total reward: 38.0 Training loss: 272.3532 Explore P: 0.8562
Episode: 75 Total reward: 24.0 Training loss: 168.4038 Explore P: 0.8541
Episode: 76 Total reward: 13.0 Training loss: 407.6812 Explore P: 0.8530
Episode: 77 Total reward: 12.0 Training loss: 255.9348 Explore P: 0.8520
Episode: 78 Total reward: 26.0 Training loss: 32.0957 Explore P: 0.8498
Episode: 79 Total reward: 20.0 Training loss: 123.0513 Explore P: 0.8482
Episode: 80 Total reward: 19.0 Training loss: 38.4925 Explore P: 0.8466
Episode: 81 Total reward: 21.0 Training loss: 46.2019 Explore P: 0.8448
Episode: 82 Total reward: 13.0 Training loss: 36.6269 Explore P: 0.8437
Episode: 83 Total reward: 19.0 Training loss: 30.1236 Explore P: 0.8421
Episode: 84 Total reward: 17.0 Training loss: 25.4964 Explore P: 0.8407
Episode: 85 Total reward: 33.0 Training loss: 26.6375 Explore P: 0.8380
Episode: 86 Total reward: 10.0 Training loss: 443.3683 Explore P: 0.8372
Episode: 87 Total reward: 13.0 Training loss: 194.6815 Explore P: 0.8361
Episode: 88 Total reward: 57.0 Training loss: 33.3435 Explore P: 0.8314
Episode: 89 Total reward: 14.0 Training loss: 229.8674 Explore P: 0.8302
Episode: 90 Total reward: 15.0 Training loss: 681.2084 Explore P: 0.8290
Episode: 91 Total reward: 17.0 Training loss: 228.5903 Explore P: 0.8276
Episode: 92 Total reward: 33.0 Training loss: 171.7032 Explore P: 0.8249
Episode: 93 Total reward: 10.0 Training loss: 318.0533 Explore P: 0.8241
Episode: 94 Total reward: 21.0 Training loss: 114.3702 Explore P: 0.8224
Episode: 95 Total reward: 15.0 Training loss: 449.0784 Explore P: 0.8212
Episode: 96 Total reward: 24.0 Training loss: 114.9906 Explore P: 0.8192
Episode: 97 Total reward: 11.0 Training loss: 213.4525 Explore P: 0.8184
Episode: 98 Total reward: 20.0 Training loss: 116.2962 Explore P: 0.8167
Episode: 99 Total reward: 12.0 Training loss: 122.7398 Explore P: 0.8158
Episode: 100 Total reward: 19.0 Training loss: 348.5748 Explore P: 0.8142
Episode: 101 Total reward: 35.0 Training loss: 132.7888 Explore P: 0.8114
Episode: 102 Total reward: 13.0 Training loss: 20.2328 Explore P: 0.8104
Episode: 103 Total reward: 20.0 Training loss: 462.3654 Explore P: 0.8088
Episode: 104 Total reward: 23.0 Training loss: 184.0719 Explore P: 0.8070
Episode: 105 Total reward: 15.0 Training loss: 359.8098 Explore P: 0.8058
Episode: 106 Total reward: 11.0 Training loss: 264.4187 Explore P: 0.8049
Episode: 107 Total reward: 11.0 Training loss: 802.2219 Explore P: 0.8040
Episode: 108 Total reward: 15.0 Training loss: 28.9613 Explore P: 0.8028
Episode: 109 Total reward: 17.0 Training loss: 27.0009 Explore P: 0.8015
Episode: 110 Total reward: 12.0 Training loss: 874.9164 Explore P: 0.8005
Episode: 111 Total reward: 28.0 Training loss: 26.1487 Explore P: 0.7983
Episode: 112 Total reward: 18.0 Training loss: 24.8512 Explore P: 0.7969
Episode: 113 Total reward: 15.0 Training loss: 181.8474 Explore P: 0.7957
Episode: 114 Total reward: 12.0 Training loss: 17.7066 Explore P: 0.7948
Episode: 115 Total reward: 40.0 Training loss: 309.7830 Explore P: 0.7916
Episode: 116 Total reward: 42.0 Training loss: 175.4858 Explore P: 0.7884
Episode: 117 Total reward: 19.0 Training loss: 204.3070 Explore P: 0.7869
Episode: 118 Total reward: 17.0 Training loss: 194.4637 Explore P: 0.7856
Episode: 119 Total reward: 11.0 Training loss: 16.0807 Explore P: 0.7847
Episode: 120 Total reward: 28.0 Training loss: 13.7005 Explore P: 0.7826
Episode: 121 Total reward: 8.0 Training loss: 222.3985 Explore P: 0.7819
Episode: 122 Total reward: 17.0 Training loss: 210.0571 Explore P: 0.7806
Episode: 123 Total reward: 21.0 Training loss: 191.8717 Explore P: 0.7790
Episode: 124 Total reward: 14.0 Training loss: 225.6518 Explore P: 0.7779
Episode: 125 Total reward: 10.0 Training loss: 240.6312 Explore P: 0.7772
Episode: 126 Total reward: 18.0 Training loss: 197.5665 Explore P: 0.7758
Episode: 127 Total reward: 16.0 Training loss: 11.5052 Explore P: 0.7746
Episode: 128 Total reward: 16.0 Training loss: 13.9771 Explore P: 0.7733
Episode: 129 Total reward: 17.0 Training loss: 12.0067 Explore P: 0.7720
Episode: 130 Total reward: 12.0 Training loss: 13.3876 Explore P: 0.7711
Episode: 131 Total reward: 16.0 Training loss: 11.5924 Explore P: 0.7699
Episode: 132 Total reward: 16.0 Training loss: 11.7082 Explore P: 0.7687
Episode: 133 Total reward: 17.0 Training loss: 10.0922 Explore P: 0.7674
Episode: 134 Total reward: 31.0 Training loss: 10.0631 Explore P: 0.7651
Episode: 135 Total reward: 21.0 Training loss: 14.9653 Explore P: 0.7635
Episode: 136 Total reward: 18.0 Training loss: 10.9656 Explore P: 0.7621
Episode: 137 Total reward: 16.0 Training loss: 8.6198 Explore P: 0.7609
Episode: 138 Total reward: 24.0 Training loss: 429.7949 Explore P: 0.7591
Episode: 139 Total reward: 16.0 Training loss: 12.8796 Explore P: 0.7579
Episode: 140 Total reward: 17.0 Training loss: 571.7976 Explore P: 0.7567
Episode: 141 Total reward: 10.0 Training loss: 426.6067 Explore P: 0.7559
Episode: 142 Total reward: 16.0 Training loss: 9.0282 Explore P: 0.7547
Episode: 143 Total reward: 17.0 Training loss: 422.3457 Explore P: 0.7535
Episode: 144 Total reward: 47.0 Training loss: 399.1782 Explore P: 0.7500
Episode: 145 Total reward: 32.0 Training loss: 205.5674 Explore P: 0.7476
Episode: 146 Total reward: 9.0 Training loss: 6.2407 Explore P: 0.7469
Episode: 147 Total reward: 17.0 Training loss: 8.3172 Explore P: 0.7457
Episode: 148 Total reward: 11.0 Training loss: 6.1241 Explore P: 0.7449
Episode: 149 Total reward: 12.0 Training loss: 4.8214 Explore P: 0.7440
Episode: 150 Total reward: 18.0 Training loss: 6.1457 Explore P: 0.7427
Episode: 151 Total reward: 16.0 Training loss: 155.2390 Explore P: 0.7415
Episode: 152 Total reward: 9.0 Training loss: 9.2007 Explore P: 0.7408
Episode: 153 Total reward: 17.0 Training loss: 321.8909 Explore P: 0.7396
Episode: 154 Total reward: 80.0 Training loss: 265.7353 Explore P: 0.7338
Episode: 155 Total reward: 9.0 Training loss: 151.8861 Explore P: 0.7331
Episode: 156 Total reward: 24.0 Training loss: 147.1886 Explore P: 0.7314
Episode: 157 Total reward: 15.0 Training loss: 339.0747 Explore P: 0.7303
Episode: 158 Total reward: 21.0 Training loss: 4.2267 Explore P: 0.7288
Episode: 159 Total reward: 10.0 Training loss: 126.6963 Explore P: 0.7281
Episode: 160 Total reward: 9.0 Training loss: 197.8804 Explore P: 0.7275
Episode: 161 Total reward: 12.0 Training loss: 248.3090 Explore P: 0.7266
Episode: 162 Total reward: 13.0 Training loss: 432.4426 Explore P: 0.7257
Episode: 163 Total reward: 19.0 Training loss: 149.6630 Explore P: 0.7243
Episode: 164 Total reward: 9.0 Training loss: 1.7188 Explore P: 0.7237
Episode: 165 Total reward: 17.0 Training loss: 468.0979 Explore P: 0.7224
Episode: 166 Total reward: 8.0 Training loss: 283.1140 Explore P: 0.7219
Episode: 167 Total reward: 12.0 Training loss: 356.0063 Explore P: 0.7210
Episode: 168 Total reward: 19.0 Training loss: 492.8287 Explore P: 0.7197
Episode: 169 Total reward: 11.0 Training loss: 134.2272 Explore P: 0.7189
Episode: 170 Total reward: 12.0 Training loss: 215.7161 Explore P: 0.7180
Episode: 171 Total reward: 22.0 Training loss: 1.1626 Explore P: 0.7165
Episode: 172 Total reward: 27.0 Training loss: 95.4166 Explore P: 0.7146
Episode: 173 Total reward: 24.0 Training loss: 112.3357 Explore P: 0.7129
Episode: 174 Total reward: 10.0 Training loss: 316.0848 Explore P: 0.7122
Episode: 175 Total reward: 15.0 Training loss: 1.8570 Explore P: 0.7111
Episode: 176 Total reward: 57.0 Training loss: 2.1777 Explore P: 0.7072
Episode: 177 Total reward: 22.0 Training loss: 1.3997 Explore P: 0.7056
Episode: 178 Total reward: 18.0 Training loss: 172.5161 Explore P: 0.7044
Episode: 179 Total reward: 12.0 Training loss: 1.4659 Explore P: 0.7035
Episode: 180 Total reward: 18.0 Training loss: 74.8100 Explore P: 0.7023
Episode: 181 Total reward: 21.0 Training loss: 223.8991 Explore P: 0.7008
Episode: 182 Total reward: 17.0 Training loss: 145.1559 Explore P: 0.6997
Episode: 183 Total reward: 9.0 Training loss: 1.2114 Explore P: 0.6990
Episode: 184 Total reward: 9.0 Training loss: 150.0736 Explore P: 0.6984
Episode: 185 Total reward: 13.0 Training loss: 240.5938 Explore P: 0.6975
Episode: 186 Total reward: 13.0 Training loss: 1.2766 Explore P: 0.6966
Episode: 187 Total reward: 10.0 Training loss: 74.9464 Explore P: 0.6960
Episode: 188 Total reward: 16.0 Training loss: 1.6428 Explore P: 0.6949
Episode: 189 Total reward: 14.0 Training loss: 93.2868 Explore P: 0.6939
Episode: 190 Total reward: 8.0 Training loss: 2.4246 Explore P: 0.6933
Episode: 191 Total reward: 12.0 Training loss: 66.0714 Explore P: 0.6925
Episode: 192 Total reward: 17.0 Training loss: 242.4605 Explore P: 0.6914
Episode: 193 Total reward: 10.0 Training loss: 62.3584 Explore P: 0.6907
Episode: 194 Total reward: 15.0 Training loss: 1.4356 Explore P: 0.6897
Episode: 195 Total reward: 21.0 Training loss: 275.8983 Explore P: 0.6882
Episode: 196 Total reward: 8.0 Training loss: 2.2438 Explore P: 0.6877
Episode: 197 Total reward: 11.0 Training loss: 1.9060 Explore P: 0.6870
Episode: 198 Total reward: 9.0 Training loss: 145.3390 Explore P: 0.6863
Episode: 199 Total reward: 22.0 Training loss: 249.2537 Explore P: 0.6849
Episode: 200 Total reward: 11.0 Training loss: 586.1780 Explore P: 0.6841
Episode: 201 Total reward: 10.0 Training loss: 1.7131 Explore P: 0.6834
Episode: 202 Total reward: 15.0 Training loss: 0.9717 Explore P: 0.6824
Episode: 203 Total reward: 14.0 Training loss: 1.3264 Explore P: 0.6815
Episode: 204 Total reward: 15.0 Training loss: 124.4185 Explore P: 0.6805
Episode: 205 Total reward: 12.0 Training loss: 1.0721 Explore P: 0.6797
Episode: 206 Total reward: 23.0 Training loss: 283.7094 Explore P: 0.6781
Episode: 207 Total reward: 24.0 Training loss: 2.1872 Explore P: 0.6765
Episode: 208 Total reward: 23.0 Training loss: 2.1923 Explore P: 0.6750
Episode: 209 Total reward: 9.0 Training loss: 50.8856 Explore P: 0.6744
Episode: 210 Total reward: 24.0 Training loss: 176.1981 Explore P: 0.6728
Episode: 211 Total reward: 16.0 Training loss: 165.4270 Explore P: 0.6718
Episode: 212 Total reward: 10.0 Training loss: 169.0332 Explore P: 0.6711
Episode: 213 Total reward: 12.0 Training loss: 124.2096 Explore P: 0.6703
Episode: 214 Total reward: 8.0 Training loss: 1.7523 Explore P: 0.6698
Episode: 215 Total reward: 27.0 Training loss: 242.9177 Explore P: 0.6680
Episode: 216 Total reward: 13.0 Training loss: 50.4216 Explore P: 0.6671
Episode: 217 Total reward: 17.0 Training loss: 47.9242 Explore P: 0.6660
Episode: 218 Total reward: 12.0 Training loss: 51.3225 Explore P: 0.6652
Episode: 219 Total reward: 15.0 Training loss: 153.7268 Explore P: 0.6643
Episode: 220 Total reward: 13.0 Training loss: 2.5651 Explore P: 0.6634
Episode: 221 Total reward: 22.0 Training loss: 48.7015 Explore P: 0.6620
Episode: 222 Total reward: 16.0 Training loss: 43.4941 Explore P: 0.6609
Episode: 223 Total reward: 12.0 Training loss: 2.9003 Explore P: 0.6602
Episode: 224 Total reward: 9.0 Training loss: 2.6266 Explore P: 0.6596
Episode: 225 Total reward: 16.0 Training loss: 162.1342 Explore P: 0.6585
Episode: 226 Total reward: 14.0 Training loss: 190.0661 Explore P: 0.6576
Episode: 227 Total reward: 11.0 Training loss: 40.1261 Explore P: 0.6569
Episode: 228 Total reward: 14.0 Training loss: 2.1330 Explore P: 0.6560
Episode: 229 Total reward: 10.0 Training loss: 3.4607 Explore P: 0.6554
Episode: 230 Total reward: 8.0 Training loss: 40.8713 Explore P: 0.6548
Episode: 231 Total reward: 17.0 Training loss: 2.1753 Explore P: 0.6537
Episode: 232 Total reward: 51.0 Training loss: 2.8111 Explore P: 0.6505
Episode: 233 Total reward: 20.0 Training loss: 262.3038 Explore P: 0.6492
Episode: 234 Total reward: 10.0 Training loss: 2.2364 Explore P: 0.6486
Episode: 235 Total reward: 18.0 Training loss: 47.8800 Explore P: 0.6474
Episode: 236 Total reward: 10.0 Training loss: 32.9212 Explore P: 0.6468
Episode: 237 Total reward: 20.0 Training loss: 36.7558 Explore P: 0.6455
Episode: 238 Total reward: 28.0 Training loss: 107.8041 Explore P: 0.6437
Episode: 239 Total reward: 16.0 Training loss: 2.5514 Explore P: 0.6427
Episode: 240 Total reward: 22.0 Training loss: 1.9298 Explore P: 0.6413
Episode: 241 Total reward: 12.0 Training loss: 2.6396 Explore P: 0.6406
Episode: 242 Total reward: 18.0 Training loss: 35.8534 Explore P: 0.6394
Episode: 243 Total reward: 22.0 Training loss: 32.9783 Explore P: 0.6380
Episode: 244 Total reward: 9.0 Training loss: 129.3889 Explore P: 0.6375
Episode: 245 Total reward: 12.0 Training loss: 116.2100 Explore P: 0.6367
Episode: 246 Total reward: 14.0 Training loss: 4.1729 Explore P: 0.6358
Episode: 247 Total reward: 15.0 Training loss: 38.3052 Explore P: 0.6349
Episode: 248 Total reward: 10.0 Training loss: 256.0958 Explore P: 0.6343
Episode: 249 Total reward: 19.0 Training loss: 121.3389 Explore P: 0.6331
Episode: 250 Total reward: 11.0 Training loss: 71.0181 Explore P: 0.6324
Episode: 251 Total reward: 11.0 Training loss: 93.9889 Explore P: 0.6317
Episode: 252 Total reward: 21.0 Training loss: 2.0961 Explore P: 0.6304
Episode: 253 Total reward: 12.0 Training loss: 326.6285 Explore P: 0.6297
Episode: 254 Total reward: 12.0 Training loss: 128.3906 Explore P: 0.6289
Episode: 255 Total reward: 45.0 Training loss: 107.7227 Explore P: 0.6262
Episode: 256 Total reward: 26.0 Training loss: 2.3994 Explore P: 0.6246
Episode: 257 Total reward: 10.0 Training loss: 159.7010 Explore P: 0.6239
Episode: 258 Total reward: 8.0 Training loss: 2.4366 Explore P: 0.6235
Episode: 259 Total reward: 15.0 Training loss: 31.3479 Explore P: 0.6225
Episode: 260 Total reward: 12.0 Training loss: 2.5527 Explore P: 0.6218
Episode: 261 Total reward: 14.0 Training loss: 88.7401 Explore P: 0.6209
Episode: 262 Total reward: 12.0 Training loss: 2.7836 Explore P: 0.6202
Episode: 263 Total reward: 13.0 Training loss: 26.0408 Explore P: 0.6194
Episode: 264 Total reward: 16.0 Training loss: 120.0539 Explore P: 0.6184
Episode: 265 Total reward: 9.0 Training loss: 32.1169 Explore P: 0.6179
Episode: 266 Total reward: 12.0 Training loss: 1.8107 Explore P: 0.6172
Episode: 267 Total reward: 11.0 Training loss: 30.6531 Explore P: 0.6165
Episode: 268 Total reward: 9.0 Training loss: 2.1186 Explore P: 0.6160
Episode: 269 Total reward: 14.0 Training loss: 29.3330 Explore P: 0.6151
Episode: 270 Total reward: 14.0 Training loss: 32.1520 Explore P: 0.6143
Episode: 271 Total reward: 20.0 Training loss: 51.4702 Explore P: 0.6131
Episode: 272 Total reward: 11.0 Training loss: 2.0823 Explore P: 0.6124
Episode: 273 Total reward: 12.0 Training loss: 153.4642 Explore P: 0.6117
Episode: 274 Total reward: 19.0 Training loss: 169.9438 Explore P: 0.6105
Episode: 275 Total reward: 9.0 Training loss: 51.2018 Explore P: 0.6100
Episode: 276 Total reward: 10.0 Training loss: 1.8589 Explore P: 0.6094
Episode: 277 Total reward: 21.0 Training loss: 91.7078 Explore P: 0.6081
Episode: 278 Total reward: 11.0 Training loss: 96.3697 Explore P: 0.6075
Episode: 279 Total reward: 14.0 Training loss: 1.9831 Explore P: 0.6066
Episode: 280 Total reward: 18.0 Training loss: 74.9915 Explore P: 0.6056
Episode: 281 Total reward: 15.0 Training loss: 30.9880 Explore P: 0.6047
Episode: 282 Total reward: 14.0 Training loss: 113.6423 Explore P: 0.6038
Episode: 283 Total reward: 14.0 Training loss: 70.6148 Explore P: 0.6030
Episode: 284 Total reward: 17.0 Training loss: 24.3675 Explore P: 0.6020
Episode: 285 Total reward: 12.0 Training loss: 68.7180 Explore P: 0.6013
Episode: 286 Total reward: 11.0 Training loss: 32.6982 Explore P: 0.6006
Episode: 287 Total reward: 9.0 Training loss: 48.1468 Explore P: 0.6001
Episode: 288 Total reward: 9.0 Training loss: 1.3115 Explore P: 0.5996
Episode: 289 Total reward: 13.0 Training loss: 1.0043 Explore P: 0.5988
Episode: 290 Total reward: 16.0 Training loss: 1.8838 Explore P: 0.5979
Episode: 291 Total reward: 13.0 Training loss: 0.8698 Explore P: 0.5971
Episode: 292 Total reward: 12.0 Training loss: 107.1492 Explore P: 0.5964
Episode: 293 Total reward: 12.0 Training loss: 118.8721 Explore P: 0.5957
Episode: 294 Total reward: 16.0 Training loss: 43.0937 Explore P: 0.5948
Episode: 295 Total reward: 32.0 Training loss: 144.8963 Explore P: 0.5929
Episode: 296 Total reward: 11.0 Training loss: 59.3461 Explore P: 0.5923
Episode: 297 Total reward: 23.0 Training loss: 0.9946 Explore P: 0.5909
Episode: 298 Total reward: 32.0 Training loss: 119.9140 Explore P: 0.5891
Episode: 299 Total reward: 11.0 Training loss: 1.2042 Explore P: 0.5884
Episode: 300 Total reward: 23.0 Training loss: 0.8598 Explore P: 0.5871
Episode: 301 Total reward: 11.0 Training loss: 21.0916 Explore P: 0.5865
Episode: 302 Total reward: 44.0 Training loss: 1.0106 Explore P: 0.5839
Episode: 303 Total reward: 14.0 Training loss: 76.6160 Explore P: 0.5831
Episode: 304 Total reward: 28.0 Training loss: 42.5719 Explore P: 0.5815
Episode: 305 Total reward: 16.0 Training loss: 57.4107 Explore P: 0.5806
Episode: 306 Total reward: 15.0 Training loss: 0.7700 Explore P: 0.5798
Episode: 307 Total reward: 17.0 Training loss: 54.1773 Explore P: 0.5788
Episode: 308 Total reward: 20.0 Training loss: 68.1773 Explore P: 0.5776
Episode: 309 Total reward: 18.0 Training loss: 95.9949 Explore P: 0.5766
Episode: 310 Total reward: 11.0 Training loss: 56.3985 Explore P: 0.5760
Episode: 311 Total reward: 18.0 Training loss: 53.2979 Explore P: 0.5750
Episode: 312 Total reward: 48.0 Training loss: 57.1327 Explore P: 0.5723
Episode: 313 Total reward: 15.0 Training loss: 17.1149 Explore P: 0.5714
Episode: 314 Total reward: 43.0 Training loss: 18.2595 Explore P: 0.5690
Episode: 315 Total reward: 122.0 Training loss: 18.5296 Explore P: 0.5623
Episode: 316 Total reward: 141.0 Training loss: 67.3824 Explore P: 0.5545
Episode: 317 Total reward: 49.0 Training loss: 29.6251 Explore P: 0.5519
Episode: 318 Total reward: 17.0 Training loss: 0.9543 Explore P: 0.5509
Episode: 319 Total reward: 56.0 Training loss: 1.0102 Explore P: 0.5479
Episode: 320 Total reward: 95.0 Training loss: 15.5083 Explore P: 0.5428
Episode: 321 Total reward: 97.0 Training loss: 1.2843 Explore P: 0.5377
Episode: 322 Total reward: 73.0 Training loss: 1.2503 Explore P: 0.5338
Episode: 323 Total reward: 121.0 Training loss: 1.2148 Explore P: 0.5275
Episode: 324 Total reward: 71.0 Training loss: 1.2691 Explore P: 0.5239
Episode: 325 Total reward: 9.0 Training loss: 1.1095 Explore P: 0.5234
Episode: 326 Total reward: 25.0 Training loss: 112.7007 Explore P: 0.5221
Episode: 327 Total reward: 123.0 Training loss: 1.0161 Explore P: 0.5159
Episode: 328 Total reward: 36.0 Training loss: 14.8811 Explore P: 0.5141
Episode: 329 Total reward: 31.0 Training loss: 13.4769 Explore P: 0.5125
Episode: 330 Total reward: 82.0 Training loss: 14.8449 Explore P: 0.5084
Episode: 331 Total reward: 182.0 Training loss: 1.8369 Explore P: 0.4994
Episode: 332 Total reward: 199.0 Training loss: 65.5059 Explore P: 0.4898
Episode: 333 Total reward: 149.0 Training loss: 58.8159 Explore P: 0.4827
Episode: 334 Total reward: 107.0 Training loss: 29.7433 Explore P: 0.4776
Episode: 335 Total reward: 199.0 Training loss: 17.5077 Explore P: 0.4684
Episode: 336 Total reward: 30.0 Training loss: 1.7410 Explore P: 0.4671
Episode: 337 Total reward: 132.0 Training loss: 1.2660 Explore P: 0.4611
Episode: 338 Total reward: 96.0 Training loss: 62.5731 Explore P: 0.4568
Episode: 339 Total reward: 199.0 Training loss: 38.6817 Explore P: 0.4480
Episode: 340 Total reward: 163.0 Training loss: 19.0086 Explore P: 0.4409
Episode: 341 Total reward: 199.0 Training loss: 36.0030 Explore P: 0.4324
Episode: 342 Total reward: 199.0 Training loss: 1.6032 Explore P: 0.4241
Episode: 343 Total reward: 36.0 Training loss: 1.7708 Explore P: 0.4226
Episode: 344 Total reward: 78.0 Training loss: 21.0087 Explore P: 0.4194
Episode: 345 Total reward: 37.0 Training loss: 37.3133 Explore P: 0.4179
Episode: 346 Total reward: 139.0 Training loss: 112.4740 Explore P: 0.4122
Episode: 347 Total reward: 12.0 Training loss: 30.3707 Explore P: 0.4117
Episode: 348 Total reward: 105.0 Training loss: 24.9713 Explore P: 0.4075
Episode: 349 Total reward: 46.0 Training loss: 1.0661 Explore P: 0.4057
Episode: 350 Total reward: 191.0 Training loss: 107.5973 Explore P: 0.3982
Episode: 351 Total reward: 199.0 Training loss: 72.3744 Explore P: 0.3906
Episode: 352 Total reward: 113.0 Training loss: 13.5777 Explore P: 0.3863
Episode: 353 Total reward: 60.0 Training loss: 31.5029 Explore P: 0.3841
Episode: 354 Total reward: 42.0 Training loss: 2.0288 Explore P: 0.3825
Episode: 355 Total reward: 107.0 Training loss: 3.1992 Explore P: 0.3785
Episode: 356 Total reward: 121.0 Training loss: 38.2556 Explore P: 0.3741
Episode: 357 Total reward: 33.0 Training loss: 26.7569 Explore P: 0.3729
Episode: 358 Total reward: 44.0 Training loss: 37.0838 Explore P: 0.3713
Episode: 359 Total reward: 64.0 Training loss: 1.0983 Explore P: 0.3690
Episode: 360 Total reward: 9.0 Training loss: 1.6487 Explore P: 0.3687
Episode: 361 Total reward: 25.0 Training loss: 1.6984 Explore P: 0.3678
Episode: 362 Total reward: 87.0 Training loss: 3.6755 Explore P: 0.3647
Episode: 363 Total reward: 61.0 Training loss: 1.3237 Explore P: 0.3625
Episode: 364 Total reward: 65.0 Training loss: 1.7698 Explore P: 0.3602
Episode: 365 Total reward: 199.0 Training loss: 0.7434 Explore P: 0.3533
Episode: 366 Total reward: 61.0 Training loss: 2.3319 Explore P: 0.3512
Episode: 367 Total reward: 57.0 Training loss: 41.5896 Explore P: 0.3493
Episode: 368 Total reward: 27.0 Training loss: 108.2389 Explore P: 0.3484
Episode: 369 Total reward: 40.0 Training loss: 37.4753 Explore P: 0.3470
Episode: 370 Total reward: 41.0 Training loss: 1.7548 Explore P: 0.3457
Episode: 371 Total reward: 122.0 Training loss: 2.2598 Explore P: 0.3416
Episode: 372 Total reward: 31.0 Training loss: 42.2909 Explore P: 0.3406
Episode: 373 Total reward: 99.0 Training loss: 24.8938 Explore P: 0.3373
Episode: 374 Total reward: 59.0 Training loss: 1.4133 Explore P: 0.3354
Episode: 375 Total reward: 117.0 Training loss: 49.8470 Explore P: 0.3316
Episode: 376 Total reward: 103.0 Training loss: 40.4387 Explore P: 0.3283
Episode: 377 Total reward: 114.0 Training loss: 1.7627 Explore P: 0.3247
Episode: 378 Total reward: 190.0 Training loss: 3.2786 Explore P: 0.3188
Episode: 379 Total reward: 42.0 Training loss: 27.0590 Explore P: 0.3175
Episode: 380 Total reward: 123.0 Training loss: 3.3331 Explore P: 0.3137
Episode: 381 Total reward: 68.0 Training loss: 48.7098 Explore P: 0.3117
Episode: 382 Total reward: 100.0 Training loss: 3.8683 Explore P: 0.3087
Episode: 383 Total reward: 40.0 Training loss: 1.3067 Explore P: 0.3075
Episode: 384 Total reward: 39.0 Training loss: 139.5377 Explore P: 0.3063
Episode: 385 Total reward: 54.0 Training loss: 27.2350 Explore P: 0.3047
Episode: 386 Total reward: 90.0 Training loss: 1.5541 Explore P: 0.3021
Episode: 387 Total reward: 73.0 Training loss: 3.2074 Explore P: 0.2999
Episode: 388 Total reward: 53.0 Training loss: 1.7449 Explore P: 0.2984
Episode: 389 Total reward: 89.0 Training loss: 1.4821 Explore P: 0.2959
Episode: 390 Total reward: 44.0 Training loss: 1.7802 Explore P: 0.2946
Episode: 391 Total reward: 42.0 Training loss: 51.9896 Explore P: 0.2934
Episode: 392 Total reward: 107.0 Training loss: 75.0258 Explore P: 0.2904
Episode: 393 Total reward: 92.0 Training loss: 112.2832 Explore P: 0.2878
Episode: 394 Total reward: 64.0 Training loss: 25.1925 Explore P: 0.2861
Episode: 395 Total reward: 40.0 Training loss: 49.4040 Explore P: 0.2850
Episode: 396 Total reward: 55.0 Training loss: 29.1842 Explore P: 0.2834
Episode: 397 Total reward: 74.0 Training loss: 4.3687 Explore P: 0.2814
Episode: 398 Total reward: 105.0 Training loss: 2.2134 Explore P: 0.2786
Episode: 399 Total reward: 62.0 Training loss: 56.4134 Explore P: 0.2769
Episode: 400 Total reward: 61.0 Training loss: 2.1531 Explore P: 0.2753
Episode: 401 Total reward: 86.0 Training loss: 2.5176 Explore P: 0.2730
Episode: 402 Total reward: 55.0 Training loss: 4.1600 Explore P: 0.2716
Episode: 403 Total reward: 72.0 Training loss: 2.2448 Explore P: 0.2697
Episode: 404 Total reward: 53.0 Training loss: 74.0260 Explore P: 0.2683
Episode: 405 Total reward: 76.0 Training loss: 37.5111 Explore P: 0.2664
Episode: 406 Total reward: 93.0 Training loss: 2.9126 Explore P: 0.2640
Episode: 407 Total reward: 84.0 Training loss: 3.6939 Explore P: 0.2619
Episode: 408 Total reward: 134.0 Training loss: 205.0600 Explore P: 0.2585
Episode: 409 Total reward: 66.0 Training loss: 2.9375 Explore P: 0.2569
Episode: 410 Total reward: 115.0 Training loss: 69.1977 Explore P: 0.2541
Episode: 411 Total reward: 85.0 Training loss: 3.4190 Explore P: 0.2520
Episode: 412 Total reward: 108.0 Training loss: 0.6622 Explore P: 0.2494
Episode: 413 Total reward: 55.0 Training loss: 1.6978 Explore P: 0.2481
Episode: 414 Total reward: 57.0 Training loss: 2.9469 Explore P: 0.2468
Episode: 415 Total reward: 58.0 Training loss: 3.5884 Explore P: 0.2454
Episode: 416 Total reward: 29.0 Training loss: 1.9151 Explore P: 0.2447
Episode: 417 Total reward: 40.0 Training loss: 0.8549 Explore P: 0.2438
Episode: 418 Total reward: 83.0 Training loss: 2.7390 Explore P: 0.2418
Episode: 419 Total reward: 58.0 Training loss: 3.0493 Explore P: 0.2405
Episode: 420 Total reward: 52.0 Training loss: 123.7813 Explore P: 0.2393
Episode: 421 Total reward: 49.0 Training loss: 161.0166 Explore P: 0.2382
Episode: 422 Total reward: 86.0 Training loss: 1.7409 Explore P: 0.2362
Episode: 423 Total reward: 78.0 Training loss: 128.4492 Explore P: 0.2345
Episode: 424 Total reward: 69.0 Training loss: 1.4359 Explore P: 0.2329
Episode: 425 Total reward: 65.0 Training loss: 2.1383 Explore P: 0.2315
Episode: 426 Total reward: 61.0 Training loss: 1.1239 Explore P: 0.2301
Episode: 427 Total reward: 57.0 Training loss: 69.1385 Explore P: 0.2289
Episode: 428 Total reward: 65.0 Training loss: 1.3844 Explore P: 0.2275
Episode: 429 Total reward: 105.0 Training loss: 1.5269 Explore P: 0.2252
Episode: 430 Total reward: 37.0 Training loss: 99.4288 Explore P: 0.2244
Episode: 431 Total reward: 72.0 Training loss: 1.6894 Explore P: 0.2229
Episode: 432 Total reward: 40.0 Training loss: 2.5790 Explore P: 0.2220
Episode: 433 Total reward: 65.0 Training loss: 1.8680 Explore P: 0.2206
Episode: 434 Total reward: 45.0 Training loss: 1.8988 Explore P: 0.2197
Episode: 435 Total reward: 102.0 Training loss: 0.7269 Explore P: 0.2176
Episode: 436 Total reward: 63.0 Training loss: 272.9193 Explore P: 0.2163
Episode: 437 Total reward: 121.0 Training loss: 1.2661 Explore P: 0.2138
Episode: 438 Total reward: 48.0 Training loss: 0.9351 Explore P: 0.2128
Episode: 439 Total reward: 57.0 Training loss: 183.1317 Explore P: 0.2116
Episode: 440 Total reward: 96.0 Training loss: 137.9915 Explore P: 0.2097
Episode: 441 Total reward: 80.0 Training loss: 1.0748 Explore P: 0.2081
Episode: 442 Total reward: 182.0 Training loss: 186.0903 Explore P: 0.2046
Episode: 443 Total reward: 72.0 Training loss: 2.0881 Explore P: 0.2032
Episode: 444 Total reward: 44.0 Training loss: 1.0547 Explore P: 0.2023
Episode: 445 Total reward: 72.0 Training loss: 0.7605 Explore P: 0.2009
Episode: 446 Total reward: 85.0 Training loss: 0.7721 Explore P: 0.1993
Episode: 447 Total reward: 81.0 Training loss: 0.8666 Explore P: 0.1978
Episode: 448 Total reward: 100.0 Training loss: 106.6899 Explore P: 0.1959
Episode: 449 Total reward: 76.0 Training loss: 199.9545 Explore P: 0.1945
Episode: 450 Total reward: 54.0 Training loss: 225.6138 Explore P: 0.1935
Episode: 451 Total reward: 62.0 Training loss: 1.0217 Explore P: 0.1924
Episode: 452 Total reward: 70.0 Training loss: 116.3916 Explore P: 0.1911
Episode: 453 Total reward: 81.0 Training loss: 2.2272 Explore P: 0.1896
Episode: 454 Total reward: 72.0 Training loss: 1.3550 Explore P: 0.1884
Episode: 455 Total reward: 62.0 Training loss: 0.7852 Explore P: 0.1873
Episode: 456 Total reward: 44.0 Training loss: 0.8978 Explore P: 0.1865
Episode: 457 Total reward: 62.0 Training loss: 0.6018 Explore P: 0.1854
Episode: 458 Total reward: 101.0 Training loss: 1.0949 Explore P: 0.1836
Episode: 459 Total reward: 77.0 Training loss: 109.9297 Explore P: 0.1823
Episode: 460 Total reward: 71.0 Training loss: 219.4469 Explore P: 0.1811
Episode: 461 Total reward: 145.0 Training loss: 1.9773 Explore P: 0.1786
Episode: 462 Total reward: 82.0 Training loss: 0.7958 Explore P: 0.1772
Episode: 463 Total reward: 150.0 Training loss: 71.7133 Explore P: 0.1747
Episode: 464 Total reward: 42.0 Training loss: 1.1387 Explore P: 0.1741
Episode: 465 Total reward: 73.0 Training loss: 73.8538 Explore P: 0.1729
Episode: 466 Total reward: 101.0 Training loss: 191.4988 Explore P: 0.1712
Episode: 467 Total reward: 119.0 Training loss: 1.2066 Explore P: 0.1693
Episode: 468 Total reward: 89.0 Training loss: 3.1530 Explore P: 0.1679
Episode: 469 Total reward: 161.0 Training loss: 0.6501 Explore P: 0.1654
Episode: 470 Total reward: 199.0 Training loss: 142.9809 Explore P: 0.1623
Episode: 471 Total reward: 109.0 Training loss: 0.9730 Explore P: 0.1607
Episode: 472 Total reward: 63.0 Training loss: 196.8127 Explore P: 0.1597
Episode: 473 Total reward: 56.0 Training loss: 236.3772 Explore P: 0.1589
Episode: 474 Total reward: 66.0 Training loss: 34.7561 Explore P: 0.1579
Episode: 475 Total reward: 59.0 Training loss: 1.0522 Explore P: 0.1570
Episode: 476 Total reward: 72.0 Training loss: 0.8902 Explore P: 0.1560
Episode: 477 Total reward: 46.0 Training loss: 1.5189 Explore P: 0.1553
Episode: 478 Total reward: 55.0 Training loss: 1.2496 Explore P: 0.1545
Episode: 479 Total reward: 120.0 Training loss: 0.8151 Explore P: 0.1528
Episode: 480 Total reward: 135.0 Training loss: 200.1712 Explore P: 0.1509
Episode: 481 Total reward: 48.0 Training loss: 44.9221 Explore P: 0.1502
Episode: 482 Total reward: 134.0 Training loss: 0.4768 Explore P: 0.1483
Episode: 483 Total reward: 139.0 Training loss: 26.8129 Explore P: 0.1464
Episode: 484 Total reward: 65.0 Training loss: 0.8811 Explore P: 0.1455
Episode: 485 Total reward: 56.0 Training loss: 1.1043 Explore P: 0.1448
Episode: 486 Total reward: 50.0 Training loss: 1.5724 Explore P: 0.1441
Episode: 487 Total reward: 199.0 Training loss: 2.0666 Explore P: 0.1415
Episode: 488 Total reward: 65.0 Training loss: 1.7804 Explore P: 0.1406
Episode: 489 Total reward: 78.0 Training loss: 0.7719 Explore P: 0.1396
Episode: 490 Total reward: 73.0 Training loss: 0.8541 Explore P: 0.1387
Episode: 491 Total reward: 57.0 Training loss: 0.8599 Explore P: 0.1379
Episode: 492 Total reward: 111.0 Training loss: 1.0490 Explore P: 0.1365
Episode: 493 Total reward: 103.0 Training loss: 1.0412 Explore P: 0.1352
Episode: 494 Total reward: 199.0 Training loss: 0.5361 Explore P: 0.1328
Episode: 495 Total reward: 85.0 Training loss: 0.7318 Explore P: 0.1317
Episode: 496 Total reward: 87.0 Training loss: 0.5037 Explore P: 0.1307
Episode: 497 Total reward: 91.0 Training loss: 1.2023 Explore P: 0.1296
Episode: 498 Total reward: 118.0 Training loss: 0.7156 Explore P: 0.1282
Episode: 499 Total reward: 81.0 Training loss: 0.6521 Explore P: 0.1272
Episode: 500 Total reward: 57.0 Training loss: 0.8486 Explore P: 0.1265
Episode: 501 Total reward: 84.0 Training loss: 0.9722 Explore P: 0.1256
Episode: 502 Total reward: 59.0 Training loss: 0.4192 Explore P: 0.1249
Episode: 503 Total reward: 93.0 Training loss: 0.6116 Explore P: 0.1238
Episode: 504 Total reward: 116.0 Training loss: 0.2333 Explore P: 0.1225
Episode: 505 Total reward: 199.0 Training loss: 0.4408 Explore P: 0.1203
Episode: 506 Total reward: 77.0 Training loss: 0.8084 Explore P: 0.1195
Episode: 507 Total reward: 61.0 Training loss: 1.2765 Explore P: 0.1188
Episode: 508 Total reward: 68.0 Training loss: 1.1434 Explore P: 0.1181
Episode: 509 Total reward: 86.0 Training loss: 0.3154 Explore P: 0.1171
Episode: 510 Total reward: 87.0 Training loss: 1.4883 Explore P: 0.1162
Episode: 511 Total reward: 60.0 Training loss: 0.6837 Explore P: 0.1156
Episode: 512 Total reward: 74.0 Training loss: 6.0119 Explore P: 0.1148
Episode: 513 Total reward: 67.0 Training loss: 0.4942 Explore P: 0.1141
Episode: 514 Total reward: 76.0 Training loss: 0.4245 Explore P: 0.1133
Episode: 515 Total reward: 56.0 Training loss: 0.4479 Explore P: 0.1127
Episode: 516 Total reward: 65.0 Training loss: 0.9331 Explore P: 0.1121
Episode: 517 Total reward: 62.0 Training loss: 0.4182 Explore P: 0.1114
Episode: 518 Total reward: 42.0 Training loss: 0.9712 Explore P: 0.1110
Episode: 519 Total reward: 92.0 Training loss: 0.9955 Explore P: 0.1101
Episode: 520 Total reward: 63.0 Training loss: 194.2884 Explore P: 0.1094
Episode: 521 Total reward: 95.0 Training loss: 0.3666 Explore P: 0.1085
Episode: 522 Total reward: 61.0 Training loss: 0.6420 Explore P: 0.1079
Episode: 523 Total reward: 79.0 Training loss: 0.8301 Explore P: 0.1071
Episode: 524 Total reward: 49.0 Training loss: 0.7596 Explore P: 0.1067
Episode: 525 Total reward: 72.0 Training loss: 1.1621 Explore P: 0.1060
Episode: 526 Total reward: 34.0 Training loss: 1.5710 Explore P: 0.1056
Episode: 527 Total reward: 40.0 Training loss: 1.3609 Explore P: 0.1053
Episode: 528 Total reward: 54.0 Training loss: 1.2838 Explore P: 0.1047
Episode: 529 Total reward: 49.0 Training loss: 0.9708 Explore P: 0.1043
Episode: 530 Total reward: 73.0 Training loss: 0.4502 Explore P: 0.1036
Episode: 531 Total reward: 48.0 Training loss: 1.6185 Explore P: 0.1031
Episode: 532 Total reward: 65.0 Training loss: 0.5712 Explore P: 0.1025
Episode: 533 Total reward: 65.0 Training loss: 0.9639 Explore P: 0.1019
Episode: 534 Total reward: 72.0 Training loss: 0.5055 Explore P: 0.1013
Episode: 535 Total reward: 66.0 Training loss: 1.0176 Explore P: 0.1007
Episode: 536 Total reward: 46.0 Training loss: 0.9275 Explore P: 0.1003
Episode: 537 Total reward: 60.0 Training loss: 1.4305 Explore P: 0.0997
Episode: 538 Total reward: 45.0 Training loss: 0.6672 Explore P: 0.0993
Episode: 539 Total reward: 57.0 Training loss: 0.8446 Explore P: 0.0988
Episode: 540 Total reward: 42.0 Training loss: 0.9120 Explore P: 0.0984
Episode: 541 Total reward: 91.0 Training loss: 0.3521 Explore P: 0.0976
Episode: 542 Total reward: 49.0 Training loss: 1.0163 Explore P: 0.0972
Episode: 543 Total reward: 103.0 Training loss: 0.3380 Explore P: 0.0963
Episode: 544 Total reward: 64.0 Training loss: 1.3203 Explore P: 0.0958
Episode: 545 Total reward: 70.0 Training loss: 0.5646 Explore P: 0.0952
Episode: 546 Total reward: 90.0 Training loss: 0.3144 Explore P: 0.0944
Episode: 547 Total reward: 79.0 Training loss: 1.0799 Explore P: 0.0937
Episode: 548 Total reward: 80.0 Training loss: 0.4764 Explore P: 0.0931
Episode: 549 Total reward: 63.0 Training loss: 0.3776 Explore P: 0.0926
Episode: 550 Total reward: 86.0 Training loss: 0.9617 Explore P: 0.0919
Episode: 551 Total reward: 48.0 Training loss: 1.1784 Explore P: 0.0915
Episode: 552 Total reward: 78.0 Training loss: 0.6117 Explore P: 0.0908
Episode: 553 Total reward: 105.0 Training loss: 0.5143 Explore P: 0.0900
Episode: 554 Total reward: 107.0 Training loss: 1.2334 Explore P: 0.0891
Episode: 555 Total reward: 123.0 Training loss: 0.3203 Explore P: 0.0882
Episode: 556 Total reward: 165.0 Training loss: 0.4542 Explore P: 0.0869
Episode: 557 Total reward: 90.0 Training loss: 0.5789 Explore P: 0.0862
Episode: 558 Total reward: 159.0 Training loss: 0.4710 Explore P: 0.0850
Episode: 559 Total reward: 124.0 Training loss: 0.5845 Explore P: 0.0841
Episode: 560 Total reward: 122.0 Training loss: 0.1682 Explore P: 0.0832
Episode: 561 Total reward: 198.0 Training loss: 0.3287 Explore P: 0.0817
Episode: 562 Total reward: 155.0 Training loss: 0.2132 Explore P: 0.0806
Episode: 563 Total reward: 199.0 Training loss: 0.3374 Explore P: 0.0792
Episode: 564 Total reward: 199.0 Training loss: 3.0482 Explore P: 0.0779
Episode: 565 Total reward: 131.0 Training loss: 0.5510 Explore P: 0.0770
Episode: 566 Total reward: 168.0 Training loss: 0.1640 Explore P: 0.0759
Episode: 567 Total reward: 199.0 Training loss: 0.7326 Explore P: 0.0746
Episode: 568 Total reward: 199.0 Training loss: 0.2233 Explore P: 0.0733
Episode: 569 Total reward: 199.0 Training loss: 0.5559 Explore P: 0.0721
Episode: 570 Total reward: 199.0 Training loss: 0.6013 Explore P: 0.0708
Episode: 571 Total reward: 199.0 Training loss: 0.4790 Explore P: 0.0696
Episode: 572 Total reward: 199.0 Training loss: 12.4632 Explore P: 0.0685
Episode: 573 Total reward: 199.0 Training loss: 0.4047 Explore P: 0.0673
Episode: 574 Total reward: 199.0 Training loss: 0.6532 Explore P: 0.0662
Episode: 575 Total reward: 103.0 Training loss: 0.7766 Explore P: 0.0656
Episode: 576 Total reward: 131.0 Training loss: 15.9979 Explore P: 0.0649
Episode: 577 Total reward: 199.0 Training loss: 0.7880 Explore P: 0.0638
Episode: 578 Total reward: 199.0 Training loss: 0.4170 Explore P: 0.0627
Episode: 579 Total reward: 199.0 Training loss: 0.5428 Explore P: 0.0617
Episode: 580 Total reward: 199.0 Training loss: 22.3391 Explore P: 0.0607
Episode: 581 Total reward: 199.0 Training loss: 0.6310 Explore P: 0.0597
Episode: 582 Total reward: 199.0 Training loss: 0.3997 Explore P: 0.0587
Episode: 583 Total reward: 199.0 Training loss: 0.3365 Explore P: 0.0577
Episode: 584 Total reward: 196.0 Training loss: 0.8490 Explore P: 0.0568
Episode: 585 Total reward: 199.0 Training loss: 21.2014 Explore P: 0.0559
Episode: 586 Total reward: 199.0 Training loss: 23.6305 Explore P: 0.0550
Episode: 587 Total reward: 199.0 Training loss: 0.5159 Explore P: 0.0541
Episode: 588 Total reward: 199.0 Training loss: 0.3961 Explore P: 0.0532
Episode: 589 Total reward: 199.0 Training loss: 0.2599 Explore P: 0.0524
Episode: 590 Total reward: 199.0 Training loss: 0.4241 Explore P: 0.0516
Episode: 591 Total reward: 199.0 Training loss: 0.3279 Explore P: 0.0507
Episode: 592 Total reward: 199.0 Training loss: 0.4428 Explore P: 0.0499
Episode: 593 Total reward: 199.0 Training loss: 0.6602 Explore P: 0.0491
Episode: 594 Total reward: 199.0 Training loss: 0.7594 Explore P: 0.0484
Episode: 595 Total reward: 199.0 Training loss: 0.3457 Explore P: 0.0476
Episode: 596 Total reward: 42.0 Training loss: 0.7475 Explore P: 0.0475
Episode: 597 Total reward: 199.0 Training loss: 0.3660 Explore P: 0.0467
Episode: 598 Total reward: 199.0 Training loss: 0.3787 Explore P: 0.0460
Episode: 599 Total reward: 199.0 Training loss: 0.5338 Explore P: 0.0453
Episode: 600 Total reward: 199.0 Training loss: 48.1424 Explore P: 0.0446
Episode: 601 Total reward: 199.0 Training loss: 48.1925 Explore P: 0.0439
Episode: 602 Total reward: 199.0 Training loss: 0.4150 Explore P: 0.0432
Episode: 603 Total reward: 56.0 Training loss: 0.3729 Explore P: 0.0431
Episode: 604 Total reward: 199.0 Training loss: 0.3364 Explore P: 0.0424
Episode: 605 Total reward: 199.0 Training loss: 0.4200 Explore P: 0.0418
Episode: 606 Total reward: 199.0 Training loss: 0.3059 Explore P: 0.0411
Episode: 607 Total reward: 199.0 Training loss: 0.2135 Explore P: 0.0405
Episode: 608 Total reward: 199.0 Training loss: 0.5670 Explore P: 0.0399
Episode: 609 Total reward: 199.0 Training loss: 0.5009 Explore P: 0.0393
Episode: 610 Total reward: 199.0 Training loss: 0.4031 Explore P: 0.0388
Episode: 611 Total reward: 199.0 Training loss: 0.3505 Explore P: 0.0382
Episode: 612 Total reward: 199.0 Training loss: 0.4032 Explore P: 0.0376
Episode: 613 Total reward: 199.0 Training loss: 0.5973 Explore P: 0.0371
Episode: 614 Total reward: 199.0 Training loss: 0.5915 Explore P: 0.0366
Episode: 615 Total reward: 199.0 Training loss: 0.2631 Explore P: 0.0360
Episode: 616 Total reward: 199.0 Training loss: 0.2537 Explore P: 0.0355
Episode: 617 Total reward: 199.0 Training loss: 0.5456 Explore P: 0.0350
Episode: 618 Total reward: 199.0 Training loss: 0.4192 Explore P: 0.0345
Episode: 619 Total reward: 199.0 Training loss: 0.3988 Explore P: 0.0340
Episode: 620 Total reward: 199.0 Training loss: 0.4277 Explore P: 0.0336
Episode: 621 Total reward: 199.0 Training loss: 0.2828 Explore P: 0.0331
Episode: 622 Total reward: 199.0 Training loss: 0.5735 Explore P: 0.0326
Episode: 623 Total reward: 199.0 Training loss: 232.4794 Explore P: 0.0322
Episode: 624 Total reward: 199.0 Training loss: 0.2616 Explore P: 0.0318
Episode: 625 Total reward: 199.0 Training loss: 0.5225 Explore P: 0.0313
Episode: 626 Total reward: 199.0 Training loss: 0.0675 Explore P: 0.0309
Episode: 627 Total reward: 199.0 Training loss: 0.4485 Explore P: 0.0305
Episode: 628 Total reward: 199.0 Training loss: 0.1797 Explore P: 0.0301
Episode: 629 Total reward: 199.0 Training loss: 0.2270 Explore P: 0.0297
Episode: 630 Total reward: 199.0 Training loss: 0.2099 Explore P: 0.0293
Episode: 631 Total reward: 199.0 Training loss: 0.3225 Explore P: 0.0289
Episode: 632 Total reward: 199.0 Training loss: 263.3187 Explore P: 0.0286
Episode: 633 Total reward: 199.0 Training loss: 213.5303 Explore P: 0.0282
Episode: 634 Total reward: 199.0 Training loss: 0.2112 Explore P: 0.0278
Episode: 635 Total reward: 199.0 Training loss: 0.1637 Explore P: 0.0275
Episode: 636 Total reward: 199.0 Training loss: 0.1121 Explore P: 0.0271
Episode: 637 Total reward: 199.0 Training loss: 0.1989 Explore P: 0.0268
Episode: 638 Total reward: 199.0 Training loss: 0.1704 Explore P: 0.0265
Episode: 639 Total reward: 199.0 Training loss: 0.5088 Explore P: 0.0261
Episode: 640 Total reward: 199.0 Training loss: 0.2532 Explore P: 0.0258
Episode: 641 Total reward: 199.0 Training loss: 0.3570 Explore P: 0.0255
Episode: 642 Total reward: 172.0 Training loss: 0.2609 Explore P: 0.0253
Episode: 643 Total reward: 170.0 Training loss: 0.3606 Explore P: 0.0250
Episode: 644 Total reward: 166.0 Training loss: 0.2295 Explore P: 0.0247
Episode: 645 Total reward: 199.0 Training loss: 0.3311 Explore P: 0.0245
Episode: 646 Total reward: 152.0 Training loss: 0.2308 Explore P: 0.0242
Episode: 647 Total reward: 142.0 Training loss: 0.1818 Explore P: 0.0240
Episode: 648 Total reward: 150.0 Training loss: 0.5432 Explore P: 0.0238
Episode: 649 Total reward: 194.0 Training loss: 0.3920 Explore P: 0.0236
Episode: 650 Total reward: 199.0 Training loss: 0.0656 Explore P: 0.0233
Episode: 651 Total reward: 169.0 Training loss: 0.1207 Explore P: 0.0231
Episode: 652 Total reward: 155.0 Training loss: 0.3628 Explore P: 0.0229
Episode: 653 Total reward: 167.0 Training loss: 0.2650 Explore P: 0.0227
Episode: 654 Total reward: 155.0 Training loss: 0.2757 Explore P: 0.0225
Episode: 655 Total reward: 119.0 Training loss: 0.1934 Explore P: 0.0223
Episode: 656 Total reward: 168.0 Training loss: 0.2588 Explore P: 0.0221
Episode: 657 Total reward: 162.0 Training loss: 0.1115 Explore P: 0.0219
Episode: 658 Total reward: 153.0 Training loss: 0.2686 Explore P: 0.0217
Episode: 659 Total reward: 199.0 Training loss: 0.2105 Explore P: 0.0215
Episode: 660 Total reward: 199.0 Training loss: 0.2371 Explore P: 0.0213
Episode: 661 Total reward: 199.0 Training loss: 0.2902 Explore P: 0.0211
Episode: 662 Total reward: 170.0 Training loss: 0.3500 Explore P: 0.0209
Episode: 663 Total reward: 199.0 Training loss: 0.1182 Explore P: 0.0207
Episode: 664 Total reward: 199.0 Training loss: 0.1814 Explore P: 0.0204
Episode: 665 Total reward: 199.0 Training loss: 0.1668 Explore P: 0.0202
Episode: 666 Total reward: 196.0 Training loss: 0.1744 Explore P: 0.0200
Episode: 667 Total reward: 199.0 Training loss: 0.1465 Explore P: 0.0198
Episode: 668 Total reward: 199.0 Training loss: 0.2491 Explore P: 0.0197
Episode: 669 Total reward: 199.0 Training loss: 0.2155 Explore P: 0.0195
Episode: 670 Total reward: 147.0 Training loss: 0.1059 Explore P: 0.0193
Episode: 671 Total reward: 175.0 Training loss: 0.1345 Explore P: 0.0192
Episode: 672 Total reward: 199.0 Training loss: 0.1351 Explore P: 0.0190
Episode: 673 Total reward: 199.0 Training loss: 0.1715 Explore P: 0.0188
Episode: 674 Total reward: 199.0 Training loss: 0.1810 Explore P: 0.0186
Episode: 675 Total reward: 199.0 Training loss: 0.1823 Explore P: 0.0185
Episode: 676 Total reward: 199.0 Training loss: 0.2036 Explore P: 0.0183
Episode: 677 Total reward: 199.0 Training loss: 4.3736 Explore P: 0.0181
Episode: 678 Total reward: 199.0 Training loss: 0.1824 Explore P: 0.0180
Episode: 679 Total reward: 199.0 Training loss: 0.1132 Explore P: 0.0178
Episode: 680 Total reward: 199.0 Training loss: 0.1501 Explore P: 0.0177
Episode: 681 Total reward: 199.0 Training loss: 0.1626 Explore P: 0.0175
Episode: 682 Total reward: 199.0 Training loss: 0.1600 Explore P: 0.0174
Episode: 683 Total reward: 199.0 Training loss: 0.2360 Explore P: 0.0172
Episode: 684 Total reward: 199.0 Training loss: 0.0988 Explore P: 0.0171
Episode: 685 Total reward: 199.0 Training loss: 0.1418 Explore P: 0.0169
Episode: 686 Total reward: 199.0 Training loss: 0.1363 Explore P: 0.0168
Episode: 687 Total reward: 199.0 Training loss: 0.1157 Explore P: 0.0167
Episode: 688 Total reward: 199.0 Training loss: 0.1811 Explore P: 0.0165
Episode: 689 Total reward: 199.0 Training loss: 0.1145 Explore P: 0.0164
Episode: 690 Total reward: 199.0 Training loss: 0.1018 Explore P: 0.0163
Episode: 691 Total reward: 199.0 Training loss: 0.2102 Explore P: 0.0162
Episode: 692 Total reward: 199.0 Training loss: 0.0769 Explore P: 0.0160
Episode: 693 Total reward: 199.0 Training loss: 0.1966 Explore P: 0.0159
Episode: 694 Total reward: 199.0 Training loss: 0.2047 Explore P: 0.0158
Episode: 695 Total reward: 199.0 Training loss: 0.1864 Explore P: 0.0157
Episode: 696 Total reward: 199.0 Training loss: 9.5800 Explore P: 0.0156
Episode: 697 Total reward: 199.0 Training loss: 0.3879 Explore P: 0.0155
Episode: 698 Total reward: 199.0 Training loss: 0.1336 Explore P: 0.0154
Episode: 699 Total reward: 199.0 Training loss: 305.6340 Explore P: 0.0152
Episode: 700 Total reward: 199.0 Training loss: 0.1705 Explore P: 0.0151
Episode: 701 Total reward: 199.0 Training loss: 0.0719 Explore P: 0.0150
Episode: 702 Total reward: 199.0 Training loss: 0.1750 Explore P: 0.0149
Episode: 703 Total reward: 199.0 Training loss: 0.2131 Explore P: 0.0148
Episode: 704 Total reward: 199.0 Training loss: 326.5654 Explore P: 0.0148
Episode: 705 Total reward: 199.0 Training loss: 190.7222 Explore P: 0.0147
Episode: 706 Total reward: 199.0 Training loss: 0.1423 Explore P: 0.0146
Episode: 707 Total reward: 199.0 Training loss: 0.0853 Explore P: 0.0145
Episode: 708 Total reward: 199.0 Training loss: 0.2765 Explore P: 0.0144
Episode: 709 Total reward: 84.0 Training loss: 0.2578 Explore P: 0.0144
Episode: 710 Total reward: 70.0 Training loss: 0.4312 Explore P: 0.0143
Episode: 711 Total reward: 199.0 Training loss: 0.1406 Explore P: 0.0142
Episode: 712 Total reward: 199.0 Training loss: 0.1679 Explore P: 0.0142
Episode: 713 Total reward: 199.0 Training loss: 0.3057 Explore P: 0.0141
Episode: 714 Total reward: 199.0 Training loss: 0.1476 Explore P: 0.0140
Episode: 715 Total reward: 199.0 Training loss: 0.2763 Explore P: 0.0139
Episode: 716 Total reward: 70.0 Training loss: 295.2354 Explore P: 0.0139
Episode: 717 Total reward: 102.0 Training loss: 0.1910 Explore P: 0.0138
Episode: 718 Total reward: 112.0 Training loss: 37.7239 Explore P: 0.0138
Episode: 719 Total reward: 128.0 Training loss: 0.4470 Explore P: 0.0138
Episode: 720 Total reward: 199.0 Training loss: 0.1135 Explore P: 0.0137
Episode: 721 Total reward: 199.0 Training loss: 0.3781 Explore P: 0.0136
Episode: 722 Total reward: 199.0 Training loss: 0.7136 Explore P: 0.0135
Episode: 723 Total reward: 199.0 Training loss: 0.5198 Explore P: 0.0135
Episode: 724 Total reward: 199.0 Training loss: 0.2549 Explore P: 0.0134
Episode: 725 Total reward: 199.0 Training loss: 0.2789 Explore P: 0.0133
Episode: 726 Total reward: 199.0 Training loss: 0.2627 Explore P: 0.0133
Episode: 727 Total reward: 199.0 Training loss: 0.2379 Explore P: 0.0132
Episode: 728 Total reward: 199.0 Training loss: 0.4553 Explore P: 0.0131
Episode: 729 Total reward: 199.0 Training loss: 0.4249 Explore P: 0.0131
Episode: 730 Total reward: 199.0 Training loss: 0.4437 Explore P: 0.0130
Episode: 731 Total reward: 199.0 Training loss: 0.1596 Explore P: 0.0130
Episode: 732 Total reward: 199.0 Training loss: 0.7226 Explore P: 0.0129
Episode: 733 Total reward: 199.0 Training loss: 0.4409 Explore P: 0.0128
Episode: 734 Total reward: 199.0 Training loss: 0.2349 Explore P: 0.0128
Episode: 735 Total reward: 199.0 Training loss: 0.3563 Explore P: 0.0127
Episode: 736 Total reward: 199.0 Training loss: 0.3517 Explore P: 0.0127
Episode: 737 Total reward: 199.0 Training loss: 0.1180 Explore P: 0.0126
Episode: 738 Total reward: 199.0 Training loss: 230.5002 Explore P: 0.0126
Episode: 739 Total reward: 199.0 Training loss: 0.3390 Explore P: 0.0125
Episode: 740 Total reward: 199.0 Training loss: 0.3147 Explore P: 0.0125
Episode: 741 Total reward: 199.0 Training loss: 0.2606 Explore P: 0.0124
Episode: 742 Total reward: 199.0 Training loss: 0.3671 Explore P: 0.0124
Episode: 743 Total reward: 199.0 Training loss: 0.3268 Explore P: 0.0123
Episode: 744 Total reward: 199.0 Training loss: 0.1541 Explore P: 0.0123
Episode: 745 Total reward: 199.0 Training loss: 201.2239 Explore P: 0.0122
Episode: 746 Total reward: 199.0 Training loss: 0.7027 Explore P: 0.0122
Episode: 747 Total reward: 199.0 Training loss: 0.1469 Explore P: 0.0121
Episode: 748 Total reward: 199.0 Training loss: 0.3901 Explore P: 0.0121
Episode: 749 Total reward: 199.0 Training loss: 0.1384 Explore P: 0.0121
Episode: 750 Total reward: 199.0 Training loss: 0.1104 Explore P: 0.0120
Episode: 751 Total reward: 199.0 Training loss: 0.7034 Explore P: 0.0120
Episode: 752 Total reward: 199.0 Training loss: 0.2108 Explore P: 0.0119
Episode: 753 Total reward: 199.0 Training loss: 0.3160 Explore P: 0.0119
Episode: 754 Total reward: 199.0 Training loss: 0.4048 Explore P: 0.0119
Episode: 755 Total reward: 199.0 Training loss: 0.1974 Explore P: 0.0118
Episode: 756 Total reward: 199.0 Training loss: 0.2176 Explore P: 0.0118
Episode: 757 Total reward: 199.0 Training loss: 0.2797 Explore P: 0.0118
Episode: 758 Total reward: 199.0 Training loss: 274.8846 Explore P: 0.0117
Episode: 759 Total reward: 199.0 Training loss: 0.1784 Explore P: 0.0117
Episode: 760 Total reward: 199.0 Training loss: 0.4048 Explore P: 0.0117
Episode: 761 Total reward: 199.0 Training loss: 0.2433 Explore P: 0.0116
Episode: 762 Total reward: 199.0 Training loss: 0.8564 Explore P: 0.0116
Episode: 763 Total reward: 199.0 Training loss: 0.5487 Explore P: 0.0116
Episode: 764 Total reward: 199.0 Training loss: 0.3069 Explore P: 0.0115
Episode: 765 Total reward: 199.0 Training loss: 0.1695 Explore P: 0.0115
Episode: 766 Total reward: 199.0 Training loss: 0.2068 Explore P: 0.0115
Episode: 767 Total reward: 199.0 Training loss: 0.2166 Explore P: 0.0114
Episode: 768 Total reward: 199.0 Training loss: 0.1380 Explore P: 0.0114
Episode: 769 Total reward: 199.0 Training loss: 0.2243 Explore P: 0.0114
Episode: 770 Total reward: 199.0 Training loss: 0.2158 Explore P: 0.0114
Episode: 771 Total reward: 199.0 Training loss: 0.2092 Explore P: 0.0113
Episode: 772 Total reward: 199.0 Training loss: 0.2646 Explore P: 0.0113
Episode: 773 Total reward: 199.0 Training loss: 0.0999 Explore P: 0.0113
Episode: 774 Total reward: 199.0 Training loss: 0.2356 Explore P: 0.0113
Episode: 775 Total reward: 199.0 Training loss: 0.3058 Explore P: 0.0112
Episode: 776 Total reward: 199.0 Training loss: 0.1739 Explore P: 0.0112
Episode: 777 Total reward: 199.0 Training loss: 0.2312 Explore P: 0.0112
Episode: 778 Total reward: 199.0 Training loss: 0.1562 Explore P: 0.0112
Episode: 779 Total reward: 199.0 Training loss: 387.0215 Explore P: 0.0111
Episode: 780 Total reward: 199.0 Training loss: 0.1911 Explore P: 0.0111
Episode: 781 Total reward: 199.0 Training loss: 0.1396 Explore P: 0.0111
Episode: 782 Total reward: 199.0 Training loss: 0.1866 Explore P: 0.0111
Episode: 783 Total reward: 199.0 Training loss: 0.2732 Explore P: 0.0111
Episode: 784 Total reward: 199.0 Training loss: 0.2317 Explore P: 0.0110
Episode: 785 Total reward: 199.0 Training loss: 0.1255 Explore P: 0.0110
Episode: 786 Total reward: 199.0 Training loss: 0.1328 Explore P: 0.0110
Episode: 787 Total reward: 199.0 Training loss: 0.1804 Explore P: 0.0110
Episode: 788 Total reward: 199.0 Training loss: 0.2548 Explore P: 0.0110
Episode: 789 Total reward: 199.0 Training loss: 0.1354 Explore P: 0.0109
Episode: 790 Total reward: 199.0 Training loss: 0.1467 Explore P: 0.0109
Episode: 791 Total reward: 199.0 Training loss: 0.3876 Explore P: 0.0109
Episode: 792 Total reward: 199.0 Training loss: 0.1706 Explore P: 0.0109
Episode: 793 Total reward: 199.0 Training loss: 166.9771 Explore P: 0.0109
Episode: 794 Total reward: 199.0 Training loss: 0.2654 Explore P: 0.0108
Episode: 795 Total reward: 199.0 Training loss: 0.3231 Explore P: 0.0108
Episode: 796 Total reward: 199.0 Training loss: 0.2890 Explore P: 0.0108
Episode: 797 Total reward: 199.0 Training loss: 0.2035 Explore P: 0.0108
Episode: 798 Total reward: 199.0 Training loss: 0.1736 Explore P: 0.0108
Episode: 799 Total reward: 199.0 Training loss: 0.4242 Explore P: 0.0108
Episode: 800 Total reward: 199.0 Training loss: 0.1846 Explore P: 0.0107
Episode: 801 Total reward: 199.0 Training loss: 0.2378 Explore P: 0.0107
Episode: 802 Total reward: 199.0 Training loss: 0.2112 Explore P: 0.0107
Episode: 803 Total reward: 199.0 Training loss: 0.1685 Explore P: 0.0107
Episode: 804 Total reward: 199.0 Training loss: 0.1844 Explore P: 0.0107
Episode: 805 Total reward: 199.0 Training loss: 0.2034 Explore P: 0.0107
Episode: 806 Total reward: 199.0 Training loss: 0.1204 Explore P: 0.0107
Episode: 807 Total reward: 199.0 Training loss: 0.1329 Explore P: 0.0107
Episode: 808 Total reward: 199.0 Training loss: 0.1432 Explore P: 0.0106
Episode: 809 Total reward: 199.0 Training loss: 0.1137 Explore P: 0.0106
Episode: 810 Total reward: 199.0 Training loss: 0.1252 Explore P: 0.0106
Episode: 811 Total reward: 199.0 Training loss: 0.4770 Explore P: 0.0106
Episode: 812 Total reward: 199.0 Training loss: 0.2150 Explore P: 0.0106
Episode: 813 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0106
Episode: 814 Total reward: 199.0 Training loss: 0.0722 Explore P: 0.0106
Episode: 815 Total reward: 199.0 Training loss: 0.2243 Explore P: 0.0106
Episode: 816 Total reward: 199.0 Training loss: 0.2882 Explore P: 0.0105
Episode: 817 Total reward: 199.0 Training loss: 0.4234 Explore P: 0.0105
Episode: 818 Total reward: 199.0 Training loss: 0.1703 Explore P: 0.0105
Episode: 819 Total reward: 199.0 Training loss: 0.1840 Explore P: 0.0105
Episode: 820 Total reward: 199.0 Training loss: 0.1995 Explore P: 0.0105
Episode: 821 Total reward: 199.0 Training loss: 0.1511 Explore P: 0.0105
Episode: 822 Total reward: 199.0 Training loss: 0.2346 Explore P: 0.0105
Episode: 823 Total reward: 145.0 Training loss: 0.1623 Explore P: 0.0105
Episode: 824 Total reward: 177.0 Training loss: 0.2099 Explore P: 0.0105
Episode: 825 Total reward: 154.0 Training loss: 0.2159 Explore P: 0.0105
Episode: 826 Total reward: 178.0 Training loss: 0.3513 Explore P: 0.0105
Episode: 827 Total reward: 170.0 Training loss: 0.1833 Explore P: 0.0104
Episode: 828 Total reward: 140.0 Training loss: 0.1525 Explore P: 0.0104
Episode: 829 Total reward: 104.0 Training loss: 0.1769 Explore P: 0.0104
Episode: 830 Total reward: 147.0 Training loss: 0.1144 Explore P: 0.0104
Episode: 831 Total reward: 130.0 Training loss: 0.1196 Explore P: 0.0104
Episode: 832 Total reward: 120.0 Training loss: 0.1573 Explore P: 0.0104
Episode: 833 Total reward: 116.0 Training loss: 0.1710 Explore P: 0.0104
Episode: 834 Total reward: 132.0 Training loss: 0.1394 Explore P: 0.0104
Episode: 835 Total reward: 129.0 Training loss: 0.1450 Explore P: 0.0104
Episode: 836 Total reward: 140.0 Training loss: 0.2251 Explore P: 0.0104
Episode: 837 Total reward: 143.0 Training loss: 0.2056 Explore P: 0.0104
Episode: 838 Total reward: 130.0 Training loss: 0.3095 Explore P: 0.0104
Episode: 839 Total reward: 153.0 Training loss: 0.2115 Explore P: 0.0104
Episode: 840 Total reward: 136.0 Training loss: 0.2406 Explore P: 0.0104
Episode: 841 Total reward: 196.0 Training loss: 0.3540 Explore P: 0.0104
Episode: 842 Total reward: 116.0 Training loss: 0.0845 Explore P: 0.0104
Episode: 843 Total reward: 183.0 Training loss: 0.1491 Explore P: 0.0104
Episode: 844 Total reward: 180.0 Training loss: 199.7549 Explore P: 0.0104
Episode: 845 Total reward: 96.0 Training loss: 0.1809 Explore P: 0.0103
Episode: 846 Total reward: 141.0 Training loss: 0.2999 Explore P: 0.0103
Episode: 847 Total reward: 112.0 Training loss: 0.0997 Explore P: 0.0103
Episode: 848 Total reward: 139.0 Training loss: 0.2845 Explore P: 0.0103
Episode: 849 Total reward: 143.0 Training loss: 0.1419 Explore P: 0.0103
Episode: 850 Total reward: 103.0 Training loss: 0.1262 Explore P: 0.0103
Episode: 851 Total reward: 117.0 Training loss: 0.2503 Explore P: 0.0103
Episode: 852 Total reward: 106.0 Training loss: 0.4424 Explore P: 0.0103
Episode: 853 Total reward: 100.0 Training loss: 0.1611 Explore P: 0.0103
Episode: 854 Total reward: 121.0 Training loss: 61.4682 Explore P: 0.0103
Episode: 855 Total reward: 130.0 Training loss: 0.1141 Explore P: 0.0103
Episode: 856 Total reward: 146.0 Training loss: 0.1216 Explore P: 0.0103
Episode: 857 Total reward: 174.0 Training loss: 0.1976 Explore P: 0.0103
Episode: 858 Total reward: 199.0 Training loss: 0.1513 Explore P: 0.0103
Episode: 859 Total reward: 199.0 Training loss: 0.8243 Explore P: 0.0103
Episode: 860 Total reward: 199.0 Training loss: 0.0980 Explore P: 0.0103
Episode: 861 Total reward: 199.0 Training loss: 0.2036 Explore P: 0.0103
Episode: 862 Total reward: 199.0 Training loss: 0.3392 Explore P: 0.0103
Episode: 863 Total reward: 153.0 Training loss: 0.1490 Explore P: 0.0103
Episode: 864 Total reward: 199.0 Training loss: 0.1160 Explore P: 0.0103
Episode: 865 Total reward: 189.0 Training loss: 0.1401 Explore P: 0.0103
Episode: 866 Total reward: 130.0 Training loss: 0.0672 Explore P: 0.0103
Episode: 867 Total reward: 199.0 Training loss: 0.0765 Explore P: 0.0102
Episode: 868 Total reward: 199.0 Training loss: 0.0908 Explore P: 0.0102
Episode: 869 Total reward: 199.0 Training loss: 0.0835 Explore P: 0.0102
Episode: 870 Total reward: 174.0 Training loss: 0.0625 Explore P: 0.0102
Episode: 871 Total reward: 199.0 Training loss: 0.0694 Explore P: 0.0102
Episode: 872 Total reward: 199.0 Training loss: 0.0884 Explore P: 0.0102
Episode: 873 Total reward: 199.0 Training loss: 0.2694 Explore P: 0.0102
Episode: 874 Total reward: 50.0 Training loss: 0.1767 Explore P: 0.0102
Episode: 875 Total reward: 199.0 Training loss: 0.1132 Explore P: 0.0102
Episode: 876 Total reward: 178.0 Training loss: 0.1502 Explore P: 0.0102
Episode: 877 Total reward: 199.0 Training loss: 0.1024 Explore P: 0.0102
Episode: 878 Total reward: 199.0 Training loss: 0.9845 Explore P: 0.0102
Episode: 879 Total reward: 199.0 Training loss: 0.0632 Explore P: 0.0102
Episode: 880 Total reward: 199.0 Training loss: 0.0219 Explore P: 0.0102
Episode: 881 Total reward: 199.0 Training loss: 0.0504 Explore P: 0.0102
Episode: 882 Total reward: 199.0 Training loss: 0.0573 Explore P: 0.0102
Episode: 883 Total reward: 199.0 Training loss: 0.0922 Explore P: 0.0102
Episode: 884 Total reward: 199.0 Training loss: 0.0403 Explore P: 0.0102
Episode: 885 Total reward: 199.0 Training loss: 0.3060 Explore P: 0.0102
Episode: 886 Total reward: 199.0 Training loss: 0.1358 Explore P: 0.0102
Episode: 887 Total reward: 199.0 Training loss: 0.0688 Explore P: 0.0102
Episode: 888 Total reward: 199.0 Training loss: 0.1460 Explore P: 0.0102
Episode: 889 Total reward: 199.0 Training loss: 0.1031 Explore P: 0.0102
Episode: 890 Total reward: 199.0 Training loss: 0.7170 Explore P: 0.0102
Episode: 891 Total reward: 199.0 Training loss: 0.1201 Explore P: 0.0102
Episode: 892 Total reward: 199.0 Training loss: 0.1615 Explore P: 0.0102
Episode: 893 Total reward: 199.0 Training loss: 0.3999 Explore P: 0.0102
Episode: 894 Total reward: 199.0 Training loss: 0.1319 Explore P: 0.0101
Episode: 895 Total reward: 199.0 Training loss: 320.1675 Explore P: 0.0101
Episode: 896 Total reward: 199.0 Training loss: 0.0939 Explore P: 0.0101
Episode: 897 Total reward: 199.0 Training loss: 0.1575 Explore P: 0.0101
Episode: 898 Total reward: 199.0 Training loss: 0.1686 Explore P: 0.0101
Episode: 899 Total reward: 199.0 Training loss: 0.1753 Explore P: 0.0101
Episode: 900 Total reward: 199.0 Training loss: 0.2628 Explore P: 0.0101
Episode: 901 Total reward: 199.0 Training loss: 0.1299 Explore P: 0.0101
Episode: 902 Total reward: 123.0 Training loss: 0.4813 Explore P: 0.0101
Episode: 903 Total reward: 199.0 Training loss: 0.2146 Explore P: 0.0101
Episode: 904 Total reward: 199.0 Training loss: 220.2995 Explore P: 0.0101
Episode: 905 Total reward: 199.0 Training loss: 0.1307 Explore P: 0.0101
Episode: 906 Total reward: 199.0 Training loss: 0.1090 Explore P: 0.0101
Episode: 907 Total reward: 199.0 Training loss: 0.0880 Explore P: 0.0101
Episode: 908 Total reward: 199.0 Training loss: 0.1433 Explore P: 0.0101
Episode: 909 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0101
Episode: 910 Total reward: 199.0 Training loss: 1.7695 Explore P: 0.0101
Episode: 911 Total reward: 199.0 Training loss: 0.3513 Explore P: 0.0101
Episode: 912 Total reward: 77.0 Training loss: 0.2121 Explore P: 0.0101
Episode: 913 Total reward: 199.0 Training loss: 0.1858 Explore P: 0.0101
Episode: 914 Total reward: 199.0 Training loss: 0.2052 Explore P: 0.0101
Episode: 915 Total reward: 199.0 Training loss: 264.3771 Explore P: 0.0101
Episode: 916 Total reward: 199.0 Training loss: 0.1744 Explore P: 0.0101
Episode: 917 Total reward: 199.0 Training loss: 0.1549 Explore P: 0.0101
Episode: 918 Total reward: 199.0 Training loss: 0.2765 Explore P: 0.0101
Episode: 919 Total reward: 199.0 Training loss: 0.1725 Explore P: 0.0101
Episode: 920 Total reward: 199.0 Training loss: 0.1275 Explore P: 0.0101
Episode: 921 Total reward: 199.0 Training loss: 0.1873 Explore P: 0.0101
Episode: 922 Total reward: 199.0 Training loss: 0.2577 Explore P: 0.0101
Episode: 923 Total reward: 199.0 Training loss: 0.1812 Explore P: 0.0101
Episode: 924 Total reward: 199.0 Training loss: 0.1742 Explore P: 0.0101
Episode: 925 Total reward: 199.0 Training loss: 0.1471 Explore P: 0.0101
Episode: 926 Total reward: 199.0 Training loss: 0.2389 Explore P: 0.0101
Episode: 927 Total reward: 199.0 Training loss: 0.1744 Explore P: 0.0101
Episode: 928 Total reward: 199.0 Training loss: 0.1252 Explore P: 0.0101
Episode: 929 Total reward: 199.0 Training loss: 159.3456 Explore P: 0.0101
Episode: 930 Total reward: 199.0 Training loss: 0.1261 Explore P: 0.0101
Episode: 931 Total reward: 199.0 Training loss: 0.2978 Explore P: 0.0101
Episode: 932 Total reward: 199.0 Training loss: 0.0945 Explore P: 0.0101
Episode: 933 Total reward: 199.0 Training loss: 0.2633 Explore P: 0.0101
Episode: 934 Total reward: 199.0 Training loss: 228.3886 Explore P: 0.0101
Episode: 935 Total reward: 199.0 Training loss: 0.2343 Explore P: 0.0101
Episode: 936 Total reward: 199.0 Training loss: 0.3117 Explore P: 0.0101
Episode: 937 Total reward: 199.0 Training loss: 0.1981 Explore P: 0.0101
Episode: 938 Total reward: 199.0 Training loss: 0.1619 Explore P: 0.0101
Episode: 939 Total reward: 199.0 Training loss: 0.2989 Explore P: 0.0101
Episode: 940 Total reward: 199.0 Training loss: 0.8366 Explore P: 0.0101
Episode: 941 Total reward: 199.0 Training loss: 0.2074 Explore P: 0.0101
Episode: 942 Total reward: 199.0 Training loss: 0.5268 Explore P: 0.0101
Episode: 943 Total reward: 199.0 Training loss: 0.3237 Explore P: 0.0101
Episode: 944 Total reward: 199.0 Training loss: 0.6197 Explore P: 0.0101
Episode: 945 Total reward: 199.0 Training loss: 0.3687 Explore P: 0.0101
Episode: 946 Total reward: 199.0 Training loss: 132.4252 Explore P: 0.0101
Episode: 947 Total reward: 169.0 Training loss: 0.9055 Explore P: 0.0101
Episode: 948 Total reward: 99.0 Training loss: 232.8694 Explore P: 0.0101
Episode: 949 Total reward: 116.0 Training loss: 0.6155 Explore P: 0.0101
Episode: 950 Total reward: 110.0 Training loss: 0.5117 Explore P: 0.0101
Episode: 951 Total reward: 99.0 Training loss: 0.3181 Explore P: 0.0101
Episode: 952 Total reward: 126.0 Training loss: 0.3050 Explore P: 0.0100
Episode: 953 Total reward: 92.0 Training loss: 0.1401 Explore P: 0.0100
Episode: 954 Total reward: 127.0 Training loss: 0.2556 Explore P: 0.0100
Episode: 955 Total reward: 140.0 Training loss: 0.2203 Explore P: 0.0100
Episode: 956 Total reward: 129.0 Training loss: 0.2118 Explore P: 0.0100
Episode: 957 Total reward: 105.0 Training loss: 0.1751 Explore P: 0.0100
Episode: 958 Total reward: 194.0 Training loss: 0.1752 Explore P: 0.0100
Episode: 959 Total reward: 199.0 Training loss: 0.2248 Explore P: 0.0100
Episode: 960 Total reward: 199.0 Training loss: 0.1390 Explore P: 0.0100
Episode: 961 Total reward: 199.0 Training loss: 0.5489 Explore P: 0.0100
Episode: 962 Total reward: 199.0 Training loss: 0.1220 Explore P: 0.0100
Episode: 963 Total reward: 199.0 Training loss: 0.1469 Explore P: 0.0100
Episode: 964 Total reward: 199.0 Training loss: 0.1967 Explore P: 0.0100
Episode: 965 Total reward: 199.0 Training loss: 0.1773 Explore P: 0.0100
Episode: 966 Total reward: 199.0 Training loss: 0.2758 Explore P: 0.0100
Episode: 967 Total reward: 199.0 Training loss: 0.2808 Explore P: 0.0100
Episode: 968 Total reward: 199.0 Training loss: 0.2220 Explore P: 0.0100
Episode: 969 Total reward: 199.0 Training loss: 0.0776 Explore P: 0.0100
Episode: 970 Total reward: 199.0 Training loss: 0.1653 Explore P: 0.0100
Episode: 971 Total reward: 199.0 Training loss: 1.9056 Explore P: 0.0100
Episode: 972 Total reward: 199.0 Training loss: 0.2045 Explore P: 0.0100
Episode: 973 Total reward: 199.0 Training loss: 0.1753 Explore P: 0.0100
Episode: 974 Total reward: 199.0 Training loss: 0.2188 Explore P: 0.0100
Episode: 975 Total reward: 199.0 Training loss: 0.1574 Explore P: 0.0100
Episode: 976 Total reward: 199.0 Training loss: 234.5040 Explore P: 0.0100
Episode: 977 Total reward: 199.0 Training loss: 0.1335 Explore P: 0.0100
Episode: 978 Total reward: 199.0 Training loss: 0.1457 Explore P: 0.0100
Episode: 979 Total reward: 199.0 Training loss: 0.1951 Explore P: 0.0100
Episode: 980 Total reward: 199.0 Training loss: 0.2660 Explore P: 0.0100
Episode: 981 Total reward: 199.0 Training loss: 0.1587 Explore P: 0.0100
Episode: 982 Total reward: 199.0 Training loss: 0.1800 Explore P: 0.0100
Episode: 983 Total reward: 199.0 Training loss: 0.1967 Explore P: 0.0100
Episode: 984 Total reward: 199.0 Training loss: 0.1127 Explore P: 0.0100
Episode: 985 Total reward: 199.0 Training loss: 3.9954 Explore P: 0.0100
Episode: 986 Total reward: 199.0 Training loss: 0.1702 Explore P: 0.0100
Episode: 987 Total reward: 199.0 Training loss: 0.2697 Explore P: 0.0100
Episode: 988 Total reward: 199.0 Training loss: 0.1316 Explore P: 0.0100
Episode: 989 Total reward: 199.0 Training loss: 5.8555 Explore P: 0.0100
Episode: 990 Total reward: 199.0 Training loss: 0.1846 Explore P: 0.0100
Episode: 991 Total reward: 199.0 Training loss: 262.4060 Explore P: 0.0100
Episode: 992 Total reward: 199.0 Training loss: 0.2670 Explore P: 0.0100
Episode: 993 Total reward: 199.0 Training loss: 216.3278 Explore P: 0.0100
Episode: 994 Total reward: 199.0 Training loss: 0.3997 Explore P: 0.0100
Episode: 995 Total reward: 199.0 Training loss: 0.3893 Explore P: 0.0100
Episode: 996 Total reward: 199.0 Training loss: 255.0965 Explore P: 0.0100
Episode: 997 Total reward: 199.0 Training loss: 0.1491 Explore P: 0.0100
Episode: 998 Total reward: 199.0 Training loss: 0.1938 Explore P: 0.0100
Episode: 999 Total reward: 199.0 Training loss: 0.2241 Explore P: 0.0100

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [17]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [18]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[18]:
<matplotlib.text.Text at 0x28e31dd8>

Testing

Let's checkout how our trained agent plays the game.


In [19]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints\cartpole.ckpt
[2017-06-29 11:25:11,700] Restoring parameters from checkpoints\cartpole.ckpt

In [20]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.