Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-08-08 20:20:41,583] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [4]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [5]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [6]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [7]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [8]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [9]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [10]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 9.0 Training loss: 1.0213 Explore P: 0.9991
Episode: 2 Total reward: 42.0 Training loss: 1.0462 Explore P: 0.9950
Episode: 3 Total reward: 18.0 Training loss: 1.0663 Explore P: 0.9932
Episode: 4 Total reward: 18.0 Training loss: 1.0639 Explore P: 0.9914
Episode: 5 Total reward: 11.0 Training loss: 1.0908 Explore P: 0.9903
Episode: 6 Total reward: 13.0 Training loss: 1.0772 Explore P: 0.9891
Episode: 7 Total reward: 22.0 Training loss: 1.0283 Explore P: 0.9869
Episode: 8 Total reward: 23.0 Training loss: 1.1242 Explore P: 0.9847
Episode: 9 Total reward: 24.0 Training loss: 1.0954 Explore P: 0.9823
Episode: 10 Total reward: 14.0 Training loss: 1.1143 Explore P: 0.9810
Episode: 11 Total reward: 28.0 Training loss: 1.0389 Explore P: 0.9783
Episode: 12 Total reward: 20.0 Training loss: 1.0970 Explore P: 0.9763
Episode: 13 Total reward: 26.0 Training loss: 1.4019 Explore P: 0.9738
Episode: 14 Total reward: 18.0 Training loss: 1.1973 Explore P: 0.9721
Episode: 15 Total reward: 15.0 Training loss: 1.3387 Explore P: 0.9706
Episode: 16 Total reward: 22.0 Training loss: 1.2754 Explore P: 0.9685
Episode: 17 Total reward: 12.0 Training loss: 1.2746 Explore P: 0.9674
Episode: 18 Total reward: 17.0 Training loss: 1.3871 Explore P: 0.9658
Episode: 19 Total reward: 22.0 Training loss: 1.4419 Explore P: 0.9637
Episode: 20 Total reward: 12.0 Training loss: 1.1752 Explore P: 0.9625
Episode: 21 Total reward: 31.0 Training loss: 1.2682 Explore P: 0.9596
Episode: 22 Total reward: 11.0 Training loss: 1.7031 Explore P: 0.9585
Episode: 23 Total reward: 10.0 Training loss: 1.6748 Explore P: 0.9576
Episode: 24 Total reward: 25.0 Training loss: 2.4156 Explore P: 0.9552
Episode: 25 Total reward: 22.0 Training loss: 1.6265 Explore P: 0.9531
Episode: 26 Total reward: 23.0 Training loss: 2.4661 Explore P: 0.9510
Episode: 27 Total reward: 21.0 Training loss: 1.9465 Explore P: 0.9490
Episode: 28 Total reward: 40.0 Training loss: 2.8570 Explore P: 0.9452
Episode: 29 Total reward: 15.0 Training loss: 3.8832 Explore P: 0.9438
Episode: 30 Total reward: 17.0 Training loss: 2.8097 Explore P: 0.9423
Episode: 31 Total reward: 9.0 Training loss: 2.5001 Explore P: 0.9414
Episode: 32 Total reward: 29.0 Training loss: 2.7355 Explore P: 0.9387
Episode: 33 Total reward: 16.0 Training loss: 3.1535 Explore P: 0.9372
Episode: 34 Total reward: 10.0 Training loss: 5.3600 Explore P: 0.9363
Episode: 35 Total reward: 17.0 Training loss: 3.6126 Explore P: 0.9347
Episode: 36 Total reward: 12.0 Training loss: 5.5356 Explore P: 0.9336
Episode: 37 Total reward: 17.0 Training loss: 6.3887 Explore P: 0.9321
Episode: 38 Total reward: 14.0 Training loss: 8.8805 Explore P: 0.9308
Episode: 39 Total reward: 17.0 Training loss: 11.0485 Explore P: 0.9292
Episode: 40 Total reward: 14.0 Training loss: 7.2107 Explore P: 0.9279
Episode: 41 Total reward: 23.0 Training loss: 7.6767 Explore P: 0.9258
Episode: 42 Total reward: 10.0 Training loss: 17.1651 Explore P: 0.9249
Episode: 43 Total reward: 15.0 Training loss: 4.5252 Explore P: 0.9235
Episode: 44 Total reward: 20.0 Training loss: 8.9291 Explore P: 0.9217
Episode: 45 Total reward: 10.0 Training loss: 17.7205 Explore P: 0.9208
Episode: 46 Total reward: 14.0 Training loss: 17.6370 Explore P: 0.9195
Episode: 47 Total reward: 13.0 Training loss: 6.9129 Explore P: 0.9183
Episode: 48 Total reward: 10.0 Training loss: 10.1103 Explore P: 0.9174
Episode: 49 Total reward: 23.0 Training loss: 9.7933 Explore P: 0.9153
Episode: 50 Total reward: 20.0 Training loss: 13.6573 Explore P: 0.9135
Episode: 51 Total reward: 13.0 Training loss: 6.1240 Explore P: 0.9124
Episode: 52 Total reward: 17.0 Training loss: 10.8801 Explore P: 0.9108
Episode: 53 Total reward: 13.0 Training loss: 10.2000 Explore P: 0.9096
Episode: 54 Total reward: 9.0 Training loss: 9.1823 Explore P: 0.9088
Episode: 55 Total reward: 12.0 Training loss: 31.6100 Explore P: 0.9078
Episode: 56 Total reward: 17.0 Training loss: 15.0891 Explore P: 0.9062
Episode: 57 Total reward: 29.0 Training loss: 12.1134 Explore P: 0.9036
Episode: 58 Total reward: 16.0 Training loss: 9.7615 Explore P: 0.9022
Episode: 59 Total reward: 17.0 Training loss: 16.4138 Explore P: 0.9007
Episode: 60 Total reward: 10.0 Training loss: 21.2183 Explore P: 0.8998
Episode: 61 Total reward: 17.0 Training loss: 7.1111 Explore P: 0.8983
Episode: 62 Total reward: 14.0 Training loss: 13.1446 Explore P: 0.8971
Episode: 63 Total reward: 29.0 Training loss: 7.1689 Explore P: 0.8945
Episode: 64 Total reward: 10.0 Training loss: 8.7993 Explore P: 0.8936
Episode: 65 Total reward: 15.0 Training loss: 17.3229 Explore P: 0.8923
Episode: 66 Total reward: 21.0 Training loss: 11.4104 Explore P: 0.8904
Episode: 67 Total reward: 10.0 Training loss: 9.1334 Explore P: 0.8895
Episode: 68 Total reward: 16.0 Training loss: 8.7066 Explore P: 0.8881
Episode: 69 Total reward: 17.0 Training loss: 30.4588 Explore P: 0.8866
Episode: 70 Total reward: 30.0 Training loss: 7.2074 Explore P: 0.8840
Episode: 71 Total reward: 17.0 Training loss: 9.3952 Explore P: 0.8825
Episode: 72 Total reward: 9.0 Training loss: 8.6917 Explore P: 0.8818
Episode: 73 Total reward: 13.0 Training loss: 9.8182 Explore P: 0.8806
Episode: 74 Total reward: 21.0 Training loss: 56.4186 Explore P: 0.8788
Episode: 75 Total reward: 25.0 Training loss: 29.0452 Explore P: 0.8766
Episode: 76 Total reward: 16.0 Training loss: 48.0451 Explore P: 0.8752
Episode: 77 Total reward: 17.0 Training loss: 50.6114 Explore P: 0.8738
Episode: 78 Total reward: 19.0 Training loss: 61.4626 Explore P: 0.8721
Episode: 79 Total reward: 22.0 Training loss: 87.8472 Explore P: 0.8702
Episode: 80 Total reward: 10.0 Training loss: 55.9940 Explore P: 0.8694
Episode: 81 Total reward: 12.0 Training loss: 6.5313 Explore P: 0.8683
Episode: 82 Total reward: 10.0 Training loss: 74.5093 Explore P: 0.8675
Episode: 83 Total reward: 12.0 Training loss: 133.3922 Explore P: 0.8665
Episode: 84 Total reward: 11.0 Training loss: 22.5061 Explore P: 0.8655
Episode: 85 Total reward: 32.0 Training loss: 13.4804 Explore P: 0.8628
Episode: 86 Total reward: 40.0 Training loss: 28.7645 Explore P: 0.8594
Episode: 87 Total reward: 35.0 Training loss: 92.1092 Explore P: 0.8564
Episode: 88 Total reward: 31.0 Training loss: 18.5484 Explore P: 0.8538
Episode: 89 Total reward: 14.0 Training loss: 10.7450 Explore P: 0.8526
Episode: 90 Total reward: 12.0 Training loss: 73.0632 Explore P: 0.8516
Episode: 91 Total reward: 21.0 Training loss: 116.7614 Explore P: 0.8498
Episode: 92 Total reward: 24.0 Training loss: 108.9655 Explore P: 0.8478
Episode: 93 Total reward: 12.0 Training loss: 8.7936 Explore P: 0.8468
Episode: 94 Total reward: 13.0 Training loss: 10.7604 Explore P: 0.8457
Episode: 95 Total reward: 15.0 Training loss: 115.7704 Explore P: 0.8445
Episode: 96 Total reward: 20.0 Training loss: 22.0523 Explore P: 0.8428
Episode: 97 Total reward: 13.0 Training loss: 7.8736 Explore P: 0.8417
Episode: 98 Total reward: 19.0 Training loss: 8.0163 Explore P: 0.8401
Episode: 99 Total reward: 12.0 Training loss: 56.4234 Explore P: 0.8392
Episode: 100 Total reward: 8.0 Training loss: 9.9508 Explore P: 0.8385
Episode: 101 Total reward: 8.0 Training loss: 54.9704 Explore P: 0.8378
Episode: 102 Total reward: 11.0 Training loss: 8.2067 Explore P: 0.8369
Episode: 103 Total reward: 14.0 Training loss: 81.8997 Explore P: 0.8358
Episode: 104 Total reward: 27.0 Training loss: 24.0916 Explore P: 0.8335
Episode: 105 Total reward: 11.0 Training loss: 93.5943 Explore P: 0.8326
Episode: 106 Total reward: 10.0 Training loss: 9.1904 Explore P: 0.8318
Episode: 107 Total reward: 60.0 Training loss: 100.1058 Explore P: 0.8269
Episode: 108 Total reward: 23.0 Training loss: 31.6099 Explore P: 0.8250
Episode: 109 Total reward: 16.0 Training loss: 98.9828 Explore P: 0.8237
Episode: 110 Total reward: 30.0 Training loss: 165.9038 Explore P: 0.8213
Episode: 111 Total reward: 18.0 Training loss: 85.7681 Explore P: 0.8198
Episode: 112 Total reward: 25.0 Training loss: 81.5898 Explore P: 0.8178
Episode: 113 Total reward: 25.0 Training loss: 9.7303 Explore P: 0.8158
Episode: 114 Total reward: 24.0 Training loss: 179.6773 Explore P: 0.8138
Episode: 115 Total reward: 22.0 Training loss: 92.9169 Explore P: 0.8121
Episode: 116 Total reward: 11.0 Training loss: 6.9589 Explore P: 0.8112
Episode: 117 Total reward: 14.0 Training loss: 35.0662 Explore P: 0.8101
Episode: 118 Total reward: 22.0 Training loss: 53.0495 Explore P: 0.8083
Episode: 119 Total reward: 9.0 Training loss: 8.2248 Explore P: 0.8076
Episode: 120 Total reward: 22.0 Training loss: 245.4856 Explore P: 0.8058
Episode: 121 Total reward: 13.0 Training loss: 95.9616 Explore P: 0.8048
Episode: 122 Total reward: 15.0 Training loss: 6.0875 Explore P: 0.8036
Episode: 123 Total reward: 27.0 Training loss: 7.3751 Explore P: 0.8015
Episode: 124 Total reward: 12.0 Training loss: 6.3912 Explore P: 0.8005
Episode: 125 Total reward: 18.0 Training loss: 8.9212 Explore P: 0.7991
Episode: 126 Total reward: 15.0 Training loss: 104.8424 Explore P: 0.7979
Episode: 127 Total reward: 18.0 Training loss: 6.8026 Explore P: 0.7965
Episode: 128 Total reward: 9.0 Training loss: 210.4043 Explore P: 0.7958
Episode: 129 Total reward: 9.0 Training loss: 78.9080 Explore P: 0.7951
Episode: 130 Total reward: 29.0 Training loss: 100.0153 Explore P: 0.7928
Episode: 131 Total reward: 34.0 Training loss: 93.2208 Explore P: 0.7902
Episode: 132 Total reward: 16.0 Training loss: 4.9632 Explore P: 0.7889
Episode: 133 Total reward: 15.0 Training loss: 140.7440 Explore P: 0.7877
Episode: 134 Total reward: 13.0 Training loss: 208.9753 Explore P: 0.7867
Episode: 135 Total reward: 22.0 Training loss: 4.6131 Explore P: 0.7850
Episode: 136 Total reward: 13.0 Training loss: 5.4704 Explore P: 0.7840
Episode: 137 Total reward: 13.0 Training loss: 4.6778 Explore P: 0.7830
Episode: 138 Total reward: 26.0 Training loss: 111.4746 Explore P: 0.7810
Episode: 139 Total reward: 13.0 Training loss: 4.1087 Explore P: 0.7800
Episode: 140 Total reward: 24.0 Training loss: 4.3271 Explore P: 0.7782
Episode: 141 Total reward: 16.0 Training loss: 3.6857 Explore P: 0.7769
Episode: 142 Total reward: 12.0 Training loss: 57.4644 Explore P: 0.7760
Episode: 143 Total reward: 11.0 Training loss: 276.2133 Explore P: 0.7752
Episode: 144 Total reward: 8.0 Training loss: 158.1288 Explore P: 0.7746
Episode: 145 Total reward: 18.0 Training loss: 78.9270 Explore P: 0.7732
Episode: 146 Total reward: 15.0 Training loss: 45.6791 Explore P: 0.7720
Episode: 147 Total reward: 15.0 Training loss: 47.3358 Explore P: 0.7709
Episode: 148 Total reward: 18.0 Training loss: 4.5098 Explore P: 0.7695
Episode: 149 Total reward: 12.0 Training loss: 111.1924 Explore P: 0.7686
Episode: 150 Total reward: 12.0 Training loss: 3.5502 Explore P: 0.7677
Episode: 151 Total reward: 25.0 Training loss: 42.2088 Explore P: 0.7658
Episode: 152 Total reward: 24.0 Training loss: 3.1924 Explore P: 0.7640
Episode: 153 Total reward: 14.0 Training loss: 47.4263 Explore P: 0.7630
Episode: 154 Total reward: 20.0 Training loss: 4.5129 Explore P: 0.7615
Episode: 155 Total reward: 21.0 Training loss: 43.9559 Explore P: 0.7599
Episode: 156 Total reward: 11.0 Training loss: 151.6761 Explore P: 0.7590
Episode: 157 Total reward: 20.0 Training loss: 151.0217 Explore P: 0.7576
Episode: 158 Total reward: 17.0 Training loss: 72.0212 Explore P: 0.7563
Episode: 159 Total reward: 14.0 Training loss: 169.2041 Explore P: 0.7552
Episode: 160 Total reward: 13.0 Training loss: 3.3487 Explore P: 0.7543
Episode: 161 Total reward: 17.0 Training loss: 95.5968 Explore P: 0.7530
Episode: 162 Total reward: 43.0 Training loss: 40.2444 Explore P: 0.7498
Episode: 163 Total reward: 22.0 Training loss: 2.4202 Explore P: 0.7482
Episode: 164 Total reward: 16.0 Training loss: 39.3733 Explore P: 0.7470
Episode: 165 Total reward: 12.0 Training loss: 2.7904 Explore P: 0.7461
Episode: 166 Total reward: 12.0 Training loss: 43.7495 Explore P: 0.7452
Episode: 167 Total reward: 11.0 Training loss: 120.0308 Explore P: 0.7444
Episode: 168 Total reward: 39.0 Training loss: 92.9598 Explore P: 0.7416
Episode: 169 Total reward: 20.0 Training loss: 76.9398 Explore P: 0.7401
Episode: 170 Total reward: 11.0 Training loss: 86.9598 Explore P: 0.7393
Episode: 171 Total reward: 12.0 Training loss: 117.8410 Explore P: 0.7384
Episode: 172 Total reward: 25.0 Training loss: 47.4724 Explore P: 0.7366
Episode: 173 Total reward: 10.0 Training loss: 38.0369 Explore P: 0.7359
Episode: 174 Total reward: 12.0 Training loss: 45.5828 Explore P: 0.7350
Episode: 175 Total reward: 13.0 Training loss: 37.4016 Explore P: 0.7341
Episode: 176 Total reward: 15.0 Training loss: 4.1665 Explore P: 0.7330
Episode: 177 Total reward: 8.0 Training loss: 2.7195 Explore P: 0.7324
Episode: 178 Total reward: 9.0 Training loss: 73.8616 Explore P: 0.7318
Episode: 179 Total reward: 17.0 Training loss: 3.3893 Explore P: 0.7305
Episode: 180 Total reward: 14.0 Training loss: 46.9425 Explore P: 0.7295
Episode: 181 Total reward: 12.0 Training loss: 84.2420 Explore P: 0.7287
Episode: 182 Total reward: 19.0 Training loss: 1.7944 Explore P: 0.7273
Episode: 183 Total reward: 15.0 Training loss: 3.0457 Explore P: 0.7262
Episode: 184 Total reward: 22.0 Training loss: 2.5001 Explore P: 0.7247
Episode: 185 Total reward: 11.0 Training loss: 39.4205 Explore P: 0.7239
Episode: 186 Total reward: 34.0 Training loss: 69.3192 Explore P: 0.7214
Episode: 187 Total reward: 10.0 Training loss: 94.1649 Explore P: 0.7207
Episode: 188 Total reward: 14.0 Training loss: 35.3777 Explore P: 0.7197
Episode: 189 Total reward: 20.0 Training loss: 2.5004 Explore P: 0.7183
Episode: 190 Total reward: 11.0 Training loss: 38.9853 Explore P: 0.7175
Episode: 191 Total reward: 22.0 Training loss: 32.7458 Explore P: 0.7160
Episode: 192 Total reward: 11.0 Training loss: 35.4300 Explore P: 0.7152
Episode: 193 Total reward: 15.0 Training loss: 2.4246 Explore P: 0.7142
Episode: 194 Total reward: 8.0 Training loss: 2.9059 Explore P: 0.7136
Episode: 195 Total reward: 13.0 Training loss: 2.5166 Explore P: 0.7127
Episode: 196 Total reward: 24.0 Training loss: 75.0618 Explore P: 0.7110
Episode: 197 Total reward: 33.0 Training loss: 1.9749 Explore P: 0.7087
Episode: 198 Total reward: 8.0 Training loss: 37.2755 Explore P: 0.7081
Episode: 199 Total reward: 17.0 Training loss: 79.9445 Explore P: 0.7069
Episode: 200 Total reward: 11.0 Training loss: 65.1042 Explore P: 0.7062
Episode: 201 Total reward: 8.0 Training loss: 36.3158 Explore P: 0.7056
Episode: 202 Total reward: 12.0 Training loss: 1.8264 Explore P: 0.7048
Episode: 203 Total reward: 9.0 Training loss: 77.0711 Explore P: 0.7042
Episode: 204 Total reward: 8.0 Training loss: 77.1708 Explore P: 0.7036
Episode: 205 Total reward: 8.0 Training loss: 1.6654 Explore P: 0.7031
Episode: 206 Total reward: 20.0 Training loss: 110.8669 Explore P: 0.7017
Episode: 207 Total reward: 13.0 Training loss: 73.2363 Explore P: 0.7008
Episode: 208 Total reward: 14.0 Training loss: 85.5946 Explore P: 0.6998
Episode: 209 Total reward: 10.0 Training loss: 30.0577 Explore P: 0.6991
Episode: 210 Total reward: 8.0 Training loss: 1.5690 Explore P: 0.6986
Episode: 211 Total reward: 11.0 Training loss: 33.1479 Explore P: 0.6978
Episode: 212 Total reward: 14.0 Training loss: 34.9453 Explore P: 0.6968
Episode: 213 Total reward: 27.0 Training loss: 1.2141 Explore P: 0.6950
Episode: 214 Total reward: 18.0 Training loss: 2.2855 Explore P: 0.6938
Episode: 215 Total reward: 26.0 Training loss: 35.4355 Explore P: 0.6920
Episode: 216 Total reward: 17.0 Training loss: 1.7180 Explore P: 0.6908
Episode: 217 Total reward: 11.0 Training loss: 51.8860 Explore P: 0.6901
Episode: 218 Total reward: 16.0 Training loss: 96.4230 Explore P: 0.6890
Episode: 219 Total reward: 9.0 Training loss: 34.0729 Explore P: 0.6884
Episode: 220 Total reward: 16.0 Training loss: 38.3987 Explore P: 0.6873
Episode: 221 Total reward: 8.0 Training loss: 24.5045 Explore P: 0.6868
Episode: 222 Total reward: 14.0 Training loss: 27.8259 Explore P: 0.6858
Episode: 223 Total reward: 9.0 Training loss: 53.1513 Explore P: 0.6852
Episode: 224 Total reward: 15.0 Training loss: 1.7310 Explore P: 0.6842
Episode: 225 Total reward: 13.0 Training loss: 1.0203 Explore P: 0.6833
Episode: 226 Total reward: 14.0 Training loss: 27.1834 Explore P: 0.6824
Episode: 227 Total reward: 13.0 Training loss: 1.1810 Explore P: 0.6815
Episode: 228 Total reward: 21.0 Training loss: 50.0466 Explore P: 0.6801
Episode: 229 Total reward: 9.0 Training loss: 73.1316 Explore P: 0.6795
Episode: 230 Total reward: 10.0 Training loss: 1.1150 Explore P: 0.6788
Episode: 231 Total reward: 8.0 Training loss: 25.5417 Explore P: 0.6783
Episode: 232 Total reward: 10.0 Training loss: 25.3364 Explore P: 0.6776
Episode: 233 Total reward: 16.0 Training loss: 1.1479 Explore P: 0.6765
Episode: 234 Total reward: 11.0 Training loss: 34.5086 Explore P: 0.6758
Episode: 235 Total reward: 8.0 Training loss: 0.9323 Explore P: 0.6753
Episode: 236 Total reward: 11.0 Training loss: 23.9834 Explore P: 0.6745
Episode: 237 Total reward: 16.0 Training loss: 29.7012 Explore P: 0.6735
Episode: 238 Total reward: 14.0 Training loss: 1.5700 Explore P: 0.6726
Episode: 239 Total reward: 11.0 Training loss: 41.9967 Explore P: 0.6718
Episode: 240 Total reward: 8.0 Training loss: 63.4866 Explore P: 0.6713
Episode: 241 Total reward: 9.0 Training loss: 23.9622 Explore P: 0.6707
Episode: 242 Total reward: 13.0 Training loss: 51.9601 Explore P: 0.6698
Episode: 243 Total reward: 15.0 Training loss: 51.1484 Explore P: 0.6689
Episode: 244 Total reward: 24.0 Training loss: 32.9782 Explore P: 0.6673
Episode: 245 Total reward: 12.0 Training loss: 33.5716 Explore P: 0.6665
Episode: 246 Total reward: 9.0 Training loss: 50.0363 Explore P: 0.6659
Episode: 247 Total reward: 49.0 Training loss: 18.1900 Explore P: 0.6627
Episode: 248 Total reward: 13.0 Training loss: 43.7224 Explore P: 0.6618
Episode: 249 Total reward: 24.0 Training loss: 18.6455 Explore P: 0.6603
Episode: 250 Total reward: 10.0 Training loss: 0.9494 Explore P: 0.6596
Episode: 251 Total reward: 14.0 Training loss: 49.6483 Explore P: 0.6587
Episode: 252 Total reward: 17.0 Training loss: 15.6961 Explore P: 0.6576
Episode: 253 Total reward: 11.0 Training loss: 1.1713 Explore P: 0.6569
Episode: 254 Total reward: 20.0 Training loss: 62.6994 Explore P: 0.6556
Episode: 255 Total reward: 21.0 Training loss: 67.7432 Explore P: 0.6543
Episode: 256 Total reward: 9.0 Training loss: 0.9462 Explore P: 0.6537
Episode: 257 Total reward: 14.0 Training loss: 46.4868 Explore P: 0.6528
Episode: 258 Total reward: 11.0 Training loss: 13.9388 Explore P: 0.6521
Episode: 259 Total reward: 17.0 Training loss: 69.8468 Explore P: 0.6510
Episode: 260 Total reward: 8.0 Training loss: 33.0711 Explore P: 0.6505
Episode: 261 Total reward: 11.0 Training loss: 0.9038 Explore P: 0.6498
Episode: 262 Total reward: 11.0 Training loss: 24.0823 Explore P: 0.6491
Episode: 263 Total reward: 8.0 Training loss: 27.8313 Explore P: 0.6486
Episode: 264 Total reward: 15.0 Training loss: 1.3762 Explore P: 0.6476
Episode: 265 Total reward: 12.0 Training loss: 0.6693 Explore P: 0.6468
Episode: 266 Total reward: 14.0 Training loss: 13.1565 Explore P: 0.6459
Episode: 267 Total reward: 18.0 Training loss: 0.9400 Explore P: 0.6448
Episode: 268 Total reward: 10.0 Training loss: 15.9995 Explore P: 0.6442
Episode: 269 Total reward: 12.0 Training loss: 0.5891 Explore P: 0.6434
Episode: 270 Total reward: 20.0 Training loss: 1.2183 Explore P: 0.6421
Episode: 271 Total reward: 12.0 Training loss: 35.3922 Explore P: 0.6414
Episode: 272 Total reward: 15.0 Training loss: 0.7438 Explore P: 0.6404
Episode: 273 Total reward: 17.0 Training loss: 69.8946 Explore P: 0.6394
Episode: 274 Total reward: 17.0 Training loss: 36.0503 Explore P: 0.6383
Episode: 275 Total reward: 20.0 Training loss: 1.0755 Explore P: 0.6370
Episode: 276 Total reward: 16.0 Training loss: 0.6874 Explore P: 0.6360
Episode: 277 Total reward: 21.0 Training loss: 15.4289 Explore P: 0.6347
Episode: 278 Total reward: 20.0 Training loss: 22.6728 Explore P: 0.6335
Episode: 279 Total reward: 11.0 Training loss: 1.1646 Explore P: 0.6328
Episode: 280 Total reward: 18.0 Training loss: 11.1088 Explore P: 0.6317
Episode: 281 Total reward: 13.0 Training loss: 43.7884 Explore P: 0.6309
Episode: 282 Total reward: 13.0 Training loss: 32.7572 Explore P: 0.6301
Episode: 283 Total reward: 20.0 Training loss: 0.7229 Explore P: 0.6288
Episode: 284 Total reward: 11.0 Training loss: 1.0037 Explore P: 0.6281
Episode: 285 Total reward: 14.0 Training loss: 11.3611 Explore P: 0.6273
Episode: 286 Total reward: 9.0 Training loss: 11.0738 Explore P: 0.6267
Episode: 287 Total reward: 11.0 Training loss: 48.9181 Explore P: 0.6260
Episode: 288 Total reward: 15.0 Training loss: 23.2071 Explore P: 0.6251
Episode: 289 Total reward: 11.0 Training loss: 0.7308 Explore P: 0.6244
Episode: 290 Total reward: 10.0 Training loss: 32.3520 Explore P: 0.6238
Episode: 291 Total reward: 15.0 Training loss: 12.3822 Explore P: 0.6229
Episode: 292 Total reward: 13.0 Training loss: 9.9153 Explore P: 0.6221
Episode: 293 Total reward: 16.0 Training loss: 11.9597 Explore P: 0.6211
Episode: 294 Total reward: 13.0 Training loss: 11.5571 Explore P: 0.6203
Episode: 295 Total reward: 13.0 Training loss: 9.1850 Explore P: 0.6195
Episode: 296 Total reward: 9.0 Training loss: 0.7150 Explore P: 0.6190
Episode: 297 Total reward: 13.0 Training loss: 17.2703 Explore P: 0.6182
Episode: 298 Total reward: 20.0 Training loss: 10.3111 Explore P: 0.6170
Episode: 299 Total reward: 12.0 Training loss: 27.9011 Explore P: 0.6163
Episode: 300 Total reward: 15.0 Training loss: 0.7382 Explore P: 0.6153
Episode: 301 Total reward: 25.0 Training loss: 67.6087 Explore P: 0.6138
Episode: 302 Total reward: 14.0 Training loss: 1.3227 Explore P: 0.6130
Episode: 303 Total reward: 23.0 Training loss: 1.2684 Explore P: 0.6116
Episode: 304 Total reward: 33.0 Training loss: 16.6697 Explore P: 0.6096
Episode: 305 Total reward: 14.0 Training loss: 40.2493 Explore P: 0.6088
Episode: 306 Total reward: 11.0 Training loss: 0.9841 Explore P: 0.6081
Episode: 307 Total reward: 12.0 Training loss: 18.1845 Explore P: 0.6074
Episode: 308 Total reward: 20.0 Training loss: 1.1323 Explore P: 0.6062
Episode: 309 Total reward: 36.0 Training loss: 1.4183 Explore P: 0.6041
Episode: 310 Total reward: 80.0 Training loss: 11.8311 Explore P: 0.5993
Episode: 311 Total reward: 52.0 Training loss: 9.4334 Explore P: 0.5963
Episode: 312 Total reward: 44.0 Training loss: 8.7087 Explore P: 0.5937
Episode: 313 Total reward: 12.0 Training loss: 6.8004 Explore P: 0.5930
Episode: 314 Total reward: 38.0 Training loss: 12.9973 Explore P: 0.5908
Episode: 315 Total reward: 60.0 Training loss: 22.4506 Explore P: 0.5873
Episode: 316 Total reward: 12.0 Training loss: 0.8059 Explore P: 0.5866
Episode: 317 Total reward: 21.0 Training loss: 1.7212 Explore P: 0.5854
Episode: 318 Total reward: 22.0 Training loss: 22.9411 Explore P: 0.5842
Episode: 319 Total reward: 40.0 Training loss: 1.1953 Explore P: 0.5819
Episode: 320 Total reward: 82.0 Training loss: 1.0735 Explore P: 0.5772
Episode: 321 Total reward: 23.0 Training loss: 30.0910 Explore P: 0.5759
Episode: 322 Total reward: 39.0 Training loss: 31.6045 Explore P: 0.5737
Episode: 323 Total reward: 37.0 Training loss: 10.6858 Explore P: 0.5716
Episode: 324 Total reward: 28.0 Training loss: 8.8664 Explore P: 0.5700
Episode: 325 Total reward: 16.0 Training loss: 1.4076 Explore P: 0.5691
Episode: 326 Total reward: 24.0 Training loss: 12.4601 Explore P: 0.5678
Episode: 327 Total reward: 25.0 Training loss: 8.6508 Explore P: 0.5664
Episode: 328 Total reward: 22.0 Training loss: 18.5561 Explore P: 0.5652
Episode: 329 Total reward: 50.0 Training loss: 15.0318 Explore P: 0.5624
Episode: 330 Total reward: 53.0 Training loss: 1.3968 Explore P: 0.5595
Episode: 331 Total reward: 31.0 Training loss: 11.8808 Explore P: 0.5578
Episode: 332 Total reward: 25.0 Training loss: 1.2976 Explore P: 0.5564
Episode: 333 Total reward: 47.0 Training loss: 27.6518 Explore P: 0.5539
Episode: 334 Total reward: 51.0 Training loss: 19.4728 Explore P: 0.5511
Episode: 335 Total reward: 23.0 Training loss: 13.9730 Explore P: 0.5499
Episode: 336 Total reward: 35.0 Training loss: 1.0742 Explore P: 0.5480
Episode: 337 Total reward: 80.0 Training loss: 15.0054 Explore P: 0.5437
Episode: 338 Total reward: 37.0 Training loss: 1.2920 Explore P: 0.5417
Episode: 339 Total reward: 56.0 Training loss: 1.8945 Explore P: 0.5387
Episode: 340 Total reward: 17.0 Training loss: 9.0324 Explore P: 0.5378
Episode: 341 Total reward: 27.0 Training loss: 32.6598 Explore P: 0.5364
Episode: 342 Total reward: 42.0 Training loss: 23.6165 Explore P: 0.5342
Episode: 343 Total reward: 24.0 Training loss: 0.9925 Explore P: 0.5330
Episode: 344 Total reward: 42.0 Training loss: 23.7051 Explore P: 0.5308
Episode: 345 Total reward: 21.0 Training loss: 21.3244 Explore P: 0.5297
Episode: 346 Total reward: 29.0 Training loss: 31.1155 Explore P: 0.5282
Episode: 347 Total reward: 117.0 Training loss: 1.9962 Explore P: 0.5221
Episode: 348 Total reward: 53.0 Training loss: 1.1959 Explore P: 0.5194
Episode: 349 Total reward: 38.0 Training loss: 12.9364 Explore P: 0.5175
Episode: 350 Total reward: 35.0 Training loss: 1.0909 Explore P: 0.5157
Episode: 351 Total reward: 13.0 Training loss: 0.8845 Explore P: 0.5151
Episode: 352 Total reward: 68.0 Training loss: 14.3952 Explore P: 0.5117
Episode: 353 Total reward: 44.0 Training loss: 1.0444 Explore P: 0.5094
Episode: 354 Total reward: 55.0 Training loss: 1.6768 Explore P: 0.5067
Episode: 355 Total reward: 39.0 Training loss: 1.4582 Explore P: 0.5048
Episode: 356 Total reward: 47.0 Training loss: 0.7661 Explore P: 0.5025
Episode: 357 Total reward: 19.0 Training loss: 1.2337 Explore P: 0.5015
Episode: 358 Total reward: 27.0 Training loss: 39.9290 Explore P: 0.5002
Episode: 359 Total reward: 33.0 Training loss: 33.8783 Explore P: 0.4986
Episode: 360 Total reward: 62.0 Training loss: 1.1639 Explore P: 0.4956
Episode: 361 Total reward: 77.0 Training loss: 19.5279 Explore P: 0.4918
Episode: 362 Total reward: 56.0 Training loss: 20.4384 Explore P: 0.4891
Episode: 363 Total reward: 25.0 Training loss: 29.9844 Explore P: 0.4879
Episode: 364 Total reward: 31.0 Training loss: 16.9553 Explore P: 0.4865
Episode: 365 Total reward: 76.0 Training loss: 1.9067 Explore P: 0.4829
Episode: 366 Total reward: 38.0 Training loss: 25.7248 Explore P: 0.4811
Episode: 367 Total reward: 84.0 Training loss: 47.1481 Explore P: 0.4771
Episode: 368 Total reward: 54.0 Training loss: 9.6251 Explore P: 0.4746
Episode: 369 Total reward: 29.0 Training loss: 16.7412 Explore P: 0.4733
Episode: 370 Total reward: 43.0 Training loss: 0.9452 Explore P: 0.4713
Episode: 371 Total reward: 88.0 Training loss: 1.8389 Explore P: 0.4672
Episode: 372 Total reward: 38.0 Training loss: 26.2892 Explore P: 0.4655
Episode: 373 Total reward: 60.0 Training loss: 14.3553 Explore P: 0.4628
Episode: 374 Total reward: 68.0 Training loss: 26.8217 Explore P: 0.4597
Episode: 375 Total reward: 65.0 Training loss: 1.3358 Explore P: 0.4568
Episode: 376 Total reward: 104.0 Training loss: 27.3020 Explore P: 0.4522
Episode: 377 Total reward: 39.0 Training loss: 93.2384 Explore P: 0.4505
Episode: 378 Total reward: 27.0 Training loss: 1.0783 Explore P: 0.4493
Episode: 379 Total reward: 43.0 Training loss: 1.3334 Explore P: 0.4474
Episode: 380 Total reward: 46.0 Training loss: 15.4738 Explore P: 0.4454
Episode: 381 Total reward: 58.0 Training loss: 19.0740 Explore P: 0.4429
Episode: 382 Total reward: 38.0 Training loss: 22.2549 Explore P: 0.4412
Episode: 383 Total reward: 56.0 Training loss: 1.4486 Explore P: 0.4388
Episode: 384 Total reward: 48.0 Training loss: 1.6903 Explore P: 0.4368
Episode: 385 Total reward: 60.0 Training loss: 1.9439 Explore P: 0.4342
Episode: 386 Total reward: 52.0 Training loss: 46.2211 Explore P: 0.4320
Episode: 387 Total reward: 38.0 Training loss: 36.6265 Explore P: 0.4304
Episode: 388 Total reward: 56.0 Training loss: 6.7650 Explore P: 0.4281
Episode: 389 Total reward: 47.0 Training loss: 1.6399 Explore P: 0.4261
Episode: 390 Total reward: 59.0 Training loss: 1.5716 Explore P: 0.4236
Episode: 391 Total reward: 45.0 Training loss: 1.1171 Explore P: 0.4218
Episode: 392 Total reward: 78.0 Training loss: 37.0077 Explore P: 0.4186
Episode: 393 Total reward: 39.0 Training loss: 23.9374 Explore P: 0.4170
Episode: 394 Total reward: 61.0 Training loss: 57.3340 Explore P: 0.4145
Episode: 395 Total reward: 53.0 Training loss: 1.5453 Explore P: 0.4124
Episode: 396 Total reward: 46.0 Training loss: 2.3882 Explore P: 0.4105
Episode: 397 Total reward: 32.0 Training loss: 34.9818 Explore P: 0.4093
Episode: 398 Total reward: 81.0 Training loss: 43.6030 Explore P: 0.4060
Episode: 399 Total reward: 53.0 Training loss: 21.5611 Explore P: 0.4039
Episode: 400 Total reward: 54.0 Training loss: 22.9423 Explore P: 0.4018
Episode: 401 Total reward: 40.0 Training loss: 1.9722 Explore P: 0.4003
Episode: 402 Total reward: 23.0 Training loss: 2.7244 Explore P: 0.3994
Episode: 403 Total reward: 71.0 Training loss: 2.2247 Explore P: 0.3966
Episode: 404 Total reward: 80.0 Training loss: 22.6882 Explore P: 0.3935
Episode: 405 Total reward: 59.0 Training loss: 55.5845 Explore P: 0.3913
Episode: 406 Total reward: 64.0 Training loss: 39.0189 Explore P: 0.3888
Episode: 407 Total reward: 63.0 Training loss: 1.6841 Explore P: 0.3865
Episode: 408 Total reward: 75.0 Training loss: 2.0983 Explore P: 0.3836
Episode: 409 Total reward: 25.0 Training loss: 48.2231 Explore P: 0.3827
Episode: 410 Total reward: 128.0 Training loss: 2.1046 Explore P: 0.3780
Episode: 411 Total reward: 117.0 Training loss: 35.0014 Explore P: 0.3737
Episode: 412 Total reward: 77.0 Training loss: 2.0764 Explore P: 0.3709
Episode: 413 Total reward: 100.0 Training loss: 22.9558 Explore P: 0.3673
Episode: 414 Total reward: 40.0 Training loss: 1.9228 Explore P: 0.3659
Episode: 415 Total reward: 35.0 Training loss: 2.0319 Explore P: 0.3646
Episode: 416 Total reward: 50.0 Training loss: 37.1650 Explore P: 0.3629
Episode: 417 Total reward: 62.0 Training loss: 2.5362 Explore P: 0.3607
Episode: 418 Total reward: 41.0 Training loss: 46.3379 Explore P: 0.3593
Episode: 419 Total reward: 44.0 Training loss: 1.2886 Explore P: 0.3577
Episode: 420 Total reward: 52.0 Training loss: 114.4273 Explore P: 0.3559
Episode: 421 Total reward: 40.0 Training loss: 46.3396 Explore P: 0.3545
Episode: 422 Total reward: 58.0 Training loss: 28.3462 Explore P: 0.3525
Episode: 423 Total reward: 64.0 Training loss: 94.8189 Explore P: 0.3504
Episode: 424 Total reward: 92.0 Training loss: 48.1418 Explore P: 0.3472
Episode: 425 Total reward: 36.0 Training loss: 171.3478 Explore P: 0.3460
Episode: 426 Total reward: 65.0 Training loss: 15.9345 Explore P: 0.3439
Episode: 427 Total reward: 119.0 Training loss: 2.0701 Explore P: 0.3399
Episode: 428 Total reward: 102.0 Training loss: 65.1495 Explore P: 0.3366
Episode: 429 Total reward: 61.0 Training loss: 23.2305 Explore P: 0.3346
Episode: 430 Total reward: 33.0 Training loss: 3.2950 Explore P: 0.3335
Episode: 431 Total reward: 183.0 Training loss: 55.5762 Explore P: 0.3276
Episode: 432 Total reward: 59.0 Training loss: 6.7213 Explore P: 0.3258
Episode: 433 Total reward: 40.0 Training loss: 1.9584 Explore P: 0.3245
Episode: 434 Total reward: 68.0 Training loss: 167.6053 Explore P: 0.3224
Episode: 435 Total reward: 48.0 Training loss: 46.0662 Explore P: 0.3209
Episode: 436 Total reward: 138.0 Training loss: 51.3997 Explore P: 0.3166
Episode: 437 Total reward: 87.0 Training loss: 86.5762 Explore P: 0.3140
Episode: 438 Total reward: 62.0 Training loss: 2.4162 Explore P: 0.3121
Episode: 439 Total reward: 45.0 Training loss: 30.9865 Explore P: 0.3107
Episode: 440 Total reward: 133.0 Training loss: 3.8534 Explore P: 0.3068
Episode: 441 Total reward: 75.0 Training loss: 4.2830 Explore P: 0.3045
Episode: 442 Total reward: 68.0 Training loss: 3.7200 Explore P: 0.3025
Episode: 443 Total reward: 59.0 Training loss: 2.1358 Explore P: 0.3008
Episode: 444 Total reward: 39.0 Training loss: 30.7633 Explore P: 0.2997
Episode: 445 Total reward: 108.0 Training loss: 1.6713 Explore P: 0.2966
Episode: 446 Total reward: 50.0 Training loss: 2.8759 Explore P: 0.2951
Episode: 447 Total reward: 47.0 Training loss: 2.1756 Explore P: 0.2938
Episode: 448 Total reward: 135.0 Training loss: 2.0221 Explore P: 0.2900
Episode: 449 Total reward: 98.0 Training loss: 2.6192 Explore P: 0.2873
Episode: 450 Total reward: 62.0 Training loss: 2.1282 Explore P: 0.2856
Episode: 451 Total reward: 59.0 Training loss: 16.5220 Explore P: 0.2839
Episode: 452 Total reward: 35.0 Training loss: 2.0605 Explore P: 0.2830
Episode: 453 Total reward: 36.0 Training loss: 40.2559 Explore P: 0.2820
Episode: 454 Total reward: 46.0 Training loss: 2.4274 Explore P: 0.2808
Episode: 455 Total reward: 47.0 Training loss: 40.1053 Explore P: 0.2795
Episode: 456 Total reward: 58.0 Training loss: 3.8568 Explore P: 0.2779
Episode: 457 Total reward: 50.0 Training loss: 23.2428 Explore P: 0.2766
Episode: 458 Total reward: 59.0 Training loss: 2.5279 Explore P: 0.2750
Episode: 459 Total reward: 80.0 Training loss: 40.5097 Explore P: 0.2729
Episode: 460 Total reward: 76.0 Training loss: 51.3538 Explore P: 0.2709
Episode: 461 Total reward: 21.0 Training loss: 1.0989 Explore P: 0.2704
Episode: 462 Total reward: 68.0 Training loss: 1.8269 Explore P: 0.2686
Episode: 463 Total reward: 62.0 Training loss: 1.9921 Explore P: 0.2670
Episode: 464 Total reward: 22.0 Training loss: 57.3726 Explore P: 0.2664
Episode: 465 Total reward: 53.0 Training loss: 1.7934 Explore P: 0.2651
Episode: 466 Total reward: 105.0 Training loss: 0.9063 Explore P: 0.2624
Episode: 467 Total reward: 123.0 Training loss: 2.4340 Explore P: 0.2593
Episode: 468 Total reward: 79.0 Training loss: 13.1567 Explore P: 0.2574
Episode: 469 Total reward: 71.0 Training loss: 109.4815 Explore P: 0.2556
Episode: 470 Total reward: 123.0 Training loss: 2.1982 Explore P: 0.2526
Episode: 471 Total reward: 113.0 Training loss: 2.8220 Explore P: 0.2499
Episode: 472 Total reward: 199.0 Training loss: 2.9033 Explore P: 0.2452
Episode: 473 Total reward: 85.0 Training loss: 90.1276 Explore P: 0.2432
Episode: 474 Total reward: 97.0 Training loss: 2.2535 Explore P: 0.2409
Episode: 475 Total reward: 121.0 Training loss: 2.5608 Explore P: 0.2382
Episode: 476 Total reward: 98.0 Training loss: 2.0214 Explore P: 0.2359
Episode: 477 Total reward: 82.0 Training loss: 1.5772 Explore P: 0.2341
Episode: 478 Total reward: 94.0 Training loss: 2.3580 Explore P: 0.2320
Episode: 479 Total reward: 83.0 Training loss: 2.3566 Explore P: 0.2301
Episode: 480 Total reward: 146.0 Training loss: 2.3508 Explore P: 0.2270
Episode: 481 Total reward: 88.0 Training loss: 2.1889 Explore P: 0.2251
Episode: 482 Total reward: 72.0 Training loss: 113.0744 Explore P: 0.2235
Episode: 483 Total reward: 70.0 Training loss: 1.1434 Explore P: 0.2220
Episode: 484 Total reward: 50.0 Training loss: 3.3442 Explore P: 0.2210
Episode: 485 Total reward: 61.0 Training loss: 183.5129 Explore P: 0.2197
Episode: 486 Total reward: 102.0 Training loss: 84.6593 Explore P: 0.2176
Episode: 487 Total reward: 92.0 Training loss: 31.0458 Explore P: 0.2157
Episode: 488 Total reward: 127.0 Training loss: 0.9220 Explore P: 0.2131
Episode: 489 Total reward: 184.0 Training loss: 275.6520 Explore P: 0.2094
Episode: 490 Total reward: 199.0 Training loss: 100.9197 Explore P: 0.2054
Episode: 491 Total reward: 131.0 Training loss: 2.5274 Explore P: 0.2029
Episode: 492 Total reward: 109.0 Training loss: 90.3494 Explore P: 0.2008
Episode: 493 Total reward: 47.0 Training loss: 1.7948 Explore P: 0.1999
Episode: 494 Total reward: 185.0 Training loss: 2.0516 Explore P: 0.1964
Episode: 495 Total reward: 199.0 Training loss: 0.6773 Explore P: 0.1927
Episode: 496 Total reward: 130.0 Training loss: 1.1481 Explore P: 0.1904
Episode: 497 Total reward: 100.0 Training loss: 114.8490 Explore P: 0.1886
Episode: 498 Total reward: 147.0 Training loss: 1.4394 Explore P: 0.1860
Episode: 499 Total reward: 57.0 Training loss: 1.0673 Explore P: 0.1850
Episode: 500 Total reward: 121.0 Training loss: 184.0763 Explore P: 0.1829
Episode: 501 Total reward: 152.0 Training loss: 1.9583 Explore P: 0.1803
Episode: 502 Total reward: 199.0 Training loss: 2.4872 Explore P: 0.1769
Episode: 503 Total reward: 199.0 Training loss: 1.9973 Explore P: 0.1736
Episode: 504 Total reward: 199.0 Training loss: 75.1750 Explore P: 0.1704
Episode: 505 Total reward: 199.0 Training loss: 1.2780 Explore P: 0.1672
Episode: 506 Total reward: 196.0 Training loss: 1.4606 Explore P: 0.1642
Episode: 507 Total reward: 105.0 Training loss: 1.9638 Explore P: 0.1626
Episode: 508 Total reward: 102.0 Training loss: 1.3333 Explore P: 0.1610
Episode: 509 Total reward: 158.0 Training loss: 1.2531 Explore P: 0.1587
Episode: 510 Total reward: 161.0 Training loss: 1.1254 Explore P: 0.1563
Episode: 511 Total reward: 199.0 Training loss: 0.6941 Explore P: 0.1534
Episode: 512 Total reward: 123.0 Training loss: 2.2725 Explore P: 0.1517
Episode: 513 Total reward: 199.0 Training loss: 0.6436 Explore P: 0.1489
Episode: 514 Total reward: 199.0 Training loss: 63.3925 Explore P: 0.1461
Episode: 515 Total reward: 199.0 Training loss: 1.1841 Explore P: 0.1434
Episode: 516 Total reward: 199.0 Training loss: 53.8437 Explore P: 0.1408
Episode: 517 Total reward: 199.0 Training loss: 52.3584 Explore P: 0.1382
Episode: 518 Total reward: 199.0 Training loss: 1.2767 Explore P: 0.1357
Episode: 519 Total reward: 199.0 Training loss: 1.4295 Explore P: 0.1332
Episode: 520 Total reward: 199.0 Training loss: 0.4995 Explore P: 0.1308
Episode: 521 Total reward: 199.0 Training loss: 0.5171 Explore P: 0.1284
Episode: 522 Total reward: 199.0 Training loss: 0.9564 Explore P: 0.1261
Episode: 523 Total reward: 199.0 Training loss: 0.7336 Explore P: 0.1238
Episode: 524 Total reward: 199.0 Training loss: 67.9433 Explore P: 0.1216
Episode: 525 Total reward: 199.0 Training loss: 59.7647 Explore P: 0.1194
Episode: 526 Total reward: 199.0 Training loss: 0.6634 Explore P: 0.1172
Episode: 527 Total reward: 199.0 Training loss: 0.8003 Explore P: 0.1151
Episode: 528 Total reward: 199.0 Training loss: 0.2039 Explore P: 0.1130
Episode: 529 Total reward: 199.0 Training loss: 0.8066 Explore P: 0.1110
Episode: 530 Total reward: 199.0 Training loss: 0.9553 Explore P: 0.1090
Episode: 531 Total reward: 199.0 Training loss: 0.5524 Explore P: 0.1071
Episode: 532 Total reward: 199.0 Training loss: 0.7345 Explore P: 0.1051
Episode: 533 Total reward: 199.0 Training loss: 0.6264 Explore P: 0.1033
Episode: 534 Total reward: 199.0 Training loss: 0.4202 Explore P: 0.1014
Episode: 535 Total reward: 199.0 Training loss: 0.5051 Explore P: 0.0996
Episode: 536 Total reward: 199.0 Training loss: 0.9479 Explore P: 0.0979
Episode: 537 Total reward: 199.0 Training loss: 0.3051 Explore P: 0.0961
Episode: 538 Total reward: 199.0 Training loss: 0.5037 Explore P: 0.0944
Episode: 539 Total reward: 199.0 Training loss: 0.3769 Explore P: 0.0928
Episode: 540 Total reward: 199.0 Training loss: 199.7347 Explore P: 0.0911
Episode: 541 Total reward: 199.0 Training loss: 0.3615 Explore P: 0.0895
Episode: 542 Total reward: 199.0 Training loss: 0.4107 Explore P: 0.0880
Episode: 543 Total reward: 199.0 Training loss: 0.3365 Explore P: 0.0864
Episode: 544 Total reward: 199.0 Training loss: 0.3009 Explore P: 0.0849
Episode: 545 Total reward: 199.0 Training loss: 0.3755 Explore P: 0.0835
Episode: 546 Total reward: 199.0 Training loss: 0.3647 Explore P: 0.0820
Episode: 547 Total reward: 199.0 Training loss: 0.3827 Explore P: 0.0806
Episode: 548 Total reward: 174.0 Training loss: 0.3419 Explore P: 0.0794
Episode: 549 Total reward: 196.0 Training loss: 0.3857 Explore P: 0.0780
Episode: 550 Total reward: 199.0 Training loss: 0.6161 Explore P: 0.0767
Episode: 551 Total reward: 199.0 Training loss: 0.4530 Explore P: 0.0754
Episode: 552 Total reward: 199.0 Training loss: 0.5506 Explore P: 0.0741
Episode: 553 Total reward: 199.0 Training loss: 0.2042 Explore P: 0.0728
Episode: 554 Total reward: 199.0 Training loss: 0.2356 Explore P: 0.0716
Episode: 555 Total reward: 199.0 Training loss: 0.0957 Explore P: 0.0704
Episode: 556 Total reward: 199.0 Training loss: 0.2205 Explore P: 0.0692
Episode: 557 Total reward: 199.0 Training loss: 0.3456 Explore P: 0.0680
Episode: 558 Total reward: 199.0 Training loss: 0.5790 Explore P: 0.0669
Episode: 559 Total reward: 199.0 Training loss: 0.2437 Explore P: 0.0658
Episode: 560 Total reward: 199.0 Training loss: 0.4569 Explore P: 0.0647
Episode: 561 Total reward: 199.0 Training loss: 229.8303 Explore P: 0.0636
Episode: 562 Total reward: 199.0 Training loss: 0.2092 Explore P: 0.0625
Episode: 563 Total reward: 199.0 Training loss: 0.3597 Explore P: 0.0615
Episode: 564 Total reward: 199.0 Training loss: 0.1547 Explore P: 0.0605
Episode: 565 Total reward: 199.0 Training loss: 0.3904 Explore P: 0.0595
Episode: 566 Total reward: 199.0 Training loss: 0.1466 Explore P: 0.0585
Episode: 567 Total reward: 199.0 Training loss: 0.1290 Explore P: 0.0575
Episode: 568 Total reward: 199.0 Training loss: 119.4984 Explore P: 0.0566
Episode: 569 Total reward: 199.0 Training loss: 0.2368 Explore P: 0.0557
Episode: 570 Total reward: 199.0 Training loss: 0.0927 Explore P: 0.0548
Episode: 571 Total reward: 199.0 Training loss: 0.1764 Explore P: 0.0539
Episode: 572 Total reward: 199.0 Training loss: 0.3013 Explore P: 0.0530
Episode: 573 Total reward: 199.0 Training loss: 0.3800 Explore P: 0.0522
Episode: 574 Total reward: 199.0 Training loss: 0.1979 Explore P: 0.0514
Episode: 575 Total reward: 199.0 Training loss: 0.4308 Explore P: 0.0505
Episode: 576 Total reward: 199.0 Training loss: 255.5942 Explore P: 0.0497
Episode: 577 Total reward: 199.0 Training loss: 0.5020 Explore P: 0.0490
Episode: 578 Total reward: 199.0 Training loss: 0.1511 Explore P: 0.0482
Episode: 579 Total reward: 199.0 Training loss: 0.4426 Explore P: 0.0474
Episode: 580 Total reward: 199.0 Training loss: 0.1710 Explore P: 0.0467
Episode: 581 Total reward: 199.0 Training loss: 0.1908 Explore P: 0.0460
Episode: 582 Total reward: 199.0 Training loss: 0.2110 Explore P: 0.0453
Episode: 583 Total reward: 199.0 Training loss: 0.2537 Explore P: 0.0446
Episode: 584 Total reward: 199.0 Training loss: 0.2009 Explore P: 0.0439
Episode: 585 Total reward: 199.0 Training loss: 0.2380 Explore P: 0.0432
Episode: 586 Total reward: 199.0 Training loss: 0.1440 Explore P: 0.0426
Episode: 587 Total reward: 181.0 Training loss: 0.1400 Explore P: 0.0420
Episode: 588 Total reward: 199.0 Training loss: 0.0869 Explore P: 0.0414
Episode: 589 Total reward: 199.0 Training loss: 0.2203 Explore P: 0.0407
Episode: 590 Total reward: 199.0 Training loss: 0.1272 Explore P: 0.0401
Episode: 591 Total reward: 199.0 Training loss: 0.0962 Explore P: 0.0395
Episode: 592 Total reward: 199.0 Training loss: 0.1599 Explore P: 0.0390
Episode: 593 Total reward: 199.0 Training loss: 0.2473 Explore P: 0.0384
Episode: 594 Total reward: 168.0 Training loss: 257.0509 Explore P: 0.0379
Episode: 595 Total reward: 199.0 Training loss: 0.1368 Explore P: 0.0374
Episode: 596 Total reward: 166.0 Training loss: 0.2421 Explore P: 0.0369
Episode: 597 Total reward: 182.0 Training loss: 0.1367 Explore P: 0.0364
Episode: 598 Total reward: 199.0 Training loss: 0.4478 Explore P: 0.0359
Episode: 599 Total reward: 199.0 Training loss: 0.2084 Explore P: 0.0354
Episode: 600 Total reward: 199.0 Training loss: 0.2368 Explore P: 0.0349
Episode: 601 Total reward: 199.0 Training loss: 0.2590 Explore P: 0.0344
Episode: 602 Total reward: 167.0 Training loss: 0.2565 Explore P: 0.0340
Episode: 603 Total reward: 199.0 Training loss: 0.1750 Explore P: 0.0335
Episode: 604 Total reward: 199.0 Training loss: 248.2791 Explore P: 0.0331
Episode: 605 Total reward: 199.0 Training loss: 0.2861 Explore P: 0.0326
Episode: 606 Total reward: 199.0 Training loss: 0.4944 Explore P: 0.0322
Episode: 607 Total reward: 155.0 Training loss: 209.2963 Explore P: 0.0318
Episode: 608 Total reward: 160.0 Training loss: 0.5650 Explore P: 0.0315
Episode: 609 Total reward: 164.0 Training loss: 0.2818 Explore P: 0.0311
Episode: 610 Total reward: 148.0 Training loss: 0.3614 Explore P: 0.0308
Episode: 611 Total reward: 101.0 Training loss: 0.1695 Explore P: 0.0306
Episode: 612 Total reward: 148.0 Training loss: 0.3957 Explore P: 0.0303
Episode: 613 Total reward: 111.0 Training loss: 0.2228 Explore P: 0.0301
Episode: 614 Total reward: 126.0 Training loss: 0.3712 Explore P: 0.0298
Episode: 615 Total reward: 137.0 Training loss: 0.2723 Explore P: 0.0296
Episode: 616 Total reward: 138.0 Training loss: 43.6485 Explore P: 0.0293
Episode: 617 Total reward: 199.0 Training loss: 0.4324 Explore P: 0.0289
Episode: 618 Total reward: 106.0 Training loss: 0.4067 Explore P: 0.0287
Episode: 619 Total reward: 199.0 Training loss: 0.1360 Explore P: 0.0283
Episode: 620 Total reward: 199.0 Training loss: 0.4432 Explore P: 0.0280
Episode: 621 Total reward: 199.0 Training loss: 0.3583 Explore P: 0.0276
Episode: 622 Total reward: 199.0 Training loss: 0.2751 Explore P: 0.0273
Episode: 623 Total reward: 199.0 Training loss: 0.1648 Explore P: 0.0269
Episode: 624 Total reward: 199.0 Training loss: 0.1023 Explore P: 0.0266
Episode: 625 Total reward: 199.0 Training loss: 0.4092 Explore P: 0.0263
Episode: 626 Total reward: 199.0 Training loss: 0.2883 Explore P: 0.0260
Episode: 627 Total reward: 199.0 Training loss: 299.6772 Explore P: 0.0256
Episode: 628 Total reward: 171.0 Training loss: 0.3497 Explore P: 0.0254
Episode: 629 Total reward: 199.0 Training loss: 0.2401 Explore P: 0.0251
Episode: 630 Total reward: 97.0 Training loss: 0.1338 Explore P: 0.0249
Episode: 631 Total reward: 109.0 Training loss: 0.3235 Explore P: 0.0248
Episode: 632 Total reward: 199.0 Training loss: 32.3240 Explore P: 0.0245
Episode: 633 Total reward: 199.0 Training loss: 0.1209 Explore P: 0.0242
Episode: 634 Total reward: 199.0 Training loss: 0.2309 Explore P: 0.0239
Episode: 635 Total reward: 199.0 Training loss: 0.2342 Explore P: 0.0236
Episode: 636 Total reward: 91.0 Training loss: 0.1839 Explore P: 0.0235
Episode: 637 Total reward: 199.0 Training loss: 62.7483 Explore P: 0.0233
Episode: 638 Total reward: 117.0 Training loss: 0.2041 Explore P: 0.0231
Episode: 639 Total reward: 199.0 Training loss: 0.2522 Explore P: 0.0228
Episode: 640 Total reward: 199.0 Training loss: 0.1980 Explore P: 0.0226
Episode: 641 Total reward: 136.0 Training loss: 0.5006 Explore P: 0.0224
Episode: 642 Total reward: 116.0 Training loss: 38.4921 Explore P: 0.0223
Episode: 643 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0220
Episode: 644 Total reward: 115.0 Training loss: 0.1720 Explore P: 0.0219
Episode: 645 Total reward: 59.0 Training loss: 0.1104 Explore P: 0.0218
Episode: 646 Total reward: 199.0 Training loss: 0.1355 Explore P: 0.0216
Episode: 647 Total reward: 199.0 Training loss: 0.3229 Explore P: 0.0214
Episode: 648 Total reward: 101.0 Training loss: 0.2894 Explore P: 0.0212
Episode: 649 Total reward: 199.0 Training loss: 0.2551 Explore P: 0.0210
Episode: 650 Total reward: 199.0 Training loss: 0.1715 Explore P: 0.0208
Episode: 651 Total reward: 199.0 Training loss: 251.8810 Explore P: 0.0206
Episode: 652 Total reward: 199.0 Training loss: 0.3345 Explore P: 0.0204
Episode: 653 Total reward: 199.0 Training loss: 98.8741 Explore P: 0.0202
Episode: 654 Total reward: 125.0 Training loss: 0.2232 Explore P: 0.0201
Episode: 655 Total reward: 199.0 Training loss: 0.2278 Explore P: 0.0199
Episode: 656 Total reward: 199.0 Training loss: 0.5348 Explore P: 0.0197
Episode: 657 Total reward: 199.0 Training loss: 0.5006 Explore P: 0.0195
Episode: 658 Total reward: 199.0 Training loss: 251.6745 Explore P: 0.0193
Episode: 659 Total reward: 199.0 Training loss: 0.1592 Explore P: 0.0191
Episode: 660 Total reward: 199.0 Training loss: 0.3497 Explore P: 0.0189
Episode: 661 Total reward: 199.0 Training loss: 0.1438 Explore P: 0.0187
Episode: 662 Total reward: 199.0 Training loss: 0.1362 Explore P: 0.0186
Episode: 663 Total reward: 134.0 Training loss: 0.1616 Explore P: 0.0185
Episode: 664 Total reward: 199.0 Training loss: 0.2875 Explore P: 0.0183
Episode: 665 Total reward: 199.0 Training loss: 0.2252 Explore P: 0.0181
Episode: 666 Total reward: 199.0 Training loss: 0.1366 Explore P: 0.0180
Episode: 667 Total reward: 81.0 Training loss: 239.8862 Explore P: 0.0179
Episode: 668 Total reward: 93.0 Training loss: 0.1514 Explore P: 0.0178
Episode: 669 Total reward: 186.0 Training loss: 0.2447 Explore P: 0.0177
Episode: 670 Total reward: 199.0 Training loss: 0.3616 Explore P: 0.0175
Episode: 671 Total reward: 141.0 Training loss: 0.1596 Explore P: 0.0174
Episode: 672 Total reward: 199.0 Training loss: 0.4101 Explore P: 0.0173
Episode: 673 Total reward: 146.0 Training loss: 0.2894 Explore P: 0.0172
Episode: 674 Total reward: 77.0 Training loss: 0.2899 Explore P: 0.0171
Episode: 675 Total reward: 199.0 Training loss: 0.2680 Explore P: 0.0170
Episode: 676 Total reward: 199.0 Training loss: 202.7782 Explore P: 0.0168
Episode: 677 Total reward: 199.0 Training loss: 0.1914 Explore P: 0.0167
Episode: 678 Total reward: 199.0 Training loss: 230.2399 Explore P: 0.0166
Episode: 679 Total reward: 80.0 Training loss: 0.2152 Explore P: 0.0165
Episode: 680 Total reward: 199.0 Training loss: 0.2605 Explore P: 0.0164
Episode: 681 Total reward: 63.0 Training loss: 0.5216 Explore P: 0.0164
Episode: 682 Total reward: 198.0 Training loss: 0.2750 Explore P: 0.0162
Episode: 683 Total reward: 199.0 Training loss: 236.6802 Explore P: 0.0161
Episode: 684 Total reward: 148.0 Training loss: 0.3172 Explore P: 0.0160
Episode: 685 Total reward: 154.0 Training loss: 0.1700 Explore P: 0.0159
Episode: 686 Total reward: 189.0 Training loss: 0.3168 Explore P: 0.0158
Episode: 687 Total reward: 134.0 Training loss: 0.2621 Explore P: 0.0157
Episode: 688 Total reward: 144.0 Training loss: 0.2617 Explore P: 0.0157
Episode: 689 Total reward: 199.0 Training loss: 0.2068 Explore P: 0.0155
Episode: 690 Total reward: 152.0 Training loss: 0.1229 Explore P: 0.0155
Episode: 691 Total reward: 199.0 Training loss: 0.2482 Explore P: 0.0154
Episode: 692 Total reward: 192.0 Training loss: 0.1613 Explore P: 0.0153
Episode: 693 Total reward: 199.0 Training loss: 0.7497 Explore P: 0.0152
Episode: 694 Total reward: 193.0 Training loss: 0.2647 Explore P: 0.0151
Episode: 695 Total reward: 199.0 Training loss: 0.3174 Explore P: 0.0150
Episode: 696 Total reward: 110.0 Training loss: 0.1454 Explore P: 0.0149
Episode: 697 Total reward: 199.0 Training loss: 0.1737 Explore P: 0.0148
Episode: 698 Total reward: 199.0 Training loss: 0.3111 Explore P: 0.0147
Episode: 699 Total reward: 199.0 Training loss: 0.2322 Explore P: 0.0146
Episode: 700 Total reward: 158.0 Training loss: 0.2341 Explore P: 0.0145
Episode: 701 Total reward: 156.0 Training loss: 247.2154 Explore P: 0.0145
Episode: 702 Total reward: 199.0 Training loss: 0.1880 Explore P: 0.0144
Episode: 703 Total reward: 93.0 Training loss: 0.1042 Explore P: 0.0143
Episode: 704 Total reward: 199.0 Training loss: 0.2740 Explore P: 0.0143
Episode: 705 Total reward: 199.0 Training loss: 244.3862 Explore P: 0.0142
Episode: 706 Total reward: 80.0 Training loss: 97.7738 Explore P: 0.0141
Episode: 707 Total reward: 199.0 Training loss: 0.1997 Explore P: 0.0141
Episode: 708 Total reward: 199.0 Training loss: 0.1640 Explore P: 0.0140
Episode: 709 Total reward: 199.0 Training loss: 0.2524 Explore P: 0.0139
Episode: 710 Total reward: 199.0 Training loss: 0.3767 Explore P: 0.0138
Episode: 711 Total reward: 76.0 Training loss: 0.2044 Explore P: 0.0138
Episode: 712 Total reward: 181.0 Training loss: 0.2759 Explore P: 0.0137
Episode: 713 Total reward: 199.0 Training loss: 0.4199 Explore P: 0.0137
Episode: 714 Total reward: 199.0 Training loss: 0.1919 Explore P: 0.0136
Episode: 715 Total reward: 101.0 Training loss: 0.0978 Explore P: 0.0135
Episode: 716 Total reward: 199.0 Training loss: 0.1715 Explore P: 0.0135
Episode: 717 Total reward: 182.0 Training loss: 0.2336 Explore P: 0.0134
Episode: 718 Total reward: 199.0 Training loss: 0.1345 Explore P: 0.0133
Episode: 719 Total reward: 199.0 Training loss: 0.1986 Explore P: 0.0133
Episode: 720 Total reward: 199.0 Training loss: 0.1710 Explore P: 0.0132
Episode: 721 Total reward: 125.0 Training loss: 0.2285 Explore P: 0.0132
Episode: 722 Total reward: 133.0 Training loss: 245.9072 Explore P: 0.0131
Episode: 723 Total reward: 199.0 Training loss: 0.1498 Explore P: 0.0131
Episode: 724 Total reward: 199.0 Training loss: 0.2999 Explore P: 0.0130
Episode: 725 Total reward: 133.0 Training loss: 0.1388 Explore P: 0.0130
Episode: 726 Total reward: 154.0 Training loss: 175.6357 Explore P: 0.0129
Episode: 727 Total reward: 199.0 Training loss: 0.2827 Explore P: 0.0129
Episode: 728 Total reward: 148.0 Training loss: 0.1386 Explore P: 0.0128
Episode: 729 Total reward: 199.0 Training loss: 0.1457 Explore P: 0.0128
Episode: 730 Total reward: 93.0 Training loss: 0.0947 Explore P: 0.0127
Episode: 731 Total reward: 162.0 Training loss: 0.1094 Explore P: 0.0127
Episode: 732 Total reward: 199.0 Training loss: 0.1803 Explore P: 0.0126
Episode: 733 Total reward: 199.0 Training loss: 255.4840 Explore P: 0.0126
Episode: 734 Total reward: 81.0 Training loss: 0.1363 Explore P: 0.0126
Episode: 735 Total reward: 151.0 Training loss: 0.1378 Explore P: 0.0125
Episode: 736 Total reward: 85.0 Training loss: 0.2596 Explore P: 0.0125
Episode: 737 Total reward: 199.0 Training loss: 0.1813 Explore P: 0.0125
Episode: 738 Total reward: 199.0 Training loss: 0.1419 Explore P: 0.0124
Episode: 739 Total reward: 79.0 Training loss: 0.2137 Explore P: 0.0124
Episode: 740 Total reward: 199.0 Training loss: 0.1485 Explore P: 0.0123
Episode: 741 Total reward: 199.0 Training loss: 0.1294 Explore P: 0.0123
Episode: 742 Total reward: 62.0 Training loss: 0.1351 Explore P: 0.0123
Episode: 743 Total reward: 63.0 Training loss: 0.1553 Explore P: 0.0123
Episode: 744 Total reward: 69.0 Training loss: 0.1902 Explore P: 0.0123
Episode: 745 Total reward: 66.0 Training loss: 0.2672 Explore P: 0.0122
Episode: 746 Total reward: 199.0 Training loss: 0.1867 Explore P: 0.0122
Episode: 747 Total reward: 199.0 Training loss: 0.2160 Explore P: 0.0122
Episode: 748 Total reward: 62.0 Training loss: 0.2131 Explore P: 0.0121
Episode: 749 Total reward: 66.0 Training loss: 0.1632 Explore P: 0.0121
Episode: 750 Total reward: 199.0 Training loss: 0.1360 Explore P: 0.0121
Episode: 751 Total reward: 199.0 Training loss: 0.2604 Explore P: 0.0120
Episode: 752 Total reward: 199.0 Training loss: 0.1652 Explore P: 0.0120
Episode: 753 Total reward: 161.0 Training loss: 0.1395 Explore P: 0.0120
Episode: 754 Total reward: 199.0 Training loss: 0.3645 Explore P: 0.0119
Episode: 755 Total reward: 85.0 Training loss: 0.1747 Explore P: 0.0119
Episode: 756 Total reward: 166.0 Training loss: 0.3191 Explore P: 0.0119
Episode: 757 Total reward: 148.0 Training loss: 0.1735 Explore P: 0.0119
Episode: 758 Total reward: 199.0 Training loss: 0.2075 Explore P: 0.0118
Episode: 759 Total reward: 199.0 Training loss: 0.1951 Explore P: 0.0118
Episode: 760 Total reward: 199.0 Training loss: 0.1833 Explore P: 0.0118
Episode: 761 Total reward: 199.0 Training loss: 0.1749 Explore P: 0.0117
Episode: 762 Total reward: 61.0 Training loss: 0.1586 Explore P: 0.0117
Episode: 763 Total reward: 199.0 Training loss: 246.7634 Explore P: 0.0117
Episode: 764 Total reward: 199.0 Training loss: 0.3331 Explore P: 0.0116
Episode: 765 Total reward: 100.0 Training loss: 0.1998 Explore P: 0.0116
Episode: 766 Total reward: 199.0 Training loss: 0.1450 Explore P: 0.0116
Episode: 767 Total reward: 81.0 Training loss: 2.7857 Explore P: 0.0116
Episode: 768 Total reward: 199.0 Training loss: 0.3700 Explore P: 0.0115
Episode: 769 Total reward: 98.0 Training loss: 1.6903 Explore P: 0.0115
Episode: 770 Total reward: 199.0 Training loss: 0.3605 Explore P: 0.0115
Episode: 771 Total reward: 183.0 Training loss: 0.2600 Explore P: 0.0115
Episode: 772 Total reward: 199.0 Training loss: 0.2701 Explore P: 0.0114
Episode: 773 Total reward: 159.0 Training loss: 0.1871 Explore P: 0.0114
Episode: 774 Total reward: 76.0 Training loss: 0.3419 Explore P: 0.0114
Episode: 775 Total reward: 199.0 Training loss: 0.2368 Explore P: 0.0114
Episode: 776 Total reward: 199.0 Training loss: 0.2385 Explore P: 0.0114
Episode: 777 Total reward: 81.0 Training loss: 0.2387 Explore P: 0.0113
Episode: 778 Total reward: 199.0 Training loss: 0.3235 Explore P: 0.0113
Episode: 779 Total reward: 199.0 Training loss: 0.2923 Explore P: 0.0113
Episode: 780 Total reward: 150.0 Training loss: 0.1019 Explore P: 0.0113
Episode: 781 Total reward: 199.0 Training loss: 155.6238 Explore P: 0.0112
Episode: 782 Total reward: 199.0 Training loss: 7.0715 Explore P: 0.0112
Episode: 783 Total reward: 154.0 Training loss: 0.2204 Explore P: 0.0112
Episode: 784 Total reward: 199.0 Training loss: 0.2468 Explore P: 0.0112
Episode: 785 Total reward: 199.0 Training loss: 0.2796 Explore P: 0.0112
Episode: 786 Total reward: 112.0 Training loss: 0.2338 Explore P: 0.0111
Episode: 787 Total reward: 83.0 Training loss: 0.2903 Explore P: 0.0111
Episode: 788 Total reward: 199.0 Training loss: 0.3212 Explore P: 0.0111
Episode: 789 Total reward: 199.0 Training loss: 0.2989 Explore P: 0.0111
Episode: 790 Total reward: 199.0 Training loss: 474.2114 Explore P: 0.0111
Episode: 791 Total reward: 199.0 Training loss: 0.1800 Explore P: 0.0110
Episode: 792 Total reward: 108.0 Training loss: 0.2848 Explore P: 0.0110
Episode: 793 Total reward: 124.0 Training loss: 0.2417 Explore P: 0.0110
Episode: 794 Total reward: 71.0 Training loss: 0.4738 Explore P: 0.0110
Episode: 795 Total reward: 199.0 Training loss: 0.1798 Explore P: 0.0110
Episode: 796 Total reward: 85.0 Training loss: 252.4544 Explore P: 0.0110
Episode: 797 Total reward: 199.0 Training loss: 0.3242 Explore P: 0.0110
Episode: 798 Total reward: 199.0 Training loss: 0.4080 Explore P: 0.0110
Episode: 799 Total reward: 199.0 Training loss: 0.2116 Explore P: 0.0109
Episode: 800 Total reward: 199.0 Training loss: 0.1778 Explore P: 0.0109
Episode: 801 Total reward: 199.0 Training loss: 1.7575 Explore P: 0.0109
Episode: 802 Total reward: 112.0 Training loss: 0.2666 Explore P: 0.0109
Episode: 803 Total reward: 120.0 Training loss: 0.1397 Explore P: 0.0109
Episode: 804 Total reward: 199.0 Training loss: 0.3245 Explore P: 0.0109
Episode: 805 Total reward: 199.0 Training loss: 0.3949 Explore P: 0.0108
Episode: 806 Total reward: 199.0 Training loss: 0.2952 Explore P: 0.0108
Episode: 807 Total reward: 199.0 Training loss: 0.3417 Explore P: 0.0108
Episode: 808 Total reward: 199.0 Training loss: 0.1763 Explore P: 0.0108
Episode: 809 Total reward: 199.0 Training loss: 0.1658 Explore P: 0.0108
Episode: 810 Total reward: 199.0 Training loss: 0.2005 Explore P: 0.0108
Episode: 811 Total reward: 199.0 Training loss: 2.1528 Explore P: 0.0107
Episode: 812 Total reward: 199.0 Training loss: 0.4954 Explore P: 0.0107
Episode: 813 Total reward: 199.0 Training loss: 0.1893 Explore P: 0.0107
Episode: 814 Total reward: 199.0 Training loss: 0.2310 Explore P: 0.0107
Episode: 815 Total reward: 118.0 Training loss: 258.1599 Explore P: 0.0107
Episode: 816 Total reward: 199.0 Training loss: 0.1393 Explore P: 0.0107
Episode: 817 Total reward: 199.0 Training loss: 0.2050 Explore P: 0.0107
Episode: 818 Total reward: 199.0 Training loss: 0.4117 Explore P: 0.0107
Episode: 819 Total reward: 86.0 Training loss: 267.9863 Explore P: 0.0106
Episode: 820 Total reward: 199.0 Training loss: 0.2043 Explore P: 0.0106
Episode: 821 Total reward: 185.0 Training loss: 0.4347 Explore P: 0.0106
Episode: 822 Total reward: 199.0 Training loss: 0.4697 Explore P: 0.0106
Episode: 823 Total reward: 80.0 Training loss: 0.4544 Explore P: 0.0106
Episode: 824 Total reward: 190.0 Training loss: 0.3538 Explore P: 0.0106
Episode: 825 Total reward: 87.0 Training loss: 0.4244 Explore P: 0.0106
Episode: 826 Total reward: 74.0 Training loss: 0.3475 Explore P: 0.0106
Episode: 827 Total reward: 102.0 Training loss: 1.4424 Explore P: 0.0106
Episode: 828 Total reward: 81.0 Training loss: 0.3188 Explore P: 0.0106
Episode: 829 Total reward: 199.0 Training loss: 0.4307 Explore P: 0.0106
Episode: 830 Total reward: 199.0 Training loss: 0.4398 Explore P: 0.0106
Episode: 831 Total reward: 199.0 Training loss: 0.1355 Explore P: 0.0105
Episode: 832 Total reward: 199.0 Training loss: 0.4279 Explore P: 0.0105
Episode: 833 Total reward: 84.0 Training loss: 0.2684 Explore P: 0.0105
Episode: 834 Total reward: 78.0 Training loss: 0.4444 Explore P: 0.0105
Episode: 835 Total reward: 164.0 Training loss: 0.2564 Explore P: 0.0105
Episode: 836 Total reward: 199.0 Training loss: 0.6094 Explore P: 0.0105
Episode: 837 Total reward: 127.0 Training loss: 0.3166 Explore P: 0.0105
Episode: 838 Total reward: 155.0 Training loss: 0.2480 Explore P: 0.0105
Episode: 839 Total reward: 199.0 Training loss: 0.2211 Explore P: 0.0105
Episode: 840 Total reward: 199.0 Training loss: 0.2846 Explore P: 0.0105
Episode: 841 Total reward: 199.0 Training loss: 0.1712 Explore P: 0.0105
Episode: 842 Total reward: 199.0 Training loss: 0.4236 Explore P: 0.0105
Episode: 843 Total reward: 199.0 Training loss: 0.2850 Explore P: 0.0104
Episode: 844 Total reward: 199.0 Training loss: 0.0822 Explore P: 0.0104
Episode: 845 Total reward: 199.0 Training loss: 0.1796 Explore P: 0.0104
Episode: 846 Total reward: 199.0 Training loss: 0.1607 Explore P: 0.0104
Episode: 847 Total reward: 199.0 Training loss: 0.1953 Explore P: 0.0104
Episode: 848 Total reward: 199.0 Training loss: 0.1087 Explore P: 0.0104
Episode: 849 Total reward: 199.0 Training loss: 0.1875 Explore P: 0.0104
Episode: 850 Total reward: 199.0 Training loss: 0.1620 Explore P: 0.0104
Episode: 851 Total reward: 199.0 Training loss: 270.5756 Explore P: 0.0104
Episode: 852 Total reward: 199.0 Training loss: 0.2723 Explore P: 0.0104
Episode: 853 Total reward: 199.0 Training loss: 266.9437 Explore P: 0.0104
Episode: 854 Total reward: 199.0 Training loss: 0.2700 Explore P: 0.0104
Episode: 855 Total reward: 199.0 Training loss: 0.3117 Explore P: 0.0103
Episode: 856 Total reward: 199.0 Training loss: 0.2011 Explore P: 0.0103
Episode: 857 Total reward: 199.0 Training loss: 0.2281 Explore P: 0.0103
Episode: 858 Total reward: 199.0 Training loss: 0.3107 Explore P: 0.0103
Episode: 859 Total reward: 199.0 Training loss: 0.2320 Explore P: 0.0103
Episode: 860 Total reward: 199.0 Training loss: 0.2264 Explore P: 0.0103
Episode: 861 Total reward: 199.0 Training loss: 0.2486 Explore P: 0.0103
Episode: 862 Total reward: 199.0 Training loss: 0.2039 Explore P: 0.0103
Episode: 863 Total reward: 199.0 Training loss: 0.3088 Explore P: 0.0103
Episode: 864 Total reward: 199.0 Training loss: 0.2381 Explore P: 0.0103
Episode: 865 Total reward: 199.0 Training loss: 3.5636 Explore P: 0.0103
Episode: 866 Total reward: 199.0 Training loss: 0.2306 Explore P: 0.0103
Episode: 867 Total reward: 199.0 Training loss: 282.7467 Explore P: 0.0103
Episode: 868 Total reward: 199.0 Training loss: 0.4679 Explore P: 0.0103
Episode: 869 Total reward: 199.0 Training loss: 0.1703 Explore P: 0.0103
Episode: 870 Total reward: 199.0 Training loss: 0.3282 Explore P: 0.0103
Episode: 871 Total reward: 199.0 Training loss: 0.3089 Explore P: 0.0103
Episode: 872 Total reward: 199.0 Training loss: 0.2299 Explore P: 0.0102
Episode: 873 Total reward: 199.0 Training loss: 0.2783 Explore P: 0.0102
Episode: 874 Total reward: 199.0 Training loss: 0.3803 Explore P: 0.0102
Episode: 875 Total reward: 199.0 Training loss: 0.3629 Explore P: 0.0102
Episode: 876 Total reward: 199.0 Training loss: 0.3799 Explore P: 0.0102
Episode: 877 Total reward: 199.0 Training loss: 5.2099 Explore P: 0.0102
Episode: 878 Total reward: 199.0 Training loss: 0.2571 Explore P: 0.0102
Episode: 879 Total reward: 199.0 Training loss: 0.3299 Explore P: 0.0102
Episode: 880 Total reward: 199.0 Training loss: 0.2526 Explore P: 0.0102
Episode: 881 Total reward: 199.0 Training loss: 0.2254 Explore P: 0.0102
Episode: 882 Total reward: 199.0 Training loss: 0.1619 Explore P: 0.0102
Episode: 883 Total reward: 199.0 Training loss: 0.3425 Explore P: 0.0102
Episode: 884 Total reward: 199.0 Training loss: 0.2534 Explore P: 0.0102
Episode: 885 Total reward: 199.0 Training loss: 0.2593 Explore P: 0.0102
Episode: 886 Total reward: 199.0 Training loss: 0.2814 Explore P: 0.0102
Episode: 887 Total reward: 199.0 Training loss: 0.2478 Explore P: 0.0102
Episode: 888 Total reward: 199.0 Training loss: 0.3338 Explore P: 0.0102
Episode: 889 Total reward: 199.0 Training loss: 0.3070 Explore P: 0.0102
Episode: 890 Total reward: 199.0 Training loss: 0.1872 Explore P: 0.0102
Episode: 891 Total reward: 199.0 Training loss: 0.2650 Explore P: 0.0102
Episode: 892 Total reward: 199.0 Training loss: 0.2429 Explore P: 0.0102
Episode: 893 Total reward: 199.0 Training loss: 0.2607 Explore P: 0.0102
Episode: 894 Total reward: 199.0 Training loss: 0.2227 Explore P: 0.0102
Episode: 895 Total reward: 199.0 Training loss: 0.1276 Explore P: 0.0102
Episode: 896 Total reward: 199.0 Training loss: 294.8071 Explore P: 0.0102
Episode: 897 Total reward: 199.0 Training loss: 0.2316 Explore P: 0.0102
Episode: 898 Total reward: 199.0 Training loss: 0.2390 Explore P: 0.0101
Episode: 899 Total reward: 199.0 Training loss: 322.5065 Explore P: 0.0101
Episode: 900 Total reward: 199.0 Training loss: 0.1451 Explore P: 0.0101
Episode: 901 Total reward: 199.0 Training loss: 0.2544 Explore P: 0.0101
Episode: 902 Total reward: 199.0 Training loss: 73.8215 Explore P: 0.0101
Episode: 903 Total reward: 199.0 Training loss: 0.1156 Explore P: 0.0101
Episode: 904 Total reward: 199.0 Training loss: 0.1319 Explore P: 0.0101
Episode: 905 Total reward: 199.0 Training loss: 0.0995 Explore P: 0.0101
Episode: 906 Total reward: 199.0 Training loss: 0.1164 Explore P: 0.0101
Episode: 907 Total reward: 199.0 Training loss: 0.1961 Explore P: 0.0101
Episode: 908 Total reward: 199.0 Training loss: 0.1933 Explore P: 0.0101
Episode: 909 Total reward: 199.0 Training loss: 0.1926 Explore P: 0.0101
Episode: 910 Total reward: 199.0 Training loss: 43.3737 Explore P: 0.0101
Episode: 911 Total reward: 199.0 Training loss: 0.2126 Explore P: 0.0101
Episode: 912 Total reward: 199.0 Training loss: 0.2718 Explore P: 0.0101
Episode: 913 Total reward: 199.0 Training loss: 0.1623 Explore P: 0.0101
Episode: 914 Total reward: 199.0 Training loss: 0.2579 Explore P: 0.0101
Episode: 915 Total reward: 199.0 Training loss: 0.1821 Explore P: 0.0101
Episode: 916 Total reward: 199.0 Training loss: 0.2138 Explore P: 0.0101
Episode: 917 Total reward: 199.0 Training loss: 0.4135 Explore P: 0.0101
Episode: 918 Total reward: 199.0 Training loss: 0.1564 Explore P: 0.0101
Episode: 919 Total reward: 199.0 Training loss: 0.2347 Explore P: 0.0101
Episode: 920 Total reward: 199.0 Training loss: 0.1607 Explore P: 0.0101
Episode: 921 Total reward: 199.0 Training loss: 0.2647 Explore P: 0.0101
Episode: 922 Total reward: 199.0 Training loss: 118.4845 Explore P: 0.0101
Episode: 923 Total reward: 199.0 Training loss: 0.1132 Explore P: 0.0101
Episode: 924 Total reward: 199.0 Training loss: 91.3929 Explore P: 0.0101
Episode: 925 Total reward: 199.0 Training loss: 0.1738 Explore P: 0.0101
Episode: 926 Total reward: 199.0 Training loss: 0.1332 Explore P: 0.0101
Episode: 927 Total reward: 199.0 Training loss: 38.8202 Explore P: 0.0101
Episode: 928 Total reward: 199.0 Training loss: 0.1266 Explore P: 0.0101
Episode: 929 Total reward: 199.0 Training loss: 0.2206 Explore P: 0.0101
Episode: 930 Total reward: 199.0 Training loss: 41.9857 Explore P: 0.0101
Episode: 931 Total reward: 199.0 Training loss: 72.9795 Explore P: 0.0101
Episode: 932 Total reward: 199.0 Training loss: 0.2195 Explore P: 0.0101
Episode: 933 Total reward: 199.0 Training loss: 0.1764 Explore P: 0.0101
Episode: 934 Total reward: 199.0 Training loss: 0.1160 Explore P: 0.0101
Episode: 935 Total reward: 199.0 Training loss: 0.2254 Explore P: 0.0101
Episode: 936 Total reward: 199.0 Training loss: 64.4262 Explore P: 0.0101
Episode: 937 Total reward: 199.0 Training loss: 0.2274 Explore P: 0.0101
Episode: 938 Total reward: 199.0 Training loss: 0.3135 Explore P: 0.0101
Episode: 939 Total reward: 199.0 Training loss: 0.3706 Explore P: 0.0101
Episode: 940 Total reward: 199.0 Training loss: 0.2173 Explore P: 0.0101
Episode: 941 Total reward: 199.0 Training loss: 0.0737 Explore P: 0.0101
Episode: 942 Total reward: 199.0 Training loss: 0.1045 Explore P: 0.0101
Episode: 943 Total reward: 199.0 Training loss: 0.3840 Explore P: 0.0101
Episode: 944 Total reward: 199.0 Training loss: 0.1716 Explore P: 0.0101
Episode: 945 Total reward: 199.0 Training loss: 0.0488 Explore P: 0.0101
Episode: 946 Total reward: 199.0 Training loss: 0.1102 Explore P: 0.0101
Episode: 947 Total reward: 199.0 Training loss: 0.1997 Explore P: 0.0101
Episode: 948 Total reward: 199.0 Training loss: 0.2284 Explore P: 0.0101
Episode: 949 Total reward: 199.0 Training loss: 0.1998 Explore P: 0.0101
Episode: 950 Total reward: 199.0 Training loss: 0.2903 Explore P: 0.0101
Episode: 951 Total reward: 199.0 Training loss: 0.1421 Explore P: 0.0101
Episode: 952 Total reward: 199.0 Training loss: 0.1047 Explore P: 0.0101
Episode: 953 Total reward: 199.0 Training loss: 0.1981 Explore P: 0.0100
Episode: 954 Total reward: 199.0 Training loss: 0.1399 Explore P: 0.0100
Episode: 955 Total reward: 199.0 Training loss: 124.6903 Explore P: 0.0100
Episode: 956 Total reward: 199.0 Training loss: 0.1622 Explore P: 0.0100
Episode: 957 Total reward: 199.0 Training loss: 0.1176 Explore P: 0.0100
Episode: 958 Total reward: 199.0 Training loss: 0.1054 Explore P: 0.0100
Episode: 959 Total reward: 199.0 Training loss: 0.1561 Explore P: 0.0100
Episode: 960 Total reward: 199.0 Training loss: 0.0836 Explore P: 0.0100
Episode: 961 Total reward: 199.0 Training loss: 0.3101 Explore P: 0.0100
Episode: 962 Total reward: 199.0 Training loss: 0.1052 Explore P: 0.0100
Episode: 963 Total reward: 199.0 Training loss: 31.3988 Explore P: 0.0100
Episode: 964 Total reward: 199.0 Training loss: 0.1292 Explore P: 0.0100
Episode: 965 Total reward: 199.0 Training loss: 0.1768 Explore P: 0.0100
Episode: 966 Total reward: 199.0 Training loss: 0.3004 Explore P: 0.0100
Episode: 967 Total reward: 199.0 Training loss: 0.3521 Explore P: 0.0100
Episode: 968 Total reward: 159.0 Training loss: 0.7399 Explore P: 0.0100
Episode: 969 Total reward: 83.0 Training loss: 1.4745 Explore P: 0.0100
Episode: 970 Total reward: 55.0 Training loss: 1.9396 Explore P: 0.0100
Episode: 971 Total reward: 56.0 Training loss: 1.8412 Explore P: 0.0100
Episode: 972 Total reward: 38.0 Training loss: 2.0362 Explore P: 0.0100
Episode: 973 Total reward: 44.0 Training loss: 596.9589 Explore P: 0.0100
Episode: 974 Total reward: 50.0 Training loss: 1.6000 Explore P: 0.0100
Episode: 975 Total reward: 45.0 Training loss: 1.4797 Explore P: 0.0100
Episode: 976 Total reward: 38.0 Training loss: 3.8015 Explore P: 0.0100
Episode: 977 Total reward: 31.0 Training loss: 3.0209 Explore P: 0.0100
Episode: 978 Total reward: 14.0 Training loss: 3.3322 Explore P: 0.0100
Episode: 979 Total reward: 13.0 Training loss: 3.9317 Explore P: 0.0100
Episode: 980 Total reward: 21.0 Training loss: 3.7661 Explore P: 0.0100
Episode: 981 Total reward: 12.0 Training loss: 2.8366 Explore P: 0.0100
Episode: 982 Total reward: 14.0 Training loss: 2.9231 Explore P: 0.0100
Episode: 983 Total reward: 12.0 Training loss: 4.3352 Explore P: 0.0100
Episode: 984 Total reward: 9.0 Training loss: 758.2007 Explore P: 0.0100
Episode: 985 Total reward: 11.0 Training loss: 3.7181 Explore P: 0.0100
Episode: 986 Total reward: 9.0 Training loss: 3.9504 Explore P: 0.0100
Episode: 987 Total reward: 14.0 Training loss: 3.7969 Explore P: 0.0100
Episode: 988 Total reward: 12.0 Training loss: 5.3795 Explore P: 0.0100
Episode: 989 Total reward: 12.0 Training loss: 3.9052 Explore P: 0.0100
Episode: 990 Total reward: 11.0 Training loss: 4.1285 Explore P: 0.0100
Episode: 991 Total reward: 13.0 Training loss: 3.9857 Explore P: 0.0100
Episode: 992 Total reward: 11.0 Training loss: 4.9310 Explore P: 0.0100
Episode: 993 Total reward: 11.0 Training loss: 3.3460 Explore P: 0.0100
Episode: 994 Total reward: 13.0 Training loss: 3.2020 Explore P: 0.0100
Episode: 995 Total reward: 11.0 Training loss: 5.2429 Explore P: 0.0100
Episode: 996 Total reward: 11.0 Training loss: 5.3359 Explore P: 0.0100
Episode: 997 Total reward: 8.0 Training loss: 3.9014 Explore P: 0.0100
Episode: 998 Total reward: 10.0 Training loss: 4.0779 Explore P: 0.0100
Episode: 999 Total reward: 10.0 Training loss: 4.5089 Explore P: 0.0100

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [12]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[12]:
<matplotlib.text.Text at 0x7f36f68887f0>

Testing

Let's checkout how our trained agent plays the game.


In [13]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [14]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.