Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [2]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [3]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-19 14:59:10,636] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [4]:
env.reset()     # why here?  clears old envrionment?  
rewards = []    # why here? we will be printing rewards later

for _ in range(100):
    
    # env.render()  # we want a human friendly rendering
    
    # why here? we want to actually step through our enviornment 
    
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    
    # for tracking later
    rewards.append(reward)
    
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [5]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [6]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [7]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        
        # state inputs to the Q-network
        with tf.variable_scope(name):
            
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [8]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [9]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [10]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [11]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [13]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []

with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        
        total_reward = 0
        t = 0
        
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 8.0 Training loss: 1.1944 Explore P: 0.9992
Episode: 2 Total reward: 23.0 Training loss: 1.0848 Explore P: 0.9969
Episode: 3 Total reward: 55.0 Training loss: 1.0756 Explore P: 0.9915
Episode: 4 Total reward: 25.0 Training loss: 1.0770 Explore P: 0.9891
Episode: 5 Total reward: 19.0 Training loss: 1.0433 Explore P: 0.9872
Episode: 6 Total reward: 26.0 Training loss: 1.0936 Explore P: 0.9847
Episode: 7 Total reward: 38.0 Training loss: 1.0669 Explore P: 0.9810
Episode: 8 Total reward: 26.0 Training loss: 1.1056 Explore P: 0.9785
Episode: 9 Total reward: 20.0 Training loss: 1.2313 Explore P: 0.9765
Episode: 10 Total reward: 13.0 Training loss: 1.1771 Explore P: 0.9753
Episode: 11 Total reward: 27.0 Training loss: 0.9951 Explore P: 0.9727
Episode: 12 Total reward: 14.0 Training loss: 1.2548 Explore P: 0.9713
Episode: 13 Total reward: 10.0 Training loss: 1.1458 Explore P: 0.9704
Episode: 14 Total reward: 21.0 Training loss: 1.1287 Explore P: 0.9683
Episode: 15 Total reward: 11.0 Training loss: 1.1773 Explore P: 0.9673
Episode: 16 Total reward: 21.0 Training loss: 1.0899 Explore P: 0.9653
Episode: 17 Total reward: 12.0 Training loss: 1.1232 Explore P: 0.9641
Episode: 18 Total reward: 64.0 Training loss: 1.4647 Explore P: 0.9580
Episode: 19 Total reward: 9.0 Training loss: 1.2611 Explore P: 0.9572
Episode: 20 Total reward: 36.0 Training loss: 2.0733 Explore P: 0.9538
Episode: 21 Total reward: 14.0 Training loss: 1.5664 Explore P: 0.9525
Episode: 22 Total reward: 17.0 Training loss: 1.4667 Explore P: 0.9509
Episode: 23 Total reward: 18.0 Training loss: 1.7136 Explore P: 0.9492
Episode: 24 Total reward: 11.0 Training loss: 1.3807 Explore P: 0.9481
Episode: 25 Total reward: 9.0 Training loss: 1.6497 Explore P: 0.9473
Episode: 26 Total reward: 21.0 Training loss: 2.3149 Explore P: 0.9453
Episode: 27 Total reward: 13.0 Training loss: 2.4973 Explore P: 0.9441
Episode: 28 Total reward: 20.0 Training loss: 2.2012 Explore P: 0.9423
Episode: 29 Total reward: 11.0 Training loss: 2.3433 Explore P: 0.9412
Episode: 30 Total reward: 35.0 Training loss: 4.9798 Explore P: 0.9380
Episode: 31 Total reward: 15.0 Training loss: 2.5154 Explore P: 0.9366
Episode: 32 Total reward: 19.0 Training loss: 2.9396 Explore P: 0.9348
Episode: 33 Total reward: 38.0 Training loss: 2.7251 Explore P: 0.9313
Episode: 34 Total reward: 18.0 Training loss: 3.3707 Explore P: 0.9297
Episode: 35 Total reward: 14.0 Training loss: 5.8757 Explore P: 0.9284
Episode: 36 Total reward: 14.0 Training loss: 9.5507 Explore P: 0.9271
Episode: 37 Total reward: 13.0 Training loss: 2.9642 Explore P: 0.9259
Episode: 38 Total reward: 20.0 Training loss: 4.6957 Explore P: 0.9241
Episode: 39 Total reward: 23.0 Training loss: 4.4398 Explore P: 0.9220
Episode: 40 Total reward: 11.0 Training loss: 4.6925 Explore P: 0.9210
Episode: 41 Total reward: 18.0 Training loss: 4.9714 Explore P: 0.9193
Episode: 42 Total reward: 9.0 Training loss: 5.2239 Explore P: 0.9185
Episode: 43 Total reward: 31.0 Training loss: 8.1592 Explore P: 0.9157
Episode: 44 Total reward: 15.0 Training loss: 17.9874 Explore P: 0.9143
Episode: 45 Total reward: 12.0 Training loss: 23.5647 Explore P: 0.9133
Episode: 46 Total reward: 12.0 Training loss: 8.7046 Explore P: 0.9122
Episode: 47 Total reward: 14.0 Training loss: 9.4538 Explore P: 0.9109
Episode: 48 Total reward: 11.0 Training loss: 6.2654 Explore P: 0.9099
Episode: 49 Total reward: 8.0 Training loss: 6.7644 Explore P: 0.9092
Episode: 50 Total reward: 10.0 Training loss: 6.9257 Explore P: 0.9083
Episode: 51 Total reward: 24.0 Training loss: 4.9465 Explore P: 0.9061
Episode: 52 Total reward: 13.0 Training loss: 8.6644 Explore P: 0.9050
Episode: 53 Total reward: 15.0 Training loss: 13.2232 Explore P: 0.9036
Episode: 54 Total reward: 16.0 Training loss: 72.8645 Explore P: 0.9022
Episode: 55 Total reward: 24.0 Training loss: 5.6434 Explore P: 0.9001
Episode: 56 Total reward: 28.0 Training loss: 12.5809 Explore P: 0.8976
Episode: 57 Total reward: 25.0 Training loss: 6.3197 Explore P: 0.8954
Episode: 58 Total reward: 10.0 Training loss: 14.4766 Explore P: 0.8945
Episode: 59 Total reward: 15.0 Training loss: 19.2250 Explore P: 0.8932
Episode: 60 Total reward: 9.0 Training loss: 7.7733 Explore P: 0.8924
Episode: 61 Total reward: 9.0 Training loss: 12.7284 Explore P: 0.8916
Episode: 62 Total reward: 14.0 Training loss: 20.9044 Explore P: 0.8903
Episode: 63 Total reward: 21.0 Training loss: 63.9284 Explore P: 0.8885
Episode: 64 Total reward: 25.0 Training loss: 8.4411 Explore P: 0.8863
Episode: 65 Total reward: 20.0 Training loss: 31.3512 Explore P: 0.8845
Episode: 66 Total reward: 15.0 Training loss: 37.5500 Explore P: 0.8832
Episode: 67 Total reward: 15.0 Training loss: 8.4937 Explore P: 0.8819
Episode: 68 Total reward: 21.0 Training loss: 10.1790 Explore P: 0.8801
Episode: 69 Total reward: 9.0 Training loss: 10.0213 Explore P: 0.8793
Episode: 70 Total reward: 19.0 Training loss: 12.0676 Explore P: 0.8777
Episode: 71 Total reward: 11.0 Training loss: 10.0365 Explore P: 0.8767
Episode: 72 Total reward: 20.0 Training loss: 50.6870 Explore P: 0.8750
Episode: 73 Total reward: 14.0 Training loss: 41.1670 Explore P: 0.8738
Episode: 74 Total reward: 12.0 Training loss: 48.9869 Explore P: 0.8727
Episode: 75 Total reward: 13.0 Training loss: 12.4277 Explore P: 0.8716
Episode: 76 Total reward: 56.0 Training loss: 47.1972 Explore P: 0.8668
Episode: 77 Total reward: 9.0 Training loss: 58.3822 Explore P: 0.8660
Episode: 78 Total reward: 35.0 Training loss: 73.4534 Explore P: 0.8630
Episode: 79 Total reward: 17.0 Training loss: 11.8595 Explore P: 0.8616
Episode: 80 Total reward: 9.0 Training loss: 10.7710 Explore P: 0.8608
Episode: 81 Total reward: 10.0 Training loss: 68.3091 Explore P: 0.8600
Episode: 82 Total reward: 10.0 Training loss: 83.9197 Explore P: 0.8591
Episode: 83 Total reward: 17.0 Training loss: 11.1402 Explore P: 0.8577
Episode: 84 Total reward: 22.0 Training loss: 151.6364 Explore P: 0.8558
Episode: 85 Total reward: 16.0 Training loss: 45.8016 Explore P: 0.8545
Episode: 86 Total reward: 21.0 Training loss: 126.8311 Explore P: 0.8527
Episode: 87 Total reward: 24.0 Training loss: 101.5969 Explore P: 0.8507
Episode: 88 Total reward: 12.0 Training loss: 9.7473 Explore P: 0.8497
Episode: 89 Total reward: 8.0 Training loss: 101.7838 Explore P: 0.8490
Episode: 90 Total reward: 44.0 Training loss: 53.6324 Explore P: 0.8453
Episode: 91 Total reward: 15.0 Training loss: 55.1948 Explore P: 0.8441
Episode: 92 Total reward: 25.0 Training loss: 73.3721 Explore P: 0.8420
Episode: 93 Total reward: 12.0 Training loss: 31.8694 Explore P: 0.8410
Episode: 94 Total reward: 10.0 Training loss: 9.5745 Explore P: 0.8401
Episode: 95 Total reward: 13.0 Training loss: 123.8642 Explore P: 0.8391
Episode: 96 Total reward: 12.0 Training loss: 239.9898 Explore P: 0.8381
Episode: 97 Total reward: 10.0 Training loss: 108.6084 Explore P: 0.8372
Episode: 98 Total reward: 8.0 Training loss: 66.1717 Explore P: 0.8366
Episode: 99 Total reward: 17.0 Training loss: 9.1921 Explore P: 0.8352
Episode: 100 Total reward: 13.0 Training loss: 10.0604 Explore P: 0.8341
Episode: 101 Total reward: 12.0 Training loss: 9.1835 Explore P: 0.8331
Episode: 102 Total reward: 10.0 Training loss: 60.8152 Explore P: 0.8323
Episode: 103 Total reward: 12.0 Training loss: 13.9947 Explore P: 0.8313
Episode: 104 Total reward: 21.0 Training loss: 13.9290 Explore P: 0.8296
Episode: 105 Total reward: 23.0 Training loss: 9.8988 Explore P: 0.8277
Episode: 106 Total reward: 18.0 Training loss: 13.1172 Explore P: 0.8262
Episode: 107 Total reward: 13.0 Training loss: 80.5732 Explore P: 0.8252
Episode: 108 Total reward: 18.0 Training loss: 86.3419 Explore P: 0.8237
Episode: 109 Total reward: 13.0 Training loss: 14.3605 Explore P: 0.8227
Episode: 110 Total reward: 10.0 Training loss: 12.6889 Explore P: 0.8218
Episode: 111 Total reward: 17.0 Training loss: 169.5447 Explore P: 0.8205
Episode: 112 Total reward: 16.0 Training loss: 191.1227 Explore P: 0.8192
Episode: 113 Total reward: 17.0 Training loss: 13.0325 Explore P: 0.8178
Episode: 114 Total reward: 16.0 Training loss: 198.1335 Explore P: 0.8165
Episode: 115 Total reward: 15.0 Training loss: 13.9331 Explore P: 0.8153
Episode: 116 Total reward: 9.0 Training loss: 16.5919 Explore P: 0.8146
Episode: 117 Total reward: 13.0 Training loss: 87.1153 Explore P: 0.8135
Episode: 118 Total reward: 25.0 Training loss: 12.5661 Explore P: 0.8115
Episode: 119 Total reward: 29.0 Training loss: 140.8273 Explore P: 0.8092
Episode: 120 Total reward: 28.0 Training loss: 95.2029 Explore P: 0.8070
Episode: 121 Total reward: 9.0 Training loss: 96.1758 Explore P: 0.8062
Episode: 122 Total reward: 13.0 Training loss: 89.6949 Explore P: 0.8052
Episode: 123 Total reward: 13.0 Training loss: 337.5701 Explore P: 0.8042
Episode: 124 Total reward: 16.0 Training loss: 14.2868 Explore P: 0.8029
Episode: 125 Total reward: 15.0 Training loss: 166.0446 Explore P: 0.8017
Episode: 126 Total reward: 33.0 Training loss: 8.9850 Explore P: 0.7991
Episode: 127 Total reward: 13.0 Training loss: 12.2823 Explore P: 0.7981
Episode: 128 Total reward: 8.0 Training loss: 202.2726 Explore P: 0.7975
Episode: 129 Total reward: 10.0 Training loss: 15.4579 Explore P: 0.7967
Episode: 130 Total reward: 9.0 Training loss: 16.1194 Explore P: 0.7960
Episode: 131 Total reward: 9.0 Training loss: 8.9409 Explore P: 0.7953
Episode: 132 Total reward: 10.0 Training loss: 13.5691 Explore P: 0.7945
Episode: 133 Total reward: 12.0 Training loss: 193.9561 Explore P: 0.7935
Episode: 134 Total reward: 12.0 Training loss: 8.3608 Explore P: 0.7926
Episode: 135 Total reward: 9.0 Training loss: 83.9630 Explore P: 0.7919
Episode: 136 Total reward: 10.0 Training loss: 11.0029 Explore P: 0.7911
Episode: 137 Total reward: 11.0 Training loss: 18.8782 Explore P: 0.7902
Episode: 138 Total reward: 11.0 Training loss: 138.6567 Explore P: 0.7894
Episode: 139 Total reward: 17.0 Training loss: 137.3711 Explore P: 0.7881
Episode: 140 Total reward: 23.0 Training loss: 152.7785 Explore P: 0.7863
Episode: 141 Total reward: 8.0 Training loss: 57.8495 Explore P: 0.7857
Episode: 142 Total reward: 11.0 Training loss: 205.0596 Explore P: 0.7848
Episode: 143 Total reward: 9.0 Training loss: 12.3889 Explore P: 0.7841
Episode: 144 Total reward: 22.0 Training loss: 74.3531 Explore P: 0.7824
Episode: 145 Total reward: 16.0 Training loss: 148.4404 Explore P: 0.7812
Episode: 146 Total reward: 12.0 Training loss: 324.1254 Explore P: 0.7802
Episode: 147 Total reward: 8.0 Training loss: 146.9432 Explore P: 0.7796
Episode: 148 Total reward: 12.0 Training loss: 299.3575 Explore P: 0.7787
Episode: 149 Total reward: 11.0 Training loss: 382.0833 Explore P: 0.7779
Episode: 150 Total reward: 10.0 Training loss: 12.1958 Explore P: 0.7771
Episode: 151 Total reward: 14.0 Training loss: 250.8840 Explore P: 0.7760
Episode: 152 Total reward: 18.0 Training loss: 460.4238 Explore P: 0.7746
Episode: 153 Total reward: 16.0 Training loss: 12.0860 Explore P: 0.7734
Episode: 154 Total reward: 10.0 Training loss: 11.7262 Explore P: 0.7727
Episode: 155 Total reward: 18.0 Training loss: 250.2154 Explore P: 0.7713
Episode: 156 Total reward: 8.0 Training loss: 18.4601 Explore P: 0.7707
Episode: 157 Total reward: 8.0 Training loss: 13.8884 Explore P: 0.7701
Episode: 158 Total reward: 21.0 Training loss: 182.4766 Explore P: 0.7685
Episode: 159 Total reward: 12.0 Training loss: 73.0686 Explore P: 0.7676
Episode: 160 Total reward: 11.0 Training loss: 104.3964 Explore P: 0.7667
Episode: 161 Total reward: 9.0 Training loss: 206.7625 Explore P: 0.7660
Episode: 162 Total reward: 17.0 Training loss: 12.2240 Explore P: 0.7648
Episode: 163 Total reward: 11.0 Training loss: 10.4401 Explore P: 0.7639
Episode: 164 Total reward: 12.0 Training loss: 316.0230 Explore P: 0.7630
Episode: 165 Total reward: 9.0 Training loss: 6.6244 Explore P: 0.7624
Episode: 166 Total reward: 70.0 Training loss: 10.7122 Explore P: 0.7571
Episode: 167 Total reward: 30.0 Training loss: 69.2707 Explore P: 0.7549
Episode: 168 Total reward: 14.0 Training loss: 113.4271 Explore P: 0.7538
Episode: 169 Total reward: 28.0 Training loss: 348.4683 Explore P: 0.7517
Episode: 170 Total reward: 35.0 Training loss: 139.6163 Explore P: 0.7492
Episode: 171 Total reward: 11.0 Training loss: 66.1172 Explore P: 0.7483
Episode: 172 Total reward: 10.0 Training loss: 162.5798 Explore P: 0.7476
Episode: 173 Total reward: 14.0 Training loss: 10.1697 Explore P: 0.7466
Episode: 174 Total reward: 12.0 Training loss: 12.1899 Explore P: 0.7457
Episode: 175 Total reward: 19.0 Training loss: 8.4877 Explore P: 0.7443
Episode: 176 Total reward: 31.0 Training loss: 7.4924 Explore P: 0.7420
Episode: 177 Total reward: 14.0 Training loss: 76.1942 Explore P: 0.7410
Episode: 178 Total reward: 10.0 Training loss: 71.0788 Explore P: 0.7403
Episode: 179 Total reward: 11.0 Training loss: 9.6621 Explore P: 0.7395
Episode: 180 Total reward: 12.0 Training loss: 289.7865 Explore P: 0.7386
Episode: 181 Total reward: 17.0 Training loss: 68.5686 Explore P: 0.7373
Episode: 182 Total reward: 14.0 Training loss: 185.4832 Explore P: 0.7363
Episode: 183 Total reward: 9.0 Training loss: 7.6132 Explore P: 0.7357
Episode: 184 Total reward: 14.0 Training loss: 73.6357 Explore P: 0.7347
Episode: 185 Total reward: 15.0 Training loss: 207.3587 Explore P: 0.7336
Episode: 186 Total reward: 24.0 Training loss: 157.0099 Explore P: 0.7318
Episode: 187 Total reward: 11.0 Training loss: 189.2686 Explore P: 0.7310
Episode: 188 Total reward: 39.0 Training loss: 139.8520 Explore P: 0.7282
Episode: 189 Total reward: 13.0 Training loss: 72.5886 Explore P: 0.7273
Episode: 190 Total reward: 11.0 Training loss: 63.7528 Explore P: 0.7265
Episode: 191 Total reward: 13.0 Training loss: 80.8556 Explore P: 0.7256
Episode: 192 Total reward: 17.0 Training loss: 7.6656 Explore P: 0.7244
Episode: 193 Total reward: 16.0 Training loss: 177.8219 Explore P: 0.7232
Episode: 194 Total reward: 9.0 Training loss: 63.4370 Explore P: 0.7226
Episode: 195 Total reward: 13.0 Training loss: 88.1313 Explore P: 0.7217
Episode: 196 Total reward: 13.0 Training loss: 6.7245 Explore P: 0.7207
Episode: 197 Total reward: 13.0 Training loss: 154.4765 Explore P: 0.7198
Episode: 198 Total reward: 8.0 Training loss: 89.3238 Explore P: 0.7192
Episode: 199 Total reward: 14.0 Training loss: 94.3158 Explore P: 0.7183
Episode: 200 Total reward: 14.0 Training loss: 4.3082 Explore P: 0.7173
Episode: 201 Total reward: 10.0 Training loss: 5.4659 Explore P: 0.7166
Episode: 202 Total reward: 15.0 Training loss: 60.1253 Explore P: 0.7155
Episode: 203 Total reward: 14.0 Training loss: 66.5905 Explore P: 0.7145
Episode: 204 Total reward: 20.0 Training loss: 63.9367 Explore P: 0.7131
Episode: 205 Total reward: 14.0 Training loss: 4.2448 Explore P: 0.7121
Episode: 206 Total reward: 9.0 Training loss: 63.1859 Explore P: 0.7115
Episode: 207 Total reward: 13.0 Training loss: 191.8483 Explore P: 0.7106
Episode: 208 Total reward: 11.0 Training loss: 69.7428 Explore P: 0.7098
Episode: 209 Total reward: 13.0 Training loss: 4.5114 Explore P: 0.7089
Episode: 210 Total reward: 12.0 Training loss: 4.4401 Explore P: 0.7081
Episode: 211 Total reward: 9.0 Training loss: 5.3239 Explore P: 0.7074
Episode: 212 Total reward: 15.0 Training loss: 131.5311 Explore P: 0.7064
Episode: 213 Total reward: 12.0 Training loss: 169.0290 Explore P: 0.7056
Episode: 214 Total reward: 8.0 Training loss: 4.9046 Explore P: 0.7050
Episode: 215 Total reward: 8.0 Training loss: 83.0435 Explore P: 0.7044
Episode: 216 Total reward: 9.0 Training loss: 116.4127 Explore P: 0.7038
Episode: 217 Total reward: 9.0 Training loss: 167.3936 Explore P: 0.7032
Episode: 218 Total reward: 11.0 Training loss: 128.6324 Explore P: 0.7024
Episode: 219 Total reward: 12.0 Training loss: 3.7247 Explore P: 0.7016
Episode: 220 Total reward: 10.0 Training loss: 3.2211 Explore P: 0.7009
Episode: 221 Total reward: 12.0 Training loss: 5.2426 Explore P: 0.7001
Episode: 222 Total reward: 11.0 Training loss: 5.2191 Explore P: 0.6993
Episode: 223 Total reward: 10.0 Training loss: 63.9852 Explore P: 0.6986
Episode: 224 Total reward: 7.0 Training loss: 53.7649 Explore P: 0.6981
Episode: 225 Total reward: 13.0 Training loss: 119.3155 Explore P: 0.6973
Episode: 226 Total reward: 9.0 Training loss: 4.1296 Explore P: 0.6966
Episode: 227 Total reward: 12.0 Training loss: 94.8914 Explore P: 0.6958
Episode: 228 Total reward: 12.0 Training loss: 75.0506 Explore P: 0.6950
Episode: 229 Total reward: 18.0 Training loss: 101.5248 Explore P: 0.6938
Episode: 230 Total reward: 11.0 Training loss: 72.9518 Explore P: 0.6930
Episode: 231 Total reward: 9.0 Training loss: 5.2160 Explore P: 0.6924
Episode: 232 Total reward: 13.0 Training loss: 3.9238 Explore P: 0.6915
Episode: 233 Total reward: 9.0 Training loss: 150.4140 Explore P: 0.6909
Episode: 234 Total reward: 8.0 Training loss: 50.2084 Explore P: 0.6903
Episode: 235 Total reward: 16.0 Training loss: 264.9699 Explore P: 0.6893
Episode: 236 Total reward: 12.0 Training loss: 66.0265 Explore P: 0.6884
Episode: 237 Total reward: 10.0 Training loss: 127.7283 Explore P: 0.6878
Episode: 238 Total reward: 20.0 Training loss: 66.5342 Explore P: 0.6864
Episode: 239 Total reward: 9.0 Training loss: 4.6135 Explore P: 0.6858
Episode: 240 Total reward: 26.0 Training loss: 3.4406 Explore P: 0.6841
Episode: 241 Total reward: 14.0 Training loss: 3.7424 Explore P: 0.6831
Episode: 242 Total reward: 9.0 Training loss: 160.3315 Explore P: 0.6825
Episode: 243 Total reward: 15.0 Training loss: 206.6206 Explore P: 0.6815
Episode: 244 Total reward: 12.0 Training loss: 2.8463 Explore P: 0.6807
Episode: 245 Total reward: 13.0 Training loss: 43.5978 Explore P: 0.6798
Episode: 246 Total reward: 8.0 Training loss: 4.3071 Explore P: 0.6793
Episode: 247 Total reward: 10.0 Training loss: 2.3447 Explore P: 0.6786
Episode: 248 Total reward: 16.0 Training loss: 131.1886 Explore P: 0.6775
Episode: 249 Total reward: 16.0 Training loss: 119.4371 Explore P: 0.6765
Episode: 250 Total reward: 9.0 Training loss: 57.1069 Explore P: 0.6759
Episode: 251 Total reward: 14.0 Training loss: 75.5360 Explore P: 0.6749
Episode: 252 Total reward: 16.0 Training loss: 228.5315 Explore P: 0.6739
Episode: 253 Total reward: 14.0 Training loss: 4.5446 Explore P: 0.6730
Episode: 254 Total reward: 14.0 Training loss: 3.4842 Explore P: 0.6720
Episode: 255 Total reward: 9.0 Training loss: 132.7238 Explore P: 0.6714
Episode: 256 Total reward: 27.0 Training loss: 56.0535 Explore P: 0.6696
Episode: 257 Total reward: 29.0 Training loss: 50.3521 Explore P: 0.6677
Episode: 258 Total reward: 13.0 Training loss: 50.5196 Explore P: 0.6669
Episode: 259 Total reward: 12.0 Training loss: 101.6573 Explore P: 0.6661
Episode: 260 Total reward: 19.0 Training loss: 152.5954 Explore P: 0.6648
Episode: 261 Total reward: 10.0 Training loss: 2.1648 Explore P: 0.6642
Episode: 262 Total reward: 11.0 Training loss: 48.3430 Explore P: 0.6635
Episode: 263 Total reward: 12.0 Training loss: 293.4637 Explore P: 0.6627
Episode: 264 Total reward: 10.0 Training loss: 165.3028 Explore P: 0.6620
Episode: 265 Total reward: 15.0 Training loss: 3.1151 Explore P: 0.6611
Episode: 266 Total reward: 9.0 Training loss: 64.9664 Explore P: 0.6605
Episode: 267 Total reward: 7.0 Training loss: 2.3567 Explore P: 0.6600
Episode: 268 Total reward: 8.0 Training loss: 102.9303 Explore P: 0.6595
Episode: 269 Total reward: 11.0 Training loss: 3.2086 Explore P: 0.6588
Episode: 270 Total reward: 11.0 Training loss: 87.7418 Explore P: 0.6581
Episode: 271 Total reward: 11.0 Training loss: 125.2021 Explore P: 0.6574
Episode: 272 Total reward: 10.0 Training loss: 42.4564 Explore P: 0.6567
Episode: 273 Total reward: 19.0 Training loss: 48.5113 Explore P: 0.6555
Episode: 274 Total reward: 16.0 Training loss: 123.7374 Explore P: 0.6545
Episode: 275 Total reward: 12.0 Training loss: 124.9563 Explore P: 0.6537
Episode: 276 Total reward: 8.0 Training loss: 101.6177 Explore P: 0.6532
Episode: 277 Total reward: 11.0 Training loss: 47.1862 Explore P: 0.6525
Episode: 278 Total reward: 20.0 Training loss: 158.2734 Explore P: 0.6512
Episode: 279 Total reward: 8.0 Training loss: 2.7498 Explore P: 0.6507
Episode: 280 Total reward: 10.0 Training loss: 264.3510 Explore P: 0.6500
Episode: 281 Total reward: 8.0 Training loss: 81.4538 Explore P: 0.6495
Episode: 282 Total reward: 9.0 Training loss: 142.7436 Explore P: 0.6489
Episode: 283 Total reward: 10.0 Training loss: 3.3470 Explore P: 0.6483
Episode: 284 Total reward: 13.0 Training loss: 95.4103 Explore P: 0.6475
Episode: 285 Total reward: 9.0 Training loss: 60.1102 Explore P: 0.6469
Episode: 286 Total reward: 15.0 Training loss: 36.3782 Explore P: 0.6459
Episode: 287 Total reward: 8.0 Training loss: 108.8368 Explore P: 0.6454
Episode: 288 Total reward: 11.0 Training loss: 83.0066 Explore P: 0.6447
Episode: 289 Total reward: 20.0 Training loss: 83.7182 Explore P: 0.6435
Episode: 290 Total reward: 10.0 Training loss: 2.3274 Explore P: 0.6428
Episode: 291 Total reward: 13.0 Training loss: 40.1583 Explore P: 0.6420
Episode: 292 Total reward: 15.0 Training loss: 168.4606 Explore P: 0.6411
Episode: 293 Total reward: 17.0 Training loss: 31.9194 Explore P: 0.6400
Episode: 294 Total reward: 10.0 Training loss: 94.3480 Explore P: 0.6394
Episode: 295 Total reward: 16.0 Training loss: 41.2433 Explore P: 0.6384
Episode: 296 Total reward: 10.0 Training loss: 147.8728 Explore P: 0.6377
Episode: 297 Total reward: 10.0 Training loss: 150.3116 Explore P: 0.6371
Episode: 298 Total reward: 17.0 Training loss: 31.2327 Explore P: 0.6360
Episode: 299 Total reward: 15.0 Training loss: 105.7043 Explore P: 0.6351
Episode: 300 Total reward: 24.0 Training loss: 38.2784 Explore P: 0.6336
Episode: 301 Total reward: 17.0 Training loss: 77.6704 Explore P: 0.6325
Episode: 302 Total reward: 10.0 Training loss: 39.8874 Explore P: 0.6319
Episode: 303 Total reward: 17.0 Training loss: 40.7719 Explore P: 0.6309
Episode: 304 Total reward: 7.0 Training loss: 36.8769 Explore P: 0.6304
Episode: 305 Total reward: 28.0 Training loss: 2.1886 Explore P: 0.6287
Episode: 306 Total reward: 8.0 Training loss: 77.5688 Explore P: 0.6282
Episode: 307 Total reward: 9.0 Training loss: 36.8568 Explore P: 0.6276
Episode: 308 Total reward: 15.0 Training loss: 78.9401 Explore P: 0.6267
Episode: 309 Total reward: 11.0 Training loss: 67.5072 Explore P: 0.6260
Episode: 310 Total reward: 10.0 Training loss: 2.5631 Explore P: 0.6254
Episode: 311 Total reward: 11.0 Training loss: 105.5229 Explore P: 0.6247
Episode: 312 Total reward: 13.0 Training loss: 77.2822 Explore P: 0.6239
Episode: 313 Total reward: 16.0 Training loss: 65.0204 Explore P: 0.6230
Episode: 314 Total reward: 10.0 Training loss: 96.2724 Explore P: 0.6224
Episode: 315 Total reward: 18.0 Training loss: 55.3033 Explore P: 0.6212
Episode: 316 Total reward: 13.0 Training loss: 72.1128 Explore P: 0.6205
Episode: 317 Total reward: 13.0 Training loss: 35.4472 Explore P: 0.6197
Episode: 318 Total reward: 11.0 Training loss: 31.0867 Explore P: 0.6190
Episode: 319 Total reward: 15.0 Training loss: 2.9056 Explore P: 0.6181
Episode: 320 Total reward: 12.0 Training loss: 29.7499 Explore P: 0.6173
Episode: 321 Total reward: 11.0 Training loss: 132.5691 Explore P: 0.6167
Episode: 322 Total reward: 13.0 Training loss: 29.4326 Explore P: 0.6159
Episode: 323 Total reward: 7.0 Training loss: 27.7335 Explore P: 0.6155
Episode: 324 Total reward: 14.0 Training loss: 50.5800 Explore P: 0.6146
Episode: 325 Total reward: 13.0 Training loss: 35.7876 Explore P: 0.6138
Episode: 326 Total reward: 10.0 Training loss: 99.3647 Explore P: 0.6132
Episode: 327 Total reward: 13.0 Training loss: 38.3180 Explore P: 0.6125
Episode: 328 Total reward: 9.0 Training loss: 32.5392 Explore P: 0.6119
Episode: 329 Total reward: 14.0 Training loss: 36.5185 Explore P: 0.6111
Episode: 330 Total reward: 8.0 Training loss: 57.1661 Explore P: 0.6106
Episode: 331 Total reward: 9.0 Training loss: 60.8050 Explore P: 0.6100
Episode: 332 Total reward: 11.0 Training loss: 35.0055 Explore P: 0.6094
Episode: 333 Total reward: 18.0 Training loss: 1.5684 Explore P: 0.6083
Episode: 334 Total reward: 8.0 Training loss: 117.4136 Explore P: 0.6078
Episode: 335 Total reward: 8.0 Training loss: 53.9731 Explore P: 0.6074
Episode: 336 Total reward: 10.0 Training loss: 58.5547 Explore P: 0.6068
Episode: 337 Total reward: 11.0 Training loss: 1.1665 Explore P: 0.6061
Episode: 338 Total reward: 9.0 Training loss: 85.6769 Explore P: 0.6056
Episode: 339 Total reward: 11.0 Training loss: 26.8164 Explore P: 0.6049
Episode: 340 Total reward: 11.0 Training loss: 26.5665 Explore P: 0.6043
Episode: 341 Total reward: 12.0 Training loss: 57.1119 Explore P: 0.6035
Episode: 342 Total reward: 9.0 Training loss: 54.6814 Explore P: 0.6030
Episode: 343 Total reward: 12.0 Training loss: 66.5267 Explore P: 0.6023
Episode: 344 Total reward: 21.0 Training loss: 1.1479 Explore P: 0.6011
Episode: 345 Total reward: 12.0 Training loss: 55.8310 Explore P: 0.6003
Episode: 346 Total reward: 18.0 Training loss: 105.5683 Explore P: 0.5993
Episode: 347 Total reward: 20.0 Training loss: 25.5906 Explore P: 0.5981
Episode: 348 Total reward: 8.0 Training loss: 54.8591 Explore P: 0.5976
Episode: 349 Total reward: 34.0 Training loss: 49.6050 Explore P: 0.5956
Episode: 350 Total reward: 18.0 Training loss: 96.8619 Explore P: 0.5946
Episode: 351 Total reward: 14.0 Training loss: 1.0686 Explore P: 0.5938
Episode: 352 Total reward: 11.0 Training loss: 1.1536 Explore P: 0.5931
Episode: 353 Total reward: 10.0 Training loss: 21.0506 Explore P: 0.5925
Episode: 354 Total reward: 16.0 Training loss: 43.7810 Explore P: 0.5916
Episode: 355 Total reward: 15.0 Training loss: 23.8541 Explore P: 0.5907
Episode: 356 Total reward: 18.0 Training loss: 23.6335 Explore P: 0.5897
Episode: 357 Total reward: 10.0 Training loss: 23.4843 Explore P: 0.5891
Episode: 358 Total reward: 25.0 Training loss: 57.7016 Explore P: 0.5877
Episode: 359 Total reward: 11.0 Training loss: 19.6227 Explore P: 0.5870
Episode: 360 Total reward: 31.0 Training loss: 41.7166 Explore P: 0.5853
Episode: 361 Total reward: 12.0 Training loss: 43.3209 Explore P: 0.5846
Episode: 362 Total reward: 8.0 Training loss: 18.8923 Explore P: 0.5841
Episode: 363 Total reward: 35.0 Training loss: 22.8490 Explore P: 0.5821
Episode: 364 Total reward: 16.0 Training loss: 19.8686 Explore P: 0.5812
Episode: 365 Total reward: 12.0 Training loss: 0.9398 Explore P: 0.5805
Episode: 366 Total reward: 12.0 Training loss: 19.8140 Explore P: 0.5798
Episode: 367 Total reward: 12.0 Training loss: 20.0603 Explore P: 0.5791
Episode: 368 Total reward: 19.0 Training loss: 18.0039 Explore P: 0.5780
Episode: 369 Total reward: 14.0 Training loss: 20.1391 Explore P: 0.5773
Episode: 370 Total reward: 15.0 Training loss: 76.0555 Explore P: 0.5764
Episode: 371 Total reward: 15.0 Training loss: 69.5606 Explore P: 0.5756
Episode: 372 Total reward: 10.0 Training loss: 19.5507 Explore P: 0.5750
Episode: 373 Total reward: 14.0 Training loss: 20.2640 Explore P: 0.5742
Episode: 374 Total reward: 11.0 Training loss: 17.8514 Explore P: 0.5736
Episode: 375 Total reward: 9.0 Training loss: 22.3711 Explore P: 0.5731
Episode: 376 Total reward: 13.0 Training loss: 36.8225 Explore P: 0.5723
Episode: 377 Total reward: 13.0 Training loss: 0.7556 Explore P: 0.5716
Episode: 378 Total reward: 10.0 Training loss: 0.7701 Explore P: 0.5710
Episode: 379 Total reward: 13.0 Training loss: 18.5341 Explore P: 0.5703
Episode: 380 Total reward: 14.0 Training loss: 18.5508 Explore P: 0.5695
Episode: 381 Total reward: 12.0 Training loss: 34.2366 Explore P: 0.5689
Episode: 382 Total reward: 11.0 Training loss: 0.6571 Explore P: 0.5682
Episode: 383 Total reward: 13.0 Training loss: 50.1199 Explore P: 0.5675
Episode: 384 Total reward: 9.0 Training loss: 1.2480 Explore P: 0.5670
Episode: 385 Total reward: 13.0 Training loss: 35.8268 Explore P: 0.5663
Episode: 386 Total reward: 13.0 Training loss: 84.6874 Explore P: 0.5656
Episode: 387 Total reward: 18.0 Training loss: 45.4454 Explore P: 0.5646
Episode: 388 Total reward: 41.0 Training loss: 30.9683 Explore P: 0.5623
Episode: 389 Total reward: 50.0 Training loss: 31.2278 Explore P: 0.5596
Episode: 390 Total reward: 18.0 Training loss: 17.5455 Explore P: 0.5586
Episode: 391 Total reward: 15.0 Training loss: 16.3971 Explore P: 0.5577
Episode: 392 Total reward: 14.0 Training loss: 31.4173 Explore P: 0.5570
Episode: 393 Total reward: 8.0 Training loss: 32.4163 Explore P: 0.5565
Episode: 394 Total reward: 17.0 Training loss: 16.7075 Explore P: 0.5556
Episode: 395 Total reward: 19.0 Training loss: 0.7701 Explore P: 0.5546
Episode: 396 Total reward: 15.0 Training loss: 45.9134 Explore P: 0.5538
Episode: 397 Total reward: 23.0 Training loss: 0.7090 Explore P: 0.5525
Episode: 398 Total reward: 13.0 Training loss: 0.8875 Explore P: 0.5518
Episode: 399 Total reward: 13.0 Training loss: 13.5716 Explore P: 0.5511
Episode: 400 Total reward: 28.0 Training loss: 52.3293 Explore P: 0.5496
Episode: 401 Total reward: 31.0 Training loss: 1.2809 Explore P: 0.5479
Episode: 402 Total reward: 19.0 Training loss: 12.9664 Explore P: 0.5469
Episode: 403 Total reward: 20.0 Training loss: 1.1375 Explore P: 0.5458
Episode: 404 Total reward: 18.0 Training loss: 12.6035 Explore P: 0.5449
Episode: 405 Total reward: 12.0 Training loss: 30.6805 Explore P: 0.5442
Episode: 406 Total reward: 28.0 Training loss: 16.4706 Explore P: 0.5427
Episode: 407 Total reward: 42.0 Training loss: 15.3171 Explore P: 0.5405
Episode: 408 Total reward: 18.0 Training loss: 12.6223 Explore P: 0.5395
Episode: 409 Total reward: 39.0 Training loss: 0.6685 Explore P: 0.5375
Episode: 410 Total reward: 15.0 Training loss: 12.3230 Explore P: 0.5367
Episode: 411 Total reward: 21.0 Training loss: 24.0005 Explore P: 0.5356
Episode: 412 Total reward: 15.0 Training loss: 26.5921 Explore P: 0.5348
Episode: 413 Total reward: 19.0 Training loss: 14.3719 Explore P: 0.5338
Episode: 414 Total reward: 16.0 Training loss: 57.0601 Explore P: 0.5330
Episode: 415 Total reward: 14.0 Training loss: 0.7923 Explore P: 0.5322
Episode: 416 Total reward: 18.0 Training loss: 23.3469 Explore P: 0.5313
Episode: 417 Total reward: 14.0 Training loss: 34.7179 Explore P: 0.5306
Episode: 418 Total reward: 13.0 Training loss: 11.6060 Explore P: 0.5299
Episode: 419 Total reward: 12.0 Training loss: 13.9192 Explore P: 0.5293
Episode: 420 Total reward: 14.0 Training loss: 35.8707 Explore P: 0.5285
Episode: 421 Total reward: 23.0 Training loss: 19.3004 Explore P: 0.5273
Episode: 422 Total reward: 41.0 Training loss: 14.7306 Explore P: 0.5252
Episode: 423 Total reward: 22.0 Training loss: 11.1686 Explore P: 0.5241
Episode: 424 Total reward: 16.0 Training loss: 23.7015 Explore P: 0.5233
Episode: 425 Total reward: 33.0 Training loss: 0.8111 Explore P: 0.5216
Episode: 426 Total reward: 19.0 Training loss: 39.9215 Explore P: 0.5206
Episode: 427 Total reward: 20.0 Training loss: 28.6523 Explore P: 0.5196
Episode: 428 Total reward: 30.0 Training loss: 13.7265 Explore P: 0.5181
Episode: 429 Total reward: 29.0 Training loss: 21.5007 Explore P: 0.5166
Episode: 430 Total reward: 18.0 Training loss: 12.8044 Explore P: 0.5157
Episode: 431 Total reward: 16.0 Training loss: 17.3753 Explore P: 0.5149
Episode: 432 Total reward: 29.0 Training loss: 19.9148 Explore P: 0.5134
Episode: 433 Total reward: 19.0 Training loss: 17.9787 Explore P: 0.5125
Episode: 434 Total reward: 29.0 Training loss: 0.9247 Explore P: 0.5110
Episode: 435 Total reward: 21.0 Training loss: 19.7667 Explore P: 0.5099
Episode: 436 Total reward: 16.0 Training loss: 8.8582 Explore P: 0.5091
Episode: 437 Total reward: 15.0 Training loss: 0.9968 Explore P: 0.5084
Episode: 438 Total reward: 25.0 Training loss: 24.3384 Explore P: 0.5072
Episode: 439 Total reward: 19.0 Training loss: 28.7640 Explore P: 0.5062
Episode: 440 Total reward: 26.0 Training loss: 1.5219 Explore P: 0.5049
Episode: 441 Total reward: 27.0 Training loss: 1.1060 Explore P: 0.5036
Episode: 442 Total reward: 12.0 Training loss: 8.1378 Explore P: 0.5030
Episode: 443 Total reward: 18.0 Training loss: 25.7266 Explore P: 0.5021
Episode: 444 Total reward: 21.0 Training loss: 7.8698 Explore P: 0.5011
Episode: 445 Total reward: 17.0 Training loss: 12.3055 Explore P: 0.5002
Episode: 446 Total reward: 16.0 Training loss: 1.1245 Explore P: 0.4995
Episode: 447 Total reward: 17.0 Training loss: 41.3151 Explore P: 0.4986
Episode: 448 Total reward: 31.0 Training loss: 8.5044 Explore P: 0.4971
Episode: 449 Total reward: 11.0 Training loss: 7.7499 Explore P: 0.4966
Episode: 450 Total reward: 37.0 Training loss: 38.9043 Explore P: 0.4948
Episode: 451 Total reward: 10.0 Training loss: 28.4057 Explore P: 0.4943
Episode: 452 Total reward: 13.0 Training loss: 26.7687 Explore P: 0.4937
Episode: 453 Total reward: 9.0 Training loss: 14.0287 Explore P: 0.4932
Episode: 454 Total reward: 13.0 Training loss: 24.3197 Explore P: 0.4926
Episode: 455 Total reward: 21.0 Training loss: 1.0293 Explore P: 0.4916
Episode: 456 Total reward: 19.0 Training loss: 37.2859 Explore P: 0.4907
Episode: 457 Total reward: 11.0 Training loss: 0.8538 Explore P: 0.4902
Episode: 458 Total reward: 19.0 Training loss: 1.2404 Explore P: 0.4892
Episode: 459 Total reward: 26.0 Training loss: 1.5802 Explore P: 0.4880
Episode: 460 Total reward: 11.0 Training loss: 1.1780 Explore P: 0.4875
Episode: 461 Total reward: 45.0 Training loss: 16.5364 Explore P: 0.4853
Episode: 462 Total reward: 21.0 Training loss: 1.3121 Explore P: 0.4843
Episode: 463 Total reward: 31.0 Training loss: 0.8922 Explore P: 0.4829
Episode: 464 Total reward: 18.0 Training loss: 17.9571 Explore P: 0.4820
Episode: 465 Total reward: 15.0 Training loss: 38.8063 Explore P: 0.4813
Episode: 466 Total reward: 15.0 Training loss: 8.2020 Explore P: 0.4806
Episode: 467 Total reward: 32.0 Training loss: 11.1194 Explore P: 0.4791
Episode: 468 Total reward: 18.0 Training loss: 37.7545 Explore P: 0.4783
Episode: 469 Total reward: 26.0 Training loss: 29.5739 Explore P: 0.4770
Episode: 470 Total reward: 26.0 Training loss: 1.2167 Explore P: 0.4758
Episode: 471 Total reward: 24.0 Training loss: 1.1193 Explore P: 0.4747
Episode: 472 Total reward: 32.0 Training loss: 1.6059 Explore P: 0.4732
Episode: 473 Total reward: 19.0 Training loss: 1.1342 Explore P: 0.4723
Episode: 474 Total reward: 20.0 Training loss: 9.5274 Explore P: 0.4714
Episode: 475 Total reward: 35.0 Training loss: 20.3094 Explore P: 0.4698
Episode: 476 Total reward: 24.0 Training loss: 1.1124 Explore P: 0.4687
Episode: 477 Total reward: 24.0 Training loss: 12.6487 Explore P: 0.4676
Episode: 478 Total reward: 26.0 Training loss: 16.9093 Explore P: 0.4664
Episode: 479 Total reward: 25.0 Training loss: 44.9873 Explore P: 0.4653
Episode: 480 Total reward: 20.0 Training loss: 13.7062 Explore P: 0.4644
Episode: 481 Total reward: 15.0 Training loss: 32.0027 Explore P: 0.4637
Episode: 482 Total reward: 16.0 Training loss: 19.1585 Explore P: 0.4630
Episode: 483 Total reward: 22.0 Training loss: 15.0526 Explore P: 0.4620
Episode: 484 Total reward: 14.0 Training loss: 1.1242 Explore P: 0.4613
Episode: 485 Total reward: 19.0 Training loss: 24.6019 Explore P: 0.4605
Episode: 486 Total reward: 60.0 Training loss: 19.7885 Explore P: 0.4578
Episode: 487 Total reward: 24.0 Training loss: 29.2378 Explore P: 0.4567
Episode: 488 Total reward: 51.0 Training loss: 1.0816 Explore P: 0.4544
Episode: 489 Total reward: 51.0 Training loss: 29.9663 Explore P: 0.4522
Episode: 490 Total reward: 39.0 Training loss: 1.1689 Explore P: 0.4505
Episode: 491 Total reward: 22.0 Training loss: 1.0570 Explore P: 0.4495
Episode: 492 Total reward: 21.0 Training loss: 25.8935 Explore P: 0.4486
Episode: 493 Total reward: 12.0 Training loss: 22.2251 Explore P: 0.4480
Episode: 494 Total reward: 20.0 Training loss: 36.7863 Explore P: 0.4472
Episode: 495 Total reward: 15.0 Training loss: 9.6657 Explore P: 0.4465
Episode: 496 Total reward: 20.0 Training loss: 14.4071 Explore P: 0.4456
Episode: 497 Total reward: 33.0 Training loss: 11.0091 Explore P: 0.4442
Episode: 498 Total reward: 17.0 Training loss: 14.0260 Explore P: 0.4435
Episode: 499 Total reward: 30.0 Training loss: 12.0654 Explore P: 0.4422
Episode: 500 Total reward: 18.0 Training loss: 8.9898 Explore P: 0.4414
Episode: 501 Total reward: 36.0 Training loss: 27.1493 Explore P: 0.4398
Episode: 502 Total reward: 32.0 Training loss: 1.1146 Explore P: 0.4385
Episode: 503 Total reward: 62.0 Training loss: 33.6167 Explore P: 0.4358
Episode: 504 Total reward: 42.0 Training loss: 21.4707 Explore P: 0.4340
Episode: 505 Total reward: 29.0 Training loss: 1.7413 Explore P: 0.4328
Episode: 506 Total reward: 21.0 Training loss: 1.2741 Explore P: 0.4319
Episode: 507 Total reward: 28.0 Training loss: 1.3390 Explore P: 0.4307
Episode: 508 Total reward: 25.0 Training loss: 1.3198 Explore P: 0.4297
Episode: 509 Total reward: 57.0 Training loss: 19.4917 Explore P: 0.4273
Episode: 510 Total reward: 40.0 Training loss: 12.4425 Explore P: 0.4256
Episode: 511 Total reward: 68.0 Training loss: 11.7733 Explore P: 0.4228
Episode: 512 Total reward: 32.0 Training loss: 10.3297 Explore P: 0.4215
Episode: 513 Total reward: 34.0 Training loss: 0.8455 Explore P: 0.4201
Episode: 514 Total reward: 39.0 Training loss: 36.3518 Explore P: 0.4185
Episode: 515 Total reward: 16.0 Training loss: 14.1500 Explore P: 0.4179
Episode: 516 Total reward: 28.0 Training loss: 11.3859 Explore P: 0.4167
Episode: 517 Total reward: 21.0 Training loss: 1.6112 Explore P: 0.4159
Episode: 518 Total reward: 32.0 Training loss: 21.7841 Explore P: 0.4146
Episode: 519 Total reward: 57.0 Training loss: 0.9055 Explore P: 0.4123
Episode: 520 Total reward: 38.0 Training loss: 31.0363 Explore P: 0.4107
Episode: 521 Total reward: 37.0 Training loss: 11.7153 Explore P: 0.4093
Episode: 522 Total reward: 58.0 Training loss: 1.2793 Explore P: 0.4069
Episode: 523 Total reward: 24.0 Training loss: 26.1383 Explore P: 0.4060
Episode: 524 Total reward: 83.0 Training loss: 1.6220 Explore P: 0.4027
Episode: 525 Total reward: 26.0 Training loss: 14.7757 Explore P: 0.4017
Episode: 526 Total reward: 25.0 Training loss: 31.2609 Explore P: 0.4007
Episode: 527 Total reward: 41.0 Training loss: 1.0933 Explore P: 0.3991
Episode: 528 Total reward: 67.0 Training loss: 50.9949 Explore P: 0.3965
Episode: 529 Total reward: 43.0 Training loss: 16.5941 Explore P: 0.3949
Episode: 530 Total reward: 44.0 Training loss: 51.6077 Explore P: 0.3932
Episode: 531 Total reward: 61.0 Training loss: 16.7493 Explore P: 0.3908
Episode: 532 Total reward: 46.0 Training loss: 36.6571 Explore P: 0.3891
Episode: 533 Total reward: 73.0 Training loss: 1.0749 Explore P: 0.3863
Episode: 534 Total reward: 54.0 Training loss: 22.8286 Explore P: 0.3843
Episode: 535 Total reward: 66.0 Training loss: 9.0616 Explore P: 0.3819
Episode: 536 Total reward: 34.0 Training loss: 38.0536 Explore P: 0.3806
Episode: 537 Total reward: 52.0 Training loss: 1.7811 Explore P: 0.3787
Episode: 538 Total reward: 62.0 Training loss: 2.2663 Explore P: 0.3764
Episode: 539 Total reward: 41.0 Training loss: 14.3416 Explore P: 0.3749
Episode: 540 Total reward: 93.0 Training loss: 15.2113 Explore P: 0.3715
Episode: 541 Total reward: 46.0 Training loss: 11.1850 Explore P: 0.3699
Episode: 542 Total reward: 40.0 Training loss: 22.0518 Explore P: 0.3684
Episode: 543 Total reward: 25.0 Training loss: 22.2773 Explore P: 0.3675
Episode: 544 Total reward: 23.0 Training loss: 14.9652 Explore P: 0.3667
Episode: 545 Total reward: 75.0 Training loss: 23.0482 Explore P: 0.3640
Episode: 546 Total reward: 22.0 Training loss: 14.1616 Explore P: 0.3633
Episode: 547 Total reward: 27.0 Training loss: 13.0065 Explore P: 0.3623
Episode: 548 Total reward: 33.0 Training loss: 1.8635 Explore P: 0.3611
Episode: 549 Total reward: 48.0 Training loss: 1.1462 Explore P: 0.3595
Episode: 550 Total reward: 45.0 Training loss: 10.6186 Explore P: 0.3579
Episode: 551 Total reward: 33.0 Training loss: 1.2578 Explore P: 0.3568
Episode: 552 Total reward: 88.0 Training loss: 17.4746 Explore P: 0.3537
Episode: 553 Total reward: 60.0 Training loss: 13.5685 Explore P: 0.3517
Episode: 554 Total reward: 47.0 Training loss: 25.8068 Explore P: 0.3501
Episode: 555 Total reward: 85.0 Training loss: 2.2876 Explore P: 0.3472
Episode: 556 Total reward: 64.0 Training loss: 1.5023 Explore P: 0.3450
Episode: 557 Total reward: 36.0 Training loss: 15.5122 Explore P: 0.3438
Episode: 558 Total reward: 89.0 Training loss: 47.9844 Explore P: 0.3409
Episode: 559 Total reward: 77.0 Training loss: 0.8502 Explore P: 0.3383
Episode: 560 Total reward: 22.0 Training loss: 68.0906 Explore P: 0.3376
Episode: 561 Total reward: 29.0 Training loss: 29.9364 Explore P: 0.3367
Episode: 562 Total reward: 58.0 Training loss: 21.3550 Explore P: 0.3348
Episode: 563 Total reward: 91.0 Training loss: 20.2344 Explore P: 0.3318
Episode: 564 Total reward: 81.0 Training loss: 27.1911 Explore P: 0.3292
Episode: 565 Total reward: 74.0 Training loss: 25.0571 Explore P: 0.3269
Episode: 566 Total reward: 82.0 Training loss: 68.6066 Explore P: 0.3243
Episode: 567 Total reward: 64.0 Training loss: 17.2128 Explore P: 0.3223
Episode: 568 Total reward: 50.0 Training loss: 55.4805 Explore P: 0.3207
Episode: 569 Total reward: 42.0 Training loss: 47.8850 Explore P: 0.3194
Episode: 570 Total reward: 46.0 Training loss: 1.5115 Explore P: 0.3180
Episode: 571 Total reward: 46.0 Training loss: 57.0466 Explore P: 0.3166
Episode: 572 Total reward: 47.0 Training loss: 13.6016 Explore P: 0.3152
Episode: 573 Total reward: 36.0 Training loss: 2.0227 Explore P: 0.3141
Episode: 574 Total reward: 30.0 Training loss: 2.5625 Explore P: 0.3131
Episode: 575 Total reward: 43.0 Training loss: 26.9228 Explore P: 0.3118
Episode: 576 Total reward: 123.0 Training loss: 14.5797 Explore P: 0.3082
Episode: 577 Total reward: 121.0 Training loss: 68.3477 Explore P: 0.3046
Episode: 578 Total reward: 56.0 Training loss: 14.7431 Explore P: 0.3029
Episode: 579 Total reward: 71.0 Training loss: 29.8230 Explore P: 0.3008
Episode: 580 Total reward: 47.0 Training loss: 85.8422 Explore P: 0.2995
Episode: 581 Total reward: 108.0 Training loss: 27.4927 Explore P: 0.2964
Episode: 582 Total reward: 100.0 Training loss: 2.2847 Explore P: 0.2935
Episode: 583 Total reward: 50.0 Training loss: 2.7229 Explore P: 0.2921
Episode: 584 Total reward: 65.0 Training loss: 15.2744 Explore P: 0.2903
Episode: 585 Total reward: 95.0 Training loss: 2.2639 Explore P: 0.2876
Episode: 586 Total reward: 58.0 Training loss: 1.6174 Explore P: 0.2860
Episode: 587 Total reward: 120.0 Training loss: 92.1166 Explore P: 0.2827
Episode: 588 Total reward: 142.0 Training loss: 16.9074 Explore P: 0.2789
Episode: 589 Total reward: 72.0 Training loss: 26.3904 Explore P: 0.2770
Episode: 590 Total reward: 27.0 Training loss: 2.7347 Explore P: 0.2762
Episode: 591 Total reward: 31.0 Training loss: 37.7428 Explore P: 0.2754
Episode: 592 Total reward: 105.0 Training loss: 62.7599 Explore P: 0.2726
Episode: 593 Total reward: 109.0 Training loss: 14.7185 Explore P: 0.2698
Episode: 594 Total reward: 52.0 Training loss: 46.3727 Explore P: 0.2685
Episode: 595 Total reward: 154.0 Training loss: 1.4994 Explore P: 0.2645
Episode: 596 Total reward: 56.0 Training loss: 26.3880 Explore P: 0.2631
Episode: 597 Total reward: 80.0 Training loss: 2.1035 Explore P: 0.2611
Episode: 598 Total reward: 80.0 Training loss: 70.7687 Explore P: 0.2591
Episode: 599 Total reward: 57.0 Training loss: 23.5564 Explore P: 0.2576
Episode: 600 Total reward: 108.0 Training loss: 1.6232 Explore P: 0.2550
Episode: 601 Total reward: 60.0 Training loss: 42.1861 Explore P: 0.2535
Episode: 602 Total reward: 127.0 Training loss: 82.0350 Explore P: 0.2504
Episode: 603 Total reward: 102.0 Training loss: 1.5874 Explore P: 0.2480
Episode: 604 Total reward: 26.0 Training loss: 1.0760 Explore P: 0.2474
Episode: 605 Total reward: 30.0 Training loss: 1.7130 Explore P: 0.2467
Episode: 606 Total reward: 53.0 Training loss: 0.9351 Explore P: 0.2454
Episode: 607 Total reward: 49.0 Training loss: 2.1143 Explore P: 0.2443
Episode: 608 Total reward: 29.0 Training loss: 2.9050 Explore P: 0.2436
Episode: 609 Total reward: 75.0 Training loss: 41.4020 Explore P: 0.2419
Episode: 610 Total reward: 91.0 Training loss: 63.1298 Explore P: 0.2398
Episode: 611 Total reward: 74.0 Training loss: 1.0130 Explore P: 0.2381
Episode: 612 Total reward: 60.0 Training loss: 1.5163 Explore P: 0.2367
Episode: 613 Total reward: 46.0 Training loss: 118.2596 Explore P: 0.2357
Episode: 614 Total reward: 77.0 Training loss: 2.1866 Explore P: 0.2339
Episode: 615 Total reward: 90.0 Training loss: 1.2926 Explore P: 0.2319
Episode: 616 Total reward: 33.0 Training loss: 41.9855 Explore P: 0.2312
Episode: 617 Total reward: 85.0 Training loss: 44.3746 Explore P: 0.2293
Episode: 618 Total reward: 36.0 Training loss: 2.3417 Explore P: 0.2285
Episode: 619 Total reward: 32.0 Training loss: 1.1100 Explore P: 0.2278
Episode: 620 Total reward: 32.0 Training loss: 1.5703 Explore P: 0.2271
Episode: 621 Total reward: 40.0 Training loss: 35.0594 Explore P: 0.2263
Episode: 622 Total reward: 42.0 Training loss: 1.5935 Explore P: 0.2254
Episode: 623 Total reward: 25.0 Training loss: 49.3768 Explore P: 0.2248
Episode: 624 Total reward: 29.0 Training loss: 140.4090 Explore P: 0.2242
Episode: 625 Total reward: 31.0 Training loss: 41.3316 Explore P: 0.2235
Episode: 626 Total reward: 28.0 Training loss: 1.9650 Explore P: 0.2229
Episode: 627 Total reward: 29.0 Training loss: 93.4359 Explore P: 0.2223
Episode: 628 Total reward: 33.0 Training loss: 40.7825 Explore P: 0.2216
Episode: 629 Total reward: 64.0 Training loss: 1.0268 Explore P: 0.2203
Episode: 630 Total reward: 46.0 Training loss: 1.3688 Explore P: 0.2193
Episode: 631 Total reward: 39.0 Training loss: 77.3664 Explore P: 0.2185
Episode: 632 Total reward: 44.0 Training loss: 1.7155 Explore P: 0.2176
Episode: 633 Total reward: 26.0 Training loss: 1.4265 Explore P: 0.2170
Episode: 634 Total reward: 32.0 Training loss: 1.5379 Explore P: 0.2164
Episode: 635 Total reward: 34.0 Training loss: 65.3090 Explore P: 0.2157
Episode: 636 Total reward: 38.0 Training loss: 2.1576 Explore P: 0.2149
Episode: 637 Total reward: 42.0 Training loss: 3.4473 Explore P: 0.2140
Episode: 638 Total reward: 54.0 Training loss: 42.6358 Explore P: 0.2129
Episode: 639 Total reward: 39.0 Training loss: 1.2786 Explore P: 0.2121
Episode: 640 Total reward: 45.0 Training loss: 1.4674 Explore P: 0.2112
Episode: 641 Total reward: 79.0 Training loss: 36.9086 Explore P: 0.2097
Episode: 642 Total reward: 46.0 Training loss: 91.3298 Explore P: 0.2087
Episode: 643 Total reward: 41.0 Training loss: 86.6613 Explore P: 0.2079
Episode: 644 Total reward: 44.0 Training loss: 1.6951 Explore P: 0.2071
Episode: 645 Total reward: 26.0 Training loss: 105.0162 Explore P: 0.2065
Episode: 646 Total reward: 31.0 Training loss: 105.1632 Explore P: 0.2059
Episode: 647 Total reward: 18.0 Training loss: 47.1846 Explore P: 0.2056
Episode: 648 Total reward: 36.0 Training loss: 53.6124 Explore P: 0.2049
Episode: 649 Total reward: 35.0 Training loss: 54.1006 Explore P: 0.2042
Episode: 650 Total reward: 27.0 Training loss: 1.5034 Explore P: 0.2037
Episode: 651 Total reward: 33.0 Training loss: 60.9929 Explore P: 0.2030
Episode: 652 Total reward: 35.0 Training loss: 1.5271 Explore P: 0.2024
Episode: 653 Total reward: 42.0 Training loss: 2.4615 Explore P: 0.2016
Episode: 654 Total reward: 21.0 Training loss: 2.1535 Explore P: 0.2012
Episode: 655 Total reward: 43.0 Training loss: 1.6790 Explore P: 0.2003
Episode: 656 Total reward: 40.0 Training loss: 2.1984 Explore P: 0.1996
Episode: 657 Total reward: 39.0 Training loss: 47.6617 Explore P: 0.1988
Episode: 658 Total reward: 38.0 Training loss: 88.4204 Explore P: 0.1981
Episode: 659 Total reward: 23.0 Training loss: 42.9641 Explore P: 0.1977
Episode: 660 Total reward: 23.0 Training loss: 2.8514 Explore P: 0.1973
Episode: 661 Total reward: 27.0 Training loss: 1.8046 Explore P: 0.1968
Episode: 662 Total reward: 23.0 Training loss: 1.8868 Explore P: 0.1963
Episode: 663 Total reward: 49.0 Training loss: 1.7521 Explore P: 0.1954
Episode: 664 Total reward: 66.0 Training loss: 1.4212 Explore P: 0.1942
Episode: 665 Total reward: 69.0 Training loss: 1.2554 Explore P: 0.1929
Episode: 666 Total reward: 39.0 Training loss: 1.6264 Explore P: 0.1922
Episode: 667 Total reward: 45.0 Training loss: 1.2313 Explore P: 0.1914
Episode: 668 Total reward: 68.0 Training loss: 1.7184 Explore P: 0.1902
Episode: 669 Total reward: 44.0 Training loss: 1.6229 Explore P: 0.1894
Episode: 670 Total reward: 37.0 Training loss: 1.6273 Explore P: 0.1887
Episode: 671 Total reward: 32.0 Training loss: 59.6959 Explore P: 0.1881
Episode: 672 Total reward: 46.0 Training loss: 1.5708 Explore P: 0.1873
Episode: 673 Total reward: 30.0 Training loss: 1.8669 Explore P: 0.1868
Episode: 674 Total reward: 47.0 Training loss: 1.6476 Explore P: 0.1860
Episode: 675 Total reward: 87.0 Training loss: 123.3414 Explore P: 0.1844
Episode: 676 Total reward: 35.0 Training loss: 65.3349 Explore P: 0.1838
Episode: 677 Total reward: 79.0 Training loss: 119.5272 Explore P: 0.1825
Episode: 678 Total reward: 84.0 Training loss: 177.2962 Explore P: 0.1810
Episode: 679 Total reward: 73.0 Training loss: 112.4075 Explore P: 0.1798
Episode: 680 Total reward: 92.0 Training loss: 1.6012 Explore P: 0.1782
Episode: 681 Total reward: 120.0 Training loss: 113.5554 Explore P: 0.1762
Episode: 682 Total reward: 96.0 Training loss: 72.1337 Explore P: 0.1746
Episode: 683 Total reward: 112.0 Training loss: 128.2959 Explore P: 0.1728
Episode: 684 Total reward: 75.0 Training loss: 104.3114 Explore P: 0.1716
Episode: 685 Total reward: 100.0 Training loss: 221.0627 Explore P: 0.1700
Episode: 686 Total reward: 149.0 Training loss: 109.7734 Explore P: 0.1676
Episode: 687 Total reward: 133.0 Training loss: 78.2540 Explore P: 0.1655
Episode: 688 Total reward: 199.0 Training loss: 79.7581 Explore P: 0.1625
Episode: 689 Total reward: 199.0 Training loss: 1.3091 Explore P: 0.1595
Episode: 690 Total reward: 199.0 Training loss: 1.4806 Explore P: 0.1565
Episode: 691 Total reward: 199.0 Training loss: 1.3279 Explore P: 0.1536
Episode: 692 Total reward: 186.0 Training loss: 0.8716 Explore P: 0.1510
Episode: 693 Total reward: 199.0 Training loss: 1.2193 Explore P: 0.1482
Episode: 694 Total reward: 199.0 Training loss: 74.2097 Explore P: 0.1455
Episode: 695 Total reward: 198.0 Training loss: 94.8758 Explore P: 0.1428
Episode: 696 Total reward: 136.0 Training loss: 1.2806 Explore P: 0.1410
Episode: 697 Total reward: 199.0 Training loss: 1.2231 Explore P: 0.1384
Episode: 698 Total reward: 199.0 Training loss: 1.2476 Explore P: 0.1359
Episode: 699 Total reward: 199.0 Training loss: 202.5727 Explore P: 0.1334
Episode: 700 Total reward: 199.0 Training loss: 108.2603 Explore P: 0.1310
Episode: 701 Total reward: 199.0 Training loss: 1.1473 Explore P: 0.1286
Episode: 702 Total reward: 199.0 Training loss: 110.7266 Explore P: 0.1263
Episode: 703 Total reward: 199.0 Training loss: 124.7103 Explore P: 0.1240
Episode: 704 Total reward: 199.0 Training loss: 1.2836 Explore P: 0.1217
Episode: 705 Total reward: 199.0 Training loss: 1.1427 Explore P: 0.1195
Episode: 706 Total reward: 155.0 Training loss: 1.1563 Explore P: 0.1179
Episode: 707 Total reward: 172.0 Training loss: 108.1588 Explore P: 0.1160
Episode: 708 Total reward: 138.0 Training loss: 237.0340 Explore P: 0.1146
Episode: 709 Total reward: 138.0 Training loss: 111.3272 Explore P: 0.1131
Episode: 710 Total reward: 174.0 Training loss: 0.4392 Explore P: 0.1114
Episode: 711 Total reward: 135.0 Training loss: 232.9319 Explore P: 0.1100
Episode: 712 Total reward: 199.0 Training loss: 236.1008 Explore P: 0.1080
Episode: 713 Total reward: 175.0 Training loss: 1.0319 Explore P: 0.1063
Episode: 714 Total reward: 178.0 Training loss: 0.9038 Explore P: 0.1046
Episode: 715 Total reward: 151.0 Training loss: 1.2105 Explore P: 0.1032
Episode: 716 Total reward: 187.0 Training loss: 0.9768 Explore P: 0.1015
Episode: 717 Total reward: 171.0 Training loss: 124.2704 Explore P: 0.0999
Episode: 718 Total reward: 124.0 Training loss: 1.6056 Explore P: 0.0988
Episode: 719 Total reward: 166.0 Training loss: 0.7521 Explore P: 0.0974
Episode: 720 Total reward: 153.0 Training loss: 0.7269 Explore P: 0.0960
Episode: 721 Total reward: 193.0 Training loss: 0.6319 Explore P: 0.0944
Episode: 722 Total reward: 165.0 Training loss: 0.8057 Explore P: 0.0930
Episode: 723 Total reward: 151.0 Training loss: 0.7462 Explore P: 0.0918
Episode: 724 Total reward: 188.0 Training loss: 119.9837 Explore P: 0.0902
Episode: 725 Total reward: 135.0 Training loss: 0.6537 Explore P: 0.0892
Episode: 726 Total reward: 177.0 Training loss: 113.3904 Explore P: 0.0878
Episode: 727 Total reward: 150.0 Training loss: 0.8852 Explore P: 0.0866
Episode: 728 Total reward: 141.0 Training loss: 126.8301 Explore P: 0.0855
Episode: 729 Total reward: 112.0 Training loss: 1.1026 Explore P: 0.0847
Episode: 730 Total reward: 149.0 Training loss: 0.7034 Explore P: 0.0836
Episode: 731 Total reward: 108.0 Training loss: 0.9362 Explore P: 0.0828
Episode: 732 Total reward: 140.0 Training loss: 172.0040 Explore P: 0.0818
Episode: 733 Total reward: 127.0 Training loss: 0.5153 Explore P: 0.0809
Episode: 734 Total reward: 118.0 Training loss: 0.6723 Explore P: 0.0801
Episode: 735 Total reward: 32.0 Training loss: 144.9614 Explore P: 0.0798
Episode: 736 Total reward: 106.0 Training loss: 0.6049 Explore P: 0.0791
Episode: 737 Total reward: 42.0 Training loss: 0.8933 Explore P: 0.0788
Episode: 738 Total reward: 109.0 Training loss: 0.5384 Explore P: 0.0781
Episode: 739 Total reward: 50.0 Training loss: 1.8706 Explore P: 0.0777
Episode: 740 Total reward: 109.0 Training loss: 1.5084 Explore P: 0.0770
Episode: 741 Total reward: 124.0 Training loss: 1.0289 Explore P: 0.0762
Episode: 742 Total reward: 128.0 Training loss: 0.5889 Explore P: 0.0753
Episode: 743 Total reward: 109.0 Training loss: 187.4742 Explore P: 0.0746
Episode: 744 Total reward: 106.0 Training loss: 1.5179 Explore P: 0.0739
Episode: 745 Total reward: 111.0 Training loss: 0.7251 Explore P: 0.0732
Episode: 746 Total reward: 35.0 Training loss: 1.2727 Explore P: 0.0730
Episode: 747 Total reward: 25.0 Training loss: 0.9435 Explore P: 0.0728
Episode: 748 Total reward: 18.0 Training loss: 1.9505 Explore P: 0.0727
Episode: 749 Total reward: 17.0 Training loss: 1.1229 Explore P: 0.0726
Episode: 750 Total reward: 24.0 Training loss: 0.5672 Explore P: 0.0725
Episode: 751 Total reward: 28.0 Training loss: 1.1481 Explore P: 0.0723
Episode: 752 Total reward: 121.0 Training loss: 0.8369 Explore P: 0.0716
Episode: 753 Total reward: 117.0 Training loss: 0.2058 Explore P: 0.0708
Episode: 754 Total reward: 131.0 Training loss: 76.4557 Explore P: 0.0700
Episode: 755 Total reward: 123.0 Training loss: 0.6674 Explore P: 0.0693
Episode: 756 Total reward: 128.0 Training loss: 0.5369 Explore P: 0.0686
Episode: 757 Total reward: 143.0 Training loss: 0.3261 Explore P: 0.0677
Episode: 758 Total reward: 129.0 Training loss: 0.2360 Explore P: 0.0670
Episode: 759 Total reward: 170.0 Training loss: 0.3962 Explore P: 0.0660
Episode: 760 Total reward: 199.0 Training loss: 0.2674 Explore P: 0.0649
Episode: 761 Total reward: 163.0 Training loss: 0.4931 Explore P: 0.0640
Episode: 762 Total reward: 199.0 Training loss: 0.1268 Explore P: 0.0630
Episode: 763 Total reward: 199.0 Training loss: 0.2428 Explore P: 0.0619
Episode: 764 Total reward: 199.0 Training loss: 0.3573 Explore P: 0.0609
Episode: 765 Total reward: 199.0 Training loss: 0.1259 Explore P: 0.0599
Episode: 766 Total reward: 199.0 Training loss: 0.0963 Explore P: 0.0589
Episode: 767 Total reward: 199.0 Training loss: 0.2258 Explore P: 0.0580
Episode: 768 Total reward: 199.0 Training loss: 0.1393 Explore P: 0.0570
Episode: 769 Total reward: 199.0 Training loss: 0.1704 Explore P: 0.0561
Episode: 770 Total reward: 199.0 Training loss: 0.1900 Explore P: 0.0552
Episode: 771 Total reward: 199.0 Training loss: 0.1808 Explore P: 0.0543
Episode: 772 Total reward: 199.0 Training loss: 0.2285 Explore P: 0.0534
Episode: 773 Total reward: 199.0 Training loss: 0.0678 Explore P: 0.0526
Episode: 774 Total reward: 199.0 Training loss: 0.1285 Explore P: 0.0517
Episode: 775 Total reward: 193.0 Training loss: 0.0716 Explore P: 0.0509
Episode: 776 Total reward: 199.0 Training loss: 0.2023 Explore P: 0.0501
Episode: 777 Total reward: 199.0 Training loss: 0.0593 Explore P: 0.0493
Episode: 778 Total reward: 199.0 Training loss: 0.2750 Explore P: 0.0485
Episode: 779 Total reward: 199.0 Training loss: 0.5040 Explore P: 0.0478
Episode: 780 Total reward: 199.0 Training loss: 0.1197 Explore P: 0.0470
Episode: 781 Total reward: 199.0 Training loss: 0.1895 Explore P: 0.0463
Episode: 782 Total reward: 199.0 Training loss: 0.1258 Explore P: 0.0456
Episode: 783 Total reward: 199.0 Training loss: 0.1687 Explore P: 0.0449
Episode: 784 Total reward: 199.0 Training loss: 0.3578 Explore P: 0.0442
Episode: 785 Total reward: 199.0 Training loss: 0.1195 Explore P: 0.0435
Episode: 786 Total reward: 199.0 Training loss: 11.2823 Explore P: 0.0429
Episode: 787 Total reward: 199.0 Training loss: 12.9821 Explore P: 0.0422
Episode: 788 Total reward: 199.0 Training loss: 0.1394 Explore P: 0.0416
Episode: 789 Total reward: 199.0 Training loss: 0.1852 Explore P: 0.0410
Episode: 790 Total reward: 199.0 Training loss: 0.3048 Explore P: 0.0404
Episode: 791 Total reward: 199.0 Training loss: 0.4613 Explore P: 0.0398
Episode: 792 Total reward: 199.0 Training loss: 0.1357 Explore P: 0.0392
Episode: 793 Total reward: 199.0 Training loss: 0.3004 Explore P: 0.0386
Episode: 794 Total reward: 199.0 Training loss: 129.1000 Explore P: 0.0380
Episode: 795 Total reward: 199.0 Training loss: 0.2677 Explore P: 0.0375
Episode: 796 Total reward: 199.0 Training loss: 0.3263 Explore P: 0.0369
Episode: 797 Total reward: 199.0 Training loss: 0.2184 Explore P: 0.0364
Episode: 798 Total reward: 199.0 Training loss: 0.2089 Explore P: 0.0359
Episode: 799 Total reward: 199.0 Training loss: 0.2131 Explore P: 0.0354
Episode: 800 Total reward: 199.0 Training loss: 0.4343 Explore P: 0.0349
Episode: 801 Total reward: 199.0 Training loss: 0.5475 Explore P: 0.0344
Episode: 802 Total reward: 199.0 Training loss: 0.1432 Explore P: 0.0339
Episode: 803 Total reward: 199.0 Training loss: 0.1570 Explore P: 0.0334
Episode: 804 Total reward: 199.0 Training loss: 0.3428 Explore P: 0.0330
Episode: 805 Total reward: 199.0 Training loss: 0.1494 Explore P: 0.0325
Episode: 806 Total reward: 199.0 Training loss: 0.1568 Explore P: 0.0321
Episode: 807 Total reward: 196.0 Training loss: 2.8326 Explore P: 0.0317
Episode: 808 Total reward: 199.0 Training loss: 0.5419 Explore P: 0.0312
Episode: 809 Total reward: 198.0 Training loss: 0.1298 Explore P: 0.0308
Episode: 810 Total reward: 189.0 Training loss: 0.1105 Explore P: 0.0304
Episode: 811 Total reward: 189.0 Training loss: 0.2036 Explore P: 0.0300
Episode: 812 Total reward: 194.0 Training loss: 0.1918 Explore P: 0.0297
Episode: 813 Total reward: 184.0 Training loss: 0.3842 Explore P: 0.0293
Episode: 814 Total reward: 197.0 Training loss: 0.0876 Explore P: 0.0289
Episode: 815 Total reward: 193.0 Training loss: 0.1377 Explore P: 0.0286
Episode: 816 Total reward: 167.0 Training loss: 0.2303 Explore P: 0.0282
Episode: 817 Total reward: 196.0 Training loss: 0.3743 Explore P: 0.0279
Episode: 818 Total reward: 195.0 Training loss: 0.1572 Explore P: 0.0275
Episode: 819 Total reward: 199.0 Training loss: 0.0655 Explore P: 0.0272
Episode: 820 Total reward: 165.0 Training loss: 0.1467 Explore P: 0.0269
Episode: 821 Total reward: 198.0 Training loss: 0.0730 Explore P: 0.0266
Episode: 822 Total reward: 186.0 Training loss: 0.0975 Explore P: 0.0263
Episode: 823 Total reward: 199.0 Training loss: 0.1363 Explore P: 0.0260
Episode: 824 Total reward: 173.0 Training loss: 0.1534 Explore P: 0.0257
Episode: 825 Total reward: 186.0 Training loss: 0.0679 Explore P: 0.0254
Episode: 826 Total reward: 194.0 Training loss: 0.2158 Explore P: 0.0251
Episode: 827 Total reward: 199.0 Training loss: 0.1993 Explore P: 0.0248
Episode: 828 Total reward: 199.0 Training loss: 0.1181 Explore P: 0.0245
Episode: 829 Total reward: 192.0 Training loss: 0.1156 Explore P: 0.0242
Episode: 830 Total reward: 199.0 Training loss: 42.3522 Explore P: 0.0240
Episode: 831 Total reward: 199.0 Training loss: 0.2039 Explore P: 0.0237
Episode: 832 Total reward: 192.0 Training loss: 0.2535 Explore P: 0.0234
Episode: 833 Total reward: 184.0 Training loss: 0.2979 Explore P: 0.0232
Episode: 834 Total reward: 199.0 Training loss: 0.1348 Explore P: 0.0229
Episode: 835 Total reward: 199.0 Training loss: 0.1467 Explore P: 0.0227
Episode: 836 Total reward: 199.0 Training loss: 0.0345 Explore P: 0.0224
Episode: 837 Total reward: 193.0 Training loss: 0.0640 Explore P: 0.0222
Episode: 838 Total reward: 199.0 Training loss: 0.0717 Explore P: 0.0219
Episode: 839 Total reward: 199.0 Training loss: 0.0873 Explore P: 0.0217
Episode: 840 Total reward: 187.0 Training loss: 1.8697 Explore P: 0.0215
Episode: 841 Total reward: 199.0 Training loss: 0.1886 Explore P: 0.0213
Episode: 842 Total reward: 196.0 Training loss: 0.1109 Explore P: 0.0210
Episode: 843 Total reward: 199.0 Training loss: 0.2017 Explore P: 0.0208
Episode: 844 Total reward: 199.0 Training loss: 0.0915 Explore P: 0.0206
Episode: 845 Total reward: 199.0 Training loss: 0.1089 Explore P: 0.0204
Episode: 846 Total reward: 199.0 Training loss: 0.0699 Explore P: 0.0202
Episode: 847 Total reward: 199.0 Training loss: 0.0322 Explore P: 0.0200
Episode: 848 Total reward: 199.0 Training loss: 0.0553 Explore P: 0.0198
Episode: 849 Total reward: 199.0 Training loss: 0.0607 Explore P: 0.0196
Episode: 850 Total reward: 199.0 Training loss: 0.0391 Explore P: 0.0194
Episode: 851 Total reward: 199.0 Training loss: 0.1549 Explore P: 0.0192
Episode: 852 Total reward: 199.0 Training loss: 1.0611 Explore P: 0.0190
Episode: 853 Total reward: 199.0 Training loss: 0.0329 Explore P: 0.0189
Episode: 854 Total reward: 199.0 Training loss: 0.0164 Explore P: 0.0187
Episode: 855 Total reward: 199.0 Training loss: 0.0511 Explore P: 0.0185
Episode: 856 Total reward: 199.0 Training loss: 0.0240 Explore P: 0.0184
Episode: 857 Total reward: 199.0 Training loss: 0.0197 Explore P: 0.0182
Episode: 858 Total reward: 199.0 Training loss: 0.0404 Explore P: 0.0180
Episode: 859 Total reward: 199.0 Training loss: 0.0445 Explore P: 0.0179
Episode: 860 Total reward: 199.0 Training loss: 1.1894 Explore P: 0.0177
Episode: 861 Total reward: 199.0 Training loss: 5.1718 Explore P: 0.0176
Episode: 862 Total reward: 199.0 Training loss: 0.0434 Explore P: 0.0174
Episode: 863 Total reward: 199.0 Training loss: 0.0364 Explore P: 0.0173
Episode: 864 Total reward: 199.0 Training loss: 0.0407 Explore P: 0.0171
Episode: 865 Total reward: 199.0 Training loss: 0.1272 Explore P: 0.0170
Episode: 866 Total reward: 199.0 Training loss: 0.0670 Explore P: 0.0168
Episode: 867 Total reward: 199.0 Training loss: 0.0593 Explore P: 0.0167
Episode: 868 Total reward: 199.0 Training loss: 0.0576 Explore P: 0.0166
Episode: 869 Total reward: 199.0 Training loss: 0.0667 Explore P: 0.0165
Episode: 870 Total reward: 199.0 Training loss: 0.0404 Explore P: 0.0163
Episode: 871 Total reward: 199.0 Training loss: 0.0405 Explore P: 0.0162
Episode: 872 Total reward: 199.0 Training loss: 0.0216 Explore P: 0.0161
Episode: 873 Total reward: 199.0 Training loss: 0.1075 Explore P: 0.0160
Episode: 874 Total reward: 199.0 Training loss: 122.3122 Explore P: 0.0158
Episode: 875 Total reward: 199.0 Training loss: 0.0838 Explore P: 0.0157
Episode: 876 Total reward: 199.0 Training loss: 0.0522 Explore P: 0.0156
Episode: 877 Total reward: 199.0 Training loss: 0.0974 Explore P: 0.0155
Episode: 878 Total reward: 199.0 Training loss: 0.0367 Explore P: 0.0154
Episode: 879 Total reward: 199.0 Training loss: 0.0469 Explore P: 0.0153
Episode: 880 Total reward: 199.0 Training loss: 0.0352 Explore P: 0.0152
Episode: 881 Total reward: 199.0 Training loss: 0.0750 Explore P: 0.0151
Episode: 882 Total reward: 199.0 Training loss: 0.0870 Explore P: 0.0150
Episode: 883 Total reward: 199.0 Training loss: 0.0498 Explore P: 0.0149
Episode: 884 Total reward: 199.0 Training loss: 0.1313 Explore P: 0.0148
Episode: 885 Total reward: 199.0 Training loss: 0.0895 Explore P: 0.0147
Episode: 886 Total reward: 199.0 Training loss: 0.1233 Explore P: 0.0146
Episode: 887 Total reward: 199.0 Training loss: 0.0890 Explore P: 0.0145
Episode: 888 Total reward: 199.0 Training loss: 0.0627 Explore P: 0.0144
Episode: 889 Total reward: 199.0 Training loss: 0.0469 Explore P: 0.0143
Episode: 890 Total reward: 199.0 Training loss: 0.0432 Explore P: 0.0142
Episode: 891 Total reward: 199.0 Training loss: 215.9361 Explore P: 0.0142
Episode: 892 Total reward: 199.0 Training loss: 0.0764 Explore P: 0.0141
Episode: 893 Total reward: 199.0 Training loss: 9.3446 Explore P: 0.0140
Episode: 894 Total reward: 199.0 Training loss: 0.1309 Explore P: 0.0139
Episode: 895 Total reward: 199.0 Training loss: 0.0522 Explore P: 0.0138
Episode: 896 Total reward: 199.0 Training loss: 0.0285 Explore P: 0.0138
Episode: 897 Total reward: 199.0 Training loss: 0.0424 Explore P: 0.0137
Episode: 898 Total reward: 199.0 Training loss: 0.1380 Explore P: 0.0136
Episode: 899 Total reward: 199.0 Training loss: 0.0300 Explore P: 0.0136
Episode: 900 Total reward: 199.0 Training loss: 0.0724 Explore P: 0.0135
Episode: 901 Total reward: 199.0 Training loss: 0.0825 Explore P: 0.0134
Episode: 902 Total reward: 199.0 Training loss: 0.1527 Explore P: 0.0133
Episode: 903 Total reward: 199.0 Training loss: 0.1715 Explore P: 0.0133
Episode: 904 Total reward: 192.0 Training loss: 0.3314 Explore P: 0.0132
Episode: 905 Total reward: 18.0 Training loss: 0.2351 Explore P: 0.0132
Episode: 906 Total reward: 163.0 Training loss: 0.1547 Explore P: 0.0132
Episode: 907 Total reward: 178.0 Training loss: 0.0973 Explore P: 0.0131
Episode: 908 Total reward: 199.0 Training loss: 0.0765 Explore P: 0.0130
Episode: 909 Total reward: 199.0 Training loss: 0.0693 Explore P: 0.0130
Episode: 910 Total reward: 199.0 Training loss: 0.1490 Explore P: 0.0129
Episode: 911 Total reward: 199.0 Training loss: 0.1552 Explore P: 0.0129
Episode: 912 Total reward: 199.0 Training loss: 0.0532 Explore P: 0.0128
Episode: 913 Total reward: 199.0 Training loss: 0.1208 Explore P: 0.0128
Episode: 914 Total reward: 199.0 Training loss: 0.1147 Explore P: 0.0127
Episode: 915 Total reward: 199.0 Training loss: 170.9050 Explore P: 0.0126
Episode: 916 Total reward: 199.0 Training loss: 0.1050 Explore P: 0.0126
Episode: 917 Total reward: 199.0 Training loss: 0.1113 Explore P: 0.0125
Episode: 918 Total reward: 199.0 Training loss: 121.4161 Explore P: 0.0125
Episode: 919 Total reward: 195.0 Training loss: 0.1119 Explore P: 0.0124
Episode: 920 Total reward: 199.0 Training loss: 0.0800 Explore P: 0.0124
Episode: 921 Total reward: 199.0 Training loss: 0.1213 Explore P: 0.0123
Episode: 922 Total reward: 158.0 Training loss: 0.1197 Explore P: 0.0123
Episode: 923 Total reward: 17.0 Training loss: 0.1818 Explore P: 0.0123
Episode: 924 Total reward: 20.0 Training loss: 0.2295 Explore P: 0.0123
Episode: 925 Total reward: 187.0 Training loss: 0.2543 Explore P: 0.0123
Episode: 926 Total reward: 199.0 Training loss: 0.9388 Explore P: 0.0122
Episode: 927 Total reward: 199.0 Training loss: 0.0859 Explore P: 0.0122
Episode: 928 Total reward: 199.0 Training loss: 0.2419 Explore P: 0.0121
Episode: 929 Total reward: 199.0 Training loss: 0.1908 Explore P: 0.0121
Episode: 930 Total reward: 199.0 Training loss: 427.3108 Explore P: 0.0120
Episode: 931 Total reward: 199.0 Training loss: 39.5287 Explore P: 0.0120
Episode: 932 Total reward: 168.0 Training loss: 0.5749 Explore P: 0.0120
Episode: 933 Total reward: 141.0 Training loss: 0.2109 Explore P: 0.0119
Episode: 934 Total reward: 114.0 Training loss: 0.7370 Explore P: 0.0119
Episode: 935 Total reward: 14.0 Training loss: 0.8284 Explore P: 0.0119
Episode: 936 Total reward: 118.0 Training loss: 0.2348 Explore P: 0.0119
Episode: 937 Total reward: 199.0 Training loss: 234.0520 Explore P: 0.0119
Episode: 938 Total reward: 199.0 Training loss: 0.6597 Explore P: 0.0118
Episode: 939 Total reward: 113.0 Training loss: 0.7521 Explore P: 0.0118
Episode: 940 Total reward: 121.0 Training loss: 0.3623 Explore P: 0.0118
Episode: 941 Total reward: 199.0 Training loss: 0.5087 Explore P: 0.0117
Episode: 942 Total reward: 199.0 Training loss: 0.1479 Explore P: 0.0117
Episode: 943 Total reward: 199.0 Training loss: 0.6023 Explore P: 0.0117
Episode: 944 Total reward: 199.0 Training loss: 0.3447 Explore P: 0.0116
Episode: 945 Total reward: 148.0 Training loss: 1377.5704 Explore P: 0.0116
Episode: 946 Total reward: 185.0 Training loss: 0.6984 Explore P: 0.0116
Episode: 947 Total reward: 125.0 Training loss: 0.8385 Explore P: 0.0116
Episode: 948 Total reward: 164.0 Training loss: 0.1964 Explore P: 0.0115
Episode: 949 Total reward: 151.0 Training loss: 0.2799 Explore P: 0.0115
Episode: 950 Total reward: 119.0 Training loss: 0.6805 Explore P: 0.0115
Episode: 951 Total reward: 108.0 Training loss: 0.6028 Explore P: 0.0115
Episode: 952 Total reward: 133.0 Training loss: 0.3216 Explore P: 0.0115
Episode: 953 Total reward: 143.0 Training loss: 0.6201 Explore P: 0.0114
Episode: 954 Total reward: 107.0 Training loss: 0.5929 Explore P: 0.0114
Episode: 955 Total reward: 118.0 Training loss: 753.3880 Explore P: 0.0114
Episode: 956 Total reward: 165.0 Training loss: 0.2063 Explore P: 0.0114
Episode: 957 Total reward: 132.0 Training loss: 699.3945 Explore P: 0.0114
Episode: 958 Total reward: 174.0 Training loss: 0.5975 Explore P: 0.0114
Episode: 959 Total reward: 131.0 Training loss: 0.5598 Explore P: 0.0113
Episode: 960 Total reward: 126.0 Training loss: 0.3417 Explore P: 0.0113
Episode: 961 Total reward: 168.0 Training loss: 0.4599 Explore P: 0.0113
Episode: 962 Total reward: 199.0 Training loss: 0.2783 Explore P: 0.0113
Episode: 963 Total reward: 199.0 Training loss: 0.7030 Explore P: 0.0112
Episode: 964 Total reward: 199.0 Training loss: 1.3158 Explore P: 0.0112
Episode: 965 Total reward: 199.0 Training loss: 0.5606 Explore P: 0.0112
Episode: 966 Total reward: 199.0 Training loss: 0.3262 Explore P: 0.0112
Episode: 967 Total reward: 199.0 Training loss: 0.3605 Explore P: 0.0111
Episode: 968 Total reward: 199.0 Training loss: 0.4147 Explore P: 0.0111
Episode: 969 Total reward: 199.0 Training loss: 0.3779 Explore P: 0.0111
Episode: 970 Total reward: 199.0 Training loss: 0.4805 Explore P: 0.0111
Episode: 971 Total reward: 199.0 Training loss: 341.7383 Explore P: 0.0111
Episode: 972 Total reward: 199.0 Training loss: 0.3265 Explore P: 0.0110
Episode: 973 Total reward: 199.0 Training loss: 0.4512 Explore P: 0.0110
Episode: 974 Total reward: 199.0 Training loss: 0.3421 Explore P: 0.0110
Episode: 975 Total reward: 199.0 Training loss: 0.4589 Explore P: 0.0110
Episode: 976 Total reward: 199.0 Training loss: 0.4558 Explore P: 0.0110
Episode: 977 Total reward: 199.0 Training loss: 0.3744 Explore P: 0.0109
Episode: 978 Total reward: 199.0 Training loss: 0.4354 Explore P: 0.0109
Episode: 979 Total reward: 199.0 Training loss: 237.2906 Explore P: 0.0109
Episode: 980 Total reward: 199.0 Training loss: 0.6117 Explore P: 0.0109
Episode: 981 Total reward: 199.0 Training loss: 0.3637 Explore P: 0.0109
Episode: 982 Total reward: 199.0 Training loss: 0.3231 Explore P: 0.0109
Episode: 983 Total reward: 199.0 Training loss: 0.3834 Explore P: 0.0108
Episode: 984 Total reward: 199.0 Training loss: 0.1899 Explore P: 0.0108
Episode: 985 Total reward: 199.0 Training loss: 0.4181 Explore P: 0.0108
Episode: 986 Total reward: 199.0 Training loss: 0.2582 Explore P: 0.0108
Episode: 987 Total reward: 199.0 Training loss: 0.2873 Explore P: 0.0108
Episode: 988 Total reward: 199.0 Training loss: 0.3227 Explore P: 0.0108
Episode: 989 Total reward: 199.0 Training loss: 0.4319 Explore P: 0.0107
Episode: 990 Total reward: 199.0 Training loss: 0.5648 Explore P: 0.0107
Episode: 991 Total reward: 199.0 Training loss: 0.2358 Explore P: 0.0107
Episode: 992 Total reward: 199.0 Training loss: 0.2825 Explore P: 0.0107
Episode: 993 Total reward: 199.0 Training loss: 272.4292 Explore P: 0.0107
Episode: 994 Total reward: 199.0 Training loss: 0.3563 Explore P: 0.0107
Episode: 995 Total reward: 199.0 Training loss: 0.2031 Explore P: 0.0107
Episode: 996 Total reward: 199.0 Training loss: 0.3343 Explore P: 0.0106
Episode: 997 Total reward: 199.0 Training loss: 0.2895 Explore P: 0.0106
Episode: 998 Total reward: 199.0 Training loss: 0.4667 Explore P: 0.0106
Episode: 999 Total reward: 199.0 Training loss: 0.2965 Explore P: 0.0106

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [15]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[15]:
<matplotlib.text.Text at 0x261501fa320>

In [181]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[181]:
<matplotlib.text.Text at 0x125c136d8>

Testing

Let's checkout how our trained agent plays the game.


In [16]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints\cartpole.ckpt
[2017-06-19 14:57:03,184] Restoring parameters from checkpoints\cartpole.ckpt

In [18]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.