Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-04-30 12:34:16,131] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [4]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [5]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [6]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [7]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [8]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [9]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [10]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 9.0 Training loss: 1.0316 Explore P: 0.9991
Episode: 2 Total reward: 65.0 Training loss: 1.0437 Explore P: 0.9927
Episode: 3 Total reward: 15.0 Training loss: 1.0643 Explore P: 0.9912
Episode: 4 Total reward: 9.0 Training loss: 1.0814 Explore P: 0.9903
Episode: 5 Total reward: 26.0 Training loss: 1.0616 Explore P: 0.9878
Episode: 6 Total reward: 22.0 Training loss: 1.0118 Explore P: 0.9857
Episode: 7 Total reward: 24.0 Training loss: 1.1686 Explore P: 0.9833
Episode: 8 Total reward: 30.0 Training loss: 1.1355 Explore P: 0.9804
Episode: 9 Total reward: 13.0 Training loss: 1.1702 Explore P: 0.9791
Episode: 10 Total reward: 17.0 Training loss: 1.2292 Explore P: 0.9775
Episode: 11 Total reward: 30.0 Training loss: 1.3595 Explore P: 0.9746
Episode: 12 Total reward: 9.0 Training loss: 1.2964 Explore P: 0.9737
Episode: 13 Total reward: 19.0 Training loss: 1.4369 Explore P: 0.9719
Episode: 14 Total reward: 28.0 Training loss: 1.6916 Explore P: 0.9692
Episode: 15 Total reward: 16.0 Training loss: 1.6912 Explore P: 0.9677
Episode: 16 Total reward: 38.0 Training loss: 1.8446 Explore P: 0.9640
Episode: 17 Total reward: 14.0 Training loss: 1.8835 Explore P: 0.9627
Episode: 18 Total reward: 40.0 Training loss: 2.2388 Explore P: 0.9589
Episode: 19 Total reward: 15.0 Training loss: 2.4478 Explore P: 0.9575
Episode: 20 Total reward: 21.0 Training loss: 5.3852 Explore P: 0.9555
Episode: 21 Total reward: 12.0 Training loss: 3.4507 Explore P: 0.9544
Episode: 22 Total reward: 18.0 Training loss: 4.0014 Explore P: 0.9527
Episode: 23 Total reward: 17.0 Training loss: 2.1087 Explore P: 0.9511
Episode: 24 Total reward: 36.0 Training loss: 4.9565 Explore P: 0.9477
Episode: 25 Total reward: 29.0 Training loss: 2.6575 Explore P: 0.9450
Episode: 26 Total reward: 10.0 Training loss: 4.6021 Explore P: 0.9440
Episode: 27 Total reward: 12.0 Training loss: 4.7069 Explore P: 0.9429
Episode: 28 Total reward: 45.0 Training loss: 6.1784 Explore P: 0.9387
Episode: 29 Total reward: 20.0 Training loss: 3.5322 Explore P: 0.9369
Episode: 30 Total reward: 17.0 Training loss: 6.0754 Explore P: 0.9353
Episode: 31 Total reward: 37.0 Training loss: 10.2596 Explore P: 0.9319
Episode: 32 Total reward: 12.0 Training loss: 4.3070 Explore P: 0.9308
Episode: 33 Total reward: 9.0 Training loss: 16.1298 Explore P: 0.9299
Episode: 34 Total reward: 16.0 Training loss: 8.0035 Explore P: 0.9285
Episode: 35 Total reward: 49.0 Training loss: 33.5364 Explore P: 0.9240
Episode: 36 Total reward: 15.0 Training loss: 6.0862 Explore P: 0.9226
Episode: 37 Total reward: 13.0 Training loss: 4.6693 Explore P: 0.9214
Episode: 38 Total reward: 12.0 Training loss: 4.8281 Explore P: 0.9203
Episode: 39 Total reward: 11.0 Training loss: 10.1451 Explore P: 0.9193
Episode: 40 Total reward: 16.0 Training loss: 31.4167 Explore P: 0.9179
Episode: 41 Total reward: 12.0 Training loss: 36.0144 Explore P: 0.9168
Episode: 42 Total reward: 14.0 Training loss: 13.4123 Explore P: 0.9155
Episode: 43 Total reward: 22.0 Training loss: 17.7679 Explore P: 0.9135
Episode: 44 Total reward: 19.0 Training loss: 4.9361 Explore P: 0.9118
Episode: 45 Total reward: 41.0 Training loss: 65.9558 Explore P: 0.9081
Episode: 46 Total reward: 29.0 Training loss: 24.1344 Explore P: 0.9055
Episode: 47 Total reward: 14.0 Training loss: 19.7040 Explore P: 0.9043
Episode: 48 Total reward: 17.0 Training loss: 15.5691 Explore P: 0.9027
Episode: 49 Total reward: 21.0 Training loss: 31.6732 Explore P: 0.9009
Episode: 50 Total reward: 22.0 Training loss: 10.6938 Explore P: 0.8989
Episode: 51 Total reward: 10.0 Training loss: 20.7102 Explore P: 0.8980
Episode: 52 Total reward: 16.0 Training loss: 8.4036 Explore P: 0.8966
Episode: 53 Total reward: 15.0 Training loss: 5.8156 Explore P: 0.8953
Episode: 54 Total reward: 17.0 Training loss: 9.9205 Explore P: 0.8938
Episode: 55 Total reward: 16.0 Training loss: 6.3906 Explore P: 0.8924
Episode: 56 Total reward: 24.0 Training loss: 9.1586 Explore P: 0.8902
Episode: 57 Total reward: 14.0 Training loss: 45.0235 Explore P: 0.8890
Episode: 58 Total reward: 18.0 Training loss: 88.5340 Explore P: 0.8874
Episode: 59 Total reward: 18.0 Training loss: 29.1445 Explore P: 0.8859
Episode: 60 Total reward: 23.0 Training loss: 31.5939 Explore P: 0.8838
Episode: 61 Total reward: 13.0 Training loss: 33.3697 Explore P: 0.8827
Episode: 62 Total reward: 20.0 Training loss: 140.1866 Explore P: 0.8810
Episode: 63 Total reward: 10.0 Training loss: 59.9095 Explore P: 0.8801
Episode: 64 Total reward: 16.0 Training loss: 27.7374 Explore P: 0.8787
Episode: 65 Total reward: 32.0 Training loss: 110.4384 Explore P: 0.8759
Episode: 66 Total reward: 13.0 Training loss: 23.4993 Explore P: 0.8748
Episode: 67 Total reward: 21.0 Training loss: 8.4738 Explore P: 0.8730
Episode: 68 Total reward: 18.0 Training loss: 18.0758 Explore P: 0.8714
Episode: 69 Total reward: 33.0 Training loss: 15.3772 Explore P: 0.8686
Episode: 70 Total reward: 40.0 Training loss: 80.6466 Explore P: 0.8652
Episode: 71 Total reward: 19.0 Training loss: 27.8352 Explore P: 0.8636
Episode: 72 Total reward: 11.0 Training loss: 36.6526 Explore P: 0.8626
Episode: 73 Total reward: 10.0 Training loss: 9.5321 Explore P: 0.8618
Episode: 74 Total reward: 19.0 Training loss: 37.4352 Explore P: 0.8601
Episode: 75 Total reward: 20.0 Training loss: 13.4242 Explore P: 0.8584
Episode: 76 Total reward: 12.0 Training loss: 17.5282 Explore P: 0.8574
Episode: 77 Total reward: 27.0 Training loss: 57.5456 Explore P: 0.8551
Episode: 78 Total reward: 22.0 Training loss: 139.7404 Explore P: 0.8533
Episode: 79 Total reward: 38.0 Training loss: 13.9993 Explore P: 0.8501
Episode: 80 Total reward: 51.0 Training loss: 133.2694 Explore P: 0.8458
Episode: 81 Total reward: 19.0 Training loss: 38.1509 Explore P: 0.8442
Episode: 82 Total reward: 8.0 Training loss: 19.8279 Explore P: 0.8436
Episode: 83 Total reward: 11.0 Training loss: 18.7225 Explore P: 0.8426
Episode: 84 Total reward: 27.0 Training loss: 80.3270 Explore P: 0.8404
Episode: 85 Total reward: 22.0 Training loss: 17.2920 Explore P: 0.8386
Episode: 86 Total reward: 20.0 Training loss: 47.8318 Explore P: 0.8369
Episode: 87 Total reward: 7.0 Training loss: 13.8299 Explore P: 0.8363
Episode: 88 Total reward: 14.0 Training loss: 21.5745 Explore P: 0.8352
Episode: 89 Total reward: 26.0 Training loss: 98.5660 Explore P: 0.8330
Episode: 90 Total reward: 12.0 Training loss: 19.5638 Explore P: 0.8321
Episode: 91 Total reward: 19.0 Training loss: 163.7694 Explore P: 0.8305
Episode: 92 Total reward: 11.0 Training loss: 147.5251 Explore P: 0.8296
Episode: 93 Total reward: 18.0 Training loss: 20.9317 Explore P: 0.8281
Episode: 94 Total reward: 17.0 Training loss: 54.6825 Explore P: 0.8267
Episode: 95 Total reward: 15.0 Training loss: 483.4086 Explore P: 0.8255
Episode: 96 Total reward: 14.0 Training loss: 29.7095 Explore P: 0.8244
Episode: 97 Total reward: 10.0 Training loss: 659.3837 Explore P: 0.8235
Episode: 98 Total reward: 12.0 Training loss: 155.0952 Explore P: 0.8226
Episode: 99 Total reward: 41.0 Training loss: 27.4600 Explore P: 0.8192
Episode: 100 Total reward: 24.0 Training loss: 186.8034 Explore P: 0.8173
Episode: 101 Total reward: 21.0 Training loss: 25.5268 Explore P: 0.8156
Episode: 102 Total reward: 20.0 Training loss: 19.2176 Explore P: 0.8140
Episode: 103 Total reward: 32.0 Training loss: 245.4964 Explore P: 0.8114
Episode: 104 Total reward: 8.0 Training loss: 21.6634 Explore P: 0.8108
Episode: 105 Total reward: 14.0 Training loss: 30.7650 Explore P: 0.8097
Episode: 106 Total reward: 27.0 Training loss: 92.3039 Explore P: 0.8075
Episode: 107 Total reward: 24.0 Training loss: 254.7927 Explore P: 0.8056
Episode: 108 Total reward: 15.0 Training loss: 78.5510 Explore P: 0.8044
Episode: 109 Total reward: 10.0 Training loss: 291.2218 Explore P: 0.8036
Episode: 110 Total reward: 14.0 Training loss: 374.8905 Explore P: 0.8025
Episode: 111 Total reward: 45.0 Training loss: 209.2974 Explore P: 0.7990
Episode: 112 Total reward: 14.0 Training loss: 19.6543 Explore P: 0.7978
Episode: 113 Total reward: 20.0 Training loss: 314.2139 Explore P: 0.7963
Episode: 114 Total reward: 41.0 Training loss: 23.7455 Explore P: 0.7931
Episode: 115 Total reward: 31.0 Training loss: 222.9855 Explore P: 0.7906
Episode: 116 Total reward: 34.0 Training loss: 119.0798 Explore P: 0.7880
Episode: 117 Total reward: 12.0 Training loss: 186.4804 Explore P: 0.7871
Episode: 118 Total reward: 7.0 Training loss: 400.1087 Explore P: 0.7865
Episode: 119 Total reward: 12.0 Training loss: 954.2346 Explore P: 0.7856
Episode: 120 Total reward: 20.0 Training loss: 444.2370 Explore P: 0.7840
Episode: 121 Total reward: 12.0 Training loss: 17.9080 Explore P: 0.7831
Episode: 122 Total reward: 15.0 Training loss: 444.0672 Explore P: 0.7819
Episode: 123 Total reward: 28.0 Training loss: 230.3163 Explore P: 0.7798
Episode: 124 Total reward: 13.0 Training loss: 466.7574 Explore P: 0.7788
Episode: 125 Total reward: 13.0 Training loss: 221.2368 Explore P: 0.7778
Episode: 126 Total reward: 48.0 Training loss: 231.2450 Explore P: 0.7741
Episode: 127 Total reward: 21.0 Training loss: 22.7591 Explore P: 0.7725
Episode: 128 Total reward: 10.0 Training loss: 14.0011 Explore P: 0.7717
Episode: 129 Total reward: 15.0 Training loss: 377.9506 Explore P: 0.7706
Episode: 130 Total reward: 11.0 Training loss: 19.8494 Explore P: 0.7698
Episode: 131 Total reward: 11.0 Training loss: 217.9535 Explore P: 0.7689
Episode: 132 Total reward: 13.0 Training loss: 315.4376 Explore P: 0.7679
Episode: 133 Total reward: 49.0 Training loss: 154.0728 Explore P: 0.7642
Episode: 134 Total reward: 19.0 Training loss: 15.7249 Explore P: 0.7628
Episode: 135 Total reward: 62.0 Training loss: 896.5980 Explore P: 0.7582
Episode: 136 Total reward: 12.0 Training loss: 186.3454 Explore P: 0.7573
Episode: 137 Total reward: 21.0 Training loss: 140.0974 Explore P: 0.7557
Episode: 138 Total reward: 16.0 Training loss: 572.4315 Explore P: 0.7545
Episode: 139 Total reward: 34.0 Training loss: 301.6552 Explore P: 0.7520
Episode: 140 Total reward: 12.0 Training loss: 203.2100 Explore P: 0.7511
Episode: 141 Total reward: 30.0 Training loss: 241.1852 Explore P: 0.7489
Episode: 142 Total reward: 29.0 Training loss: 9.7095 Explore P: 0.7467
Episode: 143 Total reward: 11.0 Training loss: 10.7972 Explore P: 0.7459
Episode: 144 Total reward: 19.0 Training loss: 13.0147 Explore P: 0.7445
Episode: 145 Total reward: 15.0 Training loss: 1052.9778 Explore P: 0.7434
Episode: 146 Total reward: 21.0 Training loss: 362.5939 Explore P: 0.7419
Episode: 147 Total reward: 8.0 Training loss: 390.6187 Explore P: 0.7413
Episode: 148 Total reward: 28.0 Training loss: 209.5749 Explore P: 0.7392
Episode: 149 Total reward: 60.0 Training loss: 143.4173 Explore P: 0.7349
Episode: 150 Total reward: 33.0 Training loss: 9.2153 Explore P: 0.7325
Episode: 151 Total reward: 13.0 Training loss: 464.5854 Explore P: 0.7316
Episode: 152 Total reward: 30.0 Training loss: 132.4353 Explore P: 0.7294
Episode: 153 Total reward: 12.0 Training loss: 264.1490 Explore P: 0.7285
Episode: 154 Total reward: 29.0 Training loss: 295.4597 Explore P: 0.7264
Episode: 155 Total reward: 16.0 Training loss: 11.2521 Explore P: 0.7253
Episode: 156 Total reward: 10.0 Training loss: 155.1048 Explore P: 0.7246
Episode: 157 Total reward: 8.0 Training loss: 304.4602 Explore P: 0.7240
Episode: 158 Total reward: 23.0 Training loss: 205.3865 Explore P: 0.7224
Episode: 159 Total reward: 27.0 Training loss: 15.8791 Explore P: 0.7205
Episode: 160 Total reward: 14.0 Training loss: 11.8918 Explore P: 0.7195
Episode: 161 Total reward: 11.0 Training loss: 302.2726 Explore P: 0.7187
Episode: 162 Total reward: 50.0 Training loss: 128.7195 Explore P: 0.7151
Episode: 163 Total reward: 14.0 Training loss: 9.6536 Explore P: 0.7142
Episode: 164 Total reward: 17.0 Training loss: 10.0922 Explore P: 0.7130
Episode: 165 Total reward: 19.0 Training loss: 187.0501 Explore P: 0.7116
Episode: 166 Total reward: 15.0 Training loss: 332.9527 Explore P: 0.7106
Episode: 167 Total reward: 12.0 Training loss: 8.3040 Explore P: 0.7097
Episode: 168 Total reward: 16.0 Training loss: 333.9890 Explore P: 0.7086
Episode: 169 Total reward: 18.0 Training loss: 340.7800 Explore P: 0.7074
Episode: 170 Total reward: 9.0 Training loss: 655.5995 Explore P: 0.7067
Episode: 171 Total reward: 12.0 Training loss: 311.3066 Explore P: 0.7059
Episode: 172 Total reward: 26.0 Training loss: 391.5112 Explore P: 0.7041
Episode: 173 Total reward: 11.0 Training loss: 143.7592 Explore P: 0.7033
Episode: 174 Total reward: 13.0 Training loss: 3.3609 Explore P: 0.7024
Episode: 175 Total reward: 13.0 Training loss: 130.4530 Explore P: 0.7015
Episode: 176 Total reward: 18.0 Training loss: 4.6092 Explore P: 0.7003
Episode: 177 Total reward: 24.0 Training loss: 3.7394 Explore P: 0.6986
Episode: 178 Total reward: 11.0 Training loss: 159.1212 Explore P: 0.6979
Episode: 179 Total reward: 14.0 Training loss: 169.5935 Explore P: 0.6969
Episode: 180 Total reward: 23.0 Training loss: 273.5978 Explore P: 0.6953
Episode: 181 Total reward: 9.0 Training loss: 7.2091 Explore P: 0.6947
Episode: 182 Total reward: 21.0 Training loss: 240.0246 Explore P: 0.6933
Episode: 183 Total reward: 20.0 Training loss: 6.2061 Explore P: 0.6919
Episode: 184 Total reward: 21.0 Training loss: 6.9988 Explore P: 0.6905
Episode: 185 Total reward: 12.0 Training loss: 4.0727 Explore P: 0.6897
Episode: 186 Total reward: 28.0 Training loss: 126.6915 Explore P: 0.6878
Episode: 187 Total reward: 18.0 Training loss: 105.3237 Explore P: 0.6865
Episode: 188 Total reward: 8.0 Training loss: 112.5963 Explore P: 0.6860
Episode: 189 Total reward: 11.0 Training loss: 3.1447 Explore P: 0.6853
Episode: 190 Total reward: 11.0 Training loss: 124.6680 Explore P: 0.6845
Episode: 191 Total reward: 9.0 Training loss: 346.4305 Explore P: 0.6839
Episode: 192 Total reward: 15.0 Training loss: 102.8071 Explore P: 0.6829
Episode: 193 Total reward: 17.0 Training loss: 227.7729 Explore P: 0.6818
Episode: 194 Total reward: 11.0 Training loss: 226.1765 Explore P: 0.6810
Episode: 195 Total reward: 17.0 Training loss: 231.0486 Explore P: 0.6799
Episode: 196 Total reward: 13.0 Training loss: 125.8711 Explore P: 0.6790
Episode: 197 Total reward: 15.0 Training loss: 212.7079 Explore P: 0.6780
Episode: 198 Total reward: 11.0 Training loss: 160.4938 Explore P: 0.6773
Episode: 199 Total reward: 11.0 Training loss: 121.8745 Explore P: 0.6765
Episode: 200 Total reward: 8.0 Training loss: 3.1332 Explore P: 0.6760
Episode: 201 Total reward: 24.0 Training loss: 114.8624 Explore P: 0.6744
Episode: 202 Total reward: 19.0 Training loss: 230.5975 Explore P: 0.6732
Episode: 203 Total reward: 33.0 Training loss: 101.5496 Explore P: 0.6710
Episode: 204 Total reward: 13.0 Training loss: 3.2942 Explore P: 0.6701
Episode: 205 Total reward: 18.0 Training loss: 234.2389 Explore P: 0.6689
Episode: 206 Total reward: 19.0 Training loss: 2.4466 Explore P: 0.6677
Episode: 207 Total reward: 14.0 Training loss: 85.8989 Explore P: 0.6668
Episode: 208 Total reward: 16.0 Training loss: 2.7451 Explore P: 0.6657
Episode: 209 Total reward: 16.0 Training loss: 2.2855 Explore P: 0.6647
Episode: 210 Total reward: 19.0 Training loss: 94.4855 Explore P: 0.6634
Episode: 211 Total reward: 10.0 Training loss: 97.9476 Explore P: 0.6628
Episode: 212 Total reward: 13.0 Training loss: 1.5983 Explore P: 0.6619
Episode: 213 Total reward: 21.0 Training loss: 2.4792 Explore P: 0.6605
Episode: 214 Total reward: 15.0 Training loss: 96.0726 Explore P: 0.6596
Episode: 215 Total reward: 15.0 Training loss: 3.8756 Explore P: 0.6586
Episode: 216 Total reward: 17.0 Training loss: 85.6463 Explore P: 0.6575
Episode: 217 Total reward: 16.0 Training loss: 160.4061 Explore P: 0.6565
Episode: 218 Total reward: 16.0 Training loss: 1.9090 Explore P: 0.6554
Episode: 219 Total reward: 9.0 Training loss: 77.8756 Explore P: 0.6548
Episode: 220 Total reward: 10.0 Training loss: 3.8508 Explore P: 0.6542
Episode: 221 Total reward: 10.0 Training loss: 68.0315 Explore P: 0.6536
Episode: 222 Total reward: 17.0 Training loss: 1.8441 Explore P: 0.6525
Episode: 223 Total reward: 11.0 Training loss: 169.6958 Explore P: 0.6518
Episode: 224 Total reward: 10.0 Training loss: 227.1624 Explore P: 0.6511
Episode: 225 Total reward: 17.0 Training loss: 59.7928 Explore P: 0.6500
Episode: 226 Total reward: 15.0 Training loss: 137.8314 Explore P: 0.6491
Episode: 227 Total reward: 11.0 Training loss: 1.4389 Explore P: 0.6484
Episode: 228 Total reward: 10.0 Training loss: 102.4782 Explore P: 0.6477
Episode: 229 Total reward: 8.0 Training loss: 144.4343 Explore P: 0.6472
Episode: 230 Total reward: 22.0 Training loss: 88.6154 Explore P: 0.6458
Episode: 231 Total reward: 12.0 Training loss: 68.9734 Explore P: 0.6451
Episode: 232 Total reward: 11.0 Training loss: 70.0051 Explore P: 0.6444
Episode: 233 Total reward: 10.0 Training loss: 147.1410 Explore P: 0.6437
Episode: 234 Total reward: 22.0 Training loss: 122.8122 Explore P: 0.6423
Episode: 235 Total reward: 15.0 Training loss: 135.3889 Explore P: 0.6414
Episode: 236 Total reward: 12.0 Training loss: 96.0290 Explore P: 0.6406
Episode: 237 Total reward: 12.0 Training loss: 89.6091 Explore P: 0.6399
Episode: 238 Total reward: 11.0 Training loss: 2.1586 Explore P: 0.6392
Episode: 239 Total reward: 12.0 Training loss: 131.0944 Explore P: 0.6384
Episode: 240 Total reward: 14.0 Training loss: 116.3372 Explore P: 0.6375
Episode: 241 Total reward: 11.0 Training loss: 58.4096 Explore P: 0.6368
Episode: 242 Total reward: 14.0 Training loss: 2.9096 Explore P: 0.6360
Episode: 243 Total reward: 15.0 Training loss: 59.3250 Explore P: 0.6350
Episode: 244 Total reward: 16.0 Training loss: 108.4973 Explore P: 0.6340
Episode: 245 Total reward: 11.0 Training loss: 98.9500 Explore P: 0.6333
Episode: 246 Total reward: 14.0 Training loss: 107.4866 Explore P: 0.6325
Episode: 247 Total reward: 9.0 Training loss: 164.1766 Explore P: 0.6319
Episode: 248 Total reward: 18.0 Training loss: 2.6407 Explore P: 0.6308
Episode: 249 Total reward: 10.0 Training loss: 88.3398 Explore P: 0.6302
Episode: 250 Total reward: 14.0 Training loss: 47.4740 Explore P: 0.6293
Episode: 251 Total reward: 12.0 Training loss: 99.3275 Explore P: 0.6286
Episode: 252 Total reward: 10.0 Training loss: 129.4503 Explore P: 0.6279
Episode: 253 Total reward: 16.0 Training loss: 3.1385 Explore P: 0.6270
Episode: 254 Total reward: 13.0 Training loss: 98.2244 Explore P: 0.6262
Episode: 255 Total reward: 32.0 Training loss: 1.9318 Explore P: 0.6242
Episode: 256 Total reward: 22.0 Training loss: 2.4082 Explore P: 0.6228
Episode: 257 Total reward: 12.0 Training loss: 38.6662 Explore P: 0.6221
Episode: 258 Total reward: 9.0 Training loss: 38.1313 Explore P: 0.6216
Episode: 259 Total reward: 11.0 Training loss: 40.9791 Explore P: 0.6209
Episode: 260 Total reward: 13.0 Training loss: 105.4043 Explore P: 0.6201
Episode: 261 Total reward: 13.0 Training loss: 38.8883 Explore P: 0.6193
Episode: 262 Total reward: 12.0 Training loss: 36.5531 Explore P: 0.6186
Episode: 263 Total reward: 18.0 Training loss: 3.2633 Explore P: 0.6175
Episode: 264 Total reward: 13.0 Training loss: 81.3934 Explore P: 0.6167
Episode: 265 Total reward: 12.0 Training loss: 94.3932 Explore P: 0.6160
Episode: 266 Total reward: 13.0 Training loss: 93.8653 Explore P: 0.6152
Episode: 267 Total reward: 12.0 Training loss: 230.4945 Explore P: 0.6144
Episode: 268 Total reward: 23.0 Training loss: 2.5279 Explore P: 0.6131
Episode: 269 Total reward: 8.0 Training loss: 111.3163 Explore P: 0.6126
Episode: 270 Total reward: 14.0 Training loss: 51.2175 Explore P: 0.6117
Episode: 271 Total reward: 9.0 Training loss: 2.4284 Explore P: 0.6112
Episode: 272 Total reward: 10.0 Training loss: 110.1151 Explore P: 0.6106
Episode: 273 Total reward: 10.0 Training loss: 35.2446 Explore P: 0.6100
Episode: 274 Total reward: 18.0 Training loss: 213.0107 Explore P: 0.6089
Episode: 275 Total reward: 10.0 Training loss: 31.7421 Explore P: 0.6083
Episode: 276 Total reward: 29.0 Training loss: 65.2927 Explore P: 0.6066
Episode: 277 Total reward: 9.0 Training loss: 63.7455 Explore P: 0.6060
Episode: 278 Total reward: 14.0 Training loss: 4.5150 Explore P: 0.6052
Episode: 279 Total reward: 11.0 Training loss: 80.1442 Explore P: 0.6046
Episode: 280 Total reward: 14.0 Training loss: 2.8931 Explore P: 0.6037
Episode: 281 Total reward: 8.0 Training loss: 68.8890 Explore P: 0.6032
Episode: 282 Total reward: 27.0 Training loss: 2.4141 Explore P: 0.6016
Episode: 283 Total reward: 15.0 Training loss: 2.6969 Explore P: 0.6008
Episode: 284 Total reward: 8.0 Training loss: 3.2832 Explore P: 0.6003
Episode: 285 Total reward: 14.0 Training loss: 31.8748 Explore P: 0.5995
Episode: 286 Total reward: 8.0 Training loss: 68.0602 Explore P: 0.5990
Episode: 287 Total reward: 15.0 Training loss: 1.9747 Explore P: 0.5981
Episode: 288 Total reward: 13.0 Training loss: 28.7219 Explore P: 0.5973
Episode: 289 Total reward: 16.0 Training loss: 29.6678 Explore P: 0.5964
Episode: 290 Total reward: 12.0 Training loss: 2.8785 Explore P: 0.5957
Episode: 291 Total reward: 11.0 Training loss: 82.7971 Explore P: 0.5951
Episode: 292 Total reward: 8.0 Training loss: 217.9563 Explore P: 0.5946
Episode: 293 Total reward: 13.0 Training loss: 155.9607 Explore P: 0.5938
Episode: 294 Total reward: 11.0 Training loss: 4.5849 Explore P: 0.5932
Episode: 295 Total reward: 33.0 Training loss: 2.2439 Explore P: 0.5913
Episode: 296 Total reward: 19.0 Training loss: 2.7919 Explore P: 0.5902
Episode: 297 Total reward: 34.0 Training loss: 23.9875 Explore P: 0.5882
Episode: 298 Total reward: 20.0 Training loss: 23.5447 Explore P: 0.5870
Episode: 299 Total reward: 42.0 Training loss: 23.4290 Explore P: 0.5846
Episode: 300 Total reward: 9.0 Training loss: 125.8735 Explore P: 0.5841
Episode: 301 Total reward: 15.0 Training loss: 53.6912 Explore P: 0.5832
Episode: 302 Total reward: 16.0 Training loss: 33.3097 Explore P: 0.5823
Episode: 303 Total reward: 33.0 Training loss: 27.9796 Explore P: 0.5804
Episode: 304 Total reward: 16.0 Training loss: 3.4405 Explore P: 0.5795
Episode: 305 Total reward: 22.0 Training loss: 43.1908 Explore P: 0.5783
Episode: 306 Total reward: 17.0 Training loss: 25.4651 Explore P: 0.5773
Episode: 307 Total reward: 11.0 Training loss: 27.5216 Explore P: 0.5767
Episode: 308 Total reward: 12.0 Training loss: 3.0493 Explore P: 0.5760
Episode: 309 Total reward: 21.0 Training loss: 50.5382 Explore P: 0.5748
Episode: 310 Total reward: 20.0 Training loss: 23.0057 Explore P: 0.5737
Episode: 311 Total reward: 15.0 Training loss: 1.5183 Explore P: 0.5728
Episode: 312 Total reward: 14.0 Training loss: 84.9803 Explore P: 0.5721
Episode: 313 Total reward: 23.0 Training loss: 219.1038 Explore P: 0.5708
Episode: 314 Total reward: 12.0 Training loss: 2.6256 Explore P: 0.5701
Episode: 315 Total reward: 19.0 Training loss: 20.8525 Explore P: 0.5690
Episode: 316 Total reward: 21.0 Training loss: 101.3318 Explore P: 0.5679
Episode: 317 Total reward: 9.0 Training loss: 83.0613 Explore P: 0.5674
Episode: 318 Total reward: 13.0 Training loss: 20.3903 Explore P: 0.5666
Episode: 319 Total reward: 11.0 Training loss: 1.8561 Explore P: 0.5660
Episode: 320 Total reward: 35.0 Training loss: 20.3350 Explore P: 0.5641
Episode: 321 Total reward: 23.0 Training loss: 34.3540 Explore P: 0.5628
Episode: 322 Total reward: 16.0 Training loss: 72.2067 Explore P: 0.5619
Episode: 323 Total reward: 19.0 Training loss: 112.7043 Explore P: 0.5609
Episode: 324 Total reward: 13.0 Training loss: 19.8262 Explore P: 0.5602
Episode: 325 Total reward: 10.0 Training loss: 2.2786 Explore P: 0.5596
Episode: 326 Total reward: 28.0 Training loss: 29.9690 Explore P: 0.5581
Episode: 327 Total reward: 19.0 Training loss: 51.2024 Explore P: 0.5570
Episode: 328 Total reward: 12.0 Training loss: 26.4922 Explore P: 0.5564
Episode: 329 Total reward: 10.0 Training loss: 1.9192 Explore P: 0.5558
Episode: 330 Total reward: 11.0 Training loss: 1.7948 Explore P: 0.5552
Episode: 331 Total reward: 14.0 Training loss: 53.0540 Explore P: 0.5545
Episode: 332 Total reward: 22.0 Training loss: 23.2155 Explore P: 0.5533
Episode: 333 Total reward: 18.0 Training loss: 46.5071 Explore P: 0.5523
Episode: 334 Total reward: 11.0 Training loss: 46.0469 Explore P: 0.5517
Episode: 335 Total reward: 15.0 Training loss: 18.5261 Explore P: 0.5509
Episode: 336 Total reward: 9.0 Training loss: 88.2077 Explore P: 0.5504
Episode: 337 Total reward: 54.0 Training loss: 63.6812 Explore P: 0.5475
Episode: 338 Total reward: 14.0 Training loss: 1.4475 Explore P: 0.5467
Episode: 339 Total reward: 12.0 Training loss: 32.2331 Explore P: 0.5461
Episode: 340 Total reward: 24.0 Training loss: 42.5720 Explore P: 0.5448
Episode: 341 Total reward: 13.0 Training loss: 39.2273 Explore P: 0.5441
Episode: 342 Total reward: 10.0 Training loss: 15.3748 Explore P: 0.5436
Episode: 343 Total reward: 8.0 Training loss: 1.2057 Explore P: 0.5432
Episode: 344 Total reward: 30.0 Training loss: 76.9522 Explore P: 0.5416
Episode: 345 Total reward: 11.0 Training loss: 34.5259 Explore P: 0.5410
Episode: 346 Total reward: 8.0 Training loss: 38.8550 Explore P: 0.5405
Episode: 347 Total reward: 20.0 Training loss: 35.9295 Explore P: 0.5395
Episode: 348 Total reward: 18.0 Training loss: 17.2139 Explore P: 0.5385
Episode: 349 Total reward: 13.0 Training loss: 35.0104 Explore P: 0.5378
Episode: 350 Total reward: 13.0 Training loss: 2.1545 Explore P: 0.5372
Episode: 351 Total reward: 14.0 Training loss: 64.0716 Explore P: 0.5364
Episode: 352 Total reward: 21.0 Training loss: 30.8095 Explore P: 0.5353
Episode: 353 Total reward: 19.0 Training loss: 48.8474 Explore P: 0.5343
Episode: 354 Total reward: 20.0 Training loss: 1.4521 Explore P: 0.5333
Episode: 355 Total reward: 16.0 Training loss: 1.5444 Explore P: 0.5324
Episode: 356 Total reward: 13.0 Training loss: 28.8941 Explore P: 0.5318
Episode: 357 Total reward: 26.0 Training loss: 17.3970 Explore P: 0.5304
Episode: 358 Total reward: 16.0 Training loss: 1.2457 Explore P: 0.5296
Episode: 359 Total reward: 19.0 Training loss: 12.4850 Explore P: 0.5286
Episode: 360 Total reward: 18.0 Training loss: 1.3440 Explore P: 0.5277
Episode: 361 Total reward: 18.0 Training loss: 11.9930 Explore P: 0.5267
Episode: 362 Total reward: 23.0 Training loss: 1.5829 Explore P: 0.5255
Episode: 363 Total reward: 32.0 Training loss: 65.6805 Explore P: 0.5239
Episode: 364 Total reward: 29.0 Training loss: 61.2528 Explore P: 0.5224
Episode: 365 Total reward: 18.0 Training loss: 1.1227 Explore P: 0.5215
Episode: 366 Total reward: 21.0 Training loss: 15.4702 Explore P: 0.5204
Episode: 367 Total reward: 42.0 Training loss: 84.3785 Explore P: 0.5183
Episode: 368 Total reward: 100.0 Training loss: 1.1006 Explore P: 0.5132
Episode: 369 Total reward: 22.0 Training loss: 56.4555 Explore P: 0.5121
Episode: 370 Total reward: 17.0 Training loss: 58.1507 Explore P: 0.5112
Episode: 371 Total reward: 60.0 Training loss: 49.0872 Explore P: 0.5083
Episode: 372 Total reward: 26.0 Training loss: 1.4883 Explore P: 0.5070
Episode: 373 Total reward: 40.0 Training loss: 1.4294 Explore P: 0.5050
Episode: 374 Total reward: 50.0 Training loss: 41.2302 Explore P: 0.5025
Episode: 375 Total reward: 23.0 Training loss: 51.2151 Explore P: 0.5014
Episode: 376 Total reward: 68.0 Training loss: 18.5606 Explore P: 0.4980
Episode: 377 Total reward: 40.0 Training loss: 41.7770 Explore P: 0.4961
Episode: 378 Total reward: 24.0 Training loss: 10.3601 Explore P: 0.4949
Episode: 379 Total reward: 33.0 Training loss: 48.3726 Explore P: 0.4933
Episode: 380 Total reward: 17.0 Training loss: 13.7195 Explore P: 0.4925
Episode: 381 Total reward: 17.0 Training loss: 40.2058 Explore P: 0.4917
Episode: 382 Total reward: 19.0 Training loss: 1.7039 Explore P: 0.4908
Episode: 383 Total reward: 58.0 Training loss: 28.2528 Explore P: 0.4880
Episode: 384 Total reward: 54.0 Training loss: 56.1332 Explore P: 0.4854
Episode: 385 Total reward: 106.0 Training loss: 57.5523 Explore P: 0.4804
Episode: 386 Total reward: 41.0 Training loss: 14.4320 Explore P: 0.4785
Episode: 387 Total reward: 45.0 Training loss: 37.2596 Explore P: 0.4764
Episode: 388 Total reward: 26.0 Training loss: 1.8763 Explore P: 0.4752
Episode: 389 Total reward: 50.0 Training loss: 34.9691 Explore P: 0.4729
Episode: 390 Total reward: 39.0 Training loss: 44.5293 Explore P: 0.4710
Episode: 391 Total reward: 64.0 Training loss: 13.1209 Explore P: 0.4681
Episode: 392 Total reward: 39.0 Training loss: 28.3887 Explore P: 0.4663
Episode: 393 Total reward: 58.0 Training loss: 1.3766 Explore P: 0.4637
Episode: 394 Total reward: 30.0 Training loss: 15.2365 Explore P: 0.4623
Episode: 395 Total reward: 32.0 Training loss: 24.2089 Explore P: 0.4609
Episode: 396 Total reward: 56.0 Training loss: 50.5049 Explore P: 0.4584
Episode: 397 Total reward: 20.0 Training loss: 41.4278 Explore P: 0.4575
Episode: 398 Total reward: 42.0 Training loss: 2.0724 Explore P: 0.4556
Episode: 399 Total reward: 36.0 Training loss: 2.2259 Explore P: 0.4540
Episode: 400 Total reward: 47.0 Training loss: 14.0365 Explore P: 0.4519
Episode: 401 Total reward: 31.0 Training loss: 23.8796 Explore P: 0.4505
Episode: 402 Total reward: 51.0 Training loss: 19.5411 Explore P: 0.4483
Episode: 403 Total reward: 62.0 Training loss: 1.8264 Explore P: 0.4456
Episode: 404 Total reward: 124.0 Training loss: 1.9596 Explore P: 0.4402
Episode: 405 Total reward: 40.0 Training loss: 1.7293 Explore P: 0.4385
Episode: 406 Total reward: 40.0 Training loss: 41.8090 Explore P: 0.4368
Episode: 407 Total reward: 46.0 Training loss: 1.8905 Explore P: 0.4348
Episode: 408 Total reward: 64.0 Training loss: 2.9157 Explore P: 0.4321
Episode: 409 Total reward: 109.0 Training loss: 13.3135 Explore P: 0.4276
Episode: 410 Total reward: 29.0 Training loss: 13.6009 Explore P: 0.4263
Episode: 411 Total reward: 56.0 Training loss: 43.8940 Explore P: 0.4240
Episode: 412 Total reward: 64.0 Training loss: 20.9163 Explore P: 0.4214
Episode: 413 Total reward: 87.0 Training loss: 43.6251 Explore P: 0.4178
Episode: 414 Total reward: 26.0 Training loss: 29.6223 Explore P: 0.4168
Episode: 415 Total reward: 35.0 Training loss: 25.9222 Explore P: 0.4153
Episode: 416 Total reward: 38.0 Training loss: 45.0317 Explore P: 0.4138
Episode: 417 Total reward: 73.0 Training loss: 68.6141 Explore P: 0.4109
Episode: 418 Total reward: 22.0 Training loss: 35.8007 Explore P: 0.4100
Episode: 419 Total reward: 105.0 Training loss: 12.8012 Explore P: 0.4058
Episode: 420 Total reward: 48.0 Training loss: 2.6607 Explore P: 0.4039
Episode: 421 Total reward: 51.0 Training loss: 32.8887 Explore P: 0.4019
Episode: 422 Total reward: 77.0 Training loss: 1.0748 Explore P: 0.3989
Episode: 423 Total reward: 53.0 Training loss: 17.7237 Explore P: 0.3968
Episode: 424 Total reward: 71.0 Training loss: 65.2779 Explore P: 0.3941
Episode: 425 Total reward: 63.0 Training loss: 26.5912 Explore P: 0.3917
Episode: 426 Total reward: 72.0 Training loss: 1.7720 Explore P: 0.3890
Episode: 427 Total reward: 83.0 Training loss: 60.4461 Explore P: 0.3858
Episode: 428 Total reward: 68.0 Training loss: 84.0442 Explore P: 0.3833
Episode: 429 Total reward: 30.0 Training loss: 33.1954 Explore P: 0.3822
Episode: 430 Total reward: 47.0 Training loss: 32.0215 Explore P: 0.3804
Episode: 431 Total reward: 70.0 Training loss: 1.2685 Explore P: 0.3778
Episode: 432 Total reward: 66.0 Training loss: 77.3056 Explore P: 0.3754
Episode: 433 Total reward: 26.0 Training loss: 3.6440 Explore P: 0.3745
Episode: 434 Total reward: 42.0 Training loss: 34.6239 Explore P: 0.3729
Episode: 435 Total reward: 78.0 Training loss: 2.3379 Explore P: 0.3701
Episode: 436 Total reward: 57.0 Training loss: 28.4296 Explore P: 0.3681
Episode: 437 Total reward: 55.0 Training loss: 32.4871 Explore P: 0.3661
Episode: 438 Total reward: 107.0 Training loss: 85.2635 Explore P: 0.3623
Episode: 439 Total reward: 77.0 Training loss: 65.9737 Explore P: 0.3596
Episode: 440 Total reward: 199.0 Training loss: 2.4413 Explore P: 0.3527
Episode: 441 Total reward: 59.0 Training loss: 1.5963 Explore P: 0.3507
Episode: 442 Total reward: 54.0 Training loss: 14.7177 Explore P: 0.3489
Episode: 443 Total reward: 41.0 Training loss: 16.0788 Explore P: 0.3475
Episode: 444 Total reward: 30.0 Training loss: 1.5555 Explore P: 0.3465
Episode: 445 Total reward: 71.0 Training loss: 1.4776 Explore P: 0.3441
Episode: 446 Total reward: 84.0 Training loss: 84.3678 Explore P: 0.3413
Episode: 447 Total reward: 83.0 Training loss: 2.4792 Explore P: 0.3386
Episode: 448 Total reward: 64.0 Training loss: 56.3732 Explore P: 0.3365
Episode: 449 Total reward: 125.0 Training loss: 30.1938 Explore P: 0.3324
Episode: 450 Total reward: 60.0 Training loss: 60.1449 Explore P: 0.3305
Episode: 451 Total reward: 64.0 Training loss: 1.6694 Explore P: 0.3284
Episode: 452 Total reward: 59.0 Training loss: 70.1036 Explore P: 0.3266
Episode: 453 Total reward: 68.0 Training loss: 15.9789 Explore P: 0.3244
Episode: 454 Total reward: 76.0 Training loss: 62.2279 Explore P: 0.3220
Episode: 455 Total reward: 40.0 Training loss: 1.7484 Explore P: 0.3208
Episode: 456 Total reward: 53.0 Training loss: 64.4348 Explore P: 0.3191
Episode: 457 Total reward: 100.0 Training loss: 58.6056 Explore P: 0.3161
Episode: 458 Total reward: 66.0 Training loss: 2.3890 Explore P: 0.3141
Episode: 459 Total reward: 58.0 Training loss: 14.8070 Explore P: 0.3123
Episode: 460 Total reward: 67.0 Training loss: 2.3347 Explore P: 0.3103
Episode: 461 Total reward: 40.0 Training loss: 22.9656 Explore P: 0.3091
Episode: 462 Total reward: 41.0 Training loss: 81.8295 Explore P: 0.3079
Episode: 463 Total reward: 138.0 Training loss: 35.0065 Explore P: 0.3038
Episode: 464 Total reward: 77.0 Training loss: 5.3839 Explore P: 0.3015
Episode: 465 Total reward: 111.0 Training loss: 48.2561 Explore P: 0.2983
Episode: 466 Total reward: 52.0 Training loss: 1.6213 Explore P: 0.2968
Episode: 467 Total reward: 40.0 Training loss: 70.0751 Explore P: 0.2957
Episode: 468 Total reward: 38.0 Training loss: 2.7262 Explore P: 0.2946
Episode: 469 Total reward: 47.0 Training loss: 31.7164 Explore P: 0.2932
Episode: 470 Total reward: 45.0 Training loss: 2.5777 Explore P: 0.2920
Episode: 471 Total reward: 80.0 Training loss: 1.6929 Explore P: 0.2897
Episode: 472 Total reward: 109.0 Training loss: 3.9390 Explore P: 0.2867
Episode: 473 Total reward: 73.0 Training loss: 2.9017 Explore P: 0.2847
Episode: 474 Total reward: 72.0 Training loss: 1.9725 Explore P: 0.2827
Episode: 475 Total reward: 73.0 Training loss: 1.8435 Explore P: 0.2807
Episode: 476 Total reward: 82.0 Training loss: 1.7257 Explore P: 0.2785
Episode: 477 Total reward: 100.0 Training loss: 28.0059 Explore P: 0.2758
Episode: 478 Total reward: 103.0 Training loss: 2.7952 Explore P: 0.2731
Episode: 479 Total reward: 90.0 Training loss: 4.3592 Explore P: 0.2708
Episode: 480 Total reward: 43.0 Training loss: 2.2704 Explore P: 0.2696
Episode: 481 Total reward: 107.0 Training loss: 81.3913 Explore P: 0.2669
Episode: 482 Total reward: 56.0 Training loss: 165.0796 Explore P: 0.2654
Episode: 483 Total reward: 143.0 Training loss: 67.1248 Explore P: 0.2618
Episode: 484 Total reward: 37.0 Training loss: 2.1704 Explore P: 0.2609
Episode: 485 Total reward: 35.0 Training loss: 155.1805 Explore P: 0.2600
Episode: 486 Total reward: 52.0 Training loss: 2.4493 Explore P: 0.2587
Episode: 487 Total reward: 59.0 Training loss: 50.7427 Explore P: 0.2573
Episode: 488 Total reward: 43.0 Training loss: 2.7069 Explore P: 0.2562
Episode: 489 Total reward: 48.0 Training loss: 91.4809 Explore P: 0.2550
Episode: 490 Total reward: 72.0 Training loss: 3.5390 Explore P: 0.2533
Episode: 491 Total reward: 180.0 Training loss: 2.5817 Explore P: 0.2489
Episode: 492 Total reward: 75.0 Training loss: 3.9075 Explore P: 0.2471
Episode: 493 Total reward: 59.0 Training loss: 1.8007 Explore P: 0.2457
Episode: 494 Total reward: 40.0 Training loss: 88.5675 Explore P: 0.2448
Episode: 495 Total reward: 44.0 Training loss: 3.7091 Explore P: 0.2438
Episode: 496 Total reward: 146.0 Training loss: 1.4005 Explore P: 0.2404
Episode: 497 Total reward: 77.0 Training loss: 14.3797 Explore P: 0.2386
Episode: 498 Total reward: 56.0 Training loss: 73.9674 Explore P: 0.2373
Episode: 499 Total reward: 32.0 Training loss: 168.7153 Explore P: 0.2366
Episode: 500 Total reward: 54.0 Training loss: 2.3457 Explore P: 0.2354
Episode: 501 Total reward: 178.0 Training loss: 3.7347 Explore P: 0.2314
Episode: 502 Total reward: 73.0 Training loss: 2.7305 Explore P: 0.2298
Episode: 503 Total reward: 146.0 Training loss: 129.6973 Explore P: 0.2266
Episode: 504 Total reward: 43.0 Training loss: 15.4852 Explore P: 0.2257
Episode: 505 Total reward: 39.0 Training loss: 1.3388 Explore P: 0.2248
Episode: 506 Total reward: 96.0 Training loss: 77.3651 Explore P: 0.2228
Episode: 507 Total reward: 93.0 Training loss: 125.4431 Explore P: 0.2208
Episode: 508 Total reward: 123.0 Training loss: 1.8852 Explore P: 0.2182
Episode: 509 Total reward: 61.0 Training loss: 86.6396 Explore P: 0.2170
Episode: 510 Total reward: 68.0 Training loss: 129.3726 Explore P: 0.2156
Episode: 511 Total reward: 45.0 Training loss: 211.3280 Explore P: 0.2147
Episode: 512 Total reward: 105.0 Training loss: 2.4421 Explore P: 0.2125
Episode: 513 Total reward: 38.0 Training loss: 35.9709 Explore P: 0.2117
Episode: 514 Total reward: 40.0 Training loss: 1.6157 Explore P: 0.2109
Episode: 515 Total reward: 88.0 Training loss: 118.4556 Explore P: 0.2092
Episode: 516 Total reward: 61.0 Training loss: 117.1901 Explore P: 0.2080
Episode: 517 Total reward: 48.0 Training loss: 2.1377 Explore P: 0.2070
Episode: 518 Total reward: 48.0 Training loss: 3.4063 Explore P: 0.2061
Episode: 519 Total reward: 49.0 Training loss: 1.8767 Explore P: 0.2051
Episode: 520 Total reward: 40.0 Training loss: 130.1490 Explore P: 0.2043
Episode: 521 Total reward: 49.0 Training loss: 2.2472 Explore P: 0.2034
Episode: 522 Total reward: 52.0 Training loss: 2.2127 Explore P: 0.2024
Episode: 523 Total reward: 66.0 Training loss: 114.1194 Explore P: 0.2011
Episode: 524 Total reward: 62.0 Training loss: 2.4375 Explore P: 0.1999
Episode: 525 Total reward: 96.0 Training loss: 0.7455 Explore P: 0.1981
Episode: 526 Total reward: 97.0 Training loss: 132.8744 Explore P: 0.1963
Episode: 527 Total reward: 137.0 Training loss: 1.9934 Explore P: 0.1938
Episode: 528 Total reward: 76.0 Training loss: 127.4379 Explore P: 0.1924
Episode: 529 Total reward: 91.0 Training loss: 2.3110 Explore P: 0.1907
Episode: 530 Total reward: 75.0 Training loss: 2.7551 Explore P: 0.1894
Episode: 531 Total reward: 62.0 Training loss: 2.5829 Explore P: 0.1883
Episode: 532 Total reward: 113.0 Training loss: 129.3456 Explore P: 0.1863
Episode: 533 Total reward: 99.0 Training loss: 1.0486 Explore P: 0.1845
Episode: 534 Total reward: 71.0 Training loss: 1.0024 Explore P: 0.1833
Episode: 535 Total reward: 83.0 Training loss: 2.0612 Explore P: 0.1819
Episode: 536 Total reward: 102.0 Training loss: 0.7929 Explore P: 0.1801
Episode: 537 Total reward: 183.0 Training loss: 130.8101 Explore P: 0.1770
Episode: 538 Total reward: 101.0 Training loss: 2.1450 Explore P: 0.1754
Episode: 539 Total reward: 101.0 Training loss: 272.0150 Explore P: 0.1737
Episode: 540 Total reward: 123.0 Training loss: 1.1409 Explore P: 0.1717
Episode: 541 Total reward: 199.0 Training loss: 1.1927 Explore P: 0.1685
Episode: 542 Total reward: 73.0 Training loss: 0.6153 Explore P: 0.1674
Episode: 543 Total reward: 83.0 Training loss: 1.4012 Explore P: 0.1661
Episode: 544 Total reward: 166.0 Training loss: 102.0027 Explore P: 0.1635
Episode: 545 Total reward: 117.0 Training loss: 144.0836 Explore P: 0.1617
Episode: 546 Total reward: 153.0 Training loss: 0.7345 Explore P: 0.1594
Episode: 547 Total reward: 148.0 Training loss: 1.4364 Explore P: 0.1572
Episode: 548 Total reward: 135.0 Training loss: 105.6134 Explore P: 0.1552
Episode: 549 Total reward: 199.0 Training loss: 0.8919 Explore P: 0.1524
Episode: 550 Total reward: 175.0 Training loss: 2.1354 Explore P: 0.1499
Episode: 551 Total reward: 199.0 Training loss: 1.1243 Explore P: 0.1471
Episode: 552 Total reward: 86.0 Training loss: 75.4190 Explore P: 0.1460
Episode: 553 Total reward: 199.0 Training loss: 1.3087 Explore P: 0.1433
Episode: 554 Total reward: 95.0 Training loss: 1.1753 Explore P: 0.1420
Episode: 555 Total reward: 182.0 Training loss: 1.3265 Explore P: 0.1396
Episode: 556 Total reward: 99.0 Training loss: 0.7203 Explore P: 0.1384
Episode: 557 Total reward: 181.0 Training loss: 0.5718 Explore P: 0.1361
Episode: 558 Total reward: 177.0 Training loss: 107.8775 Explore P: 0.1339
Episode: 559 Total reward: 98.0 Training loss: 0.5865 Explore P: 0.1326
Episode: 560 Total reward: 83.0 Training loss: 1.0574 Explore P: 0.1316
Episode: 561 Total reward: 199.0 Training loss: 0.3285 Explore P: 0.1292
Episode: 562 Total reward: 143.0 Training loss: 1.4100 Explore P: 0.1275
Episode: 563 Total reward: 143.0 Training loss: 1.2314 Explore P: 0.1259
Episode: 564 Total reward: 104.0 Training loss: 1.6950 Explore P: 0.1247
Episode: 565 Total reward: 111.0 Training loss: 0.4605 Explore P: 0.1234
Episode: 566 Total reward: 155.0 Training loss: 1.3134 Explore P: 0.1217
Episode: 567 Total reward: 124.0 Training loss: 567.5305 Explore P: 0.1203
Episode: 568 Total reward: 113.0 Training loss: 1.1193 Explore P: 0.1190
Episode: 569 Total reward: 199.0 Training loss: 77.6871 Explore P: 0.1169
Episode: 570 Total reward: 100.0 Training loss: 0.7184 Explore P: 0.1158
Episode: 571 Total reward: 113.0 Training loss: 0.8221 Explore P: 0.1146
Episode: 572 Total reward: 142.0 Training loss: 0.5819 Explore P: 0.1132
Episode: 573 Total reward: 199.0 Training loss: 0.5009 Explore P: 0.1111
Episode: 574 Total reward: 173.0 Training loss: 1.7013 Explore P: 0.1094
Episode: 575 Total reward: 69.0 Training loss: 0.8873 Explore P: 0.1087
Episode: 576 Total reward: 64.0 Training loss: 0.5821 Explore P: 0.1081
Episode: 577 Total reward: 76.0 Training loss: 0.8811 Explore P: 0.1073
Episode: 578 Total reward: 68.0 Training loss: 79.1682 Explore P: 0.1067
Episode: 579 Total reward: 106.0 Training loss: 1.5649 Explore P: 0.1057
Episode: 580 Total reward: 101.0 Training loss: 0.6381 Explore P: 0.1047
Episode: 581 Total reward: 82.0 Training loss: 1.9399 Explore P: 0.1039
Episode: 582 Total reward: 76.0 Training loss: 0.6881 Explore P: 0.1032
Episode: 583 Total reward: 52.0 Training loss: 1.0974 Explore P: 0.1027
Episode: 584 Total reward: 94.0 Training loss: 1.3403 Explore P: 0.1019
Episode: 585 Total reward: 56.0 Training loss: 0.7827 Explore P: 0.1014
Episode: 586 Total reward: 101.0 Training loss: 0.7218 Explore P: 0.1004
Episode: 587 Total reward: 75.0 Training loss: 0.6784 Explore P: 0.0998
Episode: 588 Total reward: 91.0 Training loss: 1.4941 Explore P: 0.0990
Episode: 589 Total reward: 125.0 Training loss: 329.9718 Explore P: 0.0978
Episode: 590 Total reward: 52.0 Training loss: 58.0237 Explore P: 0.0974
Episode: 591 Total reward: 161.0 Training loss: 1.0194 Explore P: 0.0960
Episode: 592 Total reward: 105.0 Training loss: 1.8304 Explore P: 0.0951
Episode: 593 Total reward: 65.0 Training loss: 268.1266 Explore P: 0.0945
Episode: 594 Total reward: 176.0 Training loss: 0.8112 Explore P: 0.0931
Episode: 595 Total reward: 163.0 Training loss: 0.6303 Explore P: 0.0917
Episode: 596 Total reward: 199.0 Training loss: 542.6828 Explore P: 0.0901
Episode: 597 Total reward: 63.0 Training loss: 0.8489 Explore P: 0.0896
Episode: 598 Total reward: 112.0 Training loss: 0.5686 Explore P: 0.0887
Episode: 599 Total reward: 47.0 Training loss: 0.5796 Explore P: 0.0884
Episode: 600 Total reward: 199.0 Training loss: 0.6016 Explore P: 0.0868
Episode: 601 Total reward: 69.0 Training loss: 1.1752 Explore P: 0.0863
Episode: 602 Total reward: 155.0 Training loss: 0.4538 Explore P: 0.0851
Episode: 603 Total reward: 94.0 Training loss: 0.8032 Explore P: 0.0844
Episode: 604 Total reward: 74.0 Training loss: 186.9839 Explore P: 0.0839
Episode: 605 Total reward: 199.0 Training loss: 0.2737 Explore P: 0.0824
Episode: 606 Total reward: 199.0 Training loss: 0.7621 Explore P: 0.0810
Episode: 607 Total reward: 72.0 Training loss: 1.1350 Explore P: 0.0805
Episode: 608 Total reward: 199.0 Training loss: 210.3353 Explore P: 0.0791
Episode: 609 Total reward: 80.0 Training loss: 0.5612 Explore P: 0.0785
Episode: 610 Total reward: 146.0 Training loss: 0.9520 Explore P: 0.0775
Episode: 611 Total reward: 106.0 Training loss: 1.1253 Explore P: 0.0768
Episode: 612 Total reward: 114.0 Training loss: 0.3057 Explore P: 0.0761
Episode: 613 Total reward: 124.0 Training loss: 1.9365 Explore P: 0.0753
Episode: 614 Total reward: 86.0 Training loss: 0.7459 Explore P: 0.0747
Episode: 615 Total reward: 106.0 Training loss: 0.4090 Explore P: 0.0740
Episode: 616 Total reward: 135.0 Training loss: 0.6420 Explore P: 0.0732
Episode: 617 Total reward: 169.0 Training loss: 0.4619 Explore P: 0.0721
Episode: 618 Total reward: 133.0 Training loss: 0.5252 Explore P: 0.0713
Episode: 619 Total reward: 104.0 Training loss: 0.3697 Explore P: 0.0706
Episode: 620 Total reward: 121.0 Training loss: 0.4536 Explore P: 0.0699
Episode: 621 Total reward: 199.0 Training loss: 0.3601 Explore P: 0.0687
Episode: 622 Total reward: 107.0 Training loss: 0.5509 Explore P: 0.0681
Episode: 623 Total reward: 199.0 Training loss: 1.1541 Explore P: 0.0670
Episode: 624 Total reward: 171.0 Training loss: 187.9090 Explore P: 0.0660
Episode: 625 Total reward: 175.0 Training loss: 0.3189 Explore P: 0.0650
Episode: 626 Total reward: 176.0 Training loss: 0.9469 Explore P: 0.0641
Episode: 627 Total reward: 66.0 Training loss: 0.7395 Explore P: 0.0637
Episode: 628 Total reward: 85.0 Training loss: 0.9024 Explore P: 0.0633
Episode: 629 Total reward: 175.0 Training loss: 0.3239 Explore P: 0.0623
Episode: 630 Total reward: 191.0 Training loss: 0.4732 Explore P: 0.0613
Episode: 631 Total reward: 194.0 Training loss: 0.7703 Explore P: 0.0604
Episode: 632 Total reward: 89.0 Training loss: 0.2900 Explore P: 0.0599
Episode: 633 Total reward: 91.0 Training loss: 0.2999 Explore P: 0.0595
Episode: 634 Total reward: 199.0 Training loss: 0.5246 Explore P: 0.0585
Episode: 635 Total reward: 107.0 Training loss: 0.6902 Explore P: 0.0580
Episode: 636 Total reward: 199.0 Training loss: 168.2552 Explore P: 0.0570
Episode: 637 Total reward: 199.0 Training loss: 0.6027 Explore P: 0.0561
Episode: 638 Total reward: 199.0 Training loss: 130.9775 Explore P: 0.0552
Episode: 639 Total reward: 114.0 Training loss: 165.6724 Explore P: 0.0547
Episode: 640 Total reward: 199.0 Training loss: 0.5795 Explore P: 0.0538
Episode: 641 Total reward: 86.0 Training loss: 0.4788 Explore P: 0.0534
Episode: 642 Total reward: 87.0 Training loss: 1.0882 Explore P: 0.0530
Episode: 643 Total reward: 104.0 Training loss: 0.3647 Explore P: 0.0526
Episode: 644 Total reward: 117.0 Training loss: 0.3060 Explore P: 0.0521
Episode: 645 Total reward: 152.0 Training loss: 0.3297 Explore P: 0.0515
Episode: 646 Total reward: 150.0 Training loss: 0.1694 Explore P: 0.0508
Episode: 647 Total reward: 89.0 Training loss: 79.1799 Explore P: 0.0505
Episode: 648 Total reward: 62.0 Training loss: 0.4003 Explore P: 0.0502
Episode: 649 Total reward: 113.0 Training loss: 0.3352 Explore P: 0.0498
Episode: 650 Total reward: 105.0 Training loss: 0.2808 Explore P: 0.0494
Episode: 651 Total reward: 124.0 Training loss: 0.3115 Explore P: 0.0489
Episode: 652 Total reward: 182.0 Training loss: 0.1938 Explore P: 0.0482
Episode: 653 Total reward: 199.0 Training loss: 0.2943 Explore P: 0.0474
Episode: 654 Total reward: 93.0 Training loss: 65.6658 Explore P: 0.0471
Episode: 655 Total reward: 199.0 Training loss: 0.2237 Explore P: 0.0464
Episode: 656 Total reward: 126.0 Training loss: 0.3648 Explore P: 0.0459
Episode: 657 Total reward: 192.0 Training loss: 0.2263 Explore P: 0.0452
Episode: 658 Total reward: 75.0 Training loss: 0.5124 Explore P: 0.0450
Episode: 659 Total reward: 70.0 Training loss: 0.3069 Explore P: 0.0447
Episode: 660 Total reward: 92.0 Training loss: 0.3050 Explore P: 0.0444
Episode: 661 Total reward: 149.0 Training loss: 0.3381 Explore P: 0.0439
Episode: 662 Total reward: 199.0 Training loss: 0.1601 Explore P: 0.0432
Episode: 663 Total reward: 60.0 Training loss: 0.3925 Explore P: 0.0430
Episode: 664 Total reward: 82.0 Training loss: 0.2632 Explore P: 0.0427
Episode: 665 Total reward: 56.0 Training loss: 0.3393 Explore P: 0.0426
Episode: 666 Total reward: 61.0 Training loss: 0.2465 Explore P: 0.0424
Episode: 667 Total reward: 67.0 Training loss: 0.1183 Explore P: 0.0421
Episode: 668 Total reward: 95.0 Training loss: 0.1824 Explore P: 0.0418
Episode: 669 Total reward: 199.0 Training loss: 0.3711 Explore P: 0.0412
Episode: 670 Total reward: 173.0 Training loss: 0.4176 Explore P: 0.0407
Episode: 671 Total reward: 64.0 Training loss: 0.3676 Explore P: 0.0405
Episode: 672 Total reward: 199.0 Training loss: 0.4016 Explore P: 0.0399
Episode: 673 Total reward: 132.0 Training loss: 0.1999 Explore P: 0.0395
Episode: 674 Total reward: 64.0 Training loss: 0.2428 Explore P: 0.0393
Episode: 675 Total reward: 64.0 Training loss: 0.2532 Explore P: 0.0391
Episode: 676 Total reward: 199.0 Training loss: 67.1298 Explore P: 0.0385
Episode: 677 Total reward: 199.0 Training loss: 0.1760 Explore P: 0.0380
Episode: 678 Total reward: 116.0 Training loss: 0.1827 Explore P: 0.0377
Episode: 679 Total reward: 199.0 Training loss: 0.2489 Explore P: 0.0371
Episode: 680 Total reward: 58.0 Training loss: 0.2434 Explore P: 0.0370
Episode: 681 Total reward: 78.0 Training loss: 0.3322 Explore P: 0.0367
Episode: 682 Total reward: 132.0 Training loss: 0.1969 Explore P: 0.0364
Episode: 683 Total reward: 69.0 Training loss: 0.2498 Explore P: 0.0362
Episode: 684 Total reward: 70.0 Training loss: 0.2996 Explore P: 0.0360
Episode: 685 Total reward: 197.0 Training loss: 0.2926 Explore P: 0.0355
Episode: 686 Total reward: 130.0 Training loss: 0.1078 Explore P: 0.0352
Episode: 687 Total reward: 62.0 Training loss: 0.3853 Explore P: 0.0350
Episode: 688 Total reward: 199.0 Training loss: 0.2478 Explore P: 0.0345
Episode: 689 Total reward: 85.0 Training loss: 0.1575 Explore P: 0.0343
Episode: 690 Total reward: 84.0 Training loss: 0.2367 Explore P: 0.0341
Episode: 691 Total reward: 94.0 Training loss: 0.1410 Explore P: 0.0339
Episode: 692 Total reward: 68.0 Training loss: 0.3698 Explore P: 0.0337
Episode: 693 Total reward: 199.0 Training loss: 0.3262 Explore P: 0.0333
Episode: 694 Total reward: 43.0 Training loss: 0.3187 Explore P: 0.0332
Episode: 695 Total reward: 139.0 Training loss: 0.3533 Explore P: 0.0329
Episode: 696 Total reward: 76.0 Training loss: 0.1500 Explore P: 0.0327
Episode: 697 Total reward: 70.0 Training loss: 0.2880 Explore P: 0.0325
Episode: 698 Total reward: 105.0 Training loss: 0.2902 Explore P: 0.0323
Episode: 699 Total reward: 197.0 Training loss: 0.1821 Explore P: 0.0319
Episode: 700 Total reward: 121.0 Training loss: 0.2179 Explore P: 0.0316
Episode: 701 Total reward: 63.0 Training loss: 0.2074 Explore P: 0.0315
Episode: 702 Total reward: 199.0 Training loss: 0.3785 Explore P: 0.0310
Episode: 703 Total reward: 174.0 Training loss: 0.7877 Explore P: 0.0307
Episode: 704 Total reward: 199.0 Training loss: 0.3699 Explore P: 0.0303
Episode: 705 Total reward: 76.0 Training loss: 0.1402 Explore P: 0.0301
Episode: 706 Total reward: 68.0 Training loss: 0.2965 Explore P: 0.0300
Episode: 707 Total reward: 140.0 Training loss: 102.8346 Explore P: 0.0297
Episode: 708 Total reward: 124.0 Training loss: 0.2341 Explore P: 0.0295
Episode: 709 Total reward: 182.0 Training loss: 0.4065 Explore P: 0.0291
Episode: 710 Total reward: 199.0 Training loss: 0.3428 Explore P: 0.0287
Episode: 711 Total reward: 159.0 Training loss: 0.5695 Explore P: 0.0284
Episode: 712 Total reward: 189.0 Training loss: 0.2003 Explore P: 0.0281
Episode: 713 Total reward: 73.0 Training loss: 0.2251 Explore P: 0.0280
Episode: 714 Total reward: 83.0 Training loss: 0.2673 Explore P: 0.0278
Episode: 715 Total reward: 99.0 Training loss: 0.2251 Explore P: 0.0276
Episode: 716 Total reward: 171.0 Training loss: 0.2621 Explore P: 0.0273
Episode: 717 Total reward: 59.0 Training loss: 0.2134 Explore P: 0.0272
Episode: 718 Total reward: 159.0 Training loss: 0.2198 Explore P: 0.0270
Episode: 719 Total reward: 82.0 Training loss: 0.2461 Explore P: 0.0268
Episode: 720 Total reward: 169.0 Training loss: 0.5517 Explore P: 0.0265
Episode: 721 Total reward: 82.0 Training loss: 0.2038 Explore P: 0.0264
Episode: 722 Total reward: 81.0 Training loss: 0.3043 Explore P: 0.0263
Episode: 723 Total reward: 105.0 Training loss: 0.2250 Explore P: 0.0261
Episode: 724 Total reward: 89.0 Training loss: 0.3984 Explore P: 0.0260
Episode: 725 Total reward: 175.0 Training loss: 0.2212 Explore P: 0.0257
Episode: 726 Total reward: 110.0 Training loss: 0.5543 Explore P: 0.0255
Episode: 727 Total reward: 71.0 Training loss: 0.2933 Explore P: 0.0254
Episode: 728 Total reward: 125.0 Training loss: 0.7547 Explore P: 0.0252
Episode: 729 Total reward: 181.0 Training loss: 1.2321 Explore P: 0.0249
Episode: 730 Total reward: 80.0 Training loss: 0.2658 Explore P: 0.0248
Episode: 731 Total reward: 173.0 Training loss: 0.3006 Explore P: 0.0246
Episode: 732 Total reward: 145.0 Training loss: 0.2950 Explore P: 0.0244
Episode: 733 Total reward: 153.0 Training loss: 0.1834 Explore P: 0.0241
Episode: 734 Total reward: 199.0 Training loss: 0.2556 Explore P: 0.0239
Episode: 735 Total reward: 199.0 Training loss: 0.3583 Explore P: 0.0236
Episode: 736 Total reward: 199.0 Training loss: 0.2578 Explore P: 0.0233
Episode: 737 Total reward: 199.0 Training loss: 0.2357 Explore P: 0.0231
Episode: 738 Total reward: 113.0 Training loss: 0.2604 Explore P: 0.0229
Episode: 739 Total reward: 188.0 Training loss: 0.3705 Explore P: 0.0227
Episode: 740 Total reward: 145.0 Training loss: 0.2302 Explore P: 0.0225
Episode: 741 Total reward: 166.0 Training loss: 0.1750 Explore P: 0.0223
Episode: 742 Total reward: 120.0 Training loss: 0.1431 Explore P: 0.0221
Episode: 743 Total reward: 199.0 Training loss: 0.2393 Explore P: 0.0219
Episode: 744 Total reward: 199.0 Training loss: 0.3177 Explore P: 0.0217
Episode: 745 Total reward: 199.0 Training loss: 0.2345 Explore P: 0.0214
Episode: 746 Total reward: 199.0 Training loss: 0.3308 Explore P: 0.0212
Episode: 747 Total reward: 161.0 Training loss: 0.2980 Explore P: 0.0210
Episode: 748 Total reward: 147.0 Training loss: 0.2213 Explore P: 0.0209
Episode: 749 Total reward: 173.0 Training loss: 0.1695 Explore P: 0.0207
Episode: 750 Total reward: 114.0 Training loss: 0.4201 Explore P: 0.0206
Episode: 751 Total reward: 102.0 Training loss: 0.2359 Explore P: 0.0205
Episode: 752 Total reward: 139.0 Training loss: 0.3271 Explore P: 0.0203
Episode: 753 Total reward: 123.0 Training loss: 0.2457 Explore P: 0.0202
Episode: 754 Total reward: 155.0 Training loss: 0.1480 Explore P: 0.0200
Episode: 755 Total reward: 101.0 Training loss: 0.3551 Explore P: 0.0199
Episode: 756 Total reward: 117.0 Training loss: 279.5504 Explore P: 0.0198
Episode: 757 Total reward: 119.0 Training loss: 0.2707 Explore P: 0.0197
Episode: 758 Total reward: 199.0 Training loss: 0.7245 Explore P: 0.0195
Episode: 759 Total reward: 137.0 Training loss: 0.2893 Explore P: 0.0194
Episode: 760 Total reward: 119.0 Training loss: 0.1791 Explore P: 0.0193
Episode: 761 Total reward: 191.0 Training loss: 0.4941 Explore P: 0.0191
Episode: 762 Total reward: 105.0 Training loss: 0.1503 Explore P: 0.0190
Episode: 763 Total reward: 163.0 Training loss: 0.4452 Explore P: 0.0188
Episode: 764 Total reward: 199.0 Training loss: 0.1741 Explore P: 0.0187
Episode: 765 Total reward: 111.0 Training loss: 0.3658 Explore P: 0.0186
Episode: 766 Total reward: 101.0 Training loss: 0.9062 Explore P: 0.0185
Episode: 767 Total reward: 152.0 Training loss: 0.2845 Explore P: 0.0184
Episode: 768 Total reward: 172.0 Training loss: 77.9814 Explore P: 0.0182
Episode: 769 Total reward: 127.0 Training loss: 0.2486 Explore P: 0.0181
Episode: 770 Total reward: 159.0 Training loss: 0.2673 Explore P: 0.0180
Episode: 771 Total reward: 156.0 Training loss: 0.4785 Explore P: 0.0179
Episode: 772 Total reward: 144.0 Training loss: 0.1814 Explore P: 0.0178
Episode: 773 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0176
Episode: 774 Total reward: 199.0 Training loss: 0.2922 Explore P: 0.0174
Episode: 775 Total reward: 147.0 Training loss: 0.4066 Explore P: 0.0173
Episode: 776 Total reward: 117.0 Training loss: 0.3948 Explore P: 0.0173
Episode: 777 Total reward: 116.0 Training loss: 0.1751 Explore P: 0.0172
Episode: 778 Total reward: 199.0 Training loss: 0.2518 Explore P: 0.0170
Episode: 779 Total reward: 139.0 Training loss: 2.0319 Explore P: 0.0169
Episode: 780 Total reward: 185.0 Training loss: 11.8979 Explore P: 0.0168
Episode: 781 Total reward: 154.0 Training loss: 0.1893 Explore P: 0.0167
Episode: 782 Total reward: 154.0 Training loss: 0.3121 Explore P: 0.0166
Episode: 783 Total reward: 174.0 Training loss: 0.2112 Explore P: 0.0165
Episode: 784 Total reward: 195.0 Training loss: 0.3060 Explore P: 0.0164
Episode: 785 Total reward: 150.0 Training loss: 0.2845 Explore P: 0.0163
Episode: 786 Total reward: 199.0 Training loss: 0.2007 Explore P: 0.0161
Episode: 787 Total reward: 199.0 Training loss: 0.6228 Explore P: 0.0160
Episode: 788 Total reward: 143.0 Training loss: 0.2138 Explore P: 0.0159
Episode: 789 Total reward: 141.0 Training loss: 0.1266 Explore P: 0.0159
Episode: 790 Total reward: 196.0 Training loss: 0.1899 Explore P: 0.0157
Episode: 791 Total reward: 199.0 Training loss: 0.2419 Explore P: 0.0156
Episode: 792 Total reward: 172.0 Training loss: 0.2525 Explore P: 0.0155
Episode: 793 Total reward: 199.0 Training loss: 7.1777 Explore P: 0.0154
Episode: 794 Total reward: 169.0 Training loss: 0.2687 Explore P: 0.0153
Episode: 795 Total reward: 199.0 Training loss: 0.2542 Explore P: 0.0152
Episode: 796 Total reward: 199.0 Training loss: 0.3908 Explore P: 0.0151
Episode: 797 Total reward: 199.0 Training loss: 0.3144 Explore P: 0.0150
Episode: 798 Total reward: 199.0 Training loss: 0.2751 Explore P: 0.0149
Episode: 799 Total reward: 194.0 Training loss: 0.2019 Explore P: 0.0148
Episode: 800 Total reward: 199.0 Training loss: 0.2032 Explore P: 0.0147
Episode: 801 Total reward: 199.0 Training loss: 100.0472 Explore P: 0.0146
Episode: 802 Total reward: 199.0 Training loss: 0.2525 Explore P: 0.0145
Episode: 803 Total reward: 199.0 Training loss: 7.6527 Explore P: 0.0145
Episode: 804 Total reward: 199.0 Training loss: 0.2426 Explore P: 0.0144
Episode: 805 Total reward: 186.0 Training loss: 0.2137 Explore P: 0.0143
Episode: 806 Total reward: 199.0 Training loss: 0.3313 Explore P: 0.0142
Episode: 807 Total reward: 199.0 Training loss: 0.2434 Explore P: 0.0141
Episode: 808 Total reward: 199.0 Training loss: 0.1153 Explore P: 0.0140
Episode: 809 Total reward: 199.0 Training loss: 0.3603 Explore P: 0.0140
Episode: 810 Total reward: 199.0 Training loss: 0.3672 Explore P: 0.0139
Episode: 811 Total reward: 199.0 Training loss: 0.2338 Explore P: 0.0138
Episode: 812 Total reward: 199.0 Training loss: 0.2382 Explore P: 0.0137
Episode: 813 Total reward: 199.0 Training loss: 18.8894 Explore P: 0.0137
Episode: 814 Total reward: 199.0 Training loss: 0.2179 Explore P: 0.0136
Episode: 815 Total reward: 199.0 Training loss: 0.4050 Explore P: 0.0135
Episode: 816 Total reward: 199.0 Training loss: 0.3095 Explore P: 0.0134
Episode: 817 Total reward: 199.0 Training loss: 0.3132 Explore P: 0.0134
Episode: 818 Total reward: 199.0 Training loss: 110.3477 Explore P: 0.0133
Episode: 819 Total reward: 199.0 Training loss: 0.2361 Explore P: 0.0132
Episode: 820 Total reward: 199.0 Training loss: 0.2991 Explore P: 0.0132
Episode: 821 Total reward: 199.0 Training loss: 0.2507 Explore P: 0.0131
Episode: 822 Total reward: 199.0 Training loss: 0.2335 Explore P: 0.0131
Episode: 823 Total reward: 199.0 Training loss: 0.2698 Explore P: 0.0130
Episode: 824 Total reward: 199.0 Training loss: 0.3011 Explore P: 0.0129
Episode: 825 Total reward: 199.0 Training loss: 0.3894 Explore P: 0.0129
Episode: 826 Total reward: 199.0 Training loss: 0.4772 Explore P: 0.0128
Episode: 827 Total reward: 199.0 Training loss: 0.1923 Explore P: 0.0128
Episode: 828 Total reward: 199.0 Training loss: 0.1685 Explore P: 0.0127
Episode: 829 Total reward: 199.0 Training loss: 0.1718 Explore P: 0.0127
Episode: 830 Total reward: 199.0 Training loss: 0.1572 Explore P: 0.0126
Episode: 831 Total reward: 199.0 Training loss: 0.1631 Explore P: 0.0126
Episode: 832 Total reward: 199.0 Training loss: 0.3152 Explore P: 0.0125
Episode: 833 Total reward: 199.0 Training loss: 0.3020 Explore P: 0.0125
Episode: 834 Total reward: 199.0 Training loss: 0.2627 Explore P: 0.0124
Episode: 835 Total reward: 199.0 Training loss: 4.0035 Explore P: 0.0124
Episode: 836 Total reward: 199.0 Training loss: 0.2350 Explore P: 0.0123
Episode: 837 Total reward: 199.0 Training loss: 0.3399 Explore P: 0.0123
Episode: 838 Total reward: 199.0 Training loss: 0.2095 Explore P: 0.0122
Episode: 839 Total reward: 199.0 Training loss: 0.3104 Explore P: 0.0122
Episode: 840 Total reward: 199.0 Training loss: 0.3630 Explore P: 0.0121
Episode: 841 Total reward: 199.0 Training loss: 0.1773 Explore P: 0.0121
Episode: 842 Total reward: 199.0 Training loss: 0.1869 Explore P: 0.0121
Episode: 843 Total reward: 199.0 Training loss: 0.3622 Explore P: 0.0120
Episode: 844 Total reward: 199.0 Training loss: 0.1255 Explore P: 0.0120
Episode: 845 Total reward: 199.0 Training loss: 0.3154 Explore P: 0.0119
Episode: 846 Total reward: 199.0 Training loss: 0.2598 Explore P: 0.0119
Episode: 847 Total reward: 199.0 Training loss: 0.1410 Explore P: 0.0119
Episode: 848 Total reward: 199.0 Training loss: 0.2009 Explore P: 0.0118
Episode: 849 Total reward: 199.0 Training loss: 0.1530 Explore P: 0.0118
Episode: 850 Total reward: 199.0 Training loss: 0.1509 Explore P: 0.0118
Episode: 851 Total reward: 199.0 Training loss: 0.1342 Explore P: 0.0117
Episode: 852 Total reward: 199.0 Training loss: 0.1809 Explore P: 0.0117
Episode: 853 Total reward: 199.0 Training loss: 0.1759 Explore P: 0.0117
Episode: 854 Total reward: 199.0 Training loss: 0.3564 Explore P: 0.0116
Episode: 855 Total reward: 199.0 Training loss: 0.1949 Explore P: 0.0116
Episode: 856 Total reward: 199.0 Training loss: 0.3450 Explore P: 0.0116
Episode: 857 Total reward: 199.0 Training loss: 0.3232 Explore P: 0.0115
Episode: 858 Total reward: 199.0 Training loss: 0.2191 Explore P: 0.0115
Episode: 859 Total reward: 199.0 Training loss: 3.0081 Explore P: 0.0115
Episode: 860 Total reward: 199.0 Training loss: 0.3033 Explore P: 0.0114
Episode: 861 Total reward: 199.0 Training loss: 0.0860 Explore P: 0.0114
Episode: 862 Total reward: 199.0 Training loss: 0.2458 Explore P: 0.0114
Episode: 863 Total reward: 199.0 Training loss: 290.2263 Explore P: 0.0114
Episode: 864 Total reward: 199.0 Training loss: 0.2580 Explore P: 0.0113
Episode: 865 Total reward: 199.0 Training loss: 276.0802 Explore P: 0.0113
Episode: 866 Total reward: 199.0 Training loss: 0.3202 Explore P: 0.0113
Episode: 867 Total reward: 199.0 Training loss: 0.2571 Explore P: 0.0112
Episode: 868 Total reward: 199.0 Training loss: 0.2095 Explore P: 0.0112
Episode: 869 Total reward: 199.0 Training loss: 0.5226 Explore P: 0.0112
Episode: 870 Total reward: 199.0 Training loss: 0.2134 Explore P: 0.0112
Episode: 871 Total reward: 199.0 Training loss: 0.2718 Explore P: 0.0112
Episode: 872 Total reward: 199.0 Training loss: 0.2734 Explore P: 0.0111
Episode: 873 Total reward: 199.0 Training loss: 0.1393 Explore P: 0.0111
Episode: 874 Total reward: 199.0 Training loss: 0.1547 Explore P: 0.0111
Episode: 875 Total reward: 199.0 Training loss: 0.1740 Explore P: 0.0111
Episode: 876 Total reward: 199.0 Training loss: 0.4671 Explore P: 0.0110
Episode: 877 Total reward: 199.0 Training loss: 0.3658 Explore P: 0.0110
Episode: 878 Total reward: 199.0 Training loss: 0.2220 Explore P: 0.0110
Episode: 879 Total reward: 199.0 Training loss: 0.2748 Explore P: 0.0110
Episode: 880 Total reward: 199.0 Training loss: 0.2568 Explore P: 0.0110
Episode: 881 Total reward: 199.0 Training loss: 0.1071 Explore P: 0.0109
Episode: 882 Total reward: 199.0 Training loss: 0.2471 Explore P: 0.0109
Episode: 883 Total reward: 199.0 Training loss: 0.2433 Explore P: 0.0109
Episode: 884 Total reward: 199.0 Training loss: 0.1850 Explore P: 0.0109
Episode: 885 Total reward: 199.0 Training loss: 0.1677 Explore P: 0.0109
Episode: 886 Total reward: 199.0 Training loss: 0.1960 Explore P: 0.0109
Episode: 887 Total reward: 199.0 Training loss: 0.2903 Explore P: 0.0108
Episode: 888 Total reward: 199.0 Training loss: 0.3443 Explore P: 0.0108
Episode: 889 Total reward: 199.0 Training loss: 0.3891 Explore P: 0.0108
Episode: 890 Total reward: 199.0 Training loss: 0.2460 Explore P: 0.0108
Episode: 891 Total reward: 199.0 Training loss: 0.2130 Explore P: 0.0108
Episode: 892 Total reward: 199.0 Training loss: 0.1836 Explore P: 0.0108
Episode: 893 Total reward: 199.0 Training loss: 0.3667 Explore P: 0.0107
Episode: 894 Total reward: 199.0 Training loss: 0.4371 Explore P: 0.0107
Episode: 895 Total reward: 199.0 Training loss: 250.9987 Explore P: 0.0107
Episode: 896 Total reward: 199.0 Training loss: 0.2214 Explore P: 0.0107
Episode: 897 Total reward: 199.0 Training loss: 0.2637 Explore P: 0.0107
Episode: 898 Total reward: 183.0 Training loss: 0.3286 Explore P: 0.0107
Episode: 899 Total reward: 199.0 Training loss: 0.2967 Explore P: 0.0107
Episode: 900 Total reward: 199.0 Training loss: 0.3683 Explore P: 0.0106
Episode: 901 Total reward: 199.0 Training loss: 0.2094 Explore P: 0.0106
Episode: 902 Total reward: 199.0 Training loss: 0.3744 Explore P: 0.0106
Episode: 903 Total reward: 199.0 Training loss: 0.4685 Explore P: 0.0106
Episode: 904 Total reward: 199.0 Training loss: 0.2145 Explore P: 0.0106
Episode: 905 Total reward: 199.0 Training loss: 0.4713 Explore P: 0.0106
Episode: 906 Total reward: 199.0 Training loss: 0.2949 Explore P: 0.0106
Episode: 907 Total reward: 199.0 Training loss: 0.1621 Explore P: 0.0106
Episode: 908 Total reward: 199.0 Training loss: 0.4886 Explore P: 0.0106
Episode: 909 Total reward: 199.0 Training loss: 0.3220 Explore P: 0.0105
Episode: 910 Total reward: 199.0 Training loss: 0.2247 Explore P: 0.0105
Episode: 911 Total reward: 199.0 Training loss: 0.3102 Explore P: 0.0105
Episode: 912 Total reward: 199.0 Training loss: 0.2243 Explore P: 0.0105
Episode: 913 Total reward: 199.0 Training loss: 0.2206 Explore P: 0.0105
Episode: 914 Total reward: 199.0 Training loss: 0.1971 Explore P: 0.0105
Episode: 915 Total reward: 199.0 Training loss: 0.2751 Explore P: 0.0105
Episode: 916 Total reward: 199.0 Training loss: 0.1857 Explore P: 0.0105
Episode: 917 Total reward: 199.0 Training loss: 0.2634 Explore P: 0.0105
Episode: 918 Total reward: 199.0 Training loss: 10.0834 Explore P: 0.0105
Episode: 919 Total reward: 199.0 Training loss: 541.2875 Explore P: 0.0104
Episode: 920 Total reward: 199.0 Training loss: 0.2874 Explore P: 0.0104
Episode: 921 Total reward: 199.0 Training loss: 0.1243 Explore P: 0.0104
Episode: 922 Total reward: 199.0 Training loss: 0.2686 Explore P: 0.0104
Episode: 923 Total reward: 199.0 Training loss: 0.2739 Explore P: 0.0104
Episode: 924 Total reward: 194.0 Training loss: 0.2138 Explore P: 0.0104
Episode: 925 Total reward: 199.0 Training loss: 0.2470 Explore P: 0.0104
Episode: 926 Total reward: 199.0 Training loss: 0.1415 Explore P: 0.0104
Episode: 927 Total reward: 199.0 Training loss: 0.2714 Explore P: 0.0104
Episode: 928 Total reward: 199.0 Training loss: 0.2652 Explore P: 0.0104
Episode: 929 Total reward: 199.0 Training loss: 0.3548 Explore P: 0.0104
Episode: 930 Total reward: 199.0 Training loss: 0.1817 Explore P: 0.0104
Episode: 931 Total reward: 199.0 Training loss: 0.2514 Explore P: 0.0104
Episode: 932 Total reward: 199.0 Training loss: 0.2048 Explore P: 0.0103
Episode: 933 Total reward: 199.0 Training loss: 0.1660 Explore P: 0.0103
Episode: 934 Total reward: 199.0 Training loss: 0.4289 Explore P: 0.0103
Episode: 935 Total reward: 199.0 Training loss: 0.1950 Explore P: 0.0103
Episode: 936 Total reward: 196.0 Training loss: 0.1878 Explore P: 0.0103
Episode: 937 Total reward: 171.0 Training loss: 0.4388 Explore P: 0.0103
Episode: 938 Total reward: 174.0 Training loss: 0.1825 Explore P: 0.0103
Episode: 939 Total reward: 172.0 Training loss: 0.1398 Explore P: 0.0103
Episode: 940 Total reward: 199.0 Training loss: 0.3148 Explore P: 0.0103
Episode: 941 Total reward: 199.0 Training loss: 228.0651 Explore P: 0.0103
Episode: 942 Total reward: 196.0 Training loss: 0.1503 Explore P: 0.0103
Episode: 943 Total reward: 147.0 Training loss: 0.1049 Explore P: 0.0103
Episode: 944 Total reward: 183.0 Training loss: 0.2008 Explore P: 0.0103
Episode: 945 Total reward: 191.0 Training loss: 287.5863 Explore P: 0.0103
Episode: 946 Total reward: 191.0 Training loss: 0.1418 Explore P: 0.0103
Episode: 947 Total reward: 184.0 Training loss: 0.1560 Explore P: 0.0103
Episode: 948 Total reward: 199.0 Training loss: 21.5267 Explore P: 0.0103
Episode: 949 Total reward: 199.0 Training loss: 0.1161 Explore P: 0.0102
Episode: 950 Total reward: 178.0 Training loss: 0.3394 Explore P: 0.0102
Episode: 951 Total reward: 199.0 Training loss: 0.1406 Explore P: 0.0102
Episode: 952 Total reward: 199.0 Training loss: 0.1144 Explore P: 0.0102
Episode: 953 Total reward: 199.0 Training loss: 0.1889 Explore P: 0.0102
Episode: 954 Total reward: 186.0 Training loss: 0.1097 Explore P: 0.0102
Episode: 955 Total reward: 157.0 Training loss: 0.1885 Explore P: 0.0102
Episode: 956 Total reward: 199.0 Training loss: 0.0771 Explore P: 0.0102
Episode: 957 Total reward: 185.0 Training loss: 0.1098 Explore P: 0.0102
Episode: 958 Total reward: 199.0 Training loss: 0.1227 Explore P: 0.0102
Episode: 959 Total reward: 196.0 Training loss: 0.0521 Explore P: 0.0102
Episode: 960 Total reward: 180.0 Training loss: 6.1210 Explore P: 0.0102
Episode: 961 Total reward: 199.0 Training loss: 0.1533 Explore P: 0.0102
Episode: 962 Total reward: 199.0 Training loss: 0.1464 Explore P: 0.0102
Episode: 963 Total reward: 199.0 Training loss: 0.0694 Explore P: 0.0102
Episode: 964 Total reward: 197.0 Training loss: 0.0608 Explore P: 0.0102
Episode: 965 Total reward: 199.0 Training loss: 0.0395 Explore P: 0.0102
Episode: 966 Total reward: 162.0 Training loss: 0.0799 Explore P: 0.0102
Episode: 967 Total reward: 199.0 Training loss: 0.0858 Explore P: 0.0102
Episode: 968 Total reward: 199.0 Training loss: 0.0948 Explore P: 0.0102
Episode: 969 Total reward: 190.0 Training loss: 0.0775 Explore P: 0.0102
Episode: 970 Total reward: 199.0 Training loss: 0.1370 Explore P: 0.0102
Episode: 971 Total reward: 199.0 Training loss: 0.1062 Explore P: 0.0102
Episode: 972 Total reward: 166.0 Training loss: 0.0426 Explore P: 0.0102
Episode: 973 Total reward: 199.0 Training loss: 0.1395 Explore P: 0.0102
Episode: 974 Total reward: 199.0 Training loss: 0.1162 Explore P: 0.0102
Episode: 975 Total reward: 141.0 Training loss: 0.1690 Explore P: 0.0102
Episode: 976 Total reward: 199.0 Training loss: 0.0863 Explore P: 0.0101
Episode: 977 Total reward: 199.0 Training loss: 0.0869 Explore P: 0.0101
Episode: 978 Total reward: 199.0 Training loss: 0.0790 Explore P: 0.0101
Episode: 979 Total reward: 199.0 Training loss: 0.0583 Explore P: 0.0101
Episode: 980 Total reward: 199.0 Training loss: 0.1412 Explore P: 0.0101
Episode: 981 Total reward: 199.0 Training loss: 0.0726 Explore P: 0.0101
Episode: 982 Total reward: 199.0 Training loss: 0.1145 Explore P: 0.0101
Episode: 983 Total reward: 199.0 Training loss: 0.1332 Explore P: 0.0101
Episode: 984 Total reward: 199.0 Training loss: 0.0698 Explore P: 0.0101
Episode: 985 Total reward: 199.0 Training loss: 0.0398 Explore P: 0.0101
Episode: 986 Total reward: 199.0 Training loss: 0.1106 Explore P: 0.0101
Episode: 987 Total reward: 199.0 Training loss: 0.0641 Explore P: 0.0101
Episode: 988 Total reward: 160.0 Training loss: 0.0856 Explore P: 0.0101
Episode: 989 Total reward: 199.0 Training loss: 27.9004 Explore P: 0.0101
Episode: 990 Total reward: 199.0 Training loss: 0.0997 Explore P: 0.0101
Episode: 991 Total reward: 199.0 Training loss: 57.3867 Explore P: 0.0101
Episode: 992 Total reward: 199.0 Training loss: 0.1191 Explore P: 0.0101
Episode: 993 Total reward: 199.0 Training loss: 0.0962 Explore P: 0.0101
Episode: 994 Total reward: 199.0 Training loss: 0.0317 Explore P: 0.0101
Episode: 995 Total reward: 199.0 Training loss: 0.2032 Explore P: 0.0101
Episode: 996 Total reward: 199.0 Training loss: 0.0411 Explore P: 0.0101
Episode: 997 Total reward: 199.0 Training loss: 0.3833 Explore P: 0.0101
Episode: 998 Total reward: 199.0 Training loss: 0.1586 Explore P: 0.0101
Episode: 999 Total reward: 199.0 Training loss: 0.0685 Explore P: 0.0101

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [12]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[12]:
<matplotlib.text.Text at 0x11ff6cbe0>

Testing

Let's checkout how our trained agent plays the game.


In [14]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints/cartpole.ckpt
[2017-04-30 13:53:43,977] Restoring parameters from checkpoints/cartpole.ckpt

In [15]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.