Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-23 09:38:13,817] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [5]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [4]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [8]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [11]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [6]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [9]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [12]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [13]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 11.0 Training loss: 1.0702 Explore P: 0.9989
Episode: 2 Total reward: 13.0 Training loss: 1.0546 Explore P: 0.9976
Episode: 3 Total reward: 59.0 Training loss: 1.0099 Explore P: 0.9918
Episode: 4 Total reward: 11.0 Training loss: 1.0761 Explore P: 0.9907
Episode: 5 Total reward: 38.0 Training loss: 1.0031 Explore P: 0.9870
Episode: 6 Total reward: 28.0 Training loss: 1.1759 Explore P: 0.9843
Episode: 7 Total reward: 30.0 Training loss: 1.1807 Explore P: 0.9814
Episode: 8 Total reward: 8.0 Training loss: 1.1642 Explore P: 0.9806
Episode: 9 Total reward: 11.0 Training loss: 1.1702 Explore P: 0.9795
Episode: 10 Total reward: 15.0 Training loss: 1.1523 Explore P: 0.9781
Episode: 11 Total reward: 15.0 Training loss: 1.2546 Explore P: 0.9766
Episode: 12 Total reward: 17.0 Training loss: 1.1789 Explore P: 0.9750
Episode: 13 Total reward: 11.0 Training loss: 1.3136 Explore P: 0.9739
Episode: 14 Total reward: 22.0 Training loss: 1.4229 Explore P: 0.9718
Episode: 15 Total reward: 13.0 Training loss: 1.3232 Explore P: 0.9705
Episode: 16 Total reward: 43.0 Training loss: 1.4466 Explore P: 0.9664
Episode: 17 Total reward: 22.0 Training loss: 1.8864 Explore P: 0.9643
Episode: 18 Total reward: 20.0 Training loss: 1.6949 Explore P: 0.9624
Episode: 19 Total reward: 21.0 Training loss: 1.9030 Explore P: 0.9604
Episode: 20 Total reward: 14.0 Training loss: 2.2571 Explore P: 0.9591
Episode: 21 Total reward: 14.0 Training loss: 1.9879 Explore P: 0.9578
Episode: 22 Total reward: 18.0 Training loss: 2.0768 Explore P: 0.9561
Episode: 23 Total reward: 13.0 Training loss: 3.3718 Explore P: 0.9548
Episode: 24 Total reward: 18.0 Training loss: 4.6491 Explore P: 0.9531
Episode: 25 Total reward: 11.0 Training loss: 1.4735 Explore P: 0.9521
Episode: 26 Total reward: 29.0 Training loss: 2.5927 Explore P: 0.9494
Episode: 27 Total reward: 24.0 Training loss: 7.0648 Explore P: 0.9471
Episode: 28 Total reward: 25.0 Training loss: 8.1734 Explore P: 0.9448
Episode: 29 Total reward: 9.0 Training loss: 7.3383 Explore P: 0.9439
Episode: 30 Total reward: 14.0 Training loss: 2.3575 Explore P: 0.9426
Episode: 31 Total reward: 11.0 Training loss: 3.4791 Explore P: 0.9416
Episode: 32 Total reward: 13.0 Training loss: 6.2328 Explore P: 0.9404
Episode: 33 Total reward: 10.0 Training loss: 5.8519 Explore P: 0.9395
Episode: 34 Total reward: 16.0 Training loss: 6.1796 Explore P: 0.9380
Episode: 35 Total reward: 20.0 Training loss: 4.2690 Explore P: 0.9361
Episode: 36 Total reward: 18.0 Training loss: 18.1500 Explore P: 0.9345
Episode: 37 Total reward: 17.0 Training loss: 12.8133 Explore P: 0.9329
Episode: 38 Total reward: 22.0 Training loss: 9.8876 Explore P: 0.9309
Episode: 39 Total reward: 22.0 Training loss: 19.2334 Explore P: 0.9288
Episode: 40 Total reward: 18.0 Training loss: 12.8982 Explore P: 0.9272
Episode: 41 Total reward: 24.0 Training loss: 5.4828 Explore P: 0.9250
Episode: 42 Total reward: 11.0 Training loss: 13.6149 Explore P: 0.9240
Episode: 43 Total reward: 36.0 Training loss: 16.5912 Explore P: 0.9207
Episode: 44 Total reward: 14.0 Training loss: 4.1816 Explore P: 0.9194
Episode: 45 Total reward: 15.0 Training loss: 19.3460 Explore P: 0.9181
Episode: 46 Total reward: 14.0 Training loss: 18.9602 Explore P: 0.9168
Episode: 47 Total reward: 28.0 Training loss: 6.7750 Explore P: 0.9142
Episode: 48 Total reward: 14.0 Training loss: 9.1849 Explore P: 0.9130
Episode: 49 Total reward: 16.0 Training loss: 7.7338 Explore P: 0.9115
Episode: 50 Total reward: 19.0 Training loss: 45.4914 Explore P: 0.9098
Episode: 51 Total reward: 39.0 Training loss: 37.7880 Explore P: 0.9063
Episode: 52 Total reward: 31.0 Training loss: 13.6614 Explore P: 0.9036
Episode: 53 Total reward: 19.0 Training loss: 14.0985 Explore P: 0.9019
Episode: 54 Total reward: 11.0 Training loss: 63.8309 Explore P: 0.9009
Episode: 55 Total reward: 13.0 Training loss: 23.1143 Explore P: 0.8997
Episode: 56 Total reward: 21.0 Training loss: 20.9336 Explore P: 0.8979
Episode: 57 Total reward: 16.0 Training loss: 27.4410 Explore P: 0.8964
Episode: 58 Total reward: 11.0 Training loss: 33.1639 Explore P: 0.8955
Episode: 59 Total reward: 18.0 Training loss: 13.7933 Explore P: 0.8939
Episode: 60 Total reward: 27.0 Training loss: 41.2850 Explore P: 0.8915
Episode: 61 Total reward: 19.0 Training loss: 137.5784 Explore P: 0.8898
Episode: 62 Total reward: 52.0 Training loss: 26.3545 Explore P: 0.8852
Episode: 63 Total reward: 25.0 Training loss: 38.3190 Explore P: 0.8831
Episode: 64 Total reward: 26.0 Training loss: 30.0066 Explore P: 0.8808
Episode: 65 Total reward: 34.0 Training loss: 37.6305 Explore P: 0.8778
Episode: 66 Total reward: 18.0 Training loss: 36.6638 Explore P: 0.8763
Episode: 67 Total reward: 18.0 Training loss: 290.0507 Explore P: 0.8747
Episode: 68 Total reward: 13.0 Training loss: 133.1707 Explore P: 0.8736
Episode: 69 Total reward: 19.0 Training loss: 23.8781 Explore P: 0.8720
Episode: 70 Total reward: 22.0 Training loss: 88.4590 Explore P: 0.8701
Episode: 71 Total reward: 15.0 Training loss: 68.4850 Explore P: 0.8688
Episode: 72 Total reward: 41.0 Training loss: 54.9184 Explore P: 0.8653
Episode: 73 Total reward: 11.0 Training loss: 31.8942 Explore P: 0.8643
Episode: 74 Total reward: 23.0 Training loss: 240.0835 Explore P: 0.8624
Episode: 75 Total reward: 43.0 Training loss: 53.6898 Explore P: 0.8587
Episode: 76 Total reward: 23.0 Training loss: 209.4031 Explore P: 0.8567
Episode: 77 Total reward: 31.0 Training loss: 328.2399 Explore P: 0.8541
Episode: 78 Total reward: 12.0 Training loss: 231.5963 Explore P: 0.8531
Episode: 79 Total reward: 9.0 Training loss: 97.5360 Explore P: 0.8524
Episode: 80 Total reward: 36.0 Training loss: 188.5837 Explore P: 0.8493
Episode: 81 Total reward: 19.0 Training loss: 472.6060 Explore P: 0.8477
Episode: 82 Total reward: 12.0 Training loss: 31.7347 Explore P: 0.8467
Episode: 83 Total reward: 11.0 Training loss: 644.5078 Explore P: 0.8458
Episode: 84 Total reward: 18.0 Training loss: 232.8938 Explore P: 0.8443
Episode: 85 Total reward: 23.0 Training loss: 95.4963 Explore P: 0.8424
Episode: 86 Total reward: 15.0 Training loss: 260.2586 Explore P: 0.8411
Episode: 87 Total reward: 10.0 Training loss: 135.8450 Explore P: 0.8403
Episode: 88 Total reward: 11.0 Training loss: 143.3975 Explore P: 0.8394
Episode: 89 Total reward: 12.0 Training loss: 39.0057 Explore P: 0.8384
Episode: 90 Total reward: 24.0 Training loss: 94.4341 Explore P: 0.8364
Episode: 91 Total reward: 12.0 Training loss: 164.1809 Explore P: 0.8354
Episode: 92 Total reward: 14.0 Training loss: 41.4338 Explore P: 0.8343
Episode: 93 Total reward: 28.0 Training loss: 333.4632 Explore P: 0.8320
Episode: 94 Total reward: 67.0 Training loss: 725.5220 Explore P: 0.8265
Episode: 95 Total reward: 16.0 Training loss: 545.7455 Explore P: 0.8252
Episode: 96 Total reward: 35.0 Training loss: 49.9303 Explore P: 0.8223
Episode: 97 Total reward: 13.0 Training loss: 242.2847 Explore P: 0.8213
Episode: 98 Total reward: 11.0 Training loss: 168.9944 Explore P: 0.8204
Episode: 99 Total reward: 15.0 Training loss: 597.1711 Explore P: 0.8192
Episode: 100 Total reward: 15.0 Training loss: 510.1923 Explore P: 0.8180
Episode: 101 Total reward: 25.0 Training loss: 941.4907 Explore P: 0.8159
Episode: 102 Total reward: 13.0 Training loss: 62.9190 Explore P: 0.8149
Episode: 103 Total reward: 18.0 Training loss: 61.4892 Explore P: 0.8134
Episode: 104 Total reward: 23.0 Training loss: 261.5439 Explore P: 0.8116
Episode: 105 Total reward: 13.0 Training loss: 39.6728 Explore P: 0.8106
Episode: 106 Total reward: 14.0 Training loss: 58.1922 Explore P: 0.8094
Episode: 107 Total reward: 16.0 Training loss: 177.5204 Explore P: 0.8082
Episode: 108 Total reward: 45.0 Training loss: 1114.0930 Explore P: 0.8046
Episode: 109 Total reward: 20.0 Training loss: 521.9534 Explore P: 0.8030
Episode: 110 Total reward: 15.0 Training loss: 533.6101 Explore P: 0.8018
Episode: 111 Total reward: 19.0 Training loss: 317.5259 Explore P: 0.8003
Episode: 112 Total reward: 15.0 Training loss: 560.5369 Explore P: 0.7991
Episode: 113 Total reward: 31.0 Training loss: 674.2643 Explore P: 0.7967
Episode: 114 Total reward: 11.0 Training loss: 229.0001 Explore P: 0.7958
Episode: 115 Total reward: 85.0 Training loss: 492.6760 Explore P: 0.7892
Episode: 116 Total reward: 8.0 Training loss: 50.4321 Explore P: 0.7885
Episode: 117 Total reward: 18.0 Training loss: 560.9222 Explore P: 0.7871
Episode: 118 Total reward: 21.0 Training loss: 54.4683 Explore P: 0.7855
Episode: 119 Total reward: 17.0 Training loss: 509.4133 Explore P: 0.7842
Episode: 120 Total reward: 18.0 Training loss: 614.4151 Explore P: 0.7828
Episode: 121 Total reward: 13.0 Training loss: 950.5312 Explore P: 0.7818
Episode: 122 Total reward: 18.0 Training loss: 50.1445 Explore P: 0.7804
Episode: 123 Total reward: 15.0 Training loss: 41.4617 Explore P: 0.7792
Episode: 124 Total reward: 27.0 Training loss: 1136.3434 Explore P: 0.7772
Episode: 125 Total reward: 24.0 Training loss: 983.6782 Explore P: 0.7753
Episode: 126 Total reward: 13.0 Training loss: 869.9653 Explore P: 0.7743
Episode: 127 Total reward: 10.0 Training loss: 48.4492 Explore P: 0.7736
Episode: 128 Total reward: 19.0 Training loss: 236.6705 Explore P: 0.7721
Episode: 129 Total reward: 17.0 Training loss: 48.2274 Explore P: 0.7708
Episode: 130 Total reward: 9.0 Training loss: 734.0258 Explore P: 0.7701
Episode: 131 Total reward: 41.0 Training loss: 485.9235 Explore P: 0.7670
Episode: 132 Total reward: 38.0 Training loss: 41.7906 Explore P: 0.7642
Episode: 133 Total reward: 12.0 Training loss: 319.2959 Explore P: 0.7633
Episode: 134 Total reward: 16.0 Training loss: 280.0275 Explore P: 0.7621
Episode: 135 Total reward: 8.0 Training loss: 38.6251 Explore P: 0.7615
Episode: 136 Total reward: 18.0 Training loss: 1169.9880 Explore P: 0.7601
Episode: 137 Total reward: 15.0 Training loss: 1073.1616 Explore P: 0.7590
Episode: 138 Total reward: 9.0 Training loss: 1774.9777 Explore P: 0.7583
Episode: 139 Total reward: 10.0 Training loss: 46.0169 Explore P: 0.7576
Episode: 140 Total reward: 45.0 Training loss: 32.0984 Explore P: 0.7542
Episode: 141 Total reward: 42.0 Training loss: 1157.9805 Explore P: 0.7511
Episode: 142 Total reward: 17.0 Training loss: 866.0440 Explore P: 0.7498
Episode: 143 Total reward: 21.0 Training loss: 866.2982 Explore P: 0.7483
Episode: 144 Total reward: 21.0 Training loss: 31.6911 Explore P: 0.7467
Episode: 145 Total reward: 24.0 Training loss: 933.3372 Explore P: 0.7450
Episode: 146 Total reward: 9.0 Training loss: 38.0045 Explore P: 0.7443
Episode: 147 Total reward: 8.0 Training loss: 492.3253 Explore P: 0.7437
Episode: 148 Total reward: 39.0 Training loss: 332.2234 Explore P: 0.7408
Episode: 149 Total reward: 13.0 Training loss: 951.0751 Explore P: 0.7399
Episode: 150 Total reward: 38.0 Training loss: 411.0845 Explore P: 0.7371
Episode: 151 Total reward: 10.0 Training loss: 1459.5696 Explore P: 0.7364
Episode: 152 Total reward: 30.0 Training loss: 19.9214 Explore P: 0.7342
Episode: 153 Total reward: 18.0 Training loss: 22.4886 Explore P: 0.7329
Episode: 154 Total reward: 11.0 Training loss: 17.4526 Explore P: 0.7321
Episode: 155 Total reward: 15.0 Training loss: 476.1651 Explore P: 0.7310
Episode: 156 Total reward: 11.0 Training loss: 377.3716 Explore P: 0.7303
Episode: 157 Total reward: 25.0 Training loss: 465.6510 Explore P: 0.7285
Episode: 158 Total reward: 24.0 Training loss: 13.5268 Explore P: 0.7267
Episode: 159 Total reward: 24.0 Training loss: 379.1453 Explore P: 0.7250
Episode: 160 Total reward: 12.0 Training loss: 247.8783 Explore P: 0.7242
Episode: 161 Total reward: 29.0 Training loss: 276.0966 Explore P: 0.7221
Episode: 162 Total reward: 13.0 Training loss: 664.7184 Explore P: 0.7212
Episode: 163 Total reward: 41.0 Training loss: 12.0897 Explore P: 0.7183
Episode: 164 Total reward: 12.0 Training loss: 1022.6981 Explore P: 0.7174
Episode: 165 Total reward: 14.0 Training loss: 9.6146 Explore P: 0.7164
Episode: 166 Total reward: 58.0 Training loss: 299.9297 Explore P: 0.7123
Episode: 167 Total reward: 11.0 Training loss: 440.6118 Explore P: 0.7116
Episode: 168 Total reward: 14.0 Training loss: 5.2796 Explore P: 0.7106
Episode: 169 Total reward: 13.0 Training loss: 8.1856 Explore P: 0.7097
Episode: 170 Total reward: 11.0 Training loss: 6.4611 Explore P: 0.7089
Episode: 171 Total reward: 22.0 Training loss: 372.6794 Explore P: 0.7074
Episode: 172 Total reward: 9.0 Training loss: 997.0746 Explore P: 0.7067
Episode: 173 Total reward: 12.0 Training loss: 287.0989 Explore P: 0.7059
Episode: 174 Total reward: 13.0 Training loss: 629.1707 Explore P: 0.7050
Episode: 175 Total reward: 23.0 Training loss: 614.5966 Explore P: 0.7034
Episode: 176 Total reward: 19.0 Training loss: 407.9127 Explore P: 0.7021
Episode: 177 Total reward: 13.0 Training loss: 622.1385 Explore P: 0.7012
Episode: 178 Total reward: 20.0 Training loss: 289.9062 Explore P: 0.6998
Episode: 179 Total reward: 15.0 Training loss: 280.7796 Explore P: 0.6988
Episode: 180 Total reward: 11.0 Training loss: 271.2862 Explore P: 0.6980
Episode: 181 Total reward: 18.0 Training loss: 299.8486 Explore P: 0.6968
Episode: 182 Total reward: 22.0 Training loss: 5.2444 Explore P: 0.6953
Episode: 183 Total reward: 14.0 Training loss: 4.1126 Explore P: 0.6943
Episode: 184 Total reward: 15.0 Training loss: 5.1376 Explore P: 0.6933
Episode: 185 Total reward: 14.0 Training loss: 4.9794 Explore P: 0.6923
Episode: 186 Total reward: 35.0 Training loss: 3.9744 Explore P: 0.6899
Episode: 187 Total reward: 18.0 Training loss: 260.5503 Explore P: 0.6887
Episode: 188 Total reward: 25.0 Training loss: 3.2934 Explore P: 0.6870
Episode: 189 Total reward: 9.0 Training loss: 237.4524 Explore P: 0.6864
Episode: 190 Total reward: 22.0 Training loss: 443.7376 Explore P: 0.6849
Episode: 191 Total reward: 18.0 Training loss: 703.3723 Explore P: 0.6837
Episode: 192 Total reward: 50.0 Training loss: 2.7426 Explore P: 0.6804
Episode: 193 Total reward: 13.0 Training loss: 1.2696 Explore P: 0.6795
Episode: 194 Total reward: 13.0 Training loss: 465.2874 Explore P: 0.6786
Episode: 195 Total reward: 24.0 Training loss: 1.4549 Explore P: 0.6770
Episode: 196 Total reward: 8.0 Training loss: 185.2390 Explore P: 0.6765
Episode: 197 Total reward: 10.0 Training loss: 363.1832 Explore P: 0.6758
Episode: 198 Total reward: 13.0 Training loss: 177.9924 Explore P: 0.6749
Episode: 199 Total reward: 18.0 Training loss: 159.3948 Explore P: 0.6737
Episode: 200 Total reward: 16.0 Training loss: 514.3766 Explore P: 0.6727
Episode: 201 Total reward: 11.0 Training loss: 180.5333 Explore P: 0.6720
Episode: 202 Total reward: 12.0 Training loss: 313.2085 Explore P: 0.6712
Episode: 203 Total reward: 9.0 Training loss: 160.1842 Explore P: 0.6706
Episode: 204 Total reward: 11.0 Training loss: 252.2746 Explore P: 0.6698
Episode: 205 Total reward: 14.0 Training loss: 150.4825 Explore P: 0.6689
Episode: 206 Total reward: 16.0 Training loss: 140.1145 Explore P: 0.6679
Episode: 207 Total reward: 10.0 Training loss: 3.8841 Explore P: 0.6672
Episode: 208 Total reward: 17.0 Training loss: 1.3424 Explore P: 0.6661
Episode: 209 Total reward: 12.0 Training loss: 2.3257 Explore P: 0.6653
Episode: 210 Total reward: 33.0 Training loss: 2.6815 Explore P: 0.6631
Episode: 211 Total reward: 14.0 Training loss: 1.7855 Explore P: 0.6622
Episode: 212 Total reward: 18.0 Training loss: 358.5397 Explore P: 0.6611
Episode: 213 Total reward: 30.0 Training loss: 286.8926 Explore P: 0.6591
Episode: 214 Total reward: 8.0 Training loss: 123.9222 Explore P: 0.6586
Episode: 215 Total reward: 15.0 Training loss: 116.8780 Explore P: 0.6576
Episode: 216 Total reward: 10.0 Training loss: 122.2108 Explore P: 0.6570
Episode: 217 Total reward: 8.0 Training loss: 1.6763 Explore P: 0.6565
Episode: 218 Total reward: 18.0 Training loss: 130.1793 Explore P: 0.6553
Episode: 219 Total reward: 13.0 Training loss: 215.0953 Explore P: 0.6545
Episode: 220 Total reward: 13.0 Training loss: 315.5068 Explore P: 0.6536
Episode: 221 Total reward: 13.0 Training loss: 1.8016 Explore P: 0.6528
Episode: 222 Total reward: 11.0 Training loss: 198.0766 Explore P: 0.6521
Episode: 223 Total reward: 11.0 Training loss: 2.3029 Explore P: 0.6514
Episode: 224 Total reward: 9.0 Training loss: 217.0137 Explore P: 0.6508
Episode: 225 Total reward: 12.0 Training loss: 449.7814 Explore P: 0.6500
Episode: 226 Total reward: 15.0 Training loss: 400.1376 Explore P: 0.6491
Episode: 227 Total reward: 21.0 Training loss: 188.5065 Explore P: 0.6477
Episode: 228 Total reward: 17.0 Training loss: 92.8690 Explore P: 0.6466
Episode: 229 Total reward: 16.0 Training loss: 94.9069 Explore P: 0.6456
Episode: 230 Total reward: 11.0 Training loss: 92.5145 Explore P: 0.6449
Episode: 231 Total reward: 13.0 Training loss: 199.7605 Explore P: 0.6441
Episode: 232 Total reward: 13.0 Training loss: 2.6730 Explore P: 0.6433
Episode: 233 Total reward: 31.0 Training loss: 122.6906 Explore P: 0.6413
Episode: 234 Total reward: 14.0 Training loss: 85.8739 Explore P: 0.6404
Episode: 235 Total reward: 34.0 Training loss: 92.8208 Explore P: 0.6383
Episode: 236 Total reward: 12.0 Training loss: 80.9422 Explore P: 0.6375
Episode: 237 Total reward: 9.0 Training loss: 439.5160 Explore P: 0.6370
Episode: 238 Total reward: 25.0 Training loss: 198.1663 Explore P: 0.6354
Episode: 239 Total reward: 27.0 Training loss: 174.8127 Explore P: 0.6337
Episode: 240 Total reward: 10.0 Training loss: 3.5637 Explore P: 0.6331
Episode: 241 Total reward: 26.0 Training loss: 301.7789 Explore P: 0.6315
Episode: 242 Total reward: 17.0 Training loss: 3.2024 Explore P: 0.6304
Episode: 243 Total reward: 13.0 Training loss: 192.6112 Explore P: 0.6296
Episode: 244 Total reward: 24.0 Training loss: 71.0778 Explore P: 0.6281
Episode: 245 Total reward: 9.0 Training loss: 158.6546 Explore P: 0.6276
Episode: 246 Total reward: 13.0 Training loss: 3.5067 Explore P: 0.6268
Episode: 247 Total reward: 21.0 Training loss: 3.1003 Explore P: 0.6255
Episode: 248 Total reward: 9.0 Training loss: 203.7366 Explore P: 0.6249
Episode: 249 Total reward: 19.0 Training loss: 3.3091 Explore P: 0.6238
Episode: 250 Total reward: 13.0 Training loss: 4.0309 Explore P: 0.6230
Episode: 251 Total reward: 11.0 Training loss: 248.9669 Explore P: 0.6223
Episode: 252 Total reward: 9.0 Training loss: 63.1236 Explore P: 0.6217
Episode: 253 Total reward: 13.0 Training loss: 387.8817 Explore P: 0.6209
Episode: 254 Total reward: 12.0 Training loss: 341.1160 Explore P: 0.6202
Episode: 255 Total reward: 11.0 Training loss: 69.5852 Explore P: 0.6195
Episode: 256 Total reward: 14.0 Training loss: 268.9966 Explore P: 0.6187
Episode: 257 Total reward: 13.0 Training loss: 62.5633 Explore P: 0.6179
Episode: 258 Total reward: 10.0 Training loss: 3.1301 Explore P: 0.6173
Episode: 259 Total reward: 11.0 Training loss: 4.8874 Explore P: 0.6166
Episode: 260 Total reward: 11.0 Training loss: 597.3100 Explore P: 0.6160
Episode: 261 Total reward: 10.0 Training loss: 4.1529 Explore P: 0.6153
Episode: 262 Total reward: 8.0 Training loss: 268.2944 Explore P: 0.6149
Episode: 263 Total reward: 20.0 Training loss: 154.7785 Explore P: 0.6137
Episode: 264 Total reward: 9.0 Training loss: 5.9689 Explore P: 0.6131
Episode: 265 Total reward: 30.0 Training loss: 6.3433 Explore P: 0.6113
Episode: 266 Total reward: 11.0 Training loss: 4.5393 Explore P: 0.6106
Episode: 267 Total reward: 19.0 Training loss: 135.8045 Explore P: 0.6095
Episode: 268 Total reward: 8.0 Training loss: 182.2120 Explore P: 0.6090
Episode: 269 Total reward: 8.0 Training loss: 129.8255 Explore P: 0.6085
Episode: 270 Total reward: 30.0 Training loss: 334.2833 Explore P: 0.6068
Episode: 271 Total reward: 21.0 Training loss: 2.9805 Explore P: 0.6055
Episode: 272 Total reward: 12.0 Training loss: 3.3922 Explore P: 0.6048
Episode: 273 Total reward: 12.0 Training loss: 49.5633 Explore P: 0.6041
Episode: 274 Total reward: 12.0 Training loss: 59.3200 Explore P: 0.6034
Episode: 275 Total reward: 8.0 Training loss: 3.4041 Explore P: 0.6029
Episode: 276 Total reward: 10.0 Training loss: 60.7013 Explore P: 0.6023
Episode: 277 Total reward: 26.0 Training loss: 228.6618 Explore P: 0.6008
Episode: 278 Total reward: 21.0 Training loss: 54.8097 Explore P: 0.5995
Episode: 279 Total reward: 31.0 Training loss: 456.8548 Explore P: 0.5977
Episode: 280 Total reward: 14.0 Training loss: 341.8407 Explore P: 0.5969
Episode: 281 Total reward: 28.0 Training loss: 133.7956 Explore P: 0.5952
Episode: 282 Total reward: 19.0 Training loss: 44.2651 Explore P: 0.5941
Episode: 283 Total reward: 16.0 Training loss: 4.3299 Explore P: 0.5932
Episode: 284 Total reward: 29.0 Training loss: 54.8188 Explore P: 0.5915
Episode: 285 Total reward: 22.0 Training loss: 45.8942 Explore P: 0.5902
Episode: 286 Total reward: 12.0 Training loss: 113.5731 Explore P: 0.5895
Episode: 287 Total reward: 11.0 Training loss: 3.8511 Explore P: 0.5889
Episode: 288 Total reward: 12.0 Training loss: 4.9939 Explore P: 0.5882
Episode: 289 Total reward: 10.0 Training loss: 49.5346 Explore P: 0.5876
Episode: 290 Total reward: 13.0 Training loss: 5.2671 Explore P: 0.5869
Episode: 291 Total reward: 9.0 Training loss: 3.8029 Explore P: 0.5863
Episode: 292 Total reward: 30.0 Training loss: 4.5856 Explore P: 0.5846
Episode: 293 Total reward: 14.0 Training loss: 92.6129 Explore P: 0.5838
Episode: 294 Total reward: 12.0 Training loss: 177.0572 Explore P: 0.5831
Episode: 295 Total reward: 12.0 Training loss: 5.3932 Explore P: 0.5824
Episode: 296 Total reward: 9.0 Training loss: 190.5712 Explore P: 0.5819
Episode: 297 Total reward: 24.0 Training loss: 219.6460 Explore P: 0.5806
Episode: 298 Total reward: 11.0 Training loss: 2.4377 Explore P: 0.5799
Episode: 299 Total reward: 14.0 Training loss: 178.3045 Explore P: 0.5791
Episode: 300 Total reward: 23.0 Training loss: 122.8639 Explore P: 0.5778
Episode: 301 Total reward: 9.0 Training loss: 5.5042 Explore P: 0.5773
Episode: 302 Total reward: 36.0 Training loss: 125.1057 Explore P: 0.5753
Episode: 303 Total reward: 13.0 Training loss: 39.0270 Explore P: 0.5745
Episode: 304 Total reward: 10.0 Training loss: 3.2884 Explore P: 0.5740
Episode: 305 Total reward: 14.0 Training loss: 4.8808 Explore P: 0.5732
Episode: 306 Total reward: 9.0 Training loss: 4.8213 Explore P: 0.5727
Episode: 307 Total reward: 19.0 Training loss: 37.3860 Explore P: 0.5716
Episode: 308 Total reward: 19.0 Training loss: 159.2723 Explore P: 0.5705
Episode: 309 Total reward: 23.0 Training loss: 3.5051 Explore P: 0.5693
Episode: 310 Total reward: 45.0 Training loss: 86.7077 Explore P: 0.5667
Episode: 311 Total reward: 11.0 Training loss: 4.7349 Explore P: 0.5661
Episode: 312 Total reward: 10.0 Training loss: 3.0856 Explore P: 0.5656
Episode: 313 Total reward: 11.0 Training loss: 3.7047 Explore P: 0.5650
Episode: 314 Total reward: 12.0 Training loss: 1.8447 Explore P: 0.5643
Episode: 315 Total reward: 16.0 Training loss: 341.7200 Explore P: 0.5634
Episode: 316 Total reward: 12.0 Training loss: 44.6074 Explore P: 0.5627
Episode: 317 Total reward: 14.0 Training loss: 167.3280 Explore P: 0.5620
Episode: 318 Total reward: 9.0 Training loss: 79.9068 Explore P: 0.5615
Episode: 319 Total reward: 9.0 Training loss: 214.1400 Explore P: 0.5610
Episode: 320 Total reward: 10.0 Training loss: 2.9635 Explore P: 0.5604
Episode: 321 Total reward: 13.0 Training loss: 3.6031 Explore P: 0.5597
Episode: 322 Total reward: 30.0 Training loss: 157.9185 Explore P: 0.5581
Episode: 323 Total reward: 22.0 Training loss: 155.5372 Explore P: 0.5569
Episode: 324 Total reward: 14.0 Training loss: 487.2683 Explore P: 0.5561
Episode: 325 Total reward: 40.0 Training loss: 186.5160 Explore P: 0.5539
Episode: 326 Total reward: 13.0 Training loss: 99.8830 Explore P: 0.5532
Episode: 327 Total reward: 11.0 Training loss: 104.7993 Explore P: 0.5526
Episode: 328 Total reward: 11.0 Training loss: 109.5139 Explore P: 0.5520
Episode: 329 Total reward: 11.0 Training loss: 3.7972 Explore P: 0.5514
Episode: 330 Total reward: 31.0 Training loss: 4.0265 Explore P: 0.5497
Episode: 331 Total reward: 8.0 Training loss: 3.0437 Explore P: 0.5493
Episode: 332 Total reward: 7.0 Training loss: 245.0715 Explore P: 0.5489
Episode: 333 Total reward: 14.0 Training loss: 1.4707 Explore P: 0.5482
Episode: 334 Total reward: 32.0 Training loss: 128.4124 Explore P: 0.5465
Episode: 335 Total reward: 8.0 Training loss: 41.2521 Explore P: 0.5460
Episode: 336 Total reward: 22.0 Training loss: 1.8229 Explore P: 0.5449
Episode: 337 Total reward: 10.0 Training loss: 34.0484 Explore P: 0.5443
Episode: 338 Total reward: 16.0 Training loss: 1.7975 Explore P: 0.5435
Episode: 339 Total reward: 13.0 Training loss: 36.6460 Explore P: 0.5428
Episode: 340 Total reward: 12.0 Training loss: 100.9386 Explore P: 0.5421
Episode: 341 Total reward: 16.0 Training loss: 1.9694 Explore P: 0.5413
Episode: 342 Total reward: 12.0 Training loss: 82.0446 Explore P: 0.5407
Episode: 343 Total reward: 10.0 Training loss: 33.4700 Explore P: 0.5401
Episode: 344 Total reward: 18.0 Training loss: 67.1813 Explore P: 0.5392
Episode: 345 Total reward: 11.0 Training loss: 91.5149 Explore P: 0.5386
Episode: 346 Total reward: 9.0 Training loss: 28.4404 Explore P: 0.5381
Episode: 347 Total reward: 8.0 Training loss: 98.2277 Explore P: 0.5377
Episode: 348 Total reward: 13.0 Training loss: 117.5515 Explore P: 0.5370
Episode: 349 Total reward: 22.0 Training loss: 88.0151 Explore P: 0.5358
Episode: 350 Total reward: 14.0 Training loss: 159.5718 Explore P: 0.5351
Episode: 351 Total reward: 14.0 Training loss: 37.4011 Explore P: 0.5344
Episode: 352 Total reward: 13.0 Training loss: 1.6003 Explore P: 0.5337
Episode: 353 Total reward: 33.0 Training loss: 69.2224 Explore P: 0.5320
Episode: 354 Total reward: 22.0 Training loss: 142.6923 Explore P: 0.5308
Episode: 355 Total reward: 27.0 Training loss: 1.1734 Explore P: 0.5294
Episode: 356 Total reward: 11.0 Training loss: 63.7455 Explore P: 0.5288
Episode: 357 Total reward: 13.0 Training loss: 26.8081 Explore P: 0.5282
Episode: 358 Total reward: 14.0 Training loss: 2.9129 Explore P: 0.5274
Episode: 359 Total reward: 10.0 Training loss: 103.0929 Explore P: 0.5269
Episode: 360 Total reward: 14.0 Training loss: 1.7583 Explore P: 0.5262
Episode: 361 Total reward: 18.0 Training loss: 169.7322 Explore P: 0.5253
Episode: 362 Total reward: 11.0 Training loss: 127.7490 Explore P: 0.5247
Episode: 363 Total reward: 112.0 Training loss: 70.0603 Explore P: 0.5190
Episode: 364 Total reward: 8.0 Training loss: 147.0307 Explore P: 0.5186
Episode: 365 Total reward: 114.0 Training loss: 33.2709 Explore P: 0.5128
Episode: 366 Total reward: 71.0 Training loss: 93.9318 Explore P: 0.5092
Episode: 367 Total reward: 32.0 Training loss: 66.5241 Explore P: 0.5077
Episode: 368 Total reward: 61.0 Training loss: 0.8708 Explore P: 0.5046
Episode: 369 Total reward: 27.0 Training loss: 26.7812 Explore P: 0.5033
Episode: 370 Total reward: 42.0 Training loss: 24.0086 Explore P: 0.5012
Episode: 371 Total reward: 30.0 Training loss: 2.0355 Explore P: 0.4998
Episode: 372 Total reward: 43.0 Training loss: 27.1411 Explore P: 0.4977
Episode: 373 Total reward: 68.0 Training loss: 38.9128 Explore P: 0.4943
Episode: 374 Total reward: 63.0 Training loss: 57.8300 Explore P: 0.4913
Episode: 375 Total reward: 23.0 Training loss: 79.7640 Explore P: 0.4902
Episode: 376 Total reward: 46.0 Training loss: 30.1037 Explore P: 0.4880
Episode: 377 Total reward: 33.0 Training loss: 66.9780 Explore P: 0.4864
Episode: 378 Total reward: 43.0 Training loss: 1.0226 Explore P: 0.4844
Episode: 379 Total reward: 39.0 Training loss: 36.5923 Explore P: 0.4825
Episode: 380 Total reward: 64.0 Training loss: 0.9134 Explore P: 0.4795
Episode: 381 Total reward: 20.0 Training loss: 1.7178 Explore P: 0.4786
Episode: 382 Total reward: 35.0 Training loss: 29.1739 Explore P: 0.4769
Episode: 383 Total reward: 28.0 Training loss: 17.1136 Explore P: 0.4756
Episode: 384 Total reward: 17.0 Training loss: 1.9130 Explore P: 0.4748
Episode: 385 Total reward: 33.0 Training loss: 0.6030 Explore P: 0.4733
Episode: 386 Total reward: 27.0 Training loss: 27.7431 Explore P: 0.4721
Episode: 387 Total reward: 23.0 Training loss: 51.7194 Explore P: 0.4710
Episode: 388 Total reward: 31.0 Training loss: 1.7018 Explore P: 0.4696
Episode: 389 Total reward: 39.0 Training loss: 42.7956 Explore P: 0.4678
Episode: 390 Total reward: 36.0 Training loss: 75.5560 Explore P: 0.4661
Episode: 391 Total reward: 62.0 Training loss: 2.3397 Explore P: 0.4633
Episode: 392 Total reward: 30.0 Training loss: 24.4110 Explore P: 0.4620
Episode: 393 Total reward: 29.0 Training loss: 116.1537 Explore P: 0.4607
Episode: 394 Total reward: 32.0 Training loss: 69.3861 Explore P: 0.4592
Episode: 395 Total reward: 21.0 Training loss: 27.0918 Explore P: 0.4583
Episode: 396 Total reward: 29.0 Training loss: 1.2462 Explore P: 0.4570
Episode: 397 Total reward: 21.0 Training loss: 36.2749 Explore P: 0.4560
Episode: 398 Total reward: 40.0 Training loss: 28.8677 Explore P: 0.4543
Episode: 399 Total reward: 23.0 Training loss: 20.8124 Explore P: 0.4532
Episode: 400 Total reward: 22.0 Training loss: 13.4448 Explore P: 0.4523
Episode: 401 Total reward: 27.0 Training loss: 21.9546 Explore P: 0.4511
Episode: 402 Total reward: 24.0 Training loss: 1.9268 Explore P: 0.4500
Episode: 403 Total reward: 32.0 Training loss: 2.6113 Explore P: 0.4486
Episode: 404 Total reward: 27.0 Training loss: 34.5709 Explore P: 0.4474
Episode: 405 Total reward: 27.0 Training loss: 60.6495 Explore P: 0.4462
Episode: 406 Total reward: 22.0 Training loss: 1.4319 Explore P: 0.4453
Episode: 407 Total reward: 24.0 Training loss: 57.7762 Explore P: 0.4442
Episode: 408 Total reward: 30.0 Training loss: 56.6550 Explore P: 0.4429
Episode: 409 Total reward: 22.0 Training loss: 2.4071 Explore P: 0.4420
Episode: 410 Total reward: 25.0 Training loss: 16.8325 Explore P: 0.4409
Episode: 411 Total reward: 20.0 Training loss: 36.0482 Explore P: 0.4401
Episode: 412 Total reward: 20.0 Training loss: 19.9007 Explore P: 0.4392
Episode: 413 Total reward: 21.0 Training loss: 39.7219 Explore P: 0.4383
Episode: 414 Total reward: 36.0 Training loss: 69.7303 Explore P: 0.4368
Episode: 415 Total reward: 30.0 Training loss: 18.3904 Explore P: 0.4355
Episode: 416 Total reward: 24.0 Training loss: 15.8679 Explore P: 0.4345
Episode: 417 Total reward: 54.0 Training loss: 19.2936 Explore P: 0.4322
Episode: 418 Total reward: 25.0 Training loss: 22.7205 Explore P: 0.4311
Episode: 419 Total reward: 26.0 Training loss: 27.5835 Explore P: 0.4300
Episode: 420 Total reward: 12.0 Training loss: 13.9295 Explore P: 0.4295
Episode: 421 Total reward: 25.0 Training loss: 18.3369 Explore P: 0.4285
Episode: 422 Total reward: 23.0 Training loss: 1.6989 Explore P: 0.4275
Episode: 423 Total reward: 30.0 Training loss: 4.8548 Explore P: 0.4263
Episode: 424 Total reward: 19.0 Training loss: 29.7306 Explore P: 0.4255
Episode: 425 Total reward: 44.0 Training loss: 3.4717 Explore P: 0.4236
Episode: 426 Total reward: 31.0 Training loss: 14.7088 Explore P: 0.4224
Episode: 427 Total reward: 20.0 Training loss: 4.8951 Explore P: 0.4215
Episode: 428 Total reward: 26.0 Training loss: 46.4745 Explore P: 0.4205
Episode: 429 Total reward: 86.0 Training loss: 1.7426 Explore P: 0.4170
Episode: 430 Total reward: 33.0 Training loss: 45.4538 Explore P: 0.4156
Episode: 431 Total reward: 43.0 Training loss: 43.3775 Explore P: 0.4139
Episode: 432 Total reward: 21.0 Training loss: 28.7594 Explore P: 0.4130
Episode: 433 Total reward: 22.0 Training loss: 2.0819 Explore P: 0.4121
Episode: 434 Total reward: 33.0 Training loss: 57.1537 Explore P: 0.4108
Episode: 435 Total reward: 26.0 Training loss: 23.3806 Explore P: 0.4098
Episode: 436 Total reward: 20.0 Training loss: 61.1093 Explore P: 0.4090
Episode: 437 Total reward: 15.0 Training loss: 38.1411 Explore P: 0.4084
Episode: 438 Total reward: 36.0 Training loss: 19.9311 Explore P: 0.4069
Episode: 439 Total reward: 29.0 Training loss: 33.2679 Explore P: 0.4058
Episode: 440 Total reward: 31.0 Training loss: 2.8531 Explore P: 0.4046
Episode: 441 Total reward: 32.0 Training loss: 43.2542 Explore P: 0.4033
Episode: 442 Total reward: 39.0 Training loss: 3.0657 Explore P: 0.4018
Episode: 443 Total reward: 36.0 Training loss: 97.0637 Explore P: 0.4004
Episode: 444 Total reward: 26.0 Training loss: 2.6573 Explore P: 0.3994
Episode: 445 Total reward: 34.0 Training loss: 19.4044 Explore P: 0.3980
Episode: 446 Total reward: 41.0 Training loss: 61.9951 Explore P: 0.3965
Episode: 447 Total reward: 43.0 Training loss: 18.3881 Explore P: 0.3948
Episode: 448 Total reward: 58.0 Training loss: 42.1395 Explore P: 0.3926
Episode: 449 Total reward: 35.0 Training loss: 42.4998 Explore P: 0.3912
Episode: 450 Total reward: 86.0 Training loss: 1.3416 Explore P: 0.3880
Episode: 451 Total reward: 64.0 Training loss: 29.2598 Explore P: 0.3856
Episode: 452 Total reward: 36.0 Training loss: 21.2943 Explore P: 0.3842
Episode: 453 Total reward: 48.0 Training loss: 38.3715 Explore P: 0.3824
Episode: 454 Total reward: 38.0 Training loss: 3.3965 Explore P: 0.3810
Episode: 455 Total reward: 34.0 Training loss: 74.3382 Explore P: 0.3797
Episode: 456 Total reward: 20.0 Training loss: 40.8807 Explore P: 0.3790
Episode: 457 Total reward: 56.0 Training loss: 63.1233 Explore P: 0.3769
Episode: 458 Total reward: 64.0 Training loss: 50.3657 Explore P: 0.3746
Episode: 459 Total reward: 50.0 Training loss: 92.9130 Explore P: 0.3728
Episode: 460 Total reward: 31.0 Training loss: 1.6923 Explore P: 0.3717
Episode: 461 Total reward: 38.0 Training loss: 92.0663 Explore P: 0.3703
Episode: 462 Total reward: 41.0 Training loss: 2.2930 Explore P: 0.3688
Episode: 463 Total reward: 35.0 Training loss: 38.9179 Explore P: 0.3676
Episode: 464 Total reward: 45.0 Training loss: 3.6351 Explore P: 0.3660
Episode: 465 Total reward: 37.0 Training loss: 33.1129 Explore P: 0.3646
Episode: 466 Total reward: 32.0 Training loss: 2.7677 Explore P: 0.3635
Episode: 467 Total reward: 28.0 Training loss: 17.4357 Explore P: 0.3625
Episode: 468 Total reward: 41.0 Training loss: 2.2315 Explore P: 0.3611
Episode: 469 Total reward: 43.0 Training loss: 29.2224 Explore P: 0.3596
Episode: 470 Total reward: 38.0 Training loss: 1.4889 Explore P: 0.3582
Episode: 471 Total reward: 26.0 Training loss: 31.2503 Explore P: 0.3573
Episode: 472 Total reward: 72.0 Training loss: 56.9775 Explore P: 0.3548
Episode: 473 Total reward: 44.0 Training loss: 23.2681 Explore P: 0.3533
Episode: 474 Total reward: 18.0 Training loss: 34.4613 Explore P: 0.3527
Episode: 475 Total reward: 46.0 Training loss: 15.3475 Explore P: 0.3511
Episode: 476 Total reward: 23.0 Training loss: 22.1580 Explore P: 0.3504
Episode: 477 Total reward: 34.0 Training loss: 35.0221 Explore P: 0.3492
Episode: 478 Total reward: 54.0 Training loss: 3.1126 Explore P: 0.3474
Episode: 479 Total reward: 53.0 Training loss: 2.5103 Explore P: 0.3456
Episode: 480 Total reward: 24.0 Training loss: 23.3640 Explore P: 0.3448
Episode: 481 Total reward: 58.0 Training loss: 15.9460 Explore P: 0.3429
Episode: 482 Total reward: 36.0 Training loss: 22.2996 Explore P: 0.3417
Episode: 483 Total reward: 28.0 Training loss: 37.6430 Explore P: 0.3407
Episode: 484 Total reward: 27.0 Training loss: 21.3598 Explore P: 0.3398
Episode: 485 Total reward: 20.0 Training loss: 2.6466 Explore P: 0.3392
Episode: 486 Total reward: 60.0 Training loss: 36.1977 Explore P: 0.3372
Episode: 487 Total reward: 35.0 Training loss: 18.8781 Explore P: 0.3361
Episode: 488 Total reward: 47.0 Training loss: 30.8423 Explore P: 0.3345
Episode: 489 Total reward: 33.0 Training loss: 25.8925 Explore P: 0.3335
Episode: 490 Total reward: 86.0 Training loss: 1.7971 Explore P: 0.3307
Episode: 491 Total reward: 56.0 Training loss: 78.8316 Explore P: 0.3289
Episode: 492 Total reward: 76.0 Training loss: 1.7356 Explore P: 0.3265
Episode: 493 Total reward: 63.0 Training loss: 91.6616 Explore P: 0.3245
Episode: 494 Total reward: 90.0 Training loss: 191.3766 Explore P: 0.3217
Episode: 495 Total reward: 69.0 Training loss: 3.4325 Explore P: 0.3195
Episode: 496 Total reward: 68.0 Training loss: 19.3488 Explore P: 0.3174
Episode: 497 Total reward: 48.0 Training loss: 2.7212 Explore P: 0.3160
Episode: 498 Total reward: 36.0 Training loss: 30.0143 Explore P: 0.3149
Episode: 499 Total reward: 52.0 Training loss: 49.3107 Explore P: 0.3133
Episode: 500 Total reward: 64.0 Training loss: 64.2149 Explore P: 0.3114
Episode: 501 Total reward: 65.0 Training loss: 46.0436 Explore P: 0.3094
Episode: 502 Total reward: 58.0 Training loss: 2.5315 Explore P: 0.3077
Episode: 503 Total reward: 43.0 Training loss: 56.0470 Explore P: 0.3064
Episode: 504 Total reward: 74.0 Training loss: 0.9125 Explore P: 0.3042
Episode: 505 Total reward: 74.0 Training loss: 23.0458 Explore P: 0.3020
Episode: 506 Total reward: 48.0 Training loss: 4.3093 Explore P: 0.3006
Episode: 507 Total reward: 61.0 Training loss: 40.1750 Explore P: 0.2989
Episode: 508 Total reward: 99.0 Training loss: 2.3839 Explore P: 0.2960
Episode: 509 Total reward: 68.0 Training loss: 57.6807 Explore P: 0.2941
Episode: 510 Total reward: 89.0 Training loss: 54.3944 Explore P: 0.2916
Episode: 511 Total reward: 49.0 Training loss: 41.8608 Explore P: 0.2902
Episode: 512 Total reward: 38.0 Training loss: 31.4714 Explore P: 0.2891
Episode: 513 Total reward: 102.0 Training loss: 37.2989 Explore P: 0.2863
Episode: 514 Total reward: 122.0 Training loss: 1.3765 Explore P: 0.2830
Episode: 515 Total reward: 26.0 Training loss: 20.2788 Explore P: 0.2822
Episode: 516 Total reward: 60.0 Training loss: 1.1492 Explore P: 0.2806
Episode: 517 Total reward: 41.0 Training loss: 43.0605 Explore P: 0.2795
Episode: 518 Total reward: 39.0 Training loss: 3.4340 Explore P: 0.2785
Episode: 519 Total reward: 48.0 Training loss: 2.3449 Explore P: 0.2772
Episode: 520 Total reward: 40.0 Training loss: 3.9184 Explore P: 0.2761
Episode: 521 Total reward: 68.0 Training loss: 1.1918 Explore P: 0.2743
Episode: 522 Total reward: 64.0 Training loss: 1.5492 Explore P: 0.2726
Episode: 523 Total reward: 40.0 Training loss: 13.3989 Explore P: 0.2716
Episode: 524 Total reward: 34.0 Training loss: 2.6941 Explore P: 0.2707
Episode: 525 Total reward: 45.0 Training loss: 43.2174 Explore P: 0.2695
Episode: 526 Total reward: 102.0 Training loss: 71.7983 Explore P: 0.2669
Episode: 527 Total reward: 58.0 Training loss: 33.4454 Explore P: 0.2654
Episode: 528 Total reward: 65.0 Training loss: 3.1510 Explore P: 0.2637
Episode: 529 Total reward: 63.0 Training loss: 0.7422 Explore P: 0.2621
Episode: 530 Total reward: 26.0 Training loss: 2.9254 Explore P: 0.2615
Episode: 531 Total reward: 60.0 Training loss: 56.1413 Explore P: 0.2600
Episode: 532 Total reward: 43.0 Training loss: 53.8148 Explore P: 0.2589
Episode: 533 Total reward: 64.0 Training loss: 3.5273 Explore P: 0.2573
Episode: 534 Total reward: 51.0 Training loss: 1.6315 Explore P: 0.2561
Episode: 535 Total reward: 45.0 Training loss: 3.3978 Explore P: 0.2550
Episode: 536 Total reward: 64.0 Training loss: 75.0764 Explore P: 0.2534
Episode: 537 Total reward: 56.0 Training loss: 16.9919 Explore P: 0.2520
Episode: 538 Total reward: 32.0 Training loss: 35.7897 Explore P: 0.2513
Episode: 539 Total reward: 56.0 Training loss: 27.1139 Explore P: 0.2499
Episode: 540 Total reward: 43.0 Training loss: 1.6118 Explore P: 0.2489
Episode: 541 Total reward: 48.0 Training loss: 45.9701 Explore P: 0.2477
Episode: 542 Total reward: 80.0 Training loss: 0.3362 Explore P: 0.2459
Episode: 543 Total reward: 49.0 Training loss: 2.4888 Explore P: 0.2447
Episode: 544 Total reward: 50.0 Training loss: 0.9251 Explore P: 0.2435
Episode: 545 Total reward: 49.0 Training loss: 2.8538 Explore P: 0.2424
Episode: 546 Total reward: 43.0 Training loss: 62.5512 Explore P: 0.2414
Episode: 547 Total reward: 71.0 Training loss: 3.4179 Explore P: 0.2398
Episode: 548 Total reward: 42.0 Training loss: 48.9850 Explore P: 0.2388
Episode: 549 Total reward: 47.0 Training loss: 79.9256 Explore P: 0.2377
Episode: 550 Total reward: 66.0 Training loss: 23.8538 Explore P: 0.2362
Episode: 551 Total reward: 65.0 Training loss: 0.6350 Explore P: 0.2348
Episode: 552 Total reward: 67.0 Training loss: 73.6929 Explore P: 0.2333
Episode: 553 Total reward: 78.0 Training loss: 31.9136 Explore P: 0.2315
Episode: 554 Total reward: 106.0 Training loss: 16.5538 Explore P: 0.2292
Episode: 555 Total reward: 56.0 Training loss: 0.6812 Explore P: 0.2280
Episode: 556 Total reward: 53.0 Training loss: 42.1195 Explore P: 0.2268
Episode: 557 Total reward: 60.0 Training loss: 0.8609 Explore P: 0.2255
Episode: 558 Total reward: 42.0 Training loss: 7.9556 Explore P: 0.2246
Episode: 559 Total reward: 87.0 Training loss: 20.5864 Explore P: 0.2227
Episode: 560 Total reward: 81.0 Training loss: 51.4350 Explore P: 0.2210
Episode: 561 Total reward: 59.0 Training loss: 87.8629 Explore P: 0.2198
Episode: 562 Total reward: 64.0 Training loss: 0.5721 Explore P: 0.2185
Episode: 563 Total reward: 49.0 Training loss: 133.9322 Explore P: 0.2174
Episode: 564 Total reward: 66.0 Training loss: 29.3177 Explore P: 0.2161
Episode: 565 Total reward: 109.0 Training loss: 2.4380 Explore P: 0.2138
Episode: 566 Total reward: 72.0 Training loss: 0.5918 Explore P: 0.2124
Episode: 567 Total reward: 47.0 Training loss: 0.4432 Explore P: 0.2114
Episode: 568 Total reward: 199.0 Training loss: 2.1091 Explore P: 0.2075
Episode: 569 Total reward: 135.0 Training loss: 1.2175 Explore P: 0.2048
Episode: 570 Total reward: 175.0 Training loss: 58.5439 Explore P: 0.2014
Episode: 571 Total reward: 199.0 Training loss: 0.3722 Explore P: 0.1977
Episode: 572 Total reward: 199.0 Training loss: 32.4787 Explore P: 0.1940
Episode: 573 Total reward: 68.0 Training loss: 7.5271 Explore P: 0.1927
Episode: 574 Total reward: 109.0 Training loss: 0.7161 Explore P: 0.1907
Episode: 575 Total reward: 57.0 Training loss: 26.8954 Explore P: 0.1897
Episode: 576 Total reward: 106.0 Training loss: 0.3605 Explore P: 0.1878
Episode: 577 Total reward: 49.0 Training loss: 8.5949 Explore P: 0.1869
Episode: 578 Total reward: 56.0 Training loss: 18.0504 Explore P: 0.1860
Episode: 579 Total reward: 65.0 Training loss: 14.0009 Explore P: 0.1848
Episode: 580 Total reward: 39.0 Training loss: 36.7687 Explore P: 0.1841
Episode: 581 Total reward: 73.0 Training loss: 0.9613 Explore P: 0.1829
Episode: 582 Total reward: 64.0 Training loss: 0.5215 Explore P: 0.1818
Episode: 583 Total reward: 47.0 Training loss: 17.7864 Explore P: 0.1810
Episode: 584 Total reward: 42.0 Training loss: 1.1644 Explore P: 0.1802
Episode: 585 Total reward: 31.0 Training loss: 0.4534 Explore P: 0.1797
Episode: 586 Total reward: 39.0 Training loss: 0.4723 Explore P: 0.1791
Episode: 587 Total reward: 118.0 Training loss: 112.7352 Explore P: 0.1771
Episode: 588 Total reward: 31.0 Training loss: 0.8958 Explore P: 0.1766
Episode: 589 Total reward: 54.0 Training loss: 0.4389 Explore P: 0.1757
Episode: 590 Total reward: 140.0 Training loss: 0.8320 Explore P: 0.1734
Episode: 591 Total reward: 37.0 Training loss: 0.7675 Explore P: 0.1727
Episode: 592 Total reward: 98.0 Training loss: 1.1262 Explore P: 0.1712
Episode: 593 Total reward: 92.0 Training loss: 0.4758 Explore P: 0.1697
Episode: 594 Total reward: 61.0 Training loss: 0.5769 Explore P: 0.1687
Episode: 595 Total reward: 47.0 Training loss: 0.3593 Explore P: 0.1680
Episode: 596 Total reward: 56.0 Training loss: 1.5094 Explore P: 0.1671
Episode: 597 Total reward: 49.0 Training loss: 1.1392 Explore P: 0.1663
Episode: 598 Total reward: 50.0 Training loss: 1.6513 Explore P: 0.1655
Episode: 599 Total reward: 49.0 Training loss: 0.8602 Explore P: 0.1648
Episode: 600 Total reward: 31.0 Training loss: 13.3898 Explore P: 0.1643
Episode: 601 Total reward: 72.0 Training loss: 23.5132 Explore P: 0.1632
Episode: 602 Total reward: 54.0 Training loss: 0.8979 Explore P: 0.1624
Episode: 603 Total reward: 69.0 Training loss: 1.6568 Explore P: 0.1613
Episode: 604 Total reward: 44.0 Training loss: 1.1419 Explore P: 0.1607
Episode: 605 Total reward: 47.0 Training loss: 36.5606 Explore P: 0.1600
Episode: 606 Total reward: 58.0 Training loss: 0.9404 Explore P: 0.1591
Episode: 607 Total reward: 60.0 Training loss: 133.3223 Explore P: 0.1582
Episode: 608 Total reward: 75.0 Training loss: 2.1895 Explore P: 0.1571
Episode: 609 Total reward: 61.0 Training loss: 345.2719 Explore P: 0.1562
Episode: 610 Total reward: 51.0 Training loss: 0.8304 Explore P: 0.1554
Episode: 611 Total reward: 41.0 Training loss: 1.1147 Explore P: 0.1549
Episode: 612 Total reward: 33.0 Training loss: 1.7112 Explore P: 0.1544
Episode: 613 Total reward: 44.0 Training loss: 1.1443 Explore P: 0.1537
Episode: 614 Total reward: 53.0 Training loss: 1.5630 Explore P: 0.1530
Episode: 615 Total reward: 50.0 Training loss: 657.5900 Explore P: 0.1523
Episode: 616 Total reward: 39.0 Training loss: 0.9280 Explore P: 0.1517
Episode: 617 Total reward: 39.0 Training loss: 1.8954 Explore P: 0.1512
Episode: 618 Total reward: 39.0 Training loss: 2.0444 Explore P: 0.1506
Episode: 619 Total reward: 92.0 Training loss: 8.2760 Explore P: 0.1493
Episode: 620 Total reward: 61.0 Training loss: 1.5538 Explore P: 0.1485
Episode: 621 Total reward: 40.0 Training loss: 12.2010 Explore P: 0.1479
Episode: 622 Total reward: 41.0 Training loss: 2.6636 Explore P: 0.1474
Episode: 623 Total reward: 99.0 Training loss: 0.8488 Explore P: 0.1460
Episode: 624 Total reward: 112.0 Training loss: 1.0042 Explore P: 0.1445
Episode: 625 Total reward: 51.0 Training loss: 1.1238 Explore P: 0.1438
Episode: 626 Total reward: 89.0 Training loss: 1.3382 Explore P: 0.1426
Episode: 627 Total reward: 199.0 Training loss: 247.1506 Explore P: 0.1400
Episode: 628 Total reward: 183.0 Training loss: 0.8664 Explore P: 0.1377
Episode: 629 Total reward: 107.0 Training loss: 25.7876 Explore P: 0.1363
Episode: 630 Total reward: 87.0 Training loss: 251.5249 Explore P: 0.1352
Episode: 631 Total reward: 199.0 Training loss: 1.1114 Explore P: 0.1327
Episode: 632 Total reward: 145.0 Training loss: 70.4354 Explore P: 0.1310
Episode: 633 Total reward: 199.0 Training loss: 0.4428 Explore P: 0.1286
Episode: 634 Total reward: 94.0 Training loss: 0.6174 Explore P: 0.1275
Episode: 635 Total reward: 133.0 Training loss: 1.4307 Explore P: 0.1259
Episode: 636 Total reward: 134.0 Training loss: 1.6741 Explore P: 0.1244
Episode: 637 Total reward: 63.0 Training loss: 1.4304 Explore P: 0.1237
Episode: 638 Total reward: 70.0 Training loss: 0.4892 Explore P: 0.1229
Episode: 639 Total reward: 179.0 Training loss: 0.4918 Explore P: 0.1209
Episode: 640 Total reward: 136.0 Training loss: 0.6853 Explore P: 0.1194
Episode: 641 Total reward: 199.0 Training loss: 0.9279 Explore P: 0.1172
Episode: 642 Total reward: 199.0 Training loss: 1.0028 Explore P: 0.1151
Episode: 643 Total reward: 108.0 Training loss: 0.9264 Explore P: 0.1140
Episode: 644 Total reward: 184.0 Training loss: 1.2231 Explore P: 0.1121
Episode: 645 Total reward: 139.0 Training loss: 0.7680 Explore P: 0.1107
Episode: 646 Total reward: 199.0 Training loss: 1.2022 Explore P: 0.1087
Episode: 647 Total reward: 72.0 Training loss: 0.5475 Explore P: 0.1080
Episode: 648 Total reward: 91.0 Training loss: 42.2788 Explore P: 0.1071
Episode: 649 Total reward: 110.0 Training loss: 0.9695 Explore P: 0.1060
Episode: 650 Total reward: 133.0 Training loss: 0.4481 Explore P: 0.1048
Episode: 651 Total reward: 106.0 Training loss: 1.0640 Explore P: 0.1038
Episode: 652 Total reward: 121.0 Training loss: 0.8846 Explore P: 0.1026
Episode: 653 Total reward: 114.0 Training loss: 0.6645 Explore P: 0.1016
Episode: 654 Total reward: 127.0 Training loss: 1.4173 Explore P: 0.1004
Episode: 655 Total reward: 178.0 Training loss: 162.0648 Explore P: 0.0988
Episode: 656 Total reward: 144.0 Training loss: 0.5522 Explore P: 0.0976
Episode: 657 Total reward: 139.0 Training loss: 0.3521 Explore P: 0.0963
Episode: 658 Total reward: 67.0 Training loss: 0.7848 Explore P: 0.0958
Episode: 659 Total reward: 100.0 Training loss: 0.6718 Explore P: 0.0949
Episode: 660 Total reward: 125.0 Training loss: 0.5009 Explore P: 0.0939
Episode: 661 Total reward: 97.0 Training loss: 0.3991 Explore P: 0.0931
Episode: 662 Total reward: 102.0 Training loss: 1.0262 Explore P: 0.0922
Episode: 663 Total reward: 148.0 Training loss: 0.7316 Explore P: 0.0910
Episode: 664 Total reward: 164.0 Training loss: 147.4817 Explore P: 0.0897
Episode: 665 Total reward: 53.0 Training loss: 0.5795 Explore P: 0.0893
Episode: 666 Total reward: 186.0 Training loss: 106.6670 Explore P: 0.0878
Episode: 667 Total reward: 96.0 Training loss: 0.6438 Explore P: 0.0871
Episode: 668 Total reward: 199.0 Training loss: 142.6675 Explore P: 0.0855
Episode: 669 Total reward: 112.0 Training loss: 0.6866 Explore P: 0.0847
Episode: 670 Total reward: 128.0 Training loss: 0.7910 Explore P: 0.0838
Episode: 671 Total reward: 187.0 Training loss: 0.5533 Explore P: 0.0824
Episode: 672 Total reward: 88.0 Training loss: 95.1446 Explore P: 0.0818
Episode: 673 Total reward: 194.0 Training loss: 0.4403 Explore P: 0.0804
Episode: 674 Total reward: 199.0 Training loss: 68.7166 Explore P: 0.0790
Episode: 675 Total reward: 72.0 Training loss: 65.9125 Explore P: 0.0785
Episode: 676 Total reward: 81.0 Training loss: 0.8700 Explore P: 0.0779
Episode: 677 Total reward: 199.0 Training loss: 0.4874 Explore P: 0.0766
Episode: 678 Total reward: 127.0 Training loss: 0.5302 Explore P: 0.0758
Episode: 679 Total reward: 80.0 Training loss: 107.2612 Explore P: 0.0752
Episode: 680 Total reward: 34.0 Training loss: 0.5179 Explore P: 0.0750
Episode: 681 Total reward: 68.0 Training loss: 0.7248 Explore P: 0.0746
Episode: 682 Total reward: 81.0 Training loss: 0.7584 Explore P: 0.0741
Episode: 683 Total reward: 134.0 Training loss: 0.5630 Explore P: 0.0732
Episode: 684 Total reward: 111.0 Training loss: 0.5661 Explore P: 0.0725
Episode: 685 Total reward: 68.0 Training loss: 0.3353 Explore P: 0.0721
Episode: 686 Total reward: 56.0 Training loss: 56.4734 Explore P: 0.0717
Episode: 687 Total reward: 57.0 Training loss: 0.4453 Explore P: 0.0714
Episode: 688 Total reward: 89.0 Training loss: 0.5400 Explore P: 0.0708
Episode: 689 Total reward: 73.0 Training loss: 0.2979 Explore P: 0.0704
Episode: 690 Total reward: 82.0 Training loss: 0.5211 Explore P: 0.0699
Episode: 691 Total reward: 74.0 Training loss: 83.4561 Explore P: 0.0695
Episode: 692 Total reward: 69.0 Training loss: 0.3768 Explore P: 0.0691
Episode: 693 Total reward: 75.0 Training loss: 0.6282 Explore P: 0.0686
Episode: 694 Total reward: 45.0 Training loss: 0.5785 Explore P: 0.0683
Episode: 695 Total reward: 72.0 Training loss: 13.0914 Explore P: 0.0679
Episode: 696 Total reward: 106.0 Training loss: 0.5300 Explore P: 0.0673
Episode: 697 Total reward: 149.0 Training loss: 0.7986 Explore P: 0.0665
Episode: 698 Total reward: 77.0 Training loss: 0.3858 Explore P: 0.0660
Episode: 699 Total reward: 71.0 Training loss: 0.2354 Explore P: 0.0656
Episode: 700 Total reward: 71.0 Training loss: 0.3226 Explore P: 0.0652
Episode: 701 Total reward: 121.0 Training loss: 0.5275 Explore P: 0.0646
Episode: 702 Total reward: 57.0 Training loss: 0.5978 Explore P: 0.0643
Episode: 703 Total reward: 44.0 Training loss: 0.3886 Explore P: 0.0640
Episode: 704 Total reward: 80.0 Training loss: 0.4800 Explore P: 0.0636
Episode: 705 Total reward: 43.0 Training loss: 0.4597 Explore P: 0.0634
Episode: 706 Total reward: 48.0 Training loss: 0.4426 Explore P: 0.0631
Episode: 707 Total reward: 49.0 Training loss: 0.1673 Explore P: 0.0629
Episode: 708 Total reward: 86.0 Training loss: 0.2978 Explore P: 0.0624
Episode: 709 Total reward: 63.0 Training loss: 0.3926 Explore P: 0.0621
Episode: 710 Total reward: 61.0 Training loss: 0.7375 Explore P: 0.0618
Episode: 711 Total reward: 59.0 Training loss: 36.8273 Explore P: 0.0615
Episode: 712 Total reward: 97.0 Training loss: 0.7469 Explore P: 0.0610
Episode: 713 Total reward: 67.0 Training loss: 0.3428 Explore P: 0.0606
Episode: 714 Total reward: 80.0 Training loss: 0.3460 Explore P: 0.0602
Episode: 715 Total reward: 75.0 Training loss: 83.8001 Explore P: 0.0598
Episode: 716 Total reward: 63.0 Training loss: 45.6893 Explore P: 0.0595
Episode: 717 Total reward: 67.0 Training loss: 0.4490 Explore P: 0.0592
Episode: 718 Total reward: 115.0 Training loss: 197.0739 Explore P: 0.0586
Episode: 719 Total reward: 87.0 Training loss: 0.2726 Explore P: 0.0582
Episode: 720 Total reward: 80.0 Training loss: 0.4176 Explore P: 0.0578
Episode: 721 Total reward: 71.0 Training loss: 26.0799 Explore P: 0.0575
Episode: 722 Total reward: 66.0 Training loss: 0.2614 Explore P: 0.0572
Episode: 723 Total reward: 85.0 Training loss: 0.3028 Explore P: 0.0568
Episode: 724 Total reward: 57.0 Training loss: 0.4916 Explore P: 0.0565
Episode: 725 Total reward: 90.0 Training loss: 0.3479 Explore P: 0.0561
Episode: 726 Total reward: 91.0 Training loss: 0.3189 Explore P: 0.0557
Episode: 727 Total reward: 78.0 Training loss: 0.3488 Explore P: 0.0553
Episode: 728 Total reward: 132.0 Training loss: 0.4400 Explore P: 0.0547
Episode: 729 Total reward: 199.0 Training loss: 0.1979 Explore P: 0.0538
Episode: 730 Total reward: 51.0 Training loss: 0.3208 Explore P: 0.0536
Episode: 731 Total reward: 109.0 Training loss: 0.1530 Explore P: 0.0532
Episode: 732 Total reward: 86.0 Training loss: 0.2421 Explore P: 0.0528
Episode: 733 Total reward: 66.0 Training loss: 0.4831 Explore P: 0.0525
Episode: 734 Total reward: 63.0 Training loss: 0.3212 Explore P: 0.0522
Episode: 735 Total reward: 104.0 Training loss: 0.2245 Explore P: 0.0518
Episode: 736 Total reward: 104.0 Training loss: 0.1300 Explore P: 0.0514
Episode: 737 Total reward: 60.0 Training loss: 41.5905 Explore P: 0.0511
Episode: 738 Total reward: 93.0 Training loss: 0.2039 Explore P: 0.0507
Episode: 739 Total reward: 95.0 Training loss: 0.4218 Explore P: 0.0504
Episode: 740 Total reward: 76.0 Training loss: 0.3255 Explore P: 0.0500
Episode: 741 Total reward: 100.0 Training loss: 0.2329 Explore P: 0.0496
Episode: 742 Total reward: 76.0 Training loss: 3.3518 Explore P: 0.0493
Episode: 743 Total reward: 49.0 Training loss: 0.2100 Explore P: 0.0492
Episode: 744 Total reward: 154.0 Training loss: 10.4889 Explore P: 0.0486
Episode: 745 Total reward: 89.0 Training loss: 0.1793 Explore P: 0.0482
Episode: 746 Total reward: 60.0 Training loss: 0.2677 Explore P: 0.0480
Episode: 747 Total reward: 93.0 Training loss: 0.3981 Explore P: 0.0476
Episode: 748 Total reward: 94.0 Training loss: 0.2122 Explore P: 0.0473
Episode: 749 Total reward: 51.0 Training loss: 0.5602 Explore P: 0.0471
Episode: 750 Total reward: 76.0 Training loss: 0.4835 Explore P: 0.0468
Episode: 751 Total reward: 59.0 Training loss: 0.1265 Explore P: 0.0466
Episode: 752 Total reward: 61.0 Training loss: 0.1975 Explore P: 0.0464
Episode: 753 Total reward: 54.0 Training loss: 0.2647 Explore P: 0.0462
Episode: 754 Total reward: 96.0 Training loss: 0.2711 Explore P: 0.0458
Episode: 755 Total reward: 90.0 Training loss: 0.4315 Explore P: 0.0455
Episode: 756 Total reward: 47.0 Training loss: 0.4877 Explore P: 0.0453
Episode: 757 Total reward: 53.0 Training loss: 18.2436 Explore P: 0.0452
Episode: 758 Total reward: 66.0 Training loss: 0.3785 Explore P: 0.0449
Episode: 759 Total reward: 72.0 Training loss: 0.2054 Explore P: 0.0447
Episode: 760 Total reward: 84.0 Training loss: 0.3870 Explore P: 0.0444
Episode: 761 Total reward: 123.0 Training loss: 0.3554 Explore P: 0.0440
Episode: 762 Total reward: 62.0 Training loss: 0.1697 Explore P: 0.0438
Episode: 763 Total reward: 46.0 Training loss: 0.5327 Explore P: 0.0436
Episode: 764 Total reward: 67.0 Training loss: 0.2479 Explore P: 0.0434
Episode: 765 Total reward: 51.0 Training loss: 0.3719 Explore P: 0.0432
Episode: 766 Total reward: 58.0 Training loss: 0.4898 Explore P: 0.0430
Episode: 767 Total reward: 71.0 Training loss: 0.3324 Explore P: 0.0428
Episode: 768 Total reward: 57.0 Training loss: 0.3126 Explore P: 0.0426
Episode: 769 Total reward: 38.0 Training loss: 0.3154 Explore P: 0.0425
Episode: 770 Total reward: 47.0 Training loss: 0.4383 Explore P: 0.0423
Episode: 771 Total reward: 95.0 Training loss: 0.3256 Explore P: 0.0420
Episode: 772 Total reward: 61.0 Training loss: 27.2130 Explore P: 0.0418
Episode: 773 Total reward: 51.0 Training loss: 0.3267 Explore P: 0.0417
Episode: 774 Total reward: 85.0 Training loss: 0.2962 Explore P: 0.0414
Episode: 775 Total reward: 51.0 Training loss: 0.5890 Explore P: 0.0412
Episode: 776 Total reward: 81.0 Training loss: 0.2983 Explore P: 0.0410
Episode: 777 Total reward: 49.0 Training loss: 1.5009 Explore P: 0.0408
Episode: 778 Total reward: 84.0 Training loss: 10.6920 Explore P: 0.0406
Episode: 779 Total reward: 143.0 Training loss: 0.3071 Explore P: 0.0401
Episode: 780 Total reward: 65.0 Training loss: 0.3531 Explore P: 0.0399
Episode: 781 Total reward: 79.0 Training loss: 0.6750 Explore P: 0.0397
Episode: 782 Total reward: 79.0 Training loss: 0.2878 Explore P: 0.0395
Episode: 783 Total reward: 98.0 Training loss: 0.1972 Explore P: 0.0392
Episode: 784 Total reward: 161.0 Training loss: 0.2790 Explore P: 0.0387
Episode: 785 Total reward: 56.0 Training loss: 0.2440 Explore P: 0.0386
Episode: 786 Total reward: 64.0 Training loss: 0.2495 Explore P: 0.0384
Episode: 787 Total reward: 57.0 Training loss: 1.2177 Explore P: 0.0382
Episode: 788 Total reward: 161.0 Training loss: 0.2855 Explore P: 0.0378
Episode: 789 Total reward: 53.0 Training loss: 0.2080 Explore P: 0.0376
Episode: 790 Total reward: 86.0 Training loss: 0.5541 Explore P: 0.0374
Episode: 791 Total reward: 50.0 Training loss: 0.2805 Explore P: 0.0372
Episode: 792 Total reward: 106.0 Training loss: 0.2333 Explore P: 0.0370
Episode: 793 Total reward: 45.0 Training loss: 0.3101 Explore P: 0.0368
Episode: 794 Total reward: 66.0 Training loss: 0.5732 Explore P: 0.0367
Episode: 795 Total reward: 67.0 Training loss: 0.3625 Explore P: 0.0365
Episode: 796 Total reward: 199.0 Training loss: 0.4423 Explore P: 0.0360
Episode: 797 Total reward: 102.0 Training loss: 0.5916 Explore P: 0.0357
Episode: 798 Total reward: 199.0 Training loss: 0.2127 Explore P: 0.0352
Episode: 799 Total reward: 199.0 Training loss: 0.3868 Explore P: 0.0347
Episode: 800 Total reward: 77.0 Training loss: 0.2499 Explore P: 0.0345
Episode: 801 Total reward: 199.0 Training loss: 0.4572 Explore P: 0.0340
Episode: 802 Total reward: 45.0 Training loss: 0.1322 Explore P: 0.0339
Episode: 803 Total reward: 48.0 Training loss: 0.3761 Explore P: 0.0338
Episode: 804 Total reward: 199.0 Training loss: 0.3436 Explore P: 0.0333
Episode: 805 Total reward: 199.0 Training loss: 0.3575 Explore P: 0.0329
Episode: 806 Total reward: 112.0 Training loss: 0.3604 Explore P: 0.0326
Episode: 807 Total reward: 163.0 Training loss: 0.4017 Explore P: 0.0322
Episode: 808 Total reward: 165.0 Training loss: 0.9773 Explore P: 0.0319
Episode: 809 Total reward: 199.0 Training loss: 0.2743 Explore P: 0.0315
Episode: 810 Total reward: 83.0 Training loss: 0.3642 Explore P: 0.0313
Episode: 811 Total reward: 132.0 Training loss: 0.6422 Explore P: 0.0310
Episode: 812 Total reward: 133.0 Training loss: 0.3714 Explore P: 0.0307
Episode: 813 Total reward: 65.0 Training loss: 0.2665 Explore P: 0.0306
Episode: 814 Total reward: 98.0 Training loss: 0.3563 Explore P: 0.0304
Episode: 815 Total reward: 109.0 Training loss: 0.4603 Explore P: 0.0302
Episode: 816 Total reward: 89.0 Training loss: 0.6098 Explore P: 0.0300
Episode: 817 Total reward: 77.0 Training loss: 0.2445 Explore P: 0.0298
Episode: 818 Total reward: 199.0 Training loss: 0.3865 Explore P: 0.0294
Episode: 819 Total reward: 103.0 Training loss: 1.6725 Explore P: 0.0292
Episode: 820 Total reward: 58.0 Training loss: 1.4831 Explore P: 0.0291
Episode: 821 Total reward: 79.0 Training loss: 0.2541 Explore P: 0.0290
Episode: 822 Total reward: 93.0 Training loss: 1.2242 Explore P: 0.0288
Episode: 823 Total reward: 81.0 Training loss: 0.4578 Explore P: 0.0287
Episode: 824 Total reward: 93.0 Training loss: 0.4876 Explore P: 0.0285
Episode: 825 Total reward: 43.0 Training loss: 0.3245 Explore P: 0.0284
Episode: 826 Total reward: 73.0 Training loss: 0.3783 Explore P: 0.0283
Episode: 827 Total reward: 107.0 Training loss: 0.2956 Explore P: 0.0281
Episode: 828 Total reward: 88.0 Training loss: 0.2944 Explore P: 0.0279
Episode: 829 Total reward: 108.0 Training loss: 0.2732 Explore P: 0.0277
Episode: 830 Total reward: 63.0 Training loss: 0.3612 Explore P: 0.0276
Episode: 831 Total reward: 88.0 Training loss: 0.4050 Explore P: 0.0275
Episode: 832 Total reward: 58.0 Training loss: 0.3041 Explore P: 0.0274
Episode: 833 Total reward: 199.0 Training loss: 0.3569 Explore P: 0.0270
Episode: 834 Total reward: 48.0 Training loss: 0.4632 Explore P: 0.0269
Episode: 835 Total reward: 88.0 Training loss: 0.4170 Explore P: 0.0268
Episode: 836 Total reward: 118.0 Training loss: 0.2411 Explore P: 0.0266
Episode: 837 Total reward: 119.0 Training loss: 0.7547 Explore P: 0.0264
Episode: 838 Total reward: 163.0 Training loss: 0.4001 Explore P: 0.0261
Episode: 839 Total reward: 87.0 Training loss: 0.4133 Explore P: 0.0260
Episode: 840 Total reward: 96.0 Training loss: 0.3351 Explore P: 0.0258
Episode: 841 Total reward: 199.0 Training loss: 2.2328 Explore P: 0.0255
Episode: 842 Total reward: 65.0 Training loss: 0.4290 Explore P: 0.0254
Episode: 843 Total reward: 58.0 Training loss: 0.5928 Explore P: 0.0253
Episode: 844 Total reward: 199.0 Training loss: 2.0073 Explore P: 0.0250
Episode: 845 Total reward: 99.0 Training loss: 0.3791 Explore P: 0.0249
Episode: 846 Total reward: 56.0 Training loss: 0.4768 Explore P: 0.0248
Episode: 847 Total reward: 76.0 Training loss: 0.4172 Explore P: 0.0247
Episode: 848 Total reward: 66.0 Training loss: 0.2981 Explore P: 0.0246
Episode: 849 Total reward: 70.0 Training loss: 0.2604 Explore P: 0.0245
Episode: 850 Total reward: 63.0 Training loss: 0.4035 Explore P: 0.0244
Episode: 851 Total reward: 73.0 Training loss: 0.3082 Explore P: 0.0243
Episode: 852 Total reward: 95.0 Training loss: 1.3742 Explore P: 0.0242
Episode: 853 Total reward: 199.0 Training loss: 0.4013 Explore P: 0.0239
Episode: 854 Total reward: 194.0 Training loss: 0.3692 Explore P: 0.0236
Episode: 855 Total reward: 50.0 Training loss: 0.6058 Explore P: 0.0235
Episode: 856 Total reward: 56.0 Training loss: 0.4865 Explore P: 0.0235
Episode: 857 Total reward: 199.0 Training loss: 0.4461 Explore P: 0.0232
Episode: 858 Total reward: 199.0 Training loss: 0.3396 Explore P: 0.0229
Episode: 859 Total reward: 199.0 Training loss: 0.3778 Explore P: 0.0227
Episode: 860 Total reward: 199.0 Training loss: 0.2751 Explore P: 0.0224
Episode: 861 Total reward: 199.0 Training loss: 0.3755 Explore P: 0.0222
Episode: 862 Total reward: 199.0 Training loss: 0.3984 Explore P: 0.0220
Episode: 863 Total reward: 199.0 Training loss: 9.5788 Explore P: 0.0217
Episode: 864 Total reward: 199.0 Training loss: 0.4584 Explore P: 0.0215
Episode: 865 Total reward: 199.0 Training loss: 0.3480 Explore P: 0.0213
Episode: 866 Total reward: 113.0 Training loss: 0.3975 Explore P: 0.0211
Episode: 867 Total reward: 199.0 Training loss: 0.3003 Explore P: 0.0209
Episode: 868 Total reward: 199.0 Training loss: 1.0108 Explore P: 0.0207
Episode: 869 Total reward: 199.0 Training loss: 0.4225 Explore P: 0.0205
Episode: 870 Total reward: 199.0 Training loss: 0.2464 Explore P: 0.0203
Episode: 871 Total reward: 199.0 Training loss: 0.3639 Explore P: 0.0201
Episode: 872 Total reward: 199.0 Training loss: 1.1189 Explore P: 0.0199
Episode: 873 Total reward: 199.0 Training loss: 0.6716 Explore P: 0.0197
Episode: 874 Total reward: 199.0 Training loss: 309.7855 Explore P: 0.0195
Episode: 875 Total reward: 199.0 Training loss: 0.6157 Explore P: 0.0193
Episode: 876 Total reward: 199.0 Training loss: 0.5410 Explore P: 0.0191
Episode: 877 Total reward: 199.0 Training loss: 0.6441 Explore P: 0.0189
Episode: 878 Total reward: 199.0 Training loss: 0.9092 Explore P: 0.0188
Episode: 879 Total reward: 199.0 Training loss: 0.6144 Explore P: 0.0186
Episode: 880 Total reward: 199.0 Training loss: 0.5254 Explore P: 0.0184
Episode: 881 Total reward: 199.0 Training loss: 0.8042 Explore P: 0.0183
Episode: 882 Total reward: 199.0 Training loss: 0.9421 Explore P: 0.0181
Episode: 883 Total reward: 199.0 Training loss: 1.6293 Explore P: 0.0179
Episode: 884 Total reward: 199.0 Training loss: 0.5465 Explore P: 0.0178
Episode: 885 Total reward: 199.0 Training loss: 350.4179 Explore P: 0.0176
Episode: 886 Total reward: 199.0 Training loss: 0.5618 Explore P: 0.0175
Episode: 887 Total reward: 199.0 Training loss: 0.4571 Explore P: 0.0173
Episode: 888 Total reward: 199.0 Training loss: 0.5744 Explore P: 0.0172
Episode: 889 Total reward: 199.0 Training loss: 0.7001 Explore P: 0.0170
Episode: 890 Total reward: 199.0 Training loss: 11.1319 Explore P: 0.0169
Episode: 891 Total reward: 199.0 Training loss: 1.0589 Explore P: 0.0168
Episode: 892 Total reward: 199.0 Training loss: 397.5830 Explore P: 0.0166
Episode: 893 Total reward: 199.0 Training loss: 0.7817 Explore P: 0.0165
Episode: 894 Total reward: 199.0 Training loss: 0.6709 Explore P: 0.0164
Episode: 895 Total reward: 199.0 Training loss: 0.5378 Explore P: 0.0163
Episode: 896 Total reward: 199.0 Training loss: 0.7287 Explore P: 0.0161
Episode: 897 Total reward: 199.0 Training loss: 0.9285 Explore P: 0.0160
Episode: 898 Total reward: 199.0 Training loss: 0.2705 Explore P: 0.0159
Episode: 899 Total reward: 199.0 Training loss: 0.6813 Explore P: 0.0158
Episode: 900 Total reward: 199.0 Training loss: 0.6448 Explore P: 0.0157
Episode: 901 Total reward: 199.0 Training loss: 0.7208 Explore P: 0.0155
Episode: 902 Total reward: 199.0 Training loss: 0.9760 Explore P: 0.0154
Episode: 903 Total reward: 199.0 Training loss: 1.0684 Explore P: 0.0153
Episode: 904 Total reward: 199.0 Training loss: 0.6626 Explore P: 0.0152
Episode: 905 Total reward: 199.0 Training loss: 1.1155 Explore P: 0.0151
Episode: 906 Total reward: 199.0 Training loss: 0.3658 Explore P: 0.0150
Episode: 907 Total reward: 199.0 Training loss: 1.0424 Explore P: 0.0149
Episode: 908 Total reward: 199.0 Training loss: 0.4847 Explore P: 0.0148
Episode: 909 Total reward: 199.0 Training loss: 0.7599 Explore P: 0.0147
Episode: 910 Total reward: 199.0 Training loss: 0.4867 Explore P: 0.0146
Episode: 911 Total reward: 199.0 Training loss: 483.6280 Explore P: 0.0145
Episode: 912 Total reward: 199.0 Training loss: 0.5416 Explore P: 0.0145
Episode: 913 Total reward: 199.0 Training loss: 0.6273 Explore P: 0.0144
Episode: 914 Total reward: 199.0 Training loss: 0.6025 Explore P: 0.0143
Episode: 915 Total reward: 199.0 Training loss: 439.6444 Explore P: 0.0142
Episode: 916 Total reward: 199.0 Training loss: 392.9993 Explore P: 0.0141
Episode: 917 Total reward: 199.0 Training loss: 0.4488 Explore P: 0.0140
Episode: 918 Total reward: 199.0 Training loss: 418.7277 Explore P: 0.0140
Episode: 919 Total reward: 199.0 Training loss: 1.0769 Explore P: 0.0139
Episode: 920 Total reward: 199.0 Training loss: 0.3030 Explore P: 0.0138
Episode: 921 Total reward: 199.0 Training loss: 0.4310 Explore P: 0.0137
Episode: 922 Total reward: 199.0 Training loss: 415.3716 Explore P: 0.0137
Episode: 923 Total reward: 199.0 Training loss: 0.6398 Explore P: 0.0136
Episode: 924 Total reward: 199.0 Training loss: 0.5096 Explore P: 0.0135
Episode: 925 Total reward: 199.0 Training loss: 0.3178 Explore P: 0.0134
Episode: 926 Total reward: 199.0 Training loss: 0.7239 Explore P: 0.0134
Episode: 927 Total reward: 199.0 Training loss: 0.3795 Explore P: 0.0133
Episode: 928 Total reward: 199.0 Training loss: 0.2193 Explore P: 0.0132
Episode: 929 Total reward: 199.0 Training loss: 0.4761 Explore P: 0.0132
Episode: 930 Total reward: 199.0 Training loss: 0.1486 Explore P: 0.0131
Episode: 931 Total reward: 199.0 Training loss: 0.1775 Explore P: 0.0131
Episode: 932 Total reward: 199.0 Training loss: 0.1814 Explore P: 0.0130
Episode: 933 Total reward: 199.0 Training loss: 0.2148 Explore P: 0.0129
Episode: 934 Total reward: 199.0 Training loss: 1.4207 Explore P: 0.0129
Episode: 935 Total reward: 199.0 Training loss: 0.2897 Explore P: 0.0128
Episode: 936 Total reward: 199.0 Training loss: 382.1853 Explore P: 0.0128
Episode: 937 Total reward: 199.0 Training loss: 0.6572 Explore P: 0.0127
Episode: 938 Total reward: 199.0 Training loss: 0.8292 Explore P: 0.0127
Episode: 939 Total reward: 199.0 Training loss: 0.5913 Explore P: 0.0126
Episode: 940 Total reward: 199.0 Training loss: 0.8381 Explore P: 0.0126
Episode: 941 Total reward: 199.0 Training loss: 0.4168 Explore P: 0.0125
Episode: 942 Total reward: 199.0 Training loss: 472.5017 Explore P: 0.0125
Episode: 943 Total reward: 199.0 Training loss: 540.8539 Explore P: 0.0124
Episode: 944 Total reward: 199.0 Training loss: 0.5896 Explore P: 0.0124
Episode: 945 Total reward: 199.0 Training loss: 0.3702 Explore P: 0.0123
Episode: 946 Total reward: 199.0 Training loss: 0.4805 Explore P: 0.0123
Episode: 947 Total reward: 199.0 Training loss: 0.2354 Explore P: 0.0122
Episode: 948 Total reward: 199.0 Training loss: 0.9596 Explore P: 0.0122
Episode: 949 Total reward: 199.0 Training loss: 0.4506 Explore P: 0.0121
Episode: 950 Total reward: 199.0 Training loss: 0.4314 Explore P: 0.0121
Episode: 951 Total reward: 199.0 Training loss: 0.5232 Explore P: 0.0121
Episode: 952 Total reward: 199.0 Training loss: 0.4639 Explore P: 0.0120
Episode: 953 Total reward: 199.0 Training loss: 0.3144 Explore P: 0.0120
Episode: 954 Total reward: 199.0 Training loss: 0.7455 Explore P: 0.0119
Episode: 955 Total reward: 199.0 Training loss: 0.2269 Explore P: 0.0119
Episode: 956 Total reward: 199.0 Training loss: 0.6675 Explore P: 0.0119
Episode: 957 Total reward: 199.0 Training loss: 0.3278 Explore P: 0.0118
Episode: 958 Total reward: 199.0 Training loss: 0.2460 Explore P: 0.0118
Episode: 959 Total reward: 199.0 Training loss: 480.2446 Explore P: 0.0117
Episode: 960 Total reward: 199.0 Training loss: 0.1878 Explore P: 0.0117
Episode: 961 Total reward: 199.0 Training loss: 0.3663 Explore P: 0.0117
Episode: 962 Total reward: 199.0 Training loss: 0.5281 Explore P: 0.0116
Episode: 963 Total reward: 199.0 Training loss: 0.3998 Explore P: 0.0116
Episode: 964 Total reward: 199.0 Training loss: 0.2663 Explore P: 0.0116
Episode: 965 Total reward: 199.0 Training loss: 0.3623 Explore P: 0.0116
Episode: 966 Total reward: 199.0 Training loss: 0.5050 Explore P: 0.0115
Episode: 967 Total reward: 14.0 Training loss: 0.4887 Explore P: 0.0115
Episode: 968 Total reward: 199.0 Training loss: 0.4440 Explore P: 0.0115
Episode: 969 Total reward: 199.0 Training loss: 0.3112 Explore P: 0.0115
Episode: 970 Total reward: 199.0 Training loss: 0.3032 Explore P: 0.0114
Episode: 971 Total reward: 199.0 Training loss: 0.2401 Explore P: 0.0114
Episode: 972 Total reward: 199.0 Training loss: 0.4172 Explore P: 0.0114
Episode: 973 Total reward: 199.0 Training loss: 0.8319 Explore P: 0.0113
Episode: 974 Total reward: 199.0 Training loss: 258.0836 Explore P: 0.0113
Episode: 975 Total reward: 199.0 Training loss: 0.7909 Explore P: 0.0113
Episode: 976 Total reward: 199.0 Training loss: 284.2740 Explore P: 0.0113
Episode: 977 Total reward: 199.0 Training loss: 1.9224 Explore P: 0.0112
Episode: 978 Total reward: 199.0 Training loss: 2.6881 Explore P: 0.0112
Episode: 979 Total reward: 199.0 Training loss: 7.3053 Explore P: 0.0112
Episode: 980 Total reward: 12.0 Training loss: 7.3745 Explore P: 0.0112
Episode: 981 Total reward: 12.0 Training loss: 6.8314 Explore P: 0.0112
Episode: 982 Total reward: 9.0 Training loss: 8.3532 Explore P: 0.0112
Episode: 983 Total reward: 11.0 Training loss: 9.9501 Explore P: 0.0112
Episode: 984 Total reward: 12.0 Training loss: 9.9673 Explore P: 0.0112
Episode: 985 Total reward: 12.0 Training loss: 895.1349 Explore P: 0.0112
Episode: 986 Total reward: 10.0 Training loss: 7.5073 Explore P: 0.0112
Episode: 987 Total reward: 199.0 Training loss: 7.4607 Explore P: 0.0112
Episode: 988 Total reward: 12.0 Training loss: 7.0840 Explore P: 0.0112
Episode: 989 Total reward: 9.0 Training loss: 8.1604 Explore P: 0.0112
Episode: 990 Total reward: 10.0 Training loss: 11.1427 Explore P: 0.0112
Episode: 991 Total reward: 12.0 Training loss: 11.5531 Explore P: 0.0112
Episode: 992 Total reward: 8.0 Training loss: 11.6846 Explore P: 0.0112
Episode: 993 Total reward: 10.0 Training loss: 8.5808 Explore P: 0.0112
Episode: 994 Total reward: 8.0 Training loss: 8.2653 Explore P: 0.0112
Episode: 995 Total reward: 8.0 Training loss: 7.6127 Explore P: 0.0112
Episode: 996 Total reward: 12.0 Training loss: 18.0211 Explore P: 0.0112
Episode: 997 Total reward: 11.0 Training loss: 9.9939 Explore P: 0.0112
Episode: 998 Total reward: 9.0 Training loss: 8.1948 Explore P: 0.0112
Episode: 999 Total reward: 8.0 Training loss: 11.1894 Explore P: 0.0112

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [15]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[15]:
<matplotlib.text.Text at 0x7f00e4109550>

Testing

Let's checkout how our trained agent plays the game.


In [16]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints/cartpole.ckpt
[2017-06-24 09:55:08,618] Restoring parameters from checkpoints/cartpole.ckpt

In [17]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.