Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-23 21:30:59,429] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(100):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

In [4]:
env.close()

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [5]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [6]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [7]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [8]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [9]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [10]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [11]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 9.0 Training loss: 1.1734 Explore P: 0.9991
Episode: 2 Total reward: 51.0 Training loss: 1.0592 Explore P: 0.9941
Episode: 3 Total reward: 14.0 Training loss: 1.0578 Explore P: 0.9927
Episode: 4 Total reward: 20.0 Training loss: 1.0563 Explore P: 0.9907
Episode: 5 Total reward: 26.0 Training loss: 1.0032 Explore P: 0.9882
Episode: 6 Total reward: 18.0 Training loss: 1.0927 Explore P: 0.9864
Episode: 7 Total reward: 12.0 Training loss: 1.0351 Explore P: 0.9853
Episode: 8 Total reward: 50.0 Training loss: 1.1726 Explore P: 0.9804
Episode: 9 Total reward: 11.0 Training loss: 1.2564 Explore P: 0.9793
Episode: 10 Total reward: 17.0 Training loss: 1.0972 Explore P: 0.9777
Episode: 11 Total reward: 28.0 Training loss: 1.0107 Explore P: 0.9750
Episode: 12 Total reward: 15.0 Training loss: 1.1308 Explore P: 0.9735
Episode: 13 Total reward: 15.0 Training loss: 1.1596 Explore P: 0.9721
Episode: 14 Total reward: 38.0 Training loss: 1.3867 Explore P: 0.9684
Episode: 15 Total reward: 10.0 Training loss: 1.2286 Explore P: 0.9675
Episode: 16 Total reward: 14.0 Training loss: 1.6422 Explore P: 0.9661
Episode: 17 Total reward: 35.0 Training loss: 1.5726 Explore P: 0.9628
Episode: 18 Total reward: 31.0 Training loss: 1.6241 Explore P: 0.9599
Episode: 19 Total reward: 10.0 Training loss: 1.7331 Explore P: 0.9589
Episode: 20 Total reward: 17.0 Training loss: 1.8468 Explore P: 0.9573
Episode: 21 Total reward: 34.0 Training loss: 1.9040 Explore P: 0.9541
Episode: 22 Total reward: 19.0 Training loss: 1.9981 Explore P: 0.9523
Episode: 23 Total reward: 17.0 Training loss: 1.7703 Explore P: 0.9507
Episode: 24 Total reward: 9.0 Training loss: 2.7957 Explore P: 0.9498
Episode: 25 Total reward: 10.0 Training loss: 1.8864 Explore P: 0.9489
Episode: 26 Total reward: 11.0 Training loss: 1.8372 Explore P: 0.9479
Episode: 27 Total reward: 14.0 Training loss: 2.9040 Explore P: 0.9466
Episode: 28 Total reward: 13.0 Training loss: 2.3455 Explore P: 0.9453
Episode: 29 Total reward: 12.0 Training loss: 2.4562 Explore P: 0.9442
Episode: 30 Total reward: 14.0 Training loss: 4.2062 Explore P: 0.9429
Episode: 31 Total reward: 17.0 Training loss: 4.9581 Explore P: 0.9413
Episode: 32 Total reward: 12.0 Training loss: 2.6849 Explore P: 0.9402
Episode: 33 Total reward: 28.0 Training loss: 4.3416 Explore P: 0.9376
Episode: 34 Total reward: 13.0 Training loss: 8.0613 Explore P: 0.9364
Episode: 35 Total reward: 21.0 Training loss: 6.2001 Explore P: 0.9345
Episode: 36 Total reward: 14.0 Training loss: 4.9522 Explore P: 0.9332
Episode: 37 Total reward: 22.0 Training loss: 5.3654 Explore P: 0.9311
Episode: 38 Total reward: 25.0 Training loss: 5.1981 Explore P: 0.9288
Episode: 39 Total reward: 11.0 Training loss: 10.5382 Explore P: 0.9278
Episode: 40 Total reward: 18.0 Training loss: 3.8849 Explore P: 0.9262
Episode: 41 Total reward: 11.0 Training loss: 5.2805 Explore P: 0.9252
Episode: 42 Total reward: 15.0 Training loss: 11.4519 Explore P: 0.9238
Episode: 43 Total reward: 41.0 Training loss: 7.3143 Explore P: 0.9201
Episode: 44 Total reward: 19.0 Training loss: 7.2134 Explore P: 0.9183
Episode: 45 Total reward: 17.0 Training loss: 11.2199 Explore P: 0.9168
Episode: 46 Total reward: 12.0 Training loss: 18.1868 Explore P: 0.9157
Episode: 47 Total reward: 19.0 Training loss: 4.5484 Explore P: 0.9140
Episode: 48 Total reward: 11.0 Training loss: 4.1239 Explore P: 0.9130
Episode: 49 Total reward: 22.0 Training loss: 5.9616 Explore P: 0.9110
Episode: 50 Total reward: 15.0 Training loss: 10.6029 Explore P: 0.9096
Episode: 51 Total reward: 9.0 Training loss: 28.1874 Explore P: 0.9088
Episode: 52 Total reward: 10.0 Training loss: 19.2188 Explore P: 0.9079
Episode: 53 Total reward: 26.0 Training loss: 7.5080 Explore P: 0.9056
Episode: 54 Total reward: 31.0 Training loss: 4.6090 Explore P: 0.9028
Episode: 55 Total reward: 47.0 Training loss: 33.1176 Explore P: 0.8987
Episode: 56 Total reward: 13.0 Training loss: 12.5365 Explore P: 0.8975
Episode: 57 Total reward: 22.0 Training loss: 24.2255 Explore P: 0.8955
Episode: 58 Total reward: 20.0 Training loss: 20.1164 Explore P: 0.8938
Episode: 59 Total reward: 10.0 Training loss: 13.3163 Explore P: 0.8929
Episode: 60 Total reward: 10.0 Training loss: 5.3879 Explore P: 0.8920
Episode: 61 Total reward: 20.0 Training loss: 24.0164 Explore P: 0.8902
Episode: 62 Total reward: 13.0 Training loss: 66.3743 Explore P: 0.8891
Episode: 63 Total reward: 40.0 Training loss: 17.2584 Explore P: 0.8856
Episode: 64 Total reward: 12.0 Training loss: 26.3604 Explore P: 0.8845
Episode: 65 Total reward: 18.0 Training loss: 6.4475 Explore P: 0.8830
Episode: 66 Total reward: 18.0 Training loss: 65.8170 Explore P: 0.8814
Episode: 67 Total reward: 16.0 Training loss: 18.2347 Explore P: 0.8800
Episode: 68 Total reward: 17.0 Training loss: 29.4055 Explore P: 0.8785
Episode: 69 Total reward: 19.0 Training loss: 23.1960 Explore P: 0.8769
Episode: 70 Total reward: 18.0 Training loss: 5.5965 Explore P: 0.8753
Episode: 71 Total reward: 13.0 Training loss: 12.9937 Explore P: 0.8742
Episode: 72 Total reward: 26.0 Training loss: 55.6505 Explore P: 0.8720
Episode: 73 Total reward: 20.0 Training loss: 6.8191 Explore P: 0.8702
Episode: 74 Total reward: 12.0 Training loss: 19.8798 Explore P: 0.8692
Episode: 75 Total reward: 11.0 Training loss: 5.6707 Explore P: 0.8683
Episode: 76 Total reward: 16.0 Training loss: 5.1949 Explore P: 0.8669
Episode: 77 Total reward: 10.0 Training loss: 20.3363 Explore P: 0.8660
Episode: 78 Total reward: 9.0 Training loss: 6.6215 Explore P: 0.8653
Episode: 79 Total reward: 15.0 Training loss: 25.2439 Explore P: 0.8640
Episode: 80 Total reward: 27.0 Training loss: 60.0898 Explore P: 0.8617
Episode: 81 Total reward: 23.0 Training loss: 28.0386 Explore P: 0.8597
Episode: 82 Total reward: 9.0 Training loss: 18.7644 Explore P: 0.8590
Episode: 83 Total reward: 14.0 Training loss: 5.7770 Explore P: 0.8578
Episode: 84 Total reward: 14.0 Training loss: 109.9339 Explore P: 0.8566
Episode: 85 Total reward: 22.0 Training loss: 29.0135 Explore P: 0.8547
Episode: 86 Total reward: 78.0 Training loss: 30.8655 Explore P: 0.8482
Episode: 87 Total reward: 15.0 Training loss: 44.9569 Explore P: 0.8469
Episode: 88 Total reward: 11.0 Training loss: 66.6375 Explore P: 0.8460
Episode: 89 Total reward: 9.0 Training loss: 5.0335 Explore P: 0.8452
Episode: 90 Total reward: 10.0 Training loss: 27.0569 Explore P: 0.8444
Episode: 91 Total reward: 20.0 Training loss: 46.8733 Explore P: 0.8427
Episode: 92 Total reward: 12.0 Training loss: 21.6696 Explore P: 0.8417
Episode: 93 Total reward: 16.0 Training loss: 42.4684 Explore P: 0.8404
Episode: 94 Total reward: 31.0 Training loss: 7.2382 Explore P: 0.8378
Episode: 95 Total reward: 20.0 Training loss: 27.5996 Explore P: 0.8362
Episode: 96 Total reward: 17.0 Training loss: 150.4063 Explore P: 0.8348
Episode: 97 Total reward: 11.0 Training loss: 33.7567 Explore P: 0.8339
Episode: 98 Total reward: 15.0 Training loss: 5.4249 Explore P: 0.8326
Episode: 99 Total reward: 16.0 Training loss: 168.1815 Explore P: 0.8313
Episode: 100 Total reward: 18.0 Training loss: 4.7430 Explore P: 0.8298
Episode: 101 Total reward: 16.0 Training loss: 5.2291 Explore P: 0.8285
Episode: 102 Total reward: 27.0 Training loss: 47.3798 Explore P: 0.8263
Episode: 103 Total reward: 20.0 Training loss: 30.0296 Explore P: 0.8247
Episode: 104 Total reward: 13.0 Training loss: 4.0504 Explore P: 0.8236
Episode: 105 Total reward: 21.0 Training loss: 139.2907 Explore P: 0.8219
Episode: 106 Total reward: 16.0 Training loss: 4.9091 Explore P: 0.8206
Episode: 107 Total reward: 20.0 Training loss: 58.2814 Explore P: 0.8190
Episode: 108 Total reward: 9.0 Training loss: 26.6589 Explore P: 0.8183
Episode: 109 Total reward: 12.0 Training loss: 5.6232 Explore P: 0.8173
Episode: 110 Total reward: 12.0 Training loss: 60.3421 Explore P: 0.8163
Episode: 111 Total reward: 27.0 Training loss: 93.7178 Explore P: 0.8142
Episode: 112 Total reward: 11.0 Training loss: 3.5770 Explore P: 0.8133
Episode: 113 Total reward: 18.0 Training loss: 4.5678 Explore P: 0.8118
Episode: 114 Total reward: 11.0 Training loss: 4.3502 Explore P: 0.8110
Episode: 115 Total reward: 14.0 Training loss: 109.3309 Explore P: 0.8098
Episode: 116 Total reward: 16.0 Training loss: 42.1244 Explore P: 0.8086
Episode: 117 Total reward: 11.0 Training loss: 27.8894 Explore P: 0.8077
Episode: 118 Total reward: 16.0 Training loss: 3.9920 Explore P: 0.8064
Episode: 119 Total reward: 10.0 Training loss: 70.6800 Explore P: 0.8056
Episode: 120 Total reward: 18.0 Training loss: 4.3304 Explore P: 0.8042
Episode: 121 Total reward: 17.0 Training loss: 4.4563 Explore P: 0.8028
Episode: 122 Total reward: 11.0 Training loss: 3.3715 Explore P: 0.8020
Episode: 123 Total reward: 21.0 Training loss: 126.5611 Explore P: 0.8003
Episode: 124 Total reward: 31.0 Training loss: 32.9193 Explore P: 0.7978
Episode: 125 Total reward: 10.0 Training loss: 3.7869 Explore P: 0.7971
Episode: 126 Total reward: 10.0 Training loss: 36.1923 Explore P: 0.7963
Episode: 127 Total reward: 9.0 Training loss: 44.5709 Explore P: 0.7956
Episode: 128 Total reward: 11.0 Training loss: 3.9107 Explore P: 0.7947
Episode: 129 Total reward: 20.0 Training loss: 34.2582 Explore P: 0.7931
Episode: 130 Total reward: 14.0 Training loss: 28.0991 Explore P: 0.7920
Episode: 131 Total reward: 12.0 Training loss: 90.2312 Explore P: 0.7911
Episode: 132 Total reward: 8.0 Training loss: 31.6345 Explore P: 0.7905
Episode: 133 Total reward: 9.0 Training loss: 3.9458 Explore P: 0.7898
Episode: 134 Total reward: 10.0 Training loss: 58.4884 Explore P: 0.7890
Episode: 135 Total reward: 14.0 Training loss: 4.3950 Explore P: 0.7879
Episode: 136 Total reward: 13.0 Training loss: 3.9879 Explore P: 0.7869
Episode: 137 Total reward: 13.0 Training loss: 96.5355 Explore P: 0.7859
Episode: 138 Total reward: 10.0 Training loss: 31.2554 Explore P: 0.7851
Episode: 139 Total reward: 12.0 Training loss: 3.9601 Explore P: 0.7842
Episode: 140 Total reward: 12.0 Training loss: 4.0886 Explore P: 0.7833
Episode: 141 Total reward: 9.0 Training loss: 4.0002 Explore P: 0.7826
Episode: 142 Total reward: 17.0 Training loss: 38.7215 Explore P: 0.7812
Episode: 143 Total reward: 9.0 Training loss: 3.9804 Explore P: 0.7806
Episode: 144 Total reward: 34.0 Training loss: 47.6627 Explore P: 0.7779
Episode: 145 Total reward: 14.0 Training loss: 37.4654 Explore P: 0.7769
Episode: 146 Total reward: 11.0 Training loss: 3.9206 Explore P: 0.7760
Episode: 147 Total reward: 25.0 Training loss: 33.7631 Explore P: 0.7741
Episode: 148 Total reward: 9.0 Training loss: 45.0559 Explore P: 0.7734
Episode: 149 Total reward: 12.0 Training loss: 57.7914 Explore P: 0.7725
Episode: 150 Total reward: 9.0 Training loss: 4.6467 Explore P: 0.7718
Episode: 151 Total reward: 28.0 Training loss: 85.7479 Explore P: 0.7697
Episode: 152 Total reward: 20.0 Training loss: 49.8638 Explore P: 0.7682
Episode: 153 Total reward: 12.0 Training loss: 3.9231 Explore P: 0.7673
Episode: 154 Total reward: 12.0 Training loss: 104.2413 Explore P: 0.7664
Episode: 155 Total reward: 10.0 Training loss: 44.7920 Explore P: 0.7656
Episode: 156 Total reward: 10.0 Training loss: 58.5903 Explore P: 0.7648
Episode: 157 Total reward: 22.0 Training loss: 48.5548 Explore P: 0.7632
Episode: 158 Total reward: 13.0 Training loss: 26.5541 Explore P: 0.7622
Episode: 159 Total reward: 9.0 Training loss: 40.9189 Explore P: 0.7615
Episode: 160 Total reward: 13.0 Training loss: 4.4045 Explore P: 0.7605
Episode: 161 Total reward: 15.0 Training loss: 29.7979 Explore P: 0.7594
Episode: 162 Total reward: 10.0 Training loss: 3.5614 Explore P: 0.7587
Episode: 163 Total reward: 13.0 Training loss: 42.5754 Explore P: 0.7577
Episode: 164 Total reward: 14.0 Training loss: 70.5025 Explore P: 0.7567
Episode: 165 Total reward: 14.0 Training loss: 74.2411 Explore P: 0.7556
Episode: 166 Total reward: 16.0 Training loss: 130.4018 Explore P: 0.7544
Episode: 167 Total reward: 21.0 Training loss: 32.1834 Explore P: 0.7529
Episode: 168 Total reward: 15.0 Training loss: 72.9596 Explore P: 0.7517
Episode: 169 Total reward: 21.0 Training loss: 50.6072 Explore P: 0.7502
Episode: 170 Total reward: 15.0 Training loss: 125.8748 Explore P: 0.7491
Episode: 171 Total reward: 15.0 Training loss: 67.7437 Explore P: 0.7480
Episode: 172 Total reward: 14.0 Training loss: 66.2948 Explore P: 0.7469
Episode: 173 Total reward: 15.0 Training loss: 32.5662 Explore P: 0.7458
Episode: 174 Total reward: 14.0 Training loss: 1.9909 Explore P: 0.7448
Episode: 175 Total reward: 13.0 Training loss: 1.7370 Explore P: 0.7439
Episode: 176 Total reward: 15.0 Training loss: 127.6469 Explore P: 0.7428
Episode: 177 Total reward: 13.0 Training loss: 3.3363 Explore P: 0.7418
Episode: 178 Total reward: 9.0 Training loss: 32.8998 Explore P: 0.7411
Episode: 179 Total reward: 13.0 Training loss: 37.7603 Explore P: 0.7402
Episode: 180 Total reward: 12.0 Training loss: 24.3922 Explore P: 0.7393
Episode: 181 Total reward: 13.0 Training loss: 60.3408 Explore P: 0.7384
Episode: 182 Total reward: 18.0 Training loss: 94.6805 Explore P: 0.7371
Episode: 183 Total reward: 14.0 Training loss: 2.1419 Explore P: 0.7360
Episode: 184 Total reward: 13.0 Training loss: 60.4708 Explore P: 0.7351
Episode: 185 Total reward: 10.0 Training loss: 3.3564 Explore P: 0.7344
Episode: 186 Total reward: 10.0 Training loss: 65.0589 Explore P: 0.7336
Episode: 187 Total reward: 11.0 Training loss: 2.7089 Explore P: 0.7329
Episode: 188 Total reward: 18.0 Training loss: 2.9132 Explore P: 0.7316
Episode: 189 Total reward: 16.0 Training loss: 42.6770 Explore P: 0.7304
Episode: 190 Total reward: 13.0 Training loss: 97.2628 Explore P: 0.7295
Episode: 191 Total reward: 19.0 Training loss: 44.3649 Explore P: 0.7281
Episode: 192 Total reward: 13.0 Training loss: 92.7372 Explore P: 0.7272
Episode: 193 Total reward: 11.0 Training loss: 28.7553 Explore P: 0.7264
Episode: 194 Total reward: 14.0 Training loss: 64.7853 Explore P: 0.7254
Episode: 195 Total reward: 13.0 Training loss: 130.6135 Explore P: 0.7244
Episode: 196 Total reward: 12.0 Training loss: 66.0312 Explore P: 0.7236
Episode: 197 Total reward: 12.0 Training loss: 2.4613 Explore P: 0.7227
Episode: 198 Total reward: 16.0 Training loss: 30.4602 Explore P: 0.7216
Episode: 199 Total reward: 10.0 Training loss: 2.2940 Explore P: 0.7209
Episode: 200 Total reward: 8.0 Training loss: 52.6595 Explore P: 0.7203
Episode: 201 Total reward: 16.0 Training loss: 23.8881 Explore P: 0.7192
Episode: 202 Total reward: 10.0 Training loss: 32.9319 Explore P: 0.7185
Episode: 203 Total reward: 18.0 Training loss: 25.2421 Explore P: 0.7172
Episode: 204 Total reward: 11.0 Training loss: 62.3220 Explore P: 0.7164
Episode: 205 Total reward: 14.0 Training loss: 34.7880 Explore P: 0.7154
Episode: 206 Total reward: 12.0 Training loss: 29.2222 Explore P: 0.7146
Episode: 207 Total reward: 16.0 Training loss: 37.8108 Explore P: 0.7135
Episode: 208 Total reward: 11.0 Training loss: 52.4928 Explore P: 0.7127
Episode: 209 Total reward: 24.0 Training loss: 24.5290 Explore P: 0.7110
Episode: 210 Total reward: 8.0 Training loss: 2.2314 Explore P: 0.7104
Episode: 211 Total reward: 17.0 Training loss: 66.0451 Explore P: 0.7092
Episode: 212 Total reward: 8.0 Training loss: 30.9840 Explore P: 0.7087
Episode: 213 Total reward: 15.0 Training loss: 116.9637 Explore P: 0.7076
Episode: 214 Total reward: 12.0 Training loss: 28.3078 Explore P: 0.7068
Episode: 215 Total reward: 10.0 Training loss: 123.3424 Explore P: 0.7061
Episode: 216 Total reward: 9.0 Training loss: 54.2178 Explore P: 0.7055
Episode: 217 Total reward: 28.0 Training loss: 2.3253 Explore P: 0.7035
Episode: 218 Total reward: 9.0 Training loss: 21.4098 Explore P: 0.7029
Episode: 219 Total reward: 8.0 Training loss: 121.1392 Explore P: 0.7024
Episode: 220 Total reward: 20.0 Training loss: 89.5349 Explore P: 0.7010
Episode: 221 Total reward: 7.0 Training loss: 83.9316 Explore P: 0.7005
Episode: 222 Total reward: 17.0 Training loss: 31.7372 Explore P: 0.6993
Episode: 223 Total reward: 40.0 Training loss: 19.8705 Explore P: 0.6966
Episode: 224 Total reward: 12.0 Training loss: 26.5281 Explore P: 0.6957
Episode: 225 Total reward: 15.0 Training loss: 20.0747 Explore P: 0.6947
Episode: 226 Total reward: 14.0 Training loss: 19.2381 Explore P: 0.6938
Episode: 227 Total reward: 17.0 Training loss: 43.4665 Explore P: 0.6926
Episode: 228 Total reward: 13.0 Training loss: 52.0621 Explore P: 0.6917
Episode: 229 Total reward: 9.0 Training loss: 1.6795 Explore P: 0.6911
Episode: 230 Total reward: 11.0 Training loss: 68.6490 Explore P: 0.6903
Episode: 231 Total reward: 13.0 Training loss: 50.8442 Explore P: 0.6895
Episode: 232 Total reward: 9.0 Training loss: 2.0973 Explore P: 0.6889
Episode: 233 Total reward: 11.0 Training loss: 20.3822 Explore P: 0.6881
Episode: 234 Total reward: 9.0 Training loss: 2.3246 Explore P: 0.6875
Episode: 235 Total reward: 10.0 Training loss: 29.5682 Explore P: 0.6868
Episode: 236 Total reward: 14.0 Training loss: 27.2669 Explore P: 0.6859
Episode: 237 Total reward: 19.0 Training loss: 1.7330 Explore P: 0.6846
Episode: 238 Total reward: 10.0 Training loss: 1.2347 Explore P: 0.6839
Episode: 239 Total reward: 11.0 Training loss: 47.6162 Explore P: 0.6832
Episode: 240 Total reward: 8.0 Training loss: 20.2703 Explore P: 0.6826
Episode: 241 Total reward: 12.0 Training loss: 23.6341 Explore P: 0.6818
Episode: 242 Total reward: 15.0 Training loss: 79.6630 Explore P: 0.6808
Episode: 243 Total reward: 24.0 Training loss: 1.6366 Explore P: 0.6792
Episode: 244 Total reward: 8.0 Training loss: 26.5523 Explore P: 0.6787
Episode: 245 Total reward: 16.0 Training loss: 28.0024 Explore P: 0.6776
Episode: 246 Total reward: 13.0 Training loss: 19.7826 Explore P: 0.6767
Episode: 247 Total reward: 12.0 Training loss: 1.3019 Explore P: 0.6759
Episode: 248 Total reward: 12.0 Training loss: 52.8874 Explore P: 0.6751
Episode: 249 Total reward: 9.0 Training loss: 45.4601 Explore P: 0.6745
Episode: 250 Total reward: 29.0 Training loss: 1.8474 Explore P: 0.6726
Episode: 251 Total reward: 15.0 Training loss: 50.8396 Explore P: 0.6716
Episode: 252 Total reward: 38.0 Training loss: 1.6858 Explore P: 0.6691
Episode: 253 Total reward: 13.0 Training loss: 26.5899 Explore P: 0.6683
Episode: 254 Total reward: 11.0 Training loss: 1.5854 Explore P: 0.6675
Episode: 255 Total reward: 11.0 Training loss: 22.2780 Explore P: 0.6668
Episode: 256 Total reward: 16.0 Training loss: 23.0382 Explore P: 0.6658
Episode: 257 Total reward: 8.0 Training loss: 17.1855 Explore P: 0.6652
Episode: 258 Total reward: 12.0 Training loss: 20.3462 Explore P: 0.6645
Episode: 259 Total reward: 10.0 Training loss: 82.7481 Explore P: 0.6638
Episode: 260 Total reward: 16.0 Training loss: 1.4613 Explore P: 0.6628
Episode: 261 Total reward: 11.0 Training loss: 38.6184 Explore P: 0.6620
Episode: 262 Total reward: 12.0 Training loss: 1.8104 Explore P: 0.6613
Episode: 263 Total reward: 15.0 Training loss: 14.2384 Explore P: 0.6603
Episode: 264 Total reward: 19.0 Training loss: 67.2271 Explore P: 0.6590
Episode: 265 Total reward: 11.0 Training loss: 37.3571 Explore P: 0.6583
Episode: 266 Total reward: 10.0 Training loss: 34.9056 Explore P: 0.6577
Episode: 267 Total reward: 11.0 Training loss: 14.6140 Explore P: 0.6570
Episode: 268 Total reward: 9.0 Training loss: 17.5218 Explore P: 0.6564
Episode: 269 Total reward: 12.0 Training loss: 45.6760 Explore P: 0.6556
Episode: 270 Total reward: 16.0 Training loss: 0.6390 Explore P: 0.6546
Episode: 271 Total reward: 10.0 Training loss: 13.4031 Explore P: 0.6539
Episode: 272 Total reward: 13.0 Training loss: 20.7238 Explore P: 0.6531
Episode: 273 Total reward: 11.0 Training loss: 1.2625 Explore P: 0.6524
Episode: 274 Total reward: 9.0 Training loss: 1.1871 Explore P: 0.6518
Episode: 275 Total reward: 23.0 Training loss: 40.7175 Explore P: 0.6503
Episode: 276 Total reward: 11.0 Training loss: 19.0874 Explore P: 0.6496
Episode: 277 Total reward: 10.0 Training loss: 1.1589 Explore P: 0.6490
Episode: 278 Total reward: 13.0 Training loss: 35.1680 Explore P: 0.6482
Episode: 279 Total reward: 16.0 Training loss: 25.1705 Explore P: 0.6471
Episode: 280 Total reward: 9.0 Training loss: 83.7663 Explore P: 0.6466
Episode: 281 Total reward: 12.0 Training loss: 30.3167 Explore P: 0.6458
Episode: 282 Total reward: 12.0 Training loss: 0.6835 Explore P: 0.6451
Episode: 283 Total reward: 15.0 Training loss: 29.2696 Explore P: 0.6441
Episode: 284 Total reward: 11.0 Training loss: 12.8804 Explore P: 0.6434
Episode: 285 Total reward: 12.0 Training loss: 12.6643 Explore P: 0.6426
Episode: 286 Total reward: 11.0 Training loss: 12.2087 Explore P: 0.6419
Episode: 287 Total reward: 10.0 Training loss: 47.5391 Explore P: 0.6413
Episode: 288 Total reward: 9.0 Training loss: 33.3150 Explore P: 0.6407
Episode: 289 Total reward: 11.0 Training loss: 16.2476 Explore P: 0.6401
Episode: 290 Total reward: 11.0 Training loss: 18.4898 Explore P: 0.6394
Episode: 291 Total reward: 13.0 Training loss: 20.1008 Explore P: 0.6385
Episode: 292 Total reward: 8.0 Training loss: 34.6700 Explore P: 0.6380
Episode: 293 Total reward: 23.0 Training loss: 27.3784 Explore P: 0.6366
Episode: 294 Total reward: 14.0 Training loss: 10.5378 Explore P: 0.6357
Episode: 295 Total reward: 9.0 Training loss: 1.1677 Explore P: 0.6352
Episode: 296 Total reward: 9.0 Training loss: 23.6470 Explore P: 0.6346
Episode: 297 Total reward: 9.0 Training loss: 11.4498 Explore P: 0.6340
Episode: 298 Total reward: 11.0 Training loss: 1.0700 Explore P: 0.6333
Episode: 299 Total reward: 15.0 Training loss: 1.0802 Explore P: 0.6324
Episode: 300 Total reward: 8.0 Training loss: 0.9330 Explore P: 0.6319
Episode: 301 Total reward: 12.0 Training loss: 31.2412 Explore P: 0.6312
Episode: 302 Total reward: 16.0 Training loss: 25.0492 Explore P: 0.6302
Episode: 303 Total reward: 14.0 Training loss: 18.1640 Explore P: 0.6293
Episode: 304 Total reward: 14.0 Training loss: 0.6485 Explore P: 0.6284
Episode: 305 Total reward: 9.0 Training loss: 10.2286 Explore P: 0.6279
Episode: 306 Total reward: 8.0 Training loss: 1.2302 Explore P: 0.6274
Episode: 307 Total reward: 20.0 Training loss: 8.9620 Explore P: 0.6262
Episode: 308 Total reward: 11.0 Training loss: 17.5165 Explore P: 0.6255
Episode: 309 Total reward: 10.0 Training loss: 9.2114 Explore P: 0.6249
Episode: 310 Total reward: 11.0 Training loss: 17.1623 Explore P: 0.6242
Episode: 311 Total reward: 40.0 Training loss: 14.6039 Explore P: 0.6217
Episode: 312 Total reward: 17.0 Training loss: 0.7247 Explore P: 0.6207
Episode: 313 Total reward: 13.0 Training loss: 8.6356 Explore P: 0.6199
Episode: 314 Total reward: 27.0 Training loss: 1.3774 Explore P: 0.6183
Episode: 315 Total reward: 15.0 Training loss: 0.9341 Explore P: 0.6173
Episode: 316 Total reward: 12.0 Training loss: 0.9970 Explore P: 0.6166
Episode: 317 Total reward: 13.0 Training loss: 25.1243 Explore P: 0.6158
Episode: 318 Total reward: 19.0 Training loss: 12.1107 Explore P: 0.6147
Episode: 319 Total reward: 14.0 Training loss: 8.5362 Explore P: 0.6138
Episode: 320 Total reward: 40.0 Training loss: 0.7512 Explore P: 0.6114
Episode: 321 Total reward: 52.0 Training loss: 0.5222 Explore P: 0.6083
Episode: 322 Total reward: 23.0 Training loss: 20.7682 Explore P: 0.6069
Episode: 323 Total reward: 28.0 Training loss: 13.0634 Explore P: 0.6053
Episode: 324 Total reward: 23.0 Training loss: 0.6316 Explore P: 0.6039
Episode: 325 Total reward: 23.0 Training loss: 20.5128 Explore P: 0.6025
Episode: 326 Total reward: 18.0 Training loss: 17.1877 Explore P: 0.6015
Episode: 327 Total reward: 16.0 Training loss: 16.5005 Explore P: 0.6005
Episode: 328 Total reward: 9.0 Training loss: 16.5921 Explore P: 0.6000
Episode: 329 Total reward: 25.0 Training loss: 5.9589 Explore P: 0.5985
Episode: 330 Total reward: 40.0 Training loss: 10.8775 Explore P: 0.5962
Episode: 331 Total reward: 21.0 Training loss: 10.2540 Explore P: 0.5949
Episode: 332 Total reward: 10.0 Training loss: 36.8551 Explore P: 0.5944
Episode: 333 Total reward: 16.0 Training loss: 0.6602 Explore P: 0.5934
Episode: 334 Total reward: 12.0 Training loss: 0.8509 Explore P: 0.5927
Episode: 335 Total reward: 8.0 Training loss: 7.4407 Explore P: 0.5923
Episode: 336 Total reward: 36.0 Training loss: 7.6975 Explore P: 0.5902
Episode: 337 Total reward: 25.0 Training loss: 11.2920 Explore P: 0.5887
Episode: 338 Total reward: 76.0 Training loss: 8.9974 Explore P: 0.5843
Episode: 339 Total reward: 77.0 Training loss: 17.6869 Explore P: 0.5799
Episode: 340 Total reward: 62.0 Training loss: 7.7591 Explore P: 0.5764
Episode: 341 Total reward: 33.0 Training loss: 6.1072 Explore P: 0.5745
Episode: 342 Total reward: 15.0 Training loss: 1.0515 Explore P: 0.5737
Episode: 343 Total reward: 29.0 Training loss: 4.4728 Explore P: 0.5721
Episode: 344 Total reward: 57.0 Training loss: 1.2968 Explore P: 0.5689
Episode: 345 Total reward: 32.0 Training loss: 8.0129 Explore P: 0.5671
Episode: 346 Total reward: 33.0 Training loss: 8.4689 Explore P: 0.5652
Episode: 347 Total reward: 53.0 Training loss: 0.8370 Explore P: 0.5623
Episode: 348 Total reward: 7.0 Training loss: 0.9666 Explore P: 0.5619
Episode: 349 Total reward: 38.0 Training loss: 20.3583 Explore P: 0.5598
Episode: 350 Total reward: 17.0 Training loss: 1.1766 Explore P: 0.5589
Episode: 351 Total reward: 18.0 Training loss: 5.4417 Explore P: 0.5579
Episode: 352 Total reward: 21.0 Training loss: 6.2139 Explore P: 0.5568
Episode: 353 Total reward: 25.0 Training loss: 0.8161 Explore P: 0.5554
Episode: 354 Total reward: 55.0 Training loss: 1.2104 Explore P: 0.5524
Episode: 355 Total reward: 33.0 Training loss: 1.0397 Explore P: 0.5506
Episode: 356 Total reward: 26.0 Training loss: 0.8184 Explore P: 0.5492
Episode: 357 Total reward: 26.0 Training loss: 8.1448 Explore P: 0.5478
Episode: 358 Total reward: 29.0 Training loss: 22.5439 Explore P: 0.5463
Episode: 359 Total reward: 33.0 Training loss: 7.9045 Explore P: 0.5445
Episode: 360 Total reward: 21.0 Training loss: 7.4942 Explore P: 0.5434
Episode: 361 Total reward: 27.0 Training loss: 0.9980 Explore P: 0.5419
Episode: 362 Total reward: 57.0 Training loss: 11.2116 Explore P: 0.5389
Episode: 363 Total reward: 29.0 Training loss: 1.1804 Explore P: 0.5374
Episode: 364 Total reward: 49.0 Training loss: 1.0220 Explore P: 0.5348
Episode: 365 Total reward: 39.0 Training loss: 16.1914 Explore P: 0.5328
Episode: 366 Total reward: 49.0 Training loss: 14.2460 Explore P: 0.5302
Episode: 367 Total reward: 29.0 Training loss: 6.4644 Explore P: 0.5287
Episode: 368 Total reward: 44.0 Training loss: 23.8029 Explore P: 0.5264
Episode: 369 Total reward: 33.0 Training loss: 0.7933 Explore P: 0.5247
Episode: 370 Total reward: 31.0 Training loss: 25.4640 Explore P: 0.5231
Episode: 371 Total reward: 31.0 Training loss: 8.9808 Explore P: 0.5215
Episode: 372 Total reward: 56.0 Training loss: 10.9213 Explore P: 0.5187
Episode: 373 Total reward: 35.0 Training loss: 1.3860 Explore P: 0.5169
Episode: 374 Total reward: 121.0 Training loss: 21.6013 Explore P: 0.5108
Episode: 375 Total reward: 18.0 Training loss: 12.4776 Explore P: 0.5099
Episode: 376 Total reward: 35.0 Training loss: 12.6905 Explore P: 0.5082
Episode: 377 Total reward: 36.0 Training loss: 1.2291 Explore P: 0.5064
Episode: 378 Total reward: 53.0 Training loss: 1.3794 Explore P: 0.5037
Episode: 379 Total reward: 26.0 Training loss: 19.6542 Explore P: 0.5025
Episode: 380 Total reward: 35.0 Training loss: 25.9405 Explore P: 0.5007
Episode: 381 Total reward: 57.0 Training loss: 12.2397 Explore P: 0.4979
Episode: 382 Total reward: 35.0 Training loss: 19.0432 Explore P: 0.4962
Episode: 383 Total reward: 76.0 Training loss: 0.9581 Explore P: 0.4926
Episode: 384 Total reward: 69.0 Training loss: 1.5857 Explore P: 0.4892
Episode: 385 Total reward: 60.0 Training loss: 10.7437 Explore P: 0.4864
Episode: 386 Total reward: 90.0 Training loss: 56.7179 Explore P: 0.4821
Episode: 387 Total reward: 35.0 Training loss: 1.7437 Explore P: 0.4805
Episode: 388 Total reward: 99.0 Training loss: 22.9842 Explore P: 0.4758
Episode: 389 Total reward: 19.0 Training loss: 17.7609 Explore P: 0.4749
Episode: 390 Total reward: 20.0 Training loss: 41.0259 Explore P: 0.4740
Episode: 391 Total reward: 24.0 Training loss: 12.6476 Explore P: 0.4729
Episode: 392 Total reward: 29.0 Training loss: 14.3975 Explore P: 0.4716
Episode: 393 Total reward: 26.0 Training loss: 16.4846 Explore P: 0.4704
Episode: 394 Total reward: 37.0 Training loss: 1.8044 Explore P: 0.4687
Episode: 395 Total reward: 26.0 Training loss: 1.6458 Explore P: 0.4675
Episode: 396 Total reward: 33.0 Training loss: 1.4761 Explore P: 0.4660
Episode: 397 Total reward: 19.0 Training loss: 1.3690 Explore P: 0.4651
Episode: 398 Total reward: 60.0 Training loss: 1.2028 Explore P: 0.4624
Episode: 399 Total reward: 39.0 Training loss: 8.7046 Explore P: 0.4606
Episode: 400 Total reward: 26.0 Training loss: 1.8107 Explore P: 0.4594
Episode: 401 Total reward: 91.0 Training loss: 39.6179 Explore P: 0.4554
Episode: 402 Total reward: 44.0 Training loss: 2.2853 Explore P: 0.4534
Episode: 403 Total reward: 42.0 Training loss: 20.3933 Explore P: 0.4516
Episode: 404 Total reward: 21.0 Training loss: 32.4025 Explore P: 0.4506
Episode: 405 Total reward: 33.0 Training loss: 22.4992 Explore P: 0.4492
Episode: 406 Total reward: 65.0 Training loss: 9.8202 Explore P: 0.4463
Episode: 407 Total reward: 49.0 Training loss: 1.8845 Explore P: 0.4442
Episode: 408 Total reward: 58.0 Training loss: 45.8225 Explore P: 0.4417
Episode: 409 Total reward: 43.0 Training loss: 25.0961 Explore P: 0.4398
Episode: 410 Total reward: 39.0 Training loss: 16.2600 Explore P: 0.4382
Episode: 411 Total reward: 51.0 Training loss: 1.5276 Explore P: 0.4360
Episode: 412 Total reward: 31.0 Training loss: 19.5503 Explore P: 0.4347
Episode: 413 Total reward: 33.0 Training loss: 34.0670 Explore P: 0.4333
Episode: 414 Total reward: 68.0 Training loss: 33.5569 Explore P: 0.4304
Episode: 415 Total reward: 93.0 Training loss: 2.2119 Explore P: 0.4265
Episode: 416 Total reward: 52.0 Training loss: 42.1603 Explore P: 0.4243
Episode: 417 Total reward: 46.0 Training loss: 16.2192 Explore P: 0.4224
Episode: 418 Total reward: 32.0 Training loss: 7.8009 Explore P: 0.4211
Episode: 419 Total reward: 30.0 Training loss: 28.0576 Explore P: 0.4199
Episode: 420 Total reward: 87.0 Training loss: 1.8137 Explore P: 0.4163
Episode: 421 Total reward: 36.0 Training loss: 2.4785 Explore P: 0.4149
Episode: 422 Total reward: 18.0 Training loss: 14.6426 Explore P: 0.4142
Episode: 423 Total reward: 19.0 Training loss: 24.3824 Explore P: 0.4134
Episode: 424 Total reward: 86.0 Training loss: 41.3054 Explore P: 0.4099
Episode: 425 Total reward: 90.0 Training loss: 2.3530 Explore P: 0.4064
Episode: 426 Total reward: 56.0 Training loss: 30.6700 Explore P: 0.4041
Episode: 427 Total reward: 50.0 Training loss: 25.7421 Explore P: 0.4022
Episode: 428 Total reward: 63.0 Training loss: 1.7728 Explore P: 0.3997
Episode: 429 Total reward: 59.0 Training loss: 69.7807 Explore P: 0.3974
Episode: 430 Total reward: 40.0 Training loss: 76.6628 Explore P: 0.3959
Episode: 431 Total reward: 55.0 Training loss: 63.3675 Explore P: 0.3938
Episode: 432 Total reward: 46.0 Training loss: 25.3404 Explore P: 0.3920
Episode: 433 Total reward: 73.0 Training loss: 37.6834 Explore P: 0.3892
Episode: 434 Total reward: 53.0 Training loss: 20.8985 Explore P: 0.3872
Episode: 435 Total reward: 54.0 Training loss: 2.6522 Explore P: 0.3852
Episode: 436 Total reward: 46.0 Training loss: 2.0310 Explore P: 0.3835
Episode: 437 Total reward: 122.0 Training loss: 2.6028 Explore P: 0.3789
Episode: 438 Total reward: 139.0 Training loss: 43.3749 Explore P: 0.3738
Episode: 439 Total reward: 70.0 Training loss: 2.8704 Explore P: 0.3713
Episode: 440 Total reward: 47.0 Training loss: 56.0399 Explore P: 0.3696
Episode: 441 Total reward: 47.0 Training loss: 2.4266 Explore P: 0.3679
Episode: 442 Total reward: 37.0 Training loss: 50.0417 Explore P: 0.3666
Episode: 443 Total reward: 83.0 Training loss: 31.8906 Explore P: 0.3636
Episode: 444 Total reward: 34.0 Training loss: 1.8229 Explore P: 0.3624
Episode: 445 Total reward: 38.0 Training loss: 3.1193 Explore P: 0.3611
Episode: 446 Total reward: 59.0 Training loss: 28.6936 Explore P: 0.3590
Episode: 447 Total reward: 39.0 Training loss: 80.9429 Explore P: 0.3577
Episode: 448 Total reward: 89.0 Training loss: 29.2328 Explore P: 0.3546
Episode: 449 Total reward: 70.0 Training loss: 21.6963 Explore P: 0.3522
Episode: 450 Total reward: 77.0 Training loss: 1.5467 Explore P: 0.3496
Episode: 451 Total reward: 44.0 Training loss: 82.2291 Explore P: 0.3481
Episode: 452 Total reward: 113.0 Training loss: 75.7802 Explore P: 0.3443
Episode: 453 Total reward: 57.0 Training loss: 1.8620 Explore P: 0.3424
Episode: 454 Total reward: 104.0 Training loss: 87.0765 Explore P: 0.3389
Episode: 455 Total reward: 23.0 Training loss: 4.2945 Explore P: 0.3382
Episode: 456 Total reward: 185.0 Training loss: 1.6064 Explore P: 0.3322
Episode: 457 Total reward: 61.0 Training loss: 14.2363 Explore P: 0.3302
Episode: 458 Total reward: 126.0 Training loss: 107.1495 Explore P: 0.3262
Episode: 459 Total reward: 52.0 Training loss: 26.8060 Explore P: 0.3246
Episode: 460 Total reward: 40.0 Training loss: 102.0959 Explore P: 0.3233
Episode: 461 Total reward: 95.0 Training loss: 2.3209 Explore P: 0.3204
Episode: 462 Total reward: 55.0 Training loss: 1.8721 Explore P: 0.3186
Episode: 463 Total reward: 91.0 Training loss: 17.7888 Explore P: 0.3159
Episode: 464 Total reward: 57.0 Training loss: 31.5446 Explore P: 0.3141
Episode: 465 Total reward: 158.0 Training loss: 102.2503 Explore P: 0.3093
Episode: 466 Total reward: 38.0 Training loss: 2.5449 Explore P: 0.3082
Episode: 467 Total reward: 54.0 Training loss: 58.7833 Explore P: 0.3066
Episode: 468 Total reward: 87.0 Training loss: 2.6992 Explore P: 0.3040
Episode: 469 Total reward: 44.0 Training loss: 15.1441 Explore P: 0.3027
Episode: 470 Total reward: 57.0 Training loss: 2.4950 Explore P: 0.3011
Episode: 471 Total reward: 67.0 Training loss: 2.1255 Explore P: 0.2991
Episode: 472 Total reward: 102.0 Training loss: 2.7994 Explore P: 0.2962
Episode: 473 Total reward: 100.0 Training loss: 69.8230 Explore P: 0.2934
Episode: 474 Total reward: 55.0 Training loss: 19.2729 Explore P: 0.2918
Episode: 475 Total reward: 65.0 Training loss: 18.6986 Explore P: 0.2900
Episode: 476 Total reward: 72.0 Training loss: 3.3323 Explore P: 0.2880
Episode: 477 Total reward: 107.0 Training loss: 81.4199 Explore P: 0.2850
Episode: 478 Total reward: 124.0 Training loss: 1.0179 Explore P: 0.2816
Episode: 479 Total reward: 143.0 Training loss: 54.3237 Explore P: 0.2778
Episode: 480 Total reward: 56.0 Training loss: 2.4847 Explore P: 0.2763
Episode: 481 Total reward: 57.0 Training loss: 3.3121 Explore P: 0.2748
Episode: 482 Total reward: 96.0 Training loss: 88.9667 Explore P: 0.2722
Episode: 483 Total reward: 44.0 Training loss: 2.4677 Explore P: 0.2711
Episode: 484 Total reward: 56.0 Training loss: 66.7677 Explore P: 0.2696
Episode: 485 Total reward: 124.0 Training loss: 3.0553 Explore P: 0.2664
Episode: 486 Total reward: 91.0 Training loss: 2.4832 Explore P: 0.2641
Episode: 487 Total reward: 143.0 Training loss: 3.2114 Explore P: 0.2605
Episode: 488 Total reward: 103.0 Training loss: 64.6792 Explore P: 0.2579
Episode: 489 Total reward: 82.0 Training loss: 2.3143 Explore P: 0.2559
Episode: 490 Total reward: 75.0 Training loss: 2.7347 Explore P: 0.2541
Episode: 491 Total reward: 81.0 Training loss: 53.1232 Explore P: 0.2521
Episode: 492 Total reward: 75.0 Training loss: 3.7306 Explore P: 0.2503
Episode: 493 Total reward: 108.0 Training loss: 1.5997 Explore P: 0.2477
Episode: 494 Total reward: 64.0 Training loss: 2.6106 Explore P: 0.2462
Episode: 495 Total reward: 128.0 Training loss: 1.6042 Explore P: 0.2432
Episode: 496 Total reward: 83.0 Training loss: 2.2759 Explore P: 0.2413
Episode: 497 Total reward: 114.0 Training loss: 2.4251 Explore P: 0.2386
Episode: 498 Total reward: 38.0 Training loss: 105.9504 Explore P: 0.2378
Episode: 499 Total reward: 60.0 Training loss: 274.9236 Explore P: 0.2364
Episode: 500 Total reward: 93.0 Training loss: 1.1972 Explore P: 0.2343
Episode: 501 Total reward: 91.0 Training loss: 2.2601 Explore P: 0.2323
Episode: 502 Total reward: 66.0 Training loss: 3.4551 Explore P: 0.2308
Episode: 503 Total reward: 120.0 Training loss: 2.3063 Explore P: 0.2282
Episode: 504 Total reward: 199.0 Training loss: 95.2752 Explore P: 0.2239
Episode: 505 Total reward: 59.0 Training loss: 2.6890 Explore P: 0.2226
Episode: 506 Total reward: 80.0 Training loss: 2.7219 Explore P: 0.2209
Episode: 507 Total reward: 96.0 Training loss: 2.4807 Explore P: 0.2189
Episode: 508 Total reward: 189.0 Training loss: 0.9607 Explore P: 0.2150
Episode: 509 Total reward: 64.0 Training loss: 2.1257 Explore P: 0.2137
Episode: 510 Total reward: 65.0 Training loss: 1.0973 Explore P: 0.2124
Episode: 511 Total reward: 72.0 Training loss: 1.9733 Explore P: 0.2109
Episode: 512 Total reward: 62.0 Training loss: 95.2381 Explore P: 0.2097
Episode: 513 Total reward: 69.0 Training loss: 2.1100 Explore P: 0.2083
Episode: 514 Total reward: 75.0 Training loss: 1.5215 Explore P: 0.2068
Episode: 515 Total reward: 132.0 Training loss: 1.9484 Explore P: 0.2042
Episode: 516 Total reward: 57.0 Training loss: 1.3551 Explore P: 0.2031
Episode: 517 Total reward: 97.0 Training loss: 1.7242 Explore P: 0.2013
Episode: 518 Total reward: 91.0 Training loss: 2.2343 Explore P: 0.1995
Episode: 519 Total reward: 69.0 Training loss: 0.4368 Explore P: 0.1982
Episode: 520 Total reward: 182.0 Training loss: 0.9244 Explore P: 0.1948
Episode: 521 Total reward: 101.0 Training loss: 104.1669 Explore P: 0.1930
Episode: 522 Total reward: 88.0 Training loss: 1.2125 Explore P: 0.1914
Episode: 523 Total reward: 91.0 Training loss: 1.8720 Explore P: 0.1897
Episode: 524 Total reward: 145.0 Training loss: 1.5282 Explore P: 0.1872
Episode: 525 Total reward: 69.0 Training loss: 1.3609 Explore P: 0.1859
Episode: 526 Total reward: 68.0 Training loss: 1.4212 Explore P: 0.1847
Episode: 527 Total reward: 86.0 Training loss: 94.9680 Explore P: 0.1832
Episode: 528 Total reward: 156.0 Training loss: 118.8774 Explore P: 0.1806
Episode: 529 Total reward: 199.0 Training loss: 1.6789 Explore P: 0.1772
Episode: 530 Total reward: 102.0 Training loss: 1.2913 Explore P: 0.1755
Episode: 531 Total reward: 190.0 Training loss: 1.3677 Explore P: 0.1724
Episode: 532 Total reward: 114.0 Training loss: 1.0599 Explore P: 0.1705
Episode: 533 Total reward: 62.0 Training loss: 90.6413 Explore P: 0.1696
Episode: 534 Total reward: 109.0 Training loss: 0.7661 Explore P: 0.1678
Episode: 535 Total reward: 181.0 Training loss: 1.4680 Explore P: 0.1650
Episode: 536 Total reward: 82.0 Training loss: 1.3047 Explore P: 0.1637
Episode: 537 Total reward: 91.0 Training loss: 1.4142 Explore P: 0.1623
Episode: 538 Total reward: 199.0 Training loss: 1.3862 Explore P: 0.1593
Episode: 539 Total reward: 98.0 Training loss: 0.9338 Explore P: 0.1579
Episode: 540 Total reward: 150.0 Training loss: 1.1670 Explore P: 0.1557
Episode: 541 Total reward: 136.0 Training loss: 1.7955 Explore P: 0.1537
Episode: 542 Total reward: 97.0 Training loss: 1.4642 Explore P: 0.1523
Episode: 543 Total reward: 75.0 Training loss: 1.5994 Explore P: 0.1513
Episode: 544 Total reward: 104.0 Training loss: 116.0041 Explore P: 0.1498
Episode: 545 Total reward: 156.0 Training loss: 0.9402 Explore P: 0.1476
Episode: 546 Total reward: 167.0 Training loss: 124.5766 Explore P: 0.1454
Episode: 547 Total reward: 113.0 Training loss: 0.7751 Explore P: 0.1438
Episode: 548 Total reward: 156.0 Training loss: 0.9998 Explore P: 0.1418
Episode: 549 Total reward: 115.0 Training loss: 0.7806 Explore P: 0.1403
Episode: 550 Total reward: 104.0 Training loss: 1.5826 Explore P: 0.1389
Episode: 551 Total reward: 173.0 Training loss: 1.4932 Explore P: 0.1367
Episode: 552 Total reward: 178.0 Training loss: 1.2793 Explore P: 0.1345
Episode: 553 Total reward: 109.0 Training loss: 0.8552 Explore P: 0.1331
Episode: 554 Total reward: 149.0 Training loss: 0.9083 Explore P: 0.1313
Episode: 555 Total reward: 198.0 Training loss: 0.7489 Explore P: 0.1289
Episode: 556 Total reward: 86.0 Training loss: 0.8611 Explore P: 0.1279
Episode: 557 Total reward: 145.0 Training loss: 0.5802 Explore P: 0.1262
Episode: 558 Total reward: 144.0 Training loss: 1.0240 Explore P: 0.1245
Episode: 559 Total reward: 61.0 Training loss: 0.8499 Explore P: 0.1238
Episode: 560 Total reward: 79.0 Training loss: 1.2696 Explore P: 0.1229
Episode: 561 Total reward: 116.0 Training loss: 147.8353 Explore P: 0.1216
Episode: 562 Total reward: 172.0 Training loss: 1.4752 Explore P: 0.1197
Episode: 563 Total reward: 88.0 Training loss: 0.5509 Explore P: 0.1188
Episode: 564 Total reward: 199.0 Training loss: 0.5416 Explore P: 0.1166
Episode: 565 Total reward: 49.0 Training loss: 0.9011 Explore P: 0.1161
Episode: 566 Total reward: 108.0 Training loss: 156.4883 Explore P: 0.1150
Episode: 567 Total reward: 117.0 Training loss: 0.8817 Explore P: 0.1138
Episode: 568 Total reward: 112.0 Training loss: 0.6156 Explore P: 0.1126
Episode: 569 Total reward: 60.0 Training loss: 0.6489 Explore P: 0.1120
Episode: 570 Total reward: 86.0 Training loss: 0.8815 Explore P: 0.1111
Episode: 571 Total reward: 176.0 Training loss: 0.6689 Explore P: 0.1093
Episode: 572 Total reward: 62.0 Training loss: 0.4775 Explore P: 0.1087
Episode: 573 Total reward: 108.0 Training loss: 0.7203 Explore P: 0.1077
Episode: 574 Total reward: 101.0 Training loss: 0.6456 Explore P: 0.1067
Episode: 575 Total reward: 170.0 Training loss: 0.4908 Explore P: 0.1051
Episode: 576 Total reward: 103.0 Training loss: 0.9958 Explore P: 0.1041
Episode: 577 Total reward: 192.0 Training loss: 0.4211 Explore P: 0.1023
Episode: 578 Total reward: 136.0 Training loss: 1.5500 Explore P: 0.1010
Episode: 579 Total reward: 144.0 Training loss: 0.4299 Explore P: 0.0997
Episode: 580 Total reward: 193.0 Training loss: 0.4543 Explore P: 0.0980
Episode: 581 Total reward: 86.0 Training loss: 116.5913 Explore P: 0.0973
Episode: 582 Total reward: 199.0 Training loss: 0.6961 Explore P: 0.0956
Episode: 583 Total reward: 199.0 Training loss: 0.5176 Explore P: 0.0939
Episode: 584 Total reward: 142.0 Training loss: 110.9818 Explore P: 0.0927
Episode: 585 Total reward: 92.0 Training loss: 0.7136 Explore P: 0.0919
Episode: 586 Total reward: 116.0 Training loss: 0.9312 Explore P: 0.0910
Episode: 587 Total reward: 106.0 Training loss: 1.0860 Explore P: 0.0901
Episode: 588 Total reward: 82.0 Training loss: 0.4168 Explore P: 0.0895
Episode: 589 Total reward: 199.0 Training loss: 0.5509 Explore P: 0.0879
Episode: 590 Total reward: 199.0 Training loss: 0.4521 Explore P: 0.0864
Episode: 591 Total reward: 87.0 Training loss: 0.3359 Explore P: 0.0857
Episode: 592 Total reward: 199.0 Training loss: 107.1309 Explore P: 0.0842
Episode: 593 Total reward: 111.0 Training loss: 0.7100 Explore P: 0.0834
Episode: 594 Total reward: 177.0 Training loss: 0.4050 Explore P: 0.0821
Episode: 595 Total reward: 122.0 Training loss: 41.3279 Explore P: 0.0812
Episode: 596 Total reward: 80.0 Training loss: 0.2042 Explore P: 0.0807
Episode: 597 Total reward: 159.0 Training loss: 0.4434 Explore P: 0.0796
Episode: 598 Total reward: 199.0 Training loss: 0.5739 Explore P: 0.0782
Episode: 599 Total reward: 115.0 Training loss: 0.3282 Explore P: 0.0774
Episode: 600 Total reward: 74.0 Training loss: 0.2484 Explore P: 0.0769
Episode: 601 Total reward: 66.0 Training loss: 0.3707 Explore P: 0.0765
Episode: 602 Total reward: 71.0 Training loss: 0.3402 Explore P: 0.0760
Episode: 603 Total reward: 185.0 Training loss: 0.1679 Explore P: 0.0748
Episode: 604 Total reward: 181.0 Training loss: 0.5374 Explore P: 0.0736
Episode: 605 Total reward: 92.0 Training loss: 0.6127 Explore P: 0.0730
Episode: 606 Total reward: 113.0 Training loss: 0.1371 Explore P: 0.0723
Episode: 607 Total reward: 158.0 Training loss: 0.4423 Explore P: 0.0714
Episode: 608 Total reward: 172.0 Training loss: 0.5114 Explore P: 0.0703
Episode: 609 Total reward: 126.0 Training loss: 0.3332 Explore P: 0.0696
Episode: 610 Total reward: 131.0 Training loss: 0.3262 Explore P: 0.0688
Episode: 611 Total reward: 65.0 Training loss: 0.4161 Explore P: 0.0684
Episode: 612 Total reward: 146.0 Training loss: 0.6209 Explore P: 0.0676
Episode: 613 Total reward: 81.0 Training loss: 0.3292 Explore P: 0.0671
Episode: 614 Total reward: 199.0 Training loss: 0.3675 Explore P: 0.0660
Episode: 615 Total reward: 85.0 Training loss: 0.2731 Explore P: 0.0655
Episode: 616 Total reward: 199.0 Training loss: 16.5018 Explore P: 0.0644
Episode: 617 Total reward: 183.0 Training loss: 0.2654 Explore P: 0.0634
Episode: 618 Total reward: 89.0 Training loss: 12.0111 Explore P: 0.0629
Episode: 619 Total reward: 102.0 Training loss: 0.4079 Explore P: 0.0624
Episode: 620 Total reward: 189.0 Training loss: 4.7593 Explore P: 0.0614
Episode: 621 Total reward: 145.0 Training loss: 8.4757 Explore P: 0.0607
Episode: 622 Total reward: 133.0 Training loss: 3.7986 Explore P: 0.0600
Episode: 623 Total reward: 93.0 Training loss: 0.5428 Explore P: 0.0596
Episode: 624 Total reward: 76.0 Training loss: 0.4065 Explore P: 0.0592
Episode: 625 Total reward: 77.0 Training loss: 0.3943 Explore P: 0.0588
Episode: 626 Total reward: 65.0 Training loss: 0.2499 Explore P: 0.0585
Episode: 627 Total reward: 81.0 Training loss: 0.2274 Explore P: 0.0581
Episode: 628 Total reward: 126.0 Training loss: 0.2762 Explore P: 0.0575
Episode: 629 Total reward: 68.0 Training loss: 0.1855 Explore P: 0.0572
Episode: 630 Total reward: 106.0 Training loss: 0.1825 Explore P: 0.0567
Episode: 631 Total reward: 101.0 Training loss: 0.2789 Explore P: 0.0562
Episode: 632 Total reward: 74.0 Training loss: 0.3269 Explore P: 0.0559
Episode: 633 Total reward: 145.0 Training loss: 0.1586 Explore P: 0.0552
Episode: 634 Total reward: 85.0 Training loss: 0.9461 Explore P: 0.0548
Episode: 635 Total reward: 125.0 Training loss: 0.6363 Explore P: 0.0543
Episode: 636 Total reward: 132.0 Training loss: 0.1393 Explore P: 0.0537
Episode: 637 Total reward: 87.0 Training loss: 258.3306 Explore P: 0.0533
Episode: 638 Total reward: 74.0 Training loss: 0.6746 Explore P: 0.0530
Episode: 639 Total reward: 93.0 Training loss: 0.2863 Explore P: 0.0526
Episode: 640 Total reward: 111.0 Training loss: 0.3133 Explore P: 0.0521
Episode: 641 Total reward: 128.0 Training loss: 0.2306 Explore P: 0.0516
Episode: 642 Total reward: 91.0 Training loss: 0.2443 Explore P: 0.0512
Episode: 643 Total reward: 81.0 Training loss: 0.2900 Explore P: 0.0509
Episode: 644 Total reward: 117.0 Training loss: 0.2778 Explore P: 0.0504
Episode: 645 Total reward: 124.0 Training loss: 0.2306 Explore P: 0.0499
Episode: 646 Total reward: 98.0 Training loss: 0.1495 Explore P: 0.0495
Episode: 647 Total reward: 111.0 Training loss: 0.2948 Explore P: 0.0491
Episode: 648 Total reward: 116.0 Training loss: 0.7138 Explore P: 0.0486
Episode: 649 Total reward: 116.0 Training loss: 0.2161 Explore P: 0.0482
Episode: 650 Total reward: 149.0 Training loss: 2.9544 Explore P: 0.0476
Episode: 651 Total reward: 184.0 Training loss: 0.1931 Explore P: 0.0469
Episode: 652 Total reward: 94.0 Training loss: 0.4101 Explore P: 0.0466
Episode: 653 Total reward: 102.0 Training loss: 0.2697 Explore P: 0.0462
Episode: 654 Total reward: 156.0 Training loss: 0.1148 Explore P: 0.0456
Episode: 655 Total reward: 95.0 Training loss: 0.3945 Explore P: 0.0453
Episode: 656 Total reward: 68.0 Training loss: 0.3274 Explore P: 0.0451
Episode: 657 Total reward: 80.0 Training loss: 0.1622 Explore P: 0.0448
Episode: 658 Total reward: 98.0 Training loss: 0.8951 Explore P: 0.0445
Episode: 659 Total reward: 106.0 Training loss: 2.4415 Explore P: 0.0441
Episode: 660 Total reward: 199.0 Training loss: 0.2616 Explore P: 0.0434
Episode: 661 Total reward: 95.0 Training loss: 1.7253 Explore P: 0.0431
Episode: 662 Total reward: 92.0 Training loss: 0.3784 Explore P: 0.0428
Episode: 663 Total reward: 157.0 Training loss: 81.3278 Explore P: 0.0423
Episode: 664 Total reward: 199.0 Training loss: 0.1662 Explore P: 0.0417
Episode: 665 Total reward: 77.0 Training loss: 0.1460 Explore P: 0.0414
Episode: 666 Total reward: 143.0 Training loss: 0.2333 Explore P: 0.0410
Episode: 667 Total reward: 73.0 Training loss: 0.2444 Explore P: 0.0407
Episode: 668 Total reward: 75.0 Training loss: 0.1838 Explore P: 0.0405
Episode: 669 Total reward: 127.0 Training loss: 0.5463 Explore P: 0.0401
Episode: 670 Total reward: 71.0 Training loss: 0.3875 Explore P: 0.0399
Episode: 671 Total reward: 133.0 Training loss: 0.5177 Explore P: 0.0395
Episode: 672 Total reward: 81.0 Training loss: 0.2405 Explore P: 0.0393
Episode: 673 Total reward: 199.0 Training loss: 0.1988 Explore P: 0.0387
Episode: 674 Total reward: 89.0 Training loss: 0.2093 Explore P: 0.0384
Episode: 675 Total reward: 80.0 Training loss: 0.2449 Explore P: 0.0382
Episode: 676 Total reward: 96.0 Training loss: 0.2633 Explore P: 0.0379
Episode: 677 Total reward: 87.0 Training loss: 0.1690 Explore P: 0.0377
Episode: 678 Total reward: 86.0 Training loss: 0.3014 Explore P: 0.0375
Episode: 679 Total reward: 190.0 Training loss: 0.4005 Explore P: 0.0370
Episode: 680 Total reward: 125.0 Training loss: 0.1019 Explore P: 0.0366
Episode: 681 Total reward: 143.0 Training loss: 0.2485 Explore P: 0.0362
Episode: 682 Total reward: 199.0 Training loss: 0.2255 Explore P: 0.0357
Episode: 683 Total reward: 112.0 Training loss: 0.2358 Explore P: 0.0354
Episode: 684 Total reward: 199.0 Training loss: 0.1148 Explore P: 0.0349
Episode: 685 Total reward: 199.0 Training loss: 0.2827 Explore P: 0.0344
Episode: 686 Total reward: 199.0 Training loss: 0.1662 Explore P: 0.0340
Episode: 687 Total reward: 199.0 Training loss: 0.1411 Explore P: 0.0335
Episode: 688 Total reward: 193.0 Training loss: 0.2410 Explore P: 0.0330
Episode: 689 Total reward: 199.0 Training loss: 0.1683 Explore P: 0.0326
Episode: 690 Total reward: 199.0 Training loss: 0.1708 Explore P: 0.0321
Episode: 691 Total reward: 161.0 Training loss: 0.2766 Explore P: 0.0318
Episode: 692 Total reward: 159.0 Training loss: 0.3060 Explore P: 0.0314
Episode: 693 Total reward: 199.0 Training loss: 0.1405 Explore P: 0.0310
Episode: 694 Total reward: 101.0 Training loss: 0.2435 Explore P: 0.0308
Episode: 695 Total reward: 199.0 Training loss: 0.2267 Explore P: 0.0304
Episode: 696 Total reward: 141.0 Training loss: 0.2143 Explore P: 0.0301
Episode: 697 Total reward: 199.0 Training loss: 0.3564 Explore P: 0.0297
Episode: 698 Total reward: 122.0 Training loss: 0.2040 Explore P: 0.0295
Episode: 699 Total reward: 103.0 Training loss: 0.3089 Explore P: 0.0293
Episode: 700 Total reward: 199.0 Training loss: 0.0925 Explore P: 0.0289
Episode: 701 Total reward: 172.0 Training loss: 0.0731 Explore P: 0.0286
Episode: 702 Total reward: 91.0 Training loss: 0.1730 Explore P: 0.0284
Episode: 703 Total reward: 178.0 Training loss: 0.2145 Explore P: 0.0281
Episode: 704 Total reward: 189.0 Training loss: 0.1254 Explore P: 0.0277
Episode: 705 Total reward: 199.0 Training loss: 0.2782 Explore P: 0.0274
Episode: 706 Total reward: 199.0 Training loss: 0.1621 Explore P: 0.0271
Episode: 707 Total reward: 143.0 Training loss: 0.1151 Explore P: 0.0268
Episode: 708 Total reward: 199.0 Training loss: 0.0826 Explore P: 0.0265
Episode: 709 Total reward: 181.0 Training loss: 0.1430 Explore P: 0.0262
Episode: 710 Total reward: 107.0 Training loss: 0.3589 Explore P: 0.0260
Episode: 711 Total reward: 137.0 Training loss: 0.1933 Explore P: 0.0258
Episode: 712 Total reward: 199.0 Training loss: 0.0975 Explore P: 0.0255
Episode: 713 Total reward: 199.0 Training loss: 0.1312 Explore P: 0.0252
Episode: 714 Total reward: 100.0 Training loss: 0.2433 Explore P: 0.0250
Episode: 715 Total reward: 199.0 Training loss: 0.1167 Explore P: 0.0247
Episode: 716 Total reward: 199.0 Training loss: 0.1858 Explore P: 0.0244
Episode: 717 Total reward: 199.0 Training loss: 0.2361 Explore P: 0.0242
Episode: 718 Total reward: 143.0 Training loss: 0.1794 Explore P: 0.0240
Episode: 719 Total reward: 199.0 Training loss: 0.1552 Explore P: 0.0237
Episode: 720 Total reward: 199.0 Training loss: 0.1464 Explore P: 0.0234
Episode: 721 Total reward: 199.0 Training loss: 0.0695 Explore P: 0.0231
Episode: 722 Total reward: 199.0 Training loss: 0.1371 Explore P: 0.0229
Episode: 723 Total reward: 161.0 Training loss: 0.2152 Explore P: 0.0227
Episode: 724 Total reward: 182.0 Training loss: 0.0996 Explore P: 0.0225
Episode: 725 Total reward: 199.0 Training loss: 124.8286 Explore P: 0.0222
Episode: 726 Total reward: 199.0 Training loss: 11.5113 Explore P: 0.0220
Episode: 727 Total reward: 126.0 Training loss: 0.1283 Explore P: 0.0218
Episode: 728 Total reward: 199.0 Training loss: 0.2756 Explore P: 0.0216
Episode: 729 Total reward: 199.0 Training loss: 0.1198 Explore P: 0.0214
Episode: 730 Total reward: 131.0 Training loss: 0.1296 Explore P: 0.0212
Episode: 731 Total reward: 199.0 Training loss: 0.1736 Explore P: 0.0210
Episode: 732 Total reward: 125.0 Training loss: 0.1822 Explore P: 0.0209
Episode: 733 Total reward: 199.0 Training loss: 0.1056 Explore P: 0.0206
Episode: 734 Total reward: 180.0 Training loss: 0.1724 Explore P: 0.0204
Episode: 735 Total reward: 94.0 Training loss: 125.6414 Explore P: 0.0204
Episode: 736 Total reward: 117.0 Training loss: 0.1531 Explore P: 0.0202
Episode: 737 Total reward: 64.0 Training loss: 0.1251 Explore P: 0.0202
Episode: 738 Total reward: 82.0 Training loss: 0.0717 Explore P: 0.0201
Episode: 739 Total reward: 189.0 Training loss: 0.1201 Explore P: 0.0199
Episode: 740 Total reward: 199.0 Training loss: 0.2520 Explore P: 0.0197
Episode: 741 Total reward: 199.0 Training loss: 0.1731 Explore P: 0.0195
Episode: 742 Total reward: 106.0 Training loss: 0.1320 Explore P: 0.0194
Episode: 743 Total reward: 130.0 Training loss: 0.1822 Explore P: 0.0193
Episode: 744 Total reward: 199.0 Training loss: 0.3446 Explore P: 0.0191
Episode: 745 Total reward: 173.0 Training loss: 0.1678 Explore P: 0.0189
Episode: 746 Total reward: 199.0 Training loss: 3.4272 Explore P: 0.0188
Episode: 747 Total reward: 199.0 Training loss: 168.4196 Explore P: 0.0186
Episode: 748 Total reward: 199.0 Training loss: 0.3732 Explore P: 0.0184
Episode: 749 Total reward: 188.0 Training loss: 0.1487 Explore P: 0.0183
Episode: 750 Total reward: 78.0 Training loss: 0.1667 Explore P: 0.0182
Episode: 751 Total reward: 121.0 Training loss: 0.2892 Explore P: 0.0181
Episode: 752 Total reward: 199.0 Training loss: 0.1473 Explore P: 0.0179
Episode: 753 Total reward: 199.0 Training loss: 2.3527 Explore P: 0.0178
Episode: 754 Total reward: 199.0 Training loss: 0.2302 Explore P: 0.0176
Episode: 755 Total reward: 199.0 Training loss: 0.1379 Explore P: 0.0175
Episode: 756 Total reward: 199.0 Training loss: 0.1614 Explore P: 0.0173
Episode: 757 Total reward: 199.0 Training loss: 0.1575 Explore P: 0.0172
Episode: 758 Total reward: 199.0 Training loss: 0.0914 Explore P: 0.0171
Episode: 759 Total reward: 101.0 Training loss: 0.1980 Explore P: 0.0170
Episode: 760 Total reward: 111.0 Training loss: 0.1839 Explore P: 0.0169
Episode: 761 Total reward: 199.0 Training loss: 0.1321 Explore P: 0.0168
Episode: 762 Total reward: 199.0 Training loss: 0.2709 Explore P: 0.0166
Episode: 763 Total reward: 96.0 Training loss: 226.3603 Explore P: 0.0166
Episode: 764 Total reward: 90.0 Training loss: 173.3170 Explore P: 0.0165
Episode: 765 Total reward: 75.0 Training loss: 0.9880 Explore P: 0.0165
Episode: 766 Total reward: 59.0 Training loss: 0.1867 Explore P: 0.0164
Episode: 767 Total reward: 199.0 Training loss: 0.3157 Explore P: 0.0163
Episode: 768 Total reward: 199.0 Training loss: 0.1075 Explore P: 0.0162
Episode: 769 Total reward: 199.0 Training loss: 0.3077 Explore P: 0.0161
Episode: 770 Total reward: 56.0 Training loss: 0.2880 Explore P: 0.0160
Episode: 771 Total reward: 199.0 Training loss: 0.1500 Explore P: 0.0159
Episode: 772 Total reward: 199.0 Training loss: 0.0758 Explore P: 0.0158
Episode: 773 Total reward: 87.0 Training loss: 280.3314 Explore P: 0.0157
Episode: 774 Total reward: 199.0 Training loss: 0.1791 Explore P: 0.0156
Episode: 775 Total reward: 199.0 Training loss: 0.2036 Explore P: 0.0155
Episode: 776 Total reward: 199.0 Training loss: 0.1364 Explore P: 0.0154
Episode: 777 Total reward: 112.0 Training loss: 0.4649 Explore P: 0.0153
Episode: 778 Total reward: 88.0 Training loss: 0.3645 Explore P: 0.0153
Episode: 779 Total reward: 107.0 Training loss: 0.1850 Explore P: 0.0152
Episode: 780 Total reward: 163.0 Training loss: 0.1861 Explore P: 0.0152
Episode: 781 Total reward: 197.0 Training loss: 0.1381 Explore P: 0.0151
Episode: 782 Total reward: 199.0 Training loss: 0.3593 Explore P: 0.0150
Episode: 783 Total reward: 199.0 Training loss: 0.1210 Explore P: 0.0149
Episode: 784 Total reward: 199.0 Training loss: 0.1437 Explore P: 0.0148
Episode: 785 Total reward: 199.0 Training loss: 0.4227 Explore P: 0.0147
Episode: 786 Total reward: 199.0 Training loss: 0.2755 Explore P: 0.0146
Episode: 787 Total reward: 199.0 Training loss: 0.2699 Explore P: 0.0145
Episode: 788 Total reward: 199.0 Training loss: 0.1186 Explore P: 0.0144
Episode: 789 Total reward: 199.0 Training loss: 0.2755 Explore P: 0.0143
Episode: 790 Total reward: 199.0 Training loss: 0.3075 Explore P: 0.0142
Episode: 791 Total reward: 199.0 Training loss: 263.5042 Explore P: 0.0141
Episode: 792 Total reward: 199.0 Training loss: 0.3876 Explore P: 0.0141
Episode: 793 Total reward: 199.0 Training loss: 0.1900 Explore P: 0.0140
Episode: 794 Total reward: 199.0 Training loss: 0.3646 Explore P: 0.0139
Episode: 795 Total reward: 199.0 Training loss: 0.2639 Explore P: 0.0138
Episode: 796 Total reward: 199.0 Training loss: 0.5092 Explore P: 0.0138
Episode: 797 Total reward: 199.0 Training loss: 0.2132 Explore P: 0.0137
Episode: 798 Total reward: 199.0 Training loss: 0.3409 Explore P: 0.0136
Episode: 799 Total reward: 199.0 Training loss: 0.2444 Explore P: 0.0135
Episode: 800 Total reward: 199.0 Training loss: 0.5575 Explore P: 0.0135
Episode: 801 Total reward: 196.0 Training loss: 0.3971 Explore P: 0.0134
Episode: 802 Total reward: 199.0 Training loss: 0.2933 Explore P: 0.0133
Episode: 803 Total reward: 199.0 Training loss: 23.8636 Explore P: 0.0133
Episode: 804 Total reward: 197.0 Training loss: 0.2552 Explore P: 0.0132
Episode: 805 Total reward: 176.0 Training loss: 0.5483 Explore P: 0.0131
Episode: 806 Total reward: 177.0 Training loss: 0.5617 Explore P: 0.0131
Episode: 807 Total reward: 199.0 Training loss: 312.4422 Explore P: 0.0130
Episode: 808 Total reward: 199.0 Training loss: 0.2880 Explore P: 0.0130
Episode: 809 Total reward: 199.0 Training loss: 0.1767 Explore P: 0.0129
Episode: 810 Total reward: 199.0 Training loss: 0.5568 Explore P: 0.0129
Episode: 811 Total reward: 199.0 Training loss: 2.3921 Explore P: 0.0128
Episode: 812 Total reward: 199.0 Training loss: 0.9594 Explore P: 0.0127
Episode: 813 Total reward: 199.0 Training loss: 274.6874 Explore P: 0.0127
Episode: 814 Total reward: 199.0 Training loss: 0.6964 Explore P: 0.0126
Episode: 815 Total reward: 199.0 Training loss: 275.2589 Explore P: 0.0126
Episode: 816 Total reward: 199.0 Training loss: 0.3222 Explore P: 0.0125
Episode: 817 Total reward: 199.0 Training loss: 0.3900 Explore P: 0.0125
Episode: 818 Total reward: 199.0 Training loss: 0.3056 Explore P: 0.0124
Episode: 819 Total reward: 160.0 Training loss: 0.2325 Explore P: 0.0124
Episode: 820 Total reward: 199.0 Training loss: 0.3711 Explore P: 0.0123
Episode: 821 Total reward: 199.0 Training loss: 0.6251 Explore P: 0.0123
Episode: 822 Total reward: 179.0 Training loss: 0.6629 Explore P: 0.0123
Episode: 823 Total reward: 199.0 Training loss: 0.3771 Explore P: 0.0122
Episode: 824 Total reward: 199.0 Training loss: 0.6810 Explore P: 0.0122
Episode: 825 Total reward: 195.0 Training loss: 1.3746 Explore P: 0.0121
Episode: 826 Total reward: 199.0 Training loss: 0.3300 Explore P: 0.0121
Episode: 827 Total reward: 189.0 Training loss: 726.4813 Explore P: 0.0120
Episode: 828 Total reward: 161.0 Training loss: 2.7680 Explore P: 0.0120
Episode: 829 Total reward: 133.0 Training loss: 1.1988 Explore P: 0.0120
Episode: 830 Total reward: 138.0 Training loss: 1.3922 Explore P: 0.0120
Episode: 831 Total reward: 134.0 Training loss: 1.0544 Explore P: 0.0119
Episode: 832 Total reward: 136.0 Training loss: 0.9330 Explore P: 0.0119
Episode: 833 Total reward: 146.0 Training loss: 1.1612 Explore P: 0.0119
Episode: 834 Total reward: 113.0 Training loss: 0.2552 Explore P: 0.0119
Episode: 835 Total reward: 120.0 Training loss: 1.3491 Explore P: 0.0118
Episode: 836 Total reward: 155.0 Training loss: 0.8642 Explore P: 0.0118
Episode: 837 Total reward: 159.0 Training loss: 1.1769 Explore P: 0.0118
Episode: 838 Total reward: 149.0 Training loss: 2.8076 Explore P: 0.0118
Episode: 839 Total reward: 199.0 Training loss: 0.1640 Explore P: 0.0117
Episode: 840 Total reward: 145.0 Training loss: 0.5410 Explore P: 0.0117
Episode: 841 Total reward: 199.0 Training loss: 0.1612 Explore P: 0.0117
Episode: 842 Total reward: 199.0 Training loss: 461.1873 Explore P: 0.0116
Episode: 843 Total reward: 199.0 Training loss: 0.2399 Explore P: 0.0116
Episode: 844 Total reward: 199.0 Training loss: 426.7859 Explore P: 0.0116
Episode: 845 Total reward: 199.0 Training loss: 0.1744 Explore P: 0.0115
Episode: 846 Total reward: 199.0 Training loss: 0.3776 Explore P: 0.0115
Episode: 847 Total reward: 199.0 Training loss: 0.5035 Explore P: 0.0115
Episode: 848 Total reward: 199.0 Training loss: 0.2363 Explore P: 0.0114
Episode: 849 Total reward: 199.0 Training loss: 0.1716 Explore P: 0.0114
Episode: 850 Total reward: 199.0 Training loss: 0.1903 Explore P: 0.0114
Episode: 851 Total reward: 199.0 Training loss: 3.2128 Explore P: 0.0114
Episode: 852 Total reward: 199.0 Training loss: 31.3783 Explore P: 0.0113
Episode: 853 Total reward: 199.0 Training loss: 0.1754 Explore P: 0.0113
Episode: 854 Total reward: 199.0 Training loss: 0.1954 Explore P: 0.0113
Episode: 855 Total reward: 199.0 Training loss: 0.3511 Explore P: 0.0113
Episode: 856 Total reward: 199.0 Training loss: 0.2271 Explore P: 0.0112
Episode: 857 Total reward: 199.0 Training loss: 0.2725 Explore P: 0.0112
Episode: 858 Total reward: 199.0 Training loss: 0.2250 Explore P: 0.0112
Episode: 859 Total reward: 199.0 Training loss: 0.2984 Explore P: 0.0112
Episode: 860 Total reward: 199.0 Training loss: 0.1503 Explore P: 0.0111
Episode: 861 Total reward: 199.0 Training loss: 0.2531 Explore P: 0.0111
Episode: 862 Total reward: 199.0 Training loss: 0.4599 Explore P: 0.0111
Episode: 863 Total reward: 199.0 Training loss: 0.3988 Explore P: 0.0111
Episode: 864 Total reward: 199.0 Training loss: 0.5754 Explore P: 0.0111
Episode: 865 Total reward: 199.0 Training loss: 0.3773 Explore P: 0.0110
Episode: 866 Total reward: 199.0 Training loss: 0.4301 Explore P: 0.0110
Episode: 867 Total reward: 199.0 Training loss: 0.2165 Explore P: 0.0110
Episode: 868 Total reward: 199.0 Training loss: 0.4200 Explore P: 0.0110
Episode: 869 Total reward: 199.0 Training loss: 440.7998 Explore P: 0.0110
Episode: 870 Total reward: 199.0 Training loss: 338.2294 Explore P: 0.0109
Episode: 871 Total reward: 199.0 Training loss: 0.5909 Explore P: 0.0109
Episode: 872 Total reward: 199.0 Training loss: 0.5540 Explore P: 0.0109
Episode: 873 Total reward: 199.0 Training loss: 0.3188 Explore P: 0.0109
Episode: 874 Total reward: 199.0 Training loss: 0.3665 Explore P: 0.0109
Episode: 875 Total reward: 199.0 Training loss: 0.3802 Explore P: 0.0108
Episode: 876 Total reward: 199.0 Training loss: 0.1847 Explore P: 0.0108
Episode: 877 Total reward: 199.0 Training loss: 234.6684 Explore P: 0.0108
Episode: 878 Total reward: 199.0 Training loss: 0.4437 Explore P: 0.0108
Episode: 879 Total reward: 199.0 Training loss: 0.1620 Explore P: 0.0108
Episode: 880 Total reward: 199.0 Training loss: 282.9654 Explore P: 0.0108
Episode: 881 Total reward: 199.0 Training loss: 0.3393 Explore P: 0.0108
Episode: 882 Total reward: 199.0 Training loss: 0.1815 Explore P: 0.0107
Episode: 883 Total reward: 199.0 Training loss: 0.2667 Explore P: 0.0107
Episode: 884 Total reward: 199.0 Training loss: 79.5465 Explore P: 0.0107
Episode: 885 Total reward: 199.0 Training loss: 0.1336 Explore P: 0.0107
Episode: 886 Total reward: 199.0 Training loss: 0.1565 Explore P: 0.0107
Episode: 887 Total reward: 199.0 Training loss: 0.2984 Explore P: 0.0107
Episode: 888 Total reward: 199.0 Training loss: 0.3737 Explore P: 0.0107
Episode: 889 Total reward: 199.0 Training loss: 0.2301 Explore P: 0.0106
Episode: 890 Total reward: 199.0 Training loss: 0.4295 Explore P: 0.0106
Episode: 891 Total reward: 199.0 Training loss: 0.3427 Explore P: 0.0106
Episode: 892 Total reward: 199.0 Training loss: 0.2069 Explore P: 0.0106
Episode: 893 Total reward: 199.0 Training loss: 0.2332 Explore P: 0.0106
Episode: 894 Total reward: 199.0 Training loss: 0.2032 Explore P: 0.0106
Episode: 895 Total reward: 199.0 Training loss: 0.2560 Explore P: 0.0106
Episode: 896 Total reward: 199.0 Training loss: 0.1294 Explore P: 0.0106
Episode: 897 Total reward: 199.0 Training loss: 0.2867 Explore P: 0.0105
Episode: 898 Total reward: 199.0 Training loss: 0.2221 Explore P: 0.0105
Episode: 899 Total reward: 199.0 Training loss: 0.3525 Explore P: 0.0105
Episode: 900 Total reward: 199.0 Training loss: 0.1707 Explore P: 0.0105
Episode: 901 Total reward: 199.0 Training loss: 0.1714 Explore P: 0.0105
Episode: 902 Total reward: 199.0 Training loss: 0.3224 Explore P: 0.0105
Episode: 903 Total reward: 199.0 Training loss: 227.6463 Explore P: 0.0105
Episode: 904 Total reward: 199.0 Training loss: 113.8036 Explore P: 0.0105
Episode: 905 Total reward: 199.0 Training loss: 331.4269 Explore P: 0.0105
Episode: 906 Total reward: 199.0 Training loss: 0.4368 Explore P: 0.0105
Episode: 907 Total reward: 199.0 Training loss: 0.1310 Explore P: 0.0104
Episode: 908 Total reward: 199.0 Training loss: 0.1131 Explore P: 0.0104
Episode: 909 Total reward: 199.0 Training loss: 0.1390 Explore P: 0.0104
Episode: 910 Total reward: 199.0 Training loss: 0.2447 Explore P: 0.0104
Episode: 911 Total reward: 199.0 Training loss: 0.0916 Explore P: 0.0104
Episode: 912 Total reward: 199.0 Training loss: 0.1136 Explore P: 0.0104
Episode: 913 Total reward: 199.0 Training loss: 0.0677 Explore P: 0.0104
Episode: 914 Total reward: 199.0 Training loss: 0.2081 Explore P: 0.0104
Episode: 915 Total reward: 199.0 Training loss: 0.4335 Explore P: 0.0104
Episode: 916 Total reward: 199.0 Training loss: 0.1258 Explore P: 0.0104
Episode: 917 Total reward: 199.0 Training loss: 0.0931 Explore P: 0.0104
Episode: 918 Total reward: 199.0 Training loss: 110.0546 Explore P: 0.0104
Episode: 919 Total reward: 199.0 Training loss: 0.1038 Explore P: 0.0104
Episode: 920 Total reward: 199.0 Training loss: 0.1247 Explore P: 0.0103
Episode: 921 Total reward: 199.0 Training loss: 0.2390 Explore P: 0.0103
Episode: 922 Total reward: 199.0 Training loss: 0.1640 Explore P: 0.0103
Episode: 923 Total reward: 199.0 Training loss: 0.3429 Explore P: 0.0103
Episode: 924 Total reward: 199.0 Training loss: 0.1769 Explore P: 0.0103
Episode: 925 Total reward: 199.0 Training loss: 0.2355 Explore P: 0.0103
Episode: 926 Total reward: 199.0 Training loss: 0.2293 Explore P: 0.0103
Episode: 927 Total reward: 199.0 Training loss: 0.1439 Explore P: 0.0103
Episode: 928 Total reward: 199.0 Training loss: 0.2100 Explore P: 0.0103
Episode: 929 Total reward: 199.0 Training loss: 0.1814 Explore P: 0.0103
Episode: 930 Total reward: 199.0 Training loss: 0.3097 Explore P: 0.0103
Episode: 931 Total reward: 199.0 Training loss: 0.2410 Explore P: 0.0103
Episode: 932 Total reward: 199.0 Training loss: 0.1630 Explore P: 0.0103
Episode: 933 Total reward: 199.0 Training loss: 0.2933 Explore P: 0.0103
Episode: 934 Total reward: 199.0 Training loss: 0.2392 Explore P: 0.0103
Episode: 935 Total reward: 199.0 Training loss: 0.3694 Explore P: 0.0103
Episode: 936 Total reward: 199.0 Training loss: 0.1583 Explore P: 0.0103
Episode: 937 Total reward: 199.0 Training loss: 262.1597 Explore P: 0.0102
Episode: 938 Total reward: 199.0 Training loss: 0.2021 Explore P: 0.0102
Episode: 939 Total reward: 199.0 Training loss: 0.3181 Explore P: 0.0102
Episode: 940 Total reward: 199.0 Training loss: 0.2084 Explore P: 0.0102
Episode: 941 Total reward: 199.0 Training loss: 0.2194 Explore P: 0.0102
Episode: 942 Total reward: 199.0 Training loss: 0.1105 Explore P: 0.0102
Episode: 943 Total reward: 199.0 Training loss: 0.1821 Explore P: 0.0102
Episode: 944 Total reward: 199.0 Training loss: 0.2062 Explore P: 0.0102
Episode: 945 Total reward: 199.0 Training loss: 0.2370 Explore P: 0.0102
Episode: 946 Total reward: 199.0 Training loss: 0.2308 Explore P: 0.0102
Episode: 947 Total reward: 199.0 Training loss: 0.1992 Explore P: 0.0102
Episode: 948 Total reward: 199.0 Training loss: 0.3247 Explore P: 0.0102
Episode: 949 Total reward: 199.0 Training loss: 0.3461 Explore P: 0.0102
Episode: 950 Total reward: 199.0 Training loss: 0.4173 Explore P: 0.0102
Episode: 951 Total reward: 199.0 Training loss: 0.2726 Explore P: 0.0102
Episode: 952 Total reward: 199.0 Training loss: 317.9990 Explore P: 0.0102
Episode: 953 Total reward: 199.0 Training loss: 0.5643 Explore P: 0.0102
Episode: 954 Total reward: 199.0 Training loss: 0.2528 Explore P: 0.0102
Episode: 955 Total reward: 199.0 Training loss: 0.3437 Explore P: 0.0102
Episode: 956 Total reward: 199.0 Training loss: 0.2868 Explore P: 0.0102
Episode: 957 Total reward: 199.0 Training loss: 0.3260 Explore P: 0.0102
Episode: 958 Total reward: 199.0 Training loss: 0.3043 Explore P: 0.0102
Episode: 959 Total reward: 199.0 Training loss: 0.2473 Explore P: 0.0102
Episode: 960 Total reward: 199.0 Training loss: 0.1918 Explore P: 0.0102
Episode: 961 Total reward: 199.0 Training loss: 0.2141 Explore P: 0.0102
Episode: 962 Total reward: 199.0 Training loss: 0.3718 Explore P: 0.0101
Episode: 963 Total reward: 199.0 Training loss: 0.2361 Explore P: 0.0101
Episode: 964 Total reward: 199.0 Training loss: 0.4131 Explore P: 0.0101
Episode: 965 Total reward: 199.0 Training loss: 0.2715 Explore P: 0.0101
Episode: 966 Total reward: 199.0 Training loss: 284.2658 Explore P: 0.0101
Episode: 967 Total reward: 199.0 Training loss: 289.4756 Explore P: 0.0101
Episode: 968 Total reward: 199.0 Training loss: 0.2891 Explore P: 0.0101
Episode: 969 Total reward: 199.0 Training loss: 0.1900 Explore P: 0.0101
Episode: 970 Total reward: 199.0 Training loss: 0.1687 Explore P: 0.0101
Episode: 971 Total reward: 199.0 Training loss: 0.2945 Explore P: 0.0101
Episode: 972 Total reward: 199.0 Training loss: 0.3203 Explore P: 0.0101
Episode: 973 Total reward: 199.0 Training loss: 0.4285 Explore P: 0.0101
Episode: 974 Total reward: 199.0 Training loss: 0.2689 Explore P: 0.0101
Episode: 975 Total reward: 199.0 Training loss: 0.4009 Explore P: 0.0101
Episode: 976 Total reward: 199.0 Training loss: 0.3709 Explore P: 0.0101
Episode: 977 Total reward: 199.0 Training loss: 0.2660 Explore P: 0.0101
Episode: 978 Total reward: 199.0 Training loss: 0.2870 Explore P: 0.0101
Episode: 979 Total reward: 199.0 Training loss: 287.2943 Explore P: 0.0101
Episode: 980 Total reward: 199.0 Training loss: 0.3534 Explore P: 0.0101
Episode: 981 Total reward: 199.0 Training loss: 329.7061 Explore P: 0.0101
Episode: 982 Total reward: 199.0 Training loss: 0.2740 Explore P: 0.0101
Episode: 983 Total reward: 199.0 Training loss: 0.2424 Explore P: 0.0101
Episode: 984 Total reward: 199.0 Training loss: 0.2969 Explore P: 0.0101
Episode: 985 Total reward: 199.0 Training loss: 0.2392 Explore P: 0.0101
Episode: 986 Total reward: 199.0 Training loss: 0.2885 Explore P: 0.0101
Episode: 987 Total reward: 199.0 Training loss: 272.6725 Explore P: 0.0101
Episode: 988 Total reward: 199.0 Training loss: 0.2334 Explore P: 0.0101
Episode: 989 Total reward: 199.0 Training loss: 0.2358 Explore P: 0.0101
Episode: 990 Total reward: 199.0 Training loss: 0.4193 Explore P: 0.0101
Episode: 991 Total reward: 199.0 Training loss: 0.2654 Explore P: 0.0101
Episode: 992 Total reward: 199.0 Training loss: 0.5570 Explore P: 0.0101
Episode: 993 Total reward: 199.0 Training loss: 0.3501 Explore P: 0.0101
Episode: 994 Total reward: 199.0 Training loss: 0.3568 Explore P: 0.0101
Episode: 995 Total reward: 199.0 Training loss: 244.7101 Explore P: 0.0101
Episode: 996 Total reward: 199.0 Training loss: 0.3210 Explore P: 0.0101
Episode: 997 Total reward: 199.0 Training loss: 0.2618 Explore P: 0.0101
Episode: 998 Total reward: 199.0 Training loss: 0.3606 Explore P: 0.0101
Episode: 999 Total reward: 199.0 Training loss: 0.1888 Explore P: 0.0101

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [12]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [13]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[13]:
<matplotlib.text.Text at 0x196888ef160>

Testing

Let's checkout how our trained agent plays the game.


In [14]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints\cartpole.ckpt
[2017-06-23 21:43:35,968] Restoring parameters from checkpoints\cartpole.ckpt

In [16]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.