Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-05-02 11:39:54,605] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()
rewards = []
for _ in range(1000):
    env.render()
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    rewards.append(reward)
    if done:
        rewards = []
        env.reset()

In [4]:
env.action_space


Out[4]:
Discrete(2)

To shut the window showing the simulation, use env.close().

If you ran the simulation above, we can look at the rewards:


In [5]:
print(rewards[-20:])


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [6]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [7]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [8]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [9]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [10]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [11]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 7.0 Training loss: 1.1028 Explore P: 0.9993
Episode: 2 Total reward: 34.0 Training loss: 1.0481 Explore P: 0.9959
Episode: 3 Total reward: 11.0 Training loss: 1.1072 Explore P: 0.9949
Episode: 4 Total reward: 32.0 Training loss: 0.9968 Explore P: 0.9917
Episode: 5 Total reward: 27.0 Training loss: 1.0651 Explore P: 0.9891
Episode: 6 Total reward: 11.0 Training loss: 1.0336 Explore P: 0.9880
Episode: 7 Total reward: 25.0 Training loss: 1.1523 Explore P: 0.9856
Episode: 8 Total reward: 10.0 Training loss: 1.0735 Explore P: 0.9846
Episode: 9 Total reward: 14.0 Training loss: 1.1519 Explore P: 0.9832
Episode: 10 Total reward: 24.0 Training loss: 1.1361 Explore P: 0.9809
Episode: 11 Total reward: 27.0 Training loss: 1.0428 Explore P: 0.9783
Episode: 12 Total reward: 12.0 Training loss: 1.1592 Explore P: 0.9771
Episode: 13 Total reward: 19.0 Training loss: 1.2369 Explore P: 0.9753
Episode: 14 Total reward: 13.0 Training loss: 0.8544 Explore P: 0.9740
Episode: 15 Total reward: 16.0 Training loss: 1.2395 Explore P: 0.9725
Episode: 16 Total reward: 35.0 Training loss: 1.0533 Explore P: 0.9691
Episode: 17 Total reward: 9.0 Training loss: 1.0708 Explore P: 0.9682
Episode: 18 Total reward: 35.0 Training loss: 1.3408 Explore P: 0.9649
Episode: 19 Total reward: 28.0 Training loss: 1.7138 Explore P: 0.9622
Episode: 20 Total reward: 12.0 Training loss: 1.7174 Explore P: 0.9611
Episode: 21 Total reward: 16.0 Training loss: 1.6269 Explore P: 0.9596
Episode: 22 Total reward: 15.0 Training loss: 1.9143 Explore P: 0.9581
Episode: 23 Total reward: 16.0 Training loss: 2.1082 Explore P: 0.9566
Episode: 24 Total reward: 13.0 Training loss: 2.7797 Explore P: 0.9554
Episode: 25 Total reward: 59.0 Training loss: 4.0444 Explore P: 0.9498
Episode: 26 Total reward: 8.0 Training loss: 2.5163 Explore P: 0.9491
Episode: 27 Total reward: 22.0 Training loss: 2.9726 Explore P: 0.9470
Episode: 28 Total reward: 21.0 Training loss: 6.2690 Explore P: 0.9451
Episode: 29 Total reward: 31.0 Training loss: 3.1001 Explore P: 0.9422
Episode: 30 Total reward: 15.0 Training loss: 3.9732 Explore P: 0.9408
Episode: 31 Total reward: 20.0 Training loss: 5.2200 Explore P: 0.9389
Episode: 32 Total reward: 13.0 Training loss: 5.5802 Explore P: 0.9377
Episode: 33 Total reward: 14.0 Training loss: 4.8454 Explore P: 0.9364
Episode: 34 Total reward: 15.0 Training loss: 3.7594 Explore P: 0.9350
Episode: 35 Total reward: 19.0 Training loss: 2.8156 Explore P: 0.9333
Episode: 36 Total reward: 22.0 Training loss: 4.7447 Explore P: 0.9312
Episode: 37 Total reward: 25.0 Training loss: 7.3484 Explore P: 0.9289
Episode: 38 Total reward: 9.0 Training loss: 6.2035 Explore P: 0.9281
Episode: 39 Total reward: 12.0 Training loss: 6.5042 Explore P: 0.9270
Episode: 40 Total reward: 14.0 Training loss: 12.2538 Explore P: 0.9257
Episode: 41 Total reward: 13.0 Training loss: 9.7314 Explore P: 0.9245
Episode: 42 Total reward: 19.0 Training loss: 19.6574 Explore P: 0.9228
Episode: 43 Total reward: 15.0 Training loss: 15.4277 Explore P: 0.9214
Episode: 44 Total reward: 28.0 Training loss: 38.2870 Explore P: 0.9189
Episode: 45 Total reward: 26.0 Training loss: 20.9992 Explore P: 0.9165
Episode: 46 Total reward: 65.0 Training loss: 26.3639 Explore P: 0.9106
Episode: 47 Total reward: 34.0 Training loss: 26.3463 Explore P: 0.9076
Episode: 48 Total reward: 14.0 Training loss: 15.3544 Explore P: 0.9063
Episode: 49 Total reward: 55.0 Training loss: 51.6470 Explore P: 0.9014
Episode: 50 Total reward: 40.0 Training loss: 14.1727 Explore P: 0.8979
Episode: 51 Total reward: 20.0 Training loss: 22.6161 Explore P: 0.8961
Episode: 52 Total reward: 19.0 Training loss: 45.0556 Explore P: 0.8944
Episode: 53 Total reward: 11.0 Training loss: 63.0143 Explore P: 0.8934
Episode: 54 Total reward: 70.0 Training loss: 162.8467 Explore P: 0.8873
Episode: 55 Total reward: 26.0 Training loss: 88.3107 Explore P: 0.8850
Episode: 56 Total reward: 21.0 Training loss: 104.5097 Explore P: 0.8831
Episode: 57 Total reward: 27.0 Training loss: 61.2001 Explore P: 0.8808
Episode: 58 Total reward: 17.0 Training loss: 68.7336 Explore P: 0.8793
Episode: 59 Total reward: 18.0 Training loss: 27.0309 Explore P: 0.8778
Episode: 60 Total reward: 11.0 Training loss: 206.8747 Explore P: 0.8768
Episode: 61 Total reward: 23.0 Training loss: 32.5811 Explore P: 0.8748
Episode: 62 Total reward: 13.0 Training loss: 45.2181 Explore P: 0.8737
Episode: 63 Total reward: 13.0 Training loss: 37.2761 Explore P: 0.8726
Episode: 64 Total reward: 22.0 Training loss: 60.9476 Explore P: 0.8707
Episode: 65 Total reward: 27.0 Training loss: 76.0577 Explore P: 0.8683
Episode: 66 Total reward: 16.0 Training loss: 47.5081 Explore P: 0.8670
Episode: 67 Total reward: 14.0 Training loss: 365.4915 Explore P: 0.8658
Episode: 68 Total reward: 28.0 Training loss: 54.4040 Explore P: 0.8634
Episode: 69 Total reward: 18.0 Training loss: 199.2112 Explore P: 0.8618
Episode: 70 Total reward: 10.0 Training loss: 40.1610 Explore P: 0.8610
Episode: 71 Total reward: 10.0 Training loss: 190.4617 Explore P: 0.8601
Episode: 72 Total reward: 8.0 Training loss: 287.9117 Explore P: 0.8595
Episode: 73 Total reward: 35.0 Training loss: 473.8464 Explore P: 0.8565
Episode: 74 Total reward: 31.0 Training loss: 210.2957 Explore P: 0.8539
Episode: 75 Total reward: 28.0 Training loss: 266.0302 Explore P: 0.8515
Episode: 76 Total reward: 28.0 Training loss: 358.3625 Explore P: 0.8492
Episode: 77 Total reward: 12.0 Training loss: 42.9481 Explore P: 0.8482
Episode: 78 Total reward: 9.0 Training loss: 42.6347 Explore P: 0.8474
Episode: 79 Total reward: 19.0 Training loss: 200.9799 Explore P: 0.8458
Episode: 80 Total reward: 14.0 Training loss: 296.6665 Explore P: 0.8446
Episode: 81 Total reward: 14.0 Training loss: 81.0296 Explore P: 0.8435
Episode: 82 Total reward: 15.0 Training loss: 58.1700 Explore P: 0.8422
Episode: 83 Total reward: 18.0 Training loss: 707.2612 Explore P: 0.8407
Episode: 84 Total reward: 19.0 Training loss: 142.9160 Explore P: 0.8392
Episode: 85 Total reward: 16.0 Training loss: 62.0031 Explore P: 0.8378
Episode: 86 Total reward: 9.0 Training loss: 114.0117 Explore P: 0.8371
Episode: 87 Total reward: 19.0 Training loss: 144.3148 Explore P: 0.8355
Episode: 88 Total reward: 15.0 Training loss: 60.4588 Explore P: 0.8343
Episode: 89 Total reward: 12.0 Training loss: 1064.2581 Explore P: 0.8333
Episode: 90 Total reward: 22.0 Training loss: 135.7678 Explore P: 0.8315
Episode: 91 Total reward: 14.0 Training loss: 82.4249 Explore P: 0.8303
Episode: 92 Total reward: 18.0 Training loss: 59.9501 Explore P: 0.8289
Episode: 93 Total reward: 17.0 Training loss: 612.0664 Explore P: 0.8275
Episode: 94 Total reward: 20.0 Training loss: 49.3571 Explore P: 0.8258
Episode: 95 Total reward: 13.0 Training loss: 768.2960 Explore P: 0.8248
Episode: 96 Total reward: 13.0 Training loss: 892.0749 Explore P: 0.8237
Episode: 97 Total reward: 17.0 Training loss: 44.4016 Explore P: 0.8223
Episode: 98 Total reward: 10.0 Training loss: 68.0192 Explore P: 0.8215
Episode: 99 Total reward: 16.0 Training loss: 371.8613 Explore P: 0.8202
Episode: 100 Total reward: 43.0 Training loss: 182.8324 Explore P: 0.8167
Episode: 101 Total reward: 19.0 Training loss: 59.7391 Explore P: 0.8152
Episode: 102 Total reward: 10.0 Training loss: 1491.1654 Explore P: 0.8144
Episode: 103 Total reward: 9.0 Training loss: 241.9762 Explore P: 0.8137
Episode: 104 Total reward: 23.0 Training loss: 62.5773 Explore P: 0.8118
Episode: 105 Total reward: 11.0 Training loss: 60.5156 Explore P: 0.8110
Episode: 106 Total reward: 9.0 Training loss: 80.4232 Explore P: 0.8102
Episode: 107 Total reward: 19.0 Training loss: 63.8508 Explore P: 0.8087
Episode: 108 Total reward: 10.0 Training loss: 162.7889 Explore P: 0.8079
Episode: 109 Total reward: 18.0 Training loss: 274.0419 Explore P: 0.8065
Episode: 110 Total reward: 12.0 Training loss: 51.2454 Explore P: 0.8055
Episode: 111 Total reward: 11.0 Training loss: 1138.6116 Explore P: 0.8047
Episode: 112 Total reward: 9.0 Training loss: 903.6786 Explore P: 0.8039
Episode: 113 Total reward: 13.0 Training loss: 61.5233 Explore P: 0.8029
Episode: 114 Total reward: 17.0 Training loss: 1154.1298 Explore P: 0.8016
Episode: 115 Total reward: 45.0 Training loss: 869.0670 Explore P: 0.7980
Episode: 116 Total reward: 12.0 Training loss: 246.2868 Explore P: 0.7971
Episode: 117 Total reward: 22.0 Training loss: 1003.0303 Explore P: 0.7953
Episode: 118 Total reward: 17.0 Training loss: 494.2971 Explore P: 0.7940
Episode: 119 Total reward: 23.0 Training loss: 53.3145 Explore P: 0.7922
Episode: 120 Total reward: 9.0 Training loss: 57.5362 Explore P: 0.7915
Episode: 121 Total reward: 24.0 Training loss: 264.8782 Explore P: 0.7896
Episode: 122 Total reward: 11.0 Training loss: 44.5046 Explore P: 0.7888
Episode: 123 Total reward: 15.0 Training loss: 55.5986 Explore P: 0.7876
Episode: 124 Total reward: 20.0 Training loss: 776.5046 Explore P: 0.7860
Episode: 125 Total reward: 13.0 Training loss: 55.8475 Explore P: 0.7850
Episode: 126 Total reward: 18.0 Training loss: 51.7594 Explore P: 0.7836
Episode: 127 Total reward: 15.0 Training loss: 518.2798 Explore P: 0.7825
Episode: 128 Total reward: 12.0 Training loss: 507.2245 Explore P: 0.7816
Episode: 129 Total reward: 12.0 Training loss: 702.2658 Explore P: 0.7806
Episode: 130 Total reward: 18.0 Training loss: 49.0397 Explore P: 0.7792
Episode: 131 Total reward: 14.0 Training loss: 493.5799 Explore P: 0.7782
Episode: 132 Total reward: 25.0 Training loss: 248.9860 Explore P: 0.7762
Episode: 133 Total reward: 12.0 Training loss: 376.7777 Explore P: 0.7753
Episode: 134 Total reward: 17.0 Training loss: 336.8404 Explore P: 0.7740
Episode: 135 Total reward: 15.0 Training loss: 241.4646 Explore P: 0.7729
Episode: 136 Total reward: 10.0 Training loss: 654.8826 Explore P: 0.7721
Episode: 137 Total reward: 35.0 Training loss: 1027.5281 Explore P: 0.7695
Episode: 138 Total reward: 20.0 Training loss: 37.3980 Explore P: 0.7679
Episode: 139 Total reward: 28.0 Training loss: 708.7606 Explore P: 0.7658
Episode: 140 Total reward: 26.0 Training loss: 1258.2268 Explore P: 0.7639
Episode: 141 Total reward: 26.0 Training loss: 1213.4128 Explore P: 0.7619
Episode: 142 Total reward: 20.0 Training loss: 18.9030 Explore P: 0.7604
Episode: 143 Total reward: 13.0 Training loss: 1426.3754 Explore P: 0.7594
Episode: 144 Total reward: 46.0 Training loss: 16.1868 Explore P: 0.7560
Episode: 145 Total reward: 11.0 Training loss: 408.5796 Explore P: 0.7552
Episode: 146 Total reward: 20.0 Training loss: 18.3627 Explore P: 0.7537
Episode: 147 Total reward: 12.0 Training loss: 16.2877 Explore P: 0.7528
Episode: 148 Total reward: 35.0 Training loss: 16.8376 Explore P: 0.7502
Episode: 149 Total reward: 25.0 Training loss: 669.7839 Explore P: 0.7483
Episode: 150 Total reward: 20.0 Training loss: 11.0631 Explore P: 0.7469
Episode: 151 Total reward: 22.0 Training loss: 309.9548 Explore P: 0.7452
Episode: 152 Total reward: 21.0 Training loss: 593.1777 Explore P: 0.7437
Episode: 153 Total reward: 19.0 Training loss: 267.4352 Explore P: 0.7423
Episode: 154 Total reward: 17.0 Training loss: 907.7632 Explore P: 0.7411
Episode: 155 Total reward: 10.0 Training loss: 382.3899 Explore P: 0.7403
Episode: 156 Total reward: 14.0 Training loss: 12.1659 Explore P: 0.7393
Episode: 157 Total reward: 9.0 Training loss: 12.2832 Explore P: 0.7387
Episode: 158 Total reward: 52.0 Training loss: 303.6866 Explore P: 0.7349
Episode: 159 Total reward: 15.0 Training loss: 508.4670 Explore P: 0.7338
Episode: 160 Total reward: 15.0 Training loss: 11.6129 Explore P: 0.7327
Episode: 161 Total reward: 12.0 Training loss: 10.2953 Explore P: 0.7318
Episode: 162 Total reward: 11.0 Training loss: 10.8632 Explore P: 0.7310
Episode: 163 Total reward: 9.0 Training loss: 11.1105 Explore P: 0.7304
Episode: 164 Total reward: 19.0 Training loss: 833.0119 Explore P: 0.7290
Episode: 165 Total reward: 13.0 Training loss: 640.4985 Explore P: 0.7281
Episode: 166 Total reward: 28.0 Training loss: 304.0070 Explore P: 0.7261
Episode: 167 Total reward: 11.0 Training loss: 261.3618 Explore P: 0.7253
Episode: 168 Total reward: 15.0 Training loss: 4.8770 Explore P: 0.7242
Episode: 169 Total reward: 21.0 Training loss: 565.4949 Explore P: 0.7227
Episode: 170 Total reward: 35.0 Training loss: 380.5205 Explore P: 0.7202
Episode: 171 Total reward: 9.0 Training loss: 217.9955 Explore P: 0.7196
Episode: 172 Total reward: 10.0 Training loss: 220.0763 Explore P: 0.7189
Episode: 173 Total reward: 19.0 Training loss: 3.7376 Explore P: 0.7175
Episode: 174 Total reward: 28.0 Training loss: 5.6056 Explore P: 0.7156
Episode: 175 Total reward: 14.0 Training loss: 911.5320 Explore P: 0.7146
Episode: 176 Total reward: 11.0 Training loss: 464.2128 Explore P: 0.7138
Episode: 177 Total reward: 27.0 Training loss: 204.1582 Explore P: 0.7119
Episode: 178 Total reward: 15.0 Training loss: 4.3980 Explore P: 0.7109
Episode: 179 Total reward: 14.0 Training loss: 2.7889 Explore P: 0.7099
Episode: 180 Total reward: 42.0 Training loss: 446.1270 Explore P: 0.7069
Episode: 181 Total reward: 20.0 Training loss: 191.9564 Explore P: 0.7056
Episode: 182 Total reward: 10.0 Training loss: 3.4906 Explore P: 0.7049
Episode: 183 Total reward: 20.0 Training loss: 441.7799 Explore P: 0.7035
Episode: 184 Total reward: 8.0 Training loss: 234.7765 Explore P: 0.7029
Episode: 185 Total reward: 28.0 Training loss: 405.4051 Explore P: 0.7010
Episode: 186 Total reward: 14.0 Training loss: 173.0782 Explore P: 0.7000
Episode: 187 Total reward: 13.0 Training loss: 4.8969 Explore P: 0.6991
Episode: 188 Total reward: 10.0 Training loss: 1.9227 Explore P: 0.6984
Episode: 189 Total reward: 13.0 Training loss: 2.6204 Explore P: 0.6975
Episode: 190 Total reward: 13.0 Training loss: 2.2222 Explore P: 0.6966
Episode: 191 Total reward: 22.0 Training loss: 2.7308 Explore P: 0.6951
Episode: 192 Total reward: 10.0 Training loss: 144.8496 Explore P: 0.6944
Episode: 193 Total reward: 17.0 Training loss: 1.2887 Explore P: 0.6933
Episode: 194 Total reward: 11.0 Training loss: 3.0330 Explore P: 0.6925
Episode: 195 Total reward: 9.0 Training loss: 3.7834 Explore P: 0.6919
Episode: 196 Total reward: 13.0 Training loss: 448.1702 Explore P: 0.6910
Episode: 197 Total reward: 18.0 Training loss: 291.3376 Explore P: 0.6898
Episode: 198 Total reward: 19.0 Training loss: 2.4611 Explore P: 0.6885
Episode: 199 Total reward: 9.0 Training loss: 1.4512 Explore P: 0.6879
Episode: 200 Total reward: 9.0 Training loss: 173.2256 Explore P: 0.6873
Episode: 201 Total reward: 9.0 Training loss: 4.7449 Explore P: 0.6867
Episode: 202 Total reward: 25.0 Training loss: 249.7471 Explore P: 0.6850
Episode: 203 Total reward: 12.0 Training loss: 2.1391 Explore P: 0.6842
Episode: 204 Total reward: 14.0 Training loss: 2.8829 Explore P: 0.6832
Episode: 205 Total reward: 10.0 Training loss: 232.1500 Explore P: 0.6826
Episode: 206 Total reward: 18.0 Training loss: 346.1035 Explore P: 0.6814
Episode: 207 Total reward: 13.0 Training loss: 123.0428 Explore P: 0.6805
Episode: 208 Total reward: 13.0 Training loss: 265.9186 Explore P: 0.6796
Episode: 209 Total reward: 9.0 Training loss: 622.0021 Explore P: 0.6790
Episode: 210 Total reward: 13.0 Training loss: 137.3789 Explore P: 0.6781
Episode: 211 Total reward: 13.0 Training loss: 106.4429 Explore P: 0.6773
Episode: 212 Total reward: 11.0 Training loss: 3.7285 Explore P: 0.6765
Episode: 213 Total reward: 9.0 Training loss: 107.3521 Explore P: 0.6759
Episode: 214 Total reward: 20.0 Training loss: 3.5297 Explore P: 0.6746
Episode: 215 Total reward: 12.0 Training loss: 288.9487 Explore P: 0.6738
Episode: 216 Total reward: 7.0 Training loss: 113.8173 Explore P: 0.6734
Episode: 217 Total reward: 16.0 Training loss: 107.3877 Explore P: 0.6723
Episode: 218 Total reward: 9.0 Training loss: 102.4685 Explore P: 0.6717
Episode: 219 Total reward: 13.0 Training loss: 126.4419 Explore P: 0.6708
Episode: 220 Total reward: 10.0 Training loss: 121.2955 Explore P: 0.6702
Episode: 221 Total reward: 16.0 Training loss: 323.7456 Explore P: 0.6691
Episode: 222 Total reward: 34.0 Training loss: 613.0165 Explore P: 0.6669
Episode: 223 Total reward: 28.0 Training loss: 3.7502 Explore P: 0.6650
Episode: 224 Total reward: 10.0 Training loss: 100.8204 Explore P: 0.6644
Episode: 225 Total reward: 8.0 Training loss: 208.4295 Explore P: 0.6639
Episode: 226 Total reward: 16.0 Training loss: 93.7536 Explore P: 0.6628
Episode: 227 Total reward: 15.0 Training loss: 2.7371 Explore P: 0.6618
Episode: 228 Total reward: 19.0 Training loss: 4.0931 Explore P: 0.6606
Episode: 229 Total reward: 17.0 Training loss: 486.2758 Explore P: 0.6595
Episode: 230 Total reward: 11.0 Training loss: 4.4817 Explore P: 0.6588
Episode: 231 Total reward: 16.0 Training loss: 514.7384 Explore P: 0.6578
Episode: 232 Total reward: 11.0 Training loss: 280.5522 Explore P: 0.6570
Episode: 233 Total reward: 28.0 Training loss: 86.5102 Explore P: 0.6552
Episode: 234 Total reward: 42.0 Training loss: 6.2949 Explore P: 0.6525
Episode: 235 Total reward: 17.0 Training loss: 7.6028 Explore P: 0.6514
Episode: 236 Total reward: 7.0 Training loss: 7.0275 Explore P: 0.6510
Episode: 237 Total reward: 11.0 Training loss: 82.2076 Explore P: 0.6503
Episode: 238 Total reward: 10.0 Training loss: 5.6118 Explore P: 0.6496
Episode: 239 Total reward: 16.0 Training loss: 6.8660 Explore P: 0.6486
Episode: 240 Total reward: 12.0 Training loss: 232.1922 Explore P: 0.6479
Episode: 241 Total reward: 11.0 Training loss: 94.0015 Explore P: 0.6471
Episode: 242 Total reward: 9.0 Training loss: 125.3228 Explore P: 0.6466
Episode: 243 Total reward: 15.0 Training loss: 5.4796 Explore P: 0.6456
Episode: 244 Total reward: 22.0 Training loss: 93.3356 Explore P: 0.6442
Episode: 245 Total reward: 21.0 Training loss: 162.6654 Explore P: 0.6429
Episode: 246 Total reward: 16.0 Training loss: 5.9535 Explore P: 0.6419
Episode: 247 Total reward: 15.0 Training loss: 523.1361 Explore P: 0.6409
Episode: 248 Total reward: 21.0 Training loss: 88.9836 Explore P: 0.6396
Episode: 249 Total reward: 12.0 Training loss: 540.1017 Explore P: 0.6389
Episode: 250 Total reward: 10.0 Training loss: 98.3027 Explore P: 0.6382
Episode: 251 Total reward: 21.0 Training loss: 88.1397 Explore P: 0.6369
Episode: 252 Total reward: 11.0 Training loss: 7.3326 Explore P: 0.6362
Episode: 253 Total reward: 15.0 Training loss: 6.0894 Explore P: 0.6353
Episode: 254 Total reward: 11.0 Training loss: 148.3004 Explore P: 0.6346
Episode: 255 Total reward: 38.0 Training loss: 120.1587 Explore P: 0.6322
Episode: 256 Total reward: 12.0 Training loss: 310.9326 Explore P: 0.6315
Episode: 257 Total reward: 21.0 Training loss: 282.1461 Explore P: 0.6302
Episode: 258 Total reward: 28.0 Training loss: 269.0856 Explore P: 0.6284
Episode: 259 Total reward: 14.0 Training loss: 415.0089 Explore P: 0.6276
Episode: 260 Total reward: 8.0 Training loss: 153.2048 Explore P: 0.6271
Episode: 261 Total reward: 16.0 Training loss: 244.6352 Explore P: 0.6261
Episode: 262 Total reward: 21.0 Training loss: 147.8963 Explore P: 0.6248
Episode: 263 Total reward: 12.0 Training loss: 6.2616 Explore P: 0.6241
Episode: 264 Total reward: 14.0 Training loss: 6.7813 Explore P: 0.6232
Episode: 265 Total reward: 15.0 Training loss: 8.1725 Explore P: 0.6223
Episode: 266 Total reward: 22.0 Training loss: 141.4295 Explore P: 0.6209
Episode: 267 Total reward: 42.0 Training loss: 70.5861 Explore P: 0.6184
Episode: 268 Total reward: 19.0 Training loss: 77.3906 Explore P: 0.6172
Episode: 269 Total reward: 12.0 Training loss: 68.8258 Explore P: 0.6165
Episode: 270 Total reward: 20.0 Training loss: 317.0912 Explore P: 0.6153
Episode: 271 Total reward: 11.0 Training loss: 93.6535 Explore P: 0.6146
Episode: 272 Total reward: 28.0 Training loss: 132.5125 Explore P: 0.6129
Episode: 273 Total reward: 19.0 Training loss: 70.2609 Explore P: 0.6118
Episode: 274 Total reward: 11.0 Training loss: 494.8968 Explore P: 0.6111
Episode: 275 Total reward: 9.0 Training loss: 5.0281 Explore P: 0.6106
Episode: 276 Total reward: 9.0 Training loss: 145.5705 Explore P: 0.6100
Episode: 277 Total reward: 15.0 Training loss: 276.4129 Explore P: 0.6091
Episode: 278 Total reward: 13.0 Training loss: 5.3846 Explore P: 0.6084
Episode: 279 Total reward: 14.0 Training loss: 4.1833 Explore P: 0.6075
Episode: 280 Total reward: 19.0 Training loss: 6.6750 Explore P: 0.6064
Episode: 281 Total reward: 15.0 Training loss: 4.5525 Explore P: 0.6055
Episode: 282 Total reward: 11.0 Training loss: 6.2832 Explore P: 0.6048
Episode: 283 Total reward: 7.0 Training loss: 129.6888 Explore P: 0.6044
Episode: 284 Total reward: 29.0 Training loss: 135.0215 Explore P: 0.6027
Episode: 285 Total reward: 17.0 Training loss: 92.8226 Explore P: 0.6017
Episode: 286 Total reward: 11.0 Training loss: 6.6714 Explore P: 0.6011
Episode: 287 Total reward: 20.0 Training loss: 183.0730 Explore P: 0.5999
Episode: 288 Total reward: 10.0 Training loss: 136.6376 Explore P: 0.5993
Episode: 289 Total reward: 27.0 Training loss: 304.2607 Explore P: 0.5977
Episode: 290 Total reward: 17.0 Training loss: 153.1395 Explore P: 0.5967
Episode: 291 Total reward: 12.0 Training loss: 130.2511 Explore P: 0.5960
Episode: 292 Total reward: 31.0 Training loss: 79.1052 Explore P: 0.5942
Episode: 293 Total reward: 20.0 Training loss: 64.9246 Explore P: 0.5930
Episode: 294 Total reward: 10.0 Training loss: 116.9816 Explore P: 0.5924
Episode: 295 Total reward: 10.0 Training loss: 87.6540 Explore P: 0.5918
Episode: 296 Total reward: 23.0 Training loss: 71.4822 Explore P: 0.5905
Episode: 297 Total reward: 26.0 Training loss: 255.4749 Explore P: 0.5890
Episode: 298 Total reward: 16.0 Training loss: 214.0237 Explore P: 0.5881
Episode: 299 Total reward: 14.0 Training loss: 122.8956 Explore P: 0.5873
Episode: 300 Total reward: 21.0 Training loss: 195.9192 Explore P: 0.5861
Episode: 301 Total reward: 11.0 Training loss: 74.4021 Explore P: 0.5854
Episode: 302 Total reward: 9.0 Training loss: 125.0591 Explore P: 0.5849
Episode: 303 Total reward: 8.0 Training loss: 334.6465 Explore P: 0.5844
Episode: 304 Total reward: 12.0 Training loss: 117.2775 Explore P: 0.5838
Episode: 305 Total reward: 25.0 Training loss: 61.6778 Explore P: 0.5823
Episode: 306 Total reward: 9.0 Training loss: 165.1031 Explore P: 0.5818
Episode: 307 Total reward: 13.0 Training loss: 7.4214 Explore P: 0.5811
Episode: 308 Total reward: 11.0 Training loss: 208.7959 Explore P: 0.5804
Episode: 309 Total reward: 11.0 Training loss: 136.4322 Explore P: 0.5798
Episode: 310 Total reward: 11.0 Training loss: 8.8668 Explore P: 0.5792
Episode: 311 Total reward: 9.0 Training loss: 5.2950 Explore P: 0.5787
Episode: 312 Total reward: 10.0 Training loss: 281.4304 Explore P: 0.5781
Episode: 313 Total reward: 12.0 Training loss: 311.8693 Explore P: 0.5774
Episode: 314 Total reward: 26.0 Training loss: 200.1442 Explore P: 0.5759
Episode: 315 Total reward: 10.0 Training loss: 62.4854 Explore P: 0.5754
Episode: 316 Total reward: 20.0 Training loss: 4.5163 Explore P: 0.5743
Episode: 317 Total reward: 7.0 Training loss: 140.5792 Explore P: 0.5739
Episode: 318 Total reward: 13.0 Training loss: 186.7457 Explore P: 0.5731
Episode: 319 Total reward: 10.0 Training loss: 3.9069 Explore P: 0.5726
Episode: 320 Total reward: 11.0 Training loss: 70.5841 Explore P: 0.5719
Episode: 321 Total reward: 36.0 Training loss: 555.8609 Explore P: 0.5699
Episode: 322 Total reward: 12.0 Training loss: 178.3234 Explore P: 0.5693
Episode: 323 Total reward: 14.0 Training loss: 50.1819 Explore P: 0.5685
Episode: 324 Total reward: 8.0 Training loss: 53.7244 Explore P: 0.5680
Episode: 325 Total reward: 25.0 Training loss: 123.5263 Explore P: 0.5666
Episode: 326 Total reward: 17.0 Training loss: 3.7603 Explore P: 0.5657
Episode: 327 Total reward: 9.0 Training loss: 184.0860 Explore P: 0.5652
Episode: 328 Total reward: 15.0 Training loss: 185.2499 Explore P: 0.5644
Episode: 329 Total reward: 16.0 Training loss: 3.3958 Explore P: 0.5635
Episode: 330 Total reward: 14.0 Training loss: 49.7909 Explore P: 0.5627
Episode: 331 Total reward: 9.0 Training loss: 3.9313 Explore P: 0.5622
Episode: 332 Total reward: 19.0 Training loss: 4.9093 Explore P: 0.5611
Episode: 333 Total reward: 16.0 Training loss: 52.1825 Explore P: 0.5603
Episode: 334 Total reward: 16.0 Training loss: 3.9714 Explore P: 0.5594
Episode: 335 Total reward: 15.0 Training loss: 65.0894 Explore P: 0.5586
Episode: 336 Total reward: 25.0 Training loss: 3.2404 Explore P: 0.5572
Episode: 337 Total reward: 12.0 Training loss: 3.2331 Explore P: 0.5565
Episode: 338 Total reward: 9.0 Training loss: 117.9789 Explore P: 0.5560
Episode: 339 Total reward: 15.0 Training loss: 296.5986 Explore P: 0.5552
Episode: 340 Total reward: 11.0 Training loss: 2.9685 Explore P: 0.5546
Episode: 341 Total reward: 31.0 Training loss: 3.4804 Explore P: 0.5529
Episode: 342 Total reward: 17.0 Training loss: 46.0701 Explore P: 0.5520
Episode: 343 Total reward: 37.0 Training loss: 302.6524 Explore P: 0.5500
Episode: 344 Total reward: 14.0 Training loss: 144.3679 Explore P: 0.5493
Episode: 345 Total reward: 25.0 Training loss: 44.4854 Explore P: 0.5479
Episode: 346 Total reward: 19.0 Training loss: 94.1055 Explore P: 0.5469
Episode: 347 Total reward: 16.0 Training loss: 119.0262 Explore P: 0.5460
Episode: 348 Total reward: 38.0 Training loss: 235.4983 Explore P: 0.5440
Episode: 349 Total reward: 15.0 Training loss: 66.0674 Explore P: 0.5432
Episode: 350 Total reward: 11.0 Training loss: 2.8907 Explore P: 0.5426
Episode: 351 Total reward: 31.0 Training loss: 212.6675 Explore P: 0.5410
Episode: 352 Total reward: 12.0 Training loss: 3.1694 Explore P: 0.5403
Episode: 353 Total reward: 8.0 Training loss: 146.3766 Explore P: 0.5399
Episode: 354 Total reward: 10.0 Training loss: 83.7692 Explore P: 0.5394
Episode: 355 Total reward: 14.0 Training loss: 37.4759 Explore P: 0.5386
Episode: 356 Total reward: 11.0 Training loss: 45.2767 Explore P: 0.5381
Episode: 357 Total reward: 15.0 Training loss: 176.3361 Explore P: 0.5373
Episode: 358 Total reward: 12.0 Training loss: 1.9362 Explore P: 0.5366
Episode: 359 Total reward: 8.0 Training loss: 1.4272 Explore P: 0.5362
Episode: 360 Total reward: 10.0 Training loss: 1.1400 Explore P: 0.5357
Episode: 361 Total reward: 21.0 Training loss: 1.6289 Explore P: 0.5346
Episode: 362 Total reward: 10.0 Training loss: 181.3690 Explore P: 0.5341
Episode: 363 Total reward: 15.0 Training loss: 96.5622 Explore P: 0.5333
Episode: 364 Total reward: 15.0 Training loss: 185.9018 Explore P: 0.5325
Episode: 365 Total reward: 12.0 Training loss: 187.2325 Explore P: 0.5319
Episode: 366 Total reward: 19.0 Training loss: 97.5076 Explore P: 0.5309
Episode: 367 Total reward: 13.0 Training loss: 163.0827 Explore P: 0.5302
Episode: 368 Total reward: 16.0 Training loss: 45.6587 Explore P: 0.5294
Episode: 369 Total reward: 8.0 Training loss: 84.6313 Explore P: 0.5289
Episode: 370 Total reward: 15.0 Training loss: 86.5594 Explore P: 0.5282
Episode: 371 Total reward: 8.0 Training loss: 179.9664 Explore P: 0.5278
Episode: 372 Total reward: 16.0 Training loss: 33.5059 Explore P: 0.5269
Episode: 373 Total reward: 15.0 Training loss: 81.3355 Explore P: 0.5262
Episode: 374 Total reward: 9.0 Training loss: 2.2593 Explore P: 0.5257
Episode: 375 Total reward: 9.0 Training loss: 243.3529 Explore P: 0.5252
Episode: 376 Total reward: 10.0 Training loss: 1.6967 Explore P: 0.5247
Episode: 377 Total reward: 13.0 Training loss: 183.2543 Explore P: 0.5240
Episode: 378 Total reward: 12.0 Training loss: 108.9945 Explore P: 0.5234
Episode: 379 Total reward: 11.0 Training loss: 140.4077 Explore P: 0.5229
Episode: 380 Total reward: 19.0 Training loss: 1.2000 Explore P: 0.5219
Episode: 381 Total reward: 15.0 Training loss: 1.0349 Explore P: 0.5211
Episode: 382 Total reward: 13.0 Training loss: 85.0662 Explore P: 0.5205
Episode: 383 Total reward: 18.0 Training loss: 1.0288 Explore P: 0.5195
Episode: 384 Total reward: 13.0 Training loss: 40.0769 Explore P: 0.5189
Episode: 385 Total reward: 26.0 Training loss: 199.0320 Explore P: 0.5176
Episode: 386 Total reward: 19.0 Training loss: 1.3385 Explore P: 0.5166
Episode: 387 Total reward: 12.0 Training loss: 0.6675 Explore P: 0.5160
Episode: 388 Total reward: 12.0 Training loss: 2.0075 Explore P: 0.5154
Episode: 389 Total reward: 22.0 Training loss: 110.8661 Explore P: 0.5143
Episode: 390 Total reward: 11.0 Training loss: 96.2184 Explore P: 0.5137
Episode: 391 Total reward: 17.0 Training loss: 160.7639 Explore P: 0.5129
Episode: 392 Total reward: 13.0 Training loss: 1.5676 Explore P: 0.5122
Episode: 393 Total reward: 11.0 Training loss: 126.3773 Explore P: 0.5117
Episode: 394 Total reward: 10.0 Training loss: 1.0882 Explore P: 0.5111
Episode: 395 Total reward: 12.0 Training loss: 1.2891 Explore P: 0.5105
Episode: 396 Total reward: 12.0 Training loss: 61.8121 Explore P: 0.5099
Episode: 397 Total reward: 40.0 Training loss: 0.7814 Explore P: 0.5080
Episode: 398 Total reward: 12.0 Training loss: 107.6754 Explore P: 0.5074
Episode: 399 Total reward: 11.0 Training loss: 163.6418 Explore P: 0.5068
Episode: 400 Total reward: 14.0 Training loss: 51.0409 Explore P: 0.5061
Episode: 401 Total reward: 8.0 Training loss: 1.4717 Explore P: 0.5057
Episode: 402 Total reward: 8.0 Training loss: 38.0097 Explore P: 0.5053
Episode: 403 Total reward: 22.0 Training loss: 57.2595 Explore P: 0.5042
Episode: 404 Total reward: 18.0 Training loss: 113.6038 Explore P: 0.5033
Episode: 405 Total reward: 35.0 Training loss: 83.0104 Explore P: 0.5016
Episode: 406 Total reward: 62.0 Training loss: 1.1692 Explore P: 0.4986
Episode: 407 Total reward: 16.0 Training loss: 0.6611 Explore P: 0.4978
Episode: 408 Total reward: 9.0 Training loss: 69.3393 Explore P: 0.4974
Episode: 409 Total reward: 37.0 Training loss: 48.3993 Explore P: 0.4956
Episode: 410 Total reward: 32.0 Training loss: 1.2571 Explore P: 0.4940
Episode: 411 Total reward: 67.0 Training loss: 41.0918 Explore P: 0.4908
Episode: 412 Total reward: 56.0 Training loss: 37.1908 Explore P: 0.4881
Episode: 413 Total reward: 35.0 Training loss: 34.4148 Explore P: 0.4864
Episode: 414 Total reward: 84.0 Training loss: 63.8711 Explore P: 0.4824
Episode: 415 Total reward: 47.0 Training loss: 1.5220 Explore P: 0.4802
Episode: 416 Total reward: 65.0 Training loss: 1.1767 Explore P: 0.4772
Episode: 417 Total reward: 51.0 Training loss: 28.5139 Explore P: 0.4748
Episode: 418 Total reward: 29.0 Training loss: 106.4193 Explore P: 0.4735
Episode: 419 Total reward: 83.0 Training loss: 73.9467 Explore P: 0.4696
Episode: 420 Total reward: 42.0 Training loss: 0.9415 Explore P: 0.4677
Episode: 421 Total reward: 43.0 Training loss: 27.2000 Explore P: 0.4657
Episode: 422 Total reward: 23.0 Training loss: 1.3469 Explore P: 0.4647
Episode: 423 Total reward: 53.0 Training loss: 74.2806 Explore P: 0.4623
Episode: 424 Total reward: 39.0 Training loss: 22.2182 Explore P: 0.4605
Episode: 425 Total reward: 16.0 Training loss: 24.2960 Explore P: 0.4598
Episode: 426 Total reward: 107.0 Training loss: 23.6230 Explore P: 0.4550
Episode: 427 Total reward: 50.0 Training loss: 1.7970 Explore P: 0.4528
Episode: 428 Total reward: 41.0 Training loss: 40.6148 Explore P: 0.4510
Episode: 429 Total reward: 55.0 Training loss: 37.8106 Explore P: 0.4486
Episode: 430 Total reward: 88.0 Training loss: 20.4169 Explore P: 0.4447
Episode: 431 Total reward: 42.0 Training loss: 37.8367 Explore P: 0.4429
Episode: 432 Total reward: 98.0 Training loss: 17.1773 Explore P: 0.4387
Episode: 433 Total reward: 53.0 Training loss: 40.6816 Explore P: 0.4364
Episode: 434 Total reward: 25.0 Training loss: 41.7637 Explore P: 0.4353
Episode: 435 Total reward: 37.0 Training loss: 36.6834 Explore P: 0.4338
Episode: 436 Total reward: 40.0 Training loss: 35.3412 Explore P: 0.4321
Episode: 437 Total reward: 47.0 Training loss: 1.0329 Explore P: 0.4301
Episode: 438 Total reward: 103.0 Training loss: 22.5777 Explore P: 0.4258
Episode: 439 Total reward: 36.0 Training loss: 34.1192 Explore P: 0.4243
Episode: 440 Total reward: 46.0 Training loss: 48.3637 Explore P: 0.4224
Episode: 441 Total reward: 88.0 Training loss: 16.2738 Explore P: 0.4188
Episode: 442 Total reward: 56.0 Training loss: 39.3706 Explore P: 0.4165
Episode: 443 Total reward: 50.0 Training loss: 20.3748 Explore P: 0.4145
Episode: 444 Total reward: 79.0 Training loss: 46.3949 Explore P: 0.4113
Episode: 445 Total reward: 50.0 Training loss: 1.6196 Explore P: 0.4093
Episode: 446 Total reward: 59.0 Training loss: 1.1975 Explore P: 0.4069
Episode: 447 Total reward: 125.0 Training loss: 1.7659 Explore P: 0.4020
Episode: 448 Total reward: 74.0 Training loss: 15.1231 Explore P: 0.3991
Episode: 449 Total reward: 60.0 Training loss: 1.2261 Explore P: 0.3968
Episode: 450 Total reward: 45.0 Training loss: 15.3832 Explore P: 0.3951
Episode: 451 Total reward: 24.0 Training loss: 35.8607 Explore P: 0.3941
Episode: 452 Total reward: 60.0 Training loss: 17.6805 Explore P: 0.3918
Episode: 453 Total reward: 51.0 Training loss: 37.3684 Explore P: 0.3899
Episode: 454 Total reward: 119.0 Training loss: 20.6233 Explore P: 0.3854
Episode: 455 Total reward: 31.0 Training loss: 19.5496 Explore P: 0.3842
Episode: 456 Total reward: 25.0 Training loss: 51.1472 Explore P: 0.3833
Episode: 457 Total reward: 51.0 Training loss: 28.7119 Explore P: 0.3814
Episode: 458 Total reward: 28.0 Training loss: 41.3453 Explore P: 0.3804
Episode: 459 Total reward: 44.0 Training loss: 47.9821 Explore P: 0.3787
Episode: 460 Total reward: 76.0 Training loss: 30.3434 Explore P: 0.3760
Episode: 461 Total reward: 76.0 Training loss: 23.7055 Explore P: 0.3732
Episode: 462 Total reward: 42.0 Training loss: 1.3552 Explore P: 0.3717
Episode: 463 Total reward: 87.0 Training loss: 1.8898 Explore P: 0.3685
Episode: 464 Total reward: 123.0 Training loss: 23.4135 Explore P: 0.3641
Episode: 465 Total reward: 71.0 Training loss: 17.7585 Explore P: 0.3616
Episode: 466 Total reward: 46.0 Training loss: 1.2437 Explore P: 0.3600
Episode: 467 Total reward: 99.0 Training loss: 1.2920 Explore P: 0.3566
Episode: 468 Total reward: 108.0 Training loss: 23.8676 Explore P: 0.3529
Episode: 469 Total reward: 64.0 Training loss: 47.6842 Explore P: 0.3507
Episode: 470 Total reward: 93.0 Training loss: 1.6747 Explore P: 0.3475
Episode: 471 Total reward: 109.0 Training loss: 1.5700 Explore P: 0.3439
Episode: 472 Total reward: 51.0 Training loss: 60.7083 Explore P: 0.3422
Episode: 473 Total reward: 60.0 Training loss: 1.3867 Explore P: 0.3402
Episode: 474 Total reward: 85.0 Training loss: 1.9941 Explore P: 0.3374
Episode: 475 Total reward: 62.0 Training loss: 2.6144 Explore P: 0.3354
Episode: 476 Total reward: 46.0 Training loss: 34.8609 Explore P: 0.3339
Episode: 477 Total reward: 35.0 Training loss: 65.6658 Explore P: 0.3327
Episode: 478 Total reward: 39.0 Training loss: 1.8050 Explore P: 0.3315
Episode: 479 Total reward: 81.0 Training loss: 1.7285 Explore P: 0.3289
Episode: 480 Total reward: 119.0 Training loss: 14.3030 Explore P: 0.3251
Episode: 481 Total reward: 86.0 Training loss: 23.1852 Explore P: 0.3224
Episode: 482 Total reward: 48.0 Training loss: 2.3227 Explore P: 0.3209
Episode: 483 Total reward: 44.0 Training loss: 1.5187 Explore P: 0.3195
Episode: 484 Total reward: 63.0 Training loss: 33.5616 Explore P: 0.3176
Episode: 485 Total reward: 86.0 Training loss: 1.8927 Explore P: 0.3150
Episode: 486 Total reward: 52.0 Training loss: 0.5092 Explore P: 0.3134
Episode: 487 Total reward: 96.0 Training loss: 1.6766 Explore P: 0.3105
Episode: 488 Total reward: 57.0 Training loss: 28.1612 Explore P: 0.3088
Episode: 489 Total reward: 94.0 Training loss: 2.3830 Explore P: 0.3060
Episode: 490 Total reward: 112.0 Training loss: 20.7773 Explore P: 0.3027
Episode: 491 Total reward: 78.0 Training loss: 1.7547 Explore P: 0.3004
Episode: 492 Total reward: 77.0 Training loss: 8.7572 Explore P: 0.2982
Episode: 493 Total reward: 84.0 Training loss: 19.6533 Explore P: 0.2958
Episode: 494 Total reward: 118.0 Training loss: 16.7880 Explore P: 0.2924
Episode: 495 Total reward: 164.0 Training loss: 2.1693 Explore P: 0.2878
Episode: 496 Total reward: 120.0 Training loss: 38.1818 Explore P: 0.2845
Episode: 497 Total reward: 159.0 Training loss: 47.1302 Explore P: 0.2802
Episode: 498 Total reward: 75.0 Training loss: 1.6757 Explore P: 0.2782
Episode: 499 Total reward: 124.0 Training loss: 0.8744 Explore P: 0.2749
Episode: 500 Total reward: 199.0 Training loss: 22.0924 Explore P: 0.2696
Episode: 501 Total reward: 128.0 Training loss: 2.0431 Explore P: 0.2663
Episode: 502 Total reward: 53.0 Training loss: 1.0917 Explore P: 0.2650
Episode: 503 Total reward: 40.0 Training loss: 2.6783 Explore P: 0.2640
Episode: 504 Total reward: 44.0 Training loss: 77.9367 Explore P: 0.2629
Episode: 505 Total reward: 161.0 Training loss: 15.8680 Explore P: 0.2588
Episode: 506 Total reward: 82.0 Training loss: 1.2678 Explore P: 0.2568
Episode: 507 Total reward: 67.0 Training loss: 1.2075 Explore P: 0.2551
Episode: 508 Total reward: 49.0 Training loss: 0.8739 Explore P: 0.2539
Episode: 509 Total reward: 30.0 Training loss: 2.0156 Explore P: 0.2532
Episode: 510 Total reward: 61.0 Training loss: 60.4510 Explore P: 0.2517
Episode: 511 Total reward: 36.0 Training loss: 2.5564 Explore P: 0.2509
Episode: 512 Total reward: 38.0 Training loss: 68.4571 Explore P: 0.2499
Episode: 513 Total reward: 42.0 Training loss: 1.5156 Explore P: 0.2489
Episode: 514 Total reward: 46.0 Training loss: 0.5713 Explore P: 0.2478
Episode: 515 Total reward: 33.0 Training loss: 1.0574 Explore P: 0.2471
Episode: 516 Total reward: 40.0 Training loss: 2.0687 Explore P: 0.2461
Episode: 517 Total reward: 50.0 Training loss: 124.2789 Explore P: 0.2449
Episode: 518 Total reward: 18.0 Training loss: 27.3571 Explore P: 0.2445
Episode: 519 Total reward: 29.0 Training loss: 18.5670 Explore P: 0.2438
Episode: 520 Total reward: 37.0 Training loss: 4.8286 Explore P: 0.2430
Episode: 521 Total reward: 29.0 Training loss: 1.3071 Explore P: 0.2423
Episode: 522 Total reward: 25.0 Training loss: 70.3652 Explore P: 0.2417
Episode: 523 Total reward: 32.0 Training loss: 1.2432 Explore P: 0.2410
Episode: 524 Total reward: 40.0 Training loss: 3.0815 Explore P: 0.2401
Episode: 525 Total reward: 35.0 Training loss: 15.6089 Explore P: 0.2392
Episode: 526 Total reward: 39.0 Training loss: 1.3723 Explore P: 0.2384
Episode: 527 Total reward: 47.0 Training loss: 5.0507 Explore P: 0.2373
Episode: 528 Total reward: 29.0 Training loss: 1.2102 Explore P: 0.2366
Episode: 529 Total reward: 46.0 Training loss: 2.2029 Explore P: 0.2356
Episode: 530 Total reward: 59.0 Training loss: 1.2768 Explore P: 0.2343
Episode: 531 Total reward: 87.0 Training loss: 51.8690 Explore P: 0.2323
Episode: 532 Total reward: 38.0 Training loss: 27.6453 Explore P: 0.2315
Episode: 533 Total reward: 32.0 Training loss: 1.8832 Explore P: 0.2308
Episode: 534 Total reward: 21.0 Training loss: 24.4535 Explore P: 0.2303
Episode: 535 Total reward: 38.0 Training loss: 72.4144 Explore P: 0.2295
Episode: 536 Total reward: 33.0 Training loss: 0.8322 Explore P: 0.2287
Episode: 537 Total reward: 46.0 Training loss: 1.8656 Explore P: 0.2277
Episode: 538 Total reward: 44.0 Training loss: 0.9156 Explore P: 0.2268
Episode: 539 Total reward: 31.0 Training loss: 118.9293 Explore P: 0.2261
Episode: 540 Total reward: 49.0 Training loss: 0.5882 Explore P: 0.2251
Episode: 541 Total reward: 31.0 Training loss: 3.1728 Explore P: 0.2244
Episode: 542 Total reward: 36.0 Training loss: 2.8409 Explore P: 0.2236
Episode: 543 Total reward: 33.0 Training loss: 66.8048 Explore P: 0.2229
Episode: 544 Total reward: 39.0 Training loss: 5.7656 Explore P: 0.2221
Episode: 545 Total reward: 37.0 Training loss: 0.7715 Explore P: 0.2213
Episode: 546 Total reward: 60.0 Training loss: 55.5691 Explore P: 0.2200
Episode: 547 Total reward: 92.0 Training loss: 25.7082 Explore P: 0.2181
Episode: 548 Total reward: 45.0 Training loss: 54.5084 Explore P: 0.2172
Episode: 549 Total reward: 53.0 Training loss: 0.7605 Explore P: 0.2161
Episode: 550 Total reward: 74.0 Training loss: 0.8070 Explore P: 0.2146
Episode: 551 Total reward: 50.0 Training loss: 17.3592 Explore P: 0.2135
Episode: 552 Total reward: 82.0 Training loss: 0.7605 Explore P: 0.2119
Episode: 553 Total reward: 53.0 Training loss: 4.8139 Explore P: 0.2108
Episode: 554 Total reward: 74.0 Training loss: 1.9479 Explore P: 0.2093
Episode: 555 Total reward: 50.0 Training loss: 0.6359 Explore P: 0.2083
Episode: 556 Total reward: 73.0 Training loss: 295.0652 Explore P: 0.2069
Episode: 557 Total reward: 106.0 Training loss: 162.4351 Explore P: 0.2048
Episode: 558 Total reward: 135.0 Training loss: 158.3529 Explore P: 0.2022
Episode: 559 Total reward: 199.0 Training loss: 133.5674 Explore P: 0.1984
Episode: 560 Total reward: 127.0 Training loss: 0.5656 Explore P: 0.1960
Episode: 561 Total reward: 82.0 Training loss: 1.4502 Explore P: 0.1945
Episode: 562 Total reward: 174.0 Training loss: 52.9702 Explore P: 0.1913
Episode: 563 Total reward: 100.0 Training loss: 2.4952 Explore P: 0.1895
Episode: 564 Total reward: 199.0 Training loss: 0.8011 Explore P: 0.1860
Episode: 565 Total reward: 167.0 Training loss: 0.6249 Explore P: 0.1831
Episode: 566 Total reward: 169.0 Training loss: 0.5491 Explore P: 0.1802
Episode: 567 Total reward: 132.0 Training loss: 68.9932 Explore P: 0.1780
Episode: 568 Total reward: 139.0 Training loss: 0.8620 Explore P: 0.1756
Episode: 569 Total reward: 199.0 Training loss: 79.1448 Explore P: 0.1724
Episode: 570 Total reward: 135.0 Training loss: 82.4651 Explore P: 0.1702
Episode: 571 Total reward: 110.0 Training loss: 1.3283 Explore P: 0.1684
Episode: 572 Total reward: 117.0 Training loss: 68.0698 Explore P: 0.1666
Episode: 573 Total reward: 172.0 Training loss: 162.1710 Explore P: 0.1639
Episode: 574 Total reward: 199.0 Training loss: 0.7362 Explore P: 0.1609
Episode: 575 Total reward: 164.0 Training loss: 0.7631 Explore P: 0.1584
Episode: 576 Total reward: 136.0 Training loss: 91.5435 Explore P: 0.1564
Episode: 577 Total reward: 130.0 Training loss: 1.3203 Explore P: 0.1545
Episode: 578 Total reward: 125.0 Training loss: 0.8767 Explore P: 0.1528
Episode: 579 Total reward: 161.0 Training loss: 0.5272 Explore P: 0.1505
Episode: 580 Total reward: 102.0 Training loss: 0.6723 Explore P: 0.1490
Episode: 581 Total reward: 152.0 Training loss: 0.9938 Explore P: 0.1469
Episode: 582 Total reward: 199.0 Training loss: 0.7741 Explore P: 0.1443
Episode: 583 Total reward: 172.0 Training loss: 0.8088 Explore P: 0.1420
Episode: 584 Total reward: 113.0 Training loss: 0.8316 Explore P: 0.1405
Episode: 585 Total reward: 199.0 Training loss: 0.6484 Explore P: 0.1379
Episode: 586 Total reward: 52.0 Training loss: 0.9579 Explore P: 0.1372
Episode: 587 Total reward: 195.0 Training loss: 480.3067 Explore P: 0.1348
Episode: 588 Total reward: 199.0 Training loss: 86.1358 Explore P: 0.1323
Episode: 589 Total reward: 134.0 Training loss: 113.1977 Explore P: 0.1307
Episode: 590 Total reward: 158.0 Training loss: 0.8695 Explore P: 0.1288
Episode: 591 Total reward: 112.0 Training loss: 1.1143 Explore P: 0.1275
Episode: 592 Total reward: 109.0 Training loss: 0.9792 Explore P: 0.1262
Episode: 593 Total reward: 105.0 Training loss: 1.2436 Explore P: 0.1250
Episode: 594 Total reward: 140.0 Training loss: 1.6343 Explore P: 0.1234
Episode: 595 Total reward: 124.0 Training loss: 105.6938 Explore P: 0.1220
Episode: 596 Total reward: 31.0 Training loss: 72.9788 Explore P: 0.1217
Episode: 597 Total reward: 31.0 Training loss: 0.9419 Explore P: 0.1213
Episode: 598 Total reward: 112.0 Training loss: 1.4023 Explore P: 0.1201
Episode: 599 Total reward: 28.0 Training loss: 1.0737 Explore P: 0.1198
Episode: 600 Total reward: 96.0 Training loss: 1.7347 Explore P: 0.1187
Episode: 601 Total reward: 33.0 Training loss: 1.3616 Explore P: 0.1184
Episode: 602 Total reward: 18.0 Training loss: 1.1815 Explore P: 0.1182
Episode: 603 Total reward: 35.0 Training loss: 2.4777 Explore P: 0.1178
Episode: 604 Total reward: 42.0 Training loss: 1.2088 Explore P: 0.1173
Episode: 605 Total reward: 19.0 Training loss: 105.3809 Explore P: 0.1171
Episode: 606 Total reward: 106.0 Training loss: 130.1140 Explore P: 0.1160
Episode: 607 Total reward: 37.0 Training loss: 0.7538 Explore P: 0.1156
Episode: 608 Total reward: 30.0 Training loss: 201.7872 Explore P: 0.1153
Episode: 609 Total reward: 42.0 Training loss: 2.0656 Explore P: 0.1148
Episode: 610 Total reward: 15.0 Training loss: 2.9324 Explore P: 0.1147
Episode: 611 Total reward: 18.0 Training loss: 149.3647 Explore P: 0.1145
Episode: 612 Total reward: 18.0 Training loss: 1.8242 Explore P: 0.1143
Episode: 613 Total reward: 27.0 Training loss: 1.4887 Explore P: 0.1140
Episode: 614 Total reward: 21.0 Training loss: 63.6069 Explore P: 0.1138
Episode: 615 Total reward: 21.0 Training loss: 1.2031 Explore P: 0.1136
Episode: 616 Total reward: 25.0 Training loss: 2.1642 Explore P: 0.1133
Episode: 617 Total reward: 22.0 Training loss: 0.8315 Explore P: 0.1131
Episode: 618 Total reward: 23.0 Training loss: 1.4031 Explore P: 0.1129
Episode: 619 Total reward: 17.0 Training loss: 1.8359 Explore P: 0.1127
Episode: 620 Total reward: 21.0 Training loss: 1.7944 Explore P: 0.1125
Episode: 621 Total reward: 18.0 Training loss: 133.5191 Explore P: 0.1123
Episode: 622 Total reward: 23.0 Training loss: 0.9887 Explore P: 0.1121
Episode: 623 Total reward: 28.0 Training loss: 1.3465 Explore P: 0.1118
Episode: 624 Total reward: 25.0 Training loss: 153.6468 Explore P: 0.1115
Episode: 625 Total reward: 29.0 Training loss: 1.9316 Explore P: 0.1112
Episode: 626 Total reward: 24.0 Training loss: 211.3271 Explore P: 0.1110
Episode: 627 Total reward: 27.0 Training loss: 2.3351 Explore P: 0.1107
Episode: 628 Total reward: 21.0 Training loss: 1.6178 Explore P: 0.1105
Episode: 629 Total reward: 24.0 Training loss: 1.8913 Explore P: 0.1103
Episode: 630 Total reward: 23.0 Training loss: 2.5794 Explore P: 0.1100
Episode: 631 Total reward: 20.0 Training loss: 2.0436 Explore P: 0.1098
Episode: 632 Total reward: 19.0 Training loss: 2.9644 Explore P: 0.1096
Episode: 633 Total reward: 19.0 Training loss: 2.9307 Explore P: 0.1095
Episode: 634 Total reward: 17.0 Training loss: 2.4719 Explore P: 0.1093
Episode: 635 Total reward: 19.0 Training loss: 1.6584 Explore P: 0.1091
Episode: 636 Total reward: 26.0 Training loss: 441.4546 Explore P: 0.1088
Episode: 637 Total reward: 28.0 Training loss: 434.3775 Explore P: 0.1086
Episode: 638 Total reward: 14.0 Training loss: 1.9336 Explore P: 0.1084
Episode: 639 Total reward: 17.0 Training loss: 2.7447 Explore P: 0.1083
Episode: 640 Total reward: 12.0 Training loss: 163.6323 Explore P: 0.1081
Episode: 641 Total reward: 18.0 Training loss: 2.9218 Explore P: 0.1080
Episode: 642 Total reward: 25.0 Training loss: 1.8434 Explore P: 0.1077
Episode: 643 Total reward: 31.0 Training loss: 0.9551 Explore P: 0.1074
Episode: 644 Total reward: 35.0 Training loss: 1.9035 Explore P: 0.1071
Episode: 645 Total reward: 48.0 Training loss: 1.6190 Explore P: 0.1066
Episode: 646 Total reward: 25.0 Training loss: 1.8218 Explore P: 0.1064
Episode: 647 Total reward: 28.0 Training loss: 1.2903 Explore P: 0.1061
Episode: 648 Total reward: 36.0 Training loss: 235.2383 Explore P: 0.1058
Episode: 649 Total reward: 35.0 Training loss: 122.2854 Explore P: 0.1054
Episode: 650 Total reward: 26.0 Training loss: 1.9973 Explore P: 0.1052
Episode: 651 Total reward: 22.0 Training loss: 87.7819 Explore P: 0.1050
Episode: 652 Total reward: 109.0 Training loss: 151.8963 Explore P: 0.1039
Episode: 653 Total reward: 29.0 Training loss: 120.3667 Explore P: 0.1037
Episode: 654 Total reward: 31.0 Training loss: 2.7498 Explore P: 0.1034
Episode: 655 Total reward: 25.0 Training loss: 1.4449 Explore P: 0.1031
Episode: 656 Total reward: 34.0 Training loss: 4.0478 Explore P: 0.1028
Episode: 657 Total reward: 27.0 Training loss: 2.1714 Explore P: 0.1026
Episode: 658 Total reward: 28.0 Training loss: 2.4137 Explore P: 0.1023
Episode: 659 Total reward: 21.0 Training loss: 183.4996 Explore P: 0.1021
Episode: 660 Total reward: 24.0 Training loss: 2.5375 Explore P: 0.1019
Episode: 661 Total reward: 25.0 Training loss: 2.3830 Explore P: 0.1017
Episode: 662 Total reward: 23.0 Training loss: 1.6508 Explore P: 0.1015
Episode: 663 Total reward: 25.0 Training loss: 2.3075 Explore P: 0.1012
Episode: 664 Total reward: 21.0 Training loss: 2.8356 Explore P: 0.1010
Episode: 665 Total reward: 32.0 Training loss: 2.1291 Explore P: 0.1007
Episode: 666 Total reward: 33.0 Training loss: 1.1294 Explore P: 0.1005
Episode: 667 Total reward: 25.0 Training loss: 1.5317 Explore P: 0.1002
Episode: 668 Total reward: 24.0 Training loss: 1.1561 Explore P: 0.1000
Episode: 669 Total reward: 116.0 Training loss: 292.1622 Explore P: 0.0990
Episode: 670 Total reward: 41.0 Training loss: 2.7889 Explore P: 0.0986
Episode: 671 Total reward: 36.0 Training loss: 1.9478 Explore P: 0.0983
Episode: 672 Total reward: 19.0 Training loss: 2.0167 Explore P: 0.0981
Episode: 673 Total reward: 178.0 Training loss: 1.7607 Explore P: 0.0966
Episode: 674 Total reward: 181.0 Training loss: 1.3524 Explore P: 0.0950
Episode: 675 Total reward: 198.0 Training loss: 1.6645 Explore P: 0.0933
Episode: 676 Total reward: 189.0 Training loss: 1.4380 Explore P: 0.0918
Episode: 677 Total reward: 173.0 Training loss: 1.3219 Explore P: 0.0904
Episode: 678 Total reward: 199.0 Training loss: 1.3161 Explore P: 0.0888
Episode: 679 Total reward: 199.0 Training loss: 0.6761 Explore P: 0.0872
Episode: 680 Total reward: 199.0 Training loss: 0.5177 Explore P: 0.0857
Episode: 681 Total reward: 199.0 Training loss: 1.0142 Explore P: 0.0842
Episode: 682 Total reward: 199.0 Training loss: 0.8778 Explore P: 0.0828
Episode: 683 Total reward: 199.0 Training loss: 0.8451 Explore P: 0.0813
Episode: 684 Total reward: 199.0 Training loss: 1.7079 Explore P: 0.0799
Episode: 685 Total reward: 199.0 Training loss: 0.8731 Explore P: 0.0786
Episode: 686 Total reward: 199.0 Training loss: 0.6528 Explore P: 0.0772
Episode: 687 Total reward: 199.0 Training loss: 0.5230 Explore P: 0.0759
Episode: 688 Total reward: 199.0 Training loss: 0.6420 Explore P: 0.0746
Episode: 689 Total reward: 199.0 Training loss: 0.9689 Explore P: 0.0733
Episode: 690 Total reward: 199.0 Training loss: 0.9918 Explore P: 0.0721
Episode: 691 Total reward: 197.0 Training loss: 0.6216 Explore P: 0.0708
Episode: 692 Total reward: 187.0 Training loss: 0.7283 Explore P: 0.0697
Episode: 693 Total reward: 166.0 Training loss: 0.9746 Explore P: 0.0687
Episode: 694 Total reward: 178.0 Training loss: 0.7703 Explore P: 0.0677
Episode: 695 Total reward: 199.0 Training loss: 1.0293 Explore P: 0.0666
Episode: 696 Total reward: 189.0 Training loss: 0.6020 Explore P: 0.0655
Episode: 697 Total reward: 199.0 Training loss: 1.1508 Explore P: 0.0644
Episode: 698 Total reward: 153.0 Training loss: 0.6275 Explore P: 0.0636
Episode: 699 Total reward: 189.0 Training loss: 0.7392 Explore P: 0.0626
Episode: 700 Total reward: 199.0 Training loss: 0.5200 Explore P: 0.0615
Episode: 701 Total reward: 199.0 Training loss: 1.1482 Explore P: 0.0605
Episode: 702 Total reward: 199.0 Training loss: 0.5708 Explore P: 0.0595
Episode: 703 Total reward: 199.0 Training loss: 0.8893 Explore P: 0.0586
Episode: 704 Total reward: 199.0 Training loss: 0.8068 Explore P: 0.0576
Episode: 705 Total reward: 199.0 Training loss: 0.7275 Explore P: 0.0567
Episode: 706 Total reward: 199.0 Training loss: 1.0198 Explore P: 0.0557
Episode: 707 Total reward: 193.0 Training loss: 0.7775 Explore P: 0.0549
Episode: 708 Total reward: 140.0 Training loss: 109.8337 Explore P: 0.0542
Episode: 709 Total reward: 138.0 Training loss: 0.9104 Explore P: 0.0536
Episode: 710 Total reward: 120.0 Training loss: 1.1798 Explore P: 0.0531
Episode: 711 Total reward: 147.0 Training loss: 1.1485 Explore P: 0.0525
Episode: 712 Total reward: 133.0 Training loss: 1.4257 Explore P: 0.0519
Episode: 713 Total reward: 117.0 Training loss: 0.6429 Explore P: 0.0514
Episode: 714 Total reward: 109.0 Training loss: 0.8409 Explore P: 0.0510
Episode: 715 Total reward: 145.0 Training loss: 1.0226 Explore P: 0.0504
Episode: 716 Total reward: 152.0 Training loss: 0.6645 Explore P: 0.0498
Episode: 717 Total reward: 136.0 Training loss: 0.9701 Explore P: 0.0493
Episode: 718 Total reward: 60.0 Training loss: 1.4582 Explore P: 0.0490
Episode: 719 Total reward: 64.0 Training loss: 1.3600 Explore P: 0.0488
Episode: 720 Total reward: 97.0 Training loss: 1.1242 Explore P: 0.0484
Episode: 721 Total reward: 88.0 Training loss: 0.9599 Explore P: 0.0481
Episode: 722 Total reward: 78.0 Training loss: 1.0156 Explore P: 0.0478
Episode: 723 Total reward: 95.0 Training loss: 1.2338 Explore P: 0.0474
Episode: 724 Total reward: 71.0 Training loss: 147.7017 Explore P: 0.0471
Episode: 725 Total reward: 35.0 Training loss: 1.8975 Explore P: 0.0470
Episode: 726 Total reward: 42.0 Training loss: 1.4883 Explore P: 0.0469
Episode: 727 Total reward: 58.0 Training loss: 0.9861 Explore P: 0.0466
Episode: 728 Total reward: 80.0 Training loss: 1.4095 Explore P: 0.0464
Episode: 729 Total reward: 57.0 Training loss: 1.4593 Explore P: 0.0461
Episode: 730 Total reward: 34.0 Training loss: 1.2430 Explore P: 0.0460
Episode: 731 Total reward: 55.0 Training loss: 0.6044 Explore P: 0.0458
Episode: 732 Total reward: 63.0 Training loss: 0.9076 Explore P: 0.0456
Episode: 733 Total reward: 46.0 Training loss: 1.2273 Explore P: 0.0454
Episode: 734 Total reward: 21.0 Training loss: 0.9614 Explore P: 0.0454
Episode: 735 Total reward: 70.0 Training loss: 1.3172 Explore P: 0.0451
Episode: 736 Total reward: 44.0 Training loss: 1.9256 Explore P: 0.0450
Episode: 737 Total reward: 55.0 Training loss: 1.3767 Explore P: 0.0448
Episode: 738 Total reward: 77.0 Training loss: 102.4658 Explore P: 0.0445
Episode: 739 Total reward: 70.0 Training loss: 1.0928 Explore P: 0.0443
Episode: 740 Total reward: 99.0 Training loss: 0.5320 Explore P: 0.0439
Episode: 741 Total reward: 69.0 Training loss: 1.1896 Explore P: 0.0437
Episode: 742 Total reward: 81.0 Training loss: 1.7280 Explore P: 0.0434
Episode: 743 Total reward: 81.0 Training loss: 1.1424 Explore P: 0.0432
Episode: 744 Total reward: 59.0 Training loss: 0.4791 Explore P: 0.0430
Episode: 745 Total reward: 68.0 Training loss: 0.5685 Explore P: 0.0427
Episode: 746 Total reward: 81.0 Training loss: 258.8846 Explore P: 0.0425
Episode: 747 Total reward: 107.0 Training loss: 0.4509 Explore P: 0.0421
Episode: 748 Total reward: 106.0 Training loss: 383.0386 Explore P: 0.0418
Episode: 749 Total reward: 85.0 Training loss: 0.5815 Explore P: 0.0415
Episode: 750 Total reward: 89.0 Training loss: 0.3022 Explore P: 0.0412
Episode: 751 Total reward: 57.0 Training loss: 39.7138 Explore P: 0.0411
Episode: 752 Total reward: 70.0 Training loss: 0.7027 Explore P: 0.0408
Episode: 753 Total reward: 102.0 Training loss: 0.7205 Explore P: 0.0405
Episode: 754 Total reward: 73.0 Training loss: 0.4330 Explore P: 0.0403
Episode: 755 Total reward: 63.0 Training loss: 0.8107 Explore P: 0.0401
Episode: 756 Total reward: 66.0 Training loss: 0.4072 Explore P: 0.0399
Episode: 757 Total reward: 111.0 Training loss: 0.7222 Explore P: 0.0396
Episode: 758 Total reward: 130.0 Training loss: 0.2502 Explore P: 0.0392
Episode: 759 Total reward: 114.0 Training loss: 0.3601 Explore P: 0.0389
Episode: 760 Total reward: 137.0 Training loss: 0.4205 Explore P: 0.0385
Episode: 761 Total reward: 142.0 Training loss: 20.6449 Explore P: 0.0381
Episode: 762 Total reward: 180.0 Training loss: 176.4587 Explore P: 0.0376
Episode: 763 Total reward: 199.0 Training loss: 0.3659 Explore P: 0.0370
Episode: 764 Total reward: 199.0 Training loss: 0.3374 Explore P: 0.0365
Episode: 765 Total reward: 199.0 Training loss: 0.3320 Explore P: 0.0360
Episode: 766 Total reward: 199.0 Training loss: 0.2526 Explore P: 0.0355
Episode: 767 Total reward: 199.0 Training loss: 0.2383 Explore P: 0.0350
Episode: 768 Total reward: 199.0 Training loss: 0.3882 Explore P: 0.0345
Episode: 769 Total reward: 199.0 Training loss: 0.3962 Explore P: 0.0340
Episode: 770 Total reward: 199.0 Training loss: 0.3605 Explore P: 0.0335
Episode: 771 Total reward: 199.0 Training loss: 41.5499 Explore P: 0.0331
Episode: 772 Total reward: 199.0 Training loss: 0.4439 Explore P: 0.0326
Episode: 773 Total reward: 199.0 Training loss: 0.2861 Explore P: 0.0322
Episode: 774 Total reward: 199.0 Training loss: 0.2648 Explore P: 0.0317
Episode: 775 Total reward: 199.0 Training loss: 0.5515 Explore P: 0.0313
Episode: 776 Total reward: 199.0 Training loss: 14.6074 Explore P: 0.0309
Episode: 777 Total reward: 199.0 Training loss: 0.3168 Explore P: 0.0305
Episode: 778 Total reward: 199.0 Training loss: 0.6032 Explore P: 0.0301
Episode: 779 Total reward: 199.0 Training loss: 4.4121 Explore P: 0.0297
Episode: 780 Total reward: 199.0 Training loss: 0.1405 Explore P: 0.0293
Episode: 781 Total reward: 199.0 Training loss: 0.2705 Explore P: 0.0289
Episode: 782 Total reward: 199.0 Training loss: 3.8941 Explore P: 0.0285
Episode: 783 Total reward: 199.0 Training loss: 0.2755 Explore P: 0.0282
Episode: 784 Total reward: 199.0 Training loss: 0.2191 Explore P: 0.0278
Episode: 785 Total reward: 199.0 Training loss: 1.3915 Explore P: 0.0275
Episode: 786 Total reward: 199.0 Training loss: 0.2638 Explore P: 0.0271
Episode: 787 Total reward: 199.0 Training loss: 0.3692 Explore P: 0.0268
Episode: 788 Total reward: 199.0 Training loss: 0.2340 Explore P: 0.0264
Episode: 789 Total reward: 199.0 Training loss: 0.2897 Explore P: 0.0261
Episode: 790 Total reward: 199.0 Training loss: 0.2794 Explore P: 0.0258
Episode: 791 Total reward: 199.0 Training loss: 0.2809 Explore P: 0.0255
Episode: 792 Total reward: 199.0 Training loss: 0.2959 Explore P: 0.0252
Episode: 793 Total reward: 199.0 Training loss: 0.3499 Explore P: 0.0249
Episode: 794 Total reward: 199.0 Training loss: 0.2905 Explore P: 0.0246
Episode: 795 Total reward: 199.0 Training loss: 0.4256 Explore P: 0.0243
Episode: 796 Total reward: 199.0 Training loss: 290.4339 Explore P: 0.0240
Episode: 797 Total reward: 199.0 Training loss: 0.3881 Explore P: 0.0237
Episode: 798 Total reward: 179.0 Training loss: 0.6153 Explore P: 0.0235
Episode: 799 Total reward: 199.0 Training loss: 0.3860 Explore P: 0.0232
Episode: 800 Total reward: 190.0 Training loss: 0.4578 Explore P: 0.0230
Episode: 801 Total reward: 199.0 Training loss: 0.2876 Explore P: 0.0227
Episode: 802 Total reward: 121.0 Training loss: 0.7888 Explore P: 0.0226
Episode: 803 Total reward: 24.0 Training loss: 1.1576 Explore P: 0.0225
Episode: 804 Total reward: 133.0 Training loss: 0.4671 Explore P: 0.0224
Episode: 805 Total reward: 94.0 Training loss: 0.6385 Explore P: 0.0223
Episode: 806 Total reward: 106.0 Training loss: 0.8049 Explore P: 0.0221
Episode: 807 Total reward: 25.0 Training loss: 0.8111 Explore P: 0.0221
Episode: 808 Total reward: 157.0 Training loss: 0.5517 Explore P: 0.0219
Episode: 809 Total reward: 122.0 Training loss: 1.1275 Explore P: 0.0218
Episode: 810 Total reward: 140.0 Training loss: 1.1703 Explore P: 0.0216
Episode: 811 Total reward: 26.0 Training loss: 0.4746 Explore P: 0.0216
Episode: 812 Total reward: 95.0 Training loss: 0.7020 Explore P: 0.0215
Episode: 813 Total reward: 126.0 Training loss: 0.5570 Explore P: 0.0213
Episode: 814 Total reward: 50.0 Training loss: 1.5018 Explore P: 0.0213
Episode: 815 Total reward: 14.0 Training loss: 1.3193 Explore P: 0.0213
Episode: 816 Total reward: 18.0 Training loss: 1.4702 Explore P: 0.0212
Episode: 817 Total reward: 20.0 Training loss: 431.1934 Explore P: 0.0212
Episode: 818 Total reward: 35.0 Training loss: 0.4309 Explore P: 0.0212
Episode: 819 Total reward: 18.0 Training loss: 1.5233 Explore P: 0.0212
Episode: 820 Total reward: 17.0 Training loss: 1.1868 Explore P: 0.0211
Episode: 821 Total reward: 15.0 Training loss: 1.2938 Explore P: 0.0211
Episode: 822 Total reward: 14.0 Training loss: 1.9645 Explore P: 0.0211
Episode: 823 Total reward: 16.0 Training loss: 1.2145 Explore P: 0.0211
Episode: 824 Total reward: 16.0 Training loss: 2.0030 Explore P: 0.0211
Episode: 825 Total reward: 14.0 Training loss: 1.6458 Explore P: 0.0210
Episode: 826 Total reward: 21.0 Training loss: 1.3639 Explore P: 0.0210
Episode: 827 Total reward: 25.0 Training loss: 0.6974 Explore P: 0.0210
Episode: 828 Total reward: 148.0 Training loss: 0.6686 Explore P: 0.0208
Episode: 829 Total reward: 117.0 Training loss: 448.9328 Explore P: 0.0207
Episode: 830 Total reward: 94.0 Training loss: 0.4382 Explore P: 0.0206
Episode: 831 Total reward: 32.0 Training loss: 0.7317 Explore P: 0.0206
Episode: 832 Total reward: 23.0 Training loss: 1.0210 Explore P: 0.0206
Episode: 833 Total reward: 23.0 Training loss: 0.6809 Explore P: 0.0205
Episode: 834 Total reward: 149.0 Training loss: 0.5022 Explore P: 0.0204
Episode: 835 Total reward: 146.0 Training loss: 0.2358 Explore P: 0.0202
Episode: 836 Total reward: 196.0 Training loss: 0.4517 Explore P: 0.0200
Episode: 837 Total reward: 199.0 Training loss: 0.2715 Explore P: 0.0198
Episode: 838 Total reward: 199.0 Training loss: 0.2304 Explore P: 0.0196
Episode: 839 Total reward: 199.0 Training loss: 0.3882 Explore P: 0.0194
Episode: 840 Total reward: 199.0 Training loss: 0.2697 Explore P: 0.0193
Episode: 841 Total reward: 199.0 Training loss: 0.3722 Explore P: 0.0191
Episode: 842 Total reward: 199.0 Training loss: 0.0968 Explore P: 0.0189
Episode: 843 Total reward: 199.0 Training loss: 0.4109 Explore P: 0.0187
Episode: 844 Total reward: 199.0 Training loss: 0.3548 Explore P: 0.0185
Episode: 845 Total reward: 199.0 Training loss: 0.3306 Explore P: 0.0184
Episode: 846 Total reward: 199.0 Training loss: 0.4660 Explore P: 0.0182
Episode: 847 Total reward: 199.0 Training loss: 0.4882 Explore P: 0.0181
Episode: 848 Total reward: 199.0 Training loss: 0.3397 Explore P: 0.0179
Episode: 849 Total reward: 199.0 Training loss: 0.2794 Explore P: 0.0177
Episode: 850 Total reward: 199.0 Training loss: 0.3881 Explore P: 0.0176
Episode: 851 Total reward: 199.0 Training loss: 0.3515 Explore P: 0.0174
Episode: 852 Total reward: 199.0 Training loss: 0.4425 Explore P: 0.0173
Episode: 853 Total reward: 199.0 Training loss: 0.4218 Explore P: 0.0171
Episode: 854 Total reward: 199.0 Training loss: 0.3471 Explore P: 0.0170
Episode: 855 Total reward: 199.0 Training loss: 0.3561 Explore P: 0.0169
Episode: 856 Total reward: 199.0 Training loss: 0.4576 Explore P: 0.0167
Episode: 857 Total reward: 199.0 Training loss: 0.4657 Explore P: 0.0166
Episode: 858 Total reward: 199.0 Training loss: 0.4361 Explore P: 0.0165
Episode: 859 Total reward: 199.0 Training loss: 0.3570 Explore P: 0.0163
Episode: 860 Total reward: 199.0 Training loss: 0.7188 Explore P: 0.0162
Episode: 861 Total reward: 199.0 Training loss: 0.3764 Explore P: 0.0161
Episode: 862 Total reward: 199.0 Training loss: 0.4386 Explore P: 0.0160
Episode: 863 Total reward: 199.0 Training loss: 0.5878 Explore P: 0.0159
Episode: 864 Total reward: 199.0 Training loss: 0.4091 Explore P: 0.0157
Episode: 865 Total reward: 199.0 Training loss: 0.3637 Explore P: 0.0156
Episode: 866 Total reward: 199.0 Training loss: 0.4915 Explore P: 0.0155
Episode: 867 Total reward: 199.0 Training loss: 0.7184 Explore P: 0.0154
Episode: 868 Total reward: 199.0 Training loss: 0.9227 Explore P: 0.0153
Episode: 869 Total reward: 199.0 Training loss: 0.7724 Explore P: 0.0152
Episode: 870 Total reward: 199.0 Training loss: 0.4782 Explore P: 0.0151
Episode: 871 Total reward: 199.0 Training loss: 0.2635 Explore P: 0.0150
Episode: 872 Total reward: 199.0 Training loss: 0.2713 Explore P: 0.0149
Episode: 873 Total reward: 199.0 Training loss: 0.5999 Explore P: 0.0148
Episode: 874 Total reward: 199.0 Training loss: 1.3098 Explore P: 0.0147
Episode: 875 Total reward: 199.0 Training loss: 0.7955 Explore P: 0.0146
Episode: 876 Total reward: 199.0 Training loss: 0.3035 Explore P: 0.0145
Episode: 877 Total reward: 199.0 Training loss: 1.4562 Explore P: 0.0144
Episode: 878 Total reward: 199.0 Training loss: 287.1692 Explore P: 0.0143
Episode: 879 Total reward: 199.0 Training loss: 1.3028 Explore P: 0.0143
Episode: 880 Total reward: 199.0 Training loss: 1.1377 Explore P: 0.0142
Episode: 881 Total reward: 199.0 Training loss: 0.9128 Explore P: 0.0141
Episode: 882 Total reward: 199.0 Training loss: 1.0762 Explore P: 0.0140
Episode: 883 Total reward: 199.0 Training loss: 0.5222 Explore P: 0.0139
Episode: 884 Total reward: 199.0 Training loss: 287.5762 Explore P: 0.0139
Episode: 885 Total reward: 199.0 Training loss: 249.6824 Explore P: 0.0138
Episode: 886 Total reward: 199.0 Training loss: 271.0705 Explore P: 0.0137
Episode: 887 Total reward: 199.0 Training loss: 0.2354 Explore P: 0.0136
Episode: 888 Total reward: 199.0 Training loss: 188.4514 Explore P: 0.0136
Episode: 889 Total reward: 199.0 Training loss: 0.2274 Explore P: 0.0135
Episode: 890 Total reward: 199.0 Training loss: 0.1826 Explore P: 0.0134
Episode: 891 Total reward: 199.0 Training loss: 482.0505 Explore P: 0.0134
Episode: 892 Total reward: 199.0 Training loss: 0.2135 Explore P: 0.0133
Episode: 893 Total reward: 199.0 Training loss: 0.2586 Explore P: 0.0132
Episode: 894 Total reward: 199.0 Training loss: 0.2488 Explore P: 0.0132
Episode: 895 Total reward: 199.0 Training loss: 0.2624 Explore P: 0.0131
Episode: 896 Total reward: 199.0 Training loss: 274.9410 Explore P: 0.0130
Episode: 897 Total reward: 199.0 Training loss: 0.4159 Explore P: 0.0130
Episode: 898 Total reward: 199.0 Training loss: 0.2305 Explore P: 0.0129
Episode: 899 Total reward: 199.0 Training loss: 0.3049 Explore P: 0.0129
Episode: 900 Total reward: 199.0 Training loss: 0.3127 Explore P: 0.0128
Episode: 901 Total reward: 199.0 Training loss: 0.2223 Explore P: 0.0127
Episode: 902 Total reward: 199.0 Training loss: 0.3256 Explore P: 0.0127
Episode: 903 Total reward: 199.0 Training loss: 0.3967 Explore P: 0.0126
Episode: 904 Total reward: 199.0 Training loss: 303.2059 Explore P: 0.0126
Episode: 905 Total reward: 199.0 Training loss: 0.3369 Explore P: 0.0125
Episode: 906 Total reward: 199.0 Training loss: 0.7641 Explore P: 0.0125
Episode: 907 Total reward: 199.0 Training loss: 0.4782 Explore P: 0.0124
Episode: 908 Total reward: 199.0 Training loss: 0.3968 Explore P: 0.0124
Episode: 909 Total reward: 199.0 Training loss: 0.3719 Explore P: 0.0123
Episode: 910 Total reward: 199.0 Training loss: 0.3119 Explore P: 0.0123
Episode: 911 Total reward: 199.0 Training loss: 0.2976 Explore P: 0.0123
Episode: 912 Total reward: 199.0 Training loss: 0.5337 Explore P: 0.0122
Episode: 913 Total reward: 199.0 Training loss: 0.3052 Explore P: 0.0122
Episode: 914 Total reward: 199.0 Training loss: 0.2954 Explore P: 0.0121
Episode: 915 Total reward: 199.0 Training loss: 258.1636 Explore P: 0.0121
Episode: 916 Total reward: 199.0 Training loss: 0.4448 Explore P: 0.0120
Episode: 917 Total reward: 199.0 Training loss: 0.2289 Explore P: 0.0120
Episode: 918 Total reward: 199.0 Training loss: 0.4139 Explore P: 0.0120
Episode: 919 Total reward: 199.0 Training loss: 0.3618 Explore P: 0.0119
Episode: 920 Total reward: 199.0 Training loss: 0.3201 Explore P: 0.0119
Episode: 921 Total reward: 199.0 Training loss: 0.5607 Explore P: 0.0118
Episode: 922 Total reward: 199.0 Training loss: 0.3750 Explore P: 0.0118
Episode: 923 Total reward: 199.0 Training loss: 0.7215 Explore P: 0.0118
Episode: 924 Total reward: 199.0 Training loss: 0.3094 Explore P: 0.0117
Episode: 925 Total reward: 199.0 Training loss: 0.4714 Explore P: 0.0117
Episode: 926 Total reward: 199.0 Training loss: 0.5222 Explore P: 0.0117
Episode: 927 Total reward: 199.0 Training loss: 0.4815 Explore P: 0.0116
Episode: 928 Total reward: 199.0 Training loss: 0.4152 Explore P: 0.0116
Episode: 929 Total reward: 199.0 Training loss: 0.3663 Explore P: 0.0116
Episode: 930 Total reward: 199.0 Training loss: 0.2195 Explore P: 0.0115
Episode: 931 Total reward: 199.0 Training loss: 0.3942 Explore P: 0.0115
Episode: 932 Total reward: 199.0 Training loss: 0.1619 Explore P: 0.0115
Episode: 933 Total reward: 199.0 Training loss: 0.2259 Explore P: 0.0115
Episode: 934 Total reward: 199.0 Training loss: 0.3272 Explore P: 0.0114
Episode: 935 Total reward: 199.0 Training loss: 0.1965 Explore P: 0.0114
Episode: 936 Total reward: 199.0 Training loss: 0.2634 Explore P: 0.0114
Episode: 937 Total reward: 199.0 Training loss: 0.2144 Explore P: 0.0113
Episode: 938 Total reward: 199.0 Training loss: 0.1325 Explore P: 0.0113
Episode: 939 Total reward: 199.0 Training loss: 0.1549 Explore P: 0.0113
Episode: 940 Total reward: 199.0 Training loss: 169.0099 Explore P: 0.0113
Episode: 941 Total reward: 199.0 Training loss: 0.1853 Explore P: 0.0112
Episode: 942 Total reward: 199.0 Training loss: 0.1085 Explore P: 0.0112
Episode: 943 Total reward: 199.0 Training loss: 0.1719 Explore P: 0.0112
Episode: 944 Total reward: 199.0 Training loss: 0.3436 Explore P: 0.0112
Episode: 945 Total reward: 199.0 Training loss: 0.2881 Explore P: 0.0111
Episode: 946 Total reward: 199.0 Training loss: 0.1439 Explore P: 0.0111
Episode: 947 Total reward: 199.0 Training loss: 0.1735 Explore P: 0.0111
Episode: 948 Total reward: 199.0 Training loss: 0.1271 Explore P: 0.0111
Episode: 949 Total reward: 199.0 Training loss: 0.1729 Explore P: 0.0111
Episode: 950 Total reward: 199.0 Training loss: 0.1054 Explore P: 0.0110
Episode: 951 Total reward: 199.0 Training loss: 0.1075 Explore P: 0.0110
Episode: 952 Total reward: 199.0 Training loss: 0.1388 Explore P: 0.0110
Episode: 953 Total reward: 199.0 Training loss: 0.2406 Explore P: 0.0110
Episode: 954 Total reward: 199.0 Training loss: 0.1928 Explore P: 0.0110
Episode: 955 Total reward: 199.0 Training loss: 0.2186 Explore P: 0.0109
Episode: 956 Total reward: 199.0 Training loss: 0.1930 Explore P: 0.0109
Episode: 957 Total reward: 199.0 Training loss: 0.1598 Explore P: 0.0109
Episode: 958 Total reward: 199.0 Training loss: 0.1594 Explore P: 0.0109
Episode: 959 Total reward: 199.0 Training loss: 0.1132 Explore P: 0.0109
Episode: 960 Total reward: 199.0 Training loss: 0.2737 Explore P: 0.0108
Episode: 961 Total reward: 199.0 Training loss: 0.1226 Explore P: 0.0108
Episode: 962 Total reward: 127.0 Training loss: 0.1656 Explore P: 0.0108
Episode: 963 Total reward: 152.0 Training loss: 0.1861 Explore P: 0.0108
Episode: 964 Total reward: 149.0 Training loss: 0.1575 Explore P: 0.0108
Episode: 965 Total reward: 170.0 Training loss: 0.3160 Explore P: 0.0108
Episode: 966 Total reward: 177.0 Training loss: 0.1949 Explore P: 0.0108
Episode: 967 Total reward: 199.0 Training loss: 0.0951 Explore P: 0.0108
Episode: 968 Total reward: 149.0 Training loss: 0.5711 Explore P: 0.0107
Episode: 969 Total reward: 114.0 Training loss: 0.1477 Explore P: 0.0107
Episode: 970 Total reward: 126.0 Training loss: 0.1378 Explore P: 0.0107
Episode: 971 Total reward: 102.0 Training loss: 0.1625 Explore P: 0.0107
Episode: 972 Total reward: 108.0 Training loss: 0.3807 Explore P: 0.0107
Episode: 973 Total reward: 44.0 Training loss: 151.5800 Explore P: 0.0107
Episode: 974 Total reward: 63.0 Training loss: 0.2211 Explore P: 0.0107
Episode: 975 Total reward: 33.0 Training loss: 0.4820 Explore P: 0.0107
Episode: 976 Total reward: 32.0 Training loss: 0.2311 Explore P: 0.0107
Episode: 977 Total reward: 30.0 Training loss: 0.3379 Explore P: 0.0107
Episode: 978 Total reward: 23.0 Training loss: 0.5098 Explore P: 0.0107
Episode: 979 Total reward: 29.0 Training loss: 0.6654 Explore P: 0.0107
Episode: 980 Total reward: 43.0 Training loss: 0.1683 Explore P: 0.0107
Episode: 981 Total reward: 44.0 Training loss: 0.1758 Explore P: 0.0107
Episode: 982 Total reward: 84.0 Training loss: 0.5272 Explore P: 0.0107
Episode: 983 Total reward: 34.0 Training loss: 0.1453 Explore P: 0.0107
Episode: 984 Total reward: 125.0 Training loss: 0.2690 Explore P: 0.0107
Episode: 985 Total reward: 157.0 Training loss: 0.2334 Explore P: 0.0107
Episode: 986 Total reward: 58.0 Training loss: 0.1949 Explore P: 0.0107
Episode: 987 Total reward: 57.0 Training loss: 0.1700 Explore P: 0.0107
Episode: 988 Total reward: 68.0 Training loss: 0.3195 Explore P: 0.0106
Episode: 989 Total reward: 134.0 Training loss: 0.1551 Explore P: 0.0106
Episode: 990 Total reward: 199.0 Training loss: 0.2924 Explore P: 0.0106
Episode: 991 Total reward: 189.0 Training loss: 0.1769 Explore P: 0.0106
Episode: 992 Total reward: 199.0 Training loss: 0.2363 Explore P: 0.0106
Episode: 993 Total reward: 199.0 Training loss: 0.2669 Explore P: 0.0106
Episode: 994 Total reward: 199.0 Training loss: 0.3318 Explore P: 0.0106
Episode: 995 Total reward: 199.0 Training loss: 0.1840 Explore P: 0.0106
Episode: 996 Total reward: 199.0 Training loss: 0.2243 Explore P: 0.0106
Episode: 997 Total reward: 199.0 Training loss: 0.2323 Explore P: 0.0105
Episode: 998 Total reward: 199.0 Training loss: 0.1370 Explore P: 0.0105
Episode: 999 Total reward: 199.0 Training loss: 0.2314 Explore P: 0.0105

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [12]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [14]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[14]:
<matplotlib.text.Text at 0x11fb4f5c0>

Testing

Let's checkout how our trained agent plays the game.


In [16]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1

In [17]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.