Deep $Q$-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use $Q$-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [5]:
import gym
import numpy as np

# Create the Cart-Pole game environment
env = gym.make('CartPole-v1')

# Number of possible actions
print('Number of possible actions:', env.action_space.n)


WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Number of possible actions: 2

In [ ]:
[2018-01-22 23:10:02,350] Making new env: CartPole-v1


Number of possible actions: 2

We interact with the simulation through env. You can see how many actions are possible from env.action_space.n, and to get a random action you can use env.action_space.sample(). Passing in an action as an integer to env.step will generate the next step in the simulation. This is general to all Gym games.

In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to interact with the environment.


In [6]:
actions = [] # actions that the agent selects
rewards = [] # obtained rewards
state = env.reset()

while True:
    action = env.action_space.sample()  # choose a random action
    state, reward, done, _ = env.step(action) 
    rewards.append(reward)
    actions.append(action)
    if done:
        break

We can look at the actions and rewards:


In [7]:
print('Actions:', actions)
print('Rewards:', rewards)


Actions: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
Rewards: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

In [8]:
Actions: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0]
Rewards: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each step while the game is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

$Q$-Network

To keep track of the action values, we'll use a neural network that accepts a state $s$ as input. The output will be $Q$-values for each available action $a$ (i.e., the output is all action values $Q(s,a)$ corresponding to the input state $s$).

For this Cart-Pole game, the state has four values: the position and velocity of the cart, and the position and velocity of the pole. Thus, the neural network has four inputs, one for each value in the state, and two outputs, one for each possible action.

As explored in the lesson, to get the training target, we'll first use the context provided by the state $s$ to choose an action $a$, then simulate the game using that action. This will get us the next state, $s'$, and the reward $r$. With that, we can calculate $\hat{Q}(s,a) = r + \gamma \max_{a'}{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

Below is one implementation of the $Q$-network. It uses two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [9]:
import tensorflow as tf

class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maximum capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [10]:
from collections import deque

class Memory():
    def __init__(self, max_size=1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

$Q$-Learning training algorithm

We will use the below algorithm to train the network. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode $\leftarrow 1$ to $M$ do
    • Observe $s_0$
    • For $t \leftarrow 0$ to $T-1$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

You are welcome (and encouraged!) to take the time to extend this code to implement some of the improvements that we discussed in the lesson, to include fixed $Q$ targets, double DQNs, prioritized replay, and/or dueling networks.

Hyperparameters

One of the more difficult aspects of reinforcement learning is the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [11]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [12]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here we re-initialize the simulation and pre-populate the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [13]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent.


In [14]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 8.0 Training loss: 1.0684 Explore P: 0.9992
Episode: 2 Total reward: 8.0 Training loss: 1.0467 Explore P: 0.9984
Episode: 3 Total reward: 12.0 Training loss: 1.0113 Explore P: 0.9972
Episode: 4 Total reward: 12.0 Training loss: 1.0239 Explore P: 0.9960
Episode: 5 Total reward: 18.0 Training loss: 1.1286 Explore P: 0.9943
Episode: 6 Total reward: 50.0 Training loss: 1.1135 Explore P: 0.9894
Episode: 7 Total reward: 14.0 Training loss: 1.1539 Explore P: 0.9880
Episode: 8 Total reward: 38.0 Training loss: 1.1720 Explore P: 0.9843
Episode: 9 Total reward: 20.0 Training loss: 1.0946 Explore P: 0.9823
Episode: 10 Total reward: 32.0 Training loss: 1.2933 Explore P: 0.9792
Episode: 11 Total reward: 14.0 Training loss: 1.0522 Explore P: 0.9779
Episode: 12 Total reward: 12.0 Training loss: 1.2921 Explore P: 0.9767
Episode: 13 Total reward: 16.0 Training loss: 1.2344 Explore P: 0.9752
Episode: 14 Total reward: 14.0 Training loss: 1.1537 Explore P: 0.9738
Episode: 15 Total reward: 7.0 Training loss: 1.4951 Explore P: 0.9731
Episode: 16 Total reward: 27.0 Training loss: 1.3684 Explore P: 0.9705
Episode: 17 Total reward: 34.0 Training loss: 1.3186 Explore P: 0.9673
Episode: 18 Total reward: 12.0 Training loss: 1.1679 Explore P: 0.9661
Episode: 19 Total reward: 18.0 Training loss: 1.7617 Explore P: 0.9644
Episode: 20 Total reward: 33.0 Training loss: 1.2475 Explore P: 0.9613
Episode: 21 Total reward: 15.0 Training loss: 1.3567 Explore P: 0.9599
Episode: 22 Total reward: 34.0 Training loss: 1.8591 Explore P: 0.9566
Episode: 23 Total reward: 15.0 Training loss: 1.7568 Explore P: 0.9552
Episode: 24 Total reward: 12.0 Training loss: 2.2871 Explore P: 0.9541
Episode: 25 Total reward: 44.0 Training loss: 1.9570 Explore P: 0.9499
Episode: 26 Total reward: 7.0 Training loss: 1.8154 Explore P: 0.9493
Episode: 27 Total reward: 48.0 Training loss: 2.5947 Explore P: 0.9448
Episode: 28 Total reward: 25.0 Training loss: 2.2409 Explore P: 0.9424
Episode: 29 Total reward: 9.0 Training loss: 7.0578 Explore P: 0.9416
Episode: 30 Total reward: 22.0 Training loss: 8.5203 Explore P: 0.9396
Episode: 31 Total reward: 9.0 Training loss: 3.3513 Explore P: 0.9387
Episode: 32 Total reward: 21.0 Training loss: 2.8973 Explore P: 0.9368
Episode: 33 Total reward: 11.0 Training loss: 4.4884 Explore P: 0.9358
Episode: 34 Total reward: 22.0 Training loss: 3.8208 Explore P: 0.9337
Episode: 35 Total reward: 12.0 Training loss: 6.6709 Explore P: 0.9326
Episode: 36 Total reward: 18.0 Training loss: 6.1406 Explore P: 0.9309
Episode: 37 Total reward: 15.0 Training loss: 10.0487 Explore P: 0.9296
Episode: 38 Total reward: 16.0 Training loss: 13.2075 Explore P: 0.9281
Episode: 39 Total reward: 26.0 Training loss: 7.6971 Explore P: 0.9257
Episode: 40 Total reward: 13.0 Training loss: 6.0606 Explore P: 0.9245
Episode: 41 Total reward: 36.0 Training loss: 10.1472 Explore P: 0.9212
Episode: 42 Total reward: 13.0 Training loss: 9.0084 Explore P: 0.9201
Episode: 43 Total reward: 22.0 Training loss: 23.2525 Explore P: 0.9181
Episode: 44 Total reward: 22.0 Training loss: 10.2871 Explore P: 0.9161
Episode: 45 Total reward: 12.0 Training loss: 8.4670 Explore P: 0.9150
Episode: 46 Total reward: 13.0 Training loss: 4.3752 Explore P: 0.9138
Episode: 47 Total reward: 18.0 Training loss: 11.3345 Explore P: 0.9122
Episode: 48 Total reward: 12.0 Training loss: 27.3281 Explore P: 0.9111
Episode: 49 Total reward: 16.0 Training loss: 10.0772 Explore P: 0.9096
Episode: 50 Total reward: 8.0 Training loss: 4.1941 Explore P: 0.9089
Episode: 51 Total reward: 14.0 Training loss: 3.2387 Explore P: 0.9077
Episode: 52 Total reward: 58.0 Training loss: 9.8572 Explore P: 0.9025
Episode: 53 Total reward: 16.0 Training loss: 12.9038 Explore P: 0.9011
Episode: 54 Total reward: 31.0 Training loss: 11.8131 Explore P: 0.8983
Episode: 55 Total reward: 13.0 Training loss: 18.3078 Explore P: 0.8971
Episode: 56 Total reward: 34.0 Training loss: 19.9301 Explore P: 0.8941
Episode: 57 Total reward: 18.0 Training loss: 5.9858 Explore P: 0.8925
Episode: 58 Total reward: 17.0 Training loss: 5.9270 Explore P: 0.8910
Episode: 59 Total reward: 18.0 Training loss: 4.4149 Explore P: 0.8895
Episode: 60 Total reward: 32.0 Training loss: 6.9850 Explore P: 0.8866
Episode: 61 Total reward: 11.0 Training loss: 4.8957 Explore P: 0.8857
Episode: 62 Total reward: 8.0 Training loss: 19.9535 Explore P: 0.8850
Episode: 63 Total reward: 13.0 Training loss: 23.7312 Explore P: 0.8838
Episode: 64 Total reward: 40.0 Training loss: 39.5898 Explore P: 0.8804
Episode: 65 Total reward: 27.0 Training loss: 28.5896 Explore P: 0.8780
Episode: 66 Total reward: 15.0 Training loss: 30.4548 Explore P: 0.8767
Episode: 67 Total reward: 10.0 Training loss: 8.6907 Explore P: 0.8758
Episode: 68 Total reward: 25.0 Training loss: 28.2341 Explore P: 0.8737
Episode: 69 Total reward: 8.0 Training loss: 111.0703 Explore P: 0.8730
Episode: 70 Total reward: 29.0 Training loss: 72.3358 Explore P: 0.8705
Episode: 71 Total reward: 9.0 Training loss: 10.4353 Explore P: 0.8697
Episode: 72 Total reward: 19.0 Training loss: 9.6673 Explore P: 0.8681
Episode: 73 Total reward: 16.0 Training loss: 12.1161 Explore P: 0.8667
Episode: 74 Total reward: 49.0 Training loss: 40.8315 Explore P: 0.8625
Episode: 75 Total reward: 23.0 Training loss: 8.8591 Explore P: 0.8606
Episode: 76 Total reward: 9.0 Training loss: 7.8767 Explore P: 0.8598
Episode: 77 Total reward: 48.0 Training loss: 9.2514 Explore P: 0.8557
Episode: 78 Total reward: 11.0 Training loss: 37.4159 Explore P: 0.8548
Episode: 79 Total reward: 22.0 Training loss: 30.0442 Explore P: 0.8529
Episode: 80 Total reward: 20.0 Training loss: 127.2028 Explore P: 0.8513
Episode: 81 Total reward: 47.0 Training loss: 58.7176 Explore P: 0.8473
Episode: 82 Total reward: 28.0 Training loss: 36.4487 Explore P: 0.8450
Episode: 83 Total reward: 14.0 Training loss: 9.6760 Explore P: 0.8438
Episode: 84 Total reward: 29.0 Training loss: 9.4721 Explore P: 0.8414
Episode: 85 Total reward: 27.0 Training loss: 10.2267 Explore P: 0.8392
Episode: 86 Total reward: 15.0 Training loss: 66.4717 Explore P: 0.8379
Episode: 87 Total reward: 31.0 Training loss: 38.2111 Explore P: 0.8353
Episode: 88 Total reward: 12.0 Training loss: 78.7699 Explore P: 0.8344
Episode: 89 Total reward: 17.0 Training loss: 12.0752 Explore P: 0.8330
Episode: 90 Total reward: 40.0 Training loss: 9.3118 Explore P: 0.8297
Episode: 91 Total reward: 9.0 Training loss: 49.6869 Explore P: 0.8289
Episode: 92 Total reward: 17.0 Training loss: 79.2813 Explore P: 0.8275
Episode: 93 Total reward: 11.0 Training loss: 8.5081 Explore P: 0.8266
Episode: 94 Total reward: 14.0 Training loss: 143.4716 Explore P: 0.8255
Episode: 95 Total reward: 8.0 Training loss: 58.7187 Explore P: 0.8249
Episode: 96 Total reward: 42.0 Training loss: 6.8510 Explore P: 0.8214
Episode: 97 Total reward: 30.0 Training loss: 79.3785 Explore P: 0.8190
Episode: 98 Total reward: 23.0 Training loss: 43.2265 Explore P: 0.8171
Episode: 99 Total reward: 14.0 Training loss: 8.0354 Explore P: 0.8160
Episode: 100 Total reward: 12.0 Training loss: 46.6256 Explore P: 0.8151
Episode: 101 Total reward: 23.0 Training loss: 135.1617 Explore P: 0.8132
Episode: 102 Total reward: 14.0 Training loss: 73.3703 Explore P: 0.8121
Episode: 103 Total reward: 18.0 Training loss: 97.7066 Explore P: 0.8106
Episode: 104 Total reward: 21.0 Training loss: 6.4787 Explore P: 0.8090
Episode: 105 Total reward: 11.0 Training loss: 46.5145 Explore P: 0.8081
Episode: 106 Total reward: 10.0 Training loss: 75.2267 Explore P: 0.8073
Episode: 107 Total reward: 29.0 Training loss: 145.1040 Explore P: 0.8050
Episode: 108 Total reward: 10.0 Training loss: 6.8401 Explore P: 0.8042
Episode: 109 Total reward: 11.0 Training loss: 174.1059 Explore P: 0.8033
Episode: 110 Total reward: 35.0 Training loss: 7.6953 Explore P: 0.8005
Episode: 111 Total reward: 12.0 Training loss: 181.3796 Explore P: 0.7996
Episode: 112 Total reward: 13.0 Training loss: 179.9426 Explore P: 0.7986
Episode: 113 Total reward: 8.0 Training loss: 5.4505 Explore P: 0.7979
Episode: 114 Total reward: 15.0 Training loss: 5.1279 Explore P: 0.7967
Episode: 115 Total reward: 9.0 Training loss: 301.5938 Explore P: 0.7960
Episode: 116 Total reward: 19.0 Training loss: 7.3884 Explore P: 0.7945
Episode: 117 Total reward: 22.0 Training loss: 7.3373 Explore P: 0.7928
Episode: 118 Total reward: 18.0 Training loss: 107.7919 Explore P: 0.7914
Episode: 119 Total reward: 15.0 Training loss: 179.7941 Explore P: 0.7902
Episode: 120 Total reward: 24.0 Training loss: 155.4587 Explore P: 0.7884
Episode: 121 Total reward: 20.0 Training loss: 4.8426 Explore P: 0.7868
Episode: 122 Total reward: 10.0 Training loss: 224.7815 Explore P: 0.7860
Episode: 123 Total reward: 14.0 Training loss: 77.8618 Explore P: 0.7850
Episode: 124 Total reward: 21.0 Training loss: 4.0350 Explore P: 0.7833
Episode: 125 Total reward: 16.0 Training loss: 46.8008 Explore P: 0.7821
Episode: 126 Total reward: 10.0 Training loss: 123.5068 Explore P: 0.7813
Episode: 127 Total reward: 7.0 Training loss: 4.0798 Explore P: 0.7808
Episode: 128 Total reward: 10.0 Training loss: 119.8019 Explore P: 0.7800
Episode: 129 Total reward: 33.0 Training loss: 150.0170 Explore P: 0.7775
Episode: 130 Total reward: 16.0 Training loss: 118.9459 Explore P: 0.7762
Episode: 131 Total reward: 17.0 Training loss: 3.7892 Explore P: 0.7749
Episode: 132 Total reward: 23.0 Training loss: 66.4730 Explore P: 0.7732
Episode: 133 Total reward: 14.0 Training loss: 3.3666 Explore P: 0.7721
Episode: 134 Total reward: 27.0 Training loss: 82.4510 Explore P: 0.7701
Episode: 135 Total reward: 18.0 Training loss: 98.1500 Explore P: 0.7687
Episode: 136 Total reward: 10.0 Training loss: 4.2998 Explore P: 0.7679
Episode: 137 Total reward: 21.0 Training loss: 142.5999 Explore P: 0.7664
Episode: 138 Total reward: 21.0 Training loss: 142.9930 Explore P: 0.7648
Episode: 139 Total reward: 35.0 Training loss: 77.8420 Explore P: 0.7621
Episode: 140 Total reward: 17.0 Training loss: 116.4496 Explore P: 0.7608
Episode: 141 Total reward: 9.0 Training loss: 81.1015 Explore P: 0.7602
Episode: 142 Total reward: 17.0 Training loss: 65.2262 Explore P: 0.7589
Episode: 143 Total reward: 27.0 Training loss: 2.3060 Explore P: 0.7569
Episode: 144 Total reward: 10.0 Training loss: 130.6761 Explore P: 0.7561
Episode: 145 Total reward: 15.0 Training loss: 110.7910 Explore P: 0.7550
Episode: 146 Total reward: 16.0 Training loss: 54.1929 Explore P: 0.7538
Episode: 147 Total reward: 15.0 Training loss: 2.4648 Explore P: 0.7527
Episode: 148 Total reward: 13.0 Training loss: 48.2231 Explore P: 0.7517
Episode: 149 Total reward: 14.0 Training loss: 59.7532 Explore P: 0.7507
Episode: 150 Total reward: 9.0 Training loss: 176.1534 Explore P: 0.7500
Episode: 151 Total reward: 25.0 Training loss: 58.1746 Explore P: 0.7482
Episode: 152 Total reward: 15.0 Training loss: 48.0449 Explore P: 0.7471
Episode: 153 Total reward: 14.0 Training loss: 2.7333 Explore P: 0.7461
Episode: 154 Total reward: 22.0 Training loss: 57.3848 Explore P: 0.7444
Episode: 155 Total reward: 20.0 Training loss: 49.9609 Explore P: 0.7430
Episode: 156 Total reward: 19.0 Training loss: 1.0095 Explore P: 0.7416
Episode: 157 Total reward: 8.0 Training loss: 50.4277 Explore P: 0.7410
Episode: 158 Total reward: 16.0 Training loss: 145.9109 Explore P: 0.7398
Episode: 159 Total reward: 11.0 Training loss: 138.0690 Explore P: 0.7390
Episode: 160 Total reward: 24.0 Training loss: 44.7967 Explore P: 0.7373
Episode: 161 Total reward: 14.0 Training loss: 1.8535 Explore P: 0.7363
Episode: 162 Total reward: 12.0 Training loss: 0.9235 Explore P: 0.7354
Episode: 163 Total reward: 14.0 Training loss: 0.9175 Explore P: 0.7344
Episode: 164 Total reward: 10.0 Training loss: 1.5927 Explore P: 0.7336
Episode: 165 Total reward: 13.0 Training loss: 41.0304 Explore P: 0.7327
Episode: 166 Total reward: 18.0 Training loss: 0.9600 Explore P: 0.7314
Episode: 167 Total reward: 8.0 Training loss: 89.9705 Explore P: 0.7308
Episode: 168 Total reward: 10.0 Training loss: 1.1634 Explore P: 0.7301
Episode: 169 Total reward: 20.0 Training loss: 77.5591 Explore P: 0.7287
Episode: 170 Total reward: 35.0 Training loss: 37.6513 Explore P: 0.7262
Episode: 171 Total reward: 12.0 Training loss: 44.6431 Explore P: 0.7253
Episode: 172 Total reward: 11.0 Training loss: 71.6115 Explore P: 0.7245
Episode: 173 Total reward: 16.0 Training loss: 102.3692 Explore P: 0.7234
Episode: 174 Total reward: 22.0 Training loss: 1.1056 Explore P: 0.7218
Episode: 175 Total reward: 13.0 Training loss: 85.0131 Explore P: 0.7209
Episode: 176 Total reward: 19.0 Training loss: 32.9587 Explore P: 0.7195
Episode: 177 Total reward: 8.0 Training loss: 0.9099 Explore P: 0.7190
Episode: 178 Total reward: 11.0 Training loss: 159.8979 Explore P: 0.7182
Episode: 179 Total reward: 31.0 Training loss: 95.9710 Explore P: 0.7160
Episode: 180 Total reward: 38.0 Training loss: 64.6527 Explore P: 0.7133
Episode: 181 Total reward: 20.0 Training loss: 1.7750 Explore P: 0.7119
Episode: 182 Total reward: 11.0 Training loss: 83.5146 Explore P: 0.7111
Episode: 183 Total reward: 15.0 Training loss: 1.7104 Explore P: 0.7101
Episode: 184 Total reward: 14.0 Training loss: 1.7226 Explore P: 0.7091
Episode: 185 Total reward: 9.0 Training loss: 1.6573 Explore P: 0.7085
Episode: 186 Total reward: 10.0 Training loss: 1.8966 Explore P: 0.7078
Episode: 187 Total reward: 16.0 Training loss: 54.2308 Explore P: 0.7067
Episode: 188 Total reward: 16.0 Training loss: 28.2465 Explore P: 0.7056
Episode: 189 Total reward: 28.0 Training loss: 1.2947 Explore P: 0.7036
Episode: 190 Total reward: 16.0 Training loss: 1.1800 Explore P: 0.7025
Episode: 191 Total reward: 32.0 Training loss: 1.4398 Explore P: 0.7003
Episode: 192 Total reward: 17.0 Training loss: 26.3074 Explore P: 0.6991
Episode: 193 Total reward: 18.0 Training loss: 48.8937 Explore P: 0.6979
Episode: 194 Total reward: 27.0 Training loss: 176.7958 Explore P: 0.6960
Episode: 195 Total reward: 35.0 Training loss: 29.8695 Explore P: 0.6936
Episode: 196 Total reward: 9.0 Training loss: 2.6726 Explore P: 0.6930
Episode: 197 Total reward: 12.0 Training loss: 81.6009 Explore P: 0.6922
Episode: 198 Total reward: 9.0 Training loss: 53.6251 Explore P: 0.6916
Episode: 199 Total reward: 38.0 Training loss: 94.0010 Explore P: 0.6890
Episode: 200 Total reward: 15.0 Training loss: 24.6086 Explore P: 0.6880
Episode: 201 Total reward: 17.0 Training loss: 66.2659 Explore P: 0.6868
Episode: 202 Total reward: 11.0 Training loss: 1.6257 Explore P: 0.6861
Episode: 203 Total reward: 18.0 Training loss: 2.4640 Explore P: 0.6849
Episode: 204 Total reward: 18.0 Training loss: 47.6094 Explore P: 0.6836
Episode: 205 Total reward: 20.0 Training loss: 70.4583 Explore P: 0.6823
Episode: 206 Total reward: 18.0 Training loss: 22.8140 Explore P: 0.6811
Episode: 207 Total reward: 15.0 Training loss: 1.9974 Explore P: 0.6801
Episode: 208 Total reward: 24.0 Training loss: 49.5533 Explore P: 0.6785
Episode: 209 Total reward: 8.0 Training loss: 60.7092 Explore P: 0.6779
Episode: 210 Total reward: 8.0 Training loss: 28.4767 Explore P: 0.6774
Episode: 211 Total reward: 18.0 Training loss: 52.1672 Explore P: 0.6762
Episode: 212 Total reward: 17.0 Training loss: 47.1092 Explore P: 0.6751
Episode: 213 Total reward: 28.0 Training loss: 25.0076 Explore P: 0.6732
Episode: 214 Total reward: 15.0 Training loss: 2.5759 Explore P: 0.6722
Episode: 215 Total reward: 12.0 Training loss: 57.1213 Explore P: 0.6714
Episode: 216 Total reward: 12.0 Training loss: 2.6585 Explore P: 0.6706
Episode: 217 Total reward: 16.0 Training loss: 4.5227 Explore P: 0.6696
Episode: 218 Total reward: 23.0 Training loss: 84.3906 Explore P: 0.6681
Episode: 219 Total reward: 22.0 Training loss: 64.6151 Explore P: 0.6666
Episode: 220 Total reward: 37.0 Training loss: 74.3045 Explore P: 0.6642
Episode: 221 Total reward: 11.0 Training loss: 47.2340 Explore P: 0.6635
Episode: 222 Total reward: 15.0 Training loss: 84.5512 Explore P: 0.6625
Episode: 223 Total reward: 21.0 Training loss: 22.3363 Explore P: 0.6611
Episode: 224 Total reward: 10.0 Training loss: 1.6090 Explore P: 0.6605
Episode: 225 Total reward: 9.0 Training loss: 48.1282 Explore P: 0.6599
Episode: 226 Total reward: 27.0 Training loss: 1.2242 Explore P: 0.6581
Episode: 227 Total reward: 15.0 Training loss: 41.4364 Explore P: 0.6572
Episode: 228 Total reward: 11.0 Training loss: 22.6689 Explore P: 0.6565
Episode: 229 Total reward: 13.0 Training loss: 80.5141 Explore P: 0.6556
Episode: 230 Total reward: 13.0 Training loss: 103.6133 Explore P: 0.6548
Episode: 231 Total reward: 13.0 Training loss: 37.5009 Explore P: 0.6539
Episode: 232 Total reward: 14.0 Training loss: 1.9739 Explore P: 0.6530
Episode: 233 Total reward: 12.0 Training loss: 2.4178 Explore P: 0.6523
Episode: 234 Total reward: 10.0 Training loss: 1.7295 Explore P: 0.6516
Episode: 235 Total reward: 7.0 Training loss: 1.8049 Explore P: 0.6512
Episode: 236 Total reward: 12.0 Training loss: 56.8069 Explore P: 0.6504
Episode: 237 Total reward: 35.0 Training loss: 1.5325 Explore P: 0.6482
Episode: 238 Total reward: 16.0 Training loss: 21.2095 Explore P: 0.6471
Episode: 239 Total reward: 18.0 Training loss: 1.7449 Explore P: 0.6460
Episode: 240 Total reward: 15.0 Training loss: 1.6335 Explore P: 0.6451
Episode: 241 Total reward: 11.0 Training loss: 1.5960 Explore P: 0.6444
Episode: 242 Total reward: 9.0 Training loss: 1.4374 Explore P: 0.6438
Episode: 243 Total reward: 11.0 Training loss: 19.9579 Explore P: 0.6431
Episode: 244 Total reward: 15.0 Training loss: 26.8245 Explore P: 0.6421
Episode: 245 Total reward: 25.0 Training loss: 1.0702 Explore P: 0.6406
Episode: 246 Total reward: 10.0 Training loss: 55.8975 Explore P: 0.6399
Episode: 247 Total reward: 10.0 Training loss: 19.5992 Explore P: 0.6393
Episode: 248 Total reward: 37.0 Training loss: 1.8334 Explore P: 0.6370
Episode: 249 Total reward: 19.0 Training loss: 25.3745 Explore P: 0.6358
Episode: 250 Total reward: 9.0 Training loss: 51.6768 Explore P: 0.6352
Episode: 251 Total reward: 21.0 Training loss: 1.0245 Explore P: 0.6339
Episode: 252 Total reward: 46.0 Training loss: 115.1664 Explore P: 0.6310
Episode: 253 Total reward: 26.0 Training loss: 29.3754 Explore P: 0.6294
Episode: 254 Total reward: 36.0 Training loss: 47.1423 Explore P: 0.6272
Episode: 255 Total reward: 64.0 Training loss: 62.6025 Explore P: 0.6233
Episode: 256 Total reward: 31.0 Training loss: 19.9118 Explore P: 0.6214
Episode: 257 Total reward: 74.0 Training loss: 22.6690 Explore P: 0.6169
Episode: 258 Total reward: 17.0 Training loss: 0.9454 Explore P: 0.6158
Episode: 259 Total reward: 17.0 Training loss: 67.6983 Explore P: 0.6148
Episode: 260 Total reward: 19.0 Training loss: 60.9874 Explore P: 0.6137
Episode: 261 Total reward: 49.0 Training loss: 66.1369 Explore P: 0.6107
Episode: 262 Total reward: 21.0 Training loss: 25.0284 Explore P: 0.6094
Episode: 263 Total reward: 25.0 Training loss: 21.4047 Explore P: 0.6079
Episode: 264 Total reward: 11.0 Training loss: 0.9335 Explore P: 0.6073
Episode: 265 Total reward: 28.0 Training loss: 22.8475 Explore P: 0.6056
Episode: 266 Total reward: 54.0 Training loss: 68.6874 Explore P: 0.6024
Episode: 267 Total reward: 29.0 Training loss: 35.9384 Explore P: 0.6007
Episode: 268 Total reward: 29.0 Training loss: 20.6376 Explore P: 0.5990
Episode: 269 Total reward: 62.0 Training loss: 1.1189 Explore P: 0.5953
Episode: 270 Total reward: 13.0 Training loss: 1.2712 Explore P: 0.5946
Episode: 271 Total reward: 16.0 Training loss: 1.0293 Explore P: 0.5937
Episode: 272 Total reward: 37.0 Training loss: 1.6923 Explore P: 0.5915
Episode: 273 Total reward: 34.0 Training loss: 42.1580 Explore P: 0.5895
Episode: 274 Total reward: 53.0 Training loss: 1.1831 Explore P: 0.5865
Episode: 275 Total reward: 78.0 Training loss: 1.3501 Explore P: 0.5820
Episode: 276 Total reward: 26.0 Training loss: 1.7617 Explore P: 0.5805
Episode: 277 Total reward: 43.0 Training loss: 32.0862 Explore P: 0.5780
Episode: 278 Total reward: 52.0 Training loss: 1.4704 Explore P: 0.5751
Episode: 279 Total reward: 64.0 Training loss: 1.3747 Explore P: 0.5715
Episode: 280 Total reward: 111.0 Training loss: 1.4334 Explore P: 0.5653
Episode: 281 Total reward: 64.0 Training loss: 0.8159 Explore P: 0.5618
Episode: 282 Total reward: 59.0 Training loss: 19.2850 Explore P: 0.5585
Episode: 283 Total reward: 93.0 Training loss: 18.6194 Explore P: 0.5534
Episode: 284 Total reward: 20.0 Training loss: 24.3620 Explore P: 0.5523
Episode: 285 Total reward: 46.0 Training loss: 24.1323 Explore P: 0.5499
Episode: 286 Total reward: 26.0 Training loss: 35.5154 Explore P: 0.5485
Episode: 287 Total reward: 69.0 Training loss: 1.9093 Explore P: 0.5448
Episode: 288 Total reward: 18.0 Training loss: 46.4706 Explore P: 0.5438
Episode: 289 Total reward: 21.0 Training loss: 20.6982 Explore P: 0.5427
Episode: 290 Total reward: 27.0 Training loss: 1.4062 Explore P: 0.5412
Episode: 291 Total reward: 27.0 Training loss: 42.1117 Explore P: 0.5398
Episode: 292 Total reward: 24.0 Training loss: 24.1907 Explore P: 0.5385
Episode: 293 Total reward: 17.0 Training loss: 46.8375 Explore P: 0.5376
Episode: 294 Total reward: 84.0 Training loss: 22.6175 Explore P: 0.5332
Episode: 295 Total reward: 58.0 Training loss: 49.1563 Explore P: 0.5302
Episode: 296 Total reward: 45.0 Training loss: 1.9040 Explore P: 0.5279
Episode: 297 Total reward: 36.0 Training loss: 1.1557 Explore P: 0.5260
Episode: 299 Total reward: 5.0 Training loss: 28.5249 Explore P: 0.5155
Episode: 300 Total reward: 42.0 Training loss: 1.5807 Explore P: 0.5134
Episode: 301 Total reward: 26.0 Training loss: 24.0474 Explore P: 0.5121
Episode: 302 Total reward: 43.0 Training loss: 16.2580 Explore P: 0.5099
Episode: 303 Total reward: 37.0 Training loss: 30.7284 Explore P: 0.5081
Episode: 304 Total reward: 22.0 Training loss: 34.5078 Explore P: 0.5070
Episode: 305 Total reward: 30.0 Training loss: 1.6059 Explore P: 0.5055
Episode: 306 Total reward: 71.0 Training loss: 1.0789 Explore P: 0.5020
Episode: 307 Total reward: 42.0 Training loss: 1.5078 Explore P: 0.5000
Episode: 308 Total reward: 20.0 Training loss: 25.4348 Explore P: 0.4990
Episode: 309 Total reward: 20.0 Training loss: 2.1902 Explore P: 0.4980
Episode: 310 Total reward: 63.0 Training loss: 1.3076 Explore P: 0.4949
Episode: 311 Total reward: 52.0 Training loss: 41.6627 Explore P: 0.4924
Episode: 312 Total reward: 34.0 Training loss: 36.6230 Explore P: 0.4908
Episode: 313 Total reward: 27.0 Training loss: 23.4803 Explore P: 0.4895
Episode: 314 Total reward: 52.0 Training loss: 23.9200 Explore P: 0.4870
Episode: 315 Total reward: 50.0 Training loss: 0.8207 Explore P: 0.4846
Episode: 316 Total reward: 43.0 Training loss: 41.0525 Explore P: 0.4826
Episode: 317 Total reward: 100.0 Training loss: 73.2615 Explore P: 0.4779
Episode: 318 Total reward: 22.0 Training loss: 76.3367 Explore P: 0.4768
Episode: 319 Total reward: 72.0 Training loss: 2.1571 Explore P: 0.4735
Episode: 320 Total reward: 33.0 Training loss: 1.1027 Explore P: 0.4720
Episode: 321 Total reward: 89.0 Training loss: 23.0003 Explore P: 0.4679
Episode: 322 Total reward: 23.0 Training loss: 23.6860 Explore P: 0.4668
Episode: 323 Total reward: 70.0 Training loss: 34.3254 Explore P: 0.4636
Episode: 324 Total reward: 108.0 Training loss: 1.0877 Explore P: 0.4588
Episode: 325 Total reward: 78.0 Training loss: 1.6854 Explore P: 0.4553
Episode: 326 Total reward: 36.0 Training loss: 2.4143 Explore P: 0.4537
Episode: 327 Total reward: 45.0 Training loss: 43.8323 Explore P: 0.4517
Episode: 328 Total reward: 44.0 Training loss: 35.0889 Explore P: 0.4497
Episode: 329 Total reward: 49.0 Training loss: 24.8049 Explore P: 0.4476
Episode: 330 Total reward: 93.0 Training loss: 1.6569 Explore P: 0.4435
Episode: 331 Total reward: 52.0 Training loss: 1.6173 Explore P: 0.4413
Episode: 332 Total reward: 45.0 Training loss: 31.1161 Explore P: 0.4394
Episode: 333 Total reward: 25.0 Training loss: 53.7734 Explore P: 0.4383
Episode: 334 Total reward: 69.0 Training loss: 18.6304 Explore P: 0.4353
Episode: 335 Total reward: 38.0 Training loss: 24.1941 Explore P: 0.4337
Episode: 336 Total reward: 66.0 Training loss: 1.9306 Explore P: 0.4309
Episode: 337 Total reward: 28.0 Training loss: 29.4689 Explore P: 0.4298
Episode: 338 Total reward: 99.0 Training loss: 1.5281 Explore P: 0.4256
Episode: 339 Total reward: 63.0 Training loss: 37.2585 Explore P: 0.4230
Episode: 340 Total reward: 101.0 Training loss: 107.2673 Explore P: 0.4189
Episode: 341 Total reward: 53.0 Training loss: 2.2748 Explore P: 0.4167
Episode: 342 Total reward: 77.0 Training loss: 114.2580 Explore P: 0.4136
Episode: 343 Total reward: 43.0 Training loss: 1.0225 Explore P: 0.4119
Episode: 344 Total reward: 47.0 Training loss: 1.0049 Explore P: 0.4100
Episode: 345 Total reward: 22.0 Training loss: 2.1757 Explore P: 0.4091
Episode: 346 Total reward: 50.0 Training loss: 2.7607 Explore P: 0.4071
Episode: 347 Total reward: 46.0 Training loss: 17.2882 Explore P: 0.4053
Episode: 348 Total reward: 42.0 Training loss: 30.6112 Explore P: 0.4036
Episode: 349 Total reward: 38.0 Training loss: 1.8756 Explore P: 0.4021
Episode: 350 Total reward: 55.0 Training loss: 32.7170 Explore P: 0.4000
Episode: 351 Total reward: 43.0 Training loss: 1.7166 Explore P: 0.3983
Episode: 352 Total reward: 23.0 Training loss: 2.3636 Explore P: 0.3974
Episode: 353 Total reward: 76.0 Training loss: 34.1695 Explore P: 0.3945
Episode: 354 Total reward: 65.0 Training loss: 27.2247 Explore P: 0.3920
Episode: 355 Total reward: 44.0 Training loss: 3.5014 Explore P: 0.3903
Episode: 356 Total reward: 67.0 Training loss: 16.5402 Explore P: 0.3878
Episode: 357 Total reward: 87.0 Training loss: 1.9138 Explore P: 0.3845
Episode: 358 Total reward: 91.0 Training loss: 19.0741 Explore P: 0.3811
Episode: 359 Total reward: 105.0 Training loss: 114.4311 Explore P: 0.3772
Episode: 360 Total reward: 87.0 Training loss: 60.3672 Explore P: 0.3741
Episode: 361 Total reward: 62.0 Training loss: 61.9824 Explore P: 0.3718
Episode: 362 Total reward: 86.0 Training loss: 25.6718 Explore P: 0.3687
Episode: 363 Total reward: 100.0 Training loss: 2.1785 Explore P: 0.3651
Episode: 364 Total reward: 75.0 Training loss: 4.0499 Explore P: 0.3625
Episode: 365 Total reward: 106.0 Training loss: 43.0345 Explore P: 0.3588
Episode: 366 Total reward: 73.0 Training loss: 3.3644 Explore P: 0.3562
Episode: 367 Total reward: 196.0 Training loss: 48.7128 Explore P: 0.3495
Episode: 368 Total reward: 51.0 Training loss: 1.2178 Explore P: 0.3478
Episode: 369 Total reward: 34.0 Training loss: 2.0560 Explore P: 0.3466
Episode: 370 Total reward: 52.0 Training loss: 84.2848 Explore P: 0.3449
Episode: 371 Total reward: 10.0 Training loss: 3.5461 Explore P: 0.3446
Episode: 372 Total reward: 25.0 Training loss: 41.8421 Explore P: 0.3437
Episode: 373 Total reward: 72.0 Training loss: 99.5705 Explore P: 0.3413
Episode: 374 Total reward: 24.0 Training loss: 2.5322 Explore P: 0.3405
Episode: 375 Total reward: 97.0 Training loss: 51.7311 Explore P: 0.3373
Episode: 376 Total reward: 45.0 Training loss: 35.8759 Explore P: 0.3359
Episode: 377 Total reward: 48.0 Training loss: 0.7287 Explore P: 0.3343
Episode: 378 Total reward: 181.0 Training loss: 1.4798 Explore P: 0.3285
Episode: 379 Total reward: 65.0 Training loss: 1.6336 Explore P: 0.3264
Episode: 380 Total reward: 55.0 Training loss: 47.8548 Explore P: 0.3247
Episode: 381 Total reward: 95.0 Training loss: 1.8704 Explore P: 0.3217
Episode: 382 Total reward: 45.0 Training loss: 3.0382 Explore P: 0.3203
Episode: 383 Total reward: 55.0 Training loss: 60.2497 Explore P: 0.3186
Episode: 384 Total reward: 192.0 Training loss: 55.0010 Explore P: 0.3127
Episode: 385 Total reward: 12.0 Training loss: 1.5841 Explore P: 0.3124
Episode: 386 Total reward: 113.0 Training loss: 1.7672 Explore P: 0.3090
Episode: 387 Total reward: 63.0 Training loss: 105.1064 Explore P: 0.3071
Episode: 388 Total reward: 180.0 Training loss: 2.8393 Explore P: 0.3018
Episode: 389 Total reward: 146.0 Training loss: 2.6621 Explore P: 0.2976
Episode: 390 Total reward: 103.0 Training loss: 1.9821 Explore P: 0.2946
Episode: 391 Total reward: 110.0 Training loss: 5.0456 Explore P: 0.2915
Episode: 392 Total reward: 81.0 Training loss: 1.5163 Explore P: 0.2892
Episode: 393 Total reward: 55.0 Training loss: 1.2628 Explore P: 0.2877
Episode: 394 Total reward: 179.0 Training loss: 1.7250 Explore P: 0.2828
Episode: 395 Total reward: 80.0 Training loss: 2.5032 Explore P: 0.2806
Episode: 396 Total reward: 89.0 Training loss: 70.5662 Explore P: 0.2782
Episode: 397 Total reward: 69.0 Training loss: 2.4792 Explore P: 0.2764
Episode: 398 Total reward: 122.0 Training loss: 1.9947 Explore P: 0.2731
Episode: 399 Total reward: 176.0 Training loss: 2.3309 Explore P: 0.2686
Episode: 400 Total reward: 89.0 Training loss: 1.9054 Explore P: 0.2663
Episode: 401 Total reward: 136.0 Training loss: 1.6404 Explore P: 0.2628
Episode: 402 Total reward: 81.0 Training loss: 1.3334 Explore P: 0.2608
Episode: 403 Total reward: 64.0 Training loss: 1.9677 Explore P: 0.2592
Episode: 404 Total reward: 97.0 Training loss: 1.8618 Explore P: 0.2568
Episode: 405 Total reward: 102.0 Training loss: 7.0038 Explore P: 0.2543
Episode: 406 Total reward: 128.0 Training loss: 2.3105 Explore P: 0.2511
Episode: 407 Total reward: 31.0 Training loss: 1.5078 Explore P: 0.2504
Episode: 408 Total reward: 173.0 Training loss: 1.3118 Explore P: 0.2463
Episode: 409 Total reward: 135.0 Training loss: 2.2806 Explore P: 0.2431
Episode: 410 Total reward: 111.0 Training loss: 1.4722 Explore P: 0.2405
Episode: 411 Total reward: 143.0 Training loss: 0.9669 Explore P: 0.2373
Episode: 412 Total reward: 172.0 Training loss: 0.7770 Explore P: 0.2334
Episode: 413 Total reward: 110.0 Training loss: 1.5044 Explore P: 0.2309
Episode: 414 Total reward: 147.0 Training loss: 1.4639 Explore P: 0.2277
Episode: 415 Total reward: 175.0 Training loss: 1.2603 Explore P: 0.2239
Episode: 417 Total reward: 92.0 Training loss: 1.2443 Explore P: 0.2178
Episode: 418 Total reward: 186.0 Training loss: 1.6042 Explore P: 0.2140
Episode: 420 Total reward: 80.0 Training loss: 1.7343 Explore P: 0.2083
Episode: 422 Total reward: 120.0 Training loss: 61.5674 Explore P: 0.2021
Episode: 424 Total reward: 64.0 Training loss: 1.6225 Explore P: 0.1971
Episode: 426 Total reward: 134.0 Training loss: 74.3335 Explore P: 0.1909
Episode: 427 Total reward: 181.0 Training loss: 1.0723 Explore P: 0.1877
Episode: 429 Total reward: 121.0 Training loss: 1.1525 Explore P: 0.1821
Episode: 430 Total reward: 187.0 Training loss: 1.2460 Explore P: 0.1789
Episode: 432 Total reward: 2.0 Training loss: 1.4572 Explore P: 0.1755
Episode: 434 Total reward: 11.0 Training loss: 0.8983 Explore P: 0.1721
Episode: 436 Total reward: 25.0 Training loss: 1.1857 Explore P: 0.1684
Episode: 437 Total reward: 161.0 Training loss: 1.0507 Explore P: 0.1659
Episode: 439 Total reward: 94.0 Training loss: 0.5170 Explore P: 0.1614
Episode: 441 Total reward: 131.0 Training loss: 1.3187 Explore P: 0.1565
Episode: 443 Total reward: 17.0 Training loss: 0.7894 Explore P: 0.1533
Episode: 445 Total reward: 103.0 Training loss: 0.5338 Explore P: 0.1490
Episode: 446 Total reward: 193.0 Training loss: 0.7300 Explore P: 0.1464
Episode: 448 Total reward: 114.0 Training loss: 0.7186 Explore P: 0.1422
Episode: 451 Total reward: 15.0 Training loss: 1.3797 Explore P: 0.1368
Episode: 454 Total reward: 80.0 Training loss: 0.7073 Explore P: 0.1309
Episode: 457 Total reward: 99.0 Training loss: 0.8535 Explore P: 0.1250
Episode: 458 Total reward: 102.0 Training loss: 1.1373 Explore P: 0.1238
Episode: 461 Total reward: 99.0 Training loss: 0.8609 Explore P: 0.1183
Episode: 462 Total reward: 139.0 Training loss: 0.5657 Explore P: 0.1168
Episode: 465 Total reward: 99.0 Training loss: 0.9245 Explore P: 0.1116
Episode: 468 Total reward: 99.0 Training loss: 0.6314 Explore P: 0.1066
Episode: 469 Total reward: 190.0 Training loss: 0.2709 Explore P: 0.1048
Episode: 472 Total reward: 99.0 Training loss: 0.6318 Explore P: 0.1002
Episode: 475 Total reward: 99.0 Training loss: 0.4896 Explore P: 0.0958
Episode: 478 Total reward: 99.0 Training loss: 0.6266 Explore P: 0.0916
Episode: 479 Total reward: 155.0 Training loss: 0.4071 Explore P: 0.0904
Episode: 481 Total reward: 51.0 Training loss: 0.2328 Explore P: 0.0884
Episode: 484 Total reward: 99.0 Training loss: 0.2864 Explore P: 0.0846
Episode: 485 Total reward: 127.0 Training loss: 0.1564 Explore P: 0.0836
Episode: 488 Total reward: 99.0 Training loss: 0.2497 Explore P: 0.0800
Episode: 491 Total reward: 99.0 Training loss: 0.2319 Explore P: 0.0766
Episode: 492 Total reward: 110.0 Training loss: 0.2774 Explore P: 0.0759
Episode: 495 Total reward: 99.0 Training loss: 0.1846 Explore P: 0.0727
Episode: 498 Total reward: 47.0 Training loss: 0.2920 Explore P: 0.0700
Episode: 501 Total reward: 99.0 Training loss: 0.1878 Explore P: 0.0670
Episode: 502 Total reward: 131.0 Training loss: 0.2091 Explore P: 0.0663
Episode: 505 Total reward: 99.0 Training loss: 0.0868 Explore P: 0.0636
Episode: 508 Total reward: 99.0 Training loss: 0.1871 Explore P: 0.0609
Episode: 511 Total reward: 99.0 Training loss: 0.0926 Explore P: 0.0585
Episode: 514 Total reward: 99.0 Training loss: 0.1091 Explore P: 0.0561
Episode: 517 Total reward: 99.0 Training loss: 0.1722 Explore P: 0.0539
Episode: 520 Total reward: 99.0 Training loss: 0.0848 Explore P: 0.0517
Episode: 522 Total reward: 199.0 Training loss: 0.1521 Explore P: 0.0501
Episode: 525 Total reward: 99.0 Training loss: 0.1151 Explore P: 0.0481
Episode: 527 Total reward: 186.0 Training loss: 0.1318 Explore P: 0.0467
Episode: 529 Total reward: 126.0 Training loss: 0.1660 Explore P: 0.0455
Episode: 531 Total reward: 129.0 Training loss: 0.1594 Explore P: 0.0444
Episode: 534 Total reward: 29.0 Training loss: 297.6937 Explore P: 0.0429
Episode: 536 Total reward: 22.0 Training loss: 267.5245 Explore P: 0.0422
Episode: 537 Total reward: 199.0 Training loss: 0.1670 Explore P: 0.0416
Episode: 539 Total reward: 20.0 Training loss: 0.1190 Explore P: 0.0409
Episode: 540 Total reward: 159.0 Training loss: 0.1217 Explore P: 0.0404
Episode: 541 Total reward: 128.0 Training loss: 0.1569 Explore P: 0.0400
Episode: 542 Total reward: 123.0 Training loss: 0.3145 Explore P: 0.0396
Episode: 543 Total reward: 134.0 Training loss: 0.2410 Explore P: 0.0393
Episode: 544 Total reward: 127.0 Training loss: 0.1767 Explore P: 0.0389
Episode: 546 Total reward: 58.0 Training loss: 0.1343 Explore P: 0.0381
Episode: 548 Total reward: 105.0 Training loss: 0.3156 Explore P: 0.0373
Episode: 549 Total reward: 169.0 Training loss: 0.3570 Explore P: 0.0368
Episode: 550 Total reward: 117.0 Training loss: 0.2074 Explore P: 0.0365
Episode: 551 Total reward: 108.0 Training loss: 0.1424 Explore P: 0.0362
Episode: 552 Total reward: 147.0 Training loss: 0.3873 Explore P: 0.0359
Episode: 553 Total reward: 69.0 Training loss: 0.4846 Explore P: 0.0357
Episode: 554 Total reward: 36.0 Training loss: 0.3570 Explore P: 0.0356
Episode: 555 Total reward: 64.0 Training loss: 0.3122 Explore P: 0.0354
Episode: 556 Total reward: 26.0 Training loss: 0.4919 Explore P: 0.0354
Episode: 557 Total reward: 33.0 Training loss: 0.4580 Explore P: 0.0353
Episode: 558 Total reward: 28.0 Training loss: 0.4183 Explore P: 0.0352
Episode: 559 Total reward: 30.0 Training loss: 0.8270 Explore P: 0.0351
Episode: 560 Total reward: 33.0 Training loss: 0.3123 Explore P: 0.0351
Episode: 561 Total reward: 58.0 Training loss: 0.3609 Explore P: 0.0349
Episode: 562 Total reward: 64.0 Training loss: 253.7803 Explore P: 0.0347
Episode: 563 Total reward: 80.0 Training loss: 0.1995 Explore P: 0.0345
Episode: 564 Total reward: 41.0 Training loss: 0.8614 Explore P: 0.0344
Episode: 565 Total reward: 80.0 Training loss: 0.3223 Explore P: 0.0343
Episode: 566 Total reward: 140.0 Training loss: 0.2373 Explore P: 0.0339
Episode: 567 Total reward: 143.0 Training loss: 0.1908 Explore P: 0.0336
Episode: 568 Total reward: 135.0 Training loss: 0.3473 Explore P: 0.0333
Episode: 569 Total reward: 73.0 Training loss: 0.1193 Explore P: 0.0331
Episode: 570 Total reward: 114.0 Training loss: 0.2054 Explore P: 0.0328
Episode: 571 Total reward: 118.0 Training loss: 0.1762 Explore P: 0.0326
Episode: 572 Total reward: 96.0 Training loss: 0.2564 Explore P: 0.0323
Episode: 573 Total reward: 106.0 Training loss: 0.2002 Explore P: 0.0321
Episode: 574 Total reward: 159.0 Training loss: 0.1819 Explore P: 0.0318
Episode: 575 Total reward: 144.0 Training loss: 0.1926 Explore P: 0.0315
Episode: 576 Total reward: 121.0 Training loss: 253.7211 Explore P: 0.0312
Episode: 577 Total reward: 74.0 Training loss: 0.6083 Explore P: 0.0310
Episode: 578 Total reward: 93.0 Training loss: 62.5660 Explore P: 0.0308
Episode: 579 Total reward: 111.0 Training loss: 0.3424 Explore P: 0.0306
Episode: 580 Total reward: 128.0 Training loss: 0.1944 Explore P: 0.0304
Episode: 581 Total reward: 169.0 Training loss: 0.4572 Explore P: 0.0300
Episode: 584 Total reward: 99.0 Training loss: 0.3901 Explore P: 0.0290
Episode: 587 Total reward: 99.0 Training loss: 209.9718 Explore P: 0.0281
Episode: 588 Total reward: 95.0 Training loss: 0.6240 Explore P: 0.0279
Episode: 589 Total reward: 94.0 Training loss: 0.5677 Explore P: 0.0278
Episode: 590 Total reward: 72.0 Training loss: 0.3398 Explore P: 0.0276
Episode: 591 Total reward: 128.0 Training loss: 0.3034 Explore P: 0.0274
Episode: 592 Total reward: 76.0 Training loss: 0.4887 Explore P: 0.0273
Episode: 593 Total reward: 65.0 Training loss: 0.4182 Explore P: 0.0272
Episode: 594 Total reward: 99.0 Training loss: 0.6983 Explore P: 0.0270
Episode: 595 Total reward: 60.0 Training loss: 0.4444 Explore P: 0.0269
Episode: 596 Total reward: 90.0 Training loss: 29.1490 Explore P: 0.0268
Episode: 597 Total reward: 48.0 Training loss: 0.5830 Explore P: 0.0267
Episode: 598 Total reward: 66.0 Training loss: 0.3654 Explore P: 0.0266
Episode: 599 Total reward: 62.0 Training loss: 0.4368 Explore P: 0.0265
Episode: 600 Total reward: 61.0 Training loss: 0.7951 Explore P: 0.0264
Episode: 603 Total reward: 99.0 Training loss: 0.4938 Explore P: 0.0256
Episode: 604 Total reward: 84.0 Training loss: 0.6343 Explore P: 0.0254
Episode: 605 Total reward: 69.0 Training loss: 0.5440 Explore P: 0.0253
Episode: 606 Total reward: 104.0 Training loss: 0.5385 Explore P: 0.0252
Episode: 607 Total reward: 175.0 Training loss: 0.3062 Explore P: 0.0249
Episode: 610 Total reward: 99.0 Training loss: 11.6219 Explore P: 0.0242
Episode: 612 Total reward: 156.0 Training loss: 0.2882 Explore P: 0.0237
Episode: 614 Total reward: 91.0 Training loss: 57.6886 Explore P: 0.0233
Episode: 615 Total reward: 67.0 Training loss: 0.3467 Explore P: 0.0232
Episode: 617 Total reward: 146.0 Training loss: 0.2802 Explore P: 0.0228
Episode: 619 Total reward: 36.0 Training loss: 7.5367 Explore P: 0.0225
Episode: 620 Total reward: 163.0 Training loss: 0.3213 Explore P: 0.0223
Episode: 622 Total reward: 7.0 Training loss: 0.2815 Explore P: 0.0220
Episode: 625 Total reward: 41.0 Training loss: 0.3216 Explore P: 0.0215
Episode: 627 Total reward: 30.0 Training loss: 0.2926 Explore P: 0.0212
Episode: 629 Total reward: 68.0 Training loss: 4.8251 Explore P: 0.0209
Episode: 632 Total reward: 99.0 Training loss: 0.4919 Explore P: 0.0204
Episode: 634 Total reward: 138.0 Training loss: 0.1592 Explore P: 0.0201
Episode: 635 Total reward: 184.0 Training loss: 0.3368 Explore P: 0.0199
Episode: 637 Total reward: 105.0 Training loss: 3.4983 Explore P: 0.0196
Episode: 640 Total reward: 99.0 Training loss: 0.2184 Explore P: 0.0191
Episode: 643 Total reward: 99.0 Training loss: 0.3386 Explore P: 0.0187
Episode: 644 Total reward: 53.0 Training loss: 0.2012 Explore P: 0.0186
Episode: 645 Total reward: 65.0 Training loss: 0.4382 Explore P: 0.0186
Episode: 646 Total reward: 40.0 Training loss: 0.5603 Explore P: 0.0185
Episode: 647 Total reward: 86.0 Training loss: 4.5583 Explore P: 0.0185
Episode: 648 Total reward: 40.0 Training loss: 3.9455 Explore P: 0.0184
Episode: 649 Total reward: 91.0 Training loss: 0.4957 Explore P: 0.0183
Episode: 650 Total reward: 122.0 Training loss: 0.2987 Explore P: 0.0182
Episode: 651 Total reward: 52.0 Training loss: 0.1995 Explore P: 0.0182
Episode: 653 Total reward: 198.0 Training loss: 0.1979 Explore P: 0.0179
Episode: 655 Total reward: 165.0 Training loss: 0.3310 Explore P: 0.0176
Episode: 658 Total reward: 99.0 Training loss: 0.1009 Explore P: 0.0172
Episode: 661 Total reward: 99.0 Training loss: 0.2303 Explore P: 0.0169
Episode: 664 Total reward: 99.0 Training loss: 0.2776 Explore P: 0.0165
Episode: 665 Total reward: 51.0 Training loss: 0.1094 Explore P: 0.0165
Episode: 666 Total reward: 55.0 Training loss: 0.2422 Explore P: 0.0165
Episode: 667 Total reward: 86.0 Training loss: 0.3608 Explore P: 0.0164
Episode: 668 Total reward: 70.0 Training loss: 0.3500 Explore P: 0.0164
Episode: 669 Total reward: 56.0 Training loss: 0.3903 Explore P: 0.0163
Episode: 670 Total reward: 53.0 Training loss: 0.1875 Explore P: 0.0163
Episode: 671 Total reward: 72.0 Training loss: 0.1592 Explore P: 0.0163
Episode: 672 Total reward: 52.0 Training loss: 0.4205 Explore P: 0.0162
Episode: 673 Total reward: 116.0 Training loss: 0.2598 Explore P: 0.0162
Episode: 674 Total reward: 90.0 Training loss: 0.2989 Explore P: 0.0161
Episode: 676 Total reward: 51.0 Training loss: 0.3488 Explore P: 0.0159
Episode: 677 Total reward: 88.0 Training loss: 0.4691 Explore P: 0.0159
Episode: 679 Total reward: 77.0 Training loss: 0.4334 Explore P: 0.0157
Episode: 680 Total reward: 50.0 Training loss: 0.3762 Explore P: 0.0157
Episode: 681 Total reward: 116.0 Training loss: 0.3971 Explore P: 0.0156
Episode: 682 Total reward: 121.0 Training loss: 0.4468 Explore P: 0.0156
Episode: 683 Total reward: 157.0 Training loss: 0.1768 Explore P: 0.0155
Episode: 684 Total reward: 91.0 Training loss: 0.1214 Explore P: 0.0154
Episode: 687 Total reward: 99.0 Training loss: 0.2310 Explore P: 0.0152
Episode: 690 Total reward: 99.0 Training loss: 0.5335 Explore P: 0.0149
Episode: 691 Total reward: 127.0 Training loss: 0.3229 Explore P: 0.0149
Episode: 694 Total reward: 99.0 Training loss: 0.1043 Explore P: 0.0146
Episode: 697 Total reward: 99.0 Training loss: 0.3556 Explore P: 0.0144
Episode: 700 Total reward: 99.0 Training loss: 0.2332 Explore P: 0.0142
Episode: 703 Total reward: 99.0 Training loss: 0.3943 Explore P: 0.0140
Episode: 706 Total reward: 99.0 Training loss: 0.4251 Explore P: 0.0138
Episode: 709 Total reward: 99.0 Training loss: 0.4632 Explore P: 0.0136
Episode: 712 Total reward: 99.0 Training loss: 0.1218 Explore P: 0.0134
Episode: 715 Total reward: 99.0 Training loss: 0.2188 Explore P: 0.0133
Episode: 718 Total reward: 99.0 Training loss: 17.7169 Explore P: 0.0131
Episode: 721 Total reward: 99.0 Training loss: 0.1671 Explore P: 0.0129
Episode: 724 Total reward: 99.0 Training loss: 0.2293 Explore P: 0.0128
Episode: 727 Total reward: 99.0 Training loss: 0.5953 Explore P: 0.0127
Episode: 730 Total reward: 99.0 Training loss: 30.3095 Explore P: 0.0125
Episode: 733 Total reward: 99.0 Training loss: 0.3614 Explore P: 0.0124
Episode: 736 Total reward: 99.0 Training loss: 0.4902 Explore P: 0.0123
Episode: 739 Total reward: 99.0 Training loss: 0.1953 Explore P: 0.0122
Episode: 742 Total reward: 99.0 Training loss: 0.3298 Explore P: 0.0121
Episode: 745 Total reward: 99.0 Training loss: 0.1874 Explore P: 0.0120
Episode: 748 Total reward: 99.0 Training loss: 0.1455 Explore P: 0.0119
Episode: 751 Total reward: 99.0 Training loss: 0.1397 Explore P: 0.0118
Episode: 754 Total reward: 99.0 Training loss: 0.1278 Explore P: 0.0117
Episode: 757 Total reward: 99.0 Training loss: 0.0661 Explore P: 0.0116
Episode: 760 Total reward: 99.0 Training loss: 0.0965 Explore P: 0.0115
Episode: 763 Total reward: 99.0 Training loss: 0.0922 Explore P: 0.0115
Episode: 766 Total reward: 99.0 Training loss: 0.2216 Explore P: 0.0114
Episode: 769 Total reward: 99.0 Training loss: 0.1424 Explore P: 0.0113
Episode: 772 Total reward: 99.0 Training loss: 0.0538 Explore P: 0.0113
Episode: 775 Total reward: 99.0 Training loss: 0.1804 Explore P: 0.0112
Episode: 778 Total reward: 99.0 Training loss: 0.1105 Explore P: 0.0111
Episode: 781 Total reward: 99.0 Training loss: 0.0710 Explore P: 0.0111
Episode: 784 Total reward: 99.0 Training loss: 0.1072 Explore P: 0.0110
Episode: 787 Total reward: 99.0 Training loss: 0.0804 Explore P: 0.0110
Episode: 790 Total reward: 99.0 Training loss: 363.6635 Explore P: 0.0109
Episode: 793 Total reward: 99.0 Training loss: 0.0775 Explore P: 0.0109
Episode: 796 Total reward: 99.0 Training loss: 0.0833 Explore P: 0.0108
Episode: 799 Total reward: 99.0 Training loss: 0.0749 Explore P: 0.0108
Episode: 802 Total reward: 99.0 Training loss: 0.1185 Explore P: 0.0108
Episode: 805 Total reward: 99.0 Training loss: 0.3461 Explore P: 0.0107
Episode: 808 Total reward: 99.0 Training loss: 0.0385 Explore P: 0.0107
Episode: 811 Total reward: 99.0 Training loss: 0.0517 Explore P: 0.0107
Episode: 814 Total reward: 99.0 Training loss: 0.0329 Explore P: 0.0106
Episode: 817 Total reward: 99.0 Training loss: 0.0826 Explore P: 0.0106
Episode: 820 Total reward: 99.0 Training loss: 0.0822 Explore P: 0.0106
Episode: 823 Total reward: 99.0 Training loss: 0.3597 Explore P: 0.0105
Episode: 826 Total reward: 99.0 Training loss: 0.0352 Explore P: 0.0105
Episode: 829 Total reward: 99.0 Training loss: 0.0580 Explore P: 0.0105
Episode: 832 Total reward: 99.0 Training loss: 0.1388 Explore P: 0.0105
Episode: 835 Total reward: 99.0 Training loss: 0.0962 Explore P: 0.0104
Episode: 838 Total reward: 99.0 Training loss: 0.3218 Explore P: 0.0104
Episode: 841 Total reward: 99.0 Training loss: 0.0186 Explore P: 0.0104
Episode: 844 Total reward: 99.0 Training loss: 0.0537 Explore P: 0.0104
Episode: 847 Total reward: 99.0 Training loss: 0.0375 Explore P: 0.0104
Episode: 850 Total reward: 99.0 Training loss: 0.0397 Explore P: 0.0103
Episode: 853 Total reward: 99.0 Training loss: 0.0681 Explore P: 0.0103
Episode: 856 Total reward: 99.0 Training loss: 0.1136 Explore P: 0.0103
Episode: 859 Total reward: 99.0 Training loss: 0.0613 Explore P: 0.0103
Episode: 862 Total reward: 99.0 Training loss: 0.0629 Explore P: 0.0103
Episode: 865 Total reward: 99.0 Training loss: 0.0316 Explore P: 0.0103
Episode: 868 Total reward: 99.0 Training loss: 0.0636 Explore P: 0.0103
Episode: 871 Total reward: 99.0 Training loss: 0.0325 Explore P: 0.0102
Episode: 874 Total reward: 99.0 Training loss: 0.0276 Explore P: 0.0102
Episode: 877 Total reward: 99.0 Training loss: 0.0515 Explore P: 0.0102
Episode: 880 Total reward: 99.0 Training loss: 0.2413 Explore P: 0.0102
Episode: 883 Total reward: 99.0 Training loss: 0.0431 Explore P: 0.0102
Episode: 886 Total reward: 99.0 Training loss: 0.1407 Explore P: 0.0102
Episode: 889 Total reward: 99.0 Training loss: 0.0382 Explore P: 0.0102
Episode: 892 Total reward: 99.0 Training loss: 0.0284 Explore P: 0.0102
Episode: 895 Total reward: 99.0 Training loss: 0.1003 Explore P: 0.0102
Episode: 898 Total reward: 99.0 Training loss: 0.0682 Explore P: 0.0102
Episode: 901 Total reward: 99.0 Training loss: 0.1097 Explore P: 0.0101
Episode: 904 Total reward: 99.0 Training loss: 0.0658 Explore P: 0.0101
Episode: 907 Total reward: 99.0 Training loss: 0.0467 Explore P: 0.0101
Episode: 910 Total reward: 99.0 Training loss: 0.0773 Explore P: 0.0101
Episode: 911 Total reward: 11.0 Training loss: 0.2055 Explore P: 0.0101
Episode: 912 Total reward: 46.0 Training loss: 0.1907 Explore P: 0.0101
Episode: 915 Total reward: 99.0 Training loss: 0.2096 Explore P: 0.0101
Episode: 916 Total reward: 127.0 Training loss: 0.2208 Explore P: 0.0101
Episode: 917 Total reward: 72.0 Training loss: 431.5775 Explore P: 0.0101
Episode: 918 Total reward: 12.0 Training loss: 0.3874 Explore P: 0.0101
Episode: 919 Total reward: 38.0 Training loss: 0.3996 Explore P: 0.0101
Episode: 920 Total reward: 11.0 Training loss: 0.5135 Explore P: 0.0101
Episode: 921 Total reward: 95.0 Training loss: 0.1752 Explore P: 0.0101
Episode: 922 Total reward: 24.0 Training loss: 474.9095 Explore P: 0.0101
Episode: 923 Total reward: 52.0 Training loss: 0.1380 Explore P: 0.0101
Episode: 925 Total reward: 64.0 Training loss: 355.1980 Explore P: 0.0101
Episode: 928 Total reward: 99.0 Training loss: 1.0440 Explore P: 0.0101
Episode: 931 Total reward: 99.0 Training loss: 1.7072 Explore P: 0.0101
Episode: 934 Total reward: 99.0 Training loss: 1.1784 Explore P: 0.0101
Episode: 937 Total reward: 99.0 Training loss: 0.4230 Explore P: 0.0101
Episode: 940 Total reward: 99.0 Training loss: 0.7165 Explore P: 0.0101
Episode: 943 Total reward: 99.0 Training loss: 1.8063 Explore P: 0.0101
Episode: 946 Total reward: 99.0 Training loss: 0.4202 Explore P: 0.0101
Episode: 949 Total reward: 99.0 Training loss: 0.9748 Explore P: 0.0101
Episode: 952 Total reward: 99.0 Training loss: 1.7591 Explore P: 0.0101
Episode: 955 Total reward: 99.0 Training loss: 1.9767 Explore P: 0.0101
Episode: 958 Total reward: 99.0 Training loss: 0.2483 Explore P: 0.0101
Episode: 961 Total reward: 99.0 Training loss: 0.6684 Explore P: 0.0101
Episode: 964 Total reward: 99.0 Training loss: 1.8376 Explore P: 0.0101
Episode: 967 Total reward: 99.0 Training loss: 1.3464 Explore P: 0.0101
Episode: 970 Total reward: 99.0 Training loss: 0.7170 Explore P: 0.0101
Episode: 973 Total reward: 99.0 Training loss: 1.5986 Explore P: 0.0101
Episode: 976 Total reward: 99.0 Training loss: 0.4037 Explore P: 0.0100
Episode: 979 Total reward: 99.0 Training loss: 0.1980 Explore P: 0.0100
Episode: 982 Total reward: 99.0 Training loss: 1.3156 Explore P: 0.0100
Episode: 985 Total reward: 99.0 Training loss: 0.0749 Explore P: 0.0100
Episode: 988 Total reward: 99.0 Training loss: 0.0620 Explore P: 0.0100
Episode: 991 Total reward: 99.0 Training loss: 0.0536 Explore P: 0.0100
Episode: 994 Total reward: 99.0 Training loss: 0.1907 Explore P: 0.0100
Episode: 997 Total reward: 99.0 Training loss: 0.0962 Explore P: 0.0100

In [15]:
Episode: 1 Total reward: 13.0 Training loss: 1.0202 Explore P: 0.9987
Episode: 2 Total reward: 13.0 Training loss: 1.0752 Explore P: 0.9974
Episode: 3 Total reward: 9.0 Training loss: 1.0600 Explore P: 0.9965
Episode: 4 Total reward: 17.0 Training loss: 1.0429 Explore P: 0.9949
Episode: 5 Total reward: 16.0 Training loss: 1.0519 Explore P: 0.9933
Episode: 6 Total reward: 15.0 Training loss: 1.0574 Explore P: 0.9918
Episode: 7 Total reward: 12.0 Training loss: 1.0889 Explore P: 0.9906
Episode: 8 Total reward: 27.0 Training loss: 1.0859 Explore P: 0.9880
Episode: 9 Total reward: 24.0 Training loss: 1.2007 Explore P: 0.9857
Episode: 10 Total reward: 17.0 Training loss: 1.1116 Explore P: 0.9840
Episode: 11 Total reward: 12.0 Training loss: 1.0739 Explore P: 0.9828
Episode: 12 Total reward: 25.0 Training loss: 1.0805 Explore P: 0.9804
Episode: 13 Total reward: 23.0 Training loss: 1.0628 Explore P: 0.9782
Episode: 14 Total reward: 31.0 Training loss: 1.0248 Explore P: 0.9752
Episode: 15 Total reward: 15.0 Training loss: 0.9859 Explore P: 0.9737
Episode: 16 Total reward: 12.0 Training loss: 1.0983 Explore P: 0.9726
Episode: 17 Total reward: 16.0 Training loss: 1.4343 Explore P: 0.9710
Episode: 18 Total reward: 21.0 Training loss: 1.2696 Explore P: 0.9690
Episode: 19 Total reward: 15.0 Training loss: 1.3542 Explore P: 0.9676
Episode: 20 Total reward: 15.0 Training loss: 1.2635 Explore P: 0.9661
Episode: 21 Total reward: 16.0 Training loss: 1.3648 Explore P: 0.9646
Episode: 22 Total reward: 43.0 Training loss: 1.6088 Explore P: 0.9605
Episode: 23 Total reward: 7.0 Training loss: 1.5027 Explore P: 0.9599
Episode: 24 Total reward: 13.0 Training loss: 1.7275 Explore P: 0.9586
Episode: 25 Total reward: 18.0 Training loss: 1.3902 Explore P: 0.9569
Episode: 26 Total reward: 27.0 Training loss: 2.5874 Explore P: 0.9544
Episode: 27 Total reward: 32.0 Training loss: 1.5907 Explore P: 0.9513
Episode: 28 Total reward: 17.0 Training loss: 2.1144 Explore P: 0.9497
Episode: 29 Total reward: 34.0 Training loss: 1.7340 Explore P: 0.9466
Episode: 30 Total reward: 18.0 Training loss: 2.5100 Explore P: 0.9449
Episode: 31 Total reward: 15.0 Training loss: 2.0166 Explore P: 0.9435
Episode: 32 Total reward: 11.0 Training loss: 1.8675 Explore P: 0.9424
Episode: 33 Total reward: 18.0 Training loss: 4.0481 Explore P: 0.9408
Episode: 34 Total reward: 10.0 Training loss: 4.0895 Explore P: 0.9398
Episode: 35 Total reward: 15.0 Training loss: 2.1252 Explore P: 0.9384
Episode: 36 Total reward: 14.0 Training loss: 4.7765 Explore P: 0.9371
Episode: 37 Total reward: 16.0 Training loss: 3.3848 Explore P: 0.9357
Episode: 38 Total reward: 21.0 Training loss: 3.9125 Explore P: 0.9337
Episode: 39 Total reward: 16.0 Training loss: 2.6183 Explore P: 0.9322
Episode: 40 Total reward: 20.0 Training loss: 5.4929 Explore P: 0.9304
Episode: 41 Total reward: 18.0 Training loss: 3.6606 Explore P: 0.9287
Episode: 42 Total reward: 17.0 Training loss: 4.5812 Explore P: 0.9272
Episode: 43 Total reward: 10.0 Training loss: 3.7633 Explore P: 0.9263
Episode: 44 Total reward: 8.0 Training loss: 4.6176 Explore P: 0.9255
Episode: 45 Total reward: 39.0 Training loss: 4.2732 Explore P: 0.9220
Episode: 46 Total reward: 18.0 Training loss: 4.0041 Explore P: 0.9203
Episode: 47 Total reward: 11.0 Training loss: 4.4035 Explore P: 0.9193
Episode: 48 Total reward: 25.0 Training loss: 5.4287 Explore P: 0.9171
Episode: 49 Total reward: 19.0 Training loss: 9.6972 Explore P: 0.9153
Episode: 50 Total reward: 11.0 Training loss: 16.3460 Explore P: 0.9143
Episode: 51 Total reward: 11.0 Training loss: 13.4854 Explore P: 0.9133
Episode: 52 Total reward: 12.0 Training loss: 12.8016 Explore P: 0.9123
Episode: 53 Total reward: 13.0 Training loss: 5.8589 Explore P: 0.9111
Episode: 54 Total reward: 12.0 Training loss: 8.5924 Explore P: 0.9100
Episode: 55 Total reward: 19.0 Training loss: 8.6204 Explore P: 0.9083
Episode: 56 Total reward: 36.0 Training loss: 14.2701 Explore P: 0.9051
Episode: 57 Total reward: 9.0 Training loss: 4.5481 Explore P: 0.9043
Episode: 58 Total reward: 22.0 Training loss: 12.9695 Explore P: 0.9023
Episode: 59 Total reward: 36.0 Training loss: 11.2639 Explore P: 0.8991
Episode: 60 Total reward: 16.0 Training loss: 7.7648 Explore P: 0.8977
Episode: 61 Total reward: 31.0 Training loss: 4.6997 Explore P: 0.8949
Episode: 62 Total reward: 13.0 Training loss: 5.9755 Explore P: 0.8938
Episode: 63 Total reward: 10.0 Training loss: 39.1040 Explore P: 0.8929
Episode: 64 Total reward: 14.0 Training loss: 23.2767 Explore P: 0.8917
Episode: 65 Total reward: 12.0 Training loss: 9.3477 Explore P: 0.8906
Episode: 66 Total reward: 20.0 Training loss: 6.4336 Explore P: 0.8888
Episode: 67 Total reward: 29.0 Training loss: 17.1522 Explore P: 0.8863
Episode: 68 Total reward: 13.0 Training loss: 39.3250 Explore P: 0.8852
Episode: 69 Total reward: 20.0 Training loss: 6.2099 Explore P: 0.8834
Episode: 70 Total reward: 15.0 Training loss: 20.9229 Explore P: 0.8821
Episode: 71 Total reward: 27.0 Training loss: 24.7817 Explore P: 0.8797
Episode: 72 Total reward: 12.0 Training loss: 20.7842 Explore P: 0.8787
Episode: 73 Total reward: 15.0 Training loss: 12.3202 Explore P: 0.8774
Episode: 74 Total reward: 31.0 Training loss: 9.2270 Explore P: 0.8747
Episode: 75 Total reward: 13.0 Training loss: 19.8264 Explore P: 0.8736
Episode: 76 Total reward: 20.0 Training loss: 72.9411 Explore P: 0.8719
Episode: 77 Total reward: 27.0 Training loss: 5.2214 Explore P: 0.8695
Episode: 78 Total reward: 14.0 Training loss: 39.3913 Explore P: 0.8683
Episode: 79 Total reward: 16.0 Training loss: 7.9491 Explore P: 0.8670
Episode: 80 Total reward: 18.0 Training loss: 10.8364 Explore P: 0.8654
Episode: 81 Total reward: 16.0 Training loss: 22.2031 Explore P: 0.8641
Episode: 82 Total reward: 21.0 Training loss: 23.6590 Explore P: 0.8623
Episode: 83 Total reward: 13.0 Training loss: 8.4819 Explore P: 0.8612
Episode: 84 Total reward: 10.0 Training loss: 13.3548 Explore P: 0.8603
Episode: 85 Total reward: 13.0 Training loss: 18.0272 Explore P: 0.8592
Episode: 86 Total reward: 24.0 Training loss: 42.1243 Explore P: 0.8572
Episode: 87 Total reward: 9.0 Training loss: 30.8526 Explore P: 0.8564
Episode: 88 Total reward: 22.0 Training loss: 36.6084 Explore P: 0.8546
Episode: 89 Total reward: 7.0 Training loss: 10.5430 Explore P: 0.8540
Episode: 90 Total reward: 12.0 Training loss: 25.5808 Explore P: 0.8529
Episode: 91 Total reward: 17.0 Training loss: 47.3073 Explore P: 0.8515
Episode: 92 Total reward: 21.0 Training loss: 7.9998 Explore P: 0.8498
Episode: 93 Total reward: 15.0 Training loss: 66.6464 Explore P: 0.8485
Episode: 94 Total reward: 17.0 Training loss: 95.6354 Explore P: 0.8471
Episode: 95 Total reward: 23.0 Training loss: 57.4714 Explore P: 0.8451
Episode: 96 Total reward: 11.0 Training loss: 40.7717 Explore P: 0.8442
Episode: 97 Total reward: 13.0 Training loss: 43.3380 Explore P: 0.8431
Episode: 98 Total reward: 9.0 Training loss: 10.8368 Explore P: 0.8424
Episode: 99 Total reward: 21.0 Training loss: 57.7325 Explore P: 0.8406
Episode: 100 Total reward: 11.0 Training loss: 9.7291 Explore P: 0.8397
Episode: 101 Total reward: 10.0 Training loss: 10.4052 Explore P: 0.8389
Episode: 102 Total reward: 26.0 Training loss: 60.4829 Explore P: 0.8368
Episode: 103 Total reward: 34.0 Training loss: 9.0924 Explore P: 0.8339
Episode: 104 Total reward: 30.0 Training loss: 178.0664 Explore P: 0.8315
Episode: 105 Total reward: 14.0 Training loss: 9.0423 Explore P: 0.8303
Episode: 106 Total reward: 18.0 Training loss: 126.9380 Explore P: 0.8289
Episode: 107 Total reward: 11.0 Training loss: 58.9921 Explore P: 0.8280
Episode: 108 Total reward: 20.0 Training loss: 9.1945 Explore P: 0.8263
Episode: 109 Total reward: 9.0 Training loss: 9.2887 Explore P: 0.8256
Episode: 110 Total reward: 29.0 Training loss: 20.7970 Explore P: 0.8232
Episode: 111 Total reward: 17.0 Training loss: 144.6258 Explore P: 0.8218
Episode: 112 Total reward: 15.0 Training loss: 82.4089 Explore P: 0.8206
Episode: 113 Total reward: 15.0 Training loss: 39.9963 Explore P: 0.8194
Episode: 114 Total reward: 8.0 Training loss: 9.8394 Explore P: 0.8188
Episode: 115 Total reward: 29.0 Training loss: 76.9930 Explore P: 0.8164
Episode: 116 Total reward: 21.0 Training loss: 25.0172 Explore P: 0.8147
Episode: 117 Total reward: 24.0 Training loss: 143.5481 Explore P: 0.8128
Episode: 118 Total reward: 35.0 Training loss: 86.5429 Explore P: 0.8100
Episode: 119 Total reward: 28.0 Training loss: 8.4315 Explore P: 0.8078
Episode: 120 Total reward: 13.0 Training loss: 25.7062 Explore P: 0.8067
Episode: 121 Total reward: 9.0 Training loss: 6.5005 Explore P: 0.8060
Episode: 122 Total reward: 32.0 Training loss: 90.7984 Explore P: 0.8035
Episode: 123 Total reward: 21.0 Training loss: 130.2779 Explore P: 0.8018
Episode: 124 Total reward: 15.0 Training loss: 167.6294 Explore P: 0.8006
Episode: 125 Total reward: 15.0 Training loss: 74.7611 Explore P: 0.7994
Episode: 126 Total reward: 20.0 Training loss: 119.3178 Explore P: 0.7978
Episode: 127 Total reward: 18.0 Training loss: 196.5175 Explore P: 0.7964
Episode: 128 Total reward: 8.0 Training loss: 45.2131 Explore P: 0.7958
Episode: 129 Total reward: 24.0 Training loss: 86.0374 Explore P: 0.7939
Episode: 130 Total reward: 11.0 Training loss: 7.8129 Explore P: 0.7931
Episode: 131 Total reward: 11.0 Training loss: 76.8442 Explore P: 0.7922
Episode: 132 Total reward: 28.0 Training loss: 196.6863 Explore P: 0.7900
Episode: 133 Total reward: 9.0 Training loss: 45.7586 Explore P: 0.7893
Episode: 134 Total reward: 21.0 Training loss: 5.8484 Explore P: 0.7877
Episode: 135 Total reward: 10.0 Training loss: 7.3919 Explore P: 0.7869
Episode: 136 Total reward: 17.0 Training loss: 12.3142 Explore P: 0.7856
Episode: 137 Total reward: 16.0 Training loss: 75.7170 Explore P: 0.7843
Episode: 138 Total reward: 12.0 Training loss: 145.3568 Explore P: 0.7834
Episode: 139 Total reward: 27.0 Training loss: 121.1114 Explore P: 0.7813
Episode: 140 Total reward: 26.0 Training loss: 7.3243 Explore P: 0.7793
Episode: 141 Total reward: 40.0 Training loss: 10.6523 Explore P: 0.7762
Episode: 142 Total reward: 14.0 Training loss: 6.9482 Explore P: 0.7752
Episode: 143 Total reward: 24.0 Training loss: 137.7784 Explore P: 0.7733
Episode: 144 Total reward: 12.0 Training loss: 98.8381 Explore P: 0.7724
Episode: 145 Total reward: 8.0 Training loss: 14.1739 Explore P: 0.7718
Episode: 146 Total reward: 51.0 Training loss: 69.1545 Explore P: 0.7679
Episode: 147 Total reward: 29.0 Training loss: 249.9989 Explore P: 0.7657
Episode: 148 Total reward: 9.0 Training loss: 140.8663 Explore P: 0.7651
Episode: 149 Total reward: 13.0 Training loss: 141.5930 Explore P: 0.7641
Episode: 150 Total reward: 19.0 Training loss: 12.6228 Explore P: 0.7627
Episode: 151 Total reward: 19.0 Training loss: 136.3315 Explore P: 0.7612
Episode: 152 Total reward: 10.0 Training loss: 110.3699 Explore P: 0.7605
Episode: 153 Total reward: 18.0 Training loss: 8.3900 Explore P: 0.7591
Episode: 154 Total reward: 18.0 Training loss: 96.3717 Explore P: 0.7578
Episode: 155 Total reward: 7.0 Training loss: 6.0889 Explore P: 0.7573
Episode: 156 Total reward: 15.0 Training loss: 126.7419 Explore P: 0.7561
Episode: 157 Total reward: 15.0 Training loss: 67.2544 Explore P: 0.7550
Episode: 158 Total reward: 26.0 Training loss: 12.2839 Explore P: 0.7531
Episode: 159 Total reward: 20.0 Training loss: 5.8118 Explore P: 0.7516
Episode: 160 Total reward: 10.0 Training loss: 96.4570 Explore P: 0.7509
Episode: 161 Total reward: 23.0 Training loss: 7.6207 Explore P: 0.7492
Episode: 162 Total reward: 18.0 Training loss: 66.5249 Explore P: 0.7478
Episode: 163 Total reward: 18.0 Training loss: 111.3273 Explore P: 0.7465
Episode: 164 Total reward: 21.0 Training loss: 11.5292 Explore P: 0.7450
Episode: 165 Total reward: 17.0 Training loss: 6.3130 Explore P: 0.7437
Episode: 166 Total reward: 22.0 Training loss: 153.8167 Explore P: 0.7421
Episode: 167 Total reward: 17.0 Training loss: 7.0915 Explore P: 0.7408
Episode: 168 Total reward: 34.0 Training loss: 228.3831 Explore P: 0.7384
Episode: 169 Total reward: 13.0 Training loss: 8.5996 Explore P: 0.7374
Episode: 170 Total reward: 13.0 Training loss: 90.8898 Explore P: 0.7365
Episode: 171 Total reward: 20.0 Training loss: 4.8179 Explore P: 0.7350
Episode: 172 Total reward: 9.0 Training loss: 6.2508 Explore P: 0.7344
Episode: 173 Total reward: 14.0 Training loss: 5.2401 Explore P: 0.7334
Episode: 174 Total reward: 12.0 Training loss: 3.9268 Explore P: 0.7325
Episode: 175 Total reward: 12.0 Training loss: 5.6376 Explore P: 0.7316
Episode: 176 Total reward: 11.0 Training loss: 44.5308 Explore P: 0.7308
Episode: 177 Total reward: 12.0 Training loss: 4.9717 Explore P: 0.7300
Episode: 178 Total reward: 9.0 Training loss: 181.0085 Explore P: 0.7293
Episode: 179 Total reward: 11.0 Training loss: 73.1134 Explore P: 0.7285
Episode: 180 Total reward: 13.0 Training loss: 87.3085 Explore P: 0.7276
Episode: 181 Total reward: 12.0 Training loss: 121.6627 Explore P: 0.7267
Episode: 182 Total reward: 12.0 Training loss: 58.1967 Explore P: 0.7259
Episode: 183 Total reward: 13.0 Training loss: 85.1540 Explore P: 0.7249
Episode: 184 Total reward: 16.0 Training loss: 5.1214 Explore P: 0.7238
Episode: 185 Total reward: 19.0 Training loss: 69.1839 Explore P: 0.7224
Episode: 186 Total reward: 7.0 Training loss: 63.2256 Explore P: 0.7219
Episode: 187 Total reward: 17.0 Training loss: 73.2788 Explore P: 0.7207
Episode: 188 Total reward: 15.0 Training loss: 78.6213 Explore P: 0.7197
Episode: 189 Total reward: 11.0 Training loss: 88.5211 Explore P: 0.7189
Episode: 190 Total reward: 14.0 Training loss: 60.1332 Explore P: 0.7179
Episode: 191 Total reward: 15.0 Training loss: 135.7724 Explore P: 0.7168
Episode: 192 Total reward: 15.0 Training loss: 156.9691 Explore P: 0.7158
Episode: 193 Total reward: 17.0 Training loss: 93.3756 Explore P: 0.7146
Episode: 194 Total reward: 12.0 Training loss: 3.0462 Explore P: 0.7137
Episode: 195 Total reward: 9.0 Training loss: 119.2650 Explore P: 0.7131
Episode: 196 Total reward: 13.0 Training loss: 66.6383 Explore P: 0.7122
Episode: 197 Total reward: 9.0 Training loss: 113.7849 Explore P: 0.7116
Episode: 198 Total reward: 13.0 Training loss: 54.6072 Explore P: 0.7106
Episode: 199 Total reward: 19.0 Training loss: 54.8980 Explore P: 0.7093
Episode: 200 Total reward: 20.0 Training loss: 155.5480 Explore P: 0.7079
Episode: 201 Total reward: 10.0 Training loss: 45.8685 Explore P: 0.7072
Episode: 202 Total reward: 14.0 Training loss: 53.5145 Explore P: 0.7062
Episode: 203 Total reward: 8.0 Training loss: 107.9623 Explore P: 0.7057
Episode: 204 Total reward: 21.0 Training loss: 40.2749 Explore P: 0.7042
Episode: 205 Total reward: 32.0 Training loss: 43.2627 Explore P: 0.7020
Episode: 206 Total reward: 9.0 Training loss: 55.5398 Explore P: 0.7014
Episode: 207 Total reward: 15.0 Training loss: 1.9959 Explore P: 0.7004
Episode: 208 Total reward: 12.0 Training loss: 105.3751 Explore P: 0.6995
Episode: 209 Total reward: 11.0 Training loss: 40.8319 Explore P: 0.6988
Episode: 210 Total reward: 10.0 Training loss: 89.7147 Explore P: 0.6981
Episode: 211 Total reward: 10.0 Training loss: 1.1946 Explore P: 0.6974
Episode: 212 Total reward: 10.0 Training loss: 80.6916 Explore P: 0.6967
Episode: 213 Total reward: 24.0 Training loss: 88.0977 Explore P: 0.6951
Episode: 214 Total reward: 14.0 Training loss: 46.4105 Explore P: 0.6941
Episode: 215 Total reward: 11.0 Training loss: 40.3726 Explore P: 0.6933
Episode: 216 Total reward: 11.0 Training loss: 3.0770 Explore P: 0.6926
Episode: 217 Total reward: 8.0 Training loss: 1.7495 Explore P: 0.6921
Episode: 218 Total reward: 16.0 Training loss: 1.5615 Explore P: 0.6910
Episode: 219 Total reward: 17.0 Training loss: 2.0250 Explore P: 0.6898
Episode: 220 Total reward: 18.0 Training loss: 37.8432 Explore P: 0.6886
Episode: 221 Total reward: 17.0 Training loss: 1.9049 Explore P: 0.6874
Episode: 222 Total reward: 16.0 Training loss: 1.9652 Explore P: 0.6863
Episode: 223 Total reward: 16.0 Training loss: 1.4384 Explore P: 0.6853
Episode: 224 Total reward: 27.0 Training loss: 66.0615 Explore P: 0.6834
Episode: 225 Total reward: 9.0 Training loss: 0.8478 Explore P: 0.6828
Episode: 226 Total reward: 14.0 Training loss: 1.0319 Explore P: 0.6819
Episode: 227 Total reward: 17.0 Training loss: 97.6957 Explore P: 0.6808
Episode: 228 Total reward: 8.0 Training loss: 68.0521 Explore P: 0.6802
Episode: 229 Total reward: 9.0 Training loss: 110.8437 Explore P: 0.6796
Episode: 230 Total reward: 19.0 Training loss: 1.6856 Explore P: 0.6783
Episode: 231 Total reward: 9.0 Training loss: 2.0634 Explore P: 0.6777
Episode: 232 Total reward: 11.0 Training loss: 32.0714 Explore P: 0.6770
Episode: 233 Total reward: 9.0 Training loss: 2.0387 Explore P: 0.6764
Episode: 234 Total reward: 13.0 Training loss: 66.9349 Explore P: 0.6755
Episode: 235 Total reward: 14.0 Training loss: 110.6725 Explore P: 0.6746
Episode: 236 Total reward: 18.0 Training loss: 1.0585 Explore P: 0.6734
Episode: 237 Total reward: 11.0 Training loss: 117.0101 Explore P: 0.6727
Episode: 238 Total reward: 7.0 Training loss: 2.6115 Explore P: 0.6722
Episode: 239 Total reward: 10.0 Training loss: 124.7320 Explore P: 0.6716
Episode: 240 Total reward: 18.0 Training loss: 2.5475 Explore P: 0.6704
Episode: 241 Total reward: 37.0 Training loss: 2.1454 Explore P: 0.6679
Episode: 242 Total reward: 11.0 Training loss: 23.6042 Explore P: 0.6672
Episode: 243 Total reward: 32.0 Training loss: 1.4344 Explore P: 0.6651
Episode: 244 Total reward: 9.0 Training loss: 1.5328 Explore P: 0.6645
Episode: 245 Total reward: 14.0 Training loss: 84.7870 Explore P: 0.6636
Episode: 246 Total reward: 12.0 Training loss: 2.7292 Explore P: 0.6628
Episode: 247 Total reward: 26.0 Training loss: 40.6692 Explore P: 0.6611
Episode: 248 Total reward: 12.0 Training loss: 22.0901 Explore P: 0.6603
Episode: 249 Total reward: 15.0 Training loss: 37.9304 Explore P: 0.6594
Episode: 250 Total reward: 20.0 Training loss: 1.4137 Explore P: 0.6581
Episode: 251 Total reward: 16.0 Training loss: 1.7831 Explore P: 0.6570
Episode: 252 Total reward: 9.0 Training loss: 38.0640 Explore P: 0.6565
Episode: 253 Total reward: 17.0 Training loss: 21.7703 Explore P: 0.6554
Episode: 254 Total reward: 24.0 Training loss: 40.3204 Explore P: 0.6538
Episode: 255 Total reward: 30.0 Training loss: 43.4179 Explore P: 0.6519
Episode: 256 Total reward: 11.0 Training loss: 60.9330 Explore P: 0.6512
Episode: 257 Total reward: 14.0 Training loss: 66.6886 Explore P: 0.6503
Episode: 258 Total reward: 15.0 Training loss: 2.5639 Explore P: 0.6493
Episode: 259 Total reward: 19.0 Training loss: 2.6969 Explore P: 0.6481
Episode: 260 Total reward: 10.0 Training loss: 2.6837 Explore P: 0.6475
Episode: 261 Total reward: 30.0 Training loss: 20.6603 Explore P: 0.6456
Episode: 262 Total reward: 17.0 Training loss: 32.1585 Explore P: 0.6445
Episode: 263 Total reward: 15.0 Training loss: 1.0833 Explore P: 0.6435
Episode: 264 Total reward: 13.0 Training loss: 81.4551 Explore P: 0.6427
Episode: 265 Total reward: 17.0 Training loss: 3.3823 Explore P: 0.6416
Episode: 266 Total reward: 11.0 Training loss: 36.3942 Explore P: 0.6409
Episode: 267 Total reward: 11.0 Training loss: 1.6628 Explore P: 0.6402
Episode: 268 Total reward: 33.0 Training loss: 26.9925 Explore P: 0.6382
Episode: 269 Total reward: 18.0 Training loss: 45.8608 Explore P: 0.6370
Episode: 270 Total reward: 20.0 Training loss: 2.7911 Explore P: 0.6358
Episode: 271 Total reward: 10.0 Training loss: 35.9215 Explore P: 0.6352
Episode: 272 Total reward: 14.0 Training loss: 2.5923 Explore P: 0.6343
Episode: 273 Total reward: 16.0 Training loss: 41.2339 Explore P: 0.6333
Episode: 274 Total reward: 18.0 Training loss: 46.7318 Explore P: 0.6322
Episode: 275 Total reward: 14.0 Training loss: 2.7245 Explore P: 0.6313
Episode: 276 Total reward: 8.0 Training loss: 16.2681 Explore P: 0.6308
Episode: 277 Total reward: 10.0 Training loss: 21.6856 Explore P: 0.6302
Episode: 278 Total reward: 12.0 Training loss: 1.7879 Explore P: 0.6294
Episode: 279 Total reward: 10.0 Training loss: 97.1567 Explore P: 0.6288
Episode: 280 Total reward: 16.0 Training loss: 3.4710 Explore P: 0.6278
Episode: 281 Total reward: 14.0 Training loss: 65.8457 Explore P: 0.6270
Episode: 282 Total reward: 21.0 Training loss: 32.4442 Explore P: 0.6257
Episode: 283 Total reward: 17.0 Training loss: 48.0136 Explore P: 0.6246
Episode: 284 Total reward: 11.0 Training loss: 2.8833 Explore P: 0.6239
Episode: 285 Total reward: 16.0 Training loss: 92.6062 Explore P: 0.6230
Episode: 286 Total reward: 16.0 Training loss: 19.1051 Explore P: 0.6220
Episode: 287 Total reward: 7.0 Training loss: 1.8220 Explore P: 0.6216
Episode: 288 Total reward: 16.0 Training loss: 41.3844 Explore P: 0.6206
Episode: 289 Total reward: 18.0 Training loss: 50.0580 Explore P: 0.6195
Episode: 290 Total reward: 13.0 Training loss: 83.2142 Explore P: 0.6187
Episode: 291 Total reward: 14.0 Training loss: 70.2605 Explore P: 0.6178
Episode: 292 Total reward: 16.0 Training loss: 53.9664 Explore P: 0.6169
Episode: 293 Total reward: 17.0 Training loss: 3.2764 Explore P: 0.6158
Episode: 294 Total reward: 18.0 Training loss: 17.7963 Explore P: 0.6147
Episode: 295 Total reward: 17.0 Training loss: 32.3772 Explore P: 0.6137
Episode: 296 Total reward: 32.0 Training loss: 18.3755 Explore P: 0.6118
Episode: 297 Total reward: 29.0 Training loss: 17.1377 Explore P: 0.6100
Episode: 298 Total reward: 12.0 Training loss: 14.2922 Explore P: 0.6093
Episode: 299 Total reward: 14.0 Training loss: 29.2226 Explore P: 0.6085
Episode: 300 Total reward: 17.0 Training loss: 38.9089 Explore P: 0.6075
Episode: 301 Total reward: 9.0 Training loss: 62.2483 Explore P: 0.6069
Episode: 302 Total reward: 22.0 Training loss: 2.3240 Explore P: 0.6056
Episode: 303 Total reward: 16.0 Training loss: 0.9979 Explore P: 0.6047
Episode: 304 Total reward: 8.0 Training loss: 67.9231 Explore P: 0.6042
Episode: 305 Total reward: 13.0 Training loss: 33.0928 Explore P: 0.6034
Episode: 306 Total reward: 20.0 Training loss: 1.3173 Explore P: 0.6022
Episode: 307 Total reward: 23.0 Training loss: 50.2106 Explore P: 0.6009
Episode: 308 Total reward: 17.0 Training loss: 52.5245 Explore P: 0.5999
Episode: 309 Total reward: 20.0 Training loss: 32.5832 Explore P: 0.5987
Episode: 310 Total reward: 19.0 Training loss: 29.0224 Explore P: 0.5976
Episode: 311 Total reward: 19.0 Training loss: 29.8863 Explore P: 0.5965
Episode: 312 Total reward: 27.0 Training loss: 34.4016 Explore P: 0.5949
Episode: 313 Total reward: 9.0 Training loss: 1.1433 Explore P: 0.5944
Episode: 314 Total reward: 20.0 Training loss: 28.8137 Explore P: 0.5932
Episode: 315 Total reward: 24.0 Training loss: 48.5379 Explore P: 0.5918
Episode: 316 Total reward: 28.0 Training loss: 45.2671 Explore P: 0.5902
Episode: 317 Total reward: 13.0 Training loss: 45.9822 Explore P: 0.5894
Episode: 318 Total reward: 12.0 Training loss: 86.3972 Explore P: 0.5887
Episode: 319 Total reward: 10.0 Training loss: 11.2909 Explore P: 0.5881
Episode: 320 Total reward: 11.0 Training loss: 36.5474 Explore P: 0.5875
Episode: 321 Total reward: 13.0 Training loss: 1.1439 Explore P: 0.5867
Episode: 322 Total reward: 8.0 Training loss: 12.6978 Explore P: 0.5863
Episode: 323 Total reward: 20.0 Training loss: 31.7664 Explore P: 0.5851
Episode: 324 Total reward: 8.0 Training loss: 29.4243 Explore P: 0.5847
Episode: 325 Total reward: 13.0 Training loss: 12.2373 Explore P: 0.5839
Episode: 326 Total reward: 19.0 Training loss: 24.2228 Explore P: 0.5828
Episode: 327 Total reward: 68.0 Training loss: 0.7256 Explore P: 0.5790
Episode: 328 Total reward: 11.0 Training loss: 1.2313 Explore P: 0.5783
Episode: 329 Total reward: 15.0 Training loss: 1.3319 Explore P: 0.5775
Episode: 330 Total reward: 53.0 Training loss: 9.9350 Explore P: 0.5745
Episode: 331 Total reward: 74.0 Training loss: 1.4366 Explore P: 0.5703
Episode: 332 Total reward: 16.0 Training loss: 11.2724 Explore P: 0.5694
Episode: 333 Total reward: 34.0 Training loss: 10.6128 Explore P: 0.5675
Episode: 334 Total reward: 27.0 Training loss: 14.9559 Explore P: 0.5660
Episode: 335 Total reward: 31.0 Training loss: 16.6541 Explore P: 0.5643
Episode: 336 Total reward: 49.0 Training loss: 23.3966 Explore P: 0.5616
Episode: 337 Total reward: 40.0 Training loss: 45.3419 Explore P: 0.5594
Episode: 338 Total reward: 71.0 Training loss: 0.8244 Explore P: 0.5555
Episode: 339 Total reward: 56.0 Training loss: 41.4562 Explore P: 0.5525
Episode: 340 Total reward: 18.0 Training loss: 1.2548 Explore P: 0.5515
Episode: 341 Total reward: 56.0 Training loss: 1.5400 Explore P: 0.5485
Episode: 342 Total reward: 34.0 Training loss: 12.0206 Explore P: 0.5466
Episode: 343 Total reward: 67.0 Training loss: 1.4189 Explore P: 0.5430
Episode: 344 Total reward: 27.0 Training loss: 1.3138 Explore P: 0.5416
Episode: 345 Total reward: 42.0 Training loss: 1.1650 Explore P: 0.5394
Episode: 346 Total reward: 23.0 Training loss: 23.1743 Explore P: 0.5382
Episode: 347 Total reward: 54.0 Training loss: 0.6971 Explore P: 0.5353
Episode: 348 Total reward: 34.0 Training loss: 27.2789 Explore P: 0.5335
Episode: 349 Total reward: 25.0 Training loss: 37.4133 Explore P: 0.5322
Episode: 350 Total reward: 20.0 Training loss: 1.6443 Explore P: 0.5312
Episode: 351 Total reward: 26.0 Training loss: 12.6839 Explore P: 0.5298
Episode: 352 Total reward: 40.0 Training loss: 13.3593 Explore P: 0.5278
Episode: 353 Total reward: 18.0 Training loss: 1.7079 Explore P: 0.5268
Episode: 354 Total reward: 47.0 Training loss: 32.5788 Explore P: 0.5244
Episode: 355 Total reward: 20.0 Training loss: 1.6101 Explore P: 0.5234
Episode: 356 Total reward: 53.0 Training loss: 2.5321 Explore P: 0.5207
Episode: 357 Total reward: 15.0 Training loss: 1.6396 Explore P: 0.5199
Episode: 358 Total reward: 76.0 Training loss: 20.8058 Explore P: 0.5160
Episode: 359 Total reward: 12.0 Training loss: 13.0315 Explore P: 0.5154
Episode: 360 Total reward: 42.0 Training loss: 10.1313 Explore P: 0.5133
Episode: 361 Total reward: 53.0 Training loss: 25.4319 Explore P: 0.5106
Episode: 362 Total reward: 33.0 Training loss: 26.4256 Explore P: 0.5090
Episode: 363 Total reward: 85.0 Training loss: 20.2429 Explore P: 0.5048
Episode: 364 Total reward: 23.0 Training loss: 16.1083 Explore P: 0.5036
Episode: 365 Total reward: 30.0 Training loss: 1.6888 Explore P: 0.5022
Episode: 366 Total reward: 66.0 Training loss: 2.0408 Explore P: 0.4989
Episode: 367 Total reward: 37.0 Training loss: 18.6438 Explore P: 0.4971
Episode: 368 Total reward: 50.0 Training loss: 20.1544 Explore P: 0.4947
Episode: 369 Total reward: 78.0 Training loss: 23.8497 Explore P: 0.4909
Episode: 370 Total reward: 83.0 Training loss: 20.6897 Explore P: 0.4869
Episode: 371 Total reward: 44.0 Training loss: 25.4317 Explore P: 0.4849
Episode: 372 Total reward: 44.0 Training loss: 1.5212 Explore P: 0.4828
Episode: 373 Total reward: 14.0 Training loss: 1.5019 Explore P: 0.4821
Episode: 374 Total reward: 31.0 Training loss: 1.8348 Explore P: 0.4806
Episode: 375 Total reward: 25.0 Training loss: 19.7533 Explore P: 0.4795
Episode: 376 Total reward: 51.0 Training loss: 1.5433 Explore P: 0.4771
Episode: 377 Total reward: 23.0 Training loss: 12.9174 Explore P: 0.4760
Episode: 378 Total reward: 67.0 Training loss: 27.2318 Explore P: 0.4729
Episode: 379 Total reward: 26.0 Training loss: 1.9319 Explore P: 0.4717
Episode: 380 Total reward: 35.0 Training loss: 43.2445 Explore P: 0.4701
Episode: 381 Total reward: 33.0 Training loss: 1.5195 Explore P: 0.4686
Episode: 382 Total reward: 30.0 Training loss: 15.4622 Explore P: 0.4672
Episode: 383 Total reward: 12.0 Training loss: 1.8349 Explore P: 0.4666
Episode: 384 Total reward: 25.0 Training loss: 47.7600 Explore P: 0.4655
Episode: 385 Total reward: 36.0 Training loss: 29.6753 Explore P: 0.4639
Episode: 386 Total reward: 50.0 Training loss: 1.1244 Explore P: 0.4616
Episode: 387 Total reward: 35.0 Training loss: 1.0955 Explore P: 0.4600
Episode: 388 Total reward: 52.0 Training loss: 24.9624 Explore P: 0.4577
Episode: 389 Total reward: 52.0 Training loss: 28.2028 Explore P: 0.4554
Episode: 390 Total reward: 132.0 Training loss: 30.5190 Explore P: 0.4495
Episode: 391 Total reward: 25.0 Training loss: 10.3908 Explore P: 0.4484
Episode: 392 Total reward: 56.0 Training loss: 14.1483 Explore P: 0.4460
Episode: 393 Total reward: 110.0 Training loss: 2.0169 Explore P: 0.4412
Episode: 394 Total reward: 78.0 Training loss: 1.2122 Explore P: 0.4379
Episode: 395 Total reward: 44.0 Training loss: 56.4728 Explore P: 0.4360
Episode: 396 Total reward: 90.0 Training loss: 65.6667 Explore P: 0.4322
Episode: 397 Total reward: 36.0 Training loss: 0.9032 Explore P: 0.4307
Episode: 398 Total reward: 40.0 Training loss: 0.8414 Explore P: 0.4290
Episode: 399 Total reward: 109.0 Training loss: 42.5467 Explore P: 0.4244
Episode: 400 Total reward: 37.0 Training loss: 2.6053 Explore P: 0.4229
Episode: 401 Total reward: 62.0 Training loss: 1.2301 Explore P: 0.4203
Episode: 402 Total reward: 42.0 Training loss: 1.1384 Explore P: 0.4186
Episode: 403 Total reward: 71.0 Training loss: 1.6765 Explore P: 0.4157
Episode: 404 Total reward: 88.0 Training loss: 2.4000 Explore P: 0.4122
Episode: 405 Total reward: 55.0 Training loss: 24.7748 Explore P: 0.4100
Episode: 406 Total reward: 33.0 Training loss: 13.5934 Explore P: 0.4087
Episode: 407 Total reward: 44.0 Training loss: 14.6865 Explore P: 0.4069
Episode: 408 Total reward: 40.0 Training loss: 2.0898 Explore P: 0.4053
Episode: 409 Total reward: 98.0 Training loss: 2.1043 Explore P: 0.4015
Episode: 410 Total reward: 63.0 Training loss: 11.3562 Explore P: 0.3990
Episode: 411 Total reward: 50.0 Training loss: 14.1151 Explore P: 0.3971
Episode: 412 Total reward: 44.0 Training loss: 1.2370 Explore P: 0.3954
Episode: 413 Total reward: 56.0 Training loss: 2.1136 Explore P: 0.3932
Episode: 414 Total reward: 61.0 Training loss: 2.2578 Explore P: 0.3909
Episode: 415 Total reward: 49.0 Training loss: 1.3966 Explore P: 0.3890
Episode: 416 Total reward: 55.0 Training loss: 10.2836 Explore P: 0.3869
Episode: 417 Total reward: 121.0 Training loss: 2.2477 Explore P: 0.3824
Episode: 418 Total reward: 46.0 Training loss: 2.3118 Explore P: 0.3807
Episode: 419 Total reward: 70.0 Training loss: 27.3952 Explore P: 0.3781
Episode: 420 Total reward: 72.0 Training loss: 45.7570 Explore P: 0.3755
Episode: 421 Total reward: 41.0 Training loss: 59.1887 Explore P: 0.3740
Episode: 422 Total reward: 67.0 Training loss: 27.0998 Explore P: 0.3716
Episode: 423 Total reward: 46.0 Training loss: 43.1971 Explore P: 0.3699
Episode: 424 Total reward: 52.0 Training loss: 2.0718 Explore P: 0.3680
Episode: 425 Total reward: 92.0 Training loss: 96.7074 Explore P: 0.3647
Episode: 426 Total reward: 60.0 Training loss: 2.0684 Explore P: 0.3626
Episode: 427 Total reward: 106.0 Training loss: 54.1831 Explore P: 0.3589
Episode: 428 Total reward: 76.0 Training loss: 1.9612 Explore P: 0.3563
Episode: 429 Total reward: 42.0 Training loss: 1.6153 Explore P: 0.3548
Episode: 430 Total reward: 77.0 Training loss: 3.9801 Explore P: 0.3522
Episode: 431 Total reward: 123.0 Training loss: 2.0505 Explore P: 0.3480
Episode: 432 Total reward: 150.0 Training loss: 19.4217 Explore P: 0.3430
Episode: 433 Total reward: 57.0 Training loss: 1.5850 Explore P: 0.3411
Episode: 434 Total reward: 74.0 Training loss: 2.4292 Explore P: 0.3386
Episode: 435 Total reward: 97.0 Training loss: 23.5709 Explore P: 0.3354
Episode: 436 Total reward: 99.0 Training loss: 2.0727 Explore P: 0.3322
Episode: 437 Total reward: 101.0 Training loss: 22.3250 Explore P: 0.3290
Episode: 438 Total reward: 46.0 Training loss: 2.0320 Explore P: 0.3275
Episode: 439 Total reward: 51.0 Training loss: 4.8099 Explore P: 0.3259
Episode: 440 Total reward: 111.0 Training loss: 68.3524 Explore P: 0.3224
Episode: 441 Total reward: 167.0 Training loss: 2.3045 Explore P: 0.3173
Episode: 442 Total reward: 80.0 Training loss: 0.8798 Explore P: 0.3148
Episode: 443 Total reward: 170.0 Training loss: 48.6270 Explore P: 0.3097
Episode: 444 Total reward: 77.0 Training loss: 2.2555 Explore P: 0.3074
Episode: 445 Total reward: 84.0 Training loss: 3.0428 Explore P: 0.3049
Episode: 447 Total reward: 12.0 Training loss: 2.8022 Explore P: 0.2987
Episode: 448 Total reward: 66.0 Training loss: 120.3442 Explore P: 0.2968
Episode: 449 Total reward: 152.0 Training loss: 6.2880 Explore P: 0.2925
Episode: 450 Total reward: 141.0 Training loss: 2.8015 Explore P: 0.2885
Episode: 451 Total reward: 99.0 Training loss: 64.0921 Explore P: 0.2858
Episode: 452 Total reward: 79.0 Training loss: 1.5581 Explore P: 0.2836
Episode: 453 Total reward: 49.0 Training loss: 113.2557 Explore P: 0.2823
Episode: 455 Total reward: 106.0 Training loss: 210.4934 Explore P: 0.2741
Episode: 456 Total reward: 109.0 Training loss: 81.6662 Explore P: 0.2712
Episode: 457 Total reward: 56.0 Training loss: 3.2287 Explore P: 0.2697
Episode: 458 Total reward: 138.0 Training loss: 2.5795 Explore P: 0.2662
Episode: 459 Total reward: 93.0 Training loss: 3.4260 Explore P: 0.2638
Episode: 460 Total reward: 71.0 Training loss: 139.3341 Explore P: 0.2620
Episode: 461 Total reward: 106.0 Training loss: 2.6074 Explore P: 0.2594
Episode: 462 Total reward: 63.0 Training loss: 2.8252 Explore P: 0.2578
Episode: 463 Total reward: 71.0 Training loss: 25.8917 Explore P: 0.2560
Episode: 464 Total reward: 79.0 Training loss: 3.8067 Explore P: 0.2541
Episode: 465 Total reward: 86.0 Training loss: 1.6050 Explore P: 0.2520
Episode: 466 Total reward: 88.0 Training loss: 44.2827 Explore P: 0.2499
Episode: 467 Total reward: 72.0 Training loss: 0.7160 Explore P: 0.2482
Episode: 468 Total reward: 152.0 Training loss: 75.7239 Explore P: 0.2446
Episode: 469 Total reward: 122.0 Training loss: 7.4345 Explore P: 0.2417
Episode: 470 Total reward: 81.0 Training loss: 101.0922 Explore P: 0.2399
Episode: 471 Total reward: 38.0 Training loss: 1.6301 Explore P: 0.2390
Episode: 472 Total reward: 79.0 Training loss: 72.4920 Explore P: 0.2372
Episode: 473 Total reward: 190.0 Training loss: 1.3869 Explore P: 0.2329
Episode: 474 Total reward: 197.0 Training loss: 1.5386 Explore P: 0.2286
Episode: 476 Total reward: 42.0 Training loss: 0.8364 Explore P: 0.2233
Episode: 477 Total reward: 134.0 Training loss: 88.3979 Explore P: 0.2205
Episode: 478 Total reward: 128.0 Training loss: 94.1007 Explore P: 0.2178
Episode: 479 Total reward: 79.0 Training loss: 103.1366 Explore P: 0.2162
Episode: 480 Total reward: 169.0 Training loss: 58.8788 Explore P: 0.2127
Episode: 481 Total reward: 160.0 Training loss: 1.1934 Explore P: 0.2095
Episode: 482 Total reward: 81.0 Training loss: 2.9244 Explore P: 0.2079
Episode: 484 Total reward: 68.0 Training loss: 77.9688 Explore P: 0.2027
Episode: 486 Total reward: 17.0 Training loss: 0.6864 Explore P: 0.1985
Episode: 488 Total reward: 62.0 Training loss: 1.9978 Explore P: 0.1937
Episode: 489 Total reward: 178.0 Training loss: 228.6335 Explore P: 0.1904
Episode: 490 Total reward: 114.0 Training loss: 0.4453 Explore P: 0.1884
Episode: 491 Total reward: 127.0 Training loss: 1.6523 Explore P: 0.1861
Episode: 492 Total reward: 124.0 Training loss: 120.2207 Explore P: 0.1840
Episode: 493 Total reward: 184.0 Training loss: 0.5913 Explore P: 0.1808
Episode: 494 Total reward: 129.0 Training loss: 56.3829 Explore P: 0.1786
Episode: 495 Total reward: 95.0 Training loss: 1.9883 Explore P: 0.1770
Episode: 496 Total reward: 129.0 Training loss: 1.2513 Explore P: 0.1749
Episode: 497 Total reward: 176.0 Training loss: 1.0322 Explore P: 0.1720
Episode: 498 Total reward: 132.0 Training loss: 0.9320 Explore P: 0.1699
Episode: 499 Total reward: 146.0 Training loss: 289.4379 Explore P: 0.1675
Episode: 500 Total reward: 147.0 Training loss: 0.5124 Explore P: 0.1652
Episode: 501 Total reward: 166.0 Training loss: 0.9444 Explore P: 0.1627
Episode: 503 Total reward: 31.0 Training loss: 1.4756 Explore P: 0.1592
Episode: 504 Total reward: 129.0 Training loss: 0.6077 Explore P: 0.1573
Episode: 505 Total reward: 127.0 Training loss: 1.2508 Explore P: 0.1554
Episode: 506 Total reward: 123.0 Training loss: 0.8265 Explore P: 0.1537
Episode: 507 Total reward: 159.0 Training loss: 260.4604 Explore P: 0.1514
Episode: 508 Total reward: 136.0 Training loss: 0.9311 Explore P: 0.1495
Episode: 509 Total reward: 198.0 Training loss: 0.9262 Explore P: 0.1467
Episode: 511 Total reward: 40.0 Training loss: 2.3126 Explore P: 0.1435
Episode: 512 Total reward: 130.0 Training loss: 1.2985 Explore P: 0.1418
Episode: 514 Total reward: 25.0 Training loss: 1.1655 Explore P: 0.1388
Episode: 515 Total reward: 149.0 Training loss: 214.9246 Explore P: 0.1369
Episode: 516 Total reward: 200.0 Training loss: 0.8085 Explore P: 0.1344
Episode: 517 Total reward: 172.0 Training loss: 24.9451 Explore P: 0.1323
Episode: 519 Total reward: 52.0 Training loss: 61.7503 Explore P: 0.1293
Episode: 520 Total reward: 176.0 Training loss: 0.4361 Explore P: 0.1272
Episode: 521 Total reward: 160.0 Training loss: 1.8377 Explore P: 0.1253
Episode: 522 Total reward: 180.0 Training loss: 1.5684 Explore P: 0.1233
Episode: 523 Total reward: 174.0 Training loss: 58.6258 Explore P: 0.1213
Episode: 525 Total reward: 10.0 Training loss: 222.1836 Explore P: 0.1190
Episode: 527 Total reward: 32.0 Training loss: 0.4007 Explore P: 0.1165
Episode: 529 Total reward: 33.0 Training loss: 0.3054 Explore P: 0.1140
Episode: 530 Total reward: 185.0 Training loss: 0.7425 Explore P: 0.1121
Episode: 531 Total reward: 171.0 Training loss: 0.3441 Explore P: 0.1104
Episode: 532 Total reward: 199.0 Training loss: 0.2333 Explore P: 0.1084
Episode: 534 Total reward: 95.0 Training loss: 0.4929 Explore P: 0.1056
Episode: 536 Total reward: 8.0 Training loss: 0.5416 Explore P: 0.1036
Episode: 538 Total reward: 42.0 Training loss: 163.3946 Explore P: 0.1014
Episode: 539 Total reward: 180.0 Training loss: 0.2803 Explore P: 0.0997
Episode: 540 Total reward: 193.0 Training loss: 0.4929 Explore P: 0.0980
Episode: 542 Total reward: 36.0 Training loss: 0.7983 Explore P: 0.0960
Episode: 544 Total reward: 152.0 Training loss: 41.4165 Explore P: 0.0930
Episode: 546 Total reward: 30.0 Training loss: 0.7570 Explore P: 0.0911
Episode: 548 Total reward: 50.0 Training loss: 0.3215 Explore P: 0.0891
Episode: 550 Total reward: 79.0 Training loss: 0.5349 Explore P: 0.0869
Episode: 552 Total reward: 38.0 Training loss: 0.3635 Explore P: 0.0851
Episode: 553 Total reward: 131.0 Training loss: 0.3965 Explore P: 0.0841
Episode: 554 Total reward: 135.0 Training loss: 0.2453 Explore P: 0.0831
Episode: 555 Total reward: 111.0 Training loss: 0.9434 Explore P: 0.0823
Episode: 556 Total reward: 136.0 Training loss: 0.7058 Explore P: 0.0814
Episode: 557 Total reward: 106.0 Training loss: 0.4755 Explore P: 0.0806
Episode: 558 Total reward: 98.0 Training loss: 0.4107 Explore P: 0.0799
Episode: 559 Total reward: 102.0 Training loss: 62.3874 Explore P: 0.0792
Episode: 560 Total reward: 92.0 Training loss: 0.4026 Explore P: 0.0786
Episode: 561 Total reward: 86.0 Training loss: 0.3649 Explore P: 0.0780
Episode: 562 Total reward: 83.0 Training loss: 0.5843 Explore P: 0.0774
Episode: 563 Total reward: 100.0 Training loss: 0.1493 Explore P: 0.0768
Episode: 564 Total reward: 102.0 Training loss: 0.4021 Explore P: 0.0761
Episode: 565 Total reward: 60.0 Training loss: 0.3445 Explore P: 0.0757
Episode: 566 Total reward: 63.0 Training loss: 0.2461 Explore P: 0.0753
Episode: 567 Total reward: 59.0 Training loss: 0.2115 Explore P: 0.0749
Episode: 568 Total reward: 73.0 Training loss: 3.2738 Explore P: 0.0744
Episode: 569 Total reward: 70.0 Training loss: 0.4267 Explore P: 0.0740
Episode: 570 Total reward: 53.0 Training loss: 0.8779 Explore P: 0.0736
Episode: 571 Total reward: 88.0 Training loss: 187.5536 Explore P: 0.0731
Episode: 572 Total reward: 54.0 Training loss: 0.3208 Explore P: 0.0727
Episode: 573 Total reward: 87.0 Training loss: 0.2894 Explore P: 0.0722
Episode: 574 Total reward: 58.0 Training loss: 0.2578 Explore P: 0.0718
Episode: 575 Total reward: 85.0 Training loss: 0.3401 Explore P: 0.0713
Episode: 576 Total reward: 73.0 Training loss: 0.2245 Explore P: 0.0709
Episode: 577 Total reward: 114.0 Training loss: 0.3640 Explore P: 0.0702
Episode: 578 Total reward: 94.0 Training loss: 0.7954 Explore P: 0.0696
Episode: 579 Total reward: 114.0 Training loss: 0.2615 Explore P: 0.0689
Episode: 580 Total reward: 80.0 Training loss: 0.4812 Explore P: 0.0685
Episode: 581 Total reward: 125.0 Training loss: 0.8818 Explore P: 0.0677
Episode: 582 Total reward: 109.0 Training loss: 0.2953 Explore P: 0.0671
Episode: 583 Total reward: 98.0 Training loss: 0.4371 Explore P: 0.0665
Episode: 584 Total reward: 119.0 Training loss: 0.4685 Explore P: 0.0659
Episode: 585 Total reward: 96.0 Training loss: 0.3440 Explore P: 0.0653
Episode: 586 Total reward: 172.0 Training loss: 0.1414 Explore P: 0.0644
Episode: 587 Total reward: 104.0 Training loss: 0.3309 Explore P: 0.0638
Episode: 588 Total reward: 85.0 Training loss: 0.2262 Explore P: 0.0634
Episode: 590 Total reward: 104.0 Training loss: 0.3231 Explore P: 0.0618
Episode: 591 Total reward: 148.0 Training loss: 0.4431 Explore P: 0.0610
Episode: 592 Total reward: 135.0 Training loss: 0.1894 Explore P: 0.0603
Episode: 595 Total reward: 99.0 Training loss: 0.2376 Explore P: 0.0579
Episode: 597 Total reward: 172.0 Training loss: 0.4083 Explore P: 0.0561
Episode: 600 Total reward: 99.0 Training loss: 0.1152 Explore P: 0.0539
Episode: 603 Total reward: 99.0 Training loss: 0.3594 Explore P: 0.0518
Episode: 604 Total reward: 149.0 Training loss: 0.1398 Explore P: 0.0511
Episode: 607 Total reward: 99.0 Training loss: 0.3337 Explore P: 0.0491
Episode: 608 Total reward: 180.0 Training loss: 7.4786 Explore P: 0.0484
Episode: 611 Total reward: 28.0 Training loss: 0.1953 Explore P: 0.0468
Episode: 614 Total reward: 38.0 Training loss: 0.3152 Explore P: 0.0453
Episode: 617 Total reward: 50.0 Training loss: 0.4420 Explore P: 0.0437
Episode: 620 Total reward: 9.0 Training loss: 0.3347 Explore P: 0.0423
Episode: 623 Total reward: 99.0 Training loss: 0.1405 Explore P: 0.0408
Episode: 626 Total reward: 78.0 Training loss: 0.1488 Explore P: 0.0393
Episode: 628 Total reward: 198.0 Training loss: 0.9185 Explore P: 0.0382
Episode: 631 Total reward: 99.0 Training loss: 0.3505 Explore P: 0.0368
Episode: 633 Total reward: 142.0 Training loss: 0.1654 Explore P: 0.0359
Episode: 635 Total reward: 134.0 Training loss: 0.3178 Explore P: 0.0351
Episode: 637 Total reward: 104.0 Training loss: 0.3331 Explore P: 0.0343
Episode: 639 Total reward: 73.0 Training loss: 66.8497 Explore P: 0.0337
Episode: 641 Total reward: 35.0 Training loss: 0.1411 Explore P: 0.0331
Episode: 643 Total reward: 83.0 Training loss: 0.2136 Explore P: 0.0325
Episode: 645 Total reward: 62.0 Training loss: 0.2303 Explore P: 0.0319
Episode: 647 Total reward: 28.0 Training loss: 0.2552 Explore P: 0.0314
Episode: 649 Total reward: 4.0 Training loss: 0.1967 Explore P: 0.0310
Episode: 650 Total reward: 194.0 Training loss: 0.1424 Explore P: 0.0306
Episode: 651 Total reward: 170.0 Training loss: 0.1509 Explore P: 0.0302
Episode: 652 Total reward: 150.0 Training loss: 0.2699 Explore P: 0.0299
Episode: 653 Total reward: 161.0 Training loss: 0.2821 Explore P: 0.0296
Episode: 654 Total reward: 148.0 Training loss: 0.4859 Explore P: 0.0293
Episode: 655 Total reward: 142.0 Training loss: 0.1541 Explore P: 0.0290
Episode: 656 Total reward: 140.0 Training loss: 0.0963 Explore P: 0.0288
Episode: 657 Total reward: 144.0 Training loss: 0.3165 Explore P: 0.0285
Episode: 658 Total reward: 160.0 Training loss: 0.2059 Explore P: 0.0282
Episode: 659 Total reward: 127.0 Training loss: 0.0918 Explore P: 0.0280
Episode: 660 Total reward: 124.0 Training loss: 431.4700 Explore P: 0.0278
Episode: 661 Total reward: 127.0 Training loss: 0.2660 Explore P: 0.0275
Episode: 662 Total reward: 133.0 Training loss: 0.4122 Explore P: 0.0273
Episode: 663 Total reward: 119.0 Training loss: 0.2070 Explore P: 0.0271
Episode: 664 Total reward: 114.0 Training loss: 0.3453 Explore P: 0.0269
Episode: 665 Total reward: 130.0 Training loss: 0.3865 Explore P: 0.0267
Episode: 666 Total reward: 125.0 Training loss: 0.2518 Explore P: 0.0265
Episode: 667 Total reward: 138.0 Training loss: 0.1668 Explore P: 0.0263
Episode: 669 Total reward: 42.0 Training loss: 0.3241 Explore P: 0.0259
Episode: 671 Total reward: 105.0 Training loss: 0.1787 Explore P: 0.0254
Episode: 674 Total reward: 99.0 Training loss: 0.2393 Explore P: 0.0246
Episode: 677 Total reward: 99.0 Training loss: 0.2190 Explore P: 0.0239
Episode: 680 Total reward: 99.0 Training loss: 2.5996 Explore P: 0.0232
Episode: 683 Total reward: 99.0 Training loss: 0.3376 Explore P: 0.0226
Episode: 686 Total reward: 99.0 Training loss: 0.5884 Explore P: 0.0220
Episode: 689 Total reward: 99.0 Training loss: 0.1356 Explore P: 0.0214
Episode: 692 Total reward: 99.0 Training loss: 0.0920 Explore P: 0.0208
Episode: 695 Total reward: 99.0 Training loss: 0.1880 Explore P: 0.0203
Episode: 698 Total reward: 99.0 Training loss: 0.0951 Explore P: 0.0198
Episode: 701 Total reward: 99.0 Training loss: 0.1050 Explore P: 0.0193
Episode: 704 Total reward: 99.0 Training loss: 0.1234 Explore P: 0.0189
Episode: 707 Total reward: 99.0 Training loss: 0.1407 Explore P: 0.0185
Episode: 709 Total reward: 160.0 Training loss: 0.0913 Explore P: 0.0182
Episode: 712 Total reward: 99.0 Training loss: 0.1815 Explore P: 0.0178
Episode: 715 Total reward: 99.0 Training loss: 0.1191 Explore P: 0.0174
Episode: 718 Total reward: 99.0 Training loss: 0.1073 Explore P: 0.0170
Episode: 721 Total reward: 99.0 Training loss: 0.1133 Explore P: 0.0167
Episode: 724 Total reward: 99.0 Training loss: 0.0898 Explore P: 0.0164
Episode: 727 Total reward: 99.0 Training loss: 0.1217 Explore P: 0.0160
Episode: 730 Total reward: 99.0 Training loss: 0.2150 Explore P: 0.0158
Episode: 733 Total reward: 99.0 Training loss: 0.0678 Explore P: 0.0155
Episode: 736 Total reward: 99.0 Training loss: 0.0872 Explore P: 0.0152
Episode: 739 Total reward: 99.0 Training loss: 0.1330 Explore P: 0.0150
Episode: 742 Total reward: 99.0 Training loss: 0.1116 Explore P: 0.0147
Episode: 745 Total reward: 99.0 Training loss: 0.1611 Explore P: 0.0145
Episode: 748 Total reward: 99.0 Training loss: 0.1307 Explore P: 0.0143
Episode: 751 Total reward: 99.0 Training loss: 0.0875 Explore P: 0.0141
Episode: 754 Total reward: 99.0 Training loss: 0.1344 Explore P: 0.0139
Episode: 757 Total reward: 99.0 Training loss: 0.0911 Explore P: 0.0137
Episode: 760 Total reward: 99.0 Training loss: 0.1224 Explore P: 0.0135
Episode: 763 Total reward: 99.0 Training loss: 0.0572 Explore P: 0.0133
Episode: 766 Total reward: 99.0 Training loss: 0.0757 Explore P: 0.0132
Episode: 769 Total reward: 99.0 Training loss: 0.0381 Explore P: 0.0130
Episode: 772 Total reward: 99.0 Training loss: 0.1698 Explore P: 0.0129
Episode: 775 Total reward: 99.0 Training loss: 0.0365 Explore P: 0.0127
Episode: 778 Total reward: 99.0 Training loss: 0.1805 Explore P: 0.0126
Episode: 781 Total reward: 99.0 Training loss: 0.1017 Explore P: 0.0125
Episode: 784 Total reward: 99.0 Training loss: 0.1112 Explore P: 0.0123
Episode: 787 Total reward: 99.0 Training loss: 0.0930 Explore P: 0.0122
Episode: 790 Total reward: 99.0 Training loss: 0.0693 Explore P: 0.0121
Episode: 793 Total reward: 99.0 Training loss: 0.0431 Explore P: 0.0120
Episode: 796 Total reward: 99.0 Training loss: 0.1168 Explore P: 0.0119
Episode: 799 Total reward: 99.0 Training loss: 0.1071 Explore P: 0.0118
Episode: 802 Total reward: 99.0 Training loss: 0.1360 Explore P: 0.0117
Episode: 805 Total reward: 99.0 Training loss: 0.0872 Explore P: 0.0117
Episode: 808 Total reward: 99.0 Training loss: 0.1197 Explore P: 0.0116
Episode: 811 Total reward: 99.0 Training loss: 0.0848 Explore P: 0.0115
Episode: 814 Total reward: 99.0 Training loss: 0.0515 Explore P: 0.0114
Episode: 817 Total reward: 99.0 Training loss: 0.1590 Explore P: 0.0114
Episode: 820 Total reward: 99.0 Training loss: 0.2080 Explore P: 0.0113
Episode: 823 Total reward: 99.0 Training loss: 0.1532 Explore P: 0.0112
Episode: 826 Total reward: 99.0 Training loss: 0.0622 Explore P: 0.0112
Episode: 829 Total reward: 99.0 Training loss: 0.0553 Explore P: 0.0111
Episode: 832 Total reward: 99.0 Training loss: 0.0746 Explore P: 0.0111
Episode: 835 Total reward: 99.0 Training loss: 0.1045 Explore P: 0.0110
Episode: 838 Total reward: 99.0 Training loss: 0.0929 Explore P: 0.0110
Episode: 841 Total reward: 99.0 Training loss: 0.1053 Explore P: 0.0109
Episode: 844 Total reward: 99.0 Training loss: 0.0877 Explore P: 0.0109
Episode: 847 Total reward: 99.0 Training loss: 0.0783 Explore P: 0.0108
Episode: 850 Total reward: 99.0 Training loss: 0.0724 Explore P: 0.0108
Episode: 853 Total reward: 99.0 Training loss: 0.1745 Explore P: 0.0107
Episode: 856 Total reward: 99.0 Training loss: 0.0334 Explore P: 0.0107
Episode: 859 Total reward: 99.0 Training loss: 0.0205 Explore P: 0.0107
Episode: 862 Total reward: 99.0 Training loss: 0.0674 Explore P: 0.0106
Episode: 865 Total reward: 99.0 Training loss: 0.1149 Explore P: 0.0106
Episode: 868 Total reward: 99.0 Training loss: 191.0773 Explore P: 0.0106
Episode: 871 Total reward: 99.0 Training loss: 0.1013 Explore P: 0.0106
Episode: 873 Total reward: 156.0 Training loss: 0.2140 Explore P: 0.0105
Episode: 874 Total reward: 185.0 Training loss: 0.1837 Explore P: 0.0105
Episode: 876 Total reward: 45.0 Training loss: 0.1855 Explore P: 0.0105
Episode: 877 Total reward: 55.0 Training loss: 0.0789 Explore P: 0.0105
Episode: 880 Total reward: 99.0 Training loss: 0.0890 Explore P: 0.0105
Episode: 882 Total reward: 185.0 Training loss: 0.0788 Explore P: 0.0105
Episode: 885 Total reward: 99.0 Training loss: 0.1397 Explore P: 0.0104
Episode: 888 Total reward: 99.0 Training loss: 0.0400 Explore P: 0.0104
Episode: 891 Total reward: 99.0 Training loss: 0.0962 Explore P: 0.0104
Episode: 894 Total reward: 99.0 Training loss: 0.1356 Explore P: 0.0104
Episode: 897 Total reward: 99.0 Training loss: 0.2037 Explore P: 0.0104
Episode: 900 Total reward: 99.0 Training loss: 0.0486 Explore P: 0.0103
Episode: 903 Total reward: 99.0 Training loss: 0.2492 Explore P: 0.0103
Episode: 906 Total reward: 99.0 Training loss: 0.1467 Explore P: 0.0103
Episode: 909 Total reward: 99.0 Training loss: 0.2217 Explore P: 0.0103
Episode: 912 Total reward: 99.0 Training loss: 0.1772 Explore P: 0.0103
Episode: 915 Total reward: 99.0 Training loss: 0.0898 Explore P: 0.0103
Episode: 918 Total reward: 99.0 Training loss: 0.0552 Explore P: 0.0103
Episode: 921 Total reward: 99.0 Training loss: 0.1267 Explore P: 0.0102
Episode: 924 Total reward: 99.0 Training loss: 0.3037 Explore P: 0.0102
Episode: 927 Total reward: 99.0 Training loss: 0.1654 Explore P: 0.0102
Episode: 930 Total reward: 99.0 Training loss: 0.1975 Explore P: 0.0102
Episode: 933 Total reward: 99.0 Training loss: 0.2122 Explore P: 0.0102
Episode: 936 Total reward: 99.0 Training loss: 0.0754 Explore P: 0.0102
Episode: 939 Total reward: 99.0 Training loss: 0.1481 Explore P: 0.0102
Episode: 942 Total reward: 99.0 Training loss: 0.0895 Explore P: 0.0102
Episode: 945 Total reward: 99.0 Training loss: 0.0690 Explore P: 0.0102
Episode: 948 Total reward: 99.0 Training loss: 0.0942 Explore P: 0.0102
Episode: 951 Total reward: 99.0 Training loss: 0.0567 Explore P: 0.0101
Episode: 954 Total reward: 99.0 Training loss: 0.0665 Explore P: 0.0101
Episode: 957 Total reward: 99.0 Training loss: 0.0645 Explore P: 0.0101
Episode: 960 Total reward: 99.0 Training loss: 224.4461 Explore P: 0.0101
Episode: 963 Total reward: 99.0 Training loss: 0.0508 Explore P: 0.0101
Episode: 966 Total reward: 99.0 Training loss: 0.0792 Explore P: 0.0101
Episode: 969 Total reward: 99.0 Training loss: 0.0754 Explore P: 0.0101
Episode: 972 Total reward: 99.0 Training loss: 0.0655 Explore P: 0.0101
Episode: 975 Total reward: 99.0 Training loss: 0.0686 Explore P: 0.0101
Episode: 978 Total reward: 99.0 Training loss: 0.0361 Explore P: 0.0101
Episode: 981 Total reward: 99.0 Training loss: 0.1777 Explore P: 0.0101
Episode: 984 Total reward: 99.0 Training loss: 0.0633 Explore P: 0.0101
Episode: 987 Total reward: 99.0 Training loss: 0.0559 Explore P: 0.0101
Episode: 990 Total reward: 99.0 Training loss: 0.0543 Explore P: 0.0101
Episode: 993 Total reward: 99.0 Training loss: 0.0833 Explore P: 0.0101
Episode: 996 Total reward: 99.0 Training loss: 0.1037 Explore P: 0.0101
Episode: 997 Total reward: 45.0 Training loss: 0.0619 Explore P: 0.0101


  File "<ipython-input-15-56557d1e4a59>", line 1
    Episode: 1 Total reward: 13.0 Training loss: 1.0202 Explore P: 0.9987
                   ^
SyntaxError: invalid syntax

Visualizing training

Below we plot the total rewards for each episode. The rolling average is plotted in blue.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [ ]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')

In [ ]:
Text(0,0.5,'Total Reward')

Playing Atari Games

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.


In [ ]: