Deep Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.


In [1]:
import gym
import tensorflow as tf
import numpy as np

Note: Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included gym as a submodule, so you can run git submodule --init --recursive to pull the contents into the gym repo.


In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')


[2017-06-19 14:59:55,263] Making new env: CartPole-v0

We interact with the simulation through env. To show the simulation running, you can use env.render() to render one frame. Passing in an action as an integer to env.step will generate the next step in the simulation. You can see how many actions are possible from env.action_space and to get a random action you can use env.action_space.sample(). This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.


In [3]:
env.reset()     # why here?  clears old envrionment?  
rewards = []    # why here? we will be printing rewards later

for _ in range(100):
    
    # env.render()  # we want a human friendly rendering
    
    # why here? we want to actually step through our enviornment 
    
    state, reward, done, info = env.step(env.action_space.sample()) # take a random action
    
    # for tracking later
    rewards.append(reward)
    
    if done:
        rewards = []
        env.reset()

To shut the window showing the simulation, use env.close().


In [4]:
env.close()

If you ran the simulation above, we can look at the rewards:


In [5]:
print(rewards[-20:])


[]

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

Q-Network

We train our Q-learning agent using the Bellman Equation:

$$ Q(s, a) = r + \gamma \max{Q(s', a')} $$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-table. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.


In [6]:
class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        
        # state inputs to the Q-network
        with tf.variable_scope(name):
            
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.


In [7]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon). That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an $\epsilon$-greedy policy.

At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called exploitation. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in episodes. One episode is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode = 1, $M$ do
    • For $t$, $T$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.


In [8]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory

In [9]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.


In [10]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent. If you want to watch it train, uncomment the env.render() line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.


In [11]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []

with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        
        total_reward = 0
        t = 0
        
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")


Episode: 1 Total reward: 10.0 Training loss: 1.1024 Explore P: 0.9990
Episode: 2 Total reward: 21.0 Training loss: 1.0602 Explore P: 0.9969
Episode: 3 Total reward: 41.0 Training loss: 1.0829 Explore P: 0.9929
Episode: 4 Total reward: 20.0 Training loss: 1.1369 Explore P: 0.9909
Episode: 5 Total reward: 22.0 Training loss: 1.1455 Explore P: 0.9888
Episode: 6 Total reward: 25.0 Training loss: 1.1369 Explore P: 0.9863
Episode: 7 Total reward: 10.0 Training loss: 1.1096 Explore P: 0.9854
Episode: 8 Total reward: 44.0 Training loss: 1.1473 Explore P: 0.9811
Episode: 9 Total reward: 16.0 Training loss: 1.1786 Explore P: 0.9795
Episode: 10 Total reward: 24.0 Training loss: 1.2535 Explore P: 0.9772
Episode: 11 Total reward: 24.0 Training loss: 1.2942 Explore P: 0.9749
Episode: 12 Total reward: 15.0 Training loss: 1.1838 Explore P: 0.9734
Episode: 13 Total reward: 17.0 Training loss: 1.6232 Explore P: 0.9718
Episode: 14 Total reward: 28.0 Training loss: 1.2788 Explore P: 0.9691
Episode: 15 Total reward: 20.0 Training loss: 1.8001 Explore P: 0.9672
Episode: 16 Total reward: 15.0 Training loss: 2.7392 Explore P: 0.9658
Episode: 17 Total reward: 35.0 Training loss: 2.0025 Explore P: 0.9624
Episode: 18 Total reward: 20.0 Training loss: 1.6929 Explore P: 0.9605
Episode: 19 Total reward: 21.0 Training loss: 1.8851 Explore P: 0.9585
Episode: 20 Total reward: 11.0 Training loss: 3.1705 Explore P: 0.9575
Episode: 21 Total reward: 39.0 Training loss: 3.5999 Explore P: 0.9538
Episode: 22 Total reward: 14.0 Training loss: 4.7904 Explore P: 0.9525
Episode: 23 Total reward: 44.0 Training loss: 2.8546 Explore P: 0.9483
Episode: 24 Total reward: 11.0 Training loss: 2.6049 Explore P: 0.9473
Episode: 25 Total reward: 53.0 Training loss: 11.8686 Explore P: 0.9423
Episode: 26 Total reward: 14.0 Training loss: 3.3772 Explore P: 0.9410
Episode: 27 Total reward: 12.0 Training loss: 2.8050 Explore P: 0.9399
Episode: 28 Total reward: 18.0 Training loss: 2.8800 Explore P: 0.9383
Episode: 29 Total reward: 11.0 Training loss: 13.2966 Explore P: 0.9372
Episode: 30 Total reward: 18.0 Training loss: 3.8167 Explore P: 0.9356
Episode: 31 Total reward: 14.0 Training loss: 5.1149 Explore P: 0.9343
Episode: 32 Total reward: 11.0 Training loss: 3.2147 Explore P: 0.9333
Episode: 33 Total reward: 13.0 Training loss: 3.0740 Explore P: 0.9321
Episode: 34 Total reward: 16.0 Training loss: 6.1068 Explore P: 0.9306
Episode: 35 Total reward: 17.0 Training loss: 8.2085 Explore P: 0.9290
Episode: 36 Total reward: 13.0 Training loss: 4.7447 Explore P: 0.9278
Episode: 37 Total reward: 20.0 Training loss: 17.1304 Explore P: 0.9260
Episode: 38 Total reward: 13.0 Training loss: 3.0266 Explore P: 0.9248
Episode: 39 Total reward: 16.0 Training loss: 3.5157 Explore P: 0.9233
Episode: 40 Total reward: 30.0 Training loss: 3.6974 Explore P: 0.9206
Episode: 41 Total reward: 24.0 Training loss: 13.7533 Explore P: 0.9184
Episode: 42 Total reward: 9.0 Training loss: 22.1939 Explore P: 0.9176
Episode: 43 Total reward: 18.0 Training loss: 2.5760 Explore P: 0.9160
Episode: 44 Total reward: 12.0 Training loss: 6.7871 Explore P: 0.9149
Episode: 45 Total reward: 13.0 Training loss: 20.9429 Explore P: 0.9137
Episode: 46 Total reward: 10.0 Training loss: 9.4136 Explore P: 0.9128
Episode: 47 Total reward: 51.0 Training loss: 3.8522 Explore P: 0.9082
Episode: 48 Total reward: 28.0 Training loss: 6.4586 Explore P: 0.9057
Episode: 49 Total reward: 19.0 Training loss: 6.1901 Explore P: 0.9040
Episode: 50 Total reward: 11.0 Training loss: 4.7896 Explore P: 0.9030
Episode: 51 Total reward: 14.0 Training loss: 10.8223 Explore P: 0.9018
Episode: 52 Total reward: 24.0 Training loss: 7.8964 Explore P: 0.8996
Episode: 53 Total reward: 14.0 Training loss: 9.1210 Explore P: 0.8984
Episode: 54 Total reward: 9.0 Training loss: 3.8915 Explore P: 0.8976
Episode: 55 Total reward: 20.0 Training loss: 14.3006 Explore P: 0.8958
Episode: 56 Total reward: 33.0 Training loss: 7.1268 Explore P: 0.8929
Episode: 57 Total reward: 13.0 Training loss: 22.3194 Explore P: 0.8917
Episode: 58 Total reward: 19.0 Training loss: 3.8429 Explore P: 0.8901
Episode: 59 Total reward: 9.0 Training loss: 19.7990 Explore P: 0.8893
Episode: 60 Total reward: 21.0 Training loss: 2.6535 Explore P: 0.8874
Episode: 61 Total reward: 24.0 Training loss: 15.5054 Explore P: 0.8853
Episode: 62 Total reward: 8.0 Training loss: 23.4269 Explore P: 0.8846
Episode: 63 Total reward: 30.0 Training loss: 24.1529 Explore P: 0.8820
Episode: 64 Total reward: 10.0 Training loss: 4.5448 Explore P: 0.8811
Episode: 65 Total reward: 13.0 Training loss: 4.6956 Explore P: 0.8800
Episode: 66 Total reward: 13.0 Training loss: 14.5483 Explore P: 0.8789
Episode: 67 Total reward: 25.0 Training loss: 43.1420 Explore P: 0.8767
Episode: 68 Total reward: 13.0 Training loss: 6.0239 Explore P: 0.8756
Episode: 69 Total reward: 16.0 Training loss: 24.6684 Explore P: 0.8742
Episode: 70 Total reward: 13.0 Training loss: 16.0454 Explore P: 0.8731
Episode: 71 Total reward: 18.0 Training loss: 22.3778 Explore P: 0.8715
Episode: 72 Total reward: 25.0 Training loss: 3.5069 Explore P: 0.8694
Episode: 73 Total reward: 8.0 Training loss: 4.2724 Explore P: 0.8687
Episode: 74 Total reward: 22.0 Training loss: 37.4296 Explore P: 0.8668
Episode: 75 Total reward: 15.0 Training loss: 26.3430 Explore P: 0.8655
Episode: 76 Total reward: 27.0 Training loss: 15.3141 Explore P: 0.8632
Episode: 77 Total reward: 37.0 Training loss: 4.5207 Explore P: 0.8601
Episode: 78 Total reward: 17.0 Training loss: 4.2251 Explore P: 0.8586
Episode: 79 Total reward: 21.0 Training loss: 29.6209 Explore P: 0.8568
Episode: 80 Total reward: 14.0 Training loss: 38.7331 Explore P: 0.8556
Episode: 81 Total reward: 19.0 Training loss: 53.1910 Explore P: 0.8540
Episode: 82 Total reward: 21.0 Training loss: 28.8065 Explore P: 0.8523
Episode: 83 Total reward: 15.0 Training loss: 5.5038 Explore P: 0.8510
Episode: 84 Total reward: 9.0 Training loss: 4.0584 Explore P: 0.8503
Episode: 85 Total reward: 13.0 Training loss: 20.1332 Explore P: 0.8492
Episode: 86 Total reward: 11.0 Training loss: 3.6718 Explore P: 0.8482
Episode: 87 Total reward: 9.0 Training loss: 5.4187 Explore P: 0.8475
Episode: 88 Total reward: 28.0 Training loss: 83.4481 Explore P: 0.8451
Episode: 89 Total reward: 15.0 Training loss: 4.1268 Explore P: 0.8439
Episode: 90 Total reward: 18.0 Training loss: 5.3806 Explore P: 0.8424
Episode: 91 Total reward: 16.0 Training loss: 17.9577 Explore P: 0.8411
Episode: 92 Total reward: 14.0 Training loss: 4.5066 Explore P: 0.8399
Episode: 93 Total reward: 12.0 Training loss: 4.9682 Explore P: 0.8389
Episode: 94 Total reward: 9.0 Training loss: 40.0417 Explore P: 0.8382
Episode: 95 Total reward: 9.0 Training loss: 4.4668 Explore P: 0.8374
Episode: 96 Total reward: 12.0 Training loss: 4.8892 Explore P: 0.8364
Episode: 97 Total reward: 11.0 Training loss: 17.1827 Explore P: 0.8355
Episode: 98 Total reward: 12.0 Training loss: 33.3216 Explore P: 0.8345
Episode: 99 Total reward: 23.0 Training loss: 34.0184 Explore P: 0.8326
Episode: 100 Total reward: 8.0 Training loss: 6.3348 Explore P: 0.8320
Episode: 101 Total reward: 14.0 Training loss: 93.3386 Explore P: 0.8308
Episode: 102 Total reward: 22.0 Training loss: 52.7266 Explore P: 0.8290
Episode: 103 Total reward: 32.0 Training loss: 56.7847 Explore P: 0.8264
Episode: 104 Total reward: 23.0 Training loss: 85.0113 Explore P: 0.8245
Episode: 105 Total reward: 11.0 Training loss: 21.6790 Explore P: 0.8236
Episode: 106 Total reward: 15.0 Training loss: 29.9756 Explore P: 0.8224
Episode: 107 Total reward: 12.0 Training loss: 4.7847 Explore P: 0.8214
Episode: 108 Total reward: 14.0 Training loss: 5.3462 Explore P: 0.8203
Episode: 109 Total reward: 20.0 Training loss: 4.7484 Explore P: 0.8187
Episode: 110 Total reward: 21.0 Training loss: 61.9813 Explore P: 0.8170
Episode: 111 Total reward: 18.0 Training loss: 3.9933 Explore P: 0.8155
Episode: 112 Total reward: 13.0 Training loss: 55.5804 Explore P: 0.8145
Episode: 113 Total reward: 12.0 Training loss: 6.5545 Explore P: 0.8135
Episode: 114 Total reward: 15.0 Training loss: 76.5954 Explore P: 0.8123
Episode: 115 Total reward: 14.0 Training loss: 4.7526 Explore P: 0.8112
Episode: 116 Total reward: 22.0 Training loss: 6.0905 Explore P: 0.8094
Episode: 117 Total reward: 9.0 Training loss: 25.7822 Explore P: 0.8087
Episode: 118 Total reward: 11.0 Training loss: 44.4646 Explore P: 0.8078
Episode: 119 Total reward: 38.0 Training loss: 5.3757 Explore P: 0.8048
Episode: 120 Total reward: 10.0 Training loss: 34.0901 Explore P: 0.8040
Episode: 121 Total reward: 22.0 Training loss: 37.9052 Explore P: 0.8023
Episode: 122 Total reward: 15.0 Training loss: 6.5145 Explore P: 0.8011
Episode: 123 Total reward: 16.0 Training loss: 26.7042 Explore P: 0.7998
Episode: 124 Total reward: 18.0 Training loss: 28.1626 Explore P: 0.7984
Episode: 125 Total reward: 14.0 Training loss: 128.4942 Explore P: 0.7973
Episode: 126 Total reward: 14.0 Training loss: 31.4793 Explore P: 0.7962
Episode: 127 Total reward: 13.0 Training loss: 69.1994 Explore P: 0.7952
Episode: 128 Total reward: 24.0 Training loss: 5.0524 Explore P: 0.7933
Episode: 129 Total reward: 16.0 Training loss: 36.1259 Explore P: 0.7920
Episode: 130 Total reward: 16.0 Training loss: 57.8649 Explore P: 0.7908
Episode: 131 Total reward: 22.0 Training loss: 77.7857 Explore P: 0.7891
Episode: 132 Total reward: 9.0 Training loss: 3.7643 Explore P: 0.7884
Episode: 133 Total reward: 11.0 Training loss: 109.9018 Explore P: 0.7875
Episode: 134 Total reward: 16.0 Training loss: 3.1673 Explore P: 0.7863
Episode: 135 Total reward: 8.0 Training loss: 37.5932 Explore P: 0.7857
Episode: 136 Total reward: 13.0 Training loss: 41.4697 Explore P: 0.7846
Episode: 137 Total reward: 30.0 Training loss: 24.8307 Explore P: 0.7823
Episode: 138 Total reward: 13.0 Training loss: 3.9446 Explore P: 0.7813
Episode: 139 Total reward: 10.0 Training loss: 58.6161 Explore P: 0.7806
Episode: 140 Total reward: 24.0 Training loss: 4.0324 Explore P: 0.7787
Episode: 141 Total reward: 8.0 Training loss: 4.3913 Explore P: 0.7781
Episode: 142 Total reward: 42.0 Training loss: 3.3131 Explore P: 0.7749
Episode: 143 Total reward: 11.0 Training loss: 32.5489 Explore P: 0.7740
Episode: 144 Total reward: 8.0 Training loss: 28.0829 Explore P: 0.7734
Episode: 145 Total reward: 16.0 Training loss: 3.9895 Explore P: 0.7722
Episode: 146 Total reward: 8.0 Training loss: 4.2295 Explore P: 0.7716
Episode: 147 Total reward: 45.0 Training loss: 70.5500 Explore P: 0.7682
Episode: 148 Total reward: 15.0 Training loss: 5.3503 Explore P: 0.7670
Episode: 149 Total reward: 31.0 Training loss: 92.1883 Explore P: 0.7647
Episode: 150 Total reward: 26.0 Training loss: 3.9927 Explore P: 0.7627
Episode: 151 Total reward: 33.0 Training loss: 39.2118 Explore P: 0.7602
Episode: 152 Total reward: 12.0 Training loss: 3.9938 Explore P: 0.7593
Episode: 153 Total reward: 13.0 Training loss: 4.4191 Explore P: 0.7584
Episode: 154 Total reward: 20.0 Training loss: 4.1355 Explore P: 0.7569
Episode: 155 Total reward: 16.0 Training loss: 30.4475 Explore P: 0.7557
Episode: 156 Total reward: 31.0 Training loss: 32.8752 Explore P: 0.7534
Episode: 157 Total reward: 24.0 Training loss: 64.6020 Explore P: 0.7516
Episode: 158 Total reward: 19.0 Training loss: 27.0855 Explore P: 0.7502
Episode: 159 Total reward: 10.0 Training loss: 41.0473 Explore P: 0.7494
Episode: 160 Total reward: 54.0 Training loss: 100.3150 Explore P: 0.7455
Episode: 161 Total reward: 9.0 Training loss: 51.8533 Explore P: 0.7448
Episode: 162 Total reward: 23.0 Training loss: 4.4842 Explore P: 0.7431
Episode: 163 Total reward: 33.0 Training loss: 109.5781 Explore P: 0.7407
Episode: 164 Total reward: 10.0 Training loss: 48.6811 Explore P: 0.7400
Episode: 165 Total reward: 24.0 Training loss: 76.7075 Explore P: 0.7382
Episode: 166 Total reward: 14.0 Training loss: 40.9150 Explore P: 0.7372
Episode: 167 Total reward: 41.0 Training loss: 3.7042 Explore P: 0.7342
Episode: 168 Total reward: 12.0 Training loss: 105.5079 Explore P: 0.7334
Episode: 169 Total reward: 10.0 Training loss: 33.2934 Explore P: 0.7326
Episode: 170 Total reward: 18.0 Training loss: 70.5803 Explore P: 0.7313
Episode: 171 Total reward: 24.0 Training loss: 2.5819 Explore P: 0.7296
Episode: 172 Total reward: 13.0 Training loss: 53.4601 Explore P: 0.7287
Episode: 173 Total reward: 15.0 Training loss: 53.9033 Explore P: 0.7276
Episode: 174 Total reward: 25.0 Training loss: 31.1533 Explore P: 0.7258
Episode: 175 Total reward: 21.0 Training loss: 2.1272 Explore P: 0.7243
Episode: 176 Total reward: 15.0 Training loss: 33.0007 Explore P: 0.7232
Episode: 177 Total reward: 18.0 Training loss: 33.8199 Explore P: 0.7219
Episode: 178 Total reward: 15.0 Training loss: 2.2021 Explore P: 0.7209
Episode: 179 Total reward: 33.0 Training loss: 38.3929 Explore P: 0.7185
Episode: 180 Total reward: 12.0 Training loss: 31.5653 Explore P: 0.7177
Episode: 181 Total reward: 11.0 Training loss: 35.5702 Explore P: 0.7169
Episode: 182 Total reward: 27.0 Training loss: 2.4636 Explore P: 0.7150
Episode: 183 Total reward: 18.0 Training loss: 31.8028 Explore P: 0.7137
Episode: 184 Total reward: 18.0 Training loss: 28.6630 Explore P: 0.7125
Episode: 185 Total reward: 30.0 Training loss: 1.2184 Explore P: 0.7104
Episode: 186 Total reward: 10.0 Training loss: 31.5997 Explore P: 0.7097
Episode: 187 Total reward: 16.0 Training loss: 1.8943 Explore P: 0.7085
Episode: 188 Total reward: 14.0 Training loss: 1.2977 Explore P: 0.7076
Episode: 189 Total reward: 10.0 Training loss: 25.7568 Explore P: 0.7069
Episode: 190 Total reward: 14.0 Training loss: 126.8777 Explore P: 0.7059
Episode: 191 Total reward: 9.0 Training loss: 30.7716 Explore P: 0.7053
Episode: 192 Total reward: 33.0 Training loss: 1.1063 Explore P: 0.7030
Episode: 193 Total reward: 8.0 Training loss: 30.1312 Explore P: 0.7024
Episode: 194 Total reward: 13.0 Training loss: 1.2481 Explore P: 0.7015
Episode: 195 Total reward: 9.0 Training loss: 104.2520 Explore P: 0.7009
Episode: 196 Total reward: 13.0 Training loss: 59.1085 Explore P: 0.7000
Episode: 197 Total reward: 44.0 Training loss: 24.8856 Explore P: 0.6970
Episode: 198 Total reward: 11.0 Training loss: 26.5014 Explore P: 0.6962
Episode: 199 Total reward: 15.0 Training loss: 32.7850 Explore P: 0.6952
Episode: 200 Total reward: 36.0 Training loss: 0.7367 Explore P: 0.6927
Episode: 201 Total reward: 12.0 Training loss: 22.5053 Explore P: 0.6919
Episode: 202 Total reward: 33.0 Training loss: 25.9778 Explore P: 0.6897
Episode: 203 Total reward: 17.0 Training loss: 23.8412 Explore P: 0.6885
Episode: 204 Total reward: 25.0 Training loss: 46.6264 Explore P: 0.6868
Episode: 205 Total reward: 10.0 Training loss: 48.4243 Explore P: 0.6861
Episode: 206 Total reward: 10.0 Training loss: 1.3329 Explore P: 0.6855
Episode: 207 Total reward: 20.0 Training loss: 46.2805 Explore P: 0.6841
Episode: 208 Total reward: 9.0 Training loss: 0.9534 Explore P: 0.6835
Episode: 209 Total reward: 23.0 Training loss: 22.7690 Explore P: 0.6820
Episode: 210 Total reward: 26.0 Training loss: 23.1922 Explore P: 0.6802
Episode: 211 Total reward: 14.0 Training loss: 0.8308 Explore P: 0.6793
Episode: 212 Total reward: 9.0 Training loss: 22.3824 Explore P: 0.6787
Episode: 213 Total reward: 9.0 Training loss: 1.0817 Explore P: 0.6781
Episode: 214 Total reward: 36.0 Training loss: 0.8707 Explore P: 0.6757
Episode: 215 Total reward: 22.0 Training loss: 21.5373 Explore P: 0.6742
Episode: 216 Total reward: 16.0 Training loss: 39.0810 Explore P: 0.6732
Episode: 217 Total reward: 15.0 Training loss: 1.0769 Explore P: 0.6722
Episode: 218 Total reward: 14.0 Training loss: 18.8047 Explore P: 0.6712
Episode: 219 Total reward: 12.0 Training loss: 19.6569 Explore P: 0.6704
Episode: 220 Total reward: 16.0 Training loss: 20.0725 Explore P: 0.6694
Episode: 221 Total reward: 12.0 Training loss: 24.3643 Explore P: 0.6686
Episode: 222 Total reward: 13.0 Training loss: 18.8074 Explore P: 0.6677
Episode: 223 Total reward: 15.0 Training loss: 20.1431 Explore P: 0.6668
Episode: 224 Total reward: 21.0 Training loss: 19.6551 Explore P: 0.6654
Episode: 225 Total reward: 12.0 Training loss: 36.4565 Explore P: 0.6646
Episode: 226 Total reward: 16.0 Training loss: 17.6369 Explore P: 0.6635
Episode: 227 Total reward: 22.0 Training loss: 38.1096 Explore P: 0.6621
Episode: 228 Total reward: 15.0 Training loss: 17.3859 Explore P: 0.6611
Episode: 229 Total reward: 18.0 Training loss: 0.8959 Explore P: 0.6600
Episode: 230 Total reward: 18.0 Training loss: 34.9386 Explore P: 0.6588
Episode: 231 Total reward: 19.0 Training loss: 36.1295 Explore P: 0.6576
Episode: 232 Total reward: 10.0 Training loss: 54.8070 Explore P: 0.6569
Episode: 233 Total reward: 19.0 Training loss: 17.3346 Explore P: 0.6557
Episode: 234 Total reward: 25.0 Training loss: 20.5150 Explore P: 0.6541
Episode: 235 Total reward: 14.0 Training loss: 52.1697 Explore P: 0.6532
Episode: 236 Total reward: 37.0 Training loss: 58.2140 Explore P: 0.6508
Episode: 237 Total reward: 9.0 Training loss: 15.8112 Explore P: 0.6502
Episode: 238 Total reward: 15.0 Training loss: 0.8926 Explore P: 0.6493
Episode: 239 Total reward: 13.0 Training loss: 14.0103 Explore P: 0.6484
Episode: 240 Total reward: 73.0 Training loss: 34.1404 Explore P: 0.6438
Episode: 241 Total reward: 17.0 Training loss: 34.6027 Explore P: 0.6427
Episode: 242 Total reward: 9.0 Training loss: 18.3989 Explore P: 0.6421
Episode: 243 Total reward: 13.0 Training loss: 0.8023 Explore P: 0.6413
Episode: 244 Total reward: 59.0 Training loss: 14.5257 Explore P: 0.6376
Episode: 245 Total reward: 61.0 Training loss: 0.8964 Explore P: 0.6338
Episode: 246 Total reward: 21.0 Training loss: 1.1927 Explore P: 0.6325
Episode: 247 Total reward: 22.0 Training loss: 17.4665 Explore P: 0.6311
Episode: 248 Total reward: 41.0 Training loss: 30.8306 Explore P: 0.6286
Episode: 249 Total reward: 22.0 Training loss: 0.8766 Explore P: 0.6272
Episode: 250 Total reward: 22.0 Training loss: 45.9915 Explore P: 0.6259
Episode: 251 Total reward: 24.0 Training loss: 15.3412 Explore P: 0.6244
Episode: 252 Total reward: 52.0 Training loss: 0.9826 Explore P: 0.6212
Episode: 253 Total reward: 65.0 Training loss: 0.9810 Explore P: 0.6172
Episode: 254 Total reward: 13.0 Training loss: 14.1935 Explore P: 0.6164
Episode: 255 Total reward: 10.0 Training loss: 0.7041 Explore P: 0.6158
Episode: 256 Total reward: 18.0 Training loss: 21.6431 Explore P: 0.6147
Episode: 257 Total reward: 11.0 Training loss: 0.9316 Explore P: 0.6141
Episode: 258 Total reward: 10.0 Training loss: 1.1237 Explore P: 0.6135
Episode: 259 Total reward: 17.0 Training loss: 28.3753 Explore P: 0.6125
Episode: 260 Total reward: 18.0 Training loss: 11.4056 Explore P: 0.6114
Episode: 261 Total reward: 16.0 Training loss: 1.0382 Explore P: 0.6104
Episode: 262 Total reward: 8.0 Training loss: 24.7943 Explore P: 0.6099
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-26e7f18540c0> in <module>()
     17             step += 1
     18             # Uncomment this next line to watch the training
---> 19             env.render()
     20 
     21             # Explore or Exploit

C:\anthony-ide\Anaconda3\lib\site-packages\gym\core.py in render(self, mode, close)
    151             elif mode not in modes:
    152                 raise error.UnsupportedMode('Unsupported rendering mode: {}. (Supported modes for {}: {})'.format(mode, self, modes))
--> 153         return self._render(mode=mode, close=close)
    154 
    155     def close(self):

C:\anthony-ide\Anaconda3\lib\site-packages\gym\core.py in _render(self, mode, close)
    283 
    284     def _render(self, mode='human', close=False):
--> 285         return self.env.render(mode, close)
    286 
    287     def _close(self):

C:\anthony-ide\Anaconda3\lib\site-packages\gym\core.py in render(self, mode, close)
    151             elif mode not in modes:
    152                 raise error.UnsupportedMode('Unsupported rendering mode: {}. (Supported modes for {}: {})'.format(mode, self, modes))
--> 153         return self._render(mode=mode, close=close)
    154 
    155     def close(self):

C:\anthony-ide\Anaconda3\lib\site-packages\gym\envs\classic_control\cartpole.py in _render(self, mode, close)
    143         self.poletrans.set_rotation(-x[2])
    144 
--> 145         return self.viewer.render(return_rgb_array = mode=='rgb_array')

C:\anthony-ide\Anaconda3\lib\site-packages\gym\envs\classic_control\rendering.py in render(self, return_rgb_array)
    102             arr = arr.reshape(buffer.height, buffer.width, 4)
    103             arr = arr[::-1,:,0:3]
--> 104         self.window.flip()
    105         self.onetime_geoms = []
    106         return arr

C:\anthony-ide\Anaconda3\lib\site-packages\pyglet\window\win32\__init__.py in flip(self)
    309     def flip(self):
    310         self.draw_mouse_cursor()
--> 311         self.context.flip()
    312 
    313     def set_location(self, x, y):

AttributeError: 'NoneType' object has no attribute 'flip'

Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.


In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [15]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[15]:
<matplotlib.text.Text at 0x261501fa320>

In [181]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')


Out[181]:
<matplotlib.text.Text at 0x125c136d8>

Testing

Let's checkout how our trained agent plays the game.


In [16]:
test_episodes = 10
test_max_steps = 400
env.reset()
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    for ep in range(1, test_episodes):
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from Q-network
            feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
            Qs = sess.run(mainQN.output, feed_dict=feed)
            action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            if done:
                t = test_max_steps
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                state = next_state
                t += 1


INFO:tensorflow:Restoring parameters from checkpoints\cartpole.ckpt
[2017-06-19 14:57:03,184] Restoring parameters from checkpoints\cartpole.ckpt

In [18]:
env.close()

Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.