Tutorial Part 18: Using Reinforcement Learning to Play Pong

This notebook demonstrates using reinforcement learning to train an agent to play Pong.

The first step is to create an Environment that implements this task. Fortunately, OpenAI Gym already provides an implementation of Pong (and many other tasks appropriate for reinforcement learning). DeepChem's GymEnvironment class provides an easy way to use environments from OpenAI Gym. We could just use it directly, but in this case we subclass it and preprocess the screen image a little bit to make learning easier.


This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.


To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment. To install gym you should also use pip install 'gym[atari]' (We need the extra modifier since we'll be using an atari game). We'll add this command onto our usual Colab installation commands for you

In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')

TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2814  100  2814    0     0  35620      0 --:--:-- --:--:-- --:--:-- 35175
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
installing miniconda to /root/miniconda
installing deepchem
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.69 s, sys: 598 ms, total: 3.28 s
Wall time: 3min 48s

In [2]:
!pip install 'gym[atari]'

Requirement already satisfied: gym[atari] in /usr/local/lib/python3.6/dist-packages (0.17.2)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.18.4)
Requirement already satisfied: cloudpickle<1.4.0,>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.3.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.4.1)
Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.5.0)
Requirement already satisfied: Pillow; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (7.0.0)
Requirement already satisfied: atari-py~=0.2.0; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (0.2.6)
Requirement already satisfied: opencv-python; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyglet<=1.5.0,>=1.4.0->gym[atari]) (0.16.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from atari-py~=0.2.0; extra == "atari"->gym[atari]) (1.12.0)

In [0]:
import deepchem as dc
import numpy as np

class PongEnv(dc.rl.GymEnvironment):
  def __init__(self):
    super(PongEnv, self).__init__('Pong-v0')
    self._state_shape = (80, 80)
  def state(self):
    # Crop everything outside the play area, reduce the image size,
    # and convert it to black and white.
    cropped = np.array(self._state)[34:194, :, :]
    reduced = cropped[0:-1:2, 0:-1:2]
    grayscale = np.sum(reduced, axis=2)
    bw = np.zeros(grayscale.shape)
    bw[grayscale != 233] = 1
    return bw

  def __deepcopy__(self, memo):
    return PongEnv()

env = PongEnv()

Next we create a network to implement the policy. We begin with two convolutional layers to process the image. That is followed by a dense (fully connected) layer to provide plenty of capacity for game logic. We also add a small Gated Recurrent Unit. That gives the network a little bit of memory, so it can keep track of which way the ball is moving.

We concatenate the dense and GRU outputs together, and use them as inputs to two final layers that serve as the network's outputs. One computes the action probabilities, and the other computes an estimate of the state value function.

We also provide an input for the initial state of the GRU, and returned its final state at the end. This is required by the learning algorithm

In [0]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Concatenate, Conv2D, Dense, Flatten, GRU, Reshape

class PongPolicy(dc.rl.Policy):
    def __init__(self):
        super(PongPolicy, self).__init__(['action_prob', 'value', 'rnn_state'], [np.zeros(16)])

    def create_model(self, **kwargs):
        state = Input(shape=(80, 80))
        rnn_state = Input(shape=(16,))
        conv1 = Conv2D(16, kernel_size=8, strides=4, activation=tf.nn.relu)(Reshape((80, 80, 1))(state))
        conv2 = Conv2D(32, kernel_size=4, strides=2, activation=tf.nn.relu)(conv1)
        dense = Dense(256, activation=tf.nn.relu)(Flatten()(conv2))
        gru, rnn_final_state = GRU(16, return_state=True, return_sequences=True)(
            Reshape((-1, 256))(dense), initial_state=rnn_state)
        concat = Concatenate()([dense, Reshape((16,))(gru)])
        action_prob = Dense(env.n_actions, activation=tf.nn.softmax)(concat)
        value = Dense(1)(concat)
        return tf.keras.Model(inputs=[state, rnn_state], outputs=[action_prob, value, rnn_final_state])

policy = PongPolicy()

We will optimize the policy using the Asynchronous Advantage Actor Critic (A3C) algorithm. There are lots of hyperparameters we could specify at this point, but the default values for most of them work well on this problem. The only one we need to customize is the learning rate.

In [5]:
from deepchem.models.optimizers import Adam
a3c = dc.rl.A3C(env, policy, model_dir='model', optimizer=Adam(learning_rate=0.0002))

WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:169: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/optimizers.py:76: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:260: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:237: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:32: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Optimize for as long as you have patience to. By 1 million steps you should see clear signs of learning. Around 3 million steps it should start to occasionally beat the game's built in AI. By 7 million steps it should be winning almost every time. Running on my laptop, training takes about 20 minutes for every million steps.

In [6]:
# Change this to train as many steps as you have patience for.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:412: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:253: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

Let's watch it play and see how it does!

In [0]:
# This code doesn't work well on Colab
while not env.terminated:

Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!