This notebook demonstrates using reinforcement learning to train an agent to play Pong.
The first step is to create an
Environment that implements this task. Fortunately,
OpenAI Gym already provides an implementation of Pong (and many other tasks appropriate
for reinforcement learning). DeepChem's
GymEnvironment class provides an easy way to
use environments from OpenAI Gym. We could just use it directly, but in this case we
subclass it and preprocess the screen image a little bit to make learning easier.
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment. To install
gym you should also use
pip install 'gym[atari]' (We need the extra modifier since we'll be using an atari game). We'll add this command onto our usual Colab installation commands for you
In :%tensorflow_version 1.x !curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py import deepchem_installer %time deepchem_installer.install(version='2.3.0')
TensorFlow 1.x selected. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2814 100 2814 0 0 35620 0 --:--:-- --:--:-- --:--:-- 35175add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH python version: 3.6.9 fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh done installing miniconda to /root/miniconda done installing deepchem done /usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+. warnings.warn(msg, category=FutureWarning)WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.deepchem-2.3.0 installation finished!CPU times: user 2.69 s, sys: 598 ms, total: 3.28 s Wall time: 3min 48s
In :!pip install 'gym[atari]'
Requirement already satisfied: gym[atari] in /usr/local/lib/python3.6/dist-packages (0.17.2) Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.18.4) Requirement already satisfied: cloudpickle<1.4.0,>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.3.0) Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.4.1) Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (1.5.0) Requirement already satisfied: Pillow; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (7.0.0) Requirement already satisfied: atari-py~=0.2.0; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (0.2.6) Requirement already satisfied: opencv-python; extra == "atari" in /usr/local/lib/python3.6/dist-packages (from gym[atari]) (188.8.131.52) Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyglet<=1.5.0,>=1.4.0->gym[atari]) (0.16.0) Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from atari-py~=0.2.0; extra == "atari"->gym[atari]) (1.12.0)
In :import deepchem as dc import numpy as np class PongEnv(dc.rl.GymEnvironment): def __init__(self): super(PongEnv, self).__init__('Pong-v0') self._state_shape = (80, 80) @property def state(self): # Crop everything outside the play area, reduce the image size, # and convert it to black and white. cropped = np.array(self._state)[34:194, :, :] reduced = cropped[0:-1:2, 0:-1:2] grayscale = np.sum(reduced, axis=2) bw = np.zeros(grayscale.shape) bw[grayscale != 233] = 1 return bw def __deepcopy__(self, memo): return PongEnv() env = PongEnv()
Next we create a network to implement the policy. We begin with two convolutional layers to process the image. That is followed by a dense (fully connected) layer to provide plenty of capacity for game logic. We also add a small Gated Recurrent Unit. That gives the network a little bit of memory, so it can keep track of which way the ball is moving.
We concatenate the dense and GRU outputs together, and use them as inputs to two final layers that serve as the network's outputs. One computes the action probabilities, and the other computes an estimate of the state value function.
We also provide an input for the initial state of the GRU, and returned its final state at the end. This is required by the learning algorithm
In :import tensorflow as tf from tensorflow.keras.layers import Input, Concatenate, Conv2D, Dense, Flatten, GRU, Reshape class PongPolicy(dc.rl.Policy): def __init__(self): super(PongPolicy, self).__init__(['action_prob', 'value', 'rnn_state'], [np.zeros(16)]) def create_model(self, **kwargs): state = Input(shape=(80, 80)) rnn_state = Input(shape=(16,)) conv1 = Conv2D(16, kernel_size=8, strides=4, activation=tf.nn.relu)(Reshape((80, 80, 1))(state)) conv2 = Conv2D(32, kernel_size=4, strides=2, activation=tf.nn.relu)(conv1) dense = Dense(256, activation=tf.nn.relu)(Flatten()(conv2)) gru, rnn_final_state = GRU(16, return_state=True, return_sequences=True)( Reshape((-1, 256))(dense), initial_state=rnn_state) concat = Concatenate()([dense, Reshape((16,))(gru)]) action_prob = Dense(env.n_actions, activation=tf.nn.softmax)(concat) value = Dense(1)(concat) return tf.keras.Model(inputs=[state, rnn_state], outputs=[action_prob, value, rnn_final_state]) policy = PongPolicy()
We will optimize the policy using the Asynchronous Advantage Actor Critic (A3C) algorithm. There are lots of hyperparameters we could specify at this point, but the default values for most of them work well on this problem. The only one we need to customize is the learning rate.
In :from deepchem.models.optimizers import Adam a3c = dc.rl.A3C(env, policy, model_dir='model', optimizer=Adam(learning_rate=0.0002))
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:169: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/optimizers.py:76: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:260: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:237: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:32: The name tf.log is deprecated. Please use tf.math.log instead. WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where
Optimize for as long as you have patience to. By 1 million steps you should see clear signs of learning. Around 3 million steps it should start to occasionally beat the game's built in AI. By 7 million steps it should be winning almost every time. Running on my laptop, training takes about 20 minutes for every million steps.
In :# Change this to train as many steps as you have patience for. a3c.fit(1000)
WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:412: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead. WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/rl/a3c.py:253: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
Let's watch it play and see how it does!
In :# This code doesn't work well on Colab env.reset() while not env.terminated: env.env.render() env.step(a3c.select_action(env.state))
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!