Remote functions in Ray should be thought of as functional and side-effect free. Restricting ourselves only to remote functions gives us distributed functional programming, which is great for many use cases, but in practice is a bit limited.

Ray extends the dataflow model with actors. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker.

Suppose we've already started Ray.


In [1]:
import ray
ray.init(num_gpus=2)


Waiting for redis server at 127.0.0.1:13780 to respond...
Waiting for redis server at 127.0.0.1:53685 to respond...
Starting local scheduler with 8 CPUs, 2 GPUs

======================================================================
View the web UI at http://localhost:8899/notebooks/ray_ui94901.ipynb?token=43ce6bf00ee9e01cff70531c6f9913ae83278889a3bca875
======================================================================

Out[1]:
{'local_scheduler_socket_names': ['/tmp/scheduler97556861'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store59937875', manager_name='/tmp/plasma_manager40408261', manager_port=51499)],
 'redis_address': '127.0.0.1:13780',
 'webui_url': 'http://localhost:8899/notebooks/ray_ui94901.ipynb?token=43ce6bf00ee9e01cff70531c6f9913ae83278889a3bca875'}

Defining and creating an actor

Consider the following simple example. The ray.remote decorator indicates that instances of the Counter class will be actors.


In [2]:
@ray.remote
class Counter(object):
    def __init__(self):
        self.value = 0
        
    def increment(self):
        self.value += 1
        return self.value

To actually create an actor, we can instantiate this class by calling Counter.remote ( ).


In [3]:
a1 = Counter.remote()
a2 = Counter.remote()

When an actor is instantiated, the following events happen.

  1. A node in the cluster is chosen and a worker process is created on that node (by the local scheduler on that node) for the purpose of running methods called on the actor.

  2. A Counter object is created on that worker and the Counter constructor is run.

Using an actor

We can schedule tasks on the actor by calling its methods.


In [4]:
a1.increment.remote()  # ray.get returns 1
a2.increment.remote()  # ray.get returns 1


Out[4]:
ObjectID(772c834522eca9c2deb148db93e5e1a5112f338b)

When a1.increment.remote() is called, the following events happens.

  1. A task is created.

  2. The task is assigned directly to the local scheduler responsible for the actor by the driver's local scheduler. Thus, this scheduling procedure bypasses the global scheduler.

  3. An object ID is returned.

We can then call ray.get on the object ID to retrieve the actual value.

Similarly, the call to a2.increment.remote() generates a task that is scheduled on the second Counter actor. Since these two tasks run on different actors, they can be executed in parallel (note that only actor methods will be scheduled on actor workers, regular remote functions will not be).

On the other hand, methods called on the same Counter actor are executed serially in the order that they are called. They can thus share state with one another, as shown below.


In [5]:
# Create ten Counter actors.
counters = [Counter.remote() for _ in range(10)]

# Increment each Counter once and get the results. These tasks all happen
# in PARALLEL.
results = ray.get([c.increment.remote() for c in counters])
print(results)  # prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Increment the first Counter five times. These tasks are executed SERIALLY
# and share state. 
results = ray.get([counters[0].increment.remote() for _ in range(5)])
print(results)  # prints [2, 3, 4, 5, 6]


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 3, 4, 5, 6]

A More Interesting Actor Example

A common pattern is to use actors to encapsulate the mutable state managed by an external library or service.

Gym provides an interface to a number of simulated environments for testing and training reinforcement learning agents. These simulators are stateful, and tasks that use these simulators must mutate their state. We can use actors to encapsulate the state of these simulators.


In [6]:
import gym

@ray.remote
class GymEnvironment(object):
    def __init__(self, name):
        self.env = gym.make(name)
        self.env.reset()
        
    def step(self, action):
        return self.env.step(action)
    
    def reset(self):
        self.env.reset()

We can then instantiate an actor and schedule a task on that actor as follows.


In [7]:
pong = GymEnvironment.remote("Pong-v0")
pong.step.remote(0)  # Take action 0 in the simulator.


Out[7]:
ObjectID(0351eea3b2ccae196709cf18a7593ad663d08206)

Using GPUs on actors

A common use case is for an actor to contain a neural network. For example, suppose we have imported Tensorflow and have created a method for constructing a neural net.


In [8]:
import tensorflow as tf

def construct_network():
    x = tf.placeholder(tf.float32, [None, 784])
    y_ = tf.placeholder(tf.float32, [None, 10])
    
    W = tf.Variable(tf.zeros([784, 10]))
    b = tf.Variable(tf.zeros([10]))
    y = tf.nn.softmax(tf.matmul(x, W) + b)
    
    cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y),
                                                  reduction_indices=[1]))
    train_step = tf.train.GradientDescentOptimizer(0.5).minimize
    (cross_entropy)
    correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    return x, y_, train_step, accuracy

We can then define an actor for this network as follows.


In [9]:
import os

# Define an actor that runs on GPUs. If there are no GPUs, then simply use
# ray.remote without any arguments and no parentheses.
@ray.remote(num_gpus=1)

class NeuralNetOnGPU(object):
    def __init__(self):
        # Set an environment variable to tell TensorFlow which GPUs to use.
        # Note that this must be done before the call to tf.Session.
        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in 
                                                       ray.get_gpu_ids()])
        with tf.Graph().as_default():
            with tf.device("/gpu:0"):
                self.x, self.y_, self.train_step, self.accuray = construct
                _network()
                # Allow this to run on CPUs if there aren't any GPUs
                self.sess = tf.Session(config=config)
                # Initialize the network.
                init = tf.global_variables_initializer()
                self.sess.run(init)

To indicate that an actor requires one GPU, we pass in num_gpus=1 to ray.remote. Note that in order for this to work, Ray must have been started with some GPUs, e.g., via ray.init(num_gpus=2). Otherwise, when you try to instantiate the GPU version with NeuralNetOnGPU.remote(), an exception will be thrown saying that there aren’t enough GPUs in the system.

When the actor is created, it will have access to a list of the IDs of the GPUs that it is allowed to use via ray. get_gpu_ids(). This is a list of integers, like [], or [1], or [2, 5, 6]. Since we passed in ray. remote(num_gpus=1), this list will have length one.

We can put this all together as follows.


In [1]:
# Restart the kernel before running this to initialize Ray with specified
# number of GPUs.
import os
import ray
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
ray.init(num_gpus=8)
def construct_network():
    x = tf.placeholder(tf.float32, [None, 784]) 
    y_ = tf.placeholder(tf.float32, [None, 10])
    W = tf.Variable(tf.zeros([784, 10]))
    b = tf.Variable(tf.zeros([10]))
    y = tf.nn.softmax(tf.matmul(x, W) + b)
    cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
    train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
    correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    return x, y_, train_step, accuracy
@ray.remote(num_gpus=1)
class NeuralNetOnGPU(object):
    def __init__(self, mnist_data):
        self.mnist = mnist_data
        # Set an environment variable to tell TensorFlow which GPUs to use. Note
        # that this must be done before the call to tf.Session. 
        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in 
                                                       ray.get_gpu_ids()])
        
        with tf.Graph().as_default(): 
            with tf.device("/gpu:0"):
                # Allow this to run on CPUs if there aren't any GPUs.
                self.x, self.y_, self.train_step, self.accuracy = construct_network() 
                config = tf.ConfigProto(allow_soft_placement=True)
                self.sess = tf.Session(config=config)
                # Initialize the network.
                init = tf.global_variables_initializer()
                self.sess.run(init)
    def train(self, num_steps): 
        for _ in range(num_steps):
            batch_xs, batch_ys = self.mnist.train.next_batch(100)
            self.sess.run(self.train_step, feed_dict={self.x: batch_xs, self.y_: batch_ys})
    def get_accuracy(self):
        return self.sess.run(self.accuracy, feed_dict={self.x: self.mnist.test.images,
                                                       self.y_: self.mnist.test.labels})
# Load the MNIST dataset and tell Ray how to serialize the custom classes.
mnist = input_data.read_data_sets("MNIST_data", one_hot=True) # Create the actor.
nn = NeuralNetOnGPU.remote(mnist)
# Run a few steps of training and print the accuracy.
nn.train.remote(100)
accuracy = ray.get(nn.get_accuracy.remote()) 
print("Accuracy is {}.".format(accuracy))


Waiting for redis server at 127.0.0.1:64838 to respond...
Waiting for redis server at 127.0.0.1:57864 to respond...
Starting local scheduler with 8 CPUs, 8 GPUs

======================================================================
View the web UI at http://localhost:8899/notebooks/ray_ui94728.ipynb?token=cd8f3a92fa1a63e4f515d83df6e8ea3bb55d3acd130cc55d
======================================================================

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
WARNING: Serializing objects of type <class 'tensorflow.contrib.learn.python.learn.datasets.base.Datasets'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
WARNING: Serializing objects of type <class 'tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet'> by expanding them as dictionaries of their fields. This behavior may be incorrect in some cases.
Accuracy is 0.8928999900817871.