Train Model on Distributed Cluster

IMPORTANT: You Must STOP All Kernels and Terminal Session

The GPU is wedged at this point. We need to set it free!!

Define ClusterSpec


In [ ]:
import tensorflow as tf

cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})

Start Server "Task 0" (localhost:2222)

Note: If you see UnknownError: Could not start gRPC server, then you have already started the server. Please ignore this.


In [ ]:
server0 = tf.train.Server(cluster, job_name="local", task_index=0)

print(server0)

Start Server "Task 1" (localhost:2223)

Note: If you see UnknownError: Could not start gRPC server, then you have already started the server. Please ignore this.


In [ ]:
server1 = tf.train.Server(cluster, job_name="local", task_index=1)

print(server1)

Define Compute-Heavy TensorFlow Graph


In [ ]:
import tensorflow as tf

n = 2
c1 = tf.Variable([])
c2 = tf.Variable([])

def matpow(M, n):
    if n < 1: 
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

Define Shape


In [ ]:
shape=[2500, 2500]

Assign Devices Manually

All CPU Devices

Note the execution time.


In [ ]:
import datetime

with tf.device("/job:local/task:0/cpu:0"):
    A = tf.random_normal(shape=shape)
    c1 = matpow(A,n)

with tf.device("/job:local/task:1/cpu:0"):
    B = tf.random_normal(shape=shape)
    c2 = matpow(B,n)

with tf.Session("grpc://127.0.0.1:2222") as sess:
    sum = c1 + c2
    start_time = datetime.datetime.now()
    print(sess.run(sum))
    print("Execution time: " 
          + str(datetime.datetime.now() - start_time))