In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
TensorFlow code, and tf.keras
models will transparently run on a single GPU with no code changes required.
Note: Use tf.config.experimental.list_physical_devices('GPU')
to confirm that TensorFlow is using the GPU.
The simplest way to run on multiple GPUs, on one or many machines, is using Distribution Strategies.
This guide is for users who have tried these approaches and found that they need fine-grained control of how TensorFlow uses the GPU.
In [0]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
TensorFlow supports running computations on a variety of types of devices, including CPU and GPU. They are represented with string identifiers for example:
"/device:CPU:0"
: The CPU of your machine."/GPU:0"
: Short-hand notation for the first GPU of your machine that is visible to TensorFlow."/job:localhost/replica:0/task:0/device:GPU:1"
: Fully qualified name of the second GPU of your machine that is visible to TensorFlow.If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul
has both CPU and GPU kernels. On a system with devices CPU:0
and GPU:0
, the GPU:0
device will be selected to run tf.matmul
unless you explicitly request running it on another device.
In [0]:
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
The above code will print an indication the MatMul
op was executed on GPU:0
.
In [0]:
tf.debugging.set_log_device_placement(True)
# Place tensors on the CPU
with tf.device('/CPU:0'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
You will see that now a
and b
are assigned to CPU:0
. Since a device was
not explicitly specified for the MatMul
operation, the TensorFlow runtime will
choose one based on the operation and available devices (GPU:0
in this
example) and automatically copy tensors between devices if required.
By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to
CUDA_VISIBLE_DEVICES
) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit TensorFlow to a specific set of GPUs we use the tf.config.experimental.set_visible_devices
method.
In [0]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.
The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth
, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, we extend the GPU memory region allocated to the TensorFlow process. Note we do not release memory, since it can lead to memory fragmentation. To turn on memory growth for a specific GPU, use the following code prior to allocating any tensors or executing any ops.
In [0]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Another way to enable this option is to set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH
to true
. This configuration is platform specific.
The second method is to configure a virtual GPU device with tf.config.experimental.set_virtual_device_configuration
and set a hard limit on the total memory to allocate on the GPU.
In [0]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. This is common practice for local development when the GPU is shared with other applications such as a workstation GUI.
In [0]:
tf.debugging.set_log_device_placement(True)
try:
# Specify an invalid GPU device
with tf.device('/device:GPU:2'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
except RuntimeError as e:
print(e)
If the device you have specified does not exist, you will get a RuntimeError
: .../device:GPU:2 unknown device
.
If you would like TensorFlow to automatically choose an existing and supported device to run the operations in case the specified one doesn't exist, you can call tf.config.set_soft_device_placement(True)
.
In [0]:
tf.config.set_soft_device_placement(True)
tf.debugging.set_log_device_placement(True)
# Creates some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
In [0]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Create 2 virtual GPUs with 1GB memory each
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Once we have multiple logical GPUs available to the runtime, we can utilize the multiple GPUs with tf.distribute.Strategy
or with manual placement.
In [0]:
tf.debugging.set_log_device_placement(True)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mse',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))
This program will run a copy of your model on each GPU, splitting the input data between them, also known as "data parallelism".
For more information about distribution strategies, check out the guide here.
In [0]:
tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
# Replicate your computation on multiple GPUs
c = []
for gpu in gpus:
with tf.device(gpu.name):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c.append(tf.matmul(a, b))
with tf.device('/CPU:0'):
matmul_sum = tf.add_n(c)
print(matmul_sum)