This notebook walks you through how to use accelerators for Kubeflow Pipelines steps.

Preparation

If you installed Kubeflow via kfctl, these steps will have already been done, and you can skip this section.

If you installed Kubeflow Pipelines via Google Cloud AI Platform Pipelines UI or Standalone manifest, you willl need to follow these steps to set up your GPU enviroment.

Add GPU nodes to your cluster

To see which accelerators are available in each zone, run the following command or check the document

gcloud compute accelerator-types list

You may also check or edit the GCP's GPU Quota to make sure you still have GPU quota in the region.

To reduce costs, you may want to create a zero-sized node pool for GPU and enable autoscaling.

Here is an example to create a P100 GPU node pool for a cluster.

# You may customize these parameters.
export GPU_POOL_NAME=p100pool
export CLUSTER_NAME=existingClusterName
export CLUSTER_ZONE=us-west1-a
export GPU_TYPE=nvidia-tesla-p100
export GPU_COUNT=1
export MACHINE_TYPE=n1-highmem-16


# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} \
  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \
  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \
  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling \
  --scopes=cloud-platform

Here in this sample, we specified --scopes=cloud-platform. More info is here. This scope will allow node pool jobs to use the GCE Default Service Account to access GCP APIs (like GCS, etc.). You can also use Workload Identity or Application Default Credentials to replace --scopes=cloud-platform.

Install NVIDIA device driver to the cluster

After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a GKE DaemonSet that automatically installs the drivers for you.

To deploy the installation DaemonSet, run the following command. You can run this command any time (even before you create your node pool), and you only need to do this once per cluster.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Consume GPU via Kubeflow Pipelines SDK

Once your cluster is set up to support GPUs, the next step is to indicate which steps in your pipelines should use accelerators, and what type they should use. Here is a document that describes the options.

The following is an example 'smoke test' pipeline, to see if your cluster setup is working properly.


In [3]:
import kfp
from kfp import dsl

def gpu_smoking_check_op():
    return dsl.ContainerOp(
        name='check',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1)

@dsl.pipeline(
    name='GPU smoke check',
    description='smoke check as to whether GPU env is ready.'
)
def gpu_pipeline():
    gpu_smoking_check = gpu_smoking_check_op()

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')

You may see a warning message from Kubeflow Pipeline logs saying "Insufficient nvidia.com/gpu". If so, this probably means that your GPU-enabled node is still spinning up; please wait for few minutes. You can check the current nodes in your cluster like this:

kubectl get nodes -o wide

If everything runs as expected, the nvidia-smi command should list the CUDA version, GPU type, usage, etc. (See the logs panel in the pipeline UI to view output).

You may also notice that after the pipeline step's GKE pod has finished, the new GPU cluster node is still there. GKE autoscale algorithm will free that node if no usage for certain time. More info is here.

Multiple GPUs pool in one cluster

It's possible you want more than one type of GPU to be supported in one cluster.

  • There are several types of GPUs.
  • Certain regions often support a only subset of the GPUs (document).

Since we can set --num-nodes=0 for certain GPU node pool to save costs if no workload, we can create multiple node pools for different types of GPUs.

Add additional GPU nodes to your cluster

In a previous section, we added a node pool for P100s. Here we add another pool for V100s.

# You may customize these parameters.
export GPU_POOL_NAME=v100pool
export CLUSTER_NAME=existingClusterName
export CLUSTER_ZONE=us-west1-a
export GPU_TYPE=nvidia-tesla-v100
export GPU_COUNT=1
export MACHINE_TYPE=n1-highmem-8


# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} \
  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \
  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \
  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling

Consume certain GPU via Kubeflow Pipelines SDK

If your cluster has multiple GPU node pools, you can explicitly specify that a given pipeline step should use a particular type of accelerator. This example shows how to use P100s for one pipeline step, and V100s for another.


In [1]:
import kfp
from kfp import dsl

def gpu_p100_op():
    return dsl.ContainerOp(
        name='check_p100',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')

def gpu_v100_op():
    return dsl.ContainerOp(
        name='check_v100',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')

@dsl.pipeline(
    name='GPU smoke check',
    description='Smoke check as to whether GPU env is ready.'
)
def gpu_pipeline():
    gpu_p100 = gpu_p100_op()
    gpu_v100 = gpu_v100_op()

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')

You should see different "nvidia-smi" logs from the two pipeline steps.

Using Preemptible GPUs

A Preemptible GPU resource is cheaper, but use of these instances means that a pipeline step has the potential to be aborted and then retried. This means that pipeline steps used with preemptible instances must be idempotent (the step gives the same results if run again), or creates some kind of checkpoint so that it can pick up where it left off. To use preemptible GPUs, create a node pool as follows. Then when specifying a pipeline, you can indicate use of a preemptible node pool for a step.

The only difference in the following node-pool creation example is that the --preemptible and --node-taints=preemptible=true:NoSchedule parameters have been added.

export GPU_POOL_NAME=v100pool-preemptible
export CLUSTER_NAME=existingClusterName
export CLUSTER_ZONE=us-west1-a
export GPU_TYPE=nvidia-tesla-v100
export GPU_COUNT=1
export MACHINE_TYPE=n1-highmem-8

gcloud container node-pools create ${GPU_POOL_NAME} \
  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \
  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \
  --preemptible \
  --node-taints=preemptible=true:NoSchedule \
  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling

Then, you can define a pipeline as follows (note the use of use_preemptible_nodepool()).


In [5]:
import kfp
import kfp.gcp as gcp
from kfp import dsl

def gpu_p100_op():
    return dsl.ContainerOp(
        name='check_p100',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')

def gpu_v100_op():
    return dsl.ContainerOp(
        name='check_v100',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')

def gpu_v100_preemptible_op():
    v100_op = dsl.ContainerOp(
        name='check_v100_preemptible',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi'])
    v100_op.set_gpu_limit(1)
    v100_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')
    v100_op.apply(gcp.use_preemptible_nodepool(hard_constraint=True))
    return v100_op

@dsl.pipeline(
    name='GPU smoking check',
    description='Smoking check whether GPU env is ready.'
)
def gpu_pipeline():
    gpu_p100 = gpu_p100_op()
    gpu_v100 = gpu_v100_op()
    gpu_v100_preemptible = gpu_v100_preemptible_op()

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')

TPU

Google's TPU is awesome. It's faster and lower TOC. To consume TPUs, there is no need to create a node-pool; just call KFP SDK to use it. Here is a doc. Note that not all regions have TPU yet.


In [ ]: