MNIST end to end on Kubeflow on GKE

This example guides you through:

  1. Taking an example TensorFlow model and modifying it to support distributed training.
  2. Serving the resulting model using TFServing.
  3. Deploying and using a web app that sends prediction requests to the model.

Requirements

Prepare model

There is a delta between existing distributed MNIST examples and what's needed to run well as a TFJob.

Basically, you must:

  • Add options in order to make the model configurable.
  • Use tf.estimator.train_and_evaluate to enable model exporting and serving.
  • Define serving signatures for model serving.

This tutorial provides a Python program that's already prepared for you: model.py.

Verify that you have a Google Cloud Platform (GCP) account

The cell below checks that this notebook was spawned with credentials to access GCP.


In [1]:
import logging
import os
import uuid
from importlib import reload
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()

Install the required libraries

Run the next cell to import the libraries required to train this model.


In [2]:
import notebook_setup
reload(notebook_setup)
notebook_setup.notebook_setup()


pip installing requirements.txt
Cloning the tf-operator repo
Checkout kubeflow/tf-operator @9238906
Adding /home/jovyan/.local/lib/python3.6/site-packages to python path
Adding /home/jovyan/git_tf-operator/sdk/python to python path
Configure docker credentials

Wait for the message Configure docker credentials before moving on to the next cell.


In [3]:
import k8s_util
# Force a reload of Kubeflow. Since Kubeflow is a multi namespace module,
# doing the reload in notebook_setup may not be sufficient.
import kubeflow
reload(kubeflow)
from kubernetes import client as k8s_client
from kubernetes import config as k8s_config
from kubeflow.tfjob.api import tf_job_client as tf_job_client_module
from IPython.core.display import display, HTML
import yaml

Configure a Docker registry for Kubeflow Fairing

  • In order to build Docker images from your notebook, you need a Docker registry to store the images.
  • Below you set some variables specifying a Container Registry.
  • Kubeflow Fairing provides a utility function to guess the name of your GCP project.

In [4]:
from kubernetes import client as k8s_client
from kubernetes.client import rest as k8s_rest
from kubeflow import fairing   
from kubeflow.fairing import utils as fairing_utils
from kubeflow.fairing.builders import append
from kubeflow.fairing.deployers import job
from kubeflow.fairing.preprocessors import base as base_preprocessor

# Setting up Google Container Registry (GCR) for storing output containers.
# You can use any Docker container registry instead of GCR.
GCP_PROJECT = fairing.cloud.gcp.guess_project_name()
DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)
namespace = fairing_utils.get_current_k8s_namespace()

logging.info(f"Running in project {GCP_PROJECT}")
logging.info(f"Running in namespace {namespace}")
logging.info(f"Using Docker registry {DOCKER_REGISTRY}")


Running in project kubeflow-writers
Running in namespace kubeflow-sarahmaddox
Using Docker registry gcr.io/kubeflow-writers/fairing-job

Use Kubeflow Fairing to build the Docker image

This notebook uses Kubeflow Fairing's kaniko builder to build a Docker image that includes all your dependencies.

  • You use kaniko because you want to be able to run pip to install dependencies.
  • Kaniko gives you the flexibility to build images from Dockerfiles.

In [5]:
# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default 
# Kaniko image is updated to a newer image than 0.7.0.
from kubeflow.fairing import constants
constants.constants.KANIKO_IMAGE = "gcr.io/kaniko-project/executor:v0.14.0"

In [6]:
from kubeflow.fairing.builders import cluster

# output_map is a map of extra files to add to the notebook.
# It is a map from source location to the location inside the context.
output_map =  {
    "Dockerfile.model": "Dockerfile",
    "model.py": "model.py"
}


preprocessor = base_preprocessor.BasePreProcessor(
    command=["python"], # The base class will set this.
    input_files=[],
    path_prefix="/app", # irrelevant since we aren't preprocessing any files
    output_map=output_map)

preprocessor.preprocess()


Out[6]:
set()

Run the next cell and wait until you see a message like Built image gcr.io/<your-project>/fairing-job/mnist:<1234567>.


In [7]:
# Use a Tensorflow image as the base image
# We use a custom Dockerfile 
cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,
                                                 base_image="", # base_image is set in the Dockerfile
                                                 preprocessor=preprocessor,
                                                 image_name="mnist",
                                                 dockerfile_path="Dockerfile",
                                                 pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],
                                                 context_source=cluster.gcs_context.GCSContextSource())
cluster_builder.build()
logging.info(f"Built image {cluster_builder.image_tag}")


Building image using cluster builder.
Creating docker context: /tmp/fairing_context_ohm2nlbv
Dockerfile already exists in Fairing context, skipping...
Waiting for fairing-builder-9vw9w-ndbhd to start...
Waiting for fairing-builder-9vw9w-ndbhd to start...
Waiting for fairing-builder-9vw9w-ndbhd to start...
Pod started running True
ERROR: logging before flag.Parse: E0226 02:34:42.750776       1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value
INFO[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3
INFO[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3
INFO[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3
ERROR: logging before flag.Parse: E0226 02:34:44.230593       1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg
ERROR: logging before flag.Parse: E0226 02:34:44.233477       1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
INFO[0004] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory
INFO[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3
INFO[0005] Built cross stage deps: map[]
INFO[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3
INFO[0005] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory
INFO[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3
INFO[0005] Using files from context: [/kaniko/buildcontext/model.py]
INFO[0005] Checking for cached layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf...
INFO[0006] No cached layer found for cmd RUN chmod +x /opt/model.py
INFO[0006] Unpacking rootfs as cmd RUN chmod +x /opt/model.py requires it.
INFO[0029] Taking snapshot of full filesystem...
INFO[0042] Using files from context: [/kaniko/buildcontext/model.py]
INFO[0042] ADD model.py /opt/model.py
INFO[0042] Taking snapshot of files...
INFO[0042] RUN chmod +x /opt/model.py
INFO[0042] cmd: /bin/sh
INFO[0042] args: [-c chmod +x /opt/model.py]
INFO[0042] Taking snapshot of full filesystem...
INFO[0045] ENTRYPOINT ["/usr/bin/python"]
INFO[0045] Pushing layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf to cache now
INFO[0045] No files changed in this command, skipping snapshotting.
INFO[0045] CMD ["/opt/model.py"]
INFO[0045] No files changed in this command, skipping snapshotting.
Built image gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B

Create a Cloud Storage bucket

Run the next cell to create a Google Cloud Storage (GCS) bucket to store your models and other results.

Since this notebook is running in Python, the cell uses the GCS Python client libraries, but you can use the gsutil command line instead.


In [8]:
from google.cloud import storage
bucket = f"{GCP_PROJECT}-mnist"

client = storage.Client()
b = storage.Bucket(client=client, name=bucket)

if not b.exists():
    logging.info(f"Creating bucket {bucket}")
    b.create()
else:
    logging.info(f"Bucket {bucket} already exists")


Creating bucket kubeflow-writers-mnist

Distributed training

To train the model, this example uses TFJob to run a distributed training job. Run the next cell to set up the YAML specification for the job:


In [9]:
train_name = f"mnist-train-{uuid.uuid4().hex[:4]}"
num_ps = 1
num_workers = 2
model_dir = f"gs://{bucket}/mnist"
export_path = f"gs://{bucket}/mnist/export" 
train_steps = 200
batch_size = 100
learning_rate = .01
image = cluster_builder.image_tag

train_spec = f"""apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: {train_name}  
spec:
  tfReplicaSpecs:
    Ps:
      replicas: {num_ps}
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
    Chief:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
    Worker:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          containers:
          - name: tensorflow
            command:
            - python
            - /opt/model.py
            - --tf-model-dir={model_dir}
            - --tf-export-dir={export_path}
            - --tf-train-steps={train_steps}
            - --tf-batch-size={batch_size}
            - --tf-learning-rate={learning_rate}
            image: {image}
            workingDir: /opt
          restartPolicy: OnFailure
"""

Create the training job

To submit the training job, you could write the spec to a YAML file and then do kubectl apply -f {FILE}.

However, because you are running in a Jupyter notebook, you use the TFJob client.

  • You run the TFJob in a namespace created by a Kubeflow profile.
  • The namespace is the same as the namespace where you are running the notebook.
  • Creating a profile ensures that the namespace is provisioned with service accounts and other resources needed for Kubeflow.

In [10]:
tf_job_client = tf_job_client_module.TFJobClient()

In [11]:
tf_job_body = yaml.safe_load(train_spec)
tf_job = tf_job_client.create(tf_job_body, namespace=namespace)  

logging.info(f"Created job {namespace}.{train_name}")


Created job kubeflow-sarahmaddox.mnist-train-289e

Check the job using kubectl

Above you used the Python SDK for TFJob to check the status. You can also use kubectl get the status of your job. The job conditions will tell you whether the job is running, succeeded or failed.


In [12]:
!kubectl get tfjobs -o yaml {train_name}


apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  creationTimestamp: "2020-02-26T02:58:32Z"
  generation: 1
  name: mnist-train-289e
  namespace: kubeflow-sarahmaddox
  resourceVersion: "770252"
  selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-sarahmaddox/tfjobs/mnist-train-289e
  uid: dfa23ecf-5843-11ea-9ddf-42010a80013f
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python
            - /opt/model.py
            - --tf-model-dir=gs://kubeflow-writers-mnist/mnist
            - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export
            - --tf-train-steps=200
            - --tf-batch-size=100
            - --tf-learning-rate=0.01
            image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B
            name: tensorflow
            workingDir: /opt
          restartPolicy: OnFailure
          serviceAccount: default-editor
    Ps:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python
            - /opt/model.py
            - --tf-model-dir=gs://kubeflow-writers-mnist/mnist
            - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export
            - --tf-train-steps=200
            - --tf-batch-size=100
            - --tf-learning-rate=0.01
            image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B
            name: tensorflow
            workingDir: /opt
          restartPolicy: OnFailure
          serviceAccount: default-editor
    Worker:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python
            - /opt/model.py
            - --tf-model-dir=gs://kubeflow-writers-mnist/mnist
            - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export
            - --tf-train-steps=200
            - --tf-batch-size=100
            - --tf-learning-rate=0.01
            image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B
            name: tensorflow
            workingDir: /opt
          restartPolicy: OnFailure
          serviceAccount: default-editor
status:
  completionTime: "2020-02-26T02:59:58Z"
  conditions:
  - lastTransitionTime: "2020-02-26T02:58:32Z"
    lastUpdateTime: "2020-02-26T02:58:32Z"
    message: TFJob mnist-train-289e is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2020-02-26T02:58:35Z"
    lastUpdateTime: "2020-02-26T02:58:35Z"
    message: TFJob mnist-train-289e is running.
    reason: TFJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2020-02-26T02:59:58Z"
    lastUpdateTime: "2020-02-26T02:59:58Z"
    message: TFJob mnist-train-289e successfully completed.
    reason: TFJobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Chief:
      succeeded: 1
    PS:
      succeeded: 1
    Worker:
      succeeded: 1
  startTime: "2020-02-26T02:58:32Z"

Get the training logs

  • There are two ways to get the logs for the training job:

    • Using kubectl to fetch the pod logs. These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources.
    • Using Stackdriver.

      • Kubernetes logs are automatically available in Stackdriver.
      • You can use labels to locate the logs for a specific pod.
      • In the cell below, you use labels for the training job name and process type to locate the logs for a specific pod.
  • Run the cell below to get a link to Stackdriver for your logs:


In [13]:
from urllib.parse import urlencode

for replica in ["chief", "worker", "ps"]:    
    logs_filter = f"""resource.type="k8s_container"    
    labels."k8s-pod/tf-job-name" = "{train_name}"
    labels."k8s-pod/tf-replica-type" = "{replica}"    
    resource.labels.container_name="tensorflow" """

    new_params = {'project': GCP_PROJECT,
                  # Logs for last 7 days
                  'interval': 'P7D',
                  'advancedFilter': logs_filter}

    query = urlencode(new_params)

    url = "https://console.cloud.google.com/logs/viewer?" + query

    display(HTML(f"Link to: <a href='{url}'>{replica} logs</a>"))


Link to: chief logs
Link to: worker logs
Link to: ps logs

Deploy TensorBoard

The next step is to create a Kubernetes deployment to run TensorBoard.

TensorBoard will be accessible behind the Kubeflow IAP endpoint.


In [14]:
tb_name = "mnist-tensorboard"
tb_deploy = f"""apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: mnist-tensorboard
  name: {tb_name}
  namespace: {namespace}
spec:
  selector:
    matchLabels:
      app: mnist-tensorboard
  template:
    metadata:
      labels:
        app: mnist-tensorboard
        version: v1
    spec:
      serviceAccount: default-editor
      containers:
      - command:
        - /usr/local/bin/tensorboard
        - --logdir={model_dir}
        - --port=80
        image: tensorflow/tensorflow:1.15.2-py3
        name: tensorboard
        ports:
        - containerPort: 80
"""
tb_service = f"""apiVersion: v1
kind: Service
metadata:
  labels:
    app: mnist-tensorboard
  name: {tb_name}
  namespace: {namespace}
spec:
  ports:
  - name: http-tb
    port: 80
    targetPort: 80
  selector:
    app: mnist-tensorboard
  type: ClusterIP
"""

tb_virtual_service = f"""apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: {tb_name}
  namespace: {namespace}
spec:
  gateways:
  - kubeflow/kubeflow-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /mnist/{namespace}/tensorboard/
    rewrite:
      uri: /
    route:
    - destination:
        host: {tb_name}.{namespace}.svc.cluster.local
        port:
          number: 80
    timeout: 300s
"""

tb_specs = [tb_deploy, tb_service, tb_virtual_service]

In [15]:
k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)


/home/jovyan/examples/mnist/k8s_util.py:55: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  spec = yaml.load(spec)
Created Deployment kubeflow-sarahmaddox.mnist-tensorboard
Created Service kubeflow-sarahmaddox.mnist-tensorboard
Created VirtualService mnist-tensorboard.mnist-tensorboard
Out[15]:
[{'api_version': 'apps/v1',
  'kind': 'Deployment',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': 1,
               'initializers': None,
               'labels': {'app': 'mnist-tensorboard'},
               'managed_fields': None,
               'name': 'mnist-tensorboard',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '782392',
               'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-tensorboard',
               'uid': 'e1d50153-5846-11ea-9ddf-42010a80013f'},
  'spec': {'min_ready_seconds': None,
           'paused': None,
           'progress_deadline_seconds': 600,
           'replicas': 1,
           'revision_history_limit': 10,
           'selector': {'match_expressions': None,
                        'match_labels': {'app': 'mnist-tensorboard'}},
           'strategy': {'rolling_update': {'max_surge': '25%',
                                           'max_unavailable': '25%'},
                        'type': 'RollingUpdate'},
           'template': {'metadata': {'annotations': None,
                                     'cluster_name': None,
                                     'creation_timestamp': None,
                                     'deletion_grace_period_seconds': None,
                                     'deletion_timestamp': None,
                                     'finalizers': None,
                                     'generate_name': None,
                                     'generation': None,
                                     'initializers': None,
                                     'labels': {'app': 'mnist-tensorboard',
                                                'version': 'v1'},
                                     'managed_fields': None,
                                     'name': None,
                                     'namespace': None,
                                     'owner_references': None,
                                     'resource_version': None,
                                     'self_link': None,
                                     'uid': None},
                        'spec': {'active_deadline_seconds': None,
                                 'affinity': None,
                                 'automount_service_account_token': None,
                                 'containers': [{'args': None,
                                                 'command': ['/usr/local/bin/tensorboard',
                                                             '--logdir=gs://kubeflow-writers-mnist/mnist',
                                                             '--port=80'],
                                                 'env': None,
                                                 'env_from': None,
                                                 'image': 'tensorflow/tensorflow:1.15.2-py3',
                                                 'image_pull_policy': 'IfNotPresent',
                                                 'lifecycle': None,
                                                 'liveness_probe': None,
                                                 'name': 'tensorboard',
                                                 'ports': [{'container_port': 80,
                                                            'host_ip': None,
                                                            'host_port': None,
                                                            'name': None,
                                                            'protocol': 'TCP'}],
                                                 'readiness_probe': None,
                                                 'resources': {'limits': None,
                                                               'requests': None},
                                                 'security_context': None,
                                                 'stdin': None,
                                                 'stdin_once': None,
                                                 'termination_message_path': '/dev/termination-log',
                                                 'termination_message_policy': 'File',
                                                 'tty': None,
                                                 'volume_devices': None,
                                                 'volume_mounts': None,
                                                 'working_dir': None}],
                                 'dns_config': None,
                                 'dns_policy': 'ClusterFirst',
                                 'enable_service_links': None,
                                 'host_aliases': None,
                                 'host_ipc': None,
                                 'host_network': None,
                                 'host_pid': None,
                                 'hostname': None,
                                 'image_pull_secrets': None,
                                 'init_containers': None,
                                 'node_name': None,
                                 'node_selector': None,
                                 'priority': None,
                                 'priority_class_name': None,
                                 'readiness_gates': None,
                                 'restart_policy': 'Always',
                                 'runtime_class_name': None,
                                 'scheduler_name': 'default-scheduler',
                                 'security_context': {'fs_group': None,
                                                      'run_as_group': None,
                                                      'run_as_non_root': None,
                                                      'run_as_user': None,
                                                      'se_linux_options': None,
                                                      'supplemental_groups': None,
                                                      'sysctls': None},
                                 'service_account': 'default-editor',
                                 'service_account_name': 'default-editor',
                                 'share_process_namespace': None,
                                 'subdomain': None,
                                 'termination_grace_period_seconds': 30,
                                 'tolerations': None,
                                 'volumes': None}}},
  'status': {'available_replicas': None,
             'collision_count': None,
             'conditions': None,
             'observed_generation': None,
             'ready_replicas': None,
             'replicas': None,
             'unavailable_replicas': None,
             'updated_replicas': None}}, {'api_version': 'v1',
  'kind': 'Service',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': None,
               'initializers': None,
               'labels': {'app': 'mnist-tensorboard'},
               'managed_fields': None,
               'name': 'mnist-tensorboard',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '782395',
               'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-tensorboard',
               'uid': 'e1d7b041-5846-11ea-9ddf-42010a80013f'},
  'spec': {'cluster_ip': '10.35.253.170',
           'external_i_ps': None,
           'external_name': None,
           'external_traffic_policy': None,
           'health_check_node_port': None,
           'load_balancer_ip': None,
           'load_balancer_source_ranges': None,
           'ports': [{'name': 'http-tb',
                      'node_port': None,
                      'port': 80,
                      'protocol': 'TCP',
                      'target_port': 80}],
           'publish_not_ready_addresses': None,
           'selector': {'app': 'mnist-tensorboard'},
           'session_affinity': 'None',
           'session_affinity_config': None,
           'type': 'ClusterIP'},
  'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',
  'kind': 'VirtualService',
  'metadata': {'creationTimestamp': '2020-02-26T03:20:04Z',
   'generation': 1,
   'name': 'mnist-tensorboard',
   'namespace': 'kubeflow-sarahmaddox',
   'resourceVersion': '782396',
   'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-tensorboard',
   'uid': 'e1daadfe-5846-11ea-9ddf-42010a80013f'},
  'spec': {'gateways': ['kubeflow/kubeflow-gateway'],
   'hosts': ['*'],
   'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/tensorboard/'}}],
     'rewrite': {'uri': '/'},
     'route': [{'destination': {'host': 'mnist-tensorboard.kubeflow-sarahmaddox.svc.cluster.local',
        'port': {'number': 80}}}],
     'timeout': '300s'}]}}]

Set a variable defining your endpoint

Set endpoint to https://your-domain (with no slash at the end). Your domain typically has the following pattern: <your-kubeflow-deployment-name>.endpoints.<your-gcp-project>.cloud.goog. You can see your domain in the URL that you're using to access this notebook.


In [36]:
endpoint = None

if endpoint:
  logging.info(f"endpoint set to {endpoint}")
else:
  logging.info("Warning: You must set {endpoint} in order to print out the URLs where you can access your web apps.")


endpoint set to https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog

Access the TensorBoard UI

Run the cell below to find the endpoint for the TensorBoard UI.


In [37]:
if endpoint:    
    vs = yaml.safe_load(tb_virtual_service)
    path= vs["spec"]["http"][0]["match"][0]["uri"]["prefix"]
    tb_endpoint = endpoint + path
    display(HTML(f"TensorBoard UI is at <a href='{tb_endpoint}'>{tb_endpoint}</a>"))




Wait for the training job to finish

You can use the TFJob client to wait for the job to finish:


In [18]:
tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=["Succeeded", "Failed"], namespace=namespace)

if tf_job_client.is_job_succeeded(train_name, namespace):
    logging.info(f"TFJob {namespace}.{train_name} succeeded")
else:
    raise ValueError(f"TFJob {namespace}.{train_name} failed")


TFJob kubeflow-sarahmaddox.mnist-train-289e succeeded

Serve the model

Now you can deploy the model using TensorFlow Serving.

You need to create the following:

  • A Kubernetes deployment.
  • A Kubernetes service.
  • (Optional) A configmap containing the Prometheus monitoring configuration.

In [19]:
deploy_name = "mnist-model"
model_base_path = export_path

# The web UI defaults to mnist-service so if you change the name, you must
# change it in the UI as well.
model_service = "mnist-service"

deploy_spec = f"""apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: mnist
  name: {deploy_name}
  namespace: {namespace}
spec:
  selector:
    matchLabels:
      app: mnist-model
  template:
    metadata:
      # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the
      # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization
      # policy to allow traffic from the UI to the model servier.
      # https://istio.io/docs/concepts/security/#target-selectors
      annotations:        
        sidecar.istio.io/inject: "false"
      labels:
        app: mnist-model
        version: v1
    spec:
      serviceAccount: default-editor
      containers:
      - args:
        - --port=9000
        - --rest_api_port=8500
        - --model_name=mnist
        - --model_base_path={model_base_path}
        - --monitoring_config_file=/var/config/monitoring_config.txt
        command:
        - /usr/bin/tensorflow_model_server
        env:
        - name: modelBasePath
          value: {model_base_path}
        image: tensorflow/serving:1.15.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
          tcpSocket:
            port: 9000
        name: mnist
        ports:
        - containerPort: 9000
        - containerPort: 8500
        resources:
          limits:
            cpu: "4"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /var/config/
          name: model-config
      volumes:
      - configMap:
          name: {deploy_name}
        name: model-config
"""

service_spec = f"""apiVersion: v1
kind: Service
metadata:
  annotations:    
    prometheus.io/path: /monitoring/prometheus/metrics
    prometheus.io/port: "8500"
    prometheus.io/scrape: "true"
  labels:
    app: mnist-model
  name: {model_service}
  namespace: {namespace}
spec:
  ports:
  - name: grpc-tf-serving
    port: 9000
    targetPort: 9000
  - name: http-tf-serving
    port: 8500
    targetPort: 8500
  selector:
    app: mnist-model
  type: ClusterIP
"""

monitoring_config = f"""kind: ConfigMap
apiVersion: v1
metadata:
  name: {deploy_name}
  namespace: {namespace}
data:
  monitoring_config.txt: |-
    prometheus_config: {{
      enable: true,
      path: "/monitoring/prometheus/metrics"
    }}
"""

model_specs = [deploy_spec, service_spec, monitoring_config]

In [20]:
k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE)


Created Deployment kubeflow-sarahmaddox.mnist-model
Created Service kubeflow-sarahmaddox.mnist-service
Created ConfigMap kubeflow-sarahmaddox.mnist-model
Out[20]:
[{'api_version': 'apps/v1',
  'kind': 'Deployment',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': 1,
               'initializers': None,
               'labels': {'app': 'mnist'},
               'managed_fields': None,
               'name': 'mnist-model',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '788910',
               'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-model',
               'uid': '5555d458-5848-11ea-9ddf-42010a80013f'},
  'spec': {'min_ready_seconds': None,
           'paused': None,
           'progress_deadline_seconds': 600,
           'replicas': 1,
           'revision_history_limit': 10,
           'selector': {'match_expressions': None,
                        'match_labels': {'app': 'mnist-model'}},
           'strategy': {'rolling_update': {'max_surge': '25%',
                                           'max_unavailable': '25%'},
                        'type': 'RollingUpdate'},
           'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},
                                     'cluster_name': None,
                                     'creation_timestamp': None,
                                     'deletion_grace_period_seconds': None,
                                     'deletion_timestamp': None,
                                     'finalizers': None,
                                     'generate_name': None,
                                     'generation': None,
                                     'initializers': None,
                                     'labels': {'app': 'mnist-model',
                                                'version': 'v1'},
                                     'managed_fields': None,
                                     'name': None,
                                     'namespace': None,
                                     'owner_references': None,
                                     'resource_version': None,
                                     'self_link': None,
                                     'uid': None},
                        'spec': {'active_deadline_seconds': None,
                                 'affinity': None,
                                 'automount_service_account_token': None,
                                 'containers': [{'args': ['--port=9000',
                                                          '--rest_api_port=8500',
                                                          '--model_name=mnist',
                                                          '--model_base_path=gs://kubeflow-writers-mnist/mnist/export',
                                                          '--monitoring_config_file=/var/config/monitoring_config.txt'],
                                                 'command': ['/usr/bin/tensorflow_model_server'],
                                                 'env': [{'name': 'modelBasePath',
                                                          'value': 'gs://kubeflow-writers-mnist/mnist/export',
                                                          'value_from': None}],
                                                 'env_from': None,
                                                 'image': 'tensorflow/serving:1.15.0',
                                                 'image_pull_policy': 'IfNotPresent',
                                                 'lifecycle': None,
                                                 'liveness_probe': {'_exec': None,
                                                                    'failure_threshold': 3,
                                                                    'http_get': None,
                                                                    'initial_delay_seconds': 30,
                                                                    'period_seconds': 30,
                                                                    'success_threshold': 1,
                                                                    'tcp_socket': {'host': None,
                                                                                   'port': 9000},
                                                                    'timeout_seconds': 1},
                                                 'name': 'mnist',
                                                 'ports': [{'container_port': 9000,
                                                            'host_ip': None,
                                                            'host_port': None,
                                                            'name': None,
                                                            'protocol': 'TCP'},
                                                           {'container_port': 8500,
                                                            'host_ip': None,
                                                            'host_port': None,
                                                            'name': None,
                                                            'protocol': 'TCP'}],
                                                 'readiness_probe': None,
                                                 'resources': {'limits': {'cpu': '4',
                                                                          'memory': '4Gi'},
                                                               'requests': {'cpu': '1',
                                                                            'memory': '1Gi'}},
                                                 'security_context': None,
                                                 'stdin': None,
                                                 'stdin_once': None,
                                                 'termination_message_path': '/dev/termination-log',
                                                 'termination_message_policy': 'File',
                                                 'tty': None,
                                                 'volume_devices': None,
                                                 'volume_mounts': [{'mount_path': '/var/config/',
                                                                    'mount_propagation': None,
                                                                    'name': 'model-config',
                                                                    'read_only': None,
                                                                    'sub_path': None,
                                                                    'sub_path_expr': None}],
                                                 'working_dir': None}],
                                 'dns_config': None,
                                 'dns_policy': 'ClusterFirst',
                                 'enable_service_links': None,
                                 'host_aliases': None,
                                 'host_ipc': None,
                                 'host_network': None,
                                 'host_pid': None,
                                 'hostname': None,
                                 'image_pull_secrets': None,
                                 'init_containers': None,
                                 'node_name': None,
                                 'node_selector': None,
                                 'priority': None,
                                 'priority_class_name': None,
                                 'readiness_gates': None,
                                 'restart_policy': 'Always',
                                 'runtime_class_name': None,
                                 'scheduler_name': 'default-scheduler',
                                 'security_context': {'fs_group': None,
                                                      'run_as_group': None,
                                                      'run_as_non_root': None,
                                                      'run_as_user': None,
                                                      'se_linux_options': None,
                                                      'supplemental_groups': None,
                                                      'sysctls': None},
                                 'service_account': 'default-editor',
                                 'service_account_name': 'default-editor',
                                 'share_process_namespace': None,
                                 'subdomain': None,
                                 'termination_grace_period_seconds': 30,
                                 'tolerations': None,
                                 'volumes': [{'aws_elastic_block_store': None,
                                              'azure_disk': None,
                                              'azure_file': None,
                                              'cephfs': None,
                                              'cinder': None,
                                              'config_map': {'default_mode': 420,
                                                             'items': None,
                                                             'name': 'mnist-model',
                                                             'optional': None},
                                              'csi': None,
                                              'downward_api': None,
                                              'empty_dir': None,
                                              'fc': None,
                                              'flex_volume': None,
                                              'flocker': None,
                                              'gce_persistent_disk': None,
                                              'git_repo': None,
                                              'glusterfs': None,
                                              'host_path': None,
                                              'iscsi': None,
                                              'name': 'model-config',
                                              'nfs': None,
                                              'persistent_volume_claim': None,
                                              'photon_persistent_disk': None,
                                              'portworx_volume': None,
                                              'projected': None,
                                              'quobyte': None,
                                              'rbd': None,
                                              'scale_io': None,
                                              'secret': None,
                                              'storageos': None,
                                              'vsphere_volume': None}]}}},
  'status': {'available_replicas': None,
             'collision_count': None,
             'conditions': None,
             'observed_generation': None,
             'ready_replicas': None,
             'replicas': None,
             'unavailable_replicas': None,
             'updated_replicas': None}}, {'api_version': 'v1',
  'kind': 'Service',
  'metadata': {'annotations': {'prometheus.io/path': '/monitoring/prometheus/metrics',
                               'prometheus.io/port': '8500',
                               'prometheus.io/scrape': 'true'},
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': None,
               'initializers': None,
               'labels': {'app': 'mnist-model'},
               'managed_fields': None,
               'name': 'mnist-service',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '788913',
               'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-service',
               'uid': '555d8fc0-5848-11ea-9ddf-42010a80013f'},
  'spec': {'cluster_ip': '10.35.254.103',
           'external_i_ps': None,
           'external_name': None,
           'external_traffic_policy': None,
           'health_check_node_port': None,
           'load_balancer_ip': None,
           'load_balancer_source_ranges': None,
           'ports': [{'name': 'grpc-tf-serving',
                      'node_port': None,
                      'port': 9000,
                      'protocol': 'TCP',
                      'target_port': 9000},
                     {'name': 'http-tf-serving',
                      'node_port': None,
                      'port': 8500,
                      'protocol': 'TCP',
                      'target_port': 8500}],
           'publish_not_ready_addresses': None,
           'selector': {'app': 'mnist-model'},
           'session_affinity': 'None',
           'session_affinity_config': None,
           'type': 'ClusterIP'},
  'status': {'load_balancer': {'ingress': None}}}, {'api_version': 'v1',
  'binary_data': None,
  'data': {'monitoring_config.txt': 'prometheus_config: {\n'
                                    '  enable: true,\n'
                                    '  path: "/monitoring/prometheus/metrics"\n'
                                    '}'},
  'kind': 'ConfigMap',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': None,
               'initializers': None,
               'labels': None,
               'managed_fields': None,
               'name': 'mnist-model',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '788914',
               'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/configmaps/mnist-model',
               'uid': '5560bb37-5848-11ea-9ddf-42010a80013f'}}]

Deploy the UI for the MNIST web app

Deploy the UI to visualize the MNIST prediction results.

This example uses a prebuilt and public Docker image for the UI.


In [21]:
ui_name = "mnist-ui"
ui_deploy = f"""apiVersion: apps/v1
kind: Deployment
metadata:
  name: {ui_name}
  namespace: {namespace}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mnist-web-ui
  template:
    metadata:
      labels:
        app: mnist-web-ui
    spec:
      containers:
      - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225
        name: web-ui
        ports:
        - containerPort: 5000        
      serviceAccount: default-editor
"""

ui_service = f"""apiVersion: v1
kind: Service
metadata:
  annotations:
  name: {ui_name}
  namespace: {namespace}
spec:
  ports:
  - name: http-mnist-ui
    port: 80
    targetPort: 5000
  selector:
    app: mnist-web-ui
  type: ClusterIP
"""

ui_virtual_service = f"""apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: {ui_name}
  namespace: {namespace}
spec:
  gateways:
  - kubeflow/kubeflow-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /mnist/{namespace}/ui/
    rewrite:
      uri: /
    route:
    - destination:
        host: {ui_name}.{namespace}.svc.cluster.local
        port:
          number: 80
    timeout: 300s
"""

ui_specs = [ui_deploy, ui_service, ui_virtual_service]

In [22]:
k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE)


Created Deployment kubeflow-sarahmaddox.mnist-ui
Created Service kubeflow-sarahmaddox.mnist-ui
Created VirtualService mnist-ui.mnist-ui
Out[22]:
[{'api_version': 'apps/v1',
  'kind': 'Deployment',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': 1,
               'initializers': None,
               'labels': None,
               'managed_fields': None,
               'name': 'mnist-ui',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '790203',
               'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-ui',
               'uid': '9d846bf6-5848-11ea-9ddf-42010a80013f'},
  'spec': {'min_ready_seconds': None,
           'paused': None,
           'progress_deadline_seconds': 600,
           'replicas': 1,
           'revision_history_limit': 10,
           'selector': {'match_expressions': None,
                        'match_labels': {'app': 'mnist-web-ui'}},
           'strategy': {'rolling_update': {'max_surge': '25%',
                                           'max_unavailable': '25%'},
                        'type': 'RollingUpdate'},
           'template': {'metadata': {'annotations': None,
                                     'cluster_name': None,
                                     'creation_timestamp': None,
                                     'deletion_grace_period_seconds': None,
                                     'deletion_timestamp': None,
                                     'finalizers': None,
                                     'generate_name': None,
                                     'generation': None,
                                     'initializers': None,
                                     'labels': {'app': 'mnist-web-ui'},
                                     'managed_fields': None,
                                     'name': None,
                                     'namespace': None,
                                     'owner_references': None,
                                     'resource_version': None,
                                     'self_link': None,
                                     'uid': None},
                        'spec': {'active_deadline_seconds': None,
                                 'affinity': None,
                                 'automount_service_account_token': None,
                                 'containers': [{'args': None,
                                                 'command': None,
                                                 'env': None,
                                                 'env_from': None,
                                                 'image': 'gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225',
                                                 'image_pull_policy': 'IfNotPresent',
                                                 'lifecycle': None,
                                                 'liveness_probe': None,
                                                 'name': 'web-ui',
                                                 'ports': [{'container_port': 5000,
                                                            'host_ip': None,
                                                            'host_port': None,
                                                            'name': None,
                                                            'protocol': 'TCP'}],
                                                 'readiness_probe': None,
                                                 'resources': {'limits': None,
                                                               'requests': None},
                                                 'security_context': None,
                                                 'stdin': None,
                                                 'stdin_once': None,
                                                 'termination_message_path': '/dev/termination-log',
                                                 'termination_message_policy': 'File',
                                                 'tty': None,
                                                 'volume_devices': None,
                                                 'volume_mounts': None,
                                                 'working_dir': None}],
                                 'dns_config': None,
                                 'dns_policy': 'ClusterFirst',
                                 'enable_service_links': None,
                                 'host_aliases': None,
                                 'host_ipc': None,
                                 'host_network': None,
                                 'host_pid': None,
                                 'hostname': None,
                                 'image_pull_secrets': None,
                                 'init_containers': None,
                                 'node_name': None,
                                 'node_selector': None,
                                 'priority': None,
                                 'priority_class_name': None,
                                 'readiness_gates': None,
                                 'restart_policy': 'Always',
                                 'runtime_class_name': None,
                                 'scheduler_name': 'default-scheduler',
                                 'security_context': {'fs_group': None,
                                                      'run_as_group': None,
                                                      'run_as_non_root': None,
                                                      'run_as_user': None,
                                                      'se_linux_options': None,
                                                      'supplemental_groups': None,
                                                      'sysctls': None},
                                 'service_account': 'default-editor',
                                 'service_account_name': 'default-editor',
                                 'share_process_namespace': None,
                                 'subdomain': None,
                                 'termination_grace_period_seconds': 30,
                                 'tolerations': None,
                                 'volumes': None}}},
  'status': {'available_replicas': None,
             'collision_count': None,
             'conditions': None,
             'observed_generation': None,
             'ready_replicas': None,
             'replicas': None,
             'unavailable_replicas': None,
             'updated_replicas': None}}, {'api_version': 'v1',
  'kind': 'Service',
  'metadata': {'annotations': None,
               'cluster_name': None,
               'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),
               'deletion_grace_period_seconds': None,
               'deletion_timestamp': None,
               'finalizers': None,
               'generate_name': None,
               'generation': None,
               'initializers': None,
               'labels': None,
               'managed_fields': None,
               'name': 'mnist-ui',
               'namespace': 'kubeflow-sarahmaddox',
               'owner_references': None,
               'resource_version': '790209',
               'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-ui',
               'uid': '9d8a67e4-5848-11ea-9ddf-42010a80013f'},
  'spec': {'cluster_ip': '10.35.244.4',
           'external_i_ps': None,
           'external_name': None,
           'external_traffic_policy': None,
           'health_check_node_port': None,
           'load_balancer_ip': None,
           'load_balancer_source_ranges': None,
           'ports': [{'name': 'http-mnist-ui',
                      'node_port': None,
                      'port': 80,
                      'protocol': 'TCP',
                      'target_port': 5000}],
           'publish_not_ready_addresses': None,
           'selector': {'app': 'mnist-web-ui'},
           'session_affinity': 'None',
           'session_affinity_config': None,
           'type': 'ClusterIP'},
  'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',
  'kind': 'VirtualService',
  'metadata': {'creationTimestamp': '2020-02-26T03:32:29Z',
   'generation': 1,
   'name': 'mnist-ui',
   'namespace': 'kubeflow-sarahmaddox',
   'resourceVersion': '790211',
   'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-ui',
   'uid': '9d921512-5848-11ea-9ddf-42010a80013f'},
  'spec': {'gateways': ['kubeflow/kubeflow-gateway'],
   'hosts': ['*'],
   'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/ui/'}}],
     'rewrite': {'uri': '/'},
     'route': [{'destination': {'host': 'mnist-ui.kubeflow-sarahmaddox.svc.cluster.local',
        'port': {'number': 80}}}],
     'timeout': '300s'}]}}]

Access the MNIST web UI

A reverse proxy route is automatically added to the Kubeflow IAP endpoint. The MNIST endpoint is:

  https:/${KUBEFlOW_ENDPOINT}/mnist/${NAMESPACE}/ui/

where NAMESPACE is the namespace where you're running the Jupyter notebook.


In [38]:
if endpoint:    
    vs = yaml.safe_load(ui_virtual_service)
    path= vs["spec"]["http"][0]["match"][0]["uri"]["prefix"]
    ui_endpoint = endpoint + path
    display(HTML(f"mnist UI is at <a href='{ui_endpoint}'>{ui_endpoint}</a>"))




Open the MNIST UI in your browser. You should see an image of a hand-written digit from 0 to 9. This is a random image sent to the model for classification. Below the image is a set of bar graphs, one for each classification label from 0 to 9, as output by the model. Each bar represents the probability that the image matches the respective label.

Click the test random image button to send the model a new image.

Next steps

Visit the Kubeflow docs for more information about running Kubeflow on GCP.


In [ ]: