Continuous training with TFX and Cloud AI Platform

Learning Objectives

  1. Use the TFX CLI to build a TFX pipeline.
  2. Deploy a TFX pipeline on the managed AI Platform service.
  3. Create and monitor TFX pipeline runs using the TFX CLI and KFP UI.

In this lab, you use the TFX CLI utility to build and deploy a TFX pipeline that uses Kubeflow pipelines for orchestration, AI Platform for model training, and a managed AI Platform Pipeline instance (Kubeflow Pipelines) that runs on a Kubernetes cluster for compute. You will then create and monitor pipeline runs using the TFX CLI as well as the KFP UI.

Setup


In [ ]:
import yaml

# Set `PATH` to include the directory containing TFX CLI and skaffold.
PATH=%env PATH
%env PATH=/home/jupyter/.local/bin:{PATH}

In [ ]:
!python -c "import tfx; print('TFX version: {}'.format(tfx.__version__))"
!python -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

Note: this lab was built and tested with the following package versions:

TFX version: 0.21.4
KFP version: 0.5.1

If running the above command results in different package versions or you receive an import error, upgrade to the correct versions by running the cell below:


In [ ]:
%pip install --upgrade --user tfx==0.21.4
%pip install --upgrade --user kfp==0.5.1

Note: you may need to restart the kernel to pick up the correct package versions.

Understanding the pipeline design

The pipeline source code can be found in the pipeline folder.


In [ ]:
%cd pipeline

In [ ]:
!ls -la

The config.py module configures the default values for the environment specific settings and the default values for the pipeline runtime parameters. The default values can be overwritten at compile time by providing the updated values in a set of environment variables.

The pipeline.py module contains the TFX DSL defining the workflow implemented by the pipeline.

The preprocessing.py module implements the data preprocessing logic the Transform component.

The model.py module implements the training logic for the Train component.

The runner.py module configures and executes KubeflowDagRunner. At compile time, the KubeflowDagRunner.run() method conversts the TFX DSL into the pipeline package in the argo format.

The features.py module contains feature definitions common across preprocessing.py and model.py.

Building and deploying the pipeline

You will use TFX CLI to compile and deploy the pipeline. As explained in the previous section, the environment specific settings can be provided through a set of environment variables and embedded into the pipeline package at compile time.

Exercise: Create AI Platform Pipelines cluster

Navigate to AI Platform Pipelines page in the Google Cloud Console.

1. Create or select an existing Kubernetes cluster (GKE) and deploy AI Platform. Make sure to select "Allow access to the following Cloud APIs https://www.googleapis.com/auth/cloud-platform" to allow for programmatic access to your pipeline by the Kubeflow SDK for the rest of the lab. Also, provide an App instance name such as "tfx" or "mlops". Note you may have already deployed an AI Pipelines instance during the Setup for the lab series. If so, you can proceed using that instance below in the next step.

Validate the deployment of your AI Platform Pipelines instance in the console before proceeding.

2. Configure your environment settings.

Update the below constants with the settings reflecting your lab environment.

  • GCP_REGION - the compute region for AI Platform Training and Prediction
  • ARTIFACT_STORE - the GCS bucket created during installation of AI Platform Pipelines. The bucket name will contain the kubeflowpipelines- prefix.

In [ ]:
# Use the following command to identify the GCS bucket for metadata and pipeline storage.
!gsutil ls
  • ENDPOINT - set the ENDPOINT constant to the endpoint to your AI Platform Pipelines instance. The endpoint to the AI Platform Pipelines instance can be found on the AI Platform Pipelines page in the Google Cloud Console. Open the SETTINGS for your instance and use the value of the host variable in the Connect to this Kubeflow Pipelines instance from a Python client via Kubeflow Pipelines SKD section of the SETTINGS window. The format is '....[region].pipelines.googleusercontent.com'.

In [ ]:
#TODO: Set your environment settings here for GCP_REGION, ENDPOINT, and ARTIFACT_STORE_URI.
GCP_REGION = ''
ENDPOINT = ''
ARTIFACT_STORE_URI = ''

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

Compile the pipeline

You can build and upload the pipeline to the AI Platform Pipelines instance in one step, using the tfx pipeline create command. The tfx pipeline create goes through the following steps:

  • (Optional) Builds the custom image to that provides a runtime environment for TFX components,
  • Compiles the pipeline DSL into a pipeline package
  • Uploads the pipeline package to the instance.

As you debug the pipeline DSL, you may prefer to first use the tfx pipeline compile command, which only executes the compilation step. After the DSL compiles successfully you can use tfx pipeline create to go through all steps.

Set the pipeline's compile time settings

The pipeline can run using a security context of the GKE default node pool's service account or the service account defined in the user-gcp-sa secret of the Kubernetes namespace hosting Kubeflow Pipelines. If you want to use the user-gcp-sa service account you change the value of USE_KFP_SA to True.

Note that the default AI Platform Pipelines configuration does not define the user-gcp-sa secret.


In [ ]:
PIPELINE_NAME = 'tfx_covertype_continuous_training'
MODEL_NAME = 'tfx_covertype_classifier'

USE_KFP_SA=False
DATA_ROOT_URI = 'gs://workshop-datasets/covertype/small'
CUSTOM_TFX_IMAGE = 'gcr.io/{}/{}'.format(PROJECT_ID, PIPELINE_NAME)
RUNTIME_VERSION = '2.1'
PYTHON_VERSION = '3.7'

In [ ]:
%env PROJECT_ID={PROJECT_ID}
%env KUBEFLOW_TFX_IMAGE={CUSTOM_TFX_IMAGE}
%env ARTIFACT_STORE_URI={ARTIFACT_STORE_URI}
%env DATA_ROOT_URI={DATA_ROOT_URI}
%env GCP_REGION={GCP_REGION}
%env MODEL_NAME={MODEL_NAME}
%env PIPELINE_NAME={PIPELINE_NAME}
%env RUNTIME_VERSION={RUNTIME_VERSION}
%env PYTHON_VERIONS={PYTHON_VERSION}
%env USE_KFP_SA={USE_KFP_SA}

In [ ]:
!tfx pipeline compile --engine kubeflow --pipeline_path runner.py

Exercise: Deploy the pipeline package to AI Platform Pipelines

In this exercise, you will deploy your compiled pipeline code e.g. gcr.io/[PROJECT_ID]/tfx_covertype_continuous_training to run on AI Platform Pipelines with the TFX CLI.

Hint: review the TFX CLI documentation on the "pipeline group" to create your pipeline. You will need to specify the --pipeline_path to point at the pipeline DSL defined locally in runner.py, --endpoint, and --build_target_image arguments using the environment variables specified above.


In [ ]:
# TODO: Your code here to use the TFX CLI to deploy your pipeline image to AI Platform Pipelines.

If you need to redeploy the pipeline you can first delete the previous version using tfx pipeline delete or you can update the pipeline in-place using tfx pipeline update.

To delete the pipeline:

tfx pipeline delete --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

To update the pipeline:

tfx pipeline update --pipeline_path runner.py --endpoint {ENDPOINT}

Exercise: Create and monitor a pipeline run

In this exercise, you will use triggered pipeline runs using the TFX CLI from this notebook and also using the KFP UI.

1. Trigger a pipeline run using the TFX CLI.

Hint: review the TFX CLI documentation on the "run group".


In [ ]:
# TODO: Your code here to trigger a pipeline run with the TFX CLI

2. Trigger a pipeline run from the KFP UI.

On the AI Platform Pipelines page, click OPEN PIPELINES DASHBOARD. A new tab will open. Select the Pipelines tab to the left, you see the tfx_covertype_continuous_training pipeline you deployed previously. Click on the pipeline name which will open up a window with a graphical display of your TFX pipeline. Next, click the Create a run button. Verify the Pipeline name and Pipeline version are pre-populated and optionally provide a Run name and Experiment to logically group the run metadata under before hitting Start.

Note: each full pipeline run takes about 45 minutes to 1 hour. Take the time to review the pipeline metadata artifacts created in the GCS storage bucket for each component including data splits, your Tensorflow SavedModel, model evaluation results, etc. as the pipeline executes.

Additionally, to list all active runs of the pipeline, you can run:


In [ ]:
!tfx run list --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

To retrieve the status of a given run:


In [ ]:
RUN_ID='[YOUR RUN ID]'

!tfx run status --pipeline_name {PIPELINE_NAME} --run_id {RUN_ID} --endpoint {ENDPOINT}

Next Steps

In this lab, you learned how to manually build and deploy a TFX pipeline to AI Platform Pipelines and trigger pipeline runs from a notebook. In the next lab, you will construct a Cloud Build CI/CD workflow that automatically builds and deploys this same TFX pipeline.

License

Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>