Continuous training with TFX and Cloud AI Platform

Learning Objectives

  1. Use the TFX CLI to build a TFX pipeline.
  2. Deploy a TFX pipeline on the managed AI Platform service.
  3. Create and monitor TFX pipeline runs using the TFX CLI and KFP UI.

In this lab, you use the TFX CLI utility to build and deploy a TFX pipeline that uses Kubeflow pipelines for orchestration, AI Platform for model training, and a managed AI Platform Pipeline instance (Kubeflow Pipelines) that runs on a Kubernetes cluster for compute. You will then create and monitor pipeline runs using the TFX CLI as well as the KFP UI.

Setup


In [ ]:
import yaml

# Set `PATH` to include the directory containing TFX CLI and skaffold.
PATH=%env PATH
%env PATH=/home/jupyter/.local/bin:{PATH}

In [ ]:
!python -c "import tfx; print('TFX version: {}'.format(tfx.__version__))"
!python -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

Note: this lab was built and tested with the following package versions:

TFX version: 0.21.4
KFP version: 0.5.1

If running the above command results in different package versions or you receive an import error, upgrade to the correct versions by running the cell below:


In [ ]:
%pip install --upgrade --user tfx==0.21.4
%pip install --upgrade --user kfp==0.5.1

Note: you may need to restart the kernel to pick up the correct package versions.

Understanding the pipeline design

The pipeline source code can be found in the pipeline folder.


In [ ]:
%cd pipeline

In [ ]:
!ls -la

The config.py module configures the default values for the environment specific settings and the default values for the pipeline runtime parameters. The default values can be overwritten at compile time by providing the updated values in a set of environment variables.

The pipeline.py module contains the TFX DSL defining the workflow implemented by the pipeline.

The preprocessing.py module implements the data preprocessing logic the Transform component.

The model.py module implements the training logic for the Train component.

The runner.py module configures and executes KubeflowDagRunner. At compile time, the KubeflowDagRunner.run() method conversts the TFX DSL into the pipeline package in the argo format.

The features.py module contains feature definitions common across preprocessing.py and model.py.

Building and deploying the pipeline

You will use TFX CLI to compile and deploy the pipeline. As explained in the previous section, the environment specific settings can be provided through a set of environment variables and embedded into the pipeline package at compile time.

Exercise: Create AI Platform Pipelines cluster

Navigate to AI Platform Pipelines page in the Google Cloud Console.

1. Create or select an existing Kubernetes cluster (GKE) and deploy AI Platform. Make sure to select "Allow access to the following Cloud APIs https://www.googleapis.com/auth/cloud-platform" to allow for programmatic access to your pipeline by the Kubeflow SDK for the rest of the lab. Also, provide an App instance name such as "tfx" or "mlops". Note you may have already deployed an AI Pipelines instance during the Setup for the lab series. If so, you can proceed using that instance below in the next step.

Validate the deployment of your AI Platform Pipelines instance in the console before proceeding.

2. Configure your environment settings.

Update the below constants with the settings reflecting your lab environment.

  • GCP_REGION - the compute region for AI Platform Training and Prediction
  • ARTIFACT_STORE - the GCS bucket created during installation of AI Platform Pipelines. The bucket name will contain the kubeflowpipelines- prefix.

In [ ]:
# Use the following command to identify the GCS bucket for metadata and pipeline storage.
!gsutil ls
  • ENDPOINT - set the ENDPOINT constant to the endpoint to your AI Platform Pipelines instance. The endpoint to the AI Platform Pipelines instance can be found on the AI Platform Pipelines page in the Google Cloud Console. Open the SETTINGS for your instance and use the value of the host variable in the Connect to this Kubeflow Pipelines instance from a Python client via Kubeflow Pipelines SKD section of the SETTINGS window. The format is '....[region].pipelines.googleusercontent.com'.

In [ ]:
GCP_REGION = 'us-central1'
ENDPOINT = '490ab949a23d5f6d-dot-us-central2.pipelines.googleusercontent.com'
ARTIFACT_STORE_URI = 'gs://hostedkfp-default-36un4wco1q'

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

Compile the pipeline

You can build and upload the pipeline to the AI Platform Pipelines instance in one step, using the tfx pipeline create command. The tfx pipeline create goes through the following steps:

  • (Optional) Builds the custom image to that provides a runtime environment for TFX components,
  • Compiles the pipeline DSL into a pipeline package
  • Uploads the pipeline package to the instance.

As you debug the pipeline DSL, you may prefer to first use the tfx pipeline compile command, which only executes the compilation step. After the DSL compiles successfully you can use tfx pipeline create to go through all steps.

Set the pipeline's compile time settings

The pipeline can run using a security context of the GKE default node pool's service account or the service account defined in the user-gcp-sa secret of the Kubernetes namespace hosting Kubeflow Pipelines. If you want to use the user-gcp-sa service account you change the value of USE_KFP_SA to True.

Note that the default AI Platform Pipelines configuration does not define the user-gcp-sa secret.


In [ ]:
PIPELINE_NAME = 'tfx_covertype_continuous_training'
MODEL_NAME = 'tfx_covertype_classifier'

USE_KFP_SA=False
DATA_ROOT_URI = 'gs://workshop-datasets/covertype/small'
CUSTOM_TFX_IMAGE = 'gcr.io/{}/{}'.format(PROJECT_ID, PIPELINE_NAME)
RUNTIME_VERSION = '2.1'
PYTHON_VERSION = '3.7'

In [ ]:
%env PROJECT_ID={PROJECT_ID}
%env KUBEFLOW_TFX_IMAGE={CUSTOM_TFX_IMAGE}
%env ARTIFACT_STORE_URI={ARTIFACT_STORE_URI}
%env DATA_ROOT_URI={DATA_ROOT_URI}
%env GCP_REGION={GCP_REGION}
%env MODEL_NAME={MODEL_NAME}
%env PIPELINE_NAME={PIPELINE_NAME}
%env RUNTIME_VERSION={RUNTIME_VERSION}
%env PYTHON_VERIONS={PYTHON_VERSION}
%env USE_KFP_SA={USE_KFP_SA}

In [ ]:
!tfx pipeline compile --engine kubeflow --pipeline_path runner.py

Deploy the pipeline package to AI Platform Pipelines

After the pipeline code compiles without any errors you can use the tfx pipeline create command to perform the full build and deploy the pipeline. You will deploy your compiled pipeline code e.g. gcr.io/[PROJECT_ID]/tfx_covertype_continuous_training to run on AI Platform Pipelines with the TFX CLI.


In [ ]:
!tfx pipeline create  \
--pipeline_path=runner.py \
--endpoint={ENDPOINT} \
--build_target_image={CUSTOM_TFX_IMAGE}

If you need to redeploy the pipeline you can first delete the previous version using tfx pipeline delete or you can update the pipeline in-place using tfx pipeline update.

To delete the pipeline:

tfx pipeline delete --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

To update the pipeline:

tfx pipeline update --pipeline_path runner.py --endpoint {ENDPOINT}

Create and monitor a pipeline run

After the pipeline has been deployed, you can trigger and monitor pipeline runs using TFX CLI or KFP UI.

1. Trigger a pipeline run using the TFX CLI.


In [ ]:
!tfx run create --pipeline_name={PIPELINE_NAME} --endpoint={ENDPOINT}

2. Trigger a pipeline run from the KFP UI.

On the AI Platform Pipelines page, click OPEN PIPELINES DASHBOARD. A new tab will open. Select the Pipelines tab to the left, you see the tfx_covertype_continuous_training pipeline you deployed previously. Click on the pipeline name which will open up a window with a graphical display of your TFX pipeline. Next, click the Create a run button. Verify the Pipeline name and Pipeline version are pre-populated and optionally provide a Run name and Experiment to logically group the run metadata under before hitting Start.

Note: each full pipeline run takes about 45 minutes to 1 hour. Take the time to review the pipeline metadata artifacts created in the GCS storage bucket for each component including data splits, your Tensorflow SavedModel, model evaluation results, etc. as the pipeline executes.

To list all active runs of the pipeline:


In [ ]:
!tfx run list --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

To retrieve the status of a given run:


In [ ]:
RUN_ID='[YOUR RUN ID]'

!tfx run status --pipeline_name {PIPELINE_NAME} --run_id {RUN_ID} --endpoint {ENDPOINT}

Next Steps

In this lab, you learned how to manually build and deploy a TFX pipeline to AI Platform Pipelines and trigger pipeline runs from a notebook. In the next lab, you will construct a Cloud Build CI/CD workflow that automatically builds and deploys this same TFX pipeline.

License

Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>