Guided Project 3

Learning Objective:

  • Learn how to customize the tfx template to your own dataset
  • Learn how to modify the Keras model scaffold provided by tfx template

In this guided project, we will use the tfx template tool to create a TFX pipeline for the covertype project, but this time, instead of re-using an already implemented model as we did in guided project 2, we will adapt the model scaffold generated by tfx template so that it can train on the covertype dataset

Note: The covertype dataset is loacated at

gs://workshop-datasets/covertype/small/dataset.csv

In [ ]:
import os

Step 1. Environment setup

Envirnonment Variables

Setup the your Kubeflow pipelines endopoint below the same way you did in guided project 1 & 2.


In [ ]:
ENDPOINT = # Enter your Kubeflow ENDPOINT here.

In [ ]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

In [ ]:
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT=shell_output[0]

%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}

In [ ]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE = 'gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'
CUSTOM_TFX_IMAGE

tfx and kfp tools setup


In [ ]:
%%bash

TFX_PKG="tfx==0.22.0"
KFP_PKG="kfp==0.5.1"

pip freeze | grep $TFX_PKG || pip install -Uq $TFX_PKG
pip freeze | grep $KFP_PKG || pip install -Uq $KFP_PKG

You may need to restart the kernel at this point.

skaffold tool setup


In [ ]:
%%bash

LOCAL_BIN="/home/jupyter/.local/bin"
SKAFFOLD_URI="https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64"

test -d $LOCAL_BIN || mkdir -p $LOCAL_BIN

which skaffold || (
    curl -Lo skaffold $SKAFFOLD_URI &&
    chmod +x skaffold               &&
    mv skaffold $LOCAL_BIN
)

Modify the PATH environment variable so that skaffold is available:

At this point, you shoud see the skaffold tool with the command which:


In [ ]:
!which skaffold

Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the PIPELINE_NAME below.

This will also become the name of the project directory where your files will be put.


In [ ]:
PIPELINE_NAME = # Your pipeline name
PROJECT_DIR = os.path.join(os.path.expanduser("."), PIPELINE_NAME)
PROJECT_DIR

TFX includes the taxi template with the TFX python package.

If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.

The tfx template copy CLI command copies predefined template files into your project directory.


In [ ]:
!tfx template copy \
  --pipeline-name={PIPELINE_NAME} \
  --destination-path={PROJECT_DIR} \
  --model=taxi

In [ ]:
%cd {PROJECT_DIR}

Step 3. Browse your copied source files

The TFX template provides basic scaffold files to build a pipeline, including Python source code, sample data, and Jupyter Notebooks to analyse the output of the pipeline.

The taxi template uses the same Chicago Taxi dataset and ML model as the Airflow Tutorial.

Here is brief introduction to each of the Python files:

pipeline - This directory contains the definition of the pipeline

  • configs.py — defines common constants for pipeline runners
  • pipeline.py — defines TFX components and a pipeline

models - This directory contains ML model definitions.

  • features.py, features_test.py — defines features for the model
  • preprocessing.py, preprocessing_test.py — defines preprocessing jobs using tf::Transform

models/estimator - This directory contains an Estimator based model.

  • constants.py — defines constants of the model
  • model.py, model_test.py — defines DNN model using TF estimator

models/keras - This directory contains a Keras based model.

  • constants.py — defines constants of the model
  • model.py, model_test.py — defines DNN model using Keras

beam_dag_runner.py, kubeflow_dag_runner.py — define runners for each orchestration engine

Running the tests: You might notice that there are some files with _test.py in their name. These are unit tests of the pipeline and it is recommended to add more unit tests as you implement your own pipelines. You can run unit tests by supplying the module name of test files with -m flag. You can usually get a module name by deleting .py extension and replacing / with ..

For example:


In [ ]:
!python -m models.features_test
!python -m models.keras.model_test

Step 4. Create the artifact store bucket

Note: You probably already have completed this step in guided project 1, so you may may skip it if this is the case.

Components in the TFX pipeline will generate outputs for each run as ML Metadata Artifacts, and they need to be stored somewhere. You can use any storage which the KFP cluster can access, and for this example we will use Google Cloud Storage (GCS).

Let us create this bucket if you haven't created it in guided project 1. Its name will be <YOUR_PROJECT>-kubeflowpipelines-default.


In [ ]:
GCS_BUCKET_NAME = GOOGLE_CLOUD_PROJECT + '-kubeflowpipelines-default'
GCS_BUCKET_NAME

In [ ]:
!gsutil ls gs://{GCS_BUCKET_NAME} | grep {GCS_BUCKET_NAME} || gsutil mb gs://{GCS_BUCKET_NAME}

Step 3. Ingest the data into the pipeline

We made a TFX pipeline for a model using the Chicago Taxi dataset and the covertype dataset. Now it's time to put your data into the pipeline.

Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.

Review the steps in guided project 1 and guided project 2 to remember what needs to be customized in full details. You'll find below a short summary of these steps:

  1. If your data is stored in files, modify the DATA_PATH in kubeflow_dag_runner.py and set it to the location of your files. If your data is stored in BigQuery, modify BIG_QUERY_QUERY in pipeline/configs.py to correctly query for your data.
  2. Add features in models/features.py
  3. Modify models/preprocessing.py to transform input data for training.
  4. Modify models/keras/model.py and models/keras/constants.py to describe your ML model.
    • You can use an estimator based model, too. Change RUN_FN constant to models.estimator.model.run_fn in pipeline/configs.py.
  5. Modify the pipeline.py and configs.py so that you can train and deploy on CAIP

Bonus Exercise (Optional)

Create a TFX pipeline as we did in this guided project but this time with your own dataset instead of the covertype dataset.

License

Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>