TFX on KubeFlow Pipelines Example

This notebook should be run inside a KF Pipelines cluster.

Install TFX and KFP packages



In [ ]:

    
!pip3 install 'tfx==0.15.0' --upgrade
!python3 -m pip install 'kfp>=0.1.35' --quiet

Enable DataFlow API for your GKE cluster

https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview

Get the TFX repo with sample pipeline



In [ ]:

    
# Directory and data locations (uses Google Cloud Storage).
import os
_input_bucket = '<your gcs bucket>'
_output_bucket = '<your gcs bucket>'
_pipeline_root = os.path.join(_output_bucket, 'tfx')

# Google Cloud Platform project id to use when deploying this pipeline.
_project_id = '<your project id>'



In [ ]:

    
# copy the trainer code to a storage bucket as the TFX pipeline will need that code file in GCS
from tensorflow.compat.v1 import gfile
gfile.Copy('utils/taxi_utils.py', _input_bucket + '/taxi_utils.py')

Configure the TFX pipeline example

Reload this cell by running the load command to get the pipeline configuration file

%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py

Configure:

Set _input_bucket to the GCS directory where you've copied taxi_utils.py. I.e. gs:////
Set _output_bucket to the GCS directory where you've want the results to be written
Set GCP project ID (replace my-gcp-project). Note that it should be project ID, not project name.

The dataset in BigQuery has 100M rows, you can change the query parameters in WHERE clause to limit the number of rows used.



In [ ]:

    
%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py

Submit pipeline for execution on the Kubeflow cluster



In [ ]:

    
import kfp

run_result = kfp.Client(
    host=None  # replace with Kubeflow Pipelines endpoint if this notebook is run outside of the Kubeflow cluster.
).create_run_from_pipeline_package('chicago_taxi_pipeline_kubeflow.tar.gz', arguments={})