In [ ]:
!pip3 install 'tfx==0.15.0' --upgrade
!python3 -m pip install 'kfp>=0.1.35' --quiet
https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview
In [ ]:
# Directory and data locations (uses Google Cloud Storage).
import os
_input_bucket = '<your gcs bucket>'
_output_bucket = '<your gcs bucket>'
_pipeline_root = os.path.join(_output_bucket, 'tfx')
# Google Cloud Platform project id to use when deploying this pipeline.
_project_id = '<your project id>'
In [ ]:
# copy the trainer code to a storage bucket as the TFX pipeline will need that code file in GCS
from tensorflow.compat.v1 import gfile
gfile.Copy('utils/taxi_utils.py', _input_bucket + '/taxi_utils.py')
Reload this cell by running the load command to get the pipeline configuration file
%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py
Configure:
_input_bucket to the GCS directory where you've copied taxi_utils.py. I.e. gs://_output_bucket to the GCS directory where you've want the results to be writtenThe dataset in BigQuery has 100M rows, you can change the query parameters in WHERE clause to limit the number of rows used.
In [ ]:
%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py
In [ ]:
import kfp
run_result = kfp.Client(
host=None # replace with Kubeflow Pipelines endpoint if this notebook is run outside of the Kubeflow cluster.
).create_run_from_pipeline_package('chicago_taxi_pipeline_kubeflow.tar.gz', arguments={})