Learning Objective:
In this guided project, we will use the tfx template
tool to create a TFX pipeline for the covertype project, but this time, instead of re-using an already implemented model as we did in guided project 2, we will adapt the model scaffold generated by tfx template
so that it can train on the covertype dataset
Note: The covertype dataset is loacated at
gs://workshop-datasets/covertype/small/dataset.csv
In [ ]:
import os
Setup the your Kubeflow pipelines endopoint below the same way you did in guided project 1 & 2.
In [ ]:
ENDPOINT = # Enter your Kubeflow ENDPOINT here.
In [ ]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin
In [ ]:
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT=shell_output[0]
%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}
In [ ]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE = 'gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'
CUSTOM_TFX_IMAGE
In [ ]:
%%bash
TFX_PKG="tfx==0.22.0"
KFP_PKG="kfp==0.5.1"
pip freeze | grep $TFX_PKG || pip install -Uq $TFX_PKG
pip freeze | grep $KFP_PKG || pip install -Uq $KFP_PKG
You may need to restart the kernel at this point.
In [ ]:
%%bash
LOCAL_BIN="/home/jupyter/.local/bin"
SKAFFOLD_URI="https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64"
test -d $LOCAL_BIN || mkdir -p $LOCAL_BIN
which skaffold || (
curl -Lo skaffold $SKAFFOLD_URI &&
chmod +x skaffold &&
mv skaffold $LOCAL_BIN
)
Modify the PATH
environment variable so that skaffold
is available:
At this point, you shoud see the skaffold
tool with the command which
:
In [ ]:
!which skaffold
In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.
You may give your pipeline a different name by changing the PIPELINE_NAME below.
This will also become the name of the project directory where your files will be put.
In [ ]:
PIPELINE_NAME = # Your pipeline name
PROJECT_DIR = os.path.join(os.path.expanduser("."), PIPELINE_NAME)
PROJECT_DIR
TFX includes the taxi template with the TFX python package.
If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.
The tfx template copy
CLI command copies predefined template files into your project directory.
In [ ]:
!tfx template copy \
--pipeline-name={PIPELINE_NAME} \
--destination-path={PROJECT_DIR} \
--model=taxi
In [ ]:
%cd {PROJECT_DIR}
The TFX template provides basic scaffold files to build a pipeline, including Python source code, sample data, and Jupyter Notebooks to analyse the output of the pipeline.
The taxi
template uses the same Chicago Taxi dataset and ML model as
the Airflow Tutorial.
Here is brief introduction to each of the Python files:
pipeline
- This directory contains the definition of the pipeline
configs.py
— defines common constants for pipeline runnerspipeline.py
— defines TFX components and a pipelinemodels
- This directory contains ML model definitions.
features.py
, features_test.py
— defines features for the modelpreprocessing.py
, preprocessing_test.py
— defines preprocessing jobs using tf::Transformmodels/estimator
- This directory contains an Estimator based model.
constants.py
— defines constants of the modelmodel.py
, model_test.py
— defines DNN model using TF estimatormodels/keras
- This directory contains a Keras based model.
constants.py
— defines constants of the modelmodel.py
, model_test.py
— defines DNN model using Kerasbeam_dag_runner.py
, kubeflow_dag_runner.py
— define runners for each orchestration engine
Running the tests:
You might notice that there are some files with _test.py
in their name.
These are unit tests of the pipeline and it is recommended to add more unit
tests as you implement your own pipelines.
You can run unit tests by supplying the module name of test files with -m
flag.
You can usually get a module name by deleting .py
extension and replacing /
with ..
For example:
In [ ]:
!python -m models.features_test
!python -m models.keras.model_test
Note: You probably already have completed this step in guided project 1, so you may may skip it if this is the case.
Components in the TFX pipeline will generate outputs for each run as ML Metadata Artifacts, and they need to be stored somewhere. You can use any storage which the KFP cluster can access, and for this example we will use Google Cloud Storage (GCS).
Let us create this bucket if you haven't created it in guided project 1.
Its name will be <YOUR_PROJECT>-kubeflowpipelines-default
.
In [ ]:
GCS_BUCKET_NAME = GOOGLE_CLOUD_PROJECT + '-kubeflowpipelines-default'
GCS_BUCKET_NAME
In [ ]:
!gsutil ls gs://{GCS_BUCKET_NAME} | grep {GCS_BUCKET_NAME} || gsutil mb gs://{GCS_BUCKET_NAME}
We made a TFX pipeline for a model using the Chicago Taxi dataset and the covertype dataset. Now it's time to put your data into the pipeline.
Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.
Review the steps in guided project 1 and guided project 2 to remember what needs to be customized in full details. You'll find below a short summary of these steps:
DATA_PATH
in kubeflow_dag_runner.py
and set it to the location of your files. If your data is stored in BigQuery, modify BIG_QUERY_QUERY
in pipeline/configs.py
to correctly query for your data.models/features.py
models/keras/model.py
and models/keras/constants.py
to describe your ML model.RUN_FN
constant to models.estimator.model.run_fn
in pipeline/configs.py
.pipeline.py
and configs.py
so that you can train and deploy on CAIPCreate a TFX pipeline as we did in this guided project but this time with your own dataset instead of the covertype dataset.
Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>