ML with TensorFlow Extended (TFX) -- Part 1

The puprpose of this tutorial is to show how to do end-to-end ML with TFX libraries on Google Cloud Platform. This tutorial covers:

  1. Data analysis and schema generation with TF Data Validation.
  2. Data preprocessing with TF Transform.
  3. Model training with TF Estimator.
  4. Model evaluation with TF Model Analysis.

This notebook has been tested in Jupyter on the Deep Learning VM.

Setup Cloud environment

import tensorflow as tf
import tensorflow_data_validation as tfdv

print('TF version: {}'.format(tf.__version__))
print('TFDV version: {}'.format(tfdv.__version__))

PROJECT = 'cloud-training-demos'    # Replace with your PROJECT
BUCKET = 'cloud-training-demos-ml'  # Replace with your BUCKET
REGION = 'us-central1'              # Choose an available region for Cloud MLE

import os

os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

UCI Adult Dataset:

Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

import os

TRAIN_DATA_FILE = os.path.join(DATA_DIR, '')
EVAL_DATA_FILE = os.path.join(DATA_DIR, 'adult.test.csv')
!gsutil ls -l $TRAIN_DATA_FILE
!gsutil ls -l $EVAL_DATA_FILE

1. Data Analysis

For data analysis, visualization, and schema generation, we use TensorFlow Data Validation to perform the following:

  1. Analyze the training data and produce statistics.
  2. Generate data schema from the produced statistics.
  3. Configure the schema.
  4. Validate the evaluation data against the schema.
  5. Save the schema for later use.

1.1 Compute and visualise statistics

HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

TARGET_FEATURE_NAME = 'income_bracket'
TARGET_LABELS = [' <=50K', ' >50K']

# This is a convenience function for CSV. We can write a Beam pipeline for other formats.
train_stats = tfdv.generate_statistics_from_csv(

1.2 Infer Schema

schema = tfdv.infer_schema(statistics=train_stats)

print(tfdv.get_feature(schema, 'age'))

1.3 Configure Schema

# Relax the minimum fraction of values that must come from the domain for feature occupation.
occupation = tfdv.get_feature(schema, 'occupation')
occupation.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature native_country, assuming that we start receiving this
# we won't be able to make great predictions of course, because this country is not part of our
# training data.
native_country_domain = tfdv.get_domain(schema, 'native_country')

# All features are by default in both TRAINING and SERVING environments.

# Specify that the class feature is not in SERVING environment.
tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

1.4 Validate evaluation data

eval_stats = tfdv.generate_statistics_from_csv(

eval_anomalies = tfdv.validate_statistics(eval_stats, schema, environment='EVALUATION')

1.5 Freeze the schema

RAW_SCHEMA_LOCATION = 'raw_schema.pbtxt'

from import file_io
from google.protobuf import text_format

tfdv.write_schema_text(schema, RAW_SCHEMA_LOCATION)

