Copyright © 2019 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TensorFlow Data Validation

An Example of a Key TFX Library

This example colab notebook illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent.

Setup

First, we install the necessary packages, download data, import modules and set up paths.

Install TFX and TensorFlow

Note

Because of some of the updates to packages you must use the button at the bottom of the output of this cell to restart the runtime. Following restart, you should rerun this cell.


In [0]:
!pip install -q -U \
  tensorflow==2.0.0 \
  tensorflow_data_validation

Import packages

We import necessary packages, including standard TFX component classes.


In [0]:
import os
import tempfile
import urllib

import tensorflow as tf

import tensorflow_data_validation as tfdv

Check the versions


In [0]:
print('TensorFlow version: {}'.format(tf.__version__))
print('TensorFlow Data Validation version: {}'.format(tfdv.__version__))

Download example data

We download the sample dataset for use in our TFX pipeline. We're working with a variant of the Online News Popularity dataset, which summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict how popular the article will be on social networks. Specifically, in the original dataset the objective was to predict the number of times each article will be shared on social networks. In this variant, the goal is to predict the article's popularity percentile. For example, if the model predicts a score of 0.7, then it means it expects the article to be shared more than 70% of all articles.


In [0]:
# Download the example data.
DATA_PATH = 'https://raw.githubusercontent.com/ageron/open-datasets/master/' \
   'online_news_popularity_for_course/online_news_popularity_for_course.csv'
_data_root = tempfile.mkdtemp(prefix='tfx-data')
_data_filepath = os.path.join(_data_root, "data.csv")
urllib.request.urlretrieve(DATA_PATH, _data_filepath)

Split the dataset into train, eval and serving

Let's take a peek at the data.


In [0]:
!head {_data_filepath}

Now let's split the data into a training set, an eval set and a serving set:

  • The training set will be used to train ML models.
  • The eval set (also called the validation set or dev set) will be used to evaluate the models we train and choose the best one.
  • The serving set should look exactly like production data so we can test our production validation rules. For this, we remove the labels.

We also modify one line in the eval set, replacing 'World' with 'Fun' in the data_channel feature, and we replace many floats with integers in the serving set: this will allow us to show how TFDV can detect anomalies.


In [0]:
_train_data_filepath = os.path.join(_data_root, "train.csv")
_eval_data_filepath = os.path.join(_data_root, "eval.csv")
_serving_data_filepath = os.path.join(_data_root, "serving.csv")

with open(_data_filepath) as data_file, \
     open(_train_data_filepath, "w") as train_file, \
     open(_eval_data_filepath, "w") as eval_file, \
     open(_serving_data_filepath, "w") as serving_file:
  lines = data_file.readlines()
  train_file.write(lines[0])
  eval_file.write(lines[0])
  serving_file.write(lines[0].rsplit(",", 1)[0] + "\n")
  for line in lines[1:]:
    if line < "2014-11-01":
      train_file.write(line)
    elif line < "2014-12-01":
      line = line.replace("2014-11-01,0,World,awkward-teen-dance",
                          "2014-11-01,0,Fun,awkward-teen-dance")
      eval_file.write(line)
    else:
      serving_file.write(line.rsplit(",", 1)[0].replace(".0,", ",") + "\n")

Now let's take a peek at the training set, the eval set and the serving set:


In [0]:
!head {_train_data_filepath}

In [0]:
!head {_eval_data_filepath}

In [0]:
!head {_serving_data_filepath}

Compute and visualize statistics

First we'll use tfdv.generate_statistics_from_csv to compute statistics for our training data. (ignore the snappy warnings)

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.


In [0]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=_train_data_filepath)

Now let's use tfdv.visualize_statistics, which uses Facets to create a succinct visualization of our training data:

  • Notice that numeric features and catagorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
  • Notice that features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
  • Notice that there are no examples with values for pickup_census_tract. This is an opportunity for dimensionality reduction!
  • Try clicking "expand" above the charts to change the display
  • Try hovering over bars in the charts to display bucket ranges and counts
  • Try switching between the log and linear scales, and notice how the log scale reveals much more detail about the payment_type categorical feature
  • Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [0]:
tfdv.visualize_statistics(train_stats)

Infer a schema

Now let's use tfdv.infer_schema to create a schema for our data. A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct. The schema also provides documentation for the data, and so is useful when different developers work on the same data. Let's use tfdv.display_schema to display the inferred schema so that we can review it.


In [0]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Check evaluation data for errors

So far we've only been looking at the training data. It's important that our evaluation data is consistent with our training data, including that it uses the same schema. It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.

  • Notice that each feature now includes statistics for both the training and evaluation datasets.
  • Notice that the charts now have both the training and evaluation datasets overlaid, making it easy to compare them.
  • Notice that the charts now include a percentages view, which can be combined with log or the default linear scales.
  • Notice that some features are significantly different for the training versus the evaluation datasets, in particular check the mean and median. Will that cause problems?
  • Click expand on the Numeric Features chart, and select the log scale. Review the n_hrefs feature, and notice the difference in the max. Will evaluation miss parts of the loss surface?

In [0]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_csv(
    data_location=_eval_data_filepath)

# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

Check for evaluation anomalies

Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values.

Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset? What about numeric features that are outside the ranges in our training dataset?


In [0]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

Fix evaluation anomalies in the schema

Oops! It looks like we have some new values for data_channel in our evaluation data, that we didn't have in our training data (what a surprise!). This should be considered an anomaly, but what we decide to do about it depends on our domain knowledge of the data. If an anomaly truly indicates a data error, then the underlying data should be fixed. Otherwise, we can simply update the schema to include the values in the eval dataset.

Key Point: How would our evaluation results be affected if we did not fix this problem?

Unless we change our evaluation dataset we can't fix everything, but we can fix things in the schema that we're comfortable accepting. That includes relaxing our view of what is and what is not an anomaly for particular features, as well as updating our schema to include missing values for categorical features. TFDV has enabled us to discover what we need to fix.

Let's make the fix now, and then review one more time.


In [0]:
# Relax the minimum fraction of values that must come
# from the domain for feature data_channel.
data_channel = tfdv.get_feature(schema, 'data_channel')
data_channel.distribution_constraints.min_domain_mass = 1.0

# Add new value to the domain of feature data_channel.
data_channel_domain = tfdv.get_domain(schema, 'data_channel')
data_channel_domain.value.append('Fun')

# Validate eval stats after updating the schema 
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Hey, look at that! We verified that the training and evaluation data are now consistent! Thanks TFDV ;)

Schema Environments

We also split off a 'serving' dataset for this example, so we should check that too. By default all datasets in a pipeline should use the same schema, but there are often exceptions. For example, in supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included. In some cases introducing slight schema variations is necessary.

Environments can be used to express such requirements. In particular, features in schema can be associated with a set of environments using default_environment, in_environment and not_in_environment.

For example, in this dataset the n_shares_percentile feature is included as the label for training, but it's missing in the serving data. Without environment specified, it will show up as an anomaly.


In [0]:
serving_stats = tfdv.generate_statistics_from_csv(_serving_data_filepath)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

TDFV noticed that the n_shares_percentile column is missing in the serving set (as expected), and it also noticed that some features which should be floats are actually integers. It's very easy to be unaware of problems like that until model performance suffers, sometimes catastrophically. It may or may not be a significant issue, but in any case this should be cause for further investigation.

In this case, we can safely convert integers to floats, so we want to tell TFDV to use our schema to infer the type. Let's do that now.


In [0]:
options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)
serving_stats = tfdv.generate_statistics_from_csv(_serving_data_filepath,
                                                  stats_options=options)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Now we just have the n_shares_percentile feature (which is our label) showing up as an anomaly ('Column dropped'). Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.


In [0]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

# Specify that 'n_shares_percentile' feature is not in SERVING environment.
tfdv.get_feature(schema, 'n_shares_percentile').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    serving_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

Check for drift and skew

In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect drift and skew. TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.

Drift

Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

Skew

TFDV can detect three different kinds of skew in your data - schema skew, feature skew, and distribution skew.

Schema Skew

Schema skew occurs when the training and serving data do not conform to the same schema. Both training and serving data are expected to adhere to the same schema. Any expected deviations between the two (such as the label feature being only present in the training data but not in serving) should be specified through environments field in the schema.

Feature Skew

Feature skew occurs when the feature values that a model trains on are different from the feature values that it sees at serving time. For example, this can happen when:

  • A data source that provides some feature values is modified between training and serving time
  • There is different logic for generating features between training and serving. For example, if you apply some transformation only in one of the two code paths.

Distribution Skew

Distribution skew occurs when the distribution of the training dataset is significantly different from the distribution of the serving dataset. One of the key causes for distribution skew is using different code or different data sources to generate the training dataset. Another reason is a faulty sampling mechanism that chooses a non-representative subsample of the serving data to train on.


In [0]:
# Add skew comparator for 'weekday' feature.
weekday = tfdv.get_feature(schema, 'weekday')
weekday.skew_comparator.infinity_norm.threshold = 0.01

# Add drift comparator for 'weekday' feature.
weekday.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

tfdv.display_anomalies(skew_anomalies)

No drift and no skew!

Freeze the schema

Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state.


In [0]:
_output_dir = os.path.join(tempfile.mkdtemp(),
                           'serving_model/online_news_simple')

In [0]:
from google.protobuf import text_format

tf.io.gfile.makedirs(_output_dir)
schema_file = os.path.join(_output_dir, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)

!cat {schema_file}

When to use TFDV

It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here's a few more:

  • Validating new data for inference to make sure that we haven't suddenly started receiving bad features
  • Validating new data for inference to make sure that our model has trained on that part of the decision surface
  • Validating our data after we've transformed it and done feature engineering (probably using TensorFlow Transform) to make sure we haven't done something wrong