In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This example colab notebook illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent.
First, we install the necessary packages, download data, import modules and set up paths.
Note
Because of some of the updates to packages you must use the button at the bottom of the output of this cell to restart the runtime. Following restart, you should rerun this cell.
In [0]:
!pip install -q -U \
tensorflow==2.0.0 \
tensorflow_data_validation
In [0]:
import os
import tempfile
import urllib
import tensorflow as tf
import tensorflow_data_validation as tfdv
Check the versions
In [0]:
print('TensorFlow version: {}'.format(tf.__version__))
print('TensorFlow Data Validation version: {}'.format(tfdv.__version__))
We download the sample dataset for use in our TFX pipeline. We're working with a variant of the Online News Popularity dataset, which summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict how popular the article will be on social networks. Specifically, in the original dataset the objective was to predict the number of times each article will be shared on social networks. In this variant, the goal is to predict the article's popularity percentile. For example, if the model predicts a score of 0.7, then it means it expects the article to be shared more than 70% of all articles.
In [0]:
# Download the example data.
DATA_PATH = 'https://raw.githubusercontent.com/ageron/open-datasets/master/' \
'online_news_popularity_for_course/online_news_popularity_for_course.csv'
_data_root = tempfile.mkdtemp(prefix='tfx-data')
_data_filepath = os.path.join(_data_root, "data.csv")
urllib.request.urlretrieve(DATA_PATH, _data_filepath)
In [0]:
!head {_data_filepath}
Now let's split the data into a training set, an eval set and a serving set:
We also modify one line in the eval set, replacing 'World' with 'Fun' in the data_channel
feature, and we replace many floats with integers in the serving set: this will allow us to show how TFDV can detect anomalies.
In [0]:
_train_data_filepath = os.path.join(_data_root, "train.csv")
_eval_data_filepath = os.path.join(_data_root, "eval.csv")
_serving_data_filepath = os.path.join(_data_root, "serving.csv")
with open(_data_filepath) as data_file, \
open(_train_data_filepath, "w") as train_file, \
open(_eval_data_filepath, "w") as eval_file, \
open(_serving_data_filepath, "w") as serving_file:
lines = data_file.readlines()
train_file.write(lines[0])
eval_file.write(lines[0])
serving_file.write(lines[0].rsplit(",", 1)[0] + "\n")
for line in lines[1:]:
if line < "2014-11-01":
train_file.write(line)
elif line < "2014-12-01":
line = line.replace("2014-11-01,0,World,awkward-teen-dance",
"2014-11-01,0,Fun,awkward-teen-dance")
eval_file.write(line)
else:
serving_file.write(line.rsplit(",", 1)[0].replace(".0,", ",") + "\n")
Now let's take a peek at the training set, the eval set and the serving set:
In [0]:
!head {_train_data_filepath}
In [0]:
!head {_eval_data_filepath}
In [0]:
!head {_serving_data_filepath}
First we'll use tfdv.generate_statistics_from_csv
to compute statistics for our training data. (ignore the snappy warnings)
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.
Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.
In [0]:
train_stats = tfdv.generate_statistics_from_csv(
data_location=_train_data_filepath)
Now let's use tfdv.visualize_statistics
, which uses Facets to create a succinct visualization of our training data:
pickup_census_tract
. This is an opportunity for dimensionality reduction!payment_type
categorical feature
In [0]:
tfdv.visualize_statistics(train_stats)
Now let's use tfdv.infer_schema
to create a schema for our data. A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.
Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct. The schema also provides documentation for the data, and so is useful when different developers work on the same data. Let's use tfdv.display_schema
to display the inferred schema so that we can review it.
In [0]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)
So far we've only been looking at the training data. It's important that our evaluation data is consistent with our training data, including that it uses the same schema. It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.
n_hrefs
feature, and notice the difference in the max. Will evaluation miss parts of the loss surface?
In [0]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_csv(
data_location=_eval_data_filepath)
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values.
Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset? What about numeric features that are outside the ranges in our training dataset?
In [0]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)
Oops! It looks like we have some new values for data_channel
in our evaluation data, that we didn't have in our training data (what a surprise!). This should be considered an anomaly, but what we decide to do about it depends on our domain knowledge of the data. If an anomaly truly indicates a data error, then the underlying data should be fixed. Otherwise, we can simply update the schema to include the values in the eval dataset.
Key Point: How would our evaluation results be affected if we did not fix this problem?
Unless we change our evaluation dataset we can't fix everything, but we can fix things in the schema that we're comfortable accepting. That includes relaxing our view of what is and what is not an anomaly for particular features, as well as updating our schema to include missing values for categorical features. TFDV has enabled us to discover what we need to fix.
Let's make the fix now, and then review one more time.
In [0]:
# Relax the minimum fraction of values that must come
# from the domain for feature data_channel.
data_channel = tfdv.get_feature(schema, 'data_channel')
data_channel.distribution_constraints.min_domain_mass = 1.0
# Add new value to the domain of feature data_channel.
data_channel_domain = tfdv.get_domain(schema, 'data_channel')
data_channel_domain.value.append('Fun')
# Validate eval stats after updating the schema
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)
Hey, look at that! We verified that the training and evaluation data are now consistent! Thanks TFDV ;)
We also split off a 'serving' dataset for this example, so we should check that too. By default all datasets in a pipeline should use the same schema, but there are often exceptions. For example, in supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included. In some cases introducing slight schema variations is necessary.
Environments can be used to express such requirements. In particular, features in schema can be associated with a set of environments using default_environment
, in_environment
and not_in_environment
.
For example, in this dataset the n_shares_percentile
feature is included as the label for training, but it's missing in the serving data. Without environment specified, it will show up as an anomaly.
In [0]:
serving_stats = tfdv.generate_statistics_from_csv(_serving_data_filepath)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)
tfdv.display_anomalies(serving_anomalies)
TDFV noticed that the n_shares_percentile
column is missing in the serving set (as expected), and it also noticed that some features which should be floats are actually integers.
It's very easy to be unaware of problems like that until model performance suffers, sometimes catastrophically. It may or may not be a significant issue, but in any case this should be cause for further investigation.
In this case, we can safely convert integers to floats, so we want to tell TFDV to use our schema to infer the type. Let's do that now.
In [0]:
options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)
serving_stats = tfdv.generate_statistics_from_csv(_serving_data_filepath,
stats_options=options)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)
tfdv.display_anomalies(serving_anomalies)
Now we just have the n_shares_percentile
feature (which is our label) showing up as an anomaly ('Column dropped'). Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.
In [0]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')
# Specify that 'n_shares_percentile' feature is not in SERVING environment.
tfdv.get_feature(schema, 'n_shares_percentile').not_in_environment.append('SERVING')
serving_anomalies_with_env = tfdv.validate_statistics(
serving_stats, schema, environment='SERVING')
tfdv.display_anomalies(serving_anomalies_with_env)
In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect drift and skew. TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.
Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.
TFDV can detect three different kinds of skew in your data - schema skew, feature skew, and distribution skew.
Schema skew occurs when the training and serving data do not conform to the same schema. Both training and serving data are expected to adhere to the same schema. Any expected deviations between the two (such as the label feature being only present in the training data but not in serving) should be specified through environments field in the schema.
Feature skew occurs when the feature values that a model trains on are different from the feature values that it sees at serving time. For example, this can happen when:
Distribution skew occurs when the distribution of the training dataset is significantly different from the distribution of the serving dataset. One of the key causes for distribution skew is using different code or different data sources to generate the training dataset. Another reason is a faulty sampling mechanism that chooses a non-representative subsample of the serving data to train on.
In [0]:
# Add skew comparator for 'weekday' feature.
weekday = tfdv.get_feature(schema, 'weekday')
weekday.skew_comparator.infinity_norm.threshold = 0.01
# Add drift comparator for 'weekday' feature.
weekday.drift_comparator.infinity_norm.threshold = 0.001
skew_anomalies = tfdv.validate_statistics(train_stats, schema,
previous_statistics=eval_stats,
serving_statistics=serving_stats)
tfdv.display_anomalies(skew_anomalies)
No drift and no skew!
In [0]:
_output_dir = os.path.join(tempfile.mkdtemp(),
'serving_model/online_news_simple')
In [0]:
from google.protobuf import text_format
tf.io.gfile.makedirs(_output_dir)
schema_file = os.path.join(_output_dir, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)
!cat {schema_file}
It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here's a few more: