Validation Playground

Watch a short tutorial video or read the written tutorial

This notebook assumes that you created at least one expectation suite in your project.

Here you will learn how to validate data loaded into a PySpark DataFrame against an expectation suite.

We'd love it if you reach out for help on the Great Expectations Slack Channel



In [ ]:

    
import json
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
import datetime

1. Get a DataContext

This represents your project that you just created using great_expectations init.



In [ ]:

    
context = ge.data_context.DataContext()

2. Choose an Expectation Suite

List expectation suites that you created in your project



In [ ]:

    
context.list_expectation_suite_names()



In [ ]:

    
expectation_suite_name =  # TODO: set to a name from the list above

3. Load a batch of data you want to validate

To learn more about get_batch, see this tutorial



In [ ]:

    
# list datasources of the type SparkDFDatasource in your project
[datasource['name'] for datasource in context.list_datasources() if datasource['class_name'] == 'SparkDFDatasource']



In [ ]:

    
datasource_name = # TODO: set to a datasource name from above



In [ ]:

    
# If you would like to validate a file on a filesystem:
batch_kwargs = {'path': "YOUR_FILE_PATH", 'datasource': datasource_name}
# To customize how Spark reads the file, you can add options under reader_options key in batch_kwargs (e.g., header='true')

# If you already loaded the data into a PySpark Data Frame:
batch_kwargs = {'dataset': "YOUR_DATAFRAME", 'datasource': datasource_name}


batch = context.get_batch(batch_kwargs, expectation_suite_name)
batch.head()

4. Validate the batch with Validation Operators

Validation Operators provide a convenient way to bundle the validation of multiple expectation suites and the actions that should be taken after validation.

When deploying Great Expectations in a real data pipeline, you will typically discover these needs:

validating a group of batches that are logically related
validating a batch against several expectation suites such as using a tiered pattern like warning and failure
doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).



In [ ]:

    
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file

"""
Create a run_id. The run_id must be of type RunIdentifier, with optional run_name and run_time instantiation
arguments (or a dictionary with these keys). The run_name can be any string (this could come from your pipeline
runner, e.g. Airflow run id). The run_time can be either a dateutil parsable string or a datetime object.
Note - any provided datetime will be assumed to be a UTC time. If no instantiation arguments are given, run_name will
be None and run_time will default to the current UTC datetime.
"""

run_id = {
  "run_name": "some_string_that_uniquely_identifies_this_run",  # insert your own run_name here
  "run_time": datetime.datetime.now(datetime.timezone.utc)
}

results = context.run_validation_operator(
    "action_list_operator",
    assets_to_validate=[batch],
    run_id=run_id)

5. View the Validation Results in Data Docs

Let's now build and look at your Data Docs. These will now include an data quality report built from the ValidationResults you just created that helps you communicate about your data with both machines and humans.



In [ ]:

    
context.open_data_docs()

Congratulations! You ran Validations!

Next steps:

1. Read about the typical workflow with Great Expectations:

typical workflow

2. Explore the documentation & community

You are now among the elite data professionals who know how to build robust descriptions of your data and protections for pipelines and machine learning models. Join the Great Expectations Slack Channel to see how others are wielding these superpowers.



In [ ]: