Watch a short tutorial video or read the written tutorial
We'd love it if you reach out for help on the Great Expectations Slack Channel
In [ ]:
import json
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
import datetime
In [ ]:
context = ge.data_context.DataContext()
In [ ]:
context.list_expectation_suite_names()
In [ ]:
expectation_suite_name = # TODO: set to a name from the list above
To learn more about get_batch, see this tutorial
In [ ]:
# list datasources of the type SparkDFDatasource in your project
[datasource['name'] for datasource in context.list_datasources() if datasource['class_name'] == 'SparkDFDatasource']
In [ ]:
datasource_name = # TODO: set to a datasource name from above
In [ ]:
# If you would like to validate a file on a filesystem:
batch_kwargs = {'path': "YOUR_FILE_PATH", 'datasource': datasource_name}
# To customize how Spark reads the file, you can add options under reader_options key in batch_kwargs (e.g., header='true')
# If you already loaded the data into a PySpark Data Frame:
batch_kwargs = {'dataset': "YOUR_DATAFRAME", 'datasource': datasource_name}
batch = context.get_batch(batch_kwargs, expectation_suite_name)
batch.head()
Validation Operators provide a convenient way to bundle the validation of
multiple expectation suites and the actions that should be taken after validation.
When deploying Great Expectations in a real data pipeline, you will typically discover these needs:
warning and failure
In [ ]:
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file
"""
Create a run_id. The run_id must be of type RunIdentifier, with optional run_name and run_time instantiation
arguments (or a dictionary with these keys). The run_name can be any string (this could come from your pipeline
runner, e.g. Airflow run id). The run_time can be either a dateutil parsable string or a datetime object.
Note - any provided datetime will be assumed to be a UTC time. If no instantiation arguments are given, run_name will
be None and run_time will default to the current UTC datetime.
"""
run_id = {
"run_name": "some_string_that_uniquely_identifies_this_run", # insert your own run_name here
"run_time": datetime.datetime.now(datetime.timezone.utc)
}
results = context.run_validation_operator(
"action_list_operator",
assets_to_validate=[batch],
run_id=run_id)
In [ ]:
context.open_data_docs()
You are now among the elite data professionals who know how to build robust descriptions of your data and protections for pipelines and machine learning models. Join the Great Expectations Slack Channel to see how others are wielding these superpowers.
In [ ]: