The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):
The information is stored in structured json files which are 1-to-1 mapped to Form objects to simplify information handling. In the following it is assumed that an initial structured json file was generated. For the different ways to generate initial structured json files see the Workflow_Form_Generation.ipynb notebook:
Approach:
In [ ]:
## the following libraries are needed to interact with
## json based form submissions
from dkrz_forms import form_handler, utils, checks,wflow_handler
from datetime import datetime
In [ ]:
## info_file = "path to json file"
info_file = "../Forms/../xxx.json"
# load json file and convert to Form object for simple updating
my_form = utils.load_workflow_form(info_file)
In [ ]:
# use "tab" completion to view the attributes
# every form has a project and has the above workflow steps associated
my_form.
In [ ]:
# evalulate to see doc string of submission part
?my_form
(i.e submission workflow json file)
The workflow is structured according to the following workfow steps:
information on the form objects can be retrieved interactively in ipython in jupyter notebooks - use again "tab" for completion and ? to retrieve docstring documentation.
Examples:
In [ ]:
# evaluate to view associated documentation string
?my_form.sub
In [ ]:
# use "tab" completion
my_form.sub.
these parts have to be filled for each workflow step to characterize who (agent), did what (activity) using which input information (entity_in) to produce which output information (entity_out). These parts align with the WC3 Prov model allowing for a translation of all collected information based on the W3C prov standard (see the provenance.ipynb notebook for an example).
In [ ]:
# example: "tab" completion to view attributes of agent
# thus - agent has an email, first_name and last_name
my_form.sub.agent.
this is generally defined in the dkrz_forms.config.workflow_steps.py templates (see source code on github: https://github.com/IS-ENES/submission_forms/dkrz_forms/config/workflow_steps.py)
for example the agent responsible for data submission this is SUBMISSION_AGENT, which is defined as:
SUBMISSION_AGENT = { 'doc': """Attributes characterizing the person responsible for form completion and submission:
- last_name: Last name of the person responsible for the submission form content
- first_name: Corresponding first name
- email: Valid user email address: all follow up activities will use this email to contact end user
- keyword : user provided key word to remember and separate submission
""",
'i_name': 'submission_agent',
'last_name' : 'mandatory',
'first_name' : 'mandatory',
'keyword': 'mandatory',
'email': 'mandatory',
'responsible_person':'mandatory'
}
All entries charactized as 'mandatory' have to be filled.
In [ ]:
# e.g. set email of person responsible for data submission:
my_form.sub.agent.email = 'franz_mustermann@hzg.de'
again the generic definition is defined in the dkrz_forms.workflow_steps.py templates.
for example the quality assurance (qua) related activity information is defined as:
QUA_ACTIVITY= { 'doc': """ Attributes characterizing the data quality assurance activity:
- status: status information
- start_time, end_time: data ingest timing information
- comment : free text comment
- ticket_id: related RT ticket number
- follow_up_ticket: in case new data has to be provided
- quality_report: dictionary with quality related information (tbd.)
""",
'i_name':'qua_activity',
'status':ACTIVITY_STATUS,
'error_status':ERROR_STATUS,
'qua_tool_version':"mandatory",
"start_time":"mandatory",
"end_time":"optional",
"comment":"optional",
"ticket_id": "mandatory",
"follow_up_ticket": 'optional', # qa feedback to users, follow up actions
}
In [ ]:
## back to example: submission related activity information
import pprint
pprint.pprint(my_form.sub.activity.__doc__)
each workflow step produces an output associated to the entity_out keyword.
To each output a user defined dictionary can be attached as report
so e.g.
my_form.sub.entity_out.report contains all the user input provided e.g. by mail or in a excel sheet or provided via a (jupyter notebook) form
my_form.qua.entity_out.report contains the quality_assurance tool json output as dictionary
etc.
In [ ]:
# view the submission related information provided by the end user:
pprint.pprint(my_form.sub.entity_out.report.__dict__)
In [ ]:
## Example for the quality assurance workflow step (qua):
my_form.qua.entity_out.report = {
"QA_conclusion": "PASS",
"project": "CORDEX",
"institute": "CLMcom",
"model": "CLMcom-CCLM4-8-17-CLM3-5",
"domain": "AUS-44",
"driving_experiment": [ "ICHEC-EC-EARTH"],
"experiment": [ "history", "rcp45", "rcp85"],
"ensemble_member": [ "r12i1p1" ],
"frequency": [ "day", "mon", "sem" ],
"annotation":
[
{
"scope": ["mon", "sem"],
"variable": [ "tasmax", "tasmin", "sfcWindmax" ],
"caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
"comment": "due to the format of the data, climatology is equivalent to time_bnds",
"severity": "note"
}
]
}
Relatively view approaches are known supporting a well documented, standards conforming and tool supported way to document the workflow around data ingest at larger data centers. In the following links will be collected to approaches taken at other data centers:
example workflows in other data centers: