DKRZ data ingest information handling

The submission_forms package provides a collection of components to support the management of information related to data ingest related activities (data transport, data checking, data publication and data archival):

data submission related information management
- who, when, what, for which project, data characteristics
data management related information collection
- ingest, quality assurance, publication, archiving

The information is stored in structured json files which are 1-to-1 mapped to Form objects to simplify information handling. In the following it is assumed that an initial structured json file was generated. For the different ways to generate initial structured json files see the Workflow_Form_Generation.ipynb notebook:

DKRZ ingest workflow system

Approach:

Data managment related information is managed in structured json files
To simplify interactive information updates etc. json files are converted to Form objects
There are multiple possibilities to populate the json files (and associated Form objects):
- DKRZ served jupyter notebooks (e.g. in DKRZ jupyterhub http://data-forms.dkrz.de:8080)
- Client side jupyter notebooks (submission via email, rt ticket, git commit)
- Client side excel sheets (submission via email, rt ticket)
- Unstructured email exchange (json population done by data managers)
A toolset to manage Form objects (specially structured json files) along a well defined workflow
A toolset to search and intercorrelate data submission information
Support for W3C prov standard exposure of the structured json files

1) Get a Form object for information stored in a json file



In [ ]:

    
## the following libraries are needed to interact with 
## json based form submissions

from dkrz_forms import form_handler, utils, checks,wflow_handler
from datetime import datetime



In [ ]:

    
## info_file = "path to json file"
info_file = "../Forms/../xxx.json"

# load json file and convert to Form object for simple updating
my_form = utils.load_workflow_form(info_file)



In [ ]:

    
# use "tab" completion to view the attributes
# every form has a project and has the above workflow steps associated
my_form.



In [ ]:

    
# evalulate to see doc string of submission part
?my_form

2) Explore the structure of a workflow Form object

  (i.e submission workflow json file)

The workflow is structured according to the following workfow steps:

'sub': data submission related information (client side: who, what, how, .., manager side: who, status,.. )
'rev': data submission review information
'ing': data ingest related information
'qua': data quality assurance related information
'pub': data publication related information
'lta': data long term archival and data citation related information

information on the form objects can be retrieved interactively in ipython in jupyter notebooks - use again "tab" for completion and ? to retrieve docstring documentation.

Examples:



In [ ]:

    
# evaluate to view associated documentation string
?my_form.sub



In [ ]:

    
# use "tab" completion
my_form.sub.

each workflow step is structured acording to:

agent: step related person or software tool information
activity: step execution related information
entity_in: input information for this workflow step
entity_out: output information for this workflow step

these parts have to be filled for each workflow step to characterize who (agent), did what (activity) using which input information (entity_in) to produce which output information (entity_out). These parts align with the WC3 Prov model allowing for a translation of all collected information based on the W3C prov standard (see the provenance.ipynb notebook for an example).



In [ ]:

    
# example: "tab" completion to view attributes of agent 
# thus - agent has an email, first_name and last_name

my_form.sub.agent.

this is generally defined in the dkrz_forms.config.workflow_steps.py templates (see source code on github: https://github.com/IS-ENES/submission_forms/dkrz_forms/config/workflow_steps.py)

for example the agent responsible for data submission this is SUBMISSION_AGENT, which is defined as:

SUBMISSION_AGENT = { 'doc': """Attributes characterizing the person responsible for form completion and submission:

   - last_name: Last name of the person responsible for the submission form content
   - first_name: Corresponding first name
   - email: Valid user email address: all follow up activities will use this email to contact end user
   - keyword : user provided key word to remember and separate submission
          """,
'i_name': 'submission_agent',
'last_name' : 'mandatory',
'first_name' : 'mandatory',
'keyword': 'mandatory',
'email': 'mandatory',
'responsible_person':'mandatory'

}

All entries charactized as 'mandatory' have to be filled.



In [ ]:

    
# e.g. set email of person responsible for data submission:
my_form.sub.agent.email = 'franz_mustermann@hzg.de'

again the generic definition is defined in the dkrz_forms.workflow_steps.py templates.

for example the quality assurance (qua) related activity information is defined as:

QUA_ACTIVITY= { 'doc': """ Attributes characterizing the data quality assurance activity:

    - status: status information
    - start_time, end_time: data ingest timing information
    - comment : free text comment
    - ticket_id: related RT ticket number
    - follow_up_ticket: in case new data has to be provided
    - quality_report: dictionary with quality related information (tbd.)
    """,
  'i_name':'qua_activity',
  'status':ACTIVITY_STATUS,
  'error_status':ERROR_STATUS,
  'qua_tool_version':"mandatory",
  "start_time":"mandatory",
  "end_time":"optional",
  "comment":"optional",
  "ticket_id": "mandatory",
  "follow_up_ticket": 'optional', # qa feedback to users, follow up actions
  }



In [ ]:

    
## back to example: submission related activity information
import pprint
pprint.pprint(my_form.sub.activity.__doc__)

workflow step report documents

each workflow step produces an output associated to the entity_out keyword.

To each output a user defined dictionary can be attached as report

so e.g.

my_form.sub.entity_out.report contains all the user input provided e.g. by mail or in a excel sheet or provided via a (jupyter notebook) form

my_form.qua.entity_out.report contains the quality_assurance tool json output as dictionary

etc.



In [ ]:

    
# view the submission related information provided by the end user:

pprint.pprint(my_form.sub.entity_out.report.__dict__)



In [ ]:

    
## Example for the quality assurance workflow step (qua):
my_form.qua.entity_out.report = {
    "QA_conclusion": "PASS",
    "project": "CORDEX",
    "institute": "CLMcom",
    "model": "CLMcom-CCLM4-8-17-CLM3-5",
    "domain": "AUS-44",
    "driving_experiment":  [ "ICHEC-EC-EARTH"],
    "experiment": [ "history", "rcp45", "rcp85"],
    "ensemble_member": [ "r12i1p1" ],
    "frequency": [ "day", "mon", "sem" ],
    "annotation":
    [
        {
            "scope": ["mon", "sem"],
            "variable": [ "tasmax", "tasmin", "sfcWindmax" ],
            "caption": "attribute <variable>:cell_methods for climatologies requires <time>:climatology instead of time_bnds",
            "comment": "due to the format of the data, climatology is equivalent to time_bnds",
            "severity": "note"
        }
    ]
}

Links:

github repo: https://github.com/IS-ENES-Data/submission_forms
...

Relatively view approaches are known supporting a well documented, standards conforming and tool supported way to document the workflow around data ingest at larger data centers. In the following links will be collected to approaches taken at other data centers:

example workflows in other data centers: