QC Configuration

Objective:

Show different ways to configure a quality control (QC) procedure - explicit inline or calling a pre-set configuration.

For CoTeDe, the most important component is the human operator, hence it should be easy to control which tests to apply and the specific parameters of each test. CoTeDe is based on the principle of a single engine for multiple applications by using a dictionary to describe the QC procedure to be used, since 2011.


In [1]:
# A different version of CoTeDe might give slightly different outputs.
# Please let me know if you see something that I should update.

import cotede
print("CoTeDe version: {}".format(cotede.__version__))


CoTeDe version: 0.21.0

load_cfg(), just for demonstration

Here we will import the load_cfg() function to illustrate different procedures. This is typically not necessary since ProfileQC does that for us. The cfgname argument for load_cfg is the same for ProfileQC, thus when we call

ProfileQC(dataset, cfgname='argo')

the procedure applied to dataset is the same shown by

load_cfg(cfgname='argo')

We will take advantage on that and simplify this notebook by inspecting only the configuration without actually applying it.


In [2]:
from cotede.utils import load_cfg

Built-in tests

The easiest way to configure a QC procedure is by using one of the built-in tests, for example the GTSPP procedure for realtime data, here named 'gtspp_realtime'.


In [3]:
cfg = load_cfg('gtspp_realtime')

print(list(cfg.keys()))


['revision', 'common', 'variables']

The output cfg is a dictionary type of object, more specifically it is an ordered dictionary. The configuration has:

  • A revision to help to determine how to handle this configuration.

  • A common item with the common tests for the whole dataset, i.e. the tests that are valid for all variables. For instance, a valid date and time is the same if we are evaluating temperature, salinity, or chlorophyll fluorescence.

  • A variables, with a list of the variables to evaluate.

Let's check each item:


In [4]:
cfg['revision']


Out[4]:
'0.21'

In [5]:
print(list(cfg['common'].keys()))


['valid_datetime', 'valid_position', 'location_at_sea']

So, for GTSSP realtime assessement, all variables must be associated with a valid time and a valid location that is at sea.


In [6]:
print(list(cfg['variables'].keys()))


['sea_water_temperature', 'sea_water_salinity']

GTSPP evaluates temperature and salinity. Here we use CF standard names, so temperature is sea_water_temperature. But which tests are applied on temperature measurements?


In [7]:
print(list(cfg['variables']['sea_water_temperature'].keys()))


['global_range', 'gradient', 'spike', 'profile_envelop']

Let's inspect the spike test.


In [8]:
print(cfg['variables']['sea_water_temperature']['spike'])


OrderedDict([('threshold', 2.0)])

There is one single item, the threshold, here defined as 2, so that any measured temperature with a spike greater than this threshold will fail on this spike test.

Let's check the global range test.


In [9]:
print(list(cfg['variables']['sea_water_temperature']['global_range']))


['minval', 'maxval']

Here there are two limit values, the minimum acceptable value and the maximum one. Anything beyond these limits will fail this test.

Check CoTeDe's manual to see what each test does and the possible parameters for each one.

Explicit inline

A QC procedure can also be explicitly defined with a dictionary. For instance, let's consider that we want to evaluate the temperature of a dataset with a single test, the spike test, using a threshold equal to one,


In [10]:
my_config = {"sea_water_temperature":
               {"spike": {
                   "threshold": 1
                   }
               }
           }

cfg = load_cfg(my_config)
print(cfg)


OrderedDict([('revision', '0.21'), ('variables', OrderedDict([('sea_water_temperature', {'spike': {'threshold': 1}})]))])

Note that load_cfg took care for us to format it with the 0.21 standard, adding the revision and variables. If a revision is not defined, it is assumed a pre-0.21.

Compound procedure

Many of the recommended QC procedures share several tests in common. One way to simplify a QC procedure definition is by using inheritance to define a QC procedure to be used as a template. For example, let's create a new QC procedure that is based on GTSPP realtime and add a new test to that, the World Ocean Atlas Climatology comparison for temperature, with a threshold of 3 standard deviations.


In [11]:
my_config = {"inherit": "gtspp_realtime",
             "sea_water_temperature":
               {"woa_normbias": {
                   "threshold": 3
                   }
               }
           }

cfg = load_cfg(my_config)
print(cfg.keys())


odict_keys(['revision', 'common', 'variables', 'inherit'])

There is a new item, inherit


In [12]:
print(cfg['inherit'])


['gtspp_realtime']

And now sea_water_temperature has all the GTSPP realtime tests plus the WOA comparison,


In [13]:
print(cfg['variables']['sea_water_temperature'].keys())


odict_keys(['global_range', 'gradient', 'spike', 'profile_envelop', 'woa_normbias'])

This new definition is actually the GTSPP recommended procedure for non-realtime data, i.e. the delayed mode. The built-in GTSPP procedure is actually written by inheriting the GTSPP realtime.


In [14]:
cfg = load_cfg('gtspp')
print(cfg['inherit'])


['gtspp_realtime']

The inheritance can also be used to modify any parameter from the parent template procedure. For example, let's use the GTSPP recommended procedure but with a more restricted threshold, equal to 1,


In [15]:
my_config = {"inherit": "gtspp_realtime",
             "sea_water_temperature":
               {"spike": {
                   "threshold": 1
                   }
               }
           }

cfg = load_cfg(my_config)
print(cfg['variables']['sea_water_temperature']['spike'])


OrderedDict([('threshold', 1)])

Custom collection of QC procedures

Different datasets might require different procedures to best identify the bad measurements. Therefore it is convenient to be able to create a personal toolbox of procedures. For instance, I work with Spray underwater gliders at Scripps Institution of Oceanography, and I have a generic QC procedure for spray. I also have a modified version of that specifically for the California Underwater Glider Network operations, as well as another procedure for the deployments in the Mediterranean Sea. I saved those as spray.json, spray_CUGN.json, and spray_med.json.

Any CoTeDe user can create its own collection of QC procedures. When CoTeDe doesn't find a built-in standard with that name, it searches in your home directory at:

~/.config/cotederc/cfg/

So for me, I placed those 3 json files at

/home/guilherme/.config/cotederc/cfg/{spray.json,spray_CUGN.json,spray_med.json}

Now I can call load_cfg('spray_CUGN').

You can change where CoTeDe search for the configuration files by defining the environment variable COTEDE_DIR. If you use bash, you could do:

export COTEDE_DIR='/my/much/better/place/to/save/these/'

but keep in mind that CoTeDe will look for the directory cfg inside $COTEDE_DIR to look for the JSON files. I use this approach on my servers to keep everything tidy.

Actual QC

Remmember that there is no reason for most users to use load_cfg(), but I used it here to better illustrate how a QC procedure can be defined.

The cfgname used in load_cfg is the same cfgname of ProfileQC. Therefore, you can apply everything that you learned here with ProfileQC. For instance, if you want to evaluate the temperature measurements from a profile by comparing with the World Ocean Atlas using 10 standard deviations as the tolerance, you could:

pqced = ProfileQC(my_data, {"sea_water_temperature": {"woa_normbias": {"threshold": 10}}})

Or if you want to evaluate a profile (my_data) using the euroGOOS recommended QC procedure (another built-in standard) you could

pqced = ProfileQC(my_data, "eurogoos")

Note that my_data must satisfy the CoTeDe's data model. Check the manual if you don't know what I'm talking about.

Excercise

CoTede has a built-in QC procedure based on Argo recommendations, named 'argo'. Which variables, tests, and thresholds are applied on that setup? hint: line 3 loads the config for 'gtspp_realtime'