Quality Controlling a generic data object

Objective:

This notebook shows how to use CoTeDe with a generic dataset. If you want to use CoTeDe in your own dataset, or want to plug CoTeDe in your application, this notebook is for you.

How to use CoTeDe to quality control any type of measurement?

CoTeDe operates with a minimalist common data model to connect with other applications. To use it from another application all you need to do is to provide your data in that standard. For this example, let's call this dataset object as 'ds'.

CoTeDe expects common information for the dataset, like latitude of the profile, to be accessed as:

ds.attrs['latitude']

or

ds.attrs['datetime']

While the measurements and auxiliary variables accessed directly like::

ds['temperature']

or

ds['depth']

With that structure, each test implemented in CoTeDe knows where to search the relevant information. For instance, the test 'at sea' for a profile only needs to know latitude and longitude, which as described above is available at ds.attrs

Let's see a real example.


In [1]:
# Different version of CoTeDe might give slightly different outputs.
# Please let me know if you see something that I should update.

import cotede
print("CoTeDe version: {}".format(cotede.__version__))


CoTeDe version: 0.21

In [2]:
# Importing some requirements

from datetime import datetime

import numpy as np
from numpy import ma
from cotede.qc import ProfileQC

Let's create a minimalist class that behaves as CoTeDe would like to. It is like a dictionary of relevant variables with a propriety attrs with some metatada.


In [3]:
class DummyDataset(object):
    """Minimalist data object that contains data and attributes
    """
    def __init__(self):
        """Two dictionaries to store the data and attributes
        """
        self.attrs = {}
        self.data = {}
            
    def __getitem__(self, key):
        """Return the requested item from the data
        """
        return self.data[key]
    
    def keys(self):
        """Show the available variables in data
        """
        return self.data.keys()
        
    @property
    def attributes(self):
        """Temporary requirement while Gui is refactoring CoTeDe. This will be soon unecessary
        """
        return self.attrs

Let's create an empty data object.


In [4]:
mydata = DummyDataset()

Let's define some metadata as position and time that the profile was measured.


In [5]:
mydata.attrs['datetime'] = datetime(2016,6,4)
mydata.attrs['latitude'] = 15
mydata.attrs['longitude'] = -38

print(mydata.attrs)


{'datetime': datetime.datetime(2016, 6, 4, 0, 0), 'latitude': 15, 'longitude': -38}

Now let's create some data. Here I'll use create pressure, temperature, and salinity. I'm using masked array, but it could be a simple array.

Here I'm creating these values, but in a real world case we would be reading from a netCDF, an ASCII file, an SQL query, or whatever is your data source.


In [6]:
mydata.data['PRES'] = ma.fix_invalid([2, 6, 10, 21, 44, 79, 100, 150, 200, 400, 410, 650, 1000, 2000, 5000])
mydata.data['TEMP'] = ma.fix_invalid([25.32, 25.34, 25.34, 25.31, 24.99, 23.46, 21.85, 17.95, 15.39, 11.08, 6.93, 7.93, 5.71, 3.58, np.nan])
mydata.data['PSAL'] = ma.fix_invalid([36.49, 36.51, 36.52, 36.53, 36.59, 36.76, 36.81, 36.39, 35.98, 35.30, 35.28, 34.93, 34.86, np.nan, np.nan])

Let's check the available variables


In [7]:
mydata.keys()


Out[7]:
dict_keys(['PRES', 'TEMP', 'PSAL'])

Let's check one of the variables, temperature:


In [8]:
mydata['TEMP']


Out[8]:
masked_array(data=[25.32, 25.34, 25.34, 25.31, 24.99, 23.46, 21.85, 17.95,
                   15.39, 11.08, 6.93, 7.93, 5.71, 3.58, --],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False,  True],
       fill_value=1e+20)

Now that we have our data and metadata as this object, CoTeDe can do its job. On this example let's evaluate this fictious profile using the EuroGOOS recommended QC test. For that we can use ProfileQC() like:


In [9]:
pqced = ProfileQC(mydata, cfg='eurogoos')

The returned object (pqced) has the same content of the original mydata. Let's check the variables again,


In [10]:
pqced.keys()


Out[10]:
dict_keys(['PRES', 'TEMP', 'PSAL'])

But now there is a new propriety named 'flags' which is a dictionary with all tests applied and the flag resulted. Those flags are groupped by variables.


In [11]:
pqced.flags.keys()


Out[11]:
dict_keys(['common', 'TEMP', 'PSAL'])

Let's see which flags are available for temperature,


In [12]:
pqced.flags['TEMP'].keys()


Out[12]:
dict_keys(['valid_datetime', 'location_at_sea', 'global_range', 'gradient_depthconditional', 'spike_depthconditional', 'digit_roll_over', 'woa_normbias', 'overall'])

Let's check the flags for the test gradient conditional to the depth, as defined by EuroGOOS


In [13]:
pqced.flags['TEMP']['gradient_depthconditional']


Out[13]:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 9], dtype=int8)

One means that that measurement was approved by this test. Nine means that the data was not available or not valid at that level. And zero means no QC. For the gradient test it is not possible to evaluate the first or the last values (check the manual), so those measurements exist but the flag was zero.

The overall flag is a special one that combines all other flags by taking the most critical assessment. If a single test identify a problem and flag as 4 (bad data), the overall flag for that measurement will be 4 even if that measurement passed in all other tests. Therefore, a measurement with flag 1 (good value) means that it was approved in all other tests.


In [14]:
pqced.flags['PSAL']['overall']


Out[14]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9], dtype=int8)