In [1]:
from utilities import css_styles
css_styles()


Out[1]:

IOOS System Test - Theme 1 - Scenario B - Description

Core Variable Strings

This notebook looks at the IOOS core variables and uses the Marine Metadata Interoperability SPARQL endpoint to convert them to CF Standard Names. Each IOOS CSW server is then queryied for the CF standard name that is associated with an IOOS Core Variable.

Questions

  1. Using a list of Core IOOS Variables and the MMI SPARQL service, can we search and quantify records from CSW endpoints that relate to core variables?

Get a list of the IOOS Core Variables from MMI


In [2]:
# Using RDF
from rdflib import Graph, Literal, BNode, Namespace, RDF, URIRef
g = Graph()
g.load("http://mmisw.org/ont/ioos/core_variable")
core_var_uri      = URIRef('http://mmisw.org/ont/ioos/core_variable/Core_Variable')
core_var_name_uri = URIRef('http://mmisw.org/ont/ioos/core_variable/name')
core_var_def_uri  = URIRef('http://mmisw.org/ont/ioos/core_variable/definition')

core_variables = []
for cv in g.subjects(predicate=RDF.type, object=core_var_uri):
    name = g.value(subject=cv, predicate=core_var_name_uri).value
    definition = g.value(subject=cv, predicate=core_var_def_uri).value
    core_variables.append((name, definition))

In [3]:
import pandas as pd
core_variables_names = [x for x,y in core_variables]
pd.DataFrame.from_records(core_variables, columns=("Name", "Definition",))


Out[3]:
Name Definition
0 phytoplankton_species Phytoplankton species
1 pathogens Pathogens
2 surface_currents Surface currents
3 optical_properties Optical properties
4 total_suspended_matter Total suspended matter
5 acidity Acidity
6 wind Wind Speed and Direction
7 fish_abundance Fish abundance
8 pco2 Partial pressure of carbon dioxide
9 sea_level Sea level
10 bottom_character Bottom character
11 fish_species Fish species
12 contaminants Contaminants
13 zooplankton_abundance Zooplankton abundance
14 zooplankton_species Zooplankton species
15 ocean_color Ocean color
16 cdom Color dissolved organic matter
17 ice_distribution Ice distribution
18 heat_flux Heat flux
19 bathymetry Bathymetry
20 stream_flow Stream flow
21 temperature Temperature
22 salinity Salinity
23 dissolved_nutrients Dissolved Nutrients
24 surface_waves Surface waves
25 dissolved_oxygen Dissolved oxygen
Programmatic access to Core Variables - This isn't straight forward and should be abstracted into a library. See: https://github.com/ioos/system-test/issues/128

In [4]:
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://mmisw.org/sparql")

query = """
PREFIX ioos: <http://mmisw.org/ont/ioos/parameter/>
SELECT DISTINCT ?cat ?parameter ?property ?value 
WHERE {?parameter a ioos:Parameter .
       ?parameter ?property ?value .
       ?cat skos:narrowMatch ?parameter .
       FILTER  (regex(str(?property), "Match", "i") && regex(str(?value), "cf", "i") )
      } 
ORDER BY ?cat ?parameter
"""
    
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
j = sparql.query().convert()

cf_standard_uris  = list(set([ x["value"]["value"] for x in j.get("results").get("bindings") ]))
cf_standard_names = map(lambda x: x.split("/")[-1], cf_standard_uris)
pd.DataFrame.from_records(zip(cf_standard_names, cf_standard_uris), columns=("CF Name", "CF URI",))


/home/will/anaconda/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py:88: RuntimeWarning: JSON-LD disabled because no suitable support has been found
  warnings.warn("JSON-LD disabled because no suitable support has been found", RuntimeWarning)
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 0 to 169
Data columns (total 2 columns):
CF Name    170  non-null values
CF URI     170  non-null values
dtypes: object(2)

Searching CSW servers on variable names


In [5]:
# https://github.com/ioos/system-test/wiki/Service-Registries-and-Data-Catalogs
known_csw_servers = ['http://data.nodc.noaa.gov/geoportal/csw',
                     'http://www.nodc.noaa.gov/geoportal/csw',
                     'http://www.ngdc.noaa.gov/geoportal/csw',
                     'http://cwic.csiss.gmu.edu/cwicv1/discovery',
                     'http://geoport.whoi.edu/geoportal/csw',
                     'https://edg.epa.gov/metadata/csw',
                     'http://cmgds.marine.usgs.gov/geonetwork/srv/en/csw',
                     'http://cida.usgs.gov/gdp/geonetwork/srv/en/csw',
                     'http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw',
                     'http://geoport.whoi.edu/gi-cat/services/cswiso',
                     'https://data.noaa.gov/csw',
                     ]

Subset which variables we should query by


In [6]:
# Query on Waves
variables_to_query = [ x for x in cf_standard_names if "sea_surface_height" in x ]
custom_variables   = [u"sea_surface_height", u"sea_surface_elevation"]

variables_to_query += custom_variables
variables_to_query


Out[6]:
[u'sea_surface_height_correction_due_to_air_pressure_at_low_frequency',
 u'sea_surface_height_correction_due_to_air_pressure_and_wind_at_high_frequency',
 u'sea_surface_height_amplitude_due_to_earth_tide',
 u'sea_surface_height_amplitude_due_to_equilibrium_ocean_tide',
 u'sea_surface_height_above_sea_level',
 u'sea_surface_height_above_reference_ellipsoid',
 u'sea_surface_height_bias_due_to_sea_surface_roughness',
 u'sea_surface_height_amplitude_due_to_pole_tide',
 u'sea_surface_height_above_geoid',
 u'sea_surface_height_amplitude_due_to_geocentric_ocean_tide',
 u'sea_surface_height_amplitude_due_to_non_equilibrium_ocean_tide',
 u'sea_surface_height',
 u'sea_surface_elevation']
Missing CF Standard Names - "sea_surface_height" and "sea_surface_elevation" are valid CF Aliases but are not returned by MMI when running the SPARQL query. We added them here manually. See: https://github.com/ioos/system-test/issues/129

Construct CSW Filters


In [7]:
from owslib import fes

cf_name_filters = []
for cf_name in variables_to_query:
    text_filter   = fes.PropertyIsLike(propertyname='apiso:AnyText', literal="*%s*" % cf_name, wildCard='*')
    cf_name_filters.append(text_filter)

Query each CSW catalog for the cf_name_filters constructed above


In [8]:
from owslib.csw import CatalogueServiceWeb
from utilities import normalize_service_urn

var_results = []

for x in range(len(cf_name_filters)):
    var_name          = variables_to_query[x]
    single_var_filter = cf_name_filters[x]
    for url in known_csw_servers:
        try:
            csw = CatalogueServiceWeb(url, timeout=20)
            csw.getrecords2(constraints=[single_var_filter], maxrecords=1000, esn='full')
            for record, item in csw.records.items():
                for d in item.references:
                    result = dict(variable=var_name,
                                  scheme=normalize_service_urn(d['scheme']),
                                  url=d['url'],
                                  server=url,
                                  title=record.title())
                    var_results.append(result)
        except BaseException, e:
            print "- FAILED: %s - %s" % (url, e)


- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://www.ngdc.noaa.gov/geoportal/csw - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://www.ngdc.noaa.gov/geoportal/csw - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - 'REQUEST_LIMITATION: TOO_MANY_RECORDS - The request asked for more records than can be handled. The maximum number designated in GetRecords request should be less than 200, please decrease the returned recorder number in request.'
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 'ORA-00907: missing right parenthesis'
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - timed out
Paginating CSW Records - Some servers have a maximum amount of records you can retrieve at once. See: https://github.com/ioos/system-test/issues/126

Load results into a Pandas DataFrame


In [10]:
%matplotlib inline
import pandas as pd
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 500)

from IPython.display import HTML

df = pd.DataFrame(var_results)
df = df.drop_duplicates()

Results by variable


In [12]:
by_variable = pd.DataFrame(df.groupby("variable").size(), columns=("Number of services",))
by_variable.sort('Number of services', ascending=False).plot(kind="barh", figsize=(10,8,))


Out[12]:
<matplotlib.axes.AxesSubplot at 0x7f6d8ddb12d0>

The number of service types for each variable


In [13]:
import math

var_service_summary = pd.DataFrame(df.groupby(["variable", "scheme"], sort=True).size(), columns=("Number of services",))
#HTML(model_service_summary.to_html())
var_service_plotter = var_service_summary.unstack("variable")
var_service_plot = var_service_plotter.plot(kind='barh', subplots=True, figsize=(12,120), sharey=True)