In [1]:
title = "Fetching data from a CSW catalog"
name = '2015-10-12-fetching_data'

In [2]:
%matplotlib inline
import seaborn
seaborn.set(style='ticks')

import os
from datetime import datetime
from IPython.core.display import HTML

import warnings
warnings.simplefilter("ignore")

# Metadata and markdown generation.
hour = datetime.utcnow().strftime('%H:%M')
comments = "true"

date = '-'.join(name.split('-')[:3])
slug = '-'.join(name.split('-')[3:])

metadata = dict(title=title,
                date=date,
                hour=hour,
                comments=comments,
                slug=slug,
                name=name)

markdown = """Title: {title}
date:  {date} {hour}
comments: {comments}
slug: {slug}

{{% notebook {name}.ipynb cells[2:] %}}
""".format(**metadata)

content = os.path.abspath(os.path.join(os.getcwd(), os.pardir,
                                       os.pardir, '{}.md'.format(name)))

with open('{}'.format(content), 'w') as f:
    f.writelines(markdown)


html = """
<small>
<p> This post was written as an IPython notebook.  It is available for
<a href="http://ioos.github.com/system-test/downloads/
notebooks/%s.ipynb">download</a>.  You can also try an interactive version on
<a href="http://mybinder.org/repo/ioos/system-test/">binder</a>.</p>
<p></p>
""" % (name)


/home/filipe/miniconda/envs/IOOS/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

This notebook shows a typical workflow to query a Catalog Service for the Web (CSW) and creates a request for data endpoints that are suitable for download.

The catalog of choice is the NGCD geoportal (http://www.ngdc.noaa.gov/geoportal/csw) and we want to query it using a geographical bounding box, a time range, and a variable of interested.

The example below will fetch Sea Surface Temperature (SST) data from all available observations and models in the Boston Harbor region. The goal is to assess the water temperature for the of the Boston Light Swim event.

We will search for data $\pm$ 4 days centered at the event date.


In [3]:
from datetime import datetime, timedelta

event_date = datetime(2015, 8, 15)

start = event_date - timedelta(days=4)
stop = event_date + timedelta(days=4)

The bounding box is slightly larger than the Boston harbor to assure we get some data.


In [4]:
spacing = 0.25

bbox = [-71.05-spacing, 42.28-spacing,
        -70.82+spacing, 42.38+spacing]

The CF_names object is just a Python dictionary whose keys are SOS names and the values contain all possible combinations of temperature variables names in the CF conventions. Note that we also define a units object. We use the object units to coerce all data to celcius.


In [5]:
import iris
from utilities import CF_names

sos_name = 'sea_water_temperature'
name_list = CF_names[sos_name]

units = iris.unit.Unit('celsius')

Now it is time to stitch all that together. For that we will use OWSLib*.

Constructing the filter is probably the most complex part. We start with a list comprehension using the fes.Or to create the variables filter. The next step is to exclude some unwanted results (ROMS Average files) using fes.Not. To select the desired dates we wrote a wrapper function that takes the start and end dates of the event. Finally, we apply the fes.And to join all the conditions above in one filter list.

* OWSLib is a Python package for client programming with Open Geospatial Consortium (OGC) web service (hence OWS) interface standards, and their related content models.


In [6]:
from owslib import fes
from utilities import fes_date_filter

kw = dict(wildCard='*',
          escapeChar='\\',
          singleChar='?',
          propertyname='apiso:AnyText')

or_filt = fes.Or([fes.PropertyIsLike(literal=('*%s*' % val), **kw)
                  for val in name_list])

# Exclude ROMS Averages and History files.
not_filt = fes.Not([fes.PropertyIsLike(literal='*Averages*', **kw)])

begin, end = fes_date_filter(start, stop)
filter_list = [fes.And([fes.BBox(bbox), begin, end, or_filt, not_filt])]

Now we are ready to load a csw object and feed it with the filter we created.


In [7]:
from owslib.csw import CatalogueServiceWeb

csw = CatalogueServiceWeb('http://www.ngdc.noaa.gov/geoportal/csw',
                          timeout=60)

csw.getrecords2(constraints=filter_list, maxrecords=1000, esn='full')

fmt = '{:*^64}'.format
print(fmt(' Catalog information '))
print("CSW version: {}".format(csw.version))
print("Number of datasets available: {}".format(len(csw.records.keys())))


********************* Catalog information **********************
CSW version: 2.0.2
Number of datasets available: 13

We found 13 datasets! Not bad for such a narrow search area and time-span.

What do we have there? Let's use the custom service_urls function to split the datasets into OPeNDAP and SOS endpoints.


In [8]:
from utilities import service_urls

dap_urls = service_urls(csw.records, service='odp:url')
sos_urls = service_urls(csw.records, service='sos:url')

print(fmt(' SOS '))
for url in sos_urls:
    print('{}'.format(url))

print(fmt(' DAP '))
for url in dap_urls:
    print('{}.html'.format(url))


***************************** SOS ******************************
http://sdf.ndbc.noaa.gov/sos/server.php?service=SOS&request=GetCapabilities&acceptVersions=1.0.0
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/036p1/036p1_d41.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/067p1/067p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/094p1/094p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/100p1/100p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/162p1/162p1_d06.nc?service=SOS&version=1.0.0&request=GetCapabilities
***************************** DAP ******************************
http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html
http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw.html
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global.html
http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html
http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best.html
http://thredds.axiomdatascience.com/thredds/dodsC/G1_SST_GLOBAL.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/036p1/036p1_d41.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/067p1/067p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/094p1/094p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/100p1/100p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/162p1/162p1_d06.nc.html

We will ignore the SOS endpoints for now and use only the DAP endpoints. But note that some of those SOS and DAP endpoints look suspicious. The Scripps Institution of Oceanography (SIO/UCSD) data should not appear in a search for the Boston Harbor. That is a known issue and we are working to sort it out. Meanwhile we have to filter out all observations form the DAP with the is_station function.

However, that filter still leaves behind URLs like http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html. That is probably satellite data and not model.

In an ideal world all datasets would have the metadata coverage_content_type defined. With the coverage_content_type we could tell models apart automatically. Until then we will have to make due with the heuristic function is_model from the utilities module. The is_model function works by comparing the metadata (and sometimes the data itself) against a series of criteria, like grid conventions, to figure out if a dataset is model data or not. Because the function operates on the data we will call it later on when we start downloading the data.


In [9]:
from utilities import is_station

non_stations = []
for url in dap_urls:
    try:
        if not is_station(url):
            non_stations.append(url)
    except RuntimeError as e:
        print("Could not access URL {}. {!r}".format(url, e))

dap_urls = non_stations

print(fmt(' Filtered DAP '))
for url in dap_urls:
    print('{}.html'.format(url))


************************* Filtered DAP *************************
http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html
http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw.html
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global.html
http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html
http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best.html

We still need to find endpoints for the observations. For that we'll use pyoos' NdbcSos and CoopsSoscollectors. The pyoos API is different from OWSLib's, but note that we are re-using the same query variables we create for the catalog search (bbox, start, stop, and sos_name.)


In [10]:
from pyoos.collectors.ndbc.ndbc_sos import NdbcSos

collector_ndbc = NdbcSos()

collector_ndbc.set_bbox(bbox)
collector_ndbc.end_time = stop
collector_ndbc.start_time = start
collector_ndbc.variables = [sos_name]

ofrs = collector_ndbc.server.offerings
title = collector_ndbc.server.identification.title
print(fmt(' NDBC Collector offerings '))
print('{}: {} offerings'.format(title, len(ofrs)))


******************* NDBC Collector offerings *******************
National Data Buoy Center SOS: 955 offerings

That number is misleading! Do we have 955 buoys available there? What exactly are the offerings? There is only one way to find out. Let's get the data!


In [11]:
from utilities import collector2table, get_ndbc_longname

ndbc = collector2table(collector=collector_ndbc)

names = []
for s in ndbc['station']:
    try:
        name = get_ndbc_longname(s)
    except ValueError:
        name = s
    names.append(name)

ndbc['name'] = names

ndbc.set_index('name', inplace=True)
ndbc.head()


Out[11]:
station sensor lon lat
name
Boston 16 Nm East Of Boston 44013 watertemp1 -70.69 42.35
Buoy A01 44029 ct1 -70.57 42.52

That makes more sense. Two buoys were found in the bounding box, and the name of at least one of them makes sense.

Now the same thing for CoopsSos.


In [12]:
from pyoos.collectors.coops.coops_sos import CoopsSos

collector_coops = CoopsSos()

collector_coops.set_bbox(bbox)
collector_coops.end_time = stop
collector_coops.start_time = start
collector_coops.variables = [sos_name]

ofrs = collector_coops.server.offerings
title = collector_coops.server.identification.title
print(fmt(' Collector offerings '))
print('{}: {} offerings'.format(title, len(ofrs)))


********************* Collector offerings **********************
NOAA.NOS.CO-OPS SOS: 1054 offerings

In [13]:
from utilities import get_coops_metadata

coops = collector2table(collector=collector_coops)

names = []
for s in coops['station']:
    try:
        name = get_coops_metadata(s)[0]
    except ValueError:
        name = s
    names.append(name)

coops['name'] = names

coops.set_index('name', inplace=True)
coops.head()


Out[13]:
station sensor lon lat
name
Boston, MA 8443970 E1 -71.0534 42.3548

We found one more. Now we can merge both into one table and start downloading the data.


In [14]:
from pandas import concat

all_obs = concat([coops, ndbc])

all_obs.head()


Out[14]:
station sensor lon lat
name
Boston, MA 8443970 E1 -71.0534 42.3548
Boston 16 Nm East Of Boston 44013 watertemp1 -70.69 42.35
Buoy A01 44029 ct1 -70.57 42.52

In [15]:
from pandas import DataFrame
from owslib.ows import ExceptionReport
from utilities import pyoos2df, save_timeseries

iris.FUTURE.netcdf_promote = True

data = dict()
col = 'sea_water_temperature (C)'
for station in all_obs.index:
    try:
        idx = all_obs['station'][station]
        df = pyoos2df(collector_ndbc, idx, df_name=station)
        if df.empty:
            df = pyoos2df(collector_coops, idx, df_name=station)
        data.update({idx: df[col]})
    except ExceptionReport as e:
        print("[{}] {}:\n{}".format(idx, station, e))

The cell below reduces or interpolates, depending on the original frequency of the data, to 1 hour frequency time-series.


In [16]:
from pandas import date_range

index = date_range(start=start, end=stop, freq='1H')
for k, v in data.iteritems():
    data[k] = v.reindex(index=index, limit=1, method='nearest')

obs_data = DataFrame.from_dict(data)

obs_data.head()


Out[16]:
44013 44029 8443970
2015-08-11 00:00:00 19 17.700001 17.5
2015-08-11 01:00:00 19 17.600000 17.7
2015-08-11 02:00:00 19 17.500000 17.8
2015-08-11 03:00:00 19 17.299999 17.8
2015-08-11 04:00:00 19 17.100000 17.9

And now the same for the models. Note that now we use the is_model to filter out non-model endpotins.


In [17]:
import warnings
from iris.exceptions import (CoordinateNotFoundError, ConstraintMismatchError,
                             MergeError)
from utilities import (quick_load_cubes, proc_cube, is_model,
                       get_model_name, get_surface)

cubes = dict()
for k, url in enumerate(dap_urls):
    print('\n[Reading url {}/{}]: {}'.format(k+1, len(dap_urls), url))
    try:
        cube = quick_load_cubes(url, name_list,
                                callback=None, strict=True)
        if is_model(cube):
            cube = proc_cube(cube, bbox=bbox,
                             time=(start, stop), units=units)
        else:
            print("[Not model data]: {}".format(url))
            continue
        cube = get_surface(cube)
        mod_name, model_full_name = get_model_name(cube, url)
        cubes.update({mod_name: cube})
    except (RuntimeError, ValueError,
            ConstraintMismatchError, CoordinateNotFoundError,
            IndexError) as e:
        print('Cannot get cube for: {}\n{}'.format(url, e))


[Reading url 1/5]: http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd

[Reading url 2/5]: http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw
[Not model data]: http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw

[Reading url 3/5]: http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global

[Reading url 4/5]: http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc
[Not model data]: http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc

[Reading url 5/5]: http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best

And now we can use the iris cube objects we collected to download model data near the buoys we found above. We will need get_nearest_water to search the 10 nearest model points at least 0.08 degrees away from each buys.

(This step is still a little bit clunky and need some improvements!)


In [18]:
from iris.pandas import as_series
from utilities import (make_tree, get_nearest_water,
                       add_station, ensure_timeseries, remove_ssh)

model_data = dict()
for mod_name, cube in cubes.items():
    print(fmt(mod_name))
    try:
        tree, lon, lat = make_tree(cube)
    except CoordinateNotFoundError as e:
        print('Cannot make KDTree for: {}'.format(mod_name))
        continue
    # Get model series at observed locations.
    raw_series = dict()
    for station, obs in all_obs.iterrows():
        try:
            kw = dict(k=10, max_dist=0.08, min_var=0.01)
            args = cube, tree, obs.lon, obs.lat
            series, dist, idx = get_nearest_water(*args, **kw)
        except ValueError as e:
            status = "No Data"
            print('[{}] {}'.format(status, obs.name))
            continue
        if not series:
            status = "Land   "
        else:
            series = as_series(series)
            raw_series.update({obs['station']: series})
            status = "Water  "
        print('[{}] {}'.format(status, obs.name))
    if raw_series:  # Save that model series.
        model_data.update({mod_name: raw_series})
        del cube


****************************COAWST_4****************************
[Water  ] Boston, MA
[Water  ] Boston 16 Nm East Of Boston
[Water  ] Buoy A01
*****************************HYCOM******************************
[Land   ] Boston, MA
[Water  ] Boston 16 Nm East Of Boston
[Water  ] Buoy A01
*************************ROMS_ESPRESSO**************************
[Land   ] Boston, MA
[Land   ] Boston 16 Nm East Of Boston
[No Data] Buoy A01

To end this post let's plot the 3 buoys we found together with the nearest model grid point.


In [19]:
import matplotlib.pyplot as plt

buoy = '44013'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()



In [20]:
buoy = '44029'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()



In [21]:
buoy = '8443970'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()


That is it! We fetched data based only on a bounding box, time-range, and variable name. The workflow is not as smooth as we would like. We had to mix OWSLib catalog searches with to different pyoos collector to download the observed and modeled data. Another hiccup are all the workarounds used to go from iris cubes to pandas series/dataframes. There is a clear need to a better way to represent CF feature types in a single Python object.

To end this post check out the full version of the Boston Light Swim notebook. (Specially the interactive map at the end.)


In [22]:
HTML(html)


Out[22]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.