We will use the owslib library to construct queries and parse responses from CSW
Specify a CSW endpoint. You can test if it's working with a getCapabilities request:
<endpoint>?request=GetCapabilities&service=CSW
for example:
http://catalog.data.gov/csw-all?service=CSW&version=2.0.2&request=GetCapabilities
In [1]:
from owslib.csw import CatalogueServiceWeb
endpoints = dict(
csw_all='http://catalog.data.gov/csw-all', # Granule level production catalog.
whoi='http://geoport.whoi.edu/csw',
geoportal='http://www.ngdc.noaa.gov/geoportal/csw',
ioos='https://data.ioos.us/csw',
ioos_dev='https://dev-catalog.ioos.us/csw'
)
csw = CatalogueServiceWeb(endpoints['ioos'], timeout=60)
print(csw.version)
In [2]:
from owslib import fes
filter1 = fes.PropertyIsLike(
propertyname='apiso:AnyText',
literal=('*sea_water_salinity*'),
escapeChar='\\',
wildCard='*',
singleChar='?'
)
csw.getrecords2(constraints=[filter1], maxrecords=100, esn='full')
print('Found {} records.\n'.format(len(csw.records.keys())))
for key, value in list(csw.records.items()):
print('[{}]: {}'.format(value.title, key))
Hmmm..... In the query above, we only get 10 records, even though we specified maxrecords=100.
What's up with that?
Turns out the CSW service specified a MaxRecordDefault that cannot be exceeded. For example, checking https://data.ioos.us/csw?request=GetCapabilities&service=CSW we find:
<ows:Constraint name="MaxRecordDefault">
<ows:Value>10</ows:Value>
</ows:Constraint>
So we need to loop the getrecords request, incrementing the startposition:
In [3]:
from owslib.fes import SortBy, SortProperty
pagesize = 10
maxrecords = 50
sort_order = 'ASC' # Should be 'ASC' or 'DESC' (ascending or descending).
sort_property = 'dc:title' # A supported queryable of the CSW.
sortby = SortBy([SortProperty(sort_property, sort_order)])
In [4]:
startposition = 0
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=[filter1],
startposition=startposition,
maxrecords=pagesize,
sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
Okay, now lets add another query filter and add it to the first one
In [5]:
filter2 = fes.PropertyIsLike(
propertyname='apiso:AnyText',
literal=('*ROMS*'),
escapeChar='\\',
wildCard='*', singleChar='?'
)
filter_list = [fes.And([filter1, filter2])]
In [6]:
startposition = 0
maxrecords = 50
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=filter_list,
startposition=startposition, maxrecords=pagesize, sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
In [7]:
import random
choice = random.choice(list(csw.records.keys()))
print(csw.records[choice].title)
csw.records[choice].references
Out[7]:
Lets see what the full XML record looks like
In [8]:
import xml.dom.minidom
xml = xml.dom.minidom.parseString(csw.records[choice].xml)
print(xml.toprettyxml())
Yuk! That's why we use OWSlib! :-)
Now add contraint to return only records that have either the OPeNDAP or SOS service.
Let's first see what services are advertised:
In [9]:
try:
csw.get_operation_by_name('GetDomain')
csw.getdomain('apiso:ServiceType', 'property')
print(csw.results['values'])
except:
print('GetDomain not supported')
In [10]:
services = ['OPeNDAP', 'SOS']
service_filt = fes.Or(
[fes.PropertyIsLike(propertyname='apiso:ServiceType',
literal=('*%s*' % val),
escapeChar='\\',
wildCard='*',
singleChar='?')
for val in services])
filter_list = [fes.And([filter1, filter2, service_filt])]
In [11]:
startposition = 0
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=filter_list,
startposition=startposition,
maxrecords=pagesize,
sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
Let's try adding a search for a non-existant service, which should result in no records back:
In [12]:
val = 'not_a_real_service'
filter3 = fes.PropertyIsLike(
propertyname='apiso:ServiceType',
literal=('*%s*' % val),
escapeChar='\\',
wildCard='*',
singleChar='?'
)
filter_list = [fes.And([filter1, filter2, filter3])]
csw.getrecords2(constraints=filter_list, maxrecords=100, esn='full')
print('Found {} records.\n'.format(len(csw.records.keys())))
for key, value in list(csw.records.items()):
print('[{}]: {}'.format(value.title, key))
Good!
Now add bounding box constraint. To specify lon,lat order for bbox (which we want to do so that we can use the same bbox with either geoportal server or pycsw requests), we need to request the bounding box specifying the CRS84 coordinate reference system. The CRS84 option is available in pycsw 1.1.10
+. The ability to specify the crs
in the bounding box request is available in owslib 0.8.12
+. For more info on the bounding box problem and how it was solved, see this pycsw issue, this geoportal server issue, and this owslib issue
In [13]:
# [lon_min, lat_min, lon_max, lat_max]
bbox = [-158.4, 21.24, -157.5, 21.77]
bbox_filter = fes.BBox(bbox, crs='urn:ogc:def:crs:OGC:1.3:CRS84')
filter_list = [fes.And([filter1, filter2, service_filt, bbox_filter])]
startposition = 0
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=filter_list,
startposition=startposition, maxrecords=pagesize, sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
Now add time contraints. Here we first define a function that will return records if any data in the records overlaps the specified time period
In [14]:
def date_range(start, stop, constraint='overlaps'):
"""
Take start and stop datetime objects and return a `fes.PropertyIs<>` filter.
"""
start = start.strftime('%Y-%m-%d %H:%M')
stop = stop.strftime('%Y-%m-%d %H:%M')
if constraint == 'overlaps':
begin = fes.PropertyIsLessThanOrEqualTo(
propertyname='apiso:TempExtent_begin', literal=stop
)
end = fes.PropertyIsGreaterThanOrEqualTo(
propertyname='apiso:TempExtent_end', literal=start
)
elif constraint == 'within':
begin = fes.PropertyIsGreaterThanOrEqualTo(
propertyname='apiso:TempExtent_begin', literal=start
)
end = fes.PropertyIsLessThanOrEqualTo(
propertyname='apiso:TempExtent_end', literal=stop
)
return begin, end
In [15]:
from datetime import datetime, timedelta
now = datetime.utcnow()
start = now - timedelta(days=3)
stop = now + timedelta(days=3)
print('{} to {}'.format(start, stop))
start, stop = date_range(start, stop)
In [16]:
filter_list = [fes.And([filter1, filter2, service_filt, bbox_filter, start, stop])]
startposition = 0
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=filter_list,
startposition=startposition,
maxrecords=pagesize,
sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
Now add a NOT filter to eliminate some entries
In [17]:
kw = dict(
wildCard='*',
escapeChar='\\',
singleChar='?',
propertyname='apiso:AnyText')
not_filt = fes.Not([fes.PropertyIsLike(literal='*Waikiki*', **kw)])
In [18]:
filter_list = [fes.And([filter1, filter2, service_filt, bbox_filter, start, stop, not_filt])]
startposition = 0
while True:
print('getting records %d to %d' % (startposition, startposition+pagesize))
csw.getrecords2(constraints=filter_list,
startposition=startposition, maxrecords=pagesize, sortby=sortby)
for rec, item in csw.records.items():
print(item.title)
print()
if csw.results['nextrecord'] == 0:
break
startposition += pagesize
if startposition >= maxrecords:
break
Hopefully this notebook demonstrated some of the power (and complexity) of CSW! ;-)