<img src="http://nci.org.au/wp-content/themes/nci/img/img-logo-large.png", width=400>


Programmatically accessing data through THREDDS and the VDI

...using Python 3

In this notebook:

The following material uses Geoscience Australia's Landsat 8 Data Collection which is available under the Creative Commons License 4.0. For more information on the collection and licensing, please click here.


Prerequisites:

  • A python 3 virtual environment with the following Python modules loaded:
    • matplotlib
    • netcdf4
    • siphon
    • shapely
    • requests

Setup instructions for python 3 virtual environments can be found here.

Browse for data

Begin by going to NCI's Geonetwork page: http://geonetwork.nci.org.au/

This page contains the metadata records for NCI Data Collections as well as information on where to find the data.

In this example, we will search for Landsat data:

If we click on the first result, we see a brief overview of the metadata record. Note: For the full record, navigate to the upper-right corner of your browser to change to the "Full view" (eyeball icon).

One of the options under Download and links is the NCI THREDDS Data Server Catalog page:

By navigating to this link, the available (public) data subcollections and datasets will be visible:

In this example, let's navigate to the LANDSAT data: scenes and tiles/ tiles/ EPSG3577/ LS8_OLI_TIRS_NBAR/ dataset:

Import python packages


In [1]:
from netCDF4 import Dataset
import matplotlib.pyplot as plt 
from siphon import catalog, ncss
%matplotlib inline

Using Siphon

Siphon is a collection of Python utilities for downloading data from Unidata data technologies. More information on installing and using Unidata's Siphon can be found: https://github.com/Unidata/siphon

Once selecting a parent dataset directory, Siphon can be used to search and use the data access methods and services provided by THREDDS. For example, Siphon will return a list of data endpoints for the OPeNDAP data URL, NetCDF Subset Service (NCSS), Web Map Service (WMS), Web Coverage Service (WCS), and the HTTP link for direct download.

Provide the top-level URL from the THREDDS page above

Note: You can leave the '.html' but you will receive a message saying it was changed to '.xml'.


In [2]:
catalog_url = 'http://dapds00.nci.org.au/thredds/catalog/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/catalog.xml'

Now we use Siphon to list all the available datasets under this catalog


In [3]:
tds = catalog.TDSCatalog(catalog_url)
datasets = list(tds.datasets)
endpts = tds.datasets.values()


---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-3-7b42d3ee98d2> in <module>()
----> 1 tds = catalog.TDSCatalog(catalog_url)
      2 datasets = list(tds.datasets)
      3 endpts = tds.datasets.values()

/Users/adam/anaconda3/envs/py35/lib/python3.5/site-packages/siphon/catalog.py in __init__(self, catalog_url)
     63         # get catalog.xml file
     64         resp = session.get(self.catalog_url)
---> 65         resp.raise_for_status()
     66 
     67         # If we were given an HTML link, warn about it and try to fix to xml

/Users/adam/anaconda3/envs/py35/lib/python3.5/site-packages/requests/models.py in raise_for_status(self)
    891 
    892         if http_error_msg:
--> 893             raise HTTPError(http_error_msg, response=self)
    894 
    895     def close(self):

HTTPError: 404 Client Error: Not Found for url: http://dapds00.nci.org.au/thredds/catalog/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/catalog.xml

Some of the datasets...


In [4]:
datasets[:10]


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-1468a8284bd2> in <module>()
----> 1 datasets[:10]

NameError: name 'datasets' is not defined

And their associated endpoints for data services:


In [5]:
endpts[0].access_urls


Out[5]:
{'HTTPServer': 'http://dapds00.nci.org.au/thredds/fileServer/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'ISO': 'http://dapds00.nci.org.au/thredds/iso/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'NCML': 'http://dapds00.nci.org.au/thredds/ncml/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'NetcdfSubset': 'http://dapds00.nci.org.au/thredds/ncss/grid/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'OPENDAP': 'http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'UDDC': 'http://dapds00.nci.org.au/thredds/uddc/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'WCS': 'http://dapds00.nci.org.au/thredds/wcs/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc',
 'WMS': 'http://dapds00.nci.org.au/thredds/wms/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_-10_-10_2013.nc'}

Now we can use Siphon along with some form of a query to find some data

This example will use the Python library Shapely to find intersecting Polygon shapes


In [6]:
from shapely.geometry import Polygon
from shapely.wkt import loads

query = (136, 138, -29.3, -27.8)
query = Polygon([[136, -29.3], [136, -27.8], [138, -27.8], [138, -29.3]])

# What this query looks like in WKT
print query.wkt


POLYGON ((136 -29.3, 136 -27.8, 138 -27.8, 138 -29.3, 136 -29.3))

Loop through the datasets and check if the Landsat's geospatial bounds (which is in a WKT polygon format) intersects with the query


In [7]:
%%time 

matches = []
for dataset in endpts:
    dap = dataset.access_urls['OPENDAP']
    with Dataset(dap) as f:
        bounds = loads(f.geospatial_bounds.encode())
        if bounds.intersects(query) == True:
            print dap
            matches.append(dap)


http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_3_-31_2013.nc
http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_3_-32_2013.nc
http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_4_-31_2013.nc
http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_4_-32_2013.nc
http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_5_-31_2013.nc
http://dapds00.nci.org.au/thredds/dodsC/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_5_-32_2013.nc
CPU times: user 3.88 s, sys: 2.29 s, total: 6.17 s
Wall time: 35.3 s

Let's take a quick look at what was found

(Because we are accessing data remotely through OPeNDAP, let's look at a lower resolution so it doesn't exceed memory limits.)


In [8]:
%%time

plt.figure(figsize=(12,12))

for match in matches:
    with Dataset(match) as f:
        x = f.variables['x'][::50]
        y = f.variables['y'][::50]
        t = f.variables['time'][::5]
        
        for i in range(0, len(t)):
            b2 = f.variables['band_2'][i,::50,::50]
            plt.pcolormesh(x, y, b2)


CPU times: user 434 ms, sys: 88 ms, total: 522 ms
Wall time: 19.4 s
/home/900/kad900/mypython0/mypy0/lib/python2.7/site-packages/numpy/ma/core.py:4089: UserWarning: Warning: converting a masked element to nan.
  warnings.warn("Warning: converting a masked element to nan.")

Equivalent programmatic access from the VDI

We will need to import an additional Python library


In [9]:
import os

This library can then be used similarly to Siphon to list all the available files within a directory.

This example with use os.listdir as a simple means to search the files under one parent directory. For searching a directory tree, os.walk is useful. For information on all the avaiable os library functions, please see: https://docs.python.org/2/library/os.html.


In [10]:
top_directory = '/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/'

files = os.listdir(top_directory)

print files[:10]


['LS8_OLI_TIRS_NBAR_3577_9_-14_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_-18_-23_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_-17_-27_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_10_-22_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_12_-36_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_-8_-10_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_20_-39_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_-4_-16_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_13_-38_2013.nc', 'LS8_OLI_TIRS_NBAR_3577_15_-48_2013.nc']

Let's perform the same polygon search again

As we are working direct on the filesystem, this step will be much faster.


In [11]:
%%time 

matches = []
for file in files:
    if file.endswith('.nc'):
        file_path = os.path.join(top_directory, file)
        with Dataset(file_path) as f:
            bounds = loads(f.geospatial_bounds.encode())
            if bounds.intersects(query) == True:
                print file_path
                matches.append(file_path)


/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_4_-31_2013.nc
/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_3_-32_2013.nc
/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_3_-31_2013.nc
/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_5_-31_2013.nc
/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_5_-32_2013.nc
/g/data2/rs0/tiles/EPSG3577/LS8_OLI_TIRS_NBAR/LS8_OLI_TIRS_NBAR_3577_4_-32_2013.nc
CPU times: user 4.21 s, sys: 1.06 s, total: 5.28 s
Wall time: 6.59 s

And again plot what we have found


In [12]:
%%time

plt.figure(figsize=(12,12))

for match in matches:
    with Dataset(match) as f:
        x = f.variables['x'][::10]
        y = f.variables['y'][::10]
        t = f.variables['time'][::10]
        
        b2 = f.variables['band_2'][14,::10,::10]
        plt.pcolormesh(x, y, b2)


CPU times: user 7.18 s, sys: 16 ms, total: 7.19 s
Wall time: 7.21 s