Presented by Alexis McIntyre (Geoscience Australia).
Would you like to know more about how to access and use the Australian Geoscience Data Cube? The Australian Geoscience Data Cube hosts the Australian Landsat Archive and other national Earth Observation collections, alongside access to gridded datasets such as rainfall and elevation.
This webinar will demonstrate how to access the data cube using the Virtual Desktop Infrastructure (VDI) at the National Computational Infrastructure (NCI).
By the end of the webinar you will be able to access national collections of Earth Observation data on the NCI, and use the Australian Geoscience Data Cube in a virtual desktop environment on the NCI.
This webinar is focused on access to the Australian Geoscience Data Cube via the NCI virtual desktop infrastucture (VDI).
Australian Geoscience Data Cube (AGDC) - provides an integrated gridded data analysis environment for earth observation satellite and related data from multiple satellite and other acquisition systems. This project facilitates the operational processing and interactive analysis of national EO data sets and is used to produce the Landsat and MODIS collections on the NCI.
For specific information on how to use the data cube there are several resources available:
The National Computational Infrastructure (NCI) is Australia’s national research computing facility including the Southern Hemisphere’s most highly-integrated supercomputer and filesystems, Australia’s highest performance research cloud, and one of the nation’s largest data catalogues.
If you would like to follow along you will need to set up an account on the NCI and access to the Virtual Desktop Infrastructure (VDI). Use project wd8 to sign up via https://nci.org.au/access/user-registration/register-new-user. Instructions to set up access to the VDI are available at http://vdi.nci.org.au/help.
Joining project wd8 allows visualisation of all data on the NCI via the VDI, but restricts storage and computing abilities. If extra storage and computing resources are required in the future, this should be discussed with the AGDC or NCI teams.
The AGDC can also be accessed within the High Performance Computing (HPC) environment (i.e. Raijin), and all users with computing quotas on Raijin are able to access the AGDC through the HPC system.
To connect to the VDI environment to use the AGDC needs a request to join wd8.
NCI have a multi-element system for metadata catalogues and data services
AusCover also uses THREDDS, Geoserver, FTP and HTTP services to deliver data in a variety of formats, and is working to refine online discovery, subsetting and reformatting tools for both raster and vector data. Keep an eye out for the new website launching soon.
This all looks a little more tricky than firing up your desktop remote sensing package, but you do get a highly flexible open source analysis environment that give you the ability to perform reproducible research, and operationalise your algorithms nationally with ease. See the links below for training information, more Jupyter notebooks or NCI help:
For these examples I'll be using a Jupyter Notebook with code in Python.
Below we will go through:
In [1]:
%matplotlib inline
import datacube
dc = datacube.Datacube()
Loading data from the datacube uses the load function.
The input parameters are all the filters you would like to put into your request for data i.e. sensor, bands, projection to return etc. The function will handle projection and resampling on the fly.
A fairly standard request would include:
The query is a special dictionary object which contains the parameters of your request. We will have a look at building a query object as this is the best way of setting up your request. First let's determine what we would like to request.
The product is the type of data you would like to load. A few examples are:
For a full list of the available products use the dc.list_products() command.
For this exercise we will have a look at some surface reflectance data from Landsat 5, as per the first example listed.
You can also specify which measurements you are interested in retrieving. For imagery these are the bands, for other types of data these are the variables. This is only required if you do not want to return all of the bands/variables for the product. You can use any of the aliases listed to refer to the measurement i.e. for Landsat 5 products you can either use the band number, (e.g. band 1) or name (e.g. blue).
For a full list of the available measurements use the dc.list_measurements() command.
For the first example we will ask for the blue, green and red bands.
The next thing we need to do is define the spatial and temporal extents of our request for data. In this case we will look at an area covering the Menindee lakes in NSW over 2008.
So set up the parameters:
In [2]:
start_date = '2008-01-01'
end_date = '2008-04-01'
latitude_min = -32.25
latitude_max = -32.55
longitude_min = 142.1
longitude_max = 142.6
crs = 'EPSG:4326'
Here I am setting up a simple dictionary to hold the search terms. This is not strictly necessary, and for a simple case like this probably overkill. It is a good habit to get in to though as you can use this dictionary in the functions, and it makes the rest of the commands much cleaner. Using a dictionary, if you decide that you want to restrict your request to just a few bands, or change the sensor, this is the only part that needs to be changed.
In [3]:
def create_search_parameters(start_date, end_date,latitude_min,\
latitude_max, longitude_min,longitude_max,\
crs, required_bands = None):
search_parameters = {}
search_parameters['time'] = (start_date, end_date)
search_parameters['x'] = (longitude_min, longitude_max)
search_parameters['y'] = (latitude_max, latitude_min)
search_parameters['crs'] = crs
search_parameters['group_by'] = 'solar_day'
return search_parameters
In [4]:
search_terms = create_search_parameters(start_date, end_date,\
latitude_min, latitude_max,\
longitude_min,longitude_max,\
crs)
print search_terms
In [5]:
from shapely.geometry import mapping
from shapely.geometry import MultiPolygon
import shapely.geometry
import shapely.ops
import rasterio
import folium
from IPython.display import display
In [6]:
def datasets_union(dataset_list):
shapes = shapely.ops.unary_union([shapely.geometry.Polygon(dataset.extent.points) for dataset in dataset_list])
return shapely.geometry.shape(rasterio.warp.transform_geom('EPSG:3577','EPSG:4326',
shapely.geometry.mapping(shapes)))
def plot_folium(shapes):
style_function = lambda x: {'fillColor': '#000000' if x['type'] == 'Polygon' else '#00ff00'}
mapa = folium.Map(location=[-30,150], zoom_start=4)
for shape in shapes:
poly = folium.features.GeoJson(mapping(shape), style_function=style_function)
mapa.add_children(poly)
display(mapa)
In [7]:
plot_folium([datasets_union(dc.index.datasets.search_eager
(**datacube.api.query.Query(product = 'ls5_nbar_albers', **search_terms).search_terms))])
We are now ready to load our data:
In [8]:
data = dc.load(product = 'ls5_nbar_albers',**search_terms)
data
Out[8]:
In [9]:
from datacube.storage.masking import mask_valid_data as mask_invalid_data
from datacube.storage.masking import make_mask
from datacube.helpers import ga_pq_fuser
import numpy as np
We can now create a mask function which masks the appropriate data. Note that when loading the pixel quality data we are using a parameter like. This takes an existing datacube dataset and returns the requested dataset with the same parameters (apart from product and measurements).
In [10]:
def mask_pq(data, platform):
product = '_'.join([platform, 'pq_albers'])
crs = data.crs
data = mask_invalid_data(data)
pq = dc.load(product=product,group_by = 'solar_day', fuse_func = ga_pq_fuser, like = data)
pq_mask = make_mask(pq.pixelquality, ga_good_pixel = True)
data = data.where(pq_mask)
data.attrs['crs'] = crs
return data
In [11]:
data = mask_pq(data, 'ls5')
data
Out[11]:
The type of object returned by the the load is an xarray.Dataset, and calculating statistical properties of the data can be done using the inbuilt functions of the xarray module.
In [12]:
data.red.max()
Out[12]:
Plotting a single variable is also very easy, and in this example we are plotting the red band for each date data was captured.
In [13]:
data.red.plot(col='time', col_wrap = 3)
Out[13]:
Displaying an RGB image requires a little more effort, and the function below demonstrates how to do this.
In [14]:
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
from skimage import img_as_float, exposure
In [15]:
def plot_rgb(data, colour_vars=('red', 'green', 'blue'),
title=None,
time_index=0, scale_min = None, scale_max = None, ax=None):
# Set title
if title == None:
date = data['time'][time_index].values
ts = pd.to_datetime(str(date))
date_str = ts.strftime('%d/%m/%Y')
title = date_str
# Set up figure
if not ax:
plt.figure(num=None, figsize=(7, 5), dpi=80,
facecolor='w', edgecolor='k')
ax = plt
plt.axis('off')
plt.title(title)
#Scaling
scaled_arrays = {}
for colour in colour_vars:
array = data.data_vars[colour][time_index].values/10000
if scale_min == None:
scale_min = np.nanmin(array)
if scale_max == None:
scale_max = np.nanmax(array)
array = array.clip(min=scale_min, max=scale_max)
array = (array -scale_min) / (scale_max - scale_min)
indices = np.where(array < 0)
array[indices] = 0.0
indices = np.where(array > 1)
array[indices] = 1.0
scaled_arrays[colour] = array
#Stacking
rgb_stack = np.dstack((scaled_arrays[colour_vars[0]],
scaled_arrays[colour_vars[1]],
scaled_arrays[colour_vars[2]]))
#Plot RGB
return ax.imshow(rgb_stack)
We can investigate the dates we have data for by indexing the xarray.DataArray object by time.
In [16]:
data['time']
Out[16]:
We can also look at which variables/bands we have.
In [17]:
data.data_vars.keys()
Out[17]:
In [18]:
plot_rgb(data)
Out[18]:
In [19]:
plot_rgb(data, ('red', 'green', 'blue'),
time_index=4, scale_min=0, scale_max=0.4)
Out[19]: