Chunking Demo (WOFS)


Notebook Summary

This notebook demonstrates how to load and process data in chunks to avoid running out of memory (causing MemoryError errors). The Australian Water Observations from Space (WOFS) algorithm is used to detect water. This water detection algorithm is significantly better than the Landsat QA water flag or the NDWI index for water identification.

For more information, visit this website: http://www.ga.gov.au/scientific-topics/hazards/flood/wofs


Index

Import Dependencies and Connect to the Data Cube [▴](#chunking_demo_top)


In [1]:
import matplotlib.pyplot as plt
import numpy as np  
import xarray as xr
import warnings
import datacube
import utils.data_cube_utilities.data_access_api as dc_api  

# Supress Warning 
warnings.filterwarnings('ignore')

api = dc_api.DataAccessApi()
dc = api.dc

Choose Platform and Product [▴](#chunking_demo_top)


In [2]:
# Select a Product and Platform
# Examples: ghana, kenya, tanzania, sierra_leone, senegal

#product = "ls7_usgs_sr_scene"
#platform = "LANDSAT_7"

product = "ls8_usgs_sr_scene"
platform = "LANDSAT_8"

Get the Extents of the Cube [▴](#chunking_demo_top)


In [3]:
from utils.data_cube_utilities.dc_load import get_product_extents
from utils.data_cube_utilities.dc_time import dt_to_str

full_lat, full_lon, min_max_dates = get_product_extents(api, platform, product)

# Print the extents of the data.
print("Latitude Extents:", full_lat)
print("Longitude Extents:", full_lon)
print("Time Extents:", list(map(dt_to_str, (min_max_dates[0], min_max_dates[1]))))


Latitude Extents: (-12.63361111121218, 18.40166666681388)
Longitude Extents: (-25.47250000020378, 44.01000000035208)
Time Extents: ['2013-03-21', '2020-01-27']

Visualize the available area


In [4]:
from utils.data_cube_utilities.dc_display_map import display_map
display_map(full_lat, full_lon)


Out[4]:

Define the Extents of the Analysis [▴](#chunking_demo_top)

<p style="color:red";>CHANGE INPUTS BELOW


In [5]:
# Select an analysis region (Lat-Lon) within the extents listed above. 
# Be sure you check whether you are using L7 or L8 as the time extents are very different
# Select a time period (Min-Max) within the extents listed above (Year-Month-Day)
# This region and time period will be used for the water assessment

# Southern Lake Guiers - Senegal
# lat = (15.9785, 16.1316 ) 
# lon = (-15.9789, -15.8519) 

# Northern Lake Guiers - Senegal
# lat =  (16.1555, 16.397 ) 
# lon = (-15.9047, -15.7184) 

# Lake Guiers - Senegal
# lat = (15.925, 16.40)
# lon = (-15.975, -15.75)

# Random Lake - Senegal
# lat =  (14.8695, 14.8954 ) 
# lon = (-17.0633, -17.0431) 

# Lake Retba - Senegal
lat =  (14.7543, 14.7721) 
lon = (-17.4143, -17.3948)

# East of Niokolo bar, Senegal
#lat = (13.1710, 13.2304)
#lon  = (-12.1627, -12.0930)
#time_extents = ('01/02/2000', '01/24/2018')

# Mako Forest, Senegal
# lat = (12.7774, 12.9963)
# lon = (-12.5275, -12.2780)

# Niokolo koba, Senegal
# lat = (12.5431, 13.4464)
# lon = (-13.7666, -12.2484)

time_extents = ('01/01/2016', '12/31/2016') # MM/DD/YYYY

Visualize the available area


In [6]:
display_map(lat, lon)


Out[6]:

Load Data from the Data Cube [▴](#chunking_demo_top)

Large data sets must be broken up into smaller geographic "chunks" and processed one at at time. If you see a MemoryError, chunking may be the appropriate solution.

The basic chunking process is:

  1. Perform a load of the full analysis region without measurements to get metadata needed for chunking.
  2. Create a result DataArray that is large enough to hold the output (WOFS percentages in this case) for the entire analysis region.
  3. Split the dataset into smaller geographic chunks.
  4. Mask out the clouds from each chunk.
  5. Run WOFS on the chunk.
  6. Add the results to the result DataArray.

In [7]:
# loading with no measurements loads coordinates and dimensions only. 
# We can use this to properly size our result dataset.
metadata = dc.load(latitude = lat, longitude = lon,
                   platform = platform,
                   time = time_extents,
                   product = product,
                   measurements = [])

After loading, you will view the xarray Dataset. Notice the dimensions represent the number of pixels in your latitude and longitude dimension as well as the number of time slices (time) in your time series.


In [8]:
metadata


Out[8]:
<xarray.Dataset>
Dimensions:    (latitude: 65, longitude: 71, time: 38)
Coordinates:
  * time       (time) datetime64[ns] 2016-01-11T11:27:32.609017 ... 2016-12-28T11:27:42.189137
  * latitude   (latitude) float64 14.77 14.77 14.77 14.77 ... 14.75 14.75 14.75
  * longitude  (longitude) float64 -17.41 -17.41 -17.41 ... -17.4 -17.4 -17.39
Data variables:
    *empty*
Attributes:
    crs:      EPSG:4326

Here we create a DataArray that is large enough to hold the WOFS percentages for the entire analysis region. Only the latitude and longitude dimensions are of interest in the final result so we drop the time dimension to save memory.


In [9]:
num_lats, num_lons = [metadata.dims[key] for key in ['latitude', 'longitude']]
metadata_coords = dict(metadata.coords)
del metadata_coords['time']
water_percents = xr.DataArray(np.full((num_lats, num_lons), np.nan), 
                              dims=('latitude', 'longitude'), 
                              coords=metadata_coords, attrs=metadata.attrs)

The analysis region is split into smaller chunks according to the given geographic_chunk_size in degrees. For larger time ranges a smaller geographic_chunk_size should be used. However, the smaller the chunk size the longer the notebook will take to run as it will spend more time loading data from the datacube which is an expensive operation.


In [10]:
from utils.data_cube_utilities.dc_chunker import create_geographic_chunks

geographic_chunks = create_geographic_chunks(latitude=lat, longitude=lon, 
                                             geographic_chunk_size=0.05)

Each geographic chunk is loaded from the datacube, the clouds are masked out, and the WOFS percentages are computed for that chunk. The results are copied into the corresponding chunk of the water_percents result DataArray.


In [11]:
from utils.data_cube_utilities.clean_mask import landsat_qa_clean_mask
from utils.data_cube_utilities.dc_water_classifier import wofs_classify

measurements = ['red', 'green', 'blue', 'nir', 'swir1', 'swir2', 'pixel_qa']
for chunk_ind, geographic_chunk in enumerate(geographic_chunks):
    lon_chunk, lat_chunk = geographic_chunk.values()
    dataset_chunk = dc.load(latitude = lat_chunk, longitude = lon_chunk,
                            platform = platform,
                            time = time_extents,
                            product = product,
                            measurements = measurements)
    if not len(dataset_chunk) == 0: # If this chunk has any data...
        # `clean_mask_chunk` keeps clear views of land and water (no clouds).
        clean_mask_chunk = landsat_qa_clean_mask(dataset_chunk, platform=platform)
        dataset_chunk = dataset_chunk.drop('pixel_qa').where(clean_mask_chunk)
        wofs_chunk = wofs_classify(dataset_chunk, clean_mask=clean_mask_chunk.values)
        wofs_chunk = wofs_chunk.where(wofs_chunk != -9999)
        water_percents_chunk = (wofs_chunk.mean(dim = ['time']) * 100).wofs.rename('water_percents_chunk')
        chunk_selector = {'latitude':slice(*lat_chunk[::-1]), 
                          'longitude':slice(*lon_chunk)}
        water_percents.loc[chunk_selector].values[:] = \
            water_percents_chunk.loc[chunk_selector].values
        
    print("\rProcessed {} out of {} chunks ({:.2%})."
          .format(chunk_ind+1, len(geographic_chunks), 
                  (chunk_ind+1)/len(geographic_chunks)), end='')


Processed 1 out of 1 chunks (100.00%).

Time Series Water Detection Analysis [▴](#chunking_demo_top)

Time series output of the Australian Water Detection from Space (WOFS) results. The results show the percent of time that a pixel is classified as water over the entire time series. BLUE = frequent water, RED = infrequent water.


In [12]:
# import color-scheme and set nans to black
from matplotlib.cm import jet_r as jet_r
jet_r.set_under('black',1)

In [13]:
# This is where the WOFS time series product is generated. 
# Areas of RED have experienced little or no water over the time series
# Areas of BLUE have experience significant or constant water over the time series

# The "figsize" may need adjustment to get the proper scaling and avoid distortion. 
# See the XARRAY dimensions above to get an idea of the Lat-Lon ratio for your image, figsize=(x,y)
# The y-axis scale and legend is part of the image area, so that needs to be considered
# It is suggested to keep the x-dimension at x=12. 

water_percents.plot(cmap = jet_r, vmin=0.001, figsize=(14,12))
plt.show()


Create GeoTIFF Output Products [▴](#chunking_demo_top)


In [14]:
# Save the water percentage data to a GeoTIFF
from utils.data_cube_utilities.import_export import export_slice_to_geotiff

# ts_water_classification
dataset_to_export = xr.Dataset(coords=water_percents.coords,attrs=water_percents.attrs)
dataset_to_export['wofs_pct'] = (water_percents/100).astype(np.float32)

In [15]:
import os

output_dir = 'output/geotiffs'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
# The export command below is commented out to avoid overwriting files. 
# If you would like to export data, please check the file path before uncommenting the line
# to ensure no files are accidentally lost.
export_slice_to_geotiff(dataset_to_export, output_dir + '/Chunking_Demo_WOFS.tif')

In [16]:
!ls -lah output/geotiffs/Chunking_Demo_WOFS.tif


-rw-r--r-- 1 jovyan users 19K May 21 16:21 output/geotiffs/Chunking_Demo_WOFS.tif

In [17]:
!gdalinfo output/geotiffs/Chunking_Demo_WOFS.tif


Driver: GTiff/GeoTIFF
Files: output/geotiffs/Chunking_Demo_WOFS.tif
Size is 71, 65
Coordinate System is:
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0],
    UNIT["degree",0.0174532925199433],
    AUTHORITY["EPSG","4326"]]
Origin = (-17.414305555694870,14.772083333451512)
Pixel Size = (0.000273865414713,-0.000273504273506)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  ( -17.4143056,  14.7720833) ( 17d24'51.50"W, 14d46'19.50"N)
Lower Left  ( -17.4143056,  14.7543056) ( 17d24'51.50"W, 14d45'15.50"N)
Upper Right ( -17.3948611,  14.7720833) ( 17d23'41.50"W, 14d46'19.50"N)
Lower Right ( -17.3948611,  14.7543056) ( 17d23'41.50"W, 14d45'15.50"N)
Center      ( -17.4045833,  14.7631944) ( 17d24'16.50"W, 14d45'47.50"N)
Band 1 Block=71x28 Type=Float32, ColorInterp=Gray
  NoData Value=-9999