In [2]:
# Enable importing of utilities
import sys
sys.path.append('..')
%matplotlib inline
The previous tutorial addressed the identifying the extent of the rainy season near Lake Chad. This tutorial will focus on cleaning up optical imagery to make it suitable for water-detection algorithms.
pixel_qa
The algorithmic process is fairly simple. It is a composable chain of operations on landsat 7 imagery. The goal is to create a scanline free and cloud-free representation of the data for pre and post rainy season segments of 2015. The process is outlined as follows:
What scanline-free or cloud-free means will be addressed later in the tutorial. To understand everything, just follow the steps in sequence.
In [3]:
import datacube
dc = datacube.Datacube(app = "cloud_removal_in_chad", config = '/home/localuser/.datacube.conf')
Like in the previous tutorial. The datacube object will be used to load data that has previously been ingested by the datacube.
In [14]:
## Define Geographic boundaries using a (min,max) tuple.
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)
## Specify a date range using a (min,max) tuple
from datetime import datetime
time = (datetime(2015,1,1), datetime(2016,1,1))
## define the name you gave your data while it was being "ingested", as well as the platform it was captured on.
platform = 'LANDSAT_7'
product = 'ls7_ledaps_lake_chad_full'
measurements = ['red', 'green', 'blue', 'nir', 'swir1', 'swir2','pixel_qa']
As a reminder and in-notebook reference, the following line of code displays the extents of the study area. Re-orient yourself with it.
In [17]:
from utils.data_cube_utilities.dc_display_map import display_map
display_map(latitude = (12.75, 13.0),longitude = (14.25, 14.5))
Out[17]:
In [18]:
#Load Landsat 7 data using parameters,
landsat_data = dc.load(latitude = latitude,
longitude = longitude,
time = time,
product = product,
platform = platform,
measurements = measurements)
The previous tutorial barely grazed the concept of xarray variables.
To understand how we use landsat7 imagery it will be necessary to make a brief detour and explain it in greater detail.
When you output the contents of your loaded -dataset...
In [19]:
print(landsat_data)
.. you should notice a list of values called data-variables.
These 'variables' are really 3 dimensional data-arrays housing values like 'red', 'green', 'blue', and 'nir', values for each lat,lon,time coordinate pair in your dataset. Think of an xarray.Dataset as an object that houses many different types of data under a shared coordinate system.
If you wish to fetch certain data from your dataset you can call it by its variable name. So, if for example, you wanted to get the near-infrared data-array from the dataset, you would just index it like so:
In [20]:
landsat_data.nir
Out[20]:
The object printed above is a data-array. Unlike a data-set, data-arrays only house one type of data and has its own set of attributes and functions.
Let's explore landsat datasets in greater detail by starting with some background about what sort of data landsat satellites collect...
In layman terms, Landsat satellites observe light that is reflected or emitted from the surface of the earth.
<img src = "diagrams/rainy_demo/visual_spectrum.jpg", style="width: 600px;"/>
In landsat The spectrum for observable light is split up into smaller sections like 'red', 'green', 'blue', 'thermal','infrared' so-on and so forth...
Each of these sections will have some value denoting how much of that light was reflected from each pixel. The dataset we've loaded in contains values for each of these sections in separate data-arrays under a shared dataset.
The ones used in this series of notebooks are:
red
green
blue
nir
- near infraredswir1
- band for short wave infrared swir2
- band for short wave infraredThere is an alternative band attached to the Landsat7 xarray dataset called pixel qa.
pixel_qa
- land cover classifications
In [22]:
## The only take-away from this code should be that a .png is produced.
## Any details about how this function is used is out of scope for this tutorial
from utils.data_cube_utilities.dc_utilities import write_png_from_xr
write_png_from_xr('../demo/landsat_rgb.png', landsat_data.isel(time = 11), ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])
Considering the imagery above. It is hard to build a comprehensive profile on landcover. There are several objects that occlude the surface of the earth. Namely errors introduced by an SLC malfunction, as well as cloud cover.
In May of 2003, Landsat7's scan line correction system failed (SLC). This essentially renders several horizontal rows of imagery unusable for analysis. Luckily, these scanline gaps don't always appear in the same spots. With enough imagery, a "gap-less" representation can be constructed that we can use to analyze pre and post rainy season.
Clouds get in the way of analyzing/observing the surface reflectance values of Lake Chad. Fortunately, like SLC gaps, clouds don't always appear in the same spot. With enough imagery, taken at close enough intervals, a "cloudless" representation of the area can be built for pre and post rainy season acquisitions of the region.
Strong Assumptions
In this analysis, strong assumptions are made regarding the variability of lake size in the span of a few acquisitions.(Namely, that the size in a pre-rainy season won't vary as much as it will after the rainy season contributes to the surface area of the lake)
The first step to cleaning up pre and post rainy season imagery is to split our year's worth of acquisitions into two separate datasets. In the previous notebooks, We've discovered that an appropriate boundary for the rainy season is sometime between June and October. For the sake of this notebook, we'll choose the first day in both months.
In [23]:
start_of_rainy_season = '2015-06-01'
end_of_rainy_season = '2015-10-01'
The next step after defining this would be to define the time ranges for both post and pre, then use them to select subsections from the original dataset to act as two separate datasets. (Like in the diagram below)
In [24]:
start_of_year = '2015-01-01'
end_of_year = '2015-12-31'
pre = landsat_data.sel(time = slice(start_of_year, start_of_rainy_season))
post = landsat_data.sel(time = slice(end_of_rainy_season, end_of_year))
This section of the process works s by masking out clouds and gaps from the imagery and then selecting a median valued cloud/scanline-gap free pixels of an image.
pixel_qa
We're going to be using one of our loaded values called pixel_qa
for the masking step.
pixel_qa
doesn't convey surface reflectance intensities. Instead, it is a band that contains more abstract information for each pixel. It places a pixel under one or more of the following categories:
clear
- pixel is likely normal landcover water
- pixel is likely water cloud_shadow
- pixel is likely in the shadow of a cloud snow
- the pixel is likely snowy cloud
- the pixel is likely cloudy fill
- the pixel is classified as not fit for analysis (SRC-Gaps fall in this classification) We will use these classifications to mask out values unsuitable for analysis.
The masking step will have to make use of a very peculiar encoding for each category.
The following function was constructed to mask out anything that isn't clear or water.
In [25]:
import numpy as np
def cloud_and_slc_removal_mask(dataset):
#Create boolean Masks for clear and water pixels
clear_pixels = dataset.pixel_qa.values == 2 + 64
water_pixels = dataset.pixel_qa.values == 4 + 64
a_clean_mask = np.logical_or(clear_pixels, water_pixels)
return a_clean_mask
The following code creates a boolean mask, for slc code.
In [26]:
mask_for_pre = cloud_and_slc_removal_mask(pre)
mask_for_post = cloud_and_slc_removal_mask(post)
A boolean mask is essentially what it sounds like. Let's look at its print-out
In [27]:
print(mask_for_post)
This boolean mask contains a true value for pixels that are clear and un-occluded by clouds or scanline gaps and false values where the opposite is true.
There are many ways to apply a mask. The following example is the xarray way. It will apply nan values to areas with clouds or scanline issues:
In [28]:
pre.where(mask_for_pre)
Out[28]:
Notice that a lot of the values in the array above have nan values. Compositing algorithms like the median-pixel mosaic below, make use of this where function as well as 'nans' as the marker for no-data values.
Here is a function we can use to build our mosaic. Its exact mechanics are abstracted away from this tutorial and can be explored in further detail by visiting our github.
In [29]:
from utils.data_cube_utilities.dc_mosaic import create_median_mosaic
def mosaic(dataset, mask):
return create_median_mosaic(dataset, clean_mask = mask)
Lets use it to generate our cloud free representations of the area:
In [30]:
clean_pre = mosaic(pre, mask_for_pre)
clean_post = mosaic(post, mask_for_post)
In [31]:
print(clean_pre)
In [33]:
write_png_from_xr('../demo/pre_rain_mosaic.png', clean_pre, ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])
Your png should look something like this:
In [34]:
print(clean_post)
In [36]:
write_png_from_xr('../demo/post_rain_mosaic.png', clean_post, ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])
The next notebook in the series deals with water classification on these cloud free composites! We'll need to save our work so that it can be loaded in the next notebook. The good news is that xarrays closely resemble the structure of net NETCDF files. It would make sense to save it off in that format. The code below saves these files as NETCDFS using built-in export features of xarray.
In [38]:
## Lets drop pixel qa since it doesn't make sense to house it in a composite.
final_post = clean_post.drop('pixel_qa')
final_pre = clean_pre.drop('pixel_qa')
final_post.to_netcdf('../demo/post_rain.nc')
final_pre.to_netcdf('../demo/pre_rain.nc')
The entire notebook has been condensed down to a about 2 dozen lines of code below.
In [41]:
import datacube
from datetime import datetime
from utils.data_cube_utilities.dc_mosaic import create_median_mosaic
def mosaic(dataset, mask):
return create_median_mosaic(dataset, clean_mask = mask)
def cloud_and_slc_removal_mask(dataset):
clear_pixels = dataset.pixel_qa.values == 2 + 64
water_pixels = dataset.pixel_qa.values == 4 + 64
a_clean_mask = np.logical_or(clear_pixels, water_pixels)
return a_clean_mask
#datacube obj
dc = datacube.Datacube(app = "cloud_removal_in_chad", config = '/home/localuser/.datacube.conf')
#load params
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)
time = (datetime(2015,1,1), datetime(2016,1,1))
platform = 'LANDSAT_7'
product = 'ls7_ledaps_lake_chad_full'
measurements = ['red', 'green', 'blue', 'nir', 'swir1', 'swir2','pixel_qa']
#load
landsat_data = dc.load(latitude = latitude, longitude = longitude, time = time, product = product, platform = platform, measurements = measurements)
#time split params
start_of_rainy_season = '2015-06-01'
end_of_rainy_season = '2015-10-01'
start_of_year = '2015-01-01'
end_of_year = '2015-12-31'
#time split
pre = landsat_data.sel(time = slice(start_of_year, start_of_rainy_season))
post = landsat_data.sel(time = slice(end_of_rainy_season, end_of_year))
#mask for mosaic processs
mask_for_pre = cloud_and_slc_removal_mask(pre)
mask_for_post = cloud_and_slc_removal_mask(post)
#mosaic process
clean_pre = mosaic(pre, mask_for_pre)
clean_post = mosaic(post, mask_for_post)
#save file
clean_pre.drop('pixel_qa').to_netcdf('../demo/pre_01.cd')
clean_post.drop('pixel_qa').to_netcdf('../demo/post_01.cd')
In [ ]: