In [2]:
# Enable importing of utilities
import sys
sys.path.append('..')
%matplotlib inline

Cleaning up imagery for pre and post rainy season

The previous tutorial addressed the identifying the extent of the rainy season near Lake Chad. This tutorial will focus on cleaning up optical imagery to make it suitable for water-detection algorithms.


What to expect from this notebook

  • Introduction to landsat 7 data.
  • basic xarray manipulations
  • removing clouds and scanline error using pixel_qa
  • building a composite image
  • saving products


Algorithmic process




The algorithmic process is fairly simple. It is a composable chain of operations on landsat 7 imagery. The goal is to create a scanline free and cloud-free representation of the data for pre and post rainy season segments of 2015. The process is outlined as follows:

  1. load landsat imagery data for 2015
  2. isolate pre and post rainy season data
  3. remove clouds and scan-line errors from pre and post rainy sesaon data.
  4. build a cloud free composite for pre and post rainy sesaon data.
  5. export the data for future use

What scanline-free or cloud-free means will be addressed later in the tutorial. To understand everything, just follow the steps in sequence.

Creating a Datacube Object


The following code connects to the datacube and accepts cloud_removal_in_chad as an app-name. The app name is typically only used in logging and debugging.


In [3]:
import datacube
dc = datacube.Datacube(app = "cloud_removal_in_chad", config = '/home/localuser/.datacube.conf')


Like in the previous tutorial. The datacube object will be used to load data that has previously been ingested by the datacube.


Defining the boundaries of the area and restricting measurements


In [14]:
## Define Geographic boundaries using a (min,max) tuple.
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)

## Specify a date range using a (min,max) tuple  
from datetime import datetime
time = (datetime(2015,1,1), datetime(2016,1,1))

## define the name you gave your data while it was being "ingested", as well as the platform it was captured on. 
platform = 'LANDSAT_7'  
product = 'ls7_ledaps_lake_chad_full' 

measurements = ['red', 'green', 'blue', 'nir', 'swir1', 'swir2','pixel_qa']

As a reminder and in-notebook reference, the following line of code displays the extents of the study area. Re-orient yourself with it.


In [17]:
from utils.data_cube_utilities.dc_display_map import display_map
display_map(latitude = (12.75, 13.0),longitude = (14.25, 14.5))


Out[17]:


Loading in Landsat 7 imagery

The following code loads in landsat 7 imagery using the constraints defined above


In [18]:
#Load Landsat 7 data using parameters,
landsat_data = dc.load(latitude = latitude,
                       longitude = longitude,
                       time = time,
                       product = product,
                       platform = platform,
                       measurements = measurements)

Explore the Landsat 7 dataset

The previous tutorial barely grazed the concept of xarray variables.
To understand how we use landsat7 imagery it will be necessary to make a brief detour and explain it in greater detail.

xarray - Variables & Data-arrays

When you output the contents of your loaded -dataset...


In [19]:
print(landsat_data)


<xarray.Dataset>
Dimensions:    (latitude: 923, longitude: 901, time: 23)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-04T09:22:48 ... 2015-12-31T09:19:28
  * latitude   (latitude) float64 13.0 13.0 13.0 13.0 ... 12.75 12.75 12.75
  * longitude  (longitude) float64 14.25 14.25 14.25 14.25 ... 14.5 14.5 14.5
Data variables:
    red        (time, latitude, longitude) int16 1123 1185 1154 ... -9999 -9999
    green      (time, latitude, longitude) int16 1131 1097 1131 ... -9999 -9999
    blue       (time, latitude, longitude) int16 851 885 885 ... -9999 -9999
    nir        (time, latitude, longitude) int16 2509 2729 2729 ... -9999 -9999
    swir1      (time, latitude, longitude) int16 1641 1921 2002 ... -9999 -9999
    swir2      (time, latitude, longitude) int16 956 1121 1121 ... -9999 -9999
    pixel_qa   (time, latitude, longitude) int32 224 224 224 224 224 ... 1 1 1 1
Attributes:
    crs:      EPSG:4326



.. you should notice a list of values called data-variables.


These 'variables' are really 3 dimensional data-arrays housing values like 'red', 'green', 'blue', and 'nir', values for each lat,lon,time coordinate pair in your dataset. Think of an xarray.Dataset as an object that houses many different types of data under a shared coordinate system.


If you wish to fetch certain data from your dataset you can call it by its variable name. So, if for example, you wanted to get the near-infrared data-array from the dataset, you would just index it like so:


In [20]:
landsat_data.nir


Out[20]:
<xarray.DataArray 'nir' (time: 23, latitude: 923, longitude: 901)>
array([[[ 2509,  2729, ...,  1222,  1222],
        [ 2509,  2509, ...,  1222,  1266],
        ...,
        [ 3024,  3068, ...,  3103,  3190],
        [ 2981,  2981, ...,  3146,  3276]],

       [[ 1947,  2303, ..., -9999, -9999],
        [ 1986,  1827, ..., -9999, -9999],
        ...,
        [ 2809,  2888, ..., -9999, -9999],
        [ 2809,  2888, ..., -9999, -9999]],

       ...,

       [[ 3171,  3371, ...,   599,   559],
        [ 2771,  3371, ...,   599,   559],
        ...,
        [ 3080,  2801, ...,  2676,  2835],
        [ 3080,  2841, ...,  2756,  2835]],

       [[-9999, -9999, ..., -9999, -9999],
        [-9999, -9999, ..., -9999, -9999],
        ...,
        [-9999, -9999, ..., -9999, -9999],
        [-9999, -9999, ..., -9999, -9999]]], dtype=int16)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-04T09:22:48 ... 2015-12-31T09:19:28
  * latitude   (latitude) float64 13.0 13.0 13.0 13.0 ... 12.75 12.75 12.75
  * longitude  (longitude) float64 14.25 14.25 14.25 14.25 ... 14.5 14.5 14.5
Attributes:
    units:    reflectance
    nodata:   -9999
    crs:      EPSG:4326


The object printed above is a data-array. Unlike a data-set, data-arrays only house one type of data and has its own set of attributes and functions.


xarray - Landsat 7 Values

Let's explore landsat datasets in greater detail by starting with some background about what sort of data landsat satellites collect...

In layman terms, Landsat satellites observe light that is reflected or emitted from the surface of the earth.

<img src = "diagrams/rainy_demo/visual_spectrum.jpg", style="width: 600px;"/>


In landsat The spectrum for observable light is split up into smaller sections like 'red', 'green', 'blue', 'thermal','infrared' so-on and so forth...

Each of these sections will have some value denoting how much of that light was reflected from each pixel. The dataset we've loaded in contains values for each of these sections in separate data-arrays under a shared dataset.

The ones used in this series of notebooks are:

  • red
  • green
  • blue
  • nir - near infrared
  • swir1 - band for short wave infrared
  • swir2 - band for short wave infrared

There is an alternative band attached to the Landsat7 xarray dataset called pixel qa.

  • pixel_qa - land cover classifications

Taking a look at landsat data taken on July 31st, 2015

The data listed above can be used in conjunction to display a visual readout of the area. The code below will use our red green, and blue values to produce a png of one of our time slices.


In [22]:
## The only take-away from this code should be that a .png is produced. 
## Any details about how this function is used is out of scope for this tutorial  

from utils.data_cube_utilities.dc_utilities import write_png_from_xr
write_png_from_xr('../demo/landsat_rgb.png', landsat_data.isel(time = 11), ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])

The need to clean up imagery

Considering the imagery above. It is hard to build a comprehensive profile on landcover. There are several objects that occlude the surface of the earth. Namely errors introduced by an SLC malfunction, as well as cloud cover.

Scanline Gaps

In May of 2003, Landsat7's scan line correction system failed (SLC). This essentially renders several horizontal rows of imagery unusable for analysis. Luckily, these scanline gaps don't always appear in the same spots. With enough imagery, a "gap-less" representation can be constructed that we can use to analyze pre and post rainy season.



Cloud occlusion

Clouds get in the way of analyzing/observing the surface reflectance values of Lake Chad. Fortunately, like SLC gaps, clouds don't always appear in the same spot. With enough imagery, taken at close enough intervals, a "cloudless" representation of the area can be built for pre and post rainy season acquisitions of the region.



Strong Assumptions
In this analysis, strong assumptions are made regarding the variability of lake size in the span of a few acquisitions.(Namely, that the size in a pre-rainy season won't vary as much as it will after the rainy season contributes to the surface area of the lake)

Cleaning up Pre and Post rainy season Imagery

Splitting the dataset in two

The first step to cleaning up pre and post rainy season imagery is to split our year's worth of acquisitions into two separate datasets. In the previous notebooks, We've discovered that an appropriate boundary for the rainy season is sometime between June and October. For the sake of this notebook, we'll choose the first day in both months.


In [23]:
start_of_rainy_season = '2015-06-01'
end_of_rainy_season   = '2015-10-01'



The next step after defining this would be to define the time ranges for both post and pre, then use them to select subsections from the original dataset to act as two separate datasets. (Like in the diagram below)




In [24]:
start_of_year = '2015-01-01'
end_of_year   = '2015-12-31'

pre  = landsat_data.sel(time = slice(start_of_year, start_of_rainy_season))
post = landsat_data.sel(time = slice(end_of_rainy_season, end_of_year))


Building a cloud-free and gap-free representation

This section of the process works s by masking out clouds and gaps from the imagery and then selecting a median valued cloud/scanline-gap free pixels of an image.


  • Masking is done using the pixel_qa variable.
  • The gap/cloud-free compositing is done using a technique called median-pixel-mosaicing


Masking out clouds and SLC gaps using pixel_qa

We're going to be using one of our loaded values called pixel_qa for the masking step.

pixel_qa doesn't convey surface reflectance intensities. Instead, it is a band that contains more abstract information for each pixel. It places a pixel under one or more of the following categories:

  • clear - pixel is likely normal landcover
  • water - pixel is likely water
  • cloud_shadow - pixel is likely in the shadow of a cloud
  • snow - the pixel is likely snowy
  • cloud - the pixel is likely cloudy
  • fill - the pixel is classified as not fit for analysis (SRC-Gaps fall in this classification)

We will use these classifications to mask out values unsuitable for analysis.

A Masking Function

The masking step will have to make use of a very peculiar encoding for each category.

\begin{array}{|c|c|} \hline bit & value & sum & interpretation \\\hline 0 & 1 & 1 & Fill \\\hline 1 & 2 & 3 & Clear \\\hline 2 & 4 & 7 & Water \\\hline 3 & 8 & 15 & Cloud Shadow \\\hline 4 & 16 & 31 & Snow \\\hline 5 & 32 & 63 & Cloud \\\hline 6 & 64 & 127 & Cloud Confidence \\ &&& 00 = None \\ 7& 128& 255 & 01 = Low \\ &&& 10 = Med \\ &&& 11 = High \\\hline \end{array}


The following function was constructed to mask out anything that isn't clear or water.


In [25]:
import numpy as np  

def cloud_and_slc_removal_mask(dataset):
    #Create boolean Masks for clear and water pixels
    clear_pixels = dataset.pixel_qa.values == 2 + 64
    water_pixels = dataset.pixel_qa.values == 4 + 64
    
    a_clean_mask = np.logical_or(clear_pixels, water_pixels)
    return a_clean_mask



The following code creates a boolean mask, for slc code.


In [26]:
mask_for_pre  = cloud_and_slc_removal_mask(pre)
mask_for_post = cloud_and_slc_removal_mask(post)



A boolean mask is essentially what it sounds like. Let's look at its print-out



In [27]:
print(mask_for_post)


[[[ True  True  True ... False False False]
  [ True  True  True ... False False False]
  [ True  True  True ...  True  True  True]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ...  True  True False]]

 [[ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  ...
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]]

 [[ True  True  True ... False False False]
  [ True  True  True ... False False False]
  [ True  True  True ... False False False]
  ...
  [ True  True  True ... False False False]
  [ True  True  True ... False False False]
  [ True  True  True ... False False False]]

 ...

 [[False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]]

 [[ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  ...
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]]

 [[False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]]]


This boolean mask contains a true value for pixels that are clear and un-occluded by clouds or scanline gaps and false values where the opposite is true.

Example of mask use

There are many ways to apply a mask. The following example is the xarray way. It will apply nan values to areas with clouds or scanline issues:


In [28]:
pre.where(mask_for_pre)


Out[28]:
<xarray.Dataset>
Dimensions:    (latitude: 923, longitude: 901, time: 9)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-04T09:22:48 ... 2015-05-28T09:23:44
  * latitude   (latitude) float64 13.0 13.0 13.0 13.0 ... 12.75 12.75 12.75
  * longitude  (longitude) float64 14.25 14.25 14.25 14.25 ... 14.5 14.5 14.5
Data variables:
    red        (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    green      (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    blue       (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    nir        (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    swir1      (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    swir2      (time, latitude, longitude) float64 nan nan nan ... nan nan nan
    pixel_qa   (time, latitude, longitude) float64 nan nan nan ... nan nan nan
Attributes:
    crs:      EPSG:4326

Notice that a lot of the values in the array above have nan values. Compositing algorithms like the median-pixel mosaic below, make use of this where function as well as 'nans' as the marker for no-data values.


Median Pixel Mosaic

A median pixel mosaic is used for our cloud/slc-gap free representation of satellite imagery. It Works by masking out clouds/slc-gaps from imagery, and using the median valued cloud-free pixels in the time series of each lat-lon coordinate pair



Here is a function we can use to build our mosaic. Its exact mechanics are abstracted away from this tutorial and can be explored in further detail by visiting our github.


In [29]:
from utils.data_cube_utilities.dc_mosaic import create_median_mosaic

def mosaic(dataset, mask):
    return create_median_mosaic(dataset, clean_mask = mask)


Lets use it to generate our cloud free representations of the area:


In [30]:
clean_pre  = mosaic(pre, mask_for_pre)  
clean_post = mosaic(post, mask_for_post)


Taking a peek at our cloud-free composites


Pre Rainy Season


In [31]:
print(clean_pre)


<xarray.Dataset>
Dimensions:    (latitude: 923, longitude: 901)
Coordinates:
  * latitude   (latitude) float64 13.0 13.0 13.0 13.0 ... 12.75 12.75 12.75
  * longitude  (longitude) float64 14.25 14.25 14.25 14.25 ... 14.5 14.5 14.5
Data variables:
    red        (latitude, longitude) int16 869 905 923 877 ... 1539 1539 1515
    green      (latitude, longitude) int16 791 814 786 759 ... 1275 1240 1215
    blue       (latitude, longitude) int16 590 614 591 593 ... 843 866 867 891
    nir        (latitude, longitude) int16 2001 2303 2261 ... 2610 3001 3001
    swir1      (latitude, longitude) int16 1576 2056 2023 ... 2024 2932 3011
    swir2      (latitude, longitude) int16 913 1178 1178 1212 ... 1265 2126 2130
    pixel_qa   (latitude, longitude) int32 66 66 66 66 66 66 ... 66 66 66 66 66

In [33]:
write_png_from_xr('../demo/pre_rain_mosaic.png', clean_pre, ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])

Your png should look something like this:

Post Rainy Season


In [34]:
print(clean_post)


<xarray.Dataset>
Dimensions:    (latitude: 923, longitude: 901)
Coordinates:
  * latitude   (latitude) float64 13.0 13.0 13.0 13.0 ... 12.75 12.75 12.75
  * longitude  (longitude) float64 14.25 14.25 14.25 14.25 ... 14.5 14.5 14.5
Data variables:
    red        (latitude, longitude) int16 638 632 693 693 ... 1030 1165 1338
    green      (latitude, longitude) int16 794 752 753 697 ... 945 949 1038 1092
    blue       (latitude, longitude) int16 437 467 438 440 ... 595 661 704 738
    nir        (latitude, longitude) int16 3440 3486 3266 ... 2552 3035 2956
    swir1      (latitude, longitude) int16 1845 2120 2120 ... 1824 2539 2557
    swir2      (latitude, longitude) int16 920 997 1017 1072 ... 1045 1546 1634
    pixel_qa   (latitude, longitude) int32 66 66 66 66 66 66 ... 66 66 66 66 66

In [36]:
write_png_from_xr('../demo/post_rain_mosaic.png', clean_post, ["red", "green", "blue"], scale = [(0,2000),(0,2000),(0,2000)])

Next Steps

The next notebook in the series deals with water classification on these cloud free composites! We'll need to save our work so that it can be loaded in the next notebook. The good news is that xarrays closely resemble the structure of net NETCDF files. It would make sense to save it off in that format. The code below saves these files as NETCDFS using built-in export features of xarray.


In [38]:
## Lets drop pixel qa since it doesn't make sense to house it in a composite.  
final_post = clean_post.drop('pixel_qa')
final_pre  = clean_pre.drop('pixel_qa')

final_post.to_netcdf('../demo/post_rain.nc')
final_pre.to_netcdf('../demo/pre_rain.nc')

The entire notebook has been condensed down to a about 2 dozen lines of code below.


In [41]:
import datacube
from datetime import datetime
from utils.data_cube_utilities.dc_mosaic import create_median_mosaic

def mosaic(dataset, mask):
    return create_median_mosaic(dataset, clean_mask = mask)

def cloud_and_slc_removal_mask(dataset):
    clear_pixels = dataset.pixel_qa.values == 2 + 64
    water_pixels = dataset.pixel_qa.values == 4 + 64
    a_clean_mask = np.logical_or(clear_pixels, water_pixels)
    return a_clean_mask

#datacube obj
dc = datacube.Datacube(app = "cloud_removal_in_chad", config = '/home/localuser/.datacube.conf') 

#load params
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)
time = (datetime(2015,1,1), datetime(2016,1,1))

platform = 'LANDSAT_7'  
product = 'ls7_ledaps_lake_chad_full' 
measurements = ['red', 'green', 'blue', 'nir', 'swir1', 'swir2','pixel_qa']

#load
landsat_data = dc.load(latitude = latitude, longitude = longitude, time = time, product = product, platform = platform, measurements = measurements)

#time split params
start_of_rainy_season = '2015-06-01'
end_of_rainy_season   = '2015-10-01'
start_of_year = '2015-01-01'
end_of_year   = '2015-12-31'

#time split
pre  = landsat_data.sel(time = slice(start_of_year, start_of_rainy_season))
post = landsat_data.sel(time = slice(end_of_rainy_season, end_of_year))

#mask for mosaic processs
mask_for_pre  = cloud_and_slc_removal_mask(pre)
mask_for_post = cloud_and_slc_removal_mask(post)  

#mosaic process
clean_pre  = mosaic(pre, mask_for_pre)  
clean_post = mosaic(post, mask_for_post)

#save file
clean_pre.drop('pixel_qa').to_netcdf('../demo/pre_01.cd')
clean_post.drop('pixel_qa').to_netcdf('../demo/post_01.cd')

In [ ]: