In [4]:
# Enable importing of utilities
import sys
sys.path.append('..')
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
This tutorial is focused on solving the problem of determining when the rainy season starts and ends over Lake Chad. Future notebooks in this series deal with analyzing the Lake Chad region before and after the rainy season to determine how much the rainy season contributes to the surface area of Lake Chad.
xarrays
The following code connects to the datacube and accepts chad_rainfall
as an app-name.
In [5]:
import datacube
dc = datacube.Datacube(app = "chad_rainfall", config = '/home/localuser/.datacube.conf')
This object is the main interface to your stored and ingested data. It can handle complicated things like reprojecting data with varying resolutions and orientations. It can also be used to explore existing datasets. In this tutorial, it is only used for loading data from the datacube
A small dataset is easier to work with than the entirety of Lake Chad. The region you're about to load contains GPM measurements for a small area of Lake Chad near the mouth of its largest contributing river. The code below displays the bounds of the region but doesn't load it.
In [6]:
from utils.data_cube_utilities.dc_display_map import display_map
display_map(latitude = (12.75, 13.0),longitude = (14.25, 14.5))
Out[6]:
In [7]:
## Define Geographic boundaries using a (min,max) tuple.
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)
## Specify a date range using a (min,max) tuple
from datetime import datetime
time = (datetime(2015,1,1), datetime(2016,1,2))
## define the name you gave your data while it was being "ingested", as well as the platform it was captured on.
product = 'gpm_imerg_gis_daily_global'
platform = 'GPM'
It's simple to intuit what the latitude,longitude and time bounds will get you. It will give you a bounded and grided-dataset containing our rainy season. Each square in the diagram below represents the smallest spatial unit in our imagery. This smallest unit is often reffered to as a pixel.
While defining space and time bounds are simple to understand, it may be more complicated to pick up on what product and platform are. Platform will tell you how/where the data is produced. Product is a key used to chose what representation of that platform's data you wish to index.
For the sake of this tutorial, think of product and platform as shorthand names we used to look up data that is:
produced on a GPM platform.
represented using gpm_imerg_gis_daily_global settings or types.
The representation reflects personal prefferences that define, for example, the resolution of each pixel, how pixels are sampled, and the sort of geometric projections this grid of pixels undergoes to assume its current shape. Scripts for adding more general product types/representations are avaiable here, but aren't necessary to understanding this stage of the tutorial.
In [8]:
#Load Percipitation data using parameters,
gpm_data = dc.load(latitude = latitude, longitude = longitude, time = time, product = product, platform = platform)
The code above should have loaded an xarray containing your GPM data. An xarray data-structure is essentially a wrapper for high-dimensional data-arrays. One of its main uses is the coupling of different data with a shared set of coordinates.
Conceptually, you can imagine GPM's xarray looking like this:
Each latitude-longitude coordinate pair will have, total_precipitation
, liquid_precipitation
, ice_precipitation
and percent_liquid
measurements taken for them.
An Xarray Dataset will store each of these measurements in separate grids and share a single set of coordinates among all measurements.
To get some detailed information about an xarray, a print()
statement will give you a readout on its structure.
In [9]:
print( gpm_data )
Using the readout above we can quickly gain a summary of the dataset we just loaded by examining:
latitude
, longitude
, and time
valuesData Variables
a readout of what sort of data is stored. In this case, each latitude
, longitude
, and time
point will store four types of data. One for each total_precipitation
, liquid_precipitation
, ice_precipitation
,percent_liquid
variable.
Data Size
Each entry has a size/type associated. IE. int32
, int8
, float64
. You can use this to manage the memory footprint of your object.
The xarray that you're using has several built-in functions to efficiently run large scale operations on all points within the xarray. The following code makes use of one of those built-in functions to determine the average precipitation in an area for each time slice
In [10]:
mean_precipitation = gpm_data.mean(dim = ['latitude', 'longitude'])
The code above will:
longitude
rows and store them in a time coordinate. The diagram below should detail the operations of averaging these areas.
Our new xarray dataset will retain a single dimension called time. Each point in time will store a value representing the average of all pixel values at that time.
Take a look at the print statement below.
In [11]:
print(mean_precipitation)
Things you should notice:
We're still representing mean_values using an xarray
only time is displayed under the coordinates section.
latitude and longitude are essentially dropped.
The averaging operation was performed on all datasets; total_precipitation
, liquid_precipitation
,ice_precipitation
, and percent_liquid
Xarray Datasets store several data-arrays.
The code below neatly extracts a total_precipitation data-array from our mean_precipitation Dataset.
In [12]:
mean_total_precipitation = mean_precipitation.total_precipitation
The new representation is also an xarray data-structure. When printed it looks something like this.
In [13]:
print(mean_total_precipitation)
For time series plotting we care about extracting time coordinates, and the data-array values
In [14]:
times = mean_total_precipitation.time.values
values = mean_total_precipitation.values
The next line of code plots a time series graph of our values
In [15]:
import matplotlib.pyplot as plt
plt.plot(times, values)
Out[15]:
In [16]:
#Code for this algorithm is out of scope for this tutorial on datacube and is abstracted away for demonstration purposes.
import demo.curve_fit_gaussian as curve_fit
curve_fit.plot_fit(times, values, standard_deviations = 2)
We pick two points that are equidistant from the center/peak of this curve to act as our bounding points for the rainy season.
In [17]:
curve_fit.get_bounds(times, values, standard_deviations = 2)
Out[17]:
It appears that JUNE and OCTOBER should be adequate bounds for the rainy season.
This notebook served as an introduction to datacube and the xarray datasets. Now that we have the extent of our rainy season you can proceed with the next notebook, in which landsat 7 data is broken into a pre and post rainy season dataset and cleaned up in preparation for water detection. The entire notebook has been condensed down to a about a dozen lines of code below.
In [18]:
from datetime import datetime
import demo.curve_fit_gaussian as curve_fit
latitude = (12.75, 13.0)
longitude = (14.25, 14.5)
time = (datetime(2015,1,1), datetime(2016,1,2))
product = 'gpm_imerg_gis_daily_global'
platform = 'GPM'
gpm_data = dc.load(latitude = latitude, longitude = longitude, time = time, product = product, platform = platform)
mean_precipitation = gpm_data.mean(dim = ['latitude', 'longitude'])
times = mean_precipitation.time.values
values = mean_precipitation.total_precipitation.values
curve_fit.get_bounds(times, values, standard_deviations = 3)
Out[18]:
In [ ]: