Raster data processing with Python and GDAL

In this lesson we will learn basic techniques for automating raster data processing with python and the Geospatial Data Abstraction Library GDAL.

Raster files are commonly used to store terrain models and remote sensing data and their derivate products such as vegetation indices and other environmental datasets. Raster files tend to be huge (imagine for example a raster dataset covering the globe in 30m x 30m resolution) and are often delivered and processed in smaller pieces (tiles). Efficient processing and analysis of such large datasets consequently requires automatization. During this lesson you will learn how to read and write common raster formats, and conduct basic raster data processes for a batch of files using the GDAL/OGR API in Python and GDAL command line utilities.

If you want to dive deeper into raster data processing in Python you might want to check out Rasterio which is a relatively new Python library for efficient raster data analysis. Also remember that you can write Python scripts for using ArcGIS tools as introduced in the demo.

For more information and examples with GDAL, please see these online resources:

1. Extract files

Download data for this lesson from this link.

In this lesson, we will use a Landsat 8 satellite scene for practicing how to read raster data with Python and Gdal. Landsat images are distributed as gzipped TAR archives *.tar.gz when downloading from the USGS Earth Explorer. For the purposes of this exercise, we have pre-downloaded a scene covering South-West Finland which you can download from the above link. In order to get started, you need to extract the contents of the TAR archive either in the Terminal window or with a Python script.

Option 1: Extract files in the Terminal window

Navigate to the directory where you downloaded the data, and extract files into a new folder:


In [ ]:
mkdir LandsatData
tar -zxvf LC81910182016153-SC20161208043748.tar.gz -C LandsatData/

Option 2: Extract files using python

It is also possible to extract the files in python using the tarfile module. First, open the file with tarfile.open(). The parameter r:gz specifies that we want to open the gzipped file in reading mode. Then, extract the files using tarifile.extractall() method:


In [ ]:
import os
import tarfile

#Create output folder
newFolder = "LandsatData"
os.makedirs(newFolder)

#Extract files
tar = tarfile.open("LC81910182016153-SC20161208043748.tar.gz", "r:gz")
tar.extractall(newFolder)
tar.close()

1. Open a raster file in Python with GDAL

With GDAL, you can read and write several different raster formats in Python. Python automatically registers all known GDAL drivers for reading supported formats when the importing the GDAL module. Most common file formats include for example TIFF and GeoTIFF, ASCII Grid and Erdas Imagine .img -files.

Landsat 8 bands are stored as separate GeoTIFF -files in the original package. Each band contains information of surface reflectance from different ranges of the electromagnetic spectrum. You can read more information about Landsat 8 here.

Let's start with inspecting one of the files we downloaded:


In [ ]:
from osgeo import gdal

filepath = r"LandsatData/LC81910182016153LGN00_sr_band4.tif"

# Open the file:
raster = gdal.Open(filepath)

# Check type of the variable 'raster'
type(raster)

Read raster file properties

The satellite image is now stored as a GDAL Dataset object in the variable raster. Let's have a closer look at the properties of the file:


In [ ]:
# Projection
raster.GetProjection()

# Dimensions
raster.RasterXSize
raster.RasterYSize

# Number of bands
raster.RasterCount

# Metadata for the raster dataset
raster.GetMetadata()

Get raster bands

In our case, all bands of the Landsat 8 scene are stored as separate files. rasterCount is 1 as we have only opened one GeoTiff containing Landsat 8 band 4. However, different bands of a satellite images are often stacket together in one raster dataset in which case rasterCount would be greater than one.

In order to have a closer look at the values stoden in the band, we will take advantage of the GDAL Band API.


In [ ]:
# Read the raster band as separate variable
band = raster.GetRasterBand(1)

# Check type of the variable 'band'
type(band)

# Data type of the values
gdal.GetDataTypeName(band.DataType)

Now we have a GDAL Raster Band object stored in the variable band.

Data type of the band can be interpreted with the help of GDAL documentation on Pixel data types. Unsigned integer is always equal or greater than zero and signed integer can store also negative values. For example, an unsigned 16-bit integer can store 2^16 (=65,536) values ranging from 0 to 65,535.

Band statistics

Next, let's have a look at the values that are stored in the band. You might need to calculate statistics for the raster before being able to print out any information.


In [ ]:
# Compute statistics if needed
if band.GetMinimum() is None or band.GetMaximum()is None:
    band.ComputeStatistics(0)
    print("Statistics computed.")
    
# Fetch metadata for the band
band.GetMetadata()
    
# Print only selected metadata:
print ("[ NO DATA VALUE ] = ", band.GetNoDataValue()) # none
print ("[ MIN ] = ", band.GetMinimum())
print ("[ MAX ] = ", band.GetMaximum())

2. Open raster file as a numerical array

GDAL is a powerful library when it comes to accessing geospatial raster data, but it does not provide many functionalities for doing calculations. For more advanced computing, we will read in the raster data as a numerical array in order to use the capabilities in the NumPy-library.

In case you want to convert and existing Gdal Dataset or a Band into a numpy array you can convert it with ReadAsArray method:


In [ ]:
# read raster data as numeric array from Gdal Dataset
rasterArray = raster.ReadAsArray()

so what's the difference?


In [ ]:
#Check the datatype of variables
type(rasterArray)
type(raster)

As you can see, our we have now stored the same raster data into two different types of variables:

-raster is a Gdal Dataset

-rasterArray is a numpy array

The GDAL library also comes with a module gdal_array that works as an interface between NumPy and GDAL. Gdal_array reads and writes raster files to and from NumPy arrays directly from file:


In [4]:
from osgeo import gdal_array
#from osgeo import gdalnumeric

# Read raster data as numeric array from file
rasterArray = gdal_array.LoadFile(filepath)

Take care of NoData values


In [ ]:
# What is the minimum value of the array?
rasterArray.min() #-9999

As you can see, the numpy array still contains the original nodata values. Any calculations will be wrong if these are not taken care of.


In [ ]:
import numpy as np

# Get nodata value from the GDAL band object
nodata = band.GetNoDataValue()

#Create a masked array for making calculations without nodata values
rasterArray = np.ma.masked_equal(rasterArray, nodata)
type(rasterArray)

# Check again array statistics
rasterArray.min()

Now you have a two-dimensional array ready for further calculations. However, for completing our exercise for this week we will use a very simple command line tool gdal_calc.py.

Close raster dataset

It might be useful to close an exitsing GDAL object in the middle of the code if you want to free resources and remove unnecessary varables from memory.


In [ ]:
raster = None
band = None

It is however not necessary to close raster datasets or bands at the end of the python script as Pyhton will automatically take care of this.

3. GDAL command line utilities

We have now tested some of the basic functions from the Python GDAL/OGR API for reading and inspecting raster files. However, GDAL also includes other powerful functions for data translation and processing which are not directly implemented in the library. We will have a closer look on a couple of such functions:

These tools need to be run from the Terminal/Command Prompt or as a subprocess in Python. We will now quickly test out these tools in the Terminal window.

Clipping the image with gdalwarp

Among other tricks, gdalwarp is a very handy tool for quickly clipping your image. We will now practice how to clip the satellite image band based on a bounding box. Desired extent for the output file is specified using the option -te :


In [ ]:
gdalwarp -te xmin ymin xmax ymax inputfile.tif outputfile.tif

---------------------------------------------

TASK: Clip band 4 of the satellite image so that it covers cloud-free areas in the Turku archipelago (in the North-East corner of the scene).

You can open the band 4 manually in QGIS for defining the corner coordinates.

---------------------------------------------

Next, let's repeat the clipping for all the rest of the bands all at once. For doing this, we will use Python for generating the command for each spectral band in our scene.


In [ ]:
import glob
import os

# List filepaths for all bands in the scence
FileList = glob.glob(os.path.join(r'/home/geo/LandsatData','*band*.tif'))

# Define clipping extent
xmin, ymin, xmax,ymax = (0, 0, 0, 0) # INSERT HERE THE CORRECT COORDINATES

# Generate gdalwarp command for each band
command = ""

for fp in FileList:
    inputfile = fp
    outputfile = inputfile[:-4] + "_clip.tif"
    
    command += "gdalwarp -te %s %s %s %s %s %s \n" % (xmin, ymin, xmax, ymax, inputfile, outputfile)
    
# Write the commands to an .sh file
cmd_file = "ClipTurkufromLandsat.sh"
f = open(os.path.join(cmd_file), 'w')

f.write(command)
f.close()

NOTE: If you are working in an windows environment, change the .sh extension to .bat.

After running the above script, you should have a file "ClipTurkufromLandsat.sh" in your working directory. Open the file (with a text editor) and check that the commands have been written correctly to the file.

Next, run the file in the Terminal window:


In [ ]:
bash ClipTurkufromLandsat.sh

Now you should have a bunch of clipped .tif files ready and you might want to open a few of them in QGIS to check that the process was succesful.

Generating a layer stack with gdal_merge.py

After clipping the image you can for example stack bands 3 (green), 4 (red), and 5 (nir) for visualizing a false-color composite. Merge the layers with gdal_merge.py and use the -separate option for indicating that you wish to save the inputs as separate bands in the output file.

Let's try running the command as a subprocess in python:


In [ ]:
import os

# Define input and output files
inputfiles =  "band3_clip.tif band4_clip.tif band5_clip.tif"
outputfile =  "Landsat8_GreenRedNir.tif"

# Generate the command
command = "gdal_merge.py -separate %s -o %s" % (inputfiles, outputfile)

# Run the command. os.system() returns value zero if the command was executed succesfully
os.system(command)

As a result, you have three bands stacked together in the file Landsat8_GreenRedNir.tif.

Gdal_calc.py

Gdal_calc.py is a command line raster calculator which can be useful for competing simple repetitive calculations for raster data. Open the terminal window and type in:


In [ ]:
Gdal_calc.py

And press enter. You should see instructions on the usage and options for the tool.The basic syntax for gdal_calc.py is the following:

gdal_calc.py -A input1.tif - B input2.tif [other_options] --outfile=outputfile.tif

From other options, it is useful to notice at least the parameters --calc for specifying the calculation syntax and --creation-option (or --co) for controlling the output file size:

  • In the case of two input files --calc="A+B" would add files A and B together.
  • By default output files tend to be huge which will quickly result in problems with disk size and memory. With gdal_calc.py you can add parameter --co="COMPRESS=LZW" in order to reduce output file size.

Exercise 7: Delineate forest area from canopy cover data

In exercise 7 we will work with a piece of the Global Forest Change Dataset by Hansen et al (2013). The goal of the exercise is to delineate forest areas from an area of interest by processing raster layers containing information about canopy cover in 2000.

You should be able to complete the exercise by altering the simple tools presented during this lesson.