Landsat Metadata Analysis with Dask

Understanding the global distribution of Landsat observations over the satellite's 40+ year record can help answer many questions including:

How viable is a particular analytical method given the observation frequency and quality in my study site?
What is the distribution of cloud cover across the planet as observed by Landsat?
How does solar and sensor geometry change in the Landsat record across time and the planet?

In order to mine this information, we can use the Landsat Bulk Metadata dataset from the USGS which provides rich metadata about every observation in the Landsat record across the history of the program. Unfortunately, these files are gigantic and would be very difficult to process using an average computer.

Luckily for those using Python, the Dask library can provide both multiprocessing and out-of-core computation capabilities while keeping to the same function calls that you might be familiar with from Numpy and Pandas. Of interest to us is the dask.dataframe collection which allows us to easily process the Landsat metadata CSV files.

To begin, first download the Landsat metadata files and unzip them. While Pandas can read from compressed CSV files, we will want to break these CSV files apart into many pieces for processing and they need to be uncompressed to split them.

For this tutorial, we will only be using the Landsat 8 and Landsat 7 metadata:

wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_8.csv.gz
gunzip LANDSAT_8.csv.gz
wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_ETM_SLC_OFF.csv.gz
gunzip LANDSAT_ETM_SLC_OFF.csv.gz
wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_ETM.csv.gz
gunzip LANDSAT_8.csv.gz

TODO: a transition



In [1]:

    
import dask.dataframe as ddf

columns = {
    'sceneID': str,
    'sensor': str,
    'path': int,
    'row': int,
    'acquisitionDate': str,
    'cloudCover': float,
    'cloudCoverFull': float,
    'sunElevation': float,
    'sunAzimuth': float,
    'DATA_TYPE_L1': str,
    'GEOMETRIC_RMSE_MODEL': float,
    'GEOMETRIC_RMSE_MODEL_X': float,
    'GEOMETRIC_RMSE_MODEL_Y': float,
    'satelliteNumber': float
}

df = ddf.read_csv('LANDSAT*.csv',
                  usecols=columns.keys(),
                  dtype=columns,
                  parse_dates=['acquisitionDate'],
                  blocksize=int(20e6))
df = df.assign(year=df.acquisitionDate.dt.year)
df.columns









    Out[1]:





Index([u'sceneID', u'sensor', u'acquisitionDate', u'path', u'row',
       u'cloudCover', u'cloudCoverFull', u'sunElevation', u'sunAzimuth',
       u'satelliteNumber', u'DATA_TYPE_L1', u'GEOMETRIC_RMSE_MODEL',
       u'GEOMETRIC_RMSE_MODEL_X', u'GEOMETRIC_RMSE_MODEL_Y', u'year'],
      dtype='object')

Question: How many observations are there?



In [4]:

    
df.groupby('sensor').sensor.count().compute()









    Out[4]:





sensor
LANDSAT_ETM             493555
LANDSAT_ETM_SLC_OFF    1691692
OLI                       3736
OLI_TIRS                808644
TIRS                      2068
Name: sensor, dtype: int64

Question: What is the mean cloud cover for every path/row across the years?



In [5]:

    
result = df.groupby(['path', 'row', 'year']).cloudCoverFull.mean().compute()



In [6]:

    
result.loc[12, 31, :]









    Out[6]:





path  row  year
12    31   1999    33.347500
           2000    45.819091
           2001    36.484348
           2002    52.677391
           2003    41.502500
           2004    51.485652
           2005    47.230435
           2006    46.412727
           2007    37.043913
           2008    35.539565
           2009    46.669091
           2010    56.190870
           2011    61.337391
           2012    58.252857
           2013    43.463947
           2014    43.966304
           2015    37.012667
           2016    38.726875
Name: cloudCoverFull, dtype: float64

Question: What is the cloud cover difference between a L1T and L1GT product?



In [7]:

    
result = df.groupby(['DATA_TYPE_L1']).cloudCoverFull.mean().compute()
result









    Out[7]:





DATA_TYPE_L1
L1G     71.539130
L1GT    46.895781
L1Gt    56.343568
L1T     28.222079
PR      39.912185
Name: cloudCoverFull, dtype: float64

Looks like there is a labeling issue due to a capitalization difference in L1GT versus L1Gt. We can correct this as well:



In [53]:

    
df = df.assign(DATA_TYPE_L1=df.DATA_TYPE_L1.apply(lambda x: x if x != 'L1Gt' else 'L1GT'))
result = df.groupby(['DATA_TYPE_L1']).cloudCoverFull.mean().compute()
result









    Out[53]:





DATA_TYPE_L1
L1G     71.539130
L1GT    50.574350
L1T     28.222079
PR      39.912185
Name: cloudCoverFull, dtype: float64

Question: How does the various levels of preprocessing affect the estimated geometric accuracy? Are more recent Landsat sensors more accurate?



In [54]:

    
df.groupby(['DATA_TYPE_L1', 'sensor'])[['GEOMETRIC_RMSE_MODEL_X', 'GEOMETRIC_RMSE_MODEL_Y']].mean().compute()









    Out[54]:






  
    
      
      
      GEOMETRIC_RMSE_MODEL_X
      GEOMETRIC_RMSE_MODEL_Y
    
    
      DATA_TYPE_L1
      sensor
      
      
    
  
  
    
      L1G
      LANDSAT_ETM
      7.333954
      9.749414
    
    
      LANDSAT_ETM_SLC_OFF
      6.804560
      8.802822
    
    
      L1GT
      LANDSAT_ETM
      7.719764
      9.941053
    
    
      LANDSAT_ETM_SLC_OFF
      6.533796
      8.131632
    
    
      OLI
      0.000000
      0.000000
    
    
      OLI_TIRS
      0.000000
      0.000000
    
    
      TIRS
      0.000000
      0.000000
    
    
      L1T
      LANDSAT_ETM
      3.446039
      3.848213
    
    
      LANDSAT_ETM_SLC_OFF
      3.747309
      4.175295
    
    
      OLI
      6.238722
      6.375502
    
    
      OLI_TIRS
      5.941521
      5.985040
    
    
      PR
      LANDSAT_ETM
      4.395927
      5.396203
    
    
      LANDSAT_ETM_SLC_OFF
      4.334864
      5.178753
    
    
      OLI
      2.003667
      1.697333
    
    
      OLI_TIRS
      2.229302
      2.254767
    
    
      TIRS
      0.000000
      0.000000

Unfortunately it looks like the Landsat 8 observations do not record the estimated geometric accuracy unless a systematic terrain correction using Ground Control Points is successful.

Behind the scenes



In [77]:

    
from dask.dot import dot_graph

dot_graph(result.dask)









    Out[77]:



In [41]:

    
import dask

Getting started with Dask

Because Dask DataFrame implements much of the Pandas API, we eliminate the added complications of parallel computing or computing on large datasets by first working our analysis out on a small subset using Pandas.

If one makes sure they stick the to subset of the Pandas API that Dask supports, leveraging the power of Dask is as simple as changing the data ingest or creation call and adding a .compute() to the computation.



In [15]:

    
import pandas as pd
_df = pd.read_csv('LANDSAT_8.csv', parse_dates=['acquisitionDate'], nrows=100)



In [16]:

    
s = _df['DATA_TYPE_L1']

		GEOMETRIC_RMSE_MODEL_X	GEOMETRIC_RMSE_MODEL_Y
DATA_TYPE_L1	sensor
L1G	LANDSAT_ETM	7.333954	9.749414
L1G	LANDSAT_ETM_SLC_OFF	6.804560	8.802822
L1GT	LANDSAT_ETM	7.719764	9.941053
	LANDSAT_ETM_SLC_OFF	6.533796	8.131632
	OLI	0.000000	0.000000
	OLI_TIRS	0.000000	0.000000
	TIRS	0.000000	0.000000
L1T	LANDSAT_ETM	3.446039	3.848213
	LANDSAT_ETM_SLC_OFF	3.747309	4.175295
	OLI	6.238722	6.375502
	OLI_TIRS	5.941521	5.985040
PR	LANDSAT_ETM	4.395927	5.396203
	LANDSAT_ETM_SLC_OFF	4.334864	5.178753
	OLI	2.003667	1.697333
	OLI_TIRS	2.229302	2.254767
	TIRS	0.000000	0.000000