Understanding the global distribution of Landsat observations over the satellite's 40+ year record can help answer many questions including:
In order to mine this information, we can use the Landsat Bulk Metadata dataset from the USGS which provides rich metadata about every observation in the Landsat record across the history of the program. Unfortunately, these files are gigantic and would be very difficult to process using an average computer.
Luckily for those using Python, the Dask library can provide both multiprocessing and out-of-core computation capabilities while keeping to the same function calls that you might be familiar with from Numpy and Pandas. Of interest to us is the dask.dataframe
collection which allows us to easily process the Landsat metadata CSV files.
To begin, first download the Landsat metadata files and unzip them. While Pandas can read from compressed CSV files, we will want to break these CSV files apart into many pieces for processing and they need to be uncompressed to split them.
For this tutorial, we will only be using the Landsat 8 and Landsat 7 metadata:
wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_8.csv.gz
gunzip LANDSAT_8.csv.gz
wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_ETM_SLC_OFF.csv.gz
gunzip LANDSAT_ETM_SLC_OFF.csv.gz
wget http://landsat.usgs.gov/metadata_service/bulk_metadata_files/LANDSAT_ETM.csv.gz
gunzip LANDSAT_8.csv.gz
TODO: a transition
In [1]:
import dask.dataframe as ddf
columns = {
'sceneID': str,
'sensor': str,
'path': int,
'row': int,
'acquisitionDate': str,
'cloudCover': float,
'cloudCoverFull': float,
'sunElevation': float,
'sunAzimuth': float,
'DATA_TYPE_L1': str,
'GEOMETRIC_RMSE_MODEL': float,
'GEOMETRIC_RMSE_MODEL_X': float,
'GEOMETRIC_RMSE_MODEL_Y': float,
'satelliteNumber': float
}
df = ddf.read_csv('LANDSAT*.csv',
usecols=columns.keys(),
dtype=columns,
parse_dates=['acquisitionDate'],
blocksize=int(20e6))
df = df.assign(year=df.acquisitionDate.dt.year)
df.columns
Out[1]:
In [4]:
df.groupby('sensor').sensor.count().compute()
Out[4]:
In [5]:
result = df.groupby(['path', 'row', 'year']).cloudCoverFull.mean().compute()
In [6]:
result.loc[12, 31, :]
Out[6]:
In [7]:
result = df.groupby(['DATA_TYPE_L1']).cloudCoverFull.mean().compute()
result
Out[7]:
Looks like there is a labeling issue due to a capitalization difference in L1GT
versus L1Gt
. We can correct this as well:
In [53]:
df = df.assign(DATA_TYPE_L1=df.DATA_TYPE_L1.apply(lambda x: x if x != 'L1Gt' else 'L1GT'))
result = df.groupby(['DATA_TYPE_L1']).cloudCoverFull.mean().compute()
result
Out[53]:
In [54]:
df.groupby(['DATA_TYPE_L1', 'sensor'])[['GEOMETRIC_RMSE_MODEL_X', 'GEOMETRIC_RMSE_MODEL_Y']].mean().compute()
Out[54]:
Unfortunately it looks like the Landsat 8 observations do not record the estimated geometric accuracy unless a systematic terrain correction using Ground Control Points is successful.
In [77]:
from dask.dot import dot_graph
dot_graph(result.dask)
Out[77]:
In [41]:
import dask
Because Dask DataFrame implements much of the Pandas API, we eliminate the added complications of parallel computing or computing on large datasets by first working our analysis out on a small subset using Pandas.
If one makes sure they stick the to subset of the Pandas API that Dask supports, leveraging the power of Dask is as simple as changing the data ingest or creation call and adding a .compute()
to the computation.
In [15]:
import pandas as pd
_df = pd.read_csv('LANDSAT_8.csv', parse_dates=['acquisitionDate'], nrows=100)
In [16]:
s = _df['DATA_TYPE_L1']