Throughout the growth cycle of a crop, the plants sprout, grow, leaf out, and mature. Accordingly, the reflectance measure over the crop, it's spectral signature, changes as well. In this tutorial, we investigate this temporal change in spectral signature of each crop through temporal analysis of satellite imagery of the crops. This type of analysis is made possible with the daily coverage of PlanetScope imagery.
We perform analysis in increasing steps of complexity in this tutorial. First, we calculate statics over a single field in a single Surface Reflectance image. We take advantage of the Cloud-Optimized Geotiffs (COGs) Planet provides to download only the pixels within the field, then calculate summary statistics on each band. Next, we calculate statistics over all images covering the field, keeping track of collection date, thus introducing temporal analysis over one field. Finally, we perform temporal analysis over multiple fields, keeping track of their field type classification, to compare temporal signatures across multiple field types. We analyze the results and look into how they can help differentiate the crop types.
In this tutorial, we focus on imagery collected within June-September 2017 (the growth season). We use field boundaries and crop type definitions collected in 2015. This dataset is described in the Identify Datasets notebook and prepared in the Prepare Datasets notebook. Because the survey data was collected in 2015 and the imagery used in this tutorial was collected in 2017, we expect there to be some inaccuracies in the field type classification, due to rotating crops within the two-year gap.
In the interest of calculation time, we limit the calculation of temporal statistics to sample subsets of fields and imagery. Definitions for 946 fields are provided in this dataset. Within the tutorial, we find that approximately 120 images are collected of each field within those dates. We also determine that it takes approximately 3 minutes to activate, download, and calculate statistics from 10 images, 10 minutes for 50 images (due to activation time, this does not scale linearly to 12-18 seconds for a single image).
We first perform an initial calculation from 10 scenes per field for 15 fields. It takes approximately 30 minutes to download and process the 150 scenes. In working with this dataset, we create an algorithm for filtering out outliers (likely due to cloudy scenes). The resulting data reveals some differentiation in the statistics from fields, but the data is sparse. Next, we perform another calculation of temporal statistics, this time over 50 images per field and 15 fields, a total of 750 data points and approximately 2.5 hours calculation time. To speed up running this notebook, we cache the statistics data from the both calculations and pull from the cached data unless run_process
is set to True
.
In analyzing the results, we find that temporal analysis can be helpful in identifying the field types. The mean reflectance across the three field types studied in this notebook (corn, safflower, and sunflower) shows variation between the visual and NIR bands, variation in behavior over time, and variation in steady-state reflectance.
In [1]:
import datetime
import json
import os
import shutil
import subprocess
import geojson
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from planet import api
from planet.api import filters, downloader
import rasterio
from shapely.geometry import shape
We start with ground-truth.geojson
, which was prepared in the datasets-identify tutorial. This dataset came from a 2015 survey of a region in Sacramento.
We then filter to features that correspond with Field Crops. We then sample those features to create our study features. We do so by randomly choosing a sample size of features from each subclass within the class.
In [2]:
# this file was prepared in the datasets-identify tutorial
ground_truth_file = os.path.join('src', 'ground-truth.geojson')
In [3]:
def load_geojson(filename):
with open(filename, 'r') as f:
return json.load(f)
ground_truth = load_geojson(ground_truth_file)
The survey data has attributes that provide the crop type. These attributes are described in a pdf distributed with the shapefile. It was unzipped along with the shapefile files and is located at data/dwr_survey/09legend.pdf
.
We are interested in the Field Crop class. Class is specified by the CLASS1
property. The Field Crop class is identified by CLASS1
value of F
.
In [4]:
crop_ground_truth = [f for f in ground_truth
if f['properties']['CLASS1'] == 'F']
print('{} out of {} features are field crops.'
.format(len(crop_ground_truth),len(ground_truth)))
The field subclasses are:
In this section, we filter out features that are uncategorized (subclass is **
).
In [5]:
field_type_names = {
1: 'cotton',
2: 'safflower',
3: 'flax',
4: 'hops',
5: 'sugar beets',
6: 'corn',
7: 'grain sorghum',
8: 'sudan',
9: 'castor beans',
10: 'beans',
11: 'misc field',
12: 'sunflowers',
13: 'hybrid sorghum/sudan',
14: 'millet',
15: 'sugar cane'
}
In [6]:
cat_crop_ground_truth = [f for f in crop_ground_truth
if f['properties']['SUBCLASS1'] != '**']
print('{} out of {} crop field features are categorized.'
.format(len(cat_crop_ground_truth),len(crop_ground_truth)))
In [7]:
# define and create test directory
# delete it if it already exists to ensure we start from a clear slate
test_dir = os.path.join('data', 'test')
if os.path.isdir(test_dir):
shutil.rmtree(test_dir)
os.mkdir(test_dir)
In [8]:
field_geojson = cat_crop_ground_truth[0]
In [9]:
shape(field_geojson['geometry'])
Out[9]:
In [10]:
def get_id(field_geojson):
return field_geojson['id']
print(get_id(field_geojson))
In [11]:
def create_save_dir(aoi_geojson, root_dir='data'):
save_dir = os.path.join(root_dir, get_id(aoi_geojson))
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
return save_dir
save_dir = create_save_dir(field_geojson)
print(save_dir)
In [12]:
def save_geojson_file(aoi_geojson, save_dir):
filename = os.path.join(save_dir, 'aoi.geojson')
with open(filename, "w") as f:
f.write(json.dumps(aoi_geojson))
return filename
geojson_filename = save_geojson_file(field_geojson, save_dir)
print('wrote to {}'.format(geojson_filename))
In [13]:
# command-line search
# !planet data search --item-type PSScene4Band --geom $geojson_filename
In [14]:
planet_api_key = os.environ['PL_API_KEY']
# quick check that key is defined
assert planet_api_key, "PL_API_KEY not defined."
client = api.ClientV1(api_key=planet_api_key)
In [15]:
item_type = 'PSScene4Band'
asset_type = 'analytic_sr'
In [16]:
# utility functions for searching for scenes that cover the field aoi
# between June through September, 2017
def build_request(aoi, item_type):
old = datetime.datetime(year=2017,month=6,day=1)
new = datetime.datetime(year=2017,month=10,day=1)
search_aoi = aoi['geometry']
query = filters.and_filter(
filters.geom_filter(search_aoi),
filters.range_filter('cloud_cover', lt=75),
filters.date_range('acquired', gte=old),
filters.date_range('acquired', lt=new)
)
# build a request for only PlanetScope imagery
request = filters.build_search_request(
query, item_types=[item_type]
)
return request
def get_monthly_stats(client, request):
stats_request = request.copy()
stats_request['interval'] = 'month'
return client.stats(stats_request).get()
request = build_request(field_geojson, item_type)
monthly_stats = get_monthly_stats(client, request)
total = sum([b['count'] for b in monthly_stats['buckets']])
print('monthly stats for intersecting scenes, {} total'.format(total))
print(monthly_stats)
In [17]:
# utilities for retrieving scene information and filtering to only scenes
# that totally overlap field
def get_items(client, request, limit=500):
# run search
# if you don't have an API key configured, this will raise an exception
result = client.quick_search(request)
return result.items_iter(limit=limit)
def filter_by_overlaps(items, aoi):
aoi_shape = shape(aoi['geometry'])
def get_overlap(item):
item_shape = shape(item['geometry'])
overlap = 100.0*(aoi_shape.intersection(item_shape).area / aoi_shape.area)
return overlap
return (i for i in items if get_overlap(i) > 99)
def search_items(aoi, item_type, client, limit=500):
request = build_request(aoi, item_type)
items = get_items(client, request, limit=limit)
return filter_by_overlaps(items, aoi)
item = list(search_items(field_geojson, item_type, client, limit=5))[0]
print(item['id'])
print(item['properties'])
We use the planet api client downloader to handle activation of the scenes. The downloader handles activation, polling activation status, and (if desired), downloading. Because we are using remote COGs, we do not want to download the scene. However, the downloader can still help us out. It has a cool feature where you can provide it with a function to call when a scene is activated.
In this section, we will provide it with a function that records the scene id and download url. The function is actually just a method of a class (Tracker
) that maintains a dataset of ids and download urls. The method simply updates that list when it is called by the downloader.
The use of this Tracker
class to keep track of the download urls is a bit of a complicated solution for what we need. Similarly, the use of the bulk downloader to activate only one scene is overly complicated. However, this is serving as a simple example of the workflow. We will use something very similar to this process to activate multiple scenes and calculate the stats of only the pixels within the field aoi in the next section.
In [18]:
# create a downloader that will handle scene activateion
dl = downloader.create(client)
In [19]:
# This class keeps track of activated scene download urls
# It also creates the `on_complete` partial function, which can be called
# by the downloader to update the list of scene download urls
class Tracker(object):
def __init__(self):
self.urls = dict()
def get_on_complete(self):
def on_complete(item, asset):
self.urls[item['id']] = asset['location']
print('{}:{}'.format(item['id'], asset['location']))
return on_complete
# create the function that keeps track of the download urls
tracker = Tracker()
dl.on_complete = tracker.get_on_complete()
# use the downloader to activate the scene and get the download url
dl.shutdown()
%time dl.activate((i for i in [item]), [asset_type])
item_id, download_url = list(tracker.urls.items())[0]
print(item_id)
In [20]:
# we need to use the vsicurl gdal driver to work with COGs.
vsicurl_url = '/vsicurl/' + download_url
In [21]:
# uncomment to check that gdal can access COG
# we need to use the vsicurl gdal driver to work with COGs
# vsicurl_url = '/vsicurl/' + download_url
# !gdalinfo $vsicurl_url
In [22]:
def create_output_filename(item_id, save_dir):
filename = os.path.join(save_dir, item_id + '.tif')
return filename
output_file = create_output_filename(item_id, save_dir)
In [23]:
# we use gdalwarp and the crop_to_cutline argument to only download the aoi portion of the COG
def _gdalwarp(input_filename, output_filename, options, verbose=False):
commands = ['gdalwarp'] + options + \
['-overwrite',
input_filename,
output_filename]
if verbose: print(' '.join(commands))
subprocess.check_call(commands)
def download_scene_aoi(download_url, output_filename, geojson_filename, verbose=False):
vsicurl_url = '/vsicurl/' + download_url
options = [
'-cutline', geojson_filename,
'-crop_to_cutline',
]
_gdalwarp(vsicurl_url, output_filename, options, verbose=verbose)
%time download_scene_aoi(download_url, output_file, geojson_filename, verbose=True)
In [24]:
# load local visual module
# autoreload because visual is in development
%load_ext autoreload
%autoreload 2
import visual
In [25]:
def load_sr(filename):
with rasterio.open(filename, 'r') as src:
# visual band ordering: red, green, blue, alpha
b, g, r, n = src.read()
# NoData value is 0
mask = b == 0
return [np.ma.array(band, mask=mask) for band in [b, g, r, n]]
def visualize_sr(filename, title='Cropped Scene'):
bgrn_bands = load_sr(filename)
rgb_bands = [bgrn_bands[i] for i in [2, 1, 0]]
visual.plot_image(rgb_bands, title=title, figsize=(5, 5))
print(output_file)
visualize_sr(output_file)
In [26]:
def get_band_stats(band):
"""Calculate simple statistics for a band"""
# Consider adding stats from here:
# https://docs.scipy.org/doc/scipy/reference/stats.mstats.html
stats = {
'mean': band.mean(),
'std': band.std(),
'max': band.max(),
'min': band.min(),
'count': band.count()
}
return stats
def get_stats(filename):
bands = load_sr(filename)
stats = [get_band_stats(band)
for band in bands]
return stats
print(get_stats(output_file))
In this section, we download the aoi image and calculate the band statistics all together in a function that is called directly by the downloader.
In the interest of time, we limit the number of images to 10 or less (though the search may return 10 scenes, some of them may not completely overlap the field).
In [27]:
def get_subclass(crop_geojson):
return crop_geojson['properties']['SUBCLASS1']
def get_date(scene_id):
date_str = scene_id[:8]
return datetime.datetime.strptime(date_str, "%Y%m%d").date()
In [28]:
class StatsCalculator(object):
def __init__(self, aoi_geojson, root_dir='data'):
self.save_dir = create_save_dir(aoi_geojson, root_dir=root_dir)
self.geojson_file = save_geojson_file(aoi_geojson, self.save_dir)
self.aoi_geojson = aoi_geojson
self.info = []
def get_on_complete(self, overwrite=False, verbose=False):
def on_complete(item, asset):
download_url = asset['location']
scene_id = item['id']
if verbose: print('{}'.format(scene_id))
output_file = create_output_filename(scene_id, self.save_dir)
if overwrite or not os.path.isfile(output_file):
if verbose: print('downloading...')
download_scene_aoi(download_url, output_file, self.geojson_file, verbose=False)
band_names = ['blue', 'green', 'red', 'nir']
band_stats = get_stats(output_file)
for stats, name in zip(band_stats, band_names):
info = {
'field_id': get_id(self.aoi_geojson),
'field_type': get_subclass(self.aoi_geojson),
'scene_id': scene_id,
'scene_filename': output_file,
'band': name,
'date': get_date(scene_id)
}
info.update(stats)
self.info.append(info)
return on_complete
def get_field_stats_info(field_geojson, max_scenes, client, item_type, overwrite=False, verbose=False):
if max_scenes > 0:
items = search_items(field_geojson, item_type, client, limit=max_scenes)
stats_calculator = StatsCalculator(field_geojson)
dl.on_complete = stats_calculator.get_on_complete(overwrite=overwrite, verbose=verbose)
dl.shutdown()
dl.activate(items, [asset_type])
info = stats_calculator.info
else:
# skip activation if max_scenes is not > 0
info = []
return pd.DataFrame(data=info)
# it takes about 3-4 minutes to download and calculate stats on 10 scenes
max_scenes=10
# we set overwrite to False (the default value of that variable) so we do not re-download
# the image if it is already cached
field_info = get_field_stats_info(field_geojson, max_scenes, client, item_type, overwrite=False)
In [29]:
print('calculated statistics across {} scenes'.format(int(len(field_info) / 4)))
In [30]:
# show first few scenes of field
# we filter to just red data points because there are 4 data points for each scene
# (red, green, blue, and NIR)
collects = field_info[field_info.band == 'red'][:3]
for collect in collects.itertuples():
title='{}\n'.format(collect.scene_id)
title += 'mean % : {}'.format(collect.mean)
visualize_sr(collect.scene_filename, title=title)
In [80]:
def plot_band_statistic(df, stat, title):
for field_type, group in df.groupby(['field_type']):
field_name = field_type_names[int(field_type)]
x = group['date'].tolist()
y = group[stat].tolist()
plt.scatter(x, y, marker='^', label=field_name)
all_x = df['date'].tolist()
one_day = datetime.timedelta(days=1)
plt.xlim(min(all_x) - one_day, max(all_x) + one_day)
plt.margins(x=0)
plt.xticks(rotation=90)
plt.title(title)
def plot_statistic(df, statistic, title=None):
fig = plt.figure(figsize=(20,5))
bands = ['blue', 'green', 'red', 'nir']
title = title if title is not None else statistic
fig.suptitle(title)
for i, band in enumerate(bands):
ax = fig.add_subplot(1,4,i+1)
band_group = df[df['band'] == band]
plot_band_statistic(band_group, statistic, band)
ax.legend(bbox_to_anchor=(1.05, 0), loc='lower left', borderaxespad=0.)
plt.show()
def plot_all_statistics(df):
plot_statistic(df, 'mean')
plot_statistic(df, 'std')
plot_all_statistics(field_info)
Up to this point, we have focused on stats of a single field. Now we get into the meat of things, calculating stats across multiple fields and grouping by crop type (specified by the subclass property).
There are just too many categorized crop fields to calculate statistics on them all in a reasonable time using just one CPU. Therefore, we will create a list of sample features, features that equally represent the crop types.
In [81]:
# determine the subclasses in this set and counts
subclasses_list = [field_geojson['properties']['SUBCLASS1']
for field_geojson in cat_crop_ground_truth]
subclasses = dict([x, subclasses_list.count(x)]
for x in set(subclasses_list))
print('subclasses and counts')
print(json.dumps(subclasses, indent=4))
In [82]:
# number of samples for each subclass
num_samples = 5
In [83]:
# filter the subclasses to those with adequate number of features
filt_subclasses = [subclass
for (subclass, count) in subclasses.items()
if count > num_samples]
print('filtered subclasses: {}'.format(filt_subclasses))
In [84]:
# lets focus on only 3 subclasses for now, comment to use all subclasses
filt_subclasses = filt_subclasses[:3]
print('filtered subclasses: {}'.format(filt_subclasses))
In [85]:
# create a list of sample features
# first filter to features within a subclass, then randomly pick a sample of those features
np.random.seed(0) # make random sampling repeatable
sample_features = []
for subclass in filt_subclasses:
subclass_features = [f for f in crop_ground_truth if get_subclass(f) == subclass]
sample_features.extend(np.random.choice(subclass_features, num_samples, replace=False))
print('{} sample field features'.format(len(sample_features)))
In this section, we calculate statistics for all of the sample features. We loop through each of the fields, activating and calculating stats for each scene that overlaps the field.
The calculation of stats for one field from 50 scenes takes 10 minutes. Therefore, we expect the calculation of stats for 15 fields from 50 scenes each to take 150 minutes, or 2.5 hours. That's quite a while! For now, we will limit the number of scenes to 10, which should take 30 minutes to process. Still a while, but not unreasonable.
In [86]:
max_scenes = 10
In [87]:
def get_all_field_stats_info(sample_features, client, item_type, max_scenes=10):
all_info = []
for field_geojson in sample_features:
info = get_field_stats_info(field_geojson, max_scenes, client, item_type)
all_info.extend(info)
return all_info
In [88]:
run_process = False
scenes_info_filename = 'src/scene_info_df.pkl'
if run_process:
info_df = analyze_more_scenes(sample_features, client, item_type)
# uncommenting overwrites cached data
# info_df.to_pickle(scenes_info_filename)
else:
info_df = pd.read_pickle(scenes_info_filename)
In [89]:
plot_all_statistics(info_df)
It is difficult to see the results due to the outliers across all of the bands. These outliers could potentially be cloudy scenes. To clean up the results, we will filter out the scenes that have mean values that are outliers across all bands.
In [90]:
def filter_bad_scenes(df):
bad_scene_sets = []
for band, group in df.groupby(['band']):
m = group['mean']
mmax = m.mean() + 2 * m.std()
outliers = group[m > mmax]
bad_scene_sets.append(set(outliers['scene_id']))
# get list of scenes that have outliers in all 4 bands
bad_scenes = set.intersection(*bad_scene_sets)
print('{} bad scenes:\n{}'.format(len(bad_scenes), bad_scenes))
bad_scene_entry = df.scene_id.isin(bad_scenes)
print('{} entries from the bad scenes filtered out'.format(bad_scene_entry.sum()))
return df[~bad_scene_entry]
filtered_info_df = filter_bad_scenes(info_df)
In [91]:
plot_all_statistics(filtered_info_df)
Awesome! The filter removed the outliers. From these results, it looks like corn mean reflectance is darker in the blue, green, and red bands than sunflowers and safflower. Also, the mean reflectance of corn in the NIR band decreases over time. Sunflowers and safflower are more difficult to distinguish. Ultimately, more data points would help here.
The section below will run calculations on 50 scenes per field. It can take 2.5 hours so is not for the faint of heart! The data is also cached, so you can load and explore it without having to go through all of that processing.
In [92]:
def analyze_more_scenes(sample_features, client, item_type):
dl.shutdown()
all_info = get_all_field_stats_info(sample_features, client, item_type, max_scenes=50)
info_df = pd.DataFrame(data=all_info)
filtered_info_df = filter_bad_scenes(info_df)
return filtered_info_df
run_process = False
more_scenes_filename = 'src/more_scenes_df.pkl'
if run_process:
more_scenes_df = analyze_more_scenes(sample_features, client, item_type)
# uncommenting overwrites cached data
# filtered_info_df.to_pickle(more_scenes_filename)
else:
more_scenes_df = pd.read_pickle(more_scenes_filename)
In [93]:
plot_all_statistics(more_scenes_df)
These plots give a little more insight into the field temporal statistics. Sunflower and corn NIR reflectance means decrease over time, while safflower NIR reflectance mean stays pretty steady. Corn red, green, and blue reflectance mean stays pretty steady, with a slight increase over time. Sunflower red, green, and blue reflectance mean is all over the board, which is kind of confusing. Safflower red, green, and blue reflectance mean is, similar to corn, pretty steady with a slow increase over time.
In general, there isn't much of a pattern in the standard deviation of the reflectance bands. Corn does stand out as having the lowest reflectance standard deviation across the blue, green, and red bands. Additionally, NIR reflectance standard deviation is higher. The NIR reflectance is much higher in the NIR band than the red, green and blue bands (which makes sense, we are looking at actively growing fields), and the higher standard deviation could either be due to increased sensitivity to the health of the crops vs the visual bands, or it could be due to higher noise associated with the higher signal of the NIR reflectance band.
In [ ]: