Our aim in this notebook is to classify crop type using PlanetScope 4-band Orthotiles. The crop types of particular interest are corn and soybeans.
CART is a decision tree algorithm that has shown great promise for classification of remotely sensed imagery. We will use this algorithm to classify pixels as corn, soybean, or neither.
In this notebook, we will focus on using the PlanetScope imagery 4 bands as well as NDVI calculation as the features that are fed into the CART algorithm. We will train on one PS Orthotile and validate on another PS Orthotile.
This notebook demonstrates the following:
The USDA 2016 Crop Data Layer (CDL) provides a national crop type dataset. This dataset was build using Landsat 8, DMC Deimos-1, and UK 2 satellite imagery (src), using supervised classification (decision trees) based on ground truth from the Farm Service Agency (FSA) Common Land Unit (CLU) Program. Since this is a derived dataset, it isn't a ground truth dataset but it is known to be quite accurate so can be used as our gold standard dataset. This dataset is provided as a georegistered raster (geoTIFF).
Since corn and soybeans are the primary crops grown in Iowa, we will focus our analysis in that state. The metadata provided for the Iowa 2016 CDL indicates that it's accuracy is 96.4% and that corn (categorization code 1) and soybeans (categorization code 5) are indeed the primary crop types in the state.
Iowa 1997 statistics: corn planting ended June 3, harvesting began Sept 17 and soybean planting ended June 17 and harvesting began Sept 21. Therefore, imagery should be limited to between June 3 and Sept 17, 2016.
In [1]:
from collections import namedtuple, OrderedDict
import os
import pathlib
import pickle
from subprocess import check_output, STDOUT, CalledProcessError
import tempfile
from xml.dom import minidom
import matplotlib
import matplotlib.colors as colors
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import rasterio
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
Get a PS Orthotile that has majority coverage over the grid, is from a SSO satellite, is free from clouds, and was taken between June 3 and Sept 17, 2016.
Using planet explorer, 210879_1558814_2016-07-25_0e16 was identified as a good candidate.
In [2]:
train_scene_id = '210879_1558814_2016-07-25_0e16'
In [3]:
# uncomment below to learn more about PSOrthotile 210879_1558814_2016-07-25_0e16
# !planet data search --item-type PSOrthoTile --string-in id $train_scene_id
In [4]:
# define and, if necessary, create cart data directory
train_folder = os.path.join('data', 'cart', '210879_1558814_2016-07-25_0e16')
pathlib.Path(train_folder).mkdir(parents=True, exist_ok=True)
# define data file filenames and ensure they exist
train_files = {
'scene': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_Analytic.tif'),
'metadata': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_Analytic_metadata.xml'),
'udm': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_DN_udm.tif'),
}
In [5]:
# First test if scene file exists, if not, use the Planet commandline tool to download the image, metadata, and udm.
# This command assumes a bash shell, available in Unix-based operating systems.
train_scene = train_files['scene']
!test -f $train_scene || \
planet data download \
--item-type PSOrthoTile \
--dest $train_folder \
--asset-type analytic,analytic_xml,udm \
--string-in id $train_scene_id
In [6]:
for filename in train_files.values():
print(filename)
assert os.path.isfile(filename)
In [7]:
predata_folder = 'pre-data'
In [8]:
train_CDL_filename = os.path.join(predata_folder, 'CDL_2016_19_train.tif')
assert os.path.isfile(train_CDL_filename)
train_files['gold'] = train_CDL_filename
Get another PS Orthotile that has majority coverage over the grid, is from a SSO satellite, is free from clouds, and was taken between June 3 and Sept 17, 2016.
Using planet explorer, 210863_1559015_2016-07-25_0e0f was identified as a good candidate.
In [9]:
test_scene_id = '210863_1559015_2016-07-25_0e0f'
In [10]:
# uncomment below to learn more about the test PSOrthotile
# !planet data search --item-type PSOrthoTile --string-in id $test_scene_id
In [11]:
# define and, if necessary, create cart data directory
test_folder = os.path.join('data', 'cart', '210863_1559015_2016-07-25_0e0f')
pathlib.Path(test_folder).mkdir(parents=True, exist_ok=True)
# define data file filenames and ensure they exist
test_files = {
'scene': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_Analytic.tif'),
'metadata': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_Analytic_metadata.xml'),
'udm': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_DN_udm.tif'),
}
In [12]:
# First test if scene file exists, if not, use the Planet commandline tool to download the image, metadata, and udm.
# This command assumes a bash shell, available in Unix-based operating systems.
test_scene = test_files['scene']
!test -f $test_scene || \
planet data download \
--item-type PSOrthoTile \
--dest $test_folder \
--asset-type analytic,analytic_xml,udm \
--string-in id $test_scene_id
In [13]:
for filename in test_files.values():
print(filename)
assert os.path.isfile(filename)
In [56]:
test_CDL_filename = os.path.join(predata_folder, 'CDL_2016_19_test.tif')
assert os.path.isfile(test_CDL_filename)
test_files['gold'] = test_CDL_filename
In [15]:
# Utility function for working with large images and memory limitations
def _read_window(filename, window):
with rasterio.open(filename, 'r') as src:
return src.read(window=window)
def decimated(arry, num=8):
return arry[::num, ::num].copy()
In [16]:
# Utility functions: visualizing a classified band
def plot_classified_band(class_band, class_labels=None, cmap='rainbow',
title='Class Labels', figdim=10):
fig = plt.figure(figsize=(figdim, figdim))
ax = fig.add_subplot(1, 1, 1)
imshow_class_band(ax, class_band, class_labels, cmap=cmap)
ax.set_title(title)
ax.set_axis_off()
def imshow_class_band(ax, class_band, class_labels=None, cmap='rainbow'):
"""Show classified band with colormap normalization and color legend. Alters ax in place.
possible cmaps ref: https://matplotlib.org/examples/color/colormaps_reference.html
"""
class_norm = _ClassNormalize(class_band)
im = ax.imshow(class_band, cmap=cmap, norm=class_norm)
try:
# add class label legend
# https://stackoverflow.com/questions/25482876
# /how-to-add-legend-to-imshow-in-matplotlib
color_mapping = class_norm.mapping
colors = [im.cmap(color_mapping[k]) for k in class_labels.keys()]
labels = class_labels.values()
# https://matplotlib.org/users/legend_guide.html
# tag: #creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists
patches = [mpatches.Patch(color=c, label=l) for c,l in zip(colors, labels)]
ax.legend(handles=patches, bbox_to_anchor=(1, 1), loc='upper right', borderaxespad=0.)
except AttributeError:
# class_labels not specified
pass
# https://matplotlib.org/users/colormapnorms.html#custom-normalization-two-linear-ranges
class _ClassNormalize(colors.Normalize):
"""Matplotlib colormap normalizer for a classified band."""
def __init__(self, arry):
# get unique unmasked values
values = [v for v in np.unique(arry)
if not isinstance(v, np.ma.core.MaskedConstant)]
color_ticks = np.array(range(len(values)), dtype=np.float) / (len(values) - 1)
self._mapping = dict((v, ct)
for v, ct in zip(values, color_ticks))
# Initialize base Normalize instance
vmin = 0
vmax = 1
clip = False
colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, arry, clip=None):
# round array back to ints for logical comparison
arry = np.around(arry)
new_arry = arry.copy()
for k, v in self._mapping.items():
new_arry[arry==k] = v
return new_arry
@property
def mapping(self):
return self._mapping
In [17]:
def load_udm(udm_filename, window=None):
'''Load bit-encoded UDM as an array.
The description for the information encoded in the bits in the UDM is from:
https://www.planet.com/docs/spec-sheets/sat-imagery/
section 7.2
Bit 0: blackfill
Bit 1: cloud covered
Bit 2: missing or suspect data in Blue band
Bit 3: missing or suspect data in Green band
Bit 4: missing or suspect data in Red band
Bit 6: missing or suspect data in NIR band
Pixels with no issues have all bits set to 0, therefore their values are zero.
'''
return _read_window(udm_filename, window)[0,...]
def get_udm_labels(udm):
return OrderedDict((v, '{0:b}'.format(v)) for v in np.unique(udm))
def _test():
udm = load_udm(train_files['udm'])
udm_labels = get_udm_labels(udm)
plot_classified_band(decimated(udm), class_labels=udm_labels, title='UDM Mask')
_test()
It looks like there are quite a few regions (they look like crops) where there is an issue in the NIR band. Likely this is due to saturation in the band.
The next step in loading the UDM is to convert it to a Boolean mask for application to the image bands.
In [18]:
def udm_to_mask(udm_array):
'''Mask all pixels with any issues.'''
return udm_array != 0
def _test():
mask = udm_to_mask(load_udm(train_files['udm']))
print('{}/{} ({:0.0f}%) masked'.format(mask.sum(), mask.size,
(100.0 * mask.sum())/mask.size))
_test()
In [19]:
NamedBands = namedtuple('NamedBands', 'b, g, r, nir')
def load_masked_bands(scene_filename, udm_filename, window=None):
"""Loads a 4-band BGRNir Planet Image file as a list of masked bands.
The masked bands share the same mask, so editing one band mask will
edit them all.
"""
bands = load_bands(scene_filename, window=window)
mask = udm_to_mask(load_udm(udm_filename, window=window))
return NamedBands(*[np.ma.array(b, mask=mask)
for b in bands])
def load_bands(filename, window=None):
b, g, r, nir = _read_window(filename, window=window)
return NamedBands(b=b, g=g, r=r, nir=nir)
def get_mask(bands):
return bands[0].mask
def check_mask(bands):
band_mask = get_mask(bands)
return '{}/{} ({:0.0f}%) masked'.format(band_mask.sum(), band_mask.size,
(100.0 * band_mask.sum())/band_mask.size)
def _test():
# window = ((500,1500),(500,1500))
window = None
print(check_mask(load_masked_bands(train_files['scene'],
train_files['udm'],
window=window)))
_test()
In [20]:
# Utility functions: converting an image to reflectance
NamedCoefficients = namedtuple('NamedCoefficients', 'b, g, r, nir')
def read_refl_coefficients(metadata_filename):
# https://github.com/planetlabs/notebooks/blob/master/jupyter-notebooks/ndvi/ndvi_planetscope.ipynb
xmldoc = minidom.parse(metadata_filename)
nodes = xmldoc.getElementsByTagName("ps:bandSpecificMetadata")
# XML parser refers to bands by numbers 1-4
coeffs = {}
for node in nodes:
bn = node.getElementsByTagName("ps:bandNumber")[0].firstChild.data
if bn in ['1', '2', '3', '4']:
i = int(bn)
value = node.getElementsByTagName("ps:reflectanceCoefficient")[0].firstChild.data
coeffs[i] = float(value)
return NamedCoefficients(b=coeffs[1],
g=coeffs[2],
r=coeffs[3],
nir=coeffs[4])
def _apply_coeffs(bands, coeffs):
return NamedBands(*[b.astype(float)*c for b,c in zip(bands, coeffs)])
def _test():
print(read_refl_coefficients(train_files['metadata']))
_test()
In [21]:
def load_reflectance_bands(scene_file, udm_file, metadata_file, window=None):
return _to_reflectance(load_masked_bands(scene_file, udm_file, window=window),
metadata_file)
def _to_reflectance(bands, metadata_filename):
coeffs = read_refl_coefficients(metadata_filename)
return _apply_coeffs(bands, coeffs)
def _test():
# window = ((500,1500),(500,1500))
window = None
bands = load_reflectance_bands(train_files['scene'],
train_files['udm'],
train_files['metadata'],
window=window)
print([b.mean() for b in bands])
_test()
In [22]:
# Utility functions: displaying an image
def _linear_scale(ndarray, old_min, old_max, new_min, new_max):
"""Linear scale from old_min to new_min, old_max to new_max.
Values below min/max are allowed in input and output.
Min/Max values are two data points that are used in the linear scaling.
"""
#https://en.wikipedia.org/wiki/Normalization_(image_processing)
return (ndarray - old_min) * (new_max - new_min) / (old_max - old_min) + new_min
# print(linear_scale(np.array([1,2,10,100,256,2560, 2660]), 2, 2560, 0, 256))
def _mask_to_alpha(bands):
band = np.atleast_3d(bands[0])
alpha = np.zeros_like(band)
alpha[~band.mask] = 1
return alpha
def _add_alpha_mask(bands):
return np.dstack([bands, _mask_to_alpha(bands)])
def scale_bands(band_stack):
old_min = np.percentile(band_stack, 2)
old_max = np.percentile(band_stack, 98)
old_min = band_stack.min()
old_max = band_stack.max()
new_min = 0
new_max = 1
scaled = [np.clip(_linear_scale(b.astype(np.float),
old_min, old_max,
new_min, new_max),
new_min, new_max)
for b in bands]
return scaled
def bands_min(bands):
# old_min = min([b.min() for b in bands])
old_min = np.percentile(np.concatenate([b.compressed() for b in bands]), 2)
return old_min
def bands_max(bands):
# old_max = max([b.max() for b in bands])
old_max = np.percentile(np.concatenate([b.compressed() for b in bands]), 98)
return old_max
def bands_to_display(bands, alpha=False):
"""Converts a list of 3 bands to a 3-band rgb, normalized array for display."""
assert len(bands) in [3]
old_min = bands_min(bands)
old_max = bands_max(bands)
# print('old min: {}, old max: {}'.format(old_min, old_max))
new_min = 0
new_max = 1
scaled = [np.clip(_linear_scale(b.astype(np.float),
old_min, old_max,
new_min, new_max),
new_min, new_max)
for b in bands]
filled = [b.filled(fill_value=new_min) for b in scaled]
if alpha:
filled.append(_mask_to_alpha(scaled))
return np.dstack(filled)
In [23]:
def show_bands(bands):
fig, axes = plt.subplots(nrows=1, ncols=2,
sharex=True, sharey=True,
figsize=(20,10))
for ax in axes.ravel():
ax.set_adjustable('box-forced')
ax.set_axis_off()
ax2, ax4 = axes
ax2.imshow(decimated(bands_to_display([bands.r, bands.g, bands.b], alpha=True)))
ax2.set_title('Reflectance RGB Bands, Alpha from Mask')
ax4.imshow(decimated(bands_to_display(3*[bands.nir], alpha=True)))
ax4.set_title('Reflectance NIR Band, Alpha from Mask')
plt.tight_layout()
def _test():
window = None
refl_bands = load_reflectance_bands(train_files['scene'],
train_files['udm'],
train_files['metadata'],
window=window)
print(check_mask(refl_bands))
for b in refl_bands:
print (b.min(), b.max())
show_bands(refl_bands)
_test()
In [24]:
def calculate_ndvi(bands):
ndvi = (bands.nir.astype(np.float) - bands.r) / (bands.nir + bands.r)
return ndvi
def build_feature_bands(bands):
"""Prepare bands representing pixel features and provide feature names.
Takes as input NamedBands
Returns (1) tuple of bands representing features and (2) feature names
"""
# not copying these bands to minimize memory footprints
features = (bands.b, bands.g, bands.r, bands.nir, calculate_ndvi(bands))
feature_names = ('Blue', 'Green', 'Red', 'NIR', 'NDVI')
return features, feature_names
def display_feature_bands(bands, names):
# for this notebook, we know there are 4 features and we will use that
# knowledge to side-step some logic in organizing subplots
assert len(bands) == 5
fig, subplot_axes = plt.subplots(nrows=3, ncols=2,
sharex=True, sharey=True,
figsize=(10,10))
axes = subplot_axes.flat[:-1]
delaxis = subplot_axes.flat[-1]
fig.delaxes(delaxis)
for ax, band, name in zip(axes, bands, names):
ax.set_adjustable('box-forced')
ax.axis('off')
pcm = ax.imshow(band, alpha=True)
ax.set_title(name)
fig.colorbar(pcm, ax=ax,
pad=0.05, shrink=0.9)
plt.tight_layout()
def _test():
# window = ((500,1500),(500,1500))
window = None
refl_bands = load_reflectance_bands(train_files['scene'],
train_files['udm'],
train_files['metadata'],
window=None)
feat_bands, feat_names = build_feature_bands(refl_bands)
display_feature_bands(feat_bands, feat_names)
_test()
From the scales on the colorbars, we see that the scales of the features are not all the same. Max for the spectral band features is ~0.35 while NDVI feature max is ~0.9. This is not a problem for decision trees, which do not require that features be normalized before training.
In [25]:
def to_X(feature_bands):
"""Convert feature_bands (tuple of bands) to 2d array for working with classifier.
"""
return np.stack([f.compressed() for f in feature_bands], # exclude masked pixels
axis=1)
def _test():
refl_bands = load_reflectance_bands(train_files['scene'],
train_files['udm'],
train_files['metadata'],
window=((500,1500),(500,1500)))
feat_bands, feat_names = build_feature_bands(refl_bands)
X = to_X(feat_bands)
print(X.shape)
_test()
In [26]:
# class labels from metadata
CLASS_LABELS = {1: 'corn', 5: 'soybeans'}
In [27]:
def load_gold_class_band(gold_filename, class_labels=None, window=None, fill_value=0):
gold_class_band = _read_window(gold_filename, window)[0,...]
try:
# mask pixels with a value not in class_labels
masks = [gold_class_band == val for val in class_labels.keys()]
mask = np.any(np.dstack(masks), axis=2)
mask = ~mask
masked_band = np.ma.array(np.ma.array(gold_class_band, mask=mask).filled(fill_value=fill_value),
mask=mask)
except AttributeError:
# mask nothing
null_mask = np.zeros(gold_class_band.shape, dtype=np.bool)
masked_band = np.ma.array(gold_class_band, mask=null_mask)
return masked_band
def _test():
print(np.unique(load_gold_class_band(train_files['gold'])))
print(np.unique(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)))
# _test()
In [28]:
def _plot():
plot_classified_band(load_gold_class_band(train_files['gold'],
class_labels=CLASS_LABELS,
window=((500,1500),(500,1500))),
class_labels=CLASS_LABELS, title='Gold Class Labels, Windowed', figdim=5)
plot_classified_band(decimated(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)),
class_labels=CLASS_LABELS, title='Gold Class Labels')
# _plot()
In [29]:
def to_y(classified_band):
return classified_band.compressed()
def _test():
print(to_y(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)).shape)
_test()
We to train the classifier with all of the data from the Orthotile and use the dummy label data generated above. The goal is to ensure we can run through training and predicting with the classifier and end up with a classified band that looks like the original in shape/mask and shows the classes as labeled above.
To speed up processing, by default the classifier is loaded from cache.
In [30]:
# window = ((500,1500),(500,1500))
window = None
In [31]:
refl_bands = load_reflectance_bands(train_files['scene'],
train_files['udm'],
train_files['metadata'],
window=window)
feat_bands, _ = build_feature_bands(refl_bands)
X = to_X(feat_bands)
print(X.shape)
In [32]:
gold_class_band = load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS, window=window)
gold_class_band.mask = get_mask(feat_bands)
y = to_y(gold_class_band)
print(y.shape)
Unconstrained, decision trees overfit the training data. Therefore, we must constrain the decision tree by setting a hyperparameter. For now, we will just blindly set the max_depth
hyperparameter. Eventually, it would be ideal to use cross-validation to identify the best hyperparameter values.
In [44]:
clf_cache_file = os.path.join('pre-data', 'classify-cart-clf.sav')
load_clf = True
if load_clf:
assert os.path.isfile(clf_cache_file)
assert X.shape == (49474203, 5)
assert y.shape == (49474203,)
clf = pickle.load(open(clf_cache_file, 'rb'))
else:
clf = DecisionTreeClassifier(random_state=0, max_depth=5)
clf.fit(X, y)
In [45]:
# optionally save ps clf model
save_clf = False
if save_clf:
pickle.dump(clf, open(clf_cache_file, 'wb'))
Check to see what classes are predicted
In [46]:
y_pred = clf.predict(X)
print(np.unique(y_pred))
In [47]:
def classified_band_from_y(y, band_mask):
class_band = np.ma.array(np.zeros(band_mask.shape),
mask=band_mask.copy())
class_band[~class_band.mask] = y
return class_band
pred_band = classified_band_from_y(y_pred, get_mask(refl_bands))
In [48]:
def plot_predicted_vs_truth(predicted_class_band, gold_class_band, class_labels=None, figsize=(15,15)):
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
sharex=True, sharey=True,
figsize=figsize)
for ax in (ax1, ax2):
ax.set_adjustable('box-forced')
imshow_class_band(ax1, decimated(predicted_class_band), class_labels=class_labels)
ax1.set_title('Classifier Predicted Classes')
imshow_class_band(ax2, decimated(gold_class_band), class_labels=class_labels)
ax2.set_title('Gold Dataset Classes')
plt.tight_layout()
plot_predicted_vs_truth(pred_band, gold_class_band,class_labels=CLASS_LABELS)
In [49]:
# memory clean up
del refl_bands
del feat_bands
del X
del y
del y_pred
Alright! This looks promising. The predicted classes just look like a higher-resolution version of the gold dataset classes. This makes sense, the gold dataset image was upsampled from 30m to 3m (the PS Orthotile pixel size) using nearest-neighbor so would not have the detail to match a prediction performed on the PS Orthotile resolution.
In [53]:
# Uncomment the function calls to visualize the test dataset
# Due to memory limitations, all visualizations are not enabled by default
def _load_utm():
udm = load_udm(test_files['udm'])
udm_labels = get_udm_labels(udm)
plot_classified_band(decimated(udm), class_labels=udm_labels, title='UDM Mask')
# _load_utm()
def _show_bands():
window = None
files = test_files
refl_bands = load_reflectance_bands(files['scene'],
files['udm'],
files['metadata'],
window=window)
print(check_mask(refl_bands))
for b in refl_bands:
print (b.min(), b.max())
show_bands(refl_bands)
_show_bands()
def _plot_classified_band():
plot_classified_band(load_gold_class_band(test_files['gold'],
class_labels=CLASS_LABELS,
window=((500,1500),(500,1500))),
class_labels=CLASS_LABELS, title='Gold Class Labels, Windowed', figdim=5)
plot_classified_band(decimated(load_gold_class_band(test_files['gold'], class_labels=CLASS_LABELS)),
class_labels=CLASS_LABELS, title='Gold Class Labels')
# _plot_classified_band()
In [57]:
# predict
feat_bands, _ = build_feature_bands(load_reflectance_bands(test_files['scene'],
test_files['udm'],
test_files['metadata'],
window=window))
y_pred_test = clf.predict(to_X(feat_bands))
print(np.unique(y_pred_test))
In [58]:
# display prediction results
mask = get_mask(feat_bands)
gold_class_band = load_gold_class_band(test_files['gold'], class_labels=CLASS_LABELS, window=window)
gold_class_band.mask = mask
pred_band_test = classified_band_from_y(y_pred_test, mask)
plot_predicted_vs_truth(pred_band_test, gold_class_band, class_labels=CLASS_LABELS)
These results look very promising! Even though the classifier was trained on one Orthotile and tested on another Orthotile, the predicted classes on the test dataset look very similar to the classes specified in the gold dataset.
In [59]:
print(classification_report(to_y(gold_class_band),
y_pred_test,
target_names=['neither', 'corn', 'soybean']))
The classifier performance for classifying a pixel as neither, corn, or soybean, calculated on the train features, is an f1-score of 0.77.