Crop Type Classification: CART

Our aim in this notebook is to classify crop type using PlanetScope 4-band Orthotiles. The crop types of particular interest are corn and soybeans.

CART is a decision tree algorithm that has shown great promise for classification of remotely sensed imagery. We will use this algorithm to classify pixels as corn, soybean, or neither.

In this notebook, we will focus on using the PlanetScope imagery 4 bands as well as NDVI calculation as the features that are fed into the CART algorithm. We will train on one PS Orthotile and validate on another PS Orthotile.

This notebook demonstrates the following:

  1. Loading and visualizing Orthotiles
  2. Loading, visualizing, and processing a UDM into a mask
  3. Preparing classification features from Orthotile bands
  4. Training a CART classifier
  5. Visualizing and quantifying the classification result

Outline

  1. Identify datasets (PS Orthotiles and gold standard dataset for train and test)
  2. Train classifier on one Orthotile
  3. Test classifier on another Orthotile

Introduction

Gold Standard Dataset

The USDA 2016 Crop Data Layer (CDL) provides a national crop type dataset. This dataset was build using Landsat 8, DMC Deimos-1, and UK 2 satellite imagery (src), using supervised classification (decision trees) based on ground truth from the Farm Service Agency (FSA) Common Land Unit (CLU) Program. Since this is a derived dataset, it isn't a ground truth dataset but it is known to be quite accurate so can be used as our gold standard dataset. This dataset is provided as a georegistered raster (geoTIFF).

Since corn and soybeans are the primary crops grown in Iowa, we will focus our analysis in that state. The metadata provided for the Iowa 2016 CDL indicates that it's accuracy is 96.4% and that corn (categorization code 1) and soybeans (categorization code 5) are indeed the primary crop types in the state.

Growth Season

Iowa 1997 statistics: corn planting ended June 3, harvesting began Sept 17 and soybean planting ended June 17 and harvesting began Sept 21. Therefore, imagery should be limited to between June 3 and Sept 17, 2016.

Install Dependencies


In [1]:
from collections import namedtuple, OrderedDict
import os
import pathlib
import pickle
from subprocess import check_output, STDOUT, CalledProcessError
import tempfile
from xml.dom import minidom

import matplotlib
import matplotlib.colors as colors
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import rasterio
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

Identify Datasets

Train Set

Download PS Orthotile and Supporting Files

Get a PS Orthotile that has majority coverage over the grid, is from a SSO satellite, is free from clouds, and was taken between June 3 and Sept 17, 2016.

Using planet explorer, 210879_1558814_2016-07-25_0e16 was identified as a good candidate.


In [2]:
train_scene_id = '210879_1558814_2016-07-25_0e16'

In [3]:
# uncomment below to learn more about PSOrthotile 210879_1558814_2016-07-25_0e16
# !planet data search --item-type PSOrthoTile --string-in id $train_scene_id

In [4]:
# define and, if necessary, create cart data directory
train_folder = os.path.join('data', 'cart', '210879_1558814_2016-07-25_0e16')
pathlib.Path(train_folder).mkdir(parents=True, exist_ok=True)

# define data file filenames and ensure they exist
train_files = {
    'scene': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_Analytic.tif'),
    'metadata': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_Analytic_metadata.xml'),
    'udm': os.path.join(train_folder, '210879_1558814_2016-07-25_0e16_BGRN_DN_udm.tif'),
}

In [5]:
# First test if scene file exists, if not, use the Planet commandline tool to download the image, metadata, and udm.
# This command assumes a bash shell, available in Unix-based operating systems.
train_scene = train_files['scene']
!test -f $train_scene || \
    planet data download \
        --item-type PSOrthoTile \
        --dest $train_folder \
        --asset-type analytic,analytic_xml,udm \
        --string-in id $train_scene_id

In [6]:
for filename in train_files.values():
    print(filename)
    assert os.path.isfile(filename)


data/cart/210879_1558814_2016-07-25_0e16/210879_1558814_2016-07-25_0e16_BGRN_Analytic.tif
data/cart/210879_1558814_2016-07-25_0e16/210879_1558814_2016-07-25_0e16_BGRN_Analytic_metadata.xml
data/cart/210879_1558814_2016-07-25_0e16/210879_1558814_2016-07-25_0e16_BGRN_DN_udm.tif

Identify Gold Dataset

The CDL image is used as the gold dataset. The train CDL image is prepared in the datasets-prepare-cdl notebook.


In [7]:
predata_folder = 'pre-data'

In [8]:
train_CDL_filename = os.path.join(predata_folder, 'CDL_2016_19_train.tif')
assert os.path.isfile(train_CDL_filename)
train_files['gold'] = train_CDL_filename

Test Set

Download PS Orthotile and Supporting Files

Get another PS Orthotile that has majority coverage over the grid, is from a SSO satellite, is free from clouds, and was taken between June 3 and Sept 17, 2016.

Using planet explorer, 210863_1559015_2016-07-25_0e0f was identified as a good candidate.


In [9]:
test_scene_id = '210863_1559015_2016-07-25_0e0f'

In [10]:
# uncomment below to learn more about the test PSOrthotile
# !planet data search --item-type PSOrthoTile --string-in id $test_scene_id

In [11]:
# define and, if necessary, create cart data directory
test_folder = os.path.join('data', 'cart', '210863_1559015_2016-07-25_0e0f')
pathlib.Path(test_folder).mkdir(parents=True, exist_ok=True)

# define data file filenames and ensure they exist
test_files = {
    'scene': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_Analytic.tif'),
    'metadata': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_Analytic_metadata.xml'),
    'udm': os.path.join(test_folder, '210863_1559015_2016-07-25_0e0f_BGRN_DN_udm.tif'),
}

In [12]:
# First test if scene file exists, if not, use the Planet commandline tool to download the image, metadata, and udm.
# This command assumes a bash shell, available in Unix-based operating systems.
test_scene = test_files['scene']
!test -f $test_scene || \
    planet data download \
        --item-type PSOrthoTile \
        --dest $test_folder \
        --asset-type analytic,analytic_xml,udm \
        --string-in id $test_scene_id

In [13]:
for filename in test_files.values():
    print(filename)
    assert os.path.isfile(filename)


data/cart/210863_1559015_2016-07-25_0e0f/210863_1559015_2016-07-25_0e0f_BGRN_Analytic.tif
data/cart/210863_1559015_2016-07-25_0e0f/210863_1559015_2016-07-25_0e0f_BGRN_Analytic_metadata.xml
data/cart/210863_1559015_2016-07-25_0e0f/210863_1559015_2016-07-25_0e0f_BGRN_DN_udm.tif

Identify Gold Dataset

The CDL image is used as the gold dataset. The test CDL image is prepared in the datasets-prepare-cdl notebook.


In [56]:
test_CDL_filename = os.path.join(predata_folder, 'CDL_2016_19_test.tif')
assert os.path.isfile(test_CDL_filename)
test_files['gold'] = test_CDL_filename

Train Classifier

PS Orthotile to features

In this section we generate classification features from the 4 bands of the PS Orthotile and NDVI calculated from the Orthotile.


In [15]:
# Utility function for working with large images and memory limitations

def _read_window(filename, window):
    with rasterio.open(filename, 'r') as src:
        return src.read(window=window)

def decimated(arry, num=8):
    return arry[::num, ::num].copy()

Convert UDM to mask

We convert the UDM file to a boolean mask that can be used to generate masked band arrays


In [16]:
# Utility functions: visualizing a classified band

def plot_classified_band(class_band, class_labels=None, cmap='rainbow',
                         title='Class Labels', figdim=10):
    fig = plt.figure(figsize=(figdim, figdim))
    ax = fig.add_subplot(1, 1, 1)
    imshow_class_band(ax, class_band, class_labels, cmap=cmap)
    ax.set_title(title)
    ax.set_axis_off()

def imshow_class_band(ax, class_band, class_labels=None, cmap='rainbow'):
    """Show classified band with colormap normalization and color legend. Alters ax in place.
    
    possible cmaps ref: https://matplotlib.org/examples/color/colormaps_reference.html
    """
    class_norm = _ClassNormalize(class_band)
    im = ax.imshow(class_band, cmap=cmap, norm=class_norm)

    try:
        # add class label legend
        # https://stackoverflow.com/questions/25482876
        # /how-to-add-legend-to-imshow-in-matplotlib
        color_mapping = class_norm.mapping
        colors = [im.cmap(color_mapping[k]) for k in class_labels.keys()]
        labels = class_labels.values()

        # https://matplotlib.org/users/legend_guide.html
        # tag: #creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists
        patches = [mpatches.Patch(color=c, label=l) for c,l in zip(colors, labels)]

        ax.legend(handles=patches, bbox_to_anchor=(1, 1), loc='upper right', borderaxespad=0.)
    except AttributeError:
        # class_labels not specified
        pass

# https://matplotlib.org/users/colormapnorms.html#custom-normalization-two-linear-ranges
class _ClassNormalize(colors.Normalize):
    """Matplotlib colormap normalizer for a classified band."""
    def __init__(self, arry):
        # get unique unmasked values
        values = [v for v in np.unique(arry)
                  if not isinstance(v, np.ma.core.MaskedConstant)]

        color_ticks = np.array(range(len(values)), dtype=np.float) / (len(values) - 1)
        self._mapping = dict((v, ct)
                            for v, ct in zip(values, color_ticks))
        
        # Initialize base Normalize instance
        vmin = 0
        vmax = 1
        clip = False
        colors.Normalize.__init__(self, vmin, vmax, clip)
    
    def __call__(self, arry, clip=None):
        # round array back to ints for logical comparison
        arry = np.around(arry)
        new_arry = arry.copy()
        for k, v in self._mapping.items():
            new_arry[arry==k] = v
        return new_arry
    
    @property
    def mapping(self):
        return self._mapping

In [17]:
def load_udm(udm_filename, window=None):
    '''Load bit-encoded UDM as an array.
    
    The description for the information encoded in the bits in the UDM is from:
    https://www.planet.com/docs/spec-sheets/sat-imagery/
    section 7.2
    
    Bit 0: blackfill
    Bit 1: cloud covered
    Bit 2: missing or suspect data in Blue band
    Bit 3: missing or suspect data in Green band
    Bit 4: missing or suspect data in Red band
    Bit 6: missing or suspect data in NIR band

    Pixels with no issues have all bits set to 0, therefore their values are zero.    
    '''    
    return _read_window(udm_filename, window)[0,...]

def get_udm_labels(udm):
    return OrderedDict((v, '{0:b}'.format(v)) for v in np.unique(udm))

        
def _test():
    udm = load_udm(train_files['udm'])
    udm_labels = get_udm_labels(udm)
    plot_classified_band(decimated(udm), class_labels=udm_labels, title='UDM Mask')
_test()


It looks like there are quite a few regions (they look like crops) where there is an issue in the NIR band. Likely this is due to saturation in the band.

The next step in loading the UDM is to convert it to a Boolean mask for application to the image bands.


In [18]:
def udm_to_mask(udm_array):
    '''Mask all pixels with any issues.'''
    return udm_array != 0

def _test():
    mask = udm_to_mask(load_udm(train_files['udm']))
    print('{}/{} ({:0.0f}%) masked'.format(mask.sum(), mask.size,
                                           (100.0 * mask.sum())/mask.size))
_test()


14525797/64000000 (23%) masked

Load and Mask Bands

We load the orthotile bands (optionally applying a window) and mask them using the udm mask.


In [19]:
NamedBands = namedtuple('NamedBands', 'b, g, r, nir')

def load_masked_bands(scene_filename, udm_filename, window=None):
    """Loads a 4-band BGRNir Planet Image file as a list of masked bands.
    
    The masked bands share the same mask, so editing one band mask will
    edit them all.
    """
    bands = load_bands(scene_filename, window=window)
    mask = udm_to_mask(load_udm(udm_filename, window=window))
    return NamedBands(*[np.ma.array(b, mask=mask)
                        for b in bands])

def load_bands(filename, window=None):      
    b, g, r, nir = _read_window(filename, window=window)
    return NamedBands(b=b, g=g, r=r, nir=nir)

def get_mask(bands):
    return bands[0].mask


def check_mask(bands):
    band_mask = get_mask(bands)
    return '{}/{} ({:0.0f}%) masked'.format(band_mask.sum(), band_mask.size,
                                            (100.0 * band_mask.sum())/band_mask.size)


def _test():
#     window = ((500,1500),(500,1500))
    window = None
    print(check_mask(load_masked_bands(train_files['scene'],
                                       train_files['udm'],
                                       window=window)))
_test()


14525797/64000000 (23%) masked

Convert to Reflectance

The PSOrthotiles provide the top of atmosphere radiance values for each band. It is preferred to work with top of atmosphere reflectance, as this is more consistent across time. Therefore, we convert each band to reflectance.


In [20]:
# Utility functions: converting an image to reflectance

NamedCoefficients = namedtuple('NamedCoefficients', 'b, g, r, nir')

def read_refl_coefficients(metadata_filename):
    # https://github.com/planetlabs/notebooks/blob/master/jupyter-notebooks/ndvi/ndvi_planetscope.ipynb
    xmldoc = minidom.parse(metadata_filename)
    nodes = xmldoc.getElementsByTagName("ps:bandSpecificMetadata")

    # XML parser refers to bands by numbers 1-4
    coeffs = {}
    for node in nodes:
        bn = node.getElementsByTagName("ps:bandNumber")[0].firstChild.data
        if bn in ['1', '2', '3', '4']:
            i = int(bn)
            value = node.getElementsByTagName("ps:reflectanceCoefficient")[0].firstChild.data
            coeffs[i] = float(value)
    return NamedCoefficients(b=coeffs[1],
                             g=coeffs[2],
                             r=coeffs[3],
                             nir=coeffs[4])

def _apply_coeffs(bands, coeffs):
    return NamedBands(*[b.astype(float)*c for b,c in zip(bands, coeffs)])


def _test():
    print(read_refl_coefficients(train_files['metadata']))
_test()


NamedCoefficients(b=2.04606999312e-05, g=2.16563842624e-05, r=2.41577463407e-05, nir=3.6441364923e-05)

In [21]:
def load_reflectance_bands(scene_file, udm_file, metadata_file, window=None):
    return _to_reflectance(load_masked_bands(scene_file, udm_file, window=window),
                                metadata_file)

def _to_reflectance(bands, metadata_filename):
    coeffs = read_refl_coefficients(metadata_filename)
    return _apply_coeffs(bands, coeffs)

def _test():
#     window = ((500,1500),(500,1500))
    window = None
    bands = load_reflectance_bands(train_files['scene'],
                                   train_files['udm'],
                                   train_files['metadata'],
                                   window=window)
    print([b.mean() for b in bands])
_test()


[0.08636558050310784, 0.08184232668859674, 0.06041713864135735, 0.40178423607165453]

Display Image

To ensure everything loaded correctly, we display the image with and without the mask converted to the alpha channel.


In [22]:
# Utility functions: displaying an image

def _linear_scale(ndarray, old_min, old_max, new_min, new_max):
    """Linear scale from old_min to new_min, old_max to new_max.
    
    Values below min/max are allowed in input and output.
    Min/Max values are two data points that are used in the linear scaling.
    """
    #https://en.wikipedia.org/wiki/Normalization_(image_processing)
    return (ndarray - old_min) * (new_max - new_min) / (old_max - old_min) + new_min
# print(linear_scale(np.array([1,2,10,100,256,2560, 2660]), 2, 2560, 0, 256))

def _mask_to_alpha(bands):
    band = np.atleast_3d(bands[0])
    alpha = np.zeros_like(band)
    alpha[~band.mask] = 1
    return alpha

def _add_alpha_mask(bands):
    return np.dstack([bands, _mask_to_alpha(bands)])

def scale_bands(band_stack):
    old_min = np.percentile(band_stack, 2)
    old_max = np.percentile(band_stack, 98)
    old_min = band_stack.min()
    old_max = band_stack.max()

    new_min = 0
    new_max = 1
    scaled = [np.clip(_linear_scale(b.astype(np.float),
                                    old_min, old_max,
                                    new_min, new_max),
                      new_min, new_max)
              for b in bands]
    return scaled

def bands_min(bands):
#     old_min = min([b.min() for b in bands])
    old_min = np.percentile(np.concatenate([b.compressed() for b in bands]), 2)
    return old_min

def bands_max(bands):
#     old_max = max([b.max() for b in bands])
    old_max = np.percentile(np.concatenate([b.compressed() for b in bands]), 98)
    return old_max


def bands_to_display(bands, alpha=False):
    """Converts a list of 3 bands to a 3-band rgb, normalized array for display."""  
    assert len(bands) in [3]

    old_min = bands_min(bands)
    old_max = bands_max(bands)
#     print('old min: {}, old max: {}'.format(old_min, old_max))

    new_min = 0
    new_max = 1
    scaled = [np.clip(_linear_scale(b.astype(np.float),
                                    old_min, old_max,
                                    new_min, new_max),
                      new_min, new_max)
              for b in bands]

    filled = [b.filled(fill_value=new_min) for b in scaled]
    if alpha:
        filled.append(_mask_to_alpha(scaled))

    return np.dstack(filled)

In [23]:
def show_bands(bands):
    fig, axes = plt.subplots(nrows=1, ncols=2,
                                   sharex=True, sharey=True,
                                   figsize=(20,10))
    for ax in axes.ravel():
        ax.set_adjustable('box-forced')
        ax.set_axis_off()

    ax2, ax4 = axes
    ax2.imshow(decimated(bands_to_display([bands.r, bands.g, bands.b], alpha=True)))
    ax2.set_title('Reflectance RGB Bands, Alpha from Mask')

    ax4.imshow(decimated(bands_to_display(3*[bands.nir], alpha=True)))
    ax4.set_title('Reflectance NIR Band, Alpha from Mask')
    plt.tight_layout()

def _test():
    window = None
    refl_bands = load_reflectance_bands(train_files['scene'],
                                        train_files['udm'],
                                        train_files['metadata'],
                                        window=window)
    print(check_mask(refl_bands))
    for b in refl_bands:
        print (b.min(), b.max()) 
    show_bands(refl_bands)
_test()


14525797/64000000 (23%) masked
0.0674793883730976 0.43000206975409916
0.0578442023648704 0.4212383302879424
0.0349079434623115 0.4346461721618744
0.066104635970322 0.586778857990146
/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_base.py:1324: MatplotlibDeprecationWarning: 
box-forced
  "2.2", "box-forced", obj_type="keyword argument")

Features from Image

We calculate the classification features from the scene and convert them to an array for feeding into the CART classifier.


In [24]:
def calculate_ndvi(bands):
    ndvi = (bands.nir.astype(np.float) - bands.r) / (bands.nir + bands.r)
    return ndvi


def build_feature_bands(bands):
    """Prepare bands representing pixel features and provide feature names.
    
    Takes as input NamedBands
    Returns (1) tuple of bands representing features and (2) feature names
    """
    # not copying these bands to minimize memory footprints
    features = (bands.b, bands.g, bands.r, bands.nir, calculate_ndvi(bands))
    feature_names = ('Blue', 'Green', 'Red', 'NIR', 'NDVI')
    return features, feature_names


def display_feature_bands(bands, names):
    # for this notebook, we know there are 4 features and we will use that
    # knowledge to side-step some logic in organizing subplots
    assert len(bands) == 5 
    
    fig, subplot_axes = plt.subplots(nrows=3, ncols=2,
                                     sharex=True, sharey=True,
                                     figsize=(10,10))
    axes = subplot_axes.flat[:-1]
    delaxis = subplot_axes.flat[-1]
    fig.delaxes(delaxis)
    for ax, band, name in zip(axes, bands, names):
        ax.set_adjustable('box-forced')
        ax.axis('off')

        pcm = ax.imshow(band, alpha=True)
        ax.set_title(name)
        fig.colorbar(pcm, ax=ax,
                     pad=0.05, shrink=0.9)

    plt.tight_layout()

def _test():
#     window = ((500,1500),(500,1500))
    window = None
    refl_bands = load_reflectance_bands(train_files['scene'],
                                        train_files['udm'],
                                        train_files['metadata'],
                                        window=None)
    feat_bands, feat_names = build_feature_bands(refl_bands)
    display_feature_bands(feat_bands, feat_names)
_test()


/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_base.py:1324: MatplotlibDeprecationWarning: 
box-forced
  "2.2", "box-forced", obj_type="keyword argument")

From the scales on the colorbars, we see that the scales of the features are not all the same. Max for the spectral band features is ~0.35 while NDVI feature max is ~0.9. This is not a problem for decision trees, which do not require that features be normalized before training.


In [25]:
def to_X(feature_bands):
    """Convert feature_bands (tuple of bands) to 2d array for working with classifier.
    """
    return np.stack([f.compressed() for f in feature_bands], # exclude masked pixels
                     axis=1)


def _test():
    refl_bands = load_reflectance_bands(train_files['scene'],
                                        train_files['udm'],
                                        train_files['metadata'],
                                        window=((500,1500),(500,1500)))
    feat_bands, feat_names = build_feature_bands(refl_bands)
    
    X = to_X(feat_bands)
    print(X.shape)
_test()


(754234, 5)

Gold Standard Dataset to Labels

Load and Visualize Gold Classified Band

Filter to corn and soybean classes and visualize the classified band.


In [26]:
# class labels from metadata
CLASS_LABELS = {1: 'corn', 5: 'soybeans'}

In [27]:
def load_gold_class_band(gold_filename, class_labels=None, window=None, fill_value=0):
    gold_class_band = _read_window(gold_filename, window)[0,...]
    
    try:
        # mask pixels with a value not in class_labels
        masks = [gold_class_band == val for val in class_labels.keys()]
        mask = np.any(np.dstack(masks), axis=2)
        mask = ~mask
        
        masked_band = np.ma.array(np.ma.array(gold_class_band, mask=mask).filled(fill_value=fill_value),
                                  mask=mask)
    except AttributeError:
        # mask nothing
        null_mask = np.zeros(gold_class_band.shape, dtype=np.bool)
        masked_band = np.ma.array(gold_class_band, mask=null_mask)

    return masked_band

def _test():
    print(np.unique(load_gold_class_band(train_files['gold'])))
    print(np.unique(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)))
# _test()

In [28]:
def _plot():
    plot_classified_band(load_gold_class_band(train_files['gold'],
                                              class_labels=CLASS_LABELS,
                                              window=((500,1500),(500,1500))),
                         class_labels=CLASS_LABELS, title='Gold Class Labels, Windowed', figdim=5)

    plot_classified_band(decimated(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)),
                         class_labels=CLASS_LABELS, title='Gold Class Labels')
# _plot()

In [29]:
def to_y(classified_band):
    return classified_band.compressed()

def _test():
    print(to_y(load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS)).shape)
_test()


(49792890,)

Train the Classifier

We to train the classifier with all of the data from the Orthotile and use the dummy label data generated above. The goal is to ensure we can run through training and predicting with the classifier and end up with a classified band that looks like the original in shape/mask and shows the classes as labeled above.

To speed up processing, by default the classifier is loaded from cache.


In [30]:
# window = ((500,1500),(500,1500))
window = None

In [31]:
refl_bands = load_reflectance_bands(train_files['scene'],
                                    train_files['udm'],
                                    train_files['metadata'],
                                    window=window)
feat_bands, _ = build_feature_bands(refl_bands)
X = to_X(feat_bands)
print(X.shape)


(49474203, 5)

In [32]:
gold_class_band = load_gold_class_band(train_files['gold'], class_labels=CLASS_LABELS, window=window)
gold_class_band.mask = get_mask(feat_bands)
y = to_y(gold_class_band)
print(y.shape)


(49474203,)

Unconstrained, decision trees overfit the training data. Therefore, we must constrain the decision tree by setting a hyperparameter. For now, we will just blindly set the max_depth hyperparameter. Eventually, it would be ideal to use cross-validation to identify the best hyperparameter values.


In [44]:
clf_cache_file = os.path.join('pre-data', 'classify-cart-clf.sav')

load_clf = True
if load_clf:
    assert os.path.isfile(clf_cache_file)
    assert X.shape == (49474203, 5)
    assert y.shape == (49474203,)
    clf = pickle.load(open(clf_cache_file, 'rb'))
else:
    clf = DecisionTreeClassifier(random_state=0, max_depth=5)
    clf.fit(X, y)

In [45]:
# optionally save ps clf model
save_clf = False
if save_clf:
    pickle.dump(clf, open(clf_cache_file, 'wb'))

Check to see what classes are predicted


In [46]:
y_pred = clf.predict(X)
print(np.unique(y_pred))


[0 1 5]

In [47]:
def classified_band_from_y(y, band_mask):
    class_band = np.ma.array(np.zeros(band_mask.shape),
                             mask=band_mask.copy())
    class_band[~class_band.mask] = y
    return class_band

pred_band = classified_band_from_y(y_pred, get_mask(refl_bands))

In [48]:
def plot_predicted_vs_truth(predicted_class_band, gold_class_band, class_labels=None, figsize=(15,15)):
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
                                   sharex=True, sharey=True,
                                   figsize=figsize)
    for ax in (ax1, ax2):
        ax.set_adjustable('box-forced')

    imshow_class_band(ax1, decimated(predicted_class_band), class_labels=class_labels)
    ax1.set_title('Classifier Predicted Classes')

    imshow_class_band(ax2, decimated(gold_class_band), class_labels=class_labels)
    ax2.set_title('Gold Dataset Classes')
    plt.tight_layout()
    
plot_predicted_vs_truth(pred_band, gold_class_band,class_labels=CLASS_LABELS)


/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_base.py:1324: MatplotlibDeprecationWarning: 
box-forced
  "2.2", "box-forced", obj_type="keyword argument")

In [49]:
# memory clean up
del refl_bands
del feat_bands
del X
del y
del y_pred

Alright! This looks promising. The predicted classes just look like a higher-resolution version of the gold dataset classes. This makes sense, the gold dataset image was upsampled from 30m to 3m (the PS Orthotile pixel size) using nearest-neighbor so would not have the detail to match a prediction performed on the PS Orthotile resolution.

Test Classifier

Visualize Datasets


In [53]:
# Uncomment the function calls to visualize the test dataset
# Due to memory limitations, all visualizations are not enabled by default

def _load_utm():
    udm = load_udm(test_files['udm'])
    udm_labels = get_udm_labels(udm)
    plot_classified_band(decimated(udm), class_labels=udm_labels, title='UDM Mask')
# _load_utm()

def _show_bands():
    window = None
    files = test_files
    refl_bands = load_reflectance_bands(files['scene'],
                                        files['udm'],
                                        files['metadata'],
                                        window=window)
    print(check_mask(refl_bands))
    for b in refl_bands:
        print (b.min(), b.max()) 
    show_bands(refl_bands)
_show_bands()

def _plot_classified_band():
    plot_classified_band(load_gold_class_band(test_files['gold'],
                                              class_labels=CLASS_LABELS,
                                              window=((500,1500),(500,1500))),
                         class_labels=CLASS_LABELS, title='Gold Class Labels, Windowed', figdim=5)

    plot_classified_band(decimated(load_gold_class_band(test_files['gold'], class_labels=CLASS_LABELS)),
                         class_labels=CLASS_LABELS, title='Gold Class Labels')
# _plot_classified_band()


16860574/64000000 (26%) masked
0.062762828474394 0.43557991100184346
0.0591686391194315 0.3949336383853625
0.036405760420247804 0.40804995069135663
0.10115575917608359 0.590195513595124
/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_base.py:1324: MatplotlibDeprecationWarning: 
box-forced
  "2.2", "box-forced", obj_type="keyword argument")

Test Classifier


In [57]:
# predict
feat_bands, _ = build_feature_bands(load_reflectance_bands(test_files['scene'],
                                                        test_files['udm'],
                                                        test_files['metadata'],
                                                        window=window))
y_pred_test = clf.predict(to_X(feat_bands))
print(np.unique(y_pred_test))


[0 1 5]

In [58]:
# display prediction results
mask = get_mask(feat_bands)

gold_class_band = load_gold_class_band(test_files['gold'], class_labels=CLASS_LABELS, window=window)
gold_class_band.mask = mask

pred_band_test = classified_band_from_y(y_pred_test, mask)
plot_predicted_vs_truth(pred_band_test, gold_class_band, class_labels=CLASS_LABELS)


/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_base.py:1324: MatplotlibDeprecationWarning: 
box-forced
  "2.2", "box-forced", obj_type="keyword argument")

These results look very promising! Even though the classifier was trained on one Orthotile and tested on another Orthotile, the predicted classes on the test dataset look very similar to the classes specified in the gold dataset.

Classification Report

Let's calculate the classification accuracy metrics (precision, recall, f1-score)


In [59]:
print(classification_report(to_y(gold_class_band),
                            y_pred_test,
                            target_names=['neither', 'corn', 'soybean']))


              precision    recall  f1-score   support

     neither       0.68      0.78      0.73  11958614
        corn       0.82      0.91      0.86  26776731
     soybean       0.92      0.40      0.56   8404081

   micro avg       0.79      0.79      0.79  47139426
   macro avg       0.81      0.70      0.72  47139426
weighted avg       0.80      0.79      0.77  47139426

The classifier performance for classifying a pixel as neither, corn, or soybean, calculated on the train features, is an f1-score of 0.77.