Unsupervised Spectral Classification - Endmember Extraction

with PySpTools

This notebook runs through an example of spectral unmixing to carry out unsupervised classification of a SERC hyperspectral data file using the PySpTools package to carry out endmember extraction, plot abundance maps of the spectral endmembers, and use Spectral Angle Mapping and Spectral Information Divergence to classify the SERC tile.

Since spectral data is so large in size, it is often useful to remove any unncessary or redundant data in order to save computational time.

In this tutorial, we use the following user-defined functions:

  • read_neon_reflh5: function to read in NEON AOP Hyperspectral Data file (in hdf5 format)
  • clean_neon_refl_data: function to clean NEON hyperspectral data, including applying the data ignore value and reflectance scale factor, and removing water vapor bands
  • plot_aop_refl: function to plot a band of NEON hyperspectral data for reference

https://pysptools.sourceforge.io/index.html

Dependencies & Installation:

To run this notebook, you the following python packages need to be installed. You can install from the notebook following the commands below, or you can install required packages from command line after activating the p35 environment.

PySpTools: Download pysptools-0.14.2.tar.gz from https://pypi.python.org/pypi/pysptools

import sys
!{sys.executable} -m pip install spectral
!{sys.executable} -m pip install scikit-image
!{sys.executable} -m pip install "C:\Users\bhass\Downloads\pysptools-0.14.2.tar.gz
!conda install --yes --prefix {sys.prefix} h5py
!conda install --yes --prefix {sys.prefix} scikit-learn
!conda install --yes --prefix {sys.prefix} cvxopt

See Also: https://stackoverflow.com/questions/45156080/installing-modules-to-anaconda-from-tar-gz


In [9]:
import h5py, os, osr, copy, time
import matplotlib.pyplot as plt
import numpy as np
import pysptools.util as util
import pysptools.eea as eea #endmembers extraction algorithms
import pysptools.abundance_maps as amap
import pysptools.classification as cls
import pysptools.material_count as cnt

from skimage import exposure
from ipywidgets import *

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
def read_neon_reflh5(refl_filename):
    """h5refl2array reads in a NEON AOP reflectance hdf5 file and returns 
    reflectance array, and header containing metadata in envi format.
    --------
    Parameters
        refl_filename -- full or relative path and name of reflectance hdf5 file
    --------
    Returns 
    --------
    reflArray:
        array of reflectance values
    header:
        dictionary containing the following metadata (all strings):
            bad_band_window1: min and max wavelenths of first water vapor window (tuple)
            bad_band_window2: min and max wavelenths of second water vapor window (tuple)
            bands: # of bands (float)
            coordinate system string: coordinate system information (string)
            data ignore value: value corresponding to no data (float)
            interleave: 'BSQ' (string)
            reflectance scale factor: factor by which reflectance is scaled (float)
            wavelength: wavelength values (float)
            wavelength unit: 'm' (string)
                
            spatial extent meters: 
    --------
    This function applies to the NEON hdf5 format implemented in 2016, and should be used for
    data acquired during and after 2016 as of July 2018. Data in earlier NEON hdf5 format 
    (collected prior to 2016) is expected to be re-processed after the 2018 flight season. 
    --------
    Example Execution:
    --------
    sercRefl, sercRefl_header = h5refl2array('NEON_D02_SERC_DP1_20160807_160559_reflectance.h5') """
    
    #Read in reflectance hdf5 file (include full or relative path if data is located in a different directory)
    hdf5_file = h5py.File(refl_filename,'r')

    #Get the site name
    file_attrs_string = str(list(hdf5_file.items()))
    file_attrs_string_split = file_attrs_string.split("'")
    sitename = file_attrs_string_split[1]
    
    #Extract the reflectance & wavelength datasets
    refl = hdf5_file[sitename]['Reflectance']
    reflData = refl['Reflectance_Data']
    reflArray = refl['Reflectance_Data'].value
    
    #Create dictionary containing relevant metadata information
    header = {}
    header['map info'] = refl['Metadata']['Coordinate_System']['Map_Info'].value
    header['wavelength'] = refl['Metadata']['Spectral_Data']['Wavelength'].value

    #Extract no data value & set no data value to NaN
    header['data ignore value'] = float(reflData.attrs['Data_Ignore_Value'])
    header['reflectance scale factor'] = float(reflData.attrs['Scale_Factor'])
    header['interleave'] = reflData.attrs['Interleave']
    
    #Extract spatial extent from attributes
    header['spatial extent'] = reflData.attrs['Spatial_Extent_meters']
    
    #Extract bad band windows
    header['bad_band_window1'] = (refl.attrs['Band_Window_1_Nanometers'])
    header['bad_band_window2'] = (refl.attrs['Band_Window_2_Nanometers'])
    
    #Extract projection information
    header['projection'] = refl['Metadata']['Coordinate_System']['Proj4'].value
    header['epsg'] = int(refl['Metadata']['Coordinate_System']['EPSG Code'].value)
    
    #Extract map information: spatial extent & resolution (pixel size)
    mapInfo = refl['Metadata']['Coordinate_System']['Map_Info'].value
    
    hdf5_file.close        
    
    return reflArray, header

In [3]:
help(read_neon_reflh5)


Help on function read_neon_reflh5 in module __main__:

read_neon_reflh5(refl_filename)
    h5refl2array reads in a NEON AOP reflectance hdf5 file and returns 
    reflectance array, and header containing metadata in envi format.
    --------
    Parameters
        refl_filename -- full or relative path and name of reflectance hdf5 file
    --------
    Returns 
    --------
    reflArray:
        array of reflectance values
    header:
        dictionary containing the following metadata (all strings):
            bad_band_window1: min and max wavelenths of first water vapor window (tuple)
            bad_band_window2: min and max wavelenths of second water vapor window (tuple)
            bands: # of bands (float)
            coordinate system string: coordinate system information (string)
            data ignore value: value corresponding to no data (float)
            interleave: 'BSQ' (string)
            reflectance scale factor: factor by which reflectance is scaled (float)
            wavelength: wavelength values (float)
            wavelength unit: 'm' (string)
                
            spatial extent meters: 
    --------
    This function applies to the NEON hdf5 format implemented in 2016, and should be used for
    data acquired during and after 2016 as of July 2018. Data in earlier NEON hdf5 format 
    (collected prior to 2016) is expected to be re-processed after the 2018 flight season. 
    --------
    Example Execution:
    --------
    sercRefl, sercRefl_header = h5refl2array('NEON_D02_SERC_DP1_20160807_160559_reflectance.h5')


In [4]:
load_start_time = time.time()
data_path = os.getcwd()
h5refl_filename = 'NEON_D02_SERC_DP3_368000_4306000_reflectance.h5'
print('h5 file name:',h5refl_filename)
data,header = read_neon_reflh5(h5refl_filename)
print('Data took ', round(time.time() - load_start_time,1), ' seconds to load.')


h5 file name: NEON_D02_SERC_DP3_368000_4306000_reflectance.h5
Data took  7.7  seconds to load.

In [5]:
#Display information stored in header
for key in sorted(header.keys()):
  print(key)


bad_band_window1
bad_band_window2
data ignore value
epsg
interleave
map info
projection
reflectance scale factor
spatial extent
wavelength

In [6]:
def clean_neon_refl_data(data,header):
    """Clean h5 reflectance data and header
    1. set data ignore value (-9999) to NaN
    2. apply reflectance scale factor (10000)
    3. remove bad bands (water vapor bands + last 10 bands): 
        Band_Window_1_Nanometers = 1340,1445
        Band_Window_2_Nanometers = 1790,1955
    4. mask reflectance data > 1.0 ?? ##TO DO ?
    """
    
    #reflSubCleaned = reflArray[clipIndex['yMin']:clipIndex['yMax'],clipIndex['xMin']:clipIndex['xMax'],:].astype(np.float)
    #reflSubCleaned[reflSubCleaned==int(reflArray_metadata['noDataVal'])]=np.nan
    #reflSubCleaned = reflSubCleaned/reflArray_metadata['scaleFactor']
    
    data_clean = copy.copy(data)
    header_clean = copy.copy(header)
    #set data ignore value (-9999) to NaN:
    ## data_clean[data_clean==header['data ignore value']]=np.nan ??? get message ValueError: cannot convert float NaN to integer
    if header['data ignore value'] in data:
        print('data ignore values exist in data')
    #apply reflectance scale factor (divide by 10000)
    data_clean = data_clean/header['reflectance scale factor']
    #remove bad bands 
    #1. determine indices corresponding to min/max center wavelength for each bad band window:
    bb1_ind0 = np.max(np.where((np.asarray(header['wavelength'])<float(header['bad_band_window1'][0]))))
    bb1_ind1 = np.min(np.where((np.asarray(header['wavelength'])>float(header['bad_band_window1'][1]))))

    bb2_ind0 = np.max(np.where((np.asarray(header['wavelength'])<float(header['bad_band_window2'][0]))))
    bb2_ind1 = np.min(np.where((np.asarray(header['wavelength'])>float(header['bad_band_window2'][1]))))

    bb3_ind0 = len(header['wavelength'])-10
    
    #define valid band ranges from indices:
    vb1 = list(range(0,bb1_ind0)); #TO DO - don't hard code bands in, fi
    vb2 = list(range(bb1_ind1,bb2_ind0))
    vb3 = list(range(bb2_ind1,bb3_ind0))
    data_clean = data_clean[:,:,vb1+vb2+vb3]
    
    valid_band_range = [i for j in (range(0,bb1_ind0), range(bb1_ind1,bb2_ind0), range(bb2_ind1,bb3_ind0)) for i in j]
    header_clean['wavelength'] = [header['wavelength'][i] for i in valid_band_range]
    
    #Display wavelengths corresponding to bands removed:
    #print('Wavelengths Removed:')
    #print(header['wavelength'][191],'-',header['wavelength'][212],'nm')
    #print(header['wavelength'][281],'-',header['wavelength'][314],'nm')
    #print(header['wavelength'][416],'-',header['wavelength'][425],'nm')

    print('Bad Band Windows (nm):')
    print(header['bad_band_window1'])
    print(header['bad_band_window2'])
    
    #Display wavelengths corresponding to bands kept:
    print('\nCenter Wavelengths of Bands Kept:')
    print(header['wavelength'][0],'-',header['wavelength'][bb1_ind0],'nm')
    print(header['wavelength'][bb1_ind1],'-',header['wavelength'][bb2_ind0],'nm')
    print(header['wavelength'][bb2_ind1],'-',header['wavelength'][bb3_ind0],'nm')
    
    #header_clean['wavelength'] = header
    
    return data_clean, header_clean

In [7]:
print('Raw Data Dimensions:\n',data.shape)
#remove bad bands
clean_start_time = time.time()
data_clean,header_clean = clean_neon_refl_data(data,header)
print('\nData took', round(time.time() - clean_start_time,1), 'seconds to clean.')
print('Cleaned Data Dimensions:',data_clean.shape)


Raw Data Dimensions:
 (1000, 1000, 426)
Bad Band Windows (nm):
[1340 1445]
[1790 1955]

Center Wavelengths of Bands Kept:
383.534 - 1335.04 nm
1445.21 - 1785.75 nm
1956.02 - 2466.82 nm

Data took 49.3 seconds to clean.
Cleaned Data Dimensions: (1000, 1000, 360)

In [10]:
def plot_aop_refl(band_array,refl_extent,colorlimit=(0,1),ax=plt.gca(),title='',cbar ='on',cmap_title='',colormap='Greys'):
    '''read in and plot a single band or 3 stacked bands of a reflectance array
    --------
    Parameters
    --------
        band_array: ndarray
            Array of reflectance values, created from aop_h5refl2array
            If 'band_array' is a 2-D array, plots intensity of values
            If 'band_array' is a 3-D array (3 bands), plots RGB image, set cbar to 'off' and don't need to specify colormap 
        refl_extent: tuple
            Extent of reflectance data to be plotted (xMin, xMax, yMin, yMax) 
            Stored in metadata['spatial extent'] from aop_h5refl2array function
        colorlimit: tuple, optional
            Range of values to plot (min,max). 
            Look at the histogram of reflectance values before plotting to determine colorlimit.
        ax: axis handle, optional
            Axis to plot on; specify if making figure with subplots. Default is current axis, plt.gca()
        title: string, optional
            plot title 
        cbar: string, optional
            Use cbar = 'on' (default), shows colorbar; use if plotting 1-band array
            If cbar = 'off' (or not 'on'), does no
        cmap_title: string, optional
            colorbar title (eg. 'reflectance', etc.)
        colormap: string, optional
            Matplotlib colormap to plot 
            see https://matplotlib.org/examples/color/colormaps_reference.html

    Returns 
    --------
        plots flightline array of single band of reflectance data
    --------

    Examples:
    --------
    >>> plot_aop_refl(sercb56,
    sercMetadata['spatial extent'],
    colorlimit=(0,0.3),
    title='SERC Band 56 Reflectance',
    cmap_title='Reflectance',
    colormap='Greys_r')'''

    import matplotlib.pyplot as plt
    
    plot = plt.imshow(band_array,extent=refl_extent,clim=colorlimit); 
    if cbar == 'on':
        cbar = plt.colorbar(plot,aspect=40); plt.set_cmap(colormap); 
        cbar.set_label(cmap_title,rotation=90,labelpad=20)
    plt.title(title); ax = plt.gca(); 
    ax.ticklabel_format(useOffset=False, style='plain'); #do not use scientific notation for ticklabels
    rotatexlabels = plt.setp(ax.get_xticklabels(),rotation=90); #rotate x tick labels 90 degrees



In [12]:
plot_aop_refl(data_clean[:,:,0],header_clean['spatial extent'],(0,0.3))


Unsupervised Classification with Spectral Unmixing:

Endmember Extraction and Abundance Mapping

Spectral Unmixing allows pixels to be composed of fractions or abundances of each class.

Endmembers can be thought of as the basis spectra of an image. Once these endmember spectra are determined, the image cube can be 'unmixed' into the fractional abundance of each material in each pixel (Winter, 1999).

Spectral Angle Mapper (SAM): is a physically-based spectral classification that uses an n-D angle to match pixels to reference spectra. The algorithm determines the spectral similarity between two spectra by calculating the angle between the spectra and treating them as vectors in a space with dimensionality equal to the number of bands. This technique, when used on calibrated reflectance data, is relatively insensitive to illumination and albedo effects. Endmember spectra used by SAM are extracted from the NFINDR algorithm. SAM compares the angle between the endmember spectrum vector and each pixel vector in n-D space. Smaller angles represent closer matches to the reference spectrum. Pixels further away than the specified maximum angle threshold in radians are not classified.

http://www.harrisgeospatial.com/docs/SpectralAngleMapper.html

Spectral Information Divergence (SID): is a spectral classification method that uses a divergence measure to match pixels to reference spectra. The smaller the divergence, the more likely the pixels are similar. Pixels with a measurement greater than the specified maximum divergence threshold are not classified. Endmember spectra used by SID in this example are extracted from the NFINDR endmembor extraction algorithm.

http://www.harrisgeospatial.com/docs/SpectralInformationDivergence.html


In [13]:
#ee.display?
wavelength_float = [float(i) for i in header_clean['wavelength']]
ee_axes = {}
ee_axes['wavelength'] = wavelength_float
ee_axes['x']='Wavelength, nm'
ee_axes['y']='Reflectance'

In [14]:
#Endmember Extraction (Unmixing) - NFINDR Algorithm (Winter, 1999)
ee_start_time = time.time()
ee = eea.NFINDR()
U = ee.extract(data_clean,4,maxit=5,normalize=False,ATGP_init=True)
ee.display(axes=ee_axes,suffix='SERC')
print('Endmember extraction took ', round(time.time() - ee_start_time,1), ' seconds to run.')


Endmember extraction took  119.8  seconds to run.

In [15]:
#Abundance Maps
amap_start_time = time.time()
am = amap.FCLS()
amaps = am.map(data_clean,U,normalize=False)
am.display(colorMap='jet',columns=4,suffix='SERC')
print('Abundance maps took ', round(time.time() - amap_start_time,1), ' seconds to generate.')


Abundance maps took  1520.4  seconds to generate.
<matplotlib.figure.Figure at 0x126e0bdfb38>

In [16]:
#Print mean values of each abundance map to better estimate thresholds
print('Abundance Map Mean Values:')
print('EM1:',np.mean(amaps[:,:,0]))
print('EM2:',np.mean(amaps[:,:,1]))
print('EM3:',np.mean(amaps[:,:,2]))
print('EM4:',np.mean(amaps[:,:,3]))


Abundance Map Mean Values:
EM1: 0.591774
EM2: 0.00089542
EM3: 0.380964
EM4: 0.0263671

In [18]:
#Look at histogram of each abundance map to determine ballpark for thresholds to use
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(18,8))

ax1 = fig.add_subplot(2,4,1); plt.title('EM1')
amap1_hist = plt.hist(np.ndarray.flatten(amaps[:,:,0]),bins=50,range=[0,1.0]) 

ax2 = fig.add_subplot(2,4,2); plt.title('EM2')
amap1_hist = plt.hist(np.ndarray.flatten(amaps[:,:,1]),bins=50,range=[0,0.001]) 

ax3 = fig.add_subplot(2,4,3); plt.title('EM3')
amap1_hist = plt.hist(np.ndarray.flatten(amaps[:,:,2]),bins=50,range=[0,0.5]) 

ax4 = fig.add_subplot(2,4,4); plt.title('EM4')
amap1_hist = plt.hist(np.ndarray.flatten(amaps[:,:,3]),bins=50,range=[0,0.05])


Below we define a functions to compute and display SID


In [19]:
#Spectral Information Divergence
def SID(data,E,thrs=None):
    sid = cls.SID()
    cmap = sid.classify(data,E,threshold=thrs)
    sid.display(colorMap='tab20b',suffix='SERC')

Now we can call this function using the three endmembers (classes) that contain the most information:


In [20]:
U2 = U[[0,2,3],:]
SID(data_clean, U2, [0.8,0.3,0.03])


From this map we can see that SID did a pretty good job of identifying the water (dark blue), roads/buildings (orange), and vegetation (blue). We can compare it to the USA Topo Base map to see how well it did:


In [21]:
#Display USA Topo Base Map
#Reference: https://viewer.nationalmap.gov/
from IPython.core.display import Image
UStopo_filename = 'SERC_368000_4307000_UStopo.PNG'
UStopo_image = os.path.join(data_path,UStopo_filename)
print('Base Map')
Image(filename=UStopo_image,width=225,height=225)


Base Map
Out[21]:

Exercises

On your own, try the Spectral Angle Mapper. You will use similar syntax to the SID. and refer to SAM documentation on pysptools and the Pine Creek example.

https://pysptools.sourceforge.io/classification.html#spectral-angle-mapper-sam https://pysptools.sourceforge.io/examples_front.html#examples-using-the-ipython-notebook

Which algorithm does a better job with classification? On your own, play around with different settings (eg. use more / less endmembers, etc.)


In [ ]:
#Spectral Angle Mapper
def SAM(data,E,thrs=None):
    sam = cls.SAM()
    cmap = sam.classify(data,E,threshold=thrs)
    sam.display(colorMap='Paired',suffix='SERC')

What Next?

PySpTools has an alpha interface with the Python machine learning package skikit-learn. To apply more advanced machine learning techniques, you may wish to explore some of these algorithms.

https://pysptools.sourceforge.io/skl.html http://scikit-learn.org/stable/


In [ ]:


In [ ]:
def subset_clean_neon_refl_data(refl_data,refl_header,subset_band_inds):
    """Subset and clean h5 reflectance data and header
    1. subset bands 
    2. set data ignore value (-9999) to NaN
    3. apply reflectance scale factor (10000)
    """
    
    refl_subset = refl_data[:,:,subset_band_inds]
    refl_header_subset = copy.copy(refl_header)
    
    #set data ignore value (-9999) to NaN:
    
    if refl_header['data ignore value'] in refl_data:
        print('data ignore values exist in data')
        refl_data_subset[refl_data_subset==refl_header['data ignore value']]=np.nan
        
    #apply reflectance scale factor (divide by 10000)
    refl_subset_clean = refl_subset/refl_header['reflectance scale factor']
    
    #valid_band_range = [i for j in (range(0,bb1_ind0), range(bb1_ind1,bb2_ind0), range(bb2_ind1,bb3_ind0)) for i in j]
    refl_header_subset['wavelength'] = [refl_header['wavelength'][i] for i in subset_band_inds]
    refl_header_subset['fwhm'] = [refl_header['fwhm'][i] for i in subset_band_inds]
    
    return refl_subset_clean, refl_header_subset

In [ ]:
LS8_OLI_bandcenters = [442.96,482.04,561.41,654.59,864.67,1608.86,2200.73]
LS8_OLI_bandwidths = [15.98,60.04,57.33,37.47,28.25,84.72,186.66]