129 97 7MB
English Pages [119] Year 2022
Library of Congress Cataloging-in-Publication Data Names: Jia, Xiuping, author. Title: Field guide to hyperspectral/multispectral image processing / Xiuping Jia. Description: Bellingham, Washington : SPIE Press, [2022] | Includes bibliographical references and index. Identifiers: LCCN 2022007877 | ISBN 9781510652149 (spiral bound) | ISBN 9781510652156 (pdf) Subjects: LCSH: Hyperspectral imaging. | Multispectral imaging. | Image processing–Digital techniques. | Remote sensing. Classification: LCC TR267.73 J53 2022 | DDC 771–dc23/eng/20220408 LC record available at https://lccn.loc.gov/2022007877
Published by SPIE P.O. Box 10 Bellingham, Washington 98227-0010 USA Phone: +1 360.676.3290 Fax: +1 360.647.1445 Email: [email protected] Web: http://spie.org Copyright © 2022 Society of Photo-Optical Instrumentation Engineers (SPIE) All rights reserved. No part of this publication may be reproduced or distributed in any form or by any means without written permission of the publisher. The content of this book reflects the work and thought of the author. Every effort has been made to publish reliable and accurate information herein, but the publisher is not responsible for the validity of the information or for any outcomes resulting from reliance thereon. Printed in the United States of America. First printing 2022. For updates to this book, visit http://spie.org and type “FG52” in the search field.
Introduction to the Series 2022 is a landmark year for the SPIE Field Guide series. It is our 19th year of publishing Field Guides, which now includes more than 50 volumes, and we expect another four titles to publish this year. The series was conceived, created, and shaped by Professor John Greivenkamp from the University of Arizona. John came up with the idea of a 100-page handy reference guide for scientists and engineers. He wanted these books to be the type of reference that professionals would keep in their briefcases, on their lab bench, or even on their bedside table. The format of the series is unique: spiral-bound in a 500 by 800 format, the book lies flat on any page while you refer to it. John was the author of the first volume, the seminal Field Guide to Geometrical Optics. This book has been an astounding success, with nearly 8000 copies sold and more than 72,000 downloads from the SPIE Digital Library. It continues to be one of the strongest selling titles in the SPIE catalog, and it is the all-time best-selling book from SPIE Press. The subsequent several Field Guides were in key optical areas such as atmospheric optics, adaptive optics, lithography, and spectroscopy. As time went on, the series explored more specialized areas such as optomechanics, interferometry, and colorimetry. In 2019, John created a sub-series, the Field Guide to Physics, with volumes on solid state physics, quantum mechanics, and optoelectronics and photonics, and a fourth volume on electromagnetics to be published this year. All told, the series has generated more than $1.5 million in print sales and nearly 1 million downloads since eBooks were made available on the SPIE Digital Library in 2011. John’s impact on the profession through the Field Guide series is immense. Rival publishers speak to SPIE Press with envy over this golden nugget that we have, and this is all thanks to him. John was taken from us all to early, and to honor his contribution to the profession through this series, he is commemorated in the 2022 Field Guides.
Field Guide to Hyperspectral/Multispectral Image Processing
Introduction to the Series We will miss John very much, but his legacy will go on for decades to come. Vale John Greivenkamp! J. Scott Tyo Series Editor, SPIE Field Guides Melbourne, Australia, March 2022
Field Guide to Hyperspectral/Multispectral Image Processing
Related Titles from SPIE Press Keep information at your fingertips with these other SPIE Field Guides: Image Processing, Khan M. Iftekharuddin and Abdul Awwal, Vol. FG25 Other related titles: Fourier Transforms Using Mathematica®, Joseph. W. Goodman, Vol. PM322 Multispectral Image Fusion and Colorization, Yufeng Zheng, Erik Blasch, and Zheng Liu, Vol. PM285 Optical Interference Filters Using MATLAB®, Scott W. Teare, Vol. PM299 Optimal Signal Processing Under Uncertainty, Edward R. Dougherty, Vol. PM287 Regularization in Hyperspectral Unmixing, Jignesh S. Bhatt and Manjunath V. Joshi, Vol. SL25
Field Guide to Hyperspectral/Multispectral Image Processing
vii
Table of Contents Preface Glossary of Terms and Acronyms Optical Remote Sensing Spectral Coverage of Optical Remote Sensing Spectral Characteristics of Earth Features Spectral Resolution Spatial Resolution Pixel, Subpixel, and Superpixel Radiometric Resolution From Raw Data to Information Retrieval Image Processing Techniques vs Image Types
xi xiii 1 1 2 3 4 5 6 7 8
Image Data Correction Radiometric Errors Due to Atmosphere Cloud Removal Geometric Errors Mapping Functions and Ground Control Points Mapping Function Validation Resampling Image Registration Example
9 9 10 11 12 13 14 15
Image Radiometric Enhancement and Display Image Histogram Linear Histogram Modification Linear Histogram Modification Example Uniform Histogram and Cumulative Histogram Histogram Equalization for Contrast Enhancement Histogram Equalization Example Color Composite Image Display Principal Component Transformation for Image Display
16 16 17 18 19 20
Image Geometric Enhancement Spatial Filtering Image Smoothing Speckle Removal
24 24 25 26
21 22 23
Field Guide to Hyperspectral/Multispectral Image Processing
viii
Table of Contents Edge and Discontinuity Detection Spatial Gradient Detection Morphological Operations
27 28 29
Hyperspectral Image Data Representation Image Data Cube Files Image Space and Spectral Space Features and Feature Space Pixel Vector, Image Matrix, and Data Set Tensor Cluster Space
30 30 31 32 33 34
Image Clustering and Segmentation Otsu’s Method Clustering Using the Single-Pass Algorithm Clustering Using the k-Means Algorithm Clustering Using the k-Means Algorithm Example Superpixel Generation Using SLIC
35 35 36 37 38 39
Pixel-Level Supervised Classification Supervised Classification Procedure Prototype Sample Selection Training Samples and Testing Samples Minimum Euclidean Distance Classifier Spectral Angle Mapper Spectral Information Divergence Class Data Modeling with a Gaussian Distribution Mean Vector and Covariance Matrix Estimation Gaussian Maximum-Likelihood Classification Other Distribution Models Mahalanobis Distance and Classifier k-Nearest Neighbor Classification Support Vector Machines Nonlinear Support Vector Machines
40 40 41 42 43 44 45 46
Handling Limited Numbers of Training Samples Semi-Supervised Classification Active Learning
54
Field Guide to Hyperspectral/Multispectral Image Processing
47 48 49 50 51 52 53
54 55
ix
Table of Contents Transfer Learning
56
Feature Reduction The Need for Feature Reduction Basic Band Selection Mutual Information Band Selection Based on Mutual Information Band Selection Based on Class Separability Knowledge-based Feature Extraction Data-Driven Approach for Feature Extraction Linear Discriminant Analysis Orthogonal Subspace Projection Adaptive Matched Filter Band Grouping for Feature Extraction Principal Component Transformation
57 57 58 59 60 61 62 63 64 65 66 67 68
Incorporation of Spatial Information in Pixel Classification Spatial Texture Features using GLCM Examples of Texture Features Markov Random Field for Contextual Classification Options for Spectral-Spatial-based Mapping
69
Subpixel Analysis Spectral Unmixing Endmember Extraction Endmember Extraction with N-FINDR Limitation of Linear Unmixing Subpixel Mapping Subpixel Mapping Example Super-resolution Reconstruction
73 73 74 75 76 77 78 79
Artificial Neural Networks and Deep Learning with CNNs Artificial Neural Networks: Structure Artificial Neural Networks: Neurons Limitation of Artificial Neural Networks CNN Input Layer and Convolution Layer CNN Padding and Stride
80
69 70 71 72
80 81 82 83 84
Field Guide to Hyperspectral/Multispectral Image Processing
x
Table of Contents CNN Pooling Layer CNN Multilayer and Output Layer CNN Training CNN for Multiple-Image Input CNN for Hyperspectral Pixel Classification CNN Training for Hyperspectral Pixel Classification
85 86 87 88 89 90
Multitemporal Earth Observation Satellite Orbit Period Coverage and Revisit Time Change Detection
91 91 92 93
Classification Accuracy Assessment Error Matrix for One-Class Mapping Error Matrix for Multiple-Class Mapping Kappa Coefficient Using the Error Matrix Model Validation
94 94 95 96 97
Bibliography
98
Index
Field Guide to Hyperspectral/Multispectral Image Processing
100
xi
Preface Hyper/multispectral imagery in optical remote sensing is an extension of color RGB pictures. The utilized wavelength range is beyond the visible, up to the reflective shortwave infrared. Hyperspectral imaging offers higher spectral resolution, leading to many wavebands. The spectral profiles recorded reveal reflected solar radiation from Earth-surface materials when the sensor is mounted on an airborne or spaceborne platform. An inverse process using machine-learning approaches is conducted for target detection, material identification, and associated environmental applications, which is the main purpose of remote sensing. This Field Guide covers three areas: the fundamentals of remote sensing imaging for image understanding; image processing for correction and quality improvement; and image analysis for information extraction at subpixel, pixel, superpixel, and image levels, including feature mining and reduction. Basic concepts and fundamental understanding are emphasized to prepare the reader for exploring advanced methods. I owe thanks to Professor John Richards, who introduced me to the remote sensing field and supervised my Ph.D. study on hyperspectral image classification. Over the years I learned a lot from John about critical thinking and thorough presentation. I dedicate this Field Guide to my husband, Xichuan, my son, James, and my daughter, Jessica, who have shared and accompanied me on my learning journey in this subject for so many years and make me a proud career wife and mom. Xiuping Jia The University of New South Wales, Canberra, Australia May 2022
Field Guide to Hyperspectral/Multispectral Image Processing
xiii
Glossary of Terms and Acronyms AMF ANN AVIRIS
adaptive matched filter artificial neural network Airborne Visible / Infrared Imaging Spectrometer B number of bits Bhattacharyya distance Bij BIL band-interleaved-by-line BIP band-interleaved-by-pixel BSQ band sequential CNN convolutional neural network d(x1, x2) spectral Euclidean distance between two vectors ETM Enhanced Thematic Mapper fm fraction of endmember class m contained in a mixed pixel FN false negative FP false positive gi(x) likelihood of x belonging to class i GCP ground control point GEO geosynchronous equatorial orbit GLCM gray-level co-occurrence matrix h histogram H entropy HPF high-pass filter HR high resolution HYDICE Hyperspectral Digital Imagery Collection Experiment I identity matrix I(Ba, Bb) mutual information between two data sets ICA independent component analysis Jij J–M distance KNN k-nearest neighbor KSVM kernel support vector machine LDA linear discriminant analysis LEO low Earth orbit LPF low-pass filter LR low resolution m mean vector M number of classes MRF Markov random field Field Guide to Hyperspectral/Multispectral Image Processing
xiv
Glossary of Terms and Acronyms MSI N NAPCT NDVI NDWI OLI OSP p(vi) p(vi|x) p(x|vi) PCT re ReLU SAM SID SLIC SPCT SRR SVM T T TN TP UAV x x X y y z l vi Sa Sw S (u, v) (x, y)
multispectral instrument number of bands noise-adjusted principal component transformation normalized difference vegetation index normalized difference water index Operational Land Imager orthogonal subspace projection probability of class i probability of x belonging to class i probability of finding x from class i principal component transformation Earth radius rectifier linear unit Spectral Angle Mapper spectral information divergence simple linear iterative clustering segmented principal component transformation super-resolution reconstruction support vector machine temperature, period, or threshold tensor true negative true positive unmanned aerial vehicle intensity value of a pixel intensity vector of a pixel image data set/cube new value transformed intensity vector after LDA or PCT transformed intensity vector after OSP wavelength; eigenvalues class i among-class covariance matrix within-class covariance matrix covariance matrix coordinates with error spatial coordinates
Field Guide to Hyperspectral/Multispectral Image Processing
Optical Remote Sensing
1
Spectral Coverage of Optical Remote Sensing Sun
Sensor
Reflected solar energy
Atmosphere
Receiver station
Earth surface
The wavelength used for optical remote sensing ranges from 0.4 to 2.5 mm. In this range, the radiation of the Sun (at a temperature of 6000 K) is strong, according to Planck’s black body radiation law: M ðT ; lÞ 5
C i ðW m2 mm1 Þ h 1 C2 l exp lT 1 5
C1: first radiation constant 5 3.7413 3 108 W mm4/m2, C2: second radiation constant 5 1.4388 3 104 mm K, T: absolute radiant temperature (K), and l: radiation wavelength (mm). T and l are the two variables. Even after the attenuation is considered due to the large distance between the Sun and the Earth, the received solar radiation in the optical reflective range by the Earth is much stronger than the emission from itself (300 K). The signals collected by an optical sensor on board a satellite or aircraft are then the reflected solar radiation from the Earth surface, thanks also to the generally high atmospheric transmittance in the optical range. When l . 3 mm, emission from hot bodies dominates the upwelling radiation from the Earth. This phenomenon forms thermal infrared remote sensing (3 mm , l , 14 mm).
Field Guide to Hyperspectral/Multispectral Image Processing
2
Optical Remote Sensing
Spectral Characteristics of Earth Features The Earth landcover types, such as water body, bare soils, and vegetation, interact differently with solar radiation at various wavelengths. In other words, each material has its own spectral reflectance characteristics. The figure below shows typical spectral reflectance of over 191 Hyperspectral Digital Imagery Collection Experiment (HYDICE) bands, covering wavelengths from 0.4 to 2.5 mm.
The overall spectral profile is affected by the solar curve, i.e., high in the visible range and low in the rest of the wavelength range.
Data correction to the percentage of reflected solar radiation is not always necessary when machine learning is performed for relative-data-based analysis, where spectral variations from a given class are also addressed.
Field Guide to Hyperspectral/Multispectral Image Processing
Optical Remote Sensing
3
Spectral Resolution Different materials have different spectral reflectance characteristics with different absorption bands of solar radiation, which make them discriminable from each other. Optical sensors are designed to capture images in multiple wavebands with good spectral resolution so that the image data can then be used for material identification and recognition. The table shows the band designation of the Landsat-8 OLI (Operational Land Imager) as an example. Landsat-9 OLI-2’s bands are similar.
Spectral Band Band 1 Coastal/Aerosol Band 2 Blue Band 3 Green Band 4 Red Band 5 Near Infrared Band 6 SWIR Band 7 SWIR Band 8 Panchromatic Band 9 Cirrus
Wavelength (nm) 435–451 452–512 533–590 636–673 851–879 1566–1651 2107–2294 503–676 1363–1384
Spectral Resolution (nm) 16 60 57 37 28 85 187 173 21
Note that Band 8 provides a panchromatic image with higher spatial resolution (15 m) than other multispectral bands (30 m) with a lower spectral resolution. There is a tradeoff between the two to meet a required signal-to-noise ratio. Hyperspectral sensors such as Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) and Hyperion provide higher spectral resolution on the order of 10 nm and collect data for the complete optical spectral range from 400 to 2450 nm, leading to more than 200 bands. The rich spectral information is important for fine material and condition monitoring. HyMap is another hyperspectral sensor that provides a spectral resolution around 20 nm, with a better spatial resolution of 3 m, compared with the 20 m offered by AVIRIS.
Field Guide to Hyperspectral/Multispectral Image Processing
4
Optical Remote Sensing
Spatial Resolution A pixel in an image corresponds to a physical area of the imaged scene. This is the smallest area that a sensor can resolve, and the size of the area is called the spatial resolution. The brightness value of this pixel is the combined reflectance from the corresponding area—a large area for a low-resolution (LR) case or a small area for a high-resolution (HR) case, as illustrated below.
LR
HR
7
6
9
5
2 5 3 5
As an example, the Sentinel 2 Multispectral Instrument (MSI) has 13 bands with spatial-resolution ranges from 10 m to 60 m. These bands are designed for basic landcover mapping, geophysical-parameter retrieval, and atmospheric correction. If the resolution in both directions is different, or data fusion is performed on images with different resolution, then geometric correction and image registration are required.
Band Band Band Band Band Band Band Band Band Band Band Band Band
Spectral Band 1 Coastal/Aerosol 2 Blue 3 Green 4 Red 5 Vegetation Red Edge 6 Vegetation Red Edge 7 Vegetation Red Edge 8 NIR 8a Vegetation Red Edge 9 Water Vapor 10 SWIR-Cirrus 11 SWIR 12 SWIR
Spatial Resolution (m) 60 10 10 10 20 20 20 10 20 60 60 20 20
Field Guide to Hyperspectral/Multispectral Image Processing
Optical Remote Sensing
5
Pixel, Subpixel, and Superpixel Low resolution leads to highly mixed pixels; i.e., a pixel contains more than one type of material. Zooming is achieved by interpolation techniques, or simply replicating pixels. It does not add new information. To improve spatial resolution, super-resolution reconstruction (SRR) can be applied, either by combining multiple observations or with the help of prior knowledge. For example, an original spatial resolution of 30 m can be improved to 10 m. Subpixel analysis can also be conducted using spectral unmixing techniques, whereby the fraction of each primary landcover class contained in each pixel is estimated. Class labels at the subpixel level can then be achieved by super-resolution mapping. High spatial resolution is desirable for small-target detection and recognition. However, high-resolution data are sensitive to noise and details, and within-class variation becomes high. Superpixel-based image analysis can then be adopted. A superpixel is a group of spatially connected similar pixels. Its size and shape often vary from location to location. Superpixels can be generated via oversegmentation. pixel
superpixel
subpixel
Field Guide to Hyperspectral/Multispectral Image Processing
6
Optical Remote Sensing
Radiometric Resolution Radiometric resolution is determined by the number of quantization levels applied. Each level is the minimum difference of radiance that the sensor can distinguish. This is often expressed as the number of bits used in the quantization. If the radiometric resolution is B bits, and the range of the intensity value is from 0 to L, then L 5 2B 1 Radiometric resolution 7 8 10 12 16
Range of intensity values 0–127 0–255 0–1023 0–4095 0–65,535
Landsat 9 OLI-2 increases quantization to 14 bits from the 12 bits provided by Landsat 8 OLI. Together with high spatial and spectral resolution, high radiometric resolution is important for providing high capacity in class separability, especially for coping with the cases where intra-class separability is low. The side effect of high radiometric resolution is an increase in data volume, leading to a high degree of redundancy. Data compression can be achieved by applying coding techniques, either lossy or lossless compression. Binary coding may be applied after a threshold is introduced to suppress noise and foster fast spectral matching and classification based on a Hamming distance (exclusive OR). Another option uses two thresholds that can be determined adaptively in multiple subregions of the spectrum.
Field Guide to Hyperspectral/Multispectral Image Processing
Optical Remote Sensing
7
From Raw Data to Information Retrieval The pixel-level hyperspectral image-processing chain is shown below. Subpixel or superpipe analysis is also possible. This section is about raw data collection. To covert raw data to information, image processing and image analysis play key roles. The following sections cover the related topics leading to image display and image interpretation for information retrieval.
Hyperspectral Raw Data Display
Correction
Enhancement
Training Data Selection
Feature Mining: Feature Generation Feature Selection Feature Extraction Feature Learning
Classification
e.g., Species identification and mapping
Post-processing
Change detection Area estimates Impact on …
Domain knowledge Historical data Other information…
Field Guide to Hyperspectral/Multispectral Image Processing
8
Optical Remote Sensing
Image Processing Techniques vs Image Types
Techniques
Image Types
Image correction
Panchromatic Single band
Histogram enhancement Spatial filtering
Spatial texture Multispectral A few bands
Superpixel approach Subpixel analysis
Feature selection
Feature extraction Hyperspectral Many bands Feature learning
Clustering
Classification
Change detection
Multitemporal
Conversion or Subset
Field Guide to Hyperspectral/Multispectral Image Processing
Image Data Correction
9
Radiometric Errors Due to Atmosphere The presence of atmosphere can significantly modify the received signal from the Earth’s surface as a result of scattering and absorption by the particles in the atmosphere. Absorption: O2, CO2, O3, H2O, etc. These molecules are wavelength selective. Scattering: By air molecules ð/ l4 Þ: Rayleigh scattering. By large particles (0.1 l to 10 l), such as those associated with smoke, haze, and fumes: Mie scattering. Fine detail in image data will be obscured due to the atmospheric effect. Bulk Correction of Atmosphere Effects: Assumption: There are some pixels in the image with brightness of about zero if there is no atmospheric effect. Find the (average) lowest brightness value in each band (C0). New value y 5 Original value x C 0 New Histogram No. of Pixels
No. of Pixels
Original Histogram
C0
x
0
y
Detailed corrections can be achieved by modeling the absorption and scattering with defined parameters.
Field Guide to Hyperspectral/Multispectral Image Processing
10
Image Data Correction
Cloud Removal Optical images suffer from cloud contamination, and without correction, the information on the affected areas cannot be retrieved. Cloud detection aims to generate a cloud mask to separate clear pixels from cloudy pixels. It gives an indication of the proportion of cloud cover in an image. It also guides the follow-up correction to the correct locations. Band 9 (1360 to 1390 nm) of Landsat OLI and OLI2 is available to detect cirrus cloud distributions. Thin cloud correction: When a cloud is thin, the signal recorded is a mixture of the reflectance from the cloud and the Earth surface area beneath it. Unmixing techniques can be applied to determine the cloud component to be removed. Thick cloud data replacement: When clouds are very thick, they block signals from the Earth’s surface. The cloud data need to be replaced by a good guess of the signal from the corresponding location/pixel in two steps: Step 1: Determine the ground cover class of the location. Other data sources such as radar images and accessory information may be utilized for this task. Step 2: Find the reflectance from that class to use for replacement. This step often uses the clear pixels of the corresponding class in the same image to make the created data consistent with the rest of the image. Ground truth for cloud removal validation may not exist. A clear image collected around the same time is often used as a reference image for verification. Cloud removal
Raw data
Classification
Cloud removed
Map/Product
Field Guide to Hyperspectral/Multispectral Image Processing
Image Data Correction
11
Geometric Errors The recorded images from a satellite platform or by an airborne sensor, including unmanned aerial vehicles (UAVs), often exhibit geometric distortion, i.e., they do not match the true location relationships among the features on the Earth’s surface. Error sources include Earth rotation, platform movement and variations, and sensor sampling variation. True locations are often provided by a map or an image that has been previously corrected. Recorded image with errors:
Reference image (previously corrected):
Field Guide to Hyperspectral/Multispectral Image Processing
12
Image Data Correction
Mapping Functions and Ground Control Points Mapping functions describe the location relationship between the image with error and the reference image (or map). Their coordinates can be denoted as (u, v) and (x, y). For example, u 5 0.5x v5yþ5 gives the relationship of a compression by half of the horizontal direction, and an offset by 5 in the vertical direction. Often, the relationship is unknown due to several error sources. A polynomial form is often assumed. The second degree of polynomial is given by u 5 a0 þ a1 x þ a2 y þ a3 xy þ a4 x2 þ a5 y2 v 5 b0 þ b1 x þ b2 y þ b3 xy þ b4 x2 þ b5 y2 where the coefficients are estimated from some known samples, which are called ground control points (GCPs). GCPs are those small, distinguished spatial features, such as a road intersection, corner of a building, or a tower, that can be identified manually or automatically in both images. Their locations (ui, vi) and (xi, yi) are used as training data to find the model parameters of a and b in the mapping functions. Image with error (ui, vi)
Reference image (xi, yi) GCPs:
Field Guide to Hyperspectral/Multispectral Image Processing
Image Data Correction
13
Mapping Function Validation If a first degree of polynomial is used, there are six unknowns: u 5 a0 þ a1 x þ a 2 y v 5 b0 þ b1 x þ b2 y Mathematically, three GCPs are adequate. In practice, ≫ 3 GCPs from different parts of the image are required to find a least square solution, so the established model works well for the whole image but not for the control points only. The least-squares solution is the best one to fit the relationship provided by the training data as a whole (best compromise), although it cannot fit perfectly for every training data point (GCP). It is important to check how closely each GCP is fitted—a quantity called model residual. While model residual cannot reach zero, a large residual is unacceptable. In this case, the model is not powerful enough and must be replaced. A model must also work for the whole image. Some GCPs are selected and are not used in the training stage. They are instead used to test the model’s ability to generalize using the accuracy obtained. Model complexity and model generalization need a good balance. A first-order polynomial is often adequate, as it can model shifting, rotation, compression/stretching, or sheering. Higher-order polynomials often give a smaller model residual on the training data but do not work for the whole image. This type of modeling is called overtraining. Word of caution: Do not use a complex model unless it is necessary.
Field Guide to Hyperspectral/Multispectral Image Processing
14
Image Data Correction
Resampling After the mapping functions are established, the brightness value of each true location g(x, y) can be found from the brightness value at the corresponding location on the image g(u, v). If, for a given (x, y), the corresponding (u, v) are integers, then g(x, y) 5 g(u, v). If (u, v) are not integers, resampling techniques must be applied. Nearest-neighbor resampling rounds the (u, v) to its closest integers: gðx; yÞ 5 g½roundðuÞ; roundðvÞ: Bilinear interpolation predicts g(u, v) from the surrounding four pixels, with the assumption of a spatial linear change. This type of resampling is often used for lowresolution images. For a given (x, y) of (22, 35), corresponding to a (u, v) of (25.5, 38.8), the surrounding pixels’ brightness values are as shown below. After three interpolations, the resampling result is g(22, 35) 5 28. 38
25
v
39
40
10
0.5 30
20
0.8
26
30
28
20
u
Field Guide to Hyperspectral/Multispectral Image Processing
Image Data Correction
15
Image Registration Example An image with error
The corrected image
The reference image
Overlay with the reference image
In this example, 17 GCPs were selected: 10 were used for training, and 7 were used for testing. A first-degree polynomial was adopted. The following mapping functions were derived: u 5 29.88 þ 0.82x þ 0.12y v 5 7.05 0.12x þ 0.83y The training data average error (model residual): 0.89 pixel (26.66 m). The testing data average error: 1.06 pixel (31.76 m). Nearest-neighbor resampling was conducted.
Field Guide to Hyperspectral/Multispectral Image Processing
16
Image Radiometric Enhancement and Display
Image Histogram Let X denote an image data set, X P RM 3N , where M and N are the number of rows and columns of the image. The intensity of each pixel is denoted as x (m, n), which is in the range of 0 to 2B 1, and B is the number of bits used. For an 8-bit image, 0 ≤ x (m, n) ≤ 255. An image histogram is a graph showing the number of pixels in an image whose intensity value is x, x 5 0; 1; 2; : : : ; 2B 1. An example of the image and its histogram is given below.
5000 4500 4000
No. of Pixels
3500 3000 2500 2000 1500 1000 500 0
0
20
40
60
80
100 120 140 160 180 200 220 240
Brightness
It is important to note that an image has a corresponding histogram, but a given histogram is not associated with a unique image.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Radiometric Enhancement and Display
17
Linear Histogram Modification The histogram on the previous page shows the distributions of the brightness values utilized in the corresponding image. If most of the pixels have lower values, the image would look dark. The details of the dark region are difficult to visualize. Histogram modification can be applied to improve image contrast or to enhance a certain data range according to the user’s interest. The linear histogram transformation function is y 5 ax þ b where x is the old intensity value, and y is the new value after transformation. a .1, the data range is stretched. a 5 1, the brightness is offset by b. a , 1, the data range is compressed. a 5 0, the data are saturated to b. Piecewise linear functions can be applied to a given image. A different transformation function can be applied to each piece, depending on the need. For example, the low values up to c1 are saturated to zero, and the values higher than c2 are saturated to the maximum value, as shown below, if both ends of data are believed to be noise. y 2B –1
0 c1
c2 2 B–1 x
Field Guide to Hyperspectral/Multispectral Image Processing
18
Image Radiometric Enhancement and Display
Linear Histogram Modification Example Original image:
#
After overall linear transform: y 5 2.83 x 105
The enhancement achieved is limited, as the whole range of data is stretched linearly to the maximum range available. A piecewise linear process can be more effective. Field Guide to Hyperspectral/Multispectral Image Processing
Image Radiometric Enhancement and Display
19
Uniform Histogram and Cumulative Histogram An image cumulative histogram is a graph showing the number of pixels in the image that have an intensity value below x, x 5 0; 1; 2; : : : ; 2B 1. B is the number of bits used. A uniform histogram and its cumulative histogram are shown below. 20 18 16
No. of pixels
14 12 10 8 6 4 2 0 0
50
100
150
200
250
Brightness 4000 3500
No. of pixels
3000 2500 2000 1500 1000 500 0
0
50
100
150
200
250
Brightness Threshold Field Guide to Hyperspectral/Multispectral Image Processing
20
Image Radiometric Enhancement and Display
Histogram Equalization for Contrast Enhancement Generally speaking, image contrast can be best improved if its histogram is modified as a uniform histogram so that the available brightness levels are equally utilized, and every detail can be displayed well. Unfortunately, it is impossible to force the image histogram to become an exact uniform histogram. However, it is possible to make it uniform on average; i.e., its new cumulative histogram is close to the cumulative uniform histogram.
Based on this idea, the new y values can be found from x as shown above using x L X y5 h ðjÞ M N j50 i where L 5 2B 1, B is the number of bits used, MN is the total number of pixels in the image, and hi(j) is the histogram of the original image. This modification is referred to as histogram equalization.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Radiometric Enhancement and Display
21
Histogram Equalization Example Original image:
After histogram equalization:
Histogram modification to other required shapes can be performed in a similar way, i.e., to make both cumulative histograms about the same.
Field Guide to Hyperspectral/Multispectral Image Processing
22
Image Radiometric Enhancement and Display
Color Composite Image Display Three measurements of a pixel can be associated with three primary components in a display device to form a composite color. Primary Color Wavelength
Blue 445 nm
Green 535 nm
Red 575 nm
Cyan
Blue
White
Magenta
Black Green
Red
Yellow
If the measurements are the reflectance at blue, green, and red wavelengths, the display shows true color. Other bands of data can be used as input, and a false color is generated. The standard false color displays green, red, and near-infrared data as blue, green, and red. Data that are not related to wavelengths can also be displayed for visual inspection. For example, a large correlation matrix of a hyperspectral data set can be displayed as an image where the brightness is proportional to the degree of correlation.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Radiometric Enhancement and Display
23
Principal Component Transformation for Image Display Principal component transformation (PCT) provides a new feature space, whose first few components are the best choice of linear functions for reconstructing original data in the sense of minimizing the mean-square error of the residual. The variance in the original data set is rearranged so that the first few principal components contain almost all of the variance in the original data. The principal components are uncorrelated in the new vector space. Example: An original Landsat Enhanced Thematic Mapper (ETM) image data set with 7 bands.
After PCT, the total variance remains about the same, and its values are rearranged and compacted to the first few components.
The colors of the image generated using the first three principal components (PCs) are richer and brighter. Bands 3, 2, 1
Bands 4, 3, 2
PCs 3, 2, 1
True color
Standard false color
False color
Field Guide to Hyperspectral/Multispectral Image Processing
24
Image Geometric Enhancement
Spatial Filtering An image’s geometric content can be enhanced via spatial filtering. For example, noise and details may be ignored to generate a smoothed image, or edges may be detected to highlight object boundaries. These goals can be achieved by designing a kernel for the purpose and performing a running window operation, which is also called a convolution. An M by N kernel is M 5 2a þ 1 and N 5 2b þ 1. For example, if a 5 1 and b 5 1, the kernel has 9 entries. This kernel is applied to each pixel f(x, y) and its neighbors on an image to generate a new value of g(x, y): gðx; yÞ 5
a X b X
wðs; tÞf ðx þ s; y þ tÞ
s5a t5b
w(1, 1) w(0, 1) w(þ1, 1)
w(1, 0) w(0, 0) w(þ1, 0)
w(1, þ1) w(0, þ1) w(þ1, þ1)
f(x–1, y1) f(x, y1) f(xþ1, y1)
f(x1, y) f(x, y) f(xþ1, y)
f(x1, yþ1) f(x, yþ1) f(xþ1, yþ1)
Field Guide to Hyperspectral/Multispectral Image Processing
Image Geometric Enhancement
25
Image Smoothing If every entry in a kernel is wðs; tÞ 5 1=ðM NÞ the result of the kernel operation is image smoothing, i.e., a low-pass filter (LPF). For a LPF, the sum of the entries is 1. For example, the kernel below will smooth an image in the horizonal direction: 1/5
1/5
1/5
1/5
1/5
To smooth the image in both directions, a kernel of 3 by 3 or 5 by 5 is often applied. The larger the kernel the stronger the smoothing effect that can be obtained. For high-spatial-resolution images, a larger kernel size may be used. While noise is removed to a large extent, object edges, i.e., the high-frequency information, are blurred. A simple and effective solution is to introduce a threshold Te to limit the modification of f(x, y) to r(x, y): if |g(x, y) – f(x, y)|, Te r(x, y) = g(x, y) Otherwise, r(x, y) = f(x, y) Alternatively, superpixels can be generated first, and then smoothing can be conducted within each superpixel.
Field Guide to Hyperspectral/Multispectral Image Processing
26
Image Geometric Enhancement
Speckle Removal If the noise is not white noise, as in the ‘salt and pepper’ effect, averaging the data is not a good solution. Median filtering is better; i.e., the pixel at the center of the template is replaced by the median of the brightness values in all of the pixels covered by the kernel—the central number of this region. Example: Below is a depiction of original data around the current pixel when a 3 by 3 kernel is applied; 89 is an outlier, representing the speckle data. 20 19 20
21 89 22
19 18 15
The median of the above group of data is 20, which is found after sorting the values: 15, 18, 19, 19, 20, 20, 21, 22, 89. After median filtering, the pixel looks as follows: x x x
x 20 x
x x x
Field Guide to Hyperspectral/Multispectral Image Processing
Image Geometric Enhancement
27
Edge and Discontinuity Detection Kernels for various edge detections: Horizontal 1 0 1
1 0 1
Vertical
1 0 1
1 1 1
0 0 0
Diagonal 1 1 0
1 1 1
1 0 1
0 1 1
All directions 5 original smoothed 1/9 1/9 1/9
1/9 8/9 1/9
1/9 0 1/9 5 0 1/9 0
0 1 0
0 1/9 0 1/9 0 1/9
1/9 1/9 1/9
1/9 1/9 1/9
Discontinuing detection and highlighting: Laplacian filter 0 1 0
1 4 1
0 1 0
1 1 1
1 8 1
1 1 1
These are high-pass filters, where the sum of the kernel entries is zero. The resulting image can be added back for image sharpening.
Field Guide to Hyperspectral/Multispectral Image Processing
28
Image Geometric Enhancement
Spatial Gradient Detection The magnitude and direction of a spatial gradient can be found from two orthogonal components using the two kernels below: ky
kx 1 0 1
2 0 2
1 0 1
1 2 1
0 0 0
1 2 1
Using the two kernels for the two directions, kx and ky, two image components can be obtained, named Gx and Gy. The magnitude of the gradient is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jGj 5 G2x þ G2y The direction of the gradient is Gy ∠G 5 arctan Gx
!
This edge detection is called the Sobel operator.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Geometric Enhancement
29
Morphological Operations Dilation and erosion are two basic morphological operations. A spatial structure element is selected first, which defines the neighboring region, e.g., a square element of 3 by 3, or a disk of radius of 1, as shown below. 1 1 1
1 1 1
1 1 1
1
1 1 1
1
Dilation: The new value of the central pixel is the maximum value of all pixels in the neighborhood defined by the spatial structure element. Morphological dilation makes objects more visible and fills in small holes in objects. Erosion: The new value of the central pixel is the minimum value of all pixels in the neighborhood defined by the spatial structure element. Morphological erosion removes isolated and small objects so that only substantive objects remain. Morphological opening: Erosion followed by dilation. This can remove small objects while preserving the shape and size of larger objects in the image. Morphological closing: Dilation followed by erosion. This is useful for filling small holes in an image while preserving the shape and size of the objects in the image.
Field Guide to Hyperspectral/Multispectral Image Processing
30
Hyperspectral Image Data Representation
Image Data Cube Files Number Number Number Number
of of of of
rows: I columns: J bands: N bits of each data: M
Pixel x
Three data-file formats:
y
Band sequential (BSQ) Band 1
Band 2
IxJxM bits
IxJxM bits
Band 3
IxJxM bits
Line 2 band N JxM bits
Line 2 band 2
Line 2 band 1 JxM bits
JxM bits
Line 1 band N
Line 1 band 2 JxM bits
JxM bits
Line 1 band 1 JxM bits
Band-interleaved-by-line (BIL)
Band-interleaved-by-pixel (BIP) Pixel 1
Pixel 2
Pixel 3
NxM bits
NxM bits
NxM bits
…
Field Guide to Hyperspectral/Multispectral Image Processing
Hyperspectral Image Data Representation
31
Image Space and Spectral Space An image data set is a data cube with three variables: two spatial variables (x, y) and a spectral variable l. Spatial space displays brightness as a function of (x, y) for a given l (band).
Neighboring pixels in spatial space are likely from the same ground cover type, so spatial correlation is high. However, data of the same class may not connect to each other. They can be distributed in different places, with some within-class variations. Spectral space displays brightness as a function of l for a given (x, y) (pixels).
Generally speaking, pixels from different ground cover types do not have similar curves in spectral space. Pixels from the same class generally have their curves next to each other. Shadow can be an issue, causing the spectra from the same class to be separated from each other.
Field Guide to Hyperspectral/Multispectral Image Processing
32
Hyperspectral Image Data Representation
Features and Feature Space A feature is used to describe an object (or a material), such as texture, shape, size, and color. For example, color can be defined quantitatively by the measurements of the reflected electromagnetic wave at the blue, green, and red wavelengths. The measurements of the blue band, green band, and red band can then be called features of an object. Measurements beyond the visible spectrum are made with multispectral and hyperspectral remote sensing. They often reveal spectral characteristics of various materials that can be used to describe those materials. Therefore, a band in multispectral or hyperspectral imagery is often called a feature. Feature space is formed by the wavebands, where two wavebands produce a 2D space. (Cases with more than three bands are hard to visualize.) The concept is often illustrated with the two-band case, as shown below.
The pixels from the same ground cover types have similar reflectance in each band. Thus, they are close to each other in feature space. An illustration in feature space can be used to explain machine-learning algorithms for pixel recognition.
Field Guide to Hyperspectral/Multispectral Image Processing
Hyperspectral Image Data Representation
33
Pixel Vector, Image Matrix, and Data Set Tensor Feature-space plots are often used to display two bands of data as an illustration. For high-dimensional cases, the full measurements of N bands cannot be plotted and visualized. They are denoted as a vector x for each pixel: 3 2 x1 6 x2 7 7 x56 4::: 5 xN A matrix can be used to denote an image of a given band n: 2 3 x1;1;n x1;2;n : : : x1;J ;n 6 x2;1;n x2;2;n : : : x2;J ;n 7 7 Xn 5 6 4 ::: ::: ::: ::: 5 xI;1;n xI;2;n : : : xI;J;n where I and J are the number of rows and columns of the image, respectively. A tensor is used to denote the data set: t 5 fxi;j;n g P RI3J3N The operations of vector, matrix, and tensor are used to implement pattern recognition and deep-learning algorithms.
Field Guide to Hyperspectral/Multispectral Image Processing
34
Hyperspectral Image Data Representation
Cluster Space Clusters, i.e., similar pixel vectors, can be generated by a clustering algorithm. They can be associated with defined information classes. A plot of the cluster index versus the number of labeled data (or data of the scene) is called a cluster space representation. An example of 2D feature space with three classes and nine clusters 60
50
40
30
20 75
80
85
90
95
100
105
110
1D cluster-space representation
No. of Pixels
45
30
15
0 1
2
3
4
5 6 Cluster Index
7
8
9
Advantages of cluster-space representation: This representation is a 2D plot, independent of the number of bands. The class data can be modeled by multiple clusters with a discrete, quantitative description of its distribution. Hybrid supervised and unsupervised classification can be performed in cluster space. It can also be used to speed up k-nearest neighbor (KNN) classification.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Clustering and Segmentation
35
Otsu’s Method Otsu’s method is a simple and automatic process for twoclass gray-image clustering. The method assumes that the image contains two classes. It has good performance if the histogram is close to a bimodal distribution. Otsu’s method gives an intensity threshold that separates the pixels of a grayscale image into two classes. The threshold is optimal in the sense that it maximizes the between-class variance, a well-known measure used in statistical discriminant analysis. Steps of Otsu’s method Step 1: Give an initial threshold value of k 5 0, label the data into Class 1 (≤k) or Class 2. Step 2: Find the numbers of pixels in Class 1 and Class 2, n1 and n2. Step 3: Calculate the mean values of Class 1 and Class 2, m1 and m2. Step 4: Calculate the between-class variance: sb ðkÞ 5
n1 n2 ðm m2 Þ2 ðn1 þ n2 Þ=2 1
Step 5: k 5 k þ 1, repeat Steps 1 to 4, until k 5 L 1 (the maximum intensity value). Step 6: Obtain the Otsu threshold k , if sb ðk Þ . sb ðkÞ∀k ≠ k : Otsu’s method can be extended to more than two classes. This method is too time consuming for an image with multiple bands.
Field Guide to Hyperspectral/Multispectral Image Processing
36
Image Clustering and Segmentation
Clustering Using the Single-Pass Algorithm Within an image, some pixels (areas) are similar and belong to a same category. These are called data clusters in unsupervised processing. The identification of clusters is based on the similarity measure used and the similarity threshold selected. The Euclidean distance between the pixels’ spectral vectors xi and xj is often used to measure their similarity: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dðxi ; xj Þ 5 ðxi xj Þt ðxi xj Þ: Single-pass algorithm Initialization: Set a threshold T and select a representative subset of pixels from the image to make the algorithm efficient. Step 1: Set the first pixel’s vector x1 as the Cluster 1 mean vector m1: m 1 5 x1 Step 2: Examine the second pixel vector x2. If x2 is close to x1, dðx1 ; x2 Þ , T then this pixel is labeled Cluster 1, and the mean vector of Cluster 1 is updated: m1 5 ðx1 þ x2 Þ=2 Otherwise, a new cluster, Cluster 2, is generated with m 2 5 x2 Step 3: Examine the next pixel’s Euclidean distance to each generated cluster mean vector and repeat Step 2. When all represented pixels are examined, a number of cluster mean vectors are formed that can be used to label every pixel in the image using a minimum distance classifier.
Field Guide to Hyperspectral/Multispectral Image Processing
Image Clustering and Segmentation
37
Clustering Using the k-Means Algorithm The objective of clustering is to minimize the sum of the distance from each pixel to its cluster center for all C clusters generated: ! C X X ðx mi Þt ðx mi Þ minðDsum Þ 5 min i51 x⊂ci
The single-pass algorithm is simple and fast. However, without iteration, it is unlikely to achieve the defined objective. An improved heuristic algorithm is called k-means clustering, or ISODATA. k-means clustering algorithm Step 1: Specify the number of clusters required to generate C and randomly select C points (vectors) in feature space as candidate cluster mean vectors: ˆ i ; i 5 1; 2; ...; C m Step 2: Each pixel vector in the image data (or represented pixels) is examined and assigned to the nearest candidate cluster based on the Euclidean distance. Step 3: Compute new cluster means using the current members of each cluster, denoted as mi ; i 5 1; 2; ...; C Step 4: If
ˆ i ; i 5 1; 2; ...; C mi m
go to Step 5. Otherwise, ˆ i 5 mi ; i 5 1; 2; ...; C m go to Step 2. Step 5: Label every pixel in the image. The initial C points are better spread in the feature space. A good guess of the cluster centers can speed up the procedure to more quickly reach the condition stated in Step 4.
Field Guide to Hyperspectral/Multispectral Image Processing
38
Image Clustering and Segmentation
Clustering Using the k-Means Algorithm Example Original RGB image:
Six clusters generated using the k-means algorithm:
A cluster is a group of pixels that are similar, displayed as a given color. The pixels are not necessarily spatially connected. An information class may be associated with multiple clusters, or a cluster may be a combination of more than one class. This association is affected by the number of clusters generated and the class data distributions. The number of clusters to apply is a trade-off decision that depends on the need of the individual applications. Postprocessing on cluster merging or splitting may be required. Field Guide to Hyperspectral/Multispectral Image Processing
Image Clustering and Segmentation
39
Superpixel Generation Using SLIC A superpixel is a group of spatially connected, similar pixels that can be generated via over-segmentation. Superpixels are formed to save computational cost and to reduce noise/detail. The shape of superpixels is often irregular. The size can be very different if, say, the watershed segmentation algorithm is applied. Simple linear iterative clustering (SLIC) can be used to generate superpixels with a similar shape and size. SLIC algorithm Step 1: Specify the number of superpixels nsp to be generated. Select nsp points uniformly across the image with a regular grid spaced s units apart: s 5 roundðnsp =ntotal Þ Step 2: The data of these points, including both the intensity vector g and spatial coordinates (x, y), are served as initial superpixel mean vectors: ˆ i 5 ½gi ; xi ; yi ; m
i 5 1; 2; : : : ; nsp
Step 3: For each superpixel center, compute the combined spectral dc and spatial ds Euclidean distances D between the center and each pixel in a 2s 3 2s neighborhood, and assign each pixel a superpixel label based on the minimum distance: 2 2 1=2 dc ds þ D5 c s c is selected to normalize dc and can also be used to weigh the relative importance between the two terms. SLIC considers both spatial distance and spectral distance when forming superpixels. Post-processing is performed to ensure the segment’s connectivity by relabeling disjoint pixels.
Field Guide to Hyperspectral/Multispectral Image Processing
40
Pixel-Level Supervised Classification
Supervised Classification Procedure The advantage of a digital remote sensing image is it can help us understand the landcover and perform quantitative Earth observation at the pixel level. To do so, pattern-recognition algorithms via machine learning need to be applied to label each pixel with an information class, for example, ‘w’ as water and ‘s’ as soil. The classes are often color coded. This process is referred to as image classification or pixel classification. A thematic map is then generated from the raw data to provide the user-required information. Raw Data
Thematic Map
Machine Learning
w w w v
w w s w s v v v
Pixel
Supervised classification relies on three inputs from an analyzer: 1. Identified and defined information classes in the image scene to classify, such as, grassland, bare soil, and water. An exhaustive list of classes is expected, or some pixels in the image may be left undetermined. 2. Prototype pixels in the image based on partial ground truth learned from site visits, photointerpretation, or auxiliary information sources to represent each class. These pixels are used as training samples to train a classifier. 3. An appropriate model to describe the difference between the classes that is associated with which classifier to use, for example, support vector machines. Then, corresponding class signatures, often in the form of statistics, or discriminant functions, are estimated using training data. The computer then labels every pixel in the image based on the classification rule of the classifier selected.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
41
Prototype Sample Selection When a supervised classification technique is applied, information classes are defined based on the interest of individual applications. The classes can be named according to their materials, for example: bare soil, water, green vegetation They can also be named based on their functions, for example: lake, ocean, crop field, buildings To achieve thematic mapping, i.e., a classification map of the defined classes, prototype samples of each class must be identified manually to guide the machine learning. These prototype samples are also called labeled data or annotated data by the analyzer. The samples can be identified based on local knowledge, site visits, information from other data sources, and/or remote sensing data characteristics. They also form regions of known classes, as shown below.
soil grass
water
Band N
Band 1
Field Guide to Hyperspectral/Multispectral Image Processing
42
Pixel-Level Supervised Classification
Training Samples and Testing Samples Each prototype sample is a vector that is the reflectance of all collected bands. In feature space, we can see that the samples from the same class, regardless of their spatial locations, are often in similar locations, and they form a group, as they have similar brightness values in each band.
Classification is performed to find discriminant rules/ boundaries to separate each class. To do so, each class data set needs to be described, i.e., modeled, to specify its key properties. For example, a class data set is basically described by the maximum and minimum values of each band, or the mean values of each class data set in each band. Some of the labeled samples are used for model training and are referred to as a training data set. Some of them are used for model testing and are referred to as a testing data set. Testing data are not involved in model development. A validation data set may be required as well for model parameter tuning. More training samples often better represent a class data set, including within-class variation, which is critical for classification.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
43
Minimum Euclidean Distance Classifier A given landcover class has spectral reflectance characteristics, which are measured as brightness at each waveband and denoted as a pixel vector x, x P RN , where N is the number of bands. Generally speaking, pixels belonging to the same class have about the same reflectance as that from the same materials. Then, the Euclidean distance between the pixel vectors can be applied to measure the similarity. To apply the minimum distance classification, each class is modeled, i.e., represented, by the mean value of each spectral band; together, they form a mean vector. Let mi, i 5 1, . . . , M, where M is the number of candidate classes, denote the mean vector of class i in the image. It is estimated from their training samples using mi 5
Ki 1 X x K i k51 k
Ki is the number of training samples for class i. A pixel vector x is then labeled by finding the shortest distance to one of the class means, defined as follows: x P vi ; if dðx; mi Þ , dðx; mj Þ for all j ≠ i where the distance is often measured by the popular Euclidean distance: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dðx; mi Þ 5 ðx mi Þt ðx mi Þ The squared root operation can be removed for a relative comparison when labeling pixels. The minimum-distance classifier is a basic classification method that assumes all defined classes to have the same within-class variance.
Field Guide to Hyperspectral/Multispectral Image Processing
44
Pixel-Level Supervised Classification
Spectral Angle Mapper Pixels belonging to the same class are sometimes not illuminated in the same condition. For example, some appear in the shaded area. In this case, it is not reasonable to expect their brightness vectors to be about the same; instead, scale variations are often observed. The Spectral Angle Mapper was developed to classify pixels in an image, using the cosine similarity measure to make it scale invariant: x t mi ; i 5 1; 2; : : : ; M sðx; mi Þ 5 pffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffi xt x mti mi where x is the pixel vector to classify, mi is the mean vector of class i, and M is the number of classes defined. The maximum value is 1, corresponding to zero spectral angle, which is treated as the highest similarity: x P vi ; if sðx; mi Þ . sðx; mj Þ for all j ≠ i A 2D example with two pixels and two classes: Band 2 Brightness
10 8
m1
6
x2
m2
4
x1
2 0 0
2
4 Band 1 Brightness
6
8
sðx1 ; m1 Þ 5 sðx2 ; m1 Þ 5 0.998 sðx1 ; m2 Þ 5 sðx2 ; m2 Þ 5 0.984 Both pixels will be labeled as belonging to Class 1. If Pearson correlation (the cosine similarity between the centered x and mi) is used, the similarity measure is invariant to both shift and scale changes.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
45
Spectral Information Divergence Similarity between two pixel vectors, or a pixel vector to a class mean vector, can be measured using different criteria, including Euclidean distance, spectral angle, and Pearson correlation, as well as spectral information divergence (SID), based on information theory. Consider two spectral vectors: x 5 ½x1 ; x2 ; x3 ; : : : ; xN t ;
y 5 ½y1 ; y2 ; y3 ; : : : ; yN t
where N is the number of bands. SID is defined based on cross-entropy as SIDðx; yÞ 5
N X n51
pðxn Þlog
N pðxn Þ X pðyn Þ þ pðyn Þlog pðyn Þ n51 pðxn Þ
where N X pðxn Þ 5 xn = xn n51 N X pðyn Þ 5 yn = yn n51
This similarity measure can be used for classification and clustering. Highly similar pixels are recognized as belonging to the same group or class. For example, the minimum distance classification and the k-nearest neighbor algorithm can be applied with this distance measure.
Field Guide to Hyperspectral/Multispectral Image Processing
46
Pixel-Level Supervised Classification
Class Data Modeling with a Gaussian Distribution A landcover type can be relatively pure, for example, a body of water, or a typical mix of materials or conditions, such as soil and forest. Generally speaking, different landcover classes may have a different degree of variation, and the reflectance measured by an optical sensor from some classes can vary more than for others. In this case, a minimum distance classifier will not suffice, as it does not consider the data distribution. A Gaussian distribution (also called normal distribution) model is often used to describe a class. The figure below shows an example of the class-i training data’s histogram in the 1D case (black line). Its mean value and standard deviation can be estimated. These values are used as parameters for a corresponding Gaussian distribution (purple line) to model the class data by assuming that the class data distribution is normal, which is valid in many cases. h(x)
mi
mi + σ i
x
For multispectral or hyperspectral data, the multivariate normal-distribution function is given by pðxjvi Þ 5 ð2pÞ
N=2
jSi j
1=2
1 t 1 exp ðx mi Þ Si ðx mi Þ 2
where N is the number of spectral bands, and mi and Si are the mean vector and covariance matrix estimated using the training samples, respectively.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
47
Mean Vector and Covariance Matrix Estimation The mean vector and covariance matrix are estimated by X Ki Ki 1 X 1 xk ; Si 5 ðx mi Þðxk mi Þt mi 5 K i k51 K i 1 k51 k where xk is a training sample, and Ki is the number of training samples of class i. The mean vector has a length of N. The covariance matrix is an N 3 N symmetrical matrix. To reliably estimate the covariance matrix, Ki ≫ N, say, 3 10N, is required. In theory, this matrix is not invertible if Ki , N þ 1. The covariance matrix describes the shape and size of the normal distribution. The mean vector describes its position (offset from the origin). The 2D examples given below show that Class B has small variances in both measurements, Class A has low correlation, and Class C has high correlation between the two variables: 80 30 200 mB 5 mC 5 mA 5 190 20 190 120 1 30 20 169 120 SA 5 SB 5 SC 5 1 100 20 25 120 100
Field Guide to Hyperspectral/Multispectral Image Processing
48
Pixel-Level Supervised Classification
Gaussian Maximum-Likelihood Classification To label a pixel vector x using the maximum-likelihood classification technique, we need to find the probability that class i is the correct class p(vi|x). Based on Bayes’ theorem, pðvi jxÞ 5 pðxjvi Þpðvi Þ=pðxÞ where p(vi) is the probability (percentage) of class i in the image; p(x) is the probability of pixel vector x in the image, which is not a function of classes; and p(x|vi) is the probability of finding x from class i, which is estimated from training samples where a Gaussian distribution is often assumed, and the mean vector mi and the covariance matrix Si are estimated using the training data. After taking the natural logarithm for convenience and removing the common terms for all classes, the relative probability that class i is the correct class for a given pixel x can be obtained: 1 1 g i ðxÞ 5 ln pðvi Þ ln jSi j ðx mi Þt S1 i ðx mi Þ 2 2 This is called the discriminant function for maximumlikelihood classification. The decision rule is x P vi ; if g i ðxÞ . g j ðxÞ for all j ≠ i A Gaussian distribution is adopted to model each class’s distribution. If this assumption is invalid, the classification performance can be compromised. The model parameters may be not reliable if the quality and quantity of the training data set are low. A threshold may be applied to the discriminant function to prevent pixels that belong to none of the defined classes from being labeled incorrectly.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
49
Other Distribution Models Sometimes the original data’s distribution has a long tail. In this case, it is possible to use a lognormal distribution to model the data approximately. After introducing z 5 log(x), the class data can be classified using the Gaussian maximum-likelihood technique. Original data x with Lognormal distribution z = log (x) New data z with Normal distribution N (mz, z) Apply discriminant function Gaussian MaximumLikelihood Classification Results
Here, x is the original brightness value of a band, which is always positive. Therefore, applying the logarithm operation is not a problem. Training data selection and normaldistribution statistics estimation are then performed on the new data vectors z where the original spatial locations are attached and mapped. If the original class data distribution has multiple peaks, a Gaussian mixture model may be adopted: pðxjvi Þ 5
K X k51
ak Nðmk ; Sk Þ; and
K X
ak 5 1
k51
The parameters are estimated iteratively using training samples with the expectation-maximization (EM) algorithm. Bayes’ theorem can then be applied.
Field Guide to Hyperspectral/Multispectral Image Processing
50
Pixel-Level Supervised Classification
Mahalanobis Distance and Classifier Unlike the Euclidean distance, where the similarity between two pixel vectors is measured, the Mahalanobis distance measures the similarity of a pixel vector x to a group of data (distribution) C, defined by the mean vector m and covariance matrix S as dðx; CÞ2 5 ðx mÞt S1 ðx mÞ This measure can be used for classification and is often seen as an extended Euclidean distance classifier or a degenerated Gaussian maximum-likelihood classification. When M defined classes are assumed to have equal prior probability and the same class covariance: pðvi Þ 5 pðvÞ; Si 5 S;
i 5 1; : : : ; M
the maximum-likelihood discriminate function 1 1 g i ðxÞ 5 lnpðvi Þ lnjSi j ðx mi Þt S1 i ðx mi Þ 2 2 becomes g i ðxÞ 5 ðx mi Þt S1 ðx mi Þ after the constant values that are not class dependent are ignored. Furthermore, if the distribution of every class is assumed to be isotropic, i.e., S 5 s2 I where I is an identity matrix, then g i ðxÞ 5 ðx mi Þt ðx mi Þ This is the Euclidean distance classifier.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
51
k-Nearest Neighbor Classification The k-nearest neighbor (KNN) classification is another supervised classification method. Its advantage is that there is no need to model the class data distribution. To label pixel vector x, its surrounding training pixels in feature space are examined to determine the class to which a majority of the neighbors belong. The pixel is then believed to belong to that class. Step 1: Find the nearest k neighbors in the training data set by applying the Euclidean distance measure in feature space. Or, find all neighbors within a specified distance. Step 2: Examine the labels of these neighbors. Step 3: Label the pixel based on a majority vote. Each neighbor can be treated equally or weighted (deducted) by its Euclidean distance to the pixel to be labeled.
Band 2 (Feature 2) Brightness
Below is an example where k 5 5. The unknown pixel vector marked as a black triangle will be labeled as the light purple class as a result of a majority vote. 10000
7500 5000 2500 0 0
3000
6000
9000
Band 1 (Feature 1) Brightness
Finding the nearest neighbors is time consuming and inefficient to compute, especially for hyperspectral data, where the number of bands is high. However, many pixels are surrounded by training pixels from the same class, and their labeling can be straightforward. The critical pixels are those located around class boundaries in feature space. A better classifier, the support vector machine, focuses on the class boundaries and is more efficient and effective.
Field Guide to Hyperspectral/Multispectral Image Processing
52
Pixel-Level Supervised Classification
Support Vector Machines The support vector machine (SVM) is another supervised classification method, which focuses on the training samples around the class boundaries. Those samples are called support vectors and are used to define class boundaries (the two thin lines in the image). The optimal separating hyperplane (a line in the 2D case) is in the middle of the two class boundaries (the thicker line) and used as a decision boundary for classification.
Band 2 (Feature 2) Brightness
Optimal hyperplane: Decision boundary 3000 2000 1000
0 0
1000
2000
3000
4000
5000
6000
Band 1 (Feature 1) Brightness
This decision hyperplane is defined as wt x þ b 5 0 where w is a weight vector, and b is a bias; these parameters are obtained by solving a quadratic programming problem using the training samples as input. x is the input vector: x P vr ; if wt x þ b , 0 x P vl ; if wt x þ b . 0 The SVM is a binary classifier for the class on the left vl and the class on the right vr in the feature space. For three or more classes, a SVM is performed on each class pair (one-vs-one) or on each class against the rest of the classes (one-vs-rest). A consensus is used to label each pixel vector.
Field Guide to Hyperspectral/Multispectral Image Processing
Pixel-Level Supervised Classification
53
Nonlinear Support Vector Machines When the class data of two classes are not separable linearly, the data can be mapped to a higher-dimensional space by applying a kernel transform so that the problem can be converted into a linear problem. Denote a mapping function as f to map the linearly inseparable data X into a higher-dimensional feature space F, in which the data are possibly linearly separable: f:X↦F Several kernel functions are available to use: • Linear: • Polynomial:
xi xtj þ c ðxi xtj þ cÞd
• Gaussian radial basis function: jxi xj j2 exp 2s2 • Laplacian:
jxi xj j exp s • Sigmoid:
tanhðxi xtj þ cÞ
where xi, xj are the ith and jth training samples, respectively. Kernel selection and kernel parameter selection require an understanding of data nonlinearity and experience. Trialand-error and cross-validation are often applied during this process.
Field Guide to Hyperspectral/Multispectral Image Processing
54
Handling Limited Numbers of Training Samples
Semi-Supervised Classification In remote sensing, ground-truth collection can be expensive and difficult. Often, a limited number of labeled samples are available to use for model development. Limited training samples do not often provide complete information and reliable statistics of each class. To overcome this problem, semi-supervised classification can be employed. Step 1: Use the original training data set to build an initial classification model. If the Gaussian maximumlikelihood method is used, the mean vector mi and covariance matrix Si of each class are estimated, i 5 1, 2, . . . , M, where M is the number of classes defined. Step 2: Use the model to find the discriminate values of unknown data, x: 1 1 g i ðxÞ 5 ln pðvi Þ ln jSi j ðx mi Þt S1 i ðx mi Þ 2 2 Step 3: Normalize: pi ðxÞ 5 g i ðxÞ=
M X
g i ðxÞ
i51
Step 4: Set a threshold TH close to 1. If max½ pi ðxÞ . T H i
the probability that x belongs to the corresponding class is very high, and the label has high certainty. Then it is selected as a training data for that class. Step 5: Run for more unknown data. Step 6: Update the class statistics and the model. In this way, more training data are selected automatically to improve the class model. In addition, the spatial neighbors of the original training samples very likely belong to the same class and can be considered together with spectral similarity to decide if they can be used as training samples.
Field Guide to Hyperspectral/Multispectral Image Processing
Handling Limited Numbers of Training Samples
55
Active Learning Active learning aims to make training-sample collection efficient. The class model is established initially with the original limited training data set. More training data are collected based on the need for model refinement. Step 1: Use the original training data set to build an initial classification model. If the Gaussian maximumlikelihood method is used, mean vector mi and covariance matrix Si of each class are estimated, i 5 1, 2, . . . , M, where M is the number of classes defined. Step 2: Use the model to find the discriminate values of unknown data x: 1 1 g i ðxÞ 5 ln pðvi Þ ln jSi j ðx mi Þt S1 i ðx mi Þ 2 2 Step 3: Normalize: pi ðxÞ 5 g i ðxÞ=
M X
g i ðxÞ
i51
Step 4: Set a threshold TL to a small value. If max½ pi ðxÞ , T L i
the probability that x belongs to the corresponding class is very low, and the label has high uncertainty. Then a request (query) of its true label is raised. After the label is found from the ground truth, it is added to the current training data set for that class. Step 5: Run for more unknown data. Step 6: Update the class statistics and the model. In this way, more training data are collected with cost. It is, however, a valuable expense, as those newly labeled data are critical for class model refinement.
Field Guide to Hyperspectral/Multispectral Image Processing
56
Handling Limited Numbers of Training Samples
Transfer Learning In machine learning, a classification model is often trained using the current training samples for a given site and data. The model works poorly for a new task using the data collected at different locations, on different dates, or with different sensors. To perform the training from scratch each time requires many training data, which is expensive and time consuming. To overcome this problem, transfer learning has been developed to leverage a related pretrained model in the source domain. Adjustment is conducted to accommodate the variation in the target domain so that the model is updated for the new task, through domain adaptation.
For a time-series data set, transfer learning can be conducted by leveraging the models that are pretrained for the data from previous dates. The model parameters for the current-date data can be predicated with regression techniques and fine-tuned using a small number of current training samples. Compared to its use in traditional machine learning, transfer learning is more important in deep learning, where model parameters to train are massive. Often, low-level layers of an existing architecture are adopted, and high-level layers are fine-tuned using the current training samples for the new task.
Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
57
The Need for Feature Reduction Features that describe an object or class, for example, reflectance values at RGB bands (color), near- or midinfrared bands, or textures, are used as input to a classifier for pattern recognition and classification. When the dimensionality of the input is high, class modeling can be biased with limited training samples. This situation leads to the Hughes phenomenon, i.e., overtraining. The trained model has low generalization capacity; it works for the training data but not for the new data. This overtraining is also called the curse of dimensionality. It is important to provide good input features.
A very good classifier Poor input
Poor output
Noisy; not useful; Not relevant; Not independent; Not sensitive; Not effective; …
Garbage in, garbage out
Feature mining is an important task to find useful and effective features for class discrimination. It includes • • • •
Feature Feature Feature Feature
selection generation extraction learning
The approaches can be knowledge based or data driven. They can be supervised to address class separability or unsupervised in terms of statistics on data quality. Feature learning by a deep neural network is an alternative approach to find discriminant features for a given classification or target detection task.
Field Guide to Hyperspectral/Multispectral Image Processing
58
Feature Reduction
Basic Band Selection Knowledge based: Bands can be selected based on the spectral characteristics of the classes to be recognized. For example, to separate water from other classes, wavebands around 1000 nm to 1700 nm are effective, as shown in the plot below.
Wavebands around 1400 nm and 1900 nm are water absorption bands, providing little information of the Earth’s surface. They are often removed. Statistics based: Independent measurements are more informative. Highly correlated bands present redundancy and can be removed. Pearson’s correlation coefficient shows the correlation between the measurements (data) of band Ba and band Bb: PK i51 ðBa;i Ba;i ÞðBb;i Bb;i Þ rðBa ; Bb Þ 5 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PK 2 PK 2 i51 ðBa;i Ba;i Þ i51 ðBb;i Bb;i Þ K is the number of pixels in each band. If the absolute value of the correlation is close to 1, the two-band data are highly correlated and can be combined, or only one band may be kept. The adjacent bands are often highly correlated. Each second band may be removed to reduce redundancy while keeping high spectral resolution. If adjacent bands are combined (averaged), the signal-to-noise ratio is improved, but spectral resolution is compromised. Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
59
Mutual Information Mutual information (MI) between the measurements of band Ba and band Bb is defined with the marginal probability distributions and joint probability distribution as X X pðx; yÞ pðx; yÞ log IðBa ; Bb Þ 5 pðxÞpðyÞ x⊂B y⊂B a
b
This equation measures general dependency, including nonlinear relationship. The value is bounded by the smaller entropy of the two variables. Therefore, it can be normalized to [0, 1] by ˆ a ; Bb Þ 5 IðB
IðBa ; Bb Þ minfHðBa Þ; HðBb Þg
where the entropy H(B) is given by HðBÞ 5
X pðxÞ log pðxÞ x⊂B
10
10
9
9
8
8
7
7
Variable 2
Variable 2
Below are two examples with 100% mutual information but different degrees of correlation.
6 5 4 3
6 5 4 3
2
2
1
1 1
2
3
4
5
6
7
Variable 1
r 5 1, Iˆ 5 1
8
9
10
1
2
3
4
5
6
7
8
9
10
Variable 1
r 5 0.48, Iˆ 5 1
Field Guide to Hyperspectral/Multispectral Image Processing
60
Feature Reduction
Band Selection Based on Mutual Information Mutual information can be applied to both numerical and nonnumerical variables. Mutual information between the class label C (where the values of C indicate the class categories) and the corresponding data on band B can be used to measure the relevance between the band and the class labels: IðB; CÞ 5
XX
pðx; yÞ log
x⊂B y⊂C
pðx; yÞ pðxÞpðyÞ
The best single band out of the total number of bands N is selected by using ˆ j ; CÞ for all j ≠ i; ði 5 1; 2; : : : ; NÞ ˆ i ; CÞ . IðB IðB A sequential search is then conducted to find a best subset. The best (k þ 1)th band is selected from the remaining (N k) bands and is highly relevant to the classes to be indentified and weakly related to the k selected bands: Gkþ1 ðBi Þ . Gkþ1 ðBj Þ for all j ≠ i; ði 5 k þ 1; : : : ; NÞ and k X ˆ i ; Bs Þ ˆ i ; CÞ 1 IðB Gkþ1 ðBi Þ 5 IðB k s51
A threshold may be introduced for the first term to ensure that the relevancy is adequate. A positive value for the combined measure may be required to avoid selection of a very noisy band. In this way, the searching space is reduced.
Relevancy
1
T
0
1 Redundancy
Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
61
Band Selection Based on Class Separability For a given subset of bands, the class separability it provides can be evaluated quantitatively with the selected separability measure. An effective separability measure is the J–M distance Jij if the class-i and class-j data can be assumed to have normal distributions, and their mean vectors and covariances are estimated from the training samples: J ij 5 2ð1 expðBij ÞÞ where Si þ Sj jSi þ Sj j=2 1 1 ðmi mj Þ þ ln Bij 5 ðmi mj Þt 1 1 2 8 2 jSi j2 jjSj j2 Bij is called the Bhattacharyya distance, which always increases with the increasing separability between the two class distributions. Jij will saturate to the maximum value of 2 when Bij is large enough. This is an important property, especially when the averaged distance is used as an overall separability indicator between more than one class pair. For M classes, the number of class pairs is M
C2 5
M! 2!ðM 2Þ!
For a subset of n bands out of the total number of bands N, the number of subsets is N
Cn 5
N! n!ðN nÞ!
The subset that gives the highest overall separability (or adequate separability) is selected for performing the classification task effectively.
Field Guide to Hyperspectral/Multispectral Image Processing
62
Feature Reduction
Knowledge-based Feature Extraction Feature extraction is often achieved by band transformation to generate new features. A small subset of new features is expected to be highly informative and/or provide strong class-discrimination power. A simple feature extraction example is the Normalized Difference Vegetation Index (NDVI), which is defined as NDV I 5
xnir xred xnir þ xred
This simple transform (ratio) is more effective than using two separate measurements at the red band and the NIR band for recognizing vegetation and measure greenness. This is based on the fact that chlorophyll strongly absorbs light at the red wavelength and the cellular structure of vegetation reflects light much more strongly at near-infrared wavelengths. Healthy vegetation will have a NDVI value close to 1. Water has the opposite spectral characteristics, which leads to a negative NDVI value. This relative measure is immune to the daily changes to sensing conditions. Similarly, based on water’s spectral characteristics, a water index NDWI has been developed as NDW I 5
xgreen xnir xgreen þ xnir
The example below shows a water-index image that highlights the flood areas.
Landsat ETM Green Band
Landsat ETM NIR Band
Water Index
Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
63
Data-Driven Approach for Feature Extraction Countless data variations occur within an information class when the data are recorded in a noisy imaging environment. Unfortunately, knowledge-based methods are limited. A data-driven approach for feature extraction has been developed to make new features more informative. In a supervised approach, training data of the classes of interest are provided, and class statistics can be estimated. The extracted features are expected to have higher discriminant capacity and enhance class separability. The results are class sensitive. Examples of linear supervised feature extraction methods are • Linear discriminant analysis • Orthogonal subspace projection • Adaptive matched filter An unsupervised approach considers global data statistics without the need for training samples. The extracted features (e.g., high signal-to-noise ratio) are of high quality and are the best representation. Examples of unsupervised feature extraction methods are • Band grouping • Principal components transformation (PCT) • Independent component analysis (ICA) Feature (band) selection: the selected bands are still associated with their wavelengths. Feature extraction: the new features are no longer related to any wavelengths.
Field Guide to Hyperspectral/Multispectral Image Processing
64
Feature Reduction
Linear Discriminant Analysis The linear discriminant analysis (LDA) is based on Fisher’s ratio and aims to generate a few new features that can maximize the class separability between the classes of interest. Specifically, LDA maximizes the ratio of among-class variance to within-class variance. Let mi and Si be the mean vector and covariance matrix of class i, i 5 1, 2, . . . , M, where M is the number of classes. The average within-class covariance matrix can be estimated as Sw 5
M 1 X S M i51 i
The among-class covariance matrix can be estimated as Sa 5
M 1 X ðm m0 Þðmi m0 Þt M 1 i51 i
where m0 is the mean vector of all of the training data of every class: m0 5
M 1 X m M i51 i
Let li be the eigenvalues of S1 w Sa in descending order, i 5 1, 2, . . . , M 1. Let A be a matrix whose rows are formed from the eigenvectors of S1 w Sa , ordered so that the first row of A is the eigenvector of the largest eigenvalue. The LDA transform is performed for the original data x by y 5 Ax LDA generates M 1 new features for an M-class classification task (as Sa has a rank of M 1), which is often much smaller than N, the original number of bands.
Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
65
Orthogonal Subspace Projection The orthogonal subspace projection (OSP) operator is a supervised feature-extraction technique that generates M new features for M classes of interest, where each new feature reveals the presence of each class. For each desired class of interest with spectral signature dm, every pixel vector x is projected to a subspace that is orthogonal to the rest of the undesired signatures, followed by maximizing the signal-to-noise ratio to obtain the new detection feature for this class zm: zm 5 qtm x 5 dtm ðI U≠m Uþ ≠m Þx where U≠m is an N by M 1 class signature matrix without the mth class’s signature, and N is the number of bands, and Uþ ≠m is the pseudo-inverse of U≠m . This process can be conducted repeatedly to find zm for m 5 1, 2, . . . , M. Alternatively, a transformation matrix Q can be formed first to obtain the new vector z directly: Q 5 ½q1 ; q2 ; : : : ; qM t ; z 5 Qx z is the new pixel vector with a reduced dimension from N to M. OSP assumes that the M class signatures are linearly independent and that the noise present in the data is white noise.
Field Guide to Hyperspectral/Multispectral Image Processing
66
Feature Reduction
Adaptive Matched Filter Target detection from hyperspectral data can be achieved by applying an adaptive matched filter, which is a simplified version of a matched filter. A projection vector that maximizes the target and background separability for a mean removed hyperspectral data set is given by r5
S1 b d t d S1 b d
where d denotes the target spectral signature, which may be estimated using the mean vector of the training target samples. Sb is the covariance matrix of the background data. For each pixel vector x, f 5 rt x The value of f is then compared with a threshold to decide if the target is detected or not. If more than one class is the target class, this filter can be applied to each target class in turn. Alternatively, orthogonal subspace projection may be applied, which is based on the same concept. This is a supervised approach, where training samples for the target class are required. The performance is sensitive to the accuracy of the statistics estimated from the training data provided. If training data are not available, an unsupervised approach can be applied, such as independent component analysis (ICA), which can achieve blind separation of M mutually independent non-Gaussian sources, where M , N, the number of spectral bands.
Field Guide to Hyperspectral/Multispectral Image Processing
Feature Reduction
67
Band Grouping for Feature Extraction Option 1: Neighboring bands are often similar and highly correlated. They may be grouped based on a similarity measure, such as correlation, or mutual information. Each group is then combined by averaging or weighted sum to generate a new feature. Band 1 ……
Groups
G1
Band N
G2
G3
GK combining
Features
F1
F2
F3
FK
Option 2: K clusters can be formed from the complete bands based on a similarity measure. The bands in each cluster are then combined by averaging or weighted sum to generate a new feature. Band 1 ……
Band N
clustering
Groups
G1
G2
GK combining
Features
F1
F2
FK
In Option 2, the bands in each cluster may not be neighboring bands. Therefore, the extracted features are not associated with any original wavelength.
Field Guide to Hyperspectral/Multispectral Image Processing
68
Feature Reduction
Principal Component Transformation Principal component transformation (PCT) provides a new feature space, whose first few components are the best choice for reconstructing the original data in the sense of minimizing mean square error of the residual. The first few are often selected to serve the purpose of feature reduction or lossy data compression. Let x be the pixel vectors with a length of N, m be the mean vector of the image data, and Sx be the covariance matrix of x. Let li be the eigenvalues of Sx in descending order: l1 . l2 . l3 : : : . lN Let A be a matrix whose rows are formed from the eigenvectors of Sx, ordered so that the first row of A is the eigenvector of the largest eigenvalue. The principal component transformation is defined as y 5 Ax The covariance matrix of the transformed data is 2 3 l1 6 7 l2 0 6 7 6 7 Sy 5 6 l3 7 4 5 0 ::: lN The transformed data are decorrelated, and the variances are the highest in PC1 and are reduced in the subsequent PCs. The total variance is the same as in the original data. PCT can be applied to highly correlated subgroups of bands separately in the hyperspectral data set to improve the efficiency, i.e., segmented PCT (SPCT). To ensure that the first few PCs have a low noise level, a noise-adjusted PCT can be applied, where the noise covariance matrix needs to be estimated.
Field Guide to Hyperspectral/Multispectral Image Processing
Incorporation of Spatial Information
69
Spatial Texture Features Using a GLCM Spatial texture features, such as contrast and correlation, can be generated from the original intensity measure of each pixel by considering its relationship to its neighbors. These features provide extra information and another description of an object. A gray-level co-occurrence matrix (GLCM) can be generated to calculate spatial measures. A GLCM is defined as the occurrence count of two neighboring pixels (spaced h pixels apart, in the direction q) with gray levels l1 and l2. They are often denoted as gij. Example: GLCM: h 5 1; q 5 0 deg
A patch of 3 bits 0 1 0 0 0
1 4 1 0 0
2 7 0 0 0
0 1 0 0 0
3 2 0 2 5
0 1 2 3 4 5 6 7
0 8 1 1 0 0 0 0 0
1 2 0 0 0 0 0 0 1
2 1 2 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 1 0 0 0
Three examples of texture measures for the patch can be calculated and used as the spatial data of the central pixel: Contrast: L X L X ði jÞ2 pij i51 j51
i51 j51
L X L X
p2ij
i51 j51
Entropy: L X L X
Uniformity:
where pij log2 pij
pij 5 g ij =
L X L X
g ij
i51 j51
L is the maximum intensity value for the number of bits B used: L 5 2B 1. B is often reduced from the original dynamic range to 5 or 6 to ignore the small difference and avoid a GLCM that is too sparse. Field Guide to Hyperspectral/Multispectral Image Processing
70
Incorporation of Spatial Information
Examples of Texture Features
Original image
Contrast
Energy
Homogeneity
Parameters used: Window size (w): 5 3 5 Direction (q): Horizontal Distance (h): 2 Texture bands can be concatenated with spectral bands as input to a classifier for classification. The spatial properties are incorporated in this way. Field Guide to Hyperspectral/Multispectral Image Processing
Incorporation of Spatial Information
71
Markov Random Field for Contextual Classification The Markov random field (MRF) process can incorporate neighborhood support to the labeling of each pixel, based on the assumption of spatial consistency over the image scene. There are two parts in the MRF-based discriminant function for image pixel classification. One is the spectral term, which uses multi-dimensional intensity information for the given (central) pixel. The other is the spatial term, which measures the label consistency of the neighborhood: g c ðxm Þ 5 eðcÞ baðcÞ When a maximum-likelihood estimation is used for the spectral analysis, and the isotropic Potts model is used for the spatial term, it becomes 1 1 g c ðxm Þ5 ln jSc j ðxm mc Þt 2 2 X 1 3 Sc ðxm mc Þ b½1 dðvc ; v∂m Þ ∂m
If a minimum-distance classifier is used, the spectral term is replaced, and it becomes X b½1 dðvc ; v∂m Þ g c ðxm Þ 5 ðxm mc Þt ðxm mc Þ ∂m
The d function gives an output of 1 if the neighbor’s label v∂m is the same as the central pixel’s label vc ; and it gives an output of 0 if the labels are not the same. When all neighbors have the same labels as the central pixel, we have the case of the strongest support. The value is capped at the size of the defined neighborhood. To control the contributions from both terms, normalizations are recommended for each, with b used as a weighting factor. The initial labels are from the first term only. After the first iteration, the labels are modified. More iterations are often run to consider a large neighborhood region. Oversmoothing at class boundaries should be avoided.
Field Guide to Hyperspectral/Multispectral Image Processing
72
Incorporation of Spatial Information
Options for Spectral-Spatial-based Mapping Using spectral information alone in image pixel classification ignores the spatial correlation of the scene. The resulting classification map may have a “salt-and-pepper” appearance, as seen in the image. To overcome this problem, spatial information can be incorporated in a few ways. Image smoothing may be applied as a pre-processing step to remove the noise and improve spatial consistency. It is important to keep the class boundaries and object edges unsmoothed. Generating texture features and stacking them together with spectral features can be effective. Various textures features can be derived using a GLCM or directly. Parameters for the texture measures used in a GLCM, such as direction and distance, reveal spatial information at different scales and orientations. Superpixels can be formed in advance, and classification can be performed at the superpixel level. The variation within the superpixel is treated as noise and ignored. A Markov random field (MRF) or probabilistic relaxation (combining the central pixel’s and its neighbors’ probabilities of belonging to each class for labeling) can be used on the raw data. These are called spectralspatial-based classifiers. Post-processing is another option, where a majority vote around a neighborhood may be used to revise the label of the central pixel. A convolutional neural network (CNN) is a recent development that take neighbors into account. The spatial features at several levels can be revealed using a deep CNN.
Field Guide to Hyperspectral/Multispectral Image Processing
Subpixel Analysis
73
Spectral Unmixing Due to limited spatial resolution, there are many mixed pixels in remote sensing images. A flood area is a good example. A pixel of 30 × 30 m2 with mixed classes
s combined reflectance
A linear spectral mixture model is often adopted: sn 5
M X
f m am;n þ en ;
n 5 1; 2; : : : ; N
m51
sn: The measured reflectance from a mixed pixel at band n. am,n: The endmember class m’s reflectance at band n. This is obtained from pure pixels. fm: The fraction of endmember class m contained in the mixed pixel. This is independent of band numbers. en: The error term. M: The total number of endmember classes. N: The total number of spectral bands. This model is applied to the measurement at each band, and a set of N simultaneous equations is generated to solve the unknown fractions, where N ≫ M. The least-squares solution is then applied to minimize the uncertainty.
Field Guide to Hyperspectral/Multispectral Image Processing
74
Subpixel Analysis
Endmember Extraction Endmember classes are the pure substances, and each one consists of a single material. Endmember spectra, often called endmember signatures, must be provided before mixed-pixel unmixing can be performed. The process to find endmember signatures is called endmember extraction. With ground truth When some ground truth of pure classes is available, training areas can be identified wherein endmember signatures can be extracted. In this case, an endmember class may correspond to multiple spectra, which presents endmember variability. If an unmixing model is limited to use of a single endmember signature for each endmember class, the averaged result may be adopted. With a spectral library Endmember signatures can be obtained from a spectral library. As they are measured in a certain standard condition, they should be applied to the calibrated image data. Overselection of endmembers may occur, in which case the collaborative sparse unmixing algorithm can be employed. Data-driven endmember extraction Several algorithms are available to find endmember signatures automatically from the image data. Consider that, in the color space, RGB are the primary colors, and other colors are combinations of these three. Endmembers can be interpreted as ‘primary colors’ in the linear unmixing analysis. Therefore, in feature space, the vertices of the simplex of a data cloud are the endmembers. This is the basic rationale for the popular convex-geometry-based endmember-extraction methods, such as N-FINDR. The pixel purity index (PPI) is also introduced, where the extreme pixel vectors in most directions are regarded as pure pixels to assist with endmember extraction.
Field Guide to Hyperspectral/Multispectral Image Processing
Subpixel Analysis
75
Endmember Extraction with N-FINDR Considering an image data set in a 2D feature space, the three vertices are believed to be the endmembers if pure pixels exist, and the mixed pixels are the linear combination of the endmembers. B2
e2 e2 e1
e3
e1 e3
B1
To find the vertices, three points e10 , e20 and e30 are randomly selected initially, and the volume is calculated: 1 1 1 V 0 / jEj 5 det e01 e02 e03 Then, a new point is used to take the place of each selected one in turn, and the volume is recalculated. This procedure is repeated until no further replacements can be performed. The three points that give the maximum volume are the vertices and endmembers e1, e2, and e3. This is the N-FINDR process. To use N-FINDR, we need to know the number of endmembers M present in the image from local knowledge or other information sources. Dimensionality reduction is required to reduce the data to M 1 features. The size of the matrix for calculating the simplex volume is M by M. If pure pixels do not exist, they can be generated by working out the vertices (endmembers), which can give a minimum simplex volume and can also unmix the image data well [iterative constrained endmembers (ICE)].
Field Guide to Hyperspectral/Multispectral Image Processing
76
Subpixel Analysis
Limitation of Linear Unmixing The linear unmixing model assumes that incident light interacts with the pixel on the ground only once. When multiple bounces occur, linear models are not powerful enough for the task. The limited mathematical nonlinear models developed so far are unable to effectively reflect the irregular and complex mixtures in real situations. Therefore, linear models are widely used despite their limited performance.
single bounce
multiple bounces
The endmember spectra are not unique due to within-class variation and system noise. An average spectrum is often used to perform unmixing. The specific endmember classes that are contained in each mixed pixel are often unknown. If extra endmember classes are included in unmixing, the algorithm is contaminated. It is critical to predict which endmember classes are contained. Due to the above problems, the fractions estimated from the linear unmixing can be negative values and their sum is not 1. To make the mixture results meaningful and useful, two constraints are often imposed on the fractions of the mixture: M X
fm 5 1
m51
0 ≤ fm ≤ 1
Field Guide to Hyperspectral/Multispectral Image Processing
Subpixel Analysis
77
Subpixel Mapping Spectral unmixing improves information retrieval in terms of quantity without detailed spatial distribution. To address this shortage, subpixel mapping is required. After subpixel analysis, each pixel is decomposed into a weighted sum of the endmember classes defined. The percentages can be quantized for subpixel mapping. For example, if a scale of 2 is used in both spatial directions, 25% is used as a basic unit, corresponding to one subpixel, as shown below.
50% water 25% soil 25% grass
1 pixel
Subpixel mapping by accounting for spatial correlation or structures
water
water
soil
grass
4 pixels
The location of the endmember classes within the pixel is a new issue to address. Prior knowledge of the structures can be applied if it is available. Continuality of class distribution in the spatial domain is often used to optimize the resulting map through a number of iterations. Subpixel mapping based on unmixing results can improve spatial resolution by a factor of N by N, and at the same time, the unmixing resolution is reduced from 1% to 1/N2. Subpixel mapping is not suitable for the cases for which the mixture is a composite in nature, where the percentage of each endmember class is distributed over the whole pixel. For example, a mud pixel contains 70% water and 30% soil, which are spatially inseparable.
Field Guide to Hyperspectral/Multispectral Image Processing
78
Subpixel Analysis
Subpixel Mapping Example The following example illustrates super-resolution mapping at a scale of 3 in both directions. Assume that the three endmember classes are represented as R, G, and B. The fraction of each endmember class contained in each pixel is quantized to 1/(333). Considering a continuous class distribution, the subpixel mapping results are obtained.
100%
7/9 R 2/9 B
100%
3/9 B 6/9 G
5/9 R 4/9 G
100%
100% 100% 100%
Field Guide to Hyperspectral/Multispectral Image Processing
Subpixel Analysis
79
Super-resolution Reconstruction Super-resolution reconstruction (SRR) improves image spatial resolution via post-processing to overcome the limitation of imaging systems. SRR can be obtained by combining a number of lowresolution images of a given scene. This process uses the mismatch among the multiple images captured and extracts the extra information between the pixels.
Super-resolution reconstruction
Several low-resolution images of a scene
making use of the mismatch among the input images A high-resolution image of the scene
While hyperspectral images offer high spectral resolution, their spatial resolution is low. Multispectral images are the opposite. When such a pair of data sets are available for a given scene, image fusion can be conducted to generate a hyperspectral data set with improved spatial resolution. This process is called image fusion. A similar fusion approach can be applied between a multispectral/hyperspectral image data set and a corresponding panchromatic image band with higher spatial resolution. This is called multispectral image sharpening. Nowadays, many high-resolution and low-resolution image pairs are available and are used as training data to find their relationship using deep learning algorithms. The model is then used for the new image data.
Field Guide to Hyperspectral/Multispectral Image Processing
80
Artificial Neural Networks and Deep Learning with CNNs
Artificial Neural Networks: Structure Artificial neural networks (ANNs) mimic the structure of the human brain. An example ANN is shown below. Hidden Layer Input Layer
Output Layer
x1 y1 x2 y2 x3 y3 x4
Input layer: The number of neurons equals the dimensions of the input vector, which is 4 in the example given above. Hidden layers: One or more hidden layers is often used, and each has several neurons that achieve nonlinear decision regions. More than two hidden layers are later called a deep network. If there are no hidden layers, a linear classifier is implemented, and a hyperplane is established as the decision surface. Output layer: This layer has the same number of neurons as the number of classes to classify. The input vector x is recognized as belonging to the ith class if the neuron yi has the highest output. Arrows: Each arrow indicates a data path with a given weight obtained after training. Neurons: Each neuron (except those at the input layer) is a classifier as explained on the next page. This neuron is also called a node.
Field Guide to Hyperspectral/Multispectral Image Processing
Artificial Neural Networks and Deep Learning with CNNs
81
Artificial Neural Networks: Neurons Each neuron has several inputs from the previous layer, and each has a given weight obtained from the training. The operations of a neuron include weighted summation followed by thresholding and activation, where a sigmoid function is often used, i.e., oðzÞ 5
1 ; 1 þ ez=uo
where z 5 wt x þ b
where b is the threshold to offset the weighted sum, and uo is a constant to control the sharpness of the sigmoid function. Input x1
Output w1 A neuron
x2
w2
x3
w3
. . . xN
wN
o weighted sum followed by thresholding and activation
To train the network, the weights are set randomly at the start. The error between the ground truth values of the training samples and the current outputs is used in the backpropagation to update the weights, until the error is reduced to a specified value. The number of layers and the number of nodes in each layer are model hyperparameters that are selected empirically or via validation. It is critical to avoid overfitting.
Field Guide to Hyperspectral/Multispectral Image Processing
82
Artificial Neural Networks and Deep Learning with CNNs
Limitation of Artificial Neural Networks Two limitations of ANNs can be noted: • For an input pattern with low dimensionality, for example, multispectral data, the size of the network is reasonable. For hyperspectral data, the number of input neurons becomes high, and the number of weights associated with each input to every neuron will be impractically large. • An ANN is suitable for 1D input only. For hyperspectral data, this means only spectral information. If an image or a patch needs to be recognized and labeled, the image pixel data must be reshaped to a 1D vector as the input. In this case, the number of input neurons is very high. Another drawback is that the little spatial contexture information presented in the patch/ image is used to perform the classification. Original Image
3
6
1
2
2
9
3
3
1
2
6
2
1
5
5
3
1
6
4
3
3
3
4
1
2
3
2
4
2
7
After Reshaping to 1D
Convolutional neural networks (CNNs) have been developed to address these issues. Local connections are introduced to each neuron, instead of applying the fully connected architecture in ANNs. Local convolutions are performed to learn the spatial features of various levels using multiple layers. The operation at each neuron is unchanged.
Field Guide to Hyperspectral/Multispectral Image Processing
Artificial Neural Networks and Deep Learning with CNNs
83
CNN Input Layer and Convolution Layer Convolutional neural networks can accept an image (or a patch) as the input of the multilayer network and introduce a local connection (instead of the full connection) and local feature learning before classification via a local convolution operation. An example of an input layer and convolution layer is illustrated below. w1
Feature Map 1 b1
Input Layer 3
3 x1
x2 … x3 …
… Feature Map 5
x9 w5 b5
A 3 3 3 kernel is defined with entries w. These represent the neighborhood region to consider (receptive field) and the weights to use in the convolution operation with the pixels around a central pixel. The ith neuron uses wi for a weighted sum function via convolution, followed by thresholding and activation, as in an ANN. The result is the output for the central pixel of the input. As shown in the figure above, the black (purple) cell is the output from the black (purple) neighborhood. After each pixel’s neighbors are fed and processed, new output images are obtained. These are called feature maps, as they show certain spatial features after filtering, as determined by wi.
Field Guide to Hyperspectral/Multispectral Image Processing
84
Artificial Neural Networks and Deep Learning with CNNs
CNN Padding and Stride We can see that the feature map on the previous page is smaller than the original input image, as a kernel of 3 3 3 cannot be applied to the pixel at row 1 and column 1 (there are no adequate neighbors). A similar lack of neighbors exists at the bottom and right edges. A greater reduction is achieved by using larger kernels. To overcome this problem, padding can be applied by adding extra border rows and columns with selected values. Zero padding adds zeros to those pixels. Other types of padding include replication padding, where the extra boarders repeat the data values at the original edges. After Replication Padding Original Image
3
3
1
2
2
3
1
2
3
3
1
2
2
6
1
5
6
6
1
5
5
3
6
3
3
3
6
3
3
3
4
2
3
3
4
2
2
3
3
4
2
2
Stride is another parameter in feature map generation. Stride defines the number of pixels to move when the next convolution is performed. Two consecutive moving windows Stride 1
Stride 2
Field Guide to Hyperspectral/Multispectral Image Processing
Artificial Neural Networks and Deep Learning with CNNs
85
CNN Pooling Layer The feature maps generated will be sent to the next layer in the deep CNN. The data volume is much higher since the input is now an image instead of a vector. A pooling layer is often introduced to reduce data size as well as the model parameters in the next layer. It also increases the receptive field of a given window size. A pooling layer is a downsampling process. It reduces the feature map resolution to a selected scale. If the scale is 2 in both directions, every 4 pixels are combined as a new pixel. There are a few options for how to determine the new value: • Average pooling: Use the average values of the pixels in the region. • Max pooling: Use the maximum values of the pixels in the region. • L2 pooling: Use the square root of the total squared values in the region. Original Feature Map 6
3
4
2
3
1
5
5
After Max Pooling Scale: 2
6
5
After pooling, the feature map resolution is reduced, and the pixels provide spatial features at the coarse level, focusing on the big picture.
Field Guide to Hyperspectral/Multispectral Image Processing
86
Artificial Neural Networks and Deep Learning with CNNs
CNN Multilayer and Output Layer Convolution and pooling are often treated as one hidden layer. A CNN may have more than one hidden layer, where the feature maps generated after the first layer will be the input for next-level feature learning. The resulting feature maps can be reshaped as a column (flattened) and concatenated to form a vector. Recognition of the image at the original input becomes a vector classification problem. The vector is often connected to a fully connected neural network for labeling.
Outputs
Flattened vector
After pooling
Feature maps
After pooling
Feature maps
Input image
The diagram for this process is often simplified without showing the detailed operations. The example below is a two-layer CNN followed by a fully connected output layer. (FCNN is a fully connected neural network.)
FCNN
Layer 1: 3 Feature Maps
Layer 2: 5 Feature Maps
The colors do not correlate with values. They indicate the input for the corresponding output. The kernel (filter) size, weights, and threshold (bias) are unchanged when each new value is generated for a given feature map. This process is called parameter reuse or parameter sharing. In this way, the number of weights and the bias are significantly reduced. More importantly, each feature map presents the same type of spatial property over the whole image.
Field Guide to Hyperspectral/Multispectral Image Processing
Artificial Neural Networks and Deep Learning with CNNs
87
CNN Training The weights and thresholds (bias) in the designed CNN architecture need to be trained for a given task. A large number of training images is required as the number of model parameters is often very high, especially for a deep CNN. The example given on the previous page has three feature maps in Layer 1 and five feature maps in Layer 2. If the perceptive field is 3 3 3, the number of parameters to train for the first layer p1 and the second layer p2 are p1 5 ð3 3 3Þ 3 3 þ 3 5 30 p2 5 ð3 3 3Þ 3 3 3 5 þ 5 5 140 The number of parameters is independent of the image size due to parameter reuse. Initially, the parameters are set randomly. When the training samples are entered, the output difference from the ground truth is evaluated, and feedback is used to revise the parameters. This backpropagation is implemented via optimization algorithms until the output error is adequately small. A rectifier linear unit (ReLU) is often used to replace the sigmoid function to activate a neuron; this mitigates the problem of gradient vanishing and reduces the computational cost in the implementation of backpropagation optimization: oðzÞ 5 ReLUðzÞ 5
z; if z . 0 0; if z ≤ 0
where z 5 wt x þ b
ReLu (z)
1
1
z
Field Guide to Hyperspectral/Multispectral Image Processing
88
Artificial Neural Networks and Deep Learning with CNNs
CNN for Multiple-Image Input
Outputs
Flattened vector
After pooling
Feature maps
Input image
If the input image is a color image with three bands of RGB, three kernels are introduced for the three bands, respectively, to generate the feature maps. The treatment is equivalent to that of the second layer when the input is a single band.
FCNN
Layer 1: 5 Feature Maps
More Layers
For multispectral imagery, the bands can be input directly, similar to the three-band case. The image (patch) is recognized by considering the spatial information and the intensities of each waveband. For hyperspectral images, band selection or feature extraction may be introduced to achieve dimensionality reduction. A principal components transform (PCT) is often applied to compress the spectral bands to a few top principal components. In this way, both spatial and spectral information are utilized in image labeling, although spectral information is not fully considered. In theory, the input image can have a large number of bands, similar to the large number of feature maps in a middle layer. The number of weights will be increased in this case, and the model-overfitting problem can be more difficult to handle.
Field Guide to Hyperspectral/Multispectral Image Processing
Artificial Neural Networks and Deep Learning with CNNs
89
CNN for Hyperspectral Pixel Classification A hyperspectral data cube provides detailed spectral information of the ground cover materials at each pixel, and reasonable spatial information, such as shape and texture, from the acceptable spatial resolution. Each pixel is often required to be recognized and labeled. The distribution and mass of each ground cover type can then be evaluated quantitatively.
Central pixel A & neighbors
Outputs for Pixel A
Hyperspectral data cube
Flattened vectors
To incorporate spatial information in pixel labeling, a feature fusion approach can be applied by generating a spectral vector and a spatial vector using two CNNs and combining them during the classification stage.
1D CNN FCNN 2D CNN spatial
A spectral-domain CNN can be seen as a special image patch with one row. The convolution kernel has one row, too. It can be called a 1D CNN. A single band is often used for a spatial-domain CNN, as feature learning in the spectral domain is performed separately. This can be the first principal component after the transform is conducted on the whole data cube.
Field Guide to Hyperspectral/Multispectral Image Processing
90
Artificial Neural Networks and Deep Learning with CNNs
CNN Training for Hyperspectral Pixel Classification For hyperspectral pixel classification using the feature fusion method, a training sample is composed of a central pixel vector and the data of its defined neighborhood, i.e., a small local data cube, plus the ground truth of the central pixel. As there are many model parameters to train, overfitting is a serious concern, given the limited number of training samples. To overcome this problem, several solutions have been developed: • Data augmentation: Virtual training samples are generated from the real training samples by introducing variations in orientation, cropping, adding noise, changing scales, etc. • Dropout: The output of some hidden neurons is set to zero, to make the weights sparse and reduce model parameters. • ReLU: ReLU is used as the activation function to make all negative neurons become inactive (zero output). • Transfer learning: Transfer learning leverages the existing knowledge for new problems. Pretrained model parameters from a previous study are used as the initial parameters, or a subset of the trained network is adopted as the new architecture for a less complex application. Parameter refining and updating are conducted with current training samples. Overfitting can be identified by evaluating testing data accuracy and examining the model generalization capacity.
Field Guide to Hyperspectral/Multispectral Image Processing
Multitemporal Earth Observation
91
Satellite Orbit Period The period of a satellite orbiting the Earth T is determined by the altitude of the satellite above the Earth surface h and is independent of the size and the weight of the satellite. h
Satellite
re
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T 5 2p ðre þ hÞ3 =m ðsÞ where m 5 3.986 3 1014 m3s2 (Earth gravitational constant) re 6.378 3 106 m (Earth radius) When h 5 35.8 Mm, T 5 24 hr, which is the same period as the Earth’s rotation. The geosynchronous equatorial orbit (GEO) is located at this height, where satellites rotate with the Earth; these are called geostationary satellites. For low Earth orbit (LEO) observation, the altitude is normally below 1000 km so that spatial resolution is reasonable and revisit time is good. However, it is impractical to set the altitude below 160 km due to strong atmospheric drag (atmospheric force against the motion of an object). At this altitude, T is much less than 24 hr. The NASA Aqua satellite is located 705 km above the Earth surface, where T = 99 min. The orbit altitude of ESA QuickBird-2 is 450 km, where T = 93 min. The period is weakly affected by the orbit altitude.
Field Guide to Hyperspectral/Multispectral Image Processing
92
Multitemporal Earth Observation
Coverage and Revisit Time Coverage is achieved when a satellite orbits the Earth and the Earth simultaneously rotates. It is determined by the inclination of the orbit, which is the angle between the orbiting plane and the equator. The figure shows the three passes (first in red, second in green, and third in blue) when the inclination is 60 deg at an altitude of 800 km. Almost all countries are covered. To cover the whole globe, the inclination angle needs to be close to 90 deg (near polar).
Revisit time is the interval between two observations of a given place on Earth. Multitemporal images are obtained with an interval determined by the revisit time. As the interval becomes smaller, remote sensing time series images are available with a high sampling rate. Revisit time is affected by not only the orbit period but also the swath width of the satellite and the latitudes of the location. Swath width is the cross-track image width acquired by a satellite when it flies over the Earth’s surface.
Field Guide to Hyperspectral/Multispectral Image Processing
Multitemporal Earth Observation
93
Change Detection An important remote sensing application is quantitative change detection, which uses the multitemporal images collected. One of the challenges is to separate the changes caused by the different imaging conditions from the real change that occurred on the Earth surface. This task requires various image processing and data analysis skills. Image registration: For a bi-temporal image pair, image registration is required to achieve spatial alignment from pixel to pixel. Radiometric correction: This correction is required if a difference image is generated for unsupervised change detection. Classification: Classification can be applied to an individual image data set, and comparison of the posterior probabilities (change vector analysis) or classification results can detect the changes. Segmentation: Object-based change detection is based on the segmentation results. The parameters for segmentation need to be selected properly to make the segmentation consistent in both images. Spectral index or band ratio: Comparisons of vegetation, soil, and water indices are often effective at reducing the noise effect. Clustering and thresholding: These processes generate change and no-change maps. Principal component transformation (PCT): The bitemporal image data sets can be stacked for a PCT. The first principal component is dominated by the correlated data, which represent the data with system changes. The subsequent components are often associated with real changes. Deep learning has been developed for change detection in recent years. Time series analysis is feasible for identifying the trend of changes as multitemporal images with short intervals are becoming widely available with the fast advancement of satellite missions. Remote sensing is a powerful tool to assist with sustainable development. Field Guide to Hyperspectral/Multispectral Image Processing
94
Classification Accuracy Assessment
Error Matrix for One-Class Mapping An error matrix lists how reference data are recognized as defined classes. For the one-class case, this matrix shows how accurately the target is detected against the background. Reference Data Error Matrix Target Results Background TP: True Positive FN: False Negative
Target Background TP FP FN TN TN: True Negative FP: False Positive
The accuracy of the target detection can be assessed by Recall 5 TP/(TP þ FN) Precision 5 TP/(TP þ FP) Overall accuracy 5 (TP þ TN)/(TP þ TN þ FN þ FP) Recall gives the percentage of the reference data that are correctly recognized. Precision gives the percentage of the labeled data that are correct. High recall implies a low miss rate. High precision implies a low false alarm rate. Example: Reference Data Error Matrix Target 188 Results Background 112
Target 100 98
Background 200 90
2
110
Recall:
98/100 5 98% for target 110/200 5 55% for background
Precision:
98/188 5 52% for target 110/112 5 98% for background
Overall accuracy: (98 þ 110)/(100 þ 200) 5 70% Field Guide to Hyperspectral/Multispectral Image Processing
Classification Accuracy Assessment
95
Error Matrix for Multiple-Class Mapping For M-class classification, the error matrix has a size of M by M to evaluate all the classes. The following table illustrates a case with three classes: Reference Data Error Matrix
Results
A 70 B 45 C 95
A 50 45
B 60 15
C 100 10
5
30
10
0
15
80
In this case, recall and precision are also called producer accuracy and user accuracy, respectively. Producer accuracy: 45/50 5 90% for A 30/60 5 50% for B 80/100 5 80% for C User accuracy:
45/70 5 64% for A 30/45 5 67% for B 80/95 5 84% for C
Overall accuracy:
(45 þ 30 þ 80)/(50 þ 60 þ 100) 5 74%
High producer accuracy but low user accuracy means that this class dominates the labeling and its false-alarm rate is high. High user accuracy but low producer accuracy means that those pixels labeled as this class on the mapping result are highly accurate, but there is a high missing rate.
Field Guide to Hyperspectral/Multispectral Image Processing
96
Classification Accuracy Assessment
Kappa Coefficient Using the Error Matrix The kappa coefficient can be calculated by using the error matrix to measure the agreement between the mapping results and reference data (ground truth). The error matrix cells can be named as follows: Reference Data
Error Matrix Results
A B C
A Q(1,1) Q(2,1) Q(3,1)
B Q(1,2) Q(2,2) Q(3,2)
C Q(1,3) Q(2,3) Q(3,3)
kappa coefficient: P k5
i
P P P qði; iÞ c ½ j qðc; jÞ i qði; cÞ P P P 1 c ½ j qðc; jÞ i qði; cÞ
where qði; jÞ 5 Qði; jÞ=N s Ns is the total number of reference samples. 1 ≤ k ≤ 1 k = 1: No agreement k = 0: Random agreement k = 1: Perfect agreement The kappa result for the example on the previous page: k = 0.59
Field Guide to Hyperspectral/Multispectral Image Processing
Classification Accuracy Assessment
97
Model Validation After a classification model is selected, for example, the Gaussian distribution model, the model parameters are estimated by using the training samples. Then, model validation is required before the model can be accepted, including a check of model representation capacity and model generalization capacity. Model representation: Check the classification accuracy of the training data using the error matrix. If it is low, the model is not powerful enough to model the class data and is not suitable for the current classification problem. The model needs to be replaced. Model generalization: Check the classification accuracy of the testing data by using the error matrix. If it is low, the model is biased and is unable to be generalized to unseen data. In other words, the model is overtrained, often due to the use of limited training samples. More training data are required, or model dimensionality and complexity should be reduced. K-fold cross-validation: When the number of labeled data are limited, K-fold validation can be performed. Randomly split the labeled data into K groups: G(1), G(2),. . . , G(K). For k = 1, K Use G(k) for testing and the rest for training. Obtain model k. Evaluate the model accuracy a(k). End Combine the evaluation: a5
K 1X aðkÞ K k51
When K = N, the total number of samples, this process becomes leave-one-out cross-validation.
Field Guide to Hyperspectral/Multispectral Image Processing
98
Bibliography Achanta, R., A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “SLIC superpixels compared to state-of-theart superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11): 2274–2281 (2012). Berman, M., H. Kiiveri, R. Lagerstrom, A. Ernst, R. Dunne, and J. F. Huntington, “ICE: A statistical approach to identifying endmembers in hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing 42(10): 2085–2095 (2004). Bioucas-Dias, J. M., A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot, “Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5(2): 354–379 (2012). Bruzzone, L. and M. Marconcini, “Domain adaptation problems: A DASVM classification technique and a circular validation strategy,” IEEE Transactions on Pattern Analysis and Machine Intelligence 32(5): 770–787 (2010). Chang, C.-I., “An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis,” IEEE Trans. Inf. Theory 46(5): 1927–1932 (2000). Chang, C.-I., “Orthogonal Subspace Projection (OSP) revisited: A comprehensive study and analysis,” IEEE Transactions on Geoscience and Remote Sensing 43(3): 502–518 (2005). Chen, Y., H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing 54(10): 6232–6251 (2016). Hyvarinen, A. and E. Oja, “Independent component analysis: algorithms and applications,” Neural Networks 13(4–5): 411–430 (2000).
Field Guide to Hyperspectral/Multispectral Image Processing
99
Bibliography Jia, X., B.-C. Kuo, and M. Crawford, “Feature mining for hyperspectral image classification,” Proceedings of the IEEE 101(3): 676–697 (2013). Jia, X. and J. A. Richards, “Segmented principal components transformation for efficient hyperspectral remote sensing image display and classification,” IEEE Transactions on Geoscience and Remote Sensing 37(1): 538–542 (1999). Jia, X. and J. A. Richards, “Cluster-space representation for hyperspectral data classification,” IEEE Transactions on Geoscience and Remote Sensing 40(3): 593–598 (2002). Lu, D., P. Mausel, E. Brondízio, and E. Moran, “Change detection techniques,” International Journal of Remote Sensing 25(12): 2365–2401 (2004). Manolakis, D. and G. Shaw, “Detection algorithms for hyperspectral imaging application,” IEEE Signal Processing Magazine, pp. 29–43 (2002). Rasti, B., D. Hong, R. Hang, P. Ghamisi, X. Kang, J. Chanussot, and J. A. Benediktsson, “Feature extraction for hyperspectral imagery: The evolution from shallow to deep: Overview and toolbox,” IEEE Geoscience and Remote Sensing Magazine 8(4): 60–88 (2020). Richards, J. A. and X. Jia, Remote Sensing Digital Image Analysis, 4th ed., Springer-Verlag, Berlin (2006). Tyo, J. S., A. Konsolakis, D. I. Diersen, and R. C. Olsen, “Principal-components-based display strategy for spectral imagery,” IEEE Transactions on Geoscience and Remote Sensing 41(3): 708–718 (2003). Winter, M. E., “N-FINDR: an algorithm for fast autonomous spectral end-member determination in hyperspectral data,” Proc. SPIE 3753 (1999) [doi: 10.1117/12.366289]. Xu, M., M. Pickering, A. Plaza, and X. Jia, “Thin cloud removal based on signal transmission principles and spectral mixture analysis,” IEEE Transactions on Geoscience and Remote Sensing 54(3): 1659–1669 (2015).
Field Guide to Hyperspectral/Multispectral Image Processing
100
Index absorption by atmosphere molecules, 9 active learning, 55 adaptive matched filter, 63, 66 Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), 3 arrows, 80 artificial neural networks (ANNs), 8 atmospheric drag, 91 band selection, 88 Bayes’ theorem, 48, 49 Bhattacharyya distance, 61 bilinear interpolation, 14 binary coding, 6 bulk correction of atmosphere effects, 9 change detection, 93 class signatures, 40, 65 classification, 93 classifier, 40 cloud detection, 10 cluster space representation, 34 clustering and thresholding, 93 clusters, 34, 36 collaborative sparse unmixing, 74 concatenated (feature map), 86 constraints, 76 convolution, 24
convolutional neural network (CNN), 72, 83, 89 cosine similarity, 44 coverage, 92 curse of dimensionality, 57 data augmentation, 90 data compression, 6 data driven (approach), 57 data fusion, 4 deep learning, 33, 56, 79, 93 deep network, 80 dilation, 29 discriminant function, 40, 48 discriminant rules/ boundaries, 42 dropout, 90 endmember classes, 74 endmember extraction, 74 endmember signatures, 74 endmember variability, 74 erosion, 29 error matrix, 94 Euclidean distance, 36 expectation maximization (EM), 49 false color, 22 feature, 32 feature extraction, 57, 63, 88 feature fusion, 89, 90 feature generation, 7, 57 feature learning, 7, 8, 57, 83, 86, 89
Field Guide to Hyperspectral/Multispectral Image Processing
101
Index feature maps, 83, 84, 85, 86 feature mining, 7, 57 feature selection, 7, 8, 57 feature space, 32 Fisher’s ratio, 64 flattened (feature shapes), 86 Gaussian (normal) distribution, 46 Gaussian mixture model, 49 geostationary satellites, 91 geosynchronous equatorial orbit (GEO), 91 gray-level co-occurrence matrix (GLCM), 69, 72 ground control point (GCP), 12 Hamming distance, 6 hidden layers, 80 high resolution (HR), 4 histogram equalization, 20 histogram modification, 17, 18 Hughes phenomenon, 57 HyMap, 3 Hyperion, 3 hyperspectral, 3 Hyperspectral Digital Imagery Collection Experiment (HYDICE), 2 identity matrix, 50 image classification, 40 image contrast, 17
image cumulative histogram, 19 image fusion, 79 image histogram, 16 image registration, 4, 93 image smoothing, 25, 72 inclination, 92 independent component analysis (ICA), 66 information classes, 40 initialization, 36 input layer, 80 ISODATA, 37 isotropic (distribution), 50 iterative constrained endmembers (ICE), 75 J–M distance, 61 kappa coefficient, 96 kernel, 24, 25 k-fold cross-validation, 97 k-means, 37 k-nearest neighbor (KNN) classification, 34, 51 knowledge based (approach), 55, 58 labeled data, 34 leave-one-out crossvalidation, 97 linear discriminant analysis (LDA), 64 linear spectral mixture model, 73 lognormal distribution, 49 low Earth orbit (LEO), 91 low-resolution (LR), 4
Field Guide to Hyperspectral/Multispectral Image Processing
102
Index machine learning, 2, 40 Mahalanobis distance, 50 mapping functions, 12 Markov random field (MRF), 71, 72 matrix, 33 maximum-likelihood classification, 48 median filtering, 26 minimum distance classifier, 36 mixed pixels, 5 model generalization, 97 model representation, 97 model residual, 13 model validation, 97 modeled data set, 42 morphological closing, 29 morphological opening, 29 morphological operations, 29 multispectral bands, 3 multispectral image sharpening, 79 Multispectral Instrument (MSI), 4 multivariate normaldistribution function, 46 mutual information (MI), 59 NDWI (Normalized Difference Vegetation Index), 62 nearest-neighbor resampling, 14 neurons, 80 N-FINDR, 74, 75
node, 80 noise-adjusted principal component analysis, 68 nonlinear relationship, 59 normal (Gaussian) distribution, 46 Normalized Difference Vegetation Index (NDVI), 62 objective of clustering, 37 optical remote sensing, 1 optimal separating hyperplane, 52 orthogonal subspace projection (OSP), 65, 66 Otsu’s method, 35 output layer, 80 overtraining, 57 padding, 84 panchromatic image, 3 parameter reuse, 86 parameter sharing, 86 pattern recognition, 40 Pearson correlation, 44 pixel classification, 40 pixel purity index (PPI), 74 Planck’s black body radiation law, 1 pooling layer, 85 post-processing, 72 principal component, 89 principal component transformation (PCT), 23, 68, 89 probabilistic relaxation, 72
Field Guide to Hyperspectral/Multispectral Image Processing
103
Index radiometric correction, 93 radiometric resolution, 6 receptive field, 83 rectifier linear unit (ReLU), 87, 90 replication padding, 84 resampling, 14 revisit time, 92 running window operation, 24 scattering, 9 segmentation, 93 segmented PCT (SPCT), 68 semi-supervised classification, 54 simple linear iterative clustering (SLIC), 39 single-pass algorithm, 37 Sobel operator, 28 spatial filtering, 24 spatial resolution, 4 spatial space, 31 Spectral Angle Mapper, 44 spectral index (band ratio), 93 spectral information divergence (SID), 45 spectral reflectance characteristics, 2 spectral resolution, 3 spectral space, 31 spectral-spatial-based classifiers, 72 spectral unmixing, 5 standard false color, 22 statistics based (band selection), 58
stride, 84 subpixel, 5 subpixel mapping, 77 superpixel, 5, 25, 39, 72 super-resolution reconstruction (SRR), 5, 79 supervised approach, 57, 63 support vector machine (SVM), 40, 52 support vectors, 52 swath width, 92 tensor, 33 texture feature generation, 72 thematic map, 40 thermal infrared remote sensing, 1 thick cloud data replacement, 10 thin cloud correction, 10 time series analysis, 93 training data, 42 training samples, 40 transfer learning, 56, 90 true color, 22 unsupervised approach, 57, 63 validation data, 42 vector, 33, 42 watershed segmentation algorithm, 39 zero padding, 84
Field Guide to Hyperspectral/Multispectral Image Processing
Xiuping Jia received a B.Eng. degree from the Beijing University of Posts and Telecommunications, Beijing, China, in January 1982. Later she received a Ph.D. degree in Electrical Engineering and a Graduate Certificate in Higher Education from The University of New South Wales, Australia, in 1996 and 2005, respectively, via part-time study. She has had a lifelong academic career in higher education, for which she has continued passion. She is currently an Associate Professor with the School of Engineering and Information Technology, The University of New South Wales at Canberra, Australia. Her research interests include remote sensing, hyperspectral image processing, and spatial data analysis. Dr. Jia has authored or co-authored more than 280 referred papers, including over 170 journal papers addressing various topics, including data correction, feature reduction, and image classification using machine learning techniques. She co-authored the remote sensing textbook Remote Sensing Digital Image Analysis [SpringerVerlag, 3rd (1999) and 4th eds. (2006)]. These publications are highly cited in the remote sensing and image processing communities. She is an IEEE Fellow.