304 101 26MB
English Pages 277 Year 2015
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
For a complete listing of titles in the Artech House Remote Sensing Library, turn to the back of this book.
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images Jón Atli Benediktsson Pedram Ghamisi
Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. Cover design by John Gomes
ISBN 13: 978-1-60807-812-7
© 2015 Jón Atli Benediktsson and Pedram Ghamisi
All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
10 9 8 7 6 5 4 3 2 1
I dedicate the book to my previous and current students, post docs and colleagues who have collaborated on this research with me. —Jón Atli Benekidtsson
This book is gratefully dedicated to my parents, Parviz and Mahasti. —Pedram Ghamisi
Contents Foreword
xi
Acknowledgments
xv
Chapter 1 Introduction 1.1 Introduction to Hyperspectral Imaging Systems 1.2 High-Dimensional Data 1.2.1 Geometrical and Statistical Properties of High Dimensional Data and the Need for Feature Reduction 1.2.2 Conventional Spectral Classifiers and the Importance of Considering Spatial Information 1.3 Summary
11 31
Chapter 2 Classification Approaches 2.1 Classification 2.2 Statistical Classification 2.2.1 Support Vector Machines 2.2.2 Neural Network Classifiers 2.2.3 Decision Tree Classifiers 2.3 Multiple Classifiers 2.3.1 Boosting 2.3.2 Bagging 2.3.3 Random Forest
33 33 36 37 41 43 44 44 46 48
vii
1 1 4 5
viii
Contents
2.4 2.5 2.6
The ECHO Classifier Estimation of Classification Error 2.5.1 Confusion Matrix 2.5.2 Average Accuracy (AA) Summary
50 51 51 52 54
Chapter 3 Feature Reduction 3.1 Feature Extraction (FE) 3.1.1 Principal Component Analysis (PCA) 3.1.2 Independent Component Analysis 3.1.3 Discriminant Analysis Feature Extraction (DAFE) 3.1.4 Decision Boundary Feature Extraction (DBFE) 3.1.5 Nonparametric Weighted Feature Extraction (NWFE) 3.2 Feature Selection 3.2.1 Supervised and Unsupervised Feature Selection Techniques 3.2.2 Evolutionary-Based Feature Selection Techniques 3.2.3 Genetic Algorithm (GA)-Based Feature Selection 3.2.4 Particle Swarm Optimization (PSO)-Based Feature Selection 3.2.5 Hybrid Genetic Algorithm Particle Swarm Optimization (HGAPSO)-Based Feature Selection 3.2.6 FODPSO-Based Feature Selection 3.3 Summary Chapter 4 Spatial Information Extraction Using Segmentation 4.1 Some Approaches for the Integration of Spectral and Spatial Information 4.1.1 Feature Fusion into a Stacked Vector 4.1.2 Composite Kernel 4.1.3 Spectral-Spatial Classification Using Majority Voting 4.2 Clustering Approaches 4.2.1 K-Means 4.2.2 Fuzzy C-Means Clustering (FCM) 4.2.3 Particle Swarm Optimization (PSO)-Based FCM (PSO-FCM) 4.3 Expectation Maximization (EM)
55 56 58 60 62 64 67 69 71 72 75 77 83 84 90 91 94 94 95
96 101 101 103 104 105
Contents
4.4 4.5 4.6 4.7
4.8
4.9
Mean-shift Segmentation (MSS) Watershed Segmentation (WS) Hierarchical Segmentation (HSeg) Segmentation and Classification Using Automatically Selected Markers 4.7.1 Marker Selection Using Probabilistic SVM 4.7.2 Multiple Classifier Approach for Marker Selection 4.7.3 Construction of a Minimum Spanning Forest (MSF) Thresholding-Based Segmentation Techniques 4.8.1 Image Thresholding 4.8.2 Classification Based on Thresholding-Based Image Segmentation 4.8.3 Experimental Evaluation of Different SpectralSpatial Classification Approaches Based on Different Segmentation Methods Summary
ix
108 109 113 115 116 119 122 124 127 131 132 138
Chapter 5 Morphological Profile 5.1 Mathematical Morphology (MM) 5.1.1 Morphological Operators 5.1.2 Morphological Profile (MP) 5.1.3 Morphological Neighborhood 5.1.4 Spectral-Spatial Classification 5.2 Summary
141 142 142 149 153 156 162
Chapter 6 Attribute Profiles 6.1 Fundamental Properties 6.2 Morphological Attribute Filter (AF) 6.2.1 Attribute Profile and Its Extension to Hyperspectral Images 6.3 Spectral-Spatial Classification Based on AP 6.3.1 Strategy 1 6.3.2 Strategy 2 6.4 Summary
165 166 167 173 180 180 181 198
Chapter 7 Conclusion and Future Works
199
x
Contents
7.1 7.2
Conclusions Perspectives
199 200
Appendix A: CEM Clustering
205
Appendix B: Spectral Angle Mapper (SAM)
207
Appendix C: Prim’s Algorithm
209
Appendix D: Data Sets Description
211
Abbreviations and Acronyms Bibliography About the Authors Index
217 219 249 251
Foreword Hyperspectral image classification has been a very active area of research in recent years. Given a set of observations (i.e., pixel vectors in a hyperspectral image), the goal of classification is to assign a unique class label to each pixel (which, in a hyperspectral image, is a high-dimensional vector of reflectance values). There are several important challenges when performing hyperspectral image classification. For instance, supervised classification faces challenges related to the imbalance between high dimensionality and limited availability of training samples, or the presence of mixed pixels in the data (which may compromise classification results for coarse spatial resolutions). Another important challenge is the need to integrate the spatial and spectral information to take advantage of the complementarities that both sources of information can provide. Specifically, the classification of hyperspectral data using both spatial and spectral information has been a fast developing topic of research in recent years. The availability of hyperspectral data with high spatial resolution has been quite important for classification techniques, as their main assumption is that the spatial resolution of the data is high enough to assume that the data mostly contains pure pixels (i.e., pixels represented by a single predominant spectral signature). However, in order to better exploit the complementary nature of spatial and spectral information in the analysis of the data, the inclusion of spatial information to traditional approaches for hyperspectral classification has been one of the most active and relevant innovative lines of research in remote sensing during recent years. In this regard, this book represents the first effort in the literature that is specifically focused on the integration of spatial and spectral information in the classification of hyperspectral data. The contributions of the book are very timely, as the field has achieved maturity due to many contributions in recent years and a monographic volume comprehensively summarizing the advances on this specific topic is now due. To address this need, the
xi
xii
Foreword
book intends to accurately summarize some of the most relevant recent advances performed in the literature in order to integrate spatial-contextual information in hyperspectral data classification. One of the first classifiers with spatial post-processing developed in the hyperspectral imaging literature was the well-known ECHO (extraction and classification of homogeneous objects). Since then, many strategies have been developed to include spatial information at the preprocessing and postprocessing level for advanced hyperspectral data interpretation. Today, it is commonly accepted that using the spatial and the spectral information simultaneously provides significant advantages in terms of improving the performance of classification techniques. The ways in which spatial and spectral information are combined for this purpose have varied significantly, though. For instance, some existing approaches include spatial information prior to the classification, during the feature extraction stage. Mathemathical morphology has been particularly successful for this purpose. Morphology is a widely used approach for modeling the spatial characteristics of the objects in remotely sensed images and represents a topic that is extensively covered in this book. Another strategy commonly used in the literature consists of incorporating the spatial context at the classification stage, for instance, using kernel methods. In this regard, a pixel entity can be redefined simultaneously both in the spectral domain (using its spectral content) and also in the spatial domain, by applying some spatial transformations to its surrounding area that yields spatial-spectral features that can be effectively exploited for classification purposes. The spatial information can also be included at the postprocessing level, in order to refine the classification result obtained using a purely spectral method. These aspects are extensively covered in the book and summarized in a way that will be appealing to both researchers with previous experience in the field and to nonspecialists. In this regard, one of the main contributions of the book is the provision of a detailed snapshot of the state of the art in spatial-spectral classification of hyperspectral data. Last but not least, and despite the fact that the book intends to provide a comprehensive summary of techniques and strategies, these inevitably represent a sample of the large techniques presented in recent years for spatial-spectral classification of hyperspectral data. However, the collection of methods and techniques covered by the book are indeed among the most representative ones in their field, and their detailed presentation and
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
xiii
evaluation in a monographic volume represents an important contribution to the state of the art in hyperspectral image classification. Antonio Plaza University of Extremadura, Spain.
Acknowledgments Research on spectral-spatial classification has been active in the Signal Processing Lab. of the University of Iceland for more than a decade. The research has been performed in collaboration with many international research groups, in particular with the GIPSA Lab at the Grenoble Institute of Technology, France, the Remote Sensing Lab. at the University of Trento, Italy, the HyperComb Lab. at the University of Extremadura, Spain and the Vision and Image Processing Lab., Hunan University, China. Several PhDs have graduated with a strong focus on this area. We would like to mention three important theses that have made a significant impact on the area: 1) Mathieu Fauvel, “Spectral and spatial methods for the classification of urban remote sensing data,” joint Ph.D. between Grenoble Institute of Technology and the University of Iceland, 2007, supervised by Jocelyn Chanussot and J´on Atli Benediktsson. 2) Yuliya Tarabalka, “Classification of Hyperspectral Data Using Spectral-Spatial Approaches,” joint Ph.D., University of Iceland and Grenoble Institute of Technology, 2010, supervised by J´on Atli Benediktsson and Jocelyn Chanussot. 3) Mauro Dalla Mura, “Advanced Techniques Based on Mathematical Morphology for the Analysis of Remote Sensing Images,” joint Ph.D., University of Trento and University of Iceland, 2011, supervised by Lorenzo Bruzzone and J´on Atli Benediktsson. The work in these theses and related research by several other researchers forms the foundation for the approaches presented here. We would like to use the opportunity to thank many colleagues for their collaboration and contributions to this research through the years: Mathieu Fauvel, Yuliya Tarabalka, Mauro Dalla Mura, Bj¨ orn Waske, Jocelyn Chanussot, Lorenzo Bruzzone, Antonio J. Plaza, Sebastiano B. Serpico, Micael S. Couceiro, Alberto Villa, Giulia Troglio, Sergio Bernab´e, Nicola Falco, Prashanth Reddy Marpu, Mattia Pedergnana, Jun ¨ Ulfarsson, Li, Shutao Li, Xudong Kang, J´ohannes R. Sveinsson, Magn´ us Orn J´ on Ævar P´ almason, P´ all G´ıslason, Gunnar J. Briem, Sveinn R. J´oelsson, Gabriele Cavallaro, Paolo Gamba, James C. Tilton, Jos´e M. Bioucas-Dias
xv
xvi
Acknowledgments
´ and Kolbeinn Arnason. This is just a partial list. It could go on and on but we greatly appreciate the contributions of our colleagues. The authors gratefully thank Antonio J. Plaza for writing the foreword for the book. We also greatly thank Mauro Dalla Mura for allowing us to use figures that he had prepared. The support and encouragement of Fawwaz Ulaby during the writing of this book is greatly appreciated. We thank the University of Iceland for its support. Funding for spectralspatial remote sensing research at the University of Iceland has been in part provided by the Icelandic Fund for Research Students, the University of Iceland Research Fund, the Icelandic Research Fund through project grants and the EMMIRS center of excellence and the European Commission through the FP6 Hyper-I-Net Marie Curie Research Training Network and the FP7 North State space theme project. The research support is gratefully appreciated. The AVIRIS Indian Pines Data were provided by Prof. David A. Landgrebe, Purdue University, W. Lafayette, Indiana, USA and the ROSIS03 Pavia Data were provided by Prof. Paolo Gamba, University of Pavia, Italy. Access to both data sets is greatly appreciated. Last but not least, we appreciate the support, patience and good wishes of our families and friends throughout the writing of the book.
Chapter 1 Introduction 1.1
INTRODUCTION TO HYPERSPECTRAL IMAGING SYSTEMS
In the past decade, hyperspectral imaging systems have gained great attention from researchers. Hyperspectral imaging systems use sensors that mostly operate from the visible through the middle infrared wavelength ranges and can simultaneously capture hundreds of (narrow) spectral channels from the same area on the surface of the Earth. The hyperpsectral sensors collect data with pixels that are represented by vectors in which each element is a measurement corresponding to a specific wavelength. The size of each vector is equal to the number of spectral data channels that are collected by the sensor. For hyperspectral images, several hundred spectral data channels of the same scene are usually available, while for multispectral images up to ten data channels are typically available. The detailed spectral information provided by hyperspectral sensors increases the possibility of accurately discriminating materials of interest with an increased classification accuracy. In addition, thanks to advances in hyperspectral technology, the fine spatial resolution of recently operated sensors provides the opportunity of analysing small spatial structures in images. Several operational imaging systems are currently available providing a large volume of images for various thematic applications, such as: • Ecological science: Hyperspectral images can be taken into account for a wide range of applications such as biomass, carbon, and biodiversity 1
2
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
estimation in dense forest zones and can be used to study land cover changes. • Geological science: Hyperspectral images enable us to retrieve and recover physico-chemical mineral properties such as composition and abundance. • Mineralogy: By using hyperspectral data, not only can a wide range of minerals be identified, but also their relation to the presence of valuable minerals can be understood. Currently, researchers are investigating the effect of oil and gas leaks from pipelines and natural wells on the spectral signatures of vegetation. • Hydrological science: Hyperspectral images have been used to detect changes in wetland characteristics. Moreover, water quality, estuarine environments, and coastal zones can be investigated by using hyperspectral images as well. • Precision agriculture: Hyperspectral images are known as a powerful tool in order to discriminate agricultural classes in rural areas and to extract nitrogen content for the purpose of precision agriculture. • Military applications: The detailed spectral-spatial information of hyperspectral data can be used for target detection. Hyperspectral images can be considered as a stack of images with different wavelength intervals (spectral channel) from the same scene on the immediate surface of the earth. Based on this interpretation, hyperspectral images can be referred to as hyperspectral data cubes. In other words, each spectral channel represents a gray scale image and all images make a three-dimensional hyperspectral cube. Figure 1.1 shows an example of a hyperspectral data cube. A three dimensional hyperspectral data cube consists of n1 × n2 × d pixels in which n1 × n2 is the number of pixels in each spectral channel and d represents the number of spectral channels. In greater detail, a hyperspectral image can be introduced from one of the following perspectives: 1. Spectral perspective (or spectral dimension): In this case, a hyperspectral data cube consists of several pixels and each pixel is a vector of d values. Each pixel corresponds to the reflected radiation of the specific region of the Earth and has multiple values in spectral bands. This detailed spectral information can be used in order to analyze different materials, precisely. The right image of Figure 1.1 shows a histogram of
Introduction
3
Reflectance
Spatial dimension
A single band
Band number Spatial dimension
Figure 1.1
An example of a hyperspectral data cube.
the one pixel with multiple values for each band in spectral dimension. In this domain, the following points are of importance: • In general, vectors of different pixels belonging to a similar material have almost the same values. Different supervised and unsupervised classification techniques are used in order to group the vectors with almost the same characteristic [1]. This part will be discussed in detail later in Chapter 2. • In general, in each vector, neighborhood pixels in different spectral channels have a strong correlation. Different supervised and unsupervised feature reduction techniques are used in order to reduce the dimensionality of the hyperspectral data cube [1]. This part will be discussed in detail later in Chapter 3. 2. Spatial perspective (or spatial dimension): In this context, a hyperspectral data cube consists of d gray scale images with a size of n1 ×n2 . The values of all pixels in the one spectral band make a gray scale image with two dimensions that are spatial and are shown in Figure 1.1.
4
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
• In the spatial dimension, adjacent pixels quite commonly belong to the same object (in particular for very high resolution data). This dimension provides valuable information regarding the size and shape of different structures and objects on the Earth. There are several ways to extract spatial information (e.g., segmentation) that will be elaborated [1]. The first attempts to analyze hyperspectral images were based on existing techniques that were developed for multispectral images which only have a few spectral channels, usually less than seven. However, most of the commonly used methods designed for the analysis of gray scale, color, or multispectral images are inappropriate and even useless for hyperspectral images. As a matter of fact, with a limited number of available samples, the performance of supervised classification will dramatically be downgraded when the number of data channels increases. In addition, the Hughes phenomenon/curse of dimensionality [2] poses a problem for getting robust statistical estimations. As a result, based on the above odd characteristics of hyperspectral images, in order to make the most of the rich information provided by the hyperspectral data, the development of new algorithms is required. In the following section, we try to discuss the specific characteristics of hyperspectral data in detail. This section is very important for the discussion that follows in the rest of the book. 1.2
HIGH-DIMENSIONAL DATA
Figure 1.2 shows the basic idea of the pixel-wise pattern recognition approach, which consists of feature extraction/selection and classification. In pattern recognition, each image pixel is considered as a pattern, and its spectrum (a vector of different values of a pixel in different spectral channels) is considered as the initial set of features. Since this set of features is often redundant, a feature reduction (feature extraction and/or selection) step is performed aiming at reducing the dimensionality of the feature set (from d1 dimensions in the original data to d2 dimensions in a new feature space d2 < d1 ) and maximizing separability between classes. The reason why we need to consider this step in hyperspectral data processing will be discussed in Section 1.2.1. The next step (called classification) refers to partitioning the entire spectral domain into K exhaustive, none overlapping regions, so that every point in this domain is uniquely associated with one of the K classes. Once this step is
Introduction
Figure 1.2
5
An example of a hyperspectral data cube.
accomplished, each pixel is classified according to its feature set. The output of this step is a one dimension image. The reason why we need to consider this step in hyperspectral data processing will be discussed in Section 1.2.2. In this section, the geometrical and statistical characteristics of hyperspectral data along with the shortcomings of conventional techniques for analyzing of this sort of data have been investigated, and possible solutions will be described for each shortcoming. 1.2.1
Geometrical and Statistical Properties of High Dimensional Data and the Need for Feature Reduction
At this point, we are in the era of massive automatic data collection, systematically obtaining many measurements, not knowing which data is appropriate for a problem in hand. The trend of hyperspectral imaging research is to record hundreds of spectral channels from the same scene, which can characterize chemical composition of different materials and is potentially helpful in analyzing different objects of interest. In the spectral domain, each spectral channel is considered as one-dimension, and each pixel is represented a point in this domain. By increasing the spectral channels in the spectral domain, theoretical and practical problems may arise and conventional techniques that are applied on multispectral data are no longer appropriate for the processing of high-dimensional data. The increased dimensionality of such data is able to improve data information content significantly, but provides a challenge to the conventional techniques for accurate analysis of hyperspectral data. Human experience in three-dimensional (3-D) space misleads one’s intuition of geometrical and statistical properties in high-dimensional space [3]. In other words, it is difficult for humans to get used to visualizing spaces with a higher-dimension than three. Sometimes, this misunderstanding of
6
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
high-dimensional spaces and conventional spaces leads to the wrong choices in terms of data processing. As a result, the main objective of the following sub-sections is to give a brief description of the properties of high-dimensional spaces. A. 1. As dimensionality increases, the volume of a hypercube concentrates in corners [4]. The volume of the hypersphere with radius r and dimension d is computed by1 Vs (r) =
2rd π d/2 d Γ( d2 )
(1.1)
and the volume of a hypercube in [−r, r] is calculated by d
Vc (r) = (2r)d . is
(1.2)
The fraction of the volume of a hypersphere inscribed in a hypercube
fd1 =
Vs (r) π d/2 = d−1 d . Vc (r) d2 Γ( 2 )
(1.3)
The above means that limd−→∞ fd1 = 0; i.e., as d increases, the volume of the hypercube is increasingly concentrated in the corners. A. 2. As dimensionality increases, the volume of a hypersphere concentrates in an outside shell [4, 5]. The fraction of the volume in a shell defined by a sphere of radius r − inscribed inside a sphere with radius r is fd2 =
Vs (r) − Vs (r − ) rd − (r − )d = = 1 − (1 − )d . Vs (r) rd r
(1.4)
The above means that limd−→∞ fd2 = 1; here the volume of a hypersphere is mostly concentrated in an outside shell. In the same way, it can be 1
Reminder: Γ is the gamma function, which is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers. The main property of the gamma function is xΓ(x) = Γ(x + 1). As an example for the gamma function √ Γ( 25 ) = 32 Γ( 23 ) = 23 12 Γ( 21 ) = 34 π.
Introduction
7
proven that the volume of a hyperellipsoid concentrates in an outside shell [6]. Based on the above-mentioned properties, two important specifications for high-dimensional data can be concluded: • A high-dimensional space is almost empty, which implies that multivariate data in IR are usually in a lower dimensional structure. As a result, high-dimensional data can be projected into a lower subspace without losing considerable information in the sense of separability among the different statistical classes. • Gaussian distributed data have a tendency to concentrate in the tails. In the same way, uniformly distributed data have a tendency to be concentrated in the corners, which makes the density estimation more difficult. In this space, local neighborhoods are almost surely empty, which demands the larger band-width of estimation and produces the effect of losing detailed density estimation. For more information and the proof of the claim, please see [6]. A. 3. As dimensionality increases, the diagonals are nearly orthogonal to all coordinate axes [4]. The cosine of the angle between any diagonal vector and a Euclidean coordinate axis is given by: 1 (1.5) cos(θd ) = ± √ d Here limd−→∞ cos(θd ) = 0, which implies that the diagonal is more likely to become orthogonal to the Euclidean coordinates in high-dimensional space. A. 4. The required number of labeled samples for supervised classification increases as the dimensionality increases. As will be discussed later, supervised classification methods classify input data by using a set of representative samples for each class, referred to as training samples. Training samples are usually obtained by the manual labeling of a small number of pixels in an image or based on some field measurements. Fukunaga [7] showed that there is a relation between the required number of training samples and the number of dimensions for different types
8
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
of classifiers. The required number of training samples is linearly related to the dimensionality for linear classifiers and to the square of the dimensionality for quadratic classifiers. For nonparametric classifiers, it has been shown that the number of required samples exponentially increases as the dimensionality increases. It is expected that by increasing the dimensionality of data, more information is required in order to detect more classes with more accuracy. At the same time, the aforementioned characteristics show that conventional techniques, which are based on the computation in full dimensional space, may not provide accurate classification results when the number of training samples is not substantial [1]. For instance, while keeping the number of samples constant, after a few features, the classification accuracy actually decreases as the number of features increases [3]. For the purpose of classification, these problems are related to the curse of dimensionality. In [3], Landgrebe shows that too many spectral bands are undesirable from the standpoint of expected classification accuracy. When the number of spectral channels (dimensionality) increases, with a constant number of samples, a higher dimensional set of statistics must be estimated. In other words, although higher spectral dimensions increase the separability of the classes, the accuracy of the statistical estimation decreases. A. 5. For most high-dimensional data sets, low linear projections have the tendency to be normal (Gaussian), or a combination of normal distributions, as the dimensionality increases It has been shown in [8, 9], as the dimensionality tends to infinity, lower dimensional linear projections will approach a normality model with probability approaching one. In this case, normality is regarded as a normal distribution or a combination of normal distributions [6]. Due to the above-mentioned characteristics of high-dimensional spaces, one can easily figure out that the high-dimensional space is completely different from 3-D space. These strange behaviors of high-dimensional data have a significant effect in the context of supervised classification techniques. In order to estimate class parameters, a large number of training samples is needed (which is almost impossible) in order to make a precise estimation. This problem is more severe when the dimensionality increases. In a nonparametric approach, in order to the satisfactory estimation of a class density, the number of required training samples is even greater.
Introduction
9
It is obvious that a high-dimensional space is almost empty and multivariate data can be represented in a lower dimensional space. Consequently, it is possible to reduce the dimensionality of high-dimensional data without sacrificing significant information and class separability. Based on the difficulties of density estimation in nonparametric approaches, parametric data-analysis techniques may lead to a better performance, where only a limited number of training samples is available to provide the required a priori information. As a result, it is desirable to project the high-dimensional data into lower dimensional subspace, where the undesirable effects of high-dimensional geometric characteristics and the so-called curse of dimensionality are decreased. In the spectral domain, each spectral channel is considered as onedimension. By increasing the features in the spectral domain, theoretical and practical problems may arise. For instance, while keeping the number of training samples constant, the classification accuracy actually decreases when the number of features becomes large [2]. For the purpose of classification, these problems are related to the curse of dimensionality. Figure 1.3 demonstrates that with a limited number of training samples, as the number of features increases, the class separability increases but the accuracy of the statistical estimation decreases. Therefore, by keeping the number of samples constant, after a few features, the classification accuracy actually decreases as the number of features increases. In general, feature reduction techniques can be divided into feature selection and feature extraction techniques. Although different types of feature selection and extraction techniques will be discussed in detail in Chapter 3, in order to provide readers with some clues about feature reduction techniques, we give here a brief description of each of them. Feature selection perhaps is the most straightforward way to reduce the dimensionality of a data set by simply selecting a subset of features from the set of available features based on a criterion. As an example, imagine one wishes to select the best five bands out of the ten available bands for the classification of a data set with six classes by using the Bhattacharyya distance [7] feature selection technique. To do so, one needs to compute the Bhattacharyya distance between each pair of classes for each subset of size five out of the 10-band data [3]. As output of this procedure, five features that provide the highest Bhattacharyya distance in the feature domain will be selected. Please note, feature selection techniques do not make any changes on the specification of the input data and easily select the most informative bands out of the available ones by considering a criterion. In contrast, feature
Statistical Estimation
Classification Accuracy
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Separability
10
Dimensionality
(a)
Dimensionality
(b)
Dimensionality
(c)
Figure 1.3 Y-axis demonstrates the amount of separability, statistical estimation, and classification accuracy in (a), (b) and (c), respectively. X-axis demonstrates dimensionality. With a limited number of training samples, as the number of features increases, the class separability increases but the accuracy of the statistical estimation decreases. In this case, while keeping the number of samples constant, after a few features, the classification accuracy actually decreases as the number of features increases.
extraction performs a linear or nonlinear transformation to the input data, concentrating the information into a smaller number of features. Therefore, feature extraction makes changes on input data via a transformation. From one point of view, feature selection techniques can be categorized into two categories: unsupervised and supervised. Supervised feature selection techniques aim at finding the most informative features with respect to prior knowledge and lead to better identification and classification of different classes of interest. In contrast, unsupervised methods are used in order to find distinctive bands when a prior knowledge of the classes of interest is not available. Information entropy [10], first spectral derivative [11], and uniform spectral spacing [12] can be considered as unsupervised feature selection techniques, while supervised feature selection techniques usually try to find a group of bands achieving the largest class separability. Class separability can be calculated by considering a few approaches such as divergence [13], transformed divergence [13], Bhattacharyya distance [7], and Jeffries-Matusita distance [13]. However, these metrics demand many samples in order to estimate statistics accurately to construct a set of optimal features. Feature extraction is the process of producing a small number of features by combining existing bands. In this line of thought, feature extraction techniques transform the input data linearly or nonlinearly to another domain and extract informative features in the new domain. In a similar fashion to feature selection techniques, feature extraction can be split into two categories: unsupervised and supervised feature extraction where the former is
Introduction
11
used for the purpose of data presentation and the latter is considered for solving the Hughes phenomenon [2] and reducing the redundancy of data in order to improve classification accuracies. In pattern recognition, it is desirable to extract features that are focused on the discrimination between classes of interest. Although a reduction in dimensionality is of importance, the error rising from the reduction in dimension has to be without sacrificing the discriminative power of classifiers [14]. A comprehensive description regarding different types of feature selection and extraction techniques, in particular the ones which have been extensively used in conjunction with spectral-spatial classification approaches, will be given in Chapter 3. 1.2.2
Conventional Spectral Classifiers and the Importance of Considering Spatial Information
As discussed before, a hyperspectral data cube consists of several pixels, and each pixel is a vector of d values (the number of spectral channels). Each pixel corresponds to the reflected radiation of the specific region of the Earth and has multiple values in spectral bands. Vectors of different pixels belonging to the similar material with high probability may have almost the same values. Different supervised and unsupervised classification techniques are used in order to group vectors with almost the same spectral characteristic. The procedure of grouping different materials with almost the same spectral characteristics can be considered as the fundamental meaning of image classification. Remote sensing image classifiers try to discriminate different classes of ground cover, for example, from categories such as soil, vegetation, and surface water in a general description of a rural area, to different types of soil, vegetation, and water depth or clarity for a more detailed description. Hyperspectral imaging instruments are now able to capture hundreds of spectral channels from the same area on the surface of the Earth. By providing very fine spectral resolution with hundreds of (narrow) bands, accurate discrimination of different materials is possible. As a result, hyperspectral data are a valuable source of information for the classifiers. The output of the classification step is a classification map. Figure 1.4 (the right image) shows an example of a classification map consisting of nine classes: trees, asphalt, bitumen, gravel, metal sheet, shadow, bricks, meadow, and soil.
12
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Although, different types of classification techniques will be discussed in detail later in Chapter 2, we provide readers with a brief description of a few well-known classifiers in order to illustrate the pros and cons of these methods and the reason why spectral and spatial classifiers have been gaining a great deal of attention from different researchers. Broadly speaking, classification techniques can be categorized into two categories, supervised and unsupervised classifiers, which can be briefly described as follows: • Supervised classifiers: These methods classify input data into a classification map in order to determine the classes of interest, by using a set of representative samples for each class, referred to as training samples. In order to partition the feature space into decision regions, a set of training samples for each class is used. Training samples are usually obtained by manually labeling a small number of pixels in an image or based on some field measurements. In other words, for a hyperspectral data cube with d-bands, which can be represented as a set of n pixel vectors X = Xj ∈ IRd , j = 1, 2, ..., n , supervised classifiers try to classify the data into a set of classes Ω = {w1 , w2 , ..., wK } [1]. • Unsupervised classifiers: Another type of classifier is based on unsupervised classification or clustering. It is referred to as unsupervised because it does not use training samples and classifies the input data only based on an arbitrary number of initial “cluster centers” that may be user-specified or may be quite arbitrarily selected. During the processing, each pixel is associated with one of the cluster centers based on a similarity criterion. – K-means: This approach [15] is one of the best-known clustering methods introduced by MacQueen. This method starts with a random initial partition of the pixel vectors into candidate clusters and then reassigns these vectors to clusters by reducing the squared error in each iteration, until a convergence criterion is met. – ISODATA: This method was first introduced in [16] and it follows the same trend with the K-means clustering algorithm but with the distinct difference that the former assumes that the number of clusters is known a priori, but the latter allows for a different number of clusters.
Introduction
13
Supervised classification techniques play a key role in the analysis of hyperspectral images, and a wide variety of applications can be handled by a good classifier, including land-use and land-cover mapping, crop monitoring, forest applications, urban development, mapping, tracking, and risk management. In the 1990s, neural network approaches attracted many researchers for classifying hyperspectral images [17, 18]. The advantage of using neural network models over the statistical parametric methods is that they are distribution free and thus no prior knowledge about the statistical distribution of classes is needed. A set of weights and nonlinearities describe the neural network, and these weights are computed via an iterative training procedure. The main interest in using such approaches increased considerably in the 1990s because of recently proposed feasible training techniques for nonlinearly separable data [19]. At this point, the use of neural networks for hyperspectral image classification is limited, primarily due to their algorithmic and training complexity [20] as well as the number of tuning parameters that need to be selected. Random forest (RF) was first introduced in [21], and it is an ensemble method for classification and regression. Ensemble classifiers get their name from the fact that several classifiers (i.e., an ensemble of classifiers) are trained and their individual results are then combined through a voting process. In order to classify an input vector by RF, the input vector is run down each decision tree (a set of binary decisions) in the forest (the set of all trees). Each tree provides a unit vote for a particular class and the forest chooses the class that has the most votes. For example, if 100 trees are grown and 80 of them predict that a particular pixel is forest and 20 of the trees predict it is grass, the final output for that pixel will be forest. Based√on studies in [22], the computational complexity of the RF algorithm is cT M N log (N ), where c is a constant, T denotes the number of trees in the forest, M is regarded as the number of variables, and N is the number of samples in the data set. It is easy to detect that RF is not computationally intensive but demands a considerable amount of memory since it needs to store an N by T matrix while running. RF does not assume any underlying probability distribution for input data, and can provide a good classification result in terms of accuracies, and can handle many variables and a lot of missing data. Another advantage of an RF classifier is that it is insensitive to noise in the training labels. In addition, RF provides an unbiased estimate
14
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
of the test set error as trees are added to the ensemble, and, finally, it does not overfit. Support Vector Machines (SVMs) are another example of supervised classification approaches. The general idea behind an SVM is to separate training samples belonging to different classes by tracing maximum margin hyperplanes in the space where the samples are mapped [23]. SVMs were originally introduced for solving linear classification problems. However, they can be generalized to nonlinear decision functions by considering the socalled kernel trick [24]. A kernel-based SVM is being used to project the pixel vectors into a higher dimensional space and estimate maximum margin hyperplanes in this new space in order to improve linear separability of data [24]. The sensitivity to the choice of the kernel and regularization parameters can be considered as the most important disadvantages of SVM. The latter is classically overcome by considering cross-validation techniques using training data [25]. The Gaussian radial basis function (RBF) is widely used in remote sensing [24]. Both SVM and RF classification methods are comparable in terms of classification accuracies and have been widely used for the purpose of hyperspectral image classification since they can handle a high dimensionality of data with a limited number of training samples, which is the common issue in remote sensing. However, while both methods are shown to be effective classifiers for nonlinear classification problems, SVM requires a computationally demanding parameter tuning in order to achieve optimal results, whereas RF does not require such a tuning process and is found to be more robust. In this sense, RF is much faster than SVM and for volumetric data using RF instead of SVM is favorable. Conventional spectral classifiers consider the hyperspectral image as a list of spectral measurements with no spatial organization [26]. However, in remote sensing images, neighboring pixels are highly related or correlated since remote sensors acquire significant amounts of energy from adjacent pixels, and homogeneous structures in the image scene are generally larger than the size of a pixel. This is especially evident for the image of high spatial resolution. As an example, if a given pixel in an image represents the class “Sea,” its adjacent pixels belong to the same class with a high probability. As a result, spatial and contextual information of adjacent pixels can provide valuable information from the scene. Considering the spatial information can reduce the labeling uncertainty that exists when only spectral information is taken into account, and helps to overcome the salt and pepper appearance of
Introduction
15
the classification map. Furthermore, other relevant contextual information can be extracted when the spatial domain is considered. As an example, for a given pixel, it is possible to extract the size and the shape of the structure to which it belongs. Therefore, a joint spectral and spatial classifier is required in order to increase classification accuracies and the quality of the final classification map. Figure 1.4. (the right image) shows an example of the quality improvement of the conventional spectral classification map by considering spatial information in a classification framework. As a result, spectral-spatial classification methods (or context classifiers) must be developed, which assign each image pixel to one class based on: (1) its own spectral values (the spectral information) and (2) information extracted from its neighborhood (the spatial information) [27]. The use of spectral-spatial classification techniques is vitally important for processing of high resolution images with large spatial regions in the image scene. Broadly speaking, a spectral-spatial classification techniques consist of three main stages: 1. Extracting spectral information (which will be discussed in Chapter 2); 2. Extracting spatial information (which will be discussed in detail in Chapters 4, 5 and 6); 3. Combining the spectral information extracted from (1) and spatial information extracted from (2) (which will be discussed mainly in Chapter 4). In the following, we describe the existing methods for spectral-spatial classification of hyperspectral data. In order to characterize the spatial information, two common strategies are available: the crisp neighborhood system and the adaptive neighborhood system. While the first one mostly considers spatial and contextual dependencies in a predefined neighborhood system, the latter is more flexible and it is not confined to a given neighborhood system. In the following, each neighborhood system will be briefly discussed. It should be noted that these methods will be explained in detail later in this book. i. Crisp neighborhood system One well-known way for extracting spatial information by using a crisp neighborhood system is the consideration of Markov random field (MRF) modeling. MRF is a family of probabilistic models and can be explained as a 2-D stochastic process over discrete pixels latices [28] and is widely used
16
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Figure 1.4 An example of classification maps, with (the right image) and without (the left image) considering spatial information. As can be seen, the classification map obtained by considering both spectral and spatial information is much smoother than the classification map obtained by considering only spectral information. Considering the spatial information can reduce the labeling uncertainty that exists when only spectral information is taken into account and helps to overcome the salt and pepper appearance of the classification map.
Introduction
17
to integrate spatial context into image classification problems. In MRFs, it is assumed that for a predefined pixel neighborhood of a given pixel, its closest neighbors belong with a high probability to the same object. Fourand eight-neighborhoods are the most frequently used in image analysis. By using this approach, the pixel in the center can be classified by taking into account the information from its neighbors according to one of these systems. MRFs are considered as a powerful tool for incorporating spatial and contextual information into the classification framework [29]. There is an extensive literature on the use of MRFs for increasing the accuracy of classification. In [30], Jackson and Landgrebe have introduced spectral-spatial iterative statistical classifiers for hyperspectral data based on a MRF. Pixel-wise maximum likelihood classification is first performed and the classification map is regularized using the maximum a posteriori (MAP)-MRF framework. The spectral information is extracted by the Maximum Likelihood classification, while the spatial information is derived over the pixel neighborhood. In [31], the result of the Probabilistic SVM was regularized by a MRF. In [32], the authors have further explored the MAP-MRF classification. They considered class-conditional PDFs which are estimated by the Mean Fieldbased SVM regression algorithm. Also, in [29, 33–36], MRFs were taken into consideration for modeling spatial and contextual information for improving the accuracy of the classification. Furthermore, a generalization of MRF, called conditional MRF, was investigated in [37] for the spectral and spatial classification of remote sensing images. In [38], the concept of hidden Markov model (HMM) was used for incorporating spectral and contextual information into a framework for performing unsupervised classification of remote sensing multispectral images. In [39], Ghamisi et al. proposed to use a generalization of MRF named hidden MRF (HMRF) for the spectral and spatial classification of hyperspectral data. In that work, spectral information was extracted by using an SVM and spatial information was extracted by using the HMRF and, finally, the spectral and spatial information were combined by using majority voting within each object.2 In addition, for the purpose of segmentation and anomaly detection, in [40] Gaussian MRF was employed.
2
For performing majority voting within each object on the output of the segmentation and classification steps, first, the number of pixels with different class labels in each object is counted. Then, the set of pixels in each object is assigned to the most frequent class label.
18
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Another common way to include spatial information into a classification technique is to consider texture measures. In [41] and [42], the authors have used texture measures derived from the gray level co-occurrence matrix (GLCM) for including the spatial information in order to classify hyperspectral data. In [41], texture images are produced using four measurements to describe the GLCM: angular second moment, contrast, entropy, and homogeneity. Then, PCA is applied on the obtained texture images, and the PCs are selected as features for Maximum Likelihood classification. In [42], the authors have proposed performing nonnegative matrix factorization feature extraction first, then extract spatial information using four measurements for the GLCM (angular second moment, entropy, homogeneity, and dissimilarity), and applying an SVM classification on a stack of spatial and spectral features. The experimental results reported that, in most cases, this method did not demonstrate an improvement over the pixel-wise approaches. This may be explained by the fact that the hyperspectral remote sensing images only contain limited textural information [27]. However, the main disadvantages of considering a set of crisp neighbors are as follows: • The crisp neighborhood system may not contain enough samples, which downgrades the effectiveness of the classifier (in particular, when the input data set is of high resolution and the neighboring pixels are highly correlated). • A larger neighborhood system may lead to intractable computational problems. Unfortunately, the closest fixed neighborhoods do not always accurately reflect information about spatial structures. For instance, they provoke assimilation of regions containing only a few pixels with their larger neighboring structures and do not provide accurate spatial information at the border of regions. • In general, the use of a crisp neighborhood system leads to acceptable results for big regions in the scene. Otherwise, it can disappear small structures in the scene and merge them with bigger surrounded objects. ii. Adaptive neighborhood system In order to address the shortcomings of using a set of crisp neighborhoods, an adaptive neighborhood system can be taken into account. One possible way of considering an adaptive neighborhood system is to take advantage of different types of segmentation methods. Image segmentation is
Introduction
19
regarded as the process of partitioning a digital image into multiple regions or objects. In other words, in image segmentation a label is assigned to each pixel in the image such that pixels with the same label share certain visual characteristics [43]. These objects provide more information than individual pixels since the interpretation of images based on objects is more meaningful than that based on individual pixels. Segmentation techniques extract large neighborhoods for large homogeneous regions while not missing small regions consisting of one or a few pixels. Image segmentation is considered as an important task in the analysis, interpretation and understanding of images and is also widely used for image processing purposes such as classification and object recognition [43] [44]. Image segmentation is a procedure that may lead to modify the accuracy of classification maps [45]. To make such an approach more effective, an accurate segmentation of the image is required [43]. There is extensive literature on the use of segmentation techniques in order to extract the spatial information from remote sensing data (e.g., [46– 48]). In order to improve classification results, the integration of classification and segmentation steps has recently been taken into account [49]. In such cases, the decision to assign a pixel to a specific class is simultaneously based on the feature vector of this pixel and some additional information derived from the segmentation step. A few methods for segmentation of multispectral and hyperspectral images have been introduced in the literature. Some of these methods are based on region merging techniques, in which neighboring image segments are merged with each other based on their homogeneity. For example, the multiresolution segmentation method in eCognition software uses this type of approach [50]. Tilton proposed a hierarchical segmentation algorithm [51], which alternately performs region growing and spectral clustering. As mentioned in [47], image segmentation can be classified into four specific types including histogram thresholding-based methods, texture analysisbased methods, clustering-based methods, and region-based split and merging methods. Thresholding is one of the most commonly used methods for the segmentation of images into two or more clusters. Many algorithms have been proposed in literature to address the issue of optimal thresholding (e.g., [52] and [53]). While several research papers address bilevel thresholding, others have considered the multilevel problem. Bilevel thresholding is reduced to an 2 optimization problem to determine the threshold t that maximizes the σB 2 (between-class variance) and minimizes σW (within-class variance). For two
20
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
level thresholding, the problem is solved by finding the value T ∗ , which results 2 in max(σB (T ∗ )) where 0 ≤ T ∗ < L and L is the maximum intensity value. This problem could be extended to n-level thresholding through satisfying 2 ∗ ∗ max σB (T1∗ , T2∗ , ..., Tn−1 ) where 0 ≤ T1∗ < T2∗ < ... < Tn−1 < L. One way for finding the optimal set of thresholds is by using exhaustive search. A commonly used exhaustive search is based on the Otsu criterion [54]. That approach is easy to implement, but it has the disadvantage that it is computationally expensive. An exhaustive search for n − 1 optimal thresholds involves evaluations of fitness of n(L − n + 1)n−1 combinations of thresholds [55]. Therefore, that method is not suitable from a computational cost point of view. The task of determining n − 1 optimal thresholds for n-level image thresholding could be formulated as a multidimensional optimization problem. To solve such a task, several biologically inspired algorithms have been explored in image segmentation (e.g., [55–57]). Bio-inspired algorithms have been used in situations where conventional optimization techniques cannot find a satisfactory solution or take too much time to find it (e.g., when the function to be optimized is discontinuous, nondifferentiable, and/or presents too many nonlinearly related parameters) [57]. One of the best known bioinspired algorithms is PSO [58]. The PSO consists of a number of particles that collectively move in the search space (e.g., pixels of the image) in search of the global optimum (e.g., maximizing the between-class variance of the distribution of intensity levels in the given image). However, a general problem with the PSO and similar optimization algorithms is that they may be trapped in local optimum points, and the algorithm may work for some problems but fail in others [43]. To overcome such a problem, Tillett et al. [59] presented the Darwinian PSO (DPSO). In the DPSO, multiple swarms of test solutions performing just like an ordinary PSO may exist at any time with rules governing the collection of swarms that are designed to simulate natural selection. In [45], DPSO was taken into account for the segmentation of multispectral remote sensing images. Results confirmed that DPSO outperforms the conventional PSO in terms of finding higher between class variance in less CPU processing time. More recently, in [60] and [47], for the purpose of image segmentation, Ghamisi et al. introduced further extension of the DPSO using fractional calculus to control the convergence rate of the algorithm [43] and evaluate the capability of that in order to segment hyperspectral images in [47], and a classification framework was proposed based on FODPSO-based segmentation technique. The result of classification
Introduction
21
was promising and FODPSO based segmentation improved DPSO- and PSObased segmentation techniques in terms of finding higher between class variance in less CPU processing time. In Chapter 4, comprehensive information regarding the thresholding-based segmentation techniques is given. In [46, 49, 61, 62], watershed, partitional clustering, and hierarchical segmentation (HSeg) have been considered in order to extract spatial information, and SVM has been considered in order to extract spectral information. Then, the spectral and spatial information have been integrated by using the majority voting [46]. The described approach leads to an improvement in terms of classification accuracies compared to spectral and spatial techniques using local neighborhoods for analyzing spatial information. Chapter 4 provides detailed information on spectral-spatial classification approaches based on different segmentation techniques. Another possible set of approaches that are able to extract spatial information by using an adaptive neighbor system relies on morphological filters. Erosion and dilation are considered as the alphabet of mathematical morphology. These operators are carried out on an image with a set of known shape, called a structuring element (SE). Opening and closing are combinations of erosion and dilation. These operators simplify input data by removing structures with a size less than the SE. However, these operators have influences on the shape of the structures and can introduce fake objects in the image [14]. One possible way to handle this issue is to consider opening and closing by reconstruction [63]. Opening and closing by reconstructions are a family of connected operators that satisfies the following criterion: If the SE cannot fit the structure of the image, then it will be totally removed, otherwise it will be totally preserved. Reconstruction operators remove objects smaller than SE without altering the shape of those objects and reconstruct connected components from the preserved objects. For gray scale images, opening by reconstruction removes unconnected light objects and in dual, closing by reconstruction removes unconnected dark objects. Figure 1.5 illustrates an original very high resolution (VHR) image along with its corresponding opening, opening by reconstruction, closing and closing by reconstruction.
(b)
(c)
(d)
(e)
Figure 1.5 (a) Morphological closing (b) closing by reconstruction (c) original VHR panchromatic image (d) opening by reconstruction, and (e) morphological opening. As can be seen, morphological opening and closing have influences on the shape of the structures and can introduce fake objects. However, opening and closing by reconstruction preserves the shape c of different objects bigger than SE. The illustration is taken from [64] IEEE. Permission for use granted by IEEE 2015.
(a)
22 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Introduction
φjR (f )
(a)
φiR (f )
(b)
f
(c)
23 i (f ) γR
(d)
j γR (f )
(e)
Figure 1.6 A simple example of MP consisting of two sequential openings and closings by reconstruction. (c) is the input image. (a) and (b) are the outputs of closing. (d) and (e) are the outputs of opening.
In order to fully exploit the spatial information, filtering technique should simultaneously attenuate the unimportant details and preserve the geometrical characteristics of the other regions. Pesaresi and Benediktsson [65] used morphological transformations to build a so-called morphological profile (MP). They carried out a multiscale analysis by computing an antigranulometry and a granulometry, (i.e., a sequence of closings and openings with SE of increasing size), appended in a common data structure named MP. Figure 1.6 illustrates a simple example of MP consisting of two sequential openings and closings by reconstruction. Another modification of using MP which was exploited for the classification of VHR panchromatic images is derivative of the MP (DMP). DMP explains the residues of two successive filtering operations for two adjacent levels existing in the profile. The obtained map is generated by associating each pixel to the level where the maximum of the DMP (evaluated at the given pixel) occurs [66]. Figure 1.7 shows examples of MP and DMP. This figure was prepared by Mauro Dalla Mura and used by his permission.
( f )
MP
Figure 1.7
DMP
i 0,1,...k
Derivative of Closing Profile
Closing Profile
(i) : (i) R( Si )
i 0,1,...k DMP:
(i ) : (i ) (i ) (i 1)
Opening Profile
Derivative of Opening Profile
(i ) : (i ) (i ) (i 1)
Examples of MP and DMP. This figure was prepared by Mauro Dalla Mura and used by his permission.
Sizes: 7, 13, 19, 25
(i ) : (i ) R( Si ) ( f )
Square SE
MP:
i 1,...k
i 1,...k
Morphological Profiles (MPs) composed by a sequence of opening and closing with SE of increasing size. Differential Morphological Profiles (DMPs) compute the residuals between adjacent levels of the MPs.
Morphological & Attribute Profiles
Morphological Morphological General Requirements Connected ProfilesFilters
Introduction MP and AF Feature Selection Conclusions
24 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Introduction
25
In [67], the MP generated by standard opening and closing was carried out on a Quickbird panchromatic image captured on Bam that was hit by the earthquake on 2003. In this case, the spatial features extracted by the MP were considered for assessing the damage caused by the earthquake. The standard opening and closing, along with white and black top hat and opening and closing by reconstruction, were taken into account all together and classified by an SVM for the classification of a Quickbird panchromatic image [68]. An automatic hierarchical segmentation technique based on the analysis of the DMP was proposed in [69]. The DMP was also analyzed in [70], by extracting a fuzzy measure of the characteristic scale and contrast of each structure in the image. The computed measures were compared with the possibility distribution predefined for each thematic class, generating a value of membership degree for each class used for classification. In [71], in order to reduce the dimensionality of data and address the socalled curse of dimensionality, feature extraction techniques were taken into consideration for the DMP classified by a neural network classifier. In [72], the concept of MPs was successfully extended in order to handle hyperspectral images. To do this, first the input hyperspectral data were transformed by using principle component analysis (PCA) and MPs were performed on the principal components of the data (which were called extended MP (EMP)). Figure 1.8 shows a stacked vector consisting of the profiles based on the first and second PCs. Since the EMPs do not fully exploit the spectral information and PCA discards class information, in [14], instead of PCA, different supervised feature extraction techniques were performed on the input data and the MP and extracted features are concatenated into a stacked vector and classified by an SVM. Some studies have been conducted in order to assess the capability of SEs with different shapes for the extraction of spatial information. For example, MPs computed with a compact SE (e.g., square, disk, etc.) can be considered for modeling the size of the objects in the image (e.g., in [73] this information was exploited to discriminate small buildings from large ones). In [74], the computation of two MPs was introduced in order to model both the length and the width of the structures. In greater detail, one MP is built by disk-shaped SEs for extracting the smallest size of the structures, while the other employs linear SEs (which generate directional profiles [75]) for characterizing the objects maximum size (along with the orientation of the SE). This is appropriate for defining the minimal and maximal length but, as all the possible lengths and orientations cannot be
26
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
practically investigated, such analysis is computationally intensive. In [76], with respect to the main shortcoming of MPs (i.e., the MPs were only based on SEs of a particular shape, which determines that the MPs may not be suitable for detecting different shapes in images), a new approach based on mathematical morphology was developed to extract structural information from hyperspectral images in order to fit several shapes in a given image. In [77], different strategies for producing the base images for MPs have been taken into account. They suggested the multilinear PCA as a more powerful approach for producing base images than PCA, due to the fact that it is a tensor-based feature representation approach, which is able to simultaneously exploit the spectral-spatial correlation between neighboring pixels. Based on the above-mentioned literature, it is easy to obtain that the computation of a multiscale processing (e.g., by MPs, DMPs, EMPs) has proven to be effective in extracting informative spatial features from the analyzed images. In order to characterize the shape or size of different structures present in an image, it is vitally important to consider a range of SE with different sizes. MPs use successive opening/closing operations with an SE of an increasing size. The successive usage of opening/closing leads to a simplification of the input image and a better understanding of different available structures in the image. Chapter 5 elaborates on the basics of mathematical morphology and provides detailed information on different spectral-spatial classification approaches based on MP. Although MP is a powerful technique for the extraction of spatial information, it suffers from few limitations including: • The shape of SEs is fixed, which is considered a main limitation for the extraction of objects within a scene. • SEs are unable to describe information related to the gray-level characteristics of the regions such as spectral homogeneity, contrast, and so on. • A final limitation associated with the concept of MPs is the computational complexity. The original image needs to be processed completely for each level of the profile, which requires two complete evaluations of the image: one performed by a closing transformation and the other by an opening transformation. Thus, the complexity increases linearly with the number of levels included in the profile [66, 78].
φiR (PC1 )
MP(PC1 )
PC1
i (PC ) γR 1
j γR (PC1 )
φjR (PC2 ) φiR (PC2 )
MP(PC2 )
PC2
i (PC ) γR 2
j γR (PC2 )
c Figure 1.8 An example of EMP based on two PCs. This illustration was taken from [64] IEEE. Permission for use granted by IEEE 2015.
φjR (PC1 )
Introduction 27
28
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
A morphological attribute profile (AP) is considered as the generalization of the MP which provides a multilevel characterization of an image by using the sequential application of morphological attribute filters (AFs) [66]. Morphological attribute opening and thinning are morphological AFs that were introduced in [79]. AFs are connected operators that process an image by considering only its connected components. For binary images, the connected components are simply the foreground and background regions present in the image. In order to deal with gray scale images, the set of connected components can be obtained by considering the image to be composed by a stack of binary images generated by thresholding the image at all its gray-level values [80]. Thus, they process the image without distorting or inserting new edges but only by merging existing flat regions [63]. AFs were employed for modeling the structural information of the scene in order to increase the effectiveness of a classification and building extraction in [66] and [81], respectively, where they proved to be efficient for the modeling of structural information in VHR images. AFs include in their definition the morphological operators based on geodesic reconstruction [79]. Moreover, they are a flexible tool since they can perform a processing based on many different types of attributes. In fact, the attributes can be of any type. For example, they can be purely geometric, or related to the spectral values of the pixels, or on different characteristics. Furthermore, in [81], the problem of the tuning of the parameters of the filter was addressed by proposing an automatic selection procedure based on a genetic algorithm. Extended AP (EAP) is a stacked vector of different APs computed on the first C features extracted from the original data set. Figure 1.9 shows an example of an EAP for the first two PCs consisting of four attributes. For the purpose of spectral-spatial classification of hyperspectral images, four attributes have been widely used in literature, including: (1) area of the region (related to the size of the regions), (2) standard deviation (as an index for showing the homogeneity of the regions), (3) diagonal of the box bounding the regions, and (4) moment of inertia (as an index for measuring the elongation of the regions). When concatenation of different attributes, {a1 , a2 , ..., aM } is gathered into a stacked vector, the extended multi-AP (EMAP) is obtained [82]. The application of the profiles for large volumes of data is computationally demanding, and this is considered to be one of the main difficulties in using them. In order to solve this issue, the efficient implementation of attribute filters was proposed in [84]. Salembier et al. in [84] introduced a new
Introduction
φT (PC1 )
PC1
AP(PC1 )
γ T (PC1 )
φT (PC2 )
29
PC2
γ T (PC2 )
AP(PC2 )
Figure 1.9 A simple example of EAP consisting of four attributes on the first and second c PCs. The illustration is taken from [83] IEEE. Permission for use granted by IEEE 2015.
30
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
data representation named Max-tree that has received much interest since it increases the efficiency of filtering by dividing the transformation process into three steps: (1) tree creation; (2) filtering; and (3) image restitution. The main difficulties of using the EMAP are (1) knowing which attributes lead to a better discrimination for different classes, and (2) which threshold values should be considered in order to initialize each AP. In this case, a few papers have tried to solve these issues and introduced automatic techniques in order to make the most of attribute profiles such as [85–87]. Chapter 6 elaborates on the basics of AF and provides detailed information on different spectral-spatial classification approaches based on AP and its generalization. As mentioned in [88], the field of hyperspectral image classification is fast moving, and it is attracting many researchers from the computer vision and machine-learning communities. New approaches are developed regularly that tackle new scenarios issued from high-resolution imaging (e.g., multitemporal, multiangular) while learning the relevant features via robust classifiers. Besides the aforementioned approaches, there are many other recent works on spectral-spatial classification of hyperspectral images that demonstrate the importance of this topic. For example, in [89], a spectral-spatial classification framework was proposed that is based on edge-preserving filtering. The framework is based on three steps: (1) The hyperspectral image is first classified using a pixel-wise classifier such as SVM. (2) Then, the output classification map is represented as multiple probability maps. Edge-preserving filtering is applied on each probability map. (3) As the final step, with respect to the filtered probability maps, the class of each pixel is selected based on the maximum probability. In [90], a classification framework was developed based on the combination of multiple features. The main objective of the framework was to address a common situation in real applications, in which some classes may be separated using linearly derived features whereas others may demand nonlinearly derived features. In [91], a new spectral-spatial classification approach was introduced for hyperspectral images based on extended random walkers consisting of two main steps: (1) classification by using a pixel-wise classifier such as SVM in order to obtain classification probability maps for a hyperspectral image, which reflect the probabilities that each hyperspectral pixel belongs to different classes, and (2) the obtained pixel-wise probability maps are optimized with the extended random walkers algorithm that encodes the spatial information of the hyperspectral image in a weighted graph. In [92], a novel supervised
Introduction
31
approach for the segmentation of hyperspectral data was proposed that integrates the spectral and spatial information in a Bayesian framework. In that work, a multinomial logistic regression (MLR) algorithm was first considered in order to learn the posterior probability distributions from the spectral information by using a subspace projection method to better characterize noise and highly mixed pixels. Then, contextual information was fed into the system using a multilevel logistic Markov-Gibbs MRF prior. Finally, a maximum a posteriori segmentation was computed by the αExpansion mincut-based integer optimization algorithm. These works are just the tip of the iceberg, which confirms that this field is progressively gaining more and more interest from researchers worldwide. 1.3
SUMMARY
The first chapter of this book was devoted to the introduction of preliminary information regarding the importance of using hyperspectral data for a wide variety of applications, and rich information contained in this sort of data, as well as existing complexities in terms of hyperspectral data analysis. With respect to the information elaborated in this chapter, one can understand the reasons why conventional techniques that have been developed for multispectral data sets, are not often applicable for hyperspectral data analysis. In this chapter, we specifically described a few challenges that one can face in hyperspectral data classification. Based on these challenges, a few objective solutions have been mentioned, which will be elaborated later in this book.
Chapter 2 Classification Approaches 2.1
CLASSIFICATION
Remote sensing imagery consists of images produced at many different wavelengths. Each of those images is composed of pixels. The number of images in remote sensing data can be counted in hundreds when the imagery is hyperspectral. In classification of remote sensing images, the task is to distinguish between several land cover classes. A classification algorithm is used to separate between different types of patterns [93]. A pattern can be assumed as a unique structure or an attribute (e.g., the indivdual pixel values), which is capable of describing a specific phenomenon (e.g., a land cover). In classification of unknown patterns (i.e., pixels) the patterns are assigned to a predefined class or they are combined to unknown clusters depending on their similarity. A d-dimensional pattern (or pixel) in remote sensing is represented by a d-dimensional random variable or feature vector X, where d denotes the number of available features (e.g., image bands) [93]:
x1
x2 Xb = . ..
(2.1)
xd
and xi , i = 1, ..., d corresponds to the i-th measurement (the measurement from the i-th wavelength band or data channel) on the ground. 33
34
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
As mentioned above, a pattern in remote sensing is represented by a d dimensional vector X. Each image pixel is considered as a pattern and the classification is usually done in a feature space [94]. The feature space contains all the measurements for the given pixel for all different wavelengths along with derived features that can, e.g., be based on spatial analysis. Usually, the initial set of features for classification contains the spectral information, i.e., the wavelength information for the pixels. A classification approach is used to assign unknown pixels to one of C land cover classes Ω = {ωi }, for i = 1, ..., C, based on the training data. Feature reduction methods (as discussed in Chapter 3) can be used to reduce the dimensionality of the feature space. The individual classes are discriminated based either on the similarity to a certain class or by decision boundaries, which are constructed in the feature space [93]. Pattern recognition and classification are strongly linked with machine learning. Machine learning refers to methods that optimize their performance iteratively via learning from data. In land cover classification, descriptive models are usually being dealt with and that enables us to distinguish between different classes of patterns. Several types of classifiers have been introduced. These classifiers can be split into different groups, such as supervised, unsupervised, parametric, and nonparametric methods [95]. Most of the approaches we consider here belong to the group of supervised classification (often referred to as learning with a teacher) where we classify input data into a classification map in order to determine classes of interest, by using training data. In order to partition the feature space into decision regions, a set of training samples for each class is used. Training samples are usually obtained by doing a manual labeling of a small number of pixels in an image or based on some field measurements. Here, as described in Chapter 1, the objective is to predict y, the class membership of an unknown pattern X, using the mapping function f . In machine learning theory it is usually attempted to estimate f based on the available labeled training data. In contrast to this, in unsupervised classification the only available data are without any information on class membership. The approach is referred to as unsupervised because it does not use training samples (can be described as learning without a teacher) and classifies the input data only based on an arbitrary number of initial “cluster centers” that may be user-specified or arbitrarily selected. Usually unsupervised analysis is iterative. During the unsupervised analysis, each pixel is associated with one of the cluster centers based on a similarity criterion. Unsupervised classification algorithms
Classification Approaches
35
describe how the data are clustered within the feature space and enable the identification of unknown structures, such as natural groups within the feature space. Consequently, pixels that belong to different clusters are more dissimilar to each other compared to pixels within the same clusters [93, 96]. In addition to unsupervised and supervised approaches, semi-supervised techniques have been introduced [30, 97]. For such methods, the training is based not only on labeled training samples, but also on unlabeled samples. By using semi-supervised techniques, it has been shown in the literature that classification accuracy with semi-supervised approaches can increase when compared to accuracies obtained in supervised classification that is only based on labeled training samples. In addition to dichotomizing classification methods into supervised and unsupervised approaches, classifiers can also be grouped into parametric and nonparametric techniques. For example, the well-known supervised maximum likelihood classifier (MLC), which has been widely used in remote sensing, is often applied in the parametric context. In that case, the MLC is based on the assumption that the probability density function for each class is known. Furthermore, in remote sensing classification it is often assumed that the classes are governed by the Gaussian distribution [7], especially in classification of agricultural areas using multispectral or hyperspectral images. Contrary to this approach, nonparametric methods are not constrained by any assumptions on the distribution of input data. Hence techniques such as support vector machines, neural networks, decision trees, and ensemble approaches (including random forests) can be applied even if the class conditional densities are not known or cannot be estimated reliably. Thus, for hyperspectral imagery with a limited number of available training samples, such techniques can be more attractive in terms of the classification accuracy because they are more likely to obtain higher accuracies. In this chapter, we will discuss some widely used supervised classification approaches. First, we will introduce statistical classification, and then describe support vector machines, neural networks, decision trees, and ensemble methods, including random forests. All the previously mentioned classifiers classify pixels on a per pixel basis. In addition, one spatial classifier will be discussed (i.e., the statistical extraction and classification of homogeneous objects (ECHO)). Finally, accuracy analysis will be discussed.
36
2.2
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
STATISTICAL CLASSIFICATION
In the statistical approach to classification, a two-class classification problem can be formulated as follows: Classify the pattern or pixel X to class ωi if p(ωi |X) > p(ωj |X)
(2.2)
with p(ωi |X) being the posteriori (or conditional) probability that the true class is ωi when the pattern X is observed. Based on (2.2), the pattern X is classified to class ωi if p(ωi |X) > p(ωj |X). The inequality in (2.2) provides the basic concept for the Bayes classifier which is the best possible statistical classifier that can be designed in terms of minimizing the classification error [7, 93]. However, the Bayes classifier requires the knowledge of p(ωi |X), which is generally unknown and needs to be estimated based on the labeled training samples. Using Bayes’ theorem, the posteriori probabilities can be described by prior and class-conditional probabilities, i.e., p(ωi |X) =
p(X|ωi )P (ωi ) p(X)
(2.3)
where p(X) is the probability density function (pdf) for pixel X [7]. The inequality can now be rewritten as follows: P (ωi )p(X|ωi ) > P (ωj )p(X|ωj )
(2.4)
with p(X|ωi ) as the class-conditional probability density function and P (ωi ) as a prior probability. The inequality in (2.4) is the basis for the maximum likelihood classifier (MLC) for a two class problem. The MLC can easily be extended to multiclass problems by looking for the maximum likelihoods for individual classes. Each individual prior probability describes the probability that the class ωi is present in the image that is being classified. It is assumed in the above approach that the different feature vectors have specific probability densities or likelihoods p(X|ωi ) that are conditioned on the observed land cover classes. Therefore, a pixel X belonging to class ωi is assumed to be an observation, which is taken randomly from the classconditional probability density function p(X|ωi ). Consequently, differences between two classes ωi and ωj result in differences between the likelihoods that correspond to the classes, i.e., p(X|ωi ) and p(X|ωj ) [93, 98].
Classification Approaches
37
The likelihoods then need to be estimated based on the labeled training data. As stated above, it has often been assumed in classification of multispectral or hyperspectral images that the distributions are governed by the Gaussian distribution. In such cases, the Gaussian distribution is multivariate because the data in the featue space are multidimensional. It is important to note that the Bayes classifier is general and can handle any parametric distribution. However, the Gaussian assumption simplifies the classifier, because the Gaussian is described by only two parameters, i.e., the mean and the covariance matrix of the individual classes. The means and the covariance matrices for the classes can be estimated from the labeled training samples and used in the probability density function [7, 93]: 1 1 exp(− (X − mi )T Σi −1 (X − mi )) p(ωi |X) = √ p d 2 π |Σi |
(2.5)
with mi as the mean vector from class ωi , covariance matrix Σi and the inverse Σi −1 and T as transpose operator. By taking logarithms of both sides in (2.5), the (Gaussian) maximum likelihood classifier can be finally described by the discrimination function gi where the pattern X is classified to the class i, which has the largest discriminant function value [99]: 1 d gi (X) = − (X − mi )T Σi −1 (X − mi ) − log(|Σi |). 2 2 2.2.1
(2.6)
Support Vector Machines
Support vector machines (SVMs) [23, 24, 100] have been widely used in hyperspectral classification of remote sensing images. The SVM approach is based on an optimal linear separating hyperplane, which is fitted to the training samples of two classes within a multidimensional feature space (see Figure 2.1). The optimal hyperplane is obtained by solving an optimization problem that is solved via structural risk minimization. The aim of the solution is to maximize the margins between the hyperplane and the closest training samples, the so-called support vectors [23]. Thus, in training the classifier only samples that are close to the class boundary are needed. One of the advantages of SVM is that the classifier works well when a limited number labeled training samples are available, even when high-dimensional data are classified [101, 102]. Furthermore, SVMs have performed well in classification of noisy patterns and multimodal feature spaces. Below, we cover the main
38
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Optimal separating hyperplane
xj yj = +1
yi= - 1 ξj
xi
w.X + b = +1
ξi
w.X + b = -1
Figure 2.1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 =
2 ||w|| #2
Classification of a nonlinearly separable case by SVM.
concepts of SVM based on [100]. For further reading, a detailed introduction of SVM is given by Burges [103] and Scholkopf and Smola [24]. For a binary classification problem, i.e., classification problem with two classes, in a d-dimensional feature space Rd , Xj ∈ Rd , j =1, . . . , n is a set of n labeled training samples with the corresponding class labels yi ∈ {1, +1}. The optimal separating hyperplane f (x) (multidimensional plane) is described by a normal vector w ∈ Rd and the bias b, where |b|/||w|| is the distance between the hyperplane and the origin, with ||w|| as the Euclidean norm for w: f (X) = wX + b
(2.7)
Classification Approaches
39
The support vectors lie on two canonical hyperplanes wX + b = ±1 that are parallel to the optimal separating hyperplane. The margin maximization leads to the following optimization problem [100]: L
min
X w2 +C υi 2 i
(2.8)
where the slack variables υi and the regularization parameter C are introduced to deal with misclassified samples in nonseparable cases, i.e., cases that cannot be separated by a linear boundary. The regularization parameter is a constant which is used as a penalty for samples that lie on the wrong side of the hyperplane. Effectively it controls the shape of the solution of the decision boundary. Thus, it affects the generalization capability of the SVM, (e.g., a large value of C may cause the approach to overfit the training data) [100]. The SMV as described above is a linear classifier, i.e., a line is used to discriminate the samples in the feature space. However, decision boundaries are often nonlinear for classification problems. Therefore, kernel methods are needed to extend the linear SVM approach to nonlinear cases, i.e., cases where the samples cannot be classified accurately with linear classifiers. In such cases a nonlinear mapping is used to map the data into a highdimensional feature space. This is the so-called kernel trick. After the transformation, the input pattern X can be described by Φ(X) in the new high-dimensional space. The transformation into the higher-dimensional space can be computationally demanding. However, the computational cost can be reduced by using a positive definite kernel k, which fulfills the so-called Mercer’s conditions [24, 100], i.e., (Φ(xi )Φ(xj )) = k(xi , xj ).
(2.9)
When the Mercer?s conditions are met, the final hyperplane can be defined by L X f (x) = ( αi yi k(xi , xj ) + b) (2.10) i=1
with αi being Lagrange multipliers. A detailed step-by-step derivation of (2.10) is given in Burges [103]. In the new feature space, an explicit knowledge of Φ is not required. Only knowledge of the kernel function k is needed. Consequently, the training
40
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
process involves the estimation of the parameters of the kernel function in addition to the regularization parameter C. Different concepts for an automatic model selection have been introduced in the literature that are usually based on a cross-validation [104]. Widely used kernel functions are polynomials of various orders and the Gaussian radial basis function (RBF) kernel [24]. The RBF kernel is perhaps the widest used in remote sensing [100]. It can handle more complex, nonlinear class distributions compared to a simple linear kernel, which is just a special case of the Gaussian RBF kernel [105]. On the other hand, a polynomial kernel is based on more parameters than a Gaussian kernel and the computational complexity of the model is increased. The RBF kernel is defined as: k(xi , xj ) = exp(−γ||xi − xj ||2 )
(2.11)
where the parameter γ controls the width of the kernel [100]. SVMs are designed for binary classification problems, which normally do not exist in the context of remote sensing classification. Also, in contrast to classification algorithms that either give a class label at the output of the classifier or provide probabilities of class memberships, an SVM produces distances of each pixel to the classification hyperplane at the output. These distances are used in order to determine the final classification based on a multiclass strategy. In the literature, several multiclass strategies have been introduced. However, two main strategies are most popular and they are based on the separation of the multiclass problem into several binary classification problems [106]. These are the one-against-one strategy and the one-against-rest strategy. These are described below [100]. Let Ω = {ωi } with i=1, . . . , c be a set of K possible class labels (i.e., land cover classes). The one-against-one strategy trains K(K − 1)/2 individual binary SVMs. This means that one binary SVM is trained for each possible pair of classes ωi and ωj (ωi 6= ωj ). The sign of the distance to the hyperplane is used in the one-against-one voting scheme. For the final decision the score function Si is computed for each class ωi that sums all positive (i.e., sgn=+1) and negative (i.e., sgn =-1) votes for the specific class. The final class for sample X is predicted by a simple majority vote. Using the K(K − 1)/2 SVM outputs, a simple majority vote is applied to compute the final class membership given by Si (x) =
X j=1,j6=i
sgn(fi j(x)).
(2.12)
Classification Approaches
41
In the one-against-rest approach, a set of K binary classifiers is trained to separate each class from the remaining ones and the maximum decision value of the individual SVM classifier is used to define the class membership. In contrast to these multiclass strategies, one-shot SVMs, are an alternative approach [107–109]. Furthermore, approaches have been introduced to convert the rather abstract decision measures of SVM (i.e., distances of data points to the hyperplane, into more intuitive class probabilities) [110]. Based on the idea of logistic regression, a sigmoidal function is fit to the distance values of each binary SVM and transferred into probability values for individual classes. This way, class memberships can be derived from probability values (similar to the concept of MLC), and possible ambiguities after majority vote are avoided [100]. Apart from using SVM, a composite kernel framework for the classification of hyperspectral images has been recently investigated. In [111], a linearly weighted composite kernel framework with SVMs has been used for the purpose of spectral-spatial classification. However, classification using composite kernels and SVMs demands convex combination of kernels and a time-consuming optimization process. To overcome these limitations, a generalized composite kernel framework for spectral-spatial classification has been developed in [111]. MLR [112–114] has been also investigated instead of SVM classifier and a set of generalized composite kernels, which can be linearly combined without any constraint of convexity, were proposed. 2.2.2
Neural Network Classifiers
A neural network is an interconnection of neurons in a network where a neuron can be described by the following [93, 115]: A neuron receives input signals xj , j = 1, 2, . . . , N , which represent the activity at the input or the momentary frequency of neural impulses delivered by another neuron to this input Kohonen [116]. In the simplest formal model of a neuron, the output value or the frequency of the neuron, o, is often represented by a function N X o = Aφ( wj xj − θ)
(2.13)
j=1
where A is a constant and φ is a nonlinear function (e.g., the threshold function that takes the value 1 for positive arguments and 0 (or -1) for negative arguments) or a logistic activation function. The wj are called synaptic efficacies or weights, and θ is a threshold.
42
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
In the neural network approach to pattern recognition the neural network operates as a black box that receives a set of input vectors X (observed signals) and produces responses oi from its output neurons i, i = 1, . . . , L where L depends on the number of information classes [115]. A general idea followed in neural network theory is that oi = 1 if neuron i is active for the current input vector X, or oi = 0 (or -1) if it is inactive. The weights are learned through an adaptive (iterative) training procedure in which a set of training samples is presented at the input. The network gives an output response for each sample. The actual output response is compared to the desired response for the sample, and the error between the desired output and the actual output is used to modify the weights in the neural network. The training procedure ends when the error is reduced to a prespecified threshold or cannot be minimized any further. Then all of the data are fed into the network to perform the classification, and the network provides at the output the class representation for each sample. The principal use of neural networks in remote sensing [115] has been in • Classification of satellite imagery and particularly – Land cover, land use classification – Sea ice classification Other applications of neural nets in remote sensing include • Image compression • Change detection • Image-to-image registration and geometric correction • Inverse parameter retrieval such as inversion of snow parameters from passive microwave remote sensing measurements, inversion of microwave brightness temperatures, and inversion for radar scattering from vegetation canopies • Precipitation estimation and, • Subsurface sensing. Several different types of neural network methods have been proposed and applied in classification. For classification of remote sensing images, the multi-layer perceptron trained with the backpropagation algorithm has
Classification Approaches
43
been the most widely used neural network approach [115]. This approach uses at least one hidden layer (i.e., a layer of neurons that is seen from neither the input nor the output). Usually the number of input neurons is selected based on the number of input features and the number of output neurons is the same as the number of classes. Using a single hidden layer enables neural networks to solve convex classification problems. In remote sensing, usually only one hidden layer is used. The number of hidden neurons can be determined via several different approaches (e.g., cross-validation). The multi-layer perceptron has been shown to appromiximate posterior probabilities at the output in the mean square sense when one output neuron is selected to represent each class. Thus, the multilayer perceptron can approximate the Bayes classifier in (2.2) in the mean square sense. That is an attractive property of the multi-layer perceptron. An excellent review article on the use of neural networks trained with the backpropagation algorithm for the classification of remotely sensed imagery was published by Paola and Schowengerdt [117]. The authors present an overview of multilayer perceptrons, and discuss all the elements of the classification procedure; namely, input data preprocessing and structure, neural network architecture and training, extraction of output classes, and encoding and comparisons to conventional classification techniques. 2.2.3
Decision Tree Classifiers
Decision trees (DTs) is a nonparametric approach that can both be used for classification and regression. The relative structural simplicity of DTs and the relatively short training time required (compared to methods that can be computationally demanding) are some advantages of DT classifiers [118, 119]. Furthermore, DT classifiers allow a direct interpretation of class membership decisions with respect to the impact of individual features [93]. Although a standard DT may be considered limited under some circumstances, the general concept is of great interest and the classifier performance in terms of accuracies can be further increased by classifier ensembles or multiple classifier systems [120, 121]. A good introduction on DTs is given by Safavian and Landgrebe [122] and a brief overview is given below. During the construction of a DT (i.e., the training phase of the classification) the training set is successively separated into an increasing number of smaller, more homogeneous groups. A DT consists of a root node, which includes all samples, internal nodes with a
44
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
split rule, and the final leaf nodes, which refer to the different classes. This hierarchical concept is different from other classification approaches, which in most cases use the entire features space at once and make a single membership decision per class. Moreover, the training process may lead to a splitting of thematic classes (i.e., land cover classes) into several final nodes. Probably, the best known DT is the classification and regression trees (CART). It is discussed in the next sub-section. 2.2.3.1
Classification and Regression Trees
CART is a decision tree where splits are made on a variable/feature/dimension, resulting in the greatest change in impurity or minimum impurity given a split on a variable in the data set at a node in the tree [123]. The growing of a tree is maintained either until the change in impurity has stopped (or is below some bound) or the number of samples left to split is too small according to the user. 2.3
MULTIPLE CLASSIFIERS
Traditionally, a single classifier has been used to determine a class label for a given pixel. However, in most cases, the use of an ensemble of classifiers (multiple classifiers) can be considered in order to increase the classification accuracies. In this case, the final decision is made more based on the results of different classifiers. The multiple classifiers can support each other and, consequently, can produce a more accurate classification as compared to single classifiers. For the use of multiple classifiers, the main aim is to determine an effective combination of classifiers that is able to benefit each classifier but avoids the weaknesses of them [120]. The theory of multiple classifiers can be traced back to 1965 [124]. However, in the following sub-sections, a somewhat detailed description of two highly used multiple classifiers, boosting and bagging [120] will be given, followed by a discussion of the random forest classifier [100, 121]. 2.3.1
Boosting
Boosting [125] is a general framework that can be used for any classifier to increase the corresponding classification accuracy. Several alternatives of
Classification Approaches
45
boosting have been proposed but here we will mainly focus on AdaBoost [126]. Here we focus on the AdaBoost.M1 method is discussed here, since it can be used for multi-class classification problems. This version of the AdaBoost algorithm is shown below [120, 125]: Input: A training set S with m samples, basic classifier I, and number of classifiers T . 1. S1 = S and weight(xj ) = 1 for j = 1 . . . m (x ∈ S1 ) 2. For i = 1 to T
{
3.
Ci = I(Si )
4.
i =
weight(xj )
X
1 m
xj ∈Si :Ci (xj )6=wj
5.
If i > 0.5, set Si to a bootstrap sample from S with weight(x) = 1 ∀ x ∈ Si and go to step 3 If i is still > 0.5 after 25 iterations, abort!
6.
βi = i /(1 − i )
7.
For each xj ∈ Si { if Ci (xj ) = yj then weight(xj ) = weight(xj ) · βi }.
8.
Norm weights such that the total weight of S1 is m.
9. } 10. C ∗ (x) = arg max y∈Y
X
log
i:Ci (x)=y
1 βi
Output: The multiple classifier C ∗ . AdaBoost uses majority voting of the outputs of the multiple in classification. At the beginning of AdaBoost, all patterns have the same weight (1/m), thus forming a uniform distribution D1 , and they are used to train the classifier C1 . The classifier C1 is called the base classifier, basic classifier, or weak classifier. Then, the samples are reweighted in such a way that the incorrectly classified samples have more weight than the correctly classified ones. Based on this new distribution, the classifier of the next
46
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
iteration C2 is trained. The classifier of iteration i, Ci , is therefore based on the distribution Di−1 calculated in iteration i − 1. Iteratively, the weight of the samples is adjusted in such a way that the weight of correctly classified samples goes down and the weight of incorrectly classified samples goes up. Therefore, the algorithm starts concentrating on the difficult samples. At the end of the procedure, T weighted training sets and T base classifiers have been obtained. If the classification error is higher than 0.5, an attempt is made to decrease it by bootstrapping (see the next sub-section). If the classification error stays above 0.5, then the procedure is stopped without a success. Thus, a minimum accuracy of the base classifier is requires, which can be of considerable disadvantage in multiclass classification. AdaBoost has several advantages. First, virtually no overfitting is observed for AdaBoost when the data are noiseless. Second, AdaBoost has a tendency to reduce the variance of the classification in repeated experiments. On the other hand, AdaBoost has the disadvantage that it is computationally more demanding than simpler classification methods. Consequently, it depends on the classification problem at hand whether it is more valuable to improve the classification accuracy or to use a simple and fast classifier. Another problem with AdaBoost is that it usually does not perform well in terms of accuracies when the input data are noisy [120].
2.3.2
Bagging
Bagging [127] gets its name from bootstrap aggregating. Bootstrapping generally collects m samples in a random and uniform way with replacement from a sample set of size m. Similarily to AdaBoost, the bagging algorithm is supervised and constructs many different bags of samples by performing bootstrapping iteratively, classifying each bag, and computing the classifications of each sample through a majority voting process. Bagging and boosting are comparable in the sense of designing a collection of classifiers and combining their conclusions with a vote. The main difference between the aforementioned methods is that bagging always uses resampling but boosting uses reweighting. In this case, bagging does not change the distribution of the samples (does not weight them), so all classifiers in the bagging algorithm have equal weights during the voting.
Classification Approaches
47
It is important to note that bagging can be applied in parallel (i.e., it is possible to prepare all the bags at once). In contrast, boosting is always applied in series, and each sample set is based on the latest weights. The bagging algorithm can be presented as follows [120]: 1. For i = 1 to T { 2.
Si = bootstrapped bag from S
3.
Ci = I(Si )
4. } 5. C ∗ (x) = arg max y∈Y
X
1
i:Ci (x)=y
Output: The multiple classifier C ∗ . As seen from the above algorithm, bagging is a very simple approach. Each classifier Ci is trained on a bootstrapped set of samples Si from the original sample set S. After all classifiers have been trained, a simple majority vote is used. In a case where more than one class jointly receives the maximum number of votes, the winner is chosen by using some simple mechanism (e.g., random selection). For a particular bag Si , the probability that a sample from S is selected at least once in m tries is 1 − (1 − 1/m)m . For a large m, the probability is approximately 1 − 1/e ≈ 0.632, indicating that each bag only includes about 63.2% of the samples in S. If the base classifier is unstable (i.e., when a small change in training samples results in a considerable change in classification accuracy), then bagging can improve the classification accuracy significantly. On the other hand, if the base classifier is stable, then bagging can lead to a decrease in terms of classification accuracy, since each classifier receives less of the training data. In this manner, bagging uses the instability of its base classifier in order to improve the classification accuracy. Consequently, a careful selection of the base classifier plays a key role in terms of obtained classification accuracies. This is also the case for boosting, since it is sensitive to small changes in the input signal [120]. In a similar way to boosting, bagging has advantages and disadvantages. Among the advantages, bagging can significantly improve the classification accuracy if the base classifer is properly selected, it is not very sensitive to noise in the input data, and it reduces the variance in the classification.
48
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Among the disadvantages of bagging, it is very sensitive to the choice of the base classifier [120]. 2.3.3
Random Forest
Random forest (RF) was first introduced in [21], and it is an ensemble method for classification and regression. A random forest classifier is a classifier comprised of a collection of tree-like classifiers. Ideally, a random forest classifier is an independent and identically distributed i.i.d. randomization of weak learners [128]. The RF classifier uses a large number of single decision trees, all of which are trained or grown in order to tackle the same problem. A maximum rule is used for classification, i.e., a pattern is determined to belong to the most frequently occurring of the classes as determined by the individual trees [129, 130]. The random forest classifier has the advantage that it is statistically more robust than a single tree classifier. For example, CART trees can be overtrained easily. Therefore, a single tree is usually pruned [131] (i.e., its size is reduced), in order to increase its generality. However, a collection of unpruned trees, where each tree is trained to its fullest on a subset of the training data to diversify individual trees, can be very useful. The individuality of the trees in a random forest is maintained by three factors [129, 130]: 1. Each tree is trained using a random subset of the training samples. 2. During the growing process of a tree the best split on each node in the tree is found by searching through m randomly selected features. For a data set with M features m is selected by the user and kept much smaller then M . 3. Every tree is grown to its fullest to diversify the trees so there is no pruning. As described above a random forest is an ensemble of tree-like classifiers, each trained on a randomly chosen subset of the input data where final classification is based on a majority vote by the trees in the forest. Each node of a tree in a random forest looks to a random subset of features of fixed size m when deciding a split during the training phase. Therefore, the trees can be viewed as random vectors of integers (features used to determine a split at each node). Ideally, the value m should be chosen
Classification Approaches
49
small enough for the vectors or trees to be independent, but large enough for the trees to address the same problem (identically distributed). However, there are two points to note about the parameter m: 1. Increasing the correlation between the trees in the forest by increasing m increases the error rate of the forest. 2. Increasing classification accuracy of every individual tree by increasing m decreases the error rate of the forest. An optimal interval for m is between the somewhat fuzzy extremes discussed above. The parameter m is often said to be the only adjustable parameter to which the forest is sensitive, and the “optimal” range for m is usually quite wide. However, often the value is set approximately to the square root of the number of input features [129, 129, 130, 132]. 2.3.3.1
Derived Parameters for Random Forests
There are three parameters that are derived from the random forests [129, 130], i.e., the out-of-bag error, the variable importance, and proximity analysis. These three derived parameters are discussed below: 1. Out-of-Bag (OOB) Error: To estimate the accuracy in classification of test samples (i.e., samples that were not used in training), the outof-bag (OOB) samples (i.e., the remaining labeled samples from the training set that were not selected in the bootstrap for a particular tree) of each tree can be run down through the tree (i.e., in cross-validation). The Out-of-Bag error estimate is derived by the classification error for the samples left out for each tree, averaged over the total number of trees. This error estimate has been shown to be unbiased in many tests [21, 130, 133]. 2. Variable Importance: For a single tree, the OOB cases are run through it and the votes obtained for correct classification are counted in each case. Then, the values of the variable number m is permuted in the OOB cases and these cases are again run through the tree. The number of votes for the correct class in the variable-m-permuted oob data are next subtracted from the number of votes for the correct class in the untouched OOB data. The average of this number over all trees in the forest is the raw importance score for variable m [127]. The variable
50
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
importance can be used to select the important features from the data and, consequently, eliminate features that are not important. 3. Proximities: The proximities measure gives an indication on the “distance” to other samples. After a tree has been grown, all the data are passed through it. If two samples, k and n, end up in the same output node, their proximity is increased by one. 2.4
THE ECHO CLASSIFIER
The classifiers discussed above are mostly used in classification of spectral variations within the data. Therefore, those classifiers are sometimes referred to as pixel classifiers. Other methods are based on using spatial information or context in classification of image data. The spatial information can be grouped into two distinct domains: (1) the information of spatial correlation of gray variations, and (2) the information of spatial dependency of class labels [134]. Neither the gray variations nor the class labels of the image are deterministic. Therefore, a stochastic model is often desirable for representing the random variations in either case, and a random field model has become popular in this regard [134]. The basic idea of a random field modeling is to capture the intrinsic character of images with relatively few parameters in order to understand the nature of the image and provide either a model of gray value dependencies for texture analysis or a model of context for decision making. By introducing these spatial parameters along with spectral models into the classification process, improvement in classification accuracy can be achieved. The extraction and classification of homogeneous objects (ECHO) classifier [135] is an example of a spatial classifier that has been applied successfully in classification of remote sensing data. The ECHO classifier incorporates not only spectral variations but also spatial ones in the decisionmaking process. It uses a two-stage process, first segmenting the scene into statistically homogeneous regions, then classifying the data based on an MLC scheme. The ECHO classifier uses exactly the same training procedures and statistics as a conventional ML pixel classifier. However, it offers selectable parameters for the analyst to vary the degree and character of the spatial relationship used in the classification. By selecting these parameters, one can vary the spatial relationship used in the classification. By selecting those incorporating image texture over a block area surrounding each pixel.
Classification Approaches
51
With a proper choice of parameter values, the ECHO classifier usually provides higher accuracies than pixel classifiers and frequently requires less computation time for a given classification task [134].
2.5
ESTIMATION OF CLASSIFICATION ERROR
Evaluation of classification error is very important in order to assess the applicability of individual classification approaches. This section provides a rough idea regarding the assessment matrices, which have been extensively used in order to evaluate the result of output classification maps. In general, almost all metrics for the assessment of the final classification map are based on the confusion matrix. This matrix provides a possibility for evaluating the exactitude of a given classification map with respect to the reference map. In this chapter, the confusion matrix will first be described, and then several specific and global estimators will be extracted from the confusion matrix. 2.5.1
Confusion Matrix
In pattern recognition, a confusion matrix is considered as a visualization tool typically used in supervised learning. In this matrix, each column infers the instances in a predicted class, while each row represents the instances in an actual class. This matrix is also able to infer where the classification technique leads to confusion (i.e., commonly mislabeling one class as another). Table 2.1 represents an example of a confusion matrix for a 3-class classification problem. The term Ci represents the class i and the term Cij refers to the number of pixels that are wrongly assigned to the class j, which are referenced as class i. Nc represents the number of classes in the referenced map. 2.5.1.1
Overall Accuracy (OA)
The OA is the percentage of correctly classified pixels, which can be estimated as follows: PNc i Cii × 100. OA = PN c i,j Cij
52
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
2.5.1.2
Class Accuracy (CA)
The CA (or producer’s accuracy) is regarded as the percentage of correctly classified pixels for each class. This metric infers how well a certain area was classified. This metric includes the error of omission in which increased errors of omission lead to lower producer accuracy. This metric is calculated by dividing the number of correct pixels in one class by the total number of pixels as derived from reference data as follows: Cii × 100. CAi = PNc j Cij 2.5.2
Average Accuracy (AA)
The AA is the mean of class accuracies for all the classes, which can be estimated as follows: Cii AA = PNc × 100. j CAi
C31 PNc i Ci1 11 PC Nc
C3
Ci1
C21
C2
i
C11
i
Ci2
C32 PNc i Ci2 22 PC Nc
C22
C12
i
Ci3
C33 PNc i Ci3 33 PC Nc
C23
C13
Classification Data C1 C2 C3
C1
Percentage Reference Data
Confusion Matrix for a 3-Class Classification Problem
Column Total User’s Accuracy
Table 2.1
N
Row Total PNc i C1i PNc i C2i PNc i C3i
C1i C2i
i
C3i
33 PC Nc
i
22 PC Nc
i
11 PC Nc
Producer’s Accuracy
Classification Approaches 53
54
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
It should be noted that when either the OA or AA is close to 100%, it infers that the classification accuracy is more accurate. The problem associated with the concept of OA occurs when a referenced set is unbalanced. In this case, the OA may not be a good representer for the true performance of the classifier. For example, if a class has very few referenced pixels, its influence will be very low on the OA, while it will have more influence in the AA since the mean is done using the number of classes rather than the whole number of pixels. Strong differences between the OA and AA may indicate that a specific class is wrongly classified with a high proportion. 2.5.2.1
Kappa Coefficient (k)
This metric is a statistical measurement of agreement between the final classification map and the reference map. It is the percentage agreement corrected by the level of agreement that could be expected due to chance alone. It is generally thought to be a more robust measure than simple percent agreement calculation since k takes into account the agreement occurring by chance. κ=
Po − P e 1 − Pe
where Po = OA,
2.6
Pe =
Nc 1 X Ci+ C+i , N2 i
Ci+ =
Nc X j
Cij ,
C+i =
Nc X
Cji .
j
SUMMARY
In this chapter, we have focused on supervised classification approaches, many of which are considered in the next chapters. First, we discussed statistical classification approaches. Then, we reviewed support vector machines, neural networks, decision trees, and ensemble methods, including random forests. All these methods are in essence per pixel approaches. One spatial classifier was discussed (i.e., the statistical spatial ECHO classifier). Finally, accuracy analysis was introduced.
Chapter 3 Feature Reduction As discussed in the first chapter, the space of high-dimensional data is almost empty. Consequently, high-dimensional data can be represented in a lower dimensional subspace. Therefore, it is possible to reduce the dimensionality of high-dimensional data without sacrificing significant information and class separability. By decreasing the number of dimensions, we not only reduce the high consecutive processing time of hyperspectral data but this type of data, can also be processed using the existing techniques in an easier way. Bearing this in mind, it is desirable to project the high-dimensional data into a lower dimensional subspace where the undesirable effects of high-dimensional geometric characteristics are decreased. In [3], it was shown that too many spectral bands can be undesirable from the standpoint of expected classification accuracy because the accuracy of the statistical estimation decreases (Hughes phenomenon). This issue infers that there is an optimal number of bands for classification accuracy and more features do not necessarily lead to better results in terms of classification accuracies. Therefore, feature reduction techniques may lead to a better classification accuracy. In general, feature reduction techniques can be divided into feature selection and feature extraction techniques. In the literature, a wide number of techniques have been introduced for feature selection and feature extraction in order to obtain informative features. The main aim of this chapter is to provide readers with a general overview of existing feature reduction methods that are used for hyperspectral data analysis, particularly for the purpose of spectral-spatial classification.
55
56
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Below, after giving a brief description about the main categories of feature reduction techniques, different methods introduced in the literature will be discussed. Since it is difficult to describe all the different existing methods in this book and it is also out of the scope of the book, we intentionally avoid elaborating on all existing feature reduction techniques. However, references to the main works in the literature for each technique are provided here for interested readers to familiarize themselves with the main concepts. Moreover, a detailed description of the techniques that have been widely used for spectral-spatial classification of hyperspectral images is also provided. 3.1
FEATURE EXTRACTION (FE)
Feature extraction can be explained as finding a set of vectors that represents an observation while reducing the dimensionality. In pattern recognition, it is desirable to extract features that are focused on the discrimination between classes of interest. Although a reduction in dimensionality is of importance, the error rising from the reduction in dimension has to occur without sacrificing the discriminative power of classifiers. Feature extraction is the process of producing a small number of features by combining existing bands. In this line of thought, feature extraction techniques transform the input data linearly or nonlinearly to another domain and extract informative features in the new domain. The following graph demonstrates different types of feature extraction techniques. Knowledge-based Unsupervised ( Feature extraction Parametric Statistical Supervised Nonparametric. Different objectives can be available for determining a new set of features in order to be used in a specific application. In this way, knowledgebased feature extraction techniques are effective and direct [136]. This type of method is often used for the purpose of image enhancement to make specific characteristics of a scene more apparent for users [3]. With respect to the specific characteristics of spectral channels, feature extraction approaches of
Feature Reduction
57
this type can be applied by performing arithmatic operations on relevant bands to enhance the signal [136]. A well-known example of the knowledgebased feature extraction techniques that have been widely used in the literature is normalized difference vegetative index (NDVI) [94]. This simple feature extraction technique is based on a fairly unique characteristic of green vegetation: the reflectance in the near infrared region is several times larger than its reflectance in a visible band. Normalized differential water index (NDWI) is another example of knowledge-based feature extraction approaches. The other type of feature extraction approaches is statistical, which is based on presenting an observation while reducing the dimensionality by linearly or nonlinearly transforming input data to a lower subspace. Feature extraction techniques of this type can be split into two categories: unsupervised and supervised. The former is used for the purpose of data representation when a prior knowledge (training samples) of the scene is not available, and the latter is considered for solving the so-called Hughes phenomenon [2] and reducing the redundancy of data in order to improve classification accuracies when the prior knowledge of the scene (training samples) is available. Finding the most informative feature vectors in terms of classification accuracies should be based on representing observations with reduced dimensionality without sacrificing the discriminant power of different classes of interest. Below, some well-known feature extraction techniques are listed along with their main references (the following list is inspired from Table 1 of [136]). It should be noted that we intentionally avoid elaborating all existing feature extraction techniques, and we only discuss the ones in detail that have been extensively used for the purpose of spectral-spatial classification. The unsupervised feature extraction methods have been mostly used for the purpose of data representation. The unsupervised techniques are fast and can be used in order to speed up further processing steps (e.g., segmentation or classification) by summarizing the information of the input data into a small number of features. The best-known unsupervised feature extraction techniques are: PCA [137], noise adjusted PCA or MNF [138, 139], kernel PCA (KPCA) [140, 141], independent component analysis (ICA) [142, 143], KICA [144, 145], discrete wavelet transform [146], projection pursuit (PP) [147], multiple source feature extraction [148], fast ISOMAP [149–151], enhanced ISOMAP (L-ISOMAP and EL-ISOMAP) [152, 153], graph-based (LWDP) [154], Hurst & Lyapunov exponents FE [155], locally
58
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
linear exponents (LLE) [156, 157], greedy modular eigenspaces [158], and MI-based band averaging (BndClust) [159]. Supervised feature extraction techniques are split into two categories; parametric and nonparametric. The former scheme assumes a particular form for the class-conditional density functions in order to produce new features while the latter has a dual meaning. The most well-known parametric supervised feature extraction techniques are linear discriminant analysis (LDA) [7], regularized LDA [160], kernel local Fisher discriminant analysis, generalized discriminant analysis [161], canonical analysis [99], combining subsets of adjacent correlated bands into a smaller number of features [162], weighted sum of the bands grouped based on Bhattacharyya distance [163], and averaging contiguous groups of bands on Jeffries-Matusita [164]. In contrast, the most widely used nonparametric supervised feature extraction techniques are nonparametric discriminant analysis (NDA) [7], NWFE [165]; kernel NWFE [166], cosine-based nonparametric feature extraction (CNFE) [167], double nearest proportion (DNP) [168], and linear embedding (LE) [169]. Below, the best-known feature extraction techniques that have been extensively used in conjunction with spectral-spatial classifiers are elaborated in detail. 3.1.1
Principal Component Analysis (PCA)
PCA is an unsupervised feature extraction technique. The general aim of PCA is to transform the data into a lower-dimensional subspace via a transformation that is optimal in terms of the sum-of-squared error [137]. PCA reduces the dimensionality of a data set with interrelated variables, while retaining as much as possible of the variation in the data set. The dimensionality reduction is obtained by a linear transformation of the data into a new set of variables, the PCs. The PCs are orthogonal to each other and are ordered in such a fashion that the first PC corresponds to the greatest variance, the second component corresponds to the second greatest variance, and so on. Each pixel in a d band image can be written as:
x1
x2 Xb = . . .. xd
(3.1)
Feature Reduction
59
In order to reduce the dimensionality of the input hyperspectral data, one can estimate the eigenvalues of the covariance matrix as follows:
Cd,d
σ1,1 . = ..
... .. .
σd,1
...
σ1,d .. . σd,d
(3.2)
where σi,j is the variance for band i if i = j and otherwise σi,j = ρij σi σj for each pair of different bands where ρij is the correlation coefficient between the bands. The eigenvalues (λ) of the variance-covariance matrix can be calculated as the roots of the characteristic equation as follows: det(C − λI) = 0,
(3.3)
where C is the covariance matrix of the data and I is the diagonal identity matrix. The eigenvalues can infer the original information that they retain. With respect to these values, one can calculate the percentage of original variance explained by each PC. This percentage can be estimated with respect to the ratio of each eigenvalue in relation to the sum of all the eigenvalues. In this way, the PCs that contain minimum variance can be eliminated. The principal component transformation can be expressed as follows: y1 w1,1 . . . . Y = .= . yd wd,1
... .. . ...
w1,d x1 .. .. . . , wd,d xd
(3.4)
where Y is the vector in the mapped space, W is the transformation matrix, and X is the vector of the original data. The column vectors of W are the eigenvectors of C, the covariance matrix of the input data, ordered in the same way as the corresponding eigenvalues. These values give information on the relation between the bands and each PC. From these values one can link a main component with a real variable. The eigenvectors can be estimated from the vector-matrix equation for each eigenvalue λb as follows: (C − λb I)wb = 0,
(3.5)
60
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
where C is the covariance matrix, λb is eigenvalue b, I is the diagonal identity matrix, and wb is eigenvector b. Figure 3.1 shows the first three obtained PCs for the Pavia University data. 3.1.2
Independent Component Analysis
The general idea of the PCA is to project data into an orthogonal domain so that eigenvectors correspondent to the highest eigenvalues retain the maximum variance of the input data. Due to the fact that PCA is based on the estimation of the covariance matrix (second order statistics), it disregards some important information, in particular when only a few components are kept. ICA is considered as a data transformation technique in which independent sources of activity in recorded mixtures of sources can be found. ICA does this by assuming that the independent sources are non-Gaussian signals and they are statistically independent from each other. From a theoretical point of view, independence is a much stronger assumption than the decorrelation, as we have in PCA. ICA is a special case of blind source separation. A common example application of ICA is the so-called “cocktail party effect.” Imagine a guest at a cocktail party who has to focus on one person’s voice in a room filled with competing voices and other noises. This effect helps people to “tune into” a single voice and “tune out” all others. Inspired by this capability of human beings, ICA provides an automated solution to the cocktail party effect under certain idealized conditions. ICA is based on finding a linear decomposition of the observed data into statistically independent components. Given an observation model X = As in which X is the vector of the observed signals, A is a matrix of scalars corresponding to mixing coefficients, and s is the vector of source signals, the ICA finds a separating matrix W with respect to Y = W X = W As in which Y is a vector of independent components (ICs). This infers that the value of any components does not give any information about the value of the other components. Figure 3.2 shows the first three ICA components for the Pavia University data. Basically, ICA makes three assumptions in order to ensure that its basic model can be estimated: 1. The components of Y , estimated from the observed signal X, are statistically independent. This is the basic principle of ICA.
Feature Reduction
(a)
(b)
61
(c)
Figure 3.1 The first three obtained PCs for the Pavia University data: (a) the first PC, (b) the second PC, and (c) the third PC.
62
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
2. At most one component has a normal (Gaussian) distribution. If more than one component has a Gaussian distribution, adequate information for separating mixtures of Gaussian sources is not available. In the case of two or more Gaussian distributions, the higher-order cumulants are equal to zero. This information is crucial for estimating the ICA model, otherwise the algorithm cannot work under these conditions. 3. The unknown mixing A is of full rank and nonsingular matrix. In other words, this assumption says that the number of independent components and the number of observed mixture are the same. This assumption is taken into account in order to simplify the estimation, but it can sometimes be relaxed. Under these three assumptions (or at least the first two), the independent components and the mixing matrix can be estimated, under some indeterminacies that will necessarily hold. Actually, for X = As, if both A and s are unknown, at least two uncertainties cannot be avoided. First, the variances of independent components cannot be estimated. This is easy to obtain. Any scalar multiplier in one of the sources could always be discarded by dividing the corresponding column of the mixing matrix. In this way, the energy of the components can be at first fixed by whitening for making all variances equal to unity, and consequently the mixing matrix is adapted. Second, due to the same reasons discussed here, the independent components cannot be ranked because any change in their order will not change the possibility of estimating the model. The best-known algorithms to find ICA components are Infomax, FastICA, and JADE. These algorithms are compared in the remote sensing context in [170]. A detailed explanation of the ICA is out of the scope of this book, but we refer interested readers to [142, 143] in order to find comprehensive information about ICA. 3.1.3
Discriminant Analysis Feature Extraction (DAFE)
DAFE is a parametric supervised feature extraction approach. This approach has been extensively used for dimension reduction in classification problems [7]. In this approach, within-class, between-class, and mixture scatter matrices are usually considered as the criteria of class separability. The within-class DA scatter matrix of DAFE (Sw ) is estimated by:
Feature Reduction
63
Figure 3.2 The first three ICA components for the Pavia University data by using the joint approximate diagonalization of eigen-metrices (JADE) approach [170] for estimating independent components.
DA Sw =
K X
Pi Σi ,
(3.6)
i=1
where Pi denotes the prior probability of class i where i = {1, ..., K} and Σi is the class covariance matrix. The between-class scatter matrix of DAFE (SbDA ) is given by: SbDA = K−1 X
X
K X
Pi (mi − m0 )(mi − m0 )T = Pi Pj (mi − mj )(mi − mj )T ,
(3.7)
i=1 j=i+1
where mi is the class mean for class i. The parameter m0 is the expected vector of the mixture distribution, which is is given by: m0 =
K X i=1
Pi m i .
(3.8)
64
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
In DAFE, the optimal features are extracted by optimizing the Fisher criterion expressed by: DA −1 DA J = tr (Sw ) (Sb ) .
(3.9)
The first row of Figure 3.3 demonstrates the first three components of DAFE extracted from the Pavia University data set. DAFE is fast and works well when the distribution of the data is normal (Gaussian), but its concept suffers from the following shortcomings: 1. When the distribution of data is not normal (non-Gaussian), the performance of DAFE will be downgraded and its results will not be promising. 2. When the difference in the mean vectors of the classes is small, the extracted features by DAFE will not be reliable. In the same manner, if one class-mean vector is very different from others, its corresponding class will dramatically influence the other classes in the sense of computing the between-class covariance matrix [171]. As a result, the feature extraction process will be ineffective. 3. DAFE is based on computations at full dimensionality, which demands a huge number of training samples in order to accurately estimate statistics [165]. 4. The main shortcoming associated with the concept of DAFE is that this approach is not full rank and its rank at maximum is equal to K-1 where K is the number of classes. In this way, if the rank of the within-class scatter matrix is u, then DAFE only extracts min(K −1, u) features. Since in real situations the data distribution is complicated, using only K-1 features usually is not sufficient [3]. 3.1.4
Decision Boundary Feature Extraction (DBFE)
This method was proposed in [172] and relies on extracting informative features from decision boundaries. DBFE considers training samples directly in order to determine the location of the effective decision boundaries, which is the boundary where different classes overlap [173]. For two Gaussian classes, the workflow of DBFE can be summarized as follows:1 1
The following steps of DBFE procedure were extracted from [172].
Feature Reduction
65
1. Let mi and Σi be the class mean and class covariance matrix, respectively. 2. Classify the whole bands of the input data by using the training samples. 3. Perform a chi-square threshold test to the correctly classified training samples of each class and delete outliers. To do so, for the class i, keep X only if the following chi-square threshold test is satisfied: (X − mi )T Σ−1 i (X − mi ) < Rt1 . Let (X1 , X2 , ..., XL1 ) be the only correctly classified training samples of class w1 that satisfy the chi-square threshold test and (Y1 , Y2 , ..., YL2 ) be the only correctly classified training samples of class w2 that satisfy the chi-square threshold test. 4. Perform a chi-square threshold test of class w1 on the training samples of class w2 , and keep Yj only if the following chi-square threshold test is satisfied: (Yj − m1 )T Σ−1 i (Yj − m1 ) < Rt2 . If the number of samples of class w2 that satisfy the chi-square threshold test is less than Lmin , keep the Lmin samples of class w2 that provide the smallest values. 5. For Xi of class w1 , find the nearest samples of class w2 kept in step 3. 6. Find the point P oi where the straight line connecting the pair of samples found in step 5 meets the decision boundary. 7. Find the normal unit vector Ni to the decision boundary at the point P oi found in step 6 as follows: −1 −1 −1 N = ∇h(X) |X=Xi = (Σ−1 1 − Σ2 )Xi + (Σ1 m1 − Σ2 m2 ).
8. By repeating steps 5-7 for Xi where i = {1, ..., L1 }, L1 unit normal vectors will be estimated. By using the normal vectors, calculate the estimate of the effective decision boundary feature matrix (Σ1EDBF M ) from class w1 that can be estimated as follows:
66
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Σ1EDBF M =
L1 1 X Ni Nit . L1 i
Repeat steps 3-8 for class w2 . 9. Calculate the final estimation of the effective decision boundary matrix by using the following equation:
ΣEDBF M =
1 1 ΣEDBF M + Σ2EDBF M . 2
This workflow can be easily extended for multiclass cases [172]. The second row of Figure 3.3 demonstrates the first three components of DBFE extracted from the Pavia University data set. Some of the main points of using DBFE are listed below: 1. Since DBFE directly considers the classification accuracies rather than other metrics (e.g., statistical distances), it is based on the mean separation and covariance differences. In this manner, this approach is more efficient than some other feature selectors that are downgraded when there is no mean separation. 2. DBFE is able to efficiently handle the problems of outliers [172]. 3. Because DBFE works directly on training samples to determine the location of effective decision boundaries, it demands many training samples. In other words, in a case when we do not have enough training samples, the efficiency of DBFE is downgraded, which is not desirable. 4. Since DBFE works directly on training samples to determine the location of effective decision boundaries, this approach can be computationally intensive if the number of training samples is large. 5. Because DBFE works directly on training samples to determine the location of effective decision boundaries, it suffers from the Hughes phenomenon as the number of features increases [173].
Feature Reduction
3.1.5
67
Nonparametric Weighted Feature Extraction (NWFE)
In order to overcome the limitations of DAFE, NWFE was introduced in [165]. NWFE is a nonparametric supervised feature extraction technique. NWFE is developed based on DAFE by focusing on samples near the eventual decision boundary rather than considering the same weight for all training samples as with DAFE. The main ideas behind NWFE are to put different weights on different samples in order to compute weighted means and define new nonparametric within-class and between-class scatter matrices. The main advantages of using NWFE are as follows: 1. NWFE is generally of full rank. This advantage provides the possibility of selecting the number of desired features in the opposite way of DAFE, which usually can extract K-1 features (K is the number of classes) [3]. In addition, this advantage helps to reduce the issue of singularity [3]. 2. The nonparametric nature of between- and within-class scatter matrices in NWFE makes this approach well-suited even for nonnormal distributed data. In NWFE, the nonparametric between-class scatter matrix for K classes is estimated as SbN W = K X i=1
K
Pi X K − 1 j=1
ni X
(i,j)
λl
(Xl
(i)
− Mj (Xl ))(Xl (i)
(i)
− Mj (Xl ))T , (i)
(3.10)
i6=j l=1
where Pi denotes the prior probability of class i where i = {1, ..., K} and (i) (i,j) Xl is the lth sample from class i. λl presents scatter matrix weight. ni (i) is training sample size of class i. Mj (Xl ) is regarded as the weighted mean (i) of Xl in class j and is given below: Mj (Xl ) = (i)
ni X
(i,j)
Wlk
Xkj ,
(3.11)
k=1
in which Wlk is the weight for estimating weighted means, which is estimated as follows: (i,j)
68
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
dist(Xl , Xk )−1 = Pnj , (i) (j) −1 t=1 dist(Xl , Xk ) (i)
(i,j)
Wlk
(j)
(3.12)
where dist(a, b) means the distance from a to b. As it can be observed, the (i,j) (i) (j) weight Wlk is a function of Xl and Xk . In this case, if the distance (i) (j) (i,j) between Xl and Xk is small, the weight Wlk goes to one, otherwise the (i,j) weight Wlk goes to zero. (i,j) (i) (i) The scatter matrix weight λl is a function of Xl and Mj (Xl ). In (i) (i) this case, for class i, if the distance between Xl and Mj (Xl ) is small, the (i,j) scatter matrix weight λl goes to one, otherwise the scatter matrix weight (i,j) goes to zero. The scatter matrix weight λl is defined as: dist(Xl , Mj (Xl ))−1 . = Pnj (i) (i) −1 t=1 dist(Xl , Mj (Xl )) (i)
(i,j)
λl
(i)
(3.13)
NW The nonparametric within-class scatter matrix Sw is estimated by NW Sw = L X
Pi
i=1
ni (i,j) X λ l
l=1
ni
(Xl
(i)
− Mi (Xl ))(Xl (i)
(i)
− Mi (Xl ))T . (i)
(3.14)
N W −1 ) (SbN W ). The optimal features are extracted by optimizing the (Sw The general workflow of NWFE is as follows [172]:
1. Compute the distances between each pair of sample points and create a distance matrix. 2. Compute the weight Wlk step 1.
(i,j)
by using the distance matrix produced in
3. Compute the weighted means Mj (Xl ) by using Wlk (i)
4. Compute the scatter matrix weight λl
(i,j)
5. Compute
SbN W
and
(i,j)
.
NW Sw .
N W −1 6. Extract features by considering (Sw ) (SbN W ).
.
Feature Reduction
69
The third row of Figure 3.3 demonstrates the first three components of NWFE extracted from the Pavia University data set. For detailed information regarding NWFE, see [165].
3.2
FEATURE SELECTION
Feature selection is perhaps the most straightforward way to reduce the dimensionality of a data set by simply selecting a subset of features from the set of available features based on a criterion. In more detail, the goal of feature selection techniques is to overcome the curse of dimensionality and reduce the redundancy of the high-dimensional data by selecting the optimal dE bands for the classification, dE ≤ d, wherein d is the total number of bands in a given image. As an example, imagine one wishes to select the best five bands out of the 10 available bands for the classification of a data set with six classes by using the Bhattacharyya distance [7] feature selection technique. To do so, one needs to compute the Bhattacharyya distance between each pair of classes for each subset of size five out of the 10-band data [3]. As the output of this procedure, five features that provide the highest Bhattacharyya distance in the feature domain will be selected. Please note that feature selection techniques do not make any changes on the specification of the input data and easily select the most informative bands of those available, by considering a criterion. Defining a good selection metric for evaluating the efficiency of different groups of bands is vital in order to develop a feature selection technique. In this case, the feature selection techniques can be split into two main categories: filter techniques and wrapper techniques. The filter techniques can be considered as a preprocessing step where the selection metric is independent from the classification step to subsequently perform classification of the input data. On the other hand, the wrapper techniques consider a classification accuracy (e.g., overall accuracy, kappa coefficient, or ...) as the selection metric in order to informatively evaluate different groups of bands [136]. For example, the classification approaches maximum likelihood, k-nearest neighbor, RF, or SVM can be used as the selection metric in the wrapper techniques. It should be noted that the feature selection step is rarely used for spectral-spatial classification. As a result, we intentionally avoid elaborating
70
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
DAFE
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
DBFE
NWFE
Figure 3.3 The first three components of DAFE, DBFE, and NWFE, respectively, for the Pavia University data.
Feature Reduction
71
on all existing feature selection techniques and put our main emphasis on the ones that are used in conjunction with spectral-spatial classifiers. However, we provide interested readers with citations in case they want to gain comprehensive information about them. 3.2.1
Supervised and Unsupervised Feature Selection Techniques
From one point of view, feature selection techniques can be categorized into two categories: unsupervised and supervised. Supervised feature selection techniques aim at finding the most informative features with respect to prior knowledge and lead to better identification and classification of different classes of interest. On the other hand, unsupervised methods are used in order to find distinctive bands when prior knowledge of the classes of interest is not available. The simplest way to apply unsupervised feature selection is to manually discard low signal to noise ratio (SNR) bands from input data. In this way, for example, low SNR, noisy or water-absorbed bands can be simply discarded. However, this way of feature selecting can be considered as a preprocessing step for further feature selection/extraction methods since the output of this step often contains many redundant features while helping to reduce the computational burden by throwing away unnecessary features and helping to reduce the possibility of singularity. Some well-known unsupervised feature selection techniques are:2 SNR, geometric-based representative bands [174], a representative band from each cluster of bands [175], dissimilar bands based on linear projection [176], and minimizing dependency between selected bands using mutual information [177]. Supervised feature selection techniques can be split into two categories: parametric and nonparametric. Parametric supervised feature selection approaches model class-conditional density function by using training samples. Some well-known parametric supervised feature selection techniques are Bhattacharyya distance [178], Jeffries-Matusita distance [13], which was extended in [179], Jeffries-Matusita distance for spatially invariant features [180], Jeffries-Matusita distance for visual inspection and quantitative band selection [181], divergence [13], divergence on class-based PCA data [182], cluster space separability [94], abundance covariance [183] and kernel dependence [184]. 2
The following list is inspired from Table 1 of [136].
72
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
The nonparametric scheme, however, does not assume a particular form for the class-conditional density functions. In other words, these methods make use of information obtained by training samples directly without modeling class data [136]. Some well-known nonparametric supervised feature selection techniques are mutual information [185–188], spectral angle mapper [189], canonical correlation measure [190], and feature weighting [182, 191]. Below we discuss a few feature selectors that are based on evolutionary optimization techniques in more detail since the following concepts have been used along with spectral-spatial classifiers in the literature. 3.2.2
Evolutionary-Based Feature Selection Techniques
Although conventional feature selection techniques have been used extensively in remote sensing for many years, they suffer from the following shortcomings: 1. Most conventional feature selection techniques are based on the estimation of the second-order statistics (e.g., covariance matrix) and because of that, they demand many training samples in order to estimate the statistics accurately.3 Therefore, in a situation when the number of training samples is limited, the singularity of covariance matrices is possible. In addition, since the existing bands in hyperspectral data usually have some redundancy, the probability of the singularity of the covariance matrix will even increase. 2. In order to select informative bands by using conventional feature selectors, corrupted bands (e.g., water absorption and low SNR bands) are usually preremoved, which is a time-consuming task. In addition, conventional feature selection methods can be computationally demanding since they are based on exhaustive search techniques and they require the calculation of all possible alternatives in order to choose the most 3
Reminder: The mean vector is considered as a first-order statistic, since it involves only one variable. In contrast, the covariance matrix is known as a second-order statistic since it considers the relationship between two variables. In the same way, correlation infers how two variables are related to each other. Higher-order statistics are related to the relationships between more variables. As the order of the statistic increases, the statistical estimation using a limited number of training samples becomes more problematic [3]. This is the reason why we would generally expect that the mean vector can be relatively well estimated with a smaller number of samples compared to the covariance matrix.
Feature Reduction
73
informative features from those available. In this case, in order to select m features out of a total of n features, these methods must calculate n!/(n − m)!m! alternatives, which is a laborious task and demands a significant amount of high computational memory. In other words, the feature selection techniques are only feasible in relatively low-dimensional cases. In this case, as the number of bands increases, the CPU processing time exponentially increases. In order to address the above-mentioned shortcomings of conventional feature selection techniques, the new trend of feature selection methods is usually based on the use of stochastic and evolutionary optimization techniques (e.g., genetic algorithms (GAs) and particle swarm optimization (PSO)). The main reasons behind this trend are: (1) in evolutionary feature selection techniques, there is no need to calculate all possible alternatives in order to find the most informative bands, and (2) in evolutionary-based feature selection techniques, usually a metric is chosen as fitness function that is not based on the calculation of the second-order statistics, and, in this case, the singularity of the covariance matrix is not a problem. Furthermore, despite the conventional feature selection techniques for which the number of required features needs to be set by the user, most evolutionary-based feature selectors are able to automatically select the most informative features in terms of classification accuracy without requiring the number of desired features to be set a priori. In order to make the most of the evolutionary-based feature selection techniques, the use of an efficient metric (fitness function) is vitally important for estimating the capability of different potential solutions. In this case, approaches such as divergence [13], transformed divergence [13], Bhattacharyya distance [7], and Jeffries-Matusita distance [13] can be taken into account as the fitness function. However, as mentioned previously, these metrics require many training samples in order to estimate statistics accurately. In addition, these metrics are based on the estimation of the second-order statistics. When the number of training samples is limited and the input features have a high correlation, the singularity of the covariance matrix downgrades the efficiency of the feature selection step and these metrics cannot lead to a conclusion. As discussed before, for the purpose of hyperspectral image analysis, SVM and RF play an important role since they can handle high-dimensional data even if a limited number of training samples is available. In addition, SVM and RF are nonparametric classifiers and in this case, they are suitable for non-Gaussian (nonnormal) data sets. Therefore, the output of these
74
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
classifiers can be chosen as a fitness function. However, it should be noted that either SVM or RF has its own shortcoming when it is considered as the fitness function. For instance, when RF is considered as the fitness function, due to its capability to handle different types of noises, corrupted and noisy bands cannot be eliminated even after a high number of iterations. In contrast, since SVM is more sensitive than RF to noise, SVM is able to detect and discard corrupted bands after a few iterations, which can be considered as a privilege for the final classification step. However, SVM needs a time-demanding step, cross-validation, in order to tune the hyperplane parameters. In this case, since most of the evolutionary-based feature selection techniques are based on an iterative process and produce many potential solutions, SVM needs to be applied many times during the process, and because of the evolutionarybased feature selection approaches based on SVM as the fitness function demand a huge amount of CPU processing time, which is not the case for RF. Another alternative would be to use SVM without the cross-validation step and arbitrarily initialize the hyperplane parameters. In this case, the algorithm is not automatic anymore and the obtained results might not be reliable. As a result, a careful choice of the fitness function is very much of importance and we recommend that readers choose the most appropriate metric based on their problem and data set. In the literature, there is a huge number of articles related to the use of evolutionary optimization-based feature selection techniques. These methods are mostly based on the use of GA and PSO. For example, in [192], the authors developed an SVM classification system that allows the detection of the most distinctive features and the estimation of the SVM parameters (e.g., regularization and kernel parameters) by using a GA. In [193], PSO was considered in order to select the most informative features obtained by morphological profiles for classification. In [194], a method was developed that allowed solving problems of clustering, feature detection, and class number estimation simultaneously in an unsupervised way by considering PSO. Below, we discuss a few evolutionary-based feature selection techniques including GA- and PSO-based feature selections in detail. In addition, based on the shortcomings of GA and PSO, we will describe the hybrid GA-PSO (HGAPSO)- and FODPSO-based feature selection methods. It should be noted that since the aforementioned feature selection approaches have been used with spectral-spatial classifiers in literature, we will discuss them in more detail than the other feature selection techniques.
Feature Reduction
3.2.3
75
Genetic Algorithm (GA)-Based Feature Selection
GA is inspired by the genetic process of biological organisms. GA consists of several solutions called chromosomes or individuals. Each chromosome in a binary GA includes several genes with binary values, 0 and 1, which determine the attributes of each individual. A set of the chromosomes is made up to form a population. For the purpose of feature selection based on GA, the length of each chromosome should be equal to the number of input features. In this case, the value of each gene, 0 or 1, demonstrates the absence or the presence of the corresponding band, respectively. The merit of each chromosome is evaluated by using a fitness function. Fitter chromosomes are selected through the selection step as parents for the generation of new chromosomes (offsprings). In that step, two fit chromosomes are selected and combined through a crossover step. Then mutation is performed on the offsprings in order to increase the randomness of individuals for decreasing the possibility of getting stuck in local optimum [195]. Below, the main steps of the GA are briefly described. Figure 3.4 shows the general idea of a simple GA. In each generation, first, the fitness values of all the chromosomes in the same population are calculated (e.g., the overall accuracy of SVM on validation samples). Then the selection step is applied. The main idea behind the selection step is to give preference to better individuals (those that have higher fitness value) by allowing them to pass on their specification (genes) to the next generation and prohibiting the entrance of worst-fit individuals into the next generations. As the generations go on, the chromosomes should get fitter and fitter (i.e., closer and closer to the desired solution). There are a wide variety of techniques for the selection step, but tournament selection [196] and roulette wheel selection [197] are the most common ones. Crossover is regarded as the process of taking more than one parent chromosome and producing new offsprings from them. In the crossover step, generally, fitter chromosomes are selected based on their fitness value and recombined with each other in order to produce new chromosomes for the next generation. In this way, once a pair of chromosomes has been selected as parents, crossover can take place to produce offsprings with respect to a crossover probability. The probability of 1.0 indicates that all the selected chromosomes are used in reproduction (i.e., there are no survivors). However, empirical studies have indicated that better results are
76
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Start
Initial population Evaluation of chromosomes (fitness)
Selection
Crossover Mutation Create the new population
Stop criterion fulfilled? Yes End
Figure 3.4
General idea of the conventional GA.
No
Feature Reduction
77
obtained by a crossover probability of between 0.65 and 0.85, which implies the probability of a selected chromosome surviving to the next generation unchanged (apart from any changes arising from mutation).4 There are a wide variety of methods for performing crossover on the chromosomes, but the most popular ones are one point, two points, and uniform crossover [198]. A simple scheme of different crossover methods is shown in Figure 3.5. Mutation is used along with the crossover operation in order to increase the randomness of the population and add more diversity on the chromosomes in order to avoid getting stacked in local optimum. Mutation is performed on the chromosomes based on the mutation probability. The mutation probability (or ratio) is considered as a measure of likelihood that random genes of the chromosome will be flipped into something else (in binary GA, the value is switched from 0 to 1 or 1 to 0). For example if a chromosome is encoded as a binary string of length 100 and if the mutation probability is 0.01, it means that 1 out of the 100 bits (on average) is picked at random and switched to another value (0 or 1). A simple scheme of mutation with the probability of 0.1 is shown in Figure 3.6. The GA is an iterative process, so it is iterated again and again until the stop criterion is met. In this manner, there are different ways that can be considered as a stop criteria. For example, if the difference between the best fitness value and the average of all fitness values in one iteration is less than a predefined threshold value, the process can be terminated. In addition, the number of iterations can be predefined by the user as another way for terminating the process. The main shortcoming of the GA is that if a chromosome is not selected, the information contained by that individual is lost. In addition, GA is slow and it demands a high CPU processing time. That problem can get even worse when the problem at hand is complicated. As with other evolutionary techniques, there is no absolute assurance that GA will be able to find a global optimum. 3.2.4
Particle Swarm Optimization (PSO)-Based Feature Selection
The original PSO was introduced by Eberhart and Kennedy in 1995 [199], which was inspired by the collective behavior of bird flocks flying over a search space in order to find food. To better understand the PSO, imagine a flock of 4
http://www.optiwater.com/optiga/ga.html.
78
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
One point crossover 1
0
0
1
1
0
1
Parent chromosomes 1
1
1
0
0
0
1
0
0
1
offspring chromosomes 1
1
1
0
1
1
0
1
0
1
0
0
0
Two point crossover 1
0
0
1
1
0
Parent chromosomes 1
1
1
0
0
offspring chromosomes 1
1
1
0
1
1
1
0
1
1
0
1
Uniform point crossover 1
0
0
1
1
0
1
Parent chromosomes 1
1
1
0
0
offspring chromosomes 1
1
1
0
0
1
0
Figure 3.5 A simple scheme of different crossover methods. From top to bottom: one point, two points, and uniform crossover techniques.
Uniform point crossover 1
1
0
0
1
0
1
Parent chromosomes 1
0
1
1
0
0
1
0
1
offspring chromosomes 1
1
Feature Reduction
1
79
Mutation
Figure 3.6
1
1
1
1
0
0
1
1
0
1
0
1
1
1
0
1
1
1
0
1
0
A simple scheme of mutation with the probability of 0.1.
birds in which each bird cries for food at its current location (position). At the same time, each bird tracks the position of other birds and can tell which of the birds emits the loudest cry [43]. The birds try to fly over a better and better search space in order to find a location with the highest concentration of food by following a trajectory that combines the following three rules: 1. Keep flying in the same direction; 2. Return back to the spot where the highest amount of food is found; 3. Fly toward the neighboring bird that cries the loudest [43]. PSO is regarded as a simple optimization technique, and compared to GA it can be implemented in a few lines of code. PSO consists of a set of particles called population (or swarm), in which each particle can be considered as a potential solution for the problem at hand. Each particle is determined by its location and velocity. In addition, each particle has a memory that helps it to keep track of its previous best position, personal best, and the best obtained position in the entire swarm global best. The positions of the particles are highly dependent on the personal best and global best. The velocities of particles are adjusted according to the historical behavior of each particle and other particles located in that swarm while they fly through the search space. Each move of particles is deeply influenced by its current position, its memory of previous best position, and the group knowledge of the swarm [199]. Therefore, the particles have a tendency to fly toward
0
0
1
0
80
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
improved search areas over the course of the search process. Figure 3.7 shows the general idea of the conventional PSO. To model the swarm in iteration t + 1, each particle n moves in a multidimensional search space considering its position (xn [t]), and its velocity (vn [t]), which greatly depend on its personal best (˘ xn [t]) and the global best (˘ gn [t]) information: vn [t + 1] = wvn [t] + ρ1 r1 (˘ xn [t] − xn [t]) + ρ2 r2 (˘ gn [t] − xn [t]).
(3.15)
The parameter w is the inertia weight (predefined by the user), which adjusts the influence of the previous velocity. The coefficients ρ1 and ρ2 are regarded as weights and control the influence of personal best and global best, respectively, when the new velocity is pointed out. Typically, ρ1 and ρ2 are initialized by constant integer values, which represent “cognitive” and “social” components. It should be noted that different results can be obtained by assigning different values for each component. The parameters r1 and r2 are random vectors in which each component generally takes a uniform random number between 0 and 1. The intent is to multiply a new random component per velocity dimension rather than multiplying the same component with each particle’s velocity dimension in order to increase the diversity of particles to avoid getting trapped in a local optimum. In order to investigate PSO for the purpose of feature selection, the following points are of importance. 1. The dimensionality of each particle should be equal to the number of features. In this case, the velocity dimensionality (i.e., dim vn [t]), as well as the position dimensionality (i.e., dim xn [t]), correspond to the total number of bands of the input image (i.e., dim vn [t] = dim xn [t] = d). In other words, each particle’s velocity will be represented as a d-dimension vector. 2. Each particle represents its position in binary values (i.e., 0 or 1), which 0 demonstrates the absence of the corresponding feature and 1 shows the presence of the corresponding band. 3. For the purpose of feature selection, the velocity of a particle can be considered as the probability of changing its position as [200]: ∆xn [t + 1] =
1+
1 e−vn [t+1]
.
(3.16)
Feature Reduction
81
Initialize swarm (Initialize 𝒙𝒏 , 𝒗𝒏 , 𝒙𝒏 and 𝒈 ) Loop: for all particles
Evaluate the fitness of each particle Update 𝒙𝒏 and 𝒈 Update 𝒗𝒏 and 𝒙𝒏
end until stopping criteria (convergence) Figure 3.7 General idea of the conventional PSO. Term n shows the number of a particle. Terms xn and vn show the corresponding location (position) and velocity of that particle, respectively. Terms x ˘n and g˘n demonstrate the personal best and global best of that particle.
Nevertheless, as one wishes to use the algorithm for feature selection, each particle represents its position in binary values (i.e., 0 or 1). This may be represented as: ( 1, ∆xn [t + 1] ≥ rx xn [t + 1] = (3.17) 0, ∆xn [t + 1] < rx wherein rx is a random d-dimension vector with each component generally a uniform random number between 0 and 1. Therefore, each particle moves in a multidimensional space according to position xn [t] from the discrete time system. In other words, each particle’s position will be represented as a d-dimensional binary vector. In order to show the procedure of PSO in a simple way, an example is given in Figure 3.8. Based on the figure, the input data consists of only five bands (i.e., d=5 ). This means that each particle will be defined by its current velocity and position in the 5-dimensional space (i.e., dim vn [t] = dim xn [t] = 5). In this example, and to allow a straightforward understanding, only a swarm of two particles was considered. As can be seen, at time/iteration t=1, particle 1 is “positioned” in such a way that it discards the 4th band (i.e., x1 [1] = [1 1 1 0 1]) while particle 2 ignores the 1st and 3rd bands (i.e., x2 [1] = [0 1 0 1 1]). By computing the overall accuracy of OA1 = 60% and OA2 = 64% for particles 1 and 2, respectively. Considering only those two particles, particle 2 will be considered as the best performing
82
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
𝑥1 [1] = [1
1
1
0
1]
𝑥2 [1] = [0
1
0
1
1]
Particle 1 is attracted toward particle 2 𝑏𝑎𝑛𝑑 = 5
𝑏𝑎𝑛𝑑 = 5
𝑏𝑎𝑛𝑑 = 4
𝑏𝑎𝑛𝑑 = 4
𝑏𝑎𝑛𝑑 = 3
𝑏𝑎𝑛𝑑 = 3
𝑏𝑎𝑛𝑑 = 2
𝑏𝑎𝑛𝑑 = 2
𝑏𝑎𝑛𝑑 = 1
𝑂𝐴1 = 60%
𝑏𝑎𝑛𝑑 = 1
𝑂𝐴2 = 64%
Particle 2 is better in terms of overall accuracy
Figure 3.8 Band coding of two particles for an image with five bands. Gridded bands c are ignored in the classification process. The illustration is taken from [201] IEEE. Permission for use granted by IEEE 2015.
one from the swarm, thus attracting particle 1 toward it. Such attraction will induce the velocity of particle 1 for iteration 2 and, consequently, its position. Although PSO has been used extensively in the literature, its concept suffers from the following issues: 1. Premature convergence of a swarm: Particles try to converge to a single point that is located on a line between the global best and the personal best positions. This point is not guaranteed for a local optimum [202]. Another reason could be the fast rate of information flow between particles, which leads to the creation of similar particles. This results in a loss in diversity. Furthermore, the possibility of being trapped in local optima is increased [203]. 2. Parameter settings dependence: Different parameters lead to the highperformance variance for stochastic search algorithms [203]. In general, there is not any specific set of parameters for different problems. As an example, increasing the inertia weight (w) will increase the speed of the particles and cause both more exploration (global search) and
Feature Reduction
83
less exploitation (local search) [203]. As a result, finding the best set of parameters is not a trivial task, and it might be different from one problem to another [204]. The main advantage of using the PSO is its simple concept along with the fact that it can be implemented in a few lines of code [205]. In addition, PSO has a memory of the personal best and the global best in past iterations. On the other hand, in GA, if a chromosome is not selected for mating, the information contained by that individual is lost. However, without a selection operator as in GA, PSO may waste resources on inferior individuals [203]. PSO may also enhance the search capability for finding an optimal solution. However, GA has a problem in finding an exact solution [206]. To this extent, in order to address the main shortcomings of GA and PSO, the hybridization of GA and PSO has been investigated [205]. As suggested in [207], the correct combination of GA and PSO has the potential of obtaining better results faster than either GA or PSO and can work more effectively across a wide variety of problems. In other words, a good combination of GA and PSO can address each technique’s shortcomings while making the most of each others’ privileges. In both [206] and [208], it has been suggested that a hybridization of GA and PSO can make up a very efficient search strategy. 3.2.5
Hybrid Genetic Algorithm Particle Swarm Optimization (HGAPSO)-Based Feature Selection
GA and PSO can be combined in different ways. In [205], the hybridization is obtained through integrating the standard velocity and update rules of PSO with selection, crossover, and mutation from GA. Figure 3.9 depicts the general idea of that technique. In [205], the overall accuracy of an SVM classifier on validation samples is used as fitness values in order to evaluate the informativity of different groups of bands. In [205], first, an initial population is randomly produced. The individuals in the population can be regarded as chromosomes with respect to GA, or as particles with respect to PSO. Then, a new population for the next generation is produced through enhancement, crossover, and mutation operations as detailed below. Enhancement: In each generation, first, the fitness values of all the individuals in the same population are estimated (e.g., the overall accuracy of SVM on validation samples). Second, the top-half of the best-performing particles are selected and called elites. Then, the elites are enhanced by the
84
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
PSO introduced in the previous section. In [205], it was stated that by using these enhanced elites as parents, the generated offsprings usually achieve better performance than using the elites directly. In this case, first, (3.15) is applied to the elites and then the range of velocity is normalized between 0 and 1 with (3.16). Finally, the normalized velocities are compared with a random chromosome between 0 and 1 in order to update the position in the binary format by using (3.17). By performing PSO on the elites, the search ability of the algorithm may increase. Half of the population in the next generation is produced in this way, and the rest is generated by using the selection, crossover, and mutation operations. This process is iterated until the stop criterion is met. With reference to the results obtained in [205], HGAPSO-based feature selection outperforms GA- and PSO-based feature selection approaches in addition to NWFE and DBFE in terms of classification accuracies, in a situation when other well-known feature selection techniques (e.g., Divergence, Transformed divergence, Bhattacharyya distance, and Jeffries-Matusita distance) could not lead to a conclusion due to the singularity of their covariance matrices. 3.2.6
FODPSO-Based Feature Selection
In order to overcome the main shortcoming of the PSO, which is related to the stagnation of particles around sub-optimal solutions, a chain of modifications has been introduced in recent years. One of those modifications is denoted as Darwinian PSO (DPSO), which is introduced in [59]. The main idea behind DPSO is to run many different simultaneous parallel PSO algorithms on the same test problem and apply a simple selection mechanism. When a search tends to a local optimum, the search in that area is simply discarded and another area is searched instead. At each iteration of the DPSO, swarms that improve are rewarded (i.e., extend particle life or spawn a new descendent) but swarms that stagnate are punished (i.e., reduce swarm life or delete particles). In order to evaluate the efficiency of each swarm, the fitness of all particles is evaluated. Then, the personal and global best positions are updated. If a new global solution is found, the particle is spawned. A particle dies off if the swarm fails to find a better position in a defined number of steps [43]. In order to delete a swarm, delete particles, and spawn a new swarm and particle, the following rules are taken into account:
Feature Reduction
85
1. When the number of particles in each swarm falls below a minimum bound (which is a predefined number), the swarm is deleted. 2. The worst performing particle with respect to its fitness value in the swarm is deleted when a maximum threshold number of steps max (stagnancy counter SCC ) is reached without improving the fitness function. After deleting the particle, the counter is reset to a value approaching the threshold number SCC (Nkill ), according to: 1 max (3.18) SCC (Nkill ) = SCC 1− Nkill + 1 where Nkill is the number of particles deleted from the swarm over a period in which there was no improvement in fitness. 3. To spawn a new swarm, the following rules must be fulfilled: (1) a swarm must not have any particle ever deleted and (2) the maximum number of swarms must not be exceeded. Still, the new swarm is only produced with a probability of p = f /N S, with f a random number in the range of [0,1] and NS the number of swarms. This rule avoids the creation of newer swarms when there are many swarms available in the search space. The parent swarm is unaffected and half of the parent’s particles are selected at random for the child swarm and half of the particles of a random member of the swarm collection are also selected. If the swarm’s initial population number is not obtained, the rest of the particles are randomly initialized and added to the new swarm. 4. A particle is spawned whenever a swarm reaches a new global best and the maximum defined population within a swarm has not been reached.
Figure 3.9 2015.
First population consisting of n particles
Crossover & Mutation
. . . Particle (n)
Selection
Particle (n/2) Particle ((n/2)+1)
PSO
Particle (n)
. . .
Particle ((n/2)+1)
Particle (n/2)
. . .
Particle (1)
c The workflow of HGAPSO. The illustration is taken from [205] IEEE. Permission for use granted by IEEE
Sorted in a descending order OA of SVM over validation samples as fitness value
. . .
Particle (1)
86 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Feature Reduction
87
Although promising results can be obtained by DPSO, this coopetitive approach also increases the computational complexity of the optimization method. Since many swarms of cooperative test solutions (i.e., particles) run simultaneously in a competitive fashion, the computational requirements increase and, consequently, the convergence time also increases for the same computer hardware [43]. To further improve the DPSO algorithm, an extended version denoted as fractional order DPSO (FODPSO) was introduced in [209].5 In FODPSO, fractional calculus is taken into account to control the convergence rate of the algorithm. An important property revealed by fractional calculus is that, while an integer-order derivative just implies a finite series, the fractional-order derivative requires an infinite number of terms [43]. In other words, integer derivatives are “local” operators, while fractional derivatives have, implicitly, a “memory” of all past events. The characteristics revealed by fractional calculus make this mathematical tool well suited to describe phenomena such as irreversibility and chaos, because of its inherent memory property. In this case, the dynamic phenomena of particles’ trajectories configure a case where fractional calculus tools fit adequately [47]. Therefore, based on the FODPSO previously presented in [209] and the Grunwald Letnikov definition of fractional calculus, in each step t, the fitness function is used to evaluate the merit of particles (i.e., overall accuracy). To model the swarm s, each particle n moves in a multidimensional space according to the position (xsn [t]), and velocity (vns [t]), values that are highly dependent on local best (˘ xsn [t]) and global best (˘ gns [t]) information: vns [t + 1] = wns [t + 1] + ρ1 r1 (˘ xsn [t] − xsn [t]) + ρ2 r2 (˘ gns [t] − xsn [t]),
(3.19)
wns [t + 1] = 1 1 αvns [t] + α(1 − α)vns [t − 1] + α(1 − α)(2 − α)vns [t − 2] 2 6 1 s + α(1 − α)(2 − α)(3 − α)vn [t − 3]. 24
(3.20)
The parameter α is known as the fractional coefficient that weighs the influence of past events on determining a new velocity, 0 < α < 1. When a small α is considered, particles ignore their previous activities, thus 5
For detailed information on the FODPSO optimization approach and its different applications, please see [210].
1 88
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Initialize 𝛼, 𝜌1 , 𝜌2 // FODPSO coefficients Initialize 𝑁, 𝑁𝑚𝑖𝑛 , 𝑁𝑚𝑎𝑥 // Population of particles 𝑠 𝑠 Initialize 𝑁 𝑠 , 𝑁𝑚𝑖𝑛 , 𝑁𝑚𝑎𝑥 // Number of swarms Initialize 𝐼𝑇 , 𝐼𝑘𝑖𝑙𝑙 // Iterations and stagnation For each particle 𝑛 from swarm 𝑠 Initialize 0 ≤ 𝑥𝑛𝑠 [0] ≤ 1 // particles’ initial position 𝑠 [0] // best OA of each particle Initialize 𝑂𝐴𝑛𝑏𝑒𝑠𝑡 Initialize 𝑥̆𝑛𝑠 , 𝑔̆𝑛𝑠 and 𝑂𝐴𝑠 [0] based on 𝑥𝑛𝑠 [0] // initial local best, global best and best OA of the swarm For each iteration 𝑡 until 𝐼𝑇 // main loop For each particle 𝑛 of swarm 𝑠 Update 𝑣𝑛𝑠 [𝑡 + 1] Update ∆𝑥𝑛𝑠 [𝑡 + 1] Update 𝑥𝑛𝑠 [𝑡 + 1] Compute 𝑂𝐴𝑛𝑠 [𝑡 + 1] of SVM 𝑠 [𝑡] // particle 𝑛 has If 𝑂𝐴𝑛𝑠 [𝑡 + 1] > 𝑂𝐴𝑛𝑏𝑒𝑠𝑡 improved 𝑠 𝑠 𝑂𝐴𝑛𝑏𝑒𝑠𝑡 [𝑡 + 1] = 𝑂𝐴𝑛 [𝑡 + 1] 𝑥̆𝑛𝑠 = 𝑥𝑛𝑠 [𝑡 + 1] For each swarm 𝑠 If max 𝑂𝐴𝑛𝑠 [𝑡 + 1] > 𝑂𝐴𝑠 [𝑡] // swarm 𝑠 has improved 𝑂𝐴𝑠 [𝑡] = 𝑂𝐴𝑛𝑠 [𝑡 + 1] 𝑔̆𝑛𝑠 = 𝑥𝑛𝑠 [𝑡 + 1] 𝐼𝑘 = 0 // reset stagnancy counter If 𝑁𝑠 < 𝑁𝑚𝑎𝑥 // lower boundary of particles 𝑁𝑠 = 𝑁𝑠 + 1 Spawns new particle in swarm s 𝑁 𝑠 If 𝑁 𝑠 < 𝑁𝑚𝑎𝑥 and 𝑟𝑎𝑛𝑑( ) 𝑠 > 𝑁𝑚𝑎𝑥
𝑟𝑎𝑛𝑑( ) // small probability of creating a new swarm 𝑁𝑠 = 𝑁𝑠 + 1 Spawns new swarm Else // swarm 𝑠 has not improved 𝐼𝑘 = 𝐼𝑘 + 1 If 𝐼𝑘 = 𝐼𝑘𝑖𝑙𝑙 // swarm 𝑠 has improved If 𝑁𝑠 > 𝑁𝑚𝑖𝑛 // swarm 𝑠 has enough particles Delete worse particle from swarm 𝑠 Else Delete whole swarm 𝑠 End
Figure 3.10 General idea of the FODPSO-based feature selection technique (s: the number of a swarm, t: the number of iterations, α: fractional order coefficient, ρ1 : cognitive s coefficient, ρ2 : social coefficient, N s : current number of swarms, Nmin : minimum number s of swarms, Nmax : maximum number of swarms, IT : number of iterations, Ikill : maximum number of iterations for stagnation, OAn : fitness of particle n, OA2nbest : best fitness of particle n, OA: best fitness value, xsn : best cognitive position of particle n, and g n : best global position).
Feature Reduction
89
ignoring the system dynamics and becoming susceptible to getting stuck in local solutions (i.e., exploitation behavior). On the other hand, when a large value for α is taken into account, particles will have a more diversified behavior, which allows exploration of new solutions and improves the longterm performance (i.e., exploration behavior). However, if the exploration level is too high, then the algorithm may take longer to find the global solution. Based on [211], a good α value can be selected in the range of 0.6 to 0.8. For the purpose of feature selection based on FODPSO, as it was stated before, the velocity length (i.e., length vn [t]), as well as the position length (i.e., length xn [t]), are equal to the total number of bands of the image (i.e., length vn [t] = length xn [t] = d). In other words, each particle’s position and velocity will be represented as a d-dimension vector. In addition, each particle’s position represents itself in a binary format, in which 0 demonstrates the absence of the corresponding feature and 1 has a dual meaning. As with binary PSO described in the previous section, the position is updated in binary format by considering (3.16) and (3.17). Based on the results achieved in [201], the following points can be highlighted: 1. The binary FODPSO-based feature selection approach outperforms PSO-based feature selection approach in terms of classification accuracies and CPU processing time on validation samples. The main reason behind this fact is that FODPSO exploits many swarms in which each swarm individually performs just like an ordinary PSO. Furthermore, by considering the concept of fractional derivative in FODPSO, one can control the convergence rate of particles. 2. For the developed FODPSO-based feature selection approach, one does not need to initialize the number of desired output features. In this manner, the considered approach is able to automatically choose the most informative features in terms of classification accuracies. 3. The proposed approaches in [212] and [205] have been compared with conventional feature selection techniques such as divergence, transformed divergence, Bhattacharyya distance, and Jeffries-Matusita distance. In this manner, since the new approaches are based on an evolutionary method, they are much faster than the aforementioned conventional feature selection techniques that demand an exhaustive process to select the most informative group of bands from existing ones. In
90
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
addition, the new approach can work appropriately in a situation when other conventional feature selection techniques are not applicable due to the singularity of their covariance matrix and limited number of training samples. 4. Since the feature selection approaches proposed in [201] and [205] SVM is selected as the fitness function, they are able to handle high-dimensional data with a limited number of training samples. In this way, the proposed approaches can work properly in an ill-posed situation when other feature selection/extraction techniques cannot proceed without a powerful technique for estimating the statistics for each class. 3.3
SUMMARY
As mentioned earlier in this book, a high-dimensional space is almost empty and multivariate data can be represented in a lower-dimensional space. Therefore, it is possible to reduce the dimensionality of high-dimensional data without sacrificing significant information and class separability. In addition, when the number of training samples is limited, the high dimensionality of the hyperspectral data may cause a decrease in classification accuracies, and as a consequence, the quality of the final classification map may be downgraded. The main objective of this chapter was to tackle the issue of high dimensionality. In this manner, different feature reduction techniques including feature selection and extraction approaches have been discussed. To that end, some unsupervised feature extraction approaches such as PCA and ICA and some supervised feature extraction approaches such as DAFE, DBFE, and NWFE have been elaborated in detail. This was followed by a comprehensive discussion regarding different feature selection approaches and some of the techniques that have been used for the concept of spectral-spatial classification in the literature were also elaborated here.
Chapter 4 Spatial Information Extraction Using Segmentation Image segmentation is regarded as the process of partitioning a digital image into multiple regions or objects. Segmentation techniques allocate a label for each pixel in the image in such a way that pixels with the same label share certain visual characteristics [43]. Therefore, pixels are considered as the underlying units of images before applying a segmentation technique while objects are known as underlying units after applying such steps. The partitioned regions or objects provide more information than individual pixels since the interpretation of images based on regions or objects is more meaningful than the interpretation based on individual pixels only. Figure 4.1 shows the outputs of a segmentation method on the first principal component of the ROSIS-03 Pavia University data. As can be seen, the output of the segmentation consists of several objects. Image segmentation is a fundamental step for a wide variety of applications such as the analysis, interpretation, and understanding of images and is also widely used for image processing purposes such as classification and object recognition [43, 213]. Segmentation approaches are applied to extract contextual and spatial information using adaptive neighborhood systems, especially in the spatial domain. Therefore, segmentation approaches can be used for combining extracted spatial regions in conjunction with the extracted spectral information (e.g., the output of classification) into a spectral-spatial classifier. Image segmentation can be split into different categories based on different points of view. For example, in [47], different image segmentation
91
92
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
(a)
(b)
(c)
Figure 4.1 (a) The first principal component of the ROSIS-03 Pavia University data, (b) and (c) the outputs of a segmentation technique with and without borders around each object, respectively. In (c), it is apparent that the objects are the underlying units of images after applying segmentation.
Spatial Information Extraction Using Segmentation
93
techniques are classified into thresholding-based methods, texture analysisbased methods, clustering-based methods, and region-based split and merging methods. Fu and Mui [214], on the other hand, identified three categories of image segmentation techniques (i.e., edge-based, region-based, and characteristic feature thresholding or clustering). Segmentation methods from the edge-based and region-based categories operate in the spatial domain, while those from the thresholding or clustering are mostly applied in the spectral domain. Furthermore, it should be noted that a combination of spatial-based and spectral-based segmentation approaches is possible. In this chapter, different segmentation techniques used in the literature for spectral-spatial classification will be discussed along with relevant experimental results. The techniques include thresholding-based approaches, clustering approaches, and edge-based and region-based approaches. Since in Chapters 4, 5, and 6 comprehensive experimental results with different spectral-spatial classification techniques will be discussed, we intentionally describe a few widely used techniques in order to integrate the spectral and spatial information. Then, a few well-known clustering techniques such as K-means and fuzzy C-means clustering (FCM) will be discussed. The clustering techniques have a shortcoming because they are sensitive to the initial cluster configuration and can fall into suboptimal solutions. Therefore, a modification of FCM, named particle swarm optimization (PSO)-based FCM (PSO-FCM), will be introduced. Next, expectation maximization (EM) will be discussed, mainly as an alternative to K-means clustering. After that, mean-shift segmentation (MSS) will be discussed, but MSS does not need embedded assumptions on the shape of the distribution nor the number of clusters as the classic K-means clustering approach. Furthermore, a few powerful segmentation techniques, which have been used extensively for spectralspatial classification, will be discussed, including watershed segmentation (WS) hierarchical segmentation (HSeg). Then, a well-known classification approach named segmentation and classification using automatically selected markers will be reviewed. After that, thresholding-based segmentation techniques will be discussed. Finally, in order to compare the efficiency of different segmentation techniques, some experimental results obtained by performing segmentation-based spectral-spatial classification approaches on two wellknown hyperspectral data sets, Indian Pines and Pavia University, will be discussed at the end of this chapter.
94
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
4.1
SOME APPROACHES FOR THE INTEGRATION OF SPECTRAL AND SPATIAL INFORMATION
In general, a spectral-spatial classifier consists of three steps: (1) spectral information extraction (Chapter 2), (2) spatial information extraction (Chapters 4, 5, and 6) and (3) integrating the spectral and the spatial information. This section, therefore, covers three main approaches: feature fusion into a stacked vector, composite kernels, and majority vote for integrating spectral and spatial features. It should be noted that other approaches for integrating spectral and spatial information will be discussed later in this book. 4.1.1
Feature Fusion into a Stacked Vector
This type of integrating spectral and spatial information has been frequently used along with the concept of extended multi-attribute profile (EMAP) and extended morphological profile (EMP). This type of fusion concatenates different features into a stacked vector. For example, the EMP was originally developed to be fed to the classification system [215]. Good results were obtained in terms of classification accuracies. However, the EMP contains only a part of the spectral information of the input data. To address this shortcoming, feature fusion into a stacked vector was considered in [14]. This strategy integrates both the EMP and the original hyperspectral image by concatenating them into a stacked vector. In order to use EMAP, one can face the same issue. EMAP is based on performing a feature extraction on the input data and producing extra features based on the extracted features. One can easily conclude that this method can discard a part of the spectral information. In order to address this issue, spectral features are concatenated along with spatial features into a stacked vector [86]. Moreover, one can also perform a feature extraction approach on both feature vectors, and the extracted features are concatenated in one stacked vector and classified by either an SVM or RF classifier. The main goal of performing feature extraction is to overcome the curse of dimensionality, which has been considered for EMP in [14] and EAP in [87]. Let Xϕ be the features associated with the spectral bands and Xω be the features associated with the spatial features (e.g., the features obtained by EMP or EAP), the corresponding extracted features resulting from a feature extraction algorithm are: xϕ = ΦTϕ Xϕ
(4.1)
Spatial Information Extraction Using Segmentation
95
and xω = ΦTω Xω
(4.2)
where Φ is the mapping matrix of the linear feature extraction algorithm [73]. The stacked vector is built up by concatenating extracted features into a stacked vector, as x = [xϕ , xω ]T [86, 87]. This method of fusing extracted features is fast and efficient, but since this method concatenates different groups of features, the risk of the curse of dimensionality increases. 4.1.2
Composite Kernel
As discussed earlier, kernel methods make SVMs well-suited to tackle the problem of hyperspectral image classification since they can handle high dimensional data efficiently, with relatively low training samples. By considering kernel methods, a good classification performance can be obtained by using the spectral information as input features. However, this performance could be further improved by including contextual (or even textural) information along with spectral information into the kernel [216]. This makes it possible to combine different kernel functions in order to include both the spatial and the spectral information in the SVM classification process [216– 218]. Instead of building a stacked vector using feature fusion, which leads to the increase in dimensionality, it is possible to take advantage of the linear property in order to construct a spectral-spatial kernel K, called the composite spectral-spatial kernel [73]. In [111], a linearly weighted composite kernel framework with SVMs has been taken into account for spectral-spatial classification using APs.1 A linearly weighted composite kernel is a weighted combination of different kernels computed using the available features [216]. Probabilistic SVMs were also employed to classify the spectral information to obtain different rule images. The kernels were computed using the obtained rule images and are combined using the weighting factor. The choice of the weighting factor can be given subjectively or estimated using cross-validation. However, classification using composite kernels and SVMs requires a convex combination of kernels and a time-consuming optimization process. To tackle these limitations, a generalized composite kernel framework for spectral-spatial classification using attribute profiles has been proposed in [111]. MLR ([112–114]) has been 1
APs will be discussed in detail later in Chapter 6.
96
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
employed instead of an SVM classifier, and a set of generalized composite kernels, which can be linearly combined without any constraint of convexity, were proposed. A few methods that consider kernel techniques in order to perform spectral-spatial classification will be discussed in Chapter 5. 4.1.3
Spectral-Spatial Classification Using Majority Voting
In this section, another approach is discussed in order to integrate the spectral and spatial information of the input hyperspectral data, named majority vote (MV) [46].2 As mentioned earlier, there are many ways to extract spatial information (e.g., MRF, segmentation approaches, morphological profile, and so on). MV is often used in order to combine the output of classification and segmentation/clustering with each other. The output of segmentation methods involves a few objects in which each object includes several pixels with the same label. In other words, the existing pixels in each object share the same characteristics. In order to apply the MV on the output of the segmentation and classification steps, counting the number of pixels with different class labels in each object is first carried out. Subsequently, all pixels in each object are assigned to the most frequent class label for the object. In the case where two classes have the same (most frequent) proportions in one object, the object is not assigned to any of those classes and the result of the spectral classification is considered for each pixel in the object directly [47]. For better understanding, the workflow of MV is given as follows: 1. The output of a classification step and the output of a segmentation technique are considered as the inputs for MV. The output of the segmentation technique consists of several objects (in Figure 4.2, we have three different objects: 1, 2, and 3), and the output of the classification step consists of different classes (in Figure 4.2, we have three different classes: blue, gray, and white) [39]. 2. In each object (every region in the segmentation map), all of the pixels are assigned to the most frequent class within this object. The advantages and disadvantages of MV are listed below: • Advantage 1: MV is considered as an accurate, simple and fast technique. 2
In the literature, this approach is often referred to as plurality vote
Spatial Information Extraction Using Segmentation
Segmentation map (3 spatial regions) 1 1 1 1 1 1 3
1 1 1 1 1 3 3
1 1 2 2 3 3 3
1 2 2 2 3 3 3
1 2 2 2 3 3 3
97
Pixelwise classification map (dark blue, white and light grey classes)
1 2 2 2 2 3 3 Combination of unsupervised segmentation and pixelwise classification results (majority voting within 3 spatial regions)
Result of spectral-spatial classification (classification map after majority voting)
Figure 4.2 Schematic example of spectral-spatial classification using majority voting c within segmentation regions. Illustration taken from [61] IEEE. Permission for use granted by IEEE 2015.
98
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
• Advantage 2: In addition, this approach does not increase data dimensionality, which causes high redundandy and Hughes phenomenon, while it retains all the spectral information for the accurate classification of the input data [73]. • Disadvantage 1: If the output of segmentation suffers from undersegmentation, this approach misleads the classification map and combines different classes into one in the final classification map. • Disadvantage 2: Although the output of MV considerably reduces the salt and pepper effect of the spectral classification step, sometimes the noisy behavior also can be seen in the classification map after MV procedure. Figure 4.3 shows the situation when the noisy behavior can occur in the final classification map. In order to use segmentation techniques, there are two main issues (oversegmentation and undersegmentation) that one needs to take into account. Oversegmentation is regarded as a situation when one region is detected as several, which is not a crucial problem. In contrast, undersegmentation is known as a situation when several regions are detected as one, which is not desired. In order to integrate the results of the classification and segmentation approaches, oversegmentation is preferable to undersegmentation. Therefore, in order to avoid the issue of undersegmentation that can lead to a downgrade in classification accuracies when MV is used, a careful selection of the segmentation approach and its corresponding parameters are crucial. In order to address the second disadvantage of the MV, in [46], the use of postregularization is suggested. The aim of this postprocessing step is to decrease the noise in the classification map after performing the MV procedure as shown in Figure 4.3. To do so, the classification map is regularized, using the masks shown in Figure 4.4 (that are 8- and 16-neighborhoods of a pixel, called Chamfer neighborhoods [219]). The post-regularization suggested in [46] is carried out as follows: 1. For every pixel in the classification map: If more than T1 neighbors in the eight-neighborhood [see Figure 4.4 (a)] have the class label L that is different from that of the considered pixel, assign this label L to the considered pixel. Repeat the filtering step until stability is reached (none of the pixels changes its label). 2. For each pixel: If more than T2 neighbors in the 16-neighborhood [see Figure 4.4(b)] have the label L different from that of the considered
Paper 3
Spatial Information Extraction Using Segmentation 135 2977
99
LKA et al.: SPECTRAL–SPATIAL CLASSIFICATION OF HYPERSPECTRAL IMAGERY
neighborhood connectivity is preferable to use while ming the labeling of connected components.
. S PECTRAL –S PATIAL C LASSIFICATION S CHEME
flowchart of the proposed spectral–spatial classification e for hyperspectral data is shown in Fig. 1. he input, we have a B-band hyperspectral image X = RB , j = 1, 2, . . . , n} and a training set map. proposed spectral–spatial classifier is based on the maote rule. In previous studies, this approach was applied milar way in [61] for multispectral (four-band IKONOS) and in [37] for hyperspectral data, giving a good perforThe approach is principally the combination of unsud segmentation and pixel wise classification results. The ed method consists of the following steps (see Figs. 1
egmentation: A hyperspectral image is segmented into omogeneous regions using partitional clustering, as decribed in the previous section. The number of clusters Cmin /Cmax for the ISODATA and Cmax for the EM) can e chosen based on the information about the considered mage (i.e., how many groups of materials with similar pectra are present). Cmin must be chosen not to be less han the number of classes. The upper bound of classes Cmax can be chosen slightly superior to the number of lasses. If less than Cmax clusters are present in the image, oth algorithms have the possibility to merge clusters. Pixel wise classification: Independently of the segmentaon procedure, a pixel wise classification of the image is erformed. We propose to use an SVM classifier with the Gaussian radial basis function (RBF) kernel for this purose, which has given good accuracies in classification f hyperspectral data [14], [15], [17]. Parameters of the lassifier can be tuned by m-fold cross validation. pectral–spatial classification: Then, for every region in he segmentation map, all the pixels are assigned to the most frequent class within this region (we call this the majority vote rule). Please note that unlike in the fixed-window-based ap- Fig. 2. Example of spectral–spatial classification. roach, the majority voting is not performed using a fixed c Figure 4.3 eighborhood but using an adaptive neighborhood. For Example of MV with postregularization. Illustration taken from [46] IEEE. Permission for use granted by IEEE 2015. ach pixel, the region it belongs to, as defined by the egmentation step, is used as its neighborhood for the majority voting on the spectral classification algorithm. PR: Finally, spatial PR of the classification map is perormed. The aim of this postprocessing step is to reduce he noise in the classification map after the majority vote rocedure. For this purpose, the classification map is ltered, using the masks shown in Fig. 3 (that are 8- and 6-neighborhoods of a pixel, called Chamfer neighborFig. 3. Chamfer neighborhoods (in gray) for a black pixel: (a) 8 neighbors oods [62]). The PR is performed as follows. and (b) 16 neighbors. ) For every pixel in the classification map: If more than b) For every pixel: If more than T 2 neighbors in the T 1 neighbors in the eight-neighborhood [see Fig. 3(a)] 16-neighborhood [see Fig. 3(b)] have the label L have the class label L that is different from that of the different from that of the considered pixel, assign the considered pixel, assign this label L to the considered label L to the considered pixel. Perform this step until pixel. Perform this filtering until stability is reached stability is reached. (none of the pixels changes its label).
rized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on August 2, 2009 at 07:01 from IEEE Xplore. Restrictions apply.
chosen slightly superior to the number of than Cmax clusters are present in the image, ms have the possibility to merge clusters. ssification: Independently of the segmentae, a pixel wise classification of the image is e propose to use an SVM classifier with the al basis function (RBF) kernel for this purhas given good accuracies in classification ral data [14], [15], [17]. Parameters of the be tuned by m-fold cross validation. ial classification: Then, for every region in ion map, all the pixels are assigned to the class within this region (we call this 100theSpectral-Spatial Classification of Hyperspectral Remote Sensing Images rule). that unlike in the fixed-window-based ap- Fig. 2. Example of spectral–spatial classification. ajority voting is not performed using a fixed but using an adaptive neighborhood. For e region it belongs to, as defined by the step, is used as its neighborhood for the g on the spectral classification algorithm. patial PR of the classification map is peraim of this postprocessing step is to reduce e classification map after the majority vote or this purpose, the classification map is the masks shown in Fig. 3 (that are 8- and oods of a pixel, called Chamfer neighborFig. 3. Chamfer neighborhoods (in gray) for a black pixel: (a) 8 neighbors The PR is performed as follows. and (b) 16 neighbors. Figure pixel in the classification map: If more than 4.4 Chamfer neighborhoods (in gray) for a black pixel: (a) eight neighbors and 16 neighbors.b) For every pixel: If more than T 2 neighbors in the ors in the eight-neighborhood [see Fig.(b) 3(a)] 16-neighborhood [see Fig. 3(b)] have the label L ass label L that is different from that of the different from that of the considered pixel, assign the pixel, assign this label L to the considered to the pixel. Perform this stepthis untilstep until orm this filtering until stability is reachedpixel, assignlabel the L label L considered to the considered pixel. Repeat stability is reached. e pixels changes its label). stability is reached. 3. Repeat regularization on the eight-neighborhood (with threshold T3 ).
imited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on August 2, 2009 at 07:01 from IEEE Xplore. Restrictions apply.
The threshold values T1 − T3 must be chosen in such a way to be equal or superior to half of the number of pixels in the considered neighborhood system in order to ensure the unique solution of this step. The post-regularization step results in more homogeneous regions in the classification map than the use of the MV alone. However, the filtering of the classification map does not use any spectral pixel-wise information. The effectiveness of this procedure is highly dependent on the size of the structures in the image. In this way, if the image resolution is not very high, the object in the image scene can be the size of one or a few pixels. In this case, this object is in danger of being removed from the classification map by the post-regularization. The filtering conditions can be restricted or relaxed by changing the threshold values T1 −T3 . Therefore, if Tj j = {1, ..., 3} decreases, the regularization has a stronger effect, which leads to more homogeneous results. However, the possibility of removing small regions consequently increases. A comprehensive experimental result regarding the use of this approach integrating segmentation and classification outputs will be presented in this chapter.
Spatial Information Extraction Using Segmentation
4.2 4.2.1
101
CLUSTERING APPROACHES K-Means
K-means clustering is the simplest and most-used clustering algorithm for finding clusters and cluster centers in a set of unlabeled data [131, 220, 221]. Let X be a set of data points (e.g., pixels). K-means is based on a local search approach to partition data points in k clusters [220]. The required number of clusters is predefined by the user, and K-means iteratively moves the centers in order to minimize the total within cluster variance [131]. K-means has been applied on a wide variety of applications by the remote sensing and computer vision communities [222–224]. The main attraction of this method lies in its simplicity and its observed speed [220]. K-means aims at minimizing the sum of squared distances between all points and the closest cluster center and proceeds as follows [225, 226]: 1. Choose k initial cluster centers: z1 (1), z2 (1), z3 (1), . . ., zk (1). 2. At the kth iterative step, distribute the samples X among the k clusters using the following: X ∈ Cj (k) if kX − zj (k)k < kX − zi (k)k ; ∀ i, j = 1, 2, ... , k; i 6= j where Cj (k) and zj (k) denote centers and set of samples, respectively, and k.k is distance norm. 3. Compute the new cluster centers zj (k + 1); j = 1, 2, ..., k, such that the sum of the squared distances of all points in Cj (k) to the new cluster center is minimized. The new cluster center is given by: zj (k + 1) =
1 Nj
X
X; j = 1, 2, ... , k
(4.3)
X∈Cj (k)
where Nj is the number of samples in Cj (k). 4. The algorithm will converge when the stop criterion |zj (k + 1) − zj (k)| < , where ∈ R is met. Otherwise, go to step 2.
102 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Figure 4.5 (b) illustrates the clustering map of the K-means on a single band of the ROSIS-03 Pavia University image. K-means is popular due to its simplicity (i.e., centroids or cluster centers in K-means can be simply understood and interpreted). In addition, K-means is able to find a local minimum for the optimization criterion (the sum of squared errors). Due to this advantage, K-means can be used as a postprocessing step for other clustering methods [227]. The K-means clustering approach is considered fast when compared to most clustering algorithms. This is due to the fact that only distances and means need to be calculated, and, thus, the computational complexity is not very high. However, since the clustering is an iterative process, the speed will partially depend on the number of iterations, which, in turn, depends on the data distribution. Speed will also depend on the dimension of the data used, because, the higher the dimension, the slower the process. In addition, K-means assigns samples to the nearest centroids in such a way that samples are only linearly scanned. This linear scanning procedure is based on a single pass over the data set. In greater detail, it reads the full set of data points without going back to the previous point. Consequently, in order to update the centroids, only the arithmetic means are computed. The convergence of K-means is often met in a very few number of iterations. It should be noted that K-means converges faster when the data contain well-separated clusters. Conventional hard classification and clustering approaches do not take into account continuous changes of different land cover classes from one to another. For example, the standard K-means uses hard partitioning, in which each data point belongs to exactly one cluster. However, this fact can be easily downgraded in different applications of remote sensing. For instance, a crisp label cannot actually be allocated to the boundary between two classes of interest. To model gradual boundary changes, “soft” classifiers can be taken into consideration. Fuzzy classifiers are soft classification techniques that deal with vagueness in class definitions and model the gradual spatial transition between land cover classes [228]. To overcome the aforementioned shortcoming of K-means, fuzzy C-means (FCM) was introduced in [229] in 1981. FCM is a generalization of the standard crisp K-means, in which a data point (pixel) can belong to all clusters with different degrees of membership.
Spatial Information Extraction Using Segmentation
4.2.2
103
Fuzzy C-Means Clustering (FCM)
The FCM algorithm can be described as follows: Let X = {X1 , ..., Xb , ..., Xq } be the set of q objects, and Z = {z1 , ..., zb , ..., zk } be the set of k centroids in a d-dimensional feature space. The FCM partitions X into k clusters as follows: 1. FCM starts by randomly selecting k objects as centroids (means) of the k clusters. 2. Membership values are estimated with respect to the relative distance of the object Xj to the centroids using (4.4). µij = Pk
1
2 dij m−1 c=1 ( dcj )
where d2ij = kXj − Zi k , 2
(4.4)
where 1 < m ≤ ∞ is the fuzzifier, Zi is the ith centroid corresponding to cluster βi , µij ∈ [0, 1] is the fuzzy membership of the pattern Xj to cluster βj 3. The centroids of the clusters are estimated using (4.5). q q X 1 X m Zi = (µij ) Xj where qi = (µij )m qi j=1 j=1
(4.5)
4. The FCM partitions X into k clusters by minimizing the following objective function [230]: q X k X 2 J= (µij )m kXj − Zi k
(4.6)
j=1 i=1
5. The stop criterion is met when the norm difference between two consecutive iterations is less than a predefined threshold [231]. Figure 4.5 (c) illustrates the clustering map of the FCM on a single band of Pavia University.
104 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
4.2.3
Particle Swarm Optimization (PSO)-Based FCM (PSO-FCM)
Although FCM is an improvement on K-means since it includes fuzzy sets over crisp sets, it is still very sensitive to its initial cluster configuration and may fall into suboptimal solutions [232]. Therefore, in the literature, researchers have tried to improve the resilience of the FCM by optimizing it with bio-inspired optimization techniques [58]. To this extent, the combination of FCM and the well-known PSO has been one of the most successful ones (i.e., [233, 234]). In order to optimize FCM by PSO, each particle presents itself as a possible solution of the problem (e.g., the best cluster centers of a given hyperspectral image).3 These particles travel through the search space to find an optimal solution by interacting and sharing information with other particles, namely their individual best solution (local best) and computing the globally best solution [235]. In other words, PSO-based algorithms consider multiple particles, wherein each particle has its own local solution (e.g., best cluster center it found), and the whole swarm has a global solution that is the best among the several local solutions of all particles (e.g., best cluster center found among all best cluster centers provided by each particle). In each step t of the PSO-FCM, the fitness function, represented by (4.6), is used to evaluate the success of particles. To model the swarm, each particle n moves in a multidimensional space according to the position xn [t], and velocity vn [t], which are highly dependent on the locally best x ˜n [t] and the globally best g ˜n [t] information: vn [t + 1] = wvn [t] + ρ1 r1 (˜ gn [t] − xn [ t]) + ρ2 r2 (˜ xn [t] − xn [t]) , xn [t + 1] = xn [t] + vn [t + 1] .
(4.7) (4.8)
Coefficients w, ρ1 , and ρ2 are considered as weights for the inertial influence, the global best, and the local best when determining the new velocity vn [t + 1], respectively, with ρ1 + ρ2 < 2 [43], provided that different results can be obtained by assigning different influences for each component. The parameters r1 and r2 are two random vectors with each component generally a uniform random number between 0 and 1. The intent is to multiply a new random component per velocity dimension, rather than 3
Here, PSO-FCM is regarded as an approach in which FCM is optimized by PSO.
Spatial Information Extraction Using Segmentation
(a)
(b)
(c)
105
(d)
Figure 4.5 The result of clustering techniques by considering nine clusters on (a) a single band of Pavia University data set (band 27) by considering (b), K-means (c) FCM, and (d) PSO-FCM, respectively.
multiplying the same component with each particle’s velocity dimension to increase the randomness of particles in order to avoid getting trapped in local optima. The velocity dimension, as well as the position dimension, is equal to the total predefined number of cluster centers in the input data. In other words, each particle’s position will be represented as a k-dimensional vector. Moreover, each particle moves in a multidimensional search space with respect to its corresponding position from the discrete time system (4.7) and (4.8), wherein vn [t], xn [t] ∈ Rk . Figure 4.5(d) illustrates the clustering map of the PSO-FCM on a single band of Pavia University. 4.3
EXPECTATION MAXIMIZATION (EM)
The K-means algorithm is simple, but it easily gets stuck in local optima. The EM algorithm tends to get less stuck in local optima than the K-means algorithm by assigning data points partially to different clusters instead of assigning them to only one cluster. In Section 4.2, the K-means algorithm and its modifications were discussed. The EM algorithm is a well-known technique for finding maximum likelihood parameter estimates in problems
106 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
with hidden (i.e., missing) data [39]. For the purpose of segmentation, the observed data are intensity values and the hidden (missing) data are the segmentation labels. The parameters are class-conditional intensity distribution parameters. Usually, the individual components of the mixture density are assumed to be Gaussian. In this case, the parameters of a Gaussian mixture model have to be estimated. This algorithm first estimates the hidden (missing) data based on the current parameter estimates. The initial parameters for the Gaussian mixture, mean and covariance of the classes, are calculated based on the histogram of the image. Then, the estimated completed data4 is taken into account in order to estimate the parameters by maximizing the likelihood of the complete data. The EM algorithm consists of two steps: the E step and the M step. While the E step computes the expectation of the missing data, the M step computes the Maximum Likelihood estimates of the unknown parameters. This process is iterated until it converges. The likelihood increases with each iteration. In the EM algorithm, it is assumed that pixels belonging to the same cluster are drawn from a multivariate Gaussian (normal) probability distribution. Each image pixel can be statistically characterized with respect to the following probability density function: p(X) =
C X
ωc φc (X; mc , Σc )
(4.9)
c=1
where mc and Σc are the mean vector and covariance matrix of cluster c, respectively. C represents the number of clusters. ωc is in the range of 0 and 1 (e.g., ωc ∈ [0, 1]), which is the mixing proportion (weight) of a cluster c PC with c=1 ωc = 1. Also, φ(m, Σ) is the multivariate Gaussian density with mean m and covariance matrix Σ: φc (X; mc , Σc ) = 1 1 T −1 exp − (X − m ) Σ (X − m ) . c c c 2 (2π)d/2 |Σc |1/2
(4.10)
The distribution parameters ψ = {C, ωc , mc , Σc ; c = 1, 2, ..., C} are calculated by the iterative Classification EM (CEM) algorithm, as described 4
Complete data contains both observed and hidden data.
Spatial Information Extraction Using Segmentation
107
in [46] (a detailed description can be found in Appendix A). As suggested in [46], an upper bound on the number of clusterscan be be chosen slightly superior to the number of classes in order to avoid undersegmentation. When convergence is met, the partitioning of the pixels into C nonoverlaped clusters is obtained. Since spatial information is not taken into account during the clustering procedure, pixels with the same cluster label can either form a connected spatial region or can belong to disjoint regions. In this context, a segmentation map can be obtained by performing a connected component labeling algorithm [236] on the cluster partitioning. This algorithm assigns different labels for disjoint regions within the same cluster. The total number of parameters needed to be estimated by the EM algorithm can be obtained by P = (d(d + 1)/2 + d + 1)C + 1. In this equation, d is the dimensionality of feature vectors (e.g., number of bands). A large d causes that P may get quite a large number. This may cause the problem of covariance matrix singularity or result in inaccurate parameter estimation. In order to address these issues, a feature reduction technique can be used. For example, in [73], a piecewise constant function approximations (PCFA) method was used [237]. PCFA is known asa simple feature reduction approach that has shown good results in terms of classification accuracies for the classification of hyperspectral [73]. As discussed in [27], partitional clustering (e.g., K-means and its alternatives and the EM algorithm) produces an exhaustive partitioning of the set of image pixels into a number of clusters. Therefore, a numerical label is given to each pixel representing the cluster to which it belongs. However, partitional clustering techniques actually consider no spatial information, and they only take into account the spectral similarities during the clustering procedure. Therefore, pixels with the same cluster label can be either connected in the image plane (thus forming a spatial region) or they can belong to disjoint regions within the spatial domain. In order to produce a segmentation map based on the output of the clustering techniques, in which each connected spatial region has a unique label, a connected components labeling algorithm can be used on the output image partitioning that was obtained by the clustering algorithm [236, 238, 239]. The algorithm, therefore, allocates different labels for disjoint regions existing in the image, which were placed in the same cluster. In general, the segmentation map obtained this way can be oversegmented. However, the final goal is not to produce a perfect segmentation result, but instead to define adaptive neighborhood groups of connected pixels belonging to the same physical object in order to incorporate
108 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
them into a spectral-spatial classifier [27]. As oversegmentation is preferred to undersegmentation, a four-neighborhood connectivity is recommended to be considered while performing the labeling of connected components. 4.4
MEAN-SHIFT SEGMENTATION (MSS)
MSS is a nonparametric clustering technique that requires neither embedded assumptions on the shape of the distribution nor the number of clusters in comparison to the classic K-means clustering approach. Mean-shift was first introduced in [240]. This approach has been more recently developed for different purposes of low-level vision problems, including adaptive smoothing and segmentation. In the MSS, each pixel is associated with the most significant mode of the joint domain density located in its neighborhood, after nearby modes were pruned as in the generic feature space analysis technique [48]. Let us assume that we have N data points Xi , i = 1, ..., N in a d-dimensional space Rd . Then, the multivariate kernel density estimator with kernel K(X), and a symmetric positive definite d × d bandwidth matrix H, computed in the point X as follows: N 1 X KH (X − Xi ), fˆ(X) = N i=1
(4.11)
KH (X) = |H|
(4.12)
where − 21
1
K(H − 2 X).
For the sake of simplicity, the bandwidth matrix H is assumed to be proportional to the identity matrix H = h2 I. The modes of the density function are located at the zeros of the gradient function (i.e., where ∇f (X) = 0). After some algebraic manipulation, the gradient of the density estimator can be written as follows:
ˆ = 2ck,d ∇f (X) N hd+2 |
PN X g
X−Xi 2
X − Xi i=1 i h
g(
− X , (4.13) PN
h )
X−Xi 2 g i=1 i=1 h {z }| {z }
N X
term1
term2
Spatial Information Extraction Using Segmentation
109
where ck,d is a normalization constant, and (g(X) = −k 0 (X) gives the derivative with respect to X of the selected kernel profile. The first term of (4.13), i.e., term1 above, gives information regarding the density estimate at X computed with kernel. The second term (term2 above) is regarded as the mean shift vector, m, that points toward the direction of the maximum increase in density and is related to the density gradient estimate at point X obtained with kernel K. The most important limitation of the standard MSS is that the value of the kernel size is unspecified. More information regarding the MSS can be found in [48]. 4.5
WATERSHED SEGMENTATION (WS)
As mentioned above, segmentation techniques often work in the spatial domain searching for groups of spatially connected pixels (i.e., regions) which are similar with respect to the defined criterion. Edge-based techniques search for discontinuities in the image, while region-based techniques search for similarities between image regions [241]. The watershed transformation is considered as a powerful morphological approach for image segmentation. This segmentation approach combines both region growing and edge detection. This approach considers a twodimensional one-band image as a topographic relief [63, 242]. The value of a pixel stands for its elevation. The WS is inspired by a flooding paradigm. The watershed lines divide the image into catchment basins so that each basin is associated with one minimum in the image, as shown in Figure 4.6. In general, the watershed transformation is usually applied to the gradient function of the image [73].5 The watershed transformation divides the input image into several regions, in which each region is associated with one minimum of the gradient image. In this case, if the crest lines in the gradient image correspond to the borders between objects, watershed transformation partitions this image into meaningful regions. In general, the result of the watershed segmentation on the gradient image without any extra processing causes a strong oversegmentation (e.g., large number of minima) [243]. One way to address this issue is to perform area filtering on either the input image or the gradient image. 5
The gradient has high values on the edges of image objects and low values in homogeneous regions.
110 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
(a)
(b)
Figure 4.6 (a) Topographic representation of a one-band image, and (b) example of a watershed transformation in one dimension, where three regions associated with the three minima, are defined. The two maxima correspond to the borders between regions and c are not assigned to any region (watershed pixels). Illustration taken from [49] IEEE. Permission for use granted by IEEE 2015.
In the same way as with morphological profile (elaborated in Chapter 5) the extension of a WS technique in order to handle hyperspectral images is not straightforward, because there are no natural means for total ordering of multivariate pixels [49]. In general, in order to apply WS, two main steps should be carried out: (1) a gradient step and (2) watershed transformation (Figure 4.7). There are different ways to obtain a gradient on a multivariate function. For example, one can apply a modulus of a gradient on each band and take the sum or the spectrum of the gradients [244]. Another option can be to consider vectorial gradients by estimating distances between vector pixels [243]. In addition, if one wishes to apply the gradient step on hyperspectral images, since the huge number of bands generates redundancies, it is important to apply a feature reduction technique on the input data to reduce the number of bands/features in order to extract pertinent information [243]. Figure 4.8 shows some examples of the gradients for ROSIS-03 Pavia University data: (a) gradient of a single band obtained by applying Canny on band no. 40; (b) sum of gradients for the first three PCA components; and (c) radiant of a color image (the first three PCA components) by transforming the original image to YCbCr color space and applying Canny edge detection.
Spatial Information Extraction Using Segmentation
111
Several techniques have been proposed to apply watershed segmentation to hyperspectral images, such as [49, 243]. The most common such approach was introduced by Tarabalka et al. in [49], which first computes a one-band gradient from a multidimensional image and then applies a standard watershed algorithm. Figure 4.7 shows a simple strategy for applying WS to a hyperspectral image. This approach is detailed as follows [73]: 1. First, a one-band gradient image is obtained by applying robust color morphological gradient (RCMG) [245] on a hyperspectral image. The objective is to apply the watershed on each d-band hyperspectral image (e.g., pixel vector X ∈ Rd ). Let χ = [X 1 , X 2 , ..., X e ] be a set of e vectors contained within a Structuring Element (SE) E (i.e., the pixel xp itself and e − 1 neighboring pixels). A 3×3 square SE with the origin in its center is typically used. The color morphological gradient (CMG), using the Euclidean distance, is computed as: CM GE (X) = max {kX i − X j k2 }, i,j∈χ
(4.14)
i.e., the maximum of the distances between all pairs of vectors in the set χ. The CMG is very sensitive to noise, which is considered its main drawback. To address the problem of outliers, in [245] the RCMG was introduced. The scheme to make the CMG robust consists of removing the two pixels that are the furthest apart and then finding the CMG of the remaining pixels. This process is repeated several times depending on the size of a SE and noise level. The RCMG, using the Euclidean distance, can be estimated as: RCM GE (X) =
max
i,j∈[χ−REMr ]
{kX i − X j k2 },
(4.15)
where REMr is a set of r vector pairs removed. If a 3×3 square SE is used, r = 1 is recommended [245]. 2. Then, the watershed transformation is applied on the one-band RCMG image, using a standard algorithm, for example the algorithm of Vincent and Soille [246]. Consequently, the image is segmented into several regions, and one subset of watershed pixels (i.e., pixels situated on the borders between regions (see Figure 4.6)). 3. Finally, watershed pixels are assigned to the neighboring region with the “closest” median [247] (i.e., with the minimal distance between the
112 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Hyperspectral image (B bands) Feature extraction Gradient Combine gradients
Watershed Combine regions
Segmentation map (1 band)
Figure 4.7 Flowchart that shows strategies of applying watershed to a hyperspectral c image. Illustration taken from [49] IEEE. Permission for use granted by IEEE 2015.
Spatial Information Extraction Using Segmentation
(a)
(b)
113
(c)
Figure 4.8 Gradients of ROSIS-03 Pavia University data: (a) gradient of a single band obtained by applying Canny on band no. 40, (b) sum of gradients for the first three PCA components, and (c) radiant of a color image (the first three PCA components) by transforming the original image to YCbCr color space and applying Canny edge detection.
vector median of the corresponding region and the watershed pixel). Assuming that an L1 norm is used to compute distances, a vector median for the region X = {Xj ∈ Rd , j = 1, 2, ..., l} is defined as Pl XV M = arg minXj ∈X { j=1 kX − Xj k1 }, in which V M is regarded as the vector median.
4.6
HIERARCHICAL SEGMENTATION (HSEG)
HSeg is a segmentation technique that works both in the spatial and the spectral domain. This approach is developed by Tilton in [248, 249]. This approach combines region growing and unsupervised classification in order to produce final segmentation map. The region growing is based on the hierarchical step-wise optimization (HSWO) method [250], which produces spatially connected regions. The unsupervised classification part groups similar spatially disjoint regions together [248, 249]. Therefore, the unsupervised classification part of this method provides a possibility of
114 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
merging spatially nonadjacent regions by spectral clustering. This approach can be described as follows [73]: Initialization: This step can be done by allocating a region label to each pixel. If a presegmentation is provided, each pixel can be labelled according to the segmentation map. Otherwise, each pixel is labelled as a separate region. 1. Then, the dissimilarity criterion value is calculated between all pairs of spatially adjacent regions.6 As mentioned in [249], different measures can be taken into account for computing dissimilarity criteria between regions, such as vector norms or spectral angle mapper (SAM) between the region mean vectors (for more information regarding SAM, please see Appendix B). 2. The smallest dissimilarity criterion value is found, known as dissim val, and set thresh val equal to it. Then, all pairs of spatially adjacent regions with respect to dissim val = thresh val will be merged. 3. If Swght > 0.0, merge all pairs of spatially nonadjacent regions with respect to dissim val ≤ Swght · thresh val. Swght is an optional parameter, which tunes the relative importance of spectral clustering versus region growing. If Swght = 0.0, only merging of spatially adjacent regions is applied. In contrast, if 0.0 < Swght ≤ 1.0, merging between spatially adjacent regions is favored compared to merging of spatially nonadjacent regions by a factor of 1.0/Swght . The optimal parameter Swght can be chosen based on a priori knowledge regarding information classes contained in the image. In this way, if some classes have very similar spectral responses, it is recommended to choose Swght = 0.0 (i.e., to perform segmentation only in the spatial domain). Otherwise, it is recommended to include the possibility of merging spatially nonadjacent regions, while favoring region growing (e.g., Swght = 0.1 can be chosen). If Swght > 0.0, labeling of connected components has to be applied after RHSeg in order to obtain a segmentation map where each spatially connected component has a unique label [73]. 4. The algorithm will stop if convergence is met. Otherwise, the algorithm will return to step 1. 6
As a reminder, a spatially adjacent region for a given region is known as one containing pixels located in the neighborhood (e.g., four- or eight-neighborhood) of the considered region’s pixels.
Spatial Information Extraction Using Segmentation
115
The merging of spatially disjointed regions causes in a heavy consecutive processing time [73]. To address this issue, a recursive divide and conquer approximation of HSeg (known as RHSeg) and its efficient parallel implementation have been developed7 . As mentioned in [248, 249], HSeg produces output as a hierarchical sequence of image segmentations from initialization down to the one-region segmentation, if allowed to proceed that far. It is easy to obtain that in this sequence, a particular object can be represented by several regions (or objects) at finer levels of detail, and can be merged with other objects and represented as one region at coarser levels of detail. However, in practical applications, a subset of one or several segmentations needs to be chosen from this hierarchical sequence. An appropriate level can be seelected interactively with the program HSEGViewer [249], or an automated method, tailored to the application, can be developed, as explored in [251–253].
4.7
SEGMENTATION AND CLASSIFICATION USING AUTOMATICALLY SELECTED MARKERS
Segmentation techniques aim at dividing an image into several homogeneous and nonoverlapped regions with respect to a measure of homogeneity. However, the choice of this measure is critical and image-dependent. A too relaxed or a too restricted homogeneity criterion can lead to undersegmentation or oversegmentation, respectively [254]. If the final aim of the segmentation step is to produce a supervised classification map, the information about thematic classes can be also taken into account for producing a segmentation map. Here, we aim at reducing the effect of oversegmentation, and thus further improve segmentation and classification results. To that extent, markercontrolled segmentation is elaborated here, where markers for spatial regions are automatically obtained from classification results and then used as seeds for region growing [61, 255]. Classification results are often typically more accurate inside spatial regions and more erroneous closer to region borders. With respect to this assumption, it was proposed to choose the most reliable classified pixels (i.e., inside spatial regions) as region markers. In this section, two different marker selection approaches are elaborated, based either on 7
J. C. Tilton. HSeg/RHSeg, HSEGViewer and HSEGReader user’s manual (version 1.40). Provided with the evaluation version of RHSeg available from: http://ipp.gsfc.nasa.gov/RHSEG, 2008.
116 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
results of probabilistic SVM or a multiple classifier (MC) system. Then, a marker-controlled segmentation algorithm will be presented, which consists of the construction of an MSF rooted on markers. A marker is known as a set of pixels (not necessarily connected; it can be composed of several spatially disjoint subsets of connected pixels) associated with one object in the image scene [255]. The markers of regions can be selected in two ways: (1) manual selection, which is time consuming, or (2) automatic selection [27]. In general, remote sensing images contain small and complex structures. Therefore, the automatic selection of markers is a challenging task. Below, two ways for the automatic selection of markers will be discussed. 4.7.1
Marker Selection Using Probabilistic SVM
In [255], Tarabalka et al. chose markers by considering probabilistic SVM classification results. The flowchart of that method, as well as an illustrative example of it, is shown in Figure 4.9. The workflow of the method is summarized as follows:
Marker for the small CC
(b)
2. Marker is not necessarily a connected set of pixels. It can be spatially split into several subsets
1. From 4 CC 3 markers are selected
Marker for the second large CC
Marker for the first large CC
Can be small objects or noise
Result of the marker selection
2 small CC
Must contain a marker
2 large connected components (CC)
Figure 4.9 (a) Flowchart of the SVM-based marker selection procedure. (b) Illustrative example of the SVM-based marker c selection. Illustration taken from [255] IEEE. Permission for use granted by IEEE 2015.
(a)
Map of markers
Marker = P% of its pixels with the highest probabilities
Marker = pixels of CC with probabilities ≥ S%
large
Must contain a marker
Large or small?
Check for region relevance
small
For each CC
Connected components (CC) labeling of the classification map
Classification map + probability map
Spatial Information Extraction Using Segmentation 117
118 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
1. Probabilistic pixel-wise classification: Perform a probabilistic pixelwise SVM classification on the input hyperspectral data [256, 257]. The outputs of this step are (1) a classification map, containing a unique class label for each pixel, and (2) a probability map, containing probability estimates for each pixel belonging to the assigned class. To estimate class probabilities, pairwise coupling of binary probability estimates can be considered [256, 258]. In [255], the probabilistic SVM algorithm implemented in the LIBSVM library [256] was utilized. The objective is to estimate, for each pixel vector X, classification probabilities as follows: p(y|X) = {pi = p(y = K|X), i = 1, . . . , N },
(4.16)
where C is the number of thematic classes, N is the number of pixels, and y is the class label of a given pixel. Pairwise class probabilities rij ≈ p(y = i|y = i or j, X) are first estimated. After that the probabilities in (4.16) are estimated based on the algorithm discussed in [258]. A probability map can be also estimated by assigning to each pixel the maximum probability estimate max(pi ), i = 1, . . . , N . 2. Marker selection: Apply a connected component labeling on the classification map by considering an eight-neighborhood connectivity [236]. Then, analyze each connected component as follows: • If a region is large (i.e., the number of pixels in the region is more than M ), it is taken into consideration to represent a spatial structure. Its marker is defined as the P % of pixels within this region with the highest probability estimates. • If a region is small (i.e., the number of pixels in the region is less than M ), it is further investigated if its pixels were classified to a particular class with a high probability. Otherwise, the component is assumed to be the consequence of classification noise, and the algorithm tends to discard it. Its potential marker is formed by the pixels with probability estimates higher than a defined threshold S. The parameters M, P, S are set based on a priori knowledge for the image, which was described in [255]:
Spatial Information Extraction Using Segmentation
119
• A parameter M , which defines if the region is large, is set based on the resolution of the image and typical sizes of the objects of interest. • A parameter P , defined the percentage of pixels within the large region to be used as markers, is set based on M . Since the marker for a large region must include at least one pixel, the following condition must be fulfilled: P ≥ 100%/M . • A parameter S, which is considered as a threshold of probability estimates defining potential markers for small regions, is set based on the probability of the presence of small structures in the image (which depends on the image resolution and the classes of interest), and the importance of the potential small structures (i.e., the cost of losing the small structures in the classification map). As the output of the marker selection step, a map of m markers is obtained, in which each marker Oi = {Xj ∈ X, j = 1, ..., card(Oi ); yOi } (i = 1, ..., m) consists of one or several pixels and has a class label yOi . It should be noted that a marker is not necessarily a spatially connected set of pixels. 4.7.2
Multiple Classifier Approach for Marker Selection
As one can obtain, although the marker selection approach based on the probabilistic SVM leads to good results in terms of classification accuracies, the performance of that is highly dependent on the performance of the selected pixel-wise classifier (e.g., the SVM classifier). In order to address this shortcoming, in [61], a marker selection approach was introduced, which considers an ensemble of classifiers (i.e., multiple classifiers instead of using only a single classifier [61]. In this manner, several individual classifiers are integrated within a single system (see Figure 4.10) in order to make the most of the complementary information, while their weaknesses are avoided [120]. As mentioned in [61], in general, this approach can be summarized as follows: 1. Multiple classification: Several classifiers are used independently to classify an image.
120 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Classifiers
Input patterns
Combination function
Result
Figure 4.10 Flowchart of a multiple classifier system. Illustration is taken from [61] c
IEEE. Permission for use granted by IEEE 2015.
2. Marker selection: A marker map is constructed by selecting the pixels assigned by all the classifiers to the same class. Figure 4.11 shows a flow-chart of the multiple spectral-spatial classifier (MSSC ) marker selection scheme. In greater detail, this approach can be described as follows: 1. Multiple classification: Perform several individual classifiers to an image. Spectral-spatial classifiers in this system can be used as individual classifiers, in which each of them combines the results of a pixel-wise classification and one of the unsupervised segmentation techniques. The procedure is as follows: (a) Unsupervised image segmentation: As suggested in [61], segmentation techniques based on different principles must be chosen in this system. In [61], three techniques described before (watershed, segmentation by EM, and HSeg) were taken into consideration. (b) Pixel-wise classification: The SVM was taken into account in order to classify the input hyperspectral image. This step provides a classification map in which each pixel has a unique class label. (c) Majority voting within segmentation regions: The output of the segmentation step and the output of the pixel-wise classification step are integrated using the majority vote approach described
Spatial Information Extraction Using Segmentation
Hyperspectral image
121
Pixelwise classification
Watershed segmentation
Majority voting within segmentation regions
Segmentation by EM for Gaussian mixture resolving
Majority voting within segmentation regions
HSeg segmentation
Majority voting within segmentation regions
Maker selection
Mapofofmarkers markers Map
Figure 4.11 Flowchart of the multiple spectral-spatial classifier marker selection scheme. c Illustration is taken from [61] IEEE. Permission for use granted by IEEE 2015.
in Section 4.1. The output of this step provides spectral-spatial classification maps. It should be noted that different segmentation techniques based on different characteristics provide different classification maps. In an efficient multiple classifier system, it is important to obtain different results in order to minimize potential mistakes of any given individual classifier by considering the complementary information of the other ones. It is also important to note that by considering spatial information along with spectral information in this step, accurate classification maps can be obtained in comparison with the pixel-wise classification maps. Another important point for designing a multiple classifier system is that how different classifiers should be combined (i.e., the combination function [259]). Based on [61] for each pixel, if all the classifiers agree, we can keep this pixel as a marker, with the corresponding class label. The resulting map of m markers contains the most reliably classified pixels. The remaining pixels are further classified by applying a marker-controlled region growing, which is described as follows.
122 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
4.7.3
Construction of a Minimum Spanning Forest (MSF)
Once marker selection is carried out, the map of markers will be achieved.8 This map is further used for marker-controlled region growing by considering an MSF algorithm [61, 255]. Figure 4.12 illustrates a simple flowchart of the spectral-spatial classification using an MSF grown from the classificationderived markers. Below, the main steps of this procedure are elaborated. 1. Construction of an MSF: In this step, each pixel is considered as a vertex v ∈ V of an undirected graph G = (V, E, W ), where V and E are regarded as the sets of vertices and edges, respectively. W shows a weighting function. Each edge ei,j ∈ E of this graph connects a couple of vertices i and j corresponding to the neighboring pixels. An eightconectivity neighborhood system was considered in [61, 255]. A weight wi,j is allocated to each edge ei,j , which infers the degree of dissimilarity between two vertices connected by this edge. Different dissimilarity measures can be investigated for computing weights of edges. As a few examples of this, one can use either vector norms or SAM between two pixel vectors. Let G = (V, E, W ) be a graph. A spanning forest F = (V, EF ) of G is a nonconnected graph without cycles such that EF ⊂ E. The MSF rooted on a set of m distinct vertices {t1 , ..., tm } is known as a spanning forest F ∗ = (V, EF ∗ ) of G, such that each tree of F ∗ is grown from one root ti , and the sum of the edges weights of F ∗ is minimal [260]: F ∗ ∈ arg min
F ∈SF
X
ei,j ∈EF
wi,j
,
(4.17)
where SF is a set of all spanning forests of G rooted on {t1 , ..., tm }. In order to build an MSF rooted on markers, m extra vertices ti , i = 1, ..., m are introduced. Each additional vertex ti is connected by the null-weight edge with the pixels belonging to the marker Oi . Moreover, a root vertex r is added and is connected by the null-weight edges to the vertices ti (Figure 4.13 illustrates an example of addition of extra vertices). The minimum spanning tree [260] of the built graph induces an MSF in G, in which each tree is grown on a vertex ti . 8
The MSF discussion is based on [61, 255].
Spatial Information Extraction Using Segmentation
123
Hyperspectral image
Classification
map of markers
Marker selection
Majority voting within connected components
Construction of a minimum spanning forest
Segmentation map + classification map
Figure 4.12 Flowchart of the spectral-spatial classification approach using an minimum spanning forest grown from automatically selected markers. Illustration taken from [27] c
IEEE. Permission for use granted by IEEE 2015.
124 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
In order to compute a minimum spanning, tree Prim’s algorithm can be carried out (See Appendix C) [255, 261]. The MSF is obtained after removing the vertex r. Each tree in the MSF forms a region in the segmentation map, by mapping the output graph onto an image. Finally, a spectral-spatial classification map is produced by allocating the class of each marker to all the pixels grown from this marker. 2. Majority voting within connected components: The consideration of this step is optional. The main justification behind this step is that although the most reliably classified pixels are selected as markers, there is a risk that a marker is wrongly allocated to a class. If this happens, all the pixels within the region grown from this marker are at risk of being wrongly classified. In order to address this issue and make this classification scheme more robust, a post-processing step can be applied on the classification map. To do so, one can perform a simple majority voting technique with four-neighborhood connectivity [46, 262], which was described in Section 4.1. It should be noted that an eight-neighborhood connectivity was considered to produce an MSF and a four-neighborhood connectivity can be considered for majority voting. By using the eight-neighborhood connectivity in the first case, one is able to obtain a segmentation map without rough borders. When performing the majority voting step, the use of the four-neighborhood connectivity results in the larger or same number of connected components as the use of the eight-neighborhood connectivity. Therefore, the possibility of the undersegmentation issue can be decreased in this step. One region from a segmentation map can be split into two connected regions when using the four-neighborhood connectivity. Furthermore, these two regions can be assigned to two different classes by the majority voting procedure.
4.8
THRESHOLDING-BASED SEGMENTATION TECHNIQUES
Thresholding-based segmentation methods are considered one of the most common segmentation approaches. While many thresholding-based segmentation approaches address the issue of bi-level thresholding, the others have considered the multilevel problem [43].
Spatial Information Extraction Using Segmentation
125
0
t1 0 0 r 1 1 0 0 0 0 t2 0 0 0 2 0 0 2 0 0 Figure 4.13 Example of addition of extra vertices t1 , t2 , r to the image graph for construction of an MSF rooted on markers 1 and 2. Nonmarker pixels are denoted by c 0. Illustration is taken from [73] IEEE. Permission for use granted by IEEE 2015.
126 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
The problem of bi-level thresholding can be viewed as an optimization problem to probe for the threshold t that maximizes the σ 2B and minimizes σ 2W [55]. For bi-level thresholding, the problem is solved by finding T ∗ , which 2 satisfies max(σB (T ∗ )) where 0 ≤ T ∗ λ
Area Attribute
Sizes: 7, 13, 19
Square SE (MP)
Thickening Profile
Morphological & Attribute Profiles
Morphological Connected Filters General Requirements Attribute Profiles
Introduction MP and AF Feature Selection Conclusions
174 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Derivative of Thinning Profile
Figure 6.6 An example of different attribute profiles (area, moment of inertia and standard deviation) with different threshold values along with an MP obtained by a square SE. This figure was prepared by Mauro Dalla Mura and used by his permission.
λ: 10, 20, 30 Criterion: STD > λ
STD Attribute
λ: 0.2, 0.1, 0.3 Criterion: Inertia > λ
Moment of Inertia Attribute
λ: 45, 169, 361 Criterion: Area > λ
Area Attribute
Sizes: 7, 13, 19
Square SE (DMP)
Derivative of Thickening Profile
Morphological & Attribute Profiles
Morphological Connected Filters General Requirements Attribute Profiles
Introduction MP and AF Feature Selection Conclusions
Attribute Profiles 175
176 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
The original gray scale image f is also included in the profile as it can be considered as level zero of both the thickening and thinning profiles (i.e., φPλ0 (f ) = γ Pλ0 (f ) = f , where Pλ0 is a predicate that is fulfilled by all the components in the image, leading to no filtering). With respect to the attribute and criterion considered, different information can be extracted from the structures in the scene leading to different multilevel characterizations of the image [66]. Figure 6.7 illustrates the general architecture of AP. In terms of computational burden, an MP is heavier than an equivalent AP, since the MP is always based on two complete image transformations, one performed by a closing and the other by an opening for each level of the profile. In contrast, for producing an AP, it is only necessary to build up one max-tree for the thinnings and one mintree for the thickenings for the entire profile. Then, the set of filtering is achieved by sequential pruning of the same trees with different values of λ. This advantage reduces the burden of the analysis compared to MPs, since the most demanding phase of the filtering, which is the construction of a tree [285], is done only once. As seen above, morphological filters (here specifically APs) have been developed for gray scale images consisting of only one component. It is not straightforward to generalize this concept to multispectral and hyperspectral images, since there is no unique approach for extending them to multichannel images [286–290, Chapter 11]. In the same way as for MPs, one possible way for applying the concept of the profile to multichannel images is based on performing a feature reduction approach like PCA or ICA and appling APs on the most informative features. This method for the extension of APs to multichannel data was proposed in [83, 215]. In more detail, that approach is based on the reduction of the dimensionality of the image values from T ⊆ Zn to T 0 ⊆ Zm (m ≤ n) with a generic transformation Ψ : T → T 0 applied to an input image f (i.e., g = Ψ(f )). Then, AP is produced for the most informative features gi (i = 1, . . . , m) of the transformed image. This can be mathematically described as: EAP (g) = {AP (g1 ), AP (g2 ), . . . , AP (gm )}.
(6.2)
Figure 6.3 illustrates the general architecture of EAP [64]. It is also possible to compute multiple EAPs with different attributes in order to derive a more complete descriptor of an image. This approach
Attribute Profiles
177
Thickening Profile T1
.. . TL
...
...
AP
T1
.. . TL
Thinning Profile c Figure 6.7 Example of the general architecture of AP [64] IEEE. Permission for use granted by IEEE 2015.
178 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
is known as extended multiattribute profile (EMAP), which was proposed in [82]. 0 0 EM AP (g) = EAPA1 (g), EAPA (g), . . . , EAPA (g) , 2 k
(6.3)
where EAPAi is an EAP built with a set of predicates evaluating the attribute Ai and EAP 0 = EAP \{gi }i=1,...,m in order to avoid redundancy since the original components {gi } are present in each EAP. Figure 6.4 shows the general architecture of EMAP. The following attributes have been widely used in the literature in order to produce EMAP: 1. Area of the region (related to the size of the regions). 2. Standard deviation (as an index for showing the homogeneity of the regions). 3. Diagonal of the box bounding the regions. 4. Moment of inertia (as an index for measuring the elongation of the regions). Figure 6.5 illustrates an example of different APs (area, moment of inertia, and standard deviation) with different threshold values along with an MP obtained by a square SE for further comparison. In addition, like the differential MP (DMP) discussed in the previous chapter, differential attribute profile (DAP) is composed of the residues of two subsequent filtering operations for two adjacent levels existing in the attribute profile. Figure 6.6 shows an examples of a DMP and three DAPs including (1) area, (2) moment of inertia, and (3) standard deviation [66]. It should be noted that the outcome of classification by using EMAP is highly dependent on the choice of two factors: (1) type of attributes, and (2) threshold values. In order to address the issues of using the EMAP, (i.e., (1) which attributes lead to a better discrimination for different classes (e.g., area, moment of inertia, etc.), and (2) which threshold values should be considered in order to initialize each AP), automatic schemes of using EMAP have been investigated. In order to make the system automatic, usually on the area and standard deviation attributes are considered, since the aforementioned attributes can be adjusted in an automatic way and are also related to the object hierarchy in the images [87].
Attribute Profiles
179
The standard deviation is adjusted according to the mean of the individual features, since standard deviation shows dispersion from the mean [292]. Therefore, λs is initialized in a way to cover a reasonable amount of deviation in the individual feature and can be mathematically given by λs (P Ci ) =
µi {σmin , σmin + δs , σmin + 2δs , ..., σmax }, 100
(6.4)
where µi is the mean of the ith feature and σmin , σmax and δs are lower bound, upper bound, and step size for standard deviation attribute, respectively. In terms of adjusting λa for the area attribute, the resolution of the image should be taken into account in order to construct an EAP [85]. The automatic scheme of the attribute area is given as: λa (P Ci ) =
1000 {amin , amin + δa , amin + 2δa , ..., amax }, υ
(6.5)
where amin and amax are considered as inner and upper bounds, respectively, with a step size increase δa and υ shows the spatial resolution of the input data. Here, “automatic” means that the framework only needs to establish a range of parameter values in order to automatically obtain a classification result with a high accuracy for different data sets instead of adjusting different thresholds with crisp values. More information regarding appropriate values for inner bound, upper bound, and step sizes can be found in [85, 87, 293]. Another way of addressing the aforementioned issues of EMAP is to produce a feature bank including different types of attributes with wide ranges of threshold values. Then, a feature selection technique can be applied on the feature bank in order to find the most informative features in terms of classification accuracies and reduce the redundancy of the features. For example, in [201] a feature selection technique was introduced, which is based on a new binary optimization method called BFODPSO.4 In that method, SVM is used as the fitness function and its corresponding classification overall accuracy is chosen as the fitness value in order to evaluate the efficiency of different groups of features. In [212], first an AP feature bank is built consisting of different attributes with a wide range of threshold values. Then, the BFODPSO-based feature selection is performed on the feature bank. In 4
For more information regarding the BFODPSO-based feature selection approach, please see Chapter 3.
180 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
this case, SVM is chosen as fitness function. The fitness of each particle is evaluated by the overall accuracy of SVM over the validation samples. After a few iterations, the BFODPSO-based feature selection approach finds the most informative features (resulting from EMAP) with respect to the overall accuracy of SVM over the validation samples. For more information, please see [201, 294, 295]. In [296] a new feature selection technique was proposed that is based on the integration of GA and PSO and is called HGAPSO.5 Then, the feature selection technique was applied on several features produced by EMAP for selecting the most informative features in order to detect road networks and discriminate between the roads and background classes. In [292], an automatic generation of standard deviation attributes was introduced. As mentioned there, features commonly follow different statistics, and also, the individual classes have different statistics for different features. Therefore, different thresholds are needed to build the standard deviation profiles from different features. In this way, the thresholds for the standard deviation attributes are estimated based on the statistics of the classes of interest. The general idea behind that paper was that the standard deviation of the training samples of different classes of interest is related to the maximum standard deviation of the pixel values within individual segments of the corresponding classes of interest. The obtained results infer that the automatic method with only one attribute (standard deviation) along with supervised feature reduction can provide good results in terms of classification accuracies. 6.3
SPECTRAL-SPATIAL CLASSIFICATION BASED ON AP
6.3.1
Strategy 1
In [83], two approaches have been introduced in order to perform EAP for the classification of hyperspectral images. In those approaches, first an unsupervised feature extraction technique was applied on the input data and the most informative features were kept. Then, different attributes were considered and computed on the informative features. In order to produce the final classification map, two methods were taken into account. 5
For more information regarding the HGAPSO-based feature selection approach, please see Chapter 3.
Attribute Profiles
181
1. Stacked vector approach (SVA): The simple workflow of this method is shown in Figure 6.8 (a). This method easily combines the EAPs by concatenating them in a single vector of features (also called EMAP). This method lead to a great redundancy and is present in the features extracted. 2. Fusion approach (FA): The simple workflow of this method is shown in Figure 6.8(b). This approach is based on the individual classification of each EAP. Then, the final classification map is obtained by integrating the final classification maps through a decision fusion step. As mentioned in [83], in comparison to the SVA, the FA keeps the dimensionality of the data low and increases the robustness of the results, particularly if the different EAPs generate complementary errors. In that work, ICA was found to be more efficient than PCA in terms of classification accuracy. Figure 6.9 shows the classification maps for the ROSIS-03 Pavia University data set: (a) PCA with area attribute (EAPa) with OA=90.00%, (b) PCA with FA with OA=89.21%, (c) ICA with SVA with OA=94.47%, and (d) ICA with FA with OA=91.69%. 6.3.2
Strategy 2
It is easy to see that the conventional way to compute attribute profiles discards class specific information since it uses unsupervised feature extraction approaches (e.g., PCA and ICA) to produce base images for the AP. In other words, it partially excludes valuable spectral information. In order to address this issue, in [87], a spectral-spatial classification approach was introduced (please see the general idea of the model in Figure 6.10(a), which here is called MANUAL, since the threshold values for EMAP are adjusted manually). Then, in that paper, an automatic scheme of that method is developed (please see the general idea of the model in Figure 6.10(b), which is called AUTOMATIC, since the threshold values for EMAP are adjusted automatically with respect to (6.4) and (6.5)). Results reported in [87] for both the MANUAL and AUTOMATIC schemes are very close in terms of classification accuracies. The small difference obtained in the classification accuracies between the MANUAL and AUTOMATIC schemes can show that the use of only two attributes (i.e., area and standard deviation), can model the spatial information on the used data sets considerably and other attributes (diagonal of the box bounding the region and the moment of inertia)
Fig. 3. ROSIS Pav (a), (b) University and
Fig. 1. Examples of EAPs computed on the first two PCs of a portion of the image Fig. 3(a). Each row shows an EAP built by different attributes. Attributes starting from the firstClassification row are area, length of theRemote diagonal of the bounding 182 Spectral-Spatial of Hyperspectral Sensing Images box, moment of inertia, and standard deviation. Each EAP is composed by the concatenation of two APs computed on PC1 and PC2 . Each AP is composed of three levels: A thickening image φT , the original PC, and a thinning image γ T . All the thickening and thinning transformations were computed with the following attribute values λ’s. Area: 5000; length of the diagonal: 100; moment of inertia: 0.5; and standard deviation: 50.
Fig. 2. 6.8 Proposed forEMAP dealing with in multiple (a) SVA and Figure Generalapproaches architecture of proposed [83]. TheEAPs. illustration is taken c from [83] IEEE. Permission for use granted by IEEE 2015. (b) FA.
the increased dimensionality which can lead to the Hughes phenomenon. Another approach is the fusion approach (FA) that is based on the separate classification of each EAP and on the fusion of the results obtained by the independent classifiers in order to generate the final decision map [see Fig. 2(b)]. In comparison
to the SVA, the and increases the different EAPs ge In this letter, a against-one mult when combining the sum of the vo assigning each pi scheme. Obvious
V
The experimen spectral images a ROSIS-03 (Refle hyperspectral sen olution of 1.3 m (610 × 340 pixe the city center (1 refer to the two d tively. The origin ranging from 0.4 noisy bands wer 102 channels for land-cover classe Trees, Asphalt, B blocking Bricks, total of 3921 and test sets, respecti found were Wate Asphalt, Bitumen for this data set w respectively. The test sets taken as In the analysis set were used for data sets, only 5 training set for ea conducted on the different training From both the extracted by PCA
Attribute Profiles
183
546
Figure 6.9 Classification maps for ROSIS-03 Pavia University data set: (a) PCA with area attribute (EAPa) with OA=90.00%, (b) PCA with FA with OA=89.21%, (c) ICA with SVA and (d) ICA with data FA with In allmaps of theseobtained approaches, Fig.with 4. OA=94.47%, ROSIS Pavia University set.OA=91.69%. Classification by SVM has been forattribute the classification The(c) illustration taken (a) PCA withused area (EAPaof),the (b)input PCAfeatures. with FA, ICA withis SVA, c from Permission for use granted by IEEE 2015. and[83] (d) IEEE. ICA with FA.
reported for space constraints). Similar considerations as for the University data set can be drawn. For this data set also, it is evident the importance of including the spatial information, which led to an increase in terms of accuracy with respect to considering the original hyperspectral data or the components obtained from the dimensionality reduction technique. The best OA obtained by using the EAPs is higher, of about 2%, than those obtained by the original spectral features and the first components. Considering the PCA and ICA transformations, the latter leads to the best results in most of the cases (except for the single components extracted and for the EAPs ). When looking at the performances obtained by considering the spatial features extracted by the EAPs, one can see that the EAP with area attribute outperformed the other single EAPs with PCA, while when considering the ICA, the choice of the standard deviation performed the best among the single EAPs. Moreover, when considering the SVA strategy resulted in the best accuracies with the ICA preprocessing (which is slightly worse
IEEE GEOSCIENCE AND R
spatial features extr concatenation of th terms of classificati present in the litera not perform well o significantly differe EAPs with different accuracies greater th with the ICA, whe and all statistically test). The approach results with the maj behavior leading to best case obtained others.
The authors wou of Pavia, for provid
[1] M. Fauvel, J. A. “Spectral and spati morphological pro no. 11, pp. 3804–3 [2] E. J. Breen and R lometries,” Compu Nov. 1996. [3] J. A. Benediktsson of hyperspectral da profiles,” IEEE Tra Mar. 2005. [4] J. A. Palmason, J. “Classification of h ical preprocessing IGARSS, Jul. 25–2 [5] M. Dalla Mura, J.
184 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
cannot add significant improvement to classification accuracies although they carry information on the shape of regions. It is generally accepted that the use of different attributes will lead to the extraction of complementary (and redundant) information from the scene leading to increased accuracies when used in classification (provided the Hughes effect is efficiently solved by only keeping those features that are most informative). In summary, it can be inferred that the AUTOMATIC can provide classification maps comparable with the MANUAL in terms of both classification accuracies and CPU processing time when only two attributes (area and standard deviation) are used instead of four (area, standard deviation, moment of inertia, and diagonal of the box). However, the whole procedure in AUTOMATIC, as the name indicates, is automatic and there is no need for any parameters to be set [87]. The automatic method obtained a very accurate classification map in comparison with the literature in terms of classification accuracies. For example, this approach improves the classification accuracy of the classification technique proposed in [14] for Pavia University by almost 10 percentage points (the best OA for the Pavia University data set reported in [14] is achieved by DBFE 95% (Table V in [14]), which is equal to 87.97% (same size of train and test sets). The best OA for the Pavia Center data set in [14] is achieved by NFWE and is equal to 98.87% (same size of train test set, but slightly larger test set). Thus, the best improvement in OA is from 87.97% to 97.00% (almost by 10 percentage points with MANUAL) for the Pavia University data set). The highest accuracy was obtained by the MANUAL approach. Based on the results reported in Section 6.3.1, the performance of the proposed method can be summarized as follows: • Compared to the ICA results in [83], the proposed approach gives improved overall accuracies of 2.5% (with the MANUAL version) and 1.83% (with the AUTOMATIC version) for the Pavia University data set. • Compared to the PCA results in [83], the proposed approach gives improved overall accuracies of 19.19% (with the MANUAL version) and 18.49% (with the AUTOMATIC version) for the Pavia University data set. It should be noted that the test samples of the Pavia Center used in [83] are different from the test samples of Pavia Center used in this work. Therefore, the results reported in [83] and in this paper for Pavia Center are not fully comparable.
Attribute Profiles
185
Based on the results reported in [87], the CPU processing time for both schemes (MANUAL and AUTOMATIC) is almost the same. For AUTOMATIC, there is no need to adjust the initial parameters for the attribute profiles, which is considered the main shortcoming of the usage of AP. Figure 6.11 shows the classification maps for ROSIS-03 Pavia University data set: (a) Spectral6 ( with OA=71.64%, (b) EMAP7 with OA=90.74%, (c) Spectral+EMAP 8 with OA=90.90%, (d) DAFE9 with OA=97.00%, and (e) NWFE with OA=94.58%. In all of these approaches, RF has been used for the classification of the input features. 1. When an adequate number of training samples is available, DBFE is shown to provide better results in terms of overall classification accuracy. 2. Based on the experiments reported in [87], when the number of training samples is adequate, the use of DAFE may lead to better classification accuracies by using the frameworks developed in [87]. In this case, the use of DAFE improves the overall accuracy of NWFE by almost 2.5%. Figure 6.12 shows that not only the number of training samples is important on the efficiency of DAFE and NWFE, but also the distribution of training samples on the whole data set is of importance. As an example, the black boxes in Figure 6.12 show two parts of the input data that do not contain training samples. In this case, although the overall accuracy for DAFE (97.00%) is significantly higher than the overall accuracy for NWFE (94.58%), some objects are missing in the classification map obtained by classifying the DAFE features because the data do not have training samples in those regions. On the other hand, for the region where there is an adequate number of training samples (the red box), DAFE leads to a comparatively smoother classification map.
6 7 8 9
Here, spectral refers to the situation when only spectral information (the input raw hyperspectral data) is classified by RF. Refers to a situation when EMAP is classified by RF. Refers to a situation when the input raw hyperspectral data and the output of EMAP are concatenated and then classified by RF. Here, the proposed approaches in [87] by using DAFE, DBFE, and NWFE are called DAFE, DBFE, and NWFE, respectively.
RF
Spec
Supervised FE
Supervised FE 99%
PCA 99%
MAP area&STD(PCc)
: :
MAParea&STD(PC2)
MAP area&STD(PC1)
(b)
Stacked Vector
RF
Supervised FE 99%
Figure 6.10 (a) The flowchart of the method introduced in [87] for the MANUAL classification of hyperspectral images using AP and feature extraction techniques. (b) The general idea of the AUTOMATIC scheme of the method introduced in c [87] IEEE. Permission for use granted by IEEE 2015.
(a)
Classification Map
AP
Supervised FE
EMAP EMAP
PCA
Input Data
186 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Attribute Profiles
187
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
7
Fig. 4. Classification maps of different methods for Pavia University: a) Spectral b) AP c) Spectral+AP d) DAFE e) NWFE.
Figure 6.11 Classification maps for ROSIS-03 Pavia University data set: (a) Spectral I with OA= 71.64%; (b) EMAP with OA=TABLE 90.74%; (c) Spectral+AP with OA= 90.90%; PAVIA U NIVERSITY: T HE NUMBER OF TRAINING AND TEST SAMPLES ; C LASSIFICATION ACCURACIES OF TEST SAMPLES IN PERCENTAGE , THE NUMBER (d) DAFE with OA= 97.00%; (e)OF NWFE with OA= 94.58%. In all of these approaches, FEATURES ARE GIVEN IN BRACKETS . RF has been used for the classification of the input features. The illustration is taken from Class of Samples Spectral by Attribute Profile2015. Spectral + AP DAFE DBFE NWFE c [87] IEEE. PermissionNo.for use granted IEEE No. 1 2 3 4 5 6 7 8 9 AA OA Kappa
Name Asphalt Meadows Gravel Trees Metal Sheets Soil Bitumen Bricks Shadows
6.3.2.1
Training 548 540 392 524 265 532 375 514 231 – – –
Strategy 3
Test 6631 18649 2099 3064 1345 5029 1330 3682 947 – – –
(103) 80.8 56.1 53.5 98.7 99.1 78.1 84.3 91.0 98.3 82.25 71.64 0.6511
(99) 96.4 92.5 68.2 98.4 99.5 68.4 99.9 99.5 99.7 91.43 90.74 0.8773
(103 + 99) 95.9 94.2 67.6 99.8 99.6 68.4 99.9 99.4 99.7 91.66 90.90 0.8794
(6 + 8) 93.7 97.3 96.2 97.1 99.6 98.3 99.9 99.5 88.4 96.72 97.00 0.9604
(29 + 25) 96.0 95.3 75.2 96.8 99.7 88.6 99.9 99.5 99.3 94.55 94.55 0.9280
(11 + 8) 93.4 95.6 57.6 99.2 99.5 97.7 99.5 98.8 99.6 93.51 94.58 0.9287
The third strategy is based on using different feature extraction approaches along with EMAP for the classification of hyperspectral imAs canages. been seen in Table6.13 I, DAFE gives the highest (Table I) the CPU processing time is increased by In 63 Figure illustrates the 0.2% general workflow of this strategy. classifications accuracy compared to other methods used and seconds (Table II). This infers that having both the spectral workflow, feature 2) incan simply beusing eliminated, improvesthat the overall accuracy of Spectral, AP, extraction Spectral+AP, and(Step AP features the same vector and more features DBFE and NWFE by output almost 25, 6,of6, EMAP 2.5 and 2.5 is percent, (202 features instead ofin 99 order features) does not necessarily the lead and the directly classified to produce respectively. This shows that when a sufficient set of training to better classification results. Spectral+AP improves the class classification map. features accuracies of Meadows, Trees and Metal Sheets in comparison samples isoutput available, DAFE leads to more discriminant in comparison with those achieved by DBFE and NWFE. with the cases when Spectral and AP have been classified As shown in Figure 6.13, a feature extraction step can be conThe main reason for that might be the number of selected separately and degrades the class accuracies of Asphalt, Gravel features used by DBFE and NWFE is notclassification sufficient. As a and Bricks compared To with AP class accuracy of Soil sidered twice in the framework. doandso,the PCA, KPCA, result, more features needs to be considered in order to provide compared with Spectral. In other words, the consideration of and ICA are the most commonly used unsupervised feature extraction more promising results in the case of NWFE and DBFE. the full features obtained by AP along with the input data (Spectral) sometimes(e.g., can leadinto [83, a better discrimination of AP shows a better performance than Spectral terms with approaches that have beenin used EMAP 87, 297]). In of accuracies and improves the overall accuracy by almost different classes and sometimes downgrades class accuracies contrast, DBFE, NWFE are considered best in comparison with the individualas usethe of either AP orknown Spectral. 19 percent. Spectral hasDAFE, better class accuraciesand for class Trees andsupervised Soils where spectral information can lead to better Table II shows that into DAFE account has the least along CPU processing feature extraction approaches taken with discrimination of those classes than the spatial information. time in comparison with the other methods used. DAFE is a EMAP It should bedependencies noted that unsupervised feature extraction According to Fig. 4, [86]. by considering the spatial verythe fast feature extraction method and is able to find more using Attribute Profile, the noisy behavior of classified pixels effective features in less CPU processing time than NWFE. approaches are mostly applied in order to extract informative features by RF has been decreased significantly. In Table III, the difference in classification accuracy thefrom basis forbyproducing (i.e., Feature Extraction 1)).significant HowAs canas be seen the table, considering bothAPs Spectral between the DAFE and others are(Step statistically and AP in the same stacked vector (Spectral + AP), the using the 5-percent level of significance. In the same ever, the supervised feature extraction approaches can be applied on overall accuracy of the classification is improved by only way, in comparison with other methods used, NWFE is either the input data or the features obtained by APs, (i.e., feature extraction (Step 1) and feature extraction (Step 2) [64]).
188 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Figure 6.12 Comparison between classification maps obtained by (a) DAFE (OA=97.00%) and (b) NWFE (OA=97.58%) by using RF classifier (with 200 trees) based c on Figure 6.10a [87] IEEE. Permission for use granted by IEEE 2015.
Input Hyperspectral Image
Feature Extraction
EMAP
Step 1
Output Classification Map
Classification
Feature Extraction
Step 2 Figure 6.13 General workflow of Strategy 3. Feature extraction (Step 2) can be simply skipped, and the output of EMAP is directly classified in order to produce the output classification map.
Attribute Profiles
189
As can be seen from Figure 6.13, the choice of the feature extraction method plays a key role in the classification results using EMAPs. With respect to the experiments conducted in [297], in which different supervised and unsupervised feature extraction methods were compared when EMAPs were built using corresponding features and classified using RF and SVM classifiers. It has been concluded that KPCA provides more consistent performance even if supervised feature extraction (e.g., DBFE, NWFE, etc.) produces more accurate maps when a sufficient number of training samples is available [64]. In terms of selecting informative features produced by the supervised feature extraction approaches, the first features with cumulative eigenvalues above 99% are retained. For DAFE and NWFE, the applied criterion is related to the size of the eigenvalues of the scatter matrices computed for the feature extraction. In the case of DBFE, the criterion is related to the size of the eigenvalues of the decision boundary feature matrix (DBFM). For PCA and KPCA, the first PCs with a cumulative variance of more than 99% are kept, since they contain almost all the variance in the data. However, different percentages can be used for different data. APs produce a large dimensionality with a high existing redundancy. This poses a great challenge for classification, e specially to counter the Hughes phenomenon [2]. Due to the highly nonlinear characteristics of the class distributions in the APs, the classification should be performed using nonlinear classifiers. The majority of the studies on classification of attribute profiles employed SVM [24] and RF [21] classifiers ( e.g., [ 85, 2 92, 297]). Apart from using SVM and RF classifiers, a composite kernel framework for spectral-spatial classification using APs has been recently investigated. In [111], a linearly weighted composite kernel framework with SVMs has been taken into account for spectral-spatial classification based on APs. A linearly weighted composite kernel is a weighted combination of different kernels computed using the available features [216]. For classification using APs, probabilistic SVMs were taken into account to classify the spectral information to obtain different rule images. The kernels are computed using the obtained rule images and are combined with a weighting factor. The choice of the weighting factor can be given subjectively or estimated using cross-validation. However,
190 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
classification using composite kernels and SVMs requires a convex combination of kernels and a time-consuming optimization process. In order to address the aforementioned limitations, a generalized composite kernel framework for spectral-spatial classification using APs has been taken into account in [111]. Additionally, multinomial logistic regression (MLR) has been used in ([112–114]) instead of the SVM classifier, and a set of generalized composite kernels that can be linearly combined without any constraint of convexity were proposed. Furthermore, sparse representation classification (SRC) techniques have been considered for the classification of EMAPs [299]. SRC relies on the concept that an unknown sample can be represented as a linear combination of a set of labeled ones (i.e., the training set), where the set of labeled samples is referred to as the the dictionary. The representation of the samples can be considered as an optimization problem. In that way, the weights of each sample of the dictionary should be estimated with a constraint enforcing sparsity on the weights (i.e., limiting the contribution in the representation to only few samples). After representation, the sample is assigned to the class that shows the minimum reconstruction error when considering only the samples of the dictionary belonging to that class. The importance of sparse-based classification methods has been further confirmed in [112] where a sparse-based MLR efficiently proved to effectively handle the very high dimensionality of the AP-based features used as input to the classifier [64]. Below, a few hints regarding the use of different feature extraction techniques are provided for readers.10 These hints are extracted from different papers concerning the classification of hyperspectral images by using AP and its generalizations. The following hints were provided in [64]. (a) When only spectral information extracted by NWFE, DAFE, and DBFE is used, the result of the classification is almost the same in terms of classification accuracies. However, when the corresponding EMAPs based on DAFE, DBFE, and NWFE are constructed, the accuracies are quite different. This considerable difference shows that the classification with EMAPs does not necessarily follow the trend of classification with spectral information only [293, 297]. 10 This part of the book is based on [64].
Attribute Profiles
191
(b) When the number of training samples is limited, a supervised feature extraction may cause a lower classification accuracy compared to unsupervised techniques. In [300], it was shown that the combination of KPCA and EMAPs can be a simple, even powerful, strategy to perform spectral-spatial classification of data sets with limited spectral resolution (RGB and multispectral images). With reference to [297], in general, EMAP based on KPCA can be found to be more consistent even though it sometimes produces slightly inferior accuracies in comparison with the supervised feature extraction techniques. However, it is difficult to predict which supervised feature extraction technique can produce appropriate results in terms of classification accuracies since it is highly dependent on the data set and the number of available training samples. (c) In [293], a spectral-spatial classification framework was proposed based on parametric supervised feature extraction techniques (DAFE and DBFE) and EMAP. Results from [293] indicate that when different parametric supervised feature extraction techniques are used for the first and second steps (e.g., DBFE is applied on the input data and DAFE is carried out on the features extracted by EMAP or vice versa) and the first features corresponding to the top few eigenvalues of both steps are concatenated into a stacked vector, the results are acceptable and RF can classify the stacked vector with a high accuracy. Below, the main points regarding the applicability of SVM and RF are listed:11 (a) As mentioned before, both the SVM and RF classification methods are shown to be effective classifiers to handle high-dimensional data with a limited number of training samples. However, SVM requires a computationally demanding parameter tuning process (cross-validation) in order to tune hyperplane parameters and consequently achieve optimal results, whereas RF does not require such a tuning process. In this sense, RF is much faster than SVM. Therefore, for volumetric data, using RF instead of SVM is favorable.
11 This part of the book is based on [64].
192 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
(b) The effect of the Hughes phenomenon is more evident on SVM when the number of dimensionality is high. The classification accuracy of SVM can be improved when an extra feature extraction technique is carried out in Step 2 [297]. (c) RF shows more stablity when limited training samples are available. Even when a sufficient number of training samples are available, the SVM classifier required a further feature extraction of the profile (Step 2) to achieve an acceptable classification accuracy [292]. (d) Based on the results reported in [292, 293, 297], RF provides higher classification accuracies compared to SVM when it is directly applied on EMAP, but SVM performs better in terms of classification accuracies when a further feature extraction is performed on EMAP. This shows the capability of RF to handle high dimensional space as an input to the classifier. In contrast, the second feature extraction on EMAP downgrades the classification accuracies of the RF classifier. The reason for this might be that the RF classifier is based on a collection of weak classifiers and can statistically handle a large set of redundant features. In contrast, the SVM classifier seems to be more effective in designing discriminant function when a subset of nonredundant features defines a highly nonlinear problem. The first scenario considers a situation where the number of training samples is not sufficient. For this purpose, a frequently used, the Indian Pines data set is investigated (detailed information regarding this data set and its corresponding training and test samples is provided in Appendix D). The second scenario considers a situation when the number of training samples is sufficient. For this purpose, the frequently used, the Pavia University data set is used (detailed information regarding this data set and its corresponding training and test samples is provided in Appendix D).
Figure 6.14 Classification maps for Indian Pines data with RF classifier (with 200 trees) using EMAPs of (a) PCA (OA=92.83%), (b) KPCA (OA=94.76%), (c) DAFE (OA=84.33%), (d) DBFE (OA=87.23%) and (e) NWFE (OA=95.06%), c and feature reduction applied on EMAP using (f) NW-NW (OA=91.03%), and (g) KP-NW (OA=90.36%) [297] Taylor & Francis. Permission for use granted by Taylor & Francis 2015.
Attribute Profiles 193
(e)
(b)
(f)
(c)
(g)
(d)
Figure 6.15 Classification maps for Indian Pines data with SVM classifier (with 5-fold cross validation and RBF kernel) using EMAPs of (a) PCA (OA=86.53 %), (b) KPCA (OA=90.20 %), (c) DAFE (OA=70.69 %), (d) DBFE (OA=78.39 %) and (e) NWFE (OA=81.85 %), and feature reduction applied on EMAP using (f) NW-NW (OA=94.17 %), and (g) KP-NW (OA=93.75 %) [297] Taylor & Francis. Permission for use granted by Taylor & Francis 2015.
(a)
194 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Attribute Profiles
195
Scenario (1) When the number of training samples is limited: By visually comparing the classification maps shown in Figures 6.14 and 6.15, the following conclusions can be made: (a) When the number of training samples is limited, NWFE can provide good base images to produce EMAP. In this case, NWFE may outperform other feature extraction techniques such as PCA, KPCA, DAFE, and DBFE in terms of classification accuracies when it is used for building EMAP. (b) In order to produce EMAP, when the number of training samples is too small, supervised feature extraction techniques lead to salt and pepper effects and the shape of different objects may not be properly preserved. In this case, the use of unsupervised feature reduction (in particular KPCA) can extract the shape of the object in a better way. (c) As mentioned before, RF demonstrates more stability when a limited number of training samples is available. (d) The overall accuracy of Indian Pines when it is classified by RF (with 200 trees) and SVM (with fivefold cross-validation) is 65.6% and 69.70%, respectively. Based on the classification accuracies reported in Figures 6.14 and 6.15, one can easily conclude that the use of AP can significantly improve the classification results.
(b)
(c)
(d)
Figure 6.16 Classification maps for Indian Pines data with RF classifier (with 200 trees) using EMAPs of (a) KPCA (OA=92.37 %), (b) DBFE (OA=95.83 %), (c) NWFE (OA=92.19 %), and feature reduction applied on EMAP using (d) DB-DB (OA=96.81 %) [297] Taylor & Francis. Permission for use granted by Taylor & Francis 2015.
(a)
196 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
(b)
(c)
(d)
Figure 6.17 Classification maps for Indian Pines data with SVM classifier (with five-fold cross-validation and RBF kernel) using EMAPs of (a) KPCA (OA=91.52 %), (b) DBFE (OA=91.64 %), (c) NWFE (OA=89.27 %), and feature reduction applied on EMAP using (d) DB-DB (OA=97.89 %) [297] Taylor & Francis. Permission for use granted by Taylor & Francis 2015.
(a)
Attribute Profiles 197
198 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Scenario (2) When an adequate number of training samples is available: By visually comparing the classification maps shown in Figures 6.16, 6.17, and 6.12, the following can be concluded: 3. The overall accuracy of Pavia University when it is classified by RF (with 200 trees) and SVM (with five-fold cross-validation) is 71.57% and 81.44%, respectively. Based on classification accuracies reported in Figures 6.16 and 6.17, one can easily obtain that the use of AP can significantly improve the classification results. 6.4
SUMMARY
In this chapter, the concept of AP for the spectral-spatial classification of hyperspectral images has been investigated. AFs were first detailed. Then, AP and its generalization to hyperspectral images were elaborated. Furthermore, the efficiency of AP for modeling spatial information in three different strategies for the spectral-spatial classification of hyperspectral images, along with some experimental results were discussed. For each strategy, a list of hints was provided for readers, which can be taken into account for further improvement of the classification approaches. Undoubtedly, AP is able to model different characteristics (e.g., scale, shape, and contrast), which provides a multilevel decomposition of an image. With respect to many works referred to in this chapter, one can conclude that AP and its generalizations can be taken into account for classification in a simple but effective way by considering it as a set of features that are fed into a classifier (as a complement to the original spectral data).
Chapter 7 Conclusion and Future Works 7.1
CONCLUSIONS
This book has mainly focused on the spectral-spatial classification approaches for hyperspectral data. To this extent, the chapters in the book discussed this crucial concept in the following way: 1. The first chapter described the importance of using hyperspectral data. In addition, the main complexities of analyzing such data were described. By reading this chapter, readers will be able to understand the reasons why conventional techniques that have been developed for multispectral images are not often applicable for hyperspectral data sets. 2. The second chapter discussed classification approaches. The concept of pixel-wise classification was introduced and several such classifiers were discussed, including (but not limited to) the maximum likelihood approach for Gaussian data, the SVM, and the RF. Additionally, the spatial ECHO classifier were introduced. Error measurements were discussed in the chapter. By reading this chapter, readers will be able to understand how the spectral classifiers work and how the output of classification approches can be evaluated. 3. The third chapter was devoted to different feature reduction techniques for solving the curse of dimensionality and reducing the redundancy of input data. As shown in the chapter, when the number of available 199
200 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
training samples is limited, the high dimensionality of the hyperspectral data may cause a decrease in classification accuracies. Therefore, the main objective of the chapter was to tackle the issue of high dimensionality by describing different feature selection and feature extraction approaches. 4. Chapter 4 discussed several approaches for the integration of spectral and spatial information within spectral-spatial classification frameworks. In addition, this chapter investigated different segmentation approaches for the purpose of spectral-spatial classification of hyperspectral images. As shown in the chapter, the use of segmentation can significantly improve the results of pixel-wise classification in terms of classification accuracies. 5. Chapter 5 discussed MP from its early beginning to its advancement. In addition, at the end of that chapter, some results for the classification of hyperspectral data by using different MP alternatives were provided. Results indicate that the MP is a powerful approach for modeling spatial information of the input data and can lead to good results in terms of classification accuracies. 6. In Chapter 6, with respect to the shortcomings of MP, the concepts of AF and AP were introduced. In addition, some classification results dealing with APs were presented. As shown in the chapter, the sequence of filtered images composing AP can be employed for classification in an effective architecture by considering it as a set of features feeding a classifier (as a complement to the original spectral data). 7.2
PERSPECTIVES
• As mentioned in [47], due to the speed and efficiency of FODPSObased segmentation (presented in Chapter 4), this approach can be investigated in image segmentation applications for the real-time autonomous deployment and distributed localization of sensor nodes. The objective is to deploy the nodes only in the terrains of interest, which are identified by segmenting the images captured by a camera onboard an unmanned aerial vehicle using the FODPSO algorithm. Such a deployment has importance for emergency applications, such as disaster monitoring and battlefield surveillance. In addition, finding a way to
Conclusion and Future Works
201
estimate of the number of thresholds in FODPSO-based segmentation and joint multichannel segmentation instead of segmenting a data set band by band would be of interest. • The selection of attributes and their related thresholds (presented in Chapter 6) is also another area that demands further improvement. In this case, although few strategies for the automatic selection of attribute thresholds have been proposed, they are limited to two attributes (i.e., area and standard deviation) and might not be applicable to others. This results in the need for developing more generic selection strategies for the filter parameters. • It is of interest to further improve and adapt all the techniques described in this book for a wide variety of applications such as land-cover mapping, urban management and modeling, and species identification in forested areas. • In order to manage and monitor many predictable and unpredictable natural disasters (including but not limited to earthquakes, floods, weather events, landslides, and wildfires), there is a particular need for developing fast, simple, automatic, and efficient methods for disaster management. Undoubtedly, the aforementioned points have a major effect on the economies of different countries. As a result, assessing the usefulness of the fast and accurate approaches demonstrated in this book for real-time applications where a rapid and accurate response is needed is of interest. • With respect to the increased availability of data from different satellite and airborne sensors for a particular scene, it is desirable to jointly use data from multiple data sources for improved information extraction and classification. Therefore, it is of importance to develop spectral and spatial classification algorithms that consider data from different sources. In this context, the integration of multisensor data exploits different phenomenology from different sensors and leads to better classification and identification performance. • As mentioned in [64], since AFs cannot be uniquely extended to multivariate images (e.g., multi- or hyperspectral images), different strategies can be taken into consideration. One of the best known ways relies on a reduction of the dimensionality of the data, followed by the application of an AP to each component (leading to an EAP) after the
202 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
reduction. Without doubt, this strategy has the great advantage of only dealing with a few components, resulting in acceptable classification accuracies. However, this strategy strongly relies on the employed transformation. Several supervised and unsupervised feature extraction techniques have been exploited in the literature showing the importance of this step, which still presents margins of improvement. • As described in Chapter 6, the selection of attributes and their corresponding thresholds is vitally important. A few strategies for the selection of attribute thresholds have been proposed in the literature dealing with the automatic selection of threshold values. These techniques are able to provide results that are comparable to those obtained by manual tuning. However, the proposed techniques have been specifically developed for some attributes (i.e., area and standard deviation) and might not be applicable to others. As a result, the development of robust strategies is crucial for the automatic selection of threshold values. • MPs and APs typically produce a set of highly dimensional and redundant features. Therefore, these aspects should be properly handled in order to fully exploit the informative content of the features. In addition, since these approaches provide extra features, the possibility of the curse of dimensionality increases. In this context, the selection of an efficient classifier plays a key role. As frequently shown in this book, SVMs and RF are efficient approaches to deal with the high dimensionality of hyperspectral images (or obtained features by MPs and APs). More recently, SVM with composite kernels and sparse representation classification have been investigated leading to accurate and robust results even if a limited number of training samples are available. As a result, the further investigation of efficient classifiers for handling the high-dimensional data is important. • In order to reduce the redundancy of the attribute profiles, especially when considered in their extended architecture (i.e., the EMAP1 ), it has been shown that the use of dimensionality reduction techniques can further improve the classification accuracies. In this line of thought, conventional feature extraction techniques (e.g., DAFE, DBFE, and NWFE) have been investigated proving their usefulness. Alternatively, 1
See Chapter 6 for more information.
Conclusion and Future Works
feature selection techniques (e.g., based on evolutionary algorithms such as GAs, PSO, and FODPSO) have also been proposed to address this task. Investigation of other feature reduction approaches for reducing the redundancy of the attribute profiles is another open area for researchers. • Another topic deserving future research is the development of parallel implementations of the approaches described in this book in highperformance computing architectures, although the processing times reported in experiments (measured in a standard desktop CPU) are quite fast for the considered data sets.
Appendix A: CEM Clustering Inputs: • A set of n feature vectors (patterns) X • An upper bound Cmax on the number of clusters Initialization (Iteration 0): Let C = Cmax . Determine the first partition Q0c , c = 1, 2, ..., C of X: 1. Choose randomly C patterns from the set X to serve as cluster centers. 2. Assign the remaining patterns to the clusters on the basis of the nearest Euclidean distance to the cluster center. For every iteration i > 0 (I iterations in total) Parameter estimation step: Estimate mic , Σic and ωci for c = 1, 2, ..., C by component-wise empirical means, empirical covariances, and relative frequencies, respectively [301]: i−1
mic
mc 1 X i−1 = i−1 Xj,c mc j=1
(1)
i−1
Σic
mc 1 X i−1 i−1 (Xj,c − mic )(Xj,c − mic )T = i−1 mc j=1
205
(2)
206 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
ωci =
mi−1 c . n
(3)
Cluster assignment step: 1. Assign each pattern in X to one of the clusters according to the maximum a posteriori probability criteria: Xj ∈ Qic : p(c|Xj ) = max p(l|Xj ) l
where
ω i φc (Xj ; mic , Σic ) . p(c|Xj ) = PC c i i i c=1 ωc φc (Xj ; mc , Σc )
(4)
(5)
2. Eliminate cluster c if mic is less than the dimensionality of patterns, c = 1, 2, ..., C. The patterns that belonged to the deleted clusters will be reassigned to the other clusters in the next iteration. 3. If the convergence criterion is not achieved, return to the parameter estimation step.
Appendix B: Spectral Angle Mapper (SAM) SAM distance is used for computing the similarity of spectra in multidimensional space, which is considered one of the simplest supervised classification approaches for hyperspectral data. SAM determines the spectral similarity between two vectors Xi = (xi1 , . . . , xiB ) and Xj = (xj1 , . . . , xjB ) (Xi , Xj ∈ Rd ) by computing the angle between them. This measurement is defined as: PB x x ib jb b=1 i hP i . SAM (Xi , Xj ) = arccos hP (6) B B 2 2 x x b=1 ib b=1 jb The SAM method works well only if the intraclass spectra variability is low. Otherwise, the classes cannot be accurately described using only their mean vectors, and SAM classification fails.
207
Appendix C: Prim’s Algorithm Require: Connected graph G = (V, E, W ) Ensure: Tree T ∗ = (V ∗ , E ∗ , W ∗ ) V ∗ = {v}, v is an arbitrary vertex from V while V ∗ 6= V do Choose edge ei,j ∈ E with minimal weight such that i ∈ V ∗ and j ∈ / V∗ V ∗ = V ∗ ∪ {j} E ∗ = E ∗ ∪ {ei,j } end while
209
Appendix D: Data Sets Description ROSIS-03 PAVIA DATA SETS This data set was captured on the city of Pavia, Italy by the ROSIS-03 (Reflective Optics Spectrographic Imaging System) airborne instrument. The flight over the city of Pavia, Italy, was operated by the Deutschen Zentrum f¨ ur Luftund Raumfahrt (DLR, the German Aerospace Agency) within the context of the HySens project and managed, and sponsored by the European Union. The ROSIS-03 sensor has 115 data channels with a spectral coverage ranging from 0.43 to 0.86 µm. The spatial resolution is 1.3 m per pixel. Below, a detailed description of two best-known data sets (Pavia University and Pavia Center) extracted from that project is provided. 1. Pavia University: In this data set, 12 channels have been removed due to noise. The remaining 103 spectral channels are processed. The data have been corrected atmospherically, but not geometrically. The data set covers the Engineering School at the University of Pavia and consists of different classes including trees, asphalt, bitumen, gravel, metal sheet, shadow, bricks, meadow, and soil. This data set comprises 640×340 pixels. Figure 1 presents a false-color image of ROSIS-03 Pavia University and its corresponding reference samples. Table 1 provides detailed information regarding the number of training and test samples investigated throughout this book. These samples are usually obtained by manual labeling of a small number of pixels in an image or based on some field measurements. Thus, the collection of these samples is expensive and time-consuming. As a result, the number of available
211
212
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Asphalt Meadows Gravel Trees Metal sheets Bare soil Bitumen Bricks Shadows
(a)
(b)
(c)
Figure 1 Pavia University hyperspectral data: (a) three band false-color composite, (b) reference data, and (c) color code.
training samples is usually limited, which is a challenging issue in supervised classification. 2. Pavia Center: The second data set was captured over the center of Pavia. This data set was originally 1096 × 1096 pixels. The original data set is 109 by 1096 pixels. A 381-pixel-wide black stripe in the left part of the data set was removed, leading to 1096 by 715 pixels. Thirteen channels have been removed due to noise. The remaining 102 spectral channels are processed. Nine classes of interest are considered: water, tree, meadow, brick, soil, asphalt, bitumen, tile, and shadow. Figure 2 presents a false-color image of ROSIS-03 Pavia Center data and their corresponding reference samples. Table 2 provides detailed information regarding the number of training and test samples investigated throughout this book.
Appendix D: Data Sets Description
213
Table 1 Pavia University: Number of Training and Test Samples.
No 1 2 3 4 5 6 7 8 9
Class Name Asphalt Meadow Gravel Tree Metal Sheet Bare Soil Bitumen Brick Shadow Total
Number of Samples Training Test 548 6304 540 18146 392 1815 524 2912 265 1113 532 4572 375 981 514 3364 231 795 3,921 40,002
Water Trees Meadows Bricks Bare soil Asphalt Bitumen Tile Shadows (a)
(b)
(c)
Figure 2 Pavia Center: (a) three-band false-color composite, (b) reference data, and (c) color code.
214
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Table 2 Pavia Center: Number of Training and Test Samples.
Class No Name 1 Water 2 Tree 3 Meadow 4 Brick 5 Bare soil 6 Asphalt 7 Bitumen 8 Tile 9 Shadow Total
Number of Training 824 820 824 808 820 816 808 1,260 476 7,456
Samples Test 65,971 7,598 3,090 2,685 6,584 9,248 7,287 42,826 2,863 148,152
AVIRIS INDIAN PINES This data set was captured by the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor over the agricultural Indian Pines test site in northwestern Indiana. The spatial dimensions of this data set are 145 × 145 pixels. The spatial resolution of this data set is equal to 20 m per pixel. This data set includes 220 spectral channels but 20 water absorption bands (104-108, 150-163, 220) have been removed, and the rest (200 bands) were taken into account for the experiments. The reference data contain 16 classes of interest, which represent mostly different types of crops and are detailed in Table 3. Figure 3 illustrates a three-band false color image and its corresponding reference data. In this figure, 50 samples for each class have been randomly selected from the whole reference samples as training, except for classes alfalfa, grass-pasture-mowed and oats. These classes contain only a small number of samples in the reference data. Therefore, only 15 samples for each of these classes were chosen at random as training samples and the rest as the test samples. The number of training and test samples are displayed in Table 3.
Appendix D: Data Sets Description
215
Table 3 Indian Pines: Number of Training and Test Samples.
No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Class Name Corn-notill Corn-mintill Corn Grass-pasture Grass-trees Hay-windrowed Soybean-notill Soybean-mintill Soybean-clean Wheat Woods Bldg-grass-tree-drives Stone-Steel-Towers Alfalfa Grass-pasture-mowed Oats Total
Number of Training 50 50 50 50 50 50 50 50 50 50 50 50 50 15 15 15 695
Samples Test 1,384 784 184 447 697 439 918 2,418 564 162 1,244 330 45 39 11 5 9,671
216
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
(a)
(b)
16
16
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
(c)
Figure 3 Indian Pines image: (a) three-band color composite, (b) reference data, and (c) color code.
Abbreviations and Acronyms AF attribute filter AP attribute profile AVIRIS airborne visible/infrared imaging spectrometer BFODPSO binary fractional order Darwinian particle swarm optimization CEM classification expectation maximization DAFE discriminant analysis feature extraction DBFE decision boundary feature extraction DMP differential morphological profile DPSO Darwinian particle swarm optimization EAP extended attribute profile EEMAP entire extended multiattribute profile EM expectation maximization EMP extended morphological profile EMAP extended multiattribute profile FCM fuzzy C-means FODPSO fractional order Darwinian particle swarm optimization GA genetic algorithm GLCM gray-level co-occurrence matrix HGAPSO hybridization of genetic algorithms and particle swarm optimization HMM hidden Markov model HMRF hidden Markov random field HSeg hierarchical segmentation LiDAR light detection and ranging MAP maximum a posteriori MC multiple classifier MKL multiple kernel learning MLC maximum likelihood classifier
217
218
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
MRF Markov random field MSF minimum spanning forest MSS mean shift segmentation MSSC multiple spectral-spatial classifier MP morphological profile MV majority voting NDVI normalized difference vegetative index NDWI normalized differential water index NWFE nonparametric weighted feature extraction PC principal component PCA principal component analysis PCFA piecewise constant function approximations PSO particle swarm optimization RBF radial basis function RCMG robust color morphological gradient RF random forest ROSIS reflective optics system imaging spectrometer SAR synthetic aperture radar SDAP self-dual attribute profile SE structuring element SVM support vector machine VHR very high resolution WS watershed segmentation
Bibliography [1] P. Ghamisi, “Spectral and spatial classification of hyperspectral data,” Ph.D. dissertation, University of Iceland, 2015. [2] G. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. 14, pp. 55 – 63, 1968. [3] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. John Wiley & Sons, Hoboken, NJ, 2003. [4] D. W. Scott, Multivariate Density Estimation, John Wiley & Sons, New York, NY, 1992. [5] E. J. Wegman, “Hyperdimensional data analysis using parallel coordinates,” Jour. American Stati. Assoc., vol. 85, no. 411, pp. 664–675, 1990. [6] L. Jimenez and D. Landgrebe, “Supervised classification in highdimensional space: geometrical, statistical, and asymptotical properties of multivariate data,” IEEE Trans. Sys., Man, Cyber., Part C: Applications and Reviews, vol. 28, no. 1, pp. 39–54, 1998. [7] K. Fukunaga, Introduction to Statistical Pattern Recognition., 2nd ed. Academic Press, Inc., San Diego, CA, 1990. [8] P. Diaconis and D. Freedman, “Asymptotics of graphical projection pursuit.” Annals Stat., vol. 12, no. 3, pp. 793–815, 1984.
219
220
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[9] E. Magli, G. Olmo, and E. Quacchio, “Optimized onboard lossless and near-lossless compression of hyperspectral data using calic,” IEEE Geos. Remote Sens. Lett., vol. 1, no. 1, pp. 21–25, 2004. [10] J. C. Russ, The Image Processing Handbook, 3rd ed. Boca Raton, FL, 1999.
CRC Press Inc.
[11] D. J. Wiersma and D. A. Landgrebe, “Analytical design of multispectral sensors,” IEEE Trans. Geos. Remote Sens., vol. GE, no. 18, pp. 180–189, 1980. [12] P. Bajcsy and P. Groves, “Methodology for hyperspectral band selection,” PE&RS, vol. 70, no. 7, pp. 793–802, 2004. [13] P. H. Swain, “Fundamentals of pattern recognition in remote sensing,” in Remote Sensing–The Quantitative Approach. P. H. Swain and S. Davis, Eds. McGraw-Hill, New York, NY, 1978. [14] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson, “Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles,” IEEE Trans. Geos. Remote Sens., vol. 46, no. 11, pp. 3804–3814, 2008. [15] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proc. 5th Berkeley Symp. Math. Stat. Prob., pp. 281–297, 1967. [16] G. Ball and D. Hall, “ISODATA, a novel method of data analysis and classification,” Tech. Rep. AD-699616, Stanford Univ., Stanford, CA, 1965. [17] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, “Conjugate gradient neural networks in classification of very high dimensional remote sensing data,” Int. Jour. Remote Sens., vol. 14, no. 15, pp. 2883–2903, 1993. [18] H. Yang, F. V. D. Meer, W. Bakker, and Z. J. Tan, “A back— propagation neural network for mineralogical mapping from AVIRIS data,” Int. Jour. Remote Sens., vol. 20, no. 1, pp. 97–110, 1999.
Bibliography
221
[19] J. A. Benediktsson, “Statistical methods and neural network approaches for classification of data from multiple sources,” Ph.D. dissertation, PhD thesis, Purdue Univ., School of Elect. Eng. West Lafayette, IN, 1990. [20] J. A. Richards, “Analysis of remotely sensed data: The formative decades and the future,” IEEE Trans. Geos. Remote Sens., vol. 43, no. 3, pp. 422–432, 2005. [21] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. [22] ——, “RF tools a class of two eyed algorithms,” SIAM Workshop, 2003. [23] V. N. Vapnik, Statistical learning theory. York, NY, 1998.
John Wiley & Sons, New
[24] B. Scholkopf and A. J. Smola, Learning with Kernels. Support Vector Machines, regularization, optimization, and beyond, MIT Press, Cambridge , MA, 2002. [25] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Kernel principal component analysis for the classfication of hyperspectral remotesensing data over urban areas,” EURASIP Jour. Adv. Signal Proc., pp. 1–14, 2009. [26] S. Tadjudin and D. A. Landgrebe, Classification of high dimensional data with limited training samples. Tech. Rep. School Elec. Comp. Eng. Purdue Univ., 1998. [27] Y. Tarabalka, “Classification of hyperspectral data using spectral– spatial approaches,” Ph.D. dissertation, Grenoble Institute of Technology and University of Iceland, 2010. [28] H. Derin and P. A. Kelly, “Discrete-index Markov-type random processes,” Proceedings of the IEEE, vol. 77, no. 10, pp. 1485–1510, 1989. [29] G. Moser, S. B. Serpico, and J. A. Benediktsson, “Land-cover mapping by Markov modeling of spatial-contextual information in very-highresolution remote sensing images,” Proceedings of the IEEE, vol. 101, no. 3, pp. 631–651, 2013.
222
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[30] Q. Jackson and D. Landgrebe, “Adaptive Bayesian contextual classification based on Markov random fields,” IEEE Trans. Geos. Remote Sens., vol. 40, no. 11, pp. 2454–2463, 2002. [31] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVMand MRF-based method for accurate classification of hyperspectral images,” IEEE Geos. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740, 2010. [32] A. Farag, R. Mohamed, and A. El-Baz, “A unified framework for map estimation in remote sensing image segmentation,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 7, pp. 1617–1634, 2005. [33] F. Bovolo and L. Bruzzone, “A context-sensitive technique based on support vector machines for image classification,” Proc. PReMI, pp. 260–265, 2005. [34] D. Liu, M. Kelly, and P. Gong, “A spatial-temporal approach to monitoring forest disease spread using multi-temporal high spatial resolution imagery,” Remote Sens. Env., vol. 101, no. 10, pp. 167–180, 2006. [35] G. Moser and S. B. Serpico, “Combining support vector machines and Markov random fields in an integrated framework for contextual image classification,” IEEE Trans. Geos. Remote Sens., vol. 51, no. 5, pp. 2734–2752, 2013. [36] M. Khodadadzadeh, R. Rajabi, and H. Ghassemian, “Combination of region-based and pixel-based hyperspectral image classification using erosion technique and MRF model,” ICEE, pp. 294–299, 2010. [37] G. Zhang and X. Jia, “Simplified conditional random fields with class boundary constraint for spectral-spatial based remote sensing image classification,” IEEE Geos. Remote Sens. Lett., vol. 9, no. 5, pp. 856 – 860, 2012. [38] B. Tso and R. C. Olsen, “Combining spectral and spatial information into hidden Markov models for unsupervised image classification,” Int. Jour. Remote Sens., vol. 26, no. 10, pp. 2113–2133, 2005. [39] P. Ghamisi, J. A. Benediktsson, and M. O. Ulfarsson, “Spectral–spatial classification of hyperspectral images based on hidden Markov random
Bibliography
223
fields,” IEEE Trans. Remote Sens. Geos., vol. 52, no. 5, pp. 2565–2574, 2014. [40] G. Hazel, “Multivariate Gaussian MRF for multispectral scene segmentation and anomaly detection,” IEEE Trans. Geos. Remote Sens., vol. 38, no. 3, pp. 1199–1211. [41] F. Tsai, C. K. Chang, and G. R. Liu, “Texture analysis for three dimension remote sensing data by 3D GLCM,” Proc. 27th Asian Conf. Remote Sens., pp. 1–6, 2006. [42] X. Huang and L. Zhang, “A comparative study of spatial approaches for urban mapping using hyperspectral ROSIS images over Pavia city, northern Italy,” Int. Jour. Remote Sens., vol. 30, no. 12, pp. 3205–3221, 2009. [43] P. Ghamisi, M. S. Couceiro, J. A. Benediktsson, and N. M. F. Ferreira, “An efficient method for segmentation of images based on fractional calculus and natural selection,” Expert Syst. Appl., vol. 39, no. 16, pp. 12 407–12 417, 2012. [44] M. Sezgin and B. Sankur, “Survey over image thresholding techniques and quantitative performance evaluation,” Jour. Electron. Imag., vol. 13, no. 1, pp. 146–168, 2004. [45] P. Ghamisi, M. S. Couceiro, N. M. F. Ferreira, and L. Kumar, “Use of Darwinian particle swarm optimization technique for the segmentation of remote sensing images,” Proc. IGARSS’12, pp. 4295–4298, 2012. [46] Y. Tarabalka, J. A. Benediktsson, and J. Chanussot, “Spectral-spatial classification of hyperspectral imagery based on partitional clustering techniques,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 5, pp. 2973– 2987, 2009. [47] P. Ghamisi, M. S. Couceiro, F. M. Martins, and J. A. Benediktsson, “Multilevel image segmentation approach for remote sensing images based on fractional–order Darwinian particle swarm optimization,” IEEE Trans. Geos. Remote Sens., vol. 52, no. 5, pp. 2382–2394, 2014. [48] P. Ghamisi, M. Couceiro, M. Fauvel, and J. A. Benediktsson, “Integration of segmentation techniques for classification of hyperspectral
224
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
images,” IEEE Geos. Remote Sens. Lett., vol. 11, no. 1, pp. 342–346, 2014. [49] Y. Tarabalka, J. Chanussot, and J. A. Benediktsson, “Segmentation and classification of hyperspectral images using watershed transformation,” Pat. Recog., vol. 43, no. 7, pp. 2367–2379, 2010. [50] A. Darwish, K. Leukert, and W. Reinhardt, “Image segmentation for the purpose of object–based classification,” Proc. IGARSS’03, vol. 3, pp. 2039–2041, 2003. [51] J. Tilton, “Analysis of hierarchically related image segmentations,” Proc. IEEE Work. Adv. Tech. Analysis Remotely Sensed Data, pp. 60– 69, 2003. [52] J. N. Kapur, P. K. Sahoo, and A. K. C. Wong, “A new method for gray–level picture thresholding using the entropy of the histogram,” Comp. Vis. Graph. Image Process., vol. 2, pp. 273–285, 1985. [53] T. Pun, “A new method for grey–level picture thresholding using the entropy of the histogram,” Comp. Vis. Graph. Image Process., vol. 2, pp. 223–237, 1980. [54] N. Otsu, “A threshold selection method from gray-level histogram,” IEEE Trans. Syst. Man Cyber., vol. 9, pp. 62–66, 1979. [55] R. V. Kulkarni and G. K. Venayagamoorthy, “Bio–inspired algorithms for autonomous deployment and localization of sensor,” IEEE Trans. Sys. Man and Cyb., Part C, vol. 40, no. 6, pp. 663–675, 2010. [56] Y. Kao, E. Zahara, and I. Kao, “A hybridized approach to data clustering,” Expert Sys. App., vol. 34, pp. 1754–1762, 2008. [57] D. Floreano and C. Mattiussi, “Bio–inspired artificial intelligence: Theories and methods and technologies,” MIT Press, Cambridge, MA, 2008. [58] J. Kennedy and R. Eberhart, “A new optimizer using particle swarm theory,” Proc. IEEE Sixth Int. Symp. Micro Machine and Human Sci., vol. 34, no. 2008, pp. 39–43, 1995.
Bibliography
225
[59] J. Tillett, T. M. Rao, F. Sahin, R. Rao, and S. Brockport, “Darwinian particle swarm optimization,” Proc. 2nd Indian Int. Conf. Art. Intel., pp. 1474–1487, 2005. [60] P. Ghamisi, M. S. Couceiro, and J. A. Benediktsson, “Extending the fractional order Darwinian particle swarm optimization to segmentation of hyperspectral images,” Proc. SPIE 8537, Image and Signal Processing for Remote Sensing XVIII, 85370F, 2012. [61] Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. Tilton, “Multiple spectral-spatial classification approach for hyperspectral data,” IEEE Trans. Geos. Remote Sens., vol. 48, no. 11, pp. 4122–4132, 2010. [62] J. C. Tilton, Y. Tarabalka, P. M. Montesano, and E. Gofman, “Best merge region growing segmentation with integrated non–adjacent region object aggregation,” IEEE Trans. Geos. Remote Sens., vol. 50, no. 11, pp. 4454 – 4467, 2012. [63] P. Soille, Morphological Image Analysis, Principles and Applications, 2nd ed. Springer Verlag, Berlin, Germany, 2003. [64] P. Ghamisi, M. Dalla Mura, and J. A. Benediktsson, “A survey on spectral–spatial classification techniques based on attribute profiles,” IEEE Trans. Geos. Remote Sens., vol. 53, no. 5, pp. 2335–2353, 2015. [65] M. Pesaresi and J. A. Benediktsson, “A new approach for the morphological segmentation of high-resolution satellite imagery,” IEEE Trans. Geos. Remote Sens., vol. 39, no. 2, pp. 309–320, 2001. [66] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone, “Morphological attribute profiles for the analysis of very high resolution images,” IEEE Trans. Geos. Remote Sens., vol. 48, no. 10, pp. 3747 – 3762, 2010. [67] M. Chini, N. Pierdicca, and W. Emery, “Exploiting SAR and VHR optical images to quantify damage caused by the 2003 Bam earthquake,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 1, pp. 145–152, 2009. [68] D. Tuia, F. Pacifici, M. Kanevski, and W. Emery, “Classification of very high spatial resolution imagery using mathematical morphology and support vector machines,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 11, pp. 3866–3879, 2009.
226
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[69] H. Akcay and S. Aksoy, “Automatic detection of geospatial objects using multiple hierarchical segmentations,” IEEE Trans. Geos. Remote Sens., vol. 46, no. 7, pp. 2097–2111, 2008. [70] J. Chanussot, J. A. Benediktsson, and M. Fauvel, “Classification of remote sensing images from urban areas using a fuzzy possibilistic model,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp. 40–44, 2006. [71] J. A. Benediktsson, M. Pesaresi, and K. Arnason, “Classification and feature extraction for remote sensing images from urban areas based on morphological transformations,” IEEE Trans. Geos. Remote Sens., vol. 11, no. 1, pp. 288–292, 2003. [72] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson, “Classification of hyperspectral data from urban areas based on extended morphological profiles,” IEEE Trans. Geos. Remote Sens., vol. 43, no. 3, pp. 480–491, 2005. [73] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. Tilton, “Advances in spectral-spatial classification of hyperspectral images,” Proceedings of the IEEE, vol. 101, no. 3, pp. 652–675, 2013. [74] R. Bellens, S. Gautama, L. M. Fonte, W. Philips, J. C. W. Chan, and F. Canters, “Improved classification of VHR images of urban areas using directional morphological profiles,” IEEE Trans. Geos. Remote Sens., vol. 46, no. 10, pp. 2803–2813, 2008. [75] P. Soille and M. Pesaresi, “Advances in mathematical morphology applied to geoscience and remote sensing,” IEEE Trans. Geosc. Remote Sens., vol. 40, no. 9, pp. 2042–2055, 2002. [76] Z. Lyu, P. L. Zhang, J. A. Benediktsson, and W. Shi, “Morphological profiles based on differently shaped structuring elements for classification of images with very high spatial resolution,” IEEE Jour. Selec. Top. Appl. Earth Obser. Remote Sens., vol. 7, no. 12, pp. 4644–4652, 2014. [77] X. Huang, X. Guan, J. A. Benediktsson, L. Zhang, J. Li, A. Plaza, and M. Dalla Mura, “Multiple morphological profiles from multicomponent base images for hyperspectral image classification,” IEEE Jour. Selec.
Bibliography
227
Top. Appl. Earth Obser. Remote Sens., vol. 7, no. 12, pp. 4653–4669, 2014. [78] M. Dalla Mura, “Advanced techniques based on mathematical morphology for the analysis of remote sensing images,” Ph.D. dissertation, University of Trento and University of Iceland, 2011. [79] E. J. Breen and R. Jones, “Attribute openings, thinnings and granulometries,” IEEE Trans. Geos. Remote Sens., vol. 40, no. 11, pp. 2486– 2494, 2013. [80] N. Bouaynaya and D. Schonfeld, “Theoretical foundations of spatiallyvariant mathematical morphology part ii: Gray-level images,” IEEE Trans. Pat Analysis Mach. Intell., vol. 30, no. 5, pp. 837–850, 2008. [81] M. Dalla Mura, J. A. Benediktsson, and L. Bruzzone, “Modeling structural information for building extraction with morphological attribute filters,” Proc. SPIE 7477, Image and Signal Processing for Remote Sensing XV, 2009. [82] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone, “Extended profiles with morphological attribute filters for the analysis of hyperspectral data,” Int. Jour. Remote Sens., vol. 31, no. 22, pp. 5975–5991, 2010. [83] M. Dalla Mura, A. Villa, J. A. Benediktsson, J. Chanussot, and L. Bruzzone, “Classification of hyperspectral images by using extended morphological attribute profiles and independent component analysis,” IEEE Geosc. Remote Sens. Lett., vol. 8, no. 3, pp. 542–546, 2011. [84] P. Salembier, A. Oliveras, and L. Garrido, “Anti-extensive connected operators for image and sequence processing,” IEEE Trans. Image Proc., vol. 7, no. 4, pp. 555–570, 1998. [85] M. Pedergnana, P. R. Marpu, M. D. Mura, J. A. Benediktsson, and L. Bruzzone, “A novel technique for optimal feature selection in attribute profiles based on genetic algorithms,” IEEE Trans. Geos. Remote Sens., vol. 51, no. 6, pp. 3514 – 3528, 2013. [86] P. Ghamisi, J. A. Benediktsson, G. Cavallaro, and A. Plaza, “Automatic framework for spectral-spatial classification based on supervised feature extraction and morphological attribute profiles,” IEEE Jour.
228
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Sel. Top. App. Earth Obser. Remote Sens., vol. 7, no. 6, pp. 2147– 2160, 2014. [87] P. Ghamisi, J. A. Benediktsson, and J. R. Sveinsson, “Automatic spectral-spatial classification framework based on attribute profiles and supervised feature extraction,” IEEE Trans. Geos. Remote Sens., vol. 52, no. 5, pp. 5771–5782, 2014. [88] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, “Advances in hyperspectral image classification,” IEEE Sig. Process. Mag., vol. 31, no. 1, pp. 45–54, 2014. [89] X. Kang, S. Li, and J. A. Benediktsson, “Spectral-spatial hyperspectral image classification with edge-preserving filtering,” IEEE Trans. Geos. Remote Sens., vol. 52, no. 5, pp. 2666 – 2677, 2014. [90] J. Li, X. Huang, P. Gamba, J. Bioucas-Dias, L. Zhang, J. A. Benediktsson, and A. Plaza, “Multiple feature learning for hyperspectral image classification,” IEEE Trans. Geos. Remote Sens., vol. 53, no. 3, pp. 1592 – 1606, 2015. [91] X. Kang, S. Li, L. Fang, M. Li, and J. A. Benediktsson, “Extended random walkers based classification of hyperspectral images,” IEEE Trans. Geos. Remote Sens., vol. 53, no. 3, pp. 144–153, 1 2015. [92] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields,” IEEE Trans. Geos. Remote Sens., vol. 50, no. 3, pp. 809 – 823, 2002. [93] B. Waske and J. A. Benediktsson, Pattern Recognition and Classification, Encyclopedia of Remote Sensing, E. G. Njoku, Ed. Springer Verlag, Berlin, 2014. [94] X. Jia and J. A. Richards, “Cluster-space representation for hyperspectral data classification,” IEEE Trans. Geos. Remote Sens., vol. 40, no. 3, pp. 593–598, 2002. [95] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for classification of multisensor data,” IEEE Trans. Geos. Remote Sens., vol. 45, no. 12, pp. 3858–3866, 2007.
Bibliography
229
[96] A. K. Jain, R. P. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pat. Analysis Mach. Intel., vol. 22, no. 1, pp. 4–37, 2000. [97] B. M. Shahshahani and D. A. Landgrebe, “The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon,” IEEE Trans. Geos. Remote Sens., vol. 32, no. 5, pp. 4–37, 1995. [98] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. John Wiley & Sons, Hoboken, NJ, 2001. [99] J. Richards and X. Jia, Remote Sensing Digital Image Analysis, 4th Ed. Springer Verlag, Berlin, 2006. [100] B. Waske, J. A. Benediktsson, K. Arnason, and J. R. Sveinsson, “Mapping of hyperspectral AVIRIS data using machine-learning algorithms,” Canadian Jour. Remote Sens., vol. 35, pp. 106–116, 2009. [101] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geos. Remote Sens., vol. 42, no. 8, pp. 1778–1790, 2004. [102] M. Pal and P. Mather, “Some issues in the classification of dais hyperspectral data,” Inter. Jour. Remote Sens., vol. 27, no. 14, pp. 2895–2916, 2006. [103] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [104] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Mach. Learn., vol. 46, pp. 131–159, 2002. [105] S. S. Keerthi and C. Lin, “Asymptotic behaviors of support vector machines with Gaussian kernel,” Neur. Comp., vol. 15, no. 7, pp. 1667– 1689, 1998. [106] G. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geos. Remote Sens., vol. 42, no. 6, pp. 1335–1343, 2002.
230
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[107] D. J. Sebald and J. A. Bucklew, “Support vector machines and the multiple hypothesis test problem,” IEEE Trans. Sig. Proc., vol. 49, no. 11, pp. 2865–2872, 2001. [108] C. W. Hsu and C. J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neur. Net., vol. 13, no. 2, pp. 415–425, 2002. [109] A. Mathur and G. M. Foody, “Crop classification by a support vector machine with intelligently selected training data for an operational application,” Inter. Jour. Remote Sens., vol. 29, no. 8, pp. 2227–2240, 2008. [110] T. F. Wu, C. J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” The Jour. Mach. Lear. Res., vol. 5, pp. 975–1005, 2004. [111] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, and J. A. Benediktsson, “Generalized composite kernel framework for hyperspectral image classification,” IEEE Trans. Geos. Remote Sens., vol. 51, no. 9, pp. 4816– 4829, 2013. [112] J. Li, J. Bioucas-Dias, and A. Plaza, “Semi-supervised hyperspectral image segmentation using multinomial logistic regression with active learning,” IEEE Trans. Geos. Remote Sens., vol. 48, no. 11, pp. 4085– 4098, 2010. [113] ——, “Hyperspectral image segmentation using a new Bayesian approach with active learning,” IEEE Trans. Geos. Remote Sens., vol. 49, no. 10, pp. 3947–3960, 2011. [114] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, “Sparse multinomial logistic regression: Fast algorithms and generalization bounds,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 27, no. 6, pp. 957–968, 2005. [115] J. A. Benediktsson and I. Kanellopoulos, Information Extraction Based on Multisensor Data Fusion and Neural Networks, Information Processing for Remote Sensing, C. H. Chen, Ed. World Scientific Press, Singapore, 1999. [116] T. Kohonen, Self-Organizing Maps.
Springer Verlag, Berlin, 2001.
Bibliography
231
[117] J. D. Paola and R. A. Schowengerdt, “A review and analysis of backpropagation neural networks for classification of remotely sensed multi-spectral imagery,” Inter. Jour. Remote Sens., vol. 16, no. 16, pp. 3033–3058, 1995. [118] M. A. Friedl and C. E. Brodley, “Decision tree classification of land cover from remotely sensed data,” Remote Sens. Env., vol. 61, no. 3, pp. 399–409, 1997. [119] M. Pal and P. Mather, “An assessment of the effectiveness of decision tree methods for land cover classification,” Remote Sens. Env., vol. 86, no. 4, pp. 554–565, 2003. [120] G. J. Briem, J. A. Benediktsson, and J. R. Sveinsson, “Multiple classifiers applied to multisource remote sensing data,” IEEE Trans. Geos. Remote Sens., vol. 40, no. 10, pp. 2291–2299, 2003. [121] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random forests for land cover classification,” Pat. Recog. Lett., vol. 27, no. 4, pp. 294–300, 2006. [122] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Trans. Syst, Man, Cyber., vol. 21, no. 3, pp. 294– 300, 1991. [123] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman and Hall/CRC, London, England, 1984. [124] N. Nilsson, Learning Machines: Foundations of Trainable PatternClassifying Systems. McGraw-Hill, New York, NY, 1965. [125] L. Breiman, “Arcing classifier,” Ann. Statist., vol. 26, no. 3, pp. 801– 849, 1998. [126] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Jour. Comp. Sys. Sci., vol. 55, no. 1, pp. 119–139, 1997. [127] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 1, pp. 123–140, 1994.
232 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[128] Z. Zhi-Hua, Ensemble Methods: Foundations and Algorithms. man and Hall/CRC, New York, NY, 2012.
Chap-
[129] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson, “Random forest classification o f r emote s ensing d ata,” i n S ignal a nd I mage Processing for Remote Sensing, C. H. Chen, Ed. CRC Press, Boca Raton, FL., pp. 327–344, 2007. [130] B. Waske, J. A. Benediktsson, and J. R. Sveinsson, “Random forest classifcation of remote sensing data,” in Signal and Image Processing for Remote Sensing, C. H. Chen, Ed. CRC Press, New York, NY, pp. 363–374, 2012. [131] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer Verlag, Berlin, 2001. [132] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of the random forest framework for classification of hyperspectral data,” IEEE Trans. Geos. Remote Sens., vol. 43, no. 3, pp. 492 – 501, 2005. [133] D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, and K. T. Hess, “Random forests for classification in ecology,” Ecology, vol. 88, no. 11, pp. 2783 – 2792, 2007. [134] J. A. Benediktsson, “Statistical and neural network pattern recognition methods for remote sensing applications,” in Handbook of Pattern Recognition and Computer Vision, C. H. Chen, L. F. Pau, and P. S. P. Wang, Eds. World Scientific Press, Singapore, 1999, pp. 507 – 534. [135] R. L. Kettig and D. A. Landgrebe, “Classification of multispectral image data by extraction and classification of homogeneous objects,” IEEE Trans. Geos. Elect., vol. GE14, pp. 19–26, 1976. [136] X. Jia, B.-C. Kuo, and M. M. Crawford, “Feature mining for hyperspectral image classification,” Proceedings of the IEEE, vol. 101, no. 3, pp. 676–697, 2013. [137] I. T. Jolliffe, Principal Component Analysis, ser. Springer Series in Statistics. Springer Verlag, Berlin, 2002.
Bibliography
233
[138] A. A. Green, M. Berman, P. Switzer, and M. D. Craig, “A transformation for ordering multispectral data in terms of image quality with implications for noise removal,” IEEE Trans. Geos. Remote Sens., vol. 26, no. 1, pp. 65–74, 1988. [139] J. B. Lee, A. S. Woodyatt, and M. Berman, “Enhancement of high spectral resolution remote sensing data by a noise-adjusted principal components transform,” IEEE Trans. Geos. Remote Sens., vol. 28, no. 3, pp. 295–304, 1990. [140] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge, 2004. [141] G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson, “Linear versus nonlinear pca for the classification of hyperspectral data based on the extended morphological profiles,” IEEE Geos. Remote Sens. Lett., vol. 9, no. 3, pp. 447–451, 2011. [142] A. Hyvrinen, J. Karhunen, and E. Oja, Independent Component Analysis. John Wiley & Sons, New York, NY, 2001. [143] A. Villa, J. A. Benediktsson, J. Chanussot, and C. Jutten, “Hyperspectral image classification with independent component discriminant analysis,” IEEE Trans. Geos. Remote Sens., vol. 49, no. 12, pp. 4865– 4876, 2011. [144] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Jour. Mach. Learn. Res., vol. 3, pp. 1–48, 2003. [145] S. Marchesi and L. Bruzzone, “ICA and kernel ICA for change detection in multispectral remote sensing images,” in Proc. Int. Geos. Remote Sens. Symp., vol. 40, no. 10, pp. 980–983, 2009. [146] L. M. Bruce, C. H. Koger, , and L. Jiang, “Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction,” IEEE Trans. Geos. Remote Sens., vol. 40, no. 10, pp. 2331–2338, 2002. [147] L. O. Jimenez-Rodriguez, E. Arzuaga-Cruz, and M. Velez-Reyes, “Unsupervised linear feature-extraction methods and their effects in the classification of high-dimensional data,” IEEE Trans. Geos. Remote Sens., vol. 45, no. 2, pp. 469–483, 2007.
234 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[148] L. Zhang, D. Tao, and X. Huang, “On combining multiple features for hyperspectral remote sensing image classification,” IEEE Trans. Geos. Remote Sens., vol. 50, no. 3, pp. 879–893, 2012. [149] C. M. Bachmann, T. L. Ainsworth, and R. A. Fusina, “Exploiting manifold geometry in hyperspectral imagery,” IEEE Trans. Geos. Remote Sens., vol. 44, no. 10, pp. 2786–2803, 2006. [150] ——, “Improved manifold coordinate representations of large-scale hyperspectral scenes,” IEEE Trans. Geos. Remote Sens., vol. 44, no. 10, pp. 2786–2803, 2009. [151] C. M. Bachmann, T. L. Ainsworth, R. A. Fusina, M. J. Montes, J. H. Bowles, D. R. Korwan, and D. B. Gillis, “Bathymetric retrieval from hyperspectral imagery using manifold coordinate representations,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 3, pp. 884–897, 2009. [152] Y. Chen, M. M. Crawford, and J. Ghosh, “Improved nonlinear manifold learning for land cover classification via intelligent landmark selection,” Proc. IGARSS’06, pp. 545–548, 2006. [153] M. Crawford and W. Kim, “Manifold learning for multi-classifier systems via ensembles,” in MCS 2009. Springer Verlag, Berlin, pp. 519–528, 2009. [154] X. Chen, T. Fang, H. Huo, and D. Li, “Graph-based feature selection for object-oriented classification in vhr airborne imagery,” IEEE Trans. Geos. Remote Sens., vol. 49, no. 1, pp. 353–365, 2011. [155] J. Yin, C. Gao, , and X. Jia, “Using Hurst and Lyapunov exponents for hyperspectral image feature extraction,” IEEE Geos. Remote Sens. Lett., vol. 9, no. 4, pp. 705–709, 2012. [156] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by local linear embedding,” Science, vol. 290, pp. 2323–2326, 2000. [157] A. Mohan, G. Sapiro, and E. Bosch, “Spatially coherent nonlinear dimensionality reduction and segmentation of hyperspectral images,” IEEE Geos. Remote Sens. Lett., vol. 4, no. 2, pp. 206–210, 2007. [158] Y.-L. Chang, C.-C. Han, K.-C. Fan, K. S. Chen, C.-T. Chen, and J.H. Chang, “Greedy modular eigenspaces and positive boolean function
Bibliography
235
for supervised hyperspectral image classification,” Opt. Eng., vol. 2, pp. 2576–2587, 2003. [159] C. Cariou, K. Chehdi, and S. L. Moan, “Bandclust: An unsupervised band reduction method for hyperspectral remote sensing,” IEEE Geos. Remote Sens. Lett., vol. 8, no. 3, pp. 565–569, 2011. [160] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hyperspectral images with regularized linear discriminant analysis,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 3, pp. 862–873, 2009. [161] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural Comput., vol. 12, no. 7, pp. 2385–2404, 2000. [162] S. Kumar, J. Ghosh, and M. M. Crawford, “Best-bases feature extraction algorithms for classification of hyperspectral data,” IEEE Trans. Geos. Remote Sens., vol. 39, no. 7, pp. 1368–1379, 2001. [163] S. D. Backer, P. Kempeneers, W. Debruyn, and P. Scheunders, “A band selection technique for spectral classification,” IEEE Geos. Remote Sens. Lett., vol. 2, no. 3, pp. 319–323, 2005. [164] S. B. Serpico and G. Moser, “Extraction of spectral channels from hyperspectral images for classification purposes,” IEEE Trans. Geos. Remote Sens., vol. 45, no. 2, pp. 484–495, 2007. [165] B. C. Kuo and D. A. Landgrebe, “A robust classification procedure based on mixture classifiers and nonparametric weighted feature extraction,” Comp. Vis. Image Unders., vol. 64, no. 3, pp. 377–389, 1996. [166] B.-C. Kuo, C.-H. Li, and J.-M. Yang, “Kernel nonparametric weighted feature extraction for hyperspectral image classification,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 4, pp. 1139–1155, 2009. [167] J.-M. Yang, P.-T. Yu, and B.-C. Kuo, “A nonparametric feature extraction and its application to nearest neighbor classification for hyperspectral image data,” IEEE Trans. Geos. Remote Sens., vol. 48, no. 3, pp. 1279–1293, 2010. [168] H.-Y. Huang and B.-C. Kuo, “Double nearest proportion feature extraction for hyperspectral-image classification,” IEEE Trans. Geos. Remote Sens., vol. 48, no. 11, pp. 4034–4046, 2010.
236 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[169] L. Ma, M. M. Crawford, and J. W. Tian, “Generalized supervised local tangent space alignment for hyperspectral image classification,” Electron. Lett, vol. 46, pp. 497–498, 2010. [170] N. Falco, J. A. Benediktsson, and L. Bruzzone, “A study on the effectiveness of different independent component analysis algorithms for hyperspectral image classification,” IEEE Jour. Selec. Top. App. Earth Observ. Remote Sens., vol. 7, pp. 2183–2199, 2014. [171] L. Jimenez and D. A. Landgrebe, “Hyperspectral data analysis and supervised feature reduction via projection pursuit,” IEEE Trans. Geos. Remote Sens., vol. 37, no. 6, pp. 2653–2667, 1999. [172] C. Lee and D. A. Landgrebe, “Feature extraction based on decision boundaries,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 15, no. 4, pp. 388–400, 1993. [173] J. A. Benediktsson and I. Kanellopoulos, “Classification of multisource and hyperspectral data based on decision fusion,” IEEE Trans. Geos. Remote Sens., vol. 37, no. 3, pp. 1367–1377, 1999. [174] L. Wang, X. Jia, and Y. Zhang, “A novel geometry-based feature selection technique for hyperspectral imagery,” IEEE Geos. Remote Sens. Lett., vol. 4, no. 1, pp. 171–175, 2007. [175] P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 24, no. 3, pp. 301–312, 2002. [176] Q. Du and H. Yang, “Similarity-based unsupervised band selection for hyperspectral image analysis,” IEEE Trans. Sys. Man Cyber. Part C, Appl. Rev., vol. 5, no. 4, pp. 564–568, 2008. [177] J. M. Sotoca, F. Pla, and J. S. Sanchez, “Band selection in multispectral images by minimization of dependent information,” IEEE Trans. Sys. Man Cyber. Part C, Appl. Rev., vol. 37, no. 2, pp. 258–267, 2007. [178] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Trans. Commun. Technol., vol. CT-15, no. 1, pp. 52–60, 1967.
Bibliography
237
[179] L. Bruzzone, F. Roli, and S. B. Serpico, “An extension of the JeffreysMatusita distance to multiclass cases for feature selection,” IEEE Trans. Geos. Remote Sens., vol. 33, no. 6, pp. 1318–1321, 1995. [180] L. Bruzzone and C. Persello, “A novel approach to the selection of spatially invariant features for the classification of hyperspectral images with improved generalization capability,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 9, pp. 3180–3191, 2009. [181] A. Ifarraguerri and M. W. Prairie, “Visual method for spectral band selection,” IEEE Geos. Remote Sens. Lett., vol. 1, no. 2, pp. 101–106, 2004. [182] R. Huang and M. He, “Band selection based on feature weighting for classification of hyperspectral data,” IEEE Geos. Remote Sens. Lett., vol. 2, no. 2, pp. 156–159, 2005. [183] Y. He, Q. Du, H. Su, and Y. Sheng, “An efficient method for supervised hyperspectral band selection,” IEEE Geos. Remote Sens. Lett., vol. 8, no. 1, pp. 138–142, 2011. [184] G. Camps-Valls, J. Mooij, and B. Scholkopf, “Remote sensing feature selection by kernel dependence measures,” IEEE Geos. Remote Sens. Lett., vol. 7, no. 3, pp. 587–597, 2010. [185] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Trans. Neural Net., vol. 5, no. 4, pp. 537–550, 1994. [186] P. A. Estevez, T. Michel, C. A. Perez, and J. M. Zurada, “Normalized mutual information feature selection,” IEEE Trans. Neural Net., vol. 20, no. 2, pp. 189–201, 2009. [187] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: Criteria of max-dependency, mas-relevance and minredundancy,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005. [188] M. A. Hossain, M. R. Pickering, and X. Jia, “Unsupervised feature extraction based on a mutual information measure for hyperspectral image classification,” Proc. IGARSS’11, vol. 42, no. 7, pp. 1720–1723, 2011.
238 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[189] N. Keshava, “Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries,” IEEE Trans. Geos. Remote Sens., vol. 42, no. 7, pp. 1552– 1565, 2004. [190] B. Paskaleva, M. M. Hayat, Z. P. Wang, J. S. Tyo, and S. Krishna, “Canonical correlation feature selection for sensors with overlapping bands: Theory and application,” IEEE Trans. Geos. Remote Sens., vol. 46, no. 10, pp. 3346–3358, 2008. [191] L. Zhang, Y. Zhong, B. Huang, J. Gong, and P. Li, “Dimensionality reduction based on clonal selection for hyperspectral imagery,” IEEE Trans. Geos. Remote Sens., vol. 45, no. 12, pp. 4172–4186, 2007. [192] Y. Bazi and F. Melgani, “Toward an optimal SVM classification system for hyperspectral remote sensing images,” IEEE Trans. Geos. Remote Sens., vol. 44, no. 11, pp. 3374–3385, 2006. [193] A. Daamouche, F. Melgani, N. Alajlan, and N. Conci, “Swarm optimization of structuring elements for VHR image classification,” IEEE Geos. Remote Sens. Lett., vol. 10, no. 6, pp. 1334–1338, 2013. [194] A. Paoli, F. Melgani, and E. Pasolli, “Clustering of hyperspectral images based on multiobjective particle swarm optimization,” IEEE Trans. Geos. Remote Sens., vol. 47, no. 12, pp. 4175–4188, 2009. [195] D. Beasley, D. Bull, and R. Martin, “An overview of genetic algorithms,” University Computing, vol. 15, no. 2, pp. 58–69, 1993. [196] B. L. Mille and D. E. Goldberg, “Genetic algorithms, tournament selection, and the effects of noise,” Comp. Sys., vol. 9, no. 1995, pp. 193– 212, 1995. [197] J. Holland, Adaptation in Natural and Artificial Systems, 2nd ed. MIT Press, Cambridge, MA, 1992. [198] P. K. Chawdhry, R. Roy, and R. K. Pant, Soft Computing in Engineering Design and Manufacturing. Springer Verlag, Berlin, 1998. [199] J. Kennedy and R. Eberhart, Swarm Intelligence. Morgan Kaufmann, San Francisco, CA, 2001.
Bibliography
239
[200] M. A. Khanesar, M. Teshnehlab, and M. A. Shoorehdeli, “A novel binary particle swarm optimization,” in IEEE Med. Conf. Cont. Aut., pp. 1–6, 2007. [201] P. Ghamisi, M. S. Couceiro, and J. A. Benediktsson, “A novel feature selection approach based on FODPSO and SVM,” IEEE Trans. Geos. Remote Sens., vol. 53, no. 5, pp. 2935–2947, 2015. [202] F. VandenBergh and A. P. Engelbrecht, “A cooperative approach to particle swarm optimization,” IEEE Trans. Evol. Comp., vol. 8, no. 3, pp. 225–239, 2004. [203] K. Premalatha and A. Natarajan,“Hybrid PSO and GA for global maximization,” Int. Jour. Open Prob. Compt. Math., vol. 2, no. 4, pp. 597–608, 2009. [204] ——, “Hybrid PSO and GA for global maximization,” in ICSRS Publication, vol. 2, Paris, France, 2009. [205] P. Ghamisi and J. A. Benediktsson, “Feature selection based on hybridization of genetic algorithm and particle swarm optimization,” IEEE Geos. Remote Sens. Lett., vol. 12, no. 2, pp. 309–313, 2015. [206] R. C. Eberhart and Y. Shi., “Comparison between genetic algorithms and particle swarm optimization,” in Evolutionary Programming VII, V. W. Porto, Ed. Springer Verlag, Berlin, 1998. [207] M. Settles and T. Soule, “Breeding swarms: A GA/PSO hybrid,” in GECCO’05. pp. 161–168, ACM Press, 2005. [208] P. J. Angeline, Evolutionary optimization versus particle swarm optimization: Philosophy and performance differences, Evolutionary Programming VII, V. W. Porto, Ed. Springer Verlag, Berlin, 1998. [209] M. S. Couceiro, R. P. Rocha, N. M. F. Ferreira, and J. A. T. Machado, “Introducing the fractional order Darwinian PSO,” Sig., Image and Vid. Process., vol. 102, no. 1, pp. 8–16, 2007. [210] M. S. Couceiro and P. Ghamisi, Fractional Order Darwinian Particle Swarm Optimization: Applications and Evaluation of an Evolutionary Algorithm. Springer Verlag, London, 2015.
240 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[211] M. S. Couceiro, F. M. Martins, R. P. Rocha, and N. M. Ferreira, “Mechanism and convergence analysis of a multi-robot swarm approach based on natural selection,” Jour. Intell. Rob. Sys., pp. 1–29, 2014. [212] P. Ghamisi, M. S. Couceiro, and J. A. Benediktsson, “Classification of hyperspectral images with binary fractional order Darwinian PSO and random forests,” Proc. SPIE 8892, Image and Signal Processing for Remote Sensing XIX, 2013. [213] M. Sezgin and B. Sankur, “Survey over image thresholding techniques and quantitative performance evaluation,” Jour. Electron. Imag., vol. 13, no. 1, pp. 146–168, 2004. [214] K. S. Fu and J. K. Mui, “A survey on image segmentation,” Pat. Recog., vol. 13, no. 1, pp. 3–16, 1981. [215] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson, “Classification of hyperspectral data from urban areas based on extended morphological profiles,” IEEE Trans. Geos. Remote Sens., vol. 43, no. 3, pp. 480–491, March 2005. [216] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances, and J. Calpe-Maravilla, “Composite kernels for hyperspectral image classfication,” IEEE Geos. Remote Sens. Lett., vol. 3, no. 1, pp. 93–97, 2006. [217] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “A spatial-spectral kernel-based approach for the classification of remote-sensing images,” Pat. Recog., vol. 45, no. 1, pp. 381 – 392, 2012. [218] ——, “Adaptive pixel neighborhood definition for the classification of hyperspectral images with support vector machines and composite kernel,” Proc. ICIP’08, pp. 1884 – 1887, 2008. [219] G. Borgefors, “Distance transformations in digital images,” Comput. Vis. Graph. Image Process., vol. 34, no. 3, pp. 344–371, 1986. [220] D. Arthur and S. Vassilvitskii, “How slow is the k-means method?” in Proc. SCG’06, pp. 144–153, 2006. [221] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–136, 1982.
Bibliography
241
[222] P. Agarwal and N. Mustafa, “K-means projective clustering,” in Proc. PODS’04, pp. 155–165, 2004. [223] F. Gibou and R. Fedkiw, “A fast hybrid k-means level set algorithm for segmentation,” 4th Annual Hawaii Inter. Conf. Stat. Math., pp. 281–291, 2005. [224] R. Herwig, A. J. Poustka, C. Muller, C. Bull, H. Lehrach, and J. O’Brien, “Large-scale clustering of CDNA-fingerprinting data,” Genome Research, pp. 1093–1105, 1999. [225] S. Ray and R. H. Turi, “Determination of number of clusters in k-means clustering and application in colour image segmentation,” in Proc. 4th Inter. Conf. Adv. Pat. Recog. Dig. Tech., 1999. [226] J. Tou and R. Gonzalez, “Pattern recognition principles,” Massachusetts: Addison-Wesley, 1974. [227] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pat. Recog. Lett., vol. 31, no. 8, pp. 651–666, 2010. [228] A. Dutta, “Fuzzy c-means classification of multispectral data incorporating spatial contextual information by using Markov random field,” Master thesis, ITC Enschede, 2009. [229] J. C. Bezdek and R. Ehrlich, “FCM: The fuzzy c-means clustering algorithm,” Comp. Geos., vol. 10, no. 22, pp. 191–203, 1981. [230] J. Mielikainen and P. Toivanen, “Lossless compression of hyperspectral images using a quantized index to lookup tables,” IEEE Geos. Remote Sens. Lett., vol. 5, no. 3, pp. 471 – 478, 2008. [231] P. Maji and S. Pal, “Maximum class separability for rough-fuzzy cmeans based brain MR image segmentation,” T. Rough Sets, vol. 5390, pp. 114–134, 2008. [232] W. Wang, Y. Zhang, Y. Li, and X. Zhang, “The global fuzzy c-means clustering algorithm,” Intel. Cont. Aut., vol. 1, pp. 3604–3607, 2006. [233] Z. Xian-cheng, “Image segmentation based on modified particle swarm optimization and fuzzy c-means clustering algorithm,” Sec. Inter. Conf. Intell. Comp. Tech. Auto., pp. 611–616, 2009.
242
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[234] H. Izakian and A. Abraham, “Fuzzy c-means and fuzzy swarm for fuzzy clustering problem,” Expert Systems with Applications, vol. 38, no. 3, pp. 1835–1838, 2011. [235] Y. D. Valle, S. Mohagheghi, J. Hernandez, and R. Harley, “Particle swarm optimization: Basic concepts, variants and applications in power systems,” IEEE Trans. Evolutionary Computation, pp. 171–195, 2008. [236] L. Shapiro and G. Stockman, Computer Vision. York, NY, 2002.
Prentice Hall, New
[237] A. Jensen and A. Solberg, “Fast hyperspectral feature reduction using piecewise constant function approximations,” IEEE Geos. Remote Sens. Lett., vol. 4, no. 4, pp. 547–551, 2007. [238] J. Driesen and P. Scheunders, “A multicomponent image segmentation framework,” in Proc. ACIVS, pp. 589–600, 2008. [239] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, 1999. [240] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 24, no. 8, pp. 603–619, 2002. [241] Y. Tarabalka, J. Chanussot, and J. A. Benediktsson, “Segmentation and classification of hyperspectral images using watershed transformation,” Pat. Recog., vol. 43, no. 7, pp. 2367–2379, 2010. [242] S. Beucher and C. Lantuejoul, “Use of watersheds in contour detection,” Int. Workshope on Image Processing, Rennes, France, 1979. [243] G. Noyel, J. Angulo, and D. Jeulin, “Morphological segmentation of hyperspectral images,” Image Analysis & Stereology, vol. 26, pp. 101– 109, 2007. [244] J. Serra, Image Analysis and Mathematical Morphology. Academic Press, London, 1982. [245] A. Evans and X. Liu, “A morphological gradient approach to color edge detection,” IEEE Trans. Image Processing, vol. 15, no. 6, pp. 1454–1463, 2006.
Bibliography
243
[246] L. Vincent and P. Soille, “Watersheds in digital spaces: an efficient algorithm based on immersion simulations,” IEEE Trans. Pat. Analysis Machine Intel., vol. 13, no. 6, pp. 583–598, 1991. [247] J. Astola, P. Haavisto, and Y. Neuvo, “Vector median filters,” Proceedings of the IEEE, vol. 78, no. 4, pp. 678–689, 1990. [248] J. C. Tilton, “Image segmentation by region growing and spectral clustering with a natural convergence criterion,” in Proc. IGARSS’98, vol. 4, pp. 1766–1768, 1998. [249] ——, “RHSeg user’s manual: Including HSWO, HSeg, HSegExtract, HSegReader and HSegViewer, version 1.55,” Jan. 2012. [250] J.-M. Beaulieu and M. Goldberg, “Hierarchy in picture segmentation: a stepwise optimization approach,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 11, no. 2, pp. 150–163, 1989. [251] A. J. Plaza and J. C. Tilton, “Automated selection of results in hierarchical segmentations of remotely sensed hyperspectral images,” in Proc. of IGARSS’05, vol. 7, 2005, pp. 4946–4949. [252] Y. Tarabalka, J. C. Tilton, J. A. Benediktsson, and J. Chanussot, “Marker-based hierarchical segmentation and classification approach for hyperspectral imagery,” in Proc. of ICASSP’11, pp. 1089–1092, 2011. [253] ——, “A marker-based approach for the automated selection of a single segmentation from a hierarchical set of image segmentations,” IEEE Jour. Selec. Top. App. Earth Obser. Remote Sens., vol. 5, no. 1, pp. 262 –272, 2012. [254] R. Gonzalez and R. Woods, Digital Image Processing, Second Edition. Prentice Hall, Upper Saddle River, NJ, 2002. [255] Y. Tarabalka, J. Chanussot, and J. A. Benediktsson, “Segmentation and classification of hyperspectral images using minimum spanning forest grown from automatically selected markers,” IEEE Trans. Sys., Man, Cybernetics: Part B, vol. 40, no. 5, pp. 1267–1279, 2010.
244
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[256] C. Chang and C. Lin, “LIBSVM support vector machines,” Software http://www.csie.ntu.edu.tw/∼cjlin/libsvm, 2008.
a
library available
for at
[257] V. Vapnik, The Nature of Statistical Learning Theory, Second edition, Springer Verlag, Berlin, 1999. [258] T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” Jour. Mach. Lear. Res., no. 5, pp. 975–1005, 2004. [259] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 20, no. 3, pp. 226–239, 1998. [260] J. Stawiaski, “Mathematical morphology and graphs: Application to interactive medical image segmentation,” Ph.D. dissertation, Paris School of Mines, 2008. [261] R. C. Prim, “Shortest connection networks and some generalizations,” Bell. Sys. Tech. Jour., vol. 36, pp. 1389–1401, 1957. [262] A. Widayati, B. Verbist, and A. Meijerink, “Application of combined pixel-based and spatial-based approaches for improved mixed vegetation classification using IKONOS,” in Proc. 23th Asian Conf Remote Sens., 2002. [263] K. Bernard, Y. Tarabalka, J. Angulo, J. Chanussot, and J. Benediktsson, “Spectral-spatial classification of hyperspectral data based on a stochastic minimum spanning forest approach,” IEEE Trans. Image Proc., vol. 21, no. 4, pp. 2008–2021, 2012. [264] J. Serra, Image Analysis and Mathematical Morphology, Volume 2: Theoretical Advances. Academic Press, London, 1988. [265] P. Soille, “Recent developments in morphological image processing for remote sensing,” in Proc. SPIE 7477, Image and Signal Processing for Remote Sensing XV, 2009. [266] M. Fauvel, “Spectral and spatial methods for the classification of urban remote sensing data,” Ph.D. dissertation, Grenoble Institute of Technology and University of Iceland, 2007.
Bibliography
245
[267] J. Crespo, J. Serra, and R. Schafer, “Theoretical aspects of morphological filters by reconstruction,” Signal Proc., vol. 47, no. 2, pp. 201–225, 1995. [268] E. Aptoula and S. Lef`evre, “A comparative study on multivariate mathematical morphology,” Pat. Recog., vol. 40, no. 11, pp. 2914 – 2929, 2007. [269] J. A. Palmason, J. A. Benediktsson, J. R. Sveinsson, and J. Chanussot, “Classification of hyperspectral data from urban areas using morphological preprocessing and independent component analysis,” in Proc. IGARSS’05, pp. 176–179, 2005. [270] T. Castaings, B. Waske, J. A. Benediktsson, and J. Chanussot, “On the inlfuence of feature reduction for the classification of hyperspectral images based on the extended morphological profile,” Int. Jour. Remote Sens., vol. 31, no. 22, pp. 5921–5939, 2010. [271] P. Soille, “Beyond self-duality in morphological image analysis,” Image Vision Comp., vol. 23, no. 2, pp. 249–257, 2005. [272] J. Debayle and J.-C. Pinoli, “General adaptive neighborhood image processing - part I,” Jour. Math. Imag. Vis., vol. 25, no. 2, pp. 245– 266, 2006. [273] ——, “General adaptive neighborhood image processing - part II,” Jour. Math. Imag. Vis., vol. 25, no. 2, pp. 267–284, 2006. [274] J. Serra, “Connectivity on complete lattices,” Jour. Math. Imag. Vis., vol. 9, no. 3, pp. 231–251, 1998. [275] E. J. Breen and R. Jones, “Attribute openings, thinnings, and granulometries,” Comput. Vis. Image Underst., vol. 64, no. 3, pp. 377–389, 1996. [276] P. Maragos and R. Ziff, “Threshold superposition in morphological image analysis systems,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 12, no. 5, pp. 498–504, 1990. [277] P. Salembier and J. Serra, “Flat zones filtering, connected operators, and filters by reconstruction,” IEEE Trans. Image Process., vol. 4, no. 8, pp. 1153 –1160, 1995.
246
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
[278] V. Caselles and P. Monasse, Geometric Description of Images as Topographic Maps. Springer Verlag, Berlin, 2010. [279] P. Salembier, A. Oliveras, and L. Garrido, “Antiextensive connected operators for image and sequence processing,” IEEE Trans. Image Process., vol. 7, no. 4, pp. 555–570, 1998. [280] L. Najman and M. Couprie, “Building the component tree in quasilinear time,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3531– 3539, 2006. [281] M. Dalla Mura, J. A. Benediktsson, and L. Bruzzone, “Self-dual attribute profiles for the analysis o f remote sensing images,” Mathematical Morphology and Its Applications to Image and Signal Processing, vol. 6671, pp. 320–330, Springer Verlag, Heidelberg, 2011. [282] P. Monasse and F. Guichard, “Fast computation of a contrast-invariant image representation,” IEEE Trans. Image Processs., vol. 9, no. 5, pp. 860–872, 2000. [283] P. Salembier and M. Wilkinson, “Connected operators,” IEEE Sig. Proc. Mag., vol. 26, no. 6, pp. 136 –157, 2009. [284] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone, “Extended profiles with morphological attribute filters for the analysis of hyperspectral data,” Int. Jour. Remote Sens., vol. 31, no. 22, pp. 5975–5991, 2010. [285] M. H. F. Wilkinson, H. Gao, W. H. Hesselink, J.-E. Jonker, and A. Meijster, “Concurrent computation of attribute filters on shared memory parallel machines,” IEEE Trans. Pat. Analysis Mach. Intell., vol. 30, no. 10, pp. 1800–1813, 2008. [286] L. Najman and H. Talbot, Mathematical Morphology. New York, NY, 2010.
Wiley-ISTE,
[287] J. Chanussot and P. Lambert, “Total ordering based on space filling curves for multivalued morphology,” in 4th Int. Symp. Math. Morph. App., pp. 51–58, 1998.
Bibliography
247
[288] L. Garrido, P. Salembier, and D. Garcia, “Extensive operators in partition lattices for image sequence analysis,” Sig. Proc., vol. 66, no. 2, pp. 157–180, 1998. [289] P. Lambert and J. Chanussot, “Extending mathematical morphology to color image processing,” in CGIP-1st Int. Conf. Col. Graph. Image Proc., Saint-Etienne, France, pp. 158–163, 2000. [290] E. Aptoula and S. Lefevre, “A comparative study on multivariate mathematical morphology,” Pat. Recog., vol. 40, no. 11, pp. 2914–2929, 2007. [291] Y. Tarabalka, J. A. Benediktsson, and J. Chanussot,, “Mathematical morphology with emphasis on analysis of hyperspectral images and remote sensing applications,” Hyper-I-Net e-learning course, University of Iceland, 2009. [292] P. Marpu, M. Pedergnana, M. Dalla Mura, J. A. Benediktsson, and L. Bruzzone, “Automatic generation of standard deviation attribute profiles for spectral-spatial classification of remote sensing data,” IEEE Geos. Remote Sens. Lett., vol. 10, no. 2, pp. 293–297, 2013. [293] P. Ghamisi, J. A. Benediktsson, G. Cavallaro, and A. Plaza, “Automatic framework for spectral-spatial classification based on supervised feature extraction and morphological attribute profiles,” IEEE Jour. Selec. Top. App. Earth Obser. Remote Sens., vol. 7, no. 6, pp. 2147– 2160, 2014. [294] P. Ghamisi, M. S. Couceiro, and J. A. Benediktsson, “FODSPO based feature selection for hyperspectral remote sensing data,” in Proc. WHISPERS’14, In press. [295] ——, “Classification of hyperspectral images with binary fractional order Darwinian PSO and random forests,” Proc. SPIE, vol. 8892, 2013. [296] P. Ghamisi, F. Sepehrband, J. Choupan, and M. Mortazavi, “Binary hybrid GA-PSO based algorithm for compression of hyperspectral data,” 2011 5th Int. Conf. Sig. Process. Com. Sys., pp. 1–8, 2011. [297] P. R. Marpu, M. Pedergnana, M. Dalla Mura, S. Peeters, J. A. Benediktsson, and L. Bruzzone, “Classification of hyperspectral data using extended attribute profiles based on supervised and unsupervised
248
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
feature extraction techniques,” Int. Jour. Image Data Fus., vol. 3, no. 3, pp. 269–298, 2012. [298] G. Celeux and G. Govaert, “A classification EM algorithm for clustering and two stochastic versions,” Comp. Stat. Data Analysis, vol. 14, no. 3, pp. 315–332, 1992. [299] B. Song, J. Li, M. Dalla Mura, P. Li, A. Plaza, J. M. Bioucas-Dias, J. A. Benediktsson, and J. Chanussot, “Remotely sensed image classification using sparse representations of morphological attribute profiles,” IEEE Trans. Geos. Remote Sens., vol. 52, no. 8, pp. 5122–5136, 2014. [300] S. Bernab´e, P. R. Marpu, A. Plaza, M. Dalla Mura, and J. A. Benediktsson, “Spectral-spatial classification of multispectral images using kernel feature space representation,” IEEE Geos. Remote Sens. Lett., vol. 41, no. 9, pp. 1940–1949, 2014.
About the Authors J´ on Atli Benediktsson is currently the rector of the University of Iceland, Reykjavik, Iceland. He joined the University of Iceland in 1991 as assistant professor of electrical engineering. He has been professor of electrical and computer engineering at the University of Iceland since 1996 and was pro-rector of science and academic affairs at the University 2009-2015. Dr. Benediktsson has been a visiting professor at the University of Trento, Italy since 2002. He received a Cand.Sci. degree in electrical engineering from the University of Iceland, Reykjavik, in 1984, and M.S.E.E. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, Indiana, in 1987 and 1990, respectively. His research interests are in remote sensing, biomedical analysis of signals, pattern recognition, image processing, and signal processing, and he has published extensively in those fields. Prof. Benediktsson was the 2011-2012 President of the IEEE Geoscience and Remote Sensing Society (GRSS) and has been on the GRSS AdCom since 2000. He was editor-in-chief of IEEE Transactions on Geoscience and Remote Sensing (TGRS) from 2003 to 2008, and has served as associate editor of TGRS since 1999, IEEE Geoscience and Remote Sensing Letters since 2003 and IEEE Access since 2013. He is on the editorial board of Proceedings of the IEEE, the international editorial board of the International Journal of Image and Data Fusion and was the chairman of the Steering Committee of IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (J-STARS) 2007-2010. Prof. Benediktsson is a cofounder of the biomedical start-up company Oxymap (www.oxymap.com). He is a fellow of the IEEE and a fellow of SPIE. Prof. Benediktsson was a member of the 2014 IEEE Fellow Committee. He received the Stevan J. Kristof Award from Purdue University in 1991 as outstanding graduate student in remote sensing. In 1997, Dr. Benediktsson was the recipient of the Icelandic Research Council’s Outstanding Young Researcher Award, in 2000, he was granted the IEEE Third Millennium Medal, in 2004, he was a corecipient of the University
249
250 Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
of Iceland’s Technology Innovation Award, in 2006 he received the yearly research award from the Engineering Research Institute of the University of Iceland, and in 2007, he received the Outstanding Service Award from the IEEE Geoscience and Remote Sensing Society. He was corecipient of the 2012 IEEE Transactions on Geoscience and Remote Sensing Paper Award, and in 2013, he was corecipient of the IEEE GRSS Highest Impact Paper Award. In 2013 he received the IEEE/VFI Electrical Engineer of the Year Award. In 2014, he was a corecipient of the 2012-2103 Best Paper Award from the International Journal of Image and Data Fusion. He is a member of the Association of Chartered Engineers in Iceland (VFI), Societas Scinetiarum Islandica, and Tau Beta Pi. Pedram Ghamisi graduated with a B.Sc. in civil (survey) engineering from the Tehran South Campus of Azad University. Then, he obtained an M.Sc. degree with first class honors in remote sensing at K. N. Toosi University of Technology in 2012. He received a Ph.D. in electrical and computer engineering at the University of Iceland, Reykjavik, Iceland in 2015. In the academic year 2010-2011, he received the Best Researcher Award for M.Sc. students in K. N. Toosi University of Technology. At the 2013 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Melbourne, July 2013, Dr. Ghamisi was awarded the IEEE Mikio Takagi Prize for winning the Student Paper Competition at the conference among almost 70 people. He is now a postdoctoral research fellow at the University of Iceland. His research interests are in remote sensing and image analysis with a special focus on spectral and spatial techniques for hyperspectral image classification and the integration of LiDAR and hyperspectral data for land cover assessment. He serves as a reviewer for a number of journals including IEEE Transactions on Image Processing, IEEE Transactions on Geoscience and Remote Sensing, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Geoscience and Remote Sensing Letters, Pattern Recognition Letters, Remote Sensing, Neural Computing and Applications, Signal, Image and Video Processing, and International Journal of Remote Sensing.
Index Absorption property, 162 Acronyms, this book, 213–14 AdaBoost, 45–46 Adaptive neighborhood system, 18–19 Antiextensivity, 162 Attribute filters (AFs), 28, 163–76 as connected filters, 163 example of, 164 max-tree, 165, 166 thickenings, 164, 166 thinnings, 163, 166 Attribute profiles (APs), 161–94 defined, 28 differential (DAP), 174 examples of, 170–71 extended multi- (EMAP), 28–30, 94, 174–75, 177–78, 185 extension to hyperspectral images, 169–76 feature production, 198 general architecture of, 173 as generalization of MPs, 169 overview, 161 probabilistic SVMs and, 185–86 spectral-spatial classification based on, 176–94 thickening profile, 169 See also Extended APs (EAPs) Attribute selection, 197, 198 AUTOMATIC, 177, 180–82
Average accuracy (AA), 52–54 AVIRIS Indian Pines, 210 Bagging, 46–48 BFODPSO, 175–76 Bibliography, this book, 215–43 Bi-level thresholding, 19–20, 124 Boosting, 44–46 Chamfer neighborhoods, 98, 100 Class accuracy (CA), 52 Classification approaches, 33–54 with automatically selected markers, 115–23 based on thresholding-based image segmentation, 129–30 defined, 5 pattern recognition and, 34 pixel-wise, 120 probabilistic pixel-wise classification, 117–18 required number of labeled samples for, 7–8 segmentation integration, 19–21 spectral, 11–33 spectral-spatial, 15, 94–100 statistical, 36–44 See also specific classification techniques
251
252
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Classification accuracies majority voting (MV), 98 Pavia University test set, 156–57 random forest (RF), 194 ROSIS-03, 133–34 support vector machines (SVMs), 194 training samples and, 9, 10 Classification and regression trees (CART), 44 Classification EM (CEM) algorithm, 106–7 cluster assignment step, 202 clustering, 201–2 initialization, 201 parameter estimation step, 201 Classification error average accuracy (AA), 52–54 confusion matrix, 51–52, 53 estimation of, 51–52 evaluation of, 51 Classification maps comparison, 184 defined, 11–12 illustrated, 16 Indian Pines data, 189–90 Pavia University, 193 ROSIS-03, 132, 179, 183 Classifiers decision tree (DT), 43–44 ECHO, 35, 50–51, 195 maximum likelihood (MLC), 35, 36 multiple, 44–50 neural network, 41–43 number of training samples and, 8 random forest (RF), 13–14 spectral, 11–33 supervised, 12 support vector machines (SVMs), 37–41 unsupervised, 12–13 Closing by reconstruction, 21, 22 defined, 144 MP composition of, 147 Clustering approaches fuzzy C-means (FCM), 103 K-means, 101–2 PSO-FCM, 104–5
Clustering-based methods, 19 Composite kernels, 95–96 Conclusions, this book, 195–96 Confusion matrix, 51–52, 53 Connected filters, 163 Crisp neighborhood system, 15–18 Crossover methods, 77, 78 Darwinian PSO (DPSO), 20–21, 84 cooperative approach, 85–87 fractional order (FODPSO), 87, 88–90, 128–30, 196–97 multiple smaller swarms, 130 thresholding-based segmentation and, 124–25 Data Gaussian distributed, 7 increasing dimensionality of, 8 See also High-dimensional data Data sets description Indian Pines, 210–12 ROSIS-03, 207–10 Decision boundary feature extraction (DBFE) defined, 64 first three components, 70 points for using, 66 workflow, 64–66 Decision boundary feature matrix (DBFM), 185 Decision trees (DTs) CART, 44 classifiers, 43–44 defined, 43 Differential attribute profiles (DAP), 174 Differential MP (DMP), 149, 150, 174 analysis of, 25 defined, 22 example of, 24 Dilation defined, 138 example of, 139 geodesic, 142 reconstruction by, 144 result, 139–40 schematic comparison, 143 See also Erosion
Index
Dimensionality, 6–8 to infinity, 8 PSO-based feature selection, 80 Discriminant analysis feature extraction (DAFE) classification map comparison, 184 within-class scatter matrix, 62–63 defined, 62 first three components, 70 Fisher criterion and, 64 shortcomings, 64 Ecological science, 1–2 EMP-KPCA, 153, 154 EMP-PCA, 153 Ensemble of classifiers. See Multiple classifiers Erosion defined, 138 example of, 140 geodesic, 142 outputs of, 140, 141 schematic comparison, 143 See also Dilation Evolutionary-based feature selection, 72–75 Expectation maximization (EM) Classification (CEM) algorithm, 106–7 defined, 105–6 number of parameters needed, 107 steps, 106 Extended APs (EAPs) defined, 28 example of, 29 general architecture of, 167 multiple, computing, 172–74 See also Attribute profiles (APs) Extended MP (EMP), 25, 27 composition, 148 example of, 148 KPCA, 153, 154 PCA, 153 stacked MPs, 146, 148 See also Morphological profiles (MPs) Extended multi-APs (EMAPs), 28–30, 94 attributes, 174 defined, 174 feature extraction and, 185
253
general architecture of, 168, 178 issues addressing, 175 threshold values, 177 using, 174 Extensivity, 162 Extraction and classification of homogeneous objects (ECHO), 35, 50–51, 195 Feature extraction, 10–11, 56–69 decision boundary feature extraction (DBFE), 64–67 defined, 10–11, 56 discriminant analysis feature extraction (DAFE), 62–64 EMAPs and, 185 independent component analysis (ICA), 60–62 knowledge-based, 56–57 nonparametric weighted feature extraction (NWFE), 67–69 principle component analysis (PCA), 58–60 supervised, 57, 58 techniques, 56 unsupervised, 57–58 Feature fusion, 153–54 Feature reduction, 55–90 need for, 5–11 overview, 55–56 summary, 90 techniques, 9 See also Feature extraction; Feature selection Feature selection, 9–10, 69–90 defined, 9 evolutionary-based, 72–75 FODPSO-based, 84–90 GA-based, 75–77 HGAPSO-based, 83–84, 86 input data specification and, 10 overview, 69–71 PSO-based, 77–83 selection metric, 69 supervised, 10, 71–72 unsupervised, 10, 71–72 use of, 69 Fisher criterion, 64
254
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
FODPSO-based feature selection, 84–90 Fractional coefficient, 87 Fractional order DPSO (FODPSO) approaches, 89–90 binary approach, 89 concept, 88, 128, 129 defined, 87 feature selection based on, 89 multiple smaller swarms, 130 segmentation, 20–21 speed and efficiency of, 196–97 thresholding-based segmentation and, 125, 129–30 Fusion approach (FA), 177 Fuzzy C-means (FCM) clustering defined, 103 introduction of, 102 k cluster partitioning, 103 PSO-based, 104–5 Gamma function, 6 Gaussian MRF, 17 Gaussian radial basis function (RBF), 14 Genetic algorithm-based feature selection, 75–77 Genetic algorithms (GAs), 73, 74 concept of, 76 as iterative process, 77 selection, 75 shortcomings, 77 Geodesic dilation, 142 Geodesic erosion, 142 Geological science, 2 Gray level co-occurrence matrix (GLCM), 18 Grunwald Letnikov, 87 Hidden Markov models (HMMs), 17 Hidden MRF (HMRF), 17 Hierarchical segmentation (HSeg) defined, 113 experimental evaluation, 131 initialization, 114 recursive (RHSeg), 115 Hierarchical step-wise optimization (HSWO) method, 113 High-dimensional data, 4–31
geometrical properties of, 5–11 low linear projections and, 8–9 need for feature reduction and, 5–11 overview of, 4–5 projecting into lower dimensional subspace, 9 statistical properties of, 5–11 summary, 31 High-dimensional spaces, 7, 8–9 Histogram thresholding-based methods, 19 Hughes phenomenon, 11, 188 Hybrid genetic algorithm particle swarm optimization (HGAPSO), 83–84, 86 Hydrological science, 2 Hypercubes, 6 Hyperellipsoid, 7 Hyperspectral images analysis of, 4, 13 AP and extensions to, 169–76 attributes, 28 defined, 2 example of, 5 as hyperspectral data cubes, 2 illustrated, 3 as list of spectral measurements, 14 spatial perspective, 3–4 spectral perspective, 2–3 WS technique for, 110–13 Hyperspectral imaging instruments, 11 Hyperspectral imaging systems functioning of, 1 introduction to, 1–4 Hypersphere, volume of, 6 Idempotence, 162 Image segmentation, 18–19 Image thresholding, 125 Increasingness, 162 Independent component analysis (ICA) algorithms, 62 assumptions, 60–62 defined, 60 first three components, 63 independent components (ICs), 60 Indian Pines data classification maps, 189–90 set description, 210–12
Index
ISODATA, 12–13 Kappa coefficient (k), 54 K means, 12 K-means clustering, 101–2 Knowledge-based feature extraction, 56–57 Likelihoods, 36–37 Linear discriminant analysis (LDA), 58 Machine learning, 34 Majority voting (MV) Chamfer neighborhoods, 98, 100 classification accuracies, 98 defined, 96 post-regulation, 98–100 within segmentation regions, 120 spectral-spatial classification using, 96–100 workflow, 96–98 MANUAL, 177, 180–82 MAP-MRF classification, 17 Marker selection, 115–23 defined, 115–16 flowchart, 116 minimum spanning forest and, 121–23 multiple classifier approach, 118–21 with probabilistic SVM, 117–18 Markov random field (MRF) Gaussian, 17 hidden (HMRF), 17 MAP, 17 modeling, 15–17 Mathematical morphology (MM), 138–58 absorption property, 162 dilation and, 138–41 erosion and, 138–41 exploitation, 138 extensivity and antiextensivity, 162 fundamental properties, 162 idempotence, 162 increasingness, 162 morphological operators, 138–44 opening and closing, 142–44 Maximum likelihood classifier (MLC), 35, 36 Max-tree, 165, 166 Mean-shift segmentation (MSS), 108–9
255
Mercer’s conditions, 39 Military applications, 2 Mineralogy, 2 Minimum spanning forest (MSF), 121–23 construction of, 121–22 eight-neighborhood connectivity, 123 extra vertices example, 123 majority voting (MV), 122–23 minimum spanning computation, 122 Min-tree, 165 Morphological attribute filters (AFs), 28, 163–76 as connected filters, 163 example of, 164 max-tree, 165, 166 thickenings, 164, 166 thinnings, 163, 166 Morphological closing, 21, 22 Morphological neighborhood, 149–58 defined, 149 EMP-KPCA, 154 illustrated, 152 spectral-spatial classification, 152–58 Morphological opening, 21, 22 Morphological profiles (MPs), 137–60 computed with compact SE, 25 concept, 137 defined, 22 differential (DMP), 22–25, 149, 150, 174 DMP, 150 examples of, 23, 24, 146 extended (EMP), 25, 27, 146, 148 feature production, 198 limitations of, 26, 151 main shortcoming of, 26 mathematical morphology, 138–58 multiscale decomposition, 169 opening/closing by reconstruction, 147 opening/closing operations, 145 overview, 137 stacking, 146 successive opening/closing operations, 26 summary, 158–60 two, computation of, 25 Multilevel logistic Markov-Gibbs MRF, 31
256
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Multinomial logistic regression (MLR) algorithm, 31, 186 Multiple classifiers bagging, 46–48 boosting, 44–46 defined, 44 random forest (RF), 48–50 theory of, 44 Multiple classifiers for marker selection, 118–21 classifier combinations, 120 defined, 118 flowchart, 119 MSSC scheme, 119–20 summary, 119 Multiple spectral-spatial classifier (MSSC) marker selection scheme, 119–20 Mutation probability, 77, 79 Natural disaster monitoring/managing, 197 Neural networks classifiers, 41–43 defined, 41 methods, 42–43 pattern recognition, 42 in remote sensing, 42 synaptic efficacies, 41 N-level thresholding, 127 Nonparametric weighted feature extraction (NWFE) advantages of, 67 classification map comparison, 184 within-class scatter matrix, 68 defined, 67 first three components, 70 nonparametric estimation, 67–68 workflow, 68 Normalized difference vegetative index (NDVI), 57 Normalized differential water index (NDWI), 57 Opening and closing, 142–44 Opening by reconstruction, 21, 22 defined, 144 MP composition of, 147 Out-of-Bag (OOB) error, 49
Overall accuracy (OA), 51 Oversegmentation, 108 Parallel implementations, 199 Particle swarm optimization (PSO), 73, 74 advantage of, 83 concept, 81 defined, 20 DPSO, 20–21, 84, 87, 128–30, 196–97 drawback, 128 FODPSO, 20–21, 87, 89, 128–30, 196–97 as optimization technique, 79 parameter settings dependence, 82–83 particle efficiency, 128 premature convergence of a swarm, 82 thresholding-based segmentation and, 124 Particle swarm optimization (PSO)-based FCM (PSO-FCM), 104–5 Particle swarm optimization (PSO)-based feature selection, 77–83 band coding, 82 dimensionality, 80 particle velocity, 80 procedure, 81–83 Pattern recognition machine learning and, 34 neural networks, 42 Pavia Center, 208–10 Pavia University, 209 classification accuracies, 156–57 classification maps, 193 classification results, 192 classified by RF, 194 classified by SVM, 194 data set, 207–8 Perspectives, this book, 196–99 Piecewise constant function approximations (PCFA) method, 107 Pixel-wise classification, 120 Pixel-wise maximum likelihood classification, 17 Precision agriculture, 2 Prim’s algorithm, 122, 205 Principle component analysis (PCA), 58–61
Index
data transformation, 25 defined, 58 eigenvalues, 59–60 PCs, 58–60, 61 See also Feature extraction Probabilistic SVM APs and, 185–86 implemented in LIBSVM library, 117 marker selection with, 117–18 regularized by MRF, 17 Profiles. See Attribute profiles (APs); Morphological profiles (MPs) PSO. See Particle swarm optimization (PSO) Radial basis function (RBF), 40 Random forest (RF), 13–14, 48–50 classification accuracies, 194 defined, 48 derived parameters for, 49 individuality of trees, 48 Out-of-Bag (OOB) error, 49 proximities, 50 stability, 188 variable importance, 49–50 Reconstruction closing by, 21, 22, 144 dilation by, 144 MP composition of, 147 opening by, 21, 22, 144 Recursive HSeg (RHSeg), 115 Region-based split and merging methods, 19 Remote sensing imagery, 33 neural networks in, 42 patterns, 34 ROSIS-03, 91, 92 classification accuracies, 133–34 classification maps, 132, 179, 183 gradients, 113 Pavia Center, 208–10 Pavia data sets, 207–10 Pavia University, 207–8 Roulette wheel selection, 75 Samples. See Training samples Segmentation with automatically selected markers,
257
115–23 classification integration, 19–21 FODPSO, 20–21 hierarchical (HSeg), 113–15 image, 18–19 maximum a posteriori, 31 mean-shift (MSS), 208–9 oversegmentation, 108 spatial information extraction with, 91–136 thresholding-based, 124–35 undersegmentation, 108 watershed (WS), 109–13 Self-complementary area filter, 151 Semi-supervised techniques, 35 Spatial features, 152 Spatial information extraction automatically selected markers and, 115–23 clustering approaches, 101–5 expectation maximization (EM), 105–8 hierarchical segmentation (HSeg), 113–15 mean-shift segmentation (MSS), 108–9 overview, 91–93 ROSIS-03, 91, 92 with segmentation, 91–136 summary, 135–36 thresholding-based segmentation, 124–35 watershed segmentation (WS), 109–13 Spatial perspective, 3–4 Spectral angle mapper (SAM), 114, 203 Spectral-EMP, 153 Spectral features, 152 Spectral perspective, 2–3 Spectral-spatial classification, 15, 94–100 composite kernels, 95–96 EMP-KPCA, 153 EMP-PCA, 153 experimental assessment, 152–58 experimental evaluation of approaches, 130–35 feature fusion, 153–54 feature fusion into stacked vector, 94–95 with majority voting, 96–100
258
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images
Spectral-spatial classification (continued) morphological neighborhood, 152–58 performance evaluation, 136, 159 schematic example, 97 spectral-EMP, 153 Spectral-spatial classification based on AP, 176–94 strategy 1, 176–77 strategy 2, 177–83 strategy 3, 183–94 Stacked vector approach (SVA), 177 Statistical classification, 36–44 decision tree (DT) classifiers, 43–44 likelihoods, 36–37 neural network classifiers, 41–43 SVMs, 37–41 Structuring elements (SEs), 21, 25 defined, 138 dilation by, 139 erosion by, 140 examples of, 139 Supervised classifiers, 12 Supervised feature extraction, 57, 58 Supervised feature selection, 10, 71–72 Support vector machines (SVMs), 14, 37–41 advantages of, 37 for binary classification problems, 40 classification accuracies, 194 classification by, 38 composite kernels and, 41, 95 generalization capability, 39 in hyperspectral classification, 37 linearly weighted composite kernel framework with, 185 multiclass strategy, 40 one-shot, 41 as pixel-wise classifier, 30 probabilistic, 17, 117–18, 185–86 support vectors, 39 Synaptic efficacies, 41
Texture analysis-based methods, 19 Texture measures, 18 Thematic maps, 158, 159 Thickenings, 164, 166 Thinnings, 163, 166 Threshold decomposition principle, 163 Thresholding-based segmentation, 124–35 classification based on, 129–30 DPSO and, 124–25 FODPSO and, 125, 129–30 image thresholding, 125–29 PSO and, 124 See also Segmentation Tournament selection, 75 Training samples classification accuracies and, 9, 10 defined, 7 dimensions and, 8 keeping number constant, 8 limited number of, 191 Undersegmentation, 108 Unsupervised classification algorithms, 34–35 Unsupervised classifiers defined, 12 similarity criterion, 12–13 Unsupervised feature extraction, 57–58 Unsupervised feature selection, 10, 71–72 Very high resolution image (VHR), 21, 22 Volume of hyperellipsoid, 7 of hypersphere, 6 Watershed segmentation (WS), 109–13 application flowchart, 112 defined, 109 for hyperspectral images, 110–13 illustrated, 110
The Artech House Remote Sensing Series Fawwaz T. Ulaby, Series Editor
Backscattering from Multiscale Rough Surfaces with Application to Wind Scatterometry, Adrian K. Fung Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation, Ian G. Cumming and Frank H. Wong Digital Terrain Modeling: Acquisitions, Manipulation, and Applications, Naser El-Sheimy, Caterina Valeo, and Ayman Habib Handbook of Radar Scattering Statistics for Terrain, F. T. Ulaby and M. C. Dobson Handbook of Radar Scattering Statistics for Terrain: Software and User’s Manual, F. T. Ulaby and M. C. Dobson Magnetic Sensors and Magnetometers, Pavel Ripka, editor Microwave Radiometer Systems Design and Analysis, Second Edition Niels Skou and David Le Vine Microwave Remote Sensing: Fundamentals and Radiometry, Volume I, F. T. Ulaby, R. K. Moore, and A. K. Fung Microwave Remote Sensing: Radar Remote Sensing and Surface Scattering and Emission Theory, Volume II, F. T. Ulaby, R. K. Moore, and A. K. Fung Microwave Remote Sensing: From Theory to Applications, Volume III, F. T. Ulaby, R. K. Moore, and A. K. Fung Microwave Scattering and Emission Models for Users, Adrian K. Fung and K. S. Chen Neural Networks in Atmospheric Remote Sensing, William J. Blackwell and Frederick W. Chen Radargrammetric Image Processing, F. W. Leberl Radar Polarimetry for Geoscience Applications, C. Elachi and F. T. Ulaby
Spectral-Spatial Classification of Hyperspectral Remote Sensing Images, Jón Alti Benediktsson and Pedram Ghamisi Understanding Synthetic Aperture Radar Images, Chris Oliver and Shaun Quegan Wavelets for Sensing Technologies, Andrew K. Chan and Cheng Peng
For further information on these and other Artech House titles, including previously considered out-of-print books now available through our In-Print-Forever® (IPF®) program, contact: Artech House
Artech House
685 Canton Street
16 Sussex Street
Norwood, MA 02062
London SW1V 4RW UK
Phone: 781-769-9750
Phone: +44 (0)20-7596-8750
Fax: 781-769-6334
Fax: +44 (0)20-7630-0166
e-mail: [email protected]
e-mail: [email protected]
Find us on the World Wide Web at: www.artechhouse.com