300 89 5MB
English Pages 264 Year 2019
Chih-Cheng Hung · Enmin Song · Yihua Lan
Image Texture Analysis Foundations, Models and Algorithms
Image Texture Analysis
Chih-Cheng Hung • Enmin Song Yihua Lan
•
Image Texture Analysis Foundations, Models and Algorithms
123
Chih-Cheng Hung Kennesaw State University Marietta, GA, USA
Enmin Song Huazhong University of Science and Technology Wuhan, Hubei, China
Yihua Lan Nanyang Normal University Nanyang, Henan, China
ISBN 978-3-030-13772-4 ISBN 978-3-030-13773-1 https://doi.org/10.1007/978-3-030-13773-1
(eBook)
Library of Congress Control Number: 2019931824 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Hello darkness, my old friend I’ve come to talk with you again because a vision softly creeping left its seeds while I was sleeping and the vision that was planted in my brain still remains within the sound of silence —Simon and Garfunkel
Research on image texture analysis has made significant progress in the past few decades. However, image texture classification (and segmentation) is still an elusive goal despite a tremendous effort devoted to work in this area. This is perhaps analogous to what Alan Turing proved, “there are many things that computers cannot do” although he also showed us that there are many things that computers can do [1]. Traditional image texture classification usually consists of texture feature extraction and texture classification. This scheme is flexible as many image and pattern classification, segmentation, and clustering algorithms in literature can be used for texture classification once texture features are available. The separation of features and classification has a limitation that texture features extractors must be well designed to provide a representative set of texture features for the classification algorithm. In the past few years, convolutional neural networks (CNN) have emerged as a popular approach for the advancement of image texture classification. The CNNs are being used in not only image classification but also other applications such as machine vision and language translation. It is quite astounding for the CNN algorithm to extract features automatically without human interactions. Unlike traditional artificial neural networks, CNNs can extract features from the spatial image domain. The concept of receptive field used in CNNs is very similar to the spatial filter in the digital image processing. Our goal in writing this book is to introduce the basics of image texture analysis to a beginner who is interested in pursuing research in this field. Simultaneously, we wish to present the basic K-views models and algorithms for extracting and classifying image textures, which were developed in our research laboratory and collaborations with other research scientists. We also include the fundamentals of popular textural feature extraction methods and classification algorithms to serve as
v
vi
Preface
a solid foundation in this book. The algorithms selected as the basis are carefully decided through our research findings and reformation from the many years of teaching in image texture analysis. Specifically, we divide the book into three parts; Part I consists of Chaps. 1–4 describing the existing models and algorithms. Part II includes Chaps. 5–8 introducing the K-views models and algorithms. Part II is based on the foundation of the papers on the K-views models published in conferences and journals. Part III introduces deep machine learning models for image texture analysis in Chaps. 9 and 10. In order to have complete coverage on image texture analysis in deep machine learning, we present those popular models and algorithms already well developed by many eminent researchers. If we have learned something from this research area, it is that we are more like the metaphor of “dwarfs standing on the shoulders of giants”. We are very grateful to learn so much from the many pioneers who have devoted their time and life to research in this fascinating area. This book is intended for building a foundation and learning the basics of image texture analysis. The recommended audience is senior undergraduate students and first-year of graduate students. Therefore, we have provided examples for the clarification of concepts in the book. We have tried our best to sift through the manuscript. If there are still any errors or typos in this book, it is our responsibility. Please kindly inform us for the correction. Marietta, USA Wuhan, China Nanyang, China
Chih-Cheng Hung Enmin Song Yihua Lan
Reference 1. Harel D (2012) Standing on the Shoulders of a Giant: One Persons Experience of Turings Impact (Summary of the Alan M. Turing Lecture), The Weizmann Institute of Science, Rehovot, Israel
Acknowledgements
The skies proclaim the work of his hands; Day after day they pour forth speech; Night after night they display knowledge. (Psalm 19: 1, 2)
We are indebted to many students, colleagues, visiting scholars, and collaborators. They have contributed significantly to our research in developing models and algorithms for image texture analysis. The digital age has established a vast “global knowledge library” through the Internet, which brings in many scholars’ pioneering works together for us to explore and learn. We have benefited from this global library in writing this book. We are also very grateful to our graduate students who have taught us a great deal about image texture analysis in our research laboratories and classrooms. C.-C.: My special thanks go to Profs. Jian Zhou, Guangzhi Ma, and Wendy Liu for their research contributions during their visit to the Center of Machine Vision and Security Research (CMVSR) at Kennesaw State University (KSU). My students who have contributed significantly to image processing and analysis research in the CMVSR at KSU (formerly Southern Polytechnic State University): Shisong Yang, Sarah Arasteh, Sarah Sattchi, David Bradford, Jr., Ellis Casper, Jr., Dilek Karabudak, Srivatsa Mallapragada, Eric Tran, Michael Wong, Wajira Priyadarshani, and Mahsa Shokri Varniab. In particular, Shisong Yang has made a great contribution to the K-views model. I also appreciate the Director, Prof. Y. Liu, of Henan Key Laboratory of Oracle Bone Inscription Information Processing in Anyang Normal University for his enthusiastic support for this project. My thanks go to the photographer, Cino Trinh, who provides many beautiful textural images for the book. I give my appreciation to the Center for Teaching Excellence and Learning (CTEL) at KSU for providing me one-semester sabbatical leave to work on this book. It is not possible to list all of the names here. If I miss those who have helped me in this contribution, please forgive my ignorance. The love and support I have received from my late parents and my family are immeasurable. My gratitude goes to Avery, David, and Ming-Huei for their love and encouragement. E. S.: I give my thanks to my colleagues in the Research Center of Biomedical Imaging and Bioinformatics (CBIB), Huazhong University of Science and Technology (HUST); Professors Renchao Jin, Hong Liu, Lianghai Jin, Xiangyang Xu,
vii
viii
Acknowledgements
and Guangzhi Ma. I have learned a lot from them in the area of medical image processing and analysis through more than 10 years of teaching and research. The brainstorming and discussions with this research group have triggered many valuable ideas. I would also like to give my special thanks to the graduate students in the research center as many works in this book are the results of their research. Y.-H.: I would like to thank Dr. Xiangyang Xu, Huazhong University of Science and Technology (HUST) and Dr. Qian Wang, School of Information Security, Central South University of Finance and Economics and Law, and Siguang Dai, my senior classmate in HUST, for their helpful discussions in image processing. I appreciate Jiao Long, my junior classmate at HUST who helped to draw a significant portion of figures for the book. I also appreciate the financial support from the National Natural Science Foundation of China (Grant No. 61401242) and Scholars of the Wolong project of Nanyang Normal University for writing the book.
Contents
Part I 1
Existing Models and Algorithms for Image Texture
Image Texture, Texture Features, and Image Texture Classification and Segmentation . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Image Texture . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Image Texture Classification and Segmentation . . 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
3 3 6 8 9 12 12 13
2
Texture Features and Image Texture Models . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Gray-Level Co-Occurrence Matrix (GLCM) . . . . . . . . . 2.3 Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Wavelet Transform (WT) Model and Its Extension . . . . 2.5 Autocorrelation Function . . . . . . . . . . . . . . . . . . . . . . . 2.6 Markov Random Field (MRF) Model . . . . . . . . . . . . . . 2.7 Fractal Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Variogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Texture Spectrum (TS) and Local Binary Pattern (LBP) 2.10 Local Binary Pattern (LBP) and Color Features . . . . . . 2.11 Experimental Results on GLCM, TS, and LBP . . . . . . . 2.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
15 15 17 21 23 32 32 35 38 40 44 45 45 47 47
3
Algorithms for Image Texture Classification . . . . . . . . . . . . 3.1 The K-means Clustering Algorithm (K-means) . . . . . . . 3.2 The K-Nearest-Neighbor Classifier (K-NN) . . . . . . . . . 3.3 The Fuzzy C-means Clustering Algorithm (FCM) . . . . . 3.4 A Fuzzy K-Nearest-Neighbor Algorithm (Fuzzy K-NN) 3.5 The Fuzzy Weighted C-means Algorithm (FWCM) . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
51 52 54 55 60 61
. . . . . . . .
. . . . . . . .
. . . . . . . .
ix
x
Contents
3.6 3.7 3.8
The New Weighted Fuzzy C-means Algorithm (NW-FCM) . Possibilistic Clustering Algorithm (PCA) . . . . . . . . . . . . . . A Generalized Approach to Possibilistic Clustering Algorithms (GPCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Credibility Clustering Algorithms (CCA) . . . . . . . . . . . . . . 3.10 The Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . 3.11 The K-means Clustering Algorithm with the Ant Colony Optimization (K-means-ACO) . . . . . . . . . . . . . . . . . . . . . . 3.12 The K-means Algorithm and Genetic Algorithms (K-means-GA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 The K-means Algorithm and Simulated Annealing (K-means-SA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 The Quantum-Modeled K-means Clustering Algorithm (Quantum K-means) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 The Pollen-Based Bee Algorithm for Clustering (PBA) . . . . 3.16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Dimensionality Reduction and Sparse Representation . . . . . 4.1 The Hughes Effect and Dimensionality Reduction (DR) 4.2 The Basis and Dimension . . . . . . . . . . . . . . . . . . . . . . 4.3 The Basis and Image . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Vector Quantization (VQ) . . . . . . . . . . . . . . . . . . . . . . 4.5 Principal Component Analysis (PCA) . . . . . . . . . . . . . 4.6 Singular Value Decomposition (SVD) . . . . . . . . . . . . . 4.7 Non-negative Matrix Factorization (NMF) . . . . . . . . . . 4.8 Sparse Representation—Sparse Coding . . . . . . . . . . . . 4.8.1 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . 4.9 Experimental Results on Dimensionality Reduction of Hyperspectral Image . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part II 5
... ...
64 66
... ... ...
70 72 76
...
79
...
83
...
86
. . . . .
. . . . .
. 90 . 93 . 97 . 99 . 100
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
103 104 105 107 110 112 115 118 119 121
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
123 124 125 126
The K-views Models and Algorithms
Basic Concept and Models of the K-views . . . . . . . . . . . . 5.1 View Concept and a Set of Views . . . . . . . . . . . . . . . 5.2 Set of Characteristic Views and the K-views Template Algorithm (K-views-T) . . . . . . . . . . . . . . . . . . . . . . . 5.3 Experimental Results Using the K-views-T Algorithm
. . . . . . . 131 . . . . . . . 132 . . . . . . . 135 . . . . . . . 139
Contents
5.4
Empirical Comparison with GLCM and (GMRF) . . . . . . . . . . . . . . . . . . . . . . . 5.5 Simplification of the K-views . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
6
7
8
Using 6.1 6.2 6.3
xi
Gaussian MRF . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Datagram in the K-views Model . . . . . . . . . . Why Do We Use Datagrams? . . . . . . . . . . . . . The K-views Datagram Algorithm (K-views-D) Boundary Refined Texture Segmentation Based on the K-views-D Algorithm . . . . . . . . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
142 145 147 147 148
. . . . . . . . . . . . 149 . . . . . . . . . . . . 150 . . . . . . . . . . . . 153 . . . .
. . . .
. . . .
Features-Based K-views Model . . . . . . . . . . . . . . . . . . . 7.1 Rotation-Invariant Features . . . . . . . . . . . . . . . . . . 7.2 The K-views Algorithm Using Rotation-Invariant Features (K-views-R) . . . . . . . . . . . . . . . . . . . . . . 7.3 Experiments on the K-views-R Algorithm . . . . . . . 7.4 The K-views-R Algorithm on Rotated Images . . . . 7.5 The K-views-R Algorithm Using a View Selection Method to Choose a Set of Characteristic Views . . 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
155 159 160 160
. . . . . . . . . 163 . . . . . . . . . 163 . . . . . . . . . 169 . . . . . . . . . 172 . . . . . . . . . 173 . . . .
. . . .
. . . .
. . . .
. . . .
Advanced K-views Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8.1 The Weighted K-views Voting Algorithm (Weighted K-views-V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Summed Square Image (SSI) and Fast Fourier Transform (FFT) Method in the Fast Weighted K-views Voting Algorithm (Fast Weighted K-views-V) . . . . . . . . . . . . . . . 8.3 A Comparison of K-views-T, K-views-D, K-views-R, Weighted K-views-V, and K-views-G Algorithms . . . . . . . 8.4 Impact of Different Parameters on Classification Accuracy and Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
176 180 181 182
. . . . 183 . . . . 183
. . . . 187 . . . . 192 . . . .
. . . .
. . . .
. . . .
194 196 197 197
xii
Contents
Part III 9
Deep Machine Learning Models for Image Texture Analysis
Foundation of Deep Machine Learning in Neural Networks . 9.1 Neuron and Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Traditional Feed-Forward Multi-layer Neural Networks (FMNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Hopfield Neural Network . . . . . . . . . . . . . . . . . . . . 9.4 Boltzmann Machines (BM) . . . . . . . . . . . . . . . . . . . . . . 9.5 Deep Belief Networks (DBN) and Restricted Boltzmann Machines (RBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 The Self-Organizing Map (SOM) . . . . . . . . . . . . . . . . . . 9.7 Simple Competitive Learning Algorithm Using Genetic Algorithms (SCL-GA) . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 The Self-Organizing Neocognitron . . . . . . . . . . . . . . . . . 9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Convolutional Neural Networks and Texture Classification . 10.1 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . 10.2 Architectures of Some Large Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Transfer Learning in CNNs . . . . . . . . . . . . . . . . . . . . . . 10.4 Image Texture Classification Using Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 201 . . . . . 202 . . . . . 207 . . . . . 211 . . . . . 215 . . . . . 218 . . . . . 221 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
223 227 230 231 231
. . . . . 233 . . . . . 233 . . . . . 241 . . . . . 244 . . . .
. . . .
. . . .
. . . .
. . . .
246 249 249 250
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Part I
Existing Models and Algorithms for Image Texture
1
Image Texture, Texture Features, and Image Texture Classification and Segmentation
The journey of a thousand miles begins with a single step. —Lao Tzu
In this chapter, we will discuss the basic concept of image texture, texture features, and image texture classification and segmentation. These concepts will be the foundation to understand image texture models and algorithms used for image texture analysis. Once texture features are available, many classification and segmentation algorithms from traditional pattern recognition can be utilized for labeling textural classes. Image texture analysis strongly depends on the spatial relationships among gray levels of pixels. Therefore, methods for texture feature extraction are developed by looking at this spatial relationship. For example, the gray-level co-occurrence matrix (GLCM) and local binary patterns (LBP) were derived based on this spatial concept. Traditional techniques for image texture analysis, including classification and segmentation, fall into one of the four categories: statistical, structural, model-based, and transform-based methods. The rapid advancement of deep machine learning in artificial intelligence and convolutional neural networks (CNN) has been widely used in image texture analysis. It would be essential for us to further explore image texture analysis with deep CNN.
1.1
Introduction
Image texture analysis is an important branch in digital image processing and computer vision. Image texture refers to the characterization of the surface of a given object or phenomenon present in the image. Texture occurs in many different types of images such as natural and remote sensing and medical images. In order for machine interpretation and understanding, image texture analysis is used to investigate the contents of image textures, characterize each texture, and categorize different types of textures. In general, image texture analysis consists of four types of problems: (1) texture segmentation, (2) texture classification, (3) texture © Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_1
3
4
1
Image Texture, Texture Features, and Image Texture Classification
synthesis, and (4) shape from texture [33, 34, 37, 41]. Broadly speaking, texture segmentation is similar to image segmentation in which a priori information is unknown. Image segmentation is defined as the meaningful partitioning of an image into homogeneous regions. Due to the repetition of pixel elements in each texture, some algorithms used in image segmentation may not be suitable for texture segmentation. Texture classification assumes that a priori knowledge about image texture is known, for example, the number of different textural classes in an image. Texture synthesis refers to the generation of textures using mathematical models by providing some parameters for texture models. Shape from texture is the construction of three-dimensional shapes of texture surfaces. Techniques for texture analysis can be classified into four categories: (1) statistical, (2) structural, (3) model-based, and (4) transform-based methods (also called signal processing methods) [40]. In other words, the objective of texture analysis is to characterize and discriminate image textures using the methods of statistics, structural, models, and transforms. This book focuses on the fundamentals of feature extraction, classification, and segmentation of image texture. Hence, we will explore models and algorithms for these fundamentals. Some related algorithms, clustering algorithms in particular, will be discussed in detail. An image texture classification platform is divided into two components as shown in Fig. 1.1a: texture feature extraction and texture classification and segmentation. This model is exactly the same as those used in the traditional pattern recognition system. It provides flexibility for each component to use algorithms available in pattern recognition literature. Texture feature extraction is a process of extracting the intrinsic features from the original dimensional space [13, 25]. Simultaneously, feature extraction technique also reduces the original high-dimensional image space
(a) Images
Texture Feature Extraction
Texture Classification and Segmentation
Results
(b)
Fig. 1.1 a An image texture classification platform that consists of texture feature extraction and texture classification and segmentation is very similar to a traditional pattern recognition system and b texture feature extraction not only extracts intrinsic features but also reduces the three-dimensional image space to a two-dimensional subspace
1.1 Introduction
5
to a lower dimensional subspace as shown in Fig. 1.1b. Texture classification and segmentation is a partition of an image such that each texture can be separated from other different types of textures for efficient machine interpretation and understanding. As stated in their books by Haralick and Shapiro [14, 15], gray-level and texture are not independent concepts. Both bear an inextricable relationship similar to a particle and a wave. Hence, their conclusion is that “to characterize texture we must characterize the gray-level primitive properties as well as the spatial relationships between them.” Based on this concept, we developed a texture feature called “characteristic view”, which is directly extracted from an image patch corresponding to each texture class. Depending on the number of classes, the K-views template method (called K-views model, K is an integer to specify how many templates will be used) is then established to classify texture pixels from those classes based on the characteristic views. Here, the template is identical to the spatial smoothing filter used in digital image processing such as a 3 3 template. Therefore, the characteristic view is to characterize the image texture features. Figure 1.2 shows one example of K-views templates. The characteristic view concept is based on the assumption that an image taken from nature will frequently reveal the repetitions of some certain patterns of features. The characteristic views are translation invariant. Different “views” can be derived to form feature vectors from different spatial locations. Several image texture classification algorithms have been developed based on the concept of the characteristic view [1, 2, 19, 20, 23, 24, 30, 31, 38, 43, 44]. We will discuss the strength and weakness of the K-views model and its advanced models in a comparison with other different models and algorithms in this book. In the following sections, we will explain the concept of image texture in Sect. 1.2, texture features in Sect. 1.3, and image texture classification and segmentation in Sect. 1.4.
Fig. 1.2 a A numerical image and b an example of K-views templates where K is an integer to specify how many templates will be used
(a) 0 0 0 0 0
0 0 0 0 0
100 100 100 100 100
(b) 0 0 0 0 0 0
100 100 100
100 100 100 100 100
100 100 100 100 100
6
1.2
1
Image Texture, Texture Features, and Image Texture Classification
Image Texture
Image texture has been an active area of research in pattern recognition and image analysis. Although a formal definition of image texture is not available in the literature, image texture is a natural characteristic in many images that we perceive in our environment. In general, a texture can be informally defined as a set of texture elements (called texels) which occurs in some regular or repeated pattern such as the image in Fig. 1.3a. Some pattern in textures may be random as shown in Fig. 1.3b. Texton is a commonly used term that is the fundamental microstructure in image texture [22]. Many different image textures have been collected in several databases [4, 18]. These image textures include bio-medical image textures such as magnetic resonance imaging (MRI) brain images, natural images such as Brodatz Gallery [5], and material image textures such as the OUTex database [35]. Some of the image textures are shown in Fig. 1.4. There are also many other databases for image textures such as KTH-TIPS [7, 16], describable texture datasets (DTD) [8], CUReT [9], UIUC [26], Drexel texture database [36], and UMD [42]. Many approaches for characterizing and classifying image texture have been proposed by researchers [14, 15, 34, 40, 41]. As stated above, these approaches are classified into categories of (1) statistical method, (2) structural method, (3) model-based method, and (4) transform-based method [40]. Some of these methods are more or less adopted from statistical pattern recognition [11]. In other words, if the statistical method is used for image texture characterization, an image texture is represented as feature vectors, which can be used for algorithms from statistical pattern recognition. The GLCM is one of the pioneering methods for the statistical feature extraction [14, 15]. It describes a texture with spatial relationships by using second-order statistics in a user-defined neighborhood. Features are then extracted from the co-occurrence matrices and used for image texture classification. In the statistical method, a texture is described by a collection of statistics from selected features.
Fig. 1.3 Examples of image textures: a brick arranged in regular pattern and b tree leaves grew in random pattern
1.2 Image Texture
7
Fig. 1.4 Different types of image textures: a tree trunk, b wood ring, c piled tree leaves, and d skin of human face
These features may include first-order statistics, second-order statistics, and higher-order statistics. Then, a histogram is used for representing the distribution of statistics of the features. Hence, GLCM and histograms are commonly used in the statistical method [14, 15, 40]. The basic statistical method can be expressed by Eq. 1.1 as shown below: L ¼ dðFð pÞÞ
ð1:1Þ
where p is a pixel (a variable) in the image, symbol F(p) is the features of variable p and its neighboring pixels in a user-specified window, and d(F(p)) is the decision function which maps the features to a class label (L). Hence, d(.) in Eq. 1.1 is an image texture classification algorithm which gives variable p a class label (L). This is a general model of image texture analysis for the platform shown in Fig. 1.1a. On the contrary, for the structural method, grammar is used for association with spatial relationships which characterize the placement rules of a set of primitives (equivalent to texels or textons) [29]. This will create symbol strings for each texture. Texture categorization and recognition are then performed on the parsing of
8
1
Image Texture, Texture Features, and Image Texture Classification
symbol strings which describe the spatial relationship. The model-based approach is to characterize texture based on the probability distributions in random fields such as Markov random fields (MRF) [34, 41]. The model-based methods for texture analysis include autoregressive model, Gibbs random fields, Wold model, and others [34, 41]. As the coefficients of the model-based methods are used to characterize the texture, it becomes critical to choose an appropriate model for the texture and correctly estimate the coefficients [44]. The transform-based method is to decompose an image texture into a set of basis images (also called feature images) by using a bank of filters such as Gabor filters [12, 21]. The transform-based method is a multichannel filtering approach based on the studies from psychophysiology [6, 10]. The wavelet transform is another frequently used method in this category. Among the four categories consisting of statistical, structural, model-based, and transform-based methods, each of them has its own advantages and drawbacks. The hybrid combination of different types of texture features extracted using a hybrid of different methods has also been successfully applied for image texture classification.
1.3
Texture Features
Texture features describe the characteristic of image textures. Image texture can be characterized by a set of features based on the operators specifically defined for extracting textural properties. Another term, texture representation is frequently used for the extraction of texture features that describe texture information [32]. Some popular textural operators such as the GLCM [14, 15], the LBP [33, 38, 39], and texture spectrum (TS) [17] fall in the category of textural operators. As texture is considered as a set of texture elements (Texels) arranged in a particular fashion, these elements are also called texture units [17]. In turn, these texture units are characterized by a specific feature description which is the result of using the textural operators. Statistical texture features are the most commonly used texture measures. Figure 1.5a shows a simple numerical image which consists of four different textures with 2 2 pixels in each texture. The first-order statistic of mean and variance for each texture is calculated and shown in Fig. 1.5b. The second-order statistics refers to the joint probability distribution of pixels in a spatial relationship such as the gray-level co-occurrence matrices (GLCM) [14, 15]. Features are then used for texture recognition and classification. Similar to any non-textural feature extraction methods, defining features from the image textures will be the most common task for image classification [14, 15, 29, 34, 40, 41, 44]. However, as most textural operators work on a spatial neighborhood of a pixel (we may refer it as the patch or kernel) in an image texture, it is critical to define the patches for computing distinguished features which will be suitable to describe the local patterns. In other words, for texture classification, texture features are not a property of a single pixel, it is a characteristic in the spatial neighborhood surrounding a single pixel. Liu et al.
1.3 Texture Features
(a) 1 1 8 8
1 1 8 8
4 4 6 6
9
4 4 6 6
(b) Texture Class Texture 1 Texture 2 Texture 3 Texture 4
Feature #1 (mean) 1 4 8 6
Feature #2 (variance) 0 0 0 0
Fig. 1.5 a A simple numerical image which consists of four different textures assuming each texture occupies a 2 2 block and b each texture is represented by the mean and variance as features
give a great comprehensive survey on image texture representation for texture classification [32]. Texture representation in their survey focuses on the bag-of-words (BoW) and convolutional neural networks (CNN) which have been extensively studied with impressive performance [32].
1.4
Image Texture Classification and Segmentation
Image texture classification is the process of assigning each pixel of an image to one of many predefined classes (also called clusters). We usually call them spectral classes as the features are derived from spectral signatures of pixels. In general, each pixel is treated as an individual unit composed of values in one (gray-scale image) or several spectral bands such as color, multispectral, and hyperspectral images. In other words, it ranges from one-dimensional to high-dimensional space. A pixel with multiple images can be treated as a vector which is formed by linking the corresponding pixels from all multiple images. In remote sensing, each image is called a band. Hence, the feature vector of a pixel is formulated by taking the corresponding pixels in multiple bands. By comparing the feature vector of each pixel with the representative features of classes, each pixel will be assigned a class label based on the similarity measure. Pixels belonging to the same class should contain similar information. Such a class is usually called the spectral class. Some of these spectral classes may be combined to form an informational class in an image due to the variety of spectral signatures for an object or region. We may use different colors to denote different spectral classes in a classified image. Figure 1.6 shows an original remote sensing image, the corresponding classified image (sometimes called a classified map), and ground truth map. The ground truth map can then be used for the classification accuracy assessment which will determine the quality of a classification algorithm.
10
1
Image Texture, Texture Features, and Image Texture Classification
Fig. 1.6 a An original remotely sensed image, b the corresponding classified image, and c the corresponding ground truth map. Please note that the ground truth map may not be available in many applications
Compared with image texture classification, image texture segmentation divides an image into several nonoverlapping and meaningful regions. Similar to a general image segmentation, image texture segmentation has been very challenging due to varying statistical characteristics present in an image such as sensitivity to noise, reflectance property of object surfaces, lighting environments, and so on [3, 14, 15, 34, 41, 44]. Since the goal of image texture segmentation is to interpret the contents of an image, it is necessary to label each pixel after the segmentation. With the labeled pixels, both texture classification and segmentation are used to interpret an image. An image classification or segmentation algorithm for texture is either a pixel-based or a region-based approach. In a pixel-based classification, spectral and textural information (called features) is commonly used to classify each pixel in an image. In a region-based approach, an image has to be segmented into homogeneous regions, and a set of meaningful features can be defined for each region. Once features are well defined, image regions (blocks/patches) can be categorized using pattern recognition techniques. One of the main drawbacks of the per-pixel classifier is that each pixel is treated independently without the consideration for its neighboring information. Image texture segmentation methods are generally based on two basic characteristics of image textures: discontinuity and similarity. In the discontinuity-based approach, it uses discontinuities of texture features in an image to obtain the boundaries of textures. An example of this approach is the edge detection method. In contrast, the similarity-based approach looks for the homogeneity of texture features associated with the pixels in an image. In this approach, an image is partitioned into regions in which all pixels in a region are similar according to the homogeneity measure of features. The characteristics of spectral, spatial, and textural information are used as three fundamental feature elements [14, 15]. Even though color is an important feature in image segmentation, there are situations where color measurements may not provide enough information, as color is sensitive to local variations. However, by combining color and texture features, an
1.4 Image Texture Classification and Segmentation
11
image texture segmentation can be improved and a better segmentation result can be obtained. Many image segmentation methods have been proposed and well developed in the past few decades [15, 29, 40]. Due to the recent development in the deep machine learning, the man-made feature extraction module (as shown in Fig. 1.1a) is being replaced by deep neural networks such as the CNN shown in Fig. 1.7 [27, 28]. A basic CNN consists of a convolutional layer which extracts features to form feature maps and a subsampling layer which produces shift and distortion invariant feature maps. These two layers can be duplicated to form a deeper neural networks. A fully connected layer, which is identical to the traditional artificial neural networks, is then used for image texture classification following the convolutional and sampling layers in many CNNs. There are some hyperparameters associated with the architecture of CNN needed to be specified for defining and training a CNN. As deep machine learning is in rapid advancement, image texture analysis using deep machine learning will also be included in this book. Image texture classification methods are also categorized into supervised and unsupervised modes. Unsupervised classification algorithms aggregate pixels into natural groupings in an image [3]. Many unsupervised classification methods, such as K-means clustering algorithm, have been widely used. Supervised classification procedures require considerable analyst interaction. The analyst must guide the classification by identifying areas on the image that belong to each class and calculate statistics from the samples in those areas. A supervised classification algorithm requires the analyst’s expertise. The classification result of supervised classification usually is better than that of unsupervised methods.
Input image
Convolutional layer
Subsampling Convolutional layer layer
Output layers
Fig. 1.7 A typical deep machine learning platform for image texture classification consists of an input layer, convolutional layers for generating texture feature maps, subsampling layers, and the output layer. Both convolutional and sampling layers are duplicated to make a deeper convolutional neural network
12
1.5
1
Image Texture, Texture Features, and Image Texture Classification
Summary
We frequently encounter image texture in many applications of pattern recognition and computer vision. Hence, several models and algorithms for image texture analysis have been developed in literature. Early research in image texture analysis was mainly focused on statistical methods and structural methods from pattern recognition discipline. Later, the model-based and transform-based methods are also introduced to characterize the texture. In the model-based methods, a texture image is described by a probability model. The coefficients of these models are then used to characterize each texture. The key issues are how to choose the correct model that is suitable for the selected texture and how to estimate their coefficients. The transform-based methods use different transformation functions to decompose an image texture into a set of basic feature images. An image texture will then be projected onto these basis images to derive the coefficients for characterizing textures. The accuracy of image texture classification (and segmentation) depends on two factors: the first is the features that represent the texture and the second is the algorithm using these features to classify the texture. Both texture features and classification algorithm have a significant impact on the classification accuracy. It is only possible to distinguish two texture classes by an algorithm if the two textures are not similar. Sometimes, even the texture features are well defined, but one may obtain different results using different algorithms. For example, if the K-means algorithm is used, due to its heavy dependency on the initialization of the cluster centers, the clustering result may not be the same. Due to the rapid advancement of deep machine learning in artificial intelligence, deep CNNs are making a breakthrough in image classification and segmentation. It would be essential for us to further explore image texture analysis with deep CNNs. In the following chapters, we will discuss different methods for measuring and extracting texture features and classifying image textures in detail.
1.6
Exercises
1. Informally define an image texture using your own language. 2. What are image texture features? 3. In terms of image texture feature extraction, contrast and compare the traditional feature extraction method and the CNN approach. 4. What is image texture classification? 5. What is image texture segmentation?
References
13
References 1. Arasteh S, Hung C-C (2006) Color and texture image segmentation using uniform local binary pattern. Mach Vis Graph 15(3/4):265–274 2. Arasteh S, Hung C-C, Kuo B-C (2006) Image texture segmentation using local binary pattern and color information. In: The proceedings of the international computer symposium (ICS 2006), Taipei, Taiwan, 4–6 Dec 2006 3. Beck J, Sutter A, Ivry R (1987) Spatial frequency channels and perceptual grouping in texture segregation. Comput Vis Graph Image Process 37:299–325 4. Bianconi F, Fernández A (2014) An appendix to texture databases – a comprehensive survey. Pattern Recognit Lett 45:33–38 5. Brodatz P (1966) Textures: a photographic album for artists and designers. Dover Publications, New York 6. Campbell FW, Robson JG (1968) Application of Fourier analysis to the visibility of gratings. J Physiol 197:551–566 7. Caputo B, Hayman E, Mallikarjuna P (2005) Class-specific material categorization. In: ICCV 8. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) 9. Dana KJ, van Ginneken B, Nayar SK, Koenderink JJ (1999) Reflectance and texture of real world surfaces. ACM Trans Graph 18(1):1–34 10. Devalois RL, Albrecht DG, Thorell LG (1982) Spatial -frequency selectivity of cells in macaque visual cortex. Vis Res 22:545–559 11. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Morgan Kaufmann 12. Garber D (1981) Computational models for texture analysis and texture synthesis, University of Southern California, USCIPI Report 1000, Ph.D. thesis 13. Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction: foundations and applications. Springer 14. Haralick RM (1979) Statistical and structural approaches to texture. In: Proceedings of IEEE, vol 67, issue 5. pp 786–804 15. Haralick RM, Sharpio L (1992) Computer and Robot vision, vol I, II. Addison-Wesley 16. Hayman E, Caputo B, Fritz M, Eklundh J-O (2004) On the significance of real-world conditions for material classification. In: ECCV 17. He D-C, Wang L (1990) Texture unit, texture spectrum, and texture analysis. IEEE Trans Geosci Remote Sens 28(4):509–512 18. Hossain S, Serikawa S (2013) Texture databases – a comprehensive survey. Pattern Recognit Lett 34(15):2007–2022 19. Hung C-C, Pham M, Arasteh S, Kuo B-C, Coleman T (2006) Image texture classification using texture spectrum and local binary pattern. In: The 2006 IEEE international geoscience and remote sensing symposium (IGARSS), Denver, Colorado, USA, 31 July−4 Aug 2006 20. Hung C-C, Yang S, Laymon C (2002) Use of characteristic views in image classification. In: Proceedings of 16th international conference on pattern recognition, pp 949–952 21. Ji Y, Chang K-H, Hung C-C (2004) Efficient edge detection and object segmentation using gabor filters. In: ACMSE, Huntsville, Alabama, USA, 2–3 April 2004 22. Julesz B, Bergen JR (1983) Textons, the fundamental elements in preattentive vision and perception of textures. Bell Syst Tech 62:1619–1645 23. Lan Y, Liu H, Song E, Hung C-C (2010) An improved K-view algorithm for image texture classification using new characteristic views selection methods. In: Proceedings of the 25th association of computing machinery (ACM) symposium on applied computing (SAC 2010) – computational intelligence and image analysis (CIIA) track, Sierre, Swizerland, 21–26 March 2010, pp 960−964 24. Lan Y, Liu H, Song E, Hung C-C (2011) A comparative study and analysis on K-view based algorithms for image texture classification. In: Proceedings of the 26th association of computing machinery (ACM) symposium on applied computing (SAC 2011) – computational intelligence, signal and image analysis (CISIA) track, Taichung, Taiwan, 21–24 March 2011
14
1
Image Texture, Texture Features, and Image Texture Classification
25. Landgrebe D (2003) Signal theory methods in multispectral remote sensing. Wiley-Interscience 26. Lazebnik S, Schmid C, Ponce J (2005) A sparse texture representation using local affine regions. IEEE Trans PAMI 28(8):2169–2178 27. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10. 1038/nature14539 28. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 1–44 29. Levine MD (1985) Vision in man and machine. McGraw-Hill 30. Liu H, Dai S, Song E, Yang C, Hung C-C (2009) A new K-view algorithm for texture image classification using rotation-invariant feature. In: Proceedings of the 24th association of computing machinery (ACM) symposium on applied computing (SAC 2009) – computational intelligence and image analysis (CIIA) track, Honolulu, Hawaii, 8–12 March 2009, pp 914−921 31. Liu H, Lan Y, Wang Q, Jin R, Song E, Hung C-C (2012) A fast weighted K-view-voting algorithm for image texture classification. Opt Eng 51(02), 1 Feb 2012. https://doi.org/10. 1117/1.oe.51.2.027004 32. Liu L, Chen J, Fieguth P, Zhao G, Chellappa R, Pietikainen M (2018) BoW meets CNN: two decades of texture representation. Int J Comput Vis 1–26. https://doi.org/10.1007/s11263018-1125-z 33. Maeanpaa T (2003) The local binary pattern approach to texture analysis – extensions and applications, Oulu Yliopisto, Oulu 34. Materka A, Strzelecki M (1998) Texture analysis methods – a review, Technical University of Lodz, Institute of Electronics, COST B11 report, Brussels 35. Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24 (7):941–987 36. Oxholm G, Bariya P, Nishino K (2012) The scale of geometric texture. In: European conference on computer vision. Springer, Berlin/Heidelberg, pp 58–71 37. Pietikainen MK (2000) Texture analysis in machine vision (ed). Series in machine perception and artificial intelligence, vol 40. World Scientific 38. Song EM, Jin R, Lu Y, Xu X, Hung C-C (2006) Boundary refined texture segmentation on liver biopsy images for quantitative assessment of fibrosis severity. In: Proceedings of the SPIE, San Diego, CA, USA, 11–15 Feb 2006 39. Song EM, Jin R, Hung C-C, Lu Y, Xu X (2007) Boundary refined texture segmentation based on K-views and datagram method. In: Proceedings of the 2007 IEEE international symposium on computational intelligence in image and signal processing (CIISP 2007), Honolulu, HI, USA, 1–6 April 2007, pp 19–23 40. Sonka M, Hlavac V, Boyle R (1999) Image processing, analysis, and machine vision, 2nd edn. PWS Publishing 41. Tuceryan M, Jain AK (1998) Texture analysis. In: Chen CH, Pau LF, Wang PSP (eds) The handbook of pattern recognition and computer vision, 2nd edn. World Scientific Publishing Company, pp 207–248 42. Xu Y, Ji H, Fermuller C (2009) Viewpoint invariant texture description using fractal analysis. IJCV 83(1):85–100 43. Yang S, Hung C-C (2003) Image texture classification using datagrams and characteristic views. In: Proceedings of the 18th ACM symposium on applied computing (SAC), Melbourne, FL, 9–12 March 2003, pp 22–26 44. Zhang J, Tan T (2002) Brief review of invariant texture analysis methods. Pattern Recognit 35:735–747
2
Texture Features and Image Texture Models
A smooth sea never made a skillful sailor. —American proverb
Image texture is an important phenomenon in many applications of pattern recognition and computer vision. Hence, several models for deriving texture properties have been proposed and developed. Although there is no formal definition of image texture in the literature, image texture is usually considered the spatial arrangement of grayscale pixels in a neighborhood on the image. In this chapter, some widely used image texture methods for measuring and extracting texture features will be introduced. These textural features can then be used for image texture classification and segmentation. Specifically, the following methods will be described: (1) the gray-level co-occurrence matrices (GLCM) which is one of the earliest methods for image texture extraction, (2) Gabor filters, (3) wavelet transform (WT) model and its extension, (4) autocorrelation function, (5) Markov random fields (MRF), (6) fractal features, (7) variogram, (8) local binary pattern (LBP), and (9) texture spectrum (TS). LBP has been frequently used for image texture measure. MRF is a statistical model which has been well studied in image texture analysis and other applications. There is one common property associated with these methods and models which use the spatial relationship for texture measurement and classification.
2.1
Introduction
Texture is used to describe a region in which textural elements are characterized in a spatial relationship. An image may consist of one or more textures. If multiple textures exist in an image, the boundary between two textures can be detected and discriminated using texture measure. Texture measure can provide important information for image segmentation, feature extraction, and image classification. Texture measure is very useful in the interpretation of images taken from satellite remote sensing, medical magnetic resonance imaging, materials science, and aerial © Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_2
15
16
2
Texture Features and Image Texture Models
terrain photographs. For example, the study of urban and rural land development in satellite images can benefit from using the image texture analysis. There exist many texture measures for characterizing a texture. The characterization is called texture feature. Regional properties such as coarseness, homogeneity, density, fineness, smoothness, linearity, directionality, granularity, and frequency have been frequently used as texture features. Many approaches which consist of autocorrelation functions, grayscale co-occurrence matrices, and LBP are used to describe and extract texture features in an image. All of these approaches fall in the four categories: statistical methods, structural methods, model-based methods, and transform-based methods [23–25, 63]. In most cases, texture features are represented numerically by feature vectors, which are composed of feature components derived from a neighborhood of the corresponding texture class. Each of these approaches has its own advantages and drawbacks. For example, statistical approach is suitable for micro-texture (i.e., random texture) while structural approach is good for macro-texture or well-defined texture patterns such as periodic texture [53, 54]. Many methods for describing texture features depend on the parameters used such as the neighborhood size for a texture region, the quantization of gray levels, and the orientation to measure the relationship among pixels such as the distance and angle [25]. Once the texture features are extracted from a texture, the next phase is to perform the image texture classification (or segmentation). Image texture classification methods fall in one of the two major categories; the first category is based on features with a high degree of spatial localization. In this category, most edge detection methods can use texture features for spatial localization. The major problem with this type of approach is that it is difficult to distinguish the texture boundaries and the micro-edge found in the same texture. The second category is based on discrimination functions with texture features. The classification accuracy in this approach depends on the discriminative power of texture features. The most important step for the second approach is to extract texture features, which can discriminate different types of textures in the image [16, 18, 34, 36, 55, 61, 64]. Haralick et al. have defined a primitive as a set of connected pixels characterized by a list of attributes [23–25]. The primitive is a texture element which can be called “texel” or “texton”. A texture pattern can be described by a primitive or a set of primitives. The smallest primitive is a pixel itself. The properties of a primitive are distributed in the neighborhood. This concept has been used in many approaches, and among which are texture unit spectrum [26, 27] and local binary pattern [42, 50]. In the texture unit spectrum approach, a textural element is a texture unit while in the local binary pattern, the texture element, the neighborhood, is called the texture primitive. Both the texture unit spectrum and the local binary pattern approaches are considering only the spatial relationship not the color features. Color images are seen everywhere nowadays. Color features are potentially useful for texture classification. Hence, many texture classifiers have combined both color and texture features to improve the discriminative capability [1]. Color features were used with texture features to improve the performance on color texture
2.1 Introduction
17
classification [2]. The LBP features and histogram-based method were combined for texture classification on color images [1]. Each texture class has a specific pattern that can distinguish the class from others. To determine whether a pixel belongs to a texture class, the neighborhood of a pixel (a small image patch) is taken and the image patch is examined to measure its similarity to one of the textures and color spectra classes. Fractal features (frequently called fractal dimensions) are useful in characterizing certain types of objects. Mandelbrot described that “clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not smooth, nor does lightning travel in a straight line” [45]. Apparently, fractal features are suitable for measuring those objects and regions that have irregular geometry. Frequency features are also very useful in describing certain types of textures. A combination of different types of features will improve image texture classification and segmentation if we properly select some of them for applications.
2.2
Gray-Level Co-Occurrence Matrix (GLCM)
The gray-level co-occurrence matrix (GLCM) is one of the earliest methods using spatial relationships for describing image texture features [23–25]. The GLCM is a statistical method which calculates properties of the spatial relationships among pixels. The spatial relationships between a pair of two pixels in a neighborhood are recorded in the co-occurrence matrices which are then used to calculate textural features [23–25]. The spatial relationships are measured using distances and angles between two pixels in a textured region. This relationship is the joint probability density of two pixels for the transition (or relation) of gray levels in both locations. The GLCM matrices are not used directly. However, certain texture features can be derived from the GLCM such as correlation, homogeneity, and others can be obtained [23–25]. The GLCM is a two-dimensional matrix in which each element pij represents the frequency of occurrences of a pair of pixels (where i and j are the gray levels) in a spatial relation separated by distance d and angle a. Let G be an image texture with the size of M N. An element pij can be calculated by counting the number of relationships with the following equation: pij ¼ fGðm; nÞ ¼ i; Gðm þ d ; n þ dÞ ¼ jg for each a;
ð2:1Þ
where m ¼ 0; 1; . . .; M 1, n ¼ 0; 1; . . .; N 1, and a ¼ 0 ; . . .; 360 . Suppose Q is the number of pairs of pixels i and j separated by d and a. The probability of pij is the number of frequency in pij divided by Q. A hypothetical image and its GLCMs are shown in Example 2.1. Example 2.1 An example of GLCMs with angles at 0°, 45°, 90°, and 135° and the distance is one unit. Please note that we use the symmetric GLCMs for the frequency count. (a) a hypothetical textured region with a size of 4 4 and a value assigned to each pixel, (b) an example of frequency counting in GLCM recording
18
2
Texture Features and Image Texture Models
the spatial relationships for pairs of pixels, and (c), (d), (e), and (f) are results of GLCMs with a = 0, 45, 90, and 135 degrees, respectively. 0 1 2 2
1 0 2 2
1 0 1 0
0 1 1 0
(a) A numerical image
GrayLevel
Gray-Level 0 1 #(0,0) #(0,1) #(1,0) #(1,1) #(2,0) #(2,1)
0 1 2
2 #(0,2) #(1,2) #(2,2)
(b) Frequency count for each pair of pixels for spatial relationships 4 4 1
4 4 1
1 1 4
(c) a = 0° 2 2 2
2 4 1
2 1 2
(d) a = 45° 0 7 1
7 2 1
1 1 4
(e) a = 90° 1 4 1
4 2 1
1 1 2
(f) a = 135° Haralick et al. proposed several textural features based on the GLCM method [23–25]. Some texture features including cluster tendency (Clu), contrast (Con), correlation (Cor), dissimilarity (D), entropy (E), homogeneity (H), inverse difference moment (I), maximum probability (M), and uniformity of energy (U) can be computed from the GLCMs as shown in Eqs. 2.2−2.10. Here pij , similarly defined as in Eq. 2.1, represents the probability of occurrence of pairing pixels as defined above, l is the overall mean, r is standard deviation, i and j are the row and column indexes of gray levels in the co-occurrence matrices, and jj is the cardinality.
2.2 Gray-Level Co-Occurrence Matrix (GLCM)
19
Clu ¼ Cluster Tendency ¼
X
ði þ j 2lÞ2 pij
ð2:2Þ
i;j
Con ¼ Contrast ¼
X
ji jj2 pij
ð2:3Þ
i;j
PM1 PN1
l1 ¼
M 1 X N 1 X
i
i¼0
l2 ¼
r22 ¼
l1 l2
where
ð2:4Þ
pij
j¼0
M 1 X N 1 X
j
i¼0
r21 ¼
j¼0 ij pij r21 r22
i¼0
Cor ¼ Correlation ¼
pij
j¼0
M 1 X
N 1 X
i¼0
j¼0
ði l1 Þ2
M 1 X
N 1 X
i¼0
j¼0
ðj l2 Þ2
X
D ¼ Dissimilarity ¼
X
pij
pij 1 þ ji jj
ð2:5Þ
pij logpij
ð2:6Þ
i;j
E ¼ Entropy ¼
pij
i;j
H ¼ Homogeneity ¼
X i:j
pij 1 þ ji jj
I ¼ Inverse Difference Moment ¼
X
pij
i; j i 6¼ j
ji jjk
M ¼ Maximum Probability ¼ max pij i;j
U ¼ Uniformity of Energy ¼
X
ð2:7Þ
p2ij
ð2:8Þ
ð2:9Þ ð2:10Þ
i;j
Haralick and Bosley [24] performed classification experiments on multi-images using features calculated from the GLCM. These features include uniformity of energy, entropy, maximum probability, contrast, inverse difference moment,
20
2
Texture Features and Image Texture Models
Fig. 2.1 A diagram shows the flow of extracting textural features using the co-occurrence matrices to form a set of feature images for image texture classification
correlation, probability of a run of length, homogeneity, and cluster tendency [23– 25]. Feature statistics calculated in Eqs. 2.2−2.10 will then be mapped into the corresponding feature vectors as in Eq. 2.11 and the classification can be performed using any clustering techniques such as the K-means algorithm. feature vector ¼ ðClu; Con; Cor; D; E; H; I; M; UÞT ; where T is the transpose ð2:11Þ Please note that there is no specific order in the list of features for a feature vector. Figure 2.1 shows a diagram for extracting textural features to form a set of feature images. The GLCM is widely used for land cover classification because of its effectiveness. The main drawbacks of GLCM are its sensitivity to background noise, difficulties in determining proper spatial parameters and extensive computation [23–25]. Example 2.2 demonstrates the calculation results based on the GLCM in Example 2.1. Example 2.2 Each element in the GLCM is converted to the probability and then features are calculated based on Eqs. 2.2−2.10.
2.2 Gray-Level Co-Occurrence Matrix (GLCM) 0.17 0.17 0.04
0.17 0.17 0.04
21
0.04 0.04 0.17
(a) a = 0° 0.11 0.11 0.11
0.11 0.22 0.06
0.11 0.06 0.11
(b) a = 45° 0.0 0.29 0.04
0.29 0.08 0.04
0.04 0.04 0.17
(c) a = 90° 0.06 0.24 0.06
0.24 0.12 0.06
0.06 0.06 0.12
(d) a = 135° Hence, the feature vector is feature vector ¼ ðClu; Con; Cor; D; E; H; I; M; UÞT . For four different directions, four feature vectors will be calculated as in the following: a = 0° Feature vector = (4.04, 0.74, −15.84, 0.75, 2.92, 0.75, 0.44, 0.17, 0.15) T a = 45° Feature vector = (4.02, 1.22, −19.54, 0.68, 2.82, 0.68, 0.40, 0.22, 0.13) T a = 90° Feature vector = (3.91, 0.98, −19.85, 0.61, 2.51, 0.61, 0.68, 0.29, 0.21) T a = 135° Feature vector = (3.68, 1.08, −18.77, 0.64, 2.94, 0.64, 0.63, 0.24, 0.16) T Then, an average of above four feature vectors will be used to represent a feature representing a pixel (or neighborhood) for the classification. Alternatively, we may combine all of the four feature vectors with the corresponding weights, which can be calculated with the ratio of norm for each feature vector over the sum of norms of four feature vectors.
2.3
Gabor Filters
Gabor filter is a time–frequency analysis method, which was introduced in 1946 by Dennis Gabor [20]. Gabor filters are the product of Gaussian with sine or cosine functions at different frequencies and different orientations. Gabor filtering provides
22
2
Texture Features and Image Texture Models
a method of textural feature extraction, which is widely used for textural classification and analysis [3, 19, 31, 33, 35, 46, 58, 62, 68, 70]. As a matter of fact, Gabor filters can be viewed as detectors of edges and lines (stripes) with both directions and scales. The statistics of these microstructure features in a given textured region can be used to represent textural features. Gabor filters, also called Gabor wavelet transforms, were developed based on the multichannel filtering theory for visual information processing which emerged from human visual system. This visual system theory was proposed by Campbell and Robson [6]. Their psychophysical experiments suggested that the mammalian vision system decomposes the image received by the retina into a number of filtered images and each of which contains intensity variations over a narrow range of frequency (size) and orientation. Hence, the spatial domain features and frequency domain features should simultaneously be used for characterizing textural features. Gabor filters have been shown to be the best description of the signal space domain and the frequency domain in the case of two-dimensional uncertainty [20]. There are several different approaches for performing the Gabor transform. Some methods are computationally intensive. We introduce the multichannel filtering technique, which uses specified filters to select the information at particular space/ frequency points. Gabor filters implemented as multichannel and wavelet-like filters mimic the characteristics of the human visual system (HVS) [10, 17, 60]. Although the wavelet transform is an excellent technique used in image processing, Nestares et al. point out that Gabor expansion is the only biologically plausible filter with orientation selectivity that can be exactly expressed as a sum of only two separable filters [49]. This unique property has made Gabor filter an important transformation in image processing and computer vision. There are several forms of 2-D Gabor filter, and a version similar to Daugman’s model that is used in spatial summation properties (of the receptive fields) of simple cells in the visual cortex is defined in Eqs. 2.12 and 2.13 [12, 13]. Gðx; yÞ ¼
ab exp a2 x2g þ b2 y2g expðj2pf0 ðxg ; yg ÞÞ p
with xg ¼ xcosh þ ysinh and yg ¼ xsinh þ ycosh
ð2:12Þ ð2:13Þ
where the arguments, x and y, specify the position of an image, f0 is the central frequency of a sinusoidal plane wave, h is the counterclockwise rotation of the Gaussian plane wave, and a and b are the sharpness values of the major and minor axes of the elliptic Gaussian. When an image is processed using a Gabor filter, the output is the convolution of the image I(x, y) and the Gabor function Gk ðx; yÞ, i:e: Rk ðx; yÞ ¼ Gk ðx; yÞ Iðx; yÞ
ð2:14Þ
where * denotes the two-dimensional convolution. This process can be used at different frequencies and different orientations, and the result is a multichannel filter bank. Figure 2.2 illustrates a multichannel filtering system.
2.3 Gabor Filters
23
G1 ( x, y )
Rk ( x, y )
Classification
Results
...
...
Gk ( x, y )
R2 ( x, y ) ...
G2 ( x, y )
R1 ( x, y )
Fig. 2.2 A Gabor multichannel filtering system. The operator |.| is the magnitude operator, and Gk ðx; yÞ is the Gabor function in the kth channel, which denotes a specific frequency and orientation [33]
In Fig. 2.2, symbol |.| is the magnitude operator, and Gk ðx; yÞ is the Gabor function in the kth channel, which denotes a specific frequency and orientation. With the multichannel filtering system, an image will be processed by all the channels simultaneously. The result is a stack of filtered feature images which are defined at various frequencies and orientations corresponding to all the channels in the system. Hence, the characteristic (textural) features at each particular frequency and individual orientation can be obtained. Although Clausi and Jernigan [10] concluded that the results of every 30° (degrees) will provide a robust and universal feature set, it is very common to use four values of 0°, 45°, 90°, and 135° to save computation time. According to the literature [3, 10, 17, 19, 20, 31, 33, 35, 46, 49, 58, 60, 62, 67, 68, 70], the central frequency is selected according to the image dimension. Low frequency corresponds to smooth variations and constitutes the base of an image while high frequency presents the edge information which gives the detailed information in the image. Using different frequencies and orientations, the Gabor multichannel filters will present an image in different multiple feature images.
2.4
Wavelet Transform (WT) Model and Its Extension
Fourier transformation (FT) is a useful technique which can reveal the frequency information in the Fourier domain [9, 22]. The FT is to decompose an image into a summation of sine and cosine functions with different phases and frequencies. These sine and cosine functions constitute a set of basis functions for the FT. Due to its global property of processes, it is difficult for FT to capture the local transient signals. The short-time Fourier transforms (STFT) are therefore proposed to obtain the frequency and phase content of local transient signals [1]. The window size used in the STFT is fixed. This limitation has some drawbacks [1]: a wide window is good for frequency localization, but poor for time localization. On the other hand, a narrow window is poor for frequency localization, but good for time localization. It
24
2
Texture Features and Image Texture Models
Fig. 2.3 Two different wave functions; a infinite duration (i.e., sinusoid) used in the Fourier transforms and b limited duration (i.e., wavelet) [11]
would be more appropriate to have a varying length of window which can capture both low-frequency and high-frequency signals. Therefore, wavelet transform (WT) is then proposed to overcome those problems [9, 22]. WT has properties of varying frequencies, limited duration, and zero average value. This is quite different from the FT which uses the sinusoidal functions with infinite duration and constant frequency. The contrast between WT and FT is shown in Fig. 2.3. WT has been widely used in many applications such as imaging medical diagnosis, coding theory, and image compression. Unlike the Fourier transform (FT), there is a difference between the FT and the WT [9, 22, 43]: Fourier basis functions are localized in frequency, but not in time/position. A small change in the frequency in the FT will produce changes everywhere in the time domain. However, the WT is localized in both frequencies (also called scale) via dilations and in time via translations. In addition, many functions can be represented by the WT in a more compact way compared with sine and cosine functions used in the FT.
2.4 Wavelet Transform (WT) Model and Its Extension
25
Similar to any basis functions (will be discussed in Chap. 4), which are used to decompose a complex function. The wavelet transform is a symmetric transformation such as Haar basis functions that is a set of wavelet functions. The transformation usually consists of several Haar wavelet basis functions for extracting certain properties. In the wavelet transform, a basic function, called the mother wavelet, is used for performing the scaling and translation used to generate several wavelet basis functions. These generated wavelet basis functions are called the child wavelets [11, 57, 64]. These wavelet basis functions will span the spatial frequency domain of an image. There are three types of wavelet transforms: continuous, discrete, and wavelet series expansion [9, 22]. Any type of wavelet transforms has the scaling and translation properties. Hence, a Haar function is specified by a dual indexing scheme [9]. A wavelet function fluctuates above and below a horizontal axis with the properties of varying frequency, limited duration, and zero average value. Many different wavelet functions satisfy those three properties. As shown in Fig. 2.4, these are some examples of the wavelet functions; Haar, Morlet and Daubechies wavelets [9, 11, 22]. Among many popular discrete wavelets developed in the literature, we introduce the Haar wavelets [9, 22]. The Haar wavelet transform is one of the earliest orthonormal wavelet transform. Similar to sine and cosines functions in the FT, the Haar wavelets, f(t), can be defined as a set of basis functions wk(t) as shown below: f ðt Þ ¼
X
ak w k ð t Þ
ð2:15Þ
k
where k is the number of basis functions and ak is a coefficient for the kth basis function. The basis functions can be constructed by applying the translations and scalings (stretching/compressing) on the mother wavelet w(t), which is represented in Eq. 2.16 with translation parameter (s) and scaling parameter (s). For example, if we take a mother wavelet as shown in Fig. 2.5a, we will be able to construct many basis wavelet functions as shown in Fig. 2.5b by applying translation and scaling. It would be convenient if a constraint was used for selecting the values of translation parameter, s, and scaling parameter, s, with the constraint s ¼ 2j and s ¼ k2j , where k and j are integers. An example with different integers of k and j is shown in Fig. 2.5c. 1 ts Þ wðs; s; tÞ ¼ pffiffi wð s s
ð2:16Þ
We assume that a family of N Haar basis functions, hk ðtÞ for k ¼ 0; . . .; N 1 with N ¼ 2n , is defined in the interval of 0 t 1. The shape of each function hk ðtÞ with index k is defined by Eq. 2.17. k ¼ 2p þ ðq 1Þ
ð2:17Þ
26
2
Texture Features and Image Texture Models
Fig. 2.4 Examples of wavelet functions: a Haar wavelet, b Morlet wavelet, and c Daubechies wavelet [11]
2.4 Wavelet Transform (WT) Model and Its Extension
27
Fig. 2.5 a A mother wavelet w(t) and b the wavelet basis functions obtained by applying the translations and scalings to w(t) with values of s ¼ 2j and s ¼ k2j . All wavelets are generated using the equation listed on the top of each wavelet where parameter b in each equation is the value of x-coordinate in each coordinate system
For an index k, parameters p and q are uniquely determined so that 2p is the largest power of two with the constraint 2p < = k. Hence, q – 1 is the remainder of k subtracting 2p (i.e., k–2p). Now, the Haar functions can be defined recursively as: If k = 0, the Haar function is defined as a constant (Eq. 2.18), then 1 h0 ðtÞ ¼ pffiffiffiffi N
ð2:18Þ
If k > 0, the Haar function is defined as in Eq. 2.19 [9, 22], which is given below:
28
2
8 p=2 1 < 2 p=2 hk ðtÞ ¼ pffiffiffiffi 2 N :0
Texture Features and Image Texture Models
ðq 1Þ=2p t\ðq 0:5Þ=2p ðq 0:5Þ=2p t\q=2p otherwise
ð2:19Þ
From the definition, it can be seen that p determines the amplitude and width of the nonzero part of the function (i.e., the scaling parameter), while q decides the position of the nonzero part of the function (i.e., translating parameter). To use the Haar transform, Example 2.3 shows the Haar wavelet transforms with a set of eight wavelet basis function (i.e., k = 8) by specifying the dual indexing scheme of p and q and their relationship with k. Example 2.3 The following example illustrates the steps listed above. If we take an example of the Haar wavelet transforms with k = 8, (t is between 0 and 1), index k and its corresponding p and q values are calculated based on Eq. 2.17 and are shown in the following table: k p q
0 1 2 3 4 5 6 7 0 0 1 1 2 2 2 2 0 1 1 2 1 2 3 4
Similar to the rotation, scaling, and translation operations used in image geometry and computer graphics, the Haar transformation can be implemented in a kernel transformation matrix [9, 22]. The Haar transform, similar to the Fourier transform, can be represented in the matrix format. For example, if we assume that there are N Haar functions sampled at the interval of t which is defined in Eq. 2.20, and given below: t¼
m ; m ¼ 0; 1; . . .; N 1 N
ð2:20Þ
to form an N x N matrix for discrete Haar transform. If N = 2, we have 1 1 H2 ¼ pffiffiffi 2 1
1 1
ð2:21Þ
A simple Haar transformation using a Haar transformation matrix is shown in Example 2.4. In addition, a normalization on the Haar transformation will make the constructed images more smooth. Example 2.4 Assume a feature vector with four components fv ¼ ½1; 2; 3; 1T . The 4 4 Haar transformation matrix is
2.4 Wavelet Transform (WT) Model and Its Extension
2
1 1 6 p1ffiffiffi H4 ¼ 4 2 2 0
1 1 ffiffiffi p 2 0
1 1 p0ffiffiffi 2
29
3 1 1 7 0 ffiffiffi 5 p 2
ð2:22Þ
The Haar transformation coefficients are obtained by multiplying H4 by fv : 2
HCoefficients
1 1 6 p1ffiffiffi ¼ 4 2 2 0
1 1 ffiffiffi p 2 0
1 1 p0ffiffiffi 2
3 2 1 1 7 6 0 ffiffiffi 5x4 p 2
3 2 7=2 1 1=2 2 7 6 pffiffiffi 3 5¼6 4 1= 2 pffiffiffi 1 2
3 7 7 5
ð2:23Þ
The inverse transform will express the feature vector as the linear combination of the basis functions: 2
HInverse
1 16 1 ¼ 4 1 2 1
pffiffiffi p2ffiffiffi 2 0 0
1 1 1 1
0 p0ffiffiffi p2ffiffiffi 2
3 2 3 2 7=2 7 1=2 7 6 pffiffiffi 7 6 5x6 4 1= 2 5 ¼ 4 pffiffiffi 2
3 1 2 7 3 5 1 ð2:24Þ
Note that the functions hk ðtÞ of Haar transform (defined in Eq. 2.19) can represent not only the details in signals of different scales (corresponding to different frequencies) but also their locations in time. A set of Haar wavelet basis functions is shown in Fig. 2.6 [9]. The Haar transform matrix is real and orthogonal: H ¼ H ;
H1 ¼ HT ;
i:e:;
HT H ¼ I
ð2:25Þ
where I is an identity matrix. For example, when N ¼ 4, we get 2
1 1 16 1 1 1 T H4 H4 ¼ H4 H4 ¼ 6 4 4 1 1 1 3 1 2 1 0 0 0 60 1 0 07 7 ¼6 40 0 1 05 0 0 0 1
32 pffiffiffi 1 1 0 p2ffiffiffi 6 1 1 ffiffiffi 2 p0ffiffiffi 7 76 pffiffiffi p 54 2 2 0 2 pffiffiffi 0 0 0 2
3 1 1 1 1 7 7 0 ffiffiffi 5 p0ffiffiffi p 2 2
ð2:26Þ
30
2
Texture Features and Image Texture Models
Fig. 2.6 A set of Haar wavelet basis functions
The Haar transform was frequently used for signal and image compression, and noise removal before the more advanced wavelet transforms such as Daubechies and other wavelets were developed. The Haar transform conserves the energy of signals and possesses the compaction of energy [65]. A general discrete wavelet transform (DWT) can be implemented by two basic operations using the averaging and differencing operators on a feature vector [32, 47]. The averaging operator is similar to the low-pass filter such as the averaging filter while the differencing operator is similar to the high-pass filter such as the Laplacian filter commonly used in digital image processing. To calculate the transform of an array of n samples, the following DWT algorithm is used.
2.4 Wavelet Transform (WT) Model and Its Extension
31
A General Discrete Wavelet Transform (DWT) Algorithm: Step 1: Divide the array into n/2 pairs called (L, R) for each pair in an input. Step 2: Calculate (L + R)/2 for each pair, these values, called approximation coefficients, will be the first half of the output array. Step 3: Calculate (L − R)/2 for each pair, these values, called detail coefficients, will be the second half. Step 4: Repeat the process on the first half of the array (the array length should be a power of two) until no pair is left for the calculation. Example 2.5 illustrates the steps listed above: Example 2.5 Assume that we have a feature vector with eight components fv ¼ ½1; 2; 3; 1; 4; 6; 7; 5T . We show the step-by-step calculation using the DWT transform algorithm. The one-level DWT transform of fv ¼ ½1; 2; 3; 1; 4; 6; 7; 5T : DWT1 ðfv Þ ¼½Approximation CoefficientsjDetail Coefficients 1þ2 3þ1 4þ6 7þ5 1 2 3 1 4 6 7 5 ; ; ; j ; ; ; ¼ 2 2 2 2 2 2 2 2 3 4 10 12 1 2 2 2 ; ; ; ¼ ; ; ; j 2 2 2 2 2 2 2 2 The two-level DWT transform of fv ¼
3
4 10 12 T : 2;2; 2 ; 2
DWT2 ðfv Þ ¼½Approximation CoefficientsjDetail Coefficients 3 4 10 12 3 4 10 12 þ þ 2 22 2 2 j ; ¼ 2 2; 2 2 2 2 2 7 22 1 2 ; ¼ ; j 4 4 4 4 The three-level DWT transform of fv ¼
7
22 T : 4; 4
DWT3 ðfv Þ ¼½Approximation CoefficientsjDetail Coefficients 7 22 7 22 þ 4 4 4 j ¼ 4 2 2 29 15 j ¼ 8 8 Table 2.1 is a summary of computations above.
32
2
Texture Features and Image Texture Models
Table 2.1 A summary of average and difference computation 1
2
3
1
4
6
7
5
3/2 7/4 29/8
4/2 22/4 −15/8
10/2 −1/4 −1/4
12/2 −2/4 −2/4
−1/2 −1/2 −1/2
2/2 2/2 2/2
−2/2 −2/2 −2/2
2/2 2/2 2/2
2.5
Autocorrelation Function
Autocorrelation function is a two-dimensional function which can detect repetitive patterns of texels [22, 25]. It works by comparing the dot product (energy) of a non-shifted image with a shifted image as shown in Eq. 2.27. The coarseness and fineness of a texture can be determined by the speed the function drops off. For a coarse texture, the function drops off slowly while for a fine texture, the function drops off rapidly. This indicates that the size of coarse texels is larger than that of fine texels. The function can drop off differently for different sizes of the image. For regular textures, the function will have peaks and valleys with peaks repeating far away from the origin. For random textures, peaks occur at the origin. The breadth of peaks determines the size of texels. The autocorrelation function q(dr,dc) of an (N+1) x (N+1) image, I, for displacement d = (dr, dc) is defined by Eq. 2.27. PN PN qðdr; dcÞ ¼
c¼0 I ½r; cI ½r þ dr; c þ dc P N PN 2 r¼0 c¼0 I ½r; c I ½r; coId ½r; c ¼ I ½r; coI½r; c r¼0
ð2:27Þ
where r is the number of rows, c is the number of columns, and symbol o represents a correlation operator.
2.6
Markov Random Field (MRF) Model
Image classification algorithms that incorporate contextual (i.e., spatial) information are called spatial image classifiers. There are two approaches for utilizing spatial information for image classification. The first method is to extract (spatial) features from the neighborhood and use any pattern classification algorithm to assign a pixel to a predefined class. The second method is to segment an image into small homogeneous regions and then extract features from each region for assigning the entire region into a predefined class [22, 25]. The spatial classifier attempts to capture the spatial relationships encoded in the remote sensing images for improving the classification accuracy.
2.6 Markov Random Field (MRF) Model
33
Markov Random Field (MRF) model is a stochastic process that can describe the spatial relationships of pixels in a local neighborhood of an image [14]. The model utilizes both spectral and spatial information for the characterization and computation of features for each pixel and its neighboring pixels. The temporal information can also be used in the model if it is available. In such a case, a three-dimensional (3-D) MRF model is utilized. The MRF characterizes the statistical relationships between a pixel and its neighbors in a user-defined window. It is assumed that the intensity of each pixel is dependent on the intensities of only neighboring pixels [14]. This is called Markovian property which is similar to that used in the Markov chain. The development of the MRF can be traced back to the Bayes theory [22, 25]. A popular classification algorithm based on the Bayes theory is called the naïve Bayes classifier. The algorithm is based on two criteria, the prior probabilities and conditional probability density functions (PDF) (also called likelihood function) as shown in Eqs. 2.28 and 2.29. PðYjX1 ; . . .; Xn Þ ¼
PðX1 ; . . .; Xn jY ÞPðYÞ PðX1 ; . . .; Xn Þ
ClassY ¼ argmax PðYjX1 ; . . .; Xn Þ where Y ¼ 1; . . .; C
ð2:28Þ ð2:29Þ
Y
We assume that there is C number of classes. Notation P(Y|X1, …, Xn) represents the probability of classifying a pixel to a class label Y based on the observed features X1, …, Xn . Notations P(X1, …, Xn | Y), P(Y), and P(X1, …, Xn) denote the likelihood function, prior probability, and normalization constant, respectively. The naive Bayes classifier employs a decision rule to select the hypothesis that is most probable; this is known as the maximum a posteriori (MAP) decision rule. In the applications of remote sensing images, the Gaussian distribution is usually assumed for image classification. Hence, the maximum likelihood criterion is used in the naïve Bayes classifier. In such a case, we will call the maximum likelihood (ML) classifier. With the Gaussian distribution for modeling the class conditional PDF, we can calculate the mean and variance for each class of the classification. Although the prior probability is used in the ML classifier, this is an estimate without considering any contextual information from the neighboring pixels. If a prior model can be used in the ML classifier, it should improve the classification result. The MRF which uses the contextual information for modeling the prior conditional probability density functions (PDF) has been well developed [14, 16, 34, 63]. The MRF uses this contextual information as a priori for improving the classification accuracy. However, it is difficult to establish a MAP estimate in the MRF model using the conditional probability density functions (PDF). The Hammersley−Clifford theorem provides an efficient method for solving this problem [14]. The theorem states that a random field is an MRF if and only if it follows a Gibbs distribution [14]. This theorem provides us a means to define an MRF model
34
2
Texture Features and Image Texture Models
through clique potentials. Cliques are used to capture the dependence of pixels in a neighborhood. We assume that the MRF model used is noncasual (i.e., stationary) in which the intensity, yðsÞ, at site s is a function of the neighbors of s in all directions [38, 69]. The conditional probability density function (PDF) defining the MRF models is given in Eq. 2.30. pðyðsÞjall yðs þ rÞ; r 2 NÞ
ð2:30Þ
where N is a neighborhood system and r is one of the neighboring pixels in the defined neighborhood. The neighborhood structure specifying dependencies among pixels in a region are given in Fig. 2.7 in which different orders are defined with respect to the central pixel x [14, 16, 28, 55, 69]. In many of the texture classification, the higher order of the neighborhood in the MRF model is used since it contains most of the statistical characteristics of the original textures [14, 16, 22, 25, 55, 69]. If we assume that P(x | f) measures the probability of a labeling (i.e., classification), given the observed feature f, our goal is to find an optimal labeling x^ which maximizes P(x | f). This is called the maximum a posteriori (MAP) estimate as shown in Eq. 2.31. x^MAP ¼ argmax Pðxjf Þ; x 2 X
ð2:31Þ
where X is a set of class labels and f is a feature vector. To find f that maximizes P(x|f), it would be sufficient to minimize the system energy function U(w) in Eq. 2.32: U ðwÞ ¼
X
Vc ðwÞ
ð2:32Þ
c2C
where Vc (w) is the clique potentials of class label c. A clique is defined as a subset of pixels if every pair of pixels in this subset is neighbor in a set of pixels. A set of
5 4 3 4 5
4 2 1 2 4
3 1 x 1 3
4 2 1 2 4
5 4 3 4 5
Fig. 2.7 Neighborhood systems represent different orders with respect to the center pixel (x): a the first order (all pixels labeled as 1), b the second order (all pixels labeled as 1 and 2), c the third order (all pixels labeled as 1, 2, and 3), d the fourth order (all pixels labeled as 1, 2, 3, and 4), and e the fifth order (all pixels labeled as 1, 2, 3, 4, and 5)
2.6 Markov Random Field (MRF) Model
35
pixels is a patch or neighborhood as defined in Fig. 2.7. A clique containing n pixels is called nth order clique [14]. We can rewrite P(x | f) in terms of the energy function as in Eq. 2.33: pðwjf Þ ¼
1 UðwÞ e T Z
ð2:33Þ
where e is an exponential function, T is temperature, and Z is a normalization constant defined in Eq. 2.34. Z¼
X
e
UðwÞ T
ð2:34Þ
c2C
where notations used are defined above. Since the Gibbs random field model is the characterization of the system energy, our goal is to determine the lowest energy of the system that gives the optimal solution. Our problem becomes choosing the most probable labeling (x) to minimize the energy. However, this is an NP-hard problem [14]. There exist several optimization methods such as simulated annealing and Gibbs sampler, which are frequently used in the MRF modeling for applications [14].
2.7
Fractal Dimensions
Fractal geometry is a branch of mathematics for the study of complex patterns in irregular geometry objects. Fractals refer to irregular geometry objects that may illustrate a degree of self-similarity at different scales [37, 44, 45, 51, 63]. When the spatial distribution of local image textures exhibits irregular shapes for the texture, fractals can be used for characterizing image textures. Fractal dimension that is different from the dimension defined in the Euclidean space is commonly used for measuring the fractal geometry [44, 45, 51, 63]. The Euclidean dimension is defined as a number of coordinates. For example, a line, which has one coordinate, is one dimensional, a plane, which has two coordinates, is two dimensional, and a space, which has three coordinates, is three dimensional. A fractal is a set in which the Hausdorff–Besicovitch dimension strictly exceeds the topological dimension according to the definition given by Mandelbrot [44, 45]. This generalized dimension, called the Hausdorff dimension (often called fractional dimension), invented by German mathematician Felix Hausdorff, is a clear concept in explaining the fractal dimension. If we take an object located in Euclidean dimension D and reduce its size by 1/r in each spatial direction linearly, its measure (i.e., number of objects) is increased to N = rD magnitudes of the original. This is shown in Fig. 2.8. Considering the relationship N = rD, if we take the log function of both sides and rearrange the formula, the relationship is rewritten as in Eq. 2.35.
36
2
Texture Features and Image Texture Models
(a) Dimension D =1
Dimension D =2
(b)
1/r=1/1
1/r=1/2
N=1
N=2
N=1
N=4
N=1
N=8
Dimension D =3
Fig. 2.8 The number of dimensions in Euclidean space (D) is increased from one to three; the measure (number) of lines, squares, and cubes are increased. a in this column, since r = 1, N is 1 for all dimensions and b r is 2, N is 2, 4, and 8, respectively. (https://www.vanderbilt.edu/AnS/ psychology/cogsci/chaos/workshop/Fractals.html)
D ¼ logðNÞ= logðrÞ
ð2:35Þ
Now, the value of D may become a fraction as used in the fractal dimension. A set with a non-integer dimension is a fractal. However, the formula in Eq. 2.35 may result in an integer dimension which is still considered as a fractal. This generalized dimension is very useful for describing natural objects and chaotic trajectories which are fractals. There exist several methods for estimating the fractal dimension of complex patterns [63]. Mallat proposed a theory using the wavelet transform for calculating the fractal dimension [43]. Pentland proposed a 3-D fractal model which characterizes the 3-D surfaces and their images for measuring the fractal dimension [52]. Another approach is the box-counting method based on the self-similarity concept developed by Mandelbrot [63]. Fractal features are mainly used in texture description and image recognition for characterizing different textures of the objects [63]. Lacunarity is one of the fractal features used for characterizing image textures [38, 40, 48, 56, 59]. Ling et al. have shown that the fractal dimension is a poor descriptor of surface complexity while the lacunarity analysis can successfully quantify the spatial texture of an SEM image [39]. Lacunarity is a scale-dependent measure of heterogeneity of texture in an image. The index of lacunarity is large when the texture is coarse [39, 59]. Texture feature can be constructed by concatenating the lacunarity-related parameters, which are estimated from the multi-scale local binary patterns of an image [59]. By using the lacunarity analysis and its ability to distinguish spatial patterns,
2.7 Fractal Dimensions
37
the method is able to characterize the spatial distribution of local image structures for multiple scales. Fractal dimension may not give different values for characterizing different image textures. However, lacunarity may provide unique characteristics for each different texture although they may have the same fractal dimension [48]. If an object is homogeneous because all gap sizes are the same, its lacunarity will be low while a high lacunarity value indicates that an object is heterogeneous [48]. Myint and Lam use different sizes of moving windows such as 13 13, 21 21, and 29 29 for calculating the lacunarity as the features from urban spectral images and observing the effects of the size of moving windows in characterizing urban texture features [48]. They discovered that the lacunarity approach can improve the accuracy of urban classification significantly using spectral images. There exist different methods to calculate the lacunarity. The method of blanket is used to calculate the lacunarity in [37, 40, 51]. Images can be viewed as a mountain surface whose height from the normal ground is proportional to the image gray levels. Then, we can create a blanket with the thickness of 2ɛ covering on the surface of both images. The estimated surface area is the volume of the blanket divided by 2ɛ. For different values of ɛ, the blanket area can be iteratively estimated. Gilmore et al. implement the lacunarity as a measure of the distribution of color intensity (i.e., RGB) for an image [21]. The first moment and second moment of the gray-level distribution for each band are calculated first. Lacunarity is defined as the ratio of the second and first moments of mass distribution. Their algorithm is listed below. The Lacunarity Algorithm Step 1: Calculate the first moment which is the same as the mean of the gray level of all pixels,si , inside the box, and N is the number of pixels for an input image.
M1 ¼
i¼N 1X si N i¼1
ð2:36Þ
Step 2: Calculate the second moment which is the same as the variance of the gray level of all pixels, si , inside the box, and N is the number of pixels.
M2 ¼
i¼N 1 X ðsi M1 Þ2 N 1 i¼1
ð2:37Þ
38
2
Step 3:
Texture Features and Image Texture Models
Calculate the lacunarity, Lðs2 Þ, for a box size, s2 , using Eq. 2.38.
Lðs2 Þ ¼
M2 þ M12 M12
ð2:38Þ
Please note that the algorithm will calculate the lacunarity for the red, green, and blue components of the image separately. As reported in the literature, the maximal box size used is one-half of the length of the smallest axis of the image. Three maximal box size diameters of one-eighth, one-half, and one-quarter of the length of the smallest axis of the image were used for the lacunarity calculations for each image in [21]. They also used the random sampling to choose the center of each box. The algorithm will generate a feature vector of lacunarities for each component of RGB color image. Depending on the maximal box size, these feature vectors contain different number of elements. Images are thus assigned nine lacunarity values according to the mean of their red, green, and blue vector values for each of three maximal box sizes in their experiments [21].
2.8
Variogram
Geostatistics is the study of phenomena that vary in space and time in geology and remote sensing images [5, 15, 30]. It is used to measure the spatial correlation and its properties. There are some basic components of geostatistics which are frequently used for the spatial characterization, variogram analysis, and Kriging and stochastic simulation [5]. The variogram (also called semivariogram), which is a second-order statistics, is to extract the texture features from the image. Unlike the GLCM, the variogram is to capture the covariance of pixel values as a function of distance between pixels [7, 8, 41]. It measures the difference between pixel values relative to a given distance separating them in a specific orientation [7, 8, 41]. A pair of pixels representing the measurement of the same variable is used for measuring the spatial similarities at a distance between the pair. This distance is usually referred to as “lag”, as used in the time series analysis [5]. A variogram shows a peak of the variance, which is called sill. The lag interval to the sill is known as the range. That range indicates the limit of spatial dependence and the distance over which values are similar. Similar to the GLCM, the variogram has been used in the classification and analysis of image textures [7, 8, 41]. However, they are quite different from the classification point of view [8]. In the GLCM method, different features are extracted from the co-occurrence matrices to form a feature vector for the classification. In the variogram approach, a set of lags is used to calculate the variogram for each pixel located in a neighborhood. Similarly, the variogram is calculated for each textural class using the training sites within a neighborhood in an image. Based on the minimum distance principle, a pixel will be classified based on the
2.8 Variogram
39
minimum distance obtained between the variogram of each pixel and that of the textural classes. The variogram model is described in detail below [7]. Let Gðx; yÞ denote a grayscale image. The variogram for describing this image is formulated as in Eq. 2.39. 2cðhÞ ¼
1 2
Z Z x
½Gðx; yÞ Gðx0 ; y0 Þ2 dydx
ð2:39Þ
y
where h is the Euclidean distance (i.e., lag distance) between a pixel at a location ðx; yÞ and another pixel value G at location ðx0 ; y0 Þ. For a digital image, this integral is approximated as Eq. 2.40. cðhÞ ¼
N 1 X ½Gðx; yÞ Gðx0 ; y0 Þ2 2N i¼1
ð2:40Þ
where N is the total number of pairs of pixels and Gðx; yÞ and Gðx0 ; y0 Þ are represented as a pair of pixels that are separated by a distance h. The variogram cðhÞ can be computed with a particular spatial direction. Four directions including East−West (E−W), North−South (N−S), Northeast−Southwest (NE−SW), and Northwest−Southeast (NW−SE) are usually used for the spatial directions. These are defined in Eqs. 2.41−2.44: E−W: cðhÞ ¼
N 1 X ½Gðx; yÞ Gðx þ h; yÞ2 2N i¼1
ð2:41Þ
cðhÞ ¼
N 1 X ½Gðx; yÞ Gðx; y þ hÞ2 2N i¼1
ð2:42Þ
cðhÞ ¼
N 1 X ½Gðx þ h; yÞ Gðx; y þ hÞ2 2N i¼1
ð2:43Þ
cðhÞ ¼
N 1 X ½Gðx; yÞ Gðx þ h; y þ hÞ2 2N i¼1
ð2:44Þ
N−S:
NE−SW:
NW−SE:
The variogram is often calculated using the absolute value (Eq. 2.45), rather than the square of pixel difference [7].
40
2
c ð hÞ ¼
Texture Features and Image Texture Models
N 1 X 0 0 jGðx; yÞ Gðx ; y Þj 2N i¼1
ð2:45Þ
The training sites of size M M pixels are extracted and computed for each class. A variogram is also computed for a region with a size of M M around each pixel to be classified. The minimum distance (Eq. 2.46) is used to determine the similarity of variograms. Distance ¼
k X
jct ðcÞ cp ðcÞj
ð2:46Þ
i¼1
where K is the number of increments of h allowable given the constraint of the window size M, the subscripts t and p are the variograms of training site and a neighborhood, respectively, c is a particular class. A pixel is then classified into a class when the distance is minimum. The variogram is one of the examples to extract features using a spatial autocorrelation function. Compared with GLCM, it is very useful in the microwave image classification [7].
2.9
Texture Spectrum (TS) and Local Binary Pattern (LBP)
He and Wang stated that a texture image can be decomposed into a set of essential small units called texture units (TU) [26, 27, 66]. A texture unit is represented by a 3 3 window. The central pixel X0 in the window is the pixel being processed, and the given neighborhood of X0 is denoted as X = {X1, X2, X3 X8} as shown in Fig. 2.9 a A 3 3 window for a texture unit, b the corresponding texture unit, c a numerical example of texture unit, and d its corresponding texture unit
(a) X4 X3 X2 X5 X0 X1 X6 X7 X8
(b) E4 E5 E6
E3 E7
E2 E1 E8
3 6 9
2 4 8
0
0 0 2
(c) 1 9 6
(d) 0 2 1
2
2.9 Texture Spectrum (TS) and Local Binary Pattern (LBP)
41
Fig. 2.9a. The corresponding texture unit set is TU ¼ fE1 ; E2 ; E3 ; E4 ; E5 ; E6 ; E7 ; E8 g by using Eq. 2.47 such that: 8 1 h represents a set of clusters in which hi represents the ith cluster, i 2 I
C h ¼ fh1 , h2 ; . . .; hc g A ¼ fa1; a2 ; . . .; aC g I = {1, 2, …, C} uij ðxÞ UX
A denotes a matrix of the centers of all clusters where ai is the ith cluster center, i 2 I, where I is a set of indexes from 1 to C A fuzzy membership of pixel j for cluster i UX ¼ lj0 lij 1 for all ði; jÞ 2 ðI; JÞ
3.3 The Fuzzy C-means Clustering Algorithm (FCM)
57
Fig. 3.3 a An example of fuzzy membership function of the fuzzy set defined on a set of old residents in a residential area and b some functions which can be used as one of fuzzy membership functions
Let us give an example (Example 3.2) to illustrate the concept of fuzzy set and a measure of the membership degree. A monotonically decreasing function (f) of d(x, x0) with f(0) = 1 and f(+∞) = 0 can be a measurement for membership l of A as shown in Fig. 3.3a [64]. Figure 3.3b gives some functions that can be used as a fuzzy measure of the membership degree. Example 3.2 Suppose U is the set of ages of residents in a residential area. The set “old people” in this residential area is a fuzzy set on U, denoted as A. Let the standard of A be x0 = 80 years old, and the dissimilarity between the standard and an element x 2 U is defined as dðx; x0 Þ ¼
0 if x x0 x0 x; if 0\x\x0
ð3:6Þ
As a basic assumption in fuzzy clustering, each cluster hi is supposed to be a fuzzy set with membership function ui in each iteration, where ui is defined on the set X by ui xj ¼ uij for each xj 2 X
ð3:7Þ
58
3 Algorithms for Image Texture Classification
with the uij representing the degree of compatibility or membership of feature point xj belonging to fuzzy cluster hi . According to the fuzzy set theory, the memberships uij assigned should satisfy 0 uij 1 for all ði; jÞ 2 ðI; JÞ
ð3:8Þ
For convenience, we denote the membership matrix by u with 0
u11 u ¼ @... uc1
1 . . . u1n ... ... A . . . ucn
ð3:9Þ
and denote the set of all the membership matrices satisfying Eq. 3.8 by UX with Ux ¼ fuj0 uij 1 for all ði; jÞ 2 ðI; JÞg
ð3:10Þ
where UX is called the eligible membership matrix set of X. In general, the objective in the fuzzy clustering is to find the optimal membership matrix u 2 UX according to the given data information and some decision criteria from experts’ opinions. The fuzzy C-means (FCM) clustering algorithm is a widely used fuzzy clustering technique. It has been shown that FCM is more stable than the K-means algorithm to avoid the local minima problem [3, 64]. The objective function of FCM is formulated as shown below: JFCM ðl; AÞ ¼
c X n X
2 ðlij Þm xj ai
ð3:11Þ
i¼1 i¼1
subject to the constraints u 2 UX and c X
lij ¼ 1 for all j 2 J
i¼1
The updated equations for both u and A are obtained from the necessary conditions for the minimization of Eq. 3.11: lij ¼
2=ðm1Þ !1 c X x j ai for ði; jÞ 2 ðI; JÞ 2=ðm1Þ k¼1 xj ak Pn
ðlij Þm x j for i 2 I m j¼1 ðlij Þ
j¼1 ai ¼ Pn
ð3:12Þ
ð3:13Þ
3.3 The Fuzzy C-means Clustering Algorithm (FCM)
59
where the weighting exponent m > 1 is called the fuzzifier which has a significant influence on the performance of the FCM [3, 64]. All notations used in Eqs. 3.11– 3.13 are similarly defined. If we set m = 0 in the objective function in Eq. 3.11, the K-means algorithm is a special case of the fuzzy C-means algorithm. FCM starts with arbitrary assignment of the initial cluster centers and randomly generated initial membership values for all the pixels. It then distributes pixels among all the clusters based on the minimum distance metric. In a sense, the FCM is similar to the K-means as an iterative algorithm. However, instead of the winner takes all (i.e., deterministic) in the K-means, a fuzzy technique is used in assigning a membership value for each pixel to each class. The fuzzy membership for each pixel belonging to each class is a real number between zero and one. Hence, an initial fuzzy membership table must be given for all the training pixels in order to run the FCM algorithm. The following illustrates the steps used in the FCM algorithm: The Fuzzy C-means Algorithm (FCM): Step 1: Choose C initial cluster centers, a fuzzifier m, an initial Ux which is a collection of lij ; a small value of d for the convergence criterion, and a maximum number of iterations (Max-It). Step 2: Calculate the cluster centers using Eq. 3.13. Step 3: Update the membership matrix Ux using Eq. 3.12. Step 4: Increase the iteration count (t) by one and check if it meets Max-It. If not, repeat Steps 2–3 until the clustering converges, i.e., uij ðt þ 1Þ uij ðtÞ d for all ði; jÞ 2 ðI; JÞ. Step 5: (Clustering) Classify all the pixels using matrix Ux (based on the largest membership to a cluster). The FCM runs until the convergence criterion is met or the maximum number of iterations is reached. An example of FCM is shown in Example 3.3. Example 3.3 A same dataset of four pixels is used in Example 3.1 (as shown below) which will be classified to one of two clusters based on the distance similarity using the FCM algorithm. (a) a set of four pixels in two-dimensional space, and (b) the degree of memberships calculated for each pixel belonging to each cluster and the assignment of each pixel to one of two clusters, C1 and C2 based on the membership value.
60
3.4
3 Algorithms for Image Texture Classification
A Fuzzy K-Nearest-Neighbor Algorithm (Fuzzy K-NN)
In the traditional K-NN algorithm, each member of the labeled pixels in the dataset is considered equally important in counting the majority of the class labels for the assignment of a class label to an input pixel. As shown in Fig. 3.1, many samples in a dataset significantly overlap in the distribution. This may create some difficulty for the traditional K-NN algorithm to correctly classify overlapped samples into clusters. Keller et al. [26] pointed out two potential problems associated with the K-NN algorithm: each pixel is given the same weight as those that are representatives of the clusters and there is no degree of the membership for a pixel belonging to a class. This is similar to the K-means problem; there is no weighting used for each sample. Hence, the fuzzy set theory is used in the K-NN algorithm to address those two problems. Keller et al. proposed three different fuzzy nearest algorithms in assigning the fuzzy membership for the training sets [26]. Similar to the fuzzy membership used in the FCM, the membership in the fuzzy K-NN also provides a level of confidence to the assignment of a pixel to a class. The higher the membership value, the more certain of the assignment to the class. We introduce the fuzzy version of the K-NN algorithm below, called the fuzzy K-NN algorithm. Instead of assigning a pixel to a class as implemented in the traditional K-NN algorithm, the fuzzy K-NN assigns the class membership to a sample pixel. The membership is a function of the distance of the pixel from its K-nearest neighbors and memberships of those neighbors in the possible classes as shown in Eq. 3.14 [26]. Similar to the crisp K-NN algorithm, the fuzzy version still needs to search the K-nearest neighbors in the sample dataset. Please note that uij 2 UX is an initial fuzzy membership used for calculating ui ðpÞ for the classification of pixel p.
p xj 2=ðm1Þ u 1= ij j¼1
ui ð pÞ ¼ P K p xj 2=ðm1Þ 1= j¼1 PK
ð3:14Þ
3.4 A Fuzzy K-Nearest-Neighbor Algorithm (Fuzzy K-NN)
61
where lij is the membership of jth pixel with respect to ith class. This is very similar to what we do for the FCM initial membership matrix. Unlike FCM, this initial fuzzy membership will be used for the entire procedure without any update. Keller et al. proposed two methods for generating an initial fuzzy membership matrix UX [26]: the first method is that each pixel will be given a complete membership to the class it belongs and nonmembership to other classes and the second method is to initialize the memberships based on the distance from the cluster mean. Jozwik proposed a scheme for learning the memberships in the fuzzy K-NN rule [25]. The fuzzifier m is similar to that used in the FCM. As stated in [26], the fuzzifier determines how heavily the distance is weighted in calculating the contribution of each neighbor to the membership degree. If the fuzzier is gradually increased, the distance to any neighboring pixel will be treated more equally. In other words, the relative distance will have less effect in calculating the membership degree. Assume that we use the similar notations defined in Table 3.1, let X ¼ fx1 ; x2 ; . . .; xn g be the set of n labeled samples and ui ðpÞ be the assigned membership of a new pixel p and uij the membership in the ith class of the jth pixel in the labeled sample set. The fuzzy K-NN algorithm is defined below: The Fuzzy K-NN Algorithm: To determine the class of a new pixel p, Step 1: Initialize K which is usually between 1 and n (number of samples in the training dataset), a fuzzifier m, and the membership values uij for all ði; jÞ 2 ðI; JÞ. Step 2: Calculate the distance between pixel p and all sample pixels, n, in the training dataset. Step 3: Select K-nearest sample pixels to p in the training dataset. Step 4: The new pixel p is assigned the memberships for all classes by computing ui ðpÞ using Eq. 3.14, where i is a class. Step 5: Assign p to the most common class among its K-nearest neighbors based on the memberships obtained in Step 4.
3.5
The Fuzzy Weighted C-means Algorithm (FWCM)
The classical fuzzy C-means clustering algorithm (FCM) is an efficient method for partitioning pixels into different categories. The objective function of FCM is defined by the distances from pixels to the cluster centers with their fuzzy memberships. If two distinct clusters have similar mean, then the performance of FCM will not be efficient. To overcome this problem, Li et al. proposed a new fuzzy clustering algorithm, namely, the fuzzy weighted C-means algorithm (FWCM) [33]. In the FWCM, the concept of weighted means used in the nonparametric weighted feature extraction method (NWFE) is employed for calculating the cluster centers in the FWCM. Kuo and Landgrebe proposed nonparametric weighted feature extraction (NWFE) [31, 32] model that is a powerful feature extraction
62
3 Algorithms for Image Texture Classification
method for dimensionality reduction. The idea of weighted means is an essential part of the NWFE concept used for hyperspectral image classification. The main idea of NWFE is to give different weights to each pixel for computing the weighted means and defining new nonparametric matrices between- and withinclass scatter matrices to obtain more features. Hence, NWFE needs to calculate two scatter matrices: the nonparametric between-class scatter matrix as defined in Eq. 3.15 and the nonparametric within-class scatter matrix as defined in Eq. 3.16: SNW ¼ b
c X i¼1
Pi
ði;jÞ ni c X X k k
j¼1 j 6¼ i
SNW w ¼
ni
k¼1
c X
Pi
i¼1
T ðiÞ ðiÞ ðiÞ ðiÞ xk M j xk xk Mj xk
ði;jÞ ni X T kk i xk Mi xik xik Mi xik ni k¼1
ð3:15Þ
ð3:16Þ
where T is the transpose, c is the number of clusters, Pi is a priori probability for ðiÞ class i, ni is the number of pixels in class i, and xk is the kth pixel from class i. The ði;jÞ
scatter matrix weight kk
is defined as in Eq. 3.17:
ði;jÞ
kk
1 ðiÞ ðiÞ dist xk ; Mj xk ¼ nj
1 P ðiÞ dist xil ; Mj xk
ð3:17Þ
l¼1
ðiÞ
where dist(x, y) denotes the distance from x to y, and Mj ðxk Þ is the weighted mean ðiÞ
of xk in the class j and is defined as Nl X ð iÞ ði;jÞ ð jÞ Mj xk ¼ wl xl
ð3:18Þ
l¼1
where ði;jÞ
wl
¼
ðiÞ dist xk ; xlj Nj P i¼1
ði;jÞ
The scatter matrix weight kk ðiÞ
ð3:19Þ
1 dist xik ; xlj
ðiÞ
will be close to 1 if the distance between xk and ði;jÞ
Mj ðxk Þ is small. Otherwise, it will be close to 0. Similarly, the weight wl computing weighted means will be close to 1 if the distance between small. Otherwise, it will be close to 0.
ðiÞ xk
and
for
ðjÞ xl
is
3.5 The Fuzzy Weighted C-means Algorithm (FWCM)
63
Although NWFE is for supervised learning problems, the concept of weighted means can be extended to unsupervised learning. To extend the weighted means of NWFE to an unsupervised version, it is necessary to develop a method for an unsupervised weighted mean calculation. For any pixel xj in a class i, the distance from xj to other pixels is calculated below:
jjxj xk jj j k ¼ 1; 2; . . .; n; k 6¼ j
ð3:20Þ
In general, if any pixel is close to xj , the pixel is most likely belonging to the same class of xj . The corresponding weight should be large. Hence, the reciprocal of the above distance is used for the weighting. If a pixel is close to xj , but does not belong to the same class, then the influence of this pixel for the weighting must be small. We can solve this problem by multiplying the membership grade. Therefore, the unsupervised weighted mean of xj in class i is defined in Eq. 3.21. Mij ¼
n X
k¼1 k 6¼ j
xj xk 1 uik Pn xj xt 1 uik
ð3:21Þ
t¼1;t6¼j
In the derivation of Mij , we can expect that the unsupervised weighted mean Mij is closer to xj than the fuzzy cluster center Ci . Now, the objective function of FWCM is defined as in Eq. 3.22: JFWCM ¼
c X n X
2 um i;j xj Mji
ð3:22Þ
i¼1 j¼1
Using the method of Lagrange multipliers, a new objective function of FWCM is formulated as follows in Eq. 3.23: JFWCM ðu11 ; . . .; u1c ; un1 ; . . .; unc ; n1 ; . . .; nn Þ ¼
c X n X i¼1 j¼1
um i;j xj
n c X 2 X Mji þ nj uij 1 j¼1
!
ð3:23Þ
i¼1
where nj are the Lagrange multipliers for the n constraints and the summation of membership grades for each cluster is one. By differentiating JFWCM in Eq. 3.23 with respect to all the arguments, we obtain the following equations, Eqs. 3.24 and 3.25: 0
c c X X xj Mji 2 m um1 nj ¼ @ ij i¼1
i¼1
!1=ð1mÞ 11m A
ð3:24Þ
64
3 Algorithms for Image Texture Classification
uij ¼
1=ðm1Þ nj
c X xj Mji 2 m um1 ij
!1=ð1mÞ ð3:25Þ
j¼1
The procedure of FWCM is given below. Similar to FCM, FWCM is an iterative algorithm. The FWCM Clustering Algorithm: Step 1: Choose a number of clusters C, a fuzzifier m, a small value of d for the convergence criterion, and a maximum number of iterations (Max-It). Initialize their cluster centers and Ux which is a collection of uij . Step 2: Calculate the unsupervised weighted means Mij (Eq. 3.21). Step 3: Update the Lagrange multipliers nj (Eq. 3.24). Step 4: Update the membership grade uij (Eq. 3.25). Step 5: Increase the iteration count (t) by one and check if it meets Max-It. If not, repeat Steps 2–4 until the clustering converges, i.e., uij ðt þ 1Þ uij ðtÞ d for all ði; jÞ 2 ðI; JÞ. Step 6: (Clustering) Classify all the pixels using matrix Ux (based on the largest membership to a cluster). Once the training is complete, assign a pixel to a cluster for calculating the cluster center. Experimental results on both synthetic and real data show that FWCM can generate better clustering results than those of FCM [33]. However, FWCM is quite computationally expensive, in particular, for a large image dataset.
3.6
The New Weighted Fuzzy C-means Algorithm (NW-FCM)
Fuzzy clustering model is a convenient tool for finding the proper cluster structure of given datasets using an unsupervised approach. A new weighted fuzzy C-means (NW-FCM) algorithm was developed to improve the performance of the FWCM model for high-dimensional multiclass pattern recognition problems [16, 21]. The methodology used in NW-FCM is the concept of weighted mean from the nonparametric weighted feature extraction (NWFE) and cluster mean from discriminant analysis feature extraction (DAFE) [44]. These two concepts are combined in the NW-FCM for unsupervised clustering. The main features of NW-FCM, when compared to FCM, are the inclusion of the weighted mean to increase accuracy, and when compared to FWCM, the centroid of each cluster is included to increase the stability. The algorithm gives higher classification accuracy and stability than that of FCM and FWCM. Comparing DAFE, NAFE, and NWFE methods, DAFE uses the “centroid concept” in the calculation of both within-class and between-class scatter matrices without the weighted mean, NAFE uses a combination of the centroid for
3.6 The New Weighted Fuzzy C-means Algorithm (NW-FCM)
65
within-class scatter matrices and the weighted mean for between-class scatter matrices, and NWFE uses the weighted mean for both within-class and between-class scatter matrices without the centroid. The weighted mean in NWFE was used in FWCM, but not the centroid concept. The weighted mean used in FWCM is very similar to the gradient–weighted inverse concept used in the smoothing filter [54]. In NW-FCM, one can expect that this new unsupervised weighted mean to be even closer to the real cluster center. The NW-FCM method combines the centroid of each cluster and the weighted mean in deriving the algorithm. The NW-FCM is more stable than FWCM and obtaining higher data classification accuracy than that of FCM. In FWCM, weighted means are calculated based on the point in consideration and all the sample pixels, whereas in NW-FCM it is calculated based on cluster centers and the rest of sample pixels. This makes NW-FCM more precise in assigning a sample pixel to a particular cluster than FWCM. Because the weighted mean is calculated from the cluster centers the NW-FCM algorithm takes computationally less time than FWCM. The NW-FCM algorithm is formulated below. The NW-FCM clustering algorithm: Step 1: Choose a number of clusters C, a fuzzifier m, a small value of d for the convergence criterion, and a maximum number of iterations (Max-It). Initialize their cluster centers and Ux which is a collection of uij satisfying Eq. 3.26. c X uij ¼ 1; j ¼ 1; 2; . . .::n ð3:26Þ i¼1
Step 2:
Calculate the fuzzy cluster center ci using Eq. 3.27 ci ¼
n X um ij xj ; j ¼ 1; 2; . . .; n n P j¼1 um ik
ð3:27Þ
k¼1
Step 3:
Calculate the weighted means Mij using Eq. 3.28. Mij ¼
n X kci xk k1 uik xk n P 1 k¼1 kci xt k uit k6¼j t¼1 t6¼j
ð3:28Þ
66
3 Algorithms for Image Texture Classification
Step 4:
Update the Lagrange multiplier nj using Eq. 3.29 0
c c X X xj Mij 2 m nj ¼ @ um1 ij i¼1
Step 5:
11m 1 !1m A
ð3:29Þ
i¼1
Upgrade the membership grade uij using Eq. 3.30 1 ðm1Þ
uij ¼ nj
c X xj Mij 2 m um1 ij
!
1 ð1mÞ
ð3:30Þ
i¼1
Step 6: Increase the iteration count (t) by one and check if it meets Max-It. If not, repeat Steps 2–5 until the clustering converges, i.e., uij ðt þ 1Þ uij ðtÞ d for all ði; jÞ 2 ðI; JÞ. Step 7: (Clustering) Classify the pixels based on the largest membership belonging to a cluster using matrix Ux. Experimental results on both synthetic and real data demonstrate that the NW-FCM clustering algorithm generates better clustering results than those of FCM and FWCM algorithms, in particular, for hyperspectral images [16, 21]. Some results are shown in Fig. 3.4 and Table 3.2. In the classification of Indian Pine images in Table 3.2, four classes including roofs, roads, trails, and grasses with 100 pixels each were used [33]. Table 3.2 shows a statistical comparison for 1000 experiments on the Indian Pine hyperspectral images.
3.7
Possibilistic Clustering Algorithm (PCA)
The memberships generated by the FCM do not always correspond to the intuitive concept of degrees of belongingness or compatibility [28]. This is due to the probabilistic constraints that the summation of the memberships of each data point across classes must be less or equal to one. Hence, the clustering results may be inaccurate for a noisy dataset [28, 29]. To improve this weakness of FCM, Krishnapuram and Keller relax the probabilistic constraints and propose possibilistic clustering algorithms (PCAs), where the memberships provide a better explanation of degrees of belongingness for the data. Since the clustering performance of the PCAs in [28, 29] heavily depends on the parameters used, Yang and Wu [57] suggested a new possibilistic clustering algorithm in which the parameters
3.7 Possibilistic Clustering Algorithm (PCA) Fig. 3.4 a An original image and b clustering results with FCM, FWCM, and NW-FCM. The sky is not properly segmented by using the FCM with m = 2.5 and 3.0 and FWCM with m = 2, 2.5, and 3. Five clusters are assumed in this image
67
(a)
(b)
FCM
FWCM m=2.0
m=2.5
m.=3.0
NW-FCM
68
3 Algorithms for Image Texture Classification
Table 3.2 A comparison of 1000 experiments for Indian Pine images. The first column lists clustering algorithms and fuzzy indexes, the second column shows the overall accuracy, the third column gives the accuracy distribution, and the fourth column indicates the number of clusters obtained in each run for 1000 runs [16] Indian pine data set Clustering algorithm Overall accuracy
Variant clusters Highest Mean Variance [0.5, 1] [0.4, 0.5) [0, 0.4) 4 3
m = 2.0 FCM FWCM NW-FCM m = 2.5 FCM FWCM NW-FCM m = 3.0 FCM FWCM NW-FCM
Accuracy distribution
1 is a fuzzifier same as that defined in the original FCM. The necessary conditions for a minimizer A of JCCA are the following update equation for cluster centers in Eq. 3.52. uij ¼ fi ðxj ai for ði; jÞ 2 ðI; J Þ
ð3:51Þ
PN
m j¼1 ðCrij Þ xj m j¼1 ðCrij Þ
ai ¼ PN
for i 2 I
ð3:52Þ
In the approach of credibilistic clustering, after random initializations of cluster centers and memberships, the membership matrix µ, the credibility matrix Cr, and the cluster center matrix A are updated until the convergence of cluster centers, where Cr is given in Eq. 3.53. 2
Cr11 Cr ¼ 4 . . . Crk1
... ... ...
3 Cr1n ... 5 Crkn
ð3:53Þ
The credibilistic clustering algorithm is described as follows. The Credibilistic Clustering Algorithm (CCA): Step 1: Choose a number of clusters C, a fuzzifer m, a small value of d for the convergence criterion, and a maximum number of iterations (Max-It). Initialize their cluster centers. Step 2: Calculate the membership matrix UX (which is a collection of uij ) using a function (fi ) which satisfies three constraints in Eq. 3.35. Step 3: Calculate Crij using Eq. 3.49 to establish a new Cr (which is a collection of Crij ). Step 4: Update the centroids of clusters using Eq. 3.52. Step 5: Increase the iteration count (t) by one and check if it meets Max-It. If not, repeat Steps 2–4 until the clustering converges, i.e., Crij ðt þ 1Þ Crij ðtÞ d for all ði; jÞ 2 ðI; JÞ. Step 6: (Clustering) Classify the pixels using the matrix Cr (based on the largest membership to a cluster). Similar to the GPCAs, the only difference among all the CCAs is the membership functions fi used in the evaluation equation (Eq. 3.51). By using the four specific membership functions f (k)(k = 1, 2, 3, 4) recommended in [64], we can obtain four corresponding credibilistic clustering algorithms, denoted by CCA(k) (k = 1, 2, 3, 4).
76
3.10
3 Algorithms for Image Texture Classification
The Support Vector Machine (SVM)
Support vector machine (SVM) is a pattern classification technique proposed by Vapnik et al. [1, 5, 53]. SVM is a nonparametric classification method that is a distribution-free algorithm. Unlike many traditional methods that minimize empirical training errors, the SVM attempts to find an optimal hyperplane for linearly separable patterns [1, 5, 24, 39, 53]. If the patterns are not linearly separable, a kernel function can be used to transform the patterns from the current space into a new higher dimensional space in which the patterns will become linearly separable [1]. The SVM also achieves greater empirical accuracy and better generalization capabilities than other standard supervised classifiers [24, 39]. In particular, SVM has shown a good performance on high-dimensional data classification with a small set of training samples [7]. However, Tuia et al. point out that the SVM works as a black box model and does not directly obtain the feature importance from the solution of the model in nonlinear cases [52]. The performance of SVM can be degraded when some features are uninformative or highly correlated with other features [30]. The SVM is also frequently used in the image texture classification [27]. The SVM is designed to maximize the margin around the separating hyperplane between two classes as shown in Fig. 3.5. This optimal decision function is completely specified by a subset of training samples which is called the support vectors. In a two-dimensional (2-D) pattern space, the optimal hyperplane is a linear function. This linear programming problem is similar to the perceptron [55]. This 2-D separating function (i.e., hyperplane in 2-D) can be expressed as in Eq. 3.54. ax1 þ bx2 ¼ c
ð3:54Þ
where x1 and x2 are elements of a pattern vector and a, b, and c are parameters used to determine the position of the function. There is an infinite number of solutions of a, b, and c for Eq. 3.54. Support vectors are the critical elements of the training set that will change the position of the function if they are removed from the training set. The problem of finding the optimal hyperplane is an optimization problem. This can be solved by the optimization techniques using the Lagrange multipliers [1, 5, 53]. Assume that a training dataset D ¼ fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . .; ðxn ; yn Þg 2 Rd in a d-dimensional real number space is given, where xi (i = 1, …, n) is a feature vector and yi is the corresponding label. The solution of SVM is to find a separating hyperplane in the feature (Hilbert) space for a binary classification problem. If we assume that all dataset have at least a distance of one from the hyperplane, then the following two constraints (Eq. 3.55) will be satisfied for the training dataset D: W T Xi þ b 1 if yi ¼ 1 W T Xi þ b 1 if yi ¼ 1
ð3:55Þ
3.10
The Support Vector Machine (SVM)
(a)
77
Support vectors
Maximize the margin
(b)
L is the best dividing function
L1 L
L2
d1 d2
d1=d2
Fig. 3.5 a A diagram shows the maximal margin (same as the sum of d1 and d2 in (b)) and support vectors between two classes and b two alternate functions (i.e., 2-D hyperplanes) lying on the support vectors
Then, since the distance for each support vector from the hyperplane is calculated as in Eq. 3.56, r¼y
wT x þ b kw k
ð3:56Þ
both the margin and the hyperplane are expressed as in Eqs. 3.57 and 3.58.
78
3 Algorithms for Image Texture Classification
q¼
2 kw k
ð3:57Þ
WT X þ b ¼ 0
ð3:58Þ
This implies that two formulae hold (Eqs. 3.59 and 3.60). wT ðxa þ xb Þ ¼ 2
ð3:59Þ
and q ¼ kx a x b k2 ¼
2 kW k2
ð3:60Þ
where kWk2 is an Euclidean distance. This is shown in Fig. 3.5b. We can formulate the quadratic optimization for the SVM below. The problem of solving the support vector machine is to find W and b such that Eq. 3.58 is maximized and all training samples fðxi ; yi Þg are met with the constraints in Eq. 3.55. The solution using the quadratic optimization involves constructing a dual problem where a Lagrange multiplier ai is associated with every constraint in the primary problem: Find a1 ; . . .; aN such that Eq. 3.61 is maximized with the following two conditions satisfied: P (1) ai yi ¼ 0 (2) ai 0 for all ai
Q ð aÞ ¼
N X
ai
i¼1
1XX ai aj yi yj xTi xj 2
ð3:61Þ
The solution will have the following form: W ¼ Rai yi x and b ¼ yk WT xk for any xk such that ak 6¼ 0
ð3:62Þ
Each nonzero ai indicates that corresponding xi is a support vector. The function for the classification will have the following form: f ðxÞ ¼
X
ai yi xTi x þ b
ð3:63Þ
where xTi x is an inner product of test point x and the support vectors xi . If slack variables are used in the SVM to prevent the overfitting or the situation where the
3.10
The Support Vector Machine (SVM)
79
dataset is not linearly separable [1, 5, 53], a new formulation should be given [1]. A summary of soft margin SVM is given below. Algorithm of Support Vector Machine (SVM): Step 1: To obtain ai and b using Eqs. 3.61 and 3.62 for the classification function f ðxÞ defined in Eq. 3.63. Step 2: (classification) given a new point, we calculate a score on its projection P onto the hyperplane normal using WT x þ b ¼ ai yi xTi x þ b to determine the score (either positive or negative) for the classification. Since the procedure above is for the binary classification, for the multiple-class classification, the one-against-one strategy (OAO) can be used [1, 5, 53].
3.11
The K-means Clustering Algorithm with the Ant Colony Optimization (K-means-ACO)
The K-means algorithm is a very useful clustering method; however, it may be trapped in a local optimum in exploring the global optimal solution. Several methods for improving the K-means have been proposed and developed in the literature [22, 23, 47]. The K-means algorithm using the ant colony optimization (ACO) for improving the stability of the clustering results will be described in this section. This algorithm is called the K-means-ACO which can be used for the clustering problems similar to the K-means. The ACO algorithm was first introduced to solve the traveling salesman problem (TSP) [10]. The problem of TSP is to find the shortest closed path in a given set of nodes that passes each node once. The ACO algorithm is one of the two main types of swarm intelligence. The other type of swarm intelligent techniques is the particle swarm optimization (PSO) algorithm. Swarm intelligence is inspired by the collaborative behavior of social animals such as birds, fish, and ants and their amazing formation of flocks, swarms, and colonies [4]. By simulating the interactions of these social animals, a variety of optimization problems can be solved with this method. The ACO algorithm consists of a sequence of local moves with a probabilistic decision based on a parameter, called pheromone as a guide to the final solution. The learning in the ACO is through the pheromone. The K-means-ACO is less dependent on the initial parameters such as randomly chosen initial cluster centers [47]. Hence, it generates more stable results compared to the K-means. In the clustering of pixels for the partition of an image, the K-means-ACO heavily depends on the probability used [10]. The probability is formulated based on the distance (i.e., similarity) between the pixel and cluster centers and a variable, s, representing the pheromone level. Pheromone is defined to be dependent on the minimum distance between each pair of cluster centers and inversely dependent on the distances between each pixel and its cluster center. Hence, the pheromone is getting larger when cluster centers are far apart. Therefore, the clusters tend to be
80
3 Algorithms for Image Texture Classification
more compact. This will have a high probability of assigning a pixel to that cluster. Pheromone evaporation is used to weaken the influence of the previously chosen solutions. In a sense, the K-means-ACO is similar to an ensemble decision as m ants are competing one another and each ant will come up with a solution. A criterion is defined to find the best solution and the pheromone level is updated accordingly for the set of m ants as a leading guide for the next iteration. A termination criterion will stop the iteration of the algorithm and an optimal solution is obtained. The algorithm starts by assigning a pheromone level s and a heuristic information η to each pixel. Then, each ant will assign each pixel to a cluster based on the probability obtained from Eq. 3.64 [47]: ðsi ðxp ÞÞa ðgi ðxp ÞÞb Pi x p ¼ PK a b j¼1 ðsj ðxp ÞÞ ðgj ðxp ÞÞ
ð3:64Þ
where Pi xp is the probability of assigning pixel xp to cluster i, si ðxp Þ and gi ðxp Þ are the pheromone and heuristic information assigned to the pixel xp for cluster i, respectively, a and b are constant parameters that determine the relative influence of the pheromone and heuristic information, and K is the number of clusters. Heuristic information gi ðxp Þ is obtained using Eq. 3.65: gi x p ¼
CONST IDist xp ; ICCi EDistðxp ; ECCi Þ
ð3:65Þ
where xp is the pth pixel, ICCi is the ith intensity cluster center, ECCi is the ith Euclidean cluster center, IDist xp ; ICCi is the distance in terms of intensity of pixels between xp and ICCi , EDistðxp ; ECCi Þ is the Euclidean distance between xp and ECCi , CONST is used to balance the value of si and gi , and denotes the multiplication. The value for the pheromone level si assigned to each pixel is initialized to one so that it does not have any effect on the probability at the beginning. This pheromone should be getting larger for the better solution. Each ant is exploring its own solution. After m ants have done their exploration for each iteration, the current best solution is chosen and the assigned pheromone to this solution is incremented. In addition, cluster centers are updated by the cluster centers of the current best solution. In other words, in each iteration, each of the m ants finds its solution based on the best solution found by the previous m ants. This procedure will be repeated until reaching a maximum number of iterations or when the overall best solution is achieved.
3.11
The K-means Clustering Algorithm with the Ant Colony Optimization …
81
The best solution in each iteration is chosen according to two factors: distance between cluster centers and sum of the gray scales and physical distances between each pixel and its cluster center (i.e., similarity and compactness of clusters). The following three conditions should be satisfied: (1) The distance between cluster centers should be large such that the clusters are well separated, (2) The sum of grayscale distances between each pixel and its cluster center should be small so that all members in each cluster are similar in the grayscale space (it will be a color space if three images are used), and (3) The sum of the distances between each pixel and its cluster center should be small so that each cluster becomes more compact. To achieve the first condition for each ant k (k = 1, …, m), the distances between every pair of cluster centers are calculated and sorted to select the minimum distance MinðkÞ. Among all of these minimum distances given by all the ants, we choose the maximum of them (Minðk0 Þ). To achieve the second and third conditions, for each clustering performed by an ant we compute the sum of the distances between each pixel and its cluster center, and sort these distances. Then, we pick up the maximum for each ant, compare all these maximums, and select the overall minimum. The second maximum and third maximum of the solutions are compared in the same way and the minimum is picked. Solutions are being voted for their leading feature and the solution with the larger vote is selected as the best solution. After the best solution is found, the pheromone value is updated according to Eq. 3.66 [47]: X si xp ¼ ð1 qÞsi xp þ Dsi ðxp Þ
ð3:66Þ
i
where q is the evaporation factor (0 q 1) which causes the earlier pheromones to be vanished over the iterations. Therefore, as the solution becomes better, the corresponding pheromones have more effect on the next solution rather than the earlier pheromones in the previous iterations. The quantity of Dsi ðxp Þ in Eq. 3.66 is the amount of pheromone added to the previous pheromone by the succeeded ant, which is obtained from Eq. 3.67: Dsi xp ¼
QMinðk0 Þ AvgIDistðk0 ;iÞAvgEDistðk0 ;iÞ
0
if xp is a member of cluster i; otherwise:
ð3:67Þ
In Eq. 3.67, Q is a positive constant which is related to the quantity of the added pheromone by ants, Minðk0 Þ is the maximum of all the minimum distances between every two cluster centers obtained by the ant k′, AvgIDistðk0 ; iÞ is the average of the distances in terms of the intensity within cluster i, and AvgEDistðk0 ; iÞ is the average
82
3 Algorithms for Image Texture Classification
of the Euclidean distances between all pixels in a cluster i and their cluster center obtained by the ant k′. When the clusters are falling further apart, Minðk0 Þ will make the pheromone larger, and hence it will increase the probability. When the members of a cluster are similar, AvgIDistðk0 ; iÞ will increase the pheromone level. Similarly, when the cluster is more compact, AvgEDistðk0 ; iÞ will increase the pheromone level. In other words, if the value of Minðk0 Þ is getting larger, it will cause the further apart of the clusters. To achieve a larger pheromone level, Minðk0 Þ should be large and both the AvgIDistðk0 ; iÞ and AvgEDistðk0 ; iÞ should be small. Next, cluster centers are then updated by the cluster centers of the best solution. The procedure is repeated until a certain amount of time when the best solution is obtained. The algorithm is described below: The K-means-ACO Algorithm: Step 1: Initialize the pheromone level to one, and a number of clusters, K, and number of ants to m. Choose a small value of d for the convergence criterion, a counter of iterations (t), and a maximum number of iterations (Max-It). Step 2: Initialize each ant with K random cluster centers for m ants. Step 3: For each ant, assign each pixel xi ; i 2 n to one of the clusters based on the probability given in Eq. 3.64. This step will be repeated for each of m ants. Step 4: Update the Euclidean cluster centers and intensity cluster centers and recalculate the cluster centers. If the difference between each of current cluster centers and previous cluster centers is less than d, go to next step. Otherwise, repeat Step 3. i:e: cj ðt þ 1Þ cj ðtÞ d for j ¼ 1; 2; . . .; K Step 5: Save the best solution among the m solutions found. Step 6: Update the pheromone level on all pixels according to the best solution using Eq. 3.66. Step 7: Update cluster centers by the cluster center values of the best solution. Step 8: If the number of iterations meets Max-It, the procedure stops. Otherwise, the procedure continues and go to Step 3. Step 9: (Clustering) Classify each of all the pixels to one of clusters based on the minimum distance between the pixel and the clusters (based on the optimal solution obtained). The stability of the K-means-ACO is illustrated in Figs. 3.6 and 3.7 where different number of iterations of both K-means and K-means-ACO are used. It shows that for a proper selection of the initial cluster centers, both algorithms work well. However, if the initial cluster centers are not suitably selected, the K-means may give unsatisfied results. The experiment shows that the K-means is not a stable algorithm.
3.12
The K-means Algorithm and Genetic Algorithms (K-means-GA)
83
Fig. 3.6 a An original image, b results of K-means with K = 3, c and d results of K-means-ACO with K = 3 and different number of iterations (In the experiments, the initial seed values are properly chosen for both algorithms)
3.12
The K-means Algorithm and Genetic Algorithms (K-means-GA)
Genetic algorithm (GA) can improve the performance of K-means algorithm for unsupervised clustering. This has been illustrated in several different applications including color image quantization and image classification [19, 20, 48, 56]. The evolution begins with an initial population. The selected chromosomes (candidate solutions) compete with one another to reproduce based on the Darwinian principle of “survival of the fittest” in each generation of evolution. After a number of generations in the evolution, the chromosomes that survived in the population are the optimal solutions. Similar to any algorithm which uses GA for the solution evolution, the K-means algorithm needs to generate a set of solutions, P, to form a population for the solution search. This population is shown in Fig. 3.8. A chromosome can be formulated using one of several different representations [12].
84
3 Algorithms for Image Texture Classification
Fig. 3.7 a and b results of K-means with K = 3 and c and d results of K-means-ACO with K = 3 with different number of iterations (In the experiments, the initial seed values are not properly chosen for both algorithms)
String #1
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
String #2
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
String #3
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
...
C k-1,1 ...C k-1,L
...
C 21...C 2L
...
C 11...C 1L
...
...
... String #P
...
C k1 ...C kL
Fig. 3.8 A population of P strings with the length of each string, L K, where L is the dimension of the cluster center and K is the number of clusters
The separability (i.e., distance/similarity) is one of the measures frequently used in the evaluation of classification accuracy. The measure is based on how much of the overlapping is in the intra-class distributions. Among them, Bhattacharya distance, the
3.12
The K-means Algorithm and Genetic Algorithms (K-means-GA)
85
transformed divergence, and Jeffries–Matusita (JM) distance are widely used. The Bhattacharyya distance has the disadvantage that it will continue to grow even after the classes have become well separated. The transformed divergence has a problem to distinguish two different clusters. To evaluate the effectiveness of the K-means-GA, the JM distance was used as the fitness function [49]. The statistics for each class in the JM distance computation were obtained by distributing the training samples among the classes (a pixel is assigned to the class that is the closest in terms of the Euclidean distance between a class mean and the pixel). The average JM distance is then computed by dividing the sum of pairwise JM distances by the number of pairs. The JM value is in the range of 0 and 2. A higher JM value indicates that the result is better. The formula for calculating the JM distance between classes i and j is listed in Eq. 3.68. Please note that this formula is established assuming that classes are normally distributed. ðJMÞij ¼ 2ð1 eB Þ
ð3:68Þ
where 0 1 Covi þ Covj 1 2 1 Covi þ Covj 1 B ¼ ðmi mj ÞT mi mj þ ln@ 1=2 A 1=2 8 2 2 jCovi j Covj ð3:69Þ Equation 3.69 is the Bhattacharyya distance [44, 49]; notations mi and Covi are the mean and covariance of class i, respectively, and ln is a log function. The K-means with GA algorithm is described below. The K-means-GA Algorithm: Step 1: Determine a number of clusters, K, and their initial cluster centers for a number of chromosomes (strings), say P, to form a population. Initialize crossover probability pc , mutation probability pm , and a maximum number of iterations (Max-It). This Max-It is the same as the number of generations the algorithm will be evolved. Step 2: Distribute the pixels in the training set among K clusters by the minimum distance criterion using the Euclidean distance measure for each string separately and update the centroid of each cluster by calculating the average of pixel values assigned to each cluster. This step is repeated P times for each string. Step 3: (Crossover) For each string, a one-point crossover is applied with probability pc . A partner string is randomly chosen for the mating. Both strings are cut into two portions at a randomly selected position between j and j + 1, and the portions are mutually interchanged where j = 1 to N2. Step 4: (Mutation) Mutation with probability pm is applied to an element for each string. In our experiments, either −1 or +1 is selected randomly by comparing with probability pm (if the random probability is less than pm ,
86
3 Algorithms for Image Texture Classification
−1 is selected. Otherwise, +1 is selected) and added to the chosen element. The mutation operation is used to prevent fixation to the local minimum in the search space. Step 5: (Reproduction) The JM distance in Eq. 3.68 is used as the fitness function for each string (i.e., the average of JMs for all pairs of clusters). All strings are evaluated with the fitness function and pairwise compared. In each comparison, the string with the higher JM value will be retained and the other one will be discarded and replaced by a new string generated randomly. In other words, only half of the strings in a population survived and the other half are regenerated by new random strings representing new class means. Step 6: Repeat Steps 2–5 for several generations defined by the user. If the procedure converges (i.e., exceeds the Max-It), the string with the maximum JM distance will be selected as the best solution. Step 7: (Clustering) Classify each of all the pixels to one of clusters based on the minimum distance between the pixel and the clusters.
3.13
The K-means Algorithm and Simulated Annealing (K-means-SA)
Simulated Annealing (SA) is an optimization technique that is based on the process of annealing metals [55]. When a metal is heated up high enough and slowly cooled down, it will be ultimately in an optimal state which in a sense corresponds to a global minimum energy. This indicates that an optimal solution is obtained. The process of annealing can be simulated with the Metropolis algorithm [40]. The laws of thermodynamics state that at temperature, T, the probability of an increase in energy of magnitude, DE, is given by pðDEÞ ¼ exp
DE kT
ð3:70Þ
where k is a constant known as the Boltzmann’s constant [55]. This equation is directly used in the simulated annealing, although it is usual to drop the Boltzmann’s constant as this was only introduced into the equation when it is used in the simulation of different materials. Therefore, the probability of accepting a worse state is given by the equation (Eq. 3.71)
DE if p ¼ exp [r T
ð3:71Þ
where DE = the change in the cost function between the next state and the current state, T = the current temperature, and
3.13
The K-means Algorithm and Simulated Annealing (K-means-SA)
87
r = a random number between 0 and 1. When the temperature T is high, the probability is close to 1. This means that the probability of accepting any move is very high. If the temperature is low, the probability in Eq. 3.71 is close to 0. This indicates that the algorithm will avoid a bad move if the next state is not better. This guarantees that only the most promising solutions will be explored when the temperature is low. This approach allows SA to explore solutions that the K-means is not able to do. Simulated annealing introduces randomness into the selection of the solutions. This will make the algorithm from being trapped in a local optimum. A general SA algorithm is listed below. A Simulated Annealing (SA) Algorithm: Start with an initial temperature T and a candidate solution randomly. Evaluate this candidate and label it as the current state. Generate a new candidate and evaluate this solution. If this new candidate is better than the current state, accept this as a current state and discard the previous one. Step 5: If this new candidate is not better than the current state, calculate the probability using Eq. 3.71. If the probability is greater than a random probability, accept this new candidate as the current state. Step 6: Reduce the temperature and repeat Steps 3–5 until the temperature is cooled down to zero. Step Step Step Step
1: 2: 3: 4:
An integrated algorithm for simulated annealing with the K-means is sketched below. The K-means with Simulated Annealing (K-means-SA): Step 1: Choose a number of clusters, K, and initial cluster centers, i.e., S ¼ fC1 ; . . .; CK g and an initial temperature T. Step 2: Apply the K-means algorithm to assigning pixels to the clusters and evaluate the performance index for K clusters. Set this as current solution. Step 3: Produce a new solution (which can be based on the current solution). Step 4: Apply K-means algorithm to assign pixels and evaluate the performance index for this new solution. For example, the JM distance can be used as a performance index.
Table 3.5 A comparison of JM distance for the algorithms applied to the images shown in Fig. 3.9 Algorithms
JM value (I1)
JM value (I2)
JM value (I3)
K-means K-means-SA K-means-GA
1.8294 1.8550 1.8550
1.8065 1.8154 1.8375
1.8955 1.9059 1.9062
88
3 Algorithms for Image Texture Classification
(I1)
(I2)
(I3)
Fig. 3.9 (I1) a satellite image with 3 classes, (I2) a satellite image shown with band 4 only (3 classes), and (I3) a satellite image shown with band 4 only (9 classes)
Step 5: If this new solution is better than the current solution, accept this as the current solution. If not, check if the probability in Eq. 3.71 is greater than the random probability. If it is true, accept this new solution as current solution. Otherwise, reject this new solution. Step 6: Reduce the temperature T and repeat Steps 3–5 until temperature is equal to 0 or until a stopping criterion is met. To evaluate the efficiency of the K-means-SA, some of the experimental results are illustrated in Table 3.5. The test images are shown in Fig. 3.9. The JM distance is used as the cost function (energy function) for the K-means-SA. The final average JM distance is then computed as shown in Table 3.5. The higher the average JM distance is, the better the results are. The results shown in this table were obtained with the algorithm which was repeated four times and initial temperature was 20 for
JM distance
JM Distance vs. Iterations for K-means 1.9200 1.9000 1.8800 1.8600 1.8400 1.8200 1.8000 1.7800 1.7600 1.7400
I1 I3 I2
0
4
8 12 16 20 24 28 32 36 40
Number of Iterations Fig. 3.10 A graph shows the relationship between JM distance and iterations for the K-means for images I1, I2, and I3 shown in Fig. 3.9
3.13
The K-means Algorithm and Simulated Annealing (K-means-SA)
89
JM distance
JM Distance vs. Iterations for K-means-GA 1.9200 1.9000 1.8800 1.8600 1.8400 1.8200 1.8000 1.7800 1.7600 1.7400
I1 I3 I2
1
3
5
7
9
11
Number of Iterations Fig. 3.11 A graph shows the relationship between JM distance and iterations for the K-means-GA with the parameters; the number of clusters is 3, the population is 10, the crossover probability is 0.7, and mutation probability is 0.3 for images I1, I2, and I3 shown in Fig. 3.9
JM Distance vs. Temperature for K-means-SA
JM distance
1.9200 1.9000
I1
1.8800 1.8600
I3
1.8400 1.8200
I2
1.8000 1.7800 1.7600 1.7400
0
10
20
30
40
Temperature Fig. 3.12 A graph shows the relationship between JM distance and temperatures for the K-means-SA for images I1, I2, and I3 shown in Fig. 3.9
each image. From the experiments, we observed that a high temperature should be given when the image size is large and the number of clusters is small. In other words, a low temperature can be given when the image size is small and the number of clusters is large. The JM distance indicates that the K-means-GA is one of the efficient algorithms. The graphs in Figs. 3.10 and 3.11 show the relationship between the JM distance and the number of iterations for K-means and K-means-GA, respectively. The graph in Fig. 3.12 shows the relationship between JM distance and the change in
90
3 Algorithms for Image Texture Classification
temperature for K-means-SA. Figure 3.10 indicates that the JM distance for the K-means does not change significantly. So does the K-means-GA. On the other hand, the JM distance for the K-means-SA increases significantly for some images when the number of iterations is increased. The K-means-SA does not depend much on the diversity of the initialization because the high temperature gives more opportunities for the algorithm to explore more promising solutions. Among these three algorithms tested, the K-means-GA is the most stable algorithm. In the K-means-SA, the algorithm sequentially generates only one solution depending on the temperature. A higher temperature will have a higher probability to accept the new solution although it may not be a good solution temporarily. Thus, the K-means-SA will allow the search going in different directions to avoid being trapped in local optima.
3.14
The Quantum-Modeled K-means Clustering Algorithm (Quantum K-means)
The ability to grouping data naturally and accurately is essential to applications such as image segmentation. Therefore, techniques that enhance accuracy are of keen interest. One such technique involves applying a quantum mechanical system model, such as that of the quantum bit, to generate probabilistic numerical output to be used as variable input for clustering algorithms [38, 41]. It has been demonstrated that by applying a quantum bit model to data clustering algorithms can increase clustering accuracy, as a result of simulating superposition as well as possessing both guaranteed and controllable convergence properties [8]. The quantum model in the form of a string of qubits has been used for the diversity obtained from representing the quantum mechanical superposition of states. Previous works have applied this model to various algorithms [8, 17, 18, 38]. To understand how a qubit works, one needs to acquire a fundamental understanding of quantum state systems. The basis of quantum physics is derived from the Schrodinger equation for matter waves [13]. In this section, we briefly introduce how the quantum concept is combined with the K-means for improving the clustering results of K-means [8]. Quantum properties can be simulated via quantum system modeling algorithms, and used in applications that can benefit from these inherent properties. Quantum bits can be manipulated via quantum operators, such as a rotation gate, which consists of a unitary matrix and can be used to bring algorithmic convergence to fruition. In quantum mechanics, a superposition or linear combination of states can be represented in Dirac notation [42] such that jwi ¼ aj0i þ bj1i
ð3:72Þ
where w represents an arbitrary state vector in Hilbert space, a and b represent probability amplitude coefficients, and j0i and j1i represent basis states. These basis
3.14
The Quantum-Modeled K-means Clustering Algorithm (Quantum K-means)
91
states correspond to spin up and spin down, respectively. The state vector in normalized form can be represented as hwjwi ¼ 1
ð3:73Þ
jaj2 þ jbj2 ¼ 1
ð3:74Þ
or equivalently
where jaj2 and jbj2 are the probabilities of quantum particle measurement, yielding a particular state. Moreover, due to the superposition of quantum states, the particle may be in either a single state or multiple states simultaneously. These probability amplitudes are complex numbers, and in matrix form can represent a qubit a b
ð3:75Þ
Moreover, a series of qubits can form a string such that h i a1 a2 ... am b1 b2 ... bm
ð3:76Þ
In addition, states j0i and j1i can be represented in bit form as 1 j 0i ¼ 0
1 0
ð3:77Þ
0 j 1i ¼ 1
0 1
ð3:78Þ
Likewise, the qubit string can be expressed in the same notation [59]. For instance, the following eight bit string 11010010 can be represented as 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0
ð3:79Þ
or in tensor product form j1i j1i j0i j1i j0i j0i j1i j0i
ð3:80Þ
92
3 Algorithms for Image Texture Classification
In order to simulate the quantum mechanical property of superposition of states, a probabilistic Turing machine must be utilized. The quantum-modeled algorithm will use the output of this state machine as input to the clustering algorithms. Moreover, the probability of obtaining a particular outcome can be controlled by operating on the associated qubits directly with a quantum rotation gate. In order to implement the quantum state machine, for each algorithmic iteration a series of black box quantum “oracle” objects are utilized, each of which in this context poses a solution. The qubit string representation previously described represents a superposition of states, and each oracle possesses n qubit strings with m qubits per string. Since m represents the string length in qubits, the value of m for each string is chosen according to the estimated optimal centroid initial value. The total length of each qubit string is the product of m, the number of bands associated with the image being segmented, and the number of clusters chosen beforehand. Once the desired centroid value is determined, the number of qubits is specified accordingly, along with the appropriate number of clusters for the dataset of interest. Moreover, qubit strings may be applied to any random variable, not just cluster centers. We simply apply qubit string values to all desired random variables, obtain some measure of fitness via the inverse of some appropriate cluster validity index, determine the fittest solution thus far, and use the rotation gate and the state information of the best solution to converge the output of the state machine to that of the best solution. The quantum rotation gate is a unitary matrix operator, and can be formulated as follows: G¼
cos h sin h
sin h cos h
ð3:81Þ
The angle of rotation h is determined via a lookup table based upon the variable angle distance method [38]. The rotation gate is only one of a variety of gates that can be applied to the probability amplitude coefficients of each qubit [41]. The quantum K-means (QKM) algorithm [8] begins by initializing a population of quantum oracles, and then a call is made to Make P(t). Following Make P(t), the decoded decimal values are provided as initial centroid input to K-means, and hard partitioning will then ensue until a stopping criterion has been determined. Afterward, the cluster fitness is evaluated, via the Davies–Bouldin (DB) index method [9], and the fitness value is stored in the oracle. The fitness function utilized in the QKM is given in Eq. 3.82. f ¼
1 1 þ DB
ð3:82Þ
The fitness of the oracle is compared to the stored best solution as of yet, and the oracle that posses a superior fitness value is stored as the new best solution. In order to guide the current solution toward the best-stored solution, and hence bring convergence to fruition, subsequent to oracle population fitness evaluation, the quantum rotation gate in Eq. 3.81 is applied to the probability amplitudes a and b,
3.14
The Quantum-Modeled K-means Clustering Algorithm (Quantum K-means)
93
Fig. 3.13 a An original image, b a segmented result using the K-means, and c a segmented result using the quantum K-means [8]. The number of clusters was set to five for both algorithms
respectively, for each qubit in each string. Following quantum gate rotation, quantum crossover and mutation operators are applied to individuals chosen randomly. The previously described process continues for the specified number of iterations. A summary of the algorithm is listed below. The Quantum K-means Algorithm (QKM): Step Step Step Step Step Step Step Step Step Step Step
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
Initialize Q(t)—(population initialization). Make P(t)—(Observe, encode, decode). Perform the K-Means Clustering. Evaluate P(t)—(XB Index). Store B(t)—(Store best solution). Repeat Steps 3–5 for the entire population. Apply the Quantum Rotation Gate. Apply the Quantum Crossover. Apply the Quantum Mutation Inversion. Make P(t)—(Observe, encode, decode). Repeat Steps 3–10 for every iteration.
Some of experimental results shown in Fig. 3.13 gives a visual comparison between the K-means and quantum K-means clustering algorithms. It shows that the quantum K-means gives a more smooth and accurate result than that of K-means.
3.15
The Pollen-Based Bee Algorithm for Clustering (PBA)
Similar to the ant colony, the collective intelligence of social insect groups, including honey bees, presents an appealing model for problem-solving [6]. Many bee algorithms have been developed for data clustering. Among them, bee algorithm (BA), virtual bee algorithm (VBA), artificial bee colony algorithm (ABC), beehive algorithm (BeeHive), bee swarm optimization algorithm (BSO), and bee colony optimization algorithm (BCO) [43, 58]. TheBA developed by Pham et al. has been used in solving the problems in pattern recognition and data clustering
94
3 Algorithms for Image Texture Classification
[43]. The bee algorithm is one of solution space search problems which can be formulated as the objective function with constraints. The goal of the bee algorithm is trying to find an optimal solution for the objective function defined. The VBA uses bees and bee interaction intensity to evaluate solution (i.e., food sources) [58]. The bees developed in ABC must know the number of food sources around the hive [6]. Each food source will be assigned to an employed bee (forager) for determining if a better solution can be found in its neighborhood. The beehive algorithm uses foraging behavior and waggle dances in its method. The BSO is based on the foraging behaviors of honey bee swarms. Three different types of bees, foragers, onlookers, and scouts employ different flying patterns for adjusting trajectories in the search space [6]. Each bee embedded in BCO incrementally builds a solution to a problem for a number of iterations [6]. A traditional BA algorithm is listed below to serve as a basic model for clustering [43]. The BA Algorithm: Initialize the population of solution space. Evaluate the fitness of each solution in the population. While (stopping criterion is not true) Select sites for the neighborhood search. Recruit bees for selected sites (more bees will be recruited for the best sites which are often called elite sites) and evaluate the fitness. Step 3(c): Select the fittest bee from each site and update the solution space. Step 3(d): Assign remaining bees for search randomly and evaluate their fitness. End While. Step 4: Output the best results. Step Step Step Step Step
1: 2: 3: 3(a): 3(b):
Based on the BA, a pollen-based bee algorithm (PBA) was developed for the data clustering and image segmentation [6]. The PBA can be considered as a variation of the BA. The PBA embeds bee, hive, environmental interactions, and season change in the design and presents a more precise swarm analogy that allows the PBA model to converge autonomously upon high-quality solutions. The PBA utilizes the natural concept of pollen depletion which, along with creating unique methods of field and honeycomb management, increases optimal solution exploitation. Simultaneously, the PBA tries to reduce the complexity of user input parameters in some bee algorithms. There are three core conceptualizations behind PBA [6]. First, it is the construction of a bee so that a single bee is a single element of the solution set and is not comprised of the entire solution set (as opposed to “Each bee represents a potential clustering solution as a set of K cluster centers” [43]). Second, the storage of pollen solutions within the hive, as honey, helps define the landscape of active fields, independent of any one bee’s ability, so that “source fields” can be rated for suitability of solution. The third conceptualization is the introduction of pollen depletion to model the advancing seasons and mimic the natural changes in the landscape of pollen availability and bee industry, leading to convergence.
3.15
The Pollen-Based Bee Algorithm for Clustering (PBA)
95
In BA, the bees are paramount in their carrying of the solution set and communicating their solution set to other bees. This is not the case in PBA, where the bees are individual elements without any understanding of the solution set required as a whole. This division of the solution as a whole from an individual of the swarm led to the development of the “field” and of “field solution storage” at the hive, in the form of honey which then allows the solution sets to improve by the individual work of the bees and not by the efforts of any single bee alone. The PBA model also takes advantage of particle swarm optimization (PSO) (as does VBA) and thereby considers one bee as one particle of the swarm. PBA does not, however, communicate a global best solution among its bees and instead allows the quality of the pollen gathered from the fields to be rated at the hive in the form of honey in the honeycomb. PBA does have two types of bees, but, unlike ABC, these bees do not change roles nor are they treated like “generations” where one set of bees is replaced by another later generation. Another similarity between PBA and ABC is the abandonment of food sources (or fields when speaking of PBA) so that our model will abandon a field only if another better field is discovered by the scouts. What beehive algorithm shares in common with PBA is the treatment of the bees as two types, long and short distance agents, in which the agents can be interpreted as individual particles like PSO, and thus like PBA’s bees. The long-distance agents would be more comparable to the scout bees in PBA (who explore the landscape to return undiscovered pollen in a random fashion) and the short distance agents as comparable to the forager bees (who return to a given field and pollen location in order to exploit the area around an already given and rated pollen source). Contrasting PBA with BSO algorithm, by using the pollen depletion method, PBA allows the scouts to search the entire landscape without constriction to a certain radius. Also, the foragers are encouraged to begin their search for better solutions in reverse of that described by BSO, starting in a small area and given enough time (if it is early enough in the harvesting season and there are multiple foragers employed) they are dispersed further and further afield from that solution which allows easy escape from local optima. BCO is most similar to ant colony optimization [4] in its handling of its members. PBA, unlike BCO, does not use a reset strategy as the bees are not able to influence each other in their tasks—the scouts will explore and the foragers will exploit without regard to any particular member finding a better or worse pollen location. The individuals in PBA do their individual best to perform up to the task assigned and so do not change their performance level nor actions based on other individuals of the hive—this allows the swarm as a whole to survive the “missteps” that single individuals can make as they perform their duties.
96
3 Algorithms for Image Texture Classification
The PBA algorithm is presented below. The detail is given in [6] for pollen depletion, scouts, fields, landscape, foragers, environment, and interactions. The PBA Algorithm: Initialize variables. While pollen not depleted and scoutable fields exist, repeat Steps 3–5. Start scout bees loop: (1) explore landscape (find pollen), (2) create fields (ranked) as found by scout bees, and (3) evaluate/replace stored fields with any better found fields. Step 4: Apply pollen depletion (modifying numbers of bees and viable fields). Step 5: Start forager bees loop: (1) forager bees are assigned a stored field to exploit, (2) explore/exploit within the recruited field, and (3) evaluate/replace stored pollen source with better returned pollen sources. Step 6: The value of the best field (having the best honey in the honeycomb) is solution.
Step 1: Step 2: Step 3:
For an empirical comparison, PBA is compared with some clustering algorithms previously discussed on the accuracy of the Iris, Glass, and Wine dataset [6, 64]. These algorithms include FCM, FWCM, NW-FCM, PCAs, and GPCAs. Tables 3.6, 3.7 and 3.8 show the results of the accuracy. Different percentages of the pollen depletion were used in the PBA simulation. Except for the results of PBA, others are from results in [64]. Experimental results on two color images for a visual comparison between the BA and PBA are shown in Fig. 3.14 [6]. With different parameters used in both algorithms, the segmentation results will be different. Table 3.6 A comparison on the classification accuracy of the Iris dataset [6, 63]
Accuracy percentage Algorithm
Highest
Mean
Variance
PBA with 28% pollen FCM FWCM NW-FCM PCA93 PCA96 PCA06 GPCA {f1} GPCA {f2}
98.00 89.33 92.66 92.66 92.67 95.33 92.00 92.66 92.00
89.20 89.26 92.42 90.97 80.00 77.23 79.64 80.01 79.27
0.74 1.53 5.24 19.17 281.67 179.36 263.24 274.04 266.47
3.16
Summary
Table 3.7 A comparison on the classification accuracy of the Glass dataset [6, 64]
Table 3.8 A comparison on the classification accuracy of the Vowel dataset [6, 64]
3.16
97 Accuracy percentage Algorithm
Highest
Mean
Variance
PBA with 40% pollen FCM FWCM NW-FCM PCA93 PCA96 PCA06 GPCA {f3} GPCA {f4}
61.68 55.14 54.21 47.66 46.26 45.79 62.62 55.61 56.08
52.63 49.12 44.42 41.11 38.93 35.75 45.62 48.82 46.20
2.79 2.66 15.22 5.77 0.97 6.28 16.71 11.41 15.51
Accuracy percentage Algorithm
Highest
Mean
Variance
PBA with 10% pollen FCM FWCM NW-FCM PCA93 PCA96 PCA06 GPCA {f5} GPCA {f6}
41.21 32.02 N/A N/A 32.56 33.64 40.10 39.90 40.40
35.29 28.00 N/A N/A 23.43 24.41 30.48 30.54 30.78
1.06 1.15 N/A N/A 8.17 5.54 6.70 6.67 6.80
Summary
Techniques for image texture classification and segmentation are very useful for image texture analysis. Clustering methods are frequently used for image texture classification and segmentation. We introduced supervised and unsupervised algorithms which have been widely used in the traditional pattern recognition and machine learning. These algorithms include crisp and fuzzy clustering approaches. The K-means clustering algorithm is a simple and efficient method for the natural grouping of data. Due to its drawback on the local minima problem, several optimization methods have been used to improve the algorithm. Some of them such as ant colony optimization (ACO), particle swarm optimization (PSO), genetic algorithms (GA), and simulated annealing (SA). It has been demonstrated that these methods not only improve the result of the K-means algorithm but also make the algorithm more stable and prevent falling in the local minima. A quantum K-means (QKM) algorithm is also described to compare the performance of the K-means and QKM and show how the quantum model improves the K-means. A nature-inspired
98
3 Algorithms for Image Texture Classification
Fig. 3.14 a and b two original images, c and d results of the BA, and e and f results of the PBA [6]
3.16
Summary
99
bee algorithm based on the pollen depletion concept is also given to confirm its merits in image segmentation. Some experimental results are given based on testing on the datasets used. Although not all of them are image textures in the experiments, these algorithms can be used in image texture analysis. Artificial neural networks and their deep models have been primarily used in pattern and image classification [34]. These neural computation platforms are parallel and frequently used to construct computing systems that are more complex.
3.17
Exercises
A hypothetical color image is given below (Red, Green, and Blue), and two initial " # " # 0 1 cluster centers are 1 and 2 . 2 3
0 1 2 3 1 3 1 1 3 0 1 1
1 2 3 0 2 2 3 2 2 1 3 3
2 3 0 1 3 1 2 3 2 3 1 2
3 0 (Red) 1 2 3 1 (Green) 1 2 3 2 (Blue) 2 1
1. Apply the K-means clustering algorithm to classify the image into two clusters with two initial cluster centers given. 2. Use the fuzzy C-means algorithm (FCM) to classify the image into two clusters with two initial cluster centers given. 3. Apply the fuzzy weighted C-means algorithm (FWCM) to classify the image into two clusters with two initial cluster centers given. 4. Perform the new weighted fuzzy C-means algorithm (NW-FCM) to classify the image into two clusters with two initial cluster centers given. 5. Apply the PCA algorithm to classify the image into two clusters with two initial cluster centers given.
100
3 Algorithms for Image Texture Classification
References 1. Abe S (2010) Support vector machines for pattern recognition, 2nd edn. Springer 2. Ahalt SC, Krishnamurthy AK, Chen P, Melton DE (1990) Competitive learning algorithms for vector quantization. Neural Netw 3:277–290 3. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York 4. Bonabeau E, Dorigo M, Theraulaz G (1999) Swarm intelligence: from natural to artificial systems. Santa Fe institute studies on the sciences of complexity. Oxford University Press 5. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory, pp 144–152 6. Bradford D, Hung C-C (2012) Pollen-based bee algorithm for data clustering—a computational model. Prog Intell Comput Appl (PICA) 1(1):16–36 7. Camps-Valls G, Bruzzone L (2005) Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sens 43(6):1351–1362 8. Casper E, Hung C-C (2013) Quantum modeled clustering algorithms for image segmentation. Prog Intell Comput Appl (PICA) 2(1):1–21 9. Davies DL, Bouldin DW (1976) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI-1(2):224–227 10. Dorigo M, Maniezzo V, Colorni A (1996) Ant system: optimization by a colony of cooperating agents. IEEE Trans Syst Man Cyber Part B 26:29–41 11. Fukunaga K (1990) Statistical pattern recognition, 2nd edn. Morgan Kaufmann 12. Goldberg E (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA 13. Griffiths D (2005) Introduction to quantum mechanics. Pearson, Upper Saddle River 14. Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory IT-14:515–516 15. Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City 16. Hung C-C, Kulcarni S, Kuo B-C (2011) A new weighted fuzzy C-means clustering algorithm for remotely sensed image classification. IEEE J Sel Top Signal Process 5(3):543–553 17. Hung C-C, Casper E, Kuo B-C, Liu W, Jung E, Yang M (2013) A quantum-modeled fuzzy C-means clustering algorithm for remotely sensed multi-band image segmentation. In: Proceedings of the IEEE international geoscience and remote sensing symposium (IGARSS ’13) (In press) 18. Hung C-C, Casper E, Kuo B-C, Liu W, Jung E, Yang M (2013) A quantum-modeled artificial bee colony clustering algorithm for remotely sensed multi-band image segmentation. In: Proceedings of the IEEE international geoscience and remote sensing symposium (IGARSS ’13) (In press) 19. Hung C-C, Coleman T, Scheunders P (1998) The genetic algorithm approach and K-means clustering: their role in unsupervised training in image classification. In: Proceedings of the international conference on computer graphics and imaging, Halifax, Canada, June 1998 20. Hung C-C, Fahsi A, Coleman T (1999) Image classification. In: Encyclopedia of electrical and electronics engineering. Wiley, pp 506–521 21. Hung C-C, Jung E, Kuo B-C, Zhang Y (2011) A new weighted fuzzy C-means algorithm for hyperspectral image classification. In: The 2011 IEEE international geoscience & remote sensing symposium (IGARSS), Vancouver, Canada, 25–29 July 2011 22. Hung C-C, Saatchi S, Pham M, Xiang M, Coleman T (2005) A comparison of ant colony optimization and simulated annealing in the K-means algorithm for clustering. In: Proceedings of the 6th international conference on intelligent technologies (InTech’05), Phuket, Thailand, 14–16 Dec 2005 23. Hung C-C, Xu L, Kuo B-C, Liu W (2014) Ant colony optimization and K-means algorithm with spectral information divergence. In: The 2014 IEEE international geoscience & remote sensing symposium (IGARSS), Quebec, Canada, 13–18 July 2014
References
101
24. John ST, Nello C (2004) Kernel methods for pattern analysis. Cambridge University Press 25. Jozwik A (1983) A learning scheme for a fuzzy k-NN rule. Pattern Recognit Lett 1:287–289 26. Keller JM, Gray MR, Givens JA Jr (1985) A fuzzy K-nearest neighbor algorithm. IEEE Trans Syst Man Cybern SMC-15(4):580–585 27. Kim KI, Jung K, Park SH, Kim HJ (2002) Support vector machines for texture classification. IEEE Trans Pattern Anal Mach Intell 24(11):1542–1550 28. Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110 29. Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393 30. Kuo B-C, Landgrebe DA (2004) Nonparametric weighted feature extraction for classification. IEEE Trans Geosci Remote Sens 42(5):1096–1105 31. Kuo B-C, Ho H-H, Li C-H, Hung C-C, Taur J-S (2013) A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE J Sel Top Appl Earth Obs Remote Sens 317–326. https://doi.org/10.1109/JSTARS.2013.2262926 32. Landgrebe DA (2003) Signal theory methods in multispectral remote sensing. Wiley 33. Li CH, Huang WC, Kuo BC, Hung C-C (2008) A novel fuzzy weighted C-means method for image classification. Int J Fuzzy Syst 10(3):168–173 34. Lippmann RP (1987) An introduction to computing with neural nets. I.E.E.E. A.S.S.P. Mag 27(11):4–22 35. Liu B (2004) Uncertainty theory: an introduction to its axiomatic foundations. Springer, Berlin 36. Liu B (2006) A survey of credibility theory. Fuzzy Optim Decis Mak 5(4):387–408 37. Liu B, Liu YK (2002) Expected value of fuzzy variable and fuzzy expected value models. IEEE Trans Fuzzy Syst 10(4):445–450 38. Liu W, Chen H, Yan Q, Liu Z, Xu J, Yu Z (2010) A novel quantum-inspired evolutionary algorithm based on variable angle-distance rotation. In: IEEE congress on evolutionary computation (CEC), pp 1–7 39. Melgani F, Bruzzone L (2004) Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans Geosci Remote Sens 42(8):1778–1790 40. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equations of state calculations by fast computing machines. J Chem Phys 21:1087–1091 41. Mohammed AM, Elhefhawy NA, El-Sherbiny MM, Hadoud MM (2012) Quantum crossover based quantum genetic algorithm for solving non-linear programming. In: The 8th international conference on informatics and systems (INFOS), pp BIO-145–BIO-153, 14– 16 May 2012 42. Nielsen M, Chung I (2010) Quantum computation and quantum information. Cambridge University Press, New York 43. Pham DT, Otri S, Afify A, Mahmuddin M, Al-Jabbouli H (2007) Data clustering using the bees algorithm. In: Proceedings of the 40th CIRP international manufacturing systems seminar, Liverpool 44. Richards JA, Jia X (2006) Remote sensing digital image analysis: an introduction, 4th edn. Springer 45. Runkler TA, Bezdek JC (1999) Function approximation with polynomial membership functions and alternating cluster estimation. Fuzzy Sets Syst 101:207–218 46. Runkler TA, Bezdek JC (1999) Alternating cluster estimation: a new tool for clustering and function approximation. IEEE Trans Fuzzy Syst 7:377–393 47. Saatchi S, Hung C-C (2005) Hybridization of the ant colony optimization with the K-means algorithm for clustering. Lecture notes in computer science (LNCS 3540, pp 511–520) SCIA 2005: image analysis. Springer 48. Scheunders P (1997) A genetic C-means clustering algorithm applied to color image quantization. Pattern Recognit 30(6):859–866 49. Swain PH, Davis SM (eds) (1978) Remote sensing: the quantitative approach. McGraw-Hill
102
3 Algorithms for Image Texture Classification
50. Tao JT, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley 51. Tomek I (1976) A generalization of the K-NN rule. IEEE Trans Syst Man Cybern SMC-6 (2):121–126 52. Tuia D, Camps-Valls G, Matasci G, Kanevski M (2010) Learning relevant image features with multiple-kernel classification. IEEE Trans Geosci Remote Sens 48(10):3780–3791 53. Vapnik VN (2001) The nature of statistical learning theory, 2nd edn. Springer, New York 54. Wang X (1982) On the gradient inverse weighted filter. IEEE Trans Signal Process 40 (2):482–484 55. Wasserman PD (1989) Neural computing: theory and practice. Van Nostrand Reinhold, New York 56. Xiang M, Hung C-C, Kuo B-C, Coleman T (2005) A parallelepiped multispectral image classifier using genetic algorithms. In: Proceedings of the 2005 IEEE international geoscience & remote sensing symposium (IGARSS), Seoul, Korea, 25–29 July 2005 57. Yang M-S, Wu K-L (2006) Unsupervised possibilistic clustering. Pattern Recognit 39 (1):5–21 58. Yang X-S (2010) Nature-inspired metaheuristic algorithms, 2nd edn. Luniver Press 59. Yanofsky NS, Mannucci MA (2008) Quantum computing for computer scientists. Cambridge University Press, New York 60. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353 61. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 62. Zhou J, Wang Q, Hung C-C, Yang F (2017) Credibilistic clustering algorithms via alternating cluster estimation. J Intell Manuf 28(3):727–738 63. Zhou J, Wang Q, Hung C-C, Yi X (2015) Credibilistic clustering: the model and algorithms. Int J Uncertain Fuzziness Knowl Based Syst 23(4):545–564 64. Zhou J, Hung C-C (2007) A generalized approach to possibilistic clustering algorithms. Int J Uncertain Fuzziness Knowl Based Syst 15(Supp. 2):117–138 65. Zhou J, Hung C-C, Wang X, Chen S (2007) Fuzzy clustering based on credibility measure. In: Proceedings of the sixth international conference on information and management sciences, Lhasa, China, pp 404–411, 1–6 July 2007
4
Dimensionality Reduction and Sparse Representation
Nature does not hurry, yet everything is accomplished. —Lao Tzu
Image representation is a fundamental issue in signal processing, pattern recognition, and computer vision. An efficient image representation can lead to the development of effective algorithms for the interpretation of images. Since Marr proposed the fundamental principle of the primary sketch concept of a scene [28], many image representations have been developed based on this concept. The primary sketch refers to the edges, lines, regions, and others in an image. These are also called parts of objects. These are characteristic features which can be extracted from an image by transforming an image from the pixel-level to a higher level representation for image understanding. This step is usually considered as a low-level transformation. Many different techniques of transformation have been developed in the literature for signal and image representation. Those techniques can be used for image transformations which will result in efficient representation of characteristic features (or called intrinsic dimension). Dimensionality reduction (DR) and sparse representation (SR) are two representative schemes frequently used in the transform for reducing the dimension of a dataset. These transformations include principle component analysis (PCA), singular value decomposition (SVD), non-negative matrix factorization (NMF), and sparse coding (SC). The study of the mammal brain suggests that only a small number of active neurons encode sensory information at any given point [26, 30]. This finding has led to the rapid development of sparse coding which refers to a small number of nonzero entries in a feature vector. Due to many zeros or small nonzeros in a feature vector, it is called the sparsity of the feature vector. Hence, it is important to represent the sparsity for the dataset by eliminating the data redundancy for applications [6, 34]. The ultimate goal is to have a compact, efficient, and compressed representation of the input data.
© Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_4
103
104
4
Dimensionality Reduction and Sparse Representation
In the following sections, we will introduce the Hughes effect in the classification of images. Due to this effect, dimensionality reduction is frequently used to solve this problem. We will then present the basis vector concept from linear algebra. Based on this concept, the principle component analysis (PCA), singular value decomposition (SVD), non-negative matrix factorization (NMF), and sparse coding (SC) will be introduced. The PCA is one of the dimensionality reduction methods developed earlier and is widely used in pattern recognition and remote sensing image interpretation. Please note that since a basis image can be represented as a basis vector, we use both terms interchangeably in the following discussions.
4.1
The Hughes Effect and Dimensionality Reduction (DR)
In a high-dimensional space, we usually encounter the problem of shortage of training samples for a model. For example, assume that the number of dimensions (i.e., features) for the multispectral images is d and the number of training samples is s. If s is much smaller than d (i.e., s d), s may be too small for accurate parameter estimation for a classification model. Most of the classification algorithms for hyperspectral images which usually have hundreds of dimensions run into the Hughes phenomenon [12]. When the training samples are fixed and the spatial dimensions are increased, the classification accuracy reaches a maximum value for a given size and then decreases, even the number of dimensions are continuously increased. The Hughes phenomenon is shown in Fig. 4.1a where mean recognition accuracy represented in the vertical axis is plotted versus measurement complexity in the horizontal axis. Because of this Hughes effect, dimensionality reduction (DR) is an important technique in reducing the number of dimensions. If we consider that a dimension is a feature, DR becomes the problem of feature selection (FS) or feature extraction (FE). Feature selection method is to determine a subset of the original features. It is a critical issue for a feature selection algorithm to find a subset of features that is
Fig. 4.1 a The Hughes phenomenon shows that even the number of dimensions represented in the horizontal axis increased, the recognition accuracy (denoted by the vertical axis) is still decreased; b the classification error will be the minimum if an optimal number of features can be found
4.1 The Hughes Effect and Dimensionality Reduction (DR)
105
still discriminative for classification after a reduction on the number of features. Feature extraction is to transform the original feature space to a low-dimensional subspace [18]. In addition, some new features can be reconstructed based on the original set of features, and then it is used for the classification. The objective of feature selection and feature extraction is to find an optimal number of features so that the classification error is minimized as shown in Fig. 4.1b. The optimality refers to a minimum set of features that are discriminative when used in the classification. In the pattern recognition and remote sensing image interpretation, principal component analysis (PCA) has been a widely used dimensionality reduction technique [3, 4, 33]. Nonparametric weighted feature extraction (NWFE) is a widely used supervised dimensionality reduction method for hyperspectral image data [19]. A number of techniques have been developed for both feature selection and feature extraction to identify the most informative features [33]. Mathematically, dimensionality reduction such as the PCA can be regarded as solving a matrix which is replaced by a low-rank matrix. If we denote the original matrix as A, a low-rank approximation for matrix A is another matrix L with a rank lower than that of matrix A. Here, rank is defined as the maximum number of linearly independent vectors in a matrix, which is equal to the number of nonzero rows if we use elementary row operations on the matrix [17]. The objective is to minimize the error of difference between these two matrices using Eq. 4.1 in order to have a good approximation. MinimizekA Lkm ; m is a matrix norm
ð4:1Þ
If the rank of A is n, we are looking for the rank of L which will be satisfied with the relationship m n. This means that if we need n basis vectors to span A, only m basis vectors needed to span L. In other words, we try to find a set of the best basis vectors with size m so that we can produce a good approximation of matrix A. Therefore, we use a linear combination of the basis vectors with a coefficient vector of size m to replace the original vector of size n. In doing so, we reduce the original dimensions from n to m with m is much smaller than n. However, the problem with such a minimization formulated in Eq. 4.1 is that there is no guarantee that the technique will preserve and generate the discriminative features for the optimal classification. In a sense, the minimization should be optimal for image classification.
4.2
The Basis and Dimension
The concept of dimensionality reduction (and sparse coding), perhaps, can be traced back to the study of the structure of a vector space in linear algebra in which we can determine a set of basis vectors that completely describes (spans) a vector space. In doing so, any vector on that space can be represented as a linear combination of a set of basis vectors. PCA is an example of this type of transformation. However, a basis vector obtained in the linear algebra is not sparse. A square matrix can be used
106
4
Dimensionality Reduction and Sparse Representation
to denote a set of basis vectors. Similarly, in the computation of a set of basis vectors for sparse coding, we model the data vectors as the sparse linear combination of basis vectors, called dictionary. Each of basis vectors, which form the dictionary, is called atom. Such a dictionary is also represented as a matrix. This matrix is usually derived through the dictionary learning [6, 34]. Unlike the matrix used for the basis vectors in linear algebra, the matrix for the dictionary is overcomplete (this means that the matrix is not square). The difference between the basis vectors in linear algebra and the sparse coding is that the atoms in a dictionary may have a higher dimensional length in the basis vector than the input signal vector and most of the components in each atom are zero. To lay out a foundation for representing an image with a set of basis images, we will review the basis and dimension with some examples in the vector space. Linear algebra provides us a rich groundwork for the relationship of the basis and dimension [17]. In a vector space V ðV Rn Þ, a set of vectors in V can be determined to completely describe V. The notation Rn represents an n-dimensional real number space. We give the definition of a set of basis vectors from [17]. Definition 4.1 A set of vectors S ¼ fX1 ; X2 ; . . .; Xn g in a vector space V is called a basis for V if S spans V and S is linearly independent. Based on Definition 4.1, any vector X in a vector space V V Rd can be represented as in Eq. 4.2. X ¼ a 1 X1 þ a 2 X2 þ þ a n Xn
ð4:2Þ
Therefore, some properties associated with the definition of the basis vectors can be summarized below: (1) If S ¼ fX1 ; X2 ; . . .; Xn g is a basis for a vector space V, then every vector in V can be written in one and only one way as a linear combination of the basis vectors in S [17]. (2) A set of basis vectors in S is linearly independent (i.e., orthogonal). Hence, it is also called a set of orthogonal basis vectors. (3) If the magnitude of each basis vector is unity, it is called a set of orthonormal basis vectors. (4) A vector space has many different bases and all bases have the same number of vectors. The dimension of a vector space V is defined in the following [17]: Definition 4.2 Let V be a subspace of R d for some d. The number of vectors in a basis for a vector space V R d is called the dimension of V. We often write dim V for the dimension of V.
4.2 The Basis and Dimension
107
Examples 4.1, 4.2, and 4.3 given below show that a vector is represented by a set of basis vectors. Example 4.1 The set S = fX1 ; X2 ; X3 g with X1 ¼ ð1; 0; 0Þ, X2 ¼ ð0; 1; 0Þ, and X3 ¼ ð0; 0; 1Þ is a basis for R3 where 3 is a dimension. Example 4.2 A general vector X ¼ ð6; 7; 8Þ can be represented as X ¼ 6X1 þ 7X2 þ 8X3 using the basis given in Example 4.1. 1 0 , Example 4.3 If a set of the basis S = fX1 ; X2 ; X3 ; X4 g with X1 ¼ 0 0 0 1 0 0 0 0 2 1 X2 ¼ , X3 ¼ , and X4 ¼ . Then, a matrix A ¼ 0 0 1 0 0 1 6 1 is a linear combination of all of 2 2 matrices. It is represented as A ¼ 2 X1 þ 1 X2 þ 6 X 3 þ 1 X4
4.3
ð4:3Þ
The Basis and Image
As an analogy to the basis vectors for coordinate systems in which a vector is expressed as a linear combination of the orthogonal basis vectors (Eq. 4.2), the standard basis for any image is a set of basis (images) as shown in Eq. 4.4 in the following. A simple numerical example is given in Example 4.4. This example is exactly the same as Example 4.3 since an image can be represented in a matrix format. 2 1 Example 4.4 A general image of is described by the following linear 6 1 1 0 0 1 0 0 0 0 combination of four basis images of ; ; ; . 0 0 0 0 1 0 0 1
2 I¼ 6
1 1 ¼2 1 0
0 0 þ1 0 0
1 0 0 0 0 þ6 þ1 0 1 0 0 1
ð4:4Þ
Similar to the basis in linear algebra, the standard basis (images) is not the only one which we can use to describe a general image. Example 4.5 gives a set of basis 2 1 images to a 2 2 image used in Example 4.4. This set of basis images is 6 1 called the Hadamard basis.
108
4
Dimensionality Reduction and Sparse Representation
Fig. 4.2 Hadamard basis images are used in Example 4.5
Example 4.5 The Hadamard basis (basis images shown here for a 2 2 image, where white represents a pixel of +1 and black represents a pixel of −1. The basis 1 1 1 1 1 1 1 1 images consist of ; ; ; . Their image 1 1 1 1 1 1 1 1 representation is shown in Fig. 4.2. Similar to Example 4.4, we can express a general image of
2 6
1 1
using the
following linear combination of this new set of basis images as I¼
2 6
1 1 1 5 ¼ 1 1 2
1 1 3 1 1
1 1 þ2 1 1
1 1 2 1 1
1 1
ð4:5Þ
2 1 The coefficient in each basis image is the projection of onto the cor6 1 responding basis image using the dot product operation. In fact, these are the coordinates of the image in the Hadamard space [4]. In other words, the image of 2 1 has been mapped to a new space using the Hadamard transform. 6 1 This example shows that an image can be expressed as a linear combination by multiplying each basis image by a coefficient. Many discrete image transform methods can achieve this purpose by using the forward transform. Based on the examples given above, we can see that the forward transform is a process of breaking an image into its elemental components in terms of a set of the basis images. This set of basis images forms the so-called transformation matrix (also called the forward transformation matrix). In other words, it is a projection of an image onto the corresponding basis image using the dot product operation. There exist some image transforms which are frequently used including the Fourier
4.3 The Basis and Image
109
transform, wavelet transform, and sinusoidal transform [8]. To obtain the original image from the transformed domain, the so-called inverse transformation matrix is used. In other words, the inverse transformation is a process of reconstructing the original image in terms of its elemental components (i.e., a set of basis images used in the forward transform) through the linear combination of multiplications and summations. Example 4.6 shows the forward and inverse discrete Fourier transform for an image. Example 4.6 The forward and inverse discrete Fourier transform for a two-dimensional image, f (k, l) with a size of N N where variables u and v denote the Fourier domain and variables k and l represent the spatial image domain. The function F (u, v) is the transformed image (is usually called frequency) in the Fourier domain. The forward transform is F ðu; vÞ ¼
N 1 X N 1 k l 1X f ðk; lÞej2pðuN þ vN Þ N k¼0 l¼0
ð4:6Þ
N 1 X N 1 u v 1X F ðu; vÞej2pðkN þ lN Þ N u¼0 v¼0
ð4:7Þ
The inverse transform is f ðk; lÞ ¼
In the literature, the matrix factorization techniques are often used to obtain the basis functions (i.e., vectors or images). For example, the singular value decomposition (SVD) and non-negative matrix factorization (NMF) are widely used techniques [10]. Mathematically, if the image is represented by a matrix X ¼ fx1 ; x2 ; . . .; xn g 2 Rmxn , the matrix factorization can be defined as finding two matrices U 2 Rmxk and A 2 Rkxn such that Eq. 4.8 holds. X ffi UA
ð4:8Þ
The formula shows that the product of two decomposed matrices U and A should be equal or appropriate to X [3]. In general, the basis functions of U should capture the intrinsic features which are hidden in the image (i.e., X) and each column of A should be sparse which give weights in the linear combination of the basis functions of U [3]. The factorization concept can be represented as shown in Fig. 4.3:
110
4
Dimensionality Reduction and Sparse Representation
Fig. 4.3 An illustration of matrix factorization; X = UA. We assume that X is a nonsingular square matrix with m x m. Matrix X can be decomposed into two matrices U and A where U is a lower triangular matrix and A is an upper triangular matrix
4.4
Vector Quantization (VQ)
Vector quantization (VQ) is an encoding technique that has been used in speech and image encoding and pattern recognition [21]. VQ is one of the compressed representations developed in the early days for data encoding. Conceptually, an image is divided into a number of distinct blocks and a representative vector is learned for each block. A set of possible representative vectors is called the codebook of the quantizer, and each member is called codeword. Hence, the vector quantization can be regarded as one of the sparse representations. The Linde–Buzo–Gray (LBG) algorithm is a popular method for the generation of the codebook, which results in a low distortion rate based on the peak signal-noise-ratio (PSNR) measure [1]. As a codebook determines the quality of encoding, it is important to choose an appropriate codebook. The optimization technique can be used for generating an optimal codebook [9]. Figure 4.4 represents an example of the vector quantization to an image input. The LBG algorithm is given below in a step-by-step procedure [21]. The algorithm is a type of competitive learning method. It is quite similar to the k-means algorithm and the Kohon’s self-organizing map neural networks as we discussed in Chap. 3. The LBG algorithm for Vector Quantization: Step 1: Initialize the number of codewords from codebook, K, their initial codewords (same as the cluster centers), the convergence criterion, e, and the maximum number of iterations (Max-It). In addition, set an initial distance D0 = 0 and the iteration number k = 0. Repeat the following steps for each vector in the training dataset. Assume that the training vectors of S ¼ fxi 2 Rd ji ¼ 1; 2; . . .; ng and an initialization of a codebook C ¼ fcj 2 Rd jj ¼ 1; 2; . . .; ng. Step 2: Each vector is assigned to a cluster based on the minimum distance to the cluster centers. Mathematically, it is expressed as in Eq. 4.9. xi 2 cq if xi cq p xi cj p for j 6¼ q ð4:9Þ where the notation||.||p represents the Minkowski distance.
4.4 Vector Quantization (VQ)
111
Fig. 4.4 An example of vector quantization by dividing an image into several pieces of blocks. Each block is represented by a codeword from the codebook
Step 3: Each of the cluster centers is updated by calculating the average of the vectors assigned to each cluster. Mathematically, it is updated as in Eq. 4.10. 1 X cj ¼ xi jj ¼ 1; 2; . . .; K Sj
ð4:10Þ
xi 2Sj
Step 4: Increase the number of iterations k by one (i.e., k the distortion (difference) as below: Dk ¼
K X X xi cj j p
k þ 1) and calculate
ð4:11Þ
j¼1 xi 2Sj
Step 5: If Dk1DDk [ , or the number of iterations, k is less than Max-It, repeat k Steps 2–5. Step 6: Output the codebook C ¼ fcj 2 Rd jj ¼ 1; 2; . . .; ng.
112
4.5
4
Dimensionality Reduction and Sparse Representation
Principal Component Analysis (PCA)
Principal component analysis (PCA) is a linear transformation that removes the correlation embedded in a dataset. This is accomplished by transforming the dataset to a new coordinate system [3, 4]. This new set of coordinates represents the orthogonal principal components. The PCA was developed by Pearson and Hotelling [11, 32]. Therefore, it is also called the Hotelling transform. The PCA is frequently used as a dimensionality reduction method in pattern recognition to reduce the complexity of a dataset. If we stack several (say k) multispectral images with a size of n x n (assuming they are registered and properly aligned), each pixel will then be a vector with k components. Hence, a matrix of n2 k can be formed where n2 is the number of rows and k is the number of columns. Those elements in each vector are called random variables and each variable is a dimension in this vector space. In the multispectral and hyperspectral image analysis, the PCA will create a compact representation of multiple images based on the principal components. Those components are orthogonal and hence uncorrelated [15]. Figure 4.5a illustrates a stack of multispectral images are cascaded for the transformation and Fig. 4.5b shows a two-dimensional dataset in the x-y space (represented as Band 1 and Band 2) is transformed into the x’-y’ space (represented as PCA Band 1 and PCA Band 2). The PCA transforms a correlated dataset into an uncorrelated dataset in a different coordinate system. The transformation results of the PCA on the most discriminative information will be concentrated on the first few principal components. In other words, the first principal component creates maximum possible variation in the images, and each succeeding component obtains as much of the remaining variability as possible. In general, most of the information will be kept in a few principal components. If we discard the principal components with low variances, we will have the minimal information loss.
Fig. 4.5 An illustration of the principal component analysis (PCA); a a stack of multispectral images (i.e., bands) to be formed for the PCA transformation and b demonstrates a set of pixel vectors are transformed into uncorrelated pixel vectors in the space of multispectral images. Please note that a pixel vector with only two bands are shown here
4.5 Principal Component Analysis (PCA)
113
The PCA reduces feature vector space from a large number of dimensions (variables) to a smaller number of dimensions of each feature vector. It is an unsupervised learning algorithm to discover the intrinsic features embedded in a dataset. The PCA assumes that the relationships among variables in the dataset are linear. If the structure of the dataset is nonlinear, the principal components obtained will not be an effective representation of the dataset. The PCA has been proved as an effective transformation method for an image classification algorithm to improve the classification accuracy [13]. The PCA is widely used in the remote sensing image interpretation [33]. Similar to the linear transformation in linear algebra, the PCA is formulated as in the following: if the K original variables are represented as X ¼ fx1 ; x2 ; . . .; xk g and the output of the transformation is Y ¼ fy1 ; y2 ; . . .; yk g. We will have the following expressions (Eq. 4.12): y1 ¼ a11 x1 þ a12 x2 þ þ a1k xk y2 ¼ a21 x1 þ a22 x2 þ þ a2k xk ...
ð4:12Þ
yk ¼ ak1 x1 þ ak2 x2 þ þ akk xk Equation 4.12 can be expressed in the matrix format below: Y ¼ AX
ð4:13Þ
where A is the transformation matrix which is the collection of coefficients aij in Eq. 4.12. The PCA transformation is to make the output fy1 ; y2 ; . . .; yk1 ; yk g uncorrelated (i.e., orthogonal). Each of the variables is called principal component. The procedure for the PCA transformation is listed below. Principal Component Analysis (PCA) Algorithm: Step 1: Let m be the mean vector (taking the mean of all feature vectors from the original dataset X). m¼
K 1X Xk K k¼1
ð4:14Þ
Step 2: Adjust the original data X by subtracting the mean (i.e., X 0 ¼ X m) for each feature vector. Step 3: Compute the covariance matrix C of adjusted X 0 .
C¼
K 1X X 0 X 0T K k¼1 k k
ð4:15Þ
114
4
Dimensionality Reduction and Sparse Representation
Step 4: Find the eigenvectors and eigenvalues of C. Step 5: Select the first p eigenvectors ei where p is the number of eigenvalues that are ranked from the largest eigenvalues to form a new set of the feature vectors (principal components). A numerical example to illustrate the PCA transformation is given in Example 4.7. Example 4.7 Assume that we have the following dataset. 2 3 1 6 7 425
2 3 4 6 7 455
2 3 7 6 7 485
3
6
9
Step 1: Calculate the mean for all the feature vectors. 2 3 4 6 7 455 6 Step 2: Subtract the mean vector from each feature vector. 2
3
3
6 7 4 3 5
2 3 2 3 3 0 6 7 6 7 405 435
3
3
0
Step 3: Calculate the covariance matrix. 2
9
6 49 9
9
9
3
9
7 95
9
9
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix: each eigenvector is arranged as a row in the following matrix and eigenvalues are also shown in a matrix.
4.5 Principal Component Analysis (PCA)
2
5:77350269e01
115
8:04908467e01
6 Eigenvectors ¼ 4 5:77350269e01
5:21162995e01
5:77350269e01
2:83745472e01
2
2:70000000e þ 01 6 Eigenvalues ¼ 4 0 0
0 2:05116020e15 0
1:09446620e16
3
7 7:07106781e01 5 7:07106781e01 3 0 7 0 5 1:76986475e32
Step 5: Once eigenvectors are arranged based on the corresponding eigenvalues from the largest to the smallest, we can discard the eigenvectors corresponding to the smaller eigenvalues which represent the least information. If the eigenvector corresponding to the smallest eigenvalue is discarded, the first and second principal components will be retained for the PCA transformation matrix as shown below. PCA ¼
5:77350269e01 5:77350269e01
8:04908467e01 1:09446620e16 5:21162995e01 7:07106781e16
Step 6: Derive the new dataset by multiplying the transformation matrix PCA from Step 5 with each original vector. The transformed vectors are given below:
1:03 1:62
1:72 4:92
2:40 8:21
The PCA provides us a reduced dimensional representation while the sparse coding produces a high-dimensional representation in which only a few elements are nonzero. In PCA, the basis vectors derived from its square matrix are orthonormal. On the other hand, in sparse coding, an overcomplete set of basis vectors are used. Hence, its matrix is usually not square where an L1 regularized optimization is needed for solving the sparse coding.
4.6
Singular Value Decomposition (SVD)
In image analysis, the singular value decomposition (SVD) is a technique which uses a rank-reduced approximation of intrinsic features embedded in a dataset for the generalization [10]. Similar to the PCA, the SVD is also used for the dimensionality reduction in machine learning. The SVD not only eliminates the collinearity in the original dataset but also reduces the number of features for better generalization in the machine learning.
116
4
Dimensionality Reduction and Sparse Representation
Given an m n matrix A of rank r, there exists a factorization (SVD) as follows: A ¼ UDVt
ð4:16Þ
where the columns of Um m and Vn n are the orthonormal eigenvectors of AAt and At A, respectively. Dm n is a diagonal matrix with the singular values of A. This means UT U ¼ I
ð4:17Þ
and V T V ¼ VV T ¼ I
ð4:18Þ
also if m = n, then U T U ¼ UU T ¼ I:
ð4:19Þ
SVD matrices are illustrated in Eq. 4.20 below. In general, rank(A) equals to the number of nonzero ri . 2
r1 ½ A ¼ ½ U 4 0 0
0 .. . 0
3 0 0 5½V T rn
ð4:20Þ
where ri for i = 1, …, n are singular values. If ki is an eigenvalue of ATA (or AAT), then ki = r2i . Two important properties of the SVD are listed below: (1) A square (n n) matrix A is singular if at least one of its singular values r1, …, rn is zero. (2) The rank of matrix A is equal to the number of nonzero singular values ri. Example 4.8 gives a numerical matrix and its SVD results. Example 4.8 An example of a numerical matrix, A and its SVD results. 2
1 A ¼ 44 7 Result of SVD: A = UDVT
2 5 8
3 3 65 9
4.6 Singular Value Decomposition (SVD)
117
2
3 0:21483724 0:88723069 0:40824829 6 7 U ¼ 4 0:52058739 0:24964395 0:81649658 5 0:82633754 0:38794278 0:40824829 2 1:68481034e þ 01 0 6 D¼4 0 1:06836951e þ 00 2
0
0:47967118 6 V ¼ 4 0:77669099 0:40824829
0
0 0
3 7 5
3:33475287e16 3 0:66506441 7 0:62531805 5
0:57236779 0:07568647
0:40824829
0:81649658
with the singular values arranged in the decreasing order. One common definition for the norm of a matrix is the Frobenius norm shown in Eq. 4.21. k AkF ¼
XX i
a2ij
ð4:21Þ
j
where aij are elements in the matrix A. Frobenius norm can be computed from SVD by adding all ri as shown in Eq. 4.22 from the diagonal matrix D. k AkF ¼
X
r2i
ð4:22Þ
i
If we want to find a best rank of k approximation to A, we may set all elements but the largest k singular values to zero. This will form compact representations (i.e., sparse representations) by eliminating columns of U and V corresponding to ri which is zero. SVD can be used to compute optimal low-rank approximations that can be obtained by minimizing a matrix A of rank k using Eq. 4.23. Ak ¼ minkA X kF ; the rank of X is k
ð4:23Þ
Both Ak and X are m n matrices with the expected k r (assuming the number of diagonal elements is r). The solution of Eq. 4.23 is Ak ¼ U diagðr1 ; . . .; rk ; 0; . . .; 0ÞV T where (r – k) smallest singular values are set to zero in Eq. 4.24.
ð4:24Þ
118
4
Dimensionality Reduction and Sparse Representation
Summary of the SVD Method Step 1: Any real m n matrix A can be decomposed uniquely as shown in Eq. 4.16: A ¼ UDVt Step 2 The columns of U are eigenvectors of AAt and the columns of V are eigenvectors of At A. Step 3: Calculate eigenvalues of AAt (or At A) to form the diagonal matrix D.
4.7
Non-negative Matrix Factorization (NMF)
Nonnegative matrix analysis is a low-rank approximation method of the matrix [22, 31]. The non-negative matrix factorization (NMF) technique decomposes an m x n matrix A to two matrices as shown in Eq. 4.25: Amxn ¼ Wmxr Hrxn
ð4:25Þ
where m denotes nonnegative scalar variables, n denotes measurements forming the columns of matrix A, W a matrix of basis vectors, r\minðm; nÞ, and H represents a matrix of coefficients. The matrix H is used for describing how strongly each building block (i.e., basis images from W) is present in the measurement vectors. Each column of H is also called encoding. Each basis image is a representation of localized features. Hence, a representation using the NMF is called parts-based structure [22, 31]. The W and H can be obtained by solving the optimization problem below (Eq. 4.26) [16]: 1 min fr ðW; HÞ kA WHk2F where W; H 0 W;H 2
ð4:26Þ
where W and H are nonnegative matrices and r is the reduced dimension. Similar to some of the clustering algorithms, the NMF does not have a unique solution. Since the objective of the NMF is to find the parts-based structure (i.e., localized features), the NMF has been used as a clustering algorithm [16, 37]. It has been shown that the NMF produces a better characteristic structure (i.e., basis images) for a dataset than other matrix factorizations such as PCA [22, 31]. There are many NMF algorithms which can be used for obtaining the reduced matrices. These include: (1) basic NMF, (2) constrained NMF, (3) structured NMF, and (4) generalized NMF [38]. To optimize Eq. 4.26, an algorithm using the alternating nonnegative least squares (ANLS) for obtaining W and H was proposed by Paatero and Tapper [31]. The algorithm is listed below.
4.7 Non-Negative Matrix Factorization (NMF)
119
The NMF Algorithm Using Alternating Nonnegative Least Squares (ANLS): Step 1: Initialize W 2 Rmxk or H 2 Rnxk with nonnegative values and scale the columns of W to unit L2-norm. Step 2: Iterate Eqs. 4.27 and 4.28 until both converge. ða) Solve minkWH Ak2F for H ðH 0Þ and W is fixed ð4:27Þ 2 ðb) Solve minH T W T AT F for W ðW 0Þ and H is fixed Step 3:
ð4:28Þ
The columns of W are normalized to L2-norm and the rows of H are scaled accordingly.
A fast NMF algorithm using the ANLS method has been developed by Kim and Park based on the nonnegative least squares [31]. Although the NMF is suitable for the square matrices, Eggert and Korner combined the concepts of NMF and sparse conditions to improve the NMF [5]. Their method is efficient compared to the standard NMF and is applicable to the overcomplete cases which will be discussed in the next section.
4.8
Sparse Representation—Sparse Coding
Sparse representation (SR) is an efficient model for representing and compressing high-dimensional images. Sparse representation is to mimic the topology of the underlying manifolds in the image [20, 25, 27, 35, 39]. Sparse coding has been used in image texture classification for the local sparse description of contents [7]. Similar to the principal component analysis (PCA), the SR is a compact representation for an image. However, unlike the PCA which is a representation by obtaining the orthonormal basis vectors, the SR uses an overcomplete set of basis vectors for sparse coding [20, 27, 35, 39]. A sparse coding algorithm learns a new representation of the input data and in that, it only has a few components which are significantly nonzeros. Hence, it is called sparse representation. Traditionally, the analytic basis functions derived from a dataset based on the mathematical formulation have been used in the pattern recognition and image analysis. These analytic basis functions can be generated using the Fourier, Wavelet, and Discrete Cosine transforms. Unlike the analytic basis functions, sparse coding attempts to find those basis functions directly from the dataset. This is data-driven generated basis functions similar to the PCA. Strictly speaking, the PCA is usually not a sparse coding although some sparse PCA has been developed. It has been demonstrated that an overcomplete basis functions learned from the
120
4
Dimensionality Reduction and Sparse Representation
dataset for sparse representation mimics the human vision system [29]. For natural image analysis, predefined dictionaries (consisting of analytic basis functions) based on various types of transforms such as wavelets [24] have been used. However, learning the dictionary is illustrated to dramatically improve signal reconstruction [6, 34]. The foundation of sparse representation is to construct a dictionary which can represent a vector as a sparse linear combination of the training vectors [2, 6, 21, 30, 34]. If we denote a feature vector by x in the n-dimensional real number space, Rn, the vector x is a sparse approximation over an overcomplete dictionary D in Rn m (n m) which composes of m columns. Each column is called atom. In other words, we can find a linear combination of a few atoms from D that is close to the original vector X [6, 34]. In the matrix format, given a vector X Rn and a matrix D Rn m, we try to find a vector a Rm such that Eq. 4.29 is satisfied. X
n X m X
d i aj ¼ Da where di 2 D:
ð4:29Þ
i¼1 j¼1
In this formulation, the matrix D is called the dictionary and a is the sparse vector which contains the representation coefficients aj of the vector. Figure 4.6 gives a diagram showing the decomposition of the SR scheme. Let us take an example to clarify the concept of sparse representation and sparse approximation using a randomly generated vector and matrix [2] as shown below. Example 4.9: Let x be an original feature vector, D dictionary, and a a sparse vector as shown below. Fig. 4.6 A diagram shows the decomposition of the SR scheme
4.8 Sparse Representation––Sparse Coding
121
2
3 0:3133 6 0:9635 7 6 7 x¼6 7 4 0:4964 5 0:8721 0:6579 0:7948 6 0:9376 0:1298 6 D¼6 4 0:1425 0:8475 0:1749 0:1938 3 2 0 7 6 6 0 7 7 6 7 a¼6 6 0:9283 7 7 6 4 0 5 0:7485 2
0:2346 0:3627
0:8273 0:1432
0:2982 0:1827
0:1983 0:9238
3 0:1276 0:8374 7 7 7 0:2934 5 0:9384
We will have the following equation: x = D a. Hence, a is a sparse vector. 2
3
2
0:3133 0:6579 6 0:9635 7 6 0:9376 6 7 6 4 0:4964 5 ¼ 4 0:1425 0:8721 0:1749
0:7948 0:1298 0:8475 0:1938
0:2346 0:3627 0:2982 0:1827
0:8273 0:1432 0:1983 0:9238
3 0 0:1276 6 0 7 6 7 0:8374 7 7 6 0:9283 7 5 6 7 0:2934 4 0 5 0:9384 0:7485 3
2
4.8.1 Dictionary Learning Considering a finite training set of vectors X ¼ fx1 ; x2 ; . . .; xn g 2 Rmxn which can be represented by a dictionary D and a set of sparse coefficients a using Eq. 4.30 [7] min D;a
m
X 1 i¼1
2
kXi Dai k22 þ kuðai Þ
ð4:30Þ
where k is a regularization parameter and uð:Þ is a sparsity function. The most common used sparsity function is l1 -norm. Hence, Eq. 4.30 is rewritten as Eq. 4.31 by using the l1 -norm. This equation can be solved using the Lasso method [8]. min D;a
m
X 1 i¼1
2
kXi Dai k22 þ kkai k1
ð4:31Þ
122
4
Dimensionality Reduction and Sparse Representation
To prevent obtaining very large values of D which may lead to very small values of ai , a constraint is imposed on the columns of D such that they have unit l2 -norm [25]. The dictionary D and sparse coefficients ai will be obtained by solving Eq. 4.31 using the K-SVD [1, 36] among several methods developed for the solutions. If the dictionary D is fixed by using all the training dataset X ¼ fx1 ; x2 ; . . .; xn g 2 Rmxn , we just need to solve the sparse coefficients ai by Eq. 4.31. K-SVD is an algorithm to learn a dictionary for sparse signal representations. K-SVD is a generalization of the k-means clustering method using the SVD, hence, it is called K-SVD. The SVD refers to the singular value decomposition method. The K-SVD works by iteratively alternating between sparse coding the dataset based on the current dictionary, and updating the atoms in the dictionary to better fit the dataset [6]. Given a training dataset, we look for the dictionary that is potentially the best representation for each member in this dataset, under strict sparsity constraints as formulated in Eq. 4.31. The dictionary is updated by solving Eq. 4.32. min D
n X kxi Dai k22
ð4:32Þ
i¼1
Equation 4.32 is rewritten as Eq. 4.33 as below:
min D
n X i¼1
kx i
Dai k22 ¼
2 n X 2 minX Dj a j dk ak ¼ Ek dk ak F D j¼1;j6¼k
where dk is obtained by solving the SVD of Ek ¼ U kth row of a 2 Rkxn .
ð4:33Þ
2
P
V, dk ¼ U ð:; 1Þ and ak is the
K-SVD is widely used in applications such as pattern recognition and image classification. The K-SVD dictionary learning algorithm is outlined below. K-SVD Dictionary Learning Algorithm [36]: Step 1: Initialize the dictionary, D, the convergence criterion, d, and the maximum number of iterations (Max-It). In addition, set an initial distance D0 = 0 and the iteration number k = 0. Repeat the following steps for each vector in the training dataset. Step 2: Sparse code update: while the dictionary is kept fixed, we update the sparse vector ai using Eq. (4.30).
4.8 Sparse Representation––Sparse Coding
123
Step 3: Dictionary update: while the sparse code is kept fixed, we update each column of the dictionary D using Eq. 4.32. Specifically, Steps 2 and 3 are to solve the following equations formulated in 4.32 and 4.33: min D
N X
kxi Dai k22
i¼1
2 minEk dk ak F 8k dk
4.9
Experimental Results on Dimensionality Reduction of Hyperspectral Image
Hyperspectral image data is a progression of spectral bands collected over visible and infrared of the electromagnetic spectrum. These datasets hold relevant information as well as accommodate noise and redundancy leading to sparseness. Correlation between the bands is inversely proportional to sparseness. It is imperative to preprocess hyperspectral data to efficiently extract meaningful information. Hence, dimensionality reduction techniques including PCA, NMF, Independent component analysis (ICA), and SVD were used in this experiment for reducing the number of hyperspectral bands [23]. ICA is an extension of PCA [14]. The experiment explores the dependency of the standard PCA, NMF, ICA, and SVD algorithms on the selected number of dimensions (L). Unsupervised clustering algorithms Fuzzy C-means (FCM) was utilized to identify the influence of L on the dimensionality reduction techniques through classification accuracy. As L value increases, each algorithm yields different accuracy. Indian Pines hyperspectral images with a size of 145 145 and 200 bands were used for dimensionality reduction and then for classification. Experimental results are shown in Table 4.1 and Fig. 4.7. Table 4.1 is a summary of classification accuracy using the FCM algorithm with dimensionality reduction methods of PCA, NMF, ICA, and SVD and FCM. Their variances and deviations for each chosen dimension L are presented for each dimensionality reduction method. Figure 4.7 shows the classified images for the visualization.
124
4
Dimensionality Reduction and Sparse Representation
Table 4.1 FCM clustering results of Indian Pines dataset; the first column shows different dimensionality reduction methods, second, third, and fourth columns give the minimum, maximum, and average classification accuracy over 10 iterations, respectively. Here, j represents the overall accuracy. The fifth, sixth, and seventh columns give the variance, standard deviation, and the number of bands obtained with dimensionality reduction methods, respectively Algorithm
j Min over 10 iters (%)
j Max over 10 iters (%)
j Average over 10 iters (%)
j Variance x 10-5 over 10 iters
Deviation
L
PCA NMF ICA SVD PCA NMF ICA SVD PCA NMF ICA SVD
85.15 84.47 85.08 84.12 81.66 84.75 83.13 85.01 61.07 73.20 61.86 55.75
85.65 85.11 85.71 84.93 84.33 85.26 84.48 85.40 77.77 79.94 76.01 73.47
85.37 84.76 85.41 84.62 83.63 85.02 83.74 85.21 71.26 78.32 70.62 62.83
0.209806 0.510506 0.475800 0.451861 5.90619 0.267256 3.01680 0.170667 290.638 34.0453 152.749 286.098
0.00144 0.00225 0.00218 0.00212 0.00768 0.00163 0.00549 0.00130 0.05391 0.01845 0.03908 0.05348
3
5
15
Fig. 4.7 Classified images of Indian Pines dataset using the FCM algorithm with dimensionality reduction techniques. These classified images corresponding to the maximum accuracy, which was taken from Table 4.1 for each reduced dimension, L, of 3, 5, and 15. a Ground Truth; b ICA (j = 85.71%, L = 3); c SVD (j = 85.40%, L = 5); d NMF (j = 79.94%, L = 15)
4.10
Summary
Although both dimensionality reduction (DR) and sparse representation (SR) techniques give a compressed representation of images, there exist similarity and difference between them. DR techniques such as PCA provide us methods to select a low-dimensional space from the orthogonal dimensional representations obtained through matrix manipulation. The SR gives us the sparse basis vectors to select the nonzero elements obtained from an overcomplete set of basis vectors by solving an L1-regularized optimization problem such as Lasso method. For a given matrix,
4.10
Summary
125
which represents an original dataset, the optimization of a low-rank approximation for the given matrix is the goal for many dimensional reduction techniques. However, the problem in this optimization is that there is no guarantee that the technique will preserve and generate the discriminative features for the optimal classification. In other words, the optimization may be good just for the image and signal reconstruction. Sparse representation (SR) is to construct a dictionary which can represent a vector as a sparse linear combination of the original training vectors. It is an efficient method to reduce the complexity of the dataset. Each feature vector using SR can be represented with a few atoms. The sparse coding is a more general approach compared with other methods such as the PCA. However, the foundation of principal component analysis (PCA), NMF, and singular value decomposition (SVD) is still widely used in many applications. For example, the SVD is used in the K-SVD algorithm for the learning of dictionary in the sparse coding. The NMF is a family of dimensionality reduction techniques for matrix factorization. The NMF assumes that all hidden variables are nonnegative.
4.11
Exercises
A hypothetical color image is given below (Red, Green, and Blue):
0 1 2 3
1 2 3 0
2 3 0 1
3 0 1 2
(Red)
1 3 1 1
2 2 3 2
3 1 2 3
3 1 1 2
(Green)
3 0 1 1
2 1 3 3
2 3 1 2
3 2 2 1
(Blue)
1. Apply the principal component analysis (PCA) to the color image into a new set of images. 2. Use the singular value decomposition (SVD) technique to one of three bands. 3. Perform the non-negative matrix factorization (NMF) technique to the images. 4. Apply the sparse coding to the images.
126
4
Dimensionality Reduction and Sparse Representation
References 1. Aharon M, Elad M, Bruckstein A (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322. https:// doi.org/10.1109/TSP.2006.881199 2. Breen P (2009) Algorithms for sparse approximation. University of Edinburgh, School of Mathematics 3. Cai D, Bao H, He X (2011) Sparse concept coding for visual analysis. CVPR 2011:20–25 4. Castleman KR (1996) Digital image processing, Prentice Hall, New Jersey 5. Eggert J, Korner E (2004) Sparse coding and NMF. In: Proceedings of IEEE international joint conference on Neural Networks 2004, vol 4, pp 2529–2533 6. Elad M (2006) Sparse and redundant representations: from theory to applications in signal and image processing. Springer, Berlin 7. Gangeh MJ, Ghodsi A, Kamel MS (2011) Dictionary learning in texture classification. In: Kamel M, Campilho A (eds) Image analysis and recognition. ICIAR 2011. Lecture notes in computer science, vol 6753. Springer, Berlin 8. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, Berlin 9. Horng MH (2009) Honey bee mating optimization vector quantization scheme in image compression. In: Deng H, Wang L, Wang FL, Lei J (eds) Artificial intelligence and computational intelligence (AICI 2009). Lecture notes in computer science, vol 5855. Springer, Berlin 10. Horn RA, Johnson CR (1985) Matrix Analysis, Cambridge. Cambridge University Press, England 11. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol, 24: 417–441 and 498–520 12. Hughes GF (1968) On the mean accuracy of statistical pattern recognitions. IEEE Trans Inf Theory, IT-14(1) 13. Hung CC, Fahsi A, Tadesse W, Coleman T (1997) A comparative study of remotely sensed data classification using principal components analysis and divergence. In: Proceedings of IEEE international conference on systems, man, and cybernetics, Orlando, FL, 12–15 Oct 1997, pp 2444–2449 14. Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430 15. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York 16. Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23 (12):1495–1502 17. Kolman B (1980) Introductory linear algebra with applications, 2nd edn. Macmillan Publishing Company Incorporated, New York 18. Kuo B-C, Ho H-H, Li C-H, Hung C-C, Taur J-S (2013) A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE J Sel Top Appl Earth Obs Remote Sens, 7(1):317–326 19. Kuo B-C, Landgrebe DA (2004) Nonparametric weighted feature extraction for classification. IEEE Trans Geosci Remote Sens 42(5):1096–1105 20. Lee H, BattleA, Raina R, Ng AY (2006) Efficient sparse coding algorithms. Advances in neural information processing systems, pp 801–808 21. Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantizer design. IEEE Trans Commun, COM-28(1):84–95 22. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
References
127
23. Mallapragada S, Wong M, Hung C-C (2018) Dimensionality reduction of hyperspectral images for classification. In: Proceedings of the ninth international conference on information, pp 153–160 24. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 2(7):674–693 25. Marial J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11:19–60 26. Mairal J, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: Proceedings of the 26th international conference on machine learning, montreal, Canada 27. Mairal J, Bach F, Sapiro G, Zisserman A (2008) Supervised dictionary learning. INRIA 28. Marr D (1982) A computational investigation into the human representation and processing of visual information. MIT Press, Cambridge 29. Olshausen BA, Field DJ (1996) Emergence of simple cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–609 30. Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14:481–487 31. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmentrics 5(2):111–126 32. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2 (11):559–572 (series 6) 33. Richards JA, Jia X (2006) Remote sensing digital image analysis, 4th edn. Springer, Berlin 34. Rish I, Grabarnik GY (2015) Sparse modeling: theory, algorithms, and applications. Chapman & Hall/CRC 35. Rubinstein R, Bruckstein AM, Elad M (2010) Dictionaries for sparse representation modeling. Proc IEEE 98(6):1045–1057. https://doi.org/10.1109/JPROC.2010.2040551 36. Sarkar R (2017) Dictionary learning and sparse representation for image analysis with application to segmentation, classification and event detection. Ph.D. Dissertation, University of Virginia 37. Turkmen AC (2015) A review of nonnegative matrix factorization methods for clustering. Allen Institute for Artificial Intelligence, 31 Aug 2015 38. Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353 39. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In IEEE conference on computer vision and pattern recognition, pp 1794–1801
Part II
The K-views Models and Algorithms
5
Basic Concept and Models of the K-views
Heaven signifies night and day, cold and heat, times and seasons. —Sun Zhu
In this chapter, we introduce the concepts of “view” and “characteristic view”. This view concept is quite different from those of gray-level co-occurrence matrix (GLCM) and local binary pattern (LBP). We emphasize on how to precisely describe the features of a texture and how to extract texture features directly from a sample patch (i.e., sub-image), and how to use these features to classify an image texture. The view concepts and related methods work with a group of pixels instead of a single pixel. Three principles are used in this work for developing the model: (1) texture features from a view should carry as much information as possible for image classification; (2) the algorithm should be kept as simple as possible; and (3) the computational time to distinguish different texture classes should be the minimum. The view-related concepts are suitable for textures that are generated by one or more basic local patterns and repeated in a periodic manner over some image region. The set of characteristic views is a powerful feature extraction and representation to describe an image texture. As different textures show different patterns, the patterns of a texture also show different views. If a set of characteristic views is properly defined, it is possible to use this set of characteristic views for texture classification. The K-views template is an algorithm that uses many characteristic views, denoted by K, for the classification of images. The K-views algorithm is suitable for classifying image textures that have basic local patterns repeated periodically. Several variations of the basic K-views model are given in Chaps. 6, 7 and 8.
© Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_5
131
132
5.1
5
Basic Concept and Models of the K-views
View Concept and a Set of Views
Human beings are capable of using the texture in the interpretation of a photograph for the targets of interest. It is very natural for a machine to use the same feature of texture for the recognition. Researchers in image texture analysis have proposed many innovative methods for image texture classification. These methods can be divided into two major categories: the first category is based on the features with a high degree of spatial localization, which can use edge detection operations for recognition. The major problem with this approach is that it is challenging to distinguish the texture boundaries and the micro-edges located in the same texture. The second category is based on the discrimination function using several texture features. The classification accuracy in this category depends upon the discriminative power of the extracted texture features. Many methods have been developed to extract features either using statistical or other algorithms as discussed in Chap. 2. In most cases, the feature is represented numerically by a vector, which is composed of real numbers (vector components) derived from a neighborhood of the corresponding class. In this chapter, we present a different framework for representing texture features for the classification. We describe how to extract texture features directly from a sample sub-image of a texture and how to use them to classify an image. As each texture class has a characteristic feature that can distinguish this class from others, this characteristic feature may show different “views”. When we determine if a pixel belongs to a specific texture class during the classification, if the spatial neighborhood of the pixel (which is a small image patch) is considered, we may be able to look up for a set of the “views” of texture classes to determine the categorization of this image patch [7, 12]. In other words, one can observe that any local patch of a texture consists of only a few patterns, which are called views in this context. In the following, we illustrate the concept of the view by using a simple image as shown in Fig. 5.1 that contains two different texture classes. One texture class consists of parallel vertical lines. The other consists of two intersected sets of diagonal lines. These two arrangements show two different textures. This type of pattern structure can be taken as the basic element of measure for image textures. Figure 5.1 also indicates that an image texture not only depends on the values of its composed pixels but also on the spatial arrangement of those pixels. Please note
(a)
(b)
Fig. 5.1 a An image shows two texture classes and b the corresponding pixel values
5.1 View Concept and a Set of Views
133
that it is impossible for a per-pixel-based classification method to classify this texture image correctly. To classify this image into two different texture classes, we can use a correlation matching method. The following steps are used to illustrate the correlation matching method. The Correlation Matching Method: Step 1: Select randomly in the area of the texture class a sample sub-image for each texture class from the original image. An example of selected sample sub-images, sample 1 and sample 2, are shown in Fig. 5.2a and b. Note that, the size of these two sample sub-images do not have to be the same. Step 2: Take a small image template (we can call this template a patch) from the original image being classified as shown in Fig. 5.2c and d. The small patch should be much smaller than any of the sample sub-image. Then, find the best match between this small image patch and the sample sub-images. Step 3: If the best match occurs in sample sub-image k, classify all the pixels (in the original image) corresponding to this small image patch into class k. (or if the small image patch is regarded as a neighborhood of one pixel, classify only that pixel to class k). Step 4: Repeat Steps 2 and 3 until the entire original image is classified.
In Steps 2 and 3, when we perform the classification of an image, a small patch is taken and looks up a set of sub-images extracted from the sample image to determine which texture class this small patch belongs. These steps will be repeated for each pixel in a neighborhood of the patch size in the original image. In a sense, this operation is very similar to how we apply the spatial filter to an image. The correlation matching method, which measures the similarity between a small image patch and a sample sub-image, can be defined in several ways. A straightforward method is to use Euclidean distance. Suppose that a small image patch is
(a)
(b)
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
0
1
0
1
0
1
0
1
(c)
(d)
0
1
0
1
0
1
1
0
Fig. 5.2 a and b Show sample sub-images with a size of 3 4 from two different textures, respectively, from Fig. 5.1 and c and d, a small patch (size of 2 2) is taken from a and b, respectively
134
5
Basic Concept and Models of the K-views
Fig. 5.3 A sample sub-image with a size of M N which contains (M − m + 1)*(N − n + 1) small patches of size m n with overlapping
taken from an original image which has a size of m n. A sample sub-image with a size of M N will contain (M − m + 1)*(N − n + 1) small patches of size m n with overlapping. This is illustrated in Fig. 5.3. If we use Vo to denote a small patch taken from an original image and Vi (i = 1 to (M − m + 1)*(N − n + 1)) to denote a small image patch taken from a sample sub-image representing a textural class and di denote the Euclidean distance between Vo and Vi . We assume that both Vo and Vi have the same size. The similarity which is measured by the Euclidean distance, say SVo Vi , between Vo and a patch Vi , can be expressed by the following equation: SVo Vi ¼ min d1 ; d2 ; d3 ; . . .; dk ; . . .; dðMm þ 1ÞðNn þ 1Þ
ð5:1Þ
Please note that Eq. 5.1 will be repeated for each sample sub-image taken from a textural class. The textural class i (from 1 to C) to which Vo belongs can be determined by Eq. 5.2. SVi ¼ minfSV1 ; SV2 ; . . .; SVi ; . . .; SVC g
ð5:2Þ
where C is the total number of textural classes. This method can accurately classify an image such as the one shown in Fig. 5.1. If the small image patch has a size of 2 2, this method can classify this image well into two texture classes. Lee and Philpot described a similar approach in their work [8]. Here, we call each small image patch, “view”. V, a Thereare two different views 0 1 1 0 in Fig. 5.2a, if the view size is 2 2: and . In total, there are six 0 1 1 0 views in this sample sub-image (i.e., (3 – 2 +1) * (4– 2 + 1)). Among a total of six 0 1 views, there are four identical views of and two identical views of 0 1
5.1 View Concept and a Set of Views
135
1 0 . A view is of a size of m n, where m > 0 and n > 0. When m = 1 and 1 0 n = 1, a view is just a single pixel. For a simple representation, a view with a size of m by n can be denoted by a vector: ð5:3Þ ðx11 ; x12 ; . . .; x1n ; x21 ; x22 ; . . .; x2n ; . . .; xm1 ; xm2 ; . . .; xmn ÞT where T is the transpose. In Eq. 5.3, a group of values separated from others by semicolons corresponds to a row in the view. Each value in this group corresponds 1 0 to a pixel value in the view. Hence, can be expressed by ð1; 0; 1; 0ÞT . All 1 0 of these views from a sample sub-image form a set called a view set. A view set is an exemplary set for an image texture class. Now, it is clear that the correlation matching method is, in fact, a method to compare a view with those in the view sets formed from different sample sub-images (i.e., textures).
5.2
Set of Characteristic Views and the K-views Template Algorithm (K-views-T)
Although the correlation matching method can accurately classify a textural image, it is a computationally intensive algorithm. To have a representative set of views, a large sample sub-image has to be chosen. A large sample sub-image indicates that it will increase the computation time. In fact, it is not necessary to compare a view (from an image being classified) with the entire set of views extracted from a sample sub-image in the matching process because this original view set may contain several identical views, and some views are very similar. It will be more efficient if a representative set of views can be chosen from an original set of views which has similar or identical views. In doing so, it will not affect the matching result. If a small representative set of views can be derived from the entire large view set of a sample sub-image, the computation time for the matching method will be dramatically reduced. This representative set of views derived from an original set of views will be called the set of characteristic views. A view in the set of characteristic views is called the characteristic view and denoted by Vcs . An original set of views (abbreviation view set and denoted by Vs ) can be formulated as Vs ¼ D@Vcs
ð5:4Þ
where Vs denotes an original set of views, Vcs a set of all different views from a sample sub-image, and D the relative frequency (or distribution) of the views (i.e., elements) in the set of different views ðVcs Þ. The notation @ is the operator that we
136
5
Basic Concept and Models of the K-views
have chosen to relate the datagram to the frequency of each characteristic view in Vcs . The datagram will be explained later. For example, with the sample sub-image in Fig. 5.2a and the view with a size of 2 2, the view set, Vs , of sample sub-image is
0 0
1 1 ; 1 1
0 0 ; 0 0
1 0 ; 1 0
0 1 1 1 0 ; ; 1 1 0 0 1
ð5:5Þ
or fð0; 1; 0; 1Þ; ð1; 0; 1; 0Þ; ð0; 1; 0; 1Þ; ð0; 1; 0; 1Þ; ð1; 0; 1; 0Þ; ð0; 1; 0; 1Þg
ð5:6Þ
please note that the transpose T for each vector in Eq. 5.6 is omitted for simplicity. Similarly, notation T will be omitted in the following vector representations. In this view set, there are four views of the pattern (0, 1; 0, 1) and two views of the pattern (1, 0; 1, 0). The distribution can be represented as Vs ¼ ð4; 2Þ@fð0; 1; 0; 1Þ; ð1; 0; 1; 0Þg
ð5:7Þ
where (4, 2) is the datagram, {(0, 1; 0, 1),(1, 0; 1, 0)} is the set of characteristic views. If we choose to use only one characteristic view, Vs can be further simplified as Vs ¼ ð6Þ@ðð0:3; 0:7; 0:3; 0:7ÞÞ
ð5:8Þ
where the value 6 is the total number of views in the view set Vs , 0.3 is the average value of the distribution in the first component of all vectors in Eq. 5.6, 0.7 is the average value of the distribution in the second component, and so on. If the size of the sample sub-images is getting larger, the average of the distribution is closer to 0.5 in this case, then Vs (6)@{(0.5, 0.5; 0.5, 0.5)}. Similarly, the View Set of the sample sub - image in Figure 5:2 b ¼ fð0; 1; 1; 0Þ; ð1; 0; 0; 1Þ; ð0; 1; 1; 0Þ; ð1; 0; 0; 1Þ; ð0; 1; 1; 0Þ; ð1; 0; 0; 1Þg ¼ f3@ð0; 1; 1; 0Þ; 3@ð1; 0; 0; 1Þg ¼ ð3; 3Þ@fð0; 1; 1; 0Þ; ð1; 0; 0; 1Þg ¼ ð6Þ@fð0:5; 0:5; 0:5; 0:5Þg ð5:9Þ The correlation matching method now will use the set of characteristic views to classify an image texture. We call this method the K-views Template Algorithm (K-views-T), as the characteristic view is similar to a “template”. The procedure of the K-views template algorithm based on the modification of correlation matching method is described in the following.
5.2 Set of Characteristic Views and the K-views Template Algorithm (K-views-T)
137
The K-views Template Algorithm (K-views-T) Step 1: Select randomly a sample sub-image in the area of the textures for each textural class from the original image. In other words, N sample sub-images will be selected for N textural classes. The size of each sub-image can be different. Step 2: Extract a view set, Vs from each sample sub-image. Step 3: Determine the value of K for each view set, and derive K-views for each characteristic view set from each sample sub-image using the K-means algorithm or fuzzy C-means algorithm. The number of views, K, may vary for each texture class (i.e., sample sub-image). Step 4: In the matching process, each view (a small image patch), say V, of the original image being classified will be compared with each characteristic view in each set of the characteristic views and find the best match (a high correlation). Please note that the size of each view is the same. Step 5: If the best matching characteristic view belongs to the characteristic view set j, classify all pixels in the view V to class j where j = 1, …, N. (If the view is regarded as a neighborhood of one pixel, classify only that pixel to class j). Step 6: Repeat Steps 4 and 5 for each view in the original image being classified. The procedure sketched above leaves two parameters undefined. One is the size of the view and the other is the number of the characteristic views, K. If a specific pattern repeated in the texture class frequently, the size of the view can be small. Otherwise, the size of the view should be large. The larger the size of the view is, the more information about the texture the view can carry. The number of the characteristic views should depend on both the texture structure in the image and the similarity between texture classes. The smaller K of the characteristic views is, the less powerful the characteristic views can describe for the texture. For the image shown in Fig. 5.1, if K is set to 1, the classification result will not be satisfied. Deriving the representative characteristic views from a view set of a sample sub-image is a simple and straightforward process. As long as the number of characteristic views (K) is determined, most clustering methods described in Chap. 3 can be used to select a representative set of characteristic views and obtain K cluster centers [9]. It can be proved that the cluster centers derived from these methods are those representative characteristic views of the view set by minimizing the objective function. If the size of the view is selected appropriately, the K-views template method can have optimal performance in texture classification. However, if the size of the view is very small, different view sets may have some views in common. Hence, different sets of characteristic views may also have some same or similar characteristic views. For example, if the size of the view is 1 2 for the sample sub-images shown in Fig. 5.2a and b, we will have the following set of views:
138
5
Basic Concept and Models of the K-views
the set of views for sample sub-image-1 ¼ fð0; 1Þ; ð1; 0Þ; ð0; 1Þ; ð0; 1Þ; ð1; 0Þ; ð0; 1Þ; ð0; 1Þ; ð1; 0Þ; ð0; 1Þg
ð5:10Þ
¼ ð6; 3Þ@fð0; 1Þ; ð1; 0Þg; and the set of views for sample sub-image-2 ¼ fð0; 1Þ; ð1; 0Þ; ð0; 1Þ; ð1; 0Þ; ð0; 1Þ; ð1; 0Þ; ð0; 1Þ; ð1; 0Þ; ð0; 1Þg ¼ ð5; 4Þ@fð0; 1Þ; ð1; 0Þg
ð5:11Þ
These two sets of characteristic views for two different textural samples are almost the same. In this situation, we cannot solely depend on these sets of characteristic views to obtain a correct classification. One solution to this problem is to increase the size of the view. However, before we do that, we should ask one question: are the sample sub-images really good representatives for textures? Is it likely that a sample sub-image contains a part that is similar to another texture class? This leads us to a second solution. The information from the distribution in the datagram can be used. Suppose that two characteristic views in two different sets are identical (or similar). If 40% of sample sub-image-1 can be classified by this characteristic view and only 5% of the sample sub-image-2 can be classified by the same characteristic view, we can just remove this characteristic view from the second set of characteristic views. In our experiments, the size of the view is set to between 10 10 and 15 15. If a view size of 10 10 is chosen, a sample sub-image of size 50 50 has at most 41 * 41 different views. It is unlikely that these sets of characteristic views for different texture classes will have an identical characteristic view. In the empirical study, it shows that some of the views coming from sample sub-image class i may be classified into class j. In such a situation, the third solution is to increase K of the characteristic views in a set so that more characteristic views can be used to accurately describe the characteristics of a texture class. In summary, three solutions to this problem can be used: (1) the first solution is to increase the size of the view, (2) the second solution is to use both the information from the distribution in the datagram and the set of characteristic views, and (3) the third solution is to increase the number of characteristic views in each set for a texture class. Besides the above three solutions, there exists another solution from the perspective of the histogram point of view. Assume that there are two sets of characteristic views for two textural classes and arranged as histograms (we will call this datagram [12]) shown in Fig. 5.4: class 1 is ð1; 1; 2; 7; 2Þ@ðV4 ; V5 ; V6 ; V7 ; V8 Þ and class 2 is ð1; 3; 6; 3; 1Þ@ðV1 ; V2 ; V3 ; V4 ; V5 Þ. Characteristic view 7 frequently appears in texture class 1 and only occurs in class 1. Similarly, characteristic views 2 and 3 frequently appear in texture class 2 and only occurs in class 2. If we take an image patch containing characteristic view 7, we can say that this image patch belongs to texture class 1 since characteristic view 7 is the most distinguished view
5.2 Set of Characteristic Views and the K-views Template Algorithm (K-views-T)
139
Fig. 5.4 Two textural classes have some characteristic views in common. As shown in Eq. 5.3, i.e., Vs = D@Vcs, y-coordinate represents D and x-coordinate represents Vcs which has five views ðV4 ; V5 ; V6 ; V7 ; V8 Þ for class 1 and ðV1 ; V2 ; V3 ; V4 ; V5 Þ for class 2 in the histogram
in this set. If it contains characteristic view 2 or 3, the entire patch will be classified into class 2 because these two characteristic views are the prominent features in this set. If the image patch consists of characteristic views 3 and 7, or, 2 and 7, that indicates that the patch is in the boundary of these two textural classes. In a grayscale image, if the view size is set to 1 1, there will be at most 256 different characteristic views. The possible values are 0, 1, 2, 3, …, and 255. If two texture classes have the same set of characteristic views, the classification should totally depend on the view distribution in the datagram instead of the set of characteristic views. As defined in Eq. 5.3, our texture classification model has been focusing on Vcs , instead of D. This is in contrast with most other texture classification algorithms. The K-views template algorithm is suitable for textures that have periodically repeating patterns. For image texture that possesses some random structures, the K-views template may not be effective to distinguish them. In this situation, both the datagrams and sets of characteristic views can be used in the image texture classification algorithms. Characteristic views are powerful and simple to describe the characteristics of textures. A datagram is also a useful feature for image classification [12]. The algorithm, which uses the datagram, will be discussed in Chap. 6.
5.3
Experimental Results Using the K-views-T Algorithm
The K-views template algorithm was tested on satellite images in the experiments [7]. Figure 5.5a shows an original image that contains three texture classes, which are smooth land, ocean, and mountain. Figure 5.5b shows the classified result using one characteristic view for each texture class. From Fig. 5.5c to f, the same K value was used for all the three texture classes. The value K was set to 2, 5, 10, and 20, respectively, for Fig. 5.5c, d, e, and f. If a larger K is used for a set of characteristic views, the better the classification result one can achieve. Each texture class does not necessarily have the same number of characteristic views. This means that K
140
5
Basic Concept and Models of the K-views
Fig. 5.5 Classification results using the K-views template algorithm. We assume that there are three texture classes in the image. a an original image, and the results of b, c, d, e, and f are obtained using the same K value (i.e., same number of characteristic views for each class) for all the three texture classes. The value K was set to 1, 2, 5, 10, and 20 for b, c, d, e, and f, respectively
may vary for each set of characteristic views of a texture class. For example, the ocean class and the smooth land can have much fewer characteristic views than that of the mountain class. Figure 5.6 shows five sets with a varying number of
5.3 Experimental Results Using the K-views-T Algorithm
141
Fig. 5.6 Five sets with a varying number of characteristic views (K-views) for mountain texture class; the K, from the first row, was set to 1, 2, 5, 10, and 20, until the last row
characteristic views (i.e., K-views) for the mountain class. We can see that the more number of characteristic views used in a set, the set of characteristic views is more close to the real image. Figure 5.7 shows the classification results on another remotely sensed image. The original image in Fig. 5.7a is a region of Atlanta city in Georgia. The image is classified into the residential area (red), the lawns (green), the commercial area (blue), and the undeveloped area (black) in Fig. 5.7b. Figure 5.8 illustrates some classified results of animal images taken from the Berkeley website (http://sunsite. berkeley.edu/ImageFinder/) and compared with the results of the color region method. Figure 5.9 shows an image consisting of four different texture classes taken from Brodatz texture album and their classification results using the K-views-T algorithm. When the number of K-views increases, the classification accuracy is improved. Each texture class has a size of 100 100.
Fig. 5.7 a A sub-image of Atlanta city in Georgia, USA and b the classified image using the K-views template algorithm: residential area (red), lawns (green), commercial areas (blue), and undeveloped area (black)
142
5
Basic Concept and Models of the K-views
Fig. 5.8 a Three images of animals, b classified results using the K-views template algorithm, and c classified results using the color region method. (images in the column of (a) and segmented results in the column of (c) are taken from the Berkley website (http://sunsite.berkeley.edu/ImageFinder/)
5.4
Empirical Comparison with GLCM and Gaussian MRF (GMRF)
Spatial models are frequently used for image texture classification [1–6, 11]. These models exploit the contextual information by utilizing spatial features for the classification. These spatial features capture the spatial relationships encoded in the image. As described in Chap. 2, the gray-level co-occurrence matrix (GLCM) is a statistical method which calculates properties of the relationship of pairwise of pixels [5, 6]. The spatial relationships between a pixel and its neighbors are recorded into the GLCM and then used for calculating the statistics for features [5]. The statistics that produce independent features are preferred such as dissimilarity (D), entropy (E), and correlation (C) [5, 6]. These features will be mapped into corresponding feature vectors, and the k-means and other clustering algorithms can be used to cluster these vectors. Geostatistics has been used to measure spatial properties. Car and Miranda [2] proposed a method which is based on the geostatistics (called the variogram), which is also a second-order statistics, to extract the texture features from the image. Unlike the GLCM, the variogram is to capture average gray-level spatial dependence. The
5.4 Empirical Comparison with GLCM and Gaussian MRF (GMRF)
143
Fig. 5.9 Classified results using the K-views template algorithm on a grayscale texture image with four different textures. a an original image with the size of 200 200, b, c, d, e, and f are classified results obtained with the number of K-views from 1, 2, 8, 15, and 25, respectively
variogram can be computed with particular spatial directions. Four directions, E-W, N-S, NE-SW, NW-SE, are usually used for the spatial directions and statistics calculation [1]. Markov random field (MRF) models are stochastic processes which define the local spatial characteristics of an image. They are used to model the textural content of the observed image. The models characterize the statistical relationships between a pixel and its neighbors [3, 10].
144
5
Basic Concept and Models of the K-views
Experimental results were performed on some of the Brodatz texture images to show the effectiveness of the models [11]. There are four different patterns of textures in images as shown in Figs. 5.10 and 5.12 and five different patterns of texture image in Fig. 5.11. In the experiments for the K-views-T model, the size of the K-views (i.e., characteristic view), the number of characteristic views and the statistics size (kernel size) are randomly selected for testing three different texture images. Although a variety of GLCM techniques are used in the literature, a simple GLCM is developed for the comparison. The classified results using the GLCM depend on the parameters such as distance, angle, and the number of gray levels. In this experiment, the distance d ¼ 1 and the angle a ¼ 0; 45, 90, 135, 180 and 225 degrees were used. Since the gray level of 256 causes some overhead, the image is quantized to 16 gray levels. The experimental results are shown in Figs. 5.10, 5.11 and 5.12. Experimental results demonstrate that the K-views-T model is effective in the classification of image texture. By increasing the size of K-view template and the number of K-views, it will improve the classification result in the K-views-T model. The K-views-T model has shown a significant improvement in the texture classification.
Fig. 5.10 Classified results of an image texture using different spatial models. a an original image, b the GLCM with the gray level of 16, a window size of 13, distance 1, and the average of all directions, c the Variogram with the window size of 9 and the average of all directions, d the GMRF with the window size of 16 and the fourth-order neighborhood structure, e K-views-T results (the size of K-views is 10, the number of characteristic views K is 20, and the sample sub-image size M is 20), and f K-views-T results (the size of K-views is 10, the number of characteristic views K is 30, and the sample sub-image size M is 30) [11]
5.5 Simplification of the K-views
145
Fig. 5.11 Classified results of an image texture using different spatial models. a an original image, b the GLCM result with the gray level of 16, a window size of 32, distance 1 and the average of all directions, c the Variogram with the window size of 11 and the average of all directions, d the GMRF with window size of 8 and fourth-order neighborhood structure, e K-views-T results (the size of K-views is 6, the number of characteristic views K is 20, and the sample sub-image size M is 30), and f K-views-T results (the size of K-views is 6, the number of characteristic views K is 40, and the sample sub-image size M is 40)
5.5
Simplification of the K-views
The concept of K-views has illustrated the capability to distinguish different texture classes. The view set consisting of characteristic views is used to describe this relationship. The K-views template algorithm was developed for texture image classification based on the set of characteristic views. If we simplify and reduce the size of K-views, some interesting results can be obtained. A K-views template with a size of 1 1 is equivalent to a pixel. Hence, a view with any size can be simplified as a line, a surface or represented by using the normal. If the view size is 1 2, and assuming that the set of characteristic views has a size of 256 * 256 for an 8-bit pixel in an image, a view distribution in the datagram is equivalent to a horizontal gray-level co-occurrence matrix (GLCM) with the distance of 1. Similarly, If the view size is 2 1, a view distribution in the datagram is equivalent to a vertical co-occurrence matrix with the distance of 1. However, the K-views are derived based on a sample sub-image while the GLCM is derived based on a matrix with the arrangement of gray levels.
146
5
Basic Concept and Models of the K-views
Fig. 5.12 Classified results of an image texture using different spatial models. a An original image b GLCM result with the gray level of 16, a window size of 32, distance 1, and the average of all directions, c the Variogram with the window size of 11 and the average of all directions, d GMRF with the window size of 8 and fourth-order neighborhood structure, e K-views-T results (with characteristic views are 4, the number of characteristic views k is 20, and the sample sub-image size M is 20), and f K-views results (with characteristic views are 4, the number of characteristic views k is 25, and the sample sub-image size M is 25)
If we are only interested in the two-end pixels of views with a size of 1 n, the view can be denoted as a vector, ðx1 ; x2 ; . . .; xn ÞT . The distribution in a datagram would be equivalent to the horizontal co-occurrence matrix with the distance of n-1. This similarity can also be found between the distribution in a datagram and other different co-occurrence matrices. In general, the view distribution in a datagram contains more messages because the information of all pixels is preserved. In the image texture classification literature, the features of image textures can be described by the surface, normal, and basic textural unit. Some of these are discussed in Chap. 2. Compared with those features, characteristic views keep the original messages and information for the classification.
5.6 Summary
5.6
147
Summary
The set of characteristic views is a powerful feature extraction method to describe an image texture. Different textures show different patterns. The patterns of a texture also show different views. If the size of K-views and the number of K-views in a set of characteristic views are properly defined, it is feasible to use a set of characteristic views for texture classification. The K-views template method (K-views-T) is an algorithm that uses a set of characteristic views for image texture classification. The K-views-T algorithm is suitable for classifying image textures that have basic local patterns repeated in a periodic manner. The performance of K-views algorithm is related to the size of K-views template and the number of characteristic views (i.e., K-views) in each set. Increasing the view size and the number of characteristic views will generally improve the classification result at the expense of processing time. However, an issue remains to be explored in determining the relationship between the view size and the number of characteristic views in a set. This means that it is necessary to develop an algorithm which can automatically determine the view size and the number of characteristic views so that the K-views-T algorithm will have optimal performance in terms of computation time and classification accuracy. The proposed classification method needs to select the sample sub-images for each class interactively. It would be useful to develop an unsupervised learning approach without human interaction by using view-related features.
5.7
Exercises
For a numerical image shown below, assume that there are two different textures; one texture in the first four columns and the other in the remaining of the image. 0 1 2 3 3 2 1 3
1 2 3 0 2 3 2 0
2 3 0 1 1 2 3 2
3 0 1 2 0 3 0 1
4 5 5 4 4 6 4 7
5 6 4 6 5 5 5 6
6 7 7 5 6 5 6 4
3 6 7 6 3 4 7 5
1. Develop a set of views with a template size of 2 2 and 3 3. 2. Develop a set of characteristic K-views from Exercise #1 using the K-views-T algorithm.
148
5
Basic Concept and Models of the K-views
3. Compare the performance of the K-views-T algorithm with different K values. 4. Implement the K-views-T algorithm using a high-level programming language and apply the algorithm to an image with different textures.
References 1. Carr JR (1999) Classification of digital image texture using variograms. In: Atkinson PM, Tate NJ (ed) Advances in remote sensing and GIS analysis. Wiley, New York, pp 135–146 2. Carr JR, Miranda FP (1998) The semivariogram in comparison to the co-occurrence matrix for classification for image texture. IEEE Trans Geosci Remote Sens 36(6):1945–1952 3. Chellappa R, Chatterjee S (1985) Classification of textures using gaussian markov random fields. IEEE Trans Acoust Speech Signal Process 33(4):959–963 4. Gurney CM, Townshend JRG (1983) The use of contextual information in the classification of remotely sensed data. Photogramm Eng Remote Sens 49(1):55–64 5. Haralick RM, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern 3(6):610–621 6. Hung CC, Yang S, Laymon C (2002) Use of characteristic views in image classification. In: Proceedings of 16th international conference on pattern recognition, pp 949–952 7. Hung CC, Karabudak D, Pham M, Coleman T (2004) Experiments on image texture classification with K-views classifier, markov random fields and cooccurrence probabilities. In: Proceedings of the 2004 IEEE international geoscience & remote sensing symposium (IGARSS), Anchorage, Alaska, 20–24 Sept 2004 8. Lee JH, Phipot WD (1991) Spectral texture pattern matching: a classifier for digital imagery. IEEE Trans Geosci Remote Sens 29(4):545–554 9. Tou JT, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading 10. Woods JW (1972) Two-dimensional discrete Markovian fields. IEEE Trans Inf Theory 18:232–240 11. Xiang M, Hung CC, Pham M, Coleman T (2005) Spatial models for image texture classification: an experiment. In: Proceedings of the 4th international conference on information and management sciences, Kunming, China, 1–10 July 2005, pp 341–343, ISSN 1539-2023 12. Yang S, Hung CC (2003) Image texture classification using datagrams and characteristic views. In: Proceedings of the 18th ACM symposium on applied computing (SAC), Melbourne, FL, 9–12 March 2003, pp 22–26. https://doi.org/10.1145/952532.952538
6
Using Datagram in the K-views Model
Truthful words are not beautiful; beautiful words are not truthful. Good words are not persuasive; persuasive words are not good. — Lao Tzu
It is feasible for us to use only characteristic views (i.e., the basic K-views template algorithm) to classify different image textures. The performance of the K-views template (K-views-T) algorithm is related to the size of a view template and the number of characteristic views in the set of characteristic views. If the size of a view template and the number of characteristic views are increased, the classification accuracy will be improved at the expense of the time complexity. To reduce the time complexity of the K-views-T algorithm and maintain the high classification accuracy, the algorithm can utilize the datagram in which the frequencies of characteristic views are cumulated and distributed in a histogram. Due to the use of frequency, a smaller size of the view can be used for maintaining similar classification accuracy. In a sense, this is very similar to the approach used in the LBP and Textural Unit in which a histogram depicting the distribution of all the frequency (i.e., number) for a texture patch is used for the classification. In the basic K-views-T algorithm, the decision is made by a single characteristic view whose center is located at the current pixel being classified. By using this new datagram in the K-views model, the decision is made by the distribution of all the views contained in a large patch (i.e., block) in which the current pixel is the center of the block. Hence, a new K-views datagram algorithm (the K-views-D) is developed based on the datagram concept. Due to the spatial template used for the view, the pixels located in the boundary between texture classes are needed to be taken care of for a complete classification. This problem is similar to spatial smoothing filters applied to the boundary areas among different textures in an image. Therefore, a boundary-refined method is described to improve the boundary pixel classification.
© Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_6
149
150
6.1
6
Using Datagram in the K-views Model
Why Do We Use Datagrams?
The K-views template is based on the assumption that an image texture has a specific pattern that distinguishes it from other textures, and this particular pattern reveals different characteristic views. If an image has random structures, we must choose both a large K and a reasonable size of view in order to have high classification accuracy. A larger K value means that it is necessary to increase the computation time to derive sets of characteristic views for the classification. To reduce the computation time, the datagram can be used for this purpose [7, 11, 13]. The datagram concept is very similar to the histograms used for the distribution of local binary patterns (LBP) and textural spectrum (TS) [4, 5, 10, 12]. As discussed in Chap. 5, a full set of views (abbreviation view set) can be formulated as Vs ¼ D@Vcs
ð6:1Þ
where Vs denotes an original set of views, Vcs a set of all different views from a sample sub-image, and D the relative frequency (or distribution) of the views (i.e., elements) in the set of different views (i.e., Vcs ). The notation @ is the operator that we have chosen to relate the datagram to the frequency of each characteristic view in Vcs . Although an image texture may show a random structure in a macroscope, it may consist of small nonrandom “micro structures” (small views). Random structured textures also have stable statistical values. For example, reducing the size of view template to 1 1, any 8-bit gray-level image has only at most 256 characteristic views. These are gray levels 0, 1, 2,…, and 255. In such a case, different textures may have different datagrams (in this case, the datagram is equivalent to the histogram), and the important features of different image textures are shown in their datagrams. Hence, we can develop an algorithm to classify image textures based on the datagrams. This is a similar idea behind the histogram-based methods. An algorithm, which uses the datagram, has an advantage: a smaller size of the view can be used. Due to the frequency of views used, it will maintain a similar classification accuracy as a large size of view used in the K-views-T algorithm. This datagram approach is very similar to those used in the LBP and Textural Unit in which a histogram depicting the distribution of all the frequency (i.e., a feature index extracted from the image) for a texture patch is used for the classification [1, 2, 4–6, 10, 12]. Let us take the same textures used for the K-views template in Chap. 5 and create datagrams for the K-views extracted from the sub-image as shown in Fig. 6.1. Datagrams of Fig. 6.1 with two texture classes are illustrated in Example 6.1; texture-1 with vertical lines in parallel and texture-2 with diagonal lines crossed each other.
6.1 Why Do We Use Datagrams?
151
(a)
(b)
Fig. 6.1 a An image shows two texture classes and b the corresponding pixel values
Example 6.1 Datagrams of Fig. 6.1 with two texture classes (texture-1 and texture-2). Assume that the following two view templates in a set of Vcs are used to extracting the frequency from these two texture classes. These two view templates can be denoted as Vcs1 and Vcs2.
0
1
0 1
0
1
1 0
Vcs1
Vcs2
By using Eq. 6.1, for texture-1, D = 12 @ Vcs1 and D = 0 @ Vcs2. Similarly, for texture-2, D = 0 @ Vcs1 and D = 8 @ Vcs2. The datagram is shown below:
152
6
Using Datagram in the K-views Model
(a)
(b) 0.5000 0.4500 0.4000 0.3500
Series1
0.3000
Series2
0.2500
Series3
0.2000
Series4
0.1500 0.1000 0.0500
39
37
35
33
29
31
27
25
23
21
19
17
15
13
11
7
9
5
3
1
0.0000
Fig. 6.2 a An image with four different textures and b a datagram of four different textures
The datagrams are obtained by using Eq. 5.1 in Chap. 5 (for the convenience, it is listed below in Eq. 6.2), SVo Vi ¼ minfd1 ; d2 ; d3 ; . . .; dk ; . . .; dðMm þ 1ÞðNn þ 1Þ g
ð6:2Þ
Please note that the similarity, SVo Vi , between Vo and a patch Vi , is measured by the Euclidean distance. Based on the set of characteristic views, Vcs , we can calculate a datagram (D) for each of the N sample sub-images. According to Eq. 6.2, each datagram D can be normalized to become a normalized datagram DN using Eq. 6.3. We call each of these N normalized datagrams coming from sample sub-image as the sample datagrams DS which is identical to DN. Hence, DS = DN in Eq. 6.3.
6.1 Why Do We Use Datagrams?
153
D ¼ ðd1 ; d2 ; d3 ; . . .; dK Þ T¼
k X
di
ð6:3Þ
i¼1
DN ¼ ðdn1 ; dn2 ; dn3 ; . . .; dnK Þ ¼ ðd1 =T; d2 =T; d3 =T; . . .; dK =TÞ where di is the number of views in the corresponding distribution of the datagram. A datagram for each texture in the image from Fig. 6.2a is shown in Fig. 6.2b. The characteristic views of the texture image in Fig. 6.2a is classified into 40 clusters using the K-means algorithm. These 40 clusters are equivalent to the set of characteristic views. Then, for each texture, a normalized datagram is obtained from the statistics of the characteristic views. The datagrams are shown as four series (Series1, Series2, Series3, and Series4) in Fig. 6.2b. An index number is assigned to each characteristic view. The index number indicates a different characteristic view. These numbers are shown on the horizontal axis in Fig. 6.2b. The vertical axis shows the appearance probability of each characteristic view in the datagram. For example, the first yellow column has a probability of 0.09. It means that characteristic view 1 has an appearance probability of 9% in the third texture class. This datagram shows that, if we take any small image patch from the first texture class, its datagram should be like series 1 in Fig. 6.2b. If different image textures have the same characteristic views, we can still use the datagrams for classification as we are using the distribution instead of the view template. Since a small size of views is used, compared with K-views template, the time to calculate the set(s) of characteristic views is reduced. But the datagram-based algorithm will use the extra time on constructing the datagram.
6.2
The K-views Datagram Algorithm (K-views-D)
Similar to the LBP and Textural Unit, a histogram depicting the distribution of all the frequency of all characteristic views in a set for a texture sub-image can be used for the classification. Such a histogram is called a datagram. The algorithm which uses datagrams for classification is described in the following steps: Step 1: Select a sample sub-image for each texture class from the original image. In other words, N sample sub-images will be selected for N texture classes. Steps 2–4 will be repeated for each texture class. Step 2: Determine the size of the view template (m n) and extract views from each sample sub-image and form a view set VS. Step 3: Determine a value (i.e., K) for K-views and use the K-means (or Fuzzy C-means) to derive a set of characteristic views (CVS) with K groups of characteristic views from the view set VS.
154
6
Using Datagram in the K-views Model
Step 4: Based on the set of characteristic views CVS, calculate a datagram (D) for each of the N sample sub-images. According to Eq. 6.3, each datagram D is normalized to obtain a normalized datagram (DN). We call each of N normalized datagrams (for each sample sub-image) as a sample datagram (DS). Step 5: Scan the image from top-left to top-right and top to bottom using a window of M M pixels (M M should be much larger than the view size. We have tested different view sizes of 3 3, 4 4, and 5 5, and M value from 20 to 30 in this datagram algorithm), and derive the corresponding normalized datagram for each window. Step 6: Calculate the difference between the normalized datagram and each of the N sample datagrams (DS), and classify the central pixel of the windows to the class, such that the difference between the sample datagram of this class and the normalized datagram of the window is the minimum. The difference (Diff) in Eq. 6.4 between a normalized datagram for each window, W, as DNW (in Eq. 6.5) and a sample datagram DS (from Eq. 6.3) can be derived. K X Diff ¼ jdSi dNi j ð6:4Þ i¼1
DNW ¼ ðdS1 ; dS2 ; dS3 ; . . .; dSK Þ
ð6:5Þ
Figure 6.3 shows the classification result by using the K-views-D algorithm described above. Table 6.1 shows the experimental results conducted on 55 pairs of images randomly taken from the Brodatz gallery. The average classification Fig. 6.3 a Classified result by applying the K-views-D algorithm on the image shown in Fig. 6.2a
6.2 The K-views Datagram Algorithm (K-views-D)
155
Table 6.1 Classification accuracy on the experimental results using the K-views-D algorithm. Please note that (1) the images tested are randomly selected from the Brodatz Gallery, (2) the sample sub-image size is 40 40; the view size is 3 3; K = 60; M = 30. The notation v1\v2 means the classification accuracy to different texture classes. For example, 96\99 (in row 2 and column 2) means 96% of the pixels of D9 and 99% of the pixels of D12 are correctly classified D12
D15
D24
D29
D38
D68
D84
D92
D94
D9 96\99 93\100 100\100 99\100 100\100 100\99 100\100 100\100 100\100 D12 99\96 100\96 90\100 100\100 91\94 100\100 90\100 97\90 D15 100\100 94\100 100\100 99\89 96\100 87\100 86\99 D24 94\97 97\100 100\96 95\99 95\98 95\91 D29 100\100 100\95 100\100 84\100 96\78 D38 100\87 100\100 100\100 100\80 D68 86\100 86\100 85\96 D84 88\98 100\98 D92 98\77 D94
D112 100\100 99\95 100\97 100\87 100\96 100\100 99\100 100\99 100\97 100\99
accuracy is 96.6% (when we calculate the classification accuracy, the pixels on the image boundaries are not taken into consideration). The K–views datagram algorithm was used in this experiment. The size of sample sub-image is 40 40, the view size is 3 3, the number of K-views is K = 60, and M is set to 30 for the K-views datagram algorithm. Sample sub-images are selected randomly. If sample sub-images are carefully chosen, the classification accuracy can be improved. From these experimental results, we observed that the K-views-D algorithm takes less time than the K-views-T algorithm to achieve the same classification accuracy. The K-views-D algorithm does not show a good performance on image D94 versus other images as shown in Table 6.1. The reason is that D94 is not uniform in the distribution of pixels. The left bottom part of the image is much darker than the right top portion in D94. Note that both the K-views-T and K-views-D algorithms are not invariant to texture rotation. This is why the experimental results on D15 versus other images in Table 6.1 do not show high classification accuracy.
6.3
Boundary Refined Texture Segmentation Based on the K-views-D Algorithm
If we do not consider the pixels on the boundary among different texture classes, the K-views-D algorithm can achieve good segmentation results on most natural images and remotely sensed images [8, 9, 11]. But in some applications such as
156
6
Using Datagram in the K-views Model
Fig. 6.4 a An original image and b an initial segmentation result of a using the K-views-D algorithm
medical image segmentation, an exact segmentation on the boundary areas is needed. Existing K-views algorithms cannot provide satisfactory results for the boundary discrimination among different texture classes. Figure 6.4 shows an example of the initial segmentation on a pair of image textures taken from the Brodatz Gallery. We can observe from Fig. 6.4 that the misclassifications are near the boundary of two texture classes. This is because the K-views-D method requires a large scanning window. This large window will cover multiple texture classes. We introduce a new texture segmentation method to improve the segmentation of boundary pixels [6]. We define a boundary set for an image and then apply a modified K-views-T method with a small scanning window to the boundary set to improve the accuracy of the segmentation. The boundary set (B) is defined as a set of pixels which includes all the pixels P with more than half of its neighboring pixels being classified into different classes other than those of P itself by the initial segmented result of K-views-D algorithm. Here the neighboring pixel means a pixel within the n n square window around the center pixel, where n is an odd integer. Figure 6.5 shows the misclassified pixels from Fig. 6.4 and the boundary set, B, of the image.
Fig. 6.5 a Misclassified pixels from Fig. 6.4 and b the boundary set B of the image
6.3 Boundary Refined Texture Segmentation Based on the K-views-D Algorithm
157
It can be observed in Fig. 6.5 that almost all the misclassified pixels are included in the boundary set. The boundary-refined algorithm consists of three steps in the following: Step 1: To apply the K-views-D algorithm to the image to obtain an initial segmentation. Step 2: To find a boundary set which includes the pixels with high probabilities which were misclassified by the initial K-views-D algorithm. Step 3: To apply the K-views-T method with a small scanning window to the boundary set to refine the segmentation. The algorithm was tested on the benchmark images which are randomly selected from Brodatz Gallery [3] to evaluate the effectiveness of the algorithm. For the initial K-views-D segmentation, a large K value and reasonable kernel size and view size must be chosen in order to have high segmentation accuracy in the non-boundary areas. We chose K = 60, which is the number of views in each characteristic view set. The size of the kernel was chosen as 40 40 (L = 40), the size of view was chosen as 10 10 (m = 10), the size of the scanning window was chosen as 30 30 (M = 30), which is smaller than the size of kernel but is still large enough. At the boundary set searching stage, n = 11 is chosen. It means that we consider the pixels in the square block of size 11 11 around a pixel as its neighboring pixels. At the refinement stage, for the K-views-T algorithm, a different K value and a smaller view size are chosen to achieve high segmentation accuracy in the boundary set. We chose K = 30, and the size of view was set to 7 7 (m = 7) for this stage. Figure 6.6 shows the boundary refined segmentation result for a combined image consists of five benchmark texture images. As shown in Fig. 6.6, it can be seen that the result shown in b is improved with the result shown in e. The most common technique for the diagnosis of prostate cancer is core biopsy using ultrasound images [11]. Many researchers have investigated methods for improving the accuracy of biopsy protocols. Automatic or interactive segmentation of the prostate in ultrasonic prostate images is an important step in these techniques [10]. This boundary refined algorithm provides an automatic method for improving the segmentation of medical images. The experimental results show that this algorithm gives satisfactory segmentation accuracy. Figure 6.7 shows an initial experimental result for an axial ultrasonic prostate image.
158
6
Using Datagram in the K-views Model
Fig. 6.6 a An original image with five different textures, b the initial segmentation result, c the boundary set, d pixels misclassified by the initial segmentation, e the refined segmentation result, and f pixels still misclassified after the refinement of the segmentation
6.3 Boundary Refined Texture Segmentation Based on the K-views-D Algorithm
159
Fig. 6.7 a An original ultrasonic prostate image and b final segmentation result: white: prostate and black: other tissues [11]
6.4
Summary
A set of characteristic views is a useful feature set to describe image textures. It is possible for us to use only a set of characteristic views to classify different image textures. However, the performance of the K-views template method is related to the view size and the number of characteristic views in each set. Increasing the view size and number of characteristic views will generally improve the classification result at the expense of processing time. The algorithm using datagrams allows us to use a smaller view size, but still, achieve high classification accuracy. When the K-views datagram (K-views-D) algorithm is used for classifying image textures which have random structures, it generally takes much less time than the K-views template (K-views-T) algorithm to achieve the same classification accuracy in the empirical study. For random image structures, sometimes, it is difficult for us to derive a robust set of characteristic sets. Hence, the normalized datagrams of textures are acquired to replace the K-views feature set. A K-views datagram algorithm is then developed for texture classification by using the datagrams. By testing on textural images, the K-views datagram algorithm can achieve promising classification results. Both the K-views-T and K-views-D algorithms described are the supervised models. In the supervised models, sample sub-images are manually selected. The development for automatic image classification models can be done based on the set of characteristic views and datagrams. Those models should include the semi-automatic model, which can be used to create a database of image textures for unsupervised learning, an automatic model based on the datagram, and an automatic model based on the set of characteristic views.
160
6
6.5
Using Datagram in the K-views Model
Exercises
For a numerical image shown below, assume that there are four different textures in the image; each texture occupies one quadrant.
0 1 2 3 200 210 190 205
1 2 3 0 200 190 205 209
2 3 0 1 200 200 210 200
3 0 1 2 200 200 200 210
100 100 100 100 4 6 4 7
90 90 90 90 5 5 5 6
100 100 100 100 6 5 6 4
90 90 90 90 3 4 7 5
1. Develop a set of views with a template size of 2 2 and 3 3. 2. Develop a set of characteristic K-views from Exercise #1 using the K-views-T algorithm. 3. Compare the performance of the K-views-T algorithm with different K values. 4. Develop the datagram of K-views with different K values. 5. Implement the K-views-D algorithm using a high-level programming language and apply the algorithm to an image consisting of different textures. 6. Compare the performance of K-views-D algorithm, Local Binary Patterns (LBP) and Texture Spectrum (TS) on image textures.
References 1. Arasteh S, Hung C-C (2006) Color and texture image segmentation using uniform local binary pattern. Mach Vis Graph 15(3/4):265−274 2. Arasteh S, Hung C-C, Kuo B-C (2006) Image texture segmentation using local binary pattern and color information. In: The proceedings of the international computer symposium (ICS 2006), Taipei, Taiwan, 4−6 Dec 2006 3. Brodatz P (1966) Textures: a photographic album for artists and designers. Dover Publications, New York 4. He D-C, Wang L (1989) Texture unit, texture spectrum, and texture analysis. In: Proceedings of IGARSS’ 89/12th Canadian symposium on remote sensing, vol 5. pp 2769−2772 5. He D-C, Wang L (1990) Texture unit, texture spectrum, and texture analysis. In: IEEE transactions on geosciences and remote sensing, vol 28, issue 4 6. Hung C-C, Yang S, Laymon C (2002) Use of characteristic views in image classification. In: Proceedings of 16th international conference on pattern recognition, pp 949−952 7. Hung C-C, Pham M, Arasteh S, Kuo B-C Coleman T (2006) Image texture classification using texture spectrum and local binary pattern. In: The 2006 IEEE international geoscience & remote sensing symposium (IGARSS), Denver, Colorado, USA, 31 July−4 Aug 2006 8. Lan Y, Liu H, Song E, Hung C-C (2010) An improved K-view algorithm for image texture classification using new characteristic views selection methods. In: Proceedings of the 25th
References
9.
10.
11.
12. 13.
161
association of computing machinery (ACM) symposium on applied computing (SAC 2010)– Computational intelligence and image analysis (CIIA) track, Sierre, Swizerland, 21−26 March 2010, pp 960−964. https://doi.org/10.1145/1774088.1774288 Lan Y, Liu H, Song E, Hung C-C (2011) A comparative study and analysis on K-view based algorithms for image texture classification. In: Proceedings of the 26th association of computing machinery (ACM) symposium on applied computing (SAC 2011)−computational intelligence, signal and image analysis (CISIA) track, Taichung, Taiwan, 21−24 March 2011. https://doi.org/10.1145/1982185.1982372 Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. In: IEEE transaction on pattern recognition and machine intelligence, vol 24, issue 7 Song E, Jin MR, Hung C-C, Lu Y, Xu X (2007) Boundary refined texture segmentation based on K-Views and datagram method. In: Proceedings of the 2007 IEEE international symposium on computational intelligence in image and signal processing (CIISP 2007), Honolulu, HI, USA, 1−6 April 2007, pp 19−23 Wang L, He D-C (1990) A new statistical approach for texture analysis. Photogramm Eng Remote Sens 56(1):61–66 Yang S, Hung C-C (2003) Image texture classification using datagrams and characteristic views. In: Proceedings of the 18th ACM symposium on applied computing (SAC), Melbourne, FL, 9−12 March 2003, pp 22−26. https://doi.org/10.1145/952532.952538
7
Features-Based K-views Model
Now the general who wins a battle makes many calculations in his temple ere the battle is fought. The general who loses a battle makes but few calculations beforehand. Thus do many calculations lead to victory and few calculations to defeat: how much more no calculation at all! It is by attention to this point that I can foresee who is likely to win or lose. —Sun Tzu
This chapter describes a new K-views algorithm, the K-views rotation-invariant features (K-views-R) algorithm, for texture image classification using rotation-invariant features. These features are statistically derived from a set of characteristic views for each texture. Unlike the basic K-views model such as K-views-T method, all the views used are transformed into rotation-invariant features, and the characteristic views (i.e., K-views) are selected randomly. This is in contrast to the basic K-views model that uses the K-means algorithm for choosing a set of characteristic views (i.e., K-views). In this new algorithm, the decision of assigning a pixel to a texture class is made by considering all those views, which have the pixel (being classified) located inside the boundary of their views. To preserve the primitive information of a texture class as much as possible, the new algorithm randomly selects K-views of the view set from each sample sub-image as the set of characteristic views.
7.1
Rotation-Invariant Features
Although the K-views datagram (K-views-D) algorithm performs better than the K-views template (K-views-T) algorithm, the classification accuracy in the boundary areas are still a challenging problem for the K-views model. In addition, the “characteristic views” extracted for the K-views-T algorithm are not rotation-invariant, which leads to the recognition of different texture classes cannot be correctly classified when the image is rotated. A new K-views algorithm is developed in order to extract the rotation-invariant features of texture images and © Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_7
163
164
7 Features-Based K-views Model
use them for improving the classification [5]. In this new algorithm, the K-views are randomly selected from the view set of each texture class as the “characteristic views”, which is different from the existing K-views-T and K-views-D algorithms. Then, we extract the rotation-invariant features from the “characteristic views”. In the process of classification, the decision that a pixel belongs to which texture class is made by considering all the views which consist of the pixel being classified. The rotation-invariant features of each of these corresponding views in the image being classified are also calculated. To develop a new K-views algorithm, a set of rotation-invariant features will be extracted from all the characteristic views, which are obtained using the K-views-T algorithm as described in Chap 5. The rotation-invariant features are then used in the new K-views algorithm which will be coined as the K-views rotation-invariant features (K-views-R) algorithm. Let us define the feature vector of a view as in Eq. 7.1: fK ¼ ½x1 ; x2 ; . . .; xC T
ð7:1Þ
where C is the dimension of the feature vector which is obtained by stacking all the rows in a (characteristic) view forming a vector and the subscript K is the index representing one of the characteristic views. If a view size is m n, then C = m n. Without loss of generality, a normalization process can be applied to the feature vector. The normalized feature vector of the corresponding Kth view is described as in Eq. 7.2: 8 0; if xKi ðxOi Þmin > > < 1; if xKi ðxOi Þmax xKi ¼ > xKi ðxOi Þmin > ; otherwise : ðxOi Þmax ðxOi Þmin
ð7:2Þ
where i ¼ 1; . . .; C, K is the Kth view and ðxOi Þmin and ðxOi Þmax are defined in Eq. 7.3. Let S be the total number of characteristic views obtained from a sample sub-image for each texture class. For each sample sub-image with a size of M N, a set of characteristic views of (M – m + 1)*(N – n + 1) can be extracted if the view size is m n. Hence, the total number of characteristic views, S = (M – m + 1)*(N – n + 1)*Ntc, if we assume that there are Ntc texture classes in an image. Now, the ðxOi Þmin and ðxOi Þmax can be defined as in Eqs. 7.3 and 7.4, respectively. ðxOi Þmin ¼ ðxOi Þmax ¼
S
S
S
l¼1
l¼1
l¼1
S
S
S
l¼1
l¼1
l¼1
min xl1 ; min xl2 ; . . .; min xlC max xl1 ; max xl2 ; . . .; max xlC
ð7:3Þ ð7:4Þ
7.1 Rotation-Invariant Features
165
In other words, the minimum and maximum values are chosen for each component across all the feature vectors for the normalization. The subscript Oi in Eqs. 7.3 and 7.4 means for the overall views (i.e., feature vectors) and i refers to a component in a feature vector. Example 7.1 illustrates the concept described above. Example 7.1 Assume that three characteristic views, Vcs1, Vcs2, and Vcs3, are obtained from a sample sub-image;
0 1 2
1 8 3
2 3 7
4 6 4
5 5 5
6 5 6
10 9 10 10 9 10 10 9 10 Their corresponding feature vectors are fvcs1 ¼ ½0; 1; 2; 1; 8; 3; 2; 3; 7T fvcs2 ¼ ½4; 5; 6; 6; 5; 5; 4; 5; 6T fvcs3 ¼ ½10; 9; 10; 10; 9; 10; 10; 9; 10T The minimum (min) and maximum (max) are obtained among all the feature vectors for each component as ðxnc Þmin ¼ f0; 1; 2; 1; 5; 3; 2; 3; 6g ðxnc Þmax ¼ f10; 9; 10; 10; 9; 10; 10; 9; 10g
The normalized feature vectors for three characteristic views are
3 1 fvcs1 ðncÞ ¼ 0; 0; 0; 0; ; 0; 0; 0; 4 4
T
166
7 Features-Based K-views Model
2 1 1 5 2 1 1 fvcs2 ðncÞ ¼ ; ; ; ; 0; ; ; ; 0 5 2 2 9 7 4 3
T
fvcs3 ðncÞ ¼ ½1; 1; 1; 1; 1; 1; 1; 1; 1T After the feature vectors are normalized, six significant rotation-invariant features will be extracted from all the feature vectors: (1) mean, (2) standard deviation, (3) entropy, (4) skewness, (5) kurtosis, and (6) histogram. These features provide good discrimination of textures and we can define the rotation-invariant features of a view which has a number of pixels (m n) in the following [2, 7]. Feature 1: Mean x1 ¼
C 1X Vi C i¼1
ð7:5Þ
Here, Vi is the value of the ith component in the feature vector with dimension C. Feature 2: Standard Deviation "
C X 1 x2 ¼ ðVi x1 Þ2 C 1 i¼1
!#12 ð7:6Þ
where x1 is the mean of feature vectors from Eq. 7.5. Feature 3: Entropy x3 ¼
C X
pi lnpi
ð7:7Þ
i¼1
where pi is the probability of frequency of feature vector i over the total number of feature vectors. Feature 4: Skew x4 ¼
M3;3 d3
ð7:8Þ
P where M3;3 ¼ C1 Ci¼1 ðVi x1 Þ3 Notations Vi and x1 are similarly defined as in Eq. 7.5 and M3;3 is the third-order moment of the feature vectors, and d is the standard deviation of feature vector.
7.1 Rotation-Invariant Features
167
Feature 5: Kurtosis x5 ¼
M4;4 d4
ð7:9Þ
P where M4;4 ¼ C1 Ci¼1 ðVi x1 Þ4 Notations Vi and x1 are similarly defined as in Eq. 7.5 and M4;4 is the fourth-order moment of feature vectors and d is the standard deviation of feature vectors. Feature 6: Histogram (a distribution of components in the feature vector) The histogram is a graphical representation of the gray scale distribution in a digital image. It plots the number of pixels in the image (vertical axis) with a particular brightness or a particular gray-level range value (horizontal axis). In our feature vectors, it will be the number of components versus the range of values between 0 and 1 in the normalized feature vector. An increase of 0.1 is used from 0 to 1 along the horizontal axis for the histogram. Next, we define the “correlative view”. A correlative view is a view which consists of the pixel being classified as shown in Fig. 7.1 [4, 5]. It may have several correlative views (i.e., feature vectors) covering this pixel. Hence, we may call it the correlative feature vector. The probability of each correlative feature vector will be calculated to determine to which texture class that this pixel belongs. Considering that for a pixel being classified may have many correlative views containing the current pixel as shown in Fig. 7.1, each correlative view should compete (or vote) to decide which texture class this pixel belongs to. If the view size is m by n, for each pixel there are m n possible correlate views whose size is m n containing the pixel. This concept is very similar to the templates used in the Tomita filter where five masks are used to determine the least variance and then calculate the mean in that mask to replace the noisy pixel being considered for
Fig. 7.1 An example of the correlative views [5] in an image
168
7 Features-Based K-views Model
Fig. 7.2 Five masks are used in the Tomita filter; a mask of 3 3 in the center and four masks with a size of 3 3 in the four corners
Fig. 7.3 Eight masks are used in the Nagao filter; a square mask of 3 3 in the center, four masks with each covering 7 pixels in each side, and four masks with each covering 5 pixels in the four corners
edge-preserving image smoothing as shown in Fig. 7.2 [3, 8]. The similar concept is also used in the Nagao filter which uses nine masks for image smoothing as shown in Fig. 7.3 [6]. Thus, the decision to which texture class this pixel belongs can be determined by the similarity between the correlative views and the characteristic views of a texture class. In other words, we can calculate the average similarity between all the correlative views containing the pixel being classified and all the characteristic views of each texture class. The calculation is based on their normalized feature vectors. Then, we classify this pixel to a texture class with the maximum similarity measure. The similarity of each correlative view and a characteristic view in a texture class can be calculated using the Euclidean distance and taking its inverse as shown in Eq. 7.10. We assume that there are M correlative views and V characteristic views for each of N textures.
7.1 Rotation-Invariant Features
169
Skij ¼
1 jdijk j
ð7:10Þ
where i 2 1; . . .; M; j 2 1; . . .; V; k 2 1; . . .; N, jdijk j is the absolute value of the Euclidean distance between the normalized feature vector of ith correlative view and the normalized feature vector of jth characteristic view in the kth texture class. Then, we calculate the average similarity of all the similarities calculated in Eq. 7.10 for each texture class as shown in Eq. 7.11. Avg Sk ¼
1 X k S M V i;j ij
ð7:11Þ
where i; j; k; M and V are similarly defined as in Eq. 7.10. The pixel will be classified to the kth texture class based on the maximum similarity obtained in Eq. 7.11.
7.2
The K-views Algorithm Using Rotation-Invariant Features (K-views-R)
We present the new K-views algorithm for texture image classification using rotation-invariant features (K-views-R) in this section [5]. These features include mean, standard deviation, entropy, skewness, kurtosis, and histogram as defined above. The proposed algorithm consists of the training process and the classification process. The training process is to pre-calculate all rotation-invariant features and then store them in a database which will be used later for the classification process. The K-Views-R Algorithm: Training Step 1: Select a sample sub-image randomly for each texture class from the original image. In other words, N sample sub-images will be selected for N texture classes. Step 2: Extract a set of views from every sample sub-image using a suitable view size defined by the user. Step 3: Determine the value of K for each set of views, and select randomly K views of the view set for each sample sub-image as a set of characteristic views. Alternatively, we can select all the views in the view set as the set of characteristic views. Step 4: Compute the rotation-invariant features (feature 1, feature 2, feature 3, feature 4, feature 5, and feature 6) using Eqs. 7.5–7.9 for every feature vector corresponding to a view in the set of characteristic views for each texture class to obtain the normalized feature vectors. Step 5: Store all the normalized feature vectors in a database.
170
7 Features-Based K-views Model
Image
Fig. 7.4 The training scheme for the K-views-R algorithm
Texture class, C=1 C=C+1 Select sample sub-image for texture class, C A view set of texture class, C
Randomly generate a set of characteristic views of texture Extract the rotation-invariant features for each characteristic view of texture class, C
NO
C=number of texture classes?
YES Save rotation-invariant features of each texture class in a database Stop
The training scheme is shown in Fig. 7.4 as the flowchart format. The K-views-R Algorithm: Classification Step 1: Retrieve all the normalized feature vectors for all texture classes from the database. Step 2: In the classification scheme, obtain all the correlative views for each pixel being classified and compute the normalized feature vector of each view in the correlative views (note that the view size should be the same as the one used in training).
7.2 The K-views Algorithm Using Rotation-Invariant …
171
Step 3: Use Eq. 7.10 to calculate the similarity between all the correlative views containing the pixel being classified and all the characteristic views of each texture class. Please note that we use the normalized feature vector for each view (from Step 2). Step 4: Calculate the average similarity using Eq. 7.11 and select the maximum similarity which will classify the pixel to the texture class it belongs. Step 5: Repeat Steps 2, 3, and 4 for each pixel in the original image being classified. The classification scheme is shown in Fig. 7.5 as the flowchart format. Some experimental results using the K-views-R algorithm for image texture classification are presented in the next section.
Image Texture
Calculate the rotation-invariant features for all the views which consisting of the pixel being classified for each pixel in the image Retrieve Pre-calculated features
Database of rotationinvariant features of each texture class
Compare with the precalculated features of each texture class
Classified result
NO
Done for all the pixels in the image? YES Stop
Fig. 7.5 The classification scheme for the K-views-R algorithm
172
7.3
7 Features-Based K-views Model
Experiments on the K-views-R Algorithm
To evaluate the effectiveness of the K-views-R algorithm on the image texture classification, experiments on the Brodatz image textures were carried out and compared to the K-views-T and K-views-D algorithms. Images with different textures were tested. All the images for the testing are of 130 130 pixels, and the sample sub-images size is chosen as 40 40, K is chosen as 100, that is, a set of characteristic views contains 100 views. Different view sizes ðw ¼ 3; 4; . . .; 30Þ were chosen for testing. The rotation-invariant features extracted for a view are the same as those defined in Sect. 7.1. Experimental results show that the K-views-R algorithm is more robust and accurate compared with the results of the K-views-T and K-views-D algorithms. Experimental results on some image textures [1] are shown in Figs. 7.6, 7.7, and 7.8 and the corresponding classified errors for different K-views algorithms are shown in Table 7.1. Among three textured images for the testing, the textures of the image in Fig. 7.7 are more similar both in structural and brightness compared to the other two textured images. This textured image is more challenging for the classification. Based on the experimental results, obviously, we can see that the K-views-R algorithm performs better than the K-views-T and K-views-D methods.
Fig. 7.6 a An original image, b an ideal classified result, c classified result with the K-views-D algorithm, d classified result with the K-views-T algorithm, and e classified result with the K-views-R algorithm. The red lines are drawn on the top of classified results to show the actual boundary
7.3 Experiments on the K-views-R Algorithm
173
Fig. 7.7 a An original image, b an ideal classified result, c classified result with the K-views-D algorithm, d classified result with the K-views-T algorithm, and e classified result with the K-views-R algorithm. The red lines are drawn on the top of classified results to show the actual boundary
It also can partition well for the image texture which is difficult for the discrimination in Fig. 7.8. Figure 7.9 shows another textured image and the classified results. This textured image was used in the comparison given in [6]. The image is composed of five textures which are quite similar, and the textural boundaries are not linear. It is more complex than the images tested in Figs. 7.6, 7.7 and 7.8. The classified result illustrated that the K-views-R algorithm performs better than the other two algorithms.
7.4
The K-views-R Algorithm on Rotated Images
In order to prove that the K-views-R algorithm is rotation-invariant, two rotated images are constructed for testing [3]. The experiment was done by following the steps listed below: Steps 1 and 2 are the same as the schemes for training and classification in the K-views-R algorithm, respectively. Step 2 is to rotate an image texture. Step 1: Select a sample sub-image randomly for each texture class from the original image to construct normalized feature vectors (i.e., training).
174
7 Features-Based K-views Model
Fig. 7.8 a An original image, b an ideal classified result, c classified result with the K-views-D algorithm, d classified result with the K-views-T algorithm, and e classified result with the K-views-R algorithm. The red lines are drawn on the top of classified results to show the actual boundary
Table 7.1 Classified errors for different K-views algorithms Algorithms Images
K-views-D
K-views-T
K-views-R
Image in Fig. 7.6 Image in Fig. 7.7 Image in Fig. 7.8
0.011 0.039 0.065
0.141 0.089 0.233
0.009 0.018 0.029
Fig. 7.9 a An original image, b classified result with the K-views-D algorithm, c classified result with the K-views-T algorithm, and d classified result with the K-views-R algorithm
7.4 The K-views-R Algorithm on Rotated Images
175
Fig. 7.10 a An original image, b a rotated image (to be classified) which is derived by rotating the original image clockwise 45 degrees, c classified result with the K-views-D algorithm, d classified result with the K-views-T algorithm, and e classified result with the K-views-R algorithm
Step 2: Rotate the original image to obtain a new image (a rotated image) to be classified and construct normalized feature vectors for this rotated image. Step 3: Use the feature vectors for sample sub-images (obtained in Step 1) which are not rotated to classify the rotated images which are derived in Step 2 (i.e., classification). The K-views-D method, the K-views-T method, and the K-views-R algorithm are used for the classification, and the classified results are compared. Figures 7.10 and 7.11 are two tested results. The size of both images is 130 130; sample sub-images size is chosen as 40 40, K is set to 100. The original images were rotated clockwise 45 degrees to obtain the rotated images: the size of the rotated image shown in Fig. 7.10b is 90 90 while the size of Fig. 7.11b is 96 96. The K-views-R algorithm achieves more satisfying results. The statistic features used to represent a texture class in the K-views-R algorithm are rotation-invariant, but the characteristic views extracted to represent a texture class in the K-views-D and the K-views-T are not rotation-invariant.
176
7 Features-Based K-views Model
Fig. 7.11 a An original image, b a rotated image (to be classified), which is derived by rotating the original image clockwise 45 degrees, c classified result with the K-views-D algorithm, d classified result with the K-views-T algorithm, and e classified result with the K-views-R algorithm
7.5
The K-views-R Algorithm Using a View Selection Method to Choose a Set of Characteristic Views
We describe a method for selecting characteristic views to improve the K-views-R algorithm [4]. To be distinguished from the K-views-R algorithm, we will call this algorithm as the K-views-R with grayness algorithm. In the K-views-R algorithm, the set of characteristic views are selected randomly for each sample sub-image. This random selection method cannot extract a representative set of characteristic views for a texture class effectively. For example, a set of views chosen randomly are not distributed evenly by the grayness of a view. The grayness of a view is defined as the mean of gray levels of all pixels in that view. Some views may be centralized in the zone in which all views may have high grayness values, or in the low grayness value zone. However, one texture class may have many types of characteristic views which have different grayness values. Hence, this random method is not suitable for practical applications and it may not select the most representing set of characteristic views.
7.5 The K-views-R Algorithm Using a View …
177
The selection method is based on the distribution of the grayness of a view to select a set of characteristic views for classification. The method chooses a set of characteristic views based on the interval of the minimum grayness and maximum grayness of all the views in the original (primitive) set of views and then randomly selects the same number of views in each subinterval of the grayness. Similar to the K-views-R algorithm, we select a sample sub-image randomly for each texture class and extract a set of the views from the sub-image to form a primitive view set (VS) which contains P number of views. Then, a set of K characteristic views will be chosen using the following selection method, where K is less than or equal to P. The view selection method is listed below. The View Selection Method: Step 1: Select a sample sub-image randomly for each texture class from the original image and then extract a set of views (Vs). Step 2: Compute the minimum grayness and maximum grayness for each view of the VS represented by VSmin and VSmax, respectively. Step 3: Calculate the zone interval (VSzone) between the VSmin and the VSmax; VSzone = VSmax − VSmin. Step 4: Divide VSzone into m sub_zones, Length of sub_zones = VSzone/m. The ith sub_zone can be calculated as VSmin + (i − 1)* Length of sub_zones (i 2 ð1; mÞ). Step 5: Select a subset of K characteristic views (i.e., K/m views) in each sub_zone from 1 to m, to form a set of characteristic views. The K-views-R with Grayness Algorithm: Training Process Step 1: Select a sample sub-image randomly for each texture class from the original image. Step 2: Extract a set of views from every sample sub-image using a suitable view size. Step 3: Determine the value of K for each view set, and select a set of K views for each sample sub-image as a set of characteristic views using the view selection method defined above. Step 4: Compute the rotation-invariant features for each view in the set of characteristic views for each sample sub-image and obtain the normalized feature vectors of these rotation-invariant features. Step 5: Store all the normalized feature vectors in a database. The K-views-R with Grayness Algorithm: Classification Process Step 1: Retrieve all the normalized feature vectors for all texture classes from the database. Step 2: In the process of classification, derive all the correlative views for each pixel and compute the normalized feature vector of each view of all the
178
7 Features-Based K-views Model
correlative views. Note that the view size is the same as the one used in the training process. Step 3: Calculate the average similarity between all the correlative views containing the pixel being classified and all the characteristic views of each texture class. The calculation is based on their normalized feature vectors. Then, we classify this pixel to a texture class with the maximum similarity measure. For example, we assume that there are four texture classes with nine correlative views containing a pixel being classified. We also assume that there are 20 characteristic views for each texture class from Step 2. The similarity of each correlative view and a characteristic view in a texture class can be calculated using the Euclidean distance and taking its inverse as shown in Eq. 7.12. 1 Skij ¼ k ð7:12Þ jdij j where i 2 1; . . .; 9; j 2 1; . . .; 20; k 2 1; . . .; 4, jdijk j is the absolute value of the Euclidean distance between the normalized feature vector of ith correlative view and the normalized feature vector of jth characteristic view in the kth texture class.
Fig. 7.12 a An original image, b an ideal classified result (ground truth), c classified result with the K-views-T algorithm, d classified result with the K-views-D algorithm, e classified result with the K-views-R algorithm, and f classified result with the K-views-R with grayness algorithm. The white lines are drawn on the top of classified results to show the actual boundary
7.5 The K-views-R Algorithm Using a View …
179
Then, we calculate the average similarity of all the similarities calculated in Eq. 7.12 for each texture class as shown in Eq. 7.13. Avg Sk ¼
1 X k S 9 20 i;j ij
ð7:13Þ
where i; j; and k are similarly defined as in Eq. 7.12. The pixel will be classified to the kth texture class based on the maximum similarity obtained in Eq. 7.13. Step 4: Repeat Steps 2 and 3 for each pixel in the original image being classified. To test the effectiveness of the view selection method, several images were used for the test experiment. Figures 7.12 and 7.13 show some of the experimental results on textured images. The image size is 130 130 pixels. The sample sub-images size is chosen as 40 40 and K is set to 100 in our experiments. Since the number of characteristic views used is 100, in the view selection method, the 100 views are divided into 20 intervals in the K-views-R with grayness algorithm. A rotated image is also constructed for testing. Figure 7.14 shows one of the
Fig. 7.13 a An original image, b an ideal classified result (ground truth), c classified result with the K-views-T algorithm, d classified result with the K-views-D algorithm, e classified result with the K-views-R algorithm, and f classified result with the K-views-R with grayness algorithm. The white lines are drawn on the top of classified results to show the actual boundary
180
7 Features-Based K-views Model
Fig. 7.14 a An original image, b a rotated image obtained by rotating the original image clockwise 45 degrees, c classified result with the K-views-T algorithm, d classified result with the K-views-D algorithm, e classified result with the K-views-R algorithm, and f classified result with the K-views-R with grayness algorithm. The white lines are drawn on the top of classified results to show the actual boundary
experimental results on the rotated image. The size is of 130 130 pixels; sample sub-images size is chosen as 40 40 with K equal to 100. The image is rotated clockwise 45 degrees to obtain the rotated image with a size of 90 90. In summary, the feature-based K-views-R algorithms are giving better classification results than those of the K-views-T and K-views-D algorithms. For some textured images such as Fig. 7.13a, the view selection method will have advantages to achieve better classification results.
7.6
Summary
This chapter presents two feature-based K-views algorithms, K-views-R and K-views-R with grayness, for image texture classification. These two algorithms use the rotation-invariant features which are statistically derived from a set of characteristic views for each texture in the image. As can be seen from the experimental results, the feature-based algorithm is superior to both the K-views-T
7.6 Summary
181
and K-views-D algorithms. Primarily, it can obtain better results in the boundary areas between different textures. Generally speaking, the feature-based algorithm has some significant and meaningful characteristics: the decision that a pixel belongs to a texture class is made by all the correlative “views” containing the pixel, which is based on the highest probability, and statistic features are used to represent a texture class. All the features used can be easily extracted from a view, and its computational complexity is simple. Unlike the K-views-T algorithm, the feature-based algorithm does not need to obtain a set of characteristic views through a few times of iterative computation for the K-views clustering. It just needs to directly select K-views randomly from a view set for each sample sub-image as a characteristic view set at one time. Experimental results show that both the K-views-R and K-views-R with grayness algorithm are stable. Several improvements can be done on the K-views algorithms. For example, we can extract the image feature with affine invariants, which are scaled orthographic projection of planar objects, and projective invariants, which are the perspective projection of planar objects, and develop a method for the automatic determination of the best view size to achieve the optimum classification result.
7.7
Exercises
For a numerical image shown below, assume that there are four different textures in the image; each texture occupies one quadrant.
0 1 2 3 200 210 190 205
1 2 3 0 200 190 205 209
2 3 0 1 200 200 210 200
3 0 1 2 200 200 200 210
100 100 100 100 4 6 4 7
90 90 90 90 5 5 5 6
100 100 100 100 6 5 6 4
90 90 90 90 3 4 7 5
1. Develop a set of characteristic K-views using the K-views-T algorithm for each sub-image texture for all four textured classes. 2. Using the K-views-R algorithm to classify the image with a set of characteristic views obtained in Exercise #1. 3. Using the K-views-R with grayness algorithms to classify the image with a set of characteristic views obtained in Exercise #1.
182
7 Features-Based K-views Model
4. Write a program to implement the K-views-R algorithm and test on some rotated image textures. 5. Write a program to implement the K-views-R with grayness algorithm and test on some rotated image textures.
References 1. Brodatz P (1966) Textures: a photographic album for artists and designers. Dover Publications, New York 2. Haralick RM, Shapiro LG (1993) Computer and robot vision, (Volume I and II). Addison Wesley, Reading 3. Hung C-C, Shin S, Jong J-Y (1996) Use of the sigma probability in Tomita’s filter. In: IEEE proceedings of the IEEE southeastcon’96, Tampa, FL USA, 11–14 April 1996 4. Lan Y, Liu H, Song E, Hung CC (2010) An improved K-view algorithm for image texture classification using new characteristic views selection methods. In: Proceedings of the 25th association of computing machinery (ACM) symposium on applied computing (SAC 2010)— computational intelligence and image analysis (CIIA) track, Sierre, Switzerland, 21–26 March 2010. https://doi.org/10.1145/1774088.1774288 5. Liu H, Dai S, Song E, Yang C, Hung C-C (2009) A new k-view algorithm for texture image classification using rotation-invariant feature. In: Proceedings of the 24th association of computing machinery (ACM) symposium on applied computing (SAC 2009)—computational intelligence and image analysis (CIIA) track, Honolulu, Hawaii, 8–12 March 2009. https://doi. org/10.1145/1529282.1529481 6. Nagao M, Matsuyama T (1979) Edge preserving smoothing. Comput Graph Image Process 9:394–407 7. Palm C (2004) Color texture classification by integrative co-occurrence matrices. Pattern Recogn 37:965–976 8. Tomita F, Tsuji S (1977) Extraction of multiple—regions by smoothing in selected neighborhoods. IEEE Trans Syst Man Cybern SMC-7:107–109 9. Song EM, Jin R, Hung C-C, Lu Y, Xu X (2007) Boundary refined texture segmentation based on K-views and datagram method. In: Proceedings of the 2007 IEEE international symposium on computational intelligence in image and signal processing (CIISP 2007), Honolulu, HI, USA, 1–6 April 2007
8
Advanced K-views Algorithms
Kindness in words creates confidence. Kindness in thinking creates profoundness. Kindness in giving creates love. —Lao Tzu
This chapter introduces the weighted K-views voting algorithm (K-views-V) and its fast version called the fast K-views-V algorithm. These methods are developed to improve K-views template (K-views-T) and K-views datagram (K-views-D) algorithms for image texture classification. The fast K-views-V algorithm uses a voting method for texture classification and an accelerating method based on the efficient summed square image (SSI) scheme as well as the fast Fourier transform (FFT) to enable overall faster processing while the K-views-V only uses the voting method. In classifying a pixel to a texture class in the K-views-V algorithm, it will be based on the weighted voting method among the “promising” members in the neighborhood of a pixel being classified. In other words, this neighborhood consists of all the views, and each view has this pixel in its territory. Experimental results on some textural images show that this K-views-V algorithm gives higher classification accuracy than the K-views-T and K-views-D algorithms, and improves the accurate classification of pixels near the boundary between textures. In addition, the acceleration method improves the processing speed of the K-views-V algorithm. Compared with the results from earlier K-views algorithms and those of the gray-level co-occurrence matrix (GLCM), the K-views-V algorithm is more robust, fast, and accurate. A comparison on the classified results with the selection of parameters on the view size, sub-image size, and the number of characteristic views is provided in this chapter.
8.1
The Weighted K-views Voting Algorithm (Weighted K-views-V)
The K-views-V [8] algorithm is an efficient approach to improve the K-views-T and K-views-D algorithms [4, 11] for image texture classification. The K-views-V algorithm applies a voting method for texture classification and an accelerating © Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_8
183
184
8 Advanced K-views Algorithms
method based on the efficient summed square image (SSI) [10] scheme and fast Fourier transform (FFT) for fast processing. The voting method has been proven to play an important role in the group decision [3, 6]. Majority voting is the most natural voting strategy. In this strategy, each voter takes a full vote on a candidate it supports, and then the candidate with the most votes is the winner. In the application of image texture classification, one voter determines which texture class a pixel belongs to. We may consider that the texture class having the majority views in the promising neighborhood is the fittest one to the pixel as shown in Fig. 8.1. Assume that a pixel is located inside the domain of many correlated views such as views shown in Fig. 8.1a. Therefore, these views should be given an opportunity to participate in determining which texture class this pixel belongs to. If the view size is m by n, for each pixel there will be m n possible correlative views consisting of the corresponding pixel. However, the simple majority voting strategy neglects the weighting factor of each voter. For example, V1 and V2 in Fig. 8.1b are two correlative views of the white pixel being classified: V1 should be given more weight than V2 for the relative importance of the voting since the white pixel is from texture class T1 and V1 is more similar to T1 than T2 . Similar to the rotation-invariant feature extraction method we discussed in Chap. 7, si;j is the similarity measure between the ith correlative view and the jth texture class. The similarity between the corresponding feature vector of the ith correlative view and the corresponding feature vectors of the jth texture class can be defined as in Eq. 8.1: 1 Sij ¼ dij
ð8:1Þ
where dij is the Euclidean distance between the feature vector of the ith view and all the feature vectors of the jth texture class. The more similar a view is to a certain texture class, the more powerful the vote on that class should be. Hence, the weighting factor for the ith view to the jth texture class by Wij is defined as in Eq. 8.2. Sij Wij ¼ PN j¼1
ð8:2Þ
Sij
where N is the number of the texture classes. Therefore, the best-matched texture class (refer to Tp ) would be the maximum among the weights calculated for all the correlative views, which is calculated as in Eq. 8.3: Tp ¼ max
V X i¼1
Wi;1 ;
V X i¼1
Wi;2 ; . . .;
V X i¼1
! Wi;k
ð8:3Þ
8.1 The Weighted K-views …
185
(a)
(b)
Fig. 8.1 a The correlative views of a pixel being classified: the dot in a small square represents the pixel being considered and b a voting example with two voters
where V is the number of correlative views in the neighborhood of the pixel being classified and k is the number of texture classes. In this approach, an image texture is classified by using the weighted votes taken from all the correlative views for each pixel based on the features of characteristics views. We call this approach the weighted K-views voting algorithm (weighted K-views-V). The procedure of this algorithm is listed below:
186
8 Advanced K-views Algorithms
The Weighted K-views Voting Algorithm (Weighted K-views-V) for the Classification: Step 1: To obtain all the correlative views in the neighborhood of each pixel being classified (assume that the rotation-invariant features of each correlative views are obtained as described in Chap. 7). Step 2: To calculate the weighting factor using Eq. 8.2. Step 3: Determine the maximum of weighted votes using Eq. 8.3. Step 4: Classify the pixel to the texture class, which is the maximum of the weighted votes from Step 3. Step 5: Repeat Steps 1–4 for each pixel being classified. Please note that the view size, the number of views, the number of characteristic views, and the size of a sample sub-image have been discussed in the previous chapters. They are selected similar to those used in the basic K-views-T algorithm. One of experimental results using the K-views-V algorithm is shown in Fig. 8.2 and compared with other K-views algorithms. Figures 8.3 and 8.4 give two more classified results using the weighted K-views-V algorithm [8].
Fig. 8.2 a(1) An original image, a(2) classified result with K-views-T, a(3) classified result with K-views-D, a(4) classified result with GLCM, and a(5) classified result with the weighted K-views-V. Classified results of images in b − d are similarly interpreted as in a. White lines are drawn on the top of classified results to show the actual boundary [8]
8.1 The Weighted K-views …
187
Fig. 8.3 a A leopard image and b classified result using the weighted K-views-V algorithm
Fig. 8.4 a A medical image of liver organ and b classified result using the weighted K-views-V algorithm
In the weighted K-views-V algorithm, the Euclidean distance (i.e., similarity measure) is calculated between each view in each texture class and all correlative views for each pixel being classified. The calculation of Euclidean distance is the most time consuming in the algorithm. In order to reduce the computation time, the SSI and FFT methods are employed for the fast weighted K-views-V algorithm [9, 10]. This approach will transform the distance calculation into a simple convolution and summation operation based on the SSI and FFT methods which will be described in the next section.
8.2
Summed Square Image (SSI) and Fast Fourier Transform (FFT) Method in the Fast Weighted K-views Voting Algorithm (Fast Weighted K-views-V)
We assume that N sample sub-images are available for N texture classes. We extract a set of primitive views (S) from each sample sub-image and then derive a set of K-views of the characteristic views denoted by Vcs from each primitive view set using the K-means algorithm [4, 11]. Hence, the total number of calculations for obtaining the distance between a pair of two characteristic views (i.e., feature vectors) is T = Z N K where Z is the number of pixels being classified. We can see that if an image is large, a number of views extracted from the image compared with the number of K-views already defined for a prototype texture are overwhelming. In other words, if we take a view, say V1 , from an image being
188
8 Advanced K-views Algorithms
classified, and a view, say V2 , from a set of K-views, then the number of V1 (i.e., Vector1 in Eqs. 8.6, 8.7, and 8.9) is much larger than that of V2 (i.e., Vector2 in Eqs. 8.6, 8.8, and 8.9). To reduce this tremendous amount of calculations, we introduce the summed square image (SSI) and fast Fourier transform method to expedite the computation in the following. Without loss of generality, we take two views (i.e., V1 and V2 ) with the same view size of m m, the Euclidean distance between two views, d, will be calculated with Eq. 8.4: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k V1 V 2 k 2
d¼
ð8:4Þ
The calculation will be repeated T times to select a minimum distance between two views. This is a very time-consuming process. Equation 8.4 can be further rearranged as Eq. 8.5. d¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kV1 V2 k2 ¼ V12 þ V22 V1 V2
ð8:5Þ
We can calculate the first term (square of V1) in Eq. 8.5 by using the summed square image (SSI) technique and calculate the third term (V1 x V2) by using the fast Fourier transform (FFT). The second term (square of V2) will be calculated directly. Please note that the number of V1 is much larger than that of V2 as discussed above. If we perform the calculation by using the feature vector corresponding to each view, the expansion is shown as in Eqs. 8.6−8.9. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kVector1 Vector2 k2 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi um1 m1 uX X ¼t ½V ðl; nÞ V ðl; nÞ2
d¼
1
2
ð8:6Þ
l¼0 n¼0
¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Vector12 þ Vector22 Vector1 Vector2
where Vector12 ¼
m1 X m1 X
½V1 ðl; nÞ2 ;
ð8:7Þ
½V2 ðl; nÞ2 ; and
ð8:8Þ
l¼0 n¼0
Vector22 ¼
m1 X m1 X l¼0 n¼0
8.2 Summed Square Image (SSI) and Fast Fourier Transform (FFT) …
Vector1 Vector2 ¼ 2
m1 X m1 X
½V1 ðl; nÞ V2 ðl; nÞ
189
ð8:9Þ
l¼0 n¼0
The multiplications in Eq. 8.9 is an inner product operation. Liu et al. proposed a SSI method which is based on the integral image concept for an image de-noising algorithm [9, 10]. If a patch is in a rectangle shape (including square), its features can be computed very rapidly using an intermediate representation for an image which is called an integral image [9]. Since the view is in a rectangle shape, SSI can be used in our K-views computation. The SSI extends the concept of the integral image: the pixel value at location ðx0 ; y0 Þ contains the squared value of each pixel in the original image above and to the left of ðx0 ; y0 Þ, inclusively. The SSI is then calculated as in Eq. 8.10. SSIðx0 ; y0 Þ ¼
X
Iðx; yÞ2 ; and x; y 2 ðl; mÞ
ð8:10Þ
x x0 ;y y0
where I(x, y) is the pixel value in the image and l is an index for location x and m is an index for location y. For example, if we need to calculate the sum of squares in region D as shown in Fig. 8.5, it can be obtained as follows (Eq. 8.11): SD ¼ SA [ B [ C [ D þ SA SA [ C SA [ B
ð8:11Þ
where U is a notation of union. Based on SSI shown in Fig. 8.5, we can see that SA [ B [ C [ D ¼ SSIðx2 ; y2 Þ; SA ¼ SSIðx1 ; y1 Þ; SA [ C ¼ SSIðx2 ; y1 Þ; SA [ B ¼ SSIðx1 ; y2 Þ Fig. 8.5 A summed square image (SSI) illustration
ð8:12Þ
190
8 Advanced K-views Algorithms
Original view
v2
flip
v3
Original view Fig. 8.6 An original view V2 and its flipped view V3
Therefore, we obtain the SSI for region D with Eq. 8.13. SD ¼ SSIðx2 ; y2 Þ þ SSIðx1 ; y1 Þ SSIðx2 ; y1 Þ SSIðx1 ; y2 Þ
ð8:13Þ
Each pixel in the SSI can be calculated in only one pass over the entire image. The computational complexity for computing SSI is OðP2 Þ (P2 is the image size). The SSI can be obtained in a linear time proportional to the image size. The third term V1 V2 can be calculated quickly using the FFT transform [2]. Assuming that the view size is m m and m is an odd number, if we flip the characteristic view, V2, horizontally and then vertically, we will obtain a flipped view labeled as V3 as shown in Fig. 8.6. We then compute the two-dimensional convolution of views V1 and V3. We can derive a (2 m – 1) (2 m – 1) matrix denoted by MAT. According to the convolution theorem, the MAT can be calculated using Eq. 8.14. V1 V2 ¼ 2ðMATðm; mÞÞ Hence, MATðm; mÞ ¼ V1 V2 =2
ð8:14Þ
We need to compare each of correlative views (a set of characteristic views consisting of a pixel being classified) with each characteristic view of each texture class. Therefore, we can calculate the two-dimensional convolution of V3 and the padded correlative view called PadI (PadI is that a view is padded with mirrored reflections before the first element and after the last element along each dimension in the image). The convolution is formulated as in Eq. 8.15: V3 ðx; yÞ PadIðx; yÞ , IFFTðFFTðV3 ðx; yÞÞFFTðPadIðx; yÞÞÞ
ð8:15Þ
where “*” is convolution operation, IFFT (FFT(V3(x, y))FFT(PadI(x, y))) means that it calculates the FFT of V3 and PadI individually, multiply them together, and then take the inverse FFT (IFFT).
8.2 Summed Square Image (SSI) and Fast Fourier Transform (FFT) …
191
In the calculation of the Euclidean distance between two views, if we calculate the Euclidean distance directly, it will take m2 P P K N subtraction operations, (m2–1) P P K N addition operations, and m2 P P K N multiplication operations (Eq. 8.4). However, by using the SSI and FFT methods, we can transform the calculations of Euclidean distance to the summation of squares and convolutions. To compute the SSI of the image, it requires P P multiplications and 2 P P addition operations; then, we can calculate the sum of each view in the SSI image. This process will need P P additions and 2 P P subtractions approximately to complete the sum of square of Vector1 in each characteristic view. It will take m2 K N multiplications and (m2–1) K N addition operations for Vector2 . To calculate Vector1 Vector2, the first step is to transform it into convolution operations and then calculate the FFT of V3 and FFT of PadI. Views V3 and PadI are 2
Þ extended to a size of (P + m – 1)2. Thus, we will have ðp þ m1 logðp þ m 1Þ 2 2 multiplications and ½ðp þ m 1Þ logðp þ m 1Þ additions to calculate extended V3 and PadI. The IFFT can be calculated similarly, and the multiplication of FFT (V3) and FFT (PadI) will take (p + m – 1) x (p + m – 1) multiplications. Therefore, it requires a number of multiplications and additions as shown below (Eqs. 8.16 and 8.17)
"
# 3 ð P þ m 1Þ 2 logðP þ m 1Þ þ ðP þ m 1Þ2 K N 2 h
i 3 ðp þ m 1Þ2 logðp þ m 1Þ K N
ð8:16Þ ð8:17Þ
Hence, to calculate the Euclidean distance for two views (feature vectors), we need a total of multiplications (Eq. 8.18) and additions (Eq. 8.19). In addition, there are approximately 2P2 subtractions. "
# 3 ð P þ m 1Þ 2 2 logðP þ m 1Þ þ ðp þ m 1Þ K N P þm K N þ 2 2
2
ð8:18Þ h i 2P2 þ m2 1 K N þ 3ðp þ m 1Þ2 logðp þ m 1Þ K N ð8:19Þ Table 8.1 gives a comparison on the number of multiplications, subtractions, and additions between the direct calculation, SSI and FFT computation giving that P = 150, m = 7, K = 30, and N = 4.
192
8 Advanced K-views Algorithms
Table 8.1 A comparison between the direct and SSI and FFT calculations Calculation methods
Number of multiplications
Number of subtractions
Number of additions
(1) Direct (2) SSI and FFT Ratio of (1) versus (2)
132,300,000 34,862,249 3.795
132,300,000 45,000 2940
129,600,000 63,900,358 2.028
8.3
A Comparison of K-views-T, K-views-D, K-views-R, Weighted K-views-V, and K-views-G Algorithms
We developed the K-views model to characterize the gray-level primitive properties as well as the relationship among them for an image texture. In this section, we compare all the K-views related algorithms, namely, K-views Template algorithm (K-views-T), K-views datagram algorithm (K-views-D), K-views rotation-invariant feature algorithm (K-views-R) [7], and Weighted K-views voting algorithm (weighted K-views-V). In addition, we establish a new K-views algorithm using gray-level co-occurrence matrix (GLCM) , abbreviated as the K-views-G algorithm which is also compared with other K-views algorithms to demonstrate its effectiveness in image texture classification. The K-views-G algorithm in our experiments is briefly described in the following steps. The K-views-G Algorithm: Please note that Steps 1–3 are the same as those in the K-views-T algorithm (in Chap. 5). Step 1: Select a sample sub-image randomly in the area of the texture class for each texture class from the original image. In other words, N sample sub-images will be selected for N texture classes. The size of each sub-image can be different. Step 2: Extract a view set from each sample sub-image. Step 3: Determine the value of K for each view set, and derive K-views for a set of the characteristic view from each sample sub-image using the K-means algorithm or fuzzy C-means algorithm. The number of views, K, may vary for each texture class (i.e., a sample sub-image). Step 4: For each characteristic view, V, of an image being classified. (a) Compute the GLCM feature vectors (i.e., the vector is composed of contrast, correlation, energy, homogeneity and mean) values of K-views of each sub-image. (b) Compute the GLCM feature vector values of view V. If the best-matched characteristic view belongs to characteristic view set M, classify all pixels in the view, V, from the original image to class M.
8.3 A Comparison of K-views-T, K-views-D …
193
(If the view is regarded as a neighborhood of one pixel, classify that pixel only to class M). The Euclidean distance is used for the similarity match in the comparison. Repeat Step 4 for each pixel in the original image being classified.
Step 5:
All K-views-based algorithms are tested on a set of representative texture images include coarse texture, irregular texture and regular texture which are randomly taken from the Brodatz Gallery [1]. The size of these artificial images is 150 150 pixels (the first original textured image is an exception and its size is 130 130 pixels [11]). In our experiments [5], all K-views-based algorithms were implemented with the same number of characteristic views (i.e., K) and view size. We choose K = 30, that means that there are 30 characteristic views for each texture class. The view size was set to 7 7. The features used in the GLCM include contrast, correlation, energy, homogeneity, and mean. Other parameters were set as follows: distance d = 1, a = {0o, 45o, 90o, 135o} and gray level = 16. Although the GLCM model with 0o, 45o, 90o, and 135o four directions was calculated, in our experiments, only one of the best-matched directions was selected as the final result. By comparing the experimental results in Fig. 8.7, we can see that the weighted K-views-V performs better than K-views-T, K-views-D, K-views-R, and K-views-G. Overall, it achieves the best classification accuracy. From Fig. 8.7, we can also verify that the weighted K-views-V is more robust than the other four algorithms; the reason is that weighted K-views-V uses a decision made by the weighted voting among all the characteristic views involved. Regarding the computation time, we know that the weighted K-views-V is much faster by using the SSI and FFT methods. The K-views-D algorithm takes more computation time which is from 10 to 100 magnitudes of time used in other K-views based 1.0
Classification accuracy
0.975 0.95 0.925 0.9 0.875 0.85
K-views-T K-views-D K-views-G K-views-V K-views-R
0.825 0.8 0.775 0
1
2
3
4
5
6
7
8
9
10
11
Image squence number(the 11th is average)
Fig. 8.7 Classification accuracy of image textures with different K-views algorithms [5]
12
194
8 Advanced K-views Algorithms
algorithms, because of this algorithm needs to calculate the datagram (DN) of all the characteristic views. The K-views-G algorithm needs to calculate the GLCM features of each view in the original image and characteristic views, so it is also much slower than other algorithms. Therefore, only five GLCM features were used instead of all GLCM features. In the latter case, the accuracy will be increased.
8.4
Impact of Different Parameters on Classification Accuracy and Computation Time
In those K-views based algorithms, there are three parameters which need be determined a priori: view size (VS), sub-image size (SIS), and the number of characteristic views (i.e., K). In this section, we will discuss the influence on the classification accuracy using different K-views algorithms with a variety of three parameters. Notations used to denote parameters are listed in Table 8.2. An image texture shown in Fig. 8.8 is used in our testing. Experimental results are shown in Table 8.3. We can notice that the classification accuracy can be increased at the expense of processing time by increasing the view size, sub-image size, and number
Table 8.2 Notations used to represent the parameters
Fig. 8.8 An original image
Notations
Description
VS SIS K CA CT
View size Sub-image size Number of characteristic views Classification accuracy Computation time
8.4 Impact of Different Parameters on Classification …
195
Table 8.3 Classified results of different K-views algorithms with the variety of parameters. The measurement of computation time is measured in seconds. Notations used are listed in Table 8.2 Algorithms K-views-T
K-views-D
K-views-R
K-views-V
Parameters VS SIS
K
Results CA
CT(seconds)
5 7 9 15 7 7 7 7 7 7 5 7 9 15 7 7 7 7 7 7 5 7 9 15 7 7 7 7 7 7 5 7 9 15 7 7 7 7 7 7
30 30 30 30 30 30 30 20 40 50 30 30 30 30 30 30 30 20 40 50 30 30 30 30 30 30 30 20 40 50 30 30 30 30 30 30 30 20 40 50
0.893 0.920 0.928 0.919 0.876 0.918 0.917 0.912 0.923 0.925 0.960 0.835 0.888 0.698 0.864 0.887 0.828 0.788 0.806 0.842 0.906 0.948 0.957 0.953 0.923 0.943 0.954 0.935 0.949 0.951 0.970 0.978 0.977 0.961 0.959 0.977 0.975 0.975 0.978 0.978
33.375 35.474 41.826 45.240 32.450 34.172 42.219 24.906 50.141 59.422 871.566 1178.362 1642.305 1983.990 1171.238 1172.103 1164.211 801.278 1567.230 1955.213 11.132 17.954 23.170 53.023 18.248 18.005 18.009 16.436 19.142 26.093 11.563 13.547 16.188 16.344 11.188 12.476 14.047 10.000 16.375 22.562
40 40 40 40 20 30 44 40 40 40 40 40 40 40 20 30 44 40 40 40 40 40 40 40 20 30 44 40 40 40 40 40 40 40 20 30 44 40 40 40
196
8 Advanced K-views Algorithms
Table 8.4 Pros and cons of different K-views algorithms Algorithms
Pros
Cons
K-views-T
Easy for implementation
K-views-D
Easy for implementation
K-views-R
Fast, high efficiency, accurate, and rotation-invariant
K-views-V
Fast, high efficiency, accurate
(1) The behavior of K-views-T is influenced by three parameters shown in Table 8.3 and the characteristic views extracted are not rotation-invariant (2) Supervised mode (1) Same as the K-views-T (2) The computation is heavy (3) Supervised mode (1) The characteristic views are selected randomly for each sample sub-image. This method cannot extract a representative set of characteristic views for a texture class effectively (2) Supervised mode Same as K-views-T
of characteristic views. However, it does not mean that a larger size of those parameters yields more accurate results. For example, in the case of K-views-T, if the VS is set to 15, the classification accuracy is not the highest. Therefore, in order to achieve a higher accuracy at a less computation time, we should explore an intelligent method which can determine the reasonable view size and the number of characteristic views. The pros and cons of different K-views-based algorithms are concluded in Table 8.4.
8.5
Summary
We give a comparison on the five K-views based algorithms, namely: K-views-T, K-views-D, K-views-R, K-views-V, and K-views-G. All of these algorithms can achieve reasonable classification accuracy in texture image classification. In particular, the K-views-V and the K-views-R algorithms perform better than K-views-T and K-views-D. In addition, we also introduced a new K-views algorithm based on the GLCM feature extractions (K-views-G). Although the feature extraction method is used in both K-views-R and K-views-G, the K-views-R performs better than K-views-G. The decision in the K-views-D is made by a group of views composed of all “views” contained in a big patch with the current pixel (candidate for the classification) being the center, the simple majority voting strategy neglects the weighting factor of each view. This makes the algorithm less efficient than the K-views-V. Therefore, the K-views-V that utilizes a group decision made through the weighted voting among a set of correlative views is efficient and
8.5 Summary
197
accurate in the classification. Each view in the neighborhood of the pixel being classified takes a vote weighted by the corresponding value in the voting weighted matrix for each texture class. In terms of time complexity, the K-views-D is very slow as it needs to calculate the datagram (DN) for all the characteristic views exploited. In order to reduce the computation time, the SSI and FFT are employed in the K-views-V algorithm which will transform the calculation of Euclidean distance into a simple convolution and summation operations. Therefore, it requires lesser computation time than other K-views based algorithms. All of these K-views based algorithms are supervised learning. With the emergence of the deep machine learning, some of these concepts may be integrated with the deep machine learning to improve the algorithms.
8.6
Exercises
Implement the following algorithms using a high-level computer language. 1. Develop a set of characteristic K-views using the K-views-T algorithm for each sub-image texture for all textured classes and test on a textured image. 2. Using the K-views-R algorithm to classify a textured image with a set of characteristic views obtained in Exercise #1. 3. Implement the K-views-R with grayness algorithms to classify a textured image with a set of characteristic views obtained in Exercise #1. 4. Use the K-views-R algorithm and test on some rotated image textures. 5. Perform the K-views-R with grayness algorithm and test on some rotated image textures.
References 1. Brodatz P (1966) Textures: a photographic album for artists and designers. Dover Publications, New York 2. Castleman KR (1996) Digital image processing. Prentice Hall, Upper Saddle River 3. Coughlin PJ (1992) Probabilistic voting theory. Cambridge University Press, Cambridge 4. Hung C-C, Yang S, Laymon C (2002) Use of characteristic views in image classification. In: Proceedings of 16th international conference on pattern recognition, pp 949−952 5. Lan Y, Liu H, Song E, Hung C-C (2011) A comparative study and analysis on k-view based algorithms for image texture classification. In: Proceedings of the 2011 ACM symposium on applied computing, Taichung, Taiwan, 21 − 25 Mar 2011. https://doi.org/10.1145/1982185. 1982372 6. Li R (1999) Fuzzy method in group decision making. Comput Math Appl 38(1):91–101 7. Liu H, Dai S, Song E, Yang C, Hung C-C (2009) A new k-view algorithm for texture image classification using rotation-invariant feature. In: Proceedings of the 24th ACM symposium on applied computing (SAC 2009), pp 914 − 921. https://doi.org/10.1145/1529282.1529481
198
8 Advanced K-views Algorithms
8. Liu H, Lan Y, Jin R, Song E, Wang Q, Hung C-C (2012) Fast weighted K-view-voting algorithm for image texture classification. Opt Eng 51(2) 2 Mar 2012. https://doi.org/10.1117/ 1.oe.51.2.027004 9. Liu YL, Wang J, Chen X, Guo YW, Peng QS (2008) A robust and fast non-local means algorithm for image denoising. J Comput Sci Technol 23(2):270–279 10. Viola P, Jones M (2001) Rapid Object Detection using a Boosted Cascade of Simple. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1:I511–I518 11. Yang S, Hung, C-C (2003) Image texture classification using datagrams and characteristic views. In: Proceedings of the 18th ACM symposium on applied computing (SAC), Melbourne, FL, 9 − 12 Mar 2003, pp 22 − 26. https://doi.org/10.1145/952532.952538
Part III
Deep Machine Learning Models for Image Texture Analysis
9
Foundation of Deep Machine Learning in Neural Networks
Our greatest glory is not in never falling, but in rising every time we fall. —Confucius
This chapter introduces several basic neural network models, which are used as the foundation for the further development of deep machine learning in neural networks. The deep machine learning is a very different approach in terms of feature extraction compared with the traditional feature extraction methods. This conventional feature extraction method has been widely used in the pattern recognition approach. The deep machine learning in neural networks is to automatically “learn” the feature extractors, instead of using human knowledge to design and build feature extractors in the pattern recognition approach. We will describe some typical neural network models that have been successfully used in image and video analysis. One type of the neural networks introduced here is called supervised learning such as the feed-forward multi-layer neural networks, and the other type is called unsupervised learning such as the Kohonen model (also called self-organizing map (SOM)). Both types are widely used in visual recognition before the nurture of the deep machine learning in the convolutional neural networks (CNN). Specifically, the following models will be introduced: (1) the basic neuron model and perceptron, (2) the traditional feed-forward multi-layer neural networks using the backpropagation, (3) the Hopfield neural networks, (4) Boltzmann machines, (5) Restricted Boltzmann machines and Deep Belief Networks, (6) Self-organizing maps, and (7) the Cognitron and Neocognitron. Both Cognitron and Neocognitron are deep neural networks that can perform the self-organizing without any supervision. These models are the foundation for discussing texture classification by using deep neural networks models.
© Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_9
201
202
9.1
9
Foundation of Deep Machine Learning in Neural Networks
Neuron and Perceptron
The traditional artificial neural networks (ANNs) have become an essential part of machine learning in artificial intelligence. The ANN is characterized by three components, namely; the architecture, transfer function (also called squashing or activation functions), and learning algorithm. Many different types of ANNs have been proposed and developed in the literature. There are two types of ANNs which are widely used for the applications; one is called the supervised ANN such as the feed-forward multi-layer neural networks (FMNN) and the other is called the unsupervised ANN such as the self-organizing map (SOM). Hence, it is very often that the SOM is used as an unsupervised classifier and the FMNN is employed as a supervised classification algorithm. In analogy, this corresponds to unsupervised and supervised learning in pattern recognition and machine learning. In general, it is time consuming for training an ANN. It acts very fast during the testing phase once an ANN is well trained [8, 10, 11, 37, 39]. Figure 9.1 illustrates an analogy between a biological neuron and an artificial neuron. The artificial neuron is a simulation of a biological neuron. The artificial neuron is called the McCulloch–Pitts model [27]. This neuron model consists of a P summation function ( ) and a squashing function ðf ðsumÞÞ as shown in Eqs. 9.1 and 9.2, respectively. sum ¼ x1 w1 þ x2 w2 þ x3 w3 þ . . . þ xn wn ¼
n X
xi wi
ð9:1Þ
i¼1
Out ¼ f ðsumÞ
ð9:2Þ
where x1 ðtÞ; x2 ðtÞ; . . .; xn ðtÞ are the input signals and w0 ; w1 ; w2 ; . . .; wn are the corresponding weights. The summation function is to calculate the total of all the multiplications for each input signal to the neuron and the corresponding weight. The squashing function will then convert the sum to fall in a controlled range. It is usually between zero and one. We can think of each neuron as a basic processing element which includes input and output functions. In addition, to accommodate the contribution of the signals traveling along the network, it depends on the strength of the synaptic connection. Figure 9.1c shows the terminology comparison between soma, synapse, dendrite, and axon in a biological neuron and neuron, weight, input, and output, respectively, in an artificial neuron. A typical supervised ANN architecture is shown in Fig. 9.2. This architecture consists of one input layer with four neurons, one hidden layer with five neurons and one output layer with four neurons. In a network architecture, one has to determine the number of layers and the number of neurons for each layer. Artificial neural networks (ANN) are the simulated models of the brain and nervous system of the biological neural networks in the mammals and human beings. However, the simulation is a simplified version of the biological neural networks. An ANN is a highly parallel information processing system much more
9.1 Neuron and Perceptron
203
(a)
(b)
Sum and squashing functions
(c) Artificial Neuron Neuron (Node) Weight Input Output
Biological Neuron Soma Synapse Dendrite Axon
Fig. 9.1 A comparison between a a biological neuron and b an artificial neuron. The McCulloch– Pitts neuron model is used as an artificial neuron. c An analogy between terms used in the artificial neuron and biological neuron
204
9
Foundation of Deep Machine Learning in Neural Networks
Weights
Weights
Fig. 9.2 The architecture of a traditional feed-forward multi-layer neural networks (FMNN): four neurons in the input layer, five neurons in the hidden layer, and four neurons in the output layer
like the brain. The ANN has shown its powerful capability as a complex problem solver. Watanabe et al. experimented on the discrimination of color paintings by pigeons [38]. The pigeons are able to learn and successfully recognize the paintings by Monet and Picasso. The experiments also illustrated the generalization capability of the pigeons with the unseen paintings. The experiments prove that the pigeons can extract and recognize the patterns (i.e., features) and generalize from what they have seen to make a prediction. Similar to this biological neural network (BNN), the ANN can be built to have similar functions for problem-solving. ANN is a nonsymbolic representation as the network functions depending on how we give the weights which are small numeric values. Those weights can be trained (i.e., learned) through the learning algorithms so that an ANN can solve the problem for an application after the training. In a sense, the ANN is more like a functional approximation, which maps from the input to the output. As the transfer function is one of the characterizations in an ANN, several mathematical functions have been used as transfer functions. Figure 9.3 shows two mathematical functions which can be used for this purpose. The sigmoid function is frequently used in the FMNN network due to its simple derivative.
Fig. 9.3 Two mathematical functions can be used as transfer functions
9.1 Neuron and Perceptron
205
Fig. 9.4 A perceptron: a uses a threshold function to derive the output and b the threshold function is treated as a special neuron, which is a constant (numerical value one), and its corresponding weight will be learned like other weights in the network
The simplest ANN is the perceptron which consists of two layers, one for the input and the other for the output, as shown in Fig. 9.4 [28]. Similar to any ANN, each input signal for a perceptron is multiplied by the corresponding weight in the connection between the input node and the output node and all the P weighted inputs will be added together. This is represented by a summation symbol . If the sum is larger than a predetermined threshold, the output is one. Otherwise, it is zero. Instead of using a threshold function as shown in Fig. 9.4a, this threshold function can be replaced by a special neuron, which is a constant one, and its corresponding weight will be learned like other weights in the network (Fig. 9.4b). To obtain a set of proper weights for a perceptron, the network needs to be trained using a learning algorithm. The learning algorithm usually requires many epochs to complete the proper training. An epoch is defined as an iteration by feeding a set of
206
9
Foundation of Deep Machine Learning in Neural Networks
training samples to the network during the training. The following gives a general training algorithm in steps based on the architecture presented in Fig. 9.4b. Training Algorithm for the Perceptron: Step 1: Initialization: Set initial weights w0 ; w1 ; w2 ; . . .; wn to small random numbers in the range of, for example, [-0.5, 0.5]. Step 2: Activation: Activate the perceptron by applying inputs x1 ðtÞ; x2 ðtÞ; . . .; xn ðtÞ and desired (i.e., target) output OT ðtÞ. Calculate the actual output at iteration t, OA ðtÞ, using Eq. 9.3. " # n X OA ðtÞ ¼ sign xi ðtÞwi ðtÞ ð9:3Þ i¼1
where n is the number of the perceptron inputs, and sign is the sign function as shown in Fig. 9.3 used as a transfer function. Step 3: Weight training: Update the weights of the perceptron using Eqs. 9.4 and 9.5. wi ðt þ 1Þ ¼ wi ðtÞ þ Dwi ðtÞ
ð9:4Þ
Dwi ðtÞ ¼ g ðOT ðtÞ OA ðtÞÞ xi
ð9:5Þ
where Dwi ðtÞ is the weight correction at iteration t and η is a learning rate which is between 0.0 and 1.0. The weight correction is computed by the delta rule shown in Eq. 9.5. The learning rate is gradually decreased during the training of the network for the stabilization of the network. Step 4: Iteration: Increase iteration t by one, go to Step 2 and repeat the process until the network converges.
Example 9.1 illustrates the steps in training a perceptron to perform a logic OR function with two input variables ðx1 ; x2 Þ. In this example, there are four samples for the training as shown in Table 9.1. Table 9.1 Illustrates a detailed training of the perceptron. The perceptron completes the training in two epochs. The threshold ðhÞ is set to 0.2 and learning rate (η) 0.1 Epoch Inputs Desired (t) ðx1 ; x2 Þ output ðOT ðtÞÞ
Weights ðw1 ; w2 Þ
Actual output ðOA ðtÞÞ
Error Weight Adjusted adjustment weights ðDw1 ðtÞ; Dw2 ðtÞÞ ðw1 ; w2 Þ
1
(1, (1, (1, (1, (1, (1, (1, (1,
0 0 1 1 0 1 1 1
0 1 0 0 0 0 0 0
2
(0, (0, (1, (1, (0, (0, (1, (1,
0) 1) 0) 1) 0) 1) 0) 1)
0 1 1 1 0 1 1 1
0) 0) 0.1) 0.1) 0.1) 0.1) 0.1) 0.1)
(0, (0, (0, (0, (0, (0, (0, (0,
0) 0.1) 0) 0) 0) 0) 0) 0)
(1, (1, (1, (1, (1, (1, (1, (1,
0) 0.1) 0.1) 0.1) 0.1) 0.1) 0.1) 0.1)
9.1 Neuron and Perceptron
207
Example 9.1 A perceptron which can perform the OR logic function. Please note that we ignore the biased neuron as shown in Fig. 9.4. Instead, a threshold value is used. (a) Shows an architecture for implementing the OR function and (b) Table 9.1 gives the process of the perceptron training. The threshold ðhÞ is set to 0.2 and learning rate (η) 0.1. The perceptron converges after two epochs of training.
As shown in Fig. 9.4, the perceptron is nothing but a linear function which can only solve the linear type of problems. It has a very limited capability for applications [28]. However, it is fundamental for the development of more complicated nonlinear ANNs.
9.2
Traditional Feed-Forward Multi-layer Neural Networks (FMNN)
Based on the foundation of the perceptron and the invention of the backpropagation algorithm by Werbos [39], Parker [30], and Rumelhart et al. [32], the feed-forward multi-layer neural networks (FMNN) becomes an important model in pattern recognition and machine learning. Figure 9.5 shows a typical FMNN model which consists of three layers. In the literature, it is simply called artificial neural networks (ANN). Please note that the neurons in the input layer are used to take input signals without any functionality at all. The backpropagation algorithm is commonly used for training a typical feed-forward multi-layer neural network (FMNN). Similar to the perception, the forward processing in FMNN is the same: apply an input to the network and calculate the output of each neuron in the network. This step is called the forward pass. If an input is represented as a vector, X, and the weights are represented as a matrix, W, then an output vector, O, can be represented as in Eq. 9.6. Here, notation f denotes the function of the FMNN. OA ¼ f ðXW Þ
ð9:6Þ
208
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.5 A feed-forward multi-layer neural network with three layers: an input layer, a hidden layer, and an output layer
If the actual output, OA , is the same as the target output, OT , the weights remain the same without any change and this processing will be repeated for the remaining training samples. Otherwise, the weights need be adjusted using a learning algorithm such as the backpropagation algorithm. Using the backpropagation algorithm for adjusting the weights, this algorithm is based on the gradient descent method. Hence, the activation function used for the backpropagation algorithm must be differential. The sigmoid function and its derivative as shown in Eqs. 9.7 and 9.8 are widely used for the backpropagation algorithm. Similar to other activation functions, the sigmoid function transforms the output of the summation into the range of zero and one. OðI Þ ¼
1 1 þ eSI
dOðI Þ ¼ OðI Þð1 OðI ÞÞ dI
ð9:7Þ ð9:8Þ
where SI is the sum of a neuron and OðI Þ is the actual output of that neuron for an input I. The adjustment of the weights connected to the neurons between the input layer and the hidden layer is different for those between the hidden layer and the output layer. This is due to the lack of target output for those neurons in the hidden layers.
9.2 Traditional Feed-Forward Multi-layer Neural Networks (FMNN)
209
Therefore, those target outputs have to be estimated for calculating the error to adjust those weights. No matter how many hidden layers are used in an FMNN, the weight adjustment for each neuron in the hidden layer is very similar. The learning rate for a network is similar to that used in the perceptron, and it is gradually decreased during the training of the network for the stabilization of the network. Similar to the training in perceptron, the training of the FMNN is to adjust the weights by providing a set of training samples as input so that the network can function correctly for an application. A training sample is a pair of input and target output. The input is usually a vector representing a number of features. The number of features in an input vector determines the number of neurons in the input layer of the network. The target output can be in a vector format if multiple neurons are used. However, only one output will be active (means one) and others are inactive (zeros) for the categorization. The following steps give a summary for training an FMNN using the backpropagation algorithm. Training of the FMNN Using the Backpropagation Algorithm: Step 1: Initialize all the weights to a small random real number between zero and one. Set up a learning rate η. Repeat Steps 2−6 for each training pair. Step 2: Apply an input to the network and calculate the output of each neuron in the network. This step is called the forward pass. Step 3: Calculate the error for a neuron k (in the output layer) between the actual output, Ok ðI Þ and the desired output, Tk ðI Þ using Eq. 9.9 for an input sample I dOk ¼ Ok ðI Þð1 Ok ðI ÞÞðTk ðI Þ Ok ðI ÞÞ
ð9:9Þ
Step 4: Adjust each weight in the connections, wi;k ðtÞ, between a neuron, i, in the hidden layer and a neuron, k, in the output layer by adding each Dwi;k to wi;k ðtÞ using Eqs. 9.10 and 9.11. wi;k ðt þ 1Þ ¼ wi;k ðtÞ þ Dwi;k
ð9:10Þ
Dwi;k ¼ gdOk hi ðI Þ
ð9:11Þ
where g is a learning rate, t is the number of iterations, and hi ðI Þ is the output of neuron i for an input sample, I. Step 5: Calculate the error for a neuron i (in the hidden layer) between the input layer and the hidden layer using Eq. 9.12. X dOj wij ðtÞ ð9:12Þ dHi ¼ Hi ðI Þð1 Hi ðI ÞÞ j2output layer
where Hi ðI Þis the output of neuron i for an input sample I, dOj is the same as that calculated in Eq. 9.9 for each neuron in the output layer and wij ðtÞ is the weight between neuron i in the input layer and neuron j in the output
210
9
Foundation of Deep Machine Learning in Neural Networks
layer. Due to the lack of a target vector for a hidden layer, the summation in Eq. 9.12 is used to estimate the error for a hidden neuron. Step 6: Adjust each weight in the connections, wl;i ðtÞ, between a neuron, l, in the input layer and a neuron, i, in the hidden layer by adding each Dwl;i to wl;i ðtÞ using Eqs. 9.13 and 9.14. wl;i ðt þ 1Þ ¼ wl;i ðtÞ þ Dwl;i
ð9:13Þ
Dwl;i ¼ gdHi Il
ð9:14Þ
where Il is the input to the neuron l in the input layer for an input sample I. Step 7:
Stop the training if the network converges.
Example 9.2 A feed-forward multilayer neural network which is trained to perform the OR logic function; (a) the FMNN architecture and (b) Table 9.2 shows the forward pass and backward propagation. To simplify the operation, an input of ðx1 ; x2 Þ = (1, 1) is used for the illustration. We assume that the target output is ðO1 ; O2 Þ = (1, 0). In other words, O1 represents the true output and O2 the false output. The learning rate is set to 0.1. Initial weights are given as shown in the architecture.
There are several issues related to the backpropagation training algorithm: network training paralysis, local minima, and long training time. The local minima problem is due to the gradient descent technique used in the backpropagation algorithm and can be solved using an additional term, called momentum, in the weight adjustment [8, 10, 11, 37, 39]. In many applications, a biased neuron is added to each neuron in an FMNN. This is very similar to a special neuron used in the perceptron shown in Fig. 9.4. In other words, the input for this biased neuron is always a constant one, and the corresponding weight will be learned precisely the
9.2 Traditional Feed-Forward Multi-layer Neural Networks (FMNN)
211
Table 9.2 Illustrates a detailed training of the neural network only for the input of ðx1 ; x2 Þ = (1. 1). The first table shows the calculated results for the forward pass. The second and third tables give the weight adjustment between the neurons during the backward propagation. The learning rate (η) is set to 0.1 Forward pass: Output neurons Target output Inputs ðx1 ; x2 Þ Hidden neurons (H) Weighted Weighted sum sum Input Output Input output x1 ¼ 1 0.3 0.29 O1 = 0.29 0.29 T1 ¼ 1 0.3 0.29 O2 = 0.14 0.28 T2 ¼ 0 x2 ¼ 1 Backward propagation: adjust weights between the input and hidden layer. Hidden neurons (H) η dH D ¼ gdH xi Old Inputs weight ðx1 ; x2 Þ x1 ¼ 1 H 0.1 0.024289 0.002429 w1 ¼ 0:1 H 0.1 0.024289 0.002429 w2 ¼ 0:2 x2 ¼ 1 Backward propagation: adjust weights between hidden and output layers. D ¼ gdO H Old Hidden neurons Output η dO weight (H) Neuron s H = 0.29 H = 0.29
O1 O2
0.1 0.146189 0.004239 w3 ¼ 1 0.1 −0.056448 −0.001637 w4 ¼ 0:5
New weight w1 ¼ 0:102429 w2 ¼ 0:202429 New weight w3 ¼ 1:004239 w4 ¼ 0:498363
same as other weights. It works similar to adjust the threshold used in the perception and bring about the faster convergence of the training. Several advanced training algorithms such as genetic algorithms have also been proposed in the literature as alternatives for the backpropagation algorithm.
9.3
The Hopfield Neural Network
The Hopfield neural network (HNN) is a recurrent artificial neural network (ANN). It is a very simple, but, useful ANN which can store the patterns [8, 10, 17]. The HNN is different from the FMNN and in that, there is no feedback (i.e., recurrent) from any of the network outputs to their inputs. The HNN is an auto-associative network which will output an entire pattern if the network recognizes an input which may be incomplete or noise-corrupted pattern. Figure 9.6 shows an HNN with three neurons in a single layer. Each neuron is connected to every other neuron in the network. Each neuron is an input, but also an output. The weights in the connection between each pair of neurons are symmetric: in other words, w12 ¼ w21 ; w13 ¼ w31 ; and w23 ¼ w32 . There are three important properties associated with the HNN: (1) each neuron is a nonlinear unit, (2) synaptic connections (weights) are symmetric, (3) recurrent feedback to each neuron, and (4) the weight is zero for self-feedback of each neuron (it is not shown in Fig. 9.6).
212
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.6 An HNN with three neurons: neurons are denoted as N1, N2, and N3. The weights in the connection between each pair of neurons are symmetric
Each neuron in an HNN is always in one of two-state (+1 and –1) at any time. Equation 9.15 is used to calculate the state ðSi Þ of a neuron i. This equation indicates that the state of a neuron, i, depends on the states of other neurons, j. Si ¼ þ 1;
if
N X j¼1
wij Sj [ hi
ð9:15Þ
else Si ¼ 1 where wij is the weight in the connection between neurons i and j, N is the number of neurons and hi is a threshold associated with the neuron i. Please note that we usually use zero for threshold hi in Eq. 9.15. An update of a neuron in HNN can be randomly selected in the asynchronous mode. A simultaneous update for all neurons in the synchronous mode is also used. The energy concept in a system, similar to the Ising model [2], is used in the HNN. Hence, to obtain a solution of the HNN, the network will have to converge to a stable state which indicates the system has reached a minimum energy in the network. The energy function (or called Liapunov function) used in the HNN is defined as in Eq. 9.16 [39].
9.3 The Hopfield Neural Network
X X 1 XX E¼ wij Si Sj Ij Sj þ hj Sj 2 i j j j
213
! ð9:16Þ
where notations are similarly defined as in Eq. 9.15, the mutual interactions between neurons i and j are characterized by the first term, Ij is an external input to neuron j, and hj is the threshold of a neuron j. Since there are only two states (+1 and –1) for each neuron, the state of HNN can be represented by using a binary vector with each component either +1 or –1. Therefore, the state of HNN can be shown as an n-dimensional hypercube for n neurons with 2n states. Example 9.3 shows a three-dimensional hypercube for the HNN in Fig. 9.6. Example 9.3 All 23 = 8 states of the HNN with three neurons as shown in Fig. 9.6.
Since the HNN is an auto-associative network, we should be able to store the patterns and then retrieve them later. Hence, there are two modes in the operation of the HNN; the storage mode and the retrieval mode. Similar to a traditional ANN, the storage mode of the HNN is the training of the network. The stochastic dynamics of the network allows the learning of HNN for a set of binary state vectors that represent good solutions. Once the training is completed, the pattern retrieval can be achieved using the retrieval mode (i.e., testing). Some researchers call the training and the testing as encoding and decoding of the patterns, respectively. The HNN can be trained with either Hebbian or Storkey learning [17, 35]. The following algorithm presents steps used for the HNN training using Hebbian learning.
214
9
Foundation of Deep Machine Learning in Neural Networks
Training of the Hopfield Neural Network (HNN): Step 1: Present a set of training patterns X ¼ x1 ; x2 ; . . .; xp one at a time to all the neurons. We assume that each pattern vector is of p-dimensional and p neurons in the HNN. In other words, the dimensionality of a pattern vector is the same as the number of neurons. Step 2: Using the Hebbian learning to train the weight with Eq. 9.17. 1X c c S S if i 6¼ j p C¼1 i j p
wij ¼
ð9:17Þ
wij ¼ 0; if i ¼ j where p is similarly defined as in Step 1, Si and Sj are the states of neurons i and j, and C is the index of a component in a pattern vector. The above training can be compactly represented in the matrix format as described in [8]. Once the training is completed, we can think of the HNN is an auto-associative memory. To retrieve a pattern from the HNN, an input vector is presented to the network for the probe. The state of the network will be dynamically changed until it stabilizes. The stabilized state will be the retrieved output pattern. The pattern retrieval in the HNN is presented in the following procedure. Pattern Retrieval in the HNN: Step 1: Present an input vector for the probe to the HNN. We assume that the dimensionality of an input vector is the same as the number of neurons in the network. A neuron is randomly selected for the update. Step 2: The energy function in Eq. 9.16 is used for testing the stability of the network. If the energy function decreases, the network will change the state. The network will change to a stable state which is the minimum of the energy function. The update of the energy function due to a change in neuron j is done using Eq. 9.18. The change in Eq. 9.18 will then be added to Eq. 9.16 to obtain an updated energy function. dE ¼
X
! wij Si þ Ij hj dSj
ð9:18Þ
i6¼j
where notations are similarly defined as in other equations and dSj is the change in neuron j. Please note that threshold hj is zero. It has been illustrated that the network energy must either decrease or remain constant [39]. Example 9.4 Assume that the following training set was used to train a Hopfield network with three neurons as shown in Fig. 9.6.
9.3 The Hopfield Neural Network
215
ð 1 1 1 ÞT ð 1 1 1 ÞT ð 1 1 1 ÞT If this HNN is well trained, a new pattern new ¼ ð1 1 1ÞT is presented to this HNN, the state of HNN should return the following pattern: output ¼ ð1 1 1ÞT . The HNN discussed so far is a deterministic machine which can be trapped to a local optimal solution. A so-called statistical Hopfield machine has been established by using the Boltzmann distribution to make a probabilistic transition among the state change in the HNN [39]. The Boltzmann distribution function was discussed in the simulated annealing in Chap. 3. A general learning procedure of the statistical Hopfield machine is described below: Learning of the Statistical Hopfield Machine: Step 1:
For each neuron i, to set its state to one or zero using Eq. 9.19. If Pi ¼ edEi =ðT þ 1Þ [ hi ; set the state to 1; Otherwise; set the state to 1: where
ð9:19Þ
dEi = the change in the energy function between the next state and the current state of a neuron i, T = the current temperature, and hi = a random number between 0 and 1. when the temperature T is high, the probability is close to 1. This means that the probability of setting the state to one is very high. If the temperature is low, the probability is close to 0. Step 2:
9.4
Reducing the temperature T and repeat Step 1 until the stable state of the network is reached or the temperature has been reduced to zero.
Boltzmann Machines (BM)
Similar to many clustering algorithms, the Hopfield neural network also tends to stabilize on a local optimal state in the energy space. Hinton, Ackley, and Sejnowski [1, 13, 14] introduced the Boltzmann machine (BM) which can be considered as an extension of the Hopfield network by using simulated annealing. Besides Hebbian learning, the Boltzmann distribution is used in the BM. The BM is a type of neural networks in which the neurons change the state using the Boltzmann distribution function. Hence, there is a similarity between the simulated annealing (SA) introduced in Chap. 3 and the BM as both use the Boltzmann
216
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.7 An example of the Boltzmann machine with four neurons in the visible layer and three neurons in the hidden layer
distribution function. The optimization method used in the SA is employed in the BM to avoid the local optimal problem. The architecture of BM consists of the visible layer and hidden layer. Hence, the neurons in the visible layer are called visible neurons. Those neurons in the hidden layer are labeled as hidden neurons. Figure 9.7 shows an example of the BM with visible and hidden layers. The BM is a fully connected neural network. The stochastic dynamics of a Boltzmann machine allows this machine to sample binary state vectors that represent good solutions to the optimization problem. The Boltzmann machine has two modes: training (i.e., learning) and testing (i.e., searching or recall). Due to the hidden neurons used in the Boltzmann machine, its training is more complicated than that of the HNN. The testing, or recall, is straightforward. The training procedure has been given by Hinton and Sejnowski [14] and other researchers [8, 10, 33, 39]. A general training procedure is summarized in five steps in the following: the training will reduce the difference between what the network settles to in the positive phase and what it settles to in the negative phase. Step 1: Step 2(a): Step 2(b): Step 3(a): Step 3(b): Step 4: Step 5:
Select an example randomly from a training set. Train the network in a positive phase with a selected example: let the network settle using the simulated annealing. Calculate the statistics on a pair of neurons which are both on. Train the network in a negative phase without a training example: let the network settle using the simulated annealing. Calculate the statistics on a pair of neurons which are both on. Update the weights based on statistics from Steps 2(b) and 3(b). Go to Step 1 and repeat.
9.4 Boltzmann Machines (BM)
217
Similar to [33], a procedure with details is given below: Training of the Boltzmann Machine: Step 1: Step 1(a):
Step 1(b): Step 1(c): Step 1(d):
Calculate the clamped probabilities: Clamp each of a set of training patterns X ¼ x1 ; x2 ; . . .; xp one at a time to input and output neurons. We assume that each pattern vector is of p-dimensional and p neurons in the input and output layers of the BM. In other words, the dimensionality of a pattern vector is the same as the number of neurons in the input and output layers. Allow the network to settle into equilibrium and record the output values of all neurons. Repeat Steps 1(a) and 1(b) for all training patterns. Calculate the probability ðPijþ Þ for all training patterns that neurons i and j are both ones using Eq. 9.20 Pij ¼ edEij =ðT þ 1Þ
Step 2: Step 2(a): Step 2(b): Step 2(c): Step 3:
ð9:20Þ
where dEij = the change in the energy function between neurons i and j, and T = the current temperature. Calculate the unclamped probabilities: Start from a random state with the “free run” of the network. “free run” means that no pattern is clamped to the input and output neurons. Repeat step 2(a) (it should be long enough) and record the output values of all neurons. Calculate the probability ðP ij Þ for all training patterns such that neurons i and j are both one using Eq. 9.20. Adjust the weights, wij , of the network with the amount, dWij , using Eq. 9.21. ð9:21Þ dWij ¼ g Pijþ P ij where η is the learning rate of the network and probabilities, Pijþ and P ij are from Steps 1 and 2.
Recall in the Boltzmann Machine: Step 1: The output of a neuron i in the hidden layer is obtained by using Eq. 9.22. ! n X Hi ¼ f SI WIi ð9:22Þ I¼1
where f is a step function, SI is the state of each neuron in the input layer, and WIi is the weight in the connection from each neuron in the input layer to neuron i in the hidden layer.
218
9
Foundation of Deep Machine Learning in Neural Networks
The visible neurons in some Boltzmann machine are divided into the input layer and the output layer [8]. In such a situation, the formula for the output of a hidden neuron can be revised for the output of a visible neuron in the output layer.
9.5
Deep Belief Networks (DBN) and Restricted Boltzmann Machines (RBM)
The traditional multi-layer ANN which uses the gradient descent algorithm such as the backpropagation can be trapped into a local minima solution. Deep belief networks (DBN) are proposed to solve this local optimal problem [12, 15, 25, 36]. The DBN is trained by using a mixture of unsupervised and supervised training. The basic module (a layer) of the DBN is the restricted Boltzmann machine (RBM) and each RBM is trained using a so-called unsupervised pretraining without labeled data [12, 15]. Once each module is well trained, the backpropagation algorithm is then used for the fine-tuning with a set of labeled training data to reduce the overall error for the DBN. Since the RBM is the foundation of the DBN, the RBM is discussed first in the following. The restricted Boltzmann machine (RBM) is a “restricted” version of the Boltzmann machine (BM). Similar to other ANNs, the neurons in the RBM are massively connected between layers. The restriction means that there are no connections among neurons at the same layer. In other words, there are no intra-layer connections. The architecture is very similar to a symmetrical bipartite and bidirectional graph in graph theory [36]. Figure 9.8 illustrates a 4-3 architecture of RBM. The neurons in the input layer are called visible neurons (denoted by v) while the neurons in the hidden layer are called hidden neurons (denoted by h). The weights between the visible neurons and the hidden neurons are symmetrically connected [12, 15, 34]. The visible neurons will take an input and pass it through the weighted connections to the neurons in the hidden layer. Therefore, hidden neurons are usually called feature detectors. Hinton and Salakhutdinov [15] presented an unsupervised pretraining for the RBM. An energy function E(v, h), similar to the function used in the HNN, of the visible and hidden neurons is defined in Eq. 9.23. Eðv; hÞ ¼
X i2pixels
bi v i
X j2features
bj hj
X
vi hj wij
ð9:23Þ
i;j
where vi and hj represent the states of visible neuron i and hidden neuron j, respectively, bi is a bias in neuron i, and wij denotes the weight between these two neurons.
9.5 Deep Belief Networks (DBN) and Restricted Boltzmann Machines (RBM)
219
Fig. 9.8 The Restricted Boltzmann Machine (RBM) with 4-3 architecture. Neurons in the visible layer is denoted by vi with i = 1, 2, 3, and 4, whereas neurons in the hidden layer are labeled as hj with j = 1, 2, and 3. RBM is a symmetrical, bipartite, and bidirectional graph with shared weights
The network assigns a probability to every input via this energy function [12, 15]. The probability of an input is adjusted by changing the weights and biases for each neuron. The pretraining of the RBM based on those presented in [15] is given below. The Pretraining of Restricted Boltzmann Machines (RBM): Step 1: Given an input, the binary state hj of each neuron j is set to 1 with probability Pðhj Þ, Pðhj Þ ¼
1 P 1 þ expððbj þ i vi wij ÞÞ
ð9:24Þ
where bj is the bias of neuron j, vi is the state of neuron i, and wij is the weight between neurons i and j. Once binary states have been chosen for the hidden units hj , neuron vi is then set to 1 with probability Pðvi Þ, Pðvi Þ ¼
1 P 1 þ expððbi þ j hj wij ÞÞ
ð9:25Þ
where symbols are similarly defined as in Eq. 9.24. Step 2: The states of the hidden units are then updated so that they represent features of the confabulation (repeat Step 1). The change in a weight is given by Eq. 9.26.
220
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.9 A DBN consists of three RBMs are stacked together (top) and each RBM can be trained separately. A fine-tuning will be done using the backpropagation (bottom) [15]
ðDataÞ ðReconÞ DWij ¼ g vi hj v i hj
ð9:26Þ
ðDataÞ
where η is a learning rate, the first term vi hj is the fraction of times that neurons i and j are on together when neurons j are being driven by ðReconÞ neurons i, and the second term vi hj is the corresponding fraction for confabulations [12, 15]. As demonstrated by Hinton and Salakhutdinov [15], a stack of RBMs, called DBN, is used for the application as shown in Fig. 9.9. Each RBM will be pretrained separately, and then the backpropagation algorithm will be used to fine-tune the entire stack of RBMs. For this fine-tuning, a labeled dataset must be used for learning the entire network. Hence, the training of the DBN can be established with the following steps: The Training of Deep Belief Networks (DBN): Step 1: To train each module of RBM using the unsupervised pretraining of RBMs above. Please note that an output from a module becomes an input to the next module. Step 2: To train the entire DBN using the backpropagation algorithm.
9.6 The Self-Organizing Map (SOM)
9.6
221
The Self-Organizing Map (SOM)
Researchers in the area of artificial neural networks have proposed and developed many interesting self-organizing neural networks to mimic the functions of human brains. These networks are capable of detecting various features presented in input signals. They have been widely used in applications such as graph bipartitioning, vector quantization, etc. [11]. Among the self-organizing networks developed, the Kohonen’s self-organizing feature map (SOM) is perhaps one of the most popular model used in remote sensing image analysis. The artificial neural network has been an important platform for many years in many different application areas such as speech recognition and pattern recognition [24]. In general, these models are composed of many nonlinear computational elements (neural nodes) operating in parallel and arranged in patterns reminiscent of biological neural nets [11, 21, 24]. One type of these networks, which possess the self-organizing property, is called a competitive learning network [11, 21, 24]. The simple competitive learning network (SCL) has been used as unsupervised training methods in the hybrid image classification system [21]. An artificial neural network model is characterized by the topology, activation function, and learning rules. The topology of SOM is represented as a two-dimensional one-layered output neural net as shown in Fig. 9.10. Each input node is connected to each output node. The dimension of the training patterns determines the number of input nodes. Therefore, for a color image, the number of input neurons is three. During the process of training the network, the input vectors representing signals are fed into the network sequentially one vector at a time. The classes trained by the network are represented by the output nodes with the centroids of each class are stored in the connection as weights between input and output nodes. The algorithm for SOM that forms feature maps (i.e., the output layer) requires a neighborhood to be defined around each winning node. The size of this neighborhood is gradually decreased [26]. The following algorithm outlines the operation of the SOM algorithm as applied to unsupervised learning [21]; let L denote the dimension of the input vectors, which for us is the number of spectral bands (or spectral features). We assume that a 2-D (N x N) output layer is defined for the algorithm, where N is chosen so that the expected number of classes is less than or equal to N2 The Competitive Learning in SOM: Step 1: Initialize weights Wij (t) (i = 1, …, L and j = 1, …, N x N) to small random values and iteration count (t) to 1. Choose a maximum number of iterations (Max-It). Determine a learning rate η (t). Step 2: Present an input pixel X(t) = (x1, x2, x3, …, xi, …, xL) at time t. Step 3: Compute the distance dj between the X(t) and each output node using Eq.9.27
222
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.10 The X1 ; X2 ; X3 ; . . .; XL are inputs, one for each component of the feature vectors (i.e., in the three-dimensional case, L is 3). Each node in the SOM (shown as a connected network in two-dimensional space) corresponds to one output. Each output defines a spectral class where its center values are stored in the connections between inputs and the output nodes
dj ¼
L X 2 xi wij ðtÞ
ð9:27Þ
i¼1
where i, j, L, xi, and wij are similarly defined as in Steps 1 and 2. Step 4: Select an output node j* which has the minimum distance as the winner and update its weights using Eq. 9.28 wij ðt þ 1Þ ¼ wij ðtÞ þ gðtÞ xi wij ðtÞ
ð9:28Þ
where i, j, L, xi and wij are similarly defined and 1 j* N x N, where η (t) is a monotonically slowly decreasing function of t and its value is between 0 and 1. Step 5: Update the weights of the nodes in the neighborhood of the winning node j* using the similar formula as Eq. 9.28. The size of the neighborhood is defined as M x M (M N) and M is decreasing as the learning progresses [26].
9.6 The Self-Organizing Map (SOM)
223
Step 6: Increase the iteration count (t) by 1 and check if it meets Max-It. If not, repeat steps 2 to 5. Step 7: Select a subset of these N2 output nodes as spectral classes; the classification will be the assignment of each pixel in the image to one of N 2 classes based on the minimum distance between each pixel and the N 2 classes.
9.7
Simple Competitive Learning Algorithm Using Genetic Algorithms (SCL-GA)
Competitive learning provides a way to discover the salient general features that can be used to classify a set of patterns [34]. However, there exist some potential problems with the application of competitive learning neural networks: 1) the underutilization of some neurons [11, 21], 2) the learning algorithm is very sensitive to the learning rate, η(t), and 3) the number of output nodes in the network must be greater than the number of classes in the training set [21]. Ideally, the number of output nodes should be dynamically determined in the training (learning) environment instead of being specified a priori. Genetic algorithms (GA) have been used to prevent fixation to the local minima. GA is a randomized heuristic search strategy [16, 29]. It is an evolutionary algorithm, which is a simulation of natural selection in which the population is composed of candidate solutions. The diverse candidates can emerge via the mating process (mainly the mutation and crossover operations) through the evolution of the population. The purpose of crossover and mutation operators is to move a population around on the landscape defined by the fitness function [16, 29]. Evolution begins with a population of randomly selected chromosomes (candidate solutions). Chromosomes will compete with one another for reproduction in each generation. This is based on the Darwinian principle of survival of the fittest. After a number of generations during the evolutionary process, the chromosomes that survived in the population are the optimal solutions. A simple genetic algorithm consists of four basic elements namely, (1) generation of populations of chromosomes, (2) reproduction, (3) crossover, and (4) mutation. The simple competitive learning algorithm using genetic algorithms (SCL-GA) consists of the following steps [22]; we assume that the string with the lowest mean square error (MSE*) is the optimal solution. The algorithm will search such a string for the solution. The cluster centers will be encoded into the string. The crossover is to combine two subsequences on two strings. The chromosomes in the SCL-GA are a population of a number of strings (say P), which are generated randomly at the beginning. Each string represents a set of K class centers. In the reproduction, the MSE for each string of the population is calculated using Eq. 9.29. The inverse, of the MSE is used as the fitness function. The half of strings will survive and the other half will be regenerated randomly. The MSE is defined as in the following:
224
9
Foundation of Deep Machine Learning in Neural Networks
String #1
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
String #2
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
String #3
C 11...C 1L
C 21...C 2L
...
C k-1,1 ...C k-1,L
C k1 ...C kL
...
C k-1,1 ...C k-1,L
...
C 21...C 2L
...
C 11...C 1L
...
...
... String #P
...
C k1 ...C kL
Fig. 9.11 A population of P strings with the length of each string, L x K (K is the same as N2) where L is the dimension of each cluster center and K is the number of clusters
r2 ¼
k X 1X 2 ðx v i Þ m i¼1 x2h
ð9:29Þ
i
where m represents the number of pixels in the training set, K is the number of classes, x is the pixel which is an N-dimensional vector in the training set, vi is the mean of class i, and hi is the collection of data points belonging to class i. Strings using decimal numbers are implemented as arrays in the simulation. Assume that a 2-D (N x N) output layer is defined for the algorithm, where N is chosen so that the expected number of classes is less than or equal to N2. Hence, the number of classes (clusters), K, is less than or equal to N2. For the convenience, K is equal to N2 and L denotes the number of feature vector dimensions used in training. The topology of competitive learning networks is similar to that shown in Fig. 9.10. A population of P strings is shown in Fig. 9.11. The SCL-GA algorithm can be described in the following steps. Please note that we assume that the training dataset has been normalized to the range between 0.0 and 1.0. The SCL-GA algorithm: Step 1:
Step 2:
Step 3:
Define a number of neural networks, say P, and each neural network is similar to that shown in Fig. 9.10. Initialize the center weights wij(t) at time t = 1 (i = 1, …, L and j = 1, …, N x N) to small random values for each neural network. These P set of neural networks are identical except their weights are different. Apply the simple competitive learning (SCL) algorithm to each neural network defined in Step 1. Note that only the weight of the winning neuron and its neighboring neurons (for example, in the neighborhood of 3 3 or 5 5) will be updated in each neural network. Generate P strings of cluster centers. That is, S1 = {C1, C2, …, CN 2 }, S2 = {C1, C2, …,CN 2 }, S3 = {C1, C2, …, CN 2 }, …, Sp = {C1, C2, …, CN 2 }, where Ci is the mean of class i in each set (i.e., weights in the connections between the input and the output neurons as shown in Fig. 9.10) and i is from 1 to N2. The length of each string is L x N2.
9.7 Simple Competitive Learning Algorithm Using Genetic Algorithms (SCL-GA)
225
Distribute the pixels in the training set among N2 clusters by the minimum distance criterion using the Euclidean Distance measure for each string separately and calculate the centroid of each cluster. This step is repeated P times for each string. Step 5: (Crossover) For each string, a one-point crossover is applied with probability pc . A partner string is randomly chosen for the mating. Both strings are cut into two portions at a randomly selected position between j and j + 1, and the portions are mutually interchanged where j = 1 to N2. Step 6: (Mutation) Mutation with probability pm is applied to an element for each string. Either –1 or +1 is selected randomly by comparing with probability 0.5 (if the random probability is less than 0.5, −1 is selected. Otherwise, +1 is selected) and added to the chosen element. The mutation operation is used to prevent fixation to the local minimum in the search space. Step 7: (Reproduction) The inverse of the mean squared error (MSE) is used as the fitness function for each string. All strings are evaluated with the fitness function and pairwise compared. In each comparison, the string with the lowest MSE will be retained and the other one will be discarded and replaced by a new string generated randomly. In other words, only half of the strings in a population are survived and the other half are regenerated by new random strings representing new class means. Step 8: Replace all the connection weights of each neural network with each new set of strings obtained in Step 7 for all P neural networks. Step 9: Repeat Steps 2–8 for several generations defined by the user. Step 10: Select a subset of these N2 output nodes as spectral classes by the user. Step 11: (Clustering) Classify each of all the pixels to one of the clusters based on the minimum distance between the pixel and the clusters (based on the selected output nodes solution obtained in Step 10).
Step 4:
As the training of the SCL-GA involves many iterations of the training samples, it may be time consuming. A systematic sampling from the whole data set (or an image) can be used as a training sample set. Figure 9.12 shows an original thematic mapper (TM) multispectral image and the classified results. The number of iterations used for all algorithms was 10. The crossover probability and mutation probability used was 0.9 and 0.3, respectively, and the population was set to 10 for SCL-GA. For SCL and SCL-GA, the learning rate was generated randomly between 0.0 and 1.0 and divided by the number of iterations. Since three bands were used in the experiments, the dimension of the input vector for SCL and SCL-GA was three. From these experiments, SCL-GA yields better classification results. The SCL is a stable algorithm and the GA makes a significant contribution to the SCL for enhancing the performance.
226
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.12 a An original image, b classified result using the K-means algorithm, c classified result using the K-means-GA algorithm (described in Chap. 3), d classified result using the SCL algorithm, and e classified result using the SCL-GA algorithm. The number of clusters used in the classification was 6
9.8 The Self-Organizing Neocognitron
9.8
227
The Self-Organizing Neocognitron
Neocognitron is a deep neural network model for the visual recognition [4–6]. This might be one of the pioneering works in the research of deep neural network models although some models existed earlier than the neocognitron [3, 7, 23, 31]. The neocognitron is an improved model of cognitron proposed by Fukushima based on the hypothesis that “the synapse from neuron x to neuron y is reinforced when x fires provided that no neuron in the vicinity of y is firing stronger than y” [3]. This type of neural networks is termed hierarchical neural networks in which sparse and localized synapses are used in the connections between layers [10, 39]. Cognitron is a self-organized model in which the receptive fields of the cells (i.e., neurons) become relatively larger in a deeper layer [3]. In the deeper layer, a neuron can integrate low-level features from previous layers to form high-level features in the hierarchy of cognitron. Neocognition is more like the biological visual system model proposed by Hubel and Wiesel [18–20]. Hubel and Wiesel proposed a neural model called cortical plasticity, also known as neuroplasticity. This term refers to the changing of the structure, function, and organization of neurons in response to the new environmental experiences. The network works similar to the traditional ANNs: it takes stimulus input signals and processes them in the successive layers. Similar to cognitron, the neocognitron is also a self-organized model which can extract higher level of features (information) if the network is stimulated with the pattern repeatedly [4–6]. The receptive fields of the cells (neurons) are important mechanisms in the biological visual cortex [18–20]. Unlike the most traditional neural networks which take a feature vector as an input, the input of neocognitron is usually an image since the network was developed for visual recognition. The neocognitron possesses the characteristics of recognizing patterns from an image based on the geometrical similarity (Gestalt) of the shapes of the objects. Unlike the cognitron, the recognition in this model is not affected by their different position nor by small distortion of their shapes [4–6]. Figure 9.13 shows the architecture of neocognitron. The architecture has an input layer along with several module structures. Each module structure consists of a pair of S-layer and C-layer. The S-layer represents simple neurons and C-layer means complex neurons. In addition, the synaptic weight connections between the S-layer and C-layer is fixed and other synaptic weight connections are trainable, i.e., the synaptic weight connections from the previous C-layer in module n-1 to the current S-layer in the module n. The input layer consists of a two-dimensional array of cells (neurons), which correspond to photoreceptors of the retina in the biological visual cortex. There is a mapping of visual input from the retina to neurons in the S-layer. This is so-called retinal mapping which are the synaptic weight connections between the cells of the input layer and S-layer. Each cell in the S-layer receives signals through the connections that lead from the cells in a neighborhood, i.e., the receptive field, on the preceding layer. Each module consisting of a pair of the S-layer and C-layer can be duplicated many times to form a deep neural network.
228
9
Foundation of Deep Machine Learning in Neural Networks
Fig. 9.13 A schematic diagram showing the interconnections between a pair of S-layer and C-layer in a module [4]
Similar to the functions in the visual cortex, neurons in the beginning level respond to low-level features such as lines and edges in a specific orientation. The neurons in the higher level of the neocognitron respond to more complex and abstract objects such as parts. This is due to the deeper the layer is, the receptive field of each neuron becomes larger in the layer [4]. Each neuron in the final layer integrates the information from the previous layers and selectively responds to a specific stimulus pattern (i.e., a feature). Figure 9.14 shows the receptive fields in a layer. Recall that one of the hypotheses used in neocognitron is that all the S-cells in the same S-plane have input synapses of the same spatial distribution and only the positions of the presynaptic cells shift in the parallel according to the shift in the position of the individual receptive field of S-cells [4]. Figure 9.15 shows an example of how the neurons in the different layers of neocognitron respond to patterns through the synaptic weight connections between cells (neurons) [4]. In the early stage, the neurons identify the low-level features such as partial patterns of an object. Gradually, the neurons in the latter stage of the network can recognize the higher level features such as the complete patterns of an object. In other words, this higher level features are composed of low-level features learned from the early stage in the front layers. For example, in the task of automatic target recognition such as camouflaged enemy tanks detection, the low-level features such as corners and partial edges and lines may be detected in the early stage. The higher level features such as
9.8 The Self-Organizing Neocognitron
229
Fig. 9.14 A diagram illustrates the interconnections to the neurons in the receptive fields [4]
Fig. 9.15 An example shows how the neocognitron works in recognizing partial and complete patterns through the different layers in the process of self-organization [4]
230
9
Foundation of Deep Machine Learning in Neural Networks
wheels of the tank may be gleaned in the latter stage. As we continue to go more in– depth of the layers in the neocognitron, some complex high-level features such as the shape of the tank may appear in recognition of the final layer. This process is completely a self-organized procedure for performing pattern recognition. To make the neocognitron to be a self-organized network, the trainable synapses are reinforced (learned) using the winner-take-all method which is very similar to that used in the SOM [10, 39]. The detail to select the winner (i.e., the representative cell) is given in [4]. The procedure for self-organizing learning is summarized below: The Self-organizing Learning of Neocognitron : Step 1: A set of images is applied to the neocognitron one at a time and the synaptic weights are adjusted layer by layer from the input layer until the last layer. Please note that only the synaptic weights from the C-layer in the module n – 1 to the S-layer in the module n will participate in the training. Step 2: A synaptic weight will be increased if (a) the complex neuron is responding or (b) the simple neuron is responding more strongly than any of its immediate neighboring neurons within the competition area. The neocognitron, when its self-organizing learning is completed, will obtain a neural structure similar to the model of the visual nervous system proposed by Hubel and Wiesel [18–20].
9.9
Summary
This chapter introduces the basics of artificial neural networks for the foundation of deep neural networks. Perceptron is a simple neural network which can only solve the linear type of problems. Traditional neural networks such as feed-forward multi-layer neural networks (FMNN) are the foundation for many advanced neural networks in the deep machine learning. This traditional multi-layer neural network requires a representative dataset for supervised training. Kohonen’s self-organizing map (SOM) is an unsupervised clustering method which is useful in the mapping from an high-dimensional space into a two-dimensional map. To improve the performance of the SOM, the population of the SOM is diversified through the evolutionary scheme using the genetic algorithm. An algorithm of SOM-GA is briefly described, and it is tested on the image classification. Both FMNN and SOM are widely used for categorization. Hopfield neural network is an associative-memory network for storing patterns and retrieving them. This pioneering work has led to the development of the Boltzmann Machine (BM) and restricted Boltzmann machine (RBM). A stack of RBMs is formed for multiple layers to become a generative model for deep machine learning. This generative model is called Deep Belief Networks which is popular for many applications. A self-organizing deep neural network, Neocognitron, is briefly described for its superior characteristic of arranging cells (neurons) which is much like the human
9.9 Summary
231
visual cortex. We may be able to find some roots of popular deep machine learning such as convolutional neural networks (CNN) from the concept used in Neocognitron and others.
9.10
Exercises
Implement the following algorithms using a high-level computer language. 1. Develop a perceptron which can perform as a logic OR function. 2. Perform a feed-forward three-layer neural network which can classify the Iris pattern dataset into three classes (https://archive.ics.uci.edu/ml/datasets/iris). 3. Develop a Kohonen’s self-organizing map (SOM) and classify a color image into a two-dimensional class map. 4. Implement the SCL–GA using a computer language. Classify a color image using the SCL-GA and compare with the result obtained from SOM.
References 1. Ackley DH, Hinton GE, Sejnowski TJ (1985) A Learning Algorithm for Boltzman Machines. Cognit Sci 9:147–169 2. Binder K (1994) Ising model. in Hazewinkel, Michiel, Encyclopedia of mathematics. Springer Science + Business Media B.V./Kluwer Academic Publishers, Berlin. ISBN 978-1-55608-010-4 3. Fukushima K (1975) Cognitron: a self-organizing multilayered neural network. Biol Cybern 20:121–136 4. Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193–202 5. Fukushima K, Miyake S (1982) Neocognitron: A new algorithm for pattern recognition tolerant of deformation and shifts in position. Pattern Recogn 15(6):455–469 6. Fukushima K, Miyake S, Takayuki I (1983) Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans Syst Man Cybern SMC-13(5):826–834 7. Giebel H (1971) Feature extraction and recognition of handwritten characters by homogeneous layers. In: Griisser O-J, Klinke R (eds) Pattern recognition in biological and technical systems. Springer, Berlin, pp. 16−169 8. Haykin S (1994) Neural networks: a comprehensive foundation. IEEE Press 9. Heaton J (2015) Artificial Intelligence for Humans (Volume 3): Deep Learning and Neural Networks, Heaton Research Inc. 2015 10. Hecht-Nielsen R (1990) Neurocomputing. Addison-Wesley, Boston. (Good in Hopfield and Boltzmann machines) 11. Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addision-Wesley, Boston 12. Hinton GE (2002) Training Products of Experts by Minimizing Contrastive Divergence. Neural Comput 14:1771–1800 13. Hinton GE, Sejnowski TJ (1983) Optimal perceptual inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Washington, pp 448–453 14. Hinton GE, Sejnowski TJ (1986) Learning and relearning in Boltzmann machines. In: Rumelhart M et al (ed) Parallel distributed processing, vol 1
232
9
Foundation of Deep Machine Learning in Neural Networks
15. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks, science, vol 313, 28 JULY 2006 16. Holland JH (1975) Adaptation in natural and artificial systems.The MIT Press, Cambridge 17. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the national academy of sciences of the USA, vol 79, pp 2554–2558 18. Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s visual cortex. J Physiol 148:574–591 19. Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160:106–154 20. Hubel DH, Wiesel TN (1965) Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. J Neurophysiol 28:229–289 21. Hung C-C (1993) Competitive learning networks for unsupervised training. Int J Remote Sens 14(12):2411–2415 22. Hung C-C, Fahsi A, Coleman T (1999) Image classification. In: Encyclopedia of electrical and electronics engineering. Wiley, pp 506–521 23. Kabrisky M (1967) A proposed model for visual information processing in the human brain. Psyccritiques 12(1):34−36 24. Kohonen T (1989) Self-organization and associative memory. Springer, New York 25. Lee H (2010) Unsupervised feature learning via sparse hierarchical representations, Ph.D. dissertation, Stanford University 26. Lippmann RP (1987) Introduction to computing with neural nets. IEEE ASSP Mag 27. McCulloch WW, Pitts W (1943) A logical calculus of the ideas imminent in nervous activity. Bull Math Biophys 5:115–133 28. Minsky ML, Papert S (1969) Perceptrons. MIT Press, Cambridge 29. Mitchell, M., An Introduction to Genetic Algorithms, The MIT Press, 1999 30. Parker DB (1982) “Learning Logic’’, Invention Report S81–64, File 1. Stanford University, Stanford, CA, Office of Technology Licensing 31. Rosenblatt F (1962) Principles of neurodynamics, perceptrons and the theory of brain mechanisms. Spartan Books, Washington 32. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Parallel distributed processing, Vol. 1, pp 318–362. MIT Press, Cambridge 33. Simpson PK (1990) Artificial neural systems: foundations, paradigms, applications, and implementation. Pergamon Press, Oxford 34. Smolensky P (1986) In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: volume 1: foundations. MIT Press, Cambridge, pp 194–281 35. Storkey AJ (1999) Efficient covariance matrix methods for Bayesian Gaussian processes and hopfield neural networks. Ph.D. thesis, University of London 36. Trudeau RJ (2013) Introduction to graph theory, Courier Corporation, 15 Apr 2013 37. Wasserman PD (1989) Neural computing: theory and practice. Van Nostrand Reinhold, New York 38. Watanabe S, Sakamoto J, Wakita M (1995) Pigeons’ discrimination of paintings by Monet and Picasso. J Exp Anal Behav 63(2):165–174. https://doi.org/10.1901/jeab.1995.63-165 39. Werbos PJ (1974) Beyond regression: New tools for prediction and analysis in the behavioral sciences. Master Thesis, Harvard University
Convolutional Neural Networks and Texture Classification
10
I have just three things to teach: simplicity, patience, compassion. These three are your greatest treasures. — Lao Tzu
Convolutional neural networks (CNN) model is an instrumental computational model not only in computer vision but also in many image and video applications. Similar to Cognitron and Neocognitron, CNN can automatically learn the features of data with the multiple layers of neurons in the network. There are several different versions of the CNN which have been reported in the literature. If an original image texture is fed into the CNN, it will be called an image-based CNN. A major problem with the image-based CNNs is that the number of training images is very demanding for the good generalization of the network due to the rotation and scaling change in images. An alternative method is to divide an image into many small patches for the CNN training. This is very similar to the patches used in the K-views model. In this chapter, we will briefly explain the image-based CNN and patch-based CNN for image texture classification. The LeNet-5 neural network architecture will be used as a basic CNN model. CNN is useful not only in the image recognition but also in the textural feature representation. Texture features, which are automatically learned and extracted from a massive amount of images using the CNN, become the focus of developing feature extraction methods.
10.1
Convolutional Neural Networks (CNN)
Deep machine learning is a trend in recent years for the general pattern recognition and machine vision. Most deep machine learning is based on the neural networks approach which uses many layers of neurons; it is more than the number of layers used in the traditional artificial neural networks. Hence, it is called “deep” neural networks. Due to the advance of high-speed computing devices, several high-performance deep neural network models based on the foundation of the © Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1_10
233
234
10
Convolutional Neural Networks and Texture Classification
traditional neural networks have been proposed and widely used in the broad domain of artificial intelligence. One of the deep neural networks, which is called the convolutional neural networks (CNN), might be the most popular in applications [3, 6, 7, 16, 24]. Unlike most traditional pattern recognition systems which require engineer-designed feature extractors, the CNN can automatically learn the feature maps through the deep layers of the networks. This characteristic feature makes CNN suitable for the big data analysis in machine learning. Even more, some advanced CNN has been developed for understanding convolutional networks and generating image descriptions [12, 31]. Convolutional neural networks (CNN) has become a de facto neural network used in visual recognition recently. The architecture of CNN is more or less similar to that of Neocognitron which is one of the earliest deep neural network models although several features used in CNN are quite new [4]. If we consider the traditional multi-layer neural networks (MLNN) as a general architecture, the CNN is a special kind of the MLNN. However, the CNN architecture is still somewhat different although both models use the backpropagation training algorithm for the network parameters learning for their adaptive weights in the connections between the layers. A general architecture of CNN is shown in Fig. 10.1 [16]. The architecture consists of a convolutional layer followed by the pooling layer (we may call this pair of layers as a component). This pair of convolution and pooling layers can be duplicated many times and cascaded to form a “deep” neural network. Then, the fully connected layers will be added in the final stage in the network to serve as the classification for the output of the CNN. The fully connected layers take the high-level feature vectors as the input extracted from the preceding components. Although the backpropagation algorithm can be used in the training of both ANN and CNN neural networks, some parameters of the networks in CNN are shared by the filters which are called the receptive fields. This sharing will significantly reduce the number of parameters used in the CNN. The pooling layer is the same as the spatial subsampling technique which is used to reduce the dimension of activation maps generated from each convolution layer. Hence, it will make the recognition and
Fig. 10.1 A general architecture of CNN [16]
10.1
Convolutional Neural Networks (CNN)
235
Fig. 10.2 A spatial filter used in the digital image smoothing
detection robust to a limited distortion of objects. The convolutional layer is very similar to the spatial filter design in digital image processing [5, 21]. As shown in Fig. 10.2, a spatial filter (commonly used in digital image smoothing) which is very similar to the filter used in CNN. The filter convolves with an image patch defined by a neighborhood function that is called the receptive field in the CNN. Each filter will produce an activation map. In general, a kernel operator is used to denote the operation of a spatial filter [5, 21]. The filter scans the entire image in the defined neighborhood from left to right, and then top to bottom of the image. The difference between filters used in CNN and digital image smoothing is that the former can be learned with the backpropagation algorithm, but the latter is engineer-designed and fixed. That is the reason that the filter used in CNN is called the learnable filter. The similarity is that both filters are used to detect low-level features from the image. Due to the multiple (i.e., deep) convolution layers used in the CNN network, the learnable filter can gradually derive high-level features in the deep layers. Feature extraction is an essential component of learning in CNN [18]. Please note that there are two important operations called convolution and correlation [5, 21]. Snyder and Qi pointed out the difference between these two operations in which the correlation refers to a kernel operator [21]. The mathematical expressions for the correlation and convolution are listed in Eqs. 10.1 and 10.2, respectively. gðx; yÞ ¼ f ðx; yÞohðx; yÞ ¼
1 X N 1 X 1 M f ðm; nÞhðx þ m; y þ nÞ MN m¼0 n¼0
ð10:1Þ
236
10
Convolutional Neural Networks and Texture Classification
gðx; yÞ ¼ f ðx; yÞ hðx; yÞ ¼
1 X N 1 X 1 M f ðm; nÞhðx m; y nÞ MN m¼0 n¼0
ð10:2Þ
where M and N denote the rows and columns of an input image, respectively, and h (x, y) function represents a spatial filter. We can see the difference between these two equations; in Eq. 10.1, the corresponding pixels between the sub-image and the spatial filter are multiplied in the order from left to right and top to bottom while in Eq. 10.2 the spatial filter will be flipped for the multiplications with the sub-image. Example 10.1 illustrates the difference between the correlation and convolution. Based on the clarification of Snyder and Qi, perhaps we should call this type of neural networks as correlational neural networks (CNN), instead of Convolutional Neural Networks (CNN). Example 10.1 Correlation and convolution operations using a spatial filter: the first two-dimensional (2-D) array is a hypothetical image and the second 2-D array is a spatial filter.. A hypothetical image
1 4 7 3 6
2 5 8 2 7
3 5 4 6 7 8 9 2 3 1 4 5 8 7 9
1 4 7
2 5 8
A spatial filter
3 6 9
For the correlation, we simply multiply the corresponding elements in the sub-image and the spatial filter and sum them together. The result is 1 1 + 2 2 + 3 3 + 4 4 + 5 5 + 6 6 + 7 7 + 8 8 + 9 9 = 285. For the convolution, we need to flip the spatial filtering mask based on Eq. 10.2 and then perform the multiplications and summation same as the correlation. The flipped spatial filter is illustrated below:
5 2 8
4 1 7
6 3 9
The result for the convolution is 1 5 + 2 4 + 3 6 + 4 2 + 5 1 + 6 3 + 7 8 + 8 7 + 9 9 = 255.
10.1
Convolutional Neural Networks (CNN)
237
However, if multiple pairs of convolutional and pooling layers are used, the learnable spatial filters in the latter stage of the CNN will be able to recognize the higher level features instead of just low-level features. This higher level features are composed of low-level features learned from the early stage in the pair of convolutional and pooling layers. This property is very similar to that demonstrated in the Neocognitron [4]. For example, in the alphabet character recognition [4], the low-level features may be partial lines or curves of characters, and the higher level features are the partial or entire alphabet character. The output of each component in the CNN is called the activation map or feature map. As we continue to go more in–depth of the multiple layers in the CNN, some complex high-level features such as the entire alphabet character may appear in the activation map. In a sense, the CNN uses the concept of a spatial filtering method from digital image processing for its multiple convolution layers along the networks. With this interpretation, the idea on the number of spatial filters, filter size, stride, and padding used in the CNN is similar to those used in the digital image processing. The number of layers in a CNN, the number of convolutional layers, the number of spatial filters, filter size, stride, and padding are called hyperparameters of a CNN. This means that we need to specify the hyperparameters for training a CNN for a task to be solved. There are some constraints on the stride (S) in terms of relationship among filter width (WF), the width (i.e., column) of an image (WI), and padding (P) [9]. The relationship is made easy for the CNN network design. It is expressed in Eq. 10.3. S¼
WI WF þ 2P Sþ1
ð10:3Þ
The stride must be an integer. Otherwise, the spatial filter cannot be moved around on the input image. We can adjust the padding that adds the extra rows and columns to an image making the result of Eq. 10.3 as an integer. Please note that the stride here is for moving the filter from left to right in an image, i.e., horizontally. It is implicitly assumed that the move from top to bottom (i.e., vertically) is one row at a time. If we need to move the filter vertically similar to the horizontal, we may use Eq. 10.3 to calculate the stride for the vertical movement. The CNN has shown its robustness to a certain degree of pattern distortions and simple geometric transformations in image recognition. The popular LeNet-5, one of the foundational CNN models, introduced by LeCun et al. is mainly for handwritten and machine-printed character recognition that possesses the property of feature invariance [16]. In other words, LeNet-5 is robust in the image recognition regardless of shift, scale and distortion variance to a certain degree. This is due to the use of the pooling layer. LeNet-5 and its variations are widely used for image classification and recognition. The LeNet-5 architecture consists of local receptive fields, shared weights, subsampling (i.e., pooling), and fully connected layers connected to the output layer. The local receptive field is very similar to the spatial
238
10
Convolutional Neural Networks and Texture Classification
Fig. 10.3 LeNet-5 architecture. The subsampling is also called pooling [16]
filter such as a size of 3 3 or 5 5 neighborhood window in digital image smoothing. Each neuron is connected to only a local subregion of the input image or activation map generated from a layer. The shared weights capture the translational invariances and other invariances in the input image and activation maps. Similar to the CNN architecture as shown in Fig. 10.1, LeNet-5 shown in Fig. 10.3 which consists of the convolutional layer, pooling layer, fully connected layer, and output layer. In addition, the transfer function called ReLU is used for squashing the output of the convolutional layer. Similar to the traditional neural network training, the weights are randomly generated at the beginning with the proper variance. The activation function ReLU uses a small positive bias before the mapping [16]. The convolutional layer in the CNN is somewhat similar to the Neocognitron as shown in Fig. 10.4. A neuron of a spatial filter in Layer L is connected to a small neighborhood in the previous Layer L-1. These receptive fields form a set of trainable filters which will generate activation (i.e., feature) maps. This is very similar to the spatial filter used in digital image processing to obtaining the most appropriate response from an input image. However, the weights for each filter are trainable and shared across the neurons by adjusting the associated weights by learning the features and patterns from the image or feature map. The spatial kernel operator of the trainable filters is similar to Eq. 10.1, however, now we are working on the neurons that are trainable and adaptive. Since the CNN is supervised learning, we need to provide a set of labeled images for optimizing the weights of filters. The initial weights are randomly given just the same as most traditional neural networks. A mathematical formula for each spatial filter in the convolutional layer is given in Eq. 10.4. frcl ¼ u
Sr X Sc X i¼1 j¼1
! l l fðl1 r þ i1Þðc þ j1Þ wij þ b
ð10:4Þ
10.1
Convolutional Neural Networks (CNN)
239
Fig. 10.4 A diagram illustrates the relations of an input layer (Layer L-1) and an output layer (Layer L) for the receptive fields
Fig. 10.5 A rectified linear function (ReLU)
where frcl denotes an output neuron in the location (r,c) of layer l, Sr and Sc the number of rows and columns of a spatial filter, wlij is the weight of connections between the neuron in the receptive field of layer l and each pixel in the location (i, j) of layer l-1, and bl is a corresponding bias for the filter. Notation u is the rectified linear function (ReLU) to model the output of a neuron. The ReLU is defined in Eq. 10.5 and plotted in Fig. 10.5. uð xÞ ¼ maxð0; xÞ
ð10:5Þ
To visualize the operations of a convolutional layer, a simple diagram is given in Fig. 10.6. A pooling layer performs a downsampling function which is used to reduce the dimension of a feature map. This function is also used to make the feature extraction robust for the invariance due to some minor deviations. It has been demonstrated that the pooling function can improve classification accuracy [15]. For example, the maximum pooling function is to select the maximum value in a neighborhood with a stride 2 as shown in Fig. 10.7. Here, stride refers to how many columns and rows we are moving the sampling window from left to right and top to bottom in a feature map. Among the spatial filters used in CNNs, a 1 1 filter is often used for computation reduction and increased nonlinearity of feature
240
10
Convolutional Neural Networks and Texture Classification
Fig. 10.6 a An example of convolutional layer and pooling layer and b the operation of three receptive fields (i.e., spatial filters) will generate three activation (feature) maps
Fig. 10.7 An example of the maximum pooling layer in LeNet-5 [16]
10.1
Convolutional Neural Networks (CNN)
241
representation [25]. This filter with a size of 1 1 will reduce multiple feature maps by pooling the features across the various feature maps and constructing a single feature map. Unlike the spatial filter such as a size of 3 3, the 1 1 filter does not consider the correlation of features in a single feature map. The fully connected layers used in the LeNet-5 are exactly same as those traditional multi-layer feed-forward neural networks as we discussed in Chap. 9. Similar to the traditional neural networks, CNNs can be an over-fitted network. Several methods have been proposed to avoid the overfitting problem such as the dropout technique [23].
10.2
Architectures of Some Large Convolutional Neural Networks (CNNs)
Similar to the topology of a traditional neural network, researchers in the area of deep machine learning have proposed several different architectures of convolutional neural networks (CNN) for solving problems of challenging visual recognition. Based on the general architecture of the CNN in Fig. 10.1, almost all of CNNs proposed has its basic structure similar to the layout shown in Fig. 10.8. We will introduce three popular CNNs namely, AlexNet, ZFNet, and VGG Net [13, 22, 31]. AlexNet is an extraordinary CNN network which is the winner of ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 for a large dataset called ImageNet [13]. AlexNet has a relatively simple layout consisting of five convolutional layers, max-pooling layers, dropout layers, and three fully connected layers. ImageNet which is a challenging benchmark case for a machine learning algorithm has more than 15 million of labeled high-resolution images with 22000 categories. The ILSVRC only uses a subset of ImageNet with approximately 1000 images per category and 1000 categories. The AlexNet architecture is shown in Fig. 10.9. This architecture has about 650,000 neurons and 60 million of trainable parameters. Due to such a large number of parameters, the AlexNet uses the data augmentation and dropout techniques to reduce overfitting. Experimental results show that the AlexNet is capable of achieving record-breaking results in the competition of ILSVRC. The performance of AlexNet will degrade if a single convolutional layer is removed. Zeiler and Fergus developed a CNN, called ZFNet, in which its architecture is very similar to AlexNet [31]. The ZF Net was the winner of 2013 ILSVRC competition. The objective of ZFNet was to explore two important issues: why a large
Fig. 10.8 A basic topology of a CNN
242
10
Convolutional Neural Networks and Texture Classification
(a)
(b) Layer Depth Dimension Filter Size Stride
C1 96 55x55 11x11 4
S1 96 27x27 3x3 2
C2 256 27x27 5x5 1
S2 256 13x13 3x3 2
C3 384 13x13 3x3 1
C4 384 13x13 3x3 1
C5 256 13x13 3x3 1
S5 256 6x6 3x3 2
F6 4096 6x6 1x1 1
F7 4096 1 1 1
Output 1000 1 1 1
C: Convolutional layer S: Sampling layer F: Fully Connected layer
Fig. 10.9 a An illustration of AlexNet architecture which contains five convolutional layers, three maximum pooling layers, and three fully connected layers; b Numeric values specify hyperparameters used in AlexNet [13]
CNN performs so well on the ImageNet benchmark and how they might be improved [31]? Hence, ZFNet was re-architectured based on the modification of AlexNet. The DeconvNet was introduced as a visualization technique that provides an insight about the function of intermediate activation layers and illustrates how the improvement of a CNN can be made. With the DeconvNet, the authors were able to find CNN model architectures that outperform AlexNet on the ImageNet benchmark. As a matter of fact, ZFNet also works very well for Caltch-101 and Caltech-256 datasets. Figure 10.10 shows the first eight-layer of ZFNet. It can be seen that a smaller spatial filter size is used in the first layer of ZFNet compared with 11 11 filter size used in AlexNet. The effect of the filter size is already discussed in many digital image processing books. As most CNNs are constructed using the trial-and-error method, the DeconvNet gives us a valuable tool for analyzing the excitation of the input stimuli on the activation maps at any layer and observing the evolution of features during the training. The latter provides a tool for diagnosing potential problems in the CNN model. Figure 10.11 shows how a DeconvNet layer is attached to a convolutional layer. This DeconvNet can be a reverse process of any convolutional layer for an activation map that will be “mapped” back to the input image. In other words, it
10.2
Architectures of Some Large Convolutional Neural Networks (CNNs)
243
Fig. 10.10 The ZFNet architecture [31]
Pooled Feature Maps Max Pooling
Layer above reconstruction Switches
Rectified Feature Maps Rectified Linear Function Feature Maps Convolutional Filtering Layer below pooled Maps
Max UnPooling UnPooled Maps Rectified Linear Function Rectified Unpooled Maps Convolutional Filtering Reconstruction
Fig. 10.11 The DeconvNet used in ZFNet architecture [31]: the right column shows a DeconvNet is attached to a convolutional layer on the left column
traces back from the features to the input pixels. A DeconvNet uses the same filters as the original ZFNet. The reverse process will start from the current layer and go through a series of unpooling, rectify, and filtering operations for each preceding layer until the input layer is reached [31]. The diagnosis of DeconvNet will discover the pros and cons of a CNN.
244
10
Convolutional Neural Networks and Texture Classification
Fig. 10.12 An example of VGGNet architecture shows that 8 to 16 convolutional layers and three fully connected layers [22]
Simonyan and Zisserman developed a CNN, called VGGNet, which won the first place and second place in the localization and classification tracks, respectively, in the competition of 2014 ILSVRC [22]. The goal of VGGNet was to investigate the effect of depth of CNN on the accuracy of a large-scale image recognition task. Experimental results show that the VGGNet will have a significant improvement over the current state-of-the-art CNNs with a depth of 16–19 layers and small 3 3 spatial filters. An example of VGGNet is shown in Fig. 10.12. It can be observed that there is not much difference in terms of architecture between AlexNet and VGGNet except for the depth of layers and size of spatial filters used. In addition, 1 1 filter is used to increase the degree of nonlinearity without affecting the receptive field. An important contribution made in VGGNet is that the deeper layers work better for the recognition in a CNN network. It also indicates that multiple small spatial filters can detect features better than a large spatial filter. There exist several software tools for constructing CNNs for training and testing. These tools include Convnet for MATLAB [33], Theano for CPU and GPU [34], Pylearn2 for python [35], TensorFlow developed by Google [36], Caffe [37], Wavenet (generate human speech by Deepmind) [26], and Project Catapult (Microsoft using special hardware) [38].
10.3
Transfer Learning in CNNs
We can think of CNN as a heterogeneous parallel computing paradigm in which the features are extracted and distributed from a low level to a higher level in the multiple layers of the entire network. To extract representative features in such a deep network, CNN training requires a significant amount of sample images to achieve higher accuracy for a task in the application. Therefore, a tremendous amount of time is required for training. The graphical processing units (GPU) and special hardware are commonly used for expediting the computation. It is also quite common to normalize the dataset in which each sample subtracted the mean of the
10.3
Transfer Learning in CNNs
245
dataset and divided by the standard deviation that is similar to what has been used in the traditional artificial neural networks. Since it is required of a large dataset to train the CNN for an application, we will not be able to have full utilization of the CNN for solving the problem if a sufficient amount of data is not available for an application. To solve this problem, transfer learning is widely used for overcoming the shortage of dataset for CNN training [7, 10, 19, 20, 24, 30]. It will be useful that CNN learns the knowledge from a sufficient amount of the dataset and store the knowledge in the distributed weights (synaptic connections) over the entire network. This is usually called the source domain. We can then copy those weights to another similar CNN architecture for fine-tuning and solving different problems. This is called the target domain. In other words, we do not have to train a CNN from scratch in the target domain, as we may not have enough amount of data. However, we still need to retrain the CNN with “cloned” weights and a limited amount of data in the target domain. There are several successful applications by using the transfer learning in the CNN [10, 19]. Oquab et al. designed a scheme in which the knowledge learned with CNNs in the source domain dataset was transferred to the CNNs in the target domain which has a limited amount of training data. They trained a CNN using the ImageNet dataset, which is a large-scale annotated dataset, as a generic feature extractor and then reused this generic feature extractor in the target domain [19]. In the target domain, all pre-trained parameters from layer C1 to layer FC7 are transferred to this new CNN as depicted in Fig. 10.13. A new adaptation layer (fully connected between FCa and FCb) is then created and trained with a limited number of the labeled dataset, which is the PASCAL VOC dataset, for the target domain. With this transfer learning, they were able to achieve the state-of-the-art
Feature learning bird dog snake car Knowledge transfer
parameters transfer desk mouse cat
Classifier learning
apple
Fig. 10.13 A transfer learning scheme from the source domain to the target domain [19]
246
10
Convolutional Neural Networks and Texture Classification
visual recognition results on the PASCAL VOC dataset. However, Yosinski et al. studied the transfer learning used in CNNs and discovered that some issues based on their experiments affect the transferability [30].
10.4
Image Texture Classification Using Convolutional Neural Networks (CNN)
Convolutional neural networks (CNN) almost becomes a de facto neural network in the area of visual and pattern recognition nowadays. Image texture classification is an application of visual and pattern recognition. Therefore, the way it works for the image texture classification is similar to those tasks in the visual and pattern recognition applying the CNN. Some excellent work has been reported in the literature for the image texture classification by using the convolutional neural networks (CNN) [7, 11, 13, 24]. As stated earlier, CNNs will extract the hierarchical features automatically from an input when the learning moves from a layer to the next layer in a deep neural network. The CNNs used for image texture classification are either image-based or patch-based networks [24]. It requires many training samples for both methods. However, in the patch-based method, we can have a sufficient amount of different patches from an input image and apply the data augmentation technique to generating enough number of samples for training. The data augmentation technique is mainly used to generate new image samples in which the variations such as rotation, different illumination, and scale changes that are different from the original samples [29]. The motivation that we can apply the patch-based CNN is due to the periodical property of the image texture [24]. Figure 10.14 shows an example by taking some textural images, randomly selecting some patches, and swapping them. Although the patch is quite small compared with the original image, it still provides adequate information for different texture discrimination. Hence, the patch-based CNN can recognize a randomly selected patch as good as an entire image. Even some patches are swapped, the image texture re-constructed still appear very similar to the original image [24]. There are some advantages associated with patch-based CNNs [24]. It is well known that natural images including the textural images are usually captured under different imaging conditions such as illumination, rotated angles, partially occluded, and scales due to varying distances between the sensor and the target. In such a state, a robust classifier must be able to accommodate all of these variations. In other words, a classifier should be provided all of these variations as inputs for the model learning. The data augmentation method is a good fit for the patch-based CNN by adding to the set of training samples with many of such new patches obtained through the transformation of the original patches. Two types of transformation are widely used: the Gaussian pyramid method to create a set of multi-scale patches in the scale space, and the rotation transformation using different angles [24].
10.4
Image Texture Classification Using Convolutional Neural Networks (CNN)
247
Fig. 10.14 a and b show two image textures: leaves and pine straw needles, and c and d illustrate if some patches are randomly selected for swapping, the image texture reconstructed still appear very similar to the original image [24]
Sun designed two different CNN architectures for image texture classification and compared their performance [24]: the image-based CNN and the patch-based CNN. The former is similar to the regular CNN. The patch-based CNN, abbreviated as p-CNN, is shown in Fig. 10.15. It consists of one p-CNN for training and the other p-CNN for the classification. The first four layers of both p-CNNs for training and classification are identical. The difference between these two p-CNNs is that there is an extra layer for the bag-of-patches pooling to obtain the confidence scores for the classification of a patch to a texture class in the p-CNN for classification. To show the effectiveness of the CNN on the image texture classification, Sun conducted experiments on four-benchmark image texture dataset including Brodatz, CUReT, KTH-TIPS, and UIUC [1, 2, 8, 14]. The image-based CNN and p-CNN are also compared with four image texture classification methods [17, 27, 28, 32] namely, (1) SRP that is an extension of the patch-based method, (2) VZ_Joint that is a patch-based method, (3) VZ_MR8 that is a filter-bank-based method, and (4) Zhang’s method that is a bag-of-keypoints method. These four methods differ in the algorithmic aspect of how the features are extracted from image textures. Their experimental results on the four-benchmark datasets show that the p-CNN has much higher classification accuracies than the image-based CNN on three datasets and slightly lower on the CUReT dataset. However, among the comparison with four image texture classification methods, the p-CNN achieves either the competitive or the better results.
248
10
Convolutional Neural Networks and Texture Classification
Fig. 10.15 An architecture of the patch-based CNN: a a patch-based CNN consisting of two convolutional layers (include pooling layers) and two fully connected layers for training and b a patch-based CNN for the classification that has an extra layer for the bag-of-patches pooling [24]
It is ideal to use the p-CNN for image texture classification. Similar to the selection of a size for a K-views template discussed earlier, it is a challenge to choose an appropriate size of the patch for the p-CNN to achieve the high classification accuracy. Sun used the patch size of 53 53 in his experiments of image texture classification and concluded that the p-CNN is suitable for any type of datasets including patches generated using the data augmentation technique while the image-based CNN is more appropriate for image textures with the homogeneity and fewer variations in the image [24].
10.5
10.5
Summary
249
Summary
Deep machine learning techniques have emerged as the state-of-the-art methods for computer vision and pattern recognition. Convolutional neural networks (CNN), which is one of deep machine learning techniques, have been widely used for image texture recognition. The features generated by the CNN provide a new set of textural characteristics for the image texture classification. This is a different approach from the traditionally handcrafted features in the traditional pattern recognition. The CNN can learn and transform the input image texture into a set of feature descriptors. LeNet-5 is one of the CNN models that is frequently used for image recognition and classification. There are many advantages associated with the CNN model such as its robustness to a certain degree of variation on the shift, scaling, and rotation. This spatial invariance is achieved through the convolutional and pooling layers in the CNN. However, the CNN model requires a tremendous amount of representative data for learning. We may not be able to obtain a sufficient amount of data for a task being solved by the CNN. Transfer learning is thus used to supplement the lack of training data. In other words, we borrow the knowledge from another dataset which has been trained in a CNN in a so-called source domain and used the well-trained weights for the current CNN for fine-tuning in the target domain. In a sense, we pre-train the current CNN on a large dataset, for example, ImageNet. Then, we will fine-tune the parameters of the current CNN with the dataset in the current task domain. Several extremely large CNNs have been proposed for image classification and achieved a significant breakthrough in the development of CNNs. These include AlexNet, ZFNet, and VGGNet. The patch-based CNN model has shown its promising in the image texture classification. As new convolutional neural networks are evolved rapidly, we may expect that the accuracy of image texture classification will be improved significantly in the near future.
10.6
Exercises
1. Perform experiments on the different size of patches for image texture classification using the patch-based CNN. 2. Apply the data augmentation technique to the patches extracted in Exercise #1. 3. Compare the accuracy of image texture classification using the image-based and patch-based CNN models. 4. Compare the accuracy of image texture classification on the CNN models with the transfer learning and without the transfer learning.
250
10
Convolutional Neural Networks and Texture Classification
References 1. Brodatz P (1999) Textures: a photographic album for artists and designers. Dover Publications. ISBN 0486406997 2. Dana KJ, van Ginneken B, Nayar SK, Koenderink JJ (1999) Reflectance and texture of real-world surfaces. ACM Trans Graph 18(1):1–34 3. Davies ER (2018) Computer vision: principles, algorithms, applications, and learning, 5th edn. Academic, New York 4. Fukushima K, Miyake S, Takayuki I (1983) Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans Syst Man Cybern SMC-13(5):826–834 5. Gonzalez RC, Woods RE (2002) Digital image processing, 2nd edn. Prentice Hall, Englewood Cliffs 6. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge 7. Hafemann LG (2014) An analysis of deep neural networks for texture classification. M.S. thesis, Universidade Federal do Parana, 8. Hayman E, Caputo B, Fritz M, Eklundh J (2004) On the significance of real-world conditions for material classification. In: European conference on computer vision, vol 4. pp 253–266 9. Heaton J (2015) Artificial intelligence for humans: deep learning and neural networks, vol 3. Heaton Research, Inc., 10. Huang Z, Pan Z, Lei B (2017) Transfer learning with deep convolutional neural network for SAR target classification with limited labeled data. Remote Sens 9:907. https://doi.org/10. 3390/rs9090907 11. Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: IEEE 12th international conference on computer vision, Kyoto, Japan, 29 Sept−2 Oct 2009, pp 2146–2153 12. Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions, CVPR 13. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90 14. Lazebnik S, Schmid C, Ponce J (2005) A sparse texture representation using local affine regions. IEEE Trans Pattern Anal Mach Intell 27(8):1265–1278 15. Lecun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubband W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551 16. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324 17. Liu L, Fieguth P, Clausi D, Kuang G (2012) Sorted random projections for robust rotation-invariant texture classification. Pattern Recogn 45(6):2405–2418 18. Liu L, Chen J, Fieguth P, Zhao G, Chellappa R, Pietikainen M (2018) BoW meets CNN: two decades of texture representation. Int J Comput Vis 1−26. https://doi.org/10.1007/s11263018-1125-z 19. Oquab M, Bottou L, Laptev I, Sivic J (2013) Learning and transferring mid-level image representations using convolutional neural networks, INRIA, Technical report, HAL-00911179 20. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22 (10):1345–1359 21. Snyder WE, Qi H (2004) Machine vision. Cambridge University Press, Cambridge 22. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition, ICLR 23. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958 24. Sun X (2014) Robust texture classification based on machine learning, Ph.D. thesis, Deakin University
References
251
25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, 7–12 June 2015 26. Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kavukcuoglu K (2016) WaveNet: a generative model for raw audio. In: SSW, 125 pp 27. Varma M, Zisserman A (2005) A statistical approach to texture classification from single images. Int J Comput Vis 62(1):61–81 28. Varma M, Zisserman A (2009) A statistical approach to material classification using image patch exemplars. IEEE Trans Pattern Anal Mach Intell 31(11):2032–2047 29. Wong SC, Gatt A, Stamatescu V, McDonnell MD (2016) Understanding data augmentation for classification: when to warp?” In: International conference on digital image computing techniques and applications (DICTA), Gold Coast, QLD, Australia, 30 Nov−2 Dec 2016 30. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems (NIPS 2014), vol 27 31. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Fleet D et al (eds) ECCV 2014, Part I, LNCS, vol 8689. Springer International Publishing, Switzerland, pp 818–833 32. Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73 (2):213–238 33. https://www.mathworks.com/products/deep-learning.html 34. http://deeplearning.net/software/theano/ 35. http://deeplearning.net/software/pylearn2/ 36. https://www.tensorflow.org/ 37. http://caffe.berkeleyvision.org/ 38. https://www.microsoft.com/en-us/research/project/project-catapult/
Index
A Activation functions, 202, 208 Activation maps, 234, 238, 242 Advanced K-views algorithms, 183, 186, 192–196 AlexNet architecture, 241, 242 Algorithms Ant Colony Optimization (ACO), 79, 97 Artificial Bee Colony Algorithm (ABC), 93 Artificial Neural Networks (ANNs), 99, 202, 207, 221, 230, 233, 245 artificial neuron, 202, 203 backpropagation algorithm, 207–211, 218, 220, 234, 235 backpropagation training algorithm, 210, 234 Bee Algorithm (BA), 93, 94 Bee Colony Optimization Algorithm (BCO), 93 Bee Hive Algorithm (BeeHive), 93 Bee Swarm Optimization Algorithm (BSO), 93, 95 biological neural network (BNN), 202, 204 biological neuron, 202, 203 classification algorithms, 11, 32, 41, 45, 47, 139, 202 Credibility Clustering Algorithm (CCA), 51, 72, 74, 75 datagram-based algorithm, 153 Fuzzy C-means (FCM), 51, 55, 58, 59, 61, 99, 123, 137, 153, 192 Fuzzy K-Nearest-Neighbor (Fuzzy K-NN), 51, 60, 61 Fuzzy Weighted C-means (FWCM), 51, 61, 63–68, 99 Generalized Possibility Clustering Algorithm (GPCA), 51, 72 K-means-ACO, 79, 80, 82–84
© Springer Nature Switzerland AG 2019 C.-C. Hung et al., Image Texture Analysis, https://doi.org/10.1007/978-3-030-13773-1
K-means algorithm, 12, 20, 51–54, 58, 59, 79, 83, 86, 87, 93, 97, 110, 137, 153, 163, 187, 192, 226 K-means Algorithm and Genetic Algorithms (K-Means-GA), 83, 85, 89, 90, 226 K-means Algorithm and Simulated Annealing (K-Means-SA), 86–90 K-Nearest-Neighbor (K-NN), 51, 54, 55 K-views algorithm using gray-level co-occurrence matrix (K-views-G) algorithm, 192–194 K-views Datagram (K-views-D) algorithm, 149, 153–157, 159, 160, 163, 164, 172–176, 178–181, 183, 186, 192–197 K-views model, 5, 131, 149, 163, 192, 233 K-views rotation-invariant features (K-views-R) algorithm, 163, 164, 169–181, 192, 193, 195, 196 K-views-R with grayness algorithm, 176–182, 197 K-views Template (K-views-T) algorithm, 135–137, 139, 141, 144–150, 155–157, 159, 160, 163, 164, 172–176, 178–181, 183, 186, 192, 193, 195–197 lacunarity, 36–38 Linde–Buzo–Gray (LBG) algorithm, 110 metropolis algorithm, 86 New Weighted Fuzzy C-means (NW-FCM), 51, 64–66, 96, 97, 99 Pollen-based Bee Algorithm for Clustering (PBA), 93–98 simple competitive learning algorithm using genetic algorithms (SCL-GA), 223–226 training algorithm, 206, 211 Virtual Bee Algorithm (VBA), 93–95 Weighted K-views voting algorithm (K-views-V), 183, 185–187, 192 253
254 Ant Colony Optimization (ACO), 79, 95, 97 Approximation coefficients, 31 Architecture AlexNet, 241, 242, 244, 249 Convolutional Neural Networks (CNN), 11, 233, 234, 241 DeconvNet, 242, 243 deep belief networks (DBN), 201, 218, 220, 230 LeNet-5, 233, 237, 238 VGGNet, 241, 244, 249 ZFNet, 241–243, 249 Artificial neural networks, 11, 202, 207, 221, 230, 233, 245 Autocorrelation function, 15, 16, 32, 40 Autoregressive model, 8 Axon, 202, 203 B Backpropagation training algorithm, 210, 234 Bag-of-Words (BoW), 9 Basic K-views template algorithm, 136, 137, 139–143, 145, 149, 192 Basic neuron model, 201 Basis, 8, 12, 23, 25, 28–30, 90, 104–109, 115, 118, 119, 124 Basis functions, 23–25, 27–30, 109, 119, 120 Basis images, 8, 12, 104, 106–109, 118 Bayes classification, 33, 54 Bee Algorithm (BA), 93, 94, 99 Bee Colony Optimization Algorithm (BCO), 93–95 Bee Hive Algorithm (BeeHive), 93 Bee Swarm Optimization Algorithm (BSO), 93–95 Bhattacharyya distance, 85 Biological Neural Network (BNN), 202, 204 Biological neuron, 202, 203 Biological visual cortex, 227 Boltzmann Machines (BM), 201, 215–218, 230 Brodatz gallery dataset, 6, 154–157, 193 C Characteristic view, 5, 131, 135–141, 144–147, 149, 150, 152–154, 157, 159, 163–165, 168, 169, 171, 172, 175–181, 183, 186, 187, 190–194, 196, 197 Classification accuracy, 9, 12, 16, 32, 33, 44, 47, 64, 65, 74, 84, 96, 97, 104, 113, 123, 124, 132, 141, 147, 149, 150, 155, 159, 163, 183, 193, 194, 196, 239, 248 Classification algorithms, 11, 12, 32, 33, 41, 45, 47, 51, 104, 113, 139, 202 Clique, 34, 35
Index Clique potentials, 34 Cluster center, 12, 52–54, 56, 59, 61, 63–65, 69–71, 74, 75, 79–82, 84, 85, 87, 92, 94, 99, 110, 111, 137, 223, 224 Clustering algorithms, 4, 11, 52, 55, 56, 58, 61, 66, 68, 69, 72, 75, 79, 90, 92, 93, 96, 97, 99, 118, 142, 215 Codebook, 110, 111 Coefficients approximation, 31 detail, 31 Cognitron and Neocognitron, 201, 233 Color feature, 16, 44, 45 RGB, 37, 38 Competitive learning, 110, 221, 223, 224 Computer vision, 3, 12, 15, 22, 103, 233, 249 Convolution, 22, 187, 190, 191, 197, 234–237 Convolutional layer, 11, 234, 235, 237–244, 248 Convolutional Neural Networks (CNN), 3, 9, 11, 12, 201, 231, 233–239, 241, 242, 244–249 Correlation, 17–20, 32, 38, 44, 45, 112, 123, 137, 142, 192, 193, 235, 236, 241 Correlation matching method, 133, 135, 136 Correlative views, 167–171, 177, 178, 184–187, 190, 196 Credibility Clustering Algorithm (CCA), 51, 72–75 Credibility measure, 74 Crossover, 85, 89, 223, 225 CUReT dataset, 247 D datagram, 136, 138, 139, 145, 146, 149–155, 159, 160, 163 Datagram-based algorithm, 153, 155 Dataset Brodatz, 6, 45, 141, 144, 154–157, 172, 193, 247 CUReT, 6, 247 KTH-TIPS, 6, 247 PASCAL VOC, 245, 246 UIUC, 6, 247 Daubechies wavelets, 25, 26 DeconvNet, 242, 243 Deep Belief Networks (DBN), 201, 218, 220, 230 Deep machine learning, 3, 11, 12, 197, 201, 230, 231, 233, 241, 249 Dendrite, 202, 203 Detail coefficients, 31 Dictionary learning, 106, 121, 122
Index Digital image processing, 3, 5, 30, 235, 237, 238, 242 Dimension, 17, 23, 35–37, 47, 52, 84, 103–107, 112, 113, 118, 123, 124, 164, 166, 190, 221, 224, 225, 234, 239, 242 Dimensionality Reduction (DR), 103–105, 112, 123–125 Discrete Fourier transform, 109 Discrete wavelets, 25 Discrete Wavelet Transform (DWT), 30, 31 Discriminant Analysis Feature Extraction (DAFE), 64, 65 Discriminative features, 105, 125 Distance similarity, 52, 59 Downsampling function, 239 Dropout technique, 241 E Energy function, 34, 35, 212, 214, 215, 217–219 Entropy, 18, 19, 45, 142, 166, 169 Euclidean dimension, 35 Euclidean distance, 39, 52, 78, 80, 82, 85, 133, 134, 152, 168, 169, 178, 184, 192, 225 Euclidean space, 35, 36, 56 F Fast weighted K-views voting algorithm (fast weighted K-views-V), 187 Feature Extraction (FE), 3, 4, 6, 8, 11, 12, 15, 21, 45, 61, 64, 104, 105, 131, 147, 184, 196, 201, 233, 235, 239 autocorrelation function, 15, 16, 32, 40 color, 16, 44, 45 fractal features, 15, 17, 36 Gray-Level Co-occurrence Matrices (GLCM), 3, 6, 8, 15, 17, 45, 131, 142, 145 Local Binary Pattern (LBP), 3, 8, 15–17, 40–47, 131, 149, 150, 160 Markov Random Fields (MRF), 8, 15, 32–35, 45, 142–146 semivariogram, 38 Texture Spectrum (TS), 8, 15, 40, 41, 43–47, 160 Wavelet Transforms (WT), 15, 22–25, 28–31, 36, 109 Feature maps, 11, 221, 234, 239, 241 Feature Selection (FS), 104, 105 Feature vector, 5, 6, 9, 16, 20, 21, 28–31, 34, 38, 76, 103, 113, 114, 120, 125, 142,
255 164–171, 173, 175, 177, 178, 184, 188, 191, 192, 222, 224, 227, 234 Feed-forward multi-layer neural networks, 201, 202, 204, 207–209, 230 Forward pass, 207, 209–211 Fourier Transform (FT), 23–25 discrete Fourier transform, 109 fast Fourier transform (FFT), 183, 184, 187, 188, 190–193, 197 inverse FFT (IFFT), 190, 191 Fourth-order moments, 167 Fractal features, 15, 17, 36 Fractal model, 36 Fractional dimension, 35 Frequency, 16–18, 21–25, 41, 109, 135, 136, 149–151, 153, 166 Frobenius norm, 117 Fully connected layer, 11, 242 Fuzzy C-means (FCM), 51, 55, 58, 59, 64–71, 74, 75, 96, 97, 99, 123, 124, 137, 153, 192 Fuzzy K-Nearest-Neighbor (Fuzzy K-NN), 51, 60, 61 Fuzzy membership function, 56, 57 Fuzzy set, 55–58, 60, 70, 74 Fuzzy variable, 74 Fuzzy Weighted C-means (FWCM), 51, 61, 63–68, 96, 97, 99 G Gabor filters, 8, 15, 21–23 Gaussian distribution, 33 Gaussian MRF (GMRF), 142–146 Generalized Possibility Clustering Algorithm (GPCA), 51, 70–75, 75, 96, 97 Genetic Algorithms (GA), 51, 83, 85, 97, 211, 223, 225, 230 Geometrical similarity (Gestalt), 227 Geostatistics, 38, 142 Gibbs distribution, 33 Gibbs random fields, 8, 35, 45 Gray-Level Co-occurrence Matrices (GLCM), 3, 6, 8, 15, 17, 45, 131, 142, 145 H Haar wavelets, 25, 26, 28, 30 Hausdorff–Besicovitch, 35 High-dimensional data classification, 76 Histogram, 7, 17, 41, 43, 44, 138, 139, 149, 150, 153, 166, 167, 169 Hopfield Neural Networks (HNN), 201, 211–– 216, 218, 230
256 Hughes phenomenon, 104 Hyperparameters, 11, 237, 242 Hyperspectral images, 9, 62, 66, 104, 105, 112, 123 I Image hyperspectral, 9, 37, 62, 66, 104, 105, 112, 123 integral, 189 natural and remote sensing, 3 Summed Square Image (SSI), 183, 184, 187–189, 191 ultrasonic prostate, 157, 159 Image-based CNN, 233, 247, 248 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), 241, 244 Image representation Non-negative Matrix Factorization (NMF), 103, 104, 109, 118, 119, 125 Principle Component Analysis (PCA), 103–105, 112–115, 118, 119, 123, 125 Singular Value Decomposition (SVD), 103, 104, 109, 115, 122, 125 Sparse Coding (SC), 103–105, 115, 119, 122, 125 Image texture, 3–12, 15–17, 20, 35–38, 41, 44, 47, 51, 76, 97, 99, 103, 119, 131, 132, 135, 139, 142, 144–147, 149, 150, 153, 159, 160, 171–173, 180–185, 192, 193, 197, 233, 246–249 Image texture analysis, 3, 7, 11, 12, 15, 16, 97, 99, 132 Image texture classification Credibility Clustering Algorithm (CCA), 51, 72, 74, 75 Fuzzy C-means (FCM), 51, 55, 58–61, 64–66, 68–70, 74, 96, 99, 123, 124, 137, 153, 192 Fuzzy K-Nearest-Neighbor (Fuzzy K-NN), 51, 60, 61 Fuzzy Weighted C-Means (FWCM), 51, 61, 63–68, 96, 97, 99 Generalized Possibility Clustering Algorithm (GPCA), 51, 70–73, 75, 96, 97 K-Nearest-Neighbor (K-NN), 51, 54, 55, 60, 61 New Weighted Fuzzy C-means (NW-FCM), 51, 64–68, 96, 97, 99 Possibility Clustering Algorithm (PCA), 51, 66, 68–70, 72, 73, 96, 99, 105, 112 Support Vector Machines (SVM), 51, 76, 78, 79
Index Image texture model, 3, 15 Image textures, 8 Independent Component Analysis (ICA), 123 Integral image, 189 Invariance rotation, 41 Invariant feature map, 11 Inverse FFT (IFFT), 190, 191 Inverse transformation, 109 Ising model, 212 J Jeffries–Matusita (JM) distance, 85–90 K K-means-ACO, 79, 80, 82–84 K-means Algorithm and Genetic Algorithms (K-Means-GA), 83, 85, 89, 90, 226 K-means Algorithm and Simulated Annealing (K-Means-SA), 86–90 K-Nearest-Neighbor (K-NN), 51, 54, 55, 60 K-SVD, 122, 125 KTH-TIPS dataset, 6, 247 Kurtosis, 166, 167, 169 K-views algorithm using gray-level co-occurrence matrix (K-views-G) algorithm, 192, 194, 196 K-views datagram (K-views-D) algorithm, 149, 153–157, 159, 160, 163, 164, 172–176, 178–181, 183, 193 K-views model, 5, 131, 149, 163, 192, 233 K-views rotation-invariant features (K-views-R) algorithm, 163, 164, 169–181, 192, 196, 197 K-views-R with grayness algorithm, 176–182, 197 K-views template, 5, 131, 135–137, 139–147, 150, 153, 159, 163, 183, 248 K-views Template (K-views-T) algorithm, 135–137, 139, 141, 144–150, 155–159, 160, 163, 164, 172–176, 178–181, 183, 186, 192, 193, 195–197 L Lacunarity algorithm, 37 Lagrange multiplier, 63, 64, 66, 76, 78 Learnable filter, 235 Likelihood function, 33 Linde–Buzo–Gray (LBG) algorithm, 110 Local Binary Pattern (LBP), 3, 8, 15–17, 36, 40–46, 131, 149, 150, 153, 160 Local receptive field, 237 M Markovian property, 33
Index Markov Random Fields (MRF), 8, 15, 32–34, 45, 143 Maximum a posteriori (MAP), 33, 34 Maximum Likelihood (ML), 33 McCulloch–Pitts model, 202 Mean, 8, 9, 18, 33, 37, 38, 61–65, 68, 71, 72, 85, 96, 97, 104, 113, 114, 166, 167, 169, 176, 192, 193, 196, 223–225, 244 Mean Square Error (MSE), 223, 225 Medical diagnosis, 24 Medical MRI imaging, 15 Membership function, 52, 55–57, 69, 70, 72–75 Metropolis algorithm, 86 Model autoregressive, 8 basic neuron, 201 3-D fractal, 36 3-D MRF, 33 image texture, 3, 15 Ising, 212 K-views, 5, 131, 149, 163, 192, 233 McCulloch–Pitts, 202 Wold, 8 Model-based method, 6, 8, 12, 16 Momentum, 210 Morlet wavelets, 26 Mother wavelet, 25, 27 Mutation, 85, 89, 93, 223, 225 N Nagao filter, 168 Naïve Bayes Classifier, 33 Natural and remote sensing image, 3 Neocognitron, 201, 227–231, 233, 234, 238 Neural networks, 11, 110, 201, 211, 215, 216, 223–225, 227, 230, 233, 234, 236, 238, 241, 246 Neuron, 103, 202–205, 207–219, 221, 223, 224, 227–230, 233, 238, 239, 241 New Weighted Fuzzy C-means (NW-FCM), 51, 64–68, 96, 97, 99 Non-negative Matrix Factorization (NMF), 103, 104, 109, 118, 119, 123–125 Nonparametric Weighted Feature Extraction (NWFE), 61–65, 105 O Output layer, 11, 202, 204, 208, 209, 211, 217, 218, 221, 224, 237–239 Over-fitting problem, 241 Optimal hyperplane, 76
257 P Padding, 237 Particle Swarm Optimization (PSO), 79, 95, 97 PASCAL VOC dataset, 245, 246 Patch-based CNN (p-CNN), 233, 246–249 Pattern recognition, 3, 4, 6, 10, 12, 15, 51, 55, 64, 93, 97, 103–105, 110, 112, 119, 122, 201, 202, 207, 221, 230, 233, 234, 246, 249 Peak signal-noise-ratio (PSNR) measure, 110 Perceptron, 76, 201, 205–207, 209, 210, 230, 231 Pollen-based Bee Algorithm for Clustering (PBA), 51, 93–96, 98 Pooling layer, 234, 237–242, 248, 249 Possibility Clustering Algorithm (PCA), 51, 66, 68–70, 72, 73, 96 Possibility measure, 74 Principle Component Analysis (PCA), 103, 104, 115, 125 Prior probability, 33 Probability Density Functions (PDF), 33, 34 Probability model, 12, 45 Q Quantum-modeled clustering algorithm (Quantum K-Means), 51, 90, 92, 93, 97 R Receptive field, 22, 227–229, 234, 235, 238–240, 244 Rectified linear function (ReLU), 238, 239 Reproduction, 86, 223, 225 Restricted Boltzmann Machines (RBM), 218–220, 230 RGB color, 38 Rotated images, 173, 175, 179, 180, 182, 197 Rotation-invariant features, 163, 164, 166, 169, 172, 177, 180, 186 S Satellite remote sensing, 15 Scaling, 25, 27, 28, 233, 249 Self-organizing learning of neocognitron, 230 Self-organizing Map (SOM), 110, 201, 202, 221, 222, 230, 231 Self-organizing Neocognitron, 227 Semivariogram, 38 Short-Time Fourier Transforms (STFT), 23 Sigmoid function, 204, 208 Sign function, 206 Similarity measure, 9, 54, 168, 178, 184, 187
258 Simple Competitive Learning algorithm using Genetic Algorithms (SCL-GA), 223–226 Simple competitive learning network, the (SCL), 221 Simulated Annealing (SA), 35, 86–88, 97, 215, 216 Singular Value Decomposition (SVD), 103, 104, 109, 115–118, 122–125 Sinusoidal transform, 109 Skewness, 166, 169 Soma, 202, 203 Sparse Coding (SC), 103–106, 115, 119, 122, 125 Sparse Representation (SR), 103, 105, 106, 110, 117, 119, 120, 122, 124, 125 Spatial filter, 133, 235–242, 244 Spatial kernel operator, the, 238 Standard deviation, 18, 124, 166, 167, 169, 245 Statistical Hopfield machine, 215 Statistical methods, 6, 7, 12, 16, 17, 142 Stride, 237, 239 Structural methods, 6, 7, 12, 16 Subsampling layer, 11 Summed Square Image (SSI), 183, 184, 187–193, 197 Supervised classification, 11 Supervised learning, 63, 113, 201, 202, 238 Supervised training, 218, 230 Support Vector Machines (SVM), 51, 76, 78, 79 Synapse, 202, 203, 227, 228, 230 T Texel, 6–8, 16, 32, 45 Texton, 6, 7, 16, 45 Texture classification, 3–6, 8–12, 15–17, 20, 34, 44, 51, 76, 97, 119, 131, 132, 137, 139, 142, 144, 146, 147, 159, 171, 172, 180, 183, 184, 192, 201, 233, 246–249 Texture models, 4 Texture segmentation, 3, 4, 10–12, 155, 156 Texture Spectrum (TS), 8, 15, 40, 41, 43–47, 160 Texture synthesis, 4 Texture units, 8, 16, 40–43, 45 Third-order moments, 166 Traditional feed-forward multi-layer neural networks using the backpropagation (FMNN), 201
Index Traditional multi-layer neural networks (MLNN), 234 Training algorithms, 206, 210 Transfer functions, 202, 204, 206, 238 Transfer learning, 244–246, 249 Transformation matrix, 28, 108, 113, 115 Transform-based method, 3, 4, 6, 8, 12, 16 Transformed divergence, 85 Translation, 5, 24, 25, 27, 28, 238 U UIUC dataset, 6, 247 Ultrasonic prostate image, 157, 159 Unpooling, 243 Unsupervised method, 11 Unsupervised pretraining, 218, 220 Unsupervised training method, 221 V Variogram, 15 Variogram, 15, 38–40, 47, 142–146 Vector Quantization (VQ), 110, 111, 221 VGGNet, 244, 249 Views characteristic, 5, 131, 135–141, 144–147, 149, 150, 152, 153, 157, 159, 163–165, 168, 169, 171, 172, 175–179, 181, 183, 186, 187, 190–194, 196, 197 K-views datagram, 153–157, 159, 160, 163, 172–175, 180, 181, 183, 193, 196 K-views template, 5, 131, 135–137, 139–150, 153, 163, 172–175, 180, 181, 183, 192, 196, 248 Virtual Bee Algorithm (VBA), 93–95 W Wavelets child, 25 Daubechies, 25, 26, 30 Haar, 25, 26, 28 Morlet, 25, 26 mother, 25, 27 Wavelet Transform (WT), 8, 15, 22–25, 30, 36, 109 Weighted K-views Voting Algorithm (K-views-V), 183, 186, 187, 192 Wold model, 8 Z ZFNet architecture, 243