314 92 46MB
English Pages 460 [403] Year 2020
HANDBOOK OF PATTERN RECOGNITION AND COMPUTER VISION 6th Edition
115783_9789811211065_tp.indd 1
26/7/19 1:30 PM
This page intentionally left blank
HANDBOOK OF PATTERN RECOGNITION AND COMPUTER VISION 6th Edition editor
C H Chen
University of Massachusetts Dartmouth, USA
World Scientific NEW JERSEY
•
115783_9789811211065_tp.indd 2
LONDON
•
SINGAPORE
•
BEIJING
•
SHANGHAI
•
HONG KONG
•
TAIPEI
•
CHENNAI
•
TOKYO
26/7/19 1:30 PM
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
HANDBOOK OF PATTERN RECOGNITION A ND COMPUTER VISION Sixth Edition Copyright © 2020 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 978-981-121-106-5 (hardcover) ISBN 978-981-121-107-2 (ebook for institutions) ISBN 978-981-121-108-9 (ebook for individuals)
For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/11573#t=suppl
Printed in Singapore
Steven - 11573 - Handbook of Pattern Recognition.indd 1
26-03-20 12:22:22 PM
The book is dedicated to the memory of the following pioneers of pattern Recognition and computer vision Prof. K.S. Fu, Dr. Pierre A. Devijver, Prof. Azriel Rosenfeld, Prof. Thomas M. Cover, Dr. C.K. Chow, Prof. Roger Mohr, and Prof. Jack Sklansky
v
This page intentionally left blank
PREFACE TO THE 6TH EDITION Motivated by the re-emergence of artificial intelligence, big data, and machine learning in the last six years that have impacted many areas of pattern recognition and computer vision, this new edition of Handbook is intended to cover both new developments involving deep learning and more traditional approaches. The book is divided into two parts: part 1 on theory and part 2 on applications. Statistical pattern recognition is of fundamental importance to the development of pattern recognition. The book starts with Chapter 1.1, Optimal Statistical Classification, by Profs. Dougherty and Dalton, that reviews the optimal Bayes classifier in a broader context than an optimal classifier with unknown feature-label distribution designed from sample data, an optimal classifier that possess minimal expected error relative to the posterior, etc. Though optimality includes a degree of subjectivity, it always incorporates the aim and knowledge of the designer. The chapter also deals with the topic of optimal Bayesian transfer learning, where the training data are augmented with data from a different source. From my observation of the last half century, I must say that it is amazing that the Bayesian theory of inferences has such a long lasting value. Chapter 1.2 by Drs. Shi and Gong on Deep Discrimitive Feature Learning Methods for Object Recognition, presents the entropy-orthogonality loss and Min-Max loss to improve the within-class compactness and betweenclass separability of the convolutional neural network classifier for better object recognition. Chapter 1.3 by Prof. Bouwmans et al. on Deep Learning Based Background Subtraction: A Systematic Survey, provides a full review of recent advances on the use of deep neural networks applied to background subtraction for detection of moving objects in video taken by a static camera. The readers may be interested to read a related chapter on Statistical Background Modeling for Foreground Detection: A Survey, also by Prof. Bouwmans, et al. in the 4th edition of the handbook series. Chapter 1.4 by Prof. Ozer, on Similarity Domains Network for Modeling Shapes and Extracting Skeleton Without Large Datasets, introduces a novel shape modeling algorithm, Similarity Domain Network (SDN), based on Radial Basis Networks which are a particular type of neural networks that utilize radial basis function as an activation function in the hidden layer. The algorithm effectively computes similarity domains for shape modeling and skeleton extraction using only one image sample as data. As a tribute to Prof. C.C. Li who recently retired from University of Pittsburgh after over 50 years of dedicated research and teaching in pattern recognition and computer vision, his chapter in the 5th edition of the handbook vii
viii
Preface
series is revised as Chapter 1.5 entitled, On Curvelet-Based Texture Features for Pattern Classification. The chapter provides a concise introduction to the curvelet transform which is still a relatively new method for sparse representation of images with rich edge structure. The curvelet-based texture features are very useful for the analysis of medical MRI organ tissue images, classification of critical Gleason grading of prostate cancer histological images and other medical as well as non-medical images. Chapter 1.6 by Dr. Wang is entitled, An Overview of Efficient Deep Learning on Embedded Systems. It is now evident that the superior accuracy of deep learning neural networks comes from the cost of high computational complexity. Implementing deep learning on embedded systems with limited hardware resources is a critical and difficult problem. The chapter reviews some of the methods that can be used to improve energy efficiency without sacrificing accuracy within cost-effective hardware. The quantization, pruning, and network structure optimization issues are also considered. As pattern recognition needs to deal with complex data such as data from different sources, as for autonomous vehicles for example, or from different feature extractors, learning from these types of data is called multi-view learning and each modality/set of features is called a view. Chapter 1.7, Random Forest for Dissimilarity-Based Multi-View Learning, by Dr. Bernard, et al. employs random forest (RF) classifiers for measuring dissimilarities. RF embed a (dis)similarity measure that takes the class membership into account in such a way that instances from the same class are similar. A Dynamic View Selection method is proposed to better combine the view-specific dissimilarity representations. Chapter 1.8, A Review of Image Colorisation, by Dr. Rosin, et al. brings us to a different theoretical but practical problem of adding color to a given grayscale image. Three classes of colourization, including colourization by deep learning, are reviewed in the chapter. Chapter 1.9 on speech recognition is presented by Drs. Li and Yu, Recent Progresses on Deep Learning for Speech Recognition. The authors noted that recent advances in automatic speech recognition (ASP) have been mostly due to the advent of using deep learning algorithms to build hybrid ASR systems with deep acoustic models like feedforward deep neural networks, convolution neural networks, and recurrent neural networks. The summary of progresses is presented in the two areas where significant efforts have been taken for ASR, namely, E2E (end to end) modeling and robust modeling. Part 2 begins with Chapter 2.1, Machine Learning in Remote Sensing by Dr. Ronny Haenesch, providing an overview of remote sensing problems and sensors. It then focuses on two machine learning approaches, one based on random forest theory and the other on convolutional neural networks with examples based on synthetic aperture radar image data. While much progress has
Preface
ix
been made on information processing for hyperspectral images in remote sensing, spectral unmixing problem presents a challenge. Chapter 2.2 by Kizel and Benediktsson is on Hyperspectral and Spatially Adaptive Unmixing for Analytical Reconstruction of Fraction Surfaces from Data with Corrupted Pixels. Analysis of the spectral mixture is important for a reliable interpretation of spectral image data. The information provided by spectral images allows for distinguishing between different land cover types. However, due to the typical low spatial resolution in remotely sensed data, many pixels in the image represent a mixture of several materials within the area of the pixels, Therefore, subpixel information is needed in different applications, which is extracted by estimating fractional abundance that corresponds to pure signatures, known as endmembers. The unmixing problem has been typically solved by using spectral information only. In this chapter, a new methodology is presented based on a modification of the spectral unmixing method called Gaussican-based spatially adaptive unmixing (GBSAU). The problem of spatially adaptive unmixing is similar to the fitting of a certain function using grid data. An advantage of the GBSAU framework is to provide a novel solution for unmixing images with both low SNR and a non-continuity due to the presence of corrupted pixels. Remote sensing readers may also be interested in the excellent chapter in the second edition of the handbook series, Statistical and Neural Network Pattern Recognition Methods for Remote Sensing Application also by Prof. Benediktsson. Chapter 2.3, Image Processing for Sea Ice Parameter Identification from Visual Images, by Dr. Zhang introduces novel sea ice image processing algorithms to automatically extract useful ice information such as ice concentration, ice types, and ice flow size distribution, which are important in various fields of ice engineering. It is noted that gradient vector flow snake algorithm is particularly useful in ice boundary-based segmentation. More details on the chapter is available in the author’s recent book, Sea Ice Image Processing with Matlab (CRC Press 2018). The next chapter (2.4) by Drs. Evan Fletcheris and Alexandeer Knaack is, Applications of Deep Learning to Brain Segmentation and Labeling of MRI Brain Structures. The authors successfully demonstrated deep learning convolution neural networks (CNNs) applications in two areas of brain structural image processing. One application focused on improving production and robustness in brain segmentation. The other aimed at improving edge recognition, leading to greater biological accuracy and statistical power for computing longitudinal atrophy rates. The authors have also carefully presented a detailed experimental set-up for the complex brain medical image processing using deep learning and a large archive of MRIs for training and testing. While there have been much increased interest in brain research and brain image
x
Preface
processing, the readers may be interested with other recent work by Dr. Fletcher reported in a chapter on Using Prior Information to Enhance Sensitivity of Longitudinal Brain Change Computation, in Frontiers of Medical Imaging (World Scientific Publishing, 2015). Chapter 2.5, Automatic Segmentation of IVUS Images Based on Temporal Texture Analysis, is devoted to more traditional approach to intravascular ultrasonic image analysis using both textural and spatial (or multi-image) information for the analysis and delineation of lumen and external elastic membrane boundaries. The use of multiple images in a sequence, processed by discrete waveframe transform, clearly provides better segmentation results over many of those reported in the literature. We take this traditional approach as the available data set for the study is limited. Chapter 2.6 by F. Liwicki and Prof. M. Liwicki on Deep Learning for Historical Documents provides an overview of the state of the art and recent methods in the area of historical document analysis, especially those using deep learning and long-Short-Term Memory Networks. Historical documents differ from the ordinary documents due to the presence of different artifacts. Their idea of detection of graphical elements in historical documents and their ongoing efforts towards the creation of large databases are also presented. Actually graphs allow us to simultaneously model the local features and the global structure of a handwritten signature in a natural and comprehensive way. Chapter 2.7 by Drs. Maergner, Riesen et al. thoroughly review two standard graph matching algorithms that can be readily integrated into an end-to-end signature framework. The system presented in the chapter was able to combine the complementary strengths of structural approach and the statistical models to improve the signature verification performance. The reader may also be interested with the chapter in the 5th edition of the Handbook series also by Prof. Riesen on Graph Edit Distance-Novel Approximation Algorithms. Chapter 2.8 by Prof. Huang and Dr. Hsieh is on Cellular Neural Network for Seismicc Pattern Recognition. The discrete-time cellular neural network (DTCNN) is used as associate memory, and then the associate memory is used to recognize seismic patterns. The seismic patterns are bright spot pattern, right and left pinch-out patterns that have the structure of gas and oil sand zones. In comparison with the use of Hopefield associative memory, the DT-CNN has better recovery capacity. The results of seismic image interpretation using DTCNN are also good. An automatic matching algorithm is necessary for a quick and accurate search of the law enforcement face databases or surveillance cameras using a forensic sketch. In Chapter 2.9, Incorporating Facial Attributes in Cross-Modal Face Verification and Synthesis, by H. Kazemi et al., two deep learning frameworks are introduced to train a Deep Coupled Convolution Neural Network for facial attribute guided sketch-to-photo matching and synthesis. The
Preface
xi
experimental results show the superiority of the proposed attribute-guided frameworks compared to the state-of-the-art techniques. Finally in Chapter 2.10, Connected and Autonomous Vehicles in the Deep Learning Era: A Case Study On Computer-Guided Steering, by Drs. Valiente, Ozer, et al., the challenging problem of machine learning in self-driving vehicles is examined in general and a specific case study is presented. The authors consider the control of the steering angle as a regression problem where the input is a stack of images and the output is the steering angle of the vehicle. Considering multiple frames in a sequence can benefit us to deal with noise and occasional corrupted images such as those caused by sunlight. The new deep architecture that is used to predict the steering angle automatically consists of convolutional neural network, Long-Short-Term-Memory (LSTM) and fully connected layers. It processes both present and future images (shared by a vehicle ahead via vehicle-to-vehicle communication) as input to control the steering angle. With a handbook of this size, or even ten times the size, it is clearly difficult to capture the full development of the field of pattern recognition and computer vision. Unlike a journal special issue, the book covers key progresses in theory and application of pattern recognition and computer vision. I hope the readers will examine all six volumes of the Handbook series that reflect the advances of nearly three decades in the field to gain a better understanding of this highly dynamic field. With the support of Information Research Foundation, a free access to the Handbook series vols. 1–4 is now available free. The free access was successfully set up in early July 2018. For your convenience, the URL links are as follows: Vol. 1: https://www.worldscientific.com/worldscibooks/10.1142/1802#t=toc Vol. 2: https://www.worldscientific.com/worldscibooks/10.1142/3414#t=toc Vol. 3: https://www.worldscientific.com/worldscibooks/10.1142/5711#t=toc Vol. 4: https://www.worldscientific.com/worldscibooks/10.1142/7297#t=toc I like to take this opportunity to thank all chapter authors throughout the years for their important contributions to the Handbook series. My very special thanks go to all chapter authors of the current volume.
C.H. Chen February 3, 2020
This page intentionally left blank
CONTENTS Dedication
v
Preface
vii
PART 1: THEORY, TECHNOLOGY AND SYSTEMS A Brief Introduction to Part 1 (by C.H. Chen)
1 2
Chapter 1.1
Optimal Statistical Classification Edward R. Dougherty, Jr. and Lori Dalton
7
Chapter 1.2
Deep Discriminative Feature Learning Method for Object Recognition Weiwei Shi and Yihong Gong
31
Chapter 1.3
Deep Learning Based Background Subtraction: A Systematic Survey Jhony H. Giraldo, Huu Ton Le, and Thierry Bouwmans
51
Chapter 1.4
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets Sedat Ozer
75
Chapter 1.5
On Curvelet-Based Texture Features for Pattern Classification (Reprinted from Chapter 1.7 of 5th HBPRCV) Ching-Chung Li and Wen-Chyi Lin
87
Chapter 1.6
An Overview of Efficient Deep Learning on Embedded Systems Xianju Wang
107
Chapter 1.7
Random Forest for Dissimilarity-Based Multi-View Learning Simon Bernard, Hongliu Cao, Robert Sabourin and Laurent Heutte
119
Chapter 1.8
A Review of Image Colourisation Bo Li, Yu-Kun Lai, and Paul L. Rosin
139
Chapter 1.9
Recent Progress of Deep learning for Speech Recognition Jinyu Li and Dong Yu
159
xiii
xiv
Contents
PART 2: APPLICATIONS A Brief Introduction to Part 2 (by C.H. Chen)
183 184
Chapter 2.1
Machine Learning in Remote Sensing Ronny Hänsch
187
Chapter 2.2
Hyperspectral and Spatially Adaptive Unmixing for Analytical Reconstruction of Fraction Surfaces from Data with Corrupt Pixels Fadi Kizel and Jon Atli Benediktsson
209
Chapter 2.3
Image Processing for Sea Ice Parameter Identification from Visual Images Qin Zhang
231
Chapter 2.4
Applications of Deep Learning to Brain Segmentation and Labeling of MRI Brain Structures Evan Fletcher and Alexander Knaack
251
Chapter 2.5
Automatic Segmentation of IVUS Images Based on Temporal Texture Analysis A. Gangidi and C.H. Chen
271
Chapter 2.6
Deep Learning for Historical Document Analysis Foteini Simistira Liwicki and Marcus Liwicki
287
Chapter 2.7
Signature Verification via Graph-Based Methods 305 Paul Maergner, Kaspar Riesen, Rolf Ingold, Andreas Fischer
Chapter 2.8
Cellular Neural Network for Seismic Pattern Recognition Kou-Yuan Huang and Wen-Hsuan Hsieh
323
Chapter 2.9
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis. Hadi Kazemi, Seyed Mehdi Iranmanesh and Nasser M. Nasrabadi
343
Chapter 2.10 Connected and Autonomous Vehicles in the Deep Learning Era: A Case Study on Computer-Guided Steering Rodolfo Valientea, Mahdi Zamana, Yaser P. Fallaha and Sedat Ozer
365
Index
385
PART 1
THEORY, TECHNOLOGY AND SYSTEMS
2
Introduction
A BRIEF INTRODUCTION From my best recollection, the effort toward making machines as intelligent as human being was shifted in late 50’s to a more realistic goal of automating the human recognition process, which is followed soon by computer processing of pictures. Statistical pattern classification emerged as a major approach to pattern recognition and even now some sixty years later is still an active research area with focus more toward classification tree methods. Feature extraction has been considered as a key problem in pattern recognition and is yet not a well solved problem. It is still an important problem despite the use of neural networks and deep learning for classification. In statistical pattern recognition, both parametric and nonparametric methods were extensively investigated in the 60’s and 70’s. The nearest neighbor decision rule for classification had well over a thousand publications. Other theoretical pattern recognition approaches have been developed since the late sixties, including syntactical (grammatical) pattern recognition and structural pattern recognition. For waveforms, effective features extracted from both spectral, temporal and statistical domains have been limited. For images, texture features, and local edge detectors have been quite effective. Still good features are much needed and can make good use of human ingenuity. Machine learning has been an essential part of pattern recognition and computer vision. In the 60’s and 70’s machine learning in pattern recognition was focused on improving parameter estimates of a distribution or nonparametric estimate of probability densities with supervised and unsupervised learning samples. The re-introduction of artificial neural networks in mid-80’s has had tremendous impact on machine learning in pattern recognition. The feature extraction problem has received less attention recently as neural networks can work with large feature dimension. Obviously the major advances on computing using personal computers has improved greatly the automated recognition capability with both neural networks and the more traditional nonneural networks approaches. It is noted that there has not been conclusive evidence that best neural networks can perform better than the Bayes decision rules for real data. However accurate class statistics may not be established from limited real data. The most cited textbooks in pattern recognition in my view are Fukunaga[1], Duda et al. [2] and Devijver et al. [3]. The most cited textbook for neural networks is by Haykin [4]. Contextual information is important for pattern recognition and there was extensive research in the use of context in pattern recognition and computer vision (see e.g. [5,6]). Feature evaluation and error estimation were
Introduction
3
another hot topics in the 70’s (see e.g. [1,7]). For many years, researchers have considered the so called “Hughes phenomenon” (see e.g. [8]), which states that for finite training sample size, there is a peak mean recognition accuracy. Large feature dimension however may imply better separability among pattern classes. Support vector machine is one way to increase the number of features for better classification. Syntactic pattern recognition is a very different approach that has different feature extraction and decision making processes. It consists of string grammarbased methods, tree grammar-based methods and graph grammar-based methods. The most important book is by Fu [9]. More recent books include those by Bunke, et al. [10], and Flasinski [11] which has over 1,000 entries in the Bibliography. Structural pattern recognition (see e.g. [12]) can be more related to signal/image segmentation and can be closely linked to syntactic pattern recognition. In more recent years, much research effort in pattern recognition was in sparse representation (see e.g. [13]) and in tree classifications like the use of random forests, as well as various forms of machine learning involving neural networks. In connection with sparse representation, compressive sensing (not data compression) has been very useful in some complex image and signal recognition problems (see e.g. [14]). The development of computer vision was largely evolved from digital image processing with early frontier work by Rosenfeld [15] and many of his subsequent publications. A popular textbook on digital image processing is by Gonzalez and Woods [16]. Digital image processing by itself can only be considered as low level to mid-level computer vision. While image segmentation and edge extraction can be loosely considered as middle level computer vision, the high level computer vision which is supposed to be like human vision has not been well defined. Among many textbooks in computer vision is the work of Haralick et al. listed in [17,18]. There has been much advances in computer vision especially in the last 20 years (see. e.g. [19]). Machine learning has been the fundamental process to both pattern recognition and computer vision. In pattern recognition, many supervised, semisupervised and unsupervised learning approaches have been explored. The neural network approaches are particularly suitable for machine learning for pattern recognition. Multilayer perceptron using back-propagation training algorithm, kernel methods for support vector machines, self-organizing maps and dynamically driven recurrent networks represent much of what neural networks have contributed to machine learning [4]. The recently popular dynamic learning neural networks started with a complex extension of multilayered neural network by LeCun et al. [20] and
4
Introduction
expanded into versions in convolutional neural networks (see e.g. [21]). Deep learning implies a lot of learning with many parameters (weights) on a large data set. As expected, some performance improvement over traditional neural network methods can be achieved. As an emphasis of this Handbook edition, we have included several chapters dealing with deep learning. Clearly, deep learning, as a renewed effort in neural networks since the mid-ninties, is among the important steps toward matured artificial intelligence. However, we take a balanced view in this book by placing as much importance of the past work on pattern recognition and computer vision as the new approach like deep learning. We believe that any work that is built on solid mathematical and/or physics foundation will have long lasting value. Examples are the Bayes decision rule, nearest-neighbor decision rule, snake-based image segmentation models, etc. Though theoretical work on pattern recognition and computer vision has moved on a fairly slow or steady pace, the software and hardware development has progressed so much faster, thanks to the ever-increasing computer power. MatLab alone, for example, has served so well in software needs, thus diminishing the need for dedicated software systems. Rapid development in powerful sensors and scanners has made possible many real-time or near real-time use of pattern recognition and computer vision. Throughout this Handbook series, we have included several chapters on hardware development. Perhaps continued and increased commercial and non-commercial needs have driven the rapid progress in the hardware as well as software development. References 1. K. Fukanaga, “Introduction to Statistical Pattern Recognition”, second edition, Academic Press 1990. 2. R. Duda, P. Hart, and D. G. Stork, “Pattern Classification’, second edition, Wiley 1995. 3. P.A. Devijver and J. Kittler, “Pattern Recognition: A Statistical Approach”, Prentice 1982. 4. S. Haykin, “Neural Networks and Learning Machines”, third edition, 2008. 5. K.S. Fu and T.S. Yu, “Statistical Pattern Classification using Contextual Information”, Research Studies Press, a Division of Wiley, 1976. 6. G. Toussaint, “The use of context in pattern recognition”, Pattern Recognition, Vol. 10, pp. 189204, 1978. 7. C.H. Chen, “On information and distance measures, error bounds, and feature selection”, Information Sciences Journal, Vol. 10, 1976. 8. D. Landgrebe, “Signal Theory Methods in Multispectral Remote Sensing”, Wiley 2003. 9. K.S. Fu, “Syntactic Pattern Recognition and Applications, Prentice-Hall 1982. 10. H.O. Bunke, A. Sanfeliu, editors, “Syntactic and Structural Pattern Recognition-theory and Applications”, World Scientific Publishing, 1992. 11. M. Flasinski, “Syntactic Pattern Recognition”, World Scientific Publishing, March 2019. 12. T. Pavlidiis. “Structural Pattern Recognition”, Springer, 1977.
Introduction
5
13. Y. Chen, T.D. Tran and N.M. Nasrabdi, “Sparse representation for target detection and classification in hyperspectral imagery”, Chapter 19 of “Signal and Image Processing for Remote Sensing”, second edition, edited by C.H. Chen, CRC Press 2012. 14. M.L. Mekhalfi, F. Melgani, et al., “Land use classification with sparse models”, Chapter 14 of “Compressive sensing of Earth Observations”, edited by C.H. Chen, CRC Press 2017. 15. A. Rosenfeld, “Picture Processing by Computer’, Academic Press 1969. 16. R.C. Gonzelez and R.E. Woods, “Digital Image Processing”, 4th edition, Prentice-Hall 2018. 17. R.M. Haralick and L. G. Shapiro, “Computer and Robot Vision, Vol. 1, Addison-Wesley Longman 2002. 18. R.M. Haralick and L.G. Shapiro, “Computer and robot Vision”. Vol. 2, Addison-Wesley Longman 2002. 19. C.H. Chen, editor, “Emerging Topics in Computer Vision”, World Scientific Publishing 2012. 20. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, Vol. 521, no. 7553, pp. 436-444, 2015 21. L Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, Cambridge, MA. MIT Press 2016.
This page intentionally left blank
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
CHAPTER 1.1 OPTIMAL STATISTICAL CLASSIFICATION 1
Edward R. Dougherty1 and Lori Dalton2
Department of Electrical and Computer Engineering, Texas A&M University, Department of Electrical and Computer Engineering, Ohio State University 1 Email: [email protected]
2
Typical classification rules input sample data and output a classifier. This is different from the engineering paradigm in which an optimal operator is derived based on a model and cost function. If the model is uncertain, one can incorporate prior knowledge of the model with data to produce an optimal Bayesian operator. In classification, the model is a feature-label distribution and, if this is known, then a Bayes classifier provides optimal classification relative to the classification error. This chapter reviews optimal Bayesian classification, in which there is an uncertainty class of feature-label distributions governed by a prior distribution, a posterior distribution is derived by conditioning the prior on the sample, and the optimal Bayesian classifier possesses minimal expected error relative to the posterior. The chapter covers binary and multi-class classification, prior construction from scientific knowledge, and optimal Bayesian transfer learning, where the training data are augmented with data from a different source.
1
Introduction
The basic structure of engineering is to operate on a system to achieve some objective. Engineers design operators to control, perturb, filter, compress, and classify systems. In the classical paradigm initiated by the Wiener-Kolmogorov theory for linearly filtering signals, the signals are modeled as random functions, the operators are modeled as integral operators, and the accuracy of a filter is measured by the mean-square error between the true signal and the filtered observation signal. The basic paradigm consists of four parts: (1) a scientific (mathematical) model describing the physical system, (2) a class of operators to choose from, (3) a cost function measuring how well the objective is being achieved, and (4) optimization to find an operator possessing minimum cost. Data are important in the process because system parameters must be estimated. What may appear to be a big data set might actually be very small relative to system complexity. And even if a system is not complex, there may be limited access to data. Should there be insufficient data for accurate parameter estimation, the system model will be uncertain. Suppose the scientific model is uncertain, and the true model belongs to an 7
page 7
March 24, 2020 12:27
8
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
uncertainty class Θ of models determined by a parameter vector θ composed of the unknown parameters. As in the classical setting there is a cost function C and a class Ψ of operators on the model whose performances are measured by the cost function. For each operator ψ ∈ Ψ there is a cost Cθ (ψ) of applying ψ on model θ ∈ Θ. An intrinsically Bayesian robust (IBR) operator minimizes the expected value of the cost with respect to a prior probability distribution π(θ) over Θ.1,2 An IBR operator is robust in the sense that on average it performs well over the whole uncertainty class. The prior distribution reflects our existing knowledge. If, in addition to a prior distribution coming from existing knowledge, there is a data sample S, the prior distribution conditioned on the sample yields a posterior distribution π ∗ (θ) = π(θ|S). An IBR operator for the posterior distribution is called an optimal Bayesian operator. For the general theory applied to other operator classes, such as filters and clusterers, see Ref. 3. The Wiener-Kolmogorov theory for linear filters was introduced in the 1930s, Kalman-Bucy recursive filtering in the 1960s, and optimal control and classification in the 1950s. In all areas, it was recognized that often the scientific model would not be known. Whereas this led to the development of adaptive linear/Kalman filters and adaptive controllers, classification became dominated by rules that did not estimate the feature-label distribution. Control theorists delved into Bayesian robust control for Markov decision processes in the 1960s,4,5 but computation was prohibitive and adaptive methods prevailed. Minimax optimal linear filtering was approached in the 1970s.6,7 Suboptimal design of filters and classifiers in the context of a prior distribution occurred in the early 2000s.8,9 IBR design for nonlinear/linear filtering,2 Kalman filtering,10 and classification11,12 has been achieved quite recently. This chapter focuses on optimal Bayesian classification. 2
Optimal Bayesian Classifier
Binary classification involves a feature vector X = (X1 , X2 , ..., Xd ) ∈ d composed of random variables (features), a binary random variable Y , and a classifier ψ : d → {0, 1} to serve as a predictor of Y , meaning Y is predicted by ψ(X). The features X1 , X2 , ..., Xd can be discrete or real-valued. The values, 0 or 1, of Y are treated as class labels. Classification is characterized by the probability distribution f (x, y) of the feature-label pair (X, Y ), which is called the feature-label distribution. The error ε[ψ] of ψ is the probability of erroneous classification: ε[ψ] = P (ψ(X) = Y ). An optimal classifier ψbay , called a Bayes classifier, is one having minimal error among the collection all classifiers on d . The error εbay of a Bayes classifier is called the Bayes error. The Bayes classifier and its error can be found from the feature-label distribution. In practice, the feature-label distribution is unknown and classifiers are designed from sample data. A classification rule takes sample data as input and outputs a classifier. A random sample refers to a sample whose points are independent and identically distributed according to the feature-label distribution. The stochastic
page 8
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
Optimal Statistical Classification
dougherty˙PR˙Hbook
9
process that generates the random sample constitutes the sampling distribution. A classifier is optimal relative to a feature-label distribution and a collection C of classifiers if it is in C and its error is minimal among all classifiers in C: (1) ψopt = arg min ε[ψ]. ψ∈C
Suppose the feature-label distribution is unkown, but we know that it is characterized by an uncertainty class Θ of parameter vectors corresponding to feature-label distributions fθ (x, y) for θ ∈ Θ. Now suppose we have scientific knowledge regarding the features and labels, and this allows us to construct a prior distribution π(θ) governing the likekihood that θ ∈ Θ parameterizes the true feature-label distribution, where we assume the prior is uniform if we have no knowledge except that the true feature-label distribution lies in the uncertainty class. Then the optimal classifier, known as an intrinsically Bayesian robust classifier (IBRC) is defined by Θ = arg min Eπ [εθ [ψ]], ψIBR ψ∈C
(2)
where εθ [ψ] is the error of ψ relative to fθ (x, y) and Eπ is expectation relative to π.11,12 The IBRC is optimal on average over the uncertainty class, but it will not be optimal for any particular feature-label distribution unless it happens to be a Bayes classifier for that distribution. Going further, suppose we have a random sample Sn = {(X1 , Y1 ), . . . , (Xn , Yn )} of vector-label pairs drawn from the actual feature-label distribution. The posterior distribution is defined by π ∗ (θ) = π(θ|Sn ) and the optimal classifier, known as an Θ , is defined by Eq. 2 with π ∗ in optimal Bayesian classifier (OBC), denoted ψOBC 11 place of π. An OBC is an IBRC relative to the posterior, and an IBRC is an OBC with a null sample. Because we are generally interested in design using samples, we focus on the OBC. For both the IBRC and the OBC, we omit the Θ in the notation if the uncertainty class is clear from the context. Given our prior knowledge and the data, the OBC is the best classifier to use. A sample-dependent minimum-mean-square-error (MMSE) estimator εˆ(Sn ) of εθ [ψ] minimzes Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ] over all Borel measurable functions ξ(Sn ), where Eπ,Sn denotes expectation with respect to the prior distribution and the sampling distribution. According to classical estimation theory, εˆ(Sn ) is the conditional expectation given Sn . Thus, εˆ(Sn ) = Eπ [εθ [ψ]|Sn ] = Eπ∗ [εθ [ψ]]. (3) In this light, Eπ∗ [εθ [ψ]] is called the Bayesian MMSE error estimator (BEE) and is denoted by εˆΘ [ψ; Sn ].13,14 The OBC can be reformulated as Θ (Sn ) = arg min εˆΘ [ψ; Sn ]. ψOBC ψ∈C
(4)
Besides minimizing Eπ,Sn [|εθ [ψ] − ξ(Sn )|2 ], the BEE is also an unbiased estimator of εθ [ψ] over the distribution of θ and Sn : ESn [ˆ εΘ [ψ; Sn ]] = ESn [ Eπ [εθ [ψ]|Sn ] ] = Eπ,Sn [εθ [ψ]] . (5)
page 9
March 24, 2020 12:27
ws-book961x669
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
10
2.1
HBPRCV-6th Edn.–11573
OBC Design
Two issues must be addressed in OBC design: representation of the BEE and minimization. In binary classification, θ is a random vector composed of three parts: the parameters of the class-0 and class-1 conditional distributions, θ0 and θ1 , respectively, and the class-0 prior probability c = c0 (with c1 = 1 − c for class 1). Let Θy denote the parameter space for θy , y = 0, 1, and write the class-conditional distribution as fθy (x|y). The marginal prior densities are π(θy ), y = 0, 1, and π(c). To facilitate analytic representations, we assume that c, θ0 and θ1 are all independent prior to observing the data. This assumption allows us to separate the prior density π(θ) and ultimately to separate the BEE into components representing the error contributed by each class. Given the independence of c, θ0 and θ1 prior to sampling, they remain independent given the data: π ∗ (θ) = π ∗ (c)π ∗ (θ0 )π ∗ (θ1 ), where π ∗ (θ0 ), π ∗ (θ1 ), and π ∗ (c) are the marginal posterior densities for θ0 , θ1 , and c, respectively.13 Focusing on c, and letting n0 be the number of class-0 points, since n0 ∼ Binomial(n, c) given c, π ∗ (c) = π(c|n0 ) ∝ π(c)f (n0 |c) ∝ π(c)cn0 (1 − c)n1 .
(6)
If π(c) is beta(α, β) distributed, then π ∗ (c) is still a beta distribution, π ∗ (c) =
n +β−1
cn0 +α−1 (1 − c) 1 B(n0 + α, n1 + β)
,
(7)
n0 +α where B is the beta function, and Eπ∗ [c] = n+α+β . If c is known, then Eπ∗ [c] = c . The posteriors for the parameters are found via Bayes’ rule, fθy (xi |y) , (8) π ∗ (θy ) = f (θy |Sny ) ∝ π(θy )f (Sny |θy ) = π(θy ) i:yi =y
where ny is the number of y-labeled points (xi , yi ) in the sample, Sny is the subset of sample points from class y, and the constant of proportionality can be found by normalizing the integral of π ∗ (θy ) to 1. The term f (Sny |θy ) is called the likelihood function. Although we call π(θy ), y = 0, 1, the “prior probabilities,” they are not required to be valid density functions. A prior is called “improper” if the integral of π(θy ) is infinite. When improper priors are used, Bayes’ rule does not apply. Hence, assuming the posterior is integrable, we take Eq. 8 as the definition of the posterior distribution, normalizing it so that its integral is equal to 1. Owing to the posterior independence between c, θ0 and θ1 , and the fact that εyθ [ψ], the error on class y, is a function of θy only, the BEE can be expressed as εˆΘ [ψ; Sn ] = Eπ∗ [cε0θ [ψ] + (1 − c)ε1θ [ψ]] = Eπ∗ [c]Eπ∗ [ε0θ [ψ]] + (1 − Eπ∗ [c])Eπ∗ [ε1θ [ψ]] ,
(9)
page 10
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
where
Eπ∗ [εyθ [ψ]] =
Θy
εyθy [ψ]π ∗ (θy )dθy
11
(10)
is the posterior expectation for the error contributed by class y. Letting εˆyΘ [ψ; Sn ] = Eπ∗ [εyn [ψ]], Eq. 9 takes the form ε0Θ [ψ; Sn ] + (1 − Eπ∗ [c])ˆ ε1Θ [ψ; Sn ] . εˆΘ [ψ; Sn ] = Eπ∗ [c]ˆ
(11)
We evaluate the BEE via effective class-conditional densities, which for y = 0, 1, are defined by11 fΘ (x|y) =
Θy
fθy (x|y) π ∗ (θy ) dθy .
(12)
The following theorem provides the key representation for the BEE. Theorem 1 [11]. If ψ (x) = 0 if x ∈ R0 and ψ (x) = 1 if x ∈ R1 , where R0 and R1 are measurable sets partitioning d , then, given random sample Sn , the BEE is given by fΘ (x|0) dx + (1 − Eπ∗ [c]) fΘ (x|1) dx εˆΘ [ψ; Sn ] = Eπ∗ [c] R1 R0 (Eπ∗ [c]fΘ (x|0) Ix∈R1 + (1 − Eπ∗ [c])fΘ (x|1) Ix∈R0 ) dx, (13) = d
where I denotes the indicator function, 1 or 0, depending on whether the condition is true or false. Moreover, for y = 0, 1, εˆyΘ [ψ; Sn ] = Eπ∗ [εyθ [ψ; Sn ]] =
d
fΘ (x|y) Ix∈R1−y dx.
(14)
In the unconstrained case in which the OBC is over all possible classifiers, Theorem 1 leads to pointwise expression of the OBC by simply minimizing Eq. 13. Theorem 2 [11]. The optimal Bayesian classifier over the set of all classifiers is given by Θ ψOBC
(x) =
0 if Eπ∗ [c]fΘ (x|0) ≥ (1 − Eπ∗ [c])fΘ (x|1) , 1 otherwise.
(15)
The representation in the theorem is the representation for the Bayes classifier for the feature-label distribution defined by class-conditional densities fΘ (x|0) and fΘ (x|1), and class-0 prior probability Eπ∗ [c]; that is, the OBC is the Bayes classifier for the effective class-conditional densities. We restrict our attention to the OBC over all possible classifiers.
page 11
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
E.R. Dougherty and L. Dalton
12
3
dougherty˙PR˙Hbook
OBC for the Discrete Model
If the range of X is finite, then there is no loss in generality in assuming a single feature X taking values in {1, . . . , b}. This discrete classification problem is defined by the class-0 prior probability c0 and the class-conditional probability mass functions pi = P (X = i|Y = 0), qi = P (X = i|Y = 1), for i = 1, . . . , b. Since b−1 b−1 pb = 1 − i=1 pi , and qb = 1 − i=1 qi , the classification problem is determined by a (2b − 1)-dimensional vector (c0 , p1 , . . . , pb−1 , q1 , . . . , qb−1 ) ∈ 2b−1 . We consider an arbitrary number of bins with beta class priors and define the parameters for each class to contain all but one bin probability: θ0 = [p1 , p2 , . . . , pb−1 ] and θ1 = [q1 , q2 , . . . , qb−1 ]. Each parameter space is defined as the set of all valid bin probabilities. For example [p1 , p2 , . . . , pb−1 ] ∈ Θ0 if and only if 0 ≤ pi ≤ 1 for b−1 i = 1, . . . , b − 1 and i=1 pi ≤ 1. We use the Dirichlet priors π(θ0 ) ∝
b
α0 −1
pi i
i=1
and π(θ1 ) ∝
b
α1 −1
qi i
,
(16)
i=1
where αiy > 0. These are conjugate priors, meaning that the posteriors take the same form. Increasing a specific αiy has the effect of biasing the corresponding bin with αiy samples from the corresponding class before observing the data. The posterior distributions are again Dirichlet and are given by b b Γ ny + i=1 αiy Uiy +αy −1 i p (17) π ∗ (θy ) = b i y y k=1 Γ (Uk + αk ) i=1 for y = 0 and a similar expression with p replaced by q for y = 1, where Uiy is the number of observations in bin i for class y.13 The effective class-conditional densities are given by13 Ujy + αjy fΘ (j|y) = . b ny + i=1 αiy
(18)
From Eq. 13, εˆΘ [ψ; Sn ] =
b j=1
Eπ∗ [c]
Uj0 + αj0 Uj1 + αj1 Iψ(j)=1 + (1 − Eπ∗ [c]) Iψ(j)=0 . b b n0 + i=1 αi0 n1 + i=1 αi1 (19)
In particular, εˆyΘ [ψ; Sn ] =
b j=1
Ujy + αjy Iψ(j)=1−y . b ny + i=1 αiy
(20)
From Eq. 15 using the effective class-conditional densities in Eq. 18,11 ⎧ 0 0 Uj1 + αj1 ⎪ ⎨ 1 if E ∗ [c] Uj + αj < (1 − Eπ∗ [c]) , b b π Θ 0 1 ψOBC (j) = n + α n + α 0 1 i i i=1 i=1 ⎪ ⎩ 0 otherwise.
(21)
page 12
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
13
From Eq. 13, the expected error of the OBC is
εOBC
b
Uj0 + αj0 Uj1 + αj1 = min Eπ∗ [c] , (1 − Eπ∗ [c]) b b n0 + i=1 αi0 n1 + i=1 αi1 j=1
.
(22)
The OBC minimizes the BEE by minimizing each term in the sum of Eq. 19 by assigning ψ(j) the class with the smaller constant scaling the indicator function. The OBC is optimal on average across the posterior distribution, but its behavior for any specific feature-label distribution is not guaranteed. Generally speaking, if the prior is concentrated in the vicinity of the true feature-label distribution, then results are good. But there is risk. If one uses a tight prior that is concentrated away from the true feature-label distribution, results can be very bad. Correct knowledge helps; incorrect knowledge hurts. Thus, prior construction is very important, and we will return to that issue in a subsequent section. Following an example in Ref. 3, suppose the true distribution is discrete with c = 0.5, p1 = p2 = p3 = p4 = 3/16, p5 = p6 = p7 = p8 = 1/16, q1 = q2 = q3 = q4 = 1/16, q5 = q6 = q7 = q8 = 3/16. Consider five Dirichlet priors π1 , π2 , ..., π5 with c = 0.5, α1j,0 = α2j,0 = α3j,0 = α4j,0 = aj,0 , α5j,0 = α6j,0 = α7j,0 = α8j,0 = bj,0 , α1j,1 = α2j,1 = α3j,1 = α4j,1 = aj,1 , α5j,1 = α6j,1 = α7j,1 = α8j,1 = bj,1 , for j = 1, 2, ..., 5, where aj,0 = 1, 1, 1, 2, 4 for j = 1, 2, ..., 5, respectively, bj,0 = 4, 2, 1, 1, 1 for j = 1, 2, ..., 5, respectively, aj,1 = 4, 2, 1, 1, 1 for j = 1, 2, ..., 5, respectively, and bj,1 = 1, 1, 1, 2, 4 for j = 1, 2, ..., 5, respectively. For n = 5 through n = 30, 100,000 samples of size n are generated. For each of these we design a histogram classifier, which assigns to each bin the majority label in the bin, and five OBCs corresponding to the five priors. Figure 1 shows average errors, with the Bayes error for the true distribution marked by small circles. Whereas the OBC from the uniform prior (prior 3) performs slightly better than the histogram rule, putting more prior mass in the vicinity of the true distribution (priors 4 and 5) gives greatly improved performance. The risk in leaving uniformity is demonstrated by priors 1 and 2, whose masses are concentrated away from the true distribution.
page 13
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
14
c = 0.5 0.7
histogram Bayes error OBC, prior 1 OBC, prior 2 OBC, prior 3 OBC, prior 4 OBC, prior 5
Average true error
0.6
0.5
0.4
0.3
0.2 5
10
15
20
25
30
sample size
Fig. 1. Average true errors for the histogram classifier and OBCs based on different prior distributions. [Reprinted from Dougherty, Optimal Signal Processing Under Uncertainty, SPIE Press, 2018.]
4
OBC for the Gaussian Model
For y ∈ {0, 1}, assume an d Gaussian distribution with parameters θy = [μy , Λy ], where μy is the mean of the class-conditional distribution and Λy is a collection of parameters determining the covariance matrix Σy of the class. We distinguish between Λy and Σy to enable us to impose a structure on the covariance. In Refs. 13 and 14, three types of models are considered: a fixed covariance (Σy = Λy is known perfectly), a scaled identity covariance having uncorrelated features with equal variances (Λy = σy2 is a scalar and Σy = σy2 Id , where Id is the d × d identity matrix), and a general (unconstrained, but valid) random covariance matrix, Σy = Λy . The parameter space of μy is d . The parameter space of Λy , denoted Λy , must permit only valid covariance matrices. We write Σy without explicitly showing its dependence on Λy . A multivariate Gaussian distribution with mean μ and covariance Σ is denoted by fμ,Σ (x), so that the parameterized class-conditional distributions are fθy (x|y) = fμy ,Σy (x). In the independent covariance model, c, θ0 = [μ0 , Λ0 ] and θ1 = [μ1 , Λ1 ] are independent prior to the data, so that π(θ) = π(c)π(θ0 )π(θ1 ). Assuming π(c) and π ∗ (c) have been established, we require priors π(θy ) and posteriors π ∗ (θy ) for both classes. We begin by specifying conjugate priors for θ0 and θ1 . Let ν be a non-negative real number, m a length d real vector, κ a real number, and S a
page 14
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
symmetric positive semi-definite d × d matrix. Define ν T fm (μ; ν, m, Λ) = |Σ|−1/2 exp − (μ − m) Σ−1 (μ − m) , 2 1 −(κ+d+1)/2 exp − trace SΣ−1 , fc (Λ; κ, S) = |Σ| 2
15
(23) (24)
where Σ is a function of Λ. If ν > 0, then fm is a (scaled) Gaussian distribution with mean m and covariance Σ/ν. If Σ = Λ, κ > d − 1, and S is positive definite, then fc is a (scaled) inverse-Wishart(κ, S) distribution. To allow for improper priors, we do not necessarily require fm and fc to be normalizable. For y = 0, 1, assume Σy is invertible and priors are of the form π(θy ) = π(μy |Λy )π(Λy ),
(25)
π(μy |Λy ) ∝ fm (μy ; νy , my , Λy ),
(26)
where π(Λy ) ∝ fc (Λy ; κy , Sy ).
(27)
If νy > 0, then π(μy |Σy ) is proper and Gaussian with mean my and covariance Σy /νy . The hyperparameter my can be viewed as a target for the mean, where the larger νy is the more localized the prior is about my . In the general covariance model where Σy = Λy , π(Σy ) is proper if κy > d − 1 and Sy is positive definite. If in addition νy > 0, then π(θy ) is a normal-inverseWishart distribution, which is the conjugate prior for the mean and covariance when sampling from normal distributions.15,16 Then Eπ [Σy ] = (κy − d − 1)−1 Sy , so that Sy can be viewed as a target for the shape of the covariance, where the actual expected covariance is scaled. If Sy is scaled appropriately, then the larger κy is the more certainty we have about Σy . At the same time, increasing κy while fixing the other hyperparameters defines a prior favoring smaller |Σy |. The model allows for improper priors. Some useful examples of improper priors occur when Sy = 0 and νy = 0. In this case, π(θy ) ∝ |Σy |−(κy +d+2)/2 . If κy +d+2 = 0, then we obtain flat priors. If Λy = Σy , then with κy = 0 we obtain Jeffreys’ rule prior, which is designed to be invariant to differentiable one-to-one transformations of the parameters,17,18 and with κy = −1 we obtain Jeffreys’ independence prior, which uses the same principle as the Jeffreys’ rule prior but also treats the mean and covariance matrix as independent parameters. Theorem 3 [14]. In the independent covariance model, the posterior distributions possess the same form as the priors: π ∗ (θy ) ∝ fm (μy ; νy∗ , my∗ , Λy )fc (Λy ; κ∗y , Sy∗ ),
(28)
with updated hyperparameters νy∗ = νy + ny , κ∗y = κy + ny , my∗ =
y νy my + ny μ , νy + ny
y + Sy∗ = Sy + (ny − 1)Σ
(29) νy n y ( μy − my )( μy − my )T , νy + ny
(30)
page 15
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
16
y are the sample mean and sample covariance for class y. where μ y and Σ Similar results are found in Ref. 15. The posterior can be expressed as π ∗ (θy ) = π ∗ (μy |Λy )π ∗ (Λy ),
(31)
where π ∗ (μy |Λy ) = f{m∗y ,Σy /νy∗ } (μy ),
−1
(32)
∗ 1 π ∗ (Λy ) ∝ |Σy |−(κy +d+1)/2 exp − trace Sy∗ Σy . (33) 2 Assuming at least one sample point, νy∗ > 0, so π ∗ (μy |Λy ) is always valid. The validity of π ∗ (Λy ) depends on the definition of Λy . Improper priors are acceptable but the posterior must always be a valid probability density. Since the effective class-conditional densities are found separately, a different covariance model may be used for each class. Going forward, to simplify notation we denote hyperparameters without subscripts. In the general covariance model, Σy = Λy , the parameter space contains all positive definite matrices, and π ∗ (Σy ) has an inverse-Wishart distribution, ∗ ∗ −1 |S ∗ |κ /2 1 −(κ∗ +d+1)/2 trace S | exp − Σ π ∗ (Σy ) = κ∗ d/2 |Σ , (34) y y 2 2 Γd (κ∗ /2) where Γd is the multivariate gamma function. For a proper posterior, we require ν ∗ > 0, κ∗ > d − 1, and S ∗ positive definite. Theorem 4. [11]. For a general covariance matrix, assuming ν ∗ > 0, κ∗ > d−1, and S ∗ positive definite, the effective class-conditional density is a multivariate student’s t-distribution,
fΘ (x|y) =
∗ Γ κ 2+1 × κ∗ −d+1 Γ 2
1 ν ∗ +1 ∗ 1/2 (κ∗ − d + 1)d/2 π d/2 | (κ∗ −d+1)ν ∗S | − κ∗2+1 −1 ∗ ν +1 1 T (x − m∗ ) S∗ (x − m∗ ) , × 1+ ∗ κ −d+1 (κ∗ − d + 1)ν ∗ (35) ∗
ν ∗ +1 ∗ (κ∗ −d+1)ν ∗ S
∗
with location vector m , scale matrix and κ −d+1 degrees of freedom. It is proper if κ∗ > d, the mean of this distribution is m∗ , and if κ∗ > d + 1 the ν ∗ +1 ∗ covariance is (κ∗ −d−1)ν ∗S . Rewriting Eq. 35 with ν ∗ = νy∗ , m∗ = my∗ , κ∗ = κ∗y , and ky = κ∗y − d + 1 degrees of freedom for y ∈ {0, 1}, the effective class-conditional densities are ky +d ky +d Γ T −1 − 2 2 1 1 ∗ ∗ x − my Ψy x − my fΘ (x|y) = d/2 × , 1+ k ky ky π d/2 |Ψy |1/2 Γ 2y (36)
page 16
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
17
3 Level curve 0 Level curve 1 PI OBC IBR
2.5 2 1.5
x
2
1 0.5 0 -0.5 -1 -1.5 -2 -2
-1
0
1
2
3
x1
Fig. 2.
Classifiers for a Gaussian model with two features.
where Ψy is the scale matrix in Eq. 35. The OBC discriminant becomes k0 +d 1 −1 ∗ T ∗ (x − m0 ) Ψ0 (x − m0 ) DOBC (x) = K 1 + k0 k1 +d 1 −1 ∗ T ∗ − 1+ (x − m1 ) Ψ1 (x − m1 ) , k1 where
K=
1 − Eπ∗ [c] Eπ∗ [c]
2
k0 k1
d
|Ψ0 | |Ψ1 |
Γ(k0 /2)Γ((k1 + d)/2) Γ((k0 + d)/2)Γ(k1 /2)
(37)
2 .
(38)
ψOBC (x) = 0 if and only if DOBC (x) ≤ 0. This classifier has a polynomial decision boundary as long as k0 and k1 are integers, which is true if κ0 and κ1 are integers. Consider a synthetic Gaussian model with d = 2 features, independent general covariance matrices, and a proper prior defined by known c = 0.5 and hyperparameters ν0 = κ0 = 20d, m0 = [0, . . . , 0], ν1 = κ1 = 2d, m1 = [1, . . . , 1], and Sy = (κy − d − 1)Id . We assume that the true model corresponds to the means of the parameters, and take a stratified sample of 10 randomly chosen points from each true class-conditional distribution. We find both the IBRC ψIBR and the OBC ψOBC relative to the family of all classifiers. We also consider a plug-in classifier ψPI , which is the Bayes classifier corresponding to the means of the parameters. ψPI is linear. Figure 2 shows ψOBC , ψIBR , and ψPI . Level curves for the class-conditional distributions corresponding to the expected parameters are also shown.
page 17
March 24, 2020 12:27
18
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
For the Gaussian and discrete models discussed herein, the OBC can be solved analytically; however, in many real-world situations Gaussian models are not suitable. Shortly after the introduction of the OBC, Markov-chain-Monte-Carlo (MCMC) methods were utilized for RNA-Seq applications.19,20 Other MCMC-based OBC applications include liquid chromatography-mass spectrometry data,21 selection reaction monitoring data,22 and classification based on dynamical measurements of single-gene expression,23 the latter using an IBR classifier because no sample data were included. Another practical issue pertains to missing values, which are common in many applications, such as genomic classification. The OBC has been reformulated to take into account missing values.24 Finally, let us note that, while random sampling is a common assumption in classification theory, nonrandom sampling can be beneficial for classifier design.25 In the case of the OBC, optimal sampling has been considered under different scenarios.3,26 5
Multi-class Classification
In this section, we generalize the BEE and OBC to treat multiple classes with arbitrary loss functions. We present the analogous concepts of Bayesian risk estimator (BRE) and optimal Bayesian risk classifier (OBRC), and show that the BRE and OBRC can be represented in the same form as the expected risk and Bayes decision rule with unknown true densities replaced by effective densities. We consider M classes, y = 0, . . . , M − 1, let f (y | c) be the probability mass function of Y parameterized by a vector c, and for each y let f (x | y, θy ) be the class-conditional density function for X parameterized by θy . Let θ be composed of the θy . Let L(i, y) be a loss function quantifying a penalty in predicting label i when the true label is y. The conditional risk in predicting label i for a given point x is defined by R(i, x, c, θ) = E[L(i, Y ) | x, c, θ]. A direct calculation yields M −1 R(i, x, c, θ) =
y=0 L(i, y)f (y | c)f (x | y, θy ) . M −1 y=0 f (y | c)f (x | y, θy )
(39)
The expected risk of an M -class classifier ψ is given by R(ψ, c, θ) = E[R(ψ(X), X, c, θ) | c, θ] =
−1 M −1 M
L(i, y)f (y | c)εi,y (ψ, θy ),
(40)
y=0 i=0
where the classification probability f (x | y, θy )dx = P (X ∈ Ri | y, θy ) εi,y (ψ, θy ) =
(41)
Ri
is the probability that a class y point will be assigned class i by ψ, and the Ri = {x : ψ(x) = i} partition the feature space into decision regions.
page 18
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
19
A Bayes decision rule (BDR) minimizes expected risk, or equivalently, the conditional risk at each fixed point x: ψBDR (x) = arg = arg
min
i∈{0,...,M −1}
min
i∈{0,...,M −1}
R(i, x, c, θ) M −1
L(i, y)f (y | c)f (x | y, θy ).
(42)
y=0
We break ties with the lowest index, i ∈ {0, . . . , M − 1}, minimizing R(i, x, c, θ). In the binary case with the zero-one loss function, L(i, y) = 0 if i = y and L(i, y) = 1 if i = y, the expected risk reduces to the classification error so that the BDR is a Bayes classifier. With uncertainty in the multi-class framework, we assume that c is the probability mass function of Y , that is, c = {c0 , . . . , cM −1 } ∈ ΔM −1 , where f (y | c) = cy and ΔM −1 is the standard M − 1 simplex defined by cy ∈ [0, 1] for y ∈ {0, . . . , M − 1} M −1 and y=0 cy = 1. Also assume θy ∈ Θy for some parameter space Θy , and θ ∈ Θ = Θ0 × . . . × ΘM −1 . Let C and T denote random vectors for parameters c and θ. We assume that C and T are independent prior to observing data, and assign prior probabilities π(c) and π(θ). Note the change of notation: up until now, c and θ have denoted both the random variables and the parameters. The change is being made to avoid confusion regarding the expectations in this section. Let Sn be a random sample, xiy the ith sample point in class y, and ny the number of class-y sample points. Given Sn , the priors are updated to posteriors: π ∗ (c, θ) = f (c, θ | Sn ) ∝ π(c)π(θ)
ny M −1
f (xiy , y | c, θy ),
(43)
y=0 i=1
where the product on the right is the likelihood function. Since f (xiy , y | c, θy ) = cy f (xiy | y, θy ), we may write π ∗ (c, θ) = π ∗ (c)π ∗ (θ), where π ∗ (c) = f (c | Sn ) ∝ π(c)
M −1
(cy )ny
(44)
y=0
and ∗
π (θ) = f (θ | Sn ) ∝ π(θ)
ny M −1
f (xiy | y, θy )
(45)
y=0 i=1
are marginal posteriors of C and T. Independence between C and T is preserved in the posterior. If the prior is proper, this all follows from Bayes’ theorem; otherwise, Eq. 44 and Eq. 45 are taken as definitions, with proper posteriors required. Given a Dirichlet prior on C with hyperparameters αy , with random sampling the posterior on C is Dirichlet, with hyperparameters αy∗ = αy + ny .
page 19
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
20
Optimal Bayesian Risk Classification
5.1
We define the Bayesian risk estimate (BRE) to be the MMSE estimate of the expected risk, or equivalently, the conditional expectation of the expected risk given the observations. Given a sample Sn and a classifier ψ that is not informed by θ, owing to posterior independence between C and T, the BRE is given by M −1 M −1 R(ψ, Sn ) = E[R(ψ, C, T) | Sn ] = L(i, y)E[f (y | C) | Sn ]E[εi,y (ψ, T) | Sn ]. y=0 i=0
(46) The effective density fΘ (x | y) is in Eq. 12. We also have an effective density fΘ (y) =
f (y | c)π ∗ (c )dc.
(47)
The effective densities are expressed via expectation by fΘ (y) = Ec [f (y | C) | Sn ] = E [ cy | Sn ] = Eπ∗ [cy ],
(48)
fΘ (x | y) = Eθy [f (x | y, T) | Sn ].
(49)
We may thus write the BRE in Eq. 46 as M −1 M −1 R(ψ, Sn ) = L(i, y)fΘ (y) εni,y (ψ, Sn ),
(50)
y=0 i=0
where εni,y (ψ, Sn )
= E[ε
i,y
(ψ, T ) | Sn ] =
fΘ (x| y) dx.
(51)
Ri
Note that fΘ (y) and fΘ (x | y) play roles analogous to f (y | c) and f (x | y, θy ) in Bayes decision theory. Various densities and conditional densities are involved in the theory, generally denoted by f . For instance, we may write the prior and posterior as π(θ) = f (θ) and π ∗ (θ) = f (θ|Sn ). We also consider f (y|Sn ) and f (x|y, Sn ). By expressing these as integrals over Θ, we see that f (y|Sn ) = fΘ (y) and f (x|y, Sn ) = fΘ (x | y). Whereas the BRE addresses overall classifier performance across the entire feature space, we may consider classification at a fixed point. The Bayesian conditional risk estimator (BCRE) for class i ∈ {0, . . . , M −1} at point x is the MMSE estimate of the conditional risk given the sample Sn and the test point X = x: x, Sn ) = E[R(i, X, C, T) | Sn , X = x] R(i, =
M −1
L(i, y)E[P (Y = y | X, C, T) | Sn , X = x].
(52)
y=0
The expectations are over a posterior on C and T updated with both Sn and the unlabeled point x. It is proven in Ref. 27 that M −1 y=0 L(i, y)fΘ (y)fΘ (x | y) . (53) R(i, x, Sn ) = M −1 y=0 fΘ (y)fΘ (x | y)
page 20
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
This is analogous to Eq. 39 in Bayes decision theory. Furthermore, given a classifier ψ with decision regions R0 , . . . , RM −1 , −1 M x, Sn )f (x | Sn )dx, X, Sn ) | Sn = R(i, E R(ψ(X),
21
(54)
Ri
i=0
where the expectation is over X (not C or T) given Sn . Calculation shows that27 X, Sn ) | Sn = R(ψ, E R(ψ(X), (55) Sn ). Hence, the BRE of ψ is the mean of the BCRE across the feature space. For binary classification, εni,y (ψ, Sn ) has been solved in closed form as components of the BEE for both discrete models under arbitrary classifiers and Gaussian models under linear classifiers, so the BRE with an arbitrary loss function is available in closed form for these models. When closed-form solutions for εni,y (ψ, Sn ) are not available, approximation may be employed.27 We define the optimal Bayesian risk classifier to minimize the BRE: (ψ, Sn ) , ψOBRC = arg min R ψ∈C
(56)
where C is a family of classifiers. If C is the set of all classifiers with measurable decision regions, then ψOBRC exists and is given for any x by ψOBRC (x) = arg = arg
min
i∈{0,...,M −1}
min
i∈{0,...,M −1}
x, Sn ) R(i, M −1
L(i, y)fΘ (y)fΘ (x | y) .
(57)
y=0
The OBRC minimizes the average loss weighted by fΘ (y)fΘ (x | y). The OBRC has the same functional form as the BDR with fΘ (y) substituted for the true class probability f (y | c), and fΘ (x | y) substituted for the true density f (x | y, θy ) for all y. Closed-form OBRC representation is available for any model in which fΘ (x | y) has been found, including discrete and Gaussian models. For binary classification, the BRE reduces to the BEE and the OBRC reduces to the OBC. 6
Prior Construction
In 1968, E. T. Jaynes remarked,28 “Bayesian methods, for all their advantages, will not be entirely satisfactory until we face the problem of finding the prior probability squarely.” Twelve years later, he added,29 “There must exist a general formal theory of determination of priors by logical analysis of prior information — and that to develop it is today the top priority research problem of Bayesian theory.” The problem is to transform scientific knowledge into prior distributions. Historically, prior construction has usually been treated independently of real prior knowledge. Subsequent to Jeffreys’ non-informative prior,17 objective-based methods were proposed.30 These were followed by information-theoretic and statistical approaches.31 In all of these methods, there is a separation between prior
page 21
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
22
knowledge and observed sample data. Several specialized methods have been proposed for prior construction in the context of the OBC. In Ref. 32, data from unused features is used to construct a prior. In Refs. 19 and 20, a hierarchical Poisson prior is employed that models cellular mRNA concentrations using a log-normal distribution and then models where the uncertainty is on the feature-label distribution. In the context of phenotype classification, knowledge concerning genetic signaling pathways has been integrated into prior construction.33-35 Here, we outline a general paradigm for prior formation involving an optimization constrained by incorporating existing scientific knowledge augmented by slackness variables.36 The constraints tighten the prior distribution in accordance with prior knowledge, while at the same time avoiding inadvertent over-restriction of the prior. Two definitions provide the general framework. Given a family of proper priors π(θ, γ) indexed by γ ∈ Γ, a maximal knowledgedriven information prior (MKDIP) is a solution to the optimization arg min Eπ(θ,γ) [Cθ (ξ, γ, D)],
(58)
γ∈Γ
where Cθ (ξ, γ, D) is a cost function depending on (1) the random vector θ parameterizing the uncertainty class, (2) the parameter γ, and (3) the state ξ of our prior knowledge and part of the sample data D. When the cost function is additively decomposed into costs on the hyperparameters and the data, it takes the form (1)
(2)
Cθ (ξ, γ, D) = (1 − β)gθ (ξ, γ) + βgθ (ξ, D), (1) gθ
(59)
(2) gθ
and are cost functions. where β ∈ [0, 1] is a regularization parameter, and Various cost functions in the literature can be adpated for the MKDIP.36 A maximal knowledge-driven information prior with constraints takes the form (3) of the optimization in Eq. 58 subject to the constraints Eπ(θ,γ) [gθ,i (ξ)] = 0, i = (3)
1, 2, ..., nc , where gθ,i , i = 1, 2, ..., nc , are constraints resulting from the state ξ of our knowledge, via a mapping (3) (3) T : ξ → Eπ(θ,γ) [gθ,1 (ξ)], ..., Eπ(θ,γ) [gθ,nc (ξ)] .
(60)
A nonnegative slackness variable εi can be considered for each constraint for the MKDIP to make the constraint structure more flexible, thereby allowing potential error or uncertainty in prior knowledge (allowing inconsistencies in prior knowledge). Slackness variables become optimization parameters, and a linear function times a regulatory coefficient is added to the cost function of the optimization in Eq. 58, so that the optimization in Eq. 58 relative to Eq. 59 becomes arg min Eπ(θ,γ) λ1 [(1 − γ∈Γ,ε∈E
(1) β)gθ (ξ, γ)
+
(2) βgθ (ξ, D)]
+ λ2
nc i=1
εi ,
(61)
page 22
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
23
(3)
subject to −εi ≤ Eπ(θ,γ) [gθ,i (ξ)] ≤ εi , i = 1, 2, ..., nc , where λ1 and λ2 are nonnegative regularization parameters, and ε = (ε1 , ..., εnc ) and E represent the vector of all slackness variables and the feasible region for slackness variables, respectively. Each slackness variable determines a range — the more uncertainty regarding a constraint, the greater the range for the corresponding slackness variable. Scientific knowledge is often expressed in the form of conditional probabilities characterizing conditional relations. For instance, if a system has m binary random variables X1 , X2 , . . . , Xm , then potentially there are m2m−1 probabilities for which a single variable is conditioned by the other variables: P (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km ) = aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ).
(62) (3)
Keeping in mind that constraints are of the form Eπ(θ,γ) [gθ,i (ξ)] = 0, in this setting, (3)
gθ,i (ξ) = Pθ (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km ) −aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ).
(63)
When slackness variables are introduced, the optimization constraints take the form aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ) − εi (k1 , . . . , ki−1 , ki+1 , . . . , km ) ≤ Eπ(θ,γ) [Pθ (Xi = ki |X1 = k1 , . . . , Xi−1 = ki−1 , Xi+1 = ki+1 , . . . , Xm = km )] ≤ aki i (k1 , . . . , ki−1 , ki+1 , . . . , km ) + εi (k1 , . . . , ki−1 , ki+1 , . . . , km ).
(64)
Not all constraints will be used, depending on our prior knowledge. In fact, the general conditional probabilities conditioned on all expressions Xj = kj , j = i, will not likely be used because they will likely not be known when there are many random variables, so that conditioning will be on subsets of these expressions. Regardless of how the prior is constructed, the salient point regarding optimal Bayesian operator design (including the OBC) is that uncertainty is quantified relative to the scientific model (the feature-label distribution for classification). The prior distribution is on the physical parameters. This differs from the common method of placing prior distributions on the parameters of the operator. For instance, if we compare optimal Bayesian regression37 to standard Bayesian linear regression models,38-40 in the latter, the connection of the regression function and prior assumptions with the underlying physical system is unclear. As noted in Ref. 37, there is a scientific gap in constructing operator models and making prior assumptions on them. In fact, operator uncertainty is a consequence of uncertainty in the physical system and is related to the latter via the optimization procedure that produces an optimal operator. A key reason why the MKDIP approach works is because the prior is on the scientific model, and therefore scientific knowledge can be applied directly in the form of constraints.
page 23
March 24, 2020 12:27
24
7
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
Optimal Bayesian Transfer Learning
The standard assumption in classification theory is that training and future data come from the same feature-label distribution. In transfer learning, the training data from the actual feature label distribution, called the target, are augmented with data from a different feature-label distribution, called the source.41 The key issue is to quantify domain relatedness. This can be achieved by extending the OBC framework so that transfer learning from source to target domain is via a joint prior probability density function for the model parameters of the featurelabel distributions of the two domains.42 The posterior distribution of the target model parameters can be updated via the joint prior probability distribution function in conjunction with the source and target data. We use π to denote a joint prior distribution and p to denote a conditional distribution involving uncertainty parameters. As usual, a posterior distribution refers to a distribution of uncertainty parameters conditioned on the data. We consider L common classes in each domain. Let Ss and St denote samples from the source and target domains with sizes of Ns and Nt , respectively. For l = l l l l l l l , xs,2 , · · · , xs,n 1, 2, ..., L, let Ssl = {xs,1 l } and St = {xt,1 , xt,2 , · · · , xt,nl }. Moreover, s t L L l L l l l Ss = ∪L n , and N = n . Since the feature t s t l=1 Ss , St = ∪l=1 St , Ns = l=1 l=1 spaces are the same in both domains, xsl and xtl are d-vectors for d features of the source and target domains, respectively. Since in transfer learning there is no joint sampling of the source and target domains, we cannot use a general joint sampling model, but instead assume that there are two datasets separately sampled from the source and target domains. Transferability (relatedness) is characterized by how we define a joint prior distribution for the source and target precision matrices, Λls and Λlt , l = 1, 2, ..., L. We employ a Gaussian model for the feature-label distribution, xzl ∼ −1 N (μlz , (Λlz ) ), for l ∈ {1, ..., L}, where z ∈ {s, t} denotes the source s or target t domain, μls and μlt are mean vectors in the source and target domains for label l, respectively, Λls and Λlt are the d × d precision matrices in the source and target domains for label l, respectively, and we employ a joint Gaussian-Wishart distribution as a prior for mean and precision matrices of the Gaussian models. The joint prior distribution for μls , μlt , Λls , and Λls takes the form π μls , μlt , Λls , Λlt = p μls , μlt |Λls , Λlt π Λls , Λlt . (65) l l l Assuming that, for any l, μs and μt are conditionally independent given Λs and Λlt results in conjugate priors. Thus, π μls , μlt , Λls , Λlt = p μls |Λls p μlt |Λlt π Λls , Λlt , (66) l l l l l l l −1 l l and both p μs |Λs and p μt |Λt are Gaussian, μz |Λz ∼ N mz , κz Λz , where mzl is the mean vector of μlz , and κlz is a positive scalar hyperparameter. A key issue is the structure of the joint prior governing the target and source precision matrices. We employ a family of joint priors that falls out naturally from a collection of partitioned Wishart random matrices.
page 24
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
25
Based on a theorem in Ref. 43, we define the joint prior distribution π(Λls , Λlt ) in Eq. 66 of the precision matrices of the source and target domains for class l: 1 l −1 l l l l T l l Mt π(Λt , Λs ) = K etr − + (F ) C F Λlt 2 l ν l −d−1 1 −1 l l ν l −d−1 2 Λs 2 × etr − Cl Λs Λt 2 l ν 1 l ; G , × 0 F1 (67) 2 4 where etr(A) = exp(tr(A)), l
M =
Mlt
Mlts
(68)
(Mlts )T Mls
is a 2d × 2d positive definite scale matrix, ν l ≥ 2d denotes degrees of freedom, p Fq is the generalized hypergeometric function,44 and −1 l Mts , (69) Cl = Mls − (Mlts )T Mlt l −1 l −1 l l T F = C (Mts ) Mt , (70) 1
T
1
Gl = Λls 2 Fl Λlt (Fl ) Λls 2 , l νl ν l −1 dν l 2 (K ) = 2 Γd |Ml | 2 . 2
(71) (72)
Based upon a theorem in Ref. 45, Λlt and Λls possess Wishart marginal distributions: Λlz ∼ Wd (Mlz , ν l ), for l ∈ {1, ..., L} and z ∈ {s, t}. We need the posterior distribution of the parameters of the target domain upon observing the source and target samples. The likelihoods of the samples St and Ss are conditionally independent given the parameters of the target and source domains. The dependence between the two domains is due to the dependence of the prior distributions of the precision matrices. Within each domain, the likelihoods of the classes are conditionally independent given the class parameters. Under these conditions, and assuming that the priors of the parameters in different classes are independent, the joint posterior can be expressed as a product of the individual class posteriors:42 L π(μt , μs , Λt , Λs |St , Ss ) = π(μlt , μls , Λlt , Λls |Stl , Ssl ), (73) l=1
where π(μlt , μls , Λlt , Λls |Stl , Ssl ) ∝ p(Stl |μlt , Λlt )p(Ssl |μls , Λls ) × p μls |Λls p μlt |Λlt π Λls , Λlt . The next theorem gives the posterior for the target domain.
(74)
page 25
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
E.R. Dougherty and L. Dalton
26
Theorem 5 [42]. Given the target St and source Ss samples, the posterior distribution of target mean μlt and target precision matrix Λlt for class l has a Gaussian-hypergeometric-function distribution 1 T l l κlt,n l l l l l l l2 l l μt − mt,n Λt μt − mt,n π(μt , Λt |St , Ss ) = A Λt exp − 2 l l ν +nt2−d−1 1 −1 l × Λlt etr − Tlt Λt 2 l l l ν + ns ν 1 l l l T l ; ; F Λt (F ) Ts , × 1 F1 (75) 2 2 2 where, if Fl is full rank or null, Al is the constant of proportionality, d2 l l l d(ν l +nl ) t l −1 t ν + nlt l ν +n 2π 2 2 Tt = 2 Γd A l 2 κt,n l ν + nls ν l + nlt ν l l l l l T , ; ; Ts F Tt (F ) × 2 F1 , 2 2 2
(76)
l ¯ tl )(κlt,n )−1 , = (κlt mtl + nlt x and κlt,n = κlt + nlt , mt,n l l −1 ˆ lt + κt nt (mtl − x ¯ tl )(mtl − x ¯ tl )T , = Mlt + (Fl )T Cl Fl + (nlt − 1)S l κt + nlt l l l −1 l −1 ˆ ls + κs ns (msl − x ¯ sl )(msl − x ¯ sl )T , Ts = C + (nls − 1)S (77) κls + nls
Tlt
−1
ˆ lz being the sample mean and covariance for z ∈ {s, t} and l. ¯ zl and S x The effective class-conditional density for class l is fOBTL (x|l) =
μlt ,Λlt
f (x|μlt , Λlt )π ∗ (μlt , Λlt )dμlt dΛlt ,
(78)
where π ∗ (μlt , Λlt ) = π(μlt , Λlt |Stl , Ssl ) is the posterior of (μlt , Λlt ) upon observation of Stl and Ssl . We evaluate it. Theorem 6 [42]. If Fl is full rank or null, then the effective class-conditional density in the target domain for class l is given by
d l κlt,n 2 ν + nlt + 1 Γd fOBTL (x|l) = π κlx 2 l l l l l t ν + nlt l ν +n2 t +1 l − ν +n 2 T T ×Γ−1 x t d 2 l ν + nls ν l + nlt + 1 ν l l l l l T , ; ; Ts F Tx (F ) × 2 F1 2 2 2 l l l l l ν + ns ν + nt ν T , ; ; Tls Fl Tlt (Fl ) ×2 F1−1 , 2 2 2 −d 2
(79)
page 26
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Optimal Statistical Classification
27
where κlx = κlt,n + 1 = κlt + nlt + 1, and
Tlx
−1
−1 = Tlt +
l T κlt,n l mt,n − x mt,n −x . l κt,n + 1
(80)
A Dirichlet prior is assumed for the prior probabilities clt that the target sample 1 L belongs to class l: ct = (c1t , · · · , cL t ) ∼ Dir(L, ξt ), where ξt = (ξt , · · · , ξt ) is the l vector of concentration parameters, and ξt > 0 for l ∈ {1, ..., L}. As the Dirichlet distribution is a conjugate prior for the categorical distribution, upon observing ∗ n = (n1t , ..., nL t ) sample points for class l in the target domain, the posterior π (ct ) = π(ct |n) has a Dirichlet distribution Dir(L, ξt + n). The optimal Bayesian transfer learning classifier (OBTLC) in the target domain relative to the uncertainty class Θt = {clt , μlt , Λlt }L l=1 is given by ψOBTL (x) = arg
max l∈{1,··· ,L}
Eπ∗ [clt ]fOBTL (x|l).
(81)
If there is no interaction between the source and target domains in all the classes, then the OBTLC reduces to the OBC in the target domain. Specifically, if Mlts = 0 for all l ∈ {1, ..., L}, then ψOBTL = ψOBC . Figure 3 shows simulation results comparing the OBC (trained only with target data) and the OBTL classifier for two classes and ten features (see Ref. 42 for simulation details). α is a parameter measuring the relatedness between the source and target domains: α = 0 when the two domains are not related, and α close to 1 indicates greater relatedness. Part (a) shows average classification error versus the number of source points, with the number of target points fixed at 10, and part 0.35
Average classification error
0.24
Average classification error
OBC, target-only OBTL, = 0.6 OBTL, = 0.8 OBTL, = 0.925
0.26
0.22
0.2
0.18
OBC, target-only OBTL, = 0.6 OBTL, = 0.8 OBTL, = 0.925
0.3
0.25
0.2
0.15
0.16 0
100
200
300
number of source points per class
400
10
20
30
40
50
number of target points per class
Fig. 3. Average classification error: (a) average classification error versus the number of source training data per class, (b) average classification error versus the number of target training data per class.
page 27
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Bibliography
28
(b) shows average classification error versus the number of target points with the number of source points fixed at 200.
8
Conclusion
Optimal Bayesian classification provides optimality with respect to both prior knowledge and data; the greater the prior knowledge, the less data are needed to obtain a given level of performance. Its formulation lies within the classical operator-optimization framework, adapted to take into account both the operational objective and the state of our uncertain knowledge.3 Perhaps the salient issue for OBC applications is the principled transformation of scientific knowledge into the prior distribution. Although a general paradigm has been proposed in Ref. 36, it depends on certain cost assumptions. Others could be used. Indeed, all optimizations depend upon the assumption of an objective and a cost function. Thus, optimality always includes a degree of subjectivity. Nonetheless, an optimization paradigm encapsulates the aims and knowledge of the engineer, and it is natural to optimize relative to these.
References [1] Yoon, B-J., Qian, X., and E. R. Dougherty, Quantifying the objective cost of uncertainty in complex dynamical systems, IEEE Trans Signal Processing, 61, 2256-2266, (2013). [2] Dalton, L. A., and E. R. Dougherty, Intrinsically optimal Bayesian robust filtering, IEEE Trans Signal Processing, 62, 657-670, (2014). [3] Dougherty, E. R., Optimal Signal Processing Under Uncertainty, SPIE Press, Bellingham, (2018). [4] Silver, E. A., Markovian decision processes with uncertain transition probabilities or rewards, Technical report, Defense Technical Information Center, (1963). [5] Martin, J. J., Bayesian Decision Problems and Markov Chains, Wiley, New York, (1967). [6] Kuznetsov, V. P., Stable detection when the signal and spectrum of normal noise are inaccurately known, Telecommunications and Radio Engineering, 30-31, 58-64, (1976). [7] Poor, H. V., On robust Wiener filtering, IEEE Trans Automatic Control, 25, 531-536, (1980). [8] Grigoryan, A. M. and E. R. Dougherty, Bayesian robust optimal linear filters, Signal Processing, 81, 2503-2521, (2001). [9] Dougherty, E. R., Hua, J., Z. Xiong, and Y. Chen, Optimal robust classifiers, Pattern Recognition, 38, 1520-1532, (2005). [10] Dehghannasiri, R., Esfahani, M. S., and E. R. Dougherty, Intrinsically Bayesian ro-
page 28
March 24, 2020 12:27
ws-book961x669
HBPRCV-6th Edn.–11573
Bibliography
[11]
[12]
[13]
[14]
[15] [16] [17]
[18] [19]
[20]
[21]
[22] [23]
[24] [25]
[26]
[27]
dougherty˙PR˙Hbook
29
bust Kalman filter: an innovation process approach, IEEE Trans Signal Processing, 65, 2531-2546, (2017). Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework–part I: discrete and Gaussian models, Pattern Recognition, 46, 1288-1300, (2013). Dalton, L. A., and E. R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework–part II: properties and performance analysis, Pattern Recognition, 46, 1301-1314, (2013). Dalton, L. A., and E. R. Dougherty, Bayesian minimum mean-square error estimation for classification error–part I: definition and the Bayesian MMSE error estimator for discrete classification, IEEE Trans Signal Processing, 59, 115-129, (2011). Dalton, L. A. , and E. R. Dougherty, Bayesian minimum mean-square error estimation for classification error–part II: linear classification of Gaussian models, IEEE Trans Signal Processing, 59, 130-144, (2011). DeGroot, M. H., Optimal Statistical Decisions, McGraw-Hill, New York, (1970). Raiffa, H., and R. Schlaifer, Applied Statistical Decision Theory, MIT Press, Cambridge, (1961). Jeffreys, H., An invariant form for the prior probability in estimation problems, Proc Royal Society of London. Series A, Mathematical and Physical Sciences, 186, 453-461, (1946). Jeffreys, H., Theory of Probability, Oxford University Press, London, (1961). Knight, J., Ivanov, I., and E. R. Dougherty, MCMC Implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-seq classification, BMC Bioinformatics, 15, (2014). Knight, J., Ivanov, I., Chapkin, R., and E. R. Dougherty, Detecting multivariate gene interactions in RNA-seq data using optimal Bayesian classification, IEEE/ACM Trans Computational Biology and Bioinformatics, 15, 484-493, (2018). Nagaraja, K., and U. Braga-Neto, Bayesian classification of proteomics biomarkers from selected reaction monitoring data using an approximate Bayesian computation– Markov chain monte carlo approach, Cancer Informatics, 17, (2018). Banerjee, U., and U. Braga-Neto, Bayesian ABC-MCMC classification of liquid chromatography–mass spectrometry data, Cancer Informatics, 14, (2015). Karbalayghareh, A., Braga-Neto, U. M., and E. R. Dougherty, Intrinsically Bayesian robust classifier for single-cell gene expression time series in gene regulatory networks, BMC Systems Biology, 12, (2018). Dadaneh, S. Z., Dougherty, E. R., and X. Qian, Optimal Bayesian classification with missing values, IEEE Trans Signal Processing, 66, 4182-4192, (2018). Zollanvari, A., Hua, J., and E. R. Dougherty, Analytic study of performance of linear discriminant analysis in stochastic settings, Pattern Recognition, 46, 3017-3029, (2013). Broumand, A., Yoon, B-J., Esfahani, M. S., and E. R. Dougherty, Discrete optimal Bayesian classification with error-conditioned sequential sampling, Pattern Recognition, 48, 3766-3782, (2015). Dalton, L. A.„ and M. R. Yousefi, On optimal Bayesian classification and risk estimation under multiple classes, EURASIP J. Bioinformatics and Systems Biology, (2015).
page 29
March 24, 2020 12:27
30
ws-book961x669
HBPRCV-6th Edn.–11573
dougherty˙PR˙Hbook
Bibliography
[28] Jaynes, E. T., Prior Probabilities, IEEE Trans Systems Science and Cybernetics, 4, 227-241, (1968). [29] Jaynes, E., What is the question? in Bayesian Statistics, J. M. Bernardo et al., Eds., Valencia University Press, Valencia, (1980). [30] Kashyap, R., Prior probability and uncertainty, IEEE Trans Information Theory, IT-17, 641-650, (1971). [31] Rissanen, J., A universal prior for integers and estimation by minimum description length, Annals of Statistics, 11, 416-431, (1983). [32] Dalton, L. A., and E. R. Dougherty, Application of the Bayesian MMSE error estimator for classification error to gene-expression microarray data, Bioinformatics, 27, 1822-1831, (2011). [33] Esfahani, M. S., Knight, J., Zollanvari, A., Yoon, B-J., and E. R. Dougherty, Classifier design given an uncertainty class of feature distributions via regularized maximum likelihood and the incorporation of biological pathway knowledge in steady-state phenotype classification, Pattern Recognition, 46, 2783-2797, (2013). [34] Esfahani, M. S., and E. R. Dougherty, Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification, IEEE/ACM Trans Computational Biology and Bioinformatics, 11, 202-218, (2014). [35] Esfahani, M. S., and E. R. Dougherty, An optimization-based framework for the transformation of incomplete biological knowledge into a probabilistic structure and its application to the utilization of gene/protein signaling pathways in discrete phenotype classification, IEEE/ACM Trans Computational Biology and Bioinformatics, 12, 1304-1321, (2015). [36] Boluki, S., Esfahani, M. S., Qian, X., and E. R. Dougherty, Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors, BMC Bioinformatics, 18, (2017). [37] Qian, X., and E. R. Dougherty, Bayesian regression with network prior: optimal Bayesian filtering perspective, IEEE Trans Signal Processing, 64, 6243-6253, (2016). [38] Bernado, J., and A. Smith, Bayesian Theory, Wiley, Chichester, U.K., (2000). [39] Bishop, C., Pattern Recognition and Machine Learning. Springer-Verlag, New York, (2006). [40] Murphy, K., Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, (2012). [41] Pan, S. J., and Q.Yang, A survey on transfer learning, IEEE Trans Knowledge and Data Engineering, 22, 1345-1359, (2010). [42] Karbalayghareh, A., Qian, X., and E. R. Dougherty, Optimal Bayesian transfer learning, IEEE Trans Signal Processing, 66, 3724-3739, (2018). [43] Halvorsen, K., Ayala, V. and E. Fierro, On the marginal distribution of the diagonal blocks in a blocked Wishart random matrix, Int. J. Anal, vol. 2016, pp. 1-5, 2016. [44] Nagar, D. K., and J. C. Mosquera-Ben´ıtez, Properties of matrix variate hypergeometric function distribution, Appl. Math. Sci., vol. 11, no. 14, pp. 677-692, 2017. [45] Muirhead, R. J., Aspects of Multivariate Statistical Theory, Wiley, Hoboken, 2009.
page 30
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
page 31
CHAPTER 1.2 DEEP DISCRIMINATIVE FEATURE LEARNING METHOD FOR OBJECT RECOGNITION
Weiwei Shi1 and Yihong Gong2 1
School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710049, China. 2 Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China. 1 [email protected], 2 [email protected]
This chapter introduces two deep discriminative feature learning methods for object recognition without the need to increase the network complexity, one based on entropy-orthogonality loss, and another one based on Min-Max loss. These two losses can enforce the learned feature vectors to have better within-class compactness and between-class separability. Therefore the discriminative ability of the learned feature vectors is highly improved, which is very essential to object recognition.
1. Introduction Recent years have witnessed the bloom of convolutional neural networks (CNNs) in many pattern recognition and computer vision applications, including object recognition,1–4 object detection,5–8 face verification,9,10 semantic segmentation,6 object tracking,11 image retrieval,12 image enhancement,13 image quality assessment,14 etc. These impressive accomplishments mainly benefit from the three factors below: (1) the rapid progress of modern computing technologies represented by GPGPUs and CPU clusters has allowed researchers to dramatically increase the scale and complexity of neural networks, and to train and run them within a reasonable time frame, (2) the availability of large-scale datasets with millions of labeled training samples has made it possible to train deep CNNs without a severe overfitting, and (3) the introduction of many training strategies, such as ReLU,1 Dropout,1 DropConnect,15 and batch normalization,16 can help produce better deep models by the back-propagation (BP) algorithm. Recently, a common and popular method to improve object recognition performance of CNNs is to develop deeper network structures with higher complexities and then train them with large-scale datasets. However, this strategy is unsustainable, and inevitably reaching its limit. This is because training very deep CNNs is 31
March 12, 2020 10:0
32
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
page 32
W. Shi and Y. Gong
becoming more and more difficult to converge, and also requires GPGPU/CPU clusters and complex distributed computing platforms. These requirements go beyond the limited budgets of many research groups and many real applications. The learned features to have good discriminative ability are very essential to object recognition.17–21 Discriminative features are the features with better withinclass compactness and between-class separability. Many discriminative feature learning methods22–27 that are not based on deep learning have been proposed. However, constructing a highly efficient discriminative feature learning method for CNN is non-trivial. Because the BP algorithm with mini-batch is used to train CNN, a mini-batch cannot very well reflect the global distribution of the training set. Owing to the large scale of the training set, it is unrealistic to input the whole training set in each iteration. In recent years, contrastive loss10 and triplet loss28 are proposed to strengthen the discriminative ability of the features learned by CNN. However, both of them suffer from dramatic data expansion when composing the sample pairs or triplets from the training set. Moreover, it has been reported that the way of constituting pairs or triplets of training samples can significantly affect the performance accuracy of a CNN model by a few percentage points.17,28 As a result, using such losses may lead to a slower model convergence, higher computational cost, increased training complexity and uncertainty. For almost all visual tasks, the human visual system (HVS) is always superior to current machine visual systems. Hence, developing a system that simulates some properties of the HVS will be a promising research direction. Actually, existing CNNs are well known for their local connectivity and shared weight properties that originate from discoveries in visual cortex research. Research findings in the areas of neuroscience, physiology, psychology, etc,29–31 have shown that, object recognition in human visual cortex (HVC) is accomplished by the ventral stream, starting from the V1 area through the V2 area and V4 area, to the inferior temporal (IT) area, and then to the prefrontal cortex (PFC) area. By this hierarchy, raw input stimulus from the retina are gradually transformed into higher level representations that have better discriminative ability for speedy and accurate object recognition. In this chapter, we introduce two deep discriminative feature learning methods for object recognition by drawing lessons from HVC object recognition mechanisms, one inspired by the class-selectivity of the neurons in the IT area, and another one inspired by the “untangling” mechanism of HVC. In the following, we first introduce the class-selectivity of the neurons in the IT area and “untangling” mechanism of HVC, respectively. Class-selectivity of the neurons in the IT area. Research findings30 have revealed the class-selectivity of the neurons in the IT area. Specifically, the response of an IT neuron to visual stimulus is sparse with respect to classes, i.e., it only responds to very few classes. The class-selectivity implies that the feature vectors from different classes can be easily separated.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 33
33
“Untangling” mechanism of human visual cortex. Works in the fields of psychology, neuroscience, physiology, etc.29,30,32 have revealed that object recognition in human brains is accomplished by the ventral stream that includes four layers, i.e., V1, V2, V4 and IT. If an object is transformed by any identity-preserving transformations (such as a shift in position, changes in pose, viewing angle, overall shape), it leads to different neuron population activities which can be viewed as the corresponding feature vectors describing the object (see Fig. 1). In feature space, a low-dimension manifold is formed by these feature vectors which correspond to all possible identity-preserving transformations of the object. At V1 layer, manifolds from different object categories are highly curved, and “tangled” with each other. From V1 layer to IT layer, neurons gradually gain the recognition ability for different object classes, implying that different manifolds will be gradually untangled. At IT layer, each manifold corresponding to an object category is very compact, while the distances among different manifolds are very large, and hence the discriminative features are learned (see Fig. 1).
Chair manifold
Transformations Chair
Not Chair
(a)
(b)
Fig. 1. (color online) In the beginning, manifolds corresponding to different object classes are highly curved and “tangled”. For instance, a chair manifold (see blue manifold) and all other nonchair manifolds (where, the black manifold is just one example). After a series of transformations, in the end, each manifold corresponding to an object category is very compact, and the distances between different manifolds are very large, and then the discriminative features are learned.30,33
Inspired by the class-selectivity of the neurons in the IT area,30 the entropyorthogonality loss based deep discriminative feature learning method is proposed.34 Inspired by the “untangling” mechanism of human visual cortex,30 the Min-Max loss based deep discriminative feature learning method is proposed.20,33 In the following two section, we will introduce them, respectively.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
34
chapter˙Shi˙Gong
page 34
W. Shi and Y. Gong
2. Entropy-Orthogonality Loss Based Deep Discriminative Feature Learning Method Inspired by the class-selectivity of the neurons in the IT area, Shi et al.34 proposed to improve the discriminative feature learning of CNN models by enabling the learned feature vectors to have class-selectivity. To achieve this, a novel loss function, termed entropy-orthogonality loss (EOL), is proposed to modulate the neuron outputs (i.e., feature vectors) in the penultimate layer of a CNN model. The EOL explicitly enables the feature vectors learned by a CNN model to have the following properties: (1) each dimension of the feature vectors only responds strongly to as few classes as possible, and (2) the feature vectors from different classes are as orthogonal as possible. Hence this method makes an analogy between the CNN’s penultimate layer neurons and the IT neurons, and the EOL measures the degree of discrimination of the learned features. The EOL and the softmax loss have the same training requirement without the need to carefully recombine the training sample pairs or triplets. Accordingly, the training of CNN models is more efficient and easier-to-implement. When combined with the softmax loss, the EOL not only can enlarge the differences in the between-class feature vectors, but also can reduce the variations in the within-class feature vectors. Therefore the discriminative ability of the learned feature vectors is highly improved, which is very essential to object recognition. In the following, we will introduce the framework of the EOL-based deep discriminative feature learning method. 2.1. Framework n
Assume that T = {Xi , ci }i=1 is the training set, where Xi represents the ith training sample (i.e., input image), ci ∈ {1, 2, · · · , C} refers to the ground-truth label of Xi , C refers to the number of classes, and n refers to the number of training samples in T . For the input image Xi , we denote the outputa of the penultimate layer√of a 1 , 2 CNN by xi , and view xi as the feature vector of Xi learned by the CNN. 22 This method improves discriminative feature learning of a CNN by embedding the entropy-orthogonality loss (EOL) into the penultimate layer of the CNN during training. For an L-layer CNN model, embedding the EOL into the layer L − 1 of the CNN, the overall objective function is: min L = W
n
(W, Xi , ci ) + λM(F, c) ,
(1)
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , W denotes the total layer (l) represents the filter parameters of the CNN model, W = {W(l) , b(l) }L l=1 , W th (l) weights of the l layer, b refers to the corresponding biases. M(F, c) denotes the EOL, F = [x1 , · · · , xn ], and c = {ci }ni=1 . Hyperparameter λ adjusts the balance between the softmax loss and the EOL. a Assume
that the output has been reshaped into a column vector.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 35
35
F directly depends on {W(l) , b(l) }L−1 l=1 . Hence M(F, c) can directly modulate all the layer parameters from 1th to (L − 1)th layers by BP algorithm during the training process. It is noteworthy that the EOL is independent of, and able to be applied to different CNN structures. Next, we will introduce the details of the EOL. 2.2. Entropy-Orthogonality Loss (EOL) In this subsection, we introduce an entropy and orthogonality based loss function, termed entropy-orthogonality loss (EOL), which measures the degree of discriminative ability of the learned feature vectors. For simplicity, assuming that the feature vector xi is a d-dimensional column vector (xi ∈ Rd×1 ). We call the k th (k = 1, 2, · · · , d) dimension of feature vector “class-sharing” if it is nonzero on many samples belonging to many classes (we call these classes “supported classes” of this dimension). Similarly, the k th dimension of feature vector is called “class-selective” if it is nonzero on samples only belonging to a few classes. The class-selectivity of the k th dimension increases as the number of its supported classes decreases. Naturally, we can define the entropyb of the k th dimension to measure the degree of its class-selectivity as: E(k) = −
C
Pkc logC (Pkc ) ,
(2)
c=1
Pkc =
|xj (k)| c j∈π n |x i (k)| i=1
j∈π n c
|xkj |
i=1 |xki |
,
(3)
where, xki (i.e., xi (k)) refers to the k th dimension of xi , πc represents the index set of the samples belonging to the cth class. The maximum possible value for E(k) is 1 when ∀c, Pkc = C1 , which means that the set of supported classes of dimension k includes all the classes and, therefore, dimension k is not class-selective at all (it is extremely “class-sharing”). Similarly, the minimum possible value of E(k) is 0 when ∃c, Pkc = 1 and ∀c = c, Pkc = 0, which means that the set of supported classes of dimension k includes just one class c and, therefore, dimension k is extremely class-selective. For dimension k, the degree of its class-selectivity is determined by the value of E(k) (between 0 and 1). As the value of E(k) decreases, the class-selectivity of dimension k increases. According to the discussions above, the entropy loss E(F, c) can be defined as: E(F, c) =
d
E(k) ,
(4)
k=1
where, F = [x1 , · · · , xn ], c = {ci }ni=1 . Minimizing the entropy loss is equivalent to enforcing that each dimension of the feature vectors should only respond strongly to as few classes as possible. However, b In
the definition of entropy, 0 logC (0) = 0.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
36
chapter˙Shi˙Gong
page 36
W. Shi and Y. Gong
the entropy loss does not consider the connection between different dimensions, which is problematic. Take 3-dimensional feature vector as an example. If we have six feature vectors from 3 different classes, x1 and x2 come from class 1, x3 and x4 come from class 2, x5 and x6 come from class 3. For the feature vector = [x1 , x2 , x3 , x4 , x5 , x6 ], when it takes the following value of A and B, matrix F respectively, E(A, c) = E(B, c), where c = {1, 1, 2, 2, 3, 3}. However, the latter one can not be classified at all, this is because x2 , x4 and x6 have the same value. Although the situation can be partially avoided by the softmax loss, it can still cause contradiction to the softmax loss and therefore affect the discriminative ability of the learned features. ⎤ ⎡ ⎤ ⎡ 101000 11 00 11 (5) A = ⎣ 0 0 1 1 1 1 ⎦, B = ⎣ 0 1 0 1 0 1 ⎦ 1 1 1 1 1 1 0 0 1 0 1 0 2 2 2 To address this problem, we need to promote orthogonality (i.e., minimize dot products) between the feature vectors of different classes. Specifically, we need to introduce the following orthogonality loss O(F, c): n 2 2 (x (6) O(F, c) = i xj − φij ) = F F − ΦF , i,j=1
where,
1 , if ci = cj , (7) 0 , else , Φ = (φij )n×n , · F denotes the Frobenius norm of a matrix, and the superscript denotes the transpose of a matrix. Minimizing the orthogonality loss is equivalent to enforcing that (1) the feature vectors from different classes are as orthogonal as possible, (2) the L2 -norm of each feature vector is as close as possible to 1, and (3) the distance between any two feature vectors belonging to the same class is as small as possible. Based on the above discussions and definitions, the entropy-orthogonality loss (EOL) M(F, c) can be obtained by integrating Eq. (4) and Eq. (6): φij =
M(F, c) = αE(F, c) + (1 − α)O(F, c) =α
d
E(k) + (1 − α)F F − Φ2F ,
(8)
k=1
where α is the hyperparameter to adjust the balance between the two terms. Combining Eq. (8) with Eq. (1), the overall objective function becomes: n (W, Xi , ci ) + λαE(F, c) + λ(1 − α)O(F, c) min L(W, T ) = i=1
=
n
(W, Xi , ci ) + λ1 E(F, c) + λ2 O(F, c) ,
(9)
i=1
where, λ1 = λα, λ2 = λ(1 − α). Next, we will introduce the optimization algorithm for Eq. (9).
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 37
37
Forward Propagation
W(1) , b(1)
W(2) ,b(2)
W(3) , b(3)
W(4) ,b(4)
W(5) , b(5)
C -dimensional output
Input Image conv1
conv2
conv3
fc2
fc1
Entropy-Orthogonality Loss (EOL) Softmax Loss Error Flows Back-Propagation Process EOL Fig. 2. The flowchart of training process in an iteration for the EOL-based deep discriminative feature learning method.34 CNN shown in this figure consists of 3 convolutional (conv) layers and 2 fully connected (fc) layers, i.e., it is a 5-layer CNN model. The last layer fc2 outputs a C-dimensional prediction vector, C is the number of classes. The penultimate layer in this model is fc1, so the entropy-orthogonality loss (EOL) is applied to layer fc1. The EOL is independent of the CNN structure.
Algorithm 1 Training algorithm for the EOL-based deep discriminative feature learning method with an L-layer CNN model. Input: Training set T , hyperparameters λ1 , λ2 , maximum number of iterations Imax , and counter iter = 0. Output: W = {W(l) , b(l) }L l=1 . Select a training mini-batch from T . Perform the forward propagation, for each sample, computing the activations of all layers. 3: Perform the back-propagation from layer L to L − 1, sequentially computing the error flows of layer L and L − 1 from softmax loss by BP algorithm. ∂E(F,c) 4: Compute by Eq. (10), then scale them by λ1 . ∂xi 1: 2:
5: 6: 7: 8: 9: 10:
Compute ∂O(F,c) by Eq. (14), then scale them by λ2 . ∂xi Compute the total error flows of layer L − 1, which is the summation of the above different items. Perform the back-propagation from layer L − 1 layer to layer 1, sequentially compute the error flows of layer L − 1, · · · , 1, by BP algorithm. ∂L According to the activations and error flows of all layers, compute ∂W by BP algorithm. Update W by gradient descent algorithm. iter ← iter + 1. If iter < Imax , perform step 1.
March 12, 2020 10:0
ws-rv961x669
38
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
page 38
W. Shi and Y. Gong
2.3. Optimization We employ the BP algorithm with mini-batch to train the CNN model. The overall objective function is Eq. (9). Hence, we need to compute the gradients of L with respect to (w.r.t.) the activations of all layers, which are called the error flows of the corresponding layers. The gradient calculation of the softmax loss is straightforward. In the following, we focus on obtaining the gradients of the E(F, c) and O(F, c) w.r.t. the feature vectors xi = [x1i , x2 , · · · , xdi ] , (i = 1, 2, · · · , n), respectively. The gradient of E(F, c) w.r.t. xi is ∂E(F, c) ∂E(1) ∂E(2) ∂E(d) =[ , ,··· , ] , ∂xi ∂x1i ∂x2i ∂xdi
(10)
C (1 + ln(Pkc )) ∂Pkc ∂E(k) · =− , ∂xki ln(C) ∂xki c=1
(11)
⎧ j∈π |xkj | ⎪ ⎨ (n c|x |)2 × sgn(xki ) , i ∈ πc ,
∂Pkc kj j=1 = − ⎪ j∈πc |xkj | ∂xki ⎩ n 2 × sgn(xki ) , i ∈ πc , ( j=1 |xkj |)
(12)
where sgn(·) is sign function. The O(F, c) can be written as: O(F, c) = F F − Φ2F = T r((F F − Φ) (F F − Φ)) = T r(F FF F) − 2T r(ΦF F) + T r(Φ Φ) ,
(13)
where T r(·) refers to the trace of a matrix. The gradients of O(F, c) w.r.t. xi is ∂O(F, c) = 4F(F F − Φ)(:,i) , ∂xi
(14)
where the subscript (:, i) represents the ith column of a matrix. Fig. 2 shows the flowchart of the training process in an iteration for the EOLbased deep discriminative feature learning method. Based on the above derivatives, the training algorithm for this method is listed in Algorithm 1. 3. Min-Max Loss Based Deep Discriminative Feature Learning Method Inspired by the “untangling” mechanism of human visual cortex,30 the Min-Max loss based deep discriminative feature learning method is proposed.20,33 The MinMax loss enforces the following properties for the features learned by a CNN model: (1) each manifold corresponding to an object category is as compact as possible, and (2) the margins (distances) between different manifolds are as large as possible.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 39
39
In principle, the Min-Max loss is independent of any CNN structures, and can be applied to any layers of a CNN model. The experimental evaluations20,33 show that applying the Min-Max loss to the penultimate layer is most effective for improving the model’s object recognition accuracies. In the following, we will introduce the framework of the Min-Max loss based deep discriminative feature learning method. 3.1. Framework n
Let {Xi , ci }i=1 be the set of input training data, where Xi denotes the ith raw input data, ci ∈ {1, 2, · · · , C} denotes the corresponding ground-truth label, C is the number of classes, and n is the number of training samples. The goal of training CNN is to learn filter weights and biases that minimize the classification error from the output layer. A recursive function for an M -layer CNN model can be defined as follows: (m)
Xi
(m−1)
= f (W(m) ∗ Xi
+ b(m) ) , (0)
i = 1, 2, · · · , n; m = 1, 2, · · · , M ; Xi
= Xi ,
(15) (16)
where, W(m) denotes the filter weights of the mth layer to be learned, b(m) refers to the corresponding biases, ∗ denotes the convolution operation, f (·) is an element(m) wise nonlinear activation function such as ReLU, and Xi represents the feature maps generated at layer m for sample Xi . The total parameters of the CNN model can be denoted as W = {W(1) , · · · , W(M ) ; b(1) , · · · , b(M ) } for simplicity. This method improves discriminative feature learning of a CNN model by embedding the Min-Max loss into certain layer of the model during the training process. Embedding this loss into the k th layer is equivalent to using the following cost function to train the model: n (W, Xi , ci ) + λL(X (k) , c) , (17) min L = W
i=1
where (W, Xi , ci ) is the softmax loss for sample Xi , L(X (k) , c) denotes the Min(k) (k) Max loss. The input to it includes X (k) = {X1 , · · · , Xn } which denotes the set of produced feature maps at layer k for all the training samples, and c = {ci }ni=1 which is the set of corresponding labels. Hyper-parameter λ controls the balance between the classification error and the Min-Max loss. Note that X (k) depends on W(1) , · · · , W(k) . Hence directly constraining X (k) will modulate the filter weights from 1th to k th layers (i.e. W(1) , · · · , W(k) ) by feedback propagation during the training phase. 3.2. Min-Max Loss In the following, we will introduce two Min-Max losses, i.e., Min-Max loss on intrinsic and penalty graphs, and Min-Max loss based on within-manifold and betweenmanifold distances, respectively.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
40
chapter˙Shi˙Gong
page 40
W. Shi and Y. Gong
Between-manifold Margin
Within-manifold Compactness
Manifold-1
xi
xj Manifold-2 Penalty Graph
Intrinsic Graph
(a)
(b)
Fig. 3. The adjacency relationships of (a) within-manifold intrinsic graph and (b) betweenmanifold penalty graph for the case of two manifolds. For clarity, the left intrinsic graph only includes the edges for one sample in each manifold.20
3.2.1. Min-Max Loss Based on Intrinsic and Penalty Graphs (k)
(k)
(k)
For X (k) = {X1 , · · · , Xn }, we denote by xi the column expansion of Xi . The goal of the Min-Max loss is to enforce both the compactness of each object manifold, and the max margin between different manifolds. The margin between two manifolds is defined as the Euclidian distance between the nearest neighbors of the two manifolds. Inspired by the Marginal Fisher Analysis research from,35 we can construct an intrinsic and a penalty graph to characterize the within-manifold compactness and the margin between the different manifolds, respectively, as shown in Fig. 3. The intrinsic graph shows the node adjacency relationships for all the object manifolds, where each node is connected to its k1 -nearest neighbors within the same manifold. Meanwhile, the penalty graph shows the between-manifold marginal node adjacency relationships, where the marginal node pairs from different manifolds are connected. The marginal node pairs of the cth (c ∈ {1, 2, · · · , C}) manifold are the k2 -nearest node pairs between manifold c and other manifolds. Then, from the intrinsic graph, the within-manifold compactness can be characterized as: n (I) Gij xi − xj 2 , (18) L1 = (I)
Gij =
i,j=1
1 , if i ∈ τk1 (j) or j ∈ τk1 (i) , 0 , else ,
(19)
(I)
where Gij refers to element (i, j) of the intrinsic graph adjacency matrix G(I) = (I)
(Gij )n×n , and τk1 (i) indicates the index set of the k1 -nearest neighbors of xi in the same manifold as xi . From the penalty graph, the between-manifold margin can be characterized as: n (P ) Gij xi − xj 2 , (20) L2 = i,j=1
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
(P )
Gij = (P )
where Gij
1 , if (i, j) ∈ ζk2 (ci ) or (i, j) ∈ ζk2 (cj ) , 0 , else ,
page 41
41
(21)
denotes element (i, j) of the penalty graph adjacency matrix G(P ) =
(P ) (Gij )n×n ,
ζk2 (c) is a set of index pairs that are the k2 -nearest pairs among the set / πc }, and πc denotes the index set of the samples belonging to the {(i, j)|i ∈ πc , j ∈ cth manifold. Based on the above descriptions, the Min-Max loss on intrinsic and penalty graphs can be expressed as: L = L1 − L2 .
(22)
Obviously, minimizing this Min-Max loss is equivalent to enforcing the learned features to form compact object manifolds and large margins between different manifolds simultaneously. Combining Eq. (22) with Eq. (17), the overall objective function becomes as follows: n (W, Xi , ci ) + λ(L1 − L2 ) , (23) min L = W
i=1
3.2.2. Min-Max Loss Based on Within-Manifold and Between-Manifold Distances In this subsection, we implement the Min-Max loss by minimizing the withinmanifold distance while maximizing the between-manifold distance for learned feature maps of the layer to which the Min-Max loss is applied. Denote the column (k) expansion of Xi by xi , and the index set for the set of samples belonging to class c by πc . Then the mean vector of the k th layer feature maps belonging to class c can be represented as 1 mc = xi , (24) nc i∈π c
where nc = |πc |. Similarly, the overall mean vector is 1 xi , m= n i=1 n
C where n = c=1 |πc |. (W ) The within-manifold distance Sc for class c can be represented as (W ) = (xi − mc ) (xi − mc ) . Sc
(25)
(26)
i∈πc
The total within-manifold distance S (W ) can be computed as S (W ) =
C c=1
Sc(W ) .
(27)
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
42
chapter˙Shi˙Gong
page 42
W. Shi and Y. Gong
Minimizing S (W ) is equivalent to enforcing the within-manifold compactness. The total between-manifold distance S (B) can be expressed as S (B) =
C
nc (mc − m) (mc − m) .
(28)
c=1
Maximizing S (B) is equivalent to enlarging the between-manifold distances. Using the above math notations, the Min-Max loss based on within-manifold and between-manifold distances can be defined as follows: L(X (k) , c) =
S (B) . S (W )
(29)
Obviously, maximizing this Min-Max loss is equivalent to enforcing the learned features to form compact object manifolds and large distances between different manifolds simultaneously. Combining Eq. (29) with Eq. (17), the overall objective function becomes as follows: min L = W
n
(W, Xi , ci ) − λ
i=1
S (B) . S (W )
(30)
3.3. Optimization We use the back-propagation method to train the CNN model, which is carried out using mini-batch. Therefore, we need to calculate the gradients of the overall objective function with respect to the features of the corresponding layers. Because the softmax loss is used as the first term of Eq. (23) and (30), its gradient calculation is straightforward. In the following, we focus on obtaining the gradient of the MinMax loss with respect to the feature maps xi in the corresponding layer. 3.3.1. Optimization for Min-Max Loss Based on Intrinsic and Penalty Graphs Let G = (Gij )n×n = G(I) − G(P ) , then the Min-Max objective can be written as: L=
n
Gij xi − xj 2 = 2T r(HΨH ) ,
(31)
i,j=1
n where H = [x1 , · · · , xn ], Ψ = D − G, D = diag(d11 , · · · , dnn ), dii = j=1,j=i Gij , i = 1, 2, · · · , n, i.e. Ψ is the Laplacian matrix of G, and T r(·) denotes the trace of a matrix. The gradients of L with respect to xi is ∂L = 2H(Ψ + Ψ )(:,i) = 4HΨ(:,i) . ∂xi where, Ψ(:,i) denotes the ith column of matrix Ψ.
(32)
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 43
43
3.3.2. Optimization for Min-Max Loss Based on Within-Manifold and Between-Manifold Distances Let S(W ) and S(B) be the within-manifold scatter matrix and between-manifold scatter matrix, respectively, then we have S (W ) = T r(S(W ) ) ,
S (B) = T r(S(B) ) .
(33)
According to,36,37 the scatter matrix S(W ) and S(B) can be calculated by: S(W ) =
n 1 (W ) Ω (xi − xj ) (xi − xj ) , 2 i,j=1 ij
(34)
S(B) =
n 1 (B) Ω (xi − xj ) (xi − xj ) , 2 i,j=1 ij
(35)
(W )
where n is the number of inputs in a mini-batch, Ωij
(B)
and Ωij
elements (i, j) of the within-manifold adjacency matrix Ω
(W )
are respectively (W )
= (Ωij )n×n , and
(B) Ω(B) = (Ωij )n×n th (k)
the between-manifold adjacency matrix based on the features X (k) (i.e., the generated feature maps at the k layer, X = {x1 , · · · , xn }) from one mini-batch of training data, which can be computed as: 1 1 , if ci = cj = c , − n1c , if ci = cj = c , (W ) (B) n c (36) , Ωij = n1 Ωij = otherwise . 0 , otherwise , n , Based on the above descriptions and ,38 the Min-Max loss based on withinmanifold and between-manifold distances L can be written as: n (B) 1 2 (B) T r(S(B) ) Φ)1n 1 i,j=1 Ωij xi − xj 2 n (Ω = n = , (37) L= (W ) (W ) 1 T r(S(W ) ) 1n (Ω Φ)1n Ω xi − xj 2 2
i,j=1
ij
where Φ = (Φij )n×n is an n × n matrix with Φij = xi − xj 2 , denotes elementproduct, and 1n ∈ Rn is a column vector with all the elements equal to one. The gradients of the tr(S(W ) ) and tr(S(B) ) with respect to xi are: ∂T r(S(W ) ) (W ) = (xi 1 + Ω(W ) )(:,i) , n − H)(Ω ∂xi
(38)
∂T r(S(B) ) (B) = (xi 1 + Ω(B) )(:,i) , n − H)(Ω ∂xi
(39)
where H = [x1 , · · · , xn ], and the subscript (:, i) denotes the ith column of a matrix. Then the gradients of the Min-Max loss with respect to features xi is T r(S(W ) ) ∂L = ∂xi
∂T r(S(B) ) ∂xi
− T r(S(B) )
[T r(S(W ) )]2
Incremental mini-batch training procedure
∂T r(S(W ) ) ∂xi
.
(40)
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
44
chapter˙Shi˙Gong
page 44
W. Shi and Y. Gong
In practise, when the number of the classes is large relative to the mini-batch size, because there is no guarantee that each mini-batch will contain training samples from all the classes, the above gradient must be calculated in an incremental fashion. th Firstly, the mean vector of the c class can be updated as i∈πc (t) xi (t) + Nc (t − 1)mc (t − 1) mc (t) = , (41) Nc (t) where (t) indicates tth iteration, Nc (t) represents the cumulative total number of cth class training samples, πc (t) denotes the index set of the samples belonging to the cth class in a mini-batch, and nc (t) = |πc (t)|. Accordingly, the overall mean vector m(t) can be updated by C 1 nc (t)mc (t) . (42) m(t) = n c=1 C where n = c=1 |πc (t)|, i.e. n is the number of training samples in a mini-batch. (W ) In such scenario, at the tth iteration, the within-manifold distance Sc (t) for class c can be represented as Sc(W ) (t) = (xi (t) − mc (t)) (xi (t) − mc (t)) , (43) i∈πc (t)
the total within-manifold distance S (W ) (t) can be denoted as C Sc(W ) (t) , S (W ) (t) =
(44)
c=1 (B)
the total between-manifold distance S (t) can be expressed as C S (B) (t) = nc (t)(mc (t) − m(t)) (mc (t) − m(t)) .
(45)
c=1 (W )
Then the gradients of S (t) and S (B) (t) with respect to xi (t) become: C (W ) ∂S (W ) (t) ∂Sc (t) = I(i ∈ πc (t)) ∂xi (t) ∂xi (t) c=1 C (nc (t)mc (t) − j∈πc (t) xj (t)) , =2 I(i ∈ πc (t)) (xi (t) − mc (t)) + Nc (t) c=1 and C ∂ c=1 nc (t)(mc (t) − m(t)) (mc (t) − m(t)) ∂S (B) (t) = ∂xi (t) ∂xi (t) C
(46)
nc (t)(mc (t) − m(t)) , (47) Nc (t) c=1 where I(·) refers to the indicator function that equals 1 if the condition is satisfied, and 0 if not. Accordingly, the gradients of the Min-Max loss with respect to features xi (t) is =2
I(i ∈ πc (t))
(B)
(W )
(t) (B) S (W ) (t) ∂S (t) ∂S∂xi (t)(t) ∂L ∂xi (t) − S = . (48) ∂xi (t) [S (W ) (t)]2 The total gradient with respect to xi is sum of the gradient from the softmax loss and that of the Min-Max loss.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 45
45
4. Experiments with Image Classification Task 4.1. Experimental Setups The performance evaluations are conducted using one shallow model, QCNN,39 and two famous deep models: NIN,40 AlexNet,1 respectively. During training, the EOL (or Min-Max loss) is applied to the penultimate layer of the models without changing the network structures.20,33,34 For those hyperparameters, including dropout ratio, learning rate, weight decay and momentum, we abide by the original network settings. The hardware used in the experiments is one NVIDIA K80 GPU and one Intel Xeon E5-2650v3 CPU. The software used in the experiments is the Caffe platform.39 All models are trained from scratch without pre-training. In the following, we use the Min-Max∗ and Min-Max to represent the Min-Max Loss based on intrinsic graph and penalty graph, and the Min-Max loss based on within-manifold and between-manifold distances, respectively. 4.2. Datasets The CIFAR10,41 CIFAR100,41 MNIST42 and SVHN43 datasets are chosen to conduct performance evaluations. CIFAR10 and CIFAR100 are natural image datasets. MNIST is a dataset of hand-written digit (0-9) images. SVHN is collected from house numbers in Google Street View images. For an image of SVHN, there may be more than one digit, but the task is to classify the digit in the image center. Table 1 lists the details of the CIFAR10, CIFAR100, MNIST and SVHN datasets. These four datasets are very popular in image classification research community. This is because they contain a large amount of small images, hence they enable models to be trained in reasonable time frames on moderate configuration computers. Table 1.
Details of the CIFAR10, CIFAR100, MNIST and SVHN datasets.
Dataset
#Classes
#Samples
Size and Format
CIFAR10
10
60000
32×32 RGB
CIFAR100
100
60000
32×32 RGB
MNIST
10
70000
28×28 gray-scale
SVHN
10
630420
32×32 RGB
Split training/test: 50000/10000 training/test: 50000/10000 training/test: 60000/10000 training/test/extra: 73257/26032/531131
4.3. Experiments using QCNN Model First, the “quick” CNN model from the official Caffe package39 is selected as the baseline (termed QCNN). It consists of 3 convolutional (conv) layers and 2 fully
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
46
chapter˙Shi˙Gong
page 46
W. Shi and Y. Gong
Table 2. Comparisons of the test error rates (%) on the CIFAR10, CIFAR100 and SVHN datasets using QCNN. Method
CIFAR10
CIFAR100
SVHN
QCNN (Baseline)
23.47
55.87
8.92
QCNN+EOL34 QCNN+Min-Max∗20 QCNN+Min-Max33
16.74 18.06 17.54
50.09 51.38 50.80
4.47 5.42 4.80
Table 3. Comparisons of the test error rates (%) on the CIFAR10, CIFAR100, MNIST and SVHN datasets using NIN. Method
CIFAR10
CIFAR100
MNIST
SVHN
NIN40 DSN44
10.41 9.78
35.68 34.57
0.47 0.39
2.35 1.92
NIN (Baseline)
10.20
35.50
0.47
2.55
NIN+EOL34 NIN+Min-Max∗20 NIN+Min-Max33
8.41 9.25 8.83
32.54 33.58 32.95
0.30 0.32 0.30
1.70 1.92 1.80
connected (fc) layers. We evaluated the QCNN model using CIFAR10, CIFAR100 and SVHN, respectively. MNIST can not be used to evaluate the QCNN model, because the input size of QCNN must be 32×32, but the images in MNIST are 28 × 28 in size. Table 2 shows the test set top-1 error rates of CIFAR10, CIFAR100 and SVHN, respectively. It can be seen that training QCNN with the EOL or Min-Max loss is able to effectively improve performance compared to the respective baseline. These remarkable performance improvements clearly reveal the effectiveness of the EOL and the Min-Max loss.
4.4. Experiments using NIN Model Next, we apply the EOL or Min-Max loss to the NIN models.40 NIN consists of 9 conv layers without fc layer. The four datasets, including CIFAR10, CIFAR100, MNIST and SVHN, are used in the evaluation. For fairness, we complied with the same training/testing protocols and data preprocessing as in.40,44 Table 3 provides the respective comparison results of test set top-1 error rates for the four datasets. For NIN baseline, to be fair, we report the evaluation results from both our own experiments and the original paper.40 We also include the results of DSN44 in this table. DSN is also based on NIN with layer-wise supervisions. These results again reveal the effectiveness of the EOL and the Min-Max loss.
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
(a) QCNN
(b) QCNN+EOL
page 47
47
(c) QCNN+Min-Max∗ (d) QCNN+Min-Max
Fig. 4. Feature visualization of the CIFAR10 test set, with (a) QCNN; (b) QCNN+EOL. One dot denotes a image, different colors denote different classes.
(a) NIN Fig. 5.
(b) NIN+EOL
(c) NIN+Min-Max∗
(d) NIN+Min-Max
Feature visualization of the CIFAR10 test set, with (a) NIN; (b) NIN+EOL.
4.5. Feature Visualization We utilize t-SNE45 to visualize the learned feature vectors extracted from the penultimate layer of the QCNN and NIN models on the CIFAR-10 test set, respectively. Fig. 4 and 5 show the respective feature visualizations for the two models. It can be observed that, the EOL and the Min-Max loss make the learned feature vectors have better between-class separability and within-class compactness compared to the respective baseline. Therefore the discriminative ability of the learned feature vectors is highly improved. 5. Discussions From section 4, all the experiments indicate superiority of the EOL and the MinMax loss. The reasons that why better within-class compactness and between-class separability will lead to better discriminative ability of the learned feature vectors are as follows: (1) Almost all the data clustering methods,46–48 discriminant analysis methods,35,49,50 etc, use this principle to learn discriminative features to better accomplish the task. Data clustering can be regarded as unsupervised data classification. Therefore, by analogy, learning features that possess the above property will certainly improve performance accuracies for supervised data classification.
March 12, 2020 10:0
ws-rv961x669
48
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
page 48
W. Shi and Y. Gong
(2) As described in Introduction, human visual cortex employs a similar mechanism to accomplish the goal of discriminative feature extraction. This discovery serves as an additional justification to this principle. References 1. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, (2012). 2. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556. (2014). 3. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, (2015). 4. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, arXiv preprint arXiv:1512.03385. (2015). 5. C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, Scalable, high-quality object detection, arXiv preprint arXiv:1412.1441. (2014). 6. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, (2014). 7. R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448, (2015). 8. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99, (2015). 9. J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1882, (2014). 10. Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pp. 1988–1996, (2014). 11. N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, (2013). 12. J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning for content-based image retrieval: A comprehensive study. In Proc. ACM Int. Conf. on Multimedia, pp. 157–166, (2014). 13. C. Dong, C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 184–199, (2014). 14. L. Kang, P. Ye, Y. Li, and D. Doermann, Convolutional neural networks for noreference image quality assessment, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1733–1740, (2014). 15. L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the international conference on machine learning, pp. 1058–1066, (2013). 16. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, pp. 448–456, (2015).
March 12, 2020 10:0
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
Deep Discriminative Feature Learning Method for Object Recognition
page 49
49
17. Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, pp. 499–515, (2016). 18. W. Shi, Y. Gong, J. Wang, and N. Zheng. Integrating supervised laplacian objective with cnn for object recognition. In Pacific Rim Conference on Multimedia, pp. 64–73, (2016). 19. G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns, IEEE Transactions on Geoscience and Remote Sensing. (2018). doi: 10.1109/TGRS. 2017.2783902. 20. W. Shi, Y. Gong, and J. Wang. Improving cnn performance with min-max objective. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2004–2010, (2016). 21. W. Shi, Y. Gong, X. Tao, and N. Zheng, Training dcnn by combining max-margin, max-correlation objectives, and correntropy loss for multilabel image classification, IEEE Transactions on Neural Networks and Learning Systems. 29(7), 2896–2908, (2018). 22. C. Li, Q. Liu, W. Dong, F. Wei, X. Zhang, and L. Yang, Max-margin-based discriminative feature learning, IEEE Transactions on Neural Networks and Learning Systems. 27(12), 2768–2775, (2016). 23. G.-S. Xie, X.-Y. Zhang, X. Shu, S. Yan, and C.-L. Liu. Task-driven feature pooling for image classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1179–1187, (2015). 24. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Sde: A novel selective, discriminative and equalizing feature representation for visual recognition, International Journal of Computer Vision. 124(2), 145–168, (2017). 25. G.-S. Xie, X.-Y. Zhang, S. Yan, and C.-L. Liu, Hybrid cnn and dictionary-based models for scene recognition and domain adaptation, IEEE Transactions on Circuits and Systems for Video Technology. 27(6), 1263–1274, (2017). 26. J. Tang, Z. Li, H. Lai, L. Zhang, S. Yan, et al., Personalized age progression with bilevel aging dictionary learning, IEEE Transactions on Pattern Analysis and Machine Intelligence. 40(4), 905–917, (2018). 27. G.-S. Xie, X.-B. Jin, Z. Zhang, Z. Liu, X. Xue, and J. Pu, Retargeted multi-view feature learning with separate and shared subspace uncovering, IEEE Access. 5, 24895–24907, (2017). 28. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, (2015). 29. T. Serre, A. Oliva, and T. Poggio, A feedforward architecture accounts for rapid categorization, Proceedings of the National Academy of Sciences. 104(15), 6424–6429, (2007). 30. J. J. DiCarlo, D. Zoccolan, and N. C. Rust, How does the brain solve visual object recognition?, Neuron. 73(3), 415–434, (2012). 31. S. Zhang, Y. Gong, and J. Wang. Improving dcnn performance with sparse categoryselective objective function. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2343–2349, (2016). 32. N. Pinto, N. Majaj, Y. Barhomi, E. Solomon, D. Cox, and J. DiCarlo. Human versus machine: comparing visual object recognition systems on a level playing field. In Computational and Systems Neuroscience, (2010). 33. W. Shi, Y. Gong, X. Tao, J. Wang, and N. Zheng, Improving cnn performance accu-
March 12, 2020 10:0
50
34.
35.
36. 37.
38. 39.
40. 41. 42. 43.
44. 45. 46. 47. 48.
49. 50.
ws-rv961x669
HBPRCV-6th Edn.–11573
chapter˙Shi˙Gong
page 50
W. Shi and Y. Gong
racies with min-max objective, IEEE Transactions on Neural Networks and Learning Systems. 29(7), 2872–2885, (2018). W. Shi, Y. Gong, D. Cheng, X. Tao, and N. Zheng, Entropy and orthogonality based deep discriminative feature learning for object recognition, Pattern Recognition. 81, 71–80, (2018). S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence. 29(1), 40–51, (2007). M. Sugiyama. Local fisher discriminant analysis for supervised dimensionality reduction. In Proc. Int. Conf. Mach. Learn., pp. 905–912, (2006). G. S. Xie, X. Y. Zhang, Y. M. Zhang, and C. L. Liu. Integrating supervised subspace criteria with restricted boltzmann machine for feature extraction. In Int. Joint Conf. on Neural Netw., (2014). M. K. Wong and M. Sun, Deep learning regularized fisher mappings, IEEE Transactions on Neural Networks. 22, 1668–1675, (2011). Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pp. 675–678, (2014). M. Lin, Q. Chen, and S. Yan, Network in network, arXiv preprint arXiv:1312.4400. (2013). A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Master’s thesis, University of Toronto. (2009). Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998). Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems (NIPS) workshop on deep learning and unsupervised feature learning, vol. 2011, p. 5, (2011). C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pp. 562–570, (2015). L. Van der Maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research. 9, 2579–2605, (2008). A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR). 31(3), 264–323, (1999). U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing. 17(4), 395–416, (2007). S. Zhou, Z. Xu, and F. Liu, Method for determining the optimal number of clusters based on agglomerative hierarchical clustering, IEEE Transactions on Neural Networks and Learning Systems. (2016). R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of eugenics. 7(2), 179–188, (1936). J. L. Andrews and P. D. Mcnicholas, Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions, Statistics and Computing. 22(5), 1021–1029, (2012).
CHAPTER 1.3 DEEP LEARNING BASED BACKGROUND SUBTRACTION: A SYSTEMATIC SURVEY
Jhony H. Giraldo1, Huu Ton Le2 and Thierry Bouwmans1,* 1
Lab. MIA, La Rochelle Univ., Avenue M. Crépeau, 17000 La Rochelle, France * E-mail: [email protected] 2 ICTLab/USTH, Hanoi, Vietnam
Machine learning has been widely applied for detection of moving objects from static cameras. Recently, many methods using deep learning for background subtraction have been reported, with very promising performance. This chapter provides a survey of different deep-learning based background subtraction methods. First, a comparison of the architecture of each method is provided, followed by a discussion against the specific application requirements such as spatio-temporal and real-time constraints. After analyzing the strategies of each method and showing their limitations, a comparative evaluation on the large scale CDnet2014 dataset is provided. Finally, we conclude with some potential future research directions.
1.
Introduction
Background subtraction is an essential process in several applications to model the background as well as to detect the moving objects in the scene like in video surveillance [1], optical motion capture [2] and multimedia [3]. We can name different machine learning models which have been used for background modeling and foreground detection such as Support Vector Machine (SVM) models [4][5][6], fuzzy learning models [7][8][9], subspace learning models [10][11][12], and neural networks models [13][14][15]. Deep learning methods based on Deep Neural Networks (DNNs) with Convolutional Neural Networks (CNNs also called ConvNets) have the ability of alleviating the disadvantages of parameters setting inherent in the conventional neural networks. Although CNNs have existed for a long time, their application to computer vision was limited during a long period, due to the lack of large training datasets, the size of the considered networks, as well as the computation power. One of the first breakthrough was made in 2012, by Krizhevsky et al. [27], with the use of a 51
J. H. Giraldo1, et al.
52
supervised training of a CNN with 8 layers and millions of parameters; the training dataset was the ImageNet database with 1 million of training images [28] the largest image dataset at that time. Since this research, along with the progress of storage devices and GPUs computation power, even larger and deeper networks became trainable. DNNs also were applied in the field of background/foreground separation in videos taken by a fixed camera. The deploying of DNNs introduced a large performance improvement for background generation [17][31][36][40][41][42][43][44], background subtraction [59][60] [61][62][63], ground-truth generation [64], and deep learned features [122][123] [124][125][126]. The rest of this chapter is organized as follows: a review of background subtraction models based on deep neural networks by comparing different networks architecture and discussing their adequacy for this task is given in Section 2. A comparative evaluation on the largescale ChangeDetection.Net (CDnet) 2014 dataset is given in Section 3. Finally, conclusions are given in Section 4. 2.
Background Subtraction
The goal of background subtraction is to label pixels as background or foreground by comparing the background image with the current image. DNNs based methods are dominating the performance on the CDnet 2014 datasets with six models as follow: 1) FgSegNet_M [59] and its variants FgSegNet_S [60] and FgSegNet_V2 [61], 2) BSGAN [62] and its variant BSPVGAN [63], 3) Cascaded CNNs [64]) for supervised approaches. Those researches are inspired by three unsupervised approaches which are multi-features/multi-cues or semantic methods (IUTIS-3 [65], IUTIS-5 [65], SemanticBGS [66]). However, background subtraction is a classification task and can be solved successfully with DNNs. 2.1. Convolutional Neural Networks One of the first attempt to use Convolutional Neural Networks (CNNs) for background subtraction were made by Braham and Van Droogenbroeck [67]. The model named ConvNet has a structure borrowed from LeNet-5 [68] with a few modifications. The subsampling is performed with max-pooling instead of averaging, and the hidden sigmoid activation function is replaced with rectified linear units (ReLU) for faster training. In general, background subtraction can be divided into four stages: background image extraction via a temporal median in grey scale, specific-scene dataset generation, network training, and background subtraction. Practically, the background model is specific for each scene. For every frame in a video sequence, Braham and Droogenbroeck [67] extract the
Deep Learning Based Background Subtraction
53
image patches for each pixel and then combine them with the corresponding patches from the background model. The size of image patch in this research is 27*27. After that, these combined patches are used as input of a neural network to predict the probability of a pixel being foreground or background. The authors use 5*5 local receptive fields, and 3*3 non-overlapping receptive fields for all pooling layers. The numbers of feature maps of the first two convolutional layers are 6 and 16, respectively. The first fully connected layer consists of 120 neurons and the output layer generates a single sigmoid unit. There are 20,243 parameters which are trained using back-propagation with a cross-entropy loss function. The algorithm needs for training the foreground results of a previous segmentation algorithm (IUTIS [65]) or the ground truth information provided in CDnet 2014 [19]. The CDnet 2014 datasets was divided into two halves; one for training, and one for testing purposes. The ConvNet shows a very similar performance to other state-of-the-art methods. Moreover, it outperforms all other methods significantly with the use of the ground-truth information, especially in videos of hard shadows and night videos. The F-Measure score of ConvNet in CDnet2014 dataset is 0.9046. DNNs approaches have been applied in other applications such as: vehicle detection [69], and pedestrian detection [127]. To be more precise, Yan et al. [127] used a similar scheme to detect pedestrian with both visible and thermal images. The inputs of the networks consist of the visible frame (RGB), thermal frame (IR), visible background (RGB) and thermal background (IR), which summed up to an input size of 64*64*8. This method shows a great improvement on OCTBVS dataset, in comparison with T2F-MOG, SuBSENSE, and DECOLOR. Remarks: ConvNet is one of the simplest approaches to model the differences between the background and the foreground using CNNs. A merit contribution of the study of Braham and Van Droogenbroeck [67] is being the first application of deep learning for background subtraction. For this reason, it can be used as a reference for comparison in terms of the improvement in performance. However, several limitations are presented. First, it is difficult to learn the high-level information through patches [93]. Second, due to the overfitting that is caused by using highly redundant data for training, the network is scene-specific. In experiment, the model can only process a certain scenery, and needs to be retrained for other video scenes. For many applications where the camera is fixed and always captures similar scene, this fact is not a serious problem. However, it may not be the case in certain applications as discussed by Hu et al. [71]. Third, ConvNet processes each pixel independently so the foreground mask may contain isolated false positives and false negatives. Fourth, this method requires the extraction of large number of patches from each frame in the video, which presents a very expensive computation as pointed out by Lim
54
J. H. Giraldo1, et al.
and Keles [59]. Fifth, the method requires pre-or post-processing of the data, and thus is not applicable for end-to-end learning framework. The long-term dependencies of the input video sequences are not considered since ConvNet uses only a few frames as input. ConvNet is a deep encoder-decoder network that is a generator network. However, one of the disadvantages of classical generator networks is unable to preserve the object edges because they minimize the classical loss function (e.g., Euclidean distance) between the predicted output and the ground-truth [93]. This leads to the generation of blurry foreground regions. Since this first valuable study, the posterior methods were introduced trying to alleviate these limitations. 2.2. Multi-scale and Cascaded CNNs A brief review of multi-scale and cascaded CNNs is given in this section. Wang et al. [64] targeted the problem of ground-truth generation in the context of background modeling algorithm validation and they introduced a deep learning method for iterative generation of the ground-truth. First, Wang et al. [64] extract a local patch of size 31*31 in each channel RGB of each pixel and this image patch is fed into a basic CNN and the multi-scale CNN. The CNN is built using 4 convolutional layers and 2 fully connected layers. The first 2 convolutional layers are followed by a 2*2 max pooling layer. The filter size of the convolutional layer is 7*7 and the authors use the Rectified Linear Unit (ReLU) as activation function. Wang et al. [64] considered the CNN output as a likelihood probability and a cross entropy loss function is used for training. This model computes images of size 31*31, as a result, the algorithm is limited to process patches of the same size or less. This limitation is alleviated by introducing the multi-scale CNN model which allows to generate outputs at three different sizes further combined in the original size. In order to model the dependencies among adjacent pixels as well as to enforce spatial coherence to avoid isolated false positives and false negatives in the foreground mask, Wang et al. [64] introduced a cascaded architecture called Cascaded CNN. Experiment showed that this CNN architecture has the advantage of learning its own features that may be more discriminative than hand-designed features. The foreground objects from video frames are manually annotated and used to train the CNN to learn the foreground features. After the training step, CNN employs generalization to segment the remaining frames of the video. A scene specific networks, trained with 200 manually selected frames was proposed by Want et al. [64]. The cascaded CNN achieves an F-Measure score of 0.9209 in CDnet2014 dataset. The CNN model was built based on Caffe library and
Deep Learning Based Background Subtraction
55
MatConvNet. The Cascaded CNN suffers from several limitations: 1) the model is more suitable for ground-truth generation than an automated background/ foreground separation application, and 2) it is computationally expensive. In another study, Lim and Keles [59] proposed a method which is based on a triplet CNN and a Transposed Convolutional Neural Network (TCNN) attached at the end of it in an encoder-decoder structure. The model called FgSegNet_M reuses the four blocks of the pre-train VGG-16 [73] under a triplet framework as a multiscale feature encoder. At the end of the network, a decoder network is integrated to map the features to a pixel-level foreground probability map. Finally, the binary segmentation labels are generated by applying a threshold to this feature map. Similar to the method proposed by Wang et al. [64], the network is trained with only few frames (from 50 up to 200). Experimental results [59] show that TCNN outperforms both ConvNet [67] and Cascaded CNN [64]. In addition, it obtained an overall F-Measure score of 0.9770, which outperformed all the reported methods. A variant of FgSegNet_M, called FgSegNet was introduced by Lim and Keles [60] by adding a feature pooling module FPM to operate on top of the final encoder (CNN) layer. Lin et al. [64] further improve the model by proposing a modified FM with feature fusion. FgSegNet_V2 achieves the highest performance on the CDnet 2014 dataset. A common drawback of these previous methods is that they require a large amount of densely labeled video training data. To solve this problem, a novel training strategy to train a multi-scale cascaded scene-specific (MCSS) CNNs was proposed by Liao et al. [119]. The network is constructed by joining the ConvNets [67] and the multiscale-cascaded architecture [64] with a training that takes advantage of the balance of positive and negative training samples. Experimental results demonstrate that MCSS obtains the score of 0.904 on the CDnet2014 dataset (excluding the PTZ category), which outperforms Deep CNN [72], TCNN [95] and SFEN [104]. A multi-scale CNN based background subtraction method was introduced by Liang et al. [128]. A specific CNN model is trained for each video to ensure accuracy, but the authors manage to avoid manual labeling. First, Liang et al. [128] used the SubSENSE algorithm to generate an initial foreground mask. This initial foreground mask is not accurate enough to be directly used as ground truth. Instead, it is used to select reliable pixels to guide the CNN training. A simple strategy to automatically select informative frames for guided learning is also proposed. Experiments on the CDnet 2014 dataset show that Guided Multiscale CNN outperforms DeepBS and SuBSENSE, with F-Measure score of 0.759.
56
J. H. Giraldo1, et al.
2.3. Fully CNNs Cinelli [74] explored the advantages of Fully Convolutional Neural Networks (FCNN) to diminish the computational requirements and proposed a similar method than Braham and Droogenbroeck [67]. The fully connected layer in traditional convolution networks is replaced by a convolutional layer in FCNN to remove the disadvantages caused by fully connected layers. The FCNN is tested with both LeNet5 [68] and ResNet [75] architectures. Since the ResNet [75] has a higher degree of hyper-parameter setting (namely the size of the model and even the organization of layers) than LeNet5 [68], Cinelli [74] used various features of the ResNet architecture in order to optimize them for background/ foreground separation. The authors used the networks designed for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which have 224*224 pixel images as input and also those for the CIFAR-10 and CIFAR-100 datasets, which works with 32*32 pixel-images. From this study, the two most accurate models on the CDnet 2014 dataset are the 32-layer CIFAR-derived dilated network and the pre-trained 34-layer ILSVRC-based dilated model adapted by direct substitution. However, only the visual results without F-measure were provided by Cinelli [74]. The idea of using FCNN is also deployed by Yang et al. [76]. The authors introduced a network with a structure of shortcut connected block with multiple branches. Each block provides four different branches. In this architecture, the first three branches extract different features by using different atrous convolution [78], while the last branch is the shortcut connection. To integrate the spatial information, atrous convolution [78] is employed instead of common convolution. This implementation allows to miss considerable details by expanding the receptive fields. The authors also present the use of PReLU Parametric Rectified Linear Unit (PReLU) [77] to introduce a learned parameter to transform the values less than 0. Yang et al. [76] also employed Conditional Random Fields (CRF) to refine the results. The authors show that the proposed method obtains better results than traditional background subtraction methods (MOG [79] and Codebook [80]) as well as recent state-of-art methods (ViBe [81], PBAS [82] and P2M [83]) on the CDnet 2012 dataset. However, the experiments were evaluated on only 6 subsets of the CDnet 2012 dataset instead of all the categories of CDnet 2014 making a comparison with other DNN methods more difficult to apply. Alikan [84] designed a Multi-View receptive field Fully CNN (MV-FCN), which borrows the architecture of fully convolutional structure, inception modules [85], and residual networking. In practice, MV-FCN is based Unet [46] with an inception module MV-FCN [84] which applies the convolution of
Deep Learning Based Background Subtraction
57
multiple filters at various scales on the same input, and integrates two Complementary Feature Flow (CFF) and a Pivotal Feature Flow (PFF) architecture. The authors also exploited intra-domain transfer learning in order to improve the accuracy of foreground region prediction. In MV-FCN, the inception modules are employed at early and late stages with three different sizes of receptive fields to capture invariance at different scales. To enhance the spatial representation, the features learned in the encoding phase are fused with appropriate feature maps in the decoding phase through residual connections. These multi-view receptive fields, together with the residual feature connections provide generalized features which are able to improve the performance of pixelwise foreground region identification. Alikan [84] evaluated MV-FCN model on the CDnet 2014 dataset, in comparison with classical neural networks (Stacked Multi-Layer [87], Multi-Layered SOM [26]), and two deep learning approaches (SDAE [88], Deep CNN [72]). However, only results with selected sequences are reported which makes the comparison less complete. Zeng and Zhu [89] targeted the moving object detection in infrared videos and they designed a Multiscale Fully Convolutional Network (MFCN). MFCN does not require the extraction of the background images. The network takes inputs as frames from different video sequences and generates a probability map. The authors borrow the architecture of VGG-16 net and use the input size of 224*224. The VGG-16 network consists of five blocks; each block contains some convolution and max pooling operations. The deeper blocks have a lower spatial resolution and contain more high-level local features whilst the lower blocks contain more low-level global features at a higher resolution. After the output feature layer, a contrast layer is added based on the average pooling operation with a kernel size of 3*3. Zeng and Zhu [89] proposed a set of deconvolution operations to upsample the features, creating an output probability map with the same size as the input, in order to exploit multiscale features from multiple layers. The cross-entropy is used to compute the loss function. The network uses the pretrained weights for layers from VGG-16, whilst randomly initializes other weights with a truncated normal distribution. Those randomly initialized weights are then trained using the AdamOptimizer method. The MFCN obtains the best score in THM category of CDnet 2014 dataset with a FMeasure score of 0.9870 whereas Cascaded CNN [64] obtains 0.8958. Over all the categories, the F-Measure score of MFCN is 0.96. In a further study, Zeng and Zhu [90] introduced a method called CNN-SFC by fusing the results produced by different background subtraction algorithms (SuBSENSE [86], FTSG [91], and CwisarDH+ [92]) and achieve even better performance. This method outperforms its direct competitor IUTIS [65] on the CDnet 2014 dataset.
58
J. H. Giraldo1, et al.
Lin et al. [93] proposed a deep Fully Convolutional Semantic Network (FCSN) for the background subtraction task. First, FCN can learn the global differences between the foreground and the background. Second, SuBSENSE algorithm [86] is able to generate robust background image with better performance. This background image is then concatenated into the input of the network together with the video frame. The weights of FCSN are initialized by partially using pre-trained weights of FCN-VGG16 [94] since these weights are applied to semantic segmentation. By doing so, FCSN can remember the semantic information of images and converge faster. Experiment results show that with the help of pre-trained weights, FCSN uses less training data and gets better results. 2.4. Deep CNNs Babaee et al. [72] designed a deep CNN for moving objects detection. The network consists of the following components: an algorithm to initialize the background via a temporal median model in RGB, a CNN model for background subtraction, and a post-processing model applied on the output of the networks using spatial median filter. The foreground pixels and background pixels are first classified with SuBSENSE algorithm [86]. Then only the background pixel values are used to obtain the background median model. Babaee et al. [72] also used Flux Tensor with Split Gaussian Models (FTSG [91]) algorithm to have adaptive memory length based on the motion of the camera and objects in the video frames. The CNNs are trained with background images obtained by the SuBSENSE algorithm [86]. The network is trained with pairs of RGB image patches (triplets of size 37*37) from video, background frames and the respective ground truth segmentation patches with around 5% of the CDnet 2014 dataset. Babaee et al. [72] trained their model by combining training frames from various video sequences including 5% of frames from each video sequence. For this reason, their model is not scene specified. In addition, the authors employ the same training procedure than ConvNet [67]. Image-patches are combined with background-patches before feeding the network. The network consists of 3 convolutional layers and a 2-layer Multi-Layer Perceptron (MLP). Babaee et al. [72] use Rectified Linear Unit (ReLU) as the activation function of each convolutional layer whilst the last fully connected layer uses the sigmoid function. Moreover, in order to reduce the effect of overfitting as well as to provide higher learning rates for training, the authors use batch normalization before each activation layer. The post-processing step is implemented with the spatial-median filtering. This network generates a more accurate foreground
Deep Learning Based Background Subtraction
59
mask than ConvNet [67] and is not very prone to outliers in presence of dynamic backgrounds. Experiment results show that deep CNN based background subtraction outperforms the existing algorithms when the challenge does not lie in the background modeling maintenance. The F-Measure score of Deep CNN in CDnet2014 dataset is 0.7548. However, Deep CNN suffers from the following limitations: 1) It does not handle very well the camouflage regions within foreground objects, 2) it performs poorly on PTZ videos category, and 3) due to the corruption of the background images, it provides poor performance in presence of large changes in the background. In another work, Zhao et al. [95] designed an end-to-end two-stage deep CNN (TS-CNN) framework. The network consists of two stages: a convolutional encoder-decoder followed by a Multi-Channel Fully Convolutional sub-Network (MCFVN). The target of the first stage is to reconstruct the background images and encode rich prior knowledge of background scenes whilst the latter stage aims to accurately detect the foreground. The authors decided to jointly optimize the reconstruction loss and segmentation loss. In practice, the encoder consists of a set of convolutions which can represent the input image as a latent feature vector. The feature vectors are used by the decoder to restore the background image. The l2 distance is used to compute the reconstruction loss. The encoderdecoder network learns from training data to separate the background from the input image and restores a clean background image. After training, the second network can learn the semantic knowledge of the foreground and background. Therefore, the model is able to process various challenges such as the night light, shadows and camouflaged foreground objects. Experimental results [95] show that the TS-CNN obtains the F-Measure score of 0.7870 which is more accurate than SuBSENSE [86], PAWCS [99], FTSG [91] and SharedModel [100] in the case of night videos, camera jitter, shadows, thermal imagery and bad weather. The Joint TS-CNN achieves a score of 0.8124 in CDnet2014 dataset. Li et al. [101] proposed to predict object locations in a surveillance scene with an adaptive deep CNN (ADCNN). First, the generic CNN-based classifier is transferred to the surveillance scene by selecting useful kernels. After that, a regression model is employed to learn the context information of the surveillance scene in order to have an accurate location prediction. ADCNN obtains very promising performance on several surveillance datasets for pedestrian detection and vehicle detection. However, ADCNN focus on object detection and thus it does not use the principle of background subtraction. Moreover, the performance of ADCNN was reported with the CUHK square dataset [102], the MIT traffic dataset [103] and the PETS 2007 instead of the CDnet2014 dataset.
60
J. H. Giraldo1, et al.
In another study, Chen et al. [104] proposed to detect moving objects by using pixel-level semantic features with an end-to-end deep sequence learning network. The authors used a deep convolutional encoder-decoder network to extract pixel-level semantic features from video frames. For the experiments, VGG-16 [73] is used as encoder-decoder network but other frameworks, such as GoogLeNet [85], ResNet50 [75] can also be used. An attention long short-term memory model named Attention ConvLSTM is employed to model the pixelwise changes over time. After that, Chen et al. [104] combined a Spatial Transformer Network (STN) model with a Conditional Random Fields (CRF) layer to reduce the sensitivity to camera motion as well as to smooth the foreground boundaries. The proposed method achieves similar results than the Convnet [67] whilst outperformed the Convnet [67] for the category “Night videos”, “Camera jitter”, “Shadow” and “Turbulence” of CDnet 2014 dataset. Using VGG-16, the attention ConvLSTM obtained an F-Measure of 0.8292. With GoogLeNet and ResNet50, the F-Measure scores are 0.7360 and 0.8772, respectively. 2.5. Structured CNNs Lim et al. [105] designed an encoder structured CNN (Struct-CNN) for background subtraction. The proposed network includes the following components: a background image extraction with a temporal median in RGB, network training, background subtraction and foreground extraction based on super-pixel processing. The architecture is similar to the VGG-16 network [73] except the fully connected layers. The encoder takes the 3 (RGB) channel images (images of size 336*336 pixels) as inputs and generates the 12-channel feature vector through convolutional and max-pooling layers yielding a 21*21*512 feature vector. After that, the decoder uses the deconvolutional and unpooling layers to convert the feature vector into a 1-channel image of size 336*336 pixels providing the foreground mask. This encoder-decoder structured network is trained in the end-to-end manner using CDnet 2014. The network involves 6 deconvolutional layers and 4 unpooling layers. The authors used the Parametric Rectified Linear Unit (PReLU) [78] as an activation function and batchnormalization is employed for all the deconvolutional layers, except for the last one. The last deconvolutional layer can be considered as the prediction layer. This layer used the sigmoid activation function to normalize outputs and then to provide the foreground mask. Lim et al. [105] used 5*5 as feature map size of all convolutional layer, and 3*3 kernel for the prediction layer. The super pixel information obtained by an edge detector was also used to suppress the incorrect
Deep Learning Based Background Subtraction
61
boundaries and holes in the foreground mask. Experimental results on the CDnet 2014 show that Struct-CNN outperforms SuBSENSE [86], PAWCS [99], FTSG [91] and SharedModel [100] in the case of bad weather, camera jitter, low frame rate, intermittent object motion and thermal imagery. The F-Measure score excluding the “PTZ” category is 0.8645. The authors excluded this category arguing that they focused only on static cameras. 2.6. 3D CNNs Sakkos et al. [106] proposed an end-to-end 3D-CNN to track temporal changes in video sequences without using a background model for the training. For this reason, 3D-CNN is able to process multiple scenes without further fine-tuning. The network architecture is inspired by the C3D branch [107]. Practically, 3DCNN outperforms ConvNet [67] and deep CNN [72]. Furthermore, the evaluation on the ESI dataset [108] with extreme and sudden illumination changes, show that 3D CNN obtains higher score than the two designed illumination invariant background subtraction methods (Universal Multimode Background Subtraction (UMBS) [109] and ESI [108]). For CDnet 2014 dataset, the proposed framework achieved an average F-Measure of 0.9507. Yu et al. [117] designed a spatial-temporal attention-based 3D ConvNets to jointly learn the appearance and motion of objects-of-interest in a video with a Relevant Motion Event detection Network (ReMotENet). Similar to the work of Sakkos et al. [106], the architecture of the proposed network is borrowed from C3D branch [107]. However, instead of using max pooling both spatially and temporally, the authors divided the spatial and temporal max pooling to capture fine-grained temporal information, as well as to make the network deeper to learn better representations. Experimental results show that ReMotENet obtains comparative results than the object detection-based method, with three to four orders of magnitude faster. With model size of less than 1MB, it is able to detect relevant motion in a 15s video in 4-8 milliseconds on a GPU and a fraction of a second on a CPU. In another work, Hu et al. [71] developed a 3D Atrous CNN model which can learn deep spatial-temporal features without losing resolution information. The authors combined this model with two convolutional long short-term memory (ConvLSTM) to capture both short-term and long-term spatio-temporal information of the input frames. In addition, the 3D Atrous ConvLSTM does not require any pre- or post-processing of the data, but to process data in a completely end-to-end manner. Experimental results on CDnet 204 dataset show that 3D atrous CNN outperforms SuBSENSE, Cascaded CNN and DeepBS.
62
J. H. Giraldo1, et al.
2.7. Generative Adversarial Networks (GANs) Bakkay et al. [110] designed a model named BScGAN which is a background subtraction method based on conditional Generative Adversarial Network (cGAN). The proposed network involves two successive networks: generator and discriminator. The former network models the mapping from the background and current image to the foreground mask whilst the later one learns a loss function to train this mapping by comparing ground-truth and predicted output using the input image and background. The architecture of BScGAN is similar to the encoder-decoder architecture of Unet network with skip connections [46]. In practice, the authors built the encoder using down-sampling layers that decrease the size of the feature maps followed by convolutional filters. It consists of 8 convolutional layers. The first layer uses 7*7 convolution which generates 64 feature maps. The last convolutional layer computes 512 feature maps with a 1*1 size. Before training, their weights are randomly initialized. The 6 middle convolutional layers are six ResNet blocks. Bakkay et al. [110] used LeakyReLU non-linearities as the activation function of all encoder layers. The decoder generates an output image with the same resolution of the input one. It is done by the up-sampling layers followed by deconvolutional filters. Its architecture is similar to the encoder one, but with a reverse layer ordering and with downsampling layers being replaced by upsampling layers. The architecture of the discriminator network includes 4 convolutional and down-sampling layers. The convolution layers use the feature size of 3*3 with randomly initialized weights. The first layer generates 64 feature maps whilst the last layer compute 512 feature maps of size 30*30. Leaky ReLU functions are employed as activation functions. Experimental results on CDnet 2014 datasets demonstrates that BScGAN obtains higher scores than ConvNets [67], Cascade CNN [64], and Deep CNN [76] with an average F-Measure score of 0.9763 without the category PTZ. Zheng et al. [112] proposed a Bayesian GAN (BGAN) network. The authors first used a median filter to extract the background and then they trained a network based on Bayesian generative adversarial network to classify each pixel, which makes the model robust to the challenges of sudden and slow illumination changes, non-stationary background, and ghosts. In practice, the generator and the discriminator of Bayesian generative adversarial network is constructed by adopting the deep convolutional neural networks. In a further study, Zheng et al. [113] improved the performance of BGAN with a parallel version named BPVGAN.
Deep Learning Based Background Subtraction
63
Bahri et al. [114] introduced the Neural Unsupervised Moving Object Detection (NUMOD) which is an end-to-end framework. The network is built based on the batch method name ILISD [115]. Thanks to the parameterization with the generative neural network, NUMOD is able to work in both online and batch mode. Each video frame is decomposed into three components: background, foreground and illumination changes. The background model is generated by finding a low-dimensional manifold for the background of the image sequence, by using a fully connected generative neural network. The architecture of NUMOD consists of Generative Fully Connected Networks (GFCN). The first one named Net1 estimates the background image from the input image whilst the second one, named Net2 generates background image from the illumination invariant image. Net1 and Net2 share the same architecture. First, the input to GFCN is optimizable low-dimensional latent vector. Then, two fully connected hidden layers are employed with ReLU nonlinearity activation function. The second hidden layer is fully connected to the output layer which is activated by the sigmoid function. A loss term is computed to impose the output of GFCN to be similar to the current input frame. In practice GFCN can be considered as the decoder part of an auto-encoder with a small modification. In GFCN, the low dimensional latent code is a free parameter than can be optimized and is the input to the network, other than being learnt by the encoder as in the case of the auto-encoder. The performance of GFCN, evaluated on a subset of CDnet 2014 dataset shows that GFCN is more robust to illumination changes than GRASTA [55], COROLA [116] and DAN with Adaptive Tolerance Measure [43]. 3.
Experimental Results
To have a fair comparison, we present the results obtained on the well-known publicly available CDnet 2014 dataset which was developed as part of Change Detection Workshop challenge (CDW 2014). The CDW2014 contains 22 additional camera-captured videos providing 5 different categories, compared to CDnet 2012. Those addition videos are used to incorporate some challenges that were not addressed in the 2012 dataset. The categories are listed as follows: baseline, dynamic backgrounds, camera jitter, shadows, intermittent object motion, thermal, challenging weather, low frame-rate, night videos, PTZ and turbulence. In the CDnet 2014, the ground truths of only the first half of every video in the 5 new categories is made publicly available for testing, other than CDnet 2012 which publishes the ground truth of all video frames. However, the
64
J. H. Giraldo1, et al.
evaluation is reported for all frames. All the challenges of these different categories have different spatial and temporal properties. The F-measures obtained by the different DNN algorithms are compared with the F-measures of other representative background subtraction algorithms over the complete evaluation dataset: (1) two conventional statistical models (MOG [128], RMOG [132], and (2) three advanced non-parametric models (SubSENSE [126], PAWCS [127], and Spectral-360 [114]). The evaluation of deep learning-based background separation models is reported on the following categories: x
Pixel-wise algorithms: The algorithms in this category were directly applied by the authors to background/foreground separation without considering spatial and temporal constraints. Thus, they may introduce isolated false positives and false negatives. We compare two algorithms: FgSegNet (multiscale) [80], and BScGAN [10].
x
Temporal-wise algorithms: These algorithms model the dependencies among adjacent temporal pixels and thus enforce temporal coherence. We compared one algorithm: 3D-CNN [110].
Table 1 groups the different F-measures which come either from the corresponding papers, or the CDnet 2014 website. In the same way, Table 2 shows some visual results obtained using SuBSENSE [126], FgSegNet-V2 [61], and BPVGAN [63].
Table 1. F-measure metric over the 6 categories of the CDnet2014, namely Baseline (BSL), Dynamic background (DBG), Camera jitter (CJT), Intermittent Motion Object (IOM), Shadows (SHD), Thermal (THM), Bad Weather (BDW), Low Frame Rate (LFR), Night Videos (NVD), PTZ, Turbulence (TBL). In bold, the best score in each algorithm's category. The top 10 methods are indicated with their rank. There are three groups of leading methods: FgSegNet's group, 3D-CNNs group and GANs group. Algorithms (Authors)
BSL
DBG
CJT
IOM
SHD
THM
BDW
LFR
NVD
PTZ
TBL
Average F-Measure
Basic statistical models 0.8245 0.6330 0.5969 0.5207 0.7156 0.6621 0.7380 0.5373 0.4097 0.1522 0.4663 0.5707
RMOG [132]
0.7848 0.7352 0.7010 0.5431 0.7212 0.4788 0.6826 0.5312 0.4265 0.2400 0.4578 0.5735
Advanced non parametric models SuBSENSE [126]
0.9503 0.8117 0.8152 0.6569 0.8986 0.8171 0.8619 0.6445 0.5599 0.3476 0.7792 0.7408
PAWCS [127]
0.9397 0.8938 0.8137 0.7764 0.8913 0.8324 0.8152 0.6588 0.4152 0.4615 0.645
Spectral-360 [114]
0.933
0.7403
0.7872 0.7156 0.5656 0.8843 0.7764 0.7569 0.6437 0.4832 0.3653 0.5429 0.7054
Multi-scale or/and cascaded CNNs FgSegNet-M (Spatial-wise) [59]
0.9973 0.9958 0.9954 0.9951 0.9937 0.9921 0.9845 0.8786 0.9655 0.9843 0.9648 0.9770 Rank 3
FgSegNet-S (Spatial-wise) [60]
0.9977 0.9958 0.9957 0.9940 0.9927 0.9937 0.9897 0.8972 0.9713 0.9879 0.9681 0.9804 Rank 2
FgSegNet-V2 (Spatial-wise) [61]
0.9978 0.9951 0.9938 0.9961 0.9955 0.9938 0.9904 0.9336 0.9739 0.9862 0.9727 0.9847 Rank 1
3D CNNs 3D CNN (Temporal-wise) [106]
0.9691 0.9614 0.9396 0.9698 0.9706 0.9830 0.9509 0.8862 0.8565 0.8987 0.8823 0.9507 Rank 7
3D Atrous CNN (Spatial/Temporal-wise) [71]
0.9897 0.9789 0.9645 0.9637 0.9813 0.9833 0.9609 0.8994 0.9489 0.8582 0.9488 0.9615 Rank 5
FC3D (Spatial/Temporal-wise) [133]
0.9941 0.9775 0.9651 0.8779 0.9881 0.9902 0.9699 0.8575 0.9595 0.924
MFC3D (Spatial/Temporal-wise) [133]
0.9950 0.9780 0.9744 0.8835 0.9893 0.9924 0.9703 0.9233 0.9696 0.9287 0.9773 0.9619 Rank 4
0.9729 0.9524 Rank 6
Deep Learning Based Background Subtraction
MOG [79]
Generative Adversarial Networks BScGAN (Pixel-wise) [110]
0.9930 0.9784 0.9770 0.9623 0.9828 0.9612 0.9796 0.9918 0.9661 -
BGAN (Pixel-wise) [62]
0.9814 0.9763 0.9828 0.9366 0.9849 0.9064 0.9465 0.8472 0.8965 0.9194 0.9118 0.9339 Rank 9
0.9712 0.9763 Rank 10
BPVGAN (Pixel-wise) [63]
0.9837 0.9849 0.9893 0.9366 0.9927 0.9764 0.9644 0.8508 0.9001 0.9486 0.9310 0.9501 Rank 8
65
66 Table 2. Visual results on CDnet 2014 dataset: From left to right: Original images, Ground-Truth images, SubSENSE [126], FgSegNet-V2 [61], BPVGAN [63].
Categories
Original
Ground Truth
14- SubSENSE
45-FgSegNet-V2
41-BPVGAN
B-Weather Skating (in002349)
C-Jitter Badminton (in001123) Dynamic-B Fall (in002416) I-O-Motion Sofa (in001314)
J. H. Giraldo1, et al.
Baseline Pedestrian (in000490)
Deep Learning Based Background Subtraction
4.
67
Conclusion
In this chapter, we have presented a full review of recent advances on the use of deep neural networks applied to background subtraction for detection of moving objects in video taken by a static camera. The experiments reported on the largescale CDnet 2014 dataset show the gap of performance obtained by the supervised deep neural networks methods in this field. Although, applying deep neural networks on for background subtraction problem has received significant attention in the last two years since the paper of Braham and Van Droogenbroeck [67], there are many unsolved important issues. Researchers need to answer the question: what is the best suitable type of deep neural networks and its corresponding architecture for background initialization, background subtraction and deep learned features in the presence of complex backgrounds? Several authors avoid experiments on the "PTZ" category and when the F-Measure is provided the score is not always very high. Thus, it seems that the current deep neural networks tested meet problems in the case of moving cameras. In the field of background subtraction, only convolutional neural networks and generative adversarial networks have been employed. Thus, future directions may investigate the adequacy of deep belief neural networks, deep restricted kernel neural networks [129], probabilistic neural networks [130] and fuzzy neural networks [131] in the case of static camera as well as moving camera. References [1] S. Cheung, C. Kamath, “Robust Background Subtraction with Foreground Validation for Urban Traffic Video”, Journal of Applied Signal Processing, 14, 2330-2340, 2005. [2] J. Carranza, C. Theobalt. M. Magnor, H. Seidel, “Free-Viewpoint Video of Human Actors”, ACM Transactions on Graphics, 22 (3), 569-577, 2003. [3] F. El Baf, T. Bouwmans, B. Vachon, “Comparison of Background Subtraction Methods for a Multimedia Learning Space”, SIGMAP 2007, Jul. 2007. [4] I. Junejo, A. Bhutta, H Foroosh, “Single Class Support Vector Machine (SVM) for Scene Modeling”, Journal of Signal, Image and Video Processing, May 2011. [5] J. Wang, G. Bebis, M. Nicolescu, M. Nicolescu, R. Miller, “Improving target detection by coupling it with tracking”, Machine Vision and Application, pages 1-19, 2008. [6] A. Tavakkoli, M. Nicolescu, G. Bebis, “A Novelty Detection Approach for Foreground Region Detection in Videos with Quasi-stationary Backgrounds”, ISVC 2006, pages 40-49, Lake Tahoe, NV, November 2006. [7] F. El Baf, T. Bouwmans, B. Vachon, “Fuzzy integral for moving object detection”, IEEE FUZZ-IEEE 2008, pages 1729–1736, June 2008. [8] F. El Baf, T. Bouwmans, B. Vachon, “Type-2 fuzzy mixture of Gaussians model: Application to background modeling”, ISVC 2008, pages 772–781, December 2008.
68
J. H. Giraldo1, et al.
[9] T. Bouwmans, “Background Subtraction for Visual Surveillance: A Fuzzy Approach” Chapter 5, Handbook on Soft Computing for Video Surveillance, Taylor and Francis Group, pages 103–139, March 2012. [10] N. Oliver, B. Rosario, A. Pentland, “A Bayesian computer vision system for modeling human interactions”, ICVS 1999, January 1999. [11] Y. Dong, G. DeSouza, “Adaptive learning of multi-subspace for foreground detection under illumination changes”, Computer Vision and Image Understanding, 2010. [12] D. Farcas, C. Marghes, T. Bouwmans, “Background subtraction via incremental maximum margin criterion: A discriminative approach”, Machine Vision and Applications, 23(6):1083–1101, October 2012. [13] M. Chacon-Muguia, S. Gonzalez-Duarte, P. Vega, “Simplified SOM-neural model for video segmentation of moving objects”, IJCNN 2009, pages 474-480, 2009. [14] M. Chacon-Murguia, G. Ramirez-Alonso, S. Gonzalez-Duarte, “Improvement of a neuralfuzzy motion detection vision model for complex scenario conditions”, International Joint Conference on Neural Networks, IJCNN 2013, August 2013. [15] M. Molina-Cabello, E. Lopez-Rubio, R. Luque-Baena, E. Domínguez, E. Palomo, "Foreground object detection for video surveillance by fuzzy logic based estimation of pixel illumination states", Logic Journal of the IGPL, September 2018. [16] E. Candès, X. Li, Y. Ma, J. Wright. Robust principal component?”, International Journal of ACM, 58(3), May 2011. [17] P. Xu, M. Ye, Q. Liu, X. Li, L. Pei, J. Ding, “Motion Detection via a Couple of AutoEncoder Networks”, IEEE ICME 2014, 2014. [18] N. Goyette, P. Jodoin, F. Porikli, J. Konrad, P. Ishwar, “Changedetection.net: A new change detection benchmark dataset”, IEEE Workshop on Change Detection, CDW 2012 in conjunction with CVPR 2012, June 2012. [19] Y. Wang, P. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, P. Ishwar, “CDnet 2014: an expanded change detection benchmark dataset”, IEEE Workshop on Change Detection, CDW 2014 in conjunction with CVPR 2014, June 2014. [20] A. Schofield, P. Mehta, T. Stonham, “A system for counting people in video images using neural networks to identify the background scene”, Pattern Recognition, 29:1421–1428, 1996. [21] P. Gil-Jimenez, S. Maldonado-Bascon, R. Gil-Pita, H. Gomez-Moreno, “Background pixel classification for motion detection in video image sequences”, IWANN 2003, 2686:718– 725, 2003. [22] L. Maddalena, A. Petrosino, “A self-organizing approach to detection of moving patterns for real-time applications”, Advances in Brain, Vision, and Artificial Intelligence, 4729:181–190, 2007. [23] L. Maddalena, A. Petrosino, “Multivalued background/foreground separation for moving object detection”, WILF 2009, pages 263–270, June 2009. [24] L. Maddalena, A. Petrosino, “The SOBS algorithm: What are the limits?”, IEEE Workshop on Change Detection, CVPR 2012, June 2012. [25] L. Maddalena, A. Petrosino, “The 3dSOBS+ algorithm for moving object detection”, CVIU 2014, 122:65–73, May 2014. [26] G. Gemignani, A. Rozza, “A novel background subtraction approach based on multilayered self organizing maps”, IEEE ICIP 2015, 2015. [27] A. Krizhevsky, I. Sutskever, G. Hinton, “ImageNet: Classification with Deep Convolutional Neural Networks”, NIPS 2012, pages 1097–1105, 2012. [28] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, “Imagenet: A large-scale hierarchical image database”, IEEE CVPR 2009, 2009.
Deep Learning Based Background Subtraction
69
[29] T. Bouwmans, L. Maddalena, A. Petrosino, “Scene Background Initialization: A Taxonomy”, Pattern Recognition Letters, January 2017. [30] P. Jodoin, L. Maddalena, A. Petrosino, Y. Wang, “Extensive Benchmark and Survey of Modeling Methods for Scene Background Initialization”, IEEE Transactions on Image Processing, 26(11):5244– 5256, November 2017. [31] I. Halfaoui, F. Bouzaraa, O. Urfalioglu, “CNN-Based Initial Background Estimation”, ICPR 2016, 2016. [32] S. Javed, A. Mahmood, T. Bouwmans, S. Jung, “Background- Foreground Modeling Based on Spatio-temporal Sparse Subspace Clustering”, IEEE Transactions on Image Processing, 26(12):5840– 5854, December 2017. [33] B. Laugraud, S. Pierard, M. Van Droogenbroeck, “A method based on motion detection for generating the background of a scene”, Pattern Recognition Letters, 2017. [34] B. Laugraud, S. Pierard, M. Van Droogenbroeck,"LaBGen-P-Semantic: A First Step for Leveraging Semantic Segmentation in Background Generation", MDPI Journal of Imaging Volume 4, No. 7, Art. 86, 2018. [35] T. Bouwmans, E. Zahzah, “Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance”, CVIU 2014, 122:22–34, May 2014. [36] R. Guo, H. Qi, “Partially-sparse restricted Boltzmann machine for background modeling and subtraction”, ICMLA 2013, pages 209–214, December 2013. [37] T. Haines, T. Xiang, “Background subtraction with Dirichlet processes”, European Conference on Computer Vision, ECCV 2012, October 2012. [38] A. Elgammal, L. Davis, “Non-parametric model for background subtraction”, European Conference on Computer Vision, ECCV 2000, pages 751–767, June 2000. [39] Z. Zivkovic, “Efficient adaptive density estimation per image pixel for the task of background subtraction”, Pattern Recognition Letters, 27(7):773–780, January 2006. [40] L. Xu, Y. Li, Y. Wang, E. Chen, “Temporally adaptive restricted Boltzmann machine for background modeling”, AAAI 2015, January 2015. [41] A. Sheri, M. Rafique, M. Jeon, W. Pedrycz, “Background subtraction using Gaussian Bernoulli restricted Boltzmann machine”, IET Image Processing, 2018. [42] A. Rafique, A. Sheri, M. Jeon, “Background scene modeling for PTZ cameras using RBM”, ICCAIS 2014, pages 165–169, 2014. [43] P. Xu, M. Ye, X. Li, Q. Liu, Y. Yang, J. Ding, “Dynamic Background Learning through Deep Auto-encoder Networks”, ACM International Conference on Multimedia, Orlando, FL, USA, November 2014. [44] Z. Qu, S. Yu, M. Fu, “Motion background modeling based on context-encoder”, IEEE ICAIPR 2016, September 2016. [45] Y. Tao, P. Palasek, Z. Ling, I. Patras, “Background modelling based on generative Unet”, IEEE AVSS 2017, September 2017. [46] O. Ronneberger, T. Brox. P. Fischer, “U-Net: Convolutional Networks for, biomedical image segmentation”, International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015. [47] M. Gregorio, M. Giordano, “Background modeling by weightless neural networks”, SBMI 2015 Workshop in conjunction with ICIAP 2015, September 2015. [48] G. Ramirez, J. Ramirez, M. Chacon, “Temporal weighted learning model for background estimation with an automatic re-initialization stage and adaptive parameters update”, Pattern Recognition Letters, 2017. [49] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, M. Cohen, “Interactive digital photomontage”, ACM Transactions on Graphics, 23(1):294– 302, 2004.
70
J. H. Giraldo1, et al.
[50] B. Laugraud, S. Pierard, M. Van Droogenbroeck, “LaBGen-P: A pixel-level stationary background generation method based on LaBGen”, Scene Background Modeling Contest in conjunction with ICPR 2016, 2016. [51] I. Goodfellow et al., “Generative adversarial networks”, NIPS 2014, 2014. [52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen., “Improved techniques for training GANs”, NIPS 2016, 2016. [53] M. Sultana, A. Mahmood, S. Javed, S. Jung, “Unsupervised deep context prediction for background estimation and foreground segmentation”. Preprint, May 2018. [54] X. Guo, X. Wang, L. Yang, X. Cao, Y. Ma, “Robust foreground detection using smoothness and arbitrariness constraints”, European Conference on Computer Vision, ECCV 2014, September 2014. [55] J. He, L. Balzano, J. Luiz, “Online robust subspace tracking from partial information”, IT 2011, September 2011. [56] J. Xu, V. Ithapu, L. Mukherjee, J. Rehg, V. Singh, “GOSUS: Grassmannian online subspace updates with structured-sparsity”, IEEE ICCV 2013, September 2013. [57] T. Zhou, D. Tao, “GoDec: randomized low-rank and sparse matrix decomposition in noisy case”, International Conference on Machine Learning, ICML 2011, 2011. [58] X. Zhou, C. Yang, W. Yu, “Moving object detection by detecting contiguous outliers in the low-rank representation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:597-610, 2013. [59] L. Lim, H. Keles, “Foreground Segmentation using a Triplet Convolutional Neural Network for Multiscale Feature Encoding”, Preprint, January 2018. [60] K. Lim, L. Ang, H. Keles, “Foreground Segmentation Using Convolutional Neural Networks for Multiscale Feature Encoding”, Pattern Recognition Letters, 2018. [61] K. Lim, L. Ang, H. Keles, “Learning Multi-scale Features for Foreground Segmentation”, arXiv preprint arXiv:1808.01477, 2018. [62] W. Zheng, K. Wang, and F. Wang. Background subtraction algorithm based on bayesian generative adversarial networks. Acta Automatica Sinica, 2018. [63] W. Zheng, K. Wang, and F. Wang. A novel background subtraction algorithm based on parallel vision and Bayesian GANs. Neurocomputing, 2018. [64] Y. Wang, Z. Luo, P. Jodoin, “Interactive deep learning method for segmenting moving objects”, Pattern Recognition Letters, 2016. [65] S. Bianco, G. Ciocca, R. Schettini, “How far can you get by combining change detection algorithms?” CoRR, abs/1505.02921, 2015. [66] M. Braham, S. Pierard, M. Van Droogenbroeck, "Semantic Background Subtraction", IEEE ICIP 2017, September 2017. [67] M. Braham, M. Van Droogenbroeck, “Deep background subtraction with scene-specific convolutional neural networks”, International Conference on Systems, Signals and Image Processing, IWSSIP2016, Bratislava, Slovakia, May 2016. [68] Y. Le Cun, L. Bottou, P. Haffner. “Gradient-based learning applied to document recognition”, Proceedings of IEEE, 86:2278–2324, November 1998. [69] C. Bautista, C. Dy, M. Manalac, R. Orbe, M. Cordel, “Convolutional neural network for vehicle detection in low resolution traffic videos”, TENCON 2016, 2016. [70] C. Lin, B. Yan, W. Tan, “Foreground detection in surveillance video with fully convolutional semantic network”, IEEE ICIP 2018, pages 4118-4122, October 2018. [71] Z. Hu, T. Turki, N. Phan, J. Wang, “3D atrous convolutional long short-term memory network for background subtraction”, IEEE Access, 2018 [72] M. Babaee, D. Dinh, G. Rigoll, “A deep convolutional neural network for background subtraction”, Pattern Recognition, September 2017. [73] K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.
Deep Learning Based Background Subtraction
71
[74] L. Cinelli, “Anomaly Detection in Surveillance Videos using Deep Residual Networks", Master Thesis, Universidade de Rio de Janeiro, February 2017. [75] K. He, X. Zhang, S. Ren, "Deep residual learning for image recognition", "EEE CVPR 2016, June 2016. [76] L. Yang, J. Li, Y. Luo, Y. Zhao, H. Cheng, J. Li, "Deep Background Modeling Using Fully Convolutional Network", IEEE Transactions on Intelligent Transportation Systems, 2017. [77] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs”, Tech. Rep., 2016. [78] K. He, X. Zhang, S. Ren, J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification", IEEE ICCV 2015, pages 1026–1034, 2015. [79] C. Stauffer, W. Grimson, “Adaptive background mixture models for real-time tracking”, IEEE CVPR 1999, pages 246-252, 1999. [80] K. Kim, T. H. Chalidabhongse, D. Harwood, L. Davis, “Background Modeling and Subtraction by Codebook Construction”, IEEE ICIP 2004, 2004 [81] O. Barnich, M. Van Droogenbroeck, “ViBe: a powerful random technique to estimate the background in video sequences”, ICASSP 2009, pages 945-948, April 2009. [82] M. Hofmann, P. Tiefenbacher, G. Rigoll, "Background Segmentation with Feedback: The Pixel-Based Adaptive Segmenter", IEEE Workshop on Change Detection, CVPR 2012, June 2012 [83] L. Yang, H. Cheng, J. Su, X. Li, “Pixel-to-model distance for robust background reconstruction", IEEE Transactions on Circuits and Systems for Video Technology, April 2015. [84] T. Akilan, "A Foreground Inference Network for Video Surveillance using Multi-View Receptive Field", Preprint, January 2018. [85] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, A. Rabinovich, "Going deeper with convolutions", IEEE CVPR 2015, pages 1-9, 2015. [86] P. St-Charles, G. Bilodeau, R. Bergevin, "Flexible Background Subtraction with SelfBalanced Local Sensitivity", IEEE CDW 2014, June 2014. [87] Z. Zhao, X. Zhang, Y. Fang, “Stacked multilayer self-organizing map for background modeling” IEEE Transactions on Image Processing, Vol. 24, No. 9, pages. 2841–2850, 2015. [88] Y. Zhang, X. Li, Z. Zhang, F. Wu, L. Zhao, “Deep learning driven bloc-kwise moving object detection with binary scene modeling”, Neurocomputing, Vol. 168, pages 454-463, 2015. [89] D. Zeng, M. Zhu, "Multiscale Fully Convolutional Network for Foreground Object Detection in Infrared Videos", IEEE Geoscience and Remote Sensing Letters, 2018. [90] D. Zeng, M. Zhu, “Combining Background Subtraction Algorithms with Convolutional Neural Network”, Preprint, 2018. [91] R. Wang, F. Bunyak, G. Seetharaman, K. Palaniappan, “Static and moving object detection using flux tensor with split Gaussian model”, IEEE CVPR 2014 Workshops, pages 414– 418, 2014. [92] M. De Gregorio, M. Giordano, “CwisarDH+: Background detection in RGBD videos by learning of weightless neural networks”, ICIAP 2017, pages 242–253, 2017. [93] C. Lin, B. Yan, W. Tan, "Foreground Detection in Surveillance Video with Fully Convolutional Semantic Network", IEEE ICIP 2018, pages 4118-4122, Athens, Greece, October 2018. [94] J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE CVPR 2015, pages 3431-3440, 2015. [95] X. Zhao, Y. Chen, M. Tang, J. Wang, "Joint Background Reconstruction and Foreground Segmentation via A Two-stage Convolutional Neural Network", Preprint, 2017. [96] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. Efros, "Context encoders: Feature learning by inpainting", arXiv preprint arXiv:1604.07379, 2016.
72
J. H. Giraldo1, et al.
[97] A. Radford, L. Metz, S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," Computer Science, 2015. [98] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution and fully connected CRFs,” arXiv preprint arXiv:1606.00915, 2016. [99] P. St-Charles, G. Bilodeau, R. Bergevin, “A Self-Adjusting Approach to Change Detection Based on Background Word Consensus", IEEE Winter Conference on Applications of Computer Vision, WACV 2015, 2015. [100] Y. Chen, J. Wang, H. Lu, “Learning sharable models for robust background subtraction”, IEEE ICME 2015, pages 1-6, 2015. [101] X. Li, M. Ye, Y. Liu, C. Zhu, “Adaptive Deep Convolutional Neural Networks for SceneSpecific Object Detection”, IEEE Transactions on Circuits and Systems for Video Technology, September 2017. [102] M. Wang and W. Li and X. Wang, "Transferring a generic pedestrian detector towards specific scenes", IEEE CVPR 2012, pgas 3274-3281, 2012. [103] X. Wang, X. Ma, W Grimson, "Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 3, pages 539-555, March 2009. [104] Y. Chen, J. Wang, B. Zhu, M. Tang, H. Lu, "Pixel-wise Deep Sequence Learning for Moving Object Detection", IEEE Transactions on Circuits and Systems for Video Technology, 2017. [105] K. Lim, W. Jang, C. Kim, "Background subtraction using encoder-decoder structured convolutional neural network", IEEE AVSS 2017, Lecce, Italy, 2017 [106] D. Sakkos, H. Liu, J. Han, L. Shao, “End-to-end video background subtraction with 3D convolutional neural networks”, Multimedia Tools and Applications, pages 1-19, December 2017. [107] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Palur, "C3D: generic features for video analysis", IEEE ICCV 2015, 2015. [108] L. Vosters, C. Shan, T. Gritti, “Real-time robust background subtraction under rapidly changing illumination conditions”, Image Vision and Computing, 30(12):1004-1015, 2012. [109] H. Sajid, S. Cheung. “Universal multimode background subtraction”, IEEE Transactions on Image Processing, 26(7):3249–3260, May 2017. [110] M. Bakkay, H. Rashwan, H. Salmane, L. Khoudoury D. Puig, Y. Ruichek, "BSCGAN: Deep Background Subtraction with Conditional Generative Adversarial Networks", IEEE ICIP 2018, Athens, Greece, October 2018. [111] P. Isola, J. Zhu, T. Zhou, A. Efros, “Image-to-image translation with conditional adversarial networks”, arXiv preprint, 2017. [112] W. Zheng, K. Wang, F. Wang, "Background Subtraction Algorithm based on Bayesian Generative Adversarial Networks", Acta Automatica Sinica, 2018. [113] W. Zheng, K. Wang, F. Wang, "A Novel Background Subtraction Algorithm based on Parallel Vision and Bayesian GANs", Neurocomputing, 2018. [114] F. Bahri, M. Shakeri, N. Ray, "Online Illumination Invariant Moving Object Detection by Generative Neural Network", Preprint, 2018. [115] M. Shakeri, H. Zhang, “Moving object detection in time-lapse or motion trigger image sequences using low-rank and invariant sparse decomposition”, IEEE ICCV 2017, pages 5133–5141, 2017. [116] M. Shakeri, H. Zhang, “COROLA: A sequential solution to moving object detection using low-rank approximation”, Computer Vision and Image Understanding, 146:27-39, 2016. [117] R. Yu, H. Wang, L. Davis, "ReMotENet: Efficient Relevant Motion Event Detection for Large-scale Home Surveillance Videos", Preprint, January 2018. [118] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, S. Li, "Deep Background Subtraction with Guided Learning", IEEE ICME 2018 San Diego, USA, July 2018.
Deep Learning Based Background Subtraction
73
[119] J. Liao, G. Guo, Y. Yan, H. Wang, "Multiscale Cascaded Scene-Specific Convolutional Neural Networks for Background Subtraction", Pacific Rim Conference on Multimedia, PCM 2018, pages 524-533, 2018. [120] S. Lee, D. Kim, "Background Subtraction using the Factored 3-Way Restricted Boltzmann Machines", Preprint, 2018. [121] P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazirbas¸, V. Golkov, P. Smagt, D. Cremers, T. Brox, “Flownet: Learning optical flow with convolutional networks”, arXiv preprint arXiv:1504.06852, 2015. [122] Y. Zhang, X. Li, Z. Zhang, F. Wu, L. Zhao, “Deep Learning Driven Blockwise Moving Object Detection with Binary Scene Modeling”, Neurocomputing, June 2015. [123] M. Shafiee, P. Siva, P. Fieguth, A. Wong, “Embedded Motion Detection via Neural Response Mixture Background Modeling”, CVPR 2016, June 2016. [124] M. Shafiee, P. Siva, P. Fieguth, A. Wong, “Real-Time Embedded Motion Detection via Neural Response Mixture Modeling”, Journal of Signal Processing Systems, June 2017. [125] T. Nguyen, C. Pham, S. Ha, J. Jeon, "Change Detection by Training a Triplet Network for Motion Feature Extraction", IEEE Transactions on Circuits and Systems for Video Technology, January 2018. [126] S. Lee, D. Kim, "Background Subtraction using the Factored 3-Way Restricted Boltzmann Machines", Preprint, 2018. [127] Y. Yan, H. Zhao, F. Kao, V. Vargas, S. Zhao, J. Ren, "Deep Background Subtraction of Thermal and Visible Imagery for Pedestrian Detection in Videos", BICS 2018, 2018. [128] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, S. Li, "Deep Background Subtraction with Guided Learning", IEEE ICME 2018, July 2018. [129] J. Suykens, “Deep Restricted Kernel Machines using Conjugate Feature Duality”, Neural Computation, Vol. 29, pages 2123-2163, 2017. [130] J. Gast, S. Roth, “Lightweight Probabilistic Deep Networks”, Preprint, 2018. [131] Y. Deng, Z. Ren, Y. Kong, F. Bao, Q. Dai, “A Hierarchical Fused Fuzzy Deep Neural Network for Data Classification”, IEEE Transactions on Fuzzy Systems, Vol. 25, No. 4, pages 1006-1012, 2017. [132] V. Sriram, P. Miller, and H. Zhou. “Spatial mixture of Gaussians for dynamic background modelling.” IEEE International Conference on Advanced Video and Signal Based Surveillance, 2013. [133] Y. Wang, Z. Yu, L. Zhu, "Foreground Detection with Deeply Learned Multi-scale SpatialTemporal Features", MDPI Sensors, 2018. [134] V. Mondéjar-Guerra, J. Rouco, J. Novo, M. Ortega, "An end-to-end deep learning approach for simultaneous background modeling and subtraction", British Machine Vision Conference, September 2019.
This page intentionally left blank
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
page 75
CHAPTER 1.4 SIMILARITY DOMAINS NETWORK FOR MODELING SHAPES AND EXTRACTING SKELETONS WITHOUT LARGE DATASETS
Sedat Ozer Bilkent University, Ankara, Turkey [email protected] In this chapter, we present a method to model and extract the skeleton of a shape with recently proposed similarity domains network (SDN). SDN is especially useful when there is only one image sample available and when there is no additional pre-trained model is available. SDN is a neural network with one-hidden layer with explainable kernel parameters. Kernel parameters have a geometric meaning within the SDN framework, which is encapsulated with similarity domains (SDs) within the feature space. We model the SDs with Gaussian kernel functions. A similarity domain is a d dimensional sphere in the d dimensional feature space and it represents the similarity domain of an important data sample where any other data that falls inside the similarity domain of that important sample is considered similar to that sample and they share the same class label. In this chapter, we first demonstrate how using SDN can help us model a pixel-based image in terms of SDs and then demonstrate how those learned SDs can be used to extract the skeleton from a shape.
1. Introduction Recent advances in deep learning moved attention of many researchers to the neural networks based solutions for shape understanding, shape analysis and parametric shape modeling. While a significant amount of research has been done for skeleton extraction and modeling from shapes in the past, recent advances in deep learning and their improved success in object detection and classification applications has also moved the attention of researchers towards neural networks based solutions for skeleton extraction and modeling. in this chapter, we introduce a novel shape modeling algorithm based on Radial Basis Networks (RBNs) which are a particular type of neural networks that utilize radial basis functions (RBF) as activation function in its hidden layer. RBFs have been used in the literature for many classification tasks including the original LeNET architecture [1]. Even though RBFs are useful in modeling surfaces and various classification tasks as in [2–8], when the goal is modeling a shape and extracting a skeleton many challenges appear associated with utilizing RBFs in neural networks. Two such challenges are: (I) estimating the optimal number of used RBFs (e.g., the number of yellow circles in our 2D image examples) in the network along with their optimal locations (their centroid values), and (II) 75
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
76
BookChapterNew
page 76
S. Ozer
(a) Binary input image
(b) Altered output image using SDs
(c) Visualization of all the SDs
(d) Visualization of only the foreground SDs
Fig. 1. This figure demonstrates how the shape parameters of SDN can be utilized on shapes. (a) Original binary input image is shown. (b) The altered image by utilizing the SDN’s shape parameters. Each object is scaled and shifted at different and individual scales. We first used a region growing algorithm to isolate the kernel parameters for each object and then individually scaled and shifted them. (c) All the computed shape parameters of the input binary image are visualized. (d) Only the foreground parameters are visualized.
estimating the optimal parameters of RBFs by relating them to shapes geometrically. The kernel parameters are typically known as the scale or the shape parameter (representing the radius of a circle in the figures) and used interchangeably in the literature. The standard RBNs as defined in [9] apply the same kernel parameter value to each basis function used in the network architecture. Recent literature focused on using multiple kernels with individual and distinct kernel parameters as in [10] and [11]. While the idea of utilizing different kernels with individual parameters has been heavily studied in the literature under the “Multiple Kernel Learning” (MKL) framework as formally modeled in [11], there are not many efficient approaches and available implementations focusing on utilizing multiple kernels with their own parameters in RBNs for shape modeling. Recently, the work in [12] combined the optimization advances achieved in the kernel machines domain with the radial basis networks and introduced a novel algorithm for shape analysis. In this chapter, we call that algorithm as “Similarity Domains Network” (SDN) and discuss its benefits from both shape modeling (see Figure 1) and skeleton extraction perspectives. As we demonstrate in this chapter, the computed similarity domains of SDN can be used not only for obtaining parametric models for shapes but also models for their skeletons.
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets
page 77
77
2. Related Work Skeleton extraction has been widely studied in the literature as in [13–16]. However, in this chapter, we study and focus on how to utilize SDs that are obtained by a novel and recently introduced algorithm, Similarity Domains Networks, and demonstrate how to obtain parametric models for shapes and to extract the skeleton of a shape. SDs save only a portion of the entire data, thus they provide a means to reduce the complexity of computations for skeleton extraction and shape modeling. Our proposed algorithm: SDN is related to both radial basis networks and kernel machines. However, in this chapter, we mainly discuss and present our novel algorithm from the neural networks perspective and relate it to the radial basis networks (RBNs). In the past, the RBN related research mainly focused on computing the optimal kernel parameter (i.e., the scale or shape parameter) that was used in all of the RBFs as in [17, 18]. While the parameter computation for multiple kernels have been heavily studied under the MKL framework in the literature (for examples, see the survey papers: [19, 20]), the computation of multiple kernel parameters in RBNs has been mostly studied under two main approaches: using optimization or using heuristic methods. For example, in [21], the authors proposed using multiple scales as opposed to using a single scale value in RBNs. Their approach utilizes first computing the standard deviation of each cluster (after applying a k-means like clustering on the data) and then using a scaled version of those standard deviations of each cluster as the shape parameter for each RBF in the network. The work in [22] also used a similar approach by using the root-mean-squaredeviation (RMSD) value between the RBF centers and the data value for each RBF in the network. The authors used a modified orthogonal least squares (OLS) algorithm to select the RBF centers. The work in [10] used k-means algorithm on the training data to choose k centers and used those centers as RBF centers. Then it used separate optimizations for computing the kernel parameters and the kernel weights (see next chapter for the formal definitions). Using additional optimization steps for different set of parameters is costly and makes it harder to interpret those parameters and to relate them to shapes geometrically and accurately. As an alternative solution, the work in [12] proposed a geometric approach by using the distance between the data samples as a geometric constraint. In [12], we did not use the well known MKL model. Instead, we defined interpretable similarity domains concept using RBFs and developed his own optimization approach with geometric constrains similar to the original Sequential Minimal Optimization (SMO) algorithm [23]. Consequently, the SDN algorithm combines both RBN and kernel machine concepts to develop a novel algorithm with geometrically interpretable kernel parameters. In this chapter, we demonstrate using SDN for parametric shape modeling and skeleton extraction. Unlike the existing work on radial basis networks, instead of applying an initial k-means algorithm or OLS algorithm to compute the kernel centers separately or using multiple cost functions, SDN chooses the RBF centers and their numbers automatically via its sparse modeling and uses a single cost function to be optimized with its geometric constraint. That is where SDN differs from other similar RBN works as they would have issues on computing all those parameters within a single optimization step while automatically adjusting the number of RBFs used in the network sparsely.
March 12, 2020 10:14
ws-rv961x669
78
HBPRCV-6th Edn.–11573
BookChapterNew
page 78
S. Ozer
Fig. 2. An illustration of SDN as a radial basis network. The network contains a single hidden layer.
The input layer (d dimensional input vector) is connected to n radial basis functions. The output is the weighted sum of the radial basis functions’ outputs.
3. Similarity Domains A similarity domain [12] is a geometric concept that defines a local similarity around a particular data set where that data represents the center of a similarity sphere (i.e., similarity domain) in the Euclidian space. Through similarity domains, we can define a unified optimization problem in which the kernel parameters are computed automatically and geometrically. We formalize the similarity domain of xi Rd , as the sphere in Rd where the center is the support vector (SV) xi and the sphere radius is ri . The radius ri is defined as follows: + d For any (+1) labelled support vector x+ i , where xi R and superscript (+) represents the (+1) class: − + − ri = min( x+ i − x1 , ..., xi − xk )/2
(1)
where superscript (-) means the (-1) class. For any (-1) labelled support vector x− i : + − + ri = min( x− i − x1 , ..., xi − xk )/2.
(2)
In this work, we use Gaussian kernel function to represent similarities and similarity domains as follows: Kσi (x, xi ) = exp(− x − xi 2 /σi2 )
(3)
where σi is the kernel parameter for SV xi . The similarity (kernel) function takes its maximum value where x = xi . The relation between ri and σi is as follows: ri2 = aσi2 where a is a domain specific scalar (constant). In our image experiments, the value of a is found via a grid search and we observed that setting a = 2.85 suffices for all the images used in our experiments. Note that, in contrast to [24, 25], our similarity domain definition differs from the term “minimal enclosing sphere”. In our approach, we define the term similarity domain as the dominant region of a SV in which the SV is the centroid and all the points within the domain are similar to the SV. The boundary of the similarity domain of a SV is defined
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets
(a) The input image
page 79
79
(c) Foreground ri
(b) All ri
Fig. 3. (color online) Visualization of the SDM kernel parameters at T = 0.05 with zero pixel error learning. The blue area represents the background and the yellow area represents the foreground. The red dots are the RBF centers and yellow circles around them show the boundaries of SDs. The green lines are the radiuses (ri ) of SDs. The ri are obtained from the computed σi . (a) Original image: 141x178 pixels. (b) Visualization of all the ri from both background and foreground with total of 1393 centers. (c) Visualization of only the ri for the object with total of 629 foreground centers (i.e., by using only the 2.51% of all image pixels). All images are resized to fit into the figure.
based on its distance to the closest point from the other class. Thus any given vector within a similarity domain (a region) will be similar to the associated SV of that similarity domain. We will use the similarity domain concept to define a kernel machine that computes its kernel parameters automatically and geometrically in the next section. 4. Similarity Domains Network A typical radial basis network (RBN) includes a single hidden layer and uses a radial basis function (RBF) as the activation function in each neuron in that hidden layer (i.e., the hidden layer uses n RBFs). Similar to RBN, a similarity domains network also uses a single hidden layer where the activation functions are radial basis functions. Unlike a typical RBN that uses the same kernel parameter in all the radial basis functions in the hidden layer, SDN uses different kernel parameters for each RBF used in the hidden layer. The illustration of SDN as a radial basis network is given in Figure 2. While the number of RBFs in the hidden layer is decided by different algorithms in RBN (as discussed in the previous section), SDN assigns a RBF to each training sample. In the figure, the hidden layer uses all of the n training data as an RBF center and then through its sparse optimization, it selects a subset of the training data (e.g., subset of pixels for shape modeling) and reduces that number n to k where n ≥ k. SDN models the decision boundary as a weighted combination of Similarity Domains (SDs). A similarity domain is a d dimensional sphere in the d dimensional feature space. Each similarity domain is centered at an RBF center and modeled with a Gaussian RBF in SDN. SDN estimates the label y of a given input vector x as y as shown below: k y = sign(f (x)) and f (x) = αi yi Kσi (x, xi ), (4) i=1
where the scalar αi is a nonzero weight for the RBF center xi , yi {−1, +1} the class label of the training data and k the total number of RBF centers. K(.) is the Gaussian RBF kernel defined as: (5) Kσi (x, xi ) = exp(− x − xi 2 /σi2 )
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
80
BookChapterNew
page 80
S. Ozer
where σi is the shape parameter for the center xi . The centers are automatically selected among the training data during the training via the following cost function: n n n 1 αi − αi αj yi yj Kσij (xi , xj ), max Q(α) = α 2 i=1 j=1 i=1 subject to:
n
(6)
αi yi = 0, C ≥ αi ≥ 0 for i = 1, 2, ..., n,
i=1
and Kσij (xi , xj ) < T, if yi yj = −1, ∀i, j where T is a constant scalar value assuring that the RBF function yields a smaller value for any given pair of samples from different classes. The shape parameter σij is defined as σij = min(σi , σj ). For a given closest pair of vectors xi and xj for which yi yj = −1, we can define the kernel parameters as follows: σi2 = σj2 =
− xi − xj 2 ln(K(xi , xr ))
(7)
As a result, the decision function takes the form: f (x) =
k
αi yi exp(−
i=1
x − xi 2 )−b σi2
(8)
where k is the total number of support vectors. In our algorithm the bias value b is constant and is equal to 0. Discussion: Normally, avoiding the term b in the decision function eliminates the conn αi yi = 0 in the optimization problem. However, since yi {−1, +1}, the sum in straint i=1
that constraint can be rewritten as
n i=1
α i yi
m 1 i=1
αi −
m 2
αi 0 where m1 + m2 = n.
i=1
This means that if the αi values are around the value of 1 (or equal to 1) ∀i , then this constraint also means that the total number of support vectors from each class should be equal n α i yi = 0 or similar to each other, i.e., m1 m2 . That is why we keep the constraint i=1
in our algorithm as that would help us to compute SVs from both classes with comparable numbers. The decision function f (x) can be expressed as: k1 k2 αi yi Ki (x, xi ) + αj yj Kj (x, xj ) where k1 is the total number of SVs f (x) = i=1
j=1
near vector x such that the Euclidian norm xi − x 2 − σi2 >> 0 and k2 is the total number of SVs for which xj − x 2 − σj2 0. Notice that k1 + k2 = k. This property suggests that local predictions can be made by the approximated decision function: k1 αi yi Ki (x, xi ). This approach can simplify the computations in large f (x) i=1
datasets as in this approach, we do not require access to all of the available of SVs. Further details on SDs and SDN formulation can be found in [12].
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets
page 81
81
5. Parametric Shape Modeling with SDN Sparse and parametric shape modeling is a challenge in the literature. For shape modeling, we propose using SDNs. SDN can model shapes sparsely with its computed kernel parameters. For that, first we train SDN to learn the shape as a decision boundary from the given binary image. For that, we label the shape (e.g., the white region in Fig. 3a) as foreground and label everything else (e.g., the black region in Fig. 3a) as background while using each pixel 2D coordinate as features. Once the image is learned by SDN, the computed kernel parameters of SDN along with their 2D coordinates are used to model the shape with our one-class classifier without performing any re-training. As mentioned earlier, we can use Gaussian RBFs and their shape parameters (i.e. the kernel parameters) to model shapes parametrically within the SDN framework. For that purpose, we can save and use only the foreground (the shape’s) RBF (or SD) centers and their shape parameters to obtain a one class classifier. The computed RBF centers of SDN can be grouped for both foreground and for background as: s1 s2 xi and C2 = xi , where s1 + s2 = k, s1 is the total number C1 = i=1,yi ∈+1
i=1,yi ∈−1
of centers from the (+1) class and s2 is the total number of centers from the (-1) class. Since the Gaussian kernel functions (RBFs) now represent local SDs geometrically, the original decision function f (x) can now be approximated by using only C1 (or by using only C2 ). Therefore, we define the one-class approximation by using only the centers and any given x as follows: their associated kernel parameters from the C1 for y = +1, if x − xi < aσi2 , ∃xi ∈ S1 (9) otherwise y = −1, where the SD radius for the ith center xi is defined as aσi2 and a is a domain specific constant. One class approximation examples are given in Figure 1b where we used only the SDs from the foreground to reconstruct the altered image. 6. Extracting the Skeleton from SDs The parametric and geometric properties of SDN provides new parameters to analyze a shape via its similarity domains. Furthermore, while the typical neural network based applications for skeleton estimation focus on learning from multiple images, SDN can learn a shape’s parameters from only the given single image without requiring any additional data set or a pre-trained model. Therefore, SDN is advantageous especially in cases where the data is very limited or only one sample shape is available. Once learned and computed by the SDN, the similarity domains (SDs) can be used to extract the skeleton of a given shape. When computed by considering only the existing SDs, the process of skeleton extraction requires only a subset of pixels (i.e., the SDs) during the computation. To extract the skeleton, we first bin the computed shape parameters (σi2 ) into m bins (in our experiments m is set to 10). Since typically the majority of the similarity domains lay around the object (or shape) boundary, they appear in small values.
March 12, 2020 10:14
ws-rv961x669
82
HBPRCV-6th Edn.–11573
BookChapterNew
page 82
S. Ozer
(a) All of the foreground ri
(b) Skeleton for σi2 > 29.12
(c) Skeleton for σi2 > 48.32
(d) Skeleton for σi2 > 67.51
(e) Skeleton for σi2 > 86.71
(f) Skeleton for σi2 > 105.90
Fig. 4. The results of filtering shape parameters at different thresholds is visualized for the image shown in Figure 3a. The remaining similarity domains after thresholding and the extracted skeletons from those similarity domains are visualized: (b) for σi2 > 29.12; (c) for σi2 > 48.32; (d) for σi2 > 67.51; (e) for σi2 > 86.71; (f) for σi2 > 105.90.
With a simple thresholding process, we can eliminate the SDs from our subset where we search for the skeleton. Eliminating them at first, gives us a lesser number of SDs to consider for skeleton extraction. After eliminating those small SDs and their computed parameters with a simple thresholding process, we connect the centers of the remaining SDs by tracing the overlapping SDs. If there are non-overlapping SDs exist within the same shape after the thresholding process, we perform linear estimation and connect the closest SDs. We interpolate a line between those closest SDs to visualize the skeleton in our figures. Thresholding the kernel parameters of SDs at different values, yields different set of SDs, and therefore, we obtain different skeletons as shown in Figure 4.
7. Experiments In this section, we demonstrate how to use SDN for parametric shape learning from a given single input image without requiring any additional dataset. Since it is hard to model shapes with the standard RBNs, and since there is no good RBN implementation was available to us, we did not use any other RBN network in our experiments for comparison. As discussed in the earlier sections, the standard RBNs have many issues and multiple individual steps to compute the RBN parameters including the total number of RBF centers and finding the center values along with the computation of the shape parameters at those centers. However, comparison of kernel machines (SVM) and SDN on shape modeling was already studied in the literature before (see [12]). Therefore, in this section, we focus on parametric shape modeling and skeleton extraction from SDs by using SDNs. In the figures, all the images are resized to fit into the figures.
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets
page 83
83
7.1. Parametric Shape Modeling with SDs Here, we first demonstrate visualizing the computed shape parameters of SDN on a given sample image in Figure 3. Figure 3a shows the original input image. We used each pixel’s 2D coordinate in the image as the features in the training data , and each pixel’s color (being black or white) as the training labels. SDN is trained at T=0.05. SDN learned and modeled the shape and reconstructed it with zero pixel error by using 1393 SDs. Pixel error is the total number of wrongly classified pixels in the image. Figure 3b visualizes all the computed shape parameters of the RBF centers of SDN as circles and Figure 3c visualizes the ones for the foreground only. The radius of a circle in all figures is computed as aσi2 where a = 2.85. We found the value of a through a heuristic search and noticed that 2.85 suffices for all the shape experiments that we had. There are total of 629 foreground RBF centers computed by SDN (only 2.51% of all the input image pixels). 7.2. Skeleton Extraction From the SDs Next, we demonstrate the skeleton extraction from the computed similarity domains as a proof of concept. Extracting the skeleton from the SDs as opposed to extracting it from the pixels, simplifies the computations as SDs are only a small portion of the total number of pixels (reducing the search space). To extract the skeleton from the computed SDs, we first quantize the shape parameters of the object into 10 bins and then starting from the largest bin, we select the most useful bin value to threshold the shape parameters. The remaining SD centers are connected based on their overlapping similarity domains. If multiple SDs overlap inside the same SD, we look at their centers and we ignore the SDs whose centers fall within the same SD (accepted the original SD center). That is why some points of the shape are not considered as a part of the skeleton in Figure 4. Figure 4 demonstrates the remaining (thresholded) SD centers and their radiuses at various thresholds in yellow. In the figure, the skeletons (shown as a blue line) are extracted by considering only the remaining SDs after thresholding as explained in Section 6. Another example is shown in Figure 5. The input binary image is shown in Figure 5a. Figure 5b shows all the foreground similarity domains. The learned SDs are thresholded and the corresponding skeleton as extracted from the remaining SDs are visualized as a blue line in Figure 5c. One benefit of using only SDs to re-compute skeletons is that, as the SDs are a subset of the training data, the number of SDs are gradually less than the total number of pixels that needs to be considered for skeleton computation. While our shown skeleton extraction algorithm here is a naive and a basic one, the goal here is showing the use of SDs for (re)computation of skeletons instead of using all the pixels of the shape. Table 1. Bin centers for the quantized foreground shape parameters (σi2 ) and the total number of shape
parameters that fall in each bin for the image in Fig. 3a. Bin Center: Total Counts:
9.93 591
29.12 18
48.32 7
67.51 3
86.71 2
105.90 4
125.09 0
144.29 0
163.48 1
182.68 3
March 12, 2020 10:14
ws-rv961x669
84
HBPRCV-6th Edn.–11573
BookChapterNew
page 84
S. Ozer
(b) σi2 > 0 (a) Input Image (c) for σi2 > 6.99 Fig. 5. (color online)Visualization of the skeleton (shown as blue line) extracted from SDs on another image. (a) Input image: 64 x 83 pixels. (b) Foreground SDs. (c) Skeleton for σi2 > 6.99.
8. Conclusion In this chapter, we introduced how the computed SDs of the SDN algorithm can be used to extract skeleton from shapes as a proof of concept. Instead of using and processing all the pixels to extract the skeleton of a given shape, we propose to use the SDs of the shape to extract the skeleton. SDs are a subset of the training data (i.e., a subset if the all pixels), thus using SDs can gradually reduces the (re)computation of skeletons at different parameters. SDs and their parameters are obtained by SDN after the training steps. The RBF shape parameters of SDN are used to define the size of SDs and they can be used to model a shape as described in Section 5 and as visualized in our experiments. While the presented skeleton extraction algorithm is a naive solution to demonstrate the use of SDs, future work will focus on presenting more elegant solutions to extract the skeleton from SDs. SDN is a novel classification algorithm and has potential in many shape analysis applications besides the skeleton extraction. SDN architecture contains a single hidden layer neural network and it uses RBFs as activation functions in the hidden layer. Each RBF as its own kernel parameter. Optimization algorithm plays an important role to obtain meaningful SDs with SDN for skeleton extraction. We use a modified version of Sequential Minimal Optimization (SMO) algorithm [23] to train SDN. While we have not tested its performance with other optimization techniques yet, we do not expect other standard batch or stochastic gradient based algorithms to yield the same results as we obtain with our algorithm. A future work will focus on the optimization part and will perform a more in detailed analysis from the optimization perspective. A shape can be modeled parametrically by using SDNs via its similarity domains where SDs are modeled with radial basis functions. A further reduction in parameters can be obtained with one class classification approximation of SDN as shown in Eq. 9. SDN can parametrically model a given single shape without requiring or using large datasets. Therefore, it can be efficiently used to learn and model a shape even if there is only one image is available where there is no any additional dataset or model can be provided.
March 12, 2020 10:14
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
Similarity Domains Network for Modeling Shapes and Extracting Skeletons without Large Datasets
page 85
85
A future work may include introducing a better skeleton algorithm utilizing SDs. Current naive technique relies on manual thresholding. However, a future technique may eliminate such manual operation to extract the skeleton. Acknowledgement We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research. The author would like to thank Prof. Chi Hau Chen for his valuable comments and feedback. References [1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE. 86(11), 2278–2324, (1998). [2] S. Ozer, D. L. Langer, X. Liu, M. A. Haider, T. H. van der Kwast, A. J. Evans, Y. Yang, M. N. Wernick, and I. S. Yetik, Supervised and unsupervised methods for prostate cancer segmentation with multispectral mri, Medical physics. 37(4), 1873–1883, (2010). [3] L. Jiang, S. Chen, and X. Jiao, Parametric shape and topology optimization: A new level set approach based on cardinal basis functions, International Journal for Numerical Methods in Engineering. 114(1), 66–87, (2018). [4] S.-H. Yoo, S.-K. Oh, and W. Pedrycz, Optimized face recognition algorithm using radial basis function neural networks and its practical applications, Neural Networks. 69, 111–125, (2015). [5] M. Botsch and L. Kobbelt. Real-time shape editing using radial basis functions. In Computer graphics forum, vol. 24, pp. 611–621. Blackwell Publishing, Inc Oxford, UK and Boston, USA, (2005). [6] S. Ozer, M. A. Haider, D. L. Langer, T. H. van der Kwast, A. J. Evans, M. N. Wernick, J. Trachtenberg, and I. S. Yetik. Prostate cancer localization with multispectral mri based on relevance vector machines. In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on, pp. 73–76. IEEE, (2009). [7] S. Ozer, On the classification performance of support vector machines using chebyshev kernel functions, Master’s Thesis, University of Massachusetts, Dartmouth. (2007). [8] S. Ozer, C. H. Chen, and H. A. Cirpan, A set of new chebyshev kernel functions for support vector machine pattern classification, Pattern Recognition. 44(7), 1435–1447, (2011). [9] R. P. Lippmann, Pattern classification using neural networks, IEEE communications magazine. 27(11), 47–50, (1989). [10] L. Fu, M. Zhang, and H. Li, Sparse rbf networks with multi-kernels, Neural processing letters. 32(3), 235–247, (2010). [11] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, p. 6. ACM, (2004). [12] S. Ozer, Similarity domains machine for scale-invariant and sparse shape modeling, IEEE Transactions on Image Processing. 28(2), 534–545, (2019). [13] N. D. Cornea, D. Silver, and P. Min, Curve-skeleton properties, applications, and algorithms, IEEE Transactions on Visualization & Computer Graphics. (3), 530–548, (2007). [14] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson. Skeleton based shape matching and retrieval. In 2003 Shape Modeling International., pp. 130–139. IEEE, (2003). [15] P. K. Saha, G. Borgefors, and G. S. di Baja, A survey on skeletonization algorithms and their applications, Pattern Recognition Letters. 76, 3–12, (2016).
March 12, 2020 10:14
86
ws-rv961x669
HBPRCV-6th Edn.–11573
BookChapterNew
page 86
S. Ozer
[16] I. Demir, C. Hahn, K. Leonard, G. Morin, D. Rahbani, A. Panotopoulou, A. Fondevilla, E. Balashova, B. Durix, and A. Kortylewski, SkelNetOn 2019 Dataset and Challenge on Deep Learning for Geometric Shape Understanding, arXiv e-prints. (2019). [17] M. Mongillo, Choosing basis functions and shape parameters for radial basis function methods, SIAM undergraduate research online. 4(190-209), 2–6, (2011). [18] J. Biazar and M. Hosami, An interval for the shape parameter in radial basis function approximation, Applied Mathematics and Computation. 315, 131–149, (2017). [19] S. S. Bucak, R. Jin, and A. K. Jain, Multiple kernel learning for visual object recognition: A review, Pattern Analysis and Machine Intelligence, IEEE Transactions on. 36(7), 1354–1369, (2014). [20] M. G¨onen and E. Alpaydın, Multiple kernel learning algorithms, The Journal of Machine Learning Research. 12, 2211–2268, (2011). [21] N. Benoudjit, C. Archambeau, A. Lendasse, J. A. Lee, M. Verleysen, et al. Width optimization of the gaussian kernels in radial basis function networks. In ESANN, vol. 2, pp. 425–432, (2002). [22] M. Bataineh and T. Marler, Neural network for regression problems with reduced training sets, Neural networks. 95, 1–9, (2017). [23] J. Platt, Fast training of support vector machines using sequential minimal optimization, Advances in kernel methods support vector learning. 3, (1999). [24] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data mining and knowledge discovery. 2(2), 121–167, (1998). [25] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. (Cambridge university press, 2004).
CHAPTER 1.5 ON CURVELET-BASED TEXTURE FEATURES FOR PATTERN CLASSIFICATION
Ching-Chung Li and Wen-Chyi Lin University of Pittsburgh, Pittsburgh, PA 15261 E-mails: [email protected], dųįŭŪů@pitt.edu
This chapter presents an exploration of the curvelet-based approach to image texture analysis for pattern recognition. A concise introduction to the curvelet transform is given, which is a relatively new method for sparse representation of images with rich edge structures. Its application to the multi-resolution texture feature extraction is discussed. Merits of this approach have been reported in recent years in several application areas, for example, on analysis of medical MRI organ tissue images, classification of critical Gleason grading of prostate cancer histological images and grading of mixture aggregate material, etc. A bibliography is provided at the end for further reading.
1. Introduction Image texture may be considered as an organized pattern of some simple primitives and their spatial relationships described in a statistical sense. The development of methodologies for texture analysis and classification over the past fifty years has led to a number of successful applications in, for example, remote sensing, astrophysical and geophysical data analyses, biomedical imaging, biometric informatics, document retrieval and material inspection, etc.13 Classical methods began with the notion of the co-occurrence matrix on gray levels of two neighboring pixels, the intriguing Law’s simple masks, a bank of spatial filters, fractal models, and then followed by multi-resolution approaches with wavelet transforms4-5, Gabor wavelet filter banks6 in a layered pyramid, ridgelet transform and more recently curvelet transform7-9. The curvelet transform developed by Candes and Donoho10-12 is a generalization of wavelet transform for optimal sparse representation of a class of continuous functions with curvilinear singularity, i.e., discontinuity along a curve with bounded curvature. With regard to the 2-dimensional image processing, the development 87
88
On Curvelet-based Texture Features for Pattern Classification
of the curvelet transform has gone through two generations. The first generation attempted to extend the ridglelet transform in small blocks of smoothly partitioned subband-filtered images to obtain a piecewise line segment in succession to approximate a curve segment at each scale. It suffered from the accuracy problem of using partitioned blocks of small size to compute the local ridgelet transform. The second generation adopts a new formulation through curvelet design in the frequency domain that results in the equivalent outcome attaining the same curve characteristics in the spatial domain. A fast digital implementation of the discrete curvelet transform is also available for the use in applications.11,13 The curvelet transform has been applied to image de-noising, estimation, contrast enhancement, image fusion, texture classification, inverse problems and sparse sensing.2, 14-17 There is an excellent paper, written by J. Ma and G. Plonka,18 as a review and tutorial on the curvelet transform for engineering and information scientists. The subject has also been discussed in sections of S. Mallat’s book4 on signal processing and of the book by J-L Starck19, et al., on sparse image and signal processing. This chapter provides a concise description of the second generation curvelet transform and feature extraction in images based on multi-scale curvelet coefficients for image texture pattern classification. 2. The Method of Curvelet Transform A curvelet in the 2-dimensional space is a function ij(x1, x2) of two spatial variables x1 and x2 that is defined primarily over a narrow rectangular region of short width along x1 axis and longer length along x2 axis following the parabolic scaling rule, i.e., the width scaling is equal to the square of the length scaling, as illustrated in Fig. 1. It changes rapidly along x1 is smooth along x2 , so its Fourier transform ijj(Ȧ1, Ȧ2) is in a broad frequency band along Ȧ1 and is limited to a narrow low frequency band along Ȧ2, that means, Mˆ j (Z1 , Z 2 ) is compactly supported over a narrow sector in the 2-d frequency domain (Ȧ1, Ȧ2) where ^ denotes the Fourier transform. The curvelet is orientation sensitive. If ij(x1, x2) is rotated by an angle ș and expressed in the spatial polar coordinates (ȡ, ș), the similar frequency support will appear along the radial frequency axis r in the frequency polar plot. With both shift (k1, k2) in (x1, x2) and rotation ș, the curvelet at scale j 0 is given by ijj,ߠ, k (x1, x2) = 2-3j/4ij( Rș [2-j (x1 í k1), 2-j/2 (x2 í k2 )])
C.-C. Li and W.-C. Lin.
89
and
Mˆ j ,T ,k (Z1 , Z2 ) 23 j / 4 M RT >2 j Z1 ,2 j / 2 Z2 @ e i Z x Z x T
1 1
2 2
(1)
where subscript k denotes (k1, k2), [ , ]T denotes a column vector and Rș is a rotation matrix,
§ cosT ¨¨ © sin T
RT
sin T · ¸, cosT ¸¹
The set of {ijj,ߠ,k (x1, x2)} is a tight frame that can be used to give an optimal representation of a function f(x1, x2) by the linear combination of the curvelets {ijj,ߠ,k (x1, x2)} with coefficients {cj, ߠ, k} which are the set of inner products
¦c
f
j ,T , k
M j ,T ,k x1 , x 2
j ,T , k
and f , M j ,T ,k
c j ,T , k
(
1 2 ˆ ) f , Mˆ j ,T ,k 2S
(2)
In the discrete curvelet transform, let us consider the design of curvelet ijj (x1, x2) through its Fourier transform Mˆ j (Z1 , Z 2 ) in the polar frequency domain (r, ș) via a pair of radial window W(r) and angular window V(t) in the polar coordinates where r (1/2, 2) and t [í1, 1]. Note that r is the normalized radial frequency variable with the normalization constant ʌ and the angular variable ș is normalized by 2ʌ to give parameter t which may vary around the normalized orientation șl in the range [í1, 1]. Both W(r) and V(t) windows are smooth non-negative real-valued functions and are subject to the admissibility conditions f
¦W
2
(2 j r ) 1,
j -f
3 3 r ( , ); 4 2
(3)
1 1 , ). 2 2
(4)
f
¦V
2
(t l ) 1,
t (
A -f
Let Uj be defined by U j (r ,T l )
2
3 j 4
W ( 2 j r )V (
2 ¬ j / 2 ¼T ), 2S
(5)
90
On Curvelet-based Texture Features for Pattern Classification
where l is the normalized șl at scale j (j 0). With the symmetry property of the Fourier transform, the range of ș is now (íʌ/2, ʌ/2) and thus the resolution unit can be reduced to half size. Let Uj be the polar wedge defined with the support of W and V U j (r ,T l )
2
3 j 4
W ( 2 j r )V (
2 ¬ j / 2 ¼T ), 2S
ș (íʌ/2, ʌ/2)
(6)
where ¬ j / 2¼ denotes the integer part of j/2. This is illustrated by the shaded sector in Figure 2. In the frequency domain, the scaled curvelet at scale j without shift can be chosen with the polar wedge Uj given by
Mˆ j ,l ,k (Z1 , Z2 ) U j (r ,T ) with the shift k, it will be
Mˆ j ,l ,k (Z1 , Z 2 ) Mˆ j ,l ,k (U j (r ,T T l )e i (Z k Z k
2 2)
1 1
(7)
j/2
where șl = l (2S ) 2 ¬ ¼ , with l = 0, 1, 2, … such that 0 șl < ʌ. Then, through the Plancherel’s theorem, curvelet coefficients can be obtained by using the inner product in the frequency domain c( j , l , k ) :
1
(2S ) 2 ³ 1 (2S )
2
1 (2S )
2
³
fˆ (Z )Mˆ j (U j (r k ,T Tl ))dZ i fˆ (Z )U j (r ,T )e
xk( j ,l ) ,Z
dZ
³ fˆ (Z ,Z )U (r k ,T T )e 1
2
j
l
i k1Z1 k 2Z2
dZ1Z2
(8)
The discrete curvelet coefficients can be computed more efficiently through the inner product in the frequency domain as shown by Eq. (8) and in Fig. 1 where, for one scale j, the same curvelet function with different orientations are well tiled in a circular shell or coronae. Conceptually, it is straight forward that we can compute the inner product of the Fourier transform of the image and that of the curvelet in each wedge, put them together and then take the inverse Fourier transform to obtain the curvelet coefficients for that scale. However, the FFT of the image is in the rectangular coordinates, while wedges are in polar coordinates; and the square region and the circular region are not fully overlapped.
C.-C. Li and W.-C. Lin.
91
Fig. 1. A narrow rectangular support for curvelet in the spatial domain is shown in the right, its width and length have two different scales according to the parabolic scaling rule; also shown is its shift and rotation in the spatial domain. In the left, a 2-dimensional frequency plane in polar coordinates is shown with radial windows in a circular corona for supporting a curvelet in different orientation, the shaded sector illustrates a radial wedge with parabolic scaling.
Let wedges be extended to concentric squares as illustrated in Fig. 2, then the wedges are in different trapezoidal shapes and incremental orientations of successive wedges are far from uniform. Special care must be taken to facilitate the computation. There are two different algorithms for fast digital curvelet transform: one is called the unequispaced FFT approach, and the other is called frequency wrapping approach. The concept of the frequency wrapping approach may be briefly explained by the sketch in Fig. 3. Let us examine a digital wedge at scale j shown by the shaded area. Under a simple shearing process, the trapezoidal shaped wedge can be mapped into one with a parallel pipe shaped support enclosing the trapezoidal support which will contain some data samples from two neighboring trapezoidal wedges. It will then be mapped into a rectangular region centered at the origin of the frequency plane. It turns out that the data in the parallel pipe wedge can be properly mapped into the rectangular wedge by a wrapping process as illustrated by the shaded parts enclosed in the rectangle. The tiling of the parallel pipes, which is geometrically periodic in either vertical or horizontal direction and each contains the identical curvelet
92
On Curvelet-based Texture Features for Pattern Classification
information, will wrap its information into the rectangular wedge to contain the same frequency information as in the parallel pipe and, thus, in the original wedge, to compute the inner product with the given image to obtain the same inner product. Although the wrapped wedge appears to contain the broken pieces of the data, actually it is just the re-indexing of the components of the original data. In this way, the inner product can be computed for each wedge and immediately followed by the inverse FFT to obtain the contribution to the curvelet coefficients from that original wedge with the trapezoidal support. Pooling contributions from all the wedges will give the final curvelet coefficients at that scale. Software for both algorithms are freely available in Candes’ IJĴ laboratory, we have used the second algorithm in our study of the curveletbased texture pattern classification of prostate cancer tissue images.
Fig. 2. The digital coronae in the frequency plane with pseudo-polar coordinates, trapezoidal wedges are shown also satisfying parabolic scaling rule.
C.-C. Li and W.-C. Lin.
93
Fig. 3. The schematic diagram to illuminate the concept of the wrapping algorithm for computing digital curvelet coefficients at a given scale. A shaded trapezoidal wedge in a digital corona in the frequency plane is sheared into a parallelepiped wedge, and then is mapped into a rectangular wedge by a wrapping process to make it having the identical frequency information content by virtue of the periodization. Although the wrapped wedge appears to contain the broken pieces of the data, actually it is just the re-indexing of the components of the original data.
The computed curvelet coefficients of sample images are illustrated in the following four Figures. The strength of coefficients is indicated by the brightness at their corresponding locations (k1, k2) in reference to the original image spatial coordinates, thus the plot of low scale appears to be coarse; coefficients of different orientations are pooled together in one plot for each scale. Fig. 4 shows a section of yeast image, the exterior contours and the contours of inner core material are sparsely represented by the curvelet coefficients in each scale. Fig. 5 shows curvelet coefficients of a cloud image from coarse to fine in six scales, the coefficient texture patterns in scale 3 to 6 will provide more manageable texture features in pattern analysis. Fig. 6 shows curvelet coefficients in four scales of an iris image providing a comparable view with the well-known
94
On Curvelet-based Texture Features for Pattern Classification
representation by the Gabor wavelet.36 Fig. 7 shows the multiscale curvelet coefficients of prostate cancer tissue images of four Gleason score. The reliable recognition of the tissue score is a very important problem in clinical urology. We will describe in the following our current work on this classification problem based on the curvelet texture representation.
Fig. 4. Fuzzy yeast cells in the dark background. The original image27 has very smooth intensity variation. Scales 2í4 illustrate the integrated curvelets extracted from the original image.
C.-C. Li and W.-C. Lin.
95
Fig. 5. NASA satellite view of cloud pattern over the South Atlantic Ocean (NASA courtesy/Jeff Schmaltz).
96
On Curvelet-based Texture Features for Pattern Classification
Fig. 6. Iris image. The scales 2í5 curvelet coefficient patterns illustrate different texture distributions of the original image.28
C.-C. Li and W.-C. Lin.
97
Fig. 7. TMA prostate images of Gleason grades P3S3, P3S4, P4S3 and P4S4 along with their respective curvelet patterns of scales 2í5 demonstrate the transitions from benign, critical intermediate class to carcinoma class.
98
On Curvelet-based Texture Features for Pattern Classification
3. Curvelet-based Texture Features
The value of curvelet coefficients cj,l,k at position (k1, k2) under scale j denotes the strength of the curvelet component oriented at angle ș in the representation of an image function f(x1, x2). It contains the information on edginess coordinated along a short path of connected pixels in that orientation. Intuitively, it would be advantageous to extract texture features in the curvelet coefficient space. One may use the standard statistical measures, such as, entropy, energy, mean, standard deviation, 3rd order and 4th order moments of an estimated marginal distribution (histogram) of curvelet coefficients as texture features,20 and also of the co-occurrence of curvelet coefficients. The correlation of coefficients across orientation and across scale may also be utilized. They may provide more discriminative power in texture classification than those features extracted by the classical approaches. Dettori, Semler and Zayed7-8 studied the use of curvelet-based texture features in recognition of normality of organ sections in CT images and reported a significant improvement in recognition accuracy, in comparison with the result obtained by using wavelet-based and ridgelet-based features. Arivazhagan, Ganesan and Kumar studied the texture classification on a set of natural images from VisTex Dataset,33-34 using curvelet-based statistical and co-occurrence features, they also obtained superior classification result. Alecu, Munteanu, et al.,35 conducted an information-theoretic analysis on correlations of curvelet coefficients in a scale, between orientations, and across scales; they showed that the generalized Gaussian density function gave a better fit for marginal probability density functions of curvelet coefficients. This is due to the sparse representation by curvelet coefficients, there will be a fewer number of coefficients and the histogram at a given scale will appear to be more peaked and have a long tail in general. Following that notion, Gomez and Romero developed a new set of curvelet-based texture descriptors under the consideration of the generalized Gaussian model for marginal density functions and demonstrated their success in a classification experiment using a set of natural images from the KTH-TIPS dataset.32 Murtagh and Starck also considered the generalized Gaussian model for histograms of curvelet coefficients of each scale and selected the second order, third order and fourth order moments as statistical texture features in classifying and grading the aggregate mixtures with superb experimental results.19-20 Rotation invariant curvelet features were used in studies of region-based image retrieval by Zhang, Islam and Sumana, by Cavusoglu, and by Zand, Doraisamy, Halin and Mustaffa, all gave superior results in their comparative studies.29-31
C.-C. Li and W.-C. Lin.
99
We have conducted the research of applying the discrete curvelet transform to extract texture features in prostate cancer tissue images for differentiating the disease grade. This will be described in the following to illustrate our experience in the advances of curvelet transform applications. 4. A Sample Application Problem
This section gives a brief discussion of the development, in collaboration with the Pathology/Urology Departments of Johns Hopkins University, of applying the curvelet transform to the analysis of prostate pathological images of critical Gleason scores for computer-aided classification17 which could serve as a potential predictive tool for urologists to anticipate prognosis and provide suggestions for adequate treatment. The Gleason grading system is a standard for interpreting prostate cancer established by expert pathologists based on microscopic tissue images from needle biopsies.21-27 Gleason grade is categorized into 1 to 5 increasing based on the cumulative loss of regular glandular structure which reflects the degree of malignancy aggressive phenotype. The Gleason score (GS) is the summation of the primary grade and the secondary grade, ranging from 5 to 10, in which a total score of 6 would be determined as a slower-growing cancer, 7 (3+4) as a medium-grade, and 4+3, 8, 9, or 10 as more aggressive carcinoma. The Gleason Scores 6 and 7, are distinguished as the midpoint between the low-grade (less aggressive) and the intermediate-grade carcinoma which also generates the most absence of agreement in secondopinion assessments. A set of Tissue MicroArray (TMA) images have been used as the data base. Each TMA (Tissue MicroArray) image of 1670 × 1670 pixels contains a core image of 0.6mm in diameter at 20× magnification. The available 224 prostate histological images consist of 4 classes including P3S3, P3S4, P4S3 and P4S4, each class with 56 images from 16 cases. We used 32 images of each class for training and the remaining 24 images of each class for testing. The curvelet transform was conducted on 25 patches of subimages columnwise or rowwise covering each image area with half overlapping and the class belonging of each patch are pooled together to make a majority decision for class assignment of the image. A two level tree classifier consisting of three Gaussian-kernel support vector machines (SVM) has been developed where the first machine is primarily to decide whether an input patch belongs to Grade 3 (GG3 designates the inclusion of P3S3 and P3S4) or Grade 4 (GG4 stands for the inclusion of P4S3 and P4S4) and then to make majority decision of multiple patches in the image.
100
On Curvelet-based Texture Features for Pattern Classification
One SVM machine at the next level is to differentiate P3S3 from P3S4 and the other to classify P4S3 and P4S4. A central area of 768 × 768 pixels of the tissue image was taken which should be sufficient for covering the biometrics of the prostate cells and glandular structure characteristics. Sampled by the aforementioned approach, each patch with 256 × 256 pixels then undergoes the fast discrete curvelet transform with the use of the Curvelab Toolbox software13 to generate curvelet coefficients cj,l,k in 4 scales. The prostate cellular and ductile structures contained therein are represented by the texture characteristics in the image region where the curveletbased analysis is performed. As shown in Fig 7, four patches taken from four prostate patterns including two critical in-between grades of P3S4 and P4S3. The curvelet coefficients at each scale are displayed in the lower part to illustrate the edge information that integrated over all orientations. The "scale number" used here for curvelet coefficients corresponds to the subband index number considered in the discrete frequency domain. For a 256 × 256 image, the scale 5 refers to the highest frequency subband, that is subband 5, and scales 4, 3 and 2 refer to the successively lower frequency subbands. Their statistical measures including mean ȝj, variance ıj2, entropy ej, and energy Ej, of curvelet coefficients at each scale j for each patch are computed as textual features. Nine features have been selected to form a 9-dimensional feature vector for use in pattern classification which includes entropy in scales 3í4, energy in scales 2í4, mean in scales 2í3 and variance in scales 2í3. Table 1. Jackknife Cross-validation results Sensitivity
Specificity
Accuracy
GG3 vs GG4
94.53%
96.88%
95.70%
P3S3 vs P3S4
97.81%
97.50%
97.65%
P4S3 vs P4S4
98.43%
97.80%
98.13%
Overall
93.68%
All three kernel SVMs were successfully trained, together with the successful majority decision rule, giving a trained tree classifier of 4 classes of critical Gleason scoring with no training error. The leave-one (image)-out crossvalidation was applied to assess the trained classifier. The 10% Jackknife crossvalidation tests were carried out for 100 realizations for all three SVMs and the statistical results are listed in Table 1 with above 95.7% accuracy for individual machines and an overall accuracy of 93.68% for 4 classes. The trained classifier was tested with 96 images (24 images p e r c l a s s ) . The result given in Table 2 shows remarkable testing accuracy to classify tissue images of four critical
C.-C. Li and W.-C. Lin.
101
Gleason scores (GS) 3+3, 3+4, 4+3 and 4+4, as compared t o the published results that we are aware of. The lowest correct classification rate (87.5%) obtained for the intermediate grade P4S3 by virtue of the situation between P3S4 and P4S4 where subtle textural characteristics are difficult to be differentiated. Table 2. Test results of the 4-class tree classifier Grade P3S3 P3S4 P4S3 P4S4
Accuracy 95.83% 91.67% 87.50% 91.67%
5. Summary and Discussion
We have discussed the exciting development over the past decade on the application of curvelet transform to the texture feature extraction in several biomedical imaging, material grading and document retrieval problems9, 37-43. One may consider the curvelet as a sophisticated “texton”, the image texture is characterized in terms of its dynamic and geometric distributions in multiple scales. With the implication of sparse representation, it leads to efficient and effective multiresolution texture descriptors capable of providing the enhanced performance of pattern classification. Many more works need to be done to explore its full potential in various applications. A closely related method of wave atoms44 representation on oscillatory patterns may guide new joint development on image texture characterization in different fields of practical applications. Appendix
This appendix provides a summary of the recent work45-46 on applying the curvelet-based texture analysis to grading the critical Gleason patterns of prostate cancer tissue histological images exemplified in Section 4. With respect to all orientations at each location of an image at a given scale, the selection of curvelet coefficients with significant magnitude yields the dominant curve segments in reconstruction where both positive edges and the corresponding negative edges in opposite direction always exist.ġ This enables a sparser representation for the boundary information of nuclei and glandular structures; the histogram of these curvelet coefficients at a given scale shows a bimodal distribution. Reverse the
102
On Curvelet-based Texture Features for Pattern Classification
sign of those negative coefficients and merge them into the pool of the positive curvelet coefficients of significant magnitude, we obtain a single mode distribution of the curvelet magnitude which is nearly rotation invariant. Thus, for a given scale, at each location, the maximum curvelet coefficient is defined by Maximum curvelet coefficient c j k
max c j k , T .
(9)
T
Based upon histograms of the maximum curvelet coefficients at different scales, the statistical texture features are computed. A two-layer tree classifier structure with two Gaussian kernel SVMs has been trained to classify each tissue image into one of the four critical patterns of Gleason grading: GS 3+3, GS 3+4, GS 4+3 and GS 4+4. In addition to variance ıj2, energy Enerj, and entropy Sj, the skewness Ȗj, and kurtosis kurtj of maximum curvelet coefficients at scale j were also considered for texture feature selectionį The statistical descriptors were evaluated from all training images and rankordered by applying the Kullback–Leibler divergence measure. Eight features were selected for the first SVM 1 for classifying GS 3+3 and GS 4+4, as given below:
>S
4
J4
Ener 4
J5
Ener5
kurt 4
V 32
@
T
kurt 3 ,
Patches in images of GS 3+4 and GS 4+3 may have textures as a mixture of grade G3 and grade G4, instead of either G3 or G4; thus a refinement of the feature vector for use in SVM 2 at the second level was obtained by adding some fine scale feature components and re-ranking the features to enhance the differentiation between GS 3+4 and GS 4+3 which resulted the selection of a set of 10 feature components as given below.
>S
@
T
4
Ener4 J 4 V 42 kurt5 S5 J 5 Ener5 kurt3 V 32 .
The performance of this curvelet-based classifier: one of the 100 realizations of the validation test of SVM 1 at the first level and of SVM 2 at the second level, and the testing result of the tree classifier are given in Tables 3, Table 4, and Table 5, respectively. Its comparison, in terms of cross validation, with the classifier by Fehr, et al48, and the one by Mosquera-Lopez, et al47 is given in Table 6.
C.-C. Li and W.-C. Lin.
103
Table 3. Validation test of SVM 1 (based on image patches) Test Label Class 1 GS 3+3 Class 2 GS 4+4 Average
GS 3+3
GS 4+4
Overall Accuracy
Error
789 11 1.37% 98.44% 14 786 1.75% 98.63% 98.25% 1.56% Level 1 classifier SVM 1 + patch voting Input with training samples of GS 3+3 and 4+4 Input with samples of GS 3+4 and 4+3 Class 3 Error Test Class 1 Class 2 Label GS 3+3 GS 4+4 GS 3+4 GS 4+3 Class 1 GS 3+3 32 0 0 0 0% Class 2 GS 4+4 0 32 0 0 0% GS 3+4 0 0 32 0 0% Class 3 GS 4+3 0 0 0 32 0%
Table 4. Validation results of SVM 2 Test
GS 3+4
Label GS 3+4 GS 4+3
31
Average
96.88%
GS 4+3
Indecision GS 3+4 1
GS 4+3
3.12%
3.12%
31 96.88%
1
Table 5. Testing Result of the Tree Classifier Test Score 6
Gleason
Gleason Score 8
GS 3+3
GS 4+4
Label
GS 3+4
Indecision
Overall
2 1
24 24 24 24 96
GS 4+3
24
GS6 GS 3+3 GS8 GS 4+4 GS 3+4 GS7 GS 4+3
Accuracy
Gleason Score 7
23
1 22
100%
95.83%
23 91.67% 95.83%
Average Accuracy 95.83%
Table 6. Comparison of cross-validation of different Approaches for G3 vs G4 and 4 critical Gleason grades Method Quaternion wavelet transform (QWT), quaternion ratios, and modified LBP47
Texture features from combining diffusion coefficient and T2weighted MRI images48
Our two-level classifier using maximum curvelet coefficientbased texture and features46
Dataset
Grade 3 vs Grade 4
30 grade 3, 30 grade 4 and 11 grade 5 images
98.83%
34 GS 3+3 vs 159 GS 7 114 GS 3+4 vs 26 GS 4+3 159 GS 7 includes: 114 GS 3+4, 26 GS 4+3, 19 GS 8 32 GS 3+3, 32 GS 3+4, 32 GS 4+3, and 32 GS 4+4 images (20×)
GS 7 (3+4) vs GS 7 (4+3)
93.00%
92.00%
98.88%
95.58%
On Curvelet-based Texture Features for Pattern Classification
104
References 1.
2.
3.
4. 5.
6. 7.
8.
9.
10.
11. 12. 13. 14. 15. 16. 17.
M. Tuceryan and A. K. Jain, “Texture Analysis,” in Handbook of Pattern Recognition and Computer Vision, 2nd ed., Eds., C. H. Chen, L. f. Pau and P. S. P. Wang, Chap. 2.1., World Scientific, (1999). C. V. Rao, J. Malleswara, A. S. Kumar, D. S. Jain and V. K. Dudhwal, “Satellite Image Fusion using Fast Discrete Curvelet Transform,” Proc. IEEE Intern. Advance Computing Conf., pp. 252-257, (2014). M. V. de Hoop, H. F. Smith, G. Uhlmann and R. D. van der Hilst, “Seismic Imaging with the Generalized Radon Transform, A Curvelet Transform Perspective,” Inverse Problems, vol. 25, 025005, (2009). S. Mallat, “A Wavelet Tour of Signal Processing, the Sparse Way”, 3rd ed., Chap. 5, (2009). C. H. Chen and G. G. Lee, “On Multiresolution Wavelet Algorithm using Gaussian Markov Random Field Models, in Handbook of Pattern Recognition and Computer Vision, 2nd ed., Eds., C. H. Chen. L. F. Pau and P.S.P. Wang, Chap. 1.5., World Scientific, (1999). A.K. Jain and F. Farrokhnia, “Unsupervised Texture Segmentation using Gabor Filters,” Pattern Recognition, vol. 34, pp. 1167-1186, (1991). L. Dettori and A. I. Zayed, “Texture Identification of Tissues using Directional Wavelet, Ridgelet and Curvelet Transforms,” in Frames and Operator Theory in Image and Signal Processing, ed., D. R. Larson, et al., Amer. Math. Soc., pp. 89-118, (2008). L. Dettori and L. Semler, “A comparison of Wavelet Ridgelet and Curvelet Texture Classification Algorithms in Computed tomography,” Computer Biology & Medicine, vol. 37, pp. 486-498, (2007). G. Castellaro, L. Bonilha, L. M. Li and F. Cendes, “Multiresolution Analysis using wavelet, ridgelet and curvelet Transforms for Medical Image Segmentation,” Intern. J. Biomed. Imaging, v. 2011, Article ID 136034, (2011). E. J. Candes and D. L. Donoho, “New Tight Frames of Curvelets and Optimal Representation of Objects with Piecewise Singularities,” Commun. Pure Appl. Math., vol. 57, no. 2, pp. 219-266, (2004). E. J. Candes, L. Demanet, D. L. Donoho, and L. Ying, “Fast Discrete Curvelet Transform,” Multiscale Modeling & Simulations, vol. 5, no. 3, pp. 861-899, (2006). E. J. Candes and D. L. Donoho, “Continuous Curvelet Transform: II. Discretization and Frames,” Appl. Comput. Harmon. Anal., vol. 19, pp. 198-222, (2005). E. J. Candes, L. Demanet, D. L. Donoho, and L. Ying, “Curvelab Toolbox, version 2.0,” CIT, (2005). J-L. Starck, E. Candes and D. L. Donoho, “The Curvelet Transform for Image Denoising,” IEEE Trans. IP, vol. 11, pp. 131-141, (2002). K. Nguyen, A. K. Jain and B. Sabata, “Prostate Cancer Detection: Fusion of Cytological and Textural Features,” Jour. Pathology Informatics, vol. 2, (2011). L. Guo, M. Dai and M. Zhu, Multifocus Color Image Fusion based on Quaternion Curvelet Transform,” Optic Express, vol. 20, pp. 18846-18860, (2012). Wen-Chyi Lin, Ching-Chung Li, Christhunesa S. Christudass, Jonathan I. Epstein and Robert W. Veltri, “Curvelet-based Classification of Prostate Cancer Histological Images of Critical Gleason Scores,” In Biomedical Imaging (ISBI), 2015 IEEE 12th International Symposium on, pp. 1020-1023 (2015).
C.-C. Li and W.-C. Lin. 18. 19. 20.
21.
22.
23. 24. 25. 26.
27. 28. 29.
30. 31.
32. 33. 34. 35.
36. 37.
105
Jianwei Ma and Gerlind Plonka. "The curvelet transform." Signal Processing Magazine, IEEE 27.2, pp.118-133 (2010). J-C. Starck, F. Murtagh and J. M. Fadili, “Sparse Image and Signal Processing”, Cambridge University Press, Chap. 5, (2010). F. Murtagh and J-C. Starck, “Wavelet and Curvelet Moments for Image Classification: Application to Aggregate Mixture Grading,” Pattern Recognition Letters, vol. 29, pp. 15571564, (2008). C. Mosquera-Lopez, S. Agaian, A. Velez-Hoyos and I. Thompson, “Computer-aided Prostate Cancer Diagnosis from Digitized Histopathology: A Review on Texture-based Systems,” IEEE Review in Biomedical Engineering, v.8, pp. 98-113, (2015). D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974). Luthringer, D. J., and Gross, M.,"Gleason Grade Migration: Changes in Prostate Cancer Grade in the Contemporary Era," PCRI Insights, vol. 9, pp. 2-3, (August 2006). J.I. Epstein, "An update of the Gleason grading system," J Urology, v. 183, pp. 433-440, (2010). Pierorazio PM, Walsh PC, Partin AW, and Epstein JI., “Prognostic Gleason grade grouping: data based on the modified Gleason scoring system,” BJU International, (2013). D. F. Gleason, and G. T. Mellinger, “The Veterans Administration Cooperative Urological Research Group: Prediction of Prognosis for Prostatic Adenocarcinoma by Combined Histological Grading and Clinical Staging,” J Urol, v.111, pp. 58-64, (1974). Gonzalez, Rafael C., and Richard E. Woods. "Digital image processing 3rd edition." (2007). John Daugman, University of Cambridge, Computer Laboratory. [Online] KWWSZZZFOFDPDFXNaMJGVDPSOHLULVMSJ. Cavusoglu, “Multiscale Texture Retrieval based on Low-dimensional and Rotationinvariant Features of Curvelet Transform,” EURASIP Jour. On Image and Video Processing, paper 2014:22, (2014). Zhang, M. M. Islam, G. Lu and I. J. Sumana, “Rotation Invariant Curvelet Features for Region Based Image Retrieval,” Intern. J. Computer Vision, vo. 98, pp. 187-201, (2012). Zand, Mohsen, et al. "Texture classification and discrimination for region-based image retrieval." Journal of Visual Communication and Image Representation 26, pp. 305-316. (2015). F. Gomez and E. Romero, “Rotation Invariant Texture Classification using a Curvelet based Descriptor,” Pattern Recognition Letter, vol. 32, pp. 2178-2186, (2011). S. Arivazhagan and T. G. S. Kumar, “Texture Classification using Curvelet Transform,” Intern. J. Wavelets, Multiresolution & Inform Processing, vol. 5, pp. 451-464, (2007). S. Arivazhagan, L. Ganesan and T. G. S. Kumar, “Texture Classification using Curvelet Statistical and Co-occurrence Features,” Proc. IEEE ICPR’06, pp. 938-941, (2006). Alecu, A. Munteanu, A. Pizurica, W. P. Y. Cornelis and P. Schelkeus, “InformationTheoretic Analysis of Dependencies between Curvelet Coefficients,” Proc. IEEE ICOP, pp. 1617-1620, (2006). J. Daugman, “How Iris Recognition Works,” IEEE Trans. On Circuits & Systems for Video Technology, vo. 14, pp. 121-130, (2004). L. Shen and Q. Yin, “Texture Classification using Curvelet Transform,” Proc. ISIP’09, China, pp. 319-324, (2009).
106 38. 39. 40. 41. 42. 43.
44. 45.
46.
47.
48.
On Curvelet-based Texture Features for Pattern Classification H. Chang and C. C. J. Kuo, “Texture analysis and Classification with Tree-structured Wavelet Transform,” IEEE Trans. Image Proc., vol. 2, pp. 429-444, (1993). M. Unser and M. Eden, “Multiresolution Texture Extraction and Selection for Texture Segmentation,” IEEE Trans. PAMI, vol. 11, pp. 717-728, (1989). M. Unser, “Texture Classification and Segmentation using Wavelet Frames,” IEEE Trans. IP, vol. 4, pp. 1549-1560, (1995). Lain and J. Fan, “Texture Classification by Wavelet Packet Signatures,” IEEE Trans. PAMI, vol. 15, pp. 1186-1191, (1993). Lain and J. Fan, “Frame Representation for Texture Segmentation,” IEEE Trans. IP, vol. 5, pp. 771-780, (1996). Nielsen, F. Albregtsen and H. E., “Statistical Nuclear Texture Analysis in Cancer Research: A Review of Methods and Applications,” Critical Review in Oncogenesis, vol. 14, pp. 89164, (2008). L. Demanet and L. Ying, “Wave Atoms and Sparsity of Oscillatory Patterns,” Appl. Comput. Harmon. Anal., vol. 23, pp. 368-387, (2007). Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Curveletbased texture classification of critical Gleason patterns of prostate histological images." In Computational Advances in Bio and Medical Sciences (ICCABS), 2016 IEEE 6th International Conference on, pp. 1-6, (2016). Lin, Wen-Chyi, Ching-Chung Li, Jonathan I. Epstein, and Robert W. Veltri. "Advance on curvelet application to prostate cancer tissue image classification." In 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp. 1-6, (2017). C. Mosquera-Lopez, S. Agaian and A. Velez-Hoyos. "The development of a multi-stage learning scheme using new descriptors for automatic grading of prostatic carcinoma." Proc. IEEE ICASSP, pp. 3586-3590, (2014). D. Fehr, H. Veeraraghavan, A. Wibmer, T. Gondo, K. Matsumoto, H. A. Vargas, E. Sala, H. Hricak, and J. O. Deasy. "Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images," Proceedings of the National Academy of Sciences 112, no. 46, pp.6265-6273, (2015).
CHAPTER 1.6 AN OVERVIEW OF EFFICIENT DEEP LEARNING ON EMBEDDED SYSTEMS Xianju Wang Bedford, MA, USA [email protected] Deep neural networks (DNNs) have exploded in the past few years, particularly in the area of visual recognition and natural language processing. At this point, they have exceeded human levels of accuracy and have set new benchmarks in several tasks. However, the complexity of the computations requires specific thought to the network design, especially when the applications need to run on high-latency, energy efficient embedded devices. In this chapter, we provide a high-level overview of DNNs along with specific architectural constructs of DNNs like convolutional neural networks (CNNs) which are better suited for image recognition tasks. We detail the design choices that deep-learning practitioners can use to get DNNs running efficiently on embedded systems. We introduce chips most commonly used for this purpose, namely microprocessor, digital signal processor (DSP), embedded graphics processing unit (GPU), field-programmable gate array (FPGA) and application specific integrated circuit (ASIC), and the specific considerations to keep in mind for their usage. Additionally, we detail some computational methods to gain more efficiency such as quantization, pruning, network structure optimization (AutoML), Winograd and Fast Fourier transform (FFT) that can further optimize ML networks after making the choice of network and hardware.
1. Introduction Deep learning has evolved into the state-of-the-art technique for artificial intelligence (AI) tasks since early 2010. Since the breakthrough application of deep learning for image recognition and natural language processing (NLP), the number of applications that use deep learning has increased significantly. In many applications, deep neural networks (DNNs) are now able to surpass human levels of accuracy. However, the superior accuracy of DNNs comes from the cost of high computational complexity. DNNs are both computationally intensive and memory intensive, making them difficult to deploy and run on embedded devices with limited hardware resources [1,2,3]. The deep neural networks (DNN) process and task includes training and inference, which have different computational needs. Training is the stage in
107
108
An Overview of Efficient Deep Learning on Embedded Systems
which your network tries to learn from the data, while inference is the phase in which a trained model is used to predict the real samples. Network training often requires a large dataset and significant computational resources. In many cases, training a DNN model still takes several hours to days to complete and thus is typically executed in the cloud. For the inference, it is desirable to have its processing near the sensor and on the embedded systems to reduce latency and improve privacy and security. In many applications, inference requires high speed and low power consumption. Thus, implementing deep learning on embedded systems becomes more critical and difficult. In this chapter, we are going to focus on reviewing different methods to implement deep learning inference on embedded systems. 2. Overview of Deep Neural Networks At the highest level, one can think of a DNN as a series of smooth geometric transformations from an input space to a desired output space. The “series” are the layers that are stacked one after the other, with the output of the outer layer fed as inputs to the inner layer. The input space could contain mathematical representations of images, language or any other feature set, while the output is the desired “answer” that during the training phase is fed to the network and during inference is predicted. The geometric transformations can take several forms and the choice often depends on the nature of the problem that needs to be solved. In the world of image and video processing, the most common form is convolutional neural networks (CNNs). CNNs specifically have contributed to the rapid growth in computer vision due to the nature of their connections and the computational efficiency they offer compared to other types of networks. 2.1. Convolutional Neural Networks As previously mentioned, CNNs are particularly suited to analyze images. A convolution is a “learned filter” that is used to extract specific features from images, for example, edges in the earlier layers and complex shapes in the deeper ones. The computer sees an image as a matrix of pixels arranged as width * height * depth. The resolution of the image determines the width and the height, while depth is usually expressed in 3 color channels – R, G and B. A convolution filter operation performs a dot product between the filter and the corresponding pixel values between the width and the height of the input image. This filter is then moved in a sliding window fashion to generate an output feature map that is then transformed using an activation function and then fed into the deeper layers.
X. Wang
109
In order to reduce the number of parameters, often a form of sub-sampling called “pooling” is applied that smoothes out neighboring pixels while reducing the dimensions as the layers get deeper. CNNs are incredibly efficient and, unlike other feed-forward networks, are spatially “aware” and can handle rotation and translation invariance remarkably well. There are several well known architectural examples of CNNs, chief among them being AlexNet, VGGNet and Resnet which differ in the accuracy vs. speed trade-off. Some of these CNNs are compared later. 2.2. Computational Costs of Networks The accuracy of CNN models have been increasing since their breakthrough in 2012. However, the accuracy comes at the price of a high computational cost. Popular CNN models along with their computational costs are shown in Table 1. Table 1. Computational costs for popular CNN models [4] Model
AlexNet
GoogleNet
VGG16
Resnet50
Conv layers
5
57
13
53
Conv MACs
666M
1.58G
15.3G
3.86G
Conv parameters
2.33M
5.97M
14.7M
23.5M
3. Hardware for DNN Processing Typical embedded systems are microprocessors, embedded graphics processing unit (GPU), digital signal processor (DSP), field-programmable gate array (FPGA) and application specific integrated circuit (ASIC). There are three major metrics to measure the efficiency of a DNN’s hardware: the processing throughput, the power consumption and the cost of the processor. The processing throughput is the most important metric to compare the performance and is usually measured in the number of FLoating Point Operations per second, or FLOP/s [2]. 3.1. Microprocessors For many years microprocessors have been applied as the only efficient way to implement embedded systems. Advanced RISC machine (ARM) processors use reduced instruction sets and require fewer transistors than those with a complex instruction set computing (CISC) architecture (such as the x86 processors used in most personal computers), which reduces the size and complexity, while
110
An Overview of Efficient Deep Learning on Embedded Systems
lowering the power consumption. ARM processors have been extensively used in consumer electronic devices such as smart phones, tablets, multimedia players and other mobile devices. Microprocessors are extremely flexible in terms of programmability, and all workloads can be run reasonably well on them. While ARM is quite powerful, it is not a good choice for massive data parallel computations, and is only used for low speed or low-cost applications. Recently, Arm Holdings developed Arm NN, an open-source network machine learning (ML) software, and NXP Semiconductors released the eIQ™ machine learning software development environment. Both of these include inference engines, neural network compilers and optimized libraries, which help users develop and deploy machine learning and deep learning systems with ease. 3.2. DSPs DSPs are well known for their high computation performance, low-power consumption and relatively small size. DSPs have highly parallel architectures with multiple functional units, VLIW/SIMD features and pipeline capability, which allow complex arithmetic operations to be performed efficiently. Compared to microprocessors, one of the primary advantages of DSP is its capability of handling multiple instructions at the same time [5] without significantly increasing the size of the hardware logic. DSPs are suitable for accelerating computationally intensive tasks on embedded devices and have been used in many real time signal and image processing systems. 3.3. Embedded GPUs GPUs are currently the most widely used hardware option for machine and deep learning. GPUs are designed for high data parallelism and memory bandwidth (i.e. can transport more “stuff” from memory to the compute cores). A typical NVIDIA GPU has thousands of cores, allowing for fast execution of the same operation across multiple cores. Graphics processing units (GPUs) are widely used in network training. Although extremely capable, GPUs have had trouble gaining traction in the embedded space given the power, size and cost constraints often found in embedded applications.
X. Wang
111
3.4. FPGAs FPGAs were developed for digital embedded systems, based on the idea of using arrays of reconfigurable complex logic blocks (LBs) with a network of programmable interconnects surrounded by a perimeter of I/O blocks (IOBs). FPGAs allow the design of custom circuits that implement hardware specific time-consuming computation. The benefit of an FPGA is the great flexibility in logic, providing extreme parallelism in data flow and processing vision applications, especially at the low and intermediate levels where they are able to employ the parallelism inherent in images. For example, 640 parallel accumulation buffers and ALUs can be created, summing up an entire 640x480 image in just 480 clock cycles [6, 7]. In many cases, FPGAs have the potential to exceed the performance of a single DSP or multiple DSPs. However, a big disadvantage is their power consumption efficiency. FPGAs have more recently become a target appliance for machine learning researchers, and big companies like Microsoft and Baidu have invested heavily in FPGAs. It is apparent that FPGAs offer much higher performance/watt than GPUs, because even though they cannot compete on pure performance, they use much less power. Generally, FPGA is about an order of magnitude less efficient than ASIC. However, modern FPGAs contain hardware resources, such as DSPs for arithmetic operations and on-chip memories located next to DSPs which increase the flexibility and reduce the efficiency gap between FPGA and ASIC [2]. 3.5. ASICs An application-specific integrated circuit is the least flexible, but highest performing, hardware option. They are also the most efficient in terms of performance/dollar and performance/watt, but require huge investment and NRE (non-recurring engineering) costs which make them cost-effective only in large quantities. ASICs can be designed for either training or inference, as the functionality of the ASIC is designed and hard-coded (it can’t be changed). While GPUs and FPGAs perform far better than CPUs for AI related tasks, a factor of up to 10 in efficiency may be gained with a more specific design with ASIC. Google is the best example of successful machine learning ASIC deployments. Released in 2018, Its inference-targeted, edge TPU (Coral) can achieve 4.0 TOPs/second peak performance. Intel Movidius Neural Compute is another good example, which boasts high performance with easy use and deployment solutions.
An Overview of Efficient Deep Learning on Embedded Systems
112
4. The Methods for Efficient DNN Inference As discussed in Section 3, the superior accuracy of DNNs comes from the cost of high computational complexity. DNNs are both computationally intensive and memory intensive, making them difficult to deploy and run on embedded devices with limited hardware resources. In the past few years, several methods were proposed to implement efficient inference. 4.1. Reduce Number of Operations and Model Size 4.1.1. Quantization Network quantization compresses the network by reducing the number of bits to represent each weight. During training, the default size for programmable platforms such as CPUs and GPUs is often 32 or 64 bits with floating-point representation. During inference, the predominant numerical format used is 32bit floating point or FP32. However, the desire for savings in energy and increase in throughput of deep learning models has led to the use of lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths has shown great progress in the past few years. They usually employ one or more of the methods below to improve model accuracy [8]. x x x x x
Training/Re-training/ Iterative quantization Changing the activation function Modifying network structure to compensate for the loss of information Conservative quantization on first and last year Mixed weights and activations precision
The simplest form of quantization involves directly quantizing a model without re-training; this method is commonly referred to as post-training quantization. In order to minimize the loss of accuracy from quantization, the model can be trained in a way that considers the quantization. This means training with quantization of weights and activations "baked" into the training procedure. The latency and accuracy results of different quantization methods can be found from Table 2 below [9].
X. Wang
113
Table 2. Benefits of model quantization for several CNN models Model
Mobilenetv1-1-224
Mobilenetv2-1-224
Inception-v3
Resnet-v2-101
Top-1 Accuracy (Original)
0.709
0.719
0.78
0.77
Top-1 Accuracy (Post Training Quantized)
0.657
0.637
0.772
0.768
Top-1 Accuracy (Quantization Aware Training)
0.7
0.709
0.775
N/A
Latency (ms) (Original)
124
89
1130
3973
Latency (ms) (Post Training Quantized)
112
98
845
2868
Latency (ms) (Quantization Aware Training)
65
54
543
N/A
Size (MB) (Original)
16.9
14
95.7
178.3
Size (MB) (Optimized)
4.3
3.6
23.9
44.9
4.1.2. Network Pruning Network pruning has been widely used to compress CNN and recurrent neural network (RNN) models. Neural network pruning is an old concept, dating back to 1990 [10]. The main idea is that, among many parameters in the network, most are redundant and do not contribute much to the output. It has been proven to be an effective way to reduce the network complexity and over-fitting [11,12,13]. As shown in Figure 1, before pruning, every neuron in each layer has a
Fig. 1. Pruning Deep Neural Network [LeCun et al., 1989].
Fig. 1. Pruning Deep Neural Network [3]
114
An Overview of Efficient Deep Learning on Embedded Systems
connection to the following layer and there are a lot of multiplications to execute. After pruning, the network becomes sparse and only connects each neuron to a few others, which saves a lot of multiplication computations. As shown in Figure 2, pruning usually includes a three-step process: training connectivity, pruning connections and retraining the remaining weights. It starts by learning the connectivity via normal network training. Next, it prunes the small-weight connections: all connections with weights below a threshold are removed from the network. Finally, it retrains the network to learn the final weights for the remaining sparse connections. This is the most straight-forward method of pruning and is called one-shot pruning. Song Han et. al show that this is surprisingly effective and usually can reduce 2x the connections without losing accuracy. They also noticed that after pruning followed by retraining, they can achieve much better results with higher sparsity at no accuracy loss. They called this iterative pruning. We can think of iterative pruning as repeatedly learning which weights are important, removing the least important weights, and then retraining the model to let it "recover" from the pruning by adjusting the remaining weights. The number of parameters was reduced by 9x and 12x for AlexNet and VGG-16 model respectively [2,10]. Train Connectivity
Prune Connectivity
Retrain Weights
Fig. 2. Pruning Pipeline
Pruning makes network weights sparse. While it reduces model size and computation, it also reduces the regularity of the computations. This makes it more difficult to parallelize in most embedded systems. In order to avoid the need for custom hardware like FPGA, structured pruning is developed and involves pruning groups of weights, such as kernels, filters, and even entire feature-maps. The resulting weights can better align with the data-parallel
X. Wang
115
architecture (e.g., SIMD) found in existing embedded hardware, like microprocessors, GPUs and DSPs, which results in more efficient processing [15]. 4.1.3. Compact Network Architectures The network computations can also be reduced by improving the network architecture. More recently, when designing DNN architectures, the filters with a smaller width and height are used more frequently because concatenating several of them can emulate a larger filter. For example, one 5x5 convolution can be replaced by two 3x3 convolutions. Alternatively, a 3-D convolution can be replaced by a set of 2-D convolutions followed by 1x1 3-D convolutions. This is also called depth-wise separable convolution [2]. Another way to reduce the computations is low-rank approximation. This maximizes the number of separable filters in CNN models. For example, a 2D separable filter (m×n) has a rank 1 and can be expressed as two successive 1D filters. A separable 2D convolution requires m+n multiplications while a standard 2D convolution requires m×n multiplications. But only a small proportion of the filters in CNN models are separable. To increase the proportion, one method is to force the convolution kernels to be separable by penalizing high rank filters when training the network. The other approach is to use a small set of low rank filters which can be implemented as successive separable filters to approximate standard convolution [2,4]. 4.2. Optimize Network Structure Several of the optimizations described above can be automated by recasting them as machine learning problems. You can imagine a ML network that learns the right optimizations (e.g. pruning, network size, quantization) given a desired accuracy and speed. This is called automated machine learning (AutoML) and has become a hot topic in the last few years. AutoML provides methods and processes to make ML accessible for non-ML experts, to improve efficiency of ML and to accelerate research and applications of ML. Neural architecture search is one of the most important areas in AutoML; it tries to address the problem of finding a well-performing (for example, high accuracy with less computation) DNN by optimizing the number of layers, the number of neurons, the number and size of filters, the number of channels, the type of activations and many more design decisions [17].
An Overview of Efficient Deep Learning on Embedded Systems
116
4.3. Winograd Transform and Fast Fourier Transform The Winograd minimal filter algorithm was introduced in 1980 by Shmuel Winograd [16]; it is a computational transform that can be applied to convolutions when the stride is 1. Winograd convolutions are particularly efficient when processing small kernel sizes (k5) compared to the use of small kernel sizes (k f ( x1 )
f ( x2 ) "
327
f ( x n )@ nu1 , a vector has output functions. T
2.3. Design of Feedback Template with Linear Neighboring
A one-dimensional space-invariant template is used as the feedback template [6]. For example, a 4x4 template that is connected with linear neighboring with n =16 cells and the neighboring radius r=1 is shown in Fig. 5.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fig. 5. The connection relation of cells for designing associative memories with r = 1 and n = 16.
If there are n cells, then the feedback coefficients of the cells can be expressed with the following matrix A.
ªa11 a12 " a1n º «a a " a » 21 22 2n » A= « «# % # » « » ¬a n1 a n 2 " a nn ¼
(5)
a11 represents the self-feedback coefficient of 1st cell, a12 represents the feedback coefficient of the next cell (the 2nd cell) of 1st cell in clockwise order, a 21 represents the feedback coefficient of the next cell (the 1st cell) of 2nd cell in counterclockwise order. Since the feedback template is a one-dimensional space-invariant template, so a11 = a 22 = … = a nn = D 1 represent the self-feedback coefficient of each cell, a12 = a23 = … = a ( n 1) n = an1 = D 2 represent the feedback coefficient of the next cell in clockwise order, a1n = a 21 = a32 = … = a n ( n 1) = D n
328
K. Y. Huang and W. H. Hsieh
represent the feedback coefficient of the next cell in counterclockwise order, a13 = a 24 = … = a ( n 2 ) n = a ( n 1)1 = a n 2 = D 3 represent the feedback coefficient of the next two cell in clockwise order, and so on. So matrix A can be expressed as
ªD 1 D 2 " D n º «D D " D » 1 n n 1 » A= « «# % # » « » ¬D 2 D 3 " D 1 ¼ Therefore A is a circulant matrix. The following one-dimensional space-invariant template is considered: [ a(-r) … a(-1) a( 0 ) a(1) … a( r ) ] r is neighborhood radius, a(0) is the self-feedback, a(1) is the feedback of next cell in the clockwise order, a(-1) is the feedback of next cell in the counterclockwise order, and so on. So according to Eq. (5), we rearrange the template elements as the following row vector.
[ a( 0 ) a( 1 ) … a( r ) 0 …0 a( -r )… a( -1 ) ]
(6)
The Eq. (6) is the first row of matrix A. We arrange the last element of the first row of matrix A in the first position of the second row of matrix A, it is regarded as the first element of the second row of matrix A, other elements of the first row of matrix A cycle right shift one position, they form the second to the last element of the second row of matrix A. Similarly, we take the previous row to cycle right shift once, the new sequence is the next row of matrix A, then we can define matrix A.
a(1)º ª a(0) a(1) " a (r ) 0 " 0 a( r ) " «a(1) a(0) " a(r ) 0 " 0 a( r ) " a(2) » « » « # " # » A= « » " # » « # « # " # » « » «¬a (1) a (2) "" """"""" a (-1) a (0) »¼
(7)
It only needs to design the first row of matrix A when we design matrix A as a circulant matrix. Each next row is the previous row which is cycle right shifted
Cellular Neural Network for Seismic Pattern Recognition
329
once. The number of 0s of Eq. (6) is decided by radius r and n. A nun , namely each row of matrix A has n elements, and there are n rows in A. If n = 9, then there are nine elements in each row. When r = 1, the one dimensional template is sorted according to the Eq. (6), as the following Eq. (8) shows. [ a( 0 ) a( 1 ) 0 0 0 0 0 0 a( -1 ) ]
(8)
In the Eq. (8), there are six 0s in the middle of the template, add other three elements, there are nine elements in a row. 2.4. Stability
If a dynamic system has a unique equilibrium point which attracts every trajectory in state space, then it is called globally asymptotically stable. A criterion for the global asymptotic stability of the equilibrium point of DT-CNNs with circulant matrices has been introduced [13]. The criterion is described in the following. DT-CNNs described by (3) and (4), with matrix A given by (7), are globally asymptotically stable, if and only if F (2 S q / n) 1 , q = 0, 1, 2, …, n-1
(9)
where F is the discrete Fourier transform of a(t). r
F ( 2 S q / n)
¦ a(t )e
j 2S tq / n
(10)
t r
The stability criterion (9) can be easily satisfied by choosing small values for the elements of the one-dimensional space-invariant template. In particular, the larger the network dimension n is, the smaller the values of the elements will be by (10). On the other hand, the feedback values cannot be zero, since the stability properties considered herein require that (3) be a dynamical system. These can help the designer in setting the values of the feedback parameters. Namely, the lower bound is zero, whereas the upper bound is related to the network dimension. 2.5. Design of DT-CNN for Associative Memories
The motion equation of a CNN is designed to behave as an associative memory. Given m bipolar (value for +1 or -1) training patterns as input vectors u i , i = 1,
K. Y. Huang and W. H. Hsieh
330
i
2, …, m, for each u i , there is only one equilibrium point x satisfying motion equation (3): x1 Ay 1 Bu1 e ° 2 Ay 2 Bu 2 e °x ® # ° °x m Ay m Bu m e ¯
(11)
We design CNN to behave as associative memory, mainly set up A, and calculate B and e from training patterns. In order to express (11) into a matrix form, we define the following matrices first:
X [ x1 x 2 " x m ]
Y
[y 1 y 2 " y m ]
ª x11 x12 " x1m º » « 1 2 m « x 2 x 2 " x 2 » nu m » « # % » « «¬ x1n x n2 " x nm »¼
ª y11 y12 " y1m º « 1 2 » m « y 2 y 2 " y 2 » num « # » % « » m 1 2 ¬« y n y n " y n »¼
Ay
AY [Ay1 Ay 2 " Ay m ] [d1 d 2 " d m ] num
di
[ d 1i d 2i " d ni ]T nu1 , i = 1, …, m
U
[u1 u 2 " u m ]
J
ª I1 «I « 2 «# « ¬I n
[e e " e]
ªu11 u12 " u1m º « 1 2 » m «u 2 u 2 " u 2 » num « # » % « » «¬u 1n u n2 " u nm »¼ I1 "
I1 º I 2 " I 2 »» nu m % # » » In " In ¼
Cellular Neural Network for Seismic Pattern Recognition
331
The (11) can be expressed in the matrix form: X = AY + BU + J BU + J = X–AY
(12)
BU + J = X– A y
(13)
U is the input training patterns and has been already known. Because Y is the desired output, so initially Y = U has been already known. Under global asymptotic stability condition, we choose a sequence {a(-r), …, a(-1), a(0), a(1), …, a(r)}, which satisfies the criterion (9). And design A as a circulant matrix, so A has been already known too. We can know from output function, if y is +1, then x > 1; if y is -1, then x < -1. U is a bipolar matrix, so Y is a bipolar matrix too, namely all elements in Y are +1 or -1, so the elements of the state matrix X corresponded to Y are all greater than +1 or less than -1, so we can establish X = ĮY = ĮU, Į > 1. So U, Y, A, and X have been already known, then we want to calculate B and J. We define the following matrices: R
[U T h] mu( n 1) ªu11 u21 « 2 2 «u1 u2 « # « «¬u1m u2m
" un1 1 º » " un2 1 » % #» » m " un 1 »¼
h [1 1 " 1]T mu1 Xj
[ x1j x 2j " x mj ] 1um is the jth row of matrix X
A y, j
[d 1j d 2j " d mj ] 1um
>B e@
ªb11 «b « 21 «# « ¬bn1
wj
b12 b22
"
b1n
" b2 n
% # bn 2 " bnn
I1 º I 2 »» #» » In ¼
ªw 1 º «w » « 2» « # » « » ¬w n ¼
[b j1 b j 2 " b jn I j ] 1u( n 1)
From (13), BU + J = X– A y
j = 1, 2, …, n
K. Y. Huang and W. H. Hsieh
332
ªb11 «b « 21 «# « ¬bn1
b12 "
b1n º b22 " b2 n »» % # » » bn 2 " bnn ¼
ªu11 u12 " u1m º ª I 1 « 1 2 m» « «u 2 u 2 " u 2 » 為 « I 2 » «# « # % » « « «¬u 1n u n2 " u nm »¼ ¬ I n
I1 "
I1 º I 2 " I 2 »» % # » » In " In ¼
ª x11 x12 " x1m º ªd11 d12 " d1m º « 1 2 « » m» 烌 « x 2 x 2 " x 2 » 炼 «d 21 d 22 " d 2m » « # » « # » % % « » « » 1 2 m 1 2 m «¬ x n xn " xn »¼ «¬d n d n " d n »¼
In view of the jth row,
[b j 1 b j 2
ªu11 u12 " u1m º « 1 2 m» " b jn ] «u 2 u 2 " u 2 » 為 [ I j I j " I j ] « # » % « » 1 2 m ¬«u n u n " u n »¼
烌 [ x1j x 2j " x mj ] 炼 [d 1j d 2j " d mj ] Rw Tj
XTj A Ty , j
ªu11 u 12 " u 1n « 2 2 2 «u1 u 2 " u n « # % « m m «¬u1 u 2 " u nm
(14)
1º » 1» #» » 1»¼
ª x1j º ªd 1j º ªb j1 º « » « » « » «b j 2 » 烌 « x 2j » 炼 «d 2j » « # » « » « » « » «# » «# » «b jn » « x m » «d m » « » ¬ j¼ ¬ j¼ «¬ I j »¼
j = 1, 2, …, n
The (14) is the transpose of the jth row of (13), so we can rewrite (13) as (14). Because each cell is only influenced by its neighboring cells, so matrix B is a sparse matrix, and elements in w j are mostly 0. We remove 0 elements of w j ,
~ . And we remove the corresponding columns of R, then we then we can get w j ~ ~ ~T , and there is the property R can get R j jw j ~ ~T R jw j ~T w j
Rw Tj . Then (14) becomes (15).
X Tj A Ty , j
~ R j ( X Tj A Ty , j ) , j = 1, 2, …, n
(15) (16)
~ R j is got from R according to the connection relation of the input of the jth cell
and inputs of other cells. We express the connection relation of the inputs of cells ~ can be got by taking out partly vectors of R according to the by matrix S, so R j
Cellular Neural Network for Seismic Pattern Recognition
333
mu h ~ is the pseudo inverse of ~ . ~ ~ 1uh j , jth row of S. R R j R j j , w j j
§ n · ¨ ¦ S ji ¸ 1 . Matrix S is the matrix represents the connection relation of ©i1 ¹ cells’ inputs. S nun , if the ith cell’s input and jth cell’s input have connection relation, sij=1. On the other hand, if the ith cell’s input and jth cell’s input have no connection relation, sij =0. hj
sij
1, if the ith cell's input and jth cell's input have connection relation ° ® °0, if the ith cell's input and jth cell's input have no connection relation ¯
For example, a 4 × 4 cell array and radius r = 1 in Fig. 5, then S is in the following:
S
ª1 «1 « «0 « «0 «1 « «1 «0 « «0 «0 « «0 « «0 «0 « «0 «0 « «0 «0 ¬
0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0
1
1 1
1
0 0 1 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 0 1
1
1 1
1
0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
1
1 0 1 1 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0º 0 0 0 0 0 0»» 0 0 0 0 0 0» » 0 0 0 0 0 0» 0 0 0 0 0 0» » 1 0 0 0 0 0» 1 1 0 0 0 0» » 1 1 0 0 0 0» 0 0 1 1 0 0» » 1 0 1 1 1 0» » 1 1 0 1 1 1» 1 1 0 0 1 1» » 0 0 1 1 0 0» 1 0 1 1 1 0» » 1 1 0 1 1 1» 1 1 0 0 1 1 »¼
In (14), we can get wTj by the following equation. w Tj
R XTj A Ty , j
R may not be unique, so wTj may not be unique. B may not be unique. Then B
may not accord with the interconnecting structure of network inputs. So we must use a matrix S to represent the interconnecting structure of network inputs and use the above derivation to calculate matrix B. We summarize the steps of using CNN to design associative memories in the following. Algorithm 1: Design a DT-CNN to behave as an associative memory in the training part Input: m bipolar patterns u i , i = 1, …, m Output: w j [b j1 b j 2 " b jn I j ] , j = 1, …, n, i.e., B and e
K. Y. Huang and W. H. Hsieh
334
Method:
(1)
Set up matrix U from training patterns u i . U
[u 1 u 2 " u m ]
(2)
Establish Y = U.
(3)
Set up S. sij
1, if the ith cell's input and jth cell's input have connection relation ° ® °0, if the ith cell's input and jth cell's input have no connection relation ¯
(4) Design matrix A as the circulant matrix which satisfies globally asymptotically stable condition. (5) Set the value of Į (Į > 1), and calculate X = ĮY. (6) Calculate A y AY . (7) for ( j =1 to n ) do: Calculate X j by X. Calculate A y, j by A y . Calculate R, R [ U T h] . ~ from matrix S and matrix R. Establish matrix R j ~ of ~ . Calculate pseudoinverse matrix R Rj j ~ T T T T ~ ,w ~ Calculate w R (X A ) . j
j
j
j
y, j
~T . Recover w j from w j
End 3. Pattern Recognition Using DT-CNN Associative Memory
After training, we can do the process of recognition. We have A, B, e, and initial y(t). We input the testing pattern u to the equation of motion in (3). After getting the state value x(t+1) at the next time, we use output function in (4) to calculate the output y(t+1) at the next time. We calculate the state value and the output until all output values are not changed anymore, then the final output is the classification of the testing pattern. The following algorithm is the recognition process. Algorithm 2: Use DT-CNN associative memory to recognize the testing pattern Input: A, B, e, and testing pattern u in the equation of motion Output: Classification of the testing pattern u
Cellular Neural Network for Seismic Pattern Recognition
335
Method: (1) Set up initial output vector y, its element values are all in [-1, 1] interval. (2) Input testing pattern u and A, B, e, and y into the equation of motion to get x(t+1). x(t + 1) = A y(t) + B u + e (3) Input x(t + 1) into activation function, get new output y(t + 1). The activation function is:
x ! 1, then y 1 ° ® 1 d x d 1, then y x ° x 1, then y 1 ¯ (4) Compare new output y(t + 1) with y(t). Check whether they are the same. If they are the same, then stop, otherwise input new output y(t + 1) into equation of motion again. Repeat Step (2) to Step (4) until output y is not changed. End 4. Experiments
We have two kinds of experiments. The first one is on the simulated seismic pattern recognition. The analyzed seismic patterns are bright spot pattern, right and left pinch-out patterns that have the structure of gas and oil sand zones [17]. The second one is on the simulated seismic images. The analyzed seismic patterns are bright spot pattern and horizon pattern. We use window moving to detect the patterns. 4.1. Preprocessing on Seismic Data
We do the experiments on the seismogram. The seismogram can be preprocessed to image. The preprocessing steps of the seismogram are shown in Fig. 6. It contains enveloping, thresholding, peaking, and compression in the time direction [14]. Fig. 7 shows a simulated seismogram. It consists of 64 seismic traces. Each trace contains many peaks (wavelets). We can extract peak data from seismogram through preprocessing. Then we transform peak data to bipolar image data. Fig. 8 shows the result of preprocessing of Fig. 7. The symbol of pixel “1” is for peak point and “0” is for background.
K. Y. Huang and W. H. Hsieh
336
Input Enveloping seismogram
Thresholding
Peaks of seismogram
Compress data in time direction
Fig. 6. Preprocessing steps of seismogram.
000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 011111111111111111111111111111111111111111111111111111111110 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000111111111111000000000000000000000000 000000000000000000001111000000000000111100000000000000000000 000000000000000001110000000000000000000011100000000000000000 000000000000000110000000000000000000000000011000000000000000 000000000000111000000000000011100000000000000111000000000000 000000000011000000000011111100011111100000000000110000000000 000000011100000000001100000000000000001110000000001110000000 000111100000000111110000000000000000010001110000000001111100 001000000000001000001111111111111111100000001110000000000011 000000000001110000000000000000000000000000000001100000000000 000000001110001111111000000000000000000111111110011100000000 000000110000000000000111111111111111111000000000000011100000 001111000000000000000000000000000000000000000000000000011100 110000000000000000000000000000000000000000000000000000000011 000000000000000000000000000000000000000000000000000000000000
Fig. 7. Simulated seismogram.
Fig. 8. Preprocessing of Fig. 7.
4.2. Experiment 1: Experiment on Simulated Seismic Patterns
4.2.1. Experiment on Simulated Seismic Patterns In the experiment, we store three simulated seismic training patterns and recognize three noisy input testing patterns. The three simulated peak data of bright spot pattern, right and left pinch-out patterns are shown in Fig. 9(a), (b), and (c). The size is 12x48. We use these three training patterns to train the CNN. The noisy testing bright spot pattern, right and left pinch-out patterns are shown in Fig. 10(a), (b), and (c). We use Hamming distance (HD) as the number of difference of symbols between the training pattern and the noisy pattern to measure the ratio of noise. Fig. 10(a) has 107 Hamming distance. It is 19% noise. Fig. 10(b) has 118 Hamming distance. It is 21% noise. Fig. 10(c) has 118 Hamming distance. It is 21% noise. We apply the DT-CNN associative memory with connection matrix S to this experiment. The S is the matrix that represents the connection relation of cell’s inputs. If the ith cell’s input and jth cell’s input have connection, then sij=1, otherwise, sij=0. We set Ƚ = 3 and neighborhood radius r = 2, 3, and 4. For r = 2, the 1-D feedback template is [a(-2), a(-1), a(0), a(1), a(2)]=[0.01, 0.01, 0.01, 0.01, 0.01]. Similar for r = 3 and r = 4.
Cellular Neural Network for Seismic Pattern Recognition
(a)
(b)
337
(c)
Fig. 9. Training seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
(a)
(b)
(c)
Fig. 10. Noisy testing seismic patterns: (a) bright spot, (b) right pinch-out, (c) left pinch-out.
For r = 2, the recovered patterns are shown in Fig. 11(a), (b), and (c). They are not correct output patterns. Fig. 11(d), (e), and (f) are the energy vs iteration. So we set neighborhood radius to r = 3 and test again. For r = 3, the recovered patterns are shown in Fig. 12(a), (b), and (c). Fig. 12(b) is not correct output pattern. Fig. 12(d), (e), and (f) are the energy vs iteration. So we set neighborhood radius to r = 4 and test again. For r = 4, the recovered pattern is shown in Fig. 13(a). Fig. 13(b) is the energy vs iteration. It is the correct output pattern.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 11. For r = 2, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c), (d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
K. Y. Huang and W. H. Hsieh
338
(a)
(b)
(d)
(c)
(e)
(f)
Fig. 12. For r = 3, (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c), (d) energy curve of Fig. 10(a), (e) energy curve of Fig. 10(b), (f) energy curve of Fig. 10(c).
(a)
(b)
Fig. 13. For r=4, (a) output of Fig. 10(b), (b) energy curve of Fig. 10(b).
Next, we apply DT-CNN associative memory without matrix S to Fig. 10(a), (b), and (c). We set Į = 3 and neighborhood radius r = 1. The output recovered patterns are the same as Fig. 9(a), (b), and (c) respectively. 4.2.2. Comparison with Hopfield Associative Memory Hopfield associative memory was proposed by Hopfield [15], [16]. In Hopfield model, the input of one cell comes from the outputs of all other cells. We apply the Hopfield associative memory to Fig. 10(a), (b), and (c). The output recovered patterns are shown in Fig. 14(a), (b), and (c). Only Fig. 14(b) is correct. The recognition result is failure. The results of four DT-CNNs and Hopfield model in this experiment is shown in Table 1.
Cellular Neural Network for Seismic Pattern Recognition
(a)
(b)
339
(c)
Fig. 14. Results of Hopfield associative memory: (a) output of Fig. 10(a), (b) output of Fig. 10(b), (c) output of Fig. 10(c). Table 1. Results of four DT-CNNs and Hopfield model.
Recognition
DT-CNN with S r=2 r=3 r=4 Failure Failure Success
DT-CNN without S r=1 Success
Hopfield model Failure
4.3. Experiment 2: Experiment on Simulated Seismic Images
We apply this DT-CNN associative memory with matrix S to recognize the simulated seismic images. In this experiment, we store two training seismic patterns and recognize the patterns in three seismic images. The two training patterns are bright spot pattern and horizon pattern, and shown in Fig. 15(a) and (b). The size is 16x50. Most of the seismic data has horizons related to geologic layer boundary. We set that the neighborhood radius r is 1 and 3, and Į is set to 3.
(a)
(b)
Fig. 15. Two training seismic patterns: (a) bright spot pattern, (b) horizon pattern.
We have three testing seismic images shown in Fig. 16(a), 17(a), and 18(a). The size is 64x64 and larger than the size of training pattern 16x50. We use a window to extract the testing pattern from seismic image. The size of this window is equal to the size of training pattern. This window is shifted from left to right and top to bottom on the testing seismic image. If the output pattern of the network is equal to one of training patterns, we record the coordinate of upper-left corner of the window. After the window is shifted to the last position
K. Y. Huang and W. H. Hsieh
340
on the testing seismic image and all testing patterns are recognized, we calculate the coordinate of the center of all recorded coordinates which are the same kind of training pattern. And then we use the center coordinate to recover the detected training pattern. We set neighborhood radius r = 1 to process the Fig. 16(a) and Fig. 17(a), and set r = 3 to process the Fig. 18(a). For the first image in Fig. 16(a), the horizon is short. The detected pattern in Fig. 16(c) is only the bright spot. For the second image in Fig. 17(a), the horizon is long. The detected patterns in Fig. 17(c) are the horizon and bright spot. For the third image in Fig. 18(a), the horizon and bright spot pattern have discontinuities, but both kinds of patterns can also be detected in Fig. 18(c).
(a)
(b)
(c)
Fig. 16. (a) First testing seismic image, (b) coordinates of successful detection, (c) output of (a).
(a)
(b)
(c)
Fig. 17.(a) Second testing seismic image, (b) coordinates of successful detection, (c) output of (a).
(a)
(b)
(c)
Fig. 18. (a) Third testing seismic image, (b) coordinates of successful detection, (c) output of (a).
Cellular Neural Network for Seismic Pattern Recognition
341
5. Conclusions
CNN is adopted for seismic pattern recognition. We design CNN to behave as associative memory according to the stored training seismic patterns, and finish the training process of the network. Then we use this associative memory to recognize seismic testing patterns. In the experiments, the analyzed seismic patterns are bright spot pattern, right and left pinch-out patterns that have the structure of gas and oil sand zones. From the recognition results, the noisy seismic patterns can be recovered. In the comparison of experimental results, the CNN has better recovery capacity than Hopfield model. They have the difference. The cells at CNN are locally connected. But the cells at Hopfield model are globally connected. And the input of one cell in CNN comes from the inputs and outputs of the neighboring neurons. But the input of one cell in Hopfield model comes from all other cells. Also we have the experiments on seismic images. Two kinds of seismic patterns are in the training. They are the bright spot pattern and horizon pattern After training, we do the testing on seismic images. Through window moving, the patterns can be detected. The results of seismic pattern recognition using CNN are good. It can help the analysis and interpretation of seismic data. Acknowledgements
This work was supported in part by the National Science Council, Taiwan, under NSC92-2213-E-009-095 and NSC93-2213-E-009-067. References 1. L. O. Chua and Lin Yang, “Cellular Neural Networks: Theory,” IEEE Trans. on CAS, vol. 35 no. 10, pp.1257-1272, 1988. 2. L. O. Chua and Lin Yang, “Cellular Neural Networks: Applications,” IEEE Trans. on CAS, vol.35, no.10, pp. 1273-1290, 1988. 3. Leon O. Chua, CNN: A paradigm for complexity, World Scientific, 1998. 4. H. Harrer and J. A. Nossek, “Discrete-time Cellular Neural Networks,” International Journal of Circuit Theory and Applications, vol. 20, pp. 453-468, 1992. 5. G. Grassi, “A new approach to design cellular neural networks for associative memories,” IEEE Trans. Circuits Syst. I, vol. 44, pp. 835–838, Sept. 1997. 6. G. Grassi, “On discrete-time cellular neural networks for associative memories,” IEEE Trans. Circuits Syst. I, vol. 48, pp. 107–111, Jan. 2001. 7. Liang Hu, Huijun Gao, and Wei Xing Zheng, “Novel stability of cellular neural networks with interval time-varying delay,” Neural Networks, vol. 21, no. 10, pp. 1458-1463, Dec. 2008.
342
K. Y. Huang and W. H. Hsieh
8. Lili Wang and Tianping Chen, “Complete stability of cellular neural networks with unbounded time-varying delays,” Neural Networks, vol. 36, pp. 11-17, Dec. 2012. 9. Wu-Hua Chen, and Wei Xing Zheng, “A new method for complete stability analysis of cellular neural networks with time delay,” IEEE Trans. on Neural Networks, vol. 21, no. 7, pp. 1126-1139, Jul. 2010. 10. Zhenyuan Guo, Jun Wang, and Zheng Yan, “Attractivity analysis of memristor-based cellular neural networks with time-varying delays,” IEEE Trans. on Neural Networks, vol. 25, no. 4, pp. 704-717, Apr. 2014. 11. R. Lepage, R. G. Rouhana, B. St-Onge, R. Noumeir, and R. Desjardins, “Cellular neural network for automated detection of geological lineaments on radarsat images,” IEEE Trans. on Geoscience and Remote Sensing, vol. 38, no. 3, pp. 1224-1233, May 2000. 12. Kou -Yuan Huang, Chin-Hua Chang, Wen-Shiang Hsieh, Shan-Chih Hsieh, Luke K.Wang, and Fan-Ren Tsai, “Cellular neural network for seismic horizon picking,” The 9th IEEE International Workshop on Cellular Neural Networks and Their Applications, CNNA 2005, May 28~30, Hsinchu, Taiwan, 2005, pp. 219-222. 13. R. Perfetti, “Frequency domain stability criteria for cellular neural networks,” Int. J. Circuit Theory Appl., vol. 25, no. 1, pp. 55–68, 1997. 14. K. Y. Huang, K. S. Fu, S. W. Cheng, and T. H. Sheen, "Image processing of seismogram: (A) Hough transformation for the detection of seismic patterns (B) Thinning processing in the seismogram," Pattern Recognition, vol.18, no.6, pp. 429-440, 1985. 15. J. J. Hopfield and D. W. Tank, “Neural” computation of decisions in optimization problems,” Biolog. Cybern., 52, pp. 141-152, 1985. 16. J. J. Hopfield and D. W. Tank, “Computing with neural circuits: A model,” Science, 233, pp. 625-633, 1986. 17. M. B. Dobrin and C. H. Savit, Introduction to Geophysical Prospecting, New York: McGrawHill Book Co., 1988.
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
page 343
CHAPTER 2.9 INCORPORATING FACIAL ATTRIBUTES IN CROSS-MODAL FACE VERIFICATION AND SYNTHESIS
Hadi Kazemi, Seyed Mehdi Iranmanesh and Nasser M. Nasrabadi Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV. USA Face sketches are able to capture the spatial topology of a face while lacking some facial attributes such as race, skin, or hair color. Existing sketch-photo recognition and synthesis approaches have mostly ignored the importance of facial attributes. This chapter introduces two deep learning frameworks to train a Deep Coupled Convolutional Neural Network (DCCNN) for facial attribute guided sketch-to-photo matching and synthesis. Specifically, for sketch-to-photo matching, an attribute-centered loss is proposed which learns several distinct centers, in a shared embedding space, for photos and sketches with different combinations of attributes. Similarly, a conditional CycleGAN framework is introduced which forces facial attributes, such as skin and hair color, on the synthesized photo and does not need a set of aligned face-sketch pairs during its training.
1. Introduction Automatic face sketch-to-photo identification has always been an important topic in computer vision and machine learning due to its vital applications in law enforcement.1,2 In criminal and intelligence investigations, in many cases, the facial photograph of a suspect is not available, and a forensic hand-drawn or computer generated composite sketch following the description provided by the testimony of an eyewitness is the only clue to identify possible suspects. Based on the existance or absence of the suspect’s photo in the law enforcement database, an automatic matching algorithm or a sketch-to-photo synthesis is needed. Automatic Face Verification: An automatic matching algorithm is necessary for a quick and accurate search of the law enforcement face databases or surveillance cameras using a forensic sketch. The forensic or composite sketches, however, encode only limited information of the suspects’ appearance such as the spatial topology of their faces while the majority of the soft biometric traits, such as skin, race, or hair color, are left out. Traditionaly, sketch recognition algorithms were of two categories, namely generative and discriminative approaches. Generative algorithms map one of the modalities into the other and perform the matching in the second modality.3,4 On the contrary, discriminative approaches learn to extract useful and discriminative common features to perform the verification, such as the Weber’s local descriptor (WLD),5 and scale-invariant feature transform 343
March 16, 2020 11:51
344
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
page 344
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
(SIFT).6 Nonetheless, these features are not always optimal for a cross-modal recognition task.7 More recently, deep learning-based approaches have emerged as a general solution to the problem of cross-domain face recognition. It is enabled by their ability in learning a common latent embedding between the two modalities.8,9 Despite all the success, employing deep learning techniques for the sketch-to-photo recognition problem is still challenging compared to the other single modality domains as it requires a large number of data samples to avoid over-fitting on the training data or stopping at local minima. Furthermore, the majority of the sketch-photo datasets include a few pairs of corresponding sketches and photos. Existing state-of-the-art methods primarily focus on making the semantic representation of the two domains into a single shared subspace, whilst the lack of soft-biometric information in the sketch modality is completely ignored. Despite the impressive results of recent sketch-photo recognition algorithms, conditioning the matching process on the soft biometric traits has not been adequately investigated. Manipulating facial attributes in photos has been an active research topic for years.10 The application of soft biometric traits in person reidentification has also been studied in the literature.11,12 A direct suspect identification framework based solely on descriptive facial attributes is introduced in.13 However, they have completely neglected the sketch images. In recent work, Mittal et al.14 employed facial attributes (e.g. ethnicity, gender, and skin color) to reorder the list of ranked identities. They have fused multiple sketches of a single identity to boost the performance of their algorithm. In this chapter, we introduce a facial attribute-guided cross-modal face verification scheme conditioned on relevant facial attributes. To this end, a new loss function, namely attribute-centered loss, is proposed to help the network in capturing the similarity of identities that have the same facial attributes combination. This loss function is defined based on assigning a distinct centroid (center point), in the embedding space, to each combination of facial attributes. Then, a deep neural network can be trained using a pair of sketch-attribute. The proposed loss function encourages the DCNN to map a photo and its corresponding sketch-attribute pair into a shared latent sub-space in which they have similar representations. Simultaneously, the proposed loss forces the distance of all the photos and sketchattribute pairs to their corresponding centers to be less than a pre-specified margin. This helps the network to filter out the subjects of similar facial structures to the query but a limited number of common facial attributes. Finally, the learned centers are trained to keep a distance related to their number of contradictory attributes. The justification behind the latter is that it is more likely that a victim misclassifies a few facial attributes of the suspect than most of them. Sketch-to-Photo Synthesis: In law enforcement, the photo of the person of interest is not always available in the police database. Here, an automatic face sketch-to-photo synthesis comes handy enabling them to produce suspects’ photos from the drawn forensic sketches. The majority of the current research works in the literature of sketch-based photo synthesis have tackled the problem using pairs of sketches and photos that are captured under highly controlled conditions, i.e., neutral expression and frontal pose. Different tech-
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
chap18
page 345
345
niques have been studied including transductive learning of a probabilistic sketch-photo generation model,15 sparse representations,16 support vector regression,17 Bayesian tensor inference,4 embedded hidden Markov model,18 and multiscale Markov random field model.19 Despite all the success, slight variations in the conditions can dramatically degrade the performance of these photo synthesizing frameworks which are developed and trained based on the assumption of having a highly controlled training pairs. In,20 a deep convolutional neural network (DCNN) is proposed to solve the problem of face sketchphoto synthesis in an uncontrolled condition. Another six-layer convolutional neural network (CNN) is introduced in21 to translate photos into sketches. In,21 a novel optimization objective is defined in the form of joint generative discriminative minimization which forces the person’s identity to be preserved in the synthesis process. More recently, generative adversarial networks (GANs)22 resulted in a significant improvement in image generation and manipulation tasks. The main idea was defining a new loss function which can help the model to capture the high-frequency information and generate more sharp and realistic images. More specifically, the generator network is trained to fool a discriminator network whose job is to distinguish synthetic and real images. Conditional GANs (cGAN)23 are also proposed to condition the generative models and generate images on an input which could be some attributes or another image. This ability makes cGANs a good fit for many image transformation applications such as sketch-photo synthesis,24 image manipulation,25 general-purpose image-to-image translation,23 and style transfer.26 However, to train the network, the proposed GAN frameworks required a pair of corresponding images from both the source and the target modalities. In order to bypass this issue, an unpaired image-to-image translation framework was introduced in,27 namely CycleGAN. The CycleGAN can learn image translation from a source domain to a target domain without any paired examples. For the same reason, we follow the same approach as CycleGAN to train a network for sketch-photo synthesis in the absence of paired samples. Despite the profound achievements in the recent literature of face sketch-photo synthesis, a key part, i.e., conditioning the face synthesis task on the soft biometric traits is mostly neglected in these works. Especially in sketch-to-photo synthesis, independent of the quality of sketches, there are some facial attributes that are missing in the sketch modality, such as skin, hair, eye colors, gender, and ethnicity. In addition, despite the existence of other adhered facial characteristics, such as having eyeglasses or a hat, on the sketch side, conditioning the image synthesis process on such information provides extra guidance about the generation of the person of interest and can result in a more precise and higher quality synthesized output. The application of soft biometric traits in person reidentification has been studied in the literature.28 Face attributes help to construct face representations and train domain classifiers for identity prediction. However, few researchers have addressed this problem in sketch-photo synthesis,29 attribute-image synthesis,30 and face editing.31 Although the CycleGAN solved the problem of learning a GAN network in the absence of paired training data, the original version does not force any conditions, e.g., facial attributes, on the image synthesis process. In this chapter, we propose a new framework built on the CycleGAN to generate face photos from sketches conditioned on relevant facial at-
March 16, 2020 11:51
ws-rv961x669
346
HBPRCV-6th Edn.–11573
chap18
page 346
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
tributes. To this end, we developed a conditional version of the CycleGAN which we refer to as the cCycleGAN and trained it by an extra discriminator to force the desired facial attributes on the synthesized images. 2. Attribute-Guided Face Verification 2.1. Center loss Minimization of cross-entropy is a common objective to train a deep neural network for classification or verification task. However, this loss function does not encourage the network to extract discriminative features and only guarantees their separability.32 The intuition behind the center loss is that the cross-entropy loss does not force the network to learn the intra-class variations in a compact form. To bypass this problem, contrastive loss33 and triplet loss34 have emerged in the literature to capture a more compact form of the intraclass variations. Despite their recent diverse successes, their convergence rates are quite slow. Consequently, a new loss function, namely center loss, has been proposed in32 to push the neural network to distill a set of features with more discriminative power. The center loss, Lc , is formulated as 1 xi − cyi 22 , Lc = 2 i=1 m
(1)
where m denotes the number of samples in a mini-batch, xi ∈ IRd denotes the ith sample feature embedding, belonging to the class yi . The cyi ∈ IRd denotes the yi th class center of the embedded features, and d is the feature dimension. To train a deep neural network, a joint supervision of the proposed center loss and cross-entropy loss is adopted: L = Ls + λLc ,
(2)
where Ls is the softmax loss (cross-entropy). The center loss, as defined in Eq. 1, is deficient in that it only penalizes the compactness of intra-class variations without considering the inter-class separation. Therefore, to address this issue, a contrastive-center loss has been proposed in35 as xi − cyi 22 1 , k 2 i=1 ( j=1,j=y xi − cj 22 ) + δ i m
Lct−c =
(3)
where δ is a constant preventing a zero denominator, and k is the number of classes. This loss function not only penalizes the intra-class variations but also maximizes the distance between each sample and all the centers belonging to the other classes. 2.2. Proposed loss function Inspired by the center loss, we propose a new loss function for facial attributes guided sketch-photo recognition. Since in most of the available sketch datasets there is only a single pair of sketch-photo images per identity, there is no benefit in assigning a separate
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
chap18
page 347
347
center to each identity as in32 and.35 However, here we assign centers to different combinations of facial attributes. In other words, the number of centers is equal to the number of possible facial attribute combinations. To define our attribute-centered loss, it is important to briefly describe the overall structure of the recognition network. 2.2.1. Network structure Due to the cross-modal nature of the sketch-photo recognition problem, we employed a coupled DNN model to learn a deep shared latent subspace between the two modalities, i.e., sketch and photo. Figure 1 shows the structure of the coupled deep neural network which is deployed to learn the common latent subspace between the two modalities. The first network, namely photo-DCNN, takes a color photo and embeds it into the shared latent subspace, pi , while the second network, or sketch-attribute-DCNN, gets a sketch and its assigned class center and finds their representation, si , in the shared latent subspace. The two networks are supposed to be trained to find a shared latent subspace such that the representation of each sketch with its associated facial attributes to be as close as possible to its corresponding photo while still keeping the distance to other photos. To this end, we proposed and employed the Attribute-Centered Loss for our attribute-guided shared representation learning.
Fig. 1.: Coupled deep neural network structure. Photo-DCNN (upper network) and sketch-attributeDCNN (lower network) map the photos and sketch-attribute pairs into a common latent subspace.
2.2.2. Attribute-centered loss In the problem of facial-attribute guided sketch-photo recognition, one can consider different combinations of facial attributes as distinct classes. With this intuition in mind, the first task of the network is to learn a set of discriminative features for inter-class (between different combinations of facial attributes) separability. However, the second goal of our network differs from the other two previous works32,35 which were looking for a compact representation of intra-class variations. On the contrary, here, intra-class variations rep-
March 16, 2020 11:51
ws-rv961x669
348
HBPRCV-6th Edn.–11573
chap18
page 348
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
resent faces with different geometrical properties, or more specifically, different identities. Consequently, the coupled DCNN should be trained to keep the separability of the identities as well. To this end, we define the attribute-centered loss function as Lac = Lattr + Lid + Lcen ,
(4)
where Lattr is a loss to minimize the intra-class distances of photos or sketch-attribute pairs which share similar combination of facial attributes, Lid denotes the identity loss for intra-class separability, and Lcen forces the centers to keep distance from each other in the embedding subspace for better inter-class discrimination. The attribute loss Lattr is formulated as m 1 max( pi − cyi 22 − c , 0) + max( sgi − cyi 22 − c , 0) (5) Lattr = 2 i=1 2 + max( sim i − cyi 2 − c , 0),
where c is a margin promoting convergence, pi is the feature embedded of the input photo by the photo-DCNN with attributes combination represented by yi . Also, sgi and sim i (see Figure 1) are the feature embeddings of two sketches with the same combination of attributes as pi but with the same (genuine pair) or different (impostor pair) identities, respectively. On the contrary to the center loss (1), the attribute loss does not try to push the samples all the way to the center, but keeps them around the center by a margin with a radius of c (see Figure 2). This gives the flexibility to the network to learn a discriminative feature space inside the margin for intra-class separability. This intra-class discriminative representation is learned by the network through the identity loss Lid which is defined as 1 2 pi − sgi 22 + max( d − pi − sim i 2 , 0), 2 i=1 m
Lid =
(6)
which is a contrastive loss33 with a margin of d to push the photos and sketches of a same identity toward each other, within their center’s margin c , and takes the photos and sketches of different identities apart. Obviously, the contrastive margin, d , should be less than twice the attribute margin c , i.e. d < 2 × c (see Figure 2). However, from a theoretical point of view, the minimization of identity loss, Lid , and attribute loss, Lattr , has a trivial solution if all the centers converge to a single point in the embedding space. This solution can be prevented by pushing the centers to keep a minimum distance. For this reason, we define another loss term formulated as nc nc 1 max( jk − cj − ck 22 , 0), (7) Lcen = 2 j=1 k=1,k=j
where nc is the total number of centers, cj and ck denote the jth and kth centers, and jk is the associated distance margin between cj and ck . In other words, this loss term enforces a minimum distance jk , between each pair of centers, which is related to the number of contradictory attributes between two centers cj and ck . Now, two centers which only differ in few attributes are closer to each other than those with more number of dissimilar attributes. The intuition behind the similarity-related margin is that the eyewitnesses may
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
chap18
page 349
349
mis-judge one or two attributes, but it is less likely to mix up more than that. Therefore, during the test, it is very probable that the top rank suspects have a few contradictory attributes when compared with the attributes provided by the victims. Figure 2 visualizes the overall concept of the attribute-centered loss.
Fig. 2.: Visualization of the shared latent space learn by the utilization of the attribute-centered loss. Centers with less contradictory attributes are closer to each other in this space.
2.2.3. A special case and connection to the data fusion For better clarification, in this section, we discuss an special case in which the network maps the attributes and geometrical information into two different subspaces. Figure 2 represents the visualization of this special case. The learned common embedding space (Z) comprises of two orthogonal subspaces. Therefore, the basis for Z can be written as Span{Z} = Span{X} + Span{Y },
(8)
where X ⊥ Y and dim(Z) = dim(X) + dim(Y ). In this scenario, the network learns to put the centers in the embedding subspace X, and utilizes embedding subspace Y to model the intra-class variations. In other words, the learned embedding space is divided into two subspaces. The first embedding subspace represents the attribute center which provides the information regarding the subjects facial attributes. The second subspace denotes the geometrical properties of subjects or their identity information. Although this is a very unlikely scenario as some of the facial attributes are highly correlated with the geometrical property of the face, this scenario can be considered to describe the intuition behind our proposed framework. It is important to note, the proposed attribute-centered loss guides the network to fuse the geometrical and attribute information automatically during its shared latent representation learning. In the proposed framework, the sketch-attribute-DCNN learns to fuse an
March 16, 2020 11:51
ws-rv961x669
350
HBPRCV-6th Edn.–11573
chap18
page 350
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
input sketch and its corresponding attributes. This fusion process is an inevitable task for the network to learn the mapping from each sketch-attribute pair to its center vicinity. As shown in Figure 1, in this scheme the sketch and n binary attributes, ai=1,...,n , are passed to the network as a (n + 1)-channel input. Each attribute-dedicated channel is constructed by repeating the value that is assigned to that attribute. This fusion algorithm uses the information provided by the attributes to compensate the information that cannot be extracted from the sketch (such as hair color) or it is lost while drawing the sketch. 2.3. Implementation details 2.3.1. Network structure We deployed a deep coupled CNN to learn the attribute-guided shared representation between the forensic sketch and the photo modalities by employing the proposed attributecentered loss. The overall structure of the coupled network is illustrated in Figure 1. The structures of both photo and sketch DCNNs are the same and are adopted from the VGG16.36 However, for the sake of parameter reduction, we replaced the last three convolutional layers of VGG16, with two convolutional layers of depth 256 and one convolutional layer of depth 64. We also replaced the last max pooling with a global average pooling, which results in a feature vector of size 64. We also added batch-normalization to all the layers of VGG16. The photo-DCNN takes an RGB photo as its input and the sketch-attribute-DCNN gets a multi-channel input. The first input channel is a gray-scale sketch and there is a specific channel for each binary attribute filled with 0 or 1 based on the presence or absence of that attribute in the person of interest. 2.3.2. Data description We make use of hand-drawn sketch and digital image pairs from CUHK Face Sketch Dataset (CUFS)37 (containing 311 pairs), IIIT-D Sketch dataset38 (containing 238 viewed pairs, 140 semi-forensic pairs, and 190 forensic pairs), unviewed Memory Gap Database (MGDB)3 (containing 100 pairs), as well as composite sketch and digital image pairs from PRIP Viewed Software-Generated Composite database (PRIP-VSGC)39 and extendedPRIP Database (e-PRIP)14 for our experiments. We also utilized the CelebFaces Attributes Dataset (CelebA),40 which is a large-scale face attributes dataset with more than 200K celebrity images with 40 attribute annotations, to pre-train the network. To this end, we generated a synthetic sketch by applying xDOG41 filter to every image in the celebA dataset. We selected 12 facial attributes, namely black hair, brown hair, blond hair, gray hair, bald, male, Asian, Indian, White, Black, eyeglasses, sunglasses, out of the available 40 attribute annotations in this dataset. We categorized the selected attributes into four attribute categories of hair (5 states), race (4 states), glasses (2 states), and gender (2 states). For each category, except the gender category, we also considered an extra state for any case in which the provided attribute does not exist for that category. Employing this attribute setup, we ended up with 180 centers (different combinations of the attributes). Since none
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
chap18
page 351
351
of the aforementioned sketch datasets includes facial attributes, we manually labeled all of the datasets.
2.3.3. Network training We pre-trained our deep coupled neural network using synthetic sketch-photo pairs from the CelebA dataset. We followed the same approach as32 to update the centers based on mini-batches. The network pre-training process terminated when the attribute-centered loss stopped decreasing. The final weights are employed to initialize the network in all the training scenarios. Since deep neural networks with a huge number of trainable parameters are prone to overfitting on a relatively small training dataset, we employed multiple augmentation techniques (see Figure 3):
Fig. 3.: A sample of different augmentation techniques.
• Deformation: Since sketches are not geometrically matched with their photos, we employed Thin Plate Spline Transformation (TPS)42 to help the network learning more robust features and prevent overfitting on small training sets, simultaneously. To this end, we deformed images, i.e. sketches and photos, by randomly translating 25 preselected points. Each point is translated with random magnitude and direction. The same approach has been successfully applied for fingerprint distortion rectification.43 • Scale and crop: Sketches and photos are upscaled to a random size, while do not keep the original width-height ratio. Then, a 250×200 crop is sampled from the center of each image. This results in a ratio deformation which is a common mismatch between sketches and their ground truth photos. • Flipping: Images are randomly flipped horizontally.
2.4. Evaluation The proposed algorithm works with a probe image, preferred attributes and a gallery of mugshots to perform identification. In this section, we compare our algorithm with multiple attribute-guided techniques as well as those that do not utilize any extra information.
March 16, 2020 11:51
ws-rv961x669
352
HBPRCV-6th Edn.–11573
chap18
page 352
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
Table 1.: Experiment Setup. The last three columns show the number of identities in each of train, test gallary, and test probe.
Setup P1 P2 P3
Test e-PRIP e-PRIP IIIT-D Semi-forensic MGDB Unviewed
Train e-PRIP e-PRIP CUFS, IIIT-D Viewed CUFSF, e-PRIP
Train # 48 48
Gallery # 75 1500
1968
1500
Prob # 75 75 135 100
2.4.1. Experiment setup We conducted three different experiments to evaluate the effectiveness of the proposed framework. For the sake of comparison, the first two experiment setups are adopted from.14 In the first setup, called P1, the e-PRIP dataset, with the total of 123 identities, is partitioned into training, 48 identities, and testing, 75 identities, sets. The original e-PRIP dataset, which is used in,14 contains 4 different composite sketch sets of the same 123 identities. However, at the time of writing of this article, there are only two of them available to the public. The accessible part of the dataset includes the composite sketches created by an Asian artist using the Identi-Kit tool, and an Indian user adopting the FACES tool. The second experiment, or P2 setup, is performed employing an extended gallery of 1500 subjects. The gallery size enlarged utilizing WVU Muti-Modal,44 IIIT-D Sketch, Multiple Encounter Dataset (MEDS),45 and CUFS datasets. This experiment is conducted to evaluate the performance of the proposed framework in confronting real-word large gallery. Finally, we assessed the robustness of the network to a new unseen dataset. This setup, P3, reveals to what extent the network is biased to the sketch styles in the training datasets. To this end, we trained the network on CUFS, IIIT-D Viewed, and e-PRIP datasets and then tested it on IIIT-D Semi-forensic pairs, and MGDB Unviewed. The performance is validated using ten fold random cross validation. The results of the proposed method are compared with the state-of-the-art techniques. 2.4.2. Experimental results For the set of sketches generated by the Indian (Faces) and Asian (IdentiKit) users14 has the rank 10 accuracy of %58.4 and %53.1, respectively. They utilized an algorithm called attribute feedback to consider facial attributes on their identification process. However, SGR-DA46 reported a better performance of %70 on the IdentiKit dataset without utilization of any facial attributes. In comparison, our proposed attribute-centered loss resulted in %73.2 and %72.6 accuracies, on Faces and IdentiKit, respectively. For the sake of evaluation, we also trained the same coupled deep neural network with the sole supervision of contrastive loss. This attribute-unaware network has %65.3 and %64.2 accuracies, on Faces and IdentiKit, respectively, which demonstrates the effectiveness of attributes contribution as part of our proposed algorithm. Figure 4 visualize the effect of attribute-centered loss on top five ranks on P1 experiment’s test results. The first row is the results of our attribute-unaware network, while the
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
page 353
353
Table 2.: Rank-10 identification accuracy (%) on the e-PRIP composite sketch database.
Algorithm Mittal et al.47 Mittal et al.48 Mittal et al.14 SGR-DA46 Ours without attributes Ours with attributes
Faces (In) 53.3 ± 1.4 60.2 ± 2.9 58.4 ± 1.1 68.6 ± 1.6 73.2 ± 1.1
IdentiKit (As) 45.3 ± 1.5 52.0 ± 2.4 53.1 ± 1.0 70 67.4 ± 1.9 72.6 ± 0.9
second row shows the top ranks for the same sketch probe using our proposed network trained by the attribute-centered loss. Considering the attributes removes many of the false matches from the ranked list and the correct subject moves to a higher rank. To evaluate the robustness of our algorithm in the presence of a relatively large gallery of mugshots, the same experiments are repeated but on an extended gallery of 1500 subjects. Figure 5a shows the performance of our algorithm as well as the state of the art algorithm on Indian user (Faces) dataset. The proposed algorithm outperforms14 by almost %11 rank 50 when exploiting facial attributes. Since the results for IdentiKit was not reported on,14 we compared our algorithm with SGR-DA46 (see Figure 5b). Even tough SGR-DA outperformed our attribute-unaware network in the P1 experiment but its result was not as robust as our proposed attribute-aware deep coupled neural network. Finally, Figure 6 demonstrate the results of the proposed algorithm on P3 experiment. The network is trained on 1968 sketch-photo pairs and then tested on two completely unseen datasets, i.e. IIIT-D Semi-forensic and MGDB Unviewed. The gallery of this experiment was also extended to 1500 mugshots.
Fig. 4.: The effect of considering facial attributes in sketch-photo matching. The first line shows the results for a network trained with attribute-centered loss, and the second line depicts the result of a network trained using contrastive loss.
March 16, 2020 11:51
ws-rv961x669
354
HBPRCV-6th Edn.–11573
chap18
page 354
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
(a)
(b)
Fig. 5.: CMC curves of the proposed and existing algorithms for the extended gallery experiment: (a) results on the Indian data subset compared to Mittal et al.14 and (b) results on the Identi-Kit data subset compared to SGR-DA.46
Fig. 6.: CMC curves of the proposed algorithm for experiment P3. The results confirm the robustness of the network to different sketch styles.
3. Attribute-guided sketch-to-photo synthesis 3.1. Conditional generative adversarial networks (cGANs) GANs22 are a group of generative models which learn to map a random noise z to output image y: G(z) : z −→ y. They can be extended to a conditional GAN (cGAN) if the generator model, G, (and usually the discriminator) is conditioned on some extra information,
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
page 355
355
x, such as an image or class labels. In other words, cGAN learns a mapping from an input x and a random noise z to the output image y: G(x, z) : {x, z} −→ y. The generator model is trained to generate an image which is not distinguishable from “real” samples by a discriminator network, D. The discriminator is trained adversarially to discriminate between the “fake” generated images by the generator and the real samples from the training dataset. Both the generator and the discriminator are trained simultaneously following a two-player min-max game. The objective function of cGAN is defined as: lGAN (G, D) =Ex,y∼pdata [log D(x, y)] + Ex,z∼pz [log(1 − D(x, G(x, z)))],
(9)
where G attempts to minimize it and D tries to maximize it. Previous works in the literature have found it beneficial to add an extra L2 or L1 distance term to the objective function which forces the network to generate images which are near the ground truth. Isola et al.23 found L1 to be a better candidate as it encourages less blurring in the generated output. In summary, the generator model is trained as follows: G∗ = arg min max lGAN (G, D) + λlL1 (G), G
D
(10)
where λ is a weighting factor and lL1 (G) is lL1 (G) = y − G(x, z) 1 .
(11)
3.1.1. Training procedure In each training step, an input, x is passed to the generator to produce the corresponding output, G(x, z). The generated output and the input are concatenated and fed to the discriminator. First, the discriminator’s weight is updated in a way to distinguish between the generated output and a real sample from the target domain. Then, the generator is trained to fool the discriminator by generating more realistic images. 3.2. CycleGAN The main goal of CycleGAN27 is to train two generative models, Gx and Gy . These two models learn the mapping functions between two domains x and y. The model, as illustrated in Figure 7, includes two generators; the first one maps x to y: Gy (x) : x −→ y and the other does the inverse mapping y to x: Gx (y) : y −→ x. There are two adversarial discriminators Dx and Dy , one for each generator. More precisely, Dx distinguishes between “real” x samples and its generated “fake” samples Gx (y), and similarly, Dy discriminates between “real” y and the “fake” Gy (x). Therefore, there is a distinct adversarial loss in CycleGAN for each of the two (Gx , Dx ) and (Gy , Dy ) pairs. Notice that the adversarial losses are defined as in Eq. 9. For a high capacity network to be trained using only the adversarial loss, there is a possibility of mapping the same set of inputs to a random permutation of images in the target domain. In other words, the adversarial loss is not enough to guarantee that the trained network generates the desired output. This is the reason behind having an extra
March 16, 2020 11:51
ws-rv961x669
356
HBPRCV-6th Edn.–11573
chap18
page 356
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
Fig. 7.: CycleGAN
L1 distance term in the objective function of cGAN as shown in Eq. 10. As shown in Figure 7, in the case of CycleGAN, there are no paired images between the source and target domains, which is the main feature of CycleGAN over cGAN. Consequently, the L1 distance loss cannot be applied to this problem. To tackle this issue, a cycle consistency loss was proposed in27 which forced the learned mapping functions to be cycle-consistent. Particularly, the following conditions should be satisfied y −→ Gx (y) −→ Gy (Gx (y)) ≈ y. (12) x −→ Gy (x) −→ Gx (Gy (x)) ≈ x, To this end, a cycle consistency loss is defined as (13) lcyc (Gx , Gy ) = Ex∼pdata [ x−Gx (Gy (x)) 1 ] + Ey∼pdata [ y−Gy (Gx (y)) 1 ] . Taken together, the full objective function is (14) l( Gx , Gy , Dx , Dy ) = lGAN (Gx , Dx ) + lGAN (Gy , Dy ) + λlcyc (Gx , Gy ), where λ is a weighting factor to control the importance of the objectives and the whole model is trained as follows (15) G∗x , G∗y = arg min max l(Gx , Gy , Dx , Dy ). Gx ,Gy Dx ,Dy
From now on, we use x for our source domain which is the sketch domain and y for the target domain or the photo domain. 3.2.1. Architecture The two generators, Gx and Gy , adopt the same architecture27 consisting of six convolutional layers and nine residual blocks49 (see27 for details). The output of the discriminator is of size 30x30. Each output pixel corresponds to a patch of the input image and tries to classify if the patch is real or fake. More details are reported in.27
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
page 357
357
3.3. Conditional CycleGAN (cCycleGAN) The CycleGAN architecture has solved the problem of having unpaired training data, but still, has a major drawback: Extra conditions, such as soft biometric traits, cannot be forced on the target domain. To tackle this problem, we proposed a CycleGAN architecture with a soft biometrics conditional setting which we refer it as Conditional CycleGAN (cCycleGAN). Since in the sketch-photo synthesis problem, attributes (e.g., skin color) are missing on the sketch side and not on the photo side, the photo-sketch generator, Gx (y), is left unchanged in the new setting. However, the sketch-photo generator, Gy (x), needs to be modified by conditioning it on the facial attributes. The new sketch-photo generator maps (x, a) to y, i.e., Gy (x, a) : (x, a) −→ y, where a stands for the desired facial attributes to be present in the synthesized photo. The corresponding discriminator, Dy (x, a), is also conditioned on both the sketch, x, and the desired facial attributes, a. The definition of the loss function remains the same as in CycleGAN given by Eq. 14. On the contrary to the previous work in face editing,31 our preliminary results showed that having only a single discriminator conditioned on the desired facial attributes was not enough to force the attributes on the generator’s output of the CycleGAN. Consequently, instead of increasing the complexity of the discriminator, we trained an additional auxiliary discriminator, Da (y, a), to detect if the desired attributes are present in the synthesized photo or not. In other words, the sketch-photo generator, Gy (x, a), tries to fool an extra attribute discriminator, Da (y, a), which checks the presence of the desired facial attributes. The objective function of the attribute discriminator is defined as follows: lAtt (Gy , Da ) =Ea,y∼pdata [log Da (a, y)] + Ey∼pdata ,¯a=a [log(1 − Da (¯ a, y))]+
(16)
Ea,y∼pdata [log(1 − Da (a, Gy (x, a)))], where a is the corresponding attributes of the real image, y, and a ¯ = a is a set of random arbitrary attributes. Therefore, the total loss of the cCycleGAN is l( Gx , Gy , Dx , Dy ) = lGAN (Gx , Dx ) + lGAN (Gy , Dy ) + λ1 lcyc (Gx , Gy )
(17)
+ λ2 lAtt (Gy , Da ), where λ1 and λ2 are weighting factors to control the importance of the objectives. 3.4. Architecture Our proposed cCycleGAN adopts the same architecture as in CycleGAN. However, to condition the generator and the discriminator to the facial attributes, we slightly modified the architecture. The generator which transforms photos into sketches, Gx (y), and its corresponding discriminator, Dx , are left unchanged as there is no attribute to force in sketch generation phase. However, in the sketch-photo generator, Gy (x), we insert the desired attributes before the fifth residual block of the bottleneck (Figure 8). To this end, each attribute is repeated 4096 (64*64) times and then resized to a matrix of size 64×64. Then all of these attribute feature maps and the output feature maps of the fourth residual block are concatenated in depth and passed to the next block, as shown in Figure 9. The
March 16, 2020 11:51
ws-rv961x669
358
HBPRCV-6th Edn.–11573
chap18
page 358
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
Fig. 8.: cCycleGAN architecture, including Sketch-Photo cycle (top) and Photo-Sketch cycle (bottom).
Fig. 9.: Sketch-Photo generator network, Gy (x, a), in cCycleGAN.
same modification is applied to the corresponding attribute discriminator, Da . All the attributes are repeated, resized, and concatenated with the generated photo in depth and are passed to the discriminator. 3.5. Training procedure We follow the same training procedure as in Section 3.1.1 for the photo-sketch generator. However, for the sketch-photo generator, we need a different training mechanism to force the desired facial attributes to be present in the generated photo. Therefore, we define a new type of negative sample for the attribute discriminator, Da , which is defined as a real photo from the target domain but with a wrong set of attributes, a ¯. The training mechanism forces the sketch-photo generator to produce faces with the desired attributes. At
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
chap18
page 359
359
each training step, this generator synthesizes a photo with the same attributes, a, as the real photo. Both the corresponding sketch-photo discriminator, Dy , and attribute discriminator, Da , are supposed to detect the synthesized photo as a fake sample. The attribute discriminator, Da , is also trained with two other pairs: a real photo with correct attributes as a real sample, and a real photo with wrong set of attributes as a fake sample. Simultaneously, the sketch-photo generator attempts to fool both of the discriminators. 3.6. Experimental results 3.6.1. Datasets FERET Sketch: The FERET database50 includes 1,194 sketch-photo pairs. Sketches are hand-drawn by an artist while looking at the face photos. Both the face photos and sketches are grayscale images of size 250 × 200 pixels. However, to produce color photos we did not use the grayscale face photos of this dataset to train the cCycleGAN. We randomly selected 1000 sketches to train the network and the remaining 194 are used for testing. WVU Multi-modal: To synthesis color images from the FERET sketches, we use the frontal view face images from WVU Multi-modal.44 The Dataset contains 3453 highresolution color frontal images of 1200 subjects. The images are aligned, cropped and resized to the same size as FERET Sketch, i.e., 250 × 200 pixels. The dataset does not contain any facial attributes. However, for each image, the average color of a 25 × 25 pixels rectangular patch (placed in forehead or cheek) is considered as the skin color. Then, they are clustered into three classes, namely white, brown and black, based on their intensities. CelebFaces Attributes (CelebA): We use the aligned and cropped version of the CelebA dataset51 and scale the images down to 128 × 128 pixels. We also randomly split it into two partitions, 182K for training and 20K for testing. Of the original 40 attributes, we selected only those attributes that have a clear visual impact on the synthesized faces and are missing in the sketch modality, which leaves a total of six attributes, namely black hair, brown hair, blond hair, gray hair, pale skin, and gender. Due to the huge differences in face views and the background in FERET and celebA databases, the preliminary results did not show an acceptable performance on FERET-celebA pair training. Consequently, we generated a synthetic sketch dataset by applying xDOG41 filter to the celebA dataset. However, to train the cCycleGAN, the synthetic sketch and photo images are used in an unpaired fashion. 3.6.2. Results on FERET and WVU multi-modal Sketches from the FERET dataset are trained in couple with frontal face images from the WVU Multi-modal to train the proposed cCycleGAN. Since there is no facial attributes associated with the color images of the WVU Multi-modal dataset, we have classified them based on their skin colors. Consequently, the skin color is the only attribute which we can control during the sketch-photo synthesis. Therefore, the input to the sketch-photo generator has two channels including a gray-scale sketch image, x, and a single attribute
March 16, 2020 11:51
360
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
page 360
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
Fig. 10.: Sketch-based photo synthesis of hand-drawn test sketches (FERET dataset). Our network adapts the synthesis results to satisfy different skin colors (white, brown, black).
channel, a, for the skin color. The sketch images are normalized to stand in [−1, 1] range. Similarly, the skin color attribute gets -1, 0, and 1 for the black, brown and white skin colors, respectively. Figure 10 shows the results of the cCycleGAN after 200 epochs on the test data. The three skin color classes are not represented equally in the dataset which obviously balanced the results towards the lighter skins. 3.6.3. Results on CelebA and synthesized sketches Preliminary results reveal that the CycleGAN training can get unstable when there is a significant difference, such as differences in scale and face poses, in the source and target datasets. The easy task of the discriminator in differentiating between the synthesized and real photos in these cases could account for this instability. Consequently, we generated a synthetic sketch dataset as a replacement to the FERET dataset. Among the 40 attributes provided in the CelebA dataset, we have selected the six most relevant ones in terms of the visual impacts on the sketch-photo synthesis, including black hair, blond hair, brown hair, gray hair, male, and pale skin. Therefore, the input to the sketch-photo generator has seven channels including a gray-scale sketch image, x, and six attribute channels, a. The attributes in CelebA dataset are binary, we have chosen -1 for a missing attribute and 1 for an attribute which is supposed to be present in the synthesized photo. Figure 11 shows the results of the cCycleGAN after 50 epochs on the test data. The trained network can follow the desired attributes and force them on the synthesized photo. 3.6.4. Evaluation of synthesized photos with a face verifier For the sake of evaluation, we utilized a VGG16-based face verifier pre-trained on the CMU Multi-PIE dataset. To evaluate the proposed algorithm, we first selected the identities which had more than one photos in the testing set. Then, for each identity, one photo is randomly added to the test gallery, and a synthetic sketch corresponding to another photo
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
page 361
361
Fig. 11.: Attribute guided Sketch-based photo synthesis of synthetic test sketches from CelebA dataset. Our network can adapt the synthesis results to satisfy the desired attributes. Table 3.: Verification performance of the proposed cCycle-GAN vs. the cycle-GAN
Method Accuracy (%)
cycle-GAN %61.34 ± 1.05
cCycle-GAN %65.53 ± 0.93
of the same identity is added to the test prob. Finally, every prob synthetic sketch is given to our attribute-guided sketch-photo synthesizer and the resulting synthesized photos are used for face verification against the entire test gallery. This evaluation process was repeated 10 times. Table 3 depicts the face verification accuracies of the proposed attribute-guided approach and the results of the original cycle-GAN on celebA dataset. The results of our proposed network significantly improved on the original cycle-GAN with no attribute information. 4. Discussion In this chapter, two distinct frameworks are introduced to enable employing facial attributes in cross-modal face verification and synthesis. The experimental results show the superiority of the proposed attribute-guided frameworks compared to the state-of-the-art techniques. To incorporate facial attribute in cross modal face verification, we introduced an attribute-centered loss to train a coupled deep neural network learning a shared embedding space between the two modalities in which both geometrical and facial attribute information cooperate on similarity score calculation. To this end, a distinct center point is constructed for every combination of the facial attributes, which are used in the sketchattribute-DCNN, by leveraging the facial attributes of the suspect provided by the victims, and the photo-DCNN learned to map their inputs close to their corresponding attribute cen-
March 16, 2020 11:51
362
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
page 362
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
ters. To incorporate facial attributes for an unpaired face sketch-photo synthesis problem, an additional auxiliary attribute discriminator was prpposed with an appropriate loss to force the desired facial attributes on the output of the generator. The pair of real face photo from the training data with a set of false attributes defined a new fake input to the attribute discriminator in addition to the pair of generator’s output and a set of random attributes. References 1. X. Wang and X. Tang, Face photo-sketch synthesis and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 31(11), 1955–1967, (2009). 2. Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A nonlinear approach for face sketch synthesis and recognition. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 1005–1010. IEEE, (2005). 3. S. Ouyang, T. M. Hospedales, Y.-Z. Song, and X. Li. Forgetmenot: memory-aware forensic facial sketch matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5571–5579, (2016). 4. Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, Face relighting from a single image under arbitrary unknown lighting conditions, IEEE Transactions on Pattern Analysis and Machine Intelligence. 31(11), 1968–1984, (2009). 5. H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, Memetically optimized mcwld for matching sketches with digital face images, IEEE Transactions on Information Forensics and Security. 7 (5), 1522–1535, (2012). 6. B. Klare and A. K. Jain, Sketch-to-photo matching: a feature-based approach, Proc. Society of Photo-Optical Instrumentation Engineers Conf. Series. 7667, (2010). 7. W. Zhang, X. Wang, and X. Tang. Coupled information-theoretic encoding for face photo-sketch recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 513–520. IEEE, (2011). 8. C. Galea and R. A. Farrugia, Forensic face photo-sketch recognition using a deep learning-based architecture, IEEE Signal Processing Letters. 24(11), 1586–1590, (2017). 9. S. Nagpal, M. Singh, R. Singh, M. Vatsa, A. Noore, and A. Majumdar, Face sketch matching via coupled deep transform learning, arXiv preprint arXiv:1710.02914. (2017). 10. Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction using off-the-shelf CNN features. In Biometrics (ICB), 2016 International Conference on, pp. 1–7. IEEE, (2016). 11. A. Dantcheva, P. Elia, and A. Ross, What else does your biometric data reveal? a survey on soft biometrics, IEEE Transactions on Information Forensics and Security. 11(3), 441–467, (2016). 12. H. Kazemi, M. Iranmanesh, A. Dabouei, and N. M. Nasrabadi. Facial attributes guided deep sketch-to-photo synthesis. In Applications of Computer Vision (WACV), 2018 IEEE Workshop on. IEEE, (2018). 13. B. F. Klare, S. Klum, J. C. Klontz, E. Taborsky, T. Akgul, and A. K. Jain. Suspect identification based on descriptive facial attributes. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pp. 1–8. IEEE, (2014). 14. P. Mittal, A. Jain, G. Goswami, M. Vatsa, and R. Singh, Composite sketch recognition using saliency and attribute feedback, Information Fusion. 33, 86–99, (2017). 15. W. Liu, X. Tang, and J. Liu. Bayesian tensor inference for sketch-based facial photo hallucination. pp. 2141–2146, (2007). 16. X. Gao, N. Wang, D. Tao, and X. Li, Face sketch–photo synthesis and retrieval using sparse representation, IEEE Transactions on circuits and systems for video technology. 22(8), 1213– 1226, (2012). 17. J. Zhang, N. Wang, X. Gao, D. Tao, and X. Li. Face sketch-photo synthesis based on support
March 16, 2020 11:51
ws-rv961x669
HBPRCV-6th Edn.–11573
Incorporating Facial Attributes in Cross-modal Face Verification and Synthesis
18. 19. 20. 21.
22.
23. 24. 25.
26. 27. 28.
29. 30. 31. 32. 33.
34.
35. 36. 37. 38.
chap18
page 363
363
vector regression. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pp. 1125–1128. IEEE, (2011). N. Wang, D. Tao, X. Gao, X. Li, and J. Li, Transductive face sketch-photo synthesis, IEEE transactions on neural networks and learning systems. 24(9), 1364–1376, (2013). B. Xiao, X. Gao, D. Tao, and X. Li, A new approach for face recognition by sketches in photos, Signal Processing. 89(8), 1576–1588, (2009). Y. G¨uc¸l¨ut¨urk, U. G¨uc¸l¨u, R. van Lier, and M. A. van Gerven. Convolutional sketch inversion. In European Conference on Computer Vision, pp. 810–824. Springer, (2016). L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang. End-to-end photo-sketch generation via fully convolutional representation learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 627–634. ACM, (2015). I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, (2014). P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditional adversarial networks, arXiv preprint arXiv:1611.07004. (2016). P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, Scribbler: Controlling deep image synthesis with sketch and color, arXiv preprint arXiv:1612.00835. (2016). J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Springer, (2016). D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, pp. 1349–1357, (2016). J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycleconsistent adversarial networks, arXiv preprint arXiv:1703.10593. (2017). J. Zhu, S. Liao, D. Yi, Z. Lei, and S. Z. Li. Multi-label CNN based pedestrian attribute learning for soft biometrics. In Biometrics (ICB), 2015 International Conference on, pp. 535–540. IEEE, (2015). Q. Guo, C. Zhu, Z. Xia, Z. Wang, and Y. Liu, Attribute-controlled face photo synthesis from simple line drawing, arXiv preprint arXiv:1702.02805. (2017). X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Springer, (2016). ´ G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Alvarez, Invertible conditional GANs for image editing, arXiv preprint arXiv:1611.06355. (2016). Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Springer, (2016). R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, vol. 2, pp. 1735–1742. IEEE, (2006). F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, (2015). C. Qi and F. Su, Contrastive-center loss for deep neural networks, arXiv preprint arXiv:1707.07391. (2017). K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556. (2014). X. Tang and X. Wang. Face sketch synthesis and recognition. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 687–694. IEEE, (2003). H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa. Memetic approach for matching sketches with digital face images. Technical report, (2012).
March 16, 2020 11:51
364
ws-rv961x669
HBPRCV-6th Edn.–11573
chap18
page 364
H. Kazemi, S.M. Iranmanesh and N.M. Nasrabadi
39. H. Han, B. F. Klare, K. Bonnen, and A. K. Jain, Matching composite sketches to face photos: A component-based approach, IEEE Transactions on Information Forensics and Security. 8(1), 191–204, (2013). 40. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), (2015). 41. H. Winnem¨oLler, J. E. Kyprianidis, and S. C. Olsen, Xdog: an extended difference-of-gaussians compendium including advanced image stylization, Computers & Graphics. 36(6), 740–753, (2012). 42. F. L. Bookstein, Principal warps: Thin-plate splines and the decomposition of deformations, IEEE Transactions on pattern analysis and machine intelligence. 11(6), 567–585, (1989). 43. A. Dabouei, H. Kazemi, M. Iranmanesh, and N. M. Nasrabadi. Fingerprint distortion rectification using deep convolutional neural networks. In Biometrics (ICB), 2018 International Conference on. IEEE, (2018). 44. Biometrics and identification innovation center, wvu multi-modal dataset. Available at http: //biic.wvu.edu/,. 45. A. P. Founds, N. Orlans, W. Genevieve, and C. I. Watson, Nist special databse 32-multiple encounter dataset ii (meds-ii), NIST Interagency/Internal Report (NISTIR)-7807. (2011). 46. C. Peng, X. Gao, N. Wang, and J. Li, Sparse graphical representation based discriminant analysis for heterogeneous face recognition, arXiv preprint arXiv:1607.00137. (2016). 47. P. Mittal, A. Jain, G. Goswami, R. Singh, and M. Vatsa. Recognizing composite sketches with digital face images via ssd dictionary. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pp. 1–6. IEEE, (2014). 48. P. Mittal, M. Vatsa, and R. Singh. Composite sketch recognition via deep network-a transfer learning approach. In Biometrics (ICB), 2015 International Conference on, pp. 251–256. IEEE, (2015). 49. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016). 50. P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, The feret evaluation methodology for facerecognition algorithms, IEEE Transactions on pattern analysis and machine intelligence. 22(10), 1090–1104, (2000). 51. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, (2015).
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 365
CHAPTER 2.10 CONNECTED AND AUTONOMOUS VEHICLES IN THE DEEP LEARNING ERA: A CASE STUDY ON COMPUTER-GUIDED STEERING Rodolfo Valientea , Mahdi Zamana , Yaser P. Fallaha , Sedat Ozerb a
Connected and Autonomous Vehicle Research Lab (CAVREL), Orlando, FL University of Central Florida Orlando, FL,USA b Bilkent University, Ankara, TURKEY [email protected] Connected and Autonomous Vehicles (CAVs) are typically equipped with multiple advanced on-board sensors generating a massive amount of data. Utilizing and processing such data to improve the performance of CAVs is a current research area. Machine learning techniques are effective ways of exploiting such data in many applications with many demonstrated success stories. In this chapter, first, we provide an overview of recent advances in applying machine learning in the emerging area of CAVs including particular applications and highlight several open issues in the area. Second, as a case study and a particular application, we present a novel deep learning approach to control the steering angle for cooperative self-driving cars capable of integrating both local and remote information. In that application, we tackle the problem of utilizing multiple sets of images shared between two autonomous vehicles to improve the accuracy of controlling the steering angle by considering the temporal dependencies between the image frames. This problem has not been studied in the literature widely. We present and study a new deep architecture to predict the steering angle automatically. Our deep architecture is an end-to-end network that utilizes Convolutional-NeuralNetworks (CNN), Long-Short-Term-Memory (LSTM) and fully connected (FC) layers; it processes both present and future images (shared by a vehicle ahead via Vehicle-to-Vehicle (V2V) communication) as input to control the steering angle. In our simulations, we demonstrate that using a combination of perception and communication systems can improve robustness and safety of CAVs. Our model demonstrates the lowest error when compared to the other existing approaches in the literature.
1. Introduction It is estimated that by the end of next decade, most vehicles will be equipped with powerful sensing capabilities and on-board units (OBUs) enabling multiple communication types including in-vehicle communications, vehicle-to-vehicle (V2V) communications and vehicle-to-infrastructure (V2I) communications. As vehicles be365
March 16, 2020 9:34
ws-rv961x669
366
HBPRCV-6th Edn.–11573
FinalVersion
page 366
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
FC
LSTM
On-Board
Buffer
CNN
Image stream from V2 (vehicle ahead) via V2V Communication
(timedistributed)
t
Image stream from V1 (Own camera)
Vehicle 2 (V2)
Predicted Steering Angle
Vehicle 1 (V1)
t t-3 t+1 t-2 t+2 t-1
t-1 t-2 t-3
t+3 t
Fig. 1. The overview of our proposed vehicle-assisted end-to-end system. Vehicle 2 (V2) sends his information to Vehicle 1 (V1) over V2V communication. V1 combines that information along with its own information to control the steering angle. The prediction is made through our CNN+LSTM+FC network (see Fig. 2 for the details of our network).
come more aware of their environments and as they evolve towards full autonomy, the concept of connected and autonomous vehicles (CAVs) becomes more crucial. Recently, CAVs has gained substantial momentum to bring a new level of connectivity to vehicles. Along with novel on-board computing and sensing technologies, CAVs serve as a key enabler for Intelligent Transport Systems (ITS) and smart cities. CAVs are increasingly equipped with a wide variety of sensors, such as engine control units, radar, light detection and ranging (LiDAR), and cameras to help a vehicle perceive the surrounding environment and monitor its own operation status in real-time. By utilizing high-performance computing and storage facilities, CAVs can keep generating, collecting, sharing and storing large volumes of data. Such data can be exploited to improve CAVs robustness and safety. Artificial intelligence (AI) is an effective approach to exploit, analyze and use such data. However, how to mine such data currently remains a challenge in many ways for further research direction. Among the current existing issues, robust control of the steering angle is one of the most difficult and important problems for autonomous vehicles [1–3]. Recent computer vision-based approaches to control the steering angle in autonomous cars mostly focus on improving the driving accuracy with the local data collected from the sensors on the same vehicle and as such, they consider each car as an isolated unit gathering and processing information locally. However, as the availability and the utilization of V2V communication increases, real-time data sharing becomes
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
Connected and Autonomous Vehicles in the Deep Learning Era
FinalVersion
page 367
367
more feasible among vehicles [4–6]. As such, new algorithms and approaches are needed that can utilize the potential of cooperative environments to improve the accuracy for controlling the steering angle automatically [7]. One objective of this chapter is to bring more attention to this emerging field since the research on applying AI in CAVs is still a growing area. We identify and discuss major challenges and applications of AI in perception/sensing, in communications and in user experience for CAVs. In particular, we discuss in greater detail and present a deep learning-based approach that differs from other approaches. It utilizes two sets of images (data): coming from the on-board sensors and coming from another vehicle ahead over V2V communication to control the steering angle in self-driving vehicles automatically (see Fig. 1). Our proposed deep architecture contains a convolutional neural network (CNN) followed by a Long-Short-TermMemory (LSTM) and a fully connected (FC) network. Unlike the older approach that manually decomposes the autonomous driving problem into different components as in [8, 9] the end-to-end model can directly steer the vehicle from the camera data and has been proven to operate effectively in previous works [1, 10]. We compare our proposed deep architecture to multiple existing algorithms in the literature on Udacity dataset. Our experimental results demonstrate that our proposed CNN-LSTM-based model yields state-of-the-art results. Our main contributions are: (1) we provide a survey of AI applications in the emerging area of CAVs and highlight several open issues for further research; (2) we propose an end-to-end vehicle-assisted steering angle control system for cooperative vehicles using a large sequence of images; (3) we introduce a new deep architecture that yields the stateof-the-art results on the Udacity dataset; (4) we demonstrate that integrating the data obtained from other vehicles via V2V communication system improves the accuracy of predicting the steering angle for CAVs.
2. Related Work: AI Applications in CAVs As the automotive industry transforms, data remains at the heart of CAV’s evolution [11, 12]. To take advantage of the data, efficient methods are needed to interpret and mine massive amount of data and to improve robustness of self-driving cars. Most of the relevant attention is given on the AI-based techniques as many recent deep learning-based techniques demonstrated promising performance in a wide variety of applications in vision, speech recognition and natural language areas by significantly improving the state-of-the-art performance [13, 14]. Similar to many other areas, deep learning-based techniques are also providing promising and improved results in the area of CAVs [14, 15]. For instance, the problem of navigating a self-driving car with the acquired sensory data has been studied in the literature with and without using end-to-end approaches [16]. The earlier works such as the ones from [17] and [18] use multiple components for recognizing objects of safe-driving concerns including lanes, vehicles, traffic signs, and
March 16, 2020 9:34
368
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 368
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
pedestrians. The recognition results are then combined to give a reliable world representation, which are used with an AI system to make decisions and control the car. More recent approaches focus on using deep learning-based techniques. For example, Ng et al. [19] utilized a CNN for vehicle and lane detection. Pomerleau [20] used a NN to automatically train a vehicle to drive by observing the input from a camera. Dehghan et al. [21] presents a vehicle color recognition (MMCR) system, that relies on a deep CNN. To automatically control the steering angle, recent works focus on using neural networks-based end-to-end approaches [22]. The Autonomous Land Vehicle in a Neural Network (ALVINN) system was one of the earlier systems utilizing multilayer perceptron [23] in 1989. Recently, CNNs were commonly used as in the DAVE-2 Project [1]. In [3], the authors proposed an end-to-end trainable C-LSTM network that uses a LSTM network at the end of the CNN network. A similar approach was taken by the authors in [24], where the authors designed a 3D CNN model with residual connections and LSTM layers. Other researchers also implemented different variants of convolutional architectures for end-to-end models as in [25–27]. Another widely used approach for controlling vehicle steering angle in autonomous systems is via sensor fusion where combining image data with other sensor data such as LiDAR, RADAR, GPS to improve the accuracy in autonomous operations [28, 29]. For instance, in [26], the authors designed a fusion network using both image features and LiDAR features based on VGGNet. There has been significant progress by using AI on several cooperative and connected vehicle related issues including network congestion, intersection collision warning, wrong-way driving warning, remote diagnostic of vehicles, etc. For instance, a centrally controlled approach to manage network congestion control at intersections has been presented by [30] with the help of a specific unsupervised learning algorithm, k-means clustering. The approach basically addresses the congestion problem when vehicles stop at a red light in an intersection, where the road side infrastructures observe the wireless channels to measure and control channel congestion. CoDrive [31] proposes an AI cooperative system for an open-car ecosystem, where cars collaborate to improve the positioning of each other. CoDrive results in precise reconstruction of a traffic scene, preserving both its shape and size. The work in [32] uses the received signal strength of the packets received by the Road Side Units (RSUs) and sent by the vehicles on the roads to predict the position of vehicles. To predict the position of vehicles they adopted a cooperative machine-learning methodology, they compare three widely recognized techniques: K Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest. In CVS systems, drivers’ behaviors and online and real-time decision making on the road directly affect the performance of the system. However, the behaviors of human drivers are highly unpredictable compared to pre-programmed driving assistance systems, which makes it hard for CAVs to make a prediction based on human behavior and that creates another important issue in the area. Many AI applications
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
Connected and Autonomous Vehicles in the Deep Learning Era
FinalVersion
page 369
369
have been proposed to tackle that issue including [5, 6, 33]. For example, in [33], a two-stage data-driven approach has been proposed: (I) classify driving patterns of on-road surrounding vehicles using the Gaussian mixture models (GMM); and (II) predict vehicles’ short-term lateral motions based on real-world vehicle mobility data. Sekizawa et al. [34] developed a stochastic switched auto-regressive exogenous model to predict the collision avoidance behavior of drivers using simulated driving data in a virtual reality system. Chen et al., in 2018, designed a visibility-based collision warning system to use the NN to reach four models to predict vehicle rear-end collision under a low visibility environment [35]. With historical traffic data, Jiang and Fei, in 2016, employed neural network models to predict average traffic speeds of road segments and a forward-backward algorithm on Hidden Markov models to predict speeds of an individual vehicle [36]. For the prediction of drivers’ maneuver. Yao et al., in 2013, developed a parametric lane change trajectory prediction approach based on real human lane change data. This method generated a similar parametric trajectory according to the kNearest real lane change instances [37]. In [38] is proposed an online learning-based approach to predict lane change intention, which incorporated SVM and Bayesian filtering. Liebner et al. developed a prediction approach for lateral motion at urban intersections with and without the presence of preceding vehicles [39]. The study focused on the parameter of the longitudinal velocity and the appearance of preceding vehicles. In [40] is proposed a multilayer perceptron approach to predict the probability of lane changes by surrounding vehicles and trajectories based on the history of the vehicles’ position and their current positions. Woo et al., in 2017, constructed a lane change prediction method for surrounding vehicles. The method employed SVM to classify driver intention classes based on a feature vector and used the potential field method to predict trajectory [41]. With the purpose of improving the driver experience in CAVs, there have been proposed works on predictive maintenance, automotive insurance (to speed up the process of filing claims when accidents occur), car manufacturing improved by AI, driver behavior monitoring, identification, recognition and alert [42]. Other works focus on eye gaze, eye openness, and distracted driving detection, alerting the driver to keep their eyes on the road [43]. Some advanced AI facial recognition algorithms are used to allow access to the vehicle and detect which driver is operating the vehicle, the system can automatically adjust the seat, mirrors, and temperature to suit the individual. For example, [44] presents a deep face detection vehicle system for driver identification that can be used in access control policies. These systems have been devised to provide customers greater user experience and to ensure safety on the roads. Other works focus on traffic flow prediction, traffic congestion alleviation, fuel consumption reduction, and various location-based services. For instance, in [45], a probabilistic graphical model; Poisson regression trees (PRT), has been used for two correlated tasks: the LTE communication connectivity prediction and the vehicular
March 16, 2020 9:34
ws-rv961x669
370
HBPRCV-6th Edn.–11573
FinalVersion
page 370
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
traffic prediction. A novel deep-learning-based traffic flow prediction method based on a stacked auto-encoder model has further been proposed in [46], where autoencoders are used as building blocks to represent traffic flow features for prediction and achieve significant performance improvement. 3. Relevant Issues The notion of a “self-driving vehicle” has been around for quite a long time now, yet a fully automated vehicle not being available for sale created some confusion [47–49]. To put the concept into a measurable degree, the United States’ Department of Transportation’s (USDOT) National Highway Traffic Safety Administration (NHTSA) defines 6 levels of automation [47]. They released this classification for smooth standardization and to measure the safety ratings for AVs. The levels span from 0 to 5, where level 0 referring to no automation at all where the human driver does all the control and maneuver. In level 1 of automation, an Advanced Driver Assistance System (ADAS) helps the human driver with either control (i.e. accelerating, braking) or maneuver (steering) at certain circumstances, albeit not simultaneously both. Adaptive Cruise Control (ACC) falls below this level of automation as it can vary the power to maintain the user-set speed but the automated control is limited to maintaining the speed, not the lateral movement. At the next level of automation (Level 2, Partial Automation), the ADAS is capable of controlling and maneuvering simultaneously, but under certain circumstances. So the human driver still has to monitor the vehicle’s surrounding and perform the rest of the controls when required. At level 3 (Conditional Automation), the ADAS does not require the human driver to monitor the environment all the time. At certain circumstances, the ADAS is fully capable of performing all the parts of the driving task. The range of safe-automation scenarios is larger at this level than in level 2. However, the human driver should still be ready to regain control when the system asks for it at such circumstance. At all other scenarios, the control is up to human-maneuver. Level 4 of automation is called “High Automation” as the ADAS at level 4 can take control of the vehicle at most scenarios and a human driver is not essentially required to take control from the system. But in critical weather where the sensor information might be noisier (e.g. in rain or snow), the system may disable the automation for safety concerns requiring the human driver to perform all of the driving tasks [47–49]. Currently, many private sector car companies and investors are testing and analyzing their vehicles at level 4 standard but putting a safety driver behind the wheel, which necessarily brings down the safety testing at level 2 and 3. All the automakers and investors are currently putting their research and development efforts to eventually reach level 5, which refers to full automation where the system is capable of performing all of the driving tasks without requiring any human takeover at any circumstance [47, 48]. The applications presented in section 2 shown a promised future for data-driven deep learning algorithms in CAVs. However, the jump from level 2 to level 3, 4
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
Connected and Autonomous Vehicles in the Deep Learning Era
FinalVersion
page 371
371
and 5 is substantial from the AI perspective and naively applying existing deep learning methods is currently insufficient for full automation due to the complex and unpredictable nature of CAVs [48, 49]. For example, in level 3 (Conditional Driving Automation) the vehicles need to have intelligent environmental detection capabilities, able to make informed decisions for themselves, such as accelerating past a slow-moving vehicle; in level 4 (High Driving Automation) the vehicles have full automation in specific controlled areas and can even intervene if things go wrong or there is a system failure and finally in level 5 (Full Driving Automation) the vehicles do not require human attention at all [42, 47]. Therefore, how to adapt existing solutions to better handle such requirements remains a challenging task. In this section, we identify some research topics for further investigation and in particular we propose one new approach to the control of the steering angle for CAVs. There are various open research problems in the area [11, 42, 48, 49]. For instance, further work can be done on detection of driver’s physical movements and posture as in eye gaze, eye openness, and head position to detect and alert a distracted driver with lower latency [11, 12, 50]. An upper body detection can detect the driver’s posture and in case of a crash, airbags can be deployed in a manner that will reduce injury based on how the driver is sitting [51]. Similarly, detecting the driver’s emotion can also help with the decision making [52]. Connected vehicles can use an Autonomous Driving Cloud (ADC) platform, that will allow data to have need-based availability [11, 14, 15]. The ADC can use AI algorithms to make meaningful decisions. It can act as the control policy or the brain of the autonomous vehicle. This intelligent agent can also be connected to a database which acts as a memory where past driving experiences are stored [53]. This data along with the real-time input coming in through the autonomous vehicle about the immediate surroundings will help the intelligent agent make accurate driving decisions. In the vehicular network side, AI can exploit multiple sources of data generated and stored in the network (e.g. vehicle information, driver behavior patterns, etc.) to learn the dynamics in the environment [4–6] and then extract appropriate features to use for the benefit of many tasks for communications purposes, such as signal detection, resource management, and routing [11, 15]. However, it is a non-trivial task to extract semantic information from the huge amount of accessible data, which might have been contaminated by noise or redundancy, and thus information extraction need to be performed [11, 53]. In addition, in vehicular networks, data are naturally generated and stored across different units in the network [15, 54] (e.g., RVs, RSUs, etc). This brings challenges to the applicability of most existing machine learning algorithms that have been developed under the assumption that data are centrally controlled and easily accessible [11, 15]. As a result, distributed learning methods are desired in CAVs that act on partially observed data and have the ability to exploit information obtained from other entities in the network [7, 55]. Furthermore, additional overheads incurred by the coordination and sharing of in-
March 16, 2020 9:34
372
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 372
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
formation among various units in vehicular networks for distributed learning shall be properly accounted for to make the system work effectively [11]. In particular, an open area for further research is how to integrate the information from local sensors (perception) and remote information (cooperative) [7]. We present a novel application in this chapter that steps in this domain. For instance, for controlling the steering angle all the above-listed work focus on utilizing data obtained from the on-board sensors and they do not consider the assisted data that comes from another car. In the following section, we demonstrate that using additional data that comes from the ahead-vehicle helps us obtain better accuracy in controlling the steering angle. In our approach, we utilize the information that is available to a vehicle ahead of our car to control the steering angle. 4. A Case study: Our Proposed Approach We consider the control of the steering angle as a regression problem where the input is a stack of images and the output is the steering angle. Our approach can also process each image individually. Considering multiple frames in a sequence can benefit us in situations where the present image alone is affected by noise or contains less useful information such as when the current image is burnt largely by direct sunlight. In such situations, the correlation between the current frame and the past frames can be useful to decide the next steering value. We use LSTM to utilize multiple images as a sequence. LSTM has a recursive structure acting as a memory, through which the network can keep some past information to predict the output based on the dependency of the consecutive frames [56, 57]. Our proposed idea in this chapter relies on the fact that the condition of the road ahead has already been seen by another vehicle recently and we can utilize that information to control the steering angle of our car as discussed above. Fig. 1 illustrates our approach. In the figure, Vehicle 1 receives a set of images from Vehicle 2 over V2V communication and keeps the data at the on-board buffer. It combines the received data with the data obtained from the on-board camera and processes those two sets of images on-board to control the steering angle via an end-to-end deep architecture. This method enables the vehicle to look ahead of its current position at any given time. Our deep architecture is presented in Fig. 2. The network takes the set of images coming from both vehicles as input and it predicts the steering angle as the regression output. The details of our deep architecture are given in Table 1. Since we construct this problem as a regression problem with a single unit at the end, we use the Mean Squared Error (MSE) loss function in our network during the training.
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
Connected and Autonomous Vehicles in the Deep Learning Era
page 373
373
Input
conv2D
conv2D
Image Set (self)
conv2D
conv2D
conv2D
Output
FC 1
FC 10 FC 50 64 64 64 FC 100
Image Set (remote)
Fig. 2. CNN + LSTM + FC Image sharing model. Our model uses 5 convolutional layers, followed by 3 LSTM layers, followed by 4 FC layers. See Table 1 for further details of our proposed architecture.
Layer 0 1 2 3 4 5 6 7 8 9 10 11 12
Table 1. Type Input Conv2D Conv2D Conv2D Conv2D Conv2D LSTM LSTM LSTM FC FC FC FC
Details Of Proposed Architecture Size Stride Activation 640*480*3*2X 5*5, 24 Filters 5*5, 32 Filters 5*5, 48 Filters 5*5, 64 Filters 5*5, 128 Filters 64 Units 64 Units 64 Units 100 50 10 1
(5,4) (3,2) (5,4) (1,1) (1,2) -
ReLU ReLU ReLU ReLU ReLU Tanh Tanh Tanh ReLU ReLU ReLU Linear
5. Experiment Setup In this section we will elaborate further on the dataset as well as data pre-processing and evaluation metrics. We conclude the section with details of our implementation. 5.1. Dataset In order to compare our results to existing work in the literature, we used the selfdriving car dataset by Udacity. The dataset has a wide variation of 100K images from simultaneous Center, Left and Right camera on a vehicle, collected in sunny and overcast weather, 33K images belong to center camera. The dataset contains the data of 5 different trips with a total drive time of 1694 seconds. Test vehicle
March 16, 2020 9:34
ws-rv961x669
374
HBPRCV-6th Edn.–11573
FinalVersion
page 374
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
3
104
Histogram
2.5 2 1.5 1 0.5 0 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Angle (Radians)
Fig. 3. The angle distribution within the entire Udacity dataset (angle in radians vs. total number of frames), just angles between -1 and 1 radians are shown.
has 3 cameras mounted as in [1] collecting images at a rate of near 20Hz. Steering wheel angle, acceleration, brake, GPS data was also recorded. The distribution of the steering wheel angles over the entire dataset is shown in Fig. 3. As shown in Fig. 3, the dataset distribution includes a wide range of steering angles. The image size is 480*640*3 pixels and total dataset is of 3.63 GB. Since there is no dataset available with V2V communication images currently, here we simulate the environment by creating a virtual vehicle that is moving ahead of the autonomous vehicle and sharing camera images by using the Udacity dataset. Udacity dataset has been used widely in the recent relevant literature [24, 58] and we also use Udacity dataset in this chapter to compare our results to the existing techniques in literature. Along with the steering angle, the dataset contains spatial (latitude, longitude, altitude) and dynamic (angle, torque, speed) information labelled with each image. The data format for each image is: index, timestamp, width, height, frame id, filename, angle, torque, speed, latitude, longitude, altitude. For our purpose, we are only using the sequence of center-camera images. 5.2. Data Preprocessing The images in the dataset are recorded at the rate around 20 frame per second. Therefore, usually there is a large overlap between consecutive frames. To avoid overfitting, we used image augmentation to get more variance in our image dataset. Our image augmentation technic randomly adds brightness and contrast to change pixel values. We also tested image cropping to exclude possible redundant information that are not relevant in our application. However, in our test the models perform better without cropping. For the sequential model implementation, we preprocessed the data in a different way. Since we do want to keep the visual sequential relevance in the series of frames while avoiding overfitting, we shuffle the dataset while keeping track of the sequential
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
Connected and Autonomous Vehicles in the Deep Learning Era
page 375
375
information. We then train our model with 80% images on the same sequence from the subsets and validate on the rest 20%. 5.3. Vehicle-assisted Image Sharing Modern wireless technology allows us to share data between vehicles at high bitrates of up to Gbits/s (e.g., in peer-to-peer and line-of-sight mmWave technologies [54, 59–61]). Such communication links can be utilized to share images between vehicles for improved control. In our experiments, we simulate that situation between two vehicles as follows: we assume that both vehicles are away from each other by Δt seconds. We take the x consecutive frames (t, t− 1, ..., t− x+ 1) from the self-driving vehicle (vehicle 1) at time step t and the set of images containing x future frames starting at (t + Δt) from the other vehicle. Thus, a single input data (sample) contains a set of 2x frames for the model. 5.4. Evaluation Metrics The steering angle is a continuous variable predicted for each time step over the sequential data and the metrics: mean absolute error (MAE) and root mean squared error (RMSE) are two of the most common used metrics in the literature to measure the effectiveness of the controlling systems. For example, RMSE is used in [24, 58] and MAE in [62]. Both MAE and RMSE express average model prediction error and their values can range from 0 to ∞. They both are indifferent to the error sign. Lower values are better for both metrics. 5.5. Baseline Networks t
t-24 . . .
t
t
t+∆t
[t
, t + ∆t]
t
5x Conv 2D
Resnet Pretrained
5x Conv 2D
5x Conv 2D
5 x FC
4 x FC
5 x FC
5 x FC
Model B
Model C
8x Conv 3D
2 x LSTM 5 x FC
Model A
Model D
Model E
Fig. 4. An overview of the used baseline models in this chapter. The details of each model can be found in their respective source paper.
March 16, 2020 9:34
376
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 376
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
As baseline, we include multiple deep architectures that have been proposed in the literature to compare our proposed algorithm. Those models from [1, 24] and [58] are, to the best of our knowledge, the best reported approaches in the literature using a camera only. In total, we chose 5 baseline end-to-end algorithms to compare our results. We name these five models as models A, B, C, D and E in the rest of this chapter. Model A is our implementation of the model presented in [1]. Models B and C are the proposal of [24]. Models D and E are reproduced as in [58]. The overview of these models is given in Fig. 4. Model A uses a CNN-based network while Model B combines LSTM with 3D-CNN and uses 25 time-steps as input. Model C is based on ResNet [63] model and Model D uses the difference image of two given time-steps as input to a CNN-based network. Finally, Model E uses the concatenation of two images coming from different time-steps as input to a CNN-based network. 5.6. Implementation and Hyperparameter Tuning We use Keras with TensorFlow backend in our implementations. Final training is done on two NVIDIA Tesla V100 16GB GPUs. When implemented on our final system, the training took 4 hours for the model in [1] and between 9-12 hours for the deeper networks used in [24], in [58] and for our proposed network. We used Adam optimizer in all our experiments and we used the final parameters as follows: learning rate = 10−2 , β1 = 0.900 , β2 = 0.999, E = 10−8 ). For learning rate, we tested from 10−1 to 10−6 and we found the best-performing learning rate being 10−3 . We also studied the minibatch size to see its effect on our application. Minibatch sizes of 128, 64 and 32 are tested and the value of 64 yielded the best results for us therefore we used 64 in our experiments reported in this chapter. Fig. 5 demonstrates how the value of the loss function changes as the number of epochs increases for both training and validation data sets. The MSE loss decreases after the first few epochs rapidly and then remains stable, remaining almost constant around the 14th epoch. 6. Analysis and Results Table 2 lists the comparison of the RMSE values for multiple end-to-end models after training them on the Udacity dataset. In addition to the five baseline models listed in Section IV-E, we also include two models of ours: Model F and Model G. Model F is our proposed approach with setting x = 8 for each vehicle. Model G sets x = 10 time-steps for each vehicle instead of 8 in our model. Since the RMSE values on Udacity dataset were not reported for Model D and Model E in [58], we re-implemented those models to compute the RMSE values on Udacity Dataset and reported the results from our implementation in Table 2. Table 3 lists the MAE values computed for our implementations of the models A, D, E, F, and G. Models A, B, C, D, and E do not report their individual MAE
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
Connected and Autonomous Vehicles in the Deep Learning Era
page 377
377
0.04 Validation Set Training Set
MSE Loss
0.03
0.02
0.01
0 0
5
10
15
Number of Ephochs Training and Validation steps for our best model with x = 8.
Fig. 5. 0.2
0.16 0.17
RMSE
0.15
0.12
0.13
0.095
0.1
0.073 0.073 0.05
0.051 0.071
0.043 0.046
0.042 0.034
0.044
0.05
0.044
0.044
0.06 Validation Set Training Set
0 0
2
4
6
8
10
12
14
16
18
20
Number of Images (x) Fig. 6. RMSE vs.x value. We trained our algorithm at various x values and computed the respective RMSE value. As shown in the figure, the minimum value is obtained at x = 8.
values in their respective sources. While we re-implemented each of those models in Keras, our implementations of the models B and C yielded higher RMSE values than their reported values even after hyperparameter tuning. Consequently, we did not include the MAE results of our implementations for those two models in Table 3. The MAE values for the models A, D and E are obtained after hyperparameter tuning. We then study the effect of changing the value of x on the performance of our model in terms of RMSE. We train our model at separate x values where x is set Table 2.
Comparison to Related Work in terms of RMSE.
aX
= 8, b X = 10.
Model:
A [1]
B [24]
C [24]
D [58]
E [58]
Fa Ours
Gb Ours
Training Validation
0.099 0.098
0.113 0.112
0.077 0.077
0.061 0.083
0.177 0.149
0.034 0.042
0.044 0.044
March 16, 2020 9:34
ws-rv961x669
378
HBPRCV-6th Edn.–11573
FinalVersion
page 378
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
to 1, 2, 4, 6, 8, 10, 12, 14, 20. We computed the RMSE value for both the training and validation data respectively at each x value. The results were plotted in Fig. 6. As shown in the figure, we obtained the lowest RMSE value for both training and validation data at the value when x = 8, where RMSE = 0.042 for the validation data. The figure also shows that choosing the appropriate x value is important to receive the best performance from the model. As Fig. 6 shows, the number of the used images in the input affects the performance. Next, we study how changing the Δt value affects the performance of our end-to-end system in terms of RMSE value during the testing, once the algorithm is trained at a fixed Δt. Changing Δt corresponds to varying the distance between the two vehicles. For that purpose, we first set Δt = 30 frames (i.e., 1.5 seconds gap between the vehicles) and trained the algorithm accordingly (where x = 10). Once our model was trained and learned the relation between the given input image stacks and the corresponding output value at Δt = 30, we studied the robustness of the trained system as the distance between two vehicles change during the testing. Fig. 7 demonstrates the results on how the RMSE value changes as we change the distance between the vehicles during the testing. For that, we run the trained model over the entire validation data where the input obtained from the validation data formed at Δt values varying between 0 and 95 with increments of 5 frames, and we computed the RMSE value at each of those Δt values. 0.07 0.065
RMSE
0.06 0.055 0.05
0.065 0.064 0.062 0.061 0.059 0.057 0.055 0.053 0.051 0.049 0.049 0.048 0.047 0.0470.046 0.046 0.046 0.044 0.044
0.045 0.04 0
10
20
30
40
50
60
70
80
90
Number of frames ahead ( ) Fig. 7. RMSE value vs. size of the number of frames ahead (Δt) over the validation data. The model is trained at Δt = 30 and x = 10. Between the Δt values: 13 and 37 (the red area) the change in RMSE value remains small and the algorithm almost yields the same min value at Δt = 20 which is different than the training value.
Table 3. Comparison to Related Work in terms of MAE. a X = 8, b X = 10. A D E Fa Gb Model: [1] [58] [58] Ours Ours Training Validation
0.067 0.062
0.038 0.041
0.046 0.039
0.022 0.033
0.031 0.036
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
Connected and Autonomous Vehicles in the Deep Learning Era
page 379
379
As shown in Fig. 7, at Δt = 30, we have the minimum RMSE value (0.0443) as the training data was also trained by setting Δt = 30. However, another (local) minimum value (0.0444), that is almost the same as the value obtained the training Δt value, is also obtained at Δt = 20. Because of those two local minimums, we noticed that the change in error remains small inside the red area as shown in the figure. However, the error does not increase evenly on both sides of the training value (Δt = 30) as most of the RMSE values within the red area remains on the left side of the training value (Δt = 30). Next, we demonstrate the performance of multiple models over each frame of the entire Udacity dataset in Fig. 9. In total, there are 33808 images in the dataset. The ground-truth for the figure is shown in Fig. 8 and the difference between the prediction and the ground-truth is given in Fig. 9 for multiple algorithms. In each plot, the maximum and minimum error values made by each algorithm are highlighted with red lines individually. In Fig. 9, we only demonstrate the results obtained for Model A, Model D, Model E and Model F (ours). The reason for that is the fact that there is no available implementation of Model B and Model C from [24] and our implementations of those models (as they are described in the original paper) did not yield good results to be reported here. Our algorithm (Model F) demonstrated the best performance overall with the lowest RMSE value. Comparing all the red lines in the plots (i.e., comparing all the maximum and minimum error values) suggests that the maximum error made by each algorithm is minimum for our algorithm over the entire dataset. Steering Angle (Radians)
2
1
0
-1
-2 0
0.5
1
1.5
2
Number of Images
2.5
3
3.5 104
Fig. 8. Steering angle (in radians) vs. the index of each image frame in the data sequence is shown for the Udacity Dataset. This data forms the ground-truth for our experiments. The upper and lower red lines highlight the maximum and minimum angle values respectively in the figure.
7. Concluding Remarks In this chapter, we provided an overview of AI applications to address the challenges in the emerging area of CAVs. We briefly discuss recent advances in applying
March 16, 2020 9:34
ws-rv961x669
380
HBPRCV-6th Edn.–11573
FinalVersion
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer Model A
Model D
Angle (Radians)
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
Model E
Model F
4
Angle (Radians)
page 380
1 0.5
2
0
0
-0.5 -2 -1 -4
-1.5 0
1
2
Number of Images x 10 4
3
0
1
2
3
Number of Images x 10 4
Fig. 9. Individual error values (shown in radian) made at each time frame is plotted for four models namely: Model A, Model D, Model E and Model F. The dataset is the Udacity Dataset. Ground-truth is shown in Fig. 8. The upper and lower red lines highlight the maximum and minimum errors made by each algorithm. The error for each frame (the y axis) for Model A, D and F are plotted in the range of [-1.5, +1,2] and the error for Model E is plotted in the range of [-4.3, +4.3].
machine learning in CAVs and highlight several open issues for further research. We present a new approach by sharing images between cooperative self-driving vehicles to improve the control accuracy of steering angle. Our end-to-end approach uses a deep model using CNN, LSTM and FC layers and it combines the on-board data with the data (images) received from another vehicle as input. Our proposed model using shared images yields the lowest RMSE value when compared to the other existing models in the literature. Unlike previous works that only use and focus on local information obtained from a single vehicle, we proposed a system where the vehicles communicate with each other and share data. In our experiments, we demonstrate that our proposed end-toend model with data sharing in cooperative environments yields better performance than the previous approaches that rely on only the data obtained and used on the same vehicle. Our end-to-end model was able to learn and predict accurate steering angles without manual decomposition into road or lane marking detection. One potentially strong argument against using image sharing might be that using the geo-spatial information along with the steering angle from the future vehicle and employing the same angle value at that position. Here we argue that: (I) using GPS makes the prediction dependent on the location data which, like many other sensor types, provides faulty location values due to various reasons and that can force algorithms to use wrong image sequence as input. More work and analysis are needed to improve the robustness of our proposed model. While this chapter relies on simulated data (where the data sharing between the vehicles is simulated from the Udacity Dataset), we are in the process of collecting real data collected from multiple cars communicating over V2V and will perform more detailed analysis on that new and real data.
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
Connected and Autonomous Vehicles in the Deep Learning Era
FinalVersion
page 381
381
Acknowledgement This work was done as a part of CAP5415 Computer Vision class in Fall 2018 at UCF. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used for this research. References [1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, End to End Learning for Self-Driving Cars. (2016). ISSN 0006341X. doi: 10.2307/2529309. URL https://images.nvidia.com/content/tegra/automotive/images/2016/ solutions/pdf/end-to-end-dl-using-px.pdfhttp://arxiv.org/abs/1604.07316. [2] Z. Chen and X. Huang. End-To-end learning for lane keeping of self-driving cars. In IEEE Intelligent Vehicles Symposium, Proceedings, (2017). ISBN 9781509048045. doi: 10.1109/IVS.2017.7995975. [3] H. M. Eraqi, M. N. Moustafa, and J. Honer, End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies (oct. 2017). URL http: //arxiv.org/abs/1710.03804. [4] H. Nourkhiz Mahjoub, B. Toghi, S. M. Osman Gani, and Y. P. Fallah, V2X System Architecture Utilizing Hybrid Gaussian Process-based Model Structures, arXiv eprints. art. arXiv:1903.01576 (Mar, 2019). [5] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A stochastic hybrid framework for driver behavior modeling based on hierarchical dirichlet process. In 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), pp. 1–5 (Aug, 2018). doi: 10.1109/VTCFall. 2018.8690570. [6] H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A driver behavior modeling structure based on non-parametric bayesian stochastic hybrid architecture. In 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), pp. 1–5 (Aug, 2018). doi: 10.1109/ VTCFall.2018.8690965. [7] R. Valiente, M. Zaman, S. Ozer, and Y. P. Fallah, Controlling steering angle for cooperative self-driving vehicles utilizing cnn and lstm-based deep networks, arXiv preprint arXiv:1904.04375. (2019). [8] M. Aly. Real time detection of lane markers in urban streets. In IEEE Intelligent Vehicles Symposium, Proceedings, (2008). ISBN 9781424425693. doi: 10.1109/IVS. 2008.4621152. [9] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez. Road scene segmentation from a single image. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (2012). ISBN 9783642337857. doi: 10.1007/978-3-642-33786-4 28. [10] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, (2017). ISBN 9781538604571. doi: 10.1109/CVPR.2017.376. [11] J. Li, H. Cheng, H. Guo, and S. Qiu, Survey on artificial intelligence for vehicles, Automotive Innovation. 1(1), 2–14, (2018). [12] S. R. Narla, The evolution of connected vehicle technology: From smart drivers to smart cars to self-driving cars, Ite Journal. 83(7), 22–26, (2013). [13] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature. 521(7553), 436, (2015).
March 16, 2020 9:34
382
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 382
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
[14] A. Luckow, M. Cook, N. Ashcraft, E. Weill, E. Djerekarov, and B. Vorster. Deep learning in the automotive industry: Applications and tools. In 2016 IEEE International Conference on Big Data (Big Data), pp. 3759–3768. IEEE, (2016). [15] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, Machine learning for vehicular networks: Recent advances and application examples, ieee vehicular technology magazine. 13(2), 94–101, (2018). [16] W. Schwarting, J. Alonso-Mora, and D. Rus, Planning and decision-making for autonomous vehicles, Annual Review of Control, Robotics, and Autonomous Systems. 1, 187–210, (2018). [17] N. Agarwal, A. Sharma, and J. R. Chang. Real-time traffic light signal recognition system for a self-driving car. In Advances in Intelligent Systems and Computing, (2018). ISBN 9783319679334. doi: 10.1007/978-3-319-67934-1 24. [18] B. S. Shin, X. Mou, W. Mou, and H. Wang, Vision-based navigation of an unmanned surface vehicle with object detection and tracking abilities, Machine Vision and Applications. (2018). ISSN 14321769. doi: 10.1007/s00138-017-0878-7. [19] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, et al., An empirical evaluation of deep learning on highway driving, arXiv preprint arXiv:1504.01716. (2015). [20] D. Pomerleau. Rapidly adapting artificial neural networks for autonomous navigation. In Advances in neural information processing systems, pp. 429–435, (1991). [21] A. Dehghan, S. Z. Masood, G. Shu, E. Ortiz, et al., View independent vehicle make, model and color recognition using convolutional neural network, arXiv preprint arXiv:1702.01721. (2017). [22] A. Amini, G. Rosman, S. Karaman, and D. Rus, Variational end-to-end navigation and localization, arXiv preprint arXiv:1811.10119. (2018). [23] D. a. Pomerleau, Alvinn: An Autonomous Land Vehicle in a Neural Network, Advances in Neural Information Processing Systems. (1989). [24] S. Du, H. Guo, and A. Simpson. Self-Driving Car Steering Angle Prediction Based on Image Recognition. Technical report, (2017). URL http://cs231n.stanford.edu/ reports/2017/pdfs/626.pdf. [25] A. Gurghian, T. Koduri, S. V. Bailur, K. J. Carey, and V. N. Murali. DeepLanes: EndTo-End Lane Position Estimation Using Deep Neural Networks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, (2016). ISBN 9781467388504. doi: 10.1109/CVPRW.2016.12. [26] J. Dirdal. End-to-end learning and sensor fusion with deep convolutional networks for steering an off-road unmanned ground vehicle. PhD thesis, (2018). URL https: //brage.bibsys.no/xmlui/handle/11250/2558926. [27] H. Yu, S. Yang, W. Gu, and S. Zhang. Baidu driving dataset and end-To-end reactive control model. In IEEE Intelligent Vehicles Symposium, Proceedings, (2017). ISBN 9781509048045. doi: 10.1109/IVS.2017.7995742. [28] H. Cho, Y. W. Seo, B. V. Kumar, and R. R. Rajkumar. A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In Proceedings - IEEE International Conference on Robotics and Automation, (2014). ISBN 9781479936847. doi: 10.1109/ICRA.2014.6907100. [29] D. Gohring, M. Wang, M. Schnurmacher, and T. Ganjineh. Radar/Lidar sensor fusion for car-following on highways. In ICARA 2011 - Proceedings of the 5th International Conference on Automation, Robotics and Applications, (2011). ISBN 9781457703287. doi: 10.1109/ICARA.2011.6144918. [30] N. Taherkhani and S. Pierre, Centralized and localized data congestion control strategy for vehicular ad hoc networks using a machine learning clustering algorithm, IEEE
March 16, 2020 9:34
ws-rv961x669
HBPRCV-6th Edn.–11573
Connected and Autonomous Vehicles in the Deep Learning Era
FinalVersion
page 383
383
Transactions on Intelligent Transportation Systems. 17(11), 3275–3285, (2016). [31] S. Demetriou, P. Jain, and K.-H. Kim. Codrive: Improving automobile positioning via collaborative driving. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 72–80. IEEE, (2018). [32] M. Sangare, S. Banerjee, P. Muhlethaler, and S. Bouzefrane. Predicting vehicles’ positions using roadside units: A machine-learning approach. In 2018 IEEE Conference on Standards for Communications and Networking (CSCN), pp. 1–6. IEEE, (2018). [33] C. Wang, J. Delport, and Y. Wang, Lateral motion prediction of on-road preceding vehicles: a data-driven approach, Sensors. 19(9), 2111, (2019). [34] S. Sekizawa, S. Inagaki, T. Suzuki, S. Hayakawa, N. Tsuchida, T. Tsuda, and H. Fujinami, Modeling and recognition of driving behavior based on stochastic switched arx model, IEEE Transactions on Intelligent Transportation Systems. 8(4), 593–606, (2007). [35] K.-P. Chen and P.-A. Hsiung, Vehicle collision prediction under reduced visibility conditions, Sensors. 18(9), 3026, (2018). [36] B. Jiang and Y. Fei, Vehicle speed prediction by two-level data driven models in vehicular networks, IEEE Transactions on Intelligent Transportation Systems. 18(7), 1793–1801, (2016). [37] W. Yao, H. Zhao, P. Bonnifait, and H. Zha. Lane change trajectory prediction by using recorded human driving data. In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 430–436. IEEE, (2013). [38] P. Kumar, M. Perrollaz, S. Lefevre, and C. Laugier. Learning-based approach for online lane change intention prediction. In 2013 IEEE Intelligent Vehicles Symposium (IV), pp. 797–802. IEEE, (2013). [39] M. Liebner, F. Klanner, M. Baumann, C. Ruhhammer, and C. Stiller, Velocity-based driver intent inference at urban intersections in the presence of preceding vehicles, IEEE Intelligent Transportation Systems Magazine. 5(2), 10–21, (2013). [40] S. Yoon and D. Kum. The multilayer perceptron approach to lateral motion prediction of surrounding vehicles for autonomous vehicles. In 2016 IEEE Intelligent Vehicles Symposium (IV), pp. 1307–1312. IEEE, (2016). [41] H. Woo, Y. Ji, H. Kono, Y. Tamura, Y. Kuroda, T. Sugano, Y. Yamamoto, A. Yamashita, and H. Asama, Lane-change detection based on vehicle-trajectory prediction, IEEE Robotics and Automation Letters. 2(2), 1109–1116, (2017). [42] C. R¨ odel, S. Stadler, A. Meschtscherjakov, and M. Tscheligi. Towards autonomous cars: the effect of autonomy levels on acceptance and user experience. In Proceedings of the 6th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, pp. 1–8. ACM, (2014). [43] J. Palmer, M. Freitas, D. A. Deninger, D. Forney, S. Sljivar, A. Vaidya, and J. Griswold. Autonomous vehicle operator performance tracking (May 30, 2017). US Patent 9,663,118. [44] C. Qu, D. A. Ulybyshev, B. K. Bhargava, R. Ranchal, and L. T. Lilien. Secure dissemination of video data in vehicle-to-vehicle systems. In 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW), pp. 47–51. IEEE, (2015). [45] C. Ide, F. Hadiji, L. Habel, A. Molina, T. Zaksek, M. Schreckenberg, K. Kersting, and C. Wietfeld. Lte connectivity and vehicular traffic prediction based on machine learning approaches. In 2015 IEEE 82nd Vehicular Technology Conference (VTC2015Fall), pp. 1–5. IEEE, (2015). [46] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, Traffic flow prediction with big data: a deep learning approach, IEEE Transactions on Intelligent Transportation Systems. 16(2), 865–873, (2014).
March 16, 2020 9:34
384
ws-rv961x669
HBPRCV-6th Edn.–11573
FinalVersion
page 384
R. Valiente, M. Zaman, Y.P. Fallah and S. Ozer
[47] nhtsa. nhtsa automated vehicles for safety, (2019). URL https://www.nhtsa.gov/ technology-innovation/automated-vehicles-safety. [48] J. M. Anderson, K. Nidhi, K. D. Stanley, P. Sorensen, C. Samaras, and O. A. Oluwatola, Autonomous vehicle technology: A guide for policymakers. (Rand Corporation, 2014). [49] W. J. Kohler and A. Colbert-Taylor, Current law and potential legal issues pertaining to automated, autonomous and connected vehicles, Santa Clara Computer & High Tech. LJ. 31, 99, (2014). [50] Y. Liang, M. L. Reyes, and J. D. Lee, Real-time detection of driver cognitive distraction using support vector machines, IEEE transactions on intelligent transportation systems. 8(2), 340–350, (2007). [51] Y. Abouelnaga, H. M. Eraqi, and M. N. Moustafa, Real-time distracted driver posture classification, arXiv preprint arXiv:1706.09498. (2017). [52] M. Grimm, K. Kroschel, H. Harris, C. Nass, B. Schuller, G. Rigoll, and T. Moosmayr. On the necessity and feasibility of detecting a drivers emotional state while driving. In International Conference on Affective Computing and Intelligent Interaction, pp. 126–138. Springer, (2007). [53] M. Gerla, E.-K. Lee, G. Pau, and U. Lee. Internet of vehicles: From intelligent grid to autonomous cars and vehicular clouds. In 2014 IEEE world forum on internet of things (WF-IoT), pp. 241–246. IEEE, (2014). [54] B. Toghi, M. Saifuddin, H. N. Mahjoub, M. O. Mughal, Y. P. Fallah, J. Rao, and S. Das. Multiple access in cellular v2x: Performance analysis in highly congested vehicular networks. In 2018 IEEE Vehicular Networking Conference (VNC), pp. 1–8 (Dec, 2018). doi: 10.1109/VNC.2018.8628416. [55] K. Passino, M. Polycarpou, D. Jacques, M. Pachter, Y. Liu, Y. Yang, M. Flint, and M. Baum. Cooperative control for autonomous air vehicles. In Cooperative control and optimization, pp. 233–271. Springer, (2002). [56] F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction with LSTM, Neural Computation. (2000). ISSN 08997667. doi: 10.1162/ 089976600300015015. [57] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber, LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learning Systems. (2017). ISSN 21622388. doi: 10.1109/TNNLS.2016.2582924. [58] D. Choudhary and G. Bansal. Convolutional Architectures for Self-Driving Cars. Technical report, (2017). [59] B. Toghi, M. Saifuddin, Y. P. Fallah, and M. O. Mughal, Analysis of Distributed Congestion Control in Cellular Vehicle-to-everything Networks, arXiv e-prints. art. arXiv:1904.00071 (Mar, 2019). [60] B. Toghi, M. Mughal, M. Saifuddin, and Y. P. Fallah, Spatio-temporal dynamics of cellular v2x communication in dense vehicular networks, arXiv preprint arXiv:1906.08634. (2019). [61] G. Shah, R. Valiente, N. Gupta, S. Gani, B. Toghi, Y. P. Fallah, and S. D. Gupta, Real-time hardware-in-the-loop emulation framework for dsrc-based connected vehicle applications, arXiv preprint arXiv:1905.09267. (2019). [62] M. Islam, M. Chowdhury, H. Li, and H. Hu, Vision-based Navigation of Autonomous Vehicle in Roadway Environments with Unexpected Hazards, arXiv preprint arXiv:1810.03967. (2018). [63] S. Wu, S. Zhong, and Y. Liu, ResNet, CVPR. (2015). ISSN 15737721. doi: 10.1002/ 9780470551592.ch2.
INDEX
brain atrophy rate, 257 brain magnetic resonance images (MRIs), 251 brain structural edge labeling, 252 bright spot pattern, 335
3D-CNN, 61 adaptive cruise control (ACC), 370 advanced driver assistance system (ADAS), 370 adversarial training, 169 Alzheimer's disease, 259, 260 anisotropic 2-D gaussians, 214 application specific integrated circuit (ASIC), 107 associative memory, 323 atlas images, 251 atlas-matching, 251 atrophy, 252 attention network, 164 attention-based encoder-decoder (AED), 163 attribute-centered loss, 347 attribute-guided cross-modal, 344 attributes, 343 augmentation, 351 automated sorting of fishes, 184 automatic speech recognition (ASR), 159
canny labeling algorithm for edge, 252 cascaded CNNs, 54 catheter-based technique, 272 cellular neural network, 323 center loss, 346 ChangeDetection.Net (CDnet), 52 character recognition, 184 classification probability, 18 classifier, 8 cognitive impairment, 259 cold fusion, 166 colour bleeding, 140, 144-147 combining views, 126 composite operator, 274 computer-guided steering, 365 Conditional CycleGAN, 357 conditional risk, 18 conjugate priors, 12 connected and autonomous vehicles (CAVs), 365 connectionist temporal classification (CTC), 161, 289 contextual information, 2 contrastive loss, 348 control the steering angle, 365 convolutional neural networks (CNNs), 31, 51, 107, 251, 365 coupled DNN, 347 cross-depiction, 291 cross-entropy, 346 cross-modal face verification, 343 curvelet-based texture features, 99 curvelets, 87, 89, 94
background subtraction, 52 back-propagation (BP) algorithm, 31 back-propagation training algorithm, 3 Bayes classifier, 8 Bayes decision rule (BDR), 4, 19 Bayes error, 8 Bayes’ rule, 10 Bayesian conditional risk estimator (BCRE), 20 Bayesian GAN (BGAN) network, 62 Bayesian MMSE error estimator (BEE), 9 Bayesian risk estimate (BRE), 20 binarization, 287, 307 bipartite graph edit distance, 311 385
386
CycleGAN, 298, 345, 355 data fusion, 349 deep attractor network (DANet), 170 deep clustering (DPCL), 170 deep fusion, 166 deep learning methods, 139, 140, 148-150, 152, 154 deep learning software, 184 deep learning, 2, 200, 251, 252, 289, 365 deep neural networks (DNNs), 107 deep-learning based background subtraction, 51 deformation field, 258 DIA, 287 dice similarity coefficient, 256 digital humanities (DH), 290 digital image processing, 3 discrete element method (DEM), 243 discrete wavelet frame decompositions, 271 dissimilarity representations, 119 document content recognition, 289 document image analysis, 287 document image binarization competition (DIBCO), 288 dynamic combination, 129 dynamic learning neural networks, 4 dynamic view selection, 119 earth observation, 187 edit path, 310 effective class-conditional densities, 11, 26 effective density, 20 embedded graphics processing unit (GPU), 107 encoder network, 161 endmembers, 209 end-to-end (E2E), 159 end-to-end model, 380 entropy-orthogonality loss, 31 equations display, 256, 259 expected risk, 18 external elastic membrane (EEM), 271
Index
face verification, 346 fast fourier transform (FFT), 116 feature extraction, 2 feature learning, 191 feature vector, 8 feature-label distribution, 8 field-programmable gate array (FPGA), 107 floe size distributions, 231 F-Measure score, 53 forged signatures, 305 fraction surfaces, 209 fully connected (FC) layers, 365 fully convolutional neural networks (FCNN), 56 fully convolutional semantic network (FCSN), 58 gaussian mixture models (GMM), 369 gaussian-based spatially adaptive unmixing (GBSAU), 209 general covariance model, 15 generative adversarial network (conditioned GAN), 202 generative adversarial networks (GANs), 345 genuine, 305 gleason grading system, 99, 103 gleason scores, 100, 102 global handwriting characteristics, 306 GPU, 252 gradient descent (GD), 216 gradient vector flow, 234, 235 graph edit distance (GED), 310 graph, 308 graph-based representation, 305, 306 graphology, 305 ground truth (GT), 252, 289 ground truth information, 53 GVF snake algorithm, 236 handwriting recognition, 289 handwritten signatures, 305 hausdor edit distance, 313 hopfield associative memory, 339
Index
horizon pattern, 336 hungarian algorithm, 312 hyperspectral image, 210 ice boundary detection, 248 ice field generation, 248 ice shape enhancement, 238 image colourisation, 139-144, 147-149, 151, 154 automatic methods, 140-143, 145, 147, 149 comic colourisation, 152 manga colourisation, 152 outline colourisation, 152 image texture analysis, 87 image texture pattern classification, 88 independent covariance model, 14 Intravascular ultrasound (IVUS), 271 intrinsically Bayesian robust (IBR) operator, 8 intrinsically Bayesian robust classifier (IBRC), 9 inverse-Wishart distribution, 16 joint network, 162 keypoint graphs, 308 K-means clustering, 233 knowledge distillation, 168 labels, 8 language model, 159 LAS: listen, attend and spell, 163 light detection and ranging (LiDAR), 366 likelihood function, 19 linear sum assignment problem (LSAP), 311 local classifier accuracy (LCA) method, 131 local feature descriptors, 306 local volume changes, 258 localization and recognition, document summarizing, and captioning, 289 log-jacobians, 258 long-short-term memory (LSTM), 161, 365 longitudinal registration, 252
387
machine learning, 2, 187 machine vision and inspection, 185 maximal knowledge-driven information prior (MKDIP), 22 maximum curvelet coefficient, 103, 105 mean absolute error (MAE), 221 media-adventitia border, 276 medical diagnosis, 184, 185 minimum sample size, 260 Min-Max loss, 31 mixed-units, 166 monaural speech separation, 171 motion equation, 326 multi-atlas, 252 multiscale fully convolutional network (MFCN), 57 multi-view learning, 119 mutli-channel separation, 171 Navier-Stokes equation, 258 nearest-neighbor decision rule, 4 neural networks, 2, 251 neural style transfer, 298 normalization of graph edit distance, 315 normalization, 309 NVIDIA DGX, 252 object recognition, 31 online signature verification, 305 online, 306 optical character recognition (OCR), 290 optimal Bayesian classifier (OBC), 9 optimal Bayesian operator, 8 optimal Bayesian risk classifier (OBRC), 18, 21 optimal Bayesian transfer learning classifier (OBTLC), 27 otsu thresholding, 233 out-of-vocabulary, 165 parametric shape modeling, 81 pattern classification, 101 pattern recognition, 323 permutation invariant training (PIT), 170 person identifications, 184 pinch-out patterns, 335
388
poisson regression trees (PRT), 369 polarimetric SAR, 189 prediction network, 162 probabilities, 12 prostate cancer tissue images, 100 prostate cancer, 87, 92, 94, 100, 102 quadratic assignment problem (QAP), 311 radial basis functions, 280 radial basis networks (RBNs), 75 random forest dissimilarity, 121 random forest, 119, 190 random sample, 8 recognition of underwater objects, 184 recurrent neural network (RNN) models, 113 recurrent neural networks (RNNs), 159 reference-based methods, 139, 140, 154 remote sensing, 184, 187 RNN transducer (RNN-T), 162 road side units (RSUs), 368 root mean squared error (RMSE), 375 SAR, 188, 189 scribble-based methods, 139, 140, 145, 147, 154 sea ice characteristics, 232 sea ice parameter, 231 segmentation, 251 seismic pattern recognition, 323 seismogram, 336 semantic segmentation, 189 semi-supervised, 211 shallow fusion, 166 signature verification, 305, 306 similarity domains network, 75 similarity or dissimilarity of graphs, 307 skeleton extraction, 81 skeletonization, 307 skull-stripping, 251 soft labels, 168 spatially adaptive unmixing, 209 spectral unmixing, 209 speech recognition, 184
Index
speech separation, 170 static combination, 127 statistical pattern recognition, 2 statistical power, 252, 260 statistical, 306 statistically defined regions of interest, 260 structural edge predictions, 257 structural MRIs, 251 structural representation, 306 structure-aware colourisation, 143, 146, 147 subspace learning models, 51 supervised unmixing, 210 support vector machine (SVM), 3, 51 syntactical (grammatical) pattern recognition, 2 synthesis, 343 synthetic aperture radar, 187 synthetic image generation, 290 teacher-student learning, 160 TensorFlow, 252 text-to-speech (TTS), 166 texture classification, 88 texture feature extraction, 87 texture features, 103 texture information, 273 texture pattern classification, 92 transcoding, 200 transposed convolutional neural network (TCNN), 55 udacity dataset, 374 uncertainty class, 8, 9 unsupervised, 211 user guidance, 150, 153 vehicle-assisted image sharing, 375 vehicle-to-vehicle (V2V) communication, 365 winograd minimal filter algorithm, 116 word error rate (WER), 289 word spotting object detection, word-piece, 166