137 22 9MB
English Pages 286 [283] Year 2022
Jinxing Li · Bob Zhang David Zhang
Information Fusion Machine Learning Methods
Information Fusion
Jinxing Li • Bob Zhang • David Zhang
Information Fusion Machine Learning Methods
Jinxing Li Harbin Institute of Technology Shenzhen School of Computer Science and Technology Xili, Nanshan Shenzhen, China
Bob Zhang Department of Computer and Information Science The University of Macau Macau, China
David Zhang School of Science and Engineering The Chinese University of Hong Kong (Shenzhen) Shenzhen, China
ISBN 978-981-16-8975-8 ISBN 978-981-16-8976-5 https://doi.org/10.1007/978-981-16-8976-5
(eBook)
© Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 Jointly published with Higher Education Press, Beijing, China The print edition is not for sale in China Mainland. Customers from China Mainland please order the print book from: Higher Education Press. This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Thanks to the quick development of multimedia techniques, a large amount of data can be collected in different ways. In a large number of real-world applications, the source data or signal is obtained from multiple domains or represented by different measurements. For example, a person could be verified via fingerprint, palmprint, or iris, and a pair of given samples can be measured by the Euclidean or Mahalanobis distance. Generally, the unique representation fails to properly extract valuable information for various applications. Thus, it is significant to conduct research on how to efficiently achieve information fusion. Information fusion is to combine different types of data to get more accurate, consistent, and useful information compared with any individual modality. Based on the fusion stage, information fusion processes are usually classified into three categories, including low, intermediate, and high levels. This book introduces various fusion algorithms based on different techniques, including sparse/collaborative representation, Gaussian process latent variable model, hierarchical structure with Bayesian theory, metric learning, score/classifier fusion, and deep learning, respectively. Thanks to these techniques, not only the correlation among multiple features or modalities obtained from a single sample is efficiently exploited but also multiple metrics are utilized to more powerfully and reasonably fit complex scenes of the real-world data. Both do contribute to the performance improvement, and these methods are also widely applied to various pattern recognition fields, including image classification, face verification, disease detection, and image retrieval. The book will be useful for researchers, experts, and postgraduate students, who work in the fields of machine learning, pattern recognition, computer vision, and biometrics. It will be very useful for interdisciplinary research.
v
vi
Preface
Our team has been working on information fusion for a long time. We appreciate the related grant supports from the National Natural Science Foundation of China (NSFC), the China Postdoctoral Science Foundation, the GRF fund of the HKSAR Government, Shenzhen Research Institute of Big Data (SRIBD), and Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS). Shenzhen, China Shenzhen, China Macau, China
Jinxing Li David Zhang Bob Zhang
Acknowledgments
Our team has been working on information fusion for more than 5 years. We appreciate the related grant supports from the National Natural Science Foundation of China (NSFC) (61906162, 61332011, 61272292, 61271344); Shenzhen Science and Technology Program (RCBS20200714114910193); and the GRF fund of the HKSAR Government, Shenzhen Research Institute of Big Data, and Open Project Fund from Shenzhen Institute of Artificial Intelligence and Robotics for Society (AC01202005017). Besides, we thank Prof. Yong Xu for providing material in Chapter 6.2.
vii
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1 Why Do Information Fusion? . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2 Related Works .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.1 Multi-View Based Fusion Methods .. . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.2 Multi-Technique Based Fusion Methods . .. . . . . . . . . . . . . . . . . . . . 1.3 Book Overview.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1 1 2 3 4 5 8
2 Information Fusion Based on Sparse/Collaborative Representation .. . 2.1 Motivation and Preliminary . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1.2 Preliminary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2 Joint Similar and Specific Learning .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.2 Optimization for JSSL . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.3 The Classification Rule for JSSL . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3 Relaxed Collaborative Representation . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.2 Optimization for RCR . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.3 The Classification Rule for RCR . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.4 Joint Discriminative and Collaborative Representation . . . . . . . . . . . . . . . 2.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.4.2 Optimization for JDCR . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.4.3 The Classification Rule for JDCR . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
13 13 13 14 16 17 18 20 20 26 27 27 29 31 31 36 37 38 39 42 43 47 48 ix
x
Contents
3 Information Fusion Based on Gaussian Process Latent Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1 Motivation and Preliminary . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.2 Preliminary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2 Shared Auto-encoder Gaussian Process Latent Variable Model .. . . . . 3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.2 Optimization for SAGP. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model.. . . . . . 3.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.2 Optimization for MKSGP . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent Variable Model . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.2 Optimization for SLEMKGP .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4 Information Fusion Based on Multi-View and Multi-Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2 Generative Multi-View and Multi-Feature Learning . . . . . . . . . . . . . . . . . . 4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2.2 Optimization for MVMFL . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2.3 Inference for MVMFL .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3 Hierarchical Multi-View Multi-Feature Fusion .. . .. . . . . . . . . . . . . . . . . . . . 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.2 Optimization for HMMF . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.3 Inference for HMMF . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
51 51 51 53 54 55 59 60 61 67 67 70 74 76 77 81 81 84 86 89 90 97 98 101 101 102 103 108 111 111 115 118 118 121 124 125 129 130
5 Information Fusion Based on Metric Learning . . . . . . .. . . . . . . . . . . . . . . . . . . . 131 5.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131 5.2 Generalized Metric Swarm Learning . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
Contents
xi
5.2.1 5.2.2 5.2.3 5.2.4
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Optimization for GMSL .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Solving with Model Modification .. . . . . .. . . . . . . . . . . . . . . . . . . . Representation of Pairwise Samples in Metric Swarm Space .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.5 Sample Pair Verification .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3 Combined Distance and Similarity Measure.. . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.2 Optimization for CDSM . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.3 Kernelized CDSM . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
134 137 141 142 142 144 146 156 158 158 162 164 166 172 172
6 Information Fusion Based on Score/Weight Classifier Fusion . . . . . . . . . . 6.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2 Adaptive Weighted Fusion Approach . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2.2 Rationale and Advantages of AWFA . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3 Adaptive Weighted Fusion of Local Kernel Classifiers . . . . . . . . . . . . . . . 6.3.1 FaLK-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3.2 Adaptive Fusion of Local SVM Classifiers. . . . . . . . . . . . . . . . . . . . 6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
175 175 177 177 178 179 188 188 188 190 192 194 194
7 Information Fusion Based on Deep Learning . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.1 Motivation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2 Dual Asymmetric Deep Hashing Learning . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.2 Optimization for DADH .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.3 Inference for DADH .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3 Relaxed Asymmetric Deep Hashing Learning.. . . .. . . . . . . . . . . . . . . . . . . . 7.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.2 Optimization for RADH . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.3 Inference for RADH . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
197 197 199 200 204 206 206 218 219 222 225 228 229 229 235
xii
Contents
7.4 Joint Learning of Single-Image and Cross-Image Representations for Person Re-identification . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.1 Joint SIR and CIR Learning .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.2 Deep Convolutional Neural Network . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.3 Experiments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
237 242 245 249 253 253
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 257 Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 259
List of Figures
Fig. 2.1
Fig. 2.2
Fig. 2.3 Fig. 2.4
Fig. 2.5
The framework of our proposed approach JSSL. There exist three parts in JSSL: dictionary construction, sparse representation, and joint classification. First, the dictionary is formed by training instances. Second, we sparsely represent multi-view data, including sublingual, facial, and tongue features of a given test sample, with the dictionary, and we divide the representation coefficients into two parts, including individual components and similar components. Third, we determine the label of the test instance according to the total reconstruction error . . . . . . . . . . . . Comparison of Healthy and DM performance of the tongue, facial, sublingual features and their fusion methods based on classification. (a) Comparison of our method with the tongue image based feature. (b) Comparison of our method with the facial image based feature. (c) Comparison of our method with the sublingual image based feature .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . ROC curves of different methods and different features for DM detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classified images based on the tongue, facial, sublingual, and their fusion features for Healthy and DM diagnosis. For each image, the red border indicates incorrect classification, and the green border indicates correct classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Comparison of Healthy vs. IGR performance by using single-modal based methods and JSSL. (a) Comparison of our method with tongue image based feature. (b) Comparison of our method with facial image based feature. (c) Comparison of our method with sublingual image based feature .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
16
22 24
25
26
xiii
xiv
List of Figures
Fig. 2.6
ROC curves of different methods and different features for IGR detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 2.7 (a) The testing instances with sunglasses and scarves in the AR database; (b) partitioned testing instances . . . . . . . . . . . . . . . . . . Fig. 2.8 Samples of FRGC 2.0 and LFW. (a) and (b) are samples in target and query sets of FRGC 2.0; (c) and (d) are samples in training and testing sets of LFW . . . .. . . . . . . . . . . . . . . . . . . . Fig. 2.9 Illustrations of instances from the Oxford flower datasets. (a) Some instances of daffodil in 17 classes; and (b) some instances of water lily in 102 classes . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 2.10 The framework of the Joint Discriminative and Collaborative Representation model (JDCR). JDRC consists of three steps that are multimodal dictionary construction, Joint Discriminative and Collaborative Representation, and the reconstruction error-based classification. In the first step, training samples belonging to different classes construct the dictionary; multimodal features extracted from the given test instance are then represented by the dictionary, and the shared and discriminative representation coefficient is obtained; third, the label is output by comparing reconstruction errors of each class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 2.11 The confusion matrices in 5 independent experiments for Fatty Liver detection. The top left, top right, bottom left, and bottom right corners are sensitivity, false negative rate, false positive rate, and specificity, respectively, followed by their corresponding error bars .. . . . .. . . . . . . . . . . . . . . . . . . . Fig. 2.12 The ROC curves of various strategies and different features for the Fatty Liver disease detection when the number of training data is 200, 250, and 300 in each category. T-SRC and C-SRC mean the curves are obtained by SRC (tongue) and SRC (Comb), respectively, which are similar to T-DPL and C-DPL. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 3.1 An overview of the proposed approach. In this model, there is a shared component that can be mapped to different characteristics. In addition, there is another transformation from observations to the shared variable .. . . . . . . . . . Fig. 3.2 Confusion matrix of the category recognition results using dataset AWA. The vertical axis represents the true labels and the horizontal axis represents the predicted labels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 3.3 Confusion matrix of the category recognition results calculated by JSSL, CCA, DCCA, DCCAE, GPLVM, DS-GPLVM, and SAGP on NUS-WIDE-LITE dataset.
28 33
33
35
39
46
47
55
67
List of Figures
xv
The vertical axis represents the true labels, and the horizontal axis represents the predicted labels . .. . . . . . . . . . . . . . . . . . . . 69 Fig. 3.4 The frame of the proposed strategy (MKSGP). We extract multiple features firstly. Then there are mapping functions from the observed samples to the latent space with multiple kernel functions. Meanwhile, mapping functions from the manifold space to the raw data are also computed. Particularly, in order to apply the proposed strategy to the classification, a classifier prior is embedded on the latent component . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 70 Fig. 3.5 The confusion matrix of category recognition results with dimensionality of the latent variable set 60 using dataset AWA. The vertical axis represents the true labels and the horizontal axis represents the predicted labels . .. . . . . . . . . . . . . . . . . . . . 79 Fig. 3.6 The average accuracy of selected 9 categories calculated by various strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81 Fig. 3.7 The accuracy calculated by different methods with different values of the dimensionality of the latent variable on the Biomedical dataset . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83 Fig. 3.8 The frame of developed strategy SLEMKGP. We first extract multiple views or features. Then, there is a projection for each view, which is learned to map the observations to a consistent space by transformation matrices {Pv }Vv=1 . A Gaussian process prior is subsequently utilized to obtain a back project to the shared and latent manifold space. Later, the transformations from the shared and latent space to the observed space are also calculated. To use developed strategy SLEMKGP for classifying, a discriminative prior is embedded on the latent component. It should be noted that, Yv refers to the input in the v-th view, Pv refers to associating mapping matrix, and f v refers to the projection function embedding the GP prior at the v-th view. Here the covariance matrices from both encoder and decoder are computed by utilizing multiple kernels . . . . . . . . . . . 83 Fig. 3.9 The confusion matrix calculated by SLEMKGP with q set as 50 using AWA dataset. And the elements on the diagonal representing classifying results . . . . . . . .. . . . . . . . . . . . . . . . . . . . 94 Fig. 3.10 The average accuracy values from selected 9 classes calculated by JSSL, CCA, DCCA, DCCAE, GPLVM, DSGPLVMI, DSGPLVMS, and SLEMKGP . . .. . . . . . . . . . . . . . . . . . . . 96 Fig. 4.1 The example containing multiple views and multiple features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102 Fig. 4.2 The framework of the proposed method, where the green round means the observed features, the blue round means
xvi
Fig. 4.3
Fig. 4.4 Fig. 4.5 Fig. 4.6 Fig. 4.7
Fig. 4.8 Fig. 4.9 Fig. 5.1 Fig. 5.2
List of Figures
the latent variable, and the orange round means the observed label; J means the number of views and Kj means Kj types of features are obtained in j -th view . . . . . . . . . . . . . The probabilistic framework and graphic model of the presented approach. (a) The probabilistic framework of the presented approach. (b) The graphical model of the presented approach. xij kj and zpi are the observed features and labels, respectively. hij kj is the latent variable associated with xij kj . For the label zpi , it follows the multinomial distribution, where zpi ∈ {0, 1} and P p=1 zpi = 1. According to the supervised information, the latent variable hij kj is generated by following a Gaussian distribution N (μjp , jp ). Then we also assume that xij kj is generated from its associated latent variable, and xij kj also follows a Gaussian distribution N (Aj kj hij kj , j kj ). (c) The graphical model of the transformed class-conditional structure of the proposed method. By integrating out the latent variable hij kj , all parameters {zpi , μjp , jp , Aj kj , j kj } are directly imposed on xij kj , and xij kj follows a novel Gaussian distribution N (Aj kj ( Pp=1 zpi μjp ), j kj + Aj kj ( Pp=1 zpi jp )ATjkj ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The visualization of the generated data when J = 3 and Kj = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The visualization of the original and generated data. (a) The original data. (b) The generated data . . . . . . .. . . . . . . . . . . . . . . . . . . . ROC curves of different approaches for DM diagnosis . . . . . . . . . . . . (a) The framework of the proposed method, where the number of the latent variables is equal to the number of observed views. (b) The probabilistic framework of the proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The comparison of the distributions between the original data and generated data . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The ROC curves obtained by different methods in DM detection .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Illustration of our GMSL for face verification . .. . . . . . . . . . . . . . . . . . . . A diagram of the proposed GMSL for classification. Left: the circles with different colors denote similar pairs, the squares with different colors denote dissimilar pairs; medium: the triangles denote the represented vector space of similar pairs, and the diamonds denote the represented vector space of dissimilar pairs; right: the binary classification in the learned metric space. (For interpretation of the references to color in this figure
103
105 112 114 117
119 127 129 134
List of Figures
legend, the reader is referred to the web version of this article) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.3 Duality gap vs. the iterations on LFW data . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.4 Classification error (%) of Euclidean, DML-eig, LMNN, ITML, Sparse ML, Sub-SML, and the proposed GMSL metric learning methods . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.5 Average rank of the classification error rate . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.6 The weights of MetricFusion for different q-values (a) and performance curve of three features with different q-values (q = 2, . . . , 10) . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.7 Comparisons of ROC curves and AUCs between our GMSL and the state-of-the-art methods on LFW . . . . . . . . . . . . . . . . . . Fig. 5.8 Comparison of ROC curves and AUCs . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.9 Comparison with single metric and combination of four metrics. 1: LBP; 2: SIFT; 3: Attribute; 4: LBP + SIFT; 5: LBP + Attribute; 6: SIFT + Attribute; and 7: LBP + SIFT + Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.10 Some intra and inter pairs of faces in PubFig . . .. . . . . . . . . . . . . . . . . . . . Fig. 5.11 ROC curves and AUCs of Sub-SML and GMSL . . . . . . . . . . . . . . . . . . . Fig. 6.1 Steps 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 6.2 Some visible and infrared face images. The first row shows the visible face images. The second row shows the infrared face images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 6.3 Some 2D palmprint images from the 2D plus 3D palmprint dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 6.4 Four ROI images of a same palm. The first, second, third, and fourth ROI images were extracted from the red, green, blue, and near infrared images, respectively . . . . . . . . . . . . . . . . Fig. 6.5 Some image samples from the GT face dataset . . . . . . . . . . . . . . . . . . . . . Fig. 6.6 Sample images of one individual from the LFW dataset . . . . . . . . . . . Fig. 7.1 The framework of the proposed method. Two streams with five convolution layers and three full-connected layers are used for feature extraction. For the real-valued outputs from these two neural networks, their similarity is preserved by using a pairwise loss. Based on the outputs, a consistent hash code is generated. Furthermore, an asymmetric loss is introduced to exploit the semantic information between the binary code and real-valued data . . . . . . . . . Fig. 7.2 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on the IAPR TC-12 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Fig. 7.3 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on
xvii
143 145
147 148
152 154 155
156 156 157 180
181 182
184 185 187
201
210
xviii
Fig. 7.4
Fig. 7.5 Fig. 7.6
Fig. 7.7
Fig. 7.8
List of Figures
the MIRFLICKR-25K dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on the CIFAR-10 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The changing of the objective function values and MAP scores with an increase in iterations . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The changing of objective function values and MAP scores with an increase in iterations. Note 2 that the asymmetric terms tanh(F)BT − kSF and tanh(G)BT − kS2 are removed . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . F The motivation of the point-to-angle matching. Assume f1 , f2 , f3 , f4 , f5 , f6 , and b1 belong to the same Hamming space R1 . (1) (fT1 f2 − k)2 makes them similar and discrete. However, if f1 and f2 are both transformed to f1 and f2 in R2 , the value of fT1 f2 remains unchanged but their hashing codes change. (2) The asymmetric inner product (fT1 b1 − k)2 can well tackle the problem in (1), but this term strongly enforces each element in f1 (f2 ) to approximate to be either +1 or −1. This is too strict and unnecessary since what we need is sign(f1 ). (3) Thus, some works try to use the cosine distance to measure the similarity between each pair of samples. Although this strategy can address the existing problem in (2), it may be unable to process some specific cases like (f3 , f4 ). (4) In order to solve the problems mentioned in (1), (2), and (3), we propose a relaxed asymmetric strategy that exploits the cosine distance between the real-valued outputs ,b1 and the binary codes. In this figure, ( ff55b − 1)2 1
213
216 217
218
,b1 or ( ff66b − 1)2 not only encourages them onto the 1 same hypercube without any length constraint but also efficiently avoids the case that occurs between f3 and f4 .. . . . . . . . . . 220 The motivation of the novel triplet loss. f1 and f2 belong to the same class, while f3 belongs to another class. According to the analysis in Fig. 7.7, f1 and f2 are located in appropriate positions. However, the Euclidean distance between them is much larger than that of f2 and f3 . The traditional triplet cannot be directly used. Thus, we propose a novel triplet loss. Firstly, each output is normalized onto a unit ball to get ¯f1 , ¯f2 , and ¯f3 corresponding to f1 , f2 , and f3 , respectively. Then
List of Figures
Fig. 7.9
Fig. 7.10
Fig. 7.11
Fig. 7.12
Fig. 7.13
Fig. 7.14 Fig. 7.15 Fig. 7.16 Fig. 7.17
[1 − (¯fT1 ¯f3 − 1)22 + (¯fT1 f¯2 − 1)22 ]+ is constructed to encourage ¯f1 to be closer to ¯f2 than to ¯f3 under the Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The framework of the proposed method. The top and bottom streams with different weights are used to perform the feature extraction for the images. A shared binary code bi is then learned corresponding to the same input in both streams. Here we assume that (b1 , f1 , f2 ) have the same semantic information, while (b2 , f3 ) enjoy different semantic information from b1 . In the learning step, the relaxed asymmetric is exploited to make (b1 , f1 , f2 ) locate in the same Hamming space without the length constraint. We also propose a novel triplet loss to rank the positive and negative pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the CIFAR-10 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the NUS-WIDE dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the IAPR TC-12 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the MIRFLICKR-25K dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The sketch of the network for learning the single-image and cross-image representations . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The proposed deep architecture of the pairwise comparison model (best viewed in color) . . . . . . .. . . . . . . . . . . . . . . . . . . . The proposed deep architecture of the triplet comparison model (best viewed in color) . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Rank-1 accuracy versus λ in the CUHK03 dataset (best viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
xix
221
222
231
233
236
238 241 246 247 250
xx
List of Figures
Fig. 7.18 The rank-1 accuracies and CMC curves of different methods on the CUHK03 dataset [37] (best viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 251 Fig. 7.19 The rank-1 accuracies and CMC curves of different methods on the CUHK01 dataset [37] (best viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 252 Fig. 7.20 The rank-1 accuracies and CMC curves of different methods on the VIPeR dataset [37] (best viewed in color) . . . . . . . . 253
List of Tables
Table 2.1
The average accuracy and error bar (percentage) of 5 independent experiments for DM detection . . .. . . . . . . . . . . . . . . . . . . . Table 2.2 The area under curve (AUC) for the different approaches in DM detection. Bold values mean the best performances .. . . . . . Table 2.3 The average accuracy and error bar (percentage) in 5 independent experiments for IGR detection. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.4 The area under curve (AUC) for the different methods in IGR detection. Bold values mean the best performances . . . . . . Table 2.5 Face recognition rates on the Extended Yale B database. Bold values mean the best performances . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.6 Face recognition rates on the AR database. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.7 Recognition accuracy by competing approaches on the disguise AR database. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.8 Face recognition rates on FRGC2.0 Exp 4. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.9 Face recognition rates on LFW. Bold values mean the best performances .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.10 The categorization rates on the 17 class Oxford Flowers dataset. The results in bracket are obtained under equal feature weights. Bold values mean the best performances . . . . . . . Table 2.11 The categorization accuracy on the 102 category Oxford Flowers dataset. The results in bracket are obtained under equal feature weights. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 2.12 The classification rate and their corresponding error bar in 5 independent experiments for Fatty Liver detection. Best results are highlighted in bold . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
23 24
27 28 32 32
33 34 34
36
36
45
xxi
xxii
List of Tables
Table 2.13 The AUC values on the Fatty Liver diagnosis when the number of training instances is 200, 250, and 300, respectively. Best results are highlighted in bold .. . . . . . . . . . . . . . . . . Table 2.14 The time consumption using the proposed method with different training data numbers .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.1 The overall and mean classifying accuracy values calculated by SAGP and other comparison methods with the dimensionality of the latent variable varying from 1 to 10 using dataset Wiki Test-Image. Bold values mean the best performances .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.2 The classifying accuracy values of each category use different individual or multi-view strategies on Wiki Test-Image dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.3 The overall and mean classifying accuracy values calculated by SAGP and other comparison approaches with the dimensionality of latent variable varying from 40 to 130 using dataset AWA. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.4 The overall and mean classifying accuracy values calculated by SAGP and other comparison methods with dimensionality of the latent variable varying from 1 to 30 using the dataset NUS-WIDE-LITE. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.5 The overall and average classification accuracies on the AWA dataset obtained by MKSGP and other comparison methods when the dimensionality of the latent variable changes from 40 to 130. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.6 The overall and average classifying accuracy values calculated by various approaches with q varying from 1 to 30 using dataset NUS-WIDE-LITE. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.7 The accuracy, sensitivity, and specificity values calculated by different methods on Biomedical dataset with the training samples numbers set 30, 40, and 50, respectively. Best results are highlighted with bold . . . . . . . . . . . . . . . Table 3.8 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison strategies with dimensionality q varying from 1 to 10 using dataset Wiki Text-Image. Bold values mean the best performances . . . . . . Table 3.9 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison approaches with dimensionality q varying from 40 to 130 using AWA dataset . . . . Table 3.10 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison approaches with
48 48
64
65
66
68
78
80
82
91
93
List of Tables
dimensionality q varying from 1 to 30 using dataset NUS-WIDE-LITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 3.11 Overall/average classifying accuracy values calculated by SLEMKGP with varying dimensionality dp of {Pv }Vv=1 by using the datasets AWA and NUS-WIDE-LITE .. . . . . Table 3.12 Overall/average classifying accuracy values calculated by SLEMKGP when dp = q using the datasets AWA and NUS-WIDE-LITE .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 4.1 Some necessary notations in this chapter . . . . . .. . . . . . . . . . . . . . . . . . . . Table 4.2 The classification accuracy on synthetic dataset with a different number of views and features. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 4.3 The classification accuracies on the synthetic dataset obtained by MVMFL. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 4.4 The average classification accuracies and error bars on the biomedical dataset obtained by MVMFL and other comparison methods when the number of training samples changes from 30 to 90. Bold values mean the best performances .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 4.5 The area under curve (AUC) for the different methods in DM detection. Bold values mean the best performances .. . . . . . Table 4.6 The classification accuracies on the synthetic dataset obtained by HMMF. Bold values mean the best performances . . . Table 4.7 The accuracy, sensitivity, and specificity values obtained by different methods on the Biomedical dataset when the number of training samples is 40, 50, 60, and 70, respectively. Best results are highlighted in bold .. . . . . . . . . . . . . . . . . Table 4.8 The area under curve (AUC) obtained by the different methods in DM detection. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 5.1 Description of 8 UCI datasets . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 5.2 Comparisons with baselines of single metric and combined metrics with single feature on LFW dataset .. . . . . . . . . . . Table 5.3 Comparisons with baselines of single metric and combined metrics with multiple features on LFW dataset .. . . . . . . Table 5.4 Comparisons with the state-of-the-art metric learning methods on LFW dataset . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Table 5.5 Accuracy (%) comparison with the state-of-the-art results on LFW dataset under image restricted protocol . . . . . . . . . . Table 5.6 Accuracy (%) comparisons with existing deep metric learning on LFW data in restricted protocol . . .. . . . . . . . . . . . . . . . . . . . Table 5.7 Performance comparisons between our GMSL and other metric learning on Pubfig faces. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
xxiii
95
97
97 104
113
115
116 117 127
128
129 146 149 150 151 153 154 157
xxiv
Table 5.8 Table 5.9
Table 5.10
Table 5.11
Table 5.12
Table 5.13
Table 5.14
Table 6.1 Table 6.2
Table 6.3 Table 6.4
Table 6.5 Table 6.6
Table 6.7
List of Tables
Descriptions of the 11 UCI and the 4 handwritten digit datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Comparison of classification error rate between CDSM and CDSM with distance part (CDSMdis ), CDSM with similarity part (CDSMsim ). Here, μ is a balance parameter to control the effect of distance or similarity part in CDSM. Bold values mean the best performances . . . . . . . . . Comparison of classification error rate between our CDSM and the state-of-the-art metric learning methods on the UCI datasets. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Comparison of classification error rate between our CDSM and the state-of-the-art metric learning methods on the handwritten digit datasets. Bold values mean the best performances .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Comparison of classification error rate between CDSM and kernelized CDSM. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Influence on running time and correctness by two improvement strategies. In counting the training time, we leave out the time for constructing triplets .. . . . . . . . . . . . . . . . . . . . Running time of different metric learning methods on UCI datasets. Note that we compute the time for training stage without building constraints . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification error rate (%) of either of NIR and VIS images from the HFB dataset . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification error rate (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the HFB dataset . . .. . . . . . . . . . . . . . . . . . . . Classification error rate (%) of either of the 2D and 3D palmprint images .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the 2D plus 3D palmprint dataset . . . . Classification error rate (%) of blue, green, or near infrared images of the Multispectral dataset . . .. . . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the Multispectral dataset with the score fusion of blue and near infrared images . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the Multispectral dataset with the score fusion of green and near infrared images . . . . . . . . . .
167
168
168
169
170
171
171 181
182 183
183 184
184
184
List of Tables
Table 6.8 Table 6.9
Table 6.10
Table 6.11
Table 6.12 Table 6.13
Table 6.14
Table 6.15
Table 6.16 Table 6.17
Table 6.18 Table 6.19
Table 7.1 Table 7.2
Table 7.3
xxv
Classification error rate (%) of the R, G, B color channels of the GT face dataset . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the red and green channels .. . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the blue and green channels . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the red and blue channels ... . . . . . . . . . . . . . . . . . . . Classification error rate (%) of the R, G, B color channels of the LFW face dataset . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the red and green channels. . . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the blue and green channels . . . . . . . . . . . . . . . Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the red and blue channels . . . . . . . . . . . . . . . . . . Summary of the fourteen datasets . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Classification accuracy (%) of Lib-SVM, FaLK-SVM, FaLK-SVMad, and FaLKSVMar on the fourteen UCI datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Summary of the three large-scale datasets . . . .. . . . . . . . . . . . . . . . . . . . Classification accuracy (%) of Lib-SVM, FaLK-SVM, FaLK-SVMad, and FaLKSVMar on the three large-scale datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The network structure of CNN-F . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Top-500 MAP and Top-500 Precision scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . .
185
186
186
186 187
187
187
188 192
193 193
194 201
208
209
xxvi
Table 7.4
Table 7.5
Table 7.6 Table 7.7
Table 7.8
Table 7.9
Table 7.10
Table 7.11
Table 7.12
Table 7.13
Table 7.14
Table 7.15
Table 7.16
Table 7.17 Table 7.18
List of Tables
The MAP scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The Top-500 MAP and Top-500 Precision scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances .. . . . The Top-500 MAP and Top-500 Precision scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the CIFAR-10 dataset, respectively. 1000 samples are selected for testing and remaining samples are used for training in this dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP@Top500 and Precision@Top500 scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the NUS-WIDE dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP@Top500 and Precision@Top500 scores obtained by different methods on the NUS-WIDE dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP@Top500 and Precision@Top500 scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . The MAP scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The MAP@Top500 and Precision@Top500 scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances . . . . . . . . . . . . . . . . . The rank-1 accuracies (%) of the proposed pairwise and triplet comparison models . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . The training times of the proposed pairwise and triplet comparison models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
211
212 214
215
217
230
230
232
232
234
234
237
237 250 250
Chapter 1
Introduction
“Information Fusion” is a key role in a number of fields, e.g., machine learning, deep learning, and pattern recognition. It is capable of fusing multiple features, modalities, views, or algorithms, greatly contributing to the performance improvement in many applications. This chapter introduces what is information fusion, reviews the history of information fusion, and analyzes the main contributions of this book. After reading this chapter, people will have some understanding on information fusion.
1.1 Why Do Information Fusion? Pattern recognition as the classical fields has always attracted much attention. Many works [1, 2] on it have been done. Generally, the popular classifiers for pattern recognition containing K-nearest neighbor (KNN) [3] and support vector machine (SVM) [4] are widely used due to their effectiveness and simplicity. With the rapid development of L1 -norm, sparse representation has also been exploited in computer vision (e.g., image restoration [5], face recognition [6]). In [6], Wright et al. presented the so-called sparse representation based classification (SRC) for the application on face recognition. In SRC, the sample is represented with a dictionary consisting of training samples, and SRC gets a great improvement in face recognition. Considering the samples that are corrupted by noise or outliers, some robust models are presented [7, 8]. In [7], the method could regress a given sample robustly with regularized regression coefficients, while Nie et al. [8] used L2,1 -norm for the loss function that could robustly select the feature for pattern recognition. Many other models with sparse representation have been presented in [9–13] and [14]. Recently, thanks to the quick development of the deep learning, some related methods [15–17] also achieve remarkable performance in pattern recognition. Although many methods are presented for pattern recognition, most methods only consider the single view. However, there generally exist different views from © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_1
1
2
1 Introduction
a common sample in practice. For instance, we can capture multiple images from multiple angles for an object; different features can also be used to represent an image, e.g., SIFT [18] and HOG [19]. These kinds of data or signals are often identified as multimodal, multi-view, or multi-feature data. It has been proven that using multiple views would contribute to enhancing the performance for pattern recognition. Thus, it is important and necessary to make a deep research on multiview learning. Additionally, with the rapid development of mathematical and computing techniques, various types of metrics and pattern recognition algorithms have been presented to represent complex distribution of the real-world data. For instance, to compute the similarity or dissimilarity of two instances, metric learning [20–23] is a widely used technique. A typical branch of metric learning approaches is to compute a Mahalanobis distance metric through means of a labeled training set (for classification problems) or from sets of positive (similar) and negative (dissimilar) pairs (for verification problems) based on a predefined objective function. Except for Mahalanobis distance, some works also apply the Euclidean or Hamming distance to achieve the similarity measurement. However, it is not robust if only a single metric is learned, due to the complex variation of real-world data structures. Thus, a combination of various metrics is a necessary study for the pattern recognition performance improvement. Furthermore, especially for the classification that is a key branch in pattern recognition, the classifier plays a significant role for the performance improvement. For example, as mentioned above, there are many classifiers that are designed for classification, including KNN, SVM, and SRC. However, it is true that different classifiers enjoy their own advantages and limitations. Being similar to the metrics, a single classifier is not robust enough to some pattern recognition tasks. Therefore, a fusion for multiple classifiers is also a key point in pattern recognition [24–26]. Generally, a classifier-based fusion method is also called score fusion. According to the aforementioned analysis, here we name the fusion not only for various features or views, but also for multiple techniques (metrics or classifiers) as the information fusion, allowing more valuable information from different modalities to be utilized and getting higher accuracy [24, 27–31]. In this book, we mainly focus on these two branches and propose various machine learning and deep learning based fusion approaches, which are then applied to the pattern recognition tasks, e.g., classification, verification, and retrieval.
1.2 Related Works Here we simply introduce related works on information fusion, which can be roughly classified into two branches: multi-view (modal or feature) based fusion methods and multi-technique (metric or classifier) based fusion methods.
1.2 Related Works
3
1.2.1 Multi-View Based Fusion Methods A number of works related to the multi-feature learning have been widely studied. Since sparse representation is effective on single-view learning, it has been widely extended to multi-view learning. Joint sparse representation (JSR) [32] was first introduced by Yuan et al. by introducing the L2,1 -norm (MTJSRC). Additionally, a similar work based on the collaborative representation was proposed by Yang et al. [33], which ensures the representation coefficients of multiple views to be similar to their averaged value (RCR). Considering label information, a discriminant collaborative representation method (JDCR) was proposed in [34] for the multiview data. In these three methods, the dictionary is fixed by using the training data. By contrast, Jing et al. [35] studied a novel method based on dictionary learning to exploit the correlation across various features by imposing the low rank prior on the learned dictionary. A sparse representation based algorithm was analyzed in [36] (UMDL and SMDL) to compute a multimodal dictionary, extracting the relationship among different types of modalities. Guo [37] studies the convex subspace learning (CSRL), in which a common subspace is obtained by using the group sparsity norm. In order to embed the multi-feature learning into the semi-supervised branch, a multi-feature shared learning approach was introduced by Zhang et al. [38] by introducing a novel L2 -norm to compute the consistency on the global label. Except for the aforementioned methods, the Canonical Correlation Analysis (CCA) [39–41] has also attracted much attention. It aims to learn two different mappings by maximizing the correlation between two views. Besides, CCA is also extended to several methods by imposing some priors on the shared subspace, such as sparse CCA [42] and robust CCA [43, 44]. Specifically, Archambeau et al. [42] added the sparsity as a prior into CCA to reduce the influence of noise. [43] and [44] imposed the L1 loss and a Student-t density on the CCA to remove outliers in the data. As CCA and its extended versions are only suitable for the application with two views, Rupnik et al. proposed the multi-view CCA (MCCA) [45]. MCCA aims to estimate multiple projections to map each view onto a shared subspace, in which the sum of all pairwise correlations achieves the maximum. Despite of these algorithms, the famous method linear discriminative analysis (LDA) is also used for multi-view data. Particularly, [46] analyzed a multi-view discriminant analysis (MvDA), which learns a discriminant common space. Additionally, the generalized multi-view LDA (GMLDA) [47] was studied by computing a group of transformations for each view. Despite that a number of multi-view learning methods have been introduced and got satisfactory results, they follow an assumption that the data is linearly modeled. In many applications, it may be not reasonable, since there is non-linearity in the data. To represent data non-linearly, some kernel-based methods have been proposed. A typical example is the kernel CCA [48]. It maps the source data into a space where the dimension is high. Given the supervised information, a supervised approach denoted as Multi-view Fisher Discriminative Analysis (MFDA) [49] was introduced, which learns independent classifiers for multiple views. Also, another multi-view learning method was presented by exploiting a manifold regularization
4
1 Introduction
based on the sparse feature selection [50]. Due to the regularization, the manifold structure can be preserved, and the performance on the scene image classification is greatly enhanced. Unlike the existing algorithms that assume a parametric or deterministic function, the Gaussian Process Latent Variable Model (GPLVM) [51– 54] was studied. It is a non-parametric technique that can represent the data in a non-linear and generative way. GPLVM adds the Gaussian Process into the transformations, so that a variable can be smooth learned in a subspace, and GPLVM is more effective in data representation compared with some dimensionality reduction methods. Based on different tasks, GPLVM is modified to different versions, such as discriminative GPLVM (DGPLVM) [54], shared GPLVM (SGPLVM) [53], and discriminative SGPLVM (DSGPLVM) [51].
1.2.2 Multi-Technique Based Fusion Methods Distance metric can be computed by using either pairwise loss or triplet loss. For the pairwise loss, its related methods include information-theoretic metric learning (ITML) [55] and logistic discriminant metric learning (LDML) [20], etc. By achieving the minimization of the differential relative entropy between two multivariate Gaussians, the Mahalanobis distance metric is learned by ITML. Differently, in LDML, this metric is achieved by exploiting the logistic discriminative regression. Apart from these two methods, the pairwise loss based metric learning can also be obtained in the context of online learning, optimization algorithms , and robust learning. The typical methods are [56], [57], and [58]. Referring to the triplet loss , its related works contain BoostMetric [59], FrobMetric [60], MetricBoost [61], and large margin nearest neighbor (LMNN) [62–64]. The boosting based approach is exploited in BoostMetric to get a positive semidefinite matrix. It is then improved in MetricBoost [61] and FrobMetric [60] that decrease the computational complexity. For LMNN [64], it enforces the distance between two dissimilar samples to be larger than a margin. It is also extended for low-rank transformation learning and transformation invariant classification [65, 66]. Recently, the deep or non-linear metric learning also adopts the aforementioned two losses. For instance, local metric learning based approaches use the kernelization [55, 67] by taking the data manifold structure into account, which can represent the training data in a non-linear way. Besides, some deep learning based metric learning strategies have also been studied by using restricted Boltzmann machines (RBMs) [68], forward multi-layer neural networks [69], and convolutional neural networks (CNNs) [70]. Apart from the multiple metrics fusion in the metric learning, the classifierbased score level fusion is also another branch in multi-technique fusion. Compared with decision level fusion, score level fusion uses a fusion strategy to fuse multiple scores, which can get more information. As analyzed in [71], since various scores can be independently got and efficiently integrated, it often gets a higher accuracy.
1.3 Book Overview
5
By consolidating the evidence achieved from different classifiers, a theoretical framework for the score level fusion is proposed in [72]. Thanks to this framework, we can achieve the score level fusion by using product rule, sum rule, min rule, max rule, median rule, or majority voting. Differently, Prabhakar and Jain [73] argued it is too restrictive to assume the feature sets are statistically independent for a multimodal biometric system. For this case, they got the optimal result in the Neyman–Pearson decision sense, if enough training samples were available to get the joint density estimation. Also, a novel fusion algorithm was proposed in [74] by exploiting the Bayesian statistics to integrate multiple scores. By combining the image and the speech signal, this approach gets better recognition results compared with any single modality. Additionally, another sum fusion framework and maxscore/min-score fusion framework were studied in [75] and [76].
1.3 Book Overview In Chap. 2, we propose three fusion methods based on sparse representation or collaborative representation. (1) Considering that there exist both similar and specific components among multiple views or features, we propose the Joint Similar and Specific Learning (JSSL) method that not only represents the inputs sparsely but also successfully extracts the similarity and diversity, being more reasonable in multi-view data representation. (2) Due to the l1 -norm existing in JSSL, the test cost is quite large and far from the real-time requirement in practical applications. To tackle this problem, a novel method based on the collaborative representation is studied, named Relaxed Collaborative Representation (RCR) . In RCR, the l1 -norm is replaced by the l2 -norm, allowing us to get a closed-form solution and decrease the test time. (3) Despite RCR meets the real-time requirement, it fails to exploit the label information that is beneficial for the classification performance enhancement. Thus, we further extend RCR by embedding a discriminative regularization, which is capable of decreasing the intra-class distance and enlarging the inter-class dissimilarity. Referring to the non-linearity in the datasets with complex scenes, in Chap. 3, three fusion methods based on Gaussian Process Latent Variable Model (GPLVM) are presented. (1) Instead of assuming a specific mapping as done in many existing methods, the first approach called Shared Auto-encoder Gaussian Process latent variable model (SAGP) introduces the GPLVM to non-linearly represent data in a generative and non-parametric way. In SAGP, a latent variable is shared among different views. Compared with GPLVM and its most extensions, SAGP jointly takes the projections from and to the observations into account. In other words, mapping functions are learned to project the variable from the shared subspace to the observed space. Meanwhile, a back constraint that maps observed data to the shared variable is also exploited. In this way, we are capable of simply getting the latent variable when a testing instance comes. To apply SAGP to the pattern recognition, a discriminative regularization is exploited to make the latent variables
6
1 Introduction
from the same category to be similar, while these from distinctive categories to be dissimilar. (2) Although SAGP can non-linearly model the multi-view data, it only adopts a kernel to establish the covariance matrix, while the kernel function in SAGP plays an important role and a certain kernel function may be incapable of model complex distributions in some real-world datasets. Here, we give an extended version of SAGP named Multi-Kernel Shared Gaussian Process latent variable model (MKSGP) that combines the multi-kernel learning and SAGP into a joint model, adaptively and automatically fitting the data. Furthermore, different from SAGP that learns the latent variable and the classifier in two separated phases, MKSGP embeds a large margin regularization into the model to compute a hyperplane for each category to distinguish latent variables from different classes. In MKSGP, the classifier is computed in an online way, being adapting for the input data. (3) Both SAGP and MKSGP introduce the auto-encoder structure to estimate the projection from multiple types of data to the shared latent variable. However, SAGP and MKSGP only simply make a summarization of the covariance matrices from each view to the latent variable. To tackle this issue, we first learn a projection matrix for each view that can project multiple observations to a common subspace. The Gaussian Process based mapping function from this subspace to the latent variable is then easily learned. Here we name this method as the Shared Linear Encoder based Multi-Kernel Gaussian Process latent variable model (SLEMKGP) . Although JSSL, RCR, JDCR, SAGP, MKSGP, and SLEMKGP are proposed for multi-view data in sparse/collaborative representation, non-linear, or multikernel ways to get an outstanding performance, there are still some problems for us to tackle. One key difficulty is that except for collected multiple views from a single object, a view may further be represented with multiple features. Often, these multiple features from a view are fully beneficial for classification. Here, we name these data with multiple views and multiple features as the multi-view and multi-feature data. A typical example of this kind of data is the person verification application. A person can be identified by using the fingerprint, palmprint, iris, and face. Meanwhile, each view or modality can also be represented with various features, such as Gabor and wavelet. To the best of our knowledge, most existing methods build the models by only considering multiple views but ignoring the case that a view further can be described by various features. A naive way of modeling this kind of data is to concatenate various features in each view to be a single vector, and multi-view methods can be applied to process these vectors in multiple views. Although this strategy is easy to achieve, it has some limitations. One is that it may lose the correlation across different features in a view, while this correlation is valuable for classification. Another one is that it may encounter the over-fitting when the dimension of concatenated vectors is large, while the number of samples for training is small. Therefore, it is necessary for us to design a novel method to model the multi-view and multi-feature data to fully exploit correlation among them. In order to tackle this problem, in Chap. 4, two probabilistic and generative models are proposed by representing multi-view and multi-feature data under hierarchical structures. (1) For each kind of feature in each view, there is a projection through which a latent variable is projected to this feature. Here we assume that both the
1.3 Book Overview
7
observed feature as well as its latent variable follow the Gaussian distributions. Specifically, to exploit the label information, the latent variables from the same class are assumed to follow a same distribution. In this case, correlations among multiple features and multiple views are both extracted. Here we name this method as Multiview and Multi-feature Learning (MVMFL) . (2) In contrast to MVMFL that learns the latent variable for each feature from a view, a shared and latent variable is learned as the fused feature in our second approach. Thanks to this assumption, the Expectation Maximization (EM) [77] is introduced to optimize the presented approach. Specifically, a closed-form solution for each variable or parameter can be obtained. Here, we name this method as Hierarchical Multi-view Multi-feature Fusion (HMMF) . In Chap. 5, we aim on multiple metrics strategies to obtain the more efficient and robust similarity/dissimilarity measurement. For the existing metric learning strategies, a common characteristic does exist: most of them only compute a single distance metric. However, for the practical data, its structure is quite complex, and the simple single measurement is not reasonable to measure this variation. Therefore, in this chapter, two metric learning based fusion methods are proposed to combine multiple strategies for the similarity/dissimilarity measurement. (1) We present a novel method to compute a metric swarm by learning local patches alike sub-metrics, named generalized metric swarm learning (GMSL) . (2) Considering that similarity and distance measures are complementary to each other and can be integrated to further improve the accuracy, this chapter proposes a joint distance and similarity measure learning method by minimizing the triplet loss. First, we suggest to combine both distance and similarity measures to improve the classification accuracy. Second, we suggest a max-margin model to compute the integrated distance and similarity measure through minimizing triplet loss. Moreover, the proposed combined distance and similarity measure (CDSM) can also be analyzed from the pairwise kernel perspective, which can be used to kernelize the proposed method in a non-linear way. In Chap. 6, we propose two adaptive weighted classifier score fusion methods for classification. (1) Although score fusion methods have been widely studied, how to get optimal weights for different scores is still a problem. To our best knowledge, there are few works on automatic weight selection, and most of the existing methods get the weights empirically. Thus, it is quite significant to propose a novel method for adaptive weight selection. In this chapter, we presented an adaptive weighted fusion strategy, being able to get the optimal weights in an automatic way. (2) Since the real-world data often follows the complex distribution, while the linear representation is incapable of modeling it, it is reasonable to propose a non-linear representation weighted fusion method. Thus, in this chapter, a fusion method based on the adaptive kernel selection is also studied. Considering that aforementioned fusion methods only utilize the hand-crafted based features, in Chap. 7, we propose three end-to-end deep learning methods by exploiting two or multiple branches of networks, which are then applied to hashing based image retrieval. (1) We first propose a novel asymmetric supervised deep hashing approach, called Dual Asymmetric Deep Hashing learning (DADH) to
8
1 Introduction
project the image into a binary subspace, in which the semantic information is also preserved. Particularly, we establish two branches of networks, so that the similarity between two images is extracted. Since the learned binary codes should preserve the semantic information, an asymmetric pairwise loss is exploited to measure the similarity between discrete- and real-valued features based on the given label. (2) However, DADH only focuses on point-to-point matching that is too strict and unnecessary. In this chapter, we further present a new approach via transforming the matching between two images to a point-to-angle version, which is named Relaxed Asymmetric Deep Hashing learning (RADH) . In detail, we asymmetrically measure the similarity and dissimilarity between the discrete code and real-valued feature through an inner product. Instead of making their inner product to be either +1 or −1, we only enforce it to locate into the semantic-related space. This strategy not only exploits the semantic information, but also gives a wide choice for the feature learning. In order to extract the deeper semantic affinity, a novel Hamming distancebased triplet loss is proposed, getting the ranking for different positive and negative pairs. (3) Person re-identification has been usually solved as either the matching of single-image representation (SIR) or the classification of cross-image representation (CIR). In this chapter, the connection between them is utilized, and a joint learning of SIR and CIR is proposed by using convolutional neural network (CNN). In detail, our method consists of one shared sub-network and two sub-networks that respectively obtain the SIRs of given images and the CIRs of given image pairs. The sub-network related to SIR is encouraged to be computed once for each image (in both the probe and gallery sets), and the depth of the CIR sub-network is encouraged to be minimal to decrease computational costs. Therefore, the two kinds of representations are simultaneously trained for learning better matching accuracy with moderate complexity. Besides, the representations obtained via pairwise and triplet objective functions can be integrated to enhance performance.
References 1. Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. In: Ninth IEEE international conference on computer vision proceedings 2003. Piscataway: IEEE; 2003. p. 1470–77. 2. Raina R, Battle A, Lee H, Packer B, Ng AY. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 759–66. 3. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85. 4. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297. 5. Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. Non-local sparse models for image restoration. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE; 2009. p. 2272–79. 6. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y. Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell. 2009;31(2):210–27.
References
9
7. Yang M, Zhang L, Yang J, Zhang D. Regularized robust coding for face recognition. IEEE Trans Image Process. 2013;22(5):1753–66. 8. Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint 2, 1-norms minimization. In: Advances in neural information processing systems. 2010. p. 1813–21. 9. Gao S, Tsang IWH, Chia LT. Kernel sparse representation for image classification and face recognition. In: Computer vision–ECCV 2010. Berlin: Springer; 2010. p. 1–14. 10. Zuo W, Meng D, Zhang L, Feng X, Zhang D. A generalized iterated shrinkage algorithm for non-convex sparse coding. In: 2013 IEEE international conference on computer vision (ICCV). Piscataway: IEEE; 2013. p. 217–24. 11. Yang M, Van Gool L, Zhang L. Sparse variation dictionary learning for face recognition with a single training sample per person. In: 2013 IEEE international conference on computer vision (ICCV). Piscataway: IEEE; 2013. p. 689–96. 12. Rodriguez F, Sapiro G. Sparse representations for image classification: learning discriminative and reconstructive non-parametric dictionaries. Technical report, DTIC Document 2008. 13. Yang M, Zhang L, Feng X, Zhang D. Sparse representation based Fisher discrimination dictionary learning for image classification. Int J Comput Vision 2014;109(3):209–32. 14. Yang M, Dai D, Shen L, Van Gool L. Latent dictionary learning for sparse representation based classification. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2014, p. 4138–45. 15. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems 2012. p. 1097–1105. 16. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2016. p. 770–8. 17. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition 2014. arXiv:1409.1556. 18. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004;60(2):91–110. 19. Wang X, Han TX, Yan S. An HOG-LBP human detector with partial occlusion handling. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE; 2009. p. 32–9. 20. Guillaumin M, Verbeek J, Schmid C. Is that you? Metric learning approaches for face identification. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE; 2009. p. 498–505. 21. Taigman Y, Wolf L, Hassner T et al. Multiple one-shots for utilizing class label information. In: The British machine vision conference (BMVC), vol. 2. 2009, p. 1–12. 22. Nguyen HV, Bai L. Cosine similarity metric learning for face verification. In: Asian conference on computer vision. Berlin: Springer; 2010. p. 709–20. 23. Cao Q, Ying Y, Li P. Similarity metric learning for face recognition. In: Proceedings of the IEEE international conference on computer vision. 2013. p. 2408–15. 24. Ross A, Nandakumar K. Fusion, score-level. In: Encyclopedia of biometrics. 2009, p. 611–6. 25. Jain A, Nandakumar K, Ross A. Score normalization in multimodal biometric systems. Pattern Recognit. 2005;38(12):2270–85. 26. Xu Y, Zhu Q, Zhang D. Combine crossing matching scores with conventional matching scores for bimodal biometrics and face and palmprint recognition experiments. Neurocomputing 2011;74(18):3946–52. 27. Mc Donald K, Smeaton AF. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International conference on image and video retrieval. Berlin: Springer; 2005. p. 61–70. 28. Nagar A, Nandakumar K, Jain AK. Multibiometric cryptosystems based on feature-level fusion. IEEE Trans Inf Forensics Secur. 2011;7(1):255–68. 29. Nandakumar K, Jain AK, Ross A. Fusion in multibiometric identification systems: what about the missing data? In: International conference on biometrics. Berlin: Springer; 2009. p. 743–52. 30. Nandakumar K, Chen Y, Dass SC, Jain A. Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell. 2007;30(2):342–7.
10
1 Introduction
31. Sim KC, Lee KA. Adaptive score fusion using weighted logistic linear regression for spoken language recognition. In: 2010 IEEE international conference on acoustics, speech and signal processing. Piscataway: IEEE; 2010. p. 5018–21. 32. Yuan XT, Liu X, Yan S. Visual classification with multitask joint sparse representation. IEEE Trans Image Process. 2012;21(10):4349–60. 33. Yang M, Zhang L, Zhang D, Wang S. Relaxed collaborative representation for pattern classification. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2012. p. 2224–31. 34. Li J, Zhang B, Zhang D. Joint discriminative and collaborative representation for fatty liver disease diagnosis. Exp Syst Appl. 2017;89:31–40. 35. Wu F, Jing XY, You X, Yue D, Hu R, Yang JY. Multi-view low-rank dictionary learning for image classification. Pattern Recognit. 2016;50:143–54. 36. Bahrampour S, Nasrabadi NM, Ray A, Jenkins WK. Multimodal task-driven dictionary learning for image classification. IEEE Trans Image Process. 2016;25(1):24–38. 37. Guo Y. Convex subspace representation learning from multi-view data. In: Proceedings of the twenty-seventh AAAI conference on artificial (AAAI’13), vol 1. 2013. p. 2. 38. Zhang L, Zhang D. Visual understanding via multi-feature shared learning with global consistency. IEEE Trans Multimedia 2016;18(2):247–59. 39. Hotelling H. Relations between two sets of variates. Biometrika 1936;28(3–4):321–77. 40. Sun J, Keates S. Canonical correlation analysis on data with censoring and error information. IEEE Trans Neural Netw Learn Syst. 2013;24(12):1909–19. 41. Yuan YH, Sun QS. Multiset canonical correlations using globality preserving projections with applications to feature extraction and recognition. IEEE Trans Neural Netw Learn Syst. 2014;25(6):1131–46. 42. Archambeau C, Bach FR. Sparse probabilistic projections. In: Advances in neural information processing systems 2009. p. 73–80. 43. Nicolaou, MA, Panagakis Y, Zafeiriou S, Pantic M. Robust canonical correlation analysis: audio-visual fusion for learning continuous interest. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). Piscataway: IEEE; 2014. p. 1522–26. 44. Bach FR, Jordan MI. A probabilistic interpretation of canonical correlation analysis. 2005. 45. Rupnik J, Shawe-Taylor J. Multi-view canonical correlation analysis. In: Conference on data mining and data warehouses (SiKDD 2010) 2010. p. 1–4. 46. Kan M, Shan S, Zhang H, Lao S, Chen X. Multi-view discriminant analysis. IEEE Trans Pattern Anal Mach Intell. 2016;38(1):188–94. 47. Sharma A, Kumar A, Daume H, Jacobs DW. Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2012. p. 2160–67. 48. Akaho S. A kernel method for canonical correlation analysis 2006. arXiv preprint cs/0609071. 49. Diethe T, Hardoon DR, Shawe-Taylor J. Constructing nonlinear discriminants from multiple data views. In: Joint European conference on machine learning and knowledge discovery in databases. Berlin: Springer; 2010. p. 328–43. 50. Lu X, Li X, Mou L. Semi-supervised multitask learning for scene recognition. IEEE Trans Cybern. 2015;45(9):1967–76. 51. Eleftheriadis S, Rudovic O, Pantic M. Discriminative shared Gaussian processes for multiview and view-invariant facial expression recognition. IEEE Trans Image Process. 2015;24(1):189– 204. 52. Lawrence ND. Gaussian process latent variable models for visualisation of high dimensional data. Adv Neural Inf Process Syst. 2004;16(3):329–36. 53. Ek CH, Lawrence PHTND. Shared Gaussian process latent variable models. Ph.D. Thesis 2009. 54. Urtasun R, Darrell T. Discriminative Gaussian process latent variable model for classification. In Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 927–34.
References
11
55. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS. Information-theoretic metric learning. In: Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 209–16. 56. Qamar AM, Gaussier E. Online and batch learning of generalized cosine similarities. In: Ninth IEEE international conference on data mining, ICDM’09. Piscataway: IEEE; 2009. p. 926–31. 57. Ying Y, Li P. Distance metric learning with eigenvalue optimization. J Mach Learn Res. 2012;13:1–26. 58. Lin T, Zha H, Lee SU. Riemannian manifold learning for nonlinear dimensionality reduction. In: Computer Vision–ECCV 2006. Berlin: Springer; 2006. p. 44–55. 59. Shen C, Kim J, Wang L, Van Den Hengel A. Positive semidefinite metric learning using boosting-like algorithms. J Mach Learn Res. 2012;13(1):1007–36. 60. Shen C, Kim J, Wang L. A scalable dual approach to semidefinite metric learning. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2011. p. 2601–08. 61. Bi J, Wu D, Lu L, Liu M, Tao Y, Wolf M. AdaBoost on low-rank PSD matrices for metric learning. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2011. p. 2617–24. 62. Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res. 2009;10:207–44. 63. Weinberger KQ, Saul LK. Fast solvers and efficient implementations for distance metric learning. In: Proceedings of the 25th international conference on machine learning. New York: ACM; 2008. p. 1160–67. 64. Weinberger KQ, Blitzer J, Saul LK. Distance metric learning for large margin nearest neighbor classification. In: Advances in neural information processing systems 2005. p. 1473–80. 65. Kumar MP, Torr PHS, Zisserman A. An invariant large margin nearest neighbour classifier. In IEEE 11th international conference on computer vision, ICCV 2007 2007. p. 1–8. 66. Torresani L, Lee KC. Large margin component analysis. In: Schölkopf B, Platt JC, Hoffman T, editors. Advances in neural information processing systems, vol. 19. Cambridge: MIT Press; 2007. p. 1385–92. 67. Wang J, Do HT, Woznica A, Kalousis A. Metric learning with multiple kernels. In: ShaweTaylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, editors. Advances in neural information processing systems, vol. 24. New York: Curran Associates; 2011, p. 1170–78. 68. Hayat M, Bennamoun M, An S. Learning non-linear reconstruction models for image set classification. In: The IEEE conference on computer vision and pattern recognition (CVPR) 2014. 69. Hu J, Lu J, Tan YP. Discriminative deep metric learning for face verification in the wild. In: The IEEE conference on computer vision and pattern recognition (CVPR) 2014. 70. Ding S, Lin L, Wang G, Chao H. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognit. 2015;48(10):2993–3003. 71. Zhang D, Song F, Xu Y, Liang Z. Other tensor analysis and further direction. In: Advanced pattern recognition technologies with applications to biometrics. Pennsylvania: IGI Global; 2009. p. 226–25. 72. Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 1998;20(3):226–239. 73. Prabhakar S, Jain AK. Decision-level fusion in fingerprint verification. Pattern Recognit. 2002;35(4):861–874. 74. Bigün ES, Bigün J, Duc B, Fischer S. Expert conciliation for multi modal person authentication systems by Bayesian statistics. In: International conference on audio-and video-based biometric person authentication. Berlin: Springer; 1997. p. 291–300. 75. Ross A, Jain A. Information fusion in biometrics. Pattern Recognit Lett. 2003;24(13):2115–25. 76. Snelick R, Indovina M, Yen J, Mink A. Multimodal biometrics: issues in design and testing. In: Proceedings of the 5th international conference on multimodal interfaces. New Yrok: ACM; 2003. p. 68–72. 77. Bishop, CM, Nasser M. Nasrabadi. Pattern recognition and machine learning. New York: springer, 2006;4(4).
Chapter 2
Information Fusion Based on Sparse/Collaborative Representation
Sparse representation follows the insight of data representation of human beings, allowing the data to be more accurately and robustly represented. Recently, many works in image classification, image retrieval, image recovery, etc., have shown its effectiveness. In contrast to the sparse representation, collaborative representation ignores the robustness but encourages algorithms to enjoy a fast computation. This chapter respectively proposes an information fusion method based on the sparse representation and two information fusion methods based on the collaborative representation. After reading this chapter, people can have preliminary knowledge on sparse/collaborative representation based fusion algorithms.
2.1 Motivation and Preliminary 2.1.1 Motivation As l1 -norm minimization techniques developed rapidly in the past years, some researchers are inspired by the sparse representation mechanism and coding of human vision system[1, 2]. They successfully use sparse coding methods in diverse image restoration applications[3]. Researchers have also made lots of efforts in applying sparse coding methods to pattern recognition tasks, such as image classification [4], signal classification [5], and face recognition (FR) [6]. In the classification task, which is one of the most popular topics in pattern recognition area and computer vision [7], algorithms utilizing sparse representation [6, 8] have made a very good improvement (i.e., robustness to lighting changes, noise and outliers) compared with representative classification approaches, such as SVM [9] and Nearest Subspace [10]. In the sparse representation based classification (SRC) [6], by minimizing the l1 -norm, the test input is encoded as a sparse linear combination of all instances for training; in particular, in SRC, we can introduce an identity matrix to encode abnormal pixels, so that SRC is © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_2
13
14
2 Information Fusion Based on Sparse/Collaborative Representation
robust to noise or anomalies. The success of SRC advances the research of pattern classification, resulting in the development of many works. For examples, l1 -graph for subspace learning and clustering [11], robust sparse coding for FR [8], and sparse classification [4]. To apply the SRC to information fusion, in this chapter we aim to use various training data from different views to represent different test inputs, through which the shared or similar representation coefficients corresponding to different views are obtained. Nevertheless, this assumption is too limited because there does exist difference among them. Thus we propose a novel sparse representation based method, named Joint Similar and Specific Learning (JSSL) [12], which is capable of extracting the similarity and distinctiveness from different views, being better and more reasonable to represent multi-view data for fusion. Although sparse representation is widely used in classification, recently researchers began to doubt its role in classification [7, 13]. In [7], it has been proved that it is the collaborative representation (i.e., the query image is collaboratively represented by training instances from all categories) rather than the l1 -norm sparse representation that ensures the validity of SRC for pattern classification. Employing the non-sparse l2 -norm to regularize the coefficients of representation can have similar classification results compared with l1 -norm regularization, but the algorithm can be significantly speeded up. In the query input, the robustness to outliers (e.g., corruption and occlusion) in fact is from the sparsity constraint on coding residuals rather than on the coding coefficients. Thus, in this chapter, we also propose another collaborative representation based classifier, named Relaxed Collaborative Representation (RCR) [14] for multi-view fusion, being able to obtain a solution in closed form and achieve the competitive classification performance. Compared with JSSL, RCR is able to achieve a solution, significantly speeding up the testing time. However, referring to the classification task, in the training phase, both JSSL and RCR are unable to utilize the existing label information, which has the significant effect on performance improvement. To embed the supervised information into the fusion model, in this chapter, we further propose another collaborative representation based approach named Joint Discriminative and Collaborative Representation (JDCR) [15] to not only fuse multiple views, but also utilize the label information.
2.1.2 Preliminary In this section, we first introduce the sparse representation classifier (SRC) and collaborative representation classifier (CRC) to make readers have the better understanding on our proposed methods in this chapter.
2.1 Motivation and Preliminary
15
2.1.2.1 Sparse Representation Classifier Given a set of instances for training and testing, SRC’s main idea is to linearly combine the training samples to obtain the test samples, and let the coefficients of representation as sparse as possible. Practically, the minimization of l1 -norm is utilized to make sure the test sample is linearly represented over the training samples in a sparsest way. Suppose that dictionary, the set of training samples, is represented as matrix D = [D1 , D2 , ·, ·, ·, DJ ] , where J is the number of classes. Di ∈ Rm×ni , with ni samples and m dimensions, is the training set of the i-th class. A sample is represented as a column in D. A test sample y ∈ Rm×1 could be computed as αˆ = min y − Dα22 + λ α1 α
(2.1)
where λ is the penalty parameter, ·1 is the l1 -norm, and ·22 is the l2 -norm. αˆ = [αˆ 1 ; αˆ 2 ; · · ·; αˆ J ] is the sparse coefficient, and αˆ i is the sparse coefficient corresponding to Di . Test sample y could be well computed by samples for training which from the i-th class, if it is from the ith class. Then, we can determine the category label of the testing sample by this Eq. (2.2) 2 i ∗ = min y − Di αˆ i 2 i
(2.2)
More information about the SRC can be found in [6].
2.1.2.2 Collaborative Representation Classifier Recently, [7] claimed that it is the collaborative representation based classification (CRC), but not the l1 -norm sparsity on α, which really makes SRC valid and efficient for classification. Let the l2 -norm to work in the regularization of α, and then it will obtain similar results on classification. The outliers in SRC are really robust because of using l1 -norm to do the modularization of coding residual, i.e., αˆ = minα y − Dα22 + λ α1 . If we do not consider the robustness on outliers, the coding model of CRC is αˆ = min y − Dα22 + λ α22
(2.3)
−1 DT y αˆ = DT D + λ · I
(2.4)
α
Then it can be derived as
−1 T D does w.r.t. to the representation α. Pay attention that P = DT D + λ · I not depend on the test sample y. Therefore, the projection matrix can be gained in advance. Overall, collaborative presentation can work really fast.
16
2 Information Fusion Based on Sparse/Collaborative Representation
Similarly to SRC, CRC’s classification is implemented by checking the class that can provide the minimum error of regularization reconstruction. We can find more information about collaborative representation in research of Zhang et al. [7].
2.2 Joint Similar and Specific Learning In this chapter, a novel fusion method for classification [12] is proposed. In particular, our proposed method jointly represents multiple views and a similarity is shared among them. Additionally, considering the difference between them, individual components are also extracted to keep the diversity among them. In this situation, both similarity and distinctiveness of various views are exploited, being useful for the classification. An optimization algorithm based on Alternating Directions Method (ADM) [16, 17] and Augmented Lagrangian Multiplier (ALM) method [18, 19] is applied to solve the presented strategy. Note that we apply JSSL to the Healthy vs. Diabetes Mellitus (DM) and Healthy vs. Impaired Glucose Regulation (IGR) classification in this chapter. Specifically, we regard the tongue, face, and sublingual vein extracted from one person as the multi-view data, and details of feature extraction on these three views can be find in [12]. The framework of JSSL is shown in Fig. 2.1.
Fig. 2.1 The framework of our proposed approach JSSL. There exist three parts in JSSL: dictionary construction, sparse representation, and joint classification. First, the dictionary is formed by training instances. Second, we sparsely represent multi-view data, including sublingual, facial, and tongue features of a given test sample, with the dictionary, and we divide the representation coefficients into two parts, including individual components and similar components. Third, we determine the label of the test instance according to the total reconstruction error
2.2 Joint Similar and Specific Learning
17
2.2.1 Problem Formulation As aforementioned, different views or feature vectors, which are from a same instance, may be similar with each other. The reasonable assumption is that coefficients of representation coded on their relevant dictionaries of different views should share some similarity. For instance, a sample can be represented by the tongue feature, the sublingual feature, and the facial feature. Since they come from the same instance, they can be well defined by the training samples of the corresponding category. Hence, the values and position of important coefficients share some similarity. To obtain the above goal, the following term is used: min αk
K
α k − α ¯ 22
(2.5)
k=1
where α k denotes the representation coefficient related to the k-th view and α¯ = 1 K k=1 α k denotes the mean vector of all α k (K denotes the number of views or K features). It is easy to see that Eq. (2.5) aims to decrease the variance of different representation coefficients α k , making them share some similarity. However, there is also difference between them, so this assumption is too restrictive. Hence, achieving the following two goals is necessary: one is employing the similarity among all views, and the other is keeping the flexibility of each view. In this situation, the original sample can be represented more accurate and stable due to the balance between similarity and difference. To solve the aforementioned problems, the representation coefficients α k are divided into two parts: the specific one and the similar one. Specifically, α k = α ck + α sk . α sk denotes the distinctiveness, and α ck denotes the similarity. Figure 2.1 displays the whole structure of our introduced method. The formulation of our model is min
K K yk − Dk α c + α s 2 + τ α c − α¯ c 2 + λ α ck 1 + α sk 1 k k 2 k 2 k=1
k=1
(2.6) where yk denotes the test instance, Dk = [D1k , D2k , . . . , DJk ] denote the instances for training belonging to the k-th view, e.g., tongue, sublingual, or facial vector, and i Dik ∈ Rmk ×nk denotes the training set belonging to the k-th view of the i-th class, c which has nik samples and mk dimension; α¯ c = K1 K k=1 α k denotes the mean value of sparse representation coefficients, which share some similarity belonging to all views, and τ and λ denote the nonnegative penalty constants. In terms of Eq. (2.6), it is easy to see that, for our model, not only the similar components belonging to each view through α ck are extracted, but also individual components of each view through α sk are kept. Additionally, since the atoms of the dictionary would linearly
18
2 Information Fusion Based on Sparse/Collaborative Representation
represent the test sample belonging to its own category, l1 -norm is applied on both α ck and α sk to keep the sparsity.
2.2.2 Optimization for JSSL The similar coefficients α ck are alternatively updated, as well as the special coefficients α sk . For instance, we first fix α sk , then update α ck , and vice versa. Update α ck By fixing α sk , the optimization solution of Eq. (2.6) in regard to α ck is equivalent with the following problem: 2 2 α ck = arg min yk − Dk α ck + α sk 2 + τ α ck − α¯ c 2 + λ α ck 1
(2.7)
Augmented Lagrangian method (ALM) algorithm is applied to adjust Eq. (2.7). Applying the ALM, we can transform the problem of (2.7) into terms as follows: 2 2 2 μ c zk c α ck = arg min yk − Dk α ck + α sk 2 + τ α ck − α¯ c 2 + λ α ck + − α + α k 1 2 k μ 2
(2.8) where α ck denotes the relaxed variable, zk denotes the k-th Lagrangian multiplier, and μ denotes the step value. Then, α ck and α ck could be optimized alternatively. (a) First, α ck is fixed to obtain α ck 2 2 μ c 2 c + zk α ck = arg min yk − Dk α ck + α sk 2 + τ α ck − α¯ c 2 + − α α k k 2 μ 2
(2.9) As in Ref. [14], a solution of α ck in a closed form can be solved as follows: τ Pk Q α c0,η K K
α ck = α c0,k +
(2.10)
η=1
where Pk = (DTk Dk + (τ + μ2 )I)−1 , α c0,k = Pk (DTk (yk − Dk α sk ) + μ2 α ck − K −1 and Q = (I − Kτ η=1 Pη ) .
zk 2 ),
2.2 Joint Similar and Specific Learning
19
Algorithm 2.1 Algorithm of updating α sk in JSSL Input: σ , γ = λ/2, yk , Dk , and α ck , k = 1, . . . , K Initialization: α˜ s(1) k = 0 and h = 1, 1: for k = 1, ..., K do 2: while not converged do 3: h=h+1 s(1) s(h−1) s(h−1) − σ1 F(α˜ k ) 4: α˜ k = Sγ /σ α˜ k
2 5: where F(α˜ s(h−1) ) is the derivative of the left of Eq. (2.13) yk − Dk (α ck + α sk )2 , and Sγ /σ k is a soft threshold operator that defined in Eq. (2.12); 6: end while 7: end for Output: α sk = α˜ s(h) k , k = 1, . . . , K
(b) Second, during the stage of updating α ck , the solution of optimizing Eq. (2.8) can be simplified to Eq. (2.11) after fixing α ck : 2 μ c zk c − α + α ck = arg min λ α ck 1 + α k 2 k μ 2
(2.11)
Then, by operating Threshold(α ck + zμk , μλ ), we can derive the α ck . We display the operation of soft threshold as follows:
Sλ/μ (β)
i
=
0 β i − sign(β i )λ/μ
β j λ/μ otherwise
(2.12)
where β i denotes the value belonging to the i-th component of β. After obtaining α ck and α ck , we can update zk and μ following zk = zk + μ(α ck − α ck ) and μ = 1.2μ. In order to prevent μ from being too large, the maximum of μ is pre-fixed, following μ = min(1.2μ, 1000), and the initial value of μ is set as 0.01. Update α sk After obtaining α ck , we could reformulate Eq. (2.6) to Eq. (2.13) at the stage of updating α sk . 2 α sk = arg min yk − Dk (α ck + α sk )2 + λ α sk 1
(2.13)
In fact, many approaches try to address the problem (2.13). For example, both ALM [20] and Iterative Projection Method (IPM) could tackle it. In this chapter, we use IPM to deal with Eq. (2.13). And this is illustrated in Algorithm 2.1. Algorithm 2.2 summarizes JSSL’s complete algorithm. The values of parameters λ and γ are determined by using the cross validation.
20
2 Information Fusion Based on Sparse/Collaborative Representation
Algorithm 2.2 Joint Similar and Special Learning (JSSL) Input: λ, τ , yk , Dk , k = 1, . . . , K Initialization: α ck = 0, α sk = 0, zk = 0 1: while not converged do 2: Update coefficients α ck : fix α sk 3: (a) compute α ck following Eq. (2.10) 4: (b) compute α ck following Eq. (2.11) 5: (c) zk = zk + μ(α ck − α ck ) 6: Update coefficients α sk : fix α ck , and solve α sk following Algorithm 2.1 7: end while Output: α ck and α sk k = 1, . . . , K
2.2.3 The Classification Rule for JSSL When we obtain the representation coefficients, the decision result is related to the lowest reconstruct residual of different classes in all K views. j ∗ = min j
K
2 wk yk − Dk,j (α ck,j + α sk,j )
k=1
2
(2.14)
where Dk,j , α ck,j , and α sk,j denote the elements belonging to the dictionary Dk , the similar coefficient α ck , and the specific coefficient α sk associated with the j -th category, respectively; wk denotes the weight value associated with the k-th vector, which can be gained by employing the strategy introduced in [21].
2.2.4 Experimental Results In this section, 2 types of analyses are conducted. First, DM versus Healthy classification is obtained. After that, the numerical results of IGR versus Healthy classification are presented. In these experiments, methods including KNN [22], SVM [23, 24], GSRC (group SRC) [25], and SRC [26] are used for comparison. We can find the feature extraction details for the tongue, sublingual vein, and face in [12]. Hence, the K in Eq. (2.6) is equivalent to 3. Because of the fact that KNN, SVM, GSRC, and SRC can only cover a single-view based classification, multiple features from these three views are also connected into a single feature vector. This feature is used as the input of the single-view based methods, symbolled as “combination.” In order to further show the validity of JSSL, we take two feature fusion methods as the comparison methods. They are Relax Collaborative Representation (RCR) [14] and Multi-task Joint Sparse Representation Classifier (MTJSRC) [21].
2.2 Joint Similar and Specific Learning
21
2.2.4.1 Image Dataset The used database includes 504 instances, which are divided into 192 Healthy instances, 198 DM instances, and 114 IGR instances. There exist three types of images for each instance, including sublingual, facial, and tongue images. From the beginning of 2014 to the end of 2015, we completed all data collection in Guangdong Provincial TCM Hospital, Guangdong, China. We validate Healthy samples through blood tests and other tests. If indicators from these tests fall into a specific range (designed by Guangdong Provincial TCM Hospital), the testing sample is considered to be healthy. To judge whether the sample suffers from IGR or DM, we use the FPG test. All samples were kept without eating food for at least 12 hours before the FPG test. The blood glucose level of DM patients was equivalent to or greater than 7.11 mmol/L. However, the IGR instances’ blood glucose level fluctuated in a certain range (between 6.1 and 7.11 mmol/L). All these standard indicators were determined by the Guangdong Provincial TCM Hospital.
2.2.4.2 Healthy Versus DM Classification We randomly selected 30–100 instances for training, and the rest of instances are exploited for testing. Figure 2.2 shows the experimental results of our JSSL method. Note that we only show the averaged results of five independent experiments. Obviously, the joint measurement of the sublingual, facial, and tongue features makes the distinction between Healthy patients and DM patients controls more accurate. In particular, when compared with the features from single tongue, our approach JSSL can get a better result in accuracy, and the enhancement can reach more than 15%. Compared with the results by only using the facial or sublingual feature, our proposed method is also much superior. In addition, the results of the “combine” methods and the other two fusion strategies (RCR and MTJSRC) are also illustrated in Table 2.1. From Table 2.1, it is obvious that in most cases, our proposed method has achieved better performance. Since the correlation of sublingual, facial, and tongue features is not taken into account, the “combine” methods based on K-NN, SVM, SRC, and GSRC are inferior to ours. In addition, JSSL also shows its advantages compared with the two multi-view fusion methods RCR and MTJSRC. It can be seen that in most cases, except the case when the number of training instances is 90, JSSL has achieved more than 2% improvement over RCR and MTJSRC. Furthermore, in terms of error bar value, the data obtained by our method is smaller, which demonstrates the stability of JSSL. Besides, for DM detection, when the number of instances for training is 100, we further draw ROC curves computed by different methods, as shown in Fig. 2.3. Table 2.2 shows the covered areas related to Fig. 2.3. It is worth noting that only the ROC curves computed by SRC, GSRC, MTJSRC, and RCR are displayed to compare with our method, due to their similar classification decision strategies. It can be seen that the ROC curve computed by JSSL significantly covers larger areas
22
2 Information Fusion Based on Sparse/Collaborative Representation Health vs DM(classified by tongue features) 1
a
Health vs DM(classified by facial features) libSVM KNN SRC GSRC JSSL
1
b
libSVM KNN SRC GSRC JSSL
0.9 0.8
0.8
Accuracy
Accuracy
0.7 0.6
0.6 0.5 0.4
0.4
0.3 0.2
0.2
0.1 0
0 30
40
50
60 70 80 Training samples
90
100
30
40
50
60 70 80 Training samples
90
100
Health vs DM(classified by sublingual features) 1
c
libSVM KNN SRC GSRC JSSL
0.9 0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
30
40
50
60 70 80 Training samples
90
100
Fig. 2.2 Comparison of Healthy and DM performance of the tongue, facial, sublingual features and their fusion methods based on classification. (a) Comparison of our method with the tongue image based feature. (b) Comparison of our method with the facial image based feature. (c) Comparison of our method with the sublingual image based feature
than that based on sublingual, face, or tongue vector. Compared with MTJSRC, RCR, and other methods based on independent combination, JSSL also has better performance. Figure 2.4 shows some sample results. For Healthy samples, SVM and GSRC can recognize them using tongue images. But for DM patients, they cannot be detected at many times. Referring to GSRC, although it achieves satisfactory performance on classifying DM samples based on the facial images, it cannot efficiently detect the Healthy instances. Different from the above methods using only individual modalities, our proposed method can obtain more accurate results for each sample by comprehensively considering sublingual, face, and tongue features.
Methods JSSL K-NN (face) libSVM (face) SRC (face) GSRC (face) K-NN (combine) libSVM (combine) SRC (combine) GSRC (combine) MTJSRC RCR
Training samples 30 40 79.82±1.92 81.45±1.50 71.15±1.89 70.55±3.35 75.62±2.03 75.43±2.24 71.00±3.22 73.51±3.35 73.14±1.51 74.59±2.74 68.58 ±2.48 70.35±2.36 77.34 ±2.91 76.08 ±2.21 61.33 ± 8.96 65.40 ±6.43 79.03 ± 1.72 80.51 ±2.80 77.76 ± 2.93 79.04±2.56 77.81 ± 2.49 78.64 ±2.10 50 82.82±1.46 70.86±2.94 75.67±2.51 75.84±1.67 76.32±2.26 68.18 ±1.43 76.84 ±7.20 66.32±7.59 79.11 ±4.05 80.48 ±1.18 80.51±2.60
60 82.88±2.15 70.74±2.18 77.53±1.47 73.62±3.79 77.45±2.19 71.96 ±2.24 78.60 ±1.43 68.63 ±9.35 77.05 ±6.53 80.96±2.00 80.91 ±3.33
70 83.27±1.55 71.83±1.89 77.53±2.49 74.14±2.90 76.93±1.82 72.03 ±2.24 76.33 ±3.67 75.14 ±4.72 80.39 ± 1.83 80.08 ±2.16 82.53 ±2.03
Table 2.1 The average accuracy and error bar (percentage) of 5 independent experiments for DM detection 80 85.06±0.87 71.99±3.26 77.92±3.05 75.11±3.84 78.31±3.85 71.86 ±2.69 79.05 ±1.58 74.03 ±2.31 79.83 ±2.46 82.51 ±1.49 83.36 ±1.12
90 83.17±2.15 72.89±2.21 79.10±1.70 75.59±2.60 79.34±2.11 72.99 ± 1.21 79.91 ±2.11 77.54±4.17 82.09 ±2.36 81.04 ±2.17 83.67±1.82
100 86.07±1.07 72.98±1.8 79.48±3.14 75.39±2.37 79.00±2.39 74.56 ±1.84 81.15 ±1.78 74.35 ±7.70 83.35 ±2.55 83.14 ±1.19 83.21 ±2.76
2.2 Joint Similar and Specific Learning 23
24
2 Information Fusion Based on Sparse/Collaborative Representation
Fig. 2.3 ROC curves of different methods and different features for DM detection Table 2.2 The area under curve (AUC) for the different approaches in DM detection. Bold values mean the best performances
Methods JSSL SRC (face) GSRC (face) SRC (sublingual) GSRC (sublingual) SRC (tongue)
AUC 0.8842 0.7696 0.7686 0.7091 0.7588 0.6885
Methods GSRC (tongue) SRC (combination) GSRC (combination) MTJSRC RCR
AUC 0.6944 0.8328 0.8512 0.8670 0.8633
2.2.4.3 Healthy Versus IGR Classification In this section, our proposed fusion strategy is applied to IGR detection. Being similar to the DM experiments, the number of training samples is randomly selected from 30 to 70 with five times, and the rest of instances are used as test samples.
2.2 Joint Similar and Specific Learning Healthy
25 DM
KNN libSVM GSRC JSSL
Fig. 2.4 Classified images based on the tongue, facial, sublingual, and their fusion features for Healthy and DM diagnosis. For each image, the red border indicates incorrect classification, and the green border indicates correct classification
Figure 2.5 illustrates the average experimental results of our JSSL and other singlemodal or single-view based methods. Similarly, our proposed method is much superior to these approaches that only exploit a single modality. We list the average classification rates and error bars of five independent experiments for IGR diagnosis, as shown in Table 2.3. At the same time, our method is compared with feature concatenation approaches, as well as MTJSRC and RCR. It is not difficult to conclude that compared with the “combine” methods based on K-NN, SRC, and GSRC, JSSL has achieved significant improvements. In contrast to MTJSRC and RCR, the average accuracy of JSSL is competitive. When the number of training instances is 70, the ROC curves computed by different approaches for IGR diagnosis are drawn in Fig. 2.6, whose corresponding covered areas are listed in Table 2.4. Obviously, our presented method JSSL is still much better.
26
2 Information Fusion Based on Sparse/Collaborative Representation
1
1 libSVM KNN SRC GSRC JSSL
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Accuracy
Accuracy
0.9
Health vs IGR(classified by tongue features)
a
0.5
libSVM KNN SRC GSRC JSSL
0.5 0.4
0.4 0.3
0.3
0.2
0.2
0.1
0.1
0
Health vs IGR(classified by facial features)
b
30
40
50 60 Training samples
1 0.9
70
0
30
40 50 60 Training samples
70
Health vs IGR(classified by sublingual features)
c
libSVM KNN SRC GSRC JSSL
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
30
40
50 60 Training samples
70
Fig. 2.5 Comparison of Healthy vs. IGR performance by using single-modal based methods and JSSL. (a) Comparison of our method with tongue image based feature. (b) Comparison of our method with facial image based feature. (c) Comparison of our method with sublingual image based feature
2.2.5 Conclusion In this chapter, we propose a joint sparse representation and fusion strategy for the task of classification. Particularly, the sparse representation is introduced to represent the multi-view data, through which the similarity and diversity among different views are jointly extracted in a sparse way. In addition, we conducted two types of experiments to identify DM and IGR from Healthy controls. The experimental results show that our fusion method enjoys effectiveness and superiority.
2.3 Relaxed Collaborative Representation
27
Table 2.3 The average accuracy and error bar (percentage) in 5 independent experiments for IGR detection. Bold values mean the best performances Methods JSSL K-NN (sub) libSVM (sub) SRC (sub) GSRC (sub) K-NN (combine) libSVM (combine) SRC (combine) GSRC (combine) MTJSRC RCR
Training samples 30 40 72.63±1.64 74.23±2.46 62.87±2.85 61.54±2.24 67.21±0.93 69.95±1.50 69.68±2.96 71.01±4.32 69.80±4.62 70.71±2.75 66.64 ± 3.35 66.96 ±8.78 72.87 ±1.69 73.30 ±2.80 59.43 ±4.33 61.76 ±7.59 69.64 ±0.95 71.89 ±3.52 71.74 ±4.84 70.57 ±2.60 72.53 ±4.55 71.81 ±3.55
50 74.88±1.32 61.35±3.49 72.80±1.17 72.08±2.08 72.13±3.25 65.41 ±3.99 72.37 ±5.63 62.51 ±5.77 72.66 ±5.38 73.8 ±2.54 74.94 ±2.34
60 75.37±1.85 60.96±3.05 74.49±1.52 72.41±4.19 72.57±3.94 68.02 ±3.58 72.83 ± 4.38 57.43 ±6.55 71.76 ±3.90 74.76 ± 3.17 73.37± 2.78
70 76.68±3.45 62.81±3.63 75.87±1.51 74.85±3.79 74.79±3.21 66.59 ±4.68 75.81 ±1.00 67.19 ±8.24 73.65 ± 4.96 74.13 ± 4.70 73.14 ±1.63
2.3 Relaxed Collaborative Representation Although sparse representation is widely used in classification, it is difficult to implement a closed-form solution, thereby increasing the cost of testing. In real-time applications, this is impractical. In this chapter, a relaxed collaborative representation (RCR) model [14] is proposed.
2.3.1 Problem Formulation To represent the multi-view or multi-feature data, the following term is designed to regularize the different features’ coding vectors over their relevant dictionaries: min αk
K
ωk α k − α22
(2.15)
k=1
where the coding vector of the k-th feature vector y k over the k-th dictionary D k is α k , k = 1, 2, . . . , K, ωk is the weight assigned to the k-th feature, and α is the mean vector of all α k . Instinctively, ωk should be larger to make αk nearer to the mean α if y k shares similarity with other features; inversely, ωk should be smaller if y k is more different from others. Based on Eq. (2.15), the relaxed collaborative representation (RCR) is min
α k ,ωk
K 2 (y k − Dk α k 2 + λ α k 22 + τ ωk α k − α22 ) s.t. prior{ωk } k=1
(2.16)
28
2 Information Fusion Based on Sparse/Collaborative Representation
Fig. 2.6 ROC curves of different methods and different features for IGR detection Table 2.4 The area under curve (AUC) for the different methods in IGR detection. Bold values mean the best performances
Methods JSSL SRC (face) GSRC (face) SRC (sublingual) GSRC (sublingual) SRC (tongue)
AUC 0.8278 0.7579 0.7596 0.6984 0.7321 0.5850
Methods GSRC (tongue) SRC (combination) GSRC (combination) MTJSRC RCR
AUC 0.6155 0.7203 0.7243 0.7870 0.7982
where τ and λ denote the positive constants, and the prior made on weights ωk is denoted as prior{ωk }. It is worth noting that in Eq. (2.16), α k is regularized by the l2 -norm. The reason is that, as proved in [7], the l2 -norm is also efficient for classification but contributes to a closed-form solution for α k .
2.3 Relaxed Collaborative Representation
29
Based on different strategies on ωk , RCR has three special situations: (1) RCR with strong prior. In this case, using a validation dataset, the weights ωk can be pre-learned. Then a closed-form solution can be obtained according to Eq. (2.16) because all the terms are of l2 -norm. (2) RCR with moderate prior. In this situation, we do not know the values of these weights in advance. However, we can empirically get some prior information of these weights (e.g., ωi > ωj for some i = j ). In order to avoid emphasizing some features but neglecting other features, the weights are regularized based on the maximum entropy principle (here we normalize the weight ωk in [0, 1]). Let ω = [ω1 , ω2 , . . . , ωK ]T . The prior of weights can be written as ⎧ K ⎨ − k=1 ωk ln ωk > σ u ω uc ⎩ l 0ω
(2.17)
where the matrix denotes the relative relationship between ωk . For example, when there are two feature vectors and ω2 ω1 0, then = [1, −1], ul = −∞, uc = 0. (3) RCR with weak prior. Any prior information about the weights was not known by us except that we adjust their entropy: −
K
ωk ln ωk > σ
(2.18)
k=1
2.3.2 Optimization for RCR In the situation that RCR with strong prior, because the weights ω are pre-learned, only the α k needs to be solved and can reach a global optimum. For the other two situations, we can solve the objective function in Eq. (2.16) by alternatively optimizing ω and α k , i.e., updating the weights ω by fixing α k and updating the coding vector α k by fixing the weights ω. Such an interaction is iterated until the values of ω and α k converge to some local minimum. First, if we know the weights ω, the optimization of Eq. (2.16) becomes min αk
K y k − D k α k 2 + λ α k 2 + τ ωk α k − α2 2 2 2 k=1
(2.19)
30
2 Information Fusion Based on Sparse/Collaborative Representation
from which a solution in a closed form for k = 1, 2, . . . , K could be derived: ωk α k = α 0,k + τ K
η=1 ωη
α¯ =
K k=1
P kQ
ωk α k /
K
ωη α 0,η
(2.20)
η=1 K
(2.21)
ωk
k=1
−1 where P k = D Tk D k + I (λ + τ ωk ) , α 0,k = P k D Tk y k , Q = I − K η=1 η −1 τ ω2 Pη , η = K n . For some detailed information of derivation, please read k=1 ωk
Appendix A. Once obtaining the coding vectors α k by Eq. (2.20), then the coding weights ω could be updated. For RCR with moderate prior, the objective function in Eq. (2.16) is transformed into min ωk
K
τ ωk α k − α22 + γ
k=1
K
ωk ln ωk
k=1
(2.22)
s.t. ul ω uc ; 0 ω which could be solved by the toolbox MOSEK (www.mosek.com). Here γ > 0 is the Lagrange multiplier. If the RCR has weak prior, we can reduce the objective function in Eq. (2.16) to min ωk
K
τ ωk α k − α22 + γ ωk ln ωk
(2.23)
k=1
and we could straightforwardly update the weights as
ωk = exp −1 − τ α k − α22 /γ
(2.24)
We summarize the RCR’s optimization in Algorithm 2.3. Because the two alternative optimizations are both convex, RCR enjoys convergence.
2.3 Relaxed Collaborative Representation
31
Algorithm 2.3 Algorithm of Relaxed Collaborative Representation (RCR) Require: Dictionary Dk and feature vectors y k of the query sample, k = 1, 2, . . . , K. An initialization of the weight vector ω(0) . 1: while not converged do 2: updating coding vectors Eq. (2.20); 3: updating weights via Eq. (2.22) or (2.24); 4: checking convergence condition: ω(t+1) − ω(t) 2 / ω(t) 2 < δω where ω(t) is the weight vector in the t-th iteration. 5: end while Ensure: α k , k = 1, 2, . . . , K and ω
2.3.3 The Classification Rule for RCR For each class, the RCR classification depends on the overall coding error. The overall coding error for the query sample y belonging to the i-th class is shown as ei =
K
2 ωk y k − Dik α ik 2
k=1
(2.25)
where Dik and α ik are the subset of the dictionary Dk and the coefficient vector α k associated with the i-th class, respectively. So the classification is done via identity (y) = arg min {ei } i
(2.26)
2.3.4 Experimental Results In order to test the availability of our proposed approach RCR, it is applied to face recognition (FR) in controlled/uncontrolled environments, as well as the multi-class object recognition. In controlled environment for FR, multiple blocks divided from a face image are utilized. In this multi-block RCR, there exist three parameters: λ, τ , and γ (the Lagrange multiplier of the entropy constraint). In our experiments, we set τ and λ as 0.005 and 0.0005, respectively; for FR with occlusion, we set γ as 0.001; for FR without occlusion, we set γ as 0.1. Multi-feature RCR is utilized for other tasks, where the parameters are estimated from the validation set.
2.3.4.1 FR in Controlled Environment In this section, two benchmark face datasets, face images with and without occlusion in controlled environments, are used: the Extended Yale B [27, 28] and a subset of AR [29]. There exist 38 individuals and about 2414 frontal images (with the size of
32
2 Information Fusion Based on Sparse/Collaborative Representation
Table 2.5 Face recognition rates on the Extended Yale B database. Bold values mean the best performances
Ntr SVM SRC CRC LRC MTJSRC RCR
10 60.0% 84.6% 84.8% 82.4% 87.3% 86.8%
15 67.1% 84.2% 84.7% 81.8% 87.4% 87.2%
20 76.5% 91.3% 91.2% 87.0% 91.5% 92.3%
25 88.1% 92.0% 92.4% 89.0% 93.6% 93.6%
Table 2.6 Face recognition rates on the AR database. Bold values mean the best performances SVM 87.10%
SRC 93.70%
CRC 93.30%
LRC 76.40%
MTJSRC 95.80%
RCR 95.90%
54 × 48) in the Extended Yale B database; there are 50 male and 50 female subjects in the subset of AR. Each subject in it has 26 images with the size of 60 × 43. (a) FR Without Occlusion The SVM (linear kernel) is employed as the baseline, and we use the strategies of LRC [30], SRC [6], MTJSRC [21], and CRC [7] to do the comparisons with the proposed approach RCR. Here, the face image is divided into 1 × 4 blocks. For the Extended Yale B, Nt r images of each subject are randomly selected to train, and the rest of images are used for testing. Table 2.5 illustrates the recognition results computed by different methods under various training numbers. MTJSRC [21] and RCR obtain the best recognition rates among all cases. Compared with MTJSRC, although RCR is slightly inferior to it when the training number is set to 10 or 15, RCR is competitive to it when the training number is 20 or 25. Referring to AR, we only select the images with illumination and expression changes. We use samples in Session 1 for training, and samples from Session 2 are used for testing. In Table 2.6, the recognition accuracies obtained by all the compared methods are shown. As we can see, except for MTJSRC, RCR achieves the highest recognition rate in comparison to all other methods by more than 2% improvement. (b) FR with Real Face Disguise As in [6, 7], 800 images (about 8 samples per subject) with only expression changes selected from the subset of AR are used for training, while two separate subsets (with sunglasses or scarf) of 200 images (1 sample per subject per Session, with neutral expression, as shown in Fig. 2.7a) are used for testing. For all the compared methods (i.e., RCR, MTJSRC, the block versions of CRC, LRC, and SRC), we resize all the images to 83 × 64, and we divided each image into 4 × 2 blocks, as shown in Fig. 2.7b [6]. Table 2.7 shows the performance of these 5 approaches. As we can see, RCR achieves much better results than all the other methods.
2.3 Relaxed Collaborative Representation
33
Fig. 2.7 (a) The testing instances with sunglasses and scarves in the AR database; (b) partitioned testing instances Table 2.7 Recognition accuracy by competing approaches on the disguise AR database. Bold values mean the best performances
Method SRC CRC LRC MTJSRC RCR
Sunglasses 97.50% 91.50% 95.50% 80.50% 98.50%
Scarf 93.50% 95% 94.50% 90.50% 96.50%
Fig. 2.8 Samples of FRGC 2.0 and LFW. (a) and (b) are samples in target and query sets of FRGC 2.0; (c) and (d) are samples in training and testing sets of LFW
2.3.4.2 FR in Uncontrolled Environment In this section, we evaluate how multi-feature RCR performs on two large-scale and real-world face databases: FRGC 2.0 [31] and LFW-a [32]. All the compared methods, LRC, SRC, MTJSRC, CRC, and RCR, utilize four features, i.e., Gabor magnitude [33], low-frequency Fourier feature [34], intensity value, and LBP [35]. We adopt a divide and conquer strategy for each feature: first, LDA [36] is exploited to extract the discrimination-enhanced feature in each block (patch each image into 2×2 blocks), and then the final feature vector is generated by concatenating features of all blocks. (a) FRGC FRGC version 2.0 [31], a large-scale face database, is constructed with uncontrolled indoor and outdoor setting. As illustrated in Fig. 2.8a, b, a subset of Experiment 4, which is the most difficult dataset in FRGC 2.0 with aging, large lighting variations, and image blur, is employed in this evaluation. The subset contains 352 subjects with at least 15 samples in the original target set. We select 5280 samples as the target set and 7606 samples as the query set. To learn the projection of LDA[36] and the weight values of RCR, half of the original validation
34
2 Information Fusion Based on Sparse/Collaborative Representation
Table 2.8 Face recognition rates on FRGC2.0 Exp 4. Bold values mean the best performances
Table 2.9 Face recognition rates on LFW. Bold values mean the best performances
Nta 15 10 5
SRC 94.90% 88.00% 83.30%
CRC 94.40% 87.40% 82.90%
LRC 95.10% 87.30% 82.90%
MTJSRC 94.30% 87.70% 84.70%
RCR 95.30% 88.40% 85.30%
Subset 1 Subset 2
SRC 53.00% 72.20%
CRC 54.50% 73.00%
LRC 48.70% 60.50%
MTJSRC 54.80% 77.40%
RCR 61.00% 80.60%
set is employed. We also learn the weights for MTJSRC to weight the coding error for better classification. For each subject, three tests are performed with the first Nt a (e.g., 5, 10, and 15) target instances. Table 2.8 demonstrates the recognition ratios of SRC, CRC, LRC, MTJSRC, and RCR adopting a combination of four features. RCR outperforms all other algorithms, though the improvement is not significant because the query set contains no occlusion, misalignment, or posture changes. It can also be observed that when the number of target samples is large enough (i.e., 15 samples per subject), all of the techniques perform well with the high recognition accuracy. (b) LFW Labeled Faces in the Wild (LFW) is a large-scale dataset of face images (illustrated in Fig. 2.8c, d) designed for unconstrained FR with variations in posture, lighting, expression, misalignment and occlusion, and so on. In the experiments, two subsets of aligned LFW [32] are employed. We utilize the first 5 instances as training data and the remaining instances as testing data in subset 1, which consists of 311 subjects with at least 6 samples per subject. We utilize the first 10 instances as training data and the remaining instances as testing data in subset 2, which contains 143 subjects with at least 11 samples per subject. For RCR and MTJSRC, we apply the learned weights in the FRGC dataset. Table 2.9 summarizes the results on the two subsets via various approaches. In subsets 1 and 2, RCR outperforms SRC, CRC, and LRC by at least 6% and 7%, respectively, demonstrating that allowing distinctive features of a sample to share the same representation coefficients is not effective. In subset 1 (subset 2), RCR has a rate of over 6% (3%) greater than MTJSRC, which allows various features to have different but similar representation coefficients and leverages the weighted coding error to achieve classification. This indicates that the presented RCR model can use the similarity and distinctiveness of various features more effectively for coding and classification.
2.3.4.3 Object Categorization Finally, we evaluate the effectiveness of the proposed method in multi-class object classification. The two Oxford flower datasets [37, 38] are utilized in this study,
2.3 Relaxed Collaborative Representation
35
Fig. 2.9 Illustrations of instances from the Oxford flower datasets. (a) Some instances of daffodil in 17 classes; and (b) some instances of water lily in 102 classes
with some examples shown in Fig. 2.9a, b. We adopt the default experimental settings given on the website (www.robots.ox.ac.uk/~vgg/data/flowers) for these two datasets, including the training, validation, and test splits, as well as the multiple features. Note that these features are only extracted from flower regions that are properly cropped via segmentation preprocessing. A fair comparison is implemented between MTJSRC [21] and our RCR on two datasets, and the RCR is extended to its kernel versions. In terms of the direct kernel version, it is easy to see that we can transform two terms in Eq. (2.20), P k = −1 T D k D k + I (λ + τ ωk ) and α 0,k = P k D Tk y k , into P k = (Gk + I (λ + τ ωk ))−1 and α 0,k = P k hk , where Gk = φk (D k )T φk (D k ), hk = φk (D k )T φk y k , and φk is the kernel mapping function for the k-th modality of feature. Another kernel version of RCR is column generation, where the k-th modality instances for training and instances for testing are directly replaced as their associated kernel matrices: D k = Gk and y k = hk . Here the kernel matrices are calculated as exp −χ 2 x, x /μ , where we set μ as the mean value of the pairwise χ 2 distances on the training set. The direct kernel version of RCR is denoted as RCR-DK, and the column generation version of RCR is denoted as RCR-CG. (a) 17 Category Dataset This dataset includes 17 different flower classes, each with 80 images. As in [21], χ 2 distance matrices of seven features (i.e., HSV, SIFTint, HOG, shape, texture vocabularies, and color) are regarded as inputs, and the experiments are conducted based on the three predefined training, validation, and test splits. Table 2.10 lists the experimental results (variance and mean) of RCR and other existing advanced approaches. It can be seen that both MTJSRC and RCR have about 2% improvement over other methods. Additionally, MTJSRC is slightly superior to RCR. (b) 102 Category Dataset There exist 102 classes (40–250 images per class) of flowers with 8198 images in total in this dataset. As in [21], the χ 2 distance matrices of four features (i.e., HSV, SIFTint, HOG, and SIFTbdy) are directly utilized in the experiment, along with a predefined training, validation, and test split. Table 2.11
36
2 Information Fusion Based on Sparse/Collaborative Representation
Table 2.10 The categorization rates on the 17 class Oxford Flowers dataset. The results in bracket are obtained under equal feature weights. Bold values mean the best performances
Table 2.11 The categorization accuracy on the 102 category Oxford Flowers dataset. The results in bracket are obtained under equal feature weights. Bold values mean the best performances
Methods SRC combination MKL CG-Boost LPBoost MTJSRC-RKHS MTJSRC-CG RCR-DK RCR-CG
Accuracy(%) 85.9±2.2 85.2±1.5 84.8±2.2 85.4±2.4 88.1±2.3(86.8±1.8) 88.9±2.9(88.2±2.3) 87.6±1.8(87.4±1.3) 88.0±1.6(87.9±1.8)
Methods SRC combination MKL MTJSRC-RKHS MTJSRC-CG RCR-DK RCR-CG
Accuracy(%) 70.0 72.8 73.8(71.5) 74.1(71.2) 74.1(71.1) 75.0(72.6)
lists the comparison between RCR and other state-of-the-art methods. As can be observed, RCR outperforms all other methods, followed by MTJSRC. In particular, RCR-CG achieves 1% improvement over MTJSRC-CG. Besides, the learned feature weights significantly enhance the final classification, by 3% improvement for RCRDK and 2.4% improvement for RCR-CG. The learned weights for RCR-DK (RCRCG) are 0.2, 1.6, 1.5, and 0.9 (0.7, 1.6, 1.3, and 0.7), indicating that HOG and SIFTint features are more discriminative.
2.3.5 Conclusion In this chapter, we introduce a relaxed collaborative representation model (RCR) for pattern classification. While allowing for flexible coding of each feature vector through its associated dictionary, a novel regularization term was designed to ensure a small variation of the coding vectors and to differentiate the distinctiveness of various features via adaptive weighting. Algorithms for optimizing the proposed RCR were introduced, and experimental results on face recognition in controlled and uncontrolled environments, as well as multi-class object categorization, clearly indicate that RCR is competitive with many state-of-the-art approaches.
2.4 Joint Discriminative and Collaborative Representation
37
Appendix A The closed-form solution of Eq. (2.19): −1 Let P k = D Tk D k + I λ + τ ωk and α 0,k = P k D Tk y k , where I is an K identity matrix. By optimizing Eq. (2.19), we could get α = K k=1 ωk α k / k=1 ωk K K and α k = α 0,k + τ ωk P k k=1 ωk α k / k=1 ωk . By summing ωk α k , k = 1, 2, . . . , K − 1, we could get K−1 K−1 K−1 wη α η = ωη α 0,η + ωK η P η α K η=1 η=1 η=1 where η = ωη τ ωη / K k=1 ωk . K−1 K−1 + η P η ωk α k
K−1 K−1 Then we have I − η=1 η P η = η=1 ωη α η η=1 ωη α 0,η + K−1 K−1 ωK η=1 η P η α K with which we could put η=1 ωη α η in α K = α 0,K + K−1 τ ωK K PK η=1 ωη α η + ωK α K . η=1
k=1
K−1
k=1 ωk
After some derivations, we could get
α K = α 0,K
⎛ ⎞−1 K K τ ωK + K P K ⎝I − η P η ⎠ ωη α 0,η n=1 ωη η=1 η=1
Similarly, all the representation coefficients are ⎛ τ ωk
α k = α 0,k + K
n=1
ωη
P k ⎝I −
K η=1
⎞−1 η P η ⎠
K
ωη α 0,η
η=1
where k = 1, 2, . . . , K.
2.4 Joint Discriminative and Collaborative Representation Although JSSL and RCR based on SRC and CRC successfully represent multiple views, the supervised information is ignored in the training phase, which is quite valuable for performance improvement in classification. To solve this problem, we propose a joint discriminant and collaborative representation (JDCR) [15] method, which enjoys two advantages: revealing similar representations and encouraging the representation coefficients to be more discriminative to those belonging to different categories [39]. Besides, an optimization algorithm is proposed to achieve the closed-form solution in an efficient way. Finally, JDCR is applied to detect the Fatty Liver control.
38
2 Information Fusion Based on Sparse/Collaborative Representation
2.4.1 Problem Formulation As mentioned above, different features belonging to various modalities may share a certain correlation. Being different from RCR [14], we employ the following formulation to measure the similarity of various modalities: K 2 k min y − Dk α α
(2.27)
2
k=1
where α = [α 1 ; α 2 ; . . . ; α J ] denotes the shared representation coefficient and α i denotes the partial vector associated with the i-th category; Dk ∈ Rmk ×n denotes the training set (dictionary) corresponding to k-th modality and mk denotes the dimension of the instance of the k-th modality. Since the above method is unsupervised and does not exploit the label information for training, it is still unsatisfactory for classification although the correlation between different views is extracted. Therefore, we embed a discriminative regularization [39] in our model to expand the variance of different categories. Equation (2.28) is the formula for discriminative regularization: 2 2 2 T k Di α i + Dkj α j = Dki α i + Dkj α j + 2 tr Dki α i Dkj α j 2
2
2
(2.28)
where Dki denotes the training instances belonging to the i-th category in the k-th view or feature. It is obvious to see that in Eq. (2.28) the i-th and j -th categories have the minimal correlation (maximum difference) by minimizing the T term Dki α i Dkj α j . In addition, the function (2.28) is differentiable and convex, and the closed-form solution w.r.t. the coefficient α can be obtained, greatly reducing the computational complexity. Hence, by employing the combination of the discriminative regularization Eq. (2.28) and the correlation formula Eq. (2.27), we can obtain the following objective function: ⎧ K ⎨ K 2 k min γ y − Dk α + λα22 + α ⎩ 2 k=1
k=1
J
⎫ J 2 ⎬ k Di α i + Dkj α j 2⎭
i=1,i=j j =1,j =i
(2.29)
2.4 Joint Discriminative and Collaborative Representation
39
Fig. 2.10 The framework of the Joint Discriminative and Collaborative Representation model (JDCR). JDRC consists of three steps that are multimodal dictionary construction, Joint Discriminative and Collaborative Representation, and the reconstruction error-based classification. In the first step, training samples belonging to different classes construct the dictionary; multimodal features extracted from the given test instance are then represented by the dictionary, and the shared and discriminative representation coefficient is obtained; third, the label is output by comparing reconstruction errors of each class
where Dk = Dk1 , Dk2 , ·, ·, ·, DkJ , and λ and γ are the positive parameters that are utilized to trade off the influence of three terms in Eq. (2.29). Based on the above analysis, not only the correlation among various views or modalities is extracted according to the shared representation coefficient α, but also the differences among different categories are enlarged by exploiting the discriminative regularization. Figure 2.10 shows the framework of JDCR. In the second stage, we extract the coefficient of shared representation. At the same time, due to the usage of discriminative information in different classes, it can be seen that most elements of the coefficients with large values would concentrate on the first category (here we assume that the samples for testing are from the first category). Hence, as for the first class, the reconstruction error will be lower compared with that in other classes. This means that the probability of the test sample belonging to the first class is high.
2.4.2 Optimization for JDCR The objective function can be denoted as L=
K K 2 k k 2 − D α +λα + γ y 2 k=1
2
k=1
J
J 2 k Di α i + Dkj α j
i=1,i=j j =1,j =i
2
(2.30)
40
2 Information Fusion Based on Sparse/Collaborative Representation
We then divide it into two parts: K k k 2 2 k=1 y − D α 2 + λα2 2 J J k k L2 = K k=1 γ i=1,i=j j =1,j =i Di α i + Dj α j L1 =
(2.31)
2
Derivative of L1 w.r.t. α It is easy to get that the derivative of L1 w.r.t. α is K T ∂L1 Dk yk − Dk α + 2λα = −2 ∂α
(2.32)
k=1
Derivative of L2 w.r.t. α Factorizing L2 , we can get ⎞ ⎛ K 2 2 k k ⎝ L2 = γ Dt α t + Dkj α j + Di α i + Dkt α t + F ⎠ k=1
=γ
K k=1
2
j =t
2
i=t
⎞ 2 ⎝2 Dkt α t + Dki α i + F ⎠ ⎛
(2.33)
2
i=t
where F is unrelated to α t . Thus, the derivative of L2 w.r.t. α t is ∂L2 =γ ∂α t
K
⎞ T ⎝4 Dkt Dkt α t + Dki α i ⎠ ⎛
k=1
i=t
⎞⎞ K T ⎝(J − 1)Dkt α t + = 4γ Dki α i ⎠⎠ Dkt k=1
⎛
i=t
K J T = 4γ (J − 2)Dkt αt + Dki α i Dkt k=1
= 4γ
i=1
K T Dkt (J − 2)Dkt α t + Dk α k=1
(2.34)
2.4 Joint Discriminative and Collaborative Representation
41
Therefore, the derivative of L2 w.r.t. α is obtained as follows: ⎡
k T ⎤ k k 4γ K k=1 D1 (J − 2)D1 α 1 + D α ⎢ ⎥ k T ⎢ 4γ K (J − 2)Dk2 α 2 + Dk α ⎥ ∂L2 k=1 D2 ⎢ ⎥ =⎢ .. ⎥ ∂α ⎣ ⎦ . K k T k k (J − 2)DJ α J + D α 4γ k=1 DJ ⎡ K T ⎡ K T ⎤ ⎤ k k k k k=1 D1 D1 α 1 k=1 D1 D ⎢ K ⎢ K ⎥ ⎥ T T ⎢ ⎢ k=1 Dk2 Dk2 α 2 ⎥ Dk Dk ⎥ ⎥ + 4γ ⎢ k=1 2 ⎥α = 4γ (J − 2) ⎢ .. .. ⎢ ⎢ ⎥ ⎥ ⎣ ⎣ ⎦ ⎦ . . K K k T k k T k D D D α D k=1 k=1 J J J J ⎡K ⎤ k T k 0 k=1 (D1 ) D1 · · · ⎢ ⎥ .. . . .. .. = 4γ (J − 2) ⎣ ⎦ α + 4γ M2 α . K k T k 0 · · · k=1 (DJ ) DJ = 4γ ((J − 2)M1 + M2 ) α (2.35) where ⎡K ⎢ M1 = ⎣
k T k k=1 (D1 ) D1
.. . 0
⎤ ··· 0 ⎥ .. .. ⎦ . . K k T k · · · k=1 (DJ ) DJ
(2.36)
and ⎡ K T ⎤ k k k=1 D1 D ⎢ K ⎥ T ⎢ k=1 Dk2 Dk ⎥ ⎥ M2 = ⎢ .. ⎢ ⎥ ⎣ ⎦ . K k T k D k=1 DJ
(2.37)
In conclusion, the total derivative of the objective function L w.r.t. α is K T ∂L = −2 Dk yk − Dk α + 2λα + 4γ ((J − 2)M1 + M2 ) α ∂α k=1
(2.38)
42
2 Information Fusion Based on Sparse/Collaborative Representation
Finally, the optimal solution of Eq. (2.29) is #
K T α = 2γ ((J − 2)M1 + M2 ) + λI + Dk Dk
−1
k=1
K T Dk y k
(2.39)
k=1
k T where I is the identity matrix. Let P = 2γ ((J − 2)M1 + M2 ) + λI + K k=1 D −1 Dk . Note that P is independent of the testing sample y = y1 , . . . , yK ; thus we can pre-calculate it as a projection matrix. Given a sample including K views, we k T k only need to calculate K k=1 (D ) y and then simply project it onto P. Obviously, based on the above analysis, our proposed method JDCR is very efficient.
2.4.3 The Classification Rule for JDCR Being similar to RCR, the label prediction in JDCR can be obtained through ∗
j = min j
K k=1
2 wk yk − Dkj α j 2
(2.40)
where Dkj and α j are the parts of the dictionary Dk and the shared coefficient corresponding to the j -th class, respectively; wk is the weight value associated with the k-th modality. In this chapter, it is calculated by employing the method mentioned in the study of Yuan [21]. Algorithm 2.4 illustrates the detailed steps of our JDCR algorithm. Note that we get the values of two parameters λ and γ according to the cross validation. Algorithm 2.4 Joint Discriminative and Collaborative Representation (JDCR) $ % $ % Require: λ, γ, y = yk , D = Dk , k = 1, . . . , K Ensure: dj , t, and α = [α 1 , . . . , α J ] k T k y Initialization P, K k=1 D k T k α is calculated using: α = P K y k=1 D The distance between the test sample and the j -th class is obtained using class-specific k 2 k residual: dj = K k=1 wk y − Dj α j 2 The test sample is classified to the t-th category: if t = arg minj dj
2.4 Joint Discriminative and Collaborative Representation
43
2.4.4 Experimental Results 2.4.4.1 Image Dataset Being similar to JSSL, tongue and facial images are utilized for Fatty Liver vs. Healthy classification. In this tongue and facial dataset, there exist 500 Healthy instances and 461 Fatty Liver instances. In particular, each sample in this dataset includes images in two different types: facial image and tongue image. From the beginning of 2014 to the end of 2015, we completed the collection of these images in the Guangdong Provincial TCM Hospital, Guangdong, China. During the sampling process, Healthy people are verified by blood test and other tests. Based on the standardized rule designed by the Guangdong Provincial TCM Hospital, if the test indicators fall within a certain range, people are proven to be healthy. We use liver function, blood fat tests, and the type-B ultrasonic (CT) to determine whether the individual was under the Fatty Liver condition. If and only if the results of these three examinations are uncommon, the individual is detected with the Fatty Liver.
2.4.4.2 Experiments Set In this section, we implement an experiment for the classification between Fatty Liver Disease and Healthy. In the experiment, KNN, SRC [6], dictionary pair learning (DPL) [40], CRC [7], and JSSL [12] are applied to classification. Generally, we concatenate different features of the facial or tongue modalities into a single one, and it is denoted as facial feature or tongue feature. Because KNN, CRC, SRC, and DPL are proposed for individual view, all features are also concatenated into a single one (denoted as “Comb”) to compare with our proposed strategy. More details can be found in [15]. In order to quantitatively evaluate our presented approach, we apply the sensitivity, specificity, false positive rate (FPR), false negative rate (FNR), and accuracy to measure the performance in the experiment. They are mathematically defined as follows: Sensitivity =
TruePos. TruePos. + FalseNeg.
Specif icity =
TrueNeg. TrueNeg. + FalsePos.
FPR =
FalsePos. FalsePos. + TrueNeg.
F NR =
FalseNeg. TruePos. + FalseNeg.
Accuracy =
TruePos. + TrueNeg. Truepos. + FalsePos. + TrueNeg. + FalseNeg.
(2.41)
44
2 Information Fusion Based on Sparse/Collaborative Representation
2.4.4.3 Healthy Versus Fatty Liver Classification In this subsection, we evaluate the proposed JDCR and other methods, in terms of the detection of Fatty Liver. We randomly select samples to train and apply the rest of samples to test. To evaluate the strategies scientifically and quantitatively, the evaluation was conducted for 5 times. Overall, we display the average accuracy and error bar. The classification rates obtained by our proposed method JDCR and other existing approaches on 5 independent experiments are shown in Table 2.12 and Fig. 2.11. For each class, 200, 250, and 300 samples are randomly selected from the original dataset. And the rest of samples are used for testing. Obviously, our method obtains a noticeable performance compared with other methods, which are based on the single view. Particularly, when the number of training samples is 200, our presented method gets 83.57% in classification accuracy. However, the best performance achieved by the face based method (DPL) is only 78.65%. This is far below than our performance. As for specificity and sensitivity, JDCR also obtains better performance compared with methods based on tongue or face. Compared with strategies based on combination, our approach is also superior. When the number of instances for training is 250, though DPL (combination) outperforms our method in specificity, the sensitivity score of DPL is much lower than the score obtained by our method. Referring to the comparison between JDCR and JSSL, our JDCR is still competitive. Additionally, we also plot the ROC curves of various classification approaches, as shown in Fig. 2.12. These ROC curves are for the Fatty Liver disease diagnosis, when the number of samples for training is 200, 250, and 300, respectively. Furthermore, Table 2.13 lists their corresponding covered areas. Note that similar to other methods (SRC, CRC, JSSL, and DPL), our decision of classification is based on the reconstruction error. Therefore, in terms of area under curve (AUC) and the ROC results, we only compare our approach with the experimental results of DPL, SRC, CRC, and JSSL. As shown in Fig. 2.12, we can see that our method covers larger areas compared with that of other different methods. When the number for training is 300 in each category, our approach JDCR achieves remarkably higher performance than other compared approaches. Moreover, as displayed in Table 2.13, JDCR outperforms all comparison methods.
2.4.4.4 Time Consumption In addition, as shown in Table 2.14, since JSSL is incapable of obtaining a closedform solution, the time cost is quite large. By contrast, the proposed algorithm not only utilizes the discriminative information, which is valuable for the classification, but also provides a closed-form solution to greatly reduce the computational complexity.
The number of training samples Methods 200 KNN (face) 66.70% ±2.38% KNN (tongue) 56.19% ±0.57% SRC (face) 77.72% ±1.15% SRC (tongue) 68.09% ±1.55% DPL (face) 78.65% ±1.35% DPL (tongue) 61.93% ±1.65%
250 66.77% ±2.29% 58.09% ±1.91% 77.83% ±1.36% 69.59% ±2.34% 79.22% ±1.20% 62.95% ±1.85%
300 68.81% ±1.01% 59.39% ±1.07% 78.95% ±1.96% 68.81% ±1.23% 79.06% ±1.51% 65.10% ±2.42%
The number of training samples Methods 200 KNN (comb) 69.31% ±2.15% SRC (comb) 79.47% ±1.05% CRC (comb) 82.00% ±1.53% DPL (comb) 82.17% ±1.09% JSSL 82.53% ±1.61% JDCR 83.57% ±0.75%
250 70.54% ±1.95% 79.39% ±1.97% 81.61% ±1.18% 82.95% ±1.28% 82.91% ±1.45% 83.99% ±1.25%
300 73.07% ±1.31% 79.06% ±2.13% 82.27% ±1.34% 82.60% ±1.67% 83.10% ±1.45% 85.10% ±1.32%
Table 2.12 The classification rate and their corresponding error bar in 5 independent experiments for Fatty Liver detection. Best results are highlighted in bold
2.4 Joint Discriminative and Collaborative Representation 45
46
2 Information Fusion Based on Sparse/Collaborative Representation
Fig. 2.11 The confusion matrices in 5 independent experiments for Fatty Liver detection. The top left, top right, bottom left, and bottom right corners are sensitivity, false negative rate, false positive rate, and specificity, respectively, followed by their corresponding error bars
2.4 Joint Discriminative and Collaborative Representation
47
Fig. 2.12 The ROC curves of various strategies and different features for the Fatty Liver disease detection when the number of training data is 200, 250, and 300 in each category. T-SRC and C-SRC mean the curves are obtained by SRC (tongue) and SRC (Comb), respectively, which are similar to T-DPL and C-DPL
2.4.5 Conclusion In this chapter, to exploit the correlation between multiple views, we propose a novel fusion method called Joint Discriminative and Collaborative Representation to learn a shared representation coefficient between multiple modalities. Simultaneously, a discriminative regularization is added into our model to maximize the distinctiveness among various categories. The extensive experiments on the detection of the Fatty Liver disease substantiate superiority of our proposed method.
48
2 Information Fusion Based on Sparse/Collaborative Representation
Table 2.13 The AUC values on the Fatty Liver diagnosis when the number of training instances is 200, 250, and 300, respectively. Best results are highlighted in bold
Methods SRC (tongue) DPL (tongue) CRC (Comb) SRC (Comb) DPL (Comb) JSSL JDCR
Num The number of training samples num = 200 num = 250 num = 300 AUC AUC AUC 0.7536 0.7345 0.7440 0.7912 0.7740 0.7838 0.8825 0.8923 0.9149 0.8690 0.8740 0.8726 0.8882 0.8947 0.9197 0.8823 0.8900 0.9155 0.8940 0.9112 0.9363
Table 2.14 The time consumption using the proposed method with different training data numbers Training data JSSL JDCR
200 22.4168(s) 0.0346(s)
250 32.5025(s) 0.0407(s)
300 39.5441(s) 0.0600(s)
References 1. Olshausen BA, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis Res. 1997;37(23):3311–25. 2. Vinje WE, Gallant JL. Sparse coding and decorrelation in primary visual cortex during natural vision. Science 2000;287(5456):1273–6. 3. Mairal J, Bach FR, Ponce J, Sapiro G, Zisserman A. Non-local sparse models for image restoration. In: ICCV, vol. 29, p. 54–62. 2009. Citeseer. 4. Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 1794–801. 5. Huang K, Aviyente S. Sparse representation for signal classification. Adv Neural Inf Process Syst. 2007;19:609–16. 6. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y. Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell. 2008;31(2):210–27. 7. Zhang L, Yang M, Feng X. Sparse representation or collaborative representation: Which helps face recognition? In: 2011 International conference on computer vision. IEEE, 2011. p. 471–8. 8. Yang M, Zhang L, Yang J, Zhang D. Robust sparse coding for face recognition. In: CVPR 2011. IEEE, 2011. p. 625–32. 9. Heisele B, Ho P, Poggio T. Face recognition with support vector machines: global versus component-based approach. In: Proceedings eighth IEEE international conference on computer vision. ICCV 2001. IEEE, 2001. vol. 2, p. 688–94. 10. Li SZ. Face recognition based on nearest linear combinations. In: Proceedings of 1998 IEEE computer society conference on computer vision and pattern recognition (Cat. No. 98CB36231). IEEE, 1998. p. 839–44. 11. Bin C, Jianchao Y, Shuicheng Y, Yun F, Huang TS. Learning with l1-graph for image analysis. IEEE Trans Image Process. 2010;19(4):858–66. 12. Li J, Zhang D, Li Y, Wu J, Zhang B. Joint similar and specific learning for diabetes mellitus and impaired glucose regulation detection. Inf Sci. 2017;384:191–204. 13. Rigamonti R, Brown MA, Lepetit V. Are sparse representations really relevant for image classification? In: CVPR 2011. IEEE, 2011. p. 1545–52.
References
49
14. Yang M, Zhang L, Zhang D, Wang S. Relaxed collaborative representation for pattern classification. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012. p. 2224–31. 15. Li J, Zhang B, Zhang D. Joint discriminative and collaborative representation for fatty liver disease diagnosis. Expert Syst Appl. 2017;89:31–40. 16. Lin Z, Liu R, Su Z. Linearized alternating direction method with adaptive penalty for low-rank representation. In: Advances in neural information processing systems, 2011. p. 612–20. 17. Ji T.-Y., Huang T.-Z., Zhao X.-L., Ma T.-H., Liu G. Tensor completion using total variation and low-rank matrix factorization. Inf Sci. 2016;326:243–57. 18. Lin Z, Chen M, Ma Y. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. Preprint, arXiv:1009.5055. 2010. 19. Ghasemishabankareh B, Li X, Ozlen M. Cooperative coevolutionary differential evolution with improved augmented Lagrangian to solve constrained optimisation problems. Inf Sci. 2016. 20. Rosasco L, Verri A, Santoro M, Mosci S, Villa S. Iterative projection methods for structured sparsity regularization, 2009. 21. Yuan X-T, Liu X, Yan S. Visual classification with multitask joint sparse representation. IEEE Trans Image Process. 2012;21(10):4349–60. 22. Cover TM, Hart PE. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7. 23. Hsieh C-J, Chang K-W, Lin C-J, Sathiya Keerthi S, Sundararajan S. A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th international conference on machine learning. ACM, 2008. p. 408–15. 24. Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: a library for large linear classification. J Mach Learn Res. 2008;9:1871–4. 25. Bengio S, Pereira F, Singer Y, Strelow D. Group sparse coding. In: Advances in neural information processing systems, 2009. p. 82–9. 26. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y. Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell. 2009;31(2):210–27. 27. Georghiades AS, Belhumeur PN. Illumination cone models for face recognition under variable lighting. In: Proceedings of CVPR98, 1998. 28. Lee K-C, Ho J, Kriegman DJ. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell. 2005;(5):684–698. 29. Martinez AM. The AR face database. CVC Technical Report24, 1998. 30. Naseem I, Togneri R, Bennamoun M. Linear regression for face recognition. IEEE Trans Pattern Anal Mach Intell. 2010;32(11):2106–12. 31. Phillips PJ, Flynn PJ, Scruggs T, Bowyer KW, Chang J, Hoffman K, Marques J, Min J, Worek W. Overview of the face recognition grand challenge. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), IEEE, 2005. vol. 1, p. 947–54. 32. Wolf L, Hassner T, Taigman Y. Similarity scores based on background samples. In: Asian conference on computer vision. Springer, 2009. p. 88–97. 33. Liu C, Wechsler H. Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition. IEEE Trans Image Process. 2002;11(4):467–476. 34. Su Y, Shan S, Chen X, Gao W. Hierarchical ensemble of global and local classifiers for face recognition. IEEE Trans Image Process. 2009;18(8):1885–96. 35. Ahonen T, Hadid A, Pietikäinen M. Face recognition with local binary patterns. In: European conference on computer vision. Springer, 2004. p. 469–481. 36. Belhumeur PN, Hespanha JP, Kriegman DJ. Recognition using class specific linear projection, 1997. 37. Nilsback M-E, Zisserman A. A visual vocabulary for flower classification. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE, 2006. vol. 2, p. 1447–54.
50
2 Information Fusion Based on Sparse/Collaborative Representation
38. Nilsback M-E, Zisserman A. Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008. p. 722–9. 39. Xu Y, Zhong Z, Yang J, You J, Zhang D. A new discriminative sparse representation method for robust face recognition via l2 regularization. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2233–42. 40. Gu S, Zhang L, Zuo W, Feng X. Projective dictionary pair learning for pattern classification. In: Advances in neural information processing systems, 2014. p. 793–801.
Chapter 3
Information Fusion Based on Gaussian Process Latent Variable Model
Gaussian Process Latent Variable Model (GPLVM) is a generative and nonparametric model and able to represent the data without a determined function. Unlike sparse/collaborative representation, GPLVM can describe the non-linearities that are prevalent in real-world datasets. This chapter proposes three information fusion methods based on GPLVM to improve classification performance. The content of this chapter can help readers get a preliminary understanding of the fusion algorithm based on GPLVM.
3.1 Motivation and Preliminary 3.1.1 Motivation As mentioned above, JSSL, RCR, and JDCR have been proposed for multi-view, multi-feature, or multi-modal fusion and classification. However, the main disadvantage of these methods is that they use linear representation to approximate complex data distributions with non-linearity in the real world. Therefore, performance may not meet needs in some applications with complex scenes. To tackle the task, some kernel-based methods, for example kernel canonical correlation analysis (CCA) [1–3], and randomized non-linear CCA [4], etc. were proposed. Differently, the GPLVM [5–8] is a non-parametric and generative strategy that is capable of fitting the data non-linearly and effectively rather than a specific set of deterministic or parametric functions. Lawrence et al. [6] proposed GPLVM in 2004 to learn a lowdimensional latent variable by a non-linear representation. In addition, the Gaussian Process (GP) [9] prior was used on the mapping function and then the covariance function of GPLVM has a powerful variety, which contributes to featuring the complicated distributions and diverge content in real world [10].
© Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_3
51
52
3 Information Fusion Based on Gaussian Process Latent Variable Model
Considering the powerful capacity of GPLVM, here, we give a novel approach to extract the correlation among multi-views. Being similar to shared GPLVM (SGPLVM) [7], a shared component is obtained from multiple views by using the view-specific mapping functions with the Gaussian Process (GP) prior. Although a lot of research groups have developed some GPLVM extensions in recent years, few of them focus on joint representation by the mappings from and to the observations. In other words, those groups assume that there is a projection between the low-dimensional latent space and high dimensional observed space according to GPLVM. However, at the predict stage, what we need is the latent variable corresponding to the testing sample when a testing sample is given. Therefore, to learn another projection or a back constraint regularization from the observation to the latent variable is needed, so that we could easily obtain the latent variable associated with a testing sample. Note that [11] also developed a back constraint regularization used for GPLVM, which learns a matrix to map the regression from the observed space to the latent space. Differently, we design a more powerful back projection from the observed space to the latent space by following the GP prior. Here we use the mapping function from the observations to the latent space as the encoder [12–14], and the projection from the latent space to the observations as decoder. This approach was defined as Shared Auto-encoder Gaussian Process latent variable model (SAGP) [15]. How to effectively design the covariance using the kernel function and mine the category-based information is another issue. In SAGP, the GP prior is employed to non-linearly model the data distribution. And the covariance matrix is frequently constructed by the Radial Basis Function (RBF). In practice the kernel function is crucial issue at covariance matrix construction. A wise selection of the kernel function will boost the discriminative power of the model at classification. Actually, the data distributions in real applications are complex and a certain kernel cannot represent these datasets. And then we develop SAGP to a multi-kernel version and employ multi-kernel learning with the proposed SAGP for classification. In detail, we utilize multiple kernels to adaptively and automatically construct covariance matrix rather than adopting a certain kernel function. By doing so, the developed model can adaptively accommodate non-linear distribution of observations by automatically updating the weights of kernel functions. The strategy is named MultiKernel Shared Gaussian Process latent variable model (MKSGP) [16]. It should be noted that, in SAGP, latent components from the same category are promoted to be similar while those from distinctive categories are enforced to be dissimilar by the discriminative regularization. But the limitation is that it requires to retrain a certain off-line classifier such as KNN and SVM. In this case the learned variables in the training stage may not fully meet the assumption of the classifier. To address this issue, jointly learning a hyperplane to separate the latent variables belonging to different classes by using a large margin prior also is developed. Being different from SAGP, which trains the latent variables and classifier separately, our strategy could learn variables and classifier jointly, and make the two processes adaptively for the input samples.
3.1 Motivation and Preliminary
53
It is true that the aforementioned SAGP [15] and MKSGP [16] can get an encoder for quick testing. It should be noted that we have to employ the covariance of p(X|{Yv }Vv ) to learn projections from the observed samples to the latent components in the encoder. In practice, SAGP and MKSGP both take a straightforward way to calculate the covariance p(X|{Yv }Vv ) by simply summing up the covariance of p(X|Yv ). In order to tackle this problem, another novel GPLVM with multiview learning method, named Shared Linear Encoder-based Multi-Kernel Gaussian Process Latent Variable Model (SLEMKGP) [17] was proposed by considering the projections from and to the latent variable jointly. Generally, all views are projected to the corresponding latent variable for each data. Considering its difficulty of obtaining the covariance matrix of the conditional distribution of the latent variable under the given observation value, we develop another transformation from multiple views to the latent space. In detail, we first project these observation components to the subspace with linear mapping. The fused feature is then transformed to latent components with GP priors. Compared with SAGP and MKSGP, multiple views are firstly transformed to a common and linear subspace Z by SLEMKGP, and the covariance of p(X|Z) is computed easily.
3.1.2 Preliminary In this chapter, we firstly give some theories about GPLVM and SGPLVM to make the readers have a better understanding on proposed approaches.
3.1.2.1 GPLVM We define the observed samples as Y = [y1, . . . , yN ]T ∈ RN×D , in which N and D are the number and the dimensionality of training data, respectively. The purpose of GPLVM is to compute the corresponding latent variable X = [x1 , . . . , xN ]T ∈ RN×q , where q is the dimensionality of X and q D. The relationship between yi and xi can be referred as yi = f (xi ) + ε
(3.1)
in which f is the transformation mapping with the GP prior while ε is the residual component. We integrate f and get the likelihood of the observation p(Y | X, θ ) = &
1 exp − tr K−1 YYT 2 (2π)ND |K|D 1
(3.2)
in which K is the covariance matrix constructed by X and θ . Here θ denotes the kernel parameters. Generally, the Radial Basis Function (RBF) kernel function is used to construct the covariance matrix. Then the posterior distribution can be
54
3 Information Fusion Based on Gaussian Process Latent Variable Model
represented as follows: p(X, θ | Y) ∝ p(Y | X, θ )p(X).
(3.3)
Kindly note that we optimize the parameter X by using the gradient descend algorithm through minimizing the negative log-posterior.
3.1.2.2 SGPLVM In order to apply GPLVM to the multi-view dataset, assume that there is a common subspace associated with mapping latent components to different views. Mathematically, there are V views from the multi-view data and the observations v can be denoted as Y = {Yv = [yv1 , . . . , yvN ]T ∈ RN×D }Vv in which Yv is the v-th view and D v is its dimensionality, respectively. In contrast to learn a view-specific variable, like GPLVM, there is a latent variable X = [x1 , . . . , xN ]T ∈ RN×q shared among different views. The relationship between yvi and xi can be computed as yvi = f v (xi ) + εv
(3.4)
The distribution of the observed view Yv associated with the latent component X is conditionally independent on the remaining views. Thus, the likelihood of various views under X follows: p(Y | X) =
V '
p(Yv | X, θ v )
(3.5)
v=1
in which θ v denotes the kernel parameters associated with the v-th view. Being similar to GPLVM, the gradient descend algorithm to optimize X in SGPLVM.
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model As described above, GPLVM and its various extensions have achieved some remarkable performances. However few of them jointly learn the mapping from and to the observations. In GPLVM, the observed data is assumed to be a mapped instance from the latent variable. However, given a test sample, what we need is its corresponding latent variable. To tackle this problem, the works [5, 11] designed a back constraint regularization. However, it only uses a matrix to fit the regression between observation space and manifold space. Rather than using this type of back projection, we develop a non-linear approach to map the observed components into the manifold space. Inspired by auto-encoder [12, 13] and its multi-modal
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model
55
extension [14], we pay attention on processing multi-view samples and develop a novel Gaussian Process Latent Variable model named as Shared Auto-encoder Gaussian Process latent variable model (SAGP) [15] in this chapter. Concretely, we learn the projection from the shared latent space to multiple observations and take its Gaussian Process (GP) [9] based back projection into account.
3.2.1 Problem Formulation Figure 3.1 visualizes the pipeline of our proposed approach SAGP. The proposed method aims to estimate the shared latent components among multi-view data. Specifically, SAGP refers the observed views y = {yv }Vv=1 as the components associated with shared latent components x in a subspace and V is view number. In order to ensure the mapping to be non-linear and smooth, we employ GPLVM due to its powerful capability. Being similar to SGPLVM [7], we decompose the likelihood of the observed samples {Yv }Vv as follows: p
$
Yv
%V v
V $ %V ' | X, θ v v = p(Yv | X, θ v )
(3.6)
i=1 v
where Yv = [yv1 , . . . , yvN ]T ∈ RN×D is the set data respecting to v-th view, X = [x1 , . . . , xN ]T ∈ RN×q is the set of N latent components, and θ v are its associated kernel parameters. In contrast to previous work, the proposed strategy adopts another projection to transform observed data Yv to the latent variable X and reconstruct those observations using shared latent components. Specifically, we express the latent component and reconstructed component by employing Gaussian Process mapping functions g = {g v }Vv and f = {f v }Vv , with $ %V kernel parameters γ = {γ v }Vv and θ = θ v v , respectively. And the projection
EDH
WT
g1
Shared Latent Variables with discriminative Prior
f1
g2
f2
g3
f3
Multiple Features
Multiple Features
CH
CH
EDH
WT
Fig. 3.1 An overview of the proposed approach. In this model, there is a shared component that can be mapped to different characteristics. In addition, there is another transformation from observations to the shared variable
56
3 Information Fusion Based on Gaussian Process Latent Variable Model
functions follow the Gaussian distribution p(g v )
∼ N (g v | 0, KvY ),
p(f v )
∼ N (f v | 0, KvX )
(3.7)
At encoder and decoder, we select RBF kernel as the kernel functions. In order to obtain the encoding part, we propose two frameworks. Here we design a mapping function to convert multiple observations to the shared and latent variable. Theoretically, the covariance matrix of p(X | {Yv }Vv , {γ v }Vv ) is v defined v as the sum of the kernel matrices associated with different views Y : v KY . In addition, we also design an independent transformation from each view to the common subspace. The former one is named as a shared encoder (SE), and the latter one is defined as an independent encoder (IE). At the IE phase, p(X | {Yv }Vv , {γ v }Vv ) is replaced by p(X | Yv , γ v ). In order to adapt to the test data with missing views, we design the second back projection. If all views can be used for the test sample, both inverse projection can be adopted. In practice, a testing sample may miss one or more than one views. Then we could obtain potential component by using the second back projection method for this case. In order to be simple and without loss generality, only the SE-based strategy is presented in the following sections. For the second approach, we only need to replace v KvY with KvY . In contrast to the work [11], only learning the backtransformation by estimating a projection matrix, SAGP achieves the transformation with GP prior from the observed spaces to the shared manifold. Due to the non-parametric and non-linear advantages of the GP prior, our proposed method performs better at precisely fitting the training data. Furthermore, compared with existing approaches, there are only a few number of variables that should be learned in this kind of back constraint. The marginal likelihood of X w.r.t. the multiple observations and the observations w.r.t. X (SE) are then referred as follows: V 1 v −1 1 p X | {Yv }Vv , {γ v }Vv = ( KY ) XXT )} q exp{− 2 tr(( V v v Nq (2π) v KY
p
V $ % $ %V ' V Yv v | X, θ v v = p(Yv | X, θ v ) v=1
=
V ' v=1
1
) Dv v (2π)ND KvX
1 exp − tr (KvX )−1 Yv (Yv )T 2
(3.8) In order to clearly distinguish the observations in the encoder and represent V the latent variable in the decoder, we re-symbol the observations as Yvγ at v
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model
57
the encoder. It should be noted that we only refer Yvγ just as a symbol and set its numerical value to be the same to Yv in implementation. Therefore, we could formulate the joint likelihood over the shared component X associated with observations and reconstructions as follows: V $ % $ % $ %V V V p X | Yvγ , Yv v , γ v V , θ v ∝ v
v
V $ % $ % $ %V V V p X | Yvγ , γ v v p Yv v | X, θ v v
(3.9)
v
The Proof of Eq. (3.9): V $ % $ % $ % V V V p X | Yvγ , Yv v , γ v v , θ v v v
V $ %V p X, Yvγ , {Yv }Vv , | {γ v }Vv , θ v v v
= V $ %V p Yvγ , {Yv }Vv | {γ v }Vv , θ v v p =
v
V
V $ %V V v v v Yγ p X | Yγ , {γ }v p {Yv }Vv | X, θ v v v v
V V V V v v v v p Yγ , {Y }v | {γ }v , {θ }v v
=
V $ %V p X | Yvγ , {γ v }Vv p {Yv }Vv | X, θ v v
(3.10)
v
V V p Yvγ ,{Yv }Vv |{γ v }Vv ,{θ v }v v V p Yvγ v
V $ %V p X | Yvγ , {γ v }Vv p {Yv }Vv | X, θ v v v = V $ %V p {Yv }Vv | Yvγ , {γ v }Vv , θ v v v
V $ %V , {γ v }Vv , θ v v ) is independent to the shared latent space, Since p({Yv }Vv | Yvγ v the result of Eq. (3.9) is given.
58
3 Information Fusion Based on Gaussian Process Latent Variable Model
As described previously, we present the negative log-likelihood of our presented model strategy as follows: L = Lγ + Lθ + const Lθ =
V
Lvθ
i=1
+ * 1 −1 Lγ = Kv + tr KvY XXT qN log 2π + q log v Y 2 v Lvθ =
(3.11)
1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv 2 2 2
where const represents the constant component, which is independent to the latent space. For the IE scenario, the kernel matrix is replaced by KvY in Lγ . Since we prefer to apply SAGP to the classification, a discriminative prior is employed to promote the latent components from the same category to be similar while those from distinctive categories to be dissimilar. Urtasun et al. [8] firstly uses a Linear Discriminant Analysis (LDA) as the prior to decrease the intra-class distances but increase the inter-class distances by max J (X) = tr(S−1 w Sb )
(3.12)
where Sw and Sb are the matrices associated with the intra-class and inter-class, respectively. The work [18] presents a more general and discriminative prior for the Gaussian Process Latent Random Field [19] by introducing the graph Laplacian matrix. Recently, Eleftheriadis et al. [5] and Zhang et al. [20] revised the general prior to adapt multi-view fusion. In this chapter, we utilize the discriminative regularization in [5] to achieve the supervised learning in our proposed method. Specifically, we firstly adopt a view-specific weight matrix Wv , v = 1, . . . , V .
Wvij =
⎧ ⎪ ⎨ ⎪ ⎩
# exp −
v v 2 yi −yj tv
0
2
if i = j and ci = cj
(3.13)
otherwise
where yvi is the i-th feature respecting to the sample set Yv , ci is its associated class label, and t v is a parameter used for determining the RBF kernel width. Following [21], we define the kernel width as the mean squared distance among training samples. We then construct the graph Laplacian for each modality Lv = Dv − Wv . v It should be noted that D v is a diagonal matrix and elements at the diagonal are v defined as Dii = j =1 Wij . Since the graph Laplacian matrices from diverse views are diverse in scale, Eleftheriadis et al. [5] normalized them by LvN = (Dv )−1/2 Lv (Dv )−1/2
(3.14)
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model
59
Consequently, the discriminative prior on the shared space is defined as p(X) =
V '
p(X | Yv )1/V =
v=1
1 β ˜ exp − tr(XT LX) V · Zq 2
(3.15)
where L˜ = Vv=1 LvN + ξ I, ξ (e.g., 10−4 ) is a regularization constant with a small value to make sure L˜ be positive-definite, β is a nonnegative penalty parameter to trade-off between the Eq. (3.11) and the discriminative prior. To sum up, by combining Eq. (3.11) and the discriminative prior, we finally get the negative log-posterior of the proposed SAGP model as follows: L = Lγ + Lθ + Ld where Ld =
(3.16)
β T ˜ 2 tr(X LX).
3.2.2 Optimization for SAGP In this chapter, we adopt the gradient descend strategy to estimate the latent $ %V component X and the hyperparameters {γ v }Vv , θ v v . In practice, we find that the value Lθ is much larger than that of Lγ , which influences the performance of the model. Therefore, we update the shared space alternatively in Lγ and Lθ + Ld . The objective function Eq. (3.16) is transformed to Eq. (3.17). min Lγ + Lθ + Ld
s.t.
X = Xγ
(3.17)
where Xγ represents the shared value learned from Lγ , and X refers the shared value learned from Lθ + Ld . To this end, we take the Augmented Lagrangian method (ALM) to solve this task. As a result, we get the following augmented Lagrangian function: + * v −1 1 v T min LALM = K + tr KY Xγ Xγ qN log 2π + q log v Y 2 v −1 T
1 v D N log 2π + D v log KvX + tr KvX Yv Yv 2 V
+
v=1
+
2 . μ β tr(XT LX) + Z, X − Xγ + X − Xγ F 2 2 (3.18)
60
3 Information Fusion Based on Gaussian Process Latent Variable Model
where Z refers to the Lagrange multiplier, ·, · refers to the inner production, and μ refers to the nonnegative penalty parameter. Consequently, the gradients w.r.t. Xγ and X can be acquired as follows: ∂LALM Z = (KY )−1 Xγ − μ X − Xγ + ∂Xγ μ V Lvθ ∂LALM Z = + βLX + μ X − Xγ + ∂X X μ
(3.19)
(3.20)
v=1
We employ scaling parameter β to trade-off the significance of the discriminative prior and set its value empirically. According to the chain rule, we then obtain ∂Lv = tr ∂xij
#
Lv KvX
T
KvX xij
(3.21)
Similarly, we can get the derivative of the augmented Lagrangian function respect$ %V ing to the kernel parameters {γ v }Vv and θ v v as follows: ⎞ ⎛# T ∂KvY ∂LALM ∂LALM ⎠ = tr ⎝ ∂γiv ∂γiv ∂ Vv KvY ∂LALM = tr ∂θiv
#
∂Lvθ ∂KvX
T
∂KvX ∂θiv
(3.22)
(3.23)
After getting the latent variable Xγ , X, and their corresponding parameters, the Lagrange multiplier Z and parameter μ can be updated as Zt +1 = Zt + μt (X − Xγ ) μt +1 = min(μmax , ρμt )
(3.24)
From the t-th iteration to the (t + 1)-th iteration, we employ the constant parameter ρ to control the step size of μ and generally set the constant values as 1.1. And the constant parameter μmax , typically set as 1000, is used to restrain maximum of μ.
3.2.3 Inference After the optimization, we could get the latent component and parameters of the proposed model. Given a test multi-modal sample that is denoted as y∗ = {yv∗ }Vv ,
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model
61
Algorithm 3.1 Shared auto-encoder Gaussian process latent variable model (SAGP) Input: the observed training samples {Yv }Vv , testing samples {yv∗ }Vv , and the prior parameter β and the dimensionality of the latent space q. Output: the latent variables X and x∗ . Initialization: initialize the latent component X using LDA. 1: while not converged do 2: Encode-step: 3: optimize the latent parameter Xγ and hyperparameter {γ }Vv by Eq. (3.19) and Eq. (3.22) 4: Decode-step: 5: estimate the latent component X and hyperparameter {θ }Vv by Eq. (3.20), Eq. (3.21) and Eq. (3.23) 6: Update Lagrange multiplier and μ: 7: update Z and μ using Eq. (3.24). 8: end while 9: Test: get the latent component for the test sample by Eq. (3.25) or Eq. (3.26).
the inference step is quite simple and straightforward. Following the result in [9] Chap. 2, we can calculate the variable x∗ as follows: ∗
x =
#
Ky v∗ Y v
v
#
−1 KvY
X
(3.25)
v
in which Ky v∗ Y v represents the kernel matrix calculated by all pairs of training and test points. For the independent encoder (IE) assumption, if the v-th view is available we will use the Eq. (3.68) to calculate latent variable x∗ . x∗ = Ky v∗ Y v (KvY )−1 X
(3.26)
Algorithm 3.1 illustrates the steps of optimization and inference procedure in detail. Following the Eq. (3.25) or Eq. (3.26), we could get the variable x∗ at latent space respecting to test samples. We then adopt the KNN classifier to predict the label and set the parameter K as 1 at our testing step.
3.2.4 Experimental Results In this section, we apply the SAGP to object detection and prove its effectiveness and superiority by evaluating the strategy on three datasets. Firstly we will briefly describe datasets and the experimental setting. Experiment results will then be analyzed.
62
3 Information Fusion Based on Gaussian Process Latent Variable Model
3.2.4.1 Dataset Description The dataset Wiki Text-Image [22] is from the WikiPedia’s featured articles collection. Because some classes are very scarce, 10 most frequently used categories are selected. Totally, this dataset consists of 2173 training instances and 693 testing instances. Simultaneously, each instance covers an image and more than 70 words. Here, the SIFT feature with 128-D and latent Dirichlet allocation feature with 10-D are used to represent image and text, respectively. The animals with attributes (AWA) [23–25] dataset, including 30,475 images from 50 classes, also is used. There are two kinds of features extracted by Convolutional Neural Network (CNN): one kind is got by a CaffeNet pretrained on ILSVRC2012 with 7-layer and other kind is extracted by a very deep 19-layer CNN pretrained on ILSVRC2014. In this part, 50 samples randomly selected from each class are used as training set while the remaining images are referred as testing set. Similar to [5], the dimension of the CNN features is reduced to 300 by the Principal Component Analysis (PCA), greatly decreasing the training time. The NUS-WIDE-LITE dataset [26] consists of 27,807 samples as the training set and 27,808 as the testing set. Since there are some classes with scarce samples, we finally use 9 classes covering boats, tower, birds, boats, rocks, toy, tree, vehicle, flowers, and sun. Additionally, because there are some multi-label samples in this dataset, we ignore those samples without label or with multiple labels. To the end, we adopt 16,377 samples totally, and randomly select 200 samples from each class as training set. The remaining samples are used as the testing set. Being similar to [5], three kinds of features, including color correlogram histogram (CH), edge direction histogram (EDH), and wavelet texture (WT) features, are extracted followed by the PCA pre-process. The dimension numbers of these three kinds of features are 33, 43, and 40, respectively.
3.2.4.2 Experimental Settings To prove the effectiveness of SAGP, we conduct a series of comparison experiments with the CCA [27, 28], JSSL [29], and GPLVM based approaches: GPLVM[6], DGPLVM [8], and DS-GPLVM [5]. Because the works [6] and [8] were only designed for single-view-based data, multiple views are concatenated as a single vector as the input. Since number of outputs of CCA is related to the views number, we only show its best results. Being similar, DS-GPLVMI and the independent encoder-based SAGP has different experimental results for each view, and we only illustrate the best numerical results in this chapter. Note that, in [5], DS-GPLVM also has two back constraints referred as DS-GPLVMI and DS-GPLVMS. In addition, apart from the JSSL, we take the 1-NN classifier with the latent component as the input to identify different categories.
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model
63
As the auto-encoder frame is used in our proposed model, we also compare SAGP with multi-view deep models including deep canonical correlation analysis (DCCA) [30] and deep canonically correlated auto-encoders (DCCAE) [31].
3.2.4.3 Performance on the Three Datasets Wiki Text-Image Dataset The overall and average classification values on the Wiki Text-Image dataset are illustrated in Table 3.1 under the setting that the dimension of X ranges from 1 to 10. We can see that SAGP achieves the better results in most cases compared with most of the comparison approaches. In comparison to the GPLVM and CCA, SAGP obtains more than 7% enhancement under the dimension 8. Compared with other approaches, the proposed method has also a remarkable improvement at many times. Referring to DS-GPLVMI and DSGPLVMS, SAGP is also competitive. In contrast to DCCA and DCCAE with deep structures, SAGP still could have a significant increase at accuracy values. Table 3.2 displays the classification accuracies from each class computed by different approaches with fixing dimension q as 8. It is obvious that SAGP obtains the best performance in the classes (history, sport, music, and biology). Particularly, in most categories, the experimental results got by SAGP is equal or better than 50% except for art. By contrast, some compared strategies have results worse about 50% at least three classes. AWA Dataset Table 3.3 states the overall classification accuracy values and average accuracy values conducted on the AWA dataset. In comparison to CCA and GPLVM, SAGP obtains much better results on all results. Compared with other approaches such as the DGPLVM, DCCA, DCCAE, and DSGPLVM, our proposed method SAGP also achieves competitive results. The confusion matrix calculated by SAGP(SE) on the AWA dataset is shown in Fig. 3.2, whose diagonal elements mean the classification results at every class. Kindly note that the dimensionality q of X is set to be 50. The results from the figure suggest that the accuracy values from most classes outperform 70% except for the blue+whale, cow, hamster, humpback+whale, moose and rat, which relatively substantiates the superiority of our presented strategy. NUS-WIDE-LITE Dataset The classification results on the NUS-WIDE-LITE dataset is tabulated in Table 3.4 with the dimensionality of X varying from 1 to 30. Since the CCA-based methods are designed for tasks with two modalities, the second and third modalities are concatenated to be a single one empirically. From the Table 3.4 we can see that the results obtained by SAGP are much better than that obtained by other methods. The confusion matrices of various approaches are displayed in Fig. 3.3 under fixed dimensionality q as 10. This is clear that the classification accuracies of different categories obtained by our proposed method are superior or competitive to that gained by other state-of-the-art methods.
SAGP(SE)
SAGP(IE)
DCCAE
DCCA
DS-GPLVMS
DS-GPLVMI
DGPLVM
GPLVM
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality dout = 1 dout = 2 68.0% 64.9% 28.1% 45.0% 27.0% 42.4% 14.9% 35.9% 12.8% 33.2% 37.2% 55.0% 34.2% 51.7% 38.4% 53.8% 33.7% 50.0% 40.8% 55.8% 36.3% 52.7% 21.5% 49.1% 20.4% 43.9% 24.0% 40.8% 22.5% 38.8% 48.9% 58.6% 45.2% 53.7% 50.5% 58.9% 48.0% 54.9% 56.0% 51.8% 46.9% 42.8% 56.1% 51.4% 58.6% 54.3% 62.2% 57.7% 53.1% 49.1% 57.7% 52.9% 65.1% 61.1% 65.6% 62.3%
dout = 3
59.0% 55.4% 54.3% 50.0% 61.9% 57.5% 63.4% 58.9% 65.4% 60.5% 57.9% 54.1% 58.0% 54.9% 67.5% 63.6% 67.7% 63.8%
dout = 4
61.8% 58.4% 58.6% 53.2% 62.9% 60.4% 66.1% 60.8% 69.4% 65.3% 62.9% 58.3% 59.0% 55.0% 69.1% 65.3% 70.6% 67.2%
dout = 5
63.1% 59.3% 61.3% 57.0% 63.5% 58.5% 66.8% 62.6% 69.0% 65.1% 64.5% 60.8% 61.3% 57.9% 70.0% 66.9% 70.0% 66.5%
dout = 6
63.8% 61.1% 61.9% 57.4% 63.4% 60.0% 66.1% 61.8% 69.3% 65.9% 64.9% 61.6% 60.2% 57.0% 69.6% 65.3% 69.7% 65.5%
dout = 7
62.6% 59.3% 61.9% 58.8% 63.4% 59.6% 66.8% 62.8% 69.4% 65.1% 65.8% 63.1% 65.7% 62.4% 70.0% 65.6% 71.0% 67.1%
dout = 8
63.8% 60.4% 60.2% 56.2% 63.2% 59.3% 65.7% 61.4% 68.7% 64.7% 65.2% 62.0% 62.9% 58.3% 69.7% 65.5% 69.3% 64.6%
dout = 9
63.8% 60.4% 59.0% 54.1% 61.8% 57.6% 64.4% 60.6% 67.4% 62.9% 66.4% 63.7% 67.1% 64.3% 67.4% 62.4% 65.1% 61.7%
dout = 10
Table 3.1 The overall and mean classifying accuracy values calculated by SAGP and other comparison methods with the dimensionality of the latent variable varying from 1 to 10 using dataset Wiki Test-Image. Bold values mean the best performances
64 3 Information Fusion Based on Gaussian Process Latent Variable Model
Categories CCA JSSL GPLVM DGPLVM DSGPLVMI DSGPLVMS DCCA DCCAE SAGP(IE) SAGP(SE)
Arts 41.2% 25.7% 47.1% 35.3% 38.2% 32.4% 50.0% 44.1% 35.3% 35.3%
Biology 89.8% 91.0% 85.2% 87.5% 90.9% 92.1% 93.2% 89.8% 93.2% 93.2%
Geography 57.3% 77.0% 57.3% 64.6% 71.9% 69.8% 64.6% 58.3% 70.8% 68.8%
History 40.0% 28.2% 37.6% 43.5% 54.1% 64.7% 36.5% 44.7% 62.4% 65.9%
Literature 49.2% 71.0% 60.0% 55.4% 49.2% 60.0% 55.4% 58.5% 60.0% 58.5%
Media 39.7% 51.3% 37.9% 39.7% 37.9% 41.4% 44.8% 41.4% 48.3% 50.0%
Music 52.9% 50.3% 45.1% 41.2% 56.7% 56.9% 52.9% 51.0% 56.9% 68.6%
Royalty 51.2% 78.7% 46.3% 58.5% 58.5% 61.0% 61.0% 58.5% 53.7% 56.1%
Sport 90.1% 86.7% 91.5% 90.1% 88.7% 91.6% 90.1% 88.7% 93.0% 91.5%
Table 3.2 The classifying accuracy values of each category use different individual or multi-view strategies on Wiki Test-Image dataset Warfare 81.7% 89.3% 79.8% 79.8% 81.7% 81.7% 82.7% 88.5% 82.7% 82.7%
3.2 Shared Auto-encoder Gaussian Process Latent Variable Model 65
SAGP(SE)
SAGP(IE)
DCCAE
DCCA
DS-GPLVMS
DS-GPLVMI
DGPLVM
GPLVM
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality dout = 40 dout = 50 83.8% 81.7% 71.8% 73.6% 69.2% 71.1% 70.9% 71.9% 69.5% 70.6% 80.1% 82.0% 78.3% 80.1% 81.8% 82.9% 80.0% 81.0% 82.0% 83.4% 80.0% 81.4% 76.6% 78.3% 74.4% 76.3% 76.7% 77.1% 74.8% 75.0% 82.0% 83.6% 79.9% 81.5% 82.4% 84.3% 80.0% 82.3% 74.3% 72.8% 71.4% 69.9% 80.2% 78.6% 82.8% 80.9% 83.1% 81.5% 77.5% 75.3% 76.9% 74.8% 83.0% 81.1% 83.4% 81.5%
dout = 60
74.8% 72.4% 71.1% 70.3% 78.8% 77.3% 81.7% 79.8% 82.4% 80.7% 77.9% 76.0% 76.7% 74.9% 82.7% 80.7% 82.7% 80.8%
dout = 70
75.1% 72.8% 69.1% 67.2% 77.6% 76.1% 82.8% 81.1% 82.5% 80.8% 76.9% 74.6% 76.8% 74.9% 82.2% 80.3% 83.3% 81.4%
dout = 80
75.1% 72.7% 72.1% 69.7% 76.5% 75.0% 82.7% 80.8% 82.7% 80.8% 77.9% 75.6% 77.0% 74.9% 81.8% 79.8% 83.3% 81.4%
dout = 90
75.5% 73.2% 71.4% 70.9% 75.4% 74.0% 82.4% 80.6% 82.7% 80.8% 78.2% 76.0% 76.9% 74.8% 81.4% 79.5% 83.1% 81.0%
dout = 100
73.8% 71.6% 71.8% 70.7% 75.0% 73.6% 82.3% 80.5% 82.5% 81.4% 78.4% 76.1% 76.9% 74.8% 81.1% 79.4% 83.3% 81.5%
dout = 110
74.1% 72.0% 72.1% 71.6% 74.5% 73.2% 82.4% 80.7% 83.2% 81.5% 78.1% 75.8% 76.9% 74.8% 80.4% 78.7% 82.6% 80.9%
dout = 120
74.1% 71.8% 71.6 70.8% 73.8% 72.3% 82.3% 80.8% 82.7% 81.2% 78.3% 76.1% 76.8% 74.7% 79.8% 78.0% 82.9% 81.4%
dout = 130
Table 3.3 The overall and mean classifying accuracy values calculated by SAGP and other comparison approaches with the dimensionality of latent variable varying from 40 to 130 using dataset AWA. Bold values mean the best performances
66 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
67
Fig. 3.2 Confusion matrix of the category recognition results using dataset AWA. The vertical axis represents the true labels and the horizontal axis represents the predicted labels
3.2.5 Conclusion In this chapter, a multi-view learning method based on the GPLVM is proposed to learn a shared variable in a subspace. Different from conventional GPLVM and its extensions, the proposed method simultaneously takes the back projection from multiple observations to the shared variable into account. Thanks to this back constraint, we can get the latent variable simply but efficiently in the testing phase. Experimental results on three real-world datasets demonstrate the superiority of our approach compared with the state-of-the-art methods.
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model As mentioned above, although SAGP achieves better performance compared with JSSL, there are still two problems that should be tackled.
SAGP(SE)
SAGP(IE)
DCCAE
DCCA
DS-GPLVMS
DS-GPLVMI
DGPLVM
GPLVM
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality dout = 1 47.2% 48.9% 15.1% 14.8% 16.3% 14.9% 19.6% 21.5% 19.1% 17.6% 25.4% 25.0% 14.6% 14.4% 17.1% 17.1% 19.8% 17.7% 23.6% 25.2% 30.6% 29.6% 33.1% 32.8% 43.9% 49.9% 24.5% 30.7% 46.1% 52.5% 27.2% 26.2% 29.8% 28.4% 29.1% 33.7% 52.6% 51.9%
dout = 5
35.6% 34.2% 36.7% 34.8% 51.2% 52.8% 32.9% 34.9% 54.8% 56.6% 31.5% 31.5% 35.4% 33.7% 32.4% 37.2% 58.7% 56.6%
dout = 10
36.7% 34.7% 40.7% 39.8% 50.6% 53.4% 33.8% 35.9% 54.4% 56.1% 33.4% 32.5% 36.4% 34.4% 34.7% 36.4% 56.2% 55.0%
dout = 15
37.4% 35.9% 42.1% 40.5% 51.9% 54.0% 33.5% 37.5% 53.8% 55.6% 34.0% 32.2% 36.2% 35.0% 34.6% 34.2% 56.9% 55.7%
dout = 20
37.7% 35.3% 40.0% 37.9% 46.2% 45.5% 32.6% 34.3% 50.5% 49.6% 34.1% 32.3% 37.3% 36.1% 37.8% 37.5% 53.1% 50.4%
dout = 25
37.9% 36.6% 40.3% 39.6% 45.1% 44.8% 31.9% 36.0% 48.9% 48.6% 35.0% 33.6% 36.5% 36.0% 33.2% 34.0% 51.8% 49.8%
dout = 30
Table 3.4 The overall and mean classifying accuracy values calculated by SAGP and other comparison methods with dimensionality of the latent variable varying from 1 to 30 using the dataset NUS-WIDE-LITE. Bold values mean the best performances
68 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
69
Fig. 3.3 Confusion matrix of the category recognition results calculated by JSSL, CCA, DCCA, DCCAE, GPLVM, DS-GPLVM, and SAGP on NUS-WIDE-LITE dataset. The vertical axis represents the true labels, and the horizontal axis represents the predicted labels
1. How to construct the covariance function effectively by the kernel function. The conventional GPLVM based methods adopt predetermined and single kernel function such as the radius basis function (RBF) to build the covariance. In general, kernel function is crucial in training and has a great influence on the classifying performance. However, in various applications, the dataset from applications is complex and it is difficult to model the data distribution by a certain kernel function. 2. How to mine the category-based information to greatly classify the samples from different categories. In order to adapt the classifying, some priors are embedded on the latent variables. For instance, Urtasun [8], Eleftheriadis [5], and SAGP proposed the discriminative GPLVM, encouraging the latent component from the same class to be close and those from the different classes to be far on the manifold space. But both approaches require to train a selected classifier offline, and the learned variables may not fully meet the assumption of the classifier.
70
3 Information Fusion Based on Gaussian Process Latent Variable Model
In order to tackle these two issues, we further extend SAGP to MKSGP [32]. In this chapter, we first introduce MKSGP including the encoder, decoder, and prior, followed by its optimization, inference, and computational complexity analysis. And we conduct experiments to demonstrate its effectiveness.
3.3.1 Problem Formulation 3.3.1.1 Decoding Part The framework of MKSGP is shown in Fig. 3.4. Denote Y = {Yv ∈ RN×Dv } (v = 1, . . . , V ) as the observed modalities, where V is the number of views, N is the number of samples in each modality, Dv is the dimensionality of each view Yv , and Yv = [yv1 , . . . , yvN ]T . Being similar to the SGPLVM [7, 33] and SAGP [15], there is a latent variable X ∈ RN×q shared among different views, instead of estimating a single one for each view as is done in GPLVM. In detail, the distribution of the observed data Yv given the latent variable X is conditionally independent to other modalities. Thus, the likelihood of multiple modalities given X satisfies p(Y | X) =
V '
p(Yv | X, θ v )
v=1
=
V v=1
&
1 (2π)NDv |Kv |Dv
1 −1 T exp − tr KvX Yv Yv 2
(3.27)
where θ v and KvX are the kernel parameters and kernel matrix corresponding to the v-th view. In existing GP based works as well as SAGP, most of them exploit a certain kernel function such as RBF to establish the kernel matrix KvX . However, it may be an unreasonable estimation because of the complexity of the real-world data. Instead of assuming KvX to follow a determinative kernel function, we employ
2
Class 1
g3 Class 2
f1
f2 f
3
Multiple Features
WT
g
Shared Latent Variables with Large Margin Prior
Multiple Kernels
EDH
Multiple Kernels
Multiple Features
Class 1 Class 2
g1
CH
CH
EDH
WT
Fig. 3.4 The frame of the proposed strategy (MKSGP). We extract multiple features firstly. Then there are mapping functions from the observed samples to the latent space with multiple kernel functions. Meanwhile, mapping functions from the manifold space to the raw data are also computed. Particularly, in order to apply the proposed strategy to the classification, a classifier prior is embedded on the latent component
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
71
multiple kernels to automatically and adaptively construct the kernel matrix, which is a more reasonable way for various real-world dataset. Therefore, the kernel function Kv can be mathematically reformulated as follows: KvX
Kf
=
kf =1
v wXk KvXkf , f
s.t.
v wXk f
0, s.t.
Kf kf =1
v wXk =1 f
(3.28)
v where Kf is the number of selected kernel functions, and wXk is the weight for the f v kf -th kernel function KXkf in the v-th view. Without loss generality, the number and types of the pre-defined kernel functions are set to be the same for all views. To estimate the latent variable X, we optimize the proposed model by minimizing the negative log-likelihood function, as shown in Eq. (3.29). V
Lf =
Lvf
v=1
1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv 2 2 2
Lvf = KvX
=
Kf kf =1
s.t.
v wXk f
(3.29) v wXk KvXkf , f
0,
Kf kf =1
v wXk = 1 (v = 1, . . . , V ) f
3.3.1.2 Priors for Classification Because our purpose is to use the proposed method to the classification task, some supervised priors would be embedded into the model to mine the label information. Various kinds of discriminative priors are widely utilized in [5, 8, 15, 20, 34–36]. For instance, a discriminative prior is imposed on the SAGP to promote the data from the same category to be similar while those from distinctive categories to be dissimilar. However, this prior requires an off-line classifier after training, which would make the model learning be not adaptive for this classifier. Differently, in MKSGP, the large margin prior is utilized to learn the hyperplane for each class. In this way, we can not only learn our GPLVM model but also get the classifier online.
72
3 Information Fusion Based on Gaussian Process Latent Variable Model
Formally, the large margin prior with the latent variable xi is represented as follows: log p(xi ) = λ
C L xTi , ti , wc , bc
(3.30)
c=1
where λ refers to the nonnegative parameter to get a trade-off between Eqs. (3.30) and (3.29), C refers to the number of categories, wc refers to the hyperplane associated with the c-th category, bc refers to its associated bias, and ti = [ti1 , . . . , tic , . . . , tiC ] represents the label vector (tic = 1 if the ground-truth label of the i-th sample is equal to c and otherwise tic = −1). Specifically, we define the large margin criterion as N 1 T 2 L X , T, wc , bc = wc 2 + τ l xTi , ti , wc , bc 2
(3.31)
i=1
where T refers to the ground-truth matrix and τ refers to a nonnegative parameter. In order to obtain a simple yet effective computation and hold a better smooth property, the quadratic hinge loss function is adopted in Eq. (3.31). / 02 l(xi , ti , wc , bc ) = max(1 − tic (wTc xi + bc ), 0)
(3.32)
3.3.1.3 Encoding Part In the proposed model, we assume that the observed sample is a transformation from its corresponding low-dimensional component in the latent space. However, given a testing sample, what we need is the latent component for classification. Being similar to SAGP, the auto-encoder-based projection, followed by the multikernel learning, is proposed in MKSGP, to achieve a transformation function from the observed sample to the shared component. In detail, we define g v as a transformation function from Yv to X. And g v is embedded with a GP∼(0, KvY ) prior, where KvY refers to the kernel matrix associated with the input Yv . Being similar to Eq. (3.28), the multiple kernel functions are adopted to establish KvY KvY
=
Kg kg =1
wYv kg KvY kg ,
s.t.
wYv kg
0,
Kg kg =1
wYv kg = 1
(3.33)
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
73
where Kg refers to the number of selected kernels and KvY kg refers to the kg -th type of kernel function in v-th view, followed by its weight wYv kg . Since there are V number of views, the conditional distribution of the latent variable given these inputs p(X | Y = {Yv }Vv=1 ) is necessary to be calculated. In this chapter, the covariance of p(X | Y) is simply defined as KY = Vv=1 KvY , being similar to [15]. Therefore, we obtain the distribution of p(X | Y) as p(X | Y, γ ) = &
1 T exp − tr K−1 XX Y 2 (2π)Nq |KY |q 1
(3.34)
where γ = {γ vkg }kg ,v refers the kernel parameter in function g. Consequently, we refer the joint conditional probabilistic function over the shared component X given the observations as p(X | Y, γ , θ ) ∝ p(X | Y, γ )p(Y | X, θ )
(3.35)
Finally, by embedding the priors of X into Eq. (3.56), the objective functions could be generalized as following: L = Lf + Lg + Lpri , Lf =
V
Lvf
i=1
Lpri = λ
C L XT , T, wc , bc c=1
1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv 2 2 2 1 1 1 Lg = qN log 2π + q log |KY | + tr((KY )−1 XXT ) 2 2 2
Lvf =
KvX
=
Kf kf =1
KvY
=
Kg
v wXk KvXkf , f
wYv kg KvY kg ,
kg =1
(v = 1, . . . , V )
s.t.
s.t.
v wXk f
wYv kg
0,
0,
Kf kf =1 Kg
kg =1
v wXk = 1, f
wYv kg = 1
(3.36)
74
3 Information Fusion Based on Gaussian Process Latent Variable Model
3.3.2 Optimization for MKSGP Being similar to SAGP, the latent component X in Lf and Lg is alternatively updated. Therefore, we reformulate the objective function of Eq. (3.36) as following: L = Lf + Lg + Lpri Lf =
V
Lvf
i=1
Lpri = λ
C
L(XT , T, wc , bc )
c=1
1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv 2 2 2 1 −1 1 1 Lg = qN log 2π + q log |KY | + tr KY Xg XTg 2 2 2
Lvf =
KvX =
Kf kf =1
KvY =
Kg
v v wXk KvXkf , s.t. wXk 0, f f
wYv kg KvY kg , s.t. wYv kg 0,
kg =1
(v = 1, . . . , V ) s.t.
Kf kf =1 Kg
(3.37)
v wXk = 1, f
wYv kg = 1
kg =1
X = Xg
The argument Lagrange multiplier method (ALM) is adopted, and the objective function is then represented as L=
−1
1 qN log 2π + q log |KY | + tr KY Xg XTg 2 V v 1 −1 T Dv D N log 2π + log KvX + tr KvX Yv Yv + 2 2 2 v=1
+λ
C c=1
(3.38)
2 . μ L(XT , T, wc , bc ) + Z, X − Xg + X − Xg F 2
where μ > 0 (we initially set it to 0.01 in the experiments) and Z refers the Lagrange multiplier, and ·, · represent the inner product.
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
75
Update the Latent Variable X The derivative with respect to X is calculated as ∂Lf ∂Lpri Z ∂L = + + μ X − Xg + ∂X ∂X ∂X μ
(3.39)
Lv L where Xf = Vv=1 Xf . In order to get the gradient of Lvf associated with X, the chain rule is utilized. Therefore, ⎞ ⎛# T ∂Lvf ∂Lvf ∂KvX ⎠ = tr ⎝ (3.40) ∂xij ∂KvX ∂xij Update the Latent Variable Xg The derivative of Eq. (3.38) associated with Xg is calculated as ∂L Z = (KY )−1 Xg − μ X − Xg + (3.41) ∂Xg μ Update the Kernel Parameters γ and θ Being similar to Eq. (3.40), the derivatives of Eq. (3.38) associated with the kernel parameters γ and θ are calculated by using the chain rule. ∂L = tr ∂γkvg ,i
#
∂Lg ∂KY
T
∂KvY ∂γkvg ,i
(3.42)
⎛# ⎞ v T v ∂L ∂KX ∂L f ⎠ = tr ⎝ ∂θkvf ,i ∂KvX ∂θkvf ,i
(3.43)
where γkvg ,i and θkvf ,i are the i-th parameter in γ vkg and θ vkf , respectively. Update the Weight of Different Types of Kernel Functions Since there is no closed-form solution for the weight values wvX and wvY , we use the chain rule to get the gradient of the objective function with respect to the wvX and wvY , and then exploit the gradient decent algorithm to estimate the weight values. In detail, ∂L = tr ∂wYv kg
#
∂Lg ∂KY
T
∂KvY ∂wYv kg
⎛# ⎞ T ∂Lvf ∂KvX ∂L ⎠ = tr ⎝ v v v ∂wXk ∂K ∂w X Xk f f
(3.44)
(3.45)
76
3 Information Fusion Based on Gaussian Process Latent Variable Model
Algorithm 3.2 Multi-kernel shared Gaussian process latent variable model (MKSGP) Input: the observed training data Y, testing sample y∗ , and the prior parameter λ and τ , and dimension q of the latent variable. Output: the latent variables X and x∗ . Initialization: initialize the latent variable X using LDA. 1: while not converged do 2: Update X and parameters θ : optimize the latent variable X and parameter {θ v }Vv using Eq. (3.39) and Eq. (3.43) 3: Update Xg and parameters γ : compute the latent variable Xg and parameter {γ v }Vv using Eq. (3.41) and Eq. (3.42) 4: Update the weight values wY and wX : estimate the weights of each type of the kernel function using Eq. (3.44) and Eq. (3.45) 5: Update Lagrange multiplier Z and μ: update Z and μ using Eq. (3.46). 6: end while 7: Test: get the latent variable of the test sample by Eq. (3.47).
K K v Due to the constraints of kgg=1 wYv kg = 1, kff=1 wXk = 1 and wYv kg 0, f v v wXk 0, we set wYv kg = 0 and wXk = 0 if their values are smaller than 0. f f v v } Then we normalize the sum of wY = {wYv kg }kg or wX = {wXk kf to 1 after each f iteration. Update the Lagrange Multiplier Z and parameter μ We update the Lagrange multiplier Z and parameter μ as following Eq. (3.46): Zt +1 = Zt + μt (X − Xg )
(3.46)
μt +1 = min(μmax , ρμt )
in which Zt and Zt +1 represent the current and updated value, respectively; so do μt and μt +1 ; μmax refers the maximization of μ (typically set as 1000); ρ is a constant and hold the step size of μ (typically set as 1.1 ).
3.3.3 Inference Referring to [9] in Chap. 2, the latent variable x∗ can be calculated by a simple yet effective way when a testing sample is got. ∗
x =
# v
Ky v∗ Y v
# v
−1 KvY
X
(3.47)
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
77
in which Ky v∗ Y v represents the kernel matrix evaluated on all pairs of training and testing points. The optimization and inference of the presented strategy is listed in Algorithm 3.2.
3.3.4 Experimental Results In this subsection, the experiments on AWA, NUS-WIDE-LITE, and biomedical (Health VS DM) datasets are implemented to prove the effectiveness of MKSGP.
3.3.4.1 Experimental Settings To precisely prove the superiority of MKSGP, we make the comparison experiments with CCA, randomized non-linear CCA (RCCA) [4], DCCA [30], DCCAE [31], JSSL [29], GPLVM, DGPLVM [8], DSGPLVM [5], and SAGP [15]. Since there are multiple kernels in the covariance construction for MKSGP, six types of kernels covering Linear, Polynomial, Rational Quadratic (Ratquad), RBF, Multilayer Perceptron (MLP), and Matern kernels are used. In experiments, the discriminative prior λ and τ were tuned through 3-fold cross validation by using small subsets selected from the training data. Because parameter τ is sensitive, we change it very carefully at different datasets. According to cross validation, we make it to 0.1, 2, and 3 with respect to the three datasets mentioned above.
3.3.4.2 Experimental Results AWA Dataset The classification results from different methods based on AWA dataset are demonstrated on Table 3.5, when the q varies from 40 to 130. Note that we set the parameter λ and τ as 40 and 0.1 in the experiment, respectively. It is obvious that the proposed method performs better than CCA, RCCA, DCCA, DCCAE, and GPLVM obtain a significant improvement. Taking the comparison with GPLVM based approaches and JSSL, MKSGP has better performance. In detail, there is about 2% increasing on the classification results from the overall and average. Figure 3.5 further displays the confusion matrix when the dimensionality q is 60. As we can see, the accuracy for each category gained by our presented method MKSGP is larger than 70% for most classes. Furthermore, there are as many as 20 classes whose percentages outperform 90%.
MKSGP
SAGP(SE)
SAGP(IE)
DCCAE
DCCA
DSGPLVMS
DSGPLVMI
DGPLVM
GPLVM
RCCA
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality dout = 40 dout = 50 83.8% 81.7% 71.8% 73.6% 69.2% 71.1% 78.1% 78.8% 75.6% 76.3% 70.9% 71.9% 69.5% 70.6% 80.1% 82.0% 78.3% 80.1% 81.8% 82.9% 80.0% 81.0% 82.0% 83.4% 80.0% 81.4% 76.6% 78.3% 74.4% 76.3% 76.7% 77.1% 74.8% 75.0% 82.0% 83.6% 79.9% 81.5% 82.4% 84.3% 80.0% 82.3% 81.6% 85.3% 78.7% 83.2% 74.3% 72.8% 78.9% 76.5% 71.4% 69.9% 80.2% 78.6% 82.8% 80.9% 83.1% 81.5% 77.5% 75.3% 76.9% 74.8% 83.0% 81.1% 83.4% 81.5% 85.4% 83.4%
dout = 60
74.8% 72.4% 79.2% 76.8% 71.1% 70.3% 78.8% 77.3% 81.7% 79.8% 82.4% 80.7% 77.9% 76.0% 76.7% 74.9% 82.7% 80.7% 82.7% 80.8% 85.3% 83.3%
dout = 70
75.1% 72.8% 79.3% 76.9% 69.1% 67.2% 77.6% 76.1% 82.8% 81.1% 82.5% 80.8% 76.9% 74.6% 76.8% 74.9% 82.2% 80.3% 83.3% 81.4% 85.4% 83.3%
dout = 80
75.1% 72.7% 79.3% 76.8% 72.1% 69.7% 76.5% 75.0% 82.7% 80.8% 82.7% 80.8% 77.9% 75.6% 77.0% 74.9% 81.8% 79.8% 83.3% 81.4% 85.2% 83.0%
dout = 90
75.5% 73.2% 79.4% 76.9% 71.4% 70.9% 75.4% 74.0% 82.4% 80.6% 82.7% 80.8% 78.2% 76.0% 76.9% 74.8% 81.4% 79.5% 83.1% 81.0% 85.2% 83.0%
dout = 100
73.8% 71.6% 79.4% 77.1% 71.8% 70.7% 75.0% 73.6% 82.3% 80.5% 82.5% 81.4% 78.4% 76.1% 76.9% 74.8% 81.1% 79.4% 83.3% 81.5% 84.9% 82.6%
dout = 110
74.1% 72.0% 79.2% 76.8% 72.1% 71.6% 74.5% 73.2% 82.4% 80.7% 83.2% 81.5% 78.1% 75.8% 76.9% 74.8% 80.4% 78.7% 82.6% 80.9% 85.0% 82.8%
dout = 120
74.1% 71.8% 79.3% 76.9% 71.6% 70.8% 73.8% 72.3% 82.3% 80.8% 82.7% 81.2% 78.3% 76.1% 76.8% 74.7% 79.8% 78.0% 82.9% 81.4% 85.1% 83.0%
dout = 130
Table 3.5 The overall and average classification accuracies on the AWA dataset obtained by MKSGP and other comparison methods when the dimensionality of the latent variable changes from 40 to 130. Bold values mean the best performances
78 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.3 Multi-Kernel Shared Gaussian Process Latent Variable Model
79
Fig. 3.5 The confusion matrix of category recognition results with dimensionality of the latent variable set 60 using dataset AWA. The vertical axis represents the true labels and the horizontal axis represents the predicted labels
NUS-WIDE-LITE Dataset We state the experiment results calculated by various single-view or multi-view methods on Table 3.6, when the dimensionality q varys from 1 to 30. Note that in this experiment we make λ and τ to be 70 and 2, respectively. It is obvious that MKSGP gets the best performance among all approaches. The classification accuracy of different classes are shown on Fig. 3.6 with the condition that q is set to 10. From Fig. 3.6, we can see that the results of our presented strategy are the best in most of kinds such as rocks, birds, sun, flowers, tower, and vehicle. Biomedical Dataset In this experiment, the evaluation metrics covering accuracy, sensitivity, and specificity are adopted, and the experimental results are illustrated in Table 3.7. Since DCCA and DCCAE require a large number of training samples, they do not perform well on this dataset. Then we remove these two methods in the comparison experiment. Similar to the design in NUS-WIDE-LITE dataset, we concatenate the feature vectors from face and sublingual empirically as a single one for CCA and RCCA. Furthermore, except for JSSL, q is set to 10 in all methods and λ = 30 and τ = 3 for MKSGP, respectively. Compared with the comparison methods, MKSGP has significant improvements in classification. Additionally, we also give the accuracy curves at Fig. 3.7 with different q. Here, we set the number of training samples from each class as 50. It is obvious that the MKSGP does get outstanding performance among all situations.
MKSGP
SAGP(SE)
SAGP(IE)
DCCAE
DCCA
DSGPLVMS
DSGPLVMI
DGPLVM
GPLVM
RCCA
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality dout = 1 47.2% 48.9% 15.1% 14.8% 17.8% 18.9% 16.3% 14.9% 19.6% 21.5% 19.1% 17.6% 25.4% 25.0% 14.6% 14.4% 17.1% 17.1% 19.8% 17.7% 23.6% 25.2% 26.2% 27.1% 30.6% 29.6% 35.6% 32.9% 33.1% 32.8% 43.9% 49.9% 24.5% 30.7% 46.1% 52.5% 27.2% 26.2% 29.8% 28.4% 29.1% 33.7% 52.6% 51.9% 59.6% 54.9%
dout =5
35.6% 34.2% 40.4% 38.4% 36.7% 34.8% 51.2% 52.8% 32.9% 34.9% 54.8% 56.6% 31.5% 31.5% 35.4% 33.7% 32.4% 37.2% 58.7% 56.6% 62.2% 58.6%
dout = 10
36.7% 34.7% 42.8% 39.6% 40.7% 39.8% 50.6% 53.4% 33.8% 35.9% 54.4% 56.1% 33.4% 32.5% 36.4% 34.4% 34.7% 36.4% 56.2% 55.0% 62.5% 59.2%
dout = 15
37.4% 35.9% 43.6% 41.4% 42.1% 40.5% 51.9% 54.0% 33.5% 37.5% 53.8% 55.6% 34.0% 32.2% 36.2% 35.0% 34.6% 34.2% 56.9% 55.7% 63.0% 59.9%
dout = 20
37.7% 35.3% 44.3% 42.4% 40.0% 37.9% 46.2% 45.5% 32.6% 34.3% 50.5% 49.6% 34.1% 32.3% 37.3% 36.1% 37.8% 37.5% 53.1% 50.4% 62.6% 59.6%
dout = 25
37.9% 36.6% 44.0% 41.1% 40.3% 39.6% 45.1% 44.8% 31.9% 36.0% 48.9% 48.6% 35.0% 33.6% 36.5% 36.0% 33.2% 34.0% 51.8% 49.8% 58.6% 56.4%
dout = 30
Table 3.6 The overall and average classifying accuracy values calculated by various approaches with q varying from 1 to 30 using dataset NUS-WIDE-LITE. Bold values mean the best performances
80 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
81
Fig. 3.6 The average accuracy of selected 9 categories calculated by various strategies
3.3.5 Conclusion In this chapter, we develop an extension of SAGP. Unlike SAGP that only adopts a certain kernel function, we use multi-kernel learning to construct the covariance matrices at both encoder and decoder steps. And the strategy could model the complex distributions of data more precisely. Furthermore, the large margin prior is also embedded on the latent variable, which is beneficial for classification.
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent Variable Model As mentioned above, although MKSGP extends SAGP by adopting multi-kernel learning strategies to adaptively construct the covariance by using various kernels, it only and simply sums the covariance of p(X|Yv ) together as the covariance of p(X|{Yv }Vv ). This strategy is naive and scant for the probability investigation. In order to address this problem, we give a novel method named Shared Linear Encoder-based Multi-kernel Gaussian Process Latent Variable Model (SLEMKGP) [37] in this chapter. Similarly, due to the distinctive dimensions of views, we jointly learn two transformations to map multiple observed samples to the latent manifold space. Different from MKSGP, in the encoder, a transformation matrix for each view is first obtained to linearly map the view into a consistent subspace. Another transformation function embedded the Gaussian process prior is then learned to nonlinearly transform the feature got at the first step to the latent subspace. Inversely,
Num Methods CCA RCCA JSSL GPLVM DGPLVM DSGPLVMI DSGPLVMS SAGP(IE) SAGP(SE) MKSGP
Number of training samples num = 30 Accuracy Sensitivity 76.1% 78.4% 76.3% 81.1% 79.8% 78.7% 73.7% 71.7% 74.9% 76.8% 75.2% 77.7% 76.9% 79.6% 76.7% 77.9% 79.0% 78.9% 80.7% 81.1% Specificity 73.8% 71.5% 81.0% 75.7% 73.1% 72.7% 74.2% 75.6% 79.2% 80.2%
num = 40 Accuracy 76.7% 77.7% 81.5% 75.8% 78.7% 76.8% 80.8% 76.7% 79.2% 82.1% Sensitivity 78.6% 80.7% 83.1% 73.6% 79.5% 79.0% 82.9% 81.4% 81.3% 82.5%
Specificity 74.9% 74.8% 79.9% 78.0% 78.0% 74.7% 78.9% 72.2% 77.2% 81.8%
num = 50 Accuracy 79.5% 78.6% 82.8% 78.1% 78.8% 75.7% 81.7% 78.4% 82.5% 83.6%
Sensitivity 80.0% 81.0% 82.0% 78.5% 79.2% 78.3% 84.2% 82.1% 83.9% 84.8%
Specificity 79.1% 76.2% 83.6% 77.8% 78.4% 73.2% 79.3% 74.9% 81.2% 82.4%
Table 3.7 The accuracy, sensitivity, and specificity values calculated by different methods on Biomedical dataset with the training samples numbers set 30, 40, and 50, respectively. Best results are highlighted with bold
82 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
83
Fig. 3.7 The accuracy calculated by different methods with different values of the dimensionality of the latent variable on the Biomedical dataset
WT
Encoder
GP Prior
Multiple Features
GP Prior
EDH
Shared Latent Variables with Discriminative Prior
Decoder
Multiple Features
Mulit-Kernel
CH
CH EDH WT
Fig. 3.8 The frame of developed strategy SLEMKGP. We first extract multiple views or features. Then, there is a projection for each view, which is learned to map the observations to a consistent space by transformation matrices {Pv }Vv=1 . A Gaussian process prior is subsequently utilized to obtain a back project to the shared and latent manifold space. Later, the transformations from the shared and latent space to the observed space are also calculated. To use developed strategy SLEMKGP for classifying, a discriminative prior is embedded on the latent component. It should be noted that, Yv refers to the input in the v-th view, Pv refers to associating mapping matrix, and f v refers to the projection function embedding the GP prior at the v-th view. Here the covariance matrices from both encoder and decoder are computed by utilizing multiple kernels
in order to map the shared latent component into multiple observations, diverse transformations embedding the Gaussian process prior are also got. Thanks to these strategies, SLEMKGP is more sensible in the mathematical way. The framework of SLEMKGP is stated in Fig. 3.8.
84
3 Information Fusion Based on Gaussian Process Latent Variable Model
3.4.1 Problem Formulation Similar to SGPLVM, SAGP, and MKSGP, we also assume that Y = {Yv = v [yv1 , . . . , yvN ]T ∈ RN×D }Vv and X = [x1 , . . . , xN ]T ∈ RN×q present the multiview data and the latent variable. Thus, the likelihood of different views given X is p(Y | X) =
V '
p(Yv | X, θ v )
v=1
=
V ' v=1
)
1
D v v (2π)ND KvX
1 −1 T exp − tr KvX Yv Yv 2
(3.48)
where KvX is the covariance matrix associated with the v-th view, and θ v is associating with kernel parameter. As mentioned above, the kernel function is crucial in the covariance matrix calculation and the distribution of the observed samples from the real world is quite complex, while a certain type of kernel functions is unable to meet our requirement. Being similar to MKSGP, the multikernel learning is adopted to select various types of kernels adaptively to establish KvX . Therefore, the covariance matrix KvX can be reformulated as KvX =
Kf kf =1
v v wXk KvXkf , s.t. wXk 0, f f
Kf kf =1
v wXk =1 f
(3.49)
where KvXkf refers to the covariance matrix calculated by adopting the kf -th v type of the kernel in v-th view and Kf is the kernel number. Note that wXk is f v nonnegative as the weight of KXkf .Without loss generality, we take kernels number from different views to be all the same (Kf ). Here we further adopt another transformation p(X | {Yv }Vv ) from multiple observed samples to the latent component using the GP structure. However, it is also difficult to directly calculate the probabilistic distribution p(X | {Yv }Vv ) because of the individual dimensionality of multiple views. If we map {Yv }Vv to a single variable value Z = [z1 , . . . , zN ]T ∈ RN×dp , the p(X | Z) can be easily achieved in which dp refers to the dimensionality of Z. Thus, an encoder with two steps is used. To be more specific, we first utilize a linear transformation to map multiple views to a subspace to get the variable Z. Then the projected variable is further mapped to the latent variable X with a GP prior. Figure 3.8 illustrates the framework diagram of v proposed approach. For each view Yv , the transformation Pv ∈ RD ×dp is learned
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
85
to map the observed samples to a dp -D subspace to get Z by the formulation as follows: Z=
V
Yv P v
(3.50)
v=1
Then we further get X by transforming Z under p(X | Z). Here we obtain the relationship between Z and X as xi = g(zi ) + εg ,
g(zi ) ∼ GP
(3.51)
Here we define these two procedures as the encoder. Mathematically, we get p(X | Z) as p(X | Z, γ ) = &
KZ =
Kg
1 T exp − tr K−1 XX Z 2 (2π)Nq |KZ |q 1
wZkg KZkg , s.t. wZkg 0,
kg =1
Kg
wZkg = 1
(3.52)
(3.53)
kg =1 K ,V
K
in which, being similar to θ = {θ vkf }kff,v , γ = {γ kg }kgg mean the kernel parameters; wZkg refers to the weight respecting to the kg -th kernel; Kg refers to kernels number. Being similar to KvXkf , which is inferenced by X, KZkg is inferenced by Z. To apply our proposed approach for classification, the label information should be utilized. In this chapter, similar to SAGP, we adopt a graph Laplacian matrix L ∈ RN×N to enforce the latent variables from the same category to be close and those from different categories to be far. Theoretically, the discriminative regularization imposed on X can be referred as p(X) =
V '
p(X | Y )
v 1/V
v=1
L=
V v=1
Wvij =
⎧ ⎪ ⎨ ⎪ ⎩
exp −
v v 2 yi −yj tv
0
(3.54)
Lv = (Dv )−1/2 (Dv − Wv )(Dv )−1/2
Lv + ξ I, #
1 β T = exp − tr(X LX) V · Zq 2
2
(3.55) if i = j and ci = cj otherwise
86
3 Information Fusion Based on Gaussian Process Latent Variable Model
in which ci and cj refer to the i-th class and j -th class, respectively; t v refers to the kernel width that equals to the mean squared distance between the training inputs as in [21]; Dv refers to diagonal and its diagonal element is represented as Dvii = v −4 j =1 Wij , ξ refers to a nonnegative parameter with a small value (e.g., 10 ), I refers to the identity matrix, Zq is a constant, and β is a nonnegative parameter. By jointly utilizing the encoder, decoder, and discriminative regularization, we present the joint conditional probabilistic distribution with the observed data Y over the shared and latent variable X as following: p(X | Y, Z, γ , θ ) ∝ p(X | Z, γ )p(Y | X, θ )p(X)
(3.56)
Then we get the objective function by doing the negative log operation for Eq. (3.56) and stating it as follows: L = Lf + Lg + Lpri ,
Lf =
V
Lpri =
Lvf ,
v=1
β tr(XT LX) 2
1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv Lvf = 2 2 2 1 −1 T 1 1 Lg = qN log 2π + q log |KZ | + tr KZ XX 2 2 2 KvX
=
Kf kf =1
KZ =
Kg
v wXk KvXkf , f
s.t.
v wXk f
0,
wZkg KZkg , s.t. wZkg 0,
kg =1
Z=
V
Yv P v ,
Kf kf =1 Kg
v wXk = 1, f
(3.57)
wZkg = 1
kg =1
(v = 1, . . . , V )
v=1
3.4.2 Optimization for SLEMKGP X in Eq. (3.57) can be calculated by using the gradient descend algorithm. Similar to SAGP and MKSGP, we adopt another expression XZ to replace X in the encoder part and relax the objective function to Eq. (3.58).
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
87
L = Lf + Lg + Lpri Lf =
V
Lvf
v=1
β tr(XT LX) 2 1 −1 T Dv Dv N log 2π + log KvX + tr KvX Yv Yv Lvf = 2 2 2 1 −1 1 1 Lg = qN log 2π + q log |KZ | + tr KZ XZ XTZ 2 2 2
Lpri =
KvX
=
Kf kf =1
KZ =
Kg
v wXk KvXkf , f
Z=
0,
wZkg KZkg , s.t. wZkg 0,
kg =1 V
s.t.
v wXk f
Yv P v ,
Kf kf =1 Kg
(3.58)
v wXk = 1, f
wZkg = 1
kg =1
(v = 1, . . . , V ) s.t. X = XZ
v=1
According to the argument Lagrange multiplier strategy (ALM),the relaxed objective function Eq. (3.58) can be modified to Eq. (3.59). L=
−1
1 qN log 2π + q log |KZ | + tr KZ XZ XTZ 2 V v 1 Dv D N log 2π + log KvX + tr((KvX )−1 Yv (Yv )T ) + 2 2 2
(3.59)
v=1
+
β μ tr(XT LX) + F, X − XZ + X − XZ 2F 2 2
where F refers to the Lagrange multiplier and the parameter μ is nonnegative v ,w v V penalty parameter. Here we update {X, θ }, {XZ , γ }, {wXk Zkg }, and {P }v=1 g alternatively by utilizing the gradient descend approach. Update XZ and γ The gradients of Eq. (3.59) with respect to XZ in the encoder is calculated. ∂L F = (KZ )−1 XZ − μ X − XZ + (3.60) ∂XZ μ
88
3 Information Fusion Based on Gaussian Process Latent Variable Model
According to the chain rule, we compute gradient values of relaxed objective function with respect to γ and get the formulation shown in Eq. (3.61). ∂L = tr ∂γkg ,i
#
∂Lg ∂KZ
T
∂KZ ∂γkg ,i
(3.61)
in which γkg ,i represents the i-th kernel parameter from the kg -th type of kernels. Update X and θ Similarly, the derivatives of Eq. (3.59) w.r.t. X and θ are V v F ∂L ∂Lf = + βLX + μ X − XZ + ∂X ∂X μ
(3.62)
v=1
⎛# ⎞ T ∂Lvf ∂KvX ∂L ⎠ = tr ⎝ ∂θkvf ,i ∂KvX ∂θkvf ,i where
∂Lvf ∂xij
∂Lv = tr ( ∂Kfv )T X
∂KvX ∂xij
(3.63)
.
Update and wZkg The gradient of objective function w.r.t. the w vX and wZ can also be calculated by using the chain rule. In detail, v wXk g
∂L = tr ∂wZkg
#
∂Lg ∂KZ
T
∂KZ ∂wZkg
⎞ ⎛# T ∂Lvf ∂KvX ∂L ⎠ = tr ⎝ v v v ∂wXk ∂K ∂w X Xk f f
(3.64)
(3.65)
K K v v v v , and Since kgg=1 wZkg = 1, kff=1 wXk = 1, and wZk 0, wXk 0, wZk g g f f v wXkf are set to 0 if their values are negative. Then we do the normalization for sum v } to 1 after each iteration. of wZ = {wZkg }kg or wX = {wXk kf f Update P The derivative of Eq. (3.59) w.r.t. Pv is computed as following: ∂L ∂L = ⊗ yi v ∂P ∂zi N
i=1
in which ⊗ represents the outer-product.
(3.66)
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
89
Algorithm 3.3 Shared linear encoder-based multi-kernel Gaussian process latent variable model (SLEMKGP) Input: V views of training and test data: {Yv }Vv and {yv∗ }Vv ; the dimensionality of the latent space q and linear projection matrix dp ; the value of discriminative prior β. Output: the latent variables X and x∗ . 1: while not converged do 2: Obtain XZ and γ : 3: update the latent variable XZ and its corresponding parameter γ by Eq. (3.60) and Eq. (3.61) 4: Obtain X and θ: 5: update the latent variable X and its corresponding parameter θ by Eq. (3.62) and Eq. (3.63) v , {v = 1, . . . , V , k = 1, . . . , K , k = 1, . . . , K } 6: Obtain wZkg and wXk g g f f g v 7: update the weight wZkg and wXk by Eq. (3.64) and Eq. (3.65) g 8: Obtain P: 9: update the linear mapping matrices {Pv }Vv by Eq. (3.66), and then calculate Z through Z = V v v v=1 Y P 10: Obtain F and μ: 11: update Lagrange multiplier F and parameter μ using Eq. (3.67). 12: end while 13: Test: get the latent variable for the test sample by Eq. (3.68).
Update F and μ For variable F, closed-form solution exits. For μ, we increase it linearly after each iteration. In detail, F and μ are updated from the t-th iteration to the (t + 1)-th iteration by following Eq. (3.67). Ft +1 = Ft + μt (X − XZ ) μt +1 = min(μmax , ρμt )
(3.67)
where ρ is used for constraining the step size μ and we set it to 1.1; μmax is a constant to control the maximum of μ, which is typically set to 1000.
3.4.3 Inference After minimizing the objective function Eq. (3.59), P = {Pv }Vv=1 , KZ , and Z are calculated. Following the procedure in [9] Chapter 2, when we get a test data y∗ = {yv∗}Vv , the associated latent variable is x∗ = Kz∗ Z (KZ )−1 X
(3.68)
in which z∗ = Vv=1 y∗v Pv and Kz∗ Z is the kernel matrix evaluated at all pairs of Z and z∗ . We present the overall optimization processes and the inference procedure at Algorithm 3.3.
90
3 Information Fusion Based on Gaussian Process Latent Variable Model
3.4.4 Experimental Results To prove the superiority of SLEMKGP for classification, we do comparison experiments on three databases. We firstly state experimental settings briefly and then make the experiment between the proposed method and comparisons used for single- and multi-view dataset.
3.4.4.1 Experimental Setting To quantitatively measure the capacity of SLEMKGP, several approaches including JSSL [29], CCA [27, 28], DCCA [30], DCCAE [31], MvDA/MvDA-VC [38], GPLVM [6], DGPLVM [8], DSGPLVM [5], m-RSimGP [39], hm-SimGP [40], hm-RSimGP [40], and SAGP [15] are used for comparison with SLEMKGP. Furthermore, we use the overall and average accuracy values as the quantitative measurements. Because SLEMKGP adopts multi-kernel learning, different types of kernel functions will be used. Here, we will select six kinds of kernels including Linear, RBF, Polynomial, Rational Quadratic (Ratquad), Multilayer Perceptron (MLP), and Matern kernels for comparison analysis to compute the covariance matrix respecting to each transformation function.
3.4.4.2 Result Comparison Wiki Text-Image The overall and average classification accuracy values calculated by SLEMKGP and other comparison strategies are stated in Table 3.8. Note that we make the parameters β and dp as 70 and 10 in the experiment, respectively. It is obvious that the performances obtained by different comparison approaches are not good as that computed by proposed strategy SLEMKGP. Apart from the case q = 8, the results got by SLEMKGP are always the best. Compared with CCAbased approaches, GPLVM based approaches, and RSimGP-based approaches, the presented method obtains a significant enhancement. By comparing with JSSL and DSGPLVM, SLEMKGP also performs better. Compared with SAGP, SLEMKGP still is very competitive. AWA The two types of classification accuracy values obtained by SLEMKGP and other comparison approaches are stated in Table 3.9. Here we make β and dp as 50 and 200, respectively. From this table we can find that with dimensionality q increasing, SLEMKGP always performs best compared with approaches used for single-view and multi-view based datasets. In contrast to GPLVM and DGPLVM, SAGP, DSGPLVM, and our proposed method obtain the significant improvement, illustrating the enhanced capacity at multi-view learning. Compared to CCA, the improvement is about 10% in accuracy measuring from overall and average values. Compared to the deep learning methods, such as DCCA and DCCAE, SLEMKGP has remarkable improvement on the evaluation metrics. The major contribution is
hm-SimGP
m-SimGP
DSGPLVMS
DSGPLVMI
DGPLVM
GPLVM
MvDA-VC
MvDA
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality q =1 q=2 68.0% 64.9% 28.1% 45.0% 27.0% 42.4% 25.8% 44.7% 25.0% 42.4% 25.8% 44.4% 25.0% 41.8% 14.9% 35.9% 12.8% 33.2% 37.2% 55.0% 34.2% 51.7% 38.4% 53.8% 33.7% 50.0% 40.8% 55.8% 36.3% 52.7% 49.6% 53.5% 46.1% 51.2% 50.9% 53.4% 47.4% 50.0% 56.0% 51.8% 52.0% 49.2% 51.8% 47.6% 46.9% 42.8% 56.1% 51.4% 58.6% 54.3% 62.2% 57.7% 55.7% 52.0% 55.4% 50.6%
q =3
59.0% 55.4% 55.8% 52.1% 56.7% 52.9% 54.3% 50.0% 61.9% 57.5% 63.4% 58.9% 65.4% 60.5% 58.4% 54.2% 57.6% 53.6%
q=4
61.8% 58.4% 62.9% 58.1% 60.5% 55.4% 58.6% 53.2% 62.9% 60.4% 66.1% 60.8% 69.4% 65.3% 58.4% 54.2% 54.3% 50.7%
q =5
63.1% 59.3% 64.5% 61.5% 64.2% 61.2% 61.3% 57.0% 63.5% 58.5% 66.8% 62.6% 69.0% 65.1% 59.3% 55.9% 58.4% 54.4%
q=6
63.8% 61.1% 65.2% 61.6% 65.5% 62.4% 61.9% 57.4% 63.4% 60.0% 66.1% 61.8% 69.3% 65.9% 57.7% 53.9% 57.9% 54.5%
q =7
62.6% 59.3% 64.7% 61.1% 66.7% 64.1% 61.9% 58.8% 63.4% 59.6% 66.8% 62.8% 69.4% 65.1% 59.2% 55.4% 58.3% 54.9%
q=8
63.8% 60.4% 64.8% 62.2% 66.8% 63.7% 60.2% 56.2% 63.2% 59.3% 65.7% 61.4% 68.7% 64.7% 60.2% 56.0% 57.7% 53.6%
q =9
(continued)
63.8% 60.4% 63.9% 60.8% 63.9% 60.9% 59.0% 54.1% 61.8% 57.6% 64.4% 60.6% 67.4% 62.9% 60.3% 55.8% 58.6% 54.8%
q = 10
Table 3.8 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison strategies with dimensionality q varying from 1 to 10 using dataset Wiki Text-Image. Bold values mean the best performances
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . . 91
SLEMKGP
SAGP
DCCAE
DCCA
Methods hm-RSimGP
Result Overall Average Overall Average Overall Average Overall Average Overall Average
Table 3.8 (continued)
Dimensionality q =1 q =2 26.6% 40.1% 24.4% 37.4% 21.5% 49.1% 20.4% 43.9% 24.0% 40.8% 22.5% 38.8% 50.5% 58.9% 48.0% 54.9% 52.4% 60.6% 49.9% 57.4% q =3 47.9% 44.2% 53.1% 49.1% 57.7% 52.9% 65.6% 62.3% 65.8% 62.3%
q=4 55.4% 50.4% 57.9% 54.1% 58.0% 54.9% 67.7% 63.8% 68.7% 64.6%
q=5 58.2% 54.0% 62.9% 58.3% 59.0% 55.0% 70.6% 67.2% 71.6% 68.2%
q=6 60.3% 56.9% 64.5% 60.8% 61.3% 57.9% 70.0% 66.5% 71.0% 67.3%
q=7 57.3% 54.2% 64.9% 61.6% 60.2% 57.0% 69.7% 65.5% 70.4% 66.4%
q=8 61.5% 58.0% 65.8% 63.1% 65.7% 62.4% 71.0% 67.1% 70.9% 66.6%
q=9 61.8% 57.0% 65.2% 62.0% 62.9% 58.3% 69.3% 64.6% 70.3% 65.9%
q = 10 64.1% 58.3% 66.4% 63.7% 67.1% 64.3% 65.1% 61.7% 69.6% 66.0%
92 3 Information Fusion Based on Gaussian Process Latent Variable Model
SLEMKGP
SAGP
DCCAE
DCCA
DSGPLVMS
DSGPLVMI
DGPLVM
GPLVM
MvDA-VC
MvDA
CCA
Methods JSSL
Result Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average Overall Average
Dimensionality q = 40 q = 50 83.8% 81.7% 71.8% 73.6% 69.2% 71.1% 77.2% 77.3% 74.7% 74.8% 77.2% 77.2% 74.7% 74.8% 70.9% 71.9% 69.5% 70.6% 80.1% 82.0% 78.3% 80.1% 81.8% 82.9% 80.0% 81.0% 82.0% 83.4% 80.0% 81.4% 76.6% 78.3% 74.4% 76.3% 76.7% 77.1% 74.8% 75.0% 82.4% 84.3% 80.0% 82.3% 83.4% 85.3% 81.2% 83.4% 74.3% 72.8% 77.6% 74.9% 77.6% 75.0% 71.4% 69.9% 80.2% 78.6% 82.8% 80.9% 83.1% 81.5% 77.5% 75.3% 76.9% 74.8% 83.4% 81.5% 85.3% 83.3%
q = 60
74.8% 72.4% 77.1% 74.5% 77.1% 74.4% 71.1% 70.3% 78.8% 77.3% 81.7% 79.8% 82.4% 80.7% 77.9% 76.0% 76.7% 74.9% 82.7% 80.8% 84.7% 82.9%
q = 70
75.1% 72.8% 77.0% 74.7% 77.0% 74.7% 69.1% 67.2% 77.6% 76.1% 82.8% 81.1% 82.5% 80.8% 76.9% 74.6% 76.8% 74.9% 83.3% 81.4% 84.9% 82.9%
q = 80
75.1% 72.7% 77.1% 74.7% 77.1% 74.7% 72.1% 69.7% 76.5% 75.0% 82.7% 80.8% 82.7% 80.8% 77.9% 75.6% 77.0% 74.9% 83.3% 81.4% 85.5% 83.6%
q = 90
75.5% 73.2% 77.1% 74.7% 77.2% 74.7% 71.4% 70.9% 75.4% 74.0% 82.4% 80.6% 82.7% 80.8% 78.2% 76.0% 76.9% 74.8% 83.1% 81.0% 85.4% 83.5%
q = 100
73.8% 71.6% 76.8% 74.3% 76.8% 74.4% 71.8% 70.7% 75.0% 73.6% 82.3% 80.5% 82.5% 81.4% 78.4% 76.1% 76.9% 74.8% 83.3% 81.5% 84.2% 82.1%
q = 110
74.1% 72.0% 76.6% 74.3% 76.7% 74.3% 72.1% 71.6% 74.5% 73.2% 82.4% 80.7% 83.2% 81.5% 78.1% 75.8% 76.9% 74.8% 82.6% 80.9% 84.6% 82.8%
q = 120
74.1% 71.8% 76.5% 74.0% 76.5% 74.0% 71.6% 70.8% 73.8% 72.3% 82.3% 80.8% 82.7% 81.2% 78.3% 76.1% 76.8% 74.7% 82.9% 81.4% 84.2% 82.1%
q = 130
Table 3.9 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison approaches with dimensionality q varying from 40 to 130 using AWA dataset
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . . 93
94
3 Information Fusion Based on Gaussian Process Latent Variable Model
Fig. 3.9 The confusion matrix calculated by SLEMKGP with q set as 50 using AWA dataset. And the elements on the diagonal representing classifying results
that our GPLVM based strategy is nonparametric, having ability to process the dataset with only a few of instances in the training data. Due to fully connect deep learning models, DCCA and DCCAE would cause the over-fitting with a large amount of parameters. Furthermore, the results gained by approaches DSGPLVM, JSSL, and SAGP are also lower than that got by SLEMKGP. We present the confusion matrix shown in Fig. 3.9. Here the dimensionality q equals to 50. The classification results of different categories obtained by our strategy are almost larger than 70% in most cases. NUS-WIDE-LITE The classification accuracy values gained by various approaches are shown in Table 3.10 using NUS-WIDE-LITE dataset. Here we make β and dp as 50 and 30, respectively. As the CCA-based methods such as CCA, DCCA, and DCCAE are not usable for dataset with three or more modalities, the second and third features are combined as a single one in the experiment. It is obvious that SLEMKGP performs much better than CCA-based approaches and JSSL. Compared with DGPLVM, the proposed method also gains enhancement with 5–10%. Compared to DSGPLVMS and SAGP, the performance from SLEMKGP is competitive. Even though DSGPLVMS and SAGP are slightly higher than SLEMKGP on the accuracy value with the dimensionality ranging from 5 to20, they are inferior to our developed strategy in remaining cases, relatively illustrating the effectiveness of our approach.
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
95
Table 3.10 Overall/average classifying accuracy values calculated by SLEMKGP and various comparison approaches with dimensionality q varying from 1 to 30 using dataset NUS-WIDELITE Methods JSSL CCA MvDA MvDA-VC GPLVM DGPLVM DSGPLVMI DSGPLVMS DCCA DCCAE SAGP SLEMKGP
Dimensionality q =1 q =5 47.2% 48.9% 15.1% 30.6% 14.8% 29.6% 24.0% 41.7% 18.8% 38.1% 25.2% 41.7% 19.3% 38.3% 16.3% 33.1% 14.9% 32.8% 19.6% 43.9% 21.5% 49.9% 19.1% 24.5% 17.6% 30.7% 25.4% 46.1% 25.0% 52.5% 14.6% 27.2% 14.4% 26.2% 17.1% 29.8% 17.1% 28.4% 23.6% 52.6% 25.2% 51.9% 27.5% 53.8% 26.1% 52.3%
q = 10
q = 15
q = 20
q = 25
q = 30
35.6% 34.2% 44.0% 42.1% 44.1% 42.0% 36.7% 34.8% 51.2% 52.8% 32.9% 34.9% 54.8% 56.6% 31.5% 31.5% 35.4% 33.7% 58.7% 56.6% 57.4% 56.0%
36.7% 34.7% 45.2% 42.7% 45.4% 43.4% 40.7% 39.8% 50.6% 53.4% 33.8% 35.9% 54.4% 56.1% 33.4% 32.5% 36.4% 34.4% 56.2% 55.0% 55.8% 54.6%
37.4% 35.9% 44.7% 42.5% 44.8% 42.4% 42.1% 40.5% 51.9% 54.0% 33.5% 37.5% 53.8% 55.6% 34.0% 32.2% 36.2% 35.0% 56.9% 55.7% 54.9% 53.1%
37.7% 35.3% 44.4% 41.6% 44.9% 42.6% 40.0% 37.9% 46.2% 45.5% 32.6% 34.3% 50.5% 49.6% 34.1% 32.3% 37.3% 36.1% 53.1% 50.4% 56.2% 54.4%
37.9% 36.6% 44.3% 41.1% 44.8% 42.1% 40.3% 39.6% 45.1% 44.8% 31.9% 36.0% 48.9% 48.6% 35.0% 33.6% 36.5% 36.0% 51.8% 49.8% 54.2% 52.4%
The average accuracy values from nine classes gained by different methods are illustrated in Fig. 3.10. It is obvious that SLEMKGP gains a satisfied results on all classes by comparing to other methods. Compared with CCA-based approaches, the proposed strategy obtains a significant improvement. In addition, our developed method also gets competitive results compared with that from JSSL and DSGPLVM. Specifically, SLEMKGP has better or similar performance than DSGPLVMS on the classes such as birds, flowers, rocks, sun, tower, and vehicle. Because of the linear transformation, we use the parameter dp in the task. To state the effect of dp , two comparison experiments are given using the datasets AWA and NUS-WIDE-LITE, as stated in Tables 3.11 and 3.12. In Table 3.11, we make q and β as (50, 50) and (10, 50), respectively. At Table 3.12, β equals to 50 for both datasets. From Table 3.11 we notice that there are only few influences on the performance with changing values of dp . And another experiment also was conducted with setting q = dp . From Table 3.12, we could see that the results are
Fig. 3.10 The average accuracy values from selected 9 classes calculated by JSSL, CCA, DCCA, DCCAE, GPLVM, DSGPLVMI, DSGPLVMS, and SLEMKGP
96 3 Information Fusion Based on Gaussian Process Latent Variable Model
3.4 Shared Linear Encoder-Based Multi-Kernel Gaussian Process Latent. . .
97
Table 3.11 Overall/average classifying accuracy values calculated by SLEMKGP with varying dimensionality dp of {Pv }Vv=1 by using the datasets AWA and NUS-WIDE-LITE AWA Overall Average NUS Overall Average
Dimensionality dp = 100 84.9% 83.0% dp = 10 57.1% 55.2%
dp = 150 84.8% 82.8% dp = 15 56.7% 55.2%
dp = 200 85.3% 83.4% dp = 20 57.3% 56.5%
dp = 250 85.2% 83.2% dp = 25 57.2% 55.7%
dp = 300 85.3% 83.4% dp = 30 57.4% 56.0%
Table 3.12 Overall/average classifying accuracy values calculated by SLEMKGP when dp = q using the datasets AWA and NUS-WIDE-LITE AWA Overall Average NUS Overall Average
Dimensionality dp = 50 dp = 60 84.9% 85.3% 82.8% 83.3% dp = 5 dp = 10 55.0% 57.1% 53.5% 55.2%
dp = 70 84.9% 82.8% dp = 15 56.1% 55.3%
dp = 80 84.3% 82.3% dp = 20 55.8% 54.8%
dp = 90 85.5% 83.4% dp = 25 56.3% 55.1%
dp = 100 85.5% 83.4% dp = 30 54.2% 52.4%
similar to that from Tables 3.9 and 3.10 and relatively indicate the robustness of our approach on dp .
3.4.5 Conclusion In this chapter, we develop a superior strategy using the GPLVM for the multiview classifying. Be different from many GPLVM methods with only assuming that observed samples are transformed from the latent component, the developed strategy uses the transformations from and to the latent variable both. Because it is difficulty to calculate the covariance matrix construction in the projection from observations to the latent space, we adopt a linear mapping to project multiple views component into a consistent subspace. The GP prior is then used to map it to the latent variable. Considering the data distribution is complex in applications, we utilize multi-kernel learning to adaptively calculate the covariance matrices. In order to fully mine the supervised information for classifying, a Laplacian matrix based prior on the latent variable also is introduced to efficiently distinguish the different classes and cluster the same class. The comparison results demonstrate that proposed strategy gets outstanding performances on three datasets.
98
3 Information Fusion Based on Gaussian Process Latent Variable Model
References 1. Akaho S. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071. 2006. 2. Fukumizu K, Bach FR, Gretton A. Statistical consistency of kernel canonical correlation analysis. J Mach Learn Res. 2007;8(Feb):361–383. 3. Lai PL, Fyfe C. Kernel and nonlinear canonical correlation analysis. Int J Neural Syst. 2000;10(05):365–377. 4. Lopez-Paz D, Sra S, Smola A, Ghahramani Z, Schölkopf B. Randomized nonlinear component analysis. In: International conference on machine learning. 2014. p. 1359–1367. 5. Eleftheriadis S, Rudovic O, Pantic M. Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE Trans Image Process. 2015;24(1):189– 204. 6. Lawrence ND. Gaussian process latent variable models for visualisation of high dimensional data. Adv Neural Inf Process Syst. 2004;16(3):329–336. 7. Ek CH, Lawrence PHTND. Shared Gaussian process latent variable models. PhD Thesis. 2009. 8. Urtasun R, Darrell T. Discriminative Gaussian process latent variable model for classification. In: Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 927–934. 9. Rasmussen CE. Gaussian processes for machine learning. Cambridge: MIT Press; 2006. 10. Song G, Wang S, Huang Q, Tian Q. Similarity gaussian process latent variable model for multi-modal data analysis. In: Proceedings of the IEEE international conference on computer vision. 2015. p. 4050–4058. 11. Lawrence ND, Quiñonero-Candela J. Local distance preservation in the GP-LVM through back constraints. In: Proceedings of the 23rd international conference on machine learning. New York: ACM; 2006. p. 513–520. 12. Jiang X, Gao J, Hong X, Cai Z. Gaussian processes autoencoder for dimensionality reduction. In: Pacific-Asia conference on knowledge discovery and data mining. Berlin: Springer; 2014. p. 62–73. 13. Ng A. Sparse autoencoder. CS294A Lecte Notes. 2011;72:1–19. 14. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY. Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11). 2011. p. 689–696. 15. Li J, Zhang B, Zhang D. Shared autoencoder gaussian process latent variable model for visual classification. IEEE Trans Neural Netw Learn Syst. 2017;29:4272–4286. 16. Li J, Zhang B, Lu G, Ren H, Zhang D. Visual classification with multikernel shared gaussian process latent variable model. IEEE Trans Cybern. 2018;49(8):2886–2899. 17. Li J, Lu G, Zhang B, You J, Zhang D. Shared linear encoder-based multikernel gaussian process latent variable model for visual classification. IEEE Transactions Cybern. 2019;51:534–547. 18. Chung FRK. Spectral graph theory, vol. 92. Providence: American Mathematical Society; 1997. 19. Zhong G, Li W-J, Yeung D-Y, Hou X, Liu C-L et al. Gaussian process latent random field. In: Association for the advancement of artificial intelligence. 2010. 20. Zhang L, Zhang Q, Zhang L, Tao D, Huang X, Du B. Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recognit. 2015;48(10):3102–3112. 21. Salzmann M, Urtasun R. Implicitly constrained gaussian process regression for monocular non-rigid pose estimation. In: Advances in neural information processing systems. 2010. p. 2065–2073. 22. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG , Levy R, Vasconcelos N. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia. 2010. p. 251–260. 23. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N. Learning systems of concepts with an infinite relational model. In: AAAI, vol. 3. 2006. p. 5.
References
99
24. Lampert CH, Nickisch H, Harmeling S. Learning to detect unseen object classes by betweenclass attribute transfer. In: CVPR 2009. IEEE conference on computer vision and pattern recognition, 2009. Piscataway: IEEE; 2009. p. 951–958.. 25. Lampert CH, Nickisch H, Harmeling S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Analy Mach Intell. 2014;36(3):453–465. 26. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y. NUS-WIDE: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval. New York: ACM; 2009. p. 48. 27. Thompson B. Canonical correlation analysis. In: Encyclopedia of statistics in behavioral science. Hoboken: Wiley; 2005. 28. Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 2004;16(12):2639–2664. 29. Li J, Zhang D, Li Y, Wu J, Zhang B. Joint similar and specific learning for diabetes mellitus and impaired glucose regulation detection. Inf Sci. 2016;384. 30. Andrew G, Arora R, Bilmes J, Livescu K. Deep canonical correlation analysis. In: International conference on machine learning. 2013. p. 1247–1255. 31. Wang W, Arora R, Livescu K, Bilmes J. On deep multi-view representation learning. In: International conference on machine learning. 2015. p. 1083–1092. 32. Li J, Zhang B, Ren H, Zhang D. Visual classification with multi-kernel shared Gaussian process latent variable model. IEEE Trans Cybern. 2018;49:1–14. 33. Lawrence N. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J Mach Learn Res. 2005;6(Nov):1783–1816. 34. Gao X, Wang X, Tao D, Li X. Supervised Gaussian process latent variable model for dimensionality reduction. IEEE Trans Syst Man Cybern Part B. 2011;41(2):425–434. 35. Li S, Fu Y. Learning robust and discriminative subspace with low-rank constraints. IEEE Trans Neural Netw Learn Syst. 2015;27(11):2160-2173. 36. Spielman D. Spectral graph theory. Lecture notes, yale university. 2009. p. 740–0776. 37. Li J, Lu G, Zhang B, You J, Zhang D. Shared linear encoder-based multikernel gaussian process latent variable model for visual classification. IEEE Trans Cybern. 2019;51:534–547. 38. Kan M, Shan S, Zhang H, Lao S, Chen X. Multi-view discriminant analysis. IEEE Trans Pattern Analy Mach Intelll. 2016;38(1):188–194. 39. Song G, Wang S, Huang Q, Tian Q. Multimodal similarity gaussian process latent variable model. IEEE Trans Image Process. 2017;26(9):4168–4181. 40. Song G, Wang S, Huang Q, Tian Q. Multimodal gaussian process latent variable models with harmonization. In: Proceedings of the IEEE international conference on computer vision. 2017. p. 5029–5037.
Chapter 4
Information Fusion Based on Multi-View and Multi-Feature Learning
In an amount of practical fields, the same instance can be represented with multiple modalities or views, and each modality can be further represented with various features. This kind of data is often named as multi-view and multi-feature data. To address this data, two probabilistic and generative fusion methods are studied in this chapter. After reading this chapter, people can have preliminary knowledge on Bayesian theory-based fusion algorithms.
4.1 Motivation Although the aforementioned methods including JSSL, RCR, JDCR, SAGP, MKSGP, and SLEMKGP have been studied to efficiently utilize the relationship among distinctive views, the research for multi-view and multi-feature data is still understudied. In an amount of practical fields, an instance can be represented by distinctive views, and each view can further be described by multiple kinds of features. For example, in JSSL, we argue that a patient suffering from Diabetes Mellitus (DM) can be diagnosed according to the tongue, face, and sublingual vessel. Furthermore, these multiple modalities can further be represented by different types of features, such as texture and color, as displayed in Fig. 4.1. To process this kind of data, a straightforward method is to simply combine different features from a view to be a single one. The conventional multi-view learning approaches can be applied. Another way is to independently regard each feature as a modality, so that these features from all views are jointly analyzed in a multiview learning way. However, for the first method, it fails to efficiently learn the correlation among different features which are from the same view. For the second method, it loses the hierarchical information between features and views. In this chapter, we present two different multi-view and multi-feature strategies [1, 2], which are capable of jointly fusing the data mentioned above. These two models follow the assumption that there is a projection for each kind of feature © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_4
101
102
4 Information Fusion Based on Multi-View and Multi-Feature Learning
Multi-view
Multi-feature Face
Texture Color
Certain Relationship
Tongue
Geometric Texture Color
Certain Relationship
Sublingual
Geometric Color
Certain Relationship
Fig. 4.1 The example containing multiple views and multiple features
from each view. The inputting (observed) feature is a projected vector from a latent variable by using the projection function. Making a comparison between these two, we adopt different strategies to generate the latent variables, followed by different optimization algorithms.
4.2 Generative Multi-View and Multi-Feature Learning Here we name this first method as multi-view and multi-feature learning (MVMFL) [1]. As shown in Fig. 4.2, which is the pipeline of MVMFL, there is a projection matrix for each kind of feature in each view. By using this projection matrix, a latent variable is linearly projected to its related observed feature. Here we assume that these multiple features as well as multiple views (latent variables) follow Gaussian distributions. However, differently, we enforce the latent variables follow the labelrelated distributions. In detail, the latent variables from the same class follow the same distribution, and vice versa. Also, in the inference of MVMFL, the model is further transformed into solving the inference problem of determining the classconditional densities p(x | Cp ) for each category Cp individually, where x is the observation, p = {1, · · · , P }, and P is the number of categories. Thanks to this transformation, we can optimize MVMFL more simply. The main contributions of MVMFL are that: 1. A novel model is present in a generative way, in which the multi-view and multifeature data is hierarchically fused for classification. Different from JSSL, RCR, JDCR, SAGP, MKSGP, and SLEMKGP which only consider the multi-view data, MVMFL further considers the case that each view can be represented by multiple kinds of features, efficiently extracting the correlation among them. 2. The presented hierarchical model is proved to be transformed into a classconditional model p(x | Cp ). Thanks to this transformation, the supervised information (label) can be directly embedded into the observed data, simplifying
4.2 Generative Multi-View and Multi-Feature Learning
103
Sample
View 1
View j
Feature 1
j
···
··· Feature K1
K
Feature 1
KJ
Follow the distributions with label information
···
··· Feature
K
Latent variable
Latent variable 1
Follow the distributions with label information
Follow the distributions with label information
···
Latent variable
Latent variable 1
Latent variable K1
Latent variable 1
View J
j
Feature 1
Feature
KJ
Follow the distributions without label information
Fig. 4.2 The framework of the proposed method, where the green round means the observed features, the blue round means the latent variable, and the orange round means the observed label; J means the number of views and Kj means Kj types of features are obtained in j -th view
the optimization of the proposed method. Thus, the prediction inference of p(Cp | x) can be easily estimated. Compared with some existing classifiers, e.g., KNN and SVM, which only provide the label belonging to a certain category, the outputs of MVMFL are the probabilities belonging to various classes. It is beneficial for us to predefine a fuzzy value to reject an observation whose output value is not larger than others’ enough.
4.2.1 Problem Formulation Before analyzing MVMFL, some notations used in this chapter are denoted in Table 4.1. Assume that the i-th observation in the kj -th type of features of the D ×1 j -th view is denoted as xij kj ∈ R jkj , where Dj kj is its dimensionality, i ∈ {1, · · · , M}, j ∈ {1, · · · , J }, and kj ∈ {1, · · · , Kj }. In other words, we assume that there are M number of observations in the training set, and each observation is composed of J number of views. Meanwhile, the j -th view also can be represented with Kj number of features. j kj is the covariance matrix associated with the kj -th kind of features in the j -th view; the hij kj ∈ RDj ×1 denotes the i-th latent variable associated with the kj -th kind of the feature in the j -th view, where Dj is the
104
4 Information Fusion Based on Multi-View and Multi-Feature Learning
Table 4.1 Some necessary notations in this chapter M; J Kj ; P zpi ∈ {0, 1} πp xij kj Dj kj ; Dj hij kj N (μjp , jp ) j kj Aj k j
The number of training samples; the number of views The number of features corresponding to the j -th view; the number of categories The parameter showing whether the i-th observation belongs to the p-th category The probability that an observation belongs to the p-th class The i-th observation in the kj -th feature of the j -th view The dimensionality of x·j kj ; the dimensionality of h·j ·j The latent variable corresponding to xij kj The mean and covariance of a Gaussian distribution corresponding to p-th class for hij kj The covariance for the kj -th feature of the j -th view The projection matrix for the kj -th feature of the j -th view
dimension of hij kj . Note that, in this chapter, we assume that different features in a certain view are the projected vectors from a common space, and thus the K D ×D dimensionality of their related variables {hij kj }kjj is the same. Aj kj ∈ R jkj j is the projection matrix for the kj -th kind of features in the j -th view, which projects the latent variable to the observed space. zpi = 1 denotes the i-th observation belongs to the p-th category, otherwise zpi = 0, where p ∈ {1, · · · , P } and P is the number of categories. For each class, a prior πp can be computed to indicate the probability that an instance belongs to the p-th class. μjp and jp are the mean and covariance of a Gaussian distribution associated with the p-th category for hij kj . Here we make an assumption that there is the linear representation which can project the latent variable to the observed data. Equation (4.1) gives the formulation of the representation between xij kj and hij kj . xij kj = Aj kj hij kj
(4.1)
where Aj kj is the projection matrix for the kj -th kind of feature in the j -th view. Note that, in our model, we make assumption that the observed data xij kj follows the Gaussian distribution whose mean and covariance are, respectively, Aj kj hij kj and j kj . Besides, to fuse various features in each view and exploit the label in the presented strategy, the latent variable hij kj is encouraged to follow a viewand category-specific Gaussian distribution N (μjp , jp ), where μjp and jp are the mean and covariance matrix for the p-th (p ∈ {1, · · · , P }) class in the j -th view, respectively. In general, for the observed multi-view and multi-feature data, it is unable to make various features in a certain view meet the same distribution as their dimensionality and numerical value scales are distinctive. Differently, in our method, hij kj is first computed which tackles the problems mentioned above, being suitable for different kinds of data. To exploit the supervised information in the training data, it is reasonable for the latent variable hij kj in each view to estimate the class-specific distribution by imposing the ground-truth zpi on it, where zi = [z1i , · · · , zpi , · · · , zP i ] is the label vector following the multinomial
4.2 Generative Multi-View and Multi-Feature Learning
105
Fig. 4.3 The probabilistic framework and graphic model of the presented approach. (a) The probabilistic framework of the presented approach. (b) The graphical model of the presented approach. xij kj and zpi are the observed features and labels, respectively. hij kj is the latent variable associated with xij kj . For the label zpi , it follows the multinomial distribution, where zpi ∈ {0, 1} and Pp=1 zpi = 1. According to the supervised information, the latent variable hij kj is generated by following a Gaussian distribution N (μjp , jp ). Then we also assume that xij kj is generated from its associated latent variable, and xij kj also follows a Gaussian distribution N (Aj kj hij kj , j kj ). (c) The graphical model of the transformed class-conditional structure of the proposed method. By integrating out the latent variable hij kj , all parameters {zpi , μjp , jp , Aj kj , j kj } are directly imposed on xij kj , and xij kj follows a novel Gaussian distribution N (Aj kj ( Pp=1 zpi μjp ), j kj + Aj kj ( Pp=1 zpi jp )ATjkj )
distribution, zpi ∈ {0, 1} and Pp=1 zpi = 1. Thus, the proposed method exploits both the correlation among various features/views and discriminative information among different classes. Figure 4.3a, b display the probabilistic framework and graphic model, respectively. Therefore, the distribution for each kind of variable
106
4 Information Fusion Based on Multi-View and Multi-Feature Learning
is formulated as zi ∼ p(zi | θZ ) =
P '
z
πppi ;
p=1
⎛
P
zpi = 1;
p=1
hij kj ∼ p(hij kj | zi , θH ) = N ⎝hij kj |
P
P
πp = 1
p=1
zpi μjp ,
p=1
P
⎞ zpi jp ⎠
p=1
xij kj ∼ p(xij kj | Aj kj , hij kj , θX ) = N (xij kj | Aj kj hij kj , j kj ) (4.2) where πp is the prior value which means the probability of xij kj from the p-th $ %P $ $ %J,P %J,K is πp , θ Z = πp p , θ H = μjp , jp j,p , and θ X = Aj kj , hij kj , j kj j,k j j
{zi }M i ,
M,J,K {hij kj }i,j,kj j ,
M,J,K {xij kj }i,j,kj j ,
are the parameters and variables related to and respectively. Additionally, the current model in Fig. 4.3b can be transformed into a more brief model as displayed in Fig. 4.3c by making an integration operation with respect to hij kj . In order to be simple, we directly use the lemma given by Bishop [3] in Chap. 2. Lemma Given a marginal Gaussian distribution for x and a conditional Gaussian distribution for y given x in the form p(x ) = N x | μ , −1 p(y | x ) = N y | A x + b , L−1
(4.3)
the marginal distribution of y is given by p(y ) = N y | A μ + b , L−1 + A −1 AT
(4.4)
In this chapter, p(x ) and p(y | x ) can be regarded as p(hij kj | zi ) and p(xij kj | Aj kj , hij kj ), respectively. In detail, μ is replaced with Pp=1 zpi μjp , −1 is replaced with Pp=1 zpi jp , A μ + b is replaced with Aj kj hij kj , and L−1 is replaced with j kj , respectively. According to Eq. (4.4), the likelihood density of the observed data xij kj on the label information would be computed as follows: p(xij kj | zi , θ Z , θ H , θ X ) ⎛ ⎛ ⎞ ⎞ P P = N ⎝xij kj | Aj kj ⎝ zpi μjp ⎠ , j kj + Aj kj ( zpi jp )ATjkj ⎠ p=1
p=1
(4.5)
4.2 Generative Multi-View and Multi-Feature Learning
107
M,J,K
Thus, given the dataset X = {xij kj }i,j,kj j in which each observation is assumed to be drawn independently from the multivariate Gaussian distribution, we estimate the parameters displayed in Eq. (4.5) by minimizing the negative log-likelihood function. M,J,Kj
min L = − log
'
p(xij kj | zi , θ Z , θ H , θ X )
i,j,kj
=−
#
#
M,J,Kj
log N
Aj kj
p
i,j,kj M,J,Kj
=
i,j,kj
*
Dj kj 2
zpi μjp , j kj +Aj kj
#
zpi jp
ATjkj
p
# 1 T log(2π) + log j kj + Aj kj zpi jp Aj kj 2 p
# −1 # T # 1 + j kj +Aj kj xij kj −Aj kj ( zpi μjp ) zpi jp ATjkj 2 p p + ×(xij kj − Aj kj ( zpi μjp )) p
(4.6) Note that, despite the fact that we transform the hierarchical model in Fig. 4.3b into the class-conditional one, it does not change the capacity of the original model. By integrating the latent variables, μj kj , j kj , μjp , jp , and Aj kj are only transformed into another type which can be easily optimized, but their relationship with the observations and ground-truth labels is kept. Thus, the class-conditional model in Fig. 4.3c is also able to model the multi-view and multi-feature data in a hierarchical way. Although the Gaussian distribution is widely used as a density model, it suffers from a significant limitation. In our method, the covariance matrices j kj and jp are set to be general symmetric matrices, including Dj kj (Dj kj + 1)/2 and Djp (Djp + 1)/2 independent parameters, respectively. If the number of training samples is smaller than the dimensionality of inputs, the total number of parameters thus increases quadratically with the dimensionality, increasing the difficulty of accurate estimation of these parameters. To address this issue, we restrict the form of the covariance matrix. In detail, j kj and jp are set to be diagonal matrices 2 I, respectively, where σ σj2kj I and σjp j kj and σjp are 1D values, and I denotes the identical matrix. We name this modification as MVMFL(diag).
108
4 Information Fusion Based on Multi-View and Multi-Feature Learning
4.2.2 Optimization for MVMFL To estimate various parameters, we calculate the derivatives of the objective function with respect to each parameter by following Eq. (4.7). ∂L = ∂θ X
1
∂L ∂L , ∂Aj kj ∂ j kj
2J,Kj j,kj
∂L , = ∂θ H
1
2J,P
∂L ∂L , ∂μjp ∂ jp
(4.7) j,p
The derivative of the objective function with respective to the covariance matrix j kj is ⎧ # # −1 M,P ⎨1 ∂L T j kj + Aj kj = zpi jp Aj kj ⎩2 ∂ j kj p i,p
# −1 # # # 1 T − zpi jp Aj kj zpi μjp xij kj −Aj kj j kj +Aj kj 2 p p #
#
× xij kj − Aj kj
T zpi μjp
p
# × j kj + Aj kj
#
zpi jp ATjkj
−1 ⎫ ⎬ ⎭
p
(4.8) The derivative of the objective function with respective to the covariance matrix jp is ⎧ # # −1 M,Kj ⎨ ∂L 1 T T zpi Aj kj j kj + Aj kj = zpi jp Aj kj Aj kj ⎩2 ∂ jp p i,kj
1 − zpi ATjkj 2
# j kj + Aj kj
#
zpi jp
−1 ATjkj
p
#
× xij kj − Aj kj
#
# zpi μjp
xij kj − Aj kj
p
# ×
j kj + Aj kj
# p
zpi jp ATjkj
#
⎫ ⎬
−1 Aj kj
T zpi μjp
p
⎭ (4.9)
4.2 Generative Multi-View and Multi-Feature Learning
109
The derivative of the objective function with respective to the mean vector μjp is #
M,Kj ∂L =− zpi ATjkj ∂μjp
#
j kj + A j kj
zpi jp
−1 T A j kj
p
i,kj
#
#
× xij kj − Aj kj
(4.10)
zpi μjp
p
Note that there is a closed-form solution for the mean vector μjp by setting its derivative to zero. Consequently, the solution of μjp is shown as follows:
μjp =
⎧ ⎨M,K j ⎩ ×
#
# j kj + Aj kj
zpi ATjkj
⎩
⎫−1 ⎬
−1
zpi jp ATjkj
Aj kj
p
i,kj
⎧ ⎨M,K j
#
# j kj + Aj kj
zpi ATjkj
⎫ ⎬
−1
zpi jp ATjkj
xij kj
p
i,kj
⎭
⎭ (4.11)
Since it is difficult to directly calculate the derivative of the term (xij kj − Aj kj ( p zpi μjp ))T ( j kj + Aj kj ( p zpi jp )ATjkj )−1 (xij kj − Aj kj ( p zpi μjp )) with respect to the mapping matrix Aj kj , we independently compute the derivative of each bracket. For instance, we only calculate the derivative of the term in the first bracket (xij kj − Aj kj ( p zpi μjp ))T , while keeping ( j kj +Aj kj ( p zpi jp )ATjkj )−1 (xij kj −Aj kj ( p zpi μjp )) unchanged. It is also similar to the terms in the second and third brackets. Therefore, the representation of ∂A∂Ljk is shown in Eq. (4.12). j
∂L = ∂Aj kj
⎧# # −1 # # ⎨ T j kj + Aj kj Aj kj zpi jp Aj kj zpi jp ⎩
M,P
p
i,p
# − j kj + Aj kj
#
p
zpi jp
−1 # xij kj − Aj kj
ATjkj
p
# ×
#
xij kj − Aj kj
#
zpi μjp
p
T zpi μjp
p
# × j kj + Aj kj
#
p
zpi jp
−1 # ATjkj
# Aj kj
p
zpi jp
110
4 Information Fusion Based on Multi-View and Multi-Feature Learning
Algorithm 4.1 Multi-view and multi-feature learning for classification M,J,K
Input: Observed data: {xij kj }i,j,kj j ; label: {zpi }M,P i,p ; J,K
J,K
j j Initialization: θ Z = {πp }Pp , θ H = {μjp , jp }J,P j,p ; θ X = {Aj kj , j kj }j,kj , where {Aj kj }j,kj are initialized through Eq. (4.13). 1: while not converged do 2: for p = 1, . . . , P do 3: for j = 1, . . . , J do 4: for kj = 1, . . . , Kj do 5: Obtain the derivatives of the objective function with respective to j kj and Aj kj according to Eqs. (4.8) and (4.12). 6: end for 7: Obtain the derivative of the objective function with respective to jp according to Eq. (4.9) 8: end for 9: end for 10: The gradient decent technique is applied to updated j kj , Aj kj and jp alternatively, followed by updating μjp according to Eq. (4.11) 11: end while J,Kj Output: θ H = {μjp , jp }J,P j,p ; θ X = {Aj kj , j kj }j,kj
#
#
− j kj + Aj kj
−1
zpi jp ATjkj
p
#
#
× xij kj − Aj kj
p
zpi μjp
# p
zpi μjp
T ⎫ ⎬
(4.12)
⎭
Additionally, we find that a reasonable initialization of Aj kj would contribute to obtaining a quick convergence. Thus, we design the following formula to initialize the mapping matrix: min
M |xij k − Aj k hij k |2 + λ |hij k |2 j j j j F F
(4.13)
i
where||·| |F is the Frobenius norm and λ is the nonnegative penalty. The initialized Aj kj can be estimated by updating Aj kj and hij kj alternatively in Eq. (4.13). Until now, the derivative representations of θ H and θ X are computed. As there are no closed-form solutions for Aj kj , j kj , and jp , the gradient decent technique is applied to updated them alternatively. The details of estimating different parameters and variables are shown in Algorithm 4.1.
4.2 Generative Multi-View and Multi-Feature Learning
111
4.2.3 Inference for MVMFL Given a testing sample xt = {xtj kj }, the posterior probability belonging to each class needs to be computed. Based on the Bayesian theory, we can get the prediction formula as displayed in Eq. (4.14). p(zp | xt , θ Z , θ H , θ X ) =
p(xt | zp , θ Z , θ H , θ X )p(zp ) ∝ p(xt | zp , θ Z , θ H , θ X )p(zp ) p(xt )
(4.14)
where p(zp ) is the prior of each class, and it can be computed through M
i=1 zpi M p=1 i=1 zpi
p(zp ) = P
(4.15)
4.2.4 Experimental Results Here, we first introduce the used datasets as well as the setting in experiments. We can compare MVMFL with other comparison methods.
4.2.4.1 Datasets and Experimental Setting We obtain the synthetic data by following the designation of MVMFL. Predefining Dj kj and Dj kj which are the dimensions of xij kj and hij kj , respectively, the Aj kj , j kj , jp , and μjp are generated randomly. Specifically, this step is implemented by exploiting the “randn” function in MATLAB. To enforce generated data to be more compact for different classes, the parameters μjp are multiplied with 0.6. Based on the assumption of MVMFL, the latent variables are obtained by following N (hij kj | Pp=1 zpi μjp , Pp=1 zpi jp ). Then, the observed data are generated by following N (xij kj | Aj kj hij kj , j kj ). Note that, we use the “mvnrnd” function in MATLAB to obtain the data whose distribution is Gaussian. Additionally, 5 categories are generated, and training number and testing number are, respectively, 250 and 2000. Here, Dj kj and Dj kj are 20 and 10, respectively. When J = 3 and Kj = 4, the visualization of the first three dimensions of the generated data in the first type of feature in the first view is displayed in Fig. 4.4. Obviously, the distributions of various classes are quite mixed. According to JSSL [4], the biomedical dataset is selected to evaluate the performance of our method. This dataset is composed of 198 Diabetes Mellitus (DM) and 192 Healthy samples collected by the Hong Kong Polytechnic University at the Guangdong Provincial TCM Hospital, Guangdong, China, from the early
112
4 Information Fusion Based on Multi-View and Multi-Feature Learning
Fig. 4.4 The visualization of the generated data when J = 3 and Kj = 4
2014 to the late 2015. As described in [4], there are three views containing the tongue, facial, and sublingual images for each observation. Particularly, the tongue image can be represented with 12 dimensional color feature, 9 dimensional texture feature, and 13 dimensional geometry feature, the facial image can be represented with 24 (4 block×6 dimension) dimensional feature and 5 dimensional texture feature, and the sublingual image can be represented with 6 dimensional color feature and 6 geometrical feature. Being similar to JSSL [4], 30, 40, 50, 60, 70, 80, and 90 samples for each class are selected for training with five independent times, and the remaining observations are used for testing. For a quantitative evaluation, we apply both single- and multi-view based methods to have the comparison. Generally speaking, the learning of Aj kj is similar to some existing subspace learning methods, and thus we compare our method with Convex Subspace Representation Learning (CSRL) [5] and Local Fisher Discriminant Analysis (LFDA) [6]. Due to the effectiveness of the representation based methods (sparse representation, collaborative representation, or dictionary learning), we also make a comparison with Dictionary Pair Learning (DPL) [7], Multi-task Joint Sparse Representation (MTJSRC) [8], RCR [9], JSSL [4], JDCR [10], Multimodal Task-driven Dictionary Learning (MDL) [11]. Since DPL and LFDA are the single-view based methods, we concatenate features in all views as a single vector. For other multi-view based methods, we concatenate diverse features in each view as their inputs. Additionally, since there are two scenarios for learning
4.2 Generative Multi-View and Multi-Feature Learning
113
the dictionary in MDL: supervised and unsupervised learning, we refer to them as SMDL and UMDL, respectively. To have a brief notation, we denote our multi-view and multi-feature learning method as MVMFL. Note that the features extracted by CSRL and LFDA are classified by SVM.
4.2.4.2 Experimental Results on Synthetic Data In this experiment, we make comparison between MVMFL and other approaches in the case of different number of views and features (Kj = 3 and J = 2, 3, 4 and J = 3 and Kj = 1, 2, 4). The experimental results computed by various methods are listed in Table 4.2. Obviously, MVMFL remarkably outperforms other methods. In contrast to DPL, MTJSRC, RCR, JDCR, and JSSL, our presented strategy gets an obvious enhancement. In contrast to CSRL, MVMFL is also superior. In detail, the performance computed by MVMFL is higher than 80% at all times, while CSRL only gets the best result (80.00%) when Kj = 3 and J = 4. Besides, compared with UMDL and SMDL, MVMFL is also much better. The data generated via the computed parameters is also visualized. To better show the distribution of different classes, we randomly regenerate synthetic data (J = 3 and Kj = 3) to enforce the gap between various categories be large, as displayed in Fig. 4.5a (the visualization of the first two dimensions of the generated data in the first type of feature in the first view). Inputting this synthetic data into MVMFL, θ Z , θ H , and θ X are estimated. We then use them to generate the five-category data, as displayed in Fig. 4.5b. Obviously, distributions in these two figures are quite similar, demonstrating that the probability assumption in MVMFL is reasonable. Additionally, we make another experiment on the synthetic dataset to further demonstrate the significance of our hierarchical fusion structure. For example, set (J = 3, Kj = 3), we then combine multiple features in each view as a single one. Table 4.2 The classification accuracy on synthetic dataset with a different number of views and features. Bold values mean the best performances Methods DPL MTJSRC RCR JSSL UMDL SMDL JDCR CSRL MVMFL
Kj = 3 J =2 51.60% 50.70% 50.60% 51.20% 75.45% 82.15% 50.70% 56.55% 89.35%
J =3 58.60% 59.05% 56.65% 59.10% 87.65% 87.95% 62.45% 73.55% 95.60%
J =4 60.80% 60.65% 57.95% 57.20% 88.60% 88.00% 66.90% 80.00% 95.70%
J =3 Kj = 1 54.20% 62.50% 48.35% 67.50% 73.75% 81.50% 52.30% 53.40% 85.95%
Kj = 2 57.45% 60.50% 55.65% 59.00% 84.85% 86.00% 61.40% 55.65% 92.85%
Kj = 4 46.20% 54.90% 51.90% 55.80% 77.95% 82.70% 62.00% 64.60% 95.95%
114
4 Information Fusion Based on Multi-View and Multi-Feature Learning
Fig. 4.5 The visualization of the original and generated data. (a) The original data. (b) The generated data
Thus, a novel input whose J is 3 and Kj is 1 is got. Furthermore, we also regard the observation as a novel one with 9 types of views (each feature is regarded as a view) and the model (J = 9, Kj = 1) is consequently acquired. The experimental results are listed in Table 4.3. Note that the number of views is the same when (J = 3, Kj = 2), (J = 3, Kj = 3), and (J = 3, Kj = 4) in the first case. We denote them as (3, 1)1, (3, 1)2 , and (3, 1)3 in the second column, respectively. So do other cases. Obviously, MVMFL always gets a noticeable result in comparison with other cases, showing the importance of our hierarchical fusion structure.
4.2.4.3 Experimental Results on Biomedical Data The evaluation of the proposed MVMFL in detection of Diabetes Mellitus disease compared with other methods is shown in this subsection. According to JSSL [4], 30, 40, 50, 60, 70, 80, and 90 samples in each class are randomly selected from the original dataset as the training data, and the rest of samples are exploited for testing. Additionally, the selection is independently conducted 5 times in each sub-
4.2 Generative Multi-View and Multi-Feature Learning Table 4.3 The classification accuracies on the synthetic dataset obtained by MVMFL. Bold values mean the best performances
Model Accuracy Model Accuracy Model Accuracy Model Accuracy Model Accuracy
115 (J , Kj ) (2,3) 89.35% (3,2) 92.85% (3,3) 95.60% (3,4) 95.95% (4,3) 95.70%
(2,1) 82.30% (3, 1)1 87.35% (3, 1)2 91.85% (3, 1)3 87.55% (4,1) 93.10%
(6, 1)1 85.45% (6, 1)2 89.00% (9,1) 92.40% (12, 1)1 88.30% (12, 1)2 94.30%
experiment, and the average accuracy and error bar are exploited to evaluate the performance in a quantitative way. Table 4.4 tabulates the average accuracy and the error bar computed by different methods. Obviously, MVMFL is superior to other comparison strategies when the number of training samples is 30, 40, 50, 60, 70, and 90, respectively. In comparison to CSRL and SMDL, MVMFL gains about more than 5% improvement. Compared with DPL, LFDA, UMDL, MTJSRC, and RCR, there is also about 2–4% enhancement on average classification accuracy at some cases. In contrast to JSSL and JDCR, MVMFL is also competitive. Despite that the result computed by JSSL is slightly larger than that got by our method when the training number is 80, MVMFL gains higher values in remaining cases. Besides, the values of the error bar acquired by MVMFL are relatively small in most cases, which also indicates the stability of the proposed method. Figure 4.6 further plots the ROC curves of different methods for the Diabetes Mellitus disease diagnosis when the number of training samples is 70, followed by their corresponding covered areas listed in Table 4.5. From Fig. 4.6, we can see that the covered area obtained by MVMFL is larger than that obtained by other strategies. Compared with SMDL, LFDA, CSRL, DPL, UMDL, MTJSRC, and RCR, our method has a noticeable enhancement. From Table 4.5, we can observe that our approach is much better than other approaches.
4.2.5 Conclusion In this chapter, we present a generative multi-view and multi-feature learning model for classification. Different from existing approaches that ignore the hierarchical information, our model jointly takes multi-view and multi-feature into account. Specifically, it follows the assumption that various features in a view are the projected vectors from a latent variable. To encourage the method to be adaptive for classification, the label information is further used to semantically represent the latent variables in various views. The proposed hierarchical model is proved
Methods DPL MTJSRC RCR JSSL UMDL SMDL JDCR LFDA CSRL MVMFL
Training Number 30 78.01±2.18% 77.76±2.93% 77.81±2.49% 79.82±1.92% 75.53±1.97% 75.77±2.21% 78.61±1.97% 51.36±1.28% 73.84±1.50% 80.91±1.29%
40 77.23±2.38% 79.04±2.56% 78.64±2.10% 81.45±1.50% 77.36±1.96% 77.43±1.99% 78.14±2.03% 67.27±5.13% 71.32±1.60% 81.61±0.57%
50 79.86±2.53% 80.48±1.18% 80.51±2.60% 82.82±1.46% 74.71±2.84% 73.61±1.73% 80.07±2.86% 72.71±1.53% 72.23±1.90% 83.09±1.41%
60 79.41±2.17% 80.96±2.00% 80.91±3.33% 82.88±2.15% 79.11±2.48% 74.69±3.66% 79.93±1.42% 75.87±4.16% 73.06±1.73% 83.39±1.22%
70 82.39±1.68% 80.08±2.16% 82.53±2.03% 83.27±1.55% 79.60±0.99% 74.50±3.39% 82.71±1.40% 78.41±2.26% 75.40±2.67% 84.14±0.99%
80 82.42±1.52% 82.51±1.49% 83.36±1.12% 85.06±0.87% 82.34±1.80% 74.37±4.03% 83.03±2.06% 78.61±3.42% 73.94±2.30% 84.94±1.20%
90 81.61±1.08% 81.04±2.17% 83.67±1.82% 83.17±2.15% 79.43±1.97% 75.07±2.89% 83.32±1.62% 82.18±2.73% 75.07±1.82% 83.98±2.26%
Table 4.4 The average classification accuracies and error bars on the biomedical dataset obtained by MVMFL and other comparison methods when the number of training samples changes from 30 to 90. Bold values mean the best performances
116 4 Information Fusion Based on Multi-View and Multi-Feature Learning
4.2 Generative Multi-View and Multi-Feature Learning
117
Fig. 4.6 ROC curves of different approaches for DM diagnosis Table 4.5 The area under curve (AUC) for the different methods in DM detection. Bold values mean the best performances
Methods DPL MTJSRC RCR JSSL UMDL
AUC 0.8639 0.8549 0.8667 0.8809 0.8639
Methods SMDL JDCR LFDA CSRL MVMFL
AUC 0.8089 0.8703 0.8397 0.8420 0.9072
that it can be transformed into a class-conditional model, and an efficient algorithm is designed to optimize the presented method. Finally, experiments conducted on synthetic and real-world datasets demonstrates the superiority of our approach.
118
4 Information Fusion Based on Multi-View and Multi-Feature Learning
4.3 Hierarchical Multi-View Multi-Feature Fusion Different from MVMFL, which learns different latent variables but follows the same distribution corresponding to different features in a view, we propose another method that aims to estimate a shared variable for different features in a view. Thanks to this assumption, this approach is capable of optimizing the model by introducing the expectation maximization (EM) algorithm. Here we name this novel method as Hierarchical Multi-view Multi-feature Fusion (HMMF) [2]. The main contributions of this chapter are shown as follows: 1. A probabilistic hierarchical method is proposed for multi-view and multi-feature learning. This model hierarchically fuses multiple features through a latent variable. 2. The EM-based algorithm [3] is introduced to solve our model effectively. Particularly, a closed-form solution for each parameter or variable is obtained, and we alternatively update the parameters and variables until convergence.
4.3.1 Problem Formulation The framework and graphic model of the proposed method are displayed in Fig. 4.7. An instance is represented by J views, and the j -th (j ∈ {1, · · · , J }) view is described by Kj kinds of features. Particularly, the data and the label of the ith instance can be represented as {xij kj ∈ R
Djkj J,Kj }j,kj =1
and a one-hot categorical
RP ,
variable zi ∈ respectively, where Dj kj is the dimension of the kj -th feature in the j -th view, P is the number of the categories, and zi satisfies zpi ∈ {0, 1} and P p=1 zpi = 1. Then, some probability assumptions for these variables are made to construct our hierarchical model. Firstly, the categorical distribution is introduced for the categorical variable zi , which has the following form: p(zi | θ Z ) =
P '
z
πppi
(4.16)
p=1
where θ Z = {πp }Pp=1 , zpi = 1 if the i-th sample belongs to the p-th class, otherwise zpi = 0, and the variable πp ∈ [0, 1] follows Pp=1 πp = 1, which means the probability of a sample belonging to its corresponding category. To utilize the label information, a latent variable hij ∈ RDj associated with the j -th view of the i-th sample is then computed by imposing the ground-truth label on it. Particularly, for different categories, the distributions of the latent variables belonging to different views are different, greatly utilizing the complementary
4.3 Hierarchical Multi-View Multi-Feature Fusion
119
Fig. 4.7 (a) The framework of the proposed method, where the number of the latent variables is equal to the number of observed views. (b) The probabilistic framework of the proposed method
120
4 Information Fusion Based on Multi-View and Multi-Feature Learning
information across different views. The most common and simple assumption for hij is that p(hij | zi , θ H ) =
P '
N (hij | μjp , jp )zpi
(4.17)
p=1
where θ H = {μjp , jp }J,P j,p=1 , meaning that its distribution for the p-th category is a Gaussian distribution with mean μjp and covariance matrix jp . A general assumption is that multiple features are the projections from a shared variable through different mapping functions. Thus, in the presented method, a mapping matrix for each feature in the same view is computed to transform the latent variable hij to the observed data xij kj by a linear Gaussian model, which can be formulated as follows: p(xij kj | hij , θ X ) = N (xij kj | Aj kj hij + bj kj , j kj )
(4.18)
J,K
where θ X = {Aj kj , bj kj , j kj }j,kj j=1 , Aj kj is the learned mapping matrix, bj kj is the bias, and j kj denotes the covariance matrix. Furthermore, we also make some reasonable conditional independence assumptions about different features and different views, including Kj ' K p(xij kj | hij , θ X ) p {xij kj }kjj=1 | hij , θ X = kj =1
K p {{xij kj }kjj=1 , hij }Jj=1 | θ X , θ H , zi =
J ' j =1
(4.19)
K p {xij kj }kjj=1 , hij | θ X , θ H , zi
In order to acquire a simple representation derivation, let Z = {zi }M i=1 , H = %M,J,Kj $ $ %M,J hij i,j =1 and X = xij kj i,j,k =1 . Taking the aforementioned independent and j identically distributed (i.i.d.) assumption into account, the join distribution w.r.t. all variables is obtained: P (X, Z, H | θ X , θ Z , θ H ) ⎫⎫ ⎧ ⎧ Kj M ⎨ J ⎨ ⎬⎬ ' ' ' = p(xij kj | hij , θ X ) p(zi | θ Z ) p(hij | zi , θ H ) ⎭⎭ ⎩ ⎩ i=1
j =1
(4.20)
kj =1
which is a probabilistic hierarchical model. Generally speaking, it is infeasible to estimate the covariance matrix j kj and jp , when the dimensions of the features
4.3 Hierarchical Multi-View Multi-Feature Fusion
121
and the latent variables are large. To avoid overfitting, jp and j kj can be set to 2 I and σ 2 I in this case, where σ be σjp jp and σj kj are two 11D variables to control j kj their variances of all dimensions, and I is the identical matrix. To estimate the parameters of this probabilistic method, the log-likelihood function w.r.t. all variables should be optimized. Since it is difficult to directly observe the latent variable H, the log-likelihood function only related to the multiview and multi-feature data X and its label variable Z is considered. Therefore, the objective function is log P (X, Z | θ X , θ Z , θ H )
(4.21)
However, it is quite different between marginalizing H in Eq. (4.20) and optimizing the objective function (4.21). Fortunately, the Expectation Maximization (EM) [3] algorithm can be readily utilized for efficiently solving this kind of problem with latent variables.
4.3.2 Optimization for HMMF As analyzed above, the EM algorithm, which is a two-stage iterative optimization technique for finding maximum likelihood solutions, is used to estimate the model parameters. Specifically, the posterior probability of latent variable H will be computed in E-step, followed by the estimation of the value of parameters θ Z , θ H , and θ X in M-step. E-Step First, we use the current values of all parameters to compute the posterior probabilities of H. The log-posterior function log P (H | X, Z, θ X , θ Z , θ H ) ∝ log P (X, H, Z | θ X , θ Z , θ H ) ⎧ ⎧ ⎛ ⎞ Kj M ⎨ J ⎨ P 1 T ⎠ − h ⎝ zpi −1 ATjkj −1 ∝ jp + j kj Aj kj hij ⎩ ⎩ 2 ij i=1
⎛
+hTij ⎝
j =1
P p=1
p=1
zpi −1 jp μjp +
Kj kj =1
kj =1
ATjkj −1 j kj xij kj
⎞⎫⎫ ⎬⎬ − bj kj ⎠ ⎭⎭
(4.22)
is a quadratic form function w.r.t. hij . Thus, the posterior probability of hij follows a Gaussian distribution, which can be formulated as follows: H , (4.23) p(hij | X, Z, θ X , θ Z , θ H ) = N hij | μH ij ij
122
4 Information Fusion Based on Multi-View and Multi-Feature Learning
where ⎛ ⎝ H ij =
P
zpi −1 jp +
p=1
H μH ij = ij
⎧ P ⎨ ⎩
p=1
Kj kj =1
⎞−1 ⎠ ATjkj −1 j kj Aj kj
zpi −1 jp μjp +
Kj kj =1
ATjkj −1 j kj (xij kj − bj kj )
⎫ ⎬
(4.24)
⎭
M-Step In M-step, all parameters are re-estimated by optimizing a concave lowbound function for (4.21), being L(θ X , θ Z , θ H ) = EH [log p(X, Z, H | θ X , θ Z , θ H )]
(4.25)
with a unique maximum point. In Eq. (4.25), the format is similar to log joint density log p(X, Z, H | θ Z , θ H , θ X ), except to replace hij and hij hTij with E[hij ] and E[hij hTij ], respectively. Since the mean and covariance of the posterior probability H T for hij are μH ij and ij , which have been calculated in E-step, E[hij ] and E[hij hij ] can be obtained through the following equation: E[hij ] = μH ij / 0 T H H E hij hTij = H + μ ij ij μij
(4.26)
By calculating the derivative of the low-bound function L(θ Z , θ H , θ X ) w.r.t. θ Z , θ H , and θ X , and setting it to be zero, the parameters can be estimated with closedform solutions. The results are listed as follows. For the parameters corresponding to θ X , the solutions are *
Aj kj
M = (xij kj − bj kj )E[hTij ]
+* M
i=1
bj kj =
+−1 E[hij hTij ]
i=1
M 1 {xij kj − Aj kj E[hij ]} M
(4.27)
i=1
j kj =
M 1 Aj kj E[hij hTij ]ATjkj − 2Aj kj E[hij ] M i=1
×(xij kj − bj kj )T + (xij kj − bj kj )(xij kj − bj kj )T
4.3 Hierarchical Multi-View Multi-Feature Fusion
123
Algorithm 4.2 [HMMF]Hierarchical multi-view multi-feature fusion Input: Observed data: X; label: Z; Initialization: θ Z ;θ H ; θ X 1: (Calculate θ Z ) 2: for p = 1, . . . , P do 3: Calculate πp by Eq. (4.29) 4: end for 5: while not converged do 6: E-step: 7: for i = 1, . . . , M do 8: for j = 1, . . . , J do H 9: Evaluate H ij and μij by Eq. (4.24), T 10: Calculate E[hij ] and E[hij hij ] by Eq. (4.26). 11: end for 12: end for 13: M-step: (re-estimate θ H and θ X ) 14: for j = 1, . . . , J do 15: for p = 1, . . . , P do 16: calculate μjp and jp through Eq. (4.28) 17: end for 18: for kj = 1, . . . , Kj do 19: calculate j kj , bj kj and Aj kj through Eq. (4.27) 20: end for 21: end for 22: end while Output: θ Z ;θ H ; θ X
For the parameters corresponding to θ H , the solutions are 1
μjp = M
M
i=1 zpi i=1
1 jp = M
M
i=1 zpi i=1
zpi E[hij ]
zpi E[hij hTij ] − 2E[hij ]μTjp + μjp μTjp
(4.28)
To estimate the parameter θ Z = {πp }Pp=1 , the Lagrange multiplier term is introduced to meet Pp=1 πp = 1. By calculating the derivative of the Lagrange function w.r.t. πp and setting it to 0, the solution of πp is then obtained according to the following equation: M
i=1 zpi M p=1 i=1 zpi
πp = P
(4.29)
From Eqs. (4.27), (4.28), and (4.29), we can see that each step has a closed-form solution which would greatly facilitate the parameter estimation process. According
124
4 Information Fusion Based on Multi-View and Multi-Feature Learning
to the convergence theory of EM algorithm, each update of the parameters acquired from an E-step followed by an M-step can guarantee the increase of the loglikelihood function. Hence, to obtain a local maximin point, we alternatively execute E-step and M-step until convergence. Algorithm 4.2 illustrates the details of the optimization. In this chapter, our proposed method is named as Hierarchical Multiview Multi-feature Fusion (HMMF).
4.3.3 Inference for HMMF According to the Bayesian principle, the posterior probability of a given test sample J,K x = {xj kj }j,kj j=1 belonging to the p-th class is calculated through p(zp = 1 | x, θ Z , θ H , θ X ) p(x | zp = 1, θ Z , θ H , θ X )p(zp = 1) = P p=1 p(x | zp = 1, θ Z , θ H , θ X )p(zp = 1)
(4.30)
Since the second term of the numerator p(zp = 1) = πp , the key problem is how to calculate the first term of the numerator. Actually, we get its value by the following process: log p(x | zp = 1, θ Z , θ H , θ X ) 3 = log p(x, h | zp = 1, θ Z , θ H , θ X )dh ⎧⎛ ⎞ J ⎨ ⎝− 1 log jp − 1 μTjp −1 μjp ⎠ = jp H 2 ⎩ 2 j =1
+
jp
Kj Dj kj 1 H T H −1 H log(2π) − μjp + jp μjp 2 2 kj
1 T −1 1 − log j kj − xj kj − bj kj j kj xj kj − bj kj 2 2
(4.31)
4.3 Hierarchical Multi-View Multi-Feature Fusion
125
where ⎛ ⎝ −1 H jp = jp +
H μH jp = jp
⎧ ⎨ ⎩
Kj kj =1
⎞−1 ⎠ ATjkj −1 j kj Aj kj Kj
−1 jp μjp +
kj =1
ATjkj −1 j kj (xj kj − bj kj )
⎫ ⎬
(4.32)
⎭
Every parameter or variable is well defined in Eq. (4.31). Thus, the first term of the numerator in Eq. (4.30) and the posterior probability of this new sample belonging to the p-th class are easily gained. If only requiring the predicted label, the logarithm of the numerator in Eq. (4.30) for all classes can be calculated, and predicted category is the one with the max value.
4.3.4 Experimental Results In this section, being similar to MVMFL, we conduct experiments on both synthetic and real-world datasets to demonstrate the superiority of the proposed method. The datasets used in this section are first described, followed by the experimental setting. The comparison among different approaches is then analyzed.
4.3.4.1 Datasets and Experimental Setting The synthetic data is generated according to the assumption of the proposed method. Particularly, given the values of Dj and Dkj , the parameters Ajkj , jkj , jp , and μjp are randomly generated. The latent variable hij is then obtained by following P P N (hij | p=1 zpi μjp , p=1 zpi jp ). Consequently, the observations are acquired according to N (xij kj | Aj kj hij , j kj ). Without loss of generality, we set the dimensionality Dj to be the same for each view, and so does Dj kj . In this experiment, we set Dj and Dj kj to be 10 and 20, respectively. We also select the biomedical dataset [4] to evaluate the performance of the proposed method. Additionally, 40, 50, 60, and 70 instances in each category are randomly selected for training with five independent times, and the rest of samples are exploited for testing. In order to illuminate the superiority of our method, we also make it compare with some single- and multi-view based strategies including DPL [7], MDL (UMDL and SMDL) [11], JDCR [10], and CSRL [5] on the real-world datasets. For DPL, we concatenate all features in all views as a single one. For other approaches, we concatenate all features in each view as a vector, and thus vectors in different views
126
4 Information Fusion Based on Multi-View and Multi-Feature Learning
are regarded as their inputs. Since CSRL aims to learn a latent variable, we apply SVM to it to do the classification. For the parameter tuning on the synthetic dataset, we tune the dimension Dj of the latent variable through fivefold cross validation using training data. In fact, we find that Dj being close to min(Dj kj ) is fine for both datasets. For the Biomedical dataset, since the dimension of several features is around 5, we set Dj to be 5 empirically.
4.3.4.2 Experimental Results on Datasets Synthetic Dataset Note that, since the assumption of HMMF is different from that in MVMFL, the synthetic dataset is generated from different ways compared with MVMFL. In this experiment, we randomly generate four types of synthetic datasets, which are (J = 2, Kj =2), (J = 3, Kj =3), (J = 4, Kj =4), and (J = 50, Kj =5). As mentioned above, without loss of generality, we set the number Kj types of features in each view to be same. Note that Dj and Dj kj are set to be 10 and 20, respectively. Besides, we randomly generate 5 classes whose number of training and test samples is 20 and 100 in each category, respectively. To show the superiority of the hierarchical fusion, we reconstruct the inputs in another three types. For instance, as shown in Table 4.6, when the number of views and their corresponding features are (J = 3, Kj =3), we concatenate the features in each view as a single one. Thus a novel input whose J is 3 and Kj is 1 (3,1) is obtained. Additionally, we also follow the input as many existing multi-view methods do. We regard the sample with 9 types of features as the input and two cases including (J = 1, Kj =9 (1,9)) and (J = 9, Kj =1 (9,1)) are consequently acquired. From Table 4.6, we can see that our proposed hierarchical fusion strategy always achieves a noticeable accuracy compared with other cases. The performance obtained by HMMF is particularly better than that in the third column, indicating the significance of hierarchical fusion structure. Furthermore, with the increase of the number of views and features, there is also a remarkable improvement in the classification accuracy. The result is only 82.2% in the case of (J = 2, Kj = 2), while it has a great achievement when (J , Kj ) rise to (4,4). Furthermore, the data generated through the estimated parameters is also visualized. To show distributions of various classes more clearly, we randomly regenerate the synthetic data by enlarging the mean of distributions belonging to different categories. Figure 4.8a shows the locations of the first two dimensional points of the synthetic data in the first type of features in the first view when J = 3 and Kj = 3. Inputting this data into our model, we can subsequently obtain the parameters θ Z , θ H , and θ X . Then, these estimated parameters are used in our model to regenerate the five-category data, as displayed in Fig. 4.8b. Obviously, distributions of various classes generated according to the estimated parameters are quite similar to that in original data, indicating the superiority of HMMF.
4.3 Hierarchical Multi-View Multi-Feature Fusion Table 4.6 The classification accuracies on the synthetic dataset obtained by HMMF. Bold values mean the best performances
Model Accuracy Model Accuracy Model Accuracy Model Accuracy
127 (J , Kj ) (2,2) 82.2% (3,3) 86.8% (4,4) 93.6% (5,5) 94.4%
(2,1) 80.6% (3,1) 85.0% (4,1) 92.4% (5,1) 93.0%
(1,4) 63.6% (1,9) 68.6% (1,16) 82.0% (1,25) 90.2%
(4,1) 80.4% (9,1) 84.8% (16,1) 92.0% (25,1) 91.0%
Fig. 4.8 The comparison of the distributions between the original data and generated data
Biomedical Dataset Results on the biomedical dataset are listed in Table 4.7. Note that Sensitivity = TruePos./(TurePos.+FalsePos.) and Specificity = TrueNeg./(TrueNeg.+FalseNeg.). Obviously, HMMF obtains the better classification performance in comparison with other methods. Compared with UMDL, SMDL, and CSRL, HMMF is obviously superior. Specifically, HMMF gains 82.0, 82.7, 83.2, and 84.9% when the training number is 40, 50, 60, and 70, respectively, while the best performances gained by the aforementioned approaches are only 77.4, 74.7, 79.1, and 79.6%. Compared with DPL, RCR, MTJSRC, and JDCR, HMMF is also competitive, achieving about 2% enhancement. For the sensitivity and specificity, HMMF gains better values in most cases. For the specificity, the
Num Methods DPL MTJSRC RCR UMDL SDML JDCR CSRL HMMF
Number of training samples Num=40 Acc Sen Spe 77.2% 75.4% 79.0% 79.0% 82.2% 76.0% 78.6% 77.0% 80.3% 77.4% 79.0% 75.8% 77.4% 79.0% 76.0% 78.1% 75.8% 80.4% 71.3% 73.5% 69.2% 82.0% 80.3% 83.7% Num=50 Acc 79.9% 80.5% 80.5% 74.7% 73.6% 80.1% 72.2% 82.7% Sen 78.2% 80.6% 80.3% 73.4% 76.1% 78.0% 74.0% 80.7%
Spe 81.5% 80.3% 80.7% 76.0% 71.2% 82.0% 70.5% 84.6%
Num=60 Acc 79.4% 81.0% 80.9% 79.1% 74.7% 79.9% 73.1% 83.2% Sen 77.7% 86.0% 78.1% 80.3% 70.4% 78.8% 77.9% 78.7%
Spe 81.0% 76.1% 83.6% 78.0 % 78.8 % 81.0 % 68.4% 87.5%
Num=70 Acc 82.4% 80.1% 82.5% 79.6% 74.5% 82.7% 75.4% 84.9%
Sen 80.3% 82.1% 80.6%, 81.5% 75.9% 80.7% 77.2% 83.1%
Spe 84.4% 78.2% 84.4% 77.8% 73.1% 84.7% 73.7% 86.6%
Table 4.7 The accuracy, sensitivity, and specificity values obtained by different methods on the Biomedical dataset when the number of training samples is 40, 50, 60, and 70, respectively. Best results are highlighted in bold
128 4 Information Fusion Based on Multi-View and Multi-Feature Learning
4.3 Hierarchical Multi-View Multi-Feature Fusion
129
Fig. 4.9 The ROC curves obtained by different methods in DM detection Table 4.8 The area under curve (AUC) obtained by the different methods in DM detection. Bold values mean the best performances
Methods DPL UMDL SMDL
AUC 0.8639 0.8639 0.8089
Methods JDCR CSRL HMMF
AUC 0.8703 0.8420 0.9027
values obtained by our method are always superior to that computed by other comparison approaches. For the sensitivity, although MTJSRC gains a superior result to ours when the training number is 40 and 60, our method reaches the best value in remaining cases. When the training number is 70, Fig. 4.9 and Table 4.8 show the ROC curves and their AUC values, respectively. Obviously, HMMF gets better results compared with other methods.
4.3.5 Conclusion In this chapter, we propose a novel method for multi-view and multi-feature fusion. For each view, a latent variable is first computed, so that its corresponding multiple features are fused. Referring to multiple views, the label is imposed on the learned
130
4 Information Fusion Based on Multi-View and Multi-Feature Learning
latent variables, so that correlation among multi-view is exploited. The model is optimized by introducing the EM algorithm, and a closed-form solution for each parameter is obtained. We conduct experiments on both synthetic and real-world datasets, and experimental results demonstrate the superiority of our method.
References 1. Li J, Zhang B, Lu G, Zhang D. Generative multi-view and multi-feature learning for classification. Inf Fusion 2019;45:215–26. 2. Li J, Yong H, Zhang B, Li M, Zhang L, Zhang D. A probabilistic hierarchical model for multi-view and multi-feature classification. In: Thirty-second AAAI conference on artificial intelligence 2018. 3. Bishop, CM., Nasser M. Nasrabadi. Pattern recognition and machine learning. New York: springer 2006;4(4). 4. Li J, Zhang D, Li Y, Wu J, Zhang B. Joint similar and specific learning for diabetes mellitus and impaired glucose regulation detection. Inf Sci. 2017;314:191–204. 5. Guo Y. Convex subspace representation learning from multi-view data. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence (AAAI), vol 1, 2013. p. 2. 6. Sugiyama M. Local fisher discriminant analysis for supervised dimensionality reduction. In: Proceedings of the 23rd international conference on machine learning. New York: ACM; 2006. p. 905–12. 7. Gu S, Zhang L, Zuo W, Feng X. Projective dictionary pair learning for pattern classification. In: Advances in neural information processing systems 2014. p. 793–801. 8. Yuan XT, Liu X, Yan S. Visual classification with multitask joint sparse representation. IEEE Trans Image Process. 2012;21(10):4349–4360. 9. Yang M, Zhang L, Zhang D, Wang S. Relaxed collaborative representation for pattern classification. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE; 2012. p. 2224–31. 10. Li J, Zhang B, Zhang D. Joint discriminative and collaborative representation for fatty liver disease diagnosis. Exp Syst Appl. 2017;89:31–40. 11. Bahrampour S, Nasrabadi NM, Ray A, Jenkins WK. Multimodal task-driven dictionary learning for image classification. IEEE Trans Image Process. 2016;25(1):24–38.
Chapter 5
Information Fusion Based on Metric Learning
Metric learning aims to measure the similarity or dissimilarity between each pair of samples. Until now, various types of metrics are studied and achieve satisfied performances in many applications. To comprehensively exploit the advantages of different metrics, this chapter proposes two metric fusion methods and applies them to classification and verification. After reading this chapter, people can have preliminary understanding on metric learning based fusion algorithms.
5.1 Motivation For the recent years, metric learning has taken much attention in the works for classification [1–4]. Metric learning offers a basic prospect to answer the following questions in pattern recognition: How do I determine whether it is similar between two data points? As far as we know, most metric learning approaches calculate a Mahalanobis distance metric M by employing the following two methods. First, for classification tasks, the labeled training set is used. Second, for verification tasks, the sets of positive (similar) and negative (dissimilar) pairs based on a predefined objective function are adopted. The objective function is often employed to punish the two distances, such as the large distance between dissimilar pairs and the T small distance between similar pairs [5–8]. The equation xi − xj M xi − xj is described as the distance metric function of the sample pair xi , xj . Additionally, other methods have also been taken as similarity metric learning, for instance, cosine similarity metric [3] and bilinear function similarity metric [4]. Especially for knearest neighbor classification, researchers widely used the learning Mahalanobis distance metric, such as information theoretic metric learning (ITML) [6], large margin nearest neighbor (LMNN) [8] where the doublets are extended to triplets in constraint to get ideal distance metric, and scalable large margin Mahalanobis metric learning [9, 10]. Recently, [11] proposed the kernel-based distance metric learning method (doublet-SVM vs. triplet-SVM) which is solved by the SVM tool. © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_5
131
132
5 Information Fusion Based on Metric Learning
Many variants of metric learning are proposed for semi-supervised learning [12, 13], multi-task learning [14], non-linear distance similarity learning [15], etc. However, the common characteristic of above metric learning methods is that the single distance metric for verification or classification is learned by searching a discriminative score threshold. Their defect may exist in two hands: first, actually deciding single metric is not strong because of the diversity of data construction. Second, it is not stable to confirm that two samples are similar or dissimilar via a score threshold. Obviously, the reason is identical to the first. In previous machine learning, people have presented multiple instance learning [16–18] and multi-view learning [19, 20]. For example, researchers introduced multi-metric learning to solve face/kinship verification [21] and classification [22]. Even with the fact that learning multiple distance metrics for classification can improve the performance, but only Mahalanobis distance metrics are considered, such that the diversity of similar metrics is missed based on only distance and the function of multiple metrics is prohibited. [4] proposed a similarity metric learning approach that integrates coupled distance metric and demonstrated a better performance for face verification. Nevertheless, the weight between distance metric and similarity is considered the same, without appropriate optimization. Additionally, in identification, a decisive score threshold is found, which holds the same way as other metric learning methods. Therefore, it may limit the potential and the discriminative capacity for similarity measure. In this chapter, based on the above discussion, we introduce generalized metric swarm learning for face verification and classification, named as generalized metric swarm learning (GMSL) [23]. The goal of GMSL is to learn each local diagonal patch of the metric swarm M by enforcing with different functions. Besides, we propose to map the pairwise samples into a discriminative metric swarm space with vectors by using the local patches of sub-metrics, such that the standard SVM can be used for final verification. We demonstrate the effectiveness of our framework on several UCI benchmark datasets, real-world LFW, and PubFig face datasets. Up to now, two main categories of methods are applied for learning similarity measure, i.e., pairwise classification and metric learning. Unlike metric learning, pairwise classification approaches [24, 25] train a classifier which utilizes pairwise loss to evaluate the similarity by learning whether a pair of samples is from the same class or not. To achieve this, we model the pairwise classification as a typical binary classification problem via utilizing some appropriate pairwise kernels [25]. In the case of similarity measure learning, recent researches have showed that similarity and distance measures are complementary and can together further promote classification performance, and several approaches have also been proposed [4, 26]. However, these methods only consider to learn the combined metrics based on pairwise loss. In fact, the margin between the distance (similarity) of a matching pair and a non-matching pair usually is enforced by triplet loss [27] while a threshold for bounding the matching pairs and non-matching pairs is required by pairwise loss, making triplet loss not demand the samples from one class to be compressed into a small ball and more flexible in dealing with data with complex
5.2 Generalized Metric Swarm Learning
133
intra-class variations. In addition, for pairwise classification, the pairwise kernels only consider the distance-based measure while ignoring the similarity measure, and the classifier is also learned using pairwise loss. Moreover, it is additionally nontrivial to kernelize the previous methods for non-linear metric learning, which generally require new implementation and formulation. Most kernelization methods usually should learn a matrix with the same size with the kernel matrix and make the computational cost very heavy, especially when the size of the training set is large. To solve the problems mentioned above, in this chapter, the joint distance and similarity measure learning approach through minimizing triplet loss is proposed. First, both distance and similarity measure are combined to promote the classification performance. Second, the integrated distance and similarity measure is calculated by the max-margin model by minimizing triplet loss. Moreover, the proposed combined distance and similarity measure (CDSM) [28] can also be explained from the pairwise kernel perspective, which can be used to kernelize the proposed method for non-linear measure learning. Furthermore, in training and testing phase, two strategies are used to promote the efficiency of our kernelized CDSM approach. Experimental results demonstrate the effectiveness of our proposed CDSM and its kernel extension.
5.2 Generalized Metric Swarm Learning Measuring the correlation between two data points through unsupervised and supervised machine learning methods via learning distance metrics has attracted much attention to improve performance in unconstrained face verification work. Considering that it is not robust while implementing single distance metric learning for verification through an score threshold generated by experience in uncontrolled experimental conditions, thus in our work the metric swarm is obtained via learning local patches alike sub-metrics, which at the same time spontaneously defines a generalized metric swarm learning (GMSL) model [23] via a joint similarity score function based on efficient alternative optimization algorithm. Moreover, the welllearned metric swarm defines each sample pair as a similarity vector, so that the face verification problem is regarded as a universal SVM-alike classification task. Hence, in the defined metric swarm space, we can enforce the verification, well improving the robustness of verification under irregular data structure. In this work, we preliminarily conduct the experiments on several UCI benchmark datasets to address the general classification task. Moreover, on PubFig and real-world LFW datasets, the experimental results of face verification show that our proposed method achieves better performance than the state-of-the-art metric learning approaches. Figure 5.1 demonstrates the structure of our GMSL for unconstrained face verification.
134
5 Information Fusion Based on Metric Learning
Fig. 5.1 Illustration of our GMSL for face verification
Although our GMSL model shares some similarity with the previous approach in [4], there is a unique innovation in our method while comparing to [4] and other pre-existing approaches. Generally, the main contributions of this work are three folds: 1. We propose to combine the multiple metrics optimized with weights into an optimal metric, so that the similarity/dissimilarity between pairwise samples can be correctly measured. However their importance are not considered in [4], when combine two metrics. 2. Unlike [4] where two different metrics for similarity score computation are learned, the goal of our model is learning an optimal metric swarm for defining pairwise samples via a generalized framework, thus a binary classifier (e.g., SVM) is capable of addressing the matching problem. 3. Comparing with [4] where the capacity of the learned metrics is limited due to learning the metrics based on one feature type, in this chapter, each metric is learned with/without data centralization.
5.2.1 Problem Formulation We denote the sets of similar and dissimilar pairs as S and D, respectively. Suppose d×d (g = 1, . . . · · · , G) to measure that we use G latent metrics defined as Mg ∈ R the similarity of a sample pair xi , xj relying on a joint metric swarm function fM xi , xj . In the above assumption, xi ∈ Rd (d is the dimension of each sample) g and M formulates the metric swarm. FMg xi , xj represents the metric function of a single Mg . L(·) and (·) represents the loss function and the Lagrange function, respectively. In this chapter, Frobenius norm and 2 -norm are denoted as · F and · 2 .
5.2 Generalized Metric Swarm Learning
135
In our GMSL method, the joint metric function implied in the data is aimed to learned, which is built on some predefined metric swarm M including G different sub-metrics, i.e., M1 , M2 , . . . · · · , MG . Assume that a sample set V is composed of a similar set S and a dissimilar set D i.e., V = S∪D , andwe represent the introduced joint metric (score) function of a sample pair xi , xj as the following definitions: Definition 5.1 ∀i, j ∈ V, under the metric swarm {M1 , M2 , . . . · · · , MG }, M← the joint similarity score function fM of a pair xi , xj could be defined as the following equation: fM xi , xj = hT FM xi , xj
(5.1)
/ 0T G x ,x 1 x , . . . · · · , F , x ∈ RG are the where FM xi , xj = FM i j i j MG 1 multi-metric distance function with MetricFusion, h = [h1 , . . . , hG ]T ∈ RG , g and FMg xi , xj are the sub-metric distance function with Mg . We summarize the indicator vector h as the following Definition 5.2, which is a known vector for each sub-metric. The indicator vector h is a known vector for each sub-metric, which is shown in Definition 5.2. Observe that the arrow ← represents that each submetric of {M1 , M2 , . . . · · · , MG } could denote M through some specific form in application (e.g., Eq. (5.4)). Definition 5.2 ∀i, j ∈ S, i, k ∈ D, g ∈ {g|g = 1, . . . · · · , G}, there is * hg =
g −1, if FMg xi , xj negatively change in distance function g +1, if FMg xi , xj positively change in distance function
(5.2)
Here, we explain Definition 5.2 as follows: when the similarity between xi and g xj increases, if the distance function FMs xi , xj under sub-metric Mg increases, g the indicator hg = 1; if the distance function FMs xi , xj under sub-metric Mg decreases, the indicator hg = −1.For instance, the indicator h = −1, for Euclidean distance metric M, since the FM xi , xj of a dissimilar pair is larger than that of a similar pair. Thus, hg value is set relying on whether the chosen sub-metric Mg has negative (−1) or positive (+1) character on the distance function from a qualitative level. Observe that the quantitative weight vector of the introduced MetricFusion is not implied in the indicator vector h. On the contrary, the property of each submetric is implied.
136
5 Information Fusion Based on Metric Learning
In view of the above Definitions 5.1 and 5.2, the main idea of the our GMSL is represented as min L(M) +
Mg ,∀g
G γ 4 g 2 g Mg − M F 2
(5.3)
g=1
4 g as a known matrix that is where we denote L(·) as the loss function. Denote M similar to the previous matrix in ITML, denote γ as the regularization parameter, and denote 0 < g < 1 as the contribution coefficient for the g-th sub-metric. In order to clearly understand the metric swarm M, we show it as a large metric denoted by G sub-metrics (diagonal patch) as ⎡ ⎢ M=⎣
⎤
h1 M1 ..
⎥ Gd×Gd ⎦∈
.
(5.4)
hG MG From the above function, the large metric formed via local patches of sub-metrics is denoted as the target metric swarm M. From formula (5.3), we can observe that the G sub-metrics are learned rather than M. We can easily learn the metric swarm by using (5.4) after optimizing G sub-metrics. Generally, in this chapter, we define hinge-loss function L(M) in (5.3) as the following equation. L(M) =
(i,j )∈V=S∪D
1 − yi,j hT FM xi , xj
+
(5.5)
where yi,j represent the label +1(for positive pair) and the label −1(for negative pair), respectively. The discriminative ability of the similarity score function with complicated data could be guaranteed via minimizing the loss function term which is built by a / 0T 1 x ,x ,...,FG x ,x metric swarm, and FM xi , xj = FM · referred in i j i j MG 1 Remarks. In the function FM xi , xj , we can transform the sample pair xi , xj in other form (see Eq. (5.11)) for similarity measure by M, since we can easily see that the dimension of M is not consistent with xi , xj . We use the regularization term to prevent the calculated sub-metric from being distorted, such that the robustness is retained for the changes of data structure.
5.2 Generalized Metric Swarm Learning
137
To address the formula (5.3) attached with the hinge-loss function term (5.5), we propose the relaxing variables ξi,j (xi , xj ), i, j ∈ V and define the novel model with constraints as the following formula. 4 g 2 minMg ,∀g (i,j )∈V ξi,j xi , xj + γ2 G g Mg − M g=1 F s.t. yi,j hT FM xi , xj 1 − ξi,j xi , xj ξi,j xi , xj 0, ∀(i, j ) ∈ V
(5.6)
It will be difficult if we directly optimize the original problem (5.6) with inequal constraints. Therefore we employ Lagrange multiplier approach to address its dual problem in the following section.
5.2.2 Optimization for GMSL As described above section, for the metric swarm M in (5.6) optimization problem, we try to address its dual problem, not the primal problem. Further, we describe the deduction procedure of its dual problem and the optimization algorithm in detail as follows: First, the Lagrange function (·) of (5.6) is defined as (M, ξ ; α, β) =
(i,j )∈V
−
G γ 4 g 2 ξi,j xi , xj + g Mg − M F 2
g=1
αi,j
yi,j hT FM xi , xj − 1 + ξi,j xi , xj
(5.7)
(i,j )∈V
−
βi,j ξi,j xi , xj
(i,j )∈V
where α 0 and β 0 represent the Lagrange multipliers, and αi,j + βi,j = 1 is implied in subsequent dual analysis. The partial derivatives of (M, ξ ; α, β) in (5.7) on Mg and ξi,j xi , xj are calculated via the following equations: ⎧ g ∂FMg (xi ,xj ) ⎨ ∂(M,ξ ;α,β) = γ M − M 4g − α y h g g i,j i,j g (i,j )∈V ∂Mg ∂Mg ⎩ ∂T (M,ξ ;α,β) = 1 − αi,j − βi,j ∂ξij (xi ,xj )
(5.8)
138
5 Information Fusion Based on Metric Learning
;α,β) ;α,β) = 0, sub-metric Mg could be easily obtained by Set ∂(M,ξ = ∂(M,ξ ∂Mg ∂ξi,j (xi ,xj ) the following equations:
*
4g + Mg = M
hg γ g
g
(i,j )∈V αi,j yi,j
βi,j = 1 − αi,j
∂FMg (xi ,xj ) ∂Mg
(5.9)
We conduct the deduction by replacing Mg and βi,j back into (M, ξ ; α, β) in (5.7), the process is shown as follows: (M, ξ ; α, β) 2 g 1 ∂FMg (xi ,xj ) = (i,j )∈V ξi,j xi , xj + γ2 G α y g=1 g γ g (i,j )∈V i,j i,j ∂Mg F T − (i,j )∈V αi,j yi,j h FM xi , xj + (i,j )∈V αi,j − (i,j )∈V αi,j ξi,j xi , xj − (i,j )∈V ξi,j xi , xj + (i,j )∈V αi,j ξi,j xi , xj 2 g ∂FMg (xi ,xj ) 1 G 1 = (i,j )∈V αi,j + 2γ g=1 g (i,j )∈V αi,j yi,j ∂Mg F T − (i,j )∈V αi,j yi,j h FM xi , xj (5.10) Note that h2g = 1 (see Definition 5.2 in this chapter). In order to easily solve the introduced optimization problem of metric swarm M, g we hope FMg xi , xj to be strictly convex on Mg , and thus we can eliminate the g ∂F (xi ,xj ) g variable Mg with Mg∂Mg . Finally, FMg xi , xj is represented as the following expression: g FMg xi , xj = uTg Mg vg , ∀g = 1, . . . , G
(5.11)
where vg and ug are vectors generated by a sample pair xi , xj (see the Theorem 5.1 in this chapter). Based on (5.11), we can calculate that g ∂FMg xi , xj ∂Mg
= ug vTg
(5.12)
By replacing (5.12) back into (5.9), we could get 4 g + hg Mg = M αi,j yi,j ug vTg γ g (i,j )∈V
(5.13)
5.2 Generalized Metric Swarm Learning
139
So, we can rewrite (5.10) step by step as follows: (M, ξ ; α, β) =
αi,j +
g=1
(i,j )∈V
−
G 1 1 || αi,j yi,j ug vTg ||2F 2γ g
αi,j yi,j
G
(i,j )∈V
g hg FMg xi , xj
g=1
(i,j )∈V
2 G 1 1 T = αi,j + αi,j yi,j ug vg 2γ g (i,j )∈V g=1 (i,j )∈V F ⎞ ⎛ G − hg ⎝ αi,j yi,j uTg Mg vg ⎠
g=1
(i,j )∈V
2 G 1 1 T = αi,j + αi,j yi,j ug 2γ g g=1 (i,j )∈V (i,j )∈V F ⎛ ⎛ ⎛ ⎞ ⎞⎞ G 4 g + hg − hg ⎝ αi,j yi,j ⎝uTg ⎝M αi,j yi,j ug vTg ⎠ vg ⎠⎠ γ g
g=1
=
(i,j )∈V
αi,j
(i,j )∈V
−
G g=1
⎛ ⎝hg
(i,j )∈V
2 G 1 1 T + αi,j yi,j ug vg 2γ g (i,j )∈V g=1 (i,j )∈V
⎞ 4 g vg ⎠ − αi,j yi,j uTg M
F
G g=1
1 γ g
2 G 1 T = αi,j − α y u v i,j i,j g g 2γ g g=1 (i,j )∈V (i,j )∈V F ⎛ ⎞ G 4 g vg ⎠ ⎝hg − αi,j yi,j uTg M g=1
2 T αi,j yi,j ug vg (i,j )∈V
F
(5.14)
(i,j )∈V
4 g as a known matrix to further simplify the As described before, we define M 4 g = δg I is defined in this Lagrange function (5.14). Hence, a diagonal matrix M chapter. Observe that I is the identity matrix, simultaneously 0 δg 1. Further,
140
5 Information Fusion Based on Metric Learning
we write the Eq. (5.14) as (M, ξ ; α, β) 2 1 T α y u v = (i,j )∈V αi,j − G g=1 2γ g (i,j )∈V i,j i,j g g F G − g=1 δg hg (i,j )∈V αi,j yi,j uTg vg 2 1 T α y u v = (i,j )∈V αi,j − G i,j i,j g g g=1 2γ g (i,j )∈V F +δg hg (i,j )∈V αi,j yi,j uTg vg
(5.15)
Specifically, based on the above discussion, we summarize the Lagrange dual formulation of GMSL expression (5.6) as g Theorem 5.1 With FMg xi , xj = uTg Mg vg , ∀g = 1, . . . as a precondition, G and 4 g = δg I, 0 δg 1, we can write the dual problem of our GMSL model (5.6) M as ⎛ 2 ⎞ G 1 T α y u v i,j i,j g g ⎠ (i,j )∈V ⎝ 2γ g (5.16) αi,j − max F T 0α1 +δg hg (i,j )∈V αi,j yi,j ug vg g=1 (i,j )∈V where ug and vg can be, but not limited to, the following cases: a. u1 = xi − xj , v1 =xi − xj − → − → − → → − b. u2 = xi − xi ◦ 1 − xj − xj ◦ 1 , v2 = xi − xi ◦ 1 − xj − xj ◦ 1 c. u3 = xi , v3 = xj − → → − d. u4 = xi − xi ◦ 1 , v4 = xj − xj ◦ 1 → − where 1 represents a full one vector, x represents the mean of vector x, and ◦ represents element-wise multiplication. Obviously, the common Mahalanobis distance measure is used in case a, which simultaneously is implied in case b with data centralization; meanwhile, in case c and d, we can use similarity metric through bilinear function. g ∂F (xi ,xj ) g Observe that FMg xi , xj = uTg Mg vg is required due to the steady Mg∂Mg = ug vTg , such that the optimization becomes easier. Simultaneously, the known 4 g = δg I is defined as a diagonal matrix to avoid distorting the learned matrix M metric swarm and make it robust. Obviously, in Theorem 5.1, we regard the expression (5.16) as a standard quadratic programming (SQP) problem and QP solvers could easily address it. Nevertheless, the QP solvers with intrinsic point approaches will become inefficient when the number of training sample pairs is increasing. Thus, based on the optimized first order algorithm achieved in [4, 29], the solution algorithm is proposed in this chapter.
5.2 Generalized Metric Swarm Learning
141
If the optimal result a is obtained, we could learn and calculate the sub-metric Mg (g = 1, . . . , G) via the following formula based on Karush-Kuhn-Tucker (KKT) constraint. M∗g = δg I +
hg ∗ αi,j yi,j ug vTg γ g
(5.17)
(i,j )∈V
5.2.3 Solving with Model Modification We observe from the formula (5.16) where exists a coefficient in the iterations. In order to solve this coefficient, we suggest the following novel model. Observe q that to differentiate the in (5.16), ϑ and g ← ϑg are used. The constraints G g=1 ϑg = 1 and 0 < ϑg < 1 are imposed. To solve the coefficient ϑ, we show the solution in detail as follows. First, we slightly modify the model (5.6) as q 4 g 2 minMg ,g ,∀g (i,j )∈V ξi,j xi , xj + γ2 G ϑg Mg − M g=1 F s.t. yi,j hT FM xi , xj 1 − ξi,j xi , xj ξi,j xi , xj 0, ∀(i, j ) ∈ V, G g=1 ϑg = 1, q > 1
(5.18)
q
Note that if q > 1 we can use q-square ϑg for avoiding the trivial result like g = 0 or g = 1, whose aim is to learn the best metric. Nevertheless, in order to improve the robustness of the method, we hope to use the data of all sub-metrics in the ∂ M ,ϑ learning process. The partial derivative ( g g ) of the Lagrange Eq. (5.18) is set ∂Mg
to 0, such that obtain the following expression: *
q−1 γ qϑg Mg − 2 G g=1 ϑg = 1
4 g 2 = μ M F
(5.19)
where μ represent the Lagrange multiplier. From (5.19), we could easily calculate ϑg as follows: # ϑg =
1
Mg − M 4 g 2 F
1/(q−1) /
G g=1
#
1
Mg − M 4 g 2 F
1/(q−1) (5.20)
In above expression, we set q to 2 and set the initial ϑg to 1/G. We introduce another optimization algorithm to address the proposed GMSL, as shown in Algorithm 5.1.
142
5 Information Fusion Based on Metric Learning
Algorithm 5.1 GMSL Input: Similar pairwise sample set S Dissimilar pairwise sample set D; Indicator vector h = [h1 , . . . , hG ]T Initialize M(0) g , g = 1, . . . , G Initialize ϑg(0) ← 1/G, g = 1, . . . , G and t ← 0 Procedure: Repeat 1. Update α (t) by solving the dual optimization problem (5.16) using FISTA algorithm; d×d , g = 1, . . . , G in metric swarm by using (5.17) and α (t) 2. Update the sub-metric M(t) g ∈ (t) 3. Update ϑg , g = 1, . . . , G 4. t ← t + 1 until convergence. 0 / Output: (t) with (5.4) M(t) = diag h1 M(t) 1 , . . . , hG MG
5.2.4 Representation of Pairwise Samples in Metric Swarm Space In order to match sample pairs with SVM, we propose to denote each sample pair as a vector in metric swarm space, thus SVM can realize similar binary classification. In other words, we hope that each vector can correspond to a similar face. For positive and negative pairs, labels +1 and −1 are used, respectively, so that the discrimination classifier can be used to realize the verification task. Obviously, the merit is that it not only inherits the advantages of metric learning and common classifier but also avoids using a simple metric function score threshold for verification. Moreover, we learn the discriminant SVM classifier in the represented metric swarm space, so that the verification solution is more robust in collecting image when exist the noise, corruption, etc. Thus, considering the learned metrics, we further represent the similarity vector (MS-space) for each sample pair xi , xj as follows: / 1 0T G MS xi , xj = FM , x , x ∈ G x , . . . , F x i j i j MG 1
(5.21)
5.2.5 Sample Pair Verification Based on the learned metric swarm in (5.17), according to (5.21) we represent all sample face pairs in the similarity vector space. At the same time, we define the labels of positive as +1, and negative face pairs as −1, respectively. Finally, we transform the verification task into a standard binary classification problem. We can easily solve this problem by using the standard SVM of the ordinary toolbox.
5.2 Generalized Metric Swarm Learning
143
Fig. 5.2 A diagram of the proposed GMSL for classification. Left: the circles with different colors denote similar pairs, the squares with different colors denote dissimilar pairs; medium: the triangles denote the represented vector space of similar pairs, and the diamonds denote the represented vector space of dissimilar pairs; right: the binary classification in the learned metric space. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article)
The main structure of face verification is illustrated in Fig. 5.1. Generally speaking, the proposed model is also self-adapting in classification. A diagram of the method proposed in the common classification is shown in Fig. 5.2. In Theorem 5.1, in view of their strict convex property on Mg , four sub-metrics in the metric swarm based on similarity metrics and distance are learned. Firstly, to evaluate the contribution of each sub-metric, the coefficients g are pre-trained by (5.20) Then, when the coefficients are fixed, the dual problem (5.16) is solved by using the fast iterative shrinkage thresholding algorithm (FISTA) [29]. Based on optimization of the metric swarm, we reconstruct the discriminative space via the metric swarm space representation (5.21). Next, we transform the sample pairs into vectors, respectively. Further, we implement to estimate the verification of similar or not similar via a standard SVM in this part. It should be noted that we describe the verification procedure as Algorithm 5.2. Algorithm 5.2 GMSL for verification Input: similar pairwise sample set S; Dissimilar pairwise sample set D Parameters γ, δ and . Procedure: Step 1. GMSL optimization. Obtain the metric swarm M by using Algorithm (5.1). Step 2. Metric swarm spacerepresentation. Represent each pair ( xi , xj as a similarity vector by using Eq. (5.18) with M; Step 3.Verification. Train and Train and Drain and test a standard SVM in the represented metric swarm space by using 10-fold cross validation. Output: Verification result.
144
5 Information Fusion Based on Metric Learning
5.2.6 Remarks 5.2.6.1 Joint Metric Swarm Score Function From Theorem 1, we could define the final joint metric score function fM xi , xj / 1 0T G in Eq. (5.1) with FM xi , xj = FM x , . . . , F x , x , x ∈ RG , and h = i j i j M 1 G [h1 , . . . , hG ]T ∈ RG as the following expression via utilizing the predefined size of the metric swarm M ≡ {M1 , . . . , M4 } about the four cases of a, b, c, and d. fM xi , xj = hT FM xi , xj = −FM1 xi , xj − FM2 xi , xj + FM3 xi , xj + FM4 xi , xj = −uT1 M1 v1 − uT2 M2 v2 + uT3 M3 v3 + uT4 M4 v4 ⎡ ⎤⎡ ⎤ v1 −M1 0⎢ / ⎢ ⎥ ⎥ −M2 ⎥ ⎢ v2 ⎥ = uT1 , uT2 , uT3 , uT4 ⎢ ⎣ ⎣ ⎦ M3 v3 ⎦ M4
v4 (5.22)
where according to Definition 5.2 set h = [−1, −1, 1, 1]T, u1 , . . . , u4 and v1 , . . . , v4 are denoted in Theorem 5.1. Here, itis known that when the similarity between xi and xj increases, if the g score FMg xi , xj has a positive change, we set hg = 1; if it has a negative change, set hg = −1. Based on above, it indicates that if they are similar, the high joint score function (5.22) of a sample pair xi , xj is necessary in the GMSL. From (5.22), obviously, we can divide the optimization of presented metric swarm M into four sub-metrics in diagonal. Therefore, with the complicated data structure, it is more robust for the learned metric via local patches. Inspired by the proposed GMSL, we summarize some new views on metric learning: we could optimize the metric M in patches, based on which the similarity of sample pairs is calculated.
5.2.6.2 Optimality Condition Via examining the duality gap (DualityGap), the optimality conditions of GMSL are studied. The dual gap here is expressed as the discrepancy between the original objective function and the dual objective function (5.16) (with the q-power on g
5.2 Generalized Metric Swarm Learning
145
at the t-th iteration, i.e., DualityGap (t) =
(i,j )∈V
G q (t) γ (t) 4 2 (t) ξi,j xi , xj + − M αi,j (t) M − g g g F 2
⎛
g=1
2 ⎞ (t) T G 1 q α y u v i,j g g ⎟ (i,j )∈V i,j ⎜ 2γ θg(t) F⎠ + ⎝ (t) g=1 +δg hg (i,j )∈V αi,j yi,j uTg vg
(i,j )∈V
(5.23) where the error term is computed as / 0 (t ) ξi,j xi , xj = 1 − yi,j hT FM(t) xi , xj
+
(5.24)
Here, yi,j = 1 is defined as similar pairs, and -1, otherwise. We could calculate the duality gap via replacing (5.22) and (5.24) into (5.23). For instance, the duality gap curve is learned via utilizing LFW data within 10 iterations, as shown in Fig. 5.3
Fig. 5.3 Duality gap vs. the iterations on LFW data
146
5 Information Fusion Based on Metric Learning
where it can be seen that GMSL can converge within 10 iterations and obtain a global optimal solution. Therefore, the superiority of GMSL is clear.
5.2.7 Experimental Results In the experiments, UCI benchmark datasets are used to estimate the our GMSL model for classification task, while LFW and PubFig datasets for unconstrained face verification task. To verify the superiority of our model, most current metric learning approaches are compared, including LMNN [8], Sparse-ML [29], CSML [3], DML-eig [7], ITML [6], LDML [1], Sub-SML [4], and deep metric learning method DDML [30].
5.2.7.1 Parameter Setting In our experiments for this chapter, we, respectively, set the parameters of GMSL model γ and δ to 100 and 0.5. The parameter θg is initialize to 1/G.
5.2.7.2 Test Results on UCI Datasets We first select the representative 8 UCI benchmark datasets introduced in Table 5.1 from UCI machine learning repository [31] to evaluate proposed GMSL model in our experiments. The aim of the experiments is to study the main superiority for common classification task with our GMSL model. When we conduct the experiments, 10-folds cross validation is used and the mean classification accuracy is taken as the evaluation criteria for the comparisons. It is clear that we should take similar/dissimilar sample pairs as the inputs of algorithm. As shown in Fig. 5.4, popular metric learning methods, including the Sparse-ML [32], ITML [6], DML-eig [7], Sub-SML [4], LMNN [8], have been compared with our GMSL model via classification error rate. It can be seen that our GMSL model Table 5.1 Description of 8 UCI datasets Dataset Wine Iris SPECTF-heart Statlog-heart UserKnowledge ILPD Sonar Seeds
Feature dimensions 13 4 44 13 5 10 60 7
# of classes 3 3 2 2 4 2 2 3
# of training samples # of test samples 125 53 105 45 80 187 189 81 101 44 525 58 188 20 147 63
5.2 Generalized Metric Swarm Learning
147
Fig. 5.4 Classification error (%) of Euclidean, DML-eig, LMNN, ITML, Sparse ML, Sub-SML, and the proposed GMSL metric learning methods
is more excellent than other learning methods on the specific five datasets. However, for other three datasets(5,7,8), it is slightly inferior to ITML. Further, based on all datasets, we have made an overall comparison and the average rank of error rate are shown in Fig. 5.5. From Fig. 5.5, we can easily see that our proposed GMSL model has achieved the best performance with the lowest error rate than the other seven approaches. Based on the experimental results, the superiority of our GMSL model in common classification task is verified.
5.2.7.3 Test Results on LFW Faces LFW (Labeled Faces in the Wild) is generally regarded as a difficult dataset for unconstrained face verification task, which is composed of 13,233 face images with 5749 persons [5]. Confined and unconfined protocols are existed in LFW. The only obtained information in confined protocol is whether each sample pair illustrates the same subject or not. For image unconfined protocols, we could get the identity information of every face image and also use additional face images
148
5 Information Fusion Based on Metric Learning
Fig. 5.5 Average rank of the classification error rate
with known identities. In order to compare with most advanced approaches, we use image confined protocol with the identical experiment setting in this chapter. The extracted 300 dimensional LBP and SIFT low-level image features from [4] reducted dimension by PCA method are used to analyze. In addition, we also used the attribute feature with 73 dimensions such as gender, race, hair style, age, etc. from [33]. We evaluate the effectiveness of a face verification algorithm by 10-fold cross validation where 300 positive and 300 negative image pairs are included for every fold. Naturally, it is known that in the every fold cross validation, 300 positive and 300 negative image pairs from 1 fold are taken as testing set. Simultaneously, 2700 positive and 2700 negative image pairs from the remaining 9 folds are taken as training set. For better demonstrating the superiority of the our GMSL model, in Table 5.2, we have described the verification accuracy of every single feature (%) by training every metric Mi (i = 1, . . . , 4) separately. As shown in Table 5.2, the GMSLComb represents the direct aggregation of the, respectively, learned 4 metrics rather than joint learning. The experiment demonstrates that it is more superior for the GMSL joint learning method than the combined method. Moreover, we have conducted the comparison experiments by calculating each metric, respectively, based on numerous features, and the experimental results are shown in Table 5.3. It can be observed that the proposed GMSL joint learning structure can bring obvious promotion. From Tables 5.2 and 5.3, we can clearly observe that comparing with learning single metric, learning multiple metrics are more effective, and the association and relationship among different metrics could not be exploited via the direct aggregation of multiple metrics. We report the mean accuracy (%) and standard deviation of 10 folds, as shown in Table 5.4, where the experimental results, respectively, illustrate the performance of the several current metric learning approaches on dataset LFW for SIFT, LBP, and Attribute features. It can be seen that several experimental results are missing since
Single feature LBP SIFT Attribute
GMSL-M1 86.10 ± 0.49 84.40 ± 0.43 84.10 ± 0.59
GMSL-M2 86.17 ± 0.46 84.38 ± 0.41 84.05 ± 0.52
GMSL-M3 85.17 ± 0.49 82.93 ± 0.28 81.78 ± 0.44
GMSL-M4 85.08 ± 0.51 82.75 ± 0.32 79.78 ± 0.79
Table 5.2 Comparisons with baselines of single metric and combined metrics with single feature on LFW dataset GMSL-Comb 86.68 ± 0.46 84.85 ± 0.45 85.45 ± 0.64
GMSL 87.12 ± 0.41 85.98 ± 0.40 85.62 ± 0.69
5.2 Generalized Metric Swarm Learning 149
Multiple features LBP + SIFT LBP + Attribute SIFT + Attribute LBP + SIFT + Attribute
GMSL-M1 87.82 ± 0.53 89.10 ± 0.44 87.30 ± 0.36 89.67 ± 0.35
GMSL-M2 87.82 ± 0.45 88.98 ± 0.45 87.23 ± 0.36 89.73 ± 0.39
GMSL-M3 86.73 ± 0.38 87.85 ± 0.38 85.92 ± 0.52 88.22 ± 0.57
GMSL-M4 86.70 ± 0.38 87.60 ± 0.54 85.98 ± 0.60 88.17 ± 0.60
Table 5.3 Comparisons with baselines of single metric and combined metrics with multiple features on LFW dataset GMSL-Comb 87.98 ± 0.61 89.40 ± 0.49 87.95 ± 0.38 90.06 ± 0.41
GMSL 89.35 ± 0.44 89.45 ± 0.41 88.20 ± 0.51 90.67 ± 0.46
150 5 Information Fusion Based on Metric Learning
Method LBP SIFT Attribute
SILD [32] 80.07 ± 1.35 80.85 ± 0.61 –
ITML [9] 79.98 ± 0.39 78.12 ± 0.45 84.00
DML-eig [10] 82.28 ± 0.41 81.27 ± 2.30 –
LDML [5] 80.65 ± 0.47 77.50 ± 0.50 83.40
KissME [34] 83.37 ± 0.54 83.08 ± 0.56 84.60
Table 5.4 Comparisons with the state-of-the-art metric learning methods on LFW dataset CSML [7] 85.57 ± 0.52 – –
Att.&Sim. [26] – – 85.29
Sub-SML [8] 86.73 ± 0.53 85.55 ± 0.61 84.77
GMSL 87.12 ± 0.41 85.98 ± 0.40 85.62
5.2 Generalized Metric Swarm Learning 151
152
5 Information Fusion Based on Metric Learning
Fig. 5.6 The weights of MetricFusion for different q-values (a) and performance curve of three features with different q-values (q = 2, . . . , 10)
they are not provided in their previous works. Additionally, we do not compare the LMNN for face verification in our study because the triplet information of image pairs are required. As we see from the experimental results, it is very clear that for SIFT, LBP, and Attribute features, our model, respectively, achieves better performance than other popular approaches. In view of a learned threshold, for verification task, the experiments are conducted via applying the joint metric score in Eq. (5.22). In the experiments, the accuracy rates are 85.6 ± 0.45 in percent, 87.0 ± 0.52 in percent, 85.2 ± 0.68 in percent for SIFT, LBP, and Attribute features, respectively. Besides, for further verifying the performance with different q-values and deeper understanding the superiority of the proposed MetricFusion, some investigations on that how q-value performs on the metric weights and face verification have been carried out, as shown in Fig. 5.6. The θ values (weights of metric fusion) for respective q-values are shown in Fig. 5.6a. The effectiveness curve of respective q-values for three features given in this work is shown in Fig. 5.6b. In the effectiveness curve of Fig. 5.6b, it is clear that the best performance of MetricFusion is obtained when q = 2. The weight of every metric for q = 2 is illustrated in Fig. 5.6a. The experimental results show similar performance change for three features with increasing q-value. When set the θ = 0.25 in our experiment, the values of accuracy for 4 metrics in average are 83.9%, 84.9%, and 85.2% for SIFT, LBP, and Attribute features, respectively. Based on the above results, we can see that the method simply averaging multiple metrics cannot obtain the best performance. Particularly, by comparing with several popular metric learning methods on constrained LFW, we carry out a methodical comparison and demonstrate the best performance in Table 5.5 where the number of feature descriptors is denoted as NoD. As shown in Table 5.5, it is clear that our model with the best face verification accuracy of 90.67% through 3 descriptors outperforms Sub-SML [4] with stateof-the-art 89.73% through 6 descriptors. If Sub-SML is implemented on the same 3 descriptors, we could obtain the 89.33% accuracy with 1.34% lower than ours.
5.2 Generalized Metric Swarm Learning
153
Table 5.5 Accuracy (%) comparison with the state-of-the-art results on LFW dataset under image restricted protocol Method Combine b/g samples [4] LDML + SVM [31] DML-eig + SVM [11] SILD +SVM [21] CSML + SVM [35] HTBI [33] Att.&Sim. classifiers [19] Sub-SML + SVM [8] Sub-SML + SVM [8] SFRD + PMML [18] Sub-SML + SVM [8] SEAML [23] ITML + Multiple OSS [18] GMSL GMSL
Features SIFT, LBP, etc. SIFT, LBP, etc. SIFT, LBP, etc. Intensity, LBP, etc. Intensity + LBP, etc. Inspired features Attributes LBP + SIFT LBP, SIFT, Attributes Spatial-temporal LBP, SIFT, etc. SIFT, Attributes SIFT, LBP, etc. LBP, SIFT LBP, SIFT, Attributes
NoD 8 8 8 8 6 16 1 2 3 8 6 2 16 2 3
Accuracy 86.83 ± 0.34 79.27 ± 0.60 85.65 ± 0.56 85.78 ± 2.05 88.00 ± 0.37 88.13 ± 0.58 85.29 88.87 ± 0.60 89.33 ± 0.54 89.35 ± 0.50 89.73 ± 0.38 87.50 ± 1.30 89.50 ± 1.58 89.35 ± 0.44 90.67 ± 0.46
We have also conducted the experiments for other methods. For example, SFRD + PMML and ITML + Multiple OSS are implemented with 8 and 16 descriptors, respectively, and obtained inferior results of 89.35% and 89.50%. Moreover, we report the ROC curves and AUCs (Area Under Curve) for the most popular approaches, as shown in Fig. 5.7. It is clearly observed that the presented GMSL method surpasses popular learning methods in restricted setting. In addition, several recently deep learning methods also are compared with our GMSL model for face verification on LFW in restricted protocol, such as discriminative shadow metric learning (DSML) [30], deep non-linear metric learning with independent subspace analysis (DNLML-ISA) [35], convolutional deep belief network (CDBN) [36], and discriminative deep metric learning (DDML) [30]. The experimental results for the comparisons are shown in Table 5.6. It can be seen that although the DDML has the highest 90.68% with 6 descriptors, the accuracy of 90.67% with only 3 descriptors is achieved by our GMSL model and only 0.01% lower than DDML. To be more intuitive, we show the ROC curves of DDML and GMSL in Fig. 5.8. We can observe that GMSL could obtain comparative effectiveness with DDML. In this work, we simultaneously learn four metrics in our GMSL model. For a deeper understanding the performance of proposed GMSL model, the experiment based on the straightforward integration of four metrics learned, respectively, and single metric has been conducted. Figure 5.9 illustrates the error rate of LFW for different feature types in restricted setting by applying four learned metrics and their direct integration, respectively. It is very clear that our GMSL model via combining multiple metrics learned synchronously obtains better effectiveness than others. In four single metrics, M1 and M2 are superior to M3 and M4 . The straightforward
154
5 Information Fusion Based on Metric Learning
Fig. 5.7 Comparisons of ROC curves and AUCs between our GMSL and the state-of-the-art methods on LFW Table 5.6 Accuracy (%) comparisons with existing deep metric learning on LFW data in restricted protocol Method CDBN [34] CDBN + Hand-crafted DNLML-ISA [37] DSML [27] DDML [27] DDML [27] GMSL GMSL
Features Image descriptors, etc. Hand-crafted, etc. LBP, SIFT, etc. LBP, SIFT, etc. Sparse SIFT (SSIFT) LBP, SIFT, etc. LBP, SIFT LBP, SIFT, Attribute
NoD 6 12 8 6 1 6 2 3
Accuracy 86.88 ± 0.62 87.77 ± 0.62 88.50 ± 0.40 87.45 ± 1.45 87.83 ± 0.93 90.68 ± 1.41 89.35 ± 0.44 90.67 ± 0.46
integration of four metrics learned is not beneficial to improving performance. As an advantage of our GMSL, it indicates the significance of learning latent metrics synchronously.
5.2 Generalized Metric Swarm Learning
155
Fig. 5.8 Comparison of ROC curves and AUCs
5.2.7.4 Test Results on PubFig Faces The Public Figures (PubFig) dataset enjoys similar property to LFW for unconstrained face verification [33]. The dataset is composed of 58,797 face images from 200 individuals. We show several face images from PubFig as Fig. 5.10. For evaluating effectiveness, we apply 10-fold cross validation with 20,000 pairs from 140 people in our experiment. Among them, every fold has 1000 identical pairs and 1000 different pairs. We report the mean accuracies of several current metric learning methods. We take the GMSL model to compare with popular metric learning methods on PubFig dataset with the constrained setting and the comparison results are demonstrated in Table 5.7 where some come from [37] It can be observed that our GMSL with the accuracy 78.5% has about 1% improvement than KissMe [37]. The ROC curves of comparison results between our model and Sub-SML are shown in Fig. 5.11, and the superiority of our GMSL model is obviously shown.
156
5 Information Fusion Based on Metric Learning
Fig. 5.9 Comparison with single metric and combination of four metrics. 1: LBP; 2: SIFT; 3: Attribute; 4: LBP + SIFT; 5: LBP + Attribute; 6: SIFT + Attribute; and 7: LBP + SIFT + Attribute
Fig. 5.10 Some intra and inter pairs of faces in PubFig
5.2.8 Conclusion In this chapter, we propose to learn a metric swarm in local patches and reconstruct a vectorized similarity space for general classification task and unconstrained face
5.2 Generalized Metric Swarm Learning
157
Table 5.7 Performance comparisons between our GMSL and other metric learning on Pubfig faces Methods Euclidean LMNN ITML DML-eig LDML KissMe Sub-SML GMSL Accuracy (%) 72.5 73.5 69.3 77.4 77.6 77.6 77.3 78.5
Fig. 5.11 ROC curves and AUCs of Sub-SML and GMSL
verification. Particularly, in our GMSL model, our target is to obtain a metric swarm by learning local patch based sub-metrics synchronously with a regularized metric learning model. We denote the dual problem of GMSL as a quadratic programming problem and we could effectively solve it with FISTA algorithm via an selective optimization algorithm. Then we could represent the local patch sub-metrics via the dual solution. We transform the sample pairs into a vectorized similarity space (metric swarm space) with the solved metric swarm via a built joint similarity function. Based on the above, we could easily solve the SVM-like classification task in the represented space for face verification problems. From above experimental results on several benchmark UCI datasets, it is demonstrated that the superiority of our model in general classification task. Moreover, on real-world LFW and PubFig faces datasets, our experiments with restricted setting show the best effectiveness of our GMSL for unconstrained face verification.
158
5 Information Fusion Based on Metric Learning
5.3 Combined Distance and Similarity Measure Distance and similarity measures often are regarded to be supplementary to pattern classification [4, 26]. With pairwise restrictions, some methods have been introduced to integrate distance and similarity measures. Nevertheless, with triplets restrictions, less studies have been carried out for joint learning of distance and similarity measures. Moreover, the kernel extension of triplet-based model is also nontrivial and computationally expensive. In this chapter, a new approach is suggested to calculate a combined distance and similarity measure (CDSM) [28]. Moreover, via combining with the max-margin model, a triplet-based CDSM learning model is proposed with a consolidated regularizer of the Frobenius norm. Then, the optimization problem is solved by using a support vector machine (SVM)based algorithm. Besides, extensive CDSM is adopted to learn non-linear measures through the kernel trick. We employ two useful solutions to accelerate testing and training of kernelized CDSM. Experiments are conducted on the handwritten digits, UCI, and person re-identification datasets and the results show that CDSM and kernelized CDSM methods are superior to many existing metric learning approaches.
5.3.1 Problem Formulation In this section, first, the combined distance and similarity measure (CDSM) is presented and further is explained from pairwise kernel viewpoint. Then, a tripletbased method for learning combined distance and similarity measure is proposed.
5.3.1.1 CDSM and Pairwise Kernel Explanation Given a pair of samples xi and xj , the combined distance and similarity measure (CDSM) is defined as C(xi , xj ) = μxTi Sxj − (1 − μ)(xi − xj )T M(xi − xj )
(5.25)
where μ (0 μ 1) is exploited to make a balance between distance and similarity measures. The positive semi-definite (PSD) restriction on S and M for the CDSM in Eq. (5.25) is removed, unlike that introduced in [4], so that CDSM is only a generalized similarity measure, but not a metric. As shown in [34, 38], there is little influence on classification when the PSD restriction is ignored, but the training performance could be highly improved. We could explain the CDSM from the pairwise kernel perspective. So far, people propose the several pairwise kernels, such as the metric learning pairwise kernel and the tensor learning pairwise kernel. Given two pair of samples (xk , xl ) and (xi , xj ),
5.3 Combined Distance and Similarity Measure
159
we define the tensor learning pairwise kernel with linear basis kernel [39] as KT L ((xi , xj ), (xk , xl )) = tr xi xTj xk xTl + xj xTi xk xTl
(5.26)
and define the metric learning pairwise kernel with linear basis kernel [39] as KML ((xi , xj ), (xk , xl )) = tr (xi − xj )(xi − xj )T (xk − xl )(xk − xl )T
(5.27)
The training set of P pairs is represented by {(x1,1, x1,2 ), . . . , (xp,1 , xp,2 ), . . . , (xP ,1, xP ,2 )}. The decision function with KT L can be denoted by P p=1
αp KT L ((xi , xj ), (xp,1 , xp,2 )) ⎛
= xTi ⎝
P
⎞ αp (xp,1 xTp,2 + xp,2 xTp,1 )⎠ xj = xTi Sxj
(5.28)
p=1
and the decision function with KML can be represented by P p=1
αp KML ((xi , xj ), (xp,1 , xp,2 )) ⎛
= (xi − xj )T ⎝
P
⎞ αp (xp,1 − xp,2 )(xp,1 − xp,2 )T )⎠ (xi − xj )
p=1
= (xi − xj )T M(xi − xj )
(5.29)
where S = Pp=1 αp (xp,1 xTp,2 + xp,2 xTp,1 ) and M = Pp=1 αp (xp,1 − xp,2 )(xp,1 − xp,2 )T . It cannot be guaranteed that the matrices S and M are positive semi-definite (PSD) because some negative variables may exist in {α1 , α2 , . . . , αP }. Therefore, Eq. (5.25) is a measure rather than a metric. Then, we could denote the CDSM as the combination of Eqs. (5.28) and (5.29), C( xi , xj ) =
P
αp μKT L ((xi , xj ), (xp,1 , xp,2 )) − (1 − μ)KML ((xi , xj ), (xp,1 , xp,2 ))
p=1
= μxi SxTj − (1 − μ)(xi − xj )T M(xi − xj )
(5.30)
Thus, the CDSM could be considered as the combination of metric learning and tensor learning pairwise kernels, which is similar to multiple kernel learning (MKL).
160
5 Information Fusion Based on Metric Learning
Differently, the positive weight and negative weight are explicitly assigned to tensor learning pairwise kernel and metric learning pairwise kernel in CDSM, respectively, while the integration weight should be defined as nonnegative in MKL. Noted that, our method keeps convex and could be effectively addressed though introducing the negative weight. Additionally, the matrices S and M are used to measure similarity and dissimilarity. Therefore, negative and positive weights are naturally adopted for metric learning and tensor learning pairwise kernels, respectively. The experimental results also demonstrate that the introduced integration scheme could achieve better performance than the two models: the tensor learning pairwise kernel (i.e., similarity measure) and the metric learning pairwise kernel (i.e., distance measure).
5.3.1.2 Triplet-Based Learning Model The triplet-based learning model proposed in this work consists of three parts, including triplet-based constraints, regularization term, and margin loss. In the following, first, the above three parts are introduced; then, the final model is suggested; finally, the convexity of our model is discussed. Triplet-Based Constraints We define a set of T triplets {(x1,1, x1,2 , x1,3 ), . . . , (xt,1, xt,2 , xt,3 ), . . . , (xT ,1 , xT ,2 , xT ,3 )} by T , where xt,1 and xt,2 represent that they are from the same class (i.e., similar), xt,1 and xt,3 represent that they are from the different classes (i.e., dissimilar). Several methods have been proposed to create triplet set from the training set D = {(xi , yi )|i = 1, 2, . . . , n}, in which the ith training sample is represented by xi ∈ Rd , the class label of xi is represented by yi . In LMNN, the triplet set is constructed via finding k-nearest neighbors of each xi . For each triplet (xt,1 , xt,2 , xt,3 ), we require that the combined distance and similarity measure between the similar pair (xt,1, xt,2 ) should be larger than that between the dissimilar pair (xt,1 , xt,3 ) by a large margin. Based on above, it can be represented as C(xt,1 , xt,2 ) − C(xt,1 , xt,3 ) 1, ∀t
(5.31)
Slack element ξt is further described to create the triplet-based constraint C(xt,1, xt,2 ) − C(xt,1 , xt,3 ) 1 − ξt , ξt 0, ∀t
(5.32)
If the constraint of Eq. (5.31) is met, ξt will be set to zero, otherwise the minimal ξt will be discovered to hold the inequality. Moreover, the pairwise kernels in Eqs. (5.27) and (5.28) could be extended to the corresponding triplet kernels, KT L ((xt,1, xt,2 , xt,3 ), (xs,1 , xs,2, xs,3 )) = tr (Xt,1 + Xt,2 ) XTs,1 + XTs,2 (5.33) T KML ((xt,1 , xt,2 , xt,3 ), (xs,1, xs,2 , xs,3)) = tr Ut Us (5.34)
5.3 Combined Distance and Similarity Measure
161
where Xt,1 = (xt,1 xTt,2 − xt,1 xTt,3), Xs,1 = (xs,1xTs,2 − xs,1xTs,3), Xt,2 = (xt,2xTt,1 − xt,3 xTt,1), Xs,2 = (xs,2xTs,1 − xs,3xTs,1), Ut = (xt,1 − xt,2 )(xt,1 − xt,2 )T − (xt,1 − xt,3 )(xt,1 − xt,3)T and Us = (xs,1 − xs,2)(xs,1 − xs,2)T − (xs,1 − xs,3)(xs,1 − xs,3)T . Then, we rewrite the triplet-based constraint in Eq. (5.32) as tr(Xt,1 + Xt,2 , μS) − tr(Ut , (1 − μ)M) 1 − ξt , ξt 0, ∀t
(5.35)
Penalty on ξ We suggest the margin loss term to violate the constraints in Eq. (5.31) to the smallest extent. Similar to SVM, the sum of all slack elements is selected as the margin loss, ρ(ξ ) =
T
(5.36)
ξt
t =1
For meeting the constraints in Eq. (5.31) as most as possible, the CDSM measure could be obtained via minimizing the above margin loss term. Regularization Term To promote the generalization ability of the learning model, we further combine the regularization terms on M and S. The Frobenius norm [40] usually is taken as a general regularizer in metric learning, i.e., ||M||2F and ||S||2F . In [4], another regularizer proposed by Cao et al. can make the learned matrix approximate the identity matrix, i.e., ||M − I||2F and ||S − I||2F . In this chapter, we propose to integrate these two forms of regularizers, such that the new regularization term is generated as follows: r(M, S) = (1 − α1 )||M||2F + α1 ||M − I||2F + (1 − α2 )||S||2F + α2 ||S − I||2F = ||M − α1 I||2F + ||S − α2 I||2F + const
(5.37)
where the balance parameters are defined as 0 α1 , α2 1. Thus, the constant is removed and the new regularization term is formulated as r(M, S) = ||M − α1 I||2F + ||S − α2 I||2F
(5.38)
The CDSM Learning Model Given all the three parts, the triplet loss of our CDSM learning model is denoted as * min F (M, S, ξ ) =
M,S,ξ
s.t.
1 (||M − α1 I||2F + ||S − α2 I||2F ) + C ξt 2 T
+
t =1
tr(Xt,1 + Xt,2 , μS) − tr(Ut , (1 − μ)M) 1 − ξt , ξt 0, ∀t
(5.39)
where the tradeoff parameter C is defined to balance the margin loss and the regularizer. From Eq. (5.39), it can be seen that the function F (M, S, ξ ) is convex
162
5 Information Fusion Based on Metric Learning
quadratic, and simultaneously the triplet-based constraints are linear. Thus, as a convex quadratic programming (QP) problem, we can easily solve the introduced model via employing the normal QP solvers. Nevertheless, it will bring the huge computation cost for the standard QP solvers, while the scale of task becomes large, such as the dimension of samples and the number of triplets. In the following, an efficient SVM-based solver is proposed to optimize this problem.
5.3.2 Optimization for CDSM Denote by Xt = Xt,1 + Xt,2 , et = xTt,1 xt,2 − xTt,1 xt,3 and ft = (xt,1 − xt,2)T (xt,1 − xt,2 ) − (xt,1 − xt,3 )T (xt,1 − xt,3). We denote the Lagrangian function of the CDSM learning model in Eq. (5.39) as L(β, λ; M, S, ξ ) =C
T t =1
−
T
1 ξt + (||M − α1 I||2F + ||S − α2 I||2F ) 2 βt [tr(Xt , μS) − tr(Ut , (1 − μ)M) − 1 + ξt ] −
t =1
T
λt ξt
(5.40)
t =1
where λt , βt 0. The Karush-Kuhn-Tucker (KKT) conditions are further formulated as follows L(β, λ; M, S, ξ ) = M − α1 I + (1 − μ) β t Ut = 0 ∂M t
(5.41)
L(β, λ; M, S, ξ ) = S − α2 I − μ β t Xt = 0 ∂S t
(5.42)
L(β, λ; M, S, ξ ) = C − βt − λt = 0 ∂ξt
(5.43)
Integrating Eqs. (5.40)–(5.43), the Lagrange dual problem of our CDSM learning model can be obtained as follows max β
% 1$ (1 − μ)2 || βt Ut ||2F + μ2 || βt Xt ||2F 2 t t − βt (μα2 et − (1 − μ)α1 ft − 1) −
t
s.t.
C βt 0, ∀t
(5.44)
5.3 Combined Distance and Similarity Measure
163
Algorithm 5.3 Linear CDSM learning Input: triplet set T , parameters C, α1 , α2 , μ. Output: M, S 1: Initialization: β ← 0. 2: Computation of the matrix B and the vector p in Eq. (5.45). 3: Initialization of the gradient vector g by g = p. 4: while not converged do 5: Select the βt that violates the KKT condition most. 6: βtold ← βt . 7: Update βt ← βt − gt /btt 8: Update the gradient vector gi ← gi + (βt − βtold ) ∗ bit , i = 1 . . . m. 9: end while 10: Calculate the matrices M and S using Eqs. (5.46) and (5.47).
We can rewrite the above model as 1 min β T Bβ + pT β, β 2
s.t.
C βt 0, ∀t
(5.45)
where pt = μα2 et − (1 − μ)α1 ft − 1 and bst = μ2 tr(Xs XTt ) + (1 − μ)2 tr(Us UTt ). After obtaining β, then we can calculate the matrices M and S via utilizing M = α1 I − (1 − μ) S = α2 I + μ
β t Ut
(5.46)
t
β t Xt
(5.47)
t
The Lagrange dual problem is addressed by the SMO algorithm. Particularly, one βt is selected to make the KKT constraints being violated most to update for reducing the complexity of our algorithm shown in Algorithm 5.3. To obtain the best βt with the largest absolute gradient, all the gradients of β are scanned, which is formulated as follows. gt = pt +
T
βi bit ,
t = 1, . . . , T
(5.48)
t
In every iteration, once βt is updated, all the gradient is required to recalculate. We can calculate the gradient via using the following formula. gi = gi + (βt − βt old )bit , where βold is the old value of βt before updating.
∀i
(5.49)
164
5 Information Fusion Based on Metric Learning
5.3.3 Kernelized CDSM In this section, first, the method is presented for the kernelization of CDSM; then, two methods are proposed for promoting the efficiency of kernelized CDSM at both training and testing stages. Kernelization In order to learn linear CDSM, we propose the model in Eq. (5.45). It is reasonable to consider whether the linear basis kernel can be substituted by using other kernel functions, e.g., Gaussian RBF kernel, to further improve the classification efficiency. By substituting the inner product of two sample vectors, xi , xj , with a kernel function k(xi , xj ), tr(Xs , XTt ), tr(Us , UTt ), et , and ft can be denoted as tr(Xs , XTt ) = 2k(xs,2, xt,1 )k(xs,1, xt,2 ) −2k(xs,2, xt,1 )k(xs,1, xt,3 ) + 2k(xs,2, xt,2 )k(xs,1, xt,1 ) −2k(xs,2, xt,3 )k(xs,1, xt,1 ) − 2k(xs,3, xt,1 )k(xs,1, xt,2 ) +2k(xs,3, xt,1 )k(xs,1, xt,3 ) −2k(xs,3, xt,2 )k(xs,1, xt,1 ) + 2k(xs,3, xt,3 )k(xs,1, xt,1 )
(5.50)
tr(Us , UTt ) = k(xs,2, xt,2 )k(xs,2, xt,2 ) − 2k(xs,2, xt,1 )k(xs,2, xt,2 ) −k(xs,2, xt,3 )k(xs,2, xt,3 ) −2k(xs,2, xt,2 )k(xs,1, xt,2 ) +2k(xs,2, xt,1 )k(xs,1, xt,2 ) + 2k(xs,2, xt,2 )k(xs,1, xt,1 ) +2k(xs,2, xt,3 )k(xs,1, xt,3 ) − 2k(xs,2, xt,1 )k(xs,1, xt,3 ) −2k(xs,2, xt,3 )k(xs,1, xt,1 ) −k(xs,3, xt,2 )k(xs,3, xt,2 ) + 2k(xs,3, xt,1 )k(xs,3, xt,2 ) +k(xs,3, xt,3 )k(xs,3, xt,3 ) +2k(xs,3, xt,2 )k(xs,1, xt,2 ) − 2k(xs,3, xt,1 )k(xs,1, xt,2 ) −2k(xs,3, xt,2 )k(xs,1, xt,1 ) −2k(xs,3, xt,3 )k(xs,1, xt,3 ) + 2k(xs,3, xt,1 )k(xs,1, xt,3 ) +2k(xs,3, xt,3 )k(xs,1, xt,1 ) +2k(xs,2, xt,1 )k(xs,2, xt,3 ) − 2k(xs,3, xt,1 )k(xs,3, xt,3 )
(5.51)
5.3 Combined Distance and Similarity Measure
et = k(xt,1, xt,2 ) − k(xt,1 , xt,3 ) ft = k(xt,2 , xt,2 ) − 2k(xt,1, xt,2 ) − k(xt,3 , xt,3 ) + 2k(xt,1, xt,3 )
165
(5.52) (5.53)
with k(xi , xj ) = xTi xj . For the kernelization of CDSM, the linear basis kernel k(xi , xj ) = xTi xj only is substituted with the Gaussian RBF kernel kRBF (xi , xj ) = exp(−γ ||xi − xj ||22 ). Given a test sample pair xi , xj , to compute (xi − xj )T M(xi − xj ) and xTi Sxj , we also extend xTi Ixj , (xi − xj )T I(xi − xj ), t βt xTi Xt xj , and T t βt (xi − xj ) Ut (xi − xj ) in the kernel space. xTi Ixj = k(xi , xj ) (xi − xj )T I(xi − xj ) = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ) t
βt xTi Xt xj =
(5.55)
βt (k(xj , xt,1 )k(xi , xt,2 ) − k(xj , xt,1 )k(xi , xt,3 )
t
+k(xj , xt,2 )k(xi , xt,1 ) − k(xj , xt,3 )k(xi , xt,1 ))
(5.54)
(5.56)
βt (xi − xj )T Ut (xi − xj )
t
=
βt (k(xi , xt,2 )k(xi , xt,2 ) − 2k(xi , xt,1 )k(xi , xt,2 ) − k(xi , xt,3 )k(xi , xt,3 )
t
+k(xi , xt,1 )k(xi , xt,3 ) + k(xj , xt,2 )k(xj , xt,2 ) − 2k(xj , xt,1 )k(xj , xt,2 ) −2k(xj , xt,2 )k(xi , xt,2 ) + 2k(xj , xt,1 )k(xi , xt,2 ) + 2k(xj , xt,2 )k(xi , xt,1 ) +2k(xj , xt,3 )k(xi , xt,3 ) − 2k(xj , xt,1 )k(xi , xt,3 ) − 2k(xj , xt,3 )k(xi , xt,1 ) −k(xj , xt,3 )k(xj , xt,3 ) + k(xj , xt,1 )k(xj , xt,3 ))
(5.57)
Linear CDSM can be extended to its kernelized version by utilizing Eqs. (5.50)– (5.53). Moreover, the model in Eq. (5.45) is solved to get β. Given β, we then employ Eqs. (5.54)–(5.57) to calculate the kernelized CDSM between any pairs of samples. Strategies for Improving Efficiency It is noted that the number of non-zero βt values controls the computational cost of Eqs. (5.56) and (5.57). The computational cost will become bigger with the number of triplets increasing, restricting the effect of kernelized CDSM on large-scale datasets. In this work, there are two methods introduced to promote the performance of kernelized CDSM. First, the calculation of the complexity of kernel CDSM is analyzed in both the training process and the testing process. We represent the number of triplets in the training process as T . We also represent the number of training samples as N. The computational complexity of kernelized CDSM similar to the linear CDSM learning is denoted as O(T ) in the training stage. The complexity to calculating C(xi , xj )
166
5 Information Fusion Based on Metric Learning
is denoted as O(T ) in the test stage. When a test sample is given, it should be compared with all training samples to learn its class label. Obviously, based on above, we can calculate the complexity of CDSM-based classification as O(NT ). Thus, to improve the performance in the testing stage, how to reduce the T and N values will be further considered. In the following, two strategies are proposed to shorten the T and N values. Reducing the Number of Triplets Conventionally, the triplet set T is constructed from the training set D via using the following program: m1 like nearest neighbors and m2 unlike nearest neighbors are selected to established m1 m2 triplets for each training sample; thus we can obtain the triplet set with in total the size as T = Nm1 m2 . In our method constructing triplet set, we will first discriminate whether the class label of every training sample is same as the class labels of its k1 (= 3) nearest neighbors. If it is, the triplets for the sample are not generated and it is regarded as an inner sample. In view of above method, we reduce the size of the triplet set to T = N m1 m2 , where N represents the number of reduced training samples. It is very clear that N < N. It is noted that, the performance for both training and testing is promoted based on the strategy. Reducing the Number of Kernel CDSM Computation In the testing process, the general method is that the kernel CDSM of the testing sample is calculated for every training sample. Thus its complexity is O(NT ). Considering the heavy complexity, we propose a two-stage strategy to shorten the computational cost. To begin with, the Euclidean distance is adopted to compute the k2 (= 20) nearest neighbors of the testing sample. Then, the kernel CDSM between the testing sample and its k2 nearest neighbors is calculated. Finally, we can find the training sample with the lowest kernel CDSM and take its class label to the testing sample. It is noted that k2 N usually holds. Based on this schema, the testing complexity could be shorten from O(NT ) to O(N + k2 T ), such that the performance could be dramatically promoted in the testing stage.
5.3.4 Experimental Results In this section, the our CDSM and kernelized CDSM are verified on 11 UCI datasets (i.e., Ecoli, Glass, IPLD, segment, sonar, Cardiotocography, BreastTissue, Parkinson’s, SPECTHeart, Satellite, and Letter) [41], and 4 handwritten digit datasets (i.e., USPS, Semion, PenDigist, and MNIST). The above datasets are representative and often are adopted to evaluate the efficiency of metric learning approaches [6–8, 42], and we introduce them in detail in Table 5.8. In this work, the 1-nearest neighbor (1-NN) classifier for classification is applied. The mean error rate is used as assessment criteria on the original defined splits on six datasets (i.e., PenDigits, Satellite, USPS, MNIST, SPECTHeart, and Letter) and the 10-fold cross validation for remanent datasets.
5.3 Combined Distance and Similarity Measure
167
Table 5.8 Descriptions of the 11 UCI and the 4 handwritten digit datasets Dataset BreastTissue Ecoli Glass ILPD Parkinson’s SPECTHeart Satellite Segment Sonar Cardiotocography Letter PenDigits USPS Semeion MNIST
of training samples 96 303 193 525 176 80 4435 2079 188 1914 16,000 7494 7291 1436 60,000
of test samples 10 33 21 58 19 187 2000 231 20 212 4000 3498 2007 159 10,000
of classes 9 8 9 10 22 44 7 19 60 21 16 10 10 10 10
Dimension 6 7 7 2 2 2 36 7 2 10 26 16 256 256 784
5.3.4.1 Evaluation on CDSM and Kernelized CDSM The influence of distance and similarity measures first is evaluated in proposed CDSM learning model. As discussed above, μ is a tradeoff parameter to balance the influence. It is obvious that we will employ distance part (CDSMdis ) if μ = 0, while we will adopt the similarity part (CDSMsim ) if μ = 1. We report the results in Table 5.9, where it can be shown that CDSMsim has roughly comparable performance with CDSMdis . Particularly, CDSMdis achieves better performance on these datasets (e.g., Ecoli, PenDigist, BreastTissue, and USPS), while worse performance in other datasets (e.g., SPECTHeart, Glass, and MNIST). It is noted that the proposed CDSM could obtain the best performance in most datasets via using both distance and similarity metric learning, which demonstrate they are supplementary. Further, they are utilized to improve the performance of proposed CDSM. Moreover, for most datasets (except Ecoli), the contributions of distance and similarity parts usually are not equal while we set the parameter μ = 0.5. The parameter μ for each dataset is written in Table 5.9, which can be obtained by cross validation on training data.
5.3.4.2 Comparison with the State-of-the-Art Metric Learning Methods In this subsection, our CSSM is compared with six popular metric learning approaches. These methods contain pairwise constraints based joint distance and similarity metric learning method: SML [4]; triplet constraints based distance and similarity metric learning method: LMNN [8] and OASIS [43]; pairwise constraints
168
5 Information Fusion Based on Metric Learning
Table 5.9 Comparison of classification error rate between CDSM and CDSM with distance part (CDSMdis ), CDSM with similarity part (CDSMsim ). Here, μ is a balance parameter to control the effect of distance or similarity part in CDSM. Bold values mean the best performances Dataset BreastTissue Ecoli Glass IPLD Parkinson’s SPECTHeart Satellite Segment Sonar Cardiotocography Letter PenDigist USPS Semion MNIST
CDSMdis 28.75 18.56 26.84 30.24 4.08 25.67 10.1 2.12 14.93 18.38 2.5 1.97 4.58 4.46 2.96
CDSMsim 33.87 18 24.58 28.65 12.3 8.02 12.25 2.81 11.5 25.92 4.95 2.26 5.23 6.66 2.65
CDSM 28.75 17.39 24.06 27.31 4.08 6.95 9.9 2.12 11.02 18.26 2.32 1.95 4.38 4.08 2.08
μ 0.0 0.5 0.9 0.9 0.0 0.9 0.2 0.0 0.9 0.4 0.3 0.6 0.3 0.4 0.88
α1 1 1 0 0 1 0 1 0 1 0 1 0 1 1 1
α2 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
Table 5.10 Comparison of classification error rate between our CDSM and the state-of-the-art metric learning methods on the UCI datasets. Bold values mean the best performances Dataset BreastTissue Ecoli Glass IPLD Parkinson’s SPECTHeart Satellite Segment Sonar Cardiotocography Letter Ave. ERR. Ave. rank
Euclidean 31 18.55 30.1 35.69 4.08 38.5 10.95 2.86 12.98 21.4 4.33 19.13 4.9
NCA 41.27 17.6 29.87 34.65 6.63 26.74 10.4 2.51 15.4 21.16 2.47 18.97 4.18
ITML 35.82 19.81 28.27 35.35 6.13 34.76 11.45 2.73 12.07 18.67 3.8 18.99 4.45
LDML 32.09 19.63 40.74 35.84 7.15 33.16 15.9 2.86 22.86 22.29 11.05 22.14 6.63
LMNN 34.37 19.81 32 34.12 5.26 34.76 10.05 2.64 11.57 19.21 3.45 18.84 4.09
OASIS 36.25 19.37 34.63 28.65 24.82 8.02 21.65 14.33 12.3 21.61 4.3 20.54 5.45
SML 36.37 23.94 30.02 30.72 5.68 8.56 21.9 2.82 14 21.47 3.4 18.08 5.18
CDSM 28.75 17.39 24.06 27.31 4.08 6.95 9.9 2.12 11.02 18.26 2.32 13.83 1
based distance metric learning methods: NCA [42], ITML [6], and LDML [1]. Addtionally, Euclidean distance (no training process) is used to compare as a baseline. We conduct the comparison experiment on UCI and Pen digit datasets, and its results are shown in Tables 5.10 and 5.11, respectively.
5.3 Combined Distance and Similarity Measure
169
Table 5.11 Comparison of classification error rate between our CDSM and the state-of-the-art metric learning methods on the handwritten digit datasets. Bold values mean the best performances Dataset PenDigist USPS Semion MNIST Ave. ERR. Ave. rank
Euclidean 2.26 5.08 8.54 2.87 4.69 4.5
NCA 2.23 5.68 8.6 5.46 5.49 6.25
ITML 2.29 6.33 5.71 2.89 4.31 5.5
LDML 6.2 8.77 11.98 6.05 8.25 8
LMNN 2.52 5.38 6.09 2.28 4.07 4
OASIS 2.14 5.13 8.16 2.97 4.6 4
SML 2.17 5.13 5.08 2.11 3.62 2.5
CDSM 1.95 4.38 4.08 2.08 3.12 1
5.3.4.3 Results on the UCI Datasets From Table 5.10, we can observe that our CDSM could dramatically promote classification accuracy of the baseline Euclidean distance and outperform the popular metric learning methods. Note that NCA, ITML, and LMNN mildly surpass the baseline, while LDML and OASIS are lower than the baseline Euclidean distance. Above results are generated, maybe because only distance or similarity measure are employed. When the SML uses both distance and similarity measure, the second minimum mean classification error is obtained. Due to exploiting the triplet loss in our CDSM, proposed CDSM achieves the better performance than SML.
5.3.4.4 Results on the Handwritten Digit Datasets On four handwritten digit datasets, the proposed CDSM still obtains best efficiency with the highest average rank and the lowest average classification error. We can observe that CDSM dramatically surpasses pairwise constraints based distance metric learning methods (i.e., LDML, ITML, and NCA). Even the latter one is lower than the baseline Euclidean distance. It can be shown that comparing with pairwise constraints, the approaches based on triplet constraints maybe have better performance for classification. For example, both LMNN and OASIS based on triplet constraints are more excellent than the baseline Euclidean distance. Moreover, new CDSM which is improved over LMNN and OASIS indicates the superiority of joint distance and similarity metric learning. As the copy of our CDSM, SML has the second best efficiency. We introduce the basic difference between SML and CDSM as follows. First, SML adopts the pairwise constraints, while triplet constraints are adopted in CDSM. Second, μ = 0.5 is simply set in SML, while μ is used to control similarity and distance parts in CDSM.
170
5 Information Fusion Based on Metric Learning
Table 5.12 Comparison of classification error rate between CDSM and kernelized CDSM. Bold values mean the best performances Dataset BreastTissue Ecoli Glass IPLD Parkinson’s SPECTHeart Satellite Segment Sonar Cardiotocography Letter PenDigist USPS Semion MNIST
CDSM 28.75 17.39 24.06 27.31 4.08 6.95 9.9 2.12 11.02 18.26 2.32 1.95 4.38 4.08 2.08
Kernelized CDSM 28.12 15.18 23.58 27.36 4.08 8.02 9.35 2.32 9.57 18.58 2.15 1.88 4.26 4.96 2.02
γ in RBF kernel 0.1 0.1 0.2 0.7 0.1 0.1 1 0.1 0.2 0.6 0.1 0.1 0.06 0.1 0.06
5.3.4.5 CDSM VS. Kernelized CDSM We present the kernelized CDSM for learning non-linear measure, which is extending CDSM with RBF kernel. In this subsection, our CDSM and its kernelized extension are compared on various datasets. As shown in Table 5.12, the parameter γ is calculated by cross validation in kernelized CDSM. Obviously, linear CDSM on most datasets could be improved by the kernelized CDSM. The kernelized CDSM shows that non-linear similarity learning is beneficial to classification, especially for sophisticated tasks. Nevertheless, heavy computational cost will be generated in kernel methods. Thus, the problem will be solved via two improvement methods, namely reducing the number of triplets (S1) and reducing the computation of kernel matrix (S2). We assess these methods on three datasets, IPL, BreastTissueD, and USPS. From Table 5.13, it can be seen that the computational cost of kernelized CDSM could be dramatically reduced via our two strategies with little performance degradation. The kernelized CDSM can achieve more improvement on larger size datasets than the linear CDSM. Simultaneously, it is comparable between the training and testing time of kernelized CDSM with S1 and S2. To balance between efficiency and effectiveness, only proposed strategies S1 and S2 are used on the datasets with more than 1000 training samples.
5.3.4.6 Comparison of Running Time Lastly, the empirical running time of proposed CDSM is compared with other approaches. The mean running time of individual approaches is illustrated for ten
5.3 Combined Distance and Similarity Measure
171
Table 5.13 Influence on running time and correctness by two improvement strategies. In counting the training time, we leave out the time for constructing triplets Dataset Optimization methods BreastTissue No S1 S2 S1+S2 linear IPLD No S1 S2 S1+S2 linear USPS No S1 S2 S1+S2 linear
Training time(s) Testing time(s) 0.007 0.02 0.005 0.015 0.007 0.013 0.005 0.006 0.003 0.002 0.11 0.6 0.07 0.6 0.11 0.1 0.07 0.06 0.035 0.01 2.36 3178.6 0.006 150.2 2.36 9.8 0.006 3.3 1.12 1.2
of Triplets 95 53 95 53 95 525 360 525 360 525 7291 426 7291 426 7291
Error rate 28.12 29.12 28.12 29.12 28.75 27.36 27.53 27.64 27.78 27.31 4.08 4.19 4.17 4.26 4.38
Table 5.14 Running time of different metric learning methods on UCI datasets. Note that we compute the time for training stage without building constraints Dataset BreastTissue Ecoli Glass IPLD Parkinson’s SPECTHeart Satellite Segment Sonar Cardiotocography Letter
NCA 0.02 0.59 0.2 2.63 0.68 0.38 8123.35 202.61 5.65 166.67 84,636.1
ITML 2.52 2.39 2.55 0.3 0.48 0.96 4.94 2.82 0.74 3.16 3.6
LDML 0.11 0.25 0.25 1.8 0.25 0.14 10.53 21.59 0.49 26.52 1155.65
LMNN 5.08 0.72 0.64 0.7 0.29 0.35 11.74 6.2 0.66 11.99 595.42
OASIS 0.017 0.02 0.019 0.02 0.02 0.15 0.14 0.027 0.048 0.028 0.11
SML 0.075 0.38 0.21 1.04 0.23 0.04 10.628 3.47 0.48 7.2 56.42
CDSM 0.003 0.012 0.009 0.035 0.007 0.006 0.85 0.23 0.011 0.19 2.24
trials on every dataset. We conduct the experiments accomplished by Matlab on a PC with Intel Core i7 − 4700MQ CPU and 8G RAM, and their results are shown in Table 5.14. We could observe that our CDSM only spends training time with the second least cost. It is closed to the fastest OASIS being an online method. Moreover, due to improving approaches, our CDSM is obviously faster than SML. Similarly, as an online algorithm, although the number of training samples has little effect on ITML, it is very sensitive to running time of NCA, LDML, and LMNN.
172
5 Information Fusion Based on Metric Learning
5.3.5 Conclusion In this chapter, a max-margin model is suggested to learn combined distance and similarity measure (CDSM) based on triplet-based constraints, and an SMO algorithm is proposed to address it effectively. We conduct some experiments on 11 UCI datasets and 4 handwritten digit datasets. In experiments, our CDSM model is compared with several state-of-art metric learning methods. Based on the experimental results, it is shown that our model achieves the best efficiency on all the datasets via comparing with the cutting-edge metric learning methods. It forcefully demonstrates the effectiveness of our CDSM. By studying the CDSM from pairwise kernel viewpoint, CDSM is extended to its kernelized version to learn non-linear similarity measure. Two significant strategies are proposed to reduce their computational cost. The kernel CDSM approach is tested on the same datasets and it could improve the performance in 10 datasets. Moreover, we have conducted several experiments to estimate two strategies on USPS dataset. The training time is decreased to 0.006s from 3.3s, while the testing time is reduced to 2.36s from 3178.6s. In our future work, we will combine CDSM with deep network structure for learning better non-linear similarity measure and further optimize our methods tailored for several vision applications such as face verification and person reidentification.
References 1. Guillaumin M, Verbeek J, Schmid C. Is that you? metric learning approaches for face identification. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE; 2009. p. 498–505. 2. Taigman Y, Wolf L, Hassner T, et al. Multiple one-shots for utilizing class label information. In: BMVC, vol. 2. 2009. p. 1–12. 3. Nguyen HV, Bai L. Cosine similarity metric learning for face verification. In: Asian conference on computer vision. Berlin: Springer; 2010. p. 709–720. 4. Cao Q, Ying Y, Li P. Similarity metric learning for face recognition. In: Proceedings of the IEEE international conference on computer vision. 2013. p. 2408–2415. 5. Huang GB, Mattar M, Berg T, Learned-Miller E. Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: Workshop on faces in’RealLife’Images: detection, alignment, and recognition. 2008. 6. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS. Information-theoretic metric learning. In: Proceedings of the 24th international conference on machine learning. New York: ACM; 2007. p. 209–216 7. Ying Y, Li P. Distance metric learning with eigenvalue optimization. J Mach Learn Res. 2012;13(Jan):1–26. 8. Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res. 2009;10(Feb):207–244. 9. Shen C, Kim J, Wang L. Scalable large-margin Mahalanobis distance metric learning. IEEE Trans Neural Netw. 2010;21(9):1524–1530.
References
173
10. Bian W, Tao D. Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst. 2012;23(8):1194–1205. 11. Wang F, Zuo W, Zhang L, Meng D, Zhang D. A kernel classification framework for metric learning. IEEE Trans Neural Netw Learn Syst. 2014;26(9):1950–1962. 12. Niu G, Dai B, Yamada M, Sugiyama M. Information-theoretic semi-supervised metric learning via entropy regularization. Neural Comput 2014;26(8):1717–1762. 13. Hoi SCH, Liu W, Chang S-F. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Trans Multimedia Comput Commun Appl. 2010;6(3):18. 14. Parameswaran S, Weinberger KQ. Large margin multi-task metric learning. In: Advances in neural information processing systems. 2010. p. 1867–1875. 15. Kulis B et al. Metric learning: a survey. Found Trends Mach Learn. 2013;5(4):287–364. 16. Xia T, Tao D, Mei T, Zhang Y. Multiview spectral embedding. IEEE Trans Syst Man Cybern Part B 2010;40(6):1438–1446. 17. Xie B, Mu Y, Tao D, Huang K. M-SNE: multiview stochastic neighbor embedding. IEEE Trans Syst Man Cybern Part B 2011;41(4):1088–1096. 18. Wang H, Yuan J. Collaborative multifeature fusion for transductive spectral learning. IEEE Trans Cybern. 2014;45(3):451–461. 19. Xiao Y, Liu B, Hao Z, Cao L. A similarity-based classification framework for multiple-instance learning. IEEE Trans Cybern 2013;44(4):500–515. 20. Nguyen DT, Nguyen CD, Hargraves R, Kurgan LA, Cios KJ. mi-DS: Multiple-instance learning algorithm. IEEE Trans Cybern 2012;43(1):143–154. 21. Hu J, Lu J, Yuan J, Tan Y-P. Large margin multi-metric learning for face and kinship verification in the wild. In: Asian conference on computer vision. Berlin: Springer; 2014. p. 252–267. 22. Cui Z, Li W, Xu D, Shan S, Chen X. Fusing robust face region descriptors via multiple metric learning for face recognition in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2013. p. 3554–3561. 23. Zhang L, Zhang D. Metricfusion: generalized metric swarm learning for similarity measure. Inf Fusion. 2016;30:80–90. 24. Hertz T, Bar-Hillel A, Weinshall D. Learning distance functions for image retrieval. In: Proceedings of the 2004 IEEE Computer Society Conference on computer vision and pattern recognition, 2004. CVPR 2004. , vol. 2. 2004. p. II–570–II–577. 25. Brunner C, Fischer A, Luig K, Thies T. Pairwise support vector machines and their application to large scale problems. J Mach Learn Res. 2012;13(1):2279–2292. 26. Chen D, Cao X, Wang L, Wen F, Sun J. Bayesian face revisited: a joint formulation. In: Computer vision–ECCV 2012. Berlin: Springer; 2012. p. 566–579. 27. Schroff F, Kalenichenko D, Philbin J. Facenet: a unified embedding for face recognition and clustering. In: The IEEE conference on computer vision and pattern recognition (CVPR). 2015. 28. Li M, Wang Q, Zhang D, Li P, Zuo W. Joint distance and similarity measure learning based on triplet-based constraints. Inf Sci. 2017;406:119–132. 29. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci. 2009;2(1):183–202. 30. Hu J, Lu J, Tan Y-P. Discriminative deep metric learning for face verification in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. p. 1875–1882. 31. Frank A, Asuncion A. UCI machine learning repository online. 2011. 32. Ying Y, Huang K, Campbell C. Sparse metric learning via smooth optimization. In: Advances in neural information processing systems. 2009. p. 2214–2222. 33. Kumar N, Berg AC, Belhumeur PN, Nayar SK. Attribute and simile classifiers for face verification. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE; 2009. p. 365–372. 34. Li Z, Chang S, Liang F, Huang TS, Cao L, Smith JR. Learning locally-adaptive decision functions for person verification. In: CVPR. 2013.
174
5 Information Fusion Based on Metric Learning
35. Cai X, Wang C, Xiao B, Chen X, Zhou J. Deep nonlinear metric learning with independent subspace analysis for face verification. In: Proceedings of the 20th ACM international conference on Multimedia. New York: ACM; 2012. p. 749–752. 36. Huang GB, Lee H, Learned-Miller E. Learning hierarchical representations for face verification with convolutional deep belief networks. In: 2012 IEEE conference on computer vision and pattern recognition. Piscataway: IEEE; 2012. p. 2518–2525. 37. Koestinger M, Hirzer M, Wohlhart P, Roth PM, Bischof H. Large scale metric learning from equivalence constraints. In: 2012 IEEE conference on computer vision and pattern recognition. Piscataway: IEEE; 2012. p. 2288–2295. 38. Wang F, Zuo W, Zhang L, Meng D, Zhang D. A kernel classification framework for metric learning. IEEE Trans Neural Netw Learn Syst. 2015;26(9):1950–1962. 39. Vert J-P, Qiu J, Noble WS. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinf. 2007;8 Suppl 10:S8. 40. Schultz M, Joachims T. Learning a distance metric from relative comparisons. In: Advances in neural information processing systems (NIPS). 2004. p. 41. 41. Lichman M. UCI machine learning repository. 2013. 42. Goldberger J, Hinton GE, Roweis ST, Salakhutdinov R. Neighbourhood components analysis. In: Advances in neural information processing systems. 2004. p. 513–520. 43. Chechik G, Shalit U, Sharma V, Bengio S. An online algorithm for large scale image similarity learning. In: Advances in neural information processing systems. 2009 p. 306–314.
Chapter 6
Information Fusion Based on Score/Weight Classifier Fusion
Multiple classification scores will be created by applying different classifiers to the categorization. Generally, each classifier has its own set of advantages and limitations, and only a simple classifier frequently fails to achieve the categorization reliably. To maximize the benefits of many classifiers, it is critical to employ score fusion. This chapter discusses two strategies for score fusion and their application to classification. After reading this chapter, the readers will have a basic understanding of score fusion methods.
6.1 Motivation Fusion is often accomplished on three levels, i.e., feature level, score level, and decision level [1, 2]. Fusions at these levels offer a variety of advantages. The decision level fusion is really simple to construct. However, decision level fusion does not fully leverage multisource information since the decision retains far quite few information from the original input. For instance, in the biometrics-based personal verification system, the decision to fuse is just a binary variable, with the value of accept or reject. Fusion at the feature level merges multisource information at the earliest possible stage, allowing for the most information to be extracted from the original data. [2]. However, fusion at the feature level should eliminate the issue of inconsistency and incompatibility between different types of data. Accordingly, score level fusion is a viable option [3–5]. Score fusion approaches that are currently used can be classified into three types, namely transformation-based score fusion, classifier-based score fusion, and density-based score fusion [6]. Classifier-based score fusion techniques aggregate scores from many data sources into a feature vector and then create a classifier to do classification [7, 8]. Apart from using conventional machine learning techniques such as multi-layer perceptron (MLP) for classifier-based score fusion [9], boosting techniques such as those presented in [10, 11] have also demonstrated good © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_6
175
176
6 Information Fusion Based on Score/Weight Classifier Fusion
performance in this type of fusion. The following are the critical concerns with classifier-based score fusion approaches. To begin, they encounter the issue of an unbalance training set. For example, in personal verification, the number of true match scores accessible for training is far less than the number of impostor scores. Second, the cost of misclassification should be carefully chosen, and so does the classifier. Before integrating scores using transformation-based procedures, they should be transformed (normalized) to a common domain. Due to the sample dependence of the transformed (normalized) scheme, empirical evaluation is typically included in the implementation of these types of approaches [4, 12]. To merge scores from disparate data sources, transformation-based score fusion often employs the sum rule, maximum rule, minimum rule, and product rule [1]. The majority of prior work indicates that the sum rule performs admirably [13]. Density-based techniques appear to be capable of achieving optimal performance when the score densities are evaluated appropriately. However, they suffer from the disadvantage of being overly complicated and difficult to apply. Modeling density distributions, in particular, is extremely complicated [14]. Due to the superior theoretical qualities of the Gaussian mixture model (GMM)-based density estimation, it has been widely used in score fusion [15]. However, determining an appropriate number of components for GMM is a difficult process [6]. Apart from the three methodologies described above, there are a few additional methods for score fusion. For instance, score fusion techniques based on receiver operating characteristic (ROC) analysis have been presented. Typical examples of this type of score fusion strategy include the least square error-based framework proposed in [16], the margin-based ranking approach [17], and the optimization approach for the area under the curve (AUC) [18]. It is certain that fusion at the weighted score level can maximize the utility of data from disparate sources [19–23]. However, optimum weights for weighted score level fusion are difficult to find. There appear to be no automated weight selection processes, and a lot of earlier weighted score level fusion approaches rely solely on empirical weight selection. As a result, it is critical to develop an automotive and adaptable weighted score level fusion approach. Several attempts have been made to improve the fusion of weighted score levels. For instance, Jain et al. proposed user-specific multi-biometric parameters [24]. Moumene et al. investigated fusion weight estimation in the exposure fusion problem [25]. However, because the approach is based on Lagrange multipliers, it is incompatible with pattern classification tasks. The work [24] can be seen as an adaptive fusion strategy that is user-specific. Apart from these methodologies, other adaptive fusion research are conducted [26–28]. For example, the strategy presented in [28] is a classic test sample-specific adaptive weighted fusion technique. Under other words, in this approach, the weights of various data sources fluctuate according to the size of the test samples. The approach presented in [14] for combining weighted scores based on sample quality can alternatively be considered as a sample-specific adaptive weighted fusion strategy. It is based on the premise that because sample quality varies, weights should vary for different samples. It should be noted that the majority of the aforementioned methods fall short of fully
6.2 Adaptive Weighted Fusion Approach
177
automating the determination of adaptive weights. Several of them estimate the weights using an exhaustive search and a criterion based on the error rate [24]. This chapter proposes an optimal adaptive weighted fusion algorithm (AWFA) [29]. The technique may automatically determine the appropriate weights for score fusion, obviating the requirement for manual adjustment. Due to the reasonableness of the automatically generated weights, this approach is well suited to integrating the benefits of complementary data sources. Extensive experiments demonstrate that the newly developed strategy outperforms previously available state-of-the-art approaches. Additionally, because real-world data frequently follows a complex distribution that a linear representation cannot model, it is reasonable to suggest non-linear representation weighted fusion approaches. Thus, we propose a fusion approach based on adaptive kernel selection in this chapter. Specifically, we use the self-adaptive weight fusion approach for combining local support vector machine classifiers (FaLK-SVM) [30] and present two fusion methods, distance-based weighting (FaLK-SVMad) and rank-based weighting methods (FaLK-SVMar) [28]. FaLKSVMa achieves a greater classification accuracy than FaLK-SVM, as demonstrated by experimental results on fourteen UCI datasets and three large-scale datasets.
6.2 Adaptive Weighted Fusion Approach Score fusion is an extremely competent fusion technique, and weighted score fusion is the most preferable score fusion technique. The ability to automatically set appropriate weights is the most critical aspect of weighted score fusion, and there appear to be no truly adaptive weighted score fusion approaches available currently. In this chapter, we design an ideal adaptive weighted fusion approach that automatically determines the optimal weights without the need for manual adjustment. While the proposed approach is quite simple and straightforward to implement, it can outperform previous state-of-the-art methods.
6.2.1 Problem Formulation For clarity, we assume in this subsection that there are only two types of samples, i.e., two types of data scores. The adaptive weighted fusion approach (AWFA) [29] consists of the following steps: Step 1. Step 2.
Feature extraction is conducted on all samples, including the test sample and the first and second types of samples. The test sample’s scores on the first type of training samples (i.e., the test sample’s distances to the first type of training samples) are calculated and di1 is used to denote the test sample’s distance to the i-th class.
178
Step 3.
Step 4.
6 Information Fusion Based on Score/Weight Classifier Fusion
i = 1, . . . , C and C represents the number of classes. The test sample’s scores on the second type of training samples (i.e., the test sample’s distances to the second type of training samples) are calculated and di2 is used to denote the test sample’s distance to the i-th class. Define h−min(d1r ,...,dcr ) r ,h = C βr = j =1 dj , r = 1, 2. h 1 2 The values’ range of di and di is normalized to [0,1] through di1 = 1 1 2 2 1 2 dmax − di1 / dmax and di2 = dmax . − dmin − di2 / dmax − dmin 1 1 represent the maximum and minimum of d 1 , respectively. and dmin dmax i 2 represent the maximum and minimum of d 2 , respectively. 2 and dmin dmax i di1 are sorted in ascending order and the sorted result is denoted by 1 . dˆ 2 , . . . , dˆ 2 are sorted in ascending order and e11 e21 . . . eC 1 C 2 . Let w = the sorted result is denoted by e12 e22 . . . eC 1 e2 − e11 + e22 − e12 , w1 = e21 − e11 /w, w2 = e22 − e12 /w. Because e12 = e11 = 0, we also have w1 = e21 / e21 + e22 and w2 = e22 / e21 + e22 . Let fi = β1 w1 di1 + β2 w2 di2 (i = 1, . . . , C). If k = arg min fi , then the test sample is mapped to the k-th class.
The algorithm of AWFA is described in Algorithm 6.1. Algorithm 6.1 Algorithm for AWFA r h−min(d1r ,...,dCr ) 1. Calculate βr = ,h = C l dj , r = 1, 2. 2 2 2 1 h 1 1 1 1 2 2. Calculate di = dmax − di / dmax − dmin di = dmax − di2 / dmax . − dmin 1 2 3. Calculate fi = β1 w1 di + β2 w2 di . 4. If k = arg min fi , then the test sample is assigned to the k-th class.
6.2.2 Rationale and Advantages of AWFA Our approach (i.e., AWFA) has three rationales and benefits. To begin with, it incorporates fully automatic weight selection. Second, it makes excellent use of the confidence (i.e., βg wg , g = 1, 2 of the scores. Third, unlike most previous methods of weighted score fusion, which assign fixed or adaptive weights to various types of scores (data sources) and ignore specific test samples, AWFA adaptively determines optimal weights for each test sample. This enables the dissimilarities between each test sample and each type of data source to be considered in detail and flexibly. Notably, the setting of w1 and w2 in AWFA is quite reasonable due to the following reason. e11 and e21 represent the best and second best scores for the first type of data sources, respectively. The research established that a significant difference between the best and secondary best scores indicates reasonable categorization based on the scores [31, 32]. In other words, the significance of classification results produced from the first kind of data sources is proportional to the value of e21 − e11 . Similarly, the significance of classification results produced from the second kind of
6.2 Adaptive Weighted Fusion Approach
179
data sources is proportional to the value of e22 −e12 . As a consequence, it is reasonable to adopt w1 = e21 − e11 /w and w2 = e22 − e12 /w as the weights of the first and second kinds of data sources, respectively. It should be noted that, although the above analysis assumes just two types of samples, i.e., two types of data sources, AWFA could also be used with several types of data sources, as briefly shown below. Assume that there are M different types of data sources. For each test sample, ti1 , . . . , tiM could represent the distances (i.e., scores) of it to the i-th class of the first to M-th types of data sources, respectively. In Step 4 of AWFA, after obtaining e21 , . . . , e2M , M weights can be calculated by wr =
er M 2
j j=1 e2
(r = 1, . . . , M). w1 , . . . , wM can be considered
as the weights of the first to M-th kinds of data respectively. Given r sources, j h−min d1,...,dc M C j r ,h = r = fi = j =1 βj wj di (i = 1, . . . , C), βr = j =1 dj , h 1, 2, . . . , M. If k = arg minfi , the test sample is mapped to the k-th class. The i
above description demonstrates that AWFA is also viable for a variety types of data sources. Notably, AWFA can be used directly only if the test sample’s distances (i.e., scores) to each class are available for each kind of data source. This criterion is often violated when only the distances (i.e., scores) of the test sample to each training sample are provided, rather than the distances to each class. In this instance, we can utilize one of the following two techniques in advance to the application of AWFA. The first approach is to use the distances (i.e., scores) between the test and training samples to acquire the test sample’s scores for each class in advance and then to apply AWFA’s native Steps 3 and 4. The second method is to modify AWFA’s Steps 3 and 4 in the manner seen in Fig. 6.1.
6.2.3 Experimental Results This section evaluates the proposed fusion approach’s performance by utilizing five publicly available datasets, which are the Heterogeneous Face Biometrics (HFBs) dataset [33], the 2D plus 3D palmprint dataset [34], the PolyU Multispectral dataset [35], the Georgia Tech face dataset [36], and the Labeled Faces in the Wild (LFW) dataset [37]. To evaluate the proposed fusion approach’s performance, we conduct comparative tests using the Product rule [38], Sum rule [38], Min fusion [4], Max fusion [4], and Average Score fusion. Average Score fusion indicates a score fusion strategy in which the scores from the first and second types of data sources are acquired separately and then merged with equal weights for classification. The techniques of score level fusion discussed in Sect. 6.2 are all based on posterior probabilities. To facilitate implementation, we simply employ the Product rule and the Sum rule presented in [38] and the Min rule and the Max rule proposed in [1]. We extract features based on these fusion schemes using principal component
180
6 Information Fusion Based on Score/Weight Classifier Fusion
Fig. 6.1 Steps 3 and 4
analysis (PCA) [39], collaborative representation based classification (CRC) [40], sparse representation based classifier (SRC) [41], and linear regression classification (LRC) [42].
6.2.3.1 Experiments on the HFB Heterogeneous Face Biometrics (HFBs) [33] dataset contains 400 near infrared (NIR) images and 400 visual (VIS) images of 100 individuals, which are captured with various poses, expressions, light conditions, and glasses accessories. Each pair of NIR and VIS images is not collected concurrently, resulting in misalignments. Figure 6.2 illustrates images of one subject in the HFB dataset. Table 6.1 shows the experimental results for NIR and VIS images. Results of our method and the conventional weighted matching score level fusion approaches are compared in Table 6.2. As can be shown, our strategy outperforms conventional weighted matching score level fusion. The first row of each table details the feature extraction procedure. “Original samples” refers to performing classification directly on the original test and training data without performing any feature extraction. Specifically, in Tables 6.1, 6.3, 6.5, and 6.8 (Tables 6.3, 6.5, and 6.8 are shown later), the classification error rate for “Original samples” is calculated using the nearest neighbor classifier. Additionally, the nearest neighbor classifier is utilized to determine the classification error rate for all PCA experiments.
6.2 Adaptive Weighted Fusion Approach
181
Fig. 6.2 Some visible and infrared face images. The first row shows the visible face images. The second row shows the infrared face images Table 6.1 Classification error rate (%) of either of NIR and VIS images from the HFB dataset One training samples per subject NIR VIS Two training samples per subject NIR VIS
PCA 7.35 12.00 PCA
CRC 5.33 10.00 CRC
Original samples 10.33 17.00 Original samples
SRC 4.72 8.89 SRC
LRC 5.56 10.55 LRC
Method in [] 4.67 7.00 Method in []
DFD 2.67 5.67 DFD
5.50 7.50
4.50 3.50
8.50 8.00
4.00 3.00
4.50 4.00
3.50 2.50
2.00 1.50
As seen in Table 6.1, DFD has the best classification accuracy of all the comparison techniques. We use the NIR as an example. When a single training sample is chosen from each subject, DFD’s accuracy is 97.33%, which is 2% (= 97.33 − 95.33%) larger than the second best. According to Table 6.2, AWFA performs the best. When one training sample for each subject is chosen, the accuracy of AWFA based PCA is 97.79%, which is 2.79% larger than the second best.
6.2.3.2 Experiments on the 2D Plus 3D Palmprint Dataset The 2D plus 3D palmprint dataset contains 8000 samples from 400 different palms [34]. Each palm has twenty 2D palmprint images and twenty 3D palmprint images. They were collected in two distinct sessions, with 10 samples of palmprint pictures from each palm being captured in each session. It should be noted that a sample in this case comprises a 3D ROI (i.e., region of interest) and its corresponding 2D ROI. A 3D ROI is represented by a binary file composed of 128 × 128 mean curve ratios, while each 2D ROI is represented by a BMP format image file. We considered the first four samples acquired in the first session as training samples (i.e., four 2D ROI images and four 3D ROI images) and all ten samples collected in the second session
182
6 Information Fusion Based on Score/Weight Classifier Fusion
Table 6.2 Classification error rate (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the HFB dataset One training samples per subject AWFA Average score fusion Sum rule Product rule Min rule Max rule Two training samples per subject AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 2.21 5.27 5.56 5.00 5.56 5.83 PCA 1.50 2.50 2.50 3.00 3.50 3.50
CRC 1.67 1.67 1.94 2.22 2.78 3.05 CRC 0.50 1.50 1.00 1.00 1.50 1.50
Original samples 4.67 6.00 6.39 5.56 6.94 7.22 Original samples 3.00 4.00 4.00 4.50 5.00 5.00
SRC 1.11 1.39 1.67 1.67 1.94 2.22 SRC 0.50 1.50 2.00 1.50 2.50 2.50
LRC 1.67 1.94 2.22 1.94 2.50 2.78 LRC 1.00 1.50 2.00 1.50 2.50 2.50
Fig. 6.3 Some 2D palmprint images from the 2D plus 3D palmprint dataset
as test samples. Before testing the techniques, we resized each image to a 32 by 32 matrix and converted it to a 1024 dimensional unit vector with length of 1. Figure 6.3 illustrates several 2D palmprint images from the 2D plus 3D palmprint dataset.
6.2 Adaptive Weighted Fusion Approach Table 6.3 Classification error rate (%) of either of the 2D and 3D palmprint images
Images 2D 3D
183 PCA 9.65 4.75
CRC 3.82 6.80
Original samples 9.58 4.75
SRC 3.55 4.13
LRC 5.35 4.50
Table 6.4 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the 2D plus 3D palmprint dataset Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 3.67 4.23 4.67 4.10 5.20 4.88
CRC 3.47 3.90 4.10 4.00 4.55 4.88
Original samples 3.65 3.50 3.68 3.78 4.05 4.10
SRC 3.10 3.45 3.15 3.65 4.13 4.45
LRC 3.55 4.10 3.98 4.10 4.78 4.85
Table 6.3 illustrates the experimental results for 2D and 3D images. Results of our method and the conventional weighted matching score level fusion methods are shown in Table 6.4. In conclusion, our method outperforms the conventional weighted matching score level fusion in terms of the classification error rate.
6.2.3.3 Experiments on the PolyU Multispectral Palmprint Dataset The PolyU multispectral palmprint dataset was collected from 250 subjects (55 women and 195 men) using the palmprint acquisition device developed by PolyU [35]. Each subject provided palmprint images of both the left and right palms. There were four types of illuminations, i.e., red, green, blue, and near infrared illuminations. As a result, there were four types of palmprint images, i.e., red, green, blue, and near infrared palmprint images. These multispectral palmprint images were collected in two separate sessions. In each session, every palm provided 6 palmprint images at each spectral band. So for each spectral band, the dataset contained 6000 images from 500 different palms. We use the first three images of each spectral band of a palm from the first session as training samples and exploit all six images of each spectral band of a palm from the second session as testing samples for the experiments below. The resolution of the palmprint image is 352 × 288. The 128 × 128 region of interest (ROI) domain is extracted from each palmprint image using the approach proposed in [53]. The ROI images are resized to 32 × 32 images. Figure 6.4 illustrates several palmprint images from the PolyU multispectral palmprint dataset. The classification error rates for the blue, green, or near infrared images from the Multispectral dataset are shown in Table 6.5. Results of the classification error rates of our method and the conventional weighted matching score level fusion methods on the Multispectral dataset with the score fusion of blue and near infrared images are shown in Table 6.6. Results of the classification error rates of our method and
184
6 Information Fusion Based on Score/Weight Classifier Fusion
Fig. 6.4 Four ROI images of a same palm. The first, second, third, and fourth ROI images were extracted from the red, green, blue, and near infrared images, respectively Table 6.5 Classification error rate (%) of blue, green, or near infrared images of the Multispectral dataset
Channel B G I
PCA 5.67 5.62 3.57
CRC 3.27 3.43 3.70
Original samples 4.80 7.90 5.23
SRC 3.46 3.16 3.50
LRC 4.10 3.67 3.96
Table 6.6 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the Multispectral dataset with the score fusion of blue and near infrared images Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 3.26 4.10 4.02 3.95 4.56 4.87
CRC 3.10 3.20 3.15 3.50 3.87 4.12
Original samples 4.97 5.10 5.17 5.23 5.68 5.79
SRC 2.96 3.46 3.33 3.67 4.12 4.65
LRC 3.15 3.98 4.10 4.23 4.98 5.12
Table 6.7 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the Multispectral dataset with the score fusion of green and near infrared images Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 2.63 4.11 4.01 4.26 4.53 4.66
CRC 1.10 1.23 1.20 1.33 1.84 1.76
Original samples 2.11 2.33 2.47 2.46 2.57 2.68
SRC 0.96 1.26 1.35 1.66 1.87 2.31
LRC 1.68 2.10 2.22 2.64 2.18 2.68
the conventional weighted matching score level fusion methods on the Multispectral dataset with the score fusion of green and near infrared images are shown in Table 6.7. We can observe from the results that AWFA outperforms the conventional weighted matching score level fusion methods.
6.2 Adaptive Weighted Fusion Approach
185
6.2.3.4 Experiments on the Georgia Tech Face Dataset The Georgia Tech (GT) face dataset was from Georgia Institute of Technology. It comprises images of 50 individuals captured in two or three sessions. Each subject is represented by 15 color JPEG images with cluttered background taken at resolution of 640×480 pixels. The pictures show frontal and/or tilted faces with different facial expressions, lighting conditions, and scales. Images were all manually annotated to ascertain the face’s location within each image. Each image is resized to 60 × 50 pixels. Several image samples from the GT face dataset are shown in Fig. 6.5. Each subject’s first five samples are utilized as training samples, while the remaining images are used as test samples. The classification error rates of the R, G, B color channels are shown in Table 6.8. The classification error rates of our method and the conventional weighted matching score level fusion methods based on the score fusion of two color channels from the original three color channels are shown in Tables 6.9, 6.10, and 6.11. We can observe from the experimental results that AWFA outperforms the conventional weighted matching score level fusion methods.
Fig. 6.5 Some image samples from the GT face dataset Table 6.8 Classification error rate (%) of the R, G, B color channels of the GT face dataset R-channel G-channel B-channel
PCA 44.00 50.40 54.80
CRC 44.40 46.80 50.20
Original samples 48.40 52.20 56.40
SRC 38.20 40.20 43.20
LRC 38.20 41.40 46.40
186
6 Information Fusion Based on Score/Weight Classifier Fusion
Table 6.9 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the red and green channels Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 40.20 43.00 43.20 42.80 43.60 43.80
CRC 39.60 41.20 40.80 41.60 42.20 42.60
Original samples 46.60 47.20 47.60 47.20 47.60 47.80
SRC 35.20 36.20 36.40 36.60 37.20 37.80
LRC 36.60 37.20 37.80 36.80 37.20 38.00
Table 6.10 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the blue and green channels Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 46.20 47.80 47.40 47.60 48.20 48.00
CRC 42.20 43.80 43.60 44.20 44.60 44.80
Original samples 48.60 49.20 49.60 50.00 50.20 50.60
SRC 37.20 38.60 38.40 39.00 39.20 39.40
LRC 38.20 39.60 39.40 39.00 40.20 40.60
Table 6.11 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the GT face dataset based on the score fusion of the red and blue channels Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 42.00 43.20 43.60 44.00 43.60 44.20
CRC 42.20 43.20 43.00 43.60 43.80 44.00
Original samples 45.20 46.40 46.20 46.00 47.20 47.60
SRC 36.00 37.20 37.40 37.00 37.80 38.00
LRC 36.20 37.60 37.80 37.40 38.00 38.20
6.2.3.5 Experiments on the LFW Dataset Labeled Faces in the Wild (LFW) [37] is a database based on web devoted to the research of unconstrained face recognition. There are 13,233 images from 5749 individuals, with different pose, occlusion, and expression. Each face has been identified with the individual’s name. 1680 of the individuals pictured have two or more distinct photos in the dataset. All of these faces are constrained by the fact that they were detected using the Viola CJones face detector. We chose 1251 photos from 86 people images [37] for experiments. Images were cropped and resized manually to 40 × 50 pixels. Figure 6.6 illustrates the example image of each person from the
6.2 Adaptive Weighted Fusion Approach
187
Fig. 6.6 Sample images of one individual from the LFW dataset Table 6.12 Classification error rate (%) of the R, G, B color channels of the LFW face dataset R-channel G-channel B-channel
PCA 74.07 75.49 74.78
CRC 64.48 66.79 65.19
Original samples 75.49 76.20 76.38
SRC 61.28 62.52 62.70
LRC 65.19 65.36 65.72
Table 6.13 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the red and green channels. Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 66.25 66.96 68.21 68.74 70.52 69.80
CRC 63.59 65.19 65.72 65.54 67.14 68.56
Original samples 66.61 68.21 67.67 67.14 69.63 70.87
SRC 55.24 57.19 57.37 57.55 59.15 58.79
LRC 56.66 57.19 57.73 58.79 59.15 60.00
Table 6.14 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the blue and green channels Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 66.25 67.85 67.32 67.67 68.20 68.03
CRC 64.12 65.72 65.54 66.25 67.67 68.74
Original samples 68.56 69.27 69.63 69.98 71.23 71.58
SRC 57.19 58.61 58.44 58.97 59.15 59.33
LRC 59.15 61.63 61.46 62.70 63.23 63.59
LFW database. The first eight images of each subject are used as the training set and the remaining images are utilized as the test set. The classification error rates of the competing approaches on the R, G, B color channels of the LFW face dataset are shown in Table 6.12. The classification error rates of the competing approaches with the score fusion of two color channels from the three color channels are shown in Tables 6.13, 6.14, and 6.15. As shown by the results, AWFA outperforms the conventional weighted matching score level fusion methods.
188
6 Information Fusion Based on Score/Weight Classifier Fusion
Table 6.15 Classification error rates (%) of the proposed fusion approach and the conventional weighted matching score level fusion approaches on the LFW face dataset based on the score fusion of the red and blue channels Fusion approaches AWFA Average score fusion Sum rule Product rule Min rule Max rule
PCA 63.94 65.19 65.54 64.12 66.61 66.25
CRC 62.17 63.23 63.59 63.77 64.83 65.01
Original samples 65.19 66.43 66.25 66.07 67.14 67.67
SRC 55.95 57.19 57.37 57.02 57.73 58.08
LRC 59.15 60.57 61.81 61.46 61.99 62.17
6.2.4 Conclusions This chapter describes an adaptive weighted fusion method, which could automatically determine the optimal weights for each test sample, requiring no manual setting. As a consequence, this method is capable of effectively integrating the advantages of complementary data sources. Extensive experiments reveal that the proposed method outperforms previous state-of-the-art fusion methods.
6.3 Adaptive Weighted Fusion of Local Kernel Classifiers As discussed in Chap. 3, non-linearity exists in real-world data. However, with reference to AWFA, it only represents data in a linear fashion, which is unreasonable. To solve this issue, we presented another adaptive fusion technique [28] for selecting a fast and scalable local kernel support vector machine (FaLK-SVM) approach [30], which we will refer to as FaLK-SVMa hereafter. To combine these local classifiers in FaLK-SVMa, we use a self-adaptive weighting fusion approach and present two fusion methods: distance-based weighting (FaLKSV-Mad) and rankbased weighting (FaLK-SVMar). Experimental results indicate that FaLK-SVMa can provide more accurate classification results than FaLK-SVM does [28].
6.3.1 FaLK-SVM Brief Survey on FaLK-SVM To be more clear, we introduce the FaLK-SVM here. FaLK-SVM is a pattern classification kernel approach that is both fast and scalable [30]. Specifically, χ represents the training set {(xi , yi ) |i = 1, · · · , N}, in which xi represents the i-th training sample, yi represents the class label of xi , and N represents the number of training samples. The number of the nearest neighbors is k. During the training process, FaLK-SVM uses k k, and a greedy method is
6.3 Adaptive Weighted Fusion of Local Kernel Classifiers
189
utilized to create an (approximate) minimal k neighborhood covering set of centers C ⊆ χ [30]. For each center c ∈ C , a local SVM classifier is trained with the k-nearest neighbors of c. As a result, |C| SVMs are trained and | · | represents the cardinality of the set. During the test process, given a sample x, one can utilize cover trees [43] to determine its nearest $ neighborsand the relevant % model center located at c with its k training samples xri (i) , yri (i) |i = 1, · · · , k . Thus the local SVM classifier where c is located can be utilized for classifying the test sample x, kNNSVMc (x) = sgn
# k
αrc (i) yrc (i) K xrc (i) , x + bc
(6.1)
i=1
in whichxrc(i) represents the i-th nearest neighbor of the center c, αrc (i) represents the local Lagrangian multiplier, and bc represents the local bias. During the classification process, two strategies can be used to choose the center c ∈ C. For a test sample x, one can choose the local SVM classifier whose center c ∈ C is the nearest center of the sample x, xr c(1) . Segata and Blanzieri named it as x FaLK-SVMc, FaLK − SVMc(x) = kNNSVMc (x),
where c = xr c(1) x
(6.2)
Besides, Segata and Blanzieri propose another approach, FaLK-SVM, in which the nearest neighbor xNN of x could be found from the training set χ. Given xNN represents the jl -th nearest neighbor of the center cl , then the local classifier kNNSVMcnt (xNN ) (x) could be selected in which jcnt (xi ) is the minimal, $ % cnt xNN = choose cz ∈ C|xNN = xrcz (h) $ % h = min t ∈ 1, · · · , k |xrcj (t ) = xNN , ∀cj ∈ C
(6.3)
The FaLK-SVM strategy could be described as follows: FaLK − SVM(x) = kNNSVMcnt (t) (x),
where t = xNN
(6.4)
Segata and Blanzieri expand on Vapnik and Bottou’s [44, 45] local risk minimization theory by discussing the generalization bounds for FaLK-SVM [30]. Additionally, new theoretical studies on consistency and localizability [46] corroborate the effectiveness of FaLK-SVM. Discussion Segata and Blanzieri [30] proposed FaLK-SVMc and FaLK-SVM, which have their own unique selection technique for local classifiers. While FaLKSVM outperforms FaLK-SVMc in terms of classification accuracy, none of these assumptions can be easily considered as the optimum. Additionally, when the greedy technique is used, the minimal k neighborhood covering set could be just nearly optimum, complicating the task of selecting the ideal local classifier.
190
6 Information Fusion Based on Score/Weight Classifier Fusion
In addition, for each center c, we train a local SVM classifier kNNSVMc (x) based on its k-nearest neighbors for classifying the test sample located in the k neighborhood of the center c. Accordingly, given the test sample x located in the k neighborhood of some centers, it could be more acceptable to classify x using all of the centers’ local classifiers associated with the centers.
6.3.2 Adaptive Fusion of Local SVM Classifiers This section first introduces an improved FaLK-SVM method, FaLK-SVMa that selects some local SVM classifiers and utilizes a self-adaptive fusion approach for classifying the test samples. In addition, some model selection issues of FaLKSVM and FaLK-SVMa are discussed.
6.3.2.1 The Adaptive Fusion Method In FaLK-SVM and FaLK-SVMc, the test sample in the k neighborhood of the center c is classified using the local classifier kNNSVMc (x). Due to the overcomplete situation of the k neighborhood covering set, one test sample could fall inside the k neighborhood of some centers {c1 , . . . , cm }. In [30], two strategies are used to determine which local classifier corresponds to the proper one center. We believe that a combination of these local classifiers [13, 40, 47] could help improve classification accuracy even more. During the classification process, given the test sample x, the centers {c1 , . . . , cm } are first searched in which x lies in the k neighborhood of the center ci (i = 1, . . . , m). The classification performance of the local classifier with respect to the center ci is calculated as follows: yci = kNNSVMci (x)
(6.5)
A self-adaptive weighted fusion approach is offered for combining the classification result yci . Let di represent the distance between the sample x and the center ci . The weight wi is further set to the classifier kNNSVMci (x), weighti = (1/di )a
(6.6)
in which i = 1, . . . , m and a is the weight parameters, whose default value is set as 1.0. In addition, each model’s normalized weight could be calculated in formulation, weighti wi = m i=1 weighti
(6.7)
6.3 Adaptive Weighted Fusion of Local Kernel Classifiers
191
Accordingly, the test sample x can be classified by utilizing the weighted fusion of the classifiers kNNSVMci (x), f (x) =
m
wi yci
(6.8)
i=1
The FaLK-SVMa with the weights described in Eq. (6.3.2.1) is referred to as FaLKSVMad in the following. In reality, there are several other options for determining the weight wi . xNN represents the nearest neighbor of x of the training set. Assuming xNN is ki -th nearest neighbor of the center ci , the weight wi could be defined as (1/ki )a wi = m a j =1 1/kj
(6.9)
in which a represents the weight parameter with the default value 1.0. The FaLK-SVMa with the weights defined in Eq. (6.3.2.1) is named as FaLK-SVMar. According to the results of [30], the FaLK-SVM method outperforms the FaLKSVMc method. Accordingly, FaLK-SVMar is expected to be effective for the classification performance enhancement.
6.3.2.2 Model Selection The choice of kernel function and the corresponding kernel parameter value are critical for the performance of kernel approaches. We use the Gaussian RBF kernel for both FaLK-SVM and FaLK-SVMa, # x − x 2 K x, x = exp − σ
(6.10)
in which x and x represent two samples, and σ represents the parameter of the Gaussian RBF kernel. The same σ value could be selected for all the local classifiers, while the standard cross validation method could be utilized for model selection. For FaLK-SVM and FaLK-SVMa, the value of the parameter σ in an adaptive range determined by distances of samples is selected. In addition, k represents a proportion of the size of the training set. Another parameter to consider while choosing a FaLK-SVMa model is the soft margin regularization constant C. After defining the range of each parameter, the values of these parameters are determined using the k-fold cross validation approach.
192
6 Information Fusion Based on Score/Weight Classifier Fusion
6.3.3 Experimental Results We compare the classification performance of SVM, FaLK-SVM, and FaLKSVMa in our experiments. The SVM source code is available in the Lib-SVM library [16] and FaLK-SVM is available in [48]. Then, with the FaLK-SVM source code, we implement the FaLK-SVMa code. We utilize fourteen datasets from the UCI machine learning repository [49] and three large-scale datasets to test the classification performance of FaLK-SVMa. Each of the datasets’ input characters is normalized to values between [−1, 1].
6.3.3.1 Experimental Results on the UCI Datasets The number of features d and the number of total instances N for each of the fourteen datasets are described in Table 6.16. The tenfold cross validation (CV) method is used for the classifier parameter estimation and classification performance evaluation of the classifiers. For FaLKSVM and FaLK-SVMa, three parameters have to be determined, in which C is $ % selected among 2−2 , 2−1 , . . . , 210 , the kernel parameter σ is selected among $ % σ ∈ 2−3 , 2−2 , . . . , 24 , 25 , and the nearest neighbor parameter k is selected among % $ k ∈ 21 , 22 , . . . , 29 , 210 , N . As for the value of the k parameter, k = k/2. The classification accuracy of methods on the fourteen UCI datasets is shown in Table 6.17. As can be found, FaLK-SVMad has better classification accuracy performances than FaLK-SVM does for eight out of fourteen datasets. In addition, FaLK-SVMar outperforms FaLKSVM for eleven out of fourteen datasets. Besides, FaLK-SVMad achieves the best classification accuracy performances for five of the fourteen datasets, meanwhile FaLK-SVM and Lib-SVM achieve the highest Table 6.16 Summary of the fourteen datasets
Dataset name Sonar Heart Haberman Liver Ionosphere Hill Breast Australian Transfusion Diabetes Fourclass tic_tac_toe Man Numer
# of features 60 13 3 6 34 100 10 14 4 8 2 9 5 24
# of samples 208 270 306 345 351 606 683 690 748 768 862 958 829 1000
6.3 Adaptive Weighted Fusion of Local Kernel Classifiers
193
Table 6.17 Classification accuracy (%) of Lib-SVM, FaLK-SVM, FaLK-SVMad, and FaLKSVMar on the fourteen UCI datasets Dataset name Sonar Heart Haberman Liver Ionosphere Hill Breast Australian Transfusion Diabetes Fourclass Tic_tac_toe Man Numer
Lib-SVM 87.5 82.22 74.51 71.59 93.73 63.2 96.63 86.23 78.07 76.43 99.65 99.9 82.63 75.3
FaLK-SVM 90.38 84.44 76.72 74.2 94.87 65.29 97.22 87.39 79.41 78.26 99.88 99.69 83.21 75.9
FaLK-SVMad 91.83 84.81 76.72 74.2 94.87 64.46 97.51 86.96 79.28 78.78 99.77 99.58 83.21 76
FaLK-SVMar 91.35 84.44 76.72 74.2 95.16 64.46 97.36 87.25 79.41 78.39 99.88 99.69 83.21 75.9
Table 6.18 Summary of the three large-scale datasets Dataset name Splice Astro a1a
# of features 60 4 119
# of training points 1000 3089 1605
# of testing points 2175 4000 30, 956
classification accuracy on two of fourteen and only one of fourteen, respectively. Generally, FaLK-SVMa can achieve the better classification performances.
6.3.3.2 Experimental Results on the Large-Scale Datasets Three large-scale datasets are utilized for the classification accuracy evaluation of FaLK-SVMa, FaLK-SVM, and SVM. For the reason that the dataset has a limited size, there is no independent validation set for the classification evaluation. Accordingly, three large-scale datasets are used, which are separately split into a training set and a test set. The values of the model parameters could be estimated based on the training set by tenfold standard cross validation. The classification performance of methods could be further evaluated accordingly to the results on the test set, as shown on Table 6.18. As can be found in Table 6.19, FaLK-SVMar could achieve better classification performance, which indicates that FaLK-SVMa outperforms FaLK-SVM. Besides, although FaLK-SVMa has an improvement over FaLK-SVM, there is no more computational cost during the training process of FaLK-SVMa. During the test
194
6 Information Fusion Based on Score/Weight Classifier Fusion
Table 6.19 Classification accuracy (%) of Lib-SVM, FaLK-SVM, FaLK-SVMad, and FaLKSVMar on the three large-scale datasets Dataset name splice astro a1a
Lib-SVM 87.72 95.50 80.81
FaLK-SVM 90.06 96.25 84.27
FaLK-SVMad 89.42 96.05 83.72
FaLK-SVMar 89.42 96.40 84.38
process, FaLK-SVMa could only have a little more computational cost than FaLKSVM does due to the limited number of the local classifiers to be fused.
6.3.4 Conclusion In this chapter, an improved FaLK-SVM method, FaLK-SVMa, is introduced. Given that a single test sample could lie in the k neighborhood of several centers {c1 , . . . , cm }, selecting a correct center to choose the local classifier, we propose that proper combination of these local classifiers would be more effective at increasing classification accuracy. Then, for integrating the local SVM classifiers, we offer a self-adaptive weighted fusion approach. FaLK-SVMa can achieve a greater classification accuracy than FaLK-SVM, as shown by experiment results on fourteen UCI datasets and three large-scale datasets. Local learning methods have also shown great promise in the construction of global consistent dimensionality reduction models in manifold learning [50–52]. Additionally, Bengio et al. [53] developed a technique for developing a kernel model for out-of-sample extension by utilizing the kernel trick. We will continue to improve the FaLK-SVM approach and investigate if there is an individual global consistent model to represent the FaLK-SVM classifier in future work.
References 1. Zhang D, Song F, Xu Y, Liang Z. Other tensor analysis and further direction. In: Advanced pattern recognition technologies with applications to biometrics. Hershey: IGI Global; 2009. p. 226–2. 2. Xu Y, Zhang D. Represent and fuse bimodal biometric images at the feature level: complexmatrix-based fusion scheme. Opt Eng. 2010;49(3):037002. 3. Ross A, Nandakumar K. Fusion, score-level. Encyclopedia Biom. 2009:611–616. 4. Jain A, Nandakumar K, Ross A. Score normalization in multimodal biometric systems. Pattern Recognit. 2005;38(12):2270–85. 5. Xu Y, Zhu Q, Zhang D. Combine crossing matching scores with conventional matching scores for bimodal biometrics and face and palmprint recognition experiments. Neurocomputing 2011;74(18):3946–52. 6. Nandakumar K, Chen Y, Dass SC, Jain A. Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell. 2007;30(2):342–7.
References
195
7. Brunelli R, Falavigna D. Person identification using multiple cues. IEEE Trans Pattern Anal Mach Intell. 1995;17(10):955–66. 8. Ma Y, Cukic B, Singh H. A classification approach to multi-biometric score fusion. In: International conference on audio-and video-based biometric person authentication. Berlin: Springer; 2005. p. 484–93. 9. Bengio S, Marcel C, Marcel S, Mariéthoz J. Confidence measures for multimodal identity verification. Inf Fusion 2002;3(4):267–76. 10. Moin MS, Parviz M. Exploring AUC boosting approach in multimodal biometrics score level fusion. In: 2009 fifth international conference on intelligent information hiding and multimedia signal processing. Piscataway: IEEE; 2009. p. 616–9. 11. Parviz M, Moin MS. Multivariate polynomials estimation based on GradientBoost in multimodal biometrics. In: International conference on intelligent computing. Berlin: Springer; 2008. p. 471–7. 12. Snelick R, Uludag U, Mink A, Indovina M, Jain A. Large-scale evaluation of multimodal biometric authentication using state-of-the-art systems. IEEE Trans Pattern Anal Mach Intell. 2005;27(3):450–5. 13. Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 1998:20(3):226–239. 14. Ulery B, Hicklin AR, Watson C, Fellner W, Hallinan P. Studies of biometric fusion: NIST. Technical Report NISTIR 7346. 2006. 15. Rakhlin A, Panchenko D, Mukherjee S. Risk bounds for mixture density estimation. ESAIM: Probab Stat. 2005;9:220–229. 16. Toh KA, Kim J, Lee S. Maximizing area under ROC curve for biometric scores fusion. Pattern Recognit. 2008;41(11):3373–3392. 17. Rudin C, Schapire RE. Margin-based ranking and an equivalence between AdaBoost and RankBoost. J Mach Learn Res. 2009;10:2193–2232. 18. Gao S, Sun Q. Improving semantic concept detection through optimizing ranking function. IEEE Trans Multimedia 2007;9(7):1430–1442. 19. Sim HM, Asmuni H, Hassan R, Othman RM. Multimodal biometrics: weighted score level fusion based on non-ideal IRIS and face images. Exp Syst Appl. 2014;41(11):5390–5404. 20. McCloskey S, Liu J. Metadata-weighted score fusion for multimedia event detection. In: 2014 Canadian conference on computer and robot vision. Piscataway: IEEE; 2014. p. 299–305. 21. Butakoff C, Frangi AF. A framework for weighted fusion of multiple statistical models of shape and appearance. IEEE Trans Pattern Anal Mach Intell. 2006;28(11):1847–1857. 22. Xu Y, Li X, Yang J, Lai Z, Zhang D. Integrating conventional and inverse representation for face recognition. IEEE Trans Cybern. 2013;44(10):1738–1746. 23. Casazza PG, Peterson J. Weighted fusion frame construction via spectral Tetris. Adv Comput Math. 2014;40(2):335–51. 24. Jain AK, Ross A. Learning user-specific parameters in a multibiometric system. In: International conference on image processing proceedings, vol. 1. Piscataway: IEEE; 2002. p. I. 25. Moumene ME, Nourine R, Ziou D. Generalized exposure fusion weights estimation. In: 2014 Canadian conference on computer and robot vision. Piscataway: IEEE; 2014. p. 71–6. 26. Kim S, Choi JY, Han S, Ro YM. Adaptive weighted fusion with new spatial and temporal fingerprints for improved video copy detection. Signal Process Image Commun. 2014;29(7):788– 806. 27. Gao G, Zhang L, Yang J, Zhang L, Zhang D. Reconstruction based finger-knuckle-print verification with score level adaptive binary fusion. IEEE Trans Image Process 2013;22(12):5050– 62. 28. Yang S, Zuo W, Liu L, Li Y, Zhang D. Adaptive weighted fusion of local kernel classifiers for effective pattern classification. In: International conference on intelligent computing. Berli: Springer; 2011. p. 63–70. 29. Xu Y, Lu Y. Adaptive weighted fusion: a novel fusion approach for image classification. Neurocomputing 2015;168:566–74.
196
6 Information Fusion Based on Score/Weight Classifier Fusion
30. Segata N, Blanzieri E. Fast and scalable local kernel machines. J Mach Learn Res. 2010;11:1883–926. 31. Kim C, Choi CH. Image covariance-based subspace method for face recognition. Pattern Recognit. 2007;40(5):1592–604. 32. Price JR, Gee TF. Face recognition using direct, weighted linear discriminant analysis and modular subspaces. Pattern Recognit. 2005;38(2):209–219. 33. Li SZ, Lei Z, Ao M. The HFB face database for heterogeneous face biometrics research. In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops. Piscataway: IEEE; 2009. p. 1–8. 34. HK-PolyU 2DT3D palmprint dataset. http://www.comp.polyu.edu.hk/~biometrics/2D_3D_ Palmprint.htm? 35. Han D, Guo Z, Zhang D. Multispectral palmprint recognition using wavelet-based image fusion. In: 2008 9th international conference on signal processing. Piscataway: IEEE; 2008, p. 2074–77. 36. Wang SJ, Yang J, Zhang N, Zhou CG. Tensor discriminant color space for face recognition. IEEE Trans Image Process. 2011;20(9):2490–501. 37. Wang SJ, Yang J, Sun MF, Peng XJ, Sun MM, Zhou CG. Sparse tensor discriminant color space for face verification. IEEE Trans Neural Netw Learn Syst. 2012;23(6):876–88. 38. Ross A, Jain A. Information fusion in biometrics. Pattern Recognit Lett. 2003;24(13):2115– 2125. 39. Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell. 1997;(7):711–20. 40. Zhang L, Yang M, Feng X. Sparse representation or collaborative representation: which helps face recognition? In: 2011 International conference on computer vision Piscataway: IEEE; 2011. p. 471–8. 41. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y. Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell. 2008;31(2):210–27. 42. Huang SM, Yang JF. Improved principal component regression for face recognition under illumination variations. IEEE Signal Process Lett. 2012;19(4):179–82. 43. Beygelzimer A, Kakade S, Langford J. Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning. New York: ACM; 2006. p. 97–104. 44. Vapnik V, Bottou L. Local algorithms for pattern recognition and dependencies estimation. Neural Comput. 1993;5(6):893–909. 45. Vapnik V. The nature of statistical learning theory. Berlin: Springer; 2013. 46. Zakai A, Ritov Y. Consistency and localizability. J Mach Learn Res. 2009;10:827–856. 47. Liu CL. Classifier combination based on confidence transformation. Pattern Recognit. 2005;38(1):11–28. 48. Segata N. FaLKM-lib: a library for fast local kernel machines 2009. Freely available for research and education purposes at http://disi.unitn.it/~segata/FaLKM-lib. 49. Newman DJ. UCI repository of machine learning database 1998. http://www.ics.uci.edu/~ mlearn/MLRepository.html. 50. Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science 2000;290(5500):2319–23. 51. Saul LK, Roweis ST. Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res. 2003;4:19–155. 52. Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000;290(5500):2323–6. 53. Bengio Y, Paiement JF, Vincent P, Delalleau O, Roux NL, Ouimet M. Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. In: Advances in neural information processing systems 2004. p. 177–84.
Chapter 7
Information Fusion Based on Deep Learning
Recently, deep learning has shown exceptional performance in a variety of areas. In contrast to standard shallow models, deep learning approaches use deeper architectures to fit the complex distributions in real-world datasets more effectively. This chapter discusses three strategies for fusing multiple branches of networks into a single feature based on deep learning. After reading this chapter, readers should have a basic understanding of deep learning-based fusion algorithms.
7.1 Motivation Several techniques to fusion have been suggested in past chapters, including sparse representation-based methods, collaborative representation-based methods, GPLVM-based methods, Bayes model-based methods, metric learning-based methods, and adaptive score fusion based methods. Almost all of them, however, rely only on hand-craft characteristics for fusion. Actually, real-world data is highly complicated, and hand-crafted characteristics based on human settings cannot adequately fit the insight. Fortunately, with the fast development of deep learning, which is capable of adaptively and autonomously extracting features, several deep learning-based fusion algorithms have been presented. We present three deep learning algorithms in this chapter and demonstrate their application to image retrieval. In particular, hashing methods have gained considerable interest and are commonly used in nearest neighbor search for information retrieval on large-scale datasets owing to their cheap storage cost and quick retrieval performance. By producing compact codes, hashing learning seeks to map the original data points to binary compact hash codes. Not only could these codes significantly decrease storage cost and realize a constant or sub-linear time complexity for information search, but they can also maintain the semantic information of the original data points. © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_7
197
198
7 Information Fusion Based on Deep Learning
In general, hashing methods could be classified as data-independent or datadependent. Locality Sensitive Hashing (LSH) [1] is a widely used data-independent approach that seeks to generate hashing codes using many randomly generated projections, guaranteeing that the likelihood of collision is substantially greater for data points that are near together than for those that are far away. LSH is further developed to a kernel version (KLSH) [2] in order to non-linearly describe complicated real-world datasets. Additionally, multiple distance or similarity priors are placed on the fundamental LSH in order to produce a variety of extensions [3–5]. However, LSH has a performance disadvantage since it is completely dataindependent and ignores the data distribution that is beneficial for performance enhancement. To address this issue, academics are concentrating their efforts on data-dependent algorithms for learning an adaptive hashing function for a given dataset. In general, data-dependent algorithms may be classified as unsupervised and supervised. Unsupervised hashing algorithms make use of the structural information in the input data to generate compact codes. DSH [6] was offered as a replacement for the random projection in LSH. Yair et al. [7] proposed Spectral Hashing (SH). SH establishes a link between binary coding and graph partitioning. To handle the high complexity of SH when dealing with large datasets, Heo et al. proposed Spherical Hashing based on the hyperplane to learn a spherical Hamming distance [8]. Jiang and Li [9] presented a method for large-scale graph hashing called Scalable Graph Hashing (SGH). Unlike SH and SGH, which neglect the discrete restriction, the Discrete Graph Hashing (DGH) [10] algorithm was investigated for its ability to discover the underlying neighborhood structure in a discrete code space. In comparison to graph hashing, Iterative Quantization (ITQ) [11] and Doublebit Quantization (DBQ) [12] were proposed to decrease quantization error. Unlike unsupervised hashing approaches, which disregard the label information in the training set, supervised hashing learning focuses on learning the hash function in order to promote projected hashing codes to maintain the semantic information in the original space. Several commonly used supervised hashing techniques include Kernel-based Supervised Hashing (KSH) [13], Fast Supervised Hashing (FastH) [14], Supervised Discrete Hashing (SDH) [15], and Column Sample-based Discrete Supervised Hashing (COSDISH) [16]. In supervised hashing, both KSH and FastH achieve non-linearity, but SDH and COSDISH optimize their models discretely. While other data-dependent approaches have been investigated, they often encounter performance limitations as a result of the usage of hand-crafted features. Deep learning with an end-to-end network has emerged as a viable and promising approach in recent years. Liong et al. [17] advocated using a deep network to jointly represent the input and generate binary codes. Convolutional Neural Networks (CNN) are commonly used in hashing learning because of their strong picture representation. For example, Zhang et al. [18] integrated CNN with hashing learning to create a unified model for picture retrieval and human reidentification. Additionally, deep hashing networks (DHN) [19], deep supervised hashing (DSH-DL) [20], and convolutional neural network-based hashing (CNNH)
7.2 Dual Asymmetric Deep Hashing Learning
199
[21] have been suggested. [22] introduced a pairwise loss (DPSH) to maintain the semantic information between each pair of outputs. Shen [23] (DAPH) and Jiang (ADSH) [24] suggested asymmetric structures and established their advantages experimentally. Additionally, we present two additional asymmetric deep hashing learning algorithms for image retrieval in this chapter. Specifically, these two algorithms extract features from two branches of deep networks and then we fuse these two extracted features to create hashing codes using two distinct strategies. The aim of person re-identification is to match two pedestrian photographs taken from disparate view points [25]. It has generated tremendous attention and effort in recent years as a result of its wide range of applications in video surveillance [26, 27]. This topic, nevertheless, remains quite difficult and deserves deeper research because of the wide variety in lighting, stances, views, and background of pedestrian images. The task of person re-identification can be handled by two strategies: First, using distance or similarity measures on a single-image representation, which is the representation of a single image [28–35]. Second, using classification on a crossimage representation, which is the representation of an image pair [36–38]. As the first kind of strategy, a single-image representation (SIR) is first obtained using either hand-crafted features [28–33, 39–42] or deep convolutional neural network (CNN) approaches [18, 35, 43], and then a distance measure with a threshold is used to predict whether or not two pedestrian images are matched. After acquiring the cross-image representation (CIR), person re-identification may be considered as straightforward binary classification task [36–38, 44]. Each of these two groups of approaches has distinct benefits. The SIR provides many significant efficiency benefits. One may precompute the SIRs of N images given a gallery set of them. We simply need to extract the SIR of the probing picture and calculate its distances from the SIRs of the gallery images during the matching step, and we need to extract the CIR between the probe image and each gallery image during the CIR classification stage (i.e., N times). In comparison to SIR, CIR is more successful in capturing the connections between the two pictures, and numerous ways for addressing horizonal displacement through local patch matching have been proposed. Thus, the SIR and CIR both have distinct benefits, and this discovery motivates us to research a complete method for integrating these two representations in terms of efficacy and efficiency. The purpose of this task is to examine the relationship between SIR and CIR and to suggest a collaborative learning framework with deep CNN that makes use of the strengths of these two kinds of representation techniques.
7.2 Dual Asymmetric Deep Hashing Learning As mentioned previously, despite the widespread use of deep neural networks for hashing learning, the majority of these networks are symmetric structures in which the Hamming distance between the outputs of the same hash function [23] is used
200
7 Information Fusion Based on Deep Learning
to estimate the similarity [45] between each pair of points. As mentioned in [23], a critical issue is that this symmetric method would result in the difficulty of optimizing the discrete constraint. Thus, in this chapter, we propose an asymmetric hashing approach called Dual Asymmetric Deep Hashing Learning (DADH) [46] to overcome the aforementioned difficulty. Notably, Shen et al. [23] published a similar technique dubbed deep asymmetric pairwise hashing (DAPH). Our research is different from DAPH. Shen et al. attempts to estimate the similarity affinity by using two distinct hashing methods that maintain more information about the similarity between real-value variables. On the other hand, DAPH uses a basic Euclidean distance and ignores the semantic relationship between learned realvalue characteristics and binary codes [24, 47]. One limitation is the difficulty of effectively preserving similarity in discrete codes. Additionally, during the training step of DAPH, two distinct kinds of discrete hash codes corresponding to two distinct hash functions are calculated. This strategy, however, would widen the gap between the two systems, resulting in performance reduction. By contrast, we not only propose a novel asymmetric structure for learning two distinct hash functions and a single consistent binary code for each sample during the training phase but also exploit real-value and binary values asymmetrically, which allows for improved preservation of the similarity between learned features and hash codes. Experiments demonstrate that this innovative asymmetric structure is capable of improving image retrieval performance and accelerating convergence during the training process.
7.2.1 Problem Formulation Considering that there exist two streams in the proposed model, we use the uppercase letters X = {x1 , . . . , xi , . . . , xN } ∈ RN×d1 ×d2 ×3 and Y = {y1 , . . . , yi , . . . , yN } ∈ RN×d1 ×d2 ×3 to represent the input images in the first and second deep neural networks, respectively, in which N represents the number of training samples, d1 and d2 represent the height and width of the image. It is worth noting that despite X and Y have distinct symbols, they both refer to the same training data. Here, we employ training samples X and Y in the first and second networks. Because our approach involves supervised learning, we can make use of the label information. Here we use the uppercase letter S ∈ {−1, 0} to represent the similarity between X and Y, and Sij signify the element in the i-th row and j -th column in S. If xi and yj have the same semantic information or label, then Sij = 1, else Sij = 0. The binary codes are represented as B = [b1 , . . . , bi , . . . , bN ]T ∈ RN×k and the k-bit binary code of the i-th sample is represented as bi ∈ {−1, +1}k×1. Our model is designed to learn two mapping functions F and G for projecting X and Y into the Hamming space B. bi = sign(F (xi )) and bj = sign(G(yj )), in which sign(·) is an element-wise sign function, and sign(x) = 1 if x 0, otherwise sign(x) = −1. Given Sij = 1, then the Hamming distance distH (bi , bj ) between bi and bj has to be as small as feasible and vice versa. We use a convolutional neural
7.2 Dual Asymmetric Deep Hashing Learning Table 7.1 The network structure of CNN-F
Layer conv1 conv2 conv3 conv4 conv5 full6 full7 full8
201 Structure f. 64 × 11 × 11; st. 4×4; pad. 0; LRN.; ×2 pool f. 265 × 5 × 5; st. 1×1; pad. 2; LRN.; ×2 pool f. 265 × 3 × 3; st. 1×1; pad. 1 f. 265 × 3 × 3; st. 1×1; pad. 1 f. 265 × 3 × 3; st. 1×1; pad. 1; ×2 pool 4096 4096 → k-bit hash code
Fig. 7.1 The framework of the proposed method. Two streams with five convolution layers and three full-connected layers are used for feature extraction. For the real-valued outputs from these two neural networks, their similarity is preserved by using a pairwise loss. Based on the outputs, a consistent hash code is generated. Furthermore, an asymmetric loss is introduced to exploit the semantic information between the binary code and real-valued data
network to train the hash functions due to the strength of deep neural networks in data representation. To be more precise, the CNN-F structure [48] is used to carry out the feature learning. The CNN-F structure contains eight layers, and five of them are convolutional and the rest of them are completely linked. Table 7.1 shows the network structure, in which “f.” represents the filter, “st.” represents the convolution stride, and “LRN” represents the Local Response Normalization [49]. To obtain the final binary code, the last layer in CNN-F is replaced by a k-D vector and the k-bit binary codes can be generated by using a sign operation on the output of the last layer. Here, the CNN-F model is utilized for both streams in the proposed asymmetric structure. Our method’s framework is illustrated in Fig. 7.1. As shown, two end-toend neural networks are utilized for discriminatively representing the inputs. The semantic information included in a pair of outputs F and G in these two streams is used through a pairwise loss based on their preset similarity matrix. Given the
202
7 Information Fusion Based on Deep Learning
objective of obtaining hash functions using deep networks, the binary code B is also created by decreasing its distance from F and G. Additionally, to maintain similarity between the learned binary codes and real-valued features and to circumvent the binary constraint, additional asymmetric pairwise loss is added using the inner product of the hash codes B and learnt features F/G. Apart from the advantages mentioned in the preceding section, another advantage of this two-stream network is that we can treat the input in one stream as a query and the output in the other as a database, which is advantageous for updating the weights in each stream in a supervised manner, as illustrated in Eq. (7.4). Given f(xi , Wf ) ∈ Rk×1 as the output of the i-th sample in the last layer of the first stream, in which Wf represents the parameter of the network. To be more clear, fi is utilized for replacing f(xi , Wf ). Similarly, the output gj related to the j -th sample with the parameter Wg of the second network can be obtained. Accordingly, the features F = [f1 , . . . , fi , fn ]T ∈ RN×k and G = [g1 , . . . , gi , gn ]T ∈ RN×k of the first and second networks can also be obtained. In order to get high-quality binary codes, sign(fi ) and sign(gi ) are expected to be close to the corresponding hash code bi . In other words, the L2 loss between them should be minimized as follows: min
n (sign(fi ) − bi 22 + sign(gi ) − bi 22 ), i=1
(7.1)
s.t. bi ∈ {−1, +1} Nevertheless, it is challenging to perform a back-propagation for the gradient of fi or gi in Eq. (7.1) for the reason that their gradients are all zero. Here, tanh(·) is utilized for soft approximation of the sign(·) function. Accordingly, Eq. (7.1) is optimized as min
n
(tanh(fi ) − bi 22 + tanh(gi ) − bi 22 ),
i=1
(7.2)
s.t. bi ∈ {−1, +1} Additionally, the negative log-likelihood of the dual-stream similarities with the likelihood function is used to leverage the label information and maintain a continuous similarity between two outputs F and G.
p(Sij |tanh(fi ), tanh(gj )) =
σ (ij ) Sij = 1 1 − σ (ij ) Sij = 0
(7.3)
7.2 Dual Asymmetric Deep Hashing Learning
203
1 in which ij = 12 tanh(fTi )tanh(gj ), and σ (ij ) = − . As a result, the pairwise 1+e ij loss for these two different outputs is calculated by
min −
n
(Sij ij − log(1 + eij ))
(7.4)
i,j =1
While Eq. (7.2) attempts to estimate discrete codes and Eq. (7.4) makes use of intra- and inter-class information, the resemblance between binary codes and realvalued characteristics is overlooked. Another asymmetric pairwise loss is provided to address this issue. min
n 2 2 tanh(fTi )bj − kSij + tanh(gTi )bj − kSij , 2
i,j =1
2
(7.5)
s.t. bj ∈ {−1, +1} The similarity between real-valued data and binary codes is quantified in Eq. (7.5). It is simple to see that Eq. (7.5) not only supports consistency between tanh(fi ) (tanh(gi )) and bi but also keeps their resemblance. Additionally, our experiments demonstrate that this kind of asymmetric inner product rapidly brings the network’s real-valued features and hash codes to a stable value. It is worth noting that, despite the fact that F and G are updated asymmetrically (as stated in the next paragraph), they are initialized with the same network structure. Additionally, Eqs. (7.8), (7.9), (7.14) indicate that F and G contribute similarly to binary variable optimization. As a result, this is the primary reason why we set the two penalty terms in Eq. (7.5) to be equal to 1. Jointly considering Eqs. (7.2), (7.4), and (7.5), the objective function can be derived by 2 2 min L = tanh(F)BT − kS + tanh(G)BT − kS F
F,G,B
−τ
n
F
(Sij ij − log(1 + eij ))+
i,j =1
+ γ (tanh(F) − B2F + tanh(G) − B2F ) 2 2 + η tanh(F)T 1 + tanh(G)T 1 F
s.t. B ∈ {−1, +1}
F
(7.6)
204
7 Information Fusion Based on Deep Learning
in which τ , γ , and η represent the nonnegative parameters for a trade-off among 2 various terms. It is notable that the aim of the fourth term tanh(F)T 1F + tanh(G)T 12 in the objective function Eq. (7.6) is for the information maximum F of each bit [50]. In particular, this term balances each bit, ensuring that the amount of −1 and +1 values in all training samples is about equal.
7.2.2 Optimization for DADH From Eq. (7.6), we can see that the objective function requires optimization of the real-valued features as well as the weights in the two neural networks (F, Wf )/(G, Wg ) and discrete codes B. Notably, this NP-hard issue is substantially non-convex, making it rather difficult to get optimum solutions directly. We build an efficient method in this chapter to optimize them alternately. To be more precise, we update one variable by adjusting the other variables. Update (F, Wf ) with (G, Wg ) and B Fixed By fixing (G, Wg ) and B, the objective function Eq. (7.6) can be converted into n 2 min tanh(F)BT − kS − τ (Sij ij − log(1 + eij ))+ F
F
i,j =1
(7.7)
2 + γ (tanh(F) − B2F + η tanh(F)T 1
F
(F, Wf ) could then be updated with the back-propagation. Here U = tanh(F) and V = tanh(G). The gradient of the objective function related with fi is ⎧ n 0 ∂L ⎨ / τ = 2bj (bTj ui − kSij ) + (σ (ij )vj − Sij vj ) ⎩ ∂fi 2 j =1
+ 2γ (ui − bi ) + 2ηUT 1
⎫ ⎬ ⎭
(7.8) (1 − u2i )
in which represents the dot product. After obtaining the gradient ∂L ∂fi , calculated by the chain rule, and Wf is updated by back-propagation.
∂L ∂Wf
can be
7.2 Dual Asymmetric Deep Hashing Learning
205
Update (G, Wg ) with (F, Wf ) and B Fixed In addition, by fixing (F, Wf ) and B, (G, Wg ) can also be updated by the back-propagation. The gradient of the objective function related with gi is ⎧ n / ⎨ 0 τ ∂L 2bj (bTj vi − kSij ) + (σ (ij )uj − Sij uj ) = ⎩ ∂gi 2 j =1 (7.9) ⎫ ⎬ +2γ (vi − bi ) + 2ηVT 1 (1 − v2i ) ⎭ ∂L ∂L After the gradient ∂g is obtained, ∂W can be calculated by the chain rule, and Wg g i can be updated by back-propagation.
Update B with (F, Wf ) and (G, Wg ) Fixed By fixing (F, Wf ) and (G, Wg ), the following formulation can be obtained: 2 2 min L(B) = UBT − kS + VBT − kS B
F
+ γ (U − B2F
F
+ V − B2F )
(7.10)
s.t. bi ∈ {−1, +1}
Then, Eq. (7.10) can be rewritten as min L(B) = −2Tr[B(k(UT S + VT S) + γ (UT + VT ))] B
2 2 + BUT + BVT + const s.t. bi ∈ {−1, +1} F
(7.11)
F
in which “const” stands for a constant value unrelated with B. For simplicity, let Q = −2k(ST U + ST V) − 2γ (U + V). Equation (7.11) can be simplified to 2 2 min L(B) = BUT + BVT + Tr[BQT ] + const B
F
F
(7.12)
s.t. bi ∈ {−1, +1} On the basis of Eq. (7.12) and [24], B can be updated bit by bit. That is to say, one column in B can be updated while leaving the other columns unchanged. Let B∗c denote the c-th column and Bˆ c be the remaining columns in B. We can do the ˆ c , V∗c , V ˆ c , Q∗c , and Q ˆ c . Accordingly, Eq. (7.12) can similar operations on U∗c , U then be rewritten as follows: ˆ c + VT∗c V ˆ c )Bˆ Tc + QT∗c ] + const min Tr(B∗c [2(UT∗c U B∗c
s.t. B ∈ {−1, +1}
n×k
(7.13)
206
7 Information Fusion Based on Deep Learning
Algorithm 7.1 Dual Asymmetric Deep Hashing Learning (DADH) Input: Training data X/Y; similarity matrix S; hash code length k; predefined parameters τ , γ and η. Output: Hashing functions F and G . Initialization: Initialize weights of the first seven layers by using the pretrained ImageNet model; the last layer is initialized randomly; B is set to be a matrix whose elements are all zero. 1: while not converged or the maximum iteration is not reached do 2: Update (F, Wf ): 3: Fix (G, Wg ) and B and update (F, Wf ) using back-propagation according to Eq. (7.8). 4: Update (G, Wg ): 5: Fix (F, Wf ) and B and update (G, Wg ) using back-propagation according to Eq. (7.9). 6: Update B: 7: Fix (F, Wf ) and (G, Wg ) and update B according to Eq. (7.14). 8: end while
Obviously, the optimal solution for B∗c is ˆ T U∗c + V ˆ T V∗c ) + Q∗c ) B∗c = −sign(2Bˆ c (U c c
(7.14)
After B∗c is calculated, B can be updated by replacing the c-th column with B∗c . By repeating Eq. (7.14), all columns can be updated. In conclusion, the optimization of the proposed method is shown in Algorithm 7.1.
7.2.3 Inference for DADH After learning Wf and Wg , the hash functions corresponding to the two neural networks are acquired. Two types of binary codes can be generated for a given testing sample x∗ : b∗f = sign(f(x∗ , Wf )) and b∗g = sign(g(x∗ , Wg )). Notably, since tanh has no effect on the sign of any element during the testing phase, we do not apply it to the result. We see that the performances obtained using the first and second networks are pretty comparable. To generate a more robust conclusion, we average the values of two outputs in our experiment. b∗ = sign(0.5[f(x∗, Wf ) + g(x∗ , Wg )])
(7.15)
7.2.4 Experimental Results This section conducts tests on three large-scale datasets to illustrate the suggested method’s effectiveness in comparison to many state-of-the-art techniques. The datasets utilized in our studies are described first, followed by the baselines,
7.2 Dual Asymmetric Deep Hashing Learning
207
assessment process, and implementation. We then compare our proposed method with the comparison methods. Following that, the convergence is described.
7.2.4.1 Datasets The IAPR TC-12 [51] dataset comprises 20,000 images of 255 categories. Some samples have multiple labels. For the i-th and j -th samples, if they share at least one same label, then Sij = 1. Here, 2000 images are utilized as the test set and 5000 samples are chosen from the remaining 18,000 (retrieval set) images as the training set. The MIRFLICKR-25K dataset [52] consists of 25,000 images collected from the Flickr website. Following [50], 20,015 images of 24 categories are chosen. Similar to the IAPR TC-12 dataset, some images have multiple labels. For our experiments, two images are considered as ground-truth neighbors if they have at least one same label. Besides, 2000 images are utilized as the test set, and the remaining samples are used as the retrieval set. 5000 samples selected from the retrieval set are utilized as the training set. The CIFAR-10 dataset [53] consists of 60,000 color images of ten categories. Each sample has only one label. Two samples are considered as semantic neighbors if they share the same label. Following [22], 1000 samples are randomly selected as the test set. In addition, 5000 images are selected from the remaining 59,000 images to make up the training set. The remaining images are used as the retrieval set.
7.2.4.2 Baseline and Evaluation Protocol To illustrate DADH’s superiority over other hashing algorithms, LSH [1], SpH [8], ITQ [11], DSH [6], DPSH [22], ADSH [24], and DAPH [23] are compared. Because LSH, SpH, ITQ, and DSH are not deep learning techniques, pre-extracted features should be used. We utilize the CNN feature [54] as the input in this chapter. For all deep learning approaches, the CNN-F network is used to extract features, and the parameters for DPSH and ADSH are specified according to their published specifications. Due to the fact that the source code for DAPH is not publicly available, we carefully construct it using the deep learning toolkit MatConvNet [54]. Additionally, since DAPH’s initial network topology is not CNN-F, the settings in [23] may not be ideal. Thus, we make effort to fine-tune the settings of DAPH. As with certain current approaches [23, 50], the MAP, the Top-500 MAP, and Top-500 Precision are used to determine the proposed method’s superiority.
7.2.4.3 Implementation On a Titan X GPU, we build DADH using the deep learning toolkit MatConvNet [55]. The first seven layers in each stream are initialized using the pretrained
208
7 Information Fusion Based on Deep Learning
ImageNet model, while the last layer’s weights are set randomly. We set the minibatch size to 64 during training and split the learning rate between [10−6 , 10−4 ] into 150 iterations. In other words, the pace of learning decreases steadily from 10−4 to 10−6 . The weights are updated using stochastic gradient descent.
7.2.4.4 Comparison with Other Methods IAPR TC-12 Table 7.2 shows the MAP scores achieved using various approaches on the IAPR TC-12 dataset. It can be found that DADH results in a significant improvement in MAP scores when compared to other techniques. In comparison to the data-independent approach LSH, DADH improves MAP scores by 10–20%. There is also a noticeable improvement when compared to ITQ, SpH, and DSH. Specifically, our suggested technique achieves a minimum MAP score of 46.54% and as high as 55.39% when the bit length is 48, whereas the best result produced by ITQ, SpH, and DSH is only 43.91%, which is much less than our results. In contrast to DPSH, ADSH, and DAPH, DADH also improves MAP scores to a degree. When compared to DAPH, the current technique improves performance by around or more than 5% when the bit length is between 12 and 48. Compared to DPSH and ADSH, DADH also performs roughly 3–5% better when the hashing bit length is 24, 36, or 48, respectively. The IAPR TC-12 dataset’s Top-500 MAP and Precision scores are provided in Table 7.3. As can be seen from this table, the experimental results acquired using deep learning-based techniques are much superior to those calculated using other conventional methods. In comparison to DPSH, ADSH, and DAPH, the suggested approach DADH consistently achieves the maximum performance. Except when the bit length is 8, DADH improves its MAP@Top500 and Precision@Top500 scores by at least 4%, demonstrating the method’s usefulness. Figure 7.2 illustrates the Precision-Recall curves generated using various approaches on the IAPR TC-12 dataset when the bit length is changed from 8 to 48. We can plainly see that the covered regions acquired by DADH are much
Table 7.2 The MAP scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
8-bit 34.07 36.04 41.36 40.64 45.19 44.69 44.33 46.54
12-bit 33.98 36.04 42.20 40.09 46.03 46.98 44.48 49.27
16-bit 35.11 36.75 42.89 39.73 46.82 48.25 44.73 50.83
24-bit 35.79 37.77 43.10 41.66 47.37 49.06 45.12 52.71
36-bit 36.71 38.56 43.20 42.42 47.97 50.24 45.24 54.47
48-bit 37.94 39.11 43.91 42.33 48.60 50.59 45.52 55.39
Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
Evaluation MAP@Top500 8-bit 12-bit 39.87 39.83 44.38 45.43 51.75 54.48 48.34 48.63 57.07 58.13 53.70 58.12 56.26 57.98 57.12 62.67
16-bit 42.72 47.73 55.93 49.70 59.94 61.35 59.48 65.15
24-bit 44.80 50.41 57.64 52.74 61.61 63.25 61.27 67.80
36-bit 48.07 52.45 58.83 54.78 63.05 64.90 62.57 70.11
48-bit 50.78 54.30 59.90 55.51 64.49 65.59 63.94 70.93
Precision@Top500 8-bit 12-bit 38.14 37.90 42.33 42.80 49.95 52.16 47.46 46.87 55.03 55.90 52.31 56.47 54.32 55.44 55.31 60.20 16-bit 40.41 44.69 53.31 47.21 57.73 58.97 56.52 62.52
24-bit 42.05 46.94 54.33 50.12 58.96 60.29 57.92 64.93
36-bit 44.45 48.49 55.02 51.61 60.15 61.94 58.95 67.13
48-bit 46.85 49.95 56.01 52.20 61.40 62.50 60.03 67.81
Table 7.3 The Top-500 MAP and Top-500 Precision scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances
7.2 Dual Asymmetric Deep Hashing Learning 209
Fig. 7.2 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on the IAPR TC-12 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
210 7 Information Fusion Based on Deep Learning
7.2 Dual Asymmetric Deep Hashing Learning Table 7.4 The MAP scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances
Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
211 8-bit 57.14 59.20 63.09 64.48 73.48 75.39 72.79 77.15
12-bit 57.21 60.71 63.40 63.85 74.68 76.41 74.70 78.16
16-bit 58.25 60.91 63.52 64.10 75.58 76.98 74.30 78.64
24-bit 59.37 60.97 63.82 65.96 76.01 76.59 74.14 79.44
36-bit 59.62 61.63 64.06 65.00 76.09 76.20 73.81 79.72
48-bit 59.86 62.43 64.34 65.62 76.05 74.53 73.41 79.26
bigger than those gained by other techniques of comparison. We see that the suggested strategy outperforms both classic data-independent and data-dependent methods significantly. With regard to DPSH, ADSH, and DAPH, there is also a higher accomplishment in all circumstances when the code length is varied. MIRFLICKR-25K Table 7.4 summarizes the MAP results from the trials done on the MIRFLICKR-25K dataset. As can be shown, DADH provides the best performance in all circumstances when the code length is varied. Similar to the results on the IAPR TC-12 dataset, DPSH, ADSH, DAPH, and DADH can achieve higher MAP values than LSH, SpH, ITQ, and DSH. When compared to other deep hashing algorithms, DADH continues to exceed the competition. Table 7.5 shows the MAP@Top500 and Precision@Top500 scores for the MIRFLICKR-25K dataset. It is clear that DADH outperforms all other strategies in both MAP@Top500 and Precision@Top500, proving its superiority over other current techniques. The MAP@Top500 and Precision@Top500 values grow from (85.80%, 84.73%) to (87.58%, 86.80%), whereas the maximum values attained by LSH, SpH, ITQ, and DSH are only (77.06%, 74.66%), much less than DADH. Additionally, the results obtained with DADH are much greater than those obtained using DPSH, ADSH, or DAPH. To put it another way, DADH scores for the MAP@Top500 and Precision@Top500 are usually always more than 85%, but scores computed using other deep hashing algorithms are often less than 85%. The Precision-Recall curves for the MIRFLICKR-25K dataset when the bit length is changed from 8 to 48 are shown in Fig. 7.3. As seen in Fig. 7.3, DADH significantly surpasses LSH, SpH, ITQ, DSH, ADSH, and DAPH. When DPSH and DADH are compared, the suggested technique is clearly better to DPSH when the code length is 8 or 12, respectively. Although DPSH covers a greater area when the recall value is less than 0.4 in Fig. 7.3c–f, it falls short of DADH as the recall value increases. By and large, our technique beats DPSH for code lengths of 16, 24, 36, and 48. CIFAR-10 On the CIFAR-10 dataset, Table 7.6 shows the MAP scores obtained using the suggested technique and several comparison methodologies. When the code length is increased from 8 to 48, the MAP scores calculated using DADH are much greater than those obtained using typical hashing techniques. In comparison
Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
Evaluation MAP@Top500 8-bit 12-bit 60.02 60.81 65.61 68.36 72.30 73.90 70.59 71.83 82.88 83.84 82.14 83.80 81.08 84.20 85.80 86.83
16-bit 63.66 69.49 74.30 73.31 84.34 84.94 83.71 86.90
24-bit 65.86 70.33 75.60 75.43 84.84 84.90 84.45 87.42
36-bit 67.81 71.64 76.19 74.89 85.77 84.20 84.02 87.98
48-bit 68.47 73.65 77.06 76.37 85.64 82.06 84.07 87.58
Precision@Top500 8-bit 12-bit 59.25 59.65 64.19 66.65 70.95 72.09 69.90 70.50 81.85 83.01 81.50 83.18 80.37 83.24 84.73 85.78 16-bit 62.20 67.56 72.42 71.82 83.58 84.15 82.73 85.76
24-bit 64.33 68.34 73.54 73.90 84.11 84.03 83.40 86.68
36-bit 65.74 69.43 74.08 73.12 84.97 83.52 82.93 87.08
48-bit 66.31 71.26 74.66 74.52 84.80 81.31 82.91 86.80
Table 7.5 The Top-500 MAP and Top-500 Precision scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances
212 7 Information Fusion Based on Deep Learning
Fig. 7.3 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on the MIRFLICKR-25K dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
7.2 Dual Asymmetric Deep Hashing Learning 213
214 Table 7.6 The MAP scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances
7 Information Fusion Based on Deep Learning Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
8-bit 14.69 18.22 24.47 22.07 63.48 56.67 59.09 71.86
12-bit 14.22 19.14 25.56 20.99 66.97 71.41 61.17 75.12
16-bit 17.17 18.98 24.52 21.48 68.83 76.50 68.15 80.33
24-bit 15.56 21.50 24.75 22.69 73.45 80.40 69.22 81.70
36-bit 16.54 21.82 26.51 23.93 74.66 82.73 70.74 83.16
48-bit 16.44 23.09 26.49 23.87 75.02 82.73 70.28 83.90
to DPSH and DAPH, it is clear to see how the proposed technique performs better throughout a range of code lengths. Except for code lengths of 8 and 12, the MAP scores obtained by DADH are always more than 80%, but the greatest performance obtained by DPSH and DAPH is just 75.02%. Additionally, ADSH is inferior to the suggested technique, particularly when the code is short. Table 7.7 displays the Top-500 MAP and Top-500 Precision scores obtained using various approaches on the CIFAR-10 dataset for varying code lengths. Clearly, the four deep hashing algorithms consistently provide superior experimental outcomes when compared to more conventional procedures. The comparison of DADH to other deep hashing algorithms further demonstrates the proposed strategy’s advantages. In comparison to DPSH, ADSH, and DAPH, DADH increases the Top500 MAP and Precision scores by around 3–4% when the code length is between 12 and 48. Figure 7.4 shows the Precision-Recall curves generated using several approaches on the CIFAR-10 dataset with bit lengths ranging from 8 to 48. When code lengths of 8, 12, 16, and 24 are used, DADH’s Precision-Recall curves cover the most regions. While ADSH’s performance is comparable to our model’s when the code length is 36 or 48, DADH is still much superior to LSH, SpH, ITQ, DSH, DPSH, and DAPH. In reality, we record recall and accuracy scores at the Top-50 and Top-100 for each of the three datasets when the code length is set to 24 for all datasets. There is no question that approaches based on deep learning are superior to traditional ones. Additionally, the suggested technique outperforms existing deep learningbased hash algorithms such as DPSH, DAPH, and ADSH. CIFAR-10 with a Larger Training Set In [24], 1000 samples are chosen for testing and the rest are utilized for training (another 1000 samples in the training set are selected for validation). To provide a fair comparison, we replicate this condition and present the experimental results in Table 7.8. It is worth noting that the results of all comparison techniques are taken directly from [24]. It is clear that the suggested approach DADH outperforms existing state-of-the-art deep hashing algorithms. Specifically, when the code length is 12 bits, DADH improves the MAP score by more than 94%, whereas other approaches get just 84.66%.
Method LSH SpH ITQ DSH DPSH ADSH DAPH DADH
Evaluation MAP@Top500 8-bit 12-bit 17.73 24.97 31.10 32.87 26.43 37.48 27.25 34.14 58.58 66.85 58.92 70.07 50.88 66.24 67.11 73.08
16-bit 29.06 33.41 37.42 34.64 71.48 74.95 72.43 78.42
24-bit 28.03 38.64 41.06 37.00 75.74 78.09 77.31 82.02
36-bit 30.39 41.24 45.57 40.02 79.69 78.85 79.21 83.51
48-bit 32.59 43.35 46.16 41.33 80.62 77.61 80.40 84.17
Precision@Top500 8-bit 12-bit 14.37 18.30 24.15 27.59 24.10 33.00 24.47 28.20 64.13 70.80 61.09 73.71 54.22 68.28 72.95 77.53 16-bit 22.85 28.59 31.86 29.33 74.71 78.85 74.42 82.68
24-bit 22.70 34.24 34.87 33.07 78.34 81.84 77.36 84.18
36-bit 25.54 36.60 40.53 35.54 80.59 83.45 78.80 85.23
48-bit 27.19 38.73 40.78 36.55 81.55 82.86 79.55 85.59
Table 7.7 The Top-500 MAP and Top-500 Precision scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances
7.2 Dual Asymmetric Deep Hashing Learning 215
Fig. 7.4 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and DADH on the CIFAR-10 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
216 7 Information Fusion Based on Deep Learning
7.2 Dual Asymmetric Deep Hashing Learning
217
Table 7.8 The MAP scores obtained by different methods on the CIFAR-10 dataset, respectively. 1000 samples are selected for testing and remaining samples are used for training in this dataset. Bold values mean the best performances
Method DSH-DL DHN DPSH ADSH DADH
Dataset CIFAR-10 12-bit 24-bit 63.70 74.93 67.94 72.01 68.63 72.76 84.66 90.62 94.92 95.36
32-bit 78.03 73.09 74.06 91.75 95.29
48-bit 80.86 74.08 75.20 92.63 95.43
Fig. 7.5 The changing of the objective function values and MAP scores with an increase in iterations
7.2.4.5 Convergence Analysis With a few iterations, our suggested model can achieve convergence. Figure 7.5 illustrates the changes in the objective function values and MAP scores for the three datasets when the code length is 48 bits. It is straightforward to see that DADH converges to a stable result in less than 30 rounds. Indeed, we see that 2 2 the asymmetric terms tanh(F)BT − kS and tanh(G)BT − kS contribute F
F
218
7 Information Fusion Based on Deep Learning
Fig. 7.6 The changing of objective function values and MAP scores with an increase in iterations. 2 2 Note that the asymmetric terms tanh(F)BT − kSF and tanh(G)BT − kSF are removed
significantly to the rapid convergence. To analyze the influences, we attempt to exclude these two components from our objective function. Notably, after removing the asymmetric terms, we will update the binary code B through sign(γ [tanh(G) + 2 2 tanh(F)]). As seen in Fig. 7.6, when tanh(F)BT − kSF and tanh(G)BT − kSF are removed from Eq. (7.6), not only do MAP scores degrade, but our model also converges more slowly than the original DADH.
7.2.5 Conclusion In this chapter, we present a new deep hashing approach for image retrieval called Dual Asymmetric Deep Hashing learning (DADH). Two asymmetric streams are used to combine feature representation and hash function learning in a complete framework. To utilize the semantic structure between each pair of outputs from the
7.3 Relaxed Asymmetric Deep Hashing Learning
219
two streams, a pairwise loss is implemented. Besides, additional asymmetric loss is suggested to capture not only the similarity between discrete binary codes and learnt real-value features but also to aid in training phase convergence. Experiments on three large-scale datasets demonstrate the suggested method’s superiority over state-of-the-art techniques.
7.3 Relaxed Asymmetric Deep Hashing Learning Although numerous asymmetric approaches, such as DADH, DAPH, and ADSH, have been suggested, previous research on deep hashing algorithms has encountered several issues: (1) The first constraint is that the majority of deep hashing algorithms are designed to match pairs of samples point-to-point. Assume that fi ∈ Rk×1 and fj ∈ Rk×1 are the outputs corresponding to the i-th and j -th samples, respectively, where k is the output dimension. A popular method in hashing learning is to minimize their Hamming distance through (fTi fj − kSij )2 , where Sij is the ith row and j -th column member of the similarity matrix. Although this term approximates (fi , fj ) to the discrete values (either +1 or −1) and makes use of their paired semantic information, it fails in certain particular instances. As seen in Fig. 7.7, consider that f1 and f2 have the same ground truth and are situated in the same Hamming space R1 . However, when both points are converted to f1 and f2 in another Hamming space R2 , the value of fT 1 f2 remains unchanged, despite the fact that their corresponding hashing codes sign(f1 ) and sign(f2 ) are clearly different from their ground-truth sign(f1 ) and sign(f2 ). To solve this issue, various researches [24] have attempted to update binary code b1 and realvalued output f2 using the asymmetric loss (fT2 b1 − kS21 )2 . The advantage of this asymmetric inner product is shown experimentally. Thus, it is important to encourage f1 , f2 , and b1 to be in the same Hamming space. However, the typical asymmetric product fT1 b1 (fT2 b1 ) not only needs f1 (f2 ) and b1 to be in the same Hamming space but also requires them to be near in Euclidean distance, which is excessive and unneeded. As a result, [56, 57] established the f ,f cosine distance f i jf to ensure that comparable points are likely to be in the i j same hypercube. However, this method also cannot meet our requirement in some specific cases. As seen in Fig. 7.7, suppose that (f3 , f4 ) and (f5 , f6 ) have the same semantic information. We prefer f5 and f6 , but stay away from f3 and f4 . In fact, the cosine distance between both pairings is large, despite the fact that f3 and f4 have distinct hashing codes. As a result of the aforementioned research, we present a novel relaxed asymmetric technique for point-by-angle matching. Particularly, as seen in Fig. 7.7, points f5 and f6 are urged to be near in cosine distance to the binary variable b1 . Not only can we place samples belonging to the same class in a similar
220
7 Information Fusion Based on Deep Learning
Fig. 7.7 The motivation of the point-to-angle matching. Assume f1 , f2 , f3 , f4 , f5 , f6 , and b1 belong to the same Hamming space R1 . (1) (fT1 f2 −k)2 makes them similar and discrete. However, if f1 and f2 are both transformed to f1 and f2 in R2 , the value of fT1 f2 remains unchanged but their hashing codes change. (2) The asymmetric inner product (fT1 b1 − k)2 can well tackle the problem in (1), but this term strongly enforces each element in f1 (f2 ) to approximate to be either +1 or −1. This is too strict and unnecessary since what we need is sign(f1 ). (3) Thus, some works try to use the cosine distance to measure the similarity between each pair of samples. Although this strategy can address the existing problem in (2), it may be unable to process some specific cases like (f3 , f4 ). (4) In order to solve the problems mentioned in (1), (2), and (3), we propose a relaxed asymmetric strategy that exploits the cosine distance between the real-valued outputs and the binary codes. In ,b1 ,b1 − 1)2 or ( ff66b − 1)2 not only encourages them onto the same hypercube this figure, ( ff55b 1 1 without any length constraint but also efficiently avoids the case that occurs between f3 and f4
hypercube without imposing a length limit on them, but we can also efficiently avoid the situation between f3 and f4 . (2) The second issue is determining the more effective way to use the semantic affinity inherent in the original space. Several functions have been presented in recent years to determine the similarity and dissimilarity of each pair of data. The pairwise loss and the triplet loss are the two most prevalent methods. The pairwise loss encourages samples belonging to the same class to be close together, while samples belonging to other classes are separated by a large distance. In comparison, the triplet loss promotes a smaller distance between each pair of comparable (positive) samples than that between each pair of different (negative) samples. Indeed, particularly in the image retrieval work, we want to bring comparable examples closer together than dissimilar ones. As a result, we concentrate on measuring semantic affinity using the triplet loss. The conventional triplet loss algorithm divides similar and dissimilar pairs using the Euclidean distance. The Euclidean distance, on the other hand, is not adaptable for our relaxed asymmetric method. As seen in Fig. 7.8, f1 and f2 are homogeneous and reside in the same Hamming space, but f3 is distinct to them and resides in a different Hamming space. To be frank, both f1 and f2 are in proper places in the relaxed point-to-angle view-point due to their enormous cosine distances from
7.3 Relaxed Asymmetric Deep Hashing Learning
221
Fig. 7.8 The motivation of the novel triplet loss. f1 and f2 belong to the same class, while f3 belongs to another class. According to the analysis in Fig. 7.7, f1 and f2 are located in appropriate positions. However, the Euclidean distance between them is much larger than that of f2 and f3 . The traditional triplet cannot be directly used. Thus, we propose a novel triplet loss. Firstly, each output is normalized onto a unit ball to get ¯f1 , ¯f2 , and ¯f3 corresponding to f1 , f2 , and f3 , respectively. Then [1 − (¯fT1 ¯f3 − 1)22 + (¯fT1 ¯f2 − 1)22 ]+ is constructed to encourage ¯f1 to be closer to ¯f2 than to ¯f3 under the Hamming distance
b1 . The Euclidean distance between f1 and f2 is, however, much greater than the distance between f2 and f3 . As a result, if the classic triplet loss is utilized directly for semantic information discovery, our suggested relaxed asymmetric technique will be influenced. Thus, we introduce an updated triplet loss based on the Hamming distance to overcome this issue. We begin by normalizing the output to a multidimensional unit ball and get ¯f1 , ¯f2 , and ¯f3 , which correspond to f1 , f2 , and f3 , respectively. Unlike standard techniques, which consider just the Euclidean distance in the triplet loss, we consider the Hamming distance to urge (¯f1 , ¯f2 ) to be closer than (¯f1 , ¯f3 ) and (¯f2 , ¯f3 ). The triplet loss between these three sites can be expressed mathematically as [1−(¯fT1 ¯f3 −1)2 +(¯fT1 ¯f2 −1)2 ]+ and [1−(¯fT2 ¯f3 −1)2 +(¯fT1 ¯f2 −1)2 ]+ , where [z]+ = max(z, 0). The next section will provide further discussion about this. We provide a new hashing learning approach called Relaxed Asymmetric Deep Hashing (RADH) [58] in this chapter. The following summarizes the method’s primary contributions. (1) In point-to-angle matching, a relaxed asymmetric technique is given to expose the similarity between real-valued outputs and discrete binary codes. Through an inner product with no length limitation, the real-valued features and hashing variables are encouraged to fall into the same Hamming space. (2) We provide an updated triplet loss that is very adaptable for our suggested relaxed asymmetric technique. Unlike the standard version, which uses just the Euclidean distance to rank positive and negative pairs, we normalize each output to a multi-dimensional unit ball and use the Hamming distance to rank various pairs.
222
7 Information Fusion Based on Deep Learning
(3) An efficient method is proposed to update numerous variables in an end-to-end deep structure in an alternate manner. The binary codes, in particular, can be acquired in a discrete manner.
7.3.1 Problem Formulation It is worth noting that this chapter primarily discusses the efficiency of relaxed asymmetric and new triplet losses. As with DADH and DAPH, we simply use the CNN-F [48] and asymmetric deep network [23] to represent the input data, as seen in Fig. 7.9. The primary benefit of this two-stream network is that we can use the input in one stream as a query and the output in the other as a database, which is advantageous for supervised weight updates in each stream. It is self-evident that this deep structure can be substituted with other network structures, such as VGG and ResNet, whereas this is not the subject of this chapter. Due to our preference for a k-bit binary coding, we substitute the last layer in CNN-F with an a k-D vector. Additionally, we initialize the first seven layers of CNN-F with the pretrained ImageNet model, and the final layer’s weights are set to random values. Here we denote the inputs in the first and second streams as X = {x1 , . . . , xi , . . . , xN } ∈ RN×d1 ×d2 ×3 and Y = {y1 , . . . , yi , . . . , yN } ∈ RN×d1 ×d2 ×3 , where N is the number of training samples and (d1 , d2 ) are the size of an
Fig. 7.9 The framework of the proposed method. The top and bottom streams with different weights are used to perform the feature extraction for the images. A shared binary code bi is then learned corresponding to the same input in both streams. Here we assume that (b1 , f1 , f2 ) have the same semantic information, while (b2 , f3 ) enjoy different semantic information from b1 . In the learning step, the relaxed asymmetric is exploited to make (b1 , f1 , f2 ) locate in the same Hamming space without the length constraint. We also propose a novel triplet loss to rank the positive and negative pairs
7.3 Relaxed Asymmetric Deep Hashing Learning
223
image. Note that X and Y are only different in symbol notations. They both represent the same training set. Being similar to [23], the purpose of our model is to learn two hash functions F and G to map the raw data X and Y into the Hamming space. In this chapter, we denote the outputs associated with X and Y as F = [f1 , . . . , fi , . . . , fN ]T ∈ RN×k and G = [g1 , . . . , gi , . . . , gN ]T ∈ RN×k , respectively. Additionally, their corresponding shared binary codes are denoted as B = [b1 , . . . , bi , . . . , bN ]T ∈ RN×k , where bi ∈ {−1, +1}k×1. As our method is supervised learning, a matrix S ∈ RN×N is introduced to measure the semantic similarity between X, Y, and B. Sij is its element in the i-th row and j -th column. Sij = 1 if xi and yj (bj ) share the same semantic information or label, otherwise Sij = 0 − ε, where ε is a slack variable, e.g., 0.11. Note that, the sizes of F, G, and B are all the same. In our two-stream network, being the same to DAPH, we alternatively input the images X or Y (both are the same training set) into the first or second streams. When the images are input into the first stream and get the output F, we encourage F to do retrieval from G. This operation is the same to that in the second stream. Then B is the binary code that is completely associated with F and G row by row. Our method’s framework is shown in Fig. 7.9. The network has two streams, both of which are utilized to extract features from the input images. Being similar to DADH, we learn a common hashing code according to the first and second streams. More precisely, a relaxed asymmetric method is utilized to promote the cosine distance between real-valued outputs and discrete hashing variables that share the same semantic information to be small. After that, the updated triplet loss is used to provide an efficient ranking for each pair of positive and negative samples. Let F and G represent the outputs of the first and second streams, respectively. Due to the fact that our objective is to get hash functions using deep networks, the binary code B is also formed by decreasing the distance between B and F/G. As shown in DADH, DAPH, and ADSH, the asymmetric technique has the potential to significantly minimize the complexity of discrete constraint optimization. Equation (7.16) is utilized to not only reduce the Hamming distance between the real-valued outputs and hashing variables but also to exploit the semantic affinity inherent in the original space. min(bTi fj − kSij )22 + (bTi gj − kSij )22
(7.16)
However, Eq. (7.16) also requires that F or G has a comparable length, which is excessive and unneeded. What we need is to guarantee that hashing codes and realvalued outputs share the same semantic information in a shared Hamming space and disregard the output length. To resolve this issue, we convert Eq. (7.16) to Eq. (7.17). #
bTi fj − Sij min bi fj
#
2 +
bTi gj − Sij bi gj
2 (7.17)
224
7 Information Fusion Based on Deep Learning
in which bi , fj , and gj represent the length of bi , fj , and gj , respectively. √ Since bi = k, we further transform Eq. (7.17) to Eq. (7.18) min La =
⎛# ⎝
i,j
bTi fj fj
√ − kSij
2
# +
bTi gj gj
−
√
2 ⎞ ⎠
kSij
(7.18)
Obviously, it can be observed from Eq. (7.18) that there is no length limitation on the learnt features, and this equation concurrently places real-valued outputs and binary codes in the same hypercube if they belong to the same class; otherwise, they are positioned in distinct Hamming spaces. In comparison to Eq. (7.16), this technique allows for greater flexibility in estimating fj and gj , which makes hashing learning more adaptable and sensible. The retrieval task requires us to verify that the distance between each positive pair is less than the distance between the corresponding negative pair. Thus, we include the triplet loss, which is adaptable for the retrieval task. However, as mentioned above, applying the existing triplet loss directly to RADH is not feasible due to the characteristics of RADH. Thus, an updated triplet loss is introduced. The updated triplet comparison formulation is trained on a sequence of triplets (fi , fj , ft ), in which fi and fj are members of the same class and fj and ft are members of distinct classes. The same definition can be applied to (gi , gj , gt ). To ensure that the Hamming distance between fi and fj is less than the distance between fj and ft , these three samples must meet the following constraint for each triplet (fi , fj , ft ): #
fTt fj −1 ft fj
#
2 −
fTi fj −1 fi fj
2 > 1 − ξij t
(7.19)
in which ξij t denotes a nonnegative slack variable. Notably, since the length of the output is unconstrained, we first normalize (fi , fj , ft ) to a k-dimensional unit ball. However, obtaining the gradient of fj is tricky since it may be coupled with other samples as a retrieval point. Here we consider fj as a query and {gi }N i=1 as a retrieval dataset according to our two-stream network topology. Accordingly, when (fi , fj , ft ) and (gi , gj , gt ) are considered simultaneously, the triplet loss function in RADH is min Lp =
⎡
#
⎣1 −
i,j,t
+
i,j,t
⎡ ⎣1 −
gTt fj
−1 gt fj #
#
2
fTt gj −1 ft gj
+
gTi fj −1 gi fj #
2 +
2 ⎤ ⎦
fTi gj −1 fi gj
+
2 ⎤ ⎦ +
(7.20)
7.3 Relaxed Asymmetric Deep Hashing Learning
225
in which [z]+ = max(z, 0). In comparison to the existing triplet loss function, which relies solely on the Euclidean distance to determine the ranking between positive and negative pairs, Eq. (7.20) successfully converts it to a hashing based function, which makes efficient use of the Hamming distance to achieve a satisfactory ranking. As mentioned in [23], an additional Euclidean regularization between bj and fj /gj improves performance. Thus, as demonstrated in Eq. (7.21), bj and fj /gj are further constrained to be close in regard of the Euclidean distance. ⎛ 2 2 ⎞ √ g √ fj j ⎝ k − bj + k − bj min Le = ⎠ fj gj j 2
(7.21)
2
√ f √ g Here k fj and k gj are applied to make their elements be either −1 or +1. j j Additionally, as indicated in Eq. (7.22), additional regularization is applied to create balance for each bit in the training samples and maximize the information provided by each bit. √ 2 √ 2 ¯ T 1 min Lr = k F¯ T 1 + k G 2
(7.22)
2
¯ = [¯g1, . . . , g¯ N ]T = in which F¯ = [¯f1 , . . . , ¯fN ]T = [ ff11 , . . . , ffNN ]T , G g1 gN T [ g1 , . . . , gN ] and 1 is a N × 1 vector with all the elements of 1. In conclusion, the objective function can be obtained as follows: min La + τ Lp + γ Le + ηLr , s.t. bi ∈ {−1, +1}
(7.23)
in which τ , γ , and η are the nonnegative parameters to make a trade-off among various terms.
7.3.2 Optimization for RADH The variables F, G, and B should be optimized in the objective function Eq. (7.23). Due to the discrete limitation on B, obtaining the optimal solution directly is challenging. This chapter introduces an efficient optimization for updating several variables alternately and discretely.
226
7 Information Fusion Based on Deep Learning
(1) Update F with G and B fixed: By fixing G and B, the objective function Eq. (7.23) is transformed to min Lf =
(bTi ¯fj −
2 √ √ ¯ kSij )2 + γ k fj − bj
i,j
τ
/
2
j
1 − (¯gTt ¯fj
− 1)
2
+ (¯gTi ¯fj
− 1)
2
0 +
i,j,t
√ 2 + η k F¯ T 1
(7.24)
2
The derivative of Lf with respect to ¯fj can be obtained. √ √ √ ∂Lf =2 bi (bTi ¯fj − kSij ) + 2γ k( k ¯fj − bj ) ∂ ¯fj i
+ 2τ
Mit (−¯gt (¯gTt ¯fj − 1) + g¯ i (¯gTi ¯fj − 1)) + 2ηk F¯ T 1
(7.25)
i,t
in which M is a mask matrix and its component in the i-th row and j -th column is denoted as Mit . Mit = 1 if 1 − (¯fTt ¯fj − 1)2 + (¯fTi ¯fj − 1)2 > 0, otherwise Mit = 0. Then the derivative of ¯fj with respect to fj is calculated as follows: fj fTj ∂ ¯fj I = − 3 fj f ∂fj j
(7.26)
Considering Eq. (7.25) and Eq. (7.26) and exploiting the chain rule, the derivative of Lf with respect to fj is ∂ ¯fj ∂Lf ∂Lf = ∂fj ∂fj ∂ ¯fj
(7.27)
∂L
∂L
After getting the gradient ∂fjf , the chain rule is used to obtain ∂Wff , in which Wf is the weight in the first stream. Wf is updated by using back-propagation. (2) Update G with F and B fixed: By fixing F and B, the objective function Eq. (7.23) is transformed to min Lg =
i,j
τ
/ i,j,t
(bTi g¯ j −
√
kSij )2 + γ
2 √ k g¯ j − bj 2
j
1 − (¯fTt g¯ j − 1)2 + (¯fTi g¯ j − 1)2
0 +
√ 2 ¯ T 1 + η kG 2
(7.28)
7.3 Relaxed Asymmetric Deep Hashing Learning
227
Similarly, the derivative of Lg with respect to gj is ∂Lg ∂ g¯ j ∂Lg = ∂gj ∂gj ∂ g¯ j
(7.29)
gj gTj ∂ g¯ j I = − 3 gj g ∂gj j
(7.30)
where
√ √ √ ∂Lg =2 bi (bTi g¯ j − kSij ) + 2γ k( k g¯ j − bj ) ∂ g¯ j i ¯ T1 + 2τ Mit (−¯ft (¯fTt g¯ j − 1) + ¯fi (¯fTi g¯ j − 1)) + 2ηk G
(7.31)
i,t
Being similar to the optimization for Wf , after getting the gradient
∂Lg ∂gj ,
the
∂Lg ∂Wg ,
where Wg is the weight in the second stream. chain rule is used to obtain Wg is updated by using back-propagation. (3) Update B with F and G fixed: By fixing F and G, the objective function Eq. (7.23) is transformed to ¯ T √ 2 ¯ T √ 2 min Lb = FB − kS + GB − kS F F √ ¯ 2 √ ¯ 2 + γ k F − B + k G − B s.t. bi ∈ {−1, +1} F
F
(7.32) Then Eq. (7.32) can be rewritten as √ ¯ T )(S + γ I))] min Lb = −2Tr[B( k(F¯ T + G 2 ¯ T 2 + BF¯ T + BG + const. s.t. bi ∈ {−1, +1} F
(7.33)
F
where “const” means a constant value without any association with B. Here let √ ¯ Equation (7.33) can then be simplified to Q = −2 k(ST + γ I)(F¯ + G). 2 ¯ T 2 min Lb = BF¯ T + BG + Tr[BQT ] + const F
F
(7.34)
s.t. bi ∈ {−1, +1} Due to the difficulty of optimizing B directly, we update it incrementally. In other words, we change one column in B while maintaining the integrity of the
228
7 Information Fusion Based on Deep Learning
Algorithm 7.2 Relaxed Asymmetric Deep Hashing (RADH) Input: Training data X/Y; similarity matrix S; hash code length k; predefined parameters τ , γ and η. Output: Hashing functions F and G for the two streams, respectively. Initialization: Initialize weights of the first seven layers by using the pretrained ImageNet model; the last layer is initialized randomly; B is set to be a matrix whose elements are zero. 1: while not converged or the maximum iteration is not reached do 2: Update (F, Wf ): 3: Fix (G, Wg ) and B and update (F, Wf ) using back-propagation according to Eq. (7.27). 4: Update (G, Wg ): 5: Fix (F, Wf ) and B and update (G, Wg ) using back-propagation according to Eq. (7.29). 6: Update B: 7: Fix (F, Wf ) and (G, Wg ) and update B according to Eq. (7.36). 8: end while
other columns. B∗c denotes the c-th column in B, whereas Bˆ c denotes the remaining ˆ c , Q∗c , and Q ˆ c . If we desire to columns of B. The same holds for F∗c , Fˆ c , G∗c , G optimize B∗c , we can rewrite Eq. (7.34) as ˆ c )Bˆ Tc + QT∗c ] + const min Tr(B∗c [2(FT∗c Fˆ c + GT∗c G B∗c
(7.35)
s.t. B ∈ {−1, +1}n×k The optimal solution for B∗c can be obtained ˆ c (Fˆ Tc F∗c + G ˆ Tc G∗c ) + Q∗c ) B∗c = −sign(2B
(7.36)
After getting B∗c , B can be updated by replacing the c-th column with B∗c . By repeating Eq. (7.36), all columns can be updated. Description of the optimization is shown in the Algorithm 7.2.
7.3.3 Inference for RADH Following training, the related weights Wf and Wg for the first and second streams are learnt. We can get the binary code for an out-of-sample image x∗ in the query set by using either b∗f = sign(F (x∗ , Wf ) or b∗g = sign(G(x∗ , Wg ). According to the experimental data, b∗f and b∗g exhibit comparable performance. To reach a more robust conclusion, we consider their average, as indicated in Eq. (7.37). b∗ = sign(F (x∗ , Wf ) + G(x∗ , Wg ))
(7.37)
7.3 Relaxed Asymmetric Deep Hashing Learning
229
7.3.4 Implementation On a Titan X GPU, we develop RADH using the deep learning toolkit MatConvNet [55]. Specifically, the pretrained CNN-F model on the ImageNet dataset is used to initialize the weights of both streams, which significantly improves performance. Due to the fact that the output of the final layer is different from the output of the original CNN-F model, we randomize the weights in this layer. We set the maximal epoch to 150, the learning rate to 10−3 , the batch size to 64, and the weight decay to 5 × 10−4 throughout the training phase. The weights are then updated using stochastic gradient descent (SGD). The triplet loss is calculated in Eq. (7.20) by rating each positive and negative pair. However, taking all training samples into consideration is time consuming. We sort the positive pairs in descending and the negative pairs in ascending order by following (fTi bj − 1)2 /(gTi bj − 1)2 . Then, we choose the top 200 samples from each section as the positive and negative examples. In other words, we pick just 200 samples as the positive set with the greatest distance to the input, and another 200 samples as the negative set with the least distance to the input. This technique encourages the triplet loss to focus on the poorest samples, contributing to an efficient update of our network.
7.3.5 Experimental Results We conduct experiments on several datasets.
7.3.5.1 Experimental Settings Apart from the three datasets, we also use the NUS-WIDE dataset. There are 269,648 web images in NUS-WIDE. Following [22] and [24], 195,834 images are chosen that correspond to the 21 most frequently used tags. It selects 2100 images for testing and the rest images for retrieval, with 10,500 images chosen for training. The MAP is used as the evaluation protocol in this case. Additionally, following [23], we use the MAP of the top 500 retrieval samples (MAP@500) and the mean precision of the top 500 retrieval samples (Precision@500) as additional evaluation methods.
7.3.5.2 Comparison with Other Methods On four real-world datasets, we compare the proposed strategy to various state-ofthe-art methods.
230 Table 7.9 The MAP scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances
7 Information Fusion Based on Deep Learning Method LSH SpH ITQ DSH DPSH ADSH DAPH RADH
8-bit 14.69 18.22 24.47 22.07 63.48 56.67 59.09 76.27
12-bit 14.22 19.14 25.56 20.99 66.97 71.41 61.17 77.92
16-bit 17.17 18.98 24.52 21.48 68.83 76.50 68.15 79.79
24-bit 15.56 21.50 24.75 22.69 73.45 80.40 69.22 82.35
36-bit 16.54 21.82 26.51 23.93 74.66 82.73 70.74 83.12
48-bit 16.44 23.09 26.49 23.87 75.02 82.73 70.28 84.43
Table 7.10 The MAP@Top500 and Precision@Top500 scores obtained by different methods on the CIFAR-10 dataset. Bold values mean the best performances Evaluation MAP@Top500 Method 8-bit 12-bit 16-bit LSH 17.73 24.97 29.06 SpH 31.10 32.87 33.41 ITQ 26.43 37.48 37.42 DSH 27.25 34.14 34.64 DPSH 58.58 66.85 71.48 ADSH 58.92 70.07 74.95 DAPH 50.88 66.24 72.43 RADH 68.72 70.39 77.36
24-bit 28.03 38.64 41.06 37.00 75.74 78.09 77.31 79.39
36-bit 30.39 41.24 45.57 40.02 79.69 78.85 79.21 82.90
48-bit 32.59 43.35 46.16 41.33 80.62 77.61 80.40 84.92
Precision@Top500 8-bit 12-bit 16-bit 14.37 18.30 22.85 24.15 27.59 28.59 24.10 33.00 31.86 24.47 28.20 29.33 64.13 70.80 74.71 61.09 73.71 78.85 54.22 68.28 74.42 76.37 76.94 81.62
24-bit 22.70 34.24 34.87 33.07 78.34 81.84 77.36 82.88
36-bit 25.54 36.60 40.53 35.54 80.59 83.45 78.80 85.31
48-bit 27.19 38.73 40.78 36.55 81.55 82.86 79.55 86.47
CIFAR-10 Table 7.9 summarizes the experimental results about the MAP scores acquired using various approaches. As can be seen from this chart, RADH always achieves the greatest MAP score, especially when the code length is quite short. DPSH, ADSH, DAPH, and our suggested RADH always outperform LSH, SpH, ITQ, and DSH, which are not deep learning approaches. In comparison to these three deep learning based approaches, RADH also shows some improvement. Particularly when the code length is 8-bit, our technique achieves a much bigger improvement, exceeding 10% on the MAP score. Table 7.10 also includes the MAP@Top500 and Precision@Top500 scores derived using various methods. As shown, RADH is superior to other methods. When the code length is set to 8 bits, RADH improves the MAP@Top500 and Precision@Top500 scores by over 10% compared to other comparison techniques. Although the performance difference narrows as the code length increases, RADH remains to be better than the other techniques. On the CIFAR-10 dataset, Fig. 7.10 illustrates the Precision-Recall (PR) curves for different code lengths ranging from 8 to 48 bits. It is self-evident that the provided method covers the largest or competitive areas. Except for the instance when the code length is 36 bits, RADH improves performance considerably. Compared with LSH, SpH, ITQ, and DSH, RADH computes substantially bigger
Fig. 7.10 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the CIFAR-10 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
7.3 Relaxed Asymmetric Deep Hashing Learning 231
232 Table 7.11 The MAP scores obtained by different methods on the NUS-WIDE dataset. Bold values mean the best performances
7 Information Fusion Based on Deep Learning Method LSH SpH ITQ DSH DPSH ADSH DAPH RADH
8-bit 38.78 43.18 54.40 52.80 62.86 58.79 60.17 64.90
12-bit 38.28 44.64 56.16 47.66 63.63 61.73 61.88 68.07
16-bit 41.58 45.28 56.04 49.24 66.23 63.97 62.56 70.30
24-bit 41.32 46.66 56.75 49.85 66.65 65.38 63.52 71.60
36-bit 44.32 47.24 57.34 50.51 67.96 63.86 63.30 73.04
48-bit 46.36 48.20 57.56 52.24 68.12 63.98 64.62 73.47
Table 7.12 The MAP@Top500 and Precision@Top500 scores obtained by different methods on the NUS-WIDE dataset. Bold values mean the best performances Evaluation MAP@Top500 Method 8-bit 12-bit 16-bit LSH 43.49 47.46 53.39 SpH 57.78 63.70 65.69 ITQ 67.85 72.96 74.95 DSH 62.85 65.55 67.28 DPSH 75.27 77.42 79.62 ADSH 63.35 68.77 72.62 DAPH 74.27 78.03 80.07 RADH 75.03 79.15 81.19
24-bit 57.58 69.11 77.07 70.95 81.39 74.99 81.22 82.94
32-bit 63.95 72.01 78.86 72.24 82.27 74.02 82.39 84.35
48-bit 68.40 73.34 78.83 73.82 82.90 75.14 83.71 84.96
Precision@Top500 8-bit 12-bit 16-bit 43.82 45.83 51.81 57.07 61.95 63.62 67.96 72.46 74.07 62.30 64.26 65.88 75.04 76.88 79.25 63.77 68.82 72.35 74.40 77.67 79.28 75.21 79.26 81.06
24-bit 55.28 67.07 75.98 69.34 80.82 74.39 80.63 82.78
32-bit 61.48 69.78 77.63 70.34 81.79 73.72 81.79 84.15
48-bit 65.82 71.06 78.51 72.02 82.32 74.75 82.97 84.63
regions. Although RADH only has a slight increase over DPSH, ADSH, and DAPH when the code length is 48 bits, it significantly achieves the larger gaps of covered regions when the code length is 8 bits, 12 bits, 16 bits, or 24 bits, indicating its superiority. NUS-WIDE Table 7.11 illustrates the MAP scores of different approaches. As can be shown, our method RADH significantly outperforms these comparison methods. In comparison to non-deep learning methods such as LSH, SpH, ITQ, and DSH, RADH always achieves an improvement of around 15%. In comparison to deep hashing algorithms, RADH remains better. Specifically, our method achieves a MAP score of 73.47%, whereas the greatest performance achieved by existing deep hashing approaches is just 68.12%. Table 7.12 summarizes the MAP@Top500 and Precision@Top500 scores, demonstrating the suggested method’s superiority. Except when the code length is 8 bits, RADH always reaches the highest positions. Both metrics increase by at least 5% when compared to LSH, SpH, ITQ, and DSH. In comparison to DPSH, ADSH, and DAPH, our technique still exhibits some improvement. Figure 7.11 illustrates the PR curves generated using different approaches. It is obvious that RADH covers the maximum region when the code length is between
Fig. 7.11 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the NUS-WIDE dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
7.3 Relaxed Asymmetric Deep Hashing Learning 233
234 Table 7.13 The MAP scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances
7 Information Fusion Based on Deep Learning Method LSH SpH ITQ DSH DPSH ADSH DAPH RADH
8-bit 34.07 36.04 41.36 40.64 45.19 44.69 44.33 47.16
12-bit 33.98 36.04 42.20 40.09 46.03 46.98 44.48 50.59
16-bit 35.11 36.75 42.89 39.73 46.82 48.25 44.73 52.13
24-bit 35.79 37.77 43.10 41.66 47.37 49.06 45.12 53.71
36-bit 36.71 38.56 43.20 42.42 47.97 50.24 45.24 54.80
48-bit 37.94 39.11 43.91 42.33 48.60 50.59 45.52 55.38
Table 7.14 The MAP@Top500 and Precision@Top500 scores obtained by different methods on the IAPR TC-12 dataset. Bold values mean the best performances Evaluation MAP@Top500 Method 8-bit 12-bit 16-bit LSH 39.87 39.83 42.72 SpH 44.38 45.43 47.73 ITQ 51.75 54.48 55.93 DSH 48.34 48.63 49.70 DPSH 57.07 58.13 59.94 ADSH 53.70 58.12 61.35 DAPH 56.26 57.98 59.48 RADH 58.99 64.67 67.34
24-bit 44.80 50.41 57.64 52.74 61.61 63.25 61.27 69.12
36-bit 48.07 52.45 58.83 54.78 63.05 64.90 62.57 70.91
48-bit 50.78 54.30 59.90 55.51 64.49 65.59 63.94 71.64
Precision@Top500 8-bit 12-bit 16-bit 38.14 37.90 40.41 42.33 42.80 44.69 49.95 52.16 53.31 47.46 46.87 47.21 55.03 55.90 57.73 52.31 56.47 58.97 54.32 55.44 56.52 57.29 61.81 64.45
24-bit 42.05 46.94 54.33 50.12 58.96 60.29 57.92 66.25
36-bit 44.45 48.49 55.02 51.61 60.15 61.94 58.95 67.86
48-bit 46.85 49.95 56.01 52.20 61.40 62.50 60.03 68.40
8 and 48 bits. Specifically, DPSH and RADH consistently provide acceptable PR curves compared to other approaches. In comparison to DPSH, RADH also results in a noticeable improvement. IAPR TC-12 Table 7.13 summarizes the MAP scores of different methods. As can be shown, RADH outperforms all comparison methods. The MAP scores of RADH are much higher than those of LSH, SpH, ITQ, and DSH, demonstrating that dataindependent and conventional data-dependent hashing techniques are often inferior to deep hashing schemes. In comparison to existing approaches of deep hashing, our method is also better. In most instances, RADH improves the MAP score by more than 4%, demonstrating the effectiveness of our relaxed method and updated triplet loss. According to Table 7.14, RADH achieves a significant improvement over other methods on MAP@Top500 and Precision@Top500. In most instances, RADH improves the MAP@Top500 and Precision@Top500 ratings by more than 10%, in comparison to LSH, SpH, ITQ, and DSH. Compared to DPSH, ADSH, and DAPH, our method also improves the two evaluation protocols by more than 5% when the code length is between 12 and 48. Additionally, although ADSH outperforms DPSH and DAPH when the code length is 12, 16, 24, 36, or 48, it performs much
7.3 Relaxed Asymmetric Deep Hashing Learning
235
inferior when the code length is 8. By comparison, our method RADH consistently achieves the greatest performance regardless of the length of the code, substantiating its superiority. Figure 7.12 illustrates the Precision-Recall (PR) curves using various techniques for bit lengths ranging from 8 to 48. It is simple to see that PR curves calculated using RADH span a broader region than those generated using other comparison methods. In comparison to DPSH, ADSH, and DAPH, RADH can obtain remarkable enhancement. MIRFLICKR-25K Table 7.15 summarizes the experimental results for MAP scores derived using RADH and various comparison approaches on the MIRFLICKR-25K dataset. Similarly, our proposed approach outperforms all other approaches evaluated in this evaluation methodology. When compared to data-independent and standard data-dependent hashing algorithms, deep hashing methods yield much superior outcomes. In comparison to DPSH, ADSH, and DAPH, although RADH improves the MAP scores somewhat when the bit length is between 8 and 12, it improves significantly when the code length is between 16 and 48. RADH achieves 80.13% and 80.38% with code lengths of 36 and 48, respectively, whereas the best method from the three deep hashing techniques only achieve 76.20% and 76.05%, which is much less than that achieved by RADH. In comparison to the MAP@Top500 and Precision@Top500 scores shown in Table 7.16, our method achieves more success. In comparison to LSH, SpH, ITQ, and DSH, both assessment methodologies show a greater than 10% improvement. In comparison to the three deep hashing methods, RADH improves performance by around 2–4% when the code length is between 8 and 48. The PR curves calculated by LSH, SpH, ITQ, DSH, DPSH, ADSH, DAPH, and RADH on the MIRFLICK-25K dataset are illustrated in Fig. 7.13. As can be shown, although RADH and ADSH provide similar results when the code length is 8, RADH covers a larger region in other instances. Specifically, when the length of the code increases, our proposed methodology widens the gap of covered regions in comparison to the three deep learning based hashing approaches. Additionally, as seen in Fig. 7.13, despite ADSH’s competitive performance on the PR curve, ADSH performs worse than DPSH when the code length grows to 24, 36, and 48, demonstrating its instability. By comparison, RADH always achieves the competitive or better performance regardless of the code length, demonstrating the is effectiveness and robustness.
7.3.6 Conclusion This chapter proposes a technique for Relaxed Asymmetric Deep Hashing. In contrast to previous approaches, which work on a point-to-point strategy to match each pair of instances, we proposed a point-to-angle matching. Particularly, there is no length limitation on the acquired real-valued features, but the asymmetric
Fig. 7.12 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the IAPR TC-12 dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
236 7 Information Fusion Based on Deep Learning
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . . Table 7.15 The MAP scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances
Method LSH SpH ITQ DSH DPSH ADSH DAPH RADH
8-bit 57.14 59.20 63.09 64.48 73.48 75.39 72.79 75.50
12-bit 57.21 60.71 63.40 63.85 74.68 76.41 74.70 76.77
16-bit 58.25 60.91 63.52 64.10 75.58 76.98 74.30 78.09
24-bit 59.37 60.97 63.82 65.96 76.01 76.59 74.14 79.18
237 36-bit 59.62 61.63 64.06 65.00 76.09 76.20 73.81 80.13
48-bit 59.86 62.43 64.34 65.62 76.05 74.53 73.41 80.38
Table 7.16 The MAP@Top500 and Precision@Top500 scores obtained by different methods on the MIRFLICKR-25K dataset. Bold values mean the best performances Evaluation MAP@Top500 Method 8-bit 12-bit 16-bit LSH 60.02 60.81 63.66 SpH 65.61 68.36 69.49 ITQ 72.30 73.90 74.30 DSH 70.59 71.83 73.31 DPSH 82.88 83.84 84.34 ADSH 82.14 83.80 84.94 DAPH 81.08 84.20 83.71 RADH 84.91 85.78 86.52
24-bit 65.86 70.33 75.60 75.43 84.84 84.90 84.45 87.75
36-bit 67.81 71.64 76.19 74.89 85.77 84.20 84.02 88.66
48-bit 68.47 73.65 77.06 76.37 85.64 82.06 84.07 88.39
Precision@Top500 8-bit 12-bit 16-bit 59.25 59.65 62.20 64.19 66.65 67.56 70.95 72.09 72.42 69.90 70.50 71.82 81.85 83.01 83.58 81.50 83.18 84.15 80.37 83.24 82.73 84.14 84.88 85.96
24-bit 64.33 68.34 73.54 73.90 84.11 84.03 83.40 87.07
36-bit 65.74 69.43 74.08 73.12 84.97 83.52 82.93 88.00
48-bit 66.31 71.26 74.66 74.52 84.80 81.31 82.91 87.77
technique permits discrete hashing variables and real-valued features to coexist in the same Hamming space if they share semantic information. Additionally, an updated triplet loss is presented to further improve the ranking of positive and negative pairings. This loss is adaptive for our proposed relaxed hashing learning. Experiments on four real-world datasets confirm the superiority of our proposed method.
7.4 Joint Learning of Single-Image and Cross-Image Representations for Person Re-identification The aim of person re-identification is to match two pedestrian images taken from different viewpoints. Existing approaches for re-identification could be classified into two groups based on whether they employ hand-crafted or deep CNN features. Numerous kinds of hand-crafted features have been utilized for person reidentification, including local binary patterns (LBP) [30, 40], color histogram [30], and local maximum occurrence (LOMO) [41, 42]. When using hand-crafted features, the approaches often concentrate on learning an efficient distance/similarity
Fig. 7.13 The Precision-Recall curves computed by LSH, SpH, ITQ, SDH, SGH, DPSH, ADSH, DAPH, and RADH on the MIRFLICKR-25K dataset. Figures from (a) to (f) are associated with the code length of 8-bit, 12-bit, 16-bit, 24-bit, 36-bit, and 48-bit
238 7 Information Fusion Based on Deep Learning
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . .
239
metric for comparing the features. For deep CNN-based features, the feature representation and classifier could be optimized concurrently for learning with SIR or CIR features. Numerous distance metric learning techniques have been developed for person re-identification. They aim at learning a distance metric that allows them to decrease the gap between matched images and increase the distance between mismatched images. Several of the known distance metric learning techniques are based on the constraint of pairwise comparison. Guillaumin et al. created a logistic discriminant metric learning (LDML) model in which the probability of a given sample pair (xi , xj ) was modeled and the maximum log-likelihood function was employed as the objective function [28]. Köstinger et al. developed a KISS metric learning (KISSME) technique to overcome the scalability problem associated with metric learning from equivalence constraints [30]. Li et al. used an adjustable threshold into Mahalanobis distance [32] to create a generic similarity measure for person re-identification. Li and Wang introduced the locally aligned feature transform to match the person images across camera views [33]. Liao et al. improved the KISSME approach by learning a distinct low-dimensional subspace [41] based on the LOMO features. Additionally, they improved the LDML model by imposing the positive semi-definite constraint and using an asymmetric sample weighting approach [42]. Other methods, such as pairwise constrained component analysis (PCCA) [31], local Fisher discriminant analysis (LFDA) [59], and informationtheoretic metric learning (ITML) [60] are also based on pairwise comparison constraints. Apart from methods based on pairwise comparison constraints, various other approaches rely on triplet comparison constraints. Weinberger et al. suggested a large margin nearest neighbor (LMNN) model [61], in which the distance metric is learnt to separate the matched neighbors from the mismatched ones by a large margin. Dikmen et al. enhanced LMNN by adding a rejection option [62]. Zheng et al. devised a model for person re-identification by maximizing the likelihood that each sample is more closed to its matched sample than its mismatched sample [63]. Due to the strength of deep CNNs in learning discriminative features from large-scale image data, several approaches have included a deep architecture to concurrently train the representation and the classifier [36, 37, 43, 44, 64]. Several of them place a premium on learning the SIR in conjunction with the similarity function. Schroff et al. presented the FaceNet model for face verification [64], which uses a deep convolutional neural network to learn the Euclidean embedding for each image through the triplet comparison loss. Additionally, online triplet generation is developed to progressively increase the difficulty of the triplets in training. Ding et al. developed a deep SIR learning model for person re-identification based on relative distance comparison [35]. It begins by presenting an efficient triplet generation strategy for constructing triplets composed of one image with a matched image and a mismatched image. This approach determines the SIR for each triplet by optimizing the relative distance between the matched and mismatched pairs. Despite learning SIR, various different ways for doing person re-identification based on CIR are provided. Li et al. presented a filter pairing neural network (FPNN) [36] that employs a patch matching layer followed by a maxout-grouping layer to train
240
7 Information Fusion Based on Deep Learning
the CIRs. In FPNN, the patch matching layer is used to simulate the displacement of each horizontal stripe in the images across views, the maxout-grouping layer is used to increase the patch matching’s robustness, and lastly, a softmax classifier is imposed on the learnt CIR for person re-identification. The work in [37] is similar in concept, but incorporates a new layer for learning the cross-image representation by calculating the neighborhood difference between two input images. The work in [44] learns the CIR by formulating the person re-identification task as a learningto-rank problem. This model stitches together its two images horizontally to create a holistic image for each pair, and then feeds these images to a CNN to learn their representations. Finally, the ranking loss is employed to guarantee that each sample is more like its positive matched image than it is like its negative matched image. Liu et al. introduced a Matching CNN (M-CNN) architecture for human parsing [38], which uses a multi-layer cross-image convolutional path to learn the CIR of an image and a semantic region in order to predict their matching confidence and displacements. The purpose of this work is to study the connection between SIR and CIR and to propose a collaborative learning framework with deep CNN that makes use of the advantages of these two kinds of representation approaches. Two pedestrian images are denoted by xi and xj . On the basis of the cross-image representation g(xi , xj ), we utilize the following classifier: SCIR xi , xj = wT g xi , xj − b
(7.38)
and the Euclidean distance is utilized for the dissimilarity measurement between the SIRs of xi and xj : 2 SSIR xi , xj = f (xi ) − f xj 2
(7.39)
in which (w, b) represents the parameter of the classifier, f (xi ) and f xj are the SIRs of xi and xj , respectively, and 2 denotes the L2 norm. With SSIR xi , xj , we introduce a threshold tS to determine whether the two pedestrian images are from the same person. It can be shown that classification on CIR is the generalization of conventional similarity measures based on SIR. Denote by []vec the vector form of a matrix. Taking the Euclidean distance in (7.39) as an instance, obviously, SSIR xi , xj is a special case of SCIR xi , xj with w = [I]vec , b = tS , and g xi , xj = T [ f (xi ) − f (xj ) f (xi ) − f (xj ) ]vec , in which I is the identity matrix. As introduced in Sect. 7.4.1.1, other distance or similarity measures [32, 65] are also special instances of CIR. We propose a framework for concurrently learning SIR and CIR for enhancing matching performance with the least increase in computational cost by using the deep CNN architecture. As shown in Fig. 7.14, our network is divided into three sub-networks, the first of which is a shared sub-network, followed by two subnetworks for extracting SIR and CIR characteristics. To save computational costs,
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . .
241
Fig. 7.14 The sketch of the network for learning the single-image and cross-image representations
we can store the CNN feature maps from the shared sub-network of the gallery images in advance and reduce the depth of the CIR sub-network to include only one convolutional layer and one fully connected layer. The shared feature maps and SIR of each probe image must be calculated only once during the test stage, and only the CIR sub-network is utilized to compute the CIR between the probe image and each gallery image. Thus, we can use the CIR to increase matching accuracy while minimizing computing cost via the use of the SIR and shared sub-network. Additionally, we enhance our model by including two distinct deep CNNs for joint SIR and CIR learning based on a pairwise or triplet comparison objective [66]. The CIR is learned using a typical support vector machine (SVM) for the pairwise comparison-based network [67]. We learn the CIR for the network based on triplet comparisons using ranking SVMs (RankSVM) [68]. Finally, we calculate the image pair’s similarity by combining the matching scores from these two networks. Experiments on person re-identification have been performed on several public datasets, including CUHK03 [36], CUHK01 [69], and VIPeR [70]. The results indicate that combining SIR and CIR learning improves the performance of person re-identification, and that the matching accuracy can be increased further by combining learnt models with pairwise and triplet comparison objectives. When
242
7 Information Fusion Based on Deep Learning
compared to state-of-the-art methods, the proposed methods outperform them in terms of person re-identification.
7.4.1 Joint SIR and CIR Learning This section discusses the relationship between SIR and CIR, then proposes two formulations for combining SIR and CIR learning (i.e., pairwise comparison formulation and triplet comparison formulation), and then introduces matching scores for person re-identification.
7.4.1.1 Connection Between SIR and CIR There are four widely used distance/similarity metrics for person re-identification using SIR features: Euclidean distance, Mahalanobis distance, joint Bayesian [65], and LADF [32]. As previously stated, Euclidean distance on SIRs can be considered as a special case of CIR-based classification. We will demonstrate in the following that some other measures are also special cases of CIR-based classification. The Mahalanobis distance based on the SIR zi = f (xi ) can be formulated T as s xi , xj = zi − zj M zi − zj , in which M is positive semi-definite. This is 0equivalent to (7.38) when w = [M]vec and g xi , xj = / formulation T zi − zj zi − zj . vec The joint Bayesian formulation [65] can be defined as follows: s xi , xj = zTi Azi + zTj Azj − 2zTi Gzj
(7.40)
T that is the generalization of Mahalanobis distance. By setting w = [A]Tvec [G]Tvec / 0T T T and g xi , xj = in (7.38), joint Bayesian can zi zTi + zj zTj −2zj zTi vec vec
be considered as a classifier w on the CIR g(xi , xj ). The LADF [32] is defined as follows: 1 1 s xi , xj = zTi Azi + zTj Azj + zTi Bzj + cT zi + zj + b 2 2
(7.41)
that is the generalization of Mahalanobis distance and joint Bayesian. We can also T consider it as a special instance of (7.38) when w = ([A]Tvec [B]Tvec cT b) and / T 0T T T T g(xi , xj ) = 12 zi zTi + zj zTj zj zi vec zi + zj 1 . vec
Despite their connections, SIR and CIR do have distinct advantages that can be used to enhance the matching performance. With the SIR-based method, the gallery
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . .
243
set’s SIR features can be precomputed in advance. We simply need to extract the SIR from each probing image and calculate its distance/similarity measure to the precomputed SIRs from the gallery images, which makes SIR computationally efficient for re-identification. The CIR-based technique successfully models the intricate interactions between gallery and probe images and is robust to spatial displacement and changed views. In the following sections, we will explore the loss associated with joint SIR and CIR learning and develop a network architecture that is both accurate and efficient.
7.4.1.2 Pairwise Comparison Formulation % $ The doublet training set is denoted as xi , xj , hij , where xi and xj are the ith and j th training samples, respectively. hij is the label assigned to the doublet xi , xj . If xi and xj are members of the same class, hij = 1, else hij = −1. Allow f (xi ) to represent the SIR of xi and bSIR to represent a distance threshold. In the pairwise comparison formulation, the similarity between the positive and negative pairs is supposed to be greater than a specific threshold, but the similarity between the negative pairs is intended to be less than the threshold. The Euclidean distance between the SIRs for every doublet xi , xj must meet the following constraints: f (xi ) − f f (xi ) − f
2 xj 2 ≤ bSIR − 1 + ξijP 2 xj 2 ≥ bSIR + 1 − ξijP
if hij = 1
(7.42)
if hij = −1
in which ξijP is a nonnegative slack variable. Accordingly, the loss function of SIR learning is LPSIR =
/
0 2 1 + hij f (xi ) − f xj 2 − bSIR
i,j
+
(7.43)
where [z]+ = max (z, 0). The CIR learning canbe formulated as a binary classification problem, in which the CIR for any doublet xi , xj could satisfy the constraints: wT g xi , xj ≤ bCIR − 1 + ζijP wT g xi , xj ≥ bCIR + 1 − ζijP
if hij = 1 if hij = −1
(7.44)
in which bCIR is the threshold and ζijP is a nonnegative slack variable. We use the loss function of the standard SVM [67] to learn CIR: LPCIR =
0 / αP w22 + 1 + hij wT g xi , xj − bCIR + 2 i,j
(7.45)
244
7 Information Fusion Based on Deep Learning
in which αP is a trade-off parameter, and we set αP = 0.0005 for the experiments. By combining the (7.43) and (7.45), the ultimate loss function of pairwise comparison-based representation learning approach is LP = LPSIR + ηP LPCIR
(7.46)
where ηP is a trade-off parameter and we set ηP = 1 for experiments.
7.4.1.3 Triplet Comparison Formulation The triplet comparison formulation is trained on a series of triplets xi , xj , xk , in which xi and xj belong to the same class, but xi and xk belong to the different class. To make the distance between xi and xj smaller than the distance xi and between xk , the SIR should meet the following constraint for each triplet xi , xj , xk : 2 f (xi ) − f (xk )22 − f (xi ) − f xj 2 ≥ 1 − ξijT k
(7.47)
in which ξijT k is a nonnegative slack variable. Then the loss function of SIR learning is / 2 0 1 − f (xi ) − f (xk )22 + f (xi ) − f xj 2 (7.48) LTSIR = +
i,j,k
The learning of CIRs can be formulated as a learning-to-rank problem, with the constraint that the CIRs should meet the following constraint: wT g (xi , xk ) − wT g xi , xj ≥ 1 − ζijT k
(7.49)
in which ζijT k represents a nonnegative slack variable. The loss function of the RankSVM [68] is utilized for learning CIR: LTCIR =
/ 0 αT w22 + 1 + wT g (xi , xk ) − wT g xi , xj + 2
(7.50)
i,j,k
in which αT is a trade-off parameter, and we set αT = 0.0005 in the experiments. By combining (7.48) and (7.50), the ultimate loss function of triplet comparisonbased learning approach is LT = LTSIR + ηT LTCIR
(7.51)
in which ηT represents a trade-off parameter and we set ηT = 1 for experiments.
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . .
245
7.4.1.4 Prediction Both SIR and CIR are utilized for matching. We use the Euclidean distance f (xi ) − f xj 2 as an indicator for the SIRs and wT g xi , xj as an indicator for 2 the CIRs for the provided image pair xi , xj . Accordingly, we use the combination of these indicators, which is as follows: 2 S xi , xj = f (xi ) − f xj 2 + λwT g xi , xj
(7.52)
in which λ represents the trade-off parameter. Cross validation can be used to determine this parameter. We set λ = 0.7 in the pairwise model and comparison λ = 1 in the triplet comparison model in experiments. S xi , xj is compared to a threshold t to determine whether or not these two pictures xi and xj are matched. S xi , xj < t indicates that xi and xj are matched, otherwise, they are not matched. The matching scores of the learning models based on and triplet pairwise comparison formulations are also combined, denoted by S x and S , x P i j T xi , x j , respectively. The combined matching score is SP &T xi , xj = SP xi , xj + μST xi , xj , in which μ represents a trade-off parameter and set as μ = 0.5 for experiments.
7.4.2 Deep Convolutional Neural Network 7.4.2.1 Network Architecture Rather than employing hand-crafted image features, we use a deep CNN to jointly train the SIRs and CIRs. For the pairwise comparison formulation, we must first pair determine the SIRs (f (xi ) and f (xj )) and CIR g xi , xj for the image xi , xj . For the triplet comparison formulation, we learn the image triplet xi , xj , xk ’s SIRs (f (xi ), f xj , and f (xk )) and the CIRs (g(xi , xj ) and g (xi , xk )). The pairwise and triplet comparison models’ deep architectures are demonstrated in Figs. 7.15 and 7.16, respectively. Each of these two networks is composed of a sub-network for SIR learning (green part), a sub-network for CIR learning (red part), and a common sub-network for SIR and CIR learning (blue part). For each probe and gallery image, the CNN feature maps (yellow part) from the shared sub-network are calculated once, as is the SIR feature. To extract CIR features from each image pair of probe image and gallery image, only the CIR learning sub-network is employed. Shared Sub-network The sub-network shown in blue in Figs. 7.15 and 7.16 is shared by SIR and CIR learning. It is composed of two convolutional layers activated by rectified linear units (ReLU). Each of these is followed by a pooling layer. The first and second convolutional layers have kernel sizes of 5 × 5 and 3 × 3, respectively. The convolutional layers have a stride of one pixel. The first and second pooling layers’ kernel sizes are set to 3 × 3 and 2 × 2, respectively.
Fig. 7.15 The proposed deep architecture of the pairwise comparison model (best viewed in color)
246 7 Information Fusion Based on Deep Learning
Fig. 7.16 The proposed deep architecture of the triplet comparison model (best viewed in color)
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . . 247
248
7 Information Fusion Based on Deep Learning
SIR Sub-network We learn the SIR f (xi ) for the input image xi using the subnetwork shown in the green portion of Figs. 7.15 and 7.16. This sub-network is composed of one convolutional layer activated by ReLU, one pooling layer, and two fully connected layers. The convolutional layer and the pooling layer have kernel sizes of 3×3 and 2×2, respectively. These two fully connected layers provide output dimensions of 1000 and 500, respectively. For the pairwise and triplet comparison models, the SIR is learned using two and three sub-networks, respectively, that share the same parameter. CIR Sub-network The sub-network shown in red in Figs.7.15 and 7.16 is used to learn the CIR g xi , xj for the input image pair xi , xj . This sub-network is composed of a convolutional layer activated by ReLU, a pooling layer, and a fully connected layer. The convolutional layer and the pooling layer have kernel sizes of 3 × 3 and 2 × 2, respectively. The fully connected layer’s output dimension is 1000. The pth channel of the CNN feature map for xi from the shared sub-network is denoted by φp (xi ). When we extract the CIR of xi , xj , the CIR sub-network is fed by the shared sub-CNN network’s feature maps for xi and xj . As follows, the cross-image feature map is computed using the first convolutional layer of the CIR sub-network. ϕr xi , xj = max 0, br + kq,r ∗ φq (xi ) + lq,r ∗ φq xj (7.53) q
in which ϕr xi , xj represents the rth channel of cross-image feature map, kq,r and lq,r are different convolutional kernels of the qth channel of the shared sub-network feature map and the rth channel of cross-image feature map. The similar operation has also been utilized in [38].
7.4.2.2 Network Training The training procedure is divided into three stages: data preprocessing, doublet/triplet generation, and network training. As is the case with the majority of deep models, back-propagation (BP) is used to train the proposed network. The first two stages are detailed below. Data Preprocessing To ensure that the model is robust to variation in image translation, we randomly crop the input images prior to the training procedure. In our experiment, the original image size is 180 × 80 pixels. We crop the original image to 160 × 60 pixels by randomly selecting the cropped image center from [80, 100] × [30, 50]. Additionally, we increase the size of the training set by constructing the horizontal mirror of each training image. Doublet/Triplet Generation Based on Mini-batch Strategy Due to the possibility that the training set would be too large to load into memory, we partition it into several mini-batches. Following the strategy described in [35], we randomly
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . .
249
choose 80 classes from the training set for each iteration and create 60 doublets or triplets for each class. We can generate 4800 doublets or triplets using this method in each round of training. We employ all 4800 doublets or triplets in training for SIR learning. We randomly choose 100 doublets or triplets for CIR learning.
7.4.3 Experiments In this section, our method is evaluated on three person re-identification datasets, i.e., CUHK03 [36],1 CUHK01 [69] (see footnote 1), and VIPeR [70].2 Our method is implemented based on the Caffe framework [71]. The momentum is set as γ = 0.5 and the weight decay is set as μ = 0.0005. The network has to be trained for 150,000 iterations. It takes about 28–34 hours in training with a NVIDIA Tesla K40 GPU. The learning rates of pairwise and triplet comparison models are 1 × 10−3 and 3 × 10−4 before the 100,000th iteration, respectively. After that their learning rates reduce to 1 × 10−4 and 3 × 10−5 .
7.4.3.1 CUHK03 Dataset The CUHK03 dataset comprises 14,096 pedestrian images captured by two surveillance cameras on 1467 individuals [36]. Each individual has an average of 4.8 images. Each image is derived from one of five video clips. The dataset includes both manually cropped bounding boxes and bounding boxes that are automatically cropped using a pedestrian detector [72]. In this section, we conduct tests using images cropped by the pedestrian detector. The identities in this dataset are randomly separated into non-overlapping training and test sets in accordance with the testing strategy described in [36]. The training set has 1367 individuals, whereas the test set contains 100 individuals. This technique creates 20 partitions for the training and test sets. These 20 groups average the provided cumulative matching characteristic (CMC) curve and accuracy. We randomly choose one camera view for each person in the test set to utilize as the probe set and one image from another camera view as the gallery set. By this way we construct 10 pairs of probe and gallery sets for testing. The average of these 10 groups is considered as the result. The CUHK03 dataset’s reported results are based on a single-shot setting. To begin with, we report the accuracies of different settings of the proposed pairwise and triplet comparison models in Table 7.17. We present the matching accuracies using SIR and CIR for each of the pairwise and triplet comparison models. As can be seen from the results, SIR and CIR-based matching provides comparable results. Their combination, however, achieves a higher accuracy than
1 2
http://www.ee.cuhk.edu.hk/~rzhao/. http://vision.soe.ucsc.edu/projects.
250
7 Information Fusion Based on Deep Learning
Table 7.17 The rank-1 accuracies (%) of the proposed pairwise and triplet comparison models
Model Pairwise Triplet Combined
Table 7.18 The training times of the proposed pairwise and triplet comparison models
SIR 37.15 43.23 44.35
Model Pairwise Triplet
CIR 35.70 43.46 45.40
Combined 43.36 51.33 52.17
SIR 24h18m 29h33m
SIR&CIR 28h25m 33h27m
0.55
Rank−1 accuracy
0.5
0.45
0.4 Pairwise comparison model Triplet comparison model 0.35
0
1
2
O
3
4
5
Fig. 7.17 Rank-1 accuracy versus λ in the CUHK03 dataset (best viewed in color)
each of them alone. Triplet comparison models are more accurate than pairwise comparison models, and their combination surpasses either of them. Additionally, we report the training time of the proposed model in Table 7.18. In comparison to SIR learning, the proposed joint SIR and CIR learning model improves matching accuracy significantly while increasing training time somewhat. Second, we investigate the sensitivity of rank-1 accuracy to the trade-off parameter λ in (7.52). Figure 7.17 shows the curves of rank-1 accuracy on the test set versus λ. It can be found that the pairwise and triplet comparison models reach the highest accuracies when λ = 0.7 and 1, respectively. The performances of our method and other state-of-the-art methods are compared, regarding Euclidean distance, ITML [60], LMNN [61], metric learning to
7.4 Joint Learning of Single-Image and Cross-Image Representations for. . . 1
19.89% FPNN 4.94% Euclidean 5.14% ITML 6.25% LMNN 8.52% RANK 10.92% LDML 4.87% SDALF 7.68% eSDC 11.70% KISSME 44.96% Ahmed et al. 46.25% LOMO+XQDA 43.36% Ours(Pairwise) 51.33% Ours(Triplet) 52.17% Ours(Combined)
Matching accuracy
0.8
0.6
0.4
0.2
0
251
0
5
10
15 Rank
20
25
30
Fig. 7.18 The rank-1 accuracies and CMC curves of different methods on the CUHK03 dataset [37] (best viewed in color)
rank (RANK) [73], LDML [28], symmetry-driven accumulation of local features (SDALF) [74], eSDC [75], KISSME [30], FPNN [36], the work by Ahmed et al. [37], and LOMO+XQDA [41]. Figure 7.18 shows the CMC curves and the rank-1 accuracies of these methods. It can be found that the rank-1 accuracy of the proposed method can reach 52.17%, which is 5.92% higher than the second best performance method (LOMO+XQDA).
7.4.3.2 CUHK01 Dataset The CUHK01 dataset contains 3884 pedestrian images captured by two surveillance cameras from 971 individuals. Each individual has four images. This dataset has been randomly divided into 10 partitions of training and test sets, and the reported CMC curves and rank-1 accuracies are averaged on these 10 groups. We utilize 871 individuals’ samples for training and 100 individuals’ samples for testing following the protocol in [37]. We pretrain the deep network for 100,000 iterations using the CUHK03 dataset, and then fine-tune the CNN for 50,000 iterations using the training set of CUHK01. We report the CMC curves and rank-1 accuracies of the proposed model (labeled “Ours (Pairwise/Triplet/Combined, Pretrain)”) in comparison to other state-of-the-art person re-identification methods, including FPNN [36], Euclidean distance, ITML [60], LMNN [61], RANK [73], LDML [28], SDALF [74], eSDC [75], KISSME [30], and the work by Ahmed et al. [37] in Fig. 7.19. The proposed approach has a substantially greater rank-1 accuracy than the other competing methods. Additionally, we present the result using the same setting in [37] without pre-training (labeled “Ours (Pairwise/Triplet/Combined)”).
252
7 Information Fusion Based on Deep Learning 1
27.87% FPNN 10.52% Euclidean 17.10% ITML 21.17%LMNN 20.61% RANK 26.45% LDML 9.90% SDALF 22.83% eSDC 29.40% KISSME 65.00% Ahmed et al. 55.69% Ours(Pairwise) 63.64% Ours(Triplet) 65.38% Ours(Combined) 65.50% Ours(Pairwise, Pretrain) 71.00% Ours(Triplet, Pretrain) 72.50% Ours(Combined, Pretrain)
0.9 0.8 Matching accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
5
10 Rank
15
20
Fig. 7.19 The rank-1 accuracies and CMC curves of different methods on the CUHK01 dataset [37] (best viewed in color)
In this setting, the proposed method’s rank-1 accuracy is much higher than that of the majority of competing approaches and is comparable to [37].
7.4.3.3 VIPeR Dataset The VIPeR dataset contains 1264 images from 632 persons [70]. All images are taken by two camera views. 316 persons are randomly selected for training, and the remaining 316 persons are utilized for testing. For each person of the test set, we randomly select one camera view as the probe set, and other camera views are utilized as the gallery set. Following the testing protocol in [37], we pretrain the CNN using CUHK03 and CUHK01 datasets, and fine-tune the network on the training set of VIPeR. As shown in Fig. 7.20, here we analyze the results of the CMC curves and rank-1 accuracies of local Fisher discriminant analysis (LF) [59], pairwise constrained component analysis (PCCA) [31], aPRDC [76], PRDC [63], enriched BiCov (eBiCov) [77], PRSVM [78], and ELF [79], saliency matching (SalMatch) [39], patch matching (PatMatch) [39], locally adaptive decision function (LADF) [32], mid-level filters (mFilter) [80], visWord [81], the work by Ahmed et al. [37], the proposed model, etc. Our method outperforms most of the other baseline methods except mFilter [80]+LADF [32], which is the combination of two methods.
References
253
90
43.39% mFilter + LADF 29.11% mFilter 30.16% SalMatch 26.90% PatMatch 29.34% LADF 24.18% LF 19.60% KISSME 19.27% PCCA 16.14% aPRDC 15.66% PRDC 26.31% eSDC 20.66% eBiCov 19.87% SDALF 30.70% visWord 14.00% PRSVM 12.00% ELF 34.81% Ahmed et al. 29.75% Ours (Pairwise) 35.13% Ours (Triplet) 35.76% Ours (Combined)
80
Matching accuracy
70 60 50 40 30 20 10 0
2
4
6
8 Rank
10
12
14
Fig. 7.20 The rank-1 accuracies and CMC curves of different methods on the VIPeR dataset [37] (best viewed in color)
7.4.4 Conclusion In this section, we present a method for person re-identification by joint SIR and CIR learning. Due to the efficiency of SIR in matching and the effectiveness of CIR in modeling the connection between probe and gallery images, we combine their losses to make use of the advantages of both representations. For joint SIR and CIR learning, we provide a pairwise comparison formulation and a triplet comparison formulation. We construct a deep neural network for each of these two models in order to jointly learn the SIR and CIR. Experimental results validate the efficacy of joint SIR and CIR learning, and the proposed method outperforms most state-of-theart models on the CUHK03, CUHK01, and VIPeR datasets. In the future, we will explore other methods for integrating SIR and CIR learning (e. g., explicit modeling on patch correspondence), as well as model-level fusion using pairwise and triplet comparisons.
References 1. Gionis A, Indyk P, Motwani R, et al. Similarity search in high dimensions via hashing. In: VLDB, 1999. vol. 99, p. 518–29. 2. Kulis B, Grauman K. Kernelized locality-sensitive hashing for scalable image search. In: 2009 IEEE 12th international conference on computer vision. IEEE, 2009. p. 2130–7.
254
7 Information Fusion Based on Deep Learning
3. Broder AZ. On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences 1997. IEEE, 1997. p. 21–9. 4. Charikar MS. Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, 2002. p. 380–8. 5. Datar M, Immorlica N, Indyk P, Mirrokni VS. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry. ACM, 2004. p. 253–62. 6. Jin Z, Li C, Lin Y, Cai D. Density sensitive hashing. IEEE Trans Cybern. 2014;44(8):1362–71. 7. Weiss Y, Torralba A, Fergus R. Spectral hashing. In: Advances in neural information processing systems, 2009. p. 1753–60. 8. Heo J-P, Lee Y, He J, Chang S-F, Yoon S-E. Spherical hashing. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2012. p. 2957–64. 9. Jiang Q-Y, Li W-J. Scalable graph hashing with feature transformation. In: IJCAI, 2015. p. 2248–54. 10. Liu W, Mu C, Kumar S, Chang S-F. Discrete graph hashing. In: Advances in neural information processing systems, 2014. p. 3419–27. 11. Gong Y, Lazebnik S, Gordo A, Perronnin F. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell. 2013;35(12):2916–29. 12. Kong W, Li W-J. Double-bit quantization for hashing. In: AAAI, 2012. vol. 1, p. 5. 13. Liu W, Wang J, Ji R, Jiang Y-G, Chang S-F. Supervised hashing with kernels. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2012. p. 2074–81. 14. Lin G, Shen C, Shi Q, Van den Hengel A, Suter D. Fast supervised hashing with decision trees for high-dimensional data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014. p. 1963–70. 15. Shen F, Shen C, Liu W, Shen HT. Supervised discrete hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015. p. 37–45. 16. Kang W-C, Li W-J, Zhou Z-H. Column sampling based discrete supervised hashing. In: AAAI, 2016. p. 1230–6. 17. Liong VE, Lu J, Wang G, Moulin P, Zhou J. Deep hashing for compact binary codes learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015. p. 2475–83. 18. Zhang R, Lin L, Zhang R, Zuo W, Zhang L. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans Image Process. 2015;24(12):4766–79. 19. Zhu H, Long M, Wang J, Cao Y. Deep hashing network for efficient similarity retrieval. In: AAAI, 2016. p. 2415–21. 20. Liu H, Wang R, Shan S, Chen X. Deep supervised hashing for fast image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. p. 2064–72. 21. Xia R, Pan Y, Lai H, Liu C, Yan S. Supervised hashing for image retrieval via image representation learning. In: AAAI, vol. 1. 2014. p. 2156–62. 22. Li W-J, Wang S, Kang W-C. Feature learning based deep supervised hashing with pairwise labels. Preprint, arXiv:1511.03855. 2015. 23. Shen F, Gao X, Liu L, Yang Y, Shen HT. Deep asymmetric pairwise hashing, 2017. 24. Jiang Q-Y, Li W-J. Asymmetric deep supervised hashing. Preprint, arXiv:1707.08325. 2017. 25. Gong S, Cristani M, Yan S, Loy CC, editors. Person re-identification. Springer, 2014. 26. Wang X. Intelligent multi-camera video surveillance: a review. Pattern Recognit Lett. 2013;34:3–19. 27. Vezzani R, Baltieri D, Cucchiara R. People reidentification in surveillance and forensics: a survey. ACM Comput Surv. 2013;46(2):29. 28. Guillaumin M, Verbeek J, Schmid C. Is that you? Metric learning approaches for face identification. In: ICCV, 2009.
References
255
29. Hirzer M, Roth PM, Köstinger M, Bischof H. Relaxed pairwise learned metric for person re-identification. In: ECCV, 2012. 30. Köstinger M, Hirzer M, Wohlhart P, Roth PM, Bischof H. Large scale metric learning from equivalence constraints. In: CVPR, 2012. 31. Mignon A, Jurie F. PCCA: a new approach for distance learning from sparse pairwise constraints. In: CVPR, 2012. 32. Li Z, Chang S, Liang F, Huang TS, Cao L, Smith JR. Learning locally-adaptive decision functions for person verification. In: CVPR, 2013. 33. Li W, Wang X. Locally aligned feature transforms across views. In: CVPR, 2013. 34. Martinel N, Micheloni C, Foresti GL. Saliency weighted features for person re-identification. In: ECCV workshop on visual surveillance and re-identification, 2014. 35. Ding S, Lin L, Wang G, Chao H. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognit. 2015;48(10):2993–3003. 36. Li W, Zhao R, Xiao T, Wang X. DeepReID: deep filter pairing neural network for person re-identification. In: CVPR, 2014. 37. Ahmed E, Jones M, Marks TK. An improved deep learning architecture for person reidentification. In: CVPR, 2015. 38. Liu S, Liang X, Liu L, Shen X, Yang J, Xu C, Lin L, Cao X, Yan S. Matching-CNN meets KNN: quasi-parametric human parsing. In: CVPR, 2015. 39. Zhao R, Ouyang W, Wang X. Person re-identification by salience matching. In: ICCV, 2013. 40. Xiong F, Gou M, Camps O, Sznaier M. Person re-identification using kernel-based metric learning methods. In: ECCV, 2014. 41. Liao S, Hu Y, Zhu X, Li SZ. Person re-identification by local maximal occurrence representation and metric learning. In: CVPR, 2015. 42. Liao S, Li SZ. Efficient PSD constrained asymmetric metric learning for person reidentification. In: ICCV, 2015. 43. Yi D, Lei Z, Liao S, Li SZ. Deep metric learning for person re-identification. In: ICPR, 2014. 44. Chen SZ, Guo CC, Lai JH. Deep ranking for person re-identification via joint representation learning. arXiv: 1505.0682. 2015. 45. Zhou X, Shang Y, Yan H, Guo G. Ensemble similarity learning for kinship verification from facial images in the wild. Inf Fusion 2016;32:40–48. 46. Li J, Zhang B, Lu G, Zhang D. Dual asymmetric deep hashing learning. IEEE Access 2019;7:113372–84. 47. Da C, Xu S, Ding K, Meng G, Xiang S, Pan C. AMVH: Asymmetric multi-valued hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. p. 736–44. 48. Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: delving deep into convolutional nets. Preprint, arXiv:1405.3531. 2014. 49. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012. p. 1097–105. 50. Jiang Q-Y, Li W-J. Deep cross-modal hashing. Preprint, arXiv:1602.02255. 2016. 51. Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes M, Morales EF, Enrique Sucar L, Villaseñor L, Grubinger M. The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst. 2010;114(4):419–28. 52. Huiskes MJ, Lew MS. The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval. ACM, 2008. p. 39–43. 53. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images, 2009. 54. Siagian C, Itti L. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans Pattern Anal Mach Intell. 2007;29(2):300–12. 55. Vedaldi A, Lenc K. MatConvNet: convolutional neural networks for MATLAB. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, 2015. p. 689–92. 56. Cao Y, Long M, Wang J, Yu PS. Correlation hashing network for efficient cross-modal retrieval. Preprint, arXiv:1602.06697. 2016.
256
7 Information Fusion Based on Deep Learning
57. He Y, Xiang S, Kang C, Wang J, Pan C. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans Multimedia 2016;18(7):1363–77. 58. Li J, Zhang B, Lu G, You J, Xu Y, Wu F, Zhang D. Relaxed asymmetric deep hashing learning: point-to-angle matching. IEEE Trans Neural Netw Learn Syst. 2019;31(11):4791–805. 59. Pedagadi S, Orwell J, Velastin S, Boghossian B. Local fisher discriminant analysis for pedestrian re-identification. In: CVPR, 2013. 60. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS. Information-theoretic metric learning. In: ICML, 2007. 61. Weinberger KQ, Blitzer J, Saul LK. Distance metric learning for large margin nearest neighbor classification. In: NIPS, 2005. 62. Dikmen M, Akbas E, Huang TS, Ahuja N. Pedestrian recognition with a learned metric. In: ACCV, 2010. 63. Zheng WS, Gong S, Xiang T. Person re-identification by probabilistic relative distance comparison. In: CVPR, 2011. 64. Schroff F, Kalenichenko D, Philbin J. FaceNet: a unified embedding for face recognition and clustering. In: CVPR, 2015. 65. Chen D, Cao X, Wang L, Wen F, Sun J. Bayesian face revisited: a joint formulation. In: ECCV, 2012. 66. Wang F, Zuo W, Lin L, Zhang D, Zhang L. Joint learning of single-image and crossimage representations for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. p. 1288–96. 67. Vapnik VN. The nature of statistical learning theory. 2nd ed. Springer, 2000. 68. Prosser B, Zheng WS, Gong S, Xiang T. Person re-identification by support vector ranking. In: BMVC, 2010. 69. Li W, Zhao R, Wang X. Human re-identification with transferred metric learning. In: ACCV, 2012. 70. Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: ECCV, 2008. 71. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, 2014. 72. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 2010;32(9):1627–45. 73. Mcfee B, Lanckriet G. Metric learning to rank. In: ICML, 2010. 74. Farenzena M, Bazzani L, Perina A, Murino V, Cristani M. Person re-identification by symmetry-driven accumulation of local features. In: CVPR, 2010. 75. Zhao R, Ouyang W, Wang X. Unsupervised salience learning for person re-identification. In: CVPR, 2013. 76. Liu C, Gong S, Loy CC, Lin X. Person re-identification: What features are important? In: ECCV workshops and demonstrations, 2012. 77. Ma B, Su Y, Jurie F. BiCov: a novel image representation for person re-identification and face verification. In: BMVC, 2012. 78. Bazzani L, Cristani M, Perina A, Murino V. Multiple-shot person re-identification by chromatic and epitomic analyses. Pattern Recognit Lett. 2012;33(7):898–903. 79. Gheissari N, Sebastian TB, Hartley R. Person reidentification using spatiotemporal appearance. In: CVPR, 2006. 80. Zhao R, Ouyang W, Wang X. Learning mid-level filters for person re-identification. In: CVPR, 2014. 81. Zhang Z, Chen Y, Saligrama V. A novel visual word co-occurrence model for person reidentification. In: ECCV workshop on visual surveillance and re-identification, 2014.
Chapter 8
Conclusion
In this book, we focus on information fusion and propose several machine learning and deep learning based methods for pattern recognition. In Chap. 2, considering the similar and specific parts existing across different views, the JSSL is proposed to effectively and sparsely divide the representation coefficients into the similar and diverse ones. In this case, a balance between similarity and distinctiveness among all views is achieved which leads to a more stable and accurate representation for classification tasks. Due to the large time cost of the sparse representation based methods, we also propose a collaborative representation based fusion method RCR which is capable of obtaining a closedform solution, greatly decreasing the time consumption. To further exploit the discriminative information existing in different classes, we extend RCR to JDCR by embedding a discriminative regularization. In Chap. 3, to tackle the non-linearity existing in the real-world data, SAGP is presented by introducing the auto-encoder and GPLVM to learn a shared variable on the manifold. Thanks for these two structures, we can not only represent the data in a non-linear and smooth way but also get the latent variable corresponding to the testing sample simply. To address the limitations of SAGP in the covariance construction and classifier learning, we further extend SAGP to MKSGP by jointly taking the multi-kernel learning and the large margin prior into account. Compared with SAGP, MKSGP is more powerful in data representation and more adaptive for classifier learning. Considering that SAGP and MKSGP only make a summarization of various covariance matrices in the encoder, we extend MKSGP to SLEMKGP by introducing a linear projection for each view, obtaining the covariance of the encoder more reasonably. In Chap. 4, considering the multi-view and multi-feature data, we also propose two probabilistic model MVMFL and HMMF to hierarchically fuse multiple views and multiple features, fully exploiting the correlation among them. For the MVMFL, a mapping matrix is estimated for each feature from a view to extract the relationship between the multi-view and multi-feature data. For HMMF, a shared and latent variable is first fused for the observed features from a view or modality. © Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5_8
257
258
8 Conclusion
These learned variables associated with different views are then assumed to be independently influenced by their ground-truth label. In Chap. 5, two metric learning based fusion methods are proposed. The first one obtains a metric swarm by learning local patches alike sub-metrics simultaneously that naturally formulates a generalized metric swarm learning (GMSL) model with a joint similarity score function. For the second one, we combined distance and similarity measure (CDSM) which also achieve an improvement on the classification. In Chap. 6, two adaptive weighted classifier score fusion methods for classification are proposed. We first design an adaptive weighted fusion approach, which automatically determines optimal weights and no any manual setting is needed. Furthermore, since the real-world data often follows the complex distribution while the linear representation is incapable of modeling it, we also propose a fusion method based on the adaptive kernel selection. In Chap. 7, two end-to-end deep learning structures are presented for image retrieval. Both methods utilize two branches of deep networks to simultaneously learn the deep features of a same input, which is then fused as a shared one. Specifically, we first propose DADH by utilizing the dual network structures and asymmetric hashing loss to learn the better discrete codes. Then we extend DADH to a relaxed version by transforming it from the point-point matching to the pointto-angle matching, which achieves competitive performance in image retrieval.
Index
Adaptive weighted fusion approach (AWFA), 177 Alternating Directions Method (ADM), 16 Asymmetric supervised deep hashing method, 7 Augmented Lagrangian Multiplier (ALM), 18 Auto-encoder, 5
Expectation Maximization (EM), 7
Face recognition, 1 Face verification, 132
Bayesian statistics, 4 Binary code, 8
Gaussian Process, 3 Gaussian Process Latent Variable Model (GPLVM), 5 Generalized metric swarm learning (GMSL), 7 Generative, 6 Gradient descent, 52
Canonical correlation analysis (CCA), 3 CIFAR-10, 207 Classifiers, 1 Closed-form solution, 5 Collaborative representation, 3 Combined distance and similarity measure (CDSM), 7 Computer vision, 1 Convolutional neural networks (CNNs), 4
Hamming distance, 2 Hashing learning, 7 Hierarchical Multi-view Multi-feature Fusion (HMMF), 7 Hierarchical structure, 6
Data representation, 5 Decoder, 52 Deep learning, 1 Dictionary, 15 Discrete, 198 Discriminative, 3 Diversity, 5 Dual Asymmetric Deep Hashing learning (DADH), 7
Euclidean distance, 2
IAPR TC-12, 207 ImageNet, 208 Image restoration, 1 Image retrieval, 7 Information fusion, 1
Joint discriminant and collaborative representation (JDCR), 3 Joint Similar and Specific Learning (JSSL), 5 Joint sparse representation, 3
Kernel, 3 K-nearest neighbor (KNN), 1
© Springer Nature Singapore Pte Ltd. & Higher Education Press, China 2022 J. Li et al., Information Fusion, https://doi.org/10.1007/978-981-16-8976-5
259
260 Large margin, 4 Latent Variable Model, 3 Linear discriminative analysis (LDA), 3 Machine learning, 2 Mahalanobis distance, 2 Mean average precision, 207 Metric learning, 2 MIRFLICKR-25K, 207 Multi-feature, 2 Multi-kernel, 6 Multi-Kernel Shared Gaussian Process latent variable model (MKSGP), 6 Multi-modal, 2 Multi-view, 2 Multi-view and multi-feature learning (MVMFL), 7
Index Precision-Recall, 208
Radial Basis Function (RBF), 6 Relaxed Asymmetric Deep Hashing learning (RADH), 8 Relaxed Collaborative Representation (RCR), 5 Representation coefficient, 14 Restricted Boltzmann machines (RBMs), 4
Object categorization, 34 Optimization algorithm, 4
Score fusion, 2 Semantic affinity, 8 Shared Auto-encoder Gaussian Process latent variable model (SAGP), 5 Shared Linear Encoder based Multi-Kernel Gaussian Process latent variable model (SLEMKGP), 6 Similarity, 2 Single view, 1 Sparse representation, 1 Sparse representation classifier (SRC), 1 Subspace learning, 3 Support vector machine (SVM), 1
Pattern recognition, 1
Triplet loss, 4
Neural networks, 4 Nonlinear, 4