216 9 10MB
English Pages 230 [231] Year 2023
Studies in Computational Intelligence 1082
Avik Hati Rajbabu Velmurugan Sayan Banerjee Subhasis Chaudhuri
Image Co-segmentation
Studies in Computational Intelligence Volume 1082
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
Avik Hati · Rajbabu Velmurugan · Sayan Banerjee · Subhasis Chaudhuri
Image Co-segmentation
Avik Hati Department of Electronics and Communication Engineering National Institute of Technology Tiruchirappalli Tiruchirappalli, Tamilnadu, India Sayan Banerjee Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai, Maharashtra, India
Rajbabu Velmurugan Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai, Maharashtra, India Subhasis Chaudhuri Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai, Maharashtra, India
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-981-19-8569-0 ISBN 978-981-19-8570-6 (eBook) https://doi.org/10.1007/978-981-19-8570-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Image segmentation is a classical and well-known problem in image processing, where an image is partitioned into non-overlapping regions. Such regions may be objects or meaningful parts of a scene. It is usually a challenging task to perform image segmentation and automatically extract objects without high-level knowledge of the object category. Instead, if we have two or more images containing a common object of interest, jointly trying to segment the images to obtain the common object will help in automating the segmentation process. This is referred to as the problem of image co-segmentation. This monograph explores several approaches to perform robust co-segmentation of images. The problem of co-segmentation is not as well researched as segmentation. For us, the motivation for understanding image co-segmentation arose from considering the problem of identifying videos with similar content and also retrieving images by searching for a similar image, even before deep learning became popular. We realized that earlier approaches had considered saliency of an object in an image as one of the cues for co-segmentation. However, realizing various restrictive issues with this approach, we started exploring other methods that can perform robust co-segmentation. We believe that a good representation for the foreground and background in an image is essential, and hence use a graph representation for the images, which helped in both unsupervised and supervised approaches. This way we could use and extend graph matching algorithms that can be made more robust. This could also be done in the deep neural network framework, extending the strength of the model to supervised approaches. Given that graph-based approaches for co-segmentation have not sufficiently been explored in literature, we decided to bring out this monograph on co-segmentation. In this monograph, we present several methods for co-segmentation that were developed over a period of seven years. Most of these methods use the power of superpixels to represent images and graphs to represent connectedness among them. Such representations could exploit efficient graph matching algorithms that could lead to co-segmentation. However, there were several challenges in developing such algorithms which are brought out in the chapters of this monograph. The challenges both in formulating and implementing such algorithms are illustrated with analytical v
vi
Preface
and experimental results. In the unsupervised framework, one of the analytical challenges relates to the statistical mode detection in a multidimensional feature space. While a solution is discussed in the monograph, this is one of the problems still considered to be a challenge in machine learning algorithms. After presenting unsupervised approaches, we present supervised approaches to solve the problem of co-segmentation. These methods lead to better performance with sufficiently labeled large datasets of images. However, with fewer images, these methods could not do well. Hence, in the monograph, we present some recent techniques such as few-shot learning to solve the problem of having access to fewer samples during training for the co-segmentation problem. In most of the methods presented, the problem of co-segmenting a single object across multiple images is presented. However, the problem of co-segmenting multiple objects across multiple images is still a challenging problem. We believe the approaches presented in this monograph will help researchers to address such co-segmentation problems and in a less constrained setting. Most of the methods presented are good references to practicing researchers. In addition, the primary target group for this monograph is graduate students in electrical engineering, computer science, or mathematics who have interest in image processing and machine learning. Since co-segmentation is a natural extension of segmentation, the monograph briefly describes topics that would be required for a smooth transition from segmentation problems. The later chapters in the monograph will be useful for students in the area of machine learning, including a data-deprived method. Overall, the chapters can help practitioners to consider the use of co-segmentation in developing efficient image or video retrieval algorithms. We strongly believe that the monograph will be useful for several readers and welcome any suggestions or comments. Mumbai, India July 2022
Avik Hati Rajbabu Velmurugan Sayan Banerjee Subhasis Chaudhuri
Acknowledgements The authors would like to acknowledge partial support provided by National Centre for Excellence in Internal Security (NCETIS), IIT Bombay. Funding support in the form of JC Ghosh Fellowship to the last author is also gratefully acknowledged. We also acknowledge the contributions of Dr. Feroz Ali and Divakar Bhat in developing some of the methods discussed in this monograph. The authors thank the publisher for accommodating our requests and supporting the development of this monograph. We also thank our families for their support throughout this endeavor.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Image Saliency and Co-saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Basic Components of Co-segmentation . . . . . . . . . . . . . . . . . . . . . . 1.3.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Co-segmentation of an Image Pair . . . . . . . . . . . . . . . . . . . . 1.4.2 Robust Co-segmentation of Multiple Images . . . . . . . . . . . 1.4.3 Co-segmentation by Superpixel Classification . . . . . . . . . . 1.4.4 Co-segmentation by Graph Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Conditional Siamese Convolutional Network . . . . . . . . . . . 1.4.6 Co-segmentation in Few-Shot Setting . . . . . . . . . . . . . . . . .
1 2 7 10 12 14 14 16 16
2
Survey of Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Unsupervised Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Markov Random Field Model-Based Methods . . . . . . . . . 2.1.2 Saliency-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Other Co-segmentation Methods . . . . . . . . . . . . . . . . . . . . . 2.2 Supervised Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Semi-supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Deep Learning-Based Methods . . . . . . . . . . . . . . . . . . . . . . 2.3 Co-segmentation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 21 21 22 23 25 25 26 27
3
Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Two-class Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Multiclass Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Nonlinear Activation Functions . . . . . . . . . . . . . . . . . . . . . .
29 29 32 33 34 36 44 45
18 18 18
vii
viii
Contents
3.4.2 Pooling in CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Regularization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 48 50 51 53 55 57
4
Maximum Common Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Co-segmentation for Two Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Image as Attributed Region Adjacency Graph . . . . . . . . . . 4.2.2 Maximum Common Subgraph Computation . . . . . . . . . . . 4.2.3 Region Co-growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Common Background Elimination . . . . . . . . . . . . . . . . . . . 4.3 Multiscale Image Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Extension to Co-segmentation of Multiple Images . . . . . . . . . . . .
59 59 59 60 60 62 65 71 72 73 81
5
Maximally Occurring Common Subgraph Matching . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Mathematical Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Multi-image Co-segmentation Problem . . . . . . . . . . . . . . . 5.2.3 Overview of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Superpixel Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Feature Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Coarse-level Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Hole Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Common Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Latent Class Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Quantitative and Qualitative Analysis . . . . . . . . . . . . . . . . . 5.5.2 Multiple Class Co-segmentation . . . . . . . . . . . . . . . . . . . . . 5.5.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89 89 90 90 91 92 93 94 94 98 99 100 103 110 110 115 121
6
Co-segmentation Using a Classification Framework . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Co-segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Mode Estimation in a Multidimensional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Discriminative Space for Co-segmentation . . . . . . . . . . . . . 6.2.3 Spatially Constrained Label Propagation . . . . . . . . . . . . . .
123 123 123 127
3.5 3.6 3.7
127 130 136
Contents
6.3
ix
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Quantitative and Qualitative Analyses . . . . . . . . . . . . . . . . . 6.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Analysis of Discriminative Space . . . . . . . . . . . . . . . . . . . . 6.3.4 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
142 142 143 146 148
7
Co-segmentation Using Graph Convolutional Network . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Global Graph Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Graph Convolution-Based Feature Computation . . . . . . . . . . . . . . 7.3.1 Graph Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Analysis of Filter Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Network Training and Testing Strategy . . . . . . . . . . . . . . . . 7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Internet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 PASCAL-VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151 151 152 152 154 156 157 158 160 162 162 163
8
Conditional Siamese Convolutional Network . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Conditional Siamese Encoder-Decoder Network . . . . . . . . 8.2.2 Siamese Metric Learning Network . . . . . . . . . . . . . . . . . . . 8.2.3 Decision Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 PASCAL-VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Internet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 MSRC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
167 167 171 171 173 174 174 175 176 176 178 178 181
9
Few-shot Learning for Co-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Co-segmentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Class Agnostic Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Directed Variational Inference Cross-Encoder . . . . . . . . . . 9.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Channel Attention Module (ChAM) . . . . . . . . . . . . . . . . . . 9.3.3 Spatial Attention Module (SpAM) . . . . . . . . . . . . . . . . . . . . 9.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 PMF Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185 185 186 187 192 193 193 194 194 194 195 195 199
x
Contents
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 10.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
About the Authors
Avik Hati is currently an Assistant Professor at National Institute of Technology Tiruchirappalli, Tamilnadu. He received his B.Tech. Degree in Electronics and Communication Engineering from Kalyani Government Engineering College, West Bengal in 2010 and M.Tech. Degree in Electronics and Electrical Engineering from the Indian Institute of Technology Guwahati in 2012. He received his Ph.D. degree in Electrical Engineering from the Indian Institute of Technology Bombay in 2018. He was a Postdoctoral Researcher at the Pattern Analysis and Computer Vision Department of Istituto Italiano di Tecnologia, Genova, Italy. He was an Assistant Professor at Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar from 2020 to 2022. He joined National Institute of Technology Tiruchirappalli in 2022. His research interests include image and video co-segmentation, subgraph matching, saliency detection, scene analysis, robust computer vision, adversarial machine learning. Rajbabu Velmurugan is a Professor in the Department of Electrical Engineering, Indian Institute of Technology Bombay. He received his Ph.D. in Electrical and Computer Engineering from Georgia Institute of Technology, USA, in 2007. He was in L&T, India, from 1995 to 1996 and in the MathWorks, USA, from 1998 to 2001. He joined IIT Bombay in 2007. His research interests are broadly in signal processing, inverse problems with application in image and audio processing such as blind deconvolution and source separation, low-level image processing and video analysis, speech enhancement using multi-microphone arrays, and developing efficient hardware systems for signal processing applications. Sayan Banerjee received his B.Tech. degree in Electrical Engineering from the West Bengal University of Technology, India, in 2012 and M.E. degree in Electrical Engineering from Jadavpur University, Kolkata, in 2015. Currently, he is completing doctoral studies at the Indian Institute of Technology Bombay. His research areas include image processing, computer vision, and machine learning.
xi
xii
About the Authors
Prof. Subhasis Chaudhuri received his B.Tech. degree in Electronics and Electrical Communication Engineering from the Indian Institute of Technology Kharagpur in 1985. He received his M.Sc. and Ph.D. degrees, both in Electrical Engineering, from the University of Calgary, Canada, and the University of California, San Diego, respectively. He joined the Department of Electrical Engineering at the Indian Institute of Technology Bombay, Mumbai, in 1990 as an Assistant Professor and is currently serving as K. N. Bajaj Chair Professor and Director of the Institute. He has also served as Head of the Department, Dean (International Relations), and Deputy Director. He has also served as a Visiting Professor at the University of Erlangen-Nuremberg, Technical University of Munich, University of Paris XI, Hong Kong Baptist University, and National University of Singapore. He is a Fellow of IEEE and the science and engineering academies in India. He is a Recipient of the Dr. Vikram Sarabhai Research Award (2001), the Swarnajayanti Fellowship (2003), the S. S. Bhatnagar Prize in engineering sciences (2004), GD Birla Award (2010), and the ACCS Research Award (2021). He is Co-author of the books Depth from Defocus: A Real Aperture Imaging Approach, Motion-Free Super-Resolution, Blind Image Deconvolution: Methods and Convergence, and Kinesthetic Perception: A Machine Learning Approach, all published by Springer, New York (NY). He is an Associate Editor for the International Journal of Computer Vision. His primary areas of research include image processing and computational haptics.
Acronyms
CNN GCN LCG LDA MCS MOCS RAG RCG x X I n1 × n2 N C K σ λ H P p(·) χ t s, r f, x d(·) S(·) S G = (V, E) V u, v E
Convolutional neural network Graph convolutional network Latent class graph Linear discriminant analysis Maximum common subgraph Maximally occurring common subgraph Region adjacency graph Region co-growing Vector Matrix Image Image dimension Number of images Cluster Number of clusters Standard deviation or sigmoid function (depending on context) Eigenvalue Histogram Probability or positive set Probability density function Regularizer in cost function Threshold Superpixels or regions Feature vectors Feature distance Feature similarity function Feature similarity matrix Graph Set of nodes in a graph Nodes in a graph Set of edges in a graph xiii
xiv
e H VH G¯ W UW N (·) F O(·) Q Q Y, L L L ω
Acronyms
Edge in a graph Subgraph Set of nodes in a subgraph Set of graphs Product graph Set of nodes in a product graph Neighborhood Foreground object Order of computation Compactness Scatter matrix Label matrix Label Loss Weights in a linear combination
Chapter 1
Introduction
Image segmentation is the problem of partitioning an image into several nonoverlapping regions where every region represents a meaningful object or a part of the scene captured by the image. The example image in Fig. 1.1a can be divided into three coarse segments: one object (‘cow’) and two regions (‘field and water body’) shown in Fig. 1.1b–d, respectively. This problem is well researched in the area of computer vision and forms the backbone of several applications. Given low-level features, e.g., color and texture, it is very difficult to segment an image with complex structure into meaningful objects or regions without any knowledge of high-level features, e.g., shape, size or category of objects present in the image. Unlike the example image in Fig. 1.1a, the foreground and background regions in the example image of Fig. 1.1e cannot be easily segmented from low-level features alone because the foreground and background contain regions of different textures. Although human visual system can easily do this segmentation, lack of high-level information (e.g., presence of house) makes the image segmentation problem in computer vision very challenging. Existing image segmentation techniques can be classified into three categories, which may be applied depending on the difficulty level of the task. • Unsupervised segmentation: It is not aided by any additional information or property of the image regions. • Semi-supervised segmentation: Users input some information regarding different image segments in the form of (i) foreground/background scribbles or (ii) provide additional images containing regions with similar features. • Fully supervised segmentation: A segmentation model is learned from the groundtruth available with the training data.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_1
1
2
1 Introduction
(a)
(b)
(c)
(d)
(e) Fig. 1.1 Image segmentation. a Input image and b–d the corresponding segmentation results. e An image that is difficult to segment using only low-level features. Image Courtesy: Source images from the MSRC dataset [105]
1.1 Image Co-segmentation In this monograph, we discuss the image co-segmentation problem [78, 103] which is the process of finding objects with similar features from a set of two or more images. This is illustrated using an image pair in Fig. 1.2. Recently with many media sharing websites, there is a large volume of image data available to researchers over internet. For example, often several people capture images of the same or a similar scene from different viewpoints or at different time instants and upload. In such a scenario, cosegmentation can be used to find the object of common interest. Figure 1.3 shows a set of five images captured at different locations by different people containing a common object ‘tiger’, and it is quite apparent that the photographers were interested in the ‘tiger’. Given two images, an image similarity measure from just global statistics may lead to wrong conclusion if the image pair contains similar background of large area
1.1 Image Co-segmentation
3
(a)
(b)
(c)
(d)
Fig. 1.2 Co-segmentation of an image pair. a, b Input image pair and c, d the common object in them. Image Courtesy: Source images from the image pair dataset [69]
but unrelated objects of interest of small size. Detection of co-segmented objects gives a better measure of image similarity. Co-segmentation can also be used to discard images that do not contain co-occurring objects from a group of images for database pruning. Hence, co-segmentation has several applications, and it has attracted much attention. This problem can be solved using either completely unsupervised or fully supervised techniques. Methods from both categories work with the assumption that it is known that every image in the image set contains at least one object of a common class. In this context, it is worth mentioning that we may require to co-segment an image set where only a majority of the images contain the common object and the rest need not. This is a challenging problem, and this monograph discusses several approaches to solve this problem. It may be noted that in co-segmentation, we extract the common object without the knowledge of its class information, and it does not do pattern recognition. For the image set in Fig. 1.3, co-segmentation yields a set of binary masks that can be used to extract the common object (‘tiger’) from the respective images. But, it does not necessarily have to recognize the common object as a ‘tiger’. Thus, co-segmentation can be used as a preprocessing step for object recognition. The co-segmentation problem is not limited to finding only a single common object. It is quite common for the image set to be co-segmented to have multiple common objects. Figure 1.4 shows an example of an image set containing two com-
4
1 Introduction
Fig. 1.3 Co-segmentation of more than two images. Column 1: Images retrieved by a child from the internet, when asked to provide pictures of a tiger. Column 2: The common object quite apparent from the given set of images. Image Courtesy: Source images from the internet
mon objects ‘kite’ and ‘bear’ present in different images. Thus not having to identify the class of object(s) provides a greater flexibility in designing a general-purpose cosegmentation scheme. Moreover, the different common objects need not be present in all the images. Further, the image subsets containing different common objects
1.1 Image Co-segmentation
5
Fig. 1.4 Multiple class co-segmentation. Column 1 shows a set of six images that includes images from two classes: ‘kite’ and ‘bear’. Column 2 shows the common object in ‘kite’ class. Column 3 shows the common object in ‘bear’ class. Image Courtesy: Source images from the iCoseg dataset [8]
can be overlapping. The example in Fig. 1.5 shows a set of four images containing three common objects ‘butterfly’ (present in all images), ‘red flower’ (present in two images) and ‘pink flower’ (present in two images). The co-segmentation results can be used in image classification, object recognition, image annotation etc., justifying the importance of image co-segmentation in computer vision.
6
1 Introduction
Fig. 1.5 Co-segmentation where multiple common objects are present in overlapping image subsets. Row 1 shows a set of four images. Row 2 shows common object ‘butterfly’ present in all images. Rows 3,4 show ‘red flower’ and ‘pink flower’ present in Images-1, 2 and Images-3, 4, respectively. Image Courtesy: Source images from the FlickrMFC dataset [59]
Given an image, humans are more attentive to certain objects present in the image. Hence in most applications of co-segmentation, we are interested in finding the common object rather than common background regions in different images. Figure 1.6 shows a set of four images that contain a common foreground (‘cow’) as well as common background regions (‘field’ and ‘water body’). In this monograph, cosegmentation is restricted only to common foreground objects (‘cow’) while ‘field’ and ‘water body’ are ignored. It may be noted that detection of common background regions also has some, although limited, applications including semantic segmentation [3], scene understanding [44]. It is worth mentioning that a related problem is image co-saliency [16], which measures the saliency of co-occurring objects in an image set. There one can make use of computer vision algorithms that are able to detect attentive objects in images. We briefly describe image saliency detection [14] next.
1.2 Image Saliency and Co-saliency
7
Fig. 1.6 Example of an image set containing common foreground (‘cow’) as well as common background regions (‘field’ and ‘water body’). Image Courtesy: Source images from the MSRC dataset [105]
(a)
(b)
(c)
Fig. 1.7 Salient object detection. a Input image and b the corresponding salient object shown in the image and c the cropped salient object through a tightly fitting bounding box. Image Courtesy: Source image from the MSRA dataset [26]
1.2 Image Saliency and Co-saliency Human visual system can easily locate objects of interest and identify important information in a large volume of real world data of complex scenes or cluttered background by interpreting them in real time. The mechanism behind this remarkable ability of human visual system is researched in neuroscience and psychology, and saliency detection is used to study this. Saliency is a measure of importance of objects or regions in an image or important event in a video scene that capture our attention. The salient regions in an image are different from the rest of the image in certain features (e.g., color, spatial or frequency). Saliency of a patch in an
8
1 Introduction
Fig. 1.8 Difference between salient object and foreground object. a Input image [105], b foreground objects and c the salient object. Image Courtesy: Source image from the MSRC dataset [105]
(a)
(b)
(c)
image depends on its uniqueness with respect to other patches in the image i.e., rare patches (in terms of features) are likely to be more salient than frequently occurring patches. In the image shown in Fig. 1.7a, the ‘cow’ is distinct among the surrounding background (‘field’) in terms of color and shape features, hence it is the salient object (Fig. 1.7b). It is important to distinguish between salient object extraction and foreground segmentation from an image. In Fig. 1.8a, both the ‘ducks’ are foreground objects (Fig. 1.8b), but only the ‘white duck’ is the salient object (Fig. 1.8c) since the color of the ‘black duck’ is quite similar to ‘water’ unlike the ‘white duck’. Moreover, the salient object need not always be distinct only from the background. In Fig. 1.9, ‘red apple’ is the salient object which stands out from the rest of the ‘green apples’. Hence, a salient object, in principle, can also be detected from a set of objects of the same class. It is also possible that an image contains more than one salient object. For example, in the image shown in Fig. 1.10, there are four salient objects (‘bowling pins’). Here, the four objects may not be equally salient and some may be more salient than others. Hence, we need to formulate a mathematical definition of saliency [97, 120, 151] and assign a saliency value to every object or pixel in an image to find its relative attention. This, however, is beyond the scope of this book. With an analogy to the enormous amount of incoming information from the retina that are required to be processed by the human visual system, a large number of images are publicly available to be used in many real-time computer vision algorithms that also have high computational complexity. Saliency detection methods offer efficient solutions by identifying important regions in an image so that operations can be performed only on those regions. This reduces complexity in many image and vision applications that work with large image databases and long video sequences. For example, saliency detection can aid in video summarization [52, 79], image segmentation and co-segmentation [19, 43, 45, 90], content-based image
1.2 Image Saliency and Co-saliency
(a)
9
(b)
Fig. 1.9 Salient object detection among a set of objects of the same class. a Input image and b the corresponding salient object. Image Courtesy: Source image from the HFT dataset [70]
(a)
(b)
Fig. 1.10 Multiple salient object detection. a Input image and b the corresponding salient objects. Image Courtesy: Source image from the MSRA dataset [26]
retrieval [23, 76], image classification [58, 112], object detection and recognition [93, 109, 113, 133], photo collage [42, 137], dominant color detection [140], person reidentification [149], advertisement design [83, 101] and image editing for contentaware image compression, retargeting, resizing, image cropping/thumbnailing (see Fig. 1.7c), and adaptive image display on mobile devices [2, 37, 47, 82, 94, 96, 122]. Since saliency detection can be used to extract highly attentive objects from an image, we may make use of it to find the common object(s) from multiple images. This is evident from the example image set in Fig. 1.11 where the common object ‘balloon’ is salient in all the images and it can be extracted through saliency value matching across images. This extraction of common and salient objects is known as image co-saliency. But this scenario is limited in practice. For example in Fig. 1.12, although ‘dog’ is the common object, it is not salient in Image 3. Here, saliency alone will fail to detect the common object (‘dog’) in all the images. Hence, saliency is not always suitable for robust co-segmentation. Hence, we have restricted the discussions in this monograph to image co-segmentation methods that do not use image saliency. In Chap. 10, we will discuss this with more examples. We also show that the image
10
1 Introduction
sets in Figs. 1.11 and 1.12 can be co-segmented through feature matching without using saliency in Chaps. 6 and 4, respectively.
1.3 Basic Components of Co-segmentation In this monograph, we will describe co-segmentation methods in both unsupervised and supervised settings. While all the unsupervised methods discussed in the monograph utilize image superpixels, the supervised methods may use either superpixels or pixels. Further, graph-based representation of images forms the basis of some of the supervised and unsupervised frameworks discussed. Some methods are based on classification of image pixels or superpixels. Next, we briefly describe these representations for a better understanding of the problem being discussed. Superpixels: Since the images to be co-segmented may have a large size, there is a high computational complexity associated with pixel-based operations. Hence to reduce the complexity, it is efficient to decompose every image into groups of pixels, and perform operations on them. A common basic image primitive used by researchers is the rectangular patch. But, pixels within a rectangular patch may belong to multiple objects and may not be similar in features, e.g., color. So, the natural patch, called superpixel, is a good choice to represent image primitives. Unlike rectangular patches, use of superpixels helps to retain the shape of an object boundary. The simple linear iterative clustering (SLIC) [1] is an accurate method to oversegment the input images into non-overlapping superpixels (see Fig. 1.13a, b). Each superpixel contains pixels from a single object or region and is homogeneous in color. Typically, superpixels are attributed with appropriate features for further processing (Chaps. 4, 5, 6 and 7). Region adjacency graphs (RAG): A global representation of an image can be obtained from the information contained in the local image structure using some local conditional density function. Since the rectangular patches are row and column ordered, one could have defined a random field, e.g., Markov random field. But here, we choose superpixels as image primitives. As no natural ordering can be specified for superpixels, a random field model cannot be defined on the image lattice. Graph representations allow us to handle this kind of superpixel-neighborhood relationship (see Fig. 1.13c). Hence, in this monograph, we use graph based approaches, among others, for co-segmentation. In a graph representation of an image, each node corresponds to an image superpixel. A node pair is connected using an edge if the corresponding superpixels are spatially adjacent to each other in the image spaces. Hence, this is called a region adjacency graph (G ). Detailed explanation of graph formulation is provided later. Chapters 4, 5 and 7 of this monograph describe graphbased methods, both unsupervised and supervised, where an object is represented as a subgraph (H) of the graph representing the image (see Fig. 1.13d, e). Convolutional neural networks (CNN): Over the past decade, several deep learning-based methods have been deployed for a range of computer vision tasks including visual recognition and image segmentation [108]. The basic unit in such
1.3 Basic Components of Co-segmentation
11
Fig. 1.11 Saliency detection on an image set (shown in Column 1) with the common foreground (‘balloon’, shown in Column 2) being salient in all the images. Image Courtesy: Source images from the iCoseg dataset [8]
12
1 Introduction
Fig. 1.12 Saliency detection on an image set (shown in top row) where the common foreground (‘dog’, shown in bottom row) is not salient in all the images. Image Courtesy: Source images from the MSRC dataset [105]
deep learning architectures is the CNN which learns to obtain semantic features from images. Given a sufficiently large dataset of labeled images for training, these learned features have been shown to outperform hand-crafted features used in the unsupervised methods. We will demonstrate this in Chaps. 7, 8 where co-segmentation is performed using a model learned utilizing labeled object masks. Further, among the deep learning-based co-segmentation methods described in this monograph, the method in Chap. 9 focuses on utilizing lesser amount of training data to mimic several practical situations where sufficient amount of labeled data is not available.
1.3.1 The Problem The primary objectives of this monograph are to • design computationally efficient image co-segmentation algorithms so that they can be used on image sets of large cardinality, without compromising on accuracy, and • ensure robustness of the co-segmentation algorithm in the presence of outlier images (in the image set) which do not contain the common object present in the majority of the images.
1.3 Basic Components of Co-segmentation
13
(a)
(b)
(d)
(c)
(e)
Fig. 1.13 Graph representation of an image. a Input image, b its superpixel segmentation and c the corresponding region adjacency graph whose nodes are drawn at the centroids of the superpixels. d The subgraph representing e the segmented object (‘flower’). Image Courtesy: Source image from the HFT dataset [70]
Since the common object present in multiple images is constituted by a set of pixels (or superpixels), there must be high feature similarities among these pixels (or superpixels) from all the images, and hence, we need to find matches. Since we are working with natural images, this poses three challenges apart from high computations associated with finding correspondences. • The common object in the images may have different pose, • they may have different sizes (see Fig. 1.14) and
14
1 Introduction
• they may have been captured by different cameras under different illumination conditions. The co-segmentation methods described in this monograph aim to overcome these challenges.
1.4 Organization of the Monograph In this monograph, we describe three unsupervised and three supervised methods for image co-segmentation. The monograph is organized as follows. In Chaps. 2 and 3, we describe existing works on co-segmentation and provide mathematical background on the tools that will be used in co-segmentation methods described in subsequent chapters. Then in Chap. 4, we describe an image co-segmentation algorithm for image pairs using maximum common subgraph matching and region co-growing. In Chap. 5, we explain a computationally efficient unsupervised image co-segmentation method for multiple images using a concept called latent class graph. In Chap. 6, we demonstrate a solution of the image co-segmentation problem in a classification setup, although in an unsupervised manner, using discriminant feature-based label propagation. We describe a graph convolutional network (GCN)based co-segmentation in Chap. 7, as first of the supervised methods. In Chap. 8, we describe a siamese network to do co-segmentation of a pair of images. More recently, supervised methods are trying to use fewer data during training, and fewshot learning is one such approach. A co-segmentation method under the few-shot setting is discussed in Chap. 9. Finally in Chap. 10, we conclude with discussions and possible future directions in image co-segmentation. We next provide a brief overview of the above six co-segmentation methods.
1.4.1 Co-segmentation of an Image Pair To co-segment an image pair using graph-based approaches, we need to find pairwise correspondences among nodes of the corresponding graphs, i.e., superpixels across the image pair. Since the common object present in the image pair may have different pose and different size, we may have one-to-one, one-to-many or many-to-many correspondences. In Chap. 4, we describe a method where • the maximum common subgraph (MCS) of the graph pair (G1 and G2 ) is first computed, which provides an approximate result that detects the common objects partially. • A region co-growing method can simultaneously grow the seed superpixels (i.e., nodes in the MCS) in both images to obtain the common objects completely. • A progressive method by combining the two stages, MCS computation and region co-growing, significantly improves the computation time.
1.4 Organization of the Monograph
15
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 1.14 Co-segmentation of image pairs with the common objects having different size and pose. a, b and e, f two input image pairs, and c, d and g, h the corresponding common objects, respectively. Image Courtesy: Source images from the image pair dataset [69]
16
1 Introduction
1.4.2 Robust Co-segmentation of Multiple Images To co-segment a set of N (> 2) images (see Fig. 1.3 for example), finding MCS of N graphs involves very high computations. Moreover, the node correspondences obtained from this MCS must lead to consistent matching of corresponding superpixels across the image set and this is very difficult. Moreover, it is quite common for the image set to contain a few outlier images that do not share the common object present in majority of the images (see Fig. 1.15). This makes co-segmentation even more challenging. In Chap. 5, we describe an efficient method that can handle these challenges. • First a latent class graph (LCG) is built by combining all the graphs Gi . In particular, we need to compute pairwise MCS sequentially until all graphs have been included. This LCG (H L ) contains information of all graphs and its cardinality is limited by |H L | =
i
| Hi | −
MCS(Hi , H j ) + MCS(Hi , H j , Hk ) − · · · i
j>i
i
j>i k> j
(1.1) where Hi ⊆ Gi is obtained through joint clustering of all image superpixels (see Chap. 5). • A maximally occurring common subgraph (MOCS) matching algorithm finds the common object completely by using the LCG as a reference graph. • We show in Chap. 5 that MOCS can handle the problem of outliers and reduce the required number of graph matchings from O (N 2 N −1 ) to O (N ). We also show that this formulation can perform multiclass co-segmentation (see Fig. 1.4 for example).
1.4.3 Co-segmentation by Superpixel Classification Crowd-sourced images are generally captured under different camera illumination and compositional context. These variations make feature selection difficult which in turn makes it very difficult to extract the common object(s) accurately. Hence, discriminative features which better distinguish between background and the common foreground, are required. In Chap. 6, co-segmentation is formulated as a foreground– background classification problem where superpixels belonging to the common object across images are labeled as foreground and the remaining superpixels are labeled as background in an unsupervised manner. • First a novel statistical mode detection method is used to initially label a set of superpixels as foreground and background. • Using the initially labeled superpixels as seeds, labels of the remaining superpixels are obtained, thus finding the common object completely. This is achieved through a novel feature iterated label propagation technique.
1.4 Organization of the Monograph
17
Fig. 1.15 Multi-image co-segmentation in the presence of outlier images. Top block shows a set of six images that includes an outlier image (image-5). Bottom block shows that the common object (‘kite’) is present only in five of the six images. Image Courtesy: Source images from the iCoseg dataset [8]
• There may be some feature variation among the superpixels belonging to the common object and this may lead to incorrect labeling. Hence, a modified linear discriminant analysis (LDA) is designed to compute discriminative features that significantly improves the labeling accuracy.
18
1 Introduction
1.4.4 Co-segmentation by Graph Convolutional Neural Network Notwithstanding the usefulness of different unsupervised co-segmentation algorithms, the effectiveness of these approaches is reliant on appropriate choice of hand-crafted features for the task. When sufficient annotated data is available, however, we can compute learned features using deep learning methods which do away with manual feature selection. Hence, we next explore co-segmentation methods based on deep neural networks. In Chap. 7, we discuss an end-to-end foreground– background classification framework using a graph convolutional neural network. • In this framework, each image pair is jointly represented as a weighted graph by exploiting both intra-image and inter-image feature similarities. • The model then uses graph convolution operation to learn features as well as classify each superpixel into the common foreground or the background class, thus achieving co-segmentation.
1.4.5 Conditional Siamese Convolutional Network In Chap. 8, we shift to a framework based on standard convolutional neural networks that directly works on image pixels instead of superpixels. It consists of a metric learning, a decision network and a conditional siamese encoder-decoder network. • The metric learning network’s job is to identify an optimal latent feature space in which samples of the same class are closer together and those of different classes are separated. • The encoder-decoder network estimates the co-segmentation masks appropriately. • The decision network determines whether input images include common objects or not, based on the extracted characteristics. As a result, the model can handle outliers in the input set.
1.4.6 Co-segmentation in Few-Shot Setting Fully supervised methods perform well when a large training data is available. However, collecting sufficient training samples is not always easy, and for some tasks, it may be almost impossible to achieve. Hence, a framework for multi-image cosegmentation that uses a meta-learning technique is required in such scenarios, and is discussed in Chap. 9. • We discuss a directed variational inference cross encoder, which is an encoderdecoder network that learns a continuous embedding space to provide superior similarity learning.
1.4 Organization of the Monograph
19
• It is a class agnostic technique that can generalize to new classes with only a limited number of training samples. • To address the limited sample size problem in co-segmentation with small datasets like iCoseg and MSRC, a few-shot learning strategy is also discussed. Having introduced the problem of co-segmentation and a brief description of approaches, in the next chapter we review the related literature on image cosegmentation and relevant datasets.
Chapter 2
Survey of Image Co-segmentation
In this chapter, we first review the literature related to unsupervised image cosegmentation. Then we review available supervised co-segmentation methods.
2.1 Unsupervised Co-segmentation Image co-segmentation methods aim to extract the common object present in more than one image by simultaneously segmenting all the images. The co-segmentation problem was first explored by Rother et al. [103] which considered the case of two images, and it was followed by the methods in [49, 88, 130]. Subsequently, researchers have actively worked on co-segmentation of more than two images [78, 105] because of its many practical applications. Recently, the focus has shifted to multiple class co-segmentation. Some of these works include the methods in [18, 56, 57, 59, 60, 78, 136] that jointly segment the images into multiple classes to find common foreground objects of different classes.
2.1.1 Markov Random Field Model-Based Methods Early methods in [49, 88, 103, 130] provide a solution for co-segmentation of twoimages by histogram matching in a Markov random field (MRF) model-based energy minimization framework. These methods extend the MRF model-based single-image segmentation technique [15] to co-segmentation. The energy function to be minimized can be written as:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_2
21
22
2 Survey of Image Co-segmentation
⎛ 2 ⎝ E t (y) = E u (yi ) + k=1
i∈Ik
⎞ E p (yi , y j )⎠ + χh E h (H1 , H2 ) ,
(2.1)
(i∈Ik , j∈N (i))
where yi ∈ {0, 1} is the unknown label (0 for common foreground and 1 for background) of pixel-i, N (·) denotes neighborhood, χh is a weight and H1 , H2 are the histograms of the common foreground region in the image pair. The first two terms in Eq. (2.1) together correspond to standard MRF model-based single-image segmentation cost function: E u (yi ) is the unary term that can be computed using histogram or foreground–background Gaussian mixture model, and E p (yi , y j ) = S (i, j)|yi − y j |
(2.2)
is the pairwise term that ensures smoothness within and distinction between the segmented foreground and background by feature similarity S (·). The third term E h (·) in Eq. (2.1) measures the histogram distance of the unknown common object regions in the image pair and, it is responsible for inter-image region matching. Rother et al. [103] used L 1 -norm of histogram difference to compute E h (·), and proposed an approximation method called submodular-supermodular procedure since optimizing a cost function with L 1 -norm is difficult, whereas Mukherjee et al. [88] replaced L 1 -norm by L 2 -norm for approximation. However, the optimization problem in both methods is computationally intensive. Hochbaum and Singh [49] rewarded foreground histogram consistency by using inner product to measure E h (·), instead of minimizing foreground histogram difference to simplify the optimization. Moreover, prior information about foreground and background colors have been used in [49, 88] to compute the unary term E u (·), whereas Vicente et al. [130] ignored the unary term. Instead, they considered separate models for the common foreground, background-1 and background-2, and added a constraint that all pixels belonging to a histogram bin must have same the label. The methods in [49, 88, 103, 130] perform well only for common objects with highly similar appearance on different background.
2.1.2 Saliency-Based Methods The methods in [22, 54, 63, 105, 139] first compute image saliency, and use it as a key component in their co-segmentation methods. Rubinstein et al. [105] use salient pixels in the images as seeds to find inter-image pixel correspondences using SIFT-flow. These are used to initialize an iterative algorithm for optimization of a cost function involving correspondence and histogram similarity and match the common regions across images. The clustering-based method in [63] can extract only the salient foregrounds from multiple images. However, it should be noted that the common object may not always be salient in all constituent images.
2.1 Unsupervised Co-segmentation
23
Recently, co-saliency-based methods [16, 19, 21, 39, 69, 73, 77, 124, 126] have also been used for co-segmentation. These methods detect common, salient objects from the image set. Typically, most methods [19, 69] define co-saliency of a region or superpixel in an image as a combination of its single-image saliency value and the average of its highest feature similarities with the regions in the remaining images. The method in [19] extracts the co-salient object through MRF model-based labeling using salient pixels for label initialization. Cao et al. [16] combine outputs of multiple saliency detection methods. The weight for a method is computed from the histogram vectors of salient regions detected by that method in all images. Since dependent vectors indicate that the corresponding regions are co-salient, weight is inversely proportional to the low-rank approximation error of the combined histogram matrix. Tsai et al. [126] jointly compute co-saliency values and co-segmentation labels of superpixels. They build a combined graph of all superpixels in the image set, and optimize a cost similar to Eq. (2.1) to obtain the superpixel labels. Co-saliency of each superpixel is obtained as a weighted sum of an ensemble of single-image saliency maps, and the weights are learned using the unary term. The cost also includes a coupling term to make the co-saliency and co-segmentation results coherent so that a common object superpixel has a high co-saliency value and vice-versa. Liu et al. [77] hierarchically segment all the images. Coarse-level segments are used to compute object prior, and fine-level segments are used to compute saliency. Co-saliency is computed using global feature similarity and saliency similarity among fine-level segments. Tan et al. [124] used a bipartite graph to compute feature similarity. Fu et al. [39] cluster all the pixels of all images into a certain number of clusters, and find saliency of every cluster using center bias of clusters, distribution of image pixels in every cluster and inter-cluster feature distances. Co-saliency methods can detect the common object from an image set only if they are highly salient in the respective images. Since most image sets to be co-segmented do not satisfy this criterion (examples shown in Chap. 1 and more examples to be shown in Chap. 10), co-saliency can be applied to a limited number of image sets and it cannot be generalized for all kinds of datasets. Hence, we use co-segmentation for detection of common objects in this monograph.
2.1.3 Other Co-segmentation Methods Joulin et al. [56] formulated co-segmentation as a two-class clustering problem using a discriminative clustering method. They extended this work for multiple classes in [57] by incorporating spectral clustering. As their kernel matrix is defined for all possible pixel pairs of all images, the complexity goes up rapidly with the number of images. Kim et al. [60] used anisotropic diffusion to optimize the number and location of image segments. As all the images are segmented into an equal number of clusters, oversegmentation may become an issue in a set of different types of images. Furthermore, this method cannot co-segment heterogeneous objects. An improvement to this method has been proposed in [59] using supervision such as
24
2 Survey of Image Co-segmentation
bounding box or pixelwise annotations for different foreground classes in some selected images of the set. However, the methods in [56, 57, 60] cannot determine the common object class automatically, and this has to be selected manually. The scale invariant co-segmentation method in [89] solves co-segmentation for different sized common foreground objects where instead of minimizing the distance between histograms of common foregrounds, it constrains them to have low entropy and to be linearly dependent. Lee et al. [64] and Collins et al. [29] employed random walk for co-segmentation. The idea is to perform a random walk from each image pixel to a set of user specified seed points (foreground and background). The walk being biased by image intensity gradients, each pixel is labeled as foreground if the pixelspecific walk reaches a foreground seed first, and vice-versa for a background label. Tao et al. [125] use shape consistency among common foreground regions as a cue for co-segmentation. But changes in pose and viewpoint, which are quite common for natural images, inherently results in change of shapes of the common object, thus making the method invalid in such cases. The method in [136] performs pairwise image matching, resulting in high computational complexity. Being a pairwise method, it does not produce consistent co-segmentation across the entire dataset, and requires further optimization techniques to ensure consistency. Meng et al. [84] and Chen et al. [22] split the input image set into simple and complex subsets. Both the subsets contain a common object of same class. But, co-segmentation from the simple subset (foreground and background are homogeneous and they are well separated) is easier compared to the complex subset (foreground is not homogeneous and background is cluttered, and there is less contrast between them). Subsequently, they use results obtained from the simple subset for cosegmentation of the complex one. In their co-segmentation energy function, Meng et al. [84] split the common foreground feature distance/similarity term E h (·) of Eq. (2.1) into two components: similarity between region pairs present in the same image group and across different image groups. This helps to have matching between few inter-group region pairs, instead of forcing matching between all pairs. Chen et al. [22] assume that the salient foregrounds in the simple image set are well separated from the background, and the well-segmented object masks are used as a segmentation prior in order to segment more difficult images. Li et al. [68] proposed to improve co-segmentation results of existing methods by repairing bad segments using information propagation from good segments as a post-processing step. Sun and Ponce [123] proposed to learn discriminative part detectors for each class using a support vector machine model. The category label information of images is used as the supervision, and the learned part detectors of each class discriminate that class from the background. Then, co-segmentation is performed using the object cue obtained using the part detectors into the discriminative clustering framework of [56]. Rubio et al. [106] first compute objectness score of pixels and image segments, and use them to form the unary term of Eq. (2.1), and incorporate an energy term in E t (·) that considers the similarity values of every two inter-image matched region pairs. In order to obtain high-level features, semi-supervised methods in [71, 74, 85, 86, 131, 146] compute region proposals from images using pretrained networks, whereas Quan et al. [99] use CNN features. The graph-based method in [85] includes
2.1 Unsupervised Co-segmentation
25
high-level information like object detection, which is also a complex problem. They construct a directed graph by representing the computed local image regions (generated using saliency and object detection) as nodes, and sequentially connecting edges among nodes of consecutive images. Then the common object is detected as the set of nodes on the shortest path in the graph. Vicente et al. [131] used proposal object segmentations to train a random forest regressor for co-segmentation. This method relies heavily on the accuracy of individual segmentation outputs as it is assumed that one segment contains the complete object. Li et al. [71] extended the proposal selection-based co-segmentation methods in [85, 131] by a self-adaptive feature selection strategy. Quan et al. [99] build a graph from all superpixels of all images, and use a ranking algorithm and boundary prior to segment the common objects such that the nodes corresponding to the common foreground are assigned high rank scores. In the first part of this monograph, we discuss image co-segmentation methods based on unsupervised frameworks. Hence, these methods do not involve CNN features or region proposals or saliency. Every image is segmented into superpixels, a region adjacency graph (RAG) is constructed, and the nodes in the graph (superpixels) are attributed with only low-level and mid-level features, e.g., color, HOG and SIFT features. These co-segmentation algorithms are based on graph matching and superpixel classification. The graph-based framework performs maximum common subgraph matching of the RAGs obtained from the image set to be co-segmented. In the classification framework, discriminative features are computed using a modified linear discriminant analysis (LDA) to classify superpixels as background and the common foreground.
2.2 Supervised Co-segmentation 2.2.1 Semi-supervised Methods There has been a small volume of work on semi-supervised co-segmentation that tries to simplify the co-segmentation problem by including user interaction for segmentation. Batra et al. [8] have proposed interactive co-segmentation for a group of similar images using scribbles drawn by users. This method guides the user to draw scribbles in order to refine segmentation boundary in the uncertain regions. This method is quite similar to the method in [130], but they consider only one background model for the entire image set. Similar semi-supervised methods have been proposed in [29, 33, 142].
26
2 Survey of Image Co-segmentation
2.2.2 Deep Learning-Based Methods Very little work has been done on the application of deep learning to solve the co-segmentation problem. Recently, the methods in [20, 72] suggested end-to-end training of deep siamese encoder-decoder networks for co-segmentation. The CNNbased encoder is responsible for learning object features, and the decoder performs the segmentation task. The siamese nature of the network allows an image pair to be input simultaneously, and the segmentation loss is computed for the two images jointly. To capture the inter-image similarity, Li et al. [72] compute the spatial correlation of the encoder feature (say, shape C × H × W ) pair obtained from the two images such that high correlation values identify the common object pixel locations in the respective images. Using the two correlation results (shape H × W × H W each) as seed, the decoder generates the segmentation masks through deconvolution operation. Different from this, Chen et al. [20] simply concatenate the encoder feature pair, which is decoded to obtain the common object masks. The common class is identified by fusing channel attention measures of the image pair, and the objects are localized using spatial attention maps. Specifically, the encoder features are modulated by the attention measures before feeding them to the decoder. More recently, the methods in [67, 147] consider more than two images simultaneously for training their co-segmentation networks. Li et al. [67] use a co-attention recurrent neural network (RNN) that learns a prototype representation for the image set. The RNN is trained in the traditional manner by taking the images as input in any random sequence. In addition, the update gate model of the RNN unit involves both channel and spatial attention so that common object information is infused in the group representation. This prototype is then combined with the encoder feature of each image to obtain the common object masks. Zhang et al. [147] use an extra classifier to learn the common class. Specifically, the fused channel attention measures obtained from all images in the set are combined to (i) predict the co-category label, (ii) modulate the encoder feature maps through a semantic modulation subnetwork. Further, a spatial modulation subnetwork learns to jointly separate the foreground and the background in all images to obtain a coarse localization of the common objects. The methods in [25, 50] perform multiple image co-segmentation using deep learning, while training the network in an unsupervised manner, that is, without using any ground-truth mask. Hsu et al. [50] train a generator to estimate co-attention maps that highlight the objects approximately. Then feature representation of the approximate foreground and background regions obtained from the co-attention maps are used to train the generator using a co-attention loss that aims to reduce the interimage object distance and increase the intra-image figure-ground discrepancy for all image pairs. To ensure that the co-attention maps capture complete objects instead of object parts, a mask loss is used. It is formulated as a binary cross-entropy loss between the predicted co-attention maps and the best fit object proposals. Thus, the mask loss and the co-attention loss complement each other. Given that the object proposals are computed from a pretrained model, and the ones that best fit the coattention map are used to refine the co-attention map itself, the method may lead to
2.2 Supervised Co-segmentation
27
a solution of incomplete objects. The method in [25] is built on the model of Li et al. [72], discussed earlier. Since ground-truth masks are not used, the segmentation loss, which is typically computed between the segmentation maps predicted by the decoder pair and the corresponding ground-truth, is replaced by the co-attention loss of Hsu et al. [50], where the figure-ground is estimated from the segmentation predictions. In addition, a geometric transformation is learned using consistency losses that align the feature map pair and the segmentation prediction pair. However, these methods are not capable of efficiently handling outliers in the image set since the network is always trained with image sets in which all images contain the common object. In the second part of this monograph, we discuss image co-segmentation methods based on deep learning. The first method uses superpixel and region adjacency graph representation of images. Then a graph convolutional network performs foreground– background classification of the nodes. The other two methods do not use any superpixel representation. Instead, they directly classify image pixels using convolutional neural networks. Specifically, object segmentation masks are obtained using encoderdecoder architectures. Network training is done in fully supervised manner as well as in few-shot setting where the number of labeled images is small. At the same time, negative samples (image pairs without any common object and image sets containing outlier images) are also used so that the model can identify outliers.
2.3 Co-segmentation Datasets We provide a summary of the datasets used in this monograph for co-segmentation. The image pair dataset [69] contains 105 pairs with a single common object present in majority of the pairs. The common object present in both the images of a pair are visually very similar. However, the objects themselves are not homogeneously featured in most cases. Faktor et al. [36] created a dataset by collecting images from the PASCAL-VOC dataset. It consists of 20 classes with average 50 images per class. The classes are: person, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa and tv/monitor. The images within a class have significant appearance and pose variations. It is a challenging dataset to work with due to the variations and presence of background clutter. In particular, the images capture scenes from both indoor and outdoor. The Microsoft Research Cambridge (MSRC) [131] subdataset consists of the following classes: cow, plane, car, sheep, cat, dog and bird. Each class has 10 images. The iCoseg dataset was introduced by Batra et al. [8] for interactive segmentation. It contains total 643 images of 38 homogeneous classes. This dataset contains a varying number of images of the same object instance under very different viewpoints and illumination, articulated or deformable objects like people, complex backgrounds and occlusions. Even though the image pair dataset, as the name suggests, already contains pairs for co-segmentation, the other three datasets mentioned
28
2 Survey of Image Co-segmentation
do not. Hence, for co-segmentation on these datasets, image sets are constructed by grouping multiple images from each class. The multiple image co-segmentation methods discussed in this monograph also consider outlier contaminated image sets. Given a set of N images, they find the common object from M(≤ N ) images. Since such datasets are not available in abundance, the iCoseg dataset has been used to create a much larger dataset by embedding each class with several outlier images randomly chosen from other classes. Each set may have up to 30% of the data as outlier images. This dataset has an overwhelmingly large 626 sets containing a total of 11,433 non-unique images where each set contains 5–50 images. Similarly, outlier contaminated image sets have been created using images from the Weizmann horse dataset [13], the flower dataset [92] and the MSRC dataset. The MIT object discovery dataset of internet images or the Internet dataset [105] has three classes: car, horse and airplane with 100 samples per class. Though the number of classes is small, this dataset has high intra-class variation and is relatively large. Every class also contains a variable number (few) of outlier images.
Chapter 3
Mathematical Background
In this chapter, we describe some concepts that will be instrumental in developing the co-segmentation algorithms of this monograph. We begin with the superpixel segmentation algorithm in Sect. 3.1 as superpixels are the basic component in majority of the chapters. In Sect. 3.2, we explain binary and multiclass label propagation algorithms that aid in classification of samples (e.g., superpixels) by assigning different labels (e.g., foreground and background) to them, thus facilitating segmentation. Then in Sect. 3.3, we describe the maximum common subgraph computation algorithm, which is applied in the graph matching-based co-segmentation algorithms. The deep learning-based co-segmentation methods of this monograph are based on convolution neural networks (CNN), and we describe them briefly in Sect. 3.4. Next in Sect. 3.5, we provide a brief introduction to graph convolutional neural network which is a variant of CNNs, specifically designed to be applied on graphs. We include a short discussion on variational inference and few-shot learning as these will be used in a co-segmentation method discussed in the monograph.
3.1 Superpixel Segmentation Pixel-level processing of images of large spatial size comes with high computational requirements. Hence, oversegmentation techniques are often used to divide an image into many non-overlapping segments or atomic regions. Then these segments are used as image primitives instead of pixels for further processing of the image. In addition to the reduction in computations, these segments also provide better local features than pixels alone because a group of pixels provide more context. This makes them meaningful substitutes for pixels. Currently, the superpixel segmentation algorithm is the most common and popular choice for oversegmentation, whose resulting atomic regions are called image superpixels. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_3
29
30
3 Mathematical Background
(a)
(b)
(c)
(d)
Fig. 3.1 Superpixel segmentation. a Input image. b 500 superpixels with compactness Q = 20. c 500 superpixels with Q = 40. d 1000 superpixels with Q = 40. Image Courtesy: Source image from the internet
Each superpixel is a group of spatially contiguous pixels. Hence, unlike image pixels or rectangular patches, superpixels do not form any rigid grid structure. However, the pixel values inside a superpixel are similar, thus making it homogeneous. Figure 3.1 shows an example of superpixel-based oversegmentation. It can be observed that the pixels constituting each superpixel belong to a single object and the superpixel contours follow the object boundary. Thus, the irregular shape of superpixels represents both foreground object and background region parts more appropriately than what regular-shaped rectangular patches can do. This allows superpixels to be
3.1 Superpixel Segmentation
31
the building blocks for several computer vision algorithms such as PASCAL-VOC Challenge for visual recognition [145], depth estimation [152], segmentation [75] and object localization [40]. A good superpixel segmentation algorithm should be efficient in terms of computation and memory requirements. Further, the superpixels should follow the object and region boundaries, thus enabling the computer vision algorithms that use them to achieve high accuracy. The following oversegmentation algorithms have been designed to generate superpixels directly or have been adapted for superpixels: simple linear iterative clustering (SLIC) [1], turbopixel [66], mean shift [30], the watershed method [132], the normalized cuts algorithm [114], the agglomerative clustering-based segmentation method of Felzenszwalb and Huttenlocher [38], the image patch stitching-based superpixel computation method of Veksler et al. [128], and the superpixel generation method of Moore et al. [87] by imposing a grid conformability criterion. Here, we will discuss the SLIC superpixel algorithm in detail since it is more efficient and often outperforms the rest of the approaches. It produces superpixels of relatively uniform size, which are regularly shaped to a certain degree, and they follow the boundary adherence property with high accuracy. Hence, we have used it in the co-segmentation algorithms for this monograph. The SLIC algorithm is an adaptation of the well-known k-means clustering algorithm. Given an image, it clusters the pixels, and each cluster represents a superpixel. For a color image, the clustering of pixels is performed using their fivedimensional feature vectors: CIE L, a, b values and X , Y coordinate values, i.e., f = [L , a, b, X, Y ]T . For a grayscale image, f = [L , X, Y ]T is used as the feature. Thus, in addition to the pixel intensities, the location of pixels are also used. This ensures the pixels belonging to a cluster to be spatially cohesive. Further, to constrain all pixels in a cluster to form a single connected component, the pixel-to-cluster assignment is done within a specified spatial window, as described next. To oversegment an n 1 × n 2 image into n S superpixels, √ the clusters are initialized to be P × P non-overlapping windows where P = n 1 n 2 /n S . In each window, the smallest gradient pixel in the 3 × 3 neighborhood of its center pixel is set as a cluster center. Assignment step: Each pixel-i is assigned to a cluster C if the corresponding cluster center c has the shortest feature distance among all cluster centers in a 2P × 2P window Ni2P . (3.1) c = arg min d(fi , fc ) c∈Ni2P
where d(·) is the feature distance. Restricting the search space to Ni2P reduces the number of distance calculations. So the algorithm requires less computation compared to the standard k-means clustering which compares with all data points (pixels) for finding the minimum distance. It can be observed that for a pixel, the assigned cluster center is one among its eight spatially closest cluster centers. So, computational complexity of SLIC for an image with N pixels is O (N ). Cluster update step: After each pixel is assigned to a cluster, each cluster center is updated as the average of all pixels in that cluster with an updated feature fc .
32
3 Mathematical Background
These two steps are repeated iteratively until the shift in the cluster center locations is below a certain threshold. The final number of clusters after convergence of the clustering is the number of generated superpixels, which may be slightly different from n S . Now, we explain the distance measure used in Eq. (3.1). Each pixel feature vector f consists of pixel intensity and chromaticity information L, a, b, and the pixel coordinates X , Y . However, (L , a, b) and (X, Y ) components exhibit different ranges of values. Hence, the distance measure d(·) must be designed as a combination of two distance measures: color distance d1 (·) and spatial distance d2 (·), which are computed as the Euclidean distances between [L i , ai , bi ]T and [L c , ac , bc ]T , and [X i , Yi ]T and [X c , Yc ]T , respectively. Then, these two measures are combined as: d(fi , fc ) =
(d1 /t1 )2 + (d2 /t2 )2
(3.2)
where t1 and t2 are two normalization constants. Here, the arguments of d1 (·) and d2 (·) have been dropped for simplicity. Since the search space is restricted to a 2P × 2P window, the maximum possible spatial distance between a pixel and the cluster center obtained using Eq. (3.1) is P. So, t2 can be set to P. Thus, Eq. (3.2) can be rewritten as: (3.3) d(fi , fc ) = d12 + Q(d2 /P)2 where Q = t12 is a compactness factor that acts as weight for the two distances. Note that the denominator t1 has been dropped because it does not impact the arg min operation in Eq. (3.1). A small value of Q gives more weight to the color distance and the resulting superpixels better adhere to the object boundaries (Fig. 3.1b), although the superpixels are less compact and regular in shape and size. The converse is true when Q is large (Fig. 3.1c). In CIELab color space, the typical values of Q ∈ [1, 40] [1]. It is possible that even after the convergence of the algorithm, there are some isolated pixels that are not connected to the superpixels they have been assigned to. As a post-processing, such a pixel is assigned to the spatially nearest superpixel.
3.2 Label Propagation Often datasets contain many samples, which are unlabeled, and only a small number of them are labeled. This may occur due to the large cardinality of datasets or the lack of annotations. Let X = Xa Xb be a set of n data sample features where Xa = {x1 , x2 , . . . , xm } and Xb = {xm+1 , xm+2 , . . . , xn } are the labeled and unlabeled sets, respectively. Each xi ∈ Xa has a label L (xi ) ∈ {1, 2, . . . , K }, which indicates that every sample belongs to one of the K classes. Label propagation techniques are designed to predict labels of the unlabeled samples from the available labeled samples. Hence, it is a type of semi-supervised learning. This has several applications including classification, image segmentation and website categorization.
3.2 Label Propagation
33
Semi-supervised learning methods [12, 53, 55, 150] learn a classifying function from the available label information and the combined space of the labeled and unlabeled samples. This learning step incorporates a local and a global labeling consistency assumption. (i) Local consistency: neighboring samples should have the same label, and (ii) global consistency: samples belonging to a cluster should have the same label. Then this classifying function assigns labels to the unlabeled samples. We first describe this process for two classes (K = 2) and then for multiple classes (K > 2).
3.2.1 Two-class Label Propagation We explain the two-class label propagation process by performing foreground– background segmentation of an image. Let each image pixel-i is represented by its feature vector xi . A certain subset of the pixels are labeled as foreground (L (xi ) = 1) or background (L (xi ) = 2), and they constitute the set Xa . Let p F (x) and p B (x) be two distributions obtained from the foreground and background labeled features, respectively. Subsequently, all pixels in the image are classified as either foreground (yi → 1) or background (yi → 0) by minimizing the following cost function: E (y) =
n i=1
(1 − yi )2 p F (xi ) +
n i=1
yi2 p B (xi ) +
n n
Si j (yi − y j )2 ,
(3.4)
i=1 j=1
where yi ∈ [0, 1] is the likelihood of pixel-i being foreground, and S ∈ Rn×n is a feature similarity matrix. One can use any appropriate measure to compute S for a particular task. For example, Si j can be calculated as the negative exponential of the normalized distance between xi and x j . Here, p F (x) and p B (x) act as the prior probability of a pixel x being foreground or background, respectively. In Eq. (3.4), minimization of the first term forces yi to take a value close to 1 for foreground pixels because a pixel-i with a large p F (xi ) is more likely to belong to foreground. For similar reasons, minimization of the second term forces yi to take a value close to 0 for background pixels since they have a large p B (xi ). Observe that these two terms together attain the global consistency requirement mentioned earlier. The third term maintains a smoothness in labeling by forcing neighboring pixels to have the same label. If two pixels are close in the feature space, Si j will be large, and the resulting yi and y j will be close in order to minimize the third term in Eq. (3.4), and consequently the local consistency requirement is satisfied. Here, diagonal elements of S are set to 0 to avoid self-bias. Till now, neighborhood has been considered only in the feature space. However, one may also consider spatial neighborhood by (i) setting Si j = 0 if pixel-i is not in the neighborhood of pixel- j, or (ii) scaling Si j by the inverse of the spatial distance between pixel-i and pixel- j, or even (iii) by certain post-processing when (i) and (ii) are not applicable. We will describe one such method in Chap. 6. As another example, consider the example of an undirected graph where each node
34
3 Mathematical Background
represents a sample xi and the neighborhood is defined by the presence of edges. It may be noted that, irrespective of the initial labels L (xi ), the design of Eq. (3.4) will assign labels to all pixels in Xa as well as in Xb . However, labels of most pixels in Xa will not change since features xi ∈ Xa have been used to compute the prior probabilities and they have a large p F (xi ) (if L (xi ) = 1, i.e., yi → 1) or p B (xi ) (if L (xi ) = 2, i.e., yi → 0). Thus, the first two terms in Eq. (3.4) jointly act as a data fitting term. Label propagation is evident in the third term of Eq. (3.4) as an unlabeled pixel- j is likely to have the same label as of pixel-i if Si j is large. In the first two terms, label propagation occurs indirectly through the prior probabilities, which are computed from the labeled pixel features xi ∈ Xa . The final labeling yi is influenced by p F (x) and p B (x) as explained earlier. Minimization of E can be performed by first expressing it using vector–matrix notation: E (y) = =
n i=1 n
p F (xi ) − 2yT p F + yT P F y + yT P B y + yT (D − S)y p F (xi ) − 2yT p F + yT (P F + P B + D − S)y ,
(3.5)
i=1 n n and { p B (xi )}i=1 as diagonal where P F and P B are diagonal matrices with { p F (xi )}i=1 elements, respectively. D is also a diagonal matrix with Dii = j Si j . The cost E can be minimized with respect to y to obtain the solution as:
− 2p F + 2(P F + P B + D − S)y = 0 y = (P F + P B + D − S)−1 p F .
(3.6)
This solution will yield yi ∈ [0, 1], and it can be thresholded to obtain the label for each pixel-i as either foreground or background.
3.2.2 Multiclass Label Propagation In the two-class label propagation process, each sample xi is classified as either of the two classes based on the obtained value of yi . However, in the case of multiple classes (K > 2), the initial label information of samples xi ∈ Xa is represented using L i ∈ R1×K , where L ik = 1 if xi has label k and L ik = 0 otherwise. Since the samples xi ∈ Xb are unlabeled, for them L ik = 0, ∀k. Let us denote L = [L 1T , L 1T , . . . , L nT ]T ∈ Rn×K . It is evident that (i) for an unlabeled sample xi , the i-th row L i is all-zero, and (ii) for a labeled sample xi , the index of value 1 in L i specifies its initial label. Similar to the two-class process, the goal of the multiclass label propagation process is to obtain a final label matrix containing labels of all samples in X . Following the design of L, let us denote it as Y = [Y1T , Y1T , . . . , YnT ]T . The label information
3.2 Label Propagation
35
of each sample xi is obtained from Yi ∈ R1×K , whose elements Yik determine the likelihood of the sample belonging to class-k. Thus, the final label to be assigned to xi is obtained as: L (xi ) = arg max Yik . (3.7) k
To calculate Y, a label propagation cost function Emulti (Y) can be formulated as: Emulti (Y) = χ
n i=1
Yi − L i 2 +
n n
Si j √
i=1 j=1
1 1 Yi − Y j 2 , Dii Djj
(3.8)
where χ is a regularization parameter that weighs the two terms. The first term is the data fitting term that ensures that the final labeling is not far from the initial labeling. Unlike in Eq. (3.4), multiple prior probabilities are not computed here for different classes. Similar to Eq. (3.4), the second term in Eq. (3.8) satisfies the local consistency requirement and ensures smoothness in labeling. Yi and Y j are further normalized by Dii and D j j , respectively, to incorporate similarities of xi and x j with their respective neighboring samples. To obtain the optimal Y, the cost Emulti is minimized with respect to Y as [150]: 2χ (Y − L) + 2(I − D−1/2 SD−1/2 )Y = 0 ˜ −1 L, Y = μ(I − ωl S)
(3.9)
where μ = χ /(1 + χ ), ωl = 1/(1 + χ ) and S˜ = D−1/2 SD−1/2 . Then labels can be computed using Eq. (3.7). The above regularization framework can also be expressed as an iterative algorithm [150] where the initial label matrix L gets iteratively updated to finally obtain the optimal Y at convergence. Let Y(0) = L and the label update equation for t ≥ 1: ˜ (t−1) + (1 − ωl )L, Y(t) = ωl SY
(3.10)
where 0 < ωl < 1 is a regularization parameter. The first term updates Y(t−1) to ˜ To understand the label propagation, Y(t) using the normalized similarity matrix S. ˜ (t−1) . Thus, Y (t) = n S˜i j Y (t−1) . This illustrates consider ωl = 1 and Y(t) = SY j=1 ik jk that if sample xi has a large similarity with sample x j , the likelihood of x j belonging (t−1) influences Yik(t) , i.e., the likelihood of xi also belonging to to class-k, i.e., Y jk class-k. Thus, label propagation occurs from x j to its neighbor xi . The second term in Eq. (3.10) ensures that the final label matrix is not far from the initial label matrix L. To obtain the optimal label matrix Y∗ = limt→∞ Y(t) at convergence, we rewrite Eq. (3.10) using recursion as: ˜ t−1 L + (1 − ωl ) Y(t) = (ωl S)
t−1 i=0
˜ iL . (ωl S)
(3.11)
36
3 Mathematical Background
Fig. 3.2 Example of maximum common subgraph of two graphs G1 and G2 . The set of nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the maximum common subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted (in blue)
˜ t−1 = 0 and (ii) Since eigenvalues of S˜ lie in [0, 1], we have (i) limt→∞ (ωl S) t−1 i −1 ˜ ˜ limt→∞ i=0 (ωl S) = (I − ωl S) . Hence, ˜ −1 L , Y∗ = lim Y(t) = (1 − ωl )(I − ωl S) t→∞
(3.12)
which is proportional to the closed form solution in Eq. (3.9).
3.3 Subgraph Matching In this section, we briefly describe the maximum common subgraph (MCS) computation for any two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ). Here, Vi = {vki } and Ei = {ekli } for i = 1, 2 denote the set of nodes and edges, respectively. Typically, each node is attributed with a label (e.g., digits or strings) or a representative vector depending on the task that is being solved using graphs, whereas edges represent a certain association among the nodes as may be specified in the dataset under consideration. The MCS corresponds to a pair of subgraphs H1 in G1 and H2 in G2 . Thus, the nodes in the resulting H1 should have a high similarity in their attributes with the nodes in H2 . Further, the nodes in both H1 and H2 should be cohesive through edge connectivity. The maximum common subgraphs for an example graph pair G1 and G2 are highlighted in Fig. 3.2. This being a computationally very demanding task, we use an appropriate solution to compute the MCS. To find the MCS, we first build a product graph W (ideally known as vertex product graph) from the graphs G1 and G2 based on their intergraph attribute similarities. If labels are used as attributes, one may consider that a node v 1 ∈ G1 is similar to a node v 2 ∈ G2 when their labels match exactly, i.e., L (v 1 ) = L (v 2 ). In case of attribute vectors, one may compare the corresponding vector distance with a predecided threshold to conclude if a specific node pair matches or not. A node in a product graph [61] is denoted as a 2-tuple (vk1 , vl2 ) with vk1 ∈ G1 and vl2 ∈ G2 . Let us call it a product node to differentiate it from single graph nodes. We define the set of product nodes U W of the product graph W as:
3.3 Subgraph Matching
UW =
37
1 vk ∈ V1 , vl2 ∈ V2 |L (vk1 ) = L (vl2 ) , considering attribute labels, (3.13)
or 1 vk ∈ V1 , vl2 ∈ V2 | d(vk1 , vl2 ) < tG , considering attribute vectors,
(3.14) where tG is a threshold. In W , an edge is added between two product nodes vk11 , vl21
and vk12 , vl22 with k1 = k2 ∧ l1 = l2 if
UW =
C1. ek11 k2 exists in G1 and el21 l2 exists in G2 , or C2. ek11 k2 is not present in G1 and el21 l2 is not present in G2 ,
where ∧ stands for the logical AND operation. In the case of product nodes vk1 , vl21
and vk1 , vl22 (i.e., k1 = k2 ), an edge is added if el21 l2 exists. As edges in the product graph W represent matching, the edges in its complement graph W C and the product nodes which they are incident on, represent non-matching, and such product nodes are essentially the minimum vertex cover (MVC) of W C . The MVC of a graph is the smallest set of vertices required to cover all the edges in that graph [31]. The set of product nodes (U M ⊆ U W ) other than this MVC represents the matched product nodes that form the maximal clique of W in the literature [17, 61]. Let V1H ⊆ V1 and V2H ⊆ V2 be the set of nodes in the corresponding common subgraphs H1 in G1 and H2 in G2 , respectively, with
V1H = {vk1 |(vk1 , vl2 ) ∈ U M } and
(3.15)
V2H = {vl2 |(vk1 , vl2 ) ∈ U M },
(3.16)
and they correspond to the matched nodes in G1 and G2 , respectively. Note H1 and H2 are induced subgraphs. Step-by-step demonstration of MCS computation is shown in Figs. 3.3, 3.4, 3.5, 3.6, 3.7 and 3.8 using three example graph pairs. The graphs G1 and G2 in Fig. 3.3 have node set
V1 = {v11 , v21 , v31 , v41 , v51 , v61 , v71 , v81 } , V2 = {v12 , v22 , v32 , v42 , v52 , v62 , v72 , v82 , v92 } , and the edge information is captured by the binary adjacency matrices ⎡
0 ⎢1 ⎢ ⎢0 ⎢ ⎢0 A1 = ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎣1 1
1 0 1 0 0 0 0 1
0 1 0 1 0 0 0 1
⎤
⎡
0 00011 ⎢1 ⎢ 0 0 0 0 1⎥ ⎥ ⎢0 ⎥ ⎢ 1 0 0 0 1⎥ ⎢0 ⎥ ⎢ 0 1 0 0 1⎥ ⎢0 , A = 2 ⎥ ⎢ 1 0 1 0 1⎥ ⎢0 ⎥ ⎢ 0 1 0 1 1⎥ ⎢0 ⎢ ⎦ 00101 ⎣1 11110 1
1 0 1 0 0 0 0 0 1
0 1 0 1 0 0 0 0 1
⎤ 000011 0 0 0 0 0 1⎥ ⎥ 1 0 0 0 0 1⎥ ⎥ 0 1 0 0 0 1⎥ ⎥ 1 0 1 0 0 0⎥ ⎥. 0 1 0 1 0 1⎥ ⎥ 0 0 1 0 1 0⎥ ⎥ 0 0 0 1 0 1⎦ 101010
38
3 Mathematical Background
Fig. 3.3 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product nodes U W of the product graph W are added based on inter-graph similarities among nodes in C G1 and G2 . d Edges in W are added based on the conditions. complement graph (W ) of the
1e The 2 product graph W shows that its minimum vertex cover is v1 , v6 . f The nodes in the complement set (U M ) of the MVC constitute the MCS. g, h The subgraphs H1 and H2 of G1 and G2 , respectively
The product node set obtained using Eq. (3.13) or Eq. (3.14) is
U W = { v11 , v12 , v21 , v22 , v71 , v82 , v81 , v92 , v11 , v62 } . Here, v11 ∈ G1 matched with both v12 , v62 ∈ G2 . Then edges are added between the nodes
using the conditions C1 and C2. Specifically, the edge between
product v21 , v22 and v71 , v82 exists due to condition C2 and the remaining edges exist due to condition C1 (Fig. 3.3d). The minimum vertex cover of the complement graph is
3.3 Subgraph Matching
39
Fig. 3.4 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3 considering different inter-graph node similarities. a Product nodes U W of the product graph W . C
b Edges in W . c The complement graph (W ) of W shows that its minimum vertex cover is v11 , v62 . d The nodes in the complement set (U M ) of the MVC constitute the MCS. e The set of nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted
1 2 v1 , v6 since it covers all the edges in W C (Fig. 3.3e). The complement set of the MVC is
U M = U W \MVC
= { v11 , v12 , v21 , v22 , v71 , v82 , v81 , v92 }, and it provides the nodes of the resulting subgraphs as V1H = {v11 , v21 , v81 , v71 } and V2H = {v12 , v22 , v92 , v82 }. Figure 3.4 shows an example of a graph pair with the same set of nodes and edges as in Fig. 3.3, but having different node attributes, and resulting in a different product node set, given as:
U W = { v11 , v12 , v21 , v82 , v71 , v22 , v81 , v92 , v11 , v62 } . However, we observe that the resulting subgraphs are the same. Figure 3.5 shows another similar example graph pair, but having different node attributes from both Figs. 3.3 and 3.4. Here, the complement of the product graph contains a sin-
40
3 Mathematical Background
Fig. 3.5 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3 considering different inter-graph node similarities. This example demonstrates the possibility of non-unique minimum vertex cover (MVC). a Product nodes U W of the product
graph W . b Edges in W . c The complement graph (W C ) of W shows that its MVC is either v61 , v62 or v71 , v82 , and d, e corresponding nodes in the complement set (U M ) of the MVC constitute the MCS. f, g Two possible subgraph pairs with the set of nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 }, or V1H = {v11 , v21 , v81 , v61 }, V2H = {v12 , v22 , v92 , v62 }
3.3 Subgraph Matching
41
Fig. 3.6 Maximum common subgraph computation for the same two graphs G1 and G2 of Fig. 3.3 considering different inter-graph node similarities. a Product nodes U W of the product graph W . b C Edges in W (W
. c The complement
graph
) of W shows that its minimum vertex cover is either
{ v11 , v62 , v31 , v22 } or { v11 , v62 , v11 , v12 }. d The nodes in the complement set (U M ) of the MVC
{ v11 , v62 , v31 , v22 } constitute the MCS. e The set of nodes V1H = {v11 , v81 , v71 }, V2H = {v12 , v92 , v82 } and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted
gle edge, an ambiguity in MVC. The MVC can be chosen as
and this creates either v61 , v62 or v71 , v82 . These choices result in maximum common subgraphs with node sets either V1H = {v11 , v21 , v81 , v71 } and V2H = {v12 , v22 , v92 , v82 } (Fig. 3.5f), or V1H = {v11 , v21 , v81 , v61 } and V2H = {v12 , v22 , v92 , v62 } (Fig. 3.5g). One may choose either result depending on the task in hand. Figures 3.6, 3.7 and 3.8 show more examples. It can be observed in Fig. 3.7d that the product graph W is a complete graph, hence, the MVC is an empty set resulting in U M = U W . Thus, all product nodes constitute the MCS.
42
3 Mathematical Background
Fig. 3.7 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product nodes U W of the product graph W are added based on inter-graph similarities among nodes in G1 and G2 . d Edges in W . e Since W is a complete graph; its complement (W C ) does not have any edges. Hence, the minimum vertex cover of W C is an empty set. f The nodes in the complement set (U M ) of the MVC constitute the MCS. g The set of nodes V1H = {v11 , v21 , v31 , v81 , v71 }, V2H = {v12 , v22 , v42 , v92 , v82 } and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted
3.3 Subgraph Matching
43
Fig. 3.8 Maximum common subgraph computation. a, b Two graphs G1 and G2 . c Product nodes U W of the product graph W are added based on inter-graph similarities among nodes in C G1 and G2 . d Edges in W . e The
complement graph (W ) of the product graph W shows that its minimum vertex cover is either v11 , v12 or v51 , v22 . f The nodes in the complement set (U M ) of the
MVC v11 , v12 constitute the MCS. g The set of nodes V1H = {v81 , v41 , v51 , v61 }, V2H = {v92 , v32 , v22 , v62 } and edges in the subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted
44
3 Mathematical Background
3.4 Convolutional Neural Network A convolutional neural network (CNN) is one of the variants of neural networks. It typically operates on images and extracts semantic features using convolution filters. The primary efficacy of CNN is that it automatically learns those filters by optimizing a task specific loss function computed over a set of training samples. CNN was primarily developed for classification purpose, and when compared to other classification methods, the amount of preprocessing required by a CNN is significantly less. While basic approaches require handcrafting of filters, CNN can learn these filters and feature extractors with enough training. The architecture of a CNN is inspired by the organization of the visual cortex and is akin to the connectivity pattern of neurons in the human brain. Individual neurons can only react to stimuli in a limited region of the visual field called the receptive field. A number of similar fields can be stacked on top of each other to span the full visual field. Benefits of applying CNNs to image data in place of fully connected networks (FCN) are multi folds. First, a convolutional layer (CL) in CNNs extracts meaningful features by preserving the spatial relations within the image. On the other hand, a fully connected (FC) layer in FCNs, despite being a universal function approximator, remains poor at identifying and generalizing the raw image pixels. Another important aspect is that a CL summarizes the image and yields more concise feature maps for the next CL in the CNN pipeline (also known as network architecture). To this end, CNNs provide dimensionality reduction and reduce the computational complexity. This is not possible with FC layers. Lastly, FCNs enforce static image sizes due to their inherent properties, whereas CNNs permit us to work on arbitrary sized images, especially in the fully convolutional way. Suppose, image I ∈ RC×M×N is a tensor with C channels and M × N pixels. Let h ∈ RC×K ×L be the convolution kernel of a filter, where K < M and L < N . Typically, K and L are chosen to be equal and odd. Now, the convolution operation at a pixel (i, j) of the image I with respect to the kernel h can be defined as: F[i, j] =
C K −1 L−1
h[c, k, l] I [c, i − k, j − l],
(3.17)
c=1 k=0 l=0
where F ∈ R M×N is an output feature map obtained after the first convolutional layer. The kernel slides over the entire tensor to produce values at all pixel locations. It may be noted that, the same spatial size of I and F is ensured by padding zeros to boundaries of I before convolution. Without zero-padding, the output spatial size will be less than that of the input. Typically, instead of using a single filter, a set of Co filters {h} are used, and outputs of all the filters are concatenated channelwise to obtain the consolidated feature map F ∈ RCo ×M×N . Next, this feature map is input to the second convolutional layer, which uses a different set of filters and produces an output. This process is then repeated for subsequent convolutional layers in the CNN. The output of the final CL can be used as the image feature in a range of
3.4 Convolutional Neural Network
45
Fig. 3.9 CNN architecture. The network in this example has seven layers. The first four layers are the convolutional layers and the last three layers are fully connected layers. Figure courtesy: Internet
computer vision problems. Typically, any CNN designed for the task of classification or regression requires a set of fully connected layers to follow the final CL (Fig. 3.9). In some cases, the convolution operation is not performed at each and every point of the input tensor. Specifically, this approach is adopted when the input image has pixelwise redundancy. In this context, the stride of a convolution operation is defined as the number of pixels the convolution kernel slides to reach the next point of operation. For example, if δw and δh are the strides across width and height of the image I , then after the operation at a certain point (i, j) as shown in Eq. (3.17), the next operations will be performed at points (i + δh , j), (i, j + δw ) and (i + δh , j + δw ). Thus, the output feature map shape will be Co × ((M + P − K )/δh + 1) × ((N + P − L)/δw + 1) where P is the number of zero rows and columns padded. The value of the stride can be determined from the amount of information one may want to drop. Thus, the stride is certainly a hyper parameter. Further, the kernel size is also a hyper parameter. If the size is very high, the kernel accumulates a large amount of contextual information at every point, which might be very useful for some complex tasks. However, this increases the space and time complexity. On the other hand, reducing the kernel size essentially simplifies the complexity, but also reduces the network’s expressive quality. Similar to FCNs, a sequence of convolutional layers can also be viewed as a graph with image pixels (in the input layer) and points in feature maps (in the subsequent layers) as nodes and kernel values as edge weights. However, the edge connectivity is sparse since the kernel size is much smaller than the image size. An FC layer is essentially a CL when (i) the input layer vector is reshaped to an 1 × M × N array and (ii) C number of filters with convolution kernels of size 1 × M × N are applied at the center point only, without any sliding. The resulting C × 1 output map will be the same as the output the FC layer.
3.4.1 Nonlinear Activation Functions In modern neural networks, nonlinear activation functions are used to enable the network to build complicated mappings between inputs and outputs, which are critical for learning and modeling complex and higher dimensional data distributions. Being
46
3 Mathematical Background
linear, convolutional layers and FC layers alone cannot achieve this. Hence, they are typically followed by nonlinear activation layers in the network. One of the most widely used nonlinear activation functions in the neural network community is the sigmoid, defined as follows: σ (x) =
1 , 1 + exp(−x)
(3.18)
where x denotes the output of a CL or an FC layer. Another classical nonlinear activation function is the hyperbolic tangent which is used whenever the intermediate activations are required to be zero-centered. It is defined as: tanh(x) =
exp(x) − exp(−x) . exp(x) + exp(−x)
(3.19)
However, it can be deduced from both the equations that whenever x becomes extremely high or low, the functions start saturating. As a result, the gradient at those points becomes almost flat. This phenomenon is called the vanishing gradient problem, which can inhibit the learning, and hence the optimization does not converge at all. The issue of vanishing gradient becomes more prominent and devastating with increasing number of layers in neural networks. Therefore, almost all the modern deep learning models do not use sigmoid or hyperbolic tangent-based activation functions. Instead the rectified linear unit (ReLU) nonlinearity is used, which always produces a constant gradient value independent of the scale of the input. As a result, the network converges faster than the sigmoid and hyperbolic tangent-based networks. It is defined as: 0, for x < 0 ReLU(x) = (3.20) x, for x ≥ 0 Furthermore, since ReLU introduces sparsity in a network, calculation load becomes significantly less than using the sigmoid or hyperbolic tangent functions. This leads to a higher preference for deeper layered networks. However, it should be noted that when inputs approach zero or negative, the function’s gradient becomes zero, and the network is unable to learn through backpropagation. Therefore, sometimes the blessing with this nonlinearity can become a curse depending upon the task in hand. To avoid this problem, a small positive slope is added in the negative area. Thus, backpropagation is possible even for negative input values. This variant of ReLU is called the leaky ReLU. It is defined as: 0.01x, for x < 0 Leaky ReLU(x) = x, for x ≥ 0
(3.21)
3.4 Convolutional Neural Network
47
With this concept, more flexibility can be achieved by introducing a scale factor α to the negative component, and this is called the parametric ReLU, defined as: αx, for x < 0 Parametric ReLU(x) = x, for x ≥ 0
(3.22)
As an argument, this function returns the slope of the negative component of the function. As a result, backpropagation can be used to determine the most appropriate value of α.
3.4.2 Pooling in CNN The output feature map of convolutional layers has the drawback of recording the exact position of features in the input. This means that even little changes in the feature’s position in the input image will result in a different feature map. Recropping, rotation, shifting and other minor changes to the input image can cause this. Downsampling is a typical signal processing technique for solving this issue. This is done by reducing the spatial resolution of an input signal, keeping the main or key structural features but removing the fine detail that may not be as valuable to the task. In CNNs, the commonly used downsampling mechanism is called pooling. It is essentially a filter, with kernel size 2 × 2 and stride 2 in most networks. Different from convolution filter, this kernel chooses either (i) the maximum value, or (ii) the average value from every patch of the feature map it overlaps with, and these values constitute the output. These two methods are known as max-pooling and average pooling, respectively. The standard practice is to have the pooling layer after the convolutional and nonlinearity layers. If a stride value 2 is considered, the pooling layer halves the spatial dimensions of a feature map. Thus, if a feature map F ∈ RCo ×M×N obtained after a convolutional and, say, ReLU layer is passed through a pooling layer, the resulting output F˜ ∈ RCo ×M/2×N /2 (assuming even M and N ), given as: ˜ i, j] = max{F[c, 2i − 1, 2 j − 1], F[c, 2i − 1, 2 j], F[c, 2i, 2 j − 1], F[c, 2i, 2 j]} (3.23) F[c,
or ˜ i, j] = 1 (F[c, 2i − 1, 2 j − 1] + F[c, 2i − 1, 2 j] + F[c, 2i, 2 j − 1] + F[c, 2i, 2 j]) F[c, 4
(3.24) Figure 3.10 shows an example where a 4 × 4 patch of a feature map is pooled, both max and average, to obtain a 2 × 2 output. This change in shape is also depicted in Fig. 3.9 (after first, third and fourth layers). In both strategies, there is no external parameter involved. Thus, the pooling operation is specified, rather than learned. In
48
3 Mathematical Background
Fig. 3.10 Max and average pooling operations over a 4 × 4 feature map with a kernel of size 2 × 2 and stride 2
addition to making a model invariant to small translation, pooling also makes the training computation and memory efficient due to the reduction in feature map size.
3.4.3 Regularization Methods In order to effectively limit the number of free parameters in a CNN so that overfitting can be avoided, it is necessary to enforce regularization over the parameters. A traditional method is to first formulate the problem in a Bayesian setting and then introduce zero mean Gaussian or Laplacian prior over the parameters in the network while calculating its posterior. That is called L 2 or L 1 regularization depending upon the nature of the prior. In larger networks, while learning the weights, i.e., the filter kernel values, it is possible that some connections will be more predictive than others. As the network is trained iteratively through backpropagation over multiple epochs, in such scenarios, the stronger connections are learned more, while the weaker ones are ignored. Only a certain percentage of the connections gets trained, and thus only the corresponding weights are learned properly and the rest cease taking part in learning. This phenomenon is called co-adaptation [119], and it cannot be prevented with the traditional L 1 or L 2 regularization. The reason for this is that they also regularize based on the connections’ prediction abilities. As a result, they approach determinism in selecting and rejecting weights. Hence, the strong becomes stronger and the weak becomes weaker. To avoid such situations, dropout has been proposed [119]. Dropout: To understand the efficacy of the dropout regularization, let us consider the simple case of an FCN with a single layer. It takes an input x ∈ Rd and has the weight vector w ∈ Rd . If t is the target, considering linear activation at the output neuron, the loss can be written as:
3.4 Convolutional Neural Network
49
L = 0.5 t −
d
2 wi xi
(3.25)
i=1
Now, let us introduce dropout to the above neuron, where the dropout rate is denoted as δ ∼ Bernouli( p). In the context of that neuron, it signifies that the probability of randomly dropping any parameter wi out of training is p. The loss function at the same neuron with dropout can be written as:
Lr = 0.5 t −
d
2 δi wi xi
(3.26)
i=1
The expectation of ‘gradient of the loss with dropout’ [4] can be written in terms of the ‘gradient of the loss without dropout’ as follows: E
∂ Lr ∂wi
=
∂L + wi p(1 − p)xi2 ∂wi
(3.27)
Thus, minimizing the dropout-based loss in Eq. (3.26) is effectively the same as optimizing a regularized network whose loss can be written as:
L˜ r = 0.5 t −
d i=1
2 pwi xi
+ p(1 − p)
d
wi xi2
(3.28)
i=1
While training a network using this regularization, in every iteration, a random p fraction of the weight parameters are not considered for updation. However, the randomness associated with the selection of the weights considered for updation ensures that after sufficient number of iterations when the network converges, all the weight parameters are learned properly. Batch normalization: In deep networks, the distribution of features varies across different layers at different point of time during the training. As a result, the independent and identical distribution (i.i.d.) assumption over the input data does not hold [51]. This phenomenon in deep neural networks is called covariate shift [51], and it significantly slows down the convergence since sufficient time is required for a network to get adapted to the continuous shift of the data distribution. In order to reduce the effect of this shift, the features can be standardized batchwise at each layer by using the empirical mean and variance of those features computed over the concerned batch [51]. Since the normalization is performed in each batch, this method is called batch normalization.
50
3 Mathematical Background
3.4.4 Loss Functions A neural network is a parametric function where weights and biases are the learnable parameters. In order to learn those parameters, the network relies on a set of training samples with ground-truth annotations. The network predicts outputs for those samples and the predicted outputs are compared with the corresponding ground-truth to compute the prediction error. The function to measure this prediction error is known, in deep neural network community, as the loss function. It takes two input arguments, the predicted output and the corresponding ground-truth, and provides the deviation between them, which is called the loss. The network then tries to minimize the loss or the prediction error by updating its parameters. Next, we discuss loss functions commonly used for training a CNN and the optimization process using them. The cross-entropy loss, or log loss, is used to measure the prediction error of a classification model which predicts the probabilities of each sample belonging to different classes. The loss (LC E ) for a sample is defined as:
LC E = −
K
y j (k) log( yˆ j (k)),
(3.29)
k=1
where K is the number of classes, y j (k) ∈ {0, 1} and yˆ j (k) ∈ [0, 1] are the groundtruth and the predicted probability of sample- j belonging to class-k, respectively. It should be noted that softmax activation is applied to the logits (i.e., the final FC layer outputs) to transform the individual class score into a class probability before the LC E computation. It can be seen that minimizing LC E is equivalent to maximizing the log likelihood of correct predictions. The binary cross-entropy loss (L BC E ) is a special type of cross-entropy loss where the loss is computed only over two classes, positive and negative classes, defined as:
L BC E = −y j log( yˆ j ) + (1 − y j ) log(1 − yˆ j ),
(3.30)
where the ground-truth label y j = {1, 0} for positive and negative class, respectively, and yˆ j ∈ [0, 1] is the model’s estimated probability that sample- j belongs to the positive class, which is obtained by applying the sigmoid nonlinearity to the logits. The focal loss is another type of cross-entropy loss that weighs the contribution of each sample to the loss based on the predictions. The rationale is that, if a sample can be easily classified correctly by the CNN, its estimated correct class probability will be high. Thus, the contribution of the easy samples to the loss overwhelms that of the hard samples whose estimated correct class probabilities are low. Hence, the loss function should be formulated to reduce the contribution of easy samples. With this strategy, the loss focuses more on the hard samples during training. The focal loss (L F L ) for binary classification is defined as:
L F L = −(1 − yˆ j )γ y j log( yˆ j ) + ( yˆ j )γ (1 − y j ) log(1 − yˆ j ),
(3.31)
3.4 Convolutional Neural Network
51
where (1 − yˆ j )γ , with the focusing parameter γ > 0, is a modulating factor to reduce the influence of easy samples in the loss.
3.4.5 Optimization Methods With a loss function in hand, the immediate purpose of a deep neural network is to minimize the loss by updating its parameters. Now that a convolutional neural network includes a large number of layers of convolution and fully connected operations and thus has a large number of learnable parameters, minimizing the loss function with respect to that vast set of parameters is not straightforward. Employing an ineffective optimization approach can have a substantial impact on the training process, jeopardizing the network’s performance and training duration. In this section, we will discuss some common optimization methods and their pros and cons. Consider f cnn (·; wcnn ) is a CNN parametrized by a set of weights wcnn , and it is composed of a sequence of functions, each of which represents a specific layer with its own set of parameters as:
f cnn (·; wcnn ) = f L . . . ( f 1 (·; w1 ); . . .); w L .
(3.32)
The network consists of L layers where each layer-i has its own set of parameters wi . The parameters wi represent the weights of convolution filters and biases in a convolutional layer, and the weights and biases of fully connected operation in a fully connected layer. During training, if the network’s prediction is yˆ for a sample x with ground-truth y, the loss L is computed as some distance measure (Sect. 3.4.4) between y and yˆ = f cnn (x; wcnn ). Thus, the loss function can be parameterized by wcnn as L(·; wcnn ). Further, it is designed to be continuous and differentiable at each point, allowing it to be minimized directly using the gradient descent optimization, which minimizes the loss by updating wcnn in the opposite direction of the gradient of the loss function ∇wcnn L(·; wcnn ) as: (t+1) (t) (t) = wcnn − η∇wcnn L(·; wcnn ), wcnn
(3.33)
where the learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function L(·; wcnn ) downhill until we reach a valley. There exists three types of gradient descent approaches: batch gradient descent, stochastic gradient descent, mini-batch gradient descent where each variant is classified based upon the amount of data utilized to compute the gradient of the objective function. However, it should be noted that there is a fine trade-off between the accuracy of the parameter updation and the total training time of the model, which is discussed in detail next. Batch gradient descent: Let {(x1 , y1 ), . . . , (xn , yn )} be the training dataset. Batch gradient descent estimates the average of ∇wcnn L(·; wcnn ) over the entire training set
52
3 Mathematical Background
and uses that mean gradient to update the parameters in each iteration or epoch as: (t+1) (t) wcnn = wcnn −η
n 1 (t) ∇w L(y j , yˆ j ; wcnn ) n j=1 cnn
(3.34)
For convex loss functions, batch gradient descent is guaranteed to converge to the global minimum. However for non-convex loss functions, there is no such guarantee. Furthermore, since the average gradient is computed over the entire training set, it results in a stable gradient and a stable convergence to an optimum. However in practice, the entire training set may be too large to fit in a single memory, necessitating the use of additional memory. Stochastic gradient descent: In contrast to batch gradient descent, stochastic gradient descent (SGD) performs parameter updation for each training sample (x j , y j ) as: (t+1) (t) (t) = wcnn − η∇wcnn L(y j , yˆ j ; wcnn ) (3.35) wcnn This type of parameter updation enables SGD to escape local minima if the optimizer gets trapped in one. Thus, it is able to arrive at a better one over time. Due to the intrinsic high variance of the gradient computed for each sample, the noisier gradient computation is better suited to a loss surface with a large number of local minima. However, excessive frequent traversals of the loss surface may impair the optimizer’s ability to maintain a decent minimum once it is discovered. In such situations, selecting an appropriate learning rate becomes critical so that the movement can be stabilized as necessary. Compared to batch gradient descent, larger datasets can be processed through SGD since it stores a single sample at a time in memory for optimization. It is also computationally faster because it processes only one sample at a time, making it suitable to perform optimization in an online fashion. Mini-batch gradient descent: This algorithm is a hybrid of stochastic and batch gradient descent. In each epoch, the training set is randomly partitioned into multiple mini-batches, each of which contains a specified number of training samples (say, m). A single mini-batch is passed through the network at a time, and the average of ∇wcnn L(·; wcnn ) over it is calculated to update the weights as: (t+1) (t) wcnn = wcnn −η
m 1 (t) ∇w L(y j , yˆ j ; wcnn ) m j=1 cnn
(3.36)
This method establishes a trade-off between batch gradient descent and SGD and hence has the advantages of both. As compared to SGD, it reduces the variance of parameter updates, potentially resulting in steadier convergence. Furthermore, by choosing an appropriate mini-batch size, training data can easily be fit into memory. Considering various parameter optimization methods as stated in Eqs. (3.34– 3.36), the immediate question is how the gradient of the loss computed at the output of a network can influence the intermediate layer parameters in order to update them.
3.4 Convolutional Neural Network
53
The method of backpropagation executes this task as follows. Considering only the weight parameters wl in an intermediate layer-l, from Eq. (3.35), we can write wl(t+1) = wl(t) − η∇wl L(y j , yˆ j ; wl(t) ) .
(3.37)
From Eq. (3.32), the gradient of the loss with respect to the parameters of this layer can be computed using chain rule as: ∇wl L(·; wl(t) ) = ∇wL L(·; w (t) L )∇w L−1 f L−1 (·; w L−1 ) . . . ∇wl f l (·; wl ),
(3.38)
that is, the loss gradient can be propagated till the desired layer, and hence this method is called backpropagation.
3.5 Graph Convolutional Neural Network Many important real-world datasets such as social networks, knowledge graphs, protein-interaction networks, the World Wide Web, to name a few, come in the form of graphs or networks. Yet, until recently, very little attention has been devoted to the generalization of neural network models to such structured datasets. Given a graph with n nodes (Fig. 3.11a), a graph signal f in ∈ Rn is constructed by concatenating the node attributes. The graph convolution operation on this graph signal can be defined as: ¯ = h¯ 0 I + h¯ 1 A + h¯ 2 A2 + · · · + h¯ L A L , (3.39) H ¯ f in , f out = H
(3.40)
¯ is the convolution filter, {h¯ l } is the set of filter taps to be learned, I is identity where H matrix, A ∈ Rn×n is the binary or weighted adjacency matrix, and f out ∈ Rn is the output graph signal. As mentioned in [121], representing a graph convolution filter as a polynomial of the adjacency matrix serves two purposes. (1) Since the filter parameters (i.e., the scalar multipliers h¯ l ) are shared at different nodes, the filter becomes linear and shift-invariant. (2) Different degrees of the adjacency matrix ensure involvement of higher-order neighbors for feature computation, which apparently increases the filter’s receptive field and induces more contextual information into the computed feature. To reduce the number of parameters, VGG network-like architectures can be used, where instead of using a large sized filter, a series of small convolution filters along different layers have been used. One such filter is given as: H = h 0 I + h 1 A.
(3.41)
54
3 Mathematical Background
Fig. 3.11 Graph CNN outcome. a An input graph with 14 nodes and scalar node attributes resulting in the input graph signal f in ∈ R14 . b Convolution operation resulting in updated node attributes (three-dimensional) with the output graph signal f out ∈ R14×3 . The adjacency relationship among the nodes remains unchanged
(a)
(b)
As one moves toward higher layers, this implicitly increases the receptive field of filters by involving L-hop neighbors. Therefore, a cascade of L layers of filter banks eventually transforms Eq. (3.41) to works as Eq. (3.39). To further improve the convolution operation, one may split the adjacency matrix A into multiple slices (say, T numbers), where each slice At carries the adjacency information of a certain set of nodes and corresponding features. For example, they can be designed to encode the information of relative orientations of different neigh-
3.5 Graph Convolutional Neural Network
55
bors with respect to the node where the convolution is centered at. Considering this formulation, Eq. (3.41) can be rewritten as: H = h 0 I + h 1,1 A1 + h 1,2 A2 + . . . + h 1,T AT ,
(3.42)
T At . where A = t=1 It may be noted that the network can take a graph of any size as the input. For the case of D-dimensional node attributes, the input graph signal f in ∈ Rn×D . This is exactly the same as in traditional CNN, where the convolution operation does not impose any constraint on the image size and pixel connectivity. Like traditional CNN, at any layer, graph convolution is performed on each channel dimension of ( j) the graph signal ( f in ∈ Rn , for j = 1, 2, . . . , u) coming from the preceding layer separately, and then the convolution results are added up to obtain a single output (i) ∈ Rn ) as: signal ( f out (i) = f out
u
( j)
H(i, j) f in , for i = 1, 2, . . . , p,
(3.43)
j=1
where p is the number of filters at that layer, {H(i, j) }uj=1 constitute filter-i, and the output signal has p channels, i.e., f out ∈ Rn× p . The output graph signal (Fig. 3.11b) obtained after multiple convolution layers can be passed through fully connected layers for classification or regression tasks.
3.6 Variational Inference Inference in probabilistic models is frequently intractable. To solve this issue, existing algorithms use subroutines to sample random variables to offer approximate solutions to the inference. Most sample-based inference algorithms are basically different instances of Markov chain Monte Carlo (MCMC) methods where Gibbs sampling and Metropolis–Hastings are the two most widely used MCMC approaches. Although they are guaranteed to discover a globally optimum solution given enough time, it is impossible to discern how near they are to a good solution given they have a finite amount of time in practice. Furthermore, choosing a right sampling approach subjected to a particular problem is an art rather than science. Variational family of algorithms address these challenges by casting inference as an optimization problem. Let’s say we have an intractable probability distribution p. In order to discover a q ∈ Q that is the most similar to p, variational approaches will attempt to solve an optimization problem over a class of tractable distributions Q to obtain an optimum q which can be used as a representative of p. The following are the key distinctions between sampling and variational techniques: • Variational techniques, unlike sampling-based methods, nearly never identify the globally optimal solution.
56
3 Mathematical Background
• We will, however, always be able to tell if they have converged, and it is possible to put bounds on their accuracy in some circumstances. • In reality, variational inference approaches scale better and are better suited to approaches such as stochastic gradient optimization, parallelization over several processors and GPU acceleration. Suppose x is a data point which has a latent feature z. For most of the statistical inference task, one might be interested to find out the distribution of the latent feature given the observation. This can be obtained using Bayes’ rule as: p(z | x; θ ) = p(z; θ )
p(x | z; θ ) , p(x | z; θ ) p(z; θ ) dz
(3.44)
where p(x | z; θ ), p(z; θ ) and p(z | x; θ ) are the likelihood, prior and the posterior distributions with parameter θ , respectively. Now the above integral is intractable asit deals with a possibly high-dimensional integral; the marginal likelihood, p(x) = p(x | z) p(z) , as a result, Bayes’ rule is difficult to apply in general. As mentioned before, the goal of variational inference is to recast this integration problem as one of optimization, which includes taking derivatives rather than integrating because the former is more easier and generally faster. The primary objective of variational inference is to obtain an approximate distribution of the posterior p(z | x). But instead of sampling, it tries to find a distribution q(z ˆ | x; φ) from a parametric family of distributions Q such that it can best approximate the posterior as: q(z ˆ | x; φ) = arg minq(z|x;φ)∈Q KL [q(z | x; φ) || p(z | x; θ )]
(3.45)
where KL[·||·] stands for Kullback–Leibler divergence and is defined as: KL [q(z | x; φ) || p(z | x; θ )] =
q(z | x; φ) log
q(z | x; φ) dz. p(z | x; θ )
(3.46)
Now the objective function in Eq. (3.45) cannot be solved directly since the computation still requires the marginal likelihood. It can be observed that: KL [q(z | x) || p(z | x)] = Eq(z|x) log q(z|x) p(z|x) = Eq(z|x) log q(z | x) − Eq(z|x) log p(z | x) p(z) = Eq(z|x) log q(z | x) − Eq(z|x) log p(x|z) p(x) = Eq(z|x) log q(z | x) − Eq(z|x) log p(z) +Eq(z|x) log p(x) − Eq(z|x) log p(x | z) = KL [q(z | x) || p(z)] − Eq(z|x) log p(x | z) + log p(x) (3.47)
3.6 Variational Inference
57
In the above equation, the term log p(x) can be ignored as it does not depend upon the optimizer q(z | x). As KL [q(z | x) || p(z | x)] is equivalent to a result, minimizing maximizing Eq(z|x) log p(x | z) − KL [q(z | x) || p(z)], which is called evidence lower bound (ELBO) since it acts as the lower bound of the evidence p(x), as shown in Eq. (3.48). This is possible because KL divergence is always a non-negative quantity. log p(x) ≥ Eq(z|x) log p(x | z) − KL [q(z | x) || p(z)]
(3.48)
3.7 Few-shot Learning Humans can distinguish new object classes based on a small number of examples. The majority of machine learning algorithms, on the other hand, require thousands of samples to reach comparable performance. In this context, few-shot learning techniques have been designed that aim to classify a novel class using a small number of training samples from that class. One-shot learning is the extreme case where each class has only one training example. In several applications, few-shot learning is extremely useful where training examples are scarce (e.g., rare disease cases) or when the expense of labeling data is too high. Let us consider a classification task of N categories where the training dataset has a small number (say, K ) of samples per class. Any classifier trained using standard approaches will be highly parameterized. Hence, it will generalize poorly and will not be able to distinguish test samples from the N classes. As the training data is insufficient to constrain the problem, one possible solution is to gain experience from other similar problems having large training datasets. To this end, few-shot learning is characterized as a meta-learning problem in an N -way-K -shot framework. In the classical learning framework, we learn how to classify from large training data, and we evaluate the results using test data. In the meta-learning framework, we learn how to learn to classify using a very small set of training samples. Here several tasks are constructed to mimic the few-shot scenario. So for N -way-K -shot classification, each task includes batches of N classes with K training examples of each, randomly chosen from a larger dataset different from the one that we finally want to perform classification on. These batches are known as the support set for the task and are used for learning how to solve that task. In addition, there are further examples of the same N classes that constitute a query set, and they are used to evaluate the performance on this task. Each task can be completely non-overlapping in terms of classes involved; we may never see the classes from one task in any of the other tasks. The idea is that the network repeatedly sees instances (tasks) during training in a manner that matches the structure of the final few-shot task, i.e., few training samples per class, but involves different classes. At each step of meta-learning, the model parameters are updated based on a randomly selected training task. The loss function is determined by the classification performance on the query set of this training task, based on the knowledge gained q q q from its support set [117]. Let {D1s , D2s , . . . , D sN } and {D1 , D2 , . . . , D N } denote
58
3 Mathematical Background
the support set and the query set, respectively, for a task. Here, Dis = {(x sj , y sj )} Kj=1 q q q and Di = {(x j , y j )} Kj=1 contain examples from class-i, where x s (or x q ) and y s ∈ {1, 2, . . . , N } (or y q ) denote a sample and the corresponding ground-truth label. Let f θ (·) be the embedding function to be learned, parameterized by θ , that obtains feature representation of samples. The mean representation of class-i is obtained as: ci =
1 K
f θ (x sj ).
(3.49)
(x sj ,y sj )∈Dis
Given a query sample x q , a probability distribution over the classes is computed as: exp (−d( f θ (x q ), ci )) , pθ (y = i|x q ) = N q i =1 exp (−d( f θ (x ), ci ))
(3.50)
where d(·) is an appropriate distance function, e.g., cosine distance or Euclidean distance. The model is learned by minimizing the negative log-probability considering each query sample’s true class label, and the corresponding loss is given as:
L=−
N 1 N K i=1
q
q
log pθ (y = y j |x j )
(3.51)
q q q (x j ,y j )∈Di
Since the network is presented with a different task at each time step, it must learn how to discriminate data classes in general, rather than a particular subset of classes. To evaluate the few-shot learning performance, a set of test tasks with both support and query sets are constructed, which contain only unseen classes that were not in any of the training tasks. For each test task, we can measure its performance on the q q q query {D˜ 1 , D˜ 2 , . . . , D˜ N } based on the knowledge provided by the corresponding support set set {D˜ 1s , D˜ 2s , . . . , D˜ sN }. Similar to Eq. (3.49), the mean representation for a test support class-i is computed as: c˜i =
1 K
f θ (x˜ sj ).
(3.52)
(x˜ sj , y˜ sj )∈D˜ is
Given a query sample x˜ q , its class label is predicted as: y˜ = arg
min
i∈{1,2,...,N }
d( f θ (x˜ q ), c˜i ).
(3.53)
Till now, we have discussed class-based meta-learning. However, we need the learning to be class agnostic for tasks such as co-segmentation that do not involve any semantic information. In Chap. 9, we discuss a class agnostic meta-learning scheme for solving the co-segmentation problem.
Chapter 4
Maximum Common Subgraph Matching
4.1 Introduction Foreground segmentation from a single image without supervision is a difficult problem. One would not know what constitutes the object of interest. If an additional image containing a similar foreground is provided, both images can possibly be segmented simultaneously with a higher accuracy using co-segmentation. We study this aspect; i.e., if one is given multiple images without any other additional information, is it possible to fuse useful information for segmentation purpose. Thus given a set of images (e.g., say, crowd-sourced images), the objects of common interest in those images are to be jointly segmented as co-segmented objects [103, 105, 106] (see Fig. 1.3).
4.1.1 Problem Formulation In this chapter, we demonstrate the commonly understood graph matching algorithm for foreground co-segmentation. We set up the problem as a maximum common subgraph (MCS) computation problem. We find a solution to MCS of two region adjacency graphs (RAG) obtained from an image pair and then perform region cogrowing to obtain the complete co-segmented objects. In a standard MCS problem, typically, certain labels are assigned as the node attributes. Thus, given a pair of graphs, the inter-graph nodes can be matched exactly as discussed in Sect. 3.3. But in natural images, we expect some variations in attributes, i.e., features of similar objects or regions (e.g., color, texture, size). So in the discussed approach, for an inter-graph node pair to match, the attributes need not be exactly equal. They are considered to match if the attribute difference is within a certain threshold (Eq. 3.14). The key aspects of this chapter are as follows. • The MCS-based matching algorithm allows co-segmentation of multiple common objects. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_4
59
60
4 Maximum Common Subgraph Matching
• Region co-growing helps to detect common objects of different sizes. • An efficient use of the MCS algorithm followed by region co-growing can cosegment high-resolution images without increasing computations. We describe the co-segmentation algorithm initially for two images in Sects. 4.2 and 4.3. Then we show its extension to multiple images in Sect. 4.5. Comparative results are provided in Sect. 4.4.
4.2 Co-segmentation for Two Images In the co-segmentation task for two images, we are interested in finding the objects of interest that are present in both the images and have similar features. The flow of the co-segmentation algorithm is shown in Fig. 4.1, which is detailed in the subsequent sections. First each image (Fig. 4.2a, b) is segmented into superpixels using SLIC method [1]. Then a graph is obtained by representing the superpixels of an image as nodes, and every node pair corresponding to a spatially adjacent superpixel pair is connected by an edge. Superpixel segmentation describes the image at a coarse-level through a limited number (n S ) of nodes of the graph. An increase in n S increases the computation during graph matching drastically. So, it is efficient to use superpixels as nodes instead of pixels as this cuts down the number of nodes significantly. As an image is a group of connected components (i.e., objects, background), and each such component is constituted by a set of contiguous superpixels, this region adjacency graph (RAG) representation of images is favorable in the co-segmentation method for obtaining the common objects.
4.2.1 Image as Attributed Region Adjacency Graph Let an image pair I1 , I2 is represented using two RAGs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), respectively. Here, Vi = {vki } and Ei = {ekli } for i = 1, 2 denote the set of nodes and edges, respectively. Any appropriate feature can be used as the node attribute. The experiments performed in this chapter consider two features for each node: (i) CIE Lab mean color and (ii) rotation invariant histogram of oriented gradient (HoG) of the pixels within the corresponding superpixel. HoG features are useful to capture the image texture, and they help to distinguish superpixels that may have similar mean color in spite of being completely different in color. Further, rotation invariant HoG features can match similar objects with different orientation. If an image is rotated, the gradient direction at every pixel is also changed by the same angle. If h denotes the histogram of directions of gradients computed at all pixels within a superpixel, the values in the vector h will be shifted as a function of the rotation angle. In order to achieve rotation invariance, the values in the computed HoG (h) are circularly shifted with respect to the index of the maximum value in it. To incorporate both
Fig. 4.1 Co-segmentation algorithm using a block diagram. Input image pair (I1 , I2 ) is represented as region adjacency graphs (RAGs) G1 and G2 . The maximum common subgraph (MCS) of the RAGs yields the node sets V1H and V2H , which form the initial matched regions in I1 and I2 , respectively. These are iteratively (index-(t)) co-grown using inter-image feature similarity between the nodes in them to obtain the final matched regions V1H ∗ and V2H ∗ . Figure courtesy: [48]
4.2 Co-segmentation for Two Images 61
62
4 Maximum Common Subgraph Matching
features in the algorithm, the feature similarity S f (·) between nodes vk1 in G1 and vl2 in G2 is computed as a weighted sum of the corresponding color and HoG feature similarities, denoted as Sc (·) and Sh (·), respectively. S f vk1 , vl2 = 0.5 Sc vk1 , vl2 + 0.5 Sh vk1 , vl2 .
(4.1)
Here, the similarity Sh vk1 , vl2 is computed as the additive inverse of the distance dkl (e.g., Euclidean distance measure) between the corresponding HoG features h(vk1 ) and h(vl2 ). Prior to computing Sh (·), each dkl is normalized with respect to the maximum pairwise distance of all node pairs as: dkl =
dkl max
k ,l : vk1 ∈G1 ,vl2 ∈G2
dk l
.
(4.2)
The same strategy is adopted for computing the color similarity measure Sc (·). Next, we proceed to obtain the MCS between the two RAGs to obtain the common objects as explained next.
4.2.2 Maximum Common Subgraph Computation To solve the co-segmentation problem of an image pair, we first need to find superpixel correspondences from one image to the other. Then the superpixels within the objects of similar features across images can be matched to obtain the common objects. However, since any prior information about the objects is not used in this unsupervised method, the matching becomes exhaustive. Colannino et al. [28] showed that the computational complexity of such matching is O ((|G1 | + |G2 |)3 ) assuming a minimum cost many-to-many matching algorithm. Further, the resulting matched regions may contain many disconnected segments. Each of these segments may be a group of superpixels or even a single superpixel, and such matching may not be meaningful. To obtain a meaningful match, wherein the connectivity among the superpixels in the matched regions is maintained, we describe a graph-based approach to jointly segment the complete objects from an image pair. In this framework, the objective is to obtain the maximum common subgraph (MCS) that represents the co-segmented objects. The MCS corresponds to the common subgraphs H1 in G1 and H2 in G2 (see Fig. 4.3 for illustration). However, H1 and H2 may not be identical as (i) G1 and G2 are region adjacency graphs with feature vectors as node attributes, and (ii) the common object regions in both the images need not undergo identical superpixel segmentation. Hence, unlike in a standard MCS finding algorithm, manyto-one matching must be permitted here. Complications arising from many-to-one node matching can be reduced by restricting the number of nodes in one image that can match to a node in the other image, and that number (say, τ ) can be chosen based on the inter-image (superpixel) feature similarities in Eq. (4.1). Following the work
4.2 Co-segmentation for Two Images
63
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4.2 Co-segmentation stages using an image pair. a, b Input images and their SLIC segmentation. c, d The matched nodes i.e., superpixels across images (shown in same color) obtained through MCS computation and the corresponding e, f object regions in the images. g, h Co-segmented objects obtained after performing region co-growing on the initially matched regions in (e, f). Figure courtesy: [48]
64
4 Maximum Common Subgraph Matching
of Madry [80], it is possible to show that the computation complexity reduces to O ((τ (|G1 | + |G2 |))10/7 ) when the matching is restricted to a maximum of τ nodes only. We begin the MCS computation by building two product graphs W12 and W21 from the RAGs G1 and G2 based on the similarity values S f (·) following a similar strategy described in Sect. 3.3. Here, we only describe the steps to compute W12 and the subsequent steps in MCS computation, considering G1 as the reference graph which is being matched to G2 . The steps involving W21 (i.e., the reference graph G2 being matched to G1 ) are identical. To find the product nodes of W12 , a threshold tG (0 ≤ tG ≤ 1) is selected for node matching, as node features do not need to match exactly for natural images. Further, to enforce the constraint τ , we need to find the (k) 1 τ largest similar nodes in G2 for 1every node vk ∈ V1 . Let V2 be the ordered list of 2 2 nodes {vl } in V2 such that {S f vk , vl }∀l are in descending order of magnitude. The W of the product graph W12 is obtained as: product node set U12 W U12
vk1 , u l ∈ V2(k) = ∀k
l=1,2,...τ
|S f vk1 , u l > tG
(4.3)
W Similarly, we can compute U21 by keeping V2 as reference. It is evident from Eq. (4.3) that restricting one node in one graph to match to at most τ nodes in the other graph W W = U21 , resulting in H1 = H2 (i.e. not commutative) as noted earlier. leads to U12 Conversely, the two product graphs W12 and W21 would be identical in the absence of τ . In Sect. 3.3, the parameter τ was not considered, and hence, a single product graph W was considered. Let us analyze the effect of the two parameters tG and τ . A large value of tG and a small value of τ .
• restrict the matching to only a few candidate superpixels, and yet allowing certain amount of inter-image variations in the common objects, • ensure a fast computation during subgraph matching and • reduce the product graph size as well as the possibility of spurious matching. For example, the size of the product graph for many-to-many matching is O (|G1 ||G2 |). The choice of τ in the matching process reduces the size to O (τ (|G1 | + |G2 |)), while the additional use of the threshold tG makes it O (ζ τ (|G1 | + |G2 |)) with 0 < ζ 1. This reduces the computation drastically. We will show in Sect. 4.2.3 that the soft matches can be recovered during the region co-growing phase. Then using the method described in Sect. 3.3, the MVC of the complement C W M is computed. The set of product nodes (U12 ⊆ U12 ) other than this MVC graph W12 represent the left matched product nodes that form the maximal clique of W12 . W M ⊆ U21 from W21 . Let Similarly, we obtain the right matched product nodes U21 M M H H M U U12 ∪ U21 . The set of nodes V1 ⊆ V1 and V2 ⊆ V2 (see Fig. 4.2c, d) in the corresponding common subgraphs H1 and H2 are obtained from U M using Eq. (3.15) and Eq. (3.16), respectively, and they correspond to the matched regions in I1 and I2 , respectively (see Fig. 4.2e, f).
4.2 Co-segmentation for Two Images
65
Fig. 4.3 Example of maximum common subgraph of two graphs G1 and G2 . The set of nodes V1H = {v11 , v21 , v81 , v71 }, V2H = {v12 , v22 , v92 , v82 } and edges in the maximum common subgraphs H1 and H2 of G1 and G2 , respectively, are highlighted (in blue). Figure courtesy: [48]
Fig. 4.4 Requirement of condition C2 of edge assignment between product graph nodes v11 , v12 1 2 and v3 , v3 obtained using Eq. (4.3). Here, condition C1 is not satisfied, however, condition C2 is satisfied, and an edge is added. It is easy to derive that the nodes in the MCS are v11 , v12 and 1 2 v3 , v3 . This shows that multiple disconnected but similar objects can be co-segmented. Figure courtesy: [48]
In Sect. 3.3, we discussed two conditions for connecting product nodes with edges, and we analyze them here. As seen in Figs. 3.3–3.8, both conditions are necessary for computing the maximal clique correctly. However, if multiple common objects are present in the image pair, and they are not connected to each other, condition C1 alone cannot co-segment both. We illustrate using an example that condition C2 helps to achieve this. In Fig. 4.4, let the disconnected nodes v11 and v31 in G1 be similar to the disconnected nodes v12 and v32 in G2 , respectively. Here, the use of condition C1 alone will co-segment either (i) v11 and v12 , or (ii) v31 and v32 , but not both. But using both conditions, we will be able to co-segment both (i) v11 and v12 , and (ii) v31 and v32 , which is the correct result.
4.2.3 Region Co-growing In the MCS algorithm of Sect. 4.2.2, we used certain constraints on the choice of similarity threshold tG and the maximal many-to-one matching parameter τ so that the product graph size remains small and the subsequent computations reduce. However, the subgraphs H1 and H2 obtained at the MCS output do not cover the complete objects; i.e., the resulting V1H and V2H may not contain all the superpixels constituting the common objects. So, we need to grow these matched regions to obtain
66
4 Maximum Common Subgraph Matching
the complete co-segmented objects. In this section, we discuss an iterative method that performs region growing in both images simultaneously based on neighborhood feature similarities across the image pair. Given V1H and V2H which constitute the common objects partially in the respective images, for a superpixel pair in them that has matched, it is expected to find matching of superpixels in their neighborhoods. Thus, we can perform region co-growing on V1H and V2H using them as seeds to obtain the complete objects as: (H1 , H2 ) = F RC G (H1 , H2 , I1 , I2 ) ,
(4.4)
where F RC G denotes the region co-growing function, and H1 , H2 denote the subgraphs representing the complete objects. Further, F RC G has the following benefits. • Even if an image pair contains common objects of different size (and number of superpixels), they are completely detected after region co-growing. • Obtaining an MCS with a small product graph followed by region co-growing is computationally less intensive than solving for MCS with a large product graph. Any region growing method typically considers the neighborhood of the seed region, and appends the superpixels (or pixels) from that neighborhood which is similar (in some metric) to the seed. However, here instead of growing V1H and V2H independently, it is more appropriate to grow them jointly because the co-segmented objects must have commonality. Let NViH denotes the set of neighbors of ViH , with
NViH =
{u ∈ N(v)} for i = 1, 2 ,
(4.5)
v∈ViH
where N(·) denotes the first-order neighborhood. To co-grow V1H and V2H , the set of nodes Nsi ⊆ NViH having high inter-image feature similarity to the nodes in V jH is obtained, and ViH grows as:
ViH,(t+1) ← ViH,(t) ∪ Ns(t) for i = 1, 2 . i
(4.6)
To completely grow H1 , H2 into H1 , H2 , this process is iterated until convergence. These iterations implicitly consider higher-order neighborhoods of V1H and V2H , which is necessary for their growth. In every iteration-t, ViH,(t) denotes the already matched regions (nodes), and NV(t)H denotes the nodes in their first-order neighbori hood. The region co-growing algorithm converges when
V1H,(t) = V1H,(t−1) and V2H,(t) = V2H,(t−1) .
(4.7)
After convergence, we denote V1H ∗ and V2H ∗ as the node sets (see Fig. 4.1) which constitute H1 , H2 representing the common objects completely in I1 and I2 , respec-
4.2 Co-segmentation for Two Images
67
Algorithm 1 Pairwise image co-segmentation algorithm Input: An image pair I1 , I2 Output: Common objects F1 , F2 present in the image pair 1: Build RAGs G1 = (V1 , E1 ), G2 = (V2 , E2 ) from I1 , I2 2: // MCS computation 3: Compute product graphs W12 , W21 using Eqn (4.3) C , W C and their complements U M , U M 4: Compute minimum vertex covers of W12 12 21 21 M ∪ UM 5: U M := U12 21 6: Find maximum common subgraphs H1 ⊆ G1 , H2 ⊆ G2 and corresponding node sets V1H , V2H from U M . 7: // Region co-growing 8: t ← 1, V1H,(t) ← V1H , V2H,(t) ← V2H 9: while no convergence do (t) (t) 10: Ns1 := v 2 ∈V H,(t) {vk1 ∈ NV H |S f (vk1 , vl2 ) > tG } 2 l 1 (t) 2 1 2 11: Ns(t) H,(t) {vl ∈ N H |S f (vk , vl ) > tG } 1 2 := v ∈V V k
1
2
12: Region growing in G1 : V1H,(t+1) ← V1H,(t) ∪ Ns(t) 1 H,(t+1) H,(t) (t) 13: Region growing in G2 : V2 ← V2 ∪ Ns2 14: t ← t + 1 15: end while 16: Obtain F1 , F2 from V1H ∗ , V2H ∗
tively (also see Fig. 4.2g, h). The example in Fig. 4.5a–f shows that region growing helps to completely detect common objects of different size. The larger object has been partially detected from MCS (Fig. 4.5b), and it is fully recovered after region co-growing (Fig. 4.5c). The co-segmentation algorithm is given as a pseudocode in Algorithm 1. Similarity metric. As discussed earlier, the neighborhood subset Ns2 is obtained by analyzing the feature similarity between the node sets {vk1 ∈ V1H } and {vl2 ∈ NV2H }. However, the node level similarity S f (vk1 , vl2 ) of Eq. (4.1) alone may not be enough. Hence to further enhance co-growing, an additional neighborhood level similarity is required which can be defined as the average feature similarity between their neighbors (Nvk1 and Nvl2 ) that are already in the set of matched regions, i.e., in V1H and V2H , respectively. A weighted feature similarity is computed as: 1 2 vk , vl , S f (vk1 , vl2 ) = ωN S f vk1 , vl2 + (1 − ωN )S N f
(4.8)
where ωN is an appropriately chosen weight, and S N f (·) is the neighborhood similarity. Thus, the similarity measure for region co-growing has an additional measure of neighborhood similarity compared to the measure used for graph matching in Sect. 4.2.2. We illustrate this using example graphs next. Figure 4.6a shows two graphs and their MCS output
V1H = {v11 , v21 , v31 , v51 }, V2H = {v12 , v22 , v32 , v52 },
68
4 Maximum Common Subgraph Matching
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4.5 Effectiveness of region co-growing in co-segmenting objects of different size. a, d Input images. b, e Similar object regions obtained using the MCS algorithm, where the larger object (of image d) is not completely detected. c, f Complete co-segmented objects are obtained after co-growing. Figure courtesy: [48]
with the correspondences v11 ↔ v12 , v21 ↔ v22 , v31 ↔ v32 , v51 ↔ v52 . While growing V2H , we need to analyze the similarity between nodes in V1H and NV2H . For the pair of a matched node v11 ∈ V1H and an unmatched neighboring node v42 ∈ NV2H , the weighted measure S f (v11 , v42 ) is computed considering their feature similarity S f (v11 , v42 ) and the feature similarity between the respective (matched) neighboring node pairs (v31 ∈ V1H ∩ Nv11 , v32 ∈ V2H ∩ Nv42 ) and (v51 ∈ V1H ∩ Nv11 , v52 ∈ V2H ∩ Nv42 ). The neighboring nodes v21 ∈ V1H and v12 ∈ V2H are ignored since they have not been matched to each other. If S f (v11 , v42 ) computed using Eq. (4.8) exceeds a certain threshold tG , v42 is assigned to the set Ns2 , which is used in Eq. (4.6). Similarly while growing V1H , the weighted feature similarity between v41 ∈ NV1H and the nodes in V2H is computed (see Fig. 4.6c). Now we discuss the formulation of S N f (·). If a node in Gi has few already matched neighbors (i.e., neighbors from ViH ), it is less likely to be part of the foreground in Ii . So, less importance should be given to it even if it has relatively high inter-image feature similarities with the nodes within the object in I j . In Fig. 4.6a, the unmatched node v42 ∈ NV2H has three matched neighboring nodes v12 , v32 and v52 , whereas in Fig. 4.6c, the unmatched node v41 ∈ NV1H has one matched neighboring node v11 . The 1 2 neighborhood similarity measure S N f (vk , vl ) is computed as: 1 2 nM SN , where f (vk , vl ) = 1 − (Z )
(4.9)
Fig. 4.6 Region co-growing. a The set of nodes V1H and V2H at the MCS outputs of the graphs G1 and G2 , with v11 , v21 , v31 , v51 match v12 , v22 , v32 , v52 , respectively. The nodes in MCSs are V1H,(t) and V2H,(t) (blue) at t = 1. To grow V2H,(t) , we compare feature similarities of each node, e.g., v42 (red), in the neighborhood of V2H,(t) to all the nodes in V1H,(t) . b V2H,(t+1) (green) has been obtained by growing V2H,(t) where v42 has been included in the set due to high feature similarity H,(t) H,(t) with v11 and their neighbors. c To grow V1 , we compare feature similarities of each node, e.g., v41 (red), in the neighborhood of V1 to all the nodes in H,(t) 1 V2 . d The set of matched nodes (purple) after iteration-1 of region growing, assuming no match has been found for v4 . Figure courtesy: [48]
4.2 Co-segmentation for Two Images 69
70
4 Maximum Common Subgraph Matching
nM =
u 1 ∈V1 u 2 ∈V2
Z =
1 (u 1 , u 2 ) with V1 = Nvk1 ∩ V1H , V2 = Nvl2 ∩ V2H , and
1 1 − S f (u 1 , u 2 ) 1 (u 1 , u 2 ). nM
(4.10)
u 1 ∈V1 u 2 ∈V2
Here, Z denotes the average distance between the already matched pairs belonging to the neighborhood of vk1 and vl2 . The indicator function 1 (u 1 , u 2 ) = 1 if the MCS matching algorithm yields a match between nodes u 1 and u 2 , and 1 (u 1 , u 2 ) = 0 otherwise. It can be observed from Eq. (4.9) that S N f (·) increases as the number of neighbors that have already been matched increases, as desired. Relevance Feedback. The weight ωN in Eq. (4.8) is used to provide relevance to the two constituent similarity measures. Instead of using heuristics, relevance feedback [81] can be used to quantify the importance of the neighborhood information and to find ωN . This is an iterative method that uses a set of training image pairs. In each iteration, users assign a score to the co-segmentation output (denoted as Fω(t)N ) based on its quality. It is then compared with the scores of the co-segmentation outputs (denoted as F1 and F0 , respectively) obtained using each of the constituent similarity measures individually. The score difference is used to obtain the weights. As one of the notable applications of relevance feedback, Rocchio [81] has used it to modify the query terms in document retrieval application. Rui et al. [107] have used relevance feedback to find appropriate weights for combining image features for content-based image retrieval. It may be mentioned here that in case one does have access to the ground-truth, one can use this ground-truth while using the relevance feedback instead of manual scoring. User feedback is used to rate FωN of every training image pair as well as F1 and F0 obtained separately using ωN = 1 and ωN = 0, respectively (Algorithm 2). is computed. Then weights Initially equal weight is assigned, i.e., ωN = 0.5 and Fω(1) N are iteratively updated by comparing user ratings for F1 and F0 with Fω(t)N until convergence, as explained next. User feedback is used to assign scores θ1,k and θ0,k to F1 and F0 , respectively, for every image pair k in a set of Nt training image pairs. Then in each iteration t, users assign a score θk(t) to Fω(t)N for every image pair k. The improvements π1(t) , π0(t) in Fω(t)N over F1 and F0 are computed based on the score difference as: Nt
θi,k − θk(t) , i = 0, 1. (4.11) πi(t) = k=1
Although the use of more levels of score improves the accuracy, it becomes inconvenient for the users to assign scores. As a trade-off, we use seven levels of scores: −3, −2, −1, 0, 1, 2, 3, where −3 and 3 indicate the worst and the best co-segmentation outputs, respectively. To find the ratio of π1(t) , π0(t) , they must be positive. As they are computed as difference between scores, they can be positive or negative. To
4.2 Co-segmentation for Two Images
71
Algorithm 2 Estimation of the weight ωN used in the weighted feature similarity computation using relevance feedback 1: Input: Co-segmentation outputs F1 and F0 obtained using ωN = 1 and ωN = 0 separately for every image pair-k in training dataset of size Nt ; score ∈ {−3, −2, −1, 0, 1, 2, 3} 2: Output: Updated weight ωN (t) 3: Initialization: t ← 1, ωN ← 0.5 4: for k = 1 to Nt do 5: θ1,k ← score assigned to F1 obtained from image pair-k 6: θ0,k ← score assigned to F0 obtained from image pair-k 7: end for 8: loop 9: for k = 1 to Nt do (t) 10: Run co-segmentation algorithm on image pair-k with ωN as weight (t) FωN ← co-segmentation output obtained 11: 12: θk(t) ← score assigned to Fω(t)N 13: end for Nt (t) (t) θi,k − θk , i = 0, 1 14: Cumulative improvement πi = k=1 15: Normalization πi(t) ← πi(t) − min π1(t) , π0(t) + min π1(t) , π0(t) , i = 0, 1
(t+1) (t) (t) (t) π1 + π0 16: ωN = π1 (t)
17: if converged in ωN then (t+1) 18: ωN ← ωN 19: break; 20: end if 21: t ← t + 1 22: end loop
have positive values, we scale π1(t) , π0(t) (Algorithm 2). Then these improvements are normalized to obtain weights ωi(t+1) for the next iteration as: ωi(t+1) = πi(t)
π1(t) + π0(t) , i = 0, 1.
(4.12)
After convergence, we obtain ωN = ω1 .
4.2.4 Common Background Elimination In the co-segmentation problem, we are interested in common foreground segmentation and not in common background segmentation. However, an image pair may contain similar background regions such as the sky, field or water body. In this scenario, the co-segmentation algorithm, as described so far, will also co-segment the background regions since it is designed to capture the commonality across images. Thus, such background superpixels should be ignored while building the product graphs and
72
4 Maximum Common Subgraph Matching
during region co-growing. Further, discarding the background nodes will reduce the product graph size and subsequent computations. In the absence of any prior information, Zhu et al. [151] proposed a method to estimate the backgroundness probability of superpixels using an unsupervised framework. This method is briefly described next. Typically, we capture images keeping the objects of interest at the center of the image. This is called the center bias. Hence, most superpixels at the image boundary (B ) are more likely to be part of the background. Additionally, several nonboundary superpixels also belong to the background, and we expect them to be highly similar to B . Thus, a boundary connectivity measure of each superpixel v is defined as: S f (v, v ) v ∈B C B (v) = , (4.13) S f (v, v ) v ∈I
where S f (·) can be the feature similarity measure of Eq. (4.1) or any other appropriate measure that may also incorporate spatial coordinates of the superpixels. Then, the probability that a superpixel v belongs to the background is computed as: 1 2 PB (v) = 1 − exp − (C B (v)) . 2
(4.14)
To identify the possible background superpixels, we can compute this probability for all superpixels in the respective images I1 and I2 , and the superpixels with PB (vi ) < tB can be marked as background. Here, tB is a threshold.
4.3 Multiscale Image Co-segmentation The number of superpixels in an image containing a well-textured scene increases with the image size, which in turn makes the region adjacency graph larger. To maintain the computational efficiency of the co-segmentation algorithm for highresolution images, we discuss a method using a pyramidal representation of images, where every image is downsampled into multiple scales. First, images at every scale are oversegmented into superpixels keeping the average superpixel size fixed, and RAGs are computed. Naturally, images at the coarsest level (i.e., the smallest resolution) contain the least number of superpixels (nodes). Hence, the maximum common subgraph is computed at the coarsest level where the computation of the MCS matching algorithm is the least. One could next perform region co-growing at this level, and resize that output to the input image size. However, this would introduce object localization error. To avoid this, the matched superpixels obtained from the MCS at the coarsest level are mapped to the upper-level superpixels through pixel coordinates, and region co-growing is performed there. This process of mapping and co-growing is successively repeated at the remaining finer levels of the pyramid to obtain the final result.
4.3 Multiscale Image Co-segmentation
73
Let us explain this process using an example. The input images I1 and I2 are successively downsampled (by 2) R times with I1,R and I2,R being the coarsest level H H and V2,R be the set of image pair, and let us denote I1,1 = I1 and I2,1 = I2 . Let V1,R matched superpixels in I1,R and I2,R obtained using the MCS matching algorithm. To H is mapped to certain find the matched superpixels in Ii,R−1 , every superpixel in Vi,R superpixels in Ii,R−1 based on the coordinates of the pixels inside the superpixels. Since Ii,R−1 is larger than Ii,R , this mapping is one-to-many. To obtain the mapping H in Ii,R−1 , let {(xv , yv )} denotes the coordinate set of the of a superpixel v ∈ Vi,R pixels constituting v, and {(x˜v , y˜v )} denotes the twice-scaled coordinates. Now a superpixel u ∈ Ii,R−1 is marked as a mapping of v if {(xu , yu )} has the highest overlap H , i.e., v → u if with {(x˜v , y˜v )} among all superpixels in Vi,R |{(x˜v , y˜v )} ∩ {(xu , yu )}| . v = arg max v
(4.15)
Then region co-growing can be performed on the mapped superpixels in I1,R−1 and H I2,R−1 , as discussed in Sect. 4.2.3, to obtain the matched superpixel sets V1,R−1 H in I1,R−1 and V2,R−1 in I2,R−1 . This process is to be repeated for subsequent levels to H H obtain the final matched superpixel sets V1,1 and V2,1 that constitute the co-segmented objects in I1,1 and I2,1 , respectively.
4.4 Experimental Results In this section, we analyze the performance of several image pair co-segmentation algorithms including the method described in this chapter (denoted as PR) by performing experiments on images selected from five datasets: the image pair dataset [69], the MSRC dataset [105], the iCoseg dataset [8], the flower dataset [92] and the Weizmann horse dataset [13]. Let us begin by discussing the choice of different parameters in the PR method. For an n 1 × n 2 image (at the coarsest level), the number of superpixels is chosen to be N = min(100, n 1 n 2 /250). This limits the size of the graph to at most 100 nodes. The maximal many-to-one matching parameter τ is limited to 2 as a trade-off between the product graph size and possible reduction in the number of seed superpixels for region co-growing. The inter-image feature similarity threshold tG in Eq. (4.3) has been adaptively chosen to ensure that the size of the product graphs, W12 and W21 , is at most 40–50 due to computational restrictions. In Sect. 4.2.4, the threshold for background probability is set as tB = 0.75 max({PB (vi ), ∀vi ∈ I }) to ignore the possible background superpixels in the co-segmentation algorithm. We first visually analyze the results and then discuss quantitative evaluation. Row 1 in Figs. 4.7 and 4.8 show multiple image pairs containing a single common object and multiple common objects, respectively. Co-segmentation results on these image pairs using the methods PC [21], CIP [69], CBC [39], DSAD [60], UJD [105], SAW [16], MRW [64] and PR are provided in Rows B-I, respectively. These results demonstrate
74
4 Maximum Common Subgraph Matching
Fig. 4.7 Co-segmentation of image pairs containing single common object. Results obtained from the image pairs in a–h of Row A using methods PC, CIP, CBC, DSAD, UJD, SAW, MRW and the method described in this chapter (PR) are shown in Rows B-I, respectively. Ground-truth data is shown in Row J. Figure courtesy: [48]
4.4 Experimental Results
Fig. 4.7 (Continued): Co-segmentation of image pairs containing single common object
75
76
4 Maximum Common Subgraph Matching
Fig. 4.8 Co-segmentation of image pairs containing multiple common objects. Results obtained from the image pairs in a–d of Row A using methods PC, CIP, CBC, DSAD, UJD, SAW, MRW are shown in Rows B-H, respectively. See next page for continuation. Figure courtesy: [48]
4.4 Experimental Results
77
Fig. 4.8 (Continued): Co-segmentation of image pairs containing multiple common objects. Results obtained from the image pairs in a–d of Row A using the method PR are shown in Row I. Ground-truth data is shown in Row J. Figure courtesy: [48]
the superior performance of PR (Row I) while comparing with the ground-truth (Row J). Among these methods, PC, CIP, CBC and SAW are co-saliency detection methods. For the input image pair in Fig. 4.7a, b the methods PC, DSAD, UJD detect only one of the two common objects (shown in Rows B, E, F). Most of the outputs of PC, CIP, CBC, DSAD (shown in Rows B-E) contain discontiguous and spurious objects. Further, in most cases the common objects are either under-segmented or oversegmented. Although the method UJD yields contiguous objects, they very often fail to detect any object from both images (Row F in Fig. 4.7a, c, e, h. However, PR yields the entire object as a single entity with very little over or under-segmentation. More experimental results are shown in Figs. 4.9, 4.10 and 4.11. The quality of the co-segmentation output is quantitatively measured using precision, recall and F-measure, as used in earlier works, e.g., [69]. These metrics are computed by comparing the segmentation output mask with the ground-truth provided in the database as defined next. Precision (P) is the ratio of the number of correctly detected co-segmented object pixels to the number of detected object pixels. It penalizes for classifying background pixels as object. Recall (R) is the ratio of the number of correctly detected co-segmented object pixels to the number of object pixels in the ground-truth image (G). It penalizes for not detecting all pixels of the object. F-measure (FP R ) is the weighted harmonic mean of precision and recall, computed as: n 1 n 2 i=1 F (i)G(i) P= , (4.16) n1 n2 i=1 F (i) n 1 n 2 i=1 F (i)G(i) , R= n1 n2 i=1 G(i) FP R =
(1 + ω F ) × P × R , ωF × P + R
(4.17)
(4.18)
78
4 Maximum Common Subgraph Matching
IN (a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
UJD
MRW
PR
Fig. 4.9 Co-segmentation results of the methods PR, UJD and MRW on an image pair selected from the MSRC dataset [105]. a, b Input image pairs (IN). c, d Co-segmentation outputs of UJD and e, f that of MRW. g, h Co-segmentation outputs of PR. The extracted objects are shown on gray background
where ω F = 0.3 (as in other works) to place more emphasis on precision. It is worth mentioning here that Jaccard similarity is another commonly used metric for evaluating segmentation results. However, we do not use it in this chapter because most of the methods we considered here are co-saliency methods and F-measure is commonly used to evaluate saliency detection algorithms. In subsequent chapters, we will use Jaccard similarity as the metric. Quantitative comparison of the methods MRW, UJD, SAW, CBC, CIP, DCC, DSAD, PC and PR on the image pair dataset [69] and the MSRC dataset [105]
4.4 Experimental Results
79
Fig. 4.10 Co-segmentation results of the methods PR, UJD and MRW on an image pair selected from the Weizmann horse dataset [13]. a, b Input image pairs (IN). c, d Co-segmentation outputs of UJD and e, f that of MRW. g, h Co-segmentation outputs of PR. The extracted objects are shown on gray background
80
4 Maximum Common Subgraph Matching
Fig. 4.11 Co-segmentation results of the methods PR, UJD and MRW on three image pairs selected from the flower dataset [92]. a–f Input image pairs (IN). g–l Co-segmentation outputs of UJD and m–r that of MRW. s–x Co-segmentation outputs of PR. The extracted objects are shown on black background Table 4.1 Precision (P), recall (R) and F-measure (FP R ) values of the method PR with the methods MRW, UJD, SAW, CBC, CIP, DCC, DSAD, PC on the image pair dataset [69] Metrics Methods PR MRW UJD SAW CBC CIP DCC DSAD PC P R FP R
0.841 0.811 0.817
0.701 0.907 0.719
0.573 0.701 0.575
0.913 0.674 0.798
0.897 0.614 0.788
0.836 0.620 0.752
0.515 0.823 0.542
0.549 0.371 0.428
0.519 0.217 0.358
are shown in Tables 4.1 and 4.2, respectively. Results show that precision and recall values of PR are very close, as it should be, and yet being very high. This indicates that PR reduces both false positives and false negatives. While the methods CBC, SAW
4.4 Experimental Results
81
Table 4.2 Mean precision (P), recall (R) and F-measure (FP R ) values of the method PR with the methods MRW, UJD, SAW, CBC, CIP, DCC, DSAD, PC on images selected from ‘cow’, ‘duck’, ‘dog’, ‘flower’ and ‘sheep’ classes in the MSRC dataset [105] Metrics Methods PR MRW UJD SAW CBC CIP DCC DSAD PC P R FP R
0.981 0.791 0.928
0.837 0.836 0.818
0.812 0.791 0.789
0.859 0.655 0.787
0.970 0.680 0.872
0.790 0.373 0.510
0.803 0.377 0.593
0.566 0.310 0.432
0.564 0.230 0.394
Table 4.3 Computation time (in seconds) required by the methods PR, MRW, UJD, as the image pair size (86 × 128 and 98 × 128) increases by shown factors Method Increase in image size 1×1 2×2 22 × 22 23 × 23 24 × 24 MRW UJD PR
32.65 1.80 1.54
51.63 6.00 2.08
78.83 25.20 2.94
163.61 107.40 5.69
820.00 475.80 13.90
(Table 4.1) also have high precision values, the recall rate is significantly inferior. Method MRW has a good recall measure, but the precision is quite low. In order to compare the speed, we execute all the methods on the same system and report the computation time required to execute the algorithms in Table 4.3. It shows that the method PR is significantly faster than methods MRW, UJD. The advantage in PR is more noticeable when the image size increases.
4.5 Extension to Co-segmentation of Multiple Images In this section, we describe an extension of the pairwise co-segmentation method to an image set. Finding matches over multiple images instead of just an image pair is more relevant in analyzing crowd-sourced images from an event or at a touristic location. Simultaneous computation of the MCS of N graphs drastically grows the product graph size to the order of O ζ τ N −1 |G1 | N −1 , assuming same cardinality of every graph for simplicity, making the algorithm incomputable. Hence, we describe a different scheme to solve the multiple image co-segmentation problem using a hierarchical setup where N − 1 pairwise co-segmentation tasks are solved over a binary tree structured organization of the constituent images and results (see Figs. 4.12 and 4.13). For each task, there is a separate product graph of size O (ζ τ (|G1 | + |G2 |)) only, which demonstrates the computational efficiency of this scheme, to be elaborated next.
82
4 Maximum Common Subgraph Matching
Fig. 4.12 Hierarchical image co-segmentation scheme for N = 4 images. Input images I1 -I4 are represented as graphs G1 -G4 . Co-segmentation of I1 and I2 yields MCS H1,1 . Co-segmentation of I3 and I4 yields MCS H2,1 . Co-segmentation of H1,1 and H2,1 yields MCS H1,2 that represents the co-segmented objects in images I1 -I4 . Thus, total N − 1 = 3 co-segmentation tasks are required. Figure courtesy: [48]
Fig. 4.13 Hierarchical image co-segmentation scheme for N = 6 images. Here, total N − 1 = 5 co-segmentation tasks are required
Let I1 , I2 , …, I N be a set of N images, and G1 , G2 , …, G N denote the respective RAGs. To co-segment them using this hierarchical approach, T = log2 N levels of co-segmentation are required. Let H j,l denotes the j-th subgraph at level l. First, the image pairs (I1 ,I2 ), (I3 ,I4 ), …, (I N −1 ,I N ) are co-segmented independently. Let H1,1 , H2,1 , …, H N /2,1 be the resulting subgraphs at level l = 1 (see Fig. 4.12). Then MCS of the pairs (H1,1 , H2,1 ), (H3,1 , H4,1 ), …, (H N /2−1,1 , H N /2,1 ) are computed to obtain the corresponding co-segmentation maps H1,2 , H2,2 , …at level l = 2. This process is repeated until the final co-segmentation map H1,T at level l = T is obtained. Figures 4.12, 4.13 show the block diagrams when considering co-segmentation for four images (T = 2) and six images (T = 3), respectively. The advantages with this approach are as follows. • The computational complexity greatly reduces after the first level of operation as |H j,l | |Gi | at any level l, and the graph size reduces at every subsequent level. • We need to perform co-segmentation at most N − 1 times for N input images; i.e., the complexity increases linearly with the number of images to be co-segmented. • If at any level any MCS is null, we can stop the algorithm and conclude that all the images in the set do not share a common object.
4.5 Extension to Co-segmentation of Multiple Images
83
Fig. 4.14 Image co-segmentation from four images. Co-segmentation of the image pair in (a, b) yields outputs (e, f), and the pair in (c, d) yields outputs (g, h). These outputs are co-segmented to obtain the final outputs (i, j, k, l). Notice how small background regions present in (e, f) have been removed in (i, j) after the second round of co-segmentation. Figure courtesy: [48]
It may be noted that during the first level of co-segmentation, the images I1 , I2 , …, I N can be paired in any order. Further, due to the non-commutativity discussed in Sect. 4.2.2, the MCS output at any level actually corresponds to two matched subgraphs from the respective input graphs, and we may choose either of them as H j,l for MCS computation at the next level. Figure 4.14 shows an example of cosegmentation for four images. For the input image pairs I1 , I2 in (a, b) and I3 , I4 in (c, d), the co-segmentation outputs at level l = 1 are shown in (e, f) and in (g, h), respectively. Final co-segmented objects (at level l = 2) are shown in (i, j) and (k, l). More experimental results for multi-image co-segmentation are shown in Figs. 4.15, 4.16, 4.17, and 4.18. To summarize this chapter, we have described a computationally efficient image co-segmentation algorithm based on the concept of maximum common subgraph matching. Computing MCS from a relatively small product graph, and performing
84
4 Maximum Common Subgraph Matching
Fig. 4.15 Co-segmentation results obtained using the methods PR, UJD and MRW on four images selected from the MSRC dataset [105]. a–d Input images (IN). e–h Co-segmentation outputs of UJD and i–l that of MRW. m–p Co-segmentation outputs of PR. The extracted objects are shown on gray background. The method UJD could not detect the common object (cow) from images (a), (d). The method MRW includes a large part of the background in the output
region co-growing on the nodes (seeds) obtained at the MCS output is efficient. Further, incorporating them in a pyramidal co-segmentation makes the method computationally very efficient for high-resolution images. This method can handle variations in shape, size, orientation and texture in the common object among constituent images. It can also deal with the presence of multiple common objects, unlike some of the methods analyzed in the results section.
4.5 Extension to Co-segmentation of Multiple Images
85
Fig. 4.16 Co-segmentation results obtained using the methods PR, UJD and MRW on four images selected from the iCoseg dataset [8]. a–d Input images (IN). e–h Co-segmentation outputs of UJD and i–l that of MRW. m–p Co-segmentation outputs of PR. The extracted objects are shown on black background
86
4 Maximum Common Subgraph Matching
Fig. 4.17 Co-segmentation results obtained using the methods PR, UJD and MRW on six images selected from the iCoseg dataset [8]. a–f Input images (IN). g–l Co-segmentation outputs of UJD and m–r that of MRW. s–x Co-segmentation outputs of PR. The extracted objects are shown on gray background. The method UJD could not find any common objects and yields significantly different outputs with the change in the number of input images to be co-segmented
The extension of the pairwise co-segmentation method to multiple images may not always yield the desired result. It requires the common object to be present in all the images in the image set. It is evident from Fig. 4.13 that the resulting subgraphs in every level are a reducing subset of nodes. If there is at least one image that does not contain the common object, then H1,T = {φ} (T = 3 in Fig. 4.13). Thus, this method fails to detect the common object from all images. Hence, we explore a solution to this problem in the next chapter through a multi-image co-segmentation algorithm that can handle the presence of images without the common object.
4.5 Extension to Co-segmentation of Multiple Images
87
Fig. 4.18 Co-segmentation results obtained using the methods PR, UJD and MRW on three challenging images selected from the MSRC dataset [105]. a–c Input images (IN). d–f Co-segmentation outputs of UJD and g–i that of MRW. j–l Co-segmentation outputs of PR. m–o Ground-truth. The extracted objects are shown on gray background
Chapter 5
Maximally Occurring Common Subgraph Matching
5.1 Introduction In image co-segmentation, we simultaneously segment the common objects present in multiple images. In the previous chapter, we have learned an algorithm for cosegmentation of an image pair, its possible extension to multiple images and the challenges involved. These images are generally retrieved from the internet. Hence, not all images in the set may contain the common object as some of these images may be totally irrelevant in the collected data set (see Fig. 5.1). Presence of such outlier images in the database makes the co-segmentation problem even more difficult. As mentioned earlier, it is possible that the set of crowd-sourced images to be co-segmented contains some outlier images that do not at all share the common object present in majority of the images in the set. Several methods including those in [18, 56, 57, 60, 64, 78] do not consider this scenario. Rubinstein et al. [105] and Wang et al. [139] do consider this. But the method of Rubinstein et al. [105] being a saliency based approach misses out on all non-salient co-segments. Wang et al. [139] proposed a supervised method for co-segmentation of image sets containing outlier images. First, they learn an object segmentation model from a training image set of same category. The outlier, if any, is rejected if it does not conform to the trained model during co-segmentation. For an unsupervised scheme, to co-segment a set of N images with the common object being present in an unknown M (M ≤ N ) number of images, the order of image matching operations is O (N 2 N −1 ). In this chapter, we show that this problem can be addressed by processing all the images together, and solving it in linear time using a greedy algorithm. The discussed method co-segments a large number (N ) of images using only O (N ) matching operations. It can also detect multiple common objects present in the image set.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_5
89
90
5 Maximally Occurring Common Subgraph Matching
Fig. 5.1 Image co-segmentation. The input image set that includes an outlier image (second image from right) is shown in the top row. The extracted common objects are shown in the bottom row. Image courtesy: Source images from the iCoseg dataset [8]
5.2 Problem Formulation In this section, we introduce relevant terminology and notations, and formulate the problem of multiple image co-segmentation using a graph-based approach. Let I = {I1 , I2 , . . . , I N } be the set of N images to be co-segmented. As in Chap. 4, every image Ii is represented as an undirected graph Gi = (Vi , Ei ) where Vi is the set of nodes (or vertices) and Ei is the set of edges.
5.2.1 Mathematical Definition Given two graphs G1 and G2 , the maximum common subgraph (MCS) G ∗ = (V ∗ , E ∗ ) (see Fig. 5.2) is defined as:
G ∗ = MCS(G1 , G2 ) , and V ∗ = arg max {|V | : V ∈ G1 , V ∈ G2 } , V
(5.1)
where | · | denotes cardinality, G ∗ ⊆ G1 and G ∗ ⊆ G2 . Obtaining MCS is known to be an NP-complete problem [61]. This definition of MCS can be extended to a set of N graphs G¯ = {G1 , G2 , . . . , G N } as G ∗ = MCS(G1 , G2 , . . . , G N ) and
V ∗ = arg max {|V | : V ∈ G1 , V ∈ G2 , . . . , V ∈ G N }. V
(5.2)
We have seen in Chap. 4 that co-segmentation of N images can be performed by solving O (N ) NP-complete problems through hierarchical subgraph matching using pairwise comparison if there is a non-empty set of nodes in every graph (image) sharing the same node labels and edge connections across all the graphs, i.e., MCS(G1 , G2 , . . . , G N ) = {φ} .
(5.3)
5.2 Problem Formulation
91
Fig. 5.2 Maximum common subgraph (MCS). G1 , G2 are the input graphs. {u 1 , u 2 , u 4 , u 3 } and {v1 , v2 , v4 , v3 } are the set of nodes in the subgraphs that match and hence constitute the MCS(G1 , G2 ) shown in the bottom row
5.2.2 Multi-image Co-segmentation Problem Let F1 , F2 , . . . , F N be the common object(s) present in I , and our objective is to find them. It is possible that every image in the set may not contain the common object i.e., it is permissible to have for some of the images F j = {φ}. We refer to such images as outlier images as they do not share the same object with majority of the images, leading to MCS(G1 , G2 , . . . , G N ) = {φ}. Hence, we would like to find the MCS from a subset of graphs (images) and maximize the cardinality of that subset. At the same time, we need to set a minimum size (α) of the MCS as it represents the unknown common object. So, we introduce the concept of maximally occurring common subgraph (MOCS) G α = (V α , E α ), with |V α | ≥ α, computed from a subset of M (≤ N ) graphs G¯ M = {Gi1 , Gi2 , . . . , Gi M } as:
G α = MOCS(G1 , G2 , . . . , G N ) = MCS(Gi1 , Gi2 , . . . , Gi M ) ,
(5.4)
where i k ∈ {1, 2, . . . , N }. M and G¯ M are computed as: M = arg max {n : MCS(Subn (G1 , G2 , . . . , G N )) = {φ} and |V α | ≥ α} n
G¯ M = SubαM (G1 , G2 , . . . , G N ).
(5.5) (5.6)
Here Subn (·) means any combination of n such elements from the set of graphs G1 , G2 , . . . , G N , whereas SubαM (·) means the particular combination of M graphs that maximizes Eq. (5.5) and α is the minimum number of nodes in G α . Hence, G α is the solution to the general N -image co-segmentation problem in presence of outlying observations. As discussed in Sect. 5.2.1, if the set of graphs (images) G¯ M containing the common object is known, then we need to find MCS of M graphs and it requires solving O (M) NP-complete problems. However, in the co-segmentation problem setting where the input image set contains outlier images, both G¯ M and M are unknown. Hence, solution to Eq. (5.4) requires solving O (M ) NP-complete problems where N N M = i=2 i i. This can be approximated as O (N 2 N −1 ). Hence, it is a compu-
92
5 Maximally Occurring Common Subgraph Matching
tationally nightmarish problem. Thus, an appropriate greedy algorithm is required. The key aspects of this chapter are as follows. • We describe the concept of maximally occurring common subgraph (MOCS) matching across N number of graphs and demonstrate how it can be approximately solved in O (N ) matching steps, thus achieving a significant reduction in computation time compared to the existing methods. • We discuss a concept called latent class graph (LCG) to solve the above MOCS problem. • We demonstrate that the multi-image co-segmentation problem can be solved as an MOCS matching problem, yielding an accurate solution that is robust in presence of outlier images. We describe MOCS and LCG in subsequent sections.
5.2.3 Overview of the Method In this section, we provide an overview of the co-segmentation method discussed in this chapter. It is described in Fig. 5.3 using a block diagram and illustrated using an image set in Fig. 5.5. Coarse-Level co-segmentation: First, RAGs are obtained by superpixel segmentation of all the images in I . Let Gi = (Vi , Ei ) be the graph representation of image Ii with Vi = {v ij }∀ j where v ij is the jth node (superpixel) in Gi and Ei = {eijl }∀( j,l)∈N where ( j, l) ∈ N denotes v ij ∈ N (vli ). Then image superpixels are grouped into N clusters C1 , C2 , . . . , C K (see Blocks 1, 2 in Fig. 5.5) such that Kj=1 C j = i=1 Vi . The cluster C having the most number of images with more than a certain number (α) of superpixels (with a spatial constraint to be defined in Sect. 5.3.2) is selected to serve as seeds to grow back the common object at a later stage. This yields a very coarse co-segmentation of the images (details in Sect. 5.3). Fine-level co-segmentation: The non-empty superpixel set belonging to cluster C in every image Ii is represented as an RAG Hi ⊆ Gi such that Hi = ({v ij ∈ C ∩ Vi }∀ j , {eijl }∀( j,l)∈N ). Then a latent class graph H L is constructed by combining Hi ’s based on node correspondences (see Block 4 in Fig. 5.5). This graph embeds the feature (node attribute) similarity and spatial relationship (edges in graphs) among ) superpixels, from all the images, belonging to that cluster. We obtain H L = H(N L as: (i) (i−1) H(1) , Hi ), ∀i = 2, 3, . . . , N , (5.7) L = H1 ; H L = F (H L where F (·) is the graph merging function defined in Sect. 5.4.1. This latent class graph is used for region growing on every Hi to obtain a finer level of co-segmentation and to obtain the complete common object Hi (see Block 5 in Fig. 5.5).
Hi = F R (H L , Hi , Ii ), ∀i = 1, 2, 3, . . . , N , where F R (·) is the region growing function described in Sect. 5.4.2.
(5.8)
5.3 Superpixel Clustering
93
Fig. 5.3 Block diagram of the multiple image co-segmentation method
It may be noted that in graph theory, typically graph matching is done by matching nodes that have same labels. But here image superpixels are represented as nodes in the graph and nodes within the common object in multiple images may not have exactly the same features. Moreover, the common object may be of different size (number of superpixels constituting the object) in different images. Hence in the solution to the MOCS problem in Eq. (5.4), the resulting common subgraph G α ⊆ Gi also has different cardinality (≥ α) in different image Ii . This makes the problem computationally challenging. In Sect. 5.3, we describe the superpixel features and superpixel clustering, and select clusters containing the common object. In Sect. 5.4, we describe the process of obtaining the common object from the selected clusters. We conclude with experimental results and discussions in Sect. 5.5.
5.3 Superpixel Clustering First the input images are segmented into superpixels using the simple linear iterative clustering algorithm [1] so that common objects across images can be identified by matching inter-image superpixels. Assuming each object to be a group of superpixels, we expect high feature similarity among all such superpixel groups corresponding to common objects. For every superpixel s ∈ F j , there must be superpixels of similar features present in other images Sub M−1 (F1 , F2 , . . . , F j−1 , F j+1 , . . . , F N ) to maintain globally consistent matching among the superpixels. To solve this problem of matching superpixel groups along with neighborhood constraints efficiently, every image I can be represented as a region adjacency graph (RAG) G . Each node in G represents a superpixel and it is attributed with the corresponding superpixel features and spatial coordinates of the centroid. A pair of nodes is connected by an edge if the corresponding superpixel pair is spatially adjacent. As the image set I may contain a large number of images, the total number of superpixels becomes very large and superpixel matching across images becomes computationally prohibitive. Hence, the superpixels are grouped into clusters for further processing as described next.
94
5 Maximally Occurring Common Subgraph Matching
5.3.1 Feature Computation To obtain homogeneous regions through clustering of superpixels, we must include spatial neighborhood constraints in addition to superpixel features. Since all superpixels from all images are considered simultaneously for clustering, it is difficult to design such spatial constraints without location prior. Hence, we want superpixel features to be embedded with this spatial neighborhood information. So for each superpixel s, first a feature vector f(s) is computed as a combination of low-level features: (i) color histogram and mean color in the RGB color space and (ii) rotation invariant histogram of oriented gradients (HOG). Then, features from first-order neighborhood N1 (s) and second-order neighborhood N2 (s) of every superpixel s are combined to design a new feature h(s). For example, u 4 ∈ N1 (u 1 ), u 5 ∈ N2 (u 1 ), v2 ∈ N1 (v1 ) and v5 ∈ N2 (v1 ) in Fig. 5.2. The most similar superpixels, s1 in N1 (s) and s2 in N2 (s), to s are chosen as: si = arg min {d f (f(s), f(r)), ∀r ∈ Ni (s)} , i = 1, 2 , r
(5.9)
where d f (·) denotes the feature distance. One can use any appropriate distance measure such as the Euclidean distance as d f (·). Using the neighborhood superpixels, h(s) is computed as: (5.10) h(s) = [f(s) f(s1 ) f(s2 )] , where [ ] denotes concatenation of vectors. This feature h(s) compactly contains the information about the neighborhood of s. Hence, it serves as a better feature than f(s) alone while grouping superpixels inside common objects in multiple images as described next.
5.3.2 Coarse-level Co-segmentation To find the maximally occurring common object(s) in multiple images, we need to find superpixels with similar features across images. As there is a large number of superpixels in a large image set, a large number of matchings are required to be performed. This increases the computational complexity. This problem is alleviated by clustering the superpixels using their features h(s) defined in Sect. 5.3.1. This results in a coarse-level matching of superpixels where one cluster contains the common object partially. Figure 5.4 shows an illustration. The number of clusters (K ) defines the coarseness of matching. A small value of K may not help much in saving computations during graph matching, while the choice of a large value of K may not help in picking up the common object in several of the images unless the common objects in the images are very close in the feature space, which is rarely the case. Any clustering technique such as the k-means can be used to cluster the superpixels.
5.3 Superpixel Clustering
95
Fig. 5.4 How common objects in constituent images form a single common cluster in the entire pooled data. Each data point denotes a superpixel feature from any of the four images. Arrows indicate the common object. Note that image-4 does not contribute to the designated cluster C and hence is an outlier image, signifying absence of a common object
Each cluster contains superpixels from multiple images with similar features and these superpixels constitute parts of the common object or background regions in each image. One of these clusters should contain the superpixels of the common object and our goal is to find this cluster leading to a coarse co-segmentation. Among the images, there could be two different types of commonalities: the common foreground object and (very often) similar background. Hence, these background superpixels are first discarded from every cluster using the background probability measure P (s) of Zhu et al. [151] before finding the cluster of interest. Background removal also helps in discarding clusters containing only the common background. A superpixel s is marked as background if P (s) > tB , where tB is an appropriately chosen threshold. Further, all sub-images (for a given cluster), having less than α (see Eq. (5.5)) superpixels are discarded. This means that the minimum size of the seeds for coarsely co-segmented images to grow into an arbitrarily sized common object is set to α superpixels. Thus, the number of non-empty sub-images in cluster C is the estimate of M in Eq. (5.5). Subsequently, to determine the cluster of interest C , we need to consider all (say n i j number) superpixels {si j (k), k = 1, 2, . . . , n i j } in image Ii belonging to the jth cluster. For an image with a segmentable object which is typically compact, the superpixels constituting it should be spatially close. Here, we introduce the notion of compactness of a cluster. Let σi2j denote the spatial variance of centroids of {si j (k)} and the initial co-segmented area is obtained as: Ai j =
ni j k=1
area(si j (k)).
(5.11)
Fig. 5.5 The co-segmentation method. Block 1 shows the input image set I1 , I2 , . . . , I10 that contains one majority class (bear) in I1 , I2 , . . . , I8 and two outlier images I9 , I10 . Block 2 shows sub-images in clusters C1 , C2 , C3 , C4 , respectively, after removing background superpixels from them using the measure given in Sect. 5.3.2. Cluster 3 is the computed cluster of interest C (using the compactness measure given in Eq. (5.12)) and it shows the partial objects obtained from the coarse-level co-segmentation. This cluster has a single superpixel (encircled) from image I9 and it is discarded in subsequent stages as we have set α = 3 in Eq. (5.5). Arrows in Block 2 show that holes in sub-images 2, 5, 6 and 7 of C3 will be filled by transferring superpixels from C2 and C4 using the method given in Sect. 5.3.3. Block 3 shows the effect of hole filling (H.F.)
96 5 Maximally Occurring Common Subgraph Matching
Fig. 5.5 (Continued): The co-segmentation method. Block 4 shows the latent class graph generation. Block 5 shows the complete common object(s) obtained after fine-level co-segmentation (F.S.) i.e., latent class graph generation and region growing. Image courtesy: Source images from the iCoseg dataset [8]
5.3 Superpixel Clustering 97
98
5 Maximally Occurring Common Subgraph Matching
We can define the compactness measure of a cluster using an inverse relationship of spatial variances of superpixels belonging to it. The compactness Q j of the jth cluster is computed as:
Qj =
σi2j i
Ai j
−1 , ∀i ∈ [1, N ] such that Ai j = 0.
(5.12)
The cluster j for which Q j is the maximum is chosen as the appropriate cluster C for further processing to obtain the fine-level co-segmentation. For an image with a segmentable object, the superpixels should be spatially close, when the measure Q will be high. Figure 5.5 shows the clustering result of a set of 10 images into 10 clusters where background superpixels have been removed. Here, we show only the top four densest clusters for simplicity. The coarse-level co-segmentation output is presented in Block 2 which shows the partially recovered common objects constituted by the superpixels belonging to the cluster of interest (C = C3 ).
5.3.3 Hole Filling Cluster C may not contain the complete common object (as shown in Block 2, Fig. 5.5) because we are working with natural images containing objects of varying size and pose, and variations in the feature space. Moreover, spatial coordinates of the superpixels have not been considered during clustering. Hence, superpixels in an image belonging to the cluster C need not be spatially contiguous and the partial object may have an intervening region (hole) within this segmented cluster, with the missing superpixels being part of another cluster. Since an image has no holes spatially, this hole can be safely assigned to the same cluster. These segmentation holes in cluster C are filled as explained next. Let ViC ⊆ Vi be the set of superpixels of image Ii , belonging to cluster C . For every image, we represent ViC as an RAG Hi with |Hi | = |ViC |. Then all cycles present in every graph Hi are identified. A cycle in an undirected graph is a closed path of edges and nodes, such that any node in the path can be reached from itself by traversing along this path. As Hi is a subgraph of Gi (i.e., |Hi | ≤ |Gi |), every cycle in Hi is also present in Gi . For every cycle, the nodes (of Gi ) interior to it are identified using the superpixel coordinates. Then every graph Hi (and corresponding ViC ) is updated by adding these interior nodes to the cycle, thus filling the holes. This is explained in Fig. 5.6. The superpixels corresponding to the interior nodes (v5 ) that fill the holes belong to clusters other than cluster C . So, superpixel transfer across clusters occurs in this stage. Cluster C is updated accordingly by including those missing superpixels. This is illustrated in Block 3 of Fig. 5.5.
5.4 Common Object Detection
99
Fig. 5.6 Hole filling. a Graph G of image I . b Subgraph H of G constituted by the superpixels in cluster C . c Cycle v1 − v2 − v4 − v3 − v1 of length 4. d Find node v5 ∈ G interior to the cycle, and add it and its edges to H
Fig. 5.7 Hole filling in a cycle of length 3. a Graph G of image I . b Subgraph H of G constituted by the superpixels in cluster C and the cycle v1 − v2 − v3 − v1 of length 3. c Find node v4 ∈ G interior to the cycle, and add its edges to H
It is interesting to note that even for a cycle of three nodes, though the corresponding three superpixels are neighbors to each other in image space, it may yet contain a hole because superpixels can have any shape. In Fig. 5.7, nodes v1 , v2 and v3 (belonging to the cluster of interest) are neighbors to each other (in image space) and form a cycle. Node v4 is also another neighbor of each v1 , v2 and v3 , but it belongs to some other cluster. Hence, it creates a hole in the cycle formed by v1 , v2 and v3 . This illustrates that even a cycle of length three can contain a hole. So, we need to consider all cycles of length three or more for hole filling.
5.4 Common Object Detection Even after hole filling, the superpixels belonging to the updated cluster C may not constitute the complete object (due to coarse-level co-segmentation and possible inter-image variations in the feature) in the respective images they belong to. So, we need to perform region growing on them. In every image Ii , we can append certain neighboring superpixels of updated ViC (after hole filling) to it based on inter-image feature similarity. This poses two challenges. First, we need to match each image with all possible combinations of the rest of the images, thus requiring O (N 2 N −1 ) matching operations, which is large if a large number of images are being co-segmented. Secondly, if each image is matched independently, global consistency in superpixel matching across all the images is not ensured, as the transitivity property
100
5 Maximally Occurring Common Subgraph Matching
is violated while matching noisy attributes. In this section, we describe a technique to tackle both these challenges by representing ViC in every image as RAGs (updated Hi ’s) and combining them to build a much larger graph that contains information from all the constituent images. As the processing is restricted to superpixels in one specific cluster (cluster C ), we call this combined graph as latent class graph (H L ) for that cluster. We will show that this requires only O (N ) graph matching steps. Then pairwise matching and region growing between H L and every constituent graph Hi are performed independently. So, the computational complexity gets reduced from O (N 2 N −1 ) to O (N ) computations (more specifically O (M) if M = N , where the common object is present in M out of N number of input images). However, it may be noted (and to be seen later in Table 5.4) that during region growing, since |H L | ≥ |Hi |, ∀i, the matching of Hi with H L will involve a certain more number of computations compared to matching Hi and H j . The details are explained next.
5.4.1 Latent Class Graph Building of the latent class graph is a commutative process and the order in which the constituent graphs (Hi ’s) are merged does not matter. To build H L for cluster C , we can start with the two largest graphs in that cluster and find correspondences among their nodes. It is possible that every node in one graph may not have some match in the other graph due to feature dissimilarity. The unmatched nodes in the second largest graph are appended to the largest graph based on attribute similarity and graph structure, resulting in a larger intermediate graph. Then this process is repeated using the updated intermediate graph and the third largest graph in cluster C and so on. After processing all the input graphs of cluster C , H L is obtained as the final updated intermediate graph. This is explained in Fig. 5.8. H L describes the feature similarity and spatial relationship among the superpixels in cluster C . Being a combined graph, H L need no longer be a planar graph and it may not be physically realizable in the image space. It is just a notional representation for computational benefit. This algorithm is explained next. Let the images I1 , I2 , . . . , I M be ordered using the size of their respective RAGs in cluster C (Hi ’s as defined in Sect. 5.3.3) such that |H1 | ≥ |H2 | ≥ · · · ≥ |H M |, and (1) let H(1) L = H1 . First, we need to find node correspondences between the graphs H L and H2 . As object sizes and shapes of co-occurring objects differ across images, we must allow many-to-many matches among nodes. One can use any many-to-many graph matching technique for this including the maximum common subgraph (MCS) matching algorithm of Chap. 4. This method finds the MCS between two graphs by building a vertex product graph from the input graph pair and finding the maximal clique in its complement. The resulting subgraphs provide the required inter-graph node correspondences. Depending on node attributes and number of nodes in the input graph pair, there may be some nodes in H2 not having any match in H1 and vice-versa. We describe the latent class graph generation steps using the graphs shown in Fig. 5.8. Let v1 ∈ H2 has a match with u 1 ∈ H1 . However, v4 ∈ H2 does
5.4 Common Object Detection
101
Fig. 5.8 Latent class graph generation. Five nodes in graphs H1 and H2 matched with each other (circumscribed by dashed contour). The unmatched nodes v3 , v4 and v5 in H2 are duplicated in H1 as u 3 , u 4 and u 5 to obtain the intermediate latent class graph H(2) L . Applying this method A to H3 , H4 , . . . , H M , we obtain the latent class graph H L = H(M) . See color image for node attributes L
not have any match in H1 and there is an edge between v1 and v4 . In such a scenario, a node u 4 is added in H1 by assigning the same attributes of v4 and connecting it to u 1 using an edge. If v1 matches with more than one node in H1 , the node u m in H1 having the highest attribute similarity with v1 is chosen and u 4 is connected to that node u m . Adding a new node in H1 for every unmatched node in H2 results in an updated intermediate graph, denoted as H(2) L . Then the above mentioned process (2) is repeated between H L and H3 to obtain H(3) L , and so on. Finally the latent class is obtained. Thus the latent class graph generation requires O (M) graph H L = H(M) L (t) (t−1) |. matching steps. Here, H L ’s are non-decreasing with |H(t) L | ≥ |H L To prevent possible exponential blow up of the latent class graph in every iterationt, exactly one node (u 4 ) is added to the intermediate graph H(t) L for every unmatched node (v4 ) in the other graph Hi . Hence, cardinality of H L is equal to the sum of |H1 | and the total number of non-matching nodes (unique superpixels in cluster C ) in Hi in each iteration. A sketch of the proof is as follows. The latent class graph generation function F (·) in (5.7) can be defined as: , Hi ) = F A ({Hi \MCS(H(i−1) , Hi )}, H(i−1) ), F (H(i−1) L L L
(5.13)
102
5 Maximally Occurring Common Subgraph Matching
where F A (·) is a function that appends the unmatched nodes in Hi to H(i−1) as L explained above. For i = 2, (1) (1) F (H(1) L , H2 ) = F A ({H2 \MCS(H L , H2 )}, H L ) .
(5.14)
(1) Hence, the cardinality of H(2) L is equal to the sum of cardinality of H L = H1 and the number of non-matching nodes (unique superpixels in cluster C ) in H2 , given by
(2) H L = |H1 | + (|H2 | − |MCS(H1 , H2 )|) .
(5.15)
Similarly, cardinality of H(3) L is given by
(3) (2) (2) H L = H L + |H3 | − MCS(H L , H3 ) .
(5.16)
Now MCS(H(2) L , H3 ) includes nodes of both MCS(H1 , H3 ) and MCS(H2 , H3 ). As these two set of nodes also overlap and contain nodes of MCS(H1 , H2 , H3 ), (2) MCS(H L , H3 ) = |MCS(H1 , H3 )| + |MCS(H2 , H3 )| − |MCS(H1 , H2 , H3 )| .
(5.17)
Combining (5.15), (5.16) and (5.17), we obtain (3) H L = |H1 | + |H2 | + |H3 | − |MCS(H1 , H2 )| − |MCS(H1 , H3 )| − |MCS(H2 , H3 )| + |MCS(H1 , H2 , H3 )| .
(5.18)
is given by Similarly, cardinality of H L = H(M) L (M) MCS(Hi , H j ) | Hi | − H L = i
i
j>i
MCS(Hi , H j , Hk ) − · · · + i
(5.19)
j>i k> j
One can obtain a numerical estimate of the size of the latent class graph as follows. Consider a simple case where every sub-image in the cluster of interest (C ) has n H superpixels and a fraction of these (say, β) constitutes the common object partially in that sub-image obtained at the output of the coarse-level co-segmentation. Thus the number of unmatched nodes is (1 − β)n H which gets appended to the intermediate latent class graph at every iteration. Hence, the cardinality of the final latent class graph is
5.4 Common Object Detection
103
|H(M) L | = nH +
M−1
(1 − β)n H ≈ n H M(1 − β) .
(5.20)
i=1
A step-wise demonstration of latent class graph generation is shown in Fig. 5.9. Here, we have used four images (from Block 3 of Fig. 5.5) for simplicity. Region adjacency graphs (RAG) H1 , H2 , H3 , H4 corresponding to the sub-images (I1(s) , I2(s) , I3(s) , I4(s) ) present in the computed cluster of interest are used to generate the latent class graph.
5.4.2 Region Growing As every graph Hi in cluster C represents partial objects in the respective image Ii , region growing (RG) needs to be performed on it for object completion. But we should not grow them independently. As we are doing co-segmentation, we should grow the graphs with respect to a reference graph that contains information of all graphs. We can follow the method described in Chap. 4 that performs region growing on a pair of graphs jointly. Here, every graph Hi is grown with respect to H L using the region growing function F R (·) in (5.8) to obtain Hi . This method uses the node-to-node correspondence information obtained during latent class graph generation stage to find the matched subgraph of H L and jointly grows this subgraph of H L and Hi by appending similar (in attribute) neighboring nodes to them until convergence. These neighboring nodes (superpixels) belong to Gi \Hi . Upon convergence, Hi grows to Hi that represents the complete object in image Ii . The set {Hi , ∀i = 1, 2, . . . , M} represents co-segmented objects in the image set I and the solution to the MOCS problem in Eq. (5.4) as explained in Sect. 5.2.3. This is explained in Fig. 5.10. As the same H L , that contains information of all the constituent graphs, is used for growing every graph Hi , consistent matching in the detected common objects is ensured. The results of region growing is shown in Block 5 of Fig. 5.5. It may be noted that in Chap. 4, we compute MCS of graph representations (Gi ’s) of the input image pair in their entirety. Unlike in Chap. 4, here MCS computation during latent class graph generation stage in Sect. 5.4.1 involves graphs Hi ’s that are much smaller than the corresponding Gi ’s (note Hi ⊆ Gi ). Hence, it is not at all computationally expensive. Moreover, in Chap. 4, the common subgraph pair obtained using MCS co-grows to the common objects. Unlike in Chap. 4, here we consider only the growth of every graph Hi with respect to the fixed H L . In Fig. 5.11, a step-wise demonstration of region growing on the sub-image I4(s) and its RAG H4 is shown. We obtain the complete object (shown in Fig. 5.11k) in image I4 using the resulting latent class graph of Fig. 5.9n as H4 (shown in Fig. 5.11l)
H4 = F R (H L , H4 , I4 ) .
(5.21)
104
5 Maximally Occurring Common Subgraph Matching
Fig. 5.9 Steps for latent class graph generation described in Fig. 5.8. a–d Sub-images present (from Block 3, Fig. 5.5) in the computed cluster of interest. e–h The corresponding RAGs H1 , H2 , H3 and H4 . Values on axes indicate image dimension. See next page for continuation
5.4 Common Object Detection
105
Fig. 5.9 (Continued): Steps for latent class graph generation described in Fig. 5.8. i New nodes (1) (shown in red) have been added to H L = H1 after matching it with H2 . j Intermediate latent (2) (1) class graph H L = F (H L , H2 ) (note the new edges). k New nodes (shown in red) have been (3) added to H(2) L after matching it with H3 . l Intermediate latent class graph H L . (m) New nodes (3) (shown in red) have been added to H L after matching it with H4 . (n) Final latent class graph (3) H L = H(4) L = F (H L , H4 )
106
5 Maximally Occurring Common Subgraph Matching
Fig. 5.10 Region growing (RG). Here H1 , H2 , …, HM are the graphical representations of the corresponding co-segmented objects
Algorithm 1 Co-segmentation algorithm Input: Set of images I1 , I2 , . . . , I N Output: Common objects Fi1 , Fi2 , . . . , Fi M present in M images (M ≤ N ) 1: for i = 1 to N do 2: Superpixel segmentation of every image Ii 3: Obtain region adjacency graph (RAG) representation Gi of every image Ii 4: Compute background probability P (s) of every superpixel s ∈ Ii 5: Compute feature h(s) of every superpixel s ∈ Ii 6: end for 7: Cluster all superpixels (from all images) together into K clusters 8: Remove every superpixel s from sub-images if P (s) > threshold tB 9: for j = 1 to K do 10: Compute compactness Q j of every cluster j 11: end for 12: Select the cluster of interest as C = arg max Q j . j
13: Find RAG Hi (⊆ Gi ) in every non-empty sub-image Ii belonging to cluster C , where i = 1, 2, . . . , M and M ≤ N 14: Order Hi ’s such that |H1 | ≥ |H2 | ≥ . . . ≥ |H M | 15: // Latent class graph generation (1) 16: H L ← H1 17: for i = 2 to M do (i−1) 18: Find matches between H L and Hi 19: Append non-matched nodes in Hi to H(i−1) and obtain H(i) L L 20: end for 21: Latent class graph H L = H(M) L 22: // Region growing 23: for i = 1 to M do 24: Perform region growing on Hi with respect to H L using Ii and obtain Hi 25: end for 26: Hi1 , Hi2 , . . . , Hi M are the graphical representations of the common objects Fi1 , Fi2 , . . . , Fi M
Newly added nodes in every iteration of region growing are highlighted in different colors. Similarly, region growing is also performed on H1 , H2 and H3 to obtain the complete co-segmented object in the corresponding images. The overall algorithm for the method is presented in a complete block diagram in Fig. 5.12 and the complete algorithmic description as a pseudo-code is given in Algorithm 1.
5.4 Common Object Detection
107
(s)
Fig. 5.11 Steps of region growing on sub-image I4 and its RAG H4 to obtain the complete object. a Sub-image I4(s) and b its corresponding RAG H4 . Note that this RAG has two components. c, e, g, i Partially grown object after iterations 1, 2, 3, 4, respectively and d, f, h, j the corresponding intermediate graphs, respectively. k Completely grown object after iteration 5 and l the corresponding graph
Fig. 5.12 The co-segmentation method using a block diagram
108 5 Maximally Occurring Common Subgraph Matching
Fig. 5.12 (Continued): The co-segmentation method using a block diagram
5.4 Common Object Detection 109
110
5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results In this section, we analyze the performance of the co-segmentation methods in DCC [56], DSAD [60], MC [57], MFC [18], JLH [78], MRW [64], UJD [105], RSP [68], GMR [99], OC [131], CMP [36], EVK [24] and the method described in this chapter (denoted as PM). Experiments are performed on images selected from the following datasets: the MSRC dataset [105], the flower dataset [92], the Weizmann horse dataset [13], the Internet dataset [105], the 38-class iCoseg dataset [8] without any outliers and the 603-set iCoseg dataset containing outliers. We begin by discussing the choice of different parameters in the PM method. The number of superpixels in every image is set to 200. The RGB color histogram, mean color and HOG feature vectors are of lengths 36, 3 and 9, respectively. These three features are concatenated to generate f(s) of length 48. Hence, the length of the combined feature h(s) is 3 × 48 = 144. In Sect. 5.3.2, the background probability threshold tB is chosen as 0.75. Further, α = 3 is chosen to prevent spurious points to grow during region growing stage, and it helps to discard outlier images. Experiments are also performed by varying the number of clusters K ∈ [7, 10], and they yield quite comparable results.
5.5.1 Quantitative and Qualitative Analysis We first discuss quantitative evaluation, and then visually analyze the results. We have used Jaccard similarity (J ) and accuracy (A) as the metrics to quantitatively evaluate [105] the performance of the methods. Jaccard similarity is defined as the intersection over union between the ground-truth and the binary mask of the cosegmentation output. Accuracy is defined as the percentage of the number of correctly labeled pixels (in both common object and background) with respect to the total number of pixels in the image. For small sized objects, the measure of accuracy is heavily biased towards a higher value and hence Jaccard similarity is commonly the preferred measure. Quantitative result: The values of A and J obtained using different methods on the images from the iCoseg dataset are provided in Table 5.1. The methods DCC, DSAD, MC require the output segmentation class to be manually chosen for each dataset before computing metrics. The poor performance of the method UJD is due to the use of saliency as a cue as discussed in Chap. 4. It is often argued that one achieves robustness by significantly sacrificing the accuracy [104]. Hence, the performance of the methods on datasets having no outliers at all is also provided in Table 5.1. We observe that the method JLH, being a GrabCut based method, performs marginally better when there is no outlier image in the dataset, whereas the method PM is more accurate than most of the other methods in both presence and absence of outliers. Table 5.3 provides results of the methods DCC, DSAD, MC, MFC, MRW, UJD, GMR, EVK, PM on the Internet dataset. Since objects in this
5.5 Experimental Results
111
Table 5.1 Accuracy (A) and Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC, JLH, MRW, UJD, RSP, GMR, OC, CMP on the dataset created using the iCoseg dataset and the 38-class iCoseg dataset without outliers Metrics Methods PM MRW CMP MFC MC UJD DCC DSAD JLH GMR RSP OC A(%) (iCoseg 603) with outliers
90.46
88.67
89.95
88.72
82.42
73.69
75.38 A(%) 93.40 (iCoseg 38) without outliers 80.00 J (iCoseg 0.71 603) with outliers 0.36 J (iCoseg 0.76 38) without outliers 0.42
83.94 91.14
x 92.80
x 90.00
x 70.50
x 89.60
76.00 0.65
x 0.63
93.30 0.62
x 0.41
85.34 0.38
0.26 0.70
x 0.73
x 0.64
x 0.59
x 0.68
0.42
0.79
0.76
0.66
0.62
‘x’ : code not available to run on outlier data
dataset have more variations, the h(s) feature in Eq. (5.10) is not sufficient. In particular, the airplane class contains images of airplanes with highly varying pose, making it a very difficult dataset for applying unsupervised methods. Hence, some semi-supervised methods use saliency or region proposal (measure of objectness) for initialization, whereas some unsupervised methods perform post-processing. For example, the methods JLH, CMP use GrabCut [102], and EVK uses graph-cut. The method PM uses additional hand-crafted features such as bag-of-words of SIFT [138] and histogram features [18]. Similarly, the method GMR uses learnt CNN features to tackle variations in images. It performs marginally better for the horse class, whereas the method PM achieves significantly higher Jaccard similarities on car and airplane classes, achieving the highest average Jaccard similarity. Table 5.2 shows quantitative comparison on the images from the Weizmann horse dataset [13] and the flower dataset [92], respectively. The quantitative analysis shows the method PM performs well in all datasets, whereas performance of other methods vary across datasets. Qualitative result: We show visual comparison of the co-segmentation outputs obtained using the methods PM, DCC, DSAD, MC, MFC, MRW, UJD on images from the iCoseg dataset [8] in Fig. 5.13. The method PM correctly co-segments the soccer players in red, whereas other methods wrongly detect additional objects (other players, referees and signboards). Moreover, they cannot handle the presence of out-
112
5 Maximally Occurring Common Subgraph Matching
Table 5.2 Accuracy (A) and Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC, MRW, UJD, CMP on images selected from the Weizmann horse dataset [13] and the flower dataset [92] Horse Methods data Metrics PM MFC CMP MRW DSAD MC DCC UJD A(%) J Flower data Metrics A(%) J
95.61 91.18 0.85 0.76 Methods
91.37 0.76
93.45 0.75
89.76 0.69
83.30 0.61
84.82 0.58
63.74 0.39
PM 94.50 0.85
CMP 92.82 0.73
MRW 89.61 0.70
DSAD 80.24 0.71
MC 79.36 0.56
DCC 78.70 0.52
UJD 53.84 0.45
MFC 94.78 0.82
Table 5.3 Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC, MRW, UJD, GMR, EVK on the Internet dataset [105] Metrics
Methods PM
GMR
UJD
MFC
CMP
EVK
MRW
DCC
MC
DSAD
J (Car class)
0.703
0.668
0.644
0.523
0.495
0.648
0.525
0.371
0.352
0.040
J (Horse class)
0.556
0.581
0.516
0.423
0.477
0.333
0.402
0.301
0.295
0.064
J (Air- 0.625 plane class)
0.563
0.558
0.491
0.423
0.403
0.367
0.153
0.117
0.079
J (Average)
0.604
0.573
0.479
0.470
0.461
0.431
0.275
0.255
0.061
0.628
lier images (containing baseball players in white) unlike PM. Figure 5.14 shows the co-segmentation results of the methods PM, MFC, MRW on images from the flower dataset [92]. The method MRW incorrectly co-segments the horse (in the outlier image) with the flowers. Figure 5.15 shows the co-segmentation outputs obtained using the methods PM, MFC, MRW on images from the MSRC dataset [105]. Both MFC and MRW fail to discard the outlier image containing sheeps. These results show the robustness of PM in presence of outlier images in the image sets to be co-segmented.
Fig. 5.13 Co-segmentation results on an image set from the iCoseg dataset. For the input image set (includes two outlier images) shown in Row A, the co-segmented objects obtained using MRW, MFC, UJD, MC are shown in Rows B-E, respectively
5.5 Experimental Results 113
Fig. 5.13 (Continued): Co-segmentation results on an image set from the iCoseg dataset. The co-segmented objects obtained using DCC, DSAD and PM are shown in Rows F-H, respectively. Row I shows the ground-truth
114 5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results
115
Fig. 5.14 Co-segmentation results on an image set from the flower dataset. For the input image set (that includes one outlier image of horse) shown in Row A, the co-segmented objects obtained using MRW, MFC and PM are shown in Rows B–D, respectively
5.5.2 Multiple Class Co-segmentation Images analyzed in Sect. 5.5 consisted of objects primarily belonging to a single class only. In real life, there could be multiple classes of common objects (e.g., a helicopter in M1 images, a cheetah in M2 images with M1 + M2 ≤ N ). We now demonstrate how the method described in this chapter can be adapted to handle co-segmentation of multiple class common objects. In Fig. 5.17, we show the cosegmentation results of a set of 22 images containing two different common objects. The intermediate clustering result for this set is given in Fig. 5.16. First, the cluster (C1 = 7) having the largest compactness (Q j in Eq. (5.12)) is selected. Then latent class graph generation and region growing are performed on that cluster to extract the first common object (cheetah). Then the cluster (C2 = 10) having the second largest compactness is selected and the same procedure is repeated on the left over data to extract the second common object (helicopter). It is quite clear from Fig. 5.17 that this method is able to co-segment objects belonging to both the classes quite accurately. It may be noted that this method being unsupervised, does not use any class information while co-segmenting an image set. Here, we can only identify two subsets of images with different classes without specifying the class and provide segmented objects in them.
Fig. 5.15 Co-segmentation results on cow image set from the MSRC dataset. For the input image set (that includes one outlier image of sheeps) shown in Row A, the co-segmented objects obtained using PM, MRW, MFC are shown in Rows B-D, respectively. Unlike the methods MFC and MRW, the method PM (Row B) is able to reject the sheeps as co-segmentable objects (shown in final column)
116 5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results
117
Fig. 5.16 Clustering a set of 22 images (from the iCoseg dataset) containing two different classes (helicopter and cheetah) into ten clusters. The input images are shown in Row 1. Rows 2 to 11 show sub-images in clusters 1 to 10, respectively. Out of 22 images, 8 are shown here, and remaining are shown in next two pages. Reader may view all images simultaneously for better understanding
118
5 Maximally Occurring Common Subgraph Matching
Fig. 5.16 (Continued): Clustering a set of 22 images (from the iCoseg dataset) containing two different classes (helicopter and cheetah) into ten clusters. The input images are shown in Row 1. Rows 2 to 11 show sub-images in clusters 1 to 10, respectively. Out of 22 images, 7 are shown here, and remaining are shown in previous page and next page
5.5 Experimental Results
119
Fig. 5.16 (Continued): Clustering a set of 22 images (from the iCoseg dataset) containing two different classes (helicopter and cheetah) into ten clusters. The input images are shown in Row 1. Rows 2 to 11 show sub-images in clusters 1 to 10, respectively. Out of 22 images, 8 are shown here, and remaining are shown in previous two pages
Fig. 5.17 Co-segmentation when the image set contains two different classes of common objects. Rows A, B together show the input set of 22 images (from the iCoseg dataset) containing two objects helicopter and cheetah. Rows C,D show the extracted common objects. The numbers (1) and (2) indicate that cheetahs and helicopters have been obtained by processing clusters C1 and C2 , respectively
120 5 Maximally Occurring Common Subgraph Matching
5.5 Experimental Results
121
Table 5.4 Computation time (in seconds) required by the methods PM, DCC, DSAD, MC, MFC, MRW, UJD for image sets of different cardinalities on i7 3.5 GHz PC with 16GB RAM No. of Methods images PM DSAD MFC DCC MRW UJD MC 8 16 24 32 40 60 80 PE
12 33 64 111 215 350 532 M
41 81 121 162 203 313 411 M+C
69 141 205 273 367 936 1106 M+C
91 180 366 452 1112 1264 2911 M+C
256 502 818 1168 2130 5666 X M+C
266 672 1013 1342 1681 2534 3353 M+C
213 554 1202 1556 3052 4064 5534 M+C
Here, PE stands for programming environment, M for Matlab and C for C language. Here, X stands for the case when Matlab ran out of memory
5.5.3 Computation Time In Table 5.4, we show the computation time (from reading images to that of obtaining the co-segmented objects) taken by the methods PM, DCC, DSAD, MC, MFC, MRW, UJD for sets of 8, 16, 24, 32, 40, 60 and 80 images. In the case of the method PM, the supra-linear growth in computation with respect to N is due to the increased cardinality of the latent class graph as explained in Sect. 5.4. It is evident from the results that the method PM, despite being run in Matlab, is computationally very fast compared to all other methods. Method DSAD is actually faster but the performance, as given in Table 5.1, is very poor in comparison. The method CMP uses region proposal which is computationally very expensive taking on average 32 min to co-segment a set of 10 images. It is worth mentioning that the method PM processes all the images in the set simultaneously unlike some methods. For example, the method MRW cannot handle large image sets. They generate multiple random subsets of the image set, perform co-segmentation on them and compute average accuracy over all the subsets. The method JLH requires O (n R ) operations in every round (10 rounds are needed) of foreground model updation alone where n R is the number of generated mid-level regions. The method UJD performs SIFT matching with only 16 most similar images in the set, thus reducing the number of matching operations to O (16N ) in each round with the number of rounds ranging from 5 to 10. In addition to this, it requires optimizing a cost function in every round to find pixel labels. The method MFC also performs multiple rounds of foreground label refinement by considering all the images in every round. In contrast, the method PM processes all the images in one round to obtain the latent class graph using only O (N ) matching operations. The methods DCC, MC subsample the images to reduce computation with the order of computation being O (n P K ) and O (n 2S ), respectively, where n P is the number of pixels in every image, n S is the total number of superpixels
122
5 Maximally Occurring Common Subgraph Matching
in the image set and K is the number of classes being considered. Since the method PM uses only the superpixels from the most compact cluster to build the latent class graph, the cardinality of every graph (Hi ⊆ Gi ) is very small compared to the total number of superpixels in that image Ii . Hence, it is computationally very efficient. In Sect. 5.3.3, cycle bases in graph G = (V , E ) can be computed using depth first search with complexity O (|V | + |E |). Hole filling for a planar graph in Sect. 5.3.3 can be alternatively implemented using morphological operations. More specifically, the interior nodes of a cycle can be found using morphological reconstruction by erosion [118] on the binary image formed by the superpixels that constitute the cycle. In this chapter, we have described a fast method for image co-segmentation in an unsupervised framework applicable to a set of large number of images with an unknown number of them (majority class) containing the common object. The method is shown to be robust against presence of outlier images in the dataset. We have discussed a concept of a latent class graph to define a combined representation of all unique superpixels within a class and this graph is used to detect the common object in the images under O (N ) operations of subgraph matching. The use of the latent class graph also helps to maintain global consistency in matches. It may be noted that the accuracy of the boundary detection in common objects depends on the accuracy of the superpixel generation. This can be further improved using post-processing by GrabCut [102] or image matting [65] and is not pursued in this chapter. In this MOCS based co-segmentation algorithm, latent class graph generation and region growing are performed on the sub-images in the cluster of interest. But choice of this cluster based on the compactness measure may not always be accurate. Some superpixels from the outlier images may also belong to this cluster due to feature similarity with some regions in the images from the major class. This may lead to poor result and the outlier images may not get excluded. In the next chapter, we describe another co-segmentation method to address this problem.
Chapter 6
Co-segmentation Using a Classification Framework
6.1 Introduction Image co-segmentation, as explained in earlier chapter, is useful for finding objects of common interest from a set of large number of crowd-sourced images. These images are generally captured by different people using different cameras under different illuminations and compositional context. These variations make it very difficult to extract the common object(s). Further as discussed in Chap. 5, the presence of outlier data, totally irrelevant images (see Fig. 6.1 for illustration), in the database makes the co-segmentation problem even more difficult. Since we are required to co-segment natural images, the common objects may not be homogeneous. Hence, feature selection is difficult and use of low-level and mid-level features, whose computation does not require any supervision, may not yield good results. In order to obtain high-level features, semi-supervised methods in [36, 71, 74, 85, 86, 131] compute region proposals from images using pretrained networks, whereas Quan et al. [99] use CNN features. However in this chapter, we do not involve any supervised learning [50, 72, 106, 123, 146] and describe a method for robust co-segmentation of an image set containing outlier images in a fully automated unsupervised framework, yet yielding excellent accuracy.
6.1.1 Problem Definition In a set of images to be co-segmented, typically (i) the common object regions in different images are concentrated in the feature space since they are similar in feature, (ii) the background varies across images and (iii) the presence of unrelated images (outliers) would produce features of low concentration away from the common object feature points. Hence in the space containing features of all regions from the entire image set, the statistical mode and a certain neighborhood around it (in the feature space) corresponds to the features of the common object regions. So, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_6
123
124
6 Co-segmentation Using a Classification Framework
Fig. 6.1 Image co-segmentation. Top row: an input image set with four of them having a common object and the fifth one without any commonality is the outlier image. Bottom row: the common object (cheetah). Image Courtesy: Source images from the iCoseg dataset [8]
finding the mode in the multidimensional feature space can be a good starting point for performing co-segmentation. This problem, however, is difficult for two reasons. Firstly, computation of mode in higher-dimensional feature space is inherently very difficult. Secondly, variations in ambient imaging conditions across different images make the population density less concentrated around the mode. Method overview. First, the images to be co-segmented (Fig. 6.2a) are tesselated into superpixels using the simple linear iterative clustering (SLIC) method [1]. Then low-level and mid-level features for every superpixel (Fig. 6.2b) are computed. Next, mode detection is performed in these features in order to classify a subset of all superpixels into background (Fig. 6.2e) and common foreground classes (Fig. 6.2c), in an unsupervised manner and mark them as seeds. Although these features are used to compute the mode, they are not sufficient for finding the complete objects since objects may not be homogeneously textured within an image and across the set of images. So, an appropriate distance measure between the common foreground class and the background class(es) is defined and a discriminative space is obtained (using the seed samples) that maximizes the distance measure. In this discriminative space, samples from the same class come closer and samples from different classes go far apart (Fig. 6.2h). This discriminative space appropriately performs the task of maximally separating the common foreground from the background regions. Thus, it better caters to robust co-segmentation than the initial low-level and mid-level feature space. Next, to get the complete labeled objects, the seed region is grown in this discriminative space using a label propagation algorithm that assigns labels to the unlabeled regions as well as updates the labels of the already labeled regions (Fig. 6.2i). The discriminative space computation stage and the label propagation
Fig. 6.2 Co-segmentation algorithm: a Input images for co-segmentation. b Features extracted for each superpixel. c Mode detection performed in the feature space to obtain initial foreground seed samples. e Background seed samples computed using a background probability measure. d The remaining samples that do not belong to the foreground/background seeds. f Clustering performed on background seed samples to obtain K background clusters. g The labeled seed samples from 1 + K classes, and unlabeled samples shown together (compare g with b). Image Courtesy: Source images from the iCoseg dataset [8]
6.1 Introduction 125
Fig. 6.2 (Continued): Co-segmentation algorithm: g The labeled seed samples from 1 + K classes, and unlabeled samples shown together. All samples are fed as input for cyclic discriminative subspace projection and label propagation. h A discriminative subspace is learned using the labeled samples, such that same class samples come closer and dissimilar class samples get well-separated. i Label propagation assigns new labels to unlabeled samples as well as updates previously assigned labels (Both h and i are repeated alternatively until convergence). j The final labeled and few unlabeled samples. Foreground labeled samples (in green) are used to obtain k the co-segmentation mask and l the common object
126 6 Co-segmentation Using a Classification Framework
6.1 Introduction
127
stage are iterated till convergence to obtain the common objects (Fig. 6.2j–l). The salient aspects of this chapter are • We describe a multi-image co-segmentation method that can handle outliers. • We discuss a method for statistical mode detection in a high-dimensional feature space. This is a difficult problem and has never been attempted in computer vision applications. • We explain a foreground–background distance measure designed for the problem of co-segmentation and compute a discriminative space based on this distance measure, in an unsupervised manner, to maximally separate the background and the common foreground regions. • We show that discriminative feature computation alleviates the limitations of lowlevel and mid-level features in co-segmentation. • We describe a region growing technique, using label propagation with spatial constraints, that achieves robust co-segmentation. In Sect. 6.2, we describe the co-segmentation algorithm through mode detection, discriminative feature computation and label propagation. We report experimental results in Sect. 6.3.
6.2 Co-segmentation Algorithm For every superpixel extracted from a given set of input images, first a feature representation based on low-level (Lab color, SIFT [127]) and mid-level (localityconstrained linear coding-based bag-of-words [138]) features is obtained. Then in that feature space, the mode of all the superpixel samples is detected. This mode is used to label two subsets of samples that partially constitute the common foreground and the background, respectively, with high confidence. Next using them as seeds, the remaining superpixels are iteratively labeled as foreground or background using label propagation. Simultaneously, the labels of some of the incorrectly labeled seed superpixels get updated, where appropriate. In order to increase the accuracy of label propagation, instead of using the input feature space directly, a discriminative space is obtained where the foreground and background class samples are well separated, aiding more robust co-segmentation. All these stages are explained in the following sections.
6.2.1 Mode Estimation in a Multidimensional Distribution We expect that the superpixels constituting the common object in multiple images have similar features and they are closer to each other in the feature space compared to the background superpixels. Furthermore, the superpixels from the outlier images have features quite distinct from the common object. Under this assumption, the
128
6 Co-segmentation Using a Classification Framework
seeds for the common foreground are obtained as the set of superpixels belonging to the densest region in the feature space. To find the seeds, we first introduce the notion of dominant mode region in the multidimensional feature space by representing every superpixel as a data point in the feature space. Definition of mode: Let p(x) denote the probability density function (pdf) of the samples x ∈ R D , which is the D-dimensional feature of a superpixel. Then x0 is the mode [110] if for each ε > 0 there exists δ > 0 such that the distance d(x, x0 ) > ε implies (6.1) p(x0 ) > p(x) + δ. Here, δ = 0 is chosen to eliminate the case where a sequence {xi } is bounded away from x0 , but p(xi ) → p(x0 ). To prove the existence of the mode in a distribution, Sager [110] showed that given a sequence of integers {( j)} such that ( j)/j = o(1),
(6.2)
j/(( j) log j) = o(1), and
(6.3) ( j)
S j being the smallest volume set containing at least ( j) samples, if x0 ∈ S j for ( j) each j, then x0 → x0 almost surely. Since the superpixel feature space is not densely populated by sample points, the previous relationship may be modified as:
p(x0 + g)dg > Cir(ν)
p(x + g)dg + δ
(6.4)
Cir(ν)
with g ∈ R D and Cir(ν) denoting the integral over a ball of radius ν. One can find the dominant mode by fitting a Gaussian kernel with an appropriate bandwidth on the data points when the feature space is densely populated. But in this setup, the data points are sparse; ı.e, the total number of samples (n A ) is small with respect to the feature dimension, and hence Gaussian kernel fitting is not useful. So, the mode for multidimensional data points is very difficult to compute, unlike in the unidimensional case [9, 27, 129]. Though multidimensional mode for very low dimensions can be computed using Hough transform or EM algorithm, it is not applicable at high dimension. Let the entire collection of superpixels in the feature space be modeled as a mixture of samples from the common foreground (F) and the background (B): p(x) = ξ F p F (x) + (1 − ξ F ) p B (x),
(6.5)
where ξ F is the mixing proportion, p F (x) and p B (x) are the pdfs of foreground and background samples, respectively. The superpixels belonging to common foreground have similar visual characteristics. Hence, their corresponding data points are expected to be in close proximity in the feature space. Therefore, without loss of
6.2 Co-segmentation Algorithm
129
generality, we can assume that the data samples of F are more concentrated around a mode in the feature space. On the other hand, the superpixels belonging to the background come from different images. Hence, they will have a much diverse set of features that are widely spread in the feature space. Thus p F (x) should have a much lower variance than p B (x). Let us define E(x, ν)
p(x + g)dg.
(6.6)
Cir(ν)
For the case of ξ F = 1 (i.e., not a mixture distribution) and assuming p F (·) to be spherically symmetric, Sager [110] computed the mode (x0 ) by finding the radius (ν0 ) of the smallest volume containing a certain number of points (n 0 ), such that E(x0 , ν0 ) = nnA0 . Mathematically, x0 = Solution {inf x
ν
p(x + g)dg = Cir(ν)
n0 }. nA
(6.7)
Although the estimator has been shown to be consistent, it is not known how one can select n 0 in Eq. (6.7). Thus, we need to handle two specific difficulties: (i) how to extend the method to deal with mixture densities and (ii) how to choose n 0 appropriately. From Eq. (6.5), integrating both sides over a ball of radius ν, we obtain E(x, ν) = ξ F E F (x, ν) + (1 − ξ F ) E B (x, ν).
(6.8)
For the data points belonging to the background class, p B (x) can be safely assumed to be uniformly distributed. We also observe that E B (x, ν) ∝ (
ν D ) , for all x, dmax
(6.9)
where dmax is the largest pairwise distance among all feature points, and hence E B (x, ν) is very small. However, due to centrality in the concentration of p F (x) (e.g., say Gaussian distributed), E F (x, ν) is very much location dependent and is high when x = x0 (i.e., mode). Hence, Eq. (6.8) may be written as: E(x, ν) = ξ F E F (x, ν) + (1 − ξ F ) E B (ν) κ F + κ B .
(6.10)
Thus, a proper mode estimation requires that we select κ F > κ B , and although EF becomes E F (x, ν) and E B (ν) are both monotonic increasing functions of ν, ddν d EB smaller than dν due to centrality in concentration of the foreground class beyond some value ν = νm . Ideally, one should select 0 < ν0 ≤ νm while extending Sager’s method [110] to mixture distribution. Hence, we need to (i) ensure an upper bound νm on ν and (ii) constrain κ F + κ B to a small value (say κm ) while searching for the mode x0 . For example, we may set κm = nnA0 = 0.2 and νm = 0.6 dmax to decide on the value of ν0 . Further assumptions can be made on the maximum background
130
6 Co-segmentation Using a Classification Framework
(superpixels from p B (x)) contamination (say α%) within the neighborhood of the mode x0 . Thus at the end of co-segmentation if one finds more than α% foreground labeled data points being changed to background labels, it can be concluded that mode estimation was unsuccessful. In order to speed up the computation, the mode can be approximated as one of the given data points only, as is commonly done for computing the median. Seed labeling. In the feature space, the mode and only a certain number of data points in close proximity to the mode are chosen as seeds for the common foreground C F (Fig. 6.2c). A restrictive choice of seeds is ensured by the bounds on ν0 and κm . As this is an approximate solution, the mode region may yet contain a few background samples also. In Sect. 6.2.3, we will show how this can be corrected. In a set of natural images, it is quite usual to have common background, e.g., sky, field, etc. In such cases, superpixels belonging to these common background segments may also get detected as the mode region in the feature space. To avoid such situations, the background probability measure of Zhu et al. [151] can be used to compute the probability P (s) that a superpixel s belongs to the background. Only the superpixels having background probability P (s) < t1 can be considered during mode detection, where t1 is an appropriately chosen threshold, thus eliminating some of the background superpixels. A small value of t1 helps reduce false positives. A superpixel is marked as a background seed if P (s) > t2 . A high value of threshold t2 ensures a high confidence during initial labeling. Figure 6.3a shows seed regions in the common foreground (C F ) and background (C B ) for an image set.
6.2.2 Discriminative Space for Co-segmentation To obtain the common foreground object, we need to assign labels to all superpixels. Hence, the set of common foreground and background superpixels obtained as seeds are used to label the remaining unlabeled superpixels as well as to update the already labeled superpixels using label propagation. To achieve a more accurate labeling and better co-segmentation, it is beneficial to have the following conditions: R1: Maximally separate the means of foreground and background classes. R2: Minimize the within-class variance of the foreground F to form a cluster. R3: Minimize the within-class variance of the background B to form another separate cluster. With this motivation, a discriminative space is learned using the labeled samples. In this space, dissimilar class samples get well separated and same class samples come closer, thus satisfying the above conditions. Since labeled and unlabeled samples come from the same distribution, the unlabeled samples come closer to the correctly labeled samples of the corresponding ground-truth class. This better facilitates the subsequent label propagation stage, yielding more accurate co-segmentation. As the background superpixels belong to different images and there is usually large diversity in these backgrounds, this superpixel set is heterogeneous, having
6.2 Co-segmentation Algorithm
131
Fig. 6.3 a Seed labeling. Row 1 shows a set of input images for co-segmentation. Row 2 shows the regions corresponding to the common foreground (C F ) seed superpixels. Row 3 shows regions corresponding to the background seed superpixels (C B ). b The heterogeneous background C B seed is grouped into K = 3 clusters. The three rows show the regions corresponding to the background clusters C B1 , C B2 and C B3 , respectively
132
6 Co-segmentation Using a Classification Framework
Fig. 6.4 R3 (Sect. 6.2.2): Satisfying R3 is equivalent to meeting requirements R3a and R3b sequentially
largely varying features. This makes the background distribution multimodal. The heterogeneous nature of the background can be observed even in the background seeds as shown in Fig. 6.3a(Row 3). Hence, R3, ı.e., enforcing the multimodal background class to form a single cluster, becomes an extremely difficult requirement for the space to satisfy. However for co-segmentation, we are interested in accurate labeling of the foreground (rather than the background), for which R1 and R2 are more important than R3. Hence, R3 can be relaxed to a simpler condition R3a by allowing background samples to group into some K clusters (with some minimum within-cluster variance). Justification: As illustrated in Fig. 6.4, meeting R3 is equivalent to meeting two requirements R3a and R3b sequentially. • R3a: transforming the multimodal background class to form multiple (some K ) clusters, with some minimum within-cluster variance (without any constraint on the separation of cluster means). • R3b: transforming all the above clusters to have a minimum between-cluster variance and finally form a single cluster. The difficulty in meeting R3 is mostly due to R3b which is necessarily seeking a space where the multiple clusters have to come closer and form a single cluster. Clearly, achieving R3a alone is a much simpler task than achieving both R3a and R3b. So, the final requirements for the discriminative space to satisfy are R1, R2 and R3a.
6.2 Co-segmentation Algorithm
133
In order to facilitate R3a, the given background seeds are grouped into clusters. Figure 6.3b illustrates a case where the clustering algorithm forms K = 3 background clusters (C B1 , C B2 , C B3 ). Next, using the foreground (C F ) and K background clusters, a discriminative space is learned that satisfies R1, R2 and R3a. As mentioned earlier, in R3a, we do not need to enforce any separation of the K background clusters within themselves. This is intuitive, as for co-segmentation we are only interested in separating the foreground class from the background, and not necessarily in the separation among all 1 + K classes. The significance of this multiple cluster modeling of the background will be illustrated in Sect. 6.3.1, in terms of improved accuracy. Learning the discriminative space. There exists various works on finding discriminative space for person re-identification [143], action classification [98], face recognition [95, 141], human gait recognition [62], object and image classification [34, 144], character recognition [148]. However, these methods give equal priority for discriminating all classes. Hence, they are not appropriate for meeting the above-mentioned specific requirements for co-segmentation. In order to find the appropriate discriminative space, first a measure of separation between the foreground class (C F ) and the background classes (C B1 , C B2 , . . ., C BK ) is defined based on the requirements mentioned previously. Then the optimal discriminative space that maximizes this separation is computed. Let X ∈ R D×n T contains feature vectors of all (say n T ) labeled superpixels with nT = n F + n B , where n F is the number of labeled superpixels in class C F and n B = K k=1 n Bk be the total number of samples from all the background classes. Here, the goal is to compute discriminant vectors that maximize the foreground–background separation E , defined to be of the form E =
K C Bk ) d (C F , ∪k=1 , K v (C F , ∪k=1 C Bk )
(6.11)
where d (·) is a measure of foreground–background feature distance (achieves R1 and R3a) and v (·) is the measure of variance in all classes (achieves R2 and R3a) as defined next. K n Bk K C Bk ) = d(C F , C Bk ), (6.12) d (C F , ∪k=1 nB k=1 where d(C F , C Bk ) is the inter-class distance between the common foreground class C F and any background class C Bk , defined as d(C F , C Bk ) = (m Bk − m F )T (m Bk − m F ) = tr (m Bk − m F )(m Bk − m F )T ,
(6.13)
where m F is the mean of feature vectors in the foreground class C F and m Bk is the mean of background class C Bk . Here, d (·) is formulated using only the distances d(C F , C Bk ) to achieve large discrimination between C F and C Bk , ∀k.
134
6 Co-segmentation Using a Classification Framework
It is quite possible that two classes Ci , C j , with large intra-class variances overlap, and they still have large inter-class distance d(Ci , C j ). Hence in Eq. (6.11), d (·) is normalized using v (·), which is given by nF nB = ω F S (C F ) + ω Bk k S (C Bk ), nT nT k=1 K
K v (C F , ∪k=1 C Bk )
(6.14)
where S (C ) denotes a measure of scatter of class C and ω is the corresponding weight of the class. Typically, the characterization S (C ) tr(V) is used to represent the scatter where V is the covariance matrix of the class C . As we are interested in separating the foreground, we require the reduction in the scatter of foreground C F to be more than that of the background classes. Hence, to give more weight to V F , we should select ω F > ω Bk . A good choice can be ω F = 1, ω Bk = 1/K . Using definitions of d (·) and v (·) in Eqs. (6.12)–(6.14), we rewrite the foreground–background separation in Eq. (6.11) as: n Bk T k=1 n B tr (m Bk − m F )(m Bk − m F ) K n Bk nF tr(V F ) + K1 k=1 tr(V Bk ) nT nT
K E =
tr Q f b , = tr (Qw )
(6.15)
where the foreground–background inter-class scatter matrix Q f b ∈ R D×D and the intra-class (or within-class) scatter matrix Qw ∈ R D×D are given as Qfb =
K 1 n B (m Bk − m F )(m Bk − m F )T n B k=1 k
(6.16)
K 1 1 {n F V F + n B V B }, nT K k=1 k k
(6.17)
Qw =
and they represent variances among the superpixel features in X. A high value of E implies that the foreground is well-separated from the background classes and the above formulations of Q f b and Qw ensure this. Next, we seek a discriminative space that maximizes the above-defined foreground– background separation E . Let W = [w1 w2 . . . w Dr ] ∈ R D×Dr be the projection matrix for mapping each data sample x ∈ R D to the discriminative space to obtain z ∈ R Dr , where Dr < D. (6.18) z = WT x Similar to Eqs. (6.16) and (6.17), the foreground–background inter-class and intraW class scatter matrices QW f b and Qw are derived using all the projected data {z} in the discriminative space as follows: T QW w = W Qw W
(6.19)
6.2 Co-segmentation Algorithm
135 T QW f b = W Q f b W.
(6.20)
A large value of tr{QW f b } ensures good separation between the common foreground class and the background classes. A small value of tr{QW w } ensures less overlap among the classes, in the projected domain. Hence, the optimal W is obtained by maximizing the foreground–background separation in the feature space. max W
W tr(QW f b )/ tr(Qw )
tr WT Q f b W) = tr(WT Qw W)
(6.21)
This is a trace ratio problem, for which an approximate closed-form solution can be obtained by transforming it to a ratio trace problem [32]: max tr{(WT Qw W)−1 (WT Q f b W }. W
(6.22)
This can be efficiently solved as a generalized eigenvalue problem: Q f b wi = λQw wi , i.e., Q−1 w Q f b wi = λwi .
(6.23)
Thus, the solution W, that contains the discriminants wi ∈ R D , is determined by the eigenvectors corresponding to the Dr largest eigenvalues of Q−1 w Q f b . After solving for W, all the superpixel samples in the low-level and mid-level feature space are projected onto the discriminative space using Eq. (6.18). These samples in the discriminative space are used for label propagation as described in Sect. 6.2.3. Discussion. (1) The learned discriminative space has two advantages. Firstly, there occurs a high separation between the foreground and the background class means. Secondly, the same class samples come closer in the discriminative space. This phenomenon occurs not only for the labeled samples but also for the unlabeled samples. This facilitates more accurate assignment of labels in the subsequent label propagation stage of the co-segmentation algorithm. (2) Although the solution of W has a form similar to that of linear discriminant analysis (LDA), the formulations of scatter matrices Q f b and Qw differ significantly. In LDA [10], the scatter matrices are designed to ensure good separation among all classes after projection, and hence the inter-class and intra-class scatter matrices are accordingly defined as: Q b =
K nk ¯ ¯ T and (mk − m)(m k − m) n T k=1
(6.24)
K nk Vk , n k=1 T
(6.25)
Q w =
136
6 Co-segmentation Using a Classification Framework
¯ is the mean of all feature vectors in all classes. Since for corespectively, where m segmentation, we require only the common foreground class to be well-separated from the background classes, here the condition of discrimination among the background classes is relaxed and the discrimination of the foreground class from the rest is prioritized. The scatter matrices Q f b and Qw are consistent with these requirements. As seen in Eq. (6.16), the foreground–background inter-class scatter matrix Q f b measures how well the foreground mean is separated from each of the background class means, without accounting how well the background class means are separated from each other. Similarly, the foreground class scatter V F is given more weight in the within-class scatter matrix Qw , thus prioritizing more the foreground samples to populate together, as compared to each of the background classes.
6.2.3 Spatially Constrained Label Propagation In Sect. 6.2.1, we have seen how a set of seed superpixels with different class labels can be initialized. Now to find the common object, the regions constituted by seed need to be grown by assigning labels to the remaining superpixels. This region growing is done in two stages. First, label propagation is performed considering all superpixels (n A ) from all images simultaneously using the discriminative features (z) described in Sect. 6.2.2. Then the updated superpixel labels in every image are pruned independently using spatial constraints as described next. First, every seed superpixel si is assigned a label L ∈ {1, 2, . . . , K + 1}. Specifically, superpixels belonging to C F have label L = 1 and superpixels belonging to C Bk have label L = k + 1 for k = 1, 2, . . . , K . Then a binary seed label matrix Yb ∈ {0, 1}n A ×K +1 is defined as Yb (i, l) = 1, if L (si ) = l Yb (i, l) = 0, if (L (si ) = l) ∨ (si is unlabeled)
(6.26a) (6.26b)
which carries class information of only the seed superpixels. Here, the unlabeled superpixels are the remaining superpixels other than the seed superpixels. The aim is to obtain an optimal label matrix Y ∈ Rn A ×K +1 with class information of all the superpixels from all images. Let S0 ∈ Rn A ×n A be the feature similarity matrix where S0 (i, j) is a similarity measure between si and s j in the discriminative space. One can use any appropriate measure to compute S0 (i, j) such as the additive inverse of Euclidean distance between zi and z j that represent si and s j , respectively. This similarity matrix can be normalized as: (6.27) S = D−1/2 S0 D−1/2 , where D is a diagonal matrix with D(i, i) = the following equation is iterated.
j
S0 (i, j). To obtain the optimal Y,
6.2 Co-segmentation Algorithm
137
Y(t+1) = ωl S Y(t) + (1 − ωl )Yb ,
(6.28)
where ωl is an appropriately chosen weight in (0, 1). The first term updates Y using the similarity matrix S. Thus, labels are assigned to unlabeled superpixels through label propagation from the labeled superpixels. The second term minimizes the difference between Y and Yb . It has been shown in [150], that Y(t) converges to Y∗ = lim Y(t) = (I − ωl S)−1 Yb . t→∞
(6.29)
The label of superpixel si is obtained as: L = arg maxY∗ (i, j), under constraints C1, C2. j
(6.30)
Every row and column of Yb correspond to a superpixel and a class, respectively. If the number of superpixels in one class C j is significantly large compared to the remaining classes (Ck , k = j), the columns of Yb corresponding to Ck ’s will be sparse. In such scenario, the solution to Eq. (6.30) will be biased toward C j . Hence, every column of Yb is normalized by its L 0 -norm. Next, we need to add two constraints to this solution and update it. C1: Y∗ (i, j) is a measure of similarity of superpixel si to the set of superpixels with label L = j. If Y∗ (i, j) is small for all j = 1, 2, . . . , K + 1 (i.e., max Y∗ (i, j) < tl ), these similarity values may lead to wrong label assignment; j
so it is discarded and the corresponding superpixel si remains unlabeled. A good choice of the threshold tl can be median(Y∗ ). C2: The label update formulation in Eq. (6.28) does not use any spatial information of superpixels. Thus any unlabeled superpixel in an image can get assigned to one of the classes based only on feature similarity. Hence, every newly labeled superpixel may not be a neighbor of the seed regions in that subimage belonging to a certain class and that subimage may contain many discontiguous regions. But typically, objects (in C F ) and background regions (in C Bk ), e.g., sky, field and water body, are contiguous regions. Hence, a spatial constraint is added to Eq. (6.28) that an unlabeled superpixel si will be considered for assignment of label L = j using Eq. (6.30) only if it belongs to the first-order spatial neighborhood of an already labeled region (with label L = j) in that subimage. Result of label propagation with the seed regions of Fig. 6.3a at convergence of Eq. (6.28) is shown in Fig. 6.5a. Due to the above two constraints, not all unlabeled superpixels are assigned labels. Only a limited number of superpixels in the spatial neighborhood of already labeled superpixels are assigned labels. After this label updating, all labeled superpixels are used to again compute a discriminative space using original input feature vectors (low-level and mid-level features) following the method in Sect. 6.2.2 and label propagation is performed again in that newly computed discriminative space. These two stages are iterated
138
6 Co-segmentation Using a Classification Framework
Fig. 6.5 Label propagation. a Label propagation assigns new labels to a subset of previously unlabeled samples, as well as updates previously labeled samples. Superpixel labels of the foreground and the three background classes in Fig. 6.3 have been updated and some of the unlabeled superpixels have been assigned to one of the four classes after discriminative space projection and label propagation. Label propagation. b Final co-segmentation result (Row 2) after multiple iterations of successive discriminative subspace projection and label propagation. Rows 3–5 show the background labeled superpixels
6.2 Co-segmentation Algorithm
139
alternately until convergence as shown in the block diagram in Fig. 6.2. The iteration converges if • either there is no more unlabeled superpixel left or • labels no longer get updated. It may be noted that some superpixels may yet remain unlabeled after convergence due to the spatial neighborhood constraint. However, it does not pose any problem as we are interested in labels of co-segments only. Figure 6.5b shows the final cosegmentation result after convergence. As ωl in Eq. (6.28) is nonzero, initial labels of the labeled superpixels also get updated. This is evident from the fact that the green regions in subimages 5,6 in C F of Fig. 6.3a are not present after label propagation and are assigned to background classes as shown in Fig. 6.5b. The strength of this method is further proved by the result that the missing balloon in the subimage 4 in C F of Fig. 6.3a gets recovered. Label refinement as preprocessing. Every iteration of discriminative space computation (Sect. 6.2.2) and label propagation begins with a set of labeled superpixels and ends with updated labels where some unlabeled superpixels are assigned labels. To achieve better results, the input labels can be refined before every iteration as a preprocessing step. The motivation behind this label refinement and procedure is described next. As an illustration, Fig. 6.7 shows the updated common foreground class C F after performing label refinement on the seed labels shown in Fig. 6.3a (Row 2). • In Fig. 6.3a, we observe that some connected regions (group of superpixels) in the common foreground class (C F ) spread from image left boundary to right boundary. These regions are most likely to be part of background. Hence, they are removed from C F , thus pruning the set. This is illustrated in subimages 3, 4 of C F in Figs. 6.3a and 6.7. • In Fig. 6.3, we also observe that there are ‘holes’ inside the connected regions in some subimages. These missing superpixels either belong to some other class or are unlabeled (not part of the set of already labeled superpixels). Such holes in C F are filled by assigning the missing superpixels to it, thus enlarging the set. In the case of any background class C Bk , such holes are filled only if the missing superpixels are unlabeled. Hole filling can be performed using the cycle detection method described in Chap. 5. Alternatively, morphological operations can also be applied. The missing superpixels in every image, if any, are found using morphological reconstruction by erosion [118] on the binary image formed by the already labeled superpixels belonging to that image. This is illustrated in Fig. 6.6. Result of hole filling in C F (of Fig. 6.3a) is illustrated in subimage 1 in Fig. 6.7. Further, the spatial constraint of the first-order neighborhood can be relaxed for fresh label assignment of superpixels in case of subimages that have no labeled segment yet. This allowed label assignment to superpixels of subimage 4 in C F of Fig. 6.7, thus providing much-needed seeds for subsequent label propagation. The entire cosegmentation method is given as a pseudocode in Algorithm 1.
140
6 Co-segmentation Using a Classification Framework
Fig. 6.6 Hole filling. Subimage 1 of C F in Row 2 of Fig. 6.3a shows that the foreground seed region contains five holes (indicated by arrows) which can be filled by morphological operations
Fig. 6.7 Removal of connected regions in C F (in Row 2 of Fig. 6.3a) that spread from left to right image boundary. Note that subimage 4 does not have a foreground seed to begin with
It may be noted that the use of mode detection ensures that superpixels from the outlier images are not part of the computed common foreground seeds due to their distinct features. This in turn helps during the label propagation stage so that outlier superpixels are not added to C F . Thus, the method is able to discard outliers in the final co-segmentation results, and hence the resulting C F , at convergence, only constitutes the common object. This method’s robustness to outlier images is demonstrated in Fig. 6.8, which is discussed in detail in Sect. 6.3.1.
6.2 Co-segmentation Algorithm
141
Algorithm 1 Co-segmentation algorithm Input: Set of images I1 , I2 , . . . , I N Output: Set of superpixels in all images belonging to the common objects 1: for i = 1 to N do 2: {s} ← Superpixel (SP) segmentation of every image Ii 3: Compute background probability P (s) of every s ∈ Ii 4: end for 5: Compute feature x for every SP s using LLC from SIFT, CSIFT and L∗ a∗ b∗ mean color 6: // Initial labeling N 7: x¯ (and corresponding SP s¯) ← mode({x ∈ i=1 Ii }) 8: N f (¯x) ← {s : d(¯x, x) < ν0 } 9: Foreground cluster C F ← s¯ ∪ N f (¯x) 10: Divide the set {s : P (s) > 0.99} into K clusters and find background clusters C B1 , C B2 , . . . , C B K 11: // Initial cluster update 12: Fill holes and update C F , C B1 , C B2 , . . . , C B K 13: For each Ii , find Ri ⊆ C F ∩ Ii that constitutes a contiguous region spreading from image left to right boundary N 14: Update C F ← C F \{ i=1 Ri }
15: Assign label Ls(0) ← 1, ∀s ∈ C F 16: For each k = 1 to K , assign label Ls(0) ← k + 1, ∀s ∈ C Bk
(0) (0) 17: Initialize t ← 0, C (0) F ← C F , C B1 ← C B1 , C B2 ← C B2 , . . . 18: // Iterative discriminative subspace projection and label propagation 19: while no convergence do (t) (t) (t) 20: Find discriminant vectors w using the input features x of C (t) F , C B , C B , …, C B using (6.16),
(6.17) and (6.23) 21: W ← [w1 w2 . . . w Dr ] 22: Project every feature vector x as z ← WT x 23: Compute similarity matrix S0 where S0 (i, j) 1 − d(zi , z j ) 24: Compute diagonal matrix D where D(i, i) j S0 (i, j)
25: 26: 27:
Compute normalized similarity matrix S ← D−1/2 S0 D−1/2 (t) for si ∈ C F do Yb (i, 1) ← 1(t) // Initialize foreground
28: 29: 30:
end for for k = 1 to K do (t) for si ∈ C B do
31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42:
1
2
K
|C F |
k
Yb (i, k + 1) ←
1 (t) // Initialize background |C B | k
end for end for Y∗ ← (I − ωl S)−1 Yb // regularizer ωl tl ← median(Y∗ ) // Label update K (t) for all si ∈ Ns (C F ∪ ( k=1 C (t) Bk )) do ∗ if max Y (i, k) > tl then k (t+1)
Lsi
← arg max Y∗ (i, k) k
end if end for (t+1) C (t+1) ← {s : Ls = 1} F
(t+1) 43: For each k = 1 to K , C (t+1) ← {s : Ls = k + 1} Bk 44: t ← t + 1 45: end while 46: C F at convergence is the set of superpixels in all images constituting the common objects
142
6 Co-segmentation Using a Classification Framework
Table 6.1 Comparison of Jaccard similarity (J ) of the methods PM, DCC, DSAD, MC, MFC, JLH, MRW, UJD, RSP, CMP, GMR, OC on the dataset created using the iCoseg dataset and the 38-class iCoseg dataset without outliers Methods PM MRW CMP MFC MC UJD J (iCoseg 626) with outliers 0.73 J (iCoseg 38) no outlier 0.76 DCC J (iCoseg 626) with outliers 0.36 J (iCoseg 38) no outlier 0.42
0.65 0.70 DSAD 0.26 0.42
0.62 0.73 JLH x 0.79
0.61 0.64 GMR x 0.76
0.41 0.59 RSP x 0.66
0.38 0.68 OC x 0.62
‘x’: code not available to run on outlier data
6.3 Experimental Results In this section, we analyze the results obtained by the co-segmentation method described in this chapter, denoted as PM, on the same datasets considered in Chap. 5: the MSRC dataset [105], the flower dataset [92], the Weizmann horse dataset [13], the Internet dataset [105], the 38-class iCoseg dataset [8] without any outliers and the 603-set iCoseg dataset containing outliers. We begin by discussing the choice of features in the PM method. Dense SIFT and CSIFT features [127] are computed from all images, and they are encoded using locality-constrained linear coding [138], with the codebook size being 100, to obtain mid-level features. The L∗ a∗ b∗ mean color feature (length 3) and color histogram [18] (length 81) have been used as low-level features. Hence, the feature dimension D = 100 + 100 + 3 + 81 = 284. Unlike semi-supervised methods, we do not use saliency or CNN features or region proposal (measure of objectness) for initialization here.
6.3.1 Quantitative and Qualitative Analyses For each image, a binary mask is obtained by the assigning value 1 to all the pixels within every superpixel belonging to C F and value 0 to the remaining pixels. This mask is used to extract the common object from that image. If an image does not contain any superpixel labeled C F , it is classified as an outlier image. Jaccard similarity (J ) [105] is used as the metric to quantitatively evaluate the performance of the methods PM, DCC [56], DSAD [60], MC [57], MFC [18], JLH [78], MRW [64], UJD [105], RSP [68], CMP [36], GMR [99], OC [131], EVK [24]. Table 6.1 and Table 6.2 provide results on the iCoseg dataset [8] and the Internet dataset [105], respectively. We also show results on images from the Weizmann horse dataset [13], the flower dataset [92] and the MSRC dataset [105] in Table 6.3, considering outliers. Figure 6.8 shows the co-segmentation outputs obtained using the methods PM, DCC, DSAD, MC, MFC, MRW, UJD on the challenging ‘panda’ images from the iCoseg dataset [8] that also includes two outlier images (Image 5 and Image 9) from ‘stonehenge’ subset. The method PM correctly co-segments the pandas, whereas
6.3 Experimental Results
143
Table 6.2 Comparison of Jaccard similarity (J ) on the Internet dataset Methods PM GMR UJD MFC EVK MRW J J J J
(Car class) (Horse class) (Airplane class) (Average)
0.653 0.519 0.583 0.585
0.668 0.581 0.563 0.604
0.644 0.516 0.558 0.573
0.523 0.423 0.491 0.479
0.648 0.333 0.403 0.461
0.525 0.402 0.367 0.431
DCC
DSAD
0.371 0.301 0.153 0.275
0.040 0.064 0.079 0.061
Table 6.3 Comparison of Jaccard similarity (J ) on images selected from the Weizmann horse dataset [13], the flower dataset [92] and MSRC dataset [105] considering outliers Methods PM MFC CMP MRW DSAD MC DCC UJD J (horse dataset) J (flower dataset) J (MSRC dataset)
0.81 0.85 0.73
0.76 0.82 0.60
0.76 0.73 0.46
0.75 0.70 0.62
0.69 0.71 0.46
0.61 0.56 0.60
0.58 0.52 0.61
0.39 0.45 0.66
other methods partially detect the common objects and also detect some background regions. The methods MC, MRW (in Rows 3,6) and DSAD (in Row 5) detect either white or black segments of panda. The method UJD (in Row 2) fails to co-segment all images with the large sized pandas because it is a saliency-based method and a large panda is less salient than a small one. Moreover, methods other than PM cannot handle the presence of outlier images and wrongly co-segment regions of significant size from them.
6.3.2 Ablation Study Number of background clusters: In Sect. 6.2.1, we motivated that grouping the background seed superpixels into multiple clusters improves superpixel labeling accuracy because the variation of superpixel features in each background cluster is less than the variation of features in the entire background labeled superpixel set. In Fig. 6.9 (blue curve), this is validated through results of the method ‘with’ (ı.e., K > 1) and ‘without’ (i.e., K = 1) background superpixel clustering. Further, Jaccard similarity values are also provided by setting the number of background clusters K ∈ {2, 3, 4, 5, 6, 7, 8} on the 626 image sets as mentioned above. It is evident that higher J is achieved when K > 1. Comparison with baselines: In this co-segmentation method, low-level and midlevel features (LMF) are used for obtaining the seeds for background regions and the common foreground region using mode detection as described in Sect. 6.2.1. Whereas in every iteration of label propagation, first discriminative features (DF) are computed using LMF of the labeled superpixels (Sect. 6.2.2) and then the computed DF are used to perform label assignment (Sect. 6.2.3). In Sect. 6.2.2, it is explained that DF helps to discriminate different classes better, thus achieving better co-segmentation results. Figure 6.9 shows the values of J obtained by the method on 626 sets of images using LMF (black curve) and DF (blue curve) for performing
144
6 Co-segmentation Using a Classification Framework
label assignment. This validates the choice of DF over LMF in the label propagation stage. Further, this figure shows the robustness of discriminative space projection based on the measure of foreground–background separation in Eq. (6.15), which performs much better compared to LDA. It provides the values of J obtained using (i) Q f b and Qw as the scatter matrices (blue curve) and (ii) Q b and Q w instead of Q f b and Qw (red curve) for discriminative space projection. It is evident that Q f b and Qw (in Eqs. 6.16 and 6.17) outperforms LDA, thus validating the efficiency of the foreground–background separation measure in Eq. (6.15), which has been designed specifically for solving the co-segmentation problem. Hence, Q f b and Qw have been used for all quantitative analyses. Here, we comment on two aspects of this study. For
Fig. 6.8 Co-segmentation results on images from the iCoseg dataset. For the input image set shown in Row A (out of 12 images, 6 images are shown here, and 6 more images are shown in next page), which includes two outlier images, Image 5 and Image 9 (shown in next page), the co-segmented objects obtained using methods UJD, MC, DCC, DSAD, MRW, MFC, PM are shown in Rows B-H, respectively. Row I shows the ground-truth (GT)
6.3 Experimental Results
145
Fig. 6.8 (Continued): Co-segmentation results on images from the iCoseg dataset. For the input image set shown in Row A (out of 12 images, 6 images are shown here, and 6 more images were shown in previous page), which includes two outlier images, Image 5 (shown in previous page) and Image 9, the co-segmented objects obtained using methods UJD, MC, DCC, DSAD, MRW, MFC, PM are shown in Rows B-H, respectively. Row I shows the ground-truth (GT)
146
6 Co-segmentation Using a Classification Framework
Fig. 6.9 Ablation study on the outlier dataset created using the iCoseg dataset by varying the number of background clusters: K = 1 (i.e., no clustering) and K = 2, 3, 4, 5, 6, 7, 8. Jaccard similarity (J ) values are provided while performing label propagation (i) with discriminative features (DF) obtained using the formulations of scatter matrices Q f b and Qw , (ii) with DF obtained using formulations in LDA and (iii) with low-level and mid-level features (LMF) alone, ı.e., without using DF
K = 1, the co-segmentation-oriented DF formulation reduces to LDA due to their design in Eqs. (6.16) and (6.17). Hence, they have the same J in the plot. Further, we observe that for any choice of K , the PM curve is always above LDA and LMF curves.
6.3.3 Analysis of Discriminative Space In the co-segmentation algorithm, we begin with a set of seed superpixels for the common foreground class and multiple background classes (Fig. 6.3). This seed selection is done in the space of low-level and mid-level features. In Sect. 6.1, it is motivated that these features are not sufficient for co-segmentation because they do not discriminate the classes well. Figure 6.10a demonstrates this by showing all image superpixels (spatial regions) at their respective locations in the feature space. It is evident that the superpixels belonging to different classes are not well-separated. To attain a better separation among the classes, in Sect. 6.2.2, a discriminative space is obtained where the common foreground superpixels are well-separated from the background superpixels. Figure 6.10c shows all image superpixels (same set of superpixels as in (a)) in this discriminative space. Superpixels of different classes form clusters, and there exists better discrimination among classes compared to the input feature space shown in (a). The cluster constituted by the common foreground class superpixels (balloon in red and blue) is well-separated from the remaining clus-
6.3 Experimental Results
147
Fig. 6.10 Effectiveness of discriminative space projection. a All superpixels from the input image set of Fig. 6.3 in their low-level and mid-level feature space. b Superpixels in the discriminative space after label propagation using LDA, ı.e., using scatter matrices Q b and Q w and c using scatter matrices Q f b and Qw . It can be seen that the foreground and background superpixels are better clustered, and the foreground cluster (indicated using bounding box(es)) is well-separated from the background clusters in (c) as compared to (a) and (b). Here, tSNE plots are used to visualize multidimensional feature vectors in two dimensions
148
6 Co-segmentation Using a Classification Framework
Table 6.4 Computation time (in seconds) required by the methods PM, DCC, DSAD, MC, MFC, MRW, UJD for different cardinalities of the image sets on an i7, 3.5 GHz PC with 16GB RAM No. of Methods images PM DSAD MFC DCC MRW UJD MC 8 16 24 32 40 PE
9 30 79 150 285 M
41 81 121 162 203 M+C
69 141 205 273 367 M+C
91 180 366 452 1112 M+C
256 502 818 1168 2130 M+C
266 672 1013 1342 1681 M+C
213 554 1202 1556 3052 M+C
Here, PE stands for programming environment, M for MATLAB and C for C language
ters constituted by the background class superpixels. This validates the formulations of scatter matrices in Eq. (6.16), Eq. (6.17). Figure 6.10b shows the superpixels in the discriminative space obtained using LDA, where the foreground class superpixels got clustered into multiple groups (3 in this case). These visualizations concur with the ablation study of Fig. 6.9.
6.3.4 Computation Time In Table 6.4, we provide the computation time taken by the methods PM, DCC, DSAD, MC, MFC, MRW, UJD for processing sets of 8, 16, 24, 32 and 40 images. For the method PM, on average, 15 iterations are required for convergence of label propagation. Despite being run in MATLAB, it is computationally faster than all other methods. The method DSAD is quite competitive but the performance is comparatively poor as given in Table 6.1. The method PM processes all the images in the set simultaneously unlike some methods. For example, UJD performs SIFT matching with only 16 most similar images in the set, whereas MRW generates multiple random subsets of the image set and computes average accuracy. In this chapter, we have explained a co-segmentation algorithm that considers the challenging scenario where the image set contains outlier images. First, a discriminative space is obtained and then label assignment (background or common foreground) is performed for image superpixels in that space. Thus, we obtain the common object that is constituted by the set of superpixels having been assigned the common foreground label. Label propagation starts with a set of seed superpixels for different classes. It has been shown that statistical mode detection helps in automatically finding the seed from the images without any supervision. The choice of using mode was driven by the objective of generating seed superpixels for foreground robustly, efficiently and in an unsupervised manner. Further, it has been shown that multiple class modeling of background is more effective to capture its large heterogeneity.
6.3 Experimental Results
149
The measure of foreground–background separation with multiple background classes helps to find a more discriminative space that efficiently separates foreground from the rest, thus yielding robust co-segmentation. Further, multiple iterations of the discriminative space projection in conjunction with label propagation result in a more accurate labeling. Spatial cohesiveness of the detected superpixels constituting the co-segmented objects is achieved using a spatial constraint at the label propagation stage. Acknowledgements Contributions of Dr. Feroz Ali is gratefully acknowledged.
Chapter 7
Co-segmentation Using Graph Convolutional Network
7.1 Introduction Extracting common object from a collection of images where images are captured in a completely uncontrolled environment resulting in significant variations of the object of interest, in terms of its appearance, pose and illumination, is certainly a challenging task. Unsupervised approaches discussed in the previous chapters have limitations in handling such scenarios due to their requirements of finding out the appropriate features of the common object. In this chapter, we discuss an end-to-end learning method for co-segmentation based on graph convolutional neural networks (graph CNN) that formulates the problem as a classification task of superpixels into the common foreground or background class (similar to the method discussed in Chap. 6). Instead of predefining the choice of features, here superpixel features are learned with the help of a dedicated loss function. Since the overall network is trained in an end-to-end manner, the learned model is able to perform both feature extraction and superpixel classification simultaneously, and hence, these two components assist each other to achieve the individual objectives. We begin with an overview of the method described in this chapter. Similar to the previous chapters, each individual image is oversegmented into superpixels, and a graph is computed by exploiting the spatial adjacency relationship of the extracted superpixels. Then for each image pair, using their individual spatial adjacency graphs, a global graph is obtained by connecting each node in one graph to a group of very similar nodes in the other graph, based on a node feature similarity measure. Thus, the resulting global graph is a joint representation of the image pair. The graph CNN model then classifies each node (superpixel) in this global graph using the learnt features into the common foreground or the background class. The rationale behind choosing graph CNN is that it explicitly uses neighborhood relationships through the graph structure to compute superpixel features, which is not achievable with a regular CNN without considerable preprocessing and approximations. As a result, the learnt superpixel features become more robust to appearance and pose variations of the object of interest and carry a greater amount of context. In order © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_7
151
152
7 Co-segmentation Using Graph Convolutional Network
to reduce the time required for the learning to converge, the model uses additional supervision as semantic labels of the object of interest for some of the data points. The overall network, therefore, comprises of two shared subnetworks, where one of them monitors shared foreground and background labels, while the other extracts semantic characteristics from associated class labels.
7.2 Co-segmentation Framework Given a collection of images I , each image Ii ∈ I is oversegmented using the simple linear iterative clustering (SLIC) algorithm [1] into a non-overlapping superpixel set Si . The goal is to obtain the label of each superpixel as either common foreground or background.
7.2.1 Global Graph Computation In the graph convolutional neural network (graph CNN)-based co-segmentation of an image pair, the network requires a joint representation of the corresponding graph pair as a single entity. So, the first step in this algorithm is to combine the graph pair, and obtain a global graph containing feature representation of superpixels from both images. Then it can be processed through the graph CNN for co-segmentation. However, the number of superpixels (|Si | = n i ) computed by the SLIC algorithm typically varies for each image. Since image superpixels represent nodes in the corresponding region adjacency graphs, the resulting graphs for the image set will be of different sizes. To avoid this non-uniformity, it is ideal to determine the minimum number n = mini n i , and perform superpixel merging in each image such that the cardinality of the superpixel set Si corresponding to every image Ii becomes n. For this purpose, given an image’s superpixel set, the superpixel pair with the highest feature similarity can be merged, and this process can be repeated until there are n superpixels left. To avoid the possibility of some superpixels blowing up in size, a merged superpixel should not be allowed to be merged again in subsequent iterations. A pseudocode for this procedure is provided in Algorithm 1. Let I1 , I2 be the image pair to be co-segmented, and S1 , S2 be the associated superpixel sets. In order to combine the corresponding graph pair to obtain a global graph, an initial feature fs ∈ R D for each superpixel s ∈ Si , ∀i is required. One may consider any appropriate feature such as RGB color histogram, dense SIFT features, etc. Each image Ii is represented as an undirected, sparse graph Gi = (Si , Ai ), in which each superpixel represents a node, and superpixels that are spatially adjacent are connected by an edge. Different from previous chapters, this graph is required to be weighted. Thus, the adjacency matrix Ai ∈ Rn×n is defined as:
7.2 Co-segmentation Framework
153
Algorithm 1 Superpixel merging algorithm Input: Set of images I = {I1 , I2 , . . . , Im } Output: Set of superpixels S1 , S2 , . . . , Sm such that |Si | = n, ∀i 1: for i = 1 to m do 2: Superpixel segmentation of Ii 3: Si Set of superpixels {s} in Ii 4: n i = |Si | 5: end for 6: n = min n i i
7: // Superpixel merging 8: for i = 1 to m do 9: Sφ ← φ 10: while n i > n do 11: for s ∈ Si , r ∈ Si do D (fs (l)−fr (l))2 12: W (s, r ) = exp −λ l=1 (fs (l)+fr (l)) 13: end for 14: (sm , rm ) = arg max{W (s, r ) : ∀ (s, r ) ∈ Si \Sφ } (s,r )
15: // Obtain a larger superpixel by grouping pixels of two superpixels 16: s¯ = Merge(sm , rm ) 17: Si ← {Si ∪ s¯ }\{sm , rm } 18: n i = | Si | 19: Sφ ← Sφ ∪ s¯ 20: end while 21: end for 22: S1 , S2 , . . . , Sm are the superpixel sets used for graph construction and remaining stages
Ai (s, r) =
W (s, r) , ∀r ∈ N 1 (s), ∀s ∈ Ii 0, otherwise
(7.1)
where W (s, r) signifies the feature similarity between superpixels s and r, and N 1 denotes a first-order neighborhood defined by spatial adjacency. Thus, edge weights measure the feature similarity of linked nodes, i.e., spatially adjacent superpixel pairs. The Chi-square kernel to determine W (s, r) for a superpixel pair (s, r) with feature vectors fs and fr is defined as: W (s, r) = exp −λ
D (fs (l) − fr (l))2 l=1
(fs (l) + fr (l))
,
(7.2)
where λ is a parameter. To construct a global graph from the individual graphs G1 , G2 , we need to consider the affinity among nodes across the graph pair. Thus, the nodes in an inter-graph node pair possessing high feature similarity value are considered to be neighbors, and they can be connected by an edge. This leads to a global graph Gglobal = (Sglobal , Aglobal ) with node set Sglobal = {S1 ∪ S2 }, |Sglobal | = 2n and adjacency matrix Aglobal ∈ R2n×2n , defined as:
154
7 Co-segmentation Using Graph Convolutional Network
Aglobal (s, r) =
Ai (s, r), W (s, r) 1(W (s, r) > t),
if (s, r) ∈ Gi , i = 1, 2 if s ∈ Gi , r ∈ G j , i = j,
(7.3)
where 1(·) is an indicator function, and Aglobal (s, r) = 0 indicates s ∈ Gi , r ∈ G j are not connected. The threshold t controls the number of inter-image connections. Thus, a high threshold reduces the number of inter-graph edges, and vice-versa. It is evident that Aglobal retains information from A1 and A2 , and attaches the inter-image superpixel affinity information to them as:
A1 O + Ainter , = O A2
Aglobal
(7.4)
where Ainter is the inter-image adjacency matrix representing node (superpixel) connections across the image pair while forming Gglobal , Co-segmentation utilizing this global graph by learning a graph convolution model is described next.
7.3 Graph Convolution-Based Feature Computation Given the global graph Gglobal containing 2n nodes, let Fin [f1 f2 · · · f2n ]T ∈ R2n×D ,
(7.5)
where each row represents the D-dimensional feature fiT of a node. This feature matrix is input to a graph CNN, and let Fout be the output feature matrix after graph convolution. In graph signal processing, these feature matrices are frequently referred to as graph signals. Rewriting the input feature matrix Fin = [F1 F2 · · · FD ], the D ). In traditional CNNs, each input signal is 2n in length and has D channels ({Fi }i=1 convolution kernel is typically an array that is suitable for convolving with an image since pixels are on a rectangular grid. However, graphs do not have such a regular structure. Hence, a graph convolution filter is built from the global graph’s adjacency matrix Aglobal so that the spatial connectivity and intra-image as well as inter-image superpixel feature similarities can be exploited in order to obtain the output graph signal. Specifically, a convolution filter is designed as an affine function of the graph adjacency matrix where the coefficients represent the filter tap. Due to the fact that the filter is coupled with a single filter tap, it functions efficiently as a low pass filter. This in turn facilitates the decomposition of the adjacency matrix into several components as: T Adi (7.6) Aglobal = I + Ainter + i=1
7.3 Graph Convolution-Based Feature Computation
155
Fig. 7.1 At each node, eight angular bins are considered. Subsequently, depending upon the relative orientation with respect to the center node (i.e., node at which convolution is getting computed, shown in blue circle), each neighboring node falls into one of the bins. Consequently, the corresponding directional adjacency matrix (Adi ) dedicated to the bin will be activated. For example, node 3 falls in bin 2, and Ad2 gets activated. Thus, each neighboring node is associated with either of eight adjacency matrices (for eight different directions). Figure courtesy: [6]
where I is the 2n × 2n identity matrix, Ainter is the inter-image adjacency matrix introduced earlier, and Adi ’s are the directional adjacency matrices representing superpixel adjacency in both images. We now explain their role. Since Gglobal is composed of individual spatial adjacency graphs (G1 , G2 ), the adjacency matrix Ainter is introduced in Eq. (7.6) to represent inter-image superpixel feature similarities. It is crucial for co-segmentation since it contains information about a common item in the form of feature similarities between various superpixel pairs across images. We will see in the next section that it receives distinct attention during convolution. The set of all potential directions around every superpixel (node) in the respective image is quantified here into T bins or directions, and each matrix Adi contains information about the feature similarity of only the adjacent nodes along direction di . The direction of a node r in relation to a reference node s is calculated by the angle θ between the line segment connecting the centroids of two nodes (superpixels) and the horizontal axis, as can be seen in Fig. 7.1. Thus, each row of Adi corresponds to a specific node of the global graph, and it contains feature similarities between that node and its first-order neighbors in the respective graph along the direction di . Adi (s, r) =
Aglobal (s, r), 0,
if (s, r) ∈ G1 or G2 , and θ (s, r) ∈ bini otherwise
(7.7)
Multiple adjacency matrices have been proven to have a significant effect on graph convolution in [121]. Splitting an image’s pixel-based adjacency matrix into directional matrices is trivial because each pixel’s local neighborhood structure is identical, that is, each pixel has eight neighboring pixels on a regular grid. However,
156
7 Co-segmentation Using Graph Convolutional Network
performing the same operation on any random graph, as considered here, is not straightforward because region adjacency graphs {Gi } constructed from different images exhibit an inconsistent pattern of connectivity among nodes. As a result, each node has a unique local neighborhood structure, including the number and orientation of its neighbors. To address this scenario, uniformly partitioning the surrounding 360◦ of each node of individual graphs into a set of angular bins is beneficial. This helps to encode the orientation information of its various first-order neighbors. Using a set of eight angular bins implicitly maintains the filter size 3 × 3, motivated by the benefits and success of VGG nets. Thus, the number of distinct directions T is chosen as 8, where each Adi corresponds to an angular bin.
7.3.1 Graph Convolution Filters D Given an input graph signal Fin consisting of D input features {Fi }i=1 , and adjacency T matrices {Adk }k=1 and Ainter representing the global graph, the graph convolution filters are designed as:
Hi = h i,0 I +
T
h i,k Adk + h i,T +1 Ainter , ∀i = 1, 2, . . . , D,
(7.8)
k=1 D where each graph convolution filter is the set {Hi }i=1 , with {h} being the filter taps. D through the graph It produces one new feature F j from D input features {Fi }i=1 convolution operation as:
j = F
D
Hi Fi .
(7.9)
i=1
Therefore in a particular graph convolutional layer, depending upon the number of j } pj=1 are produced, resultgraph convolution filters p, a series of new features { F 1 , F 2 , . . . , ing in a new graph signal F(1) = [ F F p ] ∈ R2n× p . Then using features in (1) F as the input signal, a second graph convolutional layer outputs F(2) . Similarly, F(3) , F(4) , . . . , F(L) are obtained after L layers of graph convolutions with Fout = F(L) being the final output graph signal. Next, we show that each F(l) can be represented in a recursive framework, and this helps to analyze the node information propagation beyond the first-order neighborhood. Similar to CNNs, here a convolution kernel Hi is used for every input channel i, where D is the number of input channels, and this set of D filters together is considered as one convolution filter. For every output channel j, let us denote Hi as H j,i and define:
˜ j H j,1 H j,2 . . . H j,D ∈ R2n×2n D . (7.10) H
7.3 Graph Convolution-Based Feature Computation
157
˜ j} A block diagonal matrix R(1) is defined using all p convolution filters {H j=1 as: p
R(1)
⎡˜ H1 ⎢O ⎢ =⎢ . ⎣ ..
O ˜2 H .. . O ...
⎤ O O⎥ ⎥ .. ⎥ . ⎦ ˜ O Hp
... ... .. .
(7.11)
Let F(0) = Fin be the input graph signal for layer 1, and define v(0) ∈ R2n D and another block diagonal matrix B(0) as: ⎤ ⎡ (0) F1(0) v (0) ⎢F ⎥ ⎢ o ⎢ 2 ⎥ ⎢ = ⎢ . ⎥ and B(0) = ⎢ . ⎣ .. ⎦ ⎣ .. (0) o FD ⎡
v(0)
o v(0) .. .
... ... .. .
o o .. .
⎤ ⎥ ⎥ ⎥ ⎦
(7.12)
. . . o v(0)
where the vector v(l) at any particular layer l contains output features of all nodes. Now for layer 1, the output features can be calculated as: ⎤ F1(1) ⎢ F (1) ⎥ ⎢ 2 ⎥ = ⎢ . ⎥ = v(1) ⎣ .. ⎦ ⎡
R(1) B(0) 1(1)
(7.13)
F p(1)
j in Eq. (7.9) for simplicity of notation. It may be noted that F j(1) was denoted as F The output features at any layer l can be computed using a recursive relationship as: v(l) = R(l) B(l−1) 1(l) for l ≥ 1
(7.14)
1(l) is a vector containing p 1’s where p is the number of channels at the output of l th convolutional layer. Assuming L layers and p convolution filters in the L-th layer, F1(L) , F2(L) , . . ., F p(L) are obtained from v(L) , and they constitute Fout .
7.3.2 Analysis of Filter Outputs Each graph convolution layer has its own collection of filters, each with a unique set of learnable filter taps {h}, and as a result, creates a new set of node features, i.e., graph signal. Further, Eq. (7.14) shows that the graph signal at various layers can be recursively computed from v(l) . Now v(1) is a function of the adjacency matrix (A) of the graph (see Eqs. 7.8–7.14). Therefore, v(2) becomes a function of A2 , and similarly, v(l) becomes a function of Al . Thus, convolution of graphs using a series of filters over
158
7 Co-segmentation Using Graph Convolutional Network
multiple layers gradually enhances the receptive field of the filter, by involving higher order neighbors for computation of individual node features. For instance, at layer l, l-th order neighbors are considered when computing the characteristics of individual nodes, which increases the model’s nonlinearity. This substantially enhances the amount of contextual information used in feature computation, and hence the model’s performance. Representation of the graph convolution filter as a polynomial of the adjacency matrix serves two purposes: (1) it makes the filter linear and shift invariant by sharing filter parameters across nodes, and (2) varied degrees of adjacency matrix ensure that higher order neighbors are involved in feature computation, which infuses the derived features with additional contextual information. It can be observed from the convolution Eqs. (7.8–7.14) that the model is capable of handling any heterogeneous graph. Convolution does not place any constraint on the image size or pixel connectivity in CNN since the operation of convolution is not dependent on the input size. However, such a constraint is imposed by the fully connected layers. Such et al. [121] addressed this issue by utilizing graph embed pooling, which converts heterogeneous graphs to a fixed-size regular graph. To accomplish node classification with the suggested method, we can connect a set of fully connected layers and a softmax classifier to each node. However, it is not possible to employ any type of node pooling approach in this formulation, as this would result in the loss of genuine node information. As a result, it is required to maintain a consistent number of superpixels for each image, ensuring an identical number of nodes for both training and test images.
7.4 Network Architecture The flow of the graph CNN-based co-segmentation method is shown in Fig. 7.2. The network takes the global graph generated from an image pair as input, and performs co-segmentation by classifying the nodes with appropriate labels. Specifically, the cosegmentation branch assigns a binary label to each node: foreground or background, and the classification branch predicts the class label of the entire global graph. Co-segmentation branch: Given a global graph, this branch (highlighted in red dotted line) first extracts superpixel features through a series of graph convolutional layers, which make up the graph CNN. In this architecture, L = 8 convolutional layers are used which contain 32, 64, 64, 256, 256, 512, 512, 512 filters, respectively. Thus, the output feature matrix Fout ∈ R2n×512 obtained at the final convolutional layer contains the 512-dimensional feature vector of every superpixel (node). Next to classify each node as either the common foreground or the background, the node features are passed through four 1 × 1 convolution layers with 256, 128, 64, 2 filters, respectively. All layers except the final 1 × 1 convolutional layer are associated with a ReLU nonlinearity. The final 1 × 1 convolutional layer uses a softmax layer for classifying each superpixel. Thus, this classification is based on the learned superpixel features and does not use any semantic label information.
Fig. 7.2 Complete architecture of the co-segmentation model described in this chapter, where GCN stands for graph convolutional network and FC stands for fully connected network. GCN:32F + 64F + 64F + 256F + 256F + 512F + 512F + 512F, 1 × 1 convolution:256F + 128F + 64F (for co-segmentation block), FC:128F + 64F (for semantic segmentation block). The GCN is shared between two subnetworks. Each of the convolution and fully connected layers are associated with ReLU nonlinearity and batch normalization. During training, both the co-segmentation and semantic classification blocks are present. During testing, only the co-segmentation block (as highlighted in red dotted line) is present. Figure courtesy: [6]
7.4 Network Architecture 159
160
7 Co-segmentation Using Graph Convolutional Network
Classification branch: This branch is responsible for learning the commonality information in the image pair through classification of the global graph into one of the K semantic classes assuming the images contain objects from a common class. It shares the eight convolutional layers of the graph CNN in the co-segmentation branch to extract the output node features Fout ∈ R2n×512 . Then this feature is passed through a fully connected network of three layers with 128, 64 and K filters, respectively, to classify the entire global graph into the common object class that the image pair belongs to. The first two layers are associated with ReLU nonlinearity, whereas the final layer is associated with a softmax layer to predict the class label. Unlike in the co-segmentation branch where the 512-dimensional feature vector of every node is classified into two classes, here the entire feature map Fout is flattened and input to the fully connected network. Training this classification branch infuses semantic information into the entire network while obtaining the graph convolution features. It may be noted that since the task in hand here is co-segmentation, the classification branch is considered only during the training phase, which affects the learned weights of the graph convolutional layers. However, during test time, only the cosegmentation branch is required for obtaining the label (foreground or background) of each node.
7.4.1 Network Training and Testing Strategy For the task of segmentation, the ground-truth data of an image is typically provided as a mask where each pixel is labeled (foreground and background here). However, for superpixel-based segmentation, as considered in this chapter, we first need to compute ground-truth labels for each superpixel. To obtain the superpixel labels, the oversegmentation map obtained using the SLIC algorithm is overlayed on the binary ground-truth mask, and then each superpixel’s area of overlap with the foreground region is noted. If a superpixel s has a overlap of more than fifty percent of its total area with the foreground region, it is assigned a foreground ( j = 1) label as: x1,s = 1 and x2,s = 0. Otherwise, it is assigned a background ( j = 2) label as: x1,s = 0 and x2,s = 1. With this ground-truth information, the co-segmentation branch of the network is trained using a binary cross-entropy loss Lbin that classifies each superpixel into foreground–background. It is defined as:
Lbin = −
1 n samples
2 n 2 1 i,k x j,s log(xˆ i,k j,s ) μ j k=1 i=1 s=1 j=1
n samples
(7.15)
7.4 Network Architecture
161
where n samples denotes the number of image pairs (samples) in a mini-batch during training, and n is the number of superpixels in every image. For a superpixel s i,k in image Ii of image pair-k, the variables x i,k j,s ∈ {0, 1} and xˆ j,s ∈ [0, 1] denote the ground-truth label and the predicted probability, respectively, of superpixel s belonging to foreground or background. Further, it is quite common to have the number of foreground labeled superpixels to be less compared to background labeled superpixels since background typically covers more space in natural images. This creates data imbalance during training. Hence, the foreground and background class superpixel frequencies over the training dataset, μ1 and μ2 , have been incorporated in the loss function of Eq. (7.15) to balance out the contribution of both classes. The classification branch of the network is trained using a categorical crossentropy loss Lmulti that classifies the global graph into one of the K semantic classes. It is defined as:
Lmulti = −
1 n samples
2 n K 1 i,k i,k yc,s log( yˆc,s ) φ c k=1 i=1 s=1 c=1
n samples
(7.16)
where yc,s ∈ {0, 1} and yˆc,s ∈ [0, 1] denote the true semantic class label and the predicted probability of superpixel s belonging to class c. Since the number of superpixels belonging to different semantic classes varies over the training dataset, similar to the formulation of Lbin , here the semantic class frequencies over the training dataset, φc , is utilized to eliminate the problem of semantic class imbalance. The entire network is trained using a weighted combination of the losses Lbin and Lmulti as: L = ω1 Lbin + (1 − ω1 )Lmulti , (7.17) where ω1 is a weight. Thus, the parameters in the graph CNN layers are influenced by both the losses, whereas the 1 × 1 convolutional layers of the co-segmentation branch and the fully connected layers of the classification branch are influenced by Lbin and Lmulti , respectively. This loss L can be minimized using a mini-batch stochastic gradient descent optimizer. Backpropagating Lmulti ensures that the learned model can differentiate different classes, and the computed features become more discriminative and exclusive for distinct classes. Minimization of the loss Lbin computed using the ground-truth binary masks of image pairs ensures that the learned model can distinguish the foreground and background superpixels well resulting in a good segmentation performance. Once the model is learned, to co-segment a test image pair, the corresponding global graph is passed through the co-segmentation branch, and the final softmax layer classifies each superpixel as either the common foreground or background class. However, the classification branch is not required during testing.
162
7 Co-segmentation Using Graph Convolutional Network
7.5 Experimental Results In this section, we analyze the results obtained by the end-to-end graph CNN-based co-segmentation method described in this chapter, denoted as PMG on images from the Internet dataset [105] and the PASCAL-VOC dataset [35]. We begin by listing the choice of parameters in the PMG method for both datasets. The feature similarity threshold t in Eq. (7.7) used for inter-graph node connections is chosen as t = 0.65. The weight ω1 used to combine the losses Lbin and Lmulti in Eq. (7.17) is set to 0.5. The network is initialized with Xavier’s initialization [41]. For stochastic gradient descent, the learning rate and momentum are empirically fixed at 0.00001 and 0.9, respectively, and the mini-batch size n samples is kept at 8 samples. Weight decay is set to 0.0004 for the Internet dataset, and 0.0005 for the PASCAL-VOC dataset. Jaccard similarity index (J ) is used as the metric to quantitatively evaluate the performance the methods. To evaluate a method for a set of m input images I1 , I2 , . . . , Im , first all image pairs are co-segmented, and then the average Jaccard index over all such pairs is calculated as the final co-segmentation accuracy for the given set. Next, we describe the results on the two datasets.
7.5.1 Internet Dataset The Internet dataset [105] has three classes: airplane, car and horse, with a subset of 100 images per class used for experiments as considered in relevant works [99]. However, there is no standard training-test split available for this dataset. Hence, the dataset is randomly split in 3:1:1 ratio for training, validation and testing sets, and image pairs from them are input to the network to compute the co-segmentation performance. This process is repeated 100 times, and the average accuracy computed over the 100 different test sets is reported. The comparative results of the methods PMG, DCC [56], UJD [105], EVK [24], GMR [99], DD [146] are shown in Table 7.1. The method PMG performs better than other methods except the method DD. However, DD is constrained in two aspects. First, it requires precomputed object proposals from the images. Then the proposals containing common objects are identified, and those proposals are segmented independently. Thus, co-segmentation is not performed by directly leveraging the inter-image feature similarities. Second, to compute correct object proposals, separate fine tuning is required to obtain an optimum model. Therefore, this method may fail to co-segment images containing complex object structures with occlusion. On the contrary, PMG uses a global graph which is constructed using superpixels from an image pair. Superpixel computation is easier than object proposal computation, and it does not require to learn any separate model. Further, PMG has the flexibility of using any oversegmentation algorithm for computing superpixels. Unlike DD, usage of the global graph and graph convolution helps PMG model perform co-segmentation even if the common object in one image is occluded by some other
7.5 Experimental Results
163
Table 7.1 Comparison of Jaccard index (J) of the methods PMG, DCC, UJD, EVK, GMR, DD on the Internet dataset Method Car (J) Horse (J) Airplane (J) DCC UJD EVK GMR DD PMG
37.1 64.4 64.8 66.8 72.0 70.8
30.1 31.6 33.3 58.1 65.0 63.3
15.3 55.8 40.3 56.3 66.0 62.2
Table 7.2 Comparison of Jaccard index (J) of the methods on CMP, GMR, PMG evaluated on the PASCAL-VOC dataset Method Jaccard index CMP GMR PMG
0.46 0.52 0.56
object since the model can use information from the common object in the other image of the input pair. Figure 7.3 visually demonstrates co-segmentation results on six image pairs. The first and third columns show three easy and three difficult image pairs, respectively, and the second and fourth columns show the corresponding co-segmentation results.
7.5.2 PASCAL-VOC Dataset The PASCAL-VOC dataset [35] has 20 classes, and is more challenging due to significant intra-class variations and presence of background clutter. Here, the same experimental protocol as in the experiments with the Internet dataset is used. The accuracy over the test set of all 20 classes are calculated and the average is reported here. Table 7.2 shows comparative results of different methods. The method PMG performs well because it involves semantic information, which infuses high degree of context in the computed features, thus making it robust to pose and appearance changes. Common objects obtained using the method PMG on four image pairs is shown in Fig. 7.4. Visual comparison of co-segmentation results obtained using the methods PMG and CMP [36] on four image pairs are shown in Fig. 7.5. In the TV image pair, the appearance of some part of the background is similar to the common foreground (TV). Hence, the method CMP incorrectly segments some part of the background as foreground. In case of the other three image pairs, the similarity in shape of the
164
7 Co-segmentation Using Graph Convolutional Network
Fig. 7.3 Co-segmentation results of the method PMG described in this chapter on images from airplane, horse and car classes of the Internet dataset. Columns 1, 3 show input image pairs and Columns 2, 4 show the corresponding co-segmented objects. Figure courtesy: [6]
common foreground with other objects and background results in incorrect segmentation. However, the method PMG learns features with strong semantic information and context, hence performs well. This is demonstrated using the dog image pair, where the three dogs in image-2 are not homogeneous in color. Specifically, their body color is a mixture of both white and its complement. Yet, PMG co-segments them correctly.
7.5 Experimental Results
165
Fig. 7.4 Co-segmentation results of the method PMG on images from the PASCAL-VOC dataset. Columns 1, 3, 5, 7 show input image pairs and Columns 2, 4, 6, 8 show the corresponding cosegmented objects. Figure courtesy: [6]
Fig. 7.5 Visual comparison of co-segmentation results on images from the PASCAL-VOC dataset. Columns 1, 4: input image pairs (TV, boat, dog, sofa classes). Columns 2, 5: results of CMP [36] and Columns 3, 6: results of PMG. The detected common objects are shown using red contours. Figure courtesy: [6]
Chapter 8
Conditional Siamese Convolutional Network
8.1 Introduction In the previous chapter, we have discussed and shown through different experiments the advantage of employing a graph neural network for performing co-segmentation across input images. The experimental results demonstrate the utility of a deep learning model to find out critical features for the co-segmentation task, and then use the features to extract common objects efficiently in an end-to-end manner. However, performing co-segmentation across a set of superpixels predetermined from the input image set instead of the images directly, may be bottlenecked by the accuracy of the superpixel extraction method. Furthermore, inconsistent performance of a GCN across graphs with variable number of nodes constrains the method to maintain same number of superpixels across the input images that may hinder the network’s performance in some cases. In addition to that, the model is not able to handle outliers in the input set. In this chapter, we discuss a CNN-based end-to-end architecture for co-segmentation that operates on input images directly and can handle outliers as well. Specifically, we consider co-segmentation of image pairs in the challenging setting that the images do not always contain common objects, similar to the set up in Chaps. 5 and 6. Further, shape, pose and appearance of the common objects in the images may vary significantly. Moreover, the objects should be distinguishable from the background. Humans can easily identify such variations. To capture these aspects in the co-segmentation algorithm, a metric learning approach is used to learn a latent feature space, which ensures that objects from the same class get projected very close to each other and objects from different classes get projected apart, preferably at least by a certain margin. The objective of the co-segmentation problem is: (i) Given an input image pair I1 , I2 , and the corresponding ground-truth binary masks (M1 , M2 ) indicating common object pixels, learn a model through end-to-end training of the cosegmentation network, and (ii) given a test image pair, detect if they share a common object or not, and, if present, estimate the masks Mˆ 1 , Mˆ 2 using the learned model. The co-segmentation network architecture is shown in Fig. 8.1, and we describe it next. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_8
167
Fig. 8.1 The deep convolution neural network architecture for co-segmentation. An input image pair (I1 , I2 ) is input to a pair of encoder-decoder networks, with shared weights (indicated by vertical dotted lines). Output feature maps of the ninth decoder layer pair (cT9 ) are vectorized ( f u1 , f u2 ) by global average pooling (GAP) and input to a siamese metric learning network. Its output vectors f s1 and f s2 are then concatenated and input to the decision network that predicts ( yˆr ) the presence or absence of any common object in the input image pair. Red dotted line arrows show backpropagation for L1 , and green dotted line arrows show backpropagation for L2 and L3 . The complete network is trained for positive samples
168 8 Conditional Siamese Convolutional Network
Shared weights
Encoder
Siamese metric learning net GAP
GAP Decision net
BackpropagaƟon
Decoder
ConcatenaƟon
Decoder
PredicƟon ignored, and Loss 1 not backpropagated
Fig. 8.1 (Continued): The deep convolution neural network architecture for co-segmentation of an image pair without any common object. Green dotted line arrows show backpropagation for L2 and L3 . For negative samples, the decoder network after cT9 is not trained
NegaƟve sample
Encoder
8.1 Introduction 169
Fig. 8.1 (Continued): Details of the encoder-decoder network. 64-conv indicates convolution using 64 filters followed by ReLU. MP stands for max-pooling with a kernel of size 2 × 2. NNI stands for nearest neighbor interpolation. Deconvolution is performed using convolution-transpose operation (convT). Figure courtesy: [7]
170 8 Conditional Siamese Convolutional Network
8.2 Co-segmentation Framework
171
8.2 Co-segmentation Framework The co-segmentation model consists of a siamese convolution–deconvolution network, or encoder-decoder synonymously (Sect. 8.2.1), a siamese metric learning network (Sect. 8.2.2) and a decision network (Sect. 8.2.3). We first highlight the key aspects of these modules, and then explain them in detail in subsequent sections. • The siamese encoder network takes an image pair as input, and produces intermediate convolutional features. • These features are passed to the metric learning network that learns an optimal latent feature space where objects belonging to the same class are closer and objects from different classes are well separated. This enables the convolutional features learned by the encoder to segment the common objects accurately. • To make the model class-agnostic, no semantic class label is used for metric learning. • The decision network uses the features learned by the metric learning network, and predicts the presence or absence of a common object in the image pair. • The siamese decoder network conditionally maps the encoder features into corresponding co-segmentation masks. These masks are used to extract the common objects. • The metric learning network and the decision network together condition the encoder-decoder network to perform an accurate co-segmentation depending on the presence of a common object in the input image pair.
8.2.1 Conditional Siamese Encoder-Decoder Network The encoder-decoder network has two parts: a feature encoder and a co-segmentation mask decoder. The siamese encoder consists of two identical feature extraction CNNs with shared parameters and is built upon the VGG-16 architecture. The feature extractor network is composed of 5 encoder blocks containing 2, 2, 3, 3, 3 convolutional layers (conv), respectively. Each block also has one max-pooling (MP) layer, which makes the extracted features spatially invariant and contextual. Provided with an 2 ∈ R N ×N (with N = 224 as required by VGG-16) as input, this image pair {Ii }i=1 network outputs high level semantic feature maps f 1 , f 2 ∈ R512×7×7 . The siamese decoder block, that follows the encoder, takes the semantic feature map pair f 1 , f 2 produced by the encoder as input, and performs the task of producing foreground masks of the common objects through two identical deconvolution networks. The deconvolution network is formed by five spatial interpolation layers with 13 transposed convolutional layers (convT). The max-pooling operation in the encoder reduces the spatial resolution of the convolutional feature maps. Hence, the use of 5 MP layers makes the encoder features very coarse. The decoder network transforms these low resolution feature maps to the co-segmentation masks. The feature maps are upsampled using nearest neighbor interpolation (NNI). Although NNI
8 Conditional Siamese Convolutional Network
Fig. 8.2 Testing the co-segmentation model
172
8.2 Co-segmentation Framework
173
is fast compared to bilinear or bicubic interpolation, it introduces blurring and spatial artifacts. Therefore, each NNI is followed by a transposed convolution operation in order to reduce these artifacts. Every deconvolution or transposed convolutional layer except the final layer is followed by a ReLU layer. The final deconvolution layer produces two single channel maps with size 224 × 224, which are converted to co-segmentation masks Mˆ 1 , Mˆ 2 by sigmoid function. During test time, the output layer of this network is gated by the binary output of the decision network to make the conditional siamese convolutional network as shown in Fig. 8.2.
8.2.2 Siamese Metric Learning Network This network helps to learn features that better discriminate objects from different classes. It may be noted that in Chap. 6, we discussed an LDA based approach for the same goal in an unsupervised setting. Here in the fully supervised framework, this is achieved through metric learning where the images are projected to a feature space such that the distance between two images containing the same class objects gets reduced and that for an image pair without any commonality increases. This projection is performed using a series of two fully connected layers with dimensions 128 and 64, respectively. The first layer has ReLU as nonlinearity, and the second layer does not have any nonlinearity. These layers together constitute the metric learning network. It takes input f u1 , f u2 ∈ R256 from the siamese encoder-decoder network, and outputs a pair of feature vectors f s1 , f s2 ∈ R64 that represents the objects in the learned latent space. One can use the final convolutional layer or subsequent deconvolution layers as the source. We will analyze the choice of the source in Sect. 8.3.4, and we will show through experiments that the deconvolution layers at the middle of the decoder network are ideal. The reason is that they capture sufficiently enough object information. Hence, the output of the ninth deconvolution layer (256 × 56 × 56) is used to find f u1 and f u2 . Specifically, global average pooling (GAP) is performed over each channel of the deconvolution layer output to get the vectors f u1 , f u2 . The network is trained using the standard triplet loss (Sect. 8.2.4). Thus during backpropagation, the first nine deconvolution layers of the decoder and all thirteen convolutional layers of the encoder also get updated. This infuses the commonality of the image pair in the encoder features, which leads to better masks. It is worth mentioning that f s1 , f s2 are not used for mask generation because the GAP operation destroys the spatial information completely while computing f u1 , f u2 . Hence, encoder features, i.e., the output of the final convolutional layer is passed onto the decoder for obtaining the common object masks. However, f s1 , f s2 are utilized in the decision network as described next.
174
8 Conditional Siamese Convolutional Network
8.2.3 Decision Network In this chapter, the co-segmentation task is not limited to extracting common objects, if present. We are also required to detect cases when there is no common object present in the image pair, and this is achieved using a decision network. This network uses the output of the metric learning network f s1 , f s2 to predict the common object occurrence. It takes the 128-dimensional vector obtained by concatenating f s1 and f s2 as the input, and passes the same to a series of two fully connected layers with dimensions 32 and 1, respectively. The first layer is associated with a ReLU nonlinearity. The second layer is associated with a sigmoid function that gives a probability of the presence of common objects in the image pair. During the test stage, this probability is thresholded to obtain to a binary label. If the decision network predicts a binary label 1, the output of the siamese decoder network gives us the co-segmentation masks.
8.2.4 Loss Function To train the co-segmentation network, we need to minimize the losses incurred in the siamese encoder-decoder network, the siamese metric learning network and the decision network. We define these losses next. p Let (Ira , Ir ) and (Ira , Irn ) denote a positive pair and a negative pair of images, p respectively, where Ira is an anchor, Ir belongs to the same class of the anchor, and p p Irn belongs to a different class. For the positive sample, let (Mra , Mr ) and ( Mˆ ra , Mˆ r ) be the corresponding pair of ground-truth masks and the predicted masks obtained from the decoder, respectively. We do not require the same for negative samples because of the absence of any common object in them. The pixelwise binary crossentropy loss for training the encoder-decoder network is given as: n samples
L1 = −
N N {Mri ( j, k) log( Mˆ ri ( j, k)) + (1 − Mri ( j, k)) log(1 − Mˆ ri ( j, k))},
r =1 i∈[a, p] j=1 k=1
(8.1) where n samples is the total number of samples (positive and negative image pair) in ˆ j, k) and M( j, k) denote the values of the ( j, k)-th pixel of the a mini-batch, M( predicted and true mask, respectively. It may be noted that this loss does not involve the negative samples since the final part of the decoder is not trained for them, and this will be elaborated in Sect. 8.2.5. The triplet loss is used to train the metric learning network, and it is given as:
n samples
L2 =
r =1
max 0, f s (Ira ) − f s (Irp )2 − f s (Ira ) − f s (Irn )2 + α ,
(8.2)
8.2 Co-segmentation Framework
175
where α is a scalar valued margin. Therefore by minimizing this loss, the network p implicitly learns a feature space where the distance between Ira and Ir reduces, and a n the same between Ir and Ir increases at least by the margin α. The decision network is trained using binary cross-entropy loss as:
n samples
L3 = −
{yr log yˆr + (1 − yr ) log(1 − yˆr )} ,
(8.3)
r =1 p
where yr = 1 and 0 for a positive pair (Ira , Ir ) and negative pair (Ira , Irn ), respectively, and yˆr ∈ [0, 1] is the predicted label obtained from its sigmoid layer. The overall loss is computed as: Lfinal = ω1 L1 + ω2 L2 + ω3 L3 (8.4) where ω1 , ω2 , ω3 are the corresponding weights with ω1 + ω2 + ω3 = 1.
8.2.5 Training Strategy As mentioned earlier, at the time of training, the metric learning part guides the encoder-decoder network to reduce the intra-class object distance and increase the inter-class object distance. The binary ground-truth masks guide the encoder-decoder network to differentiate common objects from the background based on the learned features. Thus, a positive sample (an image pair with a common object) will predict the common foreground at the output, whereas a negative sample (an image pair with no common object) should predict the absence of any common object producing null masks. However, forcing the same network to produce an object mask and a null mask from a particular image depending on that image being part of a positive and negative sample, respectively, hinders learning. Hence, the learning strategy should be different in the two cases. In the case of positive samples, the whole network is trained by backpropagating all three losses. However for negative samples, only the complete metric learning, decision, encoder and certain part of decoder network are trained, i.e., a part of the decoder responsible for mask generation is not trained. This is achieved by backpropagating only the losses L2 and L3 . Thus, the predicted masks are not utilized in training, and they are ignored. This is motivated by the fact that for negative examples, the decoder network is not required to produce any mask at all since the decision network notifies the absence of any common object. Hence, the role of the decision network is to reduce the overall difficulty level of training the deconvolution layers by making them produce object masks only for positive samples. It helps to train the network properly, and also improves the performance as shown in Fig. 8.6. To summarize, the entire network is trained for positive samples (yr = 1), and a part of the network is trained for negative samples (yr = 0), thus making the mask estimation of the siamese network a conditioned one.
176
8 Conditional Siamese Convolutional Network
During testing, the decision network predicts the presence or absence of a common object in them. If the prediction is yˆr = 1, the output of the siamese decoder network provides the corresponding co-segmentation masks. If the prediction is yˆr = 0, the decoder output is not considered, and it is concluded that the image pair does not have any common object. This is implemented in the architecture by gating the decoder outputs through yˆr as shown in Fig. 8.2. Thus, prediction yˆr = 0 will yield null masks. We show the experimental results for negative samples in Fig. 8.6.
8.3 Experimental Results In this section, we analyze the results obtained by the end-to-end siamese encoderdecoder-based co-segmentation method described in this chapter, denoted as PMS. The co-segmentation performance is quantitatively evaluated using precision and Jaccard index. Precision is the percentage of correctly segmented pixels of both the foreground and the background. It should be noted that this precision measure is different from the one discussed in Chap. 4, which does not consider background. Jaccard index is the intersection over union of the resulting co-segmentation masks and the ground-truth common foreground masks. As in Chap. 7, to evaluate the performance on a set of m input images I1 , I2 , . . . , Im , co-segmentation is performed on all such pairs, and the average precision and Jaccard index is computed for the given set. Next, we describe the results on the PASCAL-VOC dataset, the Internet dataset and the MSRC dataset. First, we specify various parameters used in obtaining the co-segmentation results reported in this chapter using the PMS network. It is initialized with the weights of the VGG-16 network trained on the Imagenet dataset for the image classification task. Stochastic gradient descent is used as the optimizer. The learning rate and momentum are fixed at 0.00001 and 0.9, respectively, for all three datasets. For the PASCAL-VOC and the MSRC datasets, the weight decay is set to 0.0004, and for the Internet dataset, it is set to 0.0005. At the time of training, the strategy of Schroff et al. [111] has been followed for generating samples. For the case of positive samples, the weights in Eq. (8.4) are set as ω1 = ω2 = ω3 = 1/3 to give them equal importance. As explained in Sect. 8.2.5, the mask loss L1 is not backpropagated for negative samples. Hence, the weights are set as ω1 = 0, ω2 = ω3 = 0.5. Due to memory constraints, batch size is limited to 3 in the experiments, where each input sample in a batch is a pair of input images, either positive or negative. All the input images are resized to 224 × 224, and the margin α in the triplet loss L2 is set to 1.
8.3.1 PASCAL-VOC Dataset This dataset has 20 classes. To prepare the dataset for training, it is randomly split in the ratio of 3:1:1 for training, validation and testing sets. Since there is no stan-
8.3 Experimental Results
177
Table 8.1 Comparison of precision (P) and Jaccard index (J) of the methods CMP [36], GMR [99], ANP [134], CAT [50], PMS on the PASCAL-VOC dataset Method Precision (P) Jaccard index (J) CMP ANP a GMR a CAT a PMS a Denotes
84.0 84.3 89.0 91.0 95.4
0.46 0.52 0.52 0.60 0.68
deep learning-based methods
dard split available, this splitting process is repeated 100 times, and the average performance is reported here. Table 8.1 shows comparative results of the methods PMS, CMP [36], GMR [99], ANP [134], CAT [50]. The PMS method performs very well because it involves convolution–deconvolution with pooling operation, which involves a high degree of context for feature computation. Furthermore, the metric learning network learns a latent feature space where common objects come closer irrespective of their pose and appearance variations, making the method robust to such changes. This can also be observed in Fig. 8.3 where PMS segments common objects even when they have significant pose and appearance variations (Rows 3, 4).
Fig. 8.3 Co-segmentation results on the PASCAL-VOC dataset. In each row, Columns 1, 3 show an input image pair, and Columns 2, 4 show the corresponding co-segmented objects obtained using PMS. Figure courtesy: [7]
178
8 Conditional Siamese Convolutional Network
Table 8.2 Comparison of precision (P) and Jaccard index (J) of the methods DCC [56], UJD [105], EVK [24], GMR [99], CAT [50], DD [146], DOC [72], CSA [20], PMS on the Internet dataset Method C (P) C (J) H (P) H (J) A (P) A (J) M (P) M (J) DCC UJD EVK *GMR *CAT *DD *DOC *CSA *PMS
59.2 85.4 87.6 88.5 93.0 90.4 94.0 – 95.2
0.37 0.64 0.65 0.67 0.82 0.72 0.83 0.80 0.87
64.2 82.8 89.3 85.3 89.7 90.2 91.4 – 96.2
0.30 0.32 0.33 0.58 0.67 0.65 0.65 0.71 0.72
47.5 88.0 90.0 91.0 94.2 92.6 94.6 – 96.7
0.15 0.56 0.40 0.56 0.61 0.66 0.64 0.71 0.71
57.0 82.7 89.0 89.6 92.3 91.0 93.3 – 96.1
0.28 0.51 0.46 0.60 0.70 0.68 0.70 0.73 0.77
C, H and A stands for car, horse and airplane classes. M denotes mean value, *Denotes deep learning-based methods Table 8.3 Precision (P) and Jaccard index (J) of PMS trained with the PASCAL-VOC dataset and evaluated on the Internet dataset C (P) C (J) H (P) H (J) A (P) A (J) M (P) M (J) 94.6
0.85
93.0
0.67
94.3
0.65
94.0
0.72
8.3.2 Internet Dataset This dataset has three classes: airplane, horse and car. Comparative results of different methods are shown in Table 8.2. The PMS method described in this chapter has some similarities with the method DOC [72], but the use of metric learning with a decision network in PMS as opposed to the use of a mutual correlator in DOC makes PMS faster by 6 times per epoch. This along with the conditional siamese encoderdecoder significantly improves co-segmentation performance over DOC. We show visual results of PMS in Fig. 8.4. Further, it is also evaluated by training the network using the PASCAL-VOC dataset and testing on the Internet dataset. The results are shown in Table 8.3.
8.3.3 MSRC Dataset To evaluate the PMS method on the MSRC dataset, a subset containing the classes cow, plane, car, sheep, bird, cat and dog is chosen as has been widely used in relevant works to evaluate co-segmentation performance [105, 130, 135]. Each class has 10 images, and there is a single object in each image. The objects from each class have color, pose and scale variations in different images. The experimental protocol is the same as that of the PASCAL-VOC dataset. Comparative results of different methods
Fig. 8.4 Co-segmentation results on Internet dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8 show the corresponding co-segmented objects obtained using PMS. Figure courtesy: [7]
8.3 Experimental Results 179
Fig. 8.4 (Continued): Co-segmentation results on Internet dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8 show the corresponding co-segmented objects obtained using PMS
180 8 Conditional Siamese Convolutional Network
8.3 Experimental Results
181
Table 8.4 Comparison of precision and Jaccard index of the methods in [130], UJD [105], CMP [36, 135], DOC [72], CSA [20], PMS on the MSRC dataset Method Precision Jaccard index [130] UJD CMP [135] *DOC *CSA *PMS
90.0 92.2 92.0 92.2 94.4 95.3 96.3
0.71 0.75 0.77 – 0.80 0.77 0.85
*Denotes deep learning-based methods
are shown in Table 8.4. The PMS model was trained on the PASCAL-VOC dataset since the MSRC dataset does not have sufficient number of samples for training. Yet it outperforms other methods. Figure 8.5 shows visual results obtained using PMS.
8.3.4 Ablation Study In this section, we analyze (i) the role of the siamese metric learning network and the decision network in PMS, and (ii) the choice of the layer in the encoder-decoder network that acts as the input to the metric learning network. To analyze the first, a baseline model PMS-base is created that only has the siamese encoder-decoder network in the architecture. Further to fuse information from both the images, the encoder features f 1 , f 2 ∈ R512×7×7 are concatenated along the channel dimension to obtain feature maps [ f 1 ; f 2 ], [ f 2 ; f 1 ] ∈ R1024×7×7 for I1 and I2 , respectively, and they are passed to the corresponding decoder network for mask generation. In the absence of the decision network in PMS-base, the entire siamese encoder-decoder network is trained for negative samples as well using null masks. The performance of PMS-base is compared with the complete model PMS on different datasets in Table 8.5. The improved performance of PMS over PMS-base is also visually illustrated in Fig. 8.6. Different class objects in the image pairs (Rows 1–3) are incorrectly detected as common objects by PMS-base, whereas PMS correctly infers that there is no common object in the image pairs by predicting null masks. The image pair in Row 4 contains objects from the same class, however, the objects have different pose and size. In the absence of the metric learning network, PMS-base performs poorly. This is due to the absence of any explicit object similarity learning module, which is essential for cosegmentation. Hence, the inclusion of both the subnetworks in the PMS architecture along with the novel training strategy of partially training the decoder for negative samples is justified.
Fig. 8.5 Co-segmentation results on MSRC dataset. In each row, Columns 1, 3 and 5, 7 show two input image pairs, and Columns 2, 4 and 6, 8 show the corresponding co-segmented objects obtained using PMS. Figure courtesy: [7]
182 8 Conditional Siamese Convolutional Network
Fig. 8.6 Ablation study using images from the Internet dataset. Columns 1, 2 show input image pairs, Columns 3, 4 show the objects obtained (incorrectly) using PMS-base. Columns 5, 6 show that PMS correctly identifies the absence of common objects (Rows 1–3), indicated by empty boxes. Figure courtesy: [7]
8.3 Experimental Results 183
184
8 Conditional Siamese Convolutional Network
Table 8.5 Comparison of Jaccard index (J) of the PMS model with the baseline model on different datasets Dataset Architecture PMS-base PMS PASCAL-VOC Internet MSRC
0.47 0.61 0.63
0.68 0.77 0.85
Table 8.6 Comparison of Jaccard index (J) of the PMS model while connecting the input of the metric learning network to different layers of the decoder network Dataset\layer c13 cT3 cT6 cT9 cT11 PASCAL-VOC Internet
0.63 0.67
0.65 0.68
0.66 0.68
0.68 0.72
0.64 0.65
Next we analyze the input to the metric learning network. Table 8.6 shows the performance of the PMS model by choosing the output of different layers of the siamese encoder-decoder network: • c13 : the final layer of the siamese encoder network, • cT3 , cT6 , cT9 and cT11 : the third, sixth, ninth and eleventh deconvolution layers, respectively. The model performs the best for cT9 . The reasons, we believe, are that (a) sufficient object information is available at cT9 , which may not be available in lower layers c13 , cT3 , cT6 , and (b) the higher deconvolutional layers (cT11 and cT13 ) are dedicated for producing co-segmentation masks. Hence, cT9 is the optimal layer. To summarize, the co-segmentation method described in this chapter can extract common objects from an image pair, if present, even in the presence of variations in datasets. In the next chapter, we will discuss a method that co-segments multiple images simultaneously.
Chapter 9
Few-shot Learning for Co-segmentation
9.1 Introduction Convolutional neural network (CNN) based models automatically compute suitable features for co-segmentation with varying levels of supervision [20, 67, 72, 146]. To extract the common foreground, all images in the group are used to leverage shared features to recognize commonality across the group as shown in Fig. 9.1a, b. However, all these methods require a large number of training samples for better feature computation and mask generation of the common foreground. In many real scenarios, we are presented with datasets containing a few labeled samples (Fig. 9.2a). Annotating input images in the form of a mask for the common foreground is also a very tedious task. Hence in this chapter, image co-segmentation is investigated in a few-shot setting. This implies performing the co-segmentation task over a set of a variable number of input images (called co-seg set) by relying on the guidance provided by a set of images (called guide set) to learn the commonality of features as shown in Fig. 9.2c. We describe a method that learns commonality corresponding to the foreground of interest without any semantic information. For example as shown in Fig. 9.3, commonality corresponding to the foreground horse is learned from the guide set, which is exploited to segment the common foreground of interest from the co-seg set images. To solve co-segmentation in a few-shot setting, a meta learning-based training method can be adopted where an end-to-end model learns the concept of cosegmentation using a set of episodes sampled from a larger dataset, and subsequently adapts its knowledge to a smaller target dataset of new classes. Each episode consists of a guide set and a co-seg set that together mimic the few-shot scenario encountered in the smaller dataset. The guide set learns commonality using a feature integration technique and associates it with the co-seg set individuals with the help of a variational encoder and an attention mechanism to segment the common foreground. Thus, this encoder along with the attention mechanism helps to model the common foreground, where the intelligent feature integration method boosts the quality of its
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_9
185
186
9 Few-shot Learning for Co-segmentation
Fig. 9.1 a, b Traditional supervised co-segmentation methods use a large training set to learn to extract common objects. c The few-shot setting requires only a smaller training set (guide set) to perform co-segmentation of the test image set (co-seg set)
feature. To improve the generalization capacity of the model, it is trained only using the co-segmentation loss computed over the co-seg set.
9.2 Co-segmentation Framework n n Given a dataset Dtarget = {( xit , yit )}i=1 ∪ { x uj }rj=1 containing a small set {( xit , yit )}i=1 of annotated training images and corresponding ground-truth masks, the objective y2u , . . . , yru } for the of co-segmentation is to estimate common object masks { y1u , u u u x2 , . . . , xr }. A meta-learning approach unlabeled target samples or the test set { x1 , for solving this problem is explained next.
9.2 Co-segmentation Framework
187
Input Set of Images Training Images
Test Images
(a) Traditional Supervised Co-Segmentation INPUT
Shared weights
large training data is required
OUTPUT (inferior)
Shared Encoder Decoder
Latent space capturing commonality
(b) Few Shot Multi-image Co-segmentation
co-seg set
guide set
Shared weights
A
OUTPUT (better)
DVICE
INPUT
guide set
Latent Space
It can work accurately with less training data
Fig. 9.2 The figure illustrates traditional multi-image co-segmentation and the few-shot multiimage co-segmentation described in this chapter. a Due to fewer training samples, the traditional method fails to obtain accurate co-segmentation results. b Few-shot training using a cross-encoder (DVICE) performs co-segmentation more accurately. Figure courtesy: [5]
9.2.1 Class Agnostic Meta-Learning Inspired from [117], few-shot learning for co-segmentation is defined as follows. Let us consider two datasets: a base set Dbase with a large number of annotated samples and a target set Dtarget with a small number of annotated samples for cosegmentation. The model is iteratively trained over Dbase using a series of episodes consisting of a guide set and a co-seg set. Each guide set and co-seg set are designed such that they mimic the characteristics of the small training and test set of Dtarget as shown in Fig. 9.4. The role of the guide set and the co-seg set is similar to the
188
9 Few-shot Learning for Co-segmentation
support set and query set, respectively, typically encountered in the contemporary few-shot learning [91, 115, 117]. However unlike the support set, the guide set here does not rely on semantic class labels, and it is even tolerant to the presence of outliers while guiding the network to learn and perform foreground extraction over the co-seg set as shown in Fig. 9.3. The guide set discussed in this chapter includes samples (images) that contain a dominant class and samples (outlier images) that contain other non-dominant classes, which we call positive and negative samples, respectively. The positive samples share a common foreground which is same as the foreground of interest that is to be extracted from the co-seg set. Due to the lack of sufficient training samples in the target dataset Dtarget , meta-learning is used to learn and extract transferable embedding, thus facilitating better foreground extraction on the target dataset. An episodic training scheme for this purpose is described next. A few-shot learning strategy is adopted to improve co-segmentation performance on the smaller target dataset Dtarget , on which standard training leads to over-fitting. Hence, a larger dataset Dbase is developed for the co-segmentation task such that Dtarget ∩ Dbase = φ. In order to simulate the training scenario of Dtarget , multiple episodes over Dbase are created, and an episodic training scheme is developed so that the model learns to handle the co-segmentation task with few training samples without overfitting. Each episode consists of a guide set (G ) and a co-seg set (C ) such that any operation over set C is directed by set G , which provides the information of the common object to C over which co-segmentation is performed. To accomodate a practical scenario, G is allowed to contain negative samples (outliers). Thus in each g g g g episode, the guide set is designed as G = {P g ∪ N g } = {(x1 , y1 ), . . . , (xk , yk )}, g consisting of n randomly selected positive samples {P } and k − n randomly selected negative samples {N g }. The co-seg set is designed as C = {(x1c , y1c ), . . . , (xmc , ymc )}. Here, the cardinality of P g is chosen as n because the number of annotated samples available in Dtarget is also n. Next, a prototype Og for the common object present in the guide set is obtained from the encoder features of images in it as: 1 g ChAM(E(xi )) , |G | i=1 k
Og =
(9.1)
where the encoder E is a part of the directed variational inference cross-encoder, to be explained in detail in Sect. 9.2.2, and ChAM is a channel attention mechanism, to be described in Sects. 9.3.1 and 9.3.2. The averaging operation removes the influence of outliers and makes Og robust. The ChAM module is used to focus on the semantically meaningful part of the image by exploiting the inter-channel relationship of features. An embedding of each co-seg set sample x cj is also computed as z cj = E(x cj ). Then its attention ChAM(z cj ) is concatenated channel-wise with Og and passed to the decoder for obtaining the co-segmentation mask yˆ cj . Thus, the decoder implicitly checks the similarity between the common object prototype and objects present in that image (x cj ), and estimates yˆ cj accordingly. The spatial importance of each pixel at different layers of the encoder is captured by the spatial attention module (SpAM),
Guide Set
Qc
Shared Weights
Feature z
Latent Space
Latent Space
Spatial Attention Module
Avg-pool
Channel Concatenation
Perceptron
Perceptron Output
Sigmoid layer
Spatial Attention SpAM(F)
Channel Attention ChAM(z)
Co-segmentation Mask ycj
Convolution Sigmoid Feature layer F [Max-pool ; Avg-pool] layer
Channel Attention Module (ChAM)
SpAM({F})
Channel ChAM(zc) Attention Module
Og
Feature Averaging
Channel ChAM({zg}) Attention Module
Max-pool
F
zc
{zg}
Spatial Attention Module (SpAM)
Fig. 9.3 Illustration of the co-segmentation architecture described in this chapter. [Top-right] The spatial attention module (SpAM). [Bottom-right] The channel attention module (ChAM). [Bottom-left] A guide set containing horse as the majority class. [Top-left] The architecture of the cross-encoder consisting of the SpAM and ChAM modules, which learns the commonality from the guide set and learns to segment the common objects. During test time, the method extracts the common objects from an outlier contaminated co-seg set. Here for simplicity, only two images from the co-seg set are shown. Figure courtesy: [5]
xcj
Co-seg Set
{xg}
Guide Set
Encoder
9.2 Co-segmentation Framework 189
190
9 Few-shot Learning for Co-segmentation
(a)
Unlabelled Set
Labelled Set
(b) Fig. 9.4 Sample image sets used for training and testing. a The base dataset Dbase used during training is a large set of labeled images. b The target dataset Dtarget used during testing has fewer labeled images. The method is evaluated on the unlabeled sets in Dtarget . Figure courtesy: [5]
9.2 Co-segmentation Framework
191
Training DVICE over the
Loss over co-seg
co-seg set
DVICE guide set
(c) Evaluating DVICE over the
co-seg set Co-segmented images
DVICE guide set
(d) Fig. 9.4 (Continued): c Guide sets and co-seg sets are sampled from the base set Dbase for training the cross-encoder (DVICE). d During the evaluation phase, guide sets and co-seg sets are sampled from the labeled and unlabeled sets of the target set Dtarget , respectively, and the common objects in the co-seg sets are extracted. Figure courtesy: [5]
to be described in Sect. 9.3.3, and it is used for aiding the decoder to focus on the localization of the common foreground. While training for common foreground extraction, this framework relies only on the assumption that there exists some degree of similarity between the guide set and co-seg set. Thus any semantic class information is not used during training as can be seen in Fig. 9.3, and hence, this few-shot co-segmentation strategy is completely class agnostic.
192
9 Few-shot Learning for Co-segmentation
9.2.2 Directed Variational Inference Cross-Encoder In this section, we describe the encoder-decoder model used for co-segmentation. It is built on the theory of variational inference to learn a continuous feature space over input images for better generalization. However, unlike the traditional variational auto-encoder setup, a cross-encoder is employed here for mapping an input image x to corresponding mask y based on a directive Og obtained from the guide set G . Thus for any image (x c , y c ) ∈ C , randomly sampled from an underlying unknown joint distribution p(y c , x c ; θ ), the purpose of the encoder-decoder model is to estimate the parameters θ of the distribution from its likelihood, given G . The joint probability needs to be maximized as: c c max p(y , x ; θ ) = max p(x c , y c , Og , z c )dOg dz c . (9.2) θ
θ
z c Og
For simplicity, we drop θ from p(y c , x c ) in subsequent analysis. The process of finding the distribution p(y c , x c ) implicitly depends upon latent embedding of the sample x c , which is z c , and the common class prototype Og computed over G . The crux of the variational approach here is to learn the conditional distribution p(z c |x c ), that can produce the output mask y c , and thus maximize p(y c , x c ). Here, Og and x c are independent of each other as the sets G and C are generated randomly. Thus, Eq. (9.2) can be written as: p(y , x ) = c
p(y c |Og , z c ) p(Og |z c , x c ) p(z c |x c ) p(x c )dOg dz c
c
zc O g
=
p(y c |Og , z c ) p(Og ) p(z c |x c ) p(x c )dOg dz c
(9.3)
z c Og
Since z c is the latent embedding corresponding to x c , they provide redundant information, and hence we refrain from using them together inside joint probability. The main idea behind the variational method used here is to learn the distributions q(Og ) and q(z c |x c ) that can approximate the distributions p(Og ) and p(z c |x c ) over the latent variables, respectively. Therefore, Eq. (9.3) can be written as:
p(y c , Og , z c ) q(Og )q(z c |x c ) p(x c )dOg dz c q(Og , z c ) z c Og c ,Og ,z c ) = p(x c ) E(Og ,z c )∼q(Og ,z c ) p(y q(Og ,z c )
p(y c , x c ) =
(9.4)
9.2 Co-segmentation Framework
193
Taking the logarithm on Eq. (9.4) followed by Jensen’s inequality, we have ,O ,z ) log p(y c , x c ) ≥ E(Og ,z c )∼q(Og ,z c ) log p(y q(Og ,z c ) ≥ E(Og ,z c )∼q(Og ,z c ) log p(y c |Og , z c ) −K L [q(Og |G )|| p(Og |G )] − K L [q(z c |x c )|| p(z c |x c )] c
g
c
(9.5)
From the evidence lower bound [11] obtained in Eq. (9.5), we observe that maximizing it will in turn maximize the target log-likelihood of generating a mask y c for a given input image x c . Thus, unlike the traditional variational auto-encoders, here a continuous embedding Q is learned, which guides mask generation. The terms q(z c |x c ) and q(Og ) denote the mapping operation of encoders with shared weights, and p(y c |Og , z c ) denotes the decoder part of the network that is responsible for generating the co-segmentation mask given the common object prototype Og and the latent embedding z c . From Eq. (9.5), the loss (L) to train the network is formulated over the co-seg set as: m L=− log p(y cj (a, b)|Og , z cj ) (9.6) j=1 (a,b) +K L [q(Og |G )|| p(Og |G )] + K L [q(z c |x c )|| p(z c |x c )] , where y cj (a, b) is the prediction at pixel location (a, b). The network is trained over the larger dataset Dbase using multiple episodes until convergence. During test time, n yit )}i=1 of Dtarget is used as the guide set, and the unlabeled set the labeled set {( xit , u u u x2 , . . . , xr } is used as the co-seg set. Hence, the final co-segmentation accuracy { x1 , is examined over the corresponding co-seg set of Dtarget .
9.3 Network Architecture The overall network architecture is shown in Fig. 9.3. ResNet-50 forms the backbone of the encoder-decoder framework, which in combination with the channel and spatial attention modules form the complete pipeline. The individual modules, as shown in Fig. 9.3, are explained briefly next.
9.3.1 Encoder-Decoder The variational encoder-decoder is implemented using the ResNet-50 architecture at its backbone. The encoder (E) is just the ResNet-50 network with a final additional 1 × 1 convolutional layer. The decoder has five stages of upsampling and convolutional layers with skip connections through a spatial attention module as shown in Fig. 9.3. The encoder and decoder are connected through a channel attention module.
194
9 Few-shot Learning for Co-segmentation
9.3.2 Channel Attention Module (ChAM) Channel attention of an image x is computed from its embedding z = E(x) obtained from the encoder. First, z is compressed through pooling. To boost the representational power of the embedding, both global average-pooling and max-pooling are performed simultaneously. The output vectors from these operations z avg and z max , respectively, are then fed to a multi-layer perceptron to produce the channel attention as:
(9.7) ChAM(z) = σ (z avg ) + (z max ) , where σ is the sigmoid function.
9.3.3 Spatial Attention Module (SpAM) The inter-spatial relationship among features is utilized to generate the spatial attention map. To generate the attention map for a given feature F ∈ RC×H ×W , both average-pooling and max-pooling are applied across the channels, resulting in Favg ∈ R H ×W and Fmax ∈ R H ×W , respectively. Here, C, H and W denote the channels, height and width of the feature map. Then these are concatenated channel wise to form [Favg ; Fmax ]. A convolution operation followed by a sigmoid function is performed over the concatenated features to get the spatial attention map SpAM(F) ∈ R H ×W as:
SpAM(F) = σ ([Favg ; Fmax ]) .
(9.8)
9.4 Experimental Results In this section, we analyze the results obtained by the co-segmentation method described in this chapter, denoted as PMF. The PASCAL-VOC dataset is used as the base set Dbase over which the class-agnostic episodic training is performed as discussed in Sect. 9.2.1. It consists of 20 different classes with 50 samples per class [36] where samples within a class have significant appearance and pose variations. Following this, three different datasets have been considered as the target set Dtarget : the iCoseg dataset, the MSRC dataset, and the Internet dataset, over which the model is fine-tuned. The iCoseg and the MSRC datasets are challenging due to the limited number of samples per class, hence they are not ideal for supervised learning. The PMF approach overcomes this small sample problem by using a few-shot learning method for training.
9.4 Experimental Results
195
As in previous chapters, we use precision (P) and Jaccard index (J ) as metrics for evaluating different methods. Further, we experiment over the Internet dataset with a variable number of co-segmentable images along with outliers.
9.4.1 PMF Implementation Details To build the encoder part of the PMF architecture, the pre-trained ResNet-50 is used. For the rest of the network, the strategy of Glorot et al. [41] has been adopted for initializing the weights. For the optimization, stochastic gradient descent is used with the learning rate and momentum 10−5 and 0.9, respectively for all datasets. Each input image and the corresponding mask are resized to 224 × 224 pixels. Further, data augmentation is performed by random rotation and horizontal flipping of the images to increase the number of training samples. For all datasets, the guide set G and co-seg set C are randomly created such that there are no common images, and the episodic training scheme described in Sect. 9.2.1 is used.
9.4.2 Performance Analysis As mentioned earlier, with the PASCAL-VOC dataset as Dbase , we consider the iCoseg [8] dataset as a Dtarget set because the number of labeled samples present in it is small. It has 38 classes with 643 images, with some classes having less than 5 samples. Since this dataset is very small furthermore to examine the few-shot learning based PMF method, it is split into training and testing set in the ratio of 1:1, and as a result the guide set to the co-seg set ratio is also 1:1. We compare performance of different methods in Table 9.1. The methods DOC [46, 72, 100], CSA [20], CAT [50, 126] are, by design, not equipped to handle the scenario of the small number of labeled samples in the iCoseg dataset, whereas the PMF method’s few-shot learning scheme can fine-tune the model over the small set of available samples without any overfitting, which inherently boosts our performance. The method in [67] created additional annotated data to tackle the small sample size problem, which essentially requires extra human supervision. Visual results in Fig. 9.5 show that PMF performs well even for the most difficult class (panda). Next, we analyze the use of the MSRC [131] dataset as Dtarget . It consists of the following classes: cow, plane, car, sheep, cat, dog and bird, and each class has 10 images. Hence, the aforementioned 7 classes are removed from Dbase (PASCALVOC) to preserve the few-shot setting in the experiments of PMF. The training and testing split is set to 2:3. The quantitative and visual results are shown in Table 9.2 and Fig. 9.6. It may be noted that the methods DOC, CSA and PMS (the method described in Chap. 8) perform co-segmentation over only image pairs, and use a train to test split ratio of 3:2.
Fig. 9.5 Co-segmentation results obtained using the PMF method on the iCoseg dataset. (Top row) Images in a co-seg set and corresponding common objects (statue) obtained, shown pairwise. (Middle row) Images in a co-seg set and corresponding common objects (panda) obtained, shown pairwise. (Bottom row, left) Guide set for the statue class. (Bottom row, right) Guide set for the panda class that includes a negative sample. Figure courtesy: [5]
196 9 Few-shot Learning for Co-segmentation
Fig. 9.6 Co-segmentation results obtained using the PMF method on the MSRC dataset. (Top row) Images in a co-seg set and corresponding common objects (dog) obtained, shown pairwise. (Middle row) Images in a co-seg set and corresponding common objects (cat) obtained, shown pairwise. (Bottom row, left) Guide set for the dog class that includes a negative sample. (Bottom row, right) Guide set for the cat class. Figure courtesy: [5]
9.4 Experimental Results 197
Fig. 9.7 Co-segmentation results obtained using the PMF method on the Internet dataset. (Top row) Images in a co-seg set and corresponding common objects (horse) obtained, shown pairwise. The co-seg set also contains one outlier image. (Middle row) Images in a co-seg set and corresponding common object (airplane) obtained, shown pairwise. The co-seg set also contains two outlier images. (Bottom row, left) Guide set for the horse class that includes a negative sample. (Bottom row, right) Guide set for the airplane class that includes a negative sample. The co-segmentation results show that the model is robust to outliers in the co-seg set. Figure courtesy: [5]
198 9 Few-shot Learning for Co-segmentation
9.4 Experimental Results
199
Table 9.1 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using the iCoseg dataset as the target set Method Precision (P) Jaccard index (J ) DOC [72] [100] [46] CSA [20] CAT [50] [126] [67] PMF
– – 94.4 – 96.5 90.8 97.9 99.1
0.84 0.73 0.78 0.87 0.77 0.72 0.89 0.94
Table 9.2 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using the MSRC dataset as the target set Method Precision P Jaccard index J CMP [36] DOC [72] CSA [20] PMS PMF
92.0 94.4 95.3 96.3 98.7
0.77 0.80 0.77 0.85 0.88
Finally, we consider the Internet [105] dataset as Dtarget , which has three classes: car, aeroplane and horse with 100 samples per class. Though the number of classes is small, this dataset has high intra-class variation and is relatively large. But to examine the performance in the few-shot setting, it is split in 1:9 ratio into training and testing sets. As the PASCAL-VOC dataset is considered as Dbase , the above three classes are removed from it. In the experiments, the cardinality of the co-seg set is varied (randomly selected 40, 60, or 80 images from the Internet dataset). Similarly, the number of outliers are also varied from 10 to 50% of the total samples of the set in steps of 10%. We report the average accuracy computed over all of these sets for the PMF method as well as accuracies of other methods in Table 9.3. The results indicate that the PMF method can handle large number of input images and also large number of outliers. The visual results are shown in Fig. 9.7.
9.4.3 Ablation Study Encoder: The task of image co-segmentation can be divided into two sub-tasks in cascade. The first task is to identify similar objects without exploiting any semantic information or more formally clustering similar objects together. The second task
200
9 Few-shot Learning for Co-segmentation
Fig. 9.8 tSNE plots of the embeddings obtained, for five classes, using a ResNet-50 encoder (left) and the PMF encoder (right) described in this chapter. It can be observed that the PMF encoder, when compared to ResNet-50, has smaller intra-class distance and larger inter-class distance. Figure courtesy: [5]
Fig. 9.9 Illustration of the need for the channel attention and spatial attention modules (ChAM and SpAM). (Top row) A sample guide set and co-seg set used for co-segmentation. a Output co-segmented images from the model without using ChAM and SpAM. b Output co-segmented images from the model when only ChAM is used. c Output co-segmented images from the model when both ChAM and SpAM are used. When both attention modules are used, the model is able to correctly segment the common foreground (pyramid) from the co-seg set based on the dominant class present in the guide set. Figure courtesy: [5]
9.4 Experimental Results
201
Table 9.3 Comparison of precision (P) and Jaccard index (J ) of different methods evaluated using the Internet dataset as the target set Method Precision (P) Jaccard index (J ) DOC [72] [100] CSA [20] CAT [50] PMS [67] PMF
93.3 85.0 – 92.2 96.1 97.1 99.0
0.70 0.53 0.74 0.69 0.77 0.84 0.87
Fig. 9.10 Performance comparison using Jaccard index (J ) for varying number of positive samples in the guide set. It can be seen that the performance when the variational inference and attention are used is better than the case when they are not used. The performance is good even when the guide set has only a small number of positive samples. Figure courtesy: [5]
is to jointly segment similar objects or performing foreground segmentation over each cluster. In this context, to show the role of the directed variational inference cross-encoder described in this chapter for clustering, the encoder is replaced with the ResNet50 (sans the final two layers). The resulting embedding spaces obtained with both encoders are shown using tSNE plots in Fig. 9.8. This experiment is conducted on the MSRC dataset where five classes are randomly chosen to examine the corresponding class embedding. The PMF encoder, with the help of variational inference, reduces intra-class distances and increases inter-class distances implicitly, which in turn boosts the co-segmentation performance significantly. Attention mechanism: The channel attention module (ChAM) and spatial attention module (SpAM) also play a significant role to obtain the common object in the input image set correctly. It is evident from Fig. 9.9a, c that ChAM and SpAM help to identify common objects in a very cluttered background and objects with different scales. However, the role of the ChAM is more crucial to identify common objects
202
9 Few-shot Learning for Co-segmentation
Fig. 9.11 Illustration shows the role of the guide set in obtaining the co-segmented objects for a given co-seg set (Rows 2 and 5). The guide set 1 (with pyramid as the majority class) leads to the output shown in Row 3 with pyramid as the co-segmented object. The guide set 2 (with horse as the majority class) leads to the output shown in Row 6 with horse as the co-segmented object. Figure courtesy: [5]
9.4 Experimental Results
203
whereas the SpAM is responsible for better mask production. This can be observed in Fig. 9.9b that without the SpAM, the generated output masks are incomplete. Guide set: The common object prototype Og is calculated from the set G by feature averaging. The method of determining Og is similar to noise cancellation where the motivation is to reduce the impact of outliers and to increase the influence of the positive samples (samples containing the common object). We experiment on the iCoseg dataset by varying the number of positive samples in the guide set to be 2, 4, 6, 8. The size of the guide set is fixed at 8. The performance of the PMF method with and without variational inference and the attention modules is shown in Fig. 9.10. It can also be observed that the method is robust against outliers and can work with a small number of positive guide samples. Multiple common objects: The fine control of the PMF approach over the foreground extraction process is demonstrated in Fig. 9.11. For a given co-seg set with multiple, potential common foregrounds i.e., pyramid and horse, the network can be guided to perform common foreground extraction on the co-seg set for each of these foregrounds just by varying the composition of the guide set. We can also relate this result to the multiple common object segmentation discussed in Sect. 5.5.2. There, the seeds for each common object class are obtained from the clusters with large compactness factor, and the number of different classes present in the image needs to be provided by the user. However, here guidance comes from the guide set, that is, the majority class in the guide set is responsible for detecting common objects of that class from the co-seg set. In this chapter, we have described a framework to perform multiple image cosegmentation, which is capable of overcoming the small-sample problem by integrating few-shot learning and variational inference. The approach can learn a continuous embedding to extract consistent foreground from multiple images of a given set. Further, it is capable of performing consistently, even in the presence of a large number of outlier samples in the co-seg set. Acknowledgements Contributions of S Divakar Bhatt is gratefully acknowledged.
Chapter 10
Conclusions
The objective of this monograph has been to detect common objects, i.e., objects with similar feature present in a set of images. To achieve this goal, we have discussed three unsupervised and three supervised methods. In the graph-based approaches, each image is first segmented into superpixels, and a region adjacency graph is constructed whose nodes represent the image superpixels. This graph representation of images allows us to exploit the property that an object is typically a set of contiguous superpixels, and the neighborhood relationship among superpixels is embedded in the corresponding region adjacency graph. Since the common objects across different images contain superpixels with similar feature, graph matching techniques have been used for finding inter-graph correspondences among nodes in the graphs. In the introduction of this monograph, we had briefly introduced the concept of image co-saliency. It utilizes the fact that the salient objects in an image capture human attention. But, we found that the common object in an image set may not be salient in all the images. This is due to the fact that interesting patches occur rarely in an image set. Hence, it is difficult to find many image sets with salient common objects in practice. So, we have restricted this monograph to image co-segmentation. However for completeness, here we compare image co-segmentation and co-saliency in detail using several example images. In the following example, we try to find the common object with similar features in an image set and saliency can be used as a feature. Figure 10.1 shows a set of four images with the ‘cow’ being the common object. Since ‘cow’ is the salient object in every image, we can successfully co-segment the image set using saliency. But most image sets do not show these characteristics. Next, we show two examples to explain this. In Fig. 10.2, the common object ‘dog’ is salient in Image 1 and Image 2. But in Image 3, people are salient. Next, the image set in Fig. 10.3 includes an outlier image that contains an object which is different from the common object. Here, all the objects (the common object ‘kite’ and the ‘panda’ object in the outlier image) are salient. Thus, the image set to be co-segmented may contain (i) common objects that are not salient in all the images (see Fig. 10.2) and (ii) salient objects in the outlier © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6_10
205
206
10 Conclusions
image (see Fig. 10.3). Hence, co-segmentation of these image sets using saliency results in false negatives (the algorithm may miss out on the common object in some of the images) and false positives (objects in the outlier images may get incorrectly detected as the common objects). In literature, co-saliency methods [16, 19, 21, 39, 69, 73, 77, 124] have been shown to detect common, salient objects by combining (i) individual image saliency outputs and (ii) pixel or superpixel feature distances among the images. Objects with high saliency value may not necessarily have common features while considering a set of images. Hence, these saliency guided methods do not always detect similar objects across images correctly. In Fig. 10.4, we show co-segmentation of two images without any common object present in them. Co-segmentation without using saliency yields correct result (Fig. 10.4c) as it does not detect any meaningful common object (although small dark segments have been detected due to their feature similarity). But if saliency (Fig. 10.4b) is used, ‘red flower’ and ‘black bird’ are wrongly cosegmented (Fig. 10.4d) since they are highly salient in their respective images. It has been observed that the salient patches of an image have the least probability of being sampled from a corpus of similar images [116]. Hence, we are unlikely to obtain a large volume of data with image sets containing many salient common objects. Image segmentation using co-segmentation is, in principle, different from object segmentation using co-saliency as the segmented common object need not be the salient object in both images. In this monograph, we described co-segmentation methods that are independent of saliency or any prior knowledge or preprocessing. Notwithstanding the above-mentioned disadvantages of using saliency in cosegmentation, it may be noted that if the common object present in the image set is indeed salient in the respective images, saliency will certainly aid in co-segmentation. Then the problem becomes co-saliency. Next, we explain this using four examples of synthetic image pairs. The image pair in Fig. 10.5a contains two common objects (‘maroon’ and ‘green’) and both objects have high saliency (Fig. 10.5b) in respective images. Hence, both the common objects can be co-segmented (i) using similarity between saliency values alone (Fig. 10.5c), (ii) using only feature similarities without using saliency at all (Fig. 10.5d) and (iii) using both feature similarity and saliency similarity together (Fig. 10.5e). In Fig. 10.6, the image pair contains only one salient common object (‘maroon’) and two dissimilar salient objects (‘green’ and ‘blue’). (i) The dissimilar objects are wrongly detected as common objects if only saliency similarity is considered (Fig. 10.6c), (ii) but they are correctly discarded if only feature similarity is considered (Fig. 10.6d). (iii) If both saliency and feature similarities are used, there is a possibility of false positives (Fig. 10.6e). Here, the dissimilar objects may be wrongly co-segmented if the saliency similarity far outweighs the feature similarity. For the image pair in Fig. 10.7, we obtain same result although the common object (‘dark yellow’) is less salient because only the similarity between saliency values has been used for the results in Fig. 10.7c,e instead of directly using saliency values (saliency similarity is also considered in Fig. 10.5c,e, Fig. 10.6c,e, Fig. 10.8c,e). In Fig. 10.8, we consider the dissimilar objects to be highly salient, but the common object in one image to be less salient than the other (Fig. 10.8b). This reduces saliency
10 Conclusions
207
Fig. 10.1 Illustration of saliency detection on an image set (Column 1) with the common foreground (‘cow’, shown in Column 2) being salient in all the images. Image courtesy: Source images from the MSRC dataset [105]
208
10 Conclusions
Fig. 10.2 Illustration of saliency detection on an image set (shown in top row) where the common foreground (‘dog’, shown in bottom row) is not salient in all the images. Image courtesy: Source images from the MSRC dataset [105]
similarity between the common object present in the two images resulting in false negative when only saliency similarity is used (Fig. 10.8c). But, the common object is correctly co-segmented by using feature similarity alone (Fig. 10.8d). With these observations, we solved the co-segmentation problem without using saliency. In Chap. 4, we have described a method for co-segmenting two images. Images are successively resized and oversegmented into multiple levels, and graphs are obtained. Given graph representations of the image pair in the coarsest level, we find the maximum common subgraph (MCS). As MCS computation is an NP-complete problem, an approximate method using minimum vertex cover algorithm has been used. Since the common object in both the images may have different sizes, the MCS represents it partially. Then the nodes in the resulting subgraphs are mapped to graphs in the finer segmentation levels. Next using them as seeds, these subgraphs are grown in order to obtain the complete objects. Instead of individually growing them, region co-growing (RCG) is performed where feature similarity among nodes (superpixels) is computed across images as well as within images. Since the algorithm has two components (MCS and RCG), progressive co-segmentation is possible which in turn results in fast computation. We have shown that this method can be extended for co-segmenting more than two images. We can co-segment every (non-overlapping) pair of images and use the outputs for a second round of co-segmentation and so on. For a set of N images, O (N ) MCS matching steps are required to obtain the
10 Conclusions
209
Fig. 10.3 Illustration of saliency detection on an image set (shown in Column 1) that includes an outlier image (last image) where both the common foreground (‘kite’, shown in Column 2) and the object (‘panda’) in the outlier image are salient. Image courtesy: Source images from the iCoseg dataset [8]
210
10 Conclusions
(a)
(b)
(c)
(d)
Fig. 10.4 Illustration of co-segmentation using saliency. a Input images and b corresponding saliency outputs. Co-segmentation result c without and d with saliency. Image courtesy: Source images from the MSRA dataset [26]
(a)
(b)
(c)
(d)
(e)
Fig. 10.5 Illustration of co-segmentation when saliency is redundant. a Input image pair and b saliency output. Co-segmentation c using saliency alone, d using feature similarity and e using both saliency and feature similarity correctly co-segments both common objects
final output. If at least one outlier image (that does not contain the common object) is present in the image set, the corresponding MCS involving that image will result in an empty set, and this result propagates to the final level co-segmentation which will also yield an empty set. Hence, this extension of two-image co-segmentation to multi-image co-segmentation will fail unless the common object is present in all the images. Next in Chap. 5, a multi-image co-segmentation algorithm has been demonstrated to solve the problem associated with the presence of outlier images. First, the image supepixels are clustered based on features, and seed superpixels are identified from the spatially most compact cluster. In Chap. 4, we could co-grow seed superpixels since we had two images. In the case of more than two images (say N ), the number of possible matches is very high (O (N 2 N −1 )). Hence, we need to combine all the
10 Conclusions
(a)
211
(b)
(c)
(d)
(e)
Fig. 10.6 Illustration of co-segmentation when high saliency of dissimilar objects introduces false positives. a Input image pair and b saliency output. Co-segmentation using c saliency alone and e both saliency and feature similarity introduces false positives by wrongly detecting dissimilar objects as common (‘?’ indicates the object may or may not be detected), whereas d co-segmentation using feature similarity alone correctly detects only the common object
(a)
(b)
(c)
(d)
(e)
Fig. 10.7 Illustration of co-segmentation when saliency introduces false positives. a Input image pair and b saliency output. Co-segmentation using c saliency alone and e both saliency and feature similarity introduce false positives, whereas d co-segmentation using feature similarity alone correctly detects only the common object even though the common object is less salient in both the images
seed graphs based on feature similarity and neighborhood relationships of nodes and build a combined graph which is called latent class graph (LCG). Region growing has been performed on each of the seed graphs independently by using this LCG as a reference graph. Thus, we have achieved consistent matching among superpixels within the common object across images and reduced the number of required graph matching to O (N ). In Chap. 6, we have discussed the formulation of co-segmentation as a classification problem where image superpixels are labeled as either the common foreground or the background. Since we usually get a variety of different background regions in
212
10 Conclusions
(a)
(b)
(c)
(d)
(e)
Fig. 10.8 Illustration of co-segmentation when saliency difference of the common object introduces false negatives. a Input image pair and b saliency output. Co-segmentation using c saliency alone and e both saliency and feature similarity introduce false positives as well as false negatives since the common object in one image is less salient than in the other image, whereas d co-segmentation using feature similarity alone correctly detects the common object
a set of images, more than one background class have been used. The training superpixels (seeds) have been obtained in a completely unsupervised manner, and a mode detection method in multidimensional feature space has been used to find the seeds. Optimal discriminants have been learned using a modified LDA in order to compute discriminative features by projecting the input features to a space that increases separation between the common foreground class and each of the background classes. Then using the projected features, a spatially constrained label propagation algorithm assigns labels to the unlabeled superpixels in an iterative manner in order to obtain the complete objects while ensuring their cohesiveness. Images acquired in an uncontrolled environment present a number of challenges for segmenting the common foreground from them, including differences in the appearance and pose of common foregrounds, as well as foregrounds that are strikingly similar to the background. The approach described in Chap. 6 addresses these issues as an unsupervised foreground–background classification problem, in which superpixels belonging to the same foreground are detected using their corresponding handcrafted features. The method’s effectiveness is heavily reliant on the superpixels’ computed features, and manually obtaining the appropriate one may be extremely difficult depending on the degree of variation of the common foreground across the input images. In Chap. 7, we discussed an end-to-end foreground–background classification framework where features of each superpixel is computed automatically using a graph convolution neural network. In Chap. 8, we discussed a CNN-based architecture for solving image co-segmentation. Based on a conditional siamese encoder–decoder architecture, combined with a siamese metric learning and a decision network, the model demonstrates good generalization performance on segmenting objects of the same classes across different datasets, and robustness to outlier images. In Chap. 9, we described a framework to perform multiple image
10.1 Future Work
213
co-segmentation, which is capable of overcoming the small-sample problem by integrating few-shot learning and variational inference. We have shown that this framework is capable of learning a continuous embedding to extract consistent foreground from multiple images of a given set. The approach is capable of performing consistently: (i) over small datasets, and (ii) even in the presence of a large number of outlier samples in the co-seg set. To demonstrate the robustness and superior performance of the discussed cosegmentation methods, we have experimented on standard datasets: the image pair dataset, the MSRC dataset, the iCoseg dataset, the Weizmann horse dataset, the flower dataset and our own outlier contaminated dataset by mixing images of different classes. They show better or comparable results to other unsupervised methods in literature in terms of both accuracy and computation time.
10.1 Future Work This monograph focused on foreground co-segmentation. It is to be noted that background co-segmentation is also worth studying as it has direct application in annotation of semantic segments, which include both foreground and background in images. Assuming we have training data in the form of labeled superpixels of different background classes, the work in Chap. 6 can be extended to background co-segmentation with slight modifications. In the label propagation stage of Chap. 6, we have used spatial constraints in individual images. But, we do not have constraints on spatial properties of the common object across images. Hence, in this regard, this can be extended as a future work by incorporating a shape similarity measure among the partial objects obtained after every iteration of label propagation. The multidimensional mode detection method of Chap. 6 aids in discriminative feature computation if there is only one type of common object present in the image set. It should be noted that mode detection in a high-dimensional setting is a challenging research problem to be considered, even in a generic setting. If there are more than one common object classes, as in the case of multiple co-segmentation discussed in Sect. 5.5.2, we are required to compute multiple modes and consider multiple foreground classes instead of one. Hence, study of multiple mode detection-based co-segmentation can also be considered as a challenging future work. Corresponding to the machine learning-based approaches, having sufficient labeled training data for co-segmentation can be challenging. Hence, approaches that use less labeled data as in Chap. 9, such as few-shot learning can be further explored. Specifically the approach considered in Chap. 9 is class agnostic. But considering class-aware methods for fine-grained co-segmentation is an area for future research. Along similar direction, incremental learning for co-segmentation will also be useful in certain settings and can be considered. The co-segmentation problem can be extended to perform self co-segmentation of segmenting similar objects or classes within an image. One approach to this is by considering this problem as self-similarity of subgraphs in an image. Doing
214
10 Conclusions
this in a learning-based setup may need a different approach. Another extension of the image co-segmentation could be to do co-segmentation across videos. Here, similar segments of videos across different videos can be extracted. This will have the challenge of spatial and temporal similarity that will need a different approach.
References
1. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012) 2. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3) (2007) 3. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Machine Intell. 39(12), 2481– 2495 (2017) 4. Baldi, P., Sadowski, P.J.: Understanding dropout. In: Advances in Neural Information Processing Systems, vol. 26, pp. 2814–2822 (2013) 5. Banerjee, S., Bhat, S.D., Chaudhuri, S., Velmurugan, R.: Directed variational cross-encoder network for few-shot multi-image co-segmentation. In: Proceedings of ICPR, pp. 8431–8438 (2021) 6. Banerjee, S., Hati, A., Chaudhuri, S., Velmurugan, R.: Image co-segmentation using graph convolution neural network. In: Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 57:1–57:9 (2018) 7. Banerjee, S., Hati, A., Chaudhuri, S., Velmurugan, R.: Cosegnet: image co-segmentation using a conditional siamese convolutional network. In: Proceedings of IJCAI, pp. 673–679 (2019) 8. Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: interactive co-segmentation with intelligent scribble guidance. In: Proceedings of CVPR, pp. 3169–3176 (2010) 9. Bickel, D.R., Frühwirth, R.: On a fast, robust estimator of the mode: comparisons to other robust estimators with applications. Comput. Stat. Data Anal. 50(12), 3500–3530 (2006) 10. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006) 11. Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017) 12. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts, pp. 19–26 (2001) 13. Borenstein, E., Ullman, S.: Combined top-down/bottom-up segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 30(12), 2109–2125 (2008) 14. Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: a benchmark. IEEE Trans. Image Process. 24(12), 5706–5722 (2015) 15. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In: Proceedings of ICCV, vol. 1, pp. 105–112 (2001) © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Hati et al., Image Co-segmentation, Studies in Computational Intelligence 1082, https://doi.org/10.1007/978-981-19-8570-6
215
216
References
16. Cao, X., Tao, Z., Zhang, B., Fu, H., Feng, W.: Self-adaptively weighted co-saliency detection via rank constraint. IEEE Trans. Image Process. 23(9), 4175–4186 (2014) 17. Chandran, S., Kiran, N.: Image retrieval with embedded region relationships. In: Proceedings of ACM Symposium on Applied Computing, pp. 760–764 (2003) 18. Chang, H.S., Wang, Y.C.F.: Optimizing the decomposition for multiple foreground cosegmentation. Elsevier Comput. Vis. Image Understand. 141, 18–27 (2015) 19. Chang, K.Y., Liu, T.L., Lai, S.H.: From co-saliency to co-segmentation: an efficient and fully unsupervised energy minimization model. In: Proceedings of CVPR, pp. 2129–2136 (2011) 20. Chen, H., Huang, Y., Nakayama, H.: Semantic aware attention based deep object cosegmentation. In: Proceedings of ACCV, pp. 435–450 (2018) 21. Chen, H.T.: Preattentive co-saliency detection. In: Proceedings of ICIP, pp. 1117–1120 (2010) 22. Chen, M., Velasco-Forero, S., Tsang, I., Cham, T.J.: Objects co-segmentation: propagated from simpler images. In: Proceedings of ICASSP, pp. 1682–1686 (2015) 23. Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.: Sketch2photo: internet image montage. ACM Trans. Graph. 28(5), 124 (2009) 24. Chen, X., Shrivastava, A., Gupta, A.: Enriching visual knowledge bases via object discovery and segmentation. In: Proceedings of CVPR, pp. 2035–2042 (2014) 25. Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: Show, match and segment: joint weakly supervised learning of semantic matching and object co-segmentation. IEEE Trans. PAMI 43(10), 3632–3647 (2021) 26. Cheng, M.M., Zhang, G.X., Mitra, N., Huang, X., Hu, S.M.: Global contrast based salient region detection. In: Proceedings of CVPR, pp. 409–416 (2011) 27. Chernoff, H.: Estimation of the mode. Ann. Inst. Stat. Math. 16(1), 31–41 (1964) 28. Colannino, J., Damian, M., Hurtado, F., Langerman, S., Meijer, H., Ramaswami, S., Souvaine, D., Toussaint, G.: Efficient many-to-many point matching in one dimension. Graph. Comb. 23(1), 169–178 (2007) 29. Collins, M.D., Xu, J., Grady, L., Singh, V.: Random walks based multi-image segmentation: quasiconvexity results and GPU-based solutions. In: Proceedings of CVPR, pp. 1656–1663 (2012) 30. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002) 31. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education (2001) 32. Ding, Z., Shao, M., Hwang, W., Suh, S., Han, J.J., Choi, C., Fu, Y.: Robust discriminative metric learning for image representation. IEEE Trans. Circuits Syst. Video Technol. (2019) 33. Dong, X., Shen, J., Shao, L., Yang, M.H.: Interactive cosegmentation using global and local energy optimization. IEEE Trans. Image Process. 24(11), 3966–3977 (2015) 34. Dornaika, F., El Traboulsi, Y.: Matrix exponential based semi-supervised discriminant embedding for image classification. Pattern Recogn. 61, 92–103 (2017) 35. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010) 36. Faktor, A., Irani, M.: Co-segmentation by composition. In: Proceedings of ICCV, pp. 1297– 1304 (2013) 37. Fang, Y., Chen, Z., Lin, W., Lin, C.W.: Saliency detection in the compressed domain for adaptive image retargeting. IEEE Trans. Image Process. 21(9), 3888–3901 (2012) 38. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004) 39. Fu, H., Cao, X., Tu, Z.: Cluster-based co-saliency detection. IEEE Trans. Image Process. 22(10), 3766–3778 (2013) 40. Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: Proceedings of CVPR, pp. 670–677 (2009) 41. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
References
217
42. Goferman, S., Tal, A., Zelnik-Manor, L.: Puzzle-like collage. In: Computer Graphics Forum, vol. 29, pp. 459–468. Wiley Online Library (2010) 43. Goferman, S., Tal, A., Zelnik-Manor, L.: Puzzle-like collage. In: Computer Graphics Forum, vol. 29, pp. 459–468. Wiley Online Library (2010) 44. Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of ICCV, pp. 1–8 (2009) 45. Han, J., Ngan, K.N., Li, M., Zhang, H.J.: Unsupervised extraction of visual attention objects in color images. IEEE Trans. Circuits Syst. Video Technol. 16(1), 141–145 (2006) 46. Han, J., Quan, R., Zhang, D., Nie, F.: Robust object co-segmentation using background prior. IEEE Trans. Image Process. 27(4), 1639–1651 (2018) 47. Hati, A., Chaudhuri, S., Velmurugan, R.: Salient object carving. In: Proceedings of ICIP, pp. 1767–1771 (2015) 48. Hati, A., Chaudhuri, S., Velmurugan, R.: Image co-segmentation using maximum common subgraph matching and region co-growing. In: Proceedings of ECCV, pp. 736–752 (2016) 49. Hochbaum, D.S., Singh, V.: An efficient algorithm for co-segmentation. In: Proceedings of ICCV, pp. 269–276 (2009) 50. Hsu, K.J., Lin, Y.Y., Chuang, Y.Y.: Co-attention CNNs for unsupervised object cosegmentation. In: Proceedings of IJCAI, pp. 748–756 (2018) 51. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML, pp. 448–456 (2015) 52. Itti, L.: Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Trans. Image Process. 13(10), 1304–1318 (2004) 53. Jaakkola, M.S.T., Szummer, M.: Partially labeled classification with markov random walks. In: Advances in Neural Information Processing Systems, vol. 14, pp. 945–952 (2002) 54. Jerripothula, K.R., Cai, J., Yuan, J.: Image co-segmentation via saliency co-fusion. IEEE Trans. Multimedia 18(9), 1896–1909 (2016) 55. Joachims, T.: Transductive learning via spectral graph partitioning. In: Proceedings of ICML, pp. 290–297 (2003) 56. Joulin, A., Bach, F., Ponce, J.: Discriminative clustering for image co-segmentation. In: Proceedings of CVPR, pp. 1943–1950 (2010) 57. Joulin, A., Bach, F., Ponce, J.: Multi-class cosegmentation. In: Proceedings of CVPR, pp. 542–549 (2012) 58. Kanan, C., Cottrell, G.: Robust classification of objects, faces, and flowers using natural image statistics. In: Proceedings of CVPR, pp. 2472–2479 (2010) 59. Kim, G., Xing, E.P.: On multiple foreground cosegmentation. In: Proceedings of CVPR, pp. 837–844 (2012) 60. Kim, G., Xing, E.P., Fei-Fei, L., Kanade, T.: Distributed cosegmentation via submodular optimization on anisotropic diffusion. In: Proceedings of ICCV, pp. 169–176 (2011) 61. Koch, I.: Enumerating all connected maximal common subgraphs in two graphs. Theor. Comput. Sci. 250(1), 1–30 (2001) 62. Lai, Z., Xu, Y., Jin, Z., Zhang, D.: Human gait recognition via sparse discriminant projection learning. IEEE Trans. Circuits Syst. Video Techn. 24(10), 1651–1662 (2014) 63. Lattari, L., Montenegro, A., Vasconcelos, C.: Unsupervised cosegmentation based on global clustering and saliency. In: Proceedings of ICIP, pp. 2890–2894 (2015) 64. Lee, C., Jang, W.D., Sim, J.Y., Kim, C.S.: Multiple random walkers and their application to image cosegmentation. In: Proceedings of CVPR, pp. 3837–3845 (2015) 65. Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 228–242 (2008) 66. Levinshtein, A., Stere, A., Kutulakos, K.N., Fleet, D.J., Dickinson, S.J., Siddiqi, K.: Turbopixels: fast superpixels using geometric flows. IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2290–2297 (2009) 67. Li, B., Sun, Z., Li, Q., Wu, Y., Hu, A.: Group-wise deep object co-segmentation with coattention recurrent neural network. In: Proceedings of ICCV, pp. 8519–8528 (2019)
218
References
68. Li, H., Meng, F., Luo, B., Zhu, S.: Repairing bad co-segmentation using its quality evaluation and segment propagation. IEEE Trans. Image Process. 23(8), 3545–3559 (2014) 69. Li, H., Ngan, K.N.: A co-saliency model of image pairs. IEEE Trans. Image Process. 20(12), 3365–3375 (2011) 70. Li, J., Levine, M.D., An, X., Xu, X., He, H.: Visual saliency based on scale-space analysis in the frequency domain. IEEE Trans. Pattern Anal. Mach. Intell. 35(4), 996–1010 (2013) 71. Li, K., Zhang, J., Tao, W.: Unsupervised co-segmentation for indefinite number of common foreground objects. IEEE Trans. Image Process. 25(4), 1898–1909 (2016) 72. Li, W., Jafari, O.H., Rother, C.: Deep object co-segmentation. In: Proceedings of ACCV, pp. 638–653 (2018) 73. Li, Y., Fu, K., Liu, Z., Yang, J.: Efficient saliency-model-guided visual co-saliency detection. IEEE Sig. Process. Lett. 22(5), 588–592 (2015) 74. Li, Y., Liu, L., Shen, C., van den Hengel, A.: Image co-localization by mimicking a good detector’s confidence score distribution. In: Proceedings of ECCV, pp. 19–34 (2016) 75. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. 23(3), 303–308 (2004) 76. Liu, H., Xie, X., Tang, X., Li, Z.W., Ma, W.Y.: Effective browsing of web image search results. In: Proceedings of ACM SIGMM International Workshop on Multimedia Information Retrieval, pp. 84–90 (2004) 77. Liu, Z., Zou, W., Li, L., Shen, L., Le Meur, O.: Co-saliency detection based on hierarchical segmentation. IEEE Sig. Process. Lett. 21(1), 88–92 (2014) 78. Ma, J., Li, S., Qin, H., Hao, A.: Unsupervised multi-class co-segmentation via joint-cut over l1 -manifold hyper-graph of discriminative image regions. IEEE Trans. Image Process. 26(3), 1216–1230 (2017) 79. Ma, Y.F., Hua, X.S., Lu, L., Zhang, H.J.: A generic framework of user attention model and its application in video summarization. IEEE Trans. Multimedia 7(5), 907–919 (2005) 80. Madry, A.: Navigating central path with electrical flows: from flows to matchings, and back. In: IEEE Annual Symposium on FOCS, pp. 253–262 (2013) 81. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008) 82. Marchesotti, L., Cifarelli, C., Csurka, G.: A framework for visual saliency detection with applications to image thumbnailing. In: Proceedings of ICCV, pp. 2232–2239 (2009) 83. Mei, T., Hua, X.S., Li, S.: Contextual in-image advertising. In: Proceedings of ACM Multimedia, pp. 439–448 (2008) 84. Meng, F., Cai, J., Li, H.: Cosegmentation of multiple image groups. Elsevier Comput. Vis. Image Understand. 146, 67–76 (2016) 85. Meng, F., Li, H., Liu, G., Ngan, K.N.: Object co-segmentation based on shortest path algorithm and saliency model. IEEE Trans. Multimedia 14(5), 1429–1441 (2012) 86. Meng, F., Li, H., Ngan, K.N., Zeng, L., Wu, Q.: Feature adaptive co-segmentation by complexity awareness. IEEE Trans. Image Process. 22(12), 4809–4824 (2013) 87. Moore, A.P., Prince, S.J., Warrell, J., Mohammed, U., Jones, G.: Superpixel lattices. In: Proceedings of CVPR, pp. 1–8. IEEE (2008) 88. Mukherjee, L., Singh, V., Dyer, C.R.: Half-integrality based algorithms for cosegmentation of images. In: Proceedings of CVPR, pp. 2028–2035 (2009) 89. Mukherjee, L., Singh, V., Peng, J.: Scale invariant cosegmentation for image groups. In: Proceedings of CVPR, pp. 1881–1888 (2011) 90. Mukherjee, P., Lall, B., Shah, A.: Saliency map based improved segmentation. In: Proceedings of ICIP, pp. 1290–1294 (2015) 91. Nguyen, K., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: Proceedings of ICCV, pp. 622–631 (2019) 92. Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: Proceedings of CVPR, vol. 2, pp. 1447–1454 (2006) 93. Oliva, A., Torralba, A., Castelhano, M.S., Henderson, J.M.: Top-down control of visual attention in object detection. In: Proceedings of ICIP, vol. 1, pp. I–253 (2003)
References
219
94. Pal, R., Mitra, P., Mukherjee, J.: Visual saliency-based theme and aspect ratio preserving image cropping for small displays. In: Proceedings of National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 89–92 (2008) 95. Pang, Y., Yuan, Y., Li, X.: Gabor-based region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol. 18(7), 989–993 (2008) 96. Patel, D., Raman, S.: Saliency and memorability driven retargeting. In: Proceedings of IEEE International Conference SPCOM, pp. 1–5 (2016) 97. Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: Proceedings of CVPR, pp. 733–740 (2012) 98. Presti, L.L., La Cascia, M.: 3d skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) 99. Quan, R., Han, J., Zhang, D., Nie, F.: Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of CVPR, pp. 687–695 (2016) 100. Ren, Y., Jiao, L., Yang, S., Wang, S.: Mutual learning between saliency and similarity: image cosegmentation via tree structured sparsity and tree graph matching. IEEE Trans. Image Process. 27(9), 4690–4704 (2018) 101. Rosenholtz, R., Dorai, A., Freeman, R.: Do predictions of visual perception aid design? ACM Trans. Appl. Percept. 8(2), 12 (2011) 102. Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004) 103. Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching—incorporating a global constraint into MRFs. In: Proceedings of CVPR, vol. 1, pp. 993–1000 (2006) 104. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection, vol. 589. Wiley (2005) 105. Rubinstein, M., Joulin, A., Kopf, J., Liu, C.: Unsupervised joint object discovery and segmentation in internet images. In: Proceedings of CVPR, pp. 1939–1946 (2013) 106. Rubio, J.C., Serrat, J., López, A., Paragios, N.: Unsupervised co-segmentation through region matching. In: Proceedings of CVPR, pp. 749–756 (2012) 107. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5), 644–655 (1998) 108. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 109. Rutishauser, U., Walther, D., Koch, C., Perona, P.: Is bottom-up attention useful for object recognition? In: Proceedings of CVPR, vol. 2, pp. II–II (2004) 110. Sager, T.W.: Estimation of a multivariate mode. Ann. Stat. 802–812 (1978) 111. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of CVPR, pp. 815–823 (2015) 112. Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. In: Proceedings of CVPR, pp. 3506–3513 (2012) 113. Shen, X., Wu, Y.: A unified approach to salient object detection via low rank matrix recovery. In: Proceedings of CVPR, pp. 853–860 (2012) 114. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000) 115. Siam, M., Oreshkin, B.N., Jagersand, M.: AMP: adaptive masked proxies for few-shot segmentation. In: Proceedings of ICCV, pp. 5249–5258 (2019) 116. Siva, P., Russell, C., Xiang, T., Agapito, L.: Looking beyond the image: unsupervised learning for object saliency and detection. In: Proceedings of CVPR, pp. 3238–3245 (2013) 117. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 4077–4087 (2017) 118. Soille, P.: Morphological Image Analysis: Principles and Applications, 2nd edn. Springer, New York Inc, Secaucus, NJ, USA (2003)
220
References
119. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 120. Srivatsa, R.S., Babu, R.V.: Salient object detection via objectness measure. In: Proceedings of ICIP, pp. 4481–4485 (2015) 121. Such, F.P., Sah, S., Dominguez, M.A., Pillai, S., Zhang, C., Michael, A., Cahill, N.D., Ptucha, R.: Robust spatial filtering with graph convolutional neural networks. IEEE J. Sel. Topics Sig. Process. 11(6), 884–896 (2017) 122. Sun, J., Ling, H.: Scale and object aware image retargeting for thumbnail browsing. In: Proceedings of ICCV, pp. 1511–1518 (2011) 123. Sun, J., Ponce, J.: Learning dictionary of discriminative part detectors for image categorization and cosegmentation. Int. J. Comput. Vis. 120(2), 111–133 (2016) 124. Tan, Z., Wan, L., Feng, W., Pun, C.M.: Image co-saliency detection by propagating superpixel affinities. In: Proceedings of ICASSP, pp. 2114–2118 (2013) 125. Tao, W., Li, K., Sun, K.: Sacoseg: object cosegmentation by shape conformability. IEEE Trans. Image Process. 24(3), 943–955 (2015) 126. Tsai, C.C., Li, W., Hsu, K.J., Qian, X., Lin, Y.Y.: Image co-saliency detection and cosegmentation via progressive joint optimization. IEEE Trans. Image Process. 28(1), 56–71 (2019) 127. Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010) 128. Veksler, O., Boykov, Y., Mehrani, P.: Superpixels and supervoxels in an energy optimization framework. In: Proceedings of ECCV, pp. 211–224. Springer (2010) 129. Venter, J.: On estimation of the mode. Ann. Math. Stat. 1446–1455 (1967) 130. Vicente, S., Kolmogorov, V., Rother, C.: Cosegmentation revisited: models and optimization. In: Proceedings of ECCV, pp. 465–479 (2010) 131. Vicente, S., Rother, C., Kolmogorov, V.: Object cosegmentation. In: Proceedings of CVPR, pp. 2217–2224 (2011) 132. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 583–598 (1991) 133. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Netw. 19(9), 1395– 1407 (2006) 134. Wang, C., Zhang, H., Yang, L., Cao, X., Xiong, H.: Multiple semantic matching on augmented n-partite graph for object co-segmentation. IEEE Trans. Image Process. 26(12), 5825–5839 (2017) 135. Wang, F., Huang, Q., Guibas, L.J.: Image co-segmentation via consistent functional maps. In: Proceedings of ICCV, pp. 849–856 (2013) 136. Wang, F., Huang, Q., Ovsjanikov, M., Guibas, L.J.: Unsupervised multi-class joint image segmentation. In: Proceedings of CVPR, pp. 3142–3149 (2014) 137. Wang, J., Quan, L., Sun, J., Tang, X., Shum, H.Y.: Picture collage. In: Proceedings of CVPR 1, 347–354 (2006) 138. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Proceedings of CVPR, pp. 3360–3367 (2010) 139. Wang, L., Hua, G., Xue, J., Gao, Z., Zheng, N.: Joint segmentation and recognition of categorized objects from noisy web image collection. IEEE Trans. Image Process. 23(9), 4070–4086 (2014) 140. Wang, P., Zhang, D., Wang, J., Wu, Z., Hua, X.S., Li, S.: Color filter for image search. In: Proceedings of ACM Multimedia, pp. 1327–1328 (2012) 141. Wang, S., Lu, J., Gu, X., Du, H., Yang, J.: Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recogn. 57, 179–189 (2016) 142. Wang, W., Shen, J.: Higher-order image co-segmentation. IEEE Trans. Multimedia 18(6), 1011–1021 (2016) 143. Wang, X., Zheng, W.S., Li, X., Zhang, J.: Cross-scenario transfer person reidentification. IEEE Trans. Circuits Syst. Video Technol. 26(8), 1447–1460 (2016)
References
221
144. Xiao, C., Chaovalitwongse, W.A.: Optimization models for feature selection of decomposed nearest neighbor. IEEE Trans. Systems Man Cybern. Syst. 46(2), 177–184 (2016) 145. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object detection for multi-class segmentation. In: Proceedings of CVPR, pp. 3113–3120 (2010) 146. Yuan, Z., Lu, T., Wu, Y.: Deep-dense conditional random fields for object co-segmentation. In: Proceedings of IJCAI, pp. 3371–3377 (2017) 147. Zhang, K., Chen, J., Liu, B., Liu, Q.: Deep object co-segmentation via spatial-semantic network modulation. In: Proceedings of AAAI Conference on Artificial Intelligence, vol. 34, pp. 12813–12820 (2020) 148. Zhang, X.Y., Bengio, Y., Liu, C.L.: Online and offline handwritten Chinese character recognition: a comprehensive study and new benchmark. Pattern Recogn. 61, 348–360 (2017) 149. Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: Proceedings of CVPR, pp. 3586–3593 (2013) 150. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004) 151. Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of CVPR, pp. 2814–2821 (2014) 152. Zitnick, C.L., Kang, S.B.: Stereo for image-based rendering using image over-segmentation. Int. J. Comput. Vis. 75(1), 49–65 (2007)