Recent Advances in Logo Detection Using Machine Learning Paradigms: Theory and Practice (Intelligent Systems Reference Library, 255) [2024 ed.] 3031598105, 9783031598104

This book presents the current trends in deep learning-based object detection framework with a focus on logo detection t

116 20 4MB

English Pages 131 [128] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
About the Authors
1 Deep Convolutional Neural Networks
1.1 Deep Learning Frameworks
1.1.1 Core Component and Key Elements of Deep Learning
1.2 Feature Extraction Networks
1.2.1 VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition
1.2.2 Residual Networks
1.2.3 Deep Layer Aggregation Networks
1.2.4 Hourglass Framework
1.3 Object Detection Frameworks: Detection Head
1.3.1 Detection Head Functionality in Object Detection Frameworks
1.3.2 Anchor Box-Based Detection Frameworks
1.3.3 Anchorfree Detection Frameworks
1.4 Summary
References
2 Introduction to Logo Detection
2.1 Logo Detection and Its Applications
2.2 Logo Detection Challenges
2.3 Related Work in Logo Detection
2.3.1 Deep Learning for Logo Detection
2.4 Proposed Approaches for Logo Detection
2.5 Summary
References
3 Weakly Supervised Logo Detection Approach
3.1 Weakly-Supervised Logo Detection Using Image-Level Annotation
3.2 Attention Mechanisms
3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network
3.3.1 Feature Extraction Backbone Network
3.3.2 Spatial Attention Mechanism
3.3.3 Channel Attention Mechanism
3.3.4 Gradient-Based Grad-CAM for Localization of Logos
3.3.5 Implementation of Channel and Spatial Attention
3.4 Experiments and Results
3.4.1 Implementation
3.4.2 Dataset
3.4.3 Evolution Measures
3.4.4 Comparison with Different Attention Modules
3.5 Summary
References
4 Anchorfree Logo Detection Framework
4.1 Dual-Attention LogoNet for Logo Detection
4.1.1 Overview of the Logo Detection Framework
4.1.2 Layer-Aggregated Hourglass Style Feature Extraction Network
4.1.3 Attention Modules
4.1.4 Detection Head
4.1.5 Overall Framework of Dual-Attention LogoNet
4.1.6 Lightweight CNNs Network Architecture for Practical Applications
4.1.7 Experiments and Results
4.1.8 Implementation
4.1.9 Evaluation on FlickrLogos-32 Dataset
4.1.10 Evaluation with Lightweight CNNs Method
4.2 Summary
References
5 Mitigating Domain Shift in Logo Detection: An Adversarial Learning-Based Approach
5.1 Domain Shift Problem
5.2 Domain Adaptation for Computer Vision Tasks
5.3 Related Work: Domain Adaptation
5.4 Adaptation Using Anchorfree Object Detector for Logo Detection
5.5 Evaluation with Adversarial-Based Domain Adaptation Using LogoNet
5.5.1 Experiments and Results
5.6 Summary
References
6 Unsupervised Logo Detection with Adversarial Domain Adaptation from Synthetic to Real Images
6.1 Unsupervised Domain Adaptation: Synthetic to Real Logo Detection
6.2 Synthesize Images to Avoid Manual Annotation Task
6.2.1 Synthetic Logo Images
6.3 Domain Alignment Using Entropy Minimization
6.3.1 Entropy Minimization
6.3.2 Entropy Minimization Maps Using Mid-Level Feature from Synthetic to Real Logo Images
6.4 Experiments and Results
6.4.1 Datasets
6.4.2 Implementation Details
6.5 Summary
6.6 Discussion and Future Recommendations
References
Recommend Papers

Recent Advances in Logo Detection Using Machine Learning Paradigms: Theory and Practice (Intelligent Systems Reference Library, 255) [2024 ed.]
 3031598105, 9783031598104

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Intelligent Systems Reference Library 255

Yen-Wei Chen Xiang Ruan Rahul Kumar Jain

Recent Advances in Logo Detection Using Machine Learning Paradigms Theory and Practice

Intelligent Systems Reference Library Volume 255

Series Editors Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The aim of this series is to publish a Reference Library, including novel advances and developments in all aspects of Intelligent Systems in an easily accessible and well structured form. The series includes reference works, handbooks, compendia, textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains well integrated knowledge and current information in the field of Intelligent Systems. The series covers the theory, applications, and design methods of Intelligent Systems. Virtually all disciplines such as engineering, computer science, avionics, business, e-commerce, environment, healthcare, physics and life science are included. The list of topics spans all the areas of modern intelligent systems such as: Ambient intelligence, Computational intelligence, Social intelligence, Computational neuroscience, Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems, e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent control, Intelligent data analysis, Knowledge-based paradigms, Knowledge management, Intelligent agents, Intelligent decision making, Intelligent network security, Interactive entertainment, Learning paradigms, Recommender systems, Robotics and Mechatronics including human-machine teaming, Self-organizing and adaptive systems, Soft computing including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion of these paradigms, Perception and Vision, Web intelligence and Multimedia. Indexed by SCOPUS, DBLP, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Yen-Wei Chen · Xiang Ruan · Rahul Kumar Jain

Recent Advances in Logo Detection Using Machine Learning Paradigms Theory and Practice

Yen-Wei Chen College of Information Science and Engineering Ritsumeikan University Ibaraki, Osaka, Japan

Xiang Ruan Tiwaki Co., Ltd. Kusatsu, Shiga, Japan

Rahul Kumar Jain College of Information Science and Engineering Ritsumeikan University Ibaraki, Osaka, Japan

ISSN 1868-4394 ISSN 1868-4408 (electronic) Intelligent Systems Reference Library ISBN 978-3-031-59810-4 ISBN 978-3-031-59811-1 (eBook) https://doi.org/10.1007/978-3-031-59811-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland If disposing of this product, please recycle the paper.

Preface

In recent years, there has been a notable transition from rule-based Artificial Intelligence (AI) to Machine-Learning-based AI, with an increasing focus on Deep Learning. Currently, deep learning frameworks are at the forefront of AI technology, marking significant advancements in academic and industrial fields. The evolution of deep convolutional neural networks (CNNs) has positioned deep learning as the method of choice for computer vision tasks, including image classification, segmentation, object detection, and human keypoint detection. This surge in deep learning has sparked interest in applying Convolutional Neural Networks (CNNs) across various domains, including the task of logo detection. A logo is a unique symbol which identifies the product and services of any company. Logos serve as the quintessential elements for brand identity in today’s market, symbolizing a company’s products and services while setting them apart from competitors. Consequently, automatic logo detection in real-world images has emerged as a critical challenge with substantial implications across numerous applications, such as brand management, copyright protection, advertisement, and market analysis. To date, various logo recognition methods have been proposed, but most of them require huge annotated datasets for training, posing significant challenges due to domain discrepancies between training (source domain) and testing (target domain) datasets. These discrepancies often result in suboptimum detection accuracy. In this book, we aim to document recent advancements and explore future directions for deep learning in logo recognition. This book introduces: (1) A weakly supervised logo detection approach leveraging attention mechanisms, which emphasizes key features and identifies logo positions without need of detailed object-level annotations. (2) A robust logo detection framework, incorporating a feature extraction network enhanced with spatial and channel attention modules, coupled with an anchorfree detection head for precise and efficient logo identification. (3) Strategies to address the domain shift problem between training and testing data by integrating domain adaptation methods into detection frameworks, improving accuracy across different logo datasets.

v

vi

Preface

(4) An innovative domain adaptation technique based on entropy minimization to bridge the domain gap between synthetic (simulated) and real logo images, facilitating unsupervised logo detection suitable for practical application. This book also provides valuable insights into feature learning and the application of various deep learning frameworks in logo recognition through detailed experiments and analyses, offering readers a comprehensive understanding of deep learning and logo detection. Ibaraki, Japan/Hangzhou, China Kusatsu, Japan Ibaraki, Japan December 2023

Prof. Yen-Wei Chen Dr. Xiang Ruan Rahul Kumar Jain

Contents

1 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Deep Learning Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Core Component and Key Elements of Deep Learning . . . 1.2 Feature Extraction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition . . . . . . . . . . . . . . . . . . . . 1.2.2 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Deep Layer Aggregation Networks . . . . . . . . . . . . . . . . . . . . 1.2.4 Hourglass Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Object Detection Frameworks: Detection Head . . . . . . . . . . . . . . . . . 1.3.1 Detection Head Functionality in Object Detection Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Anchor Box-Based Detection Frameworks . . . . . . . . . . . . . . 1.3.3 Anchorfree Detection Frameworks . . . . . . . . . . . . . . . . . . . . 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 9 10 10 13 13 15 16 17 23 30 30

2 Introduction to Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Logo Detection and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Logo Detection Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Related Work in Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Deep Learning for Logo Detection . . . . . . . . . . . . . . . . . . . . 2.4 Proposed Approaches for Logo Detection . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 34 36 36 37 39 39

3 Weakly Supervised Logo Detection Approach . . . . . . . . . . . . . . . . . . . . . 3.1 Weakly-Supervised Logo Detection Using Image-Level Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 43 46

vii

viii

Contents

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Feature Extraction Backbone Network . . . . . . . . . . . . . . . . . 3.3.2 Spatial Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Channel Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Gradient-Based Grad-CAM for Localization of Logos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Implementation of Channel and Spatial Attention . . . . . . . . 3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Evolution Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Comparison with Different Attention Modules . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Anchorfree Logo Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Dual-Attention LogoNet for Logo Detection . . . . . . . . . . . . . . . . . . . 4.1.1 Overview of the Logo Detection Framework . . . . . . . . . . . . 4.1.2 Layer-Aggregated Hourglass Style Feature Extraction Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Attention Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Detection Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Overall Framework of Dual-Attention LogoNet . . . . . . . . . 4.1.6 Lightweight CNNs Network Architecture for Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.7 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.9 Evaluation on FlickrLogos-32 Dataset . . . . . . . . . . . . . . . . . 4.1.10 Evaluation with Lightweight CNNs Method . . . . . . . . . . . . 4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Mitigating Domain Shift in Logo Detection: An Adversarial Learning-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Domain Shift Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Domain Adaptation for Computer Vision Tasks . . . . . . . . . . . . . . . . . 5.3 Related Work: Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Adaptation Using Anchorfree Object Detector for Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Evaluation with Adversarial-Based Domain Adaptation Using LogoNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48 49 51 53 54 55 56 56 56 58 58 63 63 65 65 65 66 69 71 73 74 76 77 77 79 81 81 83 83 85 87 88 91 91 94 94

Contents

6 Unsupervised Logo Detection with Adversarial Domain Adaptation from Synthetic to Real Images . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Unsupervised Domain Adaptation: Synthetic to Real Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Synthesize Images to Avoid Manual Annotation Task . . . . . . . . . . . . 6.2.1 Synthetic Logo Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Domain Alignment Using Entropy Minimization . . . . . . . . . . . . . . . . 6.3.1 Entropy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Entropy Minimization Maps Using Mid-Level Feature from Synthetic to Real Logo Images . . . . . . . . . . . . 6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Discussion and Future Recommendations . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

97 97 98 99 103 103 104 106 106 107 116 117 118

About the Authors

Prof. Yen-Wei Chen received his B.E. degree in 1985 from Kobe University, Kobe, Japan. He received his M.E. degree in 1987 and his D.E. degree in 1990, both from Osaka University, Osaka, Japan. From 1991 to 1994, he was a research fellow at the Institute of Laser Technology, Osaka. From October 1994 to March 2004, he was an associate Professor and a professor in the Department of Electrical and Electronic Engineering, University of the Ryukyus, Okinawa, Japan. He is currently a professor at the college of Information Science and Engineering, Ritsumeikan University, Japan. He is associate Editors for the International Journal of Image and Graphics (IJIG), and the International Journal of Knowledge-based and Intelligent Engineering Systems. His research focuses on computer vision, deep learning and medical image analysis. He has published more than 300 research papers in these fields.

xi

xii

About the Authors

Dr. Xiang Ruan received the B.E. degree from Shanghai Jiao Tong University, Shanghai, China, in 1997, and the M.E. and Ph.D degrees from Osaka City University, Osaka, Japan, in 2001 and 2004, respectively. He was with OMRON corporation, Kyoto Japan from 2007 to 2016 as an expert engineer. He is currently the founder and CEO of Tiwaki Co., Ltd., Japan. He is also guest professor of Ritsumeikan University, Japan. His research interests include computer vision, machine learning, and image processing.

Dr. Rahul Kumar Jain received his Ph.D degree from Ritsumeikan University, Shiga, Japan, in 2022. He has been an intern trainee at Tiwaki Co., Ltd., Japan since 2019. He is now working as Assistant Professor at the College of Information Science and Engineering, Ritsumeikan University. His research interests include computer vision, deep learning, and image processing as well as the applications of Artificial Intelligence in areas including engineering, science, computer science, healthcare and so on.

Chapter 1

Deep Convolutional Neural Networks

Abstract Over the last decade, deep learning frameworks have revolutionized the field of computer vision, delivering state-of-the-art performance across various applications. The advent of deep convolutional neural networks (CNNs) has placed deep learning at the forefront of several fields including Biomedical, Healthcare, Information and Communications Technology (ICT), Digital Technology (Digitech), and Industrial Automation. Deep learning is a data-driven approach which includes feature extraction network and task-specific module for downstream task. These tasks may include regression, classification, image segmentation, and object detection. Deep learning models involve two primary processes: Training and Testing phase. During the training phase, models learn from labeled or annotated data to make predictions. Training involves adjusting the parameters and biases of models to ensure their predictions closely match actual values, known as the ground truth. During the learning or training process, a quantified value, known as the objective loss score, is computed to assess the deviation from expected results. This score guides the framework in adjusting its weights to improve accuracy. Following training, the testing or inference phase evaluates the model’s performance on previously unseen and unlabeled data, determining its real-world applicability. This chapter provides an introduction to deep learning-based networks, with a concise explanation of common deep learning architectures used for image feature extraction, their key components, and a discussion of some recently introduced and widely utilized object detection frameworks.

1.1 Deep Learning Frameworks In recent years, there has been a significant shift from rule-based Artificial Intelligence (AI) to Machine Learning-based AI, marking a notable rise in interest towards deep learning technologies. Conventional machine learning algorithms typically involve formulating predictive rules based on the input data and associated outputs. These approaches often necessitate a manual feature extraction process or the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_1

1

2

1 Deep Convolutional Neural Networks

creation of hand-crafted features, where data is transformed into representative features like statistical measures or texture information. Contrastingly, deep learning offers a data-driven methodology that autonomously learns to extract features directly from the provided data. This advancement in feature extraction minimizes human intervention, streamlining the identification of intricate patterns within high-dimensional presentations and significantly enhancing the automation of the feature extraction. In many academic and industrial fields, deep learning frameworks have achieved remarkable success and efficacy, becoming a cornerstone in modern AI applications. Deep learning has facilitated the development of robust computer-aided solutions for various computer vision tasks. These include healthcare applications like lesion segmentation and detection, disease classification, and anatomical modeling, alongside other fields such as pose estimation, action recognition, autonomous driving, image segmentation, face recognition, object detection, tracking and specific tasks like logo-brand detection [1–4]. These applications underscore the transformative impact and broad applicability of deep learning frameworks in addressing complex challenges across diverse sectors. Deep learning is a data-driven paradigm that integrates a feature extraction network with task-specific modules, enabling end-to-end execution of the task and fostering practical application capabilities. A feature extraction network based on deep convolutional neural networks (CNN) excels at automatically identifying and modeling the most relevant salient features from the given data. A task-specific component, such as classification layers, regression layers, detection heads, or segmentation layers, can be integrated at the top of the feature extraction network to enable the deep learning framework to perform a range of tasks. The fundamental concept of AI models based on deep learning involves two key processes. The first is the training phase, where the model learns from labeled (annotated) data. In this phase, the model is trained to recognize patterns and make predictions. In neural network architectures, the learning process is facilitated through two primary mechanisms. In forward pass, data is passed from input to output through neural network layers, ultimately resulting in a prediction. This transmission of the input through different layers allows the network to make prediction based on the extracted attributes. The backward pass (backpropagation) involves the fine-tuning or adjusting the model’s parameters (weights and biases) to minimize prediction errors. In this phase, the objective loss (discrepancy) between the predicted output and the ground truth label is calculated to measure the prediction errors and optimize the network. Here, ‘weights’ refer to the parameters (such as values in convolutional filters or kernels) that the model adjusts based on the training data and tasks, while ‘biases’ are additional parameters that facilitate the learning of specific tasks by the neural network. Optimization refers to the process of adjusting the weights and biases of models to minimize the error or loss function during training. After training, the second process is the testing phase, in which the learnt model is evaluated using unseen and unlabeled test data to determine its efficiency and accuracy. This phase is critical for assessing the real-world applicability of the model without further adjusting its parameters. These processes enable deep learning frameworks to understand and

1.1 Deep Learning Frameworks

3

Fig. 1.1 Deep learning-based logo detection process

interpret large and complex datasets, making them applicable across a wide range of domains and tasks. Figure 1.1 illustrates the training and testing scheme of a deep learning model specifically for logo detection task. In the figure, the images are used from FlickrLogos-32 dataset [2].

1.1.1 Core Component and Key Elements of Deep Learning Deep learning networks are built on several fundamental components: neural networks (such as Multi-Layer Perceptrons—MLP), convolutional neural networks (CNNs), fully connected layers (FC), activation functions (e.g., Softmax, Sigmoid), regularization techniques (like Dropout, Batch Normalization), optimization methods, and performance evaluation metrics [1]. Deep learning frameworks include optimizer, which control the optimization process (weights adjustment), regularization techniques to prevent overfitting and enhance generalization, activation functions for modeling complex non-linear relationships, pooling operations for dimensionality reduction, loss functions to quantify discrepancies between predictions and actual outcomes, and evaluation metrics to assess network performance. These components will be explored in detail in the subsequent subsections.

1.1.1.1

Input Data: Image Representation

Deep learning frameworks, especially when dealing image data, the representation of input data is crucial. Images are typically represented as structured data formats like

4

1 Deep Convolutional Neural Networks

tensors, matrices, vectors, and scalar values, using various programming libraries. In this case, the input training data are images. Images can be commonly represented and stored as tensors (3D or multi-dimensional arrays), matrices (2D arrays), vectors (1D arrays), and scalar values using programming or coding libraries. A color image, for instance, can be represented as a 2D tensor of 3 channels for feature processing in deep learning. These data representations are important for effectively utilizing deep learning frameworks, as they allow for the transformation of raw image data into a format that can be efficiently processed and analyzed by deep learning models.

1.1.1.2

Multi-Layer Perceptron (MLP) Layers

Neurons are the core processing units of a neural network, mimicking the structural elements of the human brain. In neural networks, neurons (or nodes) are interconnected and organized into layers, creating a hierarchical framework. They take in input signals, perform calculations, and subsequently generate output signals, allowing the network to map relationships between inputs and outputs. The input is fed into the neurons, which are structured as a layer. Each neuron multiplies incoming inputs by corresponding weights, indicating the significance or impact of each input. The neuron then sums these weighted inputs, calculating a weighted sum that serves as the output to a neuron in the subsequent layer. Following this, a bias term is incorporated, adjusting the neuron’s output prior to activation. The activation function is applied after adding the bias term and calculating the weighted sum of inputs. This step is crucial for refining the neuron’s final output, as it decides whether a neuron should be activated (send signal) or not based on the input received. Activation functions are mathematical equations that determine the output of a neural network model for a given input. They introduce non-linearity into the output of a neuron, enabling the neural network to learn complex patterns and perform tasks beyond mere linear classification, which involves directly mapping input to output. This non-linearity is vital for the network’s ability to capture relationships and enhance the accuracy of its predictions. Popular activation functions such as Sigmoid, ReLU (Rectified Linear Unit), and Softmax introduce non-linearity into the network. Figure 1.2 illustrates a basic depiction of neurons within Multi-Layer Perceptron layers.

1.1.1.3

Convolutional Neural Networks (CNNs)

CNN stands for Convolutional Neural Network, which is a specific type of neural network commonly used in deep learning, particularly for analyzing visual imagery. CNNs are a specialized kind of neural network that are particularly effective for tasks that involve spatial hierarchies or patterns, such as image and video recognition, image classification, medical image analysis, and many more. As hardware technology has advanced, deep convolutional neural networks (CNNs) have become paramount in computer vision tasks.

1.1 Deep Learning Frameworks

5

Fig. 1.2 A simple representation of multi-layer perceptions

CNNs are particularly acclaimed for their feature extraction capabilities, offering flexible architectures suitable for diverse applications. The success of CNNs, especially after the introduction of AlexNet, has propelled their widespread adoption in fields like image classification, object localization, and detection [3]. Since then, a numerous CNN architectures has been developed, demonstrating their versatility and cutting-edge performance. Convolutional neural networks in deep learning frameworks automatically identify informative critical features with high-level semantic and spatial characteristics. Next, we discuss the standard architectures of CNNs used for feature extraction and outline their principal components: Convolutional Layers: These layers perform linear mapping through filter operations, extracting spatial and semantic information. Convolutional layers employ filters or kernels that slide across the input, executing element-wise multiplications and aggregating the outcomes. These filtering operations involve sliding window operations, employing kernels that process data over their receptive fields. These filters, initially set with random or pre-trained values, are refined through backpropagation to reduce the discrepancy between the predicted outputs and actual ground truths. Pooling Layers: These layers in deep learning involve capturing key information by selecting the most representative or significant values within specified regions or window. Pooling layers downsample the input to reduce its spatial dimensions, aiding in the network’s computational efficiency. Activation Functions: Functions like ReLU (Rectified Linear Unit).[max(0, x)] introduce non-linearity, converting linear inputs to nonlinear outputs by applying a threshold at zero. Additionally, deep learning frameworks incorporate Batch Normalization to normalize inputs using batch-wise mean and standard deviation, which helps stabilize training. The term batch refers to a subset of the training data. During each iteration, a mini-batch (a randomly selected subset of input samples) is used to update the model’s parameters. This process facilitates the smooth training process since the entire dataset cannot be processed in a single iteration. Zero Padding is another technique applied to maintain or modify the spatial dimensions of the output tensor after convolutional and pooling operations. This approach involves surrounding the input

6

1 Deep Convolutional Neural Networks

Fig. 1.3 Example of convolution layer, pooling operation, ReLU activation

data with zeros, ensuring uniform application of the convolutional and pooling layers and mitigating border effects. Figure 1.3 illustrates a simple example of convolution, pooling and ReLU activation functions. CNNs progressively extract features from shallow, fine layers to deep, coarse layers, building a robust feature hierarchy. Initially, they identify simple attributes like edges and textures, advancing towards complex object semantics. This hierarchical learning allows CNNs to fine-tune weights and pinpoint pertinent features effectively. The depth of a CNN-based network, determined by the number of stacked convolutional layers, directly influences the quality of feature extraction. The network’s ability to extract crucial feature information depends on both the number of these layers and the complexity and nature of the datasets and tasks being handled.

1.1.1.4

Fully Connected Layers

FC layers or Fully Connected layers are a type of layer commonly used in deep learning frameworks, particularly towards the end of the network architecture. In an FC layer, every neuron is connected to every neuron in the previous layer, hence the term ‘fully connected’. This architecture allows FC layers to model intricate data relationships effectively and perform specific tasks, like producing final classification probabilities or estimating object sizes.

1.1 Deep Learning Frameworks

7

Fully connected (FC) layers are used in deep learning networks to perform downstream tasks. These layers are typically used to flatten the high-level features learned by previous layers (such as convolutional or pooling layers in the case of a CNN) into a vector that can be used for classifying the input into various categories or for performing regression tasks. Fully Connected layers are crucial for making sense of the features extracted by the convolutional layers, enabling the network to make decisions based on the comprehensive information contained in the input data. These are an essential component of classification and detection frameworks, responsible for mapping the extracted features into the final output, such as class scores in classification tasks.

1.1.1.5

Optimization and Regularization

Optimization refers to the process used to adjust the parameters (such as weights and biases in neural networks) to minimize the loss function, which measures the difference between the model’s predictions and the actual data. The goal of optimization is to find the best parameters that lead to the best model performance. These algorithms iteratively update the model’s parameters (weights and biases) to minimize loss and achieve convergence, a process known as network fine-tuning. By employing gradient-based methods, these algorithms ascertain both the direction and magnitude required for parameter updates, aiming to reduce the loss function. This detailed gradient information assists the optimization algorithm in adjusting the model parameters towards minimizing loss, facilitating effective learning. This process ensures that the model learns from the training data and improves its performance on the given task, whether it be classification or regression. Widely used optimizers such as Stochastic Gradient Descent (SGD), Adam, and RMSprop are commonly employed in deep learning frameworks. While optimization focuses on minimizing the loss to improve training performance, regularization is a technique used to prevent overfitting, where the model performs well on the training data but poorly on unseen data. Overfitting occurs when the model becomes too complex, mistaking random fluctuations in the training data for meaningful patterns. Regularization techniques, such as L1 and L2 regularization, Dropout, and Batch Normalization, add a penalty on the size of the coefficients or adjust the training process to reduce the model’s complexity. Dropout involves randomly nullifying specific neurons, effectively simplifying the network’s complexity. While Batch Normalization standardizes the inputs of each layer within a mini-batch, diminishing internal covariate shift and promoting a more stable and faster training process. This encourages the model to learn more generalized patterns that perform better on unseen data. These strategies are instrumental in enhancing the model’s generalization capabilities, ensuring it performs consistently across different data sets. Optimization and Regularization ensure that the model remains general enough to perform well on new and unseen data, which is important for developing robust and effective models.

8

1.1.1.6

1 Deep Convolutional Neural Networks

Loss Functions

Loss functions, also known as cost functions, play a critical role in deep learning by quantifying the discrepancy between a model’s predictions and the actual ground truth values. This quantified loss directs the model to adjust its weights and parameters during training to improve its learning. Throughout the training process, this loss value serves as a critical metric for the model’s performance, guiding the optimization algorithm to minimize the error. The gradient of the loss with respect to the model’s parameters is calculated to inform adjustments during the optimization process, refining the model’s weights and biases accordingly. The selection of an appropriate loss function is pivotal, as it must align with the specific objectives and nature of the problem at hand. Common types of loss functions include: Binary Cross-Entropy Loss: Utilized for binary classification tasks. Cross-Entropy Loss: Applied in multi-class classification scenarios. Mean Squared Error (MSE) or L2 Loss: Predominantly used in regression tasks, calculating the squared differences between predicted and actual values. While the tasks like object detection, which aims to identify and classify objects within an image, a combination of loss functions is often employed. Cross-Entropy Loss assesses classification accuracy, while MSE evaluates the precision of bounding box predictions, including the center positions (x, y), width, and height of the detected objects. These multi-faceted approaches ensure comprehensive assessment and optimization of the model’s performance for different complex tasks.

1.1.1.7

Evaluation Metrics

Evaluation or measuring metrics are crucial for assessing the performance of deep learning models and algorithms across various tasks such as classification, segmentation, object detection, image enhancement, and regression, tailored to their respective goals and requirements. The fundamental components of evaluation metrics include True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN): True Positive (TP): Occurs when the model correctly predicts a positive class, and the ground truth is also positive. False Positive (FP): Occurs when the model incorrectly predicts a positive class, while the ground truth is negative. True Negative (TN): Occurs when the model correctly predicts a negative class, and the ground truth is also negative. False Negative (FN): Occurs when the model incorrectly predicts a negative class, but the ground truth is positive. These elements form the basis for deriving key evaluation metrics, such as Accuracy, Precision, Recall (Sensitivity), F1-Score (Dice Coefficient), and the Confusion Matrix.

1.2 Feature Extraction Networks

9

Accuracy: Measures the overall correctness of the model. Calculated as (TP + TN)/(TP .+ TN .+ FP .+ FN). Precision: Assesses the model’s accuracy in predicting positive classes. Computed as TP/(TP .+ FP). Recall (Sensitivity): Evaluates the model’s ability to identify all relevant instances. Defined as TP/(TP .+ FN). F1 Score: Balances precision and recall, providing a single score to measure a test’s accuracy. It is 2 * (Precision * Recall)/(Precision .+ Recall). ROC-AUC: Represents the area under the Receiver Operating Characteristic curve, plotting true positive rate against false positive rate at various threshold settings. These metrics are integral to the comprehensive evaluation of a deep learning model’s performance, guiding improvements and ensuring the model’s effectiveness in practical applications.

1.2 Feature Extraction Networks Feature extraction networks, primarily composed of a hierarchical sequence of Convolutional layers, are designed to tackle specific deep learning tasks, including classification, semantic segmentation, and object detection. The final output feature space or extracted feature maps can be tailored to specific task requirements, ensuring that they align with the down-stream task. For deep learning based image classification task, the goal is to assign a single label from a predefined set of categories to an image. This process entails creating a model capable of learning patterns and relationships from a labeled dataset to accurately predict categories. Within these models, Fully Connected layers are often integrated with the CNN architecture to produce numerical probabilities, facilitating categorical assignments. The scope of image classification extends to more intricate tasks like object detection, where the goal shifts to identifying and categorizing specific regions within images. Here, the network’s output layers are modified to accommodate dense predictions, enabling the recognition of distinct objects. Deep learning-based object detection frameworks not only classify the objects but also pinpoint their locations by generating bounding boxes whose dimensions (x, y, width, height) are determined by regression layers. These layers for regression in the object detection module may consist of either 1D Convolutional layers or Fully Connected Layers. Semantic segmentation represents another complex task where the aim is to classify each pixel of an image into predefined categories. Networks designed for segmentation output binary masks, delineating regions with semantic similarity, which paves the way for detailed image analysis. In the following subsections, we describe the designs of some commonly used convolutional neural networks that are used for feature extraction network in various tasks.

10

1 Deep Convolutional Neural Networks

1.2.1 VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition The VGG network was introduced by Karen Simonyan et al., in their work titled, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” [4]. This framework has achieved second place in the 2014 image classification task on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [5]. The architecture of VGG is known for its simplicity and structuring a sequence of convolutional layers each followed by a max pooling operation. It employs 3.× 3 convolutional filters with a stride of 1-pixel position. The max pooling is performed over a 2 .× 2 pixel window with a stride of 2-pixel. The architecture encompasses three fully connected (FC) layers towards the end. The final convolutional layer produces a multi-dimensional feature map, which is subsequently converted into a one-dimensional array through a process called flattening. Flattening converts the final two-dimensional feature maps into one-dimensional vectors, making the data suitable for processing by the fully connected layers. The subsequent FC layer, equipped with 4096 neurons, processes this vector. In the FC layer, each neuron is linked with each vector element. The first two contain 4096 neurons each, while the third facilitates a 1000-way classification. A softmax layer follows for probability estimation. Notably, the VGG network is recognized for its depth, featuring configurations like VGG-16 and VGG-19 that differ by the number of layers containing learnable weights. Besides classification, VGG has also been adapted for detection and segmentation tasks, with modifications to enlarge the feature space dimensions suitable for these tasks. Figure 1.4 shows the structure of VGG-13 network.

1.2.2 Residual Networks Residual Network, usually known as ResNet, was introduced by Kaiming He et al. in their work “Deep Residual Learning for Image Recognition” [6]. This innovative deep convolutional neural network (CNN) architecture achieved notable success

Fig. 1.4 The framework of the VGG-13 network

1.2 Feature Extraction Networks

11

in the 2015 ImageNet Large Scale Visual Recognition Challenge [5]. ResNet is renowned for its deep architecture which can have hundreds or even thousands of layers. ResNet structures comprise multiple blocks or stages, each containing a varying number of Convolutional layers. These blocks enhance the network’s depth (number of channels) while reducing spatial dimensions through downsampling. Typically, these stages consist of standard convolution layers, followed by batch normalization, ReLU activation, and max pooling. In the end, a global average pooling layer reduces the spatial dimensions to a single scalar value per feature map, generating a 1D vector or array. A set of fully connected layer then reduces the feature vector to match the number of classes (objective task) in the dataset. Subsequently, a softmax or sigmoid layer is employed to compute classification probabilities. ResNet structure offers different networks like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, each denoting the network’s depth in terms of layers. In deeper models, especially those beyond 50 layers, ResNet utilizes a ‘Bottleneck’ design to manage channel dimensionality and computational cost efficiently. This design includes a sequence of three layers: firstly, a 1.× 1 convolution (pointwise) that reduces the channel dimensions, followed by a 3.× 3 convolution generating the same number of channels, and finally, another 1 .× 1 convolution expanding the channel dimensions. The use of a 1 .× 1 kernel size, known as pointwise convolution or linear transformation, effectively minimizes computational costs. On the other hand, the 3 .× 3 kernel size layers offer extended coverage over spatial positions, enabling the incorporation of broader representations. This design balances computational efficiency with the ability to cover more spatial information. In recent developments, ResNet has been adapted for a variety of applications by modifying the output layer to suit tasks beyond classification, such as segmentation and object detection. This versatility makes ResNet a fundamental architecture for dense prediction tasks in modern deep learning frameworks. An illustrative diagram of the ResNet-50 model architecture is included for reference in Fig. 1.5. The structure of CNNs plays a pivotal role in extracting feature representations, progressing from shallow, finer layers to deeper, coarser layers, generating robust and valuable feature information. The depth of a network, signified by the stacked convolutional layers, plays an essential role in capturing the semantic, contextual,

Fig. 1.5 ResNet-50 model architecture

12

1 Deep Convolutional Neural Networks

and structural attributes of objects. The performance of the network in downstream tasks such as classification, segmentation or object detection is naturally affected by the number of training samples and the depth of the network. We have to consider the number of data samples and other factors during training. Balancing the depth for enhanced performance against maintaining efficiency presents a crucial challenge, necessitating consideration of the dataset size and other training conditions. A shallow network might not significantly improve object recognition, particularly with large datasets. Therefore, there is a consensus towards the benefits of deep architecture for better outcomes. However, as the network gets deeper, a prevalent issue arises, i.e., the problem of vanishing gradients during parameter updates. With increasing depth, vanishing gradient problems impact capabilities and accuracy as the network expands. The ResNet architecture addresses this by introducing residual skip connections, allowing the construction of significantly deeper networks. In ResNet, each network block contains a skip connection, known as an identity shortcut, that circumvents one or more layers. These shortcut connections facilitate identity mapping directly or employing a linear transformation layer using 1 .× 1 convolutions depending on the structure of block, wherein their output is added to the output of a stack of layers. This approach of residual mapping significantly counters the vanishing gradient problem. Residual connections demonstrate remarkable efficacy across various applications. A sample representation of a residual block is depicted in Fig. 1.6. In the diagram, let .x symbolize the input feature passed through the residual block. . F(x) represents the function of the second layer, generating output feature maps. The input .x is directly added to the output from this layer. The final output is expressed as . F(x) + x by integrating the skip or identity connection. This process enables the network to learn from the original input features, while the skip connections retain feature representations learned at different depths. This methodology effectively mitigates the issue of network degradation and improves the overall network performance.

Fig. 1.6 Representation of a skip or identity connection

1.2 Feature Extraction Networks

13

1.2.3 Deep Layer Aggregation Networks Deep Layer Aggregation (DLA) is a convolutional neural network architecture introduced by Yu et al. in their study titled “Deep layer Aggregation” [7]. This network is designed for computer vision tasks, particularly dense prediction tasks like object detection and semantic segmentation. The structure of DLA offers diverse networks depths, similar to ResNet, containing multiple blocks or stages of convolutional layers. The essence of DLA lies in its unique approach to feature aggregation, which iteratively and hierarchically merges semantic and spatial features across different scales and levels. This leads to a more refined extraction of object features, leveraging densely connected layers enhanced by hierarchical and iterative skip connections. These connections are designed to more accurately capture locations and categories, improving resolution and feature representation. The design of DLA facilitates the effective capture of multi-scale information, allowing the network to discern features with varying granularity levels. Consequently, DLA architectures have shown commendable performance in a variety of tasks, including object detection, semantic segmentation, and image classification, by adeptly harnessing diverse contextual information. DLA signifies a balance between model complexity, computational efficiency, and performance in various visual recognition tasks. Figure 1.7 depicts a simple layout of the Deep Layer Aggregation (DLA) network with hierarchical and iterative connections spanning various stages.

1.2.4 Hourglass Framework The Hourglass network, designed for human pose estimation, was introduced by Newell et al. [8]. This innovative network employs a combination of bottom-up and top-down structures within its architecture, initiating with the expansion of input channels and reduction of feature space dimensions via a series of convolutional, stride, and max-pooling operations. The design then mirrors these actions with upsampling operations, creating symmetric feature map blocks in the distinctive Hourglass layout. To mitigate information loss during upsampling, skip connections are strategically integrated. Its unique approach to maintaining and refining spatial and feature resolution across different stages of processing, making it a powerful tool for tasks beyond pose estimation, including various object detection applications. In the Hourglass network, multiple modules are structured featuring a bottom-up and top-down approach. These CNN modules include convolution layers complemented by skip connections (ResNet style). All processing modules consists of two residual blocks. Each residual block includes two convolutional layers and skip connection layers. Spatial dimensions are initially halved using a stride of 2 in the first convolutional layer of each residual block, while subsequent operations maintain the dimensions, employing a stride of 1. A consistent kernel size of 3 .× 3 is used

Fig. 1.7 Simplified architectural design of the deep layer aggregation (DLA) network, demonstrating the hierarchical and iterative connections across different stages

14 1 Deep Convolutional Neural Networks

1.3 Object Detection Frameworks: Detection Head

15

Fig. 1.8 Hourglass feature extractor network

in every convolutional operation. The skip connections in these blocks performs a linear transformation (1 .× 1 convolution) to align the dimensions of the input and output feature maps. The feature extraction involves a series of five downsampling and upsampling stages, reflecting the hourglass configuration. Additionally, the architecture also utilizes skip connections between corresponding blocks of network, enabling features to bypass certain layers directly, which preserves essential information throughout the network. The structure of the Hourglass network is depicted in Fig. 1.8.

1.3 Object Detection Frameworks: Detection Head Image classification refers to the process of assigning a single label to an image from a predetermined set of categories. This concept extends to a more complex task of object detection, a method in computer vision enabling the identification and precise localization of multiple objects within images or videos. This involves not only recognizing objects but also accurately outlining them with rectangular bounding boxes to indicate their positions and sizes. Recent, deep learning detection frameworks are designed not just for identifying objects or patterns within images but also for accurately marking their locations with bounding boxes. These frameworks combine advanced feature extraction network with sophisticated detection mechanisms to provide detailed object analysis. An illustration of logo detection in Fig. 1.9 is shown. An object detection model typically comprises two main elements: a feature extraction CNN that discerns relevant features from images, and a detection head that interprets these features to identify and localize objects. While the design structure and functioning of feature extraction networks are detailed in previous sections, the following subsections will delve into the design and operational principles of various detection heads, illuminating how they contribute to the overall functioning of object detection frameworks.

16

1 Deep Convolutional Neural Networks

Fig. 1.9 Example of logo detection

1.3.1 Detection Head Functionality in Object Detection Frameworks The detection head is pivotal in object detection, tasked with determining the size, location, and category of each object within the image. Their objective is to identify potential target objects—Regions of Interest (ROIs), ascertain their locations, and classify them into their respective categories. The detection head performs different key functions: it analyzes feature maps to propose candidate regions, delineates bounding boxes around these regions, applies Non-Maximum Suppression (NMS) to refine these proposals, and assigns confidence levels to each, quantifying the likelihood of each candidate region containing a valid object. Typically comprised of 1D convolutional layers or fully connected layers, detection heads act upon the extracted feature space derived from the earlier stages of the network (feature extraction network). They select candidate regions with the highest probability scores as the final detected objects, ensuring that the outcomes are both precise and relevant. The detection head classifies object categories while simultaneously conducting a regression task to pinpoint the exact coordinates of each bounding box. The entire process involves identifying the object’s position, size, and shape within the image. During the training phase, object detection frameworks leverage a multitask loss approach, combining classification loss (usually cross-entropy loss) for accuracy in predicting object categories, and location regression loss (typically Mean Squared Error loss) for precision in bounding box coordinates. The workflow of an object detection framework, from feature extraction through to the final detection output, is illustrated in Fig. 1.10.

1.3 Object Detection Frameworks: Detection Head

17

Fig. 1.10 Illustration of object detection pipeline

After the prediction of bounding boxes, the detection head undergoes postprocessing, which includes additional steps to refine the detection. Various steps, including the application of Non-Maximum Suppression (NMS), are performed to refine and finalize the detection outcomes. This ensures that the resulting detections are both accurate and relevant, effectively representing the objects present in the image. Non-Maximum Suppression (NMS) is a critical post-processing technique used in object detection to eliminate redundant or overlapping bounding box predictions, streamlining the set of final detected objects. This method employs a confidence threshold, representing the estimated probability score, to discern between bounding boxes. By applying this threshold, bounding boxes with low confidence scores are filtered out or eliminated from consideration.

1.3.2 Anchor Box-Based Detection Frameworks Anchor box-based detection frameworks play a crucial role in contemporary object detection techniques, especially in pinpointing object locations within images. Frameworks like Faster R-CNN have transformed object localization methods. The detection head in these frameworks employs regression layers to estimate the bounding box coordinates (x, y, width, height) by regressing from anchor boxes to the actual bounding boxes of the objects. In this methodology, candidate regions for potential objects are identified using anchor boxes, which are pre-defined bounding box templates with various sizes and aspect ratios. These anchor boxes are systematically distributed across the feature space extracted from the image, enabling a dense prediction grid. Each region suggested by an anchor box is considered a proposal for a possible object location. These candidate regions are then classified, and each is given a probability score reflecting the likelihood of containing a target object. The detection layers in the detection head are specifically trained to refine the dimensions and positions of these anchor boxes to best match the real object’s position and sizes based on the computed scores. The

18

1 Deep Convolutional Neural Networks

process includes fine-tuning the positions and dimensions of anchor boxes to better align with the actual object boundaries, enhancing the accuracy and relevance of the detection. This system allows for a systematic approach to object localization and classification, providing a structured pathway from generic, predefined proposals to precise object detection. In subsequent sections, we will delve into various widelyadopted object detection frameworks that utilize this anchor box-based approach, highlighting their specific mechanisms and applications in the field of computer vision.

1.3.2.1

Faster-RCNN

The development of object detection frameworks has progressed significantly, beginning with the introduction of R-CNN by Ross Girshick et al. [9], which utilized selective search [10] for proposing image regions. This approach involved processing multiple image patches through a CNN to extract deep features, followed by the use of SVMs [11] for class categorization. Building on this, Fast R-CNN emerged [12], enhancing efficiency by integrating region proposal mechanisms directly into the CNN-generated feature maps, thus bypassing the need for processing multiple separate patches. Here, selective search still played a role in generating proposals, but the overall process became significantly faster. Advancing further, Ross Girshick and his team developed Faster R-CNN [13], introducing a streamlined approach that combined a backbone feature extractor CNN, like ResNet [6] or VGG [4], with a novel component known as the Region Proposal Network (RPN). This combination allowed for the direct generation of region proposals from the feature maps, eliminating the need for separate selective search processes. This small CNN network (RPN) assigns scores to each region, indicating the likelihood of it containing an object, classifying regions in a class-agnostic manner and improving the efficiency and accuracy of object detection. To identify candidate proposals, the Region Proposal Network (RPN) in Faster R-CNN analyzes the feature maps using a sliding window approach. This process involves evaluating .k anchor boxes at each spatial location in the extracted feature space. These anchor boxes serve as fixed-size reference points that help the detection layers identify potential objects and their shapes within the image. These anchor boxes are designed in predefined scales and aspect ratios to cover various sizes and shapes of potential objects. If the objectness score of a given anchor box exceeds a set threshold, the associated area in the image is flagged as a possible candidate region. Inside the RPN, each sliding window (typically of size .n × n) is converted into a lower-dimensional feature vector, which is then fed into two separate fully connected (FC) layers. One of these FC layers focuses on bounding box regression, tasked with refining the coordinates (x, y, width, height) of each anchor box to closely fit the potential objects. This layer outputs 4.k scores, representing the adjusted coordinates for each anchor box. The other FC layer is responsible for classification, determining whether each region is likely to contain an object (foreground) or not (background), producing 2.k scores that indicate the likelihood of object presence

1.3 Object Detection Frameworks: Detection Head

19

Fig. 1.11 Potential region proposal candidates at the position of the sliding window

or absence within each anchor box. This process allows the network to simultaneously refine the positions of potential bounding boxes while assessing their relevance as candidate regions for object detection. This method ensures that only the most promising regions, based on both position and probability, are passed forward in the detection pipeline, as depicted in Fig. 1.11. The Region of Interest (ROI) pooling step within Faster R-CNN is a crucial component of the Faster R-CNN architecture. After the Region Proposal Network (RPN) identifies candidate regions, each of these regions (ROIs) is projected onto the feature space generated by the primary feature extraction backbone. These ROIs can vary significantly in size, but the architecture requires consistent input dimensions for the subsequent fully connected layers. To address this variance, ROI pooling standardizes the dimensions of these diverse ROIs into uniform-sized feature maps. Specifically, the ROI pooling layer segments each ROI into equally sized bins. It then applies a max-pooling operation within each bin to extract the most prominent features, ensuring that each ROI is represented by a feature map of fixed size. Following the ROI pooling step, these uniform feature maps are fed into subsequent fully connected layers dedicated to classification and bounding box regression tasks. These layers are tasked with determining the class of each object and refining the coordinates of its bounding box, respectively. This distinct separation allows for accurate object classification alongside precise location determination, contributing to the overall effectiveness of the Faster R-CNN framework in detecting objects within an image. This procedural flow is depicted in the Fig. 1.12. The Faster R-CNN framework enables comprehensive end-to-end training, finetuning various components such as the Region Proposal Network (RPN), the feature extractor CNNs, and the fully connected (FC) detection layers. Training of Faster R-CNN involves multiple stages where different losses are calculated to iteratively refine the network weights. This approach ensures that each component of the system-

20

1 Deep Convolutional Neural Networks

Fig. 1.12 An overview of Faster RCNN framework with ROI pooling process

from the initial object proposal to the final object classification and localization— contributes effectively to the overall learning process. The bounding box regression occurs in two critical phases: initially aligning the anchor boxes closer to the objects, and subsequently refining these boxes post-ROI pooling to improve accuracy. Throughout training, Faster R-CNN employs distinct sets of loss functions to adeptly handle object classification and localization: • Objective Losses for in Region Proposal Network (RPN): 1. Objectness Classification Loss—This binary classification loss assesses each anchor box to determine whether it likely represents a foreground object or background, essentially evaluating the “objectiveness” of each box. 2. Bounding Box Regression Loss—This loss adjusts the coordinates of the anchor boxes proposed by the RPN, ensuring they align more closely with the actual ground truth object locations. Typically, the smooth L1 loss is utilized for bounding box regression. • Objective Losses for Final Fully Connected Detection Layer: 1. Multiclass Classification Loss—After ROI pooling, the resized regions are classified into distinct object categories or determined to be background. This process is approached as a multiclass classification problem. 2. Bounding Box Regression Loss—This loss further refines the position of bounding boxes through additional fully connected layers, applying an L1 loss to minimize the discrepancies between the ground truth and predicted object coordinates. This step, based on the uniform feature maps from the ROI pooling, is crucial for enhancing the precision of the bounding boxes. Post-processing steps further polish the final results, ensuring the output from the Faster R-CNN detection system is both accurate and reliable. With its integrated

1.3 Object Detection Frameworks: Detection Head

21

training approach and efficient architecture, Faster R-CNN remains a widely applicable and highly efficient framework for object detection tasks.

1.3.2.2

YOLO: You Only Look Once

The initial version of the YOLO (You Only Look Once) detection algorithm was introduced by Joseph Redmon et al. [14], marking a significant object detection method by proposing a unified, single-step approach. Since then, several enhancements to the YOLO framework, involving modifications to the backbone, detection head, and data augmentation techniques, have been reported in subsequent works. This subsection will discuss into the YOLOv3 detection head [15] which is used in most of the latest versions of the YOLO series detectors [16]. The YOLO architectures employ a feature pyramid CNN network backbone [17] to extract multi-scale spatial features from input images. This backbone architecture produces feature spaces at various scales, each with differing spatial dimensions and depth, enabling the detection of objects across a range of sizes. YOLOv3, in particular, enhances detection capabilities through multi-scale processing, applying detection layers to three distinct scales extracted from different levels of the feature pyramid. In the YOLO framework, detection layers utilize predefined anchor boxes, which vary in aspect ratios and are tailored to different scales, enhancing the adaptability to various object sizes. A distinctive aspect of the YOLOv3 detection approach is its grid-based or cell-based strategy. The algorithm divides feature maps into a grid. Each cell in the grid is linked to multiple anchor boxes, detecting objects whose centers fall within the cell’s spatial spot. Each cell is responsible for predicting the presence and location of an object’s center point, while the spatial region within the anchor box assists in recognizing the attributes of the object. YOLO optimizes the prediction process by dividing the input image into a grid, aligning the grid’s dimensions with the feature maps at different scales. This ensures that cell positions correspond accurately to the image’s spatial location or area. This grid-based system allows for efficient and precise object detection. During its training phase, YOLO estimates several key parameters for each anchor box within a grid cell. This includes predicting the objectness confidence score, which evaluates the likelihood that a given anchor box contains any object, and the bounding box attributes such as center coordinates (.x, y), width (.w), and height (.h). The objectness score is calculated using binary cross-entropy classification loss, with each ground truth bounding box assigned an objectness score of 1, indicating object presence regardless of category. Furthermore, YOLO extends beyond merely detecting the presence of an object within a cell, it also predicts the probabilities for different object classes contained within the bounding box, enabling multi-class classification within each individual detection. This is encapsulated by the bounding box parameters, which comprise 5 + .c elements: the center coordinates (.x, y), size dimensions (.h, w), objectness score, and a class probability vector of length .c (where .c represents the number of potential object classes). During the entire training pro-

22

1 Deep Convolutional Neural Networks

cess, the model learns based on predictions of bounding boxes, objectness scores, and class probabilities for each anchor box at multiple scales within the feature map. This multi-scale strategy is crucial for accurately detecting objects of varying sizes, ensuring the model remains effective across a range of scenarios from small to large objects. In the training phase of YOLOv3, the model uses ground truth bounding boxes, which are adjusted to be compatible with the detection mechanism. This process involves several steps to ensure the model can effectively learn from diverse image scales and object sizes. First, the center coordinates of the ground truth bounding boxes (x and y) are normalized by the dimensions of the image, ensuring these values lie between 0 and 1. This normalization is crucial because it makes the model scaleinvariant, meaning the model can detect objects effectively regardless of the size of the image. Second, the widths and heights of the bounding boxes (.w and .h) undergo a specific normalization: they are divided by the dimensions of the anchor boxes assigned to them. This step is designed to help the model predict more accurate dimensions by learning the ratio between the ground truth and predefined anchor sizes. After this division, to prevent negative values and stabilize the training, a logarithmic transformation is applied to the width and height ratios. However, this transformation is not directly about making these values fall between 0 and 1; instead, it is about handling the wide range of object sizes encountered in different images. By using the logarithm, small variations in large objects become less significant, whereas variations in smaller objects remain noticeable. These normalization and transformation steps enable the model to learn effectively from a varied dataset and improve its ability to generalize across different scales and aspect ratios of objects found in real-world scenarios. Following the model’s initial predictions, post-processing techniques like nonmaximum suppression (NMS) are applied. NMS plays a vital role in refining detection results by removing overlapping and redundant bounding boxes, thus focusing on the most probable detections by retaining only those bounding boxes with the highest confidence scores. The YOLO framework employs a combination of loss functions during training to optimize different aspects of detection: • Bounding Box Regression Loss: Typically Mean Squared Error (MSE) loss, this function refines the location and dimensions of the predicted bounding boxes. • Object Category Prediction Loss: Binary Cross-Entropy (BCE) loss, which optimizes the model’s ability to discern between background and foreground (objectcontaining) regions. • Object Category Prediction Loss: Cross-Entropy (CE) loss, which enhances the model’s accuracy in classifying the detected objects into the correct categories. YOLOv3, specifically, is designed with a focus on speed, making it an ideal solution for real-time object detection tasks necessary in applications like video analysis and autonomous driving. This balance between speed, accuracy, and real-time processing capability underpins YOLOv3’s widespread adoption in various practical applications. During detection, YOLO aims to accurately predict these parameters, initially

1.3 Object Detection Frameworks: Detection Head

23

Fig. 1.13 Architecture of YOLO detection framework

relative to the cell’s location and the anchor box’s dimensions. Post-processing adjustments then scale these relative coordinates and dimensions back to the original size of the input image, ensuring that the final detection results properly align with the true dimensions and position of objects within the image. The effectiveness of YOLO lies in its ability to perform these tasks rapidly and concurrently across the entire image, enabling real-time object detection. Figure 1.13 provides a visual representation of the YOLO framework, depicting these elements into a cohesive detection strategy.

1.3.3 Anchorfree Detection Frameworks The design and configuration of Regions of Interest (ROIs) and anchor boxes significantly influence the performance and learning efficiency of detection frameworks. Traditionally, detection algorithms have relied heavily on anchor boxes, which require meticulous fine-tuning of hyperparameters such as size, number, and aspect ratio. However, recent advancements in deep learning have sparked a shift toward anchorfree or anchorless detection frameworks. These innovative approaches are gaining popularity due to their faster detection speeds and improved performance compared to traditional anchor-based techniques. Additionally, their lower memory requirements make them more suitable for deployment on devices with limited storage, addressing one of the practical constraints of modern computing devices. Unlike their anchor-based counterparts, anchorfree detection methods do not rely on predefined template anchor boxes to identify location and shape of objects. Instead, these methods focus on detecting objects based on keypoints, which could be the corners or the center of the objects. By recognizing objects as configurations of key-

24

1 Deep Convolutional Neural Networks

Fig. 1.14 Illustration of anchor box priors and learning shape vector

points, these frameworks propose Regions of Interest (ROIs) without the need for anchor boxes, learning an adaptable shape vector that corresponds to the dimensions of object. In detection layers of anchorfree detectors, keypoints are identified based on the activated pixel values within the extracted feature space. The size associated with these keypoints are also learned simultaneously within specific regions or patches. This keypoint-centric approach facilitates the direct identification and localization of objects without the intermediary step of matching anchor boxes, leading to a more streamlined and efficient detection process. These anchorfree approaches help alleviate issues such as class imbalance between background and foreground objects and the complexities involved in designing effective anchor boxes. Figure 1.14 illustrates the difference between traditional anchor box priors and the anchorfree designs. In the subsequent sections, we will explore some anchorfree detection methodologies and their contributions to the field of object detection.

1.3.3.1

CornerNet: Detecting Objects as Paired Keypoints

CornerNet [18], introduced by Law et al., represents a shift in object detection methodologies by focusing on identifying objects as pairs of keypoints: the top-left corner and the bottom-right corner. This novel approach moves away from the traditional reliance on anchor box priors and presents a new perspective to the detection process. The CornerNet framework streamlines the detection task by directly mapping the presence of object corners in the image, bypassing the complexities and limitations associated with anchor boxes. The detection module in CornerNet comprises several key components, such as heatmaps generation, corner point offset prediction, embedding vectors to join corners of the same object. CornerNet integrates two distinct corner prediction modules within its framework, each dedicated to identifying top-left corners and bottom-right corners, separately. These modules employ convolutional layers structured around the innovative corner pooling method, a technique specifically designed to accurately localize corner points even in complex visual contexts.

1.3 Object Detection Frameworks: Detection Head

25

CornerNet uses of the Hourglass network [8] as its backbone for feature extraction. This CNN network structure is adept at capturing and processing image features at multiple scales, essential for accurately locating corner keypoints across various object sizes. A complete overview of the CornerNet detection head, showing its components and procedure, is depicted in Fig. 1.15. Next, we describe the components and detection procedure introduced in CornerNet. Corner Pooling: The CornerNet framework introduces the innovative concept of corner pooling, a unique technique designed to enhance object detection by focusing on the critical aspects of bounding box corners without depending on traditional anchor box priors. This method is particularly effective in identifying the extremities of objects, namely the top-left and bottom-right corners, which are essential for accurate bounding box localization. In the context of detecting a top-left corner, the corner pooling process involves: 1. Conducting a horizontal scan from right to left on one feature map to find the highest boundary value. This is known as horizontal max-pooling. 2. Simultaneously, performing a vertical scan from bottom to top on a different feature map to locate the leftmost boundary value, called, vertical max-pooling. These operations aim to capture the most prominent boundary signals from both directions, effectively identifying the location of corner keypoints. The corner pooling module employs convolutional layers that extract and process these boundary signals from the extracted feature maps. The identified maximum values from each direction (horizontal and vertical) are then combined, resulting in a more accurate representation of the potential top-left corner of an object. This combination reflects a synthesis of information, aggregating significant features that denote the presence of a corner. Similarly, a parallel corner pooling process is applied for identifying the bottomright corners, employing the same principle but focusing on opposite boundary of the objects. The procedural explanation of the pooling strategy is demonstrated in Fig. 1.16. By incorporating this method, CornerNet significantly boosts its ability to localize corners with high precision, overcoming the limitations associated with anchor-based systems and improving the overall efficiency of object detection. The integration of corner pooling and the unique structure of the detection layers enable CornerNet to efficiently and accurately detect objects by connecting the appropriate corner pairs. Detection Layers in CornerNet: In CornerNet, the detection process is finely formulated by several convolutional layers within the detection module. These layers perform different functions: generating Heatmaps, Corner Embeddings, and Offsets for corner localization and identification. Heatmaps Generation: The detection layers produce two sets of heatmaps, one representing top-left corners and the other for bottom-right corners, created using distinct corner prediction modules. This separation ensures that each heatmap precisely captures the spatial distribution relevant to its respective corner type across all object

Fig. 1.15 Design structure of CornerNet framework

26 1 Deep Convolutional Neural Networks

1.3 Object Detection Frameworks: Detection Head

27

Fig. 1.16 Corner pooling involves a max-pooling operation for each channel, where the maximum values are extracted in two directions from different feature maps. These two maximum values are combined through addition

categories within the image. Each set of heatmap has .C channels where .C is the number of object categories, with dimensions . H × W . These channels act as binary masks, highlighting the locations of corners for each class with activated points amidst a background of negatives. During training, Gaussian 2D filtering is applied around positive locations on the heatmap to aid in learning by smoothing the distribution, which helps the model in identifying the exact corner locations more effectively. During inference, this trained ability allows the model to better pinpoint exact corner locations, as it has learned to associate the characteristics of the filtered heatmap regions with the precise corner points of objects. Offsets: CornerNet predicts offsets to refine the alignment of the predicted corners with the actual object locations within the input image. These offsets are minor, yet crucial, adjustments applied to the positions of the corners. Their primary function is to counteract the discretization errors inherent in the process of downsampling images and generating heatmap predictions. By applying these offsets, CornerNet enhances the accuracy of corner localization, ensuring a more precise fit between the predicted bounding boxes and the true object boundaries in the image. Embeddings: CornerNet utilizes an embedding vector for each detected corner to facilitate the detailed mapping of objects. This approach aids in accurately matching the top-left and bottom-right corners belonging to the same object. Inspired by the associative embedding technique [20], the network is trained to minimize the distance between embeddings of corner pairs that originate from the same object, aiding in their accurate association. A visualization of embeddings within the detection process is provided in Fig. 1.17. The network predicts an embedding vector for each identified corner, ensuring corners that belong together are matched correctly.

28

1 Deep Convolutional Neural Networks

Fig. 1.17 Example of an embedding vector for each detected corner

Post-processing and Loss Calculation: Following the heatmap and embedding predictions, a post-processing algorithm is employed to derive the final bounding boxes. The algorithm evaluates the distances between embeddings to pair top-left and bottom-right corners accurately, forming a coherent bounding box for each detected object. The training scheme for CornerNet is structured around various objective losses. These are, Smooth L1 loss for refining the corner locations through offsets, ensuring high precision in corner prediction. Pull and Push loss [21], which regulates the embeddings by drawing similar pairs closer and pushing disparate pairs apart, fostering correct corner matching. Focal loss [22] for the corner heatmaps, aimed at mitigating the impact of the vast number of easy negative locations (noncorner areas), focusing the model’s learning on hard-to-detect corners. These losses, weighted appropriately during the training process, optimize the different facets of the detection layers, ensuring balanced learning across the tasks of localizing, detecting, and associating object corners.

1.3.3.2

CenterNet: Object Detection as Points

CenterNet, introduced by Zhou et al. [19], stands out among anchorfree detection methods for its unique approach to object detection. Unlike traditional methods that rely on anchor boxes, CenterNet detects objects by identifying their center points and estimating the object’s dimensions using a shape vector. This method simplifies the detection process by reducing it to the prediction of these three elements: the center and the dimensions of the bounding box. It also avoids the complexities associated with anchor-based systems. CenterNet employs CNN architectures as feature extraction backbone, such as Deep Layer Aggregation (DLA) [7] or the Hourglass [8] network. These backbones are instrumental in extracting rich and multi-scale features from input images for accurate object detection. In the next subsection, we describe the detection layers of CenterNet and its training strategy.

1.3 Object Detection Frameworks: Detection Head

29

During the training process, the ground truth Regions of Interest (ROIs) are transformed into class-specific keypoint maps. In these maps, each object is represented by the coordinates of its center, marking the exact location on the keypoint map, while the values of all other pixels are set to zero. This step ensures that each object within the same category is distinctly identified by its central point. Subsequently, these keypoint maps are transformed into Gaussian heatmaps. This step involves adapting the maps to better represent the shape and scale of the objects. By applying a Gaussian distribution around each central point, the activation extends beyond the exact center, encompassing the surrounding region within the map. This approach helps in capturing the spatial extent of each object, providing a more continuous and gradient-based representation rather than a binary point. Ultimately, these Gaussian keypoint maps are used as the ground truth during training, assisting in the tasks of heatmap classification and object size category estimation. The model is trained to accurately predict the center points in the heatmaps of the corresponding category, enabling it to determine the location and category of objects. While other detection layers also learn to estimate the dimensions of objects. The detection layers within CenterNet are divided into three main components or heads: the heatmap head, offset head, and size head. Each head is defined by a set of convolutional layers, tailored to perform specific functions: Heatmap Head: This component is tasked with producing class-specific heatmaps to identify object categories. It activates pixel values at the center points of objects to generate the detection. A detection confidence score is then calculated based on the predicted keypoint values from these heatmaps, functioning similarly to nonmaximum suppression to identify the most likely object locations. Offset Head: To achieve precise object localization, CenterNet employs the offset head to predict a local offset for each object’s center. This adjustment ensures that the center point aligns more accurately with the actual center of the object within the image. Size Head: This head of the model is responsible for generating shape vectors that represent the width and height of each object. This process involves regressing from the learned features corresponding to the ground truth coordinates to predict the object’s dimensions. The size of each object is then inferred from the extent of the activation region within the heatmaps that are specifically responsible for object size estimation. This allows the model to determine the bounding box coordinates by utilizing the height and width as outlined by the shape vector. The activation region essentially indicates where the model predicts the center of an object to be, and the shape vector provides the additional information needed to frame the object precisely within its bounding box. Unlike some other detection frameworks, CenterNet simplifies the process by using a single keypoint for each object detection, eliminating the need for an additional grouping step or the embedding of multiple keypoints. This makes the detection process more straightforward and efficient. The overview of CenterNet is depicted in Fig. 1.18. The learning of the shape vector and offset values can be facilitated using

30

1 Deep Convolutional Neural Networks

Fig. 1.18 Overview of the CenterNet framework

a loss function like L1 distance for precision, while the heatmap category using classification loss [22] aids in learning object categories. These mechanisms work together to minimize the disparity between the predicted bounding boxes and the actual ground truth, enabling CenterNet to accurately detect and localize multiple objects within an image.

1.4 Summary This chapter provides a overview of deep learning frameworks and their diverse applications in computer vision. It elaborates on the foundational aspects such as core architectures, key building blocks, loss functions, optimizers, and evaluation metrics integral to these frameworks. Additionally, the chapter introduces various feature extraction networks, including well-known architectures such as VGG, Residual Networks, Deep Layer Aggregation Networks, and the Hourglass Framework. Furthermore, the discussion extends to object detection frameworks, emphasizing the critical role of detection head modules. It discusses both anchor box-based and anchor-free detection frameworks, providing insights into significant models including Faster-RCNN, YOLO (You Only Look Once), CornerNet, and CenterNet. The forthcoming chapter will delve into Logo detection, examining the specific challenges and practical applications associated with detecting logos.

References 1. Chen, Y.W., Jain, L.C. (eds.): Deep learning in healthcare. Springer, Berlin/Heidelberg, Germany (2020) 2. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world Images. In: the 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011) 3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems (2012)

References

31

4. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] (2014) 5. https://image-net.org/challenges/LSVRC/2010/index.php (accessed on January 25, 2023) 6. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 7. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2018) 8. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV, Springer (2016) 9. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 10. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2) (2013) 11. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, New York, USA, pp. 144–152 (1992) 12. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 13. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 14. Redmon, J., et al.: You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640 (2015) 15. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. arXiv:1804.02767 (2018) 16. Jocher, G., et al.: Ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. https://doi.org/10.5281/zenodo.3908559 (2020) 17. Lin, T. -Y., et al.: Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA (2017) 18. Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. (2019) 19. Zhou, X., Wang, D., Krahenbuhl P.: Objects as Points. arXiv:1904.07850 (2019) 20. Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems (2017) 21. Newell, A., Deng, J.: Pixels to graphs by associative embedding. In: Advances in Neural Information Processing Systems (2017) 22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. arXiv:1708.02002 (2017)

Chapter 2

Introduction to Logo Detection

Abstract This chapter discusses logo detection, highlighting its significant applications and the primary challenges it faces. Logo detection is pivotal in various fields such as market tracking, product growth analysis, and consumer behavior studies. It also plays a crucial role in the analysis of advertisements and sponsorships across different platforms, proving to be an indispensable tool for a myriad of applications. Despite its importance, logo detection encounters significant challenges: (1) Lack of Training Data Problem: The efficiency of supervised neural networks depends on the diversity and volume of training images. However, creating this training data, specifically labeling images at the object-level, is both time-consuming and expensive, posing a substantial challenge in the development of robust models. (2) Domain-Shift Problem: The appearance of logos in real-world images presents substantial variation. Factors such as contextual background, projective transformation, resolution, and illumination influence this variability. A domain-shift (domain-gap) problem occurs when the training and test datasets have different data features and characteristics. The domain shift between the training and test data (source and target domains) decreases detection performance. This chapter aims to explore these challenges in depth, review related work, and discuss potential solutions to improve the effectiveness of logo detection methods in practical settings.

2.1 Logo Detection and Its Applications A logo doesn’t sell (directly), it identifies. Paul Rand

A logo is a crucial element for every organization and brand, acting as a unique visual identity that sets a brand apart in the crowded marketplace. Automatic detection of logos in real-world images is an interesting and crucial problem. This challenge is particularly crucial in various real-world applications, including brand promotion,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_2

33

34

2 Introduction to Logo Detection

social media monitoring, virtual reality (VR), intelligent transportation, autonomous driving, detection of illegal/fraudulent logos, economic capital assessment, and market research. The automated detection of logos plays a pivotal role in market tracking, facilitating product growth, and providing insights into consumer buying habits. Additionally, it proves invaluable for the analysis of advertisements and sponsorships across different platforms. Therefore, the task of logo detection is desirable and useful for various applications. Here are several practical uses of logos: • Brand and Sales Promotion: Logos are central to the visual identity of a brand, playing a key role in promotional activities. • Market Presence and Tracking: Logo detection aids in analyzing and monitoring the presence of advertisements and sponsorships across various platforms, contributing significantly to market tracking and assessment. • Market Recognition: Recognizable logos facilitate consumer identification and brand recall, enhancing market recognition. • Professionalism and Consistency: A well-designed logo bolsters a business’s professional image and ensures consistency across all marketing materials. • Brand Trust: Established logos foster consumer trust, crucial for long-term business success. • Digital Presence and Social Media Monitoring: In today’s digital age, a brand’s presence across websites and social media platforms is vital, with logo detection playing a key role in online branding efforts. • Intelligent Transportation and Autonomous Driving: In the context of intelligent transportation systems and autonomous vehicles, logo detection aids in quick brand recognition and enhances safety communication. • Illegal/Fraudulent Logo Detection: Automated logo detection can identify unauthorized use of logos, protecting a brand’s unique visual identity from infringement and counterfeiting. • Economic Capital Assessment and Market Research: Logos are integral to market research and promotional activities, influencing economic capital assessment and strategic planning. Overall, the task of logo detection is not only desirable but also invaluable for a range of applications. Automating logo detection not only aids in brand promotion and market research but also serves as a preventive measure against copyright infringement, ensuring the protection of intellectual property. Consequently, logo detection has become an indispensable tool for organizations across various sectors, underscoring its importance in the contemporary business landscape.

2.2 Logo Detection Challenges Logo detection and localization in real-world images is a challenging task due to the diverse appearance of logos. These challenges are exacerbated by factors such as context variability, projective transformations, differences in resolution, and lighting

2.2 Logo Detection Challenges

35

Fig. 2.1 Examples of different viewpoint, size and scale of logos

Fig. 2.2 Appearance of logos in normal and complex scene

conditions. Unlike typical object detection tasks, logos can appear on different platforms with unknown and different fonts, colors, sizes, scales and viewpoints. The complexity of logo detection is compounded in real-world scenarios where images and videos may contain vast arrays of contextual information, leading to issues such as occlusion, rotation, shear, and fluctuating image resolutions. Such diversity can significantly undermine the performance of logo detection frameworks. Moreover, the challenges are heightened due to the high inter-class similarity and notable intraclass variance among logo instances, making the task of distinguishing between different logos, as well as identifying variations of the same logo, particularly taxing. The complexity of logo detection is illustrated in Fig. 2.1. Some examples of different viewpoints, sizes, and scales of logos are shown. Figure 2.2 further highlights the challenges posed by the appearance of logos in real-world applications. In this figure, while the first image may pose less difficulty, subsequent images illustrate increasing complexity, with the final images presenting scenarios that are emblematic of the challenges prevalent in practical settings. Thus, the problem of detecting logos in wild continues to be a significant area of research, critical to numerous applications and industries.

36

2 Introduction to Logo Detection

Moreover, the natural environment hosts a multitude of brands and logos, while deep learning frameworks depend on a data-driven approach necessitating extensive datasets of object-level annotated logo images. This requirement poses significant scalability and applicability challenges for deep learning-based logo detection systems, constrained by the labor-intensive and time-consuming nature of object-level image annotation. These factors collectively contribute to making logo detection an exceptionally challenging and arduous task.

2.3 Related Work in Logo Detection The domain of logo detection has been extensively explored in various studies, focusing on its application across real-world imagery for a range of purposes. Initially, traditional machine learning techniques dominated the field, leveraging hand-crafted features like Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), color histograms, and edge detection methods for logo identification [1–7]. Nowadays, the emergence of convolutional neural networks (CNNs) has shifted the paradigm towards deep learning-based approaches, which now set the benchmark in object detection due to their enhanced accuracy and effectiveness. Consequently, modern state-of-the-art methods in logo recognition predominantly incorporate deep learning-based frameworks, reflecting a significant shift from traditional to advanced detection techniques.

2.3.1 Deep Learning for Logo Detection In recent studies on logo detection, Pan et al. [8] used Fast R-CNN model [9] to train deep learning detection framework. In [10], the author presented logo classification and recognition task using deep learning networks. For logo classification task, Bianco et al. [11] utilized a custom deep CNN architecture, incorporating a Selective Search algorithm [12] for generating object proposals. These log object ROIs are classified subsequently. Subsequently, these region-of-interest (ROI) logo objects are classified. Fehervari et al. [13] trained a Faster R-CNN model [14] for class-agnostic logo detection, considering all logo categories as a single category. They then trained a separate network [15] using triplet loss and proxies [16] to retrieve logo images. Su et al. [17] suggested using of synthesized training images, combining real and synthesized logo images to fine-tune deep learning models for logo detection. In [18], Su et al. explored the impact of data augmentation using synthesized logo images, employing GANs [19] for coherent training logo image synthesis. Generative Adversarial Networks (GANs) are deep learning algorithms that consist of two CNN networks, a generator, and a discriminator. These networks can be trained simultaneously through adversarial training to generate realistic data.

2.4 Proposed Approaches for Logo Detection

37

Adversarial training is a machine learning technique where a model is trained by exposing it to adversarial examples [23]. These adversarial examples can be considered as challenging inputs, intentionally generated to mislead the model. This overall approach increases network’s the robustness and generalization capabilities of the network. Model self-learning from web-collected data was implemented in [20]. The authors selected compatible training logo images from a noisy web dataset to iteratively update the training dataset for a deep learning-based detection model [14]. This approach was refined in [21], introducing model self-co-learning. Two detection models [14, 22] were used to identify compatible training images from the noisy dataset, cross-feeding them cooperatively. This cross-model training aimed to improve performance. In [24], the authors primarily focused on building a large logo dataset, namely the Logo-2K+ dataset. They performed logo image classification using a pyramid feature extraction network [25] by generating reliable logo regions from the input image. A sub-network was also added to determine the most informative regions. More recently, in order to enhance the performance of logo detection, Wang et al. [26] incorporated focal loss [27] and CIoU loss [28] with the Yolov3 [29] framework. They also created a large dataset with over 3000 logo categories, called LogoDet-3K. Although these method performs satisfactorily but they need a significant number of object-level annotated training images from real-world domains for model learning. In [30], Bernabeu et al. employed CNN network for multi-label classification of logo images based on their shape, color, and other characteristics. They have used another CNN networks for similarity search. Most of these methods rely on object-level annotations for each logo class, emphasizing supervised learning with real training images. However, object-level annotation for real images is not only time-consuming and labor-intensive but also impractical in certain cases.

2.4 Proposed Approaches for Logo Detection Recently, deep learning has shown remarkable success in various research areas, including object detection, achieving state-of-the-art performance. However, the reliance on extensive human-annotated data and the assumption that training and testing data share the same distribution pose significant challenges in real-world applications. A noticeable performance decline often arises when applying models trained on one dataset to new, unseen data, primarily due to the domain gap or shift between the training and test datasets. Traditional models necessitate retraining with new annotated data, a process often impractical due to the time-intensive and costly nature of data labeling. Recent advancements in deep learning have led to significant improvements in object detection, setting new benchmarks across research domains. However, most existing models operate under a supervised learning framework, heavily dependent on extensive, human-annotated training datasets. This reliance presumes

38

2 Introduction to Logo Detection

homogeneity between training and testing data distributions, an assumption often disproved in real-world applications due to the prevalence of domain gaps or shifts, resulting in decreased model performance on new datasets. Addressing the limitations inherent in traditional annotation-intensive approaches, this book proposes novel strategies tailored to mitigate the challenges associated with extensive data labeling and domain shift in logo detection. The outlined methodologies pivot towards leveraging fully supervised, weakly supervised, unsupervised, and domain-adaptive learning paradigms, aiming to streamline the training process while bridging the domain gap between different datasets. Key innovations presented include: 1. A weakly supervised logo detection approach that circumvents the need for detailed object-level annotations like bounding boxes, substantially reducing manual labeling efforts [31, 32]. 2. An advanced detection framework featuring robust feature extraction networks enhanced by self attention modules and adopting an anchor-free detection head for improved logo identification [33]. 3. Integration of spatial and channel attention mechanisms to refine the feature extraction process, focusing on areas that are likely to contain logos, increasing detection accuracy [34]. 4. Development of lightweight CNN architectures to facilitate real-time logo detection without compromising accuracy, addressing the demands of practical application scenarios [34]. 5. Introduction of domain adaptation techniques aimed at generalizing model applicability across varied test domains, minimizing the necessity for extensive domain-specific annotations [34–36]. These proposed solutions emphasize efficiency, accuracy, and adaptability, targeting the overarching goal of creating more generalized and practical logo detection systems capable of operating under varied conditions and datasets. By addressing critical issues such as the labor-intensive nature of data annotation and the challenges posed by domain shifts, the outlined approaches promise to enhance the operational viability of logo detection frameworks, making them more suited to diverse real-world applications. The forthcoming chapters will delve into these proposed methods in detail, discussing their implementation, efficacy, and potential impact on the field of logo detection. The Fig. 2.3 illustrates the conceptual representation of the defined problems and their corresponding solutions.

References

39

Fig. 2.3 Conceptual representation of the defined problems and their corresponding solutions

2.5 Summary This chapter explores the field of logo detection, highlighting its significance in market tracking and analyzing consumer behavior. It outlines the major challenges faced in logo detection, such as data annotation difficulties and domain shift issues. Additionally, it proposes innovative solutions aimed at improving logo detection techniques for real-world applications. By addressing these challenges, the chapter lays the groundwork for more effective and practical logo detection methods, fostering better understanding and implementation in various scenarios.

References 1. Li, K.W., Chen, S.Y., Su, S., Duh, D.J., Zhang, H., Li, S.: Logo detection with extendibility and discrimination. Multimed. Tools Appl. 72(2) (2014) 2. Li, Z., Schulte-Austum, M., Neschen, M.: Fast logo detection and recognition in document images. In: the 20th International Conference on Pattern Recognition, pp. 2716–2719 (2010) 3. Boia, R., Bandrabur, A., and Florea, C.: Local description using multi-scale complete rank transform for improved logo recognition. In: IEEE International Conference on Communications, pp 1–4, (2014) 4. Romberg, S., and Lienhart, R.: Bundle min-hashing for logo recognition. In: Proceedings of the 3rd ACM Conference On International Conference On Multimedia Retrieval. ACM, pp. 113–120 (2013) 5. Kalantidis, Y., Pueyo, L.G., Trevisiol, M., Zwol, R.V., Avrithis, Y.: Scalable triangulation-based logo recognition. In: ACM International Conference on Multimedia Retrieval (2011) 6. Revaud, J., Douze, M., Schmid, C.: Correlation-based burstiness for logo retrieval. In: ACM International Conference on Multimedia, pp. 965–968 (2012)

40

2 Introduction to Logo Detection

7. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world images. In: the 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011) 8. Pan, C., Yan, Z., Xu, X., Sun, M., Shao, J., Wu, D.: Vehicle logo recognition based on deep learning architecture in video surveillance for intelligent traffic system. In: IET International Conference on Smart and Sustainable City, pp. 123–126 (2013) 9. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 10. Iandola, F.N., et al: Deeplogo: Hitting logo recognition with the deep neural network hammer. In: arXiv:1510.02131, (2015) 11. Binaco, S., Buzzelli, M., Mazzini, D., Schettni, R.: Deep learning for logo recognition. Neurocomputing 245, 23–30 (2017) 12. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2) (2013) 13. Fehervari, I., Appalaraju, S.: Scalable logo recognition using proxies. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 715–725 (2019) 14. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 15. Hu, J., Shen, L., Sun, G.: Squeeze and excitation networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7142 (2017) 16. Movshovitz-Attias, Y., Toshev, A., Leung, T. K., Ioffe, S., Singh, S.: No fuss distance metric learning using proxies. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 360–368 (2017) 17. Su, H., Zhu, X., Gong, S.: Deep learning logo detection with data expansion by synthesising context. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 530–539 (2017) 18. Su, H., Zhu, X., Gong, S.: Open logo detection challenge. In: Proceedings of the British Machine Vision Conference, UK (2018) 19. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017) 20. Su, H., Zhu, X., Gong, S.: Weblogo-2m: scalable logo detection by deep learning from the web. In: Proceedings of the International Conference on Computer Vision Workshops, Italy, pp. 270–279 (2017) 21. Su, H., Zhu, X., Gong, S.: Scalable logo detection by self co-learning. In: Pattern Recognition, vol. 97, 107003 (2020) 22. Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA, pp. 7263–7271 (2017) 23. Goodfellow, I., et al.: Generative adversarial nets. In: Advances In Neural Information Processing Systems, pp. 2672–2680 (2014) 24. Wang J., et al.: Logo-2K: A large-scale logo dataset for scalable logo classification. In: arXiv:1911.07924, (2019) 25. Lin, T., Dollar, P. , Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of Conference Computer Vision Pattern Recognition, pp. 936–944 (2017) 26. Wang, J., Min, W., Hou, S., Jiang, S.: LogoDet-3K: A large-scale image dataset for logo detection. In: ACM Trans. Multimedia Computer Commun. Appl. 18(1), Art. no. 21 (2022) 27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2999–3007 (2017) 28. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 12993–13000 (2020) 29. Redmon J., Farhadi, A.: Yolov3: An incremental improvement. In: arXiv:1804.02767, (2018)

References

41

30. Bernabeu, M., Gallego, A.J., Pertusa, A.: Multi-label logo recognition and retrieval based on weighted fusion of neural features. In: arXiv:2205.05419, (2022) 31. Jain, R.K., Iwamoto, Y., Watasue, T., Nakagawa, T., Sato, T., Ruan, X., Chen, Y.W.: Weakly supervised logo detection using a dual-attention dilated residual network. In: The 23rd Meeting on Image Recognition and Understanding (MIRU), Japan (2020) 32. Jain, R.K., Iwamoto, Y., Watasue, T., Nakagawa, T., Sato, T., Ruan, X., Chen, Y.W.: Weakly supervised logo detection using a dual-attention dilated residual network. IIEEJ Trans. Image Electron Visual Comput. 9(1), 12–19 (2021). https://doi.org/10.11371/tievciieej.9.1_12 33. Jain, R.K., Watasue, T., Nakagawa, T., Sato, T., Iwamoto, Y., Ruan, X., Chen, Y.W.: LogoNet: Layer-aggregated attention CenterNet for logo detection. In: Proceedings of the IEEE International Conference on Consumer Electronics (IEEE-ICCE), USA, pp. 1–6 (2021). https://doi. org/10.1109/icce50685.2021.9427658 34. Jain, R.K., Watasue, T., Nakagawa, T., Sato, T., Iwamoto, Y., Ruan, X., Chen, Y.W.: LogoNet: a robust layer-aggregated dual-attention anchorfree logo detection framework with an adversarial domain adaptation approach. App. Sci. 11,(20), 9622 (2021). https://doi.org/10.3390/ app11209622 35. Jain, R.K., Sato, T., Watasue, T., Nakagawa, T., Iwamoto, Y., Ruan, X., Chen, Y.W.: Unsupervised logo detection using adversarial learning from synthetic to real images. IEICE Tech. Rep. 121(304), PRMU2021, 43–44 (2021). https://ken.ieice.org/ken/paper/20211216BC6S/eng/ 36. Jain, R.K., Sato, T., Watasue, T., Nakagawa, T., Iwamoto, Y., Ruan, X., Chen, Y.W.: Unsupervised logo detection using adversarial learning from synthetic to real images. In: IEEE Transactions on Emerging Topics in Computational Intelligence pp 1–14 (2023). https://doi. org/10.1109/TETCI.2023.3256475

Chapter 3

Weakly Supervised Logo Detection Approach

Abstract The previous chapters discuss Deep Learning-based feature extractor frameworks and challenges related to detection tasks, with a specific focus on logo recognition. Most existing logo detection methods often rely on precise object-level bounding box (position bounding box) annotations, that poses substantial challenges in practical settings due to the labor-intensive nature of object-level annotations. To address this issue, this chapter presents a novel weakly supervised logo detection algorithm that enables effective logo recognition without necessitating detailed bounding box annotations. We begin by exploring approach to weakly supervised logo recognition that utilizes only image-level annotations. We explore the integration of attention mechanisms and a feature extraction network using dilated convolutions, aimed at compensating for the lack of precise object localization typically provided by bounding box annotations. In a weakly supervised training scheme, we lack guidance on locating object positions as bounding box annotations are not available during training. The primary goal is to boost performance by adeptly utilizing imagelevel labeled data. To enhance logo image classification and localization, the chapter introduces the application of attention-based mechanisms with Convolutional Neural Networks (CNNs). These mechanisms are designed to emphasize critical attributes and information, assigning greater significance to the spatial and semantic aspects of objects. This scheme relies on image-level annotations and demonstrates substantial scalability and adaptability for widespread real-world deployment.

3.1 Weakly-Supervised Logo Detection Using Image-Level Annotation Deep learning-based model training typically falls into two main categories: fully supervised learning, which relies on object-level annotations, and weakly supervised learning, which utilizes image-level annotations. Figure 3.1 illustrates the difference between training data used for these two approaches. Object-level annotation requires marking specific regions within an image with corresponding categories to denote the presence of particular objects. This method demands detailed location or coordi© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_3

43

44

3 Weakly Supervised Logo Detection Approach

Fig. 3.1 Annotations for training of deep learning based networks

nates outlining the rectangular boundaries of these objects. In contrast, image-level annotation labels the whole image with one or more categories to signify the presence of specific objects, without specifying their exact locations. Typically, fully supervised training employs datasets with object-level annotations, necessitating manual labeling of position bounding boxes for each object. Although this approach informs the network about the precise object positions, it comes with significant drawbacks, mainly the requirement for extensively manually labeled training data, which is both time-consuming and expensive. The necessity for detailed object-level annotations makes fully supervised methods less suitable for practical use due to the labor-intensive and costly nature of creating comprehensive annotated datasets. Figure 3.2 shows the training process of fully supervised learning with position bounding boxes and provides examples of object-level logo annotations. Weakly supervised learning provides a more viable approach for real-world applications, especially when dealing with large numbers of logos with cluttered backgrounds and varying scales. Acquiring training data with image-level annotations is substantially easier than manually annotating object-level bounding boxes. Specifically for logo detection, image-level annotation simply involves marking an entire image as containing a certain logo without detailing the logo’s specific location within the image. This method is markedly less labor-intensive than object-level annotation, as it removes the necessity for annotators to pinpoint the exact position of the logo within each training sample. However, in weakly supervised training, there is an inherent lack of direct guidance for object localization due to the absence of bounding box annotations. The absence of explicit position coordinates complicates the process of accurately identifying the exact location of logos. Figure 3.3 demonstrates the challenges associated with this constraint, underscoring the trade-offs between the ease of annotation and the granularity of detection information provided. This

3.1 Weakly-Supervised Logo Detection Using Image-Level Annotation

45

Fig. 3.2 Learning process using positioning bounding boxes for fully supervised training and examples of object-level annotations

Fig. 3.3 Training models with image-level annotations lacks precise localization information, affecting accurate object localization

highlights the need for efficient weakly-supervised techniques for effectively utilize image-level annotations, which signal the presence but not the precise location or size of logos within the images. We aim to recognize logos and predict their locations within images using only image-level annotations. In such scenarios, attention mechanisms can prove to be beneficial. These techniques effectively focus on the relevant information and emphasize the most salient features, improving performance and ensuring more precise predictions. In our approach to logo detection, we deploy a specialized framework equipped with attention modules and dilated convolutional operations. This framework employs both spatial and channel attention mechanisms. The spatial attention module calculates attention weights that aid in pinpointing the logo’s spatial location

46

3 Weakly Supervised Logo Detection Approach

Fig. 3.4 Weakly supervised logo detection using attention mechanism

within the image, allowing for accurate identification of crucial areas. Meanwhile, channel-wise feature maps prioritize semantic features within the global feature maps, facilitating the differentiation and categorization of logos. By incorporating attention-based mechanisms into the framework, we significantly enhance image classification and localization accuracy compared to conventional network architectures. Figure 3.4 shows an simple overview of the proposed attention-based weakly supervised training approach. In the following subsection, we delve into the specifics of attention mechanisms and explore related work in this area.

3.2 Attention Mechanisms The advent of self-attention mechanisms has revolutionized various computational tasks by effectively modeling global dependencies within deep feature extractions. A landmark development occurred in 2017 with the adoption of self-attention mechanisms and Transformers in natural language processing [1]. Following this, the introduction of non-local blocks in [2] marked a significant advancement for vision tasks, enabling the capture of long-range dependencies through the construction of key, value, and query components tailored for visual data. A procedural depiction of the spatial attention mechanism is provided in Fig. 3.5, while the channel attention mechanism is illustrated in Fig. 3.6. The design of attention mechanisms involves some convolutional layers dedicated to modal important regions and semantic features through multi-dimensional crossreferencing, a process known as a self-attention mechanism. This process encompasses several key steps: linear transformation of input features, generation of a similarity map through pairwise similarity scoring of transformed features, application of learnable weight matrices to input feature maps, calculation of attention weights, and synthesis of new weighted feature maps based on these attention weights. These steps begin with the linearly transform the input features using 1.×1 convolutional layers to a higher-dimensional space, enabling better discrimination of crucial regions. Subse-

3.2 Attention Mechanisms

47

Fig. 3.5 Explanation of spatial-wise attention mechanisms

Fig. 3.6 Explanation of spatial-wise attention mechanisms

quently, a multiplication of the input feature maps with learnable weight matrices is conducted. Then, a similarity map is generated by computing the pairwise similarity scores between the transformed feature vectors. This is normally achieved through operations such as dot product or cosine similarity. The similarity map highlights the relationships between different elements in the input data. The similarity scores obtained in the previous step are then used to calculate attention weights. These weights signify the importance or relevance of each element in the input data with respect to others. Attention weights are computed using techniques like Softmax normalization to ensure that the weights sum up to one. Once the attention weights are computed, they are applied to the original feature maps to compute a weighted feature maps. This step synthesizes new weighted feature maps, where each element

48

3 Weakly Supervised Logo Detection Approach

is a weighted combination of the original features, with the weights determined by the attention mechanism. During the training phase, the parameters involved in the attention mechanism, including the weight matrices used for transformation and the attention weights, are adaptively learned through backpropagation. This enables the framework to dynamically adjust the attention mechanism based on the specific task and input data, optimizing its ability to capture relevant information. These steps collectively constitute the attention module, which facilitates the selective focus on informative regions of the input data, enhancing the performance of deep learning models across various tasks. These methods are instrumental in refining and enhancing informative features, consequently, attention mechanisms have gained significant importance within deep learning architectures. Over the years, several spatial and channel attention mechanisms have been introduced and applied across diverse applications, including image classification, semantic segmentation, and object detection. One attention method proposed in [3] emphasizes spatial feature enhancement via a mask module. This module is employed with a trunk branch comprising bottom-up and top-down feed-forward structures. Author [4] introduce a notable mechanism called SENet module, which calculates channelwise weights of convolutional layers to capture channel-wise responses. This module signifies a leap in modeling channel-wise attributes of images. Furthermore, [5] introduces the ECANet block, which is designed to model channel-wise features more effectively and efficiently. A method combining channel and spatial attention blocks sequentially is presented in [6], enhancing the model’s ability to identify and prioritize salient features within images. Additionally, [7] proposes an architecture explicitly designed to capture and learn spatial information via attention mechanisms, marking a significant advancement in the application of attention for spatial feature discernment. Moreover, [8] incorporates a spatial and channel dual-attention mechanism for the classification and localization of liver lesions in CT images, demonstrating the usefulness and efficacy of attention mechanisms in medical imaging tasks. In the next section we discuss our proposed framework in details. Such advancements show the adaptability of attention mechanisms across different domains but also highlight their contribution to enhancing model performance by focusing on relevant features. In the next subsection, we delve into the specifics of our proposed framework, illustrating how these attention methods are integrated and adapted for logo detection.

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network We introduce a weakly supervised learning approach with attention mechanisms for logo detection, addressing the challenge of object-level annotation. Our method employs both spatial and channel-wise attention mechanisms to effectively highlight and learn crucial features for logo detection, thus enabling the precise localization

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network

49

of logo regions in a weakly supervised setting. This approach allows for the automatic identification of significant features necessary for logo detection and accurately locates logo positions using only image-level annotations, eliminating the need for specific position annotations. The integration of attention mechanisms significantly improves localization accuracy in images with complex backgrounds, by directing the network’s focus towards the most relevant features for logo detection. Figure 3.7 illustrates the network architecture with attention mechanisms. The components of the proposed framework are discussed in the following subsections.

3.3.1 Feature Extraction Backbone Network In our approach, we incorporate a Dilated Residual Network (DRN) [10] architecture enhanced with attention mechanisms for effective feature extraction. Dilated residual network (DRN) generates output feature maps with high-resolution. By integrating spatial and channel attention mechanisms in various configurations within the DRN, we further refine these output feature maps, making the network adept at focusing on salient features crucial for logo detection. Additionally, for tasks requiring weakly supervised localization, we generate class activation maps using the Gradient-weighted Class Activation Map (Grad-CAM) [12] method. This technique facilitates the visualization of potential logo regions, offering insights into the significant regions for logo presence. Convolutional Neural Networks (CNNs) are structured as a series of convolutional layers, pooling layers, and additional components. These convolutional operations normally use kernel sizes of 1.×1, 3.×3, and 5.×5. These kernels capture local features and their interactions with adjacent areas. Capturing larger spatial contexts can be beneficial, although increasing the kernel size may lead to higher computational costs. To address the challenge of computational cost, the concept of dilation in convolutional kernels was introduced. Dilated convolution networks are first introduced in [9]. Dilated convolutions enable the network to encompass an expanded receptive field, enabling a wider view by aggregating multi-scale contextual information. Dilated convolutions enable the network to maintain a high resolution of output feature maps throughout its layers. This ensures that the network retains detailed spatial information throughout processing, a crucial factor for tasks like logo detection where fine details may determine classification accuracy. Figure 3.8 illustrates a dilated convolution operation. Following, Dilated Residual Networks (DRNs) are introduced in [10] with the aim of enhancing network performance without increasing depth or complexity. These networks have demonstrated superior performance compared to their non-dilated counterparts in image classification and have shown high accuracy in dense prediction tasks, such as semantic segmentation and object localization. DRNs leverage a learning scheme that incorporates residual or skip connections [11] along with dilated convolutions in their higher or later layers. This combination allows for an expanded coverage area by the convolutions, thus enlarging the receptive fields in the

Fig. 3.7 Overview of the proposed network with dual-attention for weakly-supervised logo detection

50 3 Weakly Supervised Logo Detection Approach

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network

51

Fig. 3.8 Dilated convolution operation

network. Pooling operations reduce spatial dimensions along the network, leading to loss of details for small objects due to downsampling. Therefore, the last block of the DRN network produces a high-resolution global feature map (e.g., 28 .× 28), without further decreasing resolution or spatial dimensions. Figure 3.9 illustrates the transformation of a residual network (ResNet-50 [11]) into a dilated residual network (DRN-54) [10]. The networks comprising 5 groups or blocks of convolutional layers. In the figure, Group-4 and Group-5 denote the fourth and fifth convolutional blocks, respectively. Notably, ResNet networks use normal convolution operations (.d = 1), while DRNs utilize an increasing dilation factor (.d = 1, 2, 4) to capture coarse features effectively. Our framework incorporates a dilated residual network (DRN-54) equipped with spatial and channel attention mechanisms. Among many existing algorithms, our weakly supervised logo detection algorithm is based on a dual-attention mechanism, which was proposed in our previous work [8]. The proposed spatial and channel-wise attention mechanisms refine and generate robust output feature map to activate the most probable candidate logo regions. By employing these attention mechanisms in various combinations, we aim to enhance the quality of the final output feature maps, ensuring precise activation of potential logo regions. The employed spatial and channel attention mechanisms are explained in the following sections.

3.3.2 Spatial Attention Mechanism The spatial attention mechanism plays a pivotal role in enhancing the discrimination power of feature maps by focusing on the spatial attributes of objects. It operates by leveraging the global pixel positions within feature maps to pinpoint crucial

52

3 Weakly Supervised Logo Detection Approach

Fig. 3.9 Dilated residual network (DRN-54) architecture

locations, assigning different weights to various pixels based on their relevance. To construct a spatial attention map, global feature maps.A ∈ R C×H×W obtained from the feature extractor backbone is fed into separate 1.×1 convolutional layers. This process transforms the feature map into two distinct feature spaces . B1 and . B2 ∈ R C1×(H×W) , where.C1 shows the number of features and.(H × W ) denotes the size of the flattened tensor. The value for each spatial position vector (. P j ) is calculated using all other spatial position values. .

Pj =

N ∑

αj,i Di

(3.1)

i=1

Here, .αj,i is the Sigmoid normalization of .B1T B2 , highlighting the significance of the .ith position in the formulation of the . jth position. . Di is the transformed feature map after processing through another 1.×1 convolutional layer. The result of elementwise sum between the original feature map and attention feature map was the final output of the spatial attention mechanism. i.e., .

S j = β Pj + A j

(3.2)

In this formula, .β is a scale parameter, which is initially set to 0 and adaptively adjusted throughout the training, allowing the model to progressively learn the optimal degree of emphasis for different spatial regions. Figure 3.10 illustrates the operational flow of the spatial attention module.

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network

53

Fig. 3.10 Overview of the spatial attention module

3.3.3 Channel Attention Mechanism Channel attention focuses on discerning crucial features across the channel dimensions of the feature space, assigning variable importance to different semantic features. To generate the attention values, this form of attention assesses the entire global feature map to determine the significance of each channel. The channel attention map values, denoted as .γ ji , are the normalized values obtained through applying the Sigmoid function to the matrix product of the feature map .A and its transpose .AT . The computation of the channel attention map, .C j , involves an element-wise addition of the original feature map and the product of the attention feature map and the original feature map, enhancing the model’s focus on relevant channels. The mathematical representation is as follows: Cj = ρ

C ∑

.

(γ ji Ai ) + A j

(3.3)

i=1 1 where .γ ji = represents the attention weights across channels, with .γ ∈ 1+e−(A×AT ) c×c R , and .ρ is a learnable scale parameter. Initially, .ρ is set to 0 and adjusted during training through back-propagation, allowing the model to gradually learn the importance of different channels. Figure 3.11 visualizes the channel attention module, illustrating how this mechanism effectively emphasizes salient features across channels. The output feature maps from the spatial (. S j ) and channel (.C j ) attention modules are concatenated, forming .[S j , C j ], resulting in a combined feature map (C×2)×H×W .X ∈ R . Where .C × 2 is the number of features of the concatenated attention maps and (. H × W ) is the spatial dimensions. This enriched feature map .X, carrying enhanced spatial and channel-specific attentions, is then forwarded to the classification branch of the network.

54

3 Weakly Supervised Logo Detection Approach

Fig. 3.11 Overview of the channel attention module

3.3.4 Gradient-Based Grad-CAM for Localization of Logos For weakly supervised logo localization, class activation heatmaps generated via the Gradient-weighted Class Activation Mapping (Grad-CAM) method [12] serve as a pivotal tool for visualizing potential logo regions within images. Convolutional Neural Networks (CNNs), even when primarily trained for image-level classification tasks, inherently possess the ability to weakly localize objects associated with the identified categories. This capability stems from the observation that CNNs trained for image classification tend to activate strongly over regions containing the relevant objects. Grad-CAM leverages the gradients of any target concept flowing into the final convolutional layer to produce a coarse effective localization map, emphasizing crucial regions for the target object within the image. By applying this method, it becomes feasible to generate bounding boxes around the regions highlighted by the heatmaps and assign appropriate labels derived from image classification through subsequent post-processing steps. The application of the Grad-CAM method enables the visualization of classspecific heatmaps for given input images, elucidating the discriminative regions based on class-specific prediction scores across different channels within a feature map. Specifically, the gradient of the prediction score .yc is computed for class c with respect to the feature map activation . An (.n ∈ [1, N ], where .Nc is the number of channels) of a convolutional layer, the gradient can be shown as . ∂∂ Ay n . Subsequently, a ReLU activation function is applied to the weighted combination of activation maps, highlighting features that positively influence the classification of class c, offering insights into which parts of the image are most indicative of the presence of the logo. ⎞ ⎛ ∑ 1 ∑ ∑ ∂ yc c n⎠ . L GradC AM = ReLU ⎝ (3.4) n A z ∂ A ij n i j where . L cGradC AM is the generated activation map, .z is the spatial dimension-size (height .× width) of feature maps. Figure 3.12 shows a simple representation of

3.3 Weakly Supervised Logo Detection with Dual-Attention Dilated Residual Network

55

Fig. 3.12 Grad-CAM visualization of an input image

Grad-CAM based class-specific heatmaps visualization of an image. This formulation captures the essence of Grad-CAM, providing a focused visual representation of areas within an image that contribute significantly to the classification decision for a particular class. This technique not only aids in the accurate localization of logos but also enriches the interpretability of CNN-based logo detection models, providing a tangible visual representation of the model’s focus areas.

3.3.5 Implementation of Channel and Spatial Attention In our framework, we incorporate both channel and spatial attention mechanisms separately and in combination. We examine the incorporated modules in different combinations, including parallel and sequential, as illustrated in Fig. 3.13. This approach allows us to examine the modules separately and in combination, exploring their effectiveness and applicability beyond the parallel combination method previously investigated in [8]. Our aim is to delve into the potential and utility of these attention modules in enhancing feature extraction and logo detection. Figure 3.7 shows an overview of the network architecture where attention mechanisms are implemented in parallel, enabling simultaneous processing by both attention modules. The outputs from each module are then concatenated, leveraging the complementary strengths of channel and spatial attentions to refine feature representation. Figure 3.14 presents a sequential implementation strategy, where the channel (CH) and spatial (SP) attention modules are applied in a specific order with the Dilated Residual Network (DRN). Here, the output feature map from the DRN undergoes refinement first by the channel attention module, focusing on relevant channels within the feature space. Subsequently, the spatial attention module processes this refined output, generating spatial attention maps that highlight critical regions within the image. By examining different configurations, we aim to identify the most effective method for applying these mechanisms within our weakly supervised logo detection framework.

56

3 Weakly Supervised Logo Detection Approach

Fig. 3.13 Parallel and sequential combinations of dual-attention modules

The final step in the classification branch uses global average pooling on the attention-enhanced feature maps, reducing their dimensions while retaining key information. These pooled features are then processed through a 256-way fullyconnected layer, which is equipped with a softmax activation function. This setup efficiently classifies the logo images into their respective categories.

3.4 Experiments and Results 3.4.1 Implementation We implemented the DRN-54 framework employing the PyTorch library [13], configuring the training process with a mini-batch size of 16 and extending it over 200 epochs. Parameter optimization is carried out using the Adam optimizer, initialized with a learning rate of 0.0001 and a weight decay of 0.1 to ensure effective learning.

3.4.2 Dataset Our experiments are conducted using the FlickrLogos-32 dataset [14], which encompasses 32 distinct logo classes. Each class contains 70 images, with a split of 40 images dedicated to training-validation and the remaining 30 designated for testing.

Fig. 3.14 DRN Network architecture with sequential attention mechanisms (CH=.>SP)

3.4 Experiments and Results 57

58

3 Weakly Supervised Logo Detection Approach

Among the 960 images in the test set, there are a total of 1,602 logo objects, providing a comprehensive basis for evaluating the performance of our logo detection framework.

3.4.3 Evolution Measures Classification accuracy for each network was determined by calculating the ratio of the total number of true positive cases to the total number of input test images. This measure provided a straightforward metric to assess the performance of our networks in accurately identifying logos within the test dataset. Classificationacc =

.

true positive predictions total predictions

(3.5)

For weakly supervised localization, a predicted bounding box was considered correct when Intersection over Union (IoU) exceeds 0.5 with the ground truth bounding box, as shown in Fig. 3.15. We determined the localization accuracy using the Eq. (3.6). Localizationacc =

.

true positives . total number of ground truths

(3.6)

3.4.4 Comparison with Different Attention Modules Table 3.1 presents the classification accuracy of various networks, including the dilated residual network (DRN-54), DRN-54 network with parallel dual attention, DRN-54 with separate spatial (SP) and channel (CH) attention modules, and DRN-54 with sequential combinations of attention modules (i.e., channel and spatial attention module (CH.=>SP), spatial and channel attention module (SP.=>CH)), along with other existing methods such as SEResNet-50 [4] and ResNet-50 with CBAM [6]. The results indicate that the performance of the DRN-54 network improves consistently when attention modules are implemented in any combination. The DRN-54

Fig. 3.15 Bounding box over lapping

3.4 Experiments and Results

59

Table 3.1 Correctly classified logo images and classification accuracy Method True positives ResNet50 CBAM [6] SEResNet [4] DRN50 (w/o attention) DRN50 channel attention (CH) DRN50 spatial attention (SP) DRN50 Dual sequential attention (CH=.>SP) DRN50 Dual sequential attention (SP=.>CH) DRN50 Dual parallel attention

783 814 815 834 857 846 851 853

Accuracy 81.563 84.792 84.895 86.875 89.271 88.125 88.646 88.854

network achieves an accuracy rate of approximately 84%. Notably, the classification accuracy of the DRN-54 network increases by around 4% for both parallel and sequential combinations of spatial and channel attention modules. Specifically, the accuracy rate improves by more than 4% with spatial attention alone, while the channel attention mechanism contributes to an improvement of 2%. However, other methods demonstrate lower accuracy rates compared to our proposed methods. Table 3.2 illustrates the mean localization accuracy and the number of correctly located logos in images. Our approach focuses on a weakly-supervised logo localization training scheme based on attention mechanisms, where the model is trained only with image-level annotations. The results show a significant improvement in logo-object localization compared to the conventional DRN-54 network. Localization accuracy increases with all combinations of attention modules, except for the sequential implementation of spatial and channel attention modules. Notably, we achieve more than a 3% higher accuracy compared to the conventional network when the attention modules are implemented separately. The parallel implementation of both attention modules achieves the highest accuracy rate, increasing the accuracy of the DRN network from 20.84% to 25.03%. Additionally, the sequential implementation of channel and spatial attention mechanisms enhances the accuracy by more than 1.5%. The results suggest that the logo regions can be efficiently located using the dualattention method. The parallel approach outperforms sequential and conventional approaches for several reasons: (1) Simultaneous Refinement: The parallel approach refines the output feature maps in both spatial and channel dimensions simultaneously. This concurrent refinement allows for a more comprehensive enhancement of the feature maps, leading to improved performance in classification and localization tasks. (2) Concatenation of Feature Maps: By concatenating the feature maps produced by spatial and channel attention mechanisms, the parallel approach provides important attention weights to learn classification and localization tasks more accurately. This concatenation enables the network to leverage both spatial and semantic information effectively. In contrast, sequential approaches refine attention-weighted feature maps independently, one after the other. This sequential refinement may

60

3 Weakly Supervised Logo Detection Approach

Table 3.2 Correctly localized logo accuracy Method ResNet50 CBAM [6] SEResNet [4] DRN50 (w/o attention) DRN50 channel attention (CH) DRN50 spatial attention (SP) DRN50 Dual sequential attention (CH=.>SP) DRN50 Dual sequential attention (SP=.>CH) DRN50 Dual parallel attention

True positives

Accuracy

241 231 334 388 389 346 311 401

15.04 14.41 20.84 24.21 24.28 21.54 19.41 25.03

Table 3.3 Ranking comparison of different attention methods with different backbone networks for classification and localization tasks DRN-54 DRN-54 ResNet-50 ResNet-50 Mean Method Classification Localization Classification Localization Rank .± std Dual parallel attention Dual sequential attention (CH.=>SP) Dual sequential attention (SP.=>CH) Spatial attention Channel attention Without attention

2

1

1

3

1.75 .± 0.96

4

4

3

2

3.25 .± 0.96

3

3

6

6

5.25 .± 1.50

1

2

3

4

2.50 .± 1.30

5

3

2

5

3.75 .± 1.50

6

5

5

1

4.25 .± 2.22

reduce the effectiveness of important weights during learning, leading to suboptimal performance. Table 3.3 provides a ranking comparison of different attention methods with various backbone networks for both classification and localization tasks, highlighting the effectiveness of the parallel approach in improving performance. We computed the channel-specific response of the channel attention module for each category, and the top 5 channels with higher weights for each category are listed in Table 3.4. In the parallel implementation, the final output consisted of a concatenated feature map generated by the channel and spatial attention modules. Out of the 256 channels, the first 128 channels were generated by the spatial attention module, while channels 129–256 were generated by the channel attention module.

3.4 Experiments and Results

61

Table 3.4 Channel specific response for DRN-54 parallel dual-attention network Brand Channel number Brand Channel number Adidas Aldi Apple Becks Bmw Carlsberg Chimay Cocacola Corona Dhl Erdinger Esso Fedex Ferrari Ford Fosters

211, 231, 238, 251, 229 231, 169, 197, 199, 129 231, 233, 238, 132, 251 229, 217, 211, 190, 227 213, 149, 221, 231, 233 146, 211, 147, 213, 245 229, 147, 248, 216, 132 231, 129, 217, 190, 211 211, 229, 169, 245, 248 213, 129, 197, 251, 221 211, 169, 251, 232, 213 217, 202, 211, 214, 233 217, 213, 233, 241, 229 216, 231, 238, 215, 207 211, 245, 190, 216, 213 211, 189, 221, 216, 213

Google Guiness Heineken HP Milka Nvidia Paulaner Pepsi Rittersport Shell Singha Starbucks Stellaartois Texaco Tsingtao Ups

229, 231, 190, 241, 254 211, 219, 146, 248, 245 217, 146, 213, 238, 229 231, 229, 240, 190, 197 211, 241, 233, 151, 231 229, 153, 238, 145, 146 169, 213, 221, 135, 236 213, 233, 217, 189, 251 243, 145, 229, 185, 238 231, 251, 197, 229, 129 232, 211, 189, 169, 238 238, 227, 221, 169, 189 227, 211, 189, 129, 147 217, 197, 227, 190, 151 189, 241, 177, 217, 213 233, 211, 145, 251, 245

Upon analyzing the table, we observed that channels with numbers 211, 231, 229, 213, and 217 occurred more frequently across all categories. This suggests that these channels play a significant role in identifying logo regions. These channel-specific responses indicate the importance of channels for the corresponding categories. Assigning biased weights to these specific channels could potentially improve performance. Figure 3.16 visualizes the channel-specific responses of the top five channels (from left to right: 211, 231, 229, 213, 217) and the final heatmaps, respectively. These binary images illustrate different attention weights corresponding to different responses. The final heatmaps are generated using Grad-CAM visualization for the last convolutional outputs, leveraging the positive influence of all channels to locate possible logo regions. Fig. 3.17 shows the class activation regions generated through Grad-CAM for various methods, including DRN-54, DRN-54 with parallel attention modules, DRN-54 with spatial attention, and ResNet-50 with CBAM attention block. The visual comparison shows the effectiveness of our proposed dual-attention modules in parallel combination, which generates more precise target-specific class activations. This enhancement in object localization capability highlights the effectiveness of the feature extractor in our proposed method.

62

3 Weakly Supervised Logo Detection Approach

Fig. 3.16 Channel specific response for different logo objects

Fig. 3.17 Class activation map visualization using Grad-CAM

References

63

3.5 Summary The majority of existing logo detection methods heavily rely on precise object-level bounding box annotations, making them less feasible for real-world applications due to the labor-intensive nature of this annotation process. In response to this challenge, we propose a weakly supervised logo detection algorithm capable of learning logo recognition without the need for bounding box annotated training data. Our approach leverages attention-based mechanisms to highlight important areas for logo recognition. By utilizing only image-level annotations, our method is highly scalable for large-scale real-world applications. Furthermore, we enhance logo image classification and localization by integrating attention-based mechanisms with Convolutional Neural Networks (CNNs). These attention mechanisms prioritize spatial and semantic features of objects, improving the robustness of the feature map.

References 1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Aidan, G.N., Lukasz, K., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 2. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–780 (2018) 3. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6458 (2017) 4. Hu, J., Shen, L., Sun, G.: Squeeze and excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7142 (2017) 5. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14–19 (2020) 6. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of European Conference on Computer Vision, pp. 3–19 (2018) 7. Zhu, F., Li, H., Quyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2027–2036 (2017) 8. Chen, X., Lin, L., Liang, D., Hu, H., Zhang, Q., Iwamoto, Y., Han, X.H., Chen, Y.W., Tong, R., Wu, J.: A dual-attention dilated residual network for liver lesion classification and localization on CT images. In: Proceedings of IEEE International Conference on Image Processing, pp. 235–239 (2019) 9. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: arXiv:1511.07122, (2015) 10. Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 472–480 (2017) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 12. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE International Conference on Computer Vision, pp. 618–626 (2017)

64

3 Weakly Supervised Logo Detection Approach

13. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019) 14. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world Images. In: the 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011)

Chapter 4

Anchorfree Logo Detection Framework

Abstract This chapter discusses attention mechanism-based anchorfree logo detection framework and lightweight CNNs modules. To perform precise and efficient logo detection, we propose a framework, LogoNet, which consist of a robust feature extraction network with spatial and channel attention modules and an anchorfree detection head. We investigate novel attention mechanisms and a layer-aggregated hourglass style feature extraction CNNs backbone network for logo detection. The proposed spatial and channel attention modules improve detection accuracy by enhancing attention to identify logo regions. This chapter also discusses a fast and lightweight CNNs module architecture for practical use of the logo detection framework. The lightweight CNNs module reduces the number of network parameters as well as the computational cost of the network while maintaining performance.

4.1 Dual-Attention LogoNet for Logo Detection Previous chapter discuss attention mechanisms, feature extractor networks, dilated convolution, and a weakly supervised algorithm for logo classification and localization. This chapter explores an end-to-end detection framework with attention mechanisms, layer-aggregated hourglass-style feature extraction CNNs and an anchorfree detection-head. Furthermore, it discusses a fast and lightweight framework designed for the practical application of logo detection.

4.1.1 Overview of the Logo Detection Framework We introduced LogoNet, a deep learning-based algorithm for logo detection. It comprises a densely layer-aggregated hourglass-style (top-down bottom-up) feature extraction network, spatial and channel attention modules, and an anchor-free detection head similar to CenterNet [1]. The incorporate backbone network effectively captures feature information across different scales and maintains information © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_4

65

66

4 Anchorfree Logo Detection Framework

Fig. 4.1 Overall network architecture of dual-attention LogoNet

integrity during spatial resolution scaling via skip connections. To enhance performance, we introduce a Dual-Attention based detection framework for logo detection, that utilizes different attention mechanisms more efficiently. The proposed spatial attention module refines the feature space, particularly focusing on improving the identification of target objects, especially small logo regions. The integrated channel attention module facilitates the generation of more balanced feature maps with advanced semantic information. The overall architecture of LogoNet is illustrated in Fig. 4.1. The incorporation of spatial and channel-wise attention modules results in refined and robust feature maps, enabling more accurate prediction of visual and semantic information. Detailed discussions on the attention modules and the overall architecture of our proposed framework are provided in subsequent sections.

4.1.2 Layer-Aggregated Hourglass Style Feature Extraction Network An hourglass-style Convolutional Neural Network (CNN) was introduced for the task of human pose estimation [2]. This network architecture comprises structured bottom-up and top-down modules, wherein input channels are expanded and the dimensions of feature maps are down-sampled through a sequence of convolutional, stride, and max-pooling operations. Subsequently, upsampling operations are conducted to generate symmetric feature map blocks, characteristic of the hourglass style. Notably, skip connections are incorporated during the upsampling process to mitigate information loss. The hourglass network architecture was initially employed in CornerNet [3] for object detection tasks, and subsequently adopted in CenterNet [1]. We employ an hourglass-style feature extractor network, which is particularly well-suited for logo recognition tasks. This network architecture generates a output feature space with large spatial dimensions, facilitating dense predictions for small objects like logos. The introduced hourglass-style feature extractor network follows a similar convolutional block arrangement as described in [3].

4.1 Dual-Attention LogoNet for Logo Detection

67

In the feature extractor network, the process begins with resized input image of dimensions 3 .× 512 .× 512. This image is passed through a convolutional block, which reduces their dimensionality by half through 7 .× 7 convolutional operations with a stride of size 2, resulting in 128 channels. Subsequently, the feature maps traverse a residual block employing 3 .× 3 convolutional operations with a stride of size 2, yielding feature maps with 256 channels and a spatial dimension of 128 .× 128. The next step involves feeding these feature maps into the stacked hourglass modules to produce feature maps containing rich global spatial and semantic information. Each hourglass module is structured with bottom-up and top-down designs, integrating residual learning blocks. The process encompasses five stages comprising downsampling and upsampling operations. Skip connection modules establish connections between symmetric blocks of an hourglass module. Within each stage, there are two residual blocks in the processing step, including the skip connection module. Each residual block comprises two convolutional layers and one skip connection layer. The spatial dimension of the feature map is reduced through the stride operation in the first convolutional layer of the first residual block. Subsequent convolution operations within each stage maintain the spatial dimension unchanged. A 3 .× 3 kernel size is utilized for every convolutional operation. The skip connection layer in the residual blocks employs linear transformation (via 1 .× 1 convolution) to align the spatial and channel dimensions of the input feature maps with the output of the convolution layer. The spatial resolution of feature maps is reduced by 5 times and the number of channels increases as [256, 384, 384, 384, 512] along the way. For upsampling the feature maps, the nearest-neighbor algorithm is employed, followed by two residual blocks at each stage. The final output feature map has 256 channels and a spatial dimension of 128 .× 128. The detailed structure of the hourglass module is provided in Table 4.1. Based on the original hourglass architecture, our proposed network densely aggregates convolutional layers into each residual block at different scales. As each residual block incorporates two convolutional layers and a skip connection layer. Residual learning uses a skip connection to add with the output of the second convolutional layer. We propose to aggregate outputs of both convolutional layers with skip connection within each convolutional block inspired by [4]. In each residual block, both convolutional operations and the skip connection layer generate feature maps of the same spatial and channel dimensions so that these output feature maps can be directly added without increasing network overhead. This can be formulate as Eq. (4.1). .

X = Xs + X1 + X2

(4.1)

where the input feature map passes through convolutional operations, .X1 and .X2 are the output of the two convolutional operations. .Xs denotes output feature maps of the skip connection layer. Figure 4.2 illustrates the residual block structures of the hourglass network in [2] and our proposed approach. Furthermore, the design of network includes two hourglass modules. In order to project important information, we also add the output feature maps of both stacked hourglass modules. This final output is then provided to the employed attention

68

4 Anchorfree Logo Detection Framework

Table 4.1 The detailed operation and parameters of each layer in an hourglass module Layer name

Output dimension

Operation kernel size output channels stride

Layer name

Output dimension

Operation kernel size output channels stride

Conv1_1

64 .× 64

Conv, 3 .× 3, 256, 2

Conv10_2

128 .× 128 Conv, 3 .× 3, 256, 1

Conv1_2

64 .× 64

Conv, 3 .× 3, 256, 1

Conv10_1

128 .× 128 Conv, 3 .× 3, 256, 1 upsampling

Conv2_1

32 .× 32

Conv, 3 .× 3, 384, 2

Conv9_2

64 .× 64

Conv, 3 .× 3, 384, 1

Conv2_2

32 .× 32

Conv, 3 .× 3, 384, 1

Conv9_1

64 .× 64

Conv, 3 .× 3, 384, 1 upsampling

Conv3_1

16 .× 16

Conv, 3 .× 3, 384, 2

Conv8_2

32 .× 32

Conv, 3 .× 3, 384, 1

Conv3_2

16 .× 16

Conv, 3 .× 3, 384, 1

Conv8_1

32 .× 32

Conv, 3 .× 3, 384, 1 upsampling

Conv4_1

8 .× 8

Conv, 3 .× 3, 384, 2

Conv7_2

16 .× 16

Conv, 3 .× 3, 384, 1

Conv4_2

8 .× 8

Conv, 3 .× 3, 384, 1

Conv7_1

16 .× 16

Conv, 3 .× 3, 384, 1 upsampling

Conv5_1

4 .× 4

Conv, 3 .× 3, 512, 2

Conv6_2

8 .× 8

Conv, 3 .× 3, 512, 1

Conv5_2

4 .× 4

Conv, 3 .× 3, 512, 1

Conv6_1

8 .× 8

Conv, 3 .× 3, 512, 1 upsampling

Fig. 4.2 Illustration of proposed aggregation of various layers in a convolution block

4.1 Dual-Attention LogoNet for Logo Detection

69

Fig. 4.3 Proposed element wise addition of both stacked hourglass modules. The red arrow shows the output of the first hourglass module added with the output of the second

modules. This strategy prevents the loss of crucial information and finer details that may occur during the downsampling and upsampling operations of feature maps. This process is shown in Fig. 4.3.

4.1.3 Attention Modules The proposed spatial and channel attention modules enhance attention to identify logo objects resulting in improved detection accuracy.

4.1.3.1

Spatial Attention Module

Our method generates spatial attention weights using the inter-spatial relationships of channels to obtain rich and global spatial information that helps to create a robust global feature map. Figure 4.4 depicts the overview of our proposed spatial attention module. A feature map .A ∈ R C×H×W is provided as an input to the spatial attention module where .C denotes channel size and .H × W are height and width of the feature map, which are .256 × 128 × 128 in this architecture. This input .A is then fed into a .1 × 1 linear transformation layer and a normalized feature map . Ssigmoid is created for all channels using the sigmoid activation function. '

S =

. ij

1 1 + ex p(−Si j )

(4.2)

70

4 Anchorfree Logo Detection Framework

Fig. 4.4 Illustration of the proposed spatial-attention module

'

where . Si j is the scalar value at .ith and . jth position and . Si j denotes corresponding activated scalar value at .ith and . jth pixel position. The output of this operation is a sigmoid activated map, i.e., . Ssigmoid ∈ R C×H×W . Additionally, the input .A ∈ R C×H×W is fed into a convolutional block, which generates a feature map .(FC O N V 3 ). This convolutional block consists of three convolutional layers with .1 × 1, 3 × 3, 1 × 1 kernel size, respectively. To keep channel-wise details, the number of channels .(C) for each convolutional layer remains unchanged which is 256. ReLU activation is followed by the first two convolutional operations while batch normalization has been performed for all the convolutional layers. Softmax normalization strategy is applied across the channels over the output feature space of the convolutional block (. FC O N V 3 ). During softmax normalization, all positional scalar values in the same pixel-position across all feature channels are considered. New scalar value is synthesized for each pixel across the channels using the value of other pixels at the same index. In Eq. (4.3), if . Pi, j,k is a scalar value ' at .ith and . jth pixel position in .kth channel, a normalized scalar value . Pi, j,k can be obtained as: .

ex p(Pi, j,k ) ' Pi, j,k = ∑C k=1 ex p(Pi, j,k )

(4.3)

where .C denotes the number of channels in feature map . FC O N V 3 . A softmax normalized feature maps .Pso f tmax ∈ R C×H×W has been produced using these normalized ' scalar values (. Pi, j,k ).

4.1 Dual-Attention LogoNet for Logo Detection

71

We perform element-wise product of both generated normalized feature map (i.e., S and .Pso f tmax ). The input feature map (.A) is added as a skip connection to this product to obtain final attention-weighted feature map.

. sigmoid

.

Aattention = A

+

(Ssigmoid ʘ Pso f tmax )

(4.4)

where .ʘ is the element-wise product. We generate a weighted feature map and perform an element-wise addition with the input to obtain a robust representation of input image. Our proposed technique uses both sigmoid and softmax functions as activation to learn important spatial weights.

4.1.3.2

Channel Attention Module

In [5] an ECANet was introduced based on the previously proposed SENet [6] to effectively capture channel-wise attention weights. To capture channel-wise dependencies, global-average pooling (GAP) is performed on the input feature maps. Subsequently a 1D convolutional operation is employed to learn cross-channel interaction. A sigmoid activation function operates at this layer to learn channel-wise attention weights. Author proposed to use an adaptive kernel size to capture local cross-channel interactions by considering a channel and its .k neighbors (coverage of interaction). In their method the kernel size .k is proportional to the number of channels. Channel-wise response is emphasized by multiplying the attention weights with the input feature maps. This weight-enhanced feature maps is added to the input feature map as the final output. Our proposed channel attention mechanism is inspired by ECANet [5], we use a fix kernel of size 3. Unlike the ECANet, in our proposed approach we directly use the attention-based feature maps to produce category-wise heatmaps without adding the input feature maps as skip connection. Figure 4.5 shows the channel attention module.

4.1.4 Detection Head We employ CenterNet, an anchorless detector presented in [1]. CenterNet identifies objects by localizing them as a single point corresponding to the center of their bounding box. During the CenterNet training, ground truth Region of Interest (ROI) annotations are initially transformed into class-wise keypoint maps denoted as . K (x,y,c) , where .x and . y represent the coordinates for the center of an object, and .c denotes the category of the object. Within the keypoint map, the coordinates (.x, y) are activated to indicate the peak spot corresponding to the object center, while all other points are set to zero. Subsequently, the class-wise keypoint maps are further processed to generate Gaussian heatmaps based on the height and width of the objects. The

72

4 Anchorfree Logo Detection Framework

Fig. 4.5 An overview of channel-wise attention module

primary objective of the training process is to produce a set of class-wise heatmaps that effectively identify objects and generate shape vectors to estimate their sizes. During training, a detection confidence score is computed using the predicted keypoint values .Yˆ(xi ,yi ,c) derived from the generated class-wise heatmaps, where (.xi , yi ) represent the integer coordinates of a candidate keypoint location and .c denotes the category. Additionally, to precisely localize the object, a local offset for the object’s center location is predicted. To facilitate these functionalities, the CenterNet detection head comprises three types of sub-heads. Firstly, the heatmap head generates class-wise heatmaps for each category. These generated heatmaps are then utilized for classifying relevant categories, with ground truth heatmaps employed to compute the classification loss of the object categories. Secondly, the size-head generates shape-vectors leveraging the ground truth width and height (ground truth RoIs) to perform regression for estimating the size of objects. Lastly, there is an offset head designed to rectify the discretization error induced by stride operations conducted to reduce the dimension of the output feature map. The objective loss function can be represented as follows: .

L det = L k + λsi ze L s + λo f f L o f f

(4.5)

where . L k denotes the focal loss classification function [18], .Ls and .Lo f f are the . L1 loss functions and .λsi ze and .λo f f represent the loss weights. The detection framework for object identification and bounding box size regression is trained in an end-to-end manner, integrating these diverse losses collectively. The three heads are concurrently trained employing the same training strategy outlined in [1] using the objective training loss specified in Eq. 4.5. During the detection (inference phase), a detection confidence score is computed using estimated keypoint values .γˆxi ,yi ,c , where (.xi , yi ) are the integer coordinates of a candidate keypoint location and .c represents the category. In detail, first a set of class-wise heatmaps is generated corresponding to each category. Next, some peak

4.1 Dual-Attention LogoNet for Logo Detection

73

Fig. 4.6 A simple architecture of CenterNet for logo detection

points are identified in these generated class-wise heatmaps. Typically, the top 100 peak points from each heatmap are considered for detection. A set of n detected n center points (peak points) for all c classes is then estimated as .γˆ = {(xˆi , yˆi )}i=1 where (.xi , yi ) is the integer coordinate for a keypoint location. Following this, the keypoint estimator (detection confidence score) .γˆ is utilized as a threshold to determine the activation of a certain object at that point. If the detection confidence score surpasses the predefined threshold, the point is designated as representing the presence of the corresponding object. Whereas the size of the object is determined by the corresponding values of the shape-vector. Finally, For the detection task, a predicted bounding box is deemed true when the Intersection over Union (IoU) between the predicted and ground truth bounding boxes exceeds a certain threshold (e.g., 0.5). Figure 4.6 illustrate a simple overall pipeline of detection architecture.

4.1.5 Overall Framework of Dual-Attention LogoNet Figure 4.7 depicts the overall framework of CenterNet, LogoNet, and Dual-Attention LogoNet. The first architecture shows the CenterNet framework integrated with the hourglass feature extraction backbone as introduced in [1]. LogoNet incorporates the proposed feature extraction network along with the spatial attention module. Meanwhile, the architecture of Dual-Attention LogoNet encompasses spatial and channel attention modules integrated with the proposed feature extraction backbone. Notably, the CenterNet detection head is utilized in both LogoNet and Dual-Attention LogoNet configurations. Our experimental results indicate that our approach yields a robust feature map without incurring any additional computational costs.

74

4 Anchorfree Logo Detection Framework

Fig. 4.7 (Left) CenterNet framework. (Middle) LogoNet with spatial attention module and final output feature maps. (Right) Dual-Attention LogoNet

4.1.6 Lightweight CNNs Network Architecture for Practical Applications During practical implementation, tasks such as logo detection often operate on lowspecification devices like mobile phones or IP cameras. These devices demand lightweight and highly accurate algorithms to ensure efficient performance. Our objective is to develop a lightweight model that is suitable for deployment on embedded edge computing devices. To achieve this goal, we propose a lightweight CNN architecture with a reduced number of network parameters and computational complexity. This lightweight CNN architecture is designed for both LogoNet and DualAttention LogoNet to ensure practical applicability. The proposed lightweight CNNs architecture significantly reduces the number of network parameters andimproves the inference time to address the real-time performance while maintaining accuracy. The

4.1 Dual-Attention LogoNet for Logo Detection

75

architecture can boost the run-time associated with the inference of network while maintaining the performance. The lightweight-CNNs module is explained in the section. To develop a compact network and enhance detection speed for practical applications, we introduce a Lightweight architecture inspired by the factorization technique employed in MobileNetv2 [7]. In our lightweight module, a convolutional operation consists of a combination of pointwise and depthwise separable convolutional layers. Pointwise convolution involves a standard 1 .× 1 convolution operation that performs linear transformations on the input and adjusts the channel dimensionality. Depthwise convolution, on the other hand, applies a single filter per channel to filter the features. Batch normalization and ReLU activation operations are applied after the depthwise convolutional layer in our network. This pattern of layers is also followed for skip connection layers, while spatial dimension handling is achieved through max-pooling operations. This design is implemented within both the LogoNet and Dual-Attention LogoNet architectures. In our approach, each standard residual convolution block in the architecture is converted into a depthwise convolution block, following the MobileNetv2 block paradigm. However, we apply this approach only to the layers within the hourglass module, while feature transformations for other layers, including the attention modules, utilize standard convolution operations. This strategy effectively reduces network complexity and computational load compared to standard convolutional approaches. The computation involved in depthwise convolution can be expressed as: .

Oˆ l,m,c =



Kˆ i, j,c · Fl+i−1,m+ j−1,c

(4.6)

i, j

where .F and . Oˆ are input and output feature maps with .C number of channels. . Kˆ is a depthwise convolution kernel of size . D K × D K × DC where . D K is the size of kernel, which is 3 in our case. For a feature map of . D F height and width, the total computation cost of depthwise and pointwise convolution operation can be computed as: Cin · Cout · D F · D F + D K · D K · Cout · D F · D F

.

(4.7)

exploits where .Cin and .Cout are the input and output channels. To compare the proposed architecture, we also demonstrate lightweight models, exploring the CP-Decomposition (CPD) [17]. The CPD method is the typical method for reducing complexity, which factorizes a tensor into a sum of outer products of vectors. For a given tensor of 3-dimensional space, the CP decomposition can be explained as:

.

T ≈

R ∑ r =1

lr ◦ m r ◦ n r

(4.8)

76

4 Anchorfree Logo Detection Framework

Fig. 4.8 a Convolutional block with our lightweight module b Convolutional block with CPD method

where . R > 0, and .lr , m r , n r are vectors of relevant dimension, and ‘.◦’ denotes the outer product of two tensors, i.e., t

. i, j,k



R ∑

lri ◦ m r j ◦ n r k

(4.9)

r =1

In case of rank one assumption of CPD (i.e., . R = 1), the 4D kernel .Cˆ ∈ R X ×Y ×Z ×S will be separated into cross-products of four 1D filters as follows: Cˆ = α × β × γ × η

.

(4.10)

where .α, β, γ are 1D convolution vectors convolving across the dimensions and the fourth corresponds to channels. We converted a standard convolution to two 1D convolutions within each residual block of proposed feature extractor. We use 1D convolution from two axes .(X × 1, .1 × Y ) to convolve the feature maps. First, we convolve the features using single filter each channel (depthwise) by a kernel size of .(3 × 1). Then a kernel of size .(1 × 3) is applied to map the number of feature channels. Same approach is applied with skip connection layer to transform the feature maps. Block structures of feature extractor with depthwise convolution and CPD methods are shown in Fig. 4.8.

4.1.7 Experiments and Results In this section, we will explore the outcomes achieved through the methods proposed.

4.1 Dual-Attention LogoNet for Logo Detection

77

4.1.8 Implementation To evaluate the performance, we compare our proposed method with various methods such as CenterNet (baseline) [1], Faster R-CNN [9] and SSD [10]. The performance of the methods is measured in terms of mAP and detection time. For the CenterNet framework, training was conducted using a batch size of 2 for 140 epochs. We use HourglassNet-104 as feature extractor backbone pretrained on COCO dataset from ExtremeNet [11]. The initial learning rate is 1.25 .×10−4 which decreases by a multiplication of 0.1 at 90 and 120 epochs. The Adam optimizer [12] is used for network optimization. A spatial resolution 512 .× 512 is used for the input image. Faster R-CNN detector is trained with ResNet-50 backbone [13]. This model is trained for 50 epochs with batch size 4 and learning rate 0.001. SSD network is trained using VGG16 backbone [14] with a batch size of 4 and initial learning rate of 0.001. The training is performed for 16,000 iterations. The experimental results are shown in percentage (%) of mAP value over all logo classes using Intersection of Union (IoU) value 0.5. Average inference time is given for one image. The inference time is calculated on our machine with Intel Core i7-8700 CPU, GeForce GTX 980 Ti GPU, Pytorch 0.4.1, CUDA 9.0 and CUDNN 7.1.

4.1.9 Evaluation on FlickrLogos-32 Dataset Logo images of FlickrLogos-32 dataset [8] were used for training. FlickrLogos-32 datset has 32 logo classes. Each class contains 70 images for experiments. For each class, we consider 30 images for training, 10 images for validation and 30 images for test. There were a total of 1602 logo objects in 960 test images for different categories. Table 4.2 shows the details of ablation study on FlickrLogo-32 dataset. According to the results, the mAP accuracy is slightly improved when we aggregate feature maps at different scales (Proposed Method 1) or when we employ spatial attention module with baseline network (Proposed Method 2). When we implement spatial attention module with layer-aggregated feature maps together, detection accuracy improves effectively (LogoNet—Proposed Method 3). The calculation of the channel-wise response further improves the detection accuracy (Dual-Attention LogoNet—Proposed Method 4). We observe effectiveness of our methods in two steps: (i) the aggregation of feature maps at different scale, improves the global feature representation, (ii) combining attention modules with network generates a balanced and robust feature map with significant visual and semantic detail. Table 4.3 reports mAP and detection time using different detectors on Flickr32 dataset. These methods are: Faster R-CNN with ResNet50, SSD with VGG16, CenterNet with HourglassNet, CenterNet with SENet HourglassNet, CenterNet with ECANet HourglassNet, CenterNet: Channel attention module of CBAM [15] added with our proposed spatial attention module and backbone network, LogoNet, Dual-

78

4 Anchorfree Logo Detection Framework

Table 4.2 Ablation experiments on FlickrLogos-32 Dataset Methods LayerSpatial aggregation attention CenterNet (baseline) Proposed method 1 Proposed method 2 Proposed method 3 Proposed method 4

Channel attention

.√ .√ .√

.√

.√

.√

.√

mAP 80.7 81.0 80.8 82.2 82.5

Table 4.3 Performance evaluation of State-of-the-Art methods on the FlickrLogos-32 Dataset Detection accuracy Detection time (s) Methods SSD [10] Faster-RCNN [9] CenterNet (baseline) [1] CenterNet (SENet [6]) CenterNet (ECANet [5]) CenterNet (CBAM [15]) LogoNet Dual-Attention LogoNet

76.7 81.0 80.7 80.2 79.0 81.4 82.2 82.5

0.0531 0.1115 0.1083 0.1354 0.1260 0.2010 0.1145 0.1166

Attention LogoNet. SSD achieves 76.6% accuracy in mAP with the faster detection time of 0.0531 s. FatserRCNN has 81.0% accuracy with a 0.1115 s inference time. CenterNet with HourGlass achieves 80.7% accuracy and uses 0.1083 s detection time. There is a slight drop in the performance of CenterNet-HourGlass with SENet and ECANet block. These approaches have 80.2% and 79.0% accuracy with 0.1354 s and 0.1260 s detection time, respectively. Channel attention module of CBAM [15] employed with our proposed spatial attention module and backbone network improves the accuracy by around 0.7% In comparison to baseline method. Whereas, detection time taken is relatively higher (0.2010 s per image) for this approach. LogoNet shows a significant improvement in performance over the conventional methods with a considerable detection time. LogoNet has 82.2% mAP accuracy with 0.1145 s inference time. Meanwhile, our proposed Dual-Attention LogoNet yields an improved performance with the 82.5% mAP and 0.1166 s detection time. The logo detection performance is depicted in Fig. 4.9. Figure 4.10 depicts the visualization of the feature maps from the last layer of methods, including, CenterNet, CenterNet: ECANet, LogoNet, and Dual-Attention LogoNet. These binary output images illustrate the response of different attentionweight methods. Our spatial attention and dual-attention-based methods emphasize logo regions while minimizing noise.

4.1 Dual-Attention LogoNet for Logo Detection

79

Fig. 4.9 Visualization of multiple logos detection and effectiveness of our approaches

4.1.10 Evaluation with Lightweight CNNs Method We evaluated the efficacy of the proposed lightweight CNN methods utilizing the FlickrLogos-32 dataset. Table 4.4 presents the detection accuracy in terms of mAP, the number of parameters in millions, and image detection time in seconds. For the lightweight architectures, training has been conducted with a batch size of 4 over 140 epochs. The rest of the parameter setting is used as before. Due to the limited data, we initialize the network weights using PASCAL-VOC non-logo object detection images [16]. Notably, the application of depthwise and pointwise convolution operations resulted in a significant reduction in network parameters compared to standard convolution operations. This reduction in parameters contributes to faster computation speed, albeit with a slight decline in detection accuracy. To provide a comparative analysis, we implemented CPD and Lightweight CNNs modules within CenterNet, LogoNet, and Dual-Attention LogoNet frameworks. CenterNet architecture based on CPD method (CenterNet-CPD) achieves 77.9% detection accuracy while the number of parameters is 72.42 million and detection time is 0.1145 s. LogoNet-CPD network achieves a greater accuracy 78.8% with detection time of 0.1073 s. However, Dual-Attention LogoNet network achieves 78.9% accuracy with a 0.1145 s detection time. LogoNet-CPD and Dual-Attention

80

4 Anchorfree Logo Detection Framework

Fig. 4.10 Visualizing different attention specific response for logo detection Table 4.4 Performance evaluation of the Lightweight methods on FlickrLogos-32 Dataset Detectors Lightweight mAP Parameters Detection methods (M) time (s) CenterNet

LogoNet

Dual-Attention LogoNet

No CPD [17] Proposed method No CPD [17] Proposed method No CPD [17] Proposed method

80.7 77.9 79.0 82.2 78.8 79.7 82.5 78.9 79.5

191.26 72.42 27.94 192.05 73.19 28.73 192.05 73.20 28.73

0.1093 0.1145 0.0833 0.1145 0.1073 0.0885 0.1166 0.1145 0.0979

References

81

LogoNet-CPD use around 73.19 million computation parameters. With our lightweight modules, CenterNet (CenterNet-Lightweight) achieves 79.0% accuracy with 0.0833 s detection time. This architecture uses 27.94 million parameters. DualAttention LogoNet-Lightweight achieves 79.5% accuracy with a detection time of 0.0979 s. The proposed LogoNet-Lightweight network achieves a significantly higher accuracy rate of 79.7%, which is slightly less than the baseline (CenterNet) and LogoNet methods (80.7% and 82.2%). Whereas, LogoNet-Lightweight takes a detection time of 0.0885 s per image, which is about 20% faster than the baseline method. The LogoNet Lightweight and Dual-Attention LogoNet architectures use only 28.73 million parameters. The parameters used are only about 15% of the parameters used in the normal baseline method (CenterNet). We observed that LogoNet-Lighweight, which incorporates only spatial attention module, demonstrates superior performance in both detection time and accuracy. This approach leads to faster training and convergence of the network. Given the utilization of depthwise convolution operations in the lightweight modules, channel-wise attention proves to be less effective. A model with low parameters and considerable accuracy rate is preferable for edge computing devices. We believe our lightweight algorithm is more suitable to run on low-spec machines or for edge computing compared to conventional algorithms.

4.2 Summary We perform logo detection using attention-based mechanisms with an anchor-free detector. To perform precise and efficient Logo detection, we propose a framework which consist of a robust feature extraction network with spatial and channel attention modules and an anchorfree detection head. We also proposed a lightweight CNNs module architecture for fast detection and practical applications. By using the LogoNet, we realized a robust and accurate logo detection. We proposed a lightweight CNNs module architecture for fast detection and practical applications. Lightweight CNNs module reduces the number of network parameters and computational complexity.

References 1. Zhou, X., Wang, D., Krahenbuhl P.: Objects as points (2019). arXiv:1904.07850 2. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV. Springer (2016) 3. Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. (2019) 4. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2018) 5. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14–19 (2020)

82

4 Anchorfree Logo Detection Framework

6. Hu, J., Shen, L., Sun, G.: Squeeze and excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7142 (2017) 7. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 8. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world images. In: The 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011) 9. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2017) 10. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of the ECCV, pp. 21–37 (2016) 11. Zhou, X., Zhou, J., Krahenbuhl, P.: Bottom-up object detection by grouping extreme and center points. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 850–859 (2019) 12. Bengio, Y., LeCun, Y.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (ICLR) (2015) 13. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 [cs] 15. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Proceedings of European Conference on Computer Vision, pp. 3–19 (2018) 16. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 303–338 (2010) 17. Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned cp decomposition (2014). arXiv:1412.6553 18. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2999–3007 (2017)

Chapter 5

Mitigating Domain Shift in Logo Detection: An Adversarial Learning-Based Approach

Abstract This chapter discusses a domain adaptation-based technique aimed at mitigating the domain-shift challenge in logo detection. In real-world applications, we face a domain shift problem between the training data (source domain) and test data (target data) resulting in reduction of performance. The domain gab or domain shift problem is caused by the difference in feature distributions of training and test data. In practical scenarios, deploying trained models to detect objects in different images at various platforms, which may possess different data feature distributions and styles compared to the training data, often results in decreased performance. As a result, re-training of models with user-specific annotated training data is required, which is time-consuming, laborious and expensive. It is impractical for real-world applications. To address this issue, this study introduces domain adaptation-based technique to train detection framework, aligning networks across datasets from different logo datasets. The proposed method uses unlabelled data samples from target domain alongside labelled source domain data during model training to generalize the detection framework. To bridge the gap between different domains we exploit the adversarial-domain adaptation learning. This is a pragmatic way of dealing with the domain-shift problem using an anchorfree object detector.

5.1 Domain Shift Problem Domain shift or domain gap denotes the phenomenon characterized by a difference in the data distributions between the source domain (where the model undergoes training) and the target domain (where the model is applied). In machine learning algorithms, feature extraction plays a crucial role as it facilitates the mapping of input data to corresponding output categories. These algorithms acquire feature extraction capabilities from provided sample data, such as images, text, or other modalities, through a process known as model learning. During training, models tend to become biased towards learning representations specific to the given data samples or domain. However, in practical applications, different datasets may contain varying data features or distributions across separate domains. This difference © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_5

83

84

5 Mitigating Domain Shift in Logo Detection …

Fig. 5.1 Illustration of domain gap between images from different datasets

can lead to a degradation in performance when models trained on one domain are applied to another domain. This shift can occur due to various factors such as data drift, differences in lighting conditions, viewpoints, contextual backgrounds, object appearances. If the model is not exposed to such variations during training, it may have difficulty generalizing well to the target or test domain. In real-world scenarios, logo detection frameworks often face challenges due to domain shift. Domain shift in logo recognition occurs when there is a difference between the characteristics of a logo in the training dataset (source domain) and the characteristics encountered during deployment (target domain). Figure 5.1 illustrates a difference between images from different datasets. These differences may arise due to variations in the appearance of the logo, such as changes in lighting conditions, camera angle, advertising platform, weather conditions or logo design. Consequently, deep learning-based logo recognition models suffer when applied in new or unseen environments, leading to reduced performance and accuracy. The conventional methods necessitate users to retrain the model using their own annotated training data, due to the heavy reliance of supervised neural networks on the diversity and quantity of training samples. However, what if annotations are unavailable for the new test dataset? Further, it is time-consuming, expensive and laborious task. Therefore, it is essential to design resilient and adaptable networks capable of achieving better performance in new environmental conditions and data distributions. Recently, techniques such as domain adaptation and domain generalization aim to mitigate the domain shift problem by aligning the distributions of source and target domains. These methods enable the implementation of sophisticated techniques for generating domain-invariant representations that exhibit robust generalization capabilities across diverse domains. By effectively handling domain shift, domain adaptation techniques enable detection frameworks to adapt and

5.2 Domain Adaptation for Computer Vision Tasks

85

Fig. 5.2 Symbolic representation of the domain adaptation process

generalize well to new and unseen domains, enhancing their practical applicability and performance. Figure 5.2 depicts a symbolic representation of the domain adaptation process for feature extraction.

5.2 Domain Adaptation for Computer Vision Tasks Despite advancements of deep learning, there remains substantial scope for enhancing training schemes to devise practical and feasible AI-based techniques for realworld applications. It is essential to develop efficient domain adaptation-based approaches for various computer vision tasks to mitigate issues due to domain shift and insufficient training data. How to train a well generalized model which can be effectively applied to new test domains that have different features and probability distribution of data from the training dataset, is a hot research topic today. The ability of a model to generalize from one domain (source domain or training data) to another domain (target domain or test data) presents a significant challenge. However, domain adaptation methods serve as valuable tools in bridging the domain gap between training (source) and test (target) data. Apart from improving performance domain adaptation-based methods ensure functioning, re-usability and increase the usefulness of deep learning frameworks. In order to develop efficient and robust AI based computer vision systems, domain adaptation techniques can be applied to various real-world applications including classification, detection, face recognition, style translation. Adversarial learning is one of the widely used approaches for domain adaptation, where models are trained to discriminate between source and target domains, as well as simultaneously learning representations that are invariant to domain shifts. These methods provide techniques to facilitate knowledge transfer from diverse domains by aligning the deep learning framework towards new test data. Hence, these methods offer an appealing solution for various tasks by optimizing networks from a wellannotated dataset to an unlabeled or less-annotated dataset. Adversarial learning techniques focus on learning a common shared feature space in which the distribu-

86

5 Mitigating Domain Shift in Logo Detection …

Fig. 5.3 Feature alignment process using domain adaptive techniques. In this setting some objective loss or parameter is calculated to train the model

tions of two datasets (source/labeled and target/unlabeled) is aligned. This alignment helps the model to learn domain-invariant features that are robust to variations in the appearance of logos, thus improving the generalization performance of the detector in diverse real-world settings. For feature alignment process, some domain adaptive losses are computed to train the model in this setting. These loss values aim to minimize the distribution discrepancy between the source domain (labeled data) and the target domain (unlabeled data). During the training, we need to introduce both annotated source domain images and unlabeled (without annotated) target domain images to the model. Typically, this type of learning is conducted through adversarial training, where the model learns to predict the task (e.g., logo detection) and to fool a domain discriminator into believing that the features extracted from both domains are indistinguishable. The domain discriminator network is a classifier network, implemented as a separate neural network component within a framework. It is integrated into a task-specific framework, such as an object detector. The primary purpose of the domain adaptive loss function is to ensure that the learned representations are domain-invariant, signifying that they encapsulate the inherent characteristics of the data while remaining insensitive to domain-specific variations. By minimizing the domain discrepancy, the model becomes robust to variations across different domains and generalizes well to unseen data. Various loss functions can be used to implement adversarial learning scheme, including adversarial domain classification loss, discrepancy loss or gradient reversal loss. These loss functions are incorporated with the overall training objective of the model, which includes the task-specific loss (e.g., classification or detection loss). Finally, the model acquires the capability to extract features tailored to the particular task while also being domain-invariant, consequently resulting in enhanced performance within the target domain. Figure 5.3 shows a simple overview of domain adaptive training scheme.

5.3 Related Work: Domain Adaptation

87

5.3 Related Work: Domain Adaptation Unsupervised domain adaptation methods operate under the assumption that there are no object-level annotations available for the target domain. Consequently, they offer a solution to the data annotation problem for new test data, while also making them domain-invariant and highly applicable for practical applications. To date, a board range of domain adaptation-based techniques have been presented to bridge the feature distribution gaps between a source (training) domain and a target (test) domain. Ma et al. [1] proposed to use entropy optimization scheme using an minimization and maximization strategy at the final output layer of network to perform image classification. For domain adaptation, they combined minimal and maximum entropies of source and target data. Hoffman et al. [2] proposed to use Cycle-GAN [3] approach to adapt feature representation generating target images conditioned on the source images for image segmentation task. Scene adaptation via channel-wise feature alignment was proposed by Wu et al. [4]. For sematic segmentation, Vu et al. [5] demonstrated adversarial learning using entropy minimization at heatmaps level output. To reduce the domain gap, Chen et al. [6] proposed calculating maximum square loss for the target images. For medical images (DICOM format), Shen et al. [7] exploited domain adaptation. They accumulated domain adaptive loss over a number of training iterations to minimize the domain gap between two different datasets of mammogram images. Chen et al. [8] introduced domain adaptive Faster R-CNN [9] for detection tasks using instance-level and image-level adaptation. A consistency loss is applied for image-level and instance-level domain classification. The gradient reverse layer (GRL) [10] is placed between the domain classifier and base convolutional layers (i.e., feature extraction network). The GRL layer is used to accomplish domain adaptation training by flipping the sign of gradients during the backpropagation. SW-Faster-RCNN approach [11] improved this method by introducing the strong-local alignment model instead of the instance-level alignment scheme. Further, Xu et al. [12] proposed categorical regularization and designed a consistency regularization to align the model learning between two domains. Most of the recently proposed domain adaptive schemes for detection task are designed for the Faster-RCNN [9] framework. Additionally, these studies focus on adapting weather conditions from normal urban street scene images to synthetic fog images, which are variations of the original images generated mostly by changing illumination (pixel intensity value) conditions. In most of these studies, domain adaptation is carried out from the cityscape dataset (real images) [13] to the foggy cityscape dataset (synthetic images) [14]. In some recent studies, domain adaptation has been taken into account for the logo detection task. Su et al. [15] proposed domain adaptationbased training scheme for logo detection using Faster-RCNN framework [9]. They employed gradient reversal layer [10] to learn generalized feature characteristics. For training, they used real logo images (for both source and target domains), synthetic logo images, and non-logo images. In their proposed method, real logo images with object-level annotations are used for half of the total number of logo categories and synthetic images are used for all logo classes. This method needs annotated

88

5 Mitigating Domain Shift in Logo Detection …

real-world logo images and inclusion of plenty of non-logo image increases the cost of training. Scheck et al. [16] studied domain adaptation analyzing two domain adaptive losses from synthetic to real images for the detection task. However, they do not use a discriminator network or adversarial learning based on adversarial domain classification loss in their anchorless detection network.

5.4 Adaptation Using Anchorfree Object Detector for Logo Detection In this section, we discuss a domain adaptation technique using a anchorfree detector to improve model generalization. We introduce a logo detection method based on adversarial training. To bridge the domain gap between source (training) and target (test) data, we use annotated logo images from the source domain and unlabeled logo images from the target domain during training. Inspired by the existing adversarial learning-based methods primarily designed for segmentation applications [5–7], we design an approach tailored for logo detection. Given that anchorfree detectors are trained to recognize objects in terms of some keypoints, we propose utilizing midlevel output feature maps instead of class-wise heatmaps to align the distributions of the target and source domains. Our adversarial learning approach is motivated by the fact that the use of mid-level or feature-level outputs benefits from robust information about the domain while retaining object-level details. We calculate an adversarial domain classification loss using feature-level output to facilitate adversarial learning by incorporating a domain discriminant (CNNs) network into the LogoNet framework. The detection network and integrated discriminator can be trained into an end-to-end manner. This method can be easily adapted to other anchor-free detectors. The details of proposed method using LogoNet is described in the subsequent parts. To achieve alignment across divergent domains, we leverage the adversarial learning by integrating a domain discriminator network into the detection framework during the training phase. The architectural layout of the LogoNet framework, coupled with the proposed domain adaptation training methodology, is shown in Fig. 5.4. This architecture comprises a feature extraction network, a detection module [17] and domain discriminator network. The detection module is equipped with three components: a set of CNN layers for heatmap-head (for generating class-wise heatmaps), an offset-head (for pinpointing object locations), and an object-size head (for estimating the size of objects). Functioning as an anchorfree detector, this framework generates class-specific heatmaps for each class utilizing the feature-level output space (referred to as mid-level output) of the feature extraction backbone network. The offset and size output maps are additionally generated using separate convolutional layers to complete the objective detection loss. Previously proposed adversarial learning-based schemes, initially designed for semantic segmentation tasks, typically use the final class-wise output (segmentation mask) of the framework. Given that

5.4 Adaptation Using Anchorfree Object Detector for Logo Detection

89

Fig. 5.4 Network architecture of LogoNet with domain adaptation setting

anchorfree detectors train the network to identify objects based on keypoints, we have observed that reliance on class-wise heatmaps can result in the loss of crucial domain-specific information. Whereas, the selection of the most appropriate output feature maps is crucial for effectively aligning the domain gap. Therefore, we propose to use the mid-level outputs from the feature extraction network. The main advantage of using mid-level output is that it contains essential domain-specific semantic and visual information, making them conducive to effective employment of adversarial learning techniques. Using the design advantages of anchor-free detectors, we assume LogoNet generates mid-level output feature maps for images from the source domain and the target domain. The mid-level output maps of the source images are fed into different detection heads (heatmap-head, offset-head, size-head) to train the network for their respective tasks. While the mid-level output feature maps of the target images are used to compute the adversarial loss, aiming to align the data distribution of the source and target domains. As a result, there is no necessity for object-level annotations for the target images. We assume that there are .N images with corresponding object-level annotations in the source domain .S with corresponding object-level annotations .{xis ∈ X S , yis ∈ Y S } where . X S is a set of input images in the source space, .Y S denotes the set of corresponding labels. Whereas, .M is the number of images in the target domain .T without object-level annotations .{xit ∈ X T }, where . X T denotes the set of images in the target domain. To employ the adversarial learning technique, we add a domain discriminator network with the LogoNet framework that introduces the adversarial loss (.Ladv ) and classification loss (.Lcls ). The domain discriminator network consists of 5 convolution layers with a kernel size of 4.×4 and a stride of size 2, each layer is coupled with a leaky-ReLU activation layer with a fixed negative slope of 0.2, except for the last convolution layer. The number of channels is [64, 128, 156, 512, 1] for each layer, respectively. Finally, a classification layer gives classification outputs.

90

5 Mitigating Domain Shift in Logo Detection …

Table 5.1 The design of the discriminator network Layer name Output dimension 128 .× 128 64 .× 64 32 .× 32 16 .× 16 4 .× 4

Layer1 Layer2 Layer3 Layer4 Layer5

Operation, kernel size, output channels, stride Conv, 4 .× 4, 64, 2 Conv, 4 .× 4, 128, 2 Conv, 4 .× 4, 256, 2 Conv, 4 .× 4, 512, 2 Conv, 4 .× 4, 1, 2

The detailed structure and operations of the discriminator network is described in Table 5.1. We provide these mid-level outputs of the source image (.Mid_X S ) and target image (.Mid_XT ) as inputs to the discriminator network to classify the inputs form source domain (. S) or target domain (.T ). The classification loss (. L cls ) is calculated to update the network weights of the discriminator network to increase the ability to distinguish the inputs into the respective domains. We assign source images (sourcedomain) with domain label ‘0’ and target images (target-domain) with domain label ‘1’. The binary classification loss . L cls (training objective of domain discriminator network) can be defined as:

.

L cls =

|X s | |X T | 1 ∑ 1 ∑ L cls (Mid_X iS , 0) + L cls (Mid_X Ti , 1) |X S | i=1 |X T | i=1

(5.1)

where . Mid_X si and . Mid_X Ti are the mid-level features of the .ith source training sample and the .ith target training sample, respectively. .|X s | and .|X T | are sample numbers of source domain and target domain, respectively. Meanwhile, to bring the target domain (.T) and source domain (.S) distributions closer, we provide the mid-level output feature maps (.Mid_XT ) of the target image into the discriminator network and compute the adversarial loss (.Ladv ) by giving an inverted domain label, i.e., ‘0’ instead of ‘1’). The adversarial binary classification loss . L adv can be defined as:

.

L adv =

|X T | 1 ∑ L cls (Mid_X Ti , 0) |X T | i=1

(5.2)

Adversarial loss is propagated to update the gradients of LogoNet framework, the objective loss function of the detection network is given in the following equation. .

L det = L k + λsi ze L s + λo f f L o f f + λadv L adv

(5.3)

5.5 Evaluation with Adversarial-Based Domain Adaptation Using LogoNet

91

where . L k denotes the focal loss classification function, .Ls and .Lo f f are the . L1 loss functions and .λsi ze and .λo f f represent the loss weights. .λadv is loss weight. We use a value of 0.001 in our experiments. This approach encourages the network to produce similar output feature maps distributions from target (.T) to the source domain (.S) by mocking the discriminator network. The task-specific detection network and the domain discriminator network are jointly trained in an end-to-end manner. During inference we do not need the discriminator network and the normal detection pipeline is used to perform the detection task so we drop the discriminator network.

5.5 Evaluation with Adversarial-Based Domain Adaptation Using LogoNet In [18] Logos-32plus is presented as an extended version of the FlickrLogos-32 dataset. Logos-32plus has training images for 32 logo classes (similar to FlickrLogos32 [19]). To implement adversarial domain adaptation approach, we utilize the FlickrLogos-32 dataset [19] as source domain and Logos-32plus dataset [18] as target domain. The training images of target domain (i.e., Logos-32plus) are collected to provide a comprehensive representation of real-world data. These images from the target domain (Logos-32plus dataset) are captured using various platforms and exhibit diverse characteristics such as varying sizes, shapes, illumination conditions, and viewpoints. In contrast, most of the training images in the source domain (i.e., FlickrLogos-32) dataset are captured on planar and cylindrical surfaces and are limited to selected viewpoints. Figure 5.5 shows examples of images from the FlickrLogos32 and Logos-32plus datasets. It is evident that the data distributions of these two datasets significantly differ from each other. As a result, there is a huge domain gap between these two logo datasets. The task of detection becomes very challenging when the model is trained on the source domain which has less comprehensive data representation and tested to the target domain that does not have the same distribution and style as source.

5.5.1 Experiments and Results To address the domain-shift problem in our experiment, we required a test set with representations from a different domain. We accomplished this by randomly selecting 30 images for each class from the target domain (i.e., Logos-32plus dataset), resulting in a new test set comprising 960 images. Meanwhile, the training set consisted of training images from the source domain (i.e., FlickrLogos-32 dataset). The remaining images from the target domain (i.e., Logos-32plus dataset) are utilized during the training phase. It is important to note that in this experiment, only the source domain (FlickrLogos-32 dataset) are annotated, while the target domain

92

5 Mitigating Domain Shift in Logo Detection …

Fig. 5.5 Some examples of images from the FlickrLogos-32 and Logos-32plus datasets Table 5.2 Detail of training data setting Datasets Labeled training images FlickrLogos-32 (Source dataset) Logos-32plus (Target dataset)

960 –

Unlabeled target images

Test images

– 6870

– 960

(Logos-32plus dataset) remained unlableded. This scenario exemplifies the case of domain shift (from FlickrLogos-32 to Logo-32Plus), showing scene adaptation, as both datasets feature training images with different data distributions and styles. Detailed information regarding the datasets is provided in Table 5.2. In more details, the target domain comprises a total of 7830 images, with 6870 images considered for training (unlabeled) and the remaining 960 images utilized as the test set. For the parameter settings of the detection network, we followed the settings discussed in the previous chapter. The discriminator network was trained from scratch, with an initial learning rate of 0.0001, which is reduced by a factor of 0.1 at 90 and 120 epochs. The Adam optimizer is used. During training, the same batch-sized (minibatch size) images from the source and target domains are used to train the model. In each epoch, only a randomly selected subset of 960 target images out of the 6870 available images are utilized for training. As an ablation study, Table 5.3 presents the detection results of LogoNet under three settings: normal training, LogoNet with domain adaptation using class-wise heatmaps, and LogoNet with domain adaptation using mid-level feature maps (our proposed approach). In our experiments, LogoNet trained under normal conditions achieves an mAP accuracy of 63.2%. In some previous studies [16], the authors proposed the use of class-wise heatmaps to adapt domain shift from synthetic to

5.5 Evaluation with Adversarial-Based Domain Adaptation Using LogoNet

93

Table 5.3 Effectiveness of the domain adaptation using Logos-32plus dataset Methods mAP LogoNet (w/o domain adaptation) LogoNet + Domain Adaptation (Class-wise heatmaps) LogoNet + Domain Adaptation (Proposed method— mid-level feature maps)

63.2 59.6 64.5

Table 5.4 Comparison with existing domain adaptation methods mAP Methods Scheck et al. [16] (Entropy minimization loss ) Scheck et al. [16] (Maximum square loss) Hsu et al. [20] Proposed method

59.4 59.6 59.7 64.5

real images. In our case, we use class-wise heatmaps for implementing adversarial domain adaptation. However, the heatmaps-based domain adaptation yields a lower accuracy of 59.6% mAP. The results indicate a significant decrease in accuracy when using heatmaps to align the domains, primarily due to two reasons. Firstly, classwise heatmaps fail to preserve crucial image-level (context background) information. Secondly, anchorfree detectors are designed to train the network to detect objects based on keypoints, resulting in the loss of significant domain-specific information at final detection layers. On the other hand, LogoNet with mid-level domain adaptation demonstrates an improvement in performance, achieving an mAP accuracy of 64.5%. Our proposed method enhances the performance by 1.3% mAP compared to the direct transfer of the detection network. Table 5.4 presents the comparison results for domain adaptation-based methods. To compare with other state-of-the-art techniques, we trained a domain-adaptive Faster R-CNN [20] using our datasets. This approach employs a gradient reversal layer [10] to train both the generator (backbone network) and the discriminator network, with the backbone network being the feature pyramid network ResNet-50 [21, 22]. Scheck et al. [16] proposed utilizing entropy minimization loss [5] and maximum square loss [6] for the detection task. We adopted their network with the provided parameter settings on our datasets. The domain adaptation using Faster R-CNN achieves an mAP accuracy of 59.7% for our dataset, while the entropy minimization and maximum square loss-based networks achieve mAP accuracies of 59.4% and 59.6%, respectively. Our proposed method enhances the detection performance, achieving an mAP accuracy of 64.5%. The results indicate that various factors such as dataset size, style, and data distribution show a significant influence on performance. Figure 5.6 illustrates the effectiveness of the domain adaptation-based approach.

94

5 Mitigating Domain Shift in Logo Detection …

Fig. 5.6 Effectiveness of the domain adaptation approach

5.6 Summary In this study, we introduce domain adaptation-based approaches to train anchorfree detection framework, with the objective of aligning network across datasets of diverse domains. Conventional supervised neural network methods heavily rely on annotated training data for effective model training. However, manual annotation of training images can be time-consuming, labor-intensive, and sometimes impractical, particularly when dealing with new test datasets. To address this challenge, our aim is to develop robust and generalized neural networks for logo detection. Our method involves incorporating unlabeled target (target domain) data into the training process alongside annotated source domain data. By leveraging adversarial learning, we have successfully developed a robust and accurate logo detection framework.

References 1. Ma, A., et al.: Adversarial entropy optimization for unsupervised domain adaptation. IEEE Trans. Neural Netw. Learn. Syst. 33(11), 6263–6274 (2022) 2. Hoffman, J., et al.: CyCADA: Cycle-consistent adversarial domain adaptation. In: Proceedings of International Conference on Machine Learning, pp. 1994–2003 (2018) 3. Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2242–2251 (2017) 4. Wu, Z., et al.: DCAN: Dual channel-wise alignment networks for unsupervised scene adaptation. In: Proceedings of the European Conference Computer Vision, pp. 535–552 (2018)

References

95

5. Vu, T.H., et al.: ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2512–2521 (2019) 6. Chen, M., Xue, H., Cai, D.: Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2090–2099 (2019) 7. Shen, R., et al.: Unsupervised domain adaptation with adversarial learning for mass detection in mammogram. Neurocomputing 393, 27–37 (2020) 8. Chen, Y., et al: Domain adaptive faster R-CNN for object detection in the wild. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3339– 3348 (2018) 9. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2017) 10. Gannin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Proceedings of the International Conference on Machine Learning, pp. 1180–1189 (2015) 11. Saito, K., Ushiku, Y., et al.: Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965 (2019) 12. Xu, C.D., et al.: Exploring categorical regularization for domain adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11721–11730 (2020) 13. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016) 14. Sakaridis, C., et al.: Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 126(9), 973–992 (2018) 15. Su, H., et al.: Multi-perspective cross-class domain adaptation for open logo detection. Comput. Vis. Image Underst. 204, Art. no. 103156 (2021) 16. Scheck, T., et al.: Unsupervised domain adaptation from synthetic to real images for anchorless object detection. In: Proceedings of Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 319–327 (2021) 17. Zhou, X., Wang, D., Krahenbuhl P.: Objects as points (2019). arXiv:1904.07850 18. Binaco, S., Buzzelli, M., Mazzini, D., and Schettni, R.: Deep learning for logo recognition. Neurocomputing 245, 23–30 (2017) 19. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world images. In: The 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011) 20. Hsu, H.K., et al.: Progressive domain adaptation for object detection. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 738–746 (2020) 21. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 22. Lin, T.-Y., et al.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA (2017)

Chapter 6

Unsupervised Logo Detection with Adversarial Domain Adaptation from Synthetic to Real Images

Abstract In this chapter, we address the challenges posed by limited training data and domain shift in the field of logo detection. Recent advancements have illustrated the effectiveness of convolutional neural networks (CNNs) trained on simulated or synthetic images for detecting objects in real-world images. Synthesized training images with automatically generated annotations at the object-level offer a promising alternative to the laborious and costly task of bounding box annotation. However, real-world problems limit this assumption and object detectors face domain shift problems, which degrade performance. Knowledge transfer from one domain (synthetic images) to another (real-world images) causes domain shift problems due to the huge differences in data styles and distributions between the source and target domains. This chapter discusses an approach of using only synthesized images for model training and adapting knowledge from unlabelled real-world logo images. We generate synthesized logo images with automatically generated bounding box annotations to facilitate model training. Additionally, to align domain gap synthetic to real-world image, we propose entropy minimization of the mid-level output feature space. Our experiments show that the proposed method improves performance on different logo datasets compared to direct transfer from source to target domain (synthetic-to-real images) without any labeling cost and increasing network parameters.

6.1 Unsupervised Domain Adaptation: Synthetic to Real Logo Detection We use automatically generated synthetic logo images with object-level annotations, while utilizing real-world logo images without any labeling for training. This methodology aligns with the paradigm of unsupervised domain adaptation. The unsupervised domain adaption approaches include samples from target domain dataset along with source domain during training, assuming that the target domain has no annotations. These model learning techniques improve the network’s performance on the target (test) domain by aligning the network parameters between the source © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 Y.-W. Chen et al., Recent Advances in Logo Detection Using Machine Learning Paradigms, Intelligent Systems Reference Library 255, https://doi.org/10.1007/978-3-031-59811-1_6

97

98

6 Unsupervised Logo Detection with Adversarial …

and target domains through domain adaptation methods. As a result, these methods deal with the problem of data labeling for new target data and is suitable for practical uses. In this study, we explore logo detection from a novel perspective, taking into account the challenge of data annotation and domain shift issues. We realize unsupervised domain adaptation using synthesized logo images with automatically generated bounding box annotations. The usage of synthetic images also caused a significant domain gap between the training (synthetic) and test (real-world) images. The difference between feature distributions of synthetic and real-world image leads to poor performance. To effectively reduce the domain gap and train a reliable robust deep model, we implement an adversarial domain classification-based learning strategy. This involves integrating a domain classifier network into the detection framework, facilitating the alignment between source and target domains. Furthermore, to effectively reduce domain variation, we propose pixel-wise entropy-minimization of mid-level feature maps. The proposed scheme implements entropy-based domain adaptation, generating the low-entropy prediction on target domain images by imposing constraint on independent pixel-wise dense predictions in mid-level feature maps. By using weighted self-information-based (entropy maps) adversarial learning, we aim to efficiently utilize and align the feature distributions of both the source and target domains on a global scale. We also investigate various domain adaptive losses at different output level, in an effort to find suitable outputs that can effectively learn generalized features. While our framework is versatile and adaptable to various tasks, our primary focus in this study is on unsupervised domain adaptation specifically from synthetic to real logo images. Additionally, we synthesize 1-shot icon logo exemplar images with automatically generated bounding box annotations. The result of experiments and feature distribution analysis demonstrate that proposed approach effectively improves the performance of anchorless detector for logo detection.

6.2 Synthesize Images to Avoid Manual Annotation Task The considerable expenses associated with various object-level annotation-related tasks encourage researchers to explore viable and pragmatic approaches for incorporating synthetic or simulated images into model training. Synthetic images offer a valuable resource for training deep learning-based models, supplementing large sets of annotated training data to avoid data labeling cost and time as we can automatically generate bounding box annotations or segmentation masks for synthesized images. However, the efficacy of supervised neural networks heavily relies on the diversity and size of training samples. Consequently, the utilization of synthetic images frequently yields suboptimal performance due to these limitations. The domain gap between synthetic (source training dataset) and real-world images (target test dataset) significantly undermines the model’s performance. This discrepancy often leads to a domain shift problem, which can arise from variations in features, attributes, styles, and distributions between the datasets [1]. The challenge of domain shift or domain gap, where the same object categories demonstrate distinct data features, is prevalent

6.2 Synthesize Images to Avoid Manual Annotation Task

99

Fig. 6.1 Different appearances and distributions (domain gap) between synthetic and real logo images

across various object detection and segmentation tasks spanning diverse fields. This challenge is notably observed in various fields, including medical imaging, where CT or MRI scan data comprises different phases [2], and in segmentation tasks related to automatic vehicle driving, which are characterized by varying scenes and illumination conditions [3]. Similarly, models trained only on synthetic images often exhibit poor performance on real-world images due to domain shifts. Synthetic images typically lack the capacity to encapsulate the diverse attributes and characteristics found in real-world images, leading to insufficient model generalization [3–5]. The Fig. 6.1 illustrates the disparity between real and synthetic logo images. Therefore, training a robust generalized model using synthesized images that can be used with real-world applications has emerged as a hot research area.

6.2.1 Synthetic Logo Images Recently, studies have demonstrated the efficacy of training convolutional neural networks (CNNs) to recognize objects in real images using artificial (synthetic) images. Synthesized training images, accompanied by automatically generated object-level annotations, present a promising solution to alleviate the expense associated with manual annotation processes. We generate training images for the Flickr32Logos-32 [6], Toplogo-10 [7] and QMUL-OpenLogo [8] datasets to perform the experiments in an unsupervised domain adaptive manner.

100

6 Unsupervised Logo Detection with Adversarial …

Fig. 6.2 Examples of some logo exemplar images and synthesized training images. a logo exemplar images and synthetic images for FlickrLogos-32 dataset b logo exemplar images and synthetic images for Toplogo10 dataset

6.2.1.1

Logo Exemplar Transformation

For each logo category, we have acquired a logo example image icon from Google Image Search. These logo exemplar images are displayed against a homogeneous transparent background in Fig. 6.2 (row 1 and 3). To generate cohesive synthetic logo images during data synthetization, various illumination, geometric and visual alterations, such as coloring, scaling, shearing, and rotation, are applied. For random scaling, we change the height and width of logo exemplar images at different resolutions. We randomly rotate logo exemplar images by selected angles from a uniformly range. For random shearing, the scaled logo exemplar image is randomly affine transformed. We change the intensity of pixels in the logo exemplar images for color variation by multiplying a random value chosen from a range between 0.1 to 1.

6.2.1.2

Synthesising Coherent Logo Images

We have used 6,000 non-logo images from the Flickr Image dataset as background context. These images contain a variety of background conditions where a logo may appear for realistic logo image synthetization. To create more coherent synthetic training data, we use scene transformation effects to these background images, which are the same as the logo exemplar images discussed in the previous subsec-

6.2 Synthesize Images to Avoid Manual Annotation Task

101

Table 6.1 Data statics of source and target images for domain adaptation setting FlickrLogos-32 dataset Number of classes 32 Source images (with annotations) Target images (without annotations) Test images Toplogo-10 dataset Number of classes Source images (with annotations) Target images (without annotations) Test images

2560 (80 .× 32) 1280 (40 .× 32) 960 (30 .× 32) 10 800 (80 .× 10) 400 (40 .× 10) 300 (30 .× 10)

tion. We use geometric and illumination alterations to explore a wide range of visual context/background diversity. We overlay exemplar logo images at random locations in randomly selected non-logo context images to ensure a range of visual sceneries. Finally, we obtain synthetic training logo images with automatically created bounding-box annotations. Figure 6.2 (row 2 and row 4) shows some examples of synthetic training images with the corresponding logo exemplar images. For each logo class of FlickrLogos-32 and Toplogo-10 datatsets, we have synthesized 80 training images. Table 6.1 provides the details of both datasets. The QMUL-OpenLogo dataset [8] was created by combining and expanding some existing logo datasets, it has a total of 27,083 logo images from 352 logo categories. It also includes logo categories and images of FlickrLogos32 [6] dataset. We randomly selected 100 logo categories to conduct the experiments using unsupervised training scheme. We called it QMUL100 dataset. However, in QMUL100 dataset, we did not choose any logo category of FlickrLogos32 dataset to select 100 different logo categories. Table 6.2 specifies the selected logo categories and the total number of real-world logo images. We found a total of 4341 logo images for these 100 logo categories after organizing the dataset. We randomly divided training images from each category into 80% training data and 20% test data for the experiments. Finally, we used 3473 images in total as target sample images for training. Some training and test images include multiple logos from the same and different categories. Note that these training images are utilized in experiments without annotations. There are a total of 868 images in the test set. Whereas we synthesized 50 logo images for each category of logos in this dataset. The next section describes our domain adaptive anchorless detection framework.

102

6 Unsupervised Logo Detection with Adversarial …

Table 6.2 QMUL100 Dataset: logo categories and total images Logo Images Logo Images Logo 3M ABUS AIRHAWK ALLETT ANDROID ARAL ASUS AXA BACARDI BARBIE BBC BLIZZARD. CANON CATERPIL. CHEVRON CHICKFILA CHIQUITA CISCO CITI CONVERSE COSTCO DANONE DISNEY DRPEPPER DUNKIND. EBAY ESPN FACEBOOK FIREFOX GAP HANES HM IBM IKEA

20 27 17 15 22 86 29 30 29 29 27 22 158 165 20 26 20 40 19 20 100 34 20 29 20 20 28 20 28 20 13 114 121 43

INTEL INTERNET. JACKINT. JACOBSC. JAGERME. JOHNNYW. KELLOGGS KFC KIA KODAK KRAFT LAMBORG. LEGO LEVIS LG LONDONUN. LUXOTTICA MASERATI MASTERC. MEDIBANK MICHELIN MICROSOFT MILLERHIG. MITSUBISHI MK MTV NASA NB NBC NESCAFE NETFLIX NINTENDO NIVEA NORTHFAC

131 24 22 93 20 17 108 21 221 20 105 26 56 25 20 22 20 29 33 102 27 27 20 26 66 23 21 62 26 128 56 20 209 30

Images

OBEY 20 OLYMPICS 28 PAMPERS 87 PANASONIC 47 PLAYSTA. 22 RBC 21 RECYCL. 24 RENAULT 24 REPUBLI. 20 SAMSUNG 63 SAP 20 SCHWINN 20 SEGA 24 SUBARU 30 SUBWAY 66 SUPERMAN 30 SUPREME 68 SUZUKI 21 TACOBELL 24 TESLAMO. 22 TISSOT 25 TOMMYHI. 20 UNITEDN. 24 VISA 62 VOLVO 23 WARNERB. 19 WILLIAMH. 104 WINDOWS 20 WORDPRE. 21 XBOX 21 YAMAHA 21 YOUTUBE 14

6.3 Domain Alignment Using Entropy Minimization

103

6.3 Domain Alignment Using Entropy Minimization We present an adversarial learning approach based on the entropy minimization of mid-level output feature maps for the logo detection task. Architecture of our domain adaptive anchorless logo detector consists of a feature extraction backbone, a detection head, and a domain classifier network. The detection framework (feature extraction and detection head) is trained to recognize logos. The detection framework is also trained to generate more generalized feature maps for both the source and target domains to mock the domain classifier network using adversarial learning. While the domain classifier network’s training objective is to categorize the generated mid-level output feature space as synthetic (source) or real (target) domain images. We employ CenterNet, an anchorless detector presented in [11]. However, our approach can also be used to other anchorless detectors because it is general. We use Residual Network (ResNet-18) [12] and Deep Layer Aggregation (DLA-34) [13] CNN networks as feature extraction backbone to extract dense features from input images. The domain classifier (discriminator) network contains 5 convolutional layers, each with a 4 .× 4 kernel size and a stride of size 2. The number of channels [64, 128, 156, 512, 1] are used for each layer, respectively. A Leaky-ReLU activation with a fixed negative slope of 0.2 follows the first four convolutional layers. The final output of the discriminator network is used to calculate pixel-wise domain classification loss. The detailed design of the feature extraction networks, detection head and discriminator network have been given in the previous chapters. The following subsections discuss entropy maps generation, network architecture and adversarial learning by entropy minimizing.

6.3.1 Entropy Minimization Vu et al. [9] demonstrated the utility of entropy minimization using Shannon Entropy [10] for semantic segmentation tasks. They established a correlation between segmentation model performance and entropy prediction, illustrating that training a model only on source-like images results in low-entropy predictions for the source domain and high-entropy predictions for the target domain. To address this issue, they introduced entropy minimization at the final output level (i.e., segmentation mask) to generate high-confidence (low-entropy) predictions. Drawing inspiration from this approach, we propose a novel weighted self-information space-based (entropy maps) adversarial training architecture for the logo detection task. Our objective is to reduce the domain gap between the source and target domains by enforcing indirect lowentropy predictions on real (target) images. However, unlike Vu et al.’s method of utilizing final category-wise heatmaps, we propose to use of mid-level output feature maps to effectively minimize entropy prediction. Through our investigation in the previous chapter, we have observed that mid-level output feature maps in keypointbased anchorless detectors encapsulate crucial domain-specific semantic and visual

104

6 Unsupervised Logo Detection with Adversarial …

information. As a result, we use the mid-level output feature maps of both source and target domain images to generate a weighted self-information space (entropy map) to minimize the domain difference between them. For a given mid-level feature maps of the input image . Midxi , a weighted selfinformation space (entropy map) . Eˆ xi ∈ [0, 1]h×w×c can be computed, where .c denotes the number of channels, .h and .w are the height and width. In order to calculate entropy map, we first perform a linear transformation of the mid-level output feature map by applying a 1 .× 1 convolutional operation, and then we normalize it using softmax activation across the channel dimension (.c). The newly generated feature maps . Midx' i (x, y, c) = .so f tmax{Conv(1×1) (Midxi (x, y, c))} contain synthesized scalar values ensuring the sum of all predicted values along the channel dimension (.c) is equal to 1. Finally, a weighted self-information space (entropy map) for this resulting output can be generated using the following equation.

.

Eˆ x(h,w,c) = i

|C|

−1 ∑ Midx' i (x, y, c) · log Midx' i (x, y, c) log(C) c=1

(6.1)

where .c represents for the channel number of the feature maps, .x and . y are pixel is coordinates and .h and .w are the height and width. The entropy map . Eˆ x(h,w,c) i composed of independent pixel-wise entropies normalized to the [0, 1] range.

6.3.2 Entropy Minimization Maps Using Mid-Level Feature from Synthetic to Real Logo Images For model training, we use labeled synthesized logo images and unlabeled real images. In this scenario, synthetic images are considered as source domain and unlabeled real images as target domain. We assume source domain .S has .Ns images with corresponding bounding box annotations,.{(xis ∈ X S , yis ∈ Y S )}i∈(1,2,...,N s ) , where. X S denotes a set of input images, .Y S denotes the set of corresponding bounding box labels, .xis denotes an input image, i.e., .xis ∈ R H ×W ×3 and . yis = (bis , cis ) denotes the corresponding labels, i.e., bounding box coordinates .b and its associated category .c, respectively. Whereas, . N t is the number of images in the target domain .T without object-level annotations .{xit ∈ X T }i∈(1,2,...,N t ) where . X T denotes the set of images in the target domain and .xit denotes an input target image. Adversarial learning is one of the effective ideas for bridging the domain gaps. We add a domain classifier network (discriminator) with the detection framework to use the adversarial learning scheme. During training, we assume for the given source image .xis and target image .xit backbone network extracts mid-level output feature maps . Midxis and . Midxit respectively. The mid-level feature maps . Midxis of the source domain images are given to the detection heads in the detection framework to learn their respective functions. According to the process described in the previous

6.3 Domain Alignment Using Entropy Minimization

105

chapter, a detection loss for the synthesized source domain images is calculated using the following Equation. .

L det = L k + λsi ze L s + λo f f L o f f

(6.2)

where . L k denotes the focal loss classification function, .Ls and .Lo f f are the . L1 loss functions and .λsi ze and .λo f f represent the loss weights. We do not provide feature maps of the target domain images for the detection heads because we do not use object-level annotations for the target domain images. Meanwhile, we generate entropy maps for the source and target domain images using the corresponding mid-level output feature maps. To do this, we follow the procedure outlined in Sect. 6.3.1. The entropy maps of the source domain images (i.e., . Eˆ xis ) and target domain images (i.e., . Eˆ xit ) are now generated using Eq. 6.1 and fed into the discriminator network. We assign entropy maps of source images (sourcedomain) a domain label ‘0’ and give entropy maps of target images (target-domain) a domain label ‘1’. An adversarial loss is calculated using the entropy (weighted selfinformation) maps of the target domain images to indirectly minimize the entropy prediction on target images. We provide an inverted domain label (i.e., ‘0’ instead of ‘1’) with entropy maps of target domain images to calculate the adversarial domain classification loss. The adversarial binary classification loss . L adv can be given as: .

L adv =

|X T | 1 ∑ L cls ( Eˆ xit , 0) |X T | i=1

(6.3)

The detection framework is trained to generalize the network parameters between the data distribution of the source and target domain images by combining adversarial domain classification loss with network’s detection loss (i.e., Eq. 6.2). The objective optimization problem can be given as. .

L det = L k + λsi ze L s + λo f f L o f f + λadv L adv

(6.4)

λ is loss weight. We set 0.001 in our experiments. This adversarial learning scheme incentivizes the network to bring target and source domains closer by encouraging the network to deceive the discriminator. While at the same time, a domain classification loss . L cls (training objective of the domain classifier network) is calculated using the true domain labels to build the discriminator network’s ability to distinguish the inputs into their respective domains. This loss can be described as follows:

. adv

.

L cls =

|X s | |X T | 1 ∑ 1 ∑ L cls ( Eˆ xis , 0) + L cls ( Eˆ xit , 1) |X S | i=1 |X T | i=1

(6.5)

This domain adaptation approach can be viewed as indirect entropy minimization using mid-level output feature maps. These CNN networks (detection and discrim-

106

6 Unsupervised Logo Detection with Adversarial …

Fig. 6.3 Overview of the proposed domain alignment method. The network takes as input (i) real logo images (target) (ii) synthesized 1-shot logo icon images (source). We calculate entropy maps based adversarial loss for training. In the figure, yellow arrow represents the source image and purple represents the target image

inator networks) can be trained jointly in an end-to-end manner. During inference, we drop the discriminator network and only use the detection network to perform the logos detection task. The architecture of our network is shown in Fig. 6.3.

6.4 Experiments and Results 6.4.1 Datasets We have used FlickrLogos-32 [6], Toplogo-10 [7] and QMUL-OpenLogo [8] datasets for our experiments. These datasets are the benchmark datasets for logo detection tasks, especially FlickrLogos-32 has been primarily used in many previous works to establish their logo recognition methodologies. The FlickrLogos-32 dataset has 32 logo classes. Each class comprises 40 images for training and 30 images for testing. The Toplogo-10 dataset contains 10 logo classes with FlickrLogos-32-like image statics. The details about QMUL-OpenLogo dataset have been provided in the previous section. During training, we use only training images from these datasets as the target domain samples. Note that we use these images without the bounding box annotations. However, test sets with real images are used to evaluate models.

6.4 Experiments and Results

107

6.4.2 Implementation Details We train detection network for 140 epochs using a batch size of 4. We use DLA-34 and ResNet-18 as feature extraction backbone networks pretrained on COCO dataset using CenterNet. The initial learning rate is set to 0.000125. For network optimization, the Adam optimizer is employed. The input image has a 512 .× 512 spatial resolution. The discriminator network is optimized using the Adam optimizer. The initial learning rate of the discriminator network is set to 0.0001. The discriminator network is trained from scratch. The same batch-sized images from the source and target domains are given to the framework during training. We use the same data augmentation methods as [11], such as cropping, scaling, rotation, and flipping. Almost the same data enhancement strategies are widely used for various object detection tasks, for example, COCO object detection dataset. These data enhancement strategies are also useful for logo recognition as they are for other detection problems. The performance of the methods is evaluated in terms of mean Average Precision (mAP). We can calculate Precision and Recall values using IoU value for a specific threshold. For the detection task, a predicted bounding box is considered true when Intersection over Union (IoU) between the predicted and ground truth bounding box exceeds a predefined threshold. We considered a logo detected when the IoU between the predicted and ground-truth bounding box exceeds over 50%. Following CenterNet, flip test is used for evaluation, providing horizontally flipped images. For flip test, outputs of the feature extraction networks are averaged before decoding bounding boxes. We trained the networks two times and calculated the mean and standard deviation of the best results.

6.4.2.1

Ablation Experiments on FlickrLogos-32 Dataset

In our investigation, we explore various domain adaptive losses to enhance the effectiveness of domain adaptive training. Specifically, we focus on evaluating different output layers and domain adaptive losses in scenarios involving synthesized logo image data. Our analysis involves examining the mid-level output feature maps and heatmap outputs to facilitate the adaptation of model weights between synthetic and real-world image domains. We found that the use of only synthetic images yields in comparatively subpar performance in comparison to the use of real-world images. This performance degradation is primarily attributed to the domain shift in essential features between synthetic and real-world images. To address this, we investigate two domain adaptive losses: entropy loss [9] and maximal square loss [14], originally proposed for domain adaptive semantic segmentation tasks. However, the task of identifying target objects, determining their size (bounding box), and classifying them in a domain shift environment poses a greater challenge compared to pixelwise classification (semantic segmentation). In a related study [15], the authors align domain gaps at the generated class-wise heatmap level of target domain images using

108

6 Unsupervised Logo Detection with Adversarial …

entropy loss and maximal square loss with an anchorless detector. However, their strategy does not incorporate adversarial domain classification loss, a key technique for domain optimization. To enhance the network design, we integrate a domain classifier with the anchorless detector to incorporate adversarial domain classification loss, facilitating more efficient generalization of the extracted feature maps. Furthermore, we propose the integration of domain adaptive losses into domain adaptation by utilizing mid-level feature maps for the logo detection task. To compute the entropy loss, we directly utilize output heatmaps. Initially, we generate entropy maps using Eq. 6.1 for a given set of category-wise heatmaps of the target domain images. The entropy loss is then computed as the sum of all normalized pixel-wise values in the entropy maps, as described by the following equation. ∑ . L ent_loss (x t ) = (6.6) Eˆ x(h,w) i h,w

can be obtained using the detection head layers. where . Eˆ x(h,w) i While the maximum square loss can be directly calculated using the category-wise heatmaps of the target domain images. A maximum square loss can be calculated using the following Equation. 1 ∑∑ n,c 2 (p ) 2N n=1 c=1 t N

.

L max_sq_loss (xt ) = −

C

(6.7)

where . pt is a scalar value, .n is pixel position and .c is the number of channels. The ablation experiments performed on our synthetic FlickrLogos-32 dataset are shown in Table 6.3. Deep Layer Aggregation (DLA-34) [13] feature extraction backbone is used with the CenterNet architecture during these experiments. DLA network performs semantic and spatial feature fusing to capture the location and categories more accurately. Densely connected DLA network design includes feature pyramid networks with hierarchical and iterative skip connections, which enhance feature representation and improve resolution. In accordance with the authors [15], we weight the entropy loss and maximum square loss during training by 0.001 and 0.3, respectively. The entropy loss and maximum square loss are also computed using the mid-level output feature maps. The mid-level output is passed through a 1 .× 1 linear transformation to generate visual and semantic features without employing an ReLU activation function while retaining the same number of output channels. According to Table 6.3, the CenterNet network with a DLA-34 feature extraction backbone achieves an accuracy of 28.0% mAP in a normal training situation. The performance is slightly increased using class-wise heatmaps for different domain adaptation losses. The Entropy loss calculated at the heatmaps level has an accuracy of 28.04% mAP. Performance with the maximum square loss is 28.45% mAP at heatmaps level domin alignment. When entropy loss is calculated with mid-level output feature maps, performance is increased by around 1.4% mAP (Proposed method 1). While performance for maximal square loss increased by about 2% mAP

6.4 Experiments and Results

109

Table 6.3 Ablation study of different domain adaptive losses and output layers on FlickrLogos-32 dataset Network

Final heatmap outputs

Mid-level feature maps

Different losses for DA

Entropy loss w/o

mAP IoU-0.5

Maximum Entropy square minimization loss maps

Adv. loss

CenterNet DLA34

28.0 (.±0.42)

DA DA

CenterNet .√ DLA34 [15]

.√

CenterNet .√ DLA34 [15]

28.04 (.±0.21) .√

Proposed method 1

.√

Proposed method 2

.√

.√

Proposed method 3

.√

.√

Proposed method 4

.√

28.45 (.±0.35)

.√

29.4 (.±0.42) 29.8 (.±0.0)

.√

.√

29.95 (.±0.91)

.√

31.75 (.±0.92)

using mid-level output feature maps (Proposed method 2). A novel approach, the combination of maximum square loss calculated using mid-level feature maps with adversarial domain classification loss also shows a notable improvement in performance (Proposed method 3). It achieves an accuracy of 29.95% mAP. This can be seen as another approach to domain adaptation using anchorless detectors. The proposed entropy minimization strategy for adversarial learning achieves 3.75% mAP improved performance with an accuracy of 31.75% mAP when compared to direct transfer from source to target domain images (Proposed method 4).

6.4.2.2

Results and Comparison on FlickrLogos-32 Dataset

Results and comparisons with various approaches are presented in Table 6.4. In without domain adaption training, the CenterNet network with a DLA-34 backbone achieves an accuracy of 28.0% mAP. However, it is evident from earlier works that the performance on synthetic images does not resemble that of real images due to the vastly dissimilar features of visual scenes. We believe that a more cohesive and homogenous synthesized image will yield more better performance. As proposed in [15], the implementation of entropy loss [9] achieves an accuracy of 28.04% mAP. The accuracy obtained using the maximum square loss [14] is 28.45% mAP.

110 Table 6.4 Comparison on FlickrLogos-32 dataset with Deep Layer Aggregation (DLA-34) backbone

6 Unsupervised Logo Detection with Adversarial … Network

mAP IoU-0.5

CenterNet (w/o DA) Scheck et al. [15] Entropy loss [9] Scheck et al. [15] Maximum square loss [14] Hsu et al. [17] Jain et al. [16] Saito et al. [18] Xu et al. [19] Ananda et al. [20] Proposed method 4

28.0 (.±0.42) 28.04 (.±0.21) 28.45 (.±0.35) 30.4 (.±0.42) 29.55 (.±0.64) 29.7 (.±0.28) 31 (.±0.14) 30.79 (.±0.56) 31.75 (.±0.92)

The accuracy of 29.55% mAP is achieved while employing normal mid-level feature maps for adversarial domain classification loss with CenterNet-DLA34 [16]. The gradient reversal layer (GRL) with Faster-RCNN network was employed by Hsu et al. [17]. Following their method, we implemented a gradient reversal layer based domain adaptation with CenterNet-DLA34. We have placed a GRL layer between the feature extraction and discriminator networks. We have used mid-level output for domain adaptation based on GRL layer. This technique yields 30.4% mAP. SW-Faster-RCNN [18] (Faster-RCN-based two stage framework: that includes the feature extraction backbone and region proposal networks) with instance- and image-level alignment using VGG feature extraction backbone achieves 29.7% mAP accuracy. The accuracy of the consistency regularization loss based SW-Faster-ICR-CCR [19] is 31% mAP. Recently, a dual discriminator-based domain adaptation has been proposed in [20], using feature-level and output-level for the segmentation task. We implemented two discriminators with feature-level and mid-level output using CenterNet detection framework, this method achieves an accuracy of 30.79%. Our proposed entropy minimization method for mid-level feature maps using adversarial learning achieves an accuracy of 31.75% mAP accuracy. Adversarial loss calculated using weighted self-information space (entropy maps) effectively improves performance in comparison to use of normal mid-level output feature maps. Table 6.5 shows the results with ResNet-18 feature extraction network with CenterNet. CenterNet with ResNet-18 as its backbone achieves 28.6% mAP accuracy during without domain adaptive training. Accuracy is improved by domain adaptation employing entropy minimization of mid-level output feature maps achieving 30.4% mAP accuracy.

6.4 Experiments and Results

111

Table 6.5 Experiments on FlickrLogos-32 dataset with ResNet-18 backbone Network mAP IoU-0.5 CenterNet ResNet-18 (w/o DA) CenterNet ResNet-18 (with DA) proposed method 4

6.4.2.3

28.6 (.±0.42) 30.4 (.±0.28)

Ablation Experiments on Toplogo10 Dataset

Toplogo10 dataset was introduced in [7]. The logo images included in this dataset are collected from various clothing brands. According to the author, it is very difficult to achieve a high accuracy rate with this dataset. It is important to note that the author also trained using synthetic logo images and obtained an accuracy of 10.2% mAP using the Faster-RCNN (ResNet-50) network [21]. Table 6.6 shows the ablation experiment utilizing the our synthetic logo images for Toplogo10 dataset employing deep layer aggregation (DLA-34) feature extractor network with CenterNet. The results for different domain adaptive losses and various output layers are given.

Table 6.6 Ablation study of different domain adaptive losses and output layers on Toplogo10 dataset Network

Final heatmap outputs

Midlevel feature maps

Different losses for DA

Entropy loss w/o

Maximum square loss

mAP IoU-0.5

Entropy minimization maps

Adv. loss

CenterNet DLA34

17.0 (.±0.28)

DA DA

CenterNet DLA34 [15]

.√

CenterNet DLA34 [15]

.√

.√

17.4 (.±0.14) .√

Proposed method 1

.√

Proposed method 2

.√

.√

Proposed method 3

.√

.√

Proposed method 4

.√

17.65 (.±0.07)

.√

17.6 (.±0.14) 18.29 (.±0.14)

.√

.√

18.2 (.±0.42)

.√

19.30 (.±1.55)

112

6.4.2.4

6 Unsupervised Logo Detection with Adversarial …

Results and Comparison on Toplogo10 Dataset

The results for Toplogo10 dataset using CenterNet with DLA-34 as the backbone are shown in Table 6.7. CenterNet obtains 17.0% mAP accuracy for our synthetic logo images. The entropy loss-based scheme at heatmaps level achieves an accuracy of 17.4% mAP [9, 15], whereas the use of maximum square loss improves the accuracy to 17.65% mAP [14, 15]. CenterNet-DLA-34 and discriminator networks are used in the gradient reversal layer (GRL) based domain adaptation, which results in 16.35% mAP accuracy [17]. The accuracy of adversarial domain classification scheme using normal mid-level feature maps is 18.75% mAP [16]. The accuracy of SW-FasterRCNN [18] is 14.15% mAP. While the SW-Faster-ICR-CCR [19] has an accuracy of 10.85% mAP. The accuracy of domain adaptation using the dual discriminators is 18.79% [20]. Our proposed method gives a noticeable improvement with 19.30% mAP accuracy. Table 6.8 shows accuracy for the Toplogo10 dataset using CenterNet with ResNet18 backbone. CenterNet delivers 13.75% mAP accuracy in a normal training. The accuracy is increased by our proposed approach by around 2.5% achieving 16.25% mAP. Table 6.7 Experiments on Toplogo10 dataset with DLA-34 backbone

Table 6.8 Experiments on Toplogo10 dataset with ResNet-18 backbone

Network

mAP IoU-0.5

CenterNet DLA-34 (w/o DA) Scheck et al. [15] Entropy loss [9] Scheck et al. [15] Maximum square loss [14] Hsu et al. [17] Jain et al. [16] Saito et al. [18] Xu et al. [19] Ananda et al. [20] Proposed method 4

17.0 (.±0.28)

Network

mAP IoU-0.5

CenterNet ResNet-18 (w/o DA) CenterNet ResNet-18 (with DA) Proposed method 4

13.75 (.±0.07)

17.4 (.±0.14) 17.65 (.±0.07) 16.35 (.±0.07) 18.75 (.±0.35) 14.15 (.±0.21) 10.85 (.±0.63) 18.79 (.±0.84) 19.30 (.±1.55)

16.25 (.±0.63)

6.4 Experiments and Results

6.4.2.5

113

Experiments on QMUL-OpenLogo Dataset

The results for QMUL100 Dataset using CenterNet with DLA-34 backbone for normal training and proposed method have been shown in Table 6.9. Our experiment demonstrates that the lack of semantic alignment and domain gap between synthetic training dataset and real-world test images has an adverse impact on the model’s performance. As in normal training, the model achieves an accuracy of 14.89% mAP. The proposed method has an accuracy of 17.29% mAP, improving the performance by around 2.4%. The performance improvement with a large number of logo classes utilizing the unsupervised training method validates the efficiency of the proposed entropy minimization-based domain adaptation method. Table 6.10 shows accuracy for the QMUL100 dataset using CenterNet with ResNet-18 backbone. CenterNet delivers 16.15% mAP accuracy in the normal training. The accuracy is increased by our proposed approach by around 2.4%, achieving 18.55% mAP. We analyze the feature distribution of real and synthetic training images for our proposed method. The t-SNE feature distributions [22] of real and synthetic training logo images from the . Adidas logo category in the FlickLogo-32 dataset and the . H H logo category in the Toplogo10 dataset are shown in Fig. 6.4. The figure shows the feature distribution for normal training and proposed method. The class-wise feature distributions of the . Adidas and . H H logos in the normal training scheme are distinct. The figure demonstrates that our proposed approach effectively brings the distributions of real images (blue dots) and synthetic images (orange dots) closer, resulting in more overlapping regions in their feature hyperspace. In Fig. 6.5, we additionally examine the feature distributions for the entire real images (blue dots) Table 6.9 Experiments on QMUL100 dataset with DLA-34 backbone

Table 6.10 Experiments on QMUL100 dataset with ResNet-18 backbone

Network

mAP IoU-0.5

CenterNet DLA-34 (w/o DA) CenterNet DLA-34 (with DA) Proposed method 4

14.89 (.±0.42)

Network

mAP IoU-0.5

CenterNet ResNet-18 (w/o DA) CenterNet ResNet-18 (with DA) Proposed method 4

16.15 (.±0.21)

17.29 (.±0.14)

18.55 (.±0.77)

114

6 Unsupervised Logo Detection with Adversarial …

Fig. 6.4 Visualization of the t-SNE feature distributions of real (blue circles) and synthetic (orange circles) logo images. a Feature maps distribution in normal training (left) and with proposed method (right) for training images of ‘Adidas’ logo class of FlickrLogos-32 dataset b Feature maps distribution in normal training (left) and with proposed method (right) for training images of ‘HH’ logo class of Toplogo10 dataset

and synthetic images (orange dots) from the Toplogo10 dataset. As shown in Fig. 6.5, these real and synthetic image domains have a distinct feature space, although our proposed method is helpful in efficiently generating overlapping distributions for all logo classes. Figure 6.6 depicts the feature distribution for some randomly selected synthetic and real-world sample images from all categories of QMUL100 dataset.

6.4.2.6

Domain Adaptation Using Real Images (Scene Adaptation)

FlickLogo-32 and Toplogo10 datasets have a common logo category—. Adidas. But the Adidas category has a different logo icon in these datasets. The FlickrLogos-32 dataset training images mainly contain an Adidas icon in design of Lotus, although there are also a few training images with three-stripe icon design (ref. Fig. 6.2, row 1

6.4 Experiments and Results

115

Fig. 6.5 Visualization of the t-SNE feature maps distributions in normal training (left) and with proposed method (right) for training images of Toplogo10 dataset. Real logo images (blue circles) and synthetic logo images (orange circles)

Fig. 6.6 Visualization of the t-SNE feature maps distributions in normal training (left) and with proposed method (right) for training images of QMUL100 dataset. Real logo images (blue circles) and synthetic logo images (orange circles)

and row 3). Whereas the Toplogo-10 dataset has training and test images only with the three-stripe Adidas icon. These real-world training images of Adidas logo are used by us in our experiments. In our first experiment, we try to adopt domain knowledge form FlickrLogo32 to TopLogo10 dataset for the Adidas logo. For model learning, we use FlickrLogo-32 (Adidas class) training images and evaluate performance on Adidas test image set of Toplogo10 dataset. We use Adidas Toplogo10 training images (without bounding annotations) to align the domain gap. Table 6.11 shows the accuracy for the normal training and our proposed method. CenterNet achieves an accuracy of 58.5% AP on TopLogo10 Adidas test images. Our proposed method improves performance significantly by achieving 64.3% AP accuracy. The DLA-34 network with CenterNet is used for feature extraction.

116 Table 6.11 Domain Adaptation for Adidas Icon: FlickrLogos-32 to Toplogo10 using DLA-34 backbone

Table 6.12 Domain Adaptation for Adidas Icon: Toplogo10 to FlickrLogos-32 using DLA-34 Backbone

6 Unsupervised Logo Detection with Adversarial … Network

mAP IoU-0.5

CenterNet DLA-34 (w/o DA) CenterNet DLA-34 (with DA) Proposed method

58.5

Network

mAP IoU-0.5

CenterNet DLA-34 (w/o DA) CenterNet DLA-34 (with DA) Proposed method

18.3

64.3

21.3

Additionally, we perform experiments to learn domain knowledge from Toplogo10 (Adidas class) dataset to FlickrLogos-32 dataset. The Toplogo10 dataset has only Adidas images with three-strips icon while the FlickrLogos-32 test set includes Lotus icon logo images. Table 6.12 contains the results. We observe the accuracy of the trained model is very low because its logo icon is completely different from the training set. In this case, CenterNet achieves 18.3% AP for normal training. Our proposed method improves accuracy around 3% AP achieving 21.3% AP. Using synthetic images poses two significant challenges: the domain shift problem and the presence of unnatural backgrounds or less coherent and artificial background. Images with artificial backgrounds tend to have lower accuracy rates compared to those with real-world backgrounds. Moreover, relying only on synthetic images makes it difficult to create a comprehensive training dataset that encompasses a wide range of sizes and variations in features. In this study, our primary focus is on addressing the domain shift between real-world and synthetic logo images. We achieve this by utilizing logo exemplar icons to generate synthetic logo images. In the future, exploring the generation of more cohesive logo images could be a valuable research topic to overcome the challenge of building a good and large synthetic logo training dataset, especially in the context of domain adaptation.

6.5 Summary We propose an unsupervised logo detection method using domain adaptation from synthetic to real logo images. To address the data annotation issue, we utilize automatically generated synthetic logo images with object-level annotations. For domain

6.6 Discussion and Future Recommendations

117

adaptation, we propose pixel-wise entropy-minimization of mid-level feature maps for detection task to effectively bridge the domain gap. Unsupervised domain adaptation methods aim to align the feature distributions between the labeled source domain (where annotated training data are available) and the unlabeled target domain (where annotations are absent). By learning domain-invariant representations from both domains, these techniques enable the model to generalize well to the target domain without requiring explicit annotations.

6.6 Discussion and Future Recommendations The motivation behind this work is to reduce the high dependency on manually annotated object-level annotations and to discover a robust and generalized method that offers a better mechanism for logo detection in new test images. This work begins with an exploration of the importance of the automated logo recognition task, highlighting its potential applications in various business domains. The main focus of this work lies in attention mechanisms, semi-supervised (weakly supervised) and unsupervised logo detection using anchorfree detector. In the initial work, this book discusses the attention mechanisms for logo classification and localization task. To address the challenge of object-level annotations, a weakly supervised algorithm for logo recognition with image-level annotation is implemented. Several experiments are conducted to explore the importance of attention mechanisms with feature extraction networks. The results demonstrate that a larger spatial resolution and dilated receptive field are beneficial for extracting feature maps with rich representations from an image. Current deep learning-based object detector frameworks perform dense prediction to detect any object. In this context, the results of our experiments are instrumental in designing new robust feature maps capable of retaining spatial resolution in coarse layers for various tasks. This method exhibits high scalability to a large number of logo images as it relies only on image-level annotations. In the subsequent work, an anchorfree detection framework-based framework is presented for logo detection. This involves a robust hourglass-style feature extraction network and attention mechanisms to facilitate accurate logo detection. Additionally, a lightweight CNNs module is designed specifically for practical applications, significantly reducing the number of parameters and computations while maintaining model accuracy. Through analysis, it is determined that the parallel combination of spatial and channel attention mechanisms refines global information and generates more balanced feature maps, resulting in improved performance. In our previous study, our focus was primarily on improving accuracy on given datasets. However, in our next work, we shifted our attention to domain adaptationbased approaches aimed at addressing the problem of domain shift. Within the anchorfree detection framework, we perform evaluations to identify the appropriate layer for performing domain alignment between datasets with varying data distribution and style. Further, we synthesized training logo images using the 1-shot logo icon. It is discovered that synthesized images with automatically generated

118

6 Unsupervised Logo Detection with Adversarial …

bounding box annotations can effectively fulfill the requirement for large training datasets. These synthesized training logo images are utilized to facilitate model learning from synthetic to real logo images in an unsupervised manner. In order to bridge the domain gap between synthetic and real logo images, we formulate a method utilizing entropy minimization of mid-level feature maps for adversarial learning. The results demonstrated that our approach significantly enhances performance and accuracy on real test images. Effectively training a deep network requires a large training dataset. Imagelevel supervision enhances the framework’s capabilities and scalability. Utilizing only image-level labeled real images can effectively enhance detection performance. Given that collecting training data with image-level annotations in a weakly supervised manner is much easier than manually annotating object-level bounding boxes, a promising approach would be to utilize semi-supervised learning using image-level annotations for real images along side synthesized images. Furthermore, synthesized training images with automatically generated object-level annotations present a practical alternative to the time-consuming and expensive process of bounding box annotation. The utilization of synthetic images helps in learning object-level logo recognition and eliminates the need for object-level annotations, which are crucial components of any real-time detection problem. Our results have demonstrated an improvement in performance, however, there is still a need to further enhance existing domain adaptation-based methods for various computer vision tasks. Specifically, domain adaptation techniques can be explored more extensively to address the domain shift problem between synthetic and real image datasets, considering unsupervised, semi-supervised, and few-shot adaptationbased methods for both detection and segmentation tasks. However, training in an adversarial manner presents challenges for detection networks. To further improve model generalization, it would be beneficial to incorporate methods like Generative Adversarial Networks-based image synthetization, domain generalization and federated learning. This can lead to better performance and stability in training. We intend to propose and implement novel domain adaptation-based techniques using the weakly supervised and unsupervised schemes to address the domain shift problem between synthetic and real image datasets. In future, it would be valuable to explore more domain adaptation and generalization-based techniques, including federated learning for real-time detection. These advancements may help refine the architecture and reduce dependence on human-annotated training datasets.

References 1. Shen, R., et al.: Unsupervised domain adaptation with adversarial learning for mass detection in mammogram. Neurocomputing 393, 27–37 (2020) 2. Chen, Y.W., Jain, L.C. (eds.): Deep learning in healthcare. Springer, Berlin/Heidelberg, Germany (2020) 3. Sakaridis, C., et al.: Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 126(9), 973–992 (2018)

References

119

4. Song, L., et al.: Learning from synthetic images via active pseudo-labeling. IEEE Trans. Image Process. 29, 6452–6465 (2020) 5. Sankaranarayanan, S., et al.: Learning from synthetic data: addressing domain shift for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, pp. 3752–3761 (2018) 6. Romberg, S., Pueyo, L.G., Leinhart, R., Zwol, R.V.: Scalable logo recognition in real-world Images. In: The 1st ACM International Conference on Multimedia Retrieval, USA, pp. 1–8 (2011) 7. Su, H., Zhu, X., Gong, S.: Deep learning logo detection with data expansion by synthesising context. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 530–539 (2017) 8. Su, H., Zhu, X., Gong, S.: Open logo detection challenge. In: Proceedings of the British Machine Vision Conference, UK (2018) 9. Vu, T.H., et al.: ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2512–2521 (2019) 10. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948) 11. Zhou, X., Wang, D., Krahenbuhl P.: Objects as points (2019). arXiv:1904.07850 12. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 13. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2018) 14. Chen, M., Xue, H., Cai, D.: Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2090– 2099 (2019) 15. Scheck, T., et al.: Unsupervised domain adaptation from synthetic to real images for anchorless object detection. In: Proceedings of Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 319–327 (2021) 16. Jain, R.K., Watasue, T., Nakagawa, T., Sato, T., Iwamoto, Y., Ruan, X., Chen, Y.W.: LogoNet: a robust layer-aggregated dual-attention anchorfree logo detection framework with an adversarial domain adaptation approach. Appl. Sci. 11(20), 9622 (2021). https://doi.org/10.3390/ app11209622 17. Hsu, H.K., et al.: Progressive domain adaptation for object detection. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 738–746 (2020) 18. Saito, K., Ushiku, Y., et al.: Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965 (2019) 19. Xu, C.D., et al.: Exploring categorical regularization for domain adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11721–11730 (2020) 20. Ananda, S., et al.: Dual discriminator-based unsupervised domain adaptation using adversarial learning for liver segmentation on multiphase CT images. In: Proceedings of IEEE Engineering in Medicine and Biology Society, pp. 1552–1555 (2022) 21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2017) 22. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579– 2605 (2008)