Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX (Lecture Notes in Computer Science) 3031200764, 9783031200762


103 31 130MB

English Pages [811]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword
Preface
Organization
Contents – Part IX
BEVFormer: Learning Bird's-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers
1 Introduction
2 Related Work
3 BEVFormer
3.1 Overall Architecture
3.2 BEV Queries
3.3 Spatial Cross-attention
3.4 Temporal Self-attention
3.5 Applications of BEV Features
3.6 Implementation Details
4 Experiments
4.1 Datasets
4.2 Experimental Settings
4.3 3D Object Detection Results
4.4 Multi-tasks Perception Results
4.5 Ablation Study
4.6 Visualization Results
5 Discussion and Conclusion
References
Category-Level 6D Object Pose and Size Estimation Using Self-supervised Deep Prior Deformation Networks
1 Introduction
2 Related Work
3 Self-supervised Deep Prior Deformation Network
3.1 Deep Prior Deformation Network
3.2 Self-supervised Training Objective Lself-supervised
3.3 Supervised Training Objective Lsupervised
4 Experiments
4.1 Comparisons with Existing Methods
4.2 Ablation Studies and Analyses
4.3 Visualization
References
Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection
1 Introduction
2 Related Works
2.1 Semi-supervised Learning
2.2 Object Detection
2.3 Semi-supervised Object Detection
3 Dense Teacher
3.1 Pseudo-Labeling Framework
3.2 Disadvantages of Pseudo-Box Labels
3.3 Dense Pseudo-Label
4 Experiments
4.1 Datasets and Experiment Settings
4.2 Main Results
4.3 Comparison with State-of-the-Arts
4.4 Ablation and Key Parameters
5 Conclusion
References
.26em plus .1em minus .1emPoint-to-Box Network for Accurate Object Detection via Single Point Supervision*-6pt
1 Introduction
2 Related Work
2.1 Box-Supervised Object Detection
2.2 Image-Supervised Object Detection
2.3 Point-Supervised Object Detection
3 Point-to-Box Network
3.1 Coarse Pseudo Box Prediction
3.2 Pseudo Box Refinement
4 Experiments
4.1 Experiment Settings
4.2 Performance Comparisons
4.3 Ablation Study
5 Conclusion
References
Domain Adaptive Hand Keypoint and Pixel Localization in the Wild
1 Introduction
2 Related Work
3 Proposed Method
3.1 Geometric Augmentation Consistency
3.2 Confidence Estimation by Two Separate Networks
3.3 Teacher-Student Update by Knowledge Distillation
3.4 Overall Objectives
4 Experiments
4.1 Experiment Setup
4.2 Quantitative Results
4.3 Qualitative Results
4.4 Ablation Studies
5 Conclusion
References
Towards Data-Efficient Detection Transformers
1 Introduction
2 Related Work
2.1 Object Detection
2.2 Label Assignment
2.3 Data-Efficiency of Vision Transformers
3 Difference Analysis of RCNNs and DETRs
3.1 Detector Selection
3.2 Transformation from Sparse RCNN to DETR
3.3 Summary
4 Method
4.1 A Revisit of Detection Transformers
4.2 Model Improvement
4.3 Label Augmentation for Richer Supervision
5 Experiments
5.1 Main Results
5.2 Ablations
5.3 Generalization to Sample-Rich Dataset
6 Conclusion
References
Open-Vocabulary DETR with Conditional Matching
1 Introduction
2 Related Work
3 Open-Vocabulary DETR
3.1 Revisiting Closed-Set Matching in DETR
3.2 Conditional Matching for Open-Vocabulary Detection
3.3 Optimization
3.4 Inference
4 Experiments
4.1 Ablation Studies
4.2 Results on Open-Vocabulary Benchmarks
4.3 Generalization Ability of OV-DETR
4.4 Qualitative Results
4.5 Inference Time Analysis
5 Conclusion
References
Prediction-Guided Distillation for Dense Object Detection
1 Introduction
2 Related Work
3 Method
3.1 Key Predictive Regions
3.2 Prediction-Guided Weighting Module
3.3 Prediction-Guided Distillation
4 Experiments
4.1 Setup and Implementation Details
4.2 Main Results
4.3 Ablation Study
5 Conclusion
References
Multimodal Object Detection via Probabilistic Ensembling
1 Introduction
2 Related Work
3 Fusion Strategies for Multimodal Detection
4 Experiments
4.1 Implementation
4.2 Multimodal Pedestrian Detection on KAIST
4.3 Multimodal Object Detection on FLIR
5 Discussion and Conclusions
References
.26em plus .1em minus .1emExploiting Unlabeled Data with Vision and Language Models for Object Detection
1 Introduction
2 Related Work
3 Method
3.1 Training Object Detectors with Unlabeled Data
3.2 VL-PLM: Pseudo Labels from Vision & Language Models
3.3 Using Our Pseudo Labels for Downstream Tasks
4 Experiments
4.1 Open-Vocabulary Object Detection
4.2 Semi-supervised Object Detection
4.3 Analysis of Pseudo Label Generation
5 Conclusion
References
CPO: Change Robust Panorama to Point Cloud Localization
1 Introduction
2 Related Work
3 Method
3.1 Fast Histogram Generation
3.2 Score Map Generation
3.3 Candidate Pose Selection
3.4 Pose Refinement
4 Experiments
4.1 Localization Performance on Scenes with Changes
4.2 Localization Performance on Scenes Without Changes
4.3 Ablation Study
5 Conclusion
References
INT: Towards Infinite-Frames 3D Detection with an Efficient Framework
1 Introduction
2 Related Work
2.1 3D Object Detection
2.2 Multi-frame Methods
3 Methodology
3.1 Overview of INT Framework
3.2 SeqFusion
3.3 SeqSampler
3.4 SeqAug
4 Experiments
4.1 Datasets
4.2 Experimental Settings
4.3 Effectiveness and Efficiency
4.4 Comparison with SOTAs
4.5 Ablation Studies
5 Conclusion
References
End-to-End Weakly Supervised Object Detection with Sparse Proposal Evolution*-4pt
1 Introduction
2 Related Work
2.1 Weakly Supervised Object Detection
2.2 Object Proposal Generation
3 Methodology
3.1 Overview
3.2 Seed Proposal Generation
3.3 Sparse Proposal Refinement
3.4 End-to-End Training
4 Experiment
4.1 Experimental Setting
4.2 Ablation Study
4.3 Visualization Analysis
4.4 Performance
5 Conclusion
References
Calibration-Free Multi-view Crowd Counting
1 Introduction
2 Related Work
3 Calibration-Free Multi-view Crowd Counting
3.1 Single-View Counting Module (SVC)
3.2 View-Pair Matching Module (VPM)
3.3 Weight Map Prediction Module (WMP)
3.4 Total Count Calculation Module (TCC)
3.5 Adaptation to Novel Real Scenes
4 Experiment
4.1 Experiment Setting
4.2 Experiment Results
5 Conclusion
References
Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-training
1 Introduction
2 Related Work
2.1 Monocular 3D Object Detection
2.2 Unsupervised Domain Adaptation
3 STMono3D
3.1 Problem Definition
3.2 Framework Overview
3.3 Self-teacher with Temporal Ensemble
3.4 Geometry-Aligned Multi-scale Training
3.5 Quality-Aware Supervision
3.6 Crucial Training Strategies
4 Experiments
4.1 Experimental Setup
4.2 Main Results
4.3 Ablation Studies and Analysis
5 Conclusion
References
SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud
1 Introduction
2 Related Work
3 Method
3.1 Line Segmentation Model
3.2 Joint Training of Line Segmentation and Description
4 Experiments
4.1 Network Training
4.2 Point Cloud Registration Test
4.3 Line Segmentation Evaluation
4.4 Generalization on Unseen Dataset
4.5 Ablation Study
5 Conclusions
References
Exploring Plain Vision Transformer Backbones for Object Detection
1 Introduction
2 Related Work
3 Method
4 Experiments
4.1 Ablation Study and Analysis
4.2 Comparisons with Hierarchical Backbones
4.3 Comparisons with Previous Systems
5 Conclusion
References
Adversarially-Aware Robust Object Detector
1 Introduction
2 Related Work
2.1 Adversarial Attack and Defense
2.2 Attack and Robust Object Detector
3 Adversarial Robustness in Object Detection
3.1 Problem Setting
3.2 Analyses of the Detection Robustness Bottleneck
4 Methodology
4.1 Overall Framework
4.2 Adversarially-Aware Convolution (AAconv)
4.3 Adversarial Image Discriminator (AID)
4.4 Consistent Features with Reconstruction (CFR)
5 Experiments
5.1 Implementation Details
5.2 Detection Robustness Evaluation
5.3 Model Evaluation and Analysis
6 Conclusion
References
HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors
1 Introduction
2 Related Work
2.1 Object Detection
2.2 Knowledge Distillation
3 Method
3.1 Review of Detection KD
3.2 HEAD
3.3 TF-HEAD
4 Experiments
4.1 Main Results
4.2 Ablation Study
4.3 Visualization
5 Conclusion
References
You Should Look at All Objects
1 Introduction
2 Related Works
3 Revisit FPN
3.1 Backbone Network
3.2 FPN-free Detection Framework
3.3 FPN-Based Detection Framework
3.4 Analysis of FPN
4 Methodology
4.1 Auxiliary Losses
4.2 Feature Pyramid Generation Paradigm
5 Experiments
5.1 Ablation Studies
5.2 Performance with Various Detection Frameworks
5.3 Instance Segmentation
6 Conclusions
References
Detecting Twenty-Thousand Classes Using Image-Level Supervision
1 Introduction
2 Related Work
3 Preliminaries
4 Detic: Detector with Image Classes
5 Experiments
5.1 Implementation Details
5.2 Prediction-Based vs Non-prediction-Based Methods
5.3 Comparison with a Fully-Supervised Detector
5.4 Comparison with the State-of-the-Art
5.5 Detecting 21K Classes Across Datasets Without Finetuning
5.6 Ablation Studies
5.7 The Standard LVIS benchmark
6 Limitations and Conclusions
References
DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation
1 Introduction
2 Related Work
3 Deep Correspondence Learning Network
3.1 Point-Wise Feature Extraction
3.2 Dual Feature Disengagement and Alignment
3.3 Confidence-Based Pose Estimation
3.4 Training of Deep Correspondence Learning Network
3.5 Confidence-Based Pose Refinement
4 Experiments
4.1 Ablation Studies and Analyses
4.2 Comparisons with Existing Methods
References
Monocular 3D Object Detection with Depth from Motion
1 Introduction
2 Related Work
3 Theoretical Analysis
3.1 Object Depth from Binocular Systems
3.2 Object Depth from General Two-View Systems
3.3 Achilles Heel of Depth from Motion
4 Methodology
4.1 Framework Overview
4.2 Geometry-Aware Stereo Cost Volume Construction
4.3 Monocular Compensation
4.4 Pose-Free Depth from Motion
5 Experiments
5.1 Experimental Setup
5.2 Quantitative Analysis
5.3 Qualitative Analysis
5.4 Ablation Studies
6 Conclusion
References
DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation
1 Introduction
2 Related Works
3 Method
3.1 Disentangled Shape and Pose Learning
3.2 Contrastive Metric Learning for Object Shapes
3.3 Re-entanglement of Shape and Pose
4 Inference Under Different Settings
5 Experiments
5.1 Setup
5.2 Setting I: Novel Objects in a Given Category
5.3 Setting III (Extension): Novel Objects Across Categories Without 3D Models
5.4 Setting II: Novel Objects with 3D Models
5.5 Ablation Study
6 Conclusion
References
Distilling Object Detectors with Global Knowledge
1 Introduction
2 Related Works
2.1 Object Detection
2.2 Knowledge Distillation
3 Method
3.1 Prototype Generation Module
3.2 Robust Distillation Module
3.3 Optimization
4 Experiments
4.1 Comparison with Existing Methods on VOC and COCO Datasets
4.2 Effects of the Prototypes in Robust Knowledge Distillation
4.3 Analysis of the Hyperparameters
4.4 Analysis on the Prototype Generation Methods
4.5 Distilling with Larger Teacher
4.6 Analysis of Noisy Knowledge Transferring
5 Conclusion
References
Unifying Visual Perception by Dispersible Points Learning
1 Introduction
2 Related Work
3 Method
3.1 UniHead
3.2 Adaptation to Different Visual Tasks
3.3 Adaptation to Different Visual Frameworks
3.4 UniHead Initialization
4 Experiments
4.1 Implementation Details
4.2 Ablation Studies
4.3 Generalization Ability
4.4 Comparison with State-of-the-Art
5 Conclusion
References
PseCo: Pseudo Labeling and Consistency Training for Semi-Supervised Object Detection
1 Introduction
2 Related Works
3 Method
3.1 The Basic Framework
3.2 Noisy Pseudo Box Learning
3.3 Multi-view Scale-Invariant Learning
4 Experiments
4.1 Dataset and Evaluation Protocol
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Study
5 Conclusion
References
Exploring Resolution and Degradation Clues as Self-supervised Signal for Low Quality Object Detection
1 Introduction
2 Related Works
2.1 Single Image Super Resolution
2.2 Image Restoration for Machine Perception
3 Down-sampling Degradation Transformations
4 Our Framework
4.1 CenterNet
4.2 Architecture and Training Pipeline
4.3 Inference Procedure
5 Experiments and Details
5.1 Datasets and Implementation Details
5.2 Multi-degradation Evaluation
5.3 Degradation Specific Evaluation
5.4 Ablation Study
6 Conclusion
References
Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features
1 Introduction
2 Related Work
3 Method
3.1 Notation
3.2 Prior Work: Render-And-Compare for Pose Estimation
3.3 Learning Scale-Invariant Contrastive Features
3.4 Coarse-to-Fine 6D Pose Estimation
3.5 Multi-object Reasoning
4 Experiments
4.1 Experimental Setup
4.2 Quantitative Results
4.3 Qualitative Examples
4.4 Ablation Study
5 Conclusions
References
Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection
1 Introduction
2 Related Work
2.1 Oriented Object Detection
2.2 Cross-Modal Image Alignment
3 Methodology and Analysis
3.1 Analysis
3.2 Translation-Scale-Rotation Alignment Module
3.3 TSRA-Based Oriented Detector
4 Experimental Results
4.1 Dataset and Evaluation Metrics
4.2 Statistics of DroneVehicle
4.3 Ablation Studies
4.4 Comparisons
5 Conclusions
References
.26em plus .1em minus .1emRFLA: Gaussian Receptive Field Based Label Assignment for Tiny Object Detection
1 Introduction
2 Related Work
2.1 Object Detection
2.2 Tiny Object Detection
2.3 Label Assignment in Object Detection
3 Method
3.1 Receptive Field Modelling
3.2 Receptive Field Distance
3.3 Hierarchical Label Assignment
3.4 Application to Detectors
4 Experiment
4.1 Dataset
4.2 Experiment Settings
4.3 Ablation Study
4.4 Main Result
4.5 Analysis
4.6 Experiment on More Datasets
5 Conclusion
References
Rethinking IoU-based Optimization for Single-stage 3D Object Detection
1 Introduction
2 Related Work
3 Method
3.1 Our RDIoU
3.2 More Comparison Between RDIoU and 3D IoU
3.3 Incorporating RDIoU into Regression Supervision
3.4 Incorporating RDIoU into Classification Supervision
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Results on Real-world Datasets
4.4 Ablation Studies
5 Conclusion
References
.26em plus .1em minus .1emTD-Road: Top-Down Road Network Extraction with Holistic Graph Construction
1 Introduction
2 Related Work
2.1 Road Network Extraction
2.2 Relation Network
3 A Holistic Model for Direct Graph Construction
3.1 Overview
3.2 Relation Reasoning for Graph Edges
3.3 Key Point Prediction for Graph Nodes
3.4 Loss Function
4 Experiments and Results
4.1 Datasets and Evaluation Settings
4.2 Implementation Details
4.3 Comparison Results
4.4 Ablation Studies and Analysis
5 Conclusions
References
Multi-faceted Distillation of Base-Novel Commonality for Few-Shot Object Detection
1 Introduction
2 Related Work
3 Multi-faceted Distillation of Base-Novel Commonality
3.1 Preliminary
3.2 Distilling Recognition-Related Semantic Commonalities
3.3 Distilling Localization-Related Semantic Commonalities
3.4 Distilling Distribution Commonalities
3.5 Unified Distillation Framework Based on Memory Bank
4 Experiments
4.1 Experimental Setup
4.2 Comparison with State-of-the-Art Methods
4.3 Integration with Different Baseline Methods
4.4 Ablation Studies
5 Conclusion
References
PointCLM: A Contrastive Learning-based Framework for Multi-instance Point Cloud Registration
1 Introduction
2 Related Work
3 Problem Formulation
4 Method
4.1 Feature Extractor
4.2 Pruning
4.3 Clustering and Transformations Estimation
5 Experiment
5.1 Experimental Settings
5.2 Experiment on Synthetic Dataset
5.3 Experiment on Real Dataset
5.4 Ablation Studies
6 Conclusion
References
Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration
1 Introduction
2 Related Work
2.1 Weakly Supervised Object Localization
2.2 Graph Diffusion
3 Methodology
3.1 Overall Architecture
3.2 Activation Diffusion Block
3.3 Prediction
4 Experiments
4.1 Experiment Settings
4.2 Performance
4.3 Ablation Study
5 Conclusions
References
.28em plus .1em minus .1emMTTrans: Cross-domain Object Detection with Mean Teacher Transformer
1 Introduction
2 Related Work
2.1 Object Detection
2.2 Unsupervised Domain Adaptive Object Detection
3 Method
3.1 Mean Teacher-Based Knowledge Transfer Framework
3.2 Multi-level Cross-domain Adversarial Feature Alignment
3.3 Progressive Cross-domain Knowledge Transfer with Mean Teacher and Adversarial Feature Alignment
4 Evaluation
4.1 Experimental Setup
4.2 Comparisons with Other Methods
4.3 Ablation Studies
4.4 Visualization and Analysis
5 Conclusions
References
Multi-domain Multi-definition Landmark Localization for Small Datasets
1 Introduction
2 Related Works
3 Method
3.1 ViT Encoder
3.2 Facial Landmark Semantic Group (FLSG)
3.3 Definition Agnostic Decoder
3.4 Definition/Domain-Specific Prediction Heads
4 Experiments and Results
4.1 COFW ch38burgos2013robust
4.2 WFLW ch38wayne2018lab
4.3 Small Dataset Experiments
4.4 Ablation Analysis
5 Limitations and Conclusion
References
DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection
1 Introduction
2 Literature Review
3 Background
4 Depth Equivariant Backbone
5 Experiments
5.1 KITTI Test Monocular 3D Detection
5.2 KITTI Val Monocular 3D Detection
5.3 Waymo Val Monocular 3D Detection
6 Conclusions
References
Label-Guided Auxiliary Training Improves 3D Object Detector
1 Introduction
2 Related Works
2.1 3D Object Detection
2.2 Auxiliary Task and Knowledge Distillation
3 Method
3.1 Label-Knowledge Mapper
3.2 Label-Annotation-Inducer
3.3 Separable Auxiliary Tasks
4 Experiments
4.1 Experiment Settings
4.2 Main Results
4.3 Ablation Studies
4.4 Qualitative Results and Discussion
5 Conclusion
References
PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images
1 Introduction
2 Related Work
3 Methodology
3.1 Open Vocabulary Object Detector
3.2 Naïve Alignment via Detector Training
3.3 Alignment via Regional Prompt Learning
3.4 PromptDet: Alignment via Self-training
4 Experiment
4.1 Dataset and Evaluation Metrics
4.2 Implementation Details
4.3 Ablation Study
4.4 Comparison with the State-of-the-Art
5 Conclusion
References
Densely Constrained Depth Estimator for Monocular 3D Object Detection
1 Introduction
2 Related Work
3 Methodology
3.1 Geometric-Based 3D Detection Definition
3.2 Densely Geometric-Constrained Depth Estimation
3.3 Depth Weighting by Graph Matching
4 Experiments
4.1 Setup
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Studies
4.5 Disscussion About DCD and AutoShape
5 Conclusion
References
Polarimetric Pose Prediction
1 Introduction
2 Related Work
2.1 Polarimetric Imaging
2.2 6D Pose Prediction
3 Polarimetric Pose Prediction
3.1 Photometric Challenges for RGB-D
3.2 Surface Normals from Polarisation
3.3 Hybrid Polarimetric Pose Prediction Model
3.4 Learning Objectives
4 Polarimetric Data Acquisition
5 Experimental Results
5.1 Experiments Setup
5.2 PPP-Net Evaluation
5.3 Comparison with Established Benchmarks
6 Discussion
7 Conclusion
References
Author Index
Recommend Papers

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX (Lecture Notes in Computer Science)
 3031200764, 9783031200762

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

LNCS 13669

Shai Avidan · Gabriel Brostow · Moustapha Cissé · Giovanni Maria Farinella · Tal Hassner (Eds.)

Computer Vision – ECCV 2022 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part IX

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA

13669

More information about this series at https://link.springer.com/bookseries/558

Shai Avidan · Gabriel Brostow · Moustapha Cissé · Giovanni Maria Farinella · Tal Hassner (Eds.)

Computer Vision – ECCV 2022 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part IX

Editors Shai Avidan Tel Aviv University Tel Aviv, Israel

Gabriel Brostow University College London London, UK

Moustapha Cissé Google AI Accra, Ghana

Giovanni Maria Farinella University of Catania Catania, Italy

Tal Hassner Facebook (United States) Menlo Park, CA, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-20076-2 ISBN 978-3-031-20077-9 (eBook) https://doi.org/10.1007/978-3-031-20077-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

Organizing the European Conference on Computer Vision (ECCV 2022) in Tel-Aviv during a global pandemic was no easy feat. The uncertainty level was extremely high, and decisions had to be postponed to the last minute. Still, we managed to plan things just in time for ECCV 2022 to be held in person. Participation in physical events is crucial to stimulating collaborations and nurturing the culture of the Computer Vision community. There were many people who worked hard to ensure attendees enjoyed the best science at the 16th edition of ECCV. We are grateful to the Program Chairs Gabriel Brostow and Tal Hassner, who went above and beyond to ensure the ECCV reviewing process ran smoothly. The scientific program includes dozens of workshops and tutorials in addition to the main conference and we would like to thank Leonid Karlinsky and Tomer Michaeli for their hard work. Finally, special thanks to the web chairs Lorenzo Baraldi and Kosta Derpanis, who put in extra hours to transfer information fast and efficiently to the ECCV community. We would like to express gratitude to our generous sponsors and the Industry Chairs, Dimosthenis Karatzas and Chen Sagiv, who oversaw industry relations and proposed new ways for academia-industry collaboration and technology transfer. It’s great to see so much industrial interest in what we’re doing! Authors’ draft versions of the papers appeared online with open access on both the Computer Vision Foundation (CVF) and the European Computer Vision Association (ECVA) websites as with previous ECCVs. Springer, the publisher of the proceedings, has arranged for archival publication. The final version of the papers is hosted by SpringerLink, with active references and supplementary materials. It benefits all potential readers that we offer both a free and citeable version for all researchers, as well as an authoritative, citeable version for SpringerLink readers. Our thanks go to Ronan Nugent from Springer, who helped us negotiate this agreement. Last but not least, we wish to thank Eric Mortensen, our publication chair, whose expertise made the process smooth. October 2022

Rita Cucchiara Jiˇrí Matas Amnon Shashua Lihi Zelnik-Manor

Preface

Welcome to the proceedings of the European Conference on Computer Vision (ECCV 2022). This was a hybrid edition of ECCV as we made our way out of the COVID-19 pandemic. The conference received 5804 valid paper submissions, compared to 5150 submissions to ECCV 2020 (a 12.7% increase) and 2439 in ECCV 2018. 1645 submissions were accepted for publication (28%) and, of those, 157 (2.7% overall) as orals. 846 of the submissions were desk-rejected for various reasons. Many of them because they revealed author identity, thus violating the double-blind policy. This violation came in many forms: some had author names with the title, others added acknowledgments to specific grants, yet others had links to their github account where their name was visible. Tampering with the LaTeX template was another reason for automatic desk rejection. ECCV 2022 used the traditional CMT system to manage the entire double-blind reviewing process. Authors did not know the names of the reviewers and vice versa. Each paper received at least 3 reviews (except 6 papers that received only 2 reviews), totalling more than 15,000 reviews. Handling the review process at this scale was a significant challenge. To ensure that each submission received as fair and high-quality reviews as possible, we recruited more than 4719 reviewers (in the end, 4719 reviewers did at least one review). Similarly we recruited more than 276 area chairs (eventually, only 276 area chairs handled a batch of papers). The area chairs were selected based on their technical expertise and reputation, largely among people who served as area chairs in previous top computer vision and machine learning conferences (ECCV, ICCV, CVPR, NeurIPS, etc.). Reviewers were similarly invited from previous conferences, and also from the pool of authors. We also encouraged experienced area chairs to suggest additional chairs and reviewers in the initial phase of recruiting. The median reviewer load was five papers per reviewer, while the average load was about four papers, because of the emergency reviewers. The area chair load was 35 papers, on average. Conflicts of interest between authors, area chairs, and reviewers were handled largely automatically by the CMT platform, with some manual help from the Program Chairs. Reviewers were allowed to describe themselves as senior reviewer (load of 8 papers to review) or junior reviewers (load of 4 papers). Papers were matched to area chairs based on a subject-area affinity score computed in CMT and an affinity score computed by the Toronto Paper Matching System (TPMS). TPMS is based on the paper’s full text. An area chair handling each submission would bid for preferred expert reviewers, and we balanced load and prevented conflicts. The assignment of submissions to area chairs was relatively smooth, as was the assignment of submissions to reviewers. A small percentage of reviewers were not happy with their assignments in terms of subjects and self-reported expertise. This is an area for improvement, although it’s interesting that many of these cases were reviewers handpicked by AC’s. We made a later round of reviewer recruiting, targeted at the list of authors of papers submitted to the conference, and had an excellent response which

viii

Preface

helped provide enough emergency reviewers. In the end, all but six papers received at least 3 reviews. The challenges of the reviewing process are in line with past experiences at ECCV 2020. As the community grows, and the number of submissions increases, it becomes ever more challenging to recruit enough reviewers and ensure a high enough quality of reviews. Enlisting authors by default as reviewers might be one step to address this challenge. Authors were given a week to rebut the initial reviews, and address reviewers’ concerns. Each rebuttal was limited to a single pdf page with a fixed template. The Area Chairs then led discussions with the reviewers on the merits of each submission. The goal was to reach consensus, but, ultimately, it was up to the Area Chair to make a decision. The decision was then discussed with a buddy Area Chair to make sure decisions were fair and informative. The entire process was conducted virtually with no in-person meetings taking place. The Program Chairs were informed in cases where the Area Chairs overturned a decisive consensus reached by the reviewers, and pushed for the meta-reviews to contain details that explained the reasoning for such decisions. Obviously these were the most contentious cases, where reviewer inexperience was the most common reported factor. Once the list of accepted papers was finalized and released, we went through the laborious process of plagiarism (including self-plagiarism) detection. A total of 4 accepted papers were rejected because of that. Finally, we would like to thank our Technical Program Chair, Pavel Lifshits, who did tremendous work behind the scenes, and we thank the tireless CMT team. October 2022

Gabriel Brostow Giovanni Maria Farinella Moustapha Cissé Shai Avidan Tal Hassner

Organization

General Chairs Rita Cucchiara Jiˇrí Matas Amnon Shashua Lihi Zelnik-Manor

University of Modena and Reggio Emilia, Italy Czech Technical University in Prague, Czech Republic Hebrew University of Jerusalem, Israel Technion – Israel Institute of Technology, Israel

Program Chairs Shai Avidan Gabriel Brostow Moustapha Cissé Giovanni Maria Farinella Tal Hassner

Tel-Aviv University, Israel University College London, UK Google AI, Ghana University of Catania, Italy Facebook AI, USA

Program Technical Chair Pavel Lifshits

Technion – Israel Institute of Technology, Israel

Workshops Chairs Leonid Karlinsky Tomer Michaeli Ko Nishino

IBM Research, Israel Technion – Israel Institute of Technology, Israel Kyoto University, Japan

Tutorial Chairs Thomas Pock Natalia Neverova

Graz University of Technology, Austria Facebook AI Research, UK

Demo Chair Bohyung Han

Seoul National University, Korea

x

Organization

Social and Student Activities Chairs Tatiana Tommasi Sagie Benaim

Italian Institute of Technology, Italy University of Copenhagen, Denmark

Diversity and Inclusion Chairs Xi Yin Bryan Russell

Facebook AI Research, USA Adobe, USA

Communications Chairs Lorenzo Baraldi Kosta Derpanis

University of Modena and Reggio Emilia, Italy York University & Samsung AI Centre Toronto, Canada

Industrial Liaison Chairs Dimosthenis Karatzas Chen Sagiv

Universitat Autònoma de Barcelona, Spain SagivTech, Israel

Finance Chair Gerard Medioni

University of Southern California & Amazon, USA

Publication Chair Eric Mortensen

MiCROTEC, USA

Area Chairs Lourdes Agapito Zeynep Akata Naveed Akhtar Karteek Alahari Alexandre Alahi Pablo Arbelaez Antonis A. Argyros Yuki M. Asano Kalle Åström Hadar Averbuch-Elor

University College London, UK University of Tübingen, Germany University of Western Australia, Australia Inria Grenoble Rhône-Alpes, France École polytechnique fédérale de Lausanne, Switzerland Universidad de Los Andes, Columbia University of Crete & Foundation for Research and Technology-Hellas, Crete University of Amsterdam, The Netherlands Lund University, Sweden Cornell University, USA

Organization

Hossein Azizpour Vineeth N. Balasubramanian Lamberto Ballan Adrien Bartoli Horst Bischof Matthew B. Blaschko Federica Bogo Katherine Bouman Edmond Boyer Michael S. Brown Vittorio Caggiano Neill Campbell Octavia Camps Duygu Ceylan Ayan Chakrabarti Tat-Jen Cham Antoni Chan Manmohan Chandraker Xinlei Chen Xilin Chen Dongdong Chen Chen Chen Ondrej Chum John Collomosse Camille Couprie David Crandall Daniel Cremers Marco Cristani Canton Cristian Dengxin Dai Dima Damen Kostas Daniilidis Trevor Darrell Andrew Davison Tali Dekel Alessio Del Bue Weihong Deng Konstantinos Derpanis Carl Doersch

xi

KTH Royal Institute of Technology, Sweden Indian Institute of Technology, Hyderabad, India University of Padova, Italy Université Clermont Auvergne, France Graz University of Technology, Austria KU Leuven, Belgium Meta Reality Labs Research, Switzerland California Institute of Technology, USA Inria Grenoble Rhône-Alpes, France York University, Canada Meta AI Research, USA University of Bath, UK Northeastern University, USA Adobe Research, USA Google Research, USA Nanyang Technological University, Singapore City University of Hong Kong, Hong Kong, China NEC Labs America, USA Facebook AI Research, USA Institute of Computing Technology, Chinese Academy of Sciences, China Microsoft Cloud AI, USA University of Central Florida, USA Vision Recognition Group, Czech Technical University in Prague, Czech Republic Adobe Research & University of Surrey, UK Facebook, France Indiana University, USA Technical University of Munich, Germany University of Verona, Italy Facebook AI Research, USA ETH Zurich, Switzerland University of Bristol, UK University of Pennsylvania, USA University of California, Berkeley, USA Imperial College London, UK Weizmann Institute of Science, Israel Istituto Italiano di Tecnologia, Italy Beijing University of Posts and Telecommunications, China Ryerson University, Canada DeepMind, UK

xii

Organization

Matthijs Douze Mohamed Elhoseiny Sergio Escalera Yi Fang Ryan Farrell Alireza Fathi Christoph Feichtenhofer Basura Fernando Vittorio Ferrari Andrew W. Fitzgibbon David J. Fleet David Forsyth David Fouhey Katerina Fragkiadaki Friedrich Fraundorfer Oren Freifeld Thomas Funkhouser Yasutaka Furukawa Fabio Galasso Jürgen Gall Chuang Gan Zhe Gan Animesh Garg Efstratios Gavves Peter Gehler Theo Gevers Bernard Ghanem Ross B. Girshick Georgia Gkioxari Albert Gordo Stephen Gould Venu Madhav Govindu Kristen Grauman Abhinav Gupta Mohit Gupta Hu Han

Facebook AI Research, USA King Abdullah University of Science and Technology, Saudi Arabia University of Barcelona, Spain New York University, USA Brigham Young University, USA Google, USA Facebook AI Research, USA Agency for Science, Technology and Research (A*STAR), Singapore Google Research, Switzerland Graphcore, UK University of Toronto, Canada University of Illinois at Urbana-Champaign, USA University of Michigan, USA Carnegie Mellon University, USA Graz University of Technology, Austria Ben-Gurion University, Israel Google Research & Princeton University, USA Simon Fraser University, Canada Sapienza University of Rome, Italy University of Bonn, Germany Massachusetts Institute of Technology, USA Microsoft, USA University of Toronto, Vector Institute, Nvidia, Canada University of Amsterdam, The Netherlands Amazon, Germany University of Amsterdam, The Netherlands King Abdullah University of Science and Technology, Saudi Arabia Facebook AI Research, USA Facebook AI Research, USA Facebook, USA Australian National University, Australia Indian Institute of Science, India Facebook AI Research & UT Austin, USA Carnegie Mellon University & Facebook AI Research, USA University of Wisconsin-Madison, USA Institute of Computing Technology, Chinese Academy of Sciences, China

Organization

Bohyung Han Tian Han Emily Hand Bharath Hariharan Ran He Otmar Hilliges Adrian Hilton Minh Hoai Yedid Hoshen Timothy Hospedales Gang Hua Di Huang Jing Huang Jia-Bin Huang Nathan Jacobs C. V. Jawahar Herve Jegou Neel Joshi Armand Joulin Frederic Jurie Fredrik Kahl Yannis Kalantidis Evangelos Kalogerakis Sing Bing Kang Yosi Keller Margret Keuper Tae-Kyun Kim Benjamin Kimia Alexander Kirillov Kris Kitani Iasonas Kokkinos Vladlen Koltun Nikos Komodakis Piotr Koniusz Philipp Kraehenbuehl Dilip Krishnan Ajay Kumar Junseok Kwon Jean-Francois Lalonde

xiii

Seoul National University, Korea Stevens Institute of Technology, USA University of Nevada, Reno, USA Cornell University, USA Institute of Automation, Chinese Academy of Sciences, China ETH Zurich, Switzerland University of Surrey, UK Stony Brook University, USA Hebrew University of Jerusalem, Israel University of Edinburgh, UK Wormpex AI Research, USA Beihang University, China Facebook, USA Facebook, USA Washington University in St. Louis, USA International Institute of Information Technology, Hyderabad, India Facebook AI Research, France Microsoft Research, USA Facebook AI Research, France University of Caen Normandie, France Chalmers University of Technology, Sweden NAVER LABS Europe, France University of Massachusetts, Amherst, USA Zillow Group, USA Bar Ilan University, Israel University of Mannheim, Germany Imperial College London, UK Brown University, USA Facebook AI Research, USA Carnegie Mellon University, USA Snap Inc. & University College London, UK Apple, USA University of Crete, Crete Australian National University, Australia University of Texas at Austin, USA Google, USA Hong Kong Polytechnic University, Hong Kong, China Chung-Ang University, Korea Université Laval, Canada

xiv

Organization

Ivan Laptev Laura Leal-Taixé Erik Learned-Miller Gim Hee Lee Seungyong Lee Zhen Lei Bastian Leibe Hongdong Li Fuxin Li Bo Li Yin Li Ser-Nam Lim Joseph Lim Stephen Lin Dahua Lin Si Liu Xiaoming Liu Ce Liu Zicheng Liu Yanxi Liu Feng Liu Yebin Liu Chen Change Loy Huchuan Lu Cewu Lu Oisin Mac Aodha Dhruv Mahajan Subhransu Maji Atsuto Maki Arun Mallya R. Manmatha Iacopo Masi Dimitris N. Metaxas Ajmal Mian Christian Micheloni Krystian Mikolajczyk Anurag Mittal Philippos Mordohai Greg Mori

Inria Paris, France Technical University of Munich, Germany University of Massachusetts, Amherst, USA National University of Singapore, Singapore Pohang University of Science and Technology, Korea Institute of Automation, Chinese Academy of Sciences, China RWTH Aachen University, Germany Australian National University, Australia Oregon State University, USA University of Illinois at Urbana-Champaign, USA University of Wisconsin-Madison, USA Meta AI Research, USA University of Southern California, USA Microsoft Research Asia, China The Chinese University of Hong Kong, Hong Kong, China Beihang University, China Michigan State University, USA Microsoft, USA Microsoft, USA Pennsylvania State University, USA Portland State University, USA Tsinghua University, China Nanyang Technological University, Singapore Dalian University of Technology, China Shanghai Jiao Tong University, China University of Edinburgh, UK Facebook, USA University of Massachusetts, Amherst, USA KTH Royal Institute of Technology, Sweden NVIDIA, USA Amazon, USA Sapienza University of Rome, Italy Rutgers University, USA University of Western Australia, Australia University of Udine, Italy Imperial College London, UK Indian Institute of Technology, Madras, India Stevens Institute of Technology, USA Simon Fraser University & Borealis AI, Canada

Organization

Vittorio Murino P. J. Narayanan Ram Nevatia Natalia Neverova Richard Newcombe Cuong V. Nguyen Bingbing Ni Juan Carlos Niebles Ko Nishino Jean-Marc Odobez Francesca Odone Takayuki Okatani Manohar Paluri Guan Pang Maja Pantic Sylvain Paris Jaesik Park Hyun Soo Park Omkar M. Parkhi Deepak Pathak Georgios Pavlakos Marcello Pelillo Marc Pollefeys Jean Ponce Gerard Pons-Moll Fatih Porikli Victor Adrian Prisacariu Petia Radeva Ravi Ramamoorthi Deva Ramanan Vignesh Ramanathan Nalini Ratha Tammy Riklin Raviv Tobias Ritschel Emanuele Rodola Amit K. Roy-Chowdhury Michael Rubinstein Olga Russakovsky

xv

Istituto Italiano di Tecnologia, Italy International Institute of Information Technology, Hyderabad, India University of Southern California, USA Facebook AI Research, UK Facebook, USA Florida International University, USA Shanghai Jiao Tong University, China Salesforce & Stanford University, USA Kyoto University, Japan Idiap Research Institute, École polytechnique fédérale de Lausanne, Switzerland University of Genova, Italy Tohoku University & RIKEN Center for Advanced Intelligence Project, Japan Facebook, USA Facebook, USA Imperial College London, UK Adobe Research, USA Pohang University of Science and Technology, Korea The University of Minnesota, USA Facebook, USA Carnegie Mellon University, USA University of California, Berkeley, USA University of Venice, Italy ETH Zurich & Microsoft, Switzerland Inria, France University of Tübingen, Germany Qualcomm, USA University of Oxford, UK University of Barcelona, Spain University of California, San Diego, USA Carnegie Mellon University, USA Facebook, USA State University of New York at Buffalo, USA Ben-Gurion University, Israel University College London, UK Sapienza University of Rome, Italy University of California, Riverside, USA Google, USA Princeton University, USA

xvi

Organization

Mathieu Salzmann Dimitris Samaras Aswin Sankaranarayanan Imari Sato Yoichi Sato Shin’ichi Satoh Walter Scheirer Bernt Schiele Konrad Schindler Cordelia Schmid Alexander Schwing Nicu Sebe Greg Shakhnarovich Eli Shechtman Humphrey Shi

Jianbo Shi Roy Shilkrot Mike Zheng Shou Kaleem Siddiqi Richa Singh Greg Slabaugh Cees Snoek Yale Song Yi-Zhe Song Bjorn Stenger Abby Stylianou Akihiro Sugimoto Chen Sun Deqing Sun Kalyan Sunkavalli Ying Tai Ayellet Tal Ping Tan Siyu Tang Chi-Keung Tang Radu Timofte Federico Tombari

École polytechnique fédérale de Lausanne, Switzerland Stony Brook University, USA Carnegie Mellon University, USA National Institute of Informatics, Japan University of Tokyo, Japan National Institute of Informatics, Japan University of Notre Dame, USA Max Planck Institute for Informatics, Germany ETH Zurich, Switzerland Inria & Google, France University of Illinois at Urbana-Champaign, USA University of Trento, Italy Toyota Technological Institute at Chicago, USA Adobe Research, USA University of Oregon & University of Illinois at Urbana-Champaign & Picsart AI Research, USA University of Pennsylvania, USA Massachusetts Institute of Technology, USA National University of Singapore, Singapore McGill University, Canada Indian Institute of Technology Jodhpur, India Queen Mary University of London, UK University of Amsterdam, The Netherlands Facebook AI Research, USA University of Surrey, UK Rakuten Institute of Technology Saint Louis University, USA National Institute of Informatics, Japan Brown University, USA Google, USA Adobe Research, USA Tencent YouTu Lab, China Technion – Israel Institute of Technology, Israel Simon Fraser University, Canada ETH Zurich, Switzerland Hong Kong University of Science and Technology, Hong Kong, China University of Würzburg, Germany & ETH Zurich, Switzerland Google, Switzerland & Technical University of Munich, Germany

Organization

James Tompkin Lorenzo Torresani Alexander Toshev Du Tran Anh T. Tran Zhuowen Tu Georgios Tzimiropoulos Jasper Uijlings Jan C. van Gemert Gul Varol Nuno Vasconcelos Mayank Vatsa Ashok Veeraraghavan Jakob Verbeek Carl Vondrick Ruiping Wang Xinchao Wang Liwei Wang Chaohui Wang Xiaolong Wang Christian Wolf Tao Xiang Saining Xie Cihang Xie Zeki Yalniz Ming-Hsuan Yang Angela Yao Shaodi You Stella X. Yu Junsong Yuan Stefanos Zafeiriou Amir Zamir Lei Zhang Lei Zhang Pengchuan Zhang Bolei Zhou Yuke Zhu

xvii

Brown University, USA Dartmouth College, USA Apple, USA Facebook AI Research, USA VinAI, Vietnam University of California, San Diego, USA Queen Mary University of London, UK Google Research, Switzerland Delft University of Technology, The Netherlands Ecole des Ponts ParisTech, France University of California, San Diego, USA Indian Institute of Technology Jodhpur, India Rice University, USA Facebook AI Research, France Columbia University, USA Institute of Computing Technology, Chinese Academy of Sciences, China National University of Singapore, Singapore The Chinese University of Hong Kong, Hong Kong, China Université Paris-Est, France University of California, San Diego, USA NAVER LABS Europe, France University of Surrey, UK Facebook AI Research, USA University of California, Santa Cruz, USA Facebook, USA University of California, Merced, USA National University of Singapore, Singapore University of Amsterdam, The Netherlands University of California, Berkeley, USA State University of New York at Buffalo, USA Imperial College London, UK École polytechnique fédérale de Lausanne, Switzerland Alibaba & Hong Kong Polytechnic University, Hong Kong, China International Digital Economy Academy (IDEA), China Meta AI, USA University of California, Los Angeles, USA University of Texas at Austin, USA

xviii

Organization

Todd Zickler Wangmeng Zuo

Harvard University, USA Harbin Institute of Technology, China

Technical Program Committee Davide Abati Soroush Abbasi Koohpayegani Amos L. Abbott Rameen Abdal Rabab Abdelfattah Sahar Abdelnabi Hassan Abu Alhaija Abulikemu Abuduweili Ron Abutbul Hanno Ackermann Aikaterini Adam Kamil Adamczewski Ehsan Adeli Vida Adeli Donald Adjeroh Arman Afrasiyabi Akshay Agarwal Sameer Agarwal Abhinav Agarwalla Vaibhav Aggarwal Sara Aghajanzadeh Susmit Agrawal Antonio Agudo Touqeer Ahmad Sk Miraj Ahmed Chaitanya Ahuja Nilesh A. Ahuja Abhishek Aich Shubhra Aich Noam Aigerman Arash Akbarinia Peri Akiva Derya Akkaynak Emre Aksan Arjun R. Akula Yuval Alaluf Stephan Alaniz Paul Albert Cenek Albl

Filippo Aleotti Konstantinos P. Alexandridis Motasem Alfarra Mohsen Ali Thiemo Alldieck Hadi Alzayer Liang An Shan An Yi An Zhulin An Dongsheng An Jie An Xiang An Saket Anand Cosmin Ancuti Juan Andrade-Cetto Alexander Andreopoulos Bjoern Andres Jerone T. A. Andrews Shivangi Aneja Anelia Angelova Dragomir Anguelov Rushil Anirudh Oron Anschel Rao Muhammad Anwer Djamila Aouada Evlampios Apostolidis Srikar Appalaraju Nikita Araslanov Andre Araujo Eric Arazo Dawit Mureja Argaw Anurag Arnab Aditya Arora Chetan Arora Sunpreet S. Arora Alexey Artemov Muhammad Asad Kumar Ashutosh

Sinem Aslan Vishal Asnani Mahmoud Assran Amir Atapour-Abarghouei Nikos Athanasiou Ali Athar ShahRukh Athar Sara Atito Souhaib Attaiki Matan Atzmon Mathieu Aubry Nicolas Audebert Tristan T. Aumentado-Armstrong Melinos Averkiou Yannis Avrithis Stephane Ayache Mehmet Aygün Seyed Mehdi Ayyoubzadeh Hossein Azizpour George Azzopardi Mallikarjun B. R. Yunhao Ba Abhishek Badki Seung-Hwan Bae Seung-Hwan Baek Seungryul Baek Piyush Nitin Bagad Shai Bagon Gaetan Bahl Shikhar Bahl Sherwin Bahmani Haoran Bai Lei Bai Jiawang Bai Haoyue Bai Jinbin Bai Xiang Bai Xuyang Bai

Organization

Yang Bai Yuanchao Bai Ziqian Bai Sungyong Baik Kevin Bailly Max Bain Federico Baldassarre Wele Gedara Chaminda Bandara Biplab Banerjee Pratyay Banerjee Sandipan Banerjee Jihwan Bang Antyanta Bangunharcana Aayush Bansal Ankan Bansal Siddhant Bansal Wentao Bao Zhipeng Bao Amir Bar Manel Baradad Jurjo Lorenzo Baraldi Danny Barash Daniel Barath Connelly Barnes Ioan Andrei Bârsan Steven Basart Dina Bashkirova Chaim Baskin Peyman Bateni Anil Batra Sebastiano Battiato Ardhendu Behera Harkirat Behl Jens Behley Vasileios Belagiannis Boulbaba Ben Amor Emanuel Ben Baruch Abdessamad Ben Hamza Gil Ben-Artzi Assia Benbihi Fabian Benitez-Quiroz Guy Ben-Yosef Philipp Benz Alexander W. Bergman

Urs Bergmann Jesus Bermudez-Cameo Stefano Berretti Gedas Bertasius Zachary Bessinger Petra Bevandi´c Matthew Beveridge Lucas Beyer Yash Bhalgat Suvaansh Bhambri Samarth Bharadwaj Gaurav Bharaj Aparna Bharati Bharat Lal Bhatnagar Uttaran Bhattacharya Apratim Bhattacharyya Brojeshwar Bhowmick Ankan Kumar Bhunia Ayan Kumar Bhunia Qi Bi Sai Bi Michael Bi Mi Gui-Bin Bian Jia-Wang Bian Shaojun Bian Pia Bideau Mario Bijelic Hakan Bilen Guillaume-Alexandre Bilodeau Alexander Binder Tolga Birdal Vighnesh N. Birodkar Sandika Biswas Andreas Blattmann Janusz Bobulski Giuseppe Boccignone Vishnu Boddeti Navaneeth Bodla Moritz Böhle Aleksei Bokhovkin Sam Bond-Taylor Vivek Boominathan Shubhankar Borse Mark Boss

Andrea Bottino Adnane Boukhayma Fadi Boutros Nicolas C. Boutry Richard S. Bowen Ivaylo Boyadzhiev Aidan Boyd Yuri Boykov Aljaz Bozic Behzad Bozorgtabar Eric Brachmann Samarth Brahmbhatt Gustav Bredell Francois Bremond Joel Brogan Andrew Brown Thomas Brox Marcus A. Brubaker Robert-Jan Bruintjes Yuqi Bu Anders G. Buch Himanshu Buckchash Mateusz Buda Ignas Budvytis José M. Buenaposada Marcel C. Bühler Tu Bui Adrian Bulat Hannah Bull Evgeny Burnaev Andrei Bursuc Benjamin Busam Sergey N. Buzykanov Wonmin Byeon Fabian Caba Martin Cadik Guanyu Cai Minjie Cai Qing Cai Zhongang Cai Qi Cai Yancheng Cai Shen Cai Han Cai Jiarui Cai

xix

xx

Organization

Bowen Cai Mu Cai Qin Cai Ruojin Cai Weidong Cai Weiwei Cai Yi Cai Yujun Cai Zhiping Cai Akin Caliskan Lilian Calvet Baris Can Cam Necati Cihan Camgoz Tommaso Campari Dylan Campbell Ziang Cao Ang Cao Xu Cao Zhiwen Cao Shengcao Cao Song Cao Weipeng Cao Xiangyong Cao Xiaochun Cao Yue Cao Yunhao Cao Zhangjie Cao Jiale Cao Yang Cao Jiajiong Cao Jie Cao Jinkun Cao Lele Cao Yulong Cao Zhiguo Cao Chen Cao Razvan Caramalau Marlène Careil Gustavo Carneiro Joao Carreira Dan Casas Paola Cascante-Bonilla Angela Castillo Francisco M. Castro Pedro Castro

Luca Cavalli George J. Cazenavette Oya Celiktutan Hakan Cevikalp Sri Harsha C. H. Sungmin Cha Geonho Cha Menglei Chai Lucy Chai Yuning Chai Zenghao Chai Anirban Chakraborty Deep Chakraborty Rudrasis Chakraborty Souradeep Chakraborty Kelvin C. K. Chan Chee Seng Chan Paramanand Chandramouli Arjun Chandrasekaran Kenneth Chaney Dongliang Chang Huiwen Chang Peng Chang Xiaojun Chang Jia-Ren Chang Hyung Jin Chang Hyun Sung Chang Ju Yong Chang Li-Jen Chang Qi Chang Wei-Yi Chang Yi Chang Nadine Chang Hanqing Chao Pradyumna Chari Dibyadip Chatterjee Chiranjoy Chattopadhyay Siddhartha Chaudhuri Zhengping Che Gal Chechik Lianggangxu Chen Qi Alfred Chen Brian Chen Bor-Chun Chen Bo-Hao Chen

Bohong Chen Bin Chen Ziliang Chen Cheng Chen Chen Chen Chaofeng Chen Xi Chen Haoyu Chen Xuanhong Chen Wei Chen Qiang Chen Shi Chen Xianyu Chen Chang Chen Changhuai Chen Hao Chen Jie Chen Jianbo Chen Jingjing Chen Jun Chen Kejiang Chen Mingcai Chen Nenglun Chen Qifeng Chen Ruoyu Chen Shu-Yu Chen Weidong Chen Weijie Chen Weikai Chen Xiang Chen Xiuyi Chen Xingyu Chen Yaofo Chen Yueting Chen Yu Chen Yunjin Chen Yuntao Chen Yun Chen Zhenfang Chen Zhuangzhuang Chen Chu-Song Chen Xiangyu Chen Zhuo Chen Chaoqi Chen Shizhe Chen

Organization

Xiaotong Chen Xiaozhi Chen Dian Chen Defang Chen Dingfan Chen Ding-Jie Chen Ee Heng Chen Tao Chen Yixin Chen Wei-Ting Chen Lin Chen Guang Chen Guangyi Chen Guanying Chen Guangyao Chen Hwann-Tzong Chen Junwen Chen Jiacheng Chen Jianxu Chen Hui Chen Kai Chen Kan Chen Kevin Chen Kuan-Wen Chen Weihua Chen Zhang Chen Liang-Chieh Chen Lele Chen Liang Chen Fanglin Chen Zehui Chen Minghui Chen Minghao Chen Xiaokang Chen Qian Chen Jun-Cheng Chen Qi Chen Qingcai Chen Richard J. Chen Runnan Chen Rui Chen Shuo Chen Sentao Chen Shaoyu Chen Shixing Chen

Shuai Chen Shuya Chen Sizhe Chen Simin Chen Shaoxiang Chen Zitian Chen Tianlong Chen Tianshui Chen Min-Hung Chen Xiangning Chen Xin Chen Xinghao Chen Xuejin Chen Xu Chen Xuxi Chen Yunlu Chen Yanbei Chen Yuxiao Chen Yun-Chun Chen Yi-Ting Chen Yi-Wen Chen Yinbo Chen Yiran Chen Yuanhong Chen Yubei Chen Yuefeng Chen Yuhua Chen Yukang Chen Zerui Chen Zhaoyu Chen Zhen Chen Zhenyu Chen Zhi Chen Zhiwei Chen Zhixiang Chen Long Chen Bowen Cheng Jun Cheng Yi Cheng Jingchun Cheng Lechao Cheng Xi Cheng Yuan Cheng Ho Kei Cheng Kevin Ho Man Cheng

Jiacheng Cheng Kelvin B. Cheng Li Cheng Mengjun Cheng Zhen Cheng Qingrong Cheng Tianheng Cheng Harry Cheng Yihua Cheng Yu Cheng Ziheng Cheng Soon Yau Cheong Anoop Cherian Manuela Chessa Zhixiang Chi Naoki Chiba Julian Chibane Kashyap Chitta Tai-Yin Chiu Hsu-kuang Chiu Wei-Chen Chiu Sungmin Cho Donghyeon Cho Hyeon Cho Yooshin Cho Gyusang Cho Jang Hyun Cho Seungju Cho Nam Ik Cho Sunghyun Cho Hanbyel Cho Jaesung Choe Jooyoung Choi Chiho Choi Changwoon Choi Jongwon Choi Myungsub Choi Dooseop Choi Jonghyun Choi Jinwoo Choi Jun Won Choi Min-Kook Choi Hongsuk Choi Janghoon Choi Yoon-Ho Choi

xxi

xxii

Organization

Yukyung Choi Jaegul Choo Ayush Chopra Siddharth Choudhary Subhabrata Choudhury Vasileios Choutas Ka-Ho Chow Pinaki Nath Chowdhury Sammy Christen Anders Christensen Grigorios Chrysos Hang Chu Wen-Hsuan Chu Peng Chu Qi Chu Ruihang Chu Wei-Ta Chu Yung-Yu Chuang Sanghyuk Chun Se Young Chun Antonio Cinà Ramazan Gokberk Cinbis Javier Civera Albert Clapés Ronald Clark Brian S. Clipp Felipe Codevilla Daniel Coelho de Castro Niv Cohen Forrester Cole Maxwell D. Collins Robert T. Collins Marc Comino Trinidad Runmin Cong Wenyan Cong Maxime Cordy Marcella Cornia Enric Corona Huseyin Coskun Luca Cosmo Dragos Costea Davide Cozzolino Arun C. S. Kumar Aiyu Cui Qiongjie Cui

Quan Cui Shuhao Cui Yiming Cui Ying Cui Zijun Cui Jiali Cui Jiequan Cui Yawen Cui Zhen Cui Zhaopeng Cui Jack Culpepper Xiaodong Cun Ross Cutler Adam Czajka Ali Dabouei Konstantinos M. Dafnis Manuel Dahnert Tao Dai Yuchao Dai Bo Dai Mengyu Dai Hang Dai Haixing Dai Peng Dai Pingyang Dai Qi Dai Qiyu Dai Yutong Dai Naser Damer Zhiyuan Dang Mohamed Daoudi Ayan Das Abir Das Debasmit Das Deepayan Das Partha Das Sagnik Das Soumi Das Srijan Das Swagatam Das Avijit Dasgupta Jim Davis Adrian K. Davison Homa Davoudi Laura Daza

Matthias De Lange Shalini De Mello Marco De Nadai Christophe De Vleeschouwer Alp Dener Boyang Deng Congyue Deng Bailin Deng Yong Deng Ye Deng Zhuo Deng Zhijie Deng Xiaoming Deng Jiankang Deng Jinhong Deng Jingjing Deng Liang-Jian Deng Siqi Deng Xiang Deng Xueqing Deng Zhongying Deng Karan Desai Jean-Emmanuel Deschaud Aniket Anand Deshmukh Neel Dey Helisa Dhamo Prithviraj Dhar Amaya Dharmasiri Yan Di Xing Di Ousmane A. Dia Haiwen Diao Xiaolei Diao Gonçalo José Dias Pais Abdallah Dib Anastasios Dimou Changxing Ding Henghui Ding Guodong Ding Yaqing Ding Shuangrui Ding Yuhang Ding Yikang Ding Shouhong Ding

Organization

Haisong Ding Hui Ding Jiahao Ding Jian Ding Jian-Jiun Ding Shuxiao Ding Tianyu Ding Wenhao Ding Yuqi Ding Yi Ding Yuzhen Ding Zhengming Ding Tan Minh Dinh Vu Dinh Christos Diou Mandar Dixit Bao Gia Doan Khoa D. Doan Dzung Anh Doan Debi Prosad Dogra Nehal Doiphode Chengdong Dong Bowen Dong Zhenxing Dong Hang Dong Xiaoyi Dong Haoye Dong Jiangxin Dong Shichao Dong Xuan Dong Zhen Dong Shuting Dong Jing Dong Li Dong Ming Dong Nanqing Dong Qiulei Dong Runpei Dong Siyan Dong Tian Dong Wei Dong Xiaomeng Dong Xin Dong Xingbo Dong Yuan Dong

Samuel Dooley Gianfranco Doretto Michael Dorkenwald Keval Doshi Zhaopeng Dou Xiaotian Dou Hazel Doughty Ahmad Droby Iddo Drori Jie Du Yong Du Dawei Du Dong Du Ruoyi Du Yuntao Du Xuefeng Du Yilun Du Yuming Du Radhika Dua Haodong Duan Jiafei Duan Kaiwen Duan Peiqi Duan Ye Duan Haoran Duan Jiali Duan Amanda Duarte Abhimanyu Dubey Shiv Ram Dubey Florian Dubost Lukasz Dudziak Shivam Duggal Justin M. Dulay Matteo Dunnhofer Chi Nhan Duong Thibaut Durand Mihai Dusmanu Ujjal Kr Dutta Debidatta Dwibedi Isht Dwivedi Sai Kumar Dwivedi Takeharu Eda Mark Edmonds Alexei A. Efros Thibaud Ehret

Max Ehrlich Mahsa Ehsanpour Iván Eichhardt Farshad Einabadi Marvin Eisenberger Hazim Kemal Ekenel Mohamed El Banani Ismail Elezi Moshe Eliasof Alaa El-Nouby Ian Endres Francis Engelmann Deniz Engin Chanho Eom Dave Epstein Maria C. Escobar Victor A. Escorcia Carlos Esteves Sungmin Eum Bernard J. E. Evans Ivan Evtimov Fevziye Irem Eyiokur Yaman Matteo Fabbri Sébastien Fabbro Gabriele Facciolo Masud Fahim Bin Fan Hehe Fan Deng-Ping Fan Aoxiang Fan Chen-Chen Fan Qi Fan Zhaoxin Fan Haoqi Fan Heng Fan Hongyi Fan Linxi Fan Baojie Fan Jiayuan Fan Lei Fan Quanfu Fan Yonghui Fan Yingruo Fan Zhiwen Fan

xxiii

xxiv

Organization

Zicong Fan Sean Fanello Jiansheng Fang Chaowei Fang Yuming Fang Jianwu Fang Jin Fang Qi Fang Shancheng Fang Tian Fang Xianyong Fang Gongfan Fang Zhen Fang Hui Fang Jiemin Fang Le Fang Pengfei Fang Xiaolin Fang Yuxin Fang Zhaoyuan Fang Ammarah Farooq Azade Farshad Zhengcong Fei Michael Felsberg Wei Feng Chen Feng Fan Feng Andrew Feng Xin Feng Zheyun Feng Ruicheng Feng Mingtao Feng Qianyu Feng Shangbin Feng Chun-Mei Feng Zunlei Feng Zhiyong Feng Martin Fergie Mustansar Fiaz Marco Fiorucci Michael Firman Hamed Firooz Volker Fischer Corneliu O. Florea Georgios Floros

Wolfgang Foerstner Gianni Franchi Jean-Sebastien Franco Simone Frintrop Anna Fruehstueck Changhong Fu Chaoyou Fu Cheng-Yang Fu Chi-Wing Fu Deqing Fu Huan Fu Jun Fu Kexue Fu Ying Fu Jianlong Fu Jingjing Fu Qichen Fu Tsu-Jui Fu Xueyang Fu Yang Fu Yanwei Fu Yonggan Fu Wolfgang Fuhl Yasuhisa Fujii Kent Fujiwara Marco Fumero Takuya Funatomi Isabel Funke Dario Fuoli Antonino Furnari Matheus A. Gadelha Akshay Gadi Patil Adrian Galdran Guillermo Gallego Silvano Galliani Orazio Gallo Leonardo Galteri Matteo Gamba Yiming Gan Sujoy Ganguly Harald Ganster Boyan Gao Changxin Gao Daiheng Gao Difei Gao

Chen Gao Fei Gao Lin Gao Wei Gao Yiming Gao Junyu Gao Guangyu Ryan Gao Haichang Gao Hongchang Gao Jialin Gao Jin Gao Jun Gao Katelyn Gao Mingchen Gao Mingfei Gao Pan Gao Shangqian Gao Shanghua Gao Xitong Gao Yunhe Gao Zhanning Gao Elena Garces Nuno Cruz Garcia Noa Garcia Guillermo Garcia-Hernando Isha Garg Rahul Garg Sourav Garg Quentin Garrido Stefano Gasperini Kent Gauen Chandan Gautam Shivam Gautam Paul Gay Chunjiang Ge Shiming Ge Wenhang Ge Yanhao Ge Zheng Ge Songwei Ge Weifeng Ge Yixiao Ge Yuying Ge Shijie Geng

Organization

Zhengyang Geng Kyle A. Genova Georgios Georgakis Markos Georgopoulos Marcel Geppert Shabnam Ghadar Mina Ghadimi Atigh Deepti Ghadiyaram Maani Ghaffari Jadidi Sedigh Ghamari Zahra Gharaee Michaël Gharbi Golnaz Ghiasi Reza Ghoddoosian Soumya Suvra Ghosal Adhiraj Ghosh Arthita Ghosh Pallabi Ghosh Soumyadeep Ghosh Andrew Gilbert Igor Gilitschenski Jhony H. Giraldo Andreu Girbau Xalabarder Rohit Girdhar Sharath Girish Xavier Giro-i-Nieto Raja Giryes Thomas Gittings Nikolaos Gkanatsios Ioannis Gkioulekas Abhiram Gnanasambandam Aurele T. Gnanha Clement L. J. C. Godard Arushi Goel Vidit Goel Shubham Goel Zan Gojcic Aaron K. Gokaslan Tejas Gokhale S. Alireza Golestaneh Thiago L. Gomes Nuno Goncalves Boqing Gong Chen Gong

Yuanhao Gong Guoqiang Gong Jingyu Gong Rui Gong Yu Gong Mingming Gong Neil Zhenqiang Gong Xun Gong Yunye Gong Yihong Gong Cristina I. González Nithin Gopalakrishnan Nair Gaurav Goswami Jianping Gou Shreyank N. Gowda Ankit Goyal Helmut Grabner Patrick L. Grady Ben Graham Eric Granger Douglas R. Gray Matej Grci´c David Griffiths Jinjin Gu Yun Gu Shuyang Gu Jianyang Gu Fuqiang Gu Jiatao Gu Jindong Gu Jiaqi Gu Jinwei Gu Jiaxin Gu Geonmo Gu Xiao Gu Xinqian Gu Xiuye Gu Yuming Gu Zhangxuan Gu Dayan Guan Junfeng Guan Qingji Guan Tianrui Guan Shanyan Guan

Denis A. Gudovskiy Ricardo Guerrero Pierre-Louis Guhur Jie Gui Liangyan Gui Liangke Gui Benoit Guillard Erhan Gundogdu Manuel Günther Jingcai Guo Yuanfang Guo Junfeng Guo Chenqi Guo Dan Guo Hongji Guo Jia Guo Jie Guo Minghao Guo Shi Guo Yanhui Guo Yangyang Guo Yuan-Chen Guo Yilu Guo Yiluan Guo Yong Guo Guangyu Guo Haiyun Guo Jinyang Guo Jianyuan Guo Pengsheng Guo Pengfei Guo Shuxuan Guo Song Guo Tianyu Guo Qing Guo Qiushan Guo Wen Guo Xiefan Guo Xiaohu Guo Xiaoqing Guo Yufei Guo Yuhui Guo Yuliang Guo Yunhui Guo Yanwen Guo

xxv

xxvi

Organization

Akshita Gupta Ankush Gupta Kamal Gupta Kartik Gupta Ritwik Gupta Rohit Gupta Siddharth Gururani Fredrik K. Gustafsson Abner Guzman Rivera Vladimir Guzov Matthew A. Gwilliam Jung-Woo Ha Marc Habermann Isma Hadji Christian Haene Martin Hahner Levente Hajder Alexandros Haliassos Emanuela Haller Bumsub Ham Abdullah J. Hamdi Shreyas Hampali Dongyoon Han Chunrui Han Dong-Jun Han Dong-Sig Han Guangxing Han Zhizhong Han Ruize Han Jiaming Han Jin Han Ligong Han Xian-Hua Han Xiaoguang Han Yizeng Han Zhi Han Zhenjun Han Zhongyi Han Jungong Han Junlin Han Kai Han Kun Han Sungwon Han Songfang Han Wei Han

Xiao Han Xintong Han Xinzhe Han Yahong Han Yan Han Zongbo Han Nicolai Hani Rana Hanocka Niklas Hanselmann Nicklas A. Hansen Hong Hanyu Fusheng Hao Yanbin Hao Shijie Hao Udith Haputhanthri Mehrtash Harandi Josh Harguess Adam Harley David M. Hart Atsushi Hashimoto Ali Hassani Mohammed Hassanin Yana Hasson Joakim Bruslund Haurum Bo He Kun He Chen He Xin He Fazhi He Gaoqi He Hao He Haoyu He Jiangpeng He Hongliang He Qian He Xiangteng He Xuming He Yannan He Yuhang He Yang He Xiangyu He Nanjun He Pan He Sen He Shengfeng He

Songtao He Tao He Tong He Wei He Xuehai He Xiaoxiao He Ying He Yisheng He Ziwen He Peter Hedman Felix Heide Yacov Hel-Or Paul Henderson Philipp Henzler Byeongho Heo Jae-Pil Heo Miran Heo Sachini A. Herath Stephane Herbin Pedro Hermosilla Casajus Monica Hernandez Charles Herrmann Roei Herzig Mauricio Hess-Flores Carlos Hinojosa Tobias Hinz Tsubasa Hirakawa Chih-Hui Ho Lam Si Tung Ho Jennifer Hobbs Derek Hoiem Yannick Hold-Geoffroy Aleksander Holynski Cheeun Hong Fa-Ting Hong Hanbin Hong Guan Zhe Hong Danfeng Hong Lanqing Hong Xiaopeng Hong Xin Hong Jie Hong Seungbum Hong Cheng-Yao Hong Seunghoon Hong

Organization

Yi Hong Yuan Hong Yuchen Hong Anthony Hoogs Maxwell C. Horton Kazuhiro Hotta Qibin Hou Tingbo Hou Junhui Hou Ji Hou Qiqi Hou Rui Hou Ruibing Hou Zhi Hou Henry Howard-Jenkins Lukas Hoyer Wei-Lin Hsiao Chiou-Ting Hsu Anthony Hu Brian Hu Yusong Hu Hexiang Hu Haoji Hu Di Hu Hengtong Hu Haigen Hu Lianyu Hu Hanzhe Hu Jie Hu Junlin Hu Shizhe Hu Jian Hu Zhiming Hu Juhua Hu Peng Hu Ping Hu Ronghang Hu MengShun Hu Tao Hu Vincent Tao Hu Xiaoling Hu Xinting Hu Xiaolin Hu Xuefeng Hu Xiaowei Hu

Yang Hu Yueyu Hu Zeyu Hu Zhongyun Hu Binh-Son Hua Guoliang Hua Yi Hua Linzhi Huang Qiusheng Huang Bo Huang Chen Huang Hsin-Ping Huang Ye Huang Shuangping Huang Zeng Huang Buzhen Huang Cong Huang Heng Huang Hao Huang Qidong Huang Huaibo Huang Chaoqin Huang Feihu Huang Jiahui Huang Jingjia Huang Kun Huang Lei Huang Sheng Huang Shuaiyi Huang Siyu Huang Xiaoshui Huang Xiaoyang Huang Yan Huang Yihao Huang Ying Huang Ziling Huang Xiaoke Huang Yifei Huang Haiyang Huang Zhewei Huang Jin Huang Haibin Huang Jiaxing Huang Junjie Huang Keli Huang

Lang Huang Lin Huang Luojie Huang Mingzhen Huang Shijia Huang Shengyu Huang Siyuan Huang He Huang Xiuyu Huang Lianghua Huang Yue Huang Yaping Huang Yuge Huang Zehao Huang Zeyi Huang Zhiqi Huang Zhongzhan Huang Zilong Huang Ziyuan Huang Tianrui Hui Zhuo Hui Le Hui Jing Huo Junhwa Hur Shehzeen S. Hussain Chuong Minh Huynh Seunghyun Hwang Jaehui Hwang Jyh-Jing Hwang Sukjun Hwang Soonmin Hwang Wonjun Hwang Rakib Hyder Sangeek Hyun Sarah Ibrahimi Tomoki Ichikawa Yerlan Idelbayev A. S. M. Iftekhar Masaaki Iiyama Satoshi Ikehata Sunghoon Im Atul N. Ingle Eldar Insafutdinov Yani A. Ioannou Radu Tudor Ionescu

xxvii

xxviii

Organization

Umar Iqbal Go Irie Muhammad Zubair Irshad Ahmet Iscen Berivan Isik Ashraful Islam Md Amirul Islam Syed Islam Mariko Isogawa Vamsi Krishna K. Ithapu Boris Ivanovic Darshan Iyer Sarah Jabbour Ayush Jain Nishant Jain Samyak Jain Vidit Jain Vineet Jain Priyank Jaini Tomas Jakab Mohammad A. A. K. Jalwana Muhammad Abdullah Jamal Hadi Jamali-Rad Stuart James Varun Jampani Young Kyun Jang YeongJun Jang Yunseok Jang Ronnachai Jaroensri Bhavan Jasani Krishna Murthy Jatavallabhula Mojan Javaheripi Syed A. Javed Guillaume Jeanneret Pranav Jeevan Herve Jegou Rohit Jena Tomas Jenicek Porter Jenkins Simon Jenni Hae-Gon Jeon Sangryul Jeon

Boseung Jeong Yoonwoo Jeong Seong-Gyun Jeong Jisoo Jeong Allan D. Jepson Ankit Jha Sumit K. Jha I-Hong Jhuo Ge-Peng Ji Chaonan Ji Deyi Ji Jingwei Ji Wei Ji Zhong Ji Jiayi Ji Pengliang Ji Hui Ji Mingi Ji Xiaopeng Ji Yuzhu Ji Baoxiong Jia Songhao Jia Dan Jia Shan Jia Xiaojun Jia Xiuyi Jia Xu Jia Menglin Jia Wenqi Jia Boyuan Jiang Wenhao Jiang Huaizu Jiang Hanwen Jiang Haiyong Jiang Hao Jiang Huajie Jiang Huiqin Jiang Haojun Jiang Haobo Jiang Junjun Jiang Xingyu Jiang Yangbangyan Jiang Yu Jiang Jianmin Jiang Jiaxi Jiang

Jing Jiang Kui Jiang Li Jiang Liming Jiang Chiyu Jiang Meirui Jiang Chen Jiang Peng Jiang Tai-Xiang Jiang Wen Jiang Xinyang Jiang Yifan Jiang Yuming Jiang Yingying Jiang Zeren Jiang ZhengKai Jiang Zhenyu Jiang Shuming Jiao Jianbo Jiao Licheng Jiao Dongkwon Jin Yeying Jin Cheng Jin Linyi Jin Qing Jin Taisong Jin Xiao Jin Xin Jin Sheng Jin Kyong Hwan Jin Ruibing Jin SouYoung Jin Yueming Jin Chenchen Jing Longlong Jing Taotao Jing Yongcheng Jing Younghyun Jo Joakim Johnander Jeff Johnson Michael J. Jones R. Kenny Jones Rico Jonschkowski Ameya Joshi Sunghun Joung

Organization

Felix Juefei-Xu Claudio R. Jung Steffen Jung Hari Chandana K. Rahul Vigneswaran K. Prajwal K. R. Abhishek Kadian Jhony Kaesemodel Pontes Kumara Kahatapitiya Anmol Kalia Sinan Kalkan Tarun Kalluri Jaewon Kam Sandesh Kamath Meina Kan Menelaos Kanakis Takuhiro Kaneko Di Kang Guoliang Kang Hao Kang Jaeyeon Kang Kyoungkook Kang Li-Wei Kang MinGuk Kang Suk-Ju Kang Zhao Kang Yash Mukund Kant Yueying Kao Aupendu Kar Konstantinos Karantzalos Sezer Karaoglu Navid Kardan Sanjay Kariyappa Leonid Karlinsky Animesh Karnewar Shyamgopal Karthik Hirak J. Kashyap Marc A. Kastner Hirokatsu Kataoka Angelos Katharopoulos Hiroharu Kato Kai Katsumata Manuel Kaufmann Chaitanya Kaul Prakhar Kaushik

Yuki Kawana Lei Ke Lipeng Ke Tsung-Wei Ke Wei Ke Petr Kellnhofer Aniruddha Kembhavi John Kender Corentin Kervadec Leonid Keselman Daniel Keysers Nima Khademi Kalantari Taras Khakhulin Samir Khaki Muhammad Haris Khan Qadeer Khan Salman Khan Subash Khanal Vaishnavi M. Khindkar Rawal Khirodkar Saeed Khorram Pirazh Khorramshahi Kourosh Khoshelham Ansh Khurana Benjamin Kiefer Jae Myung Kim Junho Kim Boah Kim Hyeonseong Kim Dong-Jin Kim Dongwan Kim Donghyun Kim Doyeon Kim Yonghyun Kim Hyung-Il Kim Hyunwoo Kim Hyeongwoo Kim Hyo Jin Kim Hyunwoo J. Kim Taehoon Kim Jaeha Kim Jiwon Kim Jung Uk Kim Kangyeol Kim Eunji Kim

Daeha Kim Dongwon Kim Kunhee Kim Kyungmin Kim Junsik Kim Min H. Kim Namil Kim Kookhoi Kim Sanghyun Kim Seongyeop Kim Seungryong Kim Saehoon Kim Euyoung Kim Guisik Kim Sungyeon Kim Sunnie S. Y. Kim Taehun Kim Tae Oh Kim Won Hwa Kim Seungwook Kim YoungBin Kim Youngeun Kim Akisato Kimura Furkan Osman Kınlı Zsolt Kira Hedvig Kjellström Florian Kleber Jan P. Klopp Florian Kluger Laurent Kneip Byungsoo Ko Muhammed Kocabas A. Sophia Koepke Kevin Koeser Nick Kolkin Nikos Kolotouros Wai-Kin Adams Kong Deying Kong Caihua Kong Youyong Kong Shuyu Kong Shu Kong Tao Kong Yajing Kong Yu Kong

xxix

xxx

Organization

Zishang Kong Theodora Kontogianni Anton S. Konushin Julian F. P. Kooij Bruno Korbar Giorgos Kordopatis-Zilos Jari Korhonen Adam Kortylewski Denis Korzhenkov Divya Kothandaraman Suraj Kothawade Iuliia Kotseruba Satwik Kottur Shashank Kotyan Alexandros Kouris Petros Koutras Anna Kreshuk Ranjay Krishna Dilip Krishnan Andrey Kuehlkamp Hilde Kuehne Jason Kuen David Kügler Arjan Kuijper Anna Kukleva Sumith Kulal Viveka Kulharia Akshay R. Kulkarni Nilesh Kulkarni Dominik Kulon Abhinav Kumar Akash Kumar Suryansh Kumar B. V. K. Vijaya Kumar Pulkit Kumar Ratnesh Kumar Sateesh Kumar Satish Kumar Vijay Kumar B. G. Nupur Kumari Sudhakar Kumawat Jogendra Nath Kundu Hsien-Kai Kuo Meng-Yu Jennifer Kuo Vinod Kumar Kurmi

Yusuke Kurose Keerthy Kusumam Alina Kuznetsova Henry Kvinge Ho Man Kwan Hyeokjun Kweon Heeseung Kwon Gihyun Kwon Myung-Joon Kwon Taesung Kwon YoungJoong Kwon Christos Kyrkou Jorma Laaksonen Yann Labbe Zorah Laehner Florent Lafarge Hamid Laga Manuel Lagunas Shenqi Lai Jian-Huang Lai Zihang Lai Mohamed I. Lakhal Mohit Lamba Meng Lan Loic Landrieu Zhiqiang Lang Natalie Lang Dong Lao Yizhen Lao Yingjie Lao Issam Hadj Laradji Gustav Larsson Viktor Larsson Zakaria Laskar Stéphane Lathuilière Chun Pong Lau Rynson W. H. Lau Hei Law Justin Lazarow Verica Lazova Eric-Tuan Le Hieu Le Trung-Nghia Le Mathias Lechner Byeong-Uk Lee

Chen-Yu Lee Che-Rung Lee Chul Lee Hong Joo Lee Dongsoo Lee Jiyoung Lee Eugene Eu Tzuan Lee Daeun Lee Saehyung Lee Jewook Lee Hyungtae Lee Hyunmin Lee Jungbeom Lee Joon-Young Lee Jong-Seok Lee Joonseok Lee Junha Lee Kibok Lee Byung-Kwan Lee Jangwon Lee Jinho Lee Jongmin Lee Seunghyun Lee Sohyun Lee Minsik Lee Dogyoon Lee Seungmin Lee Min Jun Lee Sangho Lee Sangmin Lee Seungeun Lee Seon-Ho Lee Sungmin Lee Sungho Lee Sangyoun Lee Vincent C. S. S. Lee Jaeseong Lee Yong Jae Lee Chenyang Lei Chenyi Lei Jiahui Lei Xinyu Lei Yinjie Lei Jiaxu Leng Luziwei Leng

Organization

Jan E. Lenssen Vincent Lepetit Thomas Leung María Leyva-Vallina Xin Li Yikang Li Baoxin Li Bin Li Bing Li Bowen Li Changlin Li Chao Li Chongyi Li Guanyue Li Shuai Li Jin Li Dingquan Li Dongxu Li Yiting Li Gang Li Dian Li Guohao Li Haoang Li Haoliang Li Haoran Li Hengduo Li Huafeng Li Xiaoming Li Hanao Li Hongwei Li Ziqiang Li Jisheng Li Jiacheng Li Jia Li Jiachen Li Jiahao Li Jianwei Li Jiazhi Li Jie Li Jing Li Jingjing Li Jingtao Li Jun Li Junxuan Li Kai Li

Kailin Li Kenneth Li Kun Li Kunpeng Li Aoxue Li Chenglong Li Chenglin Li Changsheng Li Zhichao Li Qiang Li Yanyu Li Zuoyue Li Xiang Li Xuelong Li Fangda Li Ailin Li Liang Li Chun-Guang Li Daiqing Li Dong Li Guanbin Li Guorong Li Haifeng Li Jianan Li Jianing Li Jiaxin Li Ke Li Lei Li Lincheng Li Liulei Li Lujun Li Linjie Li Lin Li Pengyu Li Ping Li Qiufu Li Qingyong Li Rui Li Siyuan Li Wei Li Wenbin Li Xiangyang Li Xinyu Li Xiujun Li Xiu Li

Xu Li Ya-Li Li Yao Li Yongjie Li Yijun Li Yiming Li Yuezun Li Yu Li Yunheng Li Yuqi Li Zhe Li Zeming Li Zhen Li Zhengqin Li Zhimin Li Jiefeng Li Jinpeng Li Chengze Li Jianwu Li Lerenhan Li Shan Li Suichan Li Xiangtai Li Yanjie Li Yandong Li Zhuoling Li Zhenqiang Li Manyi Li Maosen Li Ji Li Minjun Li Mingrui Li Mengtian Li Junyi Li Nianyi Li Bo Li Xiao Li Peihua Li Peike Li Peizhao Li Peiliang Li Qi Li Ren Li Runze Li Shile Li

xxxi

xxxii

Organization

Sheng Li Shigang Li Shiyu Li Shuang Li Shasha Li Shichao Li Tianye Li Yuexiang Li Wei-Hong Li Wanhua Li Weihao Li Weiming Li Weixin Li Wenbo Li Wenshuo Li Weijian Li Yunan Li Xirong Li Xianhang Li Xiaoyu Li Xueqian Li Xuanlin Li Xianzhi Li Yunqiang Li Yanjing Li Yansheng Li Yawei Li Yi Li Yong Li Yong-Lu Li Yuhang Li Yu-Jhe Li Yuxi Li Yunsheng Li Yanwei Li Zechao Li Zejian Li Zeju Li Zekun Li Zhaowen Li Zheng Li Zhenyu Li Zhiheng Li Zhi Li Zhong Li

Zhuowei Li Zhuowan Li Zhuohang Li Zizhang Li Chen Li Yuan-Fang Li Dongze Lian Xiaochen Lian Zhouhui Lian Long Lian Qing Lian Jin Lianbao Jinxiu S. Liang Dingkang Liang Jiahao Liang Jianming Liang Jingyun Liang Kevin J. Liang Kaizhao Liang Chen Liang Jie Liang Senwei Liang Ding Liang Jiajun Liang Jian Liang Kongming Liang Siyuan Liang Yuanzhi Liang Zhengfa Liang Mingfu Liang Xiaodan Liang Xuefeng Liang Yuxuan Liang Kang Liao Liang Liao Hong-Yuan Mark Liao Wentong Liao Haofu Liao Yue Liao Minghui Liao Shengcai Liao Ting-Hsuan Liao Xin Liao Yinghong Liao Teck Yian Lim

Che-Tsung Lin Chung-Ching Lin Chen-Hsuan Lin Cheng Lin Chuming Lin Chunyu Lin Dahua Lin Wei Lin Zheng Lin Huaijia Lin Jason Lin Jierui Lin Jiaying Lin Jie Lin Kai-En Lin Kevin Lin Guangfeng Lin Jiehong Lin Feng Lin Hang Lin Kwan-Yee Lin Ke Lin Luojun Lin Qinghong Lin Xiangbo Lin Yi Lin Zudi Lin Shijie Lin Yiqun Lin Tzu-Heng Lin Ming Lin Shaohui Lin SongNan Lin Ji Lin Tsung-Yu Lin Xudong Lin Yancong Lin Yen-Chen Lin Yiming Lin Yuewei Lin Zhiqiu Lin Zinan Lin Zhe Lin David B. Lindell Zhixin Ling

Organization

Zhan Ling Alexander Liniger Venice Erin B. Liong Joey Litalien Or Litany Roee Litman Ron Litman Jim Little Dor Litvak Shaoteng Liu Shuaicheng Liu Andrew Liu Xian Liu Shaohui Liu Bei Liu Bo Liu Yong Liu Ming Liu Yanbin Liu Chenxi Liu Daqi Liu Di Liu Difan Liu Dong Liu Dongfang Liu Daizong Liu Xiao Liu Fangyi Liu Fengbei Liu Fenglin Liu Bin Liu Yuang Liu Ao Liu Hong Liu Hongfu Liu Huidong Liu Ziyi Liu Feng Liu Hao Liu Jie Liu Jialun Liu Jiang Liu Jing Liu Jingya Liu Jiaming Liu

Jun Liu Juncheng Liu Jiawei Liu Hongyu Liu Chuanbin Liu Haotian Liu Lingqiao Liu Chang Liu Han Liu Liu Liu Min Liu Yingqi Liu Aishan Liu Bingyu Liu Benlin Liu Boxiao Liu Chenchen Liu Chuanjian Liu Daqing Liu Huan Liu Haozhe Liu Jiaheng Liu Wei Liu Jingzhou Liu Jiyuan Liu Lingbo Liu Nian Liu Peiye Liu Qiankun Liu Shenglan Liu Shilong Liu Wen Liu Wenyu Liu Weifeng Liu Wu Liu Xiaolong Liu Yang Liu Yanwei Liu Yingcheng Liu Yongfei Liu Yihao Liu Yu Liu Yunze Liu Ze Liu Zhenhua Liu

Zhenguang Liu Lin Liu Lihao Liu Pengju Liu Xinhai Liu Yunfei Liu Meng Liu Minghua Liu Mingyuan Liu Miao Liu Peirong Liu Ping Liu Qingjie Liu Ruoshi Liu Risheng Liu Songtao Liu Xing Liu Shikun Liu Shuming Liu Sheng Liu Songhua Liu Tongliang Liu Weibo Liu Weide Liu Weizhe Liu Wenxi Liu Weiyang Liu Xin Liu Xiaobin Liu Xudong Liu Xiaoyi Liu Xihui Liu Xinchen Liu Xingtong Liu Xinpeng Liu Xinyu Liu Xianpeng Liu Xu Liu Xingyu Liu Yongtuo Liu Yahui Liu Yangxin Liu Yaoyao Liu Yaojie Liu Yuliang Liu

xxxiii

xxxiv

Organization

Yongcheng Liu Yuan Liu Yufan Liu Yu-Lun Liu Yun Liu Yunfan Liu Yuanzhong Liu Zhuoran Liu Zhen Liu Zheng Liu Zhijian Liu Zhisong Liu Ziquan Liu Ziyu Liu Zhihua Liu Zechun Liu Zhaoyang Liu Zhengzhe Liu Stephan Liwicki Shao-Yuan Lo Sylvain Lobry Suhas Lohit Vishnu Suresh Lokhande Vincenzo Lomonaco Chengjiang Long Guodong Long Fuchen Long Shangbang Long Yang Long Zijun Long Vasco Lopes Antonio M. Lopez Roberto Javier Lopez-Sastre Tobias Lorenz Javier Lorenzo-Navarro Yujing Lou Qian Lou Xiankai Lu Changsheng Lu Huimin Lu Yongxi Lu Hao Lu Hong Lu Jiasen Lu

Juwei Lu Fan Lu Guangming Lu Jiwen Lu Shun Lu Tao Lu Xiaonan Lu Yang Lu Yao Lu Yongchun Lu Zhiwu Lu Cheng Lu Liying Lu Guo Lu Xuequan Lu Yanye Lu Yantao Lu Yuhang Lu Fujun Luan Jonathon Luiten Jovita Lukasik Alan Lukezic Jonathan Samuel Lumentut Mayank Lunayach Ao Luo Canjie Luo Chong Luo Xu Luo Grace Luo Jun Luo Katie Z. Luo Tao Luo Cheng Luo Fangzhou Luo Gen Luo Lei Luo Sihui Luo Weixin Luo Yan Luo Xiaoyan Luo Yong Luo Yadan Luo Hao Luo Ruotian Luo Mi Luo

Tiange Luo Wenjie Luo Wenhan Luo Xiao Luo Zhiming Luo Zhipeng Luo Zhengyi Luo Diogo C. Luvizon Zhaoyang Lv Gengyu Lyu Lingjuan Lyu Jun Lyu Yuanyuan Lyu Youwei Lyu Yueming Lyu Bingpeng Ma Chao Ma Chongyang Ma Congbo Ma Chih-Yao Ma Fan Ma Lin Ma Haoyu Ma Hengbo Ma Jianqi Ma Jiawei Ma Jiayi Ma Kede Ma Kai Ma Lingni Ma Lei Ma Xu Ma Ning Ma Benteng Ma Cheng Ma Andy J. Ma Long Ma Zhanyu Ma Zhiheng Ma Qianli Ma Shiqiang Ma Sizhuo Ma Shiqing Ma Xiaolong Ma Xinzhu Ma

Organization

Gautam B. Machiraju Spandan Madan Mathew Magimai-Doss Luca Magri Behrooz Mahasseni Upal Mahbub Siddharth Mahendran Paridhi Maheshwari Rishabh Maheshwary Mohammed Mahmoud Shishira R. R. Maiya Sylwia Majchrowska Arjun Majumdar Puspita Majumdar Orchid Majumder Sagnik Majumder Ilya Makarov Farkhod F. Makhmudkhujaev Yasushi Makihara Ankur Mali Mateusz Malinowski Utkarsh Mall Srikanth Malla Clement Mallet Dimitrios Mallis Yunze Man Dipu Manandhar Massimiliano Mancini Murari Mandal Raunak Manekar Karttikeya Mangalam Puneet Mangla Fabian Manhardt Sivabalan Manivasagam Fahim Mannan Chengzhi Mao Hanzi Mao Jiayuan Mao Junhua Mao Zhiyuan Mao Jiageng Mao Yunyao Mao Zhendong Mao Alberto Marchisio

Diego Marcos Riccardo Marin Aram Markosyan Renaud Marlet Ricardo Marques Miquel Martí i Rabadán Diego Martin Arroyo Niki Martinel Brais Martinez Julieta Martinez Marc Masana Tomohiro Mashita Timothée Masquelier Minesh Mathew Tetsu Matsukawa Marwan Mattar Bruce A. Maxwell Christoph Mayer Mantas Mazeika Pratik Mazumder Scott McCloskey Steven McDonagh Ishit Mehta Jie Mei Kangfu Mei Jieru Mei Xiaoguang Mei Givi Meishvili Luke Melas-Kyriazi Iaroslav Melekhov Andres Mendez-Vazquez Heydi Mendez-Vazquez Matias Mendieta Ricardo A. Mendoza-León Chenlin Meng Depu Meng Rang Meng Zibo Meng Qingjie Meng Qier Meng Yanda Meng Zihang Meng Thomas Mensink Fabian Mentzer Christopher Metzler

xxxv

Gregory P. Meyer Vasileios Mezaris Liang Mi Lu Mi Bo Miao Changtao Miao Zichen Miao Qiguang Miao Xin Miao Zhongqi Miao Frank Michel Simone Milani Ben Mildenhall Roy V. Miles Juhong Min Kyle Min Hyun-Seok Min Weiqing Min Yuecong Min Zhixiang Min Qi Ming David Minnen Aymen Mir Deepak Mishra Anand Mishra Shlok K. Mishra Niluthpol Mithun Gaurav Mittal Trisha Mittal Daisuke Miyazaki Kaichun Mo Hong Mo Zhipeng Mo Davide Modolo Abduallah A. Mohamed Mohamed Afham Mohamed Aflal Ron Mokady Pavlo Molchanov Davide Moltisanti Liliane Momeni Gianluca Monaci Pascal Monasse Ajoy Mondal Tom Monnier

xxxvi

Organization

Aron Monszpart Gyeongsik Moon Suhong Moon Taesup Moon Sean Moran Daniel Moreira Pietro Morerio Alexandre Morgand Lia Morra Ali Mosleh Inbar Mosseri Sayed Mohammad Mostafavi Isfahani Saman Motamed Ramy A. Mounir Fangzhou Mu Jiteng Mu Norman Mu Yasuhiro Mukaigawa Ryan Mukherjee Tanmoy Mukherjee Yusuke Mukuta Ravi Teja Mullapudi Lea Müller Matthias Müller Martin Mundt Nils Murrugarra-Llerena Damien Muselet Armin Mustafa Muhammad Ferjad Naeem Sauradip Nag Hajime Nagahara Pravin Nagar Rajendra Nagar Naveen Shankar Nagaraja Varun Nagaraja Tushar Nagarajan Seungjun Nah Gaku Nakano Yuta Nakashima Giljoo Nam Seonghyeon Nam Liangliang Nan Yuesong Nan Yeshwanth Napolean

Dinesh Reddy Narapureddy Medhini Narasimhan Supreeth Narasimhaswamy Sriram Narayanan Erickson R. Nascimento Varun Nasery K. L. Navaneet Pablo Navarrete Michelini Shant Navasardyan Shah Nawaz Nihal Nayak Farhood Negin Lukáš Neumann Alejandro Newell Evonne Ng Kam Woh Ng Tony Ng Anh Nguyen Tuan Anh Nguyen Cuong Cao Nguyen Ngoc Cuong Nguyen Thanh Nguyen Khoi Nguyen Phi Le Nguyen Phong Ha Nguyen Tam Nguyen Truong Nguyen Anh Tuan Nguyen Rang Nguyen Thao Thi Phuong Nguyen Van Nguyen Nguyen Zhen-Liang Ni Yao Ni Shijie Nie Xuecheng Nie Yongwei Nie Weizhi Nie Ying Nie Yinyu Nie Kshitij N. Nikhal Simon Niklaus Xuefei Ning Jifeng Ning

Yotam Nitzan Di Niu Shuaicheng Niu Li Niu Wei Niu Yulei Niu Zhenxing Niu Albert No Shohei Nobuhara Nicoletta Noceti Junhyug Noh Sotiris Nousias Slawomir Nowaczyk Ewa M. Nowara Valsamis Ntouskos Gilberto Ochoa-Ruiz Ferda Ofli Jihyong Oh Sangyun Oh Youngtaek Oh Hiroki Ohashi Takahiro Okabe Kemal Oksuz Fumio Okura Daniel Olmeda Reino Matthew Olson Carl Olsson Roy Or-El Alessandro Ortis Guillermo Ortiz-Jimenez Magnus Oskarsson Ahmed A. A. Osman Martin R. Oswald Mayu Otani Naima Otberdout Cheng Ouyang Jiahong Ouyang Wanli Ouyang Andrew Owens Poojan B. Oza Mete Ozay A. Cengiz Oztireli Gautam Pai Tomas Pajdla Umapada Pal

Organization

Simone Palazzo Luca Palmieri Bowen Pan Hao Pan Lili Pan Tai-Yu Pan Liang Pan Chengwei Pan Yingwei Pan Xuran Pan Jinshan Pan Xinyu Pan Liyuan Pan Xingang Pan Xingjia Pan Zhihong Pan Zizheng Pan Priyadarshini Panda Rameswar Panda Rohit Pandey Kaiyue Pang Bo Pang Guansong Pang Jiangmiao Pang Meng Pang Tianyu Pang Ziqi Pang Omiros Pantazis Andreas Panteli Maja Pantic Marina Paolanti Joao P. Papa Samuele Papa Mike Papadakis Dim P. Papadopoulos George Papandreou Constantin Pape Toufiq Parag Chethan Parameshwara Shaifali Parashar Alejandro Pardo Rishubh Parihar Sarah Parisot JaeYoo Park Gyeong-Moon Park

Hyojin Park Hyoungseob Park Jongchan Park Jae Sung Park Kiru Park Chunghyun Park Kwanyong Park Sunghyun Park Sungrae Park Seongsik Park Sanghyun Park Sungjune Park Taesung Park Gaurav Parmar Paritosh Parmar Alvaro Parra Despoina Paschalidou Or Patashnik Shivansh Patel Pushpak Pati Prashant W. Patil Vaishakh Patil Suvam Patra Jay Patravali Badri Narayana Patro Angshuman Paul Sudipta Paul Rémi Pautrat Nick E. Pears Adithya Pediredla Wenjie Pei Shmuel Peleg Latha Pemula Bo Peng Houwen Peng Yue Peng Liangzu Peng Baoyun Peng Jun Peng Pai Peng Sida Peng Xi Peng Yuxin Peng Songyou Peng Wei Peng

xxxvii

Weiqi Peng Wen-Hsiao Peng Pramuditha Perera Juan C. Perez Eduardo Pérez Pellitero Juan-Manuel Perez-Rua Federico Pernici Marco Pesavento Stavros Petridis Ilya A. Petrov Vladan Petrovic Mathis Petrovich Suzanne Petryk Hieu Pham Quang Pham Khoi Pham Tung Pham Huy Phan Stephen Phillips Cheng Perng Phoo David Picard Marco Piccirilli Georg Pichler A. J. Piergiovanni Vipin Pillai Silvia L. Pintea Giovanni Pintore Robinson Piramuthu Fiora Pirri Theodoros Pissas Fabio Pizzati Benjamin Planche Bryan Plummer Matteo Poggi Ashwini Pokle Georgy E. Ponimatkin Adrian Popescu Stefan Popov Nikola Popovi´c Ronald Poppe Angelo Porrello Michael Potter Charalambos Poullis Hadi Pouransari Omid Poursaeed

xxxviii Organization

Shraman Pramanick Mantini Pranav Dilip K. Prasad Meghshyam Prasad B. H. Pawan Prasad Shitala Prasad Prateek Prasanna Ekta Prashnani Derek S. Prijatelj Luke Y. Prince Véronique Prinet Victor Adrian Prisacariu James Pritts Thomas Probst Sergey Prokudin Rita Pucci Chi-Man Pun Matthew Purri Haozhi Qi Lu Qi Lei Qi Xianbiao Qi Yonggang Qi Yuankai Qi Siyuan Qi Guocheng Qian Hangwei Qian Qi Qian Deheng Qian Shengsheng Qian Wen Qian Rui Qian Yiming Qian Shengju Qian Shengyi Qian Xuelin Qian Zhenxing Qian Nan Qiao Xiaotian Qiao Jing Qin Can Qin Siyang Qin Hongwei Qin Jie Qin Minghai Qin

Yipeng Qin Yongqiang Qin Wenda Qin Xuebin Qin Yuzhe Qin Yao Qin Zhenyue Qin Zhiwu Qing Heqian Qiu Jiayan Qiu Jielin Qiu Yue Qiu Jiaxiong Qiu Zhongxi Qiu Shi Qiu Zhaofan Qiu Zhongnan Qu Yanyun Qu Kha Gia Quach Yuhui Quan Ruijie Quan Mike Rabbat Rahul Shekhar Rade Filip Radenovic Gorjan Radevski Bogdan Raducanu Francesco Ragusa Shafin Rahman Md Mahfuzur Rahman Siddiquee Hossein Rahmani Kiran Raja Sivaramakrishnan Rajaraman Jathushan Rajasegaran Adnan Siraj Rakin Michaël Ramamonjisoa Chirag A. Raman Shanmuganathan Raman Vignesh Ramanathan Vasili Ramanishka Vikram V. Ramaswamy Merey Ramazanova Jason Rambach Sai Saketh Rambhatla

Clément Rambour Ashwin Ramesh Babu Adín Ramírez Rivera Arianna Rampini Haoxi Ran Aakanksha Rana Aayush Jung Bahadur Rana Kanchana N. Ranasinghe Aneesh Rangnekar Samrudhdhi B. Rangrej Harsh Rangwani Viresh Ranjan Anyi Rao Yongming Rao Carolina Raposo Michalis Raptis Amir Rasouli Vivek Rathod Adepu Ravi Sankar Avinash Ravichandran Bharadwaj Ravichandran Dripta S. Raychaudhuri Adria Recasens Simon Reiß Davis Rempe Daxuan Ren Jiawei Ren Jimmy Ren Sucheng Ren Dayong Ren Zhile Ren Dongwei Ren Qibing Ren Pengfei Ren Zhenwen Ren Xuqian Ren Yixuan Ren Zhongzheng Ren Ambareesh Revanur Hamed Rezazadegan Tavakoli Rafael S. Rezende Wonjong Rhee Alexander Richard

Organization

Christian Richardt Stephan R. Richter Benjamin Riggan Dominik Rivoir Mamshad Nayeem Rizve Joshua D. Robinson Joseph Robinson Chris Rockwell Ranga Rodrigo Andres C. Rodriguez Carlos Rodriguez-Pardo Marcus Rohrbach Gemma Roig Yu Rong David A. Ross Mohammad Rostami Edward Rosten Karsten Roth Anirban Roy Debaditya Roy Shuvendu Roy Ahana Roy Choudhury Aruni Roy Chowdhury Denys Rozumnyi Shulan Ruan Wenjie Ruan Patrick Ruhkamp Danila Rukhovich Anian Ruoss Chris Russell Dan Ruta Dawid Damian Rymarczyk DongHun Ryu Hyeonggon Ryu Kwonyoung Ryu Balasubramanian S. Alexandre Sablayrolles Mohammad Sabokrou Arka Sadhu Aniruddha Saha Oindrila Saha Pritish Sahu Aneeshan Sain Nirat Saini Saurabh Saini

Takeshi Saitoh Christos Sakaridis Fumihiko Sakaue Dimitrios Sakkos Ken Sakurada Parikshit V. Sakurikar Rohit Saluja Nermin Samet Leo Sampaio Ferraz Ribeiro Jorge Sanchez Enrique Sanchez Shengtian Sang Anush Sankaran Soubhik Sanyal Nikolaos Sarafianos Vishwanath Saragadam István Sárándi Saquib Sarfraz Mert Bulent Sariyildiz Anindya Sarkar Pritam Sarkar Paul-Edouard Sarlin Hiroshi Sasaki Takami Sato Torsten Sattler Ravi Kumar Satzoda Axel Sauer Stefano Savian Artem Savkin Manolis Savva Gerald Schaefer Simone Schaub-Meyer Yoni Schirris Samuel Schulter Katja Schwarz Jesse Scott Sinisa Segvic Constantin Marc Seibold Lorenzo Seidenari Matan Sela Fadime Sener Paul Hongsuck Seo Kwanggyoon Seo Hongje Seong

Dario Serez Francesco Setti Bryan Seybold Mohamad Shahbazi Shima Shahfar Xinxin Shan Caifeng Shan Dandan Shan Shawn Shan Wei Shang Jinghuan Shang Jiaxiang Shang Lei Shang Sukrit Shankar Ken Shao Rui Shao Jie Shao Mingwen Shao Aashish Sharma Gaurav Sharma Vivek Sharma Abhishek Sharma Yoli Shavit Shashank Shekhar Sumit Shekhar Zhijie Shen Fengyi Shen Furao Shen Jialie Shen Jingjing Shen Ziyi Shen Linlin Shen Guangyu Shen Biluo Shen Falong Shen Jiajun Shen Qiu Shen Qiuhong Shen Shuai Shen Wang Shen Yiqing Shen Yunhang Shen Siqi Shen Bin Shen Tianwei Shen

xxxix

xl

Organization

Xi Shen Yilin Shen Yuming Shen Yucong Shen Zhiqiang Shen Lu Sheng Yichen Sheng Shivanand Venkanna Sheshappanavar Shelly Sheynin Baifeng Shi Ruoxi Shi Botian Shi Hailin Shi Jia Shi Jing Shi Shaoshuai Shi Baoguang Shi Boxin Shi Hengcan Shi Tianyang Shi Xiaodan Shi Yongjie Shi Zhensheng Shi Yinghuan Shi Weiqi Shi Wu Shi Xuepeng Shi Xiaoshuang Shi Yujiao Shi Zenglin Shi Zhenmei Shi Takashi Shibata Meng-Li Shih Yichang Shih Hyunjung Shim Dongseok Shim Soshi Shimada Inkyu Shin Jinwoo Shin Seungjoo Shin Seungjae Shin Koichi Shinoda Suprosanna Shit

Palaiahnakote Shivakumara Eli Shlizerman Gaurav Shrivastava Xiao Shu Xiangbo Shu Xiujun Shu Yang Shu Tianmin Shu Jun Shu Zhixin Shu Bing Shuai Maria Shugrina Ivan Shugurov Satya Narayan Shukla Pranjay Shyam Jianlou Si Yawar Siddiqui Alberto Signoroni Pedro Silva Jae-Young Sim Oriane Siméoni Martin Simon Andrea Simonelli Abhishek Singh Ashish Singh Dinesh Singh Gurkirt Singh Krishna Kumar Singh Mannat Singh Pravendra Singh Rajat Vikram Singh Utkarsh Singhal Dipika Singhania Vasu Singla Harsh Sinha Sudipta Sinha Josef Sivic Elena Sizikova Geri Skenderi Ivan Skorokhodov Dmitriy Smirnov Cameron Y. Smith James S. Smith Patrick Snape

Mattia Soldan Hyeongseok Son Sanghyun Son Chuanbiao Song Chen Song Chunfeng Song Dan Song Dongjin Song Hwanjun Song Guoxian Song Jiaming Song Jie Song Liangchen Song Ran Song Luchuan Song Xibin Song Li Song Fenglong Song Guoli Song Guanglu Song Zhenbo Song Lin Song Xinhang Song Yang Song Yibing Song Rajiv Soundararajan Hossein Souri Cristovao Sousa Riccardo Spezialetti Leonidas Spinoulas Michael W. Spratling Deepak Sridhar Srinath Sridhar Gaurang Sriramanan Vinkle Kumar Srivastav Themos Stafylakis Serban Stan Anastasis Stathopoulos Markus Steinberger Jan Steinbrener Sinisa Stekovic Alexandros Stergiou Gleb Sterkin Rainer Stiefelhagen Pierre Stock

Organization

Ombretta Strafforello Julian Straub Yannick Strümpler Joerg Stueckler Hang Su Weijie Su Jong-Chyi Su Bing Su Haisheng Su Jinming Su Yiyang Su Yukun Su Yuxin Su Zhuo Su Zhaoqi Su Xiu Su Yu-Chuan Su Zhixun Su Arulkumar Subramaniam Akshayvarun Subramanya A. Subramanyam Swathikiran Sudhakaran Yusuke Sugano Masanori Suganuma Yumin Suh Yang Sui Baochen Sun Cheng Sun Long Sun Guolei Sun Haoliang Sun Haomiao Sun He Sun Hanqing Sun Hao Sun Lichao Sun Jiachen Sun Jiaming Sun Jian Sun Jin Sun Jennifer J. Sun Tiancheng Sun Libo Sun Peize Sun Qianru Sun

Shanlin Sun Yu Sun Zhun Sun Che Sun Lin Sun Tao Sun Yiyou Sun Chunyi Sun Chong Sun Weiwei Sun Weixuan Sun Xiuyu Sun Yanan Sun Zeren Sun Zhaodong Sun Zhiqing Sun Minhyuk Sung Jinli Suo Simon Suo Abhijit Suprem Anshuman Suri Saksham Suri Joshua M. Susskind Roman Suvorov Gurumurthy Swaminathan Robin Swanson Paul Swoboda Tabish A. Syed Richard Szeliski Fariborz Taherkhani Yu-Wing Tai Keita Takahashi Walter Talbott Gary Tam Masato Tamura Feitong Tan Fuwen Tan Shuhan Tan Andong Tan Bin Tan Cheng Tan Jianchao Tan Lei Tan Mingxing Tan Xin Tan

Zichang Tan Zhentao Tan Kenichiro Tanaka Masayuki Tanaka Yushun Tang Hao Tang Jingqun Tang Jinhui Tang Kaihua Tang Luming Tang Lv Tang Sheyang Tang Shitao Tang Siliang Tang Shixiang Tang Yansong Tang Keke Tang Chang Tang Chenwei Tang Jie Tang Junshu Tang Ming Tang Peng Tang Xu Tang Yao Tang Chen Tang Fan Tang Haoran Tang Shengeng Tang Yehui Tang Zhipeng Tang Ugo Tanielian Chaofan Tao Jiale Tao Junli Tao Renshuai Tao An Tao Guanhong Tao Zhiqiang Tao Makarand Tapaswi Jean-Philippe G. Tarel Juan J. Tarrio Enzo Tartaglione Keisuke Tateno Zachary Teed

xli

xlii

Organization

Ajinkya B. Tejankar Bugra Tekin Purva Tendulkar Damien Teney Minggui Teng Chris Tensmeyer Andrew Beng Jin Teoh Philipp Terhörst Kartik Thakral Nupur Thakur Kevin Thandiackal Spyridon Thermos Diego Thomas William Thong Yuesong Tian Guanzhong Tian Lin Tian Shiqi Tian Kai Tian Meng Tian Tai-Peng Tian Zhuotao Tian Shangxuan Tian Tian Tian Yapeng Tian Yu Tian Yuxin Tian Leslie Ching Ow Tiong Praveen Tirupattur Garvita Tiwari George Toderici Antoine Toisoul Aysim Toker Tatiana Tommasi Zhan Tong Alessio Tonioni Alessandro Torcinovich Fabio Tosi Matteo Toso Hugo Touvron Quan Hung Tran Son Tran Hung Tran Ngoc-Trung Tran Vinh Tran

Phong Tran Giovanni Trappolini Edith Tretschk Subarna Tripathi Shubhendu Trivedi Eduard Trulls Prune Truong Thanh-Dat Truong Tomasz Trzcinski Sam Tsai Yi-Hsuan Tsai Ethan Tseng Yu-Chee Tseng Shahar Tsiper Stavros Tsogkas Shikui Tu Zhigang Tu Zhengzhong Tu Richard Tucker Sergey Tulyakov Cigdem Turan Daniyar Turmukhambetov Victor G. Turrisi da Costa Bartlomiej Twardowski Christopher D. Twigg Radim Tylecek Mostofa Rafid Uddin Md. Zasim Uddin Kohei Uehara Nicolas Ugrinovic Youngjung Uh Norimichi Ukita Anwaar Ulhaq Devesh Upadhyay Paul Upchurch Yoshitaka Ushiku Yuzuko Utsumi Mikaela Angelina Uy Mohit Vaishnav Pratik Vaishnavi Jeya Maria Jose Valanarasu Matias A. Valdenegro Toro Diego Valsesia Wouter Van Gansbeke Nanne van Noord

Simon Vandenhende Farshid Varno Cristina Vasconcelos Francisco Vasconcelos Alex Vasilescu Subeesh Vasu Arun Balajee Vasudevan Kanav Vats Vaibhav S. Vavilala Sagar Vaze Javier Vazquez-Corral Andrea Vedaldi Olga Veksler Andreas Velten Sai H. Vemprala Raviteja Vemulapalli Shashanka Venkataramanan Dor Verbin Luisa Verdoliva Manisha Verma Yashaswi Verma Constantin Vertan Eli Verwimp Deepak Vijaykeerthy Pablo Villanueva Ruben Villegas Markus Vincze Vibhav Vineet Minh P. Vo Huy V. Vo Duc Minh Vo Tomas Vojir Igor Vozniak Nicholas Vretos Vibashan VS Tuan-Anh Vu Thang Vu Mårten Wadenbäck Neal Wadhwa Aaron T. Walsman Steven Walton Jin Wan Alvin Wan Jia Wan

Organization

Jun Wan Xiaoyue Wan Fang Wan Guowei Wan Renjie Wan Zhiqiang Wan Ziyu Wan Bastian Wandt Dongdong Wang Limin Wang Haiyang Wang Xiaobing Wang Angtian Wang Angelina Wang Bing Wang Bo Wang Boyu Wang Binghui Wang Chen Wang Chien-Yi Wang Congli Wang Qi Wang Chengrui Wang Rui Wang Yiqun Wang Cong Wang Wenjing Wang Dongkai Wang Di Wang Xiaogang Wang Kai Wang Zhizhong Wang Fangjinhua Wang Feng Wang Hang Wang Gaoang Wang Guoqing Wang Guangcong Wang Guangzhi Wang Hanqing Wang Hao Wang Haohan Wang Haoran Wang Hong Wang Haotao Wang

Hu Wang Huan Wang Hua Wang Hui-Po Wang Hengli Wang Hanyu Wang Hongxing Wang Jingwen Wang Jialiang Wang Jian Wang Jianyi Wang Jiashun Wang Jiahao Wang Tsun-Hsuan Wang Xiaoqian Wang Jinqiao Wang Jun Wang Jianzong Wang Kaihong Wang Ke Wang Lei Wang Lingjing Wang Linnan Wang Lin Wang Liansheng Wang Mengjiao Wang Manning Wang Nannan Wang Peihao Wang Jiayun Wang Pu Wang Qiang Wang Qiufeng Wang Qilong Wang Qiangchang Wang Qin Wang Qing Wang Ruocheng Wang Ruibin Wang Ruisheng Wang Ruizhe Wang Runqi Wang Runzhong Wang Wenxuan Wang Sen Wang

Shangfei Wang Shaofei Wang Shijie Wang Shiqi Wang Zhibo Wang Song Wang Xinjiang Wang Tai Wang Tao Wang Teng Wang Xiang Wang Tianren Wang Tiantian Wang Tianyi Wang Fengjiao Wang Wei Wang Miaohui Wang Suchen Wang Siyue Wang Yaoming Wang Xiao Wang Ze Wang Biao Wang Chaofei Wang Dong Wang Gu Wang Guangrun Wang Guangming Wang Guo-Hua Wang Haoqing Wang Hesheng Wang Huafeng Wang Jinghua Wang Jingdong Wang Jingjing Wang Jingya Wang Jingkang Wang Jiakai Wang Junke Wang Kuo Wang Lichen Wang Lizhi Wang Longguang Wang Mang Wang Mei Wang

xliii

xliv

Organization

Min Wang Peng-Shuai Wang Run Wang Shaoru Wang Shuhui Wang Tan Wang Tiancai Wang Tianqi Wang Wenhai Wang Wenzhe Wang Xiaobo Wang Xiudong Wang Xu Wang Yajie Wang Yan Wang Yuan-Gen Wang Yingqian Wang Yizhi Wang Yulin Wang Yu Wang Yujie Wang Yunhe Wang Yuxi Wang Yaowei Wang Yiwei Wang Zezheng Wang Hongzhi Wang Zhiqiang Wang Ziteng Wang Ziwei Wang Zheng Wang Zhenyu Wang Binglu Wang Zhongdao Wang Ce Wang Weining Wang Weiyao Wang Wenbin Wang Wenguan Wang Guangting Wang Haolin Wang Haiyan Wang Huiyu Wang Naiyan Wang Jingbo Wang

Jinpeng Wang Jiaqi Wang Liyuan Wang Lizhen Wang Ning Wang Wenqian Wang Sheng-Yu Wang Weimin Wang Xiaohan Wang Yifan Wang Yi Wang Yongtao Wang Yizhou Wang Zhuo Wang Zhe Wang Xudong Wang Xiaofang Wang Xinggang Wang Xiaosen Wang Xiaosong Wang Xiaoyang Wang Lijun Wang Xinlong Wang Xuan Wang Xue Wang Yangang Wang Yaohui Wang Yu-Chiang Frank Wang Yida Wang Yilin Wang Yi Ru Wang Yali Wang Yinglong Wang Yufu Wang Yujiang Wang Yuwang Wang Yuting Wang Yang Wang Yu-Xiong Wang Yixu Wang Ziqi Wang Zhicheng Wang Zeyu Wang Zhaowen Wang Zhenyi Wang

Zhenzhi Wang Zhijie Wang Zhiyong Wang Zhongling Wang Zhuowei Wang Zian Wang Zifu Wang Zihao Wang Zirui Wang Ziyan Wang Wenxiao Wang Zhen Wang Zhepeng Wang Zi Wang Zihao W. Wang Steven L. Waslander Olivia Watkins Daniel Watson Silvan Weder Dongyoon Wee Dongming Wei Tianyi Wei Jia Wei Dong Wei Fangyun Wei Longhui Wei Mingqiang Wei Xinyue Wei Chen Wei Donglai Wei Pengxu Wei Xing Wei Xiu-Shen Wei Wenqi Wei Guoqiang Wei Wei Wei XingKui Wei Xian Wei Xingxing Wei Yake Wei Yuxiang Wei Yi Wei Luca Weihs Michael Weinmann Martin Weinmann

Organization

Congcong Wen Chuan Wen Jie Wen Sijia Wen Song Wen Chao Wen Xiang Wen Zeyi Wen Xin Wen Yilin Wen Yijia Weng Shuchen Weng Junwu Weng Wenming Weng Renliang Weng Zhenyu Weng Xinshuo Weng Nicholas J. Westlake Gordon Wetzstein Lena M. Widin Klasén Rick Wildes Bryan M. Williams Williem Williem Ole Winther Scott Wisdom Alex Wong Chau-Wai Wong Kwan-Yee K. Wong Yongkang Wong Scott Workman Marcel Worring Michael Wray Safwan Wshah Xiang Wu Aming Wu Chongruo Wu Cho-Ying Wu Chunpeng Wu Chenyan Wu Ziyi Wu Fuxiang Wu Gang Wu Haiping Wu Huisi Wu Jane Wu

Jialian Wu Jing Wu Jinjian Wu Jianlong Wu Xian Wu Lifang Wu Lifan Wu Minye Wu Qianyi Wu Rongliang Wu Rui Wu Shiqian Wu Shuzhe Wu Shangzhe Wu Tsung-Han Wu Tz-Ying Wu Ting-Wei Wu Jiannan Wu Zhiliang Wu Yu Wu Chenyun Wu Dayan Wu Dongxian Wu Fei Wu Hefeng Wu Jianxin Wu Weibin Wu Wenxuan Wu Wenhao Wu Xiao Wu Yicheng Wu Yuanwei Wu Yu-Huan Wu Zhenxin Wu Zhenyu Wu Wei Wu Peng Wu Xiaohe Wu Xindi Wu Xinxing Wu Xinyi Wu Xingjiao Wu Xiongwei Wu Yangzheng Wu Yanzhao Wu

Yawen Wu Yong Wu Yi Wu Ying Nian Wu Zhenyao Wu Zhonghua Wu Zongze Wu Zuxuan Wu Stefanie Wuhrer Teng Xi Jianing Xi Fei Xia Haifeng Xia Menghan Xia Yuanqing Xia Zhihua Xia Xiaobo Xia Weihao Xia Shihong Xia Yan Xia Yong Xia Zhaoyang Xia Zhihao Xia Chuhua Xian Yongqin Xian Wangmeng Xiang Fanbo Xiang Tiange Xiang Tao Xiang Liuyu Xiang Xiaoyu Xiang Zhiyu Xiang Aoran Xiao Chunxia Xiao Fanyi Xiao Jimin Xiao Jun Xiao Taihong Xiao Anqi Xiao Junfei Xiao Jing Xiao Liang Xiao Yang Xiao Yuting Xiao Yijun Xiao

xlv

xlvi

Organization

Yao Xiao Zeyu Xiao Zhisheng Xiao Zihao Xiao Binhui Xie Christopher Xie Haozhe Xie Jin Xie Guo-Sen Xie Hongtao Xie Ming-Kun Xie Tingting Xie Chaohao Xie Weicheng Xie Xudong Xie Jiyang Xie Xiaohua Xie Yuan Xie Zhenyu Xie Ning Xie Xianghui Xie Xiufeng Xie You Xie Yutong Xie Fuyong Xing Yifan Xing Zhen Xing Yuanjun Xiong Jinhui Xiong Weihua Xiong Hongkai Xiong Zhitong Xiong Yuanhao Xiong Yunyang Xiong Yuwen Xiong Zhiwei Xiong Yuliang Xiu An Xu Chang Xu Chenliang Xu Chengming Xu Chenshu Xu Xiang Xu Huijuan Xu Zhe Xu

Jie Xu Jingyi Xu Jiarui Xu Yinghao Xu Kele Xu Ke Xu Li Xu Linchuan Xu Linning Xu Mengde Xu Mengmeng Frost Xu Min Xu Mingye Xu Jun Xu Ning Xu Peng Xu Runsheng Xu Sheng Xu Wenqiang Xu Xiaogang Xu Renzhe Xu Kaidi Xu Yi Xu Chi Xu Qiuling Xu Baobei Xu Feng Xu Haohang Xu Haofei Xu Lan Xu Mingze Xu Songcen Xu Weipeng Xu Wenjia Xu Wenju Xu Xiangyu Xu Xin Xu Yinshuang Xu Yixing Xu Yuting Xu Yanyu Xu Zhenbo Xu Zhiliang Xu Zhiyuan Xu Xiaohao Xu

Yanwu Xu Yan Xu Yiran Xu Yifan Xu Yufei Xu Yong Xu Zichuan Xu Zenglin Xu Zexiang Xu Zhan Xu Zheng Xu Zhiwei Xu Ziyue Xu Shiyu Xuan Hanyu Xuan Fei Xue Jianru Xue Mingfu Xue Qinghan Xue Tianfan Xue Chao Xue Chuhui Xue Nan Xue Zhou Xue Xiangyang Xue Yuan Xue Abhay Yadav Ravindra Yadav Kota Yamaguchi Toshihiko Yamasaki Kohei Yamashita Chaochao Yan Feng Yan Kun Yan Qingsen Yan Qixin Yan Rui Yan Siming Yan Xinchen Yan Yaping Yan Bin Yan Qingan Yan Shen Yan Shipeng Yan Xu Yan

Organization

Yan Yan Yichao Yan Zhaoyi Yan Zike Yan Zhiqiang Yan Hongliang Yan Zizheng Yan Jiewen Yang Anqi Joyce Yang Shan Yang Anqi Yang Antoine Yang Bo Yang Baoyao Yang Chenhongyi Yang Dingkang Yang De-Nian Yang Dong Yang David Yang Fan Yang Fengyu Yang Fengting Yang Fei Yang Gengshan Yang Heng Yang Han Yang Huan Yang Yibo Yang Jiancheng Yang Jihan Yang Jiawei Yang Jiayu Yang Jie Yang Jinfa Yang Jingkang Yang Jinyu Yang Cheng-Fu Yang Ji Yang Jianyu Yang Kailun Yang Tian Yang Luyu Yang Liang Yang Li Yang Michael Ying Yang

Yang Yang Muli Yang Le Yang Qiushi Yang Ren Yang Ruihan Yang Shuang Yang Siyuan Yang Su Yang Shiqi Yang Taojiannan Yang Tianyu Yang Lei Yang Wanzhao Yang Shuai Yang William Yang Wei Yang Xiaofeng Yang Xiaoshan Yang Xin Yang Xuan Yang Xu Yang Xingyi Yang Xitong Yang Jing Yang Yanchao Yang Wenming Yang Yujiu Yang Herb Yang Jianfei Yang Jinhui Yang Chuanguang Yang Guanglei Yang Haitao Yang Kewei Yang Linlin Yang Lijin Yang Longrong Yang Meng Yang MingKun Yang Sibei Yang Shicai Yang Tong Yang Wen Yang Xi Yang

Xiaolong Yang Xue Yang Yubin Yang Ze Yang Ziyi Yang Yi Yang Linjie Yang Yuzhe Yang Yiding Yang Zhenpei Yang Zhaohui Yang Zhengyuan Yang Zhibo Yang Zongxin Yang Hantao Yao Mingde Yao Rui Yao Taiping Yao Ting Yao Cong Yao Qingsong Yao Quanming Yao Xu Yao Yuan Yao Yao Yao Yazhou Yao Jiawen Yao Shunyu Yao Pew-Thian Yap Sudhir Yarram Rajeev Yasarla Peng Ye Botao Ye Mao Ye Fei Ye Hanrong Ye Jingwen Ye Jinwei Ye Jiarong Ye Mang Ye Meng Ye Qi Ye Qian Ye Qixiang Ye Junjie Ye

xlvii

xlviii

Organization

Sheng Ye Nanyang Ye Yufei Ye Xiaoqing Ye Ruolin Ye Yousef Yeganeh Chun-Hsiao Yeh Raymond A. Yeh Yu-Ying Yeh Kai Yi Chang Yi Renjiao Yi Xinping Yi Peng Yi Alper Yilmaz Junho Yim Hui Yin Bangjie Yin Jia-Li Yin Miao Yin Wenzhe Yin Xuwang Yin Ming Yin Yu Yin Aoxiong Yin Kangxue Yin Tianwei Yin Wei Yin Xianghua Ying Rio Yokota Tatsuya Yokota Naoto Yokoya Ryo Yonetani Ki Yoon Yoo Jinsu Yoo Sunjae Yoon Jae Shin Yoon Jihun Yoon Sung-Hoon Yoon Ryota Yoshihashi Yusuke Yoshiyasu Chenyu You Haoran You Haoxuan You Yang You

Quanzeng You Tackgeun You Kaichao You Shan You Xinge You Yurong You Baosheng Yu Bei Yu Haichao Yu Hao Yu Chaohui Yu Fisher Yu Jin-Gang Yu Jiyang Yu Jason J. Yu Jiashuo Yu Hong-Xing Yu Lei Yu Mulin Yu Ning Yu Peilin Yu Qi Yu Qian Yu Rui Yu Shuzhi Yu Gang Yu Tan Yu Weijiang Yu Xin Yu Bingyao Yu Ye Yu Hanchao Yu Yingchen Yu Tao Yu Xiaotian Yu Qing Yu Houjian Yu Changqian Yu Jing Yu Jun Yu Shujian Yu Xiang Yu Zhaofei Yu Zhenbo Yu Yinfeng Yu

Zhuoran Yu Zitong Yu Bo Yuan Jiangbo Yuan Liangzhe Yuan Weihao Yuan Jianbo Yuan Xiaoyun Yuan Ye Yuan Li Yuan Geng Yuan Jialin Yuan Maoxun Yuan Peng Yuan Xin Yuan Yuan Yuan Yuhui Yuan Yixuan Yuan Zheng Yuan Mehmet Kerim Yücel Kaiyu Yue Haixiao Yue Heeseung Yun Sangdoo Yun Tian Yun Mahmut Yurt Ekim Yurtsever Ahmet Yüzügüler Edouard Yvinec Eloi Zablocki Christopher Zach Muhammad Zaigham Zaheer Pierluigi Zama Ramirez Yuhang Zang Pietro Zanuttigh Alexey Zaytsev Bernhard Zeisl Haitian Zeng Pengpeng Zeng Jiabei Zeng Runhao Zeng Wei Zeng Yawen Zeng Yi Zeng

Organization

Yiming Zeng Tieyong Zeng Huanqiang Zeng Dan Zeng Yu Zeng Wei Zhai Yuanhao Zhai Fangneng Zhan Kun Zhan Xiong Zhang Jingdong Zhang Jiangning Zhang Zhilu Zhang Gengwei Zhang Dongsu Zhang Hui Zhang Binjie Zhang Bo Zhang Tianhao Zhang Cecilia Zhang Jing Zhang Chaoning Zhang Chenxu Zhang Chi Zhang Chris Zhang Yabin Zhang Zhao Zhang Rufeng Zhang Chaoyi Zhang Zheng Zhang Da Zhang Yi Zhang Edward Zhang Xin Zhang Feifei Zhang Feilong Zhang Yuqi Zhang GuiXuan Zhang Hanlin Zhang Hanwang Zhang Hanzhen Zhang Haotian Zhang He Zhang Haokui Zhang Hongyuan Zhang

Hengrui Zhang Hongming Zhang Mingfang Zhang Jianpeng Zhang Jiaming Zhang Jichao Zhang Jie Zhang Jingfeng Zhang Jingyi Zhang Jinnian Zhang David Junhao Zhang Junjie Zhang Junzhe Zhang Jiawan Zhang Jingyang Zhang Kai Zhang Lei Zhang Lihua Zhang Lu Zhang Miao Zhang Minjia Zhang Mingjin Zhang Qi Zhang Qian Zhang Qilong Zhang Qiming Zhang Qiang Zhang Richard Zhang Ruimao Zhang Ruisi Zhang Ruixin Zhang Runze Zhang Qilin Zhang Shan Zhang Shanshan Zhang Xi Sheryl Zhang Song-Hai Zhang Chongyang Zhang Kaihao Zhang Songyang Zhang Shu Zhang Siwei Zhang Shujian Zhang Tianyun Zhang Tong Zhang

Tao Zhang Wenwei Zhang Wenqiang Zhang Wen Zhang Xiaolin Zhang Xingchen Zhang Xingxuan Zhang Xiuming Zhang Xiaoshuai Zhang Xuanmeng Zhang Xuanyang Zhang Xucong Zhang Xingxing Zhang Xikun Zhang Xiaohan Zhang Yahui Zhang Yunhua Zhang Yan Zhang Yanghao Zhang Yifei Zhang Yifan Zhang Yi-Fan Zhang Yihao Zhang Yingliang Zhang Youshan Zhang Yulun Zhang Yushu Zhang Yixiao Zhang Yide Zhang Zhongwen Zhang Bowen Zhang Chen-Lin Zhang Zehua Zhang Zekun Zhang Zeyu Zhang Xiaowei Zhang Yifeng Zhang Cheng Zhang Hongguang Zhang Yuexi Zhang Fa Zhang Guofeng Zhang Hao Zhang Haofeng Zhang Hongwen Zhang

xlix

l

Organization

Hua Zhang Jiaxin Zhang Zhenyu Zhang Jian Zhang Jianfeng Zhang Jiao Zhang Jiakai Zhang Lefei Zhang Le Zhang Mi Zhang Min Zhang Ning Zhang Pan Zhang Pu Zhang Qing Zhang Renrui Zhang Shifeng Zhang Shuo Zhang Shaoxiong Zhang Weizhong Zhang Xi Zhang Xiaomei Zhang Xinyu Zhang Yin Zhang Zicheng Zhang Zihao Zhang Ziqi Zhang Zhaoxiang Zhang Zhen Zhang Zhipeng Zhang Zhixing Zhang Zhizheng Zhang Jiawei Zhang Zhong Zhang Pingping Zhang Yixin Zhang Kui Zhang Lingzhi Zhang Huaiwen Zhang Quanshi Zhang Zhoutong Zhang Yuhang Zhang Yuting Zhang Zhang Zhang Ziming Zhang

Zhizhong Zhang Qilong Zhangli Bingyin Zhao Bin Zhao Chenglong Zhao Lei Zhao Feng Zhao Gangming Zhao Haiyan Zhao Hao Zhao Handong Zhao Hengshuang Zhao Yinan Zhao Jiaojiao Zhao Jiaqi Zhao Jing Zhao Kaili Zhao Haojie Zhao Yucheng Zhao Longjiao Zhao Long Zhao Qingsong Zhao Qingyu Zhao Rui Zhao Rui-Wei Zhao Sicheng Zhao Shuang Zhao Siyan Zhao Zelin Zhao Shiyu Zhao Wang Zhao Tiesong Zhao Qian Zhao Wangbo Zhao Xi-Le Zhao Xu Zhao Yajie Zhao Yang Zhao Ying Zhao Yin Zhao Yizhou Zhao Yunhan Zhao Yuyang Zhao Yue Zhao Yuzhi Zhao

Bowen Zhao Pu Zhao Bingchen Zhao Borui Zhao Fuqiang Zhao Hanbin Zhao Jian Zhao Mingyang Zhao Na Zhao Rongchang Zhao Ruiqi Zhao Shuai Zhao Wenda Zhao Wenliang Zhao Xiangyun Zhao Yifan Zhao Yaping Zhao Zhou Zhao He Zhao Jie Zhao Xibin Zhao Xiaoqi Zhao Zhengyu Zhao Jin Zhe Chuanxia Zheng Huan Zheng Hao Zheng Jia Zheng Jian-Qing Zheng Shuai Zheng Meng Zheng Mingkai Zheng Qian Zheng Qi Zheng Wu Zheng Yinqiang Zheng Yufeng Zheng Yutong Zheng Yalin Zheng Yu Zheng Feng Zheng Zhaoheng Zheng Haitian Zheng Kang Zheng Bolun Zheng

Organization

Haiyong Zheng Mingwu Zheng Sipeng Zheng Tu Zheng Wenzhao Zheng Xiawu Zheng Yinglin Zheng Zhuo Zheng Zilong Zheng Kecheng Zheng Zerong Zheng Shuaifeng Zhi Tiancheng Zhi Jia-Xing Zhong Yiwu Zhong Fangwei Zhong Zhihang Zhong Yaoyao Zhong Yiran Zhong Zhun Zhong Zichun Zhong Bo Zhou Boyao Zhou Brady Zhou Mo Zhou Chunluan Zhou Dingfu Zhou Fan Zhou Jingkai Zhou Honglu Zhou Jiaming Zhou Jiahuan Zhou Jun Zhou Kaiyang Zhou Keyang Zhou Kuangqi Zhou Lei Zhou Lihua Zhou Man Zhou Mingyi Zhou Mingyuan Zhou Ning Zhou Peng Zhou Penghao Zhou Qianyi Zhou

Shuigeng Zhou Shangchen Zhou Huayi Zhou Zhize Zhou Sanping Zhou Qin Zhou Tao Zhou Wenbo Zhou Xiangdong Zhou Xiao-Yun Zhou Xiao Zhou Yang Zhou Yipin Zhou Zhenyu Zhou Hao Zhou Chu Zhou Daquan Zhou Da-Wei Zhou Hang Zhou Kang Zhou Qianyu Zhou Sheng Zhou Wenhui Zhou Xingyi Zhou Yan-Jie Zhou Yiyi Zhou Yu Zhou Yuan Zhou Yuqian Zhou Yuxuan Zhou Zixiang Zhou Wengang Zhou Shuchang Zhou Tianfei Zhou Yichao Zhou Alex Zhu Chenchen Zhu Deyao Zhu Xiatian Zhu Guibo Zhu Haidong Zhu Hao Zhu Hongzi Zhu Rui Zhu Jing Zhu

Jianke Zhu Junchen Zhu Lei Zhu Lingyu Zhu Luyang Zhu Menglong Zhu Peihao Zhu Hui Zhu Xiaofeng Zhu Tyler (Lixuan) Zhu Wentao Zhu Xiangyu Zhu Xinqi Zhu Xinxin Zhu Xinliang Zhu Yangguang Zhu Yichen Zhu Yixin Zhu Yanjun Zhu Yousong Zhu Yuhao Zhu Ye Zhu Feng Zhu Zhen Zhu Fangrui Zhu Jinjing Zhu Linchao Zhu Pengfei Zhu Sijie Zhu Xiaobin Zhu Xiaoguang Zhu Zezhou Zhu Zhenyao Zhu Kai Zhu Pengkai Zhu Bingbing Zhuang Chengyuan Zhuang Liansheng Zhuang Peiye Zhuang Yixin Zhuang Yihong Zhuang Junbao Zhuo Andrea Ziani Bartosz Zieli´nski Primo Zingaretti

li

lii

Organization

Nikolaos Zioulis Andrew Zisserman Yael Ziv Liu Ziyin Xingxing Zou Danping Zou Qi Zou

Shihao Zou Xueyan Zou Yang Zou Yuliang Zou Zihang Zou Chuhang Zou Dongqing Zou

Xu Zou Zhiming Zou Maria A. Zuluaga Xinxin Zuo Zhiwen Zuo Reyer Zwiggelaar

Contents – Part IX

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers . . . . . . . . . . . . . . . . . Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai Category-Level 6D Object Pose and Size Estimation Using Self-supervised Deep Prior Deformation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiehong Lin, Zewei Wei, Changxing Ding, and Kui Jia Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongyu Zhou, Zheng Ge, Songtao Liu, Weixin Mao, Zeming Li, Haiyan Yu, and Jian Sun Point-to-Box Network for Accurate Object Detection via Single Point Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengfei Chen, Xuehui Yu, Xumeng Han, Najmul Hassan, Kai Wang, Jiachen Li, Jian Zhao, Humphrey Shi, Zhenjun Han, and Qixiang Ye

1

19

35

51

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild . . . . . . . . . . Takehiko Ohkawa, Yu-Jhe Li, Qichen Fu, Ryosuke Furuta, Kris M. Kitani, and Yoichi Sato

68

Towards Data-Efficient Detection Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen Wang, Jing Zhang, Yang Cao, Yongliang Shen, and Dacheng Tao

88

Open-Vocabulary DETR with Conditional Matching . . . . . . . . . . . . . . . . . . . . . . . . 106 Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy Prediction-Guided Distillation for Dense Object Detection . . . . . . . . . . . . . . . . . . 123 Chenhongyi Yang, Mateusz Ochal, Amos Storkey, and Elliot J. Crowley Multimodal Object Detection via Probabilistic Ensembling . . . . . . . . . . . . . . . . . . 139 Yi-Ting Chen, Jinghao Shi, Zelin Ye, Christoph Mertz, Deva Ramanan, and Shu Kong Exploiting Unlabeled Data with Vision and Language Models for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, B.G Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N. Metaxas

liv

Contents – Part IX

CPO: Change Robust Panorama to Point Cloud Localization . . . . . . . . . . . . . . . . . 176 Junho Kim, Hojun Jang, Changwoon Choi, and Young Min Kim INT: Towards Infinite-Frames 3D Detection with an Efficient Framework . . . . . . 193 Jianyun Xu, Zhenwei Miao, Da Zhang, Hongyu Pan, Kaixuan Liu, Peihan Hao, Jun Zhu, Zhengyang Sun, Hongmin Li, and Xin Zhan End-to-End Weakly Supervised Object Detection with Sparse Proposal Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Mingxiang Liao, Fang Wan, Yuan Yao, Zhenjun Han, Jialing Zou, Yuze Wang, Bailan Feng, Peng Yuan, and Qixiang Ye Calibration-Free Multi-view Crowd Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Qi Zhang and Antoni B. Chan Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, and Junjun Jiang SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Xiangrui Zhao, Sheng Yang, Tianxin Huang, Jun Chen, Teng Ma, Mingyang Li, and Yong Liu Exploring Plain Vision Transformer Backbones for Object Detection . . . . . . . . . 280 Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He Adversarially-Aware Robust Object Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Ziyi Dong, Pengxu Wei, and Liang Lin HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors . . . . . . . 314 Luting Wang, Xiaojie Li, Yue Liao, Zeren Jiang, Jianlong Wu, Fei Wang, Chen Qian, and Si Liu You Should Look at All Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Zhenchao Jin, Dongdong Yu, Luchuan Song, Zehuan Yuan, and Lequan Yu Detecting Twenty-Thousand Classes Using Image-Level Supervision . . . . . . . . . 350 Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation . . . 369 Hongyang Li, Jiehong Lin, and Kui Jia

Contents – Part IX

lv

Monocular 3D Object Detection with Depth from Motion . . . . . . . . . . . . . . . . . . . 386 Tai Wang, Jiangmiao Pang, and Dahua Lin DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Yilin Wen, Xiangyu Li, Hao Pan, Lei Yang, Zheng Wang, Taku Komura, and Wenping Wang Distilling Object Detectors with Global Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 422 Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, and Fan He Unifying Visual Perception by Dispersible Points Learning . . . . . . . . . . . . . . . . . . 439 Jianming Liang, Guanglu Song, Biao Leng, and Yu Liu PseCo: Pseudo Labeling and Consistency Training for Semi-Supervised Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Gang Li, Xiang Li, Yujie Wang, Yichao Wu, Ding Liang, and Shanshan Zhang Exploring Resolution and Degradation Clues as Self-supervised Signal for Low Quality Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Ziteng Cui, Yingying Zhu, Lin Gu, Guo-Jun Qi, Xiaoxiao Li, Renrui Zhang, Zenghui Zhang, and Tatsuya Harada Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 Wufei Ma, Angtian Wang, Alan Yuille, and Adam Kortylewski Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Maoxun Yuan, Yinyan Wang, and Xingxing Wei RFLA: Gaussian Receptive Field Based Label Assignment for Tiny Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526 Chang Xu, Jinwang Wang, Wen Yang, Huai Yu, Lei Yu, and Gui-Song Xia Rethinking IoU-based Optimization for Single-stage 3D Object Detection . . . . . 544 Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Min-Jian Zhao, and Gim Hee Lee TD-Road: Top-Down Road Network Extraction with Holistic Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 Yang He, Ravi Garg, and Amber Roy Chowdhury

lvi

Contents – Part IX

Multi-faceted Distillation of Base-Novel Commonality for Few-Shot Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Shuang Wu, Wenjie Pei, Dianwen Mei, Fanglin Chen, Jiandong Tian, and Guangming Lu PointCLM: A Contrastive Learning-based Framework for Multi-instance Point Cloud Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Mingzhi Yuan, Zhihao Li, Qiuye Jin, Xinrong Chen, and Manning Wang Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 Haotian Bai, Ruimao Zhang, Jiong Wang, and Xiang Wan MTTrans: Cross-domain Object Detection with Mean Teacher Transformer . . . . 629 Jinze Yu, Jiaming Liu, Xiaobao Wei, Haoyi Zhou, Yohei Nakata, Denis Gudovskiy, Tomoyuki Okuno, Jianxin Li, Kurt Keutzer, and Shanghang Zhang Multi-domain Multi-definition Landmark Localization for Small Datasets . . . . . 646 David Ferman and Gaurav Bharaj DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu Label-Guided Auxiliary Training Improves 3D Object Detector . . . . . . . . . . . . . . 684 Yaomin Huang, Xinmei Liu, Yichen Zhu, Zhiyuan Xu, Chaomin Shen, Zhengping Che, Guixu Zhang, Yaxin Peng, Feifei Feng, and Jian Tang PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images . . . . 701 Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma Densely Constrained Depth Estimator for Monocular 3D Object Detection . . . . . 718 Yingyan Li, Yuntao Chen, Jiawei He, and Zhaoxiang Zhang Polarimetric Pose Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Daoyi Gao, Yitong Li, Patrick Ruhkamp, Iuliia Skobleva, Magdalena Wysocki, HyunJun Jung, Pengyuan Wang, Arturo Guridi, and Benjamin Busam Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers Zhiqi Li1,2 , Wenhai Wang2 , Hongyang Li2 , Enze Xie3 , Chonghao Sima2 , Tong Lu1 , Yu Qiao2 , and Jifeng Dai2(B) 1

Nanjing University, Nanjing, China Shanghai AI Laboratory, Shanghai, China [email protected] The University of Hong Kong, Pokfulam, Hong Kong 2

3

Abstract. 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-ofthe-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. The code is available at https:// github.com/zhiqi-li/BEVFormer. Keywords: Autonomous driving · Bird’s-Eye-View detection · Map segmentation · Transformer

1

· 3D object

Introduction

Perception in 3D space is critical for various applications such as autonomous driving, robotics, etc. Despite the remarkable progress of LiDAR-based methods [8,20,41,48,52], camera-based approaches [28,30,43,45] have attracted extensive attention in recent years. Apart from the low cost for deployment, cameras Z. Li, W. Wang and H. Li—Equal contribution.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 1. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 1–18, 2022. https://doi.org/10.1007/978-3-031-20077-9_1

2

Z. Li et al. What's in here at timestamp

?

Lookup & Aggregate

There is a car !



Fig. 1. BEVFormer leverages queries to lookup spatial/temporal space and aggregate spatiotemporal information correspondingly, hence benefiting stronger representations for perception tasks.

own the desirable advantages to detect long-range distance objects and identify vision-based road elements (e.g., traffic lights, stoplines), compared to LiDARbased counterparts. Visual perception of the surrounding scene in autonomous driving is expected to predict the 3D bounding boxes or the semantic maps from 2D cues given by multiple cameras. The most straightforward solution is based on the monocular frameworks [3,29,33,42,43] and cross-camera post-processing. The downside of this framework is that it processes different views separately and cannot capture information across cameras, leading to low performance and efficiency [30,45]. As an alternative to the monocular frameworks, a more unified framework is extracting holistic representations from multi-camera images. The bird’seye-view (BEV) is a commonly used representation of the surrounding scene since it clearly presents the location and scale of objects and is suitable for various autonomous driving tasks, such as perception and planning [27]. Although previous map segmentation methods demonstrate BEV’s effectiveness [18,27,30], BEV-based approaches have not shown significant advantages over other paradigm in 3D object detections [29,32,45]. The underlying reason is that the 3D object detection task requires strong BEV features to support accurate 3D bounding box prediction, but generating BEV from the 2D planes is ill-posed. A popular BEV framework that generates BEV features is based on depth information [30,32,44], but this paradigm is sensitive to the accuracy of depth values or the depth distributions. The detection performance of BEVbased methods is thus subject to compounding errors [45], and inaccurate BEV features can seriously hurt the final performance. Therefore, we are motivated to design a BEV generating method that does not rely on depth information and can learn BEV features adaptively rather than strictly rely on 3D prior. Transformer, which uses an attention mechanism to aggregate valuable features dynamically, meets our demands conceptually.

BEVFormer: Spatiotemporal Transformers for BEV.

3

Another motivation for using BEV features to perform perception tasks is that BEV is a desirable bridge to connect temporal and spatial space. For the human visual perception system, temporal information plays a crucial role in inferring the motion state of objects and identifying occluded objects, and many works in vision fields have demonstrated the effectiveness of using video data [2,19,24,25,31]. However, the existing state-of-the-art multi-camera 3D detection methods rarely exploit temporal information. The significant challenges are that autonomous driving is time-critical and objects in the scene change rapidly, and thus simply stacking BEV features of cross timestamps brings extra computational cost and interference information, which might not be ideal. Inspired by recurrent neural networks (RNNs) [10,17], we utilize the BEV features to deliver temporal information from past to present recurrently, which has the same spirit as the hidden states of RNN models. To this end, we present a transformer-based bird’s-eye-view (BEV) encoder, termed BEVFormer, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features. The BEV features generated from the BEVFormer can simultaneously support multiple 3D perception tasks such as 3D object detection and map segmentation, which is valuable for the autonomous driving system. As shown in Fig. 1, our BEVFormer contains three key designs, which are (1) grid-shaped BEV queries to fuse spatial and temporal features via attention mechanisms flexibly, (2) spatial cross-attention module to aggregate the spatial features from multi-camera images, and (3) temporal selfattention module to extract temporal information from history BEV features, which benefits the velocity estimation of moving objects and the detection of heavily occluded objects, while bringing negligible computational overhead. With the unified features generated by BEVFormer, the model can collaborate with different task-specific heads such as Deformable DETR [54] and mask decoder [22], for end-to-end 3D object detection and map segmentation. Our main contributions are as follows: • We propose BEVFormer, a spatiotemporal transformer encoder that projects multi-camera/timestamp input to BEV representations. With the unified BEV features, our model can simultaneously support multiple autonomous driving perception tasks, including 3D detection and map segmentation. • We designed learnable BEV queries along with a spatial cross-attention layer and a temporal self-attention layer to lookup spatial features from cross cameras and temporal features from history BEV, respectively, and then aggregate them into unified BEV features. • We evaluate the proposed BEVFormer on multiple challenging benchmarks, including nuScenes [4] and Waymo [38]. Our BEVFormer consistently achieves improved performance compared to the prior arts. For example, under a comparable parameters and computation overhead, BEVFormer achieves 56.9% NDS on nuScenes test set, outperforming previous best detection method DETR3D [45] by 9.0 points (56.9% vs. 47.9%). For the map segmentation

4

Z. Li et al.

task, we also achieve the state-of-the-art performance, more than 5.0 points higher than Lift-Splat [30] on the most challenging lane segmentation. We hope this straightforward and strong framework can serve as a new baseline for following 3D perception tasks.

2

Related Work

Transformer-Based 2D Perception. Recently, a new trend is to use transformer to reformulate detection and segmentation tasks [7,22,54]. DETR [7] uses a set of object queries to generate detection results by the cross-attention decoder directly. However, the main drawback of DETR is the long training time. Deformable DETR [54] solves this problem by proposing deformable attention. Different from vanilla global attention in DETR, the deformable attention interacts with local regions of interest, which only samples K points near each reference point and calculates attention results, resulting in high efficiency and significantly shortening the training time. The deformable attention mechanism is calculated by: DeformAttn(q, p, x) =

N head i=1

Nkey

Wi



Aij · Wi x(p + Δpij ),

(1)

j=1

where q, p, x represent the query, reference point and input features, respectively. i indexes the attention head, and Nhead denotes the total number of attention heads. j indexes the sampled keys, and Nkey is the total sampled key number for each head. Wi ∈ RC×(C/Hhead ) and Wi ∈ R(C/Hhead )×C are the learnable weights, where C is the feature dimension. Aij ∈ [0, 1] is the predicted attention weight, Nkey Aij = 1. Δpij ∈ R2 are the predicted offsets to the and is normalized by j=1 reference point p. x(p + Δpij ) represents the feature at location p + Δpij , which is extracted by bilinear interpolation as in Dai et al. [12]. In this work, we extend the deformable attention to 3D perception tasks, to efficiently aggregate both spatial and temporal information. Camera-Based 3D Perception. Previous 3D perception methods typically perform 3D object detection or map segmentation tasks independently. For the 3D object detection task, early methods are similar to 2D detection methods [1,26,37,47,51], which usually predict the 3D bounding boxes based on 2D bounding boxes. Wang et al. [43] follows an advanced 2D detector FCOS [39] and directly predicts 3D bounding boxes for each object. DETR3D [45] projects learnable 3D queries in 2D images, and then samples the corresponding features for end-to-end 3D bounding box prediction without NMS post-processing. Another solution is to transform image features into BEV features and predict 3D bounding boxes from the top-down view. Methods transform image features into BEV features with the depth information from depth estimation [44] or categorical depth distribution [32]. OFT [34] and ImVoxelNet [35] project the predefined voxels onto image features to generate the voxel representation of the

BEVFormer: Spatiotemporal Transformers for BEV.

5

Fig. 2. Overall architecture of BEVFormer. (a) The encoder layer of BEVFormer contains grid-shaped BEV queries, temporal self-attention, and spatial cross-attention. (b) In spatial cross-attention, each BEV query only interacts with image features in the regions of interest. (c) In temporal self-attention, each BEV query interacts with two features: the BEV queries at the current timestamp and the BEV features at the previous timestamp.

scene. Recently, M2 BEV [46] futher explored the feasibility of simultaneously performing multiple perception tasks based on BEV features. Actually, generating BEV features from multi-camera features is more extensively studied in map segmentation tasks [28,30]. A straightforward method is converting perspective view into the BEV through Inverse Perspective Mapping (IPM) [5,33]. In addition, Lift-Splat [30] generates the BEV features based on the depth distribution. Methods [9,16,28] utilize multilayer perceptron to learn the translation from perspective view to the BEV. PYVA [49] proposes a crossview transformer that converts the front-view monocular image into the BEV, but this paradigm is not suitable for fusing multi-camera features due to the computational cost of global attention mechinism [40]. In addition to the spatial information, previous works [6,18,36] also consider the temporal information by stacking BEV features from several timestamps. Stacking BEV features constraints the available temporal information within fixed time duration and brings extra computational cost. In this work, the proposed spatiotemporal transformer generates BEV features of the current time by considering both spatial and temporal clues, and the temporal information is obtained from the previous BEV features by the RNN manner, which only brings little computational cost.

3

BEVFormer

Converting multi-camera image features to bird’s-eye-view (BEV) features can provide a unified surrounding environment representation for various autonomous driving perception tasks. In this work, we present a new transformer-based framework for BEV generation, which can effectively aggregate spatiotemporal features from multi-view cameras and history BEV features via attention mechanisms.

6

3.1

Z. Li et al.

Overall Architecture

As illustrated in Fig. 2, BEVFormer has 6 encoder layers, each of which follows the conventional structure of transformers [40], except for three tailored designs, namely BEV queries, spatial cross-attention, and temporal self-attention. Specifically, BEV queries are grid-shaped learnable parameters, which is designed to query features in BEV space from multi-camera views via attention mechanisms. Spatial cross-attention and temporal self-attention are attention layers working with BEV queries, which are used to lookup and aggregate spatial features from multi-camera images as well as temporal features from history BEV, according to the BEV query. During inference, at timestamp t, we feed multi-camera images to the backview of bone network (e.g., ResNet-101 [15]), and obtain the features Ft = {Fti }N i=1 i different camera views, where Ft is the feature of the i-th view, Nview is the total number of camera views. At the same time, we preserved the BEV features Bt−1 at the prior timestamp t−1. In each encoder layer, we first use BEV queries Q to query the temporal information from the prior BEV features Bt−1 via the temporal self-attention. We then employ BEV queries Q to inquire about the spatial information from the multi-camera features Ft via the spatial cross-attention. After the feed-forward network [40], the encoder layer output the refined BEV features, which is the input of the next encoder layer. After 6 stacking encoder layers, unified BEV features Bt at current timestamp t are generated. Taking the BEV features Bt as input, the 3D detection head and map segmentation head predict the perception results such as 3D bounding boxes and semantic map. 3.2

BEV Queries

We predefine a group of grid-shaped learnable parameters Q ∈ RH×W ×C as the queries of BEVFormer, where H, W are the spatial shape of the BEV plane. To be specific, the query Qp ∈ R1×C located at p = (x, y) of Q is responsible for the corresponding grid cell region in the BEV plane. Each grid cell in the BEV plane corresponds to a real-world size of s meters. The center of BEV features corresponds to the position of the ego car by default. Following common practices [14], we add learnable positional embedding to BEV queries Q before inputting them to BEVFormer. 3.3

Spatial Cross-attention

Due to the large input scale of multi-camera 3D perception (containing Nview camera views), the computational cost of vanilla multi-head attention [40] is extremely high. Therefore, we develop the spatial cross-attention based on deformable attention [54], which is a resource-efficient attention layer where each BEV query Qp only interacts with its regions of interest across camera views. However, deformable attention is originally designed for 2D perception, so some adjustments are required for 3D scenes.

BEVFormer: Spatiotemporal Transformers for BEV.

7

As shown in Fig. 2(b), we first lift each query on the BEV plane to a pillarlike query [20], sample Nref 3D reference points from the pillar, and then project these points to 2D views. For one BEV query, the projected 2D points can only fall on some views, and other views are not hit. Here, we term the hit views as Vhit . After that, we regard these 2D points as the reference points of the query Qp and sample the features from the hit views Vhit around these reference points. Finally, we perform a weighted sum of the sampled features as the output of spatial cross-attention. The process of spatial cross-attention (SCA) can be formulated as: SCA(Qp , Ft ) =

ref   N 1 DeformAttn(Qp , P(p, i, j), Fti ), |Vhit | j=1

(2)

i∈Vhit

where i indexes the camera view, j indexes the reference points, and Nref is the total reference points for each BEV query. Fti is the features of the i-th camera view. For each BEV query Qp , we use a project function P(p, i, j) to get the j-th reference point on the i-th view image. Next, we introduce how to obtain the reference points on the view image from the projection function P. We first calculate the real world location (x , y  ) corresponding to the query Qp located at p = (x, y) of Q as Eq. 3. x = (x−

W ) × s; 2

y  = (y−

H ) × s, 2

(3)

where H, W are the spatial shape of BEV queries, s is the size of resolution of BEV’s grids, and (x , y  ) are the coordinates where the position of ego car is the origin. In 3D space, the objects located at (x , y  ) will appear at the height of z  ref on the z-axis. So we predefine a set of anchor heights {zj }N j=1 to make sure we can capture clues that appeared at different heights. In this way, for each query Nref . Finally, we project Qp , we obtain a pillar of 3D reference points (x , y  , zj )j=1 the 3D reference points to different image views through the projection matrix of cameras, which can be written as: P(p, i, j) = (xij , yij ) T T  where zij · xij yij 1 = Ti · x y  zj 1 . 

(4)

Here, P(p, i, j) is the 2D point on i-th view projected from j-th 3D point (x , y  , zj ), Ti ∈ R3×4 is the known projection matrix of the i-th camera. 3.4

Temporal Self-attention

In addition to spatial information, temporal information is also crucial for the visual system to understand the surrounding environment [25]. For example, it is challenging to infer the velocity of moving objects or detect highly occluded objects from static images without temporal clues. To address this problem, we design temporal self-attention, which can represent the current environment by incorporating history BEV features.

8

Z. Li et al.

Given the BEV queries Q at current timestamp t and history BEV features Bt−1 preserved at timestamp t−1, we first align Bt−1 to Q according to ego-motion to make the features at the same grid correspond to the same real-world location.  . However, from Here, we denote the aligned history BEV features Bt−1 as Bt−1 times t − 1 to t, movable objects travel in the real world with various offsets. It is challenging to construct the precise association of the same objects between the BEV features of different times. Therefore, we model this temporal connection between features through the temporal self-attention (TSA) layer, which can be written as follows:   }) = DeformAttn(Qp , p, V ), (5) TSA(Qp , {Q, Bt−1  V ∈{Q,Bt−1 }

where Qp denotes the BEV query located at p = (x, y). In addition, different from the vanilla deformable attention, the offsets Δp in temporal self-attention are  . Specially, for the first sample of predicted by the concatenation of Q and Bt−1 each sequence, the temporal self-attention will degenerate into a self-attention  } without temporal information, where we replace the BEV features {Q, Bt−1 with duplicate BEV queries {Q, Q}. Compared to simply stacking BEV in [6,18,36], our temporal self-attention can more effectively model long temporal dependency. BEVFormer extracts temporal information from the previous BEV features rather than multiple stacking BEV features, thus requiring less computational cost and suffering less disturbing information. 3.5

Applications of BEV Features

Since the BEV features Bt ∈ RH×W ×C is a versatile 2D feature map that can be used for various autonomous driving perception tasks, the 3D object detection and map segmentation task heads can be developed based on 2D perception methods [22,54] with minor modifications. For 3D object detection, we design an end-to-end 3D detection head based on the 2D detector Deformable DETR [54]. The modifications include using singlescale BEV features Bt as the input of the decoder, predicting 3D bounding boxes and velocity rather than 2D bounding boxes, and only using L1 loss to supervise 3D bounding box regression. With the detection head, our model can end-to-end predict 3D bounding boxes and velocity without the NMS post-processing. For map segmentation, we design a map segmentation head based on a 2D segmentation method Panoptic SegFormer [22]. Since the map segmentation based on the BEV is basically the same as the common semantic segmentation, we utilize the mask decoder of [22] and class-fixed queries to target each semantic category, including the car, vehicles, road (drivable area), and lane. 3.6

Implementation Details

Training Phase. For each sample at timestamp t, we randomly sample another 3 samples from the consecutive sequence of the past 2 s, and this random

BEVFormer: Spatiotemporal Transformers for BEV.

9

sampling strategy can augment the diversity of ego-motion [55]. We denote the timestamps of these four samples as t−3, t−2, t−1 and t. For the samples of the first three timestamps, they are responsible for recurrently generating the BEV features {Bt−3 , Bt−2 , Bt−1 } and this phase requires no gradients. For the first sample at timestamp t−3, there is no previous BEV features, and temporal selfattention degenerate into self-attention. At the time t, the model generates the BEV features Bt based on both multi-camera inputs and the prior BEV features Bt−1 , so that Bt contains the temporal and spatial clues crossing the four samples. Finally, we feed the BEV features Bt into the detection and segmentation heads and compute the corresponding loss functions. Inference Phase. During the inference phase, we evaluate each frame of the video sequence in chronological order. The BEV features of the previous timestamp are saved and used for the next, and this online inference strategy is timeefficient and consistent with practical applications. Although we utilize temporal information, our inference speed is still comparable with other methods [43,45].

4 4.1

Experiments Datasets

We conduct experiments on two challenging public autonomous driving datasets, namely nuScenes dataset [4] and Waymo open dataset(WOD) [38] and experiments on WOD were introduced in the supplementary. The nuScenes dataset [4] contains 1000 scenes of roughly 20 s duration each, and the key samples are annotated 2 Hz. Each sample consists of RGB images from 6 cameras and has 360◦ horizontal FOV. For the detection task, there are 1.4M annotated 3D bounding boxes from 10 categories. We follow the settings in [30] to perform BEV segmentation task. This dataset also provides the official evaluation metrics for the detection task. The mean average precision (mAP) of nuScenes is computed using the center distance on the ground plane rather than the 3D Intersection over Union (IoU) to match the predicted results and ground truth. The nuScenes metrics also contain 5 types of true positive metrics (TP metrics), including ATE, ASE, AOE, AVE, and AAE for measuring translation, scale, orientation, velocity, and attribute errors, respectively. The  nuScenes also 1 [5mAP + mTP∈TP (1 − defines a nuScenes detection score (NDS) as NDS = 10 min(1, mTP))] to capture all aspects of the nuScenes detection tasks. 4.2

Experimental Settings

Following previous methods [29,43,45], we adopt two types of backbone: ResNet101-DCN [12,15] that initialized from FCOS3D [43] checkpoint, and VoVnet-99 [21] that initialized from DD3D [29] checkpoint. By default, we uti1 1 1 , 32 , 64 and lize the output multi-scale features from FPN [23] with sizes of 16 the dimension of C = 256. For experiments on nuScenes, the default size of

10

Z. Li et al.

Table 1. 3D detection results on nuScenes test set. ∗ notes that VoVNet99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [29]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively. Method

Modality Backbone NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓

SSN [53] L CenterPoint-Voxel [50] L L&C PointPainting [41]

– – –

0.569 0.655 0.581

0.463 0.580 0.464

– – 0.388

– – 0.271

– – 0.496

– – 0.247

– – 0.111

FCOS3D [43] PGD [42] BEVFormer-S BEVFormer

C C C C

R101 R101 R101 R101

0.428 0.448 0.462 0.535

0.358 0.386 0.409 0.445

0.690 0.626 0.650 0.631

0.249 0.245 0.261 0.257

0.452 0.451 0.439 0.405

1.434 1.509 0.925 0.435

0.124 0.127 0.147 0.143

DD3D [29] DETR3D [45] BEVFormer-S BEVFormer

C C C C

V2-99∗ V2-99∗ V2-99∗ V2-99∗

0.477 0.479 0.495 0.569

0.418 0.412 0.435 0.481

0.572 0.641 0.589 0.582

0.249 0.255 0.254 0.256

0.368 0.394 0.402 0.375

1.014 0.845 0.842 0.378

0.124 0.133 0.131 0.126

Table 2. 3D detection results on nuScenes val set. “C” indicates Camera. Method

Modality Backbone NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓

FCOS3D [43]

C

R101

0.415

0.343

0.725

0.263

0.422

1.292

0.153

PGD [42]

C

R101

0.428

0.369

0.683

0.260

0.439

1.268

0.185

DETR3D [45] C

R101

0.425

0.346

0.773

0.268

0.383

0.842

0.216

BEVFormer-S C

R101

0.448

0.375

0.725

0.272

0.391

0.802

0.200

BEVFormer

R101

0.517 0.416 0.673

0.274

0.372

0.394

0.198

C

BEV queries is 200 × 200, the perception ranges are [−51.2 m, 51.2 m] for the X and Y axis and the size of resolution s of BEV’s grid is 0.512 m. We adopt learnable positional embedding for BEV queries. The BEV encoder contains 6 encoder layers and constantly refines the BEV queries in each layer. The input BEV features Bt−1 for each encoder layer are the same and require no gradients. For each local query, during the spatial cross-attention module implemented by deformable attention mechanism, it corresponds to Nref = 4 target points with different heights in 3D space, and the predefined height anchors are sampled uniformly from −5 m to 3 m. For each reference point on 2D view features, we use four sampling points around this reference point for each head. By default, we train our models with 24 epochs, a learning rate of 2 × 10−4 . Baselines. To eliminate the effect of task heads and compare other BEV generating methods fairly, we use VPN [28] and Lift-Splat [30] to replace our BEVFormer and keep task heads and other settings the same. We also adapt BEVFormer into a static model called BEVFormer-S via adjusting the temporal self-attention into a vanilla self-attention without using history BEV features.

BEVFormer: Spatiotemporal Transformers for BEV.

4.3

11

3D Object Detection Results

We train our model on the detection task with the detection head only for fairly comparing with previous state-of-the-art 3D object detection methods. In Table 1 and Table 2, we report our main results on nuScenes test and val splits. Our method outperforms previous best method DETR3D [45] over 9.2 points on val set (51.7% NDS vs. 42.5% NDS), under fair training strategy and comparable model scales. On the test set, our model achieves 56.9% NDS without bells and whistles, 9.0 points higher than DETR3D (47.9% NDS). Our method can even achieve comparable performance to some LiDAR-based baselines such as SSN (56.9% NDS) [53] and PointPainting (58.1% NDS) [41]. Previous camera-based methods [29,43,45] were almost unable to estimate the velocity, and our method demonstrates that temporal information plays a crucial role in velocity estimation for multi-camera detection. The mean Average Velocity Error (mAVE) of BEVFormer is 0.378 m/s on the test set, outperforming other camera-based methods by a vast margin and approaching the performance of LiDAR-based methods [41]. Table 3. 3D detection and map segmentation results on nuScenes val set. Comparison of training segmentation and detection tasks jointly or not. *: We use VPN [28] and Lift-Splat [30] to replace our BEV encoder for comparison, and the task heads are the same. †: Results from their paper. Method

Task Head 3D Detection BEV Segmentation (IoU) Det Seg NDS↑ mAP↑ Car Vehicles Road Lane

Lift-Splat† [30] ✗ ✗ FIERY† [18]

✓ ✓

– –

– –

32.1 32.1 – 38.2

72.9 –

20.0 –

VPN∗ [28] VPN∗ VPN∗ Lift-Splat∗ Lift-Splat∗ Lift-Splat∗ BEVFormer-S BEVFormer-S BEVFormer-S BEVFormer BEVFormer BEVFormer

✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓

0.333 – 0.334 0.397 – 0.410 0.448 – 0.453 0.517 – 0.520

0.253 – 0.257 0.348 – 0.344 0.375 – 0.380 0.416 – 0.412

– 31.0 36.6 – 42.1 43.0 – 43.1 44.3 – 44.8 46.8

– 76.9 76.0 – 77.7 73.9 – 80.7 77.6 – 80.1 77.5

– 19.4 18.0 – 20.0 18.3 – 21.3 19.8 – 25.7 23.9

✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓

– 31.8 37.3 – 41.7 42.8 – 43.2 44.4 – 44.8 46.7

12

4.4

Z. Li et al.

Multi-tasks Perception Results

We train our model with both detection and segmentation heads to verify the learning ability of our model for multiple tasks, and the results are shown in Table 3. While comparing different BEV encoders under same settings, BEVFormer achieves higher performances of all tasks except for road segmentation results is comparable with BEVFormer-S. For example, with joint training, BEVFormer outperforms Lift-Splat∗ [30] by 11.0 points on detation task (52.0% NDS v.s. 41.0% NDS) and IoU of 5.6 points on lane segmentation (23.9% v.s. 18.3%). Compared with training tasks individually, multi-task learning saves computational cost and reduces the inference time by sharing more modules, including the backbone and the BEV encoder. In this paper, we show that the BEV features generated by our BEV encoder can be well adapted to different tasks, and the model training with multi-task heads performs even better on detection tasks and vehicles segmentation. However, the jointly trained model does not perform as well as individually trained models for road and lane segmentation, which is a common phenomenon called negative transfer [11,13] in multi-task learning 4.5

Ablation Study

To delve into the effect of different modules, we conduct ablation experiments on nuScenes val set with detection head. More ablation studies are in Appendix. Effectiveness of Spatial Cross-attention. To verify the effect of spatial crossattention, we use BEVFormer-S to perform ablation experiments to exclude the interference of temporal information, and the results are shown in Table 4. The default spatial cross-attention is based on deformable attention. For comparison, we also construct two other baselines with different attention mechanisms: (1) Using the global attention to replace deformable attention; (2) Making each query only interact with its reference points rather than the surrounding local regions, and it is similar to previous methods [34,35]. For a broader comparison, we also replace the BEVFormer with the BEV generation methods proposed by VPN [28] and Lift-Spalt [30]. We can observe that deformable attention significantly outperforms other attention mechanisms under a comparable model scale. Global attention consumes too much GPU memory, and point interaction has a limited receptive field. Sparse attention achieves better performance because it interacts with a priori determined regions of interest, balancing receptive field and GPU consumption. Effectiveness of Temporal Self-attention. From Table 1 and Table 3, we can observe that BEVFormer outperforms BEVFormer-S with remarkable improvements under the same setting, especially on challenging detection tasks. The effect of temporal information is mainly in the following aspects: (1) The introduction of temporal information greatly benefits the accuracy of the velocity estimation; (2) The predicted locations and orientations of the objects are more accurate with temporal information; (3) We obtain higher recall on heavily occluded objects since the temporal information contains past objects clues,

BEVFormer: Spatiotemporal Transformers for BEV.

13

Table 4. The detection results of different methods with various BEV encoders on nuScenes val set. “Memory” is the consumed GPU memory during training. *: We use VPN [28] and Lift-Splat [30] to replace BEV encoder of our model for comparison. †: We train BEVFormer-S using global attention in spatial cross-attention, and the model is trained with fp16 weights. In addition, we only adopt single-scale features from the backbone and set the spatial shape of BEV queries to be 100 × 100 to save memory. ‡: We degrade the interaction targets of deformable attention from the local region to the reference points only by removing the predicted offsets and weights. Method

Attention

NDS↑

mAP↑

mATE↓

mAOE↓

#Param.

FLOPs

VPN∗ [28]

-

0.334

0.252

0.926

0.598

111.2M

924.5G

Memory ∼20G

List-Splat∗ [30]

-

0.397

0.348

0.784

0.537

74.0M

1087.7G

∼20G

BEVFormer-S†

Global

0.404

0.325

0.837

0.442

62.1M

1245.1G

∼36G

BEVFormer-S‡

Points

0.423

0.351

0.753

0.442

68.1M

1264.3G

∼20G

BEVFormer-S

Local

0.448

0.375

0.725

0.391

68.7M

1303.5G

∼20G

as showed in Fig. 3. To evaluate the performance of BEVFormer on objects with different occlusion levels, we divide the validation set of nuScenes into four subsets according to the official visibility label provided by nuScenes. In each subset, we also compute the average recall of all categories with a center distance threshold of 2 m during matching. The maximum number of predicted boxes is 300 for all methods to compare recall fairly. On the subset that only 0–40% of objects can be visible, the average recall of BEVFormer outperforms BEVFormer-S and DETR3D with a margin of more than 6.0%. Model Scale and Latency. We compare the performance and latency of different configurations in Table 5. We ablate the scales of BEVFormer in three aspects, including whether to use multi-scale view features, the shape of BEV queries, and the number of layers, to verify the trade-off between performance and inference latency. We can observe that configuration C using one encoder layer in BEVFormer achieves 50.1% NDS and reduces the latency of BEVFormer from the original 130 ms to 25 ms. Configuration D, with single-scale view features, smaller BEV size, and only 1 encoder layer, consumes only 7 ms during inference, although it loses 3.9 points compared to the default configuration. However, due to the multi-view image inputs, the bottleneck that limits the efficiency lies in the backbone, and efficient backbones for autonomous driving deserve in-depth study. Overall, our architecture can adapt to various model scales and be flexible to trade off performance and efficiency.

14

4.6

Z. Li et al.

Visualization Results

We show the detection results of a complex scene in Fig. 4. BEVFormer produces impressive results except for a few mistakes in small and remote objects. More qualitative results are provided in Appendix.

Fig. 3. The detection results of subsets with different visibilities. We divide the nuScenes val set into four subsets based on the visibility that {0–40%, 40–60%, 60–80%, 80–100%} of objects can be visible. (a): Enhanced by the temporal information, BEVFormer has a higher recall on all subsets, especially on the subset with the lowest visibility (0–40%). (b), (d) and (e): Temporal information benefits translation, orientation, and velocity accuracy. (c) and (f): The scale and attribute error gaps among different methods are minimal. Temporal information does not work to benefit an object’s scale prediction.

5

Discussion and Conclusion

In this work, we have proposed BEVFormer to generate the bird’s-eye-view features from multi-camera inputs. BEVFormer can efficiently aggregate spatial and temporal information and generate powerful BEV features that simultaneously support 3D detection and map segmentation tasks. Table 5. Latency and performance of different model configurations on nuScenes val set. The latency is measured on a V100 GPU, and the backbone is R101-DCN. The input image shape is 900 × 1600. “MS” notes multi-scale view features. Method

Scale of BEVFormer

Latency (ms)

MS

BEV

#Layer

Backbone

BEVFormer

Head

FPS

NDS↑

mAP↑

BEVFormer



200 × 200

6

391

130

A



200 × 200

6

387

87

19

1.7

0.517

0.416

19

1.9

0.511

B



100 × 100

6

391

0.406

53

18

2.0

0.504

C



200 × 200

1

0.402

391

25

19

2.1

0.501

D



100 × 100

1

0.396

387

7

18

2.3

0.478

0.374

BEVFormer: Spatiotemporal Transformers for BEV.

15

Fig. 4. Visualization results of BEVFormer on nuScenes val set. We show the 3D bboxes predictions in multi-camera images and the bird’s-eye-view.

Limitations. At present, the camera-based methods still have a particular gap with the LiDAR-based methods in effect and efficiency. Accurate inference of 3D location from 2D information remains a long-stand challenge for camera-based methods. Broader Impacts. BEVFormer demonstrates that using spatiotemporal information from the multi-camera input can significantly improve the performance of visual perception models. The advantages demonstrated by BEVFormer, such as more accurate velocity estimation and higher recall on low-visible objects, are essential for constructing a better and safer autonomous driving system and beyond. We believe BEVFormer is just a baseline of the following more powerful visual perception methods, and vision-based perception systems still have tremendous potential to be explored. Acknowledgement. This work is supported by the Natural Science Foundation of China under Grant 61672273 and Grant 61832008, the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100) and Shanghai AI Laboratory. This work is done when Zhiqi Li is an intern at Shanghai AI Lab.

References 1. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019) 2. Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3D object detection in monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 135–152. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-58592-1 9 3. Bruls, T., Porav, H., Kunze, L., Newman, P.: The right (angled) perspective: improving the understanding of road scenes using boosted inverse perspective mapping. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 302–309. IEEE (2019) 4. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

16

Z. Li et al.

5. Can, Y.B., Liniger, A., Paudel, D.P., Van Gool, L.: Structured bird’s-eye-view traffic scene understanding from onboard images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15661–15670 (2021) 6. Can, Y.B., Liniger, A., Unal, O., Paudel, D., Van Gool, L.: Understanding bird’seye view semantic HD-maps using an onboard monocular camera. arXiv preprint arXiv:2012.03040 (2020) 7. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 8. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017) 9. Chitta, K., Prakash, A., Geiger, A.: Neat: neural attention fields for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15793–15803 (2021) 10. Cho, K., Van Merri¨enboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014) 11. Crawshaw, M.: Multi-task learning with deep neural networks: a survey. arXiv preprint arXiv:2009.09796 (2020) 12. Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017) 13. Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., Finn, C.: Efficiently identifying task groupings for multi-task learning. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 14. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017) 15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 16. Hendy, N., et al.: Fishing net: future inference of semantic heatmaps in grids. arXiv preprint arXiv:2006.09917 (2020) 17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 18. Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15273–15282 (2021) 19. Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 817–825 (2016) 20. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019) 21. Lee, Y., Hwang, J.W., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019) 22. Li, Z., et al.: Panoptic segformer: delving deeper into panoptic segmentation with transformers. arXiv preprint arXiv:2109.03814 (2021)

BEVFormer: Spatiotemporal Transformers for BEV.

17

23. Lin, T.Y., Doll´ ar, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017) 24. Luo, W., Yang, B., Urtasun, R.: Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3569–3577 (2018) 25. Ma, X., Ouyang, W., Simonelli, A., Ricci, E.: 3D object detection from images for autonomous driving: a survey. arXiv preprint arXiv:2202.02980 (2022) 26. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017) 27. Ng, M.H., Radia, K., Chen, J., Wang, D., Gog, I., Gonzalez, J.E.: BEV-seg: bird’s eye view semantic segmentation using geometry and semantic point cloud. arXiv preprint arXiv:2006.11436 (2020) 28. Pan, B., Sun, J., Leung, H.Y.T., Andonian, A., Zhou, B.: Cross-view semantic segmentation for sensing surroundings. IEEE Robot. Autom. Lett. 5(3), 4867–4873 (2020) 29. Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021) 30. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6 12 31. Qi, C.R., et al.: Offboard 3D object detection from point cloud sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6134–6144 (2021) 32. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021) 33. Reiher, L., Lampe, B., Eckstein, L.: A Sim2Real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7. IEEE (2020) 34. Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: BMVC (2019) 35. Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: image to voxels projection for monocular and multi-view general-purpose 3D object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2397–2406 (2022) 36. Saha, A., Maldonado, O.M., Russell, C., Bowden, R.: Translating images into maps. arXiv preprint arXiv:2110.00966 (2021) 37. Simonelli, A., Bulo, S.R., Porzi, L., Lopez-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019 38. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)

18

Z. Li et al.

39. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019) 40. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 41. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020) 42. Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485. PMLR (2022) 43. Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021) 44. Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445–8453 (2019) 45. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191. PMLR (2022) 46. Xie, E., et al.: Mˆ2BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088 (2022) 47. Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018) 48. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018) 49. Yang, W., et al.: Projecting your view attentively: monocular road scene layout estimation via cross-view transformation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15536–15545 (2021) 50. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021) 51. Zhou, X., Wang, D., Kr¨ ahenb¨ uhl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019) 52. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018) 53. Zhu, X., Ma, Y., Wang, T., Xu, Y., Shi, J., Lin, D.: SSN: shape signature networks for multi-class object detection from point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 581–597. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2 35 54. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2020) 55. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)

Category-Level 6D Object Pose and Size Estimation Using Self-supervised Deep Prior Deformation Networks Jiehong Lin1,2 , Zewei Wei1 , Changxing Ding1 , and Kui Jia1,3(B) 1

South China University of Technology, Guangzhou, China {lin.jiehong,eeweizewei}@mail.scut.edu.cn, {chxding,kuijia}@scut.edu.cn 2 DexForce Co. Ltd., Shenzhen, China 3 Peng Cheng Laboratory, Shenzhen, China

Abstract. It is difficult to precisely annotate object instances and their semantics in 3D space, and as such, synthetic data are extensively used for these tasks, e.g., category-level 6D object pose and size estimation. However, the easy annotations in synthetic domains bring the downside effect of synthetic-to-real (Sim2Real) domain gap. In this work, we aim to address this issue in the task setting of Sim2Real, unsupervised domain adaptation for category-level 6D object pose and size estimation. We propose a method that is built upon a novel Deep Prior Deformation Network, shortened as DPDN. DPDN learns to deform features of categorical shape priors to match those of object observations, and is thus able to establish deep correspondence in the feature space for direct regression of object poses and sizes. To reduce the Sim2Real domain gap, we formulate a novel self-supervised objective upon DPDN via consistency learning; more specifically, we apply two rigid transformations to each object observation in parallel, and feed them into DPDN respectively to yield dual sets of predictions; on top of the parallel learning, an inter-consistency term is employed to keep cross consistency between dual predictions for improving the sensitivity of DPDN to pose changes, while individual intra-consistency ones are used to enforce selfadaptation within each learning itself. We train DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275 datasets; our results outperform the existing methods on REAL275 test set under both the unsupervised and supervised settings. Ablation studies also verify the efficacy of our designs. Our code is released publicly at https://github. com/JiehongLin/Self-DPDN. Keywords: 6D pose estimation

1

· Self-supervised learning

Introduction

The task of category-level 6D object pose and size estimation, formally introduced in [24], is to estimate the rotations, translations, and sizes of unseen object instances of certain categories in cluttered RGB-D scenes. It plays a crucial role c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 19–34, 2022. https://doi.org/10.1007/978-3-031-20077-9_2

20

J. Lin et al. Deep Prior Deformation Network Feature Space

Intraconsistency Correspondence

Euclidean Space

Interconsistency

Object Observation

Shape Prior

Euclidean Space

Correspondence Feature Space

Intraconsistency

Fig. 1. An illustration of our proposed self-supervised Deep Prior Deformation Network (DPDN). DPDN deforms categorical shape priors in the feature space to pair with object observations, and establishes deep correspondence for direct estimates of object poses and sizes; upon DPDN, a novel self-supervised objective is designed to reduce synthetic-to-real domain gap via consistency learning. Specifically, we apply two rigid transformations to the point set Po of an object observation, and feed them into DPDN in parallel to make dual sets of predictions; on top of the parallel learning, an interconsistency term between dual predictions is then combined with individual intraconsistency ones within each learning to form the self-supervision. For simplicity, we omit the image input of object observation. Notations are explained in Sect. 3.

in many real-world applications, such as robotic grasping [17,27], augmented reality [2], and autonomous driving [5,6,12,26]. For this task, existing methods can be roughly categorized into two groups, i.e., those based on direct regression and those based on dense correspondence learning. Methods of the former group [4,14,15] are conceptually simple, but struggle in learning pose-sensitive features such that direct predictions can be made in the full SE(3) space; dense correspondence learning [3,11,20,24,25] makes the task easier by first regressing point-wise coordinates in the canonical space to align with points of observations, and then obtaining object poses and sizes via solving of Umeyama algorithm [21]. Recent works [3,11,20,25] of the second group exploit strong categorical priors (e.g., mean shapes of object categories) for improving the qualities of canonical point sets, and constantly achieve impressive results; however, their surrogate objectives for the learning of canonical coordinates are one step away from the true ones for estimating object poses and sizes, making their learning suboptimal to the end task. The considered learning task is further challenged by the lack of real-world RGB-D data with careful object pose and size annotations in 3D space. As such, synthetic data are usually simulated and rendered whose annotations can be freely obtained on the fly [7,24]. However, the easy annotations in synthetic domains bring a downside effect of synthetic-to-real (Sim2Real) domain gap; learning with synthetic data with no consideration of Sim2Real domain adaptation would inevitably result in poor generalization in the real-world domain. This naturally falls in the realm of Sim2Real, unsupervised domain adaptation (UDA) [1,11,16,19,23,28–30].

Self-supervised Deep Prior Deformation Network

21

In this work, we consider the task setting of Sim2Real UDA for category-level 6D object pose and size estimation. We propose a new method of self-supervised Deep Prior Deformation Network ; Fig. 1 gives an illustration. Following dense correspondence learning, we first present a novel Deep Prior Deformation Network, shortened as DPDN, which implements a deep version of shape prior deformation in the feature space, and is thus able to establish deep correspondence for direct regression of poses and sizes with high precision. For a cluttered RGB-D scene, we employ a 2D instance segmentation network (e.g., Mask RCNN [8]) to segment the objects of interest out, and feed them into our proposed DPDN for pose and size estimation. As shown in Fig. 2, the architecture of DPDN consists of three main modules, including a Triplet Feature Extractor, a Deep Prior Deformer, and a Pose and Size Estimator. For an object observation, the Triplet Feature Extractor learns point-wise features from its image crop, point set, and categorical shape prior, respectively; then Deep Prior Deformer deforms the prior in feature space by learning a feature deformation field and a correspondence matrix, and thus builds deep correspondence from the observation to its canonical version; finally, Pose and Size Estimator is used to make reliable predictions directly from those built deep correspondence. On top of DPDN, we formulate a self-supervised objective that combines an inter-consistency term with two intra-consistency ones for UDA. More specifically, as shown in Fig. 1, we apply two rigid transformations to an input point set of object observation, and feed them into our DPDN in parallel for making dual sets of predictions. Upon the above parallel learning, the inter-consistency term enforces cross consistency between dual predictions w.r.t. two transformations for improving the sensitivity of DPDN to pose changes, and within each learning, the individual intra-consistency term is employed to enforce self-adaptation between the correspondence and the predictions. We train DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275 datasets [24]; our results outperform the existing methods on REAL275 test set under both unsupervised and supervised settings. We also conduct ablation studies that confirm the advantages of our designs. Our contributions can be summarized as follows: – We propose a Deep Prior Deformation Network, termed as DPDN, for the task of category-level 6D object pose and size estimation. DPDN deforms categorical shape priors to pair with object observations in the feature space, and is thus able to establish deep correspondence for direct regression of object poses and sizes. – Given that the considered task largely uses synthetic training data, we formulate a novel self-supervised objective upon DPDN to reduce the synthetic-toreal domain gap. The objective is built upon enforcing consistencies between parallel learning w.r.t. two rigid transformations, and has the effects of both improving the sensitivity of DPDN to pose changes, and making predictions more reliable. – We conduct thorough ablation studies to confirm the efficacy of our designs. Notably, our method outperforms existing ones on the benchmark dataset of real-world REAL275 under both the unsupervised and supervised settings.

22

2

J. Lin et al.

Related Work

Fully-Supervised Methods. Methods of fully-supervised category-level 6D pose and size estimation could be roughly divided into two groups, i.e., those based on direct regression [4,14,15] and those based on dense correspondence learning [3,20,24,25]. Direct estimates of object poses and sizes from object observations suffer from the difficulties in the learning of the full SE(3) space, and thus make demands on extraction of pose-sensitive features. FS-Net [4] builds an orientation-aware backbone with 3D graph convolutions to encode object shapes, and makes predictions with a decoupled rotation mechanism. DualPoseNet [15] encodes pose-sensitive features from object observations based on rotation-equivariant spherical convolutions, while two parallel pose decoders with different working mechanisms are stacked to impose complementary supervision. A recent work of SS-ConvNet [14] designs Sparse Steerable Convolutions (SS-Conv) to further explore SE(3)equivariant feature learning, and presents a two-stage pose estimation pipeline upon SS-Convs for iterative pose refinement. Another group of works first learn coordinates of object observations in the canonical space to establish dense correspondence, and then obtain object poses and sizes by solving Umeyama algorithm from the correspondence in 3D space. NOCS [24], the first work for our focused task, is realized in this way by directly regressing canonical coordinates from RGB images. SPD [20] then makes the learning of canonical points easier by deforming categorical shape priors, rather than directly regressing from object observations. The follow-up works also confirm the advantages of shape priors, and make efforts on the prior deformation to further improve the qualities of canonical points, e.g., via recurrent reconstruction for iterative refinement [25], or structure-guided adaptation based on transformer [3]. Unsupervised Methods. Due to the time-consuming and labor-intensive annotating of real-world data in 3D space, UDA-COPE [11] presents a new setting of unsupervised domain adaptation for the focused task, and adapts a teacher-student scheme with bidirectional point filtering to this setting, which, however, heavily relies on the qualities of pseudo labels. In this paper, we exploit inter-/intra-consistency in the self-supervised objective to explore the data characteristics of real-world data and fit the data for the reduction of domain gap.

3

Self-supervised Deep Prior Deformation Network

Given a cluttered RGB-D scene, the goal of Category-Level 6D Object Pose and Size Estimation is to detect object instances of interest with compact 3D bounding boxes, each of which is represented by rotation R ∈ SO(3), translation t ∈ R3 , and size s ∈ R3 w.r.t. categorical canonical space. A common practice to deal with this complicated task is decoupling it into two steps, including 1) object detection/instance segmentation, and 2) object pose and size estimation. For the first step, there exist quite mature techniques

Self-supervised Deep Prior Deformation Network

23

to accomplish it effectively, e.g., employing an off-the-shelf MaskRCNN [8] to segment object instances of interest out; for the second step, however, it is still challenging to directly regress poses of unknown objects, especially for the learning in SO(3) space. To settle this problem, we propose in this paper a novel Deep Prior Deformation Network Φ, shortened as DPDN, which deforms categorical shape priors to match object observations in the feature space, and estimates object poses and sizes from the built deep correspondence directly; Fig. 2 gives an illustration. We will detail the architecture of Φ in Sect. 3.1. Another challenge of this task is the difficulty in precisely annotating realworld data in 3D space. Although synthetic data at scale are available for the learning of deep models, their results are often less precise than those trained with annotated real-world data, due to downside effect of large domain gap. To this end, we take a mixture of labeled synthetic data and unlabeled real-world one for training, and design a novel self-supervised objective for synthetic-to-real (Sim2Real), unsupervised domain adaptation. Specifically, given a mini batch of B training instances {Vi }B i=1 , we solve the following optimization problem on top of Φ: min Φ

B  1 1 i i αi LV (1 − αi )LV supervised + self −supervised , B B 1 2 i=1

(1)

B B with B1 = i=1 αi and B2 = i=1 1 − αi . {αi }B i=1 is a binary mask; αi = 1 if the observation of Vi is fully annotated and αi = 0 otherwise. In Sect. 3.2, we will give a detailed illustration on the self-supervised objective Lself −supervised , which learns inter-consistency and intra-consistency upon DPDN, while the illustration on the supervised objective Lsupervised is included in Sect. 3.3. 3.1

Deep Prior Deformation Network

For an object instance V belonging to a category c of interest, we represent its RGB-D observation in the scene as (Io , Po ), where Io ∈ RH×W ×3 denotes the RGB segment compactly containing the instance with a spatial size of H × W , and Po ∈ RN ×3 denotes the masked point set with N object surface points. Direct regression of object pose and size from (Io , Po ) struggles in the learning of SO(3) space without object CAD model [4,14,15]. Alternatively, a recent group of works [3,11,20,25] achieve impressive results by taking advantages of strong categorical priors to establish dense point-wise correspondence, from which object pose and size could be obtained by solving of Umeyama algorithm [21]. Specifically, assuming Qc ∈ RM ×3 is a sampled point set from the shape prior of c with M points, a point-wise deformation field D ∈ RM ×3 and a correspondence matrix A ∈ RN ×M are learned from a triplet of (Io , Po , Qc ). D contains point-wise deviations with respect to Qc , deforming Qc to Qv ∈ RM ×3 , which represents a complete shape of V in the canonical space. A models relationships between points in Qv and Po , serving

24

J. Lin et al.

Fig. 2. An illustration of our Deep Prior Deformation Network (DPDN). For an object observation V, we take its image crop Io , point set Po , and shape prior Qc of the same category as inputs of a Triplet Feature Extractor, for the learning of their point-wise features FIo , FPo , and FQc , respectively; then a Deep Prior Deformer is employed to learn a feature deformation field FD and a correspondence matrix A to deform FQc , yielding a feature map FQo in the canonical space to pair with FPo ; a Pose and Size Estimator makes final predictions (R, t, s) directly from the built deep correspondence between FPo and FQo .

as a sampler from Qv to generate a partial point set Qo ∈ RN ×3 , paired with Po , as follows: Qo = A × Qv = A × (Qc + D).

(2)

Finally, solving of Umeyama algorithm to align Qo with Po gives out the target pose and size. However, surrogate objectives for the learning of A and D are a step away from the true ones for estimates of pose and size; for example, small deviations of A or D may lead to large changes in the pose space. Thereby, we present a Deep Prior Deformation Network (DPDN), which implements a deep version of (2) as follows: (3) FQo = A × FQv = A × (FQc + FD ), where FQc , FQv , and FQo denote point-wise features of Qc , Qv , and Qo , respectively, and FD is a feature deformation field w.r.t. FQc . The deep version (3) deforms Qc in the feature space, such that features of Qo and Po are paired to establish deep correspondence, from which object pose and size (R, t, s) could be predicted via a subsequent network. Direct regression from deep correspondence thus alleviates the difficulties encountered by that from object observation. We note that upon the correspondence and the predictions, a self-supervised signal of intra-consistency could also be built for unlabeled data (see Sect. 3.2). As depicted in Fig. 2, the architecture of DPDN consists of three main modules, including Triplet Feature Extractor, Deep Prior Deformer, and Pose and Size Estimator. We will give detailed illustrations shortly. Triplet Feature Extractor. Given the inputs of object observation (Io , Po ) and categorical shape prior Qc , we firstly extract their point-wise features

Self-supervised Deep Prior Deformation Network

25

(FIo , FPo ) ∈ (RN ×d , RN ×d ) and FQc ∈ RM ×d , where d denotes the number of feature channels. Following [22], we firstly employ a PSP network [31] with ResNet-18 [9] to learn pixel-wise appearance features of Io , and then select those corresponding to Po out to form FIo . For both Po and Qc , two networks of PointNet++ [18] decorated with 4 set abstract levels are individually applied to extract their point-wise geometric features FPo and FQc . Deep Prior Deformer. After obtaining FIo , FPo and FQc , the goal of Deep Prior Deformer is to learn the feature deformation field FD ∈ RM ×d and the correspondence matrix A ∈ RN ×M in (3), and then implement (3) in the feature space to establish deep correspondence. Specifically, as shown in Fig. 2, we obtain global feature vectors f Io ∈ Rd and f Po ∈ Rd of Io and Po , respectively, by averaging their point-wise features FIo and FPo ; then each point feature of FQc is fused with f Io and f Po , and fed into a subnetwork of Multi-Layer Perceptron (MLP) to learn its deformation. Collectively, the whole feature deformation field FD could be learned as follows: FD = MLP([FQc , TileM (f Io ), TileM (f Po )]), s.t. f Io = AvgPool(FIo ), f Po = AvgPool(FPo ),

(4)

where [·, ·] denotes concatenation along feature dimension, MLP(·) denotes a trainable subnetwork of MLP, AvgPool(·) denotes an average-pooling operation over surface points, and TileM (·) denotes M copies of the feature vector. FD is used to deform the deep prior FQc to match V in the feature space. Thereby, according to (3), we have FQv = FQc + FD , with a global feature f Qv generated by averaging FQv over M points. Then A could be learned from the fusion of FIo , FPo , and N copies of f Qv , via another MLP as follows: A = MLP([FIo , FPo , TileN (f Qv )]), s.t. f Qv = AvgPool(FQv ) = AvgPool(FQc + FD ).

(5)

Compared to the common practice in [3,20,25] to learn A by fusing FIo and FPo with N copies of f Qc = AvgPool(FQc ), our deformed version of the deep prior F Qc via adding FD could effectively improve the quality of A. We also learn Qv in (2) from FQv as follows: Qv = MLP(FQv ) = MLP(FQc + FD ),

(6)

such that according to (2) and (3), we have Qo = A × Qv and FQo = A × FQv , respectively. Supervisions on Qv and Qo could guide the learning of FD and A. Pose and Size Estimator. Through the module of Deep Prior Deformer, we establish point-to-point correspondence for the observed Po with FPo by learning Qo in the canonical space with FQo . As shown in Fig. 2, for estimating object pose and size, we firstly pair the correspondence via feature concatenation and apply an MLP to lift the features as follows: Fcorr = MLP([FIo , MLP(Po ), FPo , MLP(Qo ), FQo ]).

(7)

26

J. Lin et al.

We then inject global information into the point-wise correspondence in Fcorr by concatenating its averaged feature f corr , followed by an MLP to strengthen the correspondence; a pose-sensitive feature vector f pose is learned from all the correspondence information via an average-pooling operation: f pose = AvgPool(MLP([Fcorr , TileN (f corr )])), s.t. f corr = AvgPool(Fcorr ).

(8)

Finally, we apply three parallel MLPs to regress R, t, and s, respectively: R, t, s = ρ(MLP(f pose )), MLP(f pose ), MLP(f pose ),

(9)

where we choose a 6D representation of rotation [32] as the regression target of the first MLP, for its continuous learning space in SO(3), and ρ(·) represents transformation from the 6D representation to the 3 × 3 rotation matrix R. For the whole DPDN, we could summarize it as follows: R, t, s, Qv , Qo = Φ(Io , Po , Qc ). 3.2

(10)

Self-supervised Training Objective Lself −su per v ised (j)

For an observed point set Po = {po }N j=1 , if we transform it with (ΔR1 , Δt1 , (j)

(j)

T 1 Δs1 ) and (ΔR2 , Δt2 , Δs2 ), we can obtain Po,1 = {po,1 }N j=1 = { Δs1 ΔR1 (po − (j)

(j)

T 1 N N Δt1 )}N j=1 and Po,2 = {po,2 }j=1 = { Δs2 ΔR2 (po −Δt2 )}j=1 , respectively. When inputting them into DPDN in parallel, we have

RPo,1 , tPo,1 , sPo,1 , Qv,1 , Qo,1 = Φ(Io , Po,1 , Qc ), and RPo,2 , tPo,2 , sPo,2 , Qv,2 , Qo,2 = Φ(Io , Po,2 , Qc ), (j)

(j)

(11) (12)

(j)

N M with Qv,1 = {q v,1 }M j=1 , Qo,1 = {q o,1 }j=1 , Qv,2 = {q v,2 }j=1 , and Qo,2 = (j)

{q o,2 }N j=1 . There exist two solutions to (R, t, s) of Po from (11) and (12), respectively; for clarity, we use subscripts ‘1’ and ‘2’ for (R, t, s) to distinguish them: 1) R1 , t1 , s1 = ΔR1 RPo,1 , Δt1 + Δs1 ΔR1 tPo,1 , Δs1 sPo,1 ; 2) R2 , t2 , s2 = ΔR2 RPo,2 , Δt2 + Δs2 ΔR2 tPo,2 , Δs2 sPo,2 . Upon the above parallel learning of (11) and (12), we design a novel selfsupervised objective for the unlabeled real-world data to reduce the Sim2Real domain gap. Specifically, it combines an inter-consistency term Linter with two intra-consistency ones (Lintra,1 , Lintra,2 ) as follows: Lself −supervised = λ1 Linter + λ2 (Lintra,1 + Lintra,2 ),

(13)

where λ1 and λ2 are superparameters to balance the loss terms. Linter enforces consistency across the parallel learning from Po with different transformations, making the learning aware of pose changes to improve the precision of predictions, while Lintra,1 and Lintra,2 enforce self-adaptation between correspondence and predictions within each learning, respectively, in order to realize more reliable predictions inferred from the correspondence.

Self-supervised Deep Prior Deformation Network

27

Inter-consistency Term. We construct the inter-consistency loss based on the following two facts: 1) two solutions to the pose and size of Po from those of Po,1 and Po,2 are required to be consistent; 2) as representations of a same object V in the canonical space, Qv,1 and Qv,2 should be invariant to any pose transformations, and thus keep consistent to each other, as well as Qo,1 and Qo,2 . Therefore, with two input transformations, the inter-consistency loss Linter could be formulated as follows: Linter = Dpose (R1 , t1 , s1 , R2 , t2 , s2 ) + β1 Dcham (Qv,1 , Qv,2 ) + β2 DL2 (Qo,1 , Qo,2 ), (14) where λ1 and λ2 are balanced parameters, and Dpose (R1 , t1 , s1 , R2 , t2 , s2 ) = ||R1 − R2 ||2 + ||t1 − t2 ||2 + ||s1 − s2 ||2 , Dcham (Qv,1 , Qv,2 ) = DL2 (Qo,1 , Qo,2 ) =

M M  1  (j) (j) ( min ||q v,1 − q v,2 ||2 + min ||q v,1 − q v,2 ||2 ), q v,1 2M j=1 q v,2 j=1

N 1  (j) (j) ||q − q o,2 ||2 . N j=1 o,1

Chamfer distance Dcham is used to restrain the distance of two complete point sets Qv,1 and Qv,2 , while for the partial Qo,1 and Qo,2 , we use a more strict metric of L2 distance DL2 for point-to-point constraints, since their points should be ordered to correspond with those of Po . Intra-consistency Terms. For an observation (Io , Po ), DPDN learns deep cor(j) (j) N respondence between Po = {po }N j=1 and Qo = {q o }j=1 to predict their rela(j)

(j)

1 RT (po − tive pose and size (R, t, s); ideally, for ∀j = 1, · · · , N , q o = ||s|| 2 t). Accordingly, the predictions Qo,1 and Qo,2 in (11) and (12) should be (j)  restrained to be consistent with Qo,1 = { ||s11 ||2 RT1 (po − t1 )}N j=1 and Qo,2 = (j)

{ ||s12 ||2 RT2 (po − t2 )}N j=1 , respectively, which gives the formulations of two intraconsistency terms based on Smooth-L1 distance as follows: Lintra,1 = DSL1 (Qo,1 , Qo,1 ),

Lintra,2 = DSL1 (Qo,2 , Qo,2 ),

(15)

where

 N 3 (jk) (jk) (jk) (jk) 1  5(q1 − q2 )2 , if|q1 − q2 | ≤ 0.1 , DSL1 (Q1 , Q2 ) = (jk) (jk) N j=1 |q1 − q2 | − 0.05, otherwise k=1 (j1)

with Q1 = {(q1 3.3

(j2)

, q1

(j3)

, q1

(j1)

)}N j=1 and Q2 = {(q2

(j2)

, q2

(j3)

, q2

)}N j=1 .

Supervised Training Objective Lsu per v ised

Given a triplet of inputs (Io , Po , Qc ) along with the annotated ground ˆ ˆt, s ˆ, Qˆv , Qˆo ), we generate dual input triplets by applying two rigid truths (R,

28

J. Lin et al.

transformations to Po , as done in Sect. 3.2, and use the following supervised objective on top of the parallel learning of (11) and (12): ˆ ˆt, s ˆ ˆt, s ˆ) + Dpose (R2 , t2 , s2 , R, ˆ) Lsupervised =Dpose (R1 , t1 , s1 , R, + γ1 (Dcham (Qv,1 , Qˆv ) + Dcham (Qv,2 , Qˆv )) + γ2 (DSL1 (Qo,1 , Qˆo ) + DSL1 (Qo,2 , Qˆo )).

(16)

We note that this supervision also implies inter-consistency between the parallel learning defined in (14), making DPDN more sensitive to pose changes.

4

Experiments

Datasets. We train DPDN on both training sets of synthetic CAMERA25 and real-world REAL275 datasets [24], and conduct evaluation on REAL275 test set. CAMERA25 is created by a context-aware mixed reality approach, which renders 1, 085 synthetic object CAD models of 6 categories to real-world backgrounds, yielding a total of 300, 000 RGB-D images, with 25, 000 ones of 184 objects set aside for validation. REAL275 is a more challenging real-world dataset, which includes 4, 300 training images of 7 scenes and 2, 754 testing ones of 6 scenes. Both datasets share the same categories, yet impose large domain gap. Implementation Details. To obtain instance masks for both training and test sets of REAL275, we train a MaskRCNN [8] with a backbone of ResNet101 [9] on CAMERA25; for settings of available training mask labels, we use the same segmentation results as [14,15,20] to make fair comparisons. We employ the shape priors released by [20]. For DPDN, we resize the image crops of object observations as 192 × 192, and set the point numbers of shape priors and observed point sets as M = N = 1, 024, respectively. In Triplet Feature Extractor, a PSP Network [31] based on ResNet-18 [9] and two networks of PointNet++ [18] are employed, sharing the same architectures as those in [3,10]. To aggregate multi-scale features, each PointNet++ is built by stacking 4 set abstract levels with multi-scale grouping. The output channels of point-wise features (FIo , FPo , FQc ) are all set as d = 128; other network specifics are also given in Fig. 2. We use ADAM to train DPDN with a total of 120, 000 iterations; the data size of a mini training batch is B = 24 with B1 : B2 = 3 : 1. The superparameters of λ1 , λ2 , β1 , β2 , γ1 and γ2 are set as 0.2, 0.02, 5.0, 1.0, 5.0 and 1.0, respectively. For each pose transformation, ΔR is sampled from the whole SO(3) space by randomly generating three euler angles, while Δt ∈ R3 with each element Δt ∼ U (−0.02, 0.02), and Δs ∼ U (0.8, 1.2). Evaluation Metrics. Following [24], we report mean Average Precision (mAP) of Intersection over Union (IoU) for object detection, and mAP of n◦ m cm for 6D pose estimation. IoUx denotes precision of predictions with IoU over a threshold of x%, and n◦ m cm denotes precision of those with rotation error less than n◦ and transformation error less than m cm.

Self-supervised Deep Prior Deformation Network

29

Table 1. Quantitative comparisons of different methods for category-level 6D pose and size estimation on REAL275 [24]. ‘Syn’ and ‘Real’ denote the uses of training data of synthetic CAMERA25 and real-world REAL275 datasets, respectively. ‘∗’ denotes training with mask labels Method

Syn Real Real mAP w/o label with label IoU50 IoU75 5◦

5◦

10◦

10◦

    

3.4 12.0 27.1 37.3 45.5

33.5 33.1 53.7 59.8

20.4 37.8 56.8 67.0 71.3

Unsupervised NOCS [24] SPD [24] DualPoseNet [15] DPDN (Ours) Self-DPDN (Ours)

UDA-COPE [11]  Self-DPDN (Ours) 



36.7 71.0 68.4 71.7 72.6

3.4 43.1 49.5 60.8 63.8

11.4 15.9 29.7 37.8

∗ ∗

82.6 83.0

62.5 70.3

30.4 34.8 56.9 66.0 39.4 45.0 63.2 72.1

78.0 77.3 79.3 79.8 79.3 79.8 80.1 83.4

30.1 53.2 55.9 62.2 62.4 65.6 61.9 76.0

7.2 19.3 27.8 29.3 31.6 36.6 35.9 46.0

Supervised NOCS [24] SPD [20] CR-Net [25] DualPoseNet [15] SAR-Net [13] SS-ConvNet [14] SGPA [3] DPDN (Ours)

4.1

       

       

10.0 21.4 34.3 35.9 42.3 43.4 39.6 50.7

13.8 43.2 47.2 50.0 50.3 52.6 61.3 70.4

25.2 54.1 60.8 66.8 68.3 63.5 70.7 78.4

Comparisons with Existing Methods

We compare our method with the existing ones for category-level 6D object pose and size estimation under both unsupervised and supervised settings. Quantitative results are given in Table 1, where results under supervised setting significantly benefit from the annotations of real-world data, compared to those under unsupervised one; for example, on the metric of 5◦ 2 cm, SPD [20] improves the results from 11.4% to 19.3%, while DualPoseNet [15] improves from 15.9% to 29.3%. Therefore, the exploration of UDA for the target task in this paper is of great practical significance, due to the difficulties in precisely annotating real-world object instances in 3D space. Unsupervised Setting. Firstly, a basic version of DPDN is trained on the synthetic data and transferred to real-world domain for evaluation; under this setting, our basic DPDN outperforms the existing methods on all the evaluation metrics, as shown in Table 1. To reduce the Sim2Real domain gap, we further include the unlabeled Real275 training set via our self-supervised DPDN for UDA; results in Table 1 verify the effectiveness of our self-supervised DPDN (dubbed Self-DPDN in the table), which significantly improves the precision

30

J. Lin et al.

Table 2. Ablation studies on the variants of our proposed DPDN under supervised setting. Experiments are evaluated on REAL275 test set [24]. Input Deep prior Pose & size mAP estimator Po,1 Po,2 deformer IoU50 IoU75 5◦ 2 cm 5◦ 5 cm 10◦ 2 cm 10◦ 5 cm    

×  × 

× ×  

   

79.9 83.3 83.4 83.4

65.4 72.9 76.2 76.0

26.7 35.2 39.6 46.0

35.6 43.9 46.1 50.7

47.2 57.3 65.5 70.4

63.9 70.8 76.7 78.4







×

59.6

45.6

27.9

33.0

50.2

63.9

of the basic version, e.g., a performance gain of 8.1% on 5◦ 2 cm from 29.7% to 37.8%. UDA-COPE [11] is the first work to introduce the unsupervised setting, which trains deep model with a teacher-student scheme to yield pseudo labels for realworld data; in the process of training, pose annotations of real-world data are not employed, yet mask labels are used for learning instance segmentation. To fairly compare with UDA-COPE, we evaluate DPDN under the same setting; results in Table 1 also show the superiority of our self-supervised DPDN over UDA-COPE, especially for the metrics of high precisions, e.g., an improvement of 10.2% on 5◦ 5 cm. The reason for the great improvement is that UDA-COPE heavily relies on the qualities of pseudo labels, while our self-supervised objective could guide the optimization moving for the direction meeting inter-/intra- consistency, to make the learning fit the characteristics of REAL275 and decrease the downside effect of the synthetic domain. Supervised Setting. We also compare our DPDN with the existing methods, including those of direct regression [14,15], and those based on dense correspondence learning [3,13,20,24,25], under supervised setting. As shown in Table 1, DPDN outperforms the existing methods on all the evaluation metrics, e.g., reaching the precisions of 76.0% on IoU75 and 78.4% on 10◦ 5 cm. Compared with the representative SS-ConvNet [14], which directly regresses object poses and sizes from observations, our DPDN takes the advantages of categorical shape priors, and achieves more precise results by regressing from deep correspondence; compared with SGPA [3], the recent state-of-the-art method based on correspondence learning (c.f. Eq. (2)), our DPDN shares the same feature extractor, yet benefits from the direct objectives for pose and size estimation, rather than the surrogate ones, e.g., for regression of D and A in Eq. (2); DPDN thus achieves more reliable predictions. 4.2

Ablation Studies and Analyses

In this section, we conduct experiments to evaluate the efficacy of both the designs in our DPDN and the self-supervision upon DPDN.

Self-supervised Deep Prior Deformation Network

31

Table 3. Ablation studies of our proposed self-supervised objective upon DPDN under unsupervised setting. Experiments are evaluated on REAL275 test set [24]. Linter Lintra,1 Lintra,2

mAP IoU50 IoU75 5◦ 2 cm 5◦ 5 cm 10◦ 2 cm 10◦ 5 cm

×  × 

71.7 72.6 70.9 72.6

× ×  

60.8 63.2 58.6 63.8

29.7 36.9 35.8 37.8

37.3 43.7 43.6 45.5

53.7 58.7 56.6 59.8

67.0 68.7 69.3 71.3

Effects of the Designs in DPDN. We verify the efficacy of the designs in DPDN under supervised setting, with the results of different variants of our DPDN shown in Table 2. Firstly, we confirm the effectiveness of our Deep Prior Deformer with categorical shape priors; by removing Deep Prior Deformer, the precision of DPDN with parallel learning on 5◦ 2 cm drops from 46.0% to 35.2%, indicating that learning from deep correspondence by deforming priors in feature space indeed makes the task easier than that directly from object observations. Secondly, we show the advantages of using true objectives for direct estimates of object poses and sizes, over the surrogate ones for the learning of the canonical point set Qo to pair with the observed Po . Specifically, we remove our Pose and Size Estimator, and make predictions by solving Umeyama algorithm to align Po and Qo ; precisions shown in the table decline sharply on all the evaluation metrics, especially on IoUx . We found that the results on IoUx are also much lower than those methods based on dense correspondence learning [3,20,25], while results on n◦ m are comparable; the reason is that we regress the absolute coordinates of Qc , rather than the deviations D in Eq. (2), which may introduce more outliers to affect the object size estimation. Thirdly, we confirm the effectiveness of the parallel supervisions in (16), e.g., inputting Po,1 and Po,2 of the same instance with different poses. As shown in Table 2, results of DPDN with parallel learning are improved (with or without Deep Prior Deformer), since the inter-consistency between dual predictions is implied in the parallel supervisions, making the learning aware of pose changes. Effects of the Self-supervision upon DPDN. We have shown the superiority of our novel self-supervised DPDN under the unsupervised setting in Sect. 4.1; here we include the evaluation on the effectiveness of each consistency term in the self-supervised objective, which is confirmed by the results shown in Table 3. Taking results on 5◦ , 2 cm as examples, DPDN with inter-consistency term Linter improves the results of the baseline from 29.7% to 36.9%, and DPDN with the intra-consistency ones Lintra,1 and Lintra,2 improves to 35.8%, while their combinations further refresh the results, revealing their strengths on reduction of domain gap. We also show the influence of data size of unlabeled real-world images on the precision of predictions in Fig. 3, where precisions improve along with the increasing ratios of training data.

32

J. Lin et al.

Fig. 3. Plottings of mAP versus the ratio of unlabeled REAL275 training data under unsupervised setting. Experiments are evaluated on REAL275 test set [24]

Unsupervised (Syn)

Unsupervised (Syn + Real w/o label)

Supervised (Syn + Real with label)

Ground Truth

(a)

(b)

(c)

(d)

(e)

Fig. 4. Qualitative results of DPDN on REAL275 test set [24]. ‘Syn’ and ‘Real’ denote the uses of synthetic CAMERA25 and real-world REAL275 training sets, respectively.

4.3

Visualization

We visualize in Fig. 4 the qualitative results of our proposed DPDN under different settings on REAL275 test set [24]. As shown in the figure, our self-supervised DPDN without annotations of real-world data, in general, achieves comparable results with the fully-supervised version, although there still exist some difficult examples, e.g., cameras in Fig. 4(a) and (b), due to the inaccurate masks from MaskRCNN trained on synthetic data. Under the unsupervised setting, our self-supervised DPDN also outperforms the basic version trained with only CAMERA25, by including unlabeled real-world data with self-supervision; for example, more precise poses of laptops are obtained in Fig. 4(c) and (d). Acknowledgements. This work is supported in part by Guangdong R&D key project of China (No.: 2019B010155001), and the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (No.: 2017ZT07X183).

Self-supervised Deep Prior Deformation Network

33

References 1. Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M.: Domainadversarial neural networks. arXiv preprint arXiv:1412.4446 (2014) 2. Azuma, R.T.: A survey of augmented reality. Presence Teleoperators Virtual Environ. 6(4), 355–385 (1997) 3. Chen, K., Dou, Q.: SGPA: structure-guided prior adaptation for category-level 6D object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2773–2782 (2021) 4. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6D object pose estimation with decoupled rotation mechanism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1581–1590 (2021) 5. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017) 6. Deng, S., Liang, Z., Sun, L., Jia, K.: Vista: boosting 3D object detection via dual cross-view spatial attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8448–8457 (2022) 7. Denninger, M., et al.: Blenderproc: reducing the reality gap with photorealistic rendering. In: International Conference on Robotics: Science and Systems, RSS 2020 (2020) 8. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 10. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632– 11641 (2020) 11. Lee, T., et al.: UDA-COPE: unsupervised domain adaptation for category-level object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14891–14900 (2022) 12. Levinson, J., et al.: Towards fully autonomous driving: systems and algorithms. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 163–168. IEEE (2011) 13. Lin, H., Liu, Z., Cheang, C., Fu, Y., Guo, G., Xue, X.: SAR-Net: shape alignment and recovery network for category-level 6D object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2022) 14. Lin, J., Li, H., Chen, K., Lu, J., Jia, K.: Sparse steerable convolutions: an efficient learning of se (3)-equivariant features for estimation and tracking of object poses in 3D space. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 15. Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: Dualposenet: category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3560–3569 (2021) 16. Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97– 105. PMLR (2015)

34

J. Lin et al.

17. Mousavian, A., Eppner, C., Fox, D.: 6-DOF GraspNet: variational grasp generation for object manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2901–2910 (2019) 18. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 19. Qin, C., You, H., Wang, L., Kuo, C.C.J., Fu, Y.: Pointdan: a multi-scale 3D domain adaption network for point cloud representation. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 20. Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-58589-1 32 21. Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13(04), 376–380 (1991) 22. Wang, C., et al.: Densefusion: 6D object pose estimation by iterative dense fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3343–3352 (2019) 23. Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: selfsupervised monocular 6D object pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 108–125. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 7 24. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019) 25. Wang, J., Chen, K., Dou, Q.: Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4807–4814. IEEE (2021) 26. Wang, Z., Jia, K.: Frustum convnet: sliding frustums to aggregate local pointwise features for amodal 3D object detection. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. IEEE (2019) 27. Wu, C., et al.: Grasp proposal networks: an end-to-end solution for visual learning of robotic grasps. Adv. Neural. Inf. Process. Syst. 33, 13174–13184 (2020) 28. Zhang, Y., Deng, B., Jia, K., Zhang, L.: Label propagation with augmented anchors: a simple semi-supervised learning baseline for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 781–797. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-58548-8 45 29. Zhang, Y., Deng, B., Tang, H., Zhang, L., Jia, K.: Unsupervised multi-class domain adaptation: theory, algorithms, and practice. IEEE Trans. Pattern Anal. Mach. Intell. (2020) 30. Zhang, Y., Tang, H., Jia, K., Tan, M.: Domain-symmetric networks for adversarial domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5031–5040 (2019) 31. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017) 32. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745–5753 (2019)

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection Hongyu Zhou1,3 , Zheng Ge1 , Songtao Liu1 , Weixin Mao1,2 , Zeming Li1 , Haiyan Yu3(B) , and Jian Sun1 1

MEGVII Technology, Beijing, China Waseda University, Tokyo, Japan Harbin Institute of Technology, Harbin, China [email protected] 2

3

Abstract. To date, the most powerful semi-supervised object detectors (SS-OD) are based on pseudo-boxes, which need a sequence of post-processing with fine-tuned hyper-parameters. In this work, we propose replacing the sparse pseudo-boxes with the dense prediction as a united and straightforward form of pseudo-label. Compared to the pseudo-boxes, our Dense Pseudo-Label (DPL) does not involve any postprocessing method, thus retaining richer information. We also introduce a region selection technique to highlight the key information while suppressing the noise carried by dense labels. We name our proposed SSOD algorithm that leverages the DPL as Dense Teacher. On COCO and VOC, Dense Teacher shows superior performance under various settings compared with the pseudo-box-based methods. Code is available at https://github.com/Megvii-BaseDetection/DenseTeacher. Keywords: Semi-supervised object detection

1

· Dense pseudo-label

Introduction

Current high-performance object detection neural networks rely on a large amount of labeled data to ensure their generalization capability. However, labeling samples takes a high cost of human effort. Thus the industry and academia pay extensive attention to the use of relatively easy-to-obtain unlabeled data. An effective way to use these data is Semi-Supervised Learning (SSL), where at the training time, only part of the data is labeled while the rest are unlabeled. On image classification tasks, the dominated method of mining information from unlabeled data is “Consistency-based Pseudo-Labeling” [3–5,23]. Pseudo-Labeling [12] is a technique that utilizes trained models to generate labels for unlabeled data. Meanwhile, the Consistency-based regularization [1], from another perspective, forces a model to have similar output when giving a normal and a perturbed input with different data augmentations and perturbations like Dropout [25]. H. Zhou and Z. Ge—Authors contributed equally to this work. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 35–50, 2022. https://doi.org/10.1007/978-3-031-20077-9_3

36

H. Zhou et al.

Fig. 1. The overview of our purposed pipeline for unlabeled data compared with traditional pseudo-box based pipeline. For each iteration, Dense Pseudo-Label (DPL) are generated by the teacher model on unlabeled images. The student model then calculates unsupervised loss on perturbed images and corresponding DPLs. By removing post-processing steps, DPL contains rich information from the teacher model. Note that the vanilla learning approach uses only labeled data (not plotted in the figure) and the total loss is the sum of both supervised and unsupervised loss.

This pipeline has been successfully transferred to Semi-Supervised Object Detection (SS-OD) [18,24,30]. Specifically, the predicted boxes from a pre-trained “teacher” detector are used as the annotations of unlabeled images to train the “student” detector, where the same images are applied with different augmentations for the teacher and student model. This instinctive method has proven to be effective in SS-OD and has achieved state-of-the-art scores on benchmarks such as COCO [17] and Pascal VOC [6]. However, it is not reasonable to replicate all the empirics directly from the classification task. While the generated pseudo-label is a single and united class label for an image in classification, the object detectors predict a set of pseudo-boxes as the annotation of an image. As shown in Fig. 1, making direct supervision on the unlabeled image with these pseudo-boxes requires several additional steps, including Non-Maximum-Suppression (NMS), Thresholding, and Label Assignment. Such a lengthy label-generating procedure introduces many hyper-parameters, such as NMS threshold σnms and score threshold σt , substantially affecting the SS-OD performance.1 This motivates us to explore a more simple and effective form of pseudo-labels for SS-OD. In this work, we propose a new SS-OD pipeline named Dense Teacher with a united form of pseudo-label—Dense Pseudo-Label (DPL), which enables more efficient knowledge transfer between the teacher and student models. DPL is an integral label. Different from the existing box-like labels in a human-readable form, it is the original output from the network without ordinary post-processing. Our Dense Teacher, following existing Pseudo-Labeling paradigms, works in the 1

See also in Sect. 3.2 for a related discussion.

Dense Pseudo-Labels for Semi-supervised Object Detection

37

following way: For each iteration, labeled and unlabeled images are randomly sampled to form a data batch. The “teacher” model, which is an Exponential Moving Average of the “student” model, generates DPL for unlabeled data. And the student model is trained on both ground truth on labeled images and DPL on unlabeled images. Since the DPL does not require any post-processing, the pipeline of Dense Teacher is extremely simple. The overall pipeline of Dense Teacher can be seen in Fig. 1. Although DPL provides more information than pseudo-box labels, it contains high-level noise (e.g., low-scoring predictions) as well. We show in Sect. 4.2 that learning to make those low-scoring predictions can distract the student model, resulting in poor detection performance. Therefore, we propose a region division method to suppress noise and highlight key regions where the student model should concentrate on. According to our experiments, the region division strategy can effectively utilize the rich information contained in hard negative regions to enhance training. As a result, our proposed Dense Teacher, together with the region division strategy, shows state-of-the-art performance on MS-COCO and Pascal VOC. Our main contributions in this paper are: – We conduct a thorough analysis on the drawbacks of pseudo-boxes in the SS-OD task. – We propose a united form of pseudo-label named DPL to better fit the semisupervised setting and the Dense Teacher framework to apply the DPL on one-stage detector. – The proposed Dense Teacher achieves state-of-the-art performance on MSCOCO and Pascal VOC benchmarks under various settings. Gain analysis and ablation study are provided to verify the effectiveness of each part in Dense Teacher.

2 2.1

Related Works Semi-supervised Learning

sSemi-Supervised Learning (SSL) means that a portion of the data is labeled at training time while the other is not. Currently, there are two main approaches to achieve this goal: pseudo-labeling and consistency regularization. Pseudo-labelbased methods [12] first train a network on labeled data, then use the trained network as a teacher to make inferences on unlabeled data. This prediction result is then assigned to a specific class according to a threshold based on the predicted confidence and used as labeled data to train another student network. In [27], the teacher model is replaced by the Exponential Moving Average (EMA) of the student to conduct online pseudo-labeling. Consistency regularization based methods [1] construct a regularization loss to force predictions under a set of perturbations {Ti } to be same. Perturbations can be implemented using augmentation [5,21,29], dropout [11]), or adversarial training [19]. This approach does not require annotation and can be used in combination with other methods; therefore, it is widely adopted in many SSL frameworks [2,4,23].

38

2.2

H. Zhou et al.

Object Detection

Object detectors can be divided into Anchor-based and Anchor-free paradigms. Anchor-based detectors predict the offsets and scales of target boxes from predefined anchor boxes. Although this approach has succeeded on many tasks, one needs to redefine new anchor boxes when applying such models to new data. In contrast to anchor-based detectors, predefined anchor boxes are not required for anchor-free detectors. These detectors directly predict the box size and location on the feature map. Take FCOS model as an example; this detector predicts the classification score, distances to four boundaries, and a quality score on each pixel of Feature Pyramid Network (FPN) [15]. A variety of subsequent improvements, such as the adaptive label assigning while training [33], boundary distribution modeling [14] were proposed to improve its performance. Considering the wide application, streamlined architecture, and excellent performance of FCOS, we will conduct our experiments under this framework. 2.3

Semi-supervised Object Detection

The label type is the main difference between Semi-Supervised Object Detection (SS-OD) and SSL. Previous studies have transferred a great deal of experience from SSL works to the SS-OD domain. CSD [10] use a flipped image I  to introduce a consistency loss between F (I) and F (I  ), this regularization can be applied to unlabeled image. STAC [24] train a teacher detector on labeled images and generate pseudo-labels on unlabeled data using this static teacher. These pseudo-labels will then be selected and used for training like labeled data. Unbiased Teacher [18] use thresholding to filter pseudo-labels, Focal Loss [16] is also applied to address the pseudo-labeling bias issue. Adaptive Class-Rebalancing [32] artificially adds foreground targets to images to achieve inter-class balance. Li, et al. [13] propose dynamic thresholding and loss reweighting for each category. Soft Teacher [30] proposed a score-weighted classification loss and box jittering approach to select and utilize regression loss of pseudo-boxes. While these methods successfully transferred paradigms from SSL to SS-OD, they ignored the unique characteristics of SS-OD. These pseudo-boxbased strategies treat pseudo-boxes, or selected pseudo-boxes, as ordinary target boxes, and thus they invariably follow the detector’s label assign strategy.

3

Dense Teacher

In Sect. 3.1, we first introduce the existing Pseudo-Labeling SS-OD framework. Then we analyze the disadvantages of utilizing pseudo-boxes in Sect. 3.2. In the remaining part, we propose Dense Pseudo-Label to overcome the issues mentioned above and introduce our overall pipeline in detail, including the label generation, loss function, and the learning region selection strategy. Since our primary motivation is to show the superiority of dense pseudo-labels compared to pseudo-box labels, we naturally choose to verify our idea on dense detectors (i.e., one-stage detectors).

Dense Pseudo-Labels for Semi-supervised Object Detection

3.1

39

Pseudo-Labeling Framework

Our Dense Teacher follows the existing pseudo-labeling framework [18,30] as shown in Fig. 1. Within each iteration: 1. Labeled and unlabeled images are randomly sampled to form a data batch. 2. The teacher model, an exponential moving average (EMA) of the student, takes the augmented unlabeled images to generate pseudo-labels. 3. The student model then takes the labeled data for vanilla training and calculates supervised loss Ls , while the unlabeled data together with pseudo-labels are used to produce unsupervised loss Lu . 4. Two losses are weighted and learned to update parameters of the student model. The student model updates the teacher model in an EMA manner. Finally, the overall loss function is defined as: L = Ls + wu Lu ,

(1)

where wu is the unsupervised loss weight. Traditionally, the unsupervised loss Lu is calculated with pseudo-boxes. However, in the following section, we point out that using processed boxes as pseudo-labels can be inefficient and sub-optimal. 3.2

Disadvantages of Pseudo-Box Labels

In this part, we study the behavior of pseudo-box-based SS-OD algorithms on COCO [17], as well as CrowdHuman2 [22] since the impact of the NMS threshold can be more clearly demonstrated in the crowd situation. We adopt Unbiased Teacher [18] as a representative algorithm to FCOS for these experiments. Dilemma in Thresholding. In SS-OD algorithms [18,30], the output of the teacher model is expected to play the role of ground-truth labels for unsupervised images. To this end, Thresholding is a key operation to screen out low-scoring boxes so that the quality of pseudo-box labels can be improved. However, our preliminary experiments show that the threshold σt introduced by this operation may substantially affect the entire training process. In Fig. 2(a), we present the training results of Unbiased Teacher under different σt . It shows that the detection performance fluctuates significantly on both datasets as the σt varies. Moreover, when σt is set to a high value (e.g., 0.7 and 0.9), the training process even fails to converge. Such a phenomenon is possibly caused by a large number of false negatives in the teacher’s prediction, as shown in Fig. 2(c) and (d). When this is the case, Thresholding will eliminate many high-quality predictions and mislead the learning process of the student model. Conversely, when set σt to a low value such as 0.3, the performance shows apparent degradation due to the increasing number of false positives (see in Fig. 2(c) and (d) as well). As a result, one can not find a perfect threshold to ensure the quality of generated pseudo-boxes. 2

CrowdHuman is a benchmark for detecting humans in a crowded situation, performance is measured by Log-average Miss Rate (mMR). The lower the better.

40

H. Zhou et al.

Fig. 2. Analysis of Pseudo-box based approaches. (a) and (b): Performances under different σt and σNMS . Note that the gray  represents the training fails to converge. (c) and (d): False Positive and False Negative boxes on 128 images under different threshold on COCO and CrowdHuman, the green line denotes the ground truth box number

Dilemma in Non-Maximum Suppression (NMS). NMS is adopted on the detector’s original outputs for most object detection algorithms to remove redundant predictions. It is also indispensable to the teacher model in existing SS-OD frameworks, without which the resulting pseudo-labels will be a mess. NMS introduces a threshold σnms to control the degree of suppression. According to our experiments, we find that σnms also has a non-negligible effect on the SS-OD algorithms. Figure 2(b) shows the relationship between σnms and performance of Unbiased Teacher. From this figure, we can tell that 1). different σnms may lead to fluctuations in the detection performance (especially on CrowdHuman). 2). the optimal σnms values for different datasets are different (i.e., 0.7 on COCO and 0.8 on CrowdHuman), which will bring in extra workload for developers to tune the optimal σnms on their custom datasets. Moreover, previous works [7,8] show that in a crowd scene like in the CrowdHuman dataset, there does not exist a perfect σnms that can keep all true positive predictions while suppressing all false positives. As a result, with NMS adopted, the unreliability of pseudo-box labels is further exacerbated.

Dense Pseudo-Labels for Semi-supervised Object Detection

(a) Raw Image

(b) Ground Truth Positives

41

(c) Pseudo-Box Positives

Fig. 3. Comparisons between (b) foreground pixels assigned by ground truth boxes and (c) foreground pixels assigned by pseudo-boxes

Inconsistent Label Assignment. As shown in Fig. 1, existing pseudo-labelbased algorithms convert the sparse pseudo-boxes into a dense form by label assignment to form the final supervision. An anchor box (or point) will be assigned as either positive or negative during label assignment based on a particular pre-defined rule. Although this process is natural in the standard object detection task, we believe it is harmful to SS-OD tasks. The reason is quite simple: the pseudo-boxes may suffer from the inaccurate localization problem, making the label assigning results inconsistent with the potential ground-truth labels. In Fig. 3, we can find that although the predicted box matches the actual box under IoU threshold 0.5, a severe inconsistent assigning result appears due to the inaccurate pseudo-box. This inconsistency with the ground truth is likely to degrade the performance. Due to the above three issues, we challenge the convention of using pseudobox as the middle-ware of unsupervised learning and propose a new form of pseudo-label that is dense and free of post-processing. 3.3

Dense Pseudo-Label

To address the problems mentioned above, we propose Dense Pseudo-Label (DPL) that encompasses richer and undistorted supervising signals. Specifically, we adopt the post-sigmoid logits predicted by the trained model as our desired dense pseudo-label, as shown in the green box in Fig. 1. After bypassing those lengthy post-processing methods, one can naturally discover that our proposed DPL reserves more detailed information from the teacher than its pseudo-box counterpart.

42

H. Zhou et al.

Since DPL represents information in continuous values (value between 0 and 1) and the standard Focal Loss [16] can only deal with discrete binary values (0 or 1), we adopt Quality Focal Loss [14] to conduct learning between dense pseudo-labels and the student’s predicting results. Let us denote yi  pti as DPL (i.e., teacher’s prediction) and denote psi as student’s prediction for i-th anchor3 , we hope the prediction and the target to be similar for the same anchor. Therefore, we can write the classification loss on the i-th anchor for an unlabeled image as:   s |γ ∗ y s ) + (1 − y s)    Lcls = −| y − p  log( p  )log(1 − p (2) i i i i i i i where γ is the suppression factor. While DPL contains rich information, it also keeps many low-scoring predictions due to the absence of the thresholding operation. Since those low-scoring predictions usually involve the background regions, intuitively, the knowledge encompassed in them shall be less informative. In Sect. 4.4, we experimentally prove that learning to mimic the teacher’s response in those regions will hurt the SS-OD algorithm’s performance. Therefore, we propose to divide the whole input image into a learning region and a suppressing region (e.t., negative region in positive-negative division) based on the teacher’s Feature Richness Score (FRS [34]). With the help of this richness score, we select the pixels with top k% scores as the learning region and the other regions will be suppressed to 0. As result, our DPL is extended to: Si = max (pti,c ) c∈[1,C]  pti , if Si in top k%, yi = 0, otherwise.

(3) (4)

where pti,c denotes the score prediction of c-th class for i-th sample from the teacher, C denotes the total number of classes. Besides, this design has other advantages: 1. By modifying the learning region, we can easily achieve Hard Negative Mining by selecting extra samples (see Fig. 4). In Sect. 4.4 we will analyze the gain from this part in detail. 2. Since the learning region is selected, unsupervised learning for regression branch can be easily achieved. We apply IoU Loss on this branch and analyze its gain in Sect. 4.2.

3

“Anchor” stands for “anchor point” in anchor-free detectors and “anchor box” in anchor-based detectors.

Dense Pseudo-Labels for Semi-supervised Object Detection

4

43

Experiments

4.1

Datasets and Experiment Settings

Datasets. We present our experimental results on MS-COCO [17] and Pascal VOC [6] benchmarks. For MS-COCO, both labeled and unlabeled training datasets will be used. The train2017 set contains 118k images with target bounding boxes and the unlabeled2017 contains 123k unlabeled images. Validation is performed on the subset val2017. For Pascal VOC, training set uses VOC07 train and VOC12 train and validation set uses VOC07 test. The following three experimental settings are mainly studied: – COCO-Standard: 1%, 2%, 5% and 10% of the train2017 set are sampled as labeled data, respectively. The rest of images are viewed as unlabeled data while training. For fairness of comparison, we follow the same dataset division as in [18] which contains 5 different data folds. Mean score of all 5 folds are taken as the final performance. – COCO-Full: train2017 is used as labeled data while unlabeled2017 is used as unlabeled data. – VOC Mixture: VOC07 train is used as labeled data, while VOC12 train and COCO20cls4 are taken as unlabeled data. Implementation Details. Without loss of generality, we take FCOS [28] as the representative anchor-free detector for experiments. ResNet-50 [9] pre-trained on ImageNet [20] is used as the backbone. We use batch-size 16 for both labeled and unlabeled images. The base learning rate and γ in QFL are set to 0.01 and 2 in all of our experiments. Loss weight wu on unlabeled data is set to 4 on COCOStandard and 2 on the other settings. Following previous works [18,24], we adopt “burnin” strategy to initialize the teacher model, same data augmentations as in [18] are applied. 4.2

Main Results

In this section, we progressively improve the Dense Teacher and analyze the performance gain from each part in detail. We adopt Unbiased Teacher as our baseline. Results on the COCO-Standard 10% setting are shown in Table 1. We first replace Unbiased Teacher’s pseudo-boxes with our proposed Dense PseudoLabels without the region division strategy. It shows that this improves the mAP from 31.52% to 32.0%. Then, we apply our region division strategy on DPL and the mAP is further improved by 1.34%. Finally, we extend the unsupervised learning scheme to the regression branch as done by [30], and our final mAP comes to 35.11%. To the best of our knowledge, this is the new state-of-theart under the COCO-Standard 10% setting. According to these results, we can attribute the advantages of Dense Teacher over existing methods to two major improvements: 4

COCO20cls is the sampled COCO train2017 set, only 20 classes same as in VOC are included.

44

H. Zhou et al.

Table 1. Performance under different model configurations on COCO-Standard 10%. * denotes our re-implemented result on FCOS. “Our Division” means the learning/suppression region division based on FRS score Method

Learning region

Cls Reg AP

AP50 AP75

Supervised

-

-

42.69 28.11

-

26.44

Unbiased Teacher [18] Predicted Positive  All  Dense Teacher Our Division  Dense Teacher

× × ×

31.52 48.80 33.57 32.00 50.29 34.17 33.34 52.14 35.53

Unbiased Teacher∗ [18] Predicted Positive  Our Division  Dense Teacher

 

33.13 49.96 35.36 35.11 53.35 37.79



(a) Ground Truth

(b) Pseudo-Box Label

(c) Dense Pseudo-Label

Fig. 4. Illustration of (c) our Dense Pseudo-Label compared with (b) pseudo-box label and its assigning result. Blue areas denote the assigned negative samples. In Dense Pseudo-Label, red means high quality scores, which denotes positive samples in pseudobox label. It can be seen that Dense Pseudo-Label is able to leverage more hard negative regions compared to the Pseudo-Box based method (Color figure online)

1) The new form of pseudo-label resolves the deterioration problem of the pseudo-box label as mentioned in Sect. 3.2. It is worth mentioning that by getting rid of the lengthy post-processing procedure, our Dense Teacher forms a much simpler SS-OD pipeline but still with better performance. However, the resulting improvement (from 31.52% to 32.00%) remains marginal without the advanced learning region division strategy. 2) Our region division strategy can efficiently utilize hard negative regions to enhance training. Specifically, we conduct label assignment on FCOS using ground truth annotation of COCO, finding that there are only about 0.4% of positive samples in the COCO train2017 set. By specifying k = 1, we take a fair amount of hard negative samples for unsupervised training. In Fig. 4, we can see that hard negatives samples distribute on meaningful background objects (chair cushion, cabinet, and other parts of the dog) in the image. These responses from the teacher are valuable in improving the student’s performance.

Dense Pseudo-Labels for Semi-supervised Object Detection

4.3

45

Comparison with State-of-the-Arts

Table 2. Experimental results on COCO-Standard. * means our re-implemented results on FCOS,  means large scale jittering is adopted when training COCO-Standard 1% 2%

5%

10%

Supervised

11.24 ± 0.18

15.04 ± 0.31

20.82 ± 0.13

26.44 ± 0.11

CSD [10] STAC [24] Instant teaching [35] ISMT [31] Unbiased teacher [18] Humble teacher [26] Li, et al. [13]

10.51 ± 0.06 13.97 ± 0.35 18.05 ± 0.15 18.88 ± 0.74 20.75 ± 0.12 16.96 ± 0.38 19.02 ± 0.25

13.93 ± 0.12 18.25 ± 0.25 22.45 ± 0.30 22.43 ± 0.56 24.30 ± 0.07 21.72 ± 0.24 23.34 ± 0.18

18.63 ± 0.07 24.38 ± 0.12 26.75 ± 0.05 26.37 ± 0.24 28.27 ± 0.11 27.70 ± 0.15 28.40 ± 0.15

22.46 ± 0.08 28.64 ± 0.21 30.40 ± 0.05 30.52 ± 0.52 31.50 ± 0.10 31.61 ± 0.28 32.23 ± 0.14

Unbiased teacher∗ [18] 18.31 ± 0.44 19.64 ± 0.34 Ours Soft teacher [30] Ours

22.39 ± 0.26 27.73 ± 0.13 31.52 ± 0.15 25.39 ± 0.13 30.83 ± 0.21 35.11 ± 0.12

20.46 ± 0.39 − 30.74 ± 0.08 34.04 ± 0.14 22.38 ± 0.31 27.20 ± 0.20 33.01 ± 0.14 37.13 ± 0.12

COCO-Standard. We compare Dense Teacher with several existing methods under the COCO-Standard setting in Table 2. When the labeled data varies from 2% to 10% our model consistently shows superior results. Whereas Under the 1% labeled setting, the performance of Dense Teacher is lower than Faster R-CNN based Unbiased Teacher. However, a more direct comparison between our method and Unbiased Teacher under the 1% setting on FCOS shows that Dense Teacher still leads by 1.2% mAP. Moreover, when applying large-scale jittering for augmentation following the implementation of Soft Teacher, our method obtains more significant improvements and becomes new the state-ofthe-art. VOC & COCO-Full. Results in Table 3 and Table 4 show that Dense Teacher lead the performance in both settings. On VOC dataset, Dense Teacher improves its supervised baseline by 8.2% and 10.0% on AP50 and mAP (i.e., from AP50 to AP95). Under the COCO-Full setting, since the baseline reported in other works are not the same, we list the performance of each method in the form of “baseline→result”. Our approach obtains a boost of 3.1% from the 2017unlabeled set, which is much higher than CSD, STAC, and Unbiased Teacher. We finally apply the large-scale jittering trick and a longer training scheduler for a fair comparison with Soft Teacher, where Dense Teacher boosts mAP by 4.9%, reaching 46.12% mAP.

46

H. Zhou et al.

Table 3. Comparison with existing methods on Pascal VOC. Evaluations are performed on VOC07 test Method

Labeled Unlabeled

AP50

AP50:95

Supervised(Ours)

VOC07

None

71.69

45.87

CSD [10]

VOC07

VOC12

74.7

-

STAC [24]

77.45

44.64

ISMT [31]

77.23

46.23

Unbiased teacher [18]

77.37

48.69

Li, et al. [13]

79.00

54.60

Instant teaching [35]

79.20

50.00

Ours

79.89 55.87

CSD [10]

4.4

VOC07

VOC12 + COCO20cls 75.1

-

STAC [24]

79.08

46.01

ISMT [31]

77.75

49.59

Unbiased teacher [18]

78.82

50.34

Li, et al. [13]

79.60

56.10

Instant teaching [35]

79.90

55.70

Ours

81.23 57.52

Ablation and Key Parameters

Effect of Hard Negative Samples. In Dense Teacher, hard negative samples/anchors can be better utilized. We explore three different strategies to study the impact of these hard negative regions5 , including “suppressing”, “ignoring” and “selecting”. Results are shown in Table 5. We first suppress these samples to 0 and find a significant performance drop on both classification and regression branches compared to the original setting. Then, we ignore these samples when calculating loss. It turns out that this setting performs better than the “suppress” setting but still falls short when they are selected for training, indicating that learning to predict those hard negative samples can positively affect the model performance. Regression Branch. We have shown that unsupervised learning on the regression branch effectively improves model performance. However, since the output of deltas in the background region is not meaningful in FCOS, the quality of pseudo-labels in this branch is highly dependent on the design of the learning region. As can be seen in Table 1, in our region division and pseudo-box based method, the model can obtain gains of 1.8% mAP and 1.6% mAP. When using ground truth positives as learning region (see Table 5 “suppress” and “ignore”), the model gains about 2% mAP. Therefore, our region division strategy can produce sufficiently reliable regions for this task. 5

Since the “unlabeled images” under the COCO-Standard setting actually come with annotations, we can perform label assignments on images using these annotations.

Dense Pseudo-Labels for Semi-supervised Object Detection

47

Table 4. Experimental results on COCO-Full. Evaluations are done on COCO val2017. Note that 1x represents 90K training iterations, N x represents N x90K iterations.  means training with large scale jittering Method

mAP

CSD [10] (3x)

40.20 −−−−→ 38.82

STAC [24] (6x)

39.48 −−−−→ 39.21

ISMT [31]

37.81 − −−− → 39.64

Instant-teaching [35]

37.63 − −−− → 40.20

−1.38 −0.27 +1.83 +2.57 +1.10

Unbiased teacher [18] (3x) 40.20 − −−− → 41.30 +4.74

Humble teacher [26] (3x)

37.63 −−−−→ 42.37

Li, et al. [13] (3x)

40.20 − −−− → 43.30

Ours(3x)

41.22 − −−− → 43.90

Ours(6x)

41.22 − −−− → 44.94

Soft teacher [30] (8x)

40.90 − −−− → 44.60

Ours(8x)

41.24 −−−−→ 46.12

+3.10 +2.66 +3.70 +3.70

+4.88

Effectiveness on Other Detectors and Datasets. Apart from comparison with state-of-the-arts, we also validate the effectiveness of our method on anchorbased detector and on CrowedHuman [22] dataset. For anchor-based detector, we take RetinaNet as representation. When comparing with pseudo-box based Unbiased Teacher, our method stay ahead of the curve as show in Table 6, Dense Teacher achieved a 2.2% mAP improvement over Unbiased Teacher. On the CrowedHuman dataset with FCOS detector, our method obtains a 2.1% mMR improvement as well. Size of Learning Region. We compare Dense Teacher’s performance under different selecting ratio k in Table 7(a). As shown in the table, we obtain the best Table 5. Impact of strategies dealing Table 6. Extensive comparison on differwith hard negatives ent detectors and datasets HN samples Cls Reg AP Suppress



×

Ignore



×

Method Anchor Dataset AP/MR

31.56

UT [18] 

COCO 28.9

32.64

Ours



COCO 31.1

33.34

UT [18] ×

CH [22] 62.8

Ours

×

CH [22] 60.7

Select



×

Suppress





33.47

Ignore





34.72

Select





35.11

The difference between our division (k = 1) and the assigned foreground is defined as hard negatives.

48

H. Zhou et al.

Table 7. Ablation study on hyper-parameters introduced by our method. COCO 10% stands for COCO-Standard 10% Setting

k(%)

AP

COCO 10%

0.1 0.5 1 3 5

25.76 34.11 35.11 34.47 33.85

AP50 AP75 41.79 51.94 53.35 53.13 53.00

27.34 36.80 37.79 36.87 36.03

Setting

wu

COCO 10%

2 4 8

34.86 53.22 37.34 35.11 53.35 37.79 33.81 51.97 36.34

COCO- Full

2 4

44.92 63.71 48.79 43.04 61.60 46.87

AP

AP50 AP75

performance when selecting 1% of the samples (5 layers of FPN) for unsupervised learning. According to statistics, there are about 0.4% of positive samples in COCO under the label assigning rule of FCOS. Therefore, the optimal learning region not only contains ground truth positives, but also encompasses hard negatives that are valuable. This also suggests that although the model performance is affected by this hyperparameter, the statistical characteristics of the data set can help us determine the optimal value of this hyperparameter, mitigate model migration and deployment challenges. Unsupervised Loss Weight. The weight of unsupervised data also has an important impact on the training results. The experimental results in Table 7(b) turn out that: for a limited amount of supervised data like in the COCOStandard 10% setting, a relatively large weight of 4 is favorable. Meanwhile, for the COCO-Full setting where much more labeled data are available, weight of 2 is enough. We attribute this phenomenon to the different degrees of overfitting. When only a small amount of annotations are available, a relatively large weight of unsupervised parts could introduce stronger supervision. In contrast, when given a large amount of labeled data, a small unsupervised weight could address and better utilize supervision.

5

Conclusion

In this paper, we revisit the form of pseudo-labels in existing semi-supervised learning. By analyzing various flaws caused by the lengthy pseudo-box generation pipeline, we point out that pseudo-box is a sub-optimal choice for unlabeled data. To address this issue, we propose the Dense Teacher, a SS-OD framework which adopts dense predictions from the teacher model as pseudo-labels for unlabeled data. Our approach is simpler but stronger. We demonstrate its efficacy by comparing Dense Teacher with other pseudo-box based SS-OD algorithms on MS-COCO and Pascal VOC benchmarks. Results on both benchmarks show that our Dense Teacher achieves state-of-the-art performance.

References 1. Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

Dense Pseudo-Labels for Semi-supervised Object Detection

49

2. Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 3. Berthelot, D., et al.: Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 (2019) 4. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 6. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC 2012) Results. www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html 7. Ge, Z., Hu, C., Huang, X., Qiu, B., Yoshie, O.: Dualbox: generating bbox pair with strong correspondence via occlusion pattern clustering and proposal refinement. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2097–2102. IEEE (2021) 8. Ge, Z., Jie, Z., Huang, X., Xu, R., Yoshie, O.: PS-RCNN: detecting secondary human instances in a crowd via primary object suppression. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2020) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 10. Jeong, J., Lee, S., Kim, J., Kwak, N.: Consistency-based semi-supervised learning for object detection. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 11. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016) 12. Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks (2013) 13. Li, H., Wu, Z., Shrivastava, A., Davis, L.S.: Rethinking pseudo labels for semisupervised object detection. arXiv preprint arXiv:2106.00168 (2021) 14. Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv. Neural. Inf. Process. Syst. 33, 21002–21012 (2020) 15. Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017) 16. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 17. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 18. Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480 (2021) 19. Miyato, T., Maeda, S.I., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1979–1993 (2018)

50

H. Zhou et al.

20. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-0150816-y 21. Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 22. Shao, S., et al.: Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018) 23. Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv. Neural. Inf. Process. Syst. 33, 596–608 (2020) 24. Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757 (2020) 25. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 26. Tang, Y., Chen, W., Luo, Y., Zhang, Y.: Humble teachers teach better students for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3132–3141 (2021) 27. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 28. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019) 29. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020) 30. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069 (2021) 31. Yang, Q., Wei, X., Wang, B., Hua, X.S., Zhang, L.: Interactive self-training with mean teachers for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5941– 5950 (2021) 32. Zhang, F., Pan, T., Wang, B.: Semi-supervised object detection with adaptive class-rebalancing self-training. arXiv preprint arXiv:2107.05031 (2021) 33. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020) 34. Zhixing, D., Zhang, R., Chang, M., Liu, S., Chen, T., Chen, Y., et al.: Distilling object detectors with feature richness. In: Advances in Neural Information Processing Systems, vol. 34 (2021) 35. Zhou, Q., Yu, C., Wang, Z., Qian, Q., Li, H.: Instant-teaching: an end-to-end semi-supervised object detection framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4081–4090 (2021)

Point-to-Box Network for Accurate Object Detection via Single Point Supervision Pengfei Chen1 , Xuehui Yu1 , Xumeng Han1 , Najmul Hassan2 , Kai Wang2 , Jiachen Li3 , Jian Zhao4 , Humphrey Shi2,3,5 , Zhenjun Han1(B) , and Qixiang Ye1 1

University of Chinese Academy of Sciences, Beijing, China {chenpengfei20,yuxuehui17,hanxumeng19}@mails.ucas.ac.cn, {hanzhj,qxye}@ucas.ac.cn 2 SHI Lab, University of Oregon, Eugene, USA 3 UIUC, Champaign, USA 4 Institute of North Electronic Equipment, Beijing, China [email protected] 5 Picsart AI Research (PAIR), Princeton, USA Abstract. Object detection using single point supervision has received increasing attention over the years. However, the performance gap between point supervised object detection (PSOD) and bounding box supervised detection remains large. In this paper, we attribute such a large performance gap to the failure of generating high-quality proposal bags which are crucial for multiple instance learning (MIL). To address this problem, we introduce a lightweight alternative to the off-the-shelf proposal (OTSP) method and thereby create the Point-to-Box Network (P2BNet), which can construct an inter-objects balanced proposal bag by generating proposals in an anchor-like way. By fully investigating the accurate position information, P2BNet further constructs an instancelevel bag, avoiding the mixture of multiple objects. Finally, a coarse-tofine policy in a cascade fashion is utilized to improve the IoU between proposals and ground-truth (GT). Benefiting from these strategies, P2BNet is able to produce high-quality instance-level bags for object detection. P2BNet improves the mean average precision (AP) by more than 50% relative to the previous best PSOD method on the MS COCO dataset. It also demonstrates the great potential to bridge the performance gap between point supervised and bounding-box supervised detectors. The code will be released at www.github.com/ucas-vg/P2BNet. Keywords: Object detection supervised object detection

1

· Single point annotation · Point

Introduction

Object detectors [4,13,23,25,29,30,38,46] trained with accurate bounding box annotations have been well received in academia and industry. However, Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 4. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 51–67, 2022. https://doi.org/10.1007/978-3-031-20077-9_4

52

P. Chen et al.

Fig. 1. Based on OTSP methods, the image-level bag in WSOD shows many problems: Too much background, mixture of different objects, unbalanced and low-quality proposals. With point annotation, the previous work UFO2 filters most background in first stage and splits bags for different objects in refinement. Our P2BNet produces balanced instance-level bags in coarse stage and improves bag quality improves by adaptively sampling proposal boxes around the estimated box of the former stage for better optimization. The performance is the performance in COCO-14. The 27.6 AP50 is conducted on UFO2 with ResNet-50 and our point annotation for a fair comparison.

collecting quality bounding box annotations requires extensive human efforts. To solve this problem, weakly supervised object detection [2,6,8,39–41,49,51] (WSOD) replace bounding box annotations using low-cost image-level annotations. However, lacking crucial location information and experiencing the difficulty of distinguishing dense objects, WSOD methods perform poorly in complex scenarios. Point supervised object detection (PSOD), on the other hand, can provide distinctive location information about the object and is much cheaper compared with that via bounding box supervision. Recently, point-based annotations are widely used in many tasks including object detection [28,32] and localization [33,37,45], instance segmentation [7], and action localization [21]. However, the performance gap between point supervised detection methods [28,32] and bounding box supervised detectors remain large. Although it is understandable that location information provided by bounding boxes is richer than the points, we argue that this is not the only reason. We believe most PSOD methods do not utilize the full potential of pointbased annotations. Previous works use off-the-shelf proposal (OTSP) methods (e.g., Selective Search [34], MCG [1], and EdgeBox [53]) to obtain proposals for constructing bags. Despite the wide adaptation of these OTSP-based methods in weakly supervised detectors, they suffer from the following problems in Fig. 1: 1) There are too many background proposals in the bags. OTSP methods generate

Point-to-Box Network for PSOD

53

Fig. 2. (a) The number of assigned proposal boxes per object produced by MCG (OTSP -based) is unbalanced, which is unfair for training. (b) Histogram of mIoUprop for different proposal generation methods. mIoUprop denotes the mean IoU between proposal boxes and ground-truth for an object. Small mIoUprop in MCG brings semantic confusion. Whereas for our P2BNet with refinement, large mIoUprop is beneficial for optimization. Statistics are on COCO-17 training set, and both figures have 50 bins.

too many proposal boxes that do not have any intersection with any of the foreground objects; 2) Positive proposals per object are unbalanced. The positive proposals per object produced by MCG on the COCO-17 training set are shown in Fig. 2(a), which is clearly off-balance; 3) Majority of the proposals in bags have very low IoU indicating low-quality proposals (Fig. 2(b)). Also, as the previous PSOD methods only construct image-level bags, they can not utilize the point annotations during MIL training leading to a mixture of different objects in the same bag. All these problems limit the overall quality of the constructed bags, which contributes to the poor performance of the model. In this paper, we propose P2BNet as an alternative to the OTSP methods for generating high-quality object proposals. The number of proposals generated by P2BNet is balanced for each object, and they cover varied scales and aspect ratios. Additionally, the proposal bags are instance-level instead of image-level. This preserves the exclusivity of objects for a given proposal bag which is very helpful during MIL training. To further improve the quality of the bag, a coarseto-fine procedure is designed in a cascade fashion in P2BNet. The refinement stage consists of two parts, the coarse pseudo-box prediction (CBP) and the precise pseudo-box refinement (PBR). The CBP stage predicts the coarse scale (width and height) of objects, whereas the PBR stage iteratively finetunes the scale and position. Our P2BNet generates high-quality, balanced proposal bags and ensures the contribution of point annotations in all stages (before, during, and after MIL training). The detailed experiments on COCO suggest the effectiveness and robustness of our model outperforming the previous point-based detectors by a large margin. Our main contributions are as follows: – P2BNet, a generative and OTSP-free network, is designed for predicting pseudo boxes. It generates inter-objects balanced instance-level bags and is beneficial for better optimization of MIL training. In addition, P2BNet is much more time-efficient than the OTSP-base methods.

54

P. Chen et al.

– A coarse-to-fine fashion in P2BNet with CBP and PBR stage is proposed for higher-quality proposal bags and better prediction. – The detection performance of our proposed P2BNet-FR framework with P2BNet under single quasi-center point supervision improves the mean average precision (AP) of the previous best PSOD method by more than 50% (relative) on COCO and bridges the gap between bounding box supervised detectors achieving comparable performance on AP50 .

2

Related Work

In this section, we briefly discuss the research status of box-supervised, imagelevel and point-level supervised object detection. 2.1

Box-Supervised Object Detection

Box-supervised object detection [4,13,23,25,29,30,38,46] is a traditional object detection paradigm that gives the network a specific category and box information. One-stage detectors based on sliding-window, like YOLO [29], SSD [25], and RetinaNet [23], predict classification and bounding-box regression through setting anchors. Two-stage detectors predict proposal boxes through OTSP methods (like selective search [34] in Fast R-CNN [13]) or deep networks (like RPN in Faster R-CNN [30]) and conduct classification and bounding-box regression with filtered proposal boxes sparsely. Transformer-based detectors (DETR [4], Deformable-DETR [52], and Swin-Transformer [26]) come, utilizing global information for better representation. Sparse R-CNN [38] combines the advantages of transformer and CNN to a sparse detector. [9,14,43] study on oriented object detection in aerial scenario. However, box-level annotation requires high costs. 2.2

Image-Supervised Object Detection

Image-supervised object detection [2,6,8,27,35,39–41,48,49,51] is the traditional field in WSOD. The traditional image-supervised WSOD methods can be divided into two styles: MIL-based [2,6,39–41], and CAM-based [8,49,51]. In MIL-based methods, a bag is positively labelled if it contains at least one positive instance; otherwise, it is negative. The objective of MIL is to select positive instances from a positive bag. WSDDN [2] introduced MIL into WSOD with a representative two-stream weakly supervised deep detection network that can classify positive proposals. OICR [39] introduces iterative fashion into WSOD and attempts to find the whole part instead of a discriminative part. PCL [40] develops the proposal cluster learning and uses the proposal clusters as supervision to indicate the rough locations where objects most likely appear. Subsequently, SLV [6] brings in spatial likelihood voting to replace the max score proposal, further looking for the whole context of objects. Our paper produces the anchor-like [30,35] proposals around the point annotation as a bag and uses instance-level MIL to train the classifier. It moves the fixed pre-generated proposals (e.g. OICR, PCL and UWSOD [35]) to achieve the coarse to fine purpose.

Point-to-Box Network for PSOD

55

In CAM-based methods, the main idea is to produce the class activation maps (CAM) [51], use threshold to choose a high score region, and find the smallest circumscribed rectangle of the largest general domain. WCCN [8] uses a threestage cascade structure. The first stage produces the class activation maps and obtains the initial proposals, the second stage is a segmentation network for refining object localization, and the last stage is a MIL stage outputting the results. Acol [49] introduces two parallel-classifiers for object localization using adversarial complementary learning to alleviate the discriminative region. 2.3

Point-Supervised Object Detection

Point-level annotation is a fairly recent innovation. The average time for annotating a single point is about 1.87 s per image, close to image-level annotation (1.5 s/image) and much lower than that for bounding box (34.5 s/image). The statistics [11,28] are performed on VOC [10], which can be analogized to COCO [24]. [28] introduces center-click annotation to replace box supervision and estimates scale with the error between two times of center-click. [32] designs a network compatible with various supervision forms like tags, points, scribbles, and boxes annotation. However, these frameworks are based on OTSP methods and are not specially designed for point annotation. Therefore, the performance is limited and performs poorly in complex scenarios like the COCO [24] dataset. We introduce a new framework with P2BNet which is free of OTSP methods.

3

Point-to-Box Network

The P2BNet-FR framework consists of Point-to-Box Network (P2BNet) and Faster R-CNN (FR). P2BNet predicts pseudo boxes with point annotations to train the detector. We use standard settings for Faster R-CNN without any bells and whistles. Hence, we go over the proposed P2BNet in detail in this section. The architecture of P2BNet is shown in Fig. 3, which includes the coarse pseudo box prediction (CBP) stage and the pseudo box refinement (PBR) stage. The CBP stage predicts the coarse scale (width and height) of objects, whereas the PBR stage iteratively finetunes the scale and position. The overall loss function of P2BNet is the summation of the losses of these two stages, i.e., Lp2b = Lcbp +

T  t=1

(t)

Lpbr ,

(1)

(t)

where PBR includes T iterations, and Lpbr is the loss of t-th iteration. 3.1

Coarse Pseudo Box Prediction

In the CBP stage, firstly, proposal boxes of different widths and heights are generated in an anchor-style for each object, taking the annotated point as the

56

P. Chen et al.

Fig. 3. The architecture of P2BNet. Firstly, to predict coarse pseudo boxes in CBP stage, proposal bags are fixedly sampled around point annotations for classifier training. Then, to predict refined pseudo boxes in PBR stage, high-quality proposal bags and negative proposals are sampled with coarse pseudo boxes for training. Finally, the pseudo boxes generated by the trained P2BNet serve as supervision for the training the classic detector. (Best viewed in color.) (Color figure online)

box center. Secondly, features of the sampled proposals are extracted to train a MIL classifier for selecting the best fitted proposal of objects. Finally, the top-k merging policy are utilized to estimate coarse pseudo boxes. CBP Sampling: fixed sampling around the annotated point. With the point annotation p = (px , py ) as the center, s as the size, and v to adjust the aspect ratio, the proposal box b = (bx , by , bw , bh ) is generated, i.e. b = (px , py , v · s, v1 · s). The schematic diagram of proposal box sampling is shown in Fig. 4 (Left). By adjusting s and v, each point annotation pj generates a bag of proposal boxes with different scales and aspect ratios, denoted by Bj (j ∈ {1, 2, . . . , M }, where M is the amount of objects). The details of the settings of s and v are given in supplemental. All proposal bags are utilized for training the MIL classifier in the CBP module with the category labels of points as supervision. There is a minor issue that oversized s may lead most of b outside the image and introduce too many meaningless padding values. In this case, we clip b to guarantee that it is inside the image (see Fig. 4 (Left)), i.e.,   1 b = px , py , min(v·s, 2(px −0), 2(W −px )), min( ·s, 2(py −0), 2(H −py )) , (2) v where W and H denote the image size. (px − 0) and (W − px ) are the distances from the center to the left and right edges of the image, respectively. CBP Module. For a proposal bag Bj , features Fj ∈ RU ×D are extracted through 7 × 7 RoIAlign [15] and two fully connected (fc) layers, where U is the number of proposals in Bj , and D is the feature dimension. We refer to

Point-to-Box Network for PSOD

57

Fig. 4. Details of sampling strategies in the CBP stage and the PBR stage. The arrows in PBR sampling mean the offset of center jitter. Samples are obtained through center jitter following scale and aspect ratio jatter in PBR sampling

WSDDN [2] and design a two-stream structure as a MIL classifier to find the best bounding box region to represent the object. Specifically, applying the clasU ×K , which is then passed through the sification branch fcls to Fj yields Ocls j ∈R U ×K activation function to obtain the classification score Scls , where K repj ∈R ∈ RU ×K resents the number of instance categories. Likewise, instance score Sins j is obtained through instance selection branch fins and activation function, i.e., Ocls j = fcls (Fj ),

cls

[Oj [Scls j ]uk = e

]uk

K 

cls

e[Oj

]ui

;

(3)

i=1

= fins (Fj ), Oins j

ins

[Oj [Sins j ]uk = e

]uk

U 

ins

e[Oj

]ik

,

(4)

i=1

where [·]uk denotes the value at row u and column k in the matrix. The proposal score Sj is obtained by computing the Hadamard product of the classification  j is obtained by the summation score and the instance score, and the bag score S of the proposal scores of U proposal boxes, i.e., ins ∈ RU ×K , Sj = Scls j  Sj

j = S

U 

[Sj ]u ∈ RK .

(5)

u=1

 j can be seen as the weighted summation of the classification score [Scls ]u by S j the corresponding selection score [Sins ] . u j CBP Loss. The MIL loss in the CBP module (termed Lmil1 to distinguish it from the MIL loss in PBR) uses the form of cross-entropy loss, defined as: Lcbp = αmil1 Lmil1 = −

M K αmil1    j ]k ) + (1 − [cj ]k ) log(1 − [S  j ]k ), [cj ]k log([S M j=1 k=1

(6) where cj ∈ {0, 1}K is the one-hot category label, αmil1 is 0.25. The CBP loss is to make each proposal correctly predict the category and instance it belongs to.

58

P. Chen et al.

Finally, the top-k boxes with the highest proposal scores Sj of each object are weighted to obtain coarse pseudo boxes for the following PBR sampling. 3.2

Pseudo Box Refinement

The PBR stage aims to finetune the position, width and height of pseudo boxes, and it can be performed iteratively in a cascaded fashion for better performance. By adjusting the height and width of the pseudo box obtained in the previous stage (or iteration) in a small span while jittering its center position, finer proposal boxes are generated as positive examples for module training. Further, because the positive proposal bags are generated in the local region, negative samples can be sampled far from the proposal bags to suppress the background. The PBR module also weights the top-k proposals with the highest predicted scores to obtain the refined pseudo boxes, which are the final output of P2BNet. PBR Sampling. Adaptive sampling around estimated boxes. As shown in Fig. 4 (Right), for each coarse pseudo box b∗ = (b∗x , b∗y , b∗w , b∗h ) obtained in the previous stage (or iteration), we adjust its scale and aspect ratio with s and v, and jitter its position with ox , oy to obtain the finer proposal b = (bx , by , bw , bh ): bw = v · s · b∗w , bx = b∗x + bw · ox ,

bh =

1 · s · b∗h , v

by = b∗y + bh · oy .

(7) (8)

These finer proposals are used as positive proposal bag Bj to train PBR module. Furthermore, to better suppress the background, negative samples are introduced in the PBR sampling. We randomly sample many proposal boxes, which have small IoU (by default set as smaller than 0.3) with all positive proposals in all bags, to compose the negative sample set N for the PBR module. Through sampling proposal boxes by pseudo box distribution, high-quality proposal boxes are obtained for better optimization (shown in Fig. 5). PBR Module. The PBR module has a similar structure to the CBP module. It shares the backbone network and two fully connected layers with CBP, and also has a classification branch fcls and an instance selection branch fins . Note that fcls and fins do not share parameters between different stages and iterations. For instance selection branch, we adopt the same structure as the CBP module, for the proposal bag Bj . and utilize Eq. 4 to predict the instance score Sins j Differently, the classification branch uses the sigmoid activation function σ(x) to predict the classification score Scls j , i.e., σ(x) = 1/(1 + e−x ),

U ×K Scls . j = σ(fcls (Fj )) ∈ R

(9)

This form makes it possible to perform multi-label classification, which can distinguish overlapping proposal boxes from different objects. According to the  ∗ is calculated using Scls and Sins of the current stage. form of Eq. 5, bag score S j j j

Point-to-Box Network for PSOD

59

Fig. 5. The progression of the mIoUprop during refinement. By statistics, the mIoUpred is gradually increasing in the PBR stage, indicating that the quality of the proposal bag improves in iterative refinement

For the negative sample set N , we calculate its classification score as: |N |×K Scls . neg = σ(fcls (Fneg )) ∈ R

(10)

PBR Loss. The PBR loss consists of MIL loss Lmil2 for positive bags and negative loss Lneg for negative samples, i.e., Lpbr = αmil2 Lmil2 + αneg Lneg ,

(11)

where αmil2 = 0.25 and αneg = 0.75 are the settings in this paper. 1) MIL Loss. The MIL loss Lmil2 in the PBR stage is defined as: FL(ζ, τ ) = −

K 

[τ ]k (1 − [ζ]k )γ log([ζ]k ) + (1 − [τ ]k )([ζ]k )γ log(1 − [ζ]k ),

k=1

(12) Lmil2

M 1  T ∗  j , cj ), = c , S  · FL(S M j=1 j j

(13)

∗ where FL(ζ, τ ) is the focal loss [23], and γ is set as 2 following [23]. S j represents the bag score of the last PBR iteration (for the first iteration of ∗ PBR, using the bag score in CBP). cT j , Sj  represents the inner product of the two vectors, which means the predicted bag score of the previous stage or iteration on ground-truth category. Score is used to weight the FL of each object for stable training. 2) Negative Loss. Conventional MIL treats proposal boxes belonging to other categories as negative samples. In order to further suppress the backgrounds, we sample more negative samples in the PBR stage and introduce the negative loss (γ is also set to 2 following FL), i.e., β=

M 1  T ∗ c , S , M j=1 j j

K

Lneg = −

1  γ cls β · ([Scls neg ]k ) log(1 − [Sneg ]k ). |N | N k=1

(14)

60

4 4.1

P. Chen et al.

Experiments Experiment Settings

Datasets and Evaluate Metrics. For experiments, we use the public available MS COCO [24] dataset. COCO has 80 different categories and two versions. COCO-14 has 80K training and 40K validation images whereas COCO-17 has 118K training and 5K validation images. Since the ground truth on the test set is not released, we train our model on the training set and evaluate it on the validation set reporting AP50 and AP (averaged over IoU thresholds in [0.5 : 0.05 : 0.95]) on COCO. The mIoUpred is calculated by the mean IoU between predicted pseudo boxes and their corresponding ground-truth bounding-boxes of all objects in the training set. It can directly evaluate the ability of P2BNet to transform annotated points into accurate pseudo boxes. Implementation Details. Our codes of P2BNet-FR are based on MMDetection [5]. The stochastic gradient descent (SGD [3]) algorithm is used to optimize in 1× training schedule. The learning rate is set to 0.02 and decays by 0.1 at the 8-th and 11-th epochs, respectively. In P2BNet, we use multi-scale (480, 576, 688, 864, 1000, 1200) as the short side to resize the image during training and single-scale (1200) during inference. We choose the classic Faster R-CNN FPN [22,30] (backbone is ResNet-50 [16]) as the detector with the default setting, and single-scale (800) images are used during training and inference. More details are included in the supplementary section. Quasi-Center Point Annotation. We propose a quasi-center (QC) point annotation that is friendly for object detection tasks with a low cost. In practical scenarios, we ask annotators to annotate the object in the non-high limit center region with a loose rule. Since datasets in the experiment are already annotated with bounding boxes or masks, it is reasonable that the manually annotated points follow Gaussian distribution in the central region. We utilize Rectified Gaussian Distribution (RG) defined in [45] with central ellipse constraints. For a bounding box of b = (bx , by , bw , bh ), its central ellipse can be defined as Ellipse(κ), using (bx , by ) as the ellipse center and (κ · bw , κ · bh ) as the two axes of the ellipse. In addition, in view of the fact that the absolute position offset for a large object is too large under the above rule, we limit the two axes to no longer than 96 pixels. If the object’s mask M ask overlaps with the central ellipse Ellipse(κ), V is used to denote the intersection. If there is no intersecting area, V represents the entire M ask. When generated from bounding box annotations, the boxes are treated as masks. Then RG is defined as,   Gauss(p;μ,σ) ,p∈V Gauss(p;μ,σ)dp V (15) RG(p; μ, σ, κ) = 0, p∈ /V where μ and σ are mean and standard deviation of RG. κ decides the Ellipse(κ). In this paper, RG(p; 0, 14 , 14 ) is chosen to generate the QC point annotations.

Point-to-Box Network for PSOD

61

Table 1. The performance comparison of box-supervised, image-supervised, and pointsupervised detectors on COCO dataset. ∗ means UFO2 with image-level annotation. † means the performance we reproduce with the original setting. ‡ means we re-implement UFO2 with our QC point annotation. The performance of P2BNet-FR, UFO2 , and the box-supervised detector is tested on a single scale dataset. Our P2BNet-FR is based on P2BNet with top-4 merging and one PBR stage. SS is selective search [34], PP means proposal box defined in [38], and Free represents OTSP-free based method. Method

Backbone Proposal COCO-14

COCO-17

AP

AP50 AP

AP50

Box-supervised detectors Fast R-CNN [13]

VGG-16

SS

18.9

38.6

19.3

39.3

Faster R-CNN [30]

VGG-16

RPN

21.2

41.5

21.5

42.1

FPN [5]

R-50

RPN

35.5 56.7

37.4

58.1

RetinaNet [5, 23]

R-50

-

34.3

53.3

36.5

55.4

Reppoint [5, 44]

R-50

-

-

-

37.0

56.7

Sparse R-CNN [5, 38]

R-50

PP

-

-

37.9 56.0

Image-supervised detectors OICR+Fast [13, 39]

VGG-16

SS

7.7

17.4

-

-

PCL [40]

VGG-16

SS

8.5

19.4

-

-

PCL+Fast [13, 40]

VGG-16

SS

9.2

19.6

-

-

MEFF+Fast [12, 13]

VGG-16

SS

8.9

19.3

-

-

C-MIDN [42]

VGG-16

SS

9.6

21.4

-

-

WSOD2 [47]

VGG-16

SS

10.8

22.7

-

-

UFO2∗ [32]

VGG-16

MCG

10.8

23.1

-

-

GradingNet-C-MIL [18] VGG-16

SS

11.6

25.0

-

-

ICMWSD [31]

VGG-16

MCG

11.4

24.3

-

-

ICMWSD [31]

R-50

MCG

12.6

26.1

-

-

ICMWSD [31]

R-101

MCG

13.0

26.3

-

-

CASD [17]

VGG-16

SS

12.8

26.4

-

-

CASD [17]

R-50

SS

13.9 27.8

-

-

Point-supervised detectors

4.2

Click [28]

AlexNet

SS

-

18.4

-

-

UFO2 [32]

VGG-16

MCG

12.4

27.0

-

-

UFO2† [32]

VGG-16

MCG

12.8

26.6

13.2

27.2

UFO2‡ [32]

VGG-16

MCG

12.7

26.5

13.5

27.9

UFO2‡ [32]

R-50

MCG

12.6

27.6

13.2

28.9

P2BNet-FR (Ours)

R-50

Free

19.4 43.5

22.1 47.3

Performance Comparisons

Unless otherwise specified, the default components of our P2BNet-FR framework are P2BNet and Faster R-CNN. We compare the P2BNet-FR with the existing PSOD methods while choosing the state-of-the-art UFO2 [32] framework as the baseline for comprehensive comparisons. In addition, to demonstrate the perfor-

62

P. Chen et al. Table 2. Ablation study (Part I) CBP stage

PBR stage

Performance

Lpos Lmil1 Lmil2 Lneg Lpesudo mIoUpred     

  

 



AP

AP50

25.0 50.2

2.9 13.7

10.3 37.8

52.0 57.4 56.7

12.7 21.7 18.5

35.4 46.1 44.1

(a) The effectiveness of training loss in P2BNet: Lmil1 in CBP stage, Lmil2 and Lneg in PBR stage. Lpos and Lpesudo is for comparison. top-k mIoUpred 1 3 4 7 10

49.2 54.7 57.5 57.4 57.1

AP

AP50

T mIoUpred

12.2 21.3 22.1 21.7 21.5

35.9 46.6 47.3 46.1 46.0

0 1 2 3

(b) The top-k policy for box merging. k is set the same for all stages.

50.2 57.4 57.0 56.2

AP

AP50

13.7 21.7 21.9 21.3

37.8 46.1 46.1 45.6

(c) The number of iterations T in the PBR stage. T = 0 means only the CBP stage is conducted.

mance advantages of the PSOD methods, we compare them with the state-ofthe-art WSOD methods. At the same time, we compare the performance of the box-supervised object detectors to reflect their performance upper bound. Comparison with PSOD Methods. We compare the existing PSOD methods Click [28] and UFO2 [32] on COCO, as shown in Table 1. Both Click and UFO2 utilize OTSP-based methods (SS [34] or MCG [1]) to generate proposal boxes. Since the point annotation used by UFO2 is different from the QC point proposed in this paper, for a fair comparison, we re-train UFO2 on the public code with our QC point annotation. In addition, the previous methods are mainly based on VGG-16 [36] or AlexNet [20]. For consistency, we extend the UFO2 to the ResNet-50 FPN backbone and compare it with our framework. In comparison with Click and UFO2 , our P2BNet-FR framework outperforms them by a large margin. On COCO-14, P2BNet-FR improves AP and AP50 by 6.8 and 15.9, respectively. Also, our framework significantly outperforms state-of-the-art performance by 8.9 AP and 18.4 AP50 on COCO-17. In Fig. 6, the visualization shows our P2BNet-FR makes full use of the precise location information of point annotation and can distinguish dense objects in complex scenes. Comparison with WSOD Methods. We compare the proposed framework to the state-of-the-art WSOD methods on the COCO-14 in Table 1. The performance of P2BNet-FR proves that compared with WSOD, PSOD significantly improves the detection performance with little increase in the annotation cost, showing that the PSOD task has great prospects for development. Comparison with Box-Supervised Methods. In order to verify the feasibility of P2BNet-FR in practical applications and show the upper bound under this supervised manner, we compare the box-supervised detector [30] in Table 1. Under AP50 , P2BNet-FR-R50 (47.3 AP50 ) is much closer to box-supervised

Point-to-Box Network for PSOD

63

Table 3. Ablation study (Part II) Methods

AR1 AR10

AR100

UFO2 14.7 22.6 P2BNet-FR 21.3 32.8

23.3 34.2

Detectors

(a) Comparisons of average recall for UFO2 and P2BNet-FR. Balance  -

AP AP50 Jitter 21.7 46.1  12.9 36.0 -

(b) Unbalance issue.

AP AP50 21.7 46.1 14.2 38.2

(c) Jitter strategy.

RetinaNet [23] Reppoint [44] Sparse R-CNN [38] FR-FPN [22, 30]

GT box AP AP50 36.5 37.0 37.9 37.4

55.4 56.7 56.0 58.1

Pseudo box AP AP50 21.0 20.8 21.1 22.1

44.9 45.1 43.3 47.3

(d) Performance of different detectors on ground-truth box annotations and pseudo boxes generated by P2BNet. We use the top-4 for box merging.

detector FPN-R50 (58.1 AP50 ) than previous WSOD and PSOD method. It shows that PSOD can be applied in industries that are less demanding on box quality and more inclined to find objects [19,50], with greatly reduced annotation cost. 4.3

Ablation Study

In this section, all the ablation studies are conducted on t he COCO-17 dataset. The top-k setting is k = 7 except for the box merging policy part in Table 2(b) and different detectors part (k = 4) in Table 3(d). Training Loss in P2BNet. The ablation study of the training loss in P2BNet is shown in Table 2(a). 1) CBP loss. Only with Lmil1 in the CBP stage, we can obtain 13.7 AP and 37.8 AP50 . For comparison, we conduct Lpos , which views all the proposal boxes in the bag as positive samples. We find it hard to optimize, and the performance is bad, demonstrating the effectiveness of our proposed Lmil1 for pseudo box prediction. Coarse proposal bags can cover most objects in high IoU, resulting in a low missing rate. However, the performance still has the potential to be refined because the scale and aspect ratio are coarse, and the center position needs adjustment. 2) PBR loss. With a refined sampling of proposal bag (shown in Fig. 5), corresponding PBR loss is introduced. Only with Lmil2 , the performance is just 12.7 AP. The main reasons of performance degradation are error accumulation in a cascade fashion and lacking negative samples for focal loss. There are no explicit negative samples to suppress background for Sigmoid activation function, negative sampling and negative loss Lneg is introduced. Performance increases by 9.0 AP and 10.7 AP50 , indicating that it is essential and effectively improves the optimization. We also evaluate the mIoUpred to discuss the predicted pseudo box’s quality. In the PBR stage with Lmil2 and Lneg , the mIoU increases from 50.2 to 57.4, suggesting better quality of the pseudo box. Motivated by [45], we conduct Lpesudo , viewing pseudo boxes from the CBP stage as positive samples. However, the Lpesudo limits the refinement and the performance decreases. In Table 3(c), if we remove the jitter strategy of proposal boxes in PBR stage, the performance drops to 14.2 AP.

64

P. Chen et al.

Fig. 6. Visualization of detection results of P2BNet-FR and UFO2 . Our P2BNet-FR can distinguish dense objects and perform well in complex scene. (Best viewed in color.) (Color figure online)

Number of Refinements in PBR. Refining pseudo boxes is a vital part of P2BNet, and the cascade structure is used for iterative refinement to improve performance. Table 2(c) shows the effect of the refining number in the PBR stage. One refinement brings a performance gain of 8.0 AP, up to a competitive 21.7 AP. The highest 21.9 AP is obtained with two refinements, and the performance is saturated. We choose one refinement as the default configuration. Box Merging Policy. We use the top-k score average weight as our merging policy. We find that the hyper-parameter k is slightly sensitive and can be easily generalized to other datasets, as presented in Table 2(b), and only the top-1 or top-f ew proposal box plays a leading role in box merging. The best performance is 22.1 AP and 47.3 AP50 when k = 4. The mIoUpred between the pseudo box and ground-truth box is 57.5. In inference, if bag score S is replaced by classification score Scls for merging, the performance drops to 17.4 AP (vs 21.7 AP). Average Recall. In Table 3(a), the AR in UFO2 is 23.3, indicating a higher missing rate. Whereas the P2BNet-FR obtains 34.2 AR, far beyond that of the UFO2 . It shows our OTSP-free method is better at finding objects. Unbalance Sampling Analysis. To demonstrate the effect of unbalance sampling, we sample different numbers of proposal boxes for each object and keep them constant in every epoch during the training period. The performance drops in Table 3(b) suggests the negative impact of unbalanced sampling. Different Detectors. We train different detectors [22,23,30,38,44] for the integrity experiments, all of which are conducted on R-50, as shown in Table 3(d). Our framework exhibits competitive performance on other detectors. Box supervised performances are listed to demonstrate the upper bound of our framework.

Point-to-Box Network for PSOD

5

65

Conclusion

In this paper, we give an in-depth analysis of shortcomings in OTSP-based PSOD frameworks, and further propose a novel OTSP-free network termed P2BNet to obtain inter-objects balanced and high-quality proposal bags. The coarse-to-fine strategy divides the prediction of pseudo boxes into CBP and PBR stages. In the CBP stage, fixed sampling is performed around the annotated points, and coarse pseudo boxes are predicted through instance-level MIL. The PBR stage performs adaptive sampling around the estimated boxes to finetune the predicted boxes in a cascaded fashion. As mentioned above, P2BNet takes full advantage of point information to generate high-quality proposal bags, which is more conducive to optimizing the detector (FR). Remarkably, the conceptually simple P2BNet-FR framework yields state-of-the-art performance with single point annotation. Acknowledgements. This work was supported in part by the Youth Innovation Promotion Association CAS, the National Natural Science Foundation of China (NSFC) under Grant No. 61836012, 61771447 and 62006244, the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA27000000, and Young Elite Scientist Sponsorship Program of China Association for Science and Technology YESS20200140.

References 1. Arbel´ aez, P.A., Pont-Tuset, J., et al.: Multiscale combinatorial grouping. In: CVPR (2014) 2. Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016) 3. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 25 4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 5. Chen, K., Wang, J., Pang, J.E.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019) 6. Chen, Z., Fu, Z., et al.: SLV: spatial likelihood voting for weakly supervised object detection. In: CVPR (2020) 7. Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. CoRR (2021) 8. Diba, A., Sharma, V., et al.: Weakly supervised cascaded convolutional networks. In: CVPR (2017) 9. Ding, J., Xue, N., Long, Y., Xia, G., Lu, Q.: Learning RoI transformer for oriented object detection in aerial images. In: CVPR (2019) 10. Everingham, M., Gool, L.V., et al.: The pascal visual object classes (VOC) challenge. In: IJCV (2010) 11. Gao, M., Li, A., et al.: C-WSL: count-guided weakly supervised localization. In: ECCV (2018)

66

P. Chen et al.

12. Ge, W., Yang, S., Yu, Y.: Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In: CVPR (2018) 13. Girshick, R.B.: Fast R-CNN. In: ICCV (2015) 14. Guo, Z., Liu, C., Zhang, X., Jiao, J., Ji, X., Ye, Q.: Beyond bounding-box: convexhull feature adaptation for oriented and densely packed object detection. In: CVPR (2021) 15. He, K., Gkioxari, G., et al.: Mask R-CNN. In: ICCV (2017) 16. He, K., Zhang, X., et al.: Deep residual learning for image recognition. In: CVPR (2016) 17. Huang, Z., Zou, Y., et al.: Comprehensive attention self-distillation for weaklysupervised object detection. In: NeurIPS (2020) 18. Jia, Q., Wei, S., et al.: Gradingnet: towards providing reliable supervisions for weakly supervised object detection by grading the box candidates. In: AAAI (2021) 19. Jiang, N., et al.: Anti-UAV: a large multi-modal benchmark for UAV tracking. IEEE TMM (2021) 20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012) 21. Lee, P., Byun, H.: Learning action completeness from points for weakly-supervised temporal action localization. In: ICCV (2021) 22. Lin, T., Doll´ ar, P., et al.: Feature pyramid networks for object detection. In: CVPR (2017) 23. Lin, T., Goyal, P., et al.: Focal loss for dense object detection. In: ICCV (2017) 24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 25. Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 26. Liu, Z., Lin, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021) 27. Meng, M., Zhang, T., Yang, W., Zhao, J., Zhang, Y., Wu, F.: Diverse complementary part mining for weakly supervised object localization. IEEE TIP 31, 1774–1788 (2022) 28. Papadopoulos, D.P., Uijlings, J.R.R., et al.: Training object class detectors with click supervision. In: CVPR (2017) 29. Redmon, J., Divvala, S.K., et al.: You only look once: unified, real-time object detection. In: CVPR (2016) 30. Ren, S., He, K., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6), 1137–1149 (2017) 31. Ren, Z., Yu, Z., et al.: Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: CVPR (2020) 32. Ren, Z., Yu, Z., Yang, X., Liu, M.-Y., Schwing, A.G., Kautz, J.: UFO2 : a unified framework towards omni-supervised object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 288–313. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7 18 33. Ribera, J., Guera, D., Chen, Y., Delp, E.J.: Locating objects without bounding boxes. In: CVPR (2019) 34. van de Sande, K.E.A., Uijlings, J.R.R., et al.: Segmentation as selective search for object recognition. In: ICCV (2011)

Point-to-Box Network for PSOD

67

35. Shen, Y., Ji, R., Chen, Z., Wu, Y., Huang, F.: UWSOD: toward fully-supervisedlevel capacity weakly supervised object detection. In: NeurIPS (2020) 36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015) 37. Song, Q., et al.: Rethinking counting and localization in crowds: a purely pointbased framework. In: ICCV (2021) 38. Sun, P., Zhang, R., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: CVPR (2021) 39. Tang, P., et al.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017) 40. Tang, P., Wang, X., et al.: PCL: proposal cluster learning for weakly supervised object detection. IEEE TPAMI 42(1), 176–191 (2020) 41. Wan, F., Wei, P., et al.: Min-entropy latent model for weakly supervised object detection. IEEE TPAMI 41(10), 2395–2409 (2019) 42. Yan, G., Liu, B., et al.: C-MIDN: coupled multiple instance detection network with segmentation guidance for weakly supervised object detection. In: ICCV (2019) 43. Yang, X., Yan, J., Feng, Z., He, T.: R3Det: refined single-stage detector with feature refinement for rotating object. In: AAAI (2021) 44. Yang, Z., Liu, S., et al.: Reppoints: point set representation for object detection. In: ICCV (2019) 45. Yu, X., Chen, P., et al.: Object localization under single coarse point supervision. In: CVPR (2022) 46. Yu, X., Gong, Y., et al.: Scale match for tiny person detection. In: IEEE WACV (2020) 47. Zeng, Z., Liu, B., et al.: WSOD2: learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In: ICCV (2019) 48. Zhang, D., Han, J., Cheng, G., Yang, M.: Weakly supervised object localization and detection: a survey. IEEE TPAMI 44(9), 5866–5885 (2021) 49. Zhang, X., Wei, Y., et al.: Adversarial complementary learning for weakly supervised object localization. In: CVPR (2018) 50. Zhao, J., et al.: The 2nd anti-UAV workshop & challenge: methods and results. In: ICCVW 2021 (2021) 51. Zhou, B., Khosla, A., et al.: Learning deep features for discriminative localization. In: CVPR (2016) 52. Zhu, X., Su, W., et al.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021) 53. Zitnick, C.L., Doll´ ar, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910602-1 26

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild Takehiko Ohkawa1,2(B) , Yu-Jhe Li2 , Qichen Fu2 , Ryosuke Furuta1 , Kris M. Kitani2 , and Yoichi Sato1 1 The University of Tokyo, Tokyo, Japan {ohkawa-t,furuta,ysato}@iis.u-tokyo.ac.jp 2 Carnegie Mellon University, Pittsburgh, PA, USA {yujheli,qichenf,kkitani}@cs.cmu.edu https://tkhkaeio.github.io/projects/22-hand-ps-da/

Abstract. We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.

1

Introduction

In the real world, hand keypoint regression and hand segmentation are considered important to work under broad imaging conditions for various computer Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 5. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 68–87, 2022. https://doi.org/10.1007/978-3-031-20077-9_5

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

69

Fig. 1. We aim to adapt the model of localizing hand keypoints and pixel-level hand masks to new imaging conditions without annotation.

vision applications, such as egocentric video understanding [17,28], hand-object interaction analysis [12,21], AR/VR [39,70], and assistive technology [37,40]. For building models for both tasks, several labeled hand datasets have been proposed in laboratory settings, such as multi-camera studios [13,34,46,80] and attaching sensors to hands [24,26,75]. However, their imaging conditions do not adequately cover real-world imaging conditions [51], consisting of various lighting, hand-held objects, backgrounds, and camera viewpoints. In addition, the annotation of keypoints and pixel-level masks are not always available in real-world environments because they are labor-intensive to acquire. As shown in Fig. 1, when localizing hand keypoints and pixels in real-world egocentric videos [28] (e.g., outdoors), we may only have access to a hand dataset [13] taken under completely different imaging conditions (e.g., indoors). Given these limitations, we need methods that can robustly adapt the models trained on the available labeled images (source) to unlabeled images (target) with new imaging conditions. To enable such adaptation, the approach of self-training domain adaptation has been developed for both tasks. This approach aims to learn unlabeled target images by optimizing a self-supervised task, which exhibits effectiveness in various domain adaptation tasks [7,15,18,67,77]. For keypoint estimation, consistency training, a method that regularizes keypoint predictions to be consistent under geometric transformations, has been proposed [66,73,77]. As for hand segmentation, prior studies use pseudo-labeling [7,53], which produces hard labels by thresholding a predicted class probability for updating a network. However, these self-training methods for both tasks perform well only when the predictions are reasonably correct. When the predictions become noisy due to the gap in imaging conditions, the trained network will cause over-fitting to the noisy predictions, resulting in poor performance in the target domain. To avoid this, it is crucial to assign a low importance (confidence) weight to the loss of self-training with noisy predictions. This confidence weighting can mitigate the distractions from the noisy predictions. To this end, we propose self-training domain adaptation with confidence estimation for hand keypoint regression and hand segmentation. Our proposed method consists of (i) confidence estimation based on the divergence of two networks’ predictions and (ii) an update rule that integrates a training network for self-training and the two networks for confidence estimation.

70

T. Ohkawa et al.

To (i) estimate confidence, we utilize the predictions of two different networks. While class probability can be used as the confidence in classification tasks, it is not trivial to obtain such a measure in keypoint regression. Thus, we newly focus on the divergence of the two networks’ predictions for each target image. We design their networks to have an identical architecture but have different learning parameters. We observe that when the divergence measure is high, the predictions of both networks are noisy and should be avoided in self-training. To (ii) integrate the estimated confidence into self-training, inspired by the single-teacher-single-student update [54,64], we develop mutual training with self-training based on consistency training for a training network (student) and distillation-based update for the two networks (teachers). For training the student network, we build a unified self-training framework that can work favorably for the two tasks. Motivated by supervised or weakly-supervised learning for jointly estimating both tasks [16,27,49,68,76], we expect that jointly adapting both tasks will allow one task to provide useful cues to the other task even in the unlabeled target domain. Specifically, we enforce the student network to generate consistent predictions for both tasks under geometric augmentation. We weight the loss of the consistency training using the confidence estimated from the divergence of the teachers’ predictions. This can reduce the weight of the noisy predictions during the consistency training. To learn the two teacher networks differently, we train the teachers independently from different mini-batches by knowledge distillation, which matches the teacher-student predictions in the output level. This framework enables the teachers to update more carefully than the student and prevent over-fitting to the noisy predictions. Such stable teachers provide reliable confidence estimation for the student’s training. In our experiments, we validate our proposed method in adaptation settings where lighting, grasping objects, backgrounds, camera viewpoints, etc., vary between labeled source images and unlabeled target images. We use a large-scale hand dataset captured in a multi-camera system [13] as the source dataset (see Fig. 1). For the target dataset, we use HO3D [29] with different environments, HanCo [78] with multiple viewpoints and diverse backgrounds, and FPHA [24] with a novel first-person camera viewpoint. We also apply our method to inthe-wild egocentric video Ego4D [28] (see Fig. 1), including diverse indoor and outdoor activities worldwide. Our method improves the average score of the two tasks by 14.4%, 14.9%, and 18.0% on HO3D, HanCo, and FPHA, respectively, compared to a unadapted baseline. Our method further exhibits distinct improvements compared to the latest adversarial adaption method [33] and consistency training baselines with uncertainty estimation [7], confident instance selection [53], and the teacher-student scheme [64]. We finally confirm that our method also performs qualitatively well on the Ego4D videos. Our contributions are summarized as follows: – We propose a novel confidence estimation method based on the divergence of the predictions from two teacher networks for self-training domain adaptation of hand keypoint regression and hand segmentation.

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

71

– To integrate our proposed confidence estimation into self-training, we propose mutual training using knowledge distillation with a student network for selftraining and two teacher networks for confidence estimation. – Our proposed framework outperforms state-of-the-art methods under three adaptation settings across different imaging conditions. It also shows improved qualitative performance on in-the-wild egocentric videos.

2

Related Work

Hand keypoint regression is the task of regressing the positions of hand joint keypoints from a cropped hand image. 2D hand keypoint regression is trained by optimizing keypoint heatmaps [50,69,79] or directly predicting keypoint coordinates [60]. The 2D keypoints are informative for estimating 3D hand poses [5,47,61,74]. To build an accurate keypoint regressor, collecting massive hand keypoint annotations is required but laborious. While early works annotate the keypoints manually from a single view [48,56,62], recent studies have collected the annotation more densely and efficiently using synthetic hand models [30,47,48,79], hand sensors [24,26,63,75], or multi-camera setups [6,13,29,34,43,46,80]. However, these methods suffer the gap in imaging conditions with real-world images in deployment [51]. For instance, the synthetic hand models and hand sensors induce different lighting conditions from actual human hands. The multi-camera setup lacks a variety of lighting, grasping objects, and backgrounds. To tackle these problems, domain adaptation is a promising solution that can transfer the knowledge of the network trained on source data to unlabeled target data. Jiang et al.proposed an adversarial domain adaptation for human and hand keypoint regression, optimizing the discrepancy between regressors [33]. Additionally, self-training adaptation methods have been studied in the keypoint regression of animals [11], humans [66], and objects [77]. Unlike these prior works, we incorporate confidence estimation into a self-training method based on consistency training for keypoint regression. Hand segmentation is the task of segmenting pixel-level hand masks in a given image. CNN-based segmentation networks [3,35,65] are popularly used. The task can be jointly trained with hand keypoint regression because detecting hand regions guides to improve keypoint localization [16,27,49,68,76]. Since hand mask annotation is laborious as hand keypoint regression, a few domain adaptation methods with pseudo-labeling have been explored [7,53]. To reduce the effect of highly noisy pseudo-labels in the target domain, Cai et al.incorporate the uncertainty of pseudo-labels in model adaptation [7], and Ohkawa et al.select confident pseudo-labels by the overlap of two predicted hand masks [53]. Unlike [7], we estimate the target confidence using two networks. Instead of using the estimated confidence for instance selection [53], we assign the confidence to weight the loss of consistency training. Domain adaptation via self-training aims to learn unlabeled target data in a self-supervised learning manner. This approach can be divided into three categories. (i) Pseudo-labeling [7,15,53,59,81] learns unlabeled data with hard

72

T. Ohkawa et al.

labels assigned by confidence thresholding from the output of a network. (ii) Entropy minimization [42,55,67] regularizes the conditional entropy of unlabeled data and increases the confidence of class probability. (iii) Consistency regularization [14,20,71] enforces regularization so that the prediction on unlabeled data is invariant under data perturbation. We choose to leverage this consistencybased method for our task because it works for various tasks [41,45,52] and the first two approaches cannot be directly applied. Similar to our work, Yang et al.[73] enforce the consistency for two different views and modalities in hand keypoint regression. Mean teacher [64] provides teacher-student training with consistency regularization, which regularizes a teacher network by a student’s weights and avoids over-fitting to incorrect predictions. Unlike [73], we propose to integrate confidence estimation into the consistency training and adopt the teacher-student scheme with two networks. To encourage the two networks to have different representations, we propose a distillation-based update rule instead of updating the teacher with the exponential moving average [64].

3

Proposed Method

In this section, we present our proposed self-training domain adaptation with confidence estimation for adapting hand keypoint regression and hand segmentation. We first present our problem formulation and network initialization with supervised learning from source data. We then introduce our proposed modules: (1) geometric augmentation consistency, (2) confidence weighting by using two networks, and (3) teacher-student update via knowledge distillation. As shown in Fig. 2, our adaptation is done with two different networks (teachers) for confidence estimation and another network (student) for self-training of both tasks. Problem Formulation. Given labeled images from one source domain and unlabeled images from another target domain, we aim to jointly estimate hand keypoint coordinates and pixel-level hand masks on the target domain. We have a source image xs drawn from a set Xs ⊂ RH×W ×3 , its corresponding labels H×W ×3 . The pose (y ps , y m s ), and a target image xt drawn from a set Xt ⊂ R p label y s consists of the 2D keypoint coordinates of 21 hand joints obtained from a set Ysp ⊂ R21×2 , while the mask label y m s denotes a binary mask obtained H×W m . A network parameterized by θ learns the mappings from Ys ⊂ (0, 1) f k (x; θ) : X → Y k where k ∈ {p, m} represents the indicator for both tasks. Initialization with Supervised Learning. To initialize networks used in our adaptation, we train the network f on the labeled source data following multitask learning. Given the labeled dataset (Xs , Ys ) and the network θ, a supervised loss function is defined as    Ltask (θ, Xs , Ys ) = λk E(x s ,y ks )∼(Xs ,Ysk ) Lk (pks , y ks ) , (1) k

where Ys = {Ysp , Ysm } and pks = f k (xs ; θ). Lk (·, ·) : Y k × Y k → R+ is a loss function in each task and λk is a hyperparameter to balance the two tasks. We use a smooth L1 loss [32,58] as Lp and a binary cross-entropy loss as Lm .

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

3.1

73

Geometric Augmentation Consistency

Inspired by semi-supervised learning using hand keypoint consistency [73], we advance a unified training with consistency for both hand keypoint regression and hand segmentation. We expect that joint adaption of both tasks will allow one task to provide useful cues to the other task in consistency training, as studied in supervised or weakly-supervised learning setups [16,27,49,68,76]. We design consistency training by predicting the location of hand keypoints and hand pixels in a given geometrically transformed image, including rotation and transition. This consistency under geometric augmentation encourages the network to learn against positional bias in the target domain, which helps capture the hand structure related to poses and regions. Specifically, given a paired augmentation function (Tx , Tyk ) ∼ T for an image and an label, we generate the prediction on the target images pkt = f k (xt ; θ) and the augmented target images pkt,aug = f k (Tx (xt ); θ). We define the loss function of geometric augmentation consistency (GAC) Lgac between pkt,aug and Tyk (pt ) as ⎡ ⎤    ˜ k L˜k pk , T k (pk ) ⎦ . Lgac (θ, Xt , T ) = Ex t ,(Tx ,Typ ,Tym ) ⎣ (2) λ t,aug y t k∈{p,m}

To correct the augmented prediction pkt,aug by Tyk (pt ), we stop the gradient update for pkt , which can be viewed as the supervision to pkt,aug . We use the smooth L1 loss (see Eq. 1) as L˜p and a mean squared error as L˜m . We introduce ˜ k as a hyperparameter to control the balance of the two tasks. The augmentaλ tion set T contains the geometric augmentation and photometric augmentation, such as color jitter and blurring. We set Ty (·) to align geometric information to the augmented input Tx (xt ). For example, we apply rotation Ty (·) to the outputs pkt with the same degree of rotation Tx (·) to the input xt . 3.2

Confidence Estimation by Two Separate Networks

Since the target predictions are not always reliable, we aim to incorporate the estimated confidence weight for each target instance into the consistency training. In Eq. 2, the generated outputs pkt that is the supervision to pkt,aug may be unstable and noisy due to the domain gap between source and target domains. Due to that, the network trained with the consistency readily overfits to the incorrect supervision pkt , which is known as confirmation bias [2,64]. To reduce the bias, it is crucial to assign a low importance (confidence) weight to the consistency training with the incorrect supervision. This enables the network to learn primarily from reliable supervision while avoiding being biased to such erroneous predictions. In classification tasks, predicted class probability can serve as the confidence, while these measures are not trivially defined and available in regression tasks. To estimate the confidence of keypoint predictions, Yang et al. [73] measure the confidence of 3D hand keypoints by the distance to the fitted 3D hand template, but the hand template fitting is an ill-posed problem

74

T. Ohkawa et al.

Fig. 2. Method overview. Left: Student training with confidence-aware geometric augmentation consistency. The student learns from the consistency between its prediction and the two teachers’ predictions. The training is weighted by the target confidence computed by the divergence of both teachers. Right: Teacher training with knowledge distillation. Each teacher independently learns to match the student’s predictions. The task index k is omitted for simplicity.

for 2D hands and is not applicable to hand segmentation. Dropout [7,8,22] is a generic way of estimating uncertainty (confidence), calculated by the variance of multiple stochastic forwards. However, the estimated confidence is biased to the current state of the training network because the training and confidence estimation are done by a single network. When the training network works poorly, the confidence estimation becomes readily unreliable. To perform reliable confidence estimation for both tasks, we propose a confidence measure by computing the divergence of two predictions. Specifically, we introduce two networks (a.k.a., teachers) for the confidence estimation and the estimated confidence is used to train another network (a.k.a., student) for the consistency training. The architecture of the teachers is identical, yet they have different learning parameters. We observe that when the divergence of the two predictions from the teachers for a target instance is high, the predictions of both networks become unstable. In contrast, a lower divergence indicates that the two teacher networks predict stably and agree on their predictions. Thus, we use the divergence for representing the target confidence. Given the teachers θ tch1 , θ tch2 , we define a disagreement measure disagree to compute the divergence as  ˜ k L˜k (pk , pk ), (3) disagree θ tch1 , θ tch2 , xt = λ t1 t2 k∈{p,m}

where pkt1 = f k (xt ; θ tch1 ) and pkt2 = f k (xt ; θ tch2 ). As a proof of concept, we visualize the correlation between the disagreement measure and a validation score averaged over evaluation metrics of the two tasks (PCK and IoU) in Fig. 3. We  the score between the ensemble of the  compute teachers’ predictions pkens = pkt1 + pkt2 /2 and its ground truth in the validation set on HO3D [29]. The instances with a small disagreement measure tend to

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

75

Fig. 3. The correlation between a disagreement measure and task scores. Target instances with smaller disagreement values between the two teacher networks tend to have higher task scores. (Color figure online)

have high validation scores. In contrast, the instances with a high disagreement measure entail false predictions, e.g., detecting the hand-held object as a hand joint and hand class. When the disagreement measure was high at the bottom of Fig. 3, we found that both predictions were particularly unstable on the keypoints of the ring finger (yellow). This study shows that the disagreement measure can represent the correctness of the target predictions. With the disagreement measure disagree , we define a confidence weight importance to the consistency training. wt ∈ [0, 1] for assigning We compute the weight wt as wt = 2 1 − sigm λd disagree θ tch1 , θ tch2 , xt where wt is a normalized disagreement measure with sign inversion, sigm(·) denotes a sigmoid function, and λd controls the scale of the measure. With the confidence weight wt , we enforce the consistency training between the student’s prediction on the augmented target images pks,aug and the ensemble of the two teachers’ predictions pkens . Our proposed loss function of confidence-aware geometric augmentation consistency (C-GAC) Lcgac for the student θ stu is formulated as 

Lcgac θ

stu



tch1



tch2

, Xt , T



⎡ = Ex t ,(Tx ,Typ ,Tym ) ⎣wt



˜k

˜k

λ L





pks,aug , Tyk (pkens )

⎤ ⎦,

k∈{p,m}

(4)   where pks,aug = f k Tx (xt ); θ stu . Following [54,64], we design the student prediction pks,aug to be supervised by the teachers. We generate the teachers’ prediction by doing ensemble pkens , which is better than the prediction of either teacher. 3.3

Teacher-Student Update by Knowledge Distillation

In addition to the student’s training, we formulate an update rule for the two teacher networks by using knowledge distillation. Since disagree would not work

76

T. Ohkawa et al.

if the two teachers had the same output values, we aim to learn two teachers that have different representations yet keep high task performance as co-training works [4,15,57,59]. In a prior teacher-student update, Tarvainen et al. [64] found that the teacher’s update by an exponential moving average (EMA), which averages the student’s weights iteratively, makes the teacher’s learning more slowly and mitigates the confirmation bias as discussed in Sect. 3.2. While this EMAbased teacher-student framework is widely used in various domain adaptation tasks [9,19,25,38,72], naively applying the EMA rule to the two teachers would produce exactly the same weights for both networks. To prevent this, we propose independent knowledge distillation for building two different teachers. The distillation matches the teacher-student predictions in the output level. To let both networks have different parameters, we train the teachers from different mini-batches and using stochastic augmentation as ⎡ ⎤    ˜ k L˜k (pk , pk )⎦ , (5) λ Ldistill θ, θ stu , Xt , T = Ex t ,Tx ⎣ t,aug s,aug k∈{p,m}

where θ ∈ {θ tch1 , θ tch2 }, pkt,aug = f k (Tx (xt ); θ), and pks,aug = f k (Tx (xt ) ; θ stu ). The distillation loss Ldistill is used for updating the teacher networks only. This helps the teachers to adapt to the target domain more carefully than the student and avoid falling into exactly the same predictions on a target instance. 3.4

Overall Objectives

Overall, the objective of the student’s training consists of the supervised loss (Eq. 1) from the source domain and the self-training with confidence-aware geometric augmentation consistency (Eq. 4) in the target domain as  stu  stu tch1 tch2 θ θ + L min L , X , Y , θ , θ , X , T . (6) task s s cgac t stu θ

The two teachers are asynchronously trained with the distillation loss (Eq. 5) in the target domain, which is formulated as   (7) min Ldistill θ, θ stu , Xt , T , θ

where θ ∈ {θ ,θ }. Since the teachers are updated carefully and can perform better than the student, we use the ensemble of the two teachers’ predictions for a final output in inference. tch1

4

tch2

Experiments

In this section, we first present our experimental datasets and implementation details and then provide quantitative and qualitative results along with the ablation studies. We analyze our proposed method by comparing it with several existing methods in three different domain adaptation settings. We also show qualitative results by applying our method to in-the-wild egocentric videos.

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

4.1

77

Experiment Setup

Datasets. We experimented with several hand datasets including a variety of hand-object interactions, the annotation of 2D hand keypoints, and hand masks as follows. We adopted DexYCB [13] dataset as our source dataset since it contains a large amount of training images, their corresponding labels, and natural hand-object interactions. We chose to use the following datasets as our target datasets: HO3D [29] captured in different environments with the same YCB objects [10] as the source dataset, HanCo [78] captured in a multi-camera studio and generated with synthesized backgrounds, and FPHA [24] captured by a first-person view. We also used Ego4D [28] to verify the effectiveness of our method in real-world scenarios. During training, we used cropped images of the hand regions from the original images as input. Implementation Details. Our teacher-student networks share an identical network architecture, which consists of a unified feature extractor and task-specific branches for hand keypoint regression and hand segmentation. For training our student network, we used the Adam optimizer [36] with a learning rate of 10−5 , while the learning rate of the teacher networks was set to 5 × 10−6 . We set the ˜ p ), λm , λ ˜ m , λd ) to (107 , 102 , 5, 0.5). Since both taskhyperparameters (λp (= λ specific branches have different training speeds, we began our adaptation with the backbone and keypoint regression branch. We then trained all sub-networks, including the hand segmentation branch. We report the percentage of correct keypoints (PCK) and the mean joint position error (MPE) for hand keypoint regression, and the intersection over union (IoU) for hand segmentation. Baseline Methods. We compared quantitative performance with the following methods. Source only denotes the network trained on the source dataset without any adaptation. To compare with another adaptation approach with adversarial training, we trained DANN [23] that aligns marginal feature distributions between domains, and RegDA [33] with an adversarial regressor that optimizes domain disparity. In addition, we implemented several self-training adaptation methods by replacing pseudo-labeling with the consistency training. GAC is a simple baseline with the consistency training updated by Eq. 2. GAC + UMA [7] is a GAC method with confidence estimation by Dropout [22]. GAC + CPL [53] is a GAC method with confident instance selection using the agreement with another network. GAC + MT [64] is a GAC method with the single-teacher-single-student architecture using EMA for the teacher update. Target only indicates the network trained on the target dataset with labels, which shows an empirical performance upper bound. Our Method. We denote our full method as C-GAC introduced in Sect. 3.4. As an ablation study, we present a variant of the proposed method as GACDistill with a teacher-student pair, which is updated by the consistency training (Eq. 2) and the distillation loss (Eq. 5). GAC-Distill is different from GAC + MT only in the way of the teacher update.

78

T. Ohkawa et al.

Table 1. DexYCB [13] → HO3D [29]. We report PCK (%) and MPE (px) for hand keypoint regression and IoU (%) for hand segmentation. Each score format of val / test indicates the validation and test scores. Red and blue letters indicate the best and second best values Method

4.2

2D Pose Seg 2D Pose + Seg PCK ↑ (%) MPE ↓ (px) IoU ↑ (%) Avg. ↑ (%)

Source only

42.8/33.5

15.39/19.32 57.9/49.1

50.3/41.3

DANN [23] RegDA [33] GAC GAC + UMA [7] GAC + CPL [53] GAC + MT [64]

49.0/46.8 48.8/48.2 47.6/47.4 47.1/45.3 48.1/48.1 45.5/44.4

12.39/13.39 12.50/12.64 12.47/12.54 12.97/13.51 12.74/12.61 13.65/14.05

52.8/54.7 55.7/55.3 58.0/56.9 58.0/55.0 57.2/55.6 54.8/52.3

50.9/50.8 52.2/51.7 52.8/52.2 52.5/50.2 52.7/51.8 50.2/48.3

GAC-Distill (Ours) C-GAC (Ours-Full)

49.9/50.4 50.3/51.1

11.98/11.51 60.7/60.6 11.89/11.22 60.9/60.3

55.3/55.5 55.6/55.7

Target only

55.1/58.6

11.00/9.29

61.7/62.4

68.2/66.1

Quantitative Results

We show the results of three adaptation settings: DexYCB → {HO3D, HanCo, FPHA} in Tables 1 and 2. We then provide detailed comparisons of our method. DexYCB → HO3D. Table 1 shows the results of the adaptation from DexYCB to HO3D where the grasping objects are overlapped. The baseline of the consistency training (GAC) was effective in learning target images in both tasks. Our proposed method (C-GAC) improved by 5.3/14.4 in the average task score from the source-only performance. The method also outperformed all comparison methods and achieved close performance to the upper bound. DexYCB → HanCo. Table 2 shows the results of the adaptation from DexYCB to HanCo across laboratory setups. The source-only network less generalized to the target domain because the HanCo has diverse backgrounds, while GAC succeeded in adapting up to 47.4/47.9 in the average score. Our method CGAC showed further improved results in hand keypoint regression. DexYCB → FPHA. Table 2 also shows the results of the adaptation from DexYCB to FPHA, which captures egocentric users’ activities. Since hand markers and in-the-wild target environments cause large appearance gaps, the sourceonly performance performed the most poorly among the three adaption settings. In this challenging setting, RegDA and GAC + UMA performed well for hand segmentation, while their performance on hand keypoint regression was inferior to the GAC baseline. Our method C-GAC further improved than the GAC method in the MPE and IoU metrics and exhibited stability in adaptation training among the comparison methods.

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

79

Table 2. DexYCB [13] → {HanCo [78], FPHA [24]}. We report PCK (%) and MPE (px) for hand keypoint regression and IoU (%) for hand segmentation. We show the validation and test results on HanCo and the validation results on FPHA. Red and blue letters indicate the best and second best values Method

DexYCB → HanCo 2D Pose Seg PCK ↑ (%) MPE ↓ (px) IoU ↑ (%) Avg. ↑ (%)

DexYCB → FPHA 2D Pose Seg PCK MPE IoU Avg.

Source only

26.0/27.3

21.82/21.48 41.8/41.4

33.9/34.3

14.0 31.32 24.8 19.4

DANN [23] RegDA [33] GAC GAC + UMA [7] GAC + CPL [53] GAC + MT [64]

32.3/33.0 33.0/33.6 36.6/37.1 35.1/35.6 32.7/33.5 33.2/33.8

19.99/19.82 19.51/19.44 16.63/16.59 17.51/17.48 19.85/19.62 18.93/18.83

56.3/56.9 57.8/58.4 58.1/58.8 57.1/57.7 55.8/56.4 54.3/55.1

44.3/45.0 45.4/46.0 47.4/47.9 46.1/46.6 44.2/45.0 43.8/44.4

24.4 23.7 37.2 36.8 25.7 31.3

GAC-Distill (Ours) C-GAC (Ours-Full)

38.8/39.5 39.2/39.9

16.06/15.97 57.5/57.7 15.83/15.74 58.2/58.6

48.1/48.6 48.7/49.2

36.8 15.99 35.5 36.1 37.2 15.36 37.7 37.4

Target only

76.8/77.3

76.3/76.7

63.3

4.91/4.80

75.9/76.1

25.79 24.27 17.02 17.29 24.99 20.81

8.11

28.4 41.7 33.3 39.2 32.7 38.4

-

26.4 32.7 35.3 38.0 29.2 34.9

-

Comparison to Different Confidence Estimation Methods. We compare the results with existing confidence estimation methods. GAC + UMA and GAC + CPL estimate the confidence of target predictions by computing the variance of multiple stochastic forwards and the task scores between a training network and an auxiliary network, respectively. GAC + UMA performed effectively on DexYCB → FPHA, whereas the performance gain was thin in the other settings compared to GAC. GAC + CPL worked well for keypoint regression on DexYCB → HO3D, but it cannot address the other settings with a large domain gap well since the prediction of the auxiliary network became unstable. Although these prior methods had different disadvantages depending on the settings, our method C-GAC using the divergence of the two teachers for confidence estimation performed stably in the three settings. Comparison to Standard Teacher-Student Update. We compare our teacher update with the update with an exponential moving average (EMA) [64]. The EMA-based update (GAC-MT) degraded the performance from the source only in hand segmentation in Table 1. This suggests that the EMA update can be sensitive to the task. In contrast, our method GAC-Distill matching the teacher-student predictions in the output level did not produce such performance degeneration and worked more stably. Comparison to Adversarial Adaptation Methods. We compared our method with another major adaptation approach with adversarial training. In Tables 1 and 2, the performance of DANN and RegDA was mostly worse than the consistency-based baseline GAC. We found that instead of matching features between both domains [23,33], directly learning target images by the consistency training was critical in the adaptation of our tasks.

80

T. Ohkawa et al.

Fig. 4. Qualitative results. We show qualitative examples of the source-only network (top), the Ours-Full method (middle), and ground truth (bottom) on HO3D [29], HanCo [78], FPHA [24], and Ego4D [28] without ground truth.

Comparison to an Off-the-Shelf Hand Pose Estimator. We tested the generalization ability of an open-source library for pose estimation: OpenPose [31]. It resulted in 15.75/12.72, 18.31/18.42, and 29.02 in the MPE on HO3D, HanCo, and FPHA, respectively. Since it is built on multiple source datasets [1,34,44], the baseline showed higher generalization than the source-only network. However, the performance did not exceed our proposed method in the MPE. This shows that generalizing hand keypoint regression to other datasets is still challenging, and our adaptation framework supports improving target performance. 4.3

Qualitative Results

We show the qualitative results of hand keypoint regression and hand segmentation in Fig. 4. When hands are occluded in HO3D and FPHA or the backgrounds are diverse in HanCo, the keypoint prediction of the source only (top) represented infeasible hand poses and hand segmentation was too noisy or missing. However, our method C-GAC (middle) corrected the hand keypoint errors and improved to localize hand regions. Hand segmentation in FPHA was still noisy because visible white markers obstructed hand appearance. We can also see distinct improvements in the Ego4D dataset. We provide additional qualitative analysis in adaptation to the Ego4D beyond countries, cultures, ages, indoors/outdoors, and performing tasks with hands in our supplementary material. 4.4

Ablation Studies

Effect of Confidence Estimation. To confirm the effect of our proposed confidence estimation, we compare our full method C-GAC and our ablation model GAC-Distill without the confidence weighting. In Tables 1 and 2, while GAC-Distill mostly surpassed the comparison methods in most cases, C-GAC showed further performance gain in all three adaptation settings.

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

81

Fig. 5. Visualization of bone length distributions. We show the distributions of the bone length between hand joints, namely, Wrist, metacarpophalangeal (MCP), proximal interphalangeal (PIP), distal interphalangeal (DIP), and fingertip (TIP). Using kernel density estimation, we plotted the density of the bone length for the predictions of the source only, the Ours-Full method, and ground truth on test data of HO3D [29]. (Color figure online)

Multi-task vs. Single-Task Adaptation. We studied the effect of our multitask adaptation compared with single-task adaptation on DexYCB → HO3D. The single-task adaptation results are 50.1/51.0 in the PCK and 58.2/57.7 in the IoU. Compared to Table 1, our method in the multi-task setting improved by 2.7/2.6 over the single-task adaption in hand segmentation while it provided marginal gain in hand keypoint regression. This shows that the adaptation of hand keypoint regression helps to localize hand regions in the target domain. Bone Length Distributions. To study our adaptation results in each hand joint, we show the distributions of bone length between hand joints in Fig. 5. In Wrist-MCP, PIP-DIP, and DIP-TIP, the distribution of the source-only prediction on target images (blue) was far from that of the target ground truth (green), whereas our method (orange) improved to approximate the target distribution (green). In MCP-PIP, we could not observe such clear differences because the source-only model already represented the target distribution well. This indicates that our method improved to learn hand structure near the palm and fingertips.

5

Conclusion

In this work, we tackled the problem of joint domain adaptation of hand keypoint regression and hand segmentation. Our proposed method consists of the self-training with geometric augmentation consistency, confidence weighting by the two teacher networks, and the teacher-student update by knowledge distillation. The consistency training under geometric augmentation served to learn the unlabeled target images for both tasks. The divergence of the predictions from two teacher networks could represent the confidence of each target instance, which enables the student network to learn from reliable target predictions. The distillation-based teacher-student update guided the teachers to learn from the student carefully and mitigated over-fitting to the noisy predictions. Our method delivered state-of-the-art performance on the three adaptation setups. It also showed improved qualitative results in the real-world egocentric videos.

82

T. Ohkawa et al.

Acknowledgments. This work was supported by JST ACT-X Grant Number JPMJAX2007, JSPS Research Fellowships for Young Scientists, JST AIP Acceleration Research Grant Number JPMJCR20U1, and JSPS KAKENHI Grant Number JP20H04205, Japan. This work was also supported in part by a hardware donation from Yu Darvish.

References 1. Andriluka, M., Pishchulin, L., Gehler, P.V., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693 (2014) 2. Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudolabeling and confirmation bias in deep semi-supervised learning. In: IEEE International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020) 3. Benitez-Garcia, G., et al.: Improving real-time hand gesture recognition with semantic segmentation. Sensors 21(2), 356 (2021) 4. Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the ACM Annual Conference on Computational Learning Theory (COLT), pp. 92–100 (1998) 5. Boukhayma, A., Bem, R.D., Torr, P.H.S.: 3D hand shape and pose from images in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852 (2019) 6. Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: a dataset of grasps with object contact and hand pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 361–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0 22 7. Cai, M., Lu, F., Sato, Y.: Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14380–14389 (2020) 8. Cai, M., Luo, M., Zhong, X., Chen, H.: Uncertainty-aware model adaptation for unsupervised cross-domain object detection. CoRR, abs/2108.12612 (2021) 9. Cai, Q., Pan, Y., Ngo, C.-W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11457–11466 (2019) 10. C ¸ alli, B., Walsman, A., Singh, A., Srinivasa, S.S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: using the Yale-CMU-Berkeley object and model set. IEEE Robot. Autom. Mag. 22(3), 36–52 (2015) 11. Cao, J., Tang, H., Fang, H., Shen, X., Tai, Y.-W., Lu, C.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9497–9506 (2019) 12. Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 12417–12426 (2021) 13. Chao, Y.-W., et al.: DexYCB: a benchmark for capturing hand grasping of objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9044–9053 (2021)

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

83

14. Chen, C.-H., et al.: Unsupervised 3D pose estimation with geometric selfsupervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5714–5724 (2019) 15. Chen, M., Weinberger, K.Q., Blitzer, J.: Co-training for domain adaptation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 2456–2464 (2011) 16. Chen, X., Wang, G., Zhang, C., Kim, T.-K., Ji, X.: SHPR-Net: deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018) 17. Damen, D., et al.: Rescaling egocentric vision. Int. J. Comput. Vision (IJCV) (2021) 18. Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4091–4101 (2021) 19. French, G., Mackiewicz, M., Fisher, M.H.: Self-ensembling for visual domain adaptation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018) 20. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., Tao, D.: Geometryconsistent generative adversarial networks for one-sided unsupervised domain mapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2427–2436 (2019) 21. Fu, Q., Liu, X., Kitani, K.M.: Sequential decision-making for active object detection from hand. CoRR, abs/2110.11524 (2021) 22. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1050–1059 (2016) 23. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1180–1189 (2015) 24. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.-K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–419 (2018) 25. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In Proceedings of the International Conference on Learning Representations (ICLR) (2020) 26. Glauser, O., Wu, S., Panozzo, D., Hilliges, O., Sorkine-Hornung, O.: Interactive hand pose estimation using a stretch-sensing soft glove. ACM Trans. Graph. 38(4), 41:1-41:15 (2019) 27. Goudie, D., Galata, A.: 3D hand-object pose estimation from depth with convolutional neural networks. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 406–413 (2017) 28. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18995–19012 (2022) 29. Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3196–3206 (2020) 30. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11807–11816 (2019)

84

T. Ohkawa et al.

31. Hidalgo, G., et al.: OpenPose. https://github.com/CMU-Perceptual-ComputingLab/openpose 32. Huang, W., Ren, P., Wang, J., Qi, Q., Sun, H.: AWR: adaptive weighting regression for 3D hand pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11061–11068 (2020) 33. Jiang, J., Ji, Y., Wang, X., Liu, Y., Wang, J., Long, M.: Regressive domain adaptation for unsupervised keypoint detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6780–6789 (2021) 34. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3334–3342 (2015) 35. Kim, S., Chi, H.-G., Hu, X., Vegesana, A., Ramani, K.: First-person view hand segmentation of multi-modal hand activity video dataset. In: Proceedings of the British Machine Vision Conference (BMVC) (2020) 36. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014) 37. Lee, K., Shrivastava, A., Kacorri, H.: Hand-priming in object localization for assistive egocentric vision. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3422–3432 (2020) 38. Li, Y.-J., et al.: Cross-domain object detection via adaptive self-training. CoRR, abs/2111.13216 (2021) 39. Liang, H., Yuan, J., Thalmann, D., Magnenat-Thalmann, N.: AR in hand: egocentric palm pose tracking and gesture recognition for augmented reality applications. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 743–744 (2015) 40. Likitlersuang, J., Sumitro, E.R., Cao, T., Vis´ee, R.J., Kalsi-Ryan, S., Zariffa, J.: Egocentric video: a new tool for capturing hand use of individuals with spinal cord injury at home. J. Neuroeng. Rehabil. (JNER) 16(1), 83 (2019) 41. Liu, Y.-C., et al.: Unbiased teacher for semi-supervised object detection. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021) 42. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 136–144 (2016) 43. Lu, Y., Mayol-Cuevas, W.W.: Understanding egocentric hand-object interactions from hand pose estimation. CoRR, abs/2109.14657 (2021) 44. McKee, R., McKee, D., Alexander, D., Paillat, E.: NZ sign language exercises. Deaf Studies Department of Victoria University of Wellington. http://www.victoria.ac. nz/llc/llc resources/nzsl 45. Melas-Kyriazi, L., Manrai, A.K.: Pixmatch: unsupervised domain adaptation via pixelwise consistency training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12435–12445 (2021) 46. Moon, G., Yu, S.-I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 548–564. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058565-5 33 47. Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59 (2018)

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

85

48. Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1163–1172 (2017) 49. Neverova, N., Wolf, C., Nebout, F., Taylor, G.W.: Hand pose estimation through semi-supervised and weakly-supervised learning. Comput. Vis. Image Underst. 164, 56–67 (2017) 50. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946484-8 29 51. Ohkawa, T., Furuta, R., Sato, Y.: Efficient annotation and learning for 3D hand pose estimation: a survey. CoRR, abs/2206.02257 (2022) 52. Ohkawa, T., Inoue, N., Kataoka, H., Inoue, N.: Augmented cyclic consistency regularization for unpaired image-to-image translation. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 362–369 (2020) 53. Ohkawa, T., Yagi, T., Hashimoto, A., Ushiku, Y., Sato, Y.: Foreground-aware stylization and consensus pseudo-labeling for domain adaptation of first-person hand segmentation. IEEE Access 9, 94644–94655 (2021) 54. Pham, H., Dai, Z., Xie, Q., Le, Q.V.: Meta pseudo labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11557–11568 (2021) 55. Prabhu, V., Khare, S., Kartik, D., Hoffman, J.: SENTRY: selective entropy optimization via committee consistency for unsupervised domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8558–8567 (2021) 56. Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1106–1113 (2014) 57. Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A.: Deep co-training for semisupervised image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 142–159. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0 9 58. Ren, P., Sun, H., Qi, Q., Wang, J., Huang, W.: SRN: stacked regression network for real-time 3D hand pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC) (2019) 59. Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 2988–2997 (2017) 60. Santavas, N., Kansizoglou, I., Bampis, L., Karakasis, E., Gasteratos, A.: Attention! A lightweight 2D hand pose estimation approach. CoRR, abs/2001.08047 (2020) 61. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4645–4653 (2017) 62. Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 294– 310 (2016)

86

T. Ohkawa et al.

63. Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of wholebody human grasping of objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 581–600. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8 34 64. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017) 65. Urooj, A., Borji, A.: Analysis of hand segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4710–4719 (2018) 66. Vasconcelos, L.O., Mancini, M., Boscaini, D., Bul` o, S.R., Caputo, B., Ricci, E.: Shape consistent 2D keypoint estimation under domain shift. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 8037–8044 (2020) 67. Vu, T.H., Jain, H., Bucher, M., Cord, M., Perez, P.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2512–2521 (2019) 68. Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 29(11), 3258–3268 (2019) 69. Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016) 70. Wu, M.-Y., Ting, P.-W., Tang, Y.-H., Chou, E.T., Fu, L.-C.: Hand pose estimation in object-interaction based on deep learning for virtual reality applications. J. Vis. Commun. Image Represent. 70, 102802 (2020) 71. Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2020) 72. Yan, L., Fan, B., Xiang, S., Pan, C.: CMT: cross mean teacher unsupervised domain adaptation for VHR image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022) 73. Yang, L., Chen, S., Yao, A.: Semihand: semi-supervised hand pose estimation with consistency. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11364–11373 (2021) 74. Yang, L., Li, J., Xu, W., Diao, Y., Lu, C.: Bihand: recovering hand mesh with multi-stage bisected hourglass networks. In: Proceedings of the British Machine Vision Conference (BMVC) (2020) 75. Yuan, S., Ye, Q., Stenger, B., Jain, S., Kim, T.K.: BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2605–2613 (2017) 76. Zhang, C., Wang, G., Chen, X., Xie, P., Yamasaki, T.: Weakly supervised segmentation guided hand pose estimation during interaction with unknown objects. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), pp. 2673–2677 (2020) 77. Zhou, X., Karpur, A., Gan, C., Luo, L., Huang, Q.: Unsupervised domain adaptation for 3D keypoint estimation via view consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 141–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8 9 78. Zimmermann, C., Argus, M., Brox, T.: Contrastive representation learning for hand shape estimation. CoRR, abs/2106.04324 (2021)

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

87

79. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4913–4921 (2017) 80. Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 813–822 (2019) 81. Zou, Y., Yu, Z., Kumar, B.V., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305 (2018)

Towards Data-Efficient Detection Transformers Wen Wang1 , Jing Zhang2 , Yang Cao1,3(B) , Yongliang Shen4 , and Dacheng Tao2,5 1

3

University of Science and Technology of China, Hefei, China [email protected], [email protected] 2 The University of Sydney, Camperdown, Australia [email protected] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 4 Zhejiang University, Hangzhou, China [email protected] 5 JD Explore Academy, Beijing, China

Abstract. Detection transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at https://github.com/encounter1997/DE-DETRs. Keywords: Data efficiency · Detection transformer Rich supervision · Label augmentation

· Sparse feature ·

W. Wang—This work was done during Wen Wang’s internship at JD Explore Academy. J. Zhang—Co-first author.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 6. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 88–105, 2022. https://doi.org/10.1007/978-3-031-20077-9_6

Towards Data-Efficient Detection Transformers

1

89

Introduction

Object detection is a long-standing topic in computer vision. Recently, a new family of object detectors, named detection transformers, has drawn increasing attention due to their simplicity and promising performance. The pioneer work of this class of methods is DETR [3], which views object detection as a direct set prediction problem and applies a transformer to translate the object queries to the target objects. It achieves better performance than the seminal Faster RCNN [31] on the commonly used COCO dataset [24], but its convergence is significantly slower than that of CNN-based detectors. For this reason, most of the subsequent works have been devoted to improving the convergence of DETR, through efficient attention mechanism [50], conditional spatial query [29], regression-aware co-attention [14], etc.These methods are able to achieve better performance than Faster RCNN with comparable training costs on the COCO dataset, demonstrating the superiority of detection transformers.

Fig. 1. Performance of different object detectors on COCO 2017 with 118K training data and Cityscapes with 3K training data. The respective training epochs are shown below the name of each method. While the RCNN family show consistently high average precision, the detection transformer family degrades significantly on the small-size dataset. FRCN-FPN, SRCN, and SMCA-SS represent Faster-RCNN-FPN, Sparse RCNN, and single-scale SMCA, respectively.

Current works seem to suggest that detection transformers are superior to the CNN-based object detector, like Faster RCNN, in both simplicity and model performance. However, we find that detection transformers show superior performance only on datasets with rich training data like COCO 2017 (118K training images), while the performance of most detection transformers drops significantly when the amount of training data is small. For example, on the commonly used autonomous driving dataset Cityscapes [7] (3K training images), the average precisions (AP) of most of the detection transformers are less than half of

90

W. Wang et al.

Faster RCNN AP performance, as shown in Fig. 1. Moreover, although the performance gaps between different detection transformers on the COCO dataset are less than 3 AP, a significant difference of more than 15 AP exists on the small-size Cityscapes dataset. These findings suggest that detection transformers are generally more datahungry than CNN-based object detectors. However, the acquisition of labeled data is time-consuming and labor-intensive, especially for the object detection task, which requires both categorization and localization of multiple objects in a single image. What’s more, the large amount of training data means more training iterations to traverse the dataset, and thus more computational resources are consumed to train the detection transformers, increasing the carbon footprint. In a word, it takes a lot of human labor and computational resources to meet the training requirements of existing detection transformers. To address these issues, we first empirically analyze the key factors affecting the data efficiency of detection transformers through a step-by-step transformation from the data-efficient Sparse RCNN to the representative DETR. Our investigation and analysis show that sparse feature sampling from local area holds the key: on the one hand, it alleviates the difficulty of learning to focus on specific objects, and on the other hand, it avoids the quadratic complexity of modeling image features and makes it possible to utilize multi-scale features, which has been proved critical for the object detection task. Based on these observations, we improve the data efficiency of existing detection transformers by simply alternating how the key and value are constructed in the transformer decoder. Specifically, we perform sparse sampling features on key and value features sent to the cross-attention layer under the guidance of the bounding boxes predicted by the previous decoder layer, with minimum modifications to the original model, and without any specialized module. In addition, we mitigate the data-hungry problem by providing richer supervisory signals to detection transformers. To this end, we propose a label augmentation method to repeat the labels of foreground objects during label assignment, which is both effective and easy to implement. Our method can be applied to different detection transformers to improve their data efficiency. Interestingly, it also brings performance gain on the COCO dataset with a sufficient amount of data. To summarize, our contributions are listed as follows. – We identify the data-efficiency problem of detection transformers. Though they achieve excellent performance on the COCO dataset, they generally suffer from significant performance degradation on small-size datasets. – We empirically analyze the key factor that affects detection transformers’ data efficiency through a step-by-step model transformation from Sparse RCNN to DETR, and find that sparse feature sampling from local areas holds the key to data efficiency. – With minimum modifications, we significantly improve the data efficiency of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer.

Towards Data-Efficient Detection Transformers

91

– We propose a simple yet effective label augmentation strategy to provide richer supervision and improve the data efficiency. It can be combined with different methods to achieve performance gains on different datasets.

2 2.1

Related Work Object Detection

Object detection [13,16,23,26,30,31,35] is essential to many real-world applications, like autonomous driving, defect detection, and remote sensing. Representative deep-learning-based object detection methods can be roughly categorized into two-stage detectors like Faster RCNN [31] and one-stage object detectors like YOLO [30] and RetinaNet [23]. While effective, these methods generally rely on many heuristics like anchor generation and rule-based label assignments. Recently, DETR [3] provides a simple and clean pipeline for object detection. It formulates object detection as a set prediction task, and applies a transformer [37] to translate sparse object candidates [33] to the target objects. The success of DETR has sparked the recent surge of detection transformers [4,8,12,14,25,29,39,40,44,50] and most of the following-up works focus on alleviating the slow convergence problem of DETR. For example, DeformDETR [50] propose the deformable attention mechanism for learnable sparse feature sampling and aggregates multi-scale features to accelerate model convergence and improve model performance. CondDETR [29] proposes to learn a conditional spatial query from the decoder embedding, which helps the model quickly learn to localize the four extremities for detection. These works achieve better performance than Faster RCNN on the COCO dataset [24] with comparable training costs. It seems that detection transformers have surpassed the seminal Faster RCNN in both simplicity and superior performance. But we show that detection transformers are generally more data-hungry and perform much worse than Faster RCNN on small-size datasets. 2.2

Label Assignment

Label assignment [15,32,38,43,48,49] is a crucial component in object detection. It matches the ground truth of an object with a specific prediction from the model, and thereby provides the supervision signal for training. Prior to DETR, most object detectors [23,30,31] adopt the one-to-many matching strategy, which assigns each ground truth to multiple predictions based on local spatial relationships. By contrast, DETR makes one-to-one matching between ground truths and predictions by minimizing a global matching loss. This label assignment approach has been followed by various subsequent variants of the detection transformer [8,12,29,40,50]. Despite the merits of avoiding the duplicates removal process, only a small number of object candidates are supervised by the object labels in each iteration. As a result, the model has to obtain enough supervised signals from a larger amount of data or more training epochs.

92

W. Wang et al.

2.3

Data-Efficiency of Vision Transformers

Vision Transformers [6,10,11,17,28,41,42,45,46] (ViTs) are emerging as an alternative to CNN for feature extractors and visual recognition. Despite the superior performance, they are generally more data-hungry than their CNN counterparts. To tackle this problem, DeiT [36] improves its data efficiency by knowledge distillation from pre-trained CNNs, coupled with a better training recipe. Liu et al.propose a dense relative localization loss to improve ViTs’ data efficiency [27]. Unlike the prior works [2,27,36] that focus on the data efficiency issue of transformer backbones on image classification tasks, we tackle the data efficiency issue of detection transformers on the object detection task. Table 1. Model transformation from Sparse RCNN (SRCN for short) to DETR, experimented on Ciytscapes [7]. “50E AP” and “300E AP” indicate average precision after training for 50 and 300 epochs respectively. The change in AP is shown in the brackets, where red indicates drops and blue indicates gains on AP. Model

Added

50E AP

300E AP



29.4

35.9

106M

631G

DETR Recipe SRCN Recipe – FPN transformer encoder – cross-attn in decoder dynamic conv dropout in decoder – – bbox refinement – RoIAlign

30.6 (+1.2) 23.3 (-7.3) 21.0 (-2.3) 18.1 (-2.9) 16.7 (-1.4) 15.0 (-1.7) 6.6 (-8.4)

34.4 (-1.5) 26.6 (-7.8) 27.5 (+0.9) 25.4 (-2.1) 26.1 (+0.7) 22.7 (-3.4) 17.7 (-5.0)

106M 103M 111M 42M 42M 41M 41M

294G 244G 253G 86G 86G 86G 86G

1.6 (-5.0)

11.5 (-6.2)

41M

86G

SRCN Net1 Net2 Net3 Net4 Net5 Net6 Net7 DETR

3

Removed



initial proposals

Params FLOPs

Difference Analysis of RCNNs and DETRs

As can be seen in Fig. 1, detection transformers are generally more data-hungry than RCNNs. To find out the key factors to data efficiency, we transform a dataefficient RCNN step-by-step into a data-hungry detection transformer to ablate the effects of different designs. Similar research approach has also been adopted by ATSS [47] and Visformer [5], but for different research purposes. 3.1

Detector Selection

To obtain insightful results from the model transformation, we need to choose the appropriate detectors to conduct the experiments. To this end, we choose Sparse RCNN and DETR for the following reasons. Firstly, they are representative detectors from the RCNN and detection transformer families, respectively. The observations and conclusions drawn from the transformation between them shall also be helpful to other detectors. Secondly, there is large difference between the two detectors in data efficiency, as shown in Fig. 1. Thirdly, they share many

Towards Data-Efficient Detection Transformers

93

similarities in label assignment, loss design, and optimization, which helps us eliminate the less significant factors while focus more on the core differences. 3.2

Transformation from Sparse RCNN to DETR

During the model transformation, we consider two training schedules that are frequently used in detection transformers. The first is training for 50 epochs and learning rate decays after 40 epochs, denoted as 50E. And the second is training for 300 epochs and learning rate decays after 200 epochs. The transformation process is summarized in Table 1. Alternating Training Recipe. Though Sparse RCNN and DETR share many similarities, there are still slight differences in their training Recipes, including the classification loss, the number of object queries, learning rate, and gradient clip. We first eliminate these differences by replacing the Sparse RCNN training recipe with the DETR training recipe. Eliminating the differences in training recipes helps us focus more on the key factors that affect the data-efficiency. Removing FPN. Multi-scale feature fusion has been proved effective for object detection [22]. The attention mechanism has a quadratic complexity with respect to the image scale, making the modeling of multi-scale features in DETR nontrivial. Thus DETR only takes 32× down-sampled single-scale feature for prediction. In this stage, we remove the FPN neck and send only the 32× down-sampled feature to the detection head, which is consistent with DETR. As expected, without multi-scale modeling, the model performance degrades significantly by 7.3 AP under the 50E schedule, as shown in Table 1. Introducing Transformer Encoder. In DETR, the transformer encoder can be regarded as the neck in the detector, which is used to enhance the features extracted by the backbone. After removing the FPN neck, we add the transformer encoder neck to the model. It can be seen that the AP result decreases at 50E schedule while improves at 300E schedule. We conjecture that similar to ViT [10], the attention mechanism in the encoder requires longer training epochs to converge and manifest its advantages, due to the lack of inductive biases. Replacing Dynamic Convolutions with Coss-attention. A very interesting design in Sparse RCNN is the dynamic convolution [20,34] in the decoder, which acts very similar to the role of cross-attention in DETR. Specifically, they both adaptively aggregate the context from the image features to the object candidates based on their similarity. In this step, we replace the dynamic convolution with the cross-attention layer with learnable query positional embedding, and the corresponding results are shown in Table 1. Counter-intuitively, a larger number of learnable parameters does not necessarily make the model more data-hungry. In fact, the dynamic convolutions with about 70M parameters can exhibit better data efficiency than the parameter-efficient cross-attention layer. Aligning Dropout Settings in the Decoder. A slight difference between Sparse RCNN and DETR is the use of dropout layers in self-attention and FFN layers in the decoder. In this stage, we eliminate the interference of these factors.

94

W. Wang et al.

Removing Cascaded Bounding Box Refinement. Sparse RCNN follows the cascaded bounding box regression in Cascade RCNN [1], where each decoder layer iteratively refines the bounding box predictions made by the previous layer. We remove it in this stage and as expected, the model performance degrades to some extent. Removing RoIAlign. Sparse RCNN, like other detectors in the RCNNs family, samples features from local regions of interest, and then makes predictions based on the sampled sparse features [33]. By contrast, each content query in DETR aggregates object-specific information directly from the global features map. In this step, we remove the RoIAlign [18] operation in Sparse RCNN, with the box target transformation [16]. It can be seen that significant degradation of the model performance occurs, especially under the 50E schedule, the model performance decreases by 8.4 AP. We conjecture that learning to focus on local object regions from the entire feature map is non-trivial. The model requires more data and training epochs to capture the locality properties. Removing Initial Proposals. Finally, DETR directly predicts the target bounding boxes, while RCNNs make predictions relative to some initial guesses. In this step, we eliminate this difference by removing the initial proposal. Unexpectedly, this results in a significant decrease in model performance. We suspect that the initial proposal works as a spatial prior that helps the model to focus on object regions, thus reducing the need to learn locality from large training data. 3.3

Summary

By far, we have completed the model transformation from Sparse RCNN to DETR. From Table 1 and our analysis in Sect. 3.2, it can be seen that three factors result in more than 5 AP performance changes, and are key to data-efficient: (a) sparse feature sampling from local regions, e.g., using RoIAlign; (b) multiscale features which depend on sparse feature sampling to be computationally feasible; (c) prediction relative to initial spatial priors. Among them, (a) and (c) help the model to focus on local object regions and alleviate the requirement of learning locality from a large amount of data, while (b) facilitates a more comprehensive utilization and enhancement of the image features, though it also relies on sparse features. It is worth mentioning that DeformDETR [50] is a special case in the detection transformer family, which shows comparable data efficiency to Sparse RCNN. Our conclusions drawn from the Sparse RCNN to DETR model transformation can also explain DeformDETR’s data efficiency. Specifically, multi-scale deformable attention samples sparse features from local regions of the image and utilizes multi-scale features. The prediction of the model is relative to the initial reference points. Thus, all three key factors are satisfied in DeformDETR, though it was not intended to be data-efficient on small-size datasets.

Towards Data-Efficient Detection Transformers

4

95

Method

In this section, we aim to improve the data efficiency of existing detection transformers, while making minimum modifications to their original designs. Firstly, we provide a brief revisiting of existing detection transformers. Subsequently, based on experiments and analysis in the previous section, we make minor modifications to the existing data-hungry detection transformer models, like DETR [3] and CondDETR [29], to significantly improve their data efficiency. Finally, we propose a simple yet effective label augmentation method to provide richer supervised signals to detection transformers to further improve their data efficiency. 4.1

A Revisit of Detection Transformers

Model Structure. Detection transformers generally consist of a backbone, a transformer encoder, a transformer decoder, and the prediction heads. The back L bone first extracts multi-scale features from the input image, denoted as f l l=1 , l

l

l

where f l ∈ RH ×W ×C . Subsequently, the last feature level with the lowest resoL lution is flattened and embedded to obtain z L ∈ RS ×D where S L = H L ×W L is sequence length and D is the feature dimension. Correspondingly, the positional L embedding is denoted as pL ∈ RS ×D . Afterward, The single-scale sequence L feature is encoded by the transformer encoder to obtain zeL ∈ RS ×D . The decoder consists of a stack of Ld decoder layers, and the query content embedding is initialized as q0 ∈ RN ×D , where N is the number of queries. Each decoder layer DecoderLayer takes the previous decoder layer’s output q−1 , the query positional embedding pq , the image sequence feature z and its position embedding p as inputs, and outputs the decoded sequence features. q = DecoderLayer (q−1 , pq , z , p ) ,

 = 1 . . . Ld .

(1)

In most detection transformers, like DETR and CondDETR, single-scale image feature is utilized for decoder, and thus z = zeL and p = pL , where  = 1 . . . Ld . Label Assignment. Detection transformers view the object detection task as a set prediction problem and perform deep supervision [21] on predictions made by each decoder layer. Specifically, the labels set can be denoted as y = {y1 , . . . , yM , ∅, . . . , ∅}, where M denotes the number of foreground objects in the image and the ∅ (no object) pads the label set to a length of N . CorN respondingly, the output of each decoder layer can be written as yˆ = {ˆ yi }i=1 . During label assignment, detection transformers search for a permutation τ ∈ TN with the minimum matching cost: τˆ = arg min τ ∈TN

N  i

  Lmatch yi , yˆτ (i) ,

(2)

  where Lmatch yi , yˆτ (i) is the pair-wise loss between ground truth and the prediction with index τ (i).

96

4.2

W. Wang et al.

Model Improvement

In this section, we make slight adjustments to data-hungry detection transformers such as DETR and CondDETR, to largely boost their data efficiency. Sparse Feature Sampling. From the analysis in Sect. 3, we can see that local feature sampling is critical to data efficiency. Fortunately, in detection transformers, the object locations are predicted after each decoder layer. Therefore, we can sample local features under the guidance of the bounding box prediction made by the previous decoder layer without introducing new parameters, as shown in Fig. 2. Although more sophisticated local feature sampling methods can be used, we simply adopt the commonly used RoIAlign [18]. Formally, the sampling operation can be written as:   L  = 2 . . . Ld (3) zL  = RoIAlign ze , b−1 , where b−1 is the bounding boxes predicted by the previous layer, zL  ∈ 2 RN ×K ×D is the sampled feature, K is the feature resolution in RoIAlign sampling. Note the reshape and flatten operations are omitted in Eq. 3. Similarly, the corresponding positional embedding pL  can be obtained. The cascaded structure in the detection transformer makes it natural to use layer-wise bounding box refinement [1,50] to improve detection performance. Our experiments in Sect. 3 also validate the effectiveness of the iterative refinement and making predictions with respect to initial spatial references. For this reason, we also introduce bounding box refinement and initial reference points during our implementation, as did in CondDETR [29].

Fig. 2. The proposed data-efficient detection transformer structure. With minimum modifications, we perform sparse sampling feature on key and value feature sent to the cross-attention layers in the decoder, under the guidance of bounding boxes predicted by the previous layer. Note the box head is part of the original detection transformers, which utilize deep supervision on the predictions made by each decoder layer. The backbone, the transformer encoder, and the first decoder layer are kept unchanged.

Towards Data-Efficient Detection Transformers

97

Incorporating Multi-scale Feature. Our sparse feature sampling makes it possible to use multi-scale features in detection transformers with little computation cost. To this end, we also flatten and embed the high-resolution features  L−1 l extracted by the backbone to obtain z l l=1 ∈ RS ×D for local feature sampling. However, these features are not processed by the transformer encoder. Although more sophisticated techniques can be used, these single-scale features sampled by RoIAlign are simply concatenated to form our multi-scale feature. These features are naturally fused by the cross-attention in the decoder.   1 2 L (4) zms  = z , z , . . . , z ,  = 2 . . . Ld , where zms ∈  RN ×LK ×D is the multi-scale feature, and zl =  l RoIAlign z , b−1 , l = 1 . . . L − 1. The corresponding positional embedding pms is obtained in a similar way. The decoding process is the same as original  ms detection transformers, as shown in Eq. 1, where we have z = zms  and p = p . Please refer to the Appendix for details in implementation. 2

Fig. 3. Illustration of the proposed label augmentation method. The predictions and the ground truths are represented by circles and rectangles, respectively. The matching between foreground instances is represented by solid lines, while the matching between background instances is represented by dotted lines. The prediction in blue that was originally matched to a background instance in (a) is now matched to a foreground instance in our method (b), thus obtaining more abundant supervision.

4.3

Label Augmentation for Richer Supervision

Detection transformers perform one-to-one matching for label assignment, which means only a small number of detection candidates are provided with a positive supervision signal in each iteration. As a result, the model has to obtain enough supervision from a larger amount of data or more training epochs. To alleviate this problem, we propose a label augmentation strategy to provide a richer supervised signal to the detection transformers, by simply repeating positive labels during bipartite matching. As shown in Fig. 3, we repeat the labels of each foreground sample yi for Ri times, while keeping the total length of the label set N unchanged.

RM 1 2 y = y11 , y12 , . . . , y1R1 , . . . , yM , yM , . . . , yM , . . . , ∅, . . . , ∅ . (5) Subsequently, the label assignment is achieved according to the operation in Eq. 2.

98

W. Wang et al.

Two label repeat strategies are considered during our implementation as follows. (a) Fixed repeat times, where all positive labels are repeated for the same number of times, i.e., Ri = R, i = 1 . . . M . (b) Fixed positive sample ratio, where the positive labels are sampled repeatedly to ensure a proportion of r positive samples in the label set. Specifically, F = N × r is the expected number of positive samples after repeating labels. We first repeat each positive label for F//M times, and subsequently, randomly sample F %M positive labels without repetition. By default, we use the fixed repeat times strategy, because it is easier to implement and the resultant label set is deterministic.

5

Experiments

Datasets. To explore detection transformers’ data efficiency, most of our experiments are conducted on small-size datasets including Cityscapes [7] and subsampled COCO 2017 [24]. Cityscapes contains 2,975 images for training and 500 images for evaluation. For the sub-sampled COCO 2017 dataset, the training images are randomly sub-sampled by 0.1, 0.05, 0.02, and 0.01, while the evaluation set is kept unchanged. Besides, we also validate the effectiveness of our method on the full-size COCO 2017 dataset with 118K training images. Implementation Details. By default, our feature sampling is implemented as RoIAlign with a feature resolution of 4. Three different feature levels are included for multi-scale feature fusion. A fixed repeat time of 2 is adopted for our label augmentation and non-maximum suppression (NMS) with a threshold of 0.7 is used for duplicate removal. All models are trained for 50 epochs and the learning rate decays after 40 epochs, unless specified. ResNet-50 [19] pre-trained on ImageNet-1K [9] is used as backbone. To guarantee enough number of training iterations, all experiments on Cityscapes and sub-sampled COCO 2017 datasets are trained with a batch size of 8. And the results are averaged over five repeated runs with different random seeds. Our data-efficient detection transformers only make slight modifications to existing methods. Unless specified, we follow the original implementation details of corresponding baseline methods [3,29]. Run time is evaluated on NVIDIA A100 GPU. 5.1

Main Results

Results on Cityscapes. In this section, we compare our method with existing detection transformers. As shown in Table 2, most of them suffer from the dataefficiency issue. Nevertheless, with minor changes to the CondDETR model, our DE-CondDETR is able to achieve comparable data efficiency to DeformDETR. Further, with the richer supervision provided by label augmentation, our DELACondDETR surpasses DeformDETR by 2.2 AP. Besides, our method can be combined with other detection transformers to significantly improve their data efficiency, for example, our DE-DETR and DELA-DETR trained for 50 epochs perform significantly better than DETR trained for 500 epochs.

Towards Data-Efficient Detection Transformers

99

Table 2. Comparison of detection transformers on Cityscapes. DE denotes dataefficient and LA denotes label augmentation. † indicates the query number is increased from 100 to 300. Method

Epochs AP AP50 AP75 APS APM APL Params FLOPs FPS

DETR [3] UP-DETR [8] PnP-DETR-α=0.33 [39] PnP-DETR-α=0.80 [39] CondDETR [29] SMCA (single scale) [14] DeformDETR [50]

300 300 300 300 50 50 50

11.5 23.8 11.2 11.4 12.1 14.7 27.3

26.7 45.7 11.5 26.6 28.0 32.9 49.2

8.6 20.8 8.7 8.1 9.1 11.6 26.3

2.5 4.0 2.3 2.5 2.2 2.9 8.7

9.5 20.3 21.2 9.3 9.8 12.9 28.2

25.1 46.6 25.6 24.7 27.0 30.9 45.7

41M 41M 41M 41M 43M 42M 40M

86G 86G 79G 83G 90G 86G 174G

44 44 43 43 39 39 28

DE-DETR DELA-DETR†

50 50

21.7 41.7 24.5 46.2

19.2 22.5

4.9 6.1

20.0 39.9 23.3 43.9

42M 42M

88G 91G

34 29

DE-CondDETR DELA-CondDETR

50 50

26.8 47.8 29.5 52.8

25.4 27.6

6.8 7.5

25.6 46.6 28.2 50.1

44M 44M

107G 107G

29 29

Fig. 4. Performance comparison of different methods on sub-sampled COCO 2017 dataset. Note the sample ratio is shown on a logarithmic scale. As can be seen, both local feature sampling and label augmentation consistently improve the model performance under varying data sampling ratios.

Results on Sub-sampled COCO 2017. Sub-sampled COCO 2017 datasets contain 11,828 (10%), 5,914 (5%), 2,365 (2%), and 1,182 (1%) training images, respectively. As shown in Fig. 4, our method consistently outperforms the baseline methods by a large margin. In particular, DELA-DETR trained with only ∼1K images significantly outperforms the DETR baseline with five times the training data. Similarly, DELA-CondDETR consistently outperforms the CondDETR baseline trained with twice the data volume. 5.2

Ablations

In this section, we perform ablated experiments to better understand each component of our method. All the ablation studies are implemented on the DELA-

100

W. Wang et al.

CondDETR and the Cityscapes dataset, while more ablation studies based on DELA-DETR can be found in our Appendix. Table 3. Ablations on each component in DELA-CondDETR. “SF”, “MS”, and “LA” represent sparse feature sampling, multi-scale feature fusion, and label augmentation. Method

SF MS LA AP AP50 AP75 APS APM APL Params FLOPs FPS

CondDETR [29]   DE-CondDETR  DELA-CondDETR 

 



12.1 28.0

9.1

2.2

9.8

27.0

43M

90G

39

14.7 31.6 20.4 40.7

12.1 17.7

2.9 2.9

12.5 32.1 16.9 42.0

43M 44M

90G 95G

38 32

26.8 47.8 29.5 52.8

25.4 27.6

6.8 7.5

25.6 46.6 28.2 50.1

44M 44M

107G 107G

29 29

Effectiveness of Each Module. We first ablate the role of each module in our method, as shown in Table 3. The use of local feature sampling and multiscale feature fusion significantly improves the performance of the model by 8.3 and 6.4 AP, respectively. In addition, label augmentation further improves the performance by 2.7 AP. Besides, using it alone also brings a gain of 2.6 AP. Table 4. Ablations on multi-scale feature levels and feature resolutions for RoIAlign. Note label augmentation is not utilized for clarity. MS Lvls RoI Res. AP AP50 AP75 APS APM APL Params FLOPs FPS 1 1 1

1 4 7

14.8 35.1 20.4 40.7 20.7 40.9

11.0 17.7 18.5

2.4 2.9 2.9

11.7 31.1 16.9 42.0 16.8 42.7

44M 44M 44M

90G 95G 104G

32 32 31

3 4

4 4

26.8 47.8 26.3 47.1

25.4 25.1

6.8 6.5

25.6 46.6 24.8 46.5

44M 49M

107G 112G

29 28

Feature Resolution for RoIAlign. In general, a larger sample resolution in RoIAlign provides richer information and thus improves detection performance. However, sampling larger feature resolution is also more time-consuming and increases the computational cost of the decoding process. As shown in Table 4, the model performance is significantly improved by 5.6 AP when the resolution is increased from 1 to 4. However, when the resolution is further increased to 7, the improvement is minor and the FLOPs and latency are increased. For this reason, we set the feature resolution for RoIAlign as 4 by default. Number of Multi-scale Features. To incorporate multi-scale features, we also sample the 8× and 16× down-sampled features from the backbone to construct multi-scale features of 3 different levels. As can be seen from Table 4, it

Towards Data-Efficient Detection Transformers

101

significantly improves the model performance by 6.4 AP. However, when we further add the 64× down-sampled features for multi-scale fusion, the performance drops by 0.5 AP. By default, we use 3 feature levels for multi-scale feature fusion. Table 5. Ablations on label augmen- Table 6. Ablations on label augmentatation using fixed repeat time. Params, tion using fixed positive sample ratio. FLOPs, and FPS are omitted since they Ratio AP AP50 AP75 APS APM APL are consistent for all settings. Time AP

AP50 AP75 APS APM APL



26.8 47.8

25.4

6.8

25.6

46.6

2

29.5 52.8

27.6

7.5

28.2

50.1

3

29.4 52.6

28.0

7.6

28.1

50.3

4

29.0 52.0

27.7

7.8

27.9

49.5

5

28.7 51.3

27.4

7.8

27.7

49.3



26.8 47.8

25.4

6.8

25.6

46.6

0.1

27.7 49.7

26.1

7.4

26.5

47.2

0.2

28.2 50.2

26.9

7.4

26.8

48.5

0.25

28.3 50.5

27.2

7.5

27.1

48.3

0.3

27.9 50.3

26.5

7.3

27.1

47.4

0.4

27.6 49.7

26.0

7.0

27.0

46.8

Strategies for Label Augmentation. In this section, we ablate the proposed two label augmentation strategies, namely fixed repeat time and fixed positive sample ratio. As shown in Table 5, using different fixed repeated times consistently improves the performance of DE-DETR baseline, but the performance gain tends to decrease as the number of repetitions increases. Moreover, as shown in Table 6, although using different ratios can bring improvement on AP, the best performance is achieved when the positive to negative samples ratio is 1:3, which, interestingly, is also the most commonly used positive to negative sampling ratio in the RCNN series detectors, e.g.Faster RCNN. Table 7. Performance of our data-efficient detection transformers on COCO 2017. All models are trained for 50 epochs. Method

Epochs AP

DETR [3] DE-DETR DELA-DETR†

50 50 50

33.6 54.6 40.2 60.4 41.9 62.6

34.2 43.2 44.8

13.2 35.7 23.3 42.1 24.9 44.9

53.5 41M 56.4 43M 56.8 43M

86G 88G 91G

43 33 29

CondDETR [29] 50 50 DE-CondDETR DELA-CondDETR 50

40.2 61.1 41.7 62.4 43.0 64.0

42.6 44.9 46.4

19.9 43.6 24.4 44.5 26.0 45.5

58.7 43M 56.3 44M 57.7 44M

90G 107G 107G

39 28 28

5.3

AP50 AP75 APS APM APL Params FLOPs FPS

Generalization to Sample-Rich Dataset

Although the above experiments show that our method can improve model performance when only limited training data is available, there is no guarantee that our method remains effective when the training data is sufficient. To this end, we evaluate our method on COCO 2017 with a sufficient amount of data. As can be seen from Table 7, our method does not degrade the model performance

102

W. Wang et al.

on COCO 2017. Conversely, it delivers a promising improvement. Specifically, DELA-DETR and DELA-CondDETR improve their corresponding baseline by 8.3 and 2.8 AP, respectively.

6

Conclusion

In this paper, we identify the data-efficiency issue of detection transformers. Through step-by-step model transformation from Sparse RCNN to DETR, we find that sparse feature sampling from local areas holds the key to data efficiency. Based on these, we improve existing detection transformers by simply sampling multi-scale features under the guidance of predicted bounding boxes, with minimum modifications to the original models. In addition, we propose a simple yet effective label augmentation strategy to provide richer supervision and thus further alleviate the data-efficiency issue. Extensive experiments validate the effectiveness of our method. As transformers become increasingly popular for visual tasks, we hope our work will inspire the community to explore the data efficiency of transformers for different tasks. Acknowledgement. This work is supported by National Key R&D Program of China under Grant 2020AAA0105701, National Natural Science Foundation of China (NSFC) under Grants 61872327, Major Special Science and Technology Project of Anhui (No. 012223665049), and the ARC project FL-170100117.

References 1. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018) 2. Cao, Y.H., Yu, H., Wu, J.: Training vision transformers with only 2040 images. arXiv preprint arXiv:2201.10728 (2022) 3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 4. Chen, Z., Zhang, J., Tao, D.: Recurrent glimpse-based decoder for detection with transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5260–5269 (2022) 5. Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 589–598 (2021) 6. Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers. In: 34th Proceedings of the International Conference on Advances in Neural Information Processing Systems (2021) 7. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

Towards Data-Efficient Detection Transformers

103

8. Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020) 9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database.ac In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 10. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 11. Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., Tian, Q.: Msg-transformer: exchanging local spatial information by manipulating messenger tokens. arXiv preprint arXiv:2105.15168 (2021) 12. Fang, Y., et al.: You only look at one sequence: rethinking transformer in vision through object detection. arXiv preprint arXiv:2106.00666 (2021) 13. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009) 14. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. In: Proceedings of the IEEE International Conference on Computer Vision (2021) 15. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: OTA: optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 303–312 (2021) 16. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semaantic segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 17. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021) 18. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017) 19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 20. Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: 29th Proceedings of the Conference on Advances in Neural Information Processing Systems (2016) 21. Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics, pp. 562–570. PMLR (2015) 22. Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017) 23. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 24. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 25. Liu, F., Wei, H., Zhao, W., Li, G., Peng, J., Li, Z.: WB-DETR: transformerbased detector without backbone. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2979–2987 (2021)

104

W. Wang et al.

26. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2 27. Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021) 28. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021) 29. Meng, D., et al.: Conditional DETR for fast training convergence. In: Proceedings of the IEEE International Conference on Computer Vision (2021) 30. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 31. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016) 32. Shen, Y., et al.: Parallel instance query network for named entity recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2022). arxiv.org/abs/2203.10545 33. Sun, P., et al.: Sparse r-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021) 34. Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 282–298. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058452-8 17 35. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019) 36. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´egou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021) 37. Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (2017) 38. Wang, J., Chen, K., Yang, S., Loy, C.C., Lin, D.: Region proposal by guided anchoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2965–2974 (2019) 39. Wang, T., Yuan, L., Chen, Y., Feng, J., Yan, S.: PNP-DETR: towards efficient visual analysis with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021) 40. Wang, W., Cao, Y., Zhang, J., Tao, D.: FP-DETR: detection transformer advanced by fully pre-training. In: International Conference on Learning Representations (2022). ’openreview.net/forum?id=yjMQuLLcGWK 41. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021) 42. Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021)

Towards Data-Efficient Detection Transformers

105

43. Yang, T., Zhang, X., Li, Z., Zhang, W., Sun, J.: Metaanchor: learning to detect objects with customized anchaors. In: 31st Proceedings of the Conference on Advances in Neural Information Processing Systems (2018) 44. Yuan, H., et al.: Polyphonicformer: unified query learning for depth-aware video panoptic segmentation. In: European Conference on Computer Vision (2022) 45. Yuan, L., et al.: Tokens-to-token VIT: training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021) 46. Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv preprint aarXiv:2202.10108 (2022) 47. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference ON Computer Vision and Pattern Recognition, pp. 9759–9768 (2020) 48. Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: Learning to match anchors for visual object detection. In: 32nd Proceedings of the Conference on Advances in Neural Information Processing Systems (2019) 49. Zhu, B., et al.: Autoassign: differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496 (2020) 50. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning and Representations (2020)

Open-Vocabulary DETR with Conditional Matching Yuhang Zang1 , Wei Li1 , Kaiyang Zhou1 , Chen Huang2 , and Chen Change Loy1(B) 1

S-Lab, Nanyang Technological University, Singapore, Singapore {zang0012,wei.l,kaiyang.zhou,ccloy}@ntu.edu.sg 2 Carnegie Mellon University, Pittsburgh, USA [email protected]

Abstract. Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel openvocabulary detector based on DETR—hence the name OV-DETR— which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an openvocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first endto-end Transformer-based open-vocabulary detector —achieves non-trivial improvements over current state of the arts. Code is available at https:// github.com/yuhangzang/OV-DETR.

1

Introduction

Object detection, a fundamental computer vision task aiming to localize objects with tight bounding boxes in images, has been significantly advanced in the last decade thanks to the emergence of deep learning [9,15,25,29,32]. However, most object detection algorithms are unscalable in terms of the vocabulary size, i.e., they are limited to a fixed set of object categories defined in detection Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 7. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 106–122, 2022. https://doi.org/10.1007/978-3-031-20077-9_7

Open-Vocabulary DETR with Conditional Matching

107

box predictions

missed by the RPN : base class RCNN RPN : novel class

‘bird’ ‘cat’

CLIP text

(a) RPN-based detector

‘cat’

‘bird’

CLIP text

CLIP image

conditional image query

conditional text query

CLIP text

CLIP image

DETR

DETR

(b) OV-DETR

Fig. 1. Comparison between a RPN-based detector and our OpenVocabulary Transformer-based detector (OV-DETR) using conditional queries. The RPN trained on closed-set object classes is easy to ignore novel classes (e.g., the “cat” region receives little response). Hence the cats in this example are largely missed with few to no proposals. By contrast, our OV-DETR is trained to perform matching between a conditional query and its corresponding box, which helps to learn correspondence that can generalize to queries from unseen classes. Note we can take input queries in the form of either text (class name) or exemplar images, which offers greater flexibility for open-vocabulary object detection.

datasets [8,22]. For example, an object detector trained on COCO [22] can only detect 80 classes and is unable to handle new classes beyond the training ones. A straightforward approach to detecting novel classes is to collect and add their training images to the original dataset, and then re-train or fine-tune the detection model. This is, however, both impractical and inefficient due to the large cost of data collection and model training. In the detection literature, generalization from base to novel classes has been studied as a zero-shot detection problem [1] where zero-shot learning techniques like word embedding projection [10] are widely used. Recently, open-vocabulary detection, a new formulation that leverages large pre-trained language models, has gained increasing attention from the community [13,36]. The central idea in existing works is to align detector’s features with embedding provided by models pre-trained on large scale image-text pairs like CLIP [27] (see Fig. 1(a)). This way, we can use an aligned classifier to recognize novel classes only from their descriptive texts. A major problem with existing open-vocabulary detectors [13,36] is that they rely on region proposals that are often not reliable to cover all novel classes in an image due to the lack of training data, see Fig. 1(a). This problem has also been identified by a recent study [17], which suggests the binary nature of the

108

Y. Zang et al.

region proposal network (RPN) could easily lead to overfitting to seen classes (thus fail to generalize to novel classes). In this paper, we propose to train end-to-end an open-vocabulary detector under the Transformer framework, aiming to enhance its novel class generalization without using an intermediate RPN. To this end, we propose a novel open-vocabulary detector based on DETR [2]—hence the name OV-DETR— which is trained to detect any object given its class name or an exemplar image. This would offer greater flexibility than conventional open-vocabulary detection from natural language only. Despite the simplicity of end-to-end DETR training, turning it into an openvocabulary detector is non-trivial. The biggest challenge is the inability to calculate the classification cost for novel classes without their training labels. To overcome the challenge, we re-formulate the learning objective as binary matching between input queries (class name or exemplar image) and the corresponding objects. Such a matching loss over diverse training pairs allows to learn useful correspondence that can generalize to unseen queries during testing. For training, we extend the Transformer decoder of DETR to take conditional input queries. Specifically, we condition the Transformer decoder on the query embeddings obtained from a pre-trained vision-language model CLIP [27], in order to perform conditional matching for either text or image queries. Figure 1 shows this high-level idea, which proves better at detecting novel classes than RPN-based closed-set detectors. We conduct comprehensive experiments on two challenging open-vocabulary object detection datasets, and show consistent improvements in performance. Concretely, our OV-DETR method achieves 17.4 mask mAP of novel classes on the open-vocabulary LVIS dataset [13] and 29.4 box mAP of novel classes on open-vocabulary COCO dataset [36], surpassing SOTA methods by 1.3 and 1.8 mAP, respectively.

2

Related Work

Open-Vocabulary Object Detection leverages the recent advances in large pre-trained language models [13,36] to incorporate the open-vocabulary information into object detectors. OVR-CNN [36] first uses BERT [6] to pre-train the Faster R-CNN detector [29] on image-caption pairs and then fine-tunes the model on downstream detection datasets. ViLD [13] adopts a distillation-based approach that aligns the image feature extractor of Mask R-CNN [15] with the image and text encoder of CLIP [27] so the CLIP can be used to synthesize the classification weights for any novel class. The prompt tuning techniques [37– 39] for the pre-trained vision-language model have also been applied for openvocabulary detectors, like DetPro [7]. Our approach differs from these works in that we train a Transformer-based detector end-to-end, with a novel framework of conditional matching. Zero-Shot Object Detection is also concerned with the problem of detecting novel classes [1,20,28,31,40]. However, this setting is less practical due to the harsh constraint of limiting access to resources relevant to unseen classes [36].

Open-Vocabulary DETR with Conditional Matching

109

A common approach to zero-shot detection is to employ word embeddings like GloVe [26] as the classifier weights [1]. Other works have found that using external resources like textual descriptions can help improve the generalization of classifier embeddings [20,28]. Alternatively, Zhao et al.. [31] used Generative Adversarial Network (GAN) [12] to generate feature representations of novel classes. While Zhu et al.. [40] synthesized unseen classes using a data augmentation strategy. Visual Grounding is another relevant research area where the problem is to ground a target object in one image using natural language input [3,5]. Different from open-vocabulary detection that aims to identify all target objects in an image, the visual grounding methods typically involve a particular single object, hence cannot be directly applied to generic object detection. There is a relevant visual grounding method though, which is called MDETR [16]. This method similarly trains DETR along with a given language model so as to link the output tokens of DETR with specific words. MDETR also adopts a conditional framework, where the visual and textual features are combined to be fed to the Transformer encoder and decoder. However, the MDETR method is not applicable to open-vocabulary detection because it is unable to calculate the cost matrix for novel classes under the classification framework. Our OV-DETR bypasses this challenge by using a conditional matching framework instead. Object Detection with Transformers. The pioneer DETR approach [2] greatly simplifies the detection pipeline by casting detection as a set-toset matching problem. Several follow-up methods have been developed to improve performance and training efficiency. Deformable DETR [41] features a deformable attention module, which samples sparse pixel locations for computing attention, and further mitigates the slow convergence issue with a multiscale scheme. SMCA [11] accelerates training convergence with a location-aware co-attention mechanism. Conditional DETR [24] also addresses the slow convergence issue, but with conditional spatial queries learned from reference points and the decoder embeddings. Our work for the first time extends DETR to the open-vocabulary domain by casting open-vocabulary detection as a conditional matching problem, and achieves non-trivial improvements over current SOTA.

3

Open-Vocabulary DETR

Our goal is to design a simple yet effective open-vocabulary object detector that can detect objects described by arbitrary text inputs or exemplar images. We build on the success of DETR [2] that casts object detection as an end-to-end set matching problem (among closed classes), thus eliminating the need of handcrafted components like anchor generation and non-maximum suppression. This pipeline makes it appealing to act as a suitable framework to build our end-toend open-vocabulary object detector. However, it is non-trivial to retrofit a standard DETR with closed-set matching to an open-vocabulary detector that requires matching against unseen classes. One intuitive approach for such open-set matching is to learn a classagnostic module (e.g., ViLD [13]) to handle all classes. This is, however, still

110

Y. Zang et al. not matched input image

not matched

DETR backbone encoder

decoder

matching apple

matched

object queries

open-vocabulary

not matched

train: test:

‘apple’ ‘apple’ novel class

CLIP set of box predictions

train: test:

not matched

projector

base class

conditional queries

matched matching pikachu

‘pikachu’

Fig. 2. Overview of OV-DETR. Unlike the standard DETR, our method does not separate ‘objects’ from ‘non-objects’ for a closed set of classes. Instead, OV-DETR performs open-vocabulary detection by measuring the matchability (‘matched’ vs. ‘not matched’) between some conditional inputs (text or exemplar image embeddings from CLIP) and detection results. We show such pipeline is flexible to detect openvocabulary classes with arbitrary text or image inputs.

unable to match for those open-vocabulary classes that come with no labeled images. Here we provide a new perspective on the matching task in DETR, which leads us to reformulate the fixed set-matching objective into a conditional binary matching one between conditional inputs (text or image queries) and detection outputs. An overview of our Open-Vocabulary DETR is shown in Fig. 2. At high level, DETR first takes query embeddings (text or image) as conditional inputs obtained from a pre-trained CLIP [27] model, and then a binary matching loss is imposed against the detection result to measure their matchability. In the following, we will revisit the closed-set matching process in standard DETR in Sect. 3.1. We then describe how to perform conditional binary matching in our OV-DETR in Sect. 3.2. 3.1

Revisiting Closed-Set Matching in DETR

ˆ where N is For input image x, a standard DETR infers N object predictions y determined by the fixed size of object queries q that serve as learnable positional encodings. One single pass of the DETR pipeline consists of two main steps: (i) set prediction, and (ii) optimal bipartite matching. Set Prediction. Given an input image x, the global context representations c is first extracted by a CNN backbone fφ and then a Transformer encoder hψ : c = hψ (fφ (x)),

(1)

Open-Vocabulary DETR with Conditional Matching

111

where the output c denotes a sequence of feature embeddings of q. Taking the context feature c and object queries q as inputs, the Transformer decoder hθ ˆ = {ˆ (with prediction heads) then produce the set prediction y y i }N i=1 : ˆ = hθ (c, q), y

(2)

ˆ and class predictions p ˆ contains both bounding box predictions b ˆ for a where y closed-set of training classes. Optimal Bipartite Matching is to find the best match between the set of ˆ and the set of ground truth objects y = {y i }M N predictions y i=1 (including no object ∅). Specifically, one needs to search a permutation of N elements σ ∈ SN that has the lowest matching cost: σ ˆ = arg min σ∈SN

N 

ˆ σ(i) ), Lcost (y i , y

(3)

i

ˆ σ(i) ) is a pair-wise matching cost between ground truth y i and where Lcost (y i , y ˆ σ(i) with index σ(i). Note Lcost is comprised of the losses for both the prediction y ˆ b). The whole class prediction Lcls (ˆ p, p) and bounding box localization Lbox (b, bipartite matching process produces one-to-one label assignments, where each ˆ i is assigned to a ground-truth annotation y j or ∅ (no object). The prediction y optimal assignment can be efficiently found by the Hungarian algorithm [18]. Challenge. As mentioned above, the bipartite matching method cannot be directly applied to an open-vocabulary setting that contains both base and novel classes. The reason is that computing the matching cost in Eq. (3) requires access of the label information, which is unavailable for novel classes. We can follow previous works [7,13,35] to generate class-agnostic object proposals that may cover the novel classes, but we do not know the ground-truth classification labels of these proposals. As a result, the predictions for the N object queries cannot generalize to novel classes due to the lack of training labels for them. As shown in Fig. 3(a), bipartite matching can only be performed for base classes with available training labels. 3.2

Conditional Matching for Open-Vocabulary Detection

To enable DETR to go beyond closed-set classification and perform openvocabulary detection, we equip the Transformer decoder with conditional inputs and reformulate the learning objective as binary matching problem. Conditional Inputs. Given an object detection dataset with standard annotations for all the training (base) classes, we need to convert those annotations to conditional inputs to facilitate our new training paradigm. Specifically, for , each ground-truth annotation with bounding box bi and class label name y class i we use the CLIP model [27] to generate their corresponding image embedding and text embedding z text : z image i i z image = CLIPimage (x, bi ), i z text = CLIPtext (y class ). i i

(4)

112

Y. Zang et al.

labeled base class

√ matched × not matched

unlabeled novel class

√ × √

base

× √ × √ × novel

√ × √ ×

close-set predictions (a)

annotations DETR

conditional predictions

annotations

(b) OV-DETR

Fig. 3. Comparing the label assignment mechanisms of DETR and our OVDETR. (a) In the original DETR, the set-to-set prediction is conducted via bipartite matching between predictions and closed-set annotations, in which a cost matrix in respect of the queries and categories. Due to the absence of class label annotations for novel classes, computing such a class-specific cost matrix is impossible. (b) On the contrary, our OV-DETR casts the open-vocabulary detection as a conditional matching process and formulate a binary matching problem that computes a class-agnostic matching cost matrix for conditional inputs.

Such image and text embeddings are already well-aligned by the CLIP model. Therefore, we can choose either of them as input queries to condition the DETR’s decoder and train to match the corresponding objects. Once training is done, we can then take arbitrary input queries during testing to perform open-vocabulary detection. To ensure equal training conditioned on image and text queries, we or z image with probability ξ = 0.5 as conditional inputs. randomly select z text i i Moreover, we follow previous works [7,13,35] to generate additional object proposals for novel classes to enrich our training data. We only extract image embedfor such novel-class proposals as conditional inputs, since their class dings z image i names are unavailable to extract text embeddings. Please refer to supplementary materials for more details. Conditional Matching. Our core training objective is to measure the matchability between the conditional input embeddings and detection results. In order to perform such conditional matching, we start with a fully-connected layer Fproj to project the conditional input embeddings (z text or z image ) to have the same i i dimension as q. Then the input to the DETR decoder q  is given by: q  = q ⊕ Fproj (z mod ), mod ∈ {text, image}, i

(5)

Open-Vocabulary DETR with Conditional Matching

113

decoder

decoder

copy object queries

copy

‘bird’

CLIP text

‘bird’ conditional input (a)

copy

copy CLIP text

CLIP image

conditional input (b)

Fig. 4. DETR decoder with (a) single conditional input or (b) multiple conditional inputs in parallel.

where we use a simple addition operation ⊕ to convert the class-agnostic object ). queries q into class-specific q  informed by Fproj (z mod i In practice, adding the conditional input embeddings z to only one object query will lead to a very limited coverage of the target objects that may appear many times in the image. Indeed, in existing object detection datasets, there are typically multiple object instances in each image from the same or different classes. To enrich the training signal for our conditional matching, we copy the or z image ) for object queries q for R times, and the conditional inputs (z text i i N times before performing the conditioning in Eq. (5). As a result, we obtain a total of N × R queries for matching during each forward pass, as shown in Fig. 4(b). Experiments in the supplementary material will validate the importance of such “feature cloning” and also show how we determine N and R based on the performance-memory trade-off. Note for the final conditioning process, we further add an attention mask to ensure the independence between different query copies, as is similarly done in [4]. Given the conditioned query features q  , our binary matching loss for label assignment is given as:   ˆσ , ˆ σ ) = Lmatch (p, p ˆ σ ) + Lbox b, b (6) Lcost (y, y ˆ σ ) denotes a new matching loss that replaces the classification where Lmatch (p, p ˆ σ ) in Eq. (3). Here in our case, p is a 1-dimensional sigmoid probloss Lcls (p, p ability vector that characterizes the matchability (‘matched’ vs. ‘not matched’), and Lmatch is simply implemented by a Focal loss [21] LFocal between predicted ˆ σ and groud-truth p. For instance, with the ‘bird’ query as input, our matching p loss should allow us to match all the bird instances in one image, while tagging instances from other classes as ‘not matched’.

114

Y. Zang et al.

3.3

Optimization

After optimizing Eq. (6), we obtain the optimized label assignments σ for different object queries. This process produces a set of detected objects with assigned ˆ and 2-dim matching probability p ˆ that we will use to compute box coordinates b our final loss function for modeling training. We further attach an embedding reconstruction head to the model, which learns to predict embedding e to be able to reconstruct each conditional input embedding z text or z image :   (7) Lembed (e, z) = e − z mod 1 , mod ∈ {text, image}. 2 Supplementary materials validate the effectiveness of Lembed . Our final loss for model training combines Lembed with bounding box losses ˆ again: ˆ ) and Lbox (b, b) Lmatch (p, p ˆ + Lembed (e, z) ˆ ) = Lmatch (p, p ˆ ) + Lbox (b, b) Lloss (y, y = λLFocal LFocal + λLL1 LL1 + λLGIoU LGIoU + λLembed Lembed ,

(8)

where Lbox consists of the L1 loss and the generalized IoU (GIoU) [30] loss for boxes, while λLFocal , λLL1 , λLGiou and λLembed are the weighting parameters. 3.4

Inference

During testing, for each image, we send the text embedding z text of all the base+novel classes to the model and merge the results by selecting the top k predictions with highest prediction scores. We follow the prior work [13] to use k = 100 for COCO dataset and k = 300 for LVIS dataset. To obtain the context representation c in Eq. (1), we forward the input image through the CNN backbone fφ and Transformer encoder hψ . Note c is computed only once and shared for all conditional inputs for efficiency. Then the conditioned object queries from different classes are sent to the Transformer decoder in parallel. In practice, we copy the object queries for R times as shown in Fig. 4(b).

4

Experiments

Datasets. We evaluate our approach on two standard open-vocabulary detection benchmarks modified from LVIS [14] and COCO [22] respectively. LVIS [14] contains 100K images with 1,203 classes. The classes are divided into three groups, namely frequent, common and rare, based on the number of training images. Following ViLD [13], we treat 337 rare classes as novel classes and use only the frequent and common classes for training. The COCO [22] dataset is a widelyused benchmark for object detection, which consists of 80 classes. Following OVR-CNN [36], we divide the classes in COCO into 48 base categories and 17 novel categories, while removing 15 categories without a synset in the WordNet hierarchy. The training set is the same as the full COCO but only images containing at least one base class are used. We refer to these two benchmarks as OV-LVIS and OV-COCO hereafter.

Open-Vocabulary DETR with Conditional Matching

115

Table 1. Mask R-CNN and Def DETR on OV-LVIS, both trained on base classes. †: copied from ViLD [13]. # Method 1 2

m APm APm APm novel APc f

Mask R-CNN† 22.5 Def DETR 22.4

0.0 0.0

22.6 22.4

32.4 32.0

Table 2. Ablation study on using object proposals (P) and our conditional binary matching mechanism (M). m # P M APm APm APm novel APc f

1 2 3

24.2 ✓ 19.9 ✓ ✓ 26.6

9.5 6.3 17.4

23.2 31.7 17.4 28.6 25.0 32.5

Evaluation Metrics. For OV-LVIS, we report the mask mAP for rare, common m m and frequent classes, denoted by APm r , APc and APf . The rare classes are m m treated as novel classes (APnovel ). The symbol AP denotes to the mAP of all the classes. For OV-COCO, we follow previous work that only reports the AP50b metric, which means the box mAP at IoU threshold 0.5. Extension for Instance Segmentation. For OV-LVIS, instance segmentation results are needed for the evaluation process. Although DETR [2] and its follow-ups [24,41] are developed for the object detection task, they can also be extended to the instance segmentation task. We follow DETR [2] to add an external class-agnostic segmentation head to solve the instance segmentation task. The segmentation head employs the fully convolutional network (FCN [23]) structure, which takes features extracted from the Transformer decoder as input and produces segmentation masks. Implementation Details. Our model is based on Deformable DETR [41]. Following ViLD [13], we also use the open-source CLIP model [27] based on ViTB/32 for extracting text and image embeddings. Please refer to our supplementary material for more training details. 4.1

Ablation Studies

We conduct ablation study on OV-LVIS to evaluate the main components in our approach. The Architecture Difference. Previous works such as ViLD [13] are based on the RPN-based Mask R-CNN [15], while our work is based on the Transformerbased detector Deformable DETR [41]. We first study the difference of these two detectors on the open-vocabulary setting trained with base classes only. As shown in Table 1 row(1-2), we observe that Mask R-CNN performs a slightly

116

Y. Zang et al.

better than Deformable DETR [41]. This gap is small, indicating that we have a fair starting point compared to ViLD [13]. Object Proposals. We then replace Deformable DETR’s classifier layer as text embedding provided by CLIP and trained with base classes only. This step is similar to the previous ViLD-text [13] method. Results is presented in Table 2 row 1. We observe that the APm novel metric improved from 0.0 to 9.5. To further improve the APm metric, we add the object proposals that may contain the novel region of novel classes into the training stage. Because we do not know the category id of these object proposals, we observe that the label assignment of these object proposals is inaccurate and will decrease the APm novel performance from 9.5 to 6.3. Table 3. Main results on OV-LVIS and OV-COCO. For OV-LVIS (w/ 886 base classes and 317 novel classes), we report mask mAP and a breakdown on novel (rare), common, and frequent classes. For OV-COCO (w/ 48 base classes and 17 novel classes), we report bounding box mAP at IoU threshold 0.5. †: zero-shot methods that do not use captions or image-text pairs. ‡: ensemble model. # Method

OV-LVIS

1 2 3

SB [1]† DELO [40]† PL [28]†

APm -

4 5 6 7

OVR-CNN [36] ViLD-text [13] ViLD [13] ViLD-ens. [13]‡

24.9 22.5 25.5

8

OV-DETR (ours vs. #6)

OV-COCO

APm novel -

APm c -

APm f -

10.1 16.1 16.6

23.9 20.0 24.6

32.5 28.3 30.3

AP50b AP50bnovel AP50bbase 24.9 0.3 29.2 13.0 3.1 13.8 27.9 4.1 35.9 46.0 49.3 51.3 -

26.6 17.4 25.0 32.5 52.7 (+4.1) (+1.3) (+5.0) (+4.2) (+1.4)

22.8 5.9 27.6 -

39.9 61.8 59.5 -

29.4 (+1.8)

61.0 (+1.5)

Conditional Binary Matching. Now we replace DETR’s default close-set labeling assignment as our proposed conditional binary matching. The comparison results between Table 2 row 2-3 shows that our binary matching strategy can better leverage the knowledge from object proposals and improve the APm novel from 9.5 to 17.4. Such a large improvement shows that the proposed conditional matching is essential when applying the DETR-series detector for the open-vocabulary setting. 4.2

Results on Open-Vocabulary Benchmarks

Table 3 summarizes our results. We compare our method with SOTA openvocabulary detection methods including: (1) OVR-CNN [36] (see Table 3 row

Open-Vocabulary DETR with Conditional Matching

117

4). It pre-trains the detector’s projecting layer on image-caption pairs using contrastive loss and then fine-tunes on the object detection task; (2) Variants of ViLD [13] such as ViLD-text and ViLD-ensemble (see Table 3 rows 5-7). ViLD is the first study that uses CLIP embeddings [27] for open-vocabulary detection. Compared with ViLD-text, ViLD uses knowledge distillation from the CLIP visual backbone, improves APnovel at the cost of hurting APbase . ViLD-ens. combines the two models and shows improvements for both metrics. Such an ensemble-based method also brings extra time and memory cost. For completeness, we also list the results of some previous zero-shot methods such as SB [1], DELO [40] and PL [28] in Table 3 rows 1-3. On OV-LVIS benchmark, OV-DETR improves the previous SOTA ViLD by 4.1 on APm and 1.3 on APm novel . Compared with ViLD, our method will not affect the performance of base classes when improve the novel classes. Even compared with the ensemble result of ViLD-ensemble, OV-DETR still boosts the performance by 1.5, 0.8, 1.0 and 2.2, respectively (%). Noted that OV-DETR only uses a single model and does not leverage any ensemble-based technique. On OV-COCO benchmark, OV-DETR improves the baseline and outperforms OVR-CNN [36] by a large margin, notably, the 6.6 mAP improvements on novel classes. Compared with ViLD [13], OV-DETR still achieves 1.4 mAP gains on all the classs and 1.8 mAP gains on novel classes. In summary, it is observed that OV-DETR achieves superior performance across different datasets compared with different methods. Table 4. Generalization to Other Datasets. We evaluate OV-DETR trained on LVIS when transferred to other datasets such as PASCAL VOC 2007 test set and COCO validation set by simply replacing the text embeddings. The experimental setting is the same as that of ViLD [13]. We observe that OV-DETR achieves better generalization performance than ViLD [13]. # Method

4.3

1 2

ViLD-text [13] ViLD [13]

3

OV-DETR (ours vs #2 )

Pascal VOC APb50

APb75

40.5 72.2

31.6 56.7

COCO b

AP 28.8 36.6

APb50 43.4 55.6

APb75 31.4 39.8

76.1 59.3 38.1 58.4 41.1 (+3.9) (+2.6) (+1.5) (+2.8) (+1.3)

Generalization Ability of OV-DETR

We follow ViLD [13] to test the generalization ability of OV-DETR by training the model on LVIS [14] dataset and evaluated on PASCAL VOC [8] and COCO [22]. We keep the same implementation details with ViLD [13]. We switch the text embeddings of the category names from the source dataset to new datasets. The text embeddings of new classes are used as conditional inputs during the inference phase. As shown in Table 4, we observe that OV-DETR

118

Y. Zang et al.

achieves better transfer performance than ViLD. The experimental results show that the model trained by our conditional-based mechanism has transferability to other domains. 4.4

Qualitative Results

We visualize OV-DETR’s detection and segmentation results in Fig. 5. The results based on conditional text queries, conditional image queries, and a mixture of conditional text and image queries are shown in the top, middle and bottom row, respectively. Overall, our OV-DETR can accurately localize and precisely segment out the target objects from novel classes despite no annotations of these classes during training. It is worth noting that the conditional image queries, such as “crape” in (d) and “fork” in (h), appear drastically different from those in the target images but OV-DETR can still robustly detect them. Query:

“crape”

“napkin”

“fishbowl”

Output:

(a)

(b)

(c)

(d)

(e)

(f)

Query:

Output:

Query:

, “napkin”

, “cake”, “napkin”

, “cat”

Output:

(g)

(h)

(i)

Fig. 5. Qualitative results on LVIS. OV-DETR can precisely detect and segment novel objects (e.g., ‘crape’, ‘fishbowl’, ‘softball’) given the conditional text query (top) or conditional image query (middle) or a mixture of them (bottom).

Open-Vocabulary DETR with Conditional Matching

4.5

119

Inference Time Analysis

OV-DETR exhibits great potential in open-vocabulary detection but is by no means a perfect detector. The biggest limitation of OV-DETR is that the inference speed is slow when the number of classes to detect is huge like 1,203 on LVIS [13]. This problem is caused by the conditional design that requires multiple forward passes in the Transformer decoder (depending on the number of classes). We show a detailed comparison on the inference time between Deformable DETR and OV-DETR in Table 5. Without using any tricks, the vanilla OVDETR (#2), i.e., using a single forward pass for each class, is about 2× slower than Deformable DETR (#1) on COCO (w/ 80 classes) while 16× slower on LVIS (w/ 1,203 classes). As discussed in Sect. 3.2 and shown in Fig. 4(b), we optimize the speed by forwarding multiple conditional queries to the Transformer decoder in parallel, which reduces the inference time by 12.5% on COCO and nearly 60% on LVIS (see #3 in Table 5). Still, there is much room for improvement. Table 5. Comparison of the inference time (second per iteration) between Deformable DETR [41] and our OV-DETR before/after optimization on LVIS and COCO. # Method 1 2 3

COCO

LVIS

Def DETR 0.31 1.49 Ours 0.72 23.84 Ours (optimized) 0.63 9.57 (vs #2) (+↓ 12.5%) (+↓ 59.9%)

It is worth noting that such a slow inference problem is not unique to our approach—most instance-conditional models would have the same issue [19], which is the common price to pay in exchange for better performance. The computation bottleneck of our method lies in the computation of the Transformer decoder in Eq. (2). A potential solution is to design more efficient attention modules [33,34], which we leave as future work. In human-computer interaction where users already have target object(s) in mind, e.g., a missing luggage or a specific type of logo, the conditional input is fixed and low in number, thus the inference time is negligible.

5

Conclusion

Open-vocabulary detection is known to be a challenging problem due to the lack of training data for unseen classes. Recent advances in large language models have offered a new perspective for designing open-vocabulary detectors. In this work, we show how an end-to-end Transformer-based detector can be turned into

120

Y. Zang et al.

an open-vocabulary detector based on conditional matching and with the help of pre-trained vision-language models. The results show that, despite having a simplified training pipeline, our open-vocabulary detector based on Transformer significantly outperforms current state of the arts that are all based on two-stage detectors. We hope our approach and the findings presented in the paper can inspire more future work on the design of efficient open-vocabulary detectors.

Acknowledgments. This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). It is also partly supported by the NTU NAP grant and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001). This work was supported by SenseTime SenseCore AI Infrastructure-AIDC.

References 1. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10. 1007/978-3-030-01246-5 24 2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 3. Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: ICCV, pp. 824–832 (2017) 4. Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: CVPR, pp. 1601–1610 (2021) 5. Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: CVPR, pp. 7746–7755 (2018) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for openvocabulary object detection with vision-language model. In: CVPR, pp. 14084– 14093 (2022) 8. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010) 9. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR, pp. 1–8. Ieee (2008) 10. Frome, A., et al.: Devise: A deep visual-semantic embedding model. In: NeurIPS (2013) 11. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. In: ICCV, pp. 3621–3630 (2021) 12. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, vol. 27 (2014) 13. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 14. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR, pp. 5356–5364 (2019)

Open-Vocabulary DETR with Conditional Matching

121

15. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017) 16. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetrmodulated detection for end-to-end multi-modal understanding. In: ICCV. pp. 1780–1790 (2021) 17. Kim, D., Lin, T.Y., Angelova, A., Kweon, I.S., Kuo, W.: Learning open-world object proposals without learning to classify. Rob. Autom. Lett. 7(2), :1-1 (2022) 18. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist Q. 2(1–2), 83–97 (1955) 19. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: CVPR, pp. 1970–1979 (2017) 20. Li, Z., Yao, L., Zhang, X., Wang, X., Kanhere, S., Zhang, H.: Zero-shot object detection with textual descriptions. In: AAAI, pp. 8690–8697 (2019) 21. Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017) 22. Liu, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 23. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 24. Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021) 25. Papageorgiou, C., Poggio, T.: A trainable system for object detection. Int. J. Comput. Vis. 38(1), 15–33 (2000) 26. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 27. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021) 28. Rahman, T., Chou, S.H., Sigal, L., Carenini, G.: An improved attention for visual question answering. In: CVPR, pp. 1653–1662 (2021) 29. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28, 91–99 (2015) 30. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019) 31. Shizhen, Z., et al.: GtNet: generative transfer network for zero-shot object detection. In: AAAI (2020) 32. Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: NeurIPS, vol. 26 (2013) 33. Tay, Y., Bahri, D., Yang, L., Metzler, D., Juan, D.C.: Sparse sinkhorn attention. In: ICML, pp. 9438–9447. PMLR (2020) 34. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020) 35. Xie, J., Zheng, S.: ZSD-yolo: zero-shot yolo detection using vision-language knowledgedistillationa. arXiv preprint arXiv:2109.12066 (2021) 36. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021) 37. Zhang, Y., Zhou, K., Liu, Z.: Neural prompt search. arXiv (2022) 38. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for visionlanguage models. In: CVPR (2022)

122

Y. Zang et al.

39. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. In: IJCV (2022) 40. Zhu, P., Wang, H., Saligrama, V.: Don’t even look once: Synthesizing features for zero-shot detection. In: CVPR, pp. 11693–11702 (2020) 41. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)

Prediction-Guided Distillation for Dense Object Detection Chenhongyi Yang1(B) , Mateusz Ochal2,3 , Amos Storkey2 , and Elliot J. Crowley1 1

School of Engineering, University of Edinburgh, Edinburgh, UK [email protected] 2 School of Informatics, University of Edinburgh, Edinburgh, UK 3 School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, UK Abstract. Real-world object detection models should be cheap and accurate. Knowledge distillation (KD) can boost the accuracy of a small, cheap detection model by leveraging useful information from a larger teacher model. However, a key challenge is identifying the most informative features produced by the teacher for distillation. In this work, we show that only a very small fraction of features within a groundtruth bounding box are responsible for a teacher’s high detection performance. Based on this, we propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher and yields considerable gains in performance over many existing KD baselines. In addition, we propose an adaptive weighting scheme over the key regions to smooth out their influence and achieve even better performance. Our proposed approach outperforms current state-of-theart KD baselines on a variety of advanced one-stage detection architectures. Specifically, on the COCO dataset, our method achieves between +3.1% and +4.6% AP improvement using ResNet-101 and ResNet-50 as the teacher and student backbones, respectively. On the CrowdHuman dataset, we achieve +3.2% and +2.0% improvements in MR and AP, also using these backbones. Our code is available at https://github.com/ ChenhongyiYang/PGD. Keywords: Dense object detection

1

· Knowledge distillation

Introduction

Advances in deep learning have led to considerable performance gains on object detection tasks [2,6,11,15,18,25–27,31]. However, detectors can be computationally expensive, making it challenging to deploy them on devices with limited resources. Knowledge distillation (KD) [1,13] has emerged as a promising approach for compressing models. It allows for the direct training of a smaller student Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 8. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 123–138, 2022. https://doi.org/10.1007/978-3-031-20077-9_8

124

C. Yang et al.

(a) Box

(b) Box Gaussian

(c) FGFI

(d) Ours

Fig. 1. A comparison between different foreground distillation regions. The groundtruth bounding box is marked in blue. The colour heatmaps indicate the distillation weight for different areas. In contrast to other methods (a)–(c) [9, 30, 34], Our approach (d) focuses on a few key predictive regions of the teacher.

model [17,24,28,33] using information from a larger, more powerful teacher model; this helps the student to generalise better than if trained alone. KD was first popularised for image classification [13] where a student model is trained to mimic the soft labels generated by a teacher model. However, this approach does not work well for object detection [34] which consists of jointly classifying and localising objects. While soft label-based KD can be directly applied for classification, finding an equivalent for localisation remains a challenge. Recent work [8,9,30,34,35,37,41] alleviates this problem by forcing the student model to generate feature maps similar to the teacher counterpart; a process known as feature imitation. However, which features should the student imitate? This question is of the utmost importance for dense object detectors [6,15,18,31,38,42] because, unlike two-stage detectors [2,11,27], they do not use the RoIAlign [11] operation to explicitly pool and align object features; instead they output predictions at every location of the feature map [16]. Recent work [30,35] has shown that distilling the whole feature map with equal weighting is sub-optimal because not all features carry equally meaningful information. Therefore, a weighting mechanism that assigns appropriate importance to different regions, particularly to foreground regions near the objects, is highly desirable for dense object detectors, and has featured in recent work. For example, in DeFeat [9], foreground features that lie within ground truth (GT) boxes (Fig. 1a) are distilled with equal weighting. In [30] the authors postulate that useful features are located at the centre of GT boxes and weigh the foreground features using a Gaussian (Fig. 1b). In Finegrained Feature Imitation (FGFI) [34], the authors distil features covered by anchor boxes whose Intersection over Union (IoU) with the GTs are above a certain threshold (Fig. 1c). In this paper, we treat feature imitation for foreground regions differently. Instead of assigning distillation weights using hand-design policies, we argue that feature imitation should be conducted on a few key predictive regions: the locations where the teacher model generates the most accurate predictions. Our intuition is that these regions should be distilled because they hold the information that leads to the best predictions; other areas will be less informative and can contaminate the distillation process by distracting from more essential features. To achieve our goal, we adapt the quality measure from [6] to score

Prediction-Guided Distillation

125

teacher predictions. Then, we conduct an experiment to visualise how these scores are distributed and verify that high-scoring key predictive regions contribute the most to teacher performance. Those findings drive us to propose a Prediction-Guided Weighting (PGW) module to weight the foreground distillation loss: inspired by recent progress in label assignment [6,20,32,38,42] for dense detectors, we sample the top-K positions with the highest quality score from the teacher model and use an adaptive Gaussian distribution to fit the key predictive regions for smoothly weighting the distillation loss. Figure 1d shows a visual representation of the regions selected for distillation. We call our method Prediction-Guided Distillation (PGD). Our contributions are as follows: 1. We conduct experiments to study how the quality scores of teacher predictions are distributed in the image plane and observe that the locations that make up the top-1% of scores are responsible for most of the teacher’s performance in modern state-of-the-art dense detectors. 2. Based on our observations, we propose using the key predictive regions of the teacher as foreground features. We show that focusing distillation mainly on these few areas yields significant performance gains for the student model. 3. We introduce a parameterless weighting scheme for foreground distillation pixels and show that when applied to our key predictive regions, we achieve even stronger distillation performance. 4. We benchmark our approach on the COCO and CrowdHuman datasets and show its superiority over the state-of-the-art across multiple detectors.

2

Related Work

Dense Object Detection. In the last few years, object detection has seen considerable gains in performance [2,3,6,11,15,18,25–27,31]. The demand for simple, fast models has brought one-stage detectors into the spotlight [6,31]. In contrast to two-stage detectors, one-stage detectors directly regress and classify candidate bounding boxes from a pre-defined set of anchor boxes (or anchor points), alleviating the need for a separate region proposal mechanism. Anchorbased detectors [6,18] achieve good performance by regressing from anchor boxes with pre-defined sizes and ratios. In contrast, anchor-free methods [15,31,42] regress directly from anchor points (or locations), eliminating the need for the additional hyper-parameters used in anchor-based models. A vital challenge for detectors is determining which bounding box predictions to label as positive and negative – a problem frequently referred to as label assignment [42]. Anchors are commonly labelled as positives when their IoU with the GT is over a certain threshold (e.g. IoU ≥ 0.5) [18,31], however, more elaborate mechanisms for label assignment have been proposed [6,31,38,42]. For example, FCOS [31] applies a weighting scheme to suppress low-quality positive predictions using a “centerness” score. Other works dynamically adjust the number of positive instances according to statistical characteristics [38] or by using a differentiable confidence module [42]. In DDOD [6], the authors separate label assignment for the classification and regression branches and balance the influence of positive samples between different scales of the feature pyramid network (FPN).

126

C. Yang et al.

Knowledge Distillation for Object Detection. Early KD approaches for classification focus on transferring knowledge to student models by forcing their predictions to match those of the teacher [13]. More recent work [34,35,41] claims that feature imitation, i.e. forcing the intermediate feature maps of student models to match their teacher counterpart, is more effective for detection. A vital challenge when performing feature imitation for dense object detectors is determining which feature regions to distil from the teacher model. Naively distilling all feature maps equally results in poor performance [9,30,35]. To solve this problem, FGFI [34] distils features that are covered by anchor boxes which have a high IoU with the GT. However, distilling in this manner is still suboptimal [8,30,35,40,41]. TADF [30] suppresses foreground pixels according to a static 2D Gaussian fitted over the GT. LD [40] gives higher priority to central locations of the GT using DIoU [39]. GID [8] propose to use the top-scoring predictions using L1 distance between the classifications scores of the teacher and the student, but do not account for location quality. In LAD [21], the authors use label assignment distillation where the detector’s encoded labels are used to train a student. Others weight foreground pixels according to intricate adaptive weighting or attention mechanisms [8,14,35,36,41]. However, these weighting schemes still heavily rely on the GT dimensions, and they are agnostic to the capabilities of the teacher. In contrast, we focus distillation on only a few key predictive regions using a combination of classification and regression scores as a measure of quality. We then smoothly aggregate and weigh the selected locations using an estimated 2D Gaussian, which further focuses distillation and improves performance. This allows us to dynamically adjusts to different sizes and orientations of objects independently of the GT dimensions while accounting for the teacher’s predictive abilities.

3

Method

We begin by describing how to measure the predictive quality of a bounding box prediction and find the key predictive regions of a teacher network (Sect. 3.1). Then, we introduce our Prediction-Guided Weighting (PGW) module that returns a foreground distillation mask based on these regions (Sect. 3.2). Finally, we describe our full Prediction-Guided Distillation pipeline (Sect. 3.3). 3.1

Key Predictive Regions

Our goal is to amplify the distillation signal for the most meaningful features produced by a teacher network. For this purpose, we look at the quality of a teacher’s bounding box predictions taking both classification and localisation into consideration, as defined in [6]. Formally, the quality score of a box ˆb(i,j) predicted from a position Xi = (xi , yi ) w.r.t. a ground truth b is:  1−ξ  ξ  · IoU b, ˆb(i,j) q(ˆb(i,j) , b) = 1 [Xi ∈ b] · pˆ(i,j) (b) (1)          indicator

classification

localisation

Prediction-Guided Distillation

(a) ATSS

(b) FCOS

(c) AutoAssign

(d) GFL

127

(e) DDOD

Fig. 2. A visualisation of quality scores for various dense object detectors with ξ = 0.8 following [6]. We acquire the quality heatmap by taking the maximum value at each position across FPN layers.

where 1 [Xi ∈ Ωb ] is an indicator function that is 1 if Xi lies inside box b and 0 otherwise; pˆ(i,j) (b) is the classification probability w.r.t. the GT box’s category;  IoU b, ˆb(i,j) is the IoU between the predicted and ground-truth box; ξ is a hyper-parameter that balances classification and localisation. We calculate the quality score of location Xi as the maximum value of all prediction scores for that particular location, i.e. qˆi = maxj∈Ji q(ˆb(i,j) , b), where Ji is the set of predictions at location Xi . While this quality score has been applied for standard object detection [6], we are the first to use it to identify useful regions for distillation. In Fig. 2 we visualise the heatmaps of prediction quality scores for five stateof-the-art detectors, including anchor-based (ATSS [38] and DDOD [6]) and anchor-free (FCOS [31], GFL [15] and AutoAssign [42]) detectors. Across all detectors, we observe some common characteristics: (1) For the vast majority of objects, high scores are concentrated around a single region; (2) The size of this region doesn’t necessarily correlate strongly with the size of the actual GT box; (3) Whether or not the centring prior [31,42] is applied for label assignment during training, this region tends to be close to the centre of the GT box. These observations drive us to develop a Prediction-Guided Weighting (PGW) module to focus the distillation on these important regions. 3.2

Prediction-Guided Weighting Module

The purpose of KD is to allow a student to mimic a teacher’s strong generalisation ability. To better achieve this goal, we propose to focus foreground distillation on locations where a teacher model can yield predictions with the highest quality scores because those locations contain the most valuable information for detection and are critical to a teacher’s high performance. In Fig. 3 we present the results of a pilot experiment to identify how vital these highscoring locations are for a detector. Specifically, we measure the performance of different pre-trained detectors after masking out their top-X% predictions before non-maximum suppression (NMS) during inference. We observe that in all cases the mean Averaged Precision (mAP) drops dramatically as the mask-out ratio increases. Masking out the top-1% of predictions incurs around a 50% drop in

128

C. Yang et al. FCOS

ATSS

AutoAssign

GFL

DDOD

50

COCO mAP

37.5

25

12.5

0 0

0.05

0.1

0.5

1

5

10

20

Mask-out Ratio (%)

Fig. 3. COCO mAP performance of pre-trained detectors after ignoring predictions in the top-X% of quality scores during inference. We observe that the top-1% predictions within the GT box region are responsible for most performance gains.

AP. This suggests that the key predictive regions (responsible for the majority of a dense detector’s performance) lie within the top-1% of all anchor positions bounded by the GT box. Given their significance, how do we incorporate these regions into distillation? We could simply use all feature locations weighted by their quality score, however, as we show in Sect. 4.3 this does not yield the best performance. Inspired by recent advances in label assignment for dense object detectors [6,32], we instead propose to focus foreground distillation on the top-K positions (feature pixels) with the highest quality scores across all FPN levels. We then smooth the influence of each position according to a 2D Gaussian distribution fitted by Maximum-Likelihood Estimation (MLE) for each GT box. Finally, foreground distillation is conducted only on those K positions with their weights assigned by the Gaussian. Formally, for an object o with GT box b, we first compute the quality score for each feature pixel inside b, then we select the K pixels with the highest quality score T o = {(Xko , lko )|k = 1, ..., K} across all FPN levels, in which Xko and lko are the absolute coordinate and the FPN level of the k-th pixel. Based on our observation in Sect. 3.1, we assume the selected pixels Tko are drawn as Tko ∼ N (μ, Σ|o) defined on the image plane and use MLE to estimate μ and Σ: K K 1 o ˆ 1 o μ ˆ= Xk , Σ = (Xk − μ ˆ)(Xko − μ ˆ)T K K k=1

(2)

k=1

Then, for every feature pixel P(i,j),l on FPN layer l with absolute coordinate Xi,j , we compute its distillation importance w.r.t. object o by: 0 P(i,j),l ∈ / To o  1 I(i,j),l = (3) −1 T ˆ (Xi,j − μ P(i,j),l ∈ T o exp − 2 (Xi,j − μ ˆ)Σ ˆ) If a feature pixel has non-zero importance for multiple objects, we use its maxio }. Finally, for each FPN level l with size Hl × Wl , we mum: I(i,j),l = maxo {I(i,j),l assign the distillation weight M(i,j),l by normalising the distillation importance by the number of non-zero importance pixels at that level:

Prediction-Guided Distillation

129

Fig. 4. Our Prediction-Guided Distillation (PGD) pipeline. The Prediction-Guided Weighting (PGW) modules find the teacher’s key predictive regions and generates a foreground distillation weighting mask by fitting a Gaussian over these regions. Our pipeline also adopts the attention masks from FGD [35] and distils them together with the features. We distil the classification and regression heads separately to accommodate for these two distinct tasks [6].

I(i,j),l M(i,j),l = Hl Wl i=1

j=1

1(i,j),l

(4)

where 1(i,j),l is an indicator function that outputs 1 if I(i,j),l is not zero. The process above constitutes our Prediction-Guided Weighting (PGW) module whose output is a foreground distillation weight M across all feature levels and pixels. 3.3

Prediction-Guided Distillation

In this section, we introduce our KD pipeline, which is applicable to any dense object detector. We build our work on top of the state-of-the-art Focal and Global Distillation (FGD) [35] and incorporate their spatial and channel-wise attention mechanisms. In contrast to other distillation methods, we use the output mask from our PGW module to focus the distillation loss on the most important foreground regions. Moreover, we decouple the distillation for the classification and regression heads to better suit the two different tasks [6,22]. An illustration of the pipeline is shown in Fig. 4. Distillation of Features. We perform feature imitation at each FPN level, encouraging feature imitation on the first feature maps of the regression and classifications heads. Taking inspiration from [6], we separate the distillation process for the classification and regression heads – distilling features of each head independently. Formally, at each feature level of the FPN, we generate two foreground distillation masks Mcls , Mreg ∈ RH×W with different ξ cls and ξ reg using PGW. Then, student features F S,cls , F S,reg ∈ RC×H×W are encouraged to mimic teacher features F T,cls , F T,reg ∈ RC×H×W as follows:

130

C. Yang et al.

Lcls f ea =

C

H

W

T,cls T,cls T,cls S,cls 2 cls (αMcls Ak,i,j (Fk,i,j − Fk,i,j ) i,j + βNi,j )Pi,j

(5)

k=1 i=1 j=1

Lreg f ea

=

C

H

W

T,reg T,reg S,reg 2 γMreg (Fk,i,j − Fk,i,j ) i,j Ak

(6)

k=1 i=1 j=1

where α, β, γ are hyperparameters to balance between loss weights; N cls cls is the normalised mask over background distillation regions: Ni,j = H,W − − − 1i,j / h=1,w=1 1w,h where 1a,b is the background indicator that becomes 1 if pixel (a, b) does not lie within any GT box. P and A are spatial and channel attention maps from [35] as defined below. Note, we do not use the Global Distillation Module in FGD and the adaptation layer that is commonly used in many KD methods [4,9,34,35,37,41] as we find them have negligible impact to the overall performance. Distillation of Attention. We build on the work in FGD [35] and additionally encourage the student to imitate the attention maps of the teacher. We use spatial attention as defined in [35], but we modify their channel attention by computing it independently for each feature location instead of all spatial locations. Specifically, we define spatial attention P ∈ R1×H×W and channel attention A ∈ RC×H×W over a single feature map F ∈ RC×H×W as follows: C HW · exp ( k=1 |Fk,i,j |/τ ) C · exp (|Fk,i,j |/τ ) , Ak,i,j = C (7) Pi,j = H W C i=1 j=1 exp ( k=1 |Fk,i,j |/τ ) k=1 exp (|Fk,i,j |/τ ) Similar to feature distillation, we decouple the attention masks for classification and regression for the teacher and student: AT,cls , AT,reg , P S,cls . The two attention losses are defined as follows: Lcls att =

H W C H W δ T,cls δ T,cls S,cls |Pi,j − Pi,j |+ |Ak,i,j − AS,cls k,i,j | (8) HW i=1 j=1 CHW i=1 j=1 k=1

Lreg att =

C

δ H W i=1

H

W

C

j=1 1i,j

S,reg 1i,j |AT,reg k,i,j − Ak,i,j |

(9)

i=1 j=1 k=1

where δ is balancing loss weight hyperparameter; and 1i,j is an indicator that becomes 1 when Mreg i,j = 0. Full Distillation. The full distillation loss is reg reg cls Ldistill = Lcls f ea + Lf ea + Latt + Latt

4 4.1

(10)

Experiments Setup and Implementation Details

We evaluate PGD on two benchmarks: COCO [19] for general object detection and CrowdHuman [29] for crowd scene detection; this contains a large number of

Prediction-Guided Distillation

131

Table 1. A comparison between our PGD with other state-of-the-art distillation methods on COCO mini-val set. All models are trained locally. We set hyper-parameters for competing methods following their paper or open-sourced code bases. Detector

AP

AP50

AP75

APS

APM

APL

FCOS [31] Teacher Student DeFeat [9] FRS [41] FKD [37] FGD [35] Ours

Setting

43.1 38.2 40.7(+2.5) 40.9(+2.7) 41.3(+3.1) 41.4(+3.2) 42.5(+4.3)

62.4 57.9 60.5(+2.6) 60.6(+2.7) 60.9(+3.0) 61.1(+3.2) 62.0(+4.1)

46.6 40.5 43.5(+3.0) 44.0(+3.5) 44.1(+3.6) 44.2(+3.7) 45.4(+4.9)

25.5 23.1 24.7(+1.6) 25.0(+1.9) 23.9(+0.8) 25.3(+2.2) 24.8(+1.7)

47.1 41.3 44.4(+3.1) 44.4(+3.1) 44.9(+3.6) 45.1(+3.8) 46.1(+5.8)

54.7 49.4 52.4(+3.0) 52.6(+3.2) 53.8(+4.4) 53.8(+4.4) 55.5(+6.1)

AutoTeacher Assign [42] Student DeFeat [9] FRS [41] FKD [37] FGD [35] Ours

44.8 40.6 42.3(+1.7) 42.4(+1.8) 42.8(+2.2) 43.2(+2.6) 43.8(+3.1)

64.1 60.1 61.6(+1.5) 61.9(+1.8) 62.1(+2.0) 62.5(+2.4) 62.9(+2.8)

48.9 43.8 46.1(+2.3) 46.0(+2.2) 46.5(+2.7) 46.9(+3.1) 47.4(+3.6)

27.3 23.6 24.1(+0.5) 24.9(+1.3) 25.7(+2.1) 25.2(+1.6) 25.8(+2.2)

48.8 44.3 46.0(+1.7) 46.0(+1.7) 46.4(+2.1) 46.7(+2.4) 47.3(+3.0)

57.5 52.4 54.4(+2.0) 54.8(+2.4) 55.5(+3.1) 56.2(+3.8) 57.5(+5.1)

ATSS [38]

Teacher Student DeFeat [9] FRS [41] FGFI [34] FKD [37] FGD [35] Ours

45.5 39.6 41.8(+2.2) 41.6(+2.0) 41.8(+2.2) 42.3(+2.7) 42.6(+3.0) 44.2(+4.6)

63.9 57.6 60.3(+2.7) 60.1(+2.5) 60.3(+2.7) 60.7(+3.1) 60.9(+3.3) 62.3(+4.7)

49.7 43.2 45.3(+2.1) 44.8(+1.6) 45.3(+2.1) 46.2(+3.0) 46.2(+3.0) 48.3(+5.1)

28.7 23.0 24.8(+1.8) 24.9(+1.9) 24.8(+1.8) 26.3(+3.3) 25.7(+2.7) 26.5(+3.5)

50.1 42.9 45.6(+2.7) 45.2(+2.3) 45.6(+2.7) 46.0(+3.1) 46.7(+3.8) 48.6(+5.7)

57.8 51.2 53.5(+2.3) 53.2(+2.0) 53.5(+2.3) 54.6(+3.4) 54.5(+3.3) 57.1(+5.9)

GFL [15]

Teacher Student DeFeat [9] FRS [41] FKD [37] FGD [35] LD [40] Ours

45.8 40.2 42.1(+1.9) 42.2(+2.0) 43.1(+2.9) 43.2(+3.0) 43.5(+3.3) 43.8(+3.6)

64.2 58.4 60.5(+2.1) 60.6(+2.2) 61.6(+3.2) 61.8(+3.4) 61.8(+3.4) 62.0(+3.6)

49.8 43.3 45.2(+1.9) 45.6(+2.3) 46.6(+3.3) 46.9(+3.6) 47.4(+4.1) 47.4(+4.1)

28.3 22.7 24.4(+1.7) 24.7(+2.0) 25.1(+2.4) 25.2(+2.5) 24.7(+2.0) 25.4(+2.7)

50.3 43.6 46.1(+2.5) 46.0(+2.4) 47.2(+3.6) 47.5(+3.9) 47.5(+3.9) 47.8(+4.2)

58.6 52.0 54.5(+2.5) 55.5(+3.5) 56.5(+4.5) 56.2(+4.2) 57.3(+5.3) 57.6(+5.6)

DDOD [6]

Teacher Student DeFeat [9] FRS [41] FGFI [34] FKD [37] FGD [35] Ours

46.6 42.0 43.2(+1.2) 43.7(+1.7) 44.1(+2.1) 43.6(+1.6) 44.1(+2.1) 45.4(+3.4)

65.0 60.2 61.6(+1.4) 62.2(+2.0) 62.6(+2.4) 62.0(+1.8) 62.4(+2.2) 63.9(+3.7)

50.7 45.5 46.7(+1.2) 47.6(+2.1) 47.9(+2.4) 47.1(+1.6) 47.9(+2.4) 49.0(+3.5)

29.0 25.7 25.7(+0.0) 25.7(+0.0) 26.3(+0.6) 25.9(+0.2) 26.8(+1.1) 26.9(+1.2)

50.5 45.6 46.5(+0.9) 46.8(+1.2) 47.3(+1.7) 47.0(+1.4) 47.2(+1.6) 49.2(+3.6)

60.1 54.9 57.3(+2.4) 58.1(+3.2) 58.5(+3.6) 58.1(+3.2) 58.5(+3.6) 59.7(+4.8)

heavily occluded objects. Our codebase is built on PyTorch [23] and the MMDetection [5] toolkit and is available at https://github.com/ChenhongyiYang/ PGD. All models are trained on 8 Nvidia 2080Ti GPUs. For both COCO and CrowdHuman, all models are trained using batch sizes of 32 and with an initial learning rate of 0.02, we adopt ImageNet pre-trained backbones and freeze all

132

C. Yang et al.

Batch Normalisation layers during training. Unless otherwise specified, on both dataset we train teacher models for 3× schedule (36 epochs) [10] with multiscale inputs using ResNet-101 [12] as backbone, and train student models for 1× schedule (12 epochs) with single-scale inputs using ResNet-50 as backbone. The COCO models are trained using the train2017 set and evaluated on mini-val set following the official evaluation protocol [19]. The CrowdHuman models are trained using the CrowdHuman training set, which are then evaluated on the CrowdHuman validation set following [7]. We set K in the top-K operation to 30 for all detectors and set α to 0.8 and 0.4 for anchor-based and anchor-free detectors respectively. Following [35], we set σ = 0.0008, τ = 0.8 and β = 0.5α; we set ξ cls = 0.8 and ξ reg = 0.6 following [6]. We empirically set γ = 1.6α with minimal tuning. Table 2. Distillation results on COCO mini-val using MobileNetV2 as the student backbone. Detector Setting

AP

AP50

AP75

APS

APM

APL

FCOS

Teacher Student FGD Ours

43.1 32.8 34.7(+1.9) 37.3(+4.5)

62.4 51.3 53.0(+1.7) 55.6(+4.3)

46.6 34.5 36.8(+2.3) 39.8(+5.3)

25.5 18.4 19.8(+1.4) 20.5(+2.1)

47.1 35.4 36.8(+1.4) 40.3(+4.9)

54.7 42.6 44.9(+2.3) 49.9(+7.3)

ATSS

Teacher Student FGD Ours

45.5 33.5 35.8(+2.3) 38.3(+4.8)

63.9 50.1 52.6(+2.5) 55.1(+5.0)

49.7 36.0 38.8(+2.8) 41.7(+5.7)

28.7 18.7 20.6(+1.9) 21.3(+2.6)

50.1 36.2 38.4(+2.2) 41.6(+5.4)

57.8 43.6 46.2(+2.6) 51.6(+8.0)

4.2

Main Results

Comparison with State-of-the-Art. We compare our PGD and other recent state-of-the-art object detection KD approaches for five high-performance dense detectors on COCO; these are a mixture of anchor-based (ATSS and DDOD) and anchor-free (FCOS, GFL and AutoAssign) detectors for COCO. The results are presented in Table 1. We use the same teacher and student models and the same training settings in each case, and all training is conducted locally. For competing distillation methods, we follow the hyper-parameter settings in their corresponding papers or open-sourced code repositories. We observe that our methods surpass other KD methods with a large margin for all five detectors, which validates the effectiveness of our approach. Our approach significantly improvement over the baseline approach FGD [35] and even outperforms LD [40] when applied to GFL [15], which was specifically designed for this detector. We observe PGD is particularly good at improving the AP75 of student models, suggesting that the student model’s localisation abilities have been largely improved. Distilling to a Lightweight Backbone. Knowledge Distillation is usually used to transfer useful information from a large model to a lightweight model suitable for deployment on the edge. With this in mind, we apply PGD using

Prediction-Guided Distillation

133

Table 3. A comparison between our PGD with other state-of-the-art distillation methods on CrowdHuman validation set using DDOD as object detector. Setting

MR ↓

AP ↑

JI ↑

Teacher Student FKD [] DeFeat [9] FRS [41] FGFI [34] FGD [35]

41.4 46.0 44.3(–1.7) 44.2(–1.8) 44.1(–1.9) 43.8(–2.2) 43.1(–2.9)

90.2 88.0 89.1(+1.1) 89.1(+1.1) 89.2(+1.2) 89.2(+1.2) 89.3(+1.3)

81.4 79.0 80.0(+1.0) 79.9(+0.9) 80.3(+1.3) 80.3(+1.3) 80.4(+1.4)

Ours

42.8(–3.2)

90.0(+2.0)

80.7(+1.7)

a ResNet-101 as the teacher backbone and a MobileNet V2 [28] as the student backbone on anchor-based (ATSS) and anchor-free (FCOS) detectors. The results are provided in Table 2. Our method surpasses the baseline by a significant margin, pointing to its potential for resource-limited applications. Distillation for Crowd Detection. We compare our approach to other KD methods on the challenging CrowdHuman dataset that features heavily crowded scenes. We use the DDOD object detector for this experiment as it achieves the strongest performance. In addition to detection AP, we report the log miss rate (MR) [7] designed for evaluation in crowded scenes as well as the Jaccard Index (JI) that evaluates a detector’s counting ability. The results are available in Table 3. Our approach performs better than all competing methods. While FGD achieves comparable MR and JI scores to our method, the AP for our methods is significantly greater. We believe this is because PGD strongly favours highly accurate predictions during distillation, which directly impacts the AP metric. Table 4. Self-distillation performance on COCO mini-val. ResNet-50 is adopted as teacher and student backbone, which are both trained for 1× schedule. Detector Setting AP

AP50

AP75

APS

APM

APL

FCOS

S&T FGD Ours

38.2 57.9 40.5 23.1 41.3 49.4 39.0(+0.8) 58.6(+0.7) 41.4(+0.9) 23.7(+0.6) 42.1(+0.8) 50.6(+1.2) 39.5(+1.3) 59.2(+1.3) 41.9(+1.4) 24.4(+1.3) 42.8(+1.5) 50.6(+1.2)

ATSS

S&T FGD Ours

39.6 57.6 43.2 23.0 42.9 51.2 40.2(+0.6) 58.6(+1.0) 43.6(+1.4) 23.3(+0.3) 43.7(+0.8) 52.3(+1.1) 40.7(+1.1) 58.9(+1.3) 44.2(+2.0) 24.0(+0.9) 44.2(+1.3) 52.9(+1.7)

Self-distillation. Self-distillation is a special case of knowledge distillation where the teacher and student models are exactly same. It is useful as it can boost

134

C. Yang et al.

a model’s performance while avoiding introducing extra parameters. We compare the our method’s self-distillation performance with the baseline FGD and present results for both anchor-free FCOS and anchor-based ATSS in Table 4. The teachers and students use ResNet-50 as backbone and are trained with 1× schedule using single-scale inputs. We can see that our approach achieves a better performance than the baseline, indicating its effectiveness in self-distillation. Table 5. Ablation study on different foreground distillation strategies on COCO minival seet using ATSS as object detector. Setting

AP

AP50

AP75

APS

APM

APL

Teacher Student Box BoxGauss Centre Quality TopkEq KDE

45.5 39.6 43.3 (+3.7) 43.7(+4.1) 43.1(+3.5) 43.8(+4.2) 43.9(+4.3) 44.0(+4.4)

63.9 57.6 61.4(+3.8) 61.9(+4.3) 61.0(+3.4) 61.8(+4.2) 62.0(+4.4) 62.1(+4.5)

49.7 43.2 47.2(+4.0) 47.6(+4.4) 46.9(+3.7) 47.8(+4.6) 47.7(+4.5) 47.8(+4.6)

28.7 23.0 25.9(+2.9) 26.7(+3.7) 25.9(+2.9) 25.7(+2.7) 27.1(+4.1) 26.3(+3.3)

50.1 42.9 47.6(+4.7) 47.8(+4.9) 47.3(+4.4) 48.2(+5.3) 48.0(+5.1) 48.5(+5.6)

57.8 51.2 56.4(+5.2) 56.6(+5.4) 56.1(+4.9) 56.8(+5.6) 56.8(+5.6) 56.8(+5.6)

Ours

44.2(+4.6) 62.3(+4.7) 48.3(+5.1) 26.5(+3.5) 48.6(+5.7) 57.1(+5.9)

Table 6. Hyper-parameter ablation studies on COCO mini-val.

K

1

5

9

15

30

45

60

AP

43.2

43.5 43.6 43.9 44.2 44.0 43.9

Ablation study on different K in the top-K operation using ATSS as detector. α

0.005 0.01 0.03 0.05 0.07 0.1

0.2

FCOS 41.7

42.0 42.5 42.5 42.4 42.2 41.8

ATSS 42.9

43.2 43.7 43.9 44.2 44.1 43.2

Ablation study on distillation loss magnitude α using FCOS and ATSS.

4.3

Ablation Study

Comparing Foreground Distillation Strategies. We compare alternative strategies for distilling foreground regions to investigate how important is distilling different foreground regions. We use ATSS as our object detector and present results in Table 5. Note here we only modify the foreground distillation strategy while keeping everything else the same. We first evaluate the strategy used in FGD [35] and DeFeat [9], where regions in the GT box are distilled equally. We dub this the Box strategy (Fig. 1a). Compared to our method, Box achieves 0.9 AP worse performance. A possible reason for this is that it can include sub-optimal prediction locations that distract from more meaningful features.

Prediction-Guided Distillation

135

Note that the Box strategy still outperforms the baseline FGD, we attribute this improvement to the decoupling of distillation for classification and regression branches. Several works [30,40] postulate that most meaningful regions lie near the centre of the GT box. We evaluate the BoxGauss strategy that was proposed in TADF [30] (Fig. 4b). Specifically, a Gaussian distribution is used to weight the distillation loss, where its mean is the centre of the GT box, and the standard deviation is calculated from the box dimensions. This strategy yields +0.4 AP improvement over vanilla Box strategy, suggesting the importance of focusing on the centre area; however, it is still surpassed by our approach. We consider a Centre strategy, which distils a 0.2H × 0.2W area at the middle of the GT box. Somewhat surprisingly, this achieves an even worse AP than the vanilla Box strategy in almost all instances, with comparable performance on small objects. A possible explanation is that a fixed ratio region fails to cover the full span of useful regions for different-sized objects and limits the amount of distilled information. Then we compare to an adaptive loss weighting mechanism where we directly use the quality score in Eq. 1 to weight features for the distillation loss. The strategy—which we refer to as Quality—improves slightly on BoxGauss, especially for medium and high scoring boxes. However, it significantly under-performs on small objects. In contrast, the TopkEq strategy, where we limit distillation to only the top-K pixels according to the quality score (we set K = 30 to match our method), provides a significant improvement to the detection of small objects. A possible explanation for this is that distilling on positions with lower scores still introduces considerable noise, whereas limiting distillation to only the highest-scoring pixels focuses the student towards only the most essential features of the teacher. Finally, we compare our method to one that replaces the Gaussian MLE with kernel density estimation, the KDE strategy. It achieves similar performance to our Gaussian MLE approach, but is more complicated. Hyper-Parameter Settings. Here, we examine the effect of changing two important hyper-parameters used in our approach, as presented in Table 6. The first is K, which is the number of high-scoring pixels used for distillation. The best performance is obtained for K = 30. Small K can cause distillation to neglect important regions, whereas large K can introduce noise that distracts the distillation process from the most essential features. The second hyper-parameter we vary is α which controls the magnitude of the distillation loss. We can see how this affects performance for anchor-based ATSS and anchor-free FCOS. We find that the ATSS’s performance is quite robust when α is between 0.05–0.1, and FCOS can achieve good performance when α is between 0.03–0.1. For both types of detectors, a small α will minimise the effect of distillation, and a large α can make training unstable. Decoupled Distillation. In our pipeline, we decouple the KD loss to distil the classification and regression heads separately (see Sect. 3.3). This practice differs from previous feature imitation-based approaches where the FPN neck features are distilled. Here we conduct experiments to test this design and present the result in Table 7. Firstly, we remove the regression KD loss and only apply the

136

C. Yang et al.

classification KD loss using FPN features. The model achieves 43.6 mAP on COCO. Then we test only applying the classification KD loss using the classification feature map; the performance improves very slightly (by 0.2). Next, we only test the regression KD loss using the regression features, resulting in 41.7 COCO mAP. The performance is significantly harmed because the regression KD loss only considers foreground regions while ignoring background areas. Finally, we come to our design by combining both classification and regression KD losses, which achieves the best performance, at 44.2 COCO mAP.

(a)

(b)

(c)

(d)

(e)

Fig. 5. Visualisation of the detection results on COCO mini-val set using ATSS as detector and PGD for distillation. GTs are shown in blue; plain student detections are shown in red; distilled student predictions are shown in orange. (Color figure online) Table 7. Comparison between different distillation branches. neck cls reg AP

AP50

AP75

APS

APM

APL

-

57.6

43.2

23.0

42.9

51.2

-

-



39.6

43.6(+4.0) 61.8(+4.2) 47.5(+4.3) 26.1(+3.1) 47.8(+4.9) 56.8(+5.6)  

 

43.8(+4.2) 62.1(+4.5) 47.5(+4.3) 26.5(3.5) 48.0(+5.1) 56.8(+5.6) 41.7(+2.1) 60.2(+2.6) 45.2(+2.0) 25.3(+2.3) 45.4(+2.5) 53.9(+2.7) 44.2(+4.6) 62.3(+4.7) 48.3(+5.1) 26.5(+3.5) 48.6(+5.7) 57.1(+5.9)

Qualitative Studies. We visualise box predictions using ATSS as our object detector in Fig. 5, in which we show GT boxes alongside student predictions with and without distillation using PGD. While the high-performance ATSS is able to accurately detect objects in most cases, we observe some clear advantages of using our distillation approach: it outputs fewer false positives (Fig. 5b,c), improves detection recall (Fig. 5a,d), and localises objects better (Fig. 5b,d,e).

5

Conclusion

In this work, we highlight the need to focus distillation on features of the teacher that are responsible for high-scoring predictions. We find that these key predictive regions constitute only a small fraction of all features within the boundaries of the ground-truth bounding box. We use this observation to design a novel distillation technique—PGD—that amplifies the distillation signal from these features. We use an adaptive Gaussian distribution to smoothly aggregate those

Prediction-Guided Distillation

137

top locations to further enhance performance. Our approach can significantly improve state-of-the-art detectors on COCO and CrowdHuman, outperforming many existing KD methods. In future, we could investigate the applicability of high-quality regions to two-stage and transformer models for detection. Acknowledgement. The authors would like to thank Joe Mellor, Kaihong Wang, and Zehui Chen for their useful comments and suggestions. This work was supported by a PhD studentship provided by the School of Engineering, University of Edinburgh as well as the EPSRC Centre for Doctoral Training in Robotics and Autonomous Systems (Grant No. EP/S515061/1) and SeeByte Ltd., Edinburgh, UK.

References 1. Ba, L.J., Caruana, R.: Do deep nets really need to be deep? In: NeurIPS (2014) 2. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018) 3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 4. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NeurIPS (2017) 5. Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. In: arXiv preprint arXiv:1906.07155 (2019) 6. Chen, Z., Yang, C., Li, Q., Zhao, F., Zha, Z.J., Wu, F.: Disentangle your dense object detector. In: ACM MM (2021) 7. Chu, X., Zheng, A., Zhang, X., Sun, J.: Detection in crowded scenes: one proposal. In: CVPR, Multiple Predictions (2020) 8. Dai, X., et al.: General instance distillation for object detection. In: CVPR (2021) 9. Guo, J., et al.:: Distilling object detectors via decoupled features. In: CVPR (2021) 10. He, K., Girshick, R., Doll´ ar, P.: Rethinking ImageNet pre-training. In: CVPR (2019) 11. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 13. Hinton, G., et al.: Distilling the knowledge in a neural network. In: NeurIPS 2014 Deep Learning Workshop (2014) 14. Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. In: NeurIPS (2021) 15. c Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. In: NeurIPS (2021) 16. Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection. In: ICCV (2019) 17. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: Light-head r-CNN: in defense of two-stage object detector. In: arXiv preprint arXiv:1711.07264 (2017) 18. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: ICCV (2017) 19. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48

138

C. Yang et al.

20. Ma, Y., Liu, S., Li, Z., Sun, J.: IQDet: instance-wise quality distribution sampling for object detection. In: CVPR (2021) 21. Nguyen, C.H., Nguyen, T.C., Tang, T.N., Phan, N.L.: Improving object detection by label assignment distillation. In: WACV (2022) 22. Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: A ranking-based, balanced loss function unifying classification and localisation in object detection. Adv. Neural. Inf. Process. Syst. 33, 15534–15545 (2020) 23. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019) 24. Qin, Z., Li, Z., Zhang, Z., Bao, Y., Yu, G., Peng, Y., Sun, J.: ThunderNet: towards real-time generic object detection on mobile devices. In: ICCV (2019) 25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016) 26. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR (2017) 27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015) 28. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: Inverted residuals and linear bottlenecks. In: CVPR (2018) 29. Shao, S., et al.: CrowdHuman: a benchmark for detecting human in a crowd. In: arXiv preprint arXiv:1805.00123 (2018) 30. Sun, R., Tang, F., Zhang, X., Xiong, H., Tian, Q.: Distilling object detectors with task adaptive regularization. In: arXiv preprint arXiv:2006.13108 (2020) 31. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019) 32. Wang, J., Song, L., Li, Z., Sun, H., Sun, J., Zheng, N.: End-to-end object detection with fully convolutional network. In: CVPR (2021) 33. Wang, R.J., Li, X., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. In: NeurIPS (2018) 34. Wang, T., Yuan, L., Zhang, X., Feng, J.: Distilling object detectors with finegrained feature imitation. In: CVPR (2019) 35. Yang, Z., et al.: Focal and global knowledge distillation for detectors. In: arXiv preprint arXiv:2111.11837 (2021) 36. Yao, L., Pi, R., Xu, H., Zhang, W., Li, Z., Zhang, T.: G-DetKD: towards general distillation framework for object detectors via contrastive and semantic-guided feature imitation. In: ICCV (2021) 37. Zhang, L., Ma, K.: Improve object detection with feature-based knowledge distillation: towards accurate and efficient detectors. In: ICLR (2021) 38. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchorbased and anchor-free detection via adaptive training sample selection. In: CVPR (2020) 39. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IOU loss: faster and better learning for bounding box regression. In: AAAI (2020) 40. Zheng, Z., Ye, R., Wang, P., Wang, J., Ren, D., Zuo, W.: Localization distillation for object detection. In: arXiv preprint arXiv:2102.12252 (2021) 41. Zhixing, D., , et al.: Distilling object detectors with feature richness. In: NeurIPS (2021) 42. Zhu, B., et al.: Autoassign: Differentiable label assignment for dense object detection. In: arXiv preprint arXiv:2007.03496 (2020)

Multimodal Object Detection via Probabilistic Ensembling Yi-Ting Chen1 , Jinghao Shi2 , Zelin Ye2 , Christoph Mertz2 , Deva Ramanan2,3 , and Shu Kong2,4(B) 1

University of Maryland, College Park, USA [email protected] 2 Carnegie Mellon University, Pittsburgh, USA {jinghaos,zeliny,cmertz}@andrew.cmu.edu, [email protected] 3 Argo AI, Pittsburgh, USA 4 Texas A&M University, College Station, USA [email protected]

Abstract. Object detection with multimodal inputs can improve many safety-critical systems such as autonomous vehicles (AVs). Motivated by AVs that operate in both day and night, we study multimodal object detection with RGB and thermal cameras, since the latter provides much stronger object signatures under poor illumination. We explore strategies for fusing information from different modalities. Our key contribution is a probabilistic ensembling technique, ProbEn, a simple nonlearned method that fuses together detections from multi-modalities. We derive ProbEn from Bayes’ rule and first principles that assume conditional independence across modalities. Through probabilistic marginalization, ProbEn elegantly handles missing modalities when detectors do not fire on the same object. Importantly, ProbEn also notably improves multimodal detection even when the conditional independence assumption does not hold, e.g., fusing outputs from other fusion methods (both off-the-shelf and trained in-house). We validate ProbEn on two benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal images, showing that ProbEn outperforms prior work by more than 13% in relative performance! Keywords: Object detection · Multimodal detection · Infrared · Thermal · Probabilistic model · Ensembling · Multimodal fusion · Uncertainty

Y.-T. Chen, J. Shi and Z. Ye—Equal contribution. The work was mostly done when authors were with CMU. D. Ramanan and S. Kong—Equal supervision. open-source code in Github. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 9. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 139–158, 2022. https://doi.org/10.1007/978-3-031-20077-9_9

140

1

Y.-T. Chen et al.

Introduction

Object detection is a canonical computer vision problem that has been greatly advanced by the end-to-end training of deep neural detectors [23,45]. Such detectors are widely adopted in various safety-critical systems such as autonomous vehicles (AVs) [7,19]. Motivated by AVs that operate in both day and night, we study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination [4,12,26,32,51,57].

0.70

0.80

0.80

0.80

0.75

0.90

(b) NMS

(c) Average

(d) ProbEn

fusion

RGB

thermal

0.70

(a) Pooling

Fig. 1. Multimodal detection via ensembling single-modal detectors. (a) A naive approach is to pool detections from each modality, but this will result in multiple detections that overlap the same object. (b) To remedy this, one can apply nonmaximal suppression (NMS) to suppress overlapping detections from different modalities, which always returns the higher (maximal) scoring detection. Though quite simple, NMS is an effective fusion strategy that has not been previously proposed as such. However, NMS fails to incorporate cues from the lower-scoring modality. (c) A natural strategy for doing so might average scores of overlapping detections (instead of suppressing the weaker ones) [33, 36]. However, this must decrease the reported score compared to NMS. Intuitively, if two modalities agree on a candidate detection, one should boost its score. (d) To do so, we derive a simple probabilistic ensembling approach, ProbEn, to score fusion that increases the score for detections that have strong evidence from multiple modalities. We further extend ProbEn to box fusion in Sect. 3. Our non-learned ProbEn significantly outperforms prior work (Tables 2 and 4).

Multimodal Data. There exists several challenges in multimodal detection. One is the lack of data. While there exists large repositories of annotated singlemodal datasets (RGB) and pre-trained models, there exists much less annotated data of other modalities (thermal), and even less annotations of them paired together. One often-ignored aspect is the alignment of the modalities: aligning RGB and thermal images requires special purpose hardware, e.g., a beamsplitter [26] or a specialized rack [48] for spatial alignment, and a GPS clock synchronizer for temporal alignment [42]. Fusion on unaligned RGB-thermal inputs (cf. Fig. 4) remains relatively unexplored. For example, even annotating bounding boxes is cumbersome because separate annotations are required for each modality, increasing overall cost. As a result, many unaligned datasets annotate only one modality (e.g., FLIR [17]), further complicating multimodal learning. Multimodal Fusion. The central question in multimodal detection is how to fuse information from different modalities. Previous work has explored strategies

Multimodal Object Detection via Probabilistic Ensembling (a) Prior work: mid-fusion of features

detector

141

(b) Our focus: late-fusion of detections

final detections

final detections

detector head

detector ensemble

feature fusion feature

feature

RGB-feature extractor

thermal-feature extractor

detections

detections

RGB-detector

thermal detector

Fig. 2. High-level comparisons between mid- and late-fusion. (a) Past work primarily focuses on mid-fusion, e.g., concatenating features computed by single-modal feature extractors. (b) We focus on late-fusion via detector ensemble that fuses detections from independent detectors, e.g., two single-modal detectors trained with RGB and thermal images respectively.

for fusion at various stages [4,9,32,51,56,57], which are often categorized into early-, mid- and late-fusion. Early-fusion constructs a four-channel RGB-thermal input [49], which is then processed by a (typical) deep network. In contrast, midfusion keeps RGB and thermal inputs in different streams and then merges their features downstream within the network (Fig. 2a) [30,36,49]. The vast majority of past work focuses on architectural design of where and how to merge. Our key contribution is the exploration of an extreme variant of very-late fusion of detectors trained on separate modalities (Fig. 2b) through detector ensembling. Though conceptually simple, ensembling can be effective because one can learn from single-modal datasets that often dwarf the size of multimodal datasets. However, ensembling can be practically challenging because different detectors might not fire on the same object. For example, RGB-based detectors often fail to fire in nighttime conditions, implying one needs to deal with “missing” detections during fusion. Probabilistic Ensembling (ProbEn). We derive our very-late fusion approach, ProbEn, from first principles: simply put, if single-modal signals are conditionally independent of each other given the true label, the optimal fusion strategy is given by Bayes rule [41]. ProbEn requires no learning, and so does not require any multimodal data for training. Importantly, ProbEn elegantly handles “missing” modalities via probabilistic marginalization. While ProbEn is derived assuming conditional independence, we empirically find that it can be used to fuse outputs that are not strictly independent, by fusing outputs from other fusion methods (both off-the-shelf and trained in-house). In this sense, ProbEn is a general technique for ensembling detectors. We achieve significant improvements over prior art, both on aligned and unaligned multimodal benchmarks. Why Ensemble? One may ask why detector ensembling should be regarded as an interesting contribution, given that ensembling is a well-studied approach [3,13,18,29] that is often viewed as an “engineering detail” for improving

142

Y.-T. Chen et al.

leaderboard performance [22,25,31]. Firstly, we show that the precise ensembling technique matters, and prior approaches proposed in the (single-modal) detection literature such as score-averaging [14,31] or max-voting [52], are not as effective as ProbEn, particularly when dealing with missing modalities. Secondly, to our knowledge, we are the first to propose detector ensembling as a fusion method for multimodal detection. Though quite simple, it is remarkably effective and should be considered a baseline for future research.

2

Related Work

Object Detection and Detector Ensembling. State-of-the-art detectors train deep neural networks on large-scale datasets such as COCO [34] and often focus on architectural design [37,43–45]. Crucially, most architectures generate overlapping detections which need to be post-processed with non-maximal suppression (NMS) [5,10,47]. Overlapping detections could also be generated by detectors tuned for different image crops and scales, which typically make use of ensembling techniques for post-processing their output [1,22,25]. Somewhat surprisingly, although detector ensembling and NMS are widely studied in singlemodal RGB detection, to the best of our knowledge, they have not been used to (very) late-fuse multimodal detections; we find them remarkably effective. Multimodal Detection, particularly with RGB-thermal images, has attracted increasing attention. The KAIST pedestrian detection dataset [26] is one of the first benchmarks for RGB-thermal detection, fostering growth of research in this area. Inspired by the successful RGB-based detectors [37,43,45], current multimodal detectors train deep models with various methods for fusing multimodal signals [4,9,28,32,51,56,57,57,58]. Most of these multimodal detection methods work on aligned RGB-thermal images, but it is unclear how they perform on heavily unaligned modalities such as images in Fig. 4 taken from FLIR dataset [17]. We study multimodal detection under both aligned and unaligned RGB-thermal scenarios. Multimodal fusion is the central question in multimodal detection. Compared to early-fusion that simply concatenates RGB and thermal inputs, mid-fusion of single-modal features performs better [49]. Therefore, most multimodal methods study how to fuse features and focus on designing new network architectures [30,36,49]. Because RGB-thermal pairs might not be aligned, some methods train an RGB-thermal translation network to synthesize aligned pairs, but this requires annotations in each modality [12,27,38]. Interestingly, few works explore learning from unaligned data that are annotated only in single modality; we show that mid-fusion architectures can still learn in this setting by acting as an implicit alignment network. Finally, few fusion architectures explore (very) late fusion of single-modal detections via detector ensembling. Most that do simply take heuristic (weighted) averages of confidence scores [20,32, 57]. In contrast, we introduce probabilistic ensembling (ProbEn) for late-fusion, which significantly outperforms prior methods on both aligned and unaligned RGB-thermal data.

Multimodal Object Detection via Probabilistic Ensembling

3

143

Fusion Strategies for Multimodal Detection

We now present multimodal fusion strategies for detection. We first point out that single-modal detectors are viable methods for processing multimodal signals, and so include them as a baseline. We also include fusion baselines for early-fusion, which concatenates RGB and thermal as a four-channel input, and mid-fusion, which concatenates single-modal features inside a network (Fig. 2). As a preview of results, we find that mid-fusion is generally the most effective baseline (Table 1). Surprisingly, this holds even for unaligned data that is annotated with a single modality (Fig. 4), indicating that mid-fusion can perform some implicit alignment (Table 3). We describe strategies for late-fusing detectors from different modalities, or detector ensembling. We begin with a naive approach (Fig. 1). Late-fusion needs to fuse scores and boxes; we discuss the latter at the end of this section. Naive Pooling. The possibly simplest strategy is to naively pool detections from multiple modalities together. This will probably result in multiple detections overlapping the same ground-truth object (Fig. 1a). Non-Maximum Supression (NMS). The natural solution for dealing with overlapping detections is NMS, a crucial component in contemporary RGB detectors [14,24,60]. NMS finds bounding box predictions with high spatial overlap and remove the lower-scoring bounding boxes. This can be implemented in a sequential fashion via sorting of predictions by confidence, as depicted by Algorithm 1, or in a parallel fashion amenable to GPU computation [6]. While NMS has been used to ensemble single-modal detectors [47], it has (surprisingly) not been advocated for fusion of multi-modal detectors. We find it be shockingly effective, outperforming the majority of past work on established benchmarks (Fig. 2). Specifically, when two detections from two different modalities overlap (e.g., IoU>0.5), NMS simply keeps the higher-score detection and suppresses the other (Fig. 1b). This allows each modality to “shine” where effective – thermal detections tend to score high (and so will be selected) when RGB detections perform poorly due to poor illumination conditions. That said, rather than selecting one modality at the global image level (e.g., day-time vs. night time), NMS selects one modality at the local bounding box level. However, in some sense, NMS fails to “fuse” information from multiple modalities together, since each of the final detections are supported by only one modality. Average Fusion. To actually fuse multimodal information, a straightforward strategy is to modify NMS to average confidence scores of overlapping detections from different modalities, rather than suppressing the weaker modality. Such an averaging has been proposed in prior work [32,36,52]. However, averaging scores will necessarily decrease the NMS score which reports the max of an overlapping set of detections (Fig. 1c). Our experiments demonstrate that averaging produces worse results than NMS and single-modal detectors. Intuitively, if two modalities agree that there exist a detection, fusion should increase the overall confidence rather than decrease.

144

Y.-T. Chen et al.

Algorithm 1. Multimodal Fusion by NMS or ProbEn 1: Input: class priors πk for k ∈ {1, . . . , K}; the flag of fusion method (NMS or ProbEn); set D: detections from multiple modalities. Each detection d = (y, z, m) ∈ D contains classification posteriors y, box coordinates z and modality tag m. 2: Initialize set of fused detections F = {} 3: while D = ∅ do 4: Find detection d ∈ D with largest posterior 5: Find all detections in D that overlap d (e.g., > 0.5 IoU), denoted as T ⊆ D 6: if NMS then 7: d ← d 8: else if ProbEn then 9: Find highest scoring detection in T of each modality, denoted as S ⊆ T 10: Compute d from S by fusing scores y with Eq. (4) and boxes z with Eq. (8) 11: end if D ←D−T 12: F ← F + {d }, 13: end while 14: return set F of fused detections

Probabilistic Ensembling (ProbEn). We derive our probabilistic approach for late-fusion of detections by starting with how to fuse detection scores (Algorithm 1). Assume we have an object with label y (e.g., a “person”) and measured signals from two modalities: x1 (RGB) and x2 (thermal). We write out our formulation for two modalities, but the extension to multiple (evaluated in our experiments) is straightforward. Crucially, we assume measurements are conditionally independent given the object label y: p(x1 , x2 |y) = p(x1 |y)p(x2 |y)

(1)

This can also be written as p(x1 |y) = p(x1 |x2 , y), which may be easier to intuit. Given the person label y, predict its RGB appearance x1 ; if this prediction would not change the given knowledge of the thermal signal x2 , then conditional independence holds. We wish to infer labels given multimodal measurements: p(y|x1 , x2 ) =

p(x1 , x2 |y)p(y) ∝ p(x1 , x2 |y)p(y) p(x1 , x2 )

(2)

By applying the conditional independence assumption from (1) to (2), we have: p(x1 |y)p(y)p(x2 |y)p(y) p(y) p(y|x1 )p(y|x2 ) ∝ p(y)

p(y|x1 , x2 ) ∝ p(x1 |y)p(x2 |y)p(y) ∝

(3) (4)

The above suggests a simple approach to fusion that is provably optimal when single-modal features are conditionally-independent of the true object label: 1. Train independent single-modal classifiers that predict the distributions over the label y given each individual feature modality p(y|x1 ) and p(y|x2 ).

Multimodal Object Detection via Probabilistic Ensembling

145

2. Produce a final score by multiplying the two distributions, dividing by the class prior distribution, and normalizing the final result (4) to sum-to-one. To obtain the class prior p(y), we can simply normalize the counts of per-class examples. Extending ProbEn (4) to M modalities is simple: p(y|{xi }M i=1 ) ∝

ΠM i=1 p(y|xi ) . p(y)M −1

(5)

Independence Assumptions. ProbEn is optimal given the independence assumption from (1). Even when such independence assumptions do not hold in practice, the resulting models may still be effective [11] (i.e., just as assumptions of Gaussianity can still be useful even if strictly untrue [29,41]). Interestingly, many fusion methods including NMS and averaging make the same underlying assumption, as discussed in [29]. In fact, [29] points out that Average Fusion (which averages class posteriors) makes an even stronger assumption: posteriors do not deviate dramatically from class priors. This is likely not true, as corroborated by the poor performance of averaging in our experiments (despite its apparent widespread use [32,36,52]). Relationship to Prior Work. To compare to prior fusion approaches that tend to operate on logit scores, we rewrite the single-modal softmax posterior for class-k given modality i in terms of single-modal logit score si [k]. For notational simplicity, we suppress its dependence on the underlying input modality xi : i [k]) ∝ exp(si [k]), where we exploit the fact that the p(y=k|xi ) = exp(s j exp(si [j]) partition function in the denominator is not a function of the class label k. We now plug the above into Eq. (5): M exp( i=1 si [k]) ΠM i=1 p(y=k|xi ) ) ∝ ∝ (6) p(y=k|{xi }M i=1 p(y=k)M −1 p(y=k)M −1 ProbEn is thus equivalent to summing logits, dividing by the class prior and normalizing via a softmax. Our derivation (6) reveals that summing logits without the division may over-count class priors, where the over-counting grows with the number of modalities M . The supplement shows that dividing by class posteriors p(y) marginally helps. In practice, we empirically find that assuming uniform priors works surprisingly well, even on imbalanced datasets. This is the default for our experiments, unless otherwise noted. Missing Modalities. Importantly, summing and averaging behave profoundly differently when fusing across “missing” modalities (Fig. 3). Intuitively, different single-modal detectors often do not fire on the same object. This means that to output a final set of detections above a confidence threshold (e.g., necessary for computing precision-recall metrics), one will need to compare scores from fused multi-modal detections with single modal detections, as illustrated in Fig. 3. ProbEn elegantly deals with missing modalities because probabilisticallynormalized multi-modal posteriors p(y|x1 , x2 ) can be directly compared with single-modal posteriors p(y|x1 ).

146

Y.-T. Chen et al. (a) RGB

(b) thermal

(c) average fusion

(d) ProbEn

Fig. 3. Missing Modalities. The orange-person (a) fails to trigger a thermal detection (b), resulting in a single-modal RGB detection (0.85 confidence). To generate an output set of detections (for downstream metrics such as average precision), this detection must be compared to the fused multimodal detection of the red-person (RGB: 0.80, thermal: 0.70). (c) averaging confidences for the red-person lowers their score (0.75) below the orange-person, which is unintuitive because additional detections should boost confidence. (d) ProbEn increases the red-person fused score to 0.90, allowing for proper comparisons to single-modal detections. (Color figure online)

Bounding Box Fusion. Thus far, we have focused on fusion of class posteriors. We now extend ProbEn to probabilistically fuse bounding box (bbox) coordinates of overlapping detections. We repurpose the derivation from (4) for a continuous bbox label rather than a discrete one. Specifically, we write z for the continuous random variable defining the bounding box (parameterized by its centroid, width, and height) associated with a given detection. We assume singlemodal detections provide a posterior p(z|xi ) that takes the form of a Gaussian with a single variance σi2 , i.e., p(z|xi ) = N (µi , σi2 I) where µi are box coordinates predicted from modality i. We also assume a uniform prior on p(z), implying bbox coordinates can lie anywhere in the image plane. Doing so, we can write  z − µ2 2   z − µ1 2  exp 2 −2σ1 −2σ22 µ1 µ2 + σ2  ||z − µ||2 σ2 2 ∝ exp where µ = 11 1 1 ), 1 −2( σ2 + σ2 ) + 2 2 σ σ

p(z|x1 , x2 ) ∝ p(z|x1 )p(z|x2 ) ∝ exp

1

2

1

(7) (8)

2

We refer the reader to the supplement for a detailed derivation. Equation (8) suggests a simple way to probabilistically fuse box coordinates: compute a weighted average of box coordinates, where weights are given by the inverse covariance. We explore three methods for setting σi2 . The first method “avg” fixes σi2 =1, amounting to simply averaging bounding box coodinates. The second “s-avg” approx1 , implying that more confident detections should have a imates σi2 ≈ p(y=k|x i) higher weight when fusing box coordinates. This performs marginally better than simply averaging. The third “v-avg” train the detector to predict regression variance/uncertainty using the Gaussian negative log likelihood (GNLL) loss [39] alongside the box regression loss. Interestingly, incorporating GNLL not only produces better variance/uncertainty estimate helpful for fusion but also improves detection performance of the trained detectors (details in supplement).

Multimodal Object Detection via Probabilistic Ensembling

147

Fig. 4. RGB and thermal images are unaligned both spatially and temporally in FLIR [17], which annotates only thermal images. As a result, prior methods relies on thermal and drop the RGB modality. We find mid-fusion, taking both RGB and thermal as input, notably improves detection accuracy. When late-fusing detections computed by the mid-fusion and thermal-only detectors, our ProbEn yields much better performance (Tables 3 and 4).

4

Experiments

We validate different fusion methods on two datasets: KAIST [26] which is released under the Simplified BSD License, and FLIR [17] (Fig. 4), which allows for non-commercial educational and research purposes. Because the two datasets contain personally identifiable information such as faces and license plates, we assure that we (1) use them only for research, and (2) will release our code and models to the public without redistributing the data. We first describe implementation details and then report the experimental results on each dataset (alongside their evaluation metrics) in separate subsections. 4.1

Implementation

We conduct experiments with PyTorch [40] on a single GPU (Nvidia GTX 2080). We train our detectors (based on Faster-RCNN) with Detectron2 [50], using SGD and learning rate 5e–3. For data augmentation, we adopt random flipping and resizing. We pre-train our detector on COCO dataset [34]. As COCO has only RGB images, fine-tuning the pre-trained detector on thermal inputs needs careful pre-processing of thermal images (detailed below). Pre-processing. All RGB and thermal images have intensity in [0, 255]. In training an RGB-based detector, RGB input images are commonly processed using the mean subtraction [50] where the mean values are computed over all the training images. Similarly, we calculate the mean value (135.438) in the thermal training data. We find using a precise mean subtraction to process thermal images yields better performance when fine-tuning the pre-trained detector. Stage-wise Training. We fine-tune the pre-trained detector to train singlemodal detectors and the early-fusion detectors. To train a mid-fusion detector, we truncate the already-trained single-modal detectors, concatenate features add a new detection head and train the whole model (Fig. 2a). The late-fusion methods fuse detections from (single-modal) detectors. Note that all the late-fusion

148

Y.-T. Chen et al.

Fig. 5. Detections overlaid on two KAIST testing examples in columns. Top: detections by our mid-fusion model. Bottom: detections by our ProbEn by fusing detections of thermal-only and mid-fusion models. Green, red and blue boxes stand for true positives, false negative (miss-detection) and false positives. Visually, ProbEn performs much better than the mid-fusion model, which is already comparable to the prior work as shown in Tables 1 and 2. (Color figure online)

methods are non-learned. We also experimented with learning-based late-fusion methods (e.g., learning to fuse logits) but find them to be only marginally better than ProbEn (9.08 vs. 9.16 in LAMR using argmax box fusion). Therefore, we focus on the non-learned late fusion methods in the main paper and study learning-based ones in the supplement. Post-processing. When ensembling two detectors, we find it crucial to calibrate scores particularly when we fuse detections from our in-house models and offthe-shelf models released by others. We adopt the simple temperature scaling for score calibration [21]. Please refer to the supplement for details. 4.2

Multimodal Pedestrian Detection on KAIST

Dataset. The KAIST dataset is a popular multimodal benchmark for pedestrian detection [26]. In KAIST, RGB and thermal images are aligned with a beamsplitter, and have resolutions of 640 × 480 and 320 × 256, respectively. We resize thermal images to 640 × 480 during training. KAIST also provides day/night tags for breakdown analysis. The original KAIST dataset contains 95,328 RGBthermal image pairs, which are split into a training set (50,172) and a testing set (45, 156). Because the original KAIST dataset contains noisy annotations, the literature introduces cleaned version of the train/test sets: a sanitized train-set (7,601 examples) [32] and a cleaned test-set (2,252 examples) [35]. We also follow the literature [26] to evaluate under the “reasonable setting” for evaluation by ignoring annotated persons that are occluded (tagged by KAIST) or too small (0.5 [26]; false positives are detections that do not match any groundtruth; false negatives are miss-detections.

150

Y.-T. Chen et al.

Ablation Study on KAIST. Table 1 shows ablation studies on KAIST. Single modal detectors tend to work well in different environments, with RGB detectors working on well-lit day images while Thermal working well on nighttime images. EarlyFusion reduces the miss rate by a modest amount, while MidFusion is more effective. Naive strategies for late fusion (such as pooling together detections from different modalities) are quite poor because they generate many repeated detections on the same object, which are counted as false positives. Interestingly, simple NMS that has max score fusion and argmax box fusion, is quite effective at removing overlapping detections from different modalities, already outperforming Early and MidFusion. Instead of suppressing the weaker modality, one might average the scores of overlapping detections but this is quite ineffective because it always decreases the score from NMS. Intuitively, one should increase the score when different modalities agree on a detection. ProbEn accomplishes this by probabilistic integration of information from the RGB and Thermal single-modal detectors. Moreover, it can be further improved by probabilisitcally fusing coordinates of overlapping boxes. Lastly, ProbEn3 that ensembles three models (RGB, thermal and MidFusion), performs the best. Qualitative Results are displayed in Fig. 5. Visually, ProbEn detects all persons, while the MidFusion model has multiple false negatives / missdetections. Quantitative Comparison on KAIST Compared Methods. Among many prior methods, we particularly compare against four recent ones: AR-CNN [57], MBNet [58], MLPD [28], and GAFF [55]. AR-CNN focuses on weakly-unaligned RGB-thermal pairs and explores multiple heuristic methods for fusing features, scores and boxes. MBNet addresses modality imbalance w.r.t illumination and features to improve detection; both MLPD and GAFF are mid-fusion methods that design sophisticated network architectures; MLPD adopts aggressive data augmentation techniques and GAFF extensively exploits attentitive modules to fuse multimodal features. Table 2 lists more methods. Results. Table 2 compares ProbEn against the prior work. ProbEn+ that ensembles three models trained in-house (RGB, Thermal, and MidFusion) achieves competitive performance (7.95 LAMR) against the prior art. When replacing our MidFusion detector with off-the-shelf mid-fusion detectors [28,55], ProbEn++ significantly outperforms all the existing methods, boosting the performance from the prior art 6.48 to 5.14! This clearly shows that ProbEn works quite well when the conditional independence assumption does not hold, i.e., fusing outputs from other fusion methods (both off-the-shelf and trained in-house). As ProbEn performs better than past work as a non-learned solution, we argue that it should serve as a new baseline for future research on multimodal detection.

Multimodal Object Detection via Probabilistic Ensembling

151

Table 2. Benchmarking on KAIST measured by % LAMR↓. We report numbers from the respective papers. Results are comparable to Table 1. Simple probabilis9 methods tic ensembling of independently-trained detectors (ProbEn) outperforms 12 8 on the leaderboard. Infact, even NMS (MaxFusion) outperforms 12 methods, indicating the under-appreciated effectiveness of detector-ensembling as a multimodal fusion technique. Performance further increases when adding a MidFusion detector to the probabilistic ensemble (ProbEn3 ). Replacing our in-house MidFusion with off-the-shelf mid-fusion detectors MLPD [28] and GAFF [55] significantly boosts the state-of-art from 6.48 to 5.14! This shows ProbEn remains effective even when fusing models for which conditional independence does not hold.

4.3

Method

Day

Night All

HalfwayFusion [36] RPN+BDT [30] TC-DET [4] IATDNN [20] IAF R-CNN [33] SyNet [2] CIAN [56] MSDS-RCNN [32] AR-CNN [57] MBNet [58] MLPD [28] GAFF [55]

36.84 30.51 34.81 27.29 21.85 22.64 14.77 12.22 9.94 8.28 7.95 8.35

35.49 27.62 10.31 24.41 18.96 15.80 11.13 7.82 8.38 7.86 6.95 3.46

36.99 29.83 27.11 26.37 20.95 20.19 14.12 10.89 9.34 8.13 7.58 6.48

MaxFusion (NMS) ProbEn ProbEn3 ProbEn3 w/ MLPD ProbEn3 w/ GAFF

13.25 9.93 9.07 7.81 6.04

6.42 5.41 4.89 5.02 3.59

10.78 8.50 7.66 6.76 5.14

Multimodal Object Detection on FLIR

Dataset. The FLIR dataset [17] consists of RGB images (captured by a FLIR BlackFly RGB camera with 1280 × 1024 resolution) and thermal images (acquired by a FLIR Tau2 thermal camera 640 × 512 resolution). We resize all images to resolution 640 × 512. FLIR has 10, 228 unaligned RGB-thermal image pairs and annotates only for thermal (Fig. 4). Image pairs are split into trainset (8, 862 images) and a validation set (1, 366 images). FLIR evaluates on three classes which have imbalanced examples [8,12,27,38,54]: 28, 151 persons, 46, 692 cars, and 4, 457 bicycles. Following [54], we remove 108 thermal images in the val-set that do not have the RGB counterparts. For breakdown analysis w.r.t day/night scenes, we manually tag the validation images with “day” (768) and “night” (490). We will release our annotations to the public.

152

Y.-T. Chen et al.

Misaligned Modalities. Because FLIR’s RGB and thermal images are heavily unaligned, it labels only thermal images and does not have RGB annotations. We can still train Early and MidFusion models using multimodal inputs and the thermal annotations. These detectors might learn to internally align the unaligned modalities to predict bounding boxes according to the thermal annotations. Because we do not have an RGB-only detector, our ProbEn ensembles EarlyFusion, MidFusion, and thermal-only detectors. Metric. We measure performance using Average Precision (AP) [16,46]. Precision is computed over testing images within a single class, with true positives that overlap ground-truth bounding boxes (e.g., IoU>0.5). Computing the average precision (AP) across all classes measures the performance in multiclass object detection. Following [8,12,27,38,54], we define a true positive as a detection that overlaps a ground-truth with IoU>0.5. Note that AP used in the multimodal detection literature is different from mAP [34], which averages over different AP’s computed with different IoU thresholds. Table 3. Ablation study on FLIR. day/night scenes (AP↑ in percentage with IoU>0.5). Compared to thermal-only detector, incorporating RGB by EarlyFusion and MidFusion notably improves performance. Late-fusion (lower panel) ensembles three detectors: Thermal, EarlyFusion and MidFusion. All the explored late-fusion methods lead to better performance than MidFusion. In particular, ProbEn performs the best. Moreover, similar to the results on KAIST, using predicted uncertainty to fuse boxes (v-avg) performs better than the other two heuristic box fusion methods, avg that naively averages box coordinates and s-avg that uses classification scores to weighted average box coordinates. Baselines

Day

Night

All

Thermal

75.35

82.90

79.24

EarlyFusion

77.37

79.56

78.80

MidFusion

79.37

81.64

80.53

Pooling

52.57

55.15

53.66

Score-fusion Box-fusion Day

Night

All

max

argmax

81.91

84.42

83.14

max

avg

81.84

84.62

83.21

max

s-avg

81.85

84.48

83.19

max

v-avg

81.80

85.07

83.31

avg

argmax

81.34

84.69

82.65

avg

avg

81.26

84.81

82.91

avg

s-avg

81.26

84.72

82.89

avg

v-avg

81.26

85.39

83.03

ProbEn3

argmax

82.19

84.73

83.27

ProbEn3

avg

82.19

84.91

83.63

ProbEn3

s-avg

82.20

84.84

83.61

ProbEn3

v-avg

82.21 85.56 83.76

Multimodal Object Detection via Probabilistic Ensembling night scene

ProbEn

thermal detector

ground-truth

day scene

153

Fig. 6. Detections overlaid on two FLIR testing images (in columns) with RGB (top) and thermal images (middle and bottom). To avoid clutter, we do not mark class labels for the bounding boxes. Ground-truth annotations are shown on the RGB, emphaszing that RGB and thermal images are strongly unaligned. On the thermal images, we compare thermal-only (mid-row) and our ProbEn (bottom-row) models. Green, red and blue boxes stand for true positives, false negative (mis-detected persons) and false positives. In particular, in the second column, the thermal-only model has many false negatives (or miss-detections), which are “bicycles”. Understandably, thermal cameras will not capture bicycles because they do not emit heat. In contrast, RGB capture bicycle signatures better than thermal. This explains why our fusion performs better on bicycles.

Ablation Study on FLIR. We compare our fusion methods in Table 3, along with qualitative results in Fig. 6. We analyze results using our day/night tags. Compared to the single-modal detector (Thermal), our learning-based earlyfusion (EarlyFusion) and mid-fusion (MidFusion) produce better performance. MidFusion outperforms EarlyFusion, implying that end-to-end learning of fusing features better handles mis-alignment between RGB and thermal images. By applying late-fusion methods to detections of Thermal, EarlyFusion and MidFusion detectors, we boost detection performance. Note that typical ensembling methods in the single-modal (RGB) detection literature [32,36,52] often use max/average score fusion, and argmax/average box fusion, which are outperformed by our ProbEn. This suggests that ProbEn should be potentially a better ensembling method for object detection. Quantitative Comparison on FLIR Compared Methods. We compare against prior methods including ThermalDet [8], BU [27], ODSC [38], MMTOD [12], CFR [54], and GAFF [55]. As FLIR does not have aligned

154

Y.-T. Chen et al.

Table 4. Benchmarking on FLIR measured by AP↑ in percentage with IoU>0.5 with breakdown on the three categories. Perhaps surprisingly, end-to-end training on thermal already outperforms all the prior methods, presumably because of using a better pre-trained model (Faster-RCNN). Importantly, our ProbEn increases AP from prior art 74.6% to 84.4%! These results are comparable to Table 3. Method

Bicycle Person Car

All

MMTOD-CG [12] MMTOD-UNIT [12] ODSC [38] CFR3 [54] BU(AT,T) [27] BU(LT,T) [27] GAFF [55] ThermalDet [8]

50.26 49.43 55.53 55.77 56.10 57.40 — 60.04

63.31 64.47 71.01 74.49 76.10 75.60 — 78.24

70.63 70.72 82.33 84.91 87.00 86.50 — 85.52

61.40 61.54 69.62 72.39 73.10 73.20 72.90 74.60

Thermal EarlyFusion MidFusion ProbEn3

62.63 63.43 69.80 73.49

84.04 85.27 84.16 87.65

87.11 87.69 87.63 90.14

79.24 78.80 80.53 83.76

RGB-thermal images and only annotates thermal images, many methods exploit domain adaptation that adapts a pre-trained RGB detector to thermal input. For example, MMTOD [12] and ODSC [38] adopt the image-to-image-translation technique [53,59] to generate RGB from thermal, hypothesizing that this helps train a better multimodal detector by finetuning a detector that is pre-trained over large-scale RGB images. BU [27] operates such a translation/adaptation on features that generates thermal features to be similar to RGB features. ThermalDet [8] exclusively exploits thermal images and ignores RGB images; it proposes to combine features from multiple layers for the final detection. GAFF [55] trains on RGB-thermal image with a sophisticated attention module that fuse single-modal features. Perhaps because the complexity of the attention module, GAFF is limited to using small network backbones (ResNet18 and VGG16). Somewhat surprisingly, to the best of our knowledge, there is no prior work that trained early-fusion or mid-fusion deep networks (Fig. 2a) on the heavily unaligned RGB-thermal image pairs (like in FLIR) for multimodal detection. We find directly training them performs much better than prior work (Table 4). Results. Table 4 shows that all our methods outperform the prior art. Our single-modal detector (trained on thermal images) achieves slightly better performance than ThermalDet [8], which also exclusively trains on thermal images. This is probably because we use a better pre-trained Faster-RCNN model provided by the excellent Detectron2 toolbox. Surprisingly, our simpler EarlyFusion and MidFusion models achieve big boosts over the thermal-only model (Thermal), while MidFusion performs much better. This confirms our hypothesis that

Multimodal Object Detection via Probabilistic Ensembling

155

fusing features better handles mis-alignment of RGB-thermal images than the early-fusion method. Our ProbEn performs the best, significantly better than all compared methods! Notably, our fusion methods boost “bicycle” detection. We conjecture that bicycles do not emit heat to deliver strong signatures in thermal, but are more visible in RGB; fusing them greatly improves bicycle detection.

5

Discussion and Conclusions

We explore different fusion strategies for multimodal detection under both aligned and unaligned RGB-thermal images. We show that non-learned probabilistic fusion, ProbEn, significantly outperforms prior approaches. Key reasons for its strong performance are that (1) it can take advantage of highly-tuned single-modal detectors trained on large-scale single-modal datasets, and (2) it can deal with missing detections from particular modalities, a common occurrence when fusing together detections. One by-product of our diagnostic analysis is the remarkable performance of NMS as a fusion technique, precisely because it exploits the same key insights. Our ProbEn yields >13% relative improvement over prior work, both on aligned and unaligned multimodal benchmarks. Acknowledgement. This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.

References 1. Akiba, T., Kerola, T., Niitani, Y., Ogawa, T., Sano, S., Suzuki, S.: PFDet: 2nd place solution to open images challenge 2018 object detection track. arXiv:1809.00778 (2018) 2. Albaba, B.M., Ozer, S.: SyNet: an ensemble network for object detection in UAV images. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 10227–10234. IEEE (2021) 3. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36(1), 105–139 (1999) 4. Kieu, M., Bagdanov, A.D., Bertini, M., del Bimbo, A.: Task-conditioned domain adaptation for pedestrian detection in thermal imagery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 546–562. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6 33 5. Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-NMS-improving object detection with one line of code. In: ICCV (2017) 6. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019) 7. Caesar, H., et al.: nuScenes a multimodal dataset for autonomous driving. In: CVPR (2020) 8. Cao, Y., Zhou, T., Zhu, X., Su, Y.: Every feature counts: an improved one-stage detector in thermal imagery. In: IEEE International Conference on Computer and Communications (ICCC) (2019) 9. Choi, H., Kim, S., Park, K., Sohn, K.: Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In: International Conference on Pattern Recognition (ICPR) (2016)

156

Y.-T. Chen et al.

10. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 11. Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(1), 1–15 (1979) 12. Devaguptapu, C., Akolekar, N., M Sharma, M., N Balasubramanian, V.: Borrow from anywhere: pseudo multi-modal object detection in thermal imagery. In: CVPR Workshops (2019) 13. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). https:// doi.org/10.1007/3-540-45014-9 1 14. Doll´ ar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: CVPR (2009) 15. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2011) 16. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015) 17. FLIR: Flir thermal dataset for algorithm training (2018). https://www.flir.in/oem/ adas/adas-dataset-form 18. Freund, Y., et al.: Experiments with a new boosting algorithm. In: ICML, vol. 96, pp. 148–156. Citeseer (1996) 19. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 20. Guan, D., Cao, Y., Yang, J., Cao, Y., Yang, M.Y.: Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 50, 148–157 (2019) 21. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. arXiv:1706.04599 (2017) 22. Guo, R., et al.: 2nd place solution in google ai open images object detection track 2019. arXiv:1911.07171 (2019) 23. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017) 24. Hosang, J., Benenson, R., Schiele, B.: Learning non-maximum suppression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4507–4515 (2017) 25. Huang, Z., Chen, Z., Li, Q., Zhang, H., Wang, N.: 1st place solutions of waymo open dataset challenge 2020–2D object detection track. arXiv:2008.01365 (2020) 26. Hwang, S., Park, J., Kim, N., Choi, Y., So Kweon, I.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: CVPR (2015) 27. Kiew, M.Y., Bagdanov, A.D., Bertini, M.: Bottom-up and layer-wise domain adaptation for pedestrian detection in thermal images. ACM Transactions on Multimedia Computing Communications and Applications (2020) 28. Kim, J., Kim, H., Kim, T., Kim, N., Choi, Y.: MLPD: multi-label pedestrian detector in multispectral domain. IEEE Rob. Auto. Lett. 6(4), 7846–7853 (2021) 29. Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998) 30. Konig, D., Adam, M., Jarvers, C., Layher, G., Neumann, H., Teutsch, M.: Fully convolutional region proposal networks for multispectral person detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49–56 (2017)

Multimodal Object Detection via Probabilistic Ensembling

157

31. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012) 32. Li, C., Song, D., Tong, R., Tang, M.: Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv:1808.04818 (2018) 33. Li, C., Song, D., Tong, R., Tang, M.: Illumination-aware faster r-CNN for robust multispectral pedestrian detection. Pattern Recogn. 85, 161–171 (2019) 34. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 35. Liu, J., Zhang, S., Wang, S., Metaxas, D.: Improved annotations of test set of KAIST (2018) 36. Liu, J., Zhang, S., Wang, S., Metaxas, D.N.: Multispectral deep neural networks for pedestrian detection. In: BMVC (2016) 37. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46448-0 2 38. Munir, F., Azam, S., Rafique, M.A., Sheri, A.M., Jeon, M.: Thermal object detection using domain adaptation through style consistency. arXiv:2006.00821 (2020) 39. Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target probability distribution. In: Proceedings of 1994 IEEE international conference on neural networks (ICNN 1994), vol. 1, pp. 55–60. IEEE (1994) 40. Paszke, A., et al.: Automatic differentiation in Pytorch (2017) 41. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Elsevier, San Mateo (2014) 42. Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, vol. 3, p. 5. Kobe, Japan (2009) 43. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016) 44. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017) 45. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015) 46. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 47. Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes from different object detection models. Image Vis. Comput. 107, 104117 (2021) 48. Valverde, F.R., Hurtado, J.V., Valada, A.: There is more than meets the eye: selfsupervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: CVPR (2021) 49. Wagner, J., Fischer, V., Herman, M., Behnke, S.: Multispectral pedestrian detection using deep fusion convolutional neural networks. In: Proceedings of European Symposium on Artificial Neural Networks (2016) 50. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github. com/facebookresearch/detectron2 (2019) 51. Xu, D., Ouyang, W., Ricci, E., Wang, X., Sebe, N.: Learning cross-modal deep representations for robust pedestrian detection. In: CVPR (2017) 52. Xu, P., Davoine, F., Denoeux, T.: Evidential combination of pedestrian detectors. In: British Machine Vision Conference, pp. 1–14 (2014) 53. Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. arXiv:1703.06953 (2017)

158

Y.-T. Chen et al.

54. Zhang, H., Fromont, E., Lef`evre, S., Avignon, B.: Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In: IEEE International Conference on Image Processing (ICIP) (2020) 55. Zhang, H., Fromont, E., Lef`evre, S., Avignon, B.: Guided attentive feature fusion for multispectral pedestrian detection. In: WACV (2021) 56. Zhang, L., et al.: Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fus. 50, 20–29 (2019) 57. Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z., Liu, Z.: Weakly aligned cross-modal learning for multispectral pedestrian detection. In: ICCV (2019) 58. Zhou, K., Chen, L., Cao, X.: Improving multispectral pedestrian detection by addressing modality imbalance problems. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 787–803. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5 46 59. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017) 60. Zitnick, C.L., Doll´ ar, P.: Edge Boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910602-1 26

Exploiting Unlabeled Data with Vision and Language Models for Object Detection Shiyu Zhao1(B) , Zhixing Zhang1 , Samuel Schulter2 , Long Zhao3 , B.G Vijay Kumar2 , Anastasis Stathopoulos1 , Manmohan Chandraker2,4 , and Dimitris N. Metaxas1 1

Rutgers University, New Brunswick, USA [email protected] 2 NEC Labs America, San Jose, USA 3 Google Research, Los Angeles, USA 4 UC San Diego, La Jolla, USA

Abstract. Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

1

Introduction

Recent advances in object detection build on large-scale datasets [17,27,41], which provide rich and accurate human-annotated bounding boxes for many object categories. However, the annotation cost of such datasets is significant. Moreover, the long-tailed distribution of natural object categories makes it even harder to collect sufficient annotations for all categories. Semi-supervised object detection (SSOD) [44,60] and open-vocabulary object detection (OVD) [4,16,54] are two tasks to lower annotations costs by leveraging different forms of unlabeled S. Zhao1 and Z. Zhang1—Equal contribution. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 10. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 159–175, 2022. https://doi.org/10.1007/978-3-031-20077-9_10

160

S. Zhao et al.

Fig. 1. (a) Overview of leveraging the semantic knowledge contained in vision and language models for mining unlabeled data to improve object detection systems for open-vocabulary and semi-supervised tasks. (b) Illustration of the weak localization ability when applying CLIP [37] on raw object proposals (top), compared with our improvements (bottom). The left images show the pseudo label with the highest score. The right images show all pseudo labels with scores greater than 0.8. The proposed scoring gives much cleaner pseudo labels.

data. In SSOD, a small fraction of fully-annotated training images is given along with a large corpus of unlabeled images. In OVD, a fraction of the desired object categories is annotated (the base categories) in all training images and the task is to also detect a set of novel (or unknown) categories at test time. These object categories can be present in the training images, but are not annotated with ground truth bounding boxes. A common and successful approach for leveraging unlabeled data is by generating pseudo labels. However, all prior works on SSOD only leveraged the small set of labeled data for generating pseudo labels, while most prior work on OVD does not leverage pseudo labels at all. In this work, we propose a simple but effective way to mine unlabeled images using recently proposed vision and language (V&L) models to generate pseudo labels for both known and unknown categories, which suits both tasks, SSOD and OVD. V&L models [23,29,37] can be trained from (noisy) image caption pairs, which can be obtained at a large scale without human annotation efforts by crawling websites for images and their alt-texts. Despite the noisy annotations, these models demonstrate excellent performance on various semantic tasks like zero-shot classification or image-text retrieval. The large amount of diverse images, combined with the free-form text, provides a powerful source of information to train robust and generic models. These properties make vision and language models an ideal candidate to improve existing object detection pipelines that leverage unlabeled data, like OVD or SSOD, see Fig. 1(a). Specifically, our approach leverages the recently proposed vision and language model CLIP [37] to generate pseudo labels for object detection. We first predict region proposals with a two-stage class-agnostic proposal generator which was trained with limited ground truth (using only known base categories in OVD and only labeled images in SSOD), but generalizes to unseen categories. For each region proposal, we then obtain a probability distribution over the desired object

Exploiting Unlabeled Data with V&L Models

161

categories (depending on the task) with the pre-trained V&L model CLIP [37]. However, as shown in Fig. 1(b), a major challenge of V&L models is the rather low object localization quality, also observed in [57]. To improve localization, we propose two strategies where the two-stage proposal generator helps the V&L model: (1) Fusing CLIP scores and objectness scores of the two-stage proposal generator, and (2) removing redundant proposals by repeated application of the localization head (2nd stage) in the proposal generator. Finally, the generated pseudo labels are combined with the original ground truth to train the final detector. We name our method as V&L-guided Pseudo-Label Mining (VL-PLM). Extensive experiments demonstrate that VL-PLM successfully exploits the unlabeled data for open-vocabulary detection and outperforms the state-of-theart ViLD [16] on novel categories by +6.8 AP on the COCO dataset [32]. Moreover, VL-PLM improves the performance on known categories in SSOD and beats the popular baseline STAC [44] by a clear margin, by only replacing its pseudo labels with ours. Besides, we also conduct various ablation studies on the properties of the generated pseudo labels and analyze the design choices of our proposed method. We also believe that VL-PLM can be further improved with better V&L models like ALIGN [23] or ALBEF [29]. The contributions of our work are as follows: (1) We leverage V&L models for improving object detection frameworks by generating pseudo labels on unlabeled data. (2) A simple but effective strategy to improve the localization quality of pseudo labels scored with the V&L model CLIP [37]. (3) State-of-the-art results for novel categories on the COCO open-vocabulary detection setting. (4) We showcase the benefits of VL-PLM in a semi-supervised object detection setting.

2

Related Work

The goal of our work is to improve object detection systems by leveraging unlabeled data via vision and language models that carry rich semantic information. Vision & Language (VL) Models: Combining natural language and images has enabled many valuable applications in recent years, like image captioning [2,7, 12,25], visual question answering [1,13,20,30,36,55], referring expression comprehension [8,24,26,34,35,52,53], image-text retrieval [29,37,47] or language-driven embodied AI [3,9]. While early works proposed task-specific models, generic representation learning from vision and language inputs has gained more attention [8,19,33,34,45]. Most recent works like CLIP [37] or ALIGN [23] also propose generic vision and language representation learning approaches, but have significantly increased the scale of training data, which led to impressive results in tasks like zero-shot image classification or image-text retrieval. The training data consist of image and text pairs, typically crawled from the web at a very large scale (400M for [37] and 1.2B for [23]), but without human annotation effort. In our work, we leverage such pre-trained models to mine unlabeled data and to generate pseudo labels in the form of bounding boxes, suitable for object detection. One challenge with using such V&L models [23,37] is their limited capability in localizing objects (recall Fig. 1(b)), likely due to the lack of region-word alignment in

162

S. Zhao et al.

the image-text pairs of their training data. In Sect. 3.2, we show how to improve localization quality with our proposal generator. Vision & Language Models for Dense Prediction Tasks: The success of CLIP [37] (and others [23,29]) has motivated the extension of zero-shot classification capabilities to dense image prediction tasks like object detection [16,21, 42,54] or semantic segmentation [28,39,50,59]. These works try to map features of individual objects (detection) or pixels (segmentation) into the joint visionlanguage embedding space provided by models like CLIP. For example, ViLD [16] trains an object detector in the open-vocabulary regime by predicting the text embedding (from the CLIP text-encoder) of the category name for each image region. LSeg [28] follows a similar approach, but is applied to zero-shot semantic segmentation. Both works leverage task-specific insights and do not generate explicit pseudo labels. In contrast, our proposed VL-PLM is more generic by generating pseudo labels, thus enabling also other tasks like semi-supervised object detection [44]. Similar to our work, both Gao et al. [14] and Zhong et al. [57] generate explicit pseudo labels in the form of bounding boxes. In [14], the attention maps of a pretrained V&L model [29] between words of a given caption and image regions are used together with object proposals to generate pseudo labels. In contrast, our approach does not require image captions as input and we use only unlabeled images, while still outperforming [14] in an open-vocabulary setting on COCO. RegionCLIP [57] assigns semantics to region proposals via a pretrained V&L model, effectively creating pseudo labels in the form of bounding boxes. While our approach uses such pseudo labels directly for training object detectors, [57] uses them for fine-tuning the original V&L model, which then builds the basis for downstream tasks like open-vocabulary detection. We believe this contribution is orthogonal to ours as it effectively builds a better starting point of the V&L model, and can be incorporated into our framework as well. Interestingly, even without the refined V&L model, we show improved accuracy with pseudo labels specifically for novel categories as shown in Sect. 4.1. The main focus of all the aforementioned works is to enable the dynamic expansion of the label space and to recognize novel categories. While our work also demonstrates state-of-the-art results in this open-vocabulary setting, where we mine unlabeled data for novel categories, we want to stress that our pseudo labels are applicable more generally. In particular, we also use a V&L model to mine unlabeled images for known categories in a semi-supervised object detection setting. Furthermore, by building on the general concept of pseudo labels, our approach may be extended to other dense prediction tasks like semantic segmentation in future works as well. Object Detection From Incomplete Annotations: Pseudo labels are proven useful in many recent object detection methods trained with various forms of weak annotations: semi-supervised detection [44,60], unsupervised object discovery [43], open-vocabulary detection [14,57], weakly-supervised detection [10,58], unsupervised domain adaptation [22,51] or multi-dataset detection [56]. In all cases, an initial model trained from base information is applied on the training data to obtain the missing information. Our main proposal is to leverage V&L

Exploiting Unlabeled Data with V&L Models

163

models to improve these pseudo labels and have one unified way of improving the accuracy in multiple settings, see Sect. 3.3. In this work, we focus on two important forms of weak supervision: zero-shot/open-vocabulary detection (OVD) and semi-supervised object detection (SSOD). In zero-shot detection [4] a model is trained from a set of base categories. Without ever seeing any instance of a novel category during training, the model is asked to predict novel categories, typically via association in a different embedding space, like attribute or text embeddings. Recent works [16,38,54] relax the setting to include novel categories in the training data, but without bounding box annotations, which also enables V&L models to be used (via additional images that come with caption data). ViLD [16], as described above, uses CLIP [37] with model distillation losses to make predictions in the joint vision-text embedding space. In contrast, we demonstrate that explicitly creating pseudo labels for novel categories via mining the training data can significantly improve the accuracy, see Sect. 4.1. The second task we focus on is semisupervised object detection (SSOD), where a small set of images with bounding box annotations and a large set of unlabeled images are given. In contrast to OVD, the label space does not change from train to test time. A popular and recent baseline that builds on pseudo labels is STAC [44]. This approach employs a consistency loss between predictions on a strongly augmented image and pseudo labels computed on the original image. We demonstrate the benefit of leveraging V&L models to improve the pseudo label quality in such a framework. Other works on SSOD, like [49,60] propose several orthogonal improvements which can be incorporated into our framework as well. In this work, however, we focus purely on the impact of the pseudo labels. Finally, note that our concepts may also be applicable to other tasks beyond open-vocabulary and semi-supervised object detection, but we leave this for future work.

3

Method

The goal of our work is to mine unlabeled images with vision & language (V&L) models to generate semantically rich pseudo labels (PLs) in the form of bounding boxes so that object detectors can better leverage unlabeled data. We start with a generic training strategy for object detectors with the unlabeled data in Sect. 3.1. Then, Sect. 3.2 describes the proposed VL-PLM for pseudo label generation. Finally, Sect. 3.3 presents specific object detection tasks with our PLs. 3.1

Training Object Detectors with Unlabeled Data

Unlabeled data comes in many different forms for object detectors. In semisupervised object detection, we have a set of fully-labeled images IL with annotations for the full label space S, as well as unlabeled images IU , with IL ∩IU = ∅. In open-vocabulary detection, we have partly-labeled images with annotations for the set of base categories SB , but without annotations for the unknown/novel categories SN . Note that partly-labeled images are therefore contained in both IL and IU , i.e., IL = IU . A popular and successful approach to learn from unlabeled data is via pseudo labels. Recent semi-supervised object detection methods follow this approach by

164

S. Zhao et al.

first training a teacher model on the limited ground truth data, then generating pseudo labels for the unlabeled data, and finally training a student model. In the following, we describe a general training strategy for object detection to handle different forms of unlabeled data. We define a generic loss function for an object detector with parameters θ over both labeled and unlabeled images as L(θ, I) =

NI 1  [Ii ∈ IL ] ls (θ, Ii ) + α[Ii ∈ IU ] lu (θ, Ii ) , NI i=1

(1)

where α is a hyperparameter to balance supervised ls and unsupervised lu losses and [·] is the indicator function returning either 0 or 1 depending on the condition. Note again that Ii can be contained in both IL and IU . Object detection ultimately is a set prediction problem and to define a loss function, the set of predictions (class probabilities and bounding box estimates) need to be matched with the set of ground truth boxes. Different options exist to find a matching [6,18] but it is mainly defined by the similarity (IoU) between predicted and ground truth boxes. We define the matching for prediction i as σ(i), which returns a ground truth index j if successfully matched or nil otherwise. The supervised loss ls contains a standard cross-entropy loss for the classification lcls and an 1 loss for the box regression lreg . Given I ∈ I, we define ls as,     1  (2) ls (θ, I) = ∗ lcls Ciθ (I), c∗σ(i) + [σ(i) = nil] lreg Tiθ (I), t∗σ(i) , N i where N ∗ is the number of predicted bounding boxes. Ciθ (·) and Tiθ (·) are the predicted class distributions and bounding boxes of the object detector. The corresponding (matched) ground truth is defined as c∗σ(i) and t∗σ(i) , respectively. The unsupervised loss lu is similarly defined, but uses pseudo labels with high confidence as supervision signals:    1  [max(puσ(i) ) ≥ τ ] · lcls Ciθ (I), cˆuσ(i) + lu (θ, I) = u N i (3)   [σ(i) = nil] lreg Tiθ (I), tuσ(i)

.

Here, puσ(i) defines the probability distribution over the label space of the pseudo label matched i and N u is the number of adopted pseudo labels,  with prediction u u i.e., N = i [max(pσ(i) ) ≥ τ ]. Pseudo labels for the classification and the box regression losses are cˆuσ(i) = arg max(puσ(i) ) and tuσ(i) , respectively. The key to successful training of object detectors from unlabeled data are accurate pseudo labels. In the next section, we will present our approach, VLPLM, to leverage V&L models as external models to exploit unlabeled data for generating pseudo labels. 3.2

VL-PLM: Pseudo Labels from Vision & Language Models

V&L models are trained on large scale datasets with image-text pairs that cover a diverse set of image domains and rich semantics in natural text. Moreover, the

Exploiting Unlabeled Data with V&L Models

165

Fig. 2. Overview of the proposed VL-PLM to mine unlabeled images with vision & language models to generate pseudo labels for object detection. The top part illustrates our class-agnostic proposal generator, which improves the pseudo label localization by using the class-agnostic proposal score and the repeated application of the RoI head. The bottom part illustrates the scoring of cropped regions with the V&L model based on the target category names. The chosen category names can be adjusted for the desired downstream task. After thresholding and NMS, we get the final pseudo labels. For some tasks like SSOD, we will merge external pseudo labels for a teacher model with ours before thresholding and NMS.

image-text pairs can be obtained without costly human annotation by using webcrawled data (images and corresponding alt-texts) [23,37]. Thus, V&L models are ideal sources of external knowledge to generate pseudo labels for arbitrary categories, which can be used for downstream tasks like open-vocabulary or semi-supervised object detection. Overview: Figure 2 illustrates the overall pipeline of our pseudo label generation with the recent V&L model CLIP [37]. We first feed an unlabeled image into our two-stage class-agnostic detector (described in the next section below) to obtain region proposals. We then crop image patches based on those regions and feed them into the CLIP image-encoder to obtain an embedding in the CLIP visionand-language space. Using the corresponding CLIP text-encoder and template text prompts, we generate embeddings for category names that are desired for the specific task. For each region, we compute the similarities between the region embedding and the text embeddings via a dot product and use softmax to obtain a distribution over the categories. We then generate the final pseudo labels using scores from both class-agnostic detector and V&L model, which we describe in detail below. There are two key challenges in our framework: (1) Generating robust proposals for novel categories, required by open-vocabulary detection, and (2) overcoming the poor localization quality of the raw CLIP model, see Fig. 1(b).

166

S. Zhao et al.

Fig. 3. (a) RPN scores indicate localization quality. Top: Top 50 boxes from RPN in an image which correctly locates nearly all objects. Bottom: A positive correlation between RPN and IoU scores for RPN boxes of 50 randomly sampled COCO images. The correlation coefficient is 0.51. (b) Box refinement by repeating RoI head. “×N” indicates how many times we repeat the RoI head.

We introduce simple but effective solutions to address the two challenges in the following. Generating Robust and Class-Agnostic Region Proposals: To benefit tasks like open vocabulary detection with the unlabeled data, the proposal generator should be able to locate not only objects of categories seen during training but also of objects of novel categories. While unsupervised candidates like selective search [46] exist, these are often time-consuming and generate many noisy boxes. As suggested in prior studies [16,54], the region proposal network (RPN) of a two-stage detector generalizes well for novel categories. Moreover, we find that the RoI head is able to improve the localization of region proposals, which is elaborated in the next section. Thus, we train a standard two-stage detector, e.g., Faster-RCNN [40], as our proposal generator using available ground truth, which are annotations of base categories for open vocabulary detection and annotations from the small fraction of annotated images in semi-supervised detection. To further improve the generalization ability, we ignore the category information of the training set and train a class-agnostic proposal generator. Please refer to Sect. 4.3 and the supplement for a detailed analysis of the proposal generator. Generating Pseudo Labels with a V&L Model: Directly applying CLIP [37] on cropped region proposals yields low localization quality, as was observed in Fig. 1(b) and also in [57]. Here, we demonstrate how to improve the localization ability with our two-stage class-agnostic proposal generator in two ways. Firstly, we find that the RPN score is a good indicator for localization quality of region proposals. Figure 3(a) illustrates a positive correlation between RPN and IoU scores. We leverage this observation and average the RPN score with those of the CLIP predictions. Secondly, we remove thresholding and NMS of the proposal generator and feed proposal boxes into the RoI head multiple times, similar to [5]. We observe that it pushes redundant boxes closer to each other by repeating the

Exploiting Unlabeled Data with V&L Models

167

RoI head, which can be seen in Fig. 3(b). In this way, we encounter better located bounding boxes and provide better pseudo labels. Please refer to Sect. 4.3 for a corresponding empirical analysis. To further improve the quality of our pseudo labels, we adopt the multiscale region embedding from CLIP as described in [16]. Moreover, as suggested in [44], we employ a high threshold to pick pseudo labels with high confidence. The confidence score of the pseudo label for the region Ri is formulated as cui = [sui ≥ τ ] · sui , with sui =

SRP N (Ri ) + max(pui ) , 2

(4)

where SRP N (·) denotes the RPN score. The prediction probability distribution pui is defined as pui = softmax{φ(Eim (Ri ) + Eim (Ri1.5× )) · Etxt (Categories)T }.

(5)

Here, Ri1.5× is a region cropped by 1.5× the size of Ri . Eim and Etxt are the image and text encoders of CLIP, respectively, and φ(x) = x/||x||. If cui = 0, we exclude Ri from our pseudo labels. 3.3

Using Our Pseudo Labels for Downstream Tasks

Finally, we briefly describe how we use the pseudo labels that are generated from unlabeled data for two specific downstream tasks that we focus on in this work. Open-Vocabulary Detection: In this task, the detector has access to images with annotations for base categories and needs to generalize to novel categories. We leverage the data of the base categories to train a class-agnostic Mask RCNN as our proposal generator and take the names of novel categories as the input texts of the CLIP text-encoder in aforementioned pseudo label generation process. Then, we train a standard Mask R-CNN with RestNet50-FPN [31] with both base ground truth and novel pseudo labels as described in Sect. 3.1. Semi-supervised Object Detection: In this task, relevant methods usually train a teacher model using ground truth from the limited set of labeled images, and then generate pseudo labels with the teacher on the unlabeled images. We also generate those pseudo labels and merge them with pseudo labels from our VL-PLM. Please refer to the supplementary document for details. Thus, the student model is trained on available ground truth and pseudo labels from both our V&L-based approach and the teacher model.

4

Experiments

We experimentally evaluate the proposed VL-PLM first on open-vocabulary detection in Sect. 4.1 and then on semi-supervised object detection in Sect. 4.2. In Sect. 4.3 we ablate various design choices of VL-PLM.

168

S. Zhao et al.

Table 1. Evaluations for open vocabulary detection on the COCO 2017 [32]. RegionCLIP* indicates a model without refinement using image-caption pairs. Method

Training source

Novel AP

Base AP

Bansal et al. [4]

instance-level labels in SB

0.31

29.2

24.9

Zhu et al. [61]

3.41

13.8

13.0

Rahman et al. [38]

4.12

35.9

27.9

OVR-CNN [54]

image-caption pairs in SB ∪ SN instance-level labels in SB

22.8

46.0

39.9

Gao et al. [14]

30.8

46.1

42.1

RegionCLIP [57]

raw image-text pairs via Internet image-caption pairs in SB ∪ SN instance-level labels in SB

31.4

57.1

50.4

RegionCLIP* [57] ViLD [16]

raw image-text pairs via Internet instance-level labels in SB

14.2 27.6

52.8 59.5

42.7 51.3

34.4

60.2

53.5

VL-PLM (Ours)

4.1

Overall AP

Open-Vocabulary Object Detection

In this task, we have a training set with annotations for known base categories SB . Our goal is to train a detector for novel categories SN . Usually, the labeled images IL and the unlabeled images IU are the same, i.e., IL = IU . Experimental Setup: Following prior studies [4,14,16,54], we base our evaluation on COCO 2017 [32] in the zero-shot setting (COCO-ZS) where there are 48 known base categories and 17 unknown novel categories. Images from the training set are regarded as labeled for base classes and also as unlabeled for novel classes. We take the widely adopted mean Average Precision at an IoU of 0.5 (AP50 ) as the metric and mainly compare our method with ViLD [16], the state-of-the-art method for open vocabulary detection. Thus, we follow ViLD and report AP50 over novel categories, base categories and all categories as Novel AP, Base AP, and Overall AP, respectively. Our supplemental material contains results for the LVIS [17] dataset. Implementation Details: We set a NMS threshold of 0.3 for the RPN of the proposal generator. The confidence threshold for pseudo labels (PLs) is τ = 0.8. Finally, we obtain an average of 4.09 PLs per image, which achieve a Novel AP of 20.9. We use the above hyperparameters for pseudo label generation in all experiments, unless otherwise specified. The proposal generator and the final detector were implemented in Detectron2 [48] and trained on a server with NVIDIA A100 GPUs. The proposal generator was trained for 90,000 iterations with a batch size of 16. Similar to ViLD, the final detector is trained from scratch for 180,000 iterations with input size of 1024 × 1024, large-scale jitter augmentation [15], synchronized batch normalization of batch size 128, weight decay of 4e–5, and an initial learning rate of 0.32. Comparison to SOTA: As shown in Table 1, the detector trained with VLPLM significantly outperforms the prior state-of-the-art ViLD by nearly +7% in Novel AP. Compared with [54] and [14], our method achieves much better performance not only on novel but also on base categories. This indicates training with our PLs has less impact on the predictions of base categories, where

Exploiting Unlabeled Data with V&L Models

169

Table 2. Open-vocabulary models trained with base categories from COCO are evaluated on unseen datasets. The evaluation protocol follows [14] and reports AP50 PLs

Iterations × Batch size VOC 2007 Object365 LVIS

Gao et al. [14] 150K × 64

59.2

6.9

8.0

180K × 16

67.4

10.9

22.2

VL-PLM

previous approaches suffered a huge performance drop. Overall, we can see that using V&L models to explicitly generate PLs for novel categories to train the model can give a clear performance boost. Although this introduces an overhead compared to ViLD (and others), which can include novel categories dynamically into the label space, many practical applications easily tolerate this overhead in favor of significantly improved accuracy. Such a setup is also similar to prior works that generate synthetic features of novel categories [61]. Moreover, our method has large potential for further improvement with better V&L model. [16] demonstrates a 60% performance boost of ViLD when using ALIGN [23] as the V&L model. We expect similar improvements on VL-PLM if ALIGN is available. Generalizing to Unseen Datasets: Following Gao et al.’s evaluation protocol [14], we evaluate COCO-trained models on three unseen datasets: VOC 2007 [11], Object365 [41] and LVIS [17]. To do so, we generate PLs for the novel label spaces of these datasets on the COCO dataset and train a standard Faster R-CNN model. The results of our approach on the three unseen datasets is compared to [14] in Table 2. VL-PLM significantly outperforms [14] with similar iterations and smaller batch sizes. Note that [14] requires additional image captions to generate PLs, while VL-PLM can generate PLs for any given category. 4.2

Semi-supervised Object Detection

In this task, we have annotations for all categories on a small portion of a large image set. This portion is regarded as the labeled set IL and the remaining images are regarded as the unlabeled set IU i.e. IL ∩ IU = ∅. Experimental Setup: Following previous studies [44,49,60], we conduct experiments on COCO [32] with 1, 2, 5, and 10% of the training images selected as the labeled data and the rest as the unlabeled data, respectively. In the supplement, we provide more results for varying numbers of unlabeled data. To demonstrate how VL-PLM improves PLs for SSOD, we mainly compare our method with the following baselines. (1) Supervised : A vanilla teacher model trained on the labeled set IL . (2) Supervised +PLs: We apply the vanilla teacher model on the unlabeled set IU to generate PLs and train a student model with both ground truth and PLs. To compare with Supervised +PLs, VL-PLM generates PLs for all categories on IU . Then, those PLs are merged into the PLs from the vanilla teacher as the final PLs to train a student model named as Supervised +VLPLM. (3) STAC [44]: A popular SSOD baseline. To compare with STAC, we

170

S. Zhao et al.

Table 3. Evaluation of pseudo labels for semi-supervised object detection on COCO [32]. Methods Supervised Supervised +PLs

1% COCO 2% COCO 5% COCO 10% COCO 9.25

12.70

17.71

22.10

11.18

14.88

21.20

25.98

18.60

23.70

27.23

Supervised +VL-PLM 15.35 STAC [44]

13.97

18.25

24.38

28.64

STAC+VL-PLM

17.71

21.20

26.21

29.61

only replace its PLs with ours that are used to train Supervised +VL-PLM. The new STAC student model is denoted as STAC+VL-PLM. Here we report the standard metric for COCO, mAP, which is an average over IoU thresholds from 0.5 to 0.95 with a step size of 0.05. Implementation Details: We follow the same PL generation pipeline and hyperparameters as the OVD experiment, except that we take a class-agnostic Faster R-CNN [40] as our proposal generator and train it on the different COCO splits. Supervised and Supervised +PLs are implemented in Detectron2 [48] and trained for 90,000 iterations with a batch size of 16. For models related to STAC [44], we use the official code of STAC with default settings. Results: As shown in Table 3, models with VL-PLM outperform Supervised + PLs and STAC by a clear margin, respectively. Since the only change to the baselines is the addition of VL-PLM’s PLs, we can conclude that V&L adds clear value to the PLs and can benefit SSOD. Another interesting finding is that models with VL-PLM provide bigger gains for smaller labeled data, which is the most important regime for SSOD as it brings down annotation costs. In that regime, PLs from V&L models are likely stronger than PLs from the small amount of annotated data. We also want to mention two recent SSOD methods [49,60] that achieve higher absolute performance, however, only with additional and orthogonal contributions. VL-PLM may also improve these methods, but here we focus on a fair comparison to other PL-based methods. Moreover, we believe that with better V&L models, VL-PLM can further improve SSOD. 4.3

Analysis of Pseudo Label Generation

We base our ablation studies on the COCO-ZS setting for OVD unless otherwise specified. All models are trained for 90,000 iterations with a batch size of 16. Understanding the Quality of PLs: Average precision (AP) is a dominant metric to evaluate object detection methods. However, AP alone does not fully indicate the quality of PLs, and the number of PLs also needs to be considered. To support this claim, we generate 5 sets of PLs as follows. (1) PL v1 : We take the raw region proposals from RPN without RoI refinement in our pseudo label

Exploiting Unlabeled Data with V&L Models

171

Fig. 4. The quality of PLs with different combinations of RPN and RoI head. We change the threshold τ to ensure each combination with a similar #@PL. “×N” means we apply RoI head N times to refine the proposal boxes. Table 4. Relationship between the quality of pseudo labels and the performance of the final open vocabulary detectors. PL setting

Pseudo labels AP@PL

#@PL

Final Detector Base AP

Novel AP

Overall AP

PL v1 No RoI, τ = 0.05

17.4

89.92

33.3

14.6

28.4

PL v2 No RoI, τ = 0.95

14.6

2.88

56.1

26.0

48.2

PL v3 VL-PLM, τ = 0.05 20.6

85.15

29.7

19.3

27.0

PL v4 VL-PLM, τ = 0.95 18.0

2.93

55.4

31.3

49.1

PL v5 VL-PLM, τ = 0.99 11.1

1.62

56.7

27.2

49.0

generation and set τ = 0.05. (2) PL v2 : The same as PL v1 but with τ = 0.95. (3) PL v3 : VL-PLM with τ = 0.05. (4) PL v4 : VL-PLM with τ = 0.95. (5) PL v5 : VL-PLM with τ = 0.99. In Table 4, we report AP50 (AP@PL) and the average per-image number (#@PL) of pseudo labels on novel categories. We also report the performance of detection models trained with the corresponding PLs as Novel AP, Base AP and Overall AP. Comparing PL v1 with PL v4 and PL v2 with PL v4, we can see that a good balance between AP@PL and #@PL is desired. Many PLs may achieve high AP@PL, but drop the performance of the final detector. A high threshold reduces the number of PLs but degrades AP@PL as well as the final performance. We found τ = 0.8 to provide a good trade-off. The table also demonstrates the benefit of VL-PLM over no RoI refinement. The supplement contains more analysis and visualizations of our pseudo labels. Two-Stage Proposal Generator Matters: As mentioned in Sect. 3.2, we improve the localization ability of CLIP with the two-stage proposal generator in two ways: 1) we merge CLIP scores with RPN scores, and 2) we repeatedly refine the region proposals from RPN with the RoI Head. To showcase how RPN and the RoI head help PLs, we evaluate the quality of PLs from different settings in Fig. 4. As shown, RPN score fusion always improves the quality of PLs. As we increase the number of refinement steps with RoI head, the quality increases

172

S. Zhao et al.

Table 5. The quality of pseudo labels generated from different region proposals. The threshold τ is tuned to ensure a similar #@PL for each method. Selective search [46] τ AP@PL #@PL

RoI Head

0.99

0.55

5.7

8.8

34.92

5.01

RPN

RPN+RoI (Ours)

0.88

0.82

19.7 4.70

25.3 4.26

and converges after about 10 steps. Besides proposals from our RPN with RoI refinement (RPN+RoI), we investigate region proposals from different sources, i.e. 1) Selective search [46], 2) RPN only, and 3) RoI head with default thresholding and NMS. Table 5 shows that selective search with a high τ still leads to a large #@PL with a low AP@PL for at least two reasons. First, unlike RPN, selective search does not provide objectiveness scores to improve the localization of CLIP. Second, it returns ten times more proposals than RPN, which contain too many noisy boxes. Finally, the RoI head alone also leads to a poor quality of PLs because it classifies many novel objects as background, due to its training protocol. In the supplement, we show that the proposal generator, which is trained on base categories, generalizes to novel categories. Time Efficiency: VL-PLM sequentially generates PLs for each region proposal, which is time-consuming. For example, VL-PLM with ResNet50 takes 0.54 s per image on average. We provide two solutions to reduce the time cost. 1) Simple multithreading on 8 GPUs can generate PLs for the whole COCO training set within 6 h. 2) We provide a faster version (Fast VL-PLM) by sharing the ResNet50 feature extraction for all region proposals of the same image. This reduces the runtime by 5× with a slight performance drop. Adding multi-scale features (Multiscale Fast VL-PLM) avoids the performance drop but still reduces runtime by 3×. Please refer to the supplement for more details.

5

Conclusion

This paper demonstrates how to leverage pre-trained V&L models to mine unlabeled data for different object detection tasks, e.g., OVD and SSOD. We propose a V&L model guided pseudo label mining framework (VL-PLM) that is simple but effective, and is able to generate pseudo labels (PLs) for a task-specific labelspace. Our experiments showcase that training a standard detector with our PLs sets a new state-of-the-art for OVD on COCO. Moreover, our PLs can benefit SSOD models, especially when the amount of ground truth labels is limited. We believe that VL-PLM can be further improved with better V&L models. Acknowledgments. This research has been partially funded by research grants to D. Metaxas from NEC Labs America through NSF IUCRC CARTA-1747778, NSF: 1951890, 2003874, 1703883, 1763523 and ARO MURI SCAN.

Exploiting Unlabeled Data with V&L Models

173

References 1. Agrawal, A., et al.: VQA: visual question answering. In: ICCV (2015) 2. Agrawal, H., et al.: nocaps: novel object captioning at scale. In: ICCV (2019) 3. Anderson, P., et al.: Vision-and-Language navigation: interpreting visuallygrounded navigation instructions in real environments. In: CVPR (2018) 4. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10. 1007/978-3-030-01246-5 24 5. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018) 6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 7. Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server (2015) 8. Chen, Y.C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058577-8 7 9. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018) 10. Dong, B., Huang, Z., Guo, Y., Wang, Q., Niu, Z., Zuo, W.: Boosting weakly supervised object detection via learning bounding box adjusters. In: ICCV., pp. 2876– 2885 (2021) 11. Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015) 12. Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015) 13. Fukui, A., et al..: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016) 14. Gao, M., Xing, C., Niebles, J.C., Li, J., Xu, R., Liu, W., Xiong, C.: Towards open vocabulary object detection without human-provided bounding boxes. In: ECCV 2022 (2021) 15. Ghiasi, G., et al.: : Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR, pp. 2918–2928 (2021) 16. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 17. Gupta, A., Doll´ ar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019) 18. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017) 19. Hu, R., Singh, A.: UniT: multimodal Multitask Learning with a unified transformer. In: ICCV (2021) 20. Hudson, D.A., Manning, C.D.: Learning by abstraction: the neural state machine. In: NeurIPS (2019) 21. Huynh, D., Kuen, J., Lin, Z., Gu, J., Elhamifar, E.: Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling (2021)

174

S. Zhao et al.

22. Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation. In: CVPR (2018) 23. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (D2021) 24. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR modulated detection for end-to-end multi-modal understanding. In: ICCV (2021) 25. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015) 26. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014) 27. Kuznetsova, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis, 128, 1956–1981 (2020) 28. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022) 29. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021) 30. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-58577-8 8 31. Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 32. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 33. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019) 34. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for Vision-and-Language Tasks. In: NeurIPS (2019) 35. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and Comprehension of Unambiguous Object Descriptions. In: CVPR (2016) 36. Peng, G., et al.: Dynamic fusion with Intra- and inter- modality attention flow for visual question answering. In: CVPR (2019) 37. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 38. Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zeroshot object detection. In: AAAI, pp. 11932–11939 (2020) 39. Rao, Y., et al.: Denseclip: Language-guided dense prediction with context-aware prompting. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 40. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with Region Proposal Networks. In: NeurIPS (2015) 41. Shao, S., et al.: Objects365: a large-scale. high-quality dataset for object detection. In : 2019 IEEE/CVF International Conference on Computer Vision (2019) 42. Shi, H., Hayat, M., Wu, Y., Cai, J.: ProposalCLIP: unsupervised open-category object proposal generation via exploiting clip cues. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 43. Sim´eoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: BMVC (2021)

Exploiting Unlabeled Data with V&L Models

175

44. Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semisupervised learning framework for object detection. In: arXiv:2005.04757 (2020) 45. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: A joint model for video and language representation learning. In: ICCV (2019) 46. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013) 47. Wang, L., Li, Y., Lazebnik, S.: Learning Deep Structure-Preserving Image-Text Embeddings. In: CVPR (2016) 48. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github. com/facebookresearch/detectron2 (2019) 49. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: ICCV, pp. 3060–3069 (2021) 50. Xu, M., et al.: A simple baseline for zero-shot semantic segmentation with pretrained vision-language model (2021) 51. Yu, F., et al.: Unsupervised domain adaptation for object detection via crossdomain semi-supervised learning. In: WACV (2022) 52. Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: CVPR (2018) 53. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/9783-319-46475-6 5 54. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR (2021) 55. Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: CVPR (2021) 56. Zhao, X., Schulter, S., Sharma, G., Tsai, Y.-H., Chandraker, M., Wu, Y.: Object detection with a unified label space from multiple datasets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 178–193. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6 11 57. Zhong, Y., et al.: RegionCLIP: Region-based language-image pretraining. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 58. Zhong, Y., Wang, J., Peng, J., Zhang, L.: Boosting weakly supervised object detection with progressive knowledge transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7 37 59. Zhou, C., Loy, C.C., Dai, B.: DenseCLIP: extract free dense labels from clip. In: ECCV 2022 (2021) 60. Zhou, Q., Yu, C., Wang, Z., Qian, Q., Li, H.: Instant-teaching: an end-to-end semi-supervised object detection framework. In: CVPR (2021) 61. Zhu, P., Wang, H., Saligrama, V.: Don’t even look once: synthesizing features for zero-shot detection. In: CVPR, pp. 11693–11702 (2020)

CPO: Change Robust Panorama to Point Cloud Localization Junho Kim1 , Hojun Jang1 , Changwoon Choi1 , and Young Min Kim1,2(B) 1

Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea 2 Interdisciplinary Program in Artificial Intelligence and INMC, Seoul National University, Seoul, South Korea [email protected]

Abstract. We present CPO, a fast and robust algorithm that localizes a 2D panorama with respect to a 3D point cloud of a scene possibly containing changes. To robustly handle scene changes, our approach deviates from conventional feature point matching, and focuses on the spatial context provided from panorama images. Specifically, we propose efficient color histogram generation and subsequent robust localization using score maps. By utilizing the unique equivariance of spherical projections, we propose very fast color histogram generation for a large number of camera poses without explicitly rendering images for all candidate poses. We accumulate the regional consistency of the panorama and point cloud as 2D/3D score maps, and use them to weigh the input color values to further increase robustness. The weighted color distribution quickly finds good initial poses and achieves stable convergence for gradient-based optimization. CPO is lightweight and achieves effective localization in all tested scenarios, showing stable performance despite scene changes, repetitive structures, or featureless regions, which are typical challenges for visual localization with perspective cameras. Keywords: Visual localization

1

· Panorama · Point cloud

Introduction

The location information is a crucial building block to develop applications for AR/VR, autonomous driving, and embodied agents. Visual localization is one of the cheapest methods for localization as it could operate only using camera inputs and a pre-captured 3D map. While many existing visual localization algorithms utilize perpsective images [29,31,35], they are vulnerable to repetitive structures, lack of visual features, or scene changes. Recently, localization using panorama images [6,7,20,37] has gained attention, as devices with 360◦ cameras are becoming more accessible. The holistic view of panorama images Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 11. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 176–192, 2022. https://doi.org/10.1007/978-3-031-20077-9_11

CPO: Change Robust Panorama to Point Cloud Localization

177

Fig. 1. Overview of our approach. CPO first creates 2D and 3D score maps that attenuate regions containing scene changes. The score maps are further used to guide candidate pose selection and pose refinement.

Fig. 2. Qualitative results of CPO. We show the query image (top), and the projected point cloud on the estimated pose (bottom). CPO can flexibly operate using raw color measurements or semantic labels.

has the potential to compensate for few outliers in localization and thus is less susceptible to minor changes or ambiguities compared to perspective images. Despite the potential of panorama images, it is challenging to perform localization amidst drastic scene changes while simultaneously attaining efficiency and accuracy. On the 3D map side, it is costly to collect the up-to-date 3D map that reflects the frequent changes within the scenes. On the algorithmic side, existing localization methods have bottlenecks either in computational efficiency or accuracy. While recent panorama-based localization methods [6,7,20,37] perform accurate localization by leveraging the holistic context in panoramas, they are vulnerable to scene changes without dedicated treatment to account for changes. For perspective cameras, such scene changes are often handled by a two-step approach, using learning-based robust image retrieval [3,13] followed by feature matching [30]. However, the image retrieval step involves global feature extraction which is often costly to compute and memory intensive. We propose CPO, a fast localization algorithm that leverages the regional distributions within the panorama images for robust pose prediction under scene changes. Given a 2D panorama image as input, we find the camera pose using

178

J. Kim et al.

a 3D point cloud as the reference map. With careful investigation on the precollected 3D map and the holistic view of the panorama, CPO focuses on regions with consistent color distributions. CPO represents the consistency as 2D/3D score maps and quickly selects a small set of initial candidate poses from which the remaining discrepancy can be quickly and stably optimized for accurate localization as shown in Fig. 1. As a result, CPO enables panorama to point cloud localization under scene changes without the use of pose priors, unlike the previous state-of-the-art [20]. Further, the formulation of CPO is flexible and can be applied on both raw color measurements and semantic labels, which is not possible with conventional structure-based localization relying on visual features. To the best of our knowledge, we are the first to explicitly propose a method for coping with changes in panorama to point cloud localization. The key to fast and stable localization is the efficient color histogram generation that scores the regional consistency of candidate poses. Specifically, we utilize color histograms generated from synthetic projections of the point cloud and make comparisons with the query image. Instead of extensively rendering a large number of synthetic images, we first cache histograms in a few selected views. Then, color histograms for various other views are efficiently approximated by re-using the histograms of the nearest neighbor from the pre-computed color distribution of overlapping views. As a result, CPO generates color histograms for millions of synthetic views within a matter of milliseconds and thus can search a wide range of candidate poses within an order-of-magnitude shorter runtime than competing methods. We compare the color histograms and construct the 2D and 3D score maps, as shown in Fig. 1 (middle). The score maps impose higher scores in regions with consistent color distribution, indicating that the region did not change from the reference 3D map. The 2D and 3D score maps are crucial for change-robust localization, which is further verified with our experiments. We test our algorithm in a wide range of scenes with various input modalities where a few exemplar results are presented in Fig. 2. CPO outperforms existing approaches by a large margin despite a considerable amount of scene change or lack of visual features. Notably, CPO attains highly accurate localization, flexibly handling both RGB and semantic labels in both indoor and outdoor scenes, without altering the formulation. Since CPO does not rely on point features, our algorithm is quickly applicable in an off-the-shelf manner without any training of neural networks or collecting pose-annotated images. We expect CPO to be a lightweight solution for stable localization in various practical scenarios.

2

Related Work

In this section, we describe prior works for localization under scene changes, and further elaborate on conventional visual localization methods that employ either a single-step or two-step approach. Localization Under Scene Changes. Even the state-of-the-art techniques for visual localization can fail when the visual appearance of the scene changes.

CPO: Change Robust Panorama to Point Cloud Localization

179

This is because conventional localization approaches are often designed to find similar visual appearances from pre-collected images with ground-truth poses. Many visual localization approaches assume that the image features do not significantly change, and either train a neural network [19,19,22,35] or retrieve image features [14,18,23,31,32]. Numerous datasets and approaches have been presented in recent years to account for change-robust localization. The proposed datasets reflect day/night [25,35] or seasonal changes [5,25,33] for outdoor scenes and changes in the spatial arrangement of objects [34,36,38] for indoor scenes. To cope with such changes, most approaches follow a structure-based paradigm, incorporating a robust image retrieval method [3,10,13,17] along with a learned feature matching module [9,29,30,39]. An alternative approach utilizes indoor layouts from depth images, which stay constant despite changes in object layouts [16]. We compare CPO against various change-robust localization methods, and demonstrate that CPO outperforms the baselines amidst scene changes. Single-Step Localization. Many existing methods [6,7,37] for panorama-based localization follow a single-step approach, where the pose is directly found with respect to the 3D map. Since panorama images capture a larger scene context, fewer ambiguities arise than perspective images, and reasonable localization is possible even without a refinement process or a pose-annotated database. Campbell et al. [6,7] introduced a class of global optimization algorithms that could effectively find pose in diverse indoor and outdoor environments [4,26]. However, these algorithms require consistent semantic segmentation labels for both the panorama and 3D point cloud, which are often hard to acquire in practice. Zhang et al. [37] propose a learning-based localization algorithm using panoramic views, where networks are trained using rendered views from the 3D map. We compare CPO with optimization-based algorithms [6,7], and demonstrate that CPO outperforms these algorithms under a wide variety of practical scenarios. Two-Step Localization. Compared to single-step methods, more accurate localization is often acquired by two-step approaches that initialize poses with an effective search scheme followed by refinement. For panorama images, PICCOLO [20] follows a two-step paradigm, where promising poses are found and further refined using sampling loss values that measure the color discrepancy in 2D and 3D. While PICCOLO does not incorporate learning, it shows competitive performance in conventional panorama localization datasets [20]. Nevertheless, the initialization and refinement is unstable to scene changes as the method lacks explicit treatment of such adversaries. CPO improves upon PICCOLO by leveraging score maps in 2D that attenuate changes for effective initialization and score maps in 3D that guide sampling loss minimization for stable convergence. For perspective images, many structure-based methods [13,29] use a twostep approach, where candidate poses are found with image retrieval [3] or scene coordinate regression [22] and further refined with PnP-RANSAC [11] from feature matching [29,30,39]. While these methods can effectively localize perspective images, the initialization procedure often requires neural networks that are memory and compute intensive, trained with a dense, pose-annotated

180

J. Kim et al.

database of images. We compare CPO against prominent two-step localization methods, and demonstrate that CPO attains efficiency and accuracy with an effective formulation in the initialization and refinement.

3

Method

Given a point cloud P = {X, C}, CPO aims to find the optimal rotation R∗ ∈ SO(3) and translation t∗ ∈ R3 at which the image IQ is taken. Let X, C ∈ RN ×3 denote the point cloud coordinates and color values, and IQ ∈ RH×W ×3 the query panorama image. Figure 1 depicts the steps that CPO localizes the panorama image under scene changes. First, we extensively measure the color consistency between the panorama and point cloud in various poses. We propose fast histogram generation described in Sect. 3.1 for efficient comparison. The consistency values are recorded as a 2D score map M2D ∈ RH×W ×1 and a 3D score map M3D ∈ RN ×1 which is defined in Sect. 3.2. We use the color histograms and score maps to select candidate poses (Sect. 3.3), which are further refined to deduce the final pose (Sect. 3.4). 3.1

Fast Histogram Generation

Instead of focusing on point features, CPO relies on the regional color distribution of images to match the global context between the 2D and 3D measurements. To cope with color distribution shifts from illumination change or camera white balance, we first preprocess the raw color measurements in 2D and 3D via color histogram matching [1,8,15]. Specifically, we generate a single color histogram for the query image and point cloud, and establish a matching between the two distributions via optimal transport. While more sophisticated learningbased methods [12,24,40] may be used to handle drastic illumination changes such as night-to-day shifts, we find that simple matching can still handle a modest range of color variations prevalent in practical settings. After preprocessing, we compare the intersections of the RGB color histograms between the patches from the query image IQ and the synthetic projections of the point cloud P . The efficient generation of color histograms is a major building block for CPO. While there could be an enormous number of poses that the synthetic projections can be generated from, we re-use the pre-computed histograms from another view to accelerate the process. Suppose we have created color histograms for patches of images taken from the original view Io , as shown in Fig. 3. Then the color histogram for the image in a new view In can be quickly approximated without explicitly rendering the image and counting bins of colors for pixels within the patches. Let So = {Sio } denote the image patches of Io and Co = {coi } the 2D image coordinates of the patch centroids. Sn and Cn are similarly defined for the novel view In . For each novel view patch, we project the patch centroid using the relative transformation and obtain the color histogram of the nearest patch of the original image, as described in Fig. 3. To elaborate, we first map the patch centroid location cni of Sin ∈ Sn to the original image coordinate frame, pi = Π(Rrel Π−1 (cni ) + trel ),

(1)

CPO: Change Robust Panorama to Point Cloud Localization

181

Fig. 3. Illustration of fast histogram generation. For each image patch in the novel view In , we first project the patch centroid cn i to the view of the original image Io . The color histogram of the patch in the novel view is estimated as the histogram of image patch c∗ in the original view that is closest to the transformed centroid pi .

Fig. 4. Illustration of 2D score map generation. The 2D score map for the ith patch Mi is the maximum histogram intersection between the ith patch in query image IQ and the synthetic views Yn ∈ Y.

where Rrel , trel is the relative pose and Π−1 (·) : R2 → R3 is the inverse projection function that maps a 2D coordinate to its 3D world coordinate. The color histogram for Sin is assigned as the color histogram of the patch centroid in Io that is closest to pi , namely c∗ = arg minc∈Co c − pi 2 . We specifically utilize the cached histograms to generate histograms with arbitrary rotations at a fixed translation. In this case, the camera observes the same set of visible points without changes in occlusion or parallax effect due to depth. Therefore the synthetic image is rendered only once and the patchwise histograms can be closely approximated by our fast variant with pi = Π(Rrel Π−1 (cni )). 3.2

Score Map Generation

Based on the color histogram of the query image and the synthetic views from the point cloud, we generate 2D and 3D score maps to account for possible changes in the measurements. Given a query image IQ ∈ RH×W ×3 , we create multiple synthetic views Y ∈ Y at various translations and rotations within the point

182

J. Kim et al.

cloud. Specifically, we project the input point cloud P = {X, C} and assign the measured color Y (u, v) = Cn at the projected location of the corresponding 3D coordinate (u, v) = Π(RY Xn + tY ) to create the synthetic view Y . We further compare the color distribution of the synthetic views Y ∈ Y against the input image IQ and assign higher scores to regions with high consistency. We first divide both the query image and the synthetic views into patches and calculate the color histograms of the patches. Following the notation in Sect. 3.1, we can denote the patches of the query image as SQ = {SiQ } and SY = {SiY } for each synthetic view. Then the color distribution of patch i is recorded into a histogram with B bins per channel: hi (·) : RH×W ×3 → Si → RB×3 . The consistency of two patches is calculated by finding the intersection between two histograms Λ(·, ·) : RB×3 × RB×3 → R. Finally, we aggregate the consistency values from multiple synthetic views into the 2D score map for the query image M2D and the 3D score map for the point cloud M3D . We verify the efficacy of the score maps for CPO in Sect. 4.3. 2D Score Map. The 2D score map M2D ∈ RH×W assigns higher scores to regions in the query image IQ that are consistent with the point cloud color. As shown in Fig. 4, we split M2D into patches and assign a score for each patch. We define the 2D score as the maximum histogram intersection that each patch in the input query image IQ achieves, compared against multiple synthetic views in Y. Formally, denoting M = {Mi } as the scores for patches in M2D , the score for the ith patch is (2) Mi = max Λ(hi (Y ), hi (IQ )). Y ∈Y

If a patch in the query image contains scene change it will have small histogram intersections with any of the synthetic views. Note that for computing Eq. 2 we use the fast histogram generation from Sect. 3.1 to avoid the direct rendering of Y . We utilize the 2D score map to attenuate image regions with changes during candidate pose selection in Sect. 3.3. 3D Score Map. The 3D score map M3D ∈ RN measures the color consistency of each 3D point with respect to the query image. We compute the 3D score map by back-projecting the histogram intersection scores to the point cloud locations, as shown in Fig. 5. Given a synthetic view Y ∈ Y, let BY ∈ RN denote the assignment of patch-based intersection scores between Y and IQ into the 3D points whose locations are projected onto corresponding patches in Y . The 3D score map is the average of the back-projected scores BY for individual points, namely 1  M3D = BY . (3) |Y| Y ∈Y

If a region in the point cloud contains scene changes, one can expect the majority of the back-projected scores BY to be small for that region, leading to smaller 3D scores. We use the 3D score map to weigh the sampling loss for pose refinement in Sect. 3.4. By placing smaller weights on regions that contain scene changes, the 3D score map leads to more stable convergence.

CPO: Change Robust Panorama to Point Cloud Localization

183

Fig. 5. Illustration of 3D score map generation. For each synthetic view Y ∈ Y, the patch-wise color histogram is compared against the query image and the resulting intersection scores are back-projected onto 3D locations. The back-projected scores BY are averaged for all synthetic views to form the 3D score map M3D .

3.3

Candidate Pose Selection

For the final step, CPO optimizes sampling loss [20] from selected initial poses, as shown in Fig. 1. CPO chooses the candidate starting poses by efficiently leveraging the color distribution of the panorama and point cloud. The space of candidate starting poses is selected in two steps. First, we choose Nt 3D locations within various regions of the point cloud, and render Nt synthetic views. For datasets with large open spaces lacking much clutter, the positions are selected from uniform grid partitions. On the other hand, for cluttered indoor scenes, we propose to efficiently handle valid starting positions by building octrees to approximate the amorphous empty spaces as in Rodenberg et al. [28] and select centroids of the octrees for Nt starting positions. Second, we select the final K candidate poses out of Nt × Nr poses, where Nr is the number of rotations assigned to each translation, uniformly sampled from SO(3). We only render a single view for the Nt locations, and obtain patch-wise histograms for Nr rotations using the fast histogram generation from Sect. 3.1. We select final K poses that have the largest histogram intersections with the query panorama image. The fast generation of color histograms at synthetic views enables efficient candidate pose selection, which is quantitatively verified in Sect. 4.3. Here, we compute the patch-wise histogram intersections for Nt × Nr poses where the 2D score map M2D from Sect. 3.2 is used to place smaller weights on image patches that are likely to contain scene change. Let Yc denote the Nt × Nr synthetic views used for finding candidate poses. For a synthetic view Y ∈ Yc , the weighted histogram intersection w(Y ) with the query image IQ is expressed as follows,  w(Y ) = Mi Λ(hi (Y ), hi (IQ )). (4) i

184

J. Kim et al.

Conceptually, the affinity between a synthetic view Y and the query image IQ is computed as the sum of each patch-wise intersection weighted by the corresponding patch Mi from the 2D score map M2D . We can expect changed regions to be attenuated in the candidate pose selection process and therefore CPO can quickly compensate for possible scene changes. 3.4

Pose Refinement

We individually refine the selected K poses by optimizing a weighted variant of sampling loss [20], which quantifies the color differences between 2D and 3D. To elaborate, let Π(·) be the projection function that maps a point cloud to coordinates in the 2D panorama image IQ . Further, let Γ(·; IQ ) indicate the sampling function that maps 2D coordinates to pixel values sampled from IQ . The weighted sampling loss enforces each 3D point’s color to be similar to its 2D projection’s sampled color while placing lesser weight on points that are likely to contain change. Given the 3D score map M3D , this is expressed as follows, Lsampling (R, t) = M3D  [Γ(Π(RX + t); IQ ) − C]2 ,

(5)

where  is the Hadamard product and RX + t is the transformed point cloud under the candidate camera pose R, t. To obtain the refined poses, we minimize the weighted sampling loss for the K candidate poses using gradient descent [21]. At termination, the refined pose with the smallest sampling loss value is chosen.

4

Experiments

In this section, we analyze the performance of CPO in various localization scenarios. CPO is mainly implemented using PyTorch [27], and is accelerated with a single RTX 2080 GPU. We report the full hyperparameter setup for running CPO and further qualitative results for each tested scenario in the supplementary material. All translation and rotation errors are reported using median values, and for evaluating accuracy a prediction is considered correct if the translation error is below 0.05m and the rotation error is below 5◦ . Baselines. We select five baselines for comparison: PICCOLO [20], GOSMA [7], GOPAC [6], structure-based approach, and depth-based approach. PICCOLO, GOSMA, and GOPAC are optimization-based approaches that find pose by minimizing a designated objective function. Structure-based approach [29,31] is one of the most performant methods for localization using perspective images. This baseline first finds promising candidate poses via image retrieval using global features [13] and further refines pose via learned feature matching [30]. To adapt structure-based method to our problem setup using panorama images, we construct a database of pose-annotated synthetic views rendered from the point cloud and use it for retrieval. Depth-based approach first performs learningbased monocular depth estimation on the query panorama image [2], and finds

CPO: Change Robust Panorama to Point Cloud Localization

185

Table 1. Quantitative results on all splits containing changes in OmniScenes [20]. Method

t-error (m) R-error (◦ ) Robot Hand Extreme Robot Hand

PICCOLO

Accuracy Extreme Robot Hand Extreme

3.78

4.04

3.99

104.23 121.67 122.30

0.06

0.01

PICCOLO w/ prior 1.07

0.53

1.24

21.03

7.54

23.71

0.39

0.45

0.38

Structure-Based

0.05

0.06

0.77

0.86

0.99

0.56

0.51

0.46

0.04

0.01

Depth-Based

0.46

0.09

0.48

1.35

1.24

2.37

0.38

0.39

0.30

CPO

0.02

0.02

0.03

1.46

0.37

0.37

0.58

0.58

0.57

Table 2. Quantitative results on all splits containing changes in Structured3D [38]. Method

t-error (m) R-error (◦ ) Acc. (0.05 m, 5◦ ) Acc. (0.02 m, 2◦ ) Acc. (0.01 m, 1◦ )

PICCOLO

0.19

4.20

0.47

0.45

Structure-Based 0.02

0.64

0.59

0.47

0.43 0.29

Depth-Based

0.18

1.98

0.45

0.33

0.19

CPO

0.01

0.29

0.56

0.54

0.51

the pose that best aligns the estimated depth to the point cloud. The approach is similar to the layout-matching baseline from Jenkins et al. [16], where it demonstrated effective localization under scene change. Additional details about implementing the baselines are deferred to the supplementary material. 4.1

Localization Performance on Scenes with Changes

We assess the robustness of CPO using the OmniScenes [20] and Structured3D [38] dataset, which allows performance evaluation for the localization of panorama images against point clouds in changed scenes. OmniScenes. The OmniScenes dataset consists of seven 3D scans and 4121 2D panorama images, where the panorama images are captured with cameras either handheld or robot mounted. Further, the panorama images are obtained at different times of day and include changes in scene configuration and lighting. OmniScenes contains three splits (Robot, Handheld, Extreme) that are recorded in scenes with changes, where the Extreme split contains panorama images captured with extreme camera motion. We compare CPO against PICCOLO [20], structure-based approach, and depth-based approach. The evaluation results for all three splits in OmniScenes are shown in Table 1. In all splits, CPO outperforms the baselines without the help of prior information or training neural networks. While PICCOLO [20] performs competitively with gravity direction prior, the performance largely degrades without such information. Further, outliers triggered from scene changes and motion blur make accurate localization difficult using structurebased or depth-based methods. CPO is immune to such adversaries as it explicitly models scene changes and regional inconsistencies with 2D, 3D score maps.

186

J. Kim et al.

Fig. 6. Visualization of 2D, 3D score maps in OmniScenes [20] and Structured3D [38]. The 2D score map assigns lower scores to the capturer’s hand and objects not present in 3D. Similarly, the 3D score map assigns lower scores to regions not present in 2D.

The score maps of CPO effectively attenuate scene changes, providing useful evidence for robust localization. Figure 6 visualizes the exemplar 2D and 3D score maps generated in the wedding hall scene from OmniScenes. The scene contains drastic changes in object layout, where the carpets are removed and the arrangement of chairs has largely changed since the 3D scan. As shown in Fig. 6, the 2D score map assigns smaller scores to new objects and the capturer’s hand, which are not present in the 3D scan. Further, the 3D score map shown in Fig. 6 assigns smaller scores to chairs and blue carpets, which are present in the 3D scan but are largely modified in the panorama image. Structured3D. We further compare CPO against PICCOLO in Structured3D, which is a large-scale dataset containing synthetic 3D models with changes in object layout and illumination, as shown in Fig. 2. Due to the large size of the dataset (21845 indoor rooms), 672 rooms are selected for evaluation. For

CPO: Change Robust Panorama to Point Cloud Localization

187

Table 3. Quantitative results on Stanford 2D-3D-S [4], compared against PICCOLO (PC), structure-based approach (SB), and depth-based approach (DB). Area Area Area Area Area Area Area Total

1 2 3 4 5 6

t-error (m) PC SB DB

R-error (◦ ) CPO PC SB DB

Accuracy CPO PC SB DB

CPO

0.02 0.76 0.02 0.18 0.50 0.01

0.01 0.01 0.01 0.01 0.01 0.01

0.25 0.27 0.24 0.28 0.27 0.18

0.89 0.81 0.76 0.83 0.73 0.90

0.05 0.18 0.05 0.05 0.10 0.04

1.39 3.00 1.39 1.30 2.37 1.54

0.46 2.25 0.49 4.17 14.64 0.31

0.03 0.06 1.72 0.01 0.63

0.81 2.08 1.01 1.07 1.31 0.74

89.48 89.76 88.94 89.12 89.88 89.39

0.66 0.42 0.53 0.48 0.44 0.68

0.51 0.41 0.50 0.50 0.47 0.55

0.28 0.14 0.24 0.28 0.18 0.29

1.04 89.51 0.24 0.53 0.49 0.23 0.83

each room, the dataset contains three object configurations (empty, simple, full) along with three lighting configurations (raw, cold, warm), leading to nine configurations in total. We consider the object layout change from empty to full, where illumination change is randomly selected for each room. We provide further details about the evaluation in the supplementary material. The median errors and localization accuracy at various thresholds is reported in Table 2. CPO outperforms the baselines in most metrics, due to the change compensation of 2D/3D score maps as shown in Fig. 6. 4.2

Localization Performance on Scenes Without Changes

We further demonstrate the wide applicability of CPO by comparing CPO with existing approaches in various scene types and input modalities (raw color / semantic labels). The evaluation is performed in one indoor dataset (Stanford 2D-3D-S [4]), and one outdoor dataset (Data61/2D3D [26]). Unlike OmniScenes and Structured3D, most of these datasets lack scene change. Although CPO mainly targets scenes with changes, it shows state-of-the-art results in these datasets. This is due to the fast histogram generation that allows for effective search from the large pool of candidate poses, which is an essential component of panorama to point cloud localization given the highly non-convex nature of the objective function presented in Sect. 3. Localization with Raw Color. We first make comparisons with PICCOLO [20], structure-based approach, and depth-based approach in the Stanford 2D-3D-S dataset. In Table 3, we report the localization accuracy and median error, where CPO outperforms other baselines by a large margin. Note that PICCOLO is the current state-of-the-art algorithm for the Stanford 2D-3D-S dataset. The median translation and rotation error of PICCOLO [20] deviates largely in areas 2, 4, and 5, which contain a large number of scenes such as hallways that exhibit repetitive structure. On the other hand, the error metrics and accuracy of CPO are much more consistent in all areas.

188

J. Kim et al.

Table 4. Localization performance using semantic labels on a subset of Area 3 from Stanford 2D-3D-S [4]. Q1 , Q2 , Q3 are quartile values of each metric. t-error (m) Q1 Q2 Q3

R-error (◦ ) Q1 Q2 Q3

Runtime (s) Q1 Q2 Q3

PICCOLO 0.00 0.01 0.07 0.11 0.21 0.56 14.0 14.3 16.1 0.05 0.08 0.15 0.91 1.13 2.18 1.4 1.8 4.4 GOSMA 0.01 0.01 0.02 0.20 0.32 0.51 1.5 1.6 1.6 CPO Table 5. Localization performance on all areas of the Data61/2D3D dataset [26]. Method t-error (m) R-error (◦ ) GOPAC PICCOLO CPO GOPAC PICCOLO CPO Error

1.1

4.9

0.1

1.4

28.8

0.3

Localization with Semantic Labels. We evaluate the performance of CPO against algorithms that use semantic labels as input, namely GOSMA [7] and GOPAC [6]. We additionally report results from PICCOLO [20], as it could also function with semantic labels. To accommodate for the different input modality, CPO and PICCOLO use color-coded semantic labels as input, as shown in Fig. 2(c). We first compare CPO with PICCOLO and GOSMA on 33 images in Area 3 of the Stanford 2D-3D-S dataset following the evaluation procedure of Campbell et al. [7]. As shown in Table 4, CPO outperforms GOSMA [7] by a large margin, with the 3rd quartile values of the errors being smaller than the 1st quartile values of GOSMA [7]. Further, while the performance gap with PICCOLO [20] is smaller than GOSMA, CPO consistently exhibits a much smaller runtime. We further compare CPO with PICCOLO and GOPAC [6] in the Data61/2D 3D dataset [26], which is an outdoor dataset that contains semantic labels for both 2D and 3D. The dataset is mainly recorded in the rural regions of Australia, where large portions of the scene are highly repetitive and lack features as shown in Fig. 2(c). Nevertheless, CPO exceeds GOPAC [6] in localization accuracy, as shown in Table 5. Note that CPO only uses a single GPU for acceleration whereas GOPAC employs a quad-GPU configuration for effective performance [6]. Due to the fast histogram generation from Sect. 3.1, CPO can efficiently localize using a smaller number of computational resources. 4.3

Ablation Study

In this section, we ablate key components of CPO, namely histogram-based candidate pose selection and 2D, 3D score maps. The ablation study for other constituents of CPO is provided in the supplementary material.

CPO: Change Robust Panorama to Point Cloud Localization

189

Table 6. Ablation of various components of CPO in OmniScenes [20] Extreme split. t-error (m) R-error (◦ ) Acc.

Method

w/o Histogram initialization 3.29 0.10 w/o 2D score map 0.03 w/o 3D score map

75.60 1.19 1.56

0.20 0.48 0.55

Ours

0.37

0.57

0.03

Table 7. Average runtime for a single synthetic view in Room 3 from OmniScenes [20]. Method

PICCOLO Structure-Based Depth-Based CPO

Runtime (ms) 2.135

38.70

2.745

0.188

Histogram-Based Candidate Pose Selection. We verify the effect of using color histograms for candidate pose selection on the Extreme split from the OmniScenes dataset [20]. CPO is compared with a variant that performs candidate pose selection using sampling loss values as in PICCOLO [20], where all other conditions remain the same. As shown in Table 6, a drastic performance gap is present. CPO uses patch-based color histograms for pose selection and thus considers larger spatial context compared to pixel-wise sampling loss. This allows for CPO to effectively overcome ambiguities that arise from repetitive scene structures and scene changes that are present in the Extreme split. We further validate the efficiency of histogram-based initialization against various initialization methods used in the baselines. In Table 7, we report the average runtime for processing a single synthetic view in milliseconds. The histogram based initialization used in CPO exhibits an order-of-magnitude shorter runtime than other competing methods. The effective utilization of spherical equivariance in fast histogram generation allows for efficient search within a wide range of poses and quickly generate 2D/3D score maps. Score Maps. We validate the effectiveness of the score maps for robust localization under scene changes on the Extreme split from the OmniScenes dataset [20]. Recall that we use the 2D score map for guiding candidate pose selection and the 3D score map for guiding pose refinement. We report evaluation results for variants of CPO that do not use either the 2D or 3D score map. As shown in Table 6, optimal performance is obtained by using both score maps. The score maps effectively attenuate scene changes, leading to stable pose estimation.

5

Conclusion

In this paper, we present CPO, a fast and robust algorithm for 2D panorama to 3D point cloud localization. To fully leverage the potential of panoramic images for localization, we account for possible scene changes by saving the

190

J. Kim et al.

color distribution consistency in 2D, 3D score maps. The score maps effectively attenuate regions that contain changes and thus lead to more stable camera pose estimation. With the proposed fast histogram generation, the score maps are efficiently constructed and CPO can subsequently select promising initial poses for stable optimization. By effectively utilizing the holistic context in 2D and 3D, CPO achieves stable localization results across various datasets including scenes with changes. We expect CPO to be widely applied in practical localization scenarios where scene change is inevitable. Acknowledgements. This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2020R1C1C1008195), Creative-Pioneering Researchers Program through Seoul National University, and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-002068, Artificial Intelligence Innovation Hub).

References 1. Afifi, M., Barron, J.T., LeGendre, C., Tsai, Y.T., Bleibel, F.: Cross-camera convolutional color constancy. In: The IEEE International Conference on Computer Vision (ICCV) (2021) 2. Albanis, G., et al.: Pano3d: a holistic benchmark and a solid baseline for 360◦ depth estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3722–3732 (2021). https://doi.org/10. 1109/CVPRW53098.2021.00413 3. Arandjelovi´c, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 4. Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d–3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017) 5. Badino, H., Huber, D., Kanade, T.: The CMU Visual Localization Data Set. http:// 3dvis.ri.cmu.edu/data-sets/localization (2011) 6. Campbell, D., Petersson, L., Kneip, L., Li, H.: Globally-optimal inlier set maximisation for camera pose and correspondence estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, June 2018. https://doi.org/10.1109/ TPAMI.2018.2848650 7. Campbell, D., Petersson, L., Kneip, L., Li, H., Gould, S.: The alignment of the spheres: globally-optimal spherical mixture alignment for camera pose estimation. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, USA, June 2019 8. Coltuc, D., Bolon, P., Chassery, J.M.: Exact histogram specification. IEEE Trans. Image Process. 15, 1143–52 (2006). https://doi.org/10.1109/TIP.2005.864170 9. Dong, S., et al.: Robust neural routing through space partitions for camera relocalization in dynamic indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8544–8554, June 2021 10. Dusmanu, M., et al.: D2-Net: a trainable CNN for joint detection and description of local features. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

CPO: Change Robust Panorama to Point Cloud Localization

191

11. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM. 24(6), 381–395 (1981). http://dblp.uni-trier.de/db/journals/cacm/cacm24. htmlFischlerB81 12. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 13. Ge, Y., Wang, H., Zhu, F., Zhao, R., Li, H.: Self-supervising fine-grained region similarities for large-scale image localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 369–386. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8 22 14. Gee, A.P., Mayol-Cuevas, W.W.: 6d relocalisation for RGBD cameras using synthetic view regression. In: Bowden, R., Collomosse, J.P., Mikolajczyk, K. (eds.) British Machine Vision Conference, BMVC 2012, Surrey, UK, 3–7 September 2012, pp. 1–11. BMVA Press (2012). https://doi.org/10.5244/C.26.113 15. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Upper Saddle River (2008). http://www.amazon.com/Digital-Image-Processing-3rdEdition/dp/013168728X 16. Howard-Jenkins, H., Ruiz-Sarmiento, J.R., Prisacariu, V.A.: LaLaLoc: Latent layout localisation in dynamic, unvisited environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10107– 10116, October 2021 17. Humenberger, M., et al.: Robust image retrieval-based visual localization using kapture (2020) 18. Irschara, A., Zach, C., Frahm, J., Bischof, H.: From structure-from-motion point clouds to fast location recognition. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2599–2606 (2009). https://doi.org/10.1109/CVPR. 2009.5206587 19. Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for realtime 6-DOF camera relocalization (2015) 20. Kim, J., Choi, C., Jang, H., Kim, Y.M.: PICCOLO: point cloud-centric omnidirectional localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3313–3323, October 2021 21. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980 22. Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: CVPR (2020) 23. Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using prioritized feature matching. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 791–804. Springer, Heidelberg (2010). https://doi.org/10. 1007/978-3-642-15552-9 57 24. Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. arXiv preprint arXiv:1703.07511 (2017) 25. Maddern, W., Pascoe, G., Gadd, M., Barnes, D., Yeomans, B., Newman, P.: Real-time kinematic ground truth for the oxford robotcar dataset. arXiv preprint arXiv: 2002.10152 (2020), http://arxiv.org/pdf/2002.10152

192

J. Kim et al.

26. Namin, S., Najafi, M., Salzmann, M., Petersson, L.: A multi-modal graphical model for scene analysis. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1006–1013. IEEE Computer Society, Los Alamitos, CA, USA (2015). https://doi.org/10.1109/WACV.2015.139, http://doi.ieeecomputersociety. org/10.1109/WACV.2015.139 27. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc (2019). http://papers.neurips.cc/paper/9015pytorch-an-imperative-style-high-performance-deep-learning-library.pdf 28. Rodenberg, O.B.P.M., Verbree, E., Zlatanova, S.: Indoor A* Pathfinding Through an Octree Representation of a Point Cloud. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, IV21, pp. 249–255, October 2016. https://doi.org/10.5194/isprs-annals-IV-2-W1-249-2016 29. Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019) 30. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: Learning feature matching with graph neural networks. In: CVPR (2020) 31. Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 752–765. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5 54 32. Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1744–1756 (2017) 33. Sattler, T., et al.: Benchmarking 6DOF outdoor visual localization in changing conditions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 34. Taira, H., et al.: InLoc: Indoor visual localization with dense matching and view synthesis. In: CVPR 2018 - IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, United States, June 2018. http://hal.archivesouvertes.fr/hal-01859637 35. Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-based localization using LSTMS for structured feature correlation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017 36. Wald, J., Sattler, T., Golodetz, S., Cavallari, T., Tombari, F.: Beyond controlled environments: 3d camera re-localization in changing indoor scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 467–487. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6 28 37. Zhang, C., Budvytis, I., Liwicki, S., Cipolla, R.: Rotation equivariant orientation estimation for omnidirectional localization. In: ACCV (2020) 38. Zheng, J., et al.: Structured3D: a large photo-realistic dataset for structured 3d modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 519–535. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-58545-7 30 39. Zhou, Q., Sattler, T., Leal-Taixe, L.: Patch2pix: epipolar-guided pixel-level correspondences. In: CVPR (2021) 40. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)

INT: Towards Infinite-Frames 3D Detection with an Efficient Framework Jianyun Xu, Zhenwei Miao(B) , Da Zhang, Hongyu Pan, Kaixuan Liu, Peihan Hao, Jun Zhu, Zhengyang Sun, Hongmin Li, and Xin Zhan Alibaba Group, Hangzhou, China {xujianyun.xjy,zhenwei.mzw}@alibaba-inc.com

Abstract. It is natural to construct a multi-frame instead of a singleframe 3D detector for a continuous-time stream. Although increasing the number of frames might improve performance, previous multi-frame studies only used very limited frames to build their systems due to the dramatically increased computational and memory cost. To address these issues, we propose a novel on-stream training and prediction framework that, in theory, can employ an infinite number of frames while keeping the same amount of computation as a single-frame detector. This infinite framework (INT), which can be used with most existing detectors, is utilized, for example, on the popular CenterPoint, with significant latency reductions and performance improvements. We’ve also conducted extensive experiments on two large-scale datasets, nuScenes and Waymo Open Dataset, to demonstrate the scheme’s effectiveness and efficiency. By employing INT on CenterPoint, we can get around 7% (Waymo) and 15% (nuScenes) performance boost with only 2˜4 ms latency overhead, and currently SOTA on the Waymo 3D Detection leaderboard. Keywords: Infinite · Multi-frame · 3D detection · Efficient · Pointcloud

1

Introduction

3D object detection from pointclouds has been proven as a viable robotics vision solution, particularly in autonomous driving applications. Many singleframe 3D detectors [10,15,20,23,26,27,36,39,42] are developed to meet the real-time requirement of the online system. Nevertheless, it is more natural for a continuous-time system to adopt multi-frame detectors that can fully take advantage of the time-sequence information. However, as far as we know, few multi-frame 3D detectors are available for long frame sequences due to the heavy computation and memory burden. It is desirable to propose a concise real-time long-sequence 3D detection framework with promising performance. Existing works [5,11,22,32,37–39] demonstrate that multi-frame models yield performance gains over single-frame ones. However, these approaches require Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 12. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 193–209, 2022. https://doi.org/10.1007/978-3-031-20077-9_12

194

J. Xu et al.

Fig. 1. Impact of frames used in detectors on Waymo val set. While CenterPoint’s performance improves as the number of frames grows, the latency also increases dramatically. On the other hand, our INT keeps the same latency while increasing frames.

loading all the used frames at once during training, resulting in very limited frames being used due to computational, memory, or optimization difficulties. Taking the SOTA detector CenterPoint [39] as an example, it only uses two frames [39] on the Waymo Open Dataset [29]. While increasing the number of frames can boost the performance, it also leads to significant latency burst as shown in Fig. 1. Memory overflow occurs if we keep increasing the frames of CenterPoint, making both training and inference impossible. As a result, we believe that the number of frames used is the bottleneck preventing multi-frame development, and we intend to break through this barrier first. There are two major problems that limit the number of frames in a multiframe detector: 1) repeated computation. Most of the current multi-frame frameworks have a lot of repeated calculations or redundant data that causes computational spikes or memory overflow; 2) optimization difficulty. Some multi-frame systems have longer gradient conduction links as the number of frames increases, introducing optimization difficulties. To alleviate the above problems, we propose INT (short for infinite), an onstream system that theoretically allows training and prediction to utilize infinite number of frames. INT contains two primary components: 1) a Memory Bank (MB) for temporal information fusion and 2) a Dynamic Training Sequence Length (DTSL) strategy for on-stream training. The MB is a place to store the recursively updated historical information so that we don’t have to compute past frames’ features repeatedly. As a result, it only requires a small amount of memory but can fuse infinite data frames. To tackle the problem of optimization difficulty, we truncate the gradient of back propagation to the MB during training. However, an inherent flaw of iteratively updating MB on a training stream is that historical and current information are not given by the same model parameters, leading to training issues. To solve this problem, DTSL is employed. The primary idea of DTSL is to start with short sequences and quickly clear the MB to avoid excessive inconsistency between historical and current data; then gradually lengthen the sequence as training progresses since the gap between historical and current model parameters becomes negligible. To make INT feasible, we propose three modules: SeqFusion, SeqSampler and SeqAug. SeqFusion is a module in MB for temporal fusion that proposes multiple

INT: Towards Infinite-Frames 3D Detection

195

fusion methods for two types of data commonly used in pointcloud detection, i.e., point-style and image-style. SeqSampler is a sequence index generator that DTSL uses to generate training sequences of different lengths at each epoch. Finally, SeqAug is a data augmentation for on-stream training, capable of maintaining the same random state on the same stream. Our contributions can be summarized as: – We present INT, an on-stream multi-frame system made up of MB and DTSL that can theoretically be trained and predicted using infinite frames while consuming similar computation and memory as a single-frame system. – We propose three modules, SeqFusion, SeqSampler and SeqAug, to make INT feasible. – We conduct extensive experiments on nuScenes and Waymo Open Dataset to illustrate the effectiveness and efficiency of INT.

2 2.1

Related Work 3D Object Detection

Recent studies on 3D object detection can be broadly divided into three categories: LiDAR-based [10,15,16,19,23,27,35,36,39,42,43], image-based [4,14,17,31], and fusion-based [3,18,22,24,30,40,41]. Here we focus on LiDAR-based schemes. According to the views of pointclouds, 3D detectors can be classified as pointbased, voxel-based, range-based, and hybrid. PointRCNN [27] and VoteNet [23] are two representative point-based methods that use a structure like PointNet++ [25] to extract point-by-point features. These schemes are characterized by better preservation of pointclouds’ original geometric information. Still, they are sensitive to the number of pointclouds that pose serious latency and memory issues. In contrast, voxel-based solutions, such as VoxelNet [42], PointPillars [15], Second [36], CenterPoint [39] and AFDetV2 [10] are less sensitive to the number of pointclouds. They convert the pointcloud into 2D pillars or 3D voxels first, then extract features using 2D or 3D (sparse) convolution, making them easier to deploy. Another category is rangeview-based schemes [6,16,19] that perform feature extraction and target prediction on an efficient unfolded spherical projection view. Meanwhile, they contend with target scale variation and occlusion issues, resulting in performance that is generally inferior to that of voxel-based schemes. Hybrid methods [20,26,35,43] attempt to integrate features from several views to collect complementing information and enhance performance. We define two data styles according to the data format to facilitate the analysis of different detectors: – Image-style. Well-organized data in 2D, 3D, or 4D dimensions similar to that of an image. – Point-style. Disorganized data such as pointclouds and sparse 3D voxels. The data in pointcloud-based detectors, including input, intermediate feature, and output, is generally point-style or image-style. We design the fusion

196

J. Xu et al.

Fig. 2. Training phase of different multi-frame schemes. Operations inside dash rectangle either involve repetitive computation or raise memory burden, which leads to a very limited frame number for training.

algorithms for both point-style and image-style data in INT’s Memory Bank, so that INT can be employed in most 3D detectors. In this work, we choose the recently popular CenterPoint [39] as the baseline for studies since it performs better in terms of efficiency and performance, and it is now scoring at SOTA level on the large-scale open datasets nuScenes [2] and Waymo Open Dataset [29]. 2.2

Multi-frame Methods

There have been a variety of LSTM-based techniques in the field of video object detection, such as [7,13,33]. Transformer-based video detection schemes have recently emerged [9], however transformer may not be suitable for working onstream because it naturally needs to compute the relationship between all frames, which implies a lot of repeated computations. Recent methods for multi-frame pointclouds can be roughly divided into three categories as shown in Fig. 2(a), (b) and (c). [2,22,36] directly concatenate multi-frame pointclouds and introduce a channel indicating their relative timestamps, as shown in Fig. 2(a). While this method is simple and effective, it involves a lot of unnecessary computations and increases the memory burden, making it unsuitable for more frames. Instead of merging at the point level, MinkowskiNet [5] and MotionNet [32] combine multiple frames at the feature map level. They must voxelize multi-frame pointclouds independently before stacking the feature maps together and extracting spatio-temporal information using 3D or 4D convolution, as depicted in Fig. 2(b). Obviously, this approach requires repeated data processing and is memory intensive, thus the number of frames is very limited. To overcome above difficulties, 2020An [11] and 3DVID [38] proposed LSTM or GRU-based solutions to solve the computational and memory issues in the inference phase. However, the gradient transfer to the history frames still results in

INT: Towards Infinite-Frames 3D Detection

197

a considerable memory overhead and introduce optimization difficulties during training, so the number of frames cannot be high, as shown in Fig. 2(c). To handle the problem more thoroughly, we propose computing the gradient for the current data and not for the historical data during training, as shown in Fig. 2(d). We then employ a Dynamic Training Sequence Length (DTSL) strategy to eliminate the potential information inconsistency problem. Similarly, 3D-MAN [37] stores historical information in a Memory Bank that does not participate in the gradient calculation. However, 3D-MAN needs to store a fixed number of frames of historical proposals and feature maps, which increases the amount of memory required for its training as the number of frames increases. To get around this problem, we propose recursively updating the Memory Bank’s historical information. To the best of our knowledge, we are the first 3D multi-frame system that can be trained and inferred with infinite frames.

3

Methodology

We present INT framework in this section. The overall architecture is detailed in Sect. 3.1. Section 3.2 gives the sequence fusion (SeqFusion) methods of Memory Bank. Sections 3.3 and 3.4 illustrate the training strategies, including sequence sampler (SeqSampler) and sequence data augmentation (SeqAug), respectively. 3.1

Overview of INT Framework

The INT framework in Fig. 3 is highly compact. The main body consists of a singleframe detector and a recursively updated Memory Bank (MB). The Dynamic Training Sequence Length (DTSL) below serves as a training strategy that is not needed for inference (inference only needs the pointclouds input in chronological order). In addition, there are no special requirements for single-frame detector selection. For example, any detector listed in Sect. 2.1 can be utilized. Memory Bank (MB). The primary distinction between INT and a regular multi-frame detector is the MB, which stores historical data so that we do not have to compute past features repeatedly. MB is comparable to the hidden state in LSTM while it is more flexible, interpretable, and customizable. The user has complete control over where and what information should be saved or retrieved. For example, in Sect. 3.2, we show how to store and update several forms of data. Furthermore, we choose to update the MB recursively to solve the problem of excessive memory cost. To tackle the problem of optimization difficulty, we truncate the gradient of backpropagation to the MB during training. Dynamic Training Sequence Length (DTSL). A problem with INT training on stream is the information gained from the current observation is not derived using the same model parameters as the past data in the MB. This could lead to inconsistencies in training and prediction, which is one of the key

198

J. Xu et al.

Fig. 3. Overview of INT framework. It consists of a single-frame detector and a Memory Bank. The Dynamic Training Sequence Length below serves as a training strategy.

reasons why prior multi-frame work was not trained on stream. To solve this problem, we offer the DTSL: beginning with a small sequence length and gradually increasing it, as indicated at the bottom of Fig. 3. This is based on the following observation: as the number of training steps increases, model parameter updates get slower, and the difference in information acquired from different model parameters becomes essentially trivial. As a result, when the model parameters are updated quickly, the training sequence should be short so that the Memory Bank can be cleaned up in time. Once the training is stable, the sequence length can be increased with confidence. DTSL could be defined in a variety of ways, one of which is as follows: DT SL = max(1, lmax · min(1, max(0, 2 ·

epcur − 0.5))) epall

(1)

where lmax is the maximum training sequence length, epcur and epall is current epoch and total epoch number, respectively. 3.2

SeqFusion

Temporal fusion in the Memory Bank is critical in the INT framework. As the type of data in a 3D detector is either point-style or image-style, as indicated in Sect. 2.1, we develop both the point-style and image-style fusion algorithms. In general, original pointcloud, sparse voxel, predicted object, etc., fall into the point-style category. Whereas dense voxel, intermediate feature map, final prediction map, etc., fall into the image-style category. Point-Style Fusion. Here we propose a general and straightforward practice: concatenating past point-style data with present data directly, using a channel

INT: Towards Infinite-Frames 3D Detection

199

to identify the temporal relationship. The historical point-style data is put into a fixed length FIFO queue, and as new observations arrive, foreground data is pushed into it, while oldest data is popped out. According to the poses of ego vehicle, the position information in the history data must be spatially transformed before fusion to avoid the influence of ego movement. The point-style fusion is formulated as: −1 · Tlast , Trel = Tcur Pf = P ointConcat(Pcur , Trel · Plast )

(2) (3)

where Tlast and Tcur are the last and current frame’s ego vehicle poses, respectively, while Trel is the calculated relative pose between the two frames. Plast refers to the past point-style data in Memory Bank and Pf is the fused data of Plast and current Pcur . (a). Add

(b). Concat Spatial Transform

Spatial Transform

C

1X1 Conv

C/2

C C/2

Occupancy Mask

1X1 Conv

Current Network

Current Network

C

GRU-like

(c). Max

Spatial Transform

Occupancy Count

Spatial Transform

M Current Network

+

X 1-

X Reset Gate

Update Gate

X Candidate

Current Network

Fig. 4. Four temporal fusion methods for image-style data. Occupancy Mask and Occupancy Count in Add and Max are used to distinguish different moments.

Image-Style Fusion. We propose four fusion algorithms for the image-style data, including Add, Max, Concat and GRU-like as depicted in Fig. 4 (a), (b), (c) and (d). As the historical image-style data and current data should be identical in dimensions based on recursive updates, Add and Max are simple in design and implementation. The computational overhead of these two fusion methods is cheap. We also devise the Concat fusion approach, in which both the historical and the current feature channels are first compressed to 1/2 of the origin and then concatenated along channel dimension. To investigate the impact of longterm data, we develop a GRU-like fusion method with learnable parameters to select which data should be kept and which should be discarded. To eliminate the effect of ego vehicle motion, historical image-style data must be spatially transformed first. The image-style fusion process can be summarized as follows:

200

J. Xu et al.

I˜last = Fsample (Ilast , Faf f ine (Trel , s)), If = F usion(Icur , I˜last )

(4) (5)

where Trel is the same as Eq. 3. Faf f ine (·) and Fsample (·) refer to affine grid and grid sample operation respectively, which are proposed in [12] for image-style data transformation. Ilast refers to the past image-style data in Memory Bank and If is the fused data of Ilast and current Icur . s is the shape of Ilast . F usion(·) can be Add, Max, Concat and GRU-like, as shown in Fig. 4. 3.3

SeqSampler

SeqSampler is the key to perform the training of INT in an infinite-frames manner. It is designed to split original sequences to target length, and then generate the indices of them orderly. If the sequence is infinite-long, the training or inference can go on infinitely. DTSL is formed by executing SeqSampler with different target lengths for each epoch. The length of original sequences in a dataset generally varies, for as in the Waymo Open Dataset [29], where sequence lengths oscillate around 200. Certain datasets, such as nuScenes [2], may be interval labeled, with one frame labeled every ten frames. As a result, the SeqSampler should be designed with the idea that the source sequence will be non-fixed in length and will be annotated at intervals. The procedures of SeqSampler are as simple as Sequence Sort and Sequence Split, as indicated in Fig. 5. In Sequence Sort, we rearrange the random input samples orderly by sequences. Then split them to target length in Sequence Sort, and may padding some of them to meet the batch or iteration demands.

Fig. 5. An example of SeqSampler. There are two sequences: seq1 contains 5 frames and seq2 has 3, both are interval labeled. Given the desired batch size 2 and target length 4, we need to get the final iteration indices. First, the two sequences are sorted separately. Then, the original sequences are splitted to 3 segments in the target length, and a segment is randomly replicated (dashed rectangles) to guarantee that both batches have the same number of samples.

3.4

SeqAug

Data augmentation has been successful in many recent 3D detectors [15,36,39]. However, because of the shift in training paradigm, our suggested INT framework can not directly migrate current validated data augmentation methods.

INT: Towards Infinite-Frames 3D Detection

201

One of the main reasons for this is that INT is trained on a stream with a clear association between the before and after frames, whereas data augmentation is typically random, and the before and after frames could take various augmentation procedures. To solve this problem and allow INT to benefit from data augmentation, we must verify that a certain method of data augmentation on the same stream maintains the same random state at all times. We term the data augmentation that meets this condition SeqAug. According to the data augmentation methods widely employed in pointcloud detection, SeqAug can be split into two categories: Sequence Point Transformation (flipping, rotation, translation, scaling, and so on) and Sequence GtAug (copy and paste of the ground truth pointclouds). Sequence Point Transformation. If a pointcloud is successively augmented by flipping Tf , rotation Tr , scaling Ts , and translation Tt , the other frames in the same stream must keep the same random state to establish a reasonable temporal relationship. In addition, because of these transformations, Trel in Eq. 3 must be recalculated: −1 Trel = Tt · Ts · Tr · Tf · Tcur · Tlast · Tf · Tr−1 · Ts−1 · Tt−1

(6)

where Tlast and Tcur are the last and current frame’s ego vehicle poses. Sequence GtAug. Similarly, recording random states of the same stream is required to ensure that the sequential objects from the Gt database can be copied and pasted consecutively, as shown in Fig. 6.

Fig. 6. An example of Sequence GtAug. The colors of pointclouds represent different moments, with red being the current frame. (Color figure online)

4

Experiments

In this paper, we build the proposed INT framework based on the highly competitive and popular detector CenterPoint [39]. In the following sections, we first briefly introduce the datasets in Sect. 4.1, followed by a description of a few critical experimental setups in Sect. 4.2. The efficiency and effectiveness of the INT

202

J. Xu et al.

framework are then illustrated in Sect. 4.3 by comparing it to the baseline CenterPoint, followed by Sect. 4.4, which compares the results of INT on the Waymo test set to other SOTAs. Finally, Sect. 4.5 is several INT ablation experiments. 4.1

Datasets

This section briefly describes the two open datasets used in this paper. Waymo Open Dataset. Waymo [29] comprises 798, 202 and 150 sequences for train, validation and test, respectively. Each sequence lasts around 20 s and contains about 200 frames. There are three categories for detection: VEHICLE, PEDESTRIAN, and CYCLIST. The mean Average Precision (mAP) and mAP weighted by heading accuracy (mAPH) are the official 3D detection metrics. There are two degrees of difficulty: LEVEL 1 for boxes with more than five LiDAR points, and LEVEL 2 for boxes with at least one LiDAR point. In this paper, we utilize the officially prescribed mAPH on LEVEL 2 by default. nuScenes. There are 1000 driving sequences in nuScenes [2], with 700, 150, and 150 for training, validation, and testing, respectively. Each sequence lasts about 20 s and has a LiDAR frequency of 20 frames per second. The primary metrics for 3D detection are mean Average Precision (mAP) and nuScenes Detection Score (NDS). NDS is a weighted average of mAP and other attribute measurements such as translation, scale, orientation, velocity, and other box properties. In this study, we employ mAP and NDS as experimental results. 4.2

Experimental Settings

We employ the same network designs and training schedules as CenterPoint [39] and keep the positive and negative sample strategies, post-processing settings, loss functions, etc., unchanged. Backbone Settings. VoxelNet [36,42] and PointPillars [15] are two 3D encoders used by CenterPoint, dubbed CenterPoint-Voxel and CenterPoint-Pillar, respectively. Our INT also experiments with these two backbones, which correspond to INT-Voxel and INT-Pillar, respectively. Frame Settings. Although INT can be trained and inferred on an infinite number of frames, the sequence length of the actual dataset is finite. To facilitate comparison with previous work and demonstrate the benefits of the INT framework, we select a few specific frames on nuScenes and Waymo Open Dataset. We use 10, 20 and 100 training frames on nuScenes, and 10 training frames on Waymo Open Dataset. Fusion Settings. On INT, we choose three kinds of data added to Memory Bank: point-style foreground pointcloud, image-style intermediate feature map, and image-style final prediction map. For the foreground pointcloud, we fuse the historical points with the current points during the input phase and then update them based on the predictions at the end of network. For the intermediate feature

INT: Towards Infinite-Frames 3D Detection

203

map, we fuse and update its historical information at the same position before the Region Proposal Network. For the final prediction map, we fuse and update historical information simultaneously before the detection header. Appendix A.4 takes INT-Voxel as a typical example to provide a more specific explanation. Latency Settings. To test the network’s actual latency, we remove redundant parts of the data processing in CenterPoint [39] and just maintain the data IO and memory transfer (to the GPU) operations. We shift the essential voxelization component to the GPU to limit the CPU’s influence. We also build up data prefetching in the dataloader to lessen IO effect and run latency tests when it is stabilized. Finally, the following test circumstances are used: CUDA Version 10.2, cudnn 7.6.5, GeForce RTX 2070 SUPER, Driver Version 460.91.03. 4.3

Effectiveness and Efficiency

As shown in Table 1 and 2, we first compare to the baseline CenterPoint to demonstrate the paper’s main point, i.e., the effectiveness and efficiency of INT. In these two tables, we try two types of backbone, termed E-PointPillars and E-VoxelNet in the columns, and the unit of latency is milliseconds. The settings of INT can be referred to Table 4 and 5. CenterPoint with multiple frames refers to concatenating multi-frame pointclouds at the input level, which introduces repetitive computation and additional memory burden as analyzed in Sect. 2.2. For example, two-frames CenterPoint in Table 1 increases the latency by around 20 ms when compared to its single-frame counterpart (more details in Appendix A.6). In contrast, the latency of INT is unaffected by the number of frames used, and its performance is much better than that of multi-frame CenterPoint as the number of frames grows. As can be obviously observed in Table 1 and 2, INT shows significant improvements in both latency and performance, gaining around 7% mAPH (Waymo) and 15% NDS (nuScenes) boost while only adding 2˜4 ms delay when compared to single-frame CenterPoint. 4.4

Comparison with SOTAs

In this section, we compare the results of INT on the Waymo test set with those of other SOTAs schemes, as shown in Table 3. The approaches are divided into two groups in the table: single-frame and multi-frame. It is seen that the multiframe scheme is generally superior to the single-frame scheme. Most multi-frame approaches employ the original pointclouds concatenation [10,28,39], and we can see that the number of frames is used fewer due to computational and memory constraints. Finally, our suggested INT scheme outperforms other SOTA schemes by a large margin. As far as we know, INT is the best non-ensemble approach on Waymo Open Dataset leaderboard1 .

1

https://waymo.com/open/challenges/2020/3d-detection/.

204

J. Xu et al.

Table 1. Effectiveness and efficiency of INT on Waymo Open Dataset val set. The APH of L2 difficulty is reported. The “-2s” suffix in the rows means two-stage model. CenterPoint’s mAPH results are obtained from official website, except for those with a *, which are missing from the official results and were reproduced by us. Methods

Frames E-PointPillars E-VoxelNet VEH↑ PED↑ CYC↑ mAPH↑ Latency↓ VEH↑ PED↑ CYC↑ mAPH↑ Latency↓

CenterPoint CenterPoint INT (ours) INT (ours)

1 2 2 10

65.5 66.6* 66.2 69.6

55.1 61.9* 60.4 66.3

60.2 62.3* 64.4 65.7

60.3 63.6* 63.7 67.2

57.7 77.8 61.6 61.6

66.2 67.3 69.4 72.2

62.6 67.5 69.1 72.1

67.6 69.9 72.6 75.3

65.5 68.2 70.3 73.2

71.7 90.9 74.0 74.0

CenterPoint-2s CenterPoint-2s INT-2s (ours) INT-2s (ours)

1 2 2 10

66.7 68.4* 67.9 70.8

55.9 63.0* 61.7 67.0

61.7 64.3* 66.0 68.1

61.4 65.2* 65.2 68.6

61.7 82.9 65.9 65.9

67.9 69.7 70.8 73.3

65.6 70.3 68.7 71.9

68.6 70.9 73.1 75.6

67.4 70.3 70.8 73.6

76.6 95.8 78.9 78.9

Table 2. Effectiveness and efficiency of INT on nuScenes val set. CenterPoint’s mAP and NDS results are obtained from official website, except for those with a *, which are missing from the official results and were reproduced by us. Methods

Frames E-PointPillars

E-VoxelNet

mAP↑ NDS↑ Latency↓ mAP↑ NDS↑ Latency↓ CenterPoint

1

42.5*

46.4*

39.2

49.7*

50.7*

81.1

CenterPoint

10

50.3

60.2

49.4

59.6

66.8

117.2

INT (ours)

10

49.3

59.9

43.0

58.5

65.5

84.1

INT (ours)

20

50.7

61.0

43.0

60.9

66.9

84.1

INT (ours)

100

52.3

61.8

43.0

61.8

67.3

84.1

Table 3. Comparison with SOTAs on Waymo Open Dataset test set. We only present the non-emsemble approaches, and INT is currently the best non-emsemble solution on the Waymo Open Dataset leaderboard1 , to the best of our knowledge. Accessed on 2 March 2022. Methods

Frames VEH-APH↑ PED-APH↑ CYC-APH↑ mAPH↑ L1

L2

L1

L2

L1

L2

L1

L2

StarNet [21]

1

61.0

54.5

59.9

54.0

-

-

-

-

PointPillars [15]

1

68.1

60.1

55.5

50.1

-

-

-

-

RCD [1]

1

71.6

64.7

-

-

-

-

-

-

M3DeTR [8]

1

77.2

70.1

58.9

52.4

65.7

63.8

67.1

61.9

HIK-LiDAR [34]

1

78.1

70.6

69.9

64.1

69.7

67.2

72.6

67.3

CenterPoint [39]

1

79.7

71.8

72.1

66.4

-

-

-

-

3D-MAN [37]

15

78.3

70.0

66.0

60.3

-

-

-

-

RSN [28]

3

80.3

71.6

75.6

67.8

-

-

-

-

CenterPoint [39]

2

80.6

73.0

77.3

71.5

73.7

71.3

77.2

71.9

CenterPoint++ [39] 3

82.3

75.1

78.2

72.4

73.3

71.1

78.0

72.8

AFDetV2 [10]

2

81.2

73.9

78.1

72.4

75.4

73.0

78.2

73.1

INT (ours)

10

83.1

76.2

78.5

72.8

74.8

72.7

78.8

73.9

INT (ours)

100

84.3 77.6

79.7 74.0

76.3 74.1

80.1 75.2

INT: Towards Infinite-Frames 3D Detection

4.5

205

Ablation Studies

Impact of Different Fusion Data. This section investigates the impact of various fusion data used in INT. We use one kind of point-style data, the foreground pointcloud, and two kinds of image-style data, the intermediate feature map before RPN and the final prediction map in this paper. The fusion method of foreground pointcloud is termed as PC Fusion which is explained in Sect. 3.2. The fusion method of the intermediate feature map and the final prediction map is Concat, as described in Sect. 3.2, named as FM Fusion and PM Fusion, respectively. The fusion results of these three kinds of data on Waymo Open Dataset and nuScenes are shown in Table 4 and 5. First, the tables’ performance columns show that all the three fusion data have considerable performance boosts, with PC Fusion having the highest effect gain. Then according to the latency columns, the increase in time of different fusion data is relatively small, which is very costeffective given the performance benefit. Table 4. Impact of different fusion data on Waymo Open Dataset val set. By default, the training sequence length was set to 10 frames. In order to indicate how the final result comes in Table 1, we also add a column called “Two Stage”. PC FM PM Two E-PointPillars E-VoxelNet fusion fusion fusion stage VEH↑ PED↑ CYC↑ mAPH↑ Latency↓ VEH↑ PED↑ CYC↑ mAPH↑ Latency↓ √ √ √ √

















65.5

55.1

60.2

60.3

57.4

66.2

62.6

67.6

65.5

71.5

68.1

65.8

65.4

66.4

59.5

71.7

70.8

74.2

72.3

72.7

63.6

63.3

64.7

63.8

59.0

66.1

67.3

73.8

69.1

72.2

66.4

64.0

64.2

64.5

58.2

67.7

68.1

74.1

70.0

72.0

69.5

66.8

64.8

67.0

60.9

72.0

71.8

76.5

73.5

73.3

69.6

66.3

65.7

67.2

61.6

72.2

72.1

76.1

73.5

74.0

70.8

67.0

68.1

68.6

65.9

73.3

71.9

75.6

73.6

78.9

Table 5. Impact of different fusion data on nuScenes val set. By default, the training sequence length was set to 10 frames. In order to indicate how the final result comes in Table 2, we also add a column called “100 frames”. PC FM PM 100 E-PointPillars E-VoxelNet fusion fusion fusion frames mAP NDS Latency mAP NDS Latency √ √ √ √

















42.5

46.4

39.2

49.7

50.7

81.1

47.1

58.5

41.2

56.6

64.5

82.4

48.4

57.4

40.5

55.9

56.6

82.1

45.3

56.0

39.7

53.9

62.7

82.0

48.7

59.8

42.4

58.4

65.2

83.3

49.3

59.9

43.0

58.5

65.5

84.1

52.3

61.8 43.0

61.8

67.3 84.1

206

J. Xu et al.

Impact of Training Sequence Length. The length indicated here actually refer to the maximum length since Dynamic Training Sequence Length (DTSL) is used in training. Figure 1 depicts the relationship between frames and performance for the 2-stage INT-Voxel model on the Waymo Open Dataset. We also plot the 2-stage CenterPoint-Voxel results together to make comparisons clearer. As shown in Fig. 1(b), INT improves as the number of frames increases, although there is saturation after a certain point (See Appendix A.5 for more explanation); as for Fig. 1(c), the time consumed by INT is slightly higher than that of single-frame CenterPoint, but it does not increase with the number of frames. Impact of Sequence Augmentation. Section 3.4 introduces SeqAug, a data augmentation technique for on-stream training, and this section examines the role of Point Transformation and GtAug in SeqAug. As seen in Table 6, both augmentation strategies result in significant performance improvements, making data augmentation essential for INT training just as regular detectors do. Table 6. Impact of SeqAug on Waymo Open Dataset val set. One-stage INT-Pillar and INT-Voxel are used. Sequence Sequence E-PointPillars E-VoxelNet point trans. GtAug VEL-APH↑ PED-APH↑ CYC-APH↑ mAPH↑ VEL-APH↑ PED-APH↑ CYC-APH↑ mAPH↑ √ √



59.6

56.7

56.2

57.5

65.0

62.4

61.9

69.3

65.0

62.4

65.6

71.8

71.4

73.1

63.1 72.1

69.6

66.3

65.7

67.2

72.2

72.1

76.1

73.5

More Ablation Studies in Appendix. The impact of sequence length on nuScenes is shown in Appendix A.1. The performance and latency of temporal fusion methods for image-style data (proposed in Sect. 3.2) are shown in Appendix A.2. The impact of DTSL proposed in Sect. 3.1 can be found in Appendix A.3.

5

Conclusion

In this paper, we present INT, a novel on-stream training and prediction framework that, in theory, can employ an infinite number of frames while using about the same amount of computational and memory cost as a single-frame detector. To make INT feasible, we propose three key modules, i.e., SeqFusion, SeqSampler, and SeqAug. We utilize INT on the popular CenterPoint, with significant latency reductions and performance improvements, and rank 1st currently on Waymo Open Dataset 3D Detection leaderboard among the non-ensemble SOTA methods. Moreover, the INT is a general multi-frame system, which may be used for tasks like segmentation and motion as well as detection. Acknowledgement. This work was supported by Alibaba Group through Alibaba Innovative Research (AIR) Program and Alibaba Research Intern Program.

INT: Towards Infinite-Frames 3D Detection

207

References 1. Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3d object detection. arXiv preprint arXiv:2005.09927 (2020) 2. Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020) 3. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017) 4. Chong, Z., et al.: MonodiStill: learning spatial features for monocular 3d object detection. arXiv preprint arXiv:2201.10830 (2022) 5. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019) 6. Fan, L., Xiong, X., Wang, F., Wang, N., Zhang, Z.: Rangedet:in defense of range view for lidar-based 3d object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2021) 7. Feng, Y., Ma, L., Liu, W., Luo, J.: Spatio-temporal video re-localization by warp LSTM. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1288–1297 (2019) 8. Guan, T., et al.: M3DeTR: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (2021) 9. He, L., et al.: End-to-end video object detection with spatial-temporal transformers. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1507–1516 (2021) 10. Hu, Y., et al.: Afdetv2: rethinking the necessity of the second stage for object detection from point clouds. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021) 11. Huang, R., et al.: An LSTM approach to temporal 3d object detection in LiDAR point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 266–282. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-58523-5 16 12. Jaderberg, M., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems., vol. 28 (2015) 13. Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017) 14. Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876 (2019) 15. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019) 16. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916 (2016) 17. Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: Gs3d: an efficient 3d object detection framework for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1019–1028 (2019)

208

J. Xu et al.

18. Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353 (2019) 19. Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686 (2019) 20. Miao, Z., et al.: PVGNet: a bottom-up one-stage 3d object detector with integrated multi-level features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3288 (2021) 21. Ngiam, J., Caine, B., Han, W., Yang, B., Vasudevan, V.: StarNet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069 (2019) 22. Piergiovanni, A., Casser, V., Ryoo, M.S., Angelova, A.: 4d-net for learned multimodal alignment. In: Proceedings of the IEEE International Conference on Computer Vision (2021) 23. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE International Conference on Computer Vision (2019) 24. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointNets for 3d object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018) 25. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 26. Shi, S., Guo, C., Jiang, L., Wang, Z., Li, H.: PV-RCNN: point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020) 27. Shi, S., Wang, X., Li, H.: PointRCNN: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019) 28. Sun, P.,et al.: RSN: range sparse net for efficient, accurate lidar 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 29. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020) 30. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020) 31. Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453 (2019) 32. Wu, P., Chen, S., Metaxas, D.N.: MotionNet: joint perception and motion prediction for autonomous driving based on bird’s eye view maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11385–11395 (2020) 33. Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: European Conference on Computer Vision, pp. 485–501 (2018) 34. Xu, J., Tang, X., Dou, J., Shu, X., Zhu, Y.: CenterAtt: Fast 2-stage center attention network. arXiv preprint arXiv:2106.10493 (2021)

INT: Towards Infinite-Frames 3D Detection

209

35. Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: RPVNet: a deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 16024–16033 (2021) 36. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018) 37. Yang, Z., Zhou, Y., Chen, Z., Ngiam, J.: 3d-man: 3d multi-frame attention network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1863–1872 (2021) 38. Yin, J., Shen, J., Guan, C., Zhou, D., Yang, R.: Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020) 39. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 40. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3d object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058583-9 43 41. Zeng, Y., et al.: Lift: Learning 4d lidar image fusion transformer for 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 17172–17181 (2022) 42. Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 43. Zhou, Y., et al.: End-to-end multi-view fusion for 3d object detection in lidar point clouds. In: Conference on Robot Learning, pp. 923–932. PMLR (2020)

End-to-End Weakly Supervised Object Detection with Sparse Proposal Evolution Mingxiang Liao1 , Fang Wan1(B) , Yuan Yao1 , Zhenjun Han1 , Jialing Zou1 , Yuze Wang2 , Bailan Feng2 , Peng Yuan2 , and Qixiang Ye1 1

University of Chinese Academy of Sciences, Beijing, China {liaomingxiang20,yaoyuan17}@mails.ucas.ac.cn, {wanfang,hanzhj,qxye}@ucas.ac.cn 2 Huawei Noah’s Ark Lab, Bei Jing, China {wangyuze1,fengbailan,yuanpeng126}@huawei.com

Abstract. Conventional methods for weakly supervised object detection (WSOD) typically enumerate dense proposals and select the discriminative proposals as objects. However, these two-stage “enumerateand-select” methods suffer object feature ambiguity brought by dense proposals and low detection efficiency caused by the proposal enumeration procedure. In this study, we propose a sparse proposal evolution (SPE) approach, which advances WSOD from the two-stage pipeline with dense proposals to an end-to-end framework with sparse proposals. SPE is built upon a visual transformer equipped with a seed proposal generation (SPG) branch and a sparse proposal refinement (SPR) branch. SPG generates high-quality seed proposals by taking advantage of the cascaded self-attention mechanism of the visual transformer, and SPR trains the detector to predict sparse proposals which are supervised by the seed proposals in a one-to-one matching fashion. SPG and SPR are iteratively performed so that seed proposals update to accurate supervision signals and sparse proposals evolve to precise object regions. Experiments on VOC and COCO object detection datasets show that SPE outperforms the state-of-the-art end-to-end methods by 7.0% mAP and 8.1% AP50. It is an order of magnitude faster than the two-stage methods, setting the first solid baseline for end-to-end WSOD with sparse proposals. The code is available at https://github.com/MingXiangL/SPE. Keywords: Weakly supervised object detection Proposal evolution · End-to-end training

1

· Sparse proposals ·

Introduction

Visual object detection has achieved unprecedented progress in the past decade. However, such progress heavily relies on the large amount of data annotations (e.g., object bounding boxes) which require extensive human effort and time cost. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 13. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 210–226, 2022. https://doi.org/10.1007/978-3-031-20077-9_13

SPE

Detection Performance (mAP)

60

211

MIST SLV

50

C-MIL UWSOD PCL

SPE (Ours)

40 OICR

30 0

WSDDN

0.2 1.0

5.0

10.0 Test Speed (FPS)

15.0

(a) Comparison of detection efficiency.

Input Image

OICR [44]

SPE(ours)

(b) Comparison of activation maps.

Fig. 1. Comparison of (a) detection efficiency and (b) activation maps between the conventional methods and the proposed SPE for weakly supervised object detection (WSOD) on VOC 2007. In (a), larger cycles denote higher proposal generation speeds. All speeds in (a) are evaluated on a NVIDIA RTX GPU.

Weakly supervised object detection (WSOD), which only requires image-level annotations indicating the presence or absence of a class of objects, significantly reduces the annotation cost [4,14,31–33,52]. For the lack of instance-level annotation, WSOD methods require to localize the objects while estimate object detectors at the same time during training. To fulfill this purpose, the early WSDDN method [6] used an “enumerate-andselect” pipeline. It firstly enumerates dense proposals using empirical clues [27, 37] to ensure a high recall rate and then selects the most discriminative proposal as the pseudo object for detector training. Recent studies improved either the proposal enumeration [38,41,44] or the proposal selection module [19,23,43,51]. However, this “enumerate-and-select” pipeline meets the performance upper bound for the following two problems: (1) The redundant and near-duplicate proposals aggregate the difficulty to localize objects and decrease the detection efficiency, Fig. 1(a). (2) During training, the labels of the dense proposals are assigned by a single pseudo object through a many-to-one matching strategy, i.e., multiple proposals with large IoUs between the pseudo object are selected for detector training, which introduces ambiguity to feature representation, Fig. 1(b). In this paper, we propose the sparse proposal evolution (SPE) approach, which advances WSOD from the enumerate-and-select pipeline with dense proposals (Fig. 2(a)) to an end-to-end framework with sparse proposals (Fig. 2(b)). SPE adopts a “seed-and-refine” approach, which first produces sparse seed proposals and then refines them to achieve accurate object localization. SPE consists of a seed proposal generation (SPG) branch and a sparse proposal refinement (SPR) branch. During training, SPG leverages the visual transformer [46] to generate semantic-aware attention maps. By taking advantage of the cascaded self-attention mechanism born with the visual transformer, the semantic-aware attention map can extract long-range feature dependencies and activate full object extentFig. 1(b). With these semantic-aware attention

212

M. Liao et al. Low-Quality Dense Proposals Time-consuming Proposal Enumeration

Semantic Tokens

Instance Classification & Summation CNN & RoI-Pooling

Selected Proposals

Transformer Backbone

Class-attention

Image Classificati on Loss

Sparse Proposals Instance Classification

Instance Classification Loss

Patch Tokens

Input Image

Sparse Tokens

Re-selected Proposals

(a) Enumerate-and-Select Pipeline: WSDDN [6], OICR [44]



...

Proposal Features

Seed Proposals

Cross-attention & Prediction ...

Input Image

… Image Classification Loss

Proposal Matching Loss

Refined Seed Proposals

(b) Seed-and-Refine Framework: SPE (Ours)

Fig. 2. Comparison of (a) the conventional “enumerate-and-select” pipeline with (b) our “seed-and-refine” framework for weakly supervised object detection.

maps, SPG can generate high-quality seed proposals. Using the seed proposals as pseudo supervisions, SPR trains a detector by introducing a set of sparse proposals that are learned to match with the seed proposals in a one-to-one matching fashion. During the proposal matching procedure, each seed proposal is augmented to multiple orientations, which provide the opportunity to refine object locations when the proposals and the detector evolve. The contributions of this study include: – We propose the sparse proposal evolution (SPE) approach, opening the promising direction for end-to-end WSOD with sparse proposals. – We update many-to-one proposal selection to one-to-one proposal-proposal matching, making it possible to apply the “seed-and-refine” mechanism in the challenging WSOD problem. – SPE significantly improves the efficiency and precision of the end-to-end WSOD methods, demonstrating the potential to be a new baseline framework.

2 2.1

Related Work Weakly Supervised Object Detection

Enumerate-and-Select Method (Two-stage). This line of methods enumerates object locations using a stand-alone region proposal algorithm. A multiple instance learning (MIL) procedure iteratively performs proposal selection and detector estimation. Nevertheless, as the object proposals are dense and redundant, MIL is often puzzled by the partial activation problem [5,13,32,48]. WSDDN [6] built the first deep MIL network by integrating an MIL loss into a deep network. Online instance classifier refinement (OICR) [15,18,26,43,48,53] was proposed to select high-quality instances as pseudo objects to refine the instance classifier. Proposal cluster learning (PCL) [24,42] further alleviated networks from concentrating on object parts by proposal clustering [42]. In the two-stage framework, object pixel gradient [39], segmentation collaboration [15,21,28,40], dissimilarity coefficient [3], attention and selfdistillation [23] and extra annotations from other domains [7,17] were introduced to optimize proposal selection. Context information [25,50] was also explored to

SPE

213

identify the instances from surrounding regions. In [48,49], a min-entropy model was proposed to alleviate localization randomness. In [26], object-aware instance labeling was explored for accurate object localization by considering the instance completeness. In [19,51], continuation MIL was proposed to alleviate the nonconvexity of the WSOD loss function. Despite the substantial progress, most WSOD methods used a stand-alone proposal generation module, which decreases not only the overall detection efficiency but also the performance upper bound. Enumerate-and-Select Method (End-to-End). Recent methods [38,44] attempted to break the two-stage WSOD routine. WeakRPN [44] utilized object contours in convolutional feature maps to generate proposals to train a region proposal network (RPN). However, it remains relying on proposal enumeration during the training stage. In [38], an RPN [34] was trained using the pseudo objects predicted by the weakly supervised detector in a self-training fashion. Nevertheless, it requires generating dense object proposals by sliding windows. Both methods suffer from selecting inaccurate candidates from dense proposals. 2.2

Object Proposal Generation

Empirical Enumeration Method. This line of methods enumerates dense proposals based on simple features and classifiers [2,9,37]. Constrained Parametric MinCuts (CPMC) [9] produced up to 10,000 regions based on figureground segments and trained a regressor to select high-scored proposals. Selective Search [37] and MCG [2] adopted hierarchical segmentation and region merging on the color and contour features for proposal generation. BING [12] generated redundant proposals with sliding windows and filtered them with a classifier. EdgeBoxes [27] estimated objectness by detecting complete contours in dense bounding proposals. Learning-Based Method. Recent methods had tried to learn an RPN under weak supervision. In [44], an EdgeBoxes-like algorithm is embedded into DNNs. In [41], extra video datasets were used to learn an RPN [34]. In [38], the RPN was trained using the pseudo objects selected by the weakly supervised detector in a self-supervised fashion. However, these methods required generating very dense object proposals. The problem of achieving a high recall rate using sparse (hundreds or tens of) proposals without precise supervision still remains.

3

Methodology

In this section, we first give an overview of the proposed sparse proposal evolution (SPE) approach. We then introduce the seed proposal generation (SPG) and sparse proposal refinement (SPR) modules. Finally, we describe the end-to-end training procedure based on iterative optimization of SPG and SPR.

214

M. Liao et al. Patch Tokens tp

Class/Semantic-aware Tokens ⋯

Seed Proposal Generation (SPG)

FC

“ “

FC

”, ”,

GT Label Image Classifier





tps



tpl



Sparse and High Quality Seed Proposals

Semantic-aware Attention Maps

Transformer Backbone

Sparse Proposal Refinement (SPR) Sparse Proposal Tokens ⋯



-



Encoder

Decoder

...

& -

...

-

FFN

-

One-to-One Proposal Matching Loss

Augmented Seed Proposals

Detection Results

Fig. 3. Flowchart of the proposed sparse proposal evolution (SPE) approach. The diagram consists of a transformer backbone, a seed proposal generation (SPG) branch and a sparse proposal refinement (SPR) branch. During the training phase, SPG and SPR are jointly performed under a “seed-and-refine” mechanism for end-to-end WSOD with sparse object proposals.

3.1

Overview

Figure 3 presents the flowchart of SPE, which consists of a backbone network, an SPG branch, and an SPR branch. The backbone network, which is built upon CaiT [46], contains two sub-branches with l shared transformer blocks (each block has a self-attention layer and a multi-layer perception layer). The SPG branch consists of two modules, one for image classification and the other for seed proposal generation. The initial supervisions come from the image classification loss (in the SPG branch), which drive to learn the image classifiers for semantic-aware attention maps and seed proposal generation through a thresholding algorithm [54]. The SPR branch is an encoder-decoder structure [30], which is trained by the one-to-one matching loss between seed proposals and sparse proposals. During training, an input image is first divided into w × h patches to construct N = w × h patch tokens tp . These patch tokens are fed to the transformer to extract semantic-sensitive patch embeddings tps and locationsensitive patch embeddings tpl , which are respectively fed to the SPG branch and SPR branch. 3.2

Seed Proposal Generation

The core of SPE is generating sparse yet high-quality seed proposals. Visual transformer was observed to be able to extract long-range feature dependencies by taking advantage of the cascaded self-attention mechanism, which facilitated activating and localizing full object extent [20]. This inspires us to introduce it to WSOD to produce high-quality seed proposals for object localization.

SPE Semantic Token ts

wk Patch Tokens tps

C(T(·))

wq

Softmax ┬

wv

Attention Matrix A

215

Seed Proposals P

LBCE (t*sws , y) Embedded Semantic Tokens t*s

Fig. 4. Flowchart of the class-attention layer in the proposed SPG branch.

Semantic-Aware Attention Maps. As shown in Fig. 3, the SPG branch contains an image classification module and a seed proposal generation module. The image classification module contains two class-attention blocks and a fully connected (FC) layer, following CaiT [46]. Each class-attention block consists of a class-attention layer and an MLP layer with a shortcut connection. A class token tc ∈ R1×D is fed to the first class-attention block, where the class-attention CA(·) is performed on tc and tps as t∗c = CA(tc , tps , wq , wk , wv )   √ = Softmax (tc wq )([tc , tps ]wk ) / D ([tc , tps ]wv )

(1)

= A([tc , tps ]wv ), where wq , wk , wv denote weights in the class-attention layer, Fig. 4. [tc , tps ] denotes concatenating tc and tps along the first dimension. A ∈ R1×(N +1) is the attention vector of class token tc . In the multi-head attention layer where J heads are considered, D in Eq. 1 is updated as D0 , where D0 = D/J. A is then updated as the average of attention vectors weighted by their standard deviation of the J heads. t∗c is then projected by the MLP layer in the first class-attention layer and then further fed to the second class-attention block to calculate the final embeddings t∗c ∈ R1×D for image classification. The FC layer parameterized by wc ∈ RD×C projects the class token t∗c to a classification score. Considering that the class token tc is class-agnostic and cannot produce attention maps for each semantic category, we further add C semantic-aware tokens ts ∈ RC×D . C denotes the number of classes. By feeding both tc and ts to the class-attention blocks and applying class-attention defined in Eq. 1, we obtain the final token embeddings t∗c and t∗s . The attention vector A is updated to the attention matrix A ∈ R(C+1)×(C+N +1) . An extra FC layer parameterized with ws ∈ RD×1 is added to classify the semantic tokens t∗s . Given the image T label y = [y1 , y2 , ..., yC ] ∈ RC×1 , where yc = 1 or 0 indicates the presence or absence of the c-th object category in the image, the loss function for SPG is defined as (2) Lspg (t∗c , t∗s ) = LBCE (t∗c wc , y) + LBCE (t∗s ws , y), where LBCE (·) denotes the binary cross-entropy loss [6].

216

M. Liao et al. Matching Loss for Augmented Proposals

Matched Results 1

1

0

0

1

1

0

0

0

0

0

···

IoU > 0.5

···

0.2

0.4

0.9

1.1

0.7

0.6

0.3

0.1

0.7

1.4

0.4

0.7

...

Dense Proposals

(a) Matching Loss 0.2

0.4

0.9

1.1

0.7

0.6

0.9

0.7

1.2

0.7

0.3

0.2

Matched Results

1









1

0.9

0.7

1.2

0.7

0.3

0.2

0.8

0.9

1.5

1.1

0.1

0.4

... Augmented Seed Proposals ,

Matched Results

1

1





1

1

Seed Proposals

Sparse Proposals

(b)

Sparse Proposals

(c)

Fig. 5. Comparison of matching strategies. (a) Many-to-one matching of previous WSOD methods. (b) One-to-one matching strategy [8] applied to WSOD. (c) Oneto-one matching with proposal augmentation (ours).

Seed Proposals. By optimizing Eq. 2 and executing Eq. 1 in the second classattention block, we obtain the attention matrix A ∈ R(C+1)×(C+N +1) . The semantic-aware attention matrix A∗ ∈ RC×N is produced by indexing the first C rows and the middle N columns from A. Attention map Ac of the c-th class is then obtained by reshaping the c-th row in A∗ to w × h and then resized to the same resolution as the original image. A thresholding function T (Ac , δseed ) with a fixed threshold δseed [54] is used to binarize each semantic-aware attention map to foreground or background pixels. Based on T (Ac , δseed ), the seed proposals are generated as P = {C(T (Ac , δseed ), δmulti ), ...}C c=1 = {B, O} = {(b1 , o1 ), (b2 , o2 ), ...},

(3)

where function C(·) outputs a set of tight bounding boxes to enclose the connected regions in the binary map T (Ac , δseed ), under the constraint that the area of each connected region is larger than δmulti of the largest connected region. Consequently, we obtain a set of bounding boxes B = [b1 , b2 , ..., bM ] ∈ RM ×4 for foreground categories in the image, where each category produces at least one seed proposal. The one-hot class labels for these bounding boxes are denoted as T O = [o1 , o2 , ..., oM ] ∈ RM ×C . 3.3

Sparse Proposal Refinement

Although SPG can perform object localization using seed proposals, the performance is far from satisfactory due to the lack of instance-level supervision. We further propose sparse proposal refinement (SPR), with the aim of learning object detector while refining seed proposals.

SPE

217

Sparse Proposals. As shown in Fig. 3, the SPR branch follows recently proposed fully-supervised transformer detectors (DETR [8] and Conditional DETR [30]), which leverage a transformer encoder, a transformer decoder, and a feedforward network (FFN) to predict the object categories and locations. The location-sensitive patch embedding tpl from the transformer backbone is first encoded by the transformer encoder to t∗pl . In the transformer decoder, a fixed set of sparse proposal tokens tp ∈ RK×D are defined to make conditional crossattention [30] with the encoded location-aware embedding t∗pl . The decoded t∗p is then fed to the FFN to predict K sparse proposals, as  O}  = {(ˆb1 , oˆ1 ), (ˆb2 , oˆ2 ), ..., (ˆbK , oˆK )}, P = FFN(t∗p , wF F N ) = {B,

(4)

where wF F N and K respectively denote the parameters of the FFN and the number of proposal tokens. One-to-One Proposal Matching. Using the seed proposals defined by Eq. 3 as pseudo objects, an optimal bipartite match between seed and sparse proposals  = [ˆ ˆ2 , ..., σ ˆK ], is applied. The optimal bipartite match [8] is formulated as S σ1 , σ ˆi = m denotes the i-th sparse proposal is where σ ˆi ∈ {∅, 1, 2, ..., m, ..., M }. σ matched with the m-th seed proposal. σ ˆi = ∅ means that the i-th sparse proposal has no matched object and is categorized to “background”. The loss function of the SPR branch is defined as Lspr (P, P) =

K   i=1

λF L LF L (oi , oˆσˆi ) + 1{ˆσi =∅} λL1 LL1 (bi , ˆbσˆi )

 +1{ˆσi =∅} λGIoU LGIoU (bi , ˆbσˆi ) ,

(5)

where LF L , LL1 and LGIoU are Focal loss [47], L1 loss and generalized IoU loss [36], respectively. λF L , λL1 and λGIoU are regularization factors. Seed Proposal Augmentation. The above-defined one-to-one matching breaks many-to-one label assignment, Fig. 5(a). However, the supervision signals (seed proposals) generated by attention maps contain localization noises that cannot be corrected by the one-to-one matching mechanism, Fig. 5(b). To alleviate this problem, we augment the seed proposals through a “box jittering” strategy, which produces randomly jittered bounding boxes on four orientations. The box jittering process of a bounding box bi = (tx , ty , tw , th ) is defined as Γbi = (tx , ty , tw , th ) ± (εx tx , εy ty , εw tw , εh th ),

(6)

where the coefficients (εx , εy , εw , εh ) are randomly sampled from a uniform distribution U (−δaug , +δaug ). δaug is a small value to ensure Γbi is around bi . By applying “box jittering” upon the boxes B, we extend the seed proposals P = {O, B} to augmented seed proposals {P, ΓP } = {[O, ΓO], [B, ΓB]}, where the class label Γoi is the same as oi . With seed proposal augmentation, sparse proposals can correct noise in seed proposals, Fig. 5(c), which facilities seed proposal refinement and detection performance improvement.

218

M. Liao et al.

Table 1. Performance with respect to Table 2. Performance of SPE under SPR δseed and δmulti on VOC 2007 test set. branch numbers on VOC 2007 test set. Modules δseed δmulti mAP CorLoc

#SPR branches δaug mAP CorLoc

SPG

0 (SPG)

0

29.7

57.8 61.3

SPE

3.4

0.1

1

23.0

48.2

0.2

1

29.7 57.8

1

0

42.6

0.3

1

18.2

43.6

2

0

42.9 61.5

0.4

1

8.9

25.9

3

0

42.7

61.3

0.2

1

37.8

56.9

1

0.05 42.7

61.3

0.2

0.75

41.0

61.0

1

0.1

0.2

0.5

42.6 61.3

1

0.15 45.1

64.0

0.2

0.25

42.4

61.5

1

0.2

61.6

45.6 64.0 43.4

End-to-End Training

As the proposal generation and proposal refinement branches are unified upon the transformer backbone, we are able to train the seed proposal generator, the object detector, and the backbone network in an end-to-end fashion. As shown in Fig. 3, the SPG branch and the SPR branch share the transformer backbone [46]. Considering that the optimization objectives of the two network branches are not exactly the same, we separate the backbone transformer from the (l + 1)-th block so that they share only part of the backbone network. The two network branches are jointly optimized by the total loss defined as Lspe = Lspg + 1{e≥τ } Lspr ,

(7)

where e denotes the training epoch and τ is a threshold number of epochs. During end-to-end training, the SPG branch is first optimized for τ epochs as a “warm-up” step, which guarantees that the seed proposals are semantic-aware and can coarsely cover object extent. Subsequently, the transformer backbone, the SPG, and SPR branches are jointly trained under the supervision of the image classification loss and the proposal matching loss.

4

Experiment

In this section, we first introduce the experimental settings. We then conduct ablation study and quantitative and qualitative model analysis. We finally compare the proposed SPE approach with the state-of-the-art (SOTA) methods. 4.1

Experimental Setting

SPE is implemented based on the CaiT-XXS36 model [46] pre-trained on the ILSVRC 2012 dataset [1]. We evaluate SPE on the PASCAL VOC 2007, 2012 and MS COCO 2014, 2017 datasets. On VOC, we use mAP [29] and correct localization (CorLoc) [45] as the evaluation metric. The model is trained on the

SPE

219

union set of VOC 2007 trainval and VOC 2012 trainval (“0712”, containing 16551 images of 20 object classes), and evaluated on test set of VOC 2007 (containing 4952 images). On MS COCO datasets, we use average precision (AP) as the evaluation metric. The COCO datasets contain 80 object categories and have more challenging aspects including multi-objects and complex backgrounds. On COCO 2014 and 2017, we respectively use the 83k and 118k train sets for training, the 40k and 5k val sets for testing. Each input image is re-scaled to the fixed size and randomly horizontally flipped. During training, we employ the AdamW gradient descent algorithm with weight decay 5e-2, and a batch size of 8 in 8 GPUs. The model respectively iterates 50 and 15 epochs on VOC and COCO datasets. During training, the learning rate for the backbone is fixed to be 1e-5. The learning rate for the rest branches is initialized to 1e-4 and drops to 1e-5 after 40 and 11 epochs on VOC and COCO datasets, respectively. The number K of proposal tokens is set to 300 following [8]. The “warm-up” time τ in Eq. 7 is empirically set to 7. 4.2

Ablation Study

We analyze SPE’s hyper-parameters δseed and δmulti , times of proposal refinement, matching manners, and the detection efficiency. We also study the effect of the shared backbone block numbers, the detector and backbone network. All the ablation experiments are conducted on PASCAL VOC. SPG. Table 1 includes the detection and localization performance of SPG under different δseed and δmulti . It can be seen that δseed has the key influence from the generation of seed proposals. When δseed = 0.2, SPG achieves 29.7% mAP and 57.8% CorLoc. When δmulti decreases, the performance first increases and then decreases. This implies that as δmulti decreases SPG discovers more and more objects, which enriches the supervision signals and improves the detection performance. On the other hand, with the increase of δmulti , SPG produces more noise proposals, which degenerate the detection performance. SPR. Table 2 shows the effect of proposal refinement times by adding extra SPR branches and introducing seed proposal augmentation. When adding one SPR branch, the detection performance is significantly improved by 12.9%(29.7% vs 42.6%) and the localization performance is improved by 3.5%(57.8% vs 61.3%), which clearly demonstrates SPR’s effectiveness for refining the seed proposals. When more SPR branches are added, marginal performance improvements are achieved. By introducing seed proposal augmentation, the performance is further significantly improved by 3.0%(42.6 vs 45.6%) and 2.7%(61.3 vs 64.0%) with δaug = 0.1, demonstrating that the proposal augmentation mechanism can suppress the noise of seed proposals and achieve more accurate localization. Detection Efficiency. In Table 3, we compare the proposed SPE with “enumerate-and-select” methods, including the two-stage OICR method [43] and the STOA end-to-end method UWSOD [38]. The compared terms include the number of parameters (#Params), MACs, time of proposal generation (τ ) and inference speed. The experiments are carried out under image scale 5122 on the

220

M. Liao et al.

Table 3. Comparison of parameters, MACs, time to generate proposals and inference speed on the VOC test set. SPE is implemented based on CaiT-XXS36. Test speeds (“speed”) are evaluated on a single NVIDIA RTX GPU. Methods

#Params (M) MACs (G) τ (s/img) Speed(fps) mAP

OICR(VGG16) [43]

120.9

304.26

3.79

0.26

44.1

UWSOD(VGG16) [38] 138.5 UWSOD(WSR18) [38] 135.0

923.31 237.97

0.002 0.002

4.2 4.3

45.7 46.9

14.3

51.0

SPE(Ours)

33.9

51.25

0

Table 4. Performance of SPE with l Table 5. Performance of different detecshared blocks on VOC 2007 test set. tors on VOC 2007 test set. l shared blocks mAP CorLoc

Detector

Backbone

mAP

36

32.8

50.0

Faster RCNN

VGG16

78.3

24

45.6 64.0

Faster RCNN

ResNet50

80.9

12

43.1

Faster RCNN

CaiT-XXS36 81.4

60.1

Conditional DETR CaiT-XXS36 77.5

PASCAL VOC dataset (0712 trainval for training and 07 test set for testing). SPE has much fewer parameters than OICR and UWSOD (only ∼1/4 of OICR and UWSOD), and uses much fewer MACs than OICR and UWSOD (only 1/20∼1/4 of OICR and UWSOD). These results show that SPE, which discards dense proposals by learning sparse proposals, is efficient for object detection. For testing, SPE directly uses the backbone and the SPR branch for object detection and does not need computational costs for proposal generation. With such high detection efficiency, SPE achieves 51.0% mAP, which respectively outperforms OICR and UWSOD by 9.8% and 7.0%. Number of Shared Backbone Blocks. We analyze the effect of backbone blocks shared by SPG and SPR (denoted by l), Table 4. When l = 36, i.e., the two branches share all backbone blocks of CaiT-XXS36, the detection and localization performance are 32.8% and 50.0%, respectively. This is because the learning of regression task will interfere the attention map in SPG, and thus degenerates the quality of generated seed proposals. When l = 24, the above problem is largely alleviated, and the detection and localization performances respectively increase to 45.6% and 64.0%. When sharing fewer layers, the performances slightly decrease due to the increase of inductive bias. Backbone and Detector. In Table 5, we compare Faster RCNN w/ VGG16, Faster RCNN w/ ResNet50, Faster RCNN w/ CaiT-XXS36, and Conditional DETR w/ CaiT-XXS36 on VOC 0712 under fully supervised settings. The mAP of Conditional DETR w/ CaiT-XXS36 is 77.5%, which is lower than that of Faster-RCNN w/ VGG16 (78.3%). It shows the detector is not the key factor of performance gain. The mAP of Faster R-CNN w/ CaiT-XXS36 is 81.4%, which is 3.1% higher than w/ VGG16 and 0.5% higher than w/ ResNet50. We also

SPE

221

conducted experiments of MIST [35] w/ CaiT-XXS36, but achieved much worse results than MIST [35] w/ VGG16. Although CaiT-XXS36 is better in fully supervised detection task, it is not superior than VGG16 for traditional WSOD methods. 4.3

Visualization Analysis

CorLoc Accuracy (%)

Qualitative Analysis. Figure 7 shows the evolution of seed proposals and matched proposals and their corresponding attention maps (heatmaps) generated by SPG and SPR. At early training epochs, SPG activates most of the objects and can produce seed pro65 posals for object location initialization. However, these proposals still 60 suffer from background activation 55 or partial activation. After match50 ing with the sparse proposals, seed proposals are refined to more accu45 rate object locations, which demonSPE 40 OICR strates the effectiveness of the SPR 35 module with the proposal augmenta0 5000 10000 15000 20000 25000 30000 tion strategy. As training goes on, the #Iteration seed proposals are gradually refined Fig. 6. Comparison of CorLoc accuracy of by and matched with the sparse SPE and OICR [43] during training. proposals, and finally evolve to full object extent. Figure 8 visualizes the seed proposals and matched sparse proposals and the corresponding attention map. With the long-range feature dependencies of Epoch20

Epoch30

Epoch40

Epoch50

Matched Proposal

Seed Proposal

Epoch10

Matched Proposal

Object

Seed Proposal

Object

Epoch5

Fig. 7. Evolution of seed proposals and matched sparse proposals (yellow bounding boxes) during training. Heatmaps in the “seed proposal” column show the semanticaware attention maps, while heatmaps in “matched proposal” column show the crossattention maps of the matched sparse proposals. (Color figure online)

M. Liao et al.

Single Object

222

Seed Proposal

Matched Proposal

Object

Seed Proposal

Matched Proposal

Multiple Objects

Object

Objects

Seed Proposal(s)

Matched Proposal(s)

Seed Proposal

Matched Proposal

Fig. 8. Visualization of seed proposals and matched sparse proposals (yellow boxes). Heatmaps in “seed proposal” column show the semantic-aware attention maps for object classes. Heatmaps in the “matched proposal” column show the cross-attention maps of the matched sparse proposals. (Color figure online)

transformer, the semantic-aware attention maps in SPG can activate full object extent. Based on these attention maps, SPG can generate sparse yet high-quality seed proposals. By introducing SPR, the matched proposal can promote seed proposals and achieves preciser object localization. These results validate the effectiveness of the proposed SPG and SPR branches of SPE, where the seed proposals and sparse proposals evolve towards true object locations. Quantitative Analysis. Figure 6 shows the CorLoc accuracy of SPE and OICR [43] during training iterations. By introducing transformer block, SPE can generate much preciser proposals at very early iterations. In contrast, OICR suffers from dense and noise proposals, which struggles to select the object proposals so that the localization performance deteriorates in early iterations. 4.4

Performance

PASCAL VOC. Table 6 shows the performance of SPE and the SOTA methods on VOC 2007 dataset. “07” in “Set” column denotes the trainval set of VOC 2007, “0712” denotes trainval set of VOC 2007 and 2012 datasets. “CaiT” denotes CaiT-XXS36. † refers to our implementation using the official code. With image scale 384, SPE achieves competitive 48.5% mAP and 66.4% CorLoc accuracy when training on 0712 trainval set. With image scale 512, SPE achieves 51.0% mAP and 70.4% CorLoc accuracy, which outperforms the twostage methods WSDDN [6] and OICR [43] by 16.2% and 9.8%. Compared with the end-to-end methods using dense object proposals, the performance of SPE is very competitive. It also outperforms the SOTA UWSOD by 7.0% mAP. MS COCO. In Table 7, we report the performance of SPE and the SOTA methods on the MS COCO 2014 and 2017 datasets. “14” and “17” in “Set”

SPE

223

Table 6. Detection Performance(%) on Table 7. Detection Performance(%) on the MS COCO 2014 and 2017 set. the PASCAL VOC 2007 test set. Backbone Set

Method

mAP CorLoc

AP

AP50 AP75

(Two-stage)

(Two-Stage) VGG16

Backbone Set Method Enumerate-and-select methods

Enumerate-and-select methods 07

WSDDN [6]

34.8

53.5

14

WSDDN [6]

-

11.5

-

07

OICR [43]

41.2

60.6

VGG16

14

WCCN [15]

-

12.3

-

07

SLV [10]

53.5

71.0

14

ODGA [16]

-

12.8

-

07

DC-WSOD [3]

52.9

70.9

14

PCL [42]

8.5

19.4

-

07

TS2 C [50]

44.3

61.0

14

WSOD2 [53]

10.8

22.7

-

07

SDCN [28]

50.2

68.6

14

C-MIDN [21]

9.6

21.4

07

C-MIL [19]

50.5

65.0

14

MIST [35]

12.4 25.8

10.5

07

PCL [42]

43.5

62.7

14

PG-PS [11]

-

20.7

-

07

MIST [35]

54.9 68.8

Enumerate-and-select methods

0712 WSDDN† [6]

36.9

56.8

(End-to-end)

0712 OICR† [43]

43.6

61.7

-

VGG16

17

UWSOD [38]

2.5

9.3

1.1

Enumerate-and-select methods

WSR18

17

UWSOD [38]

3.1

10.1

1.4

(End-to-End)

Seed-and-refine methods

VGG16

07

OM+MIL [18]

23.4

41.2

(End-to-End)

07

OPG [39]

28.8

43.5

CaiT

14

SPE (ours)-384 5.7

15.2

3.4

07

SPAM [22]

27.5

-

17

SPE (ours)-384 6.3

16.3

4.0

07

UWSOD [38]

44.0

63.0

17

SPE (ours)-512 7.2

18.2

4.8

0712 SPE (ours)-384 48.5

66.4

Seed-and-refine methods (End-to-end) CaiT

0712 SPE (ours)-512 51.0 70.4

column respectively denote training on MS COCO 2014 and 2017 datasets. On COCO 2014, SPE respectively achieves 5.7%, 15.2%, and 3.4% under metric AP, AP50 and AP75, which are comparable with the two-stage “enumerate-andselect” methods. On MS COCO 2017, SPE respectively achieves 6.3% AP, 16.3% AP50 and 4.0% AP75, outperforming the end-to-end UWSOD method [38] by 3.2%, 6.2% and 2.6%. When increasing the image scale to 512, the APs are further improved by 0.9%, 1.9%, and 0.8%, respectively.

5

Conclusion

We proposed the sparse proposal evolution (SPE) approach, and advanced WSOD methods with dense proposals to an end-to-end fashion with sparse proposals. SPE uses a “seed-and-refine” framework, which is efficient for both training and test. By taking advantage of the visual transformer, SPE generates sparse yet high-quality seed proposals. With the one-to-one proposal matching strategy, SPE iteratively improves seed proposals and object detectors in a selfevolution fashion. As the first end-to-end framework with sparse proposals, SPE demonstrates tremendous potential and provides a fresh insight to the challenging WSOD problem.

224

M. Liao et al.

Acknowledgement. This work was supported by National Natural Science Foundation of China (NSFC) under Grant 62006216, 61836012, 62171431 and 62176260, the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No. XDA27000000.

References 1. Alex, K., Ilya, S., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp. 1097–1115 (2012) 2. Arbel´ aez, P.A., Pont-Tuset, J., Barron, J.T., Marqu´es, F., Malik, J.: Multiscale combinatorial grouping. In: IEEE CVPR, pp. 328–335 (2014) 3. Arun, A., Jawahar, C.V., Kumar, M.P.: Dissimilarity coefficient based weakly supervised object detection. In: IEEE CVPR, pp. 9432–9441 (2019) 4. Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: BMVC, pp. 1997–2005 (2014) 5. Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: IEEE CVPR, pp. 1081–1089 (2015) 6. Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: IEEE CVPR. pp. 2846–2854 (2016) 7. Cao, T., Du, L., Zhang, X., Chen, S., Zhang, Y., Wang, Y.: Cat: Weakly supervised object detection with category transfer (2021) 8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 9. Carreira, J., Sminchisescu, C.: CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE TPAMI 34(7), 1312–1328 (2012) 10. Carreira, J., Sminchisescu, C.: CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE TPAMI 34(7), 1312–1328 (2012) 11. Cheng, G., Yang, J., Gao, D., Guo, L., Han, J.: High-quality proposals for weakly supervised object detection. IEEE TIP 29, 5794–5804 (2020) 12. Cheng, M., Zhang, Z., Lin, W., Torr, P.H.S.: BING: binarized normed gradients for objectness estimation at 300fps. In: IEEE CVPR, pp. 3286–3293 (2014) 13. Chong, W., Kaiqi, H., Weiqiang, R., Junge, Z., Steve, M.: Large-scale weakly supervised object localization via latent category learning. IEEE TIP 24(4), 1371–1385 (2015) 14. Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-10599-4 28 15. Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L.: Weakly supervised cascaded convolutional networks. In: IEEE CVPR, pp. 5131–5139 (2017) 16. Diba, A., Sharma, V., Stiefelhagen, R., Van Gool, L.: Object discovery by generative adversarial & ranking networks. arXiv preprint arXiv:1711.08174 (2017) 17. Dong, B., Huang, Z., Guo, Y., Wang, Q., Niu, Z., Zuo, W.: Boosting weakly supervised object detection via learning bounding box adjusters. In: IEEE ICCV (2021) 18. Dong, L., Bin, H.J., Yali, L., Shengjin, W., Hsuan, Y.M.: Weakly supervised object localization with progressive domain adaptation. In: IEEE CVPR, pp. 3512–3520 (2016)

SPE

225

19. Fang, W., Chang, L., Wei, K., Xiangyang, J., Jianbin, J., Qixiang, Y.: CMIL: continuation multiple instance learning for weakly supervised object detection. In: IEEE CVPR (2019) 20. Gao, W., et al.: TS-CAM: token semantic coupled attention map for weakly supervised object localization. CoRR abs/2103.14862 (2021) 21. Gao, Y., et al.: C-MIDN: coupled multiple instance detection network with segmentation guidance for weakly supervised object detection. In: IEEE ICCV (2019) 22. Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.C.: Object-extent pooling for weakly supervised single-shot localization. In: BMVC (2017) 23. Huang, Z., Zou, Y., Kumar, B.V.K.V., Huang, D.: Comprehensive attention selfdistillation for weakly-supervised object detection. In: NeurIPS (2020) 24. Kantorov, V., et al.: Deep self-taught learning for weakly supervised object localization. In: IEEE CVPR, pp. 4294–4302 (2017) 25. ContextLocNet: context-aware deep network models for weakly supervised localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 350–365. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946454-1 22 26. Kosugi, S., Yamasaki, T., Aizawa, K.: Object-aware instance labeling for weakly supervised object detection. In: IEEE ICCV (2019) 27. Zitnick, C.L., Doll´ ar, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910602-1 26 28. Li, X., Kan, M., Shan, S., Chen, X.: Weakly supervised object detection with segmentation collaboration. In: IEEE ICCV (2019) 29. Mark, E., Luc, V.G., KI, W.C., John, W., Andrew, Z.: The pascal visual object classes (VOC) challenge. IJCV. 88(2), 303–338 (2010) 30. Meng, D., et al.: Conditional DETR for fast training convergence. In: IEEE ICCV, pp. 3651–3660, October 2021 31. Oh, S.H., Jae, L.Y., Stefanie, J., Trevor, D.: Weakly supervised discovery of visual pattern configurations. In: NeurIPS, pp. 1637–1645 (2014) 32. Oh, S.H., Ross, G., Stefanie, J., Julien, M., Zaid, H., Trevor, D.: On learning to localize objects with minimal supervision. In: ICML, pp. 1611–1619 (2014) 33. Parthipan, S., Tao, X.: Weakly supervised object detector learning with model drift detection. In: IEEE ICCV, pp. 343–350 (2011) 34. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015) 35. Ren, Z., et al.: Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: IEEE CVPR, pp. 10595–10604 (2020) 36. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: IEEE CVPR, June 2019 37. RR, U.J., de Sande Koen EA, V., Theo, G., WM, S.A.: Selective search for object recognition. IJCV. 104(2), 154–171 (2013) 38. Shen, Y., Ji, R., Chen, Z., Wu, Y., Huang, F.: UWSOD: toward fully-supervisedlevel capacity weakly supervised object detection. In: NeurIPS (2020) 39. Shen, Y., Ji, R., Wang, C., Li, X., Li, X.: Weakly supervised object detection via object-specific pixel gradient. IEEE TNNLS 29(12), 5960–5970 (2018) 40. Shen, Y., Ji, R., Wang, Y., Wu, Y., Cao, L.: Cyclic guidance for weakly supervised joint detection and segmentation. In: IEEE CVPR, pp. 697–707 (2019)

226

M. Liao et al.

41. Singh, K.K., Lee, Y.J.: You reap what you sow: using videos to generate high precision object proposals for weakly-supervised object detection. In: IEEE CVPR, pp. 9414–9422 (2019) 42. Tang, P., et al.: PCL: proposal cluster learning for weakly supervised object detection. IEEE TPAMI 42(1), 176–191 (2020) 43. Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE CVPR, pp. 3059–3067 (2017) 44. Tang, P., et al.: Weakly supervised region proposal network and object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 370–386. Springer, Cham (2018). https://doi.org/10.1007/978-3030-01252-6 22 45. Thomas, D., Bogdan, A., Vittorio, F.: Weakly supervised localization and learning with generic knowledge. IJCV 100(3), 275–293 (2012) 46. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H.: Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021) 47. Tsung-Yi, L., Priya, G., Ross, G., Kaiming, H., Doll´ ar, P.: Focal loss for dense object detection. In: IEEE ICCV (2017) 48. Wan, F., Wei, P., Jiao, J., Han, Z., Ye, Q.: Min-entropy latent model for weakly supervised object detection. In: IEEE CVPR, pp. 1297–1306 (2018) 49. Wan, F., Wei, P., Jiao, J., Han, Z., Ye, Q.: Min-entropy latent model for weakly supervised object detection. IEEE TPAMI 41(10), 2395–2409 (2019) 50. Wei, Y., et al.: TS2 C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 454–470. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6 27 51. Ye, Q., Wan, F., Liu, C., Huang, Q., Ji, X.: Continuation multiple instance learning for weakly and fully supervised object detection. IEEE TNNLS, pp. 1–15 (2021). https://doi.org/10.1109/TNNLS.2021.3070801 52. Ye, Q., Zhang, T., Qiu, Q., Zhang, B., Chen, J., Sapiro, G.: Self-learning scenespecific pedestrian detectors using a progressive latent model. In: IEEE CVPR, pp. 2057–2066 (2017) 53. Zeng, Z., Liu, B., Fu, J., Chao, H., Zhang, L.: WSOD2: learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In: IEEE ICCV (2019) 54. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE CVPR, pp. 2921–2929 (2016)

Calibration-Free Multi-view Crowd Counting Qi Zhang1,2(B) 1

and Antoni B. Chan2

College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, China 2 Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China [email protected], [email protected]

Abstract. Deep learning based multi-view crowd counting (MVCC) has been proposed to handle scenes with large size, in irregular shape or with severe occlusions. The current MVCC methods require camera calibrations in both training and testing, limiting the real application scenarios of MVCC. To extend and apply MVCC to more practical situations, in this paper we propose calibration-free multi-view crowd counting (CFMVCC), which obtains the scene-level count directly from the density map predictions for each camera view without needing the camera calibrations in the test. Specifically, the proposed CF-MVCC method first estimates the homography matrix to align each pair of camera-views, and then estimates a matching probability map for each camera-view pair. Based on the matching maps of all camera-view pairs, a weight map for each camera view is predicted, which represents how many cameras can reliably see a given pixel in the camera view. Finally, using the weight maps, the total scene-level count is obtained as a simple weighted sum of the density maps for the camera views. Experiments are conducted on several multi-view counting datasets, and promising performance is achieved compared to calibrated MVCC methods that require camera calibrations as input and use scene-level density maps as supervision.

1

Introduction

Crowd counting has many applications in real life, such as crowd control, traffic scheduling or retail shop management, etc. In the past decade, with the strong learning ability of deep learning models, single-view image counting methods based on density map prediction have achieved good performance. However, these single-view image methods may not perform well when the scene is too large or too wide, in irregular shape, or with severe occlusions. Therefore, multiview crowd counting (MVCC) has been proposed to fuse multiple camera views to mitigate these shortcomings of single-view image counting. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 14. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 227–244, 2022. https://doi.org/10.1007/978-3-031-20077-9_14

228

Q. Zhang and A. B. Chan

Fig. 1. The proposed calibration-free multi-view crowd counting (CF-MVCC) combines single-view predictions with learned weight maps to obtain the scene-level count.

The current MVCC methods rely on camera calibrations (both intrinsic and extrinsic camera parameters) to project features or density map predictions from the single camera views to the common ground-plane for fusion (see Fig. 1 top). The camera calibration is also required to obtain the ground-truth people locations on the ground-plane to build scene-level density maps for supervision. Although the latest MVCC method [55] handles the cross-view cross-scene (CVCS) setting, it still requires the camera calibrations during training and testing, which limits its real application scenarios. Therefore, it is important to explore calibration-free multi-view counting methods. For calibration-free MVCC, the key issue is to align the camera views without pre-provided camera calibrations. However, it is difficult to calibrate the cameras online from the multi-view images in MVCC, since there are a relatively small number of cameras (less than 5) that are typically on opposite sides of the scene (i.e., large change in camera angle). It may also be inconvenient to perform multiview counting by calibrating the camera views first if the model is tested on many different scenes. Besides, extra priors about the scenes are required to estimate camera intrinsic or extrinsic, such as in [1,2,4]. We observe that the people’ heads are approximately on a plane in the 3D world, and thus the same person’s image coordinates in different camera views can be roughly modeled with a homography transformation matrix. Thus, instead of using a common groundplane for aligning all the camera views together like previous methods [53,55], we propose to align pairs of camera views by estimating pairwise homography transformations. To extend and apply MVCC to more practical situations, in this paper, we propose a calibration-free multi-view crowd counting (CF-MVCC) method, which obtains the scene-level count as a weighted summation over the predicted density maps from the camera-views (see Fig. 1). The weight maps applied to each density map consider the number of cameras in which the given pixel is visible (to avoid double counting) and the confidence of each pixel (to avoid poorly predicted regions such as those far from the camera). The weight maps are generated using estimated pairwise homographies in the testing stage, and thus CF-MVCC can be applied to a novel scene without camera calibrations.

Calibration-Free Multi-view Crowd Counting

229

Fig. 2. Pipeline of CF-MVCC. The single-view counting (SVC) module predicts density maps Di for each camera-view. Given a pair of camera-views (i, j), the view-pair matching (VPM) module estimates the homography Hij and a matching probability map Mij between them. The weight map prediction (WMP) module calculates the weight map Wi for each camera using the matching probability maps Mij and confidence maps Ci , where the confidence maps are estimated from image features Fih and distance features Ti . Finally, the total count calculation (TCC) is obtained as a weighted sum between the density maps Di and the weight maps Wi .

Specifically, the proposed CF-MVCC method estimates the total crowd count in the scene via 4 modules. 1) Single-view counting module (SVC) consists of feature extraction and density map prediction submodules. 2) View-pair matching module (VPM) estimates the homography between pairs of camera views. For each camera pair, the features from one camera view are then projected to the other view, concatenated, and used to estimate a matching probability map between the two camera view. 3) Weight map prediction module (WMP) calculates a weight map for each view using all the matching probability maps. In addition, image content and distance information are used when calculating the weight maps to adjust for the confidence from each camera view. 4) Total count calculation module (TCC) obtains the total count as a weighted sum of the predicted single-view density maps using the estimated weight maps. In summary, the contributions of the paper are three-fold: 1. We propose a calibration-free multi-view counting model (CF-MVCC) to further extend the application of MVCC methods to more unconstrained scenarios, which can be applied to new scenes without camera calibrations. As far as we know, this is the first work to extend multi-view counting to the calibration-free camera setting. 2. The proposed method uses single-view density map predictions to directly estimate the scene crowd count without pixel-level supervision, via a weighting map with confidence score that is guided by camera-view content and distance information. 3. We conduct extensive experiments on multi-view counting datasets and achieve better performance than calibration-free baselines, and promising performance compared to well-calibrated MVCC methods. Furthermore, our model trained on a large synthetic dataset can be applied to real novel scenes with domain adaptation.

230

2

Q. Zhang and A. B. Chan

Related Work

In this section, we review single-image and multi-view counting, followed by DNN-based homography estimation. Single-Image Counting. Early research works on single-image counting rely on hand-crafted features [13,41], including detection-based [35], regression-based [6] or density map based methods [16]. Deep-learning based methods have been proposed for single image counting via estimating density maps [3,29,37,51]. Among them, many have focused on handling the scale variation and perspective change issues [12,14,17,20,40]. Unlike [38] and [47], [49] corrected the perspective distortions by uniformly warping the input images guided by a predicted perspective factor. Recent research explore different forms of supervision (e.g., regression methods or loss functions) [43,45]. [22] introduced local counting maps and an adaptive mixture regression framework to improve the crowd estimation precision in a coarse-to-fine manner. [25] proposed Bayesian loss, which adopts a more reliable supervision on the count expectation at each annotated point. To extend the application scenarios of crowd counting, weakly supervised [5,21,50,57] or semi-supervised methods [23,36,42] have also been proposed. Synthetic data and domain adaptation have been incorporated for better performance [46]. Other modalities are also fused with RGB images for improving the counting performance under certain conditions, such as RGBD [18] or RGBT [19]. In contrast to category-specific counting methods (e.g., people), general object counting has also been proposed recently [24,31,48]. [31] proposed a general object counting dataset and a model that predicts counting maps from the similarity of the reference patches and the testing image. Generally, all these methods aim at counting objects in single views, while seldom have targeted at the counting for whole scenes where a single camera view is not enough to cover a large or a wide scene with severe occlusions. Therefore, multi-view counting is required to enhance the counting performance for large and wide scenes. Multi-view Counting. Multi-view counting fuses multiple camera views for better counting performance for the whole scene. Traditional multi-view counting methods consist of detection-based [8,26], regression-based [34,44] and 3D cylinder-based methods [10]. These methods are frequently trained on a small dataset like PETS2009 [9]. Since they rely on hand-crafted features and foreground extraction techniques, their performance is limited. Recently, deep-learning multi-view counting methods have been proposed to better fuse single views and improve the counting performance. A multi-view multi-scale (MVMS) model [53] is the first DNNs based multi-view counting method. MVMS is based on 2D projection of the camera-view feature maps to a common ground-plane for predicting ground-plane scene-level density maps. However, the projection operation requires that camera calibrations are provided for training and testing. Follow-up work [54] proposed to use 3D density maps and 3D projection to improve counting performance. [55] proposed a cross-view cross-scene (CSCV) multi-view counting model by camera selection and noise

Calibration-Free Multi-view Crowd Counting

231

injection training. [58] enhanced the performance of the late fusion model in MVMS by modeling the correlation between each pair of views. For previous works, the single camera views (feature maps or density maps) are projected on the ground plane for fusion to predict the scene-level density maps, and thus camera calibrations are needed in the testing stage, which limits their applicability on novel scenes where camera calibrations are unavailable. In contrast, we propose a calibration-free multi-view counting method that does not require camera calibrations during testing. Our calibration-free setting is more difficult compared to previous multi-view counting methods. Deep Homography Estimation. Our work is also related to homography estimation works [27,30], especially DNNs-based methods [7,52]. [7] proposed to estimate the 8◦ C-of-freedom homography from an image pair with CNNs. [28] proposed an unsupervised method that minimizes the pixel-wise intensity error between the corresponding regions, but their unsupervised loss is not applicable when the change in camera view angle is large. [52] proposed to learn an outlier mask to select reliable regions for homography estimation. [15] proposed a multi-task learning framework for dynamic scenes by jointly estimating dynamics masks and homographies. Our proposed model estimates the homography matrix between the people head locations in the two views of each camera pair. Note that the change in view angle for camera-view pairs in the multi-view counting datasets (e.g., CityStreet) is quite large, which is in contrast to the typical setting for previous DNN-based homography estimation works where the change in angle is small. Therefore, the priors for unsupervised methods (e.g., [28]) are not applicable. Furthermore, the homography matrix in the proposed model is constructed based on the correspondence of people heads in the camera view pair, which are more difficult to observe compared to the objects in typical homography estimation datasets. Instead, we use a supervised approach to predict the homography matrix.

3

Calibration-Free Multi-view Crowd Counting

In this section we propose our model for calibration-free multi-view crowd counting (CF-MVCC). In order to avoid using the projection operation, which requires camera calibration, we could obtain the total count by summing the density maps predicted from each camera view. However, just summing all the singleview density maps would cause double counting on pixels that are also visible from other cameras. Therefore, we apply a weight map to discount the contribution of pixels that are visible from other camera views (see Fig. 1). The weight map is computed from a matching score map, which estimates the pixel-to-pixel correspondence between a pair of camera-views, and a confidence score map, which estimates the reliability of a given pixel (e.g., since predictions on faraway regions are less reliable). Specifically, our proposed CF-MVCC model consists of following 4 modules: single-image counting, view-pair matching, weight map prediction, and total count calculation. The pipeline is illustrated in Fig. 2. Furthermore, to validate the proposed method’s effectiveness on novel scenes, we

232

Q. Zhang and A. B. Chan

also train our model on a large synthetic dataset, and then apply it to real scenes via domain adaptation. 3.1

Single-View Counting Module (SVC)

The SVC module predicts the counting density map Di for each camera-view i, based on an extracted feature map Fic . For fair comparison with the SOTA calibrated MVCC method CVCS [55], in our implementation, we follow CVCS [55] and use the first 7 layers of VGG-net [39] as the feature extraction subnet, and the remaining layers of CSR-net [17] as the decoder for predicting Di . Other single-view counting models are also tested in the ablation study of the experV 2 iments and Supp. The loss used for training SVC is ld = i=1 ||Di − Digt ||2 , gt where Di and Di are the predicted and ground-truth density maps, the summation i is over cameras, and V is the number of camera-views. 3.2

View-Pair Matching Module (VPM)

The VPM module estimates the matching score Mij between any 2 camera views i and j. First, we use a CNN to estimate the homography transformation matrix from camera view i to j, denoted as Hij . This CNN extracts the 2 camera views’ feature maps Fih and Fjh . Next, the correlation map is computed between Fih and Fjh , and a decoder is applied to predict the homography transformation matrix gt is calculated Hij . For supervision, the homography matrix ground-truth Hij based on the corresponding people head locations in the 2 camera views. In the case that the camera view pair have no overlapping field-of-view, then a dummy homography matrix is used as ground-truth to indicate the 2 camera views are non-overlapped. The loss used to train the homography estimation V  gt 2 ||2 . CNN is lh = i=1 j=i ||Hij − Hij Next a subnetwork is used to predict the matching score map Mij , whose elements indicate the probability of whether the given pixel in view i has a match anywhere in view j. The input into the subnet is the concatenation of features Fic from view i, and the aligned features from view j, P (Fjc , Hij ), where P is the projection layer adopted from STN [11]. 3.3

Weight Map Prediction Module (WMP)

The WMP module calculates the weight Wi for each view i based on the matching score maps {Mij }j=i with other camera views. Specifically, the weight map Wi is:  Wi = 1/(1 + Mij ). (1) j=i

 Note that for pixel p, the denominator 1+ j=i Mij (p) is the number of cameraviews that see pixel p in camera-view i (including camera-view i itself). Thus the weight Wi (p) will average the density map values of corresponding pixels across

Calibration-Free Multi-view Crowd Counting

233

Fig. 3. Example of distance map (1 − Δi ). Usually, in surveillance cameras, the top and side areas on the image plane are faraway regions and the bottom areas are the nearer regions.

visible views, thus preventing double-counting of camera-view density maps with overlapping fields-of-view. In (1), the contribution of each camera-view is equal. However, single-view density map prediction may not always be reliable. Generally, the confidence (reliability) for regions with occlusions is lower than regions without occlusions, and the confidence of regions far from the camera is lower than near-camera regions. Therefore, to factor in these issues, we estimate a confidence score map Ci for each camera view i, based on the image content features and pixel-wise distance information. The confidence maps are then incorporated into (1),  Cji  Mij ), (2) Wi = Ci /(Ci + j=i

where Cji = P (Cj , Hij ) is the projection of confidence map Cj to camera view i. Note that in (2), the views with higher confidence will have higher contribution to the count of a given pixel. The confidence map Ci is estimated with a CNN whose inputs are the image feature map Fih and distance feature map Ti . Ideally, Ti should be computed by feeding a distance map Δi , where each pixel is the distance-to-camera in the 3D scene, into a small CNN. We note the surveillance cameras are usually angled downward to cover the scene, where the top and side areas on the image plane are faraway regions and the bottom areas are the nearer regions. Since we do not have camera calibration to compute 3D distances, we use a simple approximation for Δi where the bottom-middle pixel is considered as the pixel nearest to the current camera (the value is 0), and values of other pixels are the Euclidean distance to the bottom-middle pixel (See Fig. 3). The distance map Δi is then normalized to [0, 1], and (1 − Δi ) is fed into a CNN to output the distance feature Ti . Related Work. The weight map of our proposed method is different from the comparison method Dmap weighted from [53]. Specifically, Dmap weighted uses the camera calibrations and assumes each image pixel’s height in 3D world is the average person height to calculate how many cameras can see a given pixel. Dmap weighted also does not consider occlusion handling and prediction

234

Q. Zhang and A. B. Chan

confidence. In contrast, our method does not use camera calibrations, but instead estimates matching scores based on estimated homographies between camera views, image contents and geometry constraints (see Eq. 1). Furthermore, we incorporate confidence scores to adjust each view’s contribution, due to occlusion and distance (see Eq. 2). 3.4

Total Count Calculation Module (TCC)

count S is the With the estimated weight map Wi for each camera view i, the final V weighted summation of the density map predictions Di : S = i=1 sum(Wi Di ), where  is element-wise multiplication, and sum is the summation over the map. For training, the total count loss is the MSE of the count prediction: 2 ls = ||S − S gt ||2 , where S gt is the ground-truth count. Finally, the loss for training the whole model is l = ls + ld + lh . 3.5

Adaptation to Novel Real Scenes

To apply our model to new scenes with novel camera views, we need a large number of multi-view counting scenes for training. Therefore, we train the proposed model on a large multi-view counting dataset [55]. However, directly applying the trained model to real scenes might not achieve satisfying performance due to the domain gap between the synthetic and real data in terms of single-view counting, view-pair homography estimation and matching. To reduce the domain gap, we first fine-tune the model trained on synthetic data on each real test scene with an unsupervised domain adaptation (UDA) technique [55], where only the test images are used without counting annotations or camera calibrations. To further improve the performance, we use one image with density map annotations from the training set of the target scenes, and only fine-tune the SVC module of the proposed model with the one labeled frame. Compared to [46], we only use synthetic labels and one labeled frame from the target scene, and do not require large amounts of target scene annotations; while compared to [55], we do not need calibrations of the real scenes. Therefore, ours is a more difficult and practical setting for applying the trained multi-view counting model to real scenes.

4 4.1

Experiment Experiment Setting

Ground-Truth. We use the single-view density maps, homography transformation matrix, and scene crowd count as ground-truth for training. The groundtruth for the single-view density maps are constructed as in typical single-image counting methods [56]. The ground-truth homography transformation matrix of a camera-view pair is calculated with the corresponded people head coordinates (normalized to [−1, 1]). If there are no common people in the 2 camera views (no overlapped region), a “dummy” homography matrix is used as the ground-truth: H = [0, 0, −10; 0, 0, −10; 0, 0, 1]. As for the ground-truth people count, we only

Calibration-Free Multi-view Crowd Counting

235

require the total scene-level count, which is in contrast to [53], which requires scene-level people annotations on the ground-plane. Thus our setting is more difficult compared to the previous multi-view counting methods that use camera calibration and pixel-level supervision. Training and Evaluation. The training is stage by stage: we train the SVC and homography estimation CNNs, then fix both of them and train the remaining modules. On the large synthetic dataset, we use learning rates of 10−3 . On the real scene datasets, the learning rate is 10−4 . Network settings are in the supplemental. Mean absolute error (MAE) and mean normalized absolute error (NAE) of the predicted counts are used as the evaluation metrics. Datasets. We validate the proposed calibration-free multi-view counting on both a synthetic dataset CVCS [55] and real datasets, CityStreet [53] and PETS2009 [9]. Furthermore, we also apply the proposed model trained on CVCS dataset to real datasets CityStreet, PETS2009 and DukeMTMC [32,53]. – CVCS is synthetic dataset for multi-view counting task, which contains 31 scenes. Each scene contains 100 frames and about 100 camera views (280k total images). 5 camera views are randomly selected for 5 times for each scene in the training, and 5 camera views are randomly selected for 21 times for each test scene during testing. No camera calibrations are used in the training or testing. The input image resolution is 640 × 360. – CityStreet, PETS2009 and DukeMTMC are 3 real scene datasets for multi-view counting. CityStreet contains 3 camera views and 300 multi-view frames (676 × 380 resolution) for training and 200 for testing. PETS2009 contains 3 camera views and 1105 multi-view frames (384 × 288) for training and 794 for testing. DukeMTMC contains 4 camera views and 700 multiview frames (640 × 360) for training and 289 for testing. Among these 3 datasets, CityStreet is the most complicated dataset as it contains more severe occlusions and larger angle changes between camera views. Comparison Methods. We denote our method using the weight maps in Eq. 1 as CF-MVCC, and the weight maps with confidence scores in Eq. 2 as CFMVCC-C. As there are no previous calibration-free methods proposed, we adapt existing approaches to be calibration-free: – Dmap weightedH: This is the calibration-free version of Dmap weighted in [53]. With Dmap weighted, the density maps are weighted by how many times an image pixel can be seen by other camera views, based on the camera calibrations. Since camera calibrations are not available in our setting, the estimated homography Hij is used to calculate the weight maps. Note that this method only considers the camera geometry, and not other factors (e.g., image contents, occlusion, and distance) when computing the weights. – Dmap weightedA: The camera-view features are concatenated and used to estimate the weight maps for summing single-view predictions, which is a self-attention operation. Compared to Dmap weightH and our method, Dmap weightedA only considers image contents, and no geometry constraints.

236

Q. Zhang and A. B. Chan

Table 1. Scene-level counting performance on synthetic multi-scene dataset CVCS.

Calibrated

Method

MAE NAE

CVCS backbone CVCS (MVMS) CVCS

14.13 9.30 7.22

0.115 0.080 0.062

28.28 19.85 18.89 17.76 16.46 13.90

0.239 0.165 0.157 0.149 0.140 0.118

Calibration-free Dmap weightedH Dmap weightedA Total count 4D corr CF-MVCC (ours) CF-MVCC-C (ours)

– Total count: Since scene-level density maps are not available in our setting, we replace scene-level density maps with total count loss in CVCS [55]. – 4D corr: Replacing the VPM module in CF-MVCC with a 4D correlation [33] method for estimating the matching score Mij of the camera-view pair. Finally, we compare with multi-view counting methods that use camera calibrations: MVMS [53], 3D [54], CVCS backbone and CVCS [55], and CVF [58]. 4.2

Experiment Results

Scene-Level Counting Performance. We show the scene-level counting performance of the proposed models and comparison methods on CVCS, CityStreet and PETS2009 in Tables 1 and 2. On CVCS dataset, the proposed CF-MVCC-C achieves the best performance among the calibration-free methods. The comparison methods Dmap weightedH and Dmap weightedA only consider the camera geometry or the image contents, and thus their performance is worse than CFMVCC, which considers both. Including confidence score maps into the weights (CF-MVCC-C) will further improve the performance. Total count replaces the pixel-level supervision in CVCS with the total count loss, but directly regressing the scene-level count is not accurate since the projection to the ground stretches the features and makes it difficult to learn to fuse the multi-view features without pixel-level supervision. The 4D corr method also performs poorly because the supervision from the total-counting loss is too weak to guide the learning of the matching maps from the 4D correlation maps. Finally, our CF-MVCC-C performs worse than calibrated methods CVCS and CVCS (MVMS), but still better than CVCS backbone, which is reasonable since our method does not use any calibrations and no pixel-wise loss is available for the scene-level prediction. In Table 2, on both real single-scene datasets, our proposed calibration-free methods perform better than the other calibration-free methods. Furthermore, CF-MVCC-C is better than CF-MVCC, indicating the effectiveness of the confidence score in the weight map estimation. Compared to calibrated methods,

Calibration-Free Multi-view Crowd Counting Img

View 1

View 2

View 3

View 4

237

View 5

1.0 0.8 0.6 0.4 0.2 0.0 0.2

0.0

Fig. 4. Example of confidence maps C, weight maps W and density maps D. Table 2. Scene-level counting performance on real single-scene datasets.

Calibrated

Method

CityStreet MAE NAE

MVMS 3D counting CVF

8.01 7.54 7.08

0.096 3.49 0.091 3.15 3.08

0.124 0.113 -

9.84 9.40 11.28 8.82 8.24 8.06

0.107 0.123 0.152 0.102 0.103 0.102

0.136 0.252 0.265 0.147 0.125 0.116

Calibration-free Dmap weightedH Dmap weightedA Total count 4D corr CF-MVCC (ours) CF-MVCC-C (ours)

PETS2009 MAE NAE

4.23 6.25 6.95 4.55 3.84 3.46

CF-MVCC-C is comparable to MVMS [53], and slightly worse than 3D [54] and CVF [58]. The reason might be that the calibrated methods can implicitly learn some specific camera geometry in the fusion step, since the methods are trained and tested on the same scenes. Visualization Results. We show the visualization results the predicted confidence, weight, and density maps in Fig. 4. The red boxes indicate regions that cannot be seen by other cameras, and thus their predicted weights are large regardless of the confidence scores. The red circles show a person that can be seen in 3 camera views (3, 4 and 5) – the weights are small since the person can be seen by multiple cameras. This shows that the proposed method is effective at estimating weight maps with confidence information. See the supplemental for more visualizations (eg. projection results with ground-truth and predicted homography matrix). Ablation Studies. Various ablation studies are evaluated on the CVCS dataset.

238

Q. Zhang and A. B. Chan

Table 3. Ablation study on estimating the confidence map using image features and/or distance information. Method

Feat. Dist. MAE NAE

CF-MVCC CF-MVCC-F 

16.46

0.140

16.13

0.139

CF-MVCC-D



16.12

0.135

CF-MVCC-C 



13.90 0.118

Table 4. Ablation study on single-view counting networks for SVC module. SVC

Method

MAE NAE

CSR-Net [17] CF-MVCC 16.46 CF-MVCC-C 13.90 LCC [22]

0.140 0.118

CF-MVCC 14.01 0.117 CF-MVCC-C 12.79 0.109

Ablation Study on Confidence Map. We conduct an ablation study on the confidence score estimation: 1) without the confidence scores, i.e., CF-MVCC; 2) using only image features to estimate confidence scores, denoted as CF-MVCCF; 3) using only distance information, denoted as CF-MVCC-D; 4) using both image features and distance, i.e., our full model CF-MVCC-C. The results are presented in Table 3. Using either image features (CF-MVCC-F) or distance information (CF-MVCC-D) can improve the performance compared to not using the confidence map (CF-MVCC). Furthermore, using both image features and distance information (CF-MVCC-C) further improves the performance. Thus, the confidence map effectively adjusts the reliability of the each camera view’s prediction, in order to handle occlusion and/or low resolution. Ablation Study on Single-View Counting Network. We implement and test our proposed model with another recent single-view counting network LCC [22], which uses a larger feature backbone than CSRnet, and is trained with traditional counting density maps as in our model. The results presented in Table 4 show that the proposed CF-MVCC-C achieves better performance than CVCS when using different single-view counting networks in the SVC module. Ablation Study on the Homography Prediction Module. We also conduct experiments to show how the homography prediction module affects the performance of the model. Here the ground-truth homography matrix is used for training the proposed model. The performance of the proposed model trained with homography prediction Hpred or ground-truth Hgt is presented in Table 5. The model with ground-truth homography achieves better performance, and CF-MVCC-C performs better than CF-MVCC. Ablation Study on Variable Numbers of Camera-Views. The modules of the proposed models are shared across camera-views and camera-view pairs,

Calibration-Free Multi-view Crowd Counting

239

Table 5. Ablation study on the homography matrix input. Homography Method

MAE NAE

Hpred

CF-MVCC 16.64 CF-MVCC-C 13.90

Hgt

CF-MVCC 12.04 0.101 CF-MVCC-C 11.69 0.098

0.140 0.118

Table 6. Ablation study on testing with different numbers of input camera-views. The model is trained on CVCS dataset with 5 camera-views as input. No. Views CVCS backbone CVCS CF-MVCC-C MAE NAE MAE NAE MAE NAE 3

14.28 0.130

7.24

0.071 11.01 0.107

5

14.13 0.115

7.22

0.062 13.90 0.118

7

14.35 0.113

7.07

0.058 18.45 0.147

9

14.56 0.112

7.04

0.056 22.23 0.174

so our method can be applied to different numbers of camera views at test time. In Table 6, the proposed models are trained on the CVCS dataset [55] with 5 input camera views and tested on different number of views. Note that the ground-truth count is the people count covered by the multi-camera views. The performance of the proposed method CF-MVCC-C is worse than the calibrated method CVCS [55], but better than the calibrated method CVCS backbone [55] when the number of test camera views are close to the number of training views (3 and 5). Unlike CVCS method, the performance of CF-MVCC-C degrades as the number of cameras increases. The reason is that the error in weight map prediction might increase when the number of camera views changes. Adaptation to Novel Real Scenes. In this part, we use domain adaption to apply the proposed model CF-MVCC-C pre-trained on the synthetic CVCS dataset to the real scene datasets CityStreet, PETS2009 and DukeMTMC. We consider 3 training methods: 1) Synth, where pre-trained model is directly tested on the real scenes; 2) Synth+UDA, where unsupervised domain adaptation is applied to the pre-trained model. Specifically, 2 discriminators are added to distinguish the single-view density maps and weight maps of the source and target scenes. 3) Synth+F, where the models pre-trained on the synthetic dataset are fine-tuned with one labeled image set. Specifically, our pre-trained proposed model’s SVC module is fine-tuned with only 1 labeled camera-view image (V) from the training set of the real dataset, denoted as “+F(V)”. For comparison, the calibrated CVCS backbone and CVCS are fine-tuned with one set of multi-view images (V) and one labeled scene-level density map (S), denoted “+F(V+S)”. The results are presented in Table 7. The first 7 methods are calibrated methods that train and test on the same single scene (denoted as ‘RealSame’). This

240

Q. Zhang and A. B. Chan

Table 7. Results on real testing datasets. “Training” column indicates different training methods: “RealSame” means training and testing on the same single real scene; “Synth” means cross-scene training on synthetic dataset and directly testing on the real scenes; “+UDA” means adding unsupervised domain adaptation; “+F(V+S) means finetune the calibrated methods on a set of multi-view images (V) and one corresponding scene-level density map (S); “+F(V)” means finetuning the single-view counting with one labeled camera view image (V) from the training set of real scenes. Model

Training

Calibrated Dmap weighted [34] RealSame

PETS2009

DukeMTMC CityStreet

MAE NAE

MAE NAE

MAE NAE

7.51

0.261 2.12

0.255 11.10 0.121

Dect+ReID [53]

RealSame

9.41

0.289 2.20

0.342 27.60 0.385

LateFusion [53]

RealSame

3.92

0.138 1.27

0.198 8.12

0.097

EarlyFusion [53]

RealSame

5.43

0.199 1.25

0.220 8.10

0.096

MVMS [53]

RealSame

3.49

0.124 1.03

0.170 8.01

0.096

3D [54]

RealSame

3.15

0.113 1.37

0.244 7.54

0.091

CVF [58]

RealSame

3.08 -

Calibrated CVCS backbone [55] Synth

0.87 -

7.08 -

8.05

0.257 4.19

0.913 11.57 0.156

5.91

0.200 3.11

0.551 10.09 0.117

CVCS backbone [55] Synth+F(V+S) 5.78

0.186 2.92

0.597 9.71

CVCS [55]

Synth

5.33

0.174 2.85

0.546 11.09 0.124

CVCS [55]

Synth+UDA

5.17

0.165 2.83

0.525 9.58

CVCS [55]

Synth+F(V+S) 5.06 0.164 2.81 0.567 9.13 0.108

CVCS backbone [55] Synth+UDA

Calib-free CF-MVCC-C (ours) Synth

0.111 0.117

14.63 0.458 5.16

0.984 48.58 0.602

CF-MVCC-C (ours) Synth+UDA

12.76 0.398 2.65

0.498 14.89 0.176

CF-MVCC-C (ours) Synth+F(V)

4.85 0.162 1.80 0.293 8.13 0.095

can be considered as the upper-bound performance for this experiment. The remaining 9 methods are calibrated and calibration-free methods using domain adaptation. The proposed method trained with Synth+F(V) achieves better performance than other training methods or CVCS [55] with domain adaptation or Synth+F(V+S). Compared to calibrated single-scene models [34,53,54,58], the CF-MVCC-C training with Synth+F(V) still achieves promising performance, and is slightly worse than MVMS and 3D. Note that Synth+F(V) only uses one frame annotated with people during fine-tuning, and does not require camera calibrations during test time. Thus, our method has practical advantage over the calibrated single-scene methods, which require much more annotations and the camera calibrations.

5

Conclusion

In this paper, we propose a calibration-free multi-view counting method that fuses the single-view predictions with learned weight maps, which consider both similarity between camera-view pairs and confidence guided by image content

Calibration-Free Multi-view Crowd Counting

241

and distance information. The experiments show the proposed method can achieve better performance than other calibration-free baselines. Compared to previous calibrated multi-view methods, our proposed method is more practical for real applications, since our method does not need camera calibrations in the testing stage. The performance can be further improved by pre-training on a synthetic dataset, and applying domain adaptation with a single annotated image. In this case, our fine-tuned calibration-free method outperforms fine-tuned calibrated methods. Our work provides a promising step towards practical multiview crowd counting, which requires no camera calibrations from the test scene and only one image for fine-tuning the single-view density map regressor. Acknowledgements. This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 11212518, CityU 11215820), and by a Strategic Research Grant from City University of Hong Kong (Project No. 7005665).

References 1. Agarwal, S., et al.: Building Rome in a day. Commun. ACM 54(10), 105–112 (2011) 2. Ammar Abbas, S., Zisserman, A.: A geometric approach to obtain a bird’s eye view from an image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019) 3. Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., Yan, J.: Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4594–4603 (2020) 4. Bhardwaj, R., Tummala, G.K., Ramalingam, G., Ramjee, R., Sinha, P.: Autocalib: automatic traffic camera calibration at scale. ACM Trans. Sensor Netw. (TOSN) 14(3–4), 1–27 (2018) 5. von Borstel, M., Kandemir, M., Schmidt, P., Rao, M.K., Rajamani, K., Hamprecht, F.A.: Gaussian process density counting from weak supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 365– 380. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 22 6. Chan, A.B., Vasconcelos, N.: Counting people with low-level features and Bayesian regression. IEEE Trans. Image Process. 21(4), 2160–2177 (2012) 7. DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016) 8. Dittrich, F., de Oliveira, L.E., Britto Jr, A.S., Koerich, A.L.: People counting in crowded and outdoor scenes using a hybrid multi-camera approach. arXiv preprint arXiv:1704.00326 (2017) 9. Ferryman, J., Shahrokni, A.: Pets 2009: dataset and challenge. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. IEEE (2009) 10. Ge, W., Collins, R.T.: Crowd detection with a multiview sampler. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 324–337. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15555-0 24 11. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2017–2025 (2015)

242

Q. Zhang and A. B. Chan

12. Jiang, X., et al.: Attention scaling for crowd counting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020 13. Junior, J.C.S.J., Musse, S.R., Jung, C.R.: Crowd analysis using computer vision techniques. IEEE Signal Process. Mag. 27(5), 66–77 (2010) 14. Kang, D., Chan, A.: Crowd counting by adaptively fusing predictions from an image pyramid. In: BMVC (2018) 15. Le, H., Liu, F., Zhang, S., Agarwala, A.: Deep homography estimation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7652–7661 (2020) 16. Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems, pp. 1324–1332 (2010) 17. Li, Y., Zhang, X., Chen, D.: CSRNET: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1091–1100 (2018) 18. Lian, D., Li, J., Zheng, J., Luo, W., Gao, S.: Density map regression guided detection network for RGB-D crowd counting and localization. In: CVPR, pp. 1821–1830 (2019) 19. Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4823–4833, June 2021 20. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: CVPR, pp. 5099–5108 (2019) 21. Liu, X., van de Weijer, J., Bagdanov, A.D.: Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1862–1878 (2019) 22. Liu, X., Yang, J., Ding, W., Wang, T., Wang, Z., Xiong, J.: Adaptive mixture regression network with local counting map for crowd counting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 241–257. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0 15 23. Liu, Y., Liu, L., Wang, P., Zhang, P., Lei, Y.: Semi-supervised crowd counting via self-training on surrogate tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6 15 24. Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 669–684. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6 42 25. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision, pp. 6141–6150 (2019) 26. Maddalena, L., Petrosino, A., Russo, F.: People counting by learning their appearance in a multi-view camera environment. Pattern Recogn. Lett. 36, 125–134 (2014) 27. Mishkin, D., Matas, J., Perdoch, M., Lenc, K.: WXBS: wide baseline stereo generalizations. In: British Machine Vision Conference (2015) 28. Nguyen, T., Chen, S.W., Shivakumar, S.S., Taylor, C.J., Kumar, V.: Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robot. Autom. Lett. 3(3), 2346–2353 (2018) 29. O˜ noro-Rubio, D., L´ opez-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 615–629. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-46478-7 38

Calibration-Free Multi-view Crowd Counting

243

30. Pritchett, P., Zisserman, A.: Wide baseline stereo matching. In: International Conference on Computer Vision (1998) 31. Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3394–3403, June 2021 32. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., J´egou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-48881-3 2 33. Rocco, I., Cimpoi, M., Arandjelovi´c, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. arXiv preprint arXiv:1810.10510 (2018) 34. Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Scene invariant multi camera crowd counting. Pattern Recogn. Lett. 44(8), 98–112 (2014) 35. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007) 36. Sam, D.B., Sajjan, N.N., Maurya, H., Radhakrishnan, V.B.: Almost unsupervised learning for dense crowd counting. In: Thirty-Third AAAI Conference on Artificial Intelligence, vol. 33(1), pp. 8868–8875 (2019) 37. Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 6 (2017) 38. Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7279–7288 (2019) 39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 40. Sindagi, V.A., Patel, V.M.: Generating high-quality crowd density maps using contextual pyramid CNNs. In: IEEE International Conference on Computer Vision (ICCV), pp. 1879–1888. IEEE (2017) 41. Sindagi, V.A., Patel, V.M.: A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recogn. Lett. 107, 3–16 (2018) 42. Sindagi, V.A., Yasarla, R., Babu, D.S., Babu, R.V., Patel, V.M.: Learning to count in the crowd from limited labeled data. arXiv preprint arXiv:2007.03195 (2020) 43. Song, Q., et al.: Rethinking counting and localization in crowds: a purely pointbased framework. arXiv preprint arXiv:2107.12746 (2021) 44. Tang, N., Lin, Y.Y., Weng, M.F., Liao, H.Y.: Cross-camera knowledge transfer for multiview people counting. IEEE Trans. Image Process. 24(1), 80–93 (2014) 45. Wan, J., Liu, Z., Chan, A.B.: A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1974–1983, June 2021 46. Wang, Q., Gao, J., et al.: Learning from synthetic data for crowd counting in the wild. In: CVPR, pp. 8198–8207 (2019) 47. Yan, Z., et al.: Perspective-guided convolution networks for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 952–961 (2019) 48. Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 870–878 (2021)

244

Q. Zhang and A. B. Chan

49. Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Reverse perspective network for perspective-aware object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4374–4383 (2020) 50. Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Weakly-supervised crowd counting learns from sorting rather than locations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 1–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3 1 51. Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–841 (2015) 52. Zhang, J., Wang, C., Liu, S., Jia, L., Ye, N., Wang, J., Zhou, J., Sun, J.: Content-aware unsupervised deep homography estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 653–669. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 38 53. Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8297–8306 (2019) 54. Zhang, Q., Chan, A.B.: 3d crowd counting via multi-view fusion with 3d gaussian kernels. In: AAAI Conference on Artificial Intelligence, pp. 12837–12844 (2020) 55. Zhang, Q., Lin, W., Chan, A.B.: Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 557–567 (2021) 56. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597 (2016) 57. Zhao, Z., Shi, M., Zhao, X., Li, L.: Active crowd counting with limited supervision. arXiv preprint arXiv:2007.06334 (2020) 58. Zheng, L., Li, Y., Mu, Y.: Learning factorized cross-view fusion for multi-view crowd counting. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)

Unsupervised Domain Adaptation for Monocular 3D Object Detection via Self-training Zhenyu Li1 , Zehui Chen2 , Ang Li3 , Liangji Fang3 , Qinhong Jiang3 , Xianming Liu1 , and Junjun Jiang1(B) 1

Harbin Institute of Technology, Harbin, China {zhenyuli17,csxm,jiangjunjun}@hit.edu.cn 2 University of Science and Technology, Hefei, China [email protected] 3 SenseTime Research, Hong Kong, China {liang1,fangliangji,jiangqinhong}@senseauto.com

Abstract. Monocular 3D object detection (Mono3D) has achieved unprecedented success with the advent of deep learning techniques and emerging large-scale autonomous driving datasets. However, drastic performance degradation remains an unwell-studied challenge for practical cross-domain deployment as the lack of labels on the target domain. In this paper, we first comprehensively investigate the significant underlying factor of the domain gap in Mono3D, where the critical observation is a depth-shift issue caused by the geometric misalignment of domains. Then, we propose STMono3D, a new self-teaching framework for unsupervised domain adaptation on Mono3D. To mitigate the depth-shift, we introduce the geometry-aligned multi-scale training strategy to disentangle the camera parameters and guarantee the geometry consistency of domains. Based on this, we develop a teacher-student paradigm to generate adaptive pseudo labels on the target domain. Benefiting from the endto-end framework that provides richer information of the pseudo labels, we propose the quality-aware supervision strategy to take instance-level pseudo confidences into account and improve the effectiveness of the target-domain training process. Moreover, the positive focusing training strategy and dynamic threshold are proposed to handle tremendous FN and FP pseudo samples. STMono3D achieves remarkable performance on all evaluated datasets and even surpasses fully supervised results on the KITTI 3D object detection dataset. To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D.

Keywords: Monocular 3D object detection Unsupervised method · Self-training

· Domain adaptation ·

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 15. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 245–262, 2022. https://doi.org/10.1007/978-3-031-20077-9_15

246

1

Z. Li et al.

Introduction

Monocular 3D object detection (Mono3D) aims to categorize and localize objects from single input RGB images. With the prevalent development of cameras for autonomous vehicles and mobile robots, this field has drawn increasing research attention. Recently, it has obtained remarkable advancements [2,7,31,32,39,40,44] driven by deep neural networks and large-scale human-annotated autonomous driving datasets [3,16,20]. However, 3D detectors trained on one specific dataset (i.e. source domain) might suffer from tremendous performance degradation when generalizing to another dataset (i.e. target domains) due to unavoidable domain-gaps arising from different types of sensors, weather conditions, and geographical locations. Especially, as shown in Fig. 1, the severe depth-shift caused by different imaging camera devices leads to totally failed locations. Hence, a monocular 3D detector trained on data collected in Singapore cities with nuScenes [3] cameras cannot work well (i.e., average precision drops to zero) when evaluated on data from European cities captured by KITTI [16] cameras. While collecting and training with more data from different domains could alleviate this problem, it is unfortunately infeasible, given diverse real-world scenarios and expensive annotation costs. Therefore, methods for effectively adapting a monocular 3D detector trained on a labeled source domain to a novel unlabeled target domain are highly demanded in practical applications. We call this task unsupervised domain adaptation (UDA) for monocular 3D object detection. While intensive UDA studies [9,12,15,19,26,35] on the 2D image setting are proposed, they mainly focus on handling lighting, color, and texture variations. However, in terms of the Mono3D, since detectors attend to estimate the

(a) Camera View

(b) BEV View

(c) STMono3D

Fig. 1. Depth-shift Illustration. When inferring on the target domain, models can accurately locate the objects on the 2D image but predict totally wrong object depth with tremendous shifts. Such unreliable predictions for pseudo labels cannot improve but hurt the model performance in STMono3D. GAMS guarantees the geometry consistency and enables models predict correct object depth. Best view in color: prediction and ground truth are in orange and blue. Depth-shift is shown in green arrows. (Color figure online)

STMono3D

247

spatial information of objects from monocular RGB images, the geometry alignment of domains is much more crucial. Moreover, for UDA on LiDAR-based 3D detection [27,46–48], the fundamental differences in data structures and network architectures render these approaches not readily applicable to this problem. In this paper, we propose STMono3D, for UDA on monocular 3D object detection. We first thoroughly investigate the depth-shift issue caused by the tight entanglement of models and camera parameters during the training stage. Models can accurately locate the objects on the 2D image but predict totally wrong object depth with tremendous shifts when inferring on the target domain. To alleviate this issue, we develop the geometry-aligned multi-scale (GAMS) training strategy to guarantee the geometry consistency of domains and predict pixel-size depth to overcome the inevitable misalignment and ambiguity. Hence, models can provide effective predictions on the unlabeled target domain. Based upon this, we adopt the mean teacher [37] paradigm to facilitate the learning. The teacher model is essentially a temporal ensemble of student models, where parameters are updated by an exponential moving average window on student models of preceding iterations. It produces stable supervision for the student model without prior knowledge of the target domain. Moreover, we observe that the Mono3D teacher model suffers from extremely low confidence scores and numerous failed predictions on the target domain. To handle these issues, we adopt Quality-Aware Supervision (QAS), Positive Focusing Training (PFT), and Dynamic Threshold (DT) strategies. Benefitting from the flexibility of the end-to-end mean teacher framework, we utilize the readability of each teacher-generated prediction to dynamically reweight the supervision loss of the student model, which takes instance-level qualities of pseudo labels into account, avoiding the low-quality samples interfering the training process. Since the backgrounds of domains are similar in the Mono3D UDA of the autonomous driving setting, we ignore the negative samples and only utilize positive pseudo labels to train the model. It avoids excessive FN pseudo labels at the beginning of the training process impairing the capability of the model to recognize objects. In synchronization with training, we utilize a dynamic threshold to adjust the filter score, which stabilizes the increase of pseudo labels. To the best of our knowledge, this is the first study to explore effective UDA methods for Mono3D. Experimental results on various datasets KITTI [16], nuScenes [3], and Lyft [20] demonstrate the effectiveness of our proposed methods, where the performance gaps between source only results and fully supervised oracle results are closed by a large margin. It is noteworthy that STMono3D even outperforms the oracle results under the nuScenes→KITTI setting. Codes will be released at https://github.com/zhyever/STMono3D.

2 2.1

Related Work Monocular 3D Object Detection

Mono3D has drawn increasing attention in recent years [2,24,29–31,33,39,40, 42,43]. Earlier work utilizes sub-networks to assist 3D detection. For instance,

248

Z. Li et al.

3DOP [8] and MLFusion [44] use depth estimators while Deep3DBox [30] adopts 2D object detectors. Another line of research makes efforts to convert the RGB input to 3D representations like OFTNet [33] and Pseudo-Lidar [43]. While these methods have shown promising performance, they rely on the design and performance of sub-networks or dense depth labels. Recently, some methods propose to design the Mono3D framework in an end-to-end manner like 2D detection. M3D-RPN [2] implements a single-stage multi-class detector with a region proposal network and depth-aware convolution. SMOKE [25] proposes a simple framework to predict 3D objects without generating 2D proposals. Some methods [10,42] develop a DETR-like [4] bbox head, where 3D objects are predicted by independent queries in a set-to-set manner. In this paper, we mainly conduct UDA experiments based on FCOS3D [40], a neat and representative Mono3D paradigm that keeps the well-developed designs for 2D feature extraction and is adapted for this 3D task with only basic designs for specific 3D detection targets. 2.2

Unsupervised Domain Adaptation

UDA aims to generalize the model trained on a source domain to unlabeled target domains. So far, tremendous methods have been proposed for various computer vision tasks [9,12,15,19,26,35,49] (e.g., recognition, detection, segmentation). Some methods [5,28,36] employ the statistic-based metrics to model the differences between two domains. Other approaches [21,34,49] utilize the self-training strategy to generate pseudo labels for unlabeled target domains. Moreover, inspired by Generative Adversarial Networks (GANs) [17], adversarial learning was employed to align feature distributions [13,14,38], which can be explained by minimizing the H-divergence [1] or the Jensen-Shannon divergence [18] between two domains. [23,41] alleviated the domain shift on batch normalization layers by modulating the statistics in the BN layer before evaluation or specializing parameters of BN domain by domain. Most of these domain adaptation approaches are designed for the general 2D image recognition tasks, while direct adoption of these techniques for the large-scale monocular 3D object detection task may not work well due to the distinct characteristics of Mono3D, especially targes in 3D spatial coordination. In terms of 3D object detection, [27,47,48] investigate UDA strategies for LIDAR-based detectors. SRDAN [48] adopt adversarial losses to align the features and instances with similar scales between two domains. ST3D [47] and MLC-Net [27] develop self-training strategies with delicate designs, such as random object scaling, triplet memory bank, and multi-level alignment, for domain adaptation. Following the successful trend of UDA on LIDAR-based 3D object detection, we investigate self-training strategies for Mono3D.

3

STMono3D

In this section, we first formulate the UDA task for Mono3D (Sect. 3.1), and present an overview of our framework (Sect. 3.2), followed by the self-teacher with temporal ensemble paradigm (Sect. 3.3). Then, we explain the details of

STMono3D

249

Fig. 2. Framework overview. STMono3D leverages the mean-teacher [37] paradigm where the teacher model is the exponential moving average of the student model and updated at each iteration. We design the GAMS (Sect. 3.4) to alleviate the severe depth-shift in cross domain inference and ensure the availability of pseudo labels predicted by the teacher model. QAS (Sect. 3.5) is a simple soft-teacher approach which leverages richer information from the teacher model to reweight losses and provide quality-aware supervision on the student model. PFT and DT are another two crucial training strategies presented in Sect. 3.6.

the geometry-aligned multi-scale training (GAMS, Sect. 3.4), the quality-aware supervision (QAS, Sect. 3.5), and some other crucial training strategies consisting of positive focusing training (PFT) and dynamic threshold (DT) (Sect. 3.6). 3.1

Problem Definition

Under the unsupervised domain adaptation setting, we access to labeled images S from the source domain DS = {xis , ysi , Ksi }N i=1 , and unlabeled images from the i i NT target domain DT = {xt , Kt }i=1 , where Ns and Nt are the number of samples from the source and target domains, respectively. Each 2D image xi is paired with a camera parameter K i that projects points in 3D space to 2D image plane while ysi denotes the label of the corresponding training sample in the specific camera coordinate from the source domain. Label y is in the form of object class k, location (cx , cy , cz ), size in each dimension (dx , dy , dz ), and orientation θ. We aim to train models with {DS , DT } and avoid performance degradation when evaluating on the target domain. 3.2

Framework Overview

We illustrate our STMono3D in Fig. 2. The labeled source domain data {xS , yS } is utilized for supervised training of the student model FS with a loss LS . In terms of the unlabeled target domain data xT , we first perturb it by applying a strong random augmentation to obtain x ˆT . Before passing to the models, both the target and source domain input are further augmented by the GAMS strategy

250

Z. Li et al.

in Sect. 3.4, where images and camera intrinsic parameters are cautiously aligned via simultaneously rescaling. Subsequently, the original and perturbed images are sent to the teacher and student model, respectively, where the teacher model generates intuitively reasonable pseudo labels yˆT and supervises the student model via loss LT on the target domain: LT = LrT + LcT ,

(1)

where LrT and LcT are the regression loss and classification loss, respectively. Here, we adopt the QAS strategy in Sect. 3.5 to further leverage richer information from the teacher model by instance-wise reweighting the loss LT . In each iteration, the student model is updated through gradient descent with the total loss L, which is a linear combination of LS and LT : L = λLS + LT ,

(2)

where λ is the weight coefficient. Then, the teacher model parameters are updated by the corresponding parameters of the student model, where we introduce the details in Sect. 3.3. Moreover, we observe that the teacher model suffers from numerous FN and FP pseudo labels on the target domain. To handle this issue, we utilize the PFT and DT strategies illustrated in Sect. 3.6. 3.3

Self-teacher with Temporal Ensemble

Following the successful trend of the mean teacher paradigm [37] in the semisupervised learning, we adapt it to our Mono3D UDA task as illustrated in Fig. 2. The teacher model FT and the student model FS share the same network architecture but have different parameters θT and θS , respectively. During the training, the parameters of the teacher model are updated via taking the exponential moving average (EMA) of the student parameters: θT = mθT + (1 − m)θS ,

(3)

where m is the momentum that is commonly set close to 1, e.g., 0.999 in our experiments. Moreover, the input of the student model is perturbed by a strong augmentation, which ensures that the pseudo labels generated by the teacher model are more accurate than the student model predictions, thus providing available optimization directions for the parameter updating. In addition, the strong augmentation can also improve the model generalization to handle the different domain inputs. Hence, by supervising the student model with pseudo targets yˆT generated by the teacher model (i.e., forcing the consistency between predictions of the student and the teacher model), the student can learn domaininvariant representations to adapt to the unlabeled target domain. Figure 4 shows that the teacher model can provide effective supervision to the student model and Tables 4, 5 demonstrate the effectiveness of the mean teacher paradigm.

STMono3D

3.4

251

Geometry-Aligned Multi-scale Training

Observation. As shown in Fig. 1, depth-shift drastically harms the quality of pseudo labels on the target domain. It is mainly caused by the domain-specific geometry correspondences between 3D objects and images (i.e., camera imaging process). For instance, since the pixel size (defined in Eq. 6) of the KITTI dataset is larger than the nuScenes dataset, objects in images captured by KITTI cameras are smaller than nuScenes ones. While the model can predict accurate 2D locations on image planes, it tends to estimate relatively more distant object depth based on the cue that far objects tend to be smaller in perspective view. We call the phenomenon depth-shift: models localize accurate 2D location but predict depth with tremendous shifts on the target domain. To mitigate it, we propose a straightforward yet effective augmentation strategy, i.e., geometryaligned multi-scale training, disentangling the camera parameters and detectors and ensuring the geometry consistency in the imaging process. Method. Given the source input {xS , yS , KS } and the target input {xT , KT }, a naive geometry-aligned strategy is to rescale camera parameters to the same constant values and resize images correspondingly: ⎤ ⎡  fx 0 px  (4) K = rx ry 1 ⎣ 0 fy py ⎦ 0 0 1 where rx and ry are resize rates, f and p are focal length and optical center, x and y indicate image coordinate axises, respectively. However, since the f /p cannot be changed by resizing, it is impracticable to strictly align the geometry correspondences of 3D objects and images between different domains via convenient transformations. The inevitable discrepancy and ambiguity lead to a failure on UDA. To solve the issue, motivated by DD3D [31], we propose to predict the pixelsize depth dp instead of the metric depth dg : s · dg , c

(5)

1 1 + 2, 2 fx fy

(6)

dp =  s=

where s and c are the pixel size and a constant, dp is the model prediction and is scaled to the final result dg . Therefore, while there are inevitable discrepancies between aligned geometry correspondences of two domains, the model can infer the depth from the pixel size and be more robust to the various imaging process. Moreover, we further rescale camera parameters into a multi-scale range, instead of the same constant values, and resize images correspondingly to enhance the dynamic of models. During the training, we keep ground-truth 3D bounding boxes yS and pseudo labels yˆT unchanged, avoiding changing real 3D scenes.

252

3.5

Z. Li et al.

Quality-Aware Supervision

Observation. The cross-domain performance of the detector highly depends on the quality of pseudo labels. In practice, we have to utilize a higher threshold on the foreground score to filter out most false positive (FP) box candidates with low confidence. However, unlike the teacher model that can detect objects with high confidence in the semi-supervised 2D detection or UDA of LiDARbased 3D detector (e.g., the threshold is set to 90% and 70% in [45] and [47], respectively), we find the Mono3D cross-domain teacher suffers from a much lower confidence as shown in Fig. 3, which is another unique phenomenon in Mono3D UDA caused by the much worse oracle monocular 3D detection performance than 2D detection and LiDAR-based 3D detection. It indicates that though the prediction confidence surpasses the threshold, we cannot ensure the sample quality, much less for the ones near the threshold. To alleviate the impact, we propose the quality-aware supervision (QAS) to leverage richer information from the teacher and take instance-level quality into account. Method. Thanks to the flexibility of the end-to-end mean teacher framework, we assess the reliability of each teacher-generated bbox to be a real foreground, which is then used to weight the foreground classification loss of the student fg model. Given the foreground bounding box set {bfi g }N i=1 , the classification loss of the unlabeled images on the target domain is defined as: fg

LcT

N μ = fg wi · lcls (bfi g , Gcls ), N i=1

(7)

where Gcls denotes the set of pseudo class labels, lcls is the box classification loss, wi is the confidence score for ith foreground pseudo boxes, N f g is the number of foreground pseudo box, and μ is a constant hyperparameter. The QAS resembles a simple positive mining strategy, which is intuitively reasonable that there should be more severe punishment for pseudo labels with higher confidence. Moreover, compared with semi-supervised and supervised tasks that focus on simple/hard negative samples [6,45], it is more critical for UDA Mono3D models to prevent harmful influence caused by low-quality pseudo labels near the threshold. Such an instance-level weighting strategy balances the loss terms based on foreground confidence scores and significantly improves the effectiveness of STMono3D. 3.6

Crucial Training Strategies

Positive Focusing Training. Since the whole STMono3D is trained in an endto-end manner, the teacher model can hardly detect objects with confident scores higher than the threshold at the start of the training. Tons of FN pseudo samples impair the capability of the model to recognize objects. Because backgrounds of different domains are similar with negligible domain gaps in Mono3D UDA (e.g., street, sky, and house), we propose the positive focusing training strategy. As

STMono3D

253

Fig. 3. (a) Correlation between confidence value and box IoU with ground-truth. (b) Distribution of confidence scores. The teacher suffers from low scores on the target domain. (c) Distribution of IoU between ground-truth and pseudo labels near the threshold (0.35–0.4). We highlight the existence of numerous low-quality and FP samples in these pseudo labels.

for the LcT , we discard negative background pseudo labels and only utilize the positive samples to supervise the student model, which ensures that the model does not crash to overfit on the FN pseudo labels during the training stage. Dynamic Threshold. In practice, we find that the mean confidence score of pseudo labels gradually increases in synchronization within training duration. Increasing false positive (FP) samples appear in the middle and late stages of training, which harshly hurts the model performance. While the QAS strategy proposed in Sect. 3.5 can reduce the negative impact of low-quality pseudo labels, the completely wrong predictions still introduce inevitable noise to the training process. To alleviate the issue, we propose a simple progressively increasing threshold strategy to dynamic change the threshold τ as: ⎧ iter < n1 , ⎨ α, n1  iter < n2 , (8) τ = α + k · (iter − n1 ), ⎩ iter  n2 , α + k · (n2 − n1 ), where α is the base threshold that is set to 0.35 based on the statistics in Fig. 3(a) in our experiments, k is the slope of increasing threshold, iter is the iteration of training stage. The threshold is fixed to a minimum during the first n warmup steps as the teacher model can hardly detect objects with confident scores higher than the base threshold. It then linearly increases after the teacher model predicts pseudo labels with FP samples to avoid the model being blemished by increasing failure predictions. Finally, we find that the increasing of average scores tends to a saturation. Therefore, the threshold is fixed at the end of the training stage to guarantee the number of pseudo labels.

254

Z. Li et al.

Table 1. Dataset Overview. We focus on their properties related to frontal-view cameras and 3D object detection. The dataset size refers to the number of images used in training stage. For Waymo and nuScenes, we subsample the data. See text for details. Dataset KITTI [16]

Size 3712

Anno.

Loc.

17297 EUR

Shape

FOV

(375, 1242)

(29◦ , 81◦ )

nuScenes [3] 27522 252427 SG.,EUR (900, 1600) Lyft [20]

4 4.1

Objects Night 8

No

(39 , 65 ) 23

Yes





21623 139793 SG.,EUR (1024, 1224) (60◦ , 70◦ )

9

No

Experiments Experimental Setup

Datasets. We conduct experiments on three widely used autonomous driving datasets: KITTI [16], nuScenes [3], and Lyft [20]. Two aspects are lying in our experiments: Cross domains with different cameras (existing in all the sourcetarget pairs) and adaptation from label rich domains to insufficient domains (i.e., nuScenes→KITTI). We summarize the dataset information in detail in Table 1, and present more visualization comparisons in the supplementary material. Comparison Methods. In our experiments, we compare STMono3D with three methods: (i) Source Only indicates directly evaluating the source domain trained model on the target domain. (ii) Oracle indicates the fully supervised model trained on the target domain. (iii) Naive ST (with GAMS) is the basic self-training method. We first train a model (with GAMS) on the source domain, then generate pseudo labels for the target domain, and finally fine-tuning the trained model on the target domain. Evaluation Metric. We adopt the KITTI evaluation metric for evaluating our methods in nuScenes→KITTI and Lyft→KITTI and the NuScenes metric for Lyft→nuScenes. We focus on the commonly used car category in our experiments. For Lyft→nuScenes, we evaluate models on ring view, which is more useful in real-world applications. For KITTI, We report the average precision (AP) where the IoU thresholds are 0.5 for both the bird’s eye view (BEV) IoUs and 3D IoUs. For nuScenes, since the attribute labels are different from the source domain (i.e., Lyft), we discard the average attribute error (mAAE) and report the average trans error (mATE), scale error (mASE), orient error (mAOE), and average precision (mAP). Following [47], we report the closed performance gap between Source Only to Oracle. Implementation Details. We validate our proposed STMono3D on detection backbone FCOS3D [40]. Since there is no modification to the model, our method can be adapted to other Mono3D backbones as well. We implement STMono3D based on the popular 3D object detection codebase mmDetection3D [11]. We utilize SGD [22] optimizer. Gradient clip and warm-up policy are exploited with

STMono3D

255

Table 2. Performance of STMono3D on three source-target pairs. We report AP of the car category at IoU = 0.5 as well as the domain gap closed by STMono3D. In nus→KITTI, STMono3D achieves a even better results on AP11 compared with the Oracle model, which demonstrates the effectiveness of our proposed method. AP11

nus→K

APBEV IoU  0.5 Easy Mod. Hard

Method

Source Only 0 33.46 Oracle STMono3D 35.63 Closed Gap

0 23.62 27.37

0 22.18 23.95

0 29.01 28.65

0 19.88 21.89

0 17.17 19.55

APBEV IoU  0.5 Easy Mod. Hard 0 33.70 31.85

0 23.22 22.82

106.5% 115.8% 107.9% 98.7% 110.1% 113.8%

94.5% 98.2%

AP11

L→nus

L→K

APBEV IoU  0.5 Easy Mod. Hard

Method

AP40

AP3D IoU  0.5 Easy Mod. Hard

AP3D IoU  0.5 Easy Mod. Hard

Source Only 0 33.46 Oracle STMono3D 26.46

0 23.62 20.71

0 22.18 17.66

0 29.01 18.14

Closed Gap

87.6%

79.6%

62.5% 67.0%

79.0%

0 19.88 13.32

Method

Easy

AP3D IoU  0.5 Mod. Hard

0 20.68 19.30

0 28.33 24.00

0 18.97 16.85

0 16.57 13.66

93.3%

84.7%

88.8%

82.4%

Metrics AP

ATE

ASE

AOE

0 17.17 11.83

Source Only Oracle STMono3D

2.40 28.2 21.3

1.302 0.798 0.911

0.190 0.160 0.170

0.802 0.209 0.355

68.8%

Closed Gap

73.2%

77.5%

66.7%

82.9%

Table 3. Ablation study of the geometry-aligned multi-scale training. AP11

Nus→K

AP40

GAMS

APBEV IoU  0.5 AP3D IoU  0.5 APBEV IoU  0.5 AP3D IoU  0.5 Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard



0 0 0 0 0 0 0 0 0 0 0 0 35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66

the learning rate 2 × 10−2 , the number of warm-up iterations 500, warm-up ratio 0.33, and batch size 32 on 8 T V100s. The loss weight λ of different domains in Eq. 2 is set to 1. We apply a momentum m of 0.999 in Eq. 3 following most of mean teacher paradigms [27,45]. As for the strong augmentation, we adopt the widely used image data augmentation, including random flipping, random erase, random toning, etc.. We subsample 14 dataset during the training stage of NuScenes and Lyft dataset for simplicity. Notably, unlike the mean teacher paradigm or the self-training strategy used in UDA of LiDAR-based 3D detector [27,47], our STMono3D is trained in a totally end-to-end manner. 4.2

Main Results

As shown in Table 2, we compare the performance of our STMono3D with Source Only and Oracle. Our method outperforms the Source Only baseline on all evaluated UDA settings. Caused by the domain gap, the Source Only model cannot detect 3D objects where the mAP almost drops to 0%. Otherwise, STMono3D improves the performance on nuScenes→KITTI and Lyft→KITTI tasks by a large margin that around 110%/67% performance gap of AP3D are closed. Notably, the APBEV and AP3D of AP11 , IoU  0.5 of STMono3D surpass the Oracle results, which indicates the effectiveness of our method. Furthermore,

256

Z. Li et al.

when transferring Lyft models to other domains that have full ring view annotations for evaluation (i.e., Lyft→nuScenes), our STMono3D also attains a considerable performance gain which closes the Oracle and Source Only performance gap by up to 66% on AP3D . 4.3

Ablation Studies and Analysis

In this section, we conduct extensive ablation experiments to investigate the individual components of our STMono3D. All experiments are conducted on the task of nuScenes→KITTI. Effective of Geometry-Aligned Multi-scale Training. We study the effects of GAMS in the mean teacher paradigm of STMono3D and the Naive ST pipeline. Table 3 first reports the experimental results when GAMS is disabled. Caused by the depth-shift analyzed in Sect. 3.4, the teacher model generates incorrect pseudo labels on the target domain, thus leading to a severe drop in model performance. Furthermore, as shown in Table 4, GAMS is crucial for effective Naive ST as well. It is reasonable that GAMS supports the model trained on the source domain to generate valid pseudo labels on the target domain, making the fine-tuning stage helpful for the model performance. We present pseudo labels predicted by the teacher model of STMono3D in Fig. 1, which shows that the depth-shift is well alleviated. All the results highlight the importance of GAMS for effective Mono3D UDA. Comparison of Self-training Paradigm. We compare our STMono3D with other commonly used self-training paradigms (i.e., Naive ST) in Table 4. While the GAMS helps the Naive ST teacher generate effective pseudo labels on the target domain to boost UDA performance, our STMono3D still outperforms it by a significant margin. One of the primary concerns lies in low-quality pseudo Table 4. Comparison of different self-training paradigms. Nus→K

KITTI AP11

Method

APBEV IoU  0.5 AP3D IoU  0.5 Easy Mod. Hard Easy Mod. Hard

Nus Metrics AP

ATE

ASE

AOE

Naive ST 0 0 0 0 0 0 – Naive ST with GAMS 9.05 9.08 8.82 3.72 3.69 3.58 14.0 STMono3D 35.63 27.37 23.95 28.65 21.89 19.55 36.5

– 0.906 0.731

– 0.164 0.160

– 0.264 0.167

Table 5. Ablation study of the exponential moving average strategy. Nus→K

AP11

AP40

EMA

APBEV IoU  0.5 AP3D IoU  0.5 APBEV IoU  0.5 AP3D IoU  0.5 Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard



2.55 2.41 2.38 0.82 0.82 0.82 0.45 0.31 0.25 0.06 0.03 0.02 35.63 27.37 23.95 28.65 21.89 19.55 31.85 22.82 19.30 24.00 16.85 13.66

STMono3D

257

Table 6. Ablation study of QAS on different loss terms. AP11

Nus→K Lreg T

Lcls T

√ √

√ √

AP40

APBEV IoU  0.5 AP3D IoU  0.5 APBEV IoU  0.5 AP3D IoU  0.5 Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard 26.33 21.50 35.63 21.74

21.92 17.57 27.37 19.56

19.57 15.35 23.95 17.22

21.17 16.57 28.65 18.09

18.14 13.80 21.89 15.67

16.46 11.34 19.55 14.71

21.66 20.47 31.85 16.01

16.64 15.77 22.82 13.26

14.03 13.12 19.30 11.15

15.55 15.32 24.00 10.89

12.06 11.69 16.85 9.22

9.88 9.35 13.66 7.49

Fig. 4. Performance comparision. (a) Oracle v.s. Source Only with GAMS: While the Oracle performance progressively improves, the Source Only model suffers from a drasical performance fluctuation. (b) Mean Teacher v.s. Student on the target domain: Not only does the teacher model outperforms the student at the end of the training phase, its performance curve is also smoother and more stable.

labels caused by the domain gap. Moreover, as shown in Fig. 4(a), while the performance of Oracle improves progressively, the Source Only model on the target domain suffers from a performance fluctuation. It is also troublesome to choose a specific and suitable model from immediate results to generate pseudo labels for the student model. In terms of our STMono3D, the whole framework is trained in an end-to-end manner. The teacher is a temporal ensemble of student models at different time stamps. Figure 4(b) shows that our teacher model is much more stable compared with the ones in Naive ST and has a better performance than the student model at the end of the training phase, where the teacher model starts to generate more predictions over the filtering score threshold. This validates our analysis in Sect. 3.3 that the mean teacher paradigm provides a more effective teacher model for pseudo label generation. Table 5 demonstrates the effectiveness of the EMA of STMono3D. The performance significantly degrades when the EMA is disabled, and the model is easily crashed during the training stage. Effective of Quality-Aware Supervision. We study the effects of different applied loss terms of the proposed QAS strategy. Generally, the loss terms of Mono3D can be divided into two categories: (i) Lcls containing the object classification loss and attribute classification loss, and (ii) Lreg consisting of the location loss, dimension loss, and orientation loss. We separately apply the

258

Z. Li et al.

Fig. 5. Effects of the proposed DFT and DT. (a) Correlation between the average of the number of pseudo labels and training iters. (b) Examples of harmful FN and FP pseudo labels caused by disabling DFT and DT, respectively. Table 7. Ablation study of PFT and DT. AP11

Nus→K

AP40

APBEV IoU  0.5 AP3D IoU  0.5 APBEV IoU  0.5 AP3D IoU  0.5 PFT DT Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard √ √

√ √

13.57 19.59 18.90 35.63

11.33 16.00 16.57 27.37

10.31 14.35 15.75 23.95

9.10 15.96 15.15 28.65

7.80 13.15 13.73 21.89

7.00 12.23 12.85 19.55

12.36 13.44 12.74 31.85

9.42 9.76 10.35 22.82

8.03 7.90 9.42 19.30

7.82 9.23 8.41 24.00

5.82 6.52 6.81 16.85

5.08 5.13 5.96 13.66

QAS on these two kinds of losses and report the corresponding results in Table 6. Interestingly, utilizing the confidence score from the teacher to reweight the Lreg cannot improve the model performance. We speculate it is caused by a loose correlation between the IoU score and localization quality (see yellow or blue line in Fig. 3(a)), which is in line with the findings in LiDAR-based method [47]. However, we find QAS is more applicable for the Lcls , where the model performance increases about 20.6% AP3D . It indicates the effectiveness of our proposed QAS strategy. It is intuitively reasonable since the score of pseudo labels itself is used to measure the confidence of predicted object classification. Effective of Crucial Training Strategies. We then further investigate the effectiveness of our proposed PFT and DT strategies. We first present the ablation results in Table 7. When we disable the strategies, model performance suffers from drastic degradations, where AP3D drops 64.3%. The results demonstrate they are crucial strategies in STMono3D. As shown in Fig. 5(a), we also present the influence of them in a more intuitive manner. If we disable the PFT, the model will be severely impaired by the numerous FN predcitions (shown in Fig. 5(b) top) in the warm-up stage, leading to a failure to recognize objects in the following training iterations. On the other hand, for the teacher model w/o DT, the number of predictions abruptly increases at the end of training process, introducing more FPs predictions (shown in Fig. 5(b) down) that are harmful to the model perfomance.

STMono3D

5

259

Conclusion

In this paper, we have presented STMono3D, a meticulously designed unsupervised domain adaptation framework tailored for monocular 3D object detection task. We investigate that the depth-shift caused by the geometry discrepancy of domains leads to a drastic performance degradation when cross-domain inference. To alleviate the issue, we leverages a teacher-student paradigm for pseudo label generation and propose quality-aware supervision, positive focusing training and dynamic threshold to handle the difficulty in Mono3D UDA. Extensive experimental results demonstrate the effectiveness of STMono3D. Acknowledgment. The research was supported by the National Natural Science Foundation of China (61971165, 61922027), and also is supported by the Fundamental Research Funds for the Central Universities.

References 1. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79(1), 151–175 (2010) 2. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: International Conference on Computer Vision (ICCV), pp. 9287–9296 (2019) 3. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Computer Vision and Pattern Recognition (CVPR), pp. 11621–11631 (2020) 4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 5. Carlucci, F.M., Porzi, L., Caputo, B., Ricci, E., Bulo, S.R.: Autodial: Automatic domain alignment layers. In: International Conference on Computer Vision (ICCV), pp. 5077–5085. IEEE (2017) 6. Chen, X., Yuan, Y., Zeng, G., Wang, J.: Semi-supervised semantic segmentation with cross pseudo supervision. In: Computer Vision and Pattern Recognition (CVPR), pp. 2613–2622 (2021) 7. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Computer Vision and Pattern Recognition (CVPR), pp. 2147–2156 (2016) 8. Chen, X., et al.: 3D object proposals for accurate object class detection. In: Advances in Neural Information Processing Systems (NIPS) 28 (2015) 9. Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster RCNN for object detection in the wild. In: Computer Vision and Pattern Recognition (CVPR), pp. 3339–3348 (2018) 10. Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Graph-DETR3D: rethinking overlapping regions for multi-view 3D object detection. arXiv preprint arXiv:2204.11582 (2022) 11. Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d (2020)

260

Z. Li et al.

12. Dubourvieux, F., Audigier, R., Loesch, A., Ainouz, S., Canu, S.: Unsupervised domain adaptation for person re-identification through source-guided pseudolabeling. In: International Conference on Pattern Recognition (ICPR), pp. 4957– 4964 (2021) 13. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (ICML), pp. 1180–1189. PMLR (2015) 14. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. (JMLR) 17(1), 2030–2096 (2016) 15. Ge, Y., et al.: Self-paced contrastive learning with hybrid memory for domain adaptive object Re-ID. Adv. Neural Inf. Process. Syst. (NIPS) 33, 11309–11321 (2020) 16. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012) 17. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS) 27 (2014) 18. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein GANs. In: Advances in Neural Information Processing Systems (NIPS) 30 (2017) 19. Hoffman, J., Wang, D., Yu, F., Darrell, T.: FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649 (2016) 20. Kesten, R., et al.: Level 5 perception dataset 2020. https://level-5.global/level5/ data/ (2019) 21. Khodabandeh, M., Vahdat, A., Ranjbar, M., Macready, W.G.: A robust learning approach to domain adaptive object detection. In: International Conference on Computer Vision (ICCV), pp. 480–490 (2019) 22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 23. Li, Y., Wang, N., Shi, J., Hou, X., Liu, J.: Adaptive batch normalization for practical domain adaptation. Pattern Recogn. (PR) 80, 109–117 (2018) 24. Li, Z., et al.: SimIPU: simple 2d image and 3D point cloud unsupervised pretraining for spatial-aware visual representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1500–1508 (2022) 25. Liu, Z., Wu, Z., T´ oth, R.: Smoke: single-stage monocular 3D object detection via keypoint estimation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 996–997 (2020) 26. Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning (ICML), pp. 97–105 (2015) 27. Luo, Z., et al.: Unsupervised domain adaptive 3d detection with multi-level consistency. In: International Conference on Computer Vision (ICCV), pp. 8866–8875 (2021) 28. Mancini, M., Porzi, L., Bulo, S.R., Caputo, B., Ricci, E.: Boosting domain adaptation by discovering latent domains. In: Computer Vision and Pattern Recognition (CVPR), pp. 3771–3780 (2018) 29. Mao, J., Shi, S., Wang, X., Li, H.: 3D object detection for autonomous driving: a review and new outlooks. arXiv preprint arXiv:2206.09474 (2022) 30. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Computer Vision and Pattern Recognition (CVPR), pp. 7074–7082 (2017)

STMono3D

261

31. Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3d object detection? In: International Conference on Computer Vision (ICCV), pp. 3142–3152 (2021) 32. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 8555–8564 (2021) 33. Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. arXiv preprint arXiv:1811.08188 (2018) 34. Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: International Conference on Machine Learning (ICML), pp. 2988– 2997. PMLR (2017) 35. Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment for adaptive object detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 6956–6965 (2019) 36. Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., J´egou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8 35 37. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems (NIPS) 30 (2017) 38. Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: International Conference on Computer Vision (ICCV), pp. 4068–4076 (2015) 39. Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning (CoRL), pp. 1475–1485 (2022) 40. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3D: fully convolutional one-stage monocular 3d object detection. In: International Conference on Computer Vision Workshop (ICCVW), pp. 913–922 (2021) 41. Wang, X., Jin, Y., Long, M., Wang, J., Jordan, M.I.: Transferable normalization: towards improving transferability of deep neural networks. In: Advances in Neural Information Processing Systems (NIPS) 32 (2019) 42. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning (CoRL), pp. 180–191 (2022) 43. Weng, X., Kitani, K.: Monocular 3D object detection with pseudo-lidar point cloud. In: International Conference on Computer Vision Workshops (ICCVW) (2019) 44. Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Computer Vision and Pattern Recognition (CVPR), pp. 2345–2353 (2018) 45. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: International Conference on Computer Vision (ICCV), pp. 3060–3069 (2021) 46. Yang, J., Shi, S., Wang, Z., Li, H., Qi, X.: ST3D++: denoised self-training for unsupervised domain adaptation on 3D object detection. arXiv preprint arXiv:2108.06682 (2021) 47. Yang, J., Shi, S., Wang, Z., Li, H., Qi, X.: ST3D: self-training for unsupervised domain adaptation on 3D object detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 10368–10378 (2021)

262

Z. Li et al.

48. Zhang, W., Li, W., Xu, D.: SRDAN: scale-aware and range-aware domain adaptation network for cross-dataset 3D object detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 6769–6779 (2021) 49. Zou, Y., Yu, Z., Vijaya Kumar, B.V.K., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 297–313. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9 18

SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud Xiangrui Zhao1,2 , Sheng Yang2 , Tianxin Huang1 , Jun Chen1 , Teng Ma2 , Mingyang Li2 , and Yong Liu1(B) 1 APRIL Lab, Zhejiang University, Hangzhou, Zhejiang, China {xiangruizhao,21725129,junc}@zju.edu.cn, [email protected] 2 Autonomous Driving Lab, DAMO Academy, Hangzhou, Zhejiang, China [email protected]

Abstract. Poles and building edges are frequently observable objects on urban roads, conveying reliable hints for various computer vision tasks. To repetitively extract them as features and perform association between discrete LiDAR frames for registration, we propose the first learning-based feature segmentation and description model for 3D lines in LiDAR point cloud. To train our model without the time consuming and tedious data labeling process, we first generate synthetic primitives for the basic appearance of target lines, and build an iterative line auto-labeling process to gradually refine line labels on real LiDAR scans. Our segmentation model can extract lines under arbitrary scale perturbations, and we use shared EdgeConv encoder layers to train the two segmentation and descriptor heads jointly. Base on the model, we can build a highly-available global registration module for point cloud registration, in conditions without initial transformation hints. Experiments have demonstrated that our line-based registration method is highly competitive to state-of-the-art point-based approaches. Our code is available at https://github.com/zxrzju/SuperLine3D.git. Keywords: 3D Line Feature

1

· Point cloud registration

Introduction

Point cloud registration is an essential technique for LiDAR-based vehicle localization on urban road scenes [28]. Considering recent researches [15,18], the SLAM community [19] divides these algorithms into two categories regarding their purpose, as local and global search methods, respectively. The local search category [6,7] typically constructs a non-convex optimization problem by greedily associating nearest entities to align. This often relies on a good initial guess, and thus mostly used for incremental positioning modules such as the LiDAR odometry [41] Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 16. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 263–279, 2022. https://doi.org/10.1007/978-3-031-20077-9_16

264

X. Zhao et al.

and map-based localization [32]. The global search category is used for less informative conditions, i.e., relocalization and map initialization problems when the initial guess is not reliable and large positional and rotational change exists. Since nearest neighbor search methods cannot find correct matching pairs in the Euclidean space, global search algorithms choose to extract distinct entities and construct feature descriptors [45], to establish matches in the description space. There exists a variety of classical hand-crafted features (e.g., FPFH [33]) for global search and registration, and recent learning-based methods [43] have improved the registration accuracy and success rate. However, the performance of some methods [25,27] severely drops when adapting to real LiDAR scans, because the density of scanned points is inversely proportional to the scanning distance, and thus influences the coherence of point description. Considering such limitation of a single point, we propose an idea of using structural lines, analogously as previous approaches proposed for images [16,40], to see whether a relatively stable descriptor can be concluded through a semantically meaningful group of scattered points. In typical LiDAR point cloud scanned from urban road scenes, there are three categories of lines. 1) Intersection of planes, e.g., edge of two building facades and curbs. 2) Standalone pole objects, e.g., street lamps and road signs alongside the road. 3) Virtual line consists of edge points across multiple scan rings, generated by ray-obstacle occlusions. While the last category is not repeatable and thus inappropriate for localization, the first two types are practical landmarks suitable to be extracted and described. Since these line segments are larger targets compared to point features, they have a higher chance to be repeatably observed. Moreover, the concluded position of each line is more precise to a single corresponding point between frames due to the limited scanning resolution, which causes sampling issues. In this paper, we propose a self-supervised learning method for line segmentation and description on LiDAR scans (Fig. 1). Following the training procedure of SuperPoint [11] to solve the lack of publicly available training data, we choose to train our line extraction model, by first construct limited synthetic data and then perform auto labeling on real scans. By sharing point cloud encoding layers and use two separate branches for decoding and application headers, we are able to jointly train two tasks on those generated data. We view such a pipeline to train and use line features for the scan registration purpose as the key contribution of our work, which includes: – From the best of our knowledge, we propose the first learning-based line segmentation and description for LiDAR scans, bringing up an applicable feature category for global registration. – We propose a line segment labeling method for point clouds, which can migrate the model learned from synthetic data to real LiDAR scans for automatic labeling. – We explore the scale invariance of point cloud features, and provide a feasible idea for improving the generalization of learning-based tasks on the point cloud under scale perturbations by eliminating the scale factor in Sim(3) transformation.

SuperLine3D: Self-supervised Line Segmentation and Description SuperLine3D

Unlabeled Real Scans

Synthetic Primitives

Scale-Invariant Segmentation

Labeled Real Scans

265

Lines for Registration

Real Scans Pair

Geometric Adaptation

(a) Automatic Line Segments Labeling

(b) Line Segmentation and Description

Fig. 1. Pipeline overview. a): We train a scale-invariant segmentation on the synthetic data and get the precise line segment labels after multiple geometric adaptation iterations. b): We simultaneously train segmentation and description on labeled LiDAR scans, where red, purple, and green layers stand for encoders, segmentation header, and description header, respectively (Color figure online).

Extensive experimental results have shown that our line-based registration can maintain high success rate and accuracy under large-angle perturbations, and the trained model on one real scans dataset is highly adaptable to other urban scene datasets.

2

Related Work

Learning-Based Point Cloud Registration. In recent researches, there are a variety of learning-based approaches proposed for registering point clouds, and we can divide them into two groups considering whether explicit features have been extracted. End-to-end approaches use ground-truth transformation in loss calculation, and predict the transformation directly through the network: FMR [17] registers point clouds by minimizing feature-metric loss, and PCRNet [34] evaluate the similarity of PointNet [30] features and regresses poses through fully connected layers directly. These trained end-to-end models work well on tested sequences, but they are facing a practical problem on how to perform a joint state estimation in a multi-sensor fusion system [12]. Nevertheless, knowledge of these models are hardly adaptable to different motion scheme and other datasets. Therefore, methods with explicit feature extraction and description are still an active branch in the SLAM community. Registration with Explicit Features. Start with hand-crafted features (e.g., FPFH [33] and ISS [44]) concluding local patch appearances of point clouds, methods of extracting and describing explicit features mainly aim at the saliency of entities and coherency of description. While hand-crafted features are mostly designed for evenly sampled clouds, learning-based features [4,9,10,21,22,25]

266

X. Zhao et al.

have better robustness and generalization, once trained on the target LiDAR scan datasets. D3Feat [5] uses kernel-based convolution [36] to learn feature detection and description. SpinNet [3] builds a rotation-invariant local surface descriptor through novel spatial point transformer and 3D cylindrical convolutional layer. Both D3Feat [5] and SpinNet [3] are state-of-the-art learning-based point features, but they still suffer from the inherent problem of point features, and thus requires sample consensus as a post pruning procedure to filter correct feature associations. Line Features for SLAM. Image based line-aware approaches for detection (e.g., LSD [37], EDLines [2], and TP-LSD [16]), description (e.g., LBD [42] and SOLD2 [26]), and systematical SLAM designs (e.g., PL-SLAM [29]) have been well studied in recent years, whereas LiDAR scan based extraction and description methods, although heavily used in modern LiDAR SLAM approaches (e.g., LOAM [41] and LeGO-LOAM [35]), are under explored. To the best of our knowledge, we found Lu et al. [24] have proposed a 3D line detection method through projecting LiDAR points onto image, and thus convert the task into a 2D detection problem. Chen et al. [8] based on this work [24] to carry out a line-based registration approach for structural scenes. However, their limitations are two folds: 1) only work on organized point clouds, and 2) have not addressed line description and thus not suitable for global search registration problems. In contrast, we follow the idea of descriptor conclusion from SOLD2 [26], which has been proven to be useful in our paper for the coherency of describing a group of points.

3

Method

Considering the lack of available labeled line datasets of LiDAR scans, we follow the self-supervised idea of SuperPoint [11], to train our line segmentation model, by first constructing a simple synthetic data to initialize a base model, and then refining the model iteratively with auto-labeled real LiDAR scans from geometric adaptation (Sect. 3.1). After that, we gather line correspondences between different LiDAR scans, and jointly train the line segmentation and description in an end-to-end approach (Sect. 3.2). 3.1

Line Segmentation Model

Synthetic Data Generation. As discussed above in Sect. 1, there are two types of reliable line segments to detect: 1) intersection between planes, and 2) poles. Hence, we choose to use the following two mesh primitives shown in Fig. 2(a) for simulating their local appearances, respectively. These two mesh models are first uniformly sampled into 4,000 points as Fig. 2(b), with 5% relative 3-DOF positional perturbance added for each point. Then, to simulate possible background points nearby, we randomly cropped 40 basic primitives with each containing 1,000 points from real scans [14], and put them together to compose the final synthetic data. In total, we generated 5,000 synthetic point clouds with 5,000 points per each cloud.

SuperLine3D: Self-supervised Line Segmentation and Description

267

(c) Noise Sampling

(a) Mesh Model

(b) Point Cloud

(d) Synthetic Data

(e) Scale Comparision

Fig. 2. Synthetic data generation steps. We generate synthetic data through sampling primitive mesh models and augmenting real scan scattered points as noises.

Scale-Invariant Line Segmentation. We treat line detection as a point cloud segmentation problem, and the main challenge is the primitive scaling issue: In a real LiDAR frame, the density of points decreases with the scanning distance, and the voxel grid downsampling cannot fully normalize the density when the target feature is far away from the sensor. Moreover, our synthetic data generation also did not consider the scale of line segments (as visualized in Fig. 2(e) when put together). If such an issue is not handled, the model will not produce reasonable predictions when the training and test data are on different scales. To address this issue, our network obtains scale invariance by eliminating the scale factor s of the Sim(3) transformation and using relative distances, as: p = s · Rp + t,    s · i R(p − pi ) (p − pi ) i  f= .  = s·  i p − pi  i p − pi 

(1)

In Eq. 1, we search k = 20 nearest points {p1 , p2 , ..., pk } of a point p, and calculate the scale-invariant local feature f as the ratio of the Manhattan distance to the Euclidean distance between p and its neighbors. The trade-off of such a feature definition is that f cannot reflect the position of the original point in the Euclidean space, so the transformation has information loss. Such an influence are further evaluated in Sect. 4.3. Model Architecture. We choose DGCNN [39] as our backbone, since it directly encodes points and their nearest neighbors without complicated operations. Equation 2 shows its local feature encoding function called EdgeConv [39], where xj is the j-th feature, S xi is the neighbor of the xj in the feature space S, and h is the learnable model.     ¯ xj ,S xi − xj . h xj ,S xi = h (2) In the first EdgeConv layer, x represents the point coordinates in Euclidean space. In our implementation, we gather k = 20 nearest neighbors of each points

268

X. Zhao et al.

and calculate scale-invariant feature f . Then we turn the first EdgeConv layer into:     ¯ fj ,E fi − fj . (3) h fj ,E fi = h It replaces the coordinates in the Euclidean space with scale-invariant feature f , but E fi is still the feature of i-th neighbor of point pj in Euclidean space, not the neighbor of fj in feature space. Since part of the information in the original Euclidean space has been lost when generating scale-invariant features, preserving the neighbor relationship in the original Euclidean space can reduce further information loss. Automatic Line Segment Labeling. There is no available labeled line dataset of LiDAR scans, and performing manual labeling on point clouds is difficult. Hence, we build an automatic line labeling pipeline (Fig. 3). Inspired by homographic adaptation in SuperPoint [11], we perform geometric adaptation on LiDAR scans. First, we train a scale-invariant segmentation model purely on the synthetic data, and apply 2D transformations with a uniform distribution of 20m in XOY and 360◦ in yaw to the LiDAR scans. Then, we use the trained model to predict labels on the perturbed data, aggregate the scan labels from all the perturbations and take the points that are predicted more than 80% belonging to lines as candidate points. To cluster binary points into lines, we use the region-growth algorithm. The connectivity between points is defined through a 0.5m KD-Tree radius search. We use the labeled points as seeds, grow to nearby labeled points, and fit lines. Once such line segments are extracted, we continue to refine the segmentation model on the obtained labeled LiDAR scans. We repeat the geometric adaptation 3 times to generate 12,989 automatically labeled LiDAR frames on the KITTI odometry sequences [14]. Random Transformation

Apply Segmentation

Unlabeled Point Cloud

Aggregate Labels Clustering

...

...

Scale-Invariant Segmentation

...

Line Fitting

Train

Geometric Adaptation

Synthetic Data Train

Fig. 3. Automatic line labeling pipeline. We use geometric adaptation and line fitting to reduce the network prediction noise and improve model accuracy on real LiDAR scans through iterative training.

3.2

Joint Training of Line Segmentation and Description

Definition of Line Descriptors. Different from the geometry definition which only requires two endpoints of a line segment. A descriptor for each line should convey local appearances through its all belonged points, since observed end

SuperLine3D: Self-supervised Line Segmentation and Description

269

Fig. 4. Network architecture. The network uses the EdgeConv [39] module to extract features. The segmentation head and the description head predict the label and descriptor for each point, respectively.

points may be varied between frames due to possible occlusions. Therefore, we define the descriptor as an average of its all belonged points. Network Architecture. Our network structure (Fig. 4) consists of a stacked three EdgeConv [39] layers for feature encoding, and two decoders for line segmentation and description, respectively. Each EdgeConv layer outputs a N × 64 tensor used for 3-layer segmentation and description after a M axP ooling layer. We use ReLU for activation. The segmentation head turns the feature vector to a tensor sized N × 2 after convolution (N for the number of input points), and then obtains a boolean label per each point through a Softmax layer, to predict whether it belongs to a line. The descriptor head outputs a tensor sized N × d, and then performs L2-Norm to get a d-dimensional descriptor. Loss Functions. Our segmentation loss Lseg is a standard cross-entropy loss, and we follow [38] and [5] to build a discrimitive loss for the descriptor. In detail, we first use the line segment label to get the mean descriptor μ of each line segment, and then use the Lsame for each line to pull point descriptors towards μ. The Ldif f is proposed to make the descriptors of different lines repel each other. In addition for a point cloud pair, we calculate the matched loss Lmatch and the loss between the non-matched lines Lmismatch . Each term can be written as follows: Lsame

Ki N   2 1  1  · · μi − dj 1 − δs + , = N i |Ki | j CN  1 2 · = [2δd − μiA − μiB 1 ]+ , |C2N | 2

Ldif f

ia ,ib 

Lmatch

N 1  2 · = [μi − μi 1 − δs ]+ , N i CN 

2  1 · 2δd − μiA − μiB 1 + , = 2 |CN | 2

Lmismatch

ia ,ib 

(4)

270

X. Zhao et al.

where N is the number of detected lines and C2N stands for all pairs of two lines. i and j are two iterators, for lines and points on a line, respectively. μi is the aforementioned mean descriptor of a line, and dj is the descriptor of its related point descriptor j. μi and μiB are mean descriptors in another associated point cloud, and δs and δd are the positive and negative margins. [x]+ = max(0, x), and ·1 for the L1-distance. Finally, we use ω = 2 to balance the final loss L as: L = ω · Lseg + Lsame + Ldif f + Lmatch + Lmismatch . (5) Line-Based Registration. Our network outputs labels and descriptors for each point. We first extract lines using steps in Sect. 3.1. Then we perform descriptor matching to get line correspondences. The threshold of the matched descriptor is set to 0.1. The transformation T for registering the source cloud S to the target cloud T is optimized by minimizing point-to-line distances of all line matching cost ξi , i ∈ N:



S T Ni T · pS − pT  j ie0 × T · pj − pie1 (6) ξi = T pie0 − pTie1 j where pSj is the line points in the source frame, pTie and pTie are endpoints of the 0 1 matched line ie0 , ie1  of line i.

4 4.1

Experiments Network Training

To begin with our generated synthetic data, we first train our line segmentation network using those synthetic point clouds with 50 epochs to converge. Then, to use the auto labeling method for generating sufficient and qualified real-world labeled scans, we obtain 12,989 LiDAR frames and iteratively train 100 epochs to refine these auto labeling results. Finally, we train our whole line segmentation and description network with 120 epochs to obtain the final applicable model for real-world scans. We use scans including sequences 00–07 from the KITTI odometry dataset [14], with the last two sequences 06–07 for the validation set, and the rest 00–05 for the training set, to train our network. For each LiDAR frame, we voxelize the points cloud with 0.25m voxel size. We sample 20,000 points for evaluation and 15,000 points for training, since the kNN in EdgeConv is O(N 2 ) space complexity and consumes large memory in the training process. We calculate point-to-line distances following Eq. 6 on the line segments in Sect. 3.1. The line pair whose mean distance is within 0.2m will be selected as a line correspondence to calculate descriptor loss. We implement our network in Tensorflow [1] with Adam [20] optimizer. The learning rate is set to 0.001 and decreases by 50% for every 15 epochs. The whole network is trained on 8 NVIDIA RTX 3090 GPUs.

SuperLine3D: Self-supervised Line Segmentation and Description

4.2

271

Point Cloud Registration Test

Benchmarking. We use sequences 08–10 from the KITTI odometry dataset [14] to test the ability of our network on extracting line features and using them for point cloud registration. The preprocessing steps remain the same with our data preparation, and we choose to compare with traditional and learning-based methods for the global search registration. These traditional methods include ICP [6], RANSAC [13] and Fast Global Registration(FGR) [45], are all implemented by Open3D [46]. Specifically, The RANSAC and FGR use the FPFH [33] feature extracted from 0.25m voxel grid downsampled point clouds, and the max iteration is set to 4e6 . Two learning-based methods include HRegNet [22] and Deep Global Registration (DGR) [9], and they use ground-truth pose to calculate loss and predict the transformation directly through the network. PointDSC [4] learns to prune outlier correspondences. D3Feat [5] and SpinNet [3] extract salient features from point clouds. Our line-based registration extracts 18 line segments with 350 points per frame on average. For fair comparisons, the number of keypoints in learning-feature-based methods is also set to 350, while other parameters remain unchanged. Metrics. We use both the Relative Translation Error (RTE) and Relative Rotation Error (RRE) [22] to measure the registration accuracy. Additionally, as a special reference for evaluating the success rate for global search registration methods, we treat those calculated transformations with relative error w.r.t. the ground truth smaller than 2m and 5◦ , as a successful attempt of registration. Table 1. Registration performance on KITTI dataset. Our line segmentation and description method is highly competitive to the SOTA point-based approaches on the success rate, and both RTE and RRE can be refined with a subsequent coarse-to-fine ICP strategy. RTE (m) Mean Std

RRE (deg) Mean Std

Recall

ICP [6]

0.417

0.462

0.707

0.741

11.30%

FGR [45]

0.685

0.514

1.080

0.921

81.17%

RANSAC [13] 0.214

0.193

0.924

0.907

52.45%

HRegNet [22] 0.299

0.380

0.712

0.643

75.93%

DGR [9]

0.164

0.385

0.226

0.569

41.41%

PointDSC [4]

0.187

0.225

0.306

0.297

44.98%

SpinNet [3]

0.183

0.142

1.267

0.761

93.98%

D3Feat [5]

0.088

0.043 0.343 0.242 98.90%

SuperLine3D

0.087 0.104

0.591

0.444

97.68%

Results and Discussions. Table 1 shows the registration performances. Under random rotation perturbation, the recall of ICP is only 11.3%. The FGR and

272

X. Zhao et al.

Fig. 5. Registration recall with different RRE and RTE thresholds on the KITTI dataset. The registration success rate of our line-based approach (blue) is close to the SOTA point-based approach D3Feat (orange) under different criteria. (Color figure online)

RANSAC methods based on FPFH features have higher recall but larger errors. The learning-based end-to-end methods HRegNet and DGR also drop in recall and accuracy when dealing with large perturbed scenarios. PointDSC relies on the feature model, and the features do not have full rotation invariance, so its performance also deteriorates. Figure 5 shows the registration recall with different error thresholds. SpinNet and D3Feat have better performances, with recall of over 90%. Our line-based registration achieves comparable performance to point features, with a similar mean translation error and 1.22% lower recall than D3Feat. Figure 7 shows the visualization results on KITTI test sequence. Our method successfully registers point clouds under arbitrary rotation perturbations. We will give more results in supplementary materials.

Fig. 6. Registration performance with different RANSAC iterations. There are many mismatches in point feature correspondences, which leads to unstable results when the number of iterations is small.

Ablation on RANSAC Iterations. Point feature-based registration requires RANSAC to remove outliers and calculate the correct transformation. In the Table 1 and Fig. 5, the max iteration of RANSAC in the D3Feat and SpinNet

SuperLine3D: Self-supervised Line Segmentation and Description

273

operations are set to 5e4 . In contrast, our line-based registration does not rely on the RANSAC to filter erroneous matches: To perform outlier removal during transformation estimation, we calculate the line-to-line distances of line correspondences after the initial alignment, to remove the line correspondences with the mean distance greater than 1m and recalculate. Figure 6 shows the performance of point cloud registration under different RANSAC iterations. The x-coordinates in the figure are logarithmic coordinates. Our method does not use RANSAC for outlier rejection, and we use a dashed line in blue as a reference when comparing with other methods requiring RANSAC post processing. The star near the y coordinates represents the original result, and the star with an x-coordinate of 1 is the result after outlier removal. Both D3Feat and SpinNet can not get accurate transformation without RANSAC until the max iteration exceeds 1,000.

Fig. 7. Qualitative visualization on KITTI test sequence. Top: line associations between two LiDAR frames, Bottom: registration results of two frames.

4.3

Line Segmentation Evaluation

To evaluate the scale-invariance of our base segmentation model, we train PointNet [30], PointNet++ [31] and vanilla DGCNN [39] on the synthetic dataset. The training set includes 4,000 synthetic point clouds normalized within [0, 1]. We test the trained model with point clouds scaled from 0.1 to 3.0. Figure 8 shows the accuracy and mIOU of network predictions. Methods without scale adaptation suffer from performance decrease when the scale changes. The vanilla DGCNN gets best accuracy and mIOU in small scale disturbance (0.8 to 1.6), while our scale-invariant approach is stable under arbitrary scales. We can find that when the scale is determined, using the scale-invariant approach will decrease the accuracy, so we only use it in synthetic data training. In the joint training of segmentation and description, we utilize the vanilla DGCNN instead.

274

X. Zhao et al.

Fig. 8. Accuracy and mIOU of network predictions under different scale disturbances. Our scale-invariant approach is stable under arbitrary scales, but is a little worse than the vanilla DGCNN in the original scale.

Fig. 9. Qualitative visualization of line segmentation between Lu et al. [24] (left) and ours (right). Our method segments most of the poles and building edges.

Figure 9 shows the qualitative visualization of our line segmentation compared with the only open-source 3D line detection method [24] we found. Our method segments most of the lines, while the open-source one extracts LiDAR scan lines on the ground and cannot detect the poles. 4.4

Generalization on Unseen Dataset

To compare the generalization of learning feature-based models, we test our method against state-of-the-art point feature methods on the unseen Apollo Sourthbay dataset [23] using the models trained on the KITTI dataset. We uniformly choose half of the point clouds from the SanJoseDownTown sequence as the source frames, select target frames every 5 frames, and add random yawaxis rotation perturbances on the source frames. We get 8,296 point cloud pairs for evaluation. The data preprocessing of the point cloud is the same as the KITTI dataset.

SuperLine3D: Self-supervised Line Segmentation and Description

275

Fig. 10. Qualitative visualization on Apollo SourthBay dataset, SanJoseDowntown sequence. The majority of the line correspondences are stable poles, which helps reduce the translation error by a large margin.

Table 2 shows the point cloud registration results. On unseen datasets, all methods show a drop in recall. D3feat has the best performance, while the mean translation error of our method is the smallest one. Fig. 10 shows qualitative visualization on the test data. There are more poles in this sequence, which is beneficial to our line-based registration. Table 2. Test on unseen Apollo SourthBay Dataset, SanJoseDowntown sequence. RTE (m) Mean Std

RRE (deg) Mean Std

Recall

SpinNet [3]

0.199

0.203

1.207

75.66%

D3Feat [5]

0.079

0.046 0.206 0.144 95.94%

SuperLine3D 0.045 0.107

4.5

0.262

0.874 0.402

93.84%

Ablation Study

Skip Encoding. The receptive field is directly related to the number of encoded features in the EdgeConv module. When k is greater than 20, we can only set the batch size to 1 due to the enormous space complexity of EdgeConv. Its receptive field cannot be increased by increasing k. To this end, we utilize skip encoding. We gather S × k nearest neighbor features each time and select k features with stride size S for encoding. In this way, the receptive field increases S times without consuming too much memory (gathering S × k nearest-neighbor features will also increase a little memory usage). In the experiments, we test the cases with stride 1 (nearest neighbor encoding), 2, 4, and 6. As shown in the Table 3, adjusting the stride to 4 reaches the best performance, since the local features cannot be well encoded when the stride is too large. Descriptor Dimension. The descriptor dimension is one of the key factors for the feature matching performance, and the matching performance is poor

276

X. Zhao et al.

when the dimension is low. Our network extracts dense descriptors. Each point has a descriptor of d float numbers. It will take up a lot of storage space when its dimension is too large. Compared with the 16-dimension descriptor, the 32dimension one has a more obvious improvement on the recall, while the 64dimension descriptor has a small improvement. And increasing dimension to 128 only brings a smaller rotation error variance. Considering the average performances, we choose the 64-dimension implementation. Table 3. Ablation study on stride and descriptor dimension.

Stride

Descriptordimension

5

RTE (m) Mean Std

RRE (m) Mean Std

Recall

1 2 4 6

0.092 0.088 0.087 0.134

0.134 0.116 0.104 0.216

0.594 0.595 0.591 0.783

0.449 0.465 0.444 0.757

96.51% 96.70% 97.68% 64.23%

16 32 64 128

0.115 0.095 0.087 0.090

0.175 0.132 0.104 0.120

0.627 0.597 0.591 0.593

0.510 0.462 0.444 0.441

87.56% 95.28% 97.68% 96.70%

Conclusions

This paper proposes the first learning-based 3D line feature segmentation and description method for LiDAR scans, which achieves highly-competitive performance to the point-feature-based methods in the point cloud registration. In the future, we will explore the usage of our deep learning line features on SLAM problems such as mapping, map compression, and relocalization. We will also optimize the network structure and reduce training resource consumption. Acknowledgments. This work is supported by the National Key R&D Program of China (Grant No: 2018AAA0101503) and Alibaba-Zhejiang University Joint Institute of Frontier Technologies.

References 1. Abadi, M., et al.: {TensorFlow}: A system for {Large-Scale} machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016) 2. Akinlar, C., Topal, C.: EDLines: a real-time line segment detector with a false detection control. Pattern Recogn. Lett. 32(13), 1633–1642 (2011) 3. Ao, S., Hu, Q., Yang, B., Markham, A., Guo, Y.: SpinNet: learning a general surface descriptor for 3d point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11753–11762 (2021)

SuperLine3D: Self-supervised Line Segmentation and Description

277

4. Bai, X., et al.: PointDSC: robust point cloud registration using deep spatial consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15859–15869 (2021) 5. Bai, X., Luo, Z., Zhou, L., Fu, H., Quan, L., Tai, C.L.: D3feat: joint learning of dense detection and description of 3d local features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6359– 6367 (2020) 6. Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor Fusion IV: Control Paradigms and Data Structures, vol. 1611, pp. 586–606. SPIE (1992) 7. Biber, P., Straßer, W.: The normal distributions transform: a new approach to laser scan matching. In: Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No. 03CH37453), vol. 3, pp. 2743–2748. IEEE (2003) 8. Chen, G., Liu, Y., Dong, J., Zhang, L., Liu, H., Zhang, B., Knoll, A.: Efficient and robust line-based registration algorithm for robot perception under largescale structural scenes. In: 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), pp. 54–62. IEEE (2021) 9. Choy, C., Dong, W., Koltun, V.: Deep global registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2514– 2523 (2020) 10. Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8958– 8966 (2019) 11. DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Cnference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018) 12. Fang, F., Ma, X., Dai, X.: A multi-sensor fusion slam approach for mobile robots. In: IEEE International Conference Mechatronics and Automation, 2005. vol. 4, pp. 1837–1841. IEEE (2005) 13. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 14. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the Kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012) 15. Gu, X., Wang, X., Guo, Y.: A review of research on point cloud registration methods. IOP Conf. Ser. Mater. Sci. Eng. 782(2), 022070 (2020) 16. Huang, S., Qin, F., Xiong, P., Ding, N., He, Y., Liu, X.: TP-LSD: tri-points based line segment detector. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 770–785. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-58583-9 46 17. Huang, X., Mei, G., Zhang, J.: Feature-metric registration: a fast semi-supervised approach for robust point cloud registration without correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11366–11374 (2020) 18. Huang, X., Mei, G., Zhang, J., Abbas, R.: A comprehensive survey on point cloud registration. arXiv preprint arXiv:2103.02690 (2021) 19. Khan, M.U., Zaidi, S.A.A., Ishtiaq, A., Bukhari, S.U.R., Samer, S., Farman, A.: A comparative survey of lidar-slam and lidar based sensor technologies. In: 2021 Mohammad Ali Jinnah University International Conference on Computing (MAJICC), pp. 1–8. IEEE (2021)

278

X. Zhao et al.

20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 21. Li, J., Lee, G.H.: USIP: unsupervised stable interest point detection from 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 361–370 (2019) 22. Lu, F., et al.: HRegNet: a hierarchical network for large-scale outdoor lidar point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16014–16023 (2021) 23. Lu, W., Zhou, Y., Wan, G., Hou, S., Song, S.: L3-net: towards learning based lidar localization for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6389–6398 (2019) 24. Lu, X., Liu, Y., Li, K.: Fast 3d line segment detection from unorganized point cloud. arXiv preprint arXiv:1901.02532 (2019) 25. Pais, G.D., Ramalingam, S., Govindu, V.M., Nascimento, J.C., Chellappa, R., Miraldo, P.: 3dregnet: a deep neural network for 3d point registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7193–7203 (2020) 26. Pautrat, R., Lin, J.T., Larsson, V., Oswald, M.R., Pollefeys, M.: Sold2: selfsupervised occlusion-aware line description and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11368– 11378 (2021) 27. Perez-Gonzalez, J., Luna-Madrigal, F., Pi˜ na-Ramirez, O.: Deep learning point cloud registration based on distance features. IEEE Latin Am. Trans. 17(12), 2053–2060 (2019) 28. Pomerleau, F., Colas, F., Siegwart, R.: A review of point cloud registration algorithms for mobile robotics. Found. Trends Robot. 4(1), 1–104 (2015) 29. Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A., Moreno-Noguer, F.: Pl-slam: real-time monocular visual slam with points and lines. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4503–4508. IEEE (2017) 30. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017) 31. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 32. Rozenberszki, D., Majdik, A.L.: Lol: lidar-only odometry and localization in 3d point cloud maps. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4379–4385. IEEE (2020) 33. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3d registration. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3212–3217. IEEE (2009) 34. Sarode, V., Li, X., Goforth, H., Aoki, Y., Srivatsan, R.A., Lucey, S., Choset, H.: PCRNet: point cloud registration network using pointnet encoding. arXiv preprint arXiv:1908.07906 (2019) 35. Shan, T., Englot, B.: Lego-loam: lightweight and ground-optimized lidar odometry and mapping on variable terrain. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4758–4765. IEEE (2018) 36. Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: flexible and deformable convolution for point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420 (2019)

SuperLine3D: Self-supervised Line Segmentation and Description

279

37. Von Gioi, R.G., Jakubowicz, J., Morel, J.M., Randall, G.: LSD: a line segment detector. Image Process. Line 2, 35–55 (2012) 38. Wang, X., Liu, S., Shen, X., Shen, C., Jia, J.: Associatively segmenting instances and semantics in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4096–4105 (2019) 39. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (tog) 38(5), 1–12 (2019) 40. Zhang, H., Luo, Y., Qin, F., He, Y., Liu, X.: ELSD: efficient line segment detector and descriptor. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2969–2978 (2021) 41. Zhang, J., Singh, S.: Loam: Lidar odometry and mapping in real-time. In: Robotics: Science and Systems. Berkeley, CA (2014) 42. Zhang, L., Koch, R.: An efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency. J. Vis. Commun. Image Representation 24(7), 794–805 (2013) 43. Zhang, Z., Dai, Y., Sun, J.: Deep learning based point cloud registration: an overview. Virtual Reality Intell. Hardware 2(3), 222–246 (2020) 44. Zhong, Y.: Intrinsic shape signatures: a shape descriptor for 3d object recognition. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 689–696. IEEE (2009) 45. Zhou, Q.-Y., Park, J., Koltun, V.: Fast global registration. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6 47 46. Zhou, Q.Y., Park, J., Koltun, V.: Open3d: a modern library for 3d data processing. arXiv preprint arXiv:1801.09847 (2018)

Exploring Plain Vision Transformer Backbones for Object Detection Yanghao Li(B) , Hanzi Mao, Ross Girshick, and Kaiming He Facebook AI Research, Menlo Park, USA [email protected] Abstract. We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 APbox on the COCO dataset using only ImageNet-1K pretraining. We hope our study will draw attention to research on plainbackbone detectors. Code for ViTDet is available (https://github.com/ facebookresearch/detectron2/tree/main/projects/ViTDet).

1

Introduction

Modern object detectors in general consist of a backbone feature extractor that is agnostic to the detection task and a set of necks and heads that incorporate detection-specific prior knowledge. Common components in the necks/heads may include Region-of-Interest (RoI) operations [18,23,24], Region Proposal Networks (RPN) or anchors [45], Feature Pyramid Networks (FPN) [34], etc. If the design of the task-specific necks/heads is decoupled from the design of the backbone, they may evolve in parallel. Empirically, object detection research has benefited from the largely independent exploration of general-purpose backbones [25,27,46,47] and detection-specific modules. For a long while, these backbones have been multi-scale, hierarchical architectures due to the de facto design of convolutional networks (ConvNet) [29], which has heavily influenced the neck/head design for detecting objects at multiple scales (e.g., FPN). R. Girshick and K. He—Equal contribution. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 17. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 280–296, 2022. https://doi.org/10.1007/978-3-031-20077-9_17

Exploring Plain Vision Transformer Backbones for Object Detection

281

1/32

1/32

1/16

1/32

neck/head:

1/16

1/16

1/16

1/16

backbone:

1/8

1/8

1/16

1/8

1/4

1/4

1/16

1/4

hierarchical backbone, w/ FPN

plain backbone, w/ simple feature pyramid

Fig. 1. A typical hierarchical-backbone detector (left) vs.our plain-backbone detector (right). Traditional hierarchical backbones can be naturally adapted for multi-scale detection, e.g., using FPN. Instead, we explore building a simple pyramid from only the last, large-stride (16) feature map of a plain backbone.

Over the past year, Vision Transformers (ViT) [12] have been established as a powerful backbone for visual recognition. Unlike typical ConvNets, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout. Its “minimalist” pursuit is met by challenges when applied to object detection—e.g., How can we address multi-scale objects in a downstream task with a plain backbone from upstream pre-training? Is a plain ViT too inefficient to use with high-resolution detection images? One solution, which abandons this pursuit, is to re-introduce hierarchical designs into the backbone. This solution, e.g., Swin Transformers [39] and related works [15,26,31,52], can inherit the ConvNet-based detector design and has shown successful results. In this work, we pursue a different direction: we explore object detectors that use only plain, non-hierarchical backbones.1 If this direction is successful, it will enable the use of original ViT backbones for object detection; this will decouple the pre-training design from the fine-tuning demands, maintaining the independence of upstream vs.downstream tasks, as has been the case for ConvNet-based research. This direction also in part follows the ViT philosophy of “fewer inductive biases” [12] in the pursuit of universal features. As the non-local self-attention computation [51] can learn translation-equivariant features [12], they may also learn scale-equivariant features from certain forms of supervised or self-supervised pre-training. In our study, we do not aim to develop new components; instead, we make minimal adaptations that are sufficient to overcome the aforementioned challenges. In particular, our detector builds a simple feature pyramid from only the last feature map of a plain ViT backbone (Fig. 1). This abandons the FPN design [34] and waives the requirement of a hierarchical backbone. To efficiently extract features from high-resolution images, our detector uses simple non-overlapping window attention (without “shifting”, unlike [39]). A small number of crosswindow blocks (e.g., 4), which could be global attention [51] or convolutions, are used to propagate information. These adaptations are made only during fine-tuning and do not alter pre-training. Our simple design turns out to achieve surprising results. We find that the FPN design is not necessary in the case of a plain ViT backbone and its benefit 1

In this paper, “backbone” refers to architectural components that can be inherited from pre-training and “plain” refers to the non-hierarchical, single-scale property.

282

Y. Li et al.

can be effectively gained by a simple pyramid built from a large-stride (16), single-scale map. We also find that window attention is sufficient as long as information is well propagated across windows in a small number of layers. More surprisingly, under some circumstances, our plain-backbone detector, named ViTDet, can compete with the leading hierarchical-backbone detectors (e.g., Swin [39], MViT [15,31]). With Masked Autoencoder (MAE) [22] pretraining, our plain-backbone detector can outperform the hierarchical counterparts that are pre-trained on ImageNet-1K/21K [10] with supervision (Fig. 3). The gains are more prominent for larger model sizes. The competitiveness of our detector is observed under different object detector frameworks, including Mask R-CNN [23], Cascade Mask R-CNN [3], and their enhancements. We report 61.3 APbox on the COCO dataset [36] with a plain ViT-Huge backbone, using only ImageNet-1K pre-training with no labels. We also demonstrate competitive results on the long-tailed LVIS detection dataset [21]. While these strong results may be in part due to the effectiveness of MAE pre-training, our study demonstrates that plain-backbone detectors can be promising, challenging the entrenched position of hierarchical backbones for object detection. Beyond these results, our methodology maintains the philosophy of decoupling the detector-specific designs from the task-agnostic backbone. This philosophy is in contrast to the trend of redesigning Transformer backbones to support multi-scale hierarchies [15,26,39,52]. In our case, the detection-specific prior knowledge is introduced only during fine-tuning, without needing to tailor the backbone design a priori in pre-training. This makes our detector compatible with ViT developments along various directions that are not necessarily limited by the hierarchical constraint, e.g., block designs [49,50], self-supervised learning [1,22], and scaling [53]. We hope our study will inspire future research on plain-backbone object detection.2

2

Related Work

Object Detector Backbones. Pioneered by the work of R-CNN [19], object detection and many other vision tasks adopt a pre-training + fine-tuning paradigm: a general-purpose, task-agnostic backbone is pre-trained with supervised or self-supervised training, whose structure is later modified and adapted to the downstream tasks. The dominant backbones in computer vision have been ConvNets [29] of various forms, e.g., [25,27,46,47]. Earlier neural network detectors, e.g., [18,24,44,45], were based on a singlescale feature map when originally presented. While they use ConvNet backbones that are by default hierarchical, in principle, they are applicable on any plain backbone. SSD [37] is among the first works that leverage the hierarchical nature of the ConvNet backbones (e.g., the last two stages of a VGG net [46]). FPN [34] pushes this direction further by using all stages of a hierarchical backbone, approached by lateral and top-down connections. The FPN design is widely 2

This work is an extension of a preliminary version [32] that was unpublished and not submitted for peer review.

Exploring Plain Vision Transformer Backbones for Object Detection

283

used in object detection methods. More recently, works including Trident Networks [30] and YOLOF [6] have revisited single-scale feature maps, but unlike our work they focus on a single-scale taken from a hierarchical backbone. ViT [12] is a powerful alternative to standard ConvNets for image classification. The original ViT is a plain, non-hierarchical architecture. Various hierarchical Transformers have been presented, e.g., Swin [39], MViT [15,31], PVT [52], and PiT [26]. These methods inherit some designs from ConvNets, including the hierarchical structure and the translation-equivariant priors (e.g., convolutions, pooling, sliding windows). As a result, it is relatively straightforward to replace a ConvNet with these backbones for object detection. Plain-Backbone Detectors. The success of ViT has inspired people to push the frontier of plain backbones for object detection. Most recently, UViT [8] is presented as a single-scale Transformer for object detection. UViT studies the network width, depth, and input resolution of plain ViT backbones under object detection metrics. A progressive window attention strategy is proposed to address the high-resolution inputs. Unlike UViT that modifies the architecture during pre-training, our study focuses on the original ViT architecture without a priori specification for detection. By maintaining the task-agnostic nature of the backbone, our approach supports a wide range of available ViT backbones as well as their improvements in the future. Our method decouples the backbone design from the detection task, which is a key motivation of pursuing plain backbones. UViT uses single-scale feature maps for the detector heads, while our method builds a simple pyramid on the single-scale backbone. In the context of our study, it is an unnecessary constraint for the entire detector to be single-scale. Note the full UViT detector has several forms of multi-scale priors too (e.g., RPN [45] and RoIAlign [23]) as it is based on Cascade Mask R-CNN [3]. In our study, we focus on leveraging pre-trained plain backbones and we do not constrain the detector neck/head design. Object Detection Methodologies. Object detection is a flourishing research area that has embraced methodologies of distinct properties—e.g., two-stage [18, 19,24,45] vs.one-stage [35,37,44], anchor-based [45] vs.anchor-free [13,28,48], and region-based [18,19,24,45] vs.query-based (DETR) [4]. Research on different methodologies has been continuously advancing understandings of the object detection problem. Our study suggests that the topic of “plain vs.hierarchical” backbones is worth exploring and may bring in new insights.

3

Method

Our goal is to remove the hierarchical constraint on the backbone and to enable explorations of plain-backbone object detection. To this end, we aim for minimal modifications to adapt a plain backbone to the object detection task only during fine-tuning time. After these adaptations, in principle one can apply any detector heads, for which we opt to use Mask R-CNN [23] and its extensions. We do not aim to develop new components; instead, we focus on what new insights can be drawn in our exploration.

284

Y. Li et al. 1/16

1/32

1/16

1/32

1/16

1/32

1/16

1/16

1/16

1/16

1/16

1/16

1/16

1/8

1/16

1/8

1/16

1/8

1/16

1/4

1/16

1/4

1/16

1/4

(a) FPN, 4-stages

(b) FPN, last map

(c) simple feature pyramid

Fig. 2. Building a feature pyramid on a plain backbone. (a) FPN-like: to mimic a hierarchical backbone, the plain backbone is artificially divided into multiple stages. (b) FPN-like, but using only the last feature map without stage division. (c) Our simple feature pyramid without FPN. In all three cases, strided convolutions/deconvolutions are used whenever the scale changes.

Simple Feature Pyramid. FPN [34] is a common solution of building an in-network pyramid for object detection. If the backbone is hierarchical, the motivation of FPN is to combine the higher-resolution features from earlier stages and the stronger features from later stages. This is realized in FPN by top-down and lateral connections [34] (Fig. 1 left). If the backbone is non-hierarchical, the foundation of the FPN motivation is lost, as all the feature maps in the backbone are of the same resolution. In our scenario, we simply use only the last feature map from the backbone, which should have the strongest features. On this map, we apply a set of convolutions or deconvolutions in parallel to produce multi-scale feature maps. Specifically, 1 (stride = 16 [12]), we produce with the default ViT feature map of a scale of 16 1 1 1 1 feature maps of scales { 32 , 16 , 8 , 4 } using convolutions of strides {2, 1, 12 , 14 }, where a fractional stride indicates a deconvolution. We refer to this as a “simple feature pyramid ” (Fig. 1 right). The strategy of building multi-scale feature maps from a single map is related to that of SSD [37]. However, our scenario involves upsampling from a deep, lowresolution feature map, unlike [37], which taps into shallower feature maps. In hierarchical backbones, upsampling is often aided by lateral connection [34]; in plain ViT backbones, we empirically find this is not necessary (Sec. 4) and simple deconvolutions are sufficient. We hypothesize that this is because ViT can rely on positional embedding [51] for encoding locations and also because the highdimensional ViT patch embeddings do not necessarily discard information.3 We will compare with two FPN variants that are also built on a plain backbone (Fig. 2). In the first variant, the backbone is artificially divided into multiple stages to mimic the stages of a hierarchical backbone, with lateral and top-down connections applied (Fig. 2 (a)) [14]. The second variant is like the first one, but uses only the last map instead of the divided stages (Fig. 2 (b)). We show that these FPN variants are not necessary (Sec. 4).4 3 4

With a patch size of 16 × 16 and 3 colors, a hidden dimension ≥768 (ViT-B and larger) can preserve all information of a patch if necessary. From a broader perspective, the spirit of FPN [34] is “to build a feature pyramid inside a network”. Our simple feature pyramid follows this spirit. In the context of this paper, the term of “FPN” refers to the specific architecture design in [34].

Exploring Plain Vision Transformer Backbones for Object Detection

285

Backbone Adaptation. Object detectors benefit from high-resolution input images, but computing global self-attention throughout the backbone is prohibitive in memory and is slow. In this study, we focus on the scenario where the pre-trained backbone performs global self-attention, which is then adapted to higher-resolution inputs during fine-tuning. This is in contrast to the recent methods that modify the attention computation directly with backbone pretraining (e.g., [15,39]). Our scenario enables us to use the original ViT backbone for detection, without redesigning pre-training architectures. We explore using window attention [51] with a few cross-window blocks. During fine-tuning, given a high-resolution feature map, we divide it into regular non-overlapping windows.5 Self-attention is computed within each window. This is referred to as “restricted ” self-attention in the original Transformer [51]. Unlike Swin, we do not “shift” [39] the windows across layers. To allow information propagation, we use a very few (by default, 4) blocks that can go across windows. We evenly split a pre-trained backbone into 4 subsets of blocks (e.g., 6 in each subset for the 24-block ViT-L). We apply a propagation strategy in the last block of each subset. We study these two strategies: (i) Global propagation. We perform global self-attention in the last block of each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention in [31] that was used jointly with FPN. (ii) Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block [25] that consists of one or more convolutions and an identity shortcut. The last layer in this block is initialized as zero, such that the initial status of the block is an identity [20]. Initializing a block as identity allows us to insert it into any place in a pre-trained backbone without breaking the initial status of the backbone. Our backbone adaptation is simple and makes detection fine-tuning compatible with global self-attention pre-training. As stated, it is not necessary to redesign the pre-training architectures. Discussion. Object detectors contain components that can be task agnostic, such as the backbone, and other components that are task-specific, such as RoI heads. This model decomposition enables the task-agnostic components to be pre-trained using non-detection data (e.g., ImageNet), which may provide an advantage since detection training data is relatively scarce. Under this perspective, it becomes reasonable to pursue a backbone that involves fewer inductive biases, since the backbone may be trained effectively using large-scale data and/or self-supervision. In contrast, the detection taskspecific components have relatively little data available and may still benefit from additional inductive biases. While pursuing detection heads with fewer inductive biases is an active area of work, leading methods like DETR [4] are challenging to train and still benefit from detection-specific prior knowledge [56]. 5

We set the window size as the pre-training feature map size by default (14 × 14 [12]).

286

Y. Li et al.

Driven by these observations, our work follows the spirit of the original plain ViT paper with respect to the detector’s backbone. While the ViT paper’s discussion [12] focused on reducing inductive biases on translation equivariance, in our case, it is about having fewer or even no inductive bias on scale equivariance in the backbone. We hypothesize that the way for a plain backbone to achieve scale equivariance is to learn the prior knowledge from data, analogous to how it learns translation equivariance and locality without convolutions [12]. Our goal is to demonstrate the feasibility of this approach. Thus we choose to implement our method with standard detection specific components (i.e., Mask R-CNN and its extensions). Exploring even fewer inductive biases in the detection heads is an open and interesting direction for future work. We hope it can benefit from and build on our work here. Implementation. We use the vanilla ViT-B, ViT-L, ViT-H [12] as the pretraining backbones. We set the patch size as 16 and thus the feature map scale is 1/16, i.e., stride = 16.6 Our detector heads follow Mask R-CNN [23] or Cascade Mask R-CNN [3], with architectural details described in the appendix. The input image is 1024 × 1024, augmented with large-scale jittering [17] during training. Due to this heavy regularization, we fine-tune for up to 100 epochs in COCO. We use the AdamW optimizer [40] and search for optimal hyper-parameters using a baseline version. More details are in the appendix.

4

Experiments

4.1

Ablation Study and Analysis

We perform ablation experiments on the COCO dataset [36]. We train on the train2017 split and evaluate on the val2017 split. We report results on boundingbox object detection (APbox ) and instance segmentation (APmask ). By default, we use the simple feature pyramid and global propagation described in Sec. 3. We use 4 propagation blocks, evenly placed in the backbone. We initialize the backbone with MAE [22] pre-trained on IN-1K without labels. We ablate these defaults and discuss our main observations as follows. A Simple Feature Pyramid is Sufficient. In Table 1 we compare the feature pyramid building strategies illustrated in Fig. 2. We study a baseline with no feature pyramid : both the RPN and RoI heads 1 ) feature map. This case is are applied on the backbone’s final, single-scale ( 16 similar to the original Faster R-CNN [45] before FPN was proposed. All feature pyramid variants (Table 1 a-c) are substantially better than this baseline, increasing AP by up to 3.4 points. We note that using a single-scale feature map does not mean the detector is single-scale: the RPN head has multi-scale anchors and the RoI heads operate on regions of multiple scales. Even so, feature 6

Changing the stride affects the scale distribution and presents a different accuracy shift for objects of different scales. This topic is beyond the scope of this study. For simplicity, we use the same patch size of 16 for all of ViT-B, L, H (see the appendix).

Exploring Plain Vision Transformer Backbones for Object Detection

287

Table 1. Ablation on feature pyramid design with plain ViT backbones, using Mask R-CNN evaluated on COCO. The backbone is ViT-B (left) and ViT-L (right). The entries (a-c) correspond to Fig. 2 (a-c), compared to a baseline without any pyramid. Both FPN and our simple pyramid are substantially better than the baseline, while our simple pyramid is sufficient. Pyramid design

ViT-B

ViT-L

APbox

APmask

APbox

APmask

No feature pyramid

47.8

42.5

51.2

45.4

(a) FPN, 4-stage

50.3 (+2.5)

44.9 (+2.4)

54.4 (+3.2)

48.4 (+3.0)

(b) FPN, last-map

50.9 (+3.1)

45.3 (+2.8)

54.6 (+3.4) 48.5 (+3.1)

(c) Simple feature pyramid 51.2 (+3.4) 45.5 (+3.0) 54.6 (+3.4) 48.6 (+3.2)

pyramids are beneficial. This observation is consistent with the observation in the FPN paper [34] on hierarchical backbones. However, the FPN design is not needed and our simple feature pyramid is sufficient for a plain ViT backbone to enjoy the benefit of a pyramid. To ablate this design, we mimic the FPN architecture (i.e., the top-down and lateral connections) as in Fig. 2 (a, b). Table 1 (a, b) shows that while both FPN variants achieve strong gains over the baseline with no pyramid (as has been widely observed with the original FPN on hierarchical backbones), they are no better than our simple feature pyramid. The original FPN [34] was motivated by combining lower-resolution, stronger feature maps with higher-resolution, weaker feature maps. This foundation is lost when the backbone is plain and has no high-resolution maps, which can explain why our simple pyramid is sufficient. Our ablation reveals that the set of pyramidal feature maps, rather than the top-down/lateral connections, is the key to effective multi-scale detection. To see this, we study an even more aggressive case of the simple pyramid: we generate only the finest scale ( 14 ) feature map by deconvolution and then from this finest map we subsample other scales in parallel by strided average pooling. There are no unshared, per-scale parameters in this design. This aggressivelysimple pyramid is nearly as good: it has 54.5 AP (ViT-L), 3.3 higher than the no pyramid baseline. This shows the importance of pyramidal feature maps. For any variant of these feature pyramids, the anchors (in RPN) and regions (in RoI heads) are mapped to the corresponding level in the pyramid based on their scales, as in [34]. We hypothesize that this explicit scale-equivariant mapping, rather than the top-down/lateral connection, is the main reason why a feature pyramid can greatly benefit multi-scale object detection. Window Attention is Sufficient when Aided by a Few Propagation Blocks. Table 2 ablates our backbone adaptation approach. In short, on top of a baseline that has purely window attention and none of the cross-window propagation blocks (Table 2, “none”), various ways of propagation can show decent gains.7 7

Even our baseline with no propagation in the backbone is reasonably good (52.9 AP). This can be explained by the fact that the layers beyond the backbone (the simple feature pyramid, RPN, and RoI heads) also induce cross-window communication.

288

Y. Li et al.

Table 2. Ablation on backbone adaptation strategies using a plain ViT backbone and Mask R-CNN evaluated on COCO. All blocks perform window attention, unless modified by the propagation strategy. In sum, compared to the baseline that uses only window attention (52.9 APbox ) most configurations work effectively as long as information can be well propagated across windows. Here the backbone is ViT-L; the observations on ViT-B are similar (see the appendix). prop. strategy none

APbox 52.9

prop. conv

APmask 47.2

none

4 global blocks 54.6 48.6 (+1.7) (+1.4) 48.8 4 conv blocks 54.8 (+1.9) (+1.6) 54.0 47.9 shifted win. (+1.1) (+0.7) (a) Window attention with various crosswindow propagation strategies.

prop. locations none

APbox 52.9

47.2

52.9 47.1 (+0.0) (–0.1) 54.3 48.3 last 4 blocks (+1.4) (+1.1) 54.6 48.6 evenly 4 (+1.7) (+1.4) blocks (c) Locations of cross-window global propagation blocks.

47.2

54.3 48.3 (+1.4) (+1.1) 54.8 48.8 basic (+1.9) (+1.6) 48.6 bottleneck 54.6 (+1.7) (+1.4) (b) Convolutional propagation with different residual block types (4 blocks).

none

first 4 blocks

52.9

APmask

na¨ıve

prop. blks

APmask

APbox

APbox 52.9

APmask 47.2

2

54.4 48.5 (+1.5) (+1.3) 54.6 48.6 4 (+1.7) (+1.4) 55.1 48.9 24† (+2.2) (+1.7) (d) Number of global propagation blocks. † : Memory optimization required.

Table 3. Practical performance of backbone adaptation strategies. The backbone is ViT-L. The training memory (per GPU) is benchmarked with a batch size of 1. The testing time (per image) is benchmarked on an A100 GPU. † : This 3.34× memory (49G) is estimated as if the same training implementation could be used, which is not practical and requires special memory optimization that all together slows down training by 2.2× vs.the baseline. Prop. strategy

APbox

# params

None

52.9

1.00×

4 conv (bottleneck) 54.6 (+1.7) 1.04× 54.6 (+1.7) 1.00× 4 global 55.1 (+2.2) 1.00× 24 global

(331M)

Train mem 1.00×

(14.6G)

1.05× 1.39× 3.34׆

Test time 1.00×

(88 ms)

1.04× 1.16× 1.86×

In Table 2a, we compare our global and convolutional propagation strategies vs.the no propagation baseline. They have a gain of 1.7 and 1.9 over the baseline. We also compare with the “shifted window” (Swin [39]) strategy, in which the

Exploring Plain Vision Transformer Backbones for Object Detection

289

Table 4. Ablation on pre-training strategies with plain ViT backbones using Mask R-CNN evaluated on COCO. Pre-train

ViT-B AP

box

None (random init.) 48.1 IN-1K, supervised IN-21K, supervised IN-1K, MAE

ViT-L AP

mask

42.6

APbox

APmask

50.0

44.2

47.6 (–0.5) 42.4 (–0.2) 49.6 (–0.4) 43.8 (–0.4) 47.8 (–0.3) 42.6 (+0.0) 50.6 (+0.6) 44.8 (+0.6) 51.2 (+3.1) 45.5 (+2.9) 54.6 (+4.6) 48.6 (+4.4)

window grid is shifted by a half-window size for every other block. The shifted window variant has a 1.1 gain over the baseline, but is worse than ours. Note that here we focus only on the “shifted window” aspect of Swin [39]: the backbone is still a plain ViT, adapted to shifted window attention only during fine-tuning; it is not the Swin architecture, which we will compare to later. Table 2b compares different types of residual blocks for convolutional propagation. We study the basic (two 3 × 3) [25], bottleneck (1 × 1→3 × 3→1 × 1) [25], and a na¨ıve block that has one 3 × 3 convolution. They all improve over the baseline, while the specific block design makes only marginal differences. Interestingly, even though convolution is a local operation if its receptive field covers two adjacent windows, it is sufficient in principle to connect all pixels of the two windows. This connectivity is thanks to the self-attention in both windows in the succeeding blocks. This may explain why it can perform as well as global propagation. In Table 2c we study where cross-window propagation should be located in the backbone. By default 4 global propagation blocks are placed evenly. We compare with placing them in the first or last 4 blocks instead. Interestingly, performing propagation in the last 4 blocks is nearly as good as even placement. This is in line with the observation in [12] that ViT has longer attention distance in later blocks and is more localized in earlier ones. In contrast, performing propagation only in the first 4 blocks shows no gain: in this case, there is no propagation across windows in the backbone after these 4 blocks. This again demonstrates that propagation across windows is helpful. Table 2d compares the number of global propagation blocks to use. Even using just 2 blocks achieves good accuracy and clearly outperforms the baseline. For comprehensiveness, we also report a variant where all 24 blocks in ViT-L use global attention. This has a marginal gain of 0.5 points over our 4-block default, while its training requires special memory optimization (we use memory checkpointing [7]). This requirement makes scaling to larger models (like ViTH) impractical. Our solution of window attention plus a few propagation blocks offers a practical, high-performing tradeoff. We benchmark this tradeoff in Table 3. Using 4 propagation blocks gives a good trade-off. Convolutional propagation is the most practical, increasing memory and time by merely ≤5%, at a small cost of 4% more parameters. Global

290

Y. Li et al.

propagation with 4 blocks is also feasible and does not increase the model size. Global self-attention in all 24 blocks is not practical. In sum, Table 2 shows that various forms of propagation are helpful, while we can keep using window attention in most or all blocks. Importantly, all these architecture adaptations are performed only during fine-tuning time; they do not require a redesign of the pre-training architecture. Masked Autoencoders Provide Strong Pre-trained Backbones. Table 4 compares backbone pre-training strategies. Supervised pre-training on IN-1K is slightly worse than no pre-training, similar to the observation in [17]. Supervised pre-training on IN-21K is marginally better for ViT-L. In contrast, MAE [22] pre-training on IN-1K (without labels) shows massive gains, increasing APbox by 3.1 for ViT-B and 4.6 for ViT-L. We hypothesize that the vanilla ViT [12], with fewer inductive biases, may require higher-capacity to learn translation and scale equivariant features, while higher-capacity models are prone to heavier overfitting. MAE pre-training can help to relieve this problem. We discuss more about MAE in context next. 4.2

Comparisons with Hierarchical Backbones

Modern detection systems involve many implementation details and subtleties. To focus on comparing backbones under as fair conditions as possible, we incorporate the Swin [39] and MViTv2 [31] backbones into our implementation. Settings. We use the same implementation of Mask R-CNN [23] and Cascade Mask R-CNN [3] for all ViT, Swin, and MViTv2 backbones. We use FPN for the hierarchical backbones of Swin/MViTv2. We search for optimal hyperparameters separately for each backbone (see the appendix). Our Swin results are better than their counterparts in the original paper;8 our MViTv2 results are better than or on par with those reported in [31]. Following the original papers [31,39], Swin and MViTv2 both use relative position biases [43]. For a fairer comparison, here we also adopt relative position biases in our ViT backbones as per [31], but only during fine-tuning, not affecting pre-training. This addition improves AP by ∼1 point. Note that our ablations in Sec. 4.1 are without relative position biases. Results and Analysis. Table 5 shows the comparisons. Figure 3 plots the tradeoffs. The comparisons here involve two factors: the backbone and the pre-training strategy. Our plain-backbone detector, combined with MAE pretraining, presents better scaling behavior. When the models are large, our method outperforms the hierarchical counterparts of Swin/MViTv2, including those using IN-21K supervised pre-training. Our result with ViT-H is 2.6 better than that with MViTv2-H. Moreover, the plain ViT has a better wall-clock performance (Fig. 3 right, see ViT-H vs.MViTv2-H), as the simpler blocks are more hardware-friendly. 8

For example, Swin-B (IN-1K, Cascade Mask R-CNN) has 51.9 APbox reported in the official repo. This result in our implementation is 52.7.

Exploring Plain Vision Transformer Backbones for Object Detection

291

Table 5. Comparisons of plain vs. hierarchical backbones using Mask R-CNN [23] and Cascade Mask R-CNN [3] on COCO. Tradeoffs are plotted in Fig. 3. All entries are implemented and run by us to align low-level details. Backbone

Pre-train

Mask R-CNN

Cascade Mask R-CNN

APbox APmask APbox APmask

Hierarchical-backbone detectors: Swin-B Swin-L

21K, sup 51.4 21K, sup 52.4

45.4 46.2

54.0 54.8

46.5 47.3

MViTv2-B 21K, sup 53.1 MViTv2-L 21K, sup 53.6 MViTv2-H 21K, sup 54.1

47.4 47.5 47.7

55.6 55.7 55.8

48.1 48.3 48.3

45.9 49.2 50.1

54.0 57.6 58.7

46.7 49.8 50.9

Our plain-backbone detectors: ViT-B ViT-L ViT-H 57

1K, 1K, 1K, ViT-H

56

APbox

MViTv2-L MViTv2-B

51 50

MAE

57

57

ViT-H

MViTv2-B ViT-B Swin-B Swin-B

100

MViTv2-L

53

ViT (IN-1K, MAE) MViTv2 (IN-21K, sup) MViTv2 (IN-1K, sup) Swin (IN-21K, sup) Swin (IN-1K, sup)

200 400 # params (M) log-scale

800

ViT-L

55 MViTv2-H

54

52 51 50 0.5

Swin-B

53

Swin-L ViT-B

Swin-B

52

ViT (IN-1K, MAE) MViTv2 (IN-21K, sup) MViTv2 (IN-1K, sup) Swin (IN-21K, sup) Swin (IN-1K, sup)

1.0 2.0 # FLOPs (T) log-scale

MViTv2-L MViTv2-B

MViTv2-L

MViTv2-B

MViTv2-H

54

MViTv2-L MViTv2-B

Swin-L

ViT-H

56

ViT-L

55 MViTv2-H

54

52

MAE

51.6 55.6 56.7

56

ViT-L

55

53

MAE

51 50 4.0

40

MViTv2-L

MViTv2-B Swin-L ViT-B Swin-B Swin-B

ViT (IN-1K, MAE) MViTv2 (IN-21K, sup) MViTv2 (IN-1K, sup) Swin (IN-21K, sup) Swin (IN-1K, sup)

80 160 # test time (ms) log-scale

320

Fig. 3. Tradeoffs of accuracy vs.model sizes (left), FLOPs (middle), and wall-clock testing time (right). All entries are implemented and run by us to align low-level details. Swin [39] and MViTv2 [31] are pre-trained on IN-1K/21K with supervision. The ViT models are pre-trained using MAE [22] on IN-1K. Here the detector head is Mask R-CNN; similar trends are observed for Cascade Mask R-CNN and one-stage detector RetinaNet (Figure A.2 in the appendix). Detailed numbers are in the appendix (Table A.2).

We are also curious about the influence of MAE on hierarchical backbones. This is largely beyond the scope of this paper, as it involves finding good training recipes for hierarchical backbones with MAE. To provide some insight, we implement a na¨ıve extension of MAE with the MViTv2 backbone (see the appendix). We observe that MViTv2-L with this MAE pre-training on IN-1K is 1.3 better than that with IN-21K supervised pre-training (54.9 vs.53.6 APbox ). As a comparison, this gap is 4 points for our plain-backbone detector (Table 4). This shows that the plain ViT backbone may benefit more from MAE pre-training than the hierarchical backbone, suggesting that the lack of inductive biases on

292

Y. Li et al.

Table 6. System-level comparisons with the leading results on COCO reported by the original papers. The detection framework is Cascade Mask RCNN [3] (denoted as “Cascade”), Hybrid Task Cascade (HTC) [5], or its extension (HTC++ [39]). Here we compare results that use ImageNet data (1K or 21K); better results are reported in [9, 38] using extra data. † : [33] combines two Swin-L backbones. Method

Framework Pre-train

Single-scale test

AP

box

AP

mask

Multi-scale test

APbox APmask

Hierarchical-backbone detectors: Swin-L [39] MViTv2-L [31] MViTv2-H [31] CBNetV2 [33]† SwinV2-L [38]

HTC++ Cascade Cascade HTC HTC++

21K, 21K, 21K, 21K, 21K,

sup sup sup sup sup

57.1 56.9 57.1 59.1 58.9

49.5 48.6 48.8 51.0 51.2

58.0 58.7 58.4 59.6 60.2

50.4 50.5 50.1 51.8 52.1

1K, sup 1K, sup 1K, MAE 1K, MAE 1K, MAE

51.9 52.5 56.0 59.6 60.4

44.5 44.8 48.0 51.1 52.0

– – 57.3 60.4 61.3

– – 49.4 52.2 53.1

Plain-backbone detectors: UViT-S [8] UViT-B [8] ViTDet, ViT-B ViTDet, ViT-L ViTDet, ViT-H

Cascade Cascade Cascade Cascade Cascade

scales could be compensated by the self-supervised training of MAE. While it is an interesting future topic on improving hierarchical backbones with MAE pre-training, our plain-backbone detector enables us to use the readily available ViT backbones from MAE to achieve strong results. We also note that hierarchical backbones in general involve enhanced selfattention block designs. Examples include the shifted window attention in Swin [39] and pooling attention in MViT v1/v2 [15,31]. These block designs, if applied to plain backbones, may also improve accuracy and parameter-efficiency. While this may put our competitors at an advantage, our method is still competitive without these enhancements. 4.3

Comparisons with Previous Systems

Next we provide system-level comparisons with the leading results reported in previous papers. We refer to our system as ViTDet, i.e., ViT Detector, aiming at the usage of a ViT backbone for detection. Since these comparisons are system-level, the methods use a variety of different techniques. While we make efforts to balance the comparisons (as noted below), making a perfectly controlled comparison is infeasible in general; our goal, instead, is to situate our method in the context of current leading methods. Comparisons on COCO. Table 6 reports the system-level comparisons on COCO. For a fairer comparison, here we make two changes following our com-

Exploring Plain Vision Transformer Backbones for Object Detection

293

Table 7. System-level comparisons with the leading results on LVIS (v1 val) reported by the original papers. All results are without test-time augmentation. Detic [54] uses pre-trained CLIP [41] text embeddings. † : these entries use CBNetV2 [33] that combines two Swin-L backbones. Method

Pre-train

APmask APmask APbox rare

None (random init) 21K, sup; CLIP 21K, sup 21K, sup

36.0 41.7 43.1 49.2

29.7 41.7 34.3 45.4

39.2 – – –

1K, 1K,

46.0 48.1

34.3 36.9

51.2 53.4

Hierarchical-backbone detectors: Copy-Paste [17], Eff-B7 FPN Detic [54], Swin-B competition winner 2021 [16] baseline, competition winner 2021 [16] full, †



Plain-backbone detectors: ViTDet, ViT-L ViTDet, ViT-H

MAE MAE

petitors: we adopt soft-nms [2] as is used by all competitors [31,33,38,39] in this table and increase the input size (from 1024 to 1280) following [33,38]. We note that we do not use these improvements in previous ablations. As in the previous subsection (Sec. 4.3), we use relative position biases here. The leading systems thus far are all based on hierarchical backbones (Table 6). For the first time, we show that a plain-backbone detector can achieve highly accurate results on COCO and can compete with the leading systems. We also compare with UViT [8] which is a recent plain-backbone detection method. As discussed in Sec. 2, UViT and our work have different focuses. UViT aims at designing a new plain backbone that is good for detection, while our goal here is to support general-purpose ViT backbones including the original ones in [12]. Despite the different focuses, both UViT and our work suggest that plainbackbone detection is a promising direction with strong potential. Comparisons on LVIS. We further report system-level comparisons on the LVIS dataset [21]. LVIS contains ∼2M high-quality instance segmentation annotations for 1203 classes that exhibit a natural, long-tailed object distribution. Unlike COCO, the class distribution is heavily imbalanced and many classes have very few (e.g., 0.5 in blue and the assigned (selected) box in red. Top: The predictionbased method selects different boxes across training, and the selected box may not cover the objects in the image. Bottom: Our simpler max-size variant selects a box that covers the objects and is more consistent across training. (Color figure online) Table 2. Open-vocabulary LVIS compared to ViLD [17]. We train our model using their training settings and architecture (MaskRCNN-ResNet50, training from scratch). We report mask mAP and its breakdown to novel (rare), common, and frequent classes. Variants of ViLD use distillation (ViLD) or ensembling (ViLDensemble.). Detic (with IN-L) uses a single model and improves both mAP and mAPnovel . mask mAPmask mAPmask mAPmask novel mAPc f

ViLD-text [17] 24.9 22.5 ViLD [17] ViLD-ensemble [17] 25.5

10.1 16.1 16.6

23.9 20.0 24.6

32.5 28.3 30.3

Detic

17.8

26.3

31.6

26.8

can provide orthogonal gains to existing open-vocabulary detectors [2]. To further understand the open-vocabulary capabilities of Detic, we also report the top-line results trained with box labels for all classes (Table 1 last row). Despite not using box labels for the novel classes, Detic with ImageNet performs favorably compared to the fully-supervised detector. This result also suggests that bounding box annotations may not be required for new classes. Detic combined with large image classification datasets is a simple and effective alternative for increasing detector vocabulary. 5.4

Comparison with the State-of-the-Art

We compare Detic’s open-vocabulary object detectors with state-of-the-art methods on the open-vocabulary LVIS and the open-vocabulary COCO benchmarks. In each case, we strictly follow the architecture and setup from prior work to ensure fair comparisons. Open-vocabulary LVIS. We compare to ViLD [17], which first uses CLIP embeddings [42] for open-vocabulary detection. We strictly follow their training

Detecting Twenty-Thousand Classes Using Image-Level Supervision

361

Table 3. Open-vocabulary COCO [2]. We compare Detic using the same training data and architecture from OVR-CNN [72]. We report box mAP at IoU threshold 0.5 using Faster R-CNN with ResNet50-C4 backbone. Detic builds upon the CLIP baseline (second row) and shows significant improvements over prior work. †: results quoted from OVR-CNN [72] paper or code. ‡: results quoted from the original publications. box mAP50box mAP50box all novel mAP50base

Base-only† Base-only (CLIP) WSDDN [3]† Cap2Det [71]† SB [2]‡ DELO [78]‡ PL [43]‡ OVR-CNN [72]†

39.9 39.3 24.6 20.1 24.9 13.0 27.9 39.9

0 1.3 20.5 20.3 0.31 3.41 4.12 22.8

49.9 48.7 23.4 20.1 29.2 13.8 35.9 46.0

Detic

45.0

27.8

47.1

setup and model architecture (Appendix G) and report results in Table 2. Here ViLD-text is exactly our Box-Supervised baseline. Detic provides a gain of 7.7 points on mAPnovel . Compared to ViLD-text, ViLD, which uses knowledge distillation from the CLIP visual backbone, improves mAPnovel at the cost of hurting overall mAP. Ensembling the two models, ViLD-ens provides improvements for both metrics. On the other hand, Detic uses a single model which improves both novel and overall mAP, and outperforms the ViLD ensemble. Open-vocabulary COCO. Next, we compare with prior works on the popular open-vocabulary COCO benchmark [2] (see benchmark and implementation details in Appendix H). We strictly follow OVR-CNN [72] to use Faster R-CNN with ResNet50-C4 backbone and do not use any improvements from Sect. 5.1. Following [72], we use COCO captions as the image-supervised data. We extract nouns from the captions and use both the image labels and captions as supervision. Table 3 summarizes our results. As the training set contains only 48 base classes, the base-class only model (second row) yields low mAP on novel classes. Detic improves the baseline and outperforms OVR-CNN [72] by a large margin, using exactly the same model, training recipe, and data. Additionally, similar to Table 1, we compare to prior prediction-based methods on the open-vocabulary COCO benchmark in Appendix H. In this setting too, Detic improves over prior work providing significant gains on novel class detection and overall detection performance. 5.5

Detecting 21K Classes Across Datasets Without Finetuning

Next, we train a detector with the full 21K classes of ImageNet. We use our strong recipe with Swin-B [37] backbone. In practice, training a classification

362

X. Zhou et al.

Table 4. Detecting 21K classes across datasets. We use Detic to train a detector and evaluate it on multiple datasets without retraining. We report the bounding box mAP on Objects365 and OpenImages. Compared to the Box-Supervised baseline (trained on LVIS-all), Detic leverages image-level supervision to train robust detectors. The performance of Detic is 70%-80% of dataset-specific models (bottom row) that use dataset specific box labels. Objects365 [49] OpenImages [28] box mAPbox mAPbox mAP50box rare mAP50 rare Box-Supervised Detic w. IN-L Detic w. IN-21k

19.1 21.2 21.5

14.0 17.8 20.0

46.2 53.0 55.2

61.7 67.1 68.8

Dataset-specific oracles 31.2

22.5

69.9

81.8

Table 5. Detic with different classifiers. We vary the classifier used with Detic and observe that it works well with different choices. While CLIP embeddings give the best performance (* indicates our default), all classifiers benefit from our Detic. Classifier

Box-supervised Detic mask mAPmask mAPmask mAPmask novel mAP novel

*CLIP [42]

30.2

16.4

32.4

24.9

Trained

27.4

0

31.7

17.4

FastText [24]

27.5

9.0

30.9

19.2

OpenCLIP [23] 27.1

8.9

30.7

19.4

layer of 21K classes is computationally involved.2 We adopt a modified Federated Loss [76] that uniformly samples 50 classes from the vocabulary at every iteration. We only compute classification scores and back-propagate on the sampled classes. As there are no direct benchmark to evaluate detectors with such large vocabulary, we evaluate our detectors on new datasets without finetuning. We evaluate on two large-scale object detection datasets: Objects365v2 [49] and OpenImages [28], both with around 1.8M training images. We follow LVIS to split 1 3 of classes with the fewest training images as rare classes. Table 4 shows the results. On both datasets, Detic improves the Box-Supervised baseline by a large margin, especially on classes with fewer annotations. Using all the 21k classes further improves performance owing to the large vocabulary. Our single model significantly reduces the gap towards the dataset-specific oracles and reaches 70%–80% of their performance without using the corresponding 1.8M detection annotations. See Fig. 5 for qualitative results.

2

This is more pronounced in detection than classification, as the “batch-size” for the classification layer is 512× image-batch-size, where 512 is #RoIs per image.

Detecting Twenty-Thousand Classes Using Image-Level Supervision

363

Fig. 5. Qualitative results of our 21k-class detector. We show random samples from images containing novel classes in OpenImages (top) and Objects365 (bottom) validation sets. We use the CLIP embedding of the corresponding vocabularies. We show LVIS classes in purple and novel classes in green. We use a score threshold of 0.5 and show the most confident class for each box. Best viewed on screen. Table 6. Detic with different pretraining data. Top: our method using ImageNet1K as pretraining and ImageNet-21K as co-training; Bottom: using ImageNet-21K for both pretraining and co-training. Co-training helps pretraining in both cases. Pretrain data mAPmask

5.6

Box-Supervised IN-1K IN-1K Detic

26.1 28.8

Box-Supervised IN-21K IN-21K Detic

30.2 32.4

mAPmask novel

(+2.7)

13.6 21.7

(+8.1)

(+2.2)

16.4 24.9

(+8.5)

Ablation Studies

We now ablate our key components under the open-vocabulary LVIS setting with IN-L as the image-classification data. We use our strong training recipe as described in Sect. 5.1 for all these experiments. Classifier Weights. We study the effect of different classifier weights W. While our main open-vocabulary experiments use CLIP [42], we show the gain of Detic is independent of CLIP. We train Box-Supervised and Detic with different classifiers, including a standard random initialized and trained classifier, and other fixed language models [23,24] The results are shown in Table 5. By default, a trained classifier cannot recognize novel classes. However, Detic enables novel class recognition ability even in this setting (17.4 mAPnovel for classes without detection labels). Using language models such as FastText [24] or an open-source version of CLIP [23] leads to better novel class performance. CLIP [42] performs the best among them. Effect of Pretraining. Many existing methods use additional data only for pretraining [11,72,73], while we use image-labeled data for co-training. We present results of Detic with different types of pretraining in Table 6. Detic provides

364

X. Zhou et al.

similar gains across different types of pretraining, suggesting that our gains are orthogonal to advances in pretraining. We believe that this is because pretraining improves the overall features, while Detic uses co-training which improves both the features and the classifier. 5.7

The Standard LVIS benchmark

Finally, we evaluate Detic on the standard LVIS benchmark [18]. In this setting, the baseline (Box-Supervised) is trained with box and mask labels for all classes while Detic uses additional image-level labels from IN-L. We train Detic with the same recipe in Sect. 5.1 and use a strong Swin-B [37] backbone and 896 × 896 input size. We report the mask mAP across all classes and also split into rare, common, and frequent classes. Notably, Detic achieves 41.7 mAP and 41.7 mAPr , closing the gap between the overall mAP and the rare mAP. This suggests Detic effectively uses image-level labels to improve the performance of classes with very few boxes labels. Appendix I provides more comparisons to prior work [73] on LVIS. Appendix J shows Detic generalizes to DETR-based [79] detectors (Table 7). Table 7. Standard LVIS. We evaluate our baseline (Box-Supervised) and Detic using different backbones on the LVIS dataset. We report the mask mAP. We also report prior work on LVIS using large backbone networks (single-scale testing) for references (not for apple-to-apple comparison). †: detectors using additional data. Detic improves over the baseline with increased gains for the rare classes.

6

Backbone

mAPmask mAPmask mAPmask mAPmask r c f

MosaicOS† [73] CenterNet2 [76] AsyncSLL† [19] SeesawLoss [64] Copy-paste [15] Tan et al. [57]

ResNeXt-101 ResNeXt-101 ResNeSt-269 ResNeSt-200 EfficientNet-B7 ResNeSt-269

28.3 34.9 36.0 37.3 38.1 38.8

21.7 24.6 27.8 26.4 32.1 28.5

27.3 34.7 36.7 36.3 37.1 39.5

32.4 42.5 39.6 43.1 41.9 42.7

Baseline Detic†

Swin-B Swin-B

40.7 41.7

35.9 41.7

40.5 40.8

43.1 42.6

Limitations and Conclusions

We present Detic which is a simple way to use image supervision in largevocabulary object detection. While Detic is simpler than prior assignment-based weakly-supervised detection methods, it supervises all image labels to the same region and does not consider overall dataset statistics. We leave incorporating such information for future work. Moreover, open vocabulary generalization

Detecting Twenty-Thousand Classes Using Image-Level Supervision

365

has no guarantees on extreme domains. Our experiments show Detic improves large-vocabulary detection with various weak data sources, classifiers, detector architectures, and training recipes. Acknowledgement. We thank Bowen Cheng and Ross Girshick for helpful discussions and feedback. This material is in part based upon work supported by the National Science Foundation under Grant No. IIS-1845485 and IIS-2006820. Xingyi is supported by a Facebook PhD Fellowship.

References 1. Arbel´ aez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014) 2. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10. 1007/978-3-030-01246-5 24 3. Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016) 4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020) 5. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018) 6. Chang, N., Yu, Z., Wang, Y.X., Anandkumar, A., Fidler, S., Alvarez, J.M.: Imagelevel or object-level? A tale of two resampling strategies for long-tailed detection. In: ICML (2021) 7. Chen, L., Yang, T., Zhang, X., Zhang, W., Sun, J.: Points as queries: weakly semisupervised object detection by points. In: CVPR (2021) 8. Dave, A., Doll´ ar, P., Ramanan, D., Kirillov, A., Girshick, R.: Evaluating largevocabulary object detectors: the devil is in the details. arXiv:2102.01066 (2021) 9. Dave, A., Tokmakov, P., Ramanan, D.: Towards segmenting anything that moves. In: ICCVW (2019) 10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009) 11. Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021) 12. Dong, B., Huang, Z., Guo, Y., Wang, Q., Niu, Z., Zuo, W.: Boosting weakly supervised object detection via learning bounding box adjusters. In: ICCV (2021) 13. Fang, S., Cao, Y., Wang, X., Chen, K., Lin, D., Zhang, W.: WSSOD: a new pipeline for weakly-and semi-supervised object detection. arXiv:2105.11293 (2021) 14. Feng, C., Zhong, Y., Huang, W.: Exploring classification equilibrium in long-tailed object detection. In: ICCV (2021) 15. Ghiasi, G., et al.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR (2021) 16. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Open-vocabulary image segmentation. arXiv:2112.12143 (2021) 17. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ICLR (2022)

366

X. Zhou et al.

18. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019) 19. Han, J., Niu, M., Du, Z., Wei, L., Xie, L., Zhang, X., Tian, Q.: Joint coco and Lvis workshop at ECCV 2020: Lvis challenge track technical report: asynchronous semi-supervised learning for large vocabulary instance segmentation (2020) 20. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017) 21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 22. Huang, Z., Zou, Y., Bhagavatula, V., Huang, D.: Comprehensive attention selfdistillation for weakly-supervised object detection. In: NeurIPS (2020) 23. Ilharco, G., et al.: Openclip, July 2021. https://doi.org/10.5281/zenodo.5143773 24. Joulin, A., Grave, E., Bojanowski, P., Douze, M., J´egou, H., Mikolov, T.: Fasttext. Zip: compressing text classification models. arXiv:1612.03651 (2016) 25. Kim, D., Lin, T.Y., Angelova, A., Kweon, I.S., Kuo, W.: Learning open-world object proposals without learning to classify. arXiv:2108.06753 (2021) 26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015) 27. Konan, S., Liang, K.J., Yin, L.: Extending one-stage detection with open-world proposals. arXiv:2201.02302 (2022) 28. Kuznetsova, A., et al.: The open images dataset v4. In: IJCV (2020) 29. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022) 30. Li, X., Kan, M., Shan, S., Chen, X.: Weakly supervised object detection with segmentation collaboration. In: ICCV (2019) 31. Li, Y., Zhang, J., Huang, K., Zhang, J.: Mixed supervised object detection with robust objectness transfer. In: TPAMI (2018) 32. Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J.: Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In: CVPR (2020) 33. Li, Z., Yao, L., Zhang, X., Wang, X., Kanhere, S., Zhang, H.: Zero-shot object detection with textual descriptions. In: AAAI (2019) 34. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 35. Liu, Y., Zhang, Z., Niu, L., Chen, J., Zhang, L.: Mixed supervised object detection by transferringmask prior and semantic similarity. In: NeurIPS (2021) 36. Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. In: ICLR (2021) 37. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021) 38. Maaz, M., Rasheed, H., Khan, S., Khan, F.S., Anwer, R.M., Yang, M.H.: Multimodal transformers excel at class-agnostic object detection. arXiv:2111.11430 (2021) 39. Pan, T.Y., et al.: On model calibration for long-tailed object detection and instance segmentation. In: NeurIPS (2021) 40. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014) 41. Pinheiro, P.O., Collobert, R.: Weakly supervised semantic segmentation with convolutional networks. In: CVPR (2015) 42. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)

Detecting Twenty-Thousand Classes Using Image-Level Supervision

367

43. Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zeroshot object detection. In: AAAI (2020) 44. Ramanathan, V., Wang, R., Mahajan, D.: DLWL: improving detection for lowshot classes with weakly labelled data. In: CVPR (2020) 45. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017) 46. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015) 47. Ren, Z., Yu, Z., Yang, X., Liu, M.-Y., Schwing, A.G., Kautz, J.: UFO2 : a unified framework towards omni-supervised object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 288–313. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7 18 48. Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. In: NeurIPS (2021) 49. Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019) 50. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018) 51. Shen, Y., et al.: Enabling deep residual networks for weakly supervised object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 118–136. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-58598-3 8 52. Shen, Y., Ji, R., Wang, Y., Wu, Y., Cao, L.: Cyclic guidance for weakly supervised joint detection and segmentation. In: CVPR (2019) 53. Singh, B., Li, H., Sharma, A., Davis, L.S.: R-FCN-3000 at 30fps: decoupling detection and classification. In: CVPR (2018) 54. Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semisupervised learning framework for object detection. arXiv:2005.04757 (2020) 55. Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q.: Equalization loss v2: a new gradient balance approach for long-tailed object detection. In: CVPR (2021) 56. Tan, J., et al.: Equalization loss for long-tailed object recognition. In: CVPR (2020) 57. Tan, J., et al.: 1st place solution of Lvis challenge 2020: a good box is not a guarantee of a good mask. arXiv:2009.01559 (2020) 58. Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: CVPR (2020) 59. Tang, P., et al.: PCL: proposal cluster learning for weakly supervised object detection. In: TPAMI (2018) 60. Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017) 61. Uijlings, J., Popov, S., Ferrari, V.: Revisiting knowledge transfer for training object class detectors. In: CVPR (2018) 62. Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. In: IJCV (2013) 63. Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., Ye, Q.: C-mil:continuation multiple instance learning for weakly supervised object detection. In: CVPR (2019) 64. Wang, J., et al.: Seesaw loss for long-tailed instance segmentation. In: CVPR (2021) 65. Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J.: Forest R-CNN: large-vocabulary long-tailed object detection and instance segmentation. In: ACM Multimedia (2020) 66. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https:// github.com/facebookresearch/detectron2

368

X. Zhou et al.

67. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: ICCV (2021) 68. Yan, Z., Liang, J., Pan, W., Li, J., Zhang, C.: Weakly-and semi-supervised object detection with expectation-maximization algorithm. arXiv:1702.08740 (2017) 69. Yang, H., Wu, H., Chen, H.: Detecting 11k classes: large scale object detection without fine-grained bounding boxes. In: ICCV (2019) 70. Yang, K., Li, D., Dou, Y.: Towards precise end-to-end weakly supervised object detection network. In: ICCV (2019) 71. Ye, K., Zhang, M., Kovashka, A., Li, W., Qin, D., Berent, J.: Cap2DET: learning to amplify weak caption supervision for object detection. In: ICCV (2019) 72. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR (2021) 73. Zhang, C., et al.: MosaicOS: a simple and effective use of object-centric images for long-tailed object detection. In: ICCV (2021) 74. Zhang, S., Li, Z., Yan, S., He, X., Sun, J.: Distribution alignment: a unified framework for long-tail visual recognition. In: CVPR (2021) 75. Zhong, Y., Wang, J., Peng, J., Zhang, L.: Boosting weakly supervised object detection with progressive knowledge transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7 37 76. Zhou, X., Koltun, V., Kr¨ ahenb¨ uhl, P.: Probabilistic two-stage detection. arXiv:2103.07461 (2021) 77. Zhou, X., Koltun, V., Kr¨ ahenb¨ uhl, P.: Simple multi-dataset detection. In: CVPR (2022) 78. Zhu, P., Wang, H., Saligrama, V.: Don’t even look once: Synthesizing features for zero-shot detection. In: CVPR (2020) 79. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DeTR: deformable transformers for end-to-end object detection. In: ICLR (2021)

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation Hongyang Li1 , Jiehong Lin1,2 , and Kui Jia1,3(B) 1 South China University of Technology, Guangzhou, China {eeli.hongyang,lin.jiehong}@mail.scut.edu.cn, [email protected] 2 DexForce Co., Ltd., Shenzhen, China 3 Peng Cheng Laboratory, Shenzhen, China

Abstract. Establishment of point correspondence between camera and object coordinate systems is a promising way to solve 6D object poses. However, surrogate objectives of correspondence learning in 3D space are a step away from the true ones of object pose estimation, making the learning suboptimal for the end task. In this paper, we address this shortcoming by introducing a new method of Deep Correspondence Learning Network for direct 6D object pose estimation, shortened as DCL-Net. Specifically, DCL-Net employs dual newly proposed Feature Disengagement and Alignment (FDA) modules to establish, in the feature space, partial-to-partial correspondence and complete-to-complete one for partial object observation and its complete CAD model, respectively, which result in aggregated pose and match feature pairs from two coordinate systems; these two FDA modules thus bring complementary advantages. The match feature pairs are used to learn confidence scores for measuring the qualities of deep correspondence, while the pose feature pairs are weighted by confidence scores for direct object pose regression. A confidence-based pose refinement network is also proposed to further improve pose precision in an iterative manner. Extensive experiments show that DCL-Net outperforms existing methods on three benchmarking datasets, including YCB-Video, LineMOD, and Oclussion-LineMOD; ablation studies also confirm the efficacy of our novel designs. Our code is released publicly at https://github.com/Gorilla-Lab-SCUT/DCL-Net. Keywords: 6D pose estimation

1

· Correspondence learning

Introduction

6D object pose estimation is a fundamental task of 3D semantic analysis with many real-world applications, such as robotic grasping [7,44], augmented reality [27], and autonomous driving [8,9,21,42]. Non-linearity of the rotation space of SO(3) makes it hard to handle this nontrivial task through direct pose regression H. Li and J. Lin—Equal contribution. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 22. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 369–385, 2022. https://doi.org/10.1007/978-3-031-20077-9_22

370

H. Li et al.

Partial observation (cam)

Partial prediction (obj) (a) Partial-to-Partial Correspondence

Complete CAD model (obj)

Complete prediction (cam)

(b) Complete-to-Complete Correspondence

Fig. 1. Illustrations of two kinds of point correspondence between camera coordinate system (cam) and object coordinate system (obj). Best view in the electronic version.

from object observations [6,11,15,18,24–26,39,45,47]. Many of the data-driven methods [3,14,20,23,28,31,33,34,38,41] thus achieve the estimation by learning point correspondence between camera and object coordinate systems. Given a partial object observation in camera coordinate system along with its CAD model in object coordinate one, we show in Fig. 1 two possible ways to build point correspondence: i) inferring the observed points in object coordinate system for partial-to-partial correspondence; ii) inferring the sampled points of CAD model in camera coordinate system for complete-to-complete correspondence. These two kinds of correspondence show different advantages. The partial-to-partial correspondence is of higher qualities than the complete-tocomplete one due to the difficulty in shape completion, while the latter is more robust to figure out poses for objects with severe occlusions, which the former can hardly handle with. While these methods are promising by solving 6D poses from point correspondence (e.g., via a PnP algorithm), their surrogate correspondence objectives are a step away from the true ones of estimating 6D object poses, thus making their learnings suboptimal for the end task [40]. To this end, we present a novel method to realize the above two ways of correspondence establishment in the feature space via dual newly proposed Feature Disengagement and Alignment (FDA) modules, and directly estimate object poses from feature pairs of two coordinate systems, which are weighted by confidence scores measuring the qualities of deep correspondence. We term our method as Deep Correspondence Learning Network, shortened as DCL-Net. Figure 2 gives the illustration. For the partial object observation and its CAD model, DCL-Net firstly extracts their point-wise feature maps in parallel; then dual Feature Disengagement and Alignment (FDA) modules are designed to establish, in feature space, the partial-to-partial correspondence and the complete-to-complete one between camera and object coordinate systems. Specifically, each FDA module takes as inputs two point-wise feature maps, and disengages each feature map into individual pose and match ones; the match feature maps of two systems are then used to learn an attention map for building deep correspondence; finally, both pose and match feature maps are aligned and paired across systems based on the attention map, resulting in pose and match feature pairs, respectively. DCL-Net aggregates two sets of correspondence together, since they bring complementary advantages, by fusing the respective pose and match feature pairs of two FDA modules. The aggregated match feature pairs are used to learn confidence scores for measuring the qualities of deep correspondence, while the pose ones

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

371

are weighted by the scores to directly regress object poses. A confidence-based pose refinement network is also proposed to further improve the results of DCLNet in an iterative manner. Extensive experiments show that DCL-Net outperforms existing methods for 6D object pose estimation on three well-acknowledged datasets, including YCB-Video [4], LineMOD [16], and Occlusion-LineMOD [3]; remarkably, on the more challenging Occlusion-LineMOD, our DCL-Net outperforms the state-of-the-art method [13] with an improvement of 4.4% on the metric of ADD(S), revealing the strength of DCL-Net on handling with occlusion. Ablation studies also confirm the efficacy of individual components of DCL-Net. Our technical contributions are summarized as follows: – We design a novel Feature Disengagement and Alignment (FDA) module to establish deep correspondence between two point-wise feature maps from different coordinate systems; more specifically, FDA module disengages each feature map into individual pose and match ones, which are then aligned across systems to generate pose and match feature pairs, respectively, such that deep correspondence is established within the aligned feature pairs. – We propose a new method of Deep Correspondence Learning Network for direct regression of 6D object poses, termed as DCL-Net, which employs dual FDA modules to establish, in feature space, partial-to-partial correspondence and complete-to-complete one between camera and object coordinate systems, respectively; these two FDA modules bring complementary advantages. – Match feature pairs of dual FDA modules are aggregated and used for learning of confidence scores to measure the qualities of correspondence, while pose feature pairs are weighted by the scores for estimation of 6D pose; a confidence-based pose refinement network is also proposed to iteratively improve pose precision.

2

Related Work

6D Pose Estimation from RGB Data. This body of works can be broadly categorized into three types: i) holistic methods [11,15,18] for directly estimating object poses; ii) keypoint-based methods [28,33,34], which establish 2D-3D correspondence via 2D keypoint detection, followed by a PnP/RANSAC algorithm to solve the poses; iii) dense correspondence methods [3,20,23,31], which make dense pixel-wise predictions and vote for the final results. Due to loss of geometry information, these methods are sensitive to lighting conditions and appearance textures, and thus inferior to the RGB-D methods. 6D Pose Estimation from RGB-D Data. Depth maps provide rich geometry information complementary to appearance one from RGB images. Traditional methods [3,16,32,37,43] solve object poses by extracting features from RGBD data and performing correspondence grouping and hypothesis verification. Earlier deep methods, such as PoseCNN [45] and SSD-6D [19], learn coarse poses firstly from RGB images, and refine the poses on point clouds by using ICP [2] or MCN [22]. Recently, learning deep features of point clouds becomes an efficient

372

H. Li et al.

way to improve pose precision, especially for methods [39,47] of direct regression, which make efforts to enhance pose embeddings from deep geometry features, due to the difficulty in the learning of rotations from a nonlinear space. Wang et al. present DenseFusion [39], which fuses local features of RGB images and point clouds in a point-wise manner, and thus explicitly reasons about appearance and geometry information to make the learning more discriminative; due to the incomplete and noisy shape information, Zhou et al. propose PR-GCN [47] to polish point clouds and enhance pose embeddings via Graph Convolutional Network. On the other hand, dense correspondence methods show the advantages of deep networks on building the point correspondence in Euclidean space; for example, He et al. propose PVN3D [14] to regress dense keypoints, and achieve remarkable results. While promising, these methods are usually trained with surrogate objectives instead of the true ones of estimating 6D poses, making the learning suboptimal for the end task. Our proposed DCL-Net borrows the idea from dense correspondence methods by learning deep correspondence in feature space, and weights the feature correspondence based on confidence scores for direct estimation of object poses. Besides, the learned correspondence is also utilized by an iterative pose refinement network for precision improvement.

3

Deep Correspondence Learning Network

Given the partial object observation Xc in the camera coordinate system, along with the object CAD model Yo in the object coordinate one, our goal is to estimate the 6D pose (R, t) between these two systems, where R ∈ SO(3) stands for a rotation, and t ∈ R3 for a translation. Figure 2 gives the illustration of our proposed Deep Correspondence Learning Network (dubbed DCL-Net). DCL-Net firstly extracts point-wise features of Xc and Yo (cf. Sect. 3.1), then establishes correspondence in feature space via dual Feature Disengagement and Alignment modules (cf. Sect. 3.2), and finally regresses the object pose (R, t) with confidence scores based on the learned deep correspondence (cf. Sect. 3.3). The training objectives of DCL-Net are given in Sect. 3.4. A confidence-based pose refinement network is also introduced to iteratively improve pose precision (cf. Sect. 3.5). 3.1

Point-Wise Feature Extraction

We represent the inputs of the object observation Xc and its CAD model Yo as (I Xc , P Xc ) and (I Yo , P Yo ) with NX and NY sampled points, respectively, where P denotes a point set, and I denotes RGB values corresponding to points in P . As shown in Fig. 2, we use two parallel backbones to extract their point-wise features F Xc and F Yo , respectively. Following [12], both backbones are built based on 3D Sparse Convolutions [10], of which the volumetric features are then converted to point-level ones; more details about the architectures are given in the supplementary material. Note that for each object instance, F Yo can be pre-computed during inference for efficiency.

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

373

Fig. 2. An illustration of DCL-Net. Given object observation and its CAD model, DCL-Net first extracts their point-wise features F Xc and F Yo , separately; then dual Feature Disengagement and Alignment (FDA) modules are employed to establish, in feature space, partial-to-partial correspondence and complete-to-complete one between camera and object coordinate systems, respectively, which result in aggregated pose and match feature pairs; the match feature pairs are used to learn confidence scores s for measuring the qualities of deep correspondence, while the pose ones are weighted by s for estimating 6D object pose (R, t). Best view in the electronic version.

3.2

Dual Feature Disengagement and Alignment

The key to figure out the pose between the object observation and its CAD model lies in the establishment of correspondence. As pointed out in Sect. 1, there exist at least two ways to achieve this goal: i) learning the partial point set  Xo ),  Xo in object system from complete P Yo to pair with P Xc , e.g., (P Xc , P P

 Yc in for partial-to-partial correspondence; ii) inferring the complete point set P  Yc , P Yo ), camera coordinate system from partial P Xc to pair with P Yo , e.g., (P for complete-to-complete correspondence. In this paper, we propose to establish the correspondence in the deep feature space, from which pose feature pairs along with match feature pairs can be generated for the learning of object pose and confidence scores, respectively. Figure 2 gives illustrations of the correspondence in both 3D space and feature space. Specifically, we design a novel Feature Disengagement and Alignment (FDA)  Xo ) and (F  Yc , F Yo ) w.r.t module to learn the pose feature pairs, e.g., (F Xc , F Xo

p

Yc

p

p

p

 ) and (P  , P Yo ), respectively, and the match feature pairs, the above (P Xc , P X Y  o ) and (F  c , F Yo ), which can be formulated as follows: e.g., (F Xc , F m

m

m

m

Xo

Xo

Xo

= FDA(F Xc , F Yo ),

(1)

Yc

Yc

Yc

= FDA(F Yo , F Xc ).

(2)

Xc  c   FX p ,Fm ,Fp ,Fm ,P Yo  o   FY p ,Fm ,Fp ,Fm ,P

We term the partial-to-partial (1) and complete-to-complete (2) FDA modules as P2P-FDA and C2C-FDA modules, respectively. Feature Disengagement and Alignment Module. Feature Disengagement and Alignment (FDA) module takes point-wise feature maps of different coordinate systems as inputs, disengages each feature map into pose and match

374

H. Li et al.

Fig. 3. Illustrations of dual Feature Disengagement and Alignment modules. “T” denotes matrix transposition, and “×” denotes matrix multiplication.

ones, which are then aligned across systems to establish deep correspondence. Figure 3 gives illustrations of both P2P-FDA and C2C-FDA modules, where network specifics are also given. We take P2P-FDA module (1) as an example to illustrate the implementation of FDA. Specifically, as shown in Fig. 3, we firstly disengage F Xc into a pose Xc c feature F X p1 and a match one F m1 : Xc Xc c c FX ), F X ), p1 = MLP(F m1 = MLP(F

(3)

where MLP(·) denotes a subnetwork of Multi-layer Perceptron (MLP). The same Yo Xc Yo o applies to F Yo , and we have F Y p1 and F m1 . The match features F m1 and F m1 NX ×NY is then used for the learning of an attention map A1 ∈ R as follows: Yo c A1 = Softmax(F X m1 × Transpose(F m1 )),

(4)

where Transpose(·) denotes tensor transposition, and Softmax(·) denotes softmax operation along columns. Each element a1,ij in A1 indicates the match degree between ith point in P Xc and j th one in P Yo . Then pose and match features of the partial observation Xo in object system can be interpolated by matrix multiplication of A1 and those of P Yo , respectively, to be aligned with features of Xc in camera coordinate system:   X Xc c c Fpc = FX FX p1 m = F m1 , . (5) Xo X o  = A 1 × F Yo  = A 1 × F Yo F F p

p1

m

m1

 Xo :  Xo is expected to be decoded out from F Through feature alignment, P p  Xo ).  Xo = MLP(F P p

(6)

 Xo guide the learning of deep corresponSupervisions on the reconstruction of P dence in P2P-FDA module.

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

375

P2P-FDA module (1) learns deep correspondence of the partial X in two coordinate systems, while C2C-FDA module (2) infers that of the complete Y via a same network structure, as shown in Fig. 3(b). We adopt dual FDA modules in our design to enable robust correspondence establishment, since they bring complementary functions: P2P-FDA module provides more accurate correspondence than that of C2C-FDA module, due to the difficulty in shape completion from partial observation for the latter module; however, C2C-FDA module plays a vital role under the condition of severe occlusions, which P2P-FDA module can hardly handle with. 3.3

Confidence-Based Pose Estimation

After dual feature disengagement and alignment, we construct the pose and match feature pairs as follows:     c c  Xo  Xo FX FX p , Fp m , Fm ,Fm = . (7) Fp =  Yc , F Yo  Yc , F Yo F F p

p

m

m

As shown in Fig. 2, the paired match feature F m is fed into an MLP for the X +NY to reflect the qualities of deep learning of confidence scores s = {si }N i=1 correspondence: (8) s = MLP(F m ). The paired pose feature F p is also fed into an MLP and weighted by s for precisely estimating the 6D pose (R, t): R = MLP(f ), t = MLP(f ), s.t. f = SUM(SoftMax(s) · MLP(F p )),

(9)

where SUM denotes summation along rows. Rather than numerical calculation from two paired point sets, we directly regress the 6D object pose from deep pair-wise features with confidence scores, which effectively weakens the negative impact of correspondence of low quality on pose estimation, and thus realizes more precise results. 3.4

Training of Deep Correspondence Learning Network

o NX  Xo = { pX For dual FDA modules, we supervise the reconstruction of P i }i=1 and Yc c NY  = { pY P i }i=1 to guide the learning of deep correspondence via the following objectives: NX 1  ∗ c || pXo − R∗T (pX (10) Lp2p = i − t )||, NX i=1 i

Lc2c

NY 1  Yc ∗ o = || p − (R∗ pY i + t )||, NY i=1 i

(11)

376

H. Li et al.

Fig. 4. An illustration of the iterative confidence-based pose estimation network. Yo ∗ c NX o NY where P Xc = {pX = {pY i }i=1 and P i }i=1 are input point sets, and R and ∗ t denote ground truth 6D pose. For the confidence-based pose estimation, we use the following objectives on top of the learning of the predicted object pose X +NY , respectively: (R, t) and confidence scores s = {si }N i=1

Lpose =

NY 1  ∗ Yo ∗ o ||RpY i + t − (R pi + t )||. NY i=1

Lconf =

NX 1  Xc T o σ(|| pX i − R (pi − t)||, si ) NX i=1

+

NY 1  Yo c σ(|| pY j − (Rpj + t)||, sNX +j ), NY j=1

(12)

(13)

where σ(d, s) = ds − wlog(s), and w is a balancing hyperparameter. We note that the objectives (10), (11) and (12) are designed for asymmetric objects, while for symmetric ones, we modify them by replacing L2 distance with Chamfer distance, as done in [39]. The overall training objective combines (10), (11), (12), and (13), resulting in the following optimization problem: min L = λ1 Lp2p + λ2 Lc2c + λ3 Lpose + λ4 Lconf ,

(14)

where λ1 , λ2 , λ3 and λ4 are penalty parameters. 3.5

Confidence-Based Pose Refinement

To take full advantages of the learned correspondence, we propose a confidencebased pose refinement network, as shown in Fig. 4, where the input point set  Xo for residual pose P Xc is transformed with predicted pose, and paired with F p estimation in an iterative manner. Specifically, assuming after k − 1 iterations of refinement, the current object pose is updated as (Rk−1 , tk−1 ), and we use it for NX Xc Xc T c NX transforming P Xc = {pX i }i=1 to P k−1 = {Rk−1 (pi − tk−1 )}i=1 ; for forming

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

377

pair-wise pose features with the learned correspondence in dual FDA modules,  Xo by concatenating it with P Xc . Similarly to Sect. 3.3, we feed we reuse F p k−1 the pose feature pairs into an MLP, and weight them by reusing the confidence scores sNX (denoting the first NX elements of s) for estimating the residual pose (ΔRk , Δtk ): ΔRk = MLP(f k ), Δtk = MLP(f k ), s.t. f k = SUM(SoftMax(sNX ) ·

(15)

c  Xo MLP([P X k−1 , F p ])).

Finally, the pose (Rk , tk ) of the k th iteration can be obtained as follows: Rk = ΔRk Rk−1 , tk = Rk−1 Δtk + tk−1 .

4

(16)

Experiments

Datasets. We conduct experiments on three benchmarking datasets, including YCB-Video [4], LineMOD [16], and Occlusion-LineMOD [3]. YCB-Video dataset consists of 92 RGB-D videos with 21 different object instances, fully annotated with object poses and masks. Following [39], we use 80 videos therein for training along with additional 80, 000 synthetic images, and evaluate DCL-Net on 2, 949 keyframes sampled from the rest 12 videos. LineMOD is also a fully annotated dataset for 6D pose estimation, containing 13 videos with 13 low-textured object instances; we follow the prior work [39] to split training and testing sets. Occlusion-LineMOD is an annotated subset of LineMOD with 8 different object instances, which handpicks RGB-D images of scenes with heavy object occlusions and self-occlusions from LineMOD, making the task of pose estimation more challenging; following [35], we use the DCL-Net trained on the original LineMOD to evaluate on Occlusion-LineMOD. Implementation Details. For both object observations and CAD models, we sample point sets with 1, 024 points as inputs of DCL-Net; that is, NX = NY = 1, 024. For the training objectives, we set the penalty parameters λ1 , λ2 , λ3 , λ4 in (14) as 5.0, 1.0, 1.0, and 1.0, respectively; w in (13) is set as 0.01. During inference, we run twice the confidence-based pose refinement for improvement of pose precision. Evaluation Metrics. We use the same evaluation metrics as those in [39]. For YCB-Video dataset, the average closest point distance (ADD-S) [45] is employed to measure the pose error; following [39], we report the Area Under the Curve (AUC) of ADD-S with the maximum threshold at 0.1 m, and the percentage of ADD-S smaller than the minimum tolerance at 2 cm (< 2 cm). For both LineMOD and Occlusion-LineMOD datasets, ADD-S is employed only for symmetric objects, while the Average Distance (ADD) for asymmetric objects; we report the percentage of distance smaller than 10% of object diameter. Besides, we use Chamfer Distance (CD) to measure the reconstruction results.

378

H. Li et al.

Table 1. Ablation studies of the use of dual FDA modules on YCB-Video dataset [4]. Experiments are conducted without confidence-based weighting and pose refinement. P2P-FDA C2C-FDA AUC /) + exp(< v, cdog > /)

(2) (3) (4)

where v ∈ RD denotes the ROI pooled features from region proposal network, g(·) refers to a simple tokenisation procedure, with no trainable parameters, φtext denotes a hyper-network that maps the natural language to its corresponding latent embedding, note that, the input text usually requires to use a template with manual prompts, e.g. “this is a photo of [category]”, which converts the classification tasks into the same format as that used during pre-training. As both visual and textual embedding have been L2 normalised, a temperature parameter  is also introduced. The visual backbone is trained by optimising the classification loss, to pair the regional visual embedding and its textual embedding of the corresponding category. Discussion: Despite the simplicity in formulating an open-vocabulary detector, training such models would suffer from great challenges, due to the lack

706

C. Feng et al.

of exhaustive annotations for large-scale dataset. Until recently, the large-scale visual-language models, such as CLIP and ALIGN, have been trained to align the latent space between vision and language, using simple noise contrastive learning at the image level. Taking benefit from the rich information in text descriptions, e.g. actions, objects, human-object interactions, and object-object relationships, these visual-language models have demonstrated remarkable ‘zero-shot’ generalisation for various image classification tasks, which opens up the opportunity for expanding the vocabulary of an object detector. 3.2

Na¨ıve Alignment via Detector Training

In this section, we aim to train an open-vocabulary object detector (based on Mask-RCNN) on Dtrain , i.e. only base categories, by optimising the visual backbone and the class-agnostic RPN to align with the object category classifier, that is inherited from the pre-trained frozen text encoder from CLIP, as shown in Fig. 2 (left). Note that, such training regime has also been investigated in several previous work, for example, [2,11]. However, as indicated by our experiments, na¨ıvely aligning the visual latent space to textual ones only yields very limited open-vocabulary detection performance. We conjecture that the poor generalisation mainly comes from three aspects: Firstly, computing the category embedding with only class name is suboptimal, as they may not be precise enough to describe a visual concept, leading to the lexical ambiguity, For example, “almond ” either refers to an edible oval nut with a hard shell or the tree that it grows on; Secondly, the web images for training CLIP are scene-centric, with objects occupying only a small portion of the image, whereas the object proposals from RPNs often closely localise the object, leading to an obvious domain gap on the visual representation; Thirdly, the base categories used for detector training are significantly less diverse than those used for training CLIP, thus, may not be sufficient to guarantee a generalisation towards novel categories. In the following sections, we propose a few simple steps to alleviate the above issues. 3.3

Alignment via Regional Prompt Learning

Comparing with the scene-centric images used for training CLIP, the output features from RPNs are local and object-centric. Na¨ıvely aligning the regional visual representation to the frozen CLIP text encoder would therefore encourage each proposal to capture more context than it is required. To this end, we propose a simple idea of regional prompt learning (RPL), steering the textual latent space to better fit object-centric images. Specifically, while computing the category classifier or embedding, we prepend and append a sequence of learnable vectors to the textual input, termed as ‘continuous prompt vectors’. These prompt vectors do not correspond to any real concrete words, and will be attended at the subsequent layers as if they were a sequence of ‘virtual tokens’. Additionally, we also include more detailed description into the prompt template to alleviate the lexical ambiguity, for instance, {category: “almond”, description: “oval-shaped edible seed of the almond tree”}.

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

707

Note that, the description can often be easily sourced from Wikipedia or meta data from the dataset. The embedding for each individual category can thus be generated as: calmond = φtext ([p1 , . . . , pj , g(category), pj+1 . . . , pj+h , g(description)])

(5)

where pi (i ∈ {1, 2, ..., j + h}) denote the learnable prompt vectors with the same dimension as word embedding, [category] and [description] are calculated by tokenising the category name and detailed description. As the learnable vectors are class-agnostic, and shared for all categories, they are expected to be transferable to novel categories after training. Optimising Prompt Vectors. To save computations, we consider to learn the prompt vectors in an off-line manner, specifically, we take the object crops of base categories from LVIS, resize them accordingly and pass through the frozen CLIP visual encoder, to generate the image embeddings. To optimise the prompt vectors, we keep both the visual and textual encoder frozen, and only leave the learnable prompt vectors to be updated, with a standard cross-entropy loss to classify these image crops. The process of RPL is displayed in Fig. 2 (right). Discussion. With the proposed RPL, the textual latent space is therefore recalibrated to match the object-centric visual embeddings. Once trained, we can re-compute all the category embeddings, and train the visual backbone to align with the prompted text encoder, as described in Fig. 2 (left). In our experiments, we have confirmed the effectiveness of RPL in Sect. 4.3, which indeed leads noticeable improvements on the open-vocabulary generalisation. 3.4

PromptDet: Alignment via Self-training

Till here, we have obtained an open-vocabulary object detector by aligning the visual backbone to prompted text encoder. However, RPL has only exploited limited visual diversity, i.e. only with base categories. In this section, we unleash such limitation and propose to leverage the large-scale, uncurated, noisy web images to further improve the alignment. Specifically, as shown in Fig. 3, we describe a learning framework that iterates the procedure of RPL and candidate images sourcing, followed by generating pseudo ground truth boxes, and selftraining the open-vocabulary detector. Sourcing Candidate Images. We take the LAION-400M dataset as an initial corpus of images, with the visual embeddings pre-computed by CLIP’s visual encoder. To acquire candidate images for each category, we compute the similarity score between the visual embedding and the category embedding, which are computed with the learnt regional prompt. We keep the images with highest similarity (an ablation study on the selection of the number of the images has been conducted in Sect. 4.3). As a consequence, an additional set of images is constructed with both base and novel categories, with no ground truth bounding |D | boxes available, e.g. Dext = {(Iext )i }i=1ext .

708

C. Feng et al.

Fig. 3. Illustration of the self-training framework. Stage-I: we use the base categories to learn regional prompts, as already demonstrated in Fig. 2 (right). Stage-II: we source and download the Internet images with the learned prompt. Stage-III: we selftrain the detector with both LVIS images of base categories and the sourced images of novel categories. Note that, the prompt learning and image sourcing can be iteratively conducted to better retrieve relevant images.

Iterative Prompt Learning And Image Sourcing. Here, we can alternate the procedure of the regional prompt learning (Fig. 3 Stage-I) and sourcing Internet images with the learned prompt with high precision (Fig. 3 Stage-II). Experimentally, such iterative sourcing procedure has shown to be beneficial for mining object-centric images with high precision. It enables to generate more accurate pseudo ground truth boxes and, as a result, largely improves the detection performance on novel categories after self-training. Bounding Box Generation. For each image in Dext , we run the inference with our open-vocabulary detector. Since these sourced candidate images are often object-centric, the output object proposals from class-agnostic RPN usually guarantee a decent precision and recall. We retain the top-K proposals with max objectness scores (experiments are conducted on selecting the value of K), then keep the box with the maximal classification score as the pseudo ground truth for each image. Note that, despite an image may contain multiple objects of interest, we only pick the one box as pseudo ground truth. Overall, such procedure would successfully mine a large set of previously unlabeled instances with pseudo ground truth, which are later used for re-training the visual backbone and RPN (including regression head) in Mask-RCNN, effectively resembling the self-training procedure. Discussion. Recent Detic [36] also attempts to train an open-vocabulary detector by exploiting external data, our proposed approach differs in three major aspects: (1) in Detic, the ImageNet21K is used as the initial image corpus, which have already been well-curated with manual annotations, in contrast, we advocate more challenging and scalable scenario, with all external images uncurated; (2) Detic uses a heuristic to pseudo-label the bounding boxes, i.e. to always pick the max-sized proposal; while in our case, we choose the box with most confident prediction; (3) our image sourcing and self-training can be iteratively conducted, and shown to lead significant performance boost, as indicated in Sect. 4.3.

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

4 4.1

709

Experiment Dataset and Evaluation Metrics

Here, we describe the open-vocabulary detection setup on LVIS [12], more details for MS-COCO benchmark can be found in supplementary material. LVIS. The latest LVIS v1.0 [12] contains 1203 categories with both bounding box and instance mask annotations. The categories are divided into three groups based on the number of the images that each category appears in the train set: rare (1–10 images), common (11–100 images), and frequent (>100 images). We follow the same problem setting as in ViLD [11] and Detic [36], where the frequent and common classes are treated as base categories (Cbase ), and the rare classes as the novel categories (Cnovel ). LAION-400M and LAION-Novel. For self-training, we also use an external dataset, LAION-400M [28], which consists of 400 million image-text pairs filtered by pre-trained CLIP. It provides the pre-computed CLIP embedding for all the images, and we search for the images by using its 64G knn indices and download about 300 images for each novel category, as illustrated by Stage-II in Fig. 3. We refer to this subset of LAION-400M as LAION-novel. While training an initial open-vocabulary object detector, we use LVIS-base. For self-training, we use a combination of LVIS-base and LAION-novel datasets, we summarise the dataset statistics in Table 1. For evaluation on LVIS v1.0 minival set, we mainly consider the mask Average Precision for novel categories, i.e. APnovel . However, to complete the AP metric, we also report APc (for common classes) and APf (for frequent classes). Lastly, the mask Average Precision for all categories is denoted by AP, which is computed as the mean of all the APs ranging from 0.5 to 0.95 IoU threshold (in a step of 0.05). Table 1. A summary of dataset statistics. The numbers in bracket refer to the number of base and novel categories Dataset

Train Eval Definition

#Images #Categories

LVIS – LAION-400M –

– –

original LVIS dataset 0.1M image-text pairs filtered by CLIP 400M

1203 unlabeled

LVIS-base  LAION-novel 

✗ ✗

common and frequent categories image subset of novel categories

0.1M 0.1M

866 337 (noisy)

LVIS minival ✗



standard LVIS validation set

20K

1203 (866+337)

4.2

Implementation Details

Detector Training. We conduct all the experiments using Mask-RCNN [13] with a ResNet-50-FPN backbone. Similar to Detic [36], we use sigmoid activation and binary cross-entropy loss for classification. We adopt the Stochastic Gradient Descent (SGD) optimizer with a weight decay of 0.0001 and a momentum of 0.9. Unless specified, the models are trained for 12 epochs (1× learning

710

C. Feng et al.

schedule) and the initial learning rate is set to 0.02 and then reduced by a factor of 10 at the 8-th epoch and the 11-th epoch. This detector training schedule is used for both the na¨ıve alignment (Sect. 3.2) and self-training (Sect. 3.4). In terms of the data augmentation for the na¨ıve alignment, we use 640–800 scale jittering and horizontal flipping. Regional Prompt Learning. We train the learnable prompt vectors for 6 epochs. Empirically, we find that the model is not sensitive to the number of prompt vectors, we therefore use two vectors, one before g(category) as a prefix vector, and one after g(category) as a suffix vector. One-Iteration Prompt Learning and Image Sourcing. For the first iteration, we train the prompt using the image crops from LVIS-base. Specifically, we expand the ground truth box to triple the height or width for each side, and take crops from the images based on the extended bounding box. Then we randomly select up to 200 image crops for each base class to train the prompt vectors. At the stage of image sourcing, we search for the web images on LAION-400M using the learned prompt via the knn indices of LAION-400M, and download about 300 images for each novel class, forming LAION-novel for later self-training. Multi-Iteration Prompt Learning and Image Sourcing. If we perform the prompt learning and image sourcing for more than one iteration, we start the prompt learning using LVIS-base, and search more images for base categories from LAION-400M. Combining the original LVIS-base and newly sourced images, we can again update the prompt vectors, and used to search images for novel categories later on, constructing the external LAION-novel dataset. Self-training. We first train the detector using the LVIS-base images for 6 epochs, and then train on both LVIS-base and LAION-novel for another 6 epochs. For the LAION-novel images, they are often object-centric due to the regional prompt learning, we use a smaller resolution and do 160–800 scale jittering. To guarantee high-quality of pseudo labels, we adopt a multi-scale inference scheme to generate the pseudo ground truth bounding boxes for images from LAION-novel. Specifically, one image scale is randomly selected from 160–360, and the other is randomly selected from 360–800. The two images of different scales are fed into the detector, and one pseudo bounding box is generated for each of them. We select the most confident prediction from both images as the final pseudo bounding boxes, and use them to further self-train the detector. Training for More Epochs. To compare with state-of-the-art detectors, we train the models with a batchsize of 64 on 8 GPUs, for 72 epochs (6× learning schedule), with 100–1280 scale jittering. 4.3

Ablation Study

In this section, we conduct ablation studies on the LVIS dataset, to thoroughly validate the effectiveness of the proposed components, including the RPL, iterative candidate sourcing, and self-training. In addition, as for studying other hyper-parameters, we also conduct comparison experiment to other heuristics

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

711

for box selection, effect of sourced candidate images, and finally on the training detail for whether to update the class-agnostic region proposal during selftraining. Regional Prompt Learning (RPL). To demonstrate the effectiveness of training open-vocabulary detector, by aligning the visual and textual latent space, we compare the learned prompt with the manual prompt. For simplicity, we only use two learnable vectors (one for prefix, and one for suffix) in RPL. When using more prompt vectors, we did not observe clear benefits. As shown in Table 2, we first investigate the performance with the manual prompt of “a photo of [category]”, which has also been used in previous works [11,35,36]. However, it only brings a limited generalisation, yielding a 7.4 AP on novel categories; Secondly, after adding more detailed description to the prompt template, i.e. use the “a photo of [category], which is [description]”, the lexical ambiguity can be alleviated, and lead to an improvement of 1.6 AP on novel categories; Lastly, we verify the effectiveness of our proposed prompt learning, which further brings a performance improvement by 3.7 AP and 2.1 AP on novel categories, comparing to the two manual prompts respectively. Table 2. Comparison on manually designed and learned prompt. Here, we only use two learnable prompt vectors in PRL, i.e. [1 + 1] refers to using one vector for prefix, and one vector for suffix Prompt APnovel APc APf AP “a photo of [category]”

manual

7.4

17.2 26.1 19.0

“a photo of [category], which is [description]” manual

9.0

18.6 26.5 20.1

11.1

18.8 26.6 20.3

regional prompt learning

[1+1]

Self-training. We evaluate the performance of after self-training the detector, both with and without the iterative candidate image sourcing. As shown in Table 3, it can always bring a noticeable improvement with different prompts, for example, from 9.0 to 15.3 AP for manual prompt, and 11.1 to 15.9 AP for learnt prompt. Additionally, while conducting a second-round regional prompt learning and image sourcing, our proposed PromptDet really shines, significantly outperforming the manually designed prompt, reaching 19.0 AP on novel categories. It demonstrates the effectiveness of self-training and iterative prompt learning for sourcing higher quality images. We also conduct a third-round regional prompt learning, and it yields 19.3 AP on novel categories. For simplicity, we iterate the prompt learning twice in the following experiments. Box Generation. As for pseudo labeling the boxes, we show some visualisation examples by taking the most confident predictions on the sourced candidate images, as shown in Fig. 4.

712

C. Feng et al.

Table 3. Effectiveness of self-training with different prompts. 1-iter, 2-iter and 3-iter denote that Stage-I (i.e. RPL) and Stage-II (i.e. image sourcing) are performed for one, two or three iterations, respectively Prompt method

Self-training APnovel APc APf AP

“a photo of [category], which is [description]”  Regional prompt learning

9.0 15.3

18.6 26.5 20.1 17.7 25.8 20.4

11.1

18.8 26.6 20.3

PromptDet (1-iter)



15.9

17.6 25.5 20.4

PromptDet (2-iter)



19.0

18.5 25.8 21.4

PromptDet (3-iter)



19.3

18.3 25.8 21.4

Fig. 4. Visualisation of the generated pseudo ground truth for the sourced images.

Quantitatively, we compare with three different heuristic strategies for box generation, as validated in Detic [36]: (1) use the whole image as proposed box; (2) the proposal with max size; (3) the proposal with max RPN score. As shown in Table 4 (left), we observe that using the most confident predictions as pseudo ground truth significantly outperforms the other strategies. Specifically, we conjecture this performance gap between ours and the max-size boxes (used in Detic) might be due to the difference on the external data. Detic exploits ImageNet21K with images being manually verified by human annotators, however, we only adopt the noisy, uncurated web images, training on the bounding boxes generated by heuristic may thus incur erroneous supervision in the detector training. Sourcing Variable Candidate Images. We investigate the performance variation while increasing the number of uncurated web images for self-training. As shown in Table 4 (right), 0 image denotes the training scenario with no selftraining involved, and when increasing the number of sourced images from 50 to 300, the performance tends to be increasing monotonically, from 14.6 to 19.0 AP on novel categories, showing the scalability of our proposed self-training mechanism. However, we found the LAION-400M dataset can only support at most 300 images for most categories, and sourcing more images would require to use a larger corpus, we leave this as a future work. Updating Class-Agnostic RPN and Box Head. Here, we conduct the ablation study on updating or freezing the class-agnostic RPN or box regression

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

713

Table 4. Left: the comparison on different box generation methods. Right: the effect on increasing the sourced candidate images. Method

APnovel APc APf AP

w/o self-training

10.4

19.5 26.6 20.6

image max-size max-obj.-score max-pred.-score (ours)

9.9 9.5 11.3 19.0

18.8 18.8 18.7 18.5

26.0 26.1 26.0 25.8

#Web images APnovel APc APf AP 0 50 100 200 300

20.1 20.1 20.3 21.4

10.4 14.6 15.8 17.4 19.0

19.5 19.3 19.3 19.1 18.5

26.6 26.2 26.2 26.0 25.8

20.6 21.2 21.4 21.5 21.4

during self-training. As shown in Table 5 (left), we find that freezing these two components can be detrimental, leading to a 1.8 AP (from 19.0 to 17.2) performance drop on detecting the novel categories. Top-K Proposals for Pseudo Labeling. We investigate the performance by varying the number of box proposal in the pseudo-labeling procedure. As shown in Table 5 (right), selecting the most confident prediction among the top20 proposals yields the best performance, and taking all object proposals presents the worst performance (10.4 AP vs. 19.0 AP) for novel categories. The other options are all viable, though with some performance drop. We set K = 20 for our experiments. Table 5. Left: the ablation study on updating class-agnostic RPN and box regression during self-training. Right: the analysis on the effect of generating variable pseudo boxes from RPN RPN classifier Box head APnovel APc APf AP  

4.4



17.2 18.1 19.0

18.2 25.8 21.0 18.3 25.7 21.2 18.5 25.8 21.4

#Proposals APnovel APc APf AP 10 20 30 1000 (all)

17.2 19.0 16.1 10.4

18.7 18.5 18.8 19.5

25.7 25.8 25.9 26.6

21.2 21.4 21.1 20.6

Comparison with the State-of-the-Art

In Table 6, we compare the proposed method with other open-vocabulary object detectors [11,36] on the LIVS v1.0 validation set. Limited by the computational resource, our best model is only trained for 72 epochs, and achieving 21.4 AP for the novel categories, surpassing the recent state-of-the-art ViLD-ens [11] and Detic [36] by 4.8 AP and 3.6 AP respectively. Additionally, we observe that training for longer schedule can significantly improve the detection performance on common and frequent categories, from 18.5 AP to 23.3 AP and 25.8 AP to 29.3 AP respectively. Additionally, we compare with previous works on open-vocabulary COCO benchmark. Following [2,36], we apply the 48/17 base/novel split setting on

714

C. Feng et al.

Table 6. Detection results on the LVIS v1.0 validation set. Both Detic and our proposed approach have exploited the external images. However, in Detic, the images are manually annotated and thus indicated by ‘*’. Notably, PromptDet does not require a knowledge distillation from the CLIP visual encoder at the detector training, which is shown to prominently boost the performance but significantly increase the training costs Method

Epochs Scale Jitter

Input Size

#External APnovel APc APf AP

ViLD-text [11] 384

100∼2048 1024×1024 0

10.1

23.9 32.5 24.9

ViLD [11]

384

100∼2048 1024×1024 0

16.1

20.0 28.3 22.5

ViLD-ens. [11] 384

100∼2048 1024×1024 0

16.6

24.6 30.3 25.5

Detic [36]

384

100∼2048 1024×1024 1.2M*

17.8

26.3 31.6 26.8

PromptDet

12

640∼800

800×800

0.1M

19.0

18.5 25.8 21.4

PromptDet

72

100∼1280 800×800

0.1M

21.4

23.3 29.3 25.3

MS-COCO, and report the box Average Precision at the IoU threshold 0.5. As Table 7 shows, PromptDet trained for 24 epochs outperforms Detic on both novel-class mAP (26.6 AP vs. 24.1 AP) and overall mAP (50.6 AP vs. 44.7 AP) with the same input image resolution (i.e. 640×640). Table 7. Results on open-vocabulary COCO. Numbers are copied from [36]

5

Method

box Epochs Input size AP50box novel AP50all

WSDDN [3]

96

640×640

5.9

39.9

DLWL [24]

96

640×640

19.6

42.9

Predicted [25] 96

640×640

18.7

41.9

Detic [36]

96

640×640

24.1

44.7

PromptDet

24

640×640

26.6

50.6

Conclusion

In this work, we propose an open-vocabulary object detector PromptDet, which is able to detect novel categories without any manual annotations. Specifically, we first use the pretrained, frozen CLIP text encoder, as an “off-the-shelf” classifier generator in two-stage object detector. Then we propose a regional prompt learning method to steer the textual latent space towards the task of object detection, i.e., transform the textual embedding space, to better align the visual representation of object-centric images. In addition, we further develop a self-training regime, which enables to iteratively high-quality source candidate images from a large corpus of uncurated, external images, and self-train the detector. With

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

715

these improvements, PromptDet achieved a 21.4 AP of novel classes on LVIS, surpassing the state-of-the-art open-vocabulary object detectors by a large margin, with much lower training costs.

References 1. Akata, Z., Malinowski, M., Fritz, M., Schiele, B.: Multi-cue zero-shot learning with strong supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 59–68 (2016) 2. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Proceedings of the European Conference on Computer Vision. pp. 384–400 (2018) 3. Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR, pp. 2846–2854 (2016) 4. Cacheux, Y.L., Borgne, H.L., Crucianu, M.: Modeling inter and intra-class relations in the triplet loss for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10333–10342 (2019) 5. Elhoseiny, M., Zhu, Y., Zhang, H., Elgammal, A.: Link the head to the “ beak”: zero shot learning from noisy text description at part precision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5640–5649 (2017) 6. Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015) 7. Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-rpn and multi-relation detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4022 (2020) 8. Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: task-aligned one-stage object detection. In: Proceedings of the International Conference on Computer Vision, pp. 3490–3499. IEEE Computer Society (2021) 9. Feng, C., Zhong, Y., Huang, W.: Exploring classification equilibrium in long-tailed object detection. In: Proceedings of the International Conference on Computer Vision, pp. 3417–3426 (2021) 10. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems 26 (2013) 11. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021) 12. Gupta, A., Dollar, P., Girshick, R.: Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019) 13. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the International Conference on Computer Vision, pp. 2961–2969 (2017) 14. Ji, Z., Fu, Y., Guo, J., Pang, Y., Zhang, Z.M., et al.: Stacked semantics-guided attention model for fine-grained zero-shot learning. In: Advances in Neural Information Processing Systems 31 (2018) 15. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

716

C. Feng et al.

16. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: Proceedings of the International Conference on Computer Vision, pp. 8420–8429 (2019) 17. Kaul, P., Xie, W., Zisserman, A.: Label, verify, correct: a simple few shot object detection method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022) 18. Li, Z., Yao, L., Zhang, X., Wang, X., Kanhere, S., Zhang, H.: Zero-shot object detection with textual descriptions. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019) 19. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision, pp. 2980–2988 (2017) 20. Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755 (2014) 21. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: MISRM (1999) 22. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning, pp. 8748–8763. PMLR (2021) 23. Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zeroshot object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 24. Ramanathan, V., Wang, R., Mahajan, D.: Dlwl: improving detection for lowshot classes with weakly labelled data. In: CVPR, pp. 9342–9352 (2020) 25. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR, pp. 7263– 7271 (2017) 26. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015) 27. Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zeroshot learning in a large-scale setting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011) 28. Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million imagetext pairs. arXiv preprint arXiv:2111.02114 (2021) 29. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: fully convolutional one-stage object detection. In: Proceedings of the International Conference on Computer Vision, pp. 9627–9636 (2019) 30. Weston, J., Bengio, S., Usunier, N.: Wsabie: scaling up to large vocabulary image annotation. In: IJCAI (2011) 31. Xie, J., Zheng, S.: Zsd-yolo: zero-shot yolo detection using vision-language knowledgedistillation. arXiv preprint arXiv:2109.12066 (2021) 32. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 33. Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A.: Open vocabulary scene parsing. In: Proceedings of the International Conference on Computer Vision, pp. 2002–2010 (2017) 34. Zhong, Y., Deng, Z., Guo, S., Scott, M.R., Huang, W.: Representation sharing for fast object detector search and beyond. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 471–487. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7 28

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

717

35. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021) 36. Zhou, X., Girdhar, R., Joulin, A., Kr¨ ahenb¨ uhl, P., Misra, I.: Detecting twentythousand classes using image-level supervision. arXiv preprint arXiv:2201.02605 (2022)

Densely Constrained Depth Estimator for Monocular 3D Object Detection Yingyan Li1,2,4,5 , Yuntao Chen3 , Jiawei He1,2,4 , and Zhaoxiang Zhang1,2,3,4,5(B) 1

5

Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China {liyingyan2021,hejiawei2019,zhaoxiang.zhang}@ia.ac.cn 2 University of Chinese Academy of Sciences (UCAS), Beijing, China 3 Centre for Artificial Intelligence and Robotics, HKISI CAS, Hong Kong, China 4 National Laboratory of Pattern Recognition (NLPR), Beijing, China School of Future Technology, University of Chineses Academy of Sciences, Beijing, China

Abstract. Estimating accurate 3D locations of objects from monocular images is a challenging problem because of lacking depth. Previous work shows that utilizing the object’s keypoint projection constraints to estimate multiple depth candidates boosts the detection performance. However, the existing methods can only utilize vertical edges as projection constraints for depth estimation. So these methods only use a small number of projection constraints and produce insufficient depth candidates, leading to inaccurate depth estimation. In this paper, we propose a method that utilizes dense projection constraints from edges of any direction. In this way, we employ much more projection constraints and produce considerable depth candidates. Besides, we present a graph matching weighting module to merge the depth candidates. The proposed method DCD (Densely Constrained Detector) achieves state-of-the-art performance on the KITTI and WOD benchmarks. Code is released at https://github.com/BraveGroup/DCD. Keywords: Monocular 3D object detection · Dense geometric constraint · Message passing · Graph matching

1

Introduction

Monocular 3D detection [7,17,44,50] has become popular because images are large in number, easy to obtain, and have dense information. Nevertheless, the lack of depth information in monocular images is a fatal problem for 3D detection. Some methods [2,22] use deep neural networks to regress the 3D bounding boxes directly, but it is challenging to estimate the 3D locations of the objects from 2D images. Another line of work [8,29,43] employs a pre-trained depth estimator. However, training the depth estimator is separated from the Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20077-9 42. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  S. Avidan et al. (Eds.): ECCV 2022, LNCS 13669, pp. 718–734, 2022. https://doi.org/10.1007/978-3-031-20077-9_42

Densely Constrained Depth Estimator for Monocular 3D Object Detection

719

Fig. 1. The object’s depth is estimated by 2D-3D edge projection constraints. This figure compares the involved edges in the object’s depth estimation between (a) previous work and (b) ours. The previous work only deals with vertical edges. Our work is able to handle the edges of any direction.

detection part, requiring a large amount of additional data. In addition, some works [17,23,50] use geometric constraints, i.e., regresses the 2D/3D edges, and then estimates the object’s depth from the 2D-3D edge projection constraints. These works employ 3D shape prior information and exhibit state-of-the-art performance, which is worthy of future research. A problem of the previous work is that their geometric constraints are insufficient. Specifically, some existing methods [25,49,50] estimate the height of the 2D bounding box and the 3D bounding box, and then generate the depth candidates of an object from 2D-3D height projection constraints. The final depth is produced by weighting all the depth candidates. As Fig. 1 shows, this formulation is only suitable for the vertical edges, which means they only utilize a tiny amount of constraints and 3D prior, leading to inaccurate depth estimations. Some of the depth candidates are of low quality, so weighting is needed. However, the previous work’s weighting methods are suboptimal. Since the final depth is derived from the weighted average of depth candidates, the weight should reflect the quality of each depth candidate. Existing methods [23] use a branch to regress the weight of each depth candidate directly, and this branch is paralleled to the keypoints regression branch. So the weighting branch does not know each keypoint’s quality. Some work predicts the uncertainty of each depth to measure the quality of the depth and use the uncertainty to weight [25,50]. However, they obtain the uncertainty of each depth candidate independently, and they do not supervise the weight explicitly. To address the problem of insufficient geometric constraints, we propose a Densely Geometric-constrained Depth Estimator (DGDE). DGDE can estimate depth candidates from projection constraints provided by edges of any direction, no more limited to the vertical edges. This estimator allows better use of the

720

Y. Li et al.

shape information of the object. In addition, training the neural network with abundant 2D-3D projection constraints helps the neural network understand the mapping relationship from the 2D plane to the 3D space. To weight the depth candidates properly, we propose a new depth candidates weighting module that employs graph matching, named Graph Matching Weighting module. We construct complete graphs based on 2D and 3D semantic keypoints. In a 2D keypoint graph, the 2D keypoint coordinates are placed on the vertices, and an edge represents a pair of 2D keypoints. The 3D keypoint graph is constructed in the same way. We then match the 2D edges and 3D edges and produce the matching scores. The 2D-3D edge matching score is used as the weight of the corresponding depth candidate. These weights are explicitly supervisable. Moreover, the information of the entire 2d/3d edges is used to generate each 2d-3d edge matching score. In summary, our main contributions are: 1. We propose a Dense Geometric-constrained Depth Estimator (DGDE). Different from the previous methods, DGDE estimates depth candidates utilizing projection constraints of edges of any direction. Therefore, considerable 2D3D projection constraints are used, producing considerable depth candidates. We produce high-quality final depth based on these candidates. 2. We propose an effective and interpretable Graph Matching Weighting module (GMW). We construct the 2D/3D graph from 2D/3D keypoints respectively. Then we regard the graph matching score of the 2D-3D edge as the weight of the corresponding depth candidate. This strategy utilizes all the keypoints’ information and produces explicitly supervised weights. 3. We localize each object more accurately by weighting the estimated depth candidates with corresponding matching scores. Our Densely Constrained Detector (DCD) achieves state-of-the-art performance on the KITTI and Waymo Open Dataset (WOD) benchmarks.

2

Related Work

Monocular 3D Object Detection. Monocular 3D object detection [5,7,13,16] becomes more and more popularity because monocular images are easy to obtain. The existing methods can be divided into two categories: single-center-pointbased and multi-keypoints-based. Single-center-point-based methods [22,51,53] use the object’s center point to represent an object. In detail, M3D-RPN [2] proposes a depth-aware convolutional layer with the estimated depth. MonoPair [7] discovers that the relationships between nearby objects are useful for optimizing the final results. Although single-center-point-based methods are simple and fast, location regression is unstable because only one center point is utilized. Therefore, the multi-keypointsbased methods [17,23,50] are drawing more and more attention recently. Multi-keypoints-based methods predict multiple keypoints for an object. More keypoints provide more projection constraints. The projection constraints are useful for training the neural network because constraints build the mapping

Densely Constrained Depth Estimator for Monocular 3D Object Detection

721

relationship from the 2D image plane to the 3D space. Deep MANTA [5] defines 4 wireframes as templates for matching cars, while 3D-RCNN [16] proposes a render-and-compare loss. Deep3DBox [28] utilizes 2D bounding boxes as constraints to refine 3D bounding boxes. KM3D [17] localizes the objects utilizing eight bounding boxes points projection. MonoJSG [20] constructs an adaptive cost volume with semantic features to model the depth error. AutoShape [23], highly relevant to this paper, regresses 2D and 3D semantic keypoints and weighs them by predicted scores from a parallel branch. MonoDDE [19] is a concurrent work. It uses more geometric constraints than MonoFlex [50]. However, it only uses geometric constraints based on the object’s center and bounding box points, while we use dense geometric constraints derived from semantic keypoints. In this paper, we propose a novel edge-based depth estimator. The estimator produces depth candidates by projection constraints, consisting of edges of any direction. We weight each edge by graph matching to derive the final depth of the object. Graph Matching and Message Passing. Graph matching [36] is defined as matching vertices and edges between two graphs. This problem is formulated as a Quadratic Assignment Problem (QAP) [3] originally. It has been widely applied in different tasks, such as multiple object tracking [14], semantic keypoint matching [52] and point cloud registration [10]. In the deep learning era, graph matching has become a differentiable module. The spectral relaxation [48], quadratic relaxation [14] and Lagrange decomposition [31] are common in use. However, one simple yet effective way to implement the graph matching layer is using the Sinkhorn [4,32,41] algorithm, which is used in this paper. This paper utilizes a graph matching module to achieve the message passing between keypoints. This means that we calculate the weighting score not only from the keypoint itself but also by taking other keypoints’ regression quality into consideration, which is a kind of message passing between keypoints. All the keypoints on the object help us judge each keypoint’s importance from the perspective of the whole object. Message passing [12] is a popular design, e.g., in graph neural network [1] and transformer [38]. The message passing module learns to aggregate features between nodes in the graph and brings the global information to each node. In the object detection task, Relation Networks [15] is a pioneer work utilizing message passing between proposals. Message passing has also been used for 3D vision. Message passing between frames [45], voxels [9] and points [33] is designed for LiDAR-based 3D object detection. However, for monocular 3D object detection, message passing is seldom considered. In recent work, PGD [42] conducts message passing in the geometric relation graph. But it does not consider the message passing within the object.

3

Methodology

The overview of our framework is in Fig. 2. We employ a single-stage detector [50] to detect the object’s 3D attributes from the monocular image. We propose the Densely Geometric-constrained Depth Estimator (DGDE), which can calculate the depth from any direction’ 2D-3D edge. The DGDE can effectively utilize the

722

Y. Li et al.

Fig. 2. Overview of our framework. (a) We propose a method Densely Geometricconstrained Depth Estimator (DGDE). DGDE is able to estimate the object’s depth candidates from 2D-3D projection constraints of edges of any direction. (b) Graph Matching Weighting module (GMW) obtains the weights of estimated depth candidates by Graph Matching. A robust depth is derived from combing the multiply depth candidates with corresponding weights

semantic keypoints of the object and produce many depth candidates. Besides, we utilize the regressed 2D edges, 3D edges, and orientation as the input for our 2D-3D edge Graph Matching network. Our Graph Matching Weighting module (GMW) matches each 2D-3D edge and produces a matching score. By combining the multiple depths with their corresponding matching scores, we can finally generate a robust depth for the object. 3.1

Geometric-Based 3D Detection Definition

The geometric-based monocular 3D object detection estimates the object’s location by 2D-3D projection constraints. Specifically, the network predicts the object’s dimension (h, w, l), rotation ry , since autonomous driving datasets generally assume that the ground is flat. Assuming an object has n semantic keypoints, we regress the i-th(i = 1, 2, . . . , n) keypoint’s 2D coordinate (ui , v i ) in image coordinate and 3D coordinate (xio , yoi , zoi ) in object frame. The object frame’s coordinate origin is the object’s center point. The i-th 2D-3D keypoint projection constraint is established from (ui , v i , xio , yoi , zoi , ry ). Given n semantic 2D-3D keypoint projection constraints, it is an overdetermined problem for solving 3D object location (xc , yc , zc ), which is the translation vector for transforming the points from the object frame into the camera frame. The method of generating semantic keypoints of each object is adapted from [23]. We establish a few car models by PCA and refine the models by the 3D points segmented from the points cloud and 2D masks. After we obtain the keypoints, we can use our DGDE to estimate the object’s depth from the keypoint projection constraints.

Densely Constrained Depth Estimator for Monocular 3D Object Detection

3.2

723

Densely Geometric-Constrained Depth Estimation

While previous depth estimation methods [50] only take vertical edges into account, our DGDE can handle edges of any direction. Therefore, we are able to utilize much more constraints to estimate the depth of each depth candidate. Next, we will show the details of estimating dense depth candidates of an object from 2D-3D keypoint projection constraints. The solution is based on the keypoint’s projection relationship from 3D space to the 2D image. The ith(i = 1, 2, . . . , n) keypoint’s 3D coordinate (xio , yoi , zoi ) is defined in the object frame and is projected on a 2D image plane by the equation: si [ui , vi , 1]T = K[R|t][xio , yoi , zoi , 1]T ,

(1)

where si is the i-th keypoint’s depth, K is the camera intrinsic matrix and K, R, t is represented as: ⎤ ⎡ ⎤ ⎡ cos ry 0 sin ry fx 0 cx 1 0 ⎦ , t = [xc , yc , zc ]t . (2) K = ⎣ 0 fy cy ⎦ , R = ⎣ 0 0 0 1 − sin ry 0 cos ry By Eq. (1) and Eq. (2), the equation of i-th keypoint’s projection constraint is denoted as: ⎧ i i ⎪ ⎨si = zc − xo sin ry + zo cos ry , i i (3) u ˜i (zc − xo sin ry + zo cos ry ) = xc + xio cos ry + zoi sin ry , ⎪ ⎩ i i i v˜i (zc − xo sin ry + zo cos ry ) = yc + yo , v −c

x , v˜i = ify y . Intuitively, (zc − xio sin ry + zoi cos ry ) means an where u˜i = uif−c x object’s i-th 3D keypoint’s Z coordinate (i.e., depth) in the camera coordinate. (xc + xio cos ry + zoi sin ry ) means 3D keypoint’s X coordinate while (yc + yoi ) means its Y coordinate. Similarly, the j-th(j = 1, 2, . . . , n) projection constraint is denoted as: ⎧ j j ⎪ ⎨sj = zc − xo sin ry + zo cos ry , (4) u ˜j (zc − xjo sin ry + zoj cos ry ) = xc + xjo cos ry + zoj sin ry , ⎪ ⎩ j j j v˜j (zc − xo sin ry + zo cos ry ) = yc + yo .

From Eq. (3) and Eq. (4), we can densely obtain the zc from the i-th, j-th, i = j keypoint(i.e., edgeij ) projection constraints as: ⎧ li − l j ⎪ ⎪ (5) ⎨ u˜ − u˜ , i j ij zc = h − hj ⎪ ⎪ ⎩ i , (6) v˜i − v˜j where li = xio cos (ry ) + zoi sin (ry ) + ui (xio sin (ry ) − zoi cos (ry )) and hi = yoi + v i (xio sin (ry )−zoi cos (ry )). This equation reveals that depth can be calculated by

724

Y. Li et al.

the projection constraints of an edge of any direction. Given zc , we can estimate xc , yc from Eq. (3) as xic = ui zc − li , yci = vi zc − hi . We generate m = n(n − 1)/2 depth candidates given n keypoints. It is inevitable to meet some low-quality depth candidates in such a large number of depths. Therefore, an appropriate weighting method is necessary to ensemble these depth candidates. 3.3

Depth Weighting by Graph Matching

As we estimate the depth candidate zcij (i, j = 1, · · · , n) for the object o from DGDE, the final depth zc of the object can be weighted from these depth estimations according to the estimation quality wi,j , as

wi,j zcij . (7) zc = i