Computer Vision – ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII (Lecture Notes in Computer Science, 13808) 3031250842, 9783031250842

The 8-volume set, comprising the LNCS books 13801 until 13809, constitutes the refereed proceedings of 38 out of the 60

139 54 71MB

English Pages 522 [517] Year 2023

Table of contents :
Foreword
Preface
Organization
Contents – Part VIII
W31 - Challenge on Compositional and Multimodal Perception
W31 - Challenge on Compositional and Multimodal Perception
YORO - Lightweight End to End Visual Grounding
1 Introduction
2 Related Work
3 YORO Architecture
3.1 Multi-modal Inputs
3.2 Transformer Encoder
3.3 Multi-modal Transformer Outputs
3.4 Feature Projector and Detection Heads
4 Training Losses
5 Experiments
5.1 Dataset
5.2 Training and Evaluation Details
5.3 Ablation Study
5.4 Quantitative Comparison
5.5 Comparison of Trade-Offs Between Size, Speed, and Accuracy
5.6 Qualitative Results
6 Conclusion
References
W32 - Uncertainty Quantification for Computer Vision
W32 - Uncertainty Quantification for Computer Vision
Localization Uncertainty Estimation for Anchor-Free Object Detection
1 Introduction
2 Related Works
2.1 Anchor-Free Object Detection
2.2 Uncertainty Estimation
3 Uncertainty-Aware Detection (UAD)
3.1 Power Likelihood
3.2 Uncertainty-Aware Classification
3.3 Training and Inference
4 Experiments
4.1 Ablation Study
4.2 Comparison with Other Methods
4.3 Discussion
5 Conclusion
References
Variational Depth Networks: Uncertainty-Aware Monocular Self-supervised Depth Estimation
1 Introduction
2 Related Work
3 Methods
3.1 Background and Motivation
3.2 Variational Depth Networks
4 Experiments
4.1 Setup
4.2 ScanNet: Uncertainty-Aware Reconstruction
4.3 ScanNet: Prior Ablation Study
4.4 KITTI: 2D Depth Evaluation
5 Conclusions
References
Unsupervised Joint Image Transfer and Uncertainty Quantification Using Patch Invariant Networks
1 Introduction
2 Related Work
2.1 Generative Adversarial Networks
2.2 Unpaired Image Transfer and Domain Mapping
2.3 Uncertainty Quantification
3 Method
3.1 Preliminaries and GAN Architecture
3.2 Patch Invariance
3.3 Uncertainty by Loss Attenuation
3.4 Implementation Details
4 Experiments
4.1 Datasets
4.2 Compared Methods and Scenarios
4.3 Quantitative Evaluation
4.4 Qualitative Evaluation
4.5 Uncertainty Scores
5 Conclusions
References
Uncertainty Quantification Using Query-Based Object Detectors
1 Introduction
2 Background and Related Work
2.1 Object Detection
2.2 Uncertainty Quantification in Object Detection
3 Uncertainty Estimation in Query-Based Detectors
3.1 Detection Transformer and AdaMixer
3.2 Merging Bounding Boxes
3.3 Uncertainty Quantification
4 Experiments
4.1 Experimental Protocols
4.2 Experiment 1: Evaluating Location Uncertainty
4.3 Experiment 2: Evaluating Classification Performance
4.4 Experiment 3: Evaluating Objectness Uncertainty
4.5 Experiment 4: Runtime-Aware Performance Comparison
5 Conclusions
References
W33 - Recovering 6D Object Pose
W33 - Recovering 6D Object Pose
CenDerNet: Center and Curvature Representations for Render-and-Compare 6D Pose Estimation
1 Introduction
1.1 Context
1.2 Related Work
1.3 Contributions
2 CenDerNet
2.1 From Images to Center and Curvature Heatmaps
2.2 From Center Heatmaps to 3D Centers
2.3 6D Pose Estimation
3 Experiments
3.1 DIMO
3.2 T-LESS
4 Conclusions
References
Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement
1 Introduction
2 Related Work
2.1 6D Object Pose Estimation
2.2 Vision Transformer
3 Methodology
3.1 Transformer-Based Baselines for 6D Object Pose Estimation
3.2 Patch-Aware Feature Fusion
3.3 Pure Transformer-Based Pose Refinement
3.4 Training
4 Experiments
4.1 Implementation Details
4.2 Datasets
4.3 Evaluation Metrics
4.4 Ablation Studies
4.5 Comparison with State of the Arts
5 Conclusions
References
Learning to Estimate Multi-view Pose from Object Silhouettes
1 Introduction
2 Related Work
3 Method
3.1 Network Architecture
3.2 Confidence-Based Loss Function
3.3 Training
4 Experiments
4.1 Datasets
4.2 Results
5 Conclusion
References
TransNet: Category-Level Transparent Object Pose Estimation
1 Introduction
2 Related Works
2.1 Transparent Object Visual Perception for Manipulation
2.2 Opaque Object Category-Level Pose Estimation
3 TransNet
3.1 Architecture Overview
3.2 Object Instance Segmentation
3.3 Transparent Object Depth Completion
3.4 Transparent Object Surface Normal Estimation
3.5 Generalized Point Cloud
3.6 Transformer Feature Embedding
3.7 Pose and Scale Estimation
4 Experiments
4.1 Comparison with Baseline
4.2 Embedding Method Analysis
4.3 Ablation Study of Generalized Point Cloud
4.4 Depth and Surface Normal Exploration on TransNet
5 Conclusions
References
W34 - Drawings and Abstract Imagery: Representation and Analysis
W34 - Drawings and Abstract Imagery: Representation and Analysis
Fuse and Attend: Generalized Embedding Learning for Art and Sketches
1 Introduction
2 Proposed Method
2.1 Gated Fusion for Representative Refinement
2.2 Attention-Based Query Embedding Refinement and Augmentation for Positive Generation
2.3 Student-Teacher Based Contrastive Learning
3 Related Work and Experiments
3.1 Comparison of Our Method RCERM Against the SOTA ERM and Fishr Methods:
3.2 Ablation/Analysis of Our Method:
3.3 Key Takeaways:
4 Conclusion
References
3D Shape Reconstruction from Free-Hand Sketches
1 Introduction
2 Related Works
3 3D Reconstruction from Sketches
3.1 Synthetic Sketch Generation
3.2 Sketch Standardization
3.3 Sketch-Based 3D Reconstruction
4 Experimental Results
4.1 3D Sketching Dataset
4.2 Training Details and Evaluation Metrics
4.3 Implementation Details
4.4 Results and Comparisons
4.5 Sketch Standardization Module
4.6 View Estimation Module
5 Summary
References
Abstract Images Have Different Levels of Retrievability Per Reverse Image Search Engine
1 Introduction
2 Background
2.1 Metrics to Evaluate Search Engines
2.2 Similarity Approaches to Establish Relevance
3 Related Work
4 Methods
5 Results and Discussion
6 Future Work
7 Conclusion
References
W35 - Sign Language Understanding
W35 - Sign Language Understanding
ECCV 2022 Sign Spotting Challenge: Dataset, Design and Results
1 Introduction
2 Related Work
3 Challenge Design
3.1 The Dataset
3.2 Evaluation Protocol
3.3 The Baseline
4 Challenge Results and Winning Methods
4.1 The Leaderboard
4.2 Top Winning Approaches: MSSL Track
4.3 Top Winning Approaches: OSLWL Track
4.4 Performance on Marginal Distributions of the Test Set
5 Conclusions
References
Hierarchical I3D for Sign Spotting
1 Introduction
2 Related Work
2.1 Sign Language Recognition
2.2 Sign Spotting
3 Approach
3.1 I3D Feature Extraction Layers
3.2 Hierarchical Network Head
3.3 Learning Objectives
4 Experiments
4.1 LSE_eSaude_UVIGO Dataset
4.2 Evaluation Metric
4.3 Random Sampling Probabilities
4.4 Implementation Details
4.5 Results
4.6 Ablation Studies
5 Conclusion
References
Multi-modal Sign Language Spotting by Multi/One-Shot Learning
1 Introduction
2 Related Work
3 Methods
3.1 Feature Extraction
3.2 MSSL Framework
3.3 OSLWL Framework
4 Experiments
4.1 Experiment Settings
4.2 Ablation on MSSL
4.3 Ablation on OSLWL
5 Conclusion
5.1 Limitation and Future Work
References
Sign Spotting via Multi-modal Fusion and Testing Time Transferring
1 Introduction
2 Related Work
3 Proposed Method
3.1 Data Processing
3.2 Multi-modal Feature Extraction
3.3 Loss Function
3.4 Sign Spotting from Isolated Signs
3.5 Top-K Transferring Technique
4 Experiments
4.1 Dataset
4.2 Implementation Details
4.3 Evaluation Metrics
4.4 Main Results
5 Conclusions
References
W36 - A Challenge for Out-of-Distribution Generalization in Computer Vision
W36 - A Challenge for Out-of-Distribution Generalization in Computer Vision
Domain-Conditioned Normalization for Test-Time Domain Generalization
1 Introduction
2 Related Work
2.1 Domain Generalization
2.2 Normalization in Neural Networks
2.3 Adaptation and Generalization at Test Time
3 Methodology
3.1 Preliminary
3.2 Domain Conditioned Normalization
3.3 Training and Inference
4 Experiments
4.1 Experiment Setup
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Study
4.4 Further Analysis
5 Conclusions
References
Unleashing the Potential of Adaptation Models via Go-getting Domain Labels
1 Introduction
2 Related Work
3 Adversarial Domain Adaptation with Go-labels
3.1 Prior Knowledge Recap and Problem Definition
3.2 Proposed Go-getting Domain Labels
3.3 Theoretical Insights of Go-labels
4 Experiments
4.1 Validation on Toy Problems
4.2 Experiments on the General UDA Benchmarks
4.3 Comparison with State-of-the-Arts
5 Conclusion
References
ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization
1 Introduction
2 Related Work
2.1 Multimodal Action Recognition
2.2 Modality Contribution Quantification
2.3 Domain Generalization and Adaptation
3 Approach
3.1 Action Recognition Task
3.2 Datasets
3.3 Modality Extraction and Training
3.4 Quantification Study: Modality Contributions
3.5 ModSelect: Unsupervised Modality Selection
4 Experiments
4.1 Late Fusion: Results
4.2 Quantification Study: Results
4.3 Results from ModSelect: Unsupervised Modality Selection
5 Limitations and Conclusion
References
Consistency Regularization for Domain Adaptation
1 Introduction
2 Related Work
2.1 Unsupervised Domain Adaptation
2.2 Semantic Segmentation
2.3 Consistency Regularization
3 Our Method
3.1 Overall Training
4 Experiments
4.1 Implementation Details
4.2 Results
4.3 Ablation Study
5 Conclusion
References
W37 - Vision With Biased or Scarce Data
W37 - Vision With Biased or Scarce Data
CAT: Controllable Attribute Translation for Fair Facial Attribute Classification
1 Introduction
2 Related Work
3 Approach
4 Experimental Evaluation
4.1 Attributes Study
4.2 Synthetic Attribute-Level Balanced Datasets
4.3 Sex Classification
4.4 Facial Attribute Classification
4.5 Ablation Study
5 Conclusion
References
Weakly Supervised Invariant Representation Learning via Disentangling Known and Unknown Nuisance Factors
1 Introduction
2 Related Work
3 Learning Disentangled and Invariant Representation
3.1 Model Architecture
3.2 Learning Independent Known Nuisance Factors znk
3.3 Learning Invariant Predictive Factors zp
4 Experiments Evaluation
4.1 Benchmarks, Baselines and Metrics
4.2 Comparison with Previous Work
4.3 Ablation Study
5 Conclusion
References
Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism
1 Introduction
2 Related Work
3 Proposed Method
3.1 Problem Formulation
3.2 Training the Attention Mechanism
3.3 Inference of Model Decision's Explanation
4 Experiments
4.1 Dataset
4.2 Experimental Setup
4.3 Evaluation Measures
4.4 Results
5 Conclusion
References
Self-supervised Orientation-Guided Deep Network for Segmentation of Carbon Nanotubes in SEM Imagery
1 Introduction
2 Methods
2.1 Network Architecture
2.2 Generation of Training Labels
3 Experimental Results
3.1 Datasets
3.2 Training Process
3.3 Segmentation Evaluation
4 Conclusions
References
W38 - Visual Object Tracking Challenge
W38 - Visual Object Tracking Challenge
The Tenth Visual Object Tracking VOT2022 Challenge Results
1 Introduction
1.1 The VOT2022 Challenge
2 Performance Evaluation Protocol
2.1 The Short-Term Evaluation Protocols
2.2 The Long-Term Evaluation Protocol
3 Description of Individual Challenges
3.1 VOT-ST2022 Challenge Outline
3.2 VOT-RT2022 Challenge Outline
3.3 VOT-LT2022 Challenge Outline
3.4 VOT-RGBD2022 Challenge Outline
4 The VOT2022 Challenge Results
4.1 The VOT-STs2022 Challenge Results
4.2 The VOT-STb2022 Challenge Results
4.3 The VOT-RTs2022 Challenge Results
4.4 The VOT-RTb2022 Challenge Results
4.5 The VOT-LT2022 Challenge Results
4.6 The VOT-RGBD2022 Challenge Results
4.7 The VOT-D2022 Challenge Results
5 Conclusions
References
Efficient Visual Tracking via Hierarchical Cross-Attention Transformer
1 Introduction
2 Related Work
3 Method
3.1 Overall Architecture
3.2 Feature Sparsification Module
3.3 Hierarchical Cross-Attention Transformer
3.4 Training Loss
4 Experiments
4.1 Implementation Details
4.2 Evaluation on TrackingNet, LaSOT, and GOT-10k Datasets
4.3 Evaluation on VOT2020 Datasets
4.4 Evaluation on Other Datasets
4.5 Ablation Study and Analysis
5 Conclusion
References
Learning Dual-Fused Modality-Aware Representations for RGBD Tracking
1 Introduction
2 Related Work
3 Methodology
3.1 Overview
3.2 Cross-Modal Integration Module (CMIM)
3.3 Specificity Preserving Module (SPM)
3.4 Training Loss
3.5 Implementation Details
3.6 Visualization
4 Experiments
4.1 Experiment Settings
4.2 Main Results
4.3 Ablation Study
5 Conclusions
References
Author Index