Computer Vision – ACCV 2022: 16th Asian Conference on Computer Vision, Macao, China, December 4–8, 2022, Proceedings, Part IV (Lecture Notes in Computer Science) 3031263154, 9783031263156

The 7-volume set of LNCS 13841-13847 constitutes the proceedings of the 16th Asian Conference on Computer Vision, ACCV 2

214 79 129MB

English Pages 792 [781] Year 2023

Table of contents :
Preface
Organization
Contents – Part IV
Face and Gesture
Confidence-Calibrated Face Image Forgery Detection with Contrastive Representation Distillation
1 Introduction
2 Related Work
2.1 Face Forgery Detection
2.2 Knowledge Distillation
2.3 Confidence Calibration
3 Methodology
3.1 Feature Representation
3.2 Dual-Teacher Knowledge Distillation (DKD)
3.3 Confidence Calibration Loss (CCL)
4 Experimental Results and Discussions
4.1 Experimental Settings
4.2 In-dataset Evaluation
4.3 Cross-Dataset Evaluation
4.4 Ablation Study
5 Conclusions
References
Exposing Face Forgery Clues via Retinex-Based Image Enhancement
1 Introduction
2 Related Works
2.1 Face Forgery Detection
2.2 Retinex-Based Methods
3 Methodology
3.1 Overall Framework
3.2 Multi-scale Retinex Extraction
3.3 Feature Re-weighted Interaction
3.4 Salience-Guided Attention
3.5 Training Details and Loss Functions
4 Experiments
4.1 Experimental Setup
4.2 Comparison with Previous Methods
4.3 Ablation Studies and Visualizations
5 Conclusions
References
GB-CosFace: Rethinking Softmax-Based Face Recognition from the Perspective of Open Set Classification
1 Introduction
2 Softmax-Based Face Recognition
2.1 Framework
2.2 Objective
2.3 Properties
3 GB-CosFace Framework
3.1 Antetype Formulation
3.2 Adaptive Global Boundary
3.3 Geometric Analysis
4 Experiments
4.1 Face Recognition
4.2 Ablation Study
5 Conclusion
References
Learning Video-Independent Eye Contact Segmentation from In-the-Wild Videos
1 Introduction
2 Related Work
2.1 Gaze Estimation and Analysis
2.2 Eye Contact Detection
2.3 Action Segmentation
3 Proposed Method
3.1 Pseudo Label Generation
3.2 Iterative Model Training
3.3 Implementation Details
4 Experiments
4.1 Dataset
4.2 Evaluation
5 Conclusion
References
Exemplar Free Class Agnostic Counting
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Repetitive Region Proposal Networks
3.2 RepRPN-Counter
3.3 Knowledge Transfer for Handling Missing Labels
3.4 Training Objective
3.5 Implementation Details
4 Experiments
4.1 Dataset
4.2 Evaluation Metrics
4.3 Comparison with Class-Agnostic Visual Counters
4.4 Comparison with Object Detectors
4.5 Comparing RepRPN with RPN
4.6 Ablation Studies
4.7 Qualitative Results
5 Conclusions
References
Emphasizing Closeness and Diversity Simultaneously for Deep Face Representation
1 Introduction
2 Related Work
2.1 Margin-Based Methods
2.2 Mining-Based Methods
2.3 Contrastive Learning
3 Preliminary
3.1 Understanding Softmax-Based Methods
3.2 Derivative Analysis
4 Proposed Method
4.1 Hard Sample Mining
4.2 Loss Design
4.3 Sampling Strategy
5 Experiment
5.1 Implementation Details
5.2 Comparisons with SOTA Methods
5.3 Ablation Study
6 Conclusion
References
KinStyle: A Strong Baseline Photorealistic Kinship Face Synthesis with an Optimized StyleGAN Encoder
1 Introduction
2 Related Work
3 Methodology
3.1 Phase 1: Image Encoder
3.2 Phase 2: Fusion
3.3 Phase 3: Component-wise Parental Trait Manipulation (CW-PTM)
4 Experiment
4.1 Quantitative Evaluations of Different Encoder Configurations
4.2 Subjective Evaluation
4.3 Qualitative Evaluation
4.4 Ablation Studies
5 Conclusion
References
Occluded Facial Expression Recognition Using Self-supervised Learning
1 Introduction
2 Related Work
3 Method
3.1 Problem Statement
3.2 Pretext Task
3.3 Downstream Task
3.4 Optimization
4 Experiment
4.1 Experimental Conditions
4.2 Experimental Results and Analysis
4.3 Comparison with Related Work
5 Conclusion
References
Heterogeneous Avatar Synthesis Based on Disentanglement of Topology and Rendering
1 Introduction
2 Related Work
2.1 Universal Style Transfer
2.2 GAN-Based Methods
2.3 Cross-Domain Image Synthesis
3 Approach
3.1 RT-Net
3.2 TT-Net
4 Experiment
4.1 Dataset
4.2 Implementation Details
4.3 Qualitative Evaluation
4.4 User Study
4.5 Ablation Study
4.6 Quantitative Evaluation
5 Conclusion
References
Pose and Action
Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition
1 Introduction
2 Related Work
3 Proposed Method
3.1 Basic Spatial-Temporal Transformer on Skeleton Data
3.2 Focal and Global Spatial-Temporal Transformer Overview
3.3 Focal and Global Spatial Transformer (FG-SFormer)
3.4 Focal and Global Temporal Transformer (FG-TFormer)
4 Experiments
4.1 Dataset
4.2 Implementation Details
4.3 Ablation Study
4.4 Visualization and Analysis
4.5 Comparison with the State-of-the-Arts
5 Conclusion
References
Spatial-Temporal Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition*-12pt
1 Introduction
2 Related Work
2.1 Skeleton-Based Action Recognition
2.2 Graph Convolution Networks in Action Recognition
2.3 Topology Adaptive-Based Methods
3 Methodology
3.1 Preliminaries
3.2 Overview
3.3 Spatial and Temporal Graph Convolution
3.4 Topology Adaptive Encoder
3.5 Spatial and Temporal Adaptive Graph Convolution
3.6 Model Details
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Studies
4.4 Comparisons with SOTA Methods
5 Conclusion
References
3D Pose Based Feedback for Physical Exercises*-12pt
1 Introduction
2 Related Work
2.1 Human Motion Prediction
2.2 Action Recognition
2.3 Physical Exercise Analysis
3 Methodology
3.1 Exercise Analysis Framework
4 EC3D Dataset
5 Evaluation
5.1 Dataset and Metrics
5.2 Quantitative Results
5.3 Qualitative Results
5.4 Ablation Studies
5.5 Limitations and Future Work
6 Conclusion
References
Generating Multiple Hypotheses for 3D Human Mesh and Pose Using Conditional Generative Adversarial Nets*-12pt
1 Introduction
2 Related Work
3 The Proposed Approach
3.1 Overview
3.2 Human Mesh Estimation Using CGANs
3.3 Condition Expansion for Multiple Hypotheses
3.4 Human Mesh Selecting and Clustering
3.5 Implementation Details
4 Experimental Results
4.1 Datasets
4.2 Qualitative Results
4.3 Statistic Analysis of Multiple Hypotheses
4.4 Quantitative Comparison
4.5 Failure Case
5 Conclusion
References
SCOAD: Single-Frame Click Supervision for Online Action Detection*-12pt
1 Introduction
2 Related Work
3 Method
3.1 Overview
3.2 Action Instance Miner
3.3 Pseudo Labels Generation
3.4 Online Action Detector
3.5 Trianing and Inference
4 Experiments
4.1 Comparison Experiments
4.2 Ablation Experiments
4.3 Random Influence of Clicking Labels
4.4 Qualitative Results
5 Conclusion
References
Neural Puppeteer: Keypoint-Based Neural Rendering of Dynamic Shapes*-12pt
1 Introduction
2 Related Work
2.1 3D Keypoint Estimation
2.2 Morphable Models
2.3 Neural Fields
3 Neural Puppeteer
3.1 3D Keypoint Encoding
3.2 Keypoint-Based Neural Rendering
3.3 Training
3.4 Pose Reconstruction and Tracking
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Baselines
4.4 3D Keypoint and Pose Estimation
4.5 Keypoint-Based Neural Rendering
5 Limitations and Future Work
6 Conclusions
References
Decanus to Legatus: Synthetic Training for 2D-3D Human Pose Lifting*-4pt
1 Introduction
2 Related Work
3 Proposed Method
3.1 Synthetic Human Pose Generation Model
3.2 Pseudo-realistic 3D Human Pose Sampling
3.3 Training with Synthetic Data
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics
4.3 Implementation Details
4.4 Comparison with the State-of-the Art
5 Ablation Studies
5.1 Synthetic Poses Realism
5.2 Effect of Diffusion
5.3 Layout Adaptation
6 Conclusion
References
Social Aware Multi-modal Pedestrian Crossing Behavior Prediction*-12pt
1 Introduction
2 Related Work
2.1 Pedestrian Intention Estimation
2.2 Pedestrian Behavior Prediction
3 Method
3.1 Social Aware Encoder
3.2 Multi-modal Conditional Generating Module
3.3 Intention-Directed Decoder
3.4 Multi-task Loss
4 Experiments
4.1 Experiment Settings
4.2 Performance Evaluation and Analysis
4.3 Qualitative Example
5 Conclusion
References
Action Representing by Constrained Conditional Mutual Information*-12pt
1 Introduction
2 Related Work
2.1 Self-Supervised Learning
2.2 Action Recognition
3 Methodology
3.1 Traditional Contrastive Learning by Minimizing InfoNCE
3.2 A Learning Framework by Minimizing CCMI
3.3 A Self-Supervised Method: CoMInG
3.4 Details of Training Process
4 Relationship Between CCMI Loss and InfoNCE Loss
5 Experiments
5.1 The Setting of Experiments
5.2 Comparison with Existing Methods
5.3 Ablations
6 Conclusion
References
Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition
1 Introduction
2 Related Works
3 Approach
4 Experiments
4.1 Ablation Study
4.2 Comparisons with the State-of-the-Art Methods
5 Conclusions
References
Video Analysis and Event Recognition
Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition
1 Introduction
2 Related Work
2.1 Group Activity Recognition
2.2 Modeling of Interaction
3 Method
3.1 Image Branch
3.2 Skeleton Branch
3.3 Training Loss
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Comparison with the State-of-the-Arts
4.4 Ablation Studies
4.5 Visualization
5 Conclusion
References
Learning Using Privileged Information for Zero-Shot Action Recognition
1 Introduction
2 Related Work
2.1 Zero-Shot Action Recognition
2.2 Learning Using Privileged Information
3 Method
3.1 Overview
3.2 Video and Semantic Embedding
3.3 Object Embedding as PI
3.4 Hallucination Network
3.5 Cross-Attention Module
3.6 Training and Testing
4 Experiments
4.1 Setup
4.2 Implementation Details
4.3 Comparison with the State-of-the-Art
4.4 Ablation Studies
4.5 Qualitative Results
5 Conclusions
References
MGTR: End-to-End Mutual Gaze Detection with Transformer
1 Introduction
2 Related Work
2.1 Gaze Estimation in Social Scenarios
2.2 Mutual Gaze Detection
2.3 One-Stage Detection Method
3 Method
3.1 Representation of Mutual Gaze Instances
3.2 Model Architecture
3.3 Strategy for Mutual Gaze Instances Match
3.4 Loss Function Setting
4 Experiments
4.1 Datasets
4.2 Evaluation Metric
4.3 Current State-of-the-Art Approach
4.4 Implementation Details
4.5 Comparison with State-of-the-Art Method
4.6 Ablation Study
4.7 Qualitative Analysis
5 Conclusion
References
Is an Object-Centric Video Representation Beneficial for Transfer?
1 Introduction
2 Related Work
3 An Object-Centric Video Action Transformer
3.1 Architecture
3.2 Objectives
4 Implementation Details
5 Experiments
5.1 Datasets and Metrics
5.2 Ablations
5.3 Results
6 Conclusion
References
DCVQE: A Hierarchical Transformer for Video Quality Assessment
1 Introduction
2 Related Works
3 Divide and Conquer Video Quality Estimator (DCVQE)
3.1 Overall Architecture
3.2 Feature Extraction
3.3 Correlation Loss
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics
4.3 Implementation Details
4.4 Ablation Studies
4.5 Comparison with the State-of-the-Art Methods
5 Conclusions
References
Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition
1 Introduction
2 Related Work
3 Methods
3.1 Predictor Stage
3.2 Disturbed Prediction Sequence Generation
3.3 Refinement Stage
3.4 Training Details
4 Experiment
4.1 Dataset
4.2 Metrics
4.3 End-to-End VS. Not End-to-End
4.4 Stacked Number of GRUs in Refinement Model
4.5 Impact of Disturbed Prediction Sequence
4.6 Comparison with the SOTA Methods
5 Conclusion and Future Work
References
FunnyNet: Audiovisual Learning of Funny Moments in Videos*-12pt
1 Introduction
2 Related Work
3 Method
3.1 FunnyNet Architecture
3.2 Training Model and Loss Functions
3.3 Unsupervised Laughter Detection
4 Datasets and Metrics
5 Experiments
5.1 Comparison to the State of the Art
5.2 Analysis of Unsupervised Laughter Detector
5.3 Ablation of FunnyNet
6 Analysis of FunnyNet
6.1 Audio vs Subtitles
6.2 FunnyNet Architecture
6.3 Funny Scene Detection in the Wild
7 Conclusions
References
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
1 Introduction
2 Related Works
3 Context Transformer (ConTra)
3.1 Definitions
3.2 Clip Context Transformer
3.3 Training ConTra
3.4 Multi-modal Context
4 Results
4.1 Experimental Settings
4.2 Clip Context Results
4.3 Results of Modality Context
5 Conclusions
References
A Compressive Prior Guided Mask Predictive Coding Approach for Video Analysis*-12pt
1 Introduction
2 Related Works
2.1 Video Object Segmentation (VOS)
2.2 Image/Video Compression for Vision Tasks
3 Methods
3.1 Overview
3.2 Detailed Architecture
4 Experimentation
4.1 Training Details
4.2 Experimental Settings
4.3 Experimental Results
4.4 Ablation Studies and Modular Analysis
5 Discussions
6 Conclusion
References
BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation*-12pt
1 Introduction
2 Related Work
3 Preliminary
4 Boundary-aware Self-supervised Learning (BaSSL)
4.1 Overview
4.2 Pseudo-boundary Discovery
4.3 Pre-training Objectives
4.4 Fine-tuning for Scene Boundary Detection
5 Experiment
5.1 Experimental Settings
5.2 Comparison with State-of-the-Art Methods
5.3 Comparison with Pre-training Baselines
5.4 Ablation Studies
5.5 Analysis on Pre-trained Shot Representation Quality
5.6 Qualitative Analysis
6 Conclusion
References
HaViT: Hybrid-Attention Based Vision Transformer for Video Classification
1 Introduction
2 Related Work
3 Our Method
3.1 The Overall Architecture
3.2 Hybrid-attention Vision Transformer
3.3 Class-attention
4 Experiment
4.1 Setup
4.2 Comparison with State-of-the-Art
4.3 Ablation Studies
4.4 With Different Vision Transformers
5 Conclusion
References
Vision and Language
From Sparse to Dense: Semantic Graph Evolutionary Hashing for Unsupervised Cross-Modal Retrieval*-12pt
1 Introduction
2 Methodology
2.1 Preliminaries
2.2 Unified Sparse Affinity Graph
2.3 Semantic Graph Evolution
2.4 Hash-Code Learning
2.5 Optimization
3 Experiments
3.1 Datasets
3.2 Implementation Details
3.3 Retrieval Performance
3.4 Ablation Study
4 Conclusion
References
SST-VLM: Sparse Sampling-Twice Inspired Video-Language Model*-12pt
1 Introduction
2 Related Work
3 Methodology
3.1 Framework Overview
3.2 Video and Text Encoders
3.3 Dual X-MoCo
3.4 Model Pre-training
4 Experiments
4.1 Datasets and Settings
4.2 Ablation Study
4.3 Comparative Results
4.4 Visualization Results
5 Conclusion
References
PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization
1 Introduction
2 Related Work
3 Method
3.1 Text Pathway
3.2 Vision Pathway
3.3 Multi-Modal Information Interaction Module
3.4 Cross-Modal Contrastive Learning
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Study
4.4 Pathway Finetune
4.5 Different Backbone Networks
4.6 Comparison with State-of-the-Art Methods
4.7 Finetune on Small Datasets
5 Conclusion
References
Causal Property Based Anti-conflict Modeling with Hybrid Data Augmentation for Unbiased Scene Graph Generation
1 Introduction
2 Related Work
2.1 Scene Graph Generation
2.2 Evidence Theory
3 Approach
3.1 Overview of Approach
3.2 D-S Evidence Theory
3.3 Causal Property Anti-conflict Modeling
3.4 Hybrid Data Augmentation
4 Experiment
4.1 Evaluation Settings
4.2 Implementation Details
4.3 Comparisons with State-of-the-Art Methods
4.4 Ablation Studies
4.5 Qualitative Studies
5 Conclusion
References
gScoreCAM: What Objects Is CLIP Looking At?*-12pt
1 Introduction
2 Related Work
3 Methods
3.1 Revisiting CAM and ScoreCAM
3.2 Proposed Method: Gradient-Guided ScoreCAM (gScoreCAM)
3.3 Evaluation Datasets
3.4 Evaluation Metrics
3.5 CLIP Networks
4 Design of gScoreCAM
4.1 Ablation Study of gScoreCAM
4.2 gScoreCAM Is the Best Weighting System among the Candidates
4.3 gScoreCAM is Less Noisy Compared to ScoreCAM
5 Experiments and Results
5.1 Zero-Shot Object Detection Results
5.2 Why Does gScoreCAM Perform Better in COCO and PartImageNet?
5.3 gScoreCAM Performs Better on Different CLIP Models
5.4 Qualitative Study via CLIP
6 Discussion and Conclusions
References
From Within to Between: Knowledge Distillation for Cross Modality Retrieval*-12pt
1 Introduction
2 Related Work
3 Knowledge Distillation for Cross Modal Retrieval
3.1 Framework for Cross Modal Retrieval
3.2 Bidirectional Max-Margin Ranking Loss
3.3 Knowledge Distillation Loss from Caption and Video
3.4 Composition Loss for Training Retrieval Model
4 Experimental Results
4.1 Text-Video Retrieval Frameworks
4.2 Datasets
4.3 Knowledge Distillation for the MMT Retrieval Framework
4.4 Knowledge Distillation for the CLIP4Clip Retrieval Framework
4.5 Knowledge Distillation for the TEACHTEXT Framework
4.6 Comparison to Other Methods
5 Conclusions
References
Thinking Hallucination for Video Captioning*-12pt
1 Introduction
2 Related Work
2.1 Video Captioning
2.2 Hallucination Problems in Natural Language Generation
3 Methodology
3.1 Visual Encoder
3.2 Auxiliary Heads: Category, Action, and Object
3.3 Context Gates: Balancing Intra-modal and Multi-modal Fusion
3.4 Decoder
3.5 Parameter Learning
3.6 The COAHA Metric
4 Experimental Results
4.1 Datasets
4.2 Implementation Details
4.3 Quantitative Results
4.4 Ablation Studies
4.5 Validation of Context Gates
4.6 Effect of Exposure Bias on Hallucination
4.7 Significance of COAHA
4.8 Qualitative Results
5 Conclusion
References
Boundary-Aware Temporal Sentence Grounding with Adaptive Proposal Refinement
1 Introduction
2 Related Work
3 Methodology
3.1 Feature Extraction
3.2 Feature Enhancement
3.3 Proposal Construction
3.4 Proposal Refinement
3.5 Training and Inference
4 Experiments
4.1 Dataset and Evaluation
4.2 Implementation Details
4.3 Comparison with State-of-the-Arts
4.4 Ablation Study
4.5 Qualitative Results
5 Conclusion
References
Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering
1 Introduction
2 Related Work
3 Methodology
3.1 Overview
3.2 Multimodal Features
3.3 Two-Stage Multimodality Fusion
3.4 Denoising Module
3.5 Pre-training
4 Experiments
4.1 Datasets and Evaluation Metrics
4.2 Implementation Details
4.3 Experimental Results
4.4 Ablation Study
5 Conclusion
References
Bright as the Sun: In-depth Analysis of Imagination-Driven Image Captioning
1 Introduction
2 Related Work
3 Analyzing Imagination-Driven Captions
3.1 Analysis of Existing Datasets
3.2 Classifying Imagination-Driven Captions and Their Usages
3.3 Dataset: IdC-I and IdC-II
4 A Method for Generating both Types of Captions
4.1 Overview of the Proposed Method
4.2 Image Encoder
4.3 Text Generator
4.4 CLIP-Based Scorer
5 Experiments
5.1 Evaluation of Metrics
5.2 Evaluation of Image Captioning Models
6 Summary and Conclusion
References
Heterogeneous Interactive Learning Network for Unsupervised Cross-Modal Retrieval
1 Introduction
2 Related Work
2.1 Cross-Modal Hashing
2.2 Attention Mechanism
2.3 Generative Adversarial Network
3 Proposed Approach
3.1 Probelm Definition
3.2 Feature Extraction
3.3 Heterogeneous Feature Interaction
3.4 Network Learning and Optimization
4 Experiments and Evaluations
4.1 Datasets and Evaluation
4.2 Implementation Details
4.3 Performance
4.4 Ablation Study
5 Conclusion
References
Biometrics
GaitStrip: Gait Recognition via Effective Strip-Based Feature Representations and Multi-level Framework
1 Introduction
2 Related Work
2.1 Gait Recognition
2.2 Strip-Based Modeling
3 Proposed Method
3.1 Overview
3.2 Enhanced Convolution Module
3.3 Structural Re-parameterization
3.4 Multi-level Framework
3.5 Temporal Aggregation and Spatial Mapping
3.6 Loss Function
3.7 Training and Test Details
4 Experiments
4.1 Datasets and Evaluation Protocol
4.2 Implementation Details
4.3 Comparison with the State-of-the-Art
4.4 Ablation Study
4.5 Comparison with Other Strip-Based Modeling
4.6 Computational Analysis
5 Conclusion
References
Soft Label Mining and Average Expression Anchoring for Facial Expression Recognition
1 Introduction
2 Related Work
2.1 Facial Expression Recognition
2.2 Methods for Uncertainty or Ambiguity
3 Method
3.1 Soft Label Mining
3.2 Average Expression Anchoring
3.3 The Overall Framework
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Studies
4.4 Visualization Analysis
4.5 Evaluation on Synthetic Datasets
4.6 Comparison with the State-of-the-Arts
5 Conclusion
References
`Labelling the Gaps': A Weakly Supervised Automatic Eye Gaze Estimation
1 Introduction
2 Related Work
3 Preliminaries
4 Gaze Labelling Framework
4.1 Architectural Overview of `2-Labels'
4.2 Architectural Overview of `1-Label'
5 Experiments
6 Results
7 Limitations, Conclusion and Future Work
References
Author Index

Recommend Papers