Pattern Recognition and Computer Vision: 6th Chinese Conference, PRCV 2023, Xiamen, China, October 13–15, 2023, Proceedings, Part VI (Lecture Notes in Computer Science) 9819985366, 9789819985364

The 13-volume set LNCS 14425-14437 constitutes the refereed proceedings of the 6th Chinese Conference on Pattern Recogni

146 94 97MB

English Pages 540 [534] Year 2024

Table of contents :
Preface
Organization
Contents – Part VI
Computational Photography, Sensing and Display Technology
RSID: A Remote Sensing Image Dehazing Network
1 Introduction
2 Related Work and Model
2.1 Related Work
2.2 Model
3 Experiment
3.1 Datasets
3.2 Evaluation and Details
4 Result
4.1 Model Analysis
4.2 Real Remote Sensing Image Dehazing Analysis
5 Conclusion
References
ContextNet: Learning Context Information for Texture-Less Light Field Depth Estimation
1 Introduction
2 Related Work
2.1 Traditional Methods
3 Method
3.1 Overview
3.2 Feature Extraction and AugSPP Module
3.3 Sub-pixel Cost Volume
3.4 Cost Aggregation and Disparity Regression
4 Experiments
4.1 Datasets and Implementation Details
4.2 Comparison of State-of-the-Art Methods
4.3 Ablation Study
5 Conclusion and Limitations
References
An Efficient Way for Active None-Line-of-Sight: End-to-End Learned Compressed NLOS Imaging
1 Introduction
2 Sensing Methods
2.1 Forward Model for Active NLOS Imaging
2.2 Forward Model Combined with Compressed Sensing
3 Network Architecture
3.1 Decompressor
3.2 Encoder and Decoder
3.3 Learnable Module for Weights of Masks
4 Evaluation and Results
4.1 Dataset
4.2 Training Strategies
4.3 Reconstruction of the Proposed End-to-End Network
5 Ablation and Discussion
5.1 Ablation Study
5.2 Big Noise
5.3 Extremely Low CR
6 Conclusion
7 Supplementary Material
References
DFAR-Net: Dual-Input Three-Branch Attention Fusion Reconstruction Network for Polarized Non-Line-of-Sight Imaging
1 Introduction
2 Related Work
2.1 Passive NLOS Imaging
2.2 Polarimetric NLOS Imaging
3 Method
3.1 Network Architecture
3.2 Encoder/Decoder
3.3 Split Kernel Channel Attention Mechanism (SKAM)
3.4 CBCF Sub-Net
3.5 Loss Function
4 Experiment
4.1 Experimental Setup
4.2 Experimental Results
4.3 Ablation Study
5 Conclusion
References
EVCPP:Example-Driven Virtual Camera Pose Prediction for Cloud Performing Arts Scenes
1 Introduction
2 Related Work
3 Methods
3.1 Human Pose Estimation Module
3.2 Cinematic Feature Extraction Module
3.3 Camera Behavior Extraction Module
3.4 Camera Pose Prediction Module
3.5 Object Function
4 Experiment and Results
4.1 Experiment Details
4.2 Experimental Results
4.3 Visualization Results
5 Conclusions
References
RBSR: Efficient and Flexible Recurrent Network for Burst Super-Resolution
1 Introduction
2 Related Work
3 Method
3.1 Problem Formation and Pipeline
3.2 Recurrent Fusion Module
3.3 Implicit Weighting Loss
4 Experiments
4.1 Experimental Settings
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Study
5 Conclusion
References
WDU-Net: Wavelet-Guided Deep Unfolding Network for Image Compressed Sensing Reconstruction
1 Introduction
2 The Proposed Method
2.1 Problem Formulation
2.2 Wavelet-Guided Deep Unfolding Network (WDU-Net)
2.3 Loss Function
3 Experimental Results
3.1 Implementation Details
3.2 Comparison with the Latest Interpretable CS Methods
3.3 Ablation Study
4 Conclusion
References
Video Analysis and Understanding
Memory-Augmented Spatial-Temporal Consistency Network for Video Anomaly Detection
1 Introduction
2 Related Work
3 Method
3.1 Overall Framework
3.2 Integration Module
3.3 Spatial-Temporal Memory Fusion Module
3.4 Dissociation Module
3.5 Training Loss
3.6 Anomaly Detection
4 Experiments
4.1 Datasets and Evaluation Metrics
4.2 Implementation Details
4.3 Ablation Study
4.4 Comparison with State-of-the-Art Methods
4.5 Visualization Analysis
5 Conclusion
References
Frequency and Spatial Domain Filter Network for Visual Object Tracking
1 Introduction
2 Related Work
3 Proposed Method
3.1 Overall Architecture
3.2 Global and Local Context Enhancement
4 Experiments
4.1 Implementation Details
4.2 Analysis of the Proposed Method
4.3 State-of-the-Art Comparison
5 Conclusions
References
Enhancing Feature Representation for Anomaly Detection via Local-and-Global Temporal Relations and a Multi-stage Memory
1 Introduction
2 Proposed Method
2.1 Local-and-Global Temporal Relations Modeling
2.2 Multi-stage Memory-Augmented Feature Discrimination Learning and Classifier Learning
3 Experiments
3.1 Experimental Setup
3.2 Results on ShanghaiTech and UCF-Crime
3.3 Ablation Study
3.4 Visual Analysis
4 Conclusions
References
DFAformer: A Dual Filtering Auxiliary Transformer for Efficient Online Action Detection in Streaming Videos
1 Introduction
2 Related Work
3 Methodology
3.1 Notational Definition
3.2 Encoder
3.3 Decoder
3.4 Auxiliary Task — Jaccard Summary Unit
3.5 Training and Inference
4 Experiments
4.1 Dataset and Setup
4.2 Compared with State-of-the-Art Methods
4.3 Ablation Studies
5 Conclusion
References
Relation-Guided Multi-stage Feature Aggregation Network for Video Object Detection
1 Introduction
2 Related Work
2.1 Pixel-Based Aggregation
2.2 Region-Based Aggregation
3 Proposed Method
3.1 Overview
3.2 Multi-stage Feature Aggregation Framework
3.3 Multi-sources Feature Aggregation
3.4 Temporal Relation-Guided Aggregation Module
4 Experiments and Discussions
4.1 Dataset and Experimental Setting
4.2 Quantitative Analysis
4.3 Qualitative Analysis
4.4 Parameter Setting Analysis and Ablation Experiment
5 Conclusion
References
Multimodal Local Feature Enhancement Network for Video Summarization
1 Introduction
2 Related Work
2.1 Single Vision Modality
2.2 Multimodal Video Summarization
3 Method
3.1 Fused Multi-head Attention
3.2 Local&Global Multi-head Attention
3.3 Predicted Scores and Generated Summary
4 Experiments
4.1 Datasets and Metrics
4.2 Implementation Details
4.3 Comparisons with the State of the Art
4.4 Ablation Study and Sensitivity Analysis
5 Conclusion
References
Asymmetric Attention Fusion for Unsupervised Video Object Segmentation
1 Introduction
2 Related Work
2.1 UVOS Based on Optical Flow
3 Methodology
3.1 Opital Flow and Encoder
3.2 Asymmetric Attention Fusion Module
3.3 Feature Correction Module
3.4 Bidirectional Purification Module and Decoder
3.5 Learning Objectives
4 Experiments
4.1 Experimental Scheme
4.2 Quantitative and Qualitative Results
4.3 Ablation Study
5 Conclusion
References
Flow-Guided Diffusion Autoencoder for Unsupervised Video Anomaly Detection
1 Introduction
2 Related Work
2.1 Video Anomaly Detection
2.2 Diffusion Models
3 Method
3.1 Mixed Gaussian Clustering Network
3.2 Conditional Diffusion AutoEncoder
3.3 Sample Refinement
4 Experiments
4.1 Implementation Details
4.2 Anomaly Score
4.3 Comparison with Existing Methods
4.4 Ablation Study
5 Conclusion
References
Prototypical Transformer for Weakly Supervised Action Segmentation
1 Introduction
2 Related Work
3 Preliminaries
3.1 Notations
3.2 Dynamic Time Warping
3.3 Attention Mechanism
4 Prototypical Transformer
4.1 Video Encoder
4.2 Action Ordering Prediction
4.3 Segmentation
4.4 Training
5 Experiments
5.1 Datasets and Settings
5.2 Main Results
5.3 Ablation Study
6 Conclusions
References
Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization
1 Introduction
2 Related Works
2.1 Audio Visual Event Localization
2.2 Attention Mechanism
3 Method
3.1 Problem Notations
3.2 Unimodal-Multimodal Event-Aware Interaction
3.3 Collaborative Unimodal Event Filter
3.4 Dual-Interactive Multimodal Event Classification
3.5 Localization and Objective Function
4 Experiments and Analysis
4.1 Experiment Setup
4.2 Comparisons with State-of-the-Art Methods
4.3 Ablation Studies
4.4 Qualitative Analysis
5 Conclusion
References
Dual-Memory Feature Aggregation for Video Object Detection
1 Introduction
2 Related Work
2.1 Image Object Detection
2.2 Video Object Detection
3 Method
3.1 Preliminary
3.2 Dual-Memory Feature Aggregation
3.3 Contrastive Temporal Correlation
4 Experiments
4.1 Dataset and Metric
4.2 Implementation Details
4.3 Main Results
4.4 Ablation Study
5 Conclusion
References
Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis
1 Introduction
2 Related Work
3 Method
3.1 Multimodal Matching Framework
3.2 Emotion Prompt
3.3 Model Training
4 Experiments
4.1 Experiment Setup
4.2 Results on Video Emotion Classification
4.3 Visualization of the Representation Space
4.4 Video-Emotional Experience Matching
5 Conclusion
References
Temporal-Semantic Context Fusion for Robust Weakly Supervised Video Anomaly Detection
1 Introduction
2 Related Work
3 Proposed Method
3.1 Problem Statement
3.2 Architecture
3.3 Temporal-Semantic Context Fusion Module
3.4 Model Training
4 Experiments
4.1 Datasets and Evaluation Measure
4.2 Implementation Details
4.3 Experiment Results and Discussions
4.4 Ablation Study
5 Conclusion
References
A Survey: The Sensor-Based Method for Sign Language Recognition
1 Introduction
2 Sensor Modality
2.1 Kinect
2.2 Flexible Sensor
2.3 Electromyographic Sensor
2.4 Force Myography Sensor
2.5 Inertial Sensor
2.6 Photoplethysmography Sensor
2.7 Radio Frequency Sensor
3 Other Categories
3.1 Dataset
3.2 Evaluation
3.3 Appearance
4 Discussion
4.1 Material
4.2 Heterogeneous SLR
4.3 Combination of Sensor and Vision
5 Conclusion
References
Utilizing Video Word Boundaries and Feature-Based Knowledge Distillation Improving Sentence-Level Lip Reading
1 Introduction
2 Related Work
2.1 Sentence-Level Lip Reading
2.2 Video Word Boundary
2.3 Knowledge Distillation
3 Methods
3.1 Transformer-Based Lip Reading Model Merging Video Word Boundaries (TLiM-VWB)
3.2 Lip Reading Model Training Method Utilizing Video Word Boundaries by Knowledge Distillation (LiM-VWB-KD)
4 Experiment
4.1 Dataset
4.2 Evaluation Metrics
4.3 Implementation
4.4 Results
5 Conclusion
References
Denoised Temporal Relation Network for Temporal Action Segmentation
1 Introduction
2 Related Work
2.1 Action Segmentation
2.2 Uncertainty and Evidence Theory
3 Technical Approach
3.1 Overview
3.2 Epistemic Uncertainty Prediction
3.3 Temporal Reasoning Module
4 Experiments
4.1 Datasets and Evaluation Metrics
4.2 Implementation Details
4.3 Compared with State-of-the-Art Methods
4.4 Ablation Study
5 Conclusions
References
Vision Applications and Systems
3D Lightweight Spatial-Spectral Attention Network for Hyperspectral Image Classification
1 Introduction
2 Proposed Method
2.1 3D-LSSAN Structure
2.2 Spatial-Spectral Attention Unit
2.3 3D Large Kernel Attention Model
3 Experimental Results
3.1 Description of Datasets
3.2 Discussion of Hyperparameters
3.3 Classification Results
4 Conclusion
References
Deepfake Detection via Fine-Grained Classification and Global-Local Information Fusion
1 Introduction
2 Methodology
2.1 Fine-Grained Classification Mechanism
2.2 Global-Local Information Fusion Architecture
2.3 Global Center Loss
3 Experiments
3.1 Experimental Settings
3.2 Performance Evaluations
3.3 Analysis
4 Conclusion
References
Unsupervised Image-to-Image Translation with Style Consistency
1 Introduction
2 Related Work
3 Proposed Method
3.1 First Stage: Training of Content Encoder
3.2 Second Stage: Training of Generator
4 Experiment
4.1 Setting
4.2 Experimental Results
4.3 Limitations
5 Conclusion
References
SemanticCrop: Boosting Contrastive Learning via Semantic-Cropped Views
1 Introduction
2 Related Work
2.1 Contrastive Learning
2.2 Positive Sample Generation
3 Proposed Method
3.1 Preliminaries
3.2 Heatmap Localization
3.3 Center-Suppressed Probabilistic Sampling
4 Experiments
4.1 Datasets
4.2 Comparative Methods
4.3 Implementation Details
4.4 Experiment Resluts
4.5 Ablation Experiment
5 Conclusion
References
Transformer-Based Multi-object Tracking in Unmanned Aerial Vehicles
1 Introduction
2 Relation Work
3 Multi-object Tracking Based on Transformer
3.1 Transformer-Based Detector
3.2 Tracking Algorithm Combining Spatial Attributes
4 Experiment and Analysis
4.1 Data Set and Evaluation Indicators
4.2 Experiments and Simulations
4.3 Ablation Experiments
4.4 Comparison With Other Algorithms
5 Conclusion
References
HEI-GAN: A Human-Environment Interaction Based GAN for Multimodal Human Trajectory Prediction
1 Introduction
2 Related Works
2.1 Interaction Models
2.2 Multimodal Modeling
3 Method
3.1 Problem Definition
3.2 HEI-GAN
3.3 Trajectory Prediction System
4 Experiment
4.1 Datasets and Metrics
4.2 Comparison with Existing Works
4.3 Ablation Study
4.4 Qualitative Results
5 Conclusion
References
CenterMatch: A Center Matching Method for Semi-supervised Facial Expression Recognition
1 Introduction
2 Background
3 CenterMatch
3.1 Class Center
3.2 Adaptive Class Threshold
3.3 Loss Function
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Study
4.4 Quantitative and Qualitative Analysis
4.5 Comparison with State-of-the-Art Methods
4.6 Visualization
5 Conclusion
References
Cross-Dataset Distillation with Multi-tokens for Image Quality Assessment
1 Introduction
2 Related Work
2.1 NR-IQA Based on CNN
2.2 NR-IQA Based on Vision Transformer
3 Methodology
3.1 Overview
3.2 Multi-tokens Image Quality Transformer
3.3 Image Quality Prediction
3.4 Attention Distillation Cross Datasets
4 Experiments
4.1 Benchmark Datasets and Evaluation Protocols
4.2 Implementation Details
4.3 Overall Prediction Performance Comparison
4.4 Generalization Capability Validation
4.5 Ablation Study
4.6 Visualization of Class Activation Map
4.7 Analysis of the Number of Class Tokens
5 Conclusion
References
Quality-Aware CLIP for Blind Image Quality Assessment
1 Introduction
2 Related Work
2.1 BIQA
2.2 CLIP Applications
3 Method
3.1 Overview of CLIP
3.2 QA-CLIP
4 Experiments
4.1 Datasets and Evaluation Protocols
4.2 Implementation Details
4.3 Overall Prediction Performance Comparison
4.4 Generalization Capability Validation
4.5 Ablation Study
5 Conclusion
References
Multi-agent Perception via Co-attentive Communication Mechanism
1 Introduction
2 Related Works
2.1 Multi-agent Communication
2.2 Information Fusion in Multi-agent Perception
3 Methods
3.1 Problem Formulation
3.2 Overview of the Proposed Framework
3.3 Co-attentive Scheduler
4 Experiment
4.1 Dataset and Experiment Settings
4.2 Quantitative Evaluation
4.3 Studies on Co-attentive Scheduler
5 Conclusion
References
DBRNet: Dual-Branch Real-Time Segmentation NetWork for Metal Defect Detection
1 Introduction
2 Method
2.1 Low-Params Feature Enhancement Module
2.2 Attention Flow-Semantic Fusion Module
2.3 Deep Connection Pyramid Pooling Module
3 Experiments
3.1 Datasets and Evaluation Metrics
3.2 Implementation Details
3.3 Comparison Experiments
3.4 Ablation Study
4 Conclusion
References
MaskDiffuse: Text-Guided Face Mask Removal Based on Diffusion Models
1 Introduction
2 Related Work
2.1 Image Inpainting with GANs
2.2 Image Inpainting with Diffusion Models
2.3 Text-to-Image Synthesis
3 Preliminaries: Denoising Diffusion Probabilistic Models
4 Method
4.1 Text-Guided Mask Removal
4.2 Conditional Resampling
5 Experiments and Applications
5.1 Experiment Settings
5.2 Comparison and Evaluation
5.3 Mask Removal for Face Recognition
5.4 Ablation Study
6 Conclusion
References
Image Generation Based Intra-class Variance Smoothing for Fine-Grained Visual Classification
1 Introduction
2 Related Work
2.1 Fine-Grained Visual Classification
2.2 Image Generation
2.3 Data Augmentation
3 Methods
3.1 SmoothGAN
3.2 Dual-Threshold-Filtering Strategy
3.3 Training Classifier with Augmented Data
4 Experimental Results and Discussions
4.1 Datasets
4.2 Implementation Details
4.3 Main Results
4.4 Ablation Study
5 Conclusion
References
Cross-Domain Soft Adaptive Teacher for Syn2Real Object Detection
1 Introduction
2 Related Works
3 Proposed Method
3.1 Definition of Terms
3.2 Domain-Adaption Augmentation
3.3 Feature Adversarial Learning
3.4 Mean Teacher and Soft Adaptive Teacher
3.5 Full Objective
4 Experiments
4.1 Sim10k to Cityscapes
4.2 Sim10k to BDD100K
4.3 RarePlanes to DOTA
4.4 Ablation Study on Components
5 Conclusion
References
Dynamic Graph-Driven Heat Diffusion: Enhancing Industrial Semantic Segmentation
1 Introduction
2 Related Work
2.1 Self-training Framework
3 Method
3.1 Industrial Semantic Segmentation
3.2 Dynamic Weight Graph by Heat Propagation
4 Experiments
4.1 Dataset Description
4.2 Implementation Details
4.3 Experimental Results
4.4 Evaluation on DSS
5 Conclusion
References
EKGRL: Entity-Based Knowledge Graph Representation Learning for Fact-Based Visual Question Answering
1 Introduction
2 Related Works
3 Methods
3.1 Main Idea
3.2 Preliminary
3.3 Representation Learning in Entity Space
4 PFVQA Dataset
4.1 Image and Psychological Facts
4.2 Construction of Q&A Set and Dataset Statistics
5 Results
5.1 Overall Results on FVQA
5.2 Overall Results on PFVQA
5.3 Ablation Study
6 Conclusions
References
Disentangled Attribute Features Vision Transformer for Pedestrian Attribute Recognition
1 Introduction
2 Related Works
2.1 Pedestrian Attribute Recognition
2.2 Disentangled Features Learning
3 Methods
3.1 Overview
3.2 Correlation Self-attention Module
3.3 Exclusive Self-attention Module
3.4 Random Sequence Attention Module
4 Experiments
4.1 Datasets
4.2 Implementation
4.3 Loss Function
4.4 Comparison with State-of-the-Art Methods
4.5 Ablation Study
5 Conclusion
References
A High-Resolution Network Based on Feature Redundancy Reduction and Attention Mechanism
1 Introduction
2 Related Work
3 Approach
3.1 G-HRNet
3.2 GA-HRNet
4 Experiments
4.1 COCO
4.2 COCO-WholeBody
4.3 Ablation Study
5 Conclusion
References
Author Index