Pattern Recognition and Computer Vision: 6th Chinese Conference, PRCV 2023, Xiamen, China, October 13–15, 2023, Proceedings, Part I (Lecture Notes in Computer Science) 9819984289, 9789819984282

The 13-volume set LNCS 14425-14437 constitutes the refereed proceedings of the 6th Chinese Conference on Pattern Recogni

124 30 75MB

English Pages 528 [525] Year 2024

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part I
Action Recognition
Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion Based Classification
1 Introduction
2 Related Work
3 Our Proposed Approach
3.1 Overview
3.2 Network Architecture
4 Experiment
4.1 Dataset and Evaluation Metric
4.2 Implementation Details
4.3 Comparison with Other SOTA Algorithms
4.4 Ablation Study
4.5 Parameter Analysis
5 Conclusion
References
Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition
1 Introduction
2 Related Works
2.1 Attention Mechanism
2.2 Lightweight Models
3 Method
3.1 Multi-Branch Fusion Module
3.2 Semantic Information
3.3 Graph Convolution Module
3.4 Time Convolution Module
4 Experiment
4.1 Dataset
4.2 Experimental Details
4.3 Ablation Experiment
4.4 Comparison with State-of-the-Art
5 Action Visualization
6 Conclusion
References
Auto-Learning-GCN: An Ingenious Framework for Skeleton-Based Action Recognition
1 Introduction
2 Related Work
3 Methodology
3.1 GCN-Based Skeleton Processing
3.2 The AL-GCN Module
3.3 The Attention Correction and Jump Model
3.4 Multi-stream Gaussian Weight Selection Algorithm
4 Experimental Results and Analysis
4.1 Datasets
4.2 Implementation Details
4.3 Compared with the State-of-the-Art Methods
4.4 Ablation Study
4.5 Visualization
5 Conclusion
References
Skeleton-Based Action Recognition with Combined Part-Wise Topology Graph Convolutional Networks
1 Introduction
2 Related Work
2.1 Skeleton-Based Action Recognition
2.2 Partial Graph Convolution in Skeleton-Based Action Recognition
3 Methods
3.1 Preliminaries
3.2 Part-Wise Spatial Modeling
3.3 Part-Wise Spatio-Temporal Modeling
3.4 Model Architecture
4 Experiments
4.1 Datasets
4.2 Training Details
4.3 Ablation Studies
4.4 Comparison with the State-of-the-Art
5 Conclusion
References
Segmenting Key Clues to Induce Human-Object Interaction Detection
1 Introduction
2 Related Work
3 Approach
3.1 Key Features Segmentation-Based Module
3.2 Key Features Learning Encoder
3.3 Spatial Relationships Learning Graph-Based Module
3.4 Training and Inference
4 Experiments
4.1 Implementation Details
4.2 Implementation Results
4.3 Ablation Study
4.4 Qualitative Results
5 Conclusion
References
Lightweight Multispectral Skeleton and Multi-stream Graph Attention Networks for Enhanced Action Prediction with Multiple Modalities
1 Introduction
2 Related Work
2.1 Skeleton-Based Action Recognition
2.2 Dynamic Graph Neural Network
3 Methods
3.1 Spatial Embedding Component
3.2 Temporal Embedding Component
3.3 Action Prediction
4 Experiments and Discussion
4.1 NTU RGB+D Dataset
4.2 Experiments Setting
4.3 Evaluation of Human Action Recognition
4.4 Ablation Study
4.5 Visualization
5 Conclusion
References
Spatio-Temporal Self-supervision for Few-Shot Action Recognition
1 Introduction
2 Related Work
2.1 Few-Shot Action Recognition
2.2 Self-supervised Learning (SSL)-Based Few-Shot Learning
3 Method
3.1 Problem Definition
3.2 Spatio-Temporal Self-supervision Framework
4 Experiments
4.1 Experimental Settings
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Studies
5 Conclusions
References
A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition Model
1 Introduction
2 Related Work
2.1 Spatio-Temporal (3D) Convolution Networks
2.2 Clips Selection and Features Aggregation
3 Proposed Method
3.1 Problem Definition
3.2 Fuzzy Target
3.3 Fine Tune Loss Function
4 Experiment
4.1 Datasets and Implementation Details
4.2 Performance Comparison
4.3 Discussion
5 Conclusion
References
Temporal-Channel Topology Enhanced Network for Skeleton-Based Action Recognition
1 Introduction
2 Proposed Method
2.1 Network Architecture
2.2 Temporal-Channel Focus Module
2.3 Dynamic Channel Topology Attention Module
3 Experiments
3.1 Datasets and Implementation Details
3.2 Ablation Study
3.3 Comparison with the State-of-the-Art
4 Conclusion
References
HFGCN-Based Action Recognition System for Figure Skating
1 Introduction
2 Figure Skating Hierarchical Dataset
3 Figure Skating Action Recognition System
3.1 Data Preprocessing
3.2 Multi-stream Generation
3.3 Hierarchical Fine-Grained Graph Convolutional Neural Network (HFGCN)
3.4 Decision Fusion Module
4 Experiments and Results
4.1 Experimental Environment
4.2 Experiment Results and Analysis
5 Conclusion
References
Multi-modal Information Processing
Image Priors Assisted Pre-training for Point Cloud Shape Analysis
1 Introduction
2 Proposed Method
2.1 Problem Setting
2.2 Overview Framework
2.3 Multi-task Cross-Modal SSL
2.4 Objective Function
3 Experiments and Analysis
3.1 Pre-training Setup
3.2 Downstream Tasks
3.3 Ablation Study
4 Conclusion
References
AMM-GAN: Attribute-Matching Memory for Person Text-to-Image Generation
1 Introduction
2 Related Work
2.1 Text-to-image Generative Adversarial Network
2.2 GANs for Person Image
3 Method
3.1 Feature Extraction
3.2 Multi-scale Feature Fusion Generator
3.3 Real-Result-Driven Discriminator
3.4 Objective Functions
4 Experiment
4.1 Dataset
4.2 Implementation
4.3 Evaluation Metrics
4.4 Quantitative Evaluation
4.5 Qualitative Evaluation
4.6 Ablation Study
5 Conclusion
References
RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog
1 Introduction
2 Related Work
3 Method
3.1 Preliminaries
3.2 Model Architecture
3.3 Training Objectives
4 Experimental Setup
4.1 Dataset
4.2 Baselines
4.3 Evaluation Metric
4.4 Implementation Details
5 Results and Analysis
5.1 Main Results
5.2 Ablation Study
5.3 Attention Visualization
6 Conclusion
References
KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing
1 Introduction
2 Background
2.1 Text-to-Image Generation and Editing
2.2 Stable Diffusion Model
3 KV Inversion: Training-Free KV Embeddings Learning
3.1 Task Setting and Reason of Existing Problem
3.2 KV Inversion Overview
4 Experiments
4.1 Comparisons with Other Concurrent Works
4.2 Ablation Study
5 Limitations and Conclusion
References
Enhancing Text-Image Person Retrieval Through Nuances Varied Sample
1 Introduction
2 Relataed Work
2.1 Text-Image Retrieval
2.2 Text-Image Person Retrieval
3 Method
3.1 Feature Extraction and Alignment
3.2 Nuanced Variation Module
3.3 Image Text Matching Loss
3.4 Hard Negative Metric Loss
4 Experiment
4.1 Datasets and Evaluation Setting
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Study
5 Conclusion
References
Unsupervised Prototype Adapter for Vision-Language Models
1 Introduction
2 Related Work
2.1 Large-Scale Pre-trained Vision-Language Models
2.2 Adaptation Methods for Vision-Language Models
2.3 Self-training with Pseudo-Labeling
3 Method
3.1 Background
3.2 Unsupervised Prototype Adapter
4 Experiments
4.1 Image Recognition
4.2 Domain Generalization
4.3 Ablation Study
5 Conclusion
References
Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval
1 Introduction
2 Related Works
3 Method
3.1 Overview
3.2 MCD: Multimodal Causal Discovery
3.3 MMC-CLIP
3.4 Image-Text Alignment
4 Experiments
4.1 Datasets and Settings
4.2 Results on MSCOCO
4.3 Results on Flickr30K
4.4 Ablation Studies
5 Conclusion
References
Exploring Cross-Modal Inconsistency in Entities and Emotions for Multimodal Fake News Detection
1 Introduction
2 Related Works
2.1 Single-Modality Fake News Detection
2.2 Multimodal Fake News Detection
3 Methodology
3.1 Feature Extraction
3.2 Cross-Modal Contrastive Learning
3.3 Entity Consistency Learning
3.4 Emotional Consistency Learning
3.5 Multimodal Fake News Detector
4 Experiments
4.1 Experimental Configurations
4.2 Overall Performance
4.3 Ablation Studies
5 Conclusion
References
Deep Consistency Preserving Network for Unsupervised Cross-Modal Hashing
1 Introduction
2 The Proposed Method
2.1 Problem Definition
2.2 Deep Feature Extraction and Hashing Learning
2.3 Features Fusion and Similarity Matrix Construction
2.4 Hash Code Fusion and Reconstruction
2.5 Objective Function
3 Experiments
3.1 Datasets and Baselines
3.2 Implementation Details
3.3 Results and Analysis
4 Conclusion
References
Learning Adapters for Text-Guided Portrait Stylization with Pretrained Diffusion Models
1 Introduction
2 Related Work
2.1 Text-to-Image Diffusion Models
2.2 Control of Pretrained Diffusion Model
2.3 Text-Guided Portrait Stylizing
3 Method
3.1 Background and Preliminaries
3.2 Overview of Our Method
3.3 Portrait Stylization with Text Prompt
3.4 Convolution Adapter
3.5 Adapter Optimization
4 Experiments
4.1 Implementation Settings
4.2 Comparisons
4.3 Ablation Analysis
5 Conclusion
References
EdgeFusion: Infrared and Visible Image Fusion Algorithm in Low Light
1 Introduction
2 Related Work
2.1 AE-Based Image Fusion Methods
2.2 CNN-Based Image Fusion Methods
2.3 GAN-Based Image Fusion Methods
3 Design of EdgeFusion
3.1 Problem Formulation
3.2 Network Architecture
3.3 Loss Function
4 Experimental Validation
4.1 Experimental Configurations
4.2 Fusion Metrics
4.3 Comparative Experiment
4.4 Detection Performance
4.5 Ablation Studies
5 Conclusion
References
An Efficient Momentum Framework for Face-Voice Association Learning
1 Introduction
2 Related Work
2.1 Face and Voice Association Learning
2.2 Contrastive Learning
3 Methodology
3.1 General Definition
3.2 Cross-modal MoCo
4 Experiment
4.1 Dataset
4.2 Implementation Details
4.3 Evaluation Protocols
4.4 Performance Analysis and Comparison
4.5 Ablation Study
4.6 Visualization Study
5 Conclusion
References
Multi-modal Instance Refinement for Cross-Domain Action Recognition
1 Introduction
2 Related Work
2.1 Action Recognition
2.2 Unsupervised Domain Adaptation for Action Recognition
2.3 Deep Reinforcement Learning
3 Proposed Method
3.1 Two-Stream Action Recognition
3.2 Instance Refinement
3.3 Domain Adversarial Alignment
3.4 Training
4 Experiments
4.1 Implementation Details
4.2 Results
4.3 Ablation Study
5 Conclusion
References
Modality Interference Decoupling and Representation Alignment for Caricature-Visual Face Recognition
1 Introduction
2 Related Work
3 Proposed Method
3.1 Overview
3.2 Identity-Interference Orthogonal Decoupling (IIOD) Module
3.3 Modality Feature Alignment (MFA) Module
3.4 Joint Loss Function
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Studies
4.4 Comparison with State-of-the-Art Methods
5 Conclusion
References
Plugging Stylized Controls in Open-Stylized Image Captioning
1 Introduction
2 Related Work
2.1 Image Captioning
2.2 Stylized Image Captioning
3 Method
3.1 Style Factor
3.2 Fluency Factor
4 Experimental Results and Discussion
4.1 Datasets
4.2 Metric
4.3 Implementation Details
4.4 Quantitative Analysis
4.5 Qualitative Analysis
4.6 Ablation Study
5 Conclusion
References
MGT: Modality-Guided Transformer for Infrared and Visible Image Fusion
1 Introduction
2 Related Work
3 Methodology
3.1 Motivation
3.2 Overall Pipeline of MGT
3.3 Modality-Guided Cross-Attention Block
3.4 Loss Function
4 Experiments
4.1 Experimental Configurations
4.2 Comparative Experiment
4.3 Ablation Study
5 Conclusion
References
Multimodal Rumor Detection by Using Additive Angular Margin with Class-Aware Attention for Hard Samples
1 Introduction
2 Related Work
2.1 Multimodal Rumor Detection
2.2 Contrastive Learning
2.3 Multimodal Fusion
3 Methodology
3.1 Multi-layer Fusion Module
3.2 Angular Margin-Based Contrastive Learning
3.3 Class-Aware Attention Module
3.4 Weighted Loss with Cross-Entropy
4 Experiments and Results
4.1 Datasets
4.2 Implementation Details
4.3 Result
4.4 Ablation Study
5 Conclusion
References
An Effective Dynamic Reweighting Method for Unbiased Scene Graph Generation
1 Introduction
2 Related Work
3 Method
3.1 Basic Reweighting Formula
3.2 Dynamic Reweighting Mechanism
3.3 Non-relation Class Reweighting
4 Experiments
4.1 Experiment Setting
4.2 Implementation Details
4.3 Comparisons with State-of-the-Art Methods
4.4 Ablation Study
4.5 Visualization Results
5 Conclusion
References
Multi-modal Graph and Sequence Fusion Learning for Recommendation
1 Introduction
2 Related Work
3 The Proposed Method
3.1 Problem Statement
3.2 Multi-modal-Aware Fusion Layer
3.3 Sequence-Aware Learning Layer
3.4 Prediction and Optimization
3.5 Computational Complexity
4 Experiments and Analysis
4.1 Experimental Settings
4.2 Performance Comparison (Q1)
4.3 Ablation Study (Q2)
4.4 Parameter Analyses (Q3)
5 Conclusion and Future Work
References
Co-attention Guided Local-Global Feature Fusion for Aspect-Level Multimodal Sentiment Analysis
1 Introduction
2 Related Work
3 Method
3.1 Problem Description
3.2 Proposed Method
4 Experimental Analysis
4.1 Experimental Result
4.2 Ablation Experiments
5 Conclusion
References
Discovering Multimodal Hierarchical Structures with Graph Neural Networks for Multi-modal and Multi-hop Question Answering
1 Introduction
2 Related Work
3 Methodology
3.1 Multimodal and Multi-hop Reader
3.2 Hierarchy-Aware Graph Neural Encoder
3.3 Graph-Selection Decoder
4 Experiments
4.1 Datasets
4.2 Baselines
4.3 Implementation Details
4.4 Experimental Results
4.5 Ablation Study
4.6 Case Study
5 Conclusion
References
Enhancing Recommender System with Multi-modal Knowledge Graph
1 Introduction
2 Related Work
2.1 KG-Based Recommendation
2.2 Multi-modal Knowledge Graphs
3 Method
3.1 Multi-modal Entity Encoder
3.2 Multi-modal Knowledge Graph Propagation Layer
4 Experiments
4.1 Datasets and Settings
4.2 Baselines
4.3 Results
5 Conclusion
References
Location Attention Knowledge Embedding Model for Image-Text Matching
1 Introduction
2 Related Work
3 Method
3.1 Image Representation
3.2 Text Representation
3.3 Knowledge Representation
3.4 Feature Fusion
3.5 Image and Text Matching
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics
4.3 Implementation Details
4.4 Comparison Methods
4.5 Ablation Experiments
5 Conclusion
References
Pedestrian Attribute Recognition Based on Multimodal Transformer
1 Introduction
2 Related Works
3 Proposed Models
3.1 Cross Transformer
3.2 Multi-modal Transformer
3.3 Multi-modal Transformer Models
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics and Loss Function
4.3 Implementation Details
4.4 Classification Results
5 Conclusion
References
RGB-D Road Segmentation Based on Geometric Prior Information
1 Introduction
2 Method
2.1 Disparity Transformation Module
2.2 Surface Normal Vector Estimation Module
2.3 Road Segmentation Network
3 Experiment
3.1 Experimental Setup
3.2 Ablation Study
3.3 Comparison of Road Segmentation
4 Conclusion
References
Contrastive Perturbation Network for Weakly Supervised Temporal Sentence Grounding
1 Introduction
2 Related Work
2.1 Fully Supervised Temporal Sentence Grounding
2.2 Weakly Supervised Temporal Sentence Grounding
2.3 Multiple Instance Learning
3 Proposed Method
3.1 Overall Framework
3.2 Proposal Generation Module
3.3 Reconstruction Module
3.4 Model Training
4 Experiments
4.1 Datasets
4.2 Evaluation Metric
4.3 Implementation Details
4.4 Comparisons to the State-of-the-Art
4.5 Ablation Study
5 Conclusion
References
MLDF-Net: Metadata Based Multi-level Dynamic Fusion Network
1 Introduction
2 Related Works
3 Methodology
3.1 Metadata Guidance Module
3.2 Dynamic Feature Selection Module
4 Experimental Details
4.1 Dataset and Evaluation Metrics
4.2 Implementation Details
4.3 Experimental Results
4.4 Ablation Experiment
5 Summary
References
Efficient Adversarial Training with Membership Inference Resistance
1 Introduction
2 Background and Related Work
2.1 Workflow of Empirical Adversarial Training
2.2 Membership Inference Attacks
3 Our Method
3.1 Design Goals
3.2 Overview
3.3 Design Details
4 Experimental Setup
4.1 Implementation, Dataset, and Model Architecture
4.2 Baselines
4.3 Metrics
5 Experimental Results
5.1 Performance of Our Basic Framework
5.2 Performance of Our Framework Variants
6 Conclusion
References
Enhancing Image Comprehension for Computer Science Visual Question Answering
1 Introduction
2 Related Work
2.1 Visual Question Answering
2.2 VQA in Intelligent Education
3 Approach
3.1 Image Understanding Layer
3.2 Encoding Layer
3.3 Cross-Modal Attention Layer
3.4 Output Layer
4 Experiments
4.1 CSDQA Dataset
4.2 Implementation Details
4.3 Baselines
4.4 Experimental Results and Analysis
4.5 Ablation Studies
4.6 Case Study
5 Conclusions
References
Cross-Modal Attentive Recalibration and Dynamic Fusion for Multispectral Pedestrian Detection
1 Introduction
2 Related Work
2.1 Multispectral Pedestrian Detection
2.2 Attentive Feature Refinement
2.3 Dynamic Weight Networks
3 Methodology
3.1 Overview
3.2 Cross-Modal Attentive Feature Recalibration
3.3 Multi-modality Dynamic Feature Fusion
4 Experiments
4.1 Datasets
4.2 Parameter Setting
4.3 Evaluation Metrics
4.4 Results Analysis
5 Conclusion
References
Author Index

Pattern Recognition and Computer Vision: 6th Chinese Conference, PRCV 2023, Xiamen, China, October 13–15, 2023, Proceedings, Part I (Lecture Notes in Computer Science)
 9819984289, 9789819984282

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Recommend Papers